├── .gitignore ├── .gitmodules ├── LICENSE ├── Makefile ├── README.org ├── write-yourself-a-git.org └── wyag-tests.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | .last_push 3 | __pycache__ 4 | libwyag.py 5 | src 6 | wyag 7 | wyag.zip 8 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "lib/org-html-themes"] 2 | path = lib/org-html-themes 3 | url = https://github.com/fniessen/org-html-themes 4 | [submodule "lib/htmlize"] 5 | path = lib/htmlize 6 | url = https://github.com/hniksic/emacs-htmlize 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: article program 2 | article: write-yourself-a-git.html 3 | program: wyag libwyag.py 4 | push: .last_push 5 | 6 | .PHONY: all article clean program push test 7 | 8 | write-yourself-a-git.html: write-yourself-a-git.org wyag libwyag.py 9 | emacs --batch write-yourself-a-git.org \ 10 | --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \ 11 | --eval "(setq org-babel-inline-result-wrap \"%s\")" \ 12 | --eval "(setq org-confirm-babel-evaluate nil)" \ 13 | --eval "(setq python-indent-guess-indent-offset nil)" \ 14 | --eval "(setq org-export-with-broken-links t)" \ 15 | --eval "(setq org-html-htmlize-output-type 'css)" \ 16 | -f org-html-export-to-html 17 | 18 | write-yourself-a-git.pdf: write-yourself-a-git.org wyag libwyag.py 19 | emacs --batch write-yourself-a-git.org \ 20 | --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \ 21 | --eval "(setq org-babel-inline-result-wrap \"%s\")" \ 22 | --eval "(setq org-confirm-babel-evaluate nil)" \ 23 | --eval "(setq python-indent-guess-indent-offset nil)" \ 24 | --eval "(setq org-export-with-broken-links t)" \ 25 | -f org-latex-export-to-pdf 26 | 27 | wyag libwyag.py: write-yourself-a-git.org 28 | emacs --batch write-yourself-a-git.org -f org-babel-tangle 29 | 30 | wyag.zip: wyag libwyag.py LICENSE 31 | zip -r wyag.zip wyag libwyag.py LICENSE 32 | 33 | clean: 34 | rm -f wyag libwyag.py write-yourself-a-git.html .last_push wyag.zip 35 | 36 | test: wyag libwyag.py 37 | ./wyag-tests.sh 38 | 39 | .last_push: wyag.zip write-yourself-a-git.html 40 | scp -r write-yourself-a-git.html k9.thb.lt\:/var/www/wyag.thb.lt/index.html; \ 41 | scp -r wyag.zip lib/org-html-themes/src k9.thb.lt:/var/www/wyag.thb.lt/; \ 42 | touch .last_push 43 | -------------------------------------------------------------------------------- /README.org: -------------------------------------------------------------------------------- 1 | #+TITLE: Write yourself a Git! 2 | 3 | Source repository for the [[https://wyag.thb.lt][Write yourself a Git]] article. 4 | 5 | Wyag is a [[https://en.wikipedia.org/wiki/Literate_programming][literate program]] written in [[https://orgmode.org/][org-mode]], which means the same source document can be used to produce the HTML version of the article as published on [[https://wyag.thb.lt]] and the program itself. You only need a reasonably recent Emacs and the =make= program, then: 6 | 7 | #+begin_src shell 8 | $ git clone --recursive https://github.com/thblt/write-yourself-a-git 9 | $ cd write-yourself-a-git 10 | $ make all 11 | #+end_src 12 | -------------------------------------------------------------------------------- /write-yourself-a-git.org: -------------------------------------------------------------------------------- 1 | #+TITLE: Write yourself a Git! 2 | #+AUTHOR: [[mailto:thibault@thb.lt][Thibault Polge]] 3 | 4 | # 5 | # This file is part of wyag 6 | # 7 | # Copyright (c) 2018-2023 Thibault Polge 8 | # All rights reserved 9 | # 10 | # Wyag is free software: you can redistribute it and/or modify it 11 | # under the terms of the GNU General Public License as published by 12 | # the Free Software Foundation, either version 3 of the License, or 13 | # (at your option) any later version. 14 | # 15 | # Wyag is distributed in the hope that it will be useful, but WITHOUT 16 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 17 | # or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public 18 | # License for more details. 19 | # 20 | # You should have received a copy of the GNU General Public License 21 | # along with Wyag. If not, see . 22 | # 23 | 24 | #+LANGUAGE: en 25 | #+OPTIONS: ':t ^:nil 26 | 27 | #+SETUPFILE: lib/org-html-themes/org/theme-readtheorg-local.setup 28 | 29 | * Introduction 30 | :PROPERTIES: 31 | :CUSTOM_ID: intro 32 | :END: 33 | 34 | #+begin_note 35 | Recent changes (January 2025): 36 | 37 | - =OrderedDict= have been replaced by regular dicts. 38 | - Most string formatting have been replaced with f-strings. 39 | - Multible bugs fixed in =tag_create=. 40 | #+end_note 41 | 42 | This article is an attempt at explaining the [[https://git-scm.com/][Git version control 43 | system]] from the bottom up, that is, starting at the most fundamental 44 | level moving up from there. This does not sound too easy, and has 45 | been attempted multiple times with questionable success. But there's 46 | an easy way: all it takes to understand Git internals is to 47 | reimplement Git from scratch. 48 | 49 | No, don't run. 50 | 51 | #+NAME: slocs 52 | #+begin_src emacs-lisp :exports none 53 | ;; Compute line numbers. We'll use that in a second. 54 | (shell-command-to-string 55 | "grep -v '^$' libwyag.py wyag | grep -v ' *#' | wc -l") 56 | #+end_src 57 | 58 | It's not a joke, and it's really not complicated: if you read this 59 | article top to bottom and write the code (or just [[./wyag.zip][download it]] as a ZIP 60 | --- but you should write the code yourself, really), you'll end up 61 | with a program, called =wyag=, that will implement all the fundamental 62 | features of git: =init=, =add=, =rm=, =status=, =commit=, =log=... in 63 | a way that is perfectly compatible with =git= itself --- compatible 64 | enough that the commit finally adding the section on commits was 65 | [[https://github.com/thblt/write-yourself-a-git/commit/ed26daffb400b2be5f30e044c3237d220226d867][created by wyag itself, not git]]. And all that in exactly call_slocs() 66 | lines of very simple Python code. 67 | 68 | But isn't Git too complex for that? That Git is complex is, in my 69 | opinion, a misconception. Git is a large program, with a lot of 70 | features, that's true. But the core of that program is actually 71 | extremely simple, and its apparent complexity stems first from the 72 | fact it's often deeply counterintuitive (and [[https://byorgey.wordpress.com/2009/01/12/abstraction-intuition-and-the-monad-tutorial-fallacy/][Git is a burrito]] blog 73 | posts probably don't help). But maybe what makes Git the most 74 | confusing is the extreme simplicity /and/ power of its core model. The 75 | combination of core simplicity and powerful applications often makes 76 | thing really hard to grasp, because of the mental jump required to 77 | derive the variety of applications from the essential simplicity of 78 | the fundamental abstraction (monads, anyone?) 79 | 80 | Implementing Git will expose its fundamentals in all their naked 81 | glory. 82 | 83 | *What to expect?* This article will implement and explain in great 84 | details (if something is not clear, please [[#feedback][report it]]!) a very 85 | simplified version of Git core commands. I will keep the code simple 86 | and to the point, so =wyag= won't come anywhere near the power of the 87 | real git command-line --- but what's missing will be obvious, and 88 | trivial to implement by anyone who wants to give it a try. “Upgrading 89 | wyag to a full-featured git library and CLI is left as an exercise to 90 | the reader”, as they say. 91 | 92 | More precisely, we'll implement: 93 | 94 | #+begin_src emacs-lisp :exports results :results list 95 | (mapcar 96 | (lambda (cmd) 97 | (format "=%s= ([[#cmd-%s][wyag source]]) [[https://git-scm.com/docs/git-%s][git man page]]" cmd cmd cmd)) 98 | (list 99 | "add" 100 | "cat-file" 101 | "check-ignore" 102 | "checkout" 103 | "commit" 104 | "hash-object" 105 | "init" 106 | "log" 107 | "ls-files" 108 | "ls-tree" 109 | "rev-parse" 110 | "rm" 111 | "show-ref" 112 | "status" 113 | "tag")) 114 | #+end_src 115 | 116 | You're not going to need to know much to follow this article: just 117 | some basic Git (obviously), some basic Python, some basic shell. 118 | 119 | + First, I'm only going to assume some level of familiarity with the 120 | most basic *git commands* --- nothing like an expert level, but if 121 | you've never used =init=, =add=, =rm=, =commit= or =checkout=, you will be 122 | lost. 123 | + Language-wise, wyag will be implemented in *Python*. Again, I won't 124 | use anything too fancy, and Python looks like pseudo-code anyways, 125 | so it will be easy to follow (ironically, the most complicated part 126 | will be the command-line arguments parsing logic, and you don't 127 | really need to understand that). Yet, if you know programming but 128 | have never done any Python, I suggest you find a crash course 129 | somewhere in the internet just to get acquainted with the language. 130 | + =wyag= and =git= are terminal programs. I assume you know your way 131 | inside a Unix terminal. Again, you don't need to be a l77t h4x0r, 132 | but =cd=, =ls=, =rm=, =tree= and their friends should be in your toolbox. 133 | 134 | #+BEGIN_warning 135 | *Note for Windows users* 136 | 137 | =wyag= should run on any Unix-like system with a Python interpreter, 138 | but I have absolutely no idea how it will behave on Windows. The 139 | test suite absolutely requires a bash-compatible shell, which I 140 | assume the WSL can provide. Also, if you are using WSL, make sure 141 | your =wyag= file uses Unix-style line endings ([[https://stackoverflow.com/questions/48692741/how-can-i-make-all-line-endings-eols-in-all-files-in-visual-studio-code-unix][See this 142 | StackOverflow solution if you use VS Code]]). Feedback from Windows 143 | users would be appreciated! 144 | #+END_warning 145 | 146 | #+begin_note 147 | **Acknowledgments** 148 | 149 | This article benefited from significant contributions from multiple 150 | people, and I'm grateful to them all. Special thanks to: 151 | 152 | - Github user [[https://github.com/tammoippen][tammoippen]], who first drafted the =tag_create= 153 | function I had simply… forgotten to write (that was [[https://github.com/thblt/write-yourself-a-git/issues/9][#9]]). 154 | - Github user [[https://github.com/hjlarry][hjlarry]] fixed multiple issues in [[https://github.com/thblt/write-yourself-a-git/pull/22][#22]]. 155 | - GitHub user [[https://github.com/cutebbb][cutebbb]] implemented the first version of ls-files in 156 | [[https://github.com/thblt/write-yourself-a-git/pull/32/][#32]], and by doing so finally brought wyag to the wonders of the 157 | staging area! 158 | #+end_note 159 | 160 | * Getting started 161 | :PROPERTIES: 162 | :CUSTOM_ID: getting-started 163 | :END: 164 | 165 | You're going to need Python 3.10 or higher, along with your 166 | favorite text editor. We won't need third party packages or 167 | virtualenvs, or anything besides a regular Python interpreter: 168 | everything we need is in Python's standard library. 169 | 170 | We'll split the code into two files: 171 | 172 | - An executable, called =wyag=; 173 | - A Python library, called =libwyag.py=; 174 | 175 | Now, every software project starts with a boatload of boilerplate, so 176 | let's get this over with. 177 | 178 | We'll begin by creating the (very short) executable. Create a new 179 | file called =wyag= in your text editor, and copy the following few 180 | lines: 181 | 182 | #+BEGIN_SRC python :tangle wyag :tangle-mode (identity #o755) 183 | #!/usr/bin/env python3 184 | 185 | import libwyag 186 | libwyag.main() 187 | #+END_SRC 188 | 189 | Then make it executable: 190 | 191 | #+BEGIN_EXAMPLE 192 | $ chmod +x wyag 193 | #+END_EXAMPLE 194 | 195 | You're done! 196 | 197 | # This is a noweb template to include in all three source files. 198 | #+NAME: file_header 199 | #+BEGIN_SRC shell :exports none 200 | This file is part of wyag 201 | Copyright (c) 2018-2023 Thibault Polge 202 | All rights reserved 203 | 204 | Wyag is free software: you can redistribute it and/or modify it 205 | under the terms of the GNU General Public License as published by 206 | the Free Software Foundation, either version 3 of the License, or 207 | (at your option) any later version. 208 | 209 | Wyag is distributed in the hope that it will be useful, but WITHOUT 210 | ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 211 | or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public 212 | License for more details. 213 | 214 | You should have received a copy of the GNU General Public License 215 | along with Wyag. If not, see . 216 | 217 | #+END_SRC 218 | 219 | #+BEGIN_SRC python :tangle libwyag.py :exports none :noweb yes 220 | # <> 221 | 222 | #+END_SRC 223 | 224 | #+BEGIN_SRC python :tangle wyag :exports none :noweb yes 225 | 226 | # <> 227 | #+END_SRC 228 | 229 | Now for the library. it must be called =libwyag.py=, and be in the 230 | same directory as the =wyag= executable. Begin by opening the empty 231 | =libwyag.py= in your text editor. 232 | 233 | We're first going to need a bunch of imports (just copy each import, 234 | or merge them all in a single line) 235 | 236 | - Git is a CLI application, so we'll need something to parse 237 | command-line arguments. Python provides a cool module called 238 | [[https://docs.python.org/3/library/argparse.html][argparse]] that can do 99% of the job for us. 239 | 240 | #+BEGIN_SRC python :tangle libwyag.py 241 | import argparse 242 | #+END_SRC 243 | 244 | - Git uses a configuration file format that is basically Microsoft's 245 | INI format. The [[https://docs.python.org/3/library/configparser.html][configparser]] module can read and write these 246 | files. 247 | 248 | #+BEGIN_SRC python :tangle libwyag.py 249 | import configparser 250 | #+END_SRC 251 | 252 | - We'll be doing some date/time manipulation: 253 | 254 | #+BEGIN_SRC python :tangle libwyag.py 255 | from datetime import datetime 256 | #+END_SRC 257 | 258 | - We'll need, just once, to read the users/group database on Unix 259 | (=grp= is for groups, =pwd= for users). This is because git saves 260 | the numerical owner/group ID of files, and we'll want to display 261 | that nicely (as text): 262 | 263 | #+BEGIN_SRC python :tangle libwyag.py 264 | import grp, pwd 265 | #+END_SRC 266 | 267 | - To support =.gitignore=, we'll need to match filenames against 268 | patterns like *.txt. Filename matching is in… =fnmatch=: 269 | 270 | #+BEGIN_SRC python :tangle libwyag.py 271 | from fnmatch import fnmatch 272 | #+END_SRC 273 | 274 | - Git uses the SHA-1 function quite extensively. In Python, it's in [[https://docs.python.org/3/library/hashlib.html][hashlib]]. 275 | 276 | #+BEGIN_SRC python :tangle libwyag.py 277 | import hashlib 278 | #+END_SRC 279 | 280 | - Just one function from [[https://docs.python.org/3/library/math.html][math]]: 281 | 282 | #+BEGIN_SRC python :tangle libwyag.py 283 | from math import ceil 284 | #+END_SRC 285 | 286 | - [[https://docs.python.org/3/library/os.html][os]] and [[https://docs.python.org/3/library/os.path.html][os.path]] provide some nice filesystem abstraction routines. 287 | 288 | #+BEGIN_SRC python :tangle libwyag.py 289 | import os 290 | #+END_SRC 291 | 292 | - we use /just a bit/ of regular expressions: 293 | 294 | #+BEGIN_SRC python :tangle libwyag.py 295 | import re 296 | #+END_SRC 297 | 298 | - We also need [[https://docs.python.org/3/library/sys.html][sys]] to access the actual command-line arguments (in =sys.argv=): 299 | 300 | #+BEGIN_SRC python :tangle libwyag.py 301 | import sys 302 | #+END_SRC 303 | 304 | - Git compresses everything using zlib. Python [[https://docs.python.org/3/library/zlib.html][has that]], too: 305 | 306 | #+BEGIN_SRC python :tangle libwyag.py 307 | import zlib 308 | #+END_SRC 309 | 310 | Imports are done. We'll be working with command-line arguments a lot. 311 | Python provides a simple yet reasonably powerful parsing library, 312 | =argparse=. It's a nice library, but its interface may not be the 313 | most intuitive ever; if need, refer to its [[https://docs.python.org/3/library/argparse.html][documentation]]. 314 | 315 | #+BEGIN_SRC python :tangle libwyag.py 316 | argparser = argparse.ArgumentParser(description="The stupidest content tracker") 317 | #+END_SRC 318 | 319 | We'll need to handle subcommands (as in git: =init=, =commit=, etc.) 320 | In argparse slang, these are called "subparsers". At this point we 321 | only need to declare that our CLI will use some, and that all 322 | invocation will actually /require/ one --- you don't just call =git=, 323 | you call =git COMMAND=. 324 | 325 | #+BEGIN_SRC python :tangle libwyag.py 326 | argsubparsers = argparser.add_subparsers(title="Commands", dest="command") 327 | argsubparsers.required = True 328 | #+END_SRC 329 | 330 | The ~dest="command"~ argument states that the name of the chosen 331 | subparser will be returned as a string in a field called =command=. 332 | So we just need to read this string and call the correct function 333 | accordingly. By convention, I'll call these functions "bridges 334 | functions" and prefix their names by =cmd_=. Bridge functions 335 | take the parsed arguments as their unique parameter, and are 336 | responsible for processing and validating them before executing the 337 | actual command. 338 | 339 | #+BEGIN_SRC python :tangle libwyag.py 340 | def main(argv=sys.argv[1:]): 341 | args = argparser.parse_args(argv) 342 | match args.command: 343 | case "add" : cmd_add(args) 344 | case "cat-file" : cmd_cat_file(args) 345 | case "check-ignore" : cmd_check_ignore(args) 346 | case "checkout" : cmd_checkout(args) 347 | case "commit" : cmd_commit(args) 348 | case "hash-object" : cmd_hash_object(args) 349 | case "init" : cmd_init(args) 350 | case "log" : cmd_log(args) 351 | case "ls-files" : cmd_ls_files(args) 352 | case "ls-tree" : cmd_ls_tree(args) 353 | case "rev-parse" : cmd_rev_parse(args) 354 | case "rm" : cmd_rm(args) 355 | case "show-ref" : cmd_show_ref(args) 356 | case "status" : cmd_status(args) 357 | case "tag" : cmd_tag(args) 358 | case _ : print("Bad command.") 359 | #+END_SRC 360 | 361 | * Creating repositories: init 362 | :PROPERTIES: 363 | :CUSTOM_ID: init 364 | :END: 365 | 366 | Obviously, the first Git command in chronological /and/ logical order is 367 | =git init=, so we'll begin by creating =wyag init=. To achieve this, 368 | we're going to first need some very basic repository abstraction. 369 | 370 | ** The Repository object 371 | :PROPERTIES: 372 | :CUSTOM_ID: GitRepository 373 | :END: 374 | 375 | We'll obviously need some abstraction for a repository: almost every 376 | time we run a git command, we're trying to do something to a 377 | repository, to create it, read from it or modify it. 378 | 379 | A git repository is made of two things: a “work tree”, where the files 380 | meant to be in version control live, and a “git directory”, where Git 381 | stores its own data. In most cases, the worktree is a regular 382 | directory and the git directory is a child directory of the worktree, 383 | called =.git=. 384 | 385 | Git supports /much more/ cases (bare repo, separated gitdir, etc) but 386 | we won't need them: we'll stick to the basic approach of 387 | =worktree/.git=. Our repository object will then just hold two paths: 388 | the worktree and the gitdir. 389 | 390 | To create a new =Repository= object, we only need to make a few checks: 391 | 392 | - We must verify that the directory exists, and contains a 393 | subdirectory called =.git=. 394 | 395 | - We then read its configuration in =.git/config= (it's just an INI 396 | file) and check that =core.repositoryformatversion= is 0. More on 397 | that field in a moment. 398 | 399 | Our constructor takes an optional =force= argument which disables all 400 | checks. That's because the =repo_create()= function which we'll 401 | create later will use a =Repository= object to /create/ the repo. So 402 | we need a way to create such objects even from (still) invalid 403 | filesystem locations. 404 | 405 | #+BEGIN_SRC python :tangle libwyag.py 406 | class GitRepository (object): 407 | """A git repository""" 408 | 409 | worktree = None 410 | gitdir = None 411 | conf = None 412 | 413 | def __init__(self, path, force=False): 414 | self.worktree = path 415 | self.gitdir = os.path.join(path, ".git") 416 | 417 | if not (force or os.path.isdir(self.gitdir)): 418 | raise Exception(f"Not a Git repository {path}") 419 | 420 | # Read configuration file in .git/config 421 | self.conf = configparser.ConfigParser() 422 | cf = repo_file(self, "config") 423 | 424 | if cf and os.path.exists(cf): 425 | self.conf.read([cf]) 426 | elif not force: 427 | raise Exception("Configuration file missing") 428 | 429 | if not force: 430 | vers = int(self.conf.get("core", "repositoryformatversion")) 431 | if vers != 0: 432 | raise Exception("Unsupported repositoryformatversion: {vers}") 433 | #+END_SRC 434 | 435 | We're going to be manipulating *lots* of paths in repositories. We may 436 | as well create a few utility functions to compute those paths and 437 | create missing directory structures if needed. First, just a general 438 | path building function: 439 | 440 | #+BEGIN_SRC python :tangle libwyag.py 441 | def repo_path(repo, *path): 442 | """Compute path under repo's gitdir.""" 443 | return os.path.join(repo.gitdir, *path) 444 | #+END_SRC 445 | 446 | (A note on Python syntax: the star on the =*path= makes the function 447 | variadic, so it can be called with multiple path components as 448 | separate arguments. For example, ~repo_path(repo, "objects", "df", 449 | "4ec9fc2ad990cb9da906a95a6eda6627d7b7b0")~ is a valid call. The 450 | function receives =path= as a list) 451 | 452 | The next two functions, =repo_file()= and =repo_dir()=, return and 453 | optionally create a path to a file or a directory, respectively. The 454 | difference between them is that the file version only creates 455 | directories up to the last component. 456 | 457 | #+BEGIN_SRC python :tangle libwyag.py 458 | def repo_file(repo, *path, mkdir=False): 459 | """Same as repo_path, but create dirname(*path) if absent. For 460 | example, repo_file(r, \"refs\", \"remotes\", \"origin\", \"HEAD\") will create 461 | .git/refs/remotes/origin.""" 462 | 463 | if repo_dir(repo, *path[:-1], mkdir=mkdir): 464 | return repo_path(repo, *path) 465 | 466 | def repo_dir(repo, *path, mkdir=False): 467 | """Same as repo_path, but mkdir *path if absent if mkdir.""" 468 | 469 | path = repo_path(repo, *path) 470 | 471 | if os.path.exists(path): 472 | if (os.path.isdir(path)): 473 | return path 474 | else: 475 | raise Exception(f"Not a directory {path}") 476 | 477 | if mkdir: 478 | os.makedirs(path) 479 | return path 480 | else: 481 | return None 482 | #+END_SRC 483 | 484 | (Second and last note on syntax: because the star in =*path= makes the 485 | functions variadic, the =mkdir= argument must be passed explicitly by 486 | name. For example, ~repo_file(repo, "objects", mkdir=True)~.) 487 | 488 | To *create* a new repository, we start with a directory (which we 489 | create if doesn't already exist) and create the *git directory* inside 490 | (which must not exist already, or be empty). That directory is called 491 | =.git= (the leading period makes it "hidden" on Unix systems), and contains: 492 | 493 | - =.git/objects/= : the object store, which we'll introduce [[#objects][in the next section]]. 494 | - =.git/refs/= the reference store, which we'll discuss [[#cmd-show-ref][a bit later]]. 495 | It contains two subdirectories, =heads= and =tags=. 496 | - =.git/HEAD=, a reference to the current HEAD (more on that later!) 497 | - =.git/config=, the repository's configuration file. 498 | - =.git/description=, holds a free-form description of this 499 | repository's contents, for humans, and is rarely used. 500 | 501 | #+BEGIN_SRC python :tangle libwyag.py 502 | def repo_create(path): 503 | """Create a new repository at path.""" 504 | 505 | repo = GitRepository(path, True) 506 | 507 | # First, we make sure the path either doesn't exist or is an 508 | # empty dir. 509 | 510 | if os.path.exists(repo.worktree): 511 | if not os.path.isdir(repo.worktree): 512 | raise Exception (f"{path} is not a directory!") 513 | if os.path.exists(repo.gitdir) and os.listdir(repo.gitdir): 514 | raise Exception (f"{path} is not empty!") 515 | else: 516 | os.makedirs(repo.worktree) 517 | 518 | assert repo_dir(repo, "branches", mkdir=True) 519 | assert repo_dir(repo, "objects", mkdir=True) 520 | assert repo_dir(repo, "refs", "tags", mkdir=True) 521 | assert repo_dir(repo, "refs", "heads", mkdir=True) 522 | 523 | # .git/description 524 | with open(repo_file(repo, "description"), "w") as f: 525 | f.write("Unnamed repository; edit this file 'description' to name the repository.\n") 526 | 527 | # .git/HEAD 528 | with open(repo_file(repo, "HEAD"), "w") as f: 529 | f.write("ref: refs/heads/master\n") 530 | 531 | with open(repo_file(repo, "config"), "w") as f: 532 | config = repo_default_config() 533 | config.write(f) 534 | 535 | return repo 536 | #+END_SRC 537 | 538 | The configuration file is very simple, it's a [[https://en.wikipedia.org/wiki/INI_file][INI]]-like file with a 539 | single section (=[core]=) and three fields: 540 | 541 | - =repositoryformatversion = 0=: the version of 542 | the gitdir format. 0 means the initial format, 1 the same with 543 | extensions. If > 1, git will panic; wyag will only accept 0. 544 | - =filemode = false=: disable tracking of file modes (permissions) 545 | changes in the work tree. 546 | - =bare = false=: indicates that this repository has a worktree. Git 547 | supports an optional =worktree= key which indicates the location of 548 | the worktree, if not =..=; wyag doesn't. 549 | 550 | We create this file using Python's =configparser= lib: 551 | 552 | #+BEGIN_SRC python :tangle libwyag.py 553 | def repo_default_config(): 554 | ret = configparser.ConfigParser() 555 | 556 | ret.add_section("core") 557 | ret.set("core", "repositoryformatversion", "0") 558 | ret.set("core", "filemode", "false") 559 | ret.set("core", "bare", "false") 560 | 561 | return ret 562 | #+END_SRC 563 | 564 | ** The init command 565 | :PROPERTIES: 566 | :CUSTOM_ID: cmd-init 567 | :END: 568 | 569 | Now that we have code to read and create repositories, let's make this 570 | code usable from the command line by creating the =wyag init= command. 571 | =wyag init= behaves just like =git init= --- with much less 572 | customizability, of course. The syntax of =wyag init= is going to be: 573 | 574 | #+BEGIN_EXAMPLE 575 | wyag init [path] 576 | #+END_EXAMPLE 577 | 578 | We already have the complete repository creation logic. To create the 579 | command, we're only going to need two more things: 580 | 581 | 1. We need to create an argparse subparser to handle our command's 582 | argument. 583 | 584 | #+BEGIN_SRC python :tangle libwyag.py 585 | argsp = argsubparsers.add_parser("init", help="Initialize a new, empty repository.") 586 | #+END_SRC 587 | 588 | In the case of =init=, there's a single, optional, 589 | positional argument: the path where to init the repo. It defaults 590 | to =.=, the current directory: 591 | 592 | #+BEGIN_SRC python :tangle libwyag.py 593 | argsp.add_argument("path", 594 | metavar="directory", 595 | nargs="?", 596 | default=".", 597 | help="Where to create the repository.") 598 | 599 | #+END_SRC 600 | 601 | 2. We also need a "bridge" function that will read argument values 602 | from the object returned by argparse and call the actual 603 | function with correct values. 604 | 605 | #+BEGIN_SRC python :tangle libwyag.py 606 | def cmd_init(args): 607 | repo_create(args.path) 608 | #+END_SRC 609 | 610 | And we're done! If you've followed these steps, you should now be 611 | able to =wyag init= a git repository anywhere: 612 | 613 | #+begin_example 614 | $ wyag init test 615 | #+end_example 616 | 617 | (The =wyag= executable won't usually be in your =$PATH=: you'll want to call it by its 618 | full name, eg =~/projects/wyag/wyag init .=) 619 | 620 | ** The repo_find() function 621 | :PROPERTIES: 622 | :repo_find: 623 | :END: 624 | 625 | While we're implementing repositories, we're going to need a function 626 | to find the root of the current repository. We'll use it a lot, since 627 | almost all Git functions work on an existing repository (except 628 | =init=, of course!). Sometimes that root is the current directory, 629 | but it may also be a parent: your repository's root may be in 630 | =~/Documents/MyProject=, but you may currently be working in 631 | =~/Documents/MyProject/src/tui/frames/mainview/=. The =repo_find()= 632 | function we'll now create will look for that root, starting at the 633 | current directory and recursing back to =/=. To identify a path as a 634 | repository, it will check for the presence of a =.git= directory. 635 | 636 | #+BEGIN_SRC python :tangle libwyag.py 637 | def repo_find(path=".", required=True): 638 | path = os.path.realpath(path) 639 | 640 | if os.path.isdir(os.path.join(path, ".git")): 641 | return GitRepository(path) 642 | 643 | # If we haven't returned, recurse in parent, if w 644 | parent = os.path.realpath(os.path.join(path, "..")) 645 | 646 | if parent == path: 647 | # Bottom case 648 | # os.path.join("/", "..") == "/": 649 | # If parent==path, then path is root. 650 | if required: 651 | raise Exception("No git directory.") 652 | else: 653 | return None 654 | 655 | # Recursive case 656 | return repo_find(parent, required) 657 | #+END_SRC 658 | 659 | And we're done with repositories! 660 | 661 | * Reading and writing objects: hash-object and cat-file 662 | :PROPERTIES: 663 | :CUSTOM_ID: objects 664 | :END: 665 | 666 | ** What are objects? 667 | :PROPERTIES: 668 | :CUSTOM_ID: objects-intro 669 | :END: 670 | 671 | Now that we have repositories, putting things inside them is in order. 672 | Also, repositories are boring, and writing a Git implementation 673 | shouldn't be just a matter of writing a bunch of =mkdir=. Let's talk 674 | about *objects*, and let's implement =git hash-object= and =git cat-file=. 675 | 676 | Maybe you don't know these two commands --- they're not exactly part 677 | of an everyday git toolbox, and they're actually quite low-level 678 | ("plumbing", in git parlance). What they do is actually very simple: 679 | =hash-object= converts an existing file into a git object, and =cat-file= 680 | prints an existing git object to the standard output. 681 | 682 | Now, *what actually is a Git object?* At its core, Git is a 683 | "content-addressed filesystem". That means that unlike regular 684 | filesystems, where the name of a file is arbitrary and unrelated to 685 | that file's contents, the names of files as stored by Git are 686 | mathematically derived from their contents. This has a very important 687 | implication: if a single byte of, say, a text file, changes, its 688 | internal name will change, too. To put it simply: you don't /modify/ 689 | a file in git, you create a new file in a different location. Objects 690 | are just that: *files in the git repository, whose paths are 691 | determined by their contents*. 692 | 693 | #+begin_warning 694 | *Git is not (really) a key-value store* 695 | 696 | Some documentation, including the excellent [[https://git-scm.com/book/id/v2/Git-Internals-Git-Objects][Pro Git]], call Git a 697 | "key-value store". This is not incorrect, but may be misleading. 698 | Regular filesystems are actually closer to a key-value store than Git 699 | is. Because it computes keys from data, Git could rather be called a 700 | /value-value store/. 701 | #+end_warning 702 | 703 | Git uses objects to store quite a lot of things: first and foremost, 704 | the actual files it keeps in version control --- source code, for 705 | example. Commit are objects, too, as well as tags. With a few 706 | notable exceptions (which we'll see later!), almost everything, in 707 | Git, is stored as an object. 708 | 709 | The path where git stores a given object is computed by calculating 710 | the [[https://en.wikipedia.org/wiki/SHA-1][SHA-1]] [[https://en.wikipedia.org/wiki/Cryptographic_hash_function][hash]] of its contents. More precisely, Git renders the hash 711 | as a lowercase hexadecimal string, and splits it in two parts: the 712 | first two characters, and the rest. It uses the first part as a 713 | directory name, the rest as the file name (this is because most 714 | filesystems hate having too many files in a single directory and would 715 | slow down to a crawl. Git's method creates 256 possible intermediate 716 | directories, hence dividing the average number of files per directory 717 | by 256) 718 | 719 | #+BEGIN_note 720 | *What is a hash function?* 721 | 722 | SHA-1 is what we call a “hash function”. Simply put, a hash function 723 | is a kind of unidirectional mathematical function: it is easy to 724 | compute the hash of a value, but there's no way to compute back which 725 | value produced a hash. 726 | 727 | A very simple example of a hash function is the classical =len= (or 728 | =strlen=) function, which returns the length of a string. It's really 729 | easy to compute the length of a string, and the length of a given 730 | string will never change (unless the string itself changes, of 731 | course!) but it's impossible to retrieve the original string, given 732 | only its length. /Cryptographic/ hash functions are a much more 733 | complex version of the same, with the added property that computing an 734 | input meant to produce a given hash is hard enough to be practically 735 | impossible. (To produce an input =i= with ~strlen(i) == 12~, you just 736 | type twelve random characters. With algorithms such as SHA-1. it 737 | would take much, much longer --- long enough to be practically 738 | impossible[fn:1]). 739 | #+END_note 740 | 741 | Before we start implementing the object storage system, we must 742 | understand their exact storage format. An object starts with a header 743 | that specifies its type: =blob=, =commit=, =tag= or =tree= (more on 744 | that in a second). This header is followed by an ASCII space (0x20), 745 | then the size of the object in bytes as an ASCII number, then null 746 | (0x00) (the null byte), then the contents of the object. The first 48 747 | bytes of a commit object in Wyag's repo look like this: 748 | 749 | #+BEGIN_EXAMPLE 750 | 00000000 63 6f 6d 6d 69 74 20 31 30 38 36 00 74 72 65 65 |commit 1086.tree| 751 | 00000010 20 32 39 66 66 31 36 63 39 63 31 34 65 32 36 35 | 29ff16c9c14e265| 752 | 00000020 32 62 32 32 66 38 62 37 38 62 62 30 38 61 35 61 |2b22f8b78bb08a5a| 753 | #+END_EXAMPLE 754 | 755 | In the first line, we see the type header, a space (=0x20=), the size in 756 | ASCII (1086) and the null separator =0x00=. The last four bytes on the 757 | first line are the beginning of that object's contents, the word 758 | "tree" --- we'll discuss that further when we'll talk about commits. 759 | 760 | The objects (headers and contents) are stored compressed with =zlib=. 761 | 762 | ** A generic object object 763 | 764 | Objects can be of multiple types, but they all share the same 765 | storage/retrieval mechanism and the same general header format. 766 | Before we dive into the details of various types of objects, we need 767 | to abstract over these common features. The easiest way is to create 768 | a generic =GitObject= with two unimplemented methods: =serialize()= 769 | and =deserialize()=, and a default =init()= to create a new, empty 770 | object if needed (sorry pythonistas, this isn't very nice design but 771 | it's probably easier to read than superconstructors). Our =__init__= 772 | either loads the object from the provided data, or calls the 773 | subclass-provided =init()= to create a new, empty object. 774 | 775 | Later, we'll subclass this generic class, actually implementing these 776 | functions for each object format. 777 | 778 | #+BEGIN_SRC python :tangle libwyag.py 779 | class GitObject (object): 780 | 781 | def __init__(self, data=None): 782 | if data != None: 783 | self.deserialize(data) 784 | else: 785 | self.init() 786 | 787 | def serialize(self, repo): 788 | """This function MUST be implemented by subclasses. 789 | 790 | It must read the object's contents from self.data, a byte string, and 791 | do whatever it takes to convert it into a meaningful representation. 792 | What exactly that means depend on each subclass. 793 | 794 | """ 795 | raise Exception("Unimplemented!") 796 | 797 | def deserialize(self, data): 798 | raise Exception("Unimplemented!") 799 | 800 | def init(self): 801 | pass # Just do nothing. This is a reasonable default! 802 | #+END_SRC 803 | 804 | ** Reading objects 805 | :PROPERTIES: 806 | :CUSTOM_ID: object_read 807 | :END: 808 | 809 | To read an object, we need to know its SHA-1 hash. We then compute 810 | its path from this hash (with the formula explained above: first two 811 | characters, then a directory delimiter =/=, then the remaining part) 812 | and look it up inside of the "objects" directory in the gitdir. That 813 | is, the path to =e673d1b7eaa0aa01b5bc2442d570a765bdaae751= is 814 | =.git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751=. 815 | 816 | We then read that file as a binary file, and decompress it using 817 | =zlib=. 818 | 819 | From the decompressed data, we extract the two header components: the 820 | object type and its size. From the type, we determine the actual 821 | class to use. We convert the size to a Python integer, and check if 822 | it matches. 823 | 824 | When all is done, we just call the correct constructor for that 825 | object's format. 826 | 827 | #+BEGIN_SRC python :tangle libwyag.py 828 | def object_read(repo, sha): 829 | """Read object sha from Git repository repo. Return a 830 | GitObject whose exact type depends on the object.""" 831 | 832 | path = repo_file(repo, "objects", sha[0:2], sha[2:]) 833 | 834 | if not os.path.isfile(path): 835 | return None 836 | 837 | with open (path, "rb") as f: 838 | raw = zlib.decompress(f.read()) 839 | 840 | # Read object type 841 | x = raw.find(b' ') 842 | fmt = raw[0:x] 843 | 844 | # Read and validate object size 845 | y = raw.find(b'\x00', x) 846 | size = int(raw[x:y].decode("ascii")) 847 | if size != len(raw)-y-1: 848 | raise Exception(f"Malformed object {sha}: bad length") 849 | 850 | # Pick constructor 851 | match fmt: 852 | case b'commit' : c=GitCommit 853 | case b'tree' : c=GitTree 854 | case b'tag' : c=GitTag 855 | case b'blob' : c=GitBlob 856 | case _: 857 | raise Exception(f"Unknown type {fmt.decode("ascii")} for object {sha}") 858 | 859 | # Call constructor and return object 860 | return c(raw[y+1:]) 861 | #+END_SRC 862 | 863 | ** Writing objects 864 | :PROPERTIES: 865 | :object_write: 866 | :END: 867 | 868 | Writing an object is reading it in reverse: we compute the hash, 869 | insert the header, zlib-compress everything and write the result in 870 | the correct location. This really shouldn't require much explanation, 871 | just notice that the hash is computed *after* the header is added (so 872 | it's the hash of the object itself, uncompressed, not just its contents) 873 | 874 | #+BEGIN_SRC python :tangle libwyag.py 875 | def object_write(obj, repo=None): 876 | # Serialize object data 877 | data = obj.serialize() 878 | # Add header 879 | result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data 880 | # Compute hash 881 | sha = hashlib.sha1(result).hexdigest() 882 | 883 | if repo: 884 | # Compute path 885 | path=repo_file(repo, "objects", sha[0:2], sha[2:], mkdir=True) 886 | 887 | if not os.path.exists(path): 888 | with open(path, 'wb') as f: 889 | # Compress and write 890 | f.write(zlib.compress(result)) 891 | return sha 892 | #+END_SRC 893 | 894 | ** Working with blobs 895 | 896 | We said earlier that the type header could be one of four: =blob=, 897 | =commit=, =tag= and =tree= --- so git has four object types. 898 | 899 | Blobs are the simplest of those four types, because they have no 900 | actual format. Blobs are user data: the content of every file you put 901 | in git (=main.c=, =logo.png=, =README.md=) is stored as a blob. That 902 | makes them easy to manipulate, because they have no actual syntax or 903 | constraints beyond the basic object storage mechanism: they're just 904 | unspecified data. Creating a =GitBlob= class is thus trivial, the 905 | =serialize= and =deserialize= functions just have to store and return 906 | their input unmodified. 907 | 908 | #+BEGIN_SRC python :tangle libwyag.py 909 | class GitBlob(GitObject): 910 | fmt=b'blob' 911 | 912 | def serialize(self): 913 | return self.blobdata 914 | 915 | def deserialize(self, data): 916 | self.blobdata = data 917 | #+END_SRC 918 | 919 | ** The cat-file command 920 | :PROPERTIES: 921 | :CUSTOM_ID: cmd-cat-file 922 | :END: 923 | 924 | We can now create =wyag cat-file=. =git cat-file= simply prints the 925 | raw contents of an object to stdout, uncompressed and without the git 926 | header. In a clone of [[https://github.com/thblt/write-yourself-a-git][wyag's source repository]], =git cat-file blob 927 | e0695f14a412c29e252c998c81de1dde59658e4a= will show a version of the 928 | README. 929 | 930 | Our simplified version will just take those two positional arguments: 931 | a type and an object identifier: 932 | 933 | #+BEGIN_EXAMPLE 934 | wyag cat-file TYPE OBJECT 935 | #+END_EXAMPLE 936 | 937 | The subparser is very simple: 938 | 939 | #+BEGIN_SRC python :tangle libwyag.py 940 | argsp = argsubparsers.add_parser("cat-file", 941 | help="Provide content of repository objects") 942 | 943 | argsp.add_argument("type", 944 | metavar="type", 945 | choices=["blob", "commit", "tag", "tree"], 946 | help="Specify the type") 947 | 948 | argsp.add_argument("object", 949 | metavar="object", 950 | help="The object to display") 951 | #+END_SRC 952 | 953 | And we can implement the functions, which just call into existing code we wrote earlier: 954 | 955 | #+BEGIN_SRC python :tangle libwyag.py 956 | def cmd_cat_file(args): 957 | repo = repo_find() 958 | cat_file(repo, args.object, fmt=args.type.encode()) 959 | 960 | def cat_file(repo, obj, fmt=None): 961 | obj = object_read(repo, object_find(repo, obj, fmt=fmt)) 962 | sys.stdout.buffer.write(obj.serialize()) 963 | #+END_SRC 964 | 965 | <> This function calls an =object_find= 966 | function we haven't introduced yet. For now, it's just going to 967 | return one of its arguments unmodified, like this: 968 | 969 | #+BEGIN_SRC python 970 | def object_find(repo, name, fmt=None, follow=True): 971 | return name 972 | #+END_SRC 973 | 974 | The reason for this strange small function is that Git has a /lot/ of 975 | ways to refer to objects: full hash, short hash, tags... 976 | =object_find()= will be our name resolution function. We'll only 977 | implement it [[#object_find][later]], so this is just a temporary placeholder. This 978 | means that until we implement the real thing, the only way we can 979 | refer to an object will be by its full hash. 980 | 981 | ** The hash-object command 982 | :PROPERTIES: 983 | :CUSTOM_ID: cmd-hash-object 984 | :END: 985 | 986 | We will want to put our /own/ data in our repositories, 987 | though. =hash-object= is basically the opposite of =cat-file=: it 988 | reads a file, computes its hash as an object, either storing it in the 989 | repository (if the -w flag is passed) or just printing its hash. 990 | 991 | The syntax of =wyag hash-object= is a simplification of =git 992 | hash-object=: 993 | 994 | #+BEGIN_EXAMPLE 995 | wyag hash-object [-w] [-t TYPE] FILE 996 | #+END_EXAMPLE 997 | 998 | Which converts to: 999 | 1000 | #+BEGIN_SRC python :tangle libwyag.py 1001 | argsp = argsubparsers.add_parser( 1002 | "hash-object", 1003 | help="Compute object ID and optionally creates a blob from a file") 1004 | 1005 | argsp.add_argument("-t", 1006 | metavar="type", 1007 | dest="type", 1008 | choices=["blob", "commit", "tag", "tree"], 1009 | default="blob", 1010 | help="Specify the type") 1011 | 1012 | argsp.add_argument("-w", 1013 | dest="write", 1014 | action="store_true", 1015 | help="Actually write the object into the database") 1016 | 1017 | argsp.add_argument("path", 1018 | help="Read object from ") 1019 | #+END_SRC 1020 | 1021 | The actual implementation is very simple. As usual, we create a small 1022 | bridge function: 1023 | 1024 | #+BEGIN_SRC python :tangle libwyag.py 1025 | def cmd_hash_object(args): 1026 | if args.write: 1027 | repo = repo_find() 1028 | else: 1029 | repo = None 1030 | 1031 | with open(args.path, "rb") as fd: 1032 | sha = object_hash(fd, args.type.encode(), repo) 1033 | print(sha) 1034 | #+END_SRC 1035 | 1036 | The actual implementation is also trivial. The =repo= argument is 1037 | optional, and the object isn't written if it is =None= (this is 1038 | handled in =object_write()=, above): 1039 | 1040 | #+BEGIN_SRC python :tangle libwyag.py 1041 | def object_hash(fd, fmt, repo=None): 1042 | """ Hash object, writing it to repo if provided.""" 1043 | data = fd.read() 1044 | 1045 | # Choose constructor according to fmt argument 1046 | match fmt: 1047 | case b'commit' : obj=GitCommit(data) 1048 | case b'tree' : obj=GitTree(data) 1049 | case b'tag' : obj=GitTag(data) 1050 | case b'blob' : obj=GitBlob(data) 1051 | case _: raise Exception(f"Unknown type {fmt}!") 1052 | 1053 | return object_write(obj, repo) 1054 | #+END_SRC 1055 | 1056 | ** Aside: what about packfiles? 1057 | :PROPERTIES: 1058 | :CUSTOM_ID: packfiles 1059 | :END: 1060 | 1061 | What we've just implemented is called "loose objects". Git has a 1062 | second object storage mechanism called packfiles. Packfiles are much 1063 | more efficient, but also much more complex, than loose objects. Simply 1064 | put, a packfile is a compilation of loose objects (like a =tar=) but 1065 | some are stored as deltas (as a transformation of another object). 1066 | Packfiles are way too complex to be supported by wyag. 1067 | 1068 | The packfile is stored in =.git/objects/pack/=. It has a =.pack= 1069 | extension, and is accompanied by an index file of the same name with 1070 | the =.idx= extension. Should you want to convert a packfile to loose 1071 | objects format (to play with =wyag= on an existing repo, for example), 1072 | here's the solution. 1073 | 1074 | First, /move/ the packfile outside the gitdir (just copying it won't work). 1075 | 1076 | #+BEGIN_SRC shell 1077 | mv .git/objects/pack/pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack . 1078 | #+END_SRC 1079 | 1080 | You can ignore the =.idx=. Then, from the worktree, just =cat= it and pipe the result to =git 1081 | unpack-objects=: 1082 | 1083 | #+BEGIN_SRC shell 1084 | cat pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack | git unpack-objects 1085 | #+END_SRC 1086 | 1087 | * Reading commit history: log 1088 | 1089 | ** Parsing commits 1090 | 1091 | Now that we can read and write objects, we should consider commits. 1092 | A commit object (uncompressed, without headers) looks like this: 1093 | 1094 | #+BEGIN_EXAMPLE 1095 | tree 29ff16c9c14e2652b22f8b78bb08a5a07930c147 1096 | parent 206941306e8a8af65b66eaaaea388a7ae24d49a0 1097 | author Thibault Polge 1527025023 +0200 1098 | committer Thibault Polge 1527025044 +0200 1099 | gpgsig -----BEGIN PGP SIGNATURE----- 1100 | 1101 | iQIzBAABCAAdFiEExwXquOM8bWb4Q2zVGxM2FxoLkGQFAlsEjZQACgkQGxM2FxoL 1102 | kGQdcBAAqPP+ln4nGDd2gETXjvOpOxLzIMEw4A9gU6CzWzm+oB8mEIKyaH0UFIPh 1103 | rNUZ1j7/ZGFNeBDtT55LPdPIQw4KKlcf6kC8MPWP3qSu3xHqx12C5zyai2duFZUU 1104 | wqOt9iCFCscFQYqKs3xsHI+ncQb+PGjVZA8+jPw7nrPIkeSXQV2aZb1E68wa2YIL 1105 | 3eYgTUKz34cB6tAq9YwHnZpyPx8UJCZGkshpJmgtZ3mCbtQaO17LoihnqPn4UOMr 1106 | V75R/7FjSuPLS8NaZF4wfi52btXMSxO/u7GuoJkzJscP3p4qtwe6Rl9dc1XC8P7k 1107 | NIbGZ5Yg5cEPcfmhgXFOhQZkD0yxcJqBUcoFpnp2vu5XJl2E5I/quIyVxUXi6O6c 1108 | /obspcvace4wy8uO0bdVhc4nJ+Rla4InVSJaUaBeiHTW8kReSFYyMmDCzLjGIu1q 1109 | doU61OM3Zv1ptsLu3gUE6GU27iWYj2RWN3e3HE4Sbd89IFwLXNdSuM0ifDLZk7AQ 1110 | WBhRhipCCgZhkj9g2NEk7jRVslti1NdN5zoQLaJNqSwO1MtxTmJ15Ksk3QP6kfLB 1111 | Q52UWybBzpaP9HEd4XnR+HuQ4k2K0ns2KgNImsNvIyFwbpMUyUWLMPimaV1DWUXo 1112 | 5SBjDB/V/W2JBFR+XKHFJeFwYhj7DD/ocsGr4ZMx/lgc8rjIBkI= 1113 | =lgTX 1114 | -----END PGP SIGNATURE----- 1115 | 1116 | Create first draft 1117 | #+END_EXAMPLE 1118 | 1119 | The format is a simplified version of mail messages, as specified in 1120 | [[https://www.ietf.org/rfc/rfc2822.txt][RFC 2822]]. It begins with a series of key-value pairs, with space as 1121 | the key/value separator, and ends with the commit message, that may 1122 | span over multiple lines. Values may continue over multiple lines, 1123 | subsequent lines start with a space which the parser must drop (like 1124 | the =gpgsig= field above, which spans over 16 lines). 1125 | 1126 | Let's have a look at those fields: 1127 | 1128 | - =tree= is a reference to a tree object, a type of object that we'll 1129 | see just next. A tree maps blobs IDs to filesystem locations, and 1130 | describes a state of the work tree. Put simply, it is the actual 1131 | content of the commit: file contents, and where they go. 1132 | - =parent= is a reference to the parent of this commit. It may be 1133 | repeated: merge commits, for example, have multiple parents. It 1134 | may also be absent: the very first commit in a repository obviously 1135 | doesn't have a parent. 1136 | - =author= and =committer= are separate, because the author of a commit 1137 | is not necessarily the person who can commit it (This may not be 1138 | obvious for GitHub users, but a lot of projects do Git through e-mail) 1139 | - =gpgsig= is the PGP signature of this object. 1140 | 1141 | We'll start by writing a simple parser for the format. The code is 1142 | obvious. The name of the function we're about to create, 1143 | =kvlm_parse()=, may be confusing: it isn't called =commit_parse()= because 1144 | tags have the very same format, so we'll use it for both objects types. 1145 | I use KVLM to mean "Key-Value List with Message". 1146 | 1147 | #+BEGIN_SRC python :tangle libwyag.py 1148 | def kvlm_parse(raw, start=0, dct=None): 1149 | if not dct: 1150 | dct = dict() 1151 | # You CANNOT declare the argument as dct=dict() or all call to 1152 | # the functions will endlessly grow the same dict. 1153 | 1154 | # This function is recursive: it reads a key/value pair, then call 1155 | # itself back with the new position. So we first need to know 1156 | # where we are: at a keyword, or already in the messageQ 1157 | 1158 | # We search for the next space and the next newline. 1159 | spc = raw.find(b' ', start) 1160 | nl = raw.find(b'\n', start) 1161 | 1162 | # If space appears before newline, we have a keyword. Otherwise, 1163 | # it's the final message, which we just read to the end of the file. 1164 | 1165 | # Base case 1166 | # ========= 1167 | # If newline appears first (or there's no space at all, in which 1168 | # case find returns -1), we assume a blank line. A blank line 1169 | # means the remainder of the data is the message. We store it in 1170 | # the dictionary, with None as the key, and return. 1171 | if (spc < 0) or (nl < spc): 1172 | assert nl == start 1173 | dct[None] = raw[start+1:] 1174 | return dct 1175 | 1176 | # Recursive case 1177 | # ============== 1178 | # we read a key-value pair and recurse for the next. 1179 | key = raw[start:spc] 1180 | 1181 | # Find the end of the value. Continuation lines begin with a 1182 | # space, so we loop until we find a "\n" not followed by a space. 1183 | end = start 1184 | while True: 1185 | end = raw.find(b'\n', end+1) 1186 | if raw[end+1] != ord(' '): break 1187 | 1188 | # Grab the value 1189 | # Also, drop the leading space on continuation lines 1190 | value = raw[spc+1:end].replace(b'\n ', b'\n') 1191 | 1192 | # Don't overwrite existing data contents 1193 | if key in dct: 1194 | if type(dct[key]) == list: 1195 | dct[key].append(value) 1196 | else: 1197 | dct[key] = [ dct[key], value ] 1198 | else: 1199 | dct[key]=value 1200 | 1201 | return kvlm_parse(raw, start=end+1, dct=dct) 1202 | #+END_SRC 1203 | 1204 | #+begin_note 1205 | <> *Object identity rules* 1206 | 1207 | We use dictionaries (HashMaps) to store key/value associations, but we 1208 | rely on a specific feature of Python dictionaries: *they preserve 1209 | insertion order*. It means that when we'll write back an object, 1210 | we'll iterate over the dictionary and get the fields back in the exact 1211 | order they were added. This matters, because Git has *two strong 1212 | rules about object identity*: 1213 | 1214 | 1. The first rule is that *the same name will always refer to the 1215 | same object*. We've seen this one already, it's just a 1216 | consequence of the fact that an object's name is a hash of its 1217 | contents. 1218 | 2. The second rule is subtly different: *the same object will always 1219 | be referred by the same name*. This means that there shouldn't be 1220 | two equivalent objects under different names. This is why fields 1221 | order matter: by modifying the /order/ fields appear in a given 1222 | commit, eg by putting the =tree= after the =parent=, we'd modify 1223 | the SHA-1 hash of the commit, and we'd create two equivalent, but 1224 | numerically distinct, commit objects. 1225 | 1226 | For example, when comparing trees, git will assume that two trees with 1227 | different names /are/ different --- this is why we'll have to make 1228 | sure elements of the tree objects are properly sorted, so we don't 1229 | produce distinct but equivalent trees. 1230 | #+end_note 1231 | 1232 | We're also going to need to write similar objects, so let's add a 1233 | =kvlm_serialize()= function to our toolkit. This is very simple: we 1234 | write all fields first, then a newline, the message, and a final 1235 | newline. 1236 | 1237 | #+BEGIN_SRC python :tangle libwyag.py 1238 | def kvlm_serialize(kvlm): 1239 | ret = b'' 1240 | 1241 | # Output fields 1242 | for k in kvlm.keys(): 1243 | # Skip the message itself 1244 | if k == None: continue 1245 | val = kvlm[k] 1246 | # Normalize to a list 1247 | if type(val) != list: 1248 | val = [ val ] 1249 | 1250 | for v in val: 1251 | ret += k + b' ' + (v.replace(b'\n', b'\n ')) + b'\n' 1252 | 1253 | # Append message 1254 | ret += b'\n' + kvlm[None] 1255 | 1256 | return ret 1257 | #+END_SRC 1258 | 1259 | ** The Commit object 1260 | :PROPERTIES: 1261 | :object_write: GitCommit 1262 | :END: 1263 | 1264 | Now we have the parser, we can create the =GitCommit= class: 1265 | 1266 | #+BEGIN_SRC python :tangle libwyag.py 1267 | class GitCommit(GitObject): 1268 | fmt=b'commit' 1269 | 1270 | def deserialize(self, data): 1271 | self.kvlm = kvlm_parse(data) 1272 | 1273 | def serialize(self): 1274 | return kvlm_serialize(self.kvlm) 1275 | 1276 | def init(self): 1277 | self.kvlm = dict() 1278 | #+END_SRC 1279 | 1280 | ** The log command 1281 | :PROPERTIES: 1282 | :CUSTOM_ID: cmd-log 1283 | :END: 1284 | 1285 | We'll implement a much, much simpler version of =log= than what Git 1286 | provides. Most importantly, we won't deal with representing the log 1287 | /at all/. Instead, we'll dump Graphviz data and let the user use 1288 | =dot= to render the actual log. (If you don't know how to use 1289 | Graphviz, just paste the raw output into [[https://dreampuf.github.io/GraphvizOnline/][this site]]. If the link is 1290 | dead, lookup "graphviz online" in your favorite search engine) 1291 | 1292 | #+BEGIN_SRC python :tangle libwyag.py 1293 | argsp = argsubparsers.add_parser("log", help="Display history of a given commit.") 1294 | argsp.add_argument("commit", 1295 | default="HEAD", 1296 | nargs="?", 1297 | help="Commit to start at.") 1298 | #+END_SRC 1299 | 1300 | #+BEGIN_SRC python :tangle libwyag.py 1301 | def cmd_log(args): 1302 | repo = repo_find() 1303 | 1304 | print("digraph wyaglog{") 1305 | print(" node[shape=rect]") 1306 | log_graphviz(repo, object_find(repo, args.commit), set()) 1307 | print("}") 1308 | 1309 | def log_graphviz(repo, sha, seen): 1310 | 1311 | if sha in seen: 1312 | return 1313 | seen.add(sha) 1314 | 1315 | commit = object_read(repo, sha) 1316 | message = commit.kvlm[None].decode("utf8").strip() 1317 | message = message.replace("\\", "\\\\") 1318 | message = message.replace("\"", "\\\"") 1319 | 1320 | if "\n" in message: # Keep only the first line 1321 | message = message[:message.index("\n")] 1322 | 1323 | print(f" c_{sha} [label=\"{sha[0:7]}: {message}\"]") 1324 | assert commit.fmt==b'commit' 1325 | 1326 | if not b'parent' in commit.kvlm.keys(): 1327 | # Base case: the initial commit. 1328 | return 1329 | 1330 | parents = commit.kvlm[b'parent'] 1331 | 1332 | if type(parents) != list: 1333 | parents = [ parents ] 1334 | 1335 | for p in parents: 1336 | p = p.decode("ascii") 1337 | print (f" c_{sha} -> c_{p};") 1338 | log_graphviz(repo, p, seen) 1339 | #+END_SRC 1340 | 1341 | You can now use our log command like this: 1342 | 1343 | #+BEGIN_SRC shell 1344 | wyag log e03158242ecab460f31b0d6ae1642880577ccbe8 > log.dot 1345 | dot -O -Tpdf log.dot 1346 | #+END_SRC 1347 | 1348 | ** Anatomy of a commit 1349 | :PROPERTIES: 1350 | :CUSTOM_ID: commit-anatomy 1351 | :END: 1352 | 1353 | You may have noticed a few things right now. 1354 | 1355 | First and foremost, we've been playing with commits, browsing and 1356 | walking through commit objects, building a graph of commit history, 1357 | without ever touching a single file in the worktree or a blob. We've 1358 | done a lot with commits /without considering their contents/. This is 1359 | important: work tree contents are just one part of a commit. But a 1360 | commit is made of everything it holds: its contents, its authors, 1361 | *also its parents*. If you remember that the ID (the SHA-1 hash) of a 1362 | commit is computed from the whole commit object, you'll understand 1363 | what it means that commits are immutable: if you change the author, 1364 | the parent commit or a single file, you've actually created a new, 1365 | different object. Each and every commit is bound to its place and its 1366 | relationship to the whole repository up to the very first commit. To 1367 | put it otherwise, a given commit ID not only identifies some file 1368 | contents, but it also binds the commit to its whole history and to the 1369 | whole repository. 1370 | 1371 | It's also worth noting that from the point of view of a commit, time 1372 | somehow runs backwards: we're used to considering the history of a 1373 | project from its humble beginnings as an evening distraction, starting 1374 | with a few lines of code, some initial commits, and progressing to its 1375 | present state (millions of lines of code, dozens of contributors, 1376 | whatever). But each commit is completely unaware of its future, it's 1377 | only linked to the past. Commits have "memory", but no premonition. 1378 | 1379 | # #+begin_note 1380 | # In Terry Pratchett's Discworld, trolls believe they progress in time 1381 | # from the future to the past. The reasoning behind that belief is 1382 | # that when you walk, what you can see is what's /ahead/ of you. Of 1383 | # time, all you can perceive is the past, because you remember; hence 1384 | # it's where you're headed. Commits are Discworld trolls. 1385 | # #+end_note 1386 | 1387 | So what makes a commit? To sum it up: 1388 | 1389 | - A tree object, which we'll discuss now, that is, the contents of a 1390 | worktree, files and directories; 1391 | - Zero, one or more parents; 1392 | - An author identity (name and email), and a timestamp; 1393 | - A committer identity (name and email), and a timestamp; 1394 | - An optional PGP signature 1395 | - A message; 1396 | 1397 | All this hashed together in a unique SHA-1 identifier. 1398 | 1399 | #+begin_note 1400 | *Wait, does that make Git a blockchain?* 1401 | 1402 | Because of cryptocurrencies, blockchains are all the hype these 1403 | days. And yes, /in a way/, Git is a blockchain: it's a sequence of 1404 | blocks (commits) tied together by cryptographic means in a way that 1405 | guarantee that each single element is associated to the whole 1406 | history of the structure. Don't take the comparison too seriously, 1407 | though: we don't need a GitCoin. Really, we don't. 1408 | #+end_note 1409 | 1410 | * Reading commit data: checkout 1411 | :PROPERTIES: 1412 | :CUSTOM_ID: checkout 1413 | :END: 1414 | 1415 | It's all well that commits hold a lot more than files and directories 1416 | in a given state, but that doesn't make them really useful. It's 1417 | probably time to start implementing tree objects as well, so we'll be 1418 | able to checkout commits into the work tree. 1419 | 1420 | ** What's in a tree? 1421 | 1422 | Informally, a tree describes the content of the work tree, that it, it 1423 | associates blobs to paths. It's an array of three-element tuples made 1424 | of a file mode, a path (relative to the worktree) and a SHA-1. A 1425 | typical tree contents may look like this: 1426 | 1427 | | Mode | SHA-1 | Path | 1428 | |----------+--------------------------------------------+--------------| 1429 | | =100644= | =894a44cc066a027465cd26d634948d56d13af9af= | =.gitignore= | 1430 | | =100644= | =94a9ed024d3859793618152ea559a168bbcbb5e2= | =LICENSE= | 1431 | | =100644= | =bab489c4f4600a38ce6dbfd652b90383a4aa3e45= | =README.md= | 1432 | | =100644= | =6d208e47659a2a10f5f8640e0155d9276a2130a9= | =src= | 1433 | | =040000= | =e7445b03aea61ec801b20d6ab62f076208b7d097= | =tests= | 1434 | | =040000= | =d5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38= | =main.c= | 1435 | 1436 | Mode is just the file's [[https://en.wikipedia.org/wiki/File_system_permissions][mode]], path is its location. The SHA-1 refers 1437 | to either a blob or another tree object. If a blob, the path is a 1438 | file, if a tree, it's directory. To instantiate this tree in the 1439 | filesystem, we would begin by loading the object associated to the 1440 | first path (=.gitignore=) and check its type. Since it's a blob, 1441 | we'll just create a file called =.gitignore= with this blob's 1442 | contents; and same for =LICENSE= and =README.md=. But the object 1443 | associated with =src= is not a blob, but another tree: we'll create 1444 | the directory =src= and repeat the same operation in that directory 1445 | with the new tree. 1446 | 1447 | #+BEGIN_warning 1448 | *A path is a single filesystem entry* 1449 | 1450 | The path identifies exactly one file or directory. Not two, not 1451 | three. If you have five levels of nested directories, even if four 1452 | are empty save the next directory, you're going to need five tree 1453 | objects recursively referring to one another. You cannot take the 1454 | shortcut of putting a full path in a single tree entry, like 1455 | =dir1/dir2/dir3/dir4/dir5=. 1456 | #+END_warning 1457 | 1458 | ** Parsing trees 1459 | 1460 | Unlike tags and commits, tree objects are binary objects, but their 1461 | format is actually quite simple. A tree is the concatenation of 1462 | records of the format: 1463 | 1464 | #+begin_example 1465 | [mode] space [path] 0x00 [sha-1] 1466 | #+end_example 1467 | 1468 | - =[mode]= is up to six bytes and is an octal representation of a file 1469 | *mode*, stored in ASCII. For example, 100644 is encoded with byte 1470 | values 49 (ASCII "1"), 48 (ASCII "0"), 48, 54, 52, 52. The first 1471 | two digits encode the file type (file, directory, symlink or 1472 | submodule), the last four the permissions. 1473 | - It's followed by 0x20, an ASCII *space*; 1474 | - Followed by the null-terminated (0x00) *path*; 1475 | - Followed by the object's *SHA-1* in binary encoding, on 20 bytes. 1476 | 1477 | The parser is going to be quite simple. First, create a tiny object 1478 | wrapper for a single record (a leaf, a single path): 1479 | 1480 | #+BEGIN_SRC python :tangle libwyag.py 1481 | class GitTreeLeaf (object): 1482 | def __init__(self, mode, path, sha): 1483 | self.mode = mode 1484 | self.path = path 1485 | self.sha = sha 1486 | #+END_SRC 1487 | 1488 | Because a tree object is just the repetition of the same fundamental 1489 | data structure, we write the parser in two functions. First, a parser 1490 | to extract a single record, which returns parsed data and the position 1491 | it reached in input data: 1492 | 1493 | #+BEGIN_SRC python :tangle libwyag.py 1494 | def tree_parse_one(raw, start=0): 1495 | # Find the space terminator of the mode 1496 | x = raw.find(b' ', start) 1497 | assert x-start == 5 or x-start==6 1498 | 1499 | # Read the mode 1500 | mode = raw[start:x] 1501 | if len(mode) == 5: 1502 | # Normalize to six bytes. 1503 | mode = b"0" + mode 1504 | 1505 | # Find the NULL terminator of the path 1506 | y = raw.find(b'\x00', x) 1507 | # and read the path 1508 | path = raw[x+1:y] 1509 | 1510 | # Read the SHA… 1511 | raw_sha = int.from_bytes(raw[y+1:y+21], "big") 1512 | # and convert it into an hex string, padded to 40 chars 1513 | # with zeros if needed. 1514 | sha = format(raw_sha, "040x") 1515 | return y+21, GitTreeLeaf(mode, path.decode("utf8"), sha) 1516 | #+END_SRC 1517 | 1518 | And the "real" parser which just calls the previous one in a loop, 1519 | until input data is exhausted. 1520 | 1521 | #+BEGIN_SRC python :tangle libwyag.py 1522 | def tree_parse(raw): 1523 | pos = 0 1524 | max = len(raw) 1525 | ret = list() 1526 | while pos < max: 1527 | pos, data = tree_parse_one(raw, pos) 1528 | ret.append(data) 1529 | 1530 | return ret 1531 | #+END_SRC 1532 | 1533 | We'll finally need a serializer to write trees back. Because we may 1534 | have added or modified entries, we need to sort them again. 1535 | Consistently sorting matters, because we need to respect git's 1536 | [[identity-rules][identity rules]], which says that no two equivalent object can have a 1537 | different hash --- but differently sorted trees with the same contents 1538 | /would/ be equivalent (describing the same directory structure), and 1539 | still numerically distinct (different SHA-1 identifiers). Incorrectly 1540 | sorted trees are invalid, but /git doesn't enforce that/. I created 1541 | some invalid trees by accident writing wyag, and all I got was weird 1542 | bugs in =git status= (specifically, =status= would report an actually 1543 | clean worktree as fully modified). We don't want that. 1544 | 1545 | The ordering function is quite simple, with an unexpected twist. are 1546 | Entries sorted by name, alphabetically, /but/ directories (that is, 1547 | tree entries) are sorted with a final =/= added. It matters, because 1548 | it means that if =whatever= names a regular file, it will sort 1549 | /before/ =whatever.c=, but if =whatever= is a dir, it will sort 1550 | /after/, as =whatever/=. (I'm not sure why git does that. If you're 1551 | curious, see the function =base_name_compare= in =tree.c= in the git 1552 | source) 1553 | 1554 | #+begin_src python :tangle libwyag.py 1555 | # Notice this isn't a comparison function, but a conversion function. 1556 | # Python's default sort doesn't accept a custom comparison function, 1557 | # like in most languages, but a `key` arguments that returns a new 1558 | # value, which is compared using the default rules. So we just return 1559 | # the leaf name, with an extra / if it's a directory. 1560 | def tree_leaf_sort_key(leaf): 1561 | if leaf.mode.startswith(b"10"): 1562 | return leaf.path 1563 | else: 1564 | return leaf.path + "/" 1565 | #+end_src 1566 | 1567 | Then the serializer itself. This one is very simple: we sort the 1568 | items using our newly created function as a transformer, then write 1569 | them in order. 1570 | 1571 | #+BEGIN_SRC python :tangle libwyag.py 1572 | def tree_serialize(obj): 1573 | obj.items.sort(key=tree_leaf_sort_key) 1574 | ret = b'' 1575 | for i in obj.items: 1576 | ret += i.mode 1577 | ret += b' ' 1578 | ret += i.path.encode("utf8") 1579 | ret += b'\x00' 1580 | sha = int(i.sha, 16) 1581 | ret += sha.to_bytes(20, byteorder="big") 1582 | return ret 1583 | #+END_SRC 1584 | 1585 | And now we just have to combine all that into a class: 1586 | 1587 | #+BEGIN_SRC python :tangle libwyag.py 1588 | class GitTree(GitObject): 1589 | fmt=b'tree' 1590 | 1591 | def deserialize(self, data): 1592 | self.items = tree_parse(data) 1593 | 1594 | def serialize(self): 1595 | return tree_serialize(self) 1596 | 1597 | def init(self): 1598 | self.items = list() 1599 | #+END_SRC 1600 | 1601 | ** Showing trees: ls-tree 1602 | :PROPERTIES: 1603 | :CUSTOM_ID: cmd-ls-tree 1604 | :END: 1605 | 1606 | While we're at it, let's add the =ls-tree= command to wyag. It's so 1607 | easy there's no reason not to. =git ls-tree [-r] TREE= simply prints 1608 | the contents of a tree, recursively with the =-r= flag. In recursive 1609 | mode, it doesn't show subtrees, just final objects with their full 1610 | paths. 1611 | 1612 | #+NAME: cmd-ls-tree 1613 | #+BEGIN_SRC python :tangle libwyag.py 1614 | argsp = argsubparsers.add_parser("ls-tree", help="Pretty-print a tree object.") 1615 | argsp.add_argument("-r", 1616 | dest="recursive", 1617 | action="store_true", 1618 | help="Recurse into sub-trees") 1619 | 1620 | argsp.add_argument("tree", 1621 | help="A tree-ish object.") 1622 | 1623 | def cmd_ls_tree(args): 1624 | repo = repo_find() 1625 | ls_tree(repo, args.tree, args.recursive) 1626 | 1627 | def ls_tree(repo, ref, recursive=None, prefix=""): 1628 | sha = object_find(repo, ref, fmt=b"tree") 1629 | obj = object_read(repo, sha) 1630 | for item in obj.items: 1631 | if len(item.mode) == 5: 1632 | type = item.mode[0:1] 1633 | else: 1634 | type = item.mode[0:2] 1635 | 1636 | match type: # Determine the type. 1637 | case b'04': type = "tree" 1638 | case b'10': type = "blob" # A regular file. 1639 | case b'12': type = "blob" # A symlink. Blob contents is link target. 1640 | case b'16': type = "commit" # A submodule 1641 | case _: raise Exception(f"Weird tree leaf mode {item.mode}") 1642 | 1643 | if not (recursive and type=='tree'): # This is a leaf 1644 | print(f"{'0' * (6 - len(item.mode)) + item.mode.decode("ascii")} {type} {item.sha}\t{os.path.join(prefix, item.path)}") 1645 | else: # This is a branch, recurse 1646 | ls_tree(repo, item.sha, recursive, os.path.join(prefix, item.path)) 1647 | #+END_SRC 1648 | 1649 | ** The checkout command 1650 | :PROPERTIES: 1651 | :CUSTOM_ID: cmd-checkout 1652 | :END: 1653 | 1654 | =git checkout= simply instantiates a commit in the worktree. We're 1655 | going to oversimplify the actual git command to make our 1656 | implementation clear and understandable. We're also going to add a 1657 | few safeguards. Here's how our version of checkout will work: 1658 | 1659 | - It will take two arguments: a commit, and a directory. Git checkout 1660 | only needs a commit. 1661 | 1662 | - It will then instantiate the tree in the directory, *if and only if 1663 | the directory is empty*. Git is full of safeguards to avoid 1664 | deleting data, which would be too complicated and unsafe to try to 1665 | reproduce in wyag. Since the point of wyag is to demonstrate git, 1666 | not to produce a working implementation, this limitation is 1667 | acceptable. 1668 | 1669 | Let's get started. As usual, we need a subparser: 1670 | 1671 | #+BEGIN_SRC python :tangle libwyag.py 1672 | argsp = argsubparsers.add_parser("checkout", help="Checkout a commit inside of a directory.") 1673 | 1674 | argsp.add_argument("commit", 1675 | help="The commit or tree to checkout.") 1676 | 1677 | argsp.add_argument("path", 1678 | help="The EMPTY directory to checkout on.") 1679 | #+END_SRC 1680 | 1681 | A wrapper function: 1682 | 1683 | #+BEGIN_SRC python :tangle libwyag.py 1684 | def cmd_checkout(args): 1685 | repo = repo_find() 1686 | 1687 | obj = object_read(repo, object_find(repo, args.commit)) 1688 | 1689 | # If the object is a commit, we grab its tree 1690 | if obj.fmt == b'commit': 1691 | obj = object_read(repo, obj.kvlm[b'tree'].decode("ascii")) 1692 | 1693 | # Verify that path is an empty directory 1694 | if os.path.exists(args.path): 1695 | if not os.path.isdir(args.path): 1696 | raise Exception(f"Not a directory {args.path}!") 1697 | if os.listdir(args.path): 1698 | raise Exception(f"Not empty {args.path}!") 1699 | else: 1700 | os.makedirs(args.path) 1701 | 1702 | tree_checkout(repo, obj, os.path.realpath(args.path)) 1703 | #+END_SRC 1704 | 1705 | And a function to do the actual work: 1706 | 1707 | #+BEGIN_SRC python :tangle libwyag.py 1708 | def tree_checkout(repo, tree, path): 1709 | for item in tree.items: 1710 | obj = object_read(repo, item.sha) 1711 | dest = os.path.join(path, item.path) 1712 | 1713 | if obj.fmt == b'tree': 1714 | os.mkdir(dest) 1715 | tree_checkout(repo, obj, dest) 1716 | elif obj.fmt == b'blob': 1717 | # @TODO Support symlinks (identified by mode 12****) 1718 | with open(dest, 'wb') as f: 1719 | f.write(obj.blobdata) 1720 | #+END_SRC 1721 | 1722 | * Refs, tags and branches 1723 | ** What a ref is, and the show-ref command 1724 | :PROPERTIES: 1725 | :CUSTOM_ID: cmd-show-ref 1726 | :END: 1727 | 1728 | As of now, the only way we can refer to objects is by their full 1729 | hexadecimal identifier. In git, we actually rarely see those, except 1730 | to talk about a specific commit. But in general, we're talking about 1731 | HEAD, about some branch called names like =main= or 1732 | =feature/more-bombs=, and so on. This is handled by a simple 1733 | mechanism called references. 1734 | 1735 | Git references, or refs, are probably the most simple type of things 1736 | git holds. They live in subdirectories of =.git/refs=, and are text 1737 | files containing a hexadecimal representation of an object's hash, 1738 | encoded in ASCII. They're actually as simple as this: 1739 | 1740 | #+BEGIN_example 1741 | 6071c08bcb4757d8c89a30d9755d2466cef8c1de 1742 | #+END_example 1743 | 1744 | Refs can also refer to another reference, and thus only indirectly to 1745 | an object, in which case they look like this: 1746 | 1747 | #+BEGIN_EXAMPLE 1748 | ref: refs/remotes/origin/master 1749 | #+END_EXAMPLE 1750 | 1751 | #+BEGIN_note 1752 | *Direct and indirect references* 1753 | 1754 | From now on, I will call a reference of the form =ref: 1755 | path/to/other/ref= an *indirect* reference, and a ref with a SHA-1 1756 | object ID a *direct reference*. 1757 | #+END_note 1758 | 1759 | This section will describe the uses of refs. For now, all that matter 1760 | is this: 1761 | 1762 | - they're text files, in the =.git/refs= hierarchy; 1763 | - they hold the SHA-1 identifier of an object, or a reference to 1764 | another reference, ultimately to a SHA-1 (no loops!) 1765 | 1766 | To work with refs, we're first going to need a simple recursive solver 1767 | that will take a ref name, follow eventual recursive references (refs 1768 | whose content begin with =ref:=, as exemplified above) and return a 1769 | SHA-1 identifier: 1770 | 1771 | #+BEGIN_SRC python :tangle libwyag.py 1772 | def ref_resolve(repo, ref): 1773 | path = repo_file(repo, ref) 1774 | 1775 | # Sometimes, an indirect reference may be broken. This is normal 1776 | # in one specific case: we're looking for HEAD on a new repository 1777 | # with no commits. In that case, .git/HEAD points to "ref: 1778 | # refs/heads/main", but .git/refs/heads/main doesn't exist yet 1779 | # (since there's no commit for it to refer to). 1780 | if not os.path.isfile(path): 1781 | return None 1782 | 1783 | with open(path, 'r') as fp: 1784 | data = fp.read()[:-1] 1785 | # Drop final \n ^^^^^ 1786 | if data.startswith("ref: "): 1787 | return ref_resolve(repo, data[5:]) 1788 | else: 1789 | return data 1790 | #+END_SRC 1791 | 1792 | Let's create two small functions, and implement the =show-ref= 1793 | command --- it just lists all references in a repository. First, a 1794 | stupid recursive function to collect refs and return them as a dict: 1795 | 1796 | #+BEGIN_SRC python :tangle libwyag.py 1797 | def ref_list(repo, path=None): 1798 | if not path: 1799 | path = repo_dir(repo, "refs") 1800 | ret = dict() 1801 | # Git shows refs sorted. To do the same, we sort the output of 1802 | # listdir 1803 | for f in sorted(os.listdir(path)): 1804 | can = os.path.join(path, f) 1805 | if os.path.isdir(can): 1806 | ret[f] = ref_list(repo, can) 1807 | else: 1808 | ret[f] = ref_resolve(repo, can) 1809 | 1810 | return ret 1811 | #+END_SRC 1812 | 1813 | And, as usual, a subparser, a bridge, and a (recursive) worker function: 1814 | 1815 | #+BEGIN_SRC python :tangle libwyag.py 1816 | argsp = argsubparsers.add_parser("show-ref", help="List references.") 1817 | 1818 | def cmd_show_ref(args): 1819 | repo = repo_find() 1820 | refs = ref_list(repo) 1821 | show_ref(repo, refs, prefix="refs") 1822 | 1823 | def show_ref(repo, refs, with_hash=True, prefix=""): 1824 | if prefix: 1825 | prefix = prefix + '/' 1826 | for k, v in refs.items(): 1827 | if type(v) == str and with_hash: 1828 | print (f"{v} {prefix}{k}") 1829 | elif type(v) == str: 1830 | print (f"{prefix}{k}") 1831 | else: 1832 | show_ref(repo, v, with_hash=with_hash, prefix=f"{prefix}{k}") 1833 | #+END_SRC 1834 | ** Tags as references 1835 | :PROPERTIES: 1836 | :CUSTOM_ID: tags 1837 | :END: 1838 | 1839 | The most simple use of refs is tags. A tag is just a user-defined 1840 | name for an object, often a commit. A very common use of tags is 1841 | identifying software releases: You've just merged the last commit of, 1842 | say, version 12.78.52 of your program, so your most recent commit 1843 | (let's call it =6071c08=) /is/ your version 12.78.52. To make this 1844 | association explicit, all you have to do is: 1845 | 1846 | #+BEGIN_src shell 1847 | git tag v12.78.52 6071c08 1848 | # the object hash ^here^^ is optional and defaults to HEAD. 1849 | #+END_SRC 1850 | 1851 | This creates a new tag, called =v12.78.52=, pointing at =6071c08=. 1852 | Tagging is like aliasing: a tag introduces a new way to refer to an 1853 | existing object. After the tag is created, the name =v12.78.52= refers 1854 | to =6071c08=. For example, these two commands are now perfectly 1855 | equivalent: 1856 | 1857 | #+BEGIN_src shell 1858 | git checkout v12.78.52 1859 | git checkout 6071c08 1860 | #+END_src 1861 | 1862 | #+begin_note 1863 | Versions are a common use of tags, but like almost everything in 1864 | Git, tags have no predefined semantics: they mean whatever you want 1865 | them to mean, and can point to whichever object you want, you can 1866 | even tag /blobs/! 1867 | #+end_note 1868 | 1869 | ** Lightweight tags and tag objects, and parsing the latter 1870 | :PROPERTIES: 1871 | :CUSTOM_ID: GitTag 1872 | :END: 1873 | 1874 | You've probably guessed already that tags are actually refs. They 1875 | live in the =.git/refs/tags/= hierarchy. The only point worth noting is 1876 | that they come in two flavors: lightweight tags and tags objects. 1877 | 1878 | - "Lightweight" tags :: are just regular refs to a commit, a tree or 1879 | a blob. 1880 | 1881 | - Tag objects :: are regular refs pointing to an object of type =tag=. 1882 | Unlike lightweight tags, tag objects have an author, a date, an 1883 | optional PGP signature and an optional annotation. Their format is 1884 | the same as a commit object. 1885 | 1886 | We don't even need to implement tag objects, we can reuse =GitCommit= 1887 | and just change the =fmt= field: 1888 | 1889 | #+BEGIN_SRC python :tangle libwyag.py 1890 | class GitTag(GitCommit): 1891 | fmt = b'tag' 1892 | #+END_SRC 1893 | 1894 | And now we support tags. 1895 | 1896 | ** The tag command 1897 | :PROPERTIES: 1898 | :CUSTOM_ID: cmd-tag 1899 | :END: 1900 | 1901 | Let's add the =tag= command. In Git, it does two things: it creates a 1902 | new tag or list existing tags (by default). So you can invoke it with: 1903 | 1904 | #+BEGIN_src shell 1905 | git tag # List all tags 1906 | git tag NAME [OBJECT] # create a new *lightweight* tag NAME, pointing 1907 | # at HEAD (default) or OBJECT 1908 | git tag -a NAME [OBJECT] # create a new tag *object* NAME, pointing at 1909 | # HEAD (default) or OBJECT 1910 | #+END_src 1911 | 1912 | This translates to argparse as follows. Notice we ignore the mutual 1913 | exclusion between =--list= and =[-a] name [object]=, which seems too 1914 | complicated for argparse. 1915 | 1916 | # @FIXME This ignores the mutual exclusion 1917 | #+BEGIN_SRC python :tangle libwyag.py 1918 | argsp = argsubparsers.add_parser( 1919 | "tag", 1920 | help="List and create tags") 1921 | 1922 | argsp.add_argument("-a", 1923 | action="store_true", 1924 | dest="create_tag_object", 1925 | help="Whether to create a tag object") 1926 | 1927 | argsp.add_argument("name", 1928 | nargs="?", 1929 | help="The new tag's name") 1930 | 1931 | argsp.add_argument("object", 1932 | default="HEAD", 1933 | nargs="?", 1934 | help="The object the new tag will point to") 1935 | #+END_SRC 1936 | 1937 | The =cmd_tag= function will dispatch behavior (list or create) depending 1938 | on whether or not =name= is provided. 1939 | 1940 | #+BEGIN_SRC python :tangle libwyag.py 1941 | def cmd_tag(args): 1942 | repo = repo_find() 1943 | 1944 | if args.name: 1945 | tag_create(repo, 1946 | args.name, 1947 | args.object, 1948 | create_tag_object = args.create_tag_object) 1949 | else: 1950 | refs = ref_list(repo) 1951 | show_ref(repo, refs["tags"], with_hash=False) 1952 | #+END_SRC 1953 | 1954 | And we just need one more function to actually create the tag: 1955 | 1956 | #+begin_src python :tangle libwyag.py 1957 | def tag_create(repo, name, ref, create_tag_object=False): 1958 | # get the GitObject from the object reference 1959 | sha = object_find(repo, ref) 1960 | 1961 | if create_tag_object: 1962 | # create tag object (commit) 1963 | tag = GitTag() 1964 | tag.kvlm = dict() 1965 | tag.kvlm[b'object'] = sha.encode() 1966 | tag.kvlm[b'type'] = b'commit' 1967 | tag.kvlm[b'tag'] = name.encode() 1968 | # Feel free to let the user give their name! 1969 | # Notice you can fix this after commit, read on! 1970 | tag.kvlm[b'tagger'] = b'Wyag ' 1971 | # …and a tag message! 1972 | tag.kvlm[None] = b"A tag generated by wyag, which won't let you customize the message!\n" 1973 | tag_sha = object_write(tag, repo) 1974 | # create reference 1975 | ref_create(repo, "tags/" + name, tag_sha) 1976 | else: 1977 | # create lightweight tag (ref) 1978 | ref_create(repo, "tags/" + name, sha) 1979 | 1980 | def ref_create(repo, ref_name, sha): 1981 | with open(repo_file(repo, "refs/" + ref_name), 'w') as fp: 1982 | fp.write(sha + "\n") 1983 | #+end_src 1984 | 1985 | ** What's a branch? 1986 | :PROPERTIES: 1987 | :CUSTOM_ID: branches 1988 | :END: 1989 | 1990 | Tags are done. Now for another big chunk: branches. 1991 | 1992 | It's time to address the elephant in the room: like most Git users, 1993 | wyag still doesn't have any idea what a branch is. It currently 1994 | treats a repository as a bunch of disorganized objects, some of them 1995 | commits, and has no representation whatsoever of the fact that commits 1996 | are grouped in branches, and that at every point in time there's a 1997 | commit that's =HEAD=, /ie/, the *head* commit (or "tip") of the 1998 | *active* branch. 1999 | 2000 | So, what's a branch? The answer is actually surprisingly simple, but 2001 | it may also end up being simply surprising: *a branch is a reference 2002 | to a commit*. You could even say that a branch is a kind of a name 2003 | for a commit. In this regard, a branch is exactly the same thing as a 2004 | tag. Tags are refs that live in =.git/refs/tags=, branches are refs 2005 | that live in =.git/refs/heads=. 2006 | 2007 | There are, of course, differences between a branch and a tag: 2008 | 2009 | 1. Branches are references to a /commit/, tags can refer to any object; 2010 | 2. Most importantly, the branch ref is updated at each commit. This means 2011 | that whenever you commit, Git actually does this: 2012 | 1. a new commit object is created, with the current branch's 2013 | (commit!) ID as its parent; 2014 | 2. the commit object is hashed and stored; 2015 | 3. the branch ref is updated to refer to the new commit's hash. 2016 | 2017 | That's all. 2018 | 2019 | But what about the *current* branch? It's actually even easier. It's a 2020 | ref file outside of the =refs= hierarchy, in =.git/HEAD=, which is an 2021 | *indirect* ref (that is, it is of the form =ref: path/to/other/ref=, and 2022 | not a simple hash). 2023 | 2024 | #+begin_note 2025 | *Detached HEAD* 2026 | 2027 | When you just checkout a random commit, git will warn you it's in 2028 | "detached HEAD state". This means you're not on any branch anymore. 2029 | In this case, =.git/HEAD= is a *direct* reference: it contains a 2030 | SHA-1. 2031 | #+end_note 2032 | 2033 | ** Referring to objects: the =object_find= function 2034 | :PROPERTIES: 2035 | :CUSTOM_ID: object_find 2036 | :END: 2037 | 2038 | *** Resolving names 2039 | 2040 | Remember when we've created [[placeholder-object_find][the stupid =object_find= function]] that would 2041 | take four arguments, return the second unmodified and ignore the other 2042 | three? It's time to replace it by something more useful. We're going 2043 | to implement a small, but usable, subset of the actual Git name 2044 | resolution algorithm. The new =object_find()= will work in two steps: 2045 | first, given a name, it will return a complete sha-1 hash. For 2046 | example, with =HEAD=, it will return the hash of the head commit of the 2047 | current branch, etc. More precisely, this name resolution function 2048 | will work like this: 2049 | 2050 | - If =name= is =HEAD=, it will just resolve =.git/HEAD=; 2051 | - If =name= is a full hash, this hash is returned unmodified. 2052 | - If =name= looks like a short hash, it will collect objects whose full 2053 | hash begin with this short hash. 2054 | - At last, it will resolve tags and branches matching name. 2055 | 2056 | Notice how the last two steps /collect/ values: the first two are 2057 | absolute references, so we can safely return a result. But short 2058 | hashes or branch names can be ambiguous, we want to enumerate all 2059 | possible meanings of the name and raise an error if we've found more 2060 | than 1. 2061 | 2062 | #+begin_info 2063 | *Short hashes* 2064 | 2065 | For convenience, Git allows to refer to hashes by a prefix of their 2066 | name. For example, =5bd254aa973646fa16f66d702a5826ea14a3eb45= can 2067 | be referred to as =5bd254=. This is called a "short hash". 2068 | #+end_info 2069 | 2070 | #+BEGIN_SRC python :tangle libwyag.py 2071 | def object_resolve(repo, name): 2072 | """Resolve name to an object hash in repo. 2073 | 2074 | This function is aware of: 2075 | 2076 | - the HEAD literal 2077 | - short and long hashes 2078 | - tags 2079 | - branches 2080 | - remote branches""" 2081 | candidates = list() 2082 | hashRE = re.compile(r"^[0-9A-Fa-f]{4,40}$") 2083 | 2084 | # Empty string? Abort. 2085 | if not name.strip(): 2086 | return None 2087 | 2088 | # Head is nonambiguous 2089 | if name == "HEAD": 2090 | return [ ref_resolve(repo, "HEAD") ] 2091 | 2092 | # If it's a hex string, try for a hash. 2093 | if hashRE.match(name): 2094 | # This may be a hash, either small or full. 4 seems to be the 2095 | # minimal length for git to consider something a short hash. 2096 | # This limit is documented in man git-rev-parse 2097 | name = name.lower() 2098 | prefix = name[0:2] 2099 | path = repo_dir(repo, "objects", prefix, mkdir=False) 2100 | if path: 2101 | rem = name[2:] 2102 | for f in os.listdir(path): 2103 | if f.startswith(rem): 2104 | # Notice a string startswith() itself, so this 2105 | # works for full hashes. 2106 | candidates.append(prefix + f) 2107 | 2108 | # Try for references. 2109 | as_tag = ref_resolve(repo, "refs/tags/" + name) 2110 | if as_tag: # Did we find a tag? 2111 | candidates.append(as_tag) 2112 | 2113 | as_branch = ref_resolve(repo, "refs/heads/" + name) 2114 | if as_branch: # Did we find a branch? 2115 | candidates.append(as_branch) 2116 | 2117 | as_remote_branch = ref_resolve(repo, "refs/remotes/" + name) 2118 | if as_remote_branch: # Did we find a remote branch? 2119 | candidates.append(as_remote_branch) 2120 | 2121 | return candidates 2122 | #+END_SRC 2123 | 2124 | The second step is to follow the object we found to an object of the 2125 | required type, if a type argument was provided. Since we only need to 2126 | handle trivial cases, this is a very simple iterative process: 2127 | 2128 | - If we have a tag and =fmt= is anything else, we follow the tag. 2129 | - If we have a commit and =fmt= is tree, we return this commit's tree 2130 | object 2131 | - In all other situations, we bail out: nothing else makes sense. 2132 | 2133 | (The process is iterative because it may take an undefined number of 2134 | steps, since tags themselves can be tagged) 2135 | 2136 | #+BEGIN_SRC python :tangle libwyag.py 2137 | def object_find(repo, name, fmt=None, follow=True): 2138 | sha = object_resolve(repo, name) 2139 | 2140 | if not sha: 2141 | raise Exception(f"No such reference {name}.") 2142 | 2143 | if len(sha) > 1: 2144 | raise Exception("Ambiguous reference {name}: Candidates are:\n - {'\n - '.join(sha)}.") 2145 | 2146 | sha = sha[0] 2147 | 2148 | if not fmt: 2149 | return sha 2150 | 2151 | while True: 2152 | obj = object_read(repo, sha) 2153 | # ^^^^^^^^^^^ < this is a bit agressive: we're reading 2154 | # the full object just to get its type. And we're doing 2155 | # that in a loop, albeit normally short. Don't expect 2156 | # high performance here. 2157 | 2158 | if obj.fmt == fmt: 2159 | return sha 2160 | 2161 | if not follow: 2162 | return None 2163 | 2164 | # Follow tags 2165 | if obj.fmt == b'tag': 2166 | sha = obj.kvlm[b'object'].decode("ascii") 2167 | elif obj.fmt == b'commit' and fmt == b'tree': 2168 | sha = obj.kvlm[b'tree'].decode("ascii") 2169 | else: 2170 | return None 2171 | #+END_SRC 2172 | 2173 | With the new =object_find()=, the CLI wyag becomes a bit more usable. You can now do things like: 2174 | 2175 | #+begin_example 2176 | $ wyag checkout v3.11 # A tag 2177 | $ wyag checkout feature/explosions # A branch 2178 | $ wyag ls-tree -r HEAD # The active branch or commit. There's also a 2179 | # follow here: HEAD is actually a commit. 2180 | $ wyag cat-file blob e0695f # A short hash 2181 | $ wyag cat-file tree master # A branch, as a tree (another "follow") 2182 | #+end_example 2183 | 2184 | *** The rev-parse command 2185 | :PROPERTIES: 2186 | :CUSTOM_ID: cmd-rev-parse 2187 | :END: 2188 | 2189 | Let's implement =wyag rev-parse=. The =git rev-parse= commands does a 2190 | lot, but one of its use cases, the one we're going to clone, is 2191 | solving references. For the purpose of further testing the "follow" 2192 | feature of =object_find=, we'll add an optional =wyag-type= argument 2193 | to its interface. 2194 | 2195 | #+BEGIN_SRC python :tangle libwyag.py 2196 | argsp = argsubparsers.add_parser( 2197 | "rev-parse", 2198 | help="Parse revision (or other objects) identifiers") 2199 | 2200 | argsp.add_argument("--wyag-type", 2201 | metavar="type", 2202 | dest="type", 2203 | choices=["blob", "commit", "tag", "tree"], 2204 | default=None, 2205 | help="Specify the expected type") 2206 | 2207 | argsp.add_argument("name", 2208 | help="The name to parse") 2209 | #+END_SRC 2210 | 2211 | The bridge does all the job: 2212 | 2213 | #+BEGIN_SRC python :tangle libwyag.py 2214 | def cmd_rev_parse(args): 2215 | if args.type: 2216 | fmt = args.type.encode() 2217 | else: 2218 | fmt = None 2219 | 2220 | repo = repo_find() 2221 | 2222 | print (object_find(repo, args.name, fmt, follow=True)) 2223 | #+END_SRC 2224 | 2225 | And it works: 2226 | 2227 | #+begin_example 2228 | $ wyag rev-parse --wyag-type commit HEAD 2229 | 6c22393f5e3830d15395fd8d2f8b0cf8eb40dd58 2230 | $ wyag rev-parse --wyag-type tree HEAD 2231 | 11d33fad71dbac72840aff1447e0d080c7484361 2232 | $ wyag rev-parse --wyag-type tag HEAD 2233 | None 2234 | #+end_example 2235 | 2236 | * Working with the staging area and the index file 2237 | :PROPERTIES: 2238 | :CUSTOM_ID: staging-area 2239 | :END: 2240 | 2241 | ** What's the index file? 2242 | :PROPERTIES: 2243 | :CUSTOM_ID: staging-intro 2244 | :END: 2245 | 2246 | This final step will bring us to where commits happen (although 2247 | actually creating them is for the next section!) 2248 | 2249 | You probably know that to commit in Git, you first "stage" some 2250 | changes, using =git add= and =git rm=, and only /then/ do you commit 2251 | those changes. This intermediate stage between the last and the next 2252 | commit is called the *staging area*. 2253 | 2254 | It would seem natural to use a commit or tree object to represent the 2255 | staging area, but Git actually and uses a completely different 2256 | mechanism, in the form of what it calls the *index file*. 2257 | 2258 | After a commit, the index file is a sort of copy of that commit: it 2259 | holds the same path/blob association than the corresponding tree. But 2260 | it also holds extra information about files in the worktree, like 2261 | their creation/modification time, so =git status= doesn't often need 2262 | to actually compare files: it just checks that their modification time 2263 | is the same as the one stored in the index file, and only if it isn't 2264 | does it perform an actual comparison. 2265 | 2266 | You can thus consider the index file as a three-way association list: 2267 | not only paths with blobs, but also paths with actual filesystem 2268 | entries. 2269 | 2270 | Another important characteristic of the *index file* is that unlike a 2271 | tree, it can represent inconsistent states, like a merge conflict, 2272 | whereas a tree is always a complete, unambiguous representation. 2273 | 2274 | When you commit, what git actually does is turn the index file into a 2275 | new tree object. To summarize: 2276 | 2277 | 1. When the repository is “clean”, the index file holds the exact 2278 | same contents as the HEAD commit, plus metadata about the 2279 | corresponding filesystem entries. For instance, it may contain 2280 | something like: 2281 | 2282 | #+begin_quote 2283 | There's a file called =src/disp.c= whose contents are blob 2284 | 797441c76e59e28794458b39b0f1eff4c85f4fa0. The real =src/disp.c= 2285 | file, in the worktree, was created on 2023-07-15 2286 | 15:28:29.168572151, and last modified 2023-07-15 2287 | 15:28:29.1689427709. It is stored on device 65026, inode 8922881. 2288 | #+end_quote 2289 | 2290 | 2. When you =git add= or =git rm=, the index file is modified 2291 | accordingly. In the example above, if you modify =src/disp.c=, 2292 | and =add= your changes, the index file will be updated with a new 2293 | blob ID (the blob itself will also be created in the process, of 2294 | course), and the various file metadata will be updated as well so 2295 | =git status= knows when not to compare file contents. 2296 | 2297 | 3. When you =git commit= those changes, a new tree is produced from 2298 | the index file, a new commit object is generated with that tree, 2299 | branches are updated and we're done. 2300 | 2301 | #+begin_note 2302 | *A note on words* 2303 | 2304 | The staging area and the index are thus the same thing, but the name 2305 | "staging area" is more the name of the git user-exposed feature 2306 | (that could have been implemented otherwise), the abstraction if you 2307 | will; while "index file" refers specifically to the way this 2308 | abstract feature is actually implemented in git. 2309 | #+end_note 2310 | 2311 | ** Parsing the index 2312 | :PROPERTIES: 2313 | :CUSTOM_ID: index_read 2314 | :END: 2315 | 2316 | The index file is by far the most complicated piece of data a Git 2317 | repository can hold. Its complete documentation can be found in Git 2318 | source tree or rendered [[https://git-scm.com/docs/index-format][on the git website]]. It's made of three parts: 2319 | 2320 | - An header with the format version number and the number of entries 2321 | the index holds; 2322 | - A series of entries, sorted, each representing a file; padded to 2323 | multiple of 8 bytes. 2324 | - A series of optional extensions, which we'll ignore. 2325 | # @FIXME ^ Sorted how? Do we need to think about this? 2326 | 2327 | The first thing we need to represent is a single entry. It actually 2328 | holds quite a lot of stuff, I'm leaving the details in comments. 2329 | It's worth observing that an entry stores *both* the SHA-1 of the 2330 | associated blob in the object store /and/ a ton of metadata about the 2331 | actual file on the actual filesystem. Again, this is because 2332 | =git/wyag status= will need to determine which files in the index were 2333 | modified: it is much more efficient to begin by checking the 2334 | last-modified timestamp and comparing it with a known values, before 2335 | comparing actual files. 2336 | 2337 | #+begin_src python :tangle libwyag.py 2338 | class GitIndexEntry (object): 2339 | def __init__(self, ctime=None, mtime=None, dev=None, ino=None, 2340 | mode_type=None, mode_perms=None, uid=None, gid=None, 2341 | fsize=None, sha=None, flag_assume_valid=None, 2342 | flag_stage=None, name=None): 2343 | # The last time a file's metadata changed. This is a pair 2344 | # (timestamp in seconds, nanoseconds) 2345 | self.ctime = ctime 2346 | # The last time a file's data changed. This is a pair 2347 | # (timestamp in seconds, nanoseconds) 2348 | self.mtime = mtime 2349 | # The ID of device containing this file 2350 | self.dev = dev 2351 | # The file's inode number 2352 | self.ino = ino 2353 | # The object type, either b1000 (regular), b1010 (symlink), 2354 | # b1110 (gitlink). 2355 | self.mode_type = mode_type 2356 | # The object permissions, an integer. 2357 | self.mode_perms = mode_perms 2358 | # User ID of owner 2359 | self.uid = uid 2360 | # Group ID of ownner 2361 | self.gid = gid 2362 | # Size of this object, in bytes 2363 | self.fsize = fsize 2364 | # The object's SHA 2365 | self.sha = sha 2366 | self.flag_assume_valid = flag_assume_valid 2367 | self.flag_stage = flag_stage 2368 | # Name of the object (full path this time!) 2369 | self.name = name 2370 | #+end_src 2371 | 2372 | The index file is a binary file, likely for performance reasons. The 2373 | format is reasonably simple, though. It begins with a header with the 2374 | =DIRC= magic bytes, a version number and the total number of entries 2375 | in that index file. We create the =GitIndex= class to hold them: 2376 | 2377 | #+BEGIN_SRC python :tangle libwyag.py 2378 | class GitIndex (object): 2379 | version = None 2380 | entries = [] 2381 | # ext = None 2382 | # sha = None 2383 | 2384 | def __init__(self, version=2, entries=None): 2385 | if not entries: 2386 | entries = list() 2387 | 2388 | self.version = version 2389 | self.entries = entries 2390 | #+END_SRC 2391 | 2392 | And a parser to read index files into those objects. After reading 2393 | the 12-bytes header, we just parse entries in the order they appear. 2394 | An entry begins with a set of fixed-length data, followed by a 2395 | variable-length name. 2396 | 2397 | The code is quite straightforward, but as it's reading a binary 2398 | format, it feels more messy than what we did so far. We use the 2399 | =int.from_bytes(bytes, endianness)= a lot to read raw bytes into an 2400 | integer, and just a few bitwise operations to separate data 2401 | that share the same byte. 2402 | 2403 | (This format was probably designed so index files could just be 2404 | =mmapp()ed= to memory, and read directly as C structs, with an index 2405 | built in O(n) time in most cases. This kind of approach tends to 2406 | produce more elegant code in C than in Python…) 2407 | 2408 | #+BEGIN_SRC python :tangle libwyag.py 2409 | def index_read(repo): 2410 | index_file = repo_file(repo, "index") 2411 | 2412 | # New repositories have no index! 2413 | if not os.path.exists(index_file): 2414 | return GitIndex() 2415 | 2416 | with open(index_file, 'rb') as f: 2417 | raw = f.read() 2418 | 2419 | header = raw[:12] 2420 | signature = header[:4] 2421 | assert signature == b"DIRC" # Stands for "DirCache" 2422 | version = int.from_bytes(header[4:8], "big") 2423 | assert version == 2, "wyag only supports index file version 2" 2424 | count = int.from_bytes(header[8:12], "big") 2425 | 2426 | entries = list() 2427 | 2428 | content = raw[12:] 2429 | idx = 0 2430 | for i in range(0, count): 2431 | # Read creation time, as a unix timestamp (seconds since 2432 | # 1970-01-01 00:00:00, the "epoch") 2433 | ctime_s = int.from_bytes(content[idx: idx+4], "big") 2434 | # Read creation time, as nanoseconds after that timestamps, 2435 | # for extra precision. 2436 | ctime_ns = int.from_bytes(content[idx+4: idx+8], "big") 2437 | # Same for modification time: first seconds from epoch. 2438 | mtime_s = int.from_bytes(content[idx+8: idx+12], "big") 2439 | # Then extra nanoseconds 2440 | mtime_ns = int.from_bytes(content[idx+12: idx+16], "big") 2441 | # Device ID 2442 | dev = int.from_bytes(content[idx+16: idx+20], "big") 2443 | # Inode 2444 | ino = int.from_bytes(content[idx+20: idx+24], "big") 2445 | # Ignored. 2446 | unused = int.from_bytes(content[idx+24: idx+26], "big") 2447 | assert 0 == unused 2448 | mode = int.from_bytes(content[idx+26: idx+28], "big") 2449 | mode_type = mode >> 12 2450 | assert mode_type in [0b1000, 0b1010, 0b1110] 2451 | mode_perms = mode & 0b0000000111111111 2452 | # User ID 2453 | uid = int.from_bytes(content[idx+28: idx+32], "big") 2454 | # Group ID 2455 | gid = int.from_bytes(content[idx+32: idx+36], "big") 2456 | # Size 2457 | fsize = int.from_bytes(content[idx+36: idx+40], "big") 2458 | # SHA (object ID). We'll store it as a lowercase hex string 2459 | # for consistency. 2460 | sha = format(int.from_bytes(content[idx+40: idx+60], "big"), "040x") 2461 | # Flags we're going to ignore 2462 | flags = int.from_bytes(content[idx+60: idx+62], "big") 2463 | # Parse flags 2464 | flag_assume_valid = (flags & 0b1000000000000000) != 0 2465 | flag_extended = (flags & 0b0100000000000000) != 0 2466 | assert not flag_extended 2467 | flag_stage = flags & 0b0011000000000000 2468 | # Length of the name. This is stored on 12 bits, some max 2469 | # value is 0xFFF, 4095. Since names can occasionally go 2470 | # beyond that length, git treats 0xFFF as meaning at least 2471 | # 0xFFF, and looks for the final 0x00 to find the end of the 2472 | # name --- at a small, and probably very rare, performance 2473 | # cost. 2474 | name_length = flags & 0b0000111111111111 2475 | 2476 | # We've read 62 bytes so far. 2477 | idx += 62 2478 | 2479 | if name_length < 0xFFF: 2480 | assert content[idx + name_length] == 0x00 2481 | raw_name = content[idx:idx+name_length] 2482 | idx += name_length + 1 2483 | else: 2484 | print(f"Notice: Name is 0x{name_length:X} bytes long.") 2485 | # This probably wasn't tested enough. It works with a 2486 | # path of exactly 0xFFF bytes. Any extra bytes broke 2487 | # something between git, my shell and my filesystem. 2488 | null_idx = content.find(b'\x00', idx + 0xFFF) 2489 | raw_name = content[idx: null_idx] 2490 | idx = null_idx + 1 2491 | 2492 | # Just parse the name as utf8. 2493 | name = raw_name.decode("utf8") 2494 | 2495 | # Data is padded on multiples of eight bytes for pointer 2496 | # alignment, so we skip as many bytes as we need for the next 2497 | # read to start at the right position. 2498 | 2499 | idx = 8 * ceil(idx / 8) 2500 | 2501 | # And we add this entry to our list. 2502 | entries.append(GitIndexEntry(ctime=(ctime_s, ctime_ns), 2503 | mtime=(mtime_s, mtime_ns), 2504 | dev=dev, 2505 | ino=ino, 2506 | mode_type=mode_type, 2507 | mode_perms=mode_perms, 2508 | uid=uid, 2509 | gid=gid, 2510 | fsize=fsize, 2511 | sha=sha, 2512 | flag_assume_valid=flag_assume_valid, 2513 | flag_stage=flag_stage, 2514 | name=name)) 2515 | 2516 | return GitIndex(version=version, entries=entries) 2517 | #+END_SRC 2518 | 2519 | ** The ls-files command 2520 | :PROPERTIES: 2521 | :CUSTOM_ID: cmd-ls-files 2522 | :END: 2523 | 2524 | =git ls-files= displays the names of files in the staging area, with, 2525 | as usual, a ton of options. Our =ls-files= will be much simpler, 2526 | /but/ we'll add a =--verbose= option that doesn't exist in git, just 2527 | so we can display every single bit of info in the index file. 2528 | 2529 | #+BEGIN_SRC python :tangle libwyag.py 2530 | argsp = argsubparsers.add_parser("ls-files", help = "List all the stage files") 2531 | argsp.add_argument("--verbose", action="store_true", help="Show everything.") 2532 | 2533 | def cmd_ls_files(args): 2534 | repo = repo_find() 2535 | index = index_read(repo) 2536 | if args.verbose: 2537 | print(f"Index file format v{index.version}, containing {len(index.entries)} entries.") 2538 | 2539 | for e in index.entries: 2540 | print(e.name) 2541 | if args.verbose: 2542 | entry_type = { 0b1000: "regular file", 2543 | 0b1010: "symlink", 2544 | 0b1110: "git link" }[e.mode_type] 2545 | print(f" {entry_type} with perms: {e.mode_perms:o}") 2546 | print(f" on blob: {e.sha}") 2547 | print(f" created: {datetime.fromtimestamp(e.ctime[0])}.{e.ctime[1]}, modified: {datetime.fromtimestamp(e.mtime[0])}.{e.mtime[1]}") 2548 | print(f" device: {e.dev}, inode: {e.ino}") 2549 | print(f" user: {pwd.getpwuid(e.uid).pw_name} ({e.uid}) group: {grp.getgrgid(e.gid).gr_name} ({e.gid})") 2550 | print(f" flags: stage={e.flag_stage} assume_valid={e.flag_assume_valid}") 2551 | #+END_SRC 2552 | 2553 | If you run ls-files, you'll notice that on a “clean” worktree (an 2554 | unmodified checkout of =HEAD=), it lists all files on =HEAD=. Again, 2555 | the index is not a /delta/ (a set of differences) from the =HEAD= 2556 | commit, but starts as a copy of it, in a different format. 2557 | 2558 | ** A detour: the check-ignore command 2559 | :PROPERTIES: 2560 | :CUSTOM_ID: cmd-check-ignore 2561 | :END: 2562 | 2563 | We want to write =status=, but =status= needs to know about ignore 2564 | rules, that are stored in the various =.gitignore= files. So we first 2565 | need to add some rudimentary support for ignore files in =wyag=. 2566 | We'll expose this support as the =check-ignore= command, which takes a 2567 | list of paths and outputs back those of those paths that should be 2568 | ignored. 2569 | 2570 | Again, the command parser is trivial: 2571 | 2572 | #+BEGIN_SRC python :tangle libwyag.py 2573 | argsp = argsubparsers.add_parser("check-ignore", help = "Check path(s) against ignore rules.") 2574 | argsp.add_argument("path", nargs="+", help="Paths to check") 2575 | #+END_src 2576 | 2577 | And the function is just as simple: 2578 | 2579 | #+BEGIN_SRC python :tangle libwyag.py 2580 | def cmd_check_ignore(args): 2581 | repo = repo_find() 2582 | rules = gitignore_read(repo) 2583 | for path in args.path: 2584 | if check_ignore(rules, path): 2585 | print(path) 2586 | #+END_src 2587 | 2588 | But of course, most of the function we call don't exist yet in wyag. 2589 | We'll begin by writing a reader for rules in ignore files, 2590 | =gitignore_read()=. The syntax of those rules is quite simple: each 2591 | line in an ignore file is an exclusion pattern, so files that match 2592 | this pattern are ignored by =status=, =add -A= and so on. There are 2593 | three special cases, though: 2594 | 2595 | 1. Lines that begin with an exclamation mark =!= /negate/ the pattern 2596 | (files that match this pattern are /included/, even they were 2597 | ignored by an earlier pattern) 2598 | 2. Lines that begin with a dash =#= are comments, and are skipped. 2599 | 2. A backslash =\= at the beginning treats =!= and =#= as literal 2600 | characters. 2601 | 2602 | First, a parser for a single pattern. This parser returns a pair: the 2603 | pattern itself, and a boolean to indicate if files matching the 2604 | pattern /should/ be excluded (=True=) or included (=False=). In other 2605 | words, =False= if the pattern did start with =!=, =True= otherwise. 2606 | 2607 | #+begin_src python :tangle libwyag.py 2608 | def gitignore_parse1(raw): 2609 | raw = raw.strip() # Remove leading/trailing spaces 2610 | 2611 | if not raw or raw[0] == "#": 2612 | return None 2613 | elif raw[0] == "!": 2614 | return (raw[1:], False) 2615 | elif raw[0] == "\\": 2616 | return (raw[1:], True) 2617 | else: 2618 | return (raw, True) 2619 | #+end_src 2620 | 2621 | Parsing a file is just collecting all rules in that file. Notice this 2622 | function doesn't parse /files/, but just lists of lines: that's 2623 | because we'll need to read rules from git blobs as well, and not just 2624 | regular files. 2625 | 2626 | #+begin_src python :tangle libwyag.py 2627 | def gitignore_parse(lines): 2628 | ret = list() 2629 | 2630 | for line in lines: 2631 | parsed = gitignore_parse1(line) 2632 | if parsed: 2633 | ret.append(parsed) 2634 | 2635 | return ret 2636 | #+end_src 2637 | 2638 | Last thing we need to do is collect the various ignore files. They 2639 | come in two kinds: 2640 | 2641 | - Some of these files *live in the index*: they're the various 2642 | =gitignore= files. Emphasis on the plural; although there often is 2643 | only one such file, at the root, there can be one in each 2644 | directory, and it applies to this directory and its subdirectories. 2645 | I'll call those *scoped*, because they only apply to paths under 2646 | their directory. 2647 | - The others live *outside the index*. They're the global ignore 2648 | file (usually in =~/.config/git/ignore=) and the 2649 | repository-specific =.git/info/exclude=. I call those *absolute*, 2650 | because they apply everywhere, but at a lower priority. 2651 | 2652 | Again, a class to hold that: a list of absolute rules, a dict 2653 | (hashmap) of relative rules. The keys to this hashmap are 2654 | *directories*, relative to the root of a worktree. 2655 | 2656 | #+begin_src python :tangle libwyag.py 2657 | class GitIgnore(object): 2658 | absolute = None 2659 | scoped = None 2660 | 2661 | def __init__(self, absolute, scoped): 2662 | self.absolute = absolute 2663 | self.scoped = scoped 2664 | #+end_src 2665 | 2666 | And finally our function to collect all gitignore rules in a 2667 | repository, and return a =GitIgnore= object. Notice how it reads 2668 | scoped files from the index, and not the worktree: only /staged/ 2669 | =.gitignore= files matter (also remember: HEAD is /already/ staged --- 2670 | the staging area is a copy, not a delta). 2671 | 2672 | #+begin_src python :tangle libwyag.py 2673 | def gitignore_read(repo): 2674 | ret = GitIgnore(absolute=list(), scoped=dict()) 2675 | 2676 | # Read local configuration in .git/info/exclude 2677 | repo_file = os.path.join(repo.gitdir, "info/exclude") 2678 | if os.path.exists(repo_file): 2679 | with open(repo_file, "r") as f: 2680 | ret.absolute.append(gitignore_parse(f.readlines())) 2681 | 2682 | # Global configuration 2683 | if "XDG_CONFIG_HOME" in os.environ: 2684 | config_home = os.environ["XDG_CONFIG_HOME"] 2685 | else: 2686 | config_home = os.path.expanduser("~/.config") 2687 | global_file = os.path.join(config_home, "git/ignore") 2688 | 2689 | if os.path.exists(global_file): 2690 | with open(global_file, "r") as f: 2691 | ret.absolute.append(gitignore_parse(f.readlines())) 2692 | 2693 | # .gitignore files in the index 2694 | index = index_read(repo) 2695 | 2696 | for entry in index.entries: 2697 | if entry.name == ".gitignore" or entry.name.endswith("/.gitignore"): 2698 | dir_name = os.path.dirname(entry.name) 2699 | contents = object_read(repo, entry.sha) 2700 | lines = contents.blobdata.decode("utf8").splitlines() 2701 | ret.scoped[dir_name] = gitignore_parse(lines) 2702 | return ret 2703 | #+end_src 2704 | 2705 | We're almost there. To tie everything together, we need the 2706 | =check_ignore= function that matches a path, relative to the root of a 2707 | worktree, against a set of rules. This is how this function will 2708 | work: 2709 | 2710 | - It will first try to match this path against the *scoped* rules. 2711 | It will do this from the deepest parent of the path to the 2712 | farthest. That is, if the path is 2713 | =src/support/w32/legacy/sound.c~=, it will first look for rules in 2714 | =src/support/w32/legacy/.gitignore=, then 2715 | =src/support/w32/.gitignore=, =src/support/.gitignore=, and so on 2716 | up to simply =.gitignore"= at the root. 2717 | - If nothing matches, it will continue with the *absolute* rules. 2718 | 2719 | We write three small support functions. One to match a path against a 2720 | set of rules, and return the result of the last matching rule. Notice 2721 | how it's not a real boolean functions, since it has *three* possible 2722 | return values: =True=, =False= but also =None=. It returns =None= if 2723 | nothing matched, so the caller knows it should continue trying with 2724 | more general ignore files (eg, go one directory level up). 2725 | 2726 | #+begin_src python :tangle libwyag.py 2727 | def check_ignore1(rules, path): 2728 | result = None 2729 | for (pattern, value) in rules: 2730 | if fnmatch(path, pattern): 2731 | result = value 2732 | return result 2733 | #+end_src 2734 | 2735 | A function to match against the dictionary of *scoped* rules (the 2736 | various =.gitignore= files). It just starts at the path's directory 2737 | then moves up to the parent directory, recursively, until it has 2738 | tested root. Notice that this function (and the next two as well), 2739 | never breaks *inside* a given =.gitignore= file. Even if a rule 2740 | matches, they keep going through the file, because another rule there 2741 | may negate reverse the effect (rules are processed in order, so if you 2742 | want to exclude =*.c= but not =generator.c=, the general rule must 2743 | come before the specific one). But as soon as at least one rule 2744 | matched in a file, we drop the remaining files, because a more general 2745 | file never cancels the effect of a more specific one (this is why 2746 | =check_ignore1= is ternary and not boolean) 2747 | 2748 | #+begin_src python :tangle libwyag.py 2749 | def check_ignore_scoped(rules, path): 2750 | parent = os.path.dirname(path) 2751 | while True: 2752 | if parent in rules: 2753 | result = check_ignore1(rules[parent], path) 2754 | if result != None: 2755 | return result 2756 | if parent == "": 2757 | break 2758 | parent = os.path.dirname(parent) 2759 | return None 2760 | #+end_src 2761 | 2762 | A much simpler function to match against the list of absolute rules. 2763 | Notice that the order we push those rules to the list matters (we 2764 | /did/ read the repository rules before the global ones!) 2765 | 2766 | #+begin_src python :tangle libwyag.py 2767 | def check_ignore_absolute(rules, path): 2768 | parent = os.path.dirname(path) 2769 | for ruleset in rules: 2770 | result = check_ignore1(ruleset, path) 2771 | if result != None: 2772 | return result 2773 | return False # This is a reasonable default at this point. 2774 | #+end_src 2775 | 2776 | And finally, a function to bind them all. 2777 | 2778 | #+begin_src python :tangle libwyag.py 2779 | def check_ignore(rules, path): 2780 | if os.path.isabs(path): 2781 | raise Exception("This function requires path to be relative to the repository's root") 2782 | 2783 | result = check_ignore_scoped(rules.scoped, path) 2784 | if result != None: 2785 | return result 2786 | 2787 | return check_ignore_absolute(rules.absolute, path) 2788 | #+end_src 2789 | 2790 | You can now call =wyag check-ignore=. On its own source tree: 2791 | 2792 | #+begin_example 2793 | $ wyag check-ignore hello.el hello.elc hello.html wyag.zip wyag.tar 2794 | hello.elc 2795 | hello.html 2796 | wyag.zip 2797 | #+end_example 2798 | 2799 | #+begin_warning 2800 | *This is only an approximation* 2801 | 2802 | This isn't a perfect reimplementation. In particular, excluding 2803 | whole directories with a rule that's only the directory name (eg 2804 | =__pycache__=) won't work, because =fnmatch= would want the pattern 2805 | as =__pycache__/**=. If you really want to play with ignore rules, 2806 | [[https://github.com/mherrmann/gitignore_parser][this may be a good 2807 | starting point]]. 2808 | #+end_warning 2809 | 2810 | ** The status command 2811 | :PROPERTIES: 2812 | :CUSTOM_ID: cmd-status 2813 | :END: 2814 | 2815 | =status= is more complex than =ls-files=, because it needs to compare 2816 | the index with both HEAD /and/ the actual filesystem. You call =git 2817 | status= to know which files were added, removed or modified since the 2818 | last commit, and which of these changes are actually staged, and will 2819 | make it to the next commit. So =status= actually compares the =HEAD= 2820 | with the staging area, and the staging area with the worktree. This 2821 | is what its output looks like: 2822 | 2823 | #+begin_example 2824 | On branch master 2825 | 2826 | Changes to be committed: 2827 | (use "git restore --staged ..." to unstage) 2828 | modified: write-yourself-a-git.org 2829 | 2830 | Changes not staged for commit: 2831 | (use "git add ..." to update what will be committed) 2832 | (use "git restore ..." to discard changes in working directory) 2833 | modified: write-yourself-a-git.org 2834 | 2835 | Untracked files: 2836 | (use "git add ..." to include in what will be committed) 2837 | org-html-themes/ 2838 | wl-copy 2839 | #+end_example 2840 | 2841 | We'll implement =status= in three parts: first the active branch or 2842 | “detached HEAD”, then the difference between the index and the 2843 | worktree (“Changes not staged for commit”), then the difference 2844 | between HEAD and the index (“Changes to be committed” and “Untracked 2845 | files”). 2846 | 2847 | The public interface is dead simple, our status will take no argument: 2848 | 2849 | #+BEGIN_SRC python :tangle libwyag.py 2850 | argsp = argsubparsers.add_parser("status", help = "Show the working tree status.") 2851 | #+END_src 2852 | 2853 | And the bridge function just calls the three component functions in order: 2854 | 2855 | #+BEGIN_SRC python :tangle libwyag.py 2856 | def cmd_status(_): 2857 | repo = repo_find() 2858 | index = index_read(repo) 2859 | 2860 | cmd_status_branch(repo) 2861 | cmd_status_head_index(repo, index) 2862 | print() 2863 | cmd_status_index_worktree(repo, index) 2864 | #+END_src 2865 | 2866 | *** Finding the active branch 2867 | 2868 | First we need to know if we're on a branch, and if so which one. We 2869 | do this by just looking at =.git/HEAD=. It should contain either an 2870 | hexadecimal ID (a ref to a commit, in detached HEAD state), or an 2871 | indirect reference to something in =refs/heads/=: the active branch. 2872 | We either return its name, or =False=. 2873 | 2874 | #+begin_src python :tangle libwyag.py 2875 | def branch_get_active(repo): 2876 | with open(repo_file(repo, "HEAD"), "r") as f: 2877 | head = f.read() 2878 | 2879 | if head.startswith("ref: refs/heads/"): 2880 | return(head[16:-1]) 2881 | else: 2882 | return False 2883 | #+end_src 2884 | 2885 | Based on this, we can write the first of the three =cmd_status_*= 2886 | functions the bridge calls. This one prints the name of the active 2887 | branch, or the hash of the detached HEAD: 2888 | 2889 | #+begin_src python :tangle libwyag.py 2890 | def cmd_status_branch(repo): 2891 | branch = branch_get_active(repo) 2892 | if branch: 2893 | print(f"On branch {branch}.") 2894 | else: 2895 | print(f"HEAD detached at {object_find(repo, 'HEAD')}") 2896 | #+end_src 2897 | 2898 | *** Finding changes between HEAD and index 2899 | 2900 | The second block of the status output is the “changes to be 2901 | committed”, that is, how the staging area differs from HEAD. To do 2902 | this, we're going first to read the =HEAD= tree, and flatten it as a 2903 | single dict (hashmap) with full paths as keys, so it's closer to the 2904 | (flat) index associating paths to blobs. Then we'll just compare 2905 | them and output their differences. 2906 | 2907 | First, a function to convert a tree (recursive, remember) to a (flat) 2908 | dict. And since trees are recursive, so the function itself is, again --- 2909 | sorry about that :) 2910 | 2911 | #+begin_src python :tangle libwyag.py 2912 | def tree_to_dict(repo, ref, prefix=""): 2913 | ret = dict() 2914 | tree_sha = object_find(repo, ref, fmt=b"tree") 2915 | tree = object_read(repo, tree_sha) 2916 | 2917 | for leaf in tree.items: 2918 | full_path = os.path.join(prefix, leaf.path) 2919 | 2920 | # We read the object to extract its type (this is uselessly 2921 | # expensive: we could just open it as a file and read the 2922 | # first few bytes) 2923 | is_subtree = leaf.mode.startswith(b'04') 2924 | 2925 | # Depending on the type, we either store the path (if it's a 2926 | # blob, so a regular file), or recurse (if it's another tree, 2927 | # so a subdir) 2928 | if is_subtree: 2929 | ret.update(tree_to_dict(repo, leaf.sha, full_path)) 2930 | else: 2931 | ret[full_path] = leaf.sha 2932 | return ret 2933 | #+end_src 2934 | 2935 | And the command itself: 2936 | 2937 | #+begin_src python :tangle libwyag.py 2938 | def cmd_status_head_index(repo, index): 2939 | print("Changes to be committed:") 2940 | 2941 | head = tree_to_dict(repo, "HEAD") 2942 | for entry in index.entries: 2943 | if entry.name in head: 2944 | if head[entry.name] != entry.sha: 2945 | print(" modified:", entry.name) 2946 | del head[entry.name] # Delete the key 2947 | else: 2948 | print(" added: ", entry.name) 2949 | 2950 | # Keys still in HEAD are files that we haven't met in the index, 2951 | # and thus have been deleted. 2952 | for entry in head.keys(): 2953 | print(" deleted: ", entry) 2954 | #+end_src 2955 | 2956 | *** Finding changes between index and worktree 2957 | 2958 | #+begin_src python :tangle libwyag.py 2959 | def cmd_status_index_worktree(repo, index): 2960 | print("Changes not staged for commit:") 2961 | 2962 | ignore = gitignore_read(repo) 2963 | 2964 | gitdir_prefix = repo.gitdir + os.path.sep 2965 | 2966 | all_files = list() 2967 | 2968 | # We begin by walking the filesystem 2969 | for (root, _, files) in os.walk(repo.worktree, True): 2970 | if root==repo.gitdir or root.startswith(gitdir_prefix): 2971 | continue 2972 | for f in files: 2973 | full_path = os.path.join(root, f) 2974 | rel_path = os.path.relpath(full_path, repo.worktree) 2975 | all_files.append(rel_path) 2976 | 2977 | # We now traverse the index, and compare real files with the cached 2978 | # versions. 2979 | 2980 | for entry in index.entries: 2981 | full_path = os.path.join(repo.worktree, entry.name) 2982 | 2983 | # That file *name* is in the index 2984 | 2985 | if not os.path.exists(full_path): 2986 | print(" deleted: ", entry.name) 2987 | else: 2988 | stat = os.stat(full_path) 2989 | 2990 | # Compare metadata 2991 | ctime_ns = entry.ctime[0] * 10**9 + entry.ctime[1] 2992 | mtime_ns = entry.mtime[0] * 10**9 + entry.mtime[1] 2993 | if (stat.st_ctime_ns != ctime_ns) or (stat.st_mtime_ns != mtime_ns): 2994 | # If different, deep compare. 2995 | # @FIXME This *will* crash on symlinks to dir. 2996 | with open(full_path, "rb") as fd: 2997 | new_sha = object_hash(fd, b"blob", None) 2998 | # If the hashes are the same, the files are actually the same. 2999 | same = entry.sha == new_sha 3000 | 3001 | if not same: 3002 | print(" modified:", entry.name) 3003 | 3004 | if entry.name in all_files: 3005 | all_files.remove(entry.name) 3006 | 3007 | print() 3008 | print("Untracked files:") 3009 | 3010 | for f in all_files: 3011 | # @TODO If a full directory is untracked, we should display 3012 | # its name without its contents. 3013 | if not check_ignore(ignore, f): 3014 | print(" ", f) 3015 | #+end_src 3016 | 3017 | Our status function is done. It should output something like: 3018 | 3019 | #+begin_example 3020 | $ wyag status 3021 | On branch main. 3022 | Changes to be committed: 3023 | added: src/main.c 3024 | 3025 | Changes not staged for commit: 3026 | modified: build.py 3027 | deleted: README.org 3028 | 3029 | Untracked files: 3030 | src/cli.c 3031 | #+end_example 3032 | 3033 | The real =status= is a lot smarter: it can detect renames, for 3034 | example, where ours cannot. Another significant difference worth 3035 | mentioning is that =git status= actually /writes/ the index back if a 3036 | file metadata was modified, but not its content. You can see it with 3037 | our special ls-files: 3038 | 3039 | #+begin_example 3040 | $ wyag ls-files --verbose 3041 | Index file format v2, containing 1 entries. 3042 | file 3043 | regular file with perms: 644 3044 | on blob: f2f279981ce01b095c42ee7162aadf60185c8f67 3045 | created: 2023-07-18 18:26:15.771460869, modified: 2023-07-18 18:26:15.771460869 3046 | ... 3047 | $ touch file 3048 | $ git status > /dev/null 3049 | $ wyag ls-files --verbose 3050 | Index file format v2, containing 1 entries. 3051 | file 3052 | regular file with perms: 644 3053 | on blob: f2f279981ce01b095c42ee7162aadf60185c8f67 3054 | created: 2023-07-18 18:26:41.421743098, modified: 2023-07-18 18:26:41.421743098 3055 | ... 3056 | #+end_example 3057 | 3058 | Notice how both timestamps, from the /index file/, were updated by 3059 | =git status= to reflect the changes in the real file's metadata. 3060 | 3061 | * Staging area and index, part 2: staging and committing 3062 | :PROPERTIES: 3063 | :CUSTOM_ID: committing 3064 | :END: 3065 | 3066 | OK. Let's create commits. 3067 | 3068 | We have /almost/ everything we need for that, except for three last 3069 | things: 3070 | 3071 | 1. We need commands to modify the index, so our commits aren't just a 3072 | copy of their parent. Those commands are =add= and =rm=. 3073 | 2. These commands need to write the modified index back, since we 3074 | commit /from the index/. 3075 | 3. And obviously, we'll need the =commit= function and its associated 3076 | =wyag commit= command. 3077 | 3078 | ** Writing the index 3079 | :PROPERTIES: 3080 | :CUSTOM_ID: index_write 3081 | :END: 3082 | 3083 | We'll start by writing the index. Roughly, we're just serializing 3084 | everything back to binary. This is a bit tedious, but the code should 3085 | be straightforward. I'm leaving the gory details for the comments, 3086 | but it's really just =index_read= in reverse --- refer to it if 3087 | needed, and the =GitIndexEntry= class. 3088 | 3089 | #+begin_src python :tangle libwyag.py 3090 | def index_write(repo, index): 3091 | with open(repo_file(repo, "index"), "wb") as f: 3092 | 3093 | # HEADER 3094 | 3095 | # Write the magic bytes. 3096 | f.write(b"DIRC") 3097 | # Write version number. 3098 | f.write(index.version.to_bytes(4, "big")) 3099 | # Write the number of entries. 3100 | f.write(len(index.entries).to_bytes(4, "big")) 3101 | 3102 | # ENTRIES 3103 | 3104 | idx = 0 3105 | for e in index.entries: 3106 | f.write(e.ctime[0].to_bytes(4, "big")) 3107 | f.write(e.ctime[1].to_bytes(4, "big")) 3108 | f.write(e.mtime[0].to_bytes(4, "big")) 3109 | f.write(e.mtime[1].to_bytes(4, "big")) 3110 | f.write(e.dev.to_bytes(4, "big")) 3111 | f.write(e.ino.to_bytes(4, "big")) 3112 | 3113 | # Mode 3114 | mode = (e.mode_type << 12) | e.mode_perms 3115 | f.write(mode.to_bytes(4, "big")) 3116 | 3117 | f.write(e.uid.to_bytes(4, "big")) 3118 | f.write(e.gid.to_bytes(4, "big")) 3119 | 3120 | f.write(e.fsize.to_bytes(4, "big")) 3121 | # @FIXME Convert back to int. 3122 | f.write(int(e.sha, 16).to_bytes(20, "big")) 3123 | 3124 | flag_assume_valid = 0x1 << 15 if e.flag_assume_valid else 0 3125 | 3126 | name_bytes = e.name.encode("utf8") 3127 | bytes_len = len(name_bytes) 3128 | if bytes_len >= 0xFFF: 3129 | name_length = 0xFFF 3130 | else: 3131 | name_length = bytes_len 3132 | 3133 | # We merge back three pieces of data (two flags and the 3134 | # length of the name) on the same two bytes. 3135 | f.write((flag_assume_valid | e.flag_stage | name_length).to_bytes(2, "big")) 3136 | 3137 | # Write back the name, and a final 0x00. 3138 | f.write(name_bytes) 3139 | f.write((0).to_bytes(1, "big")) 3140 | 3141 | idx += 62 + len(name_bytes) + 1 3142 | 3143 | # Add padding if necessary. 3144 | if idx % 8 != 0: 3145 | pad = 8 - (idx % 8) 3146 | f.write((0).to_bytes(pad, "big")) 3147 | idx += pad 3148 | #+end_src 3149 | 3150 | ** The rm command 3151 | :PROPERTIES: 3152 | :CUSTOM_ID: cmd-rm 3153 | :END: 3154 | 3155 | The easiest change we can do to an index is to remove an entry from 3156 | it, which mean that the next commit *won't include* this file. This 3157 | is what the =git rm= command does. 3158 | 3159 | #+BEGIN_danger 3160 | =git rm= is *destructive*, and so is =wyag rm=. The command not 3161 | only modifies the index, it also removes file(s) from the worktree. 3162 | Unlike git, =wyag rm= doesn't care if the file it removes isn't 3163 | saved. Proceed with caution. 3164 | #+END_danger 3165 | 3166 | =rm= takes a single argument, a list of paths to remove: 3167 | 3168 | #+BEGIN_SRC python :tangle libwyag.py 3169 | argsp = argsubparsers.add_parser("rm", help="Remove files from the working tree and the index.") 3170 | argsp.add_argument("path", nargs="+", help="Files to remove") 3171 | 3172 | def cmd_rm(args): 3173 | repo = repo_find() 3174 | rm(repo, args.path) 3175 | #+END_src 3176 | 3177 | The =rm= function is a bit long, but it's very simple. It takes a 3178 | repository and a list of paths, reads that repository index, and 3179 | removes entries in the index that match this list. The optional 3180 | arguments control whether the function should actually delete the 3181 | files, and whether it should abort if some paths aren't present on the 3182 | index (both those arguments are for the use of =add=, they're not 3183 | exposed in the =wyag rm= command). 3184 | 3185 | #+BEGIN_SRC python :tangle libwyag.py 3186 | def rm(repo, paths, delete=True, skip_missing=False): 3187 | # Find and read the index 3188 | index = index_read(repo) 3189 | 3190 | worktree = repo.worktree + os.sep 3191 | 3192 | # Make paths absolute 3193 | abspaths = set() 3194 | for path in paths: 3195 | abspath = os.path.abspath(path) 3196 | if abspath.startswith(worktree): 3197 | abspaths.add(abspath) 3198 | else: 3199 | raise Exception(f"Cannot remove paths outside of worktree: {paths}") 3200 | 3201 | # The list of entries to *keep*, which we will write back to the 3202 | # index. 3203 | kept_entries = list() 3204 | # The list of removed paths, which we'll use after index update 3205 | # to physically remove the actual paths from the filesystem. 3206 | remove = list() 3207 | 3208 | # Now iterate over the list of entries, and remove those whose 3209 | # paths we find in abspaths. Preserve the others in kept_entries. 3210 | for e in index.entries: 3211 | full_path = os.path.join(repo.worktree, e.name) 3212 | 3213 | if full_path in abspaths: 3214 | remove.append(full_path) 3215 | abspaths.remove(full_path) 3216 | else: 3217 | kept_entries.append(e) # Preserve entry 3218 | 3219 | # If abspaths is empty, it means some paths weren't in the index. 3220 | if len(abspaths) > 0 and not skip_missing: 3221 | raise Exception(f"Cannot remove paths not in the index: {abspaths}") 3222 | 3223 | # Physically delete paths from filesystem. 3224 | if delete: 3225 | for path in remove: 3226 | os.unlink(path) 3227 | 3228 | # Update the list of entries in the index, and write it back. 3229 | index.entries = kept_entries 3230 | index_write(repo, index) 3231 | #+END_SRC 3232 | 3233 | And we can now delete files with =wyag rm=. 3234 | 3235 | ** The add command 3236 | :PROPERTIES: 3237 | :CUSTOM_ID: cmd-add 3238 | :END: 3239 | 3240 | Adding is just a bit more complex than removing, but nothing we don't 3241 | already know. Staging a file to a three-steps operation: 3242 | 3243 | 1. We begin by removing the existing index entry, if there's one, 3244 | without removing the file itself (this is why the =rm= function we 3245 | just wrote has those optional arguments). 3246 | 2. We then hash the file into a glob object, 3247 | 3. create its entry, 3248 | 4. And of course, finally write the modified index back. 3249 | 3250 | First, the interface. Nothing surprising, =wyag add PATH ...= where 3251 | PATH is one or more file(s) to stage. The bridge is as boring as can be. 3252 | 3253 | #+BEGIN_SRC python :tangle libwyag.py 3254 | argsp = argsubparsers.add_parser("add", help = "Add files contents to the index.") 3255 | argsp.add_argument("path", nargs="+", help="Files to add") 3256 | 3257 | def cmd_add(args): 3258 | repo = repo_find() 3259 | add(repo, args.path) 3260 | #+END_src 3261 | 3262 | The main difference with =rm= is that =add= needs to create an index 3263 | entry. This isn't hard: we just =stat()= the file and copy the 3264 | metadata in the index's field (=stat()= returns those metadata the 3265 | index stores: creation/modification time, and so on) 3266 | 3267 | #+BEGIN_SRC python :tangle libwyag.py 3268 | def add(repo, paths, delete=True, skip_missing=False): 3269 | 3270 | # First remove all paths from the index, if they exist. 3271 | rm (repo, paths, delete=False, skip_missing=True) 3272 | 3273 | worktree = repo.worktree + os.sep 3274 | 3275 | # Convert the paths to pairs: (absolute, relative_to_worktree). 3276 | # Also delete them from the index if they're present. 3277 | clean_paths = set() 3278 | for path in paths: 3279 | abspath = os.path.abspath(path) 3280 | if not (abspath.startswith(worktree) and os.path.isfile(abspath)): 3281 | raise Exception(f"Not a file, or outside the worktree: {paths}") 3282 | relpath = os.path.relpath(abspath, repo.worktree) 3283 | clean_paths.add((abspath, relpath)) 3284 | 3285 | # Find and read the index. It was modified by rm. (This isn't 3286 | # optimal, good enough for wyag!) 3287 | # 3288 | # @FIXME, though: we could just move the index through 3289 | # commands instead of reading and writing it over again. 3290 | index = index_read(repo) 3291 | 3292 | for (abspath, relpath) in clean_paths: 3293 | with open(abspath, "rb") as fd: 3294 | sha = object_hash(fd, b"blob", repo) 3295 | 3296 | stat = os.stat(abspath) 3297 | 3298 | ctime_s = int(stat.st_ctime) 3299 | ctime_ns = stat.st_ctime_ns % 10**9 3300 | mtime_s = int(stat.st_mtime) 3301 | mtime_ns = stat.st_mtime_ns % 10**9 3302 | 3303 | entry = GitIndexEntry(ctime=(ctime_s, ctime_ns), mtime=(mtime_s, mtime_ns), dev=stat.st_dev, ino=stat.st_ino, 3304 | mode_type=0b1000, mode_perms=0o644, uid=stat.st_uid, gid=stat.st_gid, 3305 | fsize=stat.st_size, sha=sha, flag_assume_valid=False, 3306 | flag_stage=False, name=relpath) 3307 | index.entries.append(entry) 3308 | 3309 | # Write the index back 3310 | index_write(repo, index) 3311 | #+END_SRC 3312 | 3313 | ** The commit command 3314 | :PROPERTIES: 3315 | :CUSTOM_ID: cmd-commit 3316 | :END: 3317 | 3318 | Now that we have modified the index, so actually /staged changes/, we 3319 | only need to turn those changes into a commit. That's what =commit= does. 3320 | 3321 | #+begin_src python :tangle libwyag.py 3322 | argsp = argsubparsers.add_parser("commit", help="Record changes to the repository.") 3323 | 3324 | argsp.add_argument("-m", 3325 | metavar="message", 3326 | dest="message", 3327 | help="Message to associate with this commit.") 3328 | #+end_src 3329 | 3330 | To do so, we first need to convert the index into a tree object, 3331 | generate and store the corresponding commit object, and update the 3332 | HEAD branch to the new commit (remember: a branch is just a ref to a 3333 | commit). 3334 | 3335 | Before we get to the interesting details, we will need to read git's 3336 | config to get the name of the user, which we'll use as the author and 3337 | committer. We'll use the same =configparser= library we've used to 3338 | read repo's config. 3339 | 3340 | #+begin_src python :tangle libwyag.py 3341 | def gitconfig_read(): 3342 | xdg_config_home = os.environ["XDG_CONFIG_HOME"] if "XDG_CONFIG_HOME" in os.environ else "~/.config" 3343 | configfiles = [ 3344 | os.path.expanduser(os.path.join(xdg_config_home, "git/config")), 3345 | os.path.expanduser("~/.gitconfig") 3346 | ] 3347 | 3348 | config = configparser.ConfigParser() 3349 | config.read(configfiles) 3350 | return config 3351 | #+end_src 3352 | 3353 | And just a simple function to grab, and format, the user identity: 3354 | 3355 | #+begin_src python :tangle libwyag.py 3356 | def gitconfig_user_get(config): 3357 | if "user" in config: 3358 | if "name" in config["user"] and "email" in config["user"]: 3359 | return f"{config['user']['name']} <{config['user']['email']}>" 3360 | return None 3361 | #+end_src 3362 | 3363 | Now for the interesting part. We first need to build a tree from the 3364 | index. This isn't hard, but notice that while the index is flat (it 3365 | stores full paths for the whole worktree), a tree is a recursive 3366 | structure: it lists files, or other trees. To "unflatten" the index 3367 | into a tree, we're going to: 3368 | 3369 | 1. Build a dictionary (hashmap) of directories. Keys are full paths 3370 | from worktree root (like =assets/sprites/monsters/=), values are 3371 | list of =GitIndexEntry= --- files in the directory. At this point, our 3372 | dictionary only contains /files/: directories are only its keys. 3373 | 2. Traverse this list, going bottom-up, that is, from the deepest 3374 | directories up to root (depth doesn't really matter: we just want 3375 | to see each directory /before/ its parent. To do that, we just 3376 | sort them by /full/ path length, from longest to shortest --- 3377 | parents are obviously always shorter). As an example, imagine we 3378 | start at =assets/sprites/monsters/= 3379 | 3. At each directory, we build a tree with its contents, say 3380 | =cacodemon.png=, =imp.png= and =baron-of-hell.png=. 3381 | 4. We write the new tree to the repository. 3382 | 5. We then add this tree to this directory's parent. Meaning that at 3383 | this point, =assets/sprites/= now contains our new tree object's 3384 | SHA-1 id under the name =monsters=. 3385 | 6. And we iterate over the next directory, let's say 3386 | =assets/sprites/keys= where we find =red.png=, =blue.png= and 3387 | =yellow.png=, create a tree, store the tree, add the tree's SHA-1 3388 | under the name =keys= under =assets/sprites/=, and so on. 3389 | 3390 | And since trees are recursive? So the last tree we'll build, which is 3391 | necessarily the one for root (since its key's length is 0), will 3392 | ultimately refer to all others, and thus will be only one we'll need. 3393 | We'll simply return its SHA-1, and be done. 3394 | 3395 | Since this may seem a bit complex, let's work this example in full 3396 | details --- feel free to skip. At the beginning, the dictionary we 3397 | built from the index looks like this: 3398 | 3399 | #+begin_example 3400 | contents["assets/sprites/monsters"] = 3401 | [ cacodemon.png : GitIndexEntry 3402 | , imp.png : GitIndexEntry 3403 | , baron-of-hell.png : GitIndexEntry ] 3404 | contents["assets/sprites/keys"] = 3405 | [ red.png : GitIndexEntry 3406 | , blue.png : GitIndexEntry 3407 | , yellow.png : GitIndexEntry ] 3408 | contents["assets/sprites/"] = 3409 | [ hero.png : GitIndexEntry ] 3410 | contents["assets/"] = [] # No files in here 3411 | contents[""] = # Root! 3412 | [ README: GitIndexEntry ] 3413 | #+end_example 3414 | 3415 | We iterate over it, by order of descending key length. The first key 3416 | we meet is the longest, so =assets/sprites/monsters=. We build a new 3417 | tree object from its contents, which associates the three file names 3418 | (=cacodemon.png=, =imp.png=, =baron-of-hell.png=) with their 3419 | corresponding blobs (A tree leaf stores /less/ data than the index --- 3420 | just path, mode and blob. So converting entries that way is easy) 3421 | 3422 | Notice we don't need to concern ourselves with storing the *contents* 3423 | of those files: =wyag add= did create the corresponding blobs as 3424 | needed. We need to store the /trees/ we create to the object store, 3425 | but we can assume the blobs are there already. 3426 | 3427 | Let's say that our new tree hashes, made from the index entries that 3428 | lived directly in =assets/sprites/monsters=, hashes down to 3429 | =426f894781bc3c38f1d26f8fd2c7f38ab8d21763=. We *modify our 3430 | dictionary* to add that new tree object to the directory's parent, 3431 | like this, so what remains to traverse now looks like this: 3432 | 3433 | #+begin_example 3434 | contents["assets/sprites/keys"] = # <- unmodified. 3435 | [ red.png : GitIndexEntry 3436 | , blue.png : GitIndexEntry 3437 | , yellow.png : GitIndexEntry ] 3438 | contents["assets/sprites/"] = 3439 | [ hero.png : GitIndexEntry 3440 | , monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763 ] <- look here 3441 | contents["assets/"] = [] # empty 3442 | contents[""] = # Root! 3443 | [ README: GitIndexEntry ] 3444 | #+end_example 3445 | 3446 | We do the same for the next longest key, =assets/sprites/keys=, 3447 | producing a tree of hash =b42788e087b1e94a0e69dcb7a4a243eaab802bb2=, 3448 | so: 3449 | 3450 | #+begin_example 3451 | contents["assets/sprites/"] = 3452 | [ hero.png : GitIndexEntry 3453 | , monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763 3454 | , keys : Tree b42788e087b1e94a0e69dcb7a4a243eaab802bb2 ] 3455 | contents["assets/"] = [] # empty 3456 | contents[""] = # Root! 3457 | [ README: GitIndexEntry ] 3458 | #+end_example 3459 | 3460 | We then generate tree =6364113557ed681d775ccbd3c90895ed276956a2= from 3461 | assets/sprites, which now contains our two trees and =hero.png=. 3462 | 3463 | #+begin_example 3464 | contents["assets/"] = [ 3465 | sprites: Tree 6364113557ed681d775ccbd3c90895ed276956a2 ] 3466 | contents[""] = # Root! 3467 | [ README: GitIndexEntry ] 3468 | #+end_example 3469 | 3470 | Assets in turn becomes tree =4d35513cb6d2a816bc00505be926624440ebbddd=, so: 3471 | 3472 | #+begin_example 3473 | contents[""] = # Root! 3474 | [ README: GitIndexEntry, 3475 | assets: 4d35513cb6d2a816bc00505be926624440ebbddd] 3476 | #+end_example 3477 | 3478 | We make a tree from that last key (with the =README= blob and the 3479 | =assets= subtree), it hashes to 3480 | =9352e52ff58fa9bf5a750f090af64c09fa6a3d93=. That's our return value: 3481 | the tree whose contents are the same as the index's. 3482 | 3483 | Here's the actual function: 3484 | 3485 | #+begin_src python :tangle libwyag.py 3486 | def tree_from_index(repo, index): 3487 | contents = dict() 3488 | contents[""] = list() 3489 | 3490 | # Enumerate entries, and turn them into a dictionary where keys 3491 | # are directories, and values are lists of directory contents. 3492 | for entry in index.entries: 3493 | dirname = os.path.dirname(entry.name) 3494 | 3495 | # We create all dictonary entries up to root (""). We need 3496 | # them *all*, because even if a directory holds no files it 3497 | # will contain at least a tree. 3498 | key = dirname 3499 | while key != "": 3500 | if not key in contents: 3501 | contents[key] = list() 3502 | key = os.path.dirname(key) 3503 | 3504 | # For now, simply store the entry in the list. 3505 | contents[dirname].append(entry) 3506 | 3507 | # Get keys (= directories) and sort them by length, descending. 3508 | # This means that we'll always encounter a given path before its 3509 | # parent, which is all we need, since for each directory D we'll 3510 | # need to modify its parent P to add D's tree. 3511 | sorted_paths = sorted(contents.keys(), key=len, reverse=True) 3512 | 3513 | # This variable will store the current tree's SHA-1. After we're 3514 | # done iterating over our dict, it will contain the hash for the 3515 | # root tree. 3516 | sha = None 3517 | 3518 | # We go through the sorted list of paths (dict keys) 3519 | for path in sorted_paths: 3520 | # Prepare a new, empty tree object 3521 | tree = GitTree() 3522 | 3523 | # Add each entry to our new tree, in turn 3524 | for entry in contents[path]: 3525 | # An entry can be a normal GitIndexEntry read from the 3526 | # index, or a tree we've created. 3527 | if isinstance(entry, GitIndexEntry): # Regular entry (a file) 3528 | 3529 | # We transcode the mode: the entry stores it as integers, 3530 | # we need an octal ASCII representation for the tree. 3531 | leaf_mode = f"{entry.mode_type:02o}{entry.mode_perms:04o}".encode("ascii") 3532 | leaf = GitTreeLeaf(mode = leaf_mode, path=os.path.basename(entry.name), sha=entry.sha) 3533 | else: # Tree. We've stored it as a pair: (basename, SHA) 3534 | leaf = GitTreeLeaf(mode = b"040000", path=entry[0], sha=entry[1]) 3535 | 3536 | tree.items.append(leaf) 3537 | 3538 | # Write the new tree object to the store. 3539 | sha = object_write(tree, repo) 3540 | 3541 | # Add the new tree hash to the current dictionary's parent, as 3542 | # a pair (basename, SHA) 3543 | parent = os.path.dirname(path) 3544 | base = os.path.basename(path) # The name without the path, eg main.go for src/main.go 3545 | contents[parent].append((base, sha)) 3546 | 3547 | return sha 3548 | #+end_src 3549 | 3550 | This was the hard part; I hope it's clear enough. From this, creating 3551 | the commit object and updating HEAD will be way easier. Just remember 3552 | that what this function /does/ is built and store as many tree objects 3553 | as needed to represent the index, and return the root tree's SHA-1. 3554 | 3555 | The function to create a commit object is simple enough, it just takes 3556 | some arguments: the hash of the tree, the hash of the parent commit, 3557 | the author's identity (a string), the timestamp and timezone delta, 3558 | and the message: 3559 | 3560 | # @TODO Explain them! 3561 | 3562 | #+begin_src python :tangle libwyag.py 3563 | def commit_create(repo, tree, parent, author, timestamp, message): 3564 | commit = GitCommit() # Create the new commit object. 3565 | commit.kvlm[b"tree"] = tree.encode("ascii") 3566 | if parent: 3567 | commit.kvlm[b"parent"] = parent.encode("ascii") 3568 | 3569 | # Trim message and add a trailing \n 3570 | message = message.strip() + "\n" 3571 | # Format timezone 3572 | offset = int(timestamp.astimezone().utcoffset().total_seconds()) 3573 | hours = offset // 3600 3574 | minutes = (offset % 3600) // 60 3575 | tz = "{}{:02}{:02}".format("+" if offset > 0 else "-", hours, minutes) 3576 | 3577 | author = author + timestamp.strftime(" %s ") + tz 3578 | 3579 | commit.kvlm[b"author"] = author.encode("utf8") 3580 | commit.kvlm[b"committer"] = author.encode("utf8") 3581 | commit.kvlm[None] = message.encode("utf8") 3582 | 3583 | return object_write(commit, repo) 3584 | #+end_src 3585 | 3586 | All what remains to write is =cmd_commit=, the bridge function to the 3587 | =wyag commit= command: 3588 | 3589 | #+begin_src python :tangle libwyag.py 3590 | def cmd_commit(args): 3591 | repo = repo_find() 3592 | index = index_read(repo) 3593 | # Create trees, grab back SHA for the root tree. 3594 | tree = tree_from_index(repo, index) 3595 | 3596 | # Create the commit object itself 3597 | commit = commit_create(repo, 3598 | tree, 3599 | object_find(repo, "HEAD"), 3600 | gitconfig_user_get(gitconfig_read()), 3601 | datetime.now(), 3602 | args.message) 3603 | 3604 | # Update HEAD so our commit is now the tip of the active branch. 3605 | active_branch = branch_get_active(repo) 3606 | if active_branch: # If we're on a branch, we update refs/heads/BRANCH 3607 | with open(repo_file(repo, os.path.join("refs/heads", active_branch)), "w") as fd: 3608 | fd.write(commit + "\n") 3609 | else: # Otherwise, we update HEAD itself. 3610 | with open(repo_file(repo, "HEAD"), "w") as fd: 3611 | fd.write("\n") 3612 | #+end_src 3613 | 3614 | And we're done! 3615 | 3616 | * Final words 3617 | 3618 | ** Conclusion, and beyond commit :noexport: 3619 | 3620 | With that final command, =wyag= is done. I hope you've enjoyed that 3621 | little journey into the internals of git core. Obviously, we're still 3622 | very far from what the real git can do, but the goal was to expose the 3623 | model, and this is done. 3624 | 3625 | One of the most fundamental design choices of git, which you've 3626 | probably noticed, is that it only stores full states of the 3627 | repository. We like to think of commits as /transformations/ of 3628 | source code, and it makes sense to us because in a way that's what 3629 | they /are/, but to git itself each commit is like a zip of a 3630 | directory, as disconnected to the previous as it is to the “next”. 3631 | Git neither knows nor cares about deltas and patches, and even file 3632 | renames indications are just a trick. (To be perfectly honest: it 3633 | actually /knows/ about deltas, but as a storage optimization method, 3634 | in packfiles --- which are optional). 3635 | 3636 | ** Comments, feedback and issues 3637 | :PROPERTIES: 3638 | :CUSTOM_ID: feedback 3639 | :END: 3640 | 3641 | This page has no comment system :) I can be reached by e-mail at 3642 | [[mailto:thibault@thb.lt][thibault@thb.lt]]. I can also be found [[https://toad.social/@thblt][on Mastodon as 3643 | @thblt@toad.social]] and [[https://twitter.com/ThbPlg][on Twitter as @ThbPlg]], and on IRC (sometimes) 3644 | as =thblt= on Libera. 3645 | 3646 | The source for this article is hosted [[https://github.com/thblt/write-yourself-a-git][on Github]]. Issue reports and 3647 | pull requests are welcome, either directly on GitHub or through e-mail 3648 | if you prefer. 3649 | 3650 | ** Release information :noexport: 3651 | 3652 | #+begin_src emacs-lisp :exports results :results table 3653 | (list 3654 | '("Key" "Value") 3655 | 'hline 3656 | `("Creation date" ,(current-time-string)) 3657 | `("On commit" ,(format "=%s= (%s)" 3658 | (string-trim (shell-command-to-string "git describe --tags --always")) 3659 | (if (string-empty-p 3660 | (string-trim (shell-command-to-string "git status --porcelain=v2"))) 3661 | "clean" "*dirty*"))) 3662 | `("By" ,(format "%s (=%s= on =%s=)" 3663 | (user-full-name) 3664 | (user-login-name) 3665 | (system-name))) 3666 | `("Emacs version" ,emacs-version) 3667 | `("Org-mode version" ,org-version)) 3668 | #+end_src 3669 | 3670 | ** License 3671 | 3672 | This article is distributed under the terms of the [[https://creativecommons.org/licenses/by-nc-sa/4.0/][Creative Commons 3673 | BY-NC-SA 4.0]]. The [[./wyag.zip][program itself]] is also licensed under the terms 3674 | of the GNU General Public License 3.0, or, at your option, any later 3675 | version of the same licence. 3676 | 3677 | * Footnotes 3678 | 3679 | [fn:1] You may know that [[https://shattered.io/][collisions have been discovered in SHA-1]]. 3680 | Git actually doesn't use SHA-1 anymore: it uses a [[https://github.com/git/git/blob/26e47e261e969491ad4e3b6c298450c061749c9e/Documentation/technical/hash-function-transition.txt#L34-L36][hardened variant]] 3681 | which is not SHA, but which applies the same hash to every known input 3682 | but the two PDF files known to collide. 3683 | -------------------------------------------------------------------------------- /wyag-tests.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -e 3 | 4 | function step() { 5 | pos=$(caller) 6 | echo $pos $@ 7 | } 8 | 9 | wyag=$(realpath ./wyag) 10 | 11 | testdir=/tmp/wyag-tests 12 | if [[ -e $testdir ]]; then 13 | rm -rf $testdir/* 14 | else 15 | mkdir $testdir 16 | fi 17 | cd $testdir 18 | step "working on $(pwd)" 19 | 20 | step Create repos 21 | $wyag init left 22 | git init right > /dev/null 23 | 24 | step status 25 | cd left 26 | git status > /dev/null 27 | cd ../right 28 | git status > /dev/null 29 | cd .. 30 | 31 | step hash-object 32 | echo "Don't read me" > README 33 | $wyag hash-object README > hash1 34 | git hash-object README > hash2 35 | cmp --quiet hash1 hash2 36 | 37 | step hash-object -w 38 | cd left 39 | $wyag hash-object -w ../README > /dev/null 40 | cd ../right 41 | git hash-object -w ../README > /dev/null 42 | cd .. 43 | ls left/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null 44 | ls right/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null 45 | 46 | step cat-file 47 | cd left 48 | $wyag cat-file blob b17d > ../file1 49 | cd ../right 50 | git cat-file blob b17d > ../file2 51 | cd .. 52 | cmp file1 file2 53 | 54 | step cat-file with long hash 55 | cd left 56 | $wyag cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file1 57 | cd ../right 58 | git cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file2 59 | cd .. 60 | cmp file1 file2 61 | 62 | step "Create commit (git only, nothing is tested)" #@FIXME Add wyag commit 63 | cd left 64 | echo "Aleph" > hebraic-letter.txt 65 | git add hebraic-letter.txt 66 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \ 67 | GIT_AUTHOR_NAME="wyag-tests.sh" \ 68 | GIT_AUTHOR_EMAIL="wyag@example.com" \ 69 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \ 70 | GIT_COMMITTER_NAME="wyag-tests.sh" \ 71 | GIT_COMMITTER_EMAIL="wyag@example.com" \ 72 | git commit --no-gpg-sign -m "Initial commit" > /dev/null 73 | cd ../right 74 | echo "Aleph" > hebraic-letter.txt 75 | git add hebraic-letter.txt 76 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \ 77 | GIT_AUTHOR_NAME="wyag-tests.sh" \ 78 | GIT_AUTHOR_EMAIL="wyag@example.com" \ 79 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \ 80 | GIT_COMMITTER_NAME="wyag-tests.sh" \ 81 | GIT_COMMITTER_EMAIL="wyag@example.com" \ 82 | git commit --no-gpg-sign -m "Initial commit" > /dev/null 83 | 84 | cd .. 85 | 86 | step cat-file on commit object without indirection 87 | cd left 88 | $wyag cat-file commit HEAD > ../file1 89 | cd ../right 90 | git cat-file commit HEAD > ../file2 91 | cd .. 92 | cmp file1 file2 93 | 94 | step cat-file on tree object redirected from commit 95 | cd left 96 | $wyag cat-file tree HEAD > ../file1 97 | cd ../right 98 | git cat-file tree HEAD > ../file2 99 | cd .. 100 | cmp file1 file2 101 | 102 | step "Add some directories and commits (git only, nothing is tested)" #@FIXME Add wyag commit 103 | cd left 104 | mkdir a 105 | echo "Alpha" > a/greek_letters 106 | mkdir b 107 | echo "Hamza" > a/arabic_letters 108 | git add a/* 109 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \ 110 | GIT_AUTHOR_NAME="wyag-tests.sh" \ 111 | GIT_AUTHOR_EMAIL="wyag@example.com" \ 112 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \ 113 | GIT_COMMITTER_NAME="wyag-tests.sh" \ 114 | GIT_COMMITTER_EMAIL="wyag@example.com" \ 115 | git commit --no-gpg-sign -m "Commit 2" > /dev/null 116 | cd ../right 117 | mkdir a 118 | echo "Alpha" > a/greek_letters 119 | mkdir b 120 | echo "Hamza" > a/arabic_letters 121 | git add a/* 122 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \ 123 | GIT_AUTHOR_NAME="wyag-tests.sh" \ 124 | GIT_AUTHOR_EMAIL="wyag@example.com" \ 125 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \ 126 | GIT_COMMITTER_NAME="wyag-tests.sh" \ 127 | GIT_COMMITTER_EMAIL="wyag@example.com" \ 128 | git commit --no-gpg-sign -m "Commit 2" > /dev/null 129 | cd .. 130 | 131 | step ls-tree 132 | cd left 133 | $wyag ls-tree HEAD > ../file1 134 | cd ../right 135 | git ls-tree HEAD > ../file2 136 | cd .. 137 | cmp file1 file2 138 | 139 | step checkout 140 | # Git and Wyag syntax are different here 141 | cd left 142 | $wyag checkout HEAD ../temp1 143 | mkdir ../temp2 144 | cd ../temp2 145 | git --git-dir=../right/.git checkout . 146 | cd .. 147 | diff -r temp1 temp2 148 | rm -rf temp1 temp2 149 | 150 | step rev-parse 151 | cd left 152 | $wyag rev-parse HEAD > ../file1 153 | $wyag rev-parse 75ee4 >> ../file1 154 | $wyag rev-parse 8a617 >> ../file1 155 | #@FIXME Tags missing, branches missing, remotes missing 156 | cd ../right 157 | git rev-parse HEAD > ../file2 158 | git rev-parse 75ee4 >> ../file2 159 | git rev-parse 8a617 >> ../file2 160 | cd .. 161 | cmp file1 file2 162 | 163 | step "ls-files " 164 | cd left 165 | $wyag ls-files > ../file1 166 | cd ../right 167 | git ls-files > ../file2 168 | cd .. 169 | cmp file1 file2 170 | 171 | gitignore_prepare() { 172 | mkdir -p a/b/c/ 173 | echo "!*.txt" > a/b/c/.gitignore 174 | echo "*.txt" > a/b/.gitignore 175 | echo "*.org" > a/.gitignore 176 | git add -A 177 | } 178 | 179 | step "gitignore" 180 | cd left 181 | gitignore_prepare 182 | $wyag check-ignore a/b/c/hello.txt > ../file1 183 | $wyag check-ignore a/b/hello.txt >> ../file1 184 | $wyag check-ignore a/hello.org >> ../file1 185 | $wyag check-ignore hello.org >> ../file1 186 | cd ../right 187 | set +e # git will return with non-zero 188 | gitignore_prepare 189 | git check-ignore a/b/c/hello.txt > ../file2 190 | git check-ignore a/b/hello.txt >> ../file2 191 | git check-ignore a/hello.org >> ../file2 192 | git check-ignore hello.org >> ../file2 193 | set -e 194 | cd .. 195 | cmp file1 file2 196 | 197 | 198 | 199 | 200 | step THIS WAS A TRIUMPH 201 | step "I'M MAKING A NOTE HERE" 202 | step "HUGE SUCCESS" 203 | --------------------------------------------------------------------------------