├── .gitignore
├── .gitmodules
├── LICENSE
├── Makefile
├── README.org
├── write-yourself-a-git.org
└── wyag-tests.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | .last_push
3 | __pycache__
4 | libwyag.py
5 | src
6 | wyag
7 | wyag.zip
8 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "lib/org-html-themes"]
2 | 	path = lib/org-html-themes
3 | 	url = https://github.com/fniessen/org-html-themes
4 | [submodule "lib/htmlize"]
5 | 	path = lib/htmlize
6 | 	url = https://github.com/hniksic/emacs-htmlize
7 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 |                             Preamble
  9 | 
 10 |   The GNU General Public License is a free, copyleft license for
 11 | software and other kinds of works.
 12 | 
 13 |   The licenses for most software and other practical works are designed
 14 | to take away your freedom to share and change the works.  By contrast,
 15 | the GNU General Public License is intended to guarantee your freedom to
 16 | share and change all versions of a program--to make sure it remains free
 17 | software for all its users.  We, the Free Software Foundation, use the
 18 | GNU General Public License for most of our software; it applies also to
 19 | any other work released this way by its authors.  You can apply it to
 20 | your programs, too.
 21 | 
 22 |   When we speak of free software, we are referring to freedom, not
 23 | price.  Our General Public Licenses are designed to make sure that you
 24 | have the freedom to distribute copies of free software (and charge for
 25 | them if you wish), that you receive source code or can get it if you
 26 | want it, that you can change the software or use pieces of it in new
 27 | free programs, and that you know you can do these things.
 28 | 
 29 |   To protect your rights, we need to prevent others from denying you
 30 | these rights or asking you to surrender the rights.  Therefore, you have
 31 | certain responsibilities if you distribute copies of the software, or if
 32 | you modify it: responsibilities to respect the freedom of others.
 33 | 
 34 |   For example, if you distribute copies of such a program, whether
 35 | gratis or for a fee, you must pass on to the recipients the same
 36 | freedoms that you received.  You must make sure that they, too, receive
 37 | or can get the source code.  And you must show them these terms so they
 38 | know their rights.
 39 | 
 40 |   Developers that use the GNU GPL protect your rights with two steps:
 41 | (1) assert copyright on the software, and (2) offer you this License
 42 | giving you legal permission to copy, distribute and/or modify it.
 43 | 
 44 |   For the developers' and authors' protection, the GPL clearly explains
 45 | that there is no warranty for this free software.  For both users' and
 46 | authors' sake, the GPL requires that modified versions be marked as
 47 | changed, so that their problems will not be attributed erroneously to
 48 | authors of previous versions.
 49 | 
 50 |   Some devices are designed to deny users access to install or run
 51 | modified versions of the software inside them, although the manufacturer
 52 | can do so.  This is fundamentally incompatible with the aim of
 53 | protecting users' freedom to change the software.  The systematic
 54 | pattern of such abuse occurs in the area of products for individuals to
 55 | use, which is precisely where it is most unacceptable.  Therefore, we
 56 | have designed this version of the GPL to prohibit the practice for those
 57 | products.  If such problems arise substantially in other domains, we
 58 | stand ready to extend this provision to those domains in future versions
 59 | of the GPL, as needed to protect the freedom of users.
 60 | 
 61 |   Finally, every program is threatened constantly by software patents.
 62 | States should not allow patents to restrict development and use of
 63 | software on general-purpose computers, but in those that do, we wish to
 64 | avoid the special danger that patents applied to a free program could
 65 | make it effectively proprietary.  To prevent this, the GPL assures that
 66 | patents cannot be used to render the program non-free.
 67 | 
 68 |   The precise terms and conditions for copying, distribution and
 69 | modification follow.
 70 | 
 71 |                        TERMS AND CONDITIONS
 72 | 
 73 |   0. Definitions.
 74 | 
 75 |   "This License" refers to version 3 of the GNU General Public License.
 76 | 
 77 |   "Copyright" also means copyright-like laws that apply to other kinds of
 78 | works, such as semiconductor masks.
 79 | 
 80 |   "The Program" refers to any copyrightable work licensed under this
 81 | License.  Each licensee is addressed as "you".  "Licensees" and
 82 | "recipients" may be individuals or organizations.
 83 | 
 84 |   To "modify" a work means to copy from or adapt all or part of the work
 85 | in a fashion requiring copyright permission, other than the making of an
 86 | exact copy.  The resulting work is called a "modified version" of the
 87 | earlier work or a work "based on" the earlier work.
 88 | 
 89 |   A "covered work" means either the unmodified Program or a work based
 90 | on the Program.
 91 | 
 92 |   To "propagate" a work means to do anything with it that, without
 93 | permission, would make you directly or secondarily liable for
 94 | infringement under applicable copyright law, except executing it on a
 95 | computer or modifying a private copy.  Propagation includes copying,
 96 | distribution (with or without modification), making available to the
 97 | public, and in some countries other activities as well.
 98 | 
 99 |   To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies.  Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 | 
103 |   An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License.  If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 | 
112 |   1. Source Code.
113 | 
114 |   The "source code" for a work means the preferred form of the work
115 | for making modifications to it.  "Object code" means any non-source
116 | form of a work.
117 | 
118 |   A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 | 
123 |   The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form.  A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 | 
134 |   The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities.  However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work.  For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 | 
147 |   The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 | 
151 |   The Corresponding Source for a work in source code form is that
152 | same work.
153 | 
154 |   2. Basic Permissions.
155 | 
156 |   All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met.  This License explicitly affirms your unlimited
159 | permission to run the unmodified Program.  The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work.  This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 | 
164 |   You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force.  You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright.  Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 | 
175 |   Conveying under any other circumstances is permitted solely under
176 | the conditions stated below.  Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 | 
179 |   3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 | 
181 |   No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 | 
187 |   When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 | 
195 |   4. Conveying Verbatim Copies.
196 | 
197 |   You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 | 
205 |   You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 | 
208 |   5. Conveying Modified Source Versions.
209 | 
210 |   You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 | 
214 |     a) The work must carry prominent notices stating that you modified
215 |     it, and giving a relevant date.
216 | 
217 |     b) The work must carry prominent notices stating that it is
218 |     released under this License and any conditions added under section
219 |     7.  This requirement modifies the requirement in section 4 to
220 |     "keep intact all notices".
221 | 
222 |     c) You must license the entire work, as a whole, under this
223 |     License to anyone who comes into possession of a copy.  This
224 |     License will therefore apply, along with any applicable section 7
225 |     additional terms, to the whole of the work, and all its parts,
226 |     regardless of how they are packaged.  This License gives no
227 |     permission to license the work in any other way, but it does not
228 |     invalidate such permission if you have separately received it.
229 | 
230 |     d) If the work has interactive user interfaces, each must display
231 |     Appropriate Legal Notices; however, if the Program has interactive
232 |     interfaces that do not display Appropriate Legal Notices, your
233 |     work need not make them do so.
234 | 
235 |   A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit.  Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 | 
245 |   6. Conveying Non-Source Forms.
246 | 
247 |   You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 | 
252 |     a) Convey the object code in, or embodied in, a physical product
253 |     (including a physical distribution medium), accompanied by the
254 |     Corresponding Source fixed on a durable physical medium
255 |     customarily used for software interchange.
256 | 
257 |     b) Convey the object code in, or embodied in, a physical product
258 |     (including a physical distribution medium), accompanied by a
259 |     written offer, valid for at least three years and valid for as
260 |     long as you offer spare parts or customer support for that product
261 |     model, to give anyone who possesses the object code either (1) a
262 |     copy of the Corresponding Source for all the software in the
263 |     product that is covered by this License, on a durable physical
264 |     medium customarily used for software interchange, for a price no
265 |     more than your reasonable cost of physically performing this
266 |     conveying of source, or (2) access to copy the
267 |     Corresponding Source from a network server at no charge.
268 | 
269 |     c) Convey individual copies of the object code with a copy of the
270 |     written offer to provide the Corresponding Source.  This
271 |     alternative is allowed only occasionally and noncommercially, and
272 |     only if you received the object code with such an offer, in accord
273 |     with subsection 6b.
274 | 
275 |     d) Convey the object code by offering access from a designated
276 |     place (gratis or for a charge), and offer equivalent access to the
277 |     Corresponding Source in the same way through the same place at no
278 |     further charge.  You need not require recipients to copy the
279 |     Corresponding Source along with the object code.  If the place to
280 |     copy the object code is a network server, the Corresponding Source
281 |     may be on a different server (operated by you or a third party)
282 |     that supports equivalent copying facilities, provided you maintain
283 |     clear directions next to the object code saying where to find the
284 |     Corresponding Source.  Regardless of what server hosts the
285 |     Corresponding Source, you remain obligated to ensure that it is
286 |     available for as long as needed to satisfy these requirements.
287 | 
288 |     e) Convey the object code using peer-to-peer transmission, provided
289 |     you inform other peers where the object code and Corresponding
290 |     Source of the work are being offered to the general public at no
291 |     charge under subsection 6d.
292 | 
293 |   A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 | 
297 |   A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling.  In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage.  For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product.  A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 | 
310 |   "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source.  The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 | 
318 |   If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information.  But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 | 
329 |   The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed.  Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 | 
337 |   Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 | 
343 |   7. Additional Terms.
344 | 
345 |   "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law.  If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 | 
354 |   When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it.  (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.)  You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 | 
361 |   Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 | 
365 |     a) Disclaiming warranty or limiting liability differently from the
366 |     terms of sections 15 and 16 of this License; or
367 | 
368 |     b) Requiring preservation of specified reasonable legal notices or
369 |     author attributions in that material or in the Appropriate Legal
370 |     Notices displayed by works containing it; or
371 | 
372 |     c) Prohibiting misrepresentation of the origin of that material, or
373 |     requiring that modified versions of such material be marked in
374 |     reasonable ways as different from the original version; or
375 | 
376 |     d) Limiting the use for publicity purposes of names of licensors or
377 |     authors of the material; or
378 | 
379 |     e) Declining to grant rights under trademark law for use of some
380 |     trade names, trademarks, or service marks; or
381 | 
382 |     f) Requiring indemnification of licensors and authors of that
383 |     material by anyone who conveys the material (or modified versions of
384 |     it) with contractual assumptions of liability to the recipient, for
385 |     any liability that these contractual assumptions directly impose on
386 |     those licensors and authors.
387 | 
388 |   All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10.  If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term.  If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 | 
398 |   If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 | 
403 |   Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 | 
407 |   8. Termination.
408 | 
409 |   You may not propagate or modify a covered work except as expressly
410 | provided under this License.  Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 | 
415 |   However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 | 
422 |   Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 | 
429 |   Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License.  If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 | 
435 |   9. Acceptance Not Required for Having Copies.
436 | 
437 |   You are not required to accept this License in order to receive or
438 | run a copy of the Program.  Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance.  However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work.  These actions infringe copyright if you do
443 | not accept this License.  Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 | 
446 |   10. Automatic Licensing of Downstream Recipients.
447 | 
448 |   Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License.  You are not responsible
451 | for enforcing compliance by third parties with this License.
452 | 
453 |   An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations.  If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 | 
463 |   You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License.  For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 | 
471 |   11. Patents.
472 | 
473 |   A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based.  The
475 | work thus licensed is called the contributor's "contributor version".
476 | 
477 |   A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version.  For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 | 
487 |   Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 | 
492 |   In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement).  To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 | 
499 |   If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients.  "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 | 
513 |   If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 | 
521 |   A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License.  You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 | 
536 |   Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 | 
540 |   12. No Surrender of Others' Freedom.
541 | 
542 |   If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License.  If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all.  For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 | 
552 |   13. Use with the GNU Affero General Public License.
553 | 
554 |   Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work.  The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 | 
563 |   14. Revised Versions of this License.
564 | 
565 |   The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time.  Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 | 
570 |   Each version is given a distinguishing version number.  If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation.  If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 | 
579 |   If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 | 
584 |   Later license versions may give you additional or different
585 | permissions.  However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 | 
589 |   15. Disclaimer of Warranty.
590 | 
591 |   THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 | 
600 |   16. Limitation of Liability.
601 | 
602 |   IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 | 
612 |   17. Interpretation of Sections 15 and 16.
613 | 
614 |   If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 | 
621 |                      END OF TERMS AND CONDITIONS
622 | 
623 |             How to Apply These Terms to Your New Programs
624 | 
625 |   If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 | 
629 |   To do so, attach the following notices to the program.  It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 | 
634 |     <one line to give the program's name and a brief idea of what it does.>
635 |     Copyright (C) <year>  <name of author>
636 | 
637 |     This program is free software: you can redistribute it and/or modify
638 |     it under the terms of the GNU General Public License as published by
639 |     the Free Software Foundation, either version 3 of the License, or
640 |     (at your option) any later version.
641 | 
642 |     This program is distributed in the hope that it will be useful,
643 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
644 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
645 |     GNU General Public License for more details.
646 | 
647 |     You should have received a copy of the GNU General Public License
648 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
649 | 
650 | Also add information on how to contact you by electronic and paper mail.
651 | 
652 |   If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 | 
655 |     <program>  Copyright (C) <year>  <name of author>
656 |     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 |     This is free software, and you are welcome to redistribute it
658 |     under certain conditions; type `show c' for details.
659 | 
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License.  Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 | 
664 |   You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | <http://www.gnu.org/licenses/>.
668 | 
669 |   The GNU General Public License does not permit incorporating your program
670 | into proprietary programs.  If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library.  If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License.  But first, please read
674 | <http://www.gnu.org/philosophy/why-not-lgpl.html>.
675 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | all: article program
 2 | article: write-yourself-a-git.html
 3 | program: wyag libwyag.py
 4 | push: .last_push
 5 | 
 6 | .PHONY: all article clean program push test
 7 | 
 8 | write-yourself-a-git.html: write-yourself-a-git.org wyag libwyag.py
 9 | 	emacs --batch write-yourself-a-git.org \
10 |     --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \
11 |     --eval "(setq org-babel-inline-result-wrap \"%s\")" \
12 |     --eval "(setq org-confirm-babel-evaluate nil)" \
13 |     --eval "(setq python-indent-guess-indent-offset nil)" \
14 |     --eval "(setq org-export-with-broken-links t)" \
15 | 		--eval "(setq org-html-htmlize-output-type 'css)" \
16 |     -f org-html-export-to-html
17 | 
18 | write-yourself-a-git.pdf: write-yourself-a-git.org wyag libwyag.py
19 | 	emacs --batch write-yourself-a-git.org \
20 |     --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \
21 |     --eval "(setq org-babel-inline-result-wrap \"%s\")" \
22 |     --eval "(setq org-confirm-babel-evaluate nil)" \
23 |     --eval "(setq python-indent-guess-indent-offset nil)" \
24 |     --eval "(setq org-export-with-broken-links t)" \
25 |     -f org-latex-export-to-pdf
26 | 
27 | wyag libwyag.py: write-yourself-a-git.org
28 | 	emacs --batch write-yourself-a-git.org -f org-babel-tangle
29 | 
30 | wyag.zip: wyag libwyag.py LICENSE
31 | 	zip -r wyag.zip wyag libwyag.py LICENSE
32 | 
33 | clean:
34 | 	rm -f wyag libwyag.py write-yourself-a-git.html .last_push wyag.zip
35 | 
36 | test: wyag libwyag.py
37 | 	./wyag-tests.sh
38 | 
39 | .last_push: wyag.zip write-yourself-a-git.html
40 | 	 scp -r write-yourself-a-git.html k9.thb.lt\:/var/www/wyag.thb.lt/index.html; \
41 |    scp -r wyag.zip lib/org-html-themes/src k9.thb.lt:/var/www/wyag.thb.lt/; \
42 |    touch .last_push
43 | 


--------------------------------------------------------------------------------
/README.org:
--------------------------------------------------------------------------------
 1 | #+TITLE: Write yourself a Git!
 2 | 
 3 | Source repository for the [[https://wyag.thb.lt][Write yourself a Git]] article.
 4 | 
 5 | Wyag is a [[https://en.wikipedia.org/wiki/Literate_programming][literate program]] written in [[https://orgmode.org/][org-mode]], which means the same source document can be used to produce the HTML version of the article as published on [[https://wyag.thb.lt]] and the program itself.  You only need a reasonably recent Emacs and the =make= program, then:
 6 | 
 7 | #+begin_src shell
 8 |   $ git clone --recursive https://github.com/thblt/write-yourself-a-git
 9 |   $ cd write-yourself-a-git
10 |   $ make all
11 | #+end_src
12 | 


--------------------------------------------------------------------------------
/write-yourself-a-git.org:
--------------------------------------------------------------------------------
   1 | #+TITLE: Write yourself a Git!
   2 | #+AUTHOR: [[mailto:thibault@thb.lt][Thibault Polge]]
   3 | 
   4 | #
   5 | # This file is part of wyag <https://wyag.thb.lt>
   6 | #
   7 | # Copyright (c) 2018-2023 Thibault Polge <thibault@thb.lt>
   8 | # All rights reserved
   9 | #
  10 | # Wyag is free software: you can redistribute it and/or modify it
  11 | # under the terms of the GNU General Public License as published by
  12 | # the Free Software Foundation, either version 3 of the License, or
  13 | # (at your option) any later version.
  14 | #
  15 | # Wyag is distributed in the hope that it will be useful, but WITHOUT
  16 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
  17 | # or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
  18 | # License for more details.
  19 | #
  20 | # You should have received a copy of the GNU General Public License
  21 | # along with Wyag.  If not, see <http://www.gnu.org/licenses/>.
  22 | #
  23 | 
  24 | #+LANGUAGE: en
  25 | #+OPTIONS: ':t ^:nil
  26 | 
  27 | #+SETUPFILE: lib/org-html-themes/org/theme-readtheorg-local.setup
  28 | 
  29 | * Introduction
  30 | :PROPERTIES:
  31 | :CUSTOM_ID: intro
  32 | :END:
  33 | 
  34 | #+begin_note
  35 |   Recent changes (January 2025):
  36 | 
  37 |    - =OrderedDict= have been replaced by regular dicts.
  38 |    - Most string formatting have been replaced with f-strings.
  39 |    - Multible bugs fixed in =tag_create=.
  40 | #+end_note
  41 | 
  42 | This article is an attempt at explaining the [[https://git-scm.com/][Git version control
  43 | system]] from the bottom up, that is, starting at the most fundamental
  44 | level moving up from there.  This does not sound too easy, and has
  45 | been attempted multiple times with questionable success.  But there's
  46 | an easy way: all it takes to understand Git internals is to
  47 | reimplement Git from scratch.
  48 | 
  49 | No, don't run.
  50 | 
  51 | #+NAME: slocs
  52 | #+begin_src emacs-lisp :exports none
  53 |   ;; Compute line numbers.  We'll use that in a second.
  54 |   (shell-command-to-string
  55 |    "grep -v '^$' libwyag.py wyag | grep -v ' *#'  | wc -l")
  56 | #+end_src
  57 | 
  58 | It's not a joke, and it's really not complicated: if you read this
  59 | article top to bottom and write the code (or just [[./wyag.zip][download it]] as a ZIP
  60 | --- but you should write the code yourself, really), you'll end up
  61 | with a program, called =wyag=, that will implement all the fundamental
  62 | features of git: =init=, =add=, =rm=, =status=, =commit=, =log=... in
  63 | a way that is perfectly compatible with =git= itself --- compatible
  64 | enough that the commit finally adding the section on commits was
  65 | [[https://github.com/thblt/write-yourself-a-git/commit/ed26daffb400b2be5f30e044c3237d220226d867][created by wyag itself, not git]].  And all that in exactly call_slocs()
  66 | lines of very simple Python code.
  67 | 
  68 | But isn't Git too complex for that?  That Git is complex is, in my
  69 | opinion, a misconception.  Git is a large program, with a lot of
  70 | features, that's true.  But the core of that program is actually
  71 | extremely simple, and its apparent complexity stems first from the
  72 | fact it's often deeply counterintuitive (and [[https://byorgey.wordpress.com/2009/01/12/abstraction-intuition-and-the-monad-tutorial-fallacy/][Git is a burrito]] blog
  73 | posts probably don't help).  But maybe what makes Git the most
  74 | confusing is the extreme simplicity /and/ power of its core model.  The
  75 | combination of core simplicity and powerful applications often makes
  76 | thing really hard to grasp, because of the mental jump required to
  77 | derive the variety of applications from the essential simplicity of
  78 | the fundamental abstraction (monads, anyone?)
  79 | 
  80 | Implementing Git will expose its fundamentals in all their naked
  81 | glory.
  82 | 
  83 | *What to expect?*  This article will implement and explain in great
  84 | details (if something is not clear, please [[#feedback][report it]]!) a very
  85 | simplified version of Git core commands.  I will keep the code simple
  86 | and to the point, so =wyag= won't come anywhere near the power of the
  87 | real git command-line --- but what's missing will be obvious, and
  88 | trivial to implement by anyone who wants to give it a try.  “Upgrading
  89 | wyag to a full-featured git library and CLI is left as an exercise to
  90 | the reader”, as they say.
  91 | 
  92 | More precisely, we'll implement:
  93 | 
  94 | #+begin_src emacs-lisp :exports results :results list
  95 |   (mapcar
  96 |    (lambda (cmd)
  97 |      (format "=%s= ([[#cmd-%s][wyag source]]) [[https://git-scm.com/docs/git-%s][git man page]]" cmd cmd cmd))
  98 |    (list
  99 |     "add"
 100 |     "cat-file"
 101 |     "check-ignore"
 102 |     "checkout"
 103 |     "commit"
 104 |     "hash-object"
 105 |     "init"
 106 |     "log"
 107 |     "ls-files"
 108 |     "ls-tree"
 109 |     "rev-parse"
 110 |     "rm"
 111 |     "show-ref"
 112 |     "status"
 113 |     "tag"))
 114 | #+end_src
 115 | 
 116 | You're not going to need to know much to follow this article: just
 117 | some basic Git (obviously), some basic Python, some basic shell.
 118 | 
 119 |  + First, I'm only going to assume some level of familiarity with the
 120 |    most basic *git commands* --- nothing like an expert level, but if
 121 |    you've never used =init=, =add=, =rm=, =commit= or =checkout=, you will be
 122 |    lost.
 123 |  + Language-wise, wyag will be implemented in *Python*.  Again, I won't
 124 |    use anything too fancy, and Python looks like pseudo-code anyways,
 125 |    so it will be easy to follow (ironically, the most complicated part
 126 |    will be the command-line arguments parsing logic, and you don't
 127 |    really need to understand that).  Yet, if you know programming but
 128 |    have never done any Python, I suggest you find a crash course
 129 |    somewhere in the internet just to get acquainted with the language.
 130 |  + =wyag= and =git= are terminal programs.  I assume you know your way
 131 |    inside a Unix terminal.  Again, you don't need to be a l77t h4x0r,
 132 |    but =cd=, =ls=, =rm=, =tree= and their friends should be in your toolbox.
 133 | 
 134 | #+BEGIN_warning
 135 |   *Note for Windows users*
 136 | 
 137 |   =wyag= should run on any Unix-like system with a Python interpreter,
 138 |   but I have absolutely no idea how it will behave on Windows.  The
 139 |   test suite absolutely requires a bash-compatible shell, which I
 140 |   assume the WSL can provide.  Also, if you are using WSL, make sure
 141 |   your =wyag= file uses Unix-style line endings ([[https://stackoverflow.com/questions/48692741/how-can-i-make-all-line-endings-eols-in-all-files-in-visual-studio-code-unix][See this
 142 |   StackOverflow solution if you use VS Code]]).  Feedback from Windows
 143 |   users would be appreciated!
 144 | #+END_warning
 145 | 
 146 | #+begin_note
 147 |  **Acknowledgments**
 148 | 
 149 |  This article benefited from significant contributions from multiple
 150 |  people, and I'm grateful to them all.  Special thanks to:
 151 | 
 152 |   - Github user [[https://github.com/tammoippen][tammoippen]], who first drafted the =tag_create=
 153 |     function I had simply… forgotten to write (that was [[https://github.com/thblt/write-yourself-a-git/issues/9][#9]]).
 154 |   - Github user [[https://github.com/hjlarry][hjlarry]] fixed multiple issues in [[https://github.com/thblt/write-yourself-a-git/pull/22][#22]].
 155 |   - GitHub user [[https://github.com/cutebbb][cutebbb]] implemented the first version of ls-files in
 156 |     [[https://github.com/thblt/write-yourself-a-git/pull/32/][#32]], and by doing so finally brought wyag to the wonders of the
 157 |     staging area!
 158 | #+end_note
 159 | 
 160 | * Getting started
 161 | :PROPERTIES:
 162 | :CUSTOM_ID: getting-started
 163 | :END:
 164 | 
 165 | You're going to need Python 3.10 or higher, along with your
 166 | favorite text editor.  We won't need third party packages or
 167 | virtualenvs, or anything besides a regular Python interpreter:
 168 | everything we need is in Python's standard library.
 169 | 
 170 | We'll split the code into two files:
 171 | 
 172 |  - An executable, called =wyag=;
 173 |  - A Python library, called =libwyag.py=;
 174 | 
 175 | Now, every software project starts with a boatload of boilerplate, so
 176 | let's get this over with.
 177 | 
 178 | We'll begin by creating the (very short) executable.  Create a new
 179 | file called =wyag= in your text editor, and copy the following few
 180 | lines:
 181 | 
 182 | #+BEGIN_SRC python :tangle wyag :tangle-mode (identity #o755)
 183 |   #!/usr/bin/env python3
 184 | 
 185 |   import libwyag
 186 |   libwyag.main()
 187 | #+END_SRC
 188 | 
 189 | Then make it executable:
 190 | 
 191 | #+BEGIN_EXAMPLE
 192 |   $ chmod +x wyag
 193 | #+END_EXAMPLE
 194 | 
 195 | You're done!
 196 | 
 197 | # This is a noweb template to include in all three source files.
 198 | #+NAME: file_header
 199 | #+BEGIN_SRC shell :exports none
 200 |    This file is part of wyag <https://wyag.thb.lt>
 201 |    Copyright (c) 2018-2023 Thibault Polge <thibault@thb.lt>
 202 |    All rights reserved
 203 | 
 204 |    Wyag is free software: you can redistribute it and/or modify it
 205 |    under the terms of the GNU General Public License as published by
 206 |    the Free Software Foundation, either version 3 of the License, or
 207 |    (at your option) any later version.
 208 | 
 209 |    Wyag is distributed in the hope that it will be useful, but WITHOUT
 210 |    ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 211 |    or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
 212 |    License for more details.
 213 | 
 214 |    You should have received a copy of the GNU General Public License
 215 |    along with Wyag.  If not, see <https://www.gnu.org/licenses/>.
 216 | 
 217 | #+END_SRC
 218 | 
 219 |    #+BEGIN_SRC python :tangle libwyag.py :exports none :noweb yes
 220 | # <<file_header>>
 221 | 
 222 |    #+END_SRC
 223 | 
 224 |    #+BEGIN_SRC python :tangle wyag :exports none :noweb yes
 225 | 
 226 | # <<file_header>>
 227 |    #+END_SRC
 228 | 
 229 | Now for the library.  it must be called =libwyag.py=, and be in the
 230 | same directory as the =wyag= executable.  Begin by opening the empty
 231 | =libwyag.py= in your text editor.
 232 | 
 233 | We're first going to need a bunch of imports (just copy each import,
 234 | or merge them all in a single line)
 235 | 
 236 |  - Git is a CLI application, so we'll need something to parse
 237 |    command-line arguments.  Python provides a cool module called
 238 |    [[https://docs.python.org/3/library/argparse.html][argparse]] that can do 99% of the job for us.
 239 | 
 240 |    #+BEGIN_SRC python :tangle libwyag.py
 241 |      import argparse
 242 |    #+END_SRC
 243 | 
 244 |  - Git uses a configuration file format that is basically Microsoft's
 245 |    INI format.  The [[https://docs.python.org/3/library/configparser.html][configparser]] module can read and write these
 246 |    files.
 247 | 
 248 |    #+BEGIN_SRC python :tangle libwyag.py
 249 |      import configparser
 250 |    #+END_SRC
 251 | 
 252 |  - We'll be doing some date/time manipulation:
 253 | 
 254 |    #+BEGIN_SRC python :tangle libwyag.py
 255 |      from datetime import datetime
 256 |    #+END_SRC
 257 | 
 258 |  - We'll need, just once, to read the users/group database on Unix
 259 |    (=grp= is for groups, =pwd= for users).  This is because git saves
 260 |    the numerical owner/group ID of files, and we'll want to display
 261 |    that nicely (as text):
 262 | 
 263 |    #+BEGIN_SRC python :tangle libwyag.py
 264 |      import grp, pwd
 265 |    #+END_SRC
 266 | 
 267 |  - To support =.gitignore=, we'll need to match filenames against
 268 |    patterns like *.txt.  Filename matching is in… =fnmatch=:
 269 | 
 270 |    #+BEGIN_SRC python :tangle libwyag.py
 271 |      from fnmatch import fnmatch
 272 |    #+END_SRC
 273 | 
 274 |  - Git uses the SHA-1 function quite extensively.  In Python, it's in [[https://docs.python.org/3/library/hashlib.html][hashlib]].
 275 | 
 276 |    #+BEGIN_SRC python :tangle libwyag.py
 277 |      import hashlib
 278 |    #+END_SRC
 279 | 
 280 |  - Just one function from [[https://docs.python.org/3/library/math.html][math]]:
 281 | 
 282 |    #+BEGIN_SRC python :tangle libwyag.py
 283 |      from math import ceil
 284 |    #+END_SRC
 285 | 
 286 |  - [[https://docs.python.org/3/library/os.html][os]] and [[https://docs.python.org/3/library/os.path.html][os.path]] provide some nice filesystem abstraction routines.
 287 | 
 288 |    #+BEGIN_SRC python :tangle libwyag.py
 289 |      import os
 290 |    #+END_SRC
 291 | 
 292 |  - we use /just a bit/ of regular expressions:
 293 | 
 294 |    #+BEGIN_SRC python :tangle libwyag.py
 295 |      import re
 296 |    #+END_SRC
 297 | 
 298 |  - We also need [[https://docs.python.org/3/library/sys.html][sys]] to access the actual command-line arguments (in =sys.argv=):
 299 | 
 300 |    #+BEGIN_SRC python :tangle libwyag.py
 301 |      import sys
 302 |    #+END_SRC
 303 | 
 304 |  - Git compresses everything using zlib.  Python [[https://docs.python.org/3/library/zlib.html][has that]], too:
 305 | 
 306 |    #+BEGIN_SRC python :tangle libwyag.py
 307 |      import zlib
 308 |    #+END_SRC
 309 | 
 310 | Imports are done.  We'll be working with command-line arguments a lot.
 311 | Python provides a simple yet reasonably powerful parsing library,
 312 | =argparse=.  It's a nice library, but its interface may not be the
 313 | most intuitive ever; if need, refer to its [[https://docs.python.org/3/library/argparse.html][documentation]].
 314 | 
 315 | #+BEGIN_SRC python :tangle libwyag.py
 316 |   argparser = argparse.ArgumentParser(description="The stupidest content tracker")
 317 | #+END_SRC
 318 | 
 319 | We'll need to handle subcommands (as in git: =init=, =commit=, etc.)
 320 | In argparse slang, these are called "subparsers".  At this point we
 321 | only need to declare that our CLI will use some, and that all
 322 | invocation will actually /require/ one --- you don't just call =git=,
 323 | you call =git COMMAND=.
 324 | 
 325 | #+BEGIN_SRC python :tangle libwyag.py
 326 |   argsubparsers = argparser.add_subparsers(title="Commands", dest="command")
 327 |   argsubparsers.required = True
 328 | #+END_SRC
 329 | 
 330 | The ~dest="command"~ argument states that the name of the chosen
 331 | subparser will be returned as a string in a field called =command=.
 332 | So we just need to read this string and call the correct function
 333 | accordingly.  By convention, I'll call these functions "bridges
 334 | functions" and prefix their names by =cmd_=.  Bridge functions
 335 | take the parsed arguments as their unique parameter, and are
 336 | responsible for processing and validating them before executing the
 337 | actual command.
 338 | 
 339 | #+BEGIN_SRC python :tangle libwyag.py
 340 |   def main(argv=sys.argv[1:]):
 341 |       args = argparser.parse_args(argv)
 342 |       match args.command:
 343 |           case "add"          : cmd_add(args)
 344 |           case "cat-file"     : cmd_cat_file(args)
 345 |           case "check-ignore" : cmd_check_ignore(args)
 346 |           case "checkout"     : cmd_checkout(args)
 347 |           case "commit"       : cmd_commit(args)
 348 |           case "hash-object"  : cmd_hash_object(args)
 349 |           case "init"         : cmd_init(args)
 350 |           case "log"          : cmd_log(args)
 351 |           case "ls-files"     : cmd_ls_files(args)
 352 |           case "ls-tree"      : cmd_ls_tree(args)
 353 |           case "rev-parse"    : cmd_rev_parse(args)
 354 |           case "rm"           : cmd_rm(args)
 355 |           case "show-ref"     : cmd_show_ref(args)
 356 |           case "status"       : cmd_status(args)
 357 |           case "tag"          : cmd_tag(args)
 358 |           case _              : print("Bad command.")
 359 | #+END_SRC
 360 | 
 361 | * Creating repositories: init
 362 | :PROPERTIES:
 363 | :CUSTOM_ID: init
 364 | :END:
 365 | 
 366 | Obviously, the first Git command in chronological /and/ logical order is
 367 | =git init=, so we'll begin by creating =wyag init=.  To achieve this,
 368 | we're going to first need some very basic repository abstraction.
 369 | 
 370 | ** The Repository object
 371 | :PROPERTIES:
 372 | :CUSTOM_ID: GitRepository
 373 | :END:
 374 | 
 375 | We'll obviously need some abstraction for a repository: almost every
 376 | time we run a git command, we're trying to do something to a
 377 | repository, to create it, read from it or modify it.
 378 | 
 379 | A git repository is made of two things: a “work tree”, where the files
 380 | meant to be in version control live, and a “git directory”, where Git
 381 | stores its own data.  In most cases, the worktree is a regular
 382 | directory and the git directory is a child directory of the worktree,
 383 | called =.git=.
 384 | 
 385 | Git supports /much more/ cases (bare repo, separated gitdir, etc) but
 386 | we won't need them: we'll stick to the basic approach of
 387 | =worktree/.git=.  Our repository object will then just hold two paths:
 388 | the worktree and the gitdir.
 389 | 
 390 | To create a new =Repository= object, we only need to make a few checks:
 391 | 
 392 |  - We must verify that the directory exists, and contains a
 393 |    subdirectory called =.git=.
 394 | 
 395 |  - We then read its configuration in =.git/config= (it's just an INI
 396 |    file) and check that =core.repositoryformatversion= is 0.  More on
 397 |    that field in a moment.
 398 | 
 399 | Our constructor takes an optional =force= argument which disables all
 400 | checks.  That's because the =repo_create()= function which we'll
 401 | create later will use a =Repository= object to /create/ the repo.  So
 402 | we need a way to create such objects even from (still) invalid
 403 | filesystem locations.
 404 | 
 405 | #+BEGIN_SRC python :tangle libwyag.py
 406 |   class GitRepository (object):
 407 |       """A git repository"""
 408 | 
 409 |       worktree = None
 410 |       gitdir = None
 411 |       conf = None
 412 | 
 413 |       def __init__(self, path, force=False):
 414 |           self.worktree = path
 415 |           self.gitdir = os.path.join(path, ".git")
 416 | 
 417 |           if not (force or os.path.isdir(self.gitdir)):
 418 |               raise Exception(f"Not a Git repository {path}")
 419 | 
 420 |           # Read configuration file in .git/config
 421 |           self.conf = configparser.ConfigParser()
 422 |           cf = repo_file(self, "config")
 423 | 
 424 |           if cf and os.path.exists(cf):
 425 |               self.conf.read([cf])
 426 |           elif not force:
 427 |               raise Exception("Configuration file missing")
 428 | 
 429 |           if not force:
 430 |               vers = int(self.conf.get("core", "repositoryformatversion"))
 431 |               if vers != 0:
 432 |                   raise Exception("Unsupported repositoryformatversion: {vers}")
 433 | #+END_SRC
 434 | 
 435 | We're going to be manipulating *lots* of paths in repositories.  We may
 436 | as well create a few utility functions to compute those paths and
 437 | create missing directory structures if needed.  First, just a general
 438 | path building function:
 439 | 
 440 | #+BEGIN_SRC python :tangle libwyag.py
 441 | def repo_path(repo, *path):
 442 |     """Compute path under repo's gitdir."""
 443 |     return os.path.join(repo.gitdir, *path)
 444 | #+END_SRC
 445 | 
 446 | (A note on Python syntax: the star on the =*path= makes the function
 447 | variadic, so it can be called with multiple path components as
 448 | separate arguments.  For example, ~repo_path(repo, "objects", "df",
 449 | "4ec9fc2ad990cb9da906a95a6eda6627d7b7b0")~ is a valid call.  The
 450 | function receives =path= as a list)
 451 | 
 452 | The next two functions, =repo_file()= and =repo_dir()=, return and
 453 | optionally create a path to a file or a directory, respectively.  The
 454 | difference between them is that the file version only creates
 455 | directories up to the last component.
 456 | 
 457 | #+BEGIN_SRC python :tangle libwyag.py
 458 |   def repo_file(repo, *path, mkdir=False):
 459 |       """Same as repo_path, but create dirname(*path) if absent.  For
 460 |   example, repo_file(r, \"refs\", \"remotes\", \"origin\", \"HEAD\") will create
 461 |   .git/refs/remotes/origin."""
 462 | 
 463 |       if repo_dir(repo, *path[:-1], mkdir=mkdir):
 464 |           return repo_path(repo, *path)
 465 | 
 466 |   def repo_dir(repo, *path, mkdir=False):
 467 |       """Same as repo_path, but mkdir *path if absent if mkdir."""
 468 | 
 469 |       path = repo_path(repo, *path)
 470 | 
 471 |       if os.path.exists(path):
 472 |           if (os.path.isdir(path)):
 473 |               return path
 474 |           else:
 475 |               raise Exception(f"Not a directory {path}")
 476 | 
 477 |       if mkdir:
 478 |           os.makedirs(path)
 479 |           return path
 480 |       else:
 481 |           return None
 482 | #+END_SRC
 483 | 
 484 | (Second and last note on syntax: because the star in =*path= makes the
 485 | functions variadic, the =mkdir= argument must be passed explicitly by
 486 | name.  For example, ~repo_file(repo, "objects", mkdir=True)~.)
 487 | 
 488 | To *create* a new repository, we start with a directory (which we
 489 | create if doesn't already exist) and create the *git directory* inside
 490 | (which must not exist already, or be empty).  That directory is called
 491 | =.git= (the leading period makes it "hidden" on Unix systems), and contains:
 492 | 
 493 |  - =.git/objects/= : the object store, which we'll introduce [[#objects][in the next section]].
 494 |  - =.git/refs/= the reference store, which we'll discuss [[#cmd-show-ref][a bit later]].
 495 |    It contains two subdirectories, =heads= and =tags=.
 496 |  - =.git/HEAD=, a reference to the current HEAD (more on that later!)
 497 |  - =.git/config=, the repository's configuration file.
 498 |  - =.git/description=, holds a free-form description of this
 499 |    repository's contents, for humans, and is rarely used.
 500 | 
 501 | #+BEGIN_SRC python :tangle libwyag.py
 502 |   def repo_create(path):
 503 |       """Create a new repository at path."""
 504 | 
 505 |       repo = GitRepository(path, True)
 506 | 
 507 |       # First, we make sure the path either doesn't exist or is an
 508 |       # empty dir.
 509 | 
 510 |       if os.path.exists(repo.worktree):
 511 |           if not os.path.isdir(repo.worktree):
 512 |               raise Exception (f"{path} is not a directory!")
 513 |           if os.path.exists(repo.gitdir) and os.listdir(repo.gitdir):
 514 |               raise Exception (f"{path} is not empty!")
 515 |       else:
 516 |           os.makedirs(repo.worktree)
 517 | 
 518 |       assert repo_dir(repo, "branches", mkdir=True)
 519 |       assert repo_dir(repo, "objects", mkdir=True)
 520 |       assert repo_dir(repo, "refs", "tags", mkdir=True)
 521 |       assert repo_dir(repo, "refs", "heads", mkdir=True)
 522 | 
 523 |       # .git/description
 524 |       with open(repo_file(repo, "description"), "w") as f:
 525 |           f.write("Unnamed repository; edit this file 'description' to name the repository.\n")
 526 | 
 527 |       # .git/HEAD
 528 |       with open(repo_file(repo, "HEAD"), "w") as f:
 529 |           f.write("ref: refs/heads/master\n")
 530 | 
 531 |       with open(repo_file(repo, "config"), "w") as f:
 532 |           config = repo_default_config()
 533 |           config.write(f)
 534 | 
 535 |       return repo
 536 | #+END_SRC
 537 | 
 538 | The configuration file is very simple, it's a [[https://en.wikipedia.org/wiki/INI_file][INI]]-like file with a
 539 | single section (=[core]=) and three fields:
 540 | 
 541 |  - =repositoryformatversion = 0=: the version of
 542 |    the gitdir format.  0 means the initial format, 1 the same with
 543 |    extensions.  If > 1, git will panic; wyag will only accept 0.
 544 |  - =filemode = false=: disable tracking of file modes (permissions)
 545 |    changes in the work tree.
 546 |  - =bare = false=: indicates that this repository has a worktree.  Git
 547 |    supports an optional =worktree= key which indicates the location of
 548 |    the worktree, if not =..=; wyag doesn't.
 549 | 
 550 | We create this file using Python's =configparser= lib:
 551 | 
 552 | #+BEGIN_SRC python :tangle libwyag.py
 553 | def repo_default_config():
 554 |     ret = configparser.ConfigParser()
 555 | 
 556 |     ret.add_section("core")
 557 |     ret.set("core", "repositoryformatversion", "0")
 558 |     ret.set("core", "filemode", "false")
 559 |     ret.set("core", "bare", "false")
 560 | 
 561 |     return ret
 562 | #+END_SRC
 563 | 
 564 | ** The init command
 565 | :PROPERTIES:
 566 | :CUSTOM_ID: cmd-init
 567 | :END:
 568 | 
 569 | Now that we have code to read and create repositories, let's make this
 570 | code usable from the command line by creating the =wyag init= command.
 571 | =wyag init= behaves just like =git init= --- with much less
 572 | customizability, of course.  The syntax of =wyag init= is going to be:
 573 | 
 574 | #+BEGIN_EXAMPLE
 575 |   wyag init [path]
 576 | #+END_EXAMPLE
 577 | 
 578 | We already have the complete repository creation logic.  To create the
 579 | command, we're only going to need two more things:
 580 | 
 581 | 1. We need to create an argparse subparser to handle our command's
 582 |    argument.
 583 | 
 584 |    #+BEGIN_SRC python :tangle libwyag.py
 585 |      argsp = argsubparsers.add_parser("init", help="Initialize a new, empty repository.")
 586 |    #+END_SRC
 587 | 
 588 |    In the case of =init=, there's a single, optional,
 589 |    positional argument: the path where to init the repo.  It defaults
 590 |    to =.=, the current directory:
 591 | 
 592 |    #+BEGIN_SRC python :tangle libwyag.py
 593 |      argsp.add_argument("path",
 594 |                         metavar="directory",
 595 |                         nargs="?",
 596 |                         default=".",
 597 |                         help="Where to create the repository.")
 598 | 
 599 |    #+END_SRC
 600 | 
 601 | 2. We also need a "bridge" function that will read argument values
 602 |    from the object returned by argparse and call the actual
 603 |    function with correct values.
 604 | 
 605 |    #+BEGIN_SRC python :tangle libwyag.py
 606 |    def cmd_init(args):
 607 |        repo_create(args.path)
 608 |    #+END_SRC
 609 | 
 610 | And we're done!  If you've followed these steps, you should now be
 611 | able to =wyag init= a git repository anywhere:
 612 | 
 613 | #+begin_example
 614 |   $ wyag init test
 615 | #+end_example
 616 | 
 617 | (The =wyag= executable won't usually be in your =$PATH=: you'll want to call it by its
 618 | full name, eg =~/projects/wyag/wyag init .=)
 619 | 
 620 | ** The repo_find() function
 621 | :PROPERTIES:
 622 | :repo_find:
 623 | :END:
 624 | 
 625 | While we're implementing repositories, we're going to need a function
 626 | to find the root of the current repository.  We'll use it a lot, since
 627 | almost all Git functions work on an existing repository (except
 628 | =init=, of course!).  Sometimes that root is the current directory,
 629 | but it may also be a parent: your repository's root may be in
 630 | =~/Documents/MyProject=, but you may currently be working in
 631 | =~/Documents/MyProject/src/tui/frames/mainview/=.  The =repo_find()=
 632 | function we'll now create will look for that root, starting at the
 633 | current directory and recursing back to =/=.  To identify a path as a
 634 | repository, it will check for the presence of a =.git= directory.
 635 | 
 636 | #+BEGIN_SRC python :tangle libwyag.py
 637 |   def repo_find(path=".", required=True):
 638 |       path = os.path.realpath(path)
 639 | 
 640 |       if os.path.isdir(os.path.join(path, ".git")):
 641 |           return GitRepository(path)
 642 | 
 643 |       # If we haven't returned, recurse in parent, if w
 644 |       parent = os.path.realpath(os.path.join(path, ".."))
 645 | 
 646 |       if parent == path:
 647 |           # Bottom case
 648 |           # os.path.join("/", "..") == "/":
 649 |           # If parent==path, then path is root.
 650 |           if required:
 651 |               raise Exception("No git directory.")
 652 |           else:
 653 |               return None
 654 | 
 655 |       # Recursive case
 656 |       return repo_find(parent, required)
 657 | #+END_SRC
 658 | 
 659 | And we're done with repositories!
 660 | 
 661 | * Reading and writing objects: hash-object and cat-file
 662 | :PROPERTIES:
 663 | :CUSTOM_ID: objects
 664 | :END:
 665 | 
 666 | ** What are objects?
 667 | :PROPERTIES:
 668 | :CUSTOM_ID: objects-intro
 669 | :END:
 670 | 
 671 | Now that we have repositories, putting things inside them is in order.
 672 | Also, repositories are boring, and writing a Git implementation
 673 | shouldn't be just a matter of writing a bunch of =mkdir=.  Let's talk
 674 | about *objects*, and let's implement =git hash-object= and =git cat-file=.
 675 | 
 676 | Maybe you don't know these two commands --- they're not exactly part
 677 | of an everyday git toolbox, and they're actually quite low-level
 678 | ("plumbing", in git parlance).  What they do is actually very simple:
 679 | =hash-object= converts an existing file into a git object, and =cat-file=
 680 | prints an existing git object to the standard output.
 681 | 
 682 | Now, *what actually is a Git object?* At its core, Git is a
 683 | "content-addressed filesystem".  That means that unlike regular
 684 | filesystems, where the name of a file is arbitrary and unrelated to
 685 | that file's contents, the names of files as stored by Git are
 686 | mathematically derived from their contents.  This has a very important
 687 | implication: if a single byte of, say, a text file, changes, its
 688 | internal name will change, too.  To put it simply: you don't /modify/
 689 | a file in git, you create a new file in a different location.  Objects
 690 | are just that: *files in the git repository, whose paths are
 691 | determined by their contents*.
 692 | 
 693 | #+begin_warning
 694 | *Git is not (really) a key-value store*
 695 | 
 696 | Some documentation, including the excellent [[https://git-scm.com/book/id/v2/Git-Internals-Git-Objects][Pro Git]], call Git a
 697 | "key-value store".  This is not incorrect, but may be misleading.
 698 | Regular filesystems are actually closer to a key-value store than Git
 699 | is.  Because it computes keys from data, Git could rather be called a
 700 | /value-value store/.
 701 | #+end_warning
 702 | 
 703 | Git uses objects to store quite a lot of things: first and foremost,
 704 | the actual files it keeps in version control --- source code, for
 705 | example.  Commit are objects, too, as well as tags.  With a few
 706 | notable exceptions (which we'll see later!), almost everything, in
 707 | Git, is stored as an object.
 708 | 
 709 | The path where git stores a given object is computed by calculating
 710 | the [[https://en.wikipedia.org/wiki/SHA-1][SHA-1]] [[https://en.wikipedia.org/wiki/Cryptographic_hash_function][hash]] of its contents.  More precisely, Git renders the hash
 711 | as a lowercase hexadecimal string, and splits it in two parts: the
 712 | first two characters, and the rest.  It uses the first part as a
 713 | directory name, the rest as the file name (this is because most
 714 | filesystems hate having too many files in a single directory and would
 715 | slow down to a crawl.  Git's method creates 256 possible intermediate
 716 | directories, hence dividing the average number of files per directory
 717 | by 256)
 718 | 
 719 | #+BEGIN_note
 720 | *What is a hash function?*
 721 | 
 722 | SHA-1 is what we call a “hash function”. Simply put, a hash function
 723 | is a kind of unidirectional mathematical function: it is easy to
 724 | compute the hash of a value, but there's no way to compute back which
 725 | value produced a hash.
 726 | 
 727 | A very simple example of a hash function is the classical =len= (or
 728 | =strlen=) function, which returns the length of a string.  It's really
 729 | easy to compute the length of a string, and the length of a given
 730 | string will never change (unless the string itself changes, of
 731 | course!) but it's impossible to retrieve the original string, given
 732 | only its length.  /Cryptographic/ hash functions are a much more
 733 | complex version of the same, with the added property that computing an
 734 | input meant to produce a given hash is hard enough to be practically
 735 | impossible.  (To produce an input =i= with ~strlen(i) == 12~, you just
 736 | type twelve random characters.  With algorithms such as SHA-1. it
 737 | would take much, much longer --- long enough to be practically
 738 | impossible[fn:1]).
 739 | #+END_note
 740 | 
 741 | Before we start implementing the object storage system, we must
 742 | understand their exact storage format.  An object starts with a header
 743 | that specifies its type: =blob=, =commit=, =tag= or =tree= (more on
 744 | that in a second).  This header is followed by an ASCII space (0x20),
 745 | then the size of the object in bytes as an ASCII number, then null
 746 | (0x00) (the null byte), then the contents of the object.  The first 48
 747 | bytes of a commit object in Wyag's repo look like this:
 748 | 
 749 | #+BEGIN_EXAMPLE
 750 | 00000000  63 6f 6d 6d 69 74 20 31  30 38 36 00 74 72 65 65  |commit 1086.tree|
 751 | 00000010  20 32 39 66 66 31 36 63  39 63 31 34 65 32 36 35  | 29ff16c9c14e265|
 752 | 00000020  32 62 32 32 66 38 62 37  38 62 62 30 38 61 35 61  |2b22f8b78bb08a5a|
 753 | #+END_EXAMPLE
 754 | 
 755 | In the first line, we see the type header, a space (=0x20=), the size in
 756 | ASCII (1086) and the null separator =0x00=.  The last four bytes on the
 757 | first line are the beginning of that object's contents, the word
 758 | "tree" --- we'll discuss that further when we'll talk about commits.
 759 | 
 760 | The objects (headers and contents) are stored compressed with =zlib=.
 761 | 
 762 | ** A generic object object
 763 | 
 764 | Objects can be of multiple types, but they all share the same
 765 | storage/retrieval mechanism and the same general header format.
 766 | Before we dive into the details of various types of objects, we need
 767 | to abstract over these common features.  The easiest way is to create
 768 | a generic =GitObject= with two unimplemented methods: =serialize()=
 769 | and =deserialize()=, and a default =init()= to create a new, empty
 770 | object if needed (sorry pythonistas, this isn't very nice design but
 771 | it's probably easier to read than superconstructors).  Our =__init__=
 772 | either loads the object from the provided data, or calls the
 773 | subclass-provided =init()= to create a new, empty object.
 774 | 
 775 | Later, we'll subclass this generic class, actually implementing these
 776 | functions for each object format.
 777 | 
 778 | #+BEGIN_SRC python :tangle libwyag.py
 779 |   class GitObject (object):
 780 | 
 781 |       def __init__(self, data=None):
 782 |           if data != None:
 783 |               self.deserialize(data)
 784 |           else:
 785 |               self.init()
 786 | 
 787 |       def serialize(self, repo):
 788 |           """This function MUST be implemented by subclasses.
 789 | 
 790 |   It must read the object's contents from self.data, a byte string, and
 791 |   do whatever it takes to convert it into a meaningful representation.
 792 |   What exactly that means depend on each subclass.
 793 | 
 794 |           """
 795 |           raise Exception("Unimplemented!")
 796 | 
 797 |       def deserialize(self, data):
 798 |           raise Exception("Unimplemented!")
 799 | 
 800 |       def init(self):
 801 |           pass # Just do nothing. This is a reasonable default!
 802 | #+END_SRC
 803 | 
 804 | ** Reading objects
 805 | :PROPERTIES:
 806 | :CUSTOM_ID: object_read
 807 | :END:
 808 | 
 809 | To read an object, we need to know its SHA-1 hash.  We then compute
 810 | its path from this hash (with the formula explained above: first two
 811 | characters, then a directory delimiter =/=, then the remaining part)
 812 | and look it up inside of the "objects" directory in the gitdir.  That
 813 | is, the path to =e673d1b7eaa0aa01b5bc2442d570a765bdaae751= is
 814 | =.git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751=.
 815 | 
 816 | We then read that file as a binary file, and decompress it using
 817 | =zlib=.
 818 | 
 819 | From the decompressed data, we extract the two header components: the
 820 | object type and its size.   From the type, we determine the actual
 821 | class to use.   We convert the size to a Python integer, and check if
 822 | it matches.
 823 | 
 824 | When all is done, we just call the correct constructor for that
 825 | object's format.
 826 | 
 827 | #+BEGIN_SRC python :tangle libwyag.py
 828 |   def object_read(repo, sha):
 829 |       """Read object sha from Git repository repo.  Return a
 830 |       GitObject whose exact type depends on the object."""
 831 | 
 832 |       path = repo_file(repo, "objects", sha[0:2], sha[2:])
 833 | 
 834 |       if not os.path.isfile(path):
 835 |           return None
 836 | 
 837 |       with open (path, "rb") as f:
 838 |           raw = zlib.decompress(f.read())
 839 | 
 840 |           # Read object type
 841 |           x = raw.find(b' ')
 842 |           fmt = raw[0:x]
 843 | 
 844 |           # Read and validate object size
 845 |           y = raw.find(b'\x00', x)
 846 |           size = int(raw[x:y].decode("ascii"))
 847 |           if size != len(raw)-y-1:
 848 |               raise Exception(f"Malformed object {sha}: bad length")
 849 | 
 850 |           # Pick constructor
 851 |           match fmt:
 852 |               case b'commit' : c=GitCommit
 853 |               case b'tree'   : c=GitTree
 854 |               case b'tag'    : c=GitTag
 855 |               case b'blob'   : c=GitBlob
 856 |               case _:
 857 |                   raise Exception(f"Unknown type {fmt.decode("ascii")} for object {sha}")
 858 | 
 859 |           # Call constructor and return object
 860 |           return c(raw[y+1:])
 861 | #+END_SRC
 862 | 
 863 | ** Writing objects
 864 | :PROPERTIES:
 865 | :object_write:
 866 | :END:
 867 | 
 868 | Writing an object is reading it in reverse: we compute the hash,
 869 | insert the header, zlib-compress everything and write the result in
 870 | the correct location.  This really shouldn't require much explanation,
 871 | just notice that the hash is computed *after* the header is added (so
 872 | it's the hash of the object itself, uncompressed, not just its contents)
 873 | 
 874 | #+BEGIN_SRC python :tangle libwyag.py
 875 |   def object_write(obj, repo=None):
 876 |       # Serialize object data
 877 |       data = obj.serialize()
 878 |       # Add header
 879 |       result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data
 880 |       # Compute hash
 881 |       sha = hashlib.sha1(result).hexdigest()
 882 | 
 883 |       if repo:
 884 |           # Compute path
 885 |           path=repo_file(repo, "objects", sha[0:2], sha[2:], mkdir=True)
 886 | 
 887 |           if not os.path.exists(path):
 888 |               with open(path, 'wb') as f:
 889 |                   # Compress and write
 890 |                   f.write(zlib.compress(result))
 891 |       return sha
 892 | #+END_SRC
 893 | 
 894 | ** Working with blobs
 895 | 
 896 | We said earlier that the type header could be one of four: =blob=,
 897 | =commit=, =tag= and =tree= --- so git has four object types.
 898 | 
 899 | Blobs are the simplest of those four types, because they have no
 900 | actual format.  Blobs are user data: the content of every file you put
 901 | in git (=main.c=, =logo.png=, =README.md=) is stored as a blob.  That
 902 | makes them easy to manipulate, because they have no actual syntax or
 903 | constraints beyond the basic object storage mechanism: they're just
 904 | unspecified data.  Creating a =GitBlob= class is thus trivial, the
 905 | =serialize= and =deserialize= functions just have to store and return
 906 | their input unmodified.
 907 | 
 908 | #+BEGIN_SRC python :tangle libwyag.py
 909 |   class GitBlob(GitObject):
 910 |       fmt=b'blob'
 911 | 
 912 |       def serialize(self):
 913 |           return self.blobdata
 914 | 
 915 |       def deserialize(self, data):
 916 |           self.blobdata = data
 917 | #+END_SRC
 918 | 
 919 | ** The cat-file command
 920 | :PROPERTIES:
 921 | :CUSTOM_ID: cmd-cat-file
 922 | :END:
 923 | 
 924 | We can now create =wyag cat-file=.  =git cat-file= simply prints the
 925 | raw contents of an object to stdout, uncompressed and without the git
 926 | header.  In a clone of [[https://github.com/thblt/write-yourself-a-git][wyag's source repository]], =git cat-file blob
 927 | e0695f14a412c29e252c998c81de1dde59658e4a= will show a version of the
 928 | README.
 929 | 
 930 | Our simplified version will just take those two positional arguments:
 931 | a type and an object identifier:
 932 | 
 933 | #+BEGIN_EXAMPLE
 934 |   wyag cat-file TYPE OBJECT
 935 | #+END_EXAMPLE
 936 | 
 937 | The subparser is very simple:
 938 | 
 939 | #+BEGIN_SRC python :tangle libwyag.py
 940 |   argsp = argsubparsers.add_parser("cat-file",
 941 |                                    help="Provide content of repository objects")
 942 | 
 943 |   argsp.add_argument("type",
 944 |                      metavar="type",
 945 |                      choices=["blob", "commit", "tag", "tree"],
 946 |                      help="Specify the type")
 947 | 
 948 |   argsp.add_argument("object",
 949 |                      metavar="object",
 950 |                      help="The object to display")
 951 | #+END_SRC
 952 | 
 953 | And we can implement the functions, which just call into existing code we wrote earlier:
 954 | 
 955 | #+BEGIN_SRC python :tangle libwyag.py
 956 |   def cmd_cat_file(args):
 957 |       repo = repo_find()
 958 |       cat_file(repo, args.object, fmt=args.type.encode())
 959 | 
 960 |   def cat_file(repo, obj, fmt=None):
 961 |       obj = object_read(repo, object_find(repo, obj, fmt=fmt))
 962 |       sys.stdout.buffer.write(obj.serialize())
 963 | #+END_SRC
 964 | 
 965 | <<placeholder-object_find>> This function calls an =object_find=
 966 | function we haven't introduced yet.  For now, it's just going to
 967 | return one of its arguments unmodified, like this:
 968 | 
 969 | #+BEGIN_SRC python
 970 | def object_find(repo, name, fmt=None, follow=True):
 971 |     return name
 972 | #+END_SRC
 973 | 
 974 | The reason for this strange small function is that Git has a /lot/ of
 975 | ways to refer to objects: full hash, short hash, tags...
 976 | =object_find()= will be our name resolution function.  We'll only
 977 | implement it [[#object_find][later]], so this is just a temporary placeholder.  This
 978 | means that until we implement the real thing, the only way we can
 979 | refer to an object will be by its full hash.
 980 | 
 981 | ** The hash-object command
 982 | :PROPERTIES:
 983 | :CUSTOM_ID: cmd-hash-object
 984 | :END:
 985 | 
 986 | We will want to put our /own/ data in our repositories,
 987 | though. =hash-object= is basically the opposite of =cat-file=: it
 988 | reads a file, computes its hash as an object, either storing it in the
 989 | repository (if the -w flag is passed) or just printing its hash.
 990 | 
 991 | The syntax of =wyag hash-object= is a simplification of =git
 992 | hash-object=:
 993 | 
 994 | #+BEGIN_EXAMPLE
 995 |   wyag hash-object [-w] [-t TYPE] FILE
 996 | #+END_EXAMPLE
 997 | 
 998 | Which converts to:
 999 | 
1000 | #+BEGIN_SRC python :tangle libwyag.py
1001 |   argsp = argsubparsers.add_parser(
1002 |       "hash-object",
1003 |       help="Compute object ID and optionally creates a blob from a file")
1004 | 
1005 |   argsp.add_argument("-t",
1006 |                      metavar="type",
1007 |                      dest="type",
1008 |                      choices=["blob", "commit", "tag", "tree"],
1009 |                      default="blob",
1010 |                      help="Specify the type")
1011 | 
1012 |   argsp.add_argument("-w",
1013 |                      dest="write",
1014 |                      action="store_true",
1015 |                      help="Actually write the object into the database")
1016 | 
1017 |   argsp.add_argument("path",
1018 |                      help="Read object from <file>")
1019 | #+END_SRC
1020 | 
1021 | The actual implementation is very simple.  As usual, we create a small
1022 | bridge function:
1023 | 
1024 | #+BEGIN_SRC python :tangle libwyag.py
1025 |   def cmd_hash_object(args):
1026 |       if args.write:
1027 |           repo = repo_find()
1028 |       else:
1029 |           repo = None
1030 | 
1031 |       with open(args.path, "rb") as fd:
1032 |           sha = object_hash(fd, args.type.encode(), repo)
1033 |           print(sha)
1034 | #+END_SRC
1035 | 
1036 | The actual implementation is also trivial.  The =repo= argument is
1037 | optional, and the object isn't written if it is =None= (this is
1038 | handled in =object_write()=, above):
1039 | 
1040 | #+BEGIN_SRC python :tangle libwyag.py
1041 |     def object_hash(fd, fmt, repo=None):
1042 |         """ Hash object, writing it to repo if provided."""
1043 |         data = fd.read()
1044 | 
1045 |         # Choose constructor according to fmt argument
1046 |         match fmt:
1047 |             case b'commit' : obj=GitCommit(data)
1048 |             case b'tree'   : obj=GitTree(data)
1049 |             case b'tag'    : obj=GitTag(data)
1050 |             case b'blob'   : obj=GitBlob(data)
1051 |             case _: raise Exception(f"Unknown type {fmt}!")
1052 | 
1053 |         return object_write(obj, repo)
1054 | #+END_SRC
1055 | 
1056 | ** Aside: what about packfiles?
1057 | :PROPERTIES:
1058 | :CUSTOM_ID: packfiles
1059 | :END:
1060 | 
1061 | What we've just implemented is called "loose objects".  Git has a
1062 | second object storage mechanism called packfiles.  Packfiles are much
1063 | more efficient, but also much more complex, than loose objects.  Simply
1064 | put, a packfile is a compilation of loose objects (like a =tar=) but
1065 | some are stored as deltas (as a transformation of another object).
1066 | Packfiles are way too complex to be supported by wyag.
1067 | 
1068 | The packfile is stored in =.git/objects/pack/=.  It has a =.pack=
1069 | extension, and is accompanied by an index file of the same name with
1070 | the =.idx= extension.  Should you want to convert a packfile to loose
1071 | objects format (to play with =wyag= on an existing repo, for example),
1072 | here's the solution.
1073 | 
1074 | First, /move/ the packfile outside the gitdir (just copying it won't work).
1075 | 
1076 | #+BEGIN_SRC shell
1077 |   mv .git/objects/pack/pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack .
1078 | #+END_SRC
1079 | 
1080 | You can ignore the =.idx=.  Then, from the worktree, just =cat= it and pipe the result to =git
1081 | unpack-objects=:
1082 | 
1083 | #+BEGIN_SRC shell
1084 |   cat pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack | git unpack-objects
1085 | #+END_SRC
1086 | 
1087 | * Reading commit history: log
1088 | 
1089 | ** Parsing commits
1090 | 
1091 | Now that we can read and write objects, we should consider commits.
1092 | A commit object (uncompressed, without headers) looks like this:
1093 | 
1094 | #+BEGIN_EXAMPLE
1095 | tree 29ff16c9c14e2652b22f8b78bb08a5a07930c147
1096 | parent 206941306e8a8af65b66eaaaea388a7ae24d49a0
1097 | author Thibault Polge <thibault@thb.lt> 1527025023 +0200
1098 | committer Thibault Polge <thibault@thb.lt> 1527025044 +0200
1099 | gpgsig -----BEGIN PGP SIGNATURE-----
1100 | 
1101 |  iQIzBAABCAAdFiEExwXquOM8bWb4Q2zVGxM2FxoLkGQFAlsEjZQACgkQGxM2FxoL
1102 |  kGQdcBAAqPP+ln4nGDd2gETXjvOpOxLzIMEw4A9gU6CzWzm+oB8mEIKyaH0UFIPh
1103 |  rNUZ1j7/ZGFNeBDtT55LPdPIQw4KKlcf6kC8MPWP3qSu3xHqx12C5zyai2duFZUU
1104 |  wqOt9iCFCscFQYqKs3xsHI+ncQb+PGjVZA8+jPw7nrPIkeSXQV2aZb1E68wa2YIL
1105 |  3eYgTUKz34cB6tAq9YwHnZpyPx8UJCZGkshpJmgtZ3mCbtQaO17LoihnqPn4UOMr
1106 |  V75R/7FjSuPLS8NaZF4wfi52btXMSxO/u7GuoJkzJscP3p4qtwe6Rl9dc1XC8P7k
1107 |  NIbGZ5Yg5cEPcfmhgXFOhQZkD0yxcJqBUcoFpnp2vu5XJl2E5I/quIyVxUXi6O6c
1108 |  /obspcvace4wy8uO0bdVhc4nJ+Rla4InVSJaUaBeiHTW8kReSFYyMmDCzLjGIu1q
1109 |  doU61OM3Zv1ptsLu3gUE6GU27iWYj2RWN3e3HE4Sbd89IFwLXNdSuM0ifDLZk7AQ
1110 |  WBhRhipCCgZhkj9g2NEk7jRVslti1NdN5zoQLaJNqSwO1MtxTmJ15Ksk3QP6kfLB
1111 |  Q52UWybBzpaP9HEd4XnR+HuQ4k2K0ns2KgNImsNvIyFwbpMUyUWLMPimaV1DWUXo
1112 |  5SBjDB/V/W2JBFR+XKHFJeFwYhj7DD/ocsGr4ZMx/lgc8rjIBkI=
1113 |  =lgTX
1114 |  -----END PGP SIGNATURE-----
1115 | 
1116 | Create first draft
1117 | #+END_EXAMPLE
1118 | 
1119 | The format is a simplified version of mail messages, as specified in
1120 | [[https://www.ietf.org/rfc/rfc2822.txt][RFC 2822]].  It begins with a series of key-value pairs, with space as
1121 | the key/value separator, and ends with the commit message, that may
1122 | span over multiple lines.  Values may continue over multiple lines,
1123 | subsequent lines start with a space which the parser must drop (like
1124 | the =gpgsig= field above, which spans over 16 lines).
1125 | 
1126 | Let's have a look at those fields:
1127 | 
1128 | - =tree= is a reference to a tree object, a type of object that we'll
1129 |   see just next.  A tree maps blobs IDs to filesystem locations, and
1130 |   describes a state of the work tree.  Put simply, it is the actual
1131 |   content of the commit: file contents, and where they go.
1132 | - =parent= is a reference to the parent of this commit.  It may be
1133 |   repeated: merge commits, for example, have multiple parents.  It
1134 |   may also be absent: the very first commit in a repository obviously
1135 |   doesn't have a parent.
1136 | - =author= and =committer= are separate, because the author of a commit
1137 |   is not necessarily the person who can commit it (This may not be
1138 |   obvious for GitHub users, but a lot of projects do Git through e-mail)
1139 | - =gpgsig= is the PGP signature of this object.
1140 | 
1141 | We'll start by writing a simple parser for the format.  The code is
1142 | obvious.  The name of the function we're about to create,
1143 | =kvlm_parse()=, may be confusing: it isn't called =commit_parse()= because
1144 | tags have the very same format, so we'll use it for both objects types.
1145 | I use KVLM to mean "Key-Value List with Message".
1146 | 
1147 | #+BEGIN_SRC python :tangle libwyag.py
1148 |   def kvlm_parse(raw, start=0, dct=None):
1149 |       if not dct:
1150 |           dct = dict()
1151 |           # You CANNOT declare the argument as dct=dict() or all call to
1152 |           # the functions will endlessly grow the same dict.
1153 | 
1154 |       # This function is recursive: it reads a key/value pair, then call
1155 |       # itself back with the new position.  So we first need to know
1156 |       # where we are: at a keyword, or already in the messageQ
1157 | 
1158 |       # We search for the next space and the next newline.
1159 |       spc = raw.find(b' ', start)
1160 |       nl = raw.find(b'\n', start)
1161 | 
1162 |       # If space appears before newline, we have a keyword.  Otherwise,
1163 |       # it's the final message, which we just read to the end of the file.
1164 | 
1165 |       # Base case
1166 |       # =========
1167 |       # If newline appears first (or there's no space at all, in which
1168 |       # case find returns -1), we assume a blank line.  A blank line
1169 |       # means the remainder of the data is the message.  We store it in
1170 |       # the dictionary, with None as the key, and return.
1171 |       if (spc < 0) or (nl < spc):
1172 |           assert nl == start
1173 |           dct[None] = raw[start+1:]
1174 |           return dct
1175 | 
1176 |       # Recursive case
1177 |       # ==============
1178 |       # we read a key-value pair and recurse for the next.
1179 |       key = raw[start:spc]
1180 | 
1181 |       # Find the end of the value.  Continuation lines begin with a
1182 |       # space, so we loop until we find a "\n" not followed by a space.
1183 |       end = start
1184 |       while True:
1185 |           end = raw.find(b'\n', end+1)
1186 |           if raw[end+1] != ord(' '): break
1187 | 
1188 |       # Grab the value
1189 |       # Also, drop the leading space on continuation lines
1190 |       value = raw[spc+1:end].replace(b'\n ', b'\n')
1191 | 
1192 |       # Don't overwrite existing data contents
1193 |       if key in dct:
1194 |           if type(dct[key]) == list:
1195 |               dct[key].append(value)
1196 |           else:
1197 |               dct[key] = [ dct[key], value ]
1198 |       else:
1199 |           dct[key]=value
1200 | 
1201 |       return kvlm_parse(raw, start=end+1, dct=dct)
1202 | #+END_SRC
1203 | 
1204 | #+begin_note
1205 | <<identity-rules>> *Object identity rules*
1206 | 
1207 | We use dictionaries (HashMaps) to store key/value associations, but we
1208 | rely on a specific feature of Python dictionaries: *they preserve
1209 | insertion order*.  It means that when we'll write back an object,
1210 | we'll iterate over the dictionary and get the fields back in the exact
1211 | order they were added.  This matters, because Git has *two strong
1212 | rules about object identity*:
1213 | 
1214 |  1. The first rule is that *the same name will always refer to the
1215 |     same object*.  We've seen this one already, it's just a
1216 |     consequence of the fact that an object's name is a hash of its
1217 |     contents.
1218 |  2. The second rule is subtly different: *the same object will always
1219 |     be referred by the same name*.  This means that there shouldn't be
1220 |     two equivalent objects under different names.  This is why fields
1221 |     order matter: by modifying the /order/ fields appear in a given
1222 |     commit, eg by putting the =tree= after the =parent=, we'd modify
1223 |     the SHA-1 hash of the commit, and we'd create two equivalent, but
1224 |     numerically distinct, commit objects.
1225 | 
1226 | For example, when comparing trees, git will assume that two trees with
1227 | different names /are/ different --- this is why we'll have to make
1228 | sure elements of the tree objects are properly sorted, so we don't
1229 | produce distinct but equivalent trees.
1230 | #+end_note
1231 | 
1232 | We're also going to need to write similar objects, so let's add a
1233 | =kvlm_serialize()= function to our toolkit.  This is very simple: we
1234 | write all fields first, then a newline, the message, and a final
1235 | newline.
1236 | 
1237 | #+BEGIN_SRC python :tangle libwyag.py
1238 |   def kvlm_serialize(kvlm):
1239 |       ret = b''
1240 | 
1241 |       # Output fields
1242 |       for k in kvlm.keys():
1243 |           # Skip the message itself
1244 |           if k == None: continue
1245 |           val = kvlm[k]
1246 |           # Normalize to a list
1247 |           if type(val) != list:
1248 |               val = [ val ]
1249 | 
1250 |           for v in val:
1251 |               ret += k + b' ' + (v.replace(b'\n', b'\n ')) + b'\n'
1252 | 
1253 |       # Append message
1254 |       ret += b'\n' + kvlm[None]
1255 | 
1256 |       return ret
1257 | #+END_SRC
1258 | 
1259 | ** The Commit object
1260 | :PROPERTIES:
1261 | :object_write: GitCommit
1262 | :END:
1263 | 
1264 | Now we have the parser, we can create the =GitCommit= class:
1265 | 
1266 | #+BEGIN_SRC python :tangle libwyag.py
1267 |   class GitCommit(GitObject):
1268 |       fmt=b'commit'
1269 | 
1270 |       def deserialize(self, data):
1271 |           self.kvlm = kvlm_parse(data)
1272 | 
1273 |       def serialize(self):
1274 |           return kvlm_serialize(self.kvlm)
1275 | 
1276 |       def init(self):
1277 |           self.kvlm = dict()
1278 | #+END_SRC
1279 | 
1280 | ** The log command
1281 | :PROPERTIES:
1282 | :CUSTOM_ID: cmd-log
1283 | :END:
1284 | 
1285 | We'll implement a much, much simpler version of =log= than what Git
1286 | provides.  Most importantly, we won't deal with representing the log
1287 | /at all/.  Instead, we'll dump Graphviz data and let the user use
1288 | =dot= to render the actual log.  (If you don't know how to use
1289 | Graphviz, just paste the raw output into [[https://dreampuf.github.io/GraphvizOnline/][this site]].  If the link is
1290 | dead, lookup "graphviz online" in your favorite search engine)
1291 | 
1292 | #+BEGIN_SRC python :tangle libwyag.py
1293 |   argsp = argsubparsers.add_parser("log", help="Display history of a given commit.")
1294 |   argsp.add_argument("commit",
1295 |                      default="HEAD",
1296 |                      nargs="?",
1297 |                      help="Commit to start at.")
1298 | #+END_SRC
1299 | 
1300 | #+BEGIN_SRC python :tangle libwyag.py
1301 |   def cmd_log(args):
1302 |       repo = repo_find()
1303 | 
1304 |       print("digraph wyaglog{")
1305 |       print("  node[shape=rect]")
1306 |       log_graphviz(repo, object_find(repo, args.commit), set())
1307 |       print("}")
1308 | 
1309 |   def log_graphviz(repo, sha, seen):
1310 | 
1311 |       if sha in seen:
1312 |           return
1313 |       seen.add(sha)
1314 | 
1315 |       commit = object_read(repo, sha)
1316 |       message = commit.kvlm[None].decode("utf8").strip()
1317 |       message = message.replace("\\", "\\\\")
1318 |       message = message.replace("\"", "\\\"")
1319 | 
1320 |       if "\n" in message: # Keep only the first line
1321 |           message = message[:message.index("\n")]
1322 | 
1323 |       print(f"  c_{sha} [label=\"{sha[0:7]}: {message}\"]")
1324 |       assert commit.fmt==b'commit'
1325 | 
1326 |       if not b'parent' in commit.kvlm.keys():
1327 |           # Base case: the initial commit.
1328 |           return
1329 | 
1330 |       parents = commit.kvlm[b'parent']
1331 | 
1332 |       if type(parents) != list:
1333 |           parents = [ parents ]
1334 | 
1335 |       for p in parents:
1336 |           p = p.decode("ascii")
1337 |           print (f"  c_{sha} -> c_{p};")
1338 |           log_graphviz(repo, p, seen)
1339 | #+END_SRC
1340 | 
1341 | You can now use our log command like this:
1342 | 
1343 | #+BEGIN_SRC shell
1344 |   wyag log e03158242ecab460f31b0d6ae1642880577ccbe8 > log.dot
1345 |   dot -O -Tpdf log.dot
1346 | #+END_SRC
1347 | 
1348 | ** Anatomy of a commit
1349 | :PROPERTIES:
1350 | :CUSTOM_ID: commit-anatomy
1351 | :END:
1352 | 
1353 | You may have noticed a few things right now.
1354 | 
1355 | First and foremost, we've been playing with commits, browsing and
1356 | walking through commit objects, building a graph of commit history,
1357 | without ever touching a single file in the worktree or a blob.  We've
1358 | done a lot with commits /without considering their contents/.  This is
1359 | important: work tree contents are just one part of a commit.  But a
1360 | commit is made of everything it holds: its contents, its authors,
1361 | *also its parents*.  If you remember that the ID (the SHA-1 hash) of a
1362 | commit is computed from the whole commit object, you'll understand
1363 | what it means that commits are immutable: if you change the author,
1364 | the parent commit or a single file, you've actually created a new,
1365 | different object.  Each and every commit is bound to its place and its
1366 | relationship to the whole repository up to the very first commit.  To
1367 | put it otherwise, a given commit ID not only identifies some file
1368 | contents, but it also binds the commit to its whole history and to the
1369 | whole repository.
1370 | 
1371 | It's also worth noting that from the point of view of a commit, time
1372 | somehow runs backwards: we're used to considering the history of a
1373 | project from its humble beginnings as an evening distraction, starting
1374 | with a few lines of code, some initial commits, and progressing to its
1375 | present state (millions of lines of code, dozens of contributors,
1376 | whatever).  But each commit is completely unaware of its future, it's
1377 | only linked to the past.  Commits have "memory", but no premonition.
1378 | 
1379 | # #+begin_note
1380 | #   In Terry Pratchett's Discworld, trolls believe they progress in time
1381 | #   from the future to the past.  The reasoning behind that belief is
1382 | #   that when you walk, what you can see is what's /ahead/ of you.  Of
1383 | #   time, all you can perceive is the past, because you remember; hence
1384 | #   it's where you're headed.  Commits are Discworld trolls.
1385 | # #+end_note
1386 | 
1387 | So what makes a commit?  To sum it up:
1388 | 
1389 | - A tree object, which we'll discuss now, that is, the contents of a
1390 |   worktree, files and directories;
1391 | - Zero, one or more parents;
1392 | - An author identity (name and email), and a timestamp;
1393 | - A committer identity (name and email), and a timestamp;
1394 | - An optional PGP signature
1395 | - A message;
1396 | 
1397 | All this hashed together in a unique SHA-1 identifier.
1398 | 
1399 | #+begin_note
1400 |   *Wait, does that make Git a blockchain?*
1401 | 
1402 |   Because of cryptocurrencies, blockchains are all the hype these
1403 |   days.  And yes, /in a way/, Git is a blockchain: it's a sequence of
1404 |   blocks (commits) tied together by cryptographic means in a way that
1405 |   guarantee that each single element is associated to the whole
1406 |   history of the structure.  Don't take the comparison too seriously,
1407 |   though: we don't need a GitCoin.  Really, we don't.
1408 | #+end_note
1409 | 
1410 | * Reading commit data: checkout
1411 | :PROPERTIES:
1412 | :CUSTOM_ID: checkout
1413 | :END:
1414 | 
1415 | It's all well that commits hold a lot more than files and directories
1416 | in a given state, but that doesn't make them really useful.  It's
1417 | probably time to start implementing tree objects as well, so we'll be
1418 | able to checkout commits into the work tree.
1419 | 
1420 | ** What's in a tree?
1421 | 
1422 | Informally, a tree describes the content of the work tree, that it, it
1423 | associates blobs to paths.  It's an array of three-element tuples made
1424 | of a file mode, a path (relative to the worktree) and a SHA-1.  A
1425 | typical tree contents may look like this:
1426 | 
1427 | | Mode     | SHA-1                                      | Path         |
1428 | |----------+--------------------------------------------+--------------|
1429 | | =100644= | =894a44cc066a027465cd26d634948d56d13af9af= | =.gitignore= |
1430 | | =100644= | =94a9ed024d3859793618152ea559a168bbcbb5e2= | =LICENSE=    |
1431 | | =100644= | =bab489c4f4600a38ce6dbfd652b90383a4aa3e45= | =README.md=  |
1432 | | =100644= | =6d208e47659a2a10f5f8640e0155d9276a2130a9= | =src=        |
1433 | | =040000= | =e7445b03aea61ec801b20d6ab62f076208b7d097= | =tests=      |
1434 | | =040000= | =d5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38= | =main.c=     |
1435 | 
1436 | Mode is just the file's [[https://en.wikipedia.org/wiki/File_system_permissions][mode]], path is its location.  The SHA-1 refers
1437 | to either a blob or another tree object.  If a blob, the path is a
1438 | file, if a tree, it's directory.  To instantiate this tree in the
1439 | filesystem, we would begin by loading the object associated to the
1440 | first path (=.gitignore=) and check its type.  Since it's a blob,
1441 | we'll just create a file called =.gitignore= with this blob's
1442 | contents; and same for =LICENSE= and =README.md=.  But the object
1443 | associated with =src= is not a blob, but another tree: we'll create
1444 | the directory =src= and repeat the same operation in that directory
1445 | with the new tree.
1446 | 
1447 | #+BEGIN_warning
1448 |   *A path is a single filesystem entry*
1449 | 
1450 |   The path identifies exactly one file or directory.  Not two, not
1451 |   three.  If you have five levels of nested directories, even if four
1452 |   are empty save the next directory, you're going to need five tree
1453 |   objects recursively referring to one another.  You cannot take the
1454 |   shortcut of putting a full path in a single tree entry, like
1455 |   =dir1/dir2/dir3/dir4/dir5=.
1456 | #+END_warning
1457 | 
1458 | ** Parsing trees
1459 | 
1460 | Unlike tags and commits, tree objects are binary objects, but their
1461 | format is actually quite simple.  A tree is the concatenation of
1462 | records of the format:
1463 | 
1464 | #+begin_example
1465 | [mode] space [path] 0x00 [sha-1]
1466 | #+end_example
1467 | 
1468 | - =[mode]= is up to six bytes and is an octal representation of a file
1469 |   *mode*, stored in ASCII. For example, 100644 is encoded with byte
1470 |   values 49 (ASCII "1"), 48 (ASCII "0"), 48, 54, 52, 52.  The first
1471 |   two digits encode the file type (file, directory, symlink or
1472 |   submodule), the last four the permissions.
1473 | - It's followed by 0x20, an ASCII *space*;
1474 | - Followed by the null-terminated (0x00) *path*;
1475 | - Followed by the object's *SHA-1* in binary encoding, on 20 bytes.
1476 | 
1477 | The parser is going to be quite simple.  First, create a tiny object
1478 | wrapper for a single record (a leaf, a single path):
1479 | 
1480 | #+BEGIN_SRC python :tangle libwyag.py
1481 |   class GitTreeLeaf (object):
1482 |       def __init__(self, mode, path, sha):
1483 |           self.mode = mode
1484 |           self.path = path
1485 |           self.sha = sha
1486 | #+END_SRC
1487 | 
1488 | Because a tree object is just the repetition of the same fundamental
1489 | data structure, we write the parser in two functions.  First, a parser
1490 | to extract a single record, which returns parsed data and the position
1491 | it reached in input data:
1492 | 
1493 | #+BEGIN_SRC python :tangle libwyag.py
1494 |   def tree_parse_one(raw, start=0):
1495 |       # Find the space terminator of the mode
1496 |       x = raw.find(b' ', start)
1497 |       assert x-start == 5 or x-start==6
1498 | 
1499 |       # Read the mode
1500 |       mode = raw[start:x]
1501 |       if len(mode) == 5:
1502 |           # Normalize to six bytes.
1503 |           mode = b"0" + mode
1504 | 
1505 |       # Find the NULL terminator of the path
1506 |       y = raw.find(b'\x00', x)
1507 |       # and read the path
1508 |       path = raw[x+1:y]
1509 | 
1510 |       # Read the SHA…
1511 |       raw_sha = int.from_bytes(raw[y+1:y+21], "big")
1512 |       # and convert it into an hex string, padded to 40 chars
1513 |       # with zeros if needed.
1514 |       sha = format(raw_sha, "040x")
1515 |       return y+21, GitTreeLeaf(mode, path.decode("utf8"), sha)
1516 | #+END_SRC
1517 | 
1518 | And the "real" parser which just calls the previous one in a loop,
1519 | until input data is exhausted.
1520 | 
1521 | #+BEGIN_SRC python :tangle libwyag.py
1522 |   def tree_parse(raw):
1523 |       pos = 0
1524 |       max = len(raw)
1525 |       ret = list()
1526 |       while pos < max:
1527 |           pos, data = tree_parse_one(raw, pos)
1528 |           ret.append(data)
1529 | 
1530 |       return ret
1531 | #+END_SRC
1532 | 
1533 | We'll finally need a serializer to write trees back.  Because we may
1534 | have added or modified entries, we need to sort them again.
1535 | Consistently sorting matters, because we need to respect git's
1536 | [[identity-rules][identity rules]], which says that no two equivalent object can have a
1537 | different hash --- but differently sorted trees with the same contents
1538 | /would/ be equivalent (describing the same directory structure), and
1539 | still numerically distinct (different SHA-1 identifiers).  Incorrectly
1540 | sorted trees are invalid, but /git doesn't enforce that/.  I created
1541 | some invalid trees by accident writing wyag, and all I got was weird
1542 | bugs in =git status= (specifically, =status= would report an actually
1543 | clean worktree as fully modified).  We don't want that.
1544 | 
1545 | The ordering function is quite simple, with an unexpected twist.  are
1546 | Entries sorted by name, alphabetically, /but/ directories (that is,
1547 | tree entries) are sorted with a final =/= added.  It matters, because
1548 | it means that if =whatever= names a regular file, it will sort
1549 | /before/ =whatever.c=, but if =whatever= is a dir, it will sort
1550 | /after/, as =whatever/=. (I'm not sure why git does that.  If you're
1551 | curious, see the function =base_name_compare= in =tree.c= in the git
1552 | source)
1553 | 
1554 | #+begin_src python :tangle libwyag.py
1555 |   # Notice this isn't a comparison function, but a conversion function.
1556 |   # Python's default sort doesn't accept a custom comparison function,
1557 |   # like in most languages, but a `key` arguments that returns a new
1558 |   # value, which is compared using the default rules.  So we just return
1559 |   # the leaf name, with an extra / if it's a directory.
1560 |   def tree_leaf_sort_key(leaf):
1561 |       if leaf.mode.startswith(b"10"):
1562 |           return leaf.path
1563 |       else:
1564 |           return leaf.path + "/"
1565 | #+end_src
1566 | 
1567 | Then the serializer itself.  This one is very simple: we sort the
1568 | items using our newly created function as a transformer, then write
1569 | them in order.
1570 | 
1571 | #+BEGIN_SRC python :tangle libwyag.py
1572 |   def tree_serialize(obj):
1573 |       obj.items.sort(key=tree_leaf_sort_key)
1574 |       ret = b''
1575 |       for i in obj.items:
1576 |           ret += i.mode
1577 |           ret += b' '
1578 |           ret += i.path.encode("utf8")
1579 |           ret += b'\x00'
1580 |           sha = int(i.sha, 16)
1581 |           ret += sha.to_bytes(20, byteorder="big")
1582 |       return ret
1583 | #+END_SRC
1584 | 
1585 | And now we just have to combine all that into a class:
1586 | 
1587 | #+BEGIN_SRC python :tangle libwyag.py
1588 |   class GitTree(GitObject):
1589 |       fmt=b'tree'
1590 | 
1591 |       def deserialize(self, data):
1592 |           self.items = tree_parse(data)
1593 | 
1594 |       def serialize(self):
1595 |           return tree_serialize(self)
1596 | 
1597 |       def init(self):
1598 |           self.items = list()
1599 | #+END_SRC
1600 | 
1601 | ** Showing trees: ls-tree
1602 | :PROPERTIES:
1603 | :CUSTOM_ID: cmd-ls-tree
1604 | :END:
1605 | 
1606 | While we're at it, let's add the =ls-tree= command to wyag.  It's so
1607 | easy there's no reason not to.  =git ls-tree [-r] TREE= simply prints
1608 | the contents of a tree, recursively with the =-r= flag.  In recursive
1609 | mode, it doesn't show subtrees, just final objects with their full
1610 | paths.
1611 | 
1612 | #+NAME: cmd-ls-tree
1613 | #+BEGIN_SRC python :tangle libwyag.py
1614 |   argsp = argsubparsers.add_parser("ls-tree", help="Pretty-print a tree object.")
1615 |   argsp.add_argument("-r",
1616 |                      dest="recursive",
1617 |                      action="store_true",
1618 |                      help="Recurse into sub-trees")
1619 | 
1620 |   argsp.add_argument("tree",
1621 |                      help="A tree-ish object.")
1622 | 
1623 |   def cmd_ls_tree(args):
1624 |       repo = repo_find()
1625 |       ls_tree(repo, args.tree, args.recursive)
1626 | 
1627 |   def ls_tree(repo, ref, recursive=None, prefix=""):
1628 |       sha = object_find(repo, ref, fmt=b"tree")
1629 |       obj = object_read(repo, sha)
1630 |       for item in obj.items:
1631 |           if len(item.mode) == 5:
1632 |               type = item.mode[0:1]
1633 |           else:
1634 |               type = item.mode[0:2]
1635 | 
1636 |           match type: # Determine the type.
1637 |               case b'04': type = "tree"
1638 |               case b'10': type = "blob" # A regular file.
1639 |               case b'12': type = "blob" # A symlink. Blob contents is link target.
1640 |               case b'16': type = "commit" # A submodule
1641 |               case _: raise Exception(f"Weird tree leaf mode {item.mode}")
1642 | 
1643 |           if not (recursive and type=='tree'): # This is a leaf
1644 |               print(f"{'0' * (6 - len(item.mode)) + item.mode.decode("ascii")} {type} {item.sha}\t{os.path.join(prefix, item.path)}")
1645 |           else: # This is a branch, recurse
1646 |               ls_tree(repo, item.sha, recursive, os.path.join(prefix, item.path))
1647 | #+END_SRC
1648 | 
1649 | ** The checkout command
1650 | :PROPERTIES:
1651 | :CUSTOM_ID: cmd-checkout
1652 | :END:
1653 | 
1654 | =git checkout= simply instantiates a commit in the worktree.  We're
1655 | going to oversimplify the actual git command to make our
1656 | implementation clear and understandable.  We're also going to add a
1657 | few safeguards.  Here's how our version of checkout will work:
1658 | 
1659 | - It will take two arguments: a commit, and a directory.  Git checkout
1660 |   only needs a commit.
1661 | 
1662 | - It will then instantiate the tree in the directory, *if and only if
1663 |   the directory is empty*.  Git is full of safeguards to avoid
1664 |   deleting data, which would be too complicated and unsafe to try to
1665 |   reproduce in wyag.  Since the point of wyag is to demonstrate git,
1666 |   not to produce a working implementation, this limitation is
1667 |   acceptable.
1668 | 
1669 | Let's get started.  As usual, we need a subparser:
1670 | 
1671 | #+BEGIN_SRC python :tangle libwyag.py
1672 |   argsp = argsubparsers.add_parser("checkout", help="Checkout a commit inside of a directory.")
1673 | 
1674 |   argsp.add_argument("commit",
1675 |                      help="The commit or tree to checkout.")
1676 | 
1677 |   argsp.add_argument("path",
1678 |                      help="The EMPTY directory to checkout on.")
1679 | #+END_SRC
1680 | 
1681 | A wrapper function:
1682 | 
1683 | #+BEGIN_SRC python :tangle libwyag.py
1684 |   def cmd_checkout(args):
1685 |       repo = repo_find()
1686 | 
1687 |       obj = object_read(repo, object_find(repo, args.commit))
1688 | 
1689 |       # If the object is a commit, we grab its tree
1690 |       if obj.fmt == b'commit':
1691 |           obj = object_read(repo, obj.kvlm[b'tree'].decode("ascii"))
1692 | 
1693 |       # Verify that path is an empty directory
1694 |       if os.path.exists(args.path):
1695 |           if not os.path.isdir(args.path):
1696 |               raise Exception(f"Not a directory {args.path}!")
1697 |           if os.listdir(args.path):
1698 |               raise Exception(f"Not empty {args.path}!")
1699 |       else:
1700 |           os.makedirs(args.path)
1701 | 
1702 |       tree_checkout(repo, obj, os.path.realpath(args.path))
1703 | #+END_SRC
1704 | 
1705 | And a function to do the actual work:
1706 | 
1707 | #+BEGIN_SRC python :tangle libwyag.py
1708 |   def tree_checkout(repo, tree, path):
1709 |       for item in tree.items:
1710 |           obj = object_read(repo, item.sha)
1711 |           dest = os.path.join(path, item.path)
1712 | 
1713 |           if obj.fmt == b'tree':
1714 |               os.mkdir(dest)
1715 |               tree_checkout(repo, obj, dest)
1716 |           elif obj.fmt == b'blob':
1717 |               # @TODO Support symlinks (identified by mode 12****)
1718 |               with open(dest, 'wb') as f:
1719 |                   f.write(obj.blobdata)
1720 | #+END_SRC
1721 | 
1722 | * Refs, tags and branches
1723 | ** What a ref is, and the show-ref command
1724 | :PROPERTIES:
1725 | :CUSTOM_ID: cmd-show-ref
1726 | :END:
1727 | 
1728 | As of now, the only way we can refer to objects is by their full
1729 | hexadecimal identifier.  In git, we actually rarely see those, except
1730 | to talk about a specific commit.  But in general, we're talking about
1731 | HEAD, about some branch called names like =main= or
1732 | =feature/more-bombs=, and so on.  This is handled by a simple
1733 | mechanism called references.
1734 | 
1735 | Git references, or refs, are probably the most simple type of things
1736 | git holds.  They live in subdirectories of =.git/refs=, and are text
1737 | files containing a hexadecimal representation of an object's hash,
1738 | encoded in ASCII.  They're actually as simple as this:
1739 | 
1740 | #+BEGIN_example
1741 |   6071c08bcb4757d8c89a30d9755d2466cef8c1de
1742 | #+END_example
1743 | 
1744 | Refs can also refer to another reference, and thus only indirectly to
1745 | an object, in which case they look like this:
1746 | 
1747 | #+BEGIN_EXAMPLE
1748 |   ref: refs/remotes/origin/master
1749 | #+END_EXAMPLE
1750 | 
1751 | #+BEGIN_note
1752 |   *Direct and indirect references*
1753 | 
1754 |   From now on, I will call a reference of the form =ref:
1755 |   path/to/other/ref= an *indirect* reference, and a ref with a SHA-1
1756 |   object ID a *direct reference*.
1757 | #+END_note
1758 | 
1759 | This section will describe the uses of refs.  For now, all that matter
1760 | is this:
1761 | 
1762 | - they're text files, in the =.git/refs= hierarchy;
1763 | - they hold the SHA-1 identifier of an object, or a reference to
1764 |   another reference, ultimately to a SHA-1 (no loops!)
1765 | 
1766 | To work with refs, we're first going to need a simple recursive solver
1767 | that will take a ref name, follow eventual recursive references (refs
1768 | whose content begin with =ref:=, as exemplified above) and return a
1769 | SHA-1 identifier:
1770 | 
1771 | #+BEGIN_SRC python :tangle libwyag.py
1772 |   def ref_resolve(repo, ref):
1773 |       path = repo_file(repo, ref)
1774 | 
1775 |       # Sometimes, an indirect reference may be broken.  This is normal
1776 |       # in one specific case: we're looking for HEAD on a new repository
1777 |       # with no commits.  In that case, .git/HEAD points to "ref:
1778 |       # refs/heads/main", but .git/refs/heads/main doesn't exist yet
1779 |       # (since there's no commit for it to refer to).
1780 |       if not os.path.isfile(path):
1781 |           return None
1782 | 
1783 |       with open(path, 'r') as fp:
1784 |           data = fp.read()[:-1]
1785 |           # Drop final \n ^^^^^
1786 |       if data.startswith("ref: "):
1787 |           return ref_resolve(repo, data[5:])
1788 |       else:
1789 |           return data
1790 | #+END_SRC
1791 | 
1792 | Let's create two small functions, and implement the =show-ref=
1793 | command --- it just lists all references in a repository.  First, a
1794 | stupid recursive function to collect refs and return them as a dict:
1795 | 
1796 | #+BEGIN_SRC python :tangle libwyag.py
1797 |   def ref_list(repo, path=None):
1798 |       if not path:
1799 |           path = repo_dir(repo, "refs")
1800 |       ret = dict()
1801 |       # Git shows refs sorted.  To do the same, we sort the output of
1802 |       # listdir
1803 |       for f in sorted(os.listdir(path)):
1804 |           can = os.path.join(path, f)
1805 |           if os.path.isdir(can):
1806 |               ret[f] = ref_list(repo, can)
1807 |           else:
1808 |               ret[f] = ref_resolve(repo, can)
1809 | 
1810 |       return ret
1811 | #+END_SRC
1812 | 
1813 | And, as usual, a subparser, a bridge, and a (recursive) worker function:
1814 | 
1815 | #+BEGIN_SRC python :tangle libwyag.py
1816 |   argsp = argsubparsers.add_parser("show-ref", help="List references.")
1817 | 
1818 |   def cmd_show_ref(args):
1819 |       repo = repo_find()
1820 |       refs = ref_list(repo)
1821 |       show_ref(repo, refs, prefix="refs")
1822 | 
1823 |   def show_ref(repo, refs, with_hash=True, prefix=""):
1824 |       if prefix:
1825 |           prefix = prefix + '/'
1826 |       for k, v in refs.items():
1827 |           if type(v) == str and with_hash:
1828 |               print (f"{v} {prefix}{k}")
1829 |           elif type(v) == str:
1830 |               print (f"{prefix}{k}")
1831 |           else:
1832 |               show_ref(repo, v, with_hash=with_hash, prefix=f"{prefix}{k}")
1833 | #+END_SRC
1834 | ** Tags as references
1835 | :PROPERTIES:
1836 | :CUSTOM_ID: tags
1837 | :END:
1838 | 
1839 | The most simple use of refs is tags.  A tag is just a user-defined
1840 | name for an object, often a commit.  A very common use of tags is
1841 | identifying software releases: You've just merged the last commit of,
1842 | say, version 12.78.52 of your program, so your most recent commit
1843 | (let's call it =6071c08=) /is/ your version 12.78.52.  To make this
1844 | association explicit, all you have to do is:
1845 | 
1846 | #+BEGIN_src shell
1847 |   git tag v12.78.52 6071c08
1848 |   # the object hash ^here^^ is optional and defaults to HEAD.
1849 | #+END_SRC
1850 | 
1851 | This creates a new tag, called =v12.78.52=, pointing at =6071c08=.
1852 | Tagging is like aliasing: a tag introduces a new way to refer to an
1853 | existing object.  After the tag is created, the name =v12.78.52= refers
1854 | to =6071c08=.  For example, these two commands are now perfectly
1855 | equivalent:
1856 | 
1857 | #+BEGIN_src shell
1858 |   git checkout v12.78.52
1859 |   git checkout 6071c08
1860 | #+END_src
1861 | 
1862 | #+begin_note
1863 |   Versions are a common use of tags, but like almost everything in
1864 |   Git, tags have no predefined semantics: they mean whatever you want
1865 |   them to mean, and can point to whichever object you want, you can
1866 |   even tag /blobs/!
1867 | #+end_note
1868 | 
1869 | ** Lightweight tags and tag objects, and parsing the latter
1870 | :PROPERTIES:
1871 | :CUSTOM_ID: GitTag
1872 | :END:
1873 | 
1874 | You've probably guessed already that tags are actually refs.  They
1875 | live in the =.git/refs/tags/= hierarchy.  The only point worth noting is
1876 | that they come in two flavors: lightweight tags and tags objects.
1877 | 
1878 | - "Lightweight" tags :: are just regular refs to a commit, a tree or
1879 |      a blob.
1880 | 
1881 | - Tag objects :: are regular refs pointing to an object of type =tag=.
1882 |   Unlike lightweight tags, tag objects have an author, a date, an
1883 |   optional PGP signature and an optional annotation.  Their format is
1884 |   the same as a commit object.
1885 | 
1886 | We don't even need to implement tag objects, we can reuse =GitCommit=
1887 | and just change the =fmt= field:
1888 | 
1889 | #+BEGIN_SRC python :tangle libwyag.py
1890 | class GitTag(GitCommit):
1891 |     fmt = b'tag'
1892 | #+END_SRC
1893 | 
1894 | And now we support tags.
1895 | 
1896 | ** The tag command
1897 | :PROPERTIES:
1898 | :CUSTOM_ID: cmd-tag
1899 | :END:
1900 | 
1901 | Let's add the =tag= command.  In Git, it does two things: it creates a
1902 | new tag or list existing tags (by default).  So you can invoke it with:
1903 | 
1904 | #+BEGIN_src shell
1905 |   git tag                  # List all tags
1906 |   git tag NAME [OBJECT]    # create a new *lightweight* tag NAME, pointing
1907 |                            # at HEAD (default) or OBJECT
1908 |   git tag -a NAME [OBJECT] # create a new tag *object* NAME, pointing at
1909 |                            # HEAD (default) or OBJECT
1910 | #+END_src
1911 | 
1912 | This translates to argparse as follows.  Notice we ignore the mutual
1913 | exclusion between =--list= and =[-a] name [object]=, which seems too
1914 | complicated for argparse.
1915 | 
1916 | # @FIXME This ignores the mutual exclusion
1917 | #+BEGIN_SRC python :tangle libwyag.py
1918 |   argsp = argsubparsers.add_parser(
1919 |       "tag",
1920 |       help="List and create tags")
1921 | 
1922 |   argsp.add_argument("-a",
1923 |                      action="store_true",
1924 |                      dest="create_tag_object",
1925 |                      help="Whether to create a tag object")
1926 | 
1927 |   argsp.add_argument("name",
1928 |                      nargs="?",
1929 |                      help="The new tag's name")
1930 | 
1931 |   argsp.add_argument("object",
1932 |                      default="HEAD",
1933 |                      nargs="?",
1934 |                      help="The object the new tag will point to")
1935 | #+END_SRC
1936 | 
1937 | The =cmd_tag= function will dispatch behavior (list or create) depending
1938 | on whether or not =name= is provided.
1939 | 
1940 | #+BEGIN_SRC python :tangle libwyag.py
1941 |     def cmd_tag(args):
1942 |         repo = repo_find()
1943 | 
1944 |         if args.name:
1945 |             tag_create(repo,
1946 |                        args.name,
1947 |                        args.object,
1948 |                        create_tag_object = args.create_tag_object)
1949 |         else:
1950 |             refs = ref_list(repo)
1951 |             show_ref(repo, refs["tags"], with_hash=False)
1952 | #+END_SRC
1953 | 
1954 | And we just need one more function to actually create the tag:
1955 | 
1956 | #+begin_src python :tangle libwyag.py
1957 |   def tag_create(repo, name, ref, create_tag_object=False):
1958 |       # get the GitObject from the object reference
1959 |       sha = object_find(repo, ref)
1960 | 
1961 |       if create_tag_object:
1962 |           # create tag object (commit)
1963 |           tag = GitTag()
1964 |           tag.kvlm = dict()
1965 |           tag.kvlm[b'object'] = sha.encode()
1966 |           tag.kvlm[b'type'] = b'commit'
1967 |           tag.kvlm[b'tag'] = name.encode()
1968 |           # Feel free to let the user give their name!
1969 |           # Notice you can fix this after commit, read on!
1970 |           tag.kvlm[b'tagger'] = b'Wyag <wyag@example.com>'
1971 |           # …and a tag message!
1972 |           tag.kvlm[None] = b"A tag generated by wyag, which won't let you customize the message!\n"
1973 |           tag_sha = object_write(tag, repo)
1974 |           # create reference
1975 |           ref_create(repo, "tags/" + name, tag_sha)
1976 |       else:
1977 |           # create lightweight tag (ref)
1978 |           ref_create(repo, "tags/" + name, sha)
1979 | 
1980 |   def ref_create(repo, ref_name, sha):
1981 |       with open(repo_file(repo, "refs/" + ref_name), 'w') as fp:
1982 |           fp.write(sha + "\n")
1983 | #+end_src
1984 | 
1985 | ** What's a branch?
1986 | :PROPERTIES:
1987 | :CUSTOM_ID: branches
1988 | :END:
1989 | 
1990 | Tags are done.  Now for another big chunk: branches.
1991 | 
1992 | It's time to address the elephant in the room: like most Git users,
1993 | wyag still doesn't have any idea what a branch is.  It currently
1994 | treats a repository as a bunch of disorganized objects, some of them
1995 | commits, and has no representation whatsoever of the fact that commits
1996 | are grouped in branches, and that at every point in time there's a
1997 | commit that's =HEAD=, /ie/, the *head* commit (or "tip") of the
1998 | *active* branch.
1999 | 
2000 | So, what's a branch?  The answer is actually surprisingly simple, but
2001 | it may also end up being simply surprising: *a branch is a reference
2002 | to a commit*.  You could even say that a branch is a kind of a name
2003 | for a commit.  In this regard, a branch is exactly the same thing as a
2004 | tag.  Tags are refs that live in =.git/refs/tags=, branches are refs
2005 | that live in =.git/refs/heads=.
2006 | 
2007 | There are, of course, differences between a branch and a tag:
2008 | 
2009 | 1. Branches are references to a /commit/, tags can refer to any object;
2010 | 2. Most importantly, the branch ref is updated at each commit.  This means
2011 |    that whenever you commit, Git actually does this:
2012 |    1. a new commit object is created, with the current branch's
2013 |       (commit!) ID as its parent;
2014 |    2. the commit object is hashed and stored;
2015 |    3. the branch ref is updated to refer to the new commit's hash.
2016 | 
2017 | That's all.
2018 | 
2019 | But what about the *current* branch?  It's actually even easier.  It's a
2020 | ref file outside of the =refs= hierarchy, in =.git/HEAD=, which is an
2021 | *indirect* ref (that is, it is of the form =ref: path/to/other/ref=, and
2022 | not a simple hash).
2023 | 
2024 | #+begin_note
2025 |   *Detached HEAD*
2026 | 
2027 |   When you just checkout a random commit, git will warn you it's in
2028 |   "detached HEAD state".  This means you're not on any branch anymore.
2029 |   In this case, =.git/HEAD= is a *direct* reference: it contains a
2030 |   SHA-1.
2031 | #+end_note
2032 | 
2033 | ** Referring to objects: the =object_find= function
2034 | :PROPERTIES:
2035 | :CUSTOM_ID: object_find
2036 | :END:
2037 | 
2038 | *** Resolving names
2039 | 
2040 | Remember when we've created [[placeholder-object_find][the stupid =object_find= function]] that would
2041 | take four arguments, return the second unmodified and ignore the other
2042 | three?  It's time to replace it by something more useful.  We're going
2043 | to implement a small, but usable, subset of the actual Git name
2044 | resolution algorithm.  The new =object_find()= will work in two steps:
2045 | first, given a name, it will return a complete sha-1 hash.  For
2046 | example, with =HEAD=, it will return the hash of the head commit of the
2047 | current branch, etc.  More precisely, this name resolution function
2048 | will work like this:
2049 | 
2050 | - If =name= is =HEAD=, it will just resolve =.git/HEAD=;
2051 | - If =name= is a full hash, this hash is returned unmodified.
2052 | - If =name= looks like a short hash, it will collect objects whose full
2053 |   hash begin with this short hash.
2054 | - At last, it will resolve tags and branches matching name.
2055 | 
2056 | Notice how the last two steps /collect/ values: the first two are
2057 | absolute references, so we can safely return a result.  But short
2058 | hashes or branch names can be ambiguous, we want to enumerate all
2059 | possible meanings of the name and raise an error if we've found more
2060 | than 1.
2061 | 
2062 | #+begin_info
2063 |   *Short hashes*
2064 | 
2065 |   For convenience, Git allows to refer to hashes by a prefix of their
2066 |   name.  For example, =5bd254aa973646fa16f66d702a5826ea14a3eb45= can
2067 |   be referred to as =5bd254=.  This is called a "short hash".
2068 | #+end_info
2069 | 
2070 | #+BEGIN_SRC python :tangle libwyag.py
2071 |   def object_resolve(repo, name):
2072 |       """Resolve name to an object hash in repo.
2073 | 
2074 |   This function is aware of:
2075 | 
2076 |    - the HEAD literal
2077 |       - short and long hashes
2078 |       - tags
2079 |       - branches
2080 |       - remote branches"""
2081 |       candidates = list()
2082 |       hashRE = re.compile(r"^[0-9A-Fa-f]{4,40}$")
2083 | 
2084 |       # Empty string?  Abort.
2085 |       if not name.strip():
2086 |           return None
2087 | 
2088 |       # Head is nonambiguous
2089 |       if name == "HEAD":
2090 |           return [ ref_resolve(repo, "HEAD") ]
2091 | 
2092 |       # If it's a hex string, try for a hash.
2093 |       if hashRE.match(name):
2094 |           # This may be a hash, either small or full.  4 seems to be the
2095 |           # minimal length for git to consider something a short hash.
2096 |           # This limit is documented in man git-rev-parse
2097 |           name = name.lower()
2098 |           prefix = name[0:2]
2099 |           path = repo_dir(repo, "objects", prefix, mkdir=False)
2100 |           if path:
2101 |               rem = name[2:]
2102 |               for f in os.listdir(path):
2103 |                   if f.startswith(rem):
2104 |                       # Notice a string startswith() itself, so this
2105 |                       # works for full hashes.
2106 |                       candidates.append(prefix + f)
2107 | 
2108 |       # Try for references.
2109 |       as_tag = ref_resolve(repo, "refs/tags/" + name)
2110 |       if as_tag: # Did we find a tag?
2111 |           candidates.append(as_tag)
2112 | 
2113 |       as_branch = ref_resolve(repo, "refs/heads/" + name)
2114 |       if as_branch: # Did we find a branch?
2115 |           candidates.append(as_branch)
2116 | 
2117 |       as_remote_branch = ref_resolve(repo, "refs/remotes/" + name)
2118 |       if as_remote_branch: # Did we find a remote branch?
2119 |           candidates.append(as_remote_branch)
2120 | 
2121 |       return candidates
2122 | #+END_SRC
2123 | 
2124 | The second step is to follow the object we found to an object of the
2125 | required type, if a type argument was provided.  Since we only need to
2126 | handle trivial cases, this is a very simple iterative process:
2127 | 
2128 | - If we have a tag and =fmt= is anything else, we follow the tag.
2129 | - If we have a commit and =fmt= is tree, we return this commit's tree
2130 |   object
2131 | - In all other situations, we bail out: nothing else makes sense.
2132 | 
2133 | (The process is iterative because it may take an undefined number of
2134 | steps, since tags themselves can be tagged)
2135 | 
2136 | #+BEGIN_SRC python :tangle libwyag.py
2137 |   def object_find(repo, name, fmt=None, follow=True):
2138 |       sha = object_resolve(repo, name)
2139 | 
2140 |       if not sha:
2141 |           raise Exception(f"No such reference {name}.")
2142 | 
2143 |       if len(sha) > 1:
2144 |           raise Exception("Ambiguous reference {name}: Candidates are:\n - {'\n - '.join(sha)}.")
2145 | 
2146 |       sha = sha[0]
2147 | 
2148 |       if not fmt:
2149 |           return sha
2150 | 
2151 |       while True:
2152 |           obj = object_read(repo, sha)
2153 |           #     ^^^^^^^^^^^ < this is a bit agressive: we're reading
2154 |           # the full object just to get its type.  And we're doing
2155 |           # that in a loop, albeit normally short.  Don't expect
2156 |           # high performance here.
2157 | 
2158 |           if obj.fmt == fmt:
2159 |               return sha
2160 | 
2161 |           if not follow:
2162 |               return None
2163 | 
2164 |           # Follow tags
2165 |           if obj.fmt == b'tag':
2166 |               sha = obj.kvlm[b'object'].decode("ascii")
2167 |           elif obj.fmt == b'commit' and fmt == b'tree':
2168 |               sha = obj.kvlm[b'tree'].decode("ascii")
2169 |           else:
2170 |               return None
2171 | #+END_SRC
2172 | 
2173 | With the new =object_find()=, the CLI wyag becomes a bit more usable.  You can now do things like:
2174 | 
2175 | #+begin_example
2176 | $ wyag checkout v3.11 # A tag
2177 | $ wyag checkout feature/explosions # A branch
2178 | $ wyag ls-tree -r HEAD # The active branch or commit.  There's also a
2179 |                        # follow here: HEAD is actually a commit.
2180 | $ wyag cat-file blob e0695f # A short hash
2181 | $ wyag cat-file tree master # A branch, as a tree (another "follow")
2182 | #+end_example
2183 | 
2184 | *** The rev-parse command
2185 | :PROPERTIES:
2186 | :CUSTOM_ID: cmd-rev-parse
2187 | :END:
2188 | 
2189 | Let's implement =wyag rev-parse=.  The =git rev-parse= commands does a
2190 | lot, but one of its use cases, the one we're going to clone, is
2191 | solving references.  For the purpose of further testing the "follow"
2192 | feature of =object_find=, we'll add an optional =wyag-type= argument
2193 | to its interface.
2194 | 
2195 | #+BEGIN_SRC python :tangle libwyag.py
2196 |   argsp = argsubparsers.add_parser(
2197 |       "rev-parse",
2198 |       help="Parse revision (or other objects) identifiers")
2199 | 
2200 |   argsp.add_argument("--wyag-type",
2201 |                      metavar="type",
2202 |                      dest="type",
2203 |                      choices=["blob", "commit", "tag", "tree"],
2204 |                      default=None,
2205 |                      help="Specify the expected type")
2206 | 
2207 |   argsp.add_argument("name",
2208 |                      help="The name to parse")
2209 | #+END_SRC
2210 | 
2211 | The bridge does all the job:
2212 | 
2213 | #+BEGIN_SRC python :tangle libwyag.py
2214 |   def cmd_rev_parse(args):
2215 |       if args.type:
2216 |           fmt = args.type.encode()
2217 |       else:
2218 |           fmt = None
2219 | 
2220 |       repo = repo_find()
2221 | 
2222 |       print (object_find(repo, args.name, fmt, follow=True))
2223 | #+END_SRC
2224 | 
2225 | And it works:
2226 | 
2227 | #+begin_example
2228 | $ wyag rev-parse --wyag-type commit HEAD
2229 | 6c22393f5e3830d15395fd8d2f8b0cf8eb40dd58
2230 | $ wyag rev-parse --wyag-type tree HEAD
2231 | 11d33fad71dbac72840aff1447e0d080c7484361
2232 | $ wyag rev-parse --wyag-type tag HEAD
2233 | None
2234 | #+end_example
2235 | 
2236 | * Working with the staging area and the index file
2237 | :PROPERTIES:
2238 | :CUSTOM_ID: staging-area
2239 | :END:
2240 | 
2241 | ** What's the index file?
2242 | :PROPERTIES:
2243 | :CUSTOM_ID: staging-intro
2244 | :END:
2245 | 
2246 | This final step will bring us to where commits happen (although
2247 | actually creating them is for the next section!)
2248 | 
2249 | You probably know that to commit in Git, you first "stage" some
2250 | changes, using =git add= and =git rm=, and only /then/ do you commit
2251 | those changes.  This intermediate stage between the last and the next
2252 | commit is called the *staging area*.
2253 | 
2254 | It would seem natural to use a commit or tree object to represent the
2255 | staging area, but Git actually and uses a completely different
2256 | mechanism, in the form of what it calls the *index file*.
2257 | 
2258 | After a commit, the index file is a sort of copy of that commit: it
2259 | holds the same path/blob association than the corresponding tree.  But
2260 | it also holds extra information about files in the worktree, like
2261 | their creation/modification time, so =git status= doesn't often need
2262 | to actually compare files: it just checks that their modification time
2263 | is the same as the one stored in the index file, and only if it isn't
2264 | does it perform an actual comparison.
2265 | 
2266 | You can thus consider the index file as a three-way association list:
2267 | not only paths with blobs, but also paths with actual filesystem
2268 | entries.
2269 | 
2270 | Another important characteristic of the *index file* is that unlike a
2271 | tree, it can represent inconsistent states, like a merge conflict,
2272 | whereas a tree is always a complete, unambiguous representation.
2273 | 
2274 | When you commit, what git actually does is turn the index file into a
2275 | new tree object.  To summarize:
2276 | 
2277 |  1. When the repository is “clean”, the index file holds the exact
2278 |     same contents as the HEAD commit, plus metadata about the
2279 |     corresponding filesystem entries.  For instance, it may contain
2280 |     something like:
2281 | 
2282 |     #+begin_quote
2283 |     There's a file called =src/disp.c= whose contents are blob
2284 |     797441c76e59e28794458b39b0f1eff4c85f4fa0.  The real =src/disp.c=
2285 |     file, in the worktree, was created on 2023-07-15
2286 |     15:28:29.168572151, and last modified 2023-07-15
2287 |     15:28:29.1689427709.  It is stored on device 65026, inode 8922881.
2288 |     #+end_quote
2289 | 
2290 |  2. When you =git add= or =git rm=, the index file is modified
2291 |     accordingly.  In the example above, if you modify =src/disp.c=,
2292 |     and =add= your changes, the index file will be updated with a new
2293 |     blob ID (the blob itself will also be created in the process, of
2294 |     course), and the various file metadata will be updated as well so
2295 |     =git status= knows when not to compare file contents.
2296 | 
2297 |  3. When you =git commit= those changes, a new tree is produced from
2298 |     the index file, a new commit object is generated with that tree,
2299 |     branches are updated and we're done.
2300 | 
2301 | #+begin_note
2302 |   *A note on words*
2303 | 
2304 |   The staging area and the index are thus the same thing, but the name
2305 |   "staging area" is more the name of the git user-exposed feature
2306 |   (that could have been implemented otherwise), the abstraction if you
2307 |   will; while "index file" refers specifically to the way this
2308 |   abstract feature is actually implemented in git.
2309 | #+end_note
2310 | 
2311 | ** Parsing the index
2312 | :PROPERTIES:
2313 | :CUSTOM_ID: index_read
2314 | :END:
2315 | 
2316 | The index file is by far the most complicated piece of data a Git
2317 | repository can hold.  Its complete documentation can be found in Git
2318 | source tree or rendered [[https://git-scm.com/docs/index-format][on the git website]].  It's made of three parts:
2319 | 
2320 |   - An header with the format version number and the number of entries
2321 |     the index holds;
2322 |   - A series of entries, sorted, each representing a file; padded to
2323 |     multiple of 8 bytes.
2324 |   - A series of optional extensions, which we'll ignore.
2325 | # @FIXME ^ Sorted how? Do we need to think about this?
2326 | 
2327 | The first thing we need to represent is a single entry.  It actually
2328 | holds quite a lot of stuff, I'm leaving the details in comments.
2329 | It's worth observing that an entry stores *both* the SHA-1 of the
2330 | associated blob in the object store /and/ a ton of metadata about the
2331 | actual file on the actual filesystem.  Again, this is because
2332 | =git/wyag status= will need to determine which files in the index were
2333 | modified: it is much more efficient to begin by checking the
2334 | last-modified timestamp and comparing it with a known values, before
2335 | comparing actual files.
2336 | 
2337 | #+begin_src python :tangle libwyag.py
2338 |   class GitIndexEntry (object):
2339 |       def __init__(self, ctime=None, mtime=None, dev=None, ino=None,
2340 |                    mode_type=None, mode_perms=None, uid=None, gid=None,
2341 |                    fsize=None, sha=None, flag_assume_valid=None,
2342 |                    flag_stage=None, name=None):
2343 |           # The last time a file's metadata changed.  This is a pair
2344 |           # (timestamp in seconds, nanoseconds)
2345 |           self.ctime = ctime
2346 |           # The last time a file's data changed.  This is a pair
2347 |           # (timestamp in seconds, nanoseconds)
2348 |           self.mtime = mtime
2349 |           # The ID of device containing this file
2350 |           self.dev = dev
2351 |           # The file's inode number
2352 |           self.ino = ino
2353 |           # The object type, either b1000 (regular), b1010 (symlink),
2354 |           # b1110 (gitlink).
2355 |           self.mode_type = mode_type
2356 |           # The object permissions, an integer.
2357 |           self.mode_perms = mode_perms
2358 |           # User ID of owner
2359 |           self.uid = uid
2360 |           # Group ID of ownner
2361 |           self.gid = gid
2362 |           # Size of this object, in bytes
2363 |           self.fsize = fsize
2364 |           # The object's SHA
2365 |           self.sha = sha
2366 |           self.flag_assume_valid = flag_assume_valid
2367 |           self.flag_stage = flag_stage
2368 |           # Name of the object (full path this time!)
2369 |           self.name = name
2370 | #+end_src
2371 | 
2372 | The index file is a binary file, likely for performance reasons.  The
2373 | format is reasonably simple, though.  It begins with a header with the
2374 | =DIRC= magic bytes, a version number and the total number of entries
2375 | in that index file.  We create the =GitIndex= class to hold them:
2376 | 
2377 | #+BEGIN_SRC python :tangle libwyag.py
2378 |   class GitIndex (object):
2379 |       version = None
2380 |       entries = []
2381 |       # ext = None
2382 |       # sha = None
2383 | 
2384 |       def __init__(self, version=2, entries=None):
2385 |           if not entries:
2386 |               entries = list()
2387 | 
2388 |           self.version = version
2389 |           self.entries = entries
2390 | #+END_SRC
2391 | 
2392 | And a parser to read index files into those objects.  After reading
2393 | the 12-bytes header, we just parse entries in the order they appear.
2394 | An entry begins with a set of fixed-length data, followed by a
2395 | variable-length name.
2396 | 
2397 | The code is quite straightforward, but as it's reading a binary
2398 | format, it feels more messy than what we did so far.  We use the
2399 | =int.from_bytes(bytes, endianness)= a lot to read raw bytes into an
2400 | integer, and just a few bitwise operations to separate data
2401 | that share the same byte.
2402 | 
2403 | (This format was probably designed so index files could just be
2404 | =mmapp()ed= to memory, and read directly as C structs, with an index
2405 | built in O(n) time in most cases.  This kind of approach tends to
2406 | produce more elegant code in C than in Python…)
2407 | 
2408 | #+BEGIN_SRC python :tangle libwyag.py
2409 |   def index_read(repo):
2410 |       index_file = repo_file(repo, "index")
2411 | 
2412 |       # New repositories have no index!
2413 |       if not os.path.exists(index_file):
2414 |           return GitIndex()
2415 | 
2416 |       with open(index_file, 'rb') as f:
2417 |           raw = f.read()
2418 | 
2419 |       header = raw[:12]
2420 |       signature = header[:4]
2421 |       assert signature == b"DIRC" # Stands for "DirCache"
2422 |       version = int.from_bytes(header[4:8], "big")
2423 |       assert version == 2, "wyag only supports index file version 2"
2424 |       count = int.from_bytes(header[8:12], "big")
2425 | 
2426 |       entries = list()
2427 | 
2428 |       content = raw[12:]
2429 |       idx = 0
2430 |       for i in range(0, count):
2431 |           # Read creation time, as a unix timestamp (seconds since
2432 |           # 1970-01-01 00:00:00, the "epoch")
2433 |           ctime_s = int.from_bytes(content[idx: idx+4], "big")
2434 |           # Read creation time, as nanoseconds after that timestamps,
2435 |           # for extra precision.
2436 |           ctime_ns = int.from_bytes(content[idx+4: idx+8], "big")
2437 |           # Same for modification time: first seconds from epoch.
2438 |           mtime_s = int.from_bytes(content[idx+8: idx+12], "big")
2439 |           # Then extra nanoseconds
2440 |           mtime_ns = int.from_bytes(content[idx+12: idx+16], "big")
2441 |           # Device ID
2442 |           dev = int.from_bytes(content[idx+16: idx+20], "big")
2443 |           # Inode
2444 |           ino = int.from_bytes(content[idx+20: idx+24], "big")
2445 |           # Ignored.
2446 |           unused = int.from_bytes(content[idx+24: idx+26], "big")
2447 |           assert 0 == unused
2448 |           mode = int.from_bytes(content[idx+26: idx+28], "big")
2449 |           mode_type = mode >> 12
2450 |           assert mode_type in [0b1000, 0b1010, 0b1110]
2451 |           mode_perms = mode & 0b0000000111111111
2452 |           # User ID
2453 |           uid = int.from_bytes(content[idx+28: idx+32], "big")
2454 |           # Group ID
2455 |           gid = int.from_bytes(content[idx+32: idx+36], "big")
2456 |           # Size
2457 |           fsize = int.from_bytes(content[idx+36: idx+40], "big")
2458 |           # SHA (object ID).  We'll store it as a lowercase hex string
2459 |           # for consistency.
2460 |           sha = format(int.from_bytes(content[idx+40: idx+60], "big"), "040x")
2461 |           # Flags we're going to ignore
2462 |           flags = int.from_bytes(content[idx+60: idx+62], "big")
2463 |           # Parse flags
2464 |           flag_assume_valid = (flags & 0b1000000000000000) != 0
2465 |           flag_extended = (flags & 0b0100000000000000) != 0
2466 |           assert not flag_extended
2467 |           flag_stage = flags & 0b0011000000000000
2468 |           # Length of the name.  This is stored on 12 bits, some max
2469 |           # value is 0xFFF, 4095.  Since names can occasionally go
2470 |           # beyond that length, git treats 0xFFF as meaning at least
2471 |           # 0xFFF, and looks for the final 0x00 to find the end of the
2472 |           # name --- at a small, and probably very rare, performance
2473 |           # cost.
2474 |           name_length = flags & 0b0000111111111111
2475 | 
2476 |           # We've read 62 bytes so far.
2477 |           idx += 62
2478 | 
2479 |           if name_length < 0xFFF:
2480 |               assert content[idx + name_length] == 0x00
2481 |               raw_name = content[idx:idx+name_length]
2482 |               idx += name_length + 1
2483 |           else:
2484 |               print(f"Notice: Name is 0x{name_length:X} bytes long.")
2485 |               # This probably wasn't tested enough.  It works with a
2486 |               # path of exactly 0xFFF bytes.  Any extra bytes broke
2487 |               # something between git, my shell and my filesystem.
2488 |               null_idx = content.find(b'\x00', idx + 0xFFF)
2489 |               raw_name = content[idx: null_idx]
2490 |               idx = null_idx + 1
2491 | 
2492 |           # Just parse the name as utf8.
2493 |           name = raw_name.decode("utf8")
2494 | 
2495 |           # Data is padded on multiples of eight bytes for pointer
2496 |           # alignment, so we skip as many bytes as we need for the next
2497 |           # read to start at the right position.
2498 | 
2499 |           idx = 8 * ceil(idx / 8)
2500 | 
2501 |           # And we add this entry to our list.
2502 |           entries.append(GitIndexEntry(ctime=(ctime_s, ctime_ns),
2503 |                                        mtime=(mtime_s,  mtime_ns),
2504 |                                        dev=dev,
2505 |                                        ino=ino,
2506 |                                        mode_type=mode_type,
2507 |                                        mode_perms=mode_perms,
2508 |                                        uid=uid,
2509 |                                        gid=gid,
2510 |                                        fsize=fsize,
2511 |                                        sha=sha,
2512 |                                        flag_assume_valid=flag_assume_valid,
2513 |                                        flag_stage=flag_stage,
2514 |                                        name=name))
2515 | 
2516 |       return GitIndex(version=version, entries=entries)
2517 | #+END_SRC
2518 | 
2519 | ** The ls-files command
2520 | :PROPERTIES:
2521 | :CUSTOM_ID: cmd-ls-files
2522 | :END:
2523 | 
2524 | =git ls-files= displays the names of files in the staging area, with,
2525 | as usual, a ton of options.  Our =ls-files= will be much simpler,
2526 | /but/ we'll add a =--verbose= option that doesn't exist in git, just
2527 | so we can display every single bit of info in the index file.
2528 | 
2529 | #+BEGIN_SRC python :tangle libwyag.py
2530 |   argsp = argsubparsers.add_parser("ls-files", help = "List all the stage files")
2531 |   argsp.add_argument("--verbose", action="store_true", help="Show everything.")
2532 | 
2533 |   def cmd_ls_files(args):
2534 |       repo = repo_find()
2535 |       index = index_read(repo)
2536 |       if args.verbose:
2537 |           print(f"Index file format v{index.version}, containing {len(index.entries)} entries.")
2538 | 
2539 |       for e in index.entries:
2540 |           print(e.name)
2541 |           if args.verbose:
2542 |               entry_type = { 0b1000: "regular file",
2543 |                              0b1010: "symlink",
2544 |                              0b1110: "git link" }[e.mode_type]
2545 |               print(f"  {entry_type} with perms: {e.mode_perms:o}")
2546 |               print(f"  on blob: {e.sha}")
2547 |               print(f"  created: {datetime.fromtimestamp(e.ctime[0])}.{e.ctime[1]}, modified: {datetime.fromtimestamp(e.mtime[0])}.{e.mtime[1]}")
2548 |               print(f"  device: {e.dev}, inode: {e.ino}")
2549 |               print(f"  user: {pwd.getpwuid(e.uid).pw_name} ({e.uid})  group: {grp.getgrgid(e.gid).gr_name} ({e.gid})")
2550 |               print(f"  flags: stage={e.flag_stage} assume_valid={e.flag_assume_valid}")
2551 | #+END_SRC
2552 | 
2553 | If you run ls-files, you'll notice that on a “clean” worktree (an
2554 | unmodified checkout of =HEAD=), it lists all files on =HEAD=.  Again,
2555 | the index is not a /delta/ (a set of differences) from the =HEAD=
2556 | commit, but starts as a copy of it, in a different format.
2557 | 
2558 | ** A detour: the check-ignore command
2559 | :PROPERTIES:
2560 | :CUSTOM_ID: cmd-check-ignore
2561 | :END:
2562 | 
2563 | We want to write =status=, but =status= needs to know about ignore
2564 | rules, that are stored in the various =.gitignore= files.  So we first
2565 | need to add some rudimentary support for ignore files in =wyag=.
2566 | We'll expose this support as the =check-ignore= command, which takes a
2567 | list of paths and outputs back those of those paths that should be
2568 | ignored.
2569 | 
2570 | Again, the command parser is trivial:
2571 | 
2572 | #+BEGIN_SRC python :tangle libwyag.py
2573 |   argsp = argsubparsers.add_parser("check-ignore", help = "Check path(s) against ignore rules.")
2574 |   argsp.add_argument("path", nargs="+", help="Paths to check")
2575 | #+END_src
2576 | 
2577 | And the function is just as simple:
2578 | 
2579 | #+BEGIN_SRC python :tangle libwyag.py
2580 |   def cmd_check_ignore(args):
2581 |       repo = repo_find()
2582 |       rules = gitignore_read(repo)
2583 |       for path in args.path:
2584 |           if check_ignore(rules, path):
2585 |               print(path)
2586 | #+END_src
2587 | 
2588 | But of course, most of the function we call don't exist yet in wyag.
2589 | We'll begin by writing a reader for rules in ignore files,
2590 | =gitignore_read()=.  The syntax of those rules is quite simple: each
2591 | line in an ignore file is an exclusion pattern, so files that match
2592 | this pattern are ignored by =status=, =add -A= and so on.  There are
2593 | three special cases, though:
2594 | 
2595 |  1. Lines that begin with an exclamation mark =!= /negate/ the pattern
2596 |     (files that match this pattern are /included/, even they were
2597 |     ignored by an earlier pattern)
2598 |  2. Lines that begin with a dash =#= are comments, and are skipped.
2599 |  2. A backslash =\= at the beginning treats =!= and =#= as literal
2600 |     characters.
2601 | 
2602 | First, a parser for a single pattern. This parser returns a pair: the
2603 | pattern itself, and a boolean to indicate if files matching the
2604 | pattern /should/ be excluded (=True=) or included (=False=).  In other
2605 | words, =False= if the pattern did start with =!=, =True= otherwise.
2606 | 
2607 | #+begin_src python :tangle libwyag.py
2608 |   def gitignore_parse1(raw):
2609 |       raw = raw.strip() # Remove leading/trailing spaces
2610 | 
2611 |       if not raw or raw[0] == "#":
2612 |           return None
2613 |       elif raw[0] == "!":
2614 |           return (raw[1:], False)
2615 |       elif raw[0] == "\\":
2616 |           return (raw[1:], True)
2617 |       else:
2618 |           return (raw, True)
2619 | #+end_src
2620 | 
2621 | Parsing a file is just collecting all rules in that file.  Notice this
2622 | function doesn't parse /files/, but just lists of lines: that's
2623 | because we'll need to read rules from git blobs as well, and not just
2624 | regular files.
2625 | 
2626 | #+begin_src python :tangle libwyag.py
2627 |   def gitignore_parse(lines):
2628 |       ret = list()
2629 | 
2630 |       for line in lines:
2631 |           parsed = gitignore_parse1(line)
2632 |           if parsed:
2633 |               ret.append(parsed)
2634 | 
2635 |       return ret
2636 | #+end_src
2637 | 
2638 | Last thing we need to do is collect the various ignore files.  They
2639 | come in two kinds:
2640 | 
2641 |  - Some of these files *live in the index*: they're the various
2642 |    =gitignore= files.  Emphasis on the plural; although there often is
2643 |    only one such file, at the root, there can be one in each
2644 |    directory, and it applies to this directory and its subdirectories.
2645 |    I'll call those *scoped*, because they only apply to paths under
2646 |    their directory.
2647 |  - The others live *outside the index*.  They're the global ignore
2648 |    file (usually in =~/.config/git/ignore=) and the
2649 |    repository-specific =.git/info/exclude=.  I call those *absolute*,
2650 |    because they apply everywhere, but at a lower priority.
2651 | 
2652 | Again, a class to hold that: a list of absolute rules, a dict
2653 | (hashmap) of relative rules.  The keys to this hashmap are
2654 | *directories*, relative to the root of a worktree.
2655 | 
2656 | #+begin_src python :tangle libwyag.py
2657 |   class GitIgnore(object):
2658 |       absolute = None
2659 |       scoped = None
2660 | 
2661 |       def __init__(self, absolute, scoped):
2662 |           self.absolute = absolute
2663 |           self.scoped = scoped
2664 | #+end_src
2665 | 
2666 | And finally our function to collect all gitignore rules in a
2667 | repository, and return a =GitIgnore= object.  Notice how it reads
2668 | scoped files from the index, and not the worktree: only /staged/
2669 | =.gitignore= files matter (also remember: HEAD is /already/ staged ---
2670 | the staging area is a copy, not a delta).
2671 | 
2672 | #+begin_src python :tangle libwyag.py
2673 |   def gitignore_read(repo):
2674 |       ret = GitIgnore(absolute=list(), scoped=dict())
2675 | 
2676 |       # Read local configuration in .git/info/exclude
2677 |       repo_file = os.path.join(repo.gitdir, "info/exclude")
2678 |       if os.path.exists(repo_file):
2679 |           with open(repo_file, "r") as f:
2680 |               ret.absolute.append(gitignore_parse(f.readlines()))
2681 | 
2682 |       # Global configuration
2683 |       if "XDG_CONFIG_HOME" in os.environ:
2684 |           config_home = os.environ["XDG_CONFIG_HOME"]
2685 |       else:
2686 |           config_home = os.path.expanduser("~/.config")
2687 |       global_file = os.path.join(config_home, "git/ignore")
2688 | 
2689 |       if os.path.exists(global_file):
2690 |           with open(global_file, "r") as f:
2691 |               ret.absolute.append(gitignore_parse(f.readlines()))
2692 | 
2693 |       # .gitignore files in the index
2694 |       index = index_read(repo)
2695 | 
2696 |       for entry in index.entries:
2697 |           if entry.name == ".gitignore" or entry.name.endswith("/.gitignore"):
2698 |               dir_name = os.path.dirname(entry.name)
2699 |               contents = object_read(repo, entry.sha)
2700 |               lines = contents.blobdata.decode("utf8").splitlines()
2701 |               ret.scoped[dir_name] = gitignore_parse(lines)
2702 |       return ret
2703 | #+end_src
2704 | 
2705 | We're almost there.  To tie everything together, we need the
2706 | =check_ignore= function that matches a path, relative to the root of a
2707 | worktree, against a set of rules.  This is how this function will
2708 | work:
2709 | 
2710 |  - It will first try to match this path against the *scoped* rules.
2711 |    It will do this from the deepest parent of the path to the
2712 |    farthest.  That is, if the path is
2713 |    =src/support/w32/legacy/sound.c~=, it will first look for rules in
2714 |    =src/support/w32/legacy/.gitignore=, then
2715 |    =src/support/w32/.gitignore=, =src/support/.gitignore=, and so on
2716 |    up to simply =.gitignore"= at the root.
2717 |  - If nothing matches, it will continue with the *absolute* rules.
2718 | 
2719 | We write three small support functions.  One to match a path against a
2720 | set of rules, and return the result of the last matching rule.  Notice
2721 | how it's not a real boolean functions, since it has *three* possible
2722 | return values: =True=, =False= but also =None=.  It returns =None= if
2723 | nothing matched, so the caller knows it should continue trying with
2724 | more general ignore files (eg, go one directory level up).
2725 | 
2726 | #+begin_src python :tangle libwyag.py
2727 |   def check_ignore1(rules, path):
2728 |       result = None
2729 |       for (pattern, value) in rules:
2730 |           if fnmatch(path, pattern):
2731 |               result = value
2732 |       return result
2733 | #+end_src
2734 | 
2735 | A function to match against the dictionary of *scoped* rules (the
2736 | various =.gitignore= files).  It just starts at the path's directory
2737 | then moves up to the parent directory, recursively, until it has
2738 | tested root.  Notice that this function (and the next two as well),
2739 | never breaks *inside* a given =.gitignore= file.  Even if a rule
2740 | matches, they keep going through the file, because another rule there
2741 | may negate reverse the effect (rules are processed in order, so if you
2742 | want to exclude =*.c= but not =generator.c=, the general rule must
2743 | come before the specific one).  But as soon as at least one rule
2744 | matched in a file, we drop the remaining files, because a more general
2745 | file never cancels the effect of a more specific one (this is why
2746 | =check_ignore1= is ternary and not boolean)
2747 | 
2748 | #+begin_src python :tangle libwyag.py
2749 |   def check_ignore_scoped(rules, path):
2750 |       parent = os.path.dirname(path)
2751 |       while True:
2752 |           if parent in rules:
2753 |               result = check_ignore1(rules[parent], path)
2754 |               if result != None:
2755 |                   return result
2756 |           if parent == "":
2757 |               break
2758 |           parent = os.path.dirname(parent)
2759 |       return None
2760 | #+end_src
2761 | 
2762 | A much simpler function to match against the list of absolute rules.
2763 | Notice that the order we push those rules to the list matters (we
2764 | /did/ read the repository rules before the global ones!)
2765 | 
2766 | #+begin_src python :tangle libwyag.py
2767 |   def check_ignore_absolute(rules, path):
2768 |       parent = os.path.dirname(path)
2769 |       for ruleset in rules:
2770 |           result = check_ignore1(ruleset, path)
2771 |           if result != None:
2772 |               return result
2773 |       return False # This is a reasonable default at this point.
2774 | #+end_src
2775 | 
2776 | And finally, a function to bind them all.
2777 | 
2778 | #+begin_src python :tangle libwyag.py
2779 |   def check_ignore(rules, path):
2780 |       if os.path.isabs(path):
2781 |           raise Exception("This function requires path to be relative to the repository's root")
2782 | 
2783 |       result = check_ignore_scoped(rules.scoped, path)
2784 |       if result != None:
2785 |           return result
2786 | 
2787 |       return check_ignore_absolute(rules.absolute, path)
2788 | #+end_src
2789 | 
2790 | You can now call =wyag check-ignore=.  On its own source tree:
2791 | 
2792 | #+begin_example
2793 |   $ wyag check-ignore hello.el hello.elc hello.html wyag.zip wyag.tar
2794 |   hello.elc
2795 |   hello.html
2796 |   wyag.zip
2797 | #+end_example
2798 | 
2799 | #+begin_warning
2800 |   *This is only an approximation*
2801 | 
2802 |   This isn't a perfect reimplementation.  In particular, excluding
2803 |   whole directories with a rule that's only the directory name (eg
2804 |   =__pycache__=) won't work, because =fnmatch= would want the pattern
2805 |   as =__pycache__/**=.  If you really want to play with ignore rules,
2806 |   [[https://github.com/mherrmann/gitignore_parser][this may be a good
2807 |   starting point]].
2808 | #+end_warning
2809 | 
2810 | ** The status command
2811 | :PROPERTIES:
2812 | :CUSTOM_ID: cmd-status
2813 | :END:
2814 | 
2815 | =status= is more complex than =ls-files=, because it needs to compare
2816 | the index with both HEAD /and/ the actual filesystem.  You call =git
2817 | status= to know which files were added, removed or modified since the
2818 | last commit, and which of these changes are actually staged, and will
2819 | make it to the next commit.  So =status= actually compares the =HEAD=
2820 | with the staging area, and the staging area with the worktree.  This
2821 | is what its output looks like:
2822 | 
2823 | #+begin_example
2824 | On branch master
2825 | 
2826 | Changes to be committed:
2827 |   (use "git restore --staged <file>..." to unstage)
2828 | 	modified:   write-yourself-a-git.org
2829 | 
2830 | Changes not staged for commit:
2831 |   (use "git add <file>..." to update what will be committed)
2832 |   (use "git restore <file>..." to discard changes in working directory)
2833 | 	modified:   write-yourself-a-git.org
2834 | 
2835 | Untracked files:
2836 |   (use "git add <file>..." to include in what will be committed)
2837 | 	org-html-themes/
2838 | 	wl-copy
2839 | #+end_example
2840 | 
2841 | We'll implement =status= in three parts: first the active branch or
2842 | “detached HEAD”, then the difference between the index and the
2843 | worktree (“Changes not staged for commit”), then the difference
2844 | between HEAD and the index (“Changes to be committed” and “Untracked
2845 | files”).
2846 | 
2847 | The public interface is dead simple, our status will take no argument:
2848 | 
2849 | #+BEGIN_SRC python :tangle libwyag.py
2850 |   argsp = argsubparsers.add_parser("status", help = "Show the working tree status.")
2851 | #+END_src
2852 | 
2853 | And the bridge function just calls the three component functions in order:
2854 | 
2855 | #+BEGIN_SRC python :tangle libwyag.py
2856 |   def cmd_status(_):
2857 |       repo = repo_find()
2858 |       index = index_read(repo)
2859 | 
2860 |       cmd_status_branch(repo)
2861 |       cmd_status_head_index(repo, index)
2862 |       print()
2863 |       cmd_status_index_worktree(repo, index)
2864 | #+END_src
2865 | 
2866 | *** Finding the active branch
2867 | 
2868 | First we need to know if we're on a branch, and if so which one.  We
2869 | do this by just looking at =.git/HEAD=.  It should contain either an
2870 | hexadecimal ID (a ref to a commit, in detached HEAD state), or an
2871 | indirect reference to something in =refs/heads/=: the active branch.
2872 | We either return its name, or =False=.
2873 | 
2874 | #+begin_src python :tangle libwyag.py
2875 |   def branch_get_active(repo):
2876 |       with open(repo_file(repo, "HEAD"), "r") as f:
2877 |           head = f.read()
2878 | 
2879 |       if head.startswith("ref: refs/heads/"):
2880 |           return(head[16:-1])
2881 |       else:
2882 |           return False
2883 | #+end_src
2884 | 
2885 | Based on this, we can write the first of the three =cmd_status_*=
2886 | functions the bridge calls.  This one prints the name of the active
2887 | branch, or the hash of the detached HEAD:
2888 | 
2889 | #+begin_src python :tangle libwyag.py
2890 |   def cmd_status_branch(repo):
2891 |       branch = branch_get_active(repo)
2892 |       if branch:
2893 |           print(f"On branch {branch}.")
2894 |       else:
2895 |           print(f"HEAD detached at {object_find(repo, 'HEAD')}")
2896 | #+end_src
2897 | 
2898 | *** Finding changes between HEAD and index
2899 | 
2900 | The second block of the status output is the “changes to be
2901 | committed”, that is, how the staging area differs from HEAD.  To do
2902 | this, we're going first to read the =HEAD= tree, and flatten it as a
2903 | single dict (hashmap) with full paths as keys, so it's closer to the
2904 | (flat) index associating paths to blobs.  Then we'll just compare
2905 | them and output their differences.
2906 | 
2907 | First, a function to convert a tree (recursive, remember) to a (flat)
2908 | dict.  And since trees are recursive, so the function itself is, again ---
2909 | sorry about that :)
2910 | 
2911 | #+begin_src python :tangle libwyag.py
2912 |   def tree_to_dict(repo, ref, prefix=""):
2913 |       ret = dict()
2914 |       tree_sha = object_find(repo, ref, fmt=b"tree")
2915 |       tree = object_read(repo, tree_sha)
2916 | 
2917 |       for leaf in tree.items:
2918 |           full_path = os.path.join(prefix, leaf.path)
2919 | 
2920 |           # We read the object to extract its type (this is uselessly
2921 |           # expensive: we could just open it as a file and read the
2922 |           # first few bytes)
2923 |           is_subtree = leaf.mode.startswith(b'04')
2924 | 
2925 |           # Depending on the type, we either store the path (if it's a
2926 |           # blob, so a regular file), or recurse (if it's another tree,
2927 |           # so a subdir)
2928 |           if is_subtree:
2929 |               ret.update(tree_to_dict(repo, leaf.sha, full_path))
2930 |           else:
2931 |               ret[full_path] = leaf.sha
2932 |       return ret
2933 | #+end_src
2934 | 
2935 | And the command itself:
2936 | 
2937 | #+begin_src python :tangle libwyag.py
2938 |   def cmd_status_head_index(repo, index):
2939 |       print("Changes to be committed:")
2940 | 
2941 |       head = tree_to_dict(repo, "HEAD")
2942 |       for entry in index.entries:
2943 |           if entry.name in head:
2944 |               if head[entry.name] != entry.sha:
2945 |                   print("  modified:", entry.name)
2946 |               del head[entry.name] # Delete the key
2947 |           else:
2948 |               print("  added:   ", entry.name)
2949 | 
2950 |       # Keys still in HEAD are files that we haven't met in the index,
2951 |       # and thus have been deleted.
2952 |       for entry in head.keys():
2953 |           print("  deleted: ", entry)
2954 | #+end_src
2955 | 
2956 | *** Finding changes between index and worktree
2957 | 
2958 | #+begin_src python :tangle libwyag.py
2959 |   def cmd_status_index_worktree(repo, index):
2960 |       print("Changes not staged for commit:")
2961 | 
2962 |       ignore = gitignore_read(repo)
2963 | 
2964 |       gitdir_prefix = repo.gitdir + os.path.sep
2965 | 
2966 |       all_files = list()
2967 | 
2968 |       # We begin by walking the filesystem
2969 |       for (root, _, files) in os.walk(repo.worktree, True):
2970 |           if root==repo.gitdir or root.startswith(gitdir_prefix):
2971 |               continue
2972 |           for f in files:
2973 |               full_path = os.path.join(root, f)
2974 |               rel_path = os.path.relpath(full_path, repo.worktree)
2975 |               all_files.append(rel_path)
2976 | 
2977 |       # We now traverse the index, and compare real files with the cached
2978 |       # versions.
2979 | 
2980 |       for entry in index.entries:
2981 |           full_path = os.path.join(repo.worktree, entry.name)
2982 | 
2983 |           # That file *name* is in the index
2984 | 
2985 |           if not os.path.exists(full_path):
2986 |               print("  deleted: ", entry.name)
2987 |           else:
2988 |               stat = os.stat(full_path)
2989 | 
2990 |               # Compare metadata
2991 |               ctime_ns = entry.ctime[0] * 10**9 + entry.ctime[1]
2992 |               mtime_ns = entry.mtime[0] * 10**9 + entry.mtime[1]
2993 |               if (stat.st_ctime_ns != ctime_ns) or (stat.st_mtime_ns != mtime_ns):
2994 |                   # If different, deep compare.
2995 |                   # @FIXME This *will* crash on symlinks to dir.
2996 |                   with open(full_path, "rb") as fd:
2997 |                       new_sha = object_hash(fd, b"blob", None)
2998 |                       # If the hashes are the same, the files are actually the same.
2999 |                       same = entry.sha == new_sha
3000 | 
3001 |                       if not same:
3002 |                           print("  modified:", entry.name)
3003 | 
3004 |           if entry.name in all_files:
3005 |               all_files.remove(entry.name)
3006 | 
3007 |       print()
3008 |       print("Untracked files:")
3009 | 
3010 |       for f in all_files:
3011 |           # @TODO If a full directory is untracked, we should display
3012 |           # its name without its contents.
3013 |           if not check_ignore(ignore, f):
3014 |               print(" ", f)
3015 | #+end_src
3016 | 
3017 | Our status function is done.  It should output something like:
3018 | 
3019 | #+begin_example
3020 | $ wyag status
3021 | On branch main.
3022 | Changes to be committed:
3023 |   added:    src/main.c
3024 | 
3025 | Changes not staged for commit:
3026 |   modified: build.py
3027 |   deleted:  README.org
3028 | 
3029 | Untracked files:
3030 |   src/cli.c
3031 | #+end_example
3032 | 
3033 | The real =status= is a lot smarter: it can detect renames, for
3034 | example, where ours cannot.  Another significant difference worth
3035 | mentioning is that =git status= actually /writes/ the index back if a
3036 | file metadata was modified, but not its content.  You can see it with
3037 | our special ls-files:
3038 | 
3039 |   #+begin_example
3040 |   $ wyag ls-files --verbose
3041 |   Index file format v2, containing 1 entries.
3042 |   file
3043 |     regular file with perms: 644
3044 |     on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
3045 |     created: 2023-07-18 18:26:15.771460869, modified: 2023-07-18 18:26:15.771460869
3046 |     ...
3047 |   $ touch file
3048 |   $ git status > /dev/null
3049 |   $ wyag ls-files --verbose
3050 |   Index file format v2, containing 1 entries.
3051 |   file
3052 |     regular file with perms: 644
3053 |     on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
3054 |     created: 2023-07-18 18:26:41.421743098, modified: 2023-07-18 18:26:41.421743098
3055 |     ...
3056 |   #+end_example
3057 | 
3058 | Notice how both timestamps, from the /index file/, were updated by
3059 | =git status= to reflect the changes in the real file's metadata.
3060 | 
3061 | * Staging area and index, part 2: staging and committing
3062 | :PROPERTIES:
3063 | :CUSTOM_ID: committing
3064 | :END:
3065 | 
3066 | OK.  Let's create commits.
3067 | 
3068 | We have /almost/ everything we need for that, except for three last
3069 | things:
3070 | 
3071 |  1. We need commands to modify the index, so our commits aren't just a
3072 |     copy of their parent.  Those commands are =add= and =rm=.
3073 |  2. These commands need to write the modified index back, since we
3074 |     commit /from the index/.
3075 |  3. And obviously, we'll need the =commit= function and its associated
3076 |     =wyag commit= command.
3077 | 
3078 | ** Writing the index
3079 | :PROPERTIES:
3080 | :CUSTOM_ID: index_write
3081 | :END:
3082 | 
3083 | We'll start by writing the index.  Roughly, we're just serializing
3084 | everything back to binary.  This is a bit tedious, but the code should
3085 | be straightforward.  I'm leaving the gory details for the comments,
3086 | but it's really just =index_read= in reverse --- refer to it if
3087 | needed, and the =GitIndexEntry= class.
3088 | 
3089 | #+begin_src python :tangle libwyag.py
3090 |   def index_write(repo, index):
3091 |       with open(repo_file(repo, "index"), "wb") as f:
3092 | 
3093 |           # HEADER
3094 | 
3095 |           # Write the magic bytes.
3096 |           f.write(b"DIRC")
3097 |           # Write version number.
3098 |           f.write(index.version.to_bytes(4, "big"))
3099 |           # Write the number of entries.
3100 |           f.write(len(index.entries).to_bytes(4, "big"))
3101 | 
3102 |           # ENTRIES
3103 | 
3104 |           idx = 0
3105 |           for e in index.entries:
3106 |               f.write(e.ctime[0].to_bytes(4, "big"))
3107 |               f.write(e.ctime[1].to_bytes(4, "big"))
3108 |               f.write(e.mtime[0].to_bytes(4, "big"))
3109 |               f.write(e.mtime[1].to_bytes(4, "big"))
3110 |               f.write(e.dev.to_bytes(4, "big"))
3111 |               f.write(e.ino.to_bytes(4, "big"))
3112 | 
3113 |               # Mode
3114 |               mode = (e.mode_type << 12) | e.mode_perms
3115 |               f.write(mode.to_bytes(4, "big"))
3116 | 
3117 |               f.write(e.uid.to_bytes(4, "big"))
3118 |               f.write(e.gid.to_bytes(4, "big"))
3119 | 
3120 |               f.write(e.fsize.to_bytes(4, "big"))
3121 |               # @FIXME Convert back to int.
3122 |               f.write(int(e.sha, 16).to_bytes(20, "big"))
3123 | 
3124 |               flag_assume_valid = 0x1 << 15 if e.flag_assume_valid else 0
3125 | 
3126 |               name_bytes = e.name.encode("utf8")
3127 |               bytes_len = len(name_bytes)
3128 |               if bytes_len >= 0xFFF:
3129 |                   name_length = 0xFFF
3130 |               else:
3131 |                   name_length = bytes_len
3132 | 
3133 |               # We merge back three pieces of data (two flags and the
3134 |               # length of the name) on the same two bytes.
3135 |               f.write((flag_assume_valid | e.flag_stage | name_length).to_bytes(2, "big"))
3136 | 
3137 |               # Write back the name, and a final 0x00.
3138 |               f.write(name_bytes)
3139 |               f.write((0).to_bytes(1, "big"))
3140 | 
3141 |               idx += 62 + len(name_bytes) + 1
3142 | 
3143 |               # Add padding if necessary.
3144 |               if idx % 8 != 0:
3145 |                   pad = 8 - (idx % 8)
3146 |                   f.write((0).to_bytes(pad, "big"))
3147 |                   idx += pad
3148 | #+end_src
3149 | 
3150 | ** The rm command
3151 | :PROPERTIES:
3152 | :CUSTOM_ID: cmd-rm
3153 | :END:
3154 | 
3155 | The easiest change we can do to an index is to remove an entry from
3156 | it, which mean that the next commit *won't include* this file.  This
3157 | is what the =git rm= command does.
3158 | 
3159 | #+BEGIN_danger
3160 |   =git rm= is *destructive*, and so is =wyag rm=.  The command not
3161 |   only modifies the index, it also removes file(s) from the worktree.
3162 |   Unlike git, =wyag rm= doesn't care if the file it removes isn't
3163 |   saved.  Proceed with caution.
3164 | #+END_danger
3165 | 
3166 | =rm= takes a single argument, a list of paths to remove:
3167 | 
3168 | #+BEGIN_SRC python :tangle libwyag.py
3169 |   argsp = argsubparsers.add_parser("rm", help="Remove files from the working tree and the index.")
3170 |   argsp.add_argument("path", nargs="+", help="Files to remove")
3171 | 
3172 |   def cmd_rm(args):
3173 |       repo = repo_find()
3174 |       rm(repo, args.path)
3175 | #+END_src
3176 | 
3177 | The =rm= function is a bit long, but it's very simple.  It takes a
3178 | repository and a list of paths, reads that repository index, and
3179 | removes entries in the index that match this list.  The optional
3180 | arguments control whether the function should actually delete the
3181 | files, and whether it should abort if some paths aren't present on the
3182 | index (both those arguments are for the use of =add=, they're not
3183 | exposed in the =wyag rm= command).
3184 | 
3185 | #+BEGIN_SRC python :tangle libwyag.py
3186 |   def rm(repo, paths, delete=True, skip_missing=False):
3187 |       # Find and read the index
3188 |       index = index_read(repo)
3189 | 
3190 |       worktree = repo.worktree + os.sep
3191 | 
3192 |       # Make paths absolute
3193 |       abspaths = set()
3194 |       for path in paths:
3195 |           abspath = os.path.abspath(path)
3196 |           if abspath.startswith(worktree):
3197 |               abspaths.add(abspath)
3198 |           else:
3199 |               raise Exception(f"Cannot remove paths outside of worktree: {paths}")
3200 | 
3201 |       # The list of entries to *keep*, which we will write back to the
3202 |       # index.
3203 |       kept_entries = list()
3204 |       # The list of removed paths, which we'll use after index update
3205 |       # to physically remove the actual paths from the filesystem.
3206 |       remove = list()
3207 | 
3208 |       # Now iterate over the list of entries, and remove those whose
3209 |       # paths we find in abspaths.  Preserve the others in kept_entries.
3210 |       for e in index.entries:
3211 |           full_path = os.path.join(repo.worktree, e.name)
3212 | 
3213 |           if full_path in abspaths:
3214 |               remove.append(full_path)
3215 |               abspaths.remove(full_path)
3216 |           else:
3217 |               kept_entries.append(e) # Preserve entry
3218 | 
3219 |       # If abspaths is empty, it means some paths weren't in the index.
3220 |       if len(abspaths) > 0 and not skip_missing:
3221 |           raise Exception(f"Cannot remove paths not in the index: {abspaths}")
3222 | 
3223 |       # Physically delete paths from filesystem.
3224 |       if delete:
3225 |           for path in remove:
3226 |               os.unlink(path)
3227 | 
3228 |       # Update the list of entries in the index, and write it back.
3229 |       index.entries = kept_entries
3230 |       index_write(repo, index)
3231 | #+END_SRC
3232 | 
3233 | And we can now delete files with =wyag rm=.
3234 | 
3235 | ** The add command
3236 | :PROPERTIES:
3237 | :CUSTOM_ID: cmd-add
3238 | :END:
3239 | 
3240 | Adding is just a bit more complex than removing, but nothing we don't
3241 | already know.  Staging a file to a three-steps operation:
3242 | 
3243 |  1. We begin by removing the existing index entry, if there's one,
3244 |     without removing the file itself (this is why the =rm= function we
3245 |     just wrote has those optional arguments).
3246 |  2. We then hash the file into a glob object,
3247 |  3. create its entry,
3248 |  4. And of course, finally write the modified index back.
3249 | 
3250 | First, the interface.  Nothing surprising, =wyag add PATH ...= where
3251 | PATH is one or more file(s) to stage.  The bridge is as boring as can be.
3252 | 
3253 | #+BEGIN_SRC python :tangle libwyag.py
3254 |   argsp = argsubparsers.add_parser("add", help = "Add files contents to the index.")
3255 |   argsp.add_argument("path", nargs="+", help="Files to add")
3256 | 
3257 |   def cmd_add(args):
3258 |       repo = repo_find()
3259 |       add(repo, args.path)
3260 | #+END_src
3261 | 
3262 | The main difference with =rm= is that =add= needs to create an index
3263 | entry.  This isn't hard: we just =stat()= the file and copy the
3264 | metadata in the index's field (=stat()= returns those metadata the
3265 | index stores: creation/modification time, and so on)
3266 | 
3267 | #+BEGIN_SRC python :tangle libwyag.py
3268 |   def add(repo, paths, delete=True, skip_missing=False):
3269 | 
3270 |       # First remove all paths from the index, if they exist.
3271 |       rm (repo, paths, delete=False, skip_missing=True)
3272 | 
3273 |       worktree = repo.worktree + os.sep
3274 | 
3275 |       # Convert the paths to pairs: (absolute, relative_to_worktree).
3276 |       # Also delete them from the index if they're present.
3277 |       clean_paths = set()
3278 |       for path in paths:
3279 |           abspath = os.path.abspath(path)
3280 |           if not (abspath.startswith(worktree) and os.path.isfile(abspath)):
3281 |               raise Exception(f"Not a file, or outside the worktree: {paths}")
3282 |           relpath = os.path.relpath(abspath, repo.worktree)
3283 |           clean_paths.add((abspath,  relpath))
3284 | 
3285 |       # Find and read the index.  It was modified by rm.  (This isn't
3286 |       # optimal, good enough for wyag!)
3287 |       #
3288 |       # @FIXME, though: we could just move the index through
3289 |       # commands instead of reading and writing it over again.
3290 |       index = index_read(repo)
3291 | 
3292 |       for (abspath, relpath) in clean_paths:
3293 |           with open(abspath, "rb") as fd:
3294 |               sha = object_hash(fd, b"blob", repo)
3295 | 
3296 |               stat = os.stat(abspath)
3297 | 
3298 |               ctime_s = int(stat.st_ctime)
3299 |               ctime_ns = stat.st_ctime_ns % 10**9
3300 |               mtime_s = int(stat.st_mtime)
3301 |               mtime_ns = stat.st_mtime_ns % 10**9
3302 | 
3303 |               entry = GitIndexEntry(ctime=(ctime_s, ctime_ns), mtime=(mtime_s, mtime_ns), dev=stat.st_dev, ino=stat.st_ino,
3304 |                                     mode_type=0b1000, mode_perms=0o644, uid=stat.st_uid, gid=stat.st_gid,
3305 |                                     fsize=stat.st_size, sha=sha, flag_assume_valid=False,
3306 |                                     flag_stage=False, name=relpath)
3307 |               index.entries.append(entry)
3308 | 
3309 |       # Write the index back
3310 |       index_write(repo, index)
3311 | #+END_SRC
3312 | 
3313 | ** The commit command
3314 | :PROPERTIES:
3315 | :CUSTOM_ID: cmd-commit
3316 | :END:
3317 | 
3318 | Now that we have modified the index, so actually /staged changes/, we
3319 | only need to turn those changes into a commit.  That's what =commit= does.
3320 | 
3321 | #+begin_src python :tangle libwyag.py
3322 |   argsp = argsubparsers.add_parser("commit", help="Record changes to the repository.")
3323 | 
3324 |   argsp.add_argument("-m",
3325 |                      metavar="message",
3326 |                      dest="message",
3327 |                      help="Message to associate with this commit.")
3328 | #+end_src
3329 | 
3330 | To do so, we first need to convert the index into a tree object,
3331 | generate and store the corresponding commit object, and update the
3332 | HEAD branch to the new commit (remember: a branch is just a ref to a
3333 | commit).
3334 | 
3335 | Before we get to the interesting details, we will need to read git's
3336 | config to get the name of the user, which we'll use as the author and
3337 | committer.  We'll use the same =configparser= library we've used to
3338 | read repo's config.
3339 | 
3340 | #+begin_src python :tangle libwyag.py
3341 |   def gitconfig_read():
3342 |       xdg_config_home = os.environ["XDG_CONFIG_HOME"] if "XDG_CONFIG_HOME" in os.environ else "~/.config"
3343 |       configfiles = [
3344 |           os.path.expanduser(os.path.join(xdg_config_home, "git/config")),
3345 |           os.path.expanduser("~/.gitconfig")
3346 |       ]
3347 | 
3348 |       config = configparser.ConfigParser()
3349 |       config.read(configfiles)
3350 |       return config
3351 | #+end_src
3352 | 
3353 | And just a simple function to grab, and format, the user identity:
3354 | 
3355 | #+begin_src python :tangle libwyag.py
3356 |   def gitconfig_user_get(config):
3357 |       if "user" in config:
3358 |           if "name" in config["user"] and "email" in config["user"]:
3359 |               return f"{config['user']['name']} <{config['user']['email']}>"
3360 |       return None
3361 | #+end_src
3362 | 
3363 | Now for the interesting part.  We first need to build a tree from the
3364 | index.  This isn't hard, but notice that while the index is flat (it
3365 | stores full paths for the whole worktree), a tree is a recursive
3366 | structure: it lists files, or other trees.  To "unflatten" the index
3367 | into a tree, we're going to:
3368 | 
3369 |  1. Build a dictionary (hashmap) of directories.  Keys are full paths
3370 |     from worktree root (like =assets/sprites/monsters/=), values are
3371 |     list of =GitIndexEntry= --- files in the directory.  At this point, our
3372 |     dictionary only contains /files/: directories are only its keys.
3373 |  2. Traverse this list, going bottom-up, that is, from the deepest
3374 |     directories up to root (depth doesn't really matter: we just want
3375 |     to see each directory /before/ its parent.  To do that, we just
3376 |     sort them by /full/ path length, from longest to shortest ---
3377 |     parents are obviously always shorter).  As an example, imagine we
3378 |     start at =assets/sprites/monsters/=
3379 |  3. At each directory, we build a tree with its contents, say
3380 |     =cacodemon.png=, =imp.png= and =baron-of-hell.png=.
3381 |  4. We write the new tree to the repository.
3382 |  5. We then add this tree to this directory's parent.  Meaning that at
3383 |     this point, =assets/sprites/= now contains our new tree object's
3384 |     SHA-1 id under the name =monsters=.
3385 |  6. And we iterate over the next directory, let's say
3386 |     =assets/sprites/keys= where we find =red.png=, =blue.png= and
3387 |     =yellow.png=, create a tree, store the tree, add the tree's SHA-1
3388 |     under the name =keys= under =assets/sprites/=, and so on.
3389 | 
3390 | And since trees are recursive?  So the last tree we'll build, which is
3391 | necessarily the one for root (since its key's length is 0), will
3392 | ultimately refer to all others, and thus will be only one we'll need.
3393 | We'll simply return its SHA-1, and be done.
3394 | 
3395 | Since this may seem a bit complex, let's work this example in full
3396 | details --- feel free to skip.  At the beginning, the dictionary we
3397 | built from the index looks like this:
3398 | 
3399 | #+begin_example
3400 |   contents["assets/sprites/monsters"] =
3401 |     [ cacodemon.png : GitIndexEntry
3402 |     , imp.png : GitIndexEntry
3403 |     , baron-of-hell.png : GitIndexEntry ]
3404 |   contents["assets/sprites/keys"] =
3405 |     [ red.png : GitIndexEntry
3406 |     , blue.png : GitIndexEntry
3407 |     , yellow.png : GitIndexEntry ]
3408 |   contents["assets/sprites/"] =
3409 |     [ hero.png : GitIndexEntry ]
3410 |   contents["assets/"] = [] # No files in here
3411 |   contents[""] = # Root!
3412 |     [ README: GitIndexEntry ]
3413 | #+end_example
3414 | 
3415 | We iterate over it, by order of descending key length.  The first key
3416 | we meet is the longest, so =assets/sprites/monsters=.  We build a new
3417 | tree object from its contents, which associates the three file names
3418 | (=cacodemon.png=, =imp.png=, =baron-of-hell.png=) with their
3419 | corresponding blobs (A tree leaf stores /less/ data than the index ---
3420 | just path, mode and blob.  So converting entries that way is easy)
3421 | 
3422 | Notice we don't need to concern ourselves with storing the *contents*
3423 | of those files: =wyag add= did create the corresponding blobs as
3424 | needed.  We need to store the /trees/ we create to the object store,
3425 | but we can assume the blobs are there already.
3426 | 
3427 | Let's say that our new tree hashes, made from the index entries that
3428 | lived directly in =assets/sprites/monsters=, hashes down to
3429 | =426f894781bc3c38f1d26f8fd2c7f38ab8d21763=.  We *modify our
3430 | dictionary* to add that new tree object to the directory's parent,
3431 | like this, so what remains to traverse now looks like this:
3432 | 
3433 | #+begin_example
3434 |   contents["assets/sprites/keys"] = # <- unmodified.
3435 |     [ red.png : GitIndexEntry
3436 |     , blue.png : GitIndexEntry
3437 |     , yellow.png : GitIndexEntry ]
3438 |   contents["assets/sprites/"] =
3439 |     [ hero.png : GitIndexEntry
3440 |     , monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763 ] <- look here
3441 |   contents["assets/"] = [] # empty
3442 |   contents[""] = # Root!
3443 |     [ README: GitIndexEntry ]
3444 | #+end_example
3445 | 
3446 | We do the same for the next longest key, =assets/sprites/keys=,
3447 | producing a tree of hash =b42788e087b1e94a0e69dcb7a4a243eaab802bb2=,
3448 | so:
3449 | 
3450 | #+begin_example
3451 |   contents["assets/sprites/"] =
3452 |     [ hero.png : GitIndexEntry
3453 |     ,  monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763
3454 |     , keys : Tree b42788e087b1e94a0e69dcb7a4a243eaab802bb2 ]
3455 |   contents["assets/"] = [] # empty
3456 |   contents[""] = # Root!
3457 |     [ README: GitIndexEntry ]
3458 | #+end_example
3459 | 
3460 | We then generate tree =6364113557ed681d775ccbd3c90895ed276956a2= from
3461 | assets/sprites, which now contains our two trees and =hero.png=.
3462 | 
3463 | #+begin_example
3464 |   contents["assets/"] = [
3465 |     sprites: Tree 6364113557ed681d775ccbd3c90895ed276956a2 ]
3466 |   contents[""] = # Root!
3467 |     [ README: GitIndexEntry ]
3468 | #+end_example
3469 | 
3470 | Assets in turn becomes tree =4d35513cb6d2a816bc00505be926624440ebbddd=, so:
3471 | 
3472 | #+begin_example
3473 |   contents[""] = # Root!
3474 |     [ README: GitIndexEntry,
3475 |       assets: 4d35513cb6d2a816bc00505be926624440ebbddd]
3476 | #+end_example
3477 | 
3478 | We make a tree from that last key (with the =README= blob and the
3479 | =assets= subtree), it hashes to
3480 | =9352e52ff58fa9bf5a750f090af64c09fa6a3d93=.  That's our return value:
3481 | the tree whose contents are the same as the index's.
3482 | 
3483 | Here's the actual function:
3484 | 
3485 | #+begin_src python :tangle libwyag.py
3486 |   def tree_from_index(repo, index):
3487 |       contents = dict()
3488 |       contents[""] = list()
3489 | 
3490 |       # Enumerate entries, and turn them into a dictionary where keys
3491 |       # are directories, and values are lists of directory contents.
3492 |       for entry in index.entries:
3493 |           dirname = os.path.dirname(entry.name)
3494 | 
3495 |           # We create all dictonary entries up to root ("").  We need
3496 |           # them *all*, because even if a directory holds no files it
3497 |           # will contain at least a tree.
3498 |           key = dirname
3499 |           while key != "":
3500 |               if not key in contents:
3501 |                   contents[key] = list()
3502 |               key = os.path.dirname(key)
3503 | 
3504 |           # For now, simply store the entry in the list.
3505 |           contents[dirname].append(entry)
3506 | 
3507 |       # Get keys (= directories) and sort them by length, descending.
3508 |       # This means that we'll always encounter a given path before its
3509 |       # parent, which is all we need, since for each directory D we'll
3510 |       # need to modify its parent P to add D's tree.
3511 |       sorted_paths = sorted(contents.keys(), key=len, reverse=True)
3512 | 
3513 |       # This variable will store the current tree's SHA-1.  After we're
3514 |       # done iterating over our dict, it will contain the hash for the
3515 |       # root tree.
3516 |       sha = None
3517 | 
3518 |       # We go through the sorted list of paths (dict keys)
3519 |       for path in sorted_paths:
3520 |           # Prepare a new, empty tree object
3521 |           tree = GitTree()
3522 | 
3523 |           # Add each entry to our new tree, in turn
3524 |           for entry in contents[path]:
3525 |               # An entry can be a normal GitIndexEntry read from the
3526 |               # index, or a tree we've created.
3527 |               if isinstance(entry, GitIndexEntry): # Regular entry (a file)
3528 | 
3529 |                   # We transcode the mode: the entry stores it as integers,
3530 |                   # we need an octal ASCII representation for the tree.
3531 |                   leaf_mode = f"{entry.mode_type:02o}{entry.mode_perms:04o}".encode("ascii")
3532 |                   leaf = GitTreeLeaf(mode = leaf_mode, path=os.path.basename(entry.name), sha=entry.sha)
3533 |               else: # Tree.  We've stored it as a pair: (basename, SHA)
3534 |                   leaf = GitTreeLeaf(mode = b"040000", path=entry[0], sha=entry[1])
3535 | 
3536 |               tree.items.append(leaf)
3537 | 
3538 |           # Write the new tree object to the store.
3539 |           sha = object_write(tree, repo)
3540 | 
3541 |           # Add the new tree hash to the current dictionary's parent, as
3542 |           # a pair (basename, SHA)
3543 |           parent = os.path.dirname(path)
3544 |           base = os.path.basename(path) # The name without the path, eg main.go for src/main.go
3545 |           contents[parent].append((base, sha))
3546 | 
3547 |       return sha
3548 | #+end_src
3549 | 
3550 | This was the hard part; I hope it's clear enough.  From this, creating
3551 | the commit object and updating HEAD will be way easier.  Just remember
3552 | that what this function /does/ is built and store as many tree objects
3553 | as needed to represent the index, and return the root tree's SHA-1.
3554 | 
3555 | The function to create a commit object is simple enough, it just takes
3556 | some arguments: the hash of the tree, the hash of the parent commit,
3557 | the author's identity (a string), the timestamp and timezone delta,
3558 | and the message:
3559 | 
3560 | # @TODO Explain them!
3561 | 
3562 | #+begin_src python :tangle libwyag.py
3563 |   def commit_create(repo, tree, parent, author, timestamp, message):
3564 |       commit = GitCommit() # Create the new commit object.
3565 |       commit.kvlm[b"tree"] = tree.encode("ascii")
3566 |       if parent:
3567 |           commit.kvlm[b"parent"] = parent.encode("ascii")
3568 | 
3569 |       # Trim message and add a trailing \n
3570 |       message = message.strip() + "\n"
3571 |       # Format timezone
3572 |       offset = int(timestamp.astimezone().utcoffset().total_seconds())
3573 |       hours = offset // 3600
3574 |       minutes = (offset % 3600) // 60
3575 |       tz = "{}{:02}{:02}".format("+" if offset > 0 else "-", hours, minutes)
3576 | 
3577 |       author = author + timestamp.strftime(" %s ") + tz
3578 | 
3579 |       commit.kvlm[b"author"] = author.encode("utf8")
3580 |       commit.kvlm[b"committer"] = author.encode("utf8")
3581 |       commit.kvlm[None] = message.encode("utf8")
3582 | 
3583 |       return object_write(commit, repo)
3584 | #+end_src
3585 | 
3586 | All what remains to write is =cmd_commit=, the bridge function to the
3587 | =wyag commit= command:
3588 | 
3589 | #+begin_src python :tangle libwyag.py
3590 |   def cmd_commit(args):
3591 |       repo = repo_find()
3592 |       index = index_read(repo)
3593 |       # Create trees, grab back SHA for the root tree.
3594 |       tree = tree_from_index(repo, index)
3595 | 
3596 |       # Create the commit object itself
3597 |       commit = commit_create(repo,
3598 |                              tree,
3599 |                              object_find(repo, "HEAD"),
3600 |                              gitconfig_user_get(gitconfig_read()),
3601 |                              datetime.now(),
3602 |                              args.message)
3603 | 
3604 |       # Update HEAD so our commit is now the tip of the active branch.
3605 |       active_branch = branch_get_active(repo)
3606 |       if active_branch: # If we're on a branch, we update refs/heads/BRANCH
3607 |           with open(repo_file(repo, os.path.join("refs/heads", active_branch)), "w") as fd:
3608 |               fd.write(commit + "\n")
3609 |       else: # Otherwise, we update HEAD itself.
3610 |           with open(repo_file(repo, "HEAD"), "w") as fd:
3611 |               fd.write("\n")
3612 | #+end_src
3613 | 
3614 | And we're done!
3615 | 
3616 | * Final words
3617 | 
3618 | ** Conclusion, and beyond commit :noexport:
3619 | 
3620 | With that final command, =wyag= is done.  I hope you've enjoyed that
3621 | little journey into the internals of git core.  Obviously, we're still
3622 | very far from what the real git can do, but the goal was to expose the
3623 | model, and this is done.
3624 | 
3625 | One of the most fundamental design choices of git, which you've
3626 | probably noticed, is that it only stores full states of the
3627 | repository.  We like to think of commits as /transformations/ of
3628 | source code, and it makes sense to us because in a way that's what
3629 | they /are/, but to git itself each commit is like a zip of a
3630 | directory, as disconnected to the previous as it is to the “next”.
3631 | Git neither knows nor cares about deltas and patches, and even file
3632 | renames indications are just a trick. (To be perfectly honest: it
3633 | actually /knows/ about deltas, but as a storage optimization method,
3634 | in packfiles --- which are optional).
3635 | 
3636 | ** Comments, feedback and issues
3637 | :PROPERTIES:
3638 | :CUSTOM_ID: feedback
3639 | :END:
3640 | 
3641 | This page has no comment system :) I can be reached by e-mail at
3642 | [[mailto:thibault@thb.lt][thibault@thb.lt]].  I can also be found [[https://toad.social/@thblt][on Mastodon as
3643 | @thblt@toad.social]] and [[https://twitter.com/ThbPlg][on Twitter as @ThbPlg]], and on IRC (sometimes)
3644 | as =thblt= on Libera.
3645 | 
3646 | The source for this article is hosted [[https://github.com/thblt/write-yourself-a-git][on Github]].  Issue reports and
3647 | pull requests are welcome, either directly on GitHub or through e-mail
3648 | if you prefer.
3649 | 
3650 | ** Release information :noexport:
3651 | 
3652 | #+begin_src emacs-lisp :exports results :results table
3653 |   (list
3654 |    '("Key" "Value")
3655 |    'hline
3656 |    `("Creation date" ,(current-time-string))
3657 |    `("On commit" ,(format "=%s= (%s)"
3658 |                        (string-trim (shell-command-to-string "git describe --tags --always"))
3659 |                        (if (string-empty-p
3660 |                              (string-trim (shell-command-to-string "git status --porcelain=v2")))
3661 |                            "clean" "*dirty*")))
3662 |    `("By" ,(format "%s (=%s= on =%s=)"
3663 |      (user-full-name)
3664 |      (user-login-name)
3665 |      (system-name)))
3666 |    `("Emacs version" ,emacs-version)
3667 |    `("Org-mode version" ,org-version))
3668 | #+end_src
3669 | 
3670 | ** License
3671 | 
3672 | This article is distributed under the terms of the [[https://creativecommons.org/licenses/by-nc-sa/4.0/][Creative Commons
3673 | BY-NC-SA 4.0]].  The [[./wyag.zip][program itself]] is also licensed under the terms
3674 | of the GNU General Public License 3.0, or, at your option, any later
3675 | version of the same licence.
3676 | 
3677 | * Footnotes
3678 | 
3679 | [fn:1] You may know that [[https://shattered.io/][collisions have been discovered in SHA-1]].
3680 | Git actually doesn't use SHA-1 anymore: it uses a [[https://github.com/git/git/blob/26e47e261e969491ad4e3b6c298450c061749c9e/Documentation/technical/hash-function-transition.txt#L34-L36][hardened variant]]
3681 | which is not SHA, but which applies the same hash to every known input
3682 | but the two PDF files known to collide.
3683 | 


--------------------------------------------------------------------------------
/wyag-tests.sh:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env bash
  2 | set -e
  3 | 
  4 | function step() {
  5 |     pos=$(caller)
  6 |     echo $pos $@
  7 | }
  8 | 
  9 | wyag=$(realpath ./wyag)
 10 | 
 11 | testdir=/tmp/wyag-tests
 12 | if [[ -e $testdir ]]; then
 13 |     rm -rf $testdir/*
 14 | else
 15 |     mkdir $testdir
 16 | fi
 17 | cd $testdir
 18 | step "working on $(pwd)"
 19 | 
 20 | step Create repos
 21 | $wyag init left
 22 | git init right > /dev/null
 23 | 
 24 | step status
 25 | cd left
 26 | git status > /dev/null
 27 | cd ../right
 28 | git status > /dev/null
 29 | cd ..
 30 | 
 31 | step hash-object
 32 | echo "Don't read me" > README
 33 | $wyag hash-object README > hash1
 34 | git hash-object README > hash2
 35 | cmp --quiet hash1 hash2
 36 | 
 37 | step hash-object -w
 38 | cd left
 39 | $wyag hash-object -w ../README > /dev/null
 40 | cd ../right
 41 | git hash-object -w ../README > /dev/null
 42 | cd ..
 43 | ls left/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null
 44 | ls right/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null
 45 | 
 46 | step cat-file
 47 | cd left
 48 | $wyag cat-file blob b17d > ../file1
 49 | cd ../right
 50 | git cat-file blob b17d > ../file2
 51 | cd ..
 52 | cmp file1 file2
 53 | 
 54 | step cat-file with long hash
 55 | cd left
 56 | $wyag cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file1
 57 | cd ../right
 58 | git cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file2
 59 | cd ..
 60 | cmp file1 file2
 61 | 
 62 | step "Create commit (git only, nothing is tested)" #@FIXME Add wyag commit
 63 | cd left
 64 | echo "Aleph" > hebraic-letter.txt
 65 | git add hebraic-letter.txt
 66 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
 67 |                GIT_AUTHOR_NAME="wyag-tests.sh" \
 68 |                GIT_AUTHOR_EMAIL="wyag@example.com" \
 69 |                GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
 70 |                GIT_COMMITTER_NAME="wyag-tests.sh" \
 71 |                GIT_COMMITTER_EMAIL="wyag@example.com" \
 72 |                git commit --no-gpg-sign -m "Initial commit" > /dev/null
 73 | cd ../right
 74 | echo "Aleph" > hebraic-letter.txt
 75 | git add hebraic-letter.txt
 76 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
 77 |                GIT_AUTHOR_NAME="wyag-tests.sh" \
 78 |                GIT_AUTHOR_EMAIL="wyag@example.com" \
 79 |                GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
 80 |                GIT_COMMITTER_NAME="wyag-tests.sh" \
 81 |                GIT_COMMITTER_EMAIL="wyag@example.com" \
 82 |                git commit --no-gpg-sign -m "Initial commit" > /dev/null
 83 | 
 84 | cd ..
 85 | 
 86 | step cat-file on commit object without indirection
 87 | cd left
 88 | $wyag cat-file commit HEAD > ../file1
 89 | cd ../right
 90 | git cat-file commit HEAD > ../file2
 91 | cd ..
 92 | cmp file1 file2
 93 | 
 94 | step cat-file on tree object redirected from commit
 95 | cd left
 96 | $wyag cat-file tree HEAD > ../file1
 97 | cd ../right
 98 | git cat-file tree HEAD > ../file2
 99 | cd ..
100 | cmp file1 file2
101 | 
102 | step "Add some directories and commits (git only, nothing is tested)" #@FIXME Add wyag commit
103 | cd left
104 | mkdir a
105 | echo "Alpha" > a/greek_letters
106 | mkdir b
107 | echo "Hamza" > a/arabic_letters
108 | git add a/*
109 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
110 |                GIT_AUTHOR_NAME="wyag-tests.sh" \
111 |                GIT_AUTHOR_EMAIL="wyag@example.com" \
112 |                GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
113 |                GIT_COMMITTER_NAME="wyag-tests.sh" \
114 |                GIT_COMMITTER_EMAIL="wyag@example.com" \
115 |                git commit --no-gpg-sign -m "Commit 2" > /dev/null
116 | cd ../right
117 | mkdir a
118 | echo "Alpha" > a/greek_letters
119 | mkdir b
120 | echo "Hamza" > a/arabic_letters
121 | git add a/*
122 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
123 |                GIT_AUTHOR_NAME="wyag-tests.sh" \
124 |                GIT_AUTHOR_EMAIL="wyag@example.com" \
125 |                GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
126 |                GIT_COMMITTER_NAME="wyag-tests.sh" \
127 |                GIT_COMMITTER_EMAIL="wyag@example.com" \
128 |                git commit --no-gpg-sign -m "Commit 2" > /dev/null
129 | cd ..
130 | 
131 | step ls-tree
132 | cd left
133 | $wyag ls-tree HEAD > ../file1
134 | cd ../right
135 | git ls-tree HEAD > ../file2
136 | cd ..
137 | cmp file1 file2
138 | 
139 | step checkout
140 | # Git and Wyag syntax are different here
141 | cd left
142 | $wyag checkout HEAD ../temp1
143 | mkdir ../temp2
144 | cd  ../temp2
145 | git --git-dir=../right/.git checkout .
146 | cd ..
147 | diff -r temp1 temp2
148 | rm -rf temp1 temp2
149 | 
150 | step rev-parse
151 | cd left
152 | $wyag rev-parse HEAD  > ../file1
153 | $wyag rev-parse 75ee4 >> ../file1
154 | $wyag rev-parse 8a617  >> ../file1
155 | #@FIXME Tags missing, branches missing, remotes missing
156 | cd ../right
157 | git rev-parse HEAD   > ../file2
158 | git rev-parse 75ee4 >> ../file2
159 | git rev-parse 8a617  >> ../file2
160 | cd ..
161 | cmp file1 file2
162 | 
163 | step "ls-files "
164 | cd left
165 | $wyag ls-files > ../file1
166 | cd ../right
167 | git ls-files > ../file2
168 | cd ..
169 | cmp file1 file2
170 | 
171 | gitignore_prepare() {
172 |     mkdir -p a/b/c/
173 |     echo "!*.txt" > a/b/c/.gitignore
174 |     echo "*.txt" > a/b/.gitignore
175 |     echo "*.org" > a/.gitignore
176 |     git add -A
177 | }
178 | 
179 | step "gitignore"
180 | cd left
181 | gitignore_prepare
182 | $wyag check-ignore a/b/c/hello.txt > ../file1
183 | $wyag check-ignore a/b/hello.txt >> ../file1
184 | $wyag check-ignore a/hello.org >> ../file1
185 | $wyag check-ignore hello.org >> ../file1
186 | cd ../right
187 | set +e # git will return with non-zero
188 | gitignore_prepare
189 | git check-ignore a/b/c/hello.txt > ../file2
190 | git check-ignore a/b/hello.txt >> ../file2
191 | git check-ignore a/hello.org >> ../file2
192 | git check-ignore hello.org >> ../file2
193 | set -e
194 | cd ..
195 | cmp file1 file2
196 | 
197 | 
198 | 
199 | 
200 | step THIS WAS A TRIUMPH
201 | step "I'M MAKING A NOTE HERE"
202 | step "HUGE SUCCESS"
203 | 


--------------------------------------------------------------------------------