├── .gitignore
├── .gitmodules
├── LICENSE
├── Makefile
├── README.org
├── write-yourself-a-git.org
└── wyag-tests.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | .last_push
3 | __pycache__
4 | libwyag.py
5 | src
6 | wyag
7 | wyag.zip
8 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "lib/org-html-themes"]
2 | path = lib/org-html-themes
3 | url = https://github.com/fniessen/org-html-themes
4 | [submodule "lib/htmlize"]
5 | path = lib/htmlize
6 | url = https://github.com/hniksic/emacs-htmlize
7 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 | Preamble
9 |
10 | The GNU General Public License is a free, copyleft license for
11 | software and other kinds of works.
12 |
13 | The licenses for most software and other practical works are designed
14 | to take away your freedom to share and change the works. By contrast,
15 | the GNU General Public License is intended to guarantee your freedom to
16 | share and change all versions of a program--to make sure it remains free
17 | software for all its users. We, the Free Software Foundation, use the
18 | GNU General Public License for most of our software; it applies also to
19 | any other work released this way by its authors. You can apply it to
20 | your programs, too.
21 |
22 | When we speak of free software, we are referring to freedom, not
23 | price. Our General Public Licenses are designed to make sure that you
24 | have the freedom to distribute copies of free software (and charge for
25 | them if you wish), that you receive source code or can get it if you
26 | want it, that you can change the software or use pieces of it in new
27 | free programs, and that you know you can do these things.
28 |
29 | To protect your rights, we need to prevent others from denying you
30 | these rights or asking you to surrender the rights. Therefore, you have
31 | certain responsibilities if you distribute copies of the software, or if
32 | you modify it: responsibilities to respect the freedom of others.
33 |
34 | For example, if you distribute copies of such a program, whether
35 | gratis or for a fee, you must pass on to the recipients the same
36 | freedoms that you received. You must make sure that they, too, receive
37 | or can get the source code. And you must show them these terms so they
38 | know their rights.
39 |
40 | Developers that use the GNU GPL protect your rights with two steps:
41 | (1) assert copyright on the software, and (2) offer you this License
42 | giving you legal permission to copy, distribute and/or modify it.
43 |
44 | For the developers' and authors' protection, the GPL clearly explains
45 | that there is no warranty for this free software. For both users' and
46 | authors' sake, the GPL requires that modified versions be marked as
47 | changed, so that their problems will not be attributed erroneously to
48 | authors of previous versions.
49 |
50 | Some devices are designed to deny users access to install or run
51 | modified versions of the software inside them, although the manufacturer
52 | can do so. This is fundamentally incompatible with the aim of
53 | protecting users' freedom to change the software. The systematic
54 | pattern of such abuse occurs in the area of products for individuals to
55 | use, which is precisely where it is most unacceptable. Therefore, we
56 | have designed this version of the GPL to prohibit the practice for those
57 | products. If such problems arise substantially in other domains, we
58 | stand ready to extend this provision to those domains in future versions
59 | of the GPL, as needed to protect the freedom of users.
60 |
61 | Finally, every program is threatened constantly by software patents.
62 | States should not allow patents to restrict development and use of
63 | software on general-purpose computers, but in those that do, we wish to
64 | avoid the special danger that patents applied to a free program could
65 | make it effectively proprietary. To prevent this, the GPL assures that
66 | patents cannot be used to render the program non-free.
67 |
68 | The precise terms and conditions for copying, distribution and
69 | modification follow.
70 |
71 | TERMS AND CONDITIONS
72 |
73 | 0. Definitions.
74 |
75 | "This License" refers to version 3 of the GNU General Public License.
76 |
77 | "Copyright" also means copyright-like laws that apply to other kinds of
78 | works, such as semiconductor masks.
79 |
80 | "The Program" refers to any copyrightable work licensed under this
81 | License. Each licensee is addressed as "you". "Licensees" and
82 | "recipients" may be individuals or organizations.
83 |
84 | To "modify" a work means to copy from or adapt all or part of the work
85 | in a fashion requiring copyright permission, other than the making of an
86 | exact copy. The resulting work is called a "modified version" of the
87 | earlier work or a work "based on" the earlier work.
88 |
89 | A "covered work" means either the unmodified Program or a work based
90 | on the Program.
91 |
92 | To "propagate" a work means to do anything with it that, without
93 | permission, would make you directly or secondarily liable for
94 | infringement under applicable copyright law, except executing it on a
95 | computer or modifying a private copy. Propagation includes copying,
96 | distribution (with or without modification), making available to the
97 | public, and in some countries other activities as well.
98 |
99 | To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies. Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 |
103 | An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License. If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 |
112 | 1. Source Code.
113 |
114 | The "source code" for a work means the preferred form of the work
115 | for making modifications to it. "Object code" means any non-source
116 | form of a work.
117 |
118 | A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 |
123 | The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form. A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 |
134 | The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities. However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work. For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 |
147 | The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 |
151 | The Corresponding Source for a work in source code form is that
152 | same work.
153 |
154 | 2. Basic Permissions.
155 |
156 | All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met. This License explicitly affirms your unlimited
159 | permission to run the unmodified Program. The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work. This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 |
164 | You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force. You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright. Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 |
175 | Conveying under any other circumstances is permitted solely under
176 | the conditions stated below. Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 |
179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 |
181 | No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 |
187 | When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 |
195 | 4. Conveying Verbatim Copies.
196 |
197 | You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 |
205 | You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 |
208 | 5. Conveying Modified Source Versions.
209 |
210 | You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 |
214 | a) The work must carry prominent notices stating that you modified
215 | it, and giving a relevant date.
216 |
217 | b) The work must carry prominent notices stating that it is
218 | released under this License and any conditions added under section
219 | 7. This requirement modifies the requirement in section 4 to
220 | "keep intact all notices".
221 |
222 | c) You must license the entire work, as a whole, under this
223 | License to anyone who comes into possession of a copy. This
224 | License will therefore apply, along with any applicable section 7
225 | additional terms, to the whole of the work, and all its parts,
226 | regardless of how they are packaged. This License gives no
227 | permission to license the work in any other way, but it does not
228 | invalidate such permission if you have separately received it.
229 |
230 | d) If the work has interactive user interfaces, each must display
231 | Appropriate Legal Notices; however, if the Program has interactive
232 | interfaces that do not display Appropriate Legal Notices, your
233 | work need not make them do so.
234 |
235 | A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit. Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 |
245 | 6. Conveying Non-Source Forms.
246 |
247 | You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 |
252 | a) Convey the object code in, or embodied in, a physical product
253 | (including a physical distribution medium), accompanied by the
254 | Corresponding Source fixed on a durable physical medium
255 | customarily used for software interchange.
256 |
257 | b) Convey the object code in, or embodied in, a physical product
258 | (including a physical distribution medium), accompanied by a
259 | written offer, valid for at least three years and valid for as
260 | long as you offer spare parts or customer support for that product
261 | model, to give anyone who possesses the object code either (1) a
262 | copy of the Corresponding Source for all the software in the
263 | product that is covered by this License, on a durable physical
264 | medium customarily used for software interchange, for a price no
265 | more than your reasonable cost of physically performing this
266 | conveying of source, or (2) access to copy the
267 | Corresponding Source from a network server at no charge.
268 |
269 | c) Convey individual copies of the object code with a copy of the
270 | written offer to provide the Corresponding Source. This
271 | alternative is allowed only occasionally and noncommercially, and
272 | only if you received the object code with such an offer, in accord
273 | with subsection 6b.
274 |
275 | d) Convey the object code by offering access from a designated
276 | place (gratis or for a charge), and offer equivalent access to the
277 | Corresponding Source in the same way through the same place at no
278 | further charge. You need not require recipients to copy the
279 | Corresponding Source along with the object code. If the place to
280 | copy the object code is a network server, the Corresponding Source
281 | may be on a different server (operated by you or a third party)
282 | that supports equivalent copying facilities, provided you maintain
283 | clear directions next to the object code saying where to find the
284 | Corresponding Source. Regardless of what server hosts the
285 | Corresponding Source, you remain obligated to ensure that it is
286 | available for as long as needed to satisfy these requirements.
287 |
288 | e) Convey the object code using peer-to-peer transmission, provided
289 | you inform other peers where the object code and Corresponding
290 | Source of the work are being offered to the general public at no
291 | charge under subsection 6d.
292 |
293 | A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 |
297 | A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling. In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage. For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product. A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 |
310 | "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source. The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 |
318 | If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information. But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 |
329 | The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed. Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 |
337 | Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 |
343 | 7. Additional Terms.
344 |
345 | "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law. If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 |
354 | When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it. (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.) You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 |
361 | Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 |
365 | a) Disclaiming warranty or limiting liability differently from the
366 | terms of sections 15 and 16 of this License; or
367 |
368 | b) Requiring preservation of specified reasonable legal notices or
369 | author attributions in that material or in the Appropriate Legal
370 | Notices displayed by works containing it; or
371 |
372 | c) Prohibiting misrepresentation of the origin of that material, or
373 | requiring that modified versions of such material be marked in
374 | reasonable ways as different from the original version; or
375 |
376 | d) Limiting the use for publicity purposes of names of licensors or
377 | authors of the material; or
378 |
379 | e) Declining to grant rights under trademark law for use of some
380 | trade names, trademarks, or service marks; or
381 |
382 | f) Requiring indemnification of licensors and authors of that
383 | material by anyone who conveys the material (or modified versions of
384 | it) with contractual assumptions of liability to the recipient, for
385 | any liability that these contractual assumptions directly impose on
386 | those licensors and authors.
387 |
388 | All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10. If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term. If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 |
398 | If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 |
403 | Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 |
407 | 8. Termination.
408 |
409 | You may not propagate or modify a covered work except as expressly
410 | provided under this License. Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 |
415 | However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 |
422 | Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 |
429 | Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License. If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 |
435 | 9. Acceptance Not Required for Having Copies.
436 |
437 | You are not required to accept this License in order to receive or
438 | run a copy of the Program. Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance. However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work. These actions infringe copyright if you do
443 | not accept this License. Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 |
446 | 10. Automatic Licensing of Downstream Recipients.
447 |
448 | Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License. You are not responsible
451 | for enforcing compliance by third parties with this License.
452 |
453 | An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations. If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 |
463 | You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License. For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 |
471 | 11. Patents.
472 |
473 | A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based. The
475 | work thus licensed is called the contributor's "contributor version".
476 |
477 | A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version. For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 |
487 | Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 |
492 | In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement). To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 |
499 | If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients. "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 |
513 | If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 |
521 | A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License. You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 |
536 | Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 |
540 | 12. No Surrender of Others' Freedom.
541 |
542 | If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License. If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all. For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 |
552 | 13. Use with the GNU Affero General Public License.
553 |
554 | Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work. The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 |
563 | 14. Revised Versions of this License.
564 |
565 | The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time. Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 |
570 | Each version is given a distinguishing version number. If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation. If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 |
579 | If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 |
584 | Later license versions may give you additional or different
585 | permissions. However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 |
589 | 15. Disclaimer of Warranty.
590 |
591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 |
600 | 16. Limitation of Liability.
601 |
602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 |
612 | 17. Interpretation of Sections 15 and 16.
613 |
614 | If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 |
621 | END OF TERMS AND CONDITIONS
622 |
623 | How to Apply These Terms to Your New Programs
624 |
625 | If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 |
629 | To do so, attach the following notices to the program. It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 |
634 |
635 | Copyright (C)
636 |
637 | This program is free software: you can redistribute it and/or modify
638 | it under the terms of the GNU General Public License as published by
639 | the Free Software Foundation, either version 3 of the License, or
640 | (at your option) any later version.
641 |
642 | This program is distributed in the hope that it will be useful,
643 | but WITHOUT ANY WARRANTY; without even the implied warranty of
644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
645 | GNU General Public License for more details.
646 |
647 | You should have received a copy of the GNU General Public License
648 | along with this program. If not, see .
649 |
650 | Also add information on how to contact you by electronic and paper mail.
651 |
652 | If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 |
655 | Copyright (C)
656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 | This is free software, and you are welcome to redistribute it
658 | under certain conditions; type `show c' for details.
659 |
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License. Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 |
664 | You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | .
668 |
669 | The GNU General Public License does not permit incorporating your program
670 | into proprietary programs. If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library. If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License. But first, please read
674 | .
675 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | all: article program
2 | article: write-yourself-a-git.html
3 | program: wyag libwyag.py
4 | push: .last_push
5 |
6 | .PHONY: all article clean program push test
7 |
8 | write-yourself-a-git.html: write-yourself-a-git.org wyag libwyag.py
9 | emacs --batch write-yourself-a-git.org \
10 | --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \
11 | --eval "(setq org-babel-inline-result-wrap \"%s\")" \
12 | --eval "(setq org-confirm-babel-evaluate nil)" \
13 | --eval "(setq python-indent-guess-indent-offset nil)" \
14 | --eval "(setq org-export-with-broken-links t)" \
15 | --eval "(setq org-html-htmlize-output-type 'css)" \
16 | -f org-html-export-to-html
17 |
18 | write-yourself-a-git.pdf: write-yourself-a-git.org wyag libwyag.py
19 | emacs --batch write-yourself-a-git.org \
20 | --eval "(add-to-list 'load-path (expand-file-name \"./lib/htmlize\"))" \
21 | --eval "(setq org-babel-inline-result-wrap \"%s\")" \
22 | --eval "(setq org-confirm-babel-evaluate nil)" \
23 | --eval "(setq python-indent-guess-indent-offset nil)" \
24 | --eval "(setq org-export-with-broken-links t)" \
25 | -f org-latex-export-to-pdf
26 |
27 | wyag libwyag.py: write-yourself-a-git.org
28 | emacs --batch write-yourself-a-git.org -f org-babel-tangle
29 |
30 | wyag.zip: wyag libwyag.py LICENSE
31 | zip -r wyag.zip wyag libwyag.py LICENSE
32 |
33 | clean:
34 | rm -f wyag libwyag.py write-yourself-a-git.html .last_push wyag.zip
35 |
36 | test: wyag libwyag.py
37 | ./wyag-tests.sh
38 |
39 | .last_push: wyag.zip write-yourself-a-git.html
40 | scp -r write-yourself-a-git.html k9.thb.lt\:/var/www/wyag.thb.lt/index.html; \
41 | scp -r wyag.zip lib/org-html-themes/src k9.thb.lt:/var/www/wyag.thb.lt/; \
42 | touch .last_push
43 |
--------------------------------------------------------------------------------
/README.org:
--------------------------------------------------------------------------------
1 | #+TITLE: Write yourself a Git!
2 |
3 | Source repository for the [[https://wyag.thb.lt][Write yourself a Git]] article.
4 |
5 | Wyag is a [[https://en.wikipedia.org/wiki/Literate_programming][literate program]] written in [[https://orgmode.org/][org-mode]], which means the same source document can be used to produce the HTML version of the article as published on [[https://wyag.thb.lt]] and the program itself. You only need a reasonably recent Emacs and the =make= program, then:
6 |
7 | #+begin_src shell
8 | $ git clone --recursive https://github.com/thblt/write-yourself-a-git
9 | $ cd write-yourself-a-git
10 | $ make all
11 | #+end_src
12 |
--------------------------------------------------------------------------------
/write-yourself-a-git.org:
--------------------------------------------------------------------------------
1 | #+TITLE: Write yourself a Git!
2 | #+AUTHOR: [[mailto:thibault@thb.lt][Thibault Polge]]
3 |
4 | #
5 | # This file is part of wyag
6 | #
7 | # Copyright (c) 2018-2023 Thibault Polge
8 | # All rights reserved
9 | #
10 | # Wyag is free software: you can redistribute it and/or modify it
11 | # under the terms of the GNU General Public License as published by
12 | # the Free Software Foundation, either version 3 of the License, or
13 | # (at your option) any later version.
14 | #
15 | # Wyag is distributed in the hope that it will be useful, but WITHOUT
16 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
17 | # or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
18 | # License for more details.
19 | #
20 | # You should have received a copy of the GNU General Public License
21 | # along with Wyag. If not, see .
22 | #
23 |
24 | #+LANGUAGE: en
25 | #+OPTIONS: ':t ^:nil
26 |
27 | #+SETUPFILE: lib/org-html-themes/org/theme-readtheorg-local.setup
28 |
29 | * Introduction
30 | :PROPERTIES:
31 | :CUSTOM_ID: intro
32 | :END:
33 |
34 | #+begin_note
35 | Recent changes (January 2025):
36 |
37 | - =OrderedDict= have been replaced by regular dicts.
38 | - Most string formatting have been replaced with f-strings.
39 | - Multible bugs fixed in =tag_create=.
40 | #+end_note
41 |
42 | This article is an attempt at explaining the [[https://git-scm.com/][Git version control
43 | system]] from the bottom up, that is, starting at the most fundamental
44 | level moving up from there. This does not sound too easy, and has
45 | been attempted multiple times with questionable success. But there's
46 | an easy way: all it takes to understand Git internals is to
47 | reimplement Git from scratch.
48 |
49 | No, don't run.
50 |
51 | #+NAME: slocs
52 | #+begin_src emacs-lisp :exports none
53 | ;; Compute line numbers. We'll use that in a second.
54 | (shell-command-to-string
55 | "grep -v '^$' libwyag.py wyag | grep -v ' *#' | wc -l")
56 | #+end_src
57 |
58 | It's not a joke, and it's really not complicated: if you read this
59 | article top to bottom and write the code (or just [[./wyag.zip][download it]] as a ZIP
60 | --- but you should write the code yourself, really), you'll end up
61 | with a program, called =wyag=, that will implement all the fundamental
62 | features of git: =init=, =add=, =rm=, =status=, =commit=, =log=... in
63 | a way that is perfectly compatible with =git= itself --- compatible
64 | enough that the commit finally adding the section on commits was
65 | [[https://github.com/thblt/write-yourself-a-git/commit/ed26daffb400b2be5f30e044c3237d220226d867][created by wyag itself, not git]]. And all that in exactly call_slocs()
66 | lines of very simple Python code.
67 |
68 | But isn't Git too complex for that? That Git is complex is, in my
69 | opinion, a misconception. Git is a large program, with a lot of
70 | features, that's true. But the core of that program is actually
71 | extremely simple, and its apparent complexity stems first from the
72 | fact it's often deeply counterintuitive (and [[https://byorgey.wordpress.com/2009/01/12/abstraction-intuition-and-the-monad-tutorial-fallacy/][Git is a burrito]] blog
73 | posts probably don't help). But maybe what makes Git the most
74 | confusing is the extreme simplicity /and/ power of its core model. The
75 | combination of core simplicity and powerful applications often makes
76 | thing really hard to grasp, because of the mental jump required to
77 | derive the variety of applications from the essential simplicity of
78 | the fundamental abstraction (monads, anyone?)
79 |
80 | Implementing Git will expose its fundamentals in all their naked
81 | glory.
82 |
83 | *What to expect?* This article will implement and explain in great
84 | details (if something is not clear, please [[#feedback][report it]]!) a very
85 | simplified version of Git core commands. I will keep the code simple
86 | and to the point, so =wyag= won't come anywhere near the power of the
87 | real git command-line --- but what's missing will be obvious, and
88 | trivial to implement by anyone who wants to give it a try. “Upgrading
89 | wyag to a full-featured git library and CLI is left as an exercise to
90 | the reader”, as they say.
91 |
92 | More precisely, we'll implement:
93 |
94 | #+begin_src emacs-lisp :exports results :results list
95 | (mapcar
96 | (lambda (cmd)
97 | (format "=%s= ([[#cmd-%s][wyag source]]) [[https://git-scm.com/docs/git-%s][git man page]]" cmd cmd cmd))
98 | (list
99 | "add"
100 | "cat-file"
101 | "check-ignore"
102 | "checkout"
103 | "commit"
104 | "hash-object"
105 | "init"
106 | "log"
107 | "ls-files"
108 | "ls-tree"
109 | "rev-parse"
110 | "rm"
111 | "show-ref"
112 | "status"
113 | "tag"))
114 | #+end_src
115 |
116 | You're not going to need to know much to follow this article: just
117 | some basic Git (obviously), some basic Python, some basic shell.
118 |
119 | + First, I'm only going to assume some level of familiarity with the
120 | most basic *git commands* --- nothing like an expert level, but if
121 | you've never used =init=, =add=, =rm=, =commit= or =checkout=, you will be
122 | lost.
123 | + Language-wise, wyag will be implemented in *Python*. Again, I won't
124 | use anything too fancy, and Python looks like pseudo-code anyways,
125 | so it will be easy to follow (ironically, the most complicated part
126 | will be the command-line arguments parsing logic, and you don't
127 | really need to understand that). Yet, if you know programming but
128 | have never done any Python, I suggest you find a crash course
129 | somewhere in the internet just to get acquainted with the language.
130 | + =wyag= and =git= are terminal programs. I assume you know your way
131 | inside a Unix terminal. Again, you don't need to be a l77t h4x0r,
132 | but =cd=, =ls=, =rm=, =tree= and their friends should be in your toolbox.
133 |
134 | #+BEGIN_warning
135 | *Note for Windows users*
136 |
137 | =wyag= should run on any Unix-like system with a Python interpreter,
138 | but I have absolutely no idea how it will behave on Windows. The
139 | test suite absolutely requires a bash-compatible shell, which I
140 | assume the WSL can provide. Also, if you are using WSL, make sure
141 | your =wyag= file uses Unix-style line endings ([[https://stackoverflow.com/questions/48692741/how-can-i-make-all-line-endings-eols-in-all-files-in-visual-studio-code-unix][See this
142 | StackOverflow solution if you use VS Code]]). Feedback from Windows
143 | users would be appreciated!
144 | #+END_warning
145 |
146 | #+begin_note
147 | **Acknowledgments**
148 |
149 | This article benefited from significant contributions from multiple
150 | people, and I'm grateful to them all. Special thanks to:
151 |
152 | - Github user [[https://github.com/tammoippen][tammoippen]], who first drafted the =tag_create=
153 | function I had simply… forgotten to write (that was [[https://github.com/thblt/write-yourself-a-git/issues/9][#9]]).
154 | - Github user [[https://github.com/hjlarry][hjlarry]] fixed multiple issues in [[https://github.com/thblt/write-yourself-a-git/pull/22][#22]].
155 | - GitHub user [[https://github.com/cutebbb][cutebbb]] implemented the first version of ls-files in
156 | [[https://github.com/thblt/write-yourself-a-git/pull/32/][#32]], and by doing so finally brought wyag to the wonders of the
157 | staging area!
158 | #+end_note
159 |
160 | * Getting started
161 | :PROPERTIES:
162 | :CUSTOM_ID: getting-started
163 | :END:
164 |
165 | You're going to need Python 3.10 or higher, along with your
166 | favorite text editor. We won't need third party packages or
167 | virtualenvs, or anything besides a regular Python interpreter:
168 | everything we need is in Python's standard library.
169 |
170 | We'll split the code into two files:
171 |
172 | - An executable, called =wyag=;
173 | - A Python library, called =libwyag.py=;
174 |
175 | Now, every software project starts with a boatload of boilerplate, so
176 | let's get this over with.
177 |
178 | We'll begin by creating the (very short) executable. Create a new
179 | file called =wyag= in your text editor, and copy the following few
180 | lines:
181 |
182 | #+BEGIN_SRC python :tangle wyag :tangle-mode (identity #o755)
183 | #!/usr/bin/env python3
184 |
185 | import libwyag
186 | libwyag.main()
187 | #+END_SRC
188 |
189 | Then make it executable:
190 |
191 | #+BEGIN_EXAMPLE
192 | $ chmod +x wyag
193 | #+END_EXAMPLE
194 |
195 | You're done!
196 |
197 | # This is a noweb template to include in all three source files.
198 | #+NAME: file_header
199 | #+BEGIN_SRC shell :exports none
200 | This file is part of wyag
201 | Copyright (c) 2018-2023 Thibault Polge
202 | All rights reserved
203 |
204 | Wyag is free software: you can redistribute it and/or modify it
205 | under the terms of the GNU General Public License as published by
206 | the Free Software Foundation, either version 3 of the License, or
207 | (at your option) any later version.
208 |
209 | Wyag is distributed in the hope that it will be useful, but WITHOUT
210 | ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
211 | or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
212 | License for more details.
213 |
214 | You should have received a copy of the GNU General Public License
215 | along with Wyag. If not, see .
216 |
217 | #+END_SRC
218 |
219 | #+BEGIN_SRC python :tangle libwyag.py :exports none :noweb yes
220 | # <>
221 |
222 | #+END_SRC
223 |
224 | #+BEGIN_SRC python :tangle wyag :exports none :noweb yes
225 |
226 | # <>
227 | #+END_SRC
228 |
229 | Now for the library. it must be called =libwyag.py=, and be in the
230 | same directory as the =wyag= executable. Begin by opening the empty
231 | =libwyag.py= in your text editor.
232 |
233 | We're first going to need a bunch of imports (just copy each import,
234 | or merge them all in a single line)
235 |
236 | - Git is a CLI application, so we'll need something to parse
237 | command-line arguments. Python provides a cool module called
238 | [[https://docs.python.org/3/library/argparse.html][argparse]] that can do 99% of the job for us.
239 |
240 | #+BEGIN_SRC python :tangle libwyag.py
241 | import argparse
242 | #+END_SRC
243 |
244 | - Git uses a configuration file format that is basically Microsoft's
245 | INI format. The [[https://docs.python.org/3/library/configparser.html][configparser]] module can read and write these
246 | files.
247 |
248 | #+BEGIN_SRC python :tangle libwyag.py
249 | import configparser
250 | #+END_SRC
251 |
252 | - We'll be doing some date/time manipulation:
253 |
254 | #+BEGIN_SRC python :tangle libwyag.py
255 | from datetime import datetime
256 | #+END_SRC
257 |
258 | - We'll need, just once, to read the users/group database on Unix
259 | (=grp= is for groups, =pwd= for users). This is because git saves
260 | the numerical owner/group ID of files, and we'll want to display
261 | that nicely (as text):
262 |
263 | #+BEGIN_SRC python :tangle libwyag.py
264 | import grp, pwd
265 | #+END_SRC
266 |
267 | - To support =.gitignore=, we'll need to match filenames against
268 | patterns like *.txt. Filename matching is in… =fnmatch=:
269 |
270 | #+BEGIN_SRC python :tangle libwyag.py
271 | from fnmatch import fnmatch
272 | #+END_SRC
273 |
274 | - Git uses the SHA-1 function quite extensively. In Python, it's in [[https://docs.python.org/3/library/hashlib.html][hashlib]].
275 |
276 | #+BEGIN_SRC python :tangle libwyag.py
277 | import hashlib
278 | #+END_SRC
279 |
280 | - Just one function from [[https://docs.python.org/3/library/math.html][math]]:
281 |
282 | #+BEGIN_SRC python :tangle libwyag.py
283 | from math import ceil
284 | #+END_SRC
285 |
286 | - [[https://docs.python.org/3/library/os.html][os]] and [[https://docs.python.org/3/library/os.path.html][os.path]] provide some nice filesystem abstraction routines.
287 |
288 | #+BEGIN_SRC python :tangle libwyag.py
289 | import os
290 | #+END_SRC
291 |
292 | - we use /just a bit/ of regular expressions:
293 |
294 | #+BEGIN_SRC python :tangle libwyag.py
295 | import re
296 | #+END_SRC
297 |
298 | - We also need [[https://docs.python.org/3/library/sys.html][sys]] to access the actual command-line arguments (in =sys.argv=):
299 |
300 | #+BEGIN_SRC python :tangle libwyag.py
301 | import sys
302 | #+END_SRC
303 |
304 | - Git compresses everything using zlib. Python [[https://docs.python.org/3/library/zlib.html][has that]], too:
305 |
306 | #+BEGIN_SRC python :tangle libwyag.py
307 | import zlib
308 | #+END_SRC
309 |
310 | Imports are done. We'll be working with command-line arguments a lot.
311 | Python provides a simple yet reasonably powerful parsing library,
312 | =argparse=. It's a nice library, but its interface may not be the
313 | most intuitive ever; if need, refer to its [[https://docs.python.org/3/library/argparse.html][documentation]].
314 |
315 | #+BEGIN_SRC python :tangle libwyag.py
316 | argparser = argparse.ArgumentParser(description="The stupidest content tracker")
317 | #+END_SRC
318 |
319 | We'll need to handle subcommands (as in git: =init=, =commit=, etc.)
320 | In argparse slang, these are called "subparsers". At this point we
321 | only need to declare that our CLI will use some, and that all
322 | invocation will actually /require/ one --- you don't just call =git=,
323 | you call =git COMMAND=.
324 |
325 | #+BEGIN_SRC python :tangle libwyag.py
326 | argsubparsers = argparser.add_subparsers(title="Commands", dest="command")
327 | argsubparsers.required = True
328 | #+END_SRC
329 |
330 | The ~dest="command"~ argument states that the name of the chosen
331 | subparser will be returned as a string in a field called =command=.
332 | So we just need to read this string and call the correct function
333 | accordingly. By convention, I'll call these functions "bridges
334 | functions" and prefix their names by =cmd_=. Bridge functions
335 | take the parsed arguments as their unique parameter, and are
336 | responsible for processing and validating them before executing the
337 | actual command.
338 |
339 | #+BEGIN_SRC python :tangle libwyag.py
340 | def main(argv=sys.argv[1:]):
341 | args = argparser.parse_args(argv)
342 | match args.command:
343 | case "add" : cmd_add(args)
344 | case "cat-file" : cmd_cat_file(args)
345 | case "check-ignore" : cmd_check_ignore(args)
346 | case "checkout" : cmd_checkout(args)
347 | case "commit" : cmd_commit(args)
348 | case "hash-object" : cmd_hash_object(args)
349 | case "init" : cmd_init(args)
350 | case "log" : cmd_log(args)
351 | case "ls-files" : cmd_ls_files(args)
352 | case "ls-tree" : cmd_ls_tree(args)
353 | case "rev-parse" : cmd_rev_parse(args)
354 | case "rm" : cmd_rm(args)
355 | case "show-ref" : cmd_show_ref(args)
356 | case "status" : cmd_status(args)
357 | case "tag" : cmd_tag(args)
358 | case _ : print("Bad command.")
359 | #+END_SRC
360 |
361 | * Creating repositories: init
362 | :PROPERTIES:
363 | :CUSTOM_ID: init
364 | :END:
365 |
366 | Obviously, the first Git command in chronological /and/ logical order is
367 | =git init=, so we'll begin by creating =wyag init=. To achieve this,
368 | we're going to first need some very basic repository abstraction.
369 |
370 | ** The Repository object
371 | :PROPERTIES:
372 | :CUSTOM_ID: GitRepository
373 | :END:
374 |
375 | We'll obviously need some abstraction for a repository: almost every
376 | time we run a git command, we're trying to do something to a
377 | repository, to create it, read from it or modify it.
378 |
379 | A git repository is made of two things: a “work tree”, where the files
380 | meant to be in version control live, and a “git directory”, where Git
381 | stores its own data. In most cases, the worktree is a regular
382 | directory and the git directory is a child directory of the worktree,
383 | called =.git=.
384 |
385 | Git supports /much more/ cases (bare repo, separated gitdir, etc) but
386 | we won't need them: we'll stick to the basic approach of
387 | =worktree/.git=. Our repository object will then just hold two paths:
388 | the worktree and the gitdir.
389 |
390 | To create a new =Repository= object, we only need to make a few checks:
391 |
392 | - We must verify that the directory exists, and contains a
393 | subdirectory called =.git=.
394 |
395 | - We then read its configuration in =.git/config= (it's just an INI
396 | file) and check that =core.repositoryformatversion= is 0. More on
397 | that field in a moment.
398 |
399 | Our constructor takes an optional =force= argument which disables all
400 | checks. That's because the =repo_create()= function which we'll
401 | create later will use a =Repository= object to /create/ the repo. So
402 | we need a way to create such objects even from (still) invalid
403 | filesystem locations.
404 |
405 | #+BEGIN_SRC python :tangle libwyag.py
406 | class GitRepository (object):
407 | """A git repository"""
408 |
409 | worktree = None
410 | gitdir = None
411 | conf = None
412 |
413 | def __init__(self, path, force=False):
414 | self.worktree = path
415 | self.gitdir = os.path.join(path, ".git")
416 |
417 | if not (force or os.path.isdir(self.gitdir)):
418 | raise Exception(f"Not a Git repository {path}")
419 |
420 | # Read configuration file in .git/config
421 | self.conf = configparser.ConfigParser()
422 | cf = repo_file(self, "config")
423 |
424 | if cf and os.path.exists(cf):
425 | self.conf.read([cf])
426 | elif not force:
427 | raise Exception("Configuration file missing")
428 |
429 | if not force:
430 | vers = int(self.conf.get("core", "repositoryformatversion"))
431 | if vers != 0:
432 | raise Exception("Unsupported repositoryformatversion: {vers}")
433 | #+END_SRC
434 |
435 | We're going to be manipulating *lots* of paths in repositories. We may
436 | as well create a few utility functions to compute those paths and
437 | create missing directory structures if needed. First, just a general
438 | path building function:
439 |
440 | #+BEGIN_SRC python :tangle libwyag.py
441 | def repo_path(repo, *path):
442 | """Compute path under repo's gitdir."""
443 | return os.path.join(repo.gitdir, *path)
444 | #+END_SRC
445 |
446 | (A note on Python syntax: the star on the =*path= makes the function
447 | variadic, so it can be called with multiple path components as
448 | separate arguments. For example, ~repo_path(repo, "objects", "df",
449 | "4ec9fc2ad990cb9da906a95a6eda6627d7b7b0")~ is a valid call. The
450 | function receives =path= as a list)
451 |
452 | The next two functions, =repo_file()= and =repo_dir()=, return and
453 | optionally create a path to a file or a directory, respectively. The
454 | difference between them is that the file version only creates
455 | directories up to the last component.
456 |
457 | #+BEGIN_SRC python :tangle libwyag.py
458 | def repo_file(repo, *path, mkdir=False):
459 | """Same as repo_path, but create dirname(*path) if absent. For
460 | example, repo_file(r, \"refs\", \"remotes\", \"origin\", \"HEAD\") will create
461 | .git/refs/remotes/origin."""
462 |
463 | if repo_dir(repo, *path[:-1], mkdir=mkdir):
464 | return repo_path(repo, *path)
465 |
466 | def repo_dir(repo, *path, mkdir=False):
467 | """Same as repo_path, but mkdir *path if absent if mkdir."""
468 |
469 | path = repo_path(repo, *path)
470 |
471 | if os.path.exists(path):
472 | if (os.path.isdir(path)):
473 | return path
474 | else:
475 | raise Exception(f"Not a directory {path}")
476 |
477 | if mkdir:
478 | os.makedirs(path)
479 | return path
480 | else:
481 | return None
482 | #+END_SRC
483 |
484 | (Second and last note on syntax: because the star in =*path= makes the
485 | functions variadic, the =mkdir= argument must be passed explicitly by
486 | name. For example, ~repo_file(repo, "objects", mkdir=True)~.)
487 |
488 | To *create* a new repository, we start with a directory (which we
489 | create if doesn't already exist) and create the *git directory* inside
490 | (which must not exist already, or be empty). That directory is called
491 | =.git= (the leading period makes it "hidden" on Unix systems), and contains:
492 |
493 | - =.git/objects/= : the object store, which we'll introduce [[#objects][in the next section]].
494 | - =.git/refs/= the reference store, which we'll discuss [[#cmd-show-ref][a bit later]].
495 | It contains two subdirectories, =heads= and =tags=.
496 | - =.git/HEAD=, a reference to the current HEAD (more on that later!)
497 | - =.git/config=, the repository's configuration file.
498 | - =.git/description=, holds a free-form description of this
499 | repository's contents, for humans, and is rarely used.
500 |
501 | #+BEGIN_SRC python :tangle libwyag.py
502 | def repo_create(path):
503 | """Create a new repository at path."""
504 |
505 | repo = GitRepository(path, True)
506 |
507 | # First, we make sure the path either doesn't exist or is an
508 | # empty dir.
509 |
510 | if os.path.exists(repo.worktree):
511 | if not os.path.isdir(repo.worktree):
512 | raise Exception (f"{path} is not a directory!")
513 | if os.path.exists(repo.gitdir) and os.listdir(repo.gitdir):
514 | raise Exception (f"{path} is not empty!")
515 | else:
516 | os.makedirs(repo.worktree)
517 |
518 | assert repo_dir(repo, "branches", mkdir=True)
519 | assert repo_dir(repo, "objects", mkdir=True)
520 | assert repo_dir(repo, "refs", "tags", mkdir=True)
521 | assert repo_dir(repo, "refs", "heads", mkdir=True)
522 |
523 | # .git/description
524 | with open(repo_file(repo, "description"), "w") as f:
525 | f.write("Unnamed repository; edit this file 'description' to name the repository.\n")
526 |
527 | # .git/HEAD
528 | with open(repo_file(repo, "HEAD"), "w") as f:
529 | f.write("ref: refs/heads/master\n")
530 |
531 | with open(repo_file(repo, "config"), "w") as f:
532 | config = repo_default_config()
533 | config.write(f)
534 |
535 | return repo
536 | #+END_SRC
537 |
538 | The configuration file is very simple, it's a [[https://en.wikipedia.org/wiki/INI_file][INI]]-like file with a
539 | single section (=[core]=) and three fields:
540 |
541 | - =repositoryformatversion = 0=: the version of
542 | the gitdir format. 0 means the initial format, 1 the same with
543 | extensions. If > 1, git will panic; wyag will only accept 0.
544 | - =filemode = false=: disable tracking of file modes (permissions)
545 | changes in the work tree.
546 | - =bare = false=: indicates that this repository has a worktree. Git
547 | supports an optional =worktree= key which indicates the location of
548 | the worktree, if not =..=; wyag doesn't.
549 |
550 | We create this file using Python's =configparser= lib:
551 |
552 | #+BEGIN_SRC python :tangle libwyag.py
553 | def repo_default_config():
554 | ret = configparser.ConfigParser()
555 |
556 | ret.add_section("core")
557 | ret.set("core", "repositoryformatversion", "0")
558 | ret.set("core", "filemode", "false")
559 | ret.set("core", "bare", "false")
560 |
561 | return ret
562 | #+END_SRC
563 |
564 | ** The init command
565 | :PROPERTIES:
566 | :CUSTOM_ID: cmd-init
567 | :END:
568 |
569 | Now that we have code to read and create repositories, let's make this
570 | code usable from the command line by creating the =wyag init= command.
571 | =wyag init= behaves just like =git init= --- with much less
572 | customizability, of course. The syntax of =wyag init= is going to be:
573 |
574 | #+BEGIN_EXAMPLE
575 | wyag init [path]
576 | #+END_EXAMPLE
577 |
578 | We already have the complete repository creation logic. To create the
579 | command, we're only going to need two more things:
580 |
581 | 1. We need to create an argparse subparser to handle our command's
582 | argument.
583 |
584 | #+BEGIN_SRC python :tangle libwyag.py
585 | argsp = argsubparsers.add_parser("init", help="Initialize a new, empty repository.")
586 | #+END_SRC
587 |
588 | In the case of =init=, there's a single, optional,
589 | positional argument: the path where to init the repo. It defaults
590 | to =.=, the current directory:
591 |
592 | #+BEGIN_SRC python :tangle libwyag.py
593 | argsp.add_argument("path",
594 | metavar="directory",
595 | nargs="?",
596 | default=".",
597 | help="Where to create the repository.")
598 |
599 | #+END_SRC
600 |
601 | 2. We also need a "bridge" function that will read argument values
602 | from the object returned by argparse and call the actual
603 | function with correct values.
604 |
605 | #+BEGIN_SRC python :tangle libwyag.py
606 | def cmd_init(args):
607 | repo_create(args.path)
608 | #+END_SRC
609 |
610 | And we're done! If you've followed these steps, you should now be
611 | able to =wyag init= a git repository anywhere:
612 |
613 | #+begin_example
614 | $ wyag init test
615 | #+end_example
616 |
617 | (The =wyag= executable won't usually be in your =$PATH=: you'll want to call it by its
618 | full name, eg =~/projects/wyag/wyag init .=)
619 |
620 | ** The repo_find() function
621 | :PROPERTIES:
622 | :repo_find:
623 | :END:
624 |
625 | While we're implementing repositories, we're going to need a function
626 | to find the root of the current repository. We'll use it a lot, since
627 | almost all Git functions work on an existing repository (except
628 | =init=, of course!). Sometimes that root is the current directory,
629 | but it may also be a parent: your repository's root may be in
630 | =~/Documents/MyProject=, but you may currently be working in
631 | =~/Documents/MyProject/src/tui/frames/mainview/=. The =repo_find()=
632 | function we'll now create will look for that root, starting at the
633 | current directory and recursing back to =/=. To identify a path as a
634 | repository, it will check for the presence of a =.git= directory.
635 |
636 | #+BEGIN_SRC python :tangle libwyag.py
637 | def repo_find(path=".", required=True):
638 | path = os.path.realpath(path)
639 |
640 | if os.path.isdir(os.path.join(path, ".git")):
641 | return GitRepository(path)
642 |
643 | # If we haven't returned, recurse in parent, if w
644 | parent = os.path.realpath(os.path.join(path, ".."))
645 |
646 | if parent == path:
647 | # Bottom case
648 | # os.path.join("/", "..") == "/":
649 | # If parent==path, then path is root.
650 | if required:
651 | raise Exception("No git directory.")
652 | else:
653 | return None
654 |
655 | # Recursive case
656 | return repo_find(parent, required)
657 | #+END_SRC
658 |
659 | And we're done with repositories!
660 |
661 | * Reading and writing objects: hash-object and cat-file
662 | :PROPERTIES:
663 | :CUSTOM_ID: objects
664 | :END:
665 |
666 | ** What are objects?
667 | :PROPERTIES:
668 | :CUSTOM_ID: objects-intro
669 | :END:
670 |
671 | Now that we have repositories, putting things inside them is in order.
672 | Also, repositories are boring, and writing a Git implementation
673 | shouldn't be just a matter of writing a bunch of =mkdir=. Let's talk
674 | about *objects*, and let's implement =git hash-object= and =git cat-file=.
675 |
676 | Maybe you don't know these two commands --- they're not exactly part
677 | of an everyday git toolbox, and they're actually quite low-level
678 | ("plumbing", in git parlance). What they do is actually very simple:
679 | =hash-object= converts an existing file into a git object, and =cat-file=
680 | prints an existing git object to the standard output.
681 |
682 | Now, *what actually is a Git object?* At its core, Git is a
683 | "content-addressed filesystem". That means that unlike regular
684 | filesystems, where the name of a file is arbitrary and unrelated to
685 | that file's contents, the names of files as stored by Git are
686 | mathematically derived from their contents. This has a very important
687 | implication: if a single byte of, say, a text file, changes, its
688 | internal name will change, too. To put it simply: you don't /modify/
689 | a file in git, you create a new file in a different location. Objects
690 | are just that: *files in the git repository, whose paths are
691 | determined by their contents*.
692 |
693 | #+begin_warning
694 | *Git is not (really) a key-value store*
695 |
696 | Some documentation, including the excellent [[https://git-scm.com/book/id/v2/Git-Internals-Git-Objects][Pro Git]], call Git a
697 | "key-value store". This is not incorrect, but may be misleading.
698 | Regular filesystems are actually closer to a key-value store than Git
699 | is. Because it computes keys from data, Git could rather be called a
700 | /value-value store/.
701 | #+end_warning
702 |
703 | Git uses objects to store quite a lot of things: first and foremost,
704 | the actual files it keeps in version control --- source code, for
705 | example. Commit are objects, too, as well as tags. With a few
706 | notable exceptions (which we'll see later!), almost everything, in
707 | Git, is stored as an object.
708 |
709 | The path where git stores a given object is computed by calculating
710 | the [[https://en.wikipedia.org/wiki/SHA-1][SHA-1]] [[https://en.wikipedia.org/wiki/Cryptographic_hash_function][hash]] of its contents. More precisely, Git renders the hash
711 | as a lowercase hexadecimal string, and splits it in two parts: the
712 | first two characters, and the rest. It uses the first part as a
713 | directory name, the rest as the file name (this is because most
714 | filesystems hate having too many files in a single directory and would
715 | slow down to a crawl. Git's method creates 256 possible intermediate
716 | directories, hence dividing the average number of files per directory
717 | by 256)
718 |
719 | #+BEGIN_note
720 | *What is a hash function?*
721 |
722 | SHA-1 is what we call a “hash function”. Simply put, a hash function
723 | is a kind of unidirectional mathematical function: it is easy to
724 | compute the hash of a value, but there's no way to compute back which
725 | value produced a hash.
726 |
727 | A very simple example of a hash function is the classical =len= (or
728 | =strlen=) function, which returns the length of a string. It's really
729 | easy to compute the length of a string, and the length of a given
730 | string will never change (unless the string itself changes, of
731 | course!) but it's impossible to retrieve the original string, given
732 | only its length. /Cryptographic/ hash functions are a much more
733 | complex version of the same, with the added property that computing an
734 | input meant to produce a given hash is hard enough to be practically
735 | impossible. (To produce an input =i= with ~strlen(i) == 12~, you just
736 | type twelve random characters. With algorithms such as SHA-1. it
737 | would take much, much longer --- long enough to be practically
738 | impossible[fn:1]).
739 | #+END_note
740 |
741 | Before we start implementing the object storage system, we must
742 | understand their exact storage format. An object starts with a header
743 | that specifies its type: =blob=, =commit=, =tag= or =tree= (more on
744 | that in a second). This header is followed by an ASCII space (0x20),
745 | then the size of the object in bytes as an ASCII number, then null
746 | (0x00) (the null byte), then the contents of the object. The first 48
747 | bytes of a commit object in Wyag's repo look like this:
748 |
749 | #+BEGIN_EXAMPLE
750 | 00000000 63 6f 6d 6d 69 74 20 31 30 38 36 00 74 72 65 65 |commit 1086.tree|
751 | 00000010 20 32 39 66 66 31 36 63 39 63 31 34 65 32 36 35 | 29ff16c9c14e265|
752 | 00000020 32 62 32 32 66 38 62 37 38 62 62 30 38 61 35 61 |2b22f8b78bb08a5a|
753 | #+END_EXAMPLE
754 |
755 | In the first line, we see the type header, a space (=0x20=), the size in
756 | ASCII (1086) and the null separator =0x00=. The last four bytes on the
757 | first line are the beginning of that object's contents, the word
758 | "tree" --- we'll discuss that further when we'll talk about commits.
759 |
760 | The objects (headers and contents) are stored compressed with =zlib=.
761 |
762 | ** A generic object object
763 |
764 | Objects can be of multiple types, but they all share the same
765 | storage/retrieval mechanism and the same general header format.
766 | Before we dive into the details of various types of objects, we need
767 | to abstract over these common features. The easiest way is to create
768 | a generic =GitObject= with two unimplemented methods: =serialize()=
769 | and =deserialize()=, and a default =init()= to create a new, empty
770 | object if needed (sorry pythonistas, this isn't very nice design but
771 | it's probably easier to read than superconstructors). Our =__init__=
772 | either loads the object from the provided data, or calls the
773 | subclass-provided =init()= to create a new, empty object.
774 |
775 | Later, we'll subclass this generic class, actually implementing these
776 | functions for each object format.
777 |
778 | #+BEGIN_SRC python :tangle libwyag.py
779 | class GitObject (object):
780 |
781 | def __init__(self, data=None):
782 | if data != None:
783 | self.deserialize(data)
784 | else:
785 | self.init()
786 |
787 | def serialize(self, repo):
788 | """This function MUST be implemented by subclasses.
789 |
790 | It must read the object's contents from self.data, a byte string, and
791 | do whatever it takes to convert it into a meaningful representation.
792 | What exactly that means depend on each subclass.
793 |
794 | """
795 | raise Exception("Unimplemented!")
796 |
797 | def deserialize(self, data):
798 | raise Exception("Unimplemented!")
799 |
800 | def init(self):
801 | pass # Just do nothing. This is a reasonable default!
802 | #+END_SRC
803 |
804 | ** Reading objects
805 | :PROPERTIES:
806 | :CUSTOM_ID: object_read
807 | :END:
808 |
809 | To read an object, we need to know its SHA-1 hash. We then compute
810 | its path from this hash (with the formula explained above: first two
811 | characters, then a directory delimiter =/=, then the remaining part)
812 | and look it up inside of the "objects" directory in the gitdir. That
813 | is, the path to =e673d1b7eaa0aa01b5bc2442d570a765bdaae751= is
814 | =.git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751=.
815 |
816 | We then read that file as a binary file, and decompress it using
817 | =zlib=.
818 |
819 | From the decompressed data, we extract the two header components: the
820 | object type and its size. From the type, we determine the actual
821 | class to use. We convert the size to a Python integer, and check if
822 | it matches.
823 |
824 | When all is done, we just call the correct constructor for that
825 | object's format.
826 |
827 | #+BEGIN_SRC python :tangle libwyag.py
828 | def object_read(repo, sha):
829 | """Read object sha from Git repository repo. Return a
830 | GitObject whose exact type depends on the object."""
831 |
832 | path = repo_file(repo, "objects", sha[0:2], sha[2:])
833 |
834 | if not os.path.isfile(path):
835 | return None
836 |
837 | with open (path, "rb") as f:
838 | raw = zlib.decompress(f.read())
839 |
840 | # Read object type
841 | x = raw.find(b' ')
842 | fmt = raw[0:x]
843 |
844 | # Read and validate object size
845 | y = raw.find(b'\x00', x)
846 | size = int(raw[x:y].decode("ascii"))
847 | if size != len(raw)-y-1:
848 | raise Exception(f"Malformed object {sha}: bad length")
849 |
850 | # Pick constructor
851 | match fmt:
852 | case b'commit' : c=GitCommit
853 | case b'tree' : c=GitTree
854 | case b'tag' : c=GitTag
855 | case b'blob' : c=GitBlob
856 | case _:
857 | raise Exception(f"Unknown type {fmt.decode("ascii")} for object {sha}")
858 |
859 | # Call constructor and return object
860 | return c(raw[y+1:])
861 | #+END_SRC
862 |
863 | ** Writing objects
864 | :PROPERTIES:
865 | :object_write:
866 | :END:
867 |
868 | Writing an object is reading it in reverse: we compute the hash,
869 | insert the header, zlib-compress everything and write the result in
870 | the correct location. This really shouldn't require much explanation,
871 | just notice that the hash is computed *after* the header is added (so
872 | it's the hash of the object itself, uncompressed, not just its contents)
873 |
874 | #+BEGIN_SRC python :tangle libwyag.py
875 | def object_write(obj, repo=None):
876 | # Serialize object data
877 | data = obj.serialize()
878 | # Add header
879 | result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data
880 | # Compute hash
881 | sha = hashlib.sha1(result).hexdigest()
882 |
883 | if repo:
884 | # Compute path
885 | path=repo_file(repo, "objects", sha[0:2], sha[2:], mkdir=True)
886 |
887 | if not os.path.exists(path):
888 | with open(path, 'wb') as f:
889 | # Compress and write
890 | f.write(zlib.compress(result))
891 | return sha
892 | #+END_SRC
893 |
894 | ** Working with blobs
895 |
896 | We said earlier that the type header could be one of four: =blob=,
897 | =commit=, =tag= and =tree= --- so git has four object types.
898 |
899 | Blobs are the simplest of those four types, because they have no
900 | actual format. Blobs are user data: the content of every file you put
901 | in git (=main.c=, =logo.png=, =README.md=) is stored as a blob. That
902 | makes them easy to manipulate, because they have no actual syntax or
903 | constraints beyond the basic object storage mechanism: they're just
904 | unspecified data. Creating a =GitBlob= class is thus trivial, the
905 | =serialize= and =deserialize= functions just have to store and return
906 | their input unmodified.
907 |
908 | #+BEGIN_SRC python :tangle libwyag.py
909 | class GitBlob(GitObject):
910 | fmt=b'blob'
911 |
912 | def serialize(self):
913 | return self.blobdata
914 |
915 | def deserialize(self, data):
916 | self.blobdata = data
917 | #+END_SRC
918 |
919 | ** The cat-file command
920 | :PROPERTIES:
921 | :CUSTOM_ID: cmd-cat-file
922 | :END:
923 |
924 | We can now create =wyag cat-file=. =git cat-file= simply prints the
925 | raw contents of an object to stdout, uncompressed and without the git
926 | header. In a clone of [[https://github.com/thblt/write-yourself-a-git][wyag's source repository]], =git cat-file blob
927 | e0695f14a412c29e252c998c81de1dde59658e4a= will show a version of the
928 | README.
929 |
930 | Our simplified version will just take those two positional arguments:
931 | a type and an object identifier:
932 |
933 | #+BEGIN_EXAMPLE
934 | wyag cat-file TYPE OBJECT
935 | #+END_EXAMPLE
936 |
937 | The subparser is very simple:
938 |
939 | #+BEGIN_SRC python :tangle libwyag.py
940 | argsp = argsubparsers.add_parser("cat-file",
941 | help="Provide content of repository objects")
942 |
943 | argsp.add_argument("type",
944 | metavar="type",
945 | choices=["blob", "commit", "tag", "tree"],
946 | help="Specify the type")
947 |
948 | argsp.add_argument("object",
949 | metavar="object",
950 | help="The object to display")
951 | #+END_SRC
952 |
953 | And we can implement the functions, which just call into existing code we wrote earlier:
954 |
955 | #+BEGIN_SRC python :tangle libwyag.py
956 | def cmd_cat_file(args):
957 | repo = repo_find()
958 | cat_file(repo, args.object, fmt=args.type.encode())
959 |
960 | def cat_file(repo, obj, fmt=None):
961 | obj = object_read(repo, object_find(repo, obj, fmt=fmt))
962 | sys.stdout.buffer.write(obj.serialize())
963 | #+END_SRC
964 |
965 | <> This function calls an =object_find=
966 | function we haven't introduced yet. For now, it's just going to
967 | return one of its arguments unmodified, like this:
968 |
969 | #+BEGIN_SRC python
970 | def object_find(repo, name, fmt=None, follow=True):
971 | return name
972 | #+END_SRC
973 |
974 | The reason for this strange small function is that Git has a /lot/ of
975 | ways to refer to objects: full hash, short hash, tags...
976 | =object_find()= will be our name resolution function. We'll only
977 | implement it [[#object_find][later]], so this is just a temporary placeholder. This
978 | means that until we implement the real thing, the only way we can
979 | refer to an object will be by its full hash.
980 |
981 | ** The hash-object command
982 | :PROPERTIES:
983 | :CUSTOM_ID: cmd-hash-object
984 | :END:
985 |
986 | We will want to put our /own/ data in our repositories,
987 | though. =hash-object= is basically the opposite of =cat-file=: it
988 | reads a file, computes its hash as an object, either storing it in the
989 | repository (if the -w flag is passed) or just printing its hash.
990 |
991 | The syntax of =wyag hash-object= is a simplification of =git
992 | hash-object=:
993 |
994 | #+BEGIN_EXAMPLE
995 | wyag hash-object [-w] [-t TYPE] FILE
996 | #+END_EXAMPLE
997 |
998 | Which converts to:
999 |
1000 | #+BEGIN_SRC python :tangle libwyag.py
1001 | argsp = argsubparsers.add_parser(
1002 | "hash-object",
1003 | help="Compute object ID and optionally creates a blob from a file")
1004 |
1005 | argsp.add_argument("-t",
1006 | metavar="type",
1007 | dest="type",
1008 | choices=["blob", "commit", "tag", "tree"],
1009 | default="blob",
1010 | help="Specify the type")
1011 |
1012 | argsp.add_argument("-w",
1013 | dest="write",
1014 | action="store_true",
1015 | help="Actually write the object into the database")
1016 |
1017 | argsp.add_argument("path",
1018 | help="Read object from ")
1019 | #+END_SRC
1020 |
1021 | The actual implementation is very simple. As usual, we create a small
1022 | bridge function:
1023 |
1024 | #+BEGIN_SRC python :tangle libwyag.py
1025 | def cmd_hash_object(args):
1026 | if args.write:
1027 | repo = repo_find()
1028 | else:
1029 | repo = None
1030 |
1031 | with open(args.path, "rb") as fd:
1032 | sha = object_hash(fd, args.type.encode(), repo)
1033 | print(sha)
1034 | #+END_SRC
1035 |
1036 | The actual implementation is also trivial. The =repo= argument is
1037 | optional, and the object isn't written if it is =None= (this is
1038 | handled in =object_write()=, above):
1039 |
1040 | #+BEGIN_SRC python :tangle libwyag.py
1041 | def object_hash(fd, fmt, repo=None):
1042 | """ Hash object, writing it to repo if provided."""
1043 | data = fd.read()
1044 |
1045 | # Choose constructor according to fmt argument
1046 | match fmt:
1047 | case b'commit' : obj=GitCommit(data)
1048 | case b'tree' : obj=GitTree(data)
1049 | case b'tag' : obj=GitTag(data)
1050 | case b'blob' : obj=GitBlob(data)
1051 | case _: raise Exception(f"Unknown type {fmt}!")
1052 |
1053 | return object_write(obj, repo)
1054 | #+END_SRC
1055 |
1056 | ** Aside: what about packfiles?
1057 | :PROPERTIES:
1058 | :CUSTOM_ID: packfiles
1059 | :END:
1060 |
1061 | What we've just implemented is called "loose objects". Git has a
1062 | second object storage mechanism called packfiles. Packfiles are much
1063 | more efficient, but also much more complex, than loose objects. Simply
1064 | put, a packfile is a compilation of loose objects (like a =tar=) but
1065 | some are stored as deltas (as a transformation of another object).
1066 | Packfiles are way too complex to be supported by wyag.
1067 |
1068 | The packfile is stored in =.git/objects/pack/=. It has a =.pack=
1069 | extension, and is accompanied by an index file of the same name with
1070 | the =.idx= extension. Should you want to convert a packfile to loose
1071 | objects format (to play with =wyag= on an existing repo, for example),
1072 | here's the solution.
1073 |
1074 | First, /move/ the packfile outside the gitdir (just copying it won't work).
1075 |
1076 | #+BEGIN_SRC shell
1077 | mv .git/objects/pack/pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack .
1078 | #+END_SRC
1079 |
1080 | You can ignore the =.idx=. Then, from the worktree, just =cat= it and pipe the result to =git
1081 | unpack-objects=:
1082 |
1083 | #+BEGIN_SRC shell
1084 | cat pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack | git unpack-objects
1085 | #+END_SRC
1086 |
1087 | * Reading commit history: log
1088 |
1089 | ** Parsing commits
1090 |
1091 | Now that we can read and write objects, we should consider commits.
1092 | A commit object (uncompressed, without headers) looks like this:
1093 |
1094 | #+BEGIN_EXAMPLE
1095 | tree 29ff16c9c14e2652b22f8b78bb08a5a07930c147
1096 | parent 206941306e8a8af65b66eaaaea388a7ae24d49a0
1097 | author Thibault Polge 1527025023 +0200
1098 | committer Thibault Polge 1527025044 +0200
1099 | gpgsig -----BEGIN PGP SIGNATURE-----
1100 |
1101 | iQIzBAABCAAdFiEExwXquOM8bWb4Q2zVGxM2FxoLkGQFAlsEjZQACgkQGxM2FxoL
1102 | kGQdcBAAqPP+ln4nGDd2gETXjvOpOxLzIMEw4A9gU6CzWzm+oB8mEIKyaH0UFIPh
1103 | rNUZ1j7/ZGFNeBDtT55LPdPIQw4KKlcf6kC8MPWP3qSu3xHqx12C5zyai2duFZUU
1104 | wqOt9iCFCscFQYqKs3xsHI+ncQb+PGjVZA8+jPw7nrPIkeSXQV2aZb1E68wa2YIL
1105 | 3eYgTUKz34cB6tAq9YwHnZpyPx8UJCZGkshpJmgtZ3mCbtQaO17LoihnqPn4UOMr
1106 | V75R/7FjSuPLS8NaZF4wfi52btXMSxO/u7GuoJkzJscP3p4qtwe6Rl9dc1XC8P7k
1107 | NIbGZ5Yg5cEPcfmhgXFOhQZkD0yxcJqBUcoFpnp2vu5XJl2E5I/quIyVxUXi6O6c
1108 | /obspcvace4wy8uO0bdVhc4nJ+Rla4InVSJaUaBeiHTW8kReSFYyMmDCzLjGIu1q
1109 | doU61OM3Zv1ptsLu3gUE6GU27iWYj2RWN3e3HE4Sbd89IFwLXNdSuM0ifDLZk7AQ
1110 | WBhRhipCCgZhkj9g2NEk7jRVslti1NdN5zoQLaJNqSwO1MtxTmJ15Ksk3QP6kfLB
1111 | Q52UWybBzpaP9HEd4XnR+HuQ4k2K0ns2KgNImsNvIyFwbpMUyUWLMPimaV1DWUXo
1112 | 5SBjDB/V/W2JBFR+XKHFJeFwYhj7DD/ocsGr4ZMx/lgc8rjIBkI=
1113 | =lgTX
1114 | -----END PGP SIGNATURE-----
1115 |
1116 | Create first draft
1117 | #+END_EXAMPLE
1118 |
1119 | The format is a simplified version of mail messages, as specified in
1120 | [[https://www.ietf.org/rfc/rfc2822.txt][RFC 2822]]. It begins with a series of key-value pairs, with space as
1121 | the key/value separator, and ends with the commit message, that may
1122 | span over multiple lines. Values may continue over multiple lines,
1123 | subsequent lines start with a space which the parser must drop (like
1124 | the =gpgsig= field above, which spans over 16 lines).
1125 |
1126 | Let's have a look at those fields:
1127 |
1128 | - =tree= is a reference to a tree object, a type of object that we'll
1129 | see just next. A tree maps blobs IDs to filesystem locations, and
1130 | describes a state of the work tree. Put simply, it is the actual
1131 | content of the commit: file contents, and where they go.
1132 | - =parent= is a reference to the parent of this commit. It may be
1133 | repeated: merge commits, for example, have multiple parents. It
1134 | may also be absent: the very first commit in a repository obviously
1135 | doesn't have a parent.
1136 | - =author= and =committer= are separate, because the author of a commit
1137 | is not necessarily the person who can commit it (This may not be
1138 | obvious for GitHub users, but a lot of projects do Git through e-mail)
1139 | - =gpgsig= is the PGP signature of this object.
1140 |
1141 | We'll start by writing a simple parser for the format. The code is
1142 | obvious. The name of the function we're about to create,
1143 | =kvlm_parse()=, may be confusing: it isn't called =commit_parse()= because
1144 | tags have the very same format, so we'll use it for both objects types.
1145 | I use KVLM to mean "Key-Value List with Message".
1146 |
1147 | #+BEGIN_SRC python :tangle libwyag.py
1148 | def kvlm_parse(raw, start=0, dct=None):
1149 | if not dct:
1150 | dct = dict()
1151 | # You CANNOT declare the argument as dct=dict() or all call to
1152 | # the functions will endlessly grow the same dict.
1153 |
1154 | # This function is recursive: it reads a key/value pair, then call
1155 | # itself back with the new position. So we first need to know
1156 | # where we are: at a keyword, or already in the messageQ
1157 |
1158 | # We search for the next space and the next newline.
1159 | spc = raw.find(b' ', start)
1160 | nl = raw.find(b'\n', start)
1161 |
1162 | # If space appears before newline, we have a keyword. Otherwise,
1163 | # it's the final message, which we just read to the end of the file.
1164 |
1165 | # Base case
1166 | # =========
1167 | # If newline appears first (or there's no space at all, in which
1168 | # case find returns -1), we assume a blank line. A blank line
1169 | # means the remainder of the data is the message. We store it in
1170 | # the dictionary, with None as the key, and return.
1171 | if (spc < 0) or (nl < spc):
1172 | assert nl == start
1173 | dct[None] = raw[start+1:]
1174 | return dct
1175 |
1176 | # Recursive case
1177 | # ==============
1178 | # we read a key-value pair and recurse for the next.
1179 | key = raw[start:spc]
1180 |
1181 | # Find the end of the value. Continuation lines begin with a
1182 | # space, so we loop until we find a "\n" not followed by a space.
1183 | end = start
1184 | while True:
1185 | end = raw.find(b'\n', end+1)
1186 | if raw[end+1] != ord(' '): break
1187 |
1188 | # Grab the value
1189 | # Also, drop the leading space on continuation lines
1190 | value = raw[spc+1:end].replace(b'\n ', b'\n')
1191 |
1192 | # Don't overwrite existing data contents
1193 | if key in dct:
1194 | if type(dct[key]) == list:
1195 | dct[key].append(value)
1196 | else:
1197 | dct[key] = [ dct[key], value ]
1198 | else:
1199 | dct[key]=value
1200 |
1201 | return kvlm_parse(raw, start=end+1, dct=dct)
1202 | #+END_SRC
1203 |
1204 | #+begin_note
1205 | <> *Object identity rules*
1206 |
1207 | We use dictionaries (HashMaps) to store key/value associations, but we
1208 | rely on a specific feature of Python dictionaries: *they preserve
1209 | insertion order*. It means that when we'll write back an object,
1210 | we'll iterate over the dictionary and get the fields back in the exact
1211 | order they were added. This matters, because Git has *two strong
1212 | rules about object identity*:
1213 |
1214 | 1. The first rule is that *the same name will always refer to the
1215 | same object*. We've seen this one already, it's just a
1216 | consequence of the fact that an object's name is a hash of its
1217 | contents.
1218 | 2. The second rule is subtly different: *the same object will always
1219 | be referred by the same name*. This means that there shouldn't be
1220 | two equivalent objects under different names. This is why fields
1221 | order matter: by modifying the /order/ fields appear in a given
1222 | commit, eg by putting the =tree= after the =parent=, we'd modify
1223 | the SHA-1 hash of the commit, and we'd create two equivalent, but
1224 | numerically distinct, commit objects.
1225 |
1226 | For example, when comparing trees, git will assume that two trees with
1227 | different names /are/ different --- this is why we'll have to make
1228 | sure elements of the tree objects are properly sorted, so we don't
1229 | produce distinct but equivalent trees.
1230 | #+end_note
1231 |
1232 | We're also going to need to write similar objects, so let's add a
1233 | =kvlm_serialize()= function to our toolkit. This is very simple: we
1234 | write all fields first, then a newline, the message, and a final
1235 | newline.
1236 |
1237 | #+BEGIN_SRC python :tangle libwyag.py
1238 | def kvlm_serialize(kvlm):
1239 | ret = b''
1240 |
1241 | # Output fields
1242 | for k in kvlm.keys():
1243 | # Skip the message itself
1244 | if k == None: continue
1245 | val = kvlm[k]
1246 | # Normalize to a list
1247 | if type(val) != list:
1248 | val = [ val ]
1249 |
1250 | for v in val:
1251 | ret += k + b' ' + (v.replace(b'\n', b'\n ')) + b'\n'
1252 |
1253 | # Append message
1254 | ret += b'\n' + kvlm[None]
1255 |
1256 | return ret
1257 | #+END_SRC
1258 |
1259 | ** The Commit object
1260 | :PROPERTIES:
1261 | :object_write: GitCommit
1262 | :END:
1263 |
1264 | Now we have the parser, we can create the =GitCommit= class:
1265 |
1266 | #+BEGIN_SRC python :tangle libwyag.py
1267 | class GitCommit(GitObject):
1268 | fmt=b'commit'
1269 |
1270 | def deserialize(self, data):
1271 | self.kvlm = kvlm_parse(data)
1272 |
1273 | def serialize(self):
1274 | return kvlm_serialize(self.kvlm)
1275 |
1276 | def init(self):
1277 | self.kvlm = dict()
1278 | #+END_SRC
1279 |
1280 | ** The log command
1281 | :PROPERTIES:
1282 | :CUSTOM_ID: cmd-log
1283 | :END:
1284 |
1285 | We'll implement a much, much simpler version of =log= than what Git
1286 | provides. Most importantly, we won't deal with representing the log
1287 | /at all/. Instead, we'll dump Graphviz data and let the user use
1288 | =dot= to render the actual log. (If you don't know how to use
1289 | Graphviz, just paste the raw output into [[https://dreampuf.github.io/GraphvizOnline/][this site]]. If the link is
1290 | dead, lookup "graphviz online" in your favorite search engine)
1291 |
1292 | #+BEGIN_SRC python :tangle libwyag.py
1293 | argsp = argsubparsers.add_parser("log", help="Display history of a given commit.")
1294 | argsp.add_argument("commit",
1295 | default="HEAD",
1296 | nargs="?",
1297 | help="Commit to start at.")
1298 | #+END_SRC
1299 |
1300 | #+BEGIN_SRC python :tangle libwyag.py
1301 | def cmd_log(args):
1302 | repo = repo_find()
1303 |
1304 | print("digraph wyaglog{")
1305 | print(" node[shape=rect]")
1306 | log_graphviz(repo, object_find(repo, args.commit), set())
1307 | print("}")
1308 |
1309 | def log_graphviz(repo, sha, seen):
1310 |
1311 | if sha in seen:
1312 | return
1313 | seen.add(sha)
1314 |
1315 | commit = object_read(repo, sha)
1316 | message = commit.kvlm[None].decode("utf8").strip()
1317 | message = message.replace("\\", "\\\\")
1318 | message = message.replace("\"", "\\\"")
1319 |
1320 | if "\n" in message: # Keep only the first line
1321 | message = message[:message.index("\n")]
1322 |
1323 | print(f" c_{sha} [label=\"{sha[0:7]}: {message}\"]")
1324 | assert commit.fmt==b'commit'
1325 |
1326 | if not b'parent' in commit.kvlm.keys():
1327 | # Base case: the initial commit.
1328 | return
1329 |
1330 | parents = commit.kvlm[b'parent']
1331 |
1332 | if type(parents) != list:
1333 | parents = [ parents ]
1334 |
1335 | for p in parents:
1336 | p = p.decode("ascii")
1337 | print (f" c_{sha} -> c_{p};")
1338 | log_graphviz(repo, p, seen)
1339 | #+END_SRC
1340 |
1341 | You can now use our log command like this:
1342 |
1343 | #+BEGIN_SRC shell
1344 | wyag log e03158242ecab460f31b0d6ae1642880577ccbe8 > log.dot
1345 | dot -O -Tpdf log.dot
1346 | #+END_SRC
1347 |
1348 | ** Anatomy of a commit
1349 | :PROPERTIES:
1350 | :CUSTOM_ID: commit-anatomy
1351 | :END:
1352 |
1353 | You may have noticed a few things right now.
1354 |
1355 | First and foremost, we've been playing with commits, browsing and
1356 | walking through commit objects, building a graph of commit history,
1357 | without ever touching a single file in the worktree or a blob. We've
1358 | done a lot with commits /without considering their contents/. This is
1359 | important: work tree contents are just one part of a commit. But a
1360 | commit is made of everything it holds: its contents, its authors,
1361 | *also its parents*. If you remember that the ID (the SHA-1 hash) of a
1362 | commit is computed from the whole commit object, you'll understand
1363 | what it means that commits are immutable: if you change the author,
1364 | the parent commit or a single file, you've actually created a new,
1365 | different object. Each and every commit is bound to its place and its
1366 | relationship to the whole repository up to the very first commit. To
1367 | put it otherwise, a given commit ID not only identifies some file
1368 | contents, but it also binds the commit to its whole history and to the
1369 | whole repository.
1370 |
1371 | It's also worth noting that from the point of view of a commit, time
1372 | somehow runs backwards: we're used to considering the history of a
1373 | project from its humble beginnings as an evening distraction, starting
1374 | with a few lines of code, some initial commits, and progressing to its
1375 | present state (millions of lines of code, dozens of contributors,
1376 | whatever). But each commit is completely unaware of its future, it's
1377 | only linked to the past. Commits have "memory", but no premonition.
1378 |
1379 | # #+begin_note
1380 | # In Terry Pratchett's Discworld, trolls believe they progress in time
1381 | # from the future to the past. The reasoning behind that belief is
1382 | # that when you walk, what you can see is what's /ahead/ of you. Of
1383 | # time, all you can perceive is the past, because you remember; hence
1384 | # it's where you're headed. Commits are Discworld trolls.
1385 | # #+end_note
1386 |
1387 | So what makes a commit? To sum it up:
1388 |
1389 | - A tree object, which we'll discuss now, that is, the contents of a
1390 | worktree, files and directories;
1391 | - Zero, one or more parents;
1392 | - An author identity (name and email), and a timestamp;
1393 | - A committer identity (name and email), and a timestamp;
1394 | - An optional PGP signature
1395 | - A message;
1396 |
1397 | All this hashed together in a unique SHA-1 identifier.
1398 |
1399 | #+begin_note
1400 | *Wait, does that make Git a blockchain?*
1401 |
1402 | Because of cryptocurrencies, blockchains are all the hype these
1403 | days. And yes, /in a way/, Git is a blockchain: it's a sequence of
1404 | blocks (commits) tied together by cryptographic means in a way that
1405 | guarantee that each single element is associated to the whole
1406 | history of the structure. Don't take the comparison too seriously,
1407 | though: we don't need a GitCoin. Really, we don't.
1408 | #+end_note
1409 |
1410 | * Reading commit data: checkout
1411 | :PROPERTIES:
1412 | :CUSTOM_ID: checkout
1413 | :END:
1414 |
1415 | It's all well that commits hold a lot more than files and directories
1416 | in a given state, but that doesn't make them really useful. It's
1417 | probably time to start implementing tree objects as well, so we'll be
1418 | able to checkout commits into the work tree.
1419 |
1420 | ** What's in a tree?
1421 |
1422 | Informally, a tree describes the content of the work tree, that it, it
1423 | associates blobs to paths. It's an array of three-element tuples made
1424 | of a file mode, a path (relative to the worktree) and a SHA-1. A
1425 | typical tree contents may look like this:
1426 |
1427 | | Mode | SHA-1 | Path |
1428 | |----------+--------------------------------------------+--------------|
1429 | | =100644= | =894a44cc066a027465cd26d634948d56d13af9af= | =.gitignore= |
1430 | | =100644= | =94a9ed024d3859793618152ea559a168bbcbb5e2= | =LICENSE= |
1431 | | =100644= | =bab489c4f4600a38ce6dbfd652b90383a4aa3e45= | =README.md= |
1432 | | =100644= | =6d208e47659a2a10f5f8640e0155d9276a2130a9= | =src= |
1433 | | =040000= | =e7445b03aea61ec801b20d6ab62f076208b7d097= | =tests= |
1434 | | =040000= | =d5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38= | =main.c= |
1435 |
1436 | Mode is just the file's [[https://en.wikipedia.org/wiki/File_system_permissions][mode]], path is its location. The SHA-1 refers
1437 | to either a blob or another tree object. If a blob, the path is a
1438 | file, if a tree, it's directory. To instantiate this tree in the
1439 | filesystem, we would begin by loading the object associated to the
1440 | first path (=.gitignore=) and check its type. Since it's a blob,
1441 | we'll just create a file called =.gitignore= with this blob's
1442 | contents; and same for =LICENSE= and =README.md=. But the object
1443 | associated with =src= is not a blob, but another tree: we'll create
1444 | the directory =src= and repeat the same operation in that directory
1445 | with the new tree.
1446 |
1447 | #+BEGIN_warning
1448 | *A path is a single filesystem entry*
1449 |
1450 | The path identifies exactly one file or directory. Not two, not
1451 | three. If you have five levels of nested directories, even if four
1452 | are empty save the next directory, you're going to need five tree
1453 | objects recursively referring to one another. You cannot take the
1454 | shortcut of putting a full path in a single tree entry, like
1455 | =dir1/dir2/dir3/dir4/dir5=.
1456 | #+END_warning
1457 |
1458 | ** Parsing trees
1459 |
1460 | Unlike tags and commits, tree objects are binary objects, but their
1461 | format is actually quite simple. A tree is the concatenation of
1462 | records of the format:
1463 |
1464 | #+begin_example
1465 | [mode] space [path] 0x00 [sha-1]
1466 | #+end_example
1467 |
1468 | - =[mode]= is up to six bytes and is an octal representation of a file
1469 | *mode*, stored in ASCII. For example, 100644 is encoded with byte
1470 | values 49 (ASCII "1"), 48 (ASCII "0"), 48, 54, 52, 52. The first
1471 | two digits encode the file type (file, directory, symlink or
1472 | submodule), the last four the permissions.
1473 | - It's followed by 0x20, an ASCII *space*;
1474 | - Followed by the null-terminated (0x00) *path*;
1475 | - Followed by the object's *SHA-1* in binary encoding, on 20 bytes.
1476 |
1477 | The parser is going to be quite simple. First, create a tiny object
1478 | wrapper for a single record (a leaf, a single path):
1479 |
1480 | #+BEGIN_SRC python :tangle libwyag.py
1481 | class GitTreeLeaf (object):
1482 | def __init__(self, mode, path, sha):
1483 | self.mode = mode
1484 | self.path = path
1485 | self.sha = sha
1486 | #+END_SRC
1487 |
1488 | Because a tree object is just the repetition of the same fundamental
1489 | data structure, we write the parser in two functions. First, a parser
1490 | to extract a single record, which returns parsed data and the position
1491 | it reached in input data:
1492 |
1493 | #+BEGIN_SRC python :tangle libwyag.py
1494 | def tree_parse_one(raw, start=0):
1495 | # Find the space terminator of the mode
1496 | x = raw.find(b' ', start)
1497 | assert x-start == 5 or x-start==6
1498 |
1499 | # Read the mode
1500 | mode = raw[start:x]
1501 | if len(mode) == 5:
1502 | # Normalize to six bytes.
1503 | mode = b"0" + mode
1504 |
1505 | # Find the NULL terminator of the path
1506 | y = raw.find(b'\x00', x)
1507 | # and read the path
1508 | path = raw[x+1:y]
1509 |
1510 | # Read the SHA…
1511 | raw_sha = int.from_bytes(raw[y+1:y+21], "big")
1512 | # and convert it into an hex string, padded to 40 chars
1513 | # with zeros if needed.
1514 | sha = format(raw_sha, "040x")
1515 | return y+21, GitTreeLeaf(mode, path.decode("utf8"), sha)
1516 | #+END_SRC
1517 |
1518 | And the "real" parser which just calls the previous one in a loop,
1519 | until input data is exhausted.
1520 |
1521 | #+BEGIN_SRC python :tangle libwyag.py
1522 | def tree_parse(raw):
1523 | pos = 0
1524 | max = len(raw)
1525 | ret = list()
1526 | while pos < max:
1527 | pos, data = tree_parse_one(raw, pos)
1528 | ret.append(data)
1529 |
1530 | return ret
1531 | #+END_SRC
1532 |
1533 | We'll finally need a serializer to write trees back. Because we may
1534 | have added or modified entries, we need to sort them again.
1535 | Consistently sorting matters, because we need to respect git's
1536 | [[identity-rules][identity rules]], which says that no two equivalent object can have a
1537 | different hash --- but differently sorted trees with the same contents
1538 | /would/ be equivalent (describing the same directory structure), and
1539 | still numerically distinct (different SHA-1 identifiers). Incorrectly
1540 | sorted trees are invalid, but /git doesn't enforce that/. I created
1541 | some invalid trees by accident writing wyag, and all I got was weird
1542 | bugs in =git status= (specifically, =status= would report an actually
1543 | clean worktree as fully modified). We don't want that.
1544 |
1545 | The ordering function is quite simple, with an unexpected twist. are
1546 | Entries sorted by name, alphabetically, /but/ directories (that is,
1547 | tree entries) are sorted with a final =/= added. It matters, because
1548 | it means that if =whatever= names a regular file, it will sort
1549 | /before/ =whatever.c=, but if =whatever= is a dir, it will sort
1550 | /after/, as =whatever/=. (I'm not sure why git does that. If you're
1551 | curious, see the function =base_name_compare= in =tree.c= in the git
1552 | source)
1553 |
1554 | #+begin_src python :tangle libwyag.py
1555 | # Notice this isn't a comparison function, but a conversion function.
1556 | # Python's default sort doesn't accept a custom comparison function,
1557 | # like in most languages, but a `key` arguments that returns a new
1558 | # value, which is compared using the default rules. So we just return
1559 | # the leaf name, with an extra / if it's a directory.
1560 | def tree_leaf_sort_key(leaf):
1561 | if leaf.mode.startswith(b"10"):
1562 | return leaf.path
1563 | else:
1564 | return leaf.path + "/"
1565 | #+end_src
1566 |
1567 | Then the serializer itself. This one is very simple: we sort the
1568 | items using our newly created function as a transformer, then write
1569 | them in order.
1570 |
1571 | #+BEGIN_SRC python :tangle libwyag.py
1572 | def tree_serialize(obj):
1573 | obj.items.sort(key=tree_leaf_sort_key)
1574 | ret = b''
1575 | for i in obj.items:
1576 | ret += i.mode
1577 | ret += b' '
1578 | ret += i.path.encode("utf8")
1579 | ret += b'\x00'
1580 | sha = int(i.sha, 16)
1581 | ret += sha.to_bytes(20, byteorder="big")
1582 | return ret
1583 | #+END_SRC
1584 |
1585 | And now we just have to combine all that into a class:
1586 |
1587 | #+BEGIN_SRC python :tangle libwyag.py
1588 | class GitTree(GitObject):
1589 | fmt=b'tree'
1590 |
1591 | def deserialize(self, data):
1592 | self.items = tree_parse(data)
1593 |
1594 | def serialize(self):
1595 | return tree_serialize(self)
1596 |
1597 | def init(self):
1598 | self.items = list()
1599 | #+END_SRC
1600 |
1601 | ** Showing trees: ls-tree
1602 | :PROPERTIES:
1603 | :CUSTOM_ID: cmd-ls-tree
1604 | :END:
1605 |
1606 | While we're at it, let's add the =ls-tree= command to wyag. It's so
1607 | easy there's no reason not to. =git ls-tree [-r] TREE= simply prints
1608 | the contents of a tree, recursively with the =-r= flag. In recursive
1609 | mode, it doesn't show subtrees, just final objects with their full
1610 | paths.
1611 |
1612 | #+NAME: cmd-ls-tree
1613 | #+BEGIN_SRC python :tangle libwyag.py
1614 | argsp = argsubparsers.add_parser("ls-tree", help="Pretty-print a tree object.")
1615 | argsp.add_argument("-r",
1616 | dest="recursive",
1617 | action="store_true",
1618 | help="Recurse into sub-trees")
1619 |
1620 | argsp.add_argument("tree",
1621 | help="A tree-ish object.")
1622 |
1623 | def cmd_ls_tree(args):
1624 | repo = repo_find()
1625 | ls_tree(repo, args.tree, args.recursive)
1626 |
1627 | def ls_tree(repo, ref, recursive=None, prefix=""):
1628 | sha = object_find(repo, ref, fmt=b"tree")
1629 | obj = object_read(repo, sha)
1630 | for item in obj.items:
1631 | if len(item.mode) == 5:
1632 | type = item.mode[0:1]
1633 | else:
1634 | type = item.mode[0:2]
1635 |
1636 | match type: # Determine the type.
1637 | case b'04': type = "tree"
1638 | case b'10': type = "blob" # A regular file.
1639 | case b'12': type = "blob" # A symlink. Blob contents is link target.
1640 | case b'16': type = "commit" # A submodule
1641 | case _: raise Exception(f"Weird tree leaf mode {item.mode}")
1642 |
1643 | if not (recursive and type=='tree'): # This is a leaf
1644 | print(f"{'0' * (6 - len(item.mode)) + item.mode.decode("ascii")} {type} {item.sha}\t{os.path.join(prefix, item.path)}")
1645 | else: # This is a branch, recurse
1646 | ls_tree(repo, item.sha, recursive, os.path.join(prefix, item.path))
1647 | #+END_SRC
1648 |
1649 | ** The checkout command
1650 | :PROPERTIES:
1651 | :CUSTOM_ID: cmd-checkout
1652 | :END:
1653 |
1654 | =git checkout= simply instantiates a commit in the worktree. We're
1655 | going to oversimplify the actual git command to make our
1656 | implementation clear and understandable. We're also going to add a
1657 | few safeguards. Here's how our version of checkout will work:
1658 |
1659 | - It will take two arguments: a commit, and a directory. Git checkout
1660 | only needs a commit.
1661 |
1662 | - It will then instantiate the tree in the directory, *if and only if
1663 | the directory is empty*. Git is full of safeguards to avoid
1664 | deleting data, which would be too complicated and unsafe to try to
1665 | reproduce in wyag. Since the point of wyag is to demonstrate git,
1666 | not to produce a working implementation, this limitation is
1667 | acceptable.
1668 |
1669 | Let's get started. As usual, we need a subparser:
1670 |
1671 | #+BEGIN_SRC python :tangle libwyag.py
1672 | argsp = argsubparsers.add_parser("checkout", help="Checkout a commit inside of a directory.")
1673 |
1674 | argsp.add_argument("commit",
1675 | help="The commit or tree to checkout.")
1676 |
1677 | argsp.add_argument("path",
1678 | help="The EMPTY directory to checkout on.")
1679 | #+END_SRC
1680 |
1681 | A wrapper function:
1682 |
1683 | #+BEGIN_SRC python :tangle libwyag.py
1684 | def cmd_checkout(args):
1685 | repo = repo_find()
1686 |
1687 | obj = object_read(repo, object_find(repo, args.commit))
1688 |
1689 | # If the object is a commit, we grab its tree
1690 | if obj.fmt == b'commit':
1691 | obj = object_read(repo, obj.kvlm[b'tree'].decode("ascii"))
1692 |
1693 | # Verify that path is an empty directory
1694 | if os.path.exists(args.path):
1695 | if not os.path.isdir(args.path):
1696 | raise Exception(f"Not a directory {args.path}!")
1697 | if os.listdir(args.path):
1698 | raise Exception(f"Not empty {args.path}!")
1699 | else:
1700 | os.makedirs(args.path)
1701 |
1702 | tree_checkout(repo, obj, os.path.realpath(args.path))
1703 | #+END_SRC
1704 |
1705 | And a function to do the actual work:
1706 |
1707 | #+BEGIN_SRC python :tangle libwyag.py
1708 | def tree_checkout(repo, tree, path):
1709 | for item in tree.items:
1710 | obj = object_read(repo, item.sha)
1711 | dest = os.path.join(path, item.path)
1712 |
1713 | if obj.fmt == b'tree':
1714 | os.mkdir(dest)
1715 | tree_checkout(repo, obj, dest)
1716 | elif obj.fmt == b'blob':
1717 | # @TODO Support symlinks (identified by mode 12****)
1718 | with open(dest, 'wb') as f:
1719 | f.write(obj.blobdata)
1720 | #+END_SRC
1721 |
1722 | * Refs, tags and branches
1723 | ** What a ref is, and the show-ref command
1724 | :PROPERTIES:
1725 | :CUSTOM_ID: cmd-show-ref
1726 | :END:
1727 |
1728 | As of now, the only way we can refer to objects is by their full
1729 | hexadecimal identifier. In git, we actually rarely see those, except
1730 | to talk about a specific commit. But in general, we're talking about
1731 | HEAD, about some branch called names like =main= or
1732 | =feature/more-bombs=, and so on. This is handled by a simple
1733 | mechanism called references.
1734 |
1735 | Git references, or refs, are probably the most simple type of things
1736 | git holds. They live in subdirectories of =.git/refs=, and are text
1737 | files containing a hexadecimal representation of an object's hash,
1738 | encoded in ASCII. They're actually as simple as this:
1739 |
1740 | #+BEGIN_example
1741 | 6071c08bcb4757d8c89a30d9755d2466cef8c1de
1742 | #+END_example
1743 |
1744 | Refs can also refer to another reference, and thus only indirectly to
1745 | an object, in which case they look like this:
1746 |
1747 | #+BEGIN_EXAMPLE
1748 | ref: refs/remotes/origin/master
1749 | #+END_EXAMPLE
1750 |
1751 | #+BEGIN_note
1752 | *Direct and indirect references*
1753 |
1754 | From now on, I will call a reference of the form =ref:
1755 | path/to/other/ref= an *indirect* reference, and a ref with a SHA-1
1756 | object ID a *direct reference*.
1757 | #+END_note
1758 |
1759 | This section will describe the uses of refs. For now, all that matter
1760 | is this:
1761 |
1762 | - they're text files, in the =.git/refs= hierarchy;
1763 | - they hold the SHA-1 identifier of an object, or a reference to
1764 | another reference, ultimately to a SHA-1 (no loops!)
1765 |
1766 | To work with refs, we're first going to need a simple recursive solver
1767 | that will take a ref name, follow eventual recursive references (refs
1768 | whose content begin with =ref:=, as exemplified above) and return a
1769 | SHA-1 identifier:
1770 |
1771 | #+BEGIN_SRC python :tangle libwyag.py
1772 | def ref_resolve(repo, ref):
1773 | path = repo_file(repo, ref)
1774 |
1775 | # Sometimes, an indirect reference may be broken. This is normal
1776 | # in one specific case: we're looking for HEAD on a new repository
1777 | # with no commits. In that case, .git/HEAD points to "ref:
1778 | # refs/heads/main", but .git/refs/heads/main doesn't exist yet
1779 | # (since there's no commit for it to refer to).
1780 | if not os.path.isfile(path):
1781 | return None
1782 |
1783 | with open(path, 'r') as fp:
1784 | data = fp.read()[:-1]
1785 | # Drop final \n ^^^^^
1786 | if data.startswith("ref: "):
1787 | return ref_resolve(repo, data[5:])
1788 | else:
1789 | return data
1790 | #+END_SRC
1791 |
1792 | Let's create two small functions, and implement the =show-ref=
1793 | command --- it just lists all references in a repository. First, a
1794 | stupid recursive function to collect refs and return them as a dict:
1795 |
1796 | #+BEGIN_SRC python :tangle libwyag.py
1797 | def ref_list(repo, path=None):
1798 | if not path:
1799 | path = repo_dir(repo, "refs")
1800 | ret = dict()
1801 | # Git shows refs sorted. To do the same, we sort the output of
1802 | # listdir
1803 | for f in sorted(os.listdir(path)):
1804 | can = os.path.join(path, f)
1805 | if os.path.isdir(can):
1806 | ret[f] = ref_list(repo, can)
1807 | else:
1808 | ret[f] = ref_resolve(repo, can)
1809 |
1810 | return ret
1811 | #+END_SRC
1812 |
1813 | And, as usual, a subparser, a bridge, and a (recursive) worker function:
1814 |
1815 | #+BEGIN_SRC python :tangle libwyag.py
1816 | argsp = argsubparsers.add_parser("show-ref", help="List references.")
1817 |
1818 | def cmd_show_ref(args):
1819 | repo = repo_find()
1820 | refs = ref_list(repo)
1821 | show_ref(repo, refs, prefix="refs")
1822 |
1823 | def show_ref(repo, refs, with_hash=True, prefix=""):
1824 | if prefix:
1825 | prefix = prefix + '/'
1826 | for k, v in refs.items():
1827 | if type(v) == str and with_hash:
1828 | print (f"{v} {prefix}{k}")
1829 | elif type(v) == str:
1830 | print (f"{prefix}{k}")
1831 | else:
1832 | show_ref(repo, v, with_hash=with_hash, prefix=f"{prefix}{k}")
1833 | #+END_SRC
1834 | ** Tags as references
1835 | :PROPERTIES:
1836 | :CUSTOM_ID: tags
1837 | :END:
1838 |
1839 | The most simple use of refs is tags. A tag is just a user-defined
1840 | name for an object, often a commit. A very common use of tags is
1841 | identifying software releases: You've just merged the last commit of,
1842 | say, version 12.78.52 of your program, so your most recent commit
1843 | (let's call it =6071c08=) /is/ your version 12.78.52. To make this
1844 | association explicit, all you have to do is:
1845 |
1846 | #+BEGIN_src shell
1847 | git tag v12.78.52 6071c08
1848 | # the object hash ^here^^ is optional and defaults to HEAD.
1849 | #+END_SRC
1850 |
1851 | This creates a new tag, called =v12.78.52=, pointing at =6071c08=.
1852 | Tagging is like aliasing: a tag introduces a new way to refer to an
1853 | existing object. After the tag is created, the name =v12.78.52= refers
1854 | to =6071c08=. For example, these two commands are now perfectly
1855 | equivalent:
1856 |
1857 | #+BEGIN_src shell
1858 | git checkout v12.78.52
1859 | git checkout 6071c08
1860 | #+END_src
1861 |
1862 | #+begin_note
1863 | Versions are a common use of tags, but like almost everything in
1864 | Git, tags have no predefined semantics: they mean whatever you want
1865 | them to mean, and can point to whichever object you want, you can
1866 | even tag /blobs/!
1867 | #+end_note
1868 |
1869 | ** Lightweight tags and tag objects, and parsing the latter
1870 | :PROPERTIES:
1871 | :CUSTOM_ID: GitTag
1872 | :END:
1873 |
1874 | You've probably guessed already that tags are actually refs. They
1875 | live in the =.git/refs/tags/= hierarchy. The only point worth noting is
1876 | that they come in two flavors: lightweight tags and tags objects.
1877 |
1878 | - "Lightweight" tags :: are just regular refs to a commit, a tree or
1879 | a blob.
1880 |
1881 | - Tag objects :: are regular refs pointing to an object of type =tag=.
1882 | Unlike lightweight tags, tag objects have an author, a date, an
1883 | optional PGP signature and an optional annotation. Their format is
1884 | the same as a commit object.
1885 |
1886 | We don't even need to implement tag objects, we can reuse =GitCommit=
1887 | and just change the =fmt= field:
1888 |
1889 | #+BEGIN_SRC python :tangle libwyag.py
1890 | class GitTag(GitCommit):
1891 | fmt = b'tag'
1892 | #+END_SRC
1893 |
1894 | And now we support tags.
1895 |
1896 | ** The tag command
1897 | :PROPERTIES:
1898 | :CUSTOM_ID: cmd-tag
1899 | :END:
1900 |
1901 | Let's add the =tag= command. In Git, it does two things: it creates a
1902 | new tag or list existing tags (by default). So you can invoke it with:
1903 |
1904 | #+BEGIN_src shell
1905 | git tag # List all tags
1906 | git tag NAME [OBJECT] # create a new *lightweight* tag NAME, pointing
1907 | # at HEAD (default) or OBJECT
1908 | git tag -a NAME [OBJECT] # create a new tag *object* NAME, pointing at
1909 | # HEAD (default) or OBJECT
1910 | #+END_src
1911 |
1912 | This translates to argparse as follows. Notice we ignore the mutual
1913 | exclusion between =--list= and =[-a] name [object]=, which seems too
1914 | complicated for argparse.
1915 |
1916 | # @FIXME This ignores the mutual exclusion
1917 | #+BEGIN_SRC python :tangle libwyag.py
1918 | argsp = argsubparsers.add_parser(
1919 | "tag",
1920 | help="List and create tags")
1921 |
1922 | argsp.add_argument("-a",
1923 | action="store_true",
1924 | dest="create_tag_object",
1925 | help="Whether to create a tag object")
1926 |
1927 | argsp.add_argument("name",
1928 | nargs="?",
1929 | help="The new tag's name")
1930 |
1931 | argsp.add_argument("object",
1932 | default="HEAD",
1933 | nargs="?",
1934 | help="The object the new tag will point to")
1935 | #+END_SRC
1936 |
1937 | The =cmd_tag= function will dispatch behavior (list or create) depending
1938 | on whether or not =name= is provided.
1939 |
1940 | #+BEGIN_SRC python :tangle libwyag.py
1941 | def cmd_tag(args):
1942 | repo = repo_find()
1943 |
1944 | if args.name:
1945 | tag_create(repo,
1946 | args.name,
1947 | args.object,
1948 | create_tag_object = args.create_tag_object)
1949 | else:
1950 | refs = ref_list(repo)
1951 | show_ref(repo, refs["tags"], with_hash=False)
1952 | #+END_SRC
1953 |
1954 | And we just need one more function to actually create the tag:
1955 |
1956 | #+begin_src python :tangle libwyag.py
1957 | def tag_create(repo, name, ref, create_tag_object=False):
1958 | # get the GitObject from the object reference
1959 | sha = object_find(repo, ref)
1960 |
1961 | if create_tag_object:
1962 | # create tag object (commit)
1963 | tag = GitTag()
1964 | tag.kvlm = dict()
1965 | tag.kvlm[b'object'] = sha.encode()
1966 | tag.kvlm[b'type'] = b'commit'
1967 | tag.kvlm[b'tag'] = name.encode()
1968 | # Feel free to let the user give their name!
1969 | # Notice you can fix this after commit, read on!
1970 | tag.kvlm[b'tagger'] = b'Wyag '
1971 | # …and a tag message!
1972 | tag.kvlm[None] = b"A tag generated by wyag, which won't let you customize the message!\n"
1973 | tag_sha = object_write(tag, repo)
1974 | # create reference
1975 | ref_create(repo, "tags/" + name, tag_sha)
1976 | else:
1977 | # create lightweight tag (ref)
1978 | ref_create(repo, "tags/" + name, sha)
1979 |
1980 | def ref_create(repo, ref_name, sha):
1981 | with open(repo_file(repo, "refs/" + ref_name), 'w') as fp:
1982 | fp.write(sha + "\n")
1983 | #+end_src
1984 |
1985 | ** What's a branch?
1986 | :PROPERTIES:
1987 | :CUSTOM_ID: branches
1988 | :END:
1989 |
1990 | Tags are done. Now for another big chunk: branches.
1991 |
1992 | It's time to address the elephant in the room: like most Git users,
1993 | wyag still doesn't have any idea what a branch is. It currently
1994 | treats a repository as a bunch of disorganized objects, some of them
1995 | commits, and has no representation whatsoever of the fact that commits
1996 | are grouped in branches, and that at every point in time there's a
1997 | commit that's =HEAD=, /ie/, the *head* commit (or "tip") of the
1998 | *active* branch.
1999 |
2000 | So, what's a branch? The answer is actually surprisingly simple, but
2001 | it may also end up being simply surprising: *a branch is a reference
2002 | to a commit*. You could even say that a branch is a kind of a name
2003 | for a commit. In this regard, a branch is exactly the same thing as a
2004 | tag. Tags are refs that live in =.git/refs/tags=, branches are refs
2005 | that live in =.git/refs/heads=.
2006 |
2007 | There are, of course, differences between a branch and a tag:
2008 |
2009 | 1. Branches are references to a /commit/, tags can refer to any object;
2010 | 2. Most importantly, the branch ref is updated at each commit. This means
2011 | that whenever you commit, Git actually does this:
2012 | 1. a new commit object is created, with the current branch's
2013 | (commit!) ID as its parent;
2014 | 2. the commit object is hashed and stored;
2015 | 3. the branch ref is updated to refer to the new commit's hash.
2016 |
2017 | That's all.
2018 |
2019 | But what about the *current* branch? It's actually even easier. It's a
2020 | ref file outside of the =refs= hierarchy, in =.git/HEAD=, which is an
2021 | *indirect* ref (that is, it is of the form =ref: path/to/other/ref=, and
2022 | not a simple hash).
2023 |
2024 | #+begin_note
2025 | *Detached HEAD*
2026 |
2027 | When you just checkout a random commit, git will warn you it's in
2028 | "detached HEAD state". This means you're not on any branch anymore.
2029 | In this case, =.git/HEAD= is a *direct* reference: it contains a
2030 | SHA-1.
2031 | #+end_note
2032 |
2033 | ** Referring to objects: the =object_find= function
2034 | :PROPERTIES:
2035 | :CUSTOM_ID: object_find
2036 | :END:
2037 |
2038 | *** Resolving names
2039 |
2040 | Remember when we've created [[placeholder-object_find][the stupid =object_find= function]] that would
2041 | take four arguments, return the second unmodified and ignore the other
2042 | three? It's time to replace it by something more useful. We're going
2043 | to implement a small, but usable, subset of the actual Git name
2044 | resolution algorithm. The new =object_find()= will work in two steps:
2045 | first, given a name, it will return a complete sha-1 hash. For
2046 | example, with =HEAD=, it will return the hash of the head commit of the
2047 | current branch, etc. More precisely, this name resolution function
2048 | will work like this:
2049 |
2050 | - If =name= is =HEAD=, it will just resolve =.git/HEAD=;
2051 | - If =name= is a full hash, this hash is returned unmodified.
2052 | - If =name= looks like a short hash, it will collect objects whose full
2053 | hash begin with this short hash.
2054 | - At last, it will resolve tags and branches matching name.
2055 |
2056 | Notice how the last two steps /collect/ values: the first two are
2057 | absolute references, so we can safely return a result. But short
2058 | hashes or branch names can be ambiguous, we want to enumerate all
2059 | possible meanings of the name and raise an error if we've found more
2060 | than 1.
2061 |
2062 | #+begin_info
2063 | *Short hashes*
2064 |
2065 | For convenience, Git allows to refer to hashes by a prefix of their
2066 | name. For example, =5bd254aa973646fa16f66d702a5826ea14a3eb45= can
2067 | be referred to as =5bd254=. This is called a "short hash".
2068 | #+end_info
2069 |
2070 | #+BEGIN_SRC python :tangle libwyag.py
2071 | def object_resolve(repo, name):
2072 | """Resolve name to an object hash in repo.
2073 |
2074 | This function is aware of:
2075 |
2076 | - the HEAD literal
2077 | - short and long hashes
2078 | - tags
2079 | - branches
2080 | - remote branches"""
2081 | candidates = list()
2082 | hashRE = re.compile(r"^[0-9A-Fa-f]{4,40}$")
2083 |
2084 | # Empty string? Abort.
2085 | if not name.strip():
2086 | return None
2087 |
2088 | # Head is nonambiguous
2089 | if name == "HEAD":
2090 | return [ ref_resolve(repo, "HEAD") ]
2091 |
2092 | # If it's a hex string, try for a hash.
2093 | if hashRE.match(name):
2094 | # This may be a hash, either small or full. 4 seems to be the
2095 | # minimal length for git to consider something a short hash.
2096 | # This limit is documented in man git-rev-parse
2097 | name = name.lower()
2098 | prefix = name[0:2]
2099 | path = repo_dir(repo, "objects", prefix, mkdir=False)
2100 | if path:
2101 | rem = name[2:]
2102 | for f in os.listdir(path):
2103 | if f.startswith(rem):
2104 | # Notice a string startswith() itself, so this
2105 | # works for full hashes.
2106 | candidates.append(prefix + f)
2107 |
2108 | # Try for references.
2109 | as_tag = ref_resolve(repo, "refs/tags/" + name)
2110 | if as_tag: # Did we find a tag?
2111 | candidates.append(as_tag)
2112 |
2113 | as_branch = ref_resolve(repo, "refs/heads/" + name)
2114 | if as_branch: # Did we find a branch?
2115 | candidates.append(as_branch)
2116 |
2117 | as_remote_branch = ref_resolve(repo, "refs/remotes/" + name)
2118 | if as_remote_branch: # Did we find a remote branch?
2119 | candidates.append(as_remote_branch)
2120 |
2121 | return candidates
2122 | #+END_SRC
2123 |
2124 | The second step is to follow the object we found to an object of the
2125 | required type, if a type argument was provided. Since we only need to
2126 | handle trivial cases, this is a very simple iterative process:
2127 |
2128 | - If we have a tag and =fmt= is anything else, we follow the tag.
2129 | - If we have a commit and =fmt= is tree, we return this commit's tree
2130 | object
2131 | - In all other situations, we bail out: nothing else makes sense.
2132 |
2133 | (The process is iterative because it may take an undefined number of
2134 | steps, since tags themselves can be tagged)
2135 |
2136 | #+BEGIN_SRC python :tangle libwyag.py
2137 | def object_find(repo, name, fmt=None, follow=True):
2138 | sha = object_resolve(repo, name)
2139 |
2140 | if not sha:
2141 | raise Exception(f"No such reference {name}.")
2142 |
2143 | if len(sha) > 1:
2144 | raise Exception("Ambiguous reference {name}: Candidates are:\n - {'\n - '.join(sha)}.")
2145 |
2146 | sha = sha[0]
2147 |
2148 | if not fmt:
2149 | return sha
2150 |
2151 | while True:
2152 | obj = object_read(repo, sha)
2153 | # ^^^^^^^^^^^ < this is a bit agressive: we're reading
2154 | # the full object just to get its type. And we're doing
2155 | # that in a loop, albeit normally short. Don't expect
2156 | # high performance here.
2157 |
2158 | if obj.fmt == fmt:
2159 | return sha
2160 |
2161 | if not follow:
2162 | return None
2163 |
2164 | # Follow tags
2165 | if obj.fmt == b'tag':
2166 | sha = obj.kvlm[b'object'].decode("ascii")
2167 | elif obj.fmt == b'commit' and fmt == b'tree':
2168 | sha = obj.kvlm[b'tree'].decode("ascii")
2169 | else:
2170 | return None
2171 | #+END_SRC
2172 |
2173 | With the new =object_find()=, the CLI wyag becomes a bit more usable. You can now do things like:
2174 |
2175 | #+begin_example
2176 | $ wyag checkout v3.11 # A tag
2177 | $ wyag checkout feature/explosions # A branch
2178 | $ wyag ls-tree -r HEAD # The active branch or commit. There's also a
2179 | # follow here: HEAD is actually a commit.
2180 | $ wyag cat-file blob e0695f # A short hash
2181 | $ wyag cat-file tree master # A branch, as a tree (another "follow")
2182 | #+end_example
2183 |
2184 | *** The rev-parse command
2185 | :PROPERTIES:
2186 | :CUSTOM_ID: cmd-rev-parse
2187 | :END:
2188 |
2189 | Let's implement =wyag rev-parse=. The =git rev-parse= commands does a
2190 | lot, but one of its use cases, the one we're going to clone, is
2191 | solving references. For the purpose of further testing the "follow"
2192 | feature of =object_find=, we'll add an optional =wyag-type= argument
2193 | to its interface.
2194 |
2195 | #+BEGIN_SRC python :tangle libwyag.py
2196 | argsp = argsubparsers.add_parser(
2197 | "rev-parse",
2198 | help="Parse revision (or other objects) identifiers")
2199 |
2200 | argsp.add_argument("--wyag-type",
2201 | metavar="type",
2202 | dest="type",
2203 | choices=["blob", "commit", "tag", "tree"],
2204 | default=None,
2205 | help="Specify the expected type")
2206 |
2207 | argsp.add_argument("name",
2208 | help="The name to parse")
2209 | #+END_SRC
2210 |
2211 | The bridge does all the job:
2212 |
2213 | #+BEGIN_SRC python :tangle libwyag.py
2214 | def cmd_rev_parse(args):
2215 | if args.type:
2216 | fmt = args.type.encode()
2217 | else:
2218 | fmt = None
2219 |
2220 | repo = repo_find()
2221 |
2222 | print (object_find(repo, args.name, fmt, follow=True))
2223 | #+END_SRC
2224 |
2225 | And it works:
2226 |
2227 | #+begin_example
2228 | $ wyag rev-parse --wyag-type commit HEAD
2229 | 6c22393f5e3830d15395fd8d2f8b0cf8eb40dd58
2230 | $ wyag rev-parse --wyag-type tree HEAD
2231 | 11d33fad71dbac72840aff1447e0d080c7484361
2232 | $ wyag rev-parse --wyag-type tag HEAD
2233 | None
2234 | #+end_example
2235 |
2236 | * Working with the staging area and the index file
2237 | :PROPERTIES:
2238 | :CUSTOM_ID: staging-area
2239 | :END:
2240 |
2241 | ** What's the index file?
2242 | :PROPERTIES:
2243 | :CUSTOM_ID: staging-intro
2244 | :END:
2245 |
2246 | This final step will bring us to where commits happen (although
2247 | actually creating them is for the next section!)
2248 |
2249 | You probably know that to commit in Git, you first "stage" some
2250 | changes, using =git add= and =git rm=, and only /then/ do you commit
2251 | those changes. This intermediate stage between the last and the next
2252 | commit is called the *staging area*.
2253 |
2254 | It would seem natural to use a commit or tree object to represent the
2255 | staging area, but Git actually and uses a completely different
2256 | mechanism, in the form of what it calls the *index file*.
2257 |
2258 | After a commit, the index file is a sort of copy of that commit: it
2259 | holds the same path/blob association than the corresponding tree. But
2260 | it also holds extra information about files in the worktree, like
2261 | their creation/modification time, so =git status= doesn't often need
2262 | to actually compare files: it just checks that their modification time
2263 | is the same as the one stored in the index file, and only if it isn't
2264 | does it perform an actual comparison.
2265 |
2266 | You can thus consider the index file as a three-way association list:
2267 | not only paths with blobs, but also paths with actual filesystem
2268 | entries.
2269 |
2270 | Another important characteristic of the *index file* is that unlike a
2271 | tree, it can represent inconsistent states, like a merge conflict,
2272 | whereas a tree is always a complete, unambiguous representation.
2273 |
2274 | When you commit, what git actually does is turn the index file into a
2275 | new tree object. To summarize:
2276 |
2277 | 1. When the repository is “clean”, the index file holds the exact
2278 | same contents as the HEAD commit, plus metadata about the
2279 | corresponding filesystem entries. For instance, it may contain
2280 | something like:
2281 |
2282 | #+begin_quote
2283 | There's a file called =src/disp.c= whose contents are blob
2284 | 797441c76e59e28794458b39b0f1eff4c85f4fa0. The real =src/disp.c=
2285 | file, in the worktree, was created on 2023-07-15
2286 | 15:28:29.168572151, and last modified 2023-07-15
2287 | 15:28:29.1689427709. It is stored on device 65026, inode 8922881.
2288 | #+end_quote
2289 |
2290 | 2. When you =git add= or =git rm=, the index file is modified
2291 | accordingly. In the example above, if you modify =src/disp.c=,
2292 | and =add= your changes, the index file will be updated with a new
2293 | blob ID (the blob itself will also be created in the process, of
2294 | course), and the various file metadata will be updated as well so
2295 | =git status= knows when not to compare file contents.
2296 |
2297 | 3. When you =git commit= those changes, a new tree is produced from
2298 | the index file, a new commit object is generated with that tree,
2299 | branches are updated and we're done.
2300 |
2301 | #+begin_note
2302 | *A note on words*
2303 |
2304 | The staging area and the index are thus the same thing, but the name
2305 | "staging area" is more the name of the git user-exposed feature
2306 | (that could have been implemented otherwise), the abstraction if you
2307 | will; while "index file" refers specifically to the way this
2308 | abstract feature is actually implemented in git.
2309 | #+end_note
2310 |
2311 | ** Parsing the index
2312 | :PROPERTIES:
2313 | :CUSTOM_ID: index_read
2314 | :END:
2315 |
2316 | The index file is by far the most complicated piece of data a Git
2317 | repository can hold. Its complete documentation can be found in Git
2318 | source tree or rendered [[https://git-scm.com/docs/index-format][on the git website]]. It's made of three parts:
2319 |
2320 | - An header with the format version number and the number of entries
2321 | the index holds;
2322 | - A series of entries, sorted, each representing a file; padded to
2323 | multiple of 8 bytes.
2324 | - A series of optional extensions, which we'll ignore.
2325 | # @FIXME ^ Sorted how? Do we need to think about this?
2326 |
2327 | The first thing we need to represent is a single entry. It actually
2328 | holds quite a lot of stuff, I'm leaving the details in comments.
2329 | It's worth observing that an entry stores *both* the SHA-1 of the
2330 | associated blob in the object store /and/ a ton of metadata about the
2331 | actual file on the actual filesystem. Again, this is because
2332 | =git/wyag status= will need to determine which files in the index were
2333 | modified: it is much more efficient to begin by checking the
2334 | last-modified timestamp and comparing it with a known values, before
2335 | comparing actual files.
2336 |
2337 | #+begin_src python :tangle libwyag.py
2338 | class GitIndexEntry (object):
2339 | def __init__(self, ctime=None, mtime=None, dev=None, ino=None,
2340 | mode_type=None, mode_perms=None, uid=None, gid=None,
2341 | fsize=None, sha=None, flag_assume_valid=None,
2342 | flag_stage=None, name=None):
2343 | # The last time a file's metadata changed. This is a pair
2344 | # (timestamp in seconds, nanoseconds)
2345 | self.ctime = ctime
2346 | # The last time a file's data changed. This is a pair
2347 | # (timestamp in seconds, nanoseconds)
2348 | self.mtime = mtime
2349 | # The ID of device containing this file
2350 | self.dev = dev
2351 | # The file's inode number
2352 | self.ino = ino
2353 | # The object type, either b1000 (regular), b1010 (symlink),
2354 | # b1110 (gitlink).
2355 | self.mode_type = mode_type
2356 | # The object permissions, an integer.
2357 | self.mode_perms = mode_perms
2358 | # User ID of owner
2359 | self.uid = uid
2360 | # Group ID of ownner
2361 | self.gid = gid
2362 | # Size of this object, in bytes
2363 | self.fsize = fsize
2364 | # The object's SHA
2365 | self.sha = sha
2366 | self.flag_assume_valid = flag_assume_valid
2367 | self.flag_stage = flag_stage
2368 | # Name of the object (full path this time!)
2369 | self.name = name
2370 | #+end_src
2371 |
2372 | The index file is a binary file, likely for performance reasons. The
2373 | format is reasonably simple, though. It begins with a header with the
2374 | =DIRC= magic bytes, a version number and the total number of entries
2375 | in that index file. We create the =GitIndex= class to hold them:
2376 |
2377 | #+BEGIN_SRC python :tangle libwyag.py
2378 | class GitIndex (object):
2379 | version = None
2380 | entries = []
2381 | # ext = None
2382 | # sha = None
2383 |
2384 | def __init__(self, version=2, entries=None):
2385 | if not entries:
2386 | entries = list()
2387 |
2388 | self.version = version
2389 | self.entries = entries
2390 | #+END_SRC
2391 |
2392 | And a parser to read index files into those objects. After reading
2393 | the 12-bytes header, we just parse entries in the order they appear.
2394 | An entry begins with a set of fixed-length data, followed by a
2395 | variable-length name.
2396 |
2397 | The code is quite straightforward, but as it's reading a binary
2398 | format, it feels more messy than what we did so far. We use the
2399 | =int.from_bytes(bytes, endianness)= a lot to read raw bytes into an
2400 | integer, and just a few bitwise operations to separate data
2401 | that share the same byte.
2402 |
2403 | (This format was probably designed so index files could just be
2404 | =mmapp()ed= to memory, and read directly as C structs, with an index
2405 | built in O(n) time in most cases. This kind of approach tends to
2406 | produce more elegant code in C than in Python…)
2407 |
2408 | #+BEGIN_SRC python :tangle libwyag.py
2409 | def index_read(repo):
2410 | index_file = repo_file(repo, "index")
2411 |
2412 | # New repositories have no index!
2413 | if not os.path.exists(index_file):
2414 | return GitIndex()
2415 |
2416 | with open(index_file, 'rb') as f:
2417 | raw = f.read()
2418 |
2419 | header = raw[:12]
2420 | signature = header[:4]
2421 | assert signature == b"DIRC" # Stands for "DirCache"
2422 | version = int.from_bytes(header[4:8], "big")
2423 | assert version == 2, "wyag only supports index file version 2"
2424 | count = int.from_bytes(header[8:12], "big")
2425 |
2426 | entries = list()
2427 |
2428 | content = raw[12:]
2429 | idx = 0
2430 | for i in range(0, count):
2431 | # Read creation time, as a unix timestamp (seconds since
2432 | # 1970-01-01 00:00:00, the "epoch")
2433 | ctime_s = int.from_bytes(content[idx: idx+4], "big")
2434 | # Read creation time, as nanoseconds after that timestamps,
2435 | # for extra precision.
2436 | ctime_ns = int.from_bytes(content[idx+4: idx+8], "big")
2437 | # Same for modification time: first seconds from epoch.
2438 | mtime_s = int.from_bytes(content[idx+8: idx+12], "big")
2439 | # Then extra nanoseconds
2440 | mtime_ns = int.from_bytes(content[idx+12: idx+16], "big")
2441 | # Device ID
2442 | dev = int.from_bytes(content[idx+16: idx+20], "big")
2443 | # Inode
2444 | ino = int.from_bytes(content[idx+20: idx+24], "big")
2445 | # Ignored.
2446 | unused = int.from_bytes(content[idx+24: idx+26], "big")
2447 | assert 0 == unused
2448 | mode = int.from_bytes(content[idx+26: idx+28], "big")
2449 | mode_type = mode >> 12
2450 | assert mode_type in [0b1000, 0b1010, 0b1110]
2451 | mode_perms = mode & 0b0000000111111111
2452 | # User ID
2453 | uid = int.from_bytes(content[idx+28: idx+32], "big")
2454 | # Group ID
2455 | gid = int.from_bytes(content[idx+32: idx+36], "big")
2456 | # Size
2457 | fsize = int.from_bytes(content[idx+36: idx+40], "big")
2458 | # SHA (object ID). We'll store it as a lowercase hex string
2459 | # for consistency.
2460 | sha = format(int.from_bytes(content[idx+40: idx+60], "big"), "040x")
2461 | # Flags we're going to ignore
2462 | flags = int.from_bytes(content[idx+60: idx+62], "big")
2463 | # Parse flags
2464 | flag_assume_valid = (flags & 0b1000000000000000) != 0
2465 | flag_extended = (flags & 0b0100000000000000) != 0
2466 | assert not flag_extended
2467 | flag_stage = flags & 0b0011000000000000
2468 | # Length of the name. This is stored on 12 bits, some max
2469 | # value is 0xFFF, 4095. Since names can occasionally go
2470 | # beyond that length, git treats 0xFFF as meaning at least
2471 | # 0xFFF, and looks for the final 0x00 to find the end of the
2472 | # name --- at a small, and probably very rare, performance
2473 | # cost.
2474 | name_length = flags & 0b0000111111111111
2475 |
2476 | # We've read 62 bytes so far.
2477 | idx += 62
2478 |
2479 | if name_length < 0xFFF:
2480 | assert content[idx + name_length] == 0x00
2481 | raw_name = content[idx:idx+name_length]
2482 | idx += name_length + 1
2483 | else:
2484 | print(f"Notice: Name is 0x{name_length:X} bytes long.")
2485 | # This probably wasn't tested enough. It works with a
2486 | # path of exactly 0xFFF bytes. Any extra bytes broke
2487 | # something between git, my shell and my filesystem.
2488 | null_idx = content.find(b'\x00', idx + 0xFFF)
2489 | raw_name = content[idx: null_idx]
2490 | idx = null_idx + 1
2491 |
2492 | # Just parse the name as utf8.
2493 | name = raw_name.decode("utf8")
2494 |
2495 | # Data is padded on multiples of eight bytes for pointer
2496 | # alignment, so we skip as many bytes as we need for the next
2497 | # read to start at the right position.
2498 |
2499 | idx = 8 * ceil(idx / 8)
2500 |
2501 | # And we add this entry to our list.
2502 | entries.append(GitIndexEntry(ctime=(ctime_s, ctime_ns),
2503 | mtime=(mtime_s, mtime_ns),
2504 | dev=dev,
2505 | ino=ino,
2506 | mode_type=mode_type,
2507 | mode_perms=mode_perms,
2508 | uid=uid,
2509 | gid=gid,
2510 | fsize=fsize,
2511 | sha=sha,
2512 | flag_assume_valid=flag_assume_valid,
2513 | flag_stage=flag_stage,
2514 | name=name))
2515 |
2516 | return GitIndex(version=version, entries=entries)
2517 | #+END_SRC
2518 |
2519 | ** The ls-files command
2520 | :PROPERTIES:
2521 | :CUSTOM_ID: cmd-ls-files
2522 | :END:
2523 |
2524 | =git ls-files= displays the names of files in the staging area, with,
2525 | as usual, a ton of options. Our =ls-files= will be much simpler,
2526 | /but/ we'll add a =--verbose= option that doesn't exist in git, just
2527 | so we can display every single bit of info in the index file.
2528 |
2529 | #+BEGIN_SRC python :tangle libwyag.py
2530 | argsp = argsubparsers.add_parser("ls-files", help = "List all the stage files")
2531 | argsp.add_argument("--verbose", action="store_true", help="Show everything.")
2532 |
2533 | def cmd_ls_files(args):
2534 | repo = repo_find()
2535 | index = index_read(repo)
2536 | if args.verbose:
2537 | print(f"Index file format v{index.version}, containing {len(index.entries)} entries.")
2538 |
2539 | for e in index.entries:
2540 | print(e.name)
2541 | if args.verbose:
2542 | entry_type = { 0b1000: "regular file",
2543 | 0b1010: "symlink",
2544 | 0b1110: "git link" }[e.mode_type]
2545 | print(f" {entry_type} with perms: {e.mode_perms:o}")
2546 | print(f" on blob: {e.sha}")
2547 | print(f" created: {datetime.fromtimestamp(e.ctime[0])}.{e.ctime[1]}, modified: {datetime.fromtimestamp(e.mtime[0])}.{e.mtime[1]}")
2548 | print(f" device: {e.dev}, inode: {e.ino}")
2549 | print(f" user: {pwd.getpwuid(e.uid).pw_name} ({e.uid}) group: {grp.getgrgid(e.gid).gr_name} ({e.gid})")
2550 | print(f" flags: stage={e.flag_stage} assume_valid={e.flag_assume_valid}")
2551 | #+END_SRC
2552 |
2553 | If you run ls-files, you'll notice that on a “clean” worktree (an
2554 | unmodified checkout of =HEAD=), it lists all files on =HEAD=. Again,
2555 | the index is not a /delta/ (a set of differences) from the =HEAD=
2556 | commit, but starts as a copy of it, in a different format.
2557 |
2558 | ** A detour: the check-ignore command
2559 | :PROPERTIES:
2560 | :CUSTOM_ID: cmd-check-ignore
2561 | :END:
2562 |
2563 | We want to write =status=, but =status= needs to know about ignore
2564 | rules, that are stored in the various =.gitignore= files. So we first
2565 | need to add some rudimentary support for ignore files in =wyag=.
2566 | We'll expose this support as the =check-ignore= command, which takes a
2567 | list of paths and outputs back those of those paths that should be
2568 | ignored.
2569 |
2570 | Again, the command parser is trivial:
2571 |
2572 | #+BEGIN_SRC python :tangle libwyag.py
2573 | argsp = argsubparsers.add_parser("check-ignore", help = "Check path(s) against ignore rules.")
2574 | argsp.add_argument("path", nargs="+", help="Paths to check")
2575 | #+END_src
2576 |
2577 | And the function is just as simple:
2578 |
2579 | #+BEGIN_SRC python :tangle libwyag.py
2580 | def cmd_check_ignore(args):
2581 | repo = repo_find()
2582 | rules = gitignore_read(repo)
2583 | for path in args.path:
2584 | if check_ignore(rules, path):
2585 | print(path)
2586 | #+END_src
2587 |
2588 | But of course, most of the function we call don't exist yet in wyag.
2589 | We'll begin by writing a reader for rules in ignore files,
2590 | =gitignore_read()=. The syntax of those rules is quite simple: each
2591 | line in an ignore file is an exclusion pattern, so files that match
2592 | this pattern are ignored by =status=, =add -A= and so on. There are
2593 | three special cases, though:
2594 |
2595 | 1. Lines that begin with an exclamation mark =!= /negate/ the pattern
2596 | (files that match this pattern are /included/, even they were
2597 | ignored by an earlier pattern)
2598 | 2. Lines that begin with a dash =#= are comments, and are skipped.
2599 | 2. A backslash =\= at the beginning treats =!= and =#= as literal
2600 | characters.
2601 |
2602 | First, a parser for a single pattern. This parser returns a pair: the
2603 | pattern itself, and a boolean to indicate if files matching the
2604 | pattern /should/ be excluded (=True=) or included (=False=). In other
2605 | words, =False= if the pattern did start with =!=, =True= otherwise.
2606 |
2607 | #+begin_src python :tangle libwyag.py
2608 | def gitignore_parse1(raw):
2609 | raw = raw.strip() # Remove leading/trailing spaces
2610 |
2611 | if not raw or raw[0] == "#":
2612 | return None
2613 | elif raw[0] == "!":
2614 | return (raw[1:], False)
2615 | elif raw[0] == "\\":
2616 | return (raw[1:], True)
2617 | else:
2618 | return (raw, True)
2619 | #+end_src
2620 |
2621 | Parsing a file is just collecting all rules in that file. Notice this
2622 | function doesn't parse /files/, but just lists of lines: that's
2623 | because we'll need to read rules from git blobs as well, and not just
2624 | regular files.
2625 |
2626 | #+begin_src python :tangle libwyag.py
2627 | def gitignore_parse(lines):
2628 | ret = list()
2629 |
2630 | for line in lines:
2631 | parsed = gitignore_parse1(line)
2632 | if parsed:
2633 | ret.append(parsed)
2634 |
2635 | return ret
2636 | #+end_src
2637 |
2638 | Last thing we need to do is collect the various ignore files. They
2639 | come in two kinds:
2640 |
2641 | - Some of these files *live in the index*: they're the various
2642 | =gitignore= files. Emphasis on the plural; although there often is
2643 | only one such file, at the root, there can be one in each
2644 | directory, and it applies to this directory and its subdirectories.
2645 | I'll call those *scoped*, because they only apply to paths under
2646 | their directory.
2647 | - The others live *outside the index*. They're the global ignore
2648 | file (usually in =~/.config/git/ignore=) and the
2649 | repository-specific =.git/info/exclude=. I call those *absolute*,
2650 | because they apply everywhere, but at a lower priority.
2651 |
2652 | Again, a class to hold that: a list of absolute rules, a dict
2653 | (hashmap) of relative rules. The keys to this hashmap are
2654 | *directories*, relative to the root of a worktree.
2655 |
2656 | #+begin_src python :tangle libwyag.py
2657 | class GitIgnore(object):
2658 | absolute = None
2659 | scoped = None
2660 |
2661 | def __init__(self, absolute, scoped):
2662 | self.absolute = absolute
2663 | self.scoped = scoped
2664 | #+end_src
2665 |
2666 | And finally our function to collect all gitignore rules in a
2667 | repository, and return a =GitIgnore= object. Notice how it reads
2668 | scoped files from the index, and not the worktree: only /staged/
2669 | =.gitignore= files matter (also remember: HEAD is /already/ staged ---
2670 | the staging area is a copy, not a delta).
2671 |
2672 | #+begin_src python :tangle libwyag.py
2673 | def gitignore_read(repo):
2674 | ret = GitIgnore(absolute=list(), scoped=dict())
2675 |
2676 | # Read local configuration in .git/info/exclude
2677 | repo_file = os.path.join(repo.gitdir, "info/exclude")
2678 | if os.path.exists(repo_file):
2679 | with open(repo_file, "r") as f:
2680 | ret.absolute.append(gitignore_parse(f.readlines()))
2681 |
2682 | # Global configuration
2683 | if "XDG_CONFIG_HOME" in os.environ:
2684 | config_home = os.environ["XDG_CONFIG_HOME"]
2685 | else:
2686 | config_home = os.path.expanduser("~/.config")
2687 | global_file = os.path.join(config_home, "git/ignore")
2688 |
2689 | if os.path.exists(global_file):
2690 | with open(global_file, "r") as f:
2691 | ret.absolute.append(gitignore_parse(f.readlines()))
2692 |
2693 | # .gitignore files in the index
2694 | index = index_read(repo)
2695 |
2696 | for entry in index.entries:
2697 | if entry.name == ".gitignore" or entry.name.endswith("/.gitignore"):
2698 | dir_name = os.path.dirname(entry.name)
2699 | contents = object_read(repo, entry.sha)
2700 | lines = contents.blobdata.decode("utf8").splitlines()
2701 | ret.scoped[dir_name] = gitignore_parse(lines)
2702 | return ret
2703 | #+end_src
2704 |
2705 | We're almost there. To tie everything together, we need the
2706 | =check_ignore= function that matches a path, relative to the root of a
2707 | worktree, against a set of rules. This is how this function will
2708 | work:
2709 |
2710 | - It will first try to match this path against the *scoped* rules.
2711 | It will do this from the deepest parent of the path to the
2712 | farthest. That is, if the path is
2713 | =src/support/w32/legacy/sound.c~=, it will first look for rules in
2714 | =src/support/w32/legacy/.gitignore=, then
2715 | =src/support/w32/.gitignore=, =src/support/.gitignore=, and so on
2716 | up to simply =.gitignore"= at the root.
2717 | - If nothing matches, it will continue with the *absolute* rules.
2718 |
2719 | We write three small support functions. One to match a path against a
2720 | set of rules, and return the result of the last matching rule. Notice
2721 | how it's not a real boolean functions, since it has *three* possible
2722 | return values: =True=, =False= but also =None=. It returns =None= if
2723 | nothing matched, so the caller knows it should continue trying with
2724 | more general ignore files (eg, go one directory level up).
2725 |
2726 | #+begin_src python :tangle libwyag.py
2727 | def check_ignore1(rules, path):
2728 | result = None
2729 | for (pattern, value) in rules:
2730 | if fnmatch(path, pattern):
2731 | result = value
2732 | return result
2733 | #+end_src
2734 |
2735 | A function to match against the dictionary of *scoped* rules (the
2736 | various =.gitignore= files). It just starts at the path's directory
2737 | then moves up to the parent directory, recursively, until it has
2738 | tested root. Notice that this function (and the next two as well),
2739 | never breaks *inside* a given =.gitignore= file. Even if a rule
2740 | matches, they keep going through the file, because another rule there
2741 | may negate reverse the effect (rules are processed in order, so if you
2742 | want to exclude =*.c= but not =generator.c=, the general rule must
2743 | come before the specific one). But as soon as at least one rule
2744 | matched in a file, we drop the remaining files, because a more general
2745 | file never cancels the effect of a more specific one (this is why
2746 | =check_ignore1= is ternary and not boolean)
2747 |
2748 | #+begin_src python :tangle libwyag.py
2749 | def check_ignore_scoped(rules, path):
2750 | parent = os.path.dirname(path)
2751 | while True:
2752 | if parent in rules:
2753 | result = check_ignore1(rules[parent], path)
2754 | if result != None:
2755 | return result
2756 | if parent == "":
2757 | break
2758 | parent = os.path.dirname(parent)
2759 | return None
2760 | #+end_src
2761 |
2762 | A much simpler function to match against the list of absolute rules.
2763 | Notice that the order we push those rules to the list matters (we
2764 | /did/ read the repository rules before the global ones!)
2765 |
2766 | #+begin_src python :tangle libwyag.py
2767 | def check_ignore_absolute(rules, path):
2768 | parent = os.path.dirname(path)
2769 | for ruleset in rules:
2770 | result = check_ignore1(ruleset, path)
2771 | if result != None:
2772 | return result
2773 | return False # This is a reasonable default at this point.
2774 | #+end_src
2775 |
2776 | And finally, a function to bind them all.
2777 |
2778 | #+begin_src python :tangle libwyag.py
2779 | def check_ignore(rules, path):
2780 | if os.path.isabs(path):
2781 | raise Exception("This function requires path to be relative to the repository's root")
2782 |
2783 | result = check_ignore_scoped(rules.scoped, path)
2784 | if result != None:
2785 | return result
2786 |
2787 | return check_ignore_absolute(rules.absolute, path)
2788 | #+end_src
2789 |
2790 | You can now call =wyag check-ignore=. On its own source tree:
2791 |
2792 | #+begin_example
2793 | $ wyag check-ignore hello.el hello.elc hello.html wyag.zip wyag.tar
2794 | hello.elc
2795 | hello.html
2796 | wyag.zip
2797 | #+end_example
2798 |
2799 | #+begin_warning
2800 | *This is only an approximation*
2801 |
2802 | This isn't a perfect reimplementation. In particular, excluding
2803 | whole directories with a rule that's only the directory name (eg
2804 | =__pycache__=) won't work, because =fnmatch= would want the pattern
2805 | as =__pycache__/**=. If you really want to play with ignore rules,
2806 | [[https://github.com/mherrmann/gitignore_parser][this may be a good
2807 | starting point]].
2808 | #+end_warning
2809 |
2810 | ** The status command
2811 | :PROPERTIES:
2812 | :CUSTOM_ID: cmd-status
2813 | :END:
2814 |
2815 | =status= is more complex than =ls-files=, because it needs to compare
2816 | the index with both HEAD /and/ the actual filesystem. You call =git
2817 | status= to know which files were added, removed or modified since the
2818 | last commit, and which of these changes are actually staged, and will
2819 | make it to the next commit. So =status= actually compares the =HEAD=
2820 | with the staging area, and the staging area with the worktree. This
2821 | is what its output looks like:
2822 |
2823 | #+begin_example
2824 | On branch master
2825 |
2826 | Changes to be committed:
2827 | (use "git restore --staged ..." to unstage)
2828 | modified: write-yourself-a-git.org
2829 |
2830 | Changes not staged for commit:
2831 | (use "git add ..." to update what will be committed)
2832 | (use "git restore ..." to discard changes in working directory)
2833 | modified: write-yourself-a-git.org
2834 |
2835 | Untracked files:
2836 | (use "git add ..." to include in what will be committed)
2837 | org-html-themes/
2838 | wl-copy
2839 | #+end_example
2840 |
2841 | We'll implement =status= in three parts: first the active branch or
2842 | “detached HEAD”, then the difference between the index and the
2843 | worktree (“Changes not staged for commit”), then the difference
2844 | between HEAD and the index (“Changes to be committed” and “Untracked
2845 | files”).
2846 |
2847 | The public interface is dead simple, our status will take no argument:
2848 |
2849 | #+BEGIN_SRC python :tangle libwyag.py
2850 | argsp = argsubparsers.add_parser("status", help = "Show the working tree status.")
2851 | #+END_src
2852 |
2853 | And the bridge function just calls the three component functions in order:
2854 |
2855 | #+BEGIN_SRC python :tangle libwyag.py
2856 | def cmd_status(_):
2857 | repo = repo_find()
2858 | index = index_read(repo)
2859 |
2860 | cmd_status_branch(repo)
2861 | cmd_status_head_index(repo, index)
2862 | print()
2863 | cmd_status_index_worktree(repo, index)
2864 | #+END_src
2865 |
2866 | *** Finding the active branch
2867 |
2868 | First we need to know if we're on a branch, and if so which one. We
2869 | do this by just looking at =.git/HEAD=. It should contain either an
2870 | hexadecimal ID (a ref to a commit, in detached HEAD state), or an
2871 | indirect reference to something in =refs/heads/=: the active branch.
2872 | We either return its name, or =False=.
2873 |
2874 | #+begin_src python :tangle libwyag.py
2875 | def branch_get_active(repo):
2876 | with open(repo_file(repo, "HEAD"), "r") as f:
2877 | head = f.read()
2878 |
2879 | if head.startswith("ref: refs/heads/"):
2880 | return(head[16:-1])
2881 | else:
2882 | return False
2883 | #+end_src
2884 |
2885 | Based on this, we can write the first of the three =cmd_status_*=
2886 | functions the bridge calls. This one prints the name of the active
2887 | branch, or the hash of the detached HEAD:
2888 |
2889 | #+begin_src python :tangle libwyag.py
2890 | def cmd_status_branch(repo):
2891 | branch = branch_get_active(repo)
2892 | if branch:
2893 | print(f"On branch {branch}.")
2894 | else:
2895 | print(f"HEAD detached at {object_find(repo, 'HEAD')}")
2896 | #+end_src
2897 |
2898 | *** Finding changes between HEAD and index
2899 |
2900 | The second block of the status output is the “changes to be
2901 | committed”, that is, how the staging area differs from HEAD. To do
2902 | this, we're going first to read the =HEAD= tree, and flatten it as a
2903 | single dict (hashmap) with full paths as keys, so it's closer to the
2904 | (flat) index associating paths to blobs. Then we'll just compare
2905 | them and output their differences.
2906 |
2907 | First, a function to convert a tree (recursive, remember) to a (flat)
2908 | dict. And since trees are recursive, so the function itself is, again ---
2909 | sorry about that :)
2910 |
2911 | #+begin_src python :tangle libwyag.py
2912 | def tree_to_dict(repo, ref, prefix=""):
2913 | ret = dict()
2914 | tree_sha = object_find(repo, ref, fmt=b"tree")
2915 | tree = object_read(repo, tree_sha)
2916 |
2917 | for leaf in tree.items:
2918 | full_path = os.path.join(prefix, leaf.path)
2919 |
2920 | # We read the object to extract its type (this is uselessly
2921 | # expensive: we could just open it as a file and read the
2922 | # first few bytes)
2923 | is_subtree = leaf.mode.startswith(b'04')
2924 |
2925 | # Depending on the type, we either store the path (if it's a
2926 | # blob, so a regular file), or recurse (if it's another tree,
2927 | # so a subdir)
2928 | if is_subtree:
2929 | ret.update(tree_to_dict(repo, leaf.sha, full_path))
2930 | else:
2931 | ret[full_path] = leaf.sha
2932 | return ret
2933 | #+end_src
2934 |
2935 | And the command itself:
2936 |
2937 | #+begin_src python :tangle libwyag.py
2938 | def cmd_status_head_index(repo, index):
2939 | print("Changes to be committed:")
2940 |
2941 | head = tree_to_dict(repo, "HEAD")
2942 | for entry in index.entries:
2943 | if entry.name in head:
2944 | if head[entry.name] != entry.sha:
2945 | print(" modified:", entry.name)
2946 | del head[entry.name] # Delete the key
2947 | else:
2948 | print(" added: ", entry.name)
2949 |
2950 | # Keys still in HEAD are files that we haven't met in the index,
2951 | # and thus have been deleted.
2952 | for entry in head.keys():
2953 | print(" deleted: ", entry)
2954 | #+end_src
2955 |
2956 | *** Finding changes between index and worktree
2957 |
2958 | #+begin_src python :tangle libwyag.py
2959 | def cmd_status_index_worktree(repo, index):
2960 | print("Changes not staged for commit:")
2961 |
2962 | ignore = gitignore_read(repo)
2963 |
2964 | gitdir_prefix = repo.gitdir + os.path.sep
2965 |
2966 | all_files = list()
2967 |
2968 | # We begin by walking the filesystem
2969 | for (root, _, files) in os.walk(repo.worktree, True):
2970 | if root==repo.gitdir or root.startswith(gitdir_prefix):
2971 | continue
2972 | for f in files:
2973 | full_path = os.path.join(root, f)
2974 | rel_path = os.path.relpath(full_path, repo.worktree)
2975 | all_files.append(rel_path)
2976 |
2977 | # We now traverse the index, and compare real files with the cached
2978 | # versions.
2979 |
2980 | for entry in index.entries:
2981 | full_path = os.path.join(repo.worktree, entry.name)
2982 |
2983 | # That file *name* is in the index
2984 |
2985 | if not os.path.exists(full_path):
2986 | print(" deleted: ", entry.name)
2987 | else:
2988 | stat = os.stat(full_path)
2989 |
2990 | # Compare metadata
2991 | ctime_ns = entry.ctime[0] * 10**9 + entry.ctime[1]
2992 | mtime_ns = entry.mtime[0] * 10**9 + entry.mtime[1]
2993 | if (stat.st_ctime_ns != ctime_ns) or (stat.st_mtime_ns != mtime_ns):
2994 | # If different, deep compare.
2995 | # @FIXME This *will* crash on symlinks to dir.
2996 | with open(full_path, "rb") as fd:
2997 | new_sha = object_hash(fd, b"blob", None)
2998 | # If the hashes are the same, the files are actually the same.
2999 | same = entry.sha == new_sha
3000 |
3001 | if not same:
3002 | print(" modified:", entry.name)
3003 |
3004 | if entry.name in all_files:
3005 | all_files.remove(entry.name)
3006 |
3007 | print()
3008 | print("Untracked files:")
3009 |
3010 | for f in all_files:
3011 | # @TODO If a full directory is untracked, we should display
3012 | # its name without its contents.
3013 | if not check_ignore(ignore, f):
3014 | print(" ", f)
3015 | #+end_src
3016 |
3017 | Our status function is done. It should output something like:
3018 |
3019 | #+begin_example
3020 | $ wyag status
3021 | On branch main.
3022 | Changes to be committed:
3023 | added: src/main.c
3024 |
3025 | Changes not staged for commit:
3026 | modified: build.py
3027 | deleted: README.org
3028 |
3029 | Untracked files:
3030 | src/cli.c
3031 | #+end_example
3032 |
3033 | The real =status= is a lot smarter: it can detect renames, for
3034 | example, where ours cannot. Another significant difference worth
3035 | mentioning is that =git status= actually /writes/ the index back if a
3036 | file metadata was modified, but not its content. You can see it with
3037 | our special ls-files:
3038 |
3039 | #+begin_example
3040 | $ wyag ls-files --verbose
3041 | Index file format v2, containing 1 entries.
3042 | file
3043 | regular file with perms: 644
3044 | on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
3045 | created: 2023-07-18 18:26:15.771460869, modified: 2023-07-18 18:26:15.771460869
3046 | ...
3047 | $ touch file
3048 | $ git status > /dev/null
3049 | $ wyag ls-files --verbose
3050 | Index file format v2, containing 1 entries.
3051 | file
3052 | regular file with perms: 644
3053 | on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
3054 | created: 2023-07-18 18:26:41.421743098, modified: 2023-07-18 18:26:41.421743098
3055 | ...
3056 | #+end_example
3057 |
3058 | Notice how both timestamps, from the /index file/, were updated by
3059 | =git status= to reflect the changes in the real file's metadata.
3060 |
3061 | * Staging area and index, part 2: staging and committing
3062 | :PROPERTIES:
3063 | :CUSTOM_ID: committing
3064 | :END:
3065 |
3066 | OK. Let's create commits.
3067 |
3068 | We have /almost/ everything we need for that, except for three last
3069 | things:
3070 |
3071 | 1. We need commands to modify the index, so our commits aren't just a
3072 | copy of their parent. Those commands are =add= and =rm=.
3073 | 2. These commands need to write the modified index back, since we
3074 | commit /from the index/.
3075 | 3. And obviously, we'll need the =commit= function and its associated
3076 | =wyag commit= command.
3077 |
3078 | ** Writing the index
3079 | :PROPERTIES:
3080 | :CUSTOM_ID: index_write
3081 | :END:
3082 |
3083 | We'll start by writing the index. Roughly, we're just serializing
3084 | everything back to binary. This is a bit tedious, but the code should
3085 | be straightforward. I'm leaving the gory details for the comments,
3086 | but it's really just =index_read= in reverse --- refer to it if
3087 | needed, and the =GitIndexEntry= class.
3088 |
3089 | #+begin_src python :tangle libwyag.py
3090 | def index_write(repo, index):
3091 | with open(repo_file(repo, "index"), "wb") as f:
3092 |
3093 | # HEADER
3094 |
3095 | # Write the magic bytes.
3096 | f.write(b"DIRC")
3097 | # Write version number.
3098 | f.write(index.version.to_bytes(4, "big"))
3099 | # Write the number of entries.
3100 | f.write(len(index.entries).to_bytes(4, "big"))
3101 |
3102 | # ENTRIES
3103 |
3104 | idx = 0
3105 | for e in index.entries:
3106 | f.write(e.ctime[0].to_bytes(4, "big"))
3107 | f.write(e.ctime[1].to_bytes(4, "big"))
3108 | f.write(e.mtime[0].to_bytes(4, "big"))
3109 | f.write(e.mtime[1].to_bytes(4, "big"))
3110 | f.write(e.dev.to_bytes(4, "big"))
3111 | f.write(e.ino.to_bytes(4, "big"))
3112 |
3113 | # Mode
3114 | mode = (e.mode_type << 12) | e.mode_perms
3115 | f.write(mode.to_bytes(4, "big"))
3116 |
3117 | f.write(e.uid.to_bytes(4, "big"))
3118 | f.write(e.gid.to_bytes(4, "big"))
3119 |
3120 | f.write(e.fsize.to_bytes(4, "big"))
3121 | # @FIXME Convert back to int.
3122 | f.write(int(e.sha, 16).to_bytes(20, "big"))
3123 |
3124 | flag_assume_valid = 0x1 << 15 if e.flag_assume_valid else 0
3125 |
3126 | name_bytes = e.name.encode("utf8")
3127 | bytes_len = len(name_bytes)
3128 | if bytes_len >= 0xFFF:
3129 | name_length = 0xFFF
3130 | else:
3131 | name_length = bytes_len
3132 |
3133 | # We merge back three pieces of data (two flags and the
3134 | # length of the name) on the same two bytes.
3135 | f.write((flag_assume_valid | e.flag_stage | name_length).to_bytes(2, "big"))
3136 |
3137 | # Write back the name, and a final 0x00.
3138 | f.write(name_bytes)
3139 | f.write((0).to_bytes(1, "big"))
3140 |
3141 | idx += 62 + len(name_bytes) + 1
3142 |
3143 | # Add padding if necessary.
3144 | if idx % 8 != 0:
3145 | pad = 8 - (idx % 8)
3146 | f.write((0).to_bytes(pad, "big"))
3147 | idx += pad
3148 | #+end_src
3149 |
3150 | ** The rm command
3151 | :PROPERTIES:
3152 | :CUSTOM_ID: cmd-rm
3153 | :END:
3154 |
3155 | The easiest change we can do to an index is to remove an entry from
3156 | it, which mean that the next commit *won't include* this file. This
3157 | is what the =git rm= command does.
3158 |
3159 | #+BEGIN_danger
3160 | =git rm= is *destructive*, and so is =wyag rm=. The command not
3161 | only modifies the index, it also removes file(s) from the worktree.
3162 | Unlike git, =wyag rm= doesn't care if the file it removes isn't
3163 | saved. Proceed with caution.
3164 | #+END_danger
3165 |
3166 | =rm= takes a single argument, a list of paths to remove:
3167 |
3168 | #+BEGIN_SRC python :tangle libwyag.py
3169 | argsp = argsubparsers.add_parser("rm", help="Remove files from the working tree and the index.")
3170 | argsp.add_argument("path", nargs="+", help="Files to remove")
3171 |
3172 | def cmd_rm(args):
3173 | repo = repo_find()
3174 | rm(repo, args.path)
3175 | #+END_src
3176 |
3177 | The =rm= function is a bit long, but it's very simple. It takes a
3178 | repository and a list of paths, reads that repository index, and
3179 | removes entries in the index that match this list. The optional
3180 | arguments control whether the function should actually delete the
3181 | files, and whether it should abort if some paths aren't present on the
3182 | index (both those arguments are for the use of =add=, they're not
3183 | exposed in the =wyag rm= command).
3184 |
3185 | #+BEGIN_SRC python :tangle libwyag.py
3186 | def rm(repo, paths, delete=True, skip_missing=False):
3187 | # Find and read the index
3188 | index = index_read(repo)
3189 |
3190 | worktree = repo.worktree + os.sep
3191 |
3192 | # Make paths absolute
3193 | abspaths = set()
3194 | for path in paths:
3195 | abspath = os.path.abspath(path)
3196 | if abspath.startswith(worktree):
3197 | abspaths.add(abspath)
3198 | else:
3199 | raise Exception(f"Cannot remove paths outside of worktree: {paths}")
3200 |
3201 | # The list of entries to *keep*, which we will write back to the
3202 | # index.
3203 | kept_entries = list()
3204 | # The list of removed paths, which we'll use after index update
3205 | # to physically remove the actual paths from the filesystem.
3206 | remove = list()
3207 |
3208 | # Now iterate over the list of entries, and remove those whose
3209 | # paths we find in abspaths. Preserve the others in kept_entries.
3210 | for e in index.entries:
3211 | full_path = os.path.join(repo.worktree, e.name)
3212 |
3213 | if full_path in abspaths:
3214 | remove.append(full_path)
3215 | abspaths.remove(full_path)
3216 | else:
3217 | kept_entries.append(e) # Preserve entry
3218 |
3219 | # If abspaths is empty, it means some paths weren't in the index.
3220 | if len(abspaths) > 0 and not skip_missing:
3221 | raise Exception(f"Cannot remove paths not in the index: {abspaths}")
3222 |
3223 | # Physically delete paths from filesystem.
3224 | if delete:
3225 | for path in remove:
3226 | os.unlink(path)
3227 |
3228 | # Update the list of entries in the index, and write it back.
3229 | index.entries = kept_entries
3230 | index_write(repo, index)
3231 | #+END_SRC
3232 |
3233 | And we can now delete files with =wyag rm=.
3234 |
3235 | ** The add command
3236 | :PROPERTIES:
3237 | :CUSTOM_ID: cmd-add
3238 | :END:
3239 |
3240 | Adding is just a bit more complex than removing, but nothing we don't
3241 | already know. Staging a file to a three-steps operation:
3242 |
3243 | 1. We begin by removing the existing index entry, if there's one,
3244 | without removing the file itself (this is why the =rm= function we
3245 | just wrote has those optional arguments).
3246 | 2. We then hash the file into a glob object,
3247 | 3. create its entry,
3248 | 4. And of course, finally write the modified index back.
3249 |
3250 | First, the interface. Nothing surprising, =wyag add PATH ...= where
3251 | PATH is one or more file(s) to stage. The bridge is as boring as can be.
3252 |
3253 | #+BEGIN_SRC python :tangle libwyag.py
3254 | argsp = argsubparsers.add_parser("add", help = "Add files contents to the index.")
3255 | argsp.add_argument("path", nargs="+", help="Files to add")
3256 |
3257 | def cmd_add(args):
3258 | repo = repo_find()
3259 | add(repo, args.path)
3260 | #+END_src
3261 |
3262 | The main difference with =rm= is that =add= needs to create an index
3263 | entry. This isn't hard: we just =stat()= the file and copy the
3264 | metadata in the index's field (=stat()= returns those metadata the
3265 | index stores: creation/modification time, and so on)
3266 |
3267 | #+BEGIN_SRC python :tangle libwyag.py
3268 | def add(repo, paths, delete=True, skip_missing=False):
3269 |
3270 | # First remove all paths from the index, if they exist.
3271 | rm (repo, paths, delete=False, skip_missing=True)
3272 |
3273 | worktree = repo.worktree + os.sep
3274 |
3275 | # Convert the paths to pairs: (absolute, relative_to_worktree).
3276 | # Also delete them from the index if they're present.
3277 | clean_paths = set()
3278 | for path in paths:
3279 | abspath = os.path.abspath(path)
3280 | if not (abspath.startswith(worktree) and os.path.isfile(abspath)):
3281 | raise Exception(f"Not a file, or outside the worktree: {paths}")
3282 | relpath = os.path.relpath(abspath, repo.worktree)
3283 | clean_paths.add((abspath, relpath))
3284 |
3285 | # Find and read the index. It was modified by rm. (This isn't
3286 | # optimal, good enough for wyag!)
3287 | #
3288 | # @FIXME, though: we could just move the index through
3289 | # commands instead of reading and writing it over again.
3290 | index = index_read(repo)
3291 |
3292 | for (abspath, relpath) in clean_paths:
3293 | with open(abspath, "rb") as fd:
3294 | sha = object_hash(fd, b"blob", repo)
3295 |
3296 | stat = os.stat(abspath)
3297 |
3298 | ctime_s = int(stat.st_ctime)
3299 | ctime_ns = stat.st_ctime_ns % 10**9
3300 | mtime_s = int(stat.st_mtime)
3301 | mtime_ns = stat.st_mtime_ns % 10**9
3302 |
3303 | entry = GitIndexEntry(ctime=(ctime_s, ctime_ns), mtime=(mtime_s, mtime_ns), dev=stat.st_dev, ino=stat.st_ino,
3304 | mode_type=0b1000, mode_perms=0o644, uid=stat.st_uid, gid=stat.st_gid,
3305 | fsize=stat.st_size, sha=sha, flag_assume_valid=False,
3306 | flag_stage=False, name=relpath)
3307 | index.entries.append(entry)
3308 |
3309 | # Write the index back
3310 | index_write(repo, index)
3311 | #+END_SRC
3312 |
3313 | ** The commit command
3314 | :PROPERTIES:
3315 | :CUSTOM_ID: cmd-commit
3316 | :END:
3317 |
3318 | Now that we have modified the index, so actually /staged changes/, we
3319 | only need to turn those changes into a commit. That's what =commit= does.
3320 |
3321 | #+begin_src python :tangle libwyag.py
3322 | argsp = argsubparsers.add_parser("commit", help="Record changes to the repository.")
3323 |
3324 | argsp.add_argument("-m",
3325 | metavar="message",
3326 | dest="message",
3327 | help="Message to associate with this commit.")
3328 | #+end_src
3329 |
3330 | To do so, we first need to convert the index into a tree object,
3331 | generate and store the corresponding commit object, and update the
3332 | HEAD branch to the new commit (remember: a branch is just a ref to a
3333 | commit).
3334 |
3335 | Before we get to the interesting details, we will need to read git's
3336 | config to get the name of the user, which we'll use as the author and
3337 | committer. We'll use the same =configparser= library we've used to
3338 | read repo's config.
3339 |
3340 | #+begin_src python :tangle libwyag.py
3341 | def gitconfig_read():
3342 | xdg_config_home = os.environ["XDG_CONFIG_HOME"] if "XDG_CONFIG_HOME" in os.environ else "~/.config"
3343 | configfiles = [
3344 | os.path.expanduser(os.path.join(xdg_config_home, "git/config")),
3345 | os.path.expanduser("~/.gitconfig")
3346 | ]
3347 |
3348 | config = configparser.ConfigParser()
3349 | config.read(configfiles)
3350 | return config
3351 | #+end_src
3352 |
3353 | And just a simple function to grab, and format, the user identity:
3354 |
3355 | #+begin_src python :tangle libwyag.py
3356 | def gitconfig_user_get(config):
3357 | if "user" in config:
3358 | if "name" in config["user"] and "email" in config["user"]:
3359 | return f"{config['user']['name']} <{config['user']['email']}>"
3360 | return None
3361 | #+end_src
3362 |
3363 | Now for the interesting part. We first need to build a tree from the
3364 | index. This isn't hard, but notice that while the index is flat (it
3365 | stores full paths for the whole worktree), a tree is a recursive
3366 | structure: it lists files, or other trees. To "unflatten" the index
3367 | into a tree, we're going to:
3368 |
3369 | 1. Build a dictionary (hashmap) of directories. Keys are full paths
3370 | from worktree root (like =assets/sprites/monsters/=), values are
3371 | list of =GitIndexEntry= --- files in the directory. At this point, our
3372 | dictionary only contains /files/: directories are only its keys.
3373 | 2. Traverse this list, going bottom-up, that is, from the deepest
3374 | directories up to root (depth doesn't really matter: we just want
3375 | to see each directory /before/ its parent. To do that, we just
3376 | sort them by /full/ path length, from longest to shortest ---
3377 | parents are obviously always shorter). As an example, imagine we
3378 | start at =assets/sprites/monsters/=
3379 | 3. At each directory, we build a tree with its contents, say
3380 | =cacodemon.png=, =imp.png= and =baron-of-hell.png=.
3381 | 4. We write the new tree to the repository.
3382 | 5. We then add this tree to this directory's parent. Meaning that at
3383 | this point, =assets/sprites/= now contains our new tree object's
3384 | SHA-1 id under the name =monsters=.
3385 | 6. And we iterate over the next directory, let's say
3386 | =assets/sprites/keys= where we find =red.png=, =blue.png= and
3387 | =yellow.png=, create a tree, store the tree, add the tree's SHA-1
3388 | under the name =keys= under =assets/sprites/=, and so on.
3389 |
3390 | And since trees are recursive? So the last tree we'll build, which is
3391 | necessarily the one for root (since its key's length is 0), will
3392 | ultimately refer to all others, and thus will be only one we'll need.
3393 | We'll simply return its SHA-1, and be done.
3394 |
3395 | Since this may seem a bit complex, let's work this example in full
3396 | details --- feel free to skip. At the beginning, the dictionary we
3397 | built from the index looks like this:
3398 |
3399 | #+begin_example
3400 | contents["assets/sprites/monsters"] =
3401 | [ cacodemon.png : GitIndexEntry
3402 | , imp.png : GitIndexEntry
3403 | , baron-of-hell.png : GitIndexEntry ]
3404 | contents["assets/sprites/keys"] =
3405 | [ red.png : GitIndexEntry
3406 | , blue.png : GitIndexEntry
3407 | , yellow.png : GitIndexEntry ]
3408 | contents["assets/sprites/"] =
3409 | [ hero.png : GitIndexEntry ]
3410 | contents["assets/"] = [] # No files in here
3411 | contents[""] = # Root!
3412 | [ README: GitIndexEntry ]
3413 | #+end_example
3414 |
3415 | We iterate over it, by order of descending key length. The first key
3416 | we meet is the longest, so =assets/sprites/monsters=. We build a new
3417 | tree object from its contents, which associates the three file names
3418 | (=cacodemon.png=, =imp.png=, =baron-of-hell.png=) with their
3419 | corresponding blobs (A tree leaf stores /less/ data than the index ---
3420 | just path, mode and blob. So converting entries that way is easy)
3421 |
3422 | Notice we don't need to concern ourselves with storing the *contents*
3423 | of those files: =wyag add= did create the corresponding blobs as
3424 | needed. We need to store the /trees/ we create to the object store,
3425 | but we can assume the blobs are there already.
3426 |
3427 | Let's say that our new tree hashes, made from the index entries that
3428 | lived directly in =assets/sprites/monsters=, hashes down to
3429 | =426f894781bc3c38f1d26f8fd2c7f38ab8d21763=. We *modify our
3430 | dictionary* to add that new tree object to the directory's parent,
3431 | like this, so what remains to traverse now looks like this:
3432 |
3433 | #+begin_example
3434 | contents["assets/sprites/keys"] = # <- unmodified.
3435 | [ red.png : GitIndexEntry
3436 | , blue.png : GitIndexEntry
3437 | , yellow.png : GitIndexEntry ]
3438 | contents["assets/sprites/"] =
3439 | [ hero.png : GitIndexEntry
3440 | , monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763 ] <- look here
3441 | contents["assets/"] = [] # empty
3442 | contents[""] = # Root!
3443 | [ README: GitIndexEntry ]
3444 | #+end_example
3445 |
3446 | We do the same for the next longest key, =assets/sprites/keys=,
3447 | producing a tree of hash =b42788e087b1e94a0e69dcb7a4a243eaab802bb2=,
3448 | so:
3449 |
3450 | #+begin_example
3451 | contents["assets/sprites/"] =
3452 | [ hero.png : GitIndexEntry
3453 | , monsters : Tree 426f894781bc3c38f1d26f8fd2c7f38ab8d21763
3454 | , keys : Tree b42788e087b1e94a0e69dcb7a4a243eaab802bb2 ]
3455 | contents["assets/"] = [] # empty
3456 | contents[""] = # Root!
3457 | [ README: GitIndexEntry ]
3458 | #+end_example
3459 |
3460 | We then generate tree =6364113557ed681d775ccbd3c90895ed276956a2= from
3461 | assets/sprites, which now contains our two trees and =hero.png=.
3462 |
3463 | #+begin_example
3464 | contents["assets/"] = [
3465 | sprites: Tree 6364113557ed681d775ccbd3c90895ed276956a2 ]
3466 | contents[""] = # Root!
3467 | [ README: GitIndexEntry ]
3468 | #+end_example
3469 |
3470 | Assets in turn becomes tree =4d35513cb6d2a816bc00505be926624440ebbddd=, so:
3471 |
3472 | #+begin_example
3473 | contents[""] = # Root!
3474 | [ README: GitIndexEntry,
3475 | assets: 4d35513cb6d2a816bc00505be926624440ebbddd]
3476 | #+end_example
3477 |
3478 | We make a tree from that last key (with the =README= blob and the
3479 | =assets= subtree), it hashes to
3480 | =9352e52ff58fa9bf5a750f090af64c09fa6a3d93=. That's our return value:
3481 | the tree whose contents are the same as the index's.
3482 |
3483 | Here's the actual function:
3484 |
3485 | #+begin_src python :tangle libwyag.py
3486 | def tree_from_index(repo, index):
3487 | contents = dict()
3488 | contents[""] = list()
3489 |
3490 | # Enumerate entries, and turn them into a dictionary where keys
3491 | # are directories, and values are lists of directory contents.
3492 | for entry in index.entries:
3493 | dirname = os.path.dirname(entry.name)
3494 |
3495 | # We create all dictonary entries up to root (""). We need
3496 | # them *all*, because even if a directory holds no files it
3497 | # will contain at least a tree.
3498 | key = dirname
3499 | while key != "":
3500 | if not key in contents:
3501 | contents[key] = list()
3502 | key = os.path.dirname(key)
3503 |
3504 | # For now, simply store the entry in the list.
3505 | contents[dirname].append(entry)
3506 |
3507 | # Get keys (= directories) and sort them by length, descending.
3508 | # This means that we'll always encounter a given path before its
3509 | # parent, which is all we need, since for each directory D we'll
3510 | # need to modify its parent P to add D's tree.
3511 | sorted_paths = sorted(contents.keys(), key=len, reverse=True)
3512 |
3513 | # This variable will store the current tree's SHA-1. After we're
3514 | # done iterating over our dict, it will contain the hash for the
3515 | # root tree.
3516 | sha = None
3517 |
3518 | # We go through the sorted list of paths (dict keys)
3519 | for path in sorted_paths:
3520 | # Prepare a new, empty tree object
3521 | tree = GitTree()
3522 |
3523 | # Add each entry to our new tree, in turn
3524 | for entry in contents[path]:
3525 | # An entry can be a normal GitIndexEntry read from the
3526 | # index, or a tree we've created.
3527 | if isinstance(entry, GitIndexEntry): # Regular entry (a file)
3528 |
3529 | # We transcode the mode: the entry stores it as integers,
3530 | # we need an octal ASCII representation for the tree.
3531 | leaf_mode = f"{entry.mode_type:02o}{entry.mode_perms:04o}".encode("ascii")
3532 | leaf = GitTreeLeaf(mode = leaf_mode, path=os.path.basename(entry.name), sha=entry.sha)
3533 | else: # Tree. We've stored it as a pair: (basename, SHA)
3534 | leaf = GitTreeLeaf(mode = b"040000", path=entry[0], sha=entry[1])
3535 |
3536 | tree.items.append(leaf)
3537 |
3538 | # Write the new tree object to the store.
3539 | sha = object_write(tree, repo)
3540 |
3541 | # Add the new tree hash to the current dictionary's parent, as
3542 | # a pair (basename, SHA)
3543 | parent = os.path.dirname(path)
3544 | base = os.path.basename(path) # The name without the path, eg main.go for src/main.go
3545 | contents[parent].append((base, sha))
3546 |
3547 | return sha
3548 | #+end_src
3549 |
3550 | This was the hard part; I hope it's clear enough. From this, creating
3551 | the commit object and updating HEAD will be way easier. Just remember
3552 | that what this function /does/ is built and store as many tree objects
3553 | as needed to represent the index, and return the root tree's SHA-1.
3554 |
3555 | The function to create a commit object is simple enough, it just takes
3556 | some arguments: the hash of the tree, the hash of the parent commit,
3557 | the author's identity (a string), the timestamp and timezone delta,
3558 | and the message:
3559 |
3560 | # @TODO Explain them!
3561 |
3562 | #+begin_src python :tangle libwyag.py
3563 | def commit_create(repo, tree, parent, author, timestamp, message):
3564 | commit = GitCommit() # Create the new commit object.
3565 | commit.kvlm[b"tree"] = tree.encode("ascii")
3566 | if parent:
3567 | commit.kvlm[b"parent"] = parent.encode("ascii")
3568 |
3569 | # Trim message and add a trailing \n
3570 | message = message.strip() + "\n"
3571 | # Format timezone
3572 | offset = int(timestamp.astimezone().utcoffset().total_seconds())
3573 | hours = offset // 3600
3574 | minutes = (offset % 3600) // 60
3575 | tz = "{}{:02}{:02}".format("+" if offset > 0 else "-", hours, minutes)
3576 |
3577 | author = author + timestamp.strftime(" %s ") + tz
3578 |
3579 | commit.kvlm[b"author"] = author.encode("utf8")
3580 | commit.kvlm[b"committer"] = author.encode("utf8")
3581 | commit.kvlm[None] = message.encode("utf8")
3582 |
3583 | return object_write(commit, repo)
3584 | #+end_src
3585 |
3586 | All what remains to write is =cmd_commit=, the bridge function to the
3587 | =wyag commit= command:
3588 |
3589 | #+begin_src python :tangle libwyag.py
3590 | def cmd_commit(args):
3591 | repo = repo_find()
3592 | index = index_read(repo)
3593 | # Create trees, grab back SHA for the root tree.
3594 | tree = tree_from_index(repo, index)
3595 |
3596 | # Create the commit object itself
3597 | commit = commit_create(repo,
3598 | tree,
3599 | object_find(repo, "HEAD"),
3600 | gitconfig_user_get(gitconfig_read()),
3601 | datetime.now(),
3602 | args.message)
3603 |
3604 | # Update HEAD so our commit is now the tip of the active branch.
3605 | active_branch = branch_get_active(repo)
3606 | if active_branch: # If we're on a branch, we update refs/heads/BRANCH
3607 | with open(repo_file(repo, os.path.join("refs/heads", active_branch)), "w") as fd:
3608 | fd.write(commit + "\n")
3609 | else: # Otherwise, we update HEAD itself.
3610 | with open(repo_file(repo, "HEAD"), "w") as fd:
3611 | fd.write("\n")
3612 | #+end_src
3613 |
3614 | And we're done!
3615 |
3616 | * Final words
3617 |
3618 | ** Conclusion, and beyond commit :noexport:
3619 |
3620 | With that final command, =wyag= is done. I hope you've enjoyed that
3621 | little journey into the internals of git core. Obviously, we're still
3622 | very far from what the real git can do, but the goal was to expose the
3623 | model, and this is done.
3624 |
3625 | One of the most fundamental design choices of git, which you've
3626 | probably noticed, is that it only stores full states of the
3627 | repository. We like to think of commits as /transformations/ of
3628 | source code, and it makes sense to us because in a way that's what
3629 | they /are/, but to git itself each commit is like a zip of a
3630 | directory, as disconnected to the previous as it is to the “next”.
3631 | Git neither knows nor cares about deltas and patches, and even file
3632 | renames indications are just a trick. (To be perfectly honest: it
3633 | actually /knows/ about deltas, but as a storage optimization method,
3634 | in packfiles --- which are optional).
3635 |
3636 | ** Comments, feedback and issues
3637 | :PROPERTIES:
3638 | :CUSTOM_ID: feedback
3639 | :END:
3640 |
3641 | This page has no comment system :) I can be reached by e-mail at
3642 | [[mailto:thibault@thb.lt][thibault@thb.lt]]. I can also be found [[https://toad.social/@thblt][on Mastodon as
3643 | @thblt@toad.social]] and [[https://twitter.com/ThbPlg][on Twitter as @ThbPlg]], and on IRC (sometimes)
3644 | as =thblt= on Libera.
3645 |
3646 | The source for this article is hosted [[https://github.com/thblt/write-yourself-a-git][on Github]]. Issue reports and
3647 | pull requests are welcome, either directly on GitHub or through e-mail
3648 | if you prefer.
3649 |
3650 | ** Release information :noexport:
3651 |
3652 | #+begin_src emacs-lisp :exports results :results table
3653 | (list
3654 | '("Key" "Value")
3655 | 'hline
3656 | `("Creation date" ,(current-time-string))
3657 | `("On commit" ,(format "=%s= (%s)"
3658 | (string-trim (shell-command-to-string "git describe --tags --always"))
3659 | (if (string-empty-p
3660 | (string-trim (shell-command-to-string "git status --porcelain=v2")))
3661 | "clean" "*dirty*")))
3662 | `("By" ,(format "%s (=%s= on =%s=)"
3663 | (user-full-name)
3664 | (user-login-name)
3665 | (system-name)))
3666 | `("Emacs version" ,emacs-version)
3667 | `("Org-mode version" ,org-version))
3668 | #+end_src
3669 |
3670 | ** License
3671 |
3672 | This article is distributed under the terms of the [[https://creativecommons.org/licenses/by-nc-sa/4.0/][Creative Commons
3673 | BY-NC-SA 4.0]]. The [[./wyag.zip][program itself]] is also licensed under the terms
3674 | of the GNU General Public License 3.0, or, at your option, any later
3675 | version of the same licence.
3676 |
3677 | * Footnotes
3678 |
3679 | [fn:1] You may know that [[https://shattered.io/][collisions have been discovered in SHA-1]].
3680 | Git actually doesn't use SHA-1 anymore: it uses a [[https://github.com/git/git/blob/26e47e261e969491ad4e3b6c298450c061749c9e/Documentation/technical/hash-function-transition.txt#L34-L36][hardened variant]]
3681 | which is not SHA, but which applies the same hash to every known input
3682 | but the two PDF files known to collide.
3683 |
--------------------------------------------------------------------------------
/wyag-tests.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | set -e
3 |
4 | function step() {
5 | pos=$(caller)
6 | echo $pos $@
7 | }
8 |
9 | wyag=$(realpath ./wyag)
10 |
11 | testdir=/tmp/wyag-tests
12 | if [[ -e $testdir ]]; then
13 | rm -rf $testdir/*
14 | else
15 | mkdir $testdir
16 | fi
17 | cd $testdir
18 | step "working on $(pwd)"
19 |
20 | step Create repos
21 | $wyag init left
22 | git init right > /dev/null
23 |
24 | step status
25 | cd left
26 | git status > /dev/null
27 | cd ../right
28 | git status > /dev/null
29 | cd ..
30 |
31 | step hash-object
32 | echo "Don't read me" > README
33 | $wyag hash-object README > hash1
34 | git hash-object README > hash2
35 | cmp --quiet hash1 hash2
36 |
37 | step hash-object -w
38 | cd left
39 | $wyag hash-object -w ../README > /dev/null
40 | cd ../right
41 | git hash-object -w ../README > /dev/null
42 | cd ..
43 | ls left/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null
44 | ls right/.git/objects/b1/7df541639ec7814a9ad274e177d9f8da1eb951 > /dev/null
45 |
46 | step cat-file
47 | cd left
48 | $wyag cat-file blob b17d > ../file1
49 | cd ../right
50 | git cat-file blob b17d > ../file2
51 | cd ..
52 | cmp file1 file2
53 |
54 | step cat-file with long hash
55 | cd left
56 | $wyag cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file1
57 | cd ../right
58 | git cat-file blob b17df541639ec7814a9ad274e177d9f8da1eb951 > ../file2
59 | cd ..
60 | cmp file1 file2
61 |
62 | step "Create commit (git only, nothing is tested)" #@FIXME Add wyag commit
63 | cd left
64 | echo "Aleph" > hebraic-letter.txt
65 | git add hebraic-letter.txt
66 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
67 | GIT_AUTHOR_NAME="wyag-tests.sh" \
68 | GIT_AUTHOR_EMAIL="wyag@example.com" \
69 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
70 | GIT_COMMITTER_NAME="wyag-tests.sh" \
71 | GIT_COMMITTER_EMAIL="wyag@example.com" \
72 | git commit --no-gpg-sign -m "Initial commit" > /dev/null
73 | cd ../right
74 | echo "Aleph" > hebraic-letter.txt
75 | git add hebraic-letter.txt
76 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
77 | GIT_AUTHOR_NAME="wyag-tests.sh" \
78 | GIT_AUTHOR_EMAIL="wyag@example.com" \
79 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
80 | GIT_COMMITTER_NAME="wyag-tests.sh" \
81 | GIT_COMMITTER_EMAIL="wyag@example.com" \
82 | git commit --no-gpg-sign -m "Initial commit" > /dev/null
83 |
84 | cd ..
85 |
86 | step cat-file on commit object without indirection
87 | cd left
88 | $wyag cat-file commit HEAD > ../file1
89 | cd ../right
90 | git cat-file commit HEAD > ../file2
91 | cd ..
92 | cmp file1 file2
93 |
94 | step cat-file on tree object redirected from commit
95 | cd left
96 | $wyag cat-file tree HEAD > ../file1
97 | cd ../right
98 | git cat-file tree HEAD > ../file2
99 | cd ..
100 | cmp file1 file2
101 |
102 | step "Add some directories and commits (git only, nothing is tested)" #@FIXME Add wyag commit
103 | cd left
104 | mkdir a
105 | echo "Alpha" > a/greek_letters
106 | mkdir b
107 | echo "Hamza" > a/arabic_letters
108 | git add a/*
109 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
110 | GIT_AUTHOR_NAME="wyag-tests.sh" \
111 | GIT_AUTHOR_EMAIL="wyag@example.com" \
112 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
113 | GIT_COMMITTER_NAME="wyag-tests.sh" \
114 | GIT_COMMITTER_EMAIL="wyag@example.com" \
115 | git commit --no-gpg-sign -m "Commit 2" > /dev/null
116 | cd ../right
117 | mkdir a
118 | echo "Alpha" > a/greek_letters
119 | mkdir b
120 | echo "Hamza" > a/arabic_letters
121 | git add a/*
122 | GIT_AUTHOR_DATE="2010-01-01 01:02:03 +0100" \
123 | GIT_AUTHOR_NAME="wyag-tests.sh" \
124 | GIT_AUTHOR_EMAIL="wyag@example.com" \
125 | GIT_COMMITTER_DATE="2010-01-01 01:02:03 +0100" \
126 | GIT_COMMITTER_NAME="wyag-tests.sh" \
127 | GIT_COMMITTER_EMAIL="wyag@example.com" \
128 | git commit --no-gpg-sign -m "Commit 2" > /dev/null
129 | cd ..
130 |
131 | step ls-tree
132 | cd left
133 | $wyag ls-tree HEAD > ../file1
134 | cd ../right
135 | git ls-tree HEAD > ../file2
136 | cd ..
137 | cmp file1 file2
138 |
139 | step checkout
140 | # Git and Wyag syntax are different here
141 | cd left
142 | $wyag checkout HEAD ../temp1
143 | mkdir ../temp2
144 | cd ../temp2
145 | git --git-dir=../right/.git checkout .
146 | cd ..
147 | diff -r temp1 temp2
148 | rm -rf temp1 temp2
149 |
150 | step rev-parse
151 | cd left
152 | $wyag rev-parse HEAD > ../file1
153 | $wyag rev-parse 75ee4 >> ../file1
154 | $wyag rev-parse 8a617 >> ../file1
155 | #@FIXME Tags missing, branches missing, remotes missing
156 | cd ../right
157 | git rev-parse HEAD > ../file2
158 | git rev-parse 75ee4 >> ../file2
159 | git rev-parse 8a617 >> ../file2
160 | cd ..
161 | cmp file1 file2
162 |
163 | step "ls-files "
164 | cd left
165 | $wyag ls-files > ../file1
166 | cd ../right
167 | git ls-files > ../file2
168 | cd ..
169 | cmp file1 file2
170 |
171 | gitignore_prepare() {
172 | mkdir -p a/b/c/
173 | echo "!*.txt" > a/b/c/.gitignore
174 | echo "*.txt" > a/b/.gitignore
175 | echo "*.org" > a/.gitignore
176 | git add -A
177 | }
178 |
179 | step "gitignore"
180 | cd left
181 | gitignore_prepare
182 | $wyag check-ignore a/b/c/hello.txt > ../file1
183 | $wyag check-ignore a/b/hello.txt >> ../file1
184 | $wyag check-ignore a/hello.org >> ../file1
185 | $wyag check-ignore hello.org >> ../file1
186 | cd ../right
187 | set +e # git will return with non-zero
188 | gitignore_prepare
189 | git check-ignore a/b/c/hello.txt > ../file2
190 | git check-ignore a/b/hello.txt >> ../file2
191 | git check-ignore a/hello.org >> ../file2
192 | git check-ignore hello.org >> ../file2
193 | set -e
194 | cd ..
195 | cmp file1 file2
196 |
197 |
198 |
199 |
200 | step THIS WAS A TRIUMPH
201 | step "I'M MAKING A NOTE HERE"
202 | step "HUGE SUCCESS"
203 |
--------------------------------------------------------------------------------