├── .github
    └── ISSUE_TEMPLATE
    │   └── bug_report.md
├── .gitignore
├── LICENSE
├── README.org
├── setup.py
└── slob.py


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Report an issue with slob reader/writer
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | <!-- 
11 | 
12 | STOP! PLEASE READ:
13 | 
14 | - For any dictionaries related questions, requests and suggestions, including requests for new dictionaries, dictionary updates or broken link reports, please join aarddict Google group at https://groups.google.com/g/aarddict and post there
15 | 
16 | - Do not request slob file format or reference implementation re-design
17 | 
18 | - If you believe you discovered a bug in the slob reference implementation please provide details as outlined below
19 | 
20 |  -->
21 | 
22 | **Description**
23 | A clear and concise description of what the bug is.
24 | 
25 | **To Reproduce**
26 | Steps to reproduce the behavior:
27 | - ...
28 | - ...
29 | 
30 | **Expected behavior**
31 | A clear and concise description of what you expected to happen.
32 | 
33 | **Environment:**
34 |  - OS: [e.g. iOS]
35 |  - Python version
36 |  
37 | **Additional context**
38 | Add any other context about the problem here.
39 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *~
3 | *.egg-info
4 | __pycache__
5 | README.html


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 3, 29 June 2007
  3 | 
  4 |  Copyright (C) 2007 Free Software Foundation, Inc. [http://fsf.org/]
  5 |  Everyone is permitted to copy and distribute verbatim copies
  6 |  of this license document, but changing it is not allowed.
  7 | 
  8 |                             Preamble
  9 | 
 10 |   The GNU General Public License is a free, copyleft license for
 11 | software and other kinds of works.
 12 | 
 13 |   The licenses for most software and other practical works are designed
 14 | to take away your freedom to share and change the works.  By contrast,
 15 | the GNU General Public License is intended to guarantee your freedom to
 16 | share and change all versions of a program--to make sure it remains free
 17 | software for all its users.  We, the Free Software Foundation, use the
 18 | GNU General Public License for most of our software; it applies also to
 19 | any other work released this way by its authors.  You can apply it to
 20 | your programs, too.
 21 | 
 22 |   When we speak of free software, we are referring to freedom, not
 23 | price.  Our General Public Licenses are designed to make sure that you
 24 | have the freedom to distribute copies of free software (and charge for
 25 | them if you wish), that you receive source code or can get it if you
 26 | want it, that you can change the software or use pieces of it in new
 27 | free programs, and that you know you can do these things.
 28 | 
 29 |   To protect your rights, we need to prevent others from denying you
 30 | these rights or asking you to surrender the rights.  Therefore, you have
 31 | certain responsibilities if you distribute copies of the software, or if
 32 | you modify it: responsibilities to respect the freedom of others.
 33 | 
 34 |   For example, if you distribute copies of such a program, whether
 35 | gratis or for a fee, you must pass on to the recipients the same
 36 | freedoms that you received.  You must make sure that they, too, receive
 37 | or can get the source code.  And you must show them these terms so they
 38 | know their rights.
 39 | 
 40 |   Developers that use the GNU GPL protect your rights with two steps:
 41 | (1) assert copyright on the software, and (2) offer you this License
 42 | giving you legal permission to copy, distribute and/or modify it.
 43 | 
 44 |   For the developers' and authors' protection, the GPL clearly explains
 45 | that there is no warranty for this free software.  For both users' and
 46 | authors' sake, the GPL requires that modified versions be marked as
 47 | changed, so that their problems will not be attributed erroneously to
 48 | authors of previous versions.
 49 | 
 50 |   Some devices are designed to deny users access to install or run
 51 | modified versions of the software inside them, although the manufacturer
 52 | can do so.  This is fundamentally incompatible with the aim of
 53 | protecting users' freedom to change the software.  The systematic
 54 | pattern of such abuse occurs in the area of products for individuals to
 55 | use, which is precisely where it is most unacceptable.  Therefore, we
 56 | have designed this version of the GPL to prohibit the practice for those
 57 | products.  If such problems arise substantially in other domains, we
 58 | stand ready to extend this provision to those domains in future versions
 59 | of the GPL, as needed to protect the freedom of users.
 60 | 
 61 |   Finally, every program is threatened constantly by software patents.
 62 | States should not allow patents to restrict development and use of
 63 | software on general-purpose computers, but in those that do, we wish to
 64 | avoid the special danger that patents applied to a free program could
 65 | make it effectively proprietary.  To prevent this, the GPL assures that
 66 | patents cannot be used to render the program non-free.
 67 | 
 68 |   The precise terms and conditions for copying, distribution and
 69 | modification follow.
 70 | 
 71 |                        TERMS AND CONDITIONS
 72 | 
 73 |   0. Definitions.
 74 | 
 75 |   "This License" refers to version 3 of the GNU General Public License.
 76 | 
 77 |   "Copyright" also means copyright-like laws that apply to other kinds of
 78 | works, such as semiconductor masks.
 79 | 
 80 |   "The Program" refers to any copyrightable work licensed under this
 81 | License.  Each licensee is addressed as "you".  "Licensees" and
 82 | "recipients" may be individuals or organizations.
 83 | 
 84 |   To "modify" a work means to copy from or adapt all or part of the work
 85 | in a fashion requiring copyright permission, other than the making of an
 86 | exact copy.  The resulting work is called a "modified version" of the
 87 | earlier work or a work "based on" the earlier work.
 88 | 
 89 |   A "covered work" means either the unmodified Program or a work based
 90 | on the Program.
 91 | 
 92 |   To "propagate" a work means to do anything with it that, without
 93 | permission, would make you directly or secondarily liable for
 94 | infringement under applicable copyright law, except executing it on a
 95 | computer or modifying a private copy.  Propagation includes copying,
 96 | distribution (with or without modification), making available to the
 97 | public, and in some countries other activities as well.
 98 | 
 99 |   To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies.  Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 | 
103 |   An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License.  If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 | 
112 |   1. Source Code.
113 | 
114 |   The "source code" for a work means the preferred form of the work
115 | for making modifications to it.  "Object code" means any non-source
116 | form of a work.
117 | 
118 |   A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 | 
123 |   The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form.  A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 | 
134 |   The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities.  However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work.  For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 | 
147 |   The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 | 
151 |   The Corresponding Source for a work in source code form is that
152 | same work.
153 | 
154 |   2. Basic Permissions.
155 | 
156 |   All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met.  This License explicitly affirms your unlimited
159 | permission to run the unmodified Program.  The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work.  This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 | 
164 |   You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force.  You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright.  Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 | 
175 |   Conveying under any other circumstances is permitted solely under
176 | the conditions stated below.  Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 | 
179 |   3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 | 
181 |   No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 | 
187 |   When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 | 
195 |   4. Conveying Verbatim Copies.
196 | 
197 |   You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 | 
205 |   You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 | 
208 |   5. Conveying Modified Source Versions.
209 | 
210 |   You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 | 
214 |     a) The work must carry prominent notices stating that you modified
215 |     it, and giving a relevant date.
216 | 
217 |     b) The work must carry prominent notices stating that it is
218 |     released under this License and any conditions added under section
219 |     7.  This requirement modifies the requirement in section 4 to
220 |     "keep intact all notices".
221 | 
222 |     c) You must license the entire work, as a whole, under this
223 |     License to anyone who comes into possession of a copy.  This
224 |     License will therefore apply, along with any applicable section 7
225 |     additional terms, to the whole of the work, and all its parts,
226 |     regardless of how they are packaged.  This License gives no
227 |     permission to license the work in any other way, but it does not
228 |     invalidate such permission if you have separately received it.
229 | 
230 |     d) If the work has interactive user interfaces, each must display
231 |     Appropriate Legal Notices; however, if the Program has interactive
232 |     interfaces that do not display Appropriate Legal Notices, your
233 |     work need not make them do so.
234 | 
235 |   A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit.  Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 | 
245 |   6. Conveying Non-Source Forms.
246 | 
247 |   You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 | 
252 |     a) Convey the object code in, or embodied in, a physical product
253 |     (including a physical distribution medium), accompanied by the
254 |     Corresponding Source fixed on a durable physical medium
255 |     customarily used for software interchange.
256 | 
257 |     b) Convey the object code in, or embodied in, a physical product
258 |     (including a physical distribution medium), accompanied by a
259 |     written offer, valid for at least three years and valid for as
260 |     long as you offer spare parts or customer support for that product
261 |     model, to give anyone who possesses the object code either (1) a
262 |     copy of the Corresponding Source for all the software in the
263 |     product that is covered by this License, on a durable physical
264 |     medium customarily used for software interchange, for a price no
265 |     more than your reasonable cost of physically performing this
266 |     conveying of source, or (2) access to copy the
267 |     Corresponding Source from a network server at no charge.
268 | 
269 |     c) Convey individual copies of the object code with a copy of the
270 |     written offer to provide the Corresponding Source.  This
271 |     alternative is allowed only occasionally and noncommercially, and
272 |     only if you received the object code with such an offer, in accord
273 |     with subsection 6b.
274 | 
275 |     d) Convey the object code by offering access from a designated
276 |     place (gratis or for a charge), and offer equivalent access to the
277 |     Corresponding Source in the same way through the same place at no
278 |     further charge.  You need not require recipients to copy the
279 |     Corresponding Source along with the object code.  If the place to
280 |     copy the object code is a network server, the Corresponding Source
281 |     may be on a different server (operated by you or a third party)
282 |     that supports equivalent copying facilities, provided you maintain
283 |     clear directions next to the object code saying where to find the
284 |     Corresponding Source.  Regardless of what server hosts the
285 |     Corresponding Source, you remain obligated to ensure that it is
286 |     available for as long as needed to satisfy these requirements.
287 | 
288 |     e) Convey the object code using peer-to-peer transmission, provided
289 |     you inform other peers where the object code and Corresponding
290 |     Source of the work are being offered to the general public at no
291 |     charge under subsection 6d.
292 | 
293 |   A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 | 
297 |   A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling.  In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage.  For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product.  A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 | 
310 |   "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source.  The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 | 
318 |   If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information.  But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 | 
329 |   The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed.  Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 | 
337 |   Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 | 
343 |   7. Additional Terms.
344 | 
345 |   "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law.  If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 | 
354 |   When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it.  (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.)  You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 | 
361 |   Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 | 
365 |     a) Disclaiming warranty or limiting liability differently from the
366 |     terms of sections 15 and 16 of this License; or
367 | 
368 |     b) Requiring preservation of specified reasonable legal notices or
369 |     author attributions in that material or in the Appropriate Legal
370 |     Notices displayed by works containing it; or
371 | 
372 |     c) Prohibiting misrepresentation of the origin of that material, or
373 |     requiring that modified versions of such material be marked in
374 |     reasonable ways as different from the original version; or
375 | 
376 |     d) Limiting the use for publicity purposes of names of licensors or
377 |     authors of the material; or
378 | 
379 |     e) Declining to grant rights under trademark law for use of some
380 |     trade names, trademarks, or service marks; or
381 | 
382 |     f) Requiring indemnification of licensors and authors of that
383 |     material by anyone who conveys the material (or modified versions of
384 |     it) with contractual assumptions of liability to the recipient, for
385 |     any liability that these contractual assumptions directly impose on
386 |     those licensors and authors.
387 | 
388 |   All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10.  If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term.  If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 | 
398 |   If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 | 
403 |   Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 | 
407 |   8. Termination.
408 | 
409 |   You may not propagate or modify a covered work except as expressly
410 | provided under this License.  Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 | 
415 |   However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 | 
422 |   Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 | 
429 |   Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License.  If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 | 
435 |   9. Acceptance Not Required for Having Copies.
436 | 
437 |   You are not required to accept this License in order to receive or
438 | run a copy of the Program.  Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance.  However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work.  These actions infringe copyright if you do
443 | not accept this License.  Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 | 
446 |   10. Automatic Licensing of Downstream Recipients.
447 | 
448 |   Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License.  You are not responsible
451 | for enforcing compliance by third parties with this License.
452 | 
453 |   An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations.  If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 | 
463 |   You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License.  For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 | 
471 |   11. Patents.
472 | 
473 |   A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based.  The
475 | work thus licensed is called the contributor's "contributor version".
476 | 
477 |   A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version.  For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 | 
487 |   Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 | 
492 |   In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement).  To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 | 
499 |   If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients.  "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 | 
513 |   If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 | 
521 |   A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License.  You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 | 
536 |   Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 | 
540 |   12. No Surrender of Others' Freedom.
541 | 
542 |   If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License.  If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all.  For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 | 
552 |   13. Use with the GNU Affero General Public License.
553 | 
554 |   Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work.  The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 | 
563 |   14. Revised Versions of this License.
564 | 
565 |   The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time.  Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 | 
570 |   Each version is given a distinguishing version number.  If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation.  If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 | 
579 |   If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 | 
584 |   Later license versions may give you additional or different
585 | permissions.  However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 | 
589 |   15. Disclaimer of Warranty.
590 | 
591 |   THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 | 
600 |   16. Limitation of Liability.
601 | 
602 |   IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 | 
612 |   17. Interpretation of Sections 15 and 16.
613 | 
614 |   If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 | 
621 |                      END OF TERMS AND CONDITIONS
622 | 
623 |             How to Apply These Terms to Your New Programs
624 | 
625 |   If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 | 
629 |   To do so, attach the following notices to the program.  It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 | 
634 |     {one line to give the program's name and a brief idea of what it does.}
635 |     Copyright (C) {year}  {name of author}
636 | 
637 |     This program is free software: you can redistribute it and/or modify
638 |     it under the terms of the GNU General Public License as published by
639 |     the Free Software Foundation, either version 3 of the License, or
640 |     (at your option) any later version.
641 | 
642 |     This program is distributed in the hope that it will be useful,
643 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
644 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
645 |     GNU General Public License for more details.
646 | 
647 |     You should have received a copy of the GNU General Public License
648 |     along with this program.  If not, see [http://www.gnu.org/licenses/].
649 | 
650 | Also add information on how to contact you by electronic and paper mail.
651 | 
652 |   If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 | 
655 |     {project}  Copyright (C) {year}  {fullname}
656 |     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 |     This is free software, and you are welcome to redistribute it
658 |     under certain conditions; type `show c' for details.
659 | 
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License.  Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 | 
664 |   You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | [http://www.gnu.org/licenses/].
668 | 
669 |   The GNU General Public License does not permit incorporating your program
670 | into proprietary programs.  If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library.  If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License.  But first, please read
674 | [http://www.gnu.org/philosophy/why-not-lgpl.html].


--------------------------------------------------------------------------------
/README.org:
--------------------------------------------------------------------------------
  1 | * Slob
  2 |   Slob (sorted list of blobs) is a read-only, compressed data store
  3 |   with dictionary-like interface to look up content by text keys. Keys
  4 |   are sorted according to [[http://www.unicode.org/reports/tr10/][Unicode Collation Algorithm]]. This allows to
  5 |   perform punctuation, case and diacritics insensitive
  6 |   lookups. /slob.py/ is a reference implementation of slob format
  7 |   reader and writer in [[http://python.org][Python 3]].
  8 | 
  9 | ** Installation
 10 | 
 11 |    /slob.py/ depends on the following components:
 12 | 
 13 |    - [[http://python.org][Python]] >= 3.6
 14 |    - [[http://icu-project.org][ICU]] >= 4.8
 15 |    - [[https://pypi.python.org/pypi/PyICU][PyICU]] >= 1.5
 16 | 
 17 |    In addition, the following components are needed to set up
 18 |    slob environment:
 19 | 
 20 |    - [[http://git-scm.com/][git]]
 21 |    - [[https://virtualenv.pypa.io/][virtualenv]]
 22 | 
 23 |    Consult your operating system documentation and these component's
 24 |    websites for installation instructions.
 25 | 
 26 |    For example, on Ubuntu 20.04, the following command installs
 27 |    required packages:
 28 | 
 29 |    #+BEGIN_SRC sh
 30 |    sudo apt update
 31 |    sudo apt install python3 python3-icu python3.8-venv git
 32 |    #+END_SRC
 33 | 
 34 |    Create new Python virtual environment:
 35 | 
 36 |    #+BEGIN_SRC sh
 37 |    python3 -m venv env-slob --system-site-packages
 38 |    #+END_SRC
 39 | 
 40 |    Activate it:
 41 | 
 42 |    #+BEGIN_SRC sh
 43 |    source env-slob/bin/activate
 44 |    #+END_SRC
 45 | 
 46 |    Install from source code repository:
 47 | 
 48 |    #+BEGIN_SRC sh
 49 |    pip install git+https://github.com/itkach/slob.git
 50 |    #+END_SRC
 51 | 
 52 |    or, download source code manually:
 53 | 
 54 |    #+BEGIN_SRC sh
 55 |    wget https://github.com/itkach/slob/archive/master.zip
 56 |    pip install master.zip
 57 |    #+END_SRC
 58 | 
 59 |    Run tests:
 60 | 
 61 |    #+BEGIN_SRC sh
 62 |    python -m unittest slob
 63 |    #+END_SRC
 64 | 
 65 | ** Command line interface
 66 | 
 67 |    /slob.py/ provides basic command line interface to inspect
 68 |    and modify slob content.
 69 | 
 70 |    #+BEGIN_SRC
 71 |    usage: slob [-h] {find,get,info,tag} ...
 72 | 
 73 |    positional arguments:
 74 |      {find,get,info,tag}  sub-command
 75 |        find               Find keys
 76 |        get                Retrieve blob content
 77 |        info               Inspect slob and print basic information about it
 78 |        tag                List tags, view or edit tag value
 79 |        convert            Create new slob with the same convent but different
 80 |                           encoding and compression parameters
 81 |                           or split into multiple slobs
 82 | 
 83 |    optional arguments:
 84 |      -h, --help           show this help message and exit
 85 |    #+END_SRC
 86 | 
 87 |    To see basic slob info such as text encoding, compression and tags:
 88 |    #+BEGIN_SRC
 89 |    slob info my.slob
 90 |    #+END_SRC
 91 | 
 92 |    To see value of a tag, for example /label/:
 93 |    #+BEGIN_SRC
 94 |    slob tag -n label my.slob
 95 |    #+END_SRC
 96 | 
 97 |    To set tag value:
 98 |    #+BEGIN_SRC
 99 |    slob tag -n label -v "A Fine Dictionary" my.slob
100 |    #+END_SRC
101 | 
102 |    To look up a key, for example /abc/:
103 |    #+BEGIN_SRC
104 |    slob find wordnet-3.0.slob abc
105 |    #+END_SRC
106 | 
107 |    The output should like something like
108 |    #+BEGIN_SRC
109 |    465 text/html; charset=utf-8 ABC
110 |    466 text/html; charset=utf-8 abcoulomb
111 |    472 text/html; charset=utf-8 ABC's
112 |    468 text/html; charset=utf-8 ABCs
113 |    #+END_SRC
114 | 
115 |    First column in the output is blob id. It can be used to retrieve
116 |    blob content (content bytes are written to stdout):
117 |    #+BEGIN_SRC
118 |    slob get wordnet-3.0.slob 465
119 |    #+END_SRC
120 | 
121 |    To re-encode or re-compress slob content with different
122 |    parameters:
123 |    #+BEGIN_SRC
124 |    slob convert -c lzma2 -b 256 simplewiki-20140209.zlib.384k.slob simplewiki-20140209.lzma2.256k.slob
125 |    #+END_SRC
126 | 
127 |    To split into multiple slobs:
128 | 
129 |    #+BEGIN_SRC
130 |    slob convert --split 4096 enwiki-20150406.slob enwiki-20150406-vol.slob
131 |    #+END_SRC
132 | 
133 |    Output name /enwiki-20150406-vol.slob/ is the name of the
134 |    directory where resulting .slob files will be created.
135 | 
136 |    This is useful for crippled systems that can't use normal
137 |    filesystems and have file size limits, such as SD cards on
138 |    vanilla Android. Note that this command doesn't duplicate any
139 |    content, so clients must search all these slobs when looking for
140 |    shared resources such as stylesheets, fonts, javascript or
141 |    images.
142 | 
143 | 
144 | ** Examples
145 | 
146 | *** Basic Usage
147 | 
148 |     Create a slob:
149 | 
150 |     #+BEGIN_SRC python
151 |       import slob
152 |       with slob.create('test.slob') as w:
153 |           w.add(b'Hello A', 'a')
154 |           w.add(b'Hello B', 'b')
155 |     #+END_SRC
156 | 
157 |     Read content:
158 | 
159 |     #+BEGIN_SRC python
160 |       import slob
161 |       with slob.open('test.slob') as r:
162 |           d = r.as_dict()
163 |           for key in ('a', 'b'):
164 |               result = next(d[key])
165 |               print(result.content)
166 | 
167 |     #+END_SRC
168 | 
169 |     will print
170 | 
171 |     #+BEGIN_SRC
172 | b'Hello A'
173 | b'Hello B'
174 |     #+END_SRC
175 | 
176 | 
177 |     Slob we created in this example certainly works, but it is not
178 |     ideal: we neglected to specify content type for the content we
179 |     are adding. Lets consider a slightly more involved example:
180 | 
181 |     #+BEGIN_SRC python
182 |       import slob
183 |       PLAIN_TEXT = 'text/plain; charset=utf-8'
184 |       with slob.create('test1.slob') as w:
185 |           w.add('Hello, Earth!'.encode('utf-8'),
186 |                 'earth', 'terra', content_type=PLAIN_TEXT)
187 |           w.add_alias('земля', 'earth')
188 |           w.add('Hello, Mars!'.encode('utf-8'), 'mars',
189 |                 content_type=PLAIN_TEXT)
190 |     #+END_SRC
191 | 
192 |     Here we specify MIME type of the content we are adding so that
193 |     consumers of this content can display or process it
194 |     properly. Note that the same content may be associated with
195 |     multiple keys, either when it is added or later with /add_alias/.
196 | 
197 |     This
198 | 
199 |     #+BEGIN_SRC python
200 |       with slob.open('test1.slob') as r:
201 | 
202 |           def p(blob):
203 |               print(blob.id, blob.content_type, blob.content)
204 | 
205 |           for key in ('earth', 'земля', 'terra'):
206 |               blob = next(r.as_dict()[key])
207 |               p(blob)
208 | 
209 |           p(next(r.as_dict()['mars']))
210 | 
211 |     #+END_SRC
212 | 
213 |     will print
214 | 
215 |     #+BEGIN_SRC
216 | 0 text/plain; charset=utf-8 b'Hello, Earth!'
217 | 0 text/plain; charset=utf-8 b'Hello, Earth!'
218 | 0 text/plain; charset=utf-8 b'Hello, Earth!'
219 | 1 text/plain; charset=utf-8 b'Hello, Mars!'
220 |     #+END_SRC
221 | 
222 |     Note that blob id for the first three keys is the same, they all
223 |     point to the same content item.
224 | 
225 |     Take a look at tests in /slob.py/ for more examples.
226 | 
227 | 
228 | *** Software and Dictionaries
229 | 
230 | - [[https://github.com/itkach/slob/wiki/Dictionaries][Wikipedia, Wiktionary, WordNet, FreeDict and more]]
231 | - [[http://github.com/itkach/aard2-android/][aard2-android]] - dictionary for Android
232 |   - [[https://github.com/farfromrefug/OSS-Dict][OSS-Dict]] - fork of Aard2 with new Material design and updated features
233 | - [[https://github.com/itkach/aard2-web][aard2-web]] - minimalistic Web UI (Java)
234 | - [[http://github.com/itkach/slobber/][slobber]] - Web API to look up content in slob dictionaries
235 | - [[http://github.com/itkach/slobby/][slobby]] - minimalistic Web UI (Python)
236 | - [[https://github.com/ilius/pyglossary][pyglossary]] - convert dictionaries in various formats, including slob
237 | - [[https://github.com/itkach/mw2slob][mw2slob]] - create slob dictionaries from Wikimedia Enterprise HTML Dumps or MediaWiki API
238 | - [[http://github.com/itkach/xdxf2slob/][xdxf2slob]] - create slob dictionaries from XDXF
239 | - [[https://github.com/itkach/tei2slob/][tei2slob]] - create slob dictionaries from TEI
240 | - [[http://github.com/itkach/wordnet2slob/][wordnet2slob]] - convert WordNet databaset to slob dictionary
241 | 
242 | 
243 | ** Slob File Format
244 | 
245 | *** Slob
246 | 
247 | | Element       | Type                                 | Description                                                                                                                                                                        |
248 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
249 | | magic         | fixed size sequence of 8 bytes       | Bytes ~21 2d 31 53 4c 4f 42 1f~: string ~!-1SLOB~ followed by ascii unit separator (ascii hex code ~1f~) identifying slob format                                                   |
250 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
251 | | uuid          | fixed size sequence of 16 bytes      | Unique slob identifier ([[https://tools.ietf.org/html/rfc4122][RFC 4122]] UUID)                                                                                                                                             |
252 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
253 | | encoding      | tiny text (utf8)                     | Name of text encoding used for all other text elements: tag names and values, content types, keys, fragments                                                                       |
254 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
255 | | compression   | tiny text                            | Name of compression algorithm used to compress storage bins.                                                                                                                       |
256 | |               |                                      | slob.py understands following names: /bz2/, /zlib/ which correspond to Python module names, and /lzma2/ which refers to raw lzma2 compression with LZMA2 filter (this is default). |
257 | |               |                                      | Empty value means bins are not compressed.                                                                                                                                         |
258 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
259 | | tags          | char-sized sequence of tags          | Tags are text key-value pairs that may provide additional information about slob or its data.                                                                                      |
260 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
261 | | content types | char-sized sequence of content types | MIME content types. Content items refer to content types by id.                                                                                                                    |
262 | |               |                                      | Content type id is 0-based position of content type in this sequence.                                                                                                              |
263 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
264 | | blob count    | int                                  | Number of content items stored in the slob                                                                                                                                         |
265 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
266 | | store offset  | long                                 | File position at which store data begins                                                                                                                                           |
267 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
268 | | size          | long                                 | Total file byte size (or sum of all files if slob is split into multiple files)                                                                                                    |
269 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
270 | | refs          | list of long-positioned refs         | References to content                                                                                                                                                              |
271 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
272 | | store         | list of long-positioned store items  | Store item contains number of items stored, content type id for each item and storage bin with each item's content                                                                 |
273 | 
274 | 
275 | 
276 | *** tiny text
277 | 
278 |     char-sized sequence of encoded text bytes
279 | 
280 | 
281 | *** text
282 | 
283 |     short-sized sequence of encoded text bytes
284 | 
285 | 
286 | *** large byte string
287 | 
288 |     int-sized sequence of bytes
289 | 
290 | *** /size type/-sized sequence of /items/
291 | 
292 |      | Element | Type                      |
293 |      |---------+---------------------------|
294 |      | count   | /size type/               |
295 |      | items   | sequence of /count/ items |
296 | 
297 | 
298 | *** tag
299 | 
300 |      | Element | Type                        |
301 |      |---------+-----------------------------|
302 |      | name    | tiny text                   |
303 |      | value   | tiny text padded to maximum |
304 |      |         | length with null bytes      |
305 | 
306 |      Tag values are tiny text of length 255, starting with encoded
307 |      text bytes followed by null bytes. This allowes modifying tag
308 |      values without having to recompile the whole slob. Null bytes
309 |      must be stripped before decoding value text.
310 | 
311 | *** content type
312 | 
313 |     text
314 | 
315 | 
316 | *** ref
317 | 
318 |      | Element    | Type      | Description                                           |
319 |      |------------+-----------+-------------------------------------------------------|
320 |      | key        | text      | Text key associated with content                      |
321 |      | bin index  | int       | Index of compressed bin containing content            |
322 |      | item index | short     | Index of content item inside uncompressed bin         |
323 |      | fragment   | tiny text | Text identifier of a specific location inside content |
324 | 
325 | 
326 | *** store item
327 |      | Element          | Type                                                    | Description                                       |
328 |      |------------------+---------------------------------------------------------+---------------------------------------------------|
329 |      | content type ids | int-sized sequence of bytes                             | Each byte is a char representing content type id. |
330 |      | storage bin      | list of int-positioned large byte strings without count | Content                                           |
331 | 
332 | Storage bin doesn't include leading int that would represent item
333 | count - item count equals the length of content type ids. Items in the
334 | storage bin are large byte strings - actual content bytes.
335 | 
336 | *** list of /position type/-positioned /items/
337 | 
338 |      | Element   | Type                                                        | Description                                                                                         |
339 |      |-----------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------------------|
340 |      | positions | int-sized sequence of item offsets of type /position type/. | Item offset specifies position in file where item data starts, relative to the end of position data |
341 |      | items     | sequence of /items/                                         |                                                                                                     |
342 | 
343 | *** char
344 |     unsigned char (1 byte)
345 | 
346 | *** short
347 |     big endian unsigned short (2 bytes)
348 | 
349 | *** int
350 |     big endian unsigned int (4 bytes)
351 | 
352 | *** long
353 |     big endian unsigned long long (8 bytes)
354 | 
355 | 
356 | ** Design Considerations
357 | 
358 |    Slob format design is influenced by old Aard Dictionary's [[https://github.com/aarddict/tools/blob/master/doc/aardformat.rst][aard]] and [[http://openzim.org/][ZIM]]
359 |    file formats. Similar to Aard Dictionary, it allows to perform
360 |    non-exact lookups based on UCA's notion of collation
361 |    strength. Similar to ZIM, it groups and compresses multiple
362 |    content items to achieve high compression ratio and can combine
363 |    several physical files into one logical container. Both aard and
364 |    ZIM contain vestigial elements of predecessor formats as well
365 |    as elements specific to a particular use case (such as
366 |    implementing offline Wikipedia content access). Slob aims to
367 |    provide a minimal framework to allow building such applications
368 |    while remaining a simple, generic, read-only data store.
369 | 
370 | *** No Format Version
371 |     Slob header doesn't contain explicit file format version
372 |     number. Any incompatible changes to the format
373 |     should be introduced in a new file format which will get its own
374 |     identifying magic bytes.
375 | 
376 | *** No Content Checksum
377 |     Unlike aard and ZIM file formats, slob doesn't contain
378 |     content checksum. File integrity can be easily verified by
379 |     employing standard tools to calculate content hash. Inclusion of
380 |     pre-calculated hash into the file itself prevents using most
381 |     standard tools and puts burden of implementing hash calculation
382 |     on every slob reader implementation.
383 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(name='Slob',
 4 |       version='1.0.2',
 5 |       description='Read-only compressed data store',
 6 |       author='Igor Tkach',
 7 |       author_email='itkach@gmail.com',
 8 |       url='http://github.com/itkach/slob',
 9 |       license='GPL3',
10 |       py_modules=['slob'],
11 |       install_requires=['PyICU >= 1.5'],
12 |       entry_points={'console_scripts': ['slob=slob:main']})
13 | 


--------------------------------------------------------------------------------
/slob.py:
--------------------------------------------------------------------------------
   1 | # pylint: disable=C0111,C0103,C0302,R0903,R0904,R0914,R0201
   2 | import argparse
   3 | import array
   4 | import collections
   5 | import encodings
   6 | import functools
   7 | import io
   8 | import os
   9 | import pickle
  10 | import random
  11 | import sys
  12 | import tempfile
  13 | import unicodedata
  14 | import unittest
  15 | import warnings
  16 | 
  17 | from abc import abstractmethod
  18 | from bisect import bisect_left
  19 | from builtins import open as fopen
  20 | from collections import namedtuple
  21 | from collections.abc import Sequence
  22 | from datetime import datetime, timezone
  23 | from functools import lru_cache
  24 | from struct import pack, unpack, calcsize
  25 | from threading import RLock
  26 | from types import MappingProxyType
  27 | from uuid import uuid4, UUID
  28 | 
  29 | import icu
  30 | 
  31 | DEFAULT_COMPRESSION = "lzma2"
  32 | 
  33 | UTF8 = "utf-8"
  34 | MAGIC = b"!-1SLOB\x1F"
  35 | 
  36 | Compression = namedtuple("Compression", "compress decompress")
  37 | 
  38 | Ref = namedtuple("Ref", "key bin_index item_index fragment")
  39 | 
  40 | Header = namedtuple(
  41 |     "Header",
  42 |     "magic uuid encoding "
  43 |     "compression tags content_types "
  44 |     "blob_count "
  45 |     "store_offset "
  46 |     "refs_offset "
  47 |     "size",
  48 | )
  49 | 
  50 | U_CHAR = ">B"
  51 | U_CHAR_SIZE = calcsize(U_CHAR)
  52 | U_SHORT = ">H"
  53 | U_SHORT_SIZE = calcsize(U_SHORT)
  54 | U_INT = ">I"
  55 | U_INT_SIZE = calcsize(U_INT)
  56 | U_LONG_LONG = ">Q"
  57 | U_LONG_LONG_SIZE = calcsize(U_LONG_LONG)
  58 | 
  59 | 
  60 | def calcmax(len_size_spec):
  61 |     return 2 ** (calcsize(len_size_spec) * 8) - 1
  62 | 
  63 | 
  64 | MAX_TEXT_LEN = calcmax(U_SHORT)
  65 | MAX_TINY_TEXT_LEN = calcmax(U_CHAR)
  66 | MAX_LARGE_BYTE_STRING_LEN = calcmax(U_INT)
  67 | MAX_BIN_ITEM_COUNT = calcmax(U_SHORT)
  68 | 
  69 | from icu import Locale, Collator, UCollAttribute, UCollAttributeValue
  70 | 
  71 | PRIMARY = Collator.PRIMARY
  72 | SECONDARY = Collator.SECONDARY
  73 | TERTIARY = Collator.TERTIARY
  74 | QUATERNARY = Collator.QUATERNARY
  75 | IDENTICAL = Collator.IDENTICAL
  76 | 
  77 | 
  78 | def init_compressions():
  79 |     ident = lambda x: x
  80 |     compressions = {"": Compression(ident, ident)}
  81 |     for name in ("bz2", "zlib"):
  82 |         try:
  83 |             m = __import__(name)
  84 |         except ImportError:
  85 |             warnings.warn("%s is not available" % name)
  86 |         else:
  87 |             compressions[name] = Compression(lambda x: m.compress(x, 9), m.decompress)
  88 | 
  89 |     try:
  90 |         import lzma
  91 |     except ImportError:
  92 |         warnings.warn("lzma is not available")
  93 |     else:
  94 |         filters = [{"id": lzma.FILTER_LZMA2}]
  95 |         compress = lambda s: lzma.compress(s, format=lzma.FORMAT_RAW, filters=filters)
  96 |         decompress = lambda s: lzma.decompress(
  97 |             s, format=lzma.FORMAT_RAW, filters=filters
  98 |         )
  99 |         compressions["lzma2"] = Compression(compress, decompress)
 100 |     return compressions
 101 | 
 102 | 
 103 | COMPRESSIONS = init_compressions()
 104 | 
 105 | 
 106 | del init_compressions
 107 | 
 108 | 
 109 | MIME_TEXT = "text/plain"
 110 | MIME_HTML = "text/html"
 111 | MIME_CSS = "text/css"
 112 | MIME_JS = "application/javascript"
 113 | 
 114 | MIME_TYPES = {
 115 |     "html": MIME_HTML,
 116 |     "txt": MIME_TEXT,
 117 |     "js": MIME_JS,
 118 |     "css": MIME_CSS,
 119 |     "json": "application/json",
 120 |     "woff": "application/font-woff",
 121 |     "svg": "image/svg+xml",
 122 |     "png": "image/png",
 123 |     "jpg": "image/jpeg",
 124 |     "jpeg": "image/jpeg",
 125 |     "gif": "image/gif",
 126 |     "ttf": "application/x-font-ttf",
 127 |     "otf": "application/x-font-opentype",
 128 | }
 129 | 
 130 | 
 131 | class FileFormatException(Exception):
 132 |     pass
 133 | 
 134 | 
 135 | class UnknownFileFormat(FileFormatException):
 136 |     pass
 137 | 
 138 | 
 139 | class UnknownCompression(FileFormatException):
 140 |     pass
 141 | 
 142 | 
 143 | class UnknownEncoding(FileFormatException):
 144 |     pass
 145 | 
 146 | 
 147 | class IncorrectFileSize(FileFormatException):
 148 |     pass
 149 | 
 150 | 
 151 | class TagNotFound(Exception):
 152 |     pass
 153 | 
 154 | 
 155 | @lru_cache(maxsize=None)
 156 | def sortkey(strength, maxlength=None):
 157 |     c = Collator.createInstance(Locale(""))
 158 |     c.setStrength(strength)
 159 |     c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED)
 160 |     if maxlength is None:
 161 |         return c.getSortKey
 162 |     else:
 163 |         return lambda x: c.getSortKey(x)[:maxlength]
 164 | 
 165 | 
 166 | def sortkey_length(strength, word):
 167 |     c = Collator.createInstance(Locale(""))
 168 |     c.setStrength(strength)
 169 |     c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED)
 170 |     coll_key = c.getSortKey(word)
 171 |     return len(coll_key) - 1  # subtract 1 for ending \x00 byte
 172 | 
 173 | 
 174 | class MultiFileReader(io.BufferedIOBase):
 175 | 
 176 |     def __init__(self, *args):
 177 |         filenames = []
 178 |         for arg in args:
 179 |             if isinstance(arg, str):
 180 |                 filenames.append(arg)
 181 |             else:
 182 |                 for name in arg:
 183 |                     filenames.append(name)
 184 |         files = []
 185 |         ranges = []
 186 |         offset = 0
 187 |         for name in filenames:
 188 |             size = os.stat(name).st_size
 189 |             ranges.append(range(offset, offset + size))
 190 |             files.append(fopen(name, "rb"))
 191 |             offset += size
 192 |         self.size = offset
 193 |         self._ranges = ranges
 194 |         self._files = files
 195 |         self._fcount = len(self._files)
 196 |         self._offset = -1
 197 |         self.seek(0)
 198 | 
 199 |     def __enter__(self):
 200 |         return self
 201 | 
 202 |     def __exit__(self, exc_type, exc_val, exc_tb):
 203 |         self.close()
 204 |         return False
 205 | 
 206 |     def close(self):
 207 |         for f in self._files:
 208 |             f.close()
 209 |         self._files.clear()
 210 |         self._ranges.clear()
 211 | 
 212 |     def closed(self):
 213 |         return len(self._ranges) == 0
 214 | 
 215 |     def isatty(self):
 216 |         return False
 217 | 
 218 |     def readable(self):
 219 |         return True
 220 | 
 221 |     def seek(self, offset, whence=io.SEEK_SET):
 222 |         if whence == io.SEEK_SET:
 223 |             self._offset = offset
 224 |         elif whence == io.SEEK_CUR:
 225 |             self._offset = self._offset + offset
 226 |         elif whence == io.SEEK_END:
 227 |             self._offset = self.size + offset
 228 |         else:
 229 |             raise ValueError("Invalid value for parameter whence: %r" % whence)
 230 |         return self._offset
 231 | 
 232 |     def seekable(self):
 233 |         return True
 234 | 
 235 |     def tell(self):
 236 |         return self._offset
 237 | 
 238 |     def writable(self):
 239 |         return False
 240 | 
 241 |     def read(self, n=-1):
 242 |         file_index = -1
 243 |         actual_offset = 0
 244 |         for i, r in enumerate(self._ranges):
 245 |             if self._offset in r:
 246 |                 file_index = i
 247 |                 actual_offset = self._offset - r.start
 248 |                 break
 249 |         result = b""
 250 |         if n == -1 or n is None:
 251 |             to_read = self.size
 252 |         else:
 253 |             to_read = n
 254 |         while -1 < file_index < self._fcount:
 255 |             f = self._files[file_index]
 256 |             f.seek(actual_offset)
 257 |             read = f.read(to_read)
 258 |             read_count = len(read)
 259 |             self._offset += read_count
 260 |             result += read
 261 |             to_read -= read_count
 262 |             if to_read > 0:
 263 |                 file_index += 1
 264 |                 actual_offset = 0
 265 |             else:
 266 |                 break
 267 |         return result
 268 | 
 269 | 
 270 | class CollationKeyList(object):
 271 | 
 272 |     def __init__(self, lst, sortkey_):
 273 |         self.lst = lst
 274 |         self.sortkey = sortkey_
 275 | 
 276 |     def __len__(self):
 277 |         return len(self.lst)
 278 | 
 279 |     def __getitem__(self, i):
 280 |         return self.sortkey(self.lst[i].key)
 281 | 
 282 | 
 283 | class KeydItemDict(object):
 284 | 
 285 |     def __init__(self, lst, strength, maxlength=None):
 286 |         self.lst = lst
 287 |         self.sortkey = sortkey(strength, maxlength=maxlength)
 288 |         self.sortkeylist = CollationKeyList(lst, self.sortkey)
 289 | 
 290 |     def __len__(self):
 291 |         return len(self.lst)
 292 | 
 293 |     def __getitem__(self, key):
 294 |         key_as_sk = self.sortkey(key)
 295 |         i = bisect_left(self.sortkeylist, key_as_sk)
 296 |         if i != len(self.lst):
 297 |             while i < len(self.lst):
 298 |                 if self.sortkey(self.lst[i].key) == key_as_sk:
 299 |                     yield self.lst[i]
 300 |                 else:
 301 |                     break
 302 |                 i += 1
 303 | 
 304 |     def __contains__(self, key):
 305 |         try:
 306 |             next(self[key])
 307 |         except StopIteration:
 308 |             return False
 309 |         else:
 310 |             return True
 311 | 
 312 | 
 313 | class Blob(object):
 314 | 
 315 |     def __init__(self, content_id, key, fragment, read_content_type_func, read_func):
 316 |         self._content_id = content_id
 317 |         self._key = key
 318 |         self._fragment = fragment
 319 |         self._read_content_type = read_content_type_func
 320 |         self._read = read_func
 321 | 
 322 |     @property
 323 |     def id(self):
 324 |         return self._content_id
 325 | 
 326 |     @property
 327 |     def key(self):
 328 |         return self._key
 329 | 
 330 |     @property
 331 |     def fragment(self):
 332 |         return self._fragment
 333 | 
 334 |     @property
 335 |     def content_type(self):
 336 |         return self._read_content_type()
 337 | 
 338 |     @property
 339 |     def content(self):
 340 |         return self._read()
 341 | 
 342 |     def __str__(self):
 343 |         return self.key
 344 | 
 345 |     def __repr__(self):
 346 |         return "<{0.__class__.__module__}.{0.__class__.__name__} " "{0.key}>".format(
 347 |             self
 348 |         )
 349 | 
 350 | 
 351 | def read_byte_string(f, len_spec):
 352 |     length = unpack(len_spec, f.read(calcsize(len_spec)))[0]
 353 |     return f.read(length)
 354 | 
 355 | 
 356 | class StructReader:
 357 | 
 358 |     def __init__(self, file_, encoding=None):
 359 |         self._file = file_
 360 |         self.encoding = encoding
 361 | 
 362 |     def read_int(self):
 363 |         s = self.read(U_INT_SIZE)
 364 |         return unpack(U_INT, s)[0]
 365 | 
 366 |     def read_long(self):
 367 |         b = self.read(U_LONG_LONG_SIZE)
 368 |         return unpack(U_LONG_LONG, b)[0]
 369 | 
 370 |     def read_byte(self):
 371 |         s = self.read(U_CHAR_SIZE)
 372 |         return unpack(U_CHAR, s)[0]
 373 | 
 374 |     def read_short(self):
 375 |         return unpack(U_SHORT, self._file.read(U_SHORT_SIZE))[0]
 376 | 
 377 |     def _read_text(self, len_spec):
 378 |         max_len = 2 ** (8 * calcsize(len_spec)) - 1
 379 |         byte_string = read_byte_string(self._file, len_spec)
 380 |         if len(byte_string) == max_len:
 381 |             terminator = byte_string.find(0)
 382 |             if terminator > -1:
 383 |                 byte_string = byte_string[:terminator]
 384 |         return byte_string.decode(self.encoding)
 385 | 
 386 |     def read_tiny_text(self):
 387 |         return self._read_text(U_CHAR)
 388 | 
 389 |     def read_text(self):
 390 |         return self._read_text(U_SHORT)
 391 | 
 392 |     def __getattr__(self, name):
 393 |         return getattr(self._file, name)
 394 | 
 395 | 
 396 | class StructWriter:
 397 | 
 398 |     def __init__(self, file_, encoding=None):
 399 |         self._file = file_
 400 |         self.encoding = encoding
 401 | 
 402 |     def write_int(self, value):
 403 |         self._file.write(pack(U_INT, value))
 404 | 
 405 |     def write_long(self, value):
 406 |         self._file.write(pack(U_LONG_LONG, value))
 407 | 
 408 |     def write_byte(self, value):
 409 |         self._file.write(pack(U_CHAR, value))
 410 | 
 411 |     def write_short(self, value):
 412 |         self._file.write(pack(U_SHORT, value))
 413 | 
 414 |     def _write_text(self, text, len_size_spec, encoding=None, pad_to_length=None):
 415 |         if encoding is None:
 416 |             encoding = self.encoding
 417 |         text_bytes = text.encode(encoding)
 418 |         length = len(text_bytes)
 419 |         max_length = calcmax(len_size_spec)
 420 |         if length > max_length:
 421 |             raise ValueError("Text is too long for size spec %s" % len_size_spec)
 422 |         self._file.write(
 423 |             pack(len_size_spec, pad_to_length if pad_to_length else length)
 424 |         )
 425 |         self._file.write(text_bytes)
 426 |         if pad_to_length:
 427 |             for _ in range(pad_to_length - length):
 428 |                 self._file.write(pack(U_CHAR, 0))
 429 | 
 430 |     def write_tiny_text(self, text, encoding=None, editable=False):
 431 |         pad_to_length = 255 if editable else None
 432 |         self._write_text(text, U_CHAR, encoding=encoding, pad_to_length=pad_to_length)
 433 | 
 434 |     def write_text(self, text, encoding=None):
 435 |         self._write_text(text, U_SHORT, encoding=encoding)
 436 | 
 437 |     def __getattr__(self, name):
 438 |         return getattr(self._file, name)
 439 | 
 440 | 
 441 | def set_tag_value(filename, name, value):
 442 |     with fopen(filename, "rb+") as f:
 443 |         f.seek(len(MAGIC) + 16)
 444 |         encoding = read_byte_string(f, U_CHAR).decode(UTF8)
 445 |         if encodings.search_function(encoding) is None:
 446 |             raise UnknownEncoding(encoding)
 447 |         f = StructWriter(StructReader(f, encoding=encoding), encoding=encoding)
 448 |         f.read_tiny_text()
 449 |         tag_count = f.read_byte()
 450 |         for _ in range(tag_count):
 451 |             key = f.read_tiny_text()
 452 |             if key == name:
 453 |                 f.write_tiny_text(value, editable=True)
 454 |                 return
 455 |             f.read_tiny_text()
 456 |     raise TagNotFound(name)
 457 | 
 458 | 
 459 | def read_header(f):
 460 |     f.seek(0)
 461 | 
 462 |     magic = f.read(len(MAGIC))
 463 |     if magic != MAGIC:
 464 |         raise UnknownFileFormat()
 465 | 
 466 |     uuid = UUID(bytes=f.read(16))
 467 |     encoding = read_byte_string(f, U_CHAR).decode(UTF8)
 468 |     if encodings.search_function(encoding) is None:
 469 |         raise UnknownEncoding(encoding)
 470 | 
 471 |     f = StructReader(f, encoding)
 472 |     compression = f.read_tiny_text()
 473 |     if not compression in COMPRESSIONS:
 474 |         raise UnknownCompression(compression)
 475 | 
 476 |     def read_tags():
 477 |         tags = {}
 478 |         count = f.read_byte()
 479 |         for _ in range(count):
 480 |             key = f.read_tiny_text()
 481 |             value = f.read_tiny_text()
 482 |             tags[key] = value
 483 |         return tags
 484 | 
 485 |     tags = read_tags()
 486 | 
 487 |     def read_content_types():
 488 |         content_types = []
 489 |         count = f.read_byte()
 490 |         for _ in range(count):
 491 |             content_type = f.read_text()
 492 |             content_types.append(content_type)
 493 |         return tuple(content_types)
 494 | 
 495 |     content_types = read_content_types()
 496 | 
 497 |     blob_count = f.read_int()
 498 |     store_offset = f.read_long()
 499 |     size = f.read_long()
 500 |     refs_offset = f.tell()
 501 | 
 502 |     return Header(
 503 |         magic=magic,
 504 |         uuid=uuid,
 505 |         encoding=encoding,
 506 |         compression=compression,
 507 |         tags=MappingProxyType(tags),
 508 |         content_types=content_types,
 509 |         blob_count=blob_count,
 510 |         store_offset=store_offset,
 511 |         refs_offset=refs_offset,
 512 |         size=size,
 513 |     )
 514 | 
 515 | 
 516 | def meld_ints(a, b):
 517 |     return (a << 16) | b
 518 | 
 519 | 
 520 | def unmeld_ints(c):
 521 |     bstr = bin(c).lstrip("0b").zfill(48)
 522 |     a, b = bstr[-48:-16], bstr[-16:]
 523 |     return int(a, 2), int(b, 2)
 524 | 
 525 | 
 526 | class Slob(Sequence):
 527 | 
 528 |     def __init__(self, file_or_filenames):
 529 |         self._f = MultiFileReader(file_or_filenames)
 530 | 
 531 |         try:
 532 |             self._header = read_header(self._f)
 533 |             if self._f.size != self._header.size:
 534 |                 raise IncorrectFileSize(
 535 |                     "File size should be {0}, {1} bytes found".format(
 536 |                         self._header.size, self._f.size
 537 |                     )
 538 |                 )
 539 |         except FileFormatException:
 540 |             self._f.close()
 541 |             raise
 542 | 
 543 |         self._refs = RefList(
 544 |             self._f, self._header.encoding, offset=self._header.refs_offset
 545 |         )
 546 | 
 547 |         self._g = MultiFileReader(file_or_filenames)
 548 |         self._store = Store(
 549 |             self._g,
 550 |             self._header.store_offset,
 551 |             COMPRESSIONS[self._header.compression].decompress,
 552 |             self._header.content_types,
 553 |         )
 554 | 
 555 |     def __enter__(self):
 556 |         return self
 557 | 
 558 |     def __exit__(self, exc_type, exc_val, exc_tb):
 559 |         self.close()
 560 |         return False
 561 | 
 562 |     @property
 563 |     def id(self):
 564 |         return self._header.uuid.hex
 565 | 
 566 |     @property
 567 |     def content_types(self):
 568 |         return self._header.content_types
 569 | 
 570 |     @property
 571 |     def tags(self):
 572 |         return self._header.tags
 573 | 
 574 |     @property
 575 |     def blob_count(self):
 576 |         return self._header.blob_count
 577 | 
 578 |     @property
 579 |     def compression(self):
 580 |         return self._header.compression
 581 | 
 582 |     @property
 583 |     def encoding(self):
 584 |         return self._header.encoding
 585 | 
 586 |     def __len__(self):
 587 |         return len(self._refs)
 588 | 
 589 |     def __getitem__(self, i):
 590 |         ref = self._refs[i]
 591 | 
 592 |         def read_func():
 593 |             return self._store.get(ref.bin_index, ref.item_index)[1]
 594 | 
 595 |         read_func = lru_cache(maxsize=None)(read_func)
 596 | 
 597 |         def read_content_type_func():
 598 |             return self._store.content_type(ref.bin_index, ref.item_index)
 599 | 
 600 |         content_id = meld_ints(ref.bin_index, ref.item_index)
 601 |         return Blob(
 602 |             content_id, ref.key, ref.fragment, read_content_type_func, read_func
 603 |         )
 604 | 
 605 |     def get(self, blob_id):
 606 |         bin_index, bin_item_index = unmeld_ints(blob_id)
 607 |         return self._store.get(bin_index, bin_item_index)
 608 | 
 609 |     @lru_cache(maxsize=None)
 610 |     def as_dict(self, strength=TERTIARY, maxlength=None):
 611 |         return KeydItemDict(self, strength, maxlength=maxlength)
 612 | 
 613 |     def close(self):
 614 |         self._f.close()
 615 |         self._g.close()
 616 | 
 617 | 
 618 | def find_parts(fname):
 619 |     fname = os.path.expanduser(fname)
 620 |     dirname = os.path.dirname(fname) or os.getcwd()
 621 |     basename = os.path.basename(fname)
 622 |     candidates = []
 623 |     for name in os.listdir(dirname):
 624 |         if name.startswith(basename):
 625 |             candidates.append(os.path.join(dirname, name))
 626 |     return sorted(candidates)
 627 | 
 628 | 
 629 | def open(file_or_filenames):
 630 |     if isinstance(file_or_filenames, str):
 631 |         if not os.path.exists(file_or_filenames):
 632 |             file_or_filenames = find_parts(file_or_filenames)
 633 |     return Slob(file_or_filenames)
 634 | 
 635 | 
 636 | def create(*args, **kwargs):
 637 |     return Writer(*args, **kwargs)
 638 | 
 639 | 
 640 | class BinMemWriter:
 641 | 
 642 |     def __init__(self):
 643 |         self.content_type_ids = []
 644 |         self.item_dir = []
 645 |         self.items = []
 646 |         self.current_offset = 0
 647 | 
 648 |     def add(self, content_type_id, blob):
 649 |         self.content_type_ids.append(content_type_id)
 650 |         self.item_dir.append(pack(U_INT, self.current_offset))
 651 |         length_and_bytes = pack(U_INT, len(blob)) + blob
 652 |         self.items.append(length_and_bytes)
 653 |         self.current_offset += len(length_and_bytes)
 654 | 
 655 |     def __len__(self):
 656 |         return len(self.item_dir)
 657 | 
 658 |     def finalize(self, fout: "output file", compress: "function"):
 659 |         count = len(self)
 660 |         fout.write(pack(U_INT, count))
 661 |         for content_type_id in self.content_type_ids:
 662 |             fout.write(pack(U_CHAR, content_type_id))
 663 |         content = b"".join(self.item_dir + self.items)
 664 |         compressed = compress(content)
 665 |         fout.write(pack(U_INT, len(compressed)))
 666 |         fout.write(compressed)
 667 |         self.content_type_ids.clear()
 668 |         self.item_dir.clear()
 669 |         self.items.clear()
 670 | 
 671 | 
 672 | class ItemList(Sequence):
 673 | 
 674 |     def __init__(self, file_, offset, count_or_spec, pos_spec, cache_size=None):
 675 |         self.lock = RLock()
 676 |         self._file = file_
 677 |         file_.seek(offset)
 678 |         if isinstance(count_or_spec, str):
 679 |             count_spec = count_or_spec
 680 |             self.count = unpack(count_spec, file_.read(calcsize(count_spec)))[0]
 681 |         else:
 682 |             self.count = count_or_spec
 683 |         self.pos_offset = file_.tell()
 684 |         self.pos_spec = pos_spec
 685 |         self.pos_size = calcsize(pos_spec)
 686 |         self.data_offset = self.pos_offset + self.pos_size * self.count
 687 |         if cache_size:
 688 |             self.__getitem__ = lru_cache(maxsize=cache_size)(self.__getitem__)
 689 | 
 690 |     def __len__(self):
 691 |         return self.count
 692 | 
 693 |     def pos(self, i):
 694 |         with self.lock:
 695 |             self._file.seek(self.pos_offset + self.pos_size * i)
 696 |             return unpack(self.pos_spec, self._file.read(self.pos_size))[0]
 697 | 
 698 |     def read(self, pos):
 699 |         with self.lock:
 700 |             self._file.seek(self.data_offset + pos)
 701 |             return self._read_item()
 702 | 
 703 |     @abstractmethod
 704 |     def _read_item(self):
 705 |         pass
 706 | 
 707 |     def __getitem__(self, i):
 708 |         if i >= len(self) or i < 0:
 709 |             raise IndexError("index out of range")
 710 |         return self.read(self.pos(i))
 711 | 
 712 | 
 713 | class RefList(ItemList):
 714 | 
 715 |     def __init__(self, f, encoding, offset=0, count=None):
 716 |         super().__init__(
 717 |             StructReader(f, encoding),
 718 |             offset,
 719 |             U_INT if count is None else count,
 720 |             U_LONG_LONG,
 721 |             cache_size=512,
 722 |         )
 723 | 
 724 |     def _read_item(self):
 725 |         key = self._file.read_text()
 726 |         bin_index = self._file.read_int()
 727 |         item_index = self._file.read_short()
 728 |         fragment = self._file.read_tiny_text()
 729 |         return Ref(
 730 |             key=key, bin_index=bin_index, item_index=item_index, fragment=fragment
 731 |         )
 732 | 
 733 |     @lru_cache(maxsize=None)
 734 |     def as_dict(self, strength=TERTIARY, maxlength=None):
 735 |         return KeydItemDict(self, strength, maxlength=maxlength)
 736 | 
 737 | 
 738 | class Bin(ItemList):
 739 | 
 740 |     def __init__(self, count, bin_bytes):
 741 |         super().__init__(StructReader(io.BytesIO(bin_bytes)), 0, count, U_INT)
 742 | 
 743 |     def _read_item(self):
 744 |         content_len = self._file.read_int()
 745 |         content = self._file.read(content_len)
 746 |         return content
 747 | 
 748 | 
 749 | StoreItem = namedtuple("StoreItem", "content_type_ids compressed_content")
 750 | 
 751 | 
 752 | class Store(ItemList):
 753 | 
 754 |     def __init__(self, file_, offset, decompress, content_types):
 755 |         super().__init__(StructReader(file_), offset, U_INT, U_LONG_LONG, cache_size=32)
 756 |         self.decompress = decompress
 757 |         self.content_types = content_types
 758 | 
 759 |     def _read_item(self):
 760 |         bin_item_count = self._file.read_int()
 761 |         packed_content_type_ids = self._file.read(bin_item_count * U_CHAR_SIZE)
 762 |         content_type_ids = []
 763 |         for i in range(bin_item_count):
 764 |             content_type_id = unpack(U_CHAR, packed_content_type_ids[i : i + 1])[0]
 765 |             content_type_ids.append(content_type_id)
 766 |         content_length = self._file.read_int()
 767 |         content = self._file.read(content_length)
 768 |         return StoreItem(content_type_ids=content_type_ids, compressed_content=content)
 769 | 
 770 |     def _content_type(self, bin_index, item_index):
 771 |         store_item = self[bin_index]
 772 |         content_type_id = store_item.content_type_ids[item_index]
 773 |         content_type = self.content_types[content_type_id]
 774 |         return content_type, store_item
 775 | 
 776 |     def content_type(self, bin_index, item_index):
 777 |         return self._content_type(bin_index, item_index)[0]
 778 | 
 779 |     @lru_cache(maxsize=16)
 780 |     def _decompress(self, bin_index):
 781 |         store_item = self[bin_index]
 782 |         return self.decompress(store_item.compressed_content)
 783 | 
 784 |     def get(self, bin_index, item_index):
 785 |         content_type, store_item = self._content_type(bin_index, item_index)
 786 |         content = self._decompress(bin_index)
 787 |         count = len(store_item.content_type_ids)
 788 |         store_bin = Bin(count, content)
 789 |         content = store_bin[item_index]
 790 |         return (content_type, content)
 791 | 
 792 | 
 793 | def find(word, slobs, match_prefix=True):
 794 |     seen = set()
 795 |     if isinstance(slobs, Slob):
 796 |         slobs = [slobs]
 797 | 
 798 |     variants = []
 799 | 
 800 |     for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY):
 801 |         variants.append((strength, None))
 802 | 
 803 |     if match_prefix:
 804 |         for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY):
 805 |             variants.append((strength, sortkey_length(strength, word)))
 806 | 
 807 |     for strength, maxlength in variants:
 808 |         for slob in slobs:
 809 |             d = slob.as_dict(strength=strength, maxlength=maxlength)
 810 |             for item in d[word]:
 811 |                 dedup_key = (slob.id, item.id, item.fragment)
 812 |                 if dedup_key in seen:
 813 |                     continue
 814 |                 else:
 815 |                     seen.add(dedup_key)
 816 |                     yield slob, item
 817 | 
 818 | 
 819 | WriterEvent = namedtuple("WriterEvent", "name data")
 820 | 
 821 | 
 822 | class KeyTooLongException(Exception):
 823 | 
 824 |     @property
 825 |     def key(self):
 826 |         return self.args[0]
 827 | 
 828 | 
 829 | class Writer(object):
 830 | 
 831 |     def __init__(
 832 |         self,
 833 |         filename,
 834 |         workdir=None,
 835 |         encoding=UTF8,
 836 |         compression=DEFAULT_COMPRESSION,
 837 |         min_bin_size=512 * 1024,
 838 |         max_redirects=5,
 839 |         observer=None,
 840 |     ):
 841 |         self.filename = filename
 842 |         self.observer = observer
 843 |         if os.path.exists(self.filename):
 844 |             raise SystemExit("File %r already exists" % self.filename)
 845 | 
 846 |         # make sure we can write
 847 |         with fopen(self.filename, "wb"):
 848 |             pass
 849 | 
 850 |         self.encoding = encoding
 851 | 
 852 |         if encodings.search_function(self.encoding) is None:
 853 |             raise UnknownEncoding(self.encoding)
 854 | 
 855 |         self.workdir = workdir
 856 | 
 857 |         self.tmpdir = tmpdir = tempfile.TemporaryDirectory(
 858 |             prefix="{0}-".format(os.path.basename(filename)), dir=workdir
 859 |         )
 860 | 
 861 |         self.f_ref_positions = self._wbfopen("ref-positions")
 862 |         self.f_store_positions = self._wbfopen("store-positions")
 863 |         self.f_refs = self._wbfopen("refs")
 864 |         self.f_store = self._wbfopen("store")
 865 | 
 866 |         self.max_redirects = max_redirects
 867 |         if max_redirects:
 868 |             self.aliases_path = os.path.join(tmpdir.name, "aliases")
 869 |             self.f_aliases = Writer(
 870 |                 self.aliases_path,
 871 |                 workdir=tmpdir.name,
 872 |                 max_redirects=0,
 873 |                 compression=None,
 874 |             )
 875 | 
 876 |         if compression is None:
 877 |             compression = ""
 878 |         if not compression in COMPRESSIONS:
 879 |             raise UnknownCompression(compression)
 880 |         else:
 881 |             self.compress = COMPRESSIONS[compression].compress
 882 | 
 883 |         self.compression = compression
 884 |         self.content_types = {}
 885 | 
 886 |         self.min_bin_size = min_bin_size
 887 | 
 888 |         self.current_bin = None
 889 | 
 890 |         self.blob_count = 0
 891 |         self.ref_count = 0
 892 |         self.bin_count = 0
 893 |         self._tags = {
 894 |             "version.python": sys.version.replace("\n", " "),
 895 |             "version.pyicu": icu.VERSION,
 896 |             "version.icu": icu.ICU_VERSION,
 897 |             "created.at": datetime.now(timezone.utc).isoformat(),
 898 |         }
 899 |         self.tags = MappingProxyType(self._tags)
 900 | 
 901 |     def _wbfopen(self, name):
 902 |         return StructWriter(
 903 |             fopen(os.path.join(self.tmpdir.name, name), "wb"), encoding=self.encoding
 904 |         )
 905 | 
 906 |     def tag(self, name, value=""):
 907 |         if len(name.encode(self.encoding)) > MAX_TINY_TEXT_LEN:
 908 |             self._fire_event("tag_name_too_long", (name, value))
 909 |             return
 910 | 
 911 |         if len(value.encode(self.encoding)) > MAX_TINY_TEXT_LEN:
 912 |             self._fire_event("tag_value_too_long", (name, value))
 913 |             value = ""
 914 | 
 915 |         self._tags[name] = value
 916 | 
 917 |     def _split_key(self, key):
 918 |         if isinstance(key, str):
 919 |             actual_key = key
 920 |             fragment = ""
 921 |         else:
 922 |             actual_key, fragment = key
 923 |         if len(actual_key) > MAX_TEXT_LEN or len(fragment) > MAX_TINY_TEXT_LEN:
 924 |             raise KeyTooLongException(key)
 925 |         return actual_key, fragment
 926 | 
 927 |     def add(self, blob, *keys, content_type=""):
 928 | 
 929 |         if len(blob) > MAX_LARGE_BYTE_STRING_LEN:
 930 |             self._fire_event("content_too_long", blob)
 931 |             return
 932 | 
 933 |         if len(content_type) > MAX_TEXT_LEN:
 934 |             self._fire_event("content_type_too_long", content_type)
 935 |             return
 936 | 
 937 |         actual_keys = []
 938 | 
 939 |         for key in keys:
 940 |             try:
 941 |                 actual_key, fragment = self._split_key(key)
 942 |             except KeyTooLongException as e:
 943 |                 self._fire_event("key_too_long", e.key)
 944 |             else:
 945 |                 actual_keys.append((actual_key, fragment))
 946 | 
 947 |         if len(actual_keys) == 0:
 948 |             return
 949 | 
 950 |         if self.current_bin is None:
 951 |             self.current_bin = BinMemWriter()
 952 |             self.bin_count += 1
 953 | 
 954 |         if content_type not in self.content_types:
 955 |             self.content_types[content_type] = len(self.content_types)
 956 | 
 957 |         self.current_bin.add(self.content_types[content_type], blob)
 958 |         self.blob_count += 1
 959 |         bin_item_index = len(self.current_bin) - 1
 960 |         bin_index = self.bin_count - 1
 961 | 
 962 |         for actual_key, fragment in actual_keys:
 963 |             self._write_ref(actual_key, bin_index, bin_item_index, fragment)
 964 | 
 965 |         if (
 966 |             self.current_bin.current_offset > self.min_bin_size
 967 |             or len(self.current_bin) == MAX_BIN_ITEM_COUNT
 968 |         ):
 969 |             self._write_current_bin()
 970 | 
 971 |     def add_alias(self, key, target_key):
 972 |         if self.max_redirects:
 973 |             try:
 974 |                 self._split_key(key)
 975 |             except KeyTooLongException as e:
 976 |                 self._fire_event("alias_too_long", e.key)
 977 |                 return
 978 |             try:
 979 |                 self._split_key(target_key)
 980 |             except KeyTooLongException as e:
 981 |                 self._fire_event("alias_target_too_long", e.key)
 982 |                 return
 983 |             self.f_aliases.add(pickle.dumps(target_key), key)
 984 |         else:
 985 |             raise NotImplementedError()
 986 | 
 987 |     def _fire_event(self, name, data=None):
 988 |         if self.observer:
 989 |             self.observer(WriterEvent(name, data))
 990 | 
 991 |     def _write_current_bin(self):
 992 |         self.f_store_positions.write_long(self.f_store.tell())
 993 |         self.current_bin.finalize(self.f_store, self.compress)
 994 |         self.current_bin = None
 995 | 
 996 |     def _write_ref(self, key, bin_index, item_index, fragment=""):
 997 |         self.f_ref_positions.write_long(self.f_refs.tell())
 998 |         self.f_refs.write_text(key)
 999 |         self.f_refs.write_int(bin_index)
1000 |         self.f_refs.write_short(item_index)
1001 |         self.f_refs.write_tiny_text(fragment)
1002 |         self.ref_count += 1
1003 | 
1004 |     def _sort(self):
1005 |         self._fire_event("begin_sort")
1006 |         f_ref_positions_sorted = self._wbfopen("ref-positions-sorted")
1007 |         self.f_refs.flush()
1008 |         self.f_ref_positions.close()
1009 |         with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f:
1010 |             ref_list = RefList(f, self.encoding, count=self.ref_count)
1011 |             sortkey_func = sortkey(IDENTICAL)
1012 |             for i in sorted(
1013 |                 range(len(ref_list)), key=lambda j: sortkey_func(ref_list[j].key)
1014 |             ):
1015 |                 ref_pos = ref_list.pos(i)
1016 |                 f_ref_positions_sorted.write_long(ref_pos)
1017 |         f_ref_positions_sorted.close()
1018 |         os.remove(self.f_ref_positions.name)
1019 |         os.rename(f_ref_positions_sorted.name, self.f_ref_positions.name)
1020 |         self.f_ref_positions = StructWriter(
1021 |             fopen(self.f_ref_positions.name, "ab"), encoding=self.encoding
1022 |         )
1023 |         self._fire_event("end_sort")
1024 | 
1025 |     def _resolve_aliases(self):
1026 |         self._fire_event("begin_resolve_aliases")
1027 |         self.f_aliases.finalize()
1028 |         with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f_ref_list:
1029 |             ref_list = RefList(f_ref_list, self.encoding, count=self.ref_count)
1030 |             ref_dict = ref_list.as_dict()
1031 |             with open(self.aliases_path) as r:
1032 |                 aliases = r.as_dict()
1033 |                 path = os.path.join(self.tmpdir.name, "resolved-aliases")
1034 |                 with create(
1035 |                     path, workdir=self.tmpdir.name, max_redirects=0, compression=None
1036 |                 ) as alias_writer:
1037 | 
1038 |                     def read_key_frag(item, default_fragment):
1039 |                         key_frag = pickle.loads(item.content)
1040 |                         if isinstance(key_frag, str):
1041 |                             return key_frag, default_fragment
1042 |                         else:
1043 |                             return key_frag
1044 | 
1045 |                     for item in r:
1046 |                         from_key = item.key
1047 |                         keys = set()
1048 |                         keys.add(from_key)
1049 |                         to_key, fragment = read_key_frag(item, item.fragment)
1050 |                         count = 0
1051 |                         while count <= self.max_redirects:
1052 |                             # is target key itself a redirect?
1053 |                             try:
1054 |                                 orig_to_key = to_key
1055 |                                 to_key, fragment = read_key_frag(
1056 |                                     next(aliases[to_key]), fragment
1057 |                                 )
1058 |                                 count += 1
1059 |                                 keys.add(orig_to_key)
1060 |                             except StopIteration:
1061 |                                 break
1062 |                         if count > self.max_redirects:
1063 |                             self._fire_event("too_many_redirects", from_key)
1064 |                         try:
1065 |                             target_ref = next(ref_dict[to_key])
1066 |                         except StopIteration:
1067 |                             self._fire_event("alias_target_not_found", to_key)
1068 |                         else:
1069 |                             for key in keys:
1070 |                                 ref = Ref(
1071 |                                     key=key,
1072 |                                     bin_index=target_ref.bin_index,
1073 |                                     item_index=target_ref.item_index,
1074 |                                     # last fragment in the chain wins
1075 |                                     fragment=target_ref.fragment or fragment,
1076 |                                 )
1077 |                                 alias_writer.add(pickle.dumps(ref), key)
1078 | 
1079 |         with open(path) as resolved_aliases_reader:
1080 |             previous = None
1081 |             targets = set()
1082 | 
1083 |             for item in resolved_aliases_reader:
1084 |                 ref = pickle.loads(item.content)
1085 |                 if previous is not None and ref.key != previous.key:
1086 |                     for bin_index, item_index, fragment in sorted(targets):
1087 |                         self._write_ref(previous.key, bin_index, item_index, fragment)
1088 |                     targets.clear()
1089 |                 targets.add((ref.bin_index, ref.item_index, ref.fragment))
1090 |                 previous = ref
1091 | 
1092 |             for bin_index, item_index, fragment in sorted(targets):
1093 |                 self._write_ref(previous.key, bin_index, item_index, fragment)
1094 | 
1095 |         self._sort()
1096 |         self._fire_event("end_resolve_aliases")
1097 | 
1098 |     def finalize(self):
1099 |         self._fire_event("begin_finalize")
1100 |         if not self.current_bin is None:
1101 |             self._write_current_bin()
1102 | 
1103 |         self._sort()
1104 |         if self.max_redirects:
1105 |             self._resolve_aliases()
1106 | 
1107 |         files = (
1108 |             self.f_ref_positions,
1109 |             self.f_refs,
1110 |             self.f_store_positions,
1111 |             self.f_store,
1112 |         )
1113 | 
1114 |         for f in files:
1115 |             f.close()
1116 | 
1117 |         buf_size = 10 * 1024 * 1024
1118 | 
1119 |         with fopen(self.filename, mode="wb") as output_file:
1120 |             out = StructWriter(output_file, self.encoding)
1121 |             out.write(MAGIC)
1122 |             out.write(uuid4().bytes)
1123 |             out.write_tiny_text(self.encoding, encoding=UTF8)
1124 |             out.write_tiny_text(self.compression)
1125 | 
1126 |             def write_tags(tags, f):
1127 |                 f.write(pack(U_CHAR, len(tags)))
1128 |                 for key, value in tags.items():
1129 |                     f.write_tiny_text(key)
1130 |                     f.write_tiny_text(value, editable=True)
1131 | 
1132 |             write_tags(self.tags, out)
1133 | 
1134 |             def write_content_types(content_types, f):
1135 |                 count = len(content_types)
1136 |                 f.write(pack(U_CHAR, count))
1137 |                 types = sorted(content_types.items(), key=lambda x: x[1])
1138 |                 for content_type, _ in types:
1139 |                     f.write_text(content_type)
1140 | 
1141 |             write_content_types(self.content_types, out)
1142 | 
1143 |             out.write_int(self.blob_count)
1144 |             store_offset = (
1145 |                 out.tell()
1146 |                 + U_LONG_LONG_SIZE  # this value
1147 |                 + U_LONG_LONG_SIZE  # file size value
1148 |                 + U_INT_SIZE  # ref count value
1149 |                 + os.stat(self.f_ref_positions.name).st_size
1150 |                 + os.stat(self.f_refs.name).st_size
1151 |             )
1152 |             out.write_long(store_offset)
1153 |             out.flush()
1154 | 
1155 |             file_size = (
1156 |                 out.tell()  # bytes written so far
1157 |                 + U_LONG_LONG_SIZE  # file size value
1158 |                 + 2 * U_INT_SIZE
1159 |             )  # ref count and bin count
1160 |             file_size += sum((os.stat(f.name).st_size for f in files))
1161 |             out.write_long(file_size)
1162 | 
1163 |             def mv(src, out):
1164 |                 fname = src.name
1165 |                 self._fire_event("begin_move", fname)
1166 |                 with fopen(fname, mode="rb") as f:
1167 |                     while True:
1168 |                         data = f.read(buf_size)
1169 |                         if len(data) == 0:
1170 |                             break
1171 |                         out.write(data)
1172 |                         out.flush()
1173 |                 os.remove(fname)
1174 |                 self._fire_event("end_move", fname)
1175 | 
1176 |             out.write_int(self.ref_count)
1177 |             mv(self.f_ref_positions, out)
1178 |             mv(self.f_refs, out)
1179 | 
1180 |             out.write_int(self.bin_count)
1181 |             mv(self.f_store_positions, out)
1182 |             mv(self.f_store, out)
1183 | 
1184 |         self.tmpdir.cleanup()
1185 |         self._fire_event("end_finalize")
1186 | 
1187 |     def size_header(self):
1188 |         size = 0
1189 |         size += len(MAGIC)
1190 |         size += 16  # uuid bytes
1191 |         size += U_CHAR_SIZE + len(self.encoding.encode(UTF8))
1192 |         size += U_CHAR_SIZE + len(self.compression.encode(self.encoding))
1193 | 
1194 |         size += U_CHAR_SIZE  # tag length
1195 |         size += U_CHAR_SIZE  # content types count
1196 | 
1197 |         # tags and content types themselves counted elsewhere
1198 | 
1199 |         size += U_INT_SIZE  # blob count
1200 |         size += U_LONG_LONG_SIZE  # store offset
1201 |         size += U_LONG_LONG_SIZE  # file size
1202 |         size += U_INT_SIZE  # ref count
1203 |         size += U_INT_SIZE  # bin count
1204 | 
1205 |         return size
1206 | 
1207 |     def size_tags(self):
1208 |         size = 0
1209 |         for key, _ in self.tags.items():
1210 |             size += U_CHAR_SIZE + len(key.encode(self.encoding))
1211 |             size += 255
1212 |         return size
1213 | 
1214 |     def size_content_types(self):
1215 |         size = 0
1216 |         for content_type in self.content_types:
1217 |             size += U_CHAR_SIZE + len(content_type.encode(self.encoding))
1218 |         return size
1219 | 
1220 |     def size_data(self):
1221 |         files = (
1222 |             self.f_ref_positions,
1223 |             self.f_refs,
1224 |             self.f_store_positions,
1225 |             self.f_store,
1226 |         )
1227 |         return sum((os.stat(f.name).st_size for f in files))
1228 | 
1229 |     def __enter__(self):
1230 |         return self
1231 | 
1232 |     def __exit__(self, exc_type, exc_val, exc_tb):
1233 |         self.finalize()
1234 |         return False
1235 | 
1236 | 
1237 | class TestReadWrite(unittest.TestCase):
1238 | 
1239 |     def setUp(self):
1240 | 
1241 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1242 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1243 | 
1244 |         with create(self.path) as w:
1245 | 
1246 |             self.tags = {"a": "abc", "bb": "xyz123", "ccc": "lkjlk"}
1247 |             for name, value in self.tags.items():
1248 |                 w.tag(name, value)
1249 | 
1250 |             self.tag2 = "bb", "xyz123"
1251 | 
1252 |             self.blob_encoding = "ascii"
1253 | 
1254 |             self.data = [
1255 |                 (("c", "cc", "ccc"), MIME_TEXT, "Hello C 1"),
1256 |                 ("a", MIME_TEXT, "Hello A 12"),
1257 |                 ("z", MIME_TEXT, "Hello Z 123"),
1258 |                 ("b", MIME_TEXT, "Hello B 1234"),
1259 |                 ("d", MIME_TEXT, "Hello D 12345"),
1260 |                 ("uuu", MIME_HTML, "<html><body>Hello U!</body></html>"),
1261 |                 ((("yy", "frag1"),), MIME_HTML, '<h1 name="frag1">Section 1</h1>'),
1262 |             ]
1263 | 
1264 |             self.all_keys = []
1265 | 
1266 |             self.data_as_dict = {}
1267 | 
1268 |             for k, t, v in self.data:
1269 |                 if isinstance(k, str):
1270 |                     k = (k,)
1271 |                 for key in k:
1272 |                     if isinstance(key, tuple):
1273 |                         key, fragment = key
1274 |                     else:
1275 |                         fragment = ""
1276 |                     self.all_keys.append(key)
1277 |                     self.data_as_dict[key] = (t, v, fragment)
1278 |                 w.add(v.encode(self.blob_encoding), *k, content_type=t)
1279 |             self.all_keys.sort()
1280 | 
1281 |         self.w = w
1282 | 
1283 |     def test_header(self):
1284 |         with MultiFileReader(self.path) as f:
1285 |             header = read_header(f)
1286 | 
1287 |         for key, value in self.tags.items():
1288 |             self.assertEqual(header.tags[key], value)
1289 | 
1290 |         self.assertEqual(self.w.encoding, UTF8)
1291 |         self.assertEqual(header.encoding, self.w.encoding)
1292 | 
1293 |         self.assertEqual(header.compression, self.w.compression)
1294 | 
1295 |         for i, content_type in enumerate(header.content_types):
1296 |             self.assertEqual(self.w.content_types[content_type], i)
1297 | 
1298 |         self.assertEqual(header.blob_count, len(self.data))
1299 | 
1300 |     def test_content(self):
1301 |         with open(self.path) as r:
1302 |             self.assertEqual(len(r), len(self.all_keys))
1303 |             self.assertRaises(IndexError, r.__getitem__, len(self.all_keys))
1304 |             for i, item in enumerate(r):
1305 |                 self.assertEqual(item.key, self.all_keys[i])
1306 |                 content_type, value, fragment = self.data_as_dict[item.key]
1307 |                 self.assertEqual(item.content_type, content_type)
1308 |                 self.assertEqual(item.content.decode(self.blob_encoding), value)
1309 |                 self.assertEqual(item.fragment, fragment)
1310 | 
1311 |     def tearDown(self):
1312 |         self.tmpdir.cleanup()
1313 | 
1314 | 
1315 | class TestSort(unittest.TestCase):
1316 | 
1317 |     def setUp(self):
1318 | 
1319 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1320 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1321 | 
1322 |         with create(self.path) as w:
1323 | 
1324 |             data = [
1325 |                 "Ф, ф",
1326 |                 "Ф ф",
1327 |                 "Ф",
1328 |                 "Э",
1329 |                 "Е е",
1330 |                 "г",
1331 |                 "н",
1332 |                 "ф",
1333 |                 "а",
1334 |                 "Ф, Ф",
1335 |                 "е",
1336 |                 "Е",
1337 |                 "Ее",
1338 |                 "ё",
1339 |                 "Ё",
1340 |                 "Её",
1341 |                 "Е ё",
1342 |                 "А",
1343 |                 "э",
1344 |                 "ы",
1345 |             ]
1346 | 
1347 |             self.data_sorted = sorted(data, key=sortkey(IDENTICAL))
1348 | 
1349 |             for k in data:
1350 |                 v = ";".join(unicodedata.name(c) for c in k)
1351 |                 w.add(v.encode("ascii"), k)
1352 | 
1353 |         self.r = open(self.path)
1354 | 
1355 |     def test_sort_order(self):
1356 |         for i in range(len(self.r)):
1357 |             self.assertEqual(self.r[i].key, self.data_sorted[i])
1358 | 
1359 |     def tearDown(self):
1360 |         self.r.close()
1361 |         self.tmpdir.cleanup()
1362 | 
1363 | 
1364 | class TestFind(unittest.TestCase):
1365 | 
1366 |     def setUp(self):
1367 | 
1368 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1369 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1370 | 
1371 |         with create(self.path) as w:
1372 |             data = [
1373 |                 "Cc",
1374 |                 "aA",
1375 |                 "aa",
1376 |                 "Aa",
1377 |                 "Bb",
1378 |                 "cc",
1379 |                 "Äā",
1380 |                 "ăÀ",
1381 |                 "a\u00A0a",
1382 |                 "a-a",
1383 |                 "a\u2019a",
1384 |                 "a\u2032a",
1385 |                 "a,a",
1386 |                 "a a",
1387 |             ]
1388 | 
1389 |             for k in data:
1390 |                 v = ";".join(unicodedata.name(c) for c in k)
1391 |                 w.add(v.encode("ascii"), k)
1392 | 
1393 |         self.r = open(self.path)
1394 | 
1395 |     def get(self, d, key):
1396 |         return list(item.content.decode("ascii") for item in d[key])
1397 | 
1398 |     def test_find_identical(self):
1399 |         d = self.r.as_dict(IDENTICAL)
1400 |         self.assertEqual(
1401 |             self.get(d, "aa"), ["LATIN SMALL LETTER A;LATIN SMALL LETTER A"]
1402 |         )
1403 |         self.assertEqual(
1404 |             self.get(d, "a-a"),
1405 |             ["LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A"],
1406 |         )
1407 |         self.assertEqual(
1408 |             self.get(d, "aA"), ["LATIN SMALL LETTER A;LATIN CAPITAL LETTER A"]
1409 |         )
1410 |         self.assertEqual(
1411 |             self.get(d, "Äā"),
1412 |             [
1413 |                 "LATIN CAPITAL LETTER A WITH DIAERESIS;"
1414 |                 "LATIN SMALL LETTER A WITH MACRON"
1415 |             ],
1416 |         )
1417 |         self.assertEqual(
1418 |             self.get(d, "a a"), ["LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A"]
1419 |         )
1420 | 
1421 |     def test_find_quaternary(self):
1422 |         d = self.r.as_dict(QUATERNARY)
1423 |         self.assertEqual(
1424 |             self.get(d, "a\u2032a"), ["LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A"]
1425 |         )
1426 |         self.assertEqual(
1427 |             self.get(d, "a a"),
1428 |             [
1429 |                 "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A",
1430 |                 "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A",
1431 |             ],
1432 |         )
1433 | 
1434 |     def test_find_tertiary(self):
1435 |         d = self.r.as_dict(TERTIARY)
1436 |         self.assertEqual(
1437 |             self.get(d, "aa"),
1438 |             [
1439 |                 "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A",
1440 |                 "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A",
1441 |                 "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A",
1442 |                 "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A",
1443 |                 "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A",
1444 |                 "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A",
1445 |                 "LATIN SMALL LETTER A;LATIN SMALL LETTER A",
1446 |             ],
1447 |         )
1448 | 
1449 |     def test_find_secondary(self):
1450 |         d = self.r.as_dict(SECONDARY)
1451 |         self.assertEqual(
1452 |             self.get(d, "aa"),
1453 |             [
1454 |                 "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A",
1455 |                 "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A",
1456 |                 "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A",
1457 |                 "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A",
1458 |                 "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A",
1459 |                 "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A",
1460 |                 "LATIN SMALL LETTER A;LATIN SMALL LETTER A",
1461 |                 "LATIN SMALL LETTER A;LATIN CAPITAL LETTER A",
1462 |                 "LATIN CAPITAL LETTER A;LATIN SMALL LETTER A",
1463 |             ],
1464 |         )
1465 | 
1466 |     def test_find_primary(self):
1467 |         d = self.r.as_dict(PRIMARY)
1468 | 
1469 |         self.assertEqual(
1470 |             self.get(d, "aa"),
1471 |             [
1472 |                 "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A",
1473 |                 "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A",
1474 |                 "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A",
1475 |                 "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A",
1476 |                 "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A",
1477 |                 "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A",
1478 |                 "LATIN SMALL LETTER A;LATIN SMALL LETTER A",
1479 |                 "LATIN SMALL LETTER A;LATIN CAPITAL LETTER A",
1480 |                 "LATIN CAPITAL LETTER A;LATIN SMALL LETTER A",
1481 |                 "LATIN SMALL LETTER A WITH BREVE;LATIN CAPITAL LETTER A WITH GRAVE",
1482 |                 "LATIN CAPITAL LETTER A WITH DIAERESIS;LATIN SMALL LETTER A WITH MACRON",
1483 |             ],
1484 |         )
1485 | 
1486 |     def tearDown(self):
1487 |         self.r.close()
1488 |         self.tmpdir.cleanup()
1489 | 
1490 | 
1491 | class TestPrefixFind(unittest.TestCase):
1492 | 
1493 |     def setUp(self):
1494 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1495 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1496 |         self.data = ["a", "ab", "abc", "abcd", "abcde"]
1497 |         with create(self.path) as w:
1498 |             for k in self.data:
1499 |                 w.add(k.encode("ascii"), k)
1500 | 
1501 |     def tearDown(self):
1502 |         self.tmpdir.cleanup()
1503 | 
1504 |     def test(self):
1505 |         with open(self.path) as r:
1506 |             for i, k in enumerate(self.data):
1507 |                 d = r.as_dict(IDENTICAL, len(k))
1508 |                 self.assertEqual(
1509 |                     list(v.content.decode("ascii") for v in d[k]), self.data[i:]
1510 |                 )
1511 | 
1512 | 
1513 | class TestBestMatch(unittest.TestCase):
1514 | 
1515 |     def setUp(self):
1516 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1517 |         self.path1 = os.path.join(self.tmpdir.name, "test1.slob")
1518 |         self.path2 = os.path.join(self.tmpdir.name, "test2.slob")
1519 | 
1520 |         data1 = ["aa", "Aa", "a-a", "aabc", "Äā", "bb", "aa"]
1521 |         data2 = ["aa", "aA", "āā", "a,a", "a-a", "aade", "Äā", "cc"]
1522 | 
1523 |         with create(self.path1) as w:
1524 |             for key in data1:
1525 |                 w.add(b"", key)
1526 | 
1527 |         with create(self.path2) as w:
1528 |             for key in data2:
1529 |                 w.add(b"", key)
1530 | 
1531 |     def test_best_match(self):
1532 |         self.maxDiff = None
1533 |         with open(self.path1) as s1, open(self.path2) as s2:
1534 |             result = find("aa", [s1, s2], match_prefix=True)
1535 |             actual = list((s.id, item.key) for s, item in result)
1536 |             expected = [
1537 |                 (s1.id, "aa"),
1538 |                 (s1.id, "aa"),
1539 |                 (s2.id, "aa"),
1540 |                 (s1.id, "a-a"),
1541 |                 (s2.id, "a-a"),
1542 |                 (s2.id, "a,a"),
1543 |                 (s1.id, "Aa"),
1544 |                 (s2.id, "aA"),
1545 |                 (s1.id, "Äā"),
1546 |                 (s2.id, "Äā"),
1547 |                 (s2.id, "āā"),
1548 |                 (s1.id, "aabc"),
1549 |                 (s2.id, "aade"),
1550 |             ]
1551 |             self.assertEqual(expected, actual)
1552 | 
1553 |     def tearDown(self):
1554 |         self.tmpdir.cleanup()
1555 | 
1556 | 
1557 | class TestAlias(unittest.TestCase):
1558 | 
1559 |     def setUp(self):
1560 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1561 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1562 | 
1563 |     def tearDown(self):
1564 |         self.tmpdir.cleanup()
1565 | 
1566 |     def test_alias(self):
1567 |         too_many_redirects = []
1568 |         target_not_found = []
1569 | 
1570 |         def observer(event):
1571 |             if event.name == "too_many_redirects":
1572 |                 too_many_redirects.append(event.data)
1573 |             elif event.name == "alias_target_not_found":
1574 |                 target_not_found.append(event.data)
1575 | 
1576 |         with create(self.path, observer=observer) as w:
1577 |             data = ["z", "b", "q", "a", "u", "g", "p", "n"]
1578 |             for k in data:
1579 |                 v = ";".join(unicodedata.name(c) for c in k)
1580 |                 w.add(v.encode("ascii"), k)
1581 | 
1582 |             w.add_alias("w", "u")
1583 |             w.add_alias("small u", "u")
1584 |             w.add_alias("y1", "y2")
1585 |             w.add_alias("y2", "y3")
1586 |             w.add_alias("y3", "z")
1587 |             w.add_alias("ZZZ", "YYY")
1588 | 
1589 |             w.add_alias("l3", "l1")
1590 |             w.add_alias("l1", "l2")
1591 |             w.add_alias("l2", "l3")
1592 | 
1593 |             w.add_alias("a1", ("a", "a-frag1"))
1594 |             w.add_alias("a2", "a1")
1595 |             w.add_alias("a3", ("a2", "a-frag2"))
1596 | 
1597 |             w.add_alias("g1", "g")
1598 |             w.add_alias("g2", ("g1", "g-frag1"))
1599 | 
1600 |             w.add_alias("n or p", "n")
1601 |             w.add_alias("n or p", "p")
1602 | 
1603 |         self.assertEqual(too_many_redirects, ["l1", "l2", "l3"])
1604 |         self.assertEqual(target_not_found, ["l2", "l3", "l1", "YYY"])
1605 | 
1606 |         with open(self.path) as r:
1607 |             d = r.as_dict()
1608 | 
1609 |             def get(key):
1610 |                 return list(item.content.decode("ascii") for item in d[key])
1611 | 
1612 |             self.assertEqual(get("w"), ["LATIN SMALL LETTER U"])
1613 |             self.assertEqual(get("small u"), ["LATIN SMALL LETTER U"])
1614 |             self.assertEqual(get("y1"), ["LATIN SMALL LETTER Z"])
1615 |             self.assertEqual(get("y2"), ["LATIN SMALL LETTER Z"])
1616 |             self.assertEqual(get("y3"), ["LATIN SMALL LETTER Z"])
1617 |             self.assertEqual(get("ZZZ"), [])
1618 |             self.assertEqual(get("l1"), [])
1619 |             self.assertEqual(get("l2"), [])
1620 |             self.assertEqual(get("l3"), [])
1621 | 
1622 |             self.assertEqual(len(list(d["n or p"])), 2)
1623 | 
1624 |             item_a1 = next(d["a1"])
1625 |             self.assertEqual(item_a1.content, b"LATIN SMALL LETTER A")
1626 |             self.assertEqual(item_a1.fragment, "a-frag1")
1627 | 
1628 |             item_a2 = next(d["a2"])
1629 |             self.assertEqual(item_a2.content, b"LATIN SMALL LETTER A")
1630 |             self.assertEqual(item_a2.fragment, "a-frag1")
1631 | 
1632 |             item_a3 = next(d["a3"])
1633 |             self.assertEqual(item_a3.content, b"LATIN SMALL LETTER A")
1634 |             self.assertEqual(item_a3.fragment, "a-frag1")
1635 | 
1636 |             item_g1 = next(d["g1"])
1637 |             self.assertEqual(item_g1.content, b"LATIN SMALL LETTER G")
1638 |             self.assertEqual(item_g1.fragment, "")
1639 | 
1640 |             item_g2 = next(d["g2"])
1641 |             self.assertEqual(item_g2.content, b"LATIN SMALL LETTER G")
1642 |             self.assertEqual(item_g2.fragment, "g-frag1")
1643 | 
1644 | 
1645 | class TestBlobId(unittest.TestCase):
1646 | 
1647 |     def test(self):
1648 |         max_i = 2**32 - 1
1649 |         max_j = 2**16 - 1
1650 |         i_values = [0, max_i] + [random.randint(1, max_i - 1) for _ in range(100)]
1651 |         j_values = [0, max_j] + [random.randint(1, max_j - 1) for _ in range(100)]
1652 |         for i in i_values:
1653 |             for j in j_values:
1654 |                 self.assertEqual(unmeld_ints(meld_ints(i, j)), (i, j))
1655 | 
1656 | 
1657 | class TestMultiFileReader(unittest.TestCase):
1658 | 
1659 |     def setUp(self):
1660 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1661 | 
1662 |     def tearDown(self):
1663 |         self.tmpdir.cleanup()
1664 | 
1665 |     def test_read_all(self):
1666 |         fnames = []
1667 |         for name in "abcdef":
1668 |             path = os.path.join(self.tmpdir.name, name)
1669 |             fnames.append(path)
1670 |             with fopen(path, "wb") as f:
1671 |                 f.write(name.encode(UTF8))
1672 |         with MultiFileReader(fnames) as m:
1673 |             self.assertEqual(m.read().decode(UTF8), "abcdef")
1674 | 
1675 |     def test_seek_and_read(self):
1676 | 
1677 |         def mkfile(basename, content):
1678 |             part = os.path.join(self.tmpdir.name, basename)
1679 |             with fopen(part, "wb") as f:
1680 |                 f.write(content)
1681 |             return part
1682 | 
1683 |         content = b"abc\nd\nefgh\nij"
1684 |         part1 = mkfile("1", content[:4])
1685 |         part2 = mkfile("2", content[4:5])
1686 |         part3 = mkfile("3", content[5:])
1687 | 
1688 |         with MultiFileReader(part1, part2, part3) as m:
1689 |             self.assertEqual(m.size, len(content))
1690 |             m.seek(2)
1691 |             self.assertEqual(m.read(2), content[2:4])
1692 |             m.seek(1)
1693 |             self.assertEqual(m.read(len(content) - 2), content[1:-1])
1694 |             m.seek(-1, whence=io.SEEK_END)
1695 |             self.assertEqual(m.read(10), content[-1:])
1696 | 
1697 |             m.seek(4)
1698 |             m.seek(-2, whence=io.SEEK_CUR)
1699 |             self.assertEqual(m.read(3), content[2:5])
1700 | 
1701 | 
1702 | class TestFormatErrors(unittest.TestCase):
1703 | 
1704 |     def setUp(self):
1705 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1706 | 
1707 |     def tearDown(self):
1708 |         self.tmpdir.cleanup()
1709 | 
1710 |     def test_wrong_file_type(self):
1711 |         name = os.path.join(self.tmpdir.name, "1")
1712 |         with fopen(name, "wb") as f:
1713 |             f.write(b"123")
1714 |         self.assertRaises(UnknownFileFormat, open, name)
1715 | 
1716 |     def test_truncated_file(self):
1717 |         name = os.path.join(self.tmpdir.name, "1")
1718 | 
1719 |         with create(name) as f:
1720 |             f.add(b"123", "a")
1721 |             f.add(
1722 |                 b"234",
1723 |                 "b",
1724 |             )
1725 | 
1726 |         with fopen(name, "rb") as f:
1727 |             all_bytes = f.read()
1728 | 
1729 |         with fopen(name, "wb") as f:
1730 |             f.write(all_bytes[:-1])
1731 | 
1732 |         self.assertRaises(IncorrectFileSize, open, name)
1733 | 
1734 |         with fopen(name, "wb") as f:
1735 |             f.write(all_bytes)
1736 |             f.write(b"\n")
1737 | 
1738 |         self.assertRaises(IncorrectFileSize, open, name)
1739 | 
1740 | 
1741 | class TestFindParts(unittest.TestCase):
1742 | 
1743 |     def setUp(self):
1744 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1745 | 
1746 |     def tearDown(self):
1747 |         self.tmpdir.cleanup()
1748 | 
1749 |     def test_find_parts(self):
1750 |         names = [
1751 |             os.path.join(self.tmpdir.name, name) for name in ("abc-1", "abc-2", "abc-3")
1752 |         ]
1753 |         for name in names:
1754 |             with fopen(name, "wb"):
1755 |                 pass
1756 |         parts = find_parts(os.path.join(self.tmpdir.name, "abc"))
1757 |         self.assertEqual(names, parts)
1758 | 
1759 | 
1760 | class TestTooLongText(unittest.TestCase):
1761 | 
1762 |     def setUp(self):
1763 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1764 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1765 | 
1766 |     def tearDown(self):
1767 |         self.tmpdir.cleanup()
1768 | 
1769 |     def test_too_long(self):
1770 |         rejected_keys = []
1771 |         rejected_aliases = []
1772 |         rejected_alias_targets = []
1773 |         rejected_tags = []
1774 |         rejected_content_types = []
1775 | 
1776 |         def observer(event):
1777 |             if event.name == "key_too_long":
1778 |                 rejected_keys.append(event.data)
1779 |             elif event.name == "alias_too_long":
1780 |                 rejected_aliases.append(event.data)
1781 |             elif event.name == "alias_target_too_long":
1782 |                 rejected_alias_targets.append(event.data)
1783 |             elif event.name == "tag_name_too_long":
1784 |                 rejected_tags.append(event.data)
1785 |             elif event.name == "content_type_too_long":
1786 |                 rejected_content_types.append(event.data)
1787 | 
1788 |         long_tag_name = "t" * (MAX_TINY_TEXT_LEN + 1)
1789 |         long_tag_value = "v" * (MAX_TINY_TEXT_LEN + 1)
1790 |         long_content_type = "T" * (MAX_TEXT_LEN + 1)
1791 |         long_key = "c" * (MAX_TEXT_LEN + 1)
1792 |         long_frag = "d" * (MAX_TINY_TEXT_LEN + 1)
1793 |         key_with_long_frag = ("d", long_frag)
1794 |         tag_with_long_name = (long_tag_name, "t3 value")
1795 |         tag_with_long_value = ("t1", long_tag_value)
1796 |         long_alias = "f" * (MAX_TEXT_LEN + 1)
1797 |         alias_with_long_frag = ("i", long_frag)
1798 |         long_alias_target = long_key
1799 |         long_alias_target_frag = key_with_long_frag
1800 | 
1801 |         with create(self.path, observer=observer) as w:
1802 | 
1803 |             w.tag(*tag_with_long_value)
1804 |             w.tag("t2", "t2 value")
1805 |             w.tag(*tag_with_long_name)
1806 | 
1807 |             data = ["a", "b", long_key, key_with_long_frag]
1808 | 
1809 |             for k in data:
1810 |                 if isinstance(k, str):
1811 |                     v = k.encode("ascii")
1812 |                 else:
1813 |                     v = "#".join(k).encode("ascii")
1814 |                 w.add(v, k)
1815 | 
1816 |             w.add_alias("e", "a")
1817 |             w.add_alias(long_alias, "a")
1818 |             w.add_alias(alias_with_long_frag, "a")
1819 |             w.add_alias("g", long_alias_target)
1820 |             w.add_alias("h", long_alias_target_frag)
1821 | 
1822 |             w.add(b"Hello", "hello", content_type=long_content_type)
1823 | 
1824 |         self.assertEqual(rejected_keys, [long_key, key_with_long_frag])
1825 |         self.assertEqual(rejected_aliases, [long_alias, alias_with_long_frag])
1826 |         self.assertEqual(
1827 |             rejected_alias_targets, [long_alias_target, long_alias_target_frag]
1828 |         )
1829 |         self.assertEqual(rejected_tags, [tag_with_long_name])
1830 |         self.assertEqual(rejected_content_types, [long_content_type])
1831 | 
1832 |         with open(self.path) as r:
1833 |             self.assertEqual(r.tags["t2"], "t2 value")
1834 |             self.assertFalse(tag_with_long_name[0] in r.tags)
1835 |             self.assertTrue(tag_with_long_value[0] in r.tags)
1836 |             self.assertEqual(r.tags[tag_with_long_value[0]], "")
1837 |             d = r.as_dict()
1838 |             self.assertTrue("a" in d)
1839 |             self.assertTrue("b" in d)
1840 |             self.assertFalse(long_key in d)
1841 |             self.assertFalse(key_with_long_frag[0] in d)
1842 |             self.assertTrue("e" in d)
1843 |             self.assertFalse(long_alias in d)
1844 |             self.assertFalse("g" in d)
1845 | 
1846 |         self.assertRaises(ValueError, set_tag_value, self.path, "t1", "ы" * 128)
1847 | 
1848 | 
1849 | class TestEditTag(unittest.TestCase):
1850 | 
1851 |     def setUp(self):
1852 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1853 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1854 |         with create(self.path) as w:
1855 |             w.tag("a", "123456")
1856 |             w.tag("b", "654321")
1857 | 
1858 |     def tearDown(self):
1859 |         self.tmpdir.cleanup()
1860 | 
1861 |     def test_edit_existing_tag(self):
1862 |         with open(self.path) as f:
1863 |             self.assertEqual(f.tags["a"], "123456")
1864 |             self.assertEqual(f.tags["b"], "654321")
1865 |         set_tag_value(self.path, "b", "efg")
1866 |         set_tag_value(self.path, "a", "xyz")
1867 |         with open(self.path) as f:
1868 |             self.assertEqual(f.tags["a"], "xyz")
1869 |             self.assertEqual(f.tags["b"], "efg")
1870 | 
1871 |     def test_edit_nonexisting_tag(self):
1872 |         self.assertRaises(TagNotFound, set_tag_value, self.path, "z", "abc")
1873 | 
1874 | 
1875 | class TestBinItemNumberLimit(unittest.TestCase):
1876 | 
1877 |     def setUp(self):
1878 |         self.tmpdir = tempfile.TemporaryDirectory(prefix="test")
1879 |         self.path = os.path.join(self.tmpdir.name, "test.slob")
1880 | 
1881 |     def tearDown(self):
1882 |         self.tmpdir.cleanup()
1883 | 
1884 |     def test_writing_more_then_max_number_of_bin_items(self):
1885 |         with create(self.path) as w:
1886 |             for _ in range(MAX_BIN_ITEM_COUNT + 2):
1887 |                 w.add(b"a", "a")
1888 |             self.assertEqual(w.bin_count, 2)
1889 | 
1890 | 
1891 | def _cli_info(args):
1892 |     from collections import OrderedDict
1893 | 
1894 |     with open(args.path) as s:
1895 |         h = s._header
1896 |         print("\n")
1897 |         info = OrderedDict(
1898 |             (
1899 |                 ("id", s.id),
1900 |                 ("encoding", h.encoding),
1901 |                 ("compression", h.compression),
1902 |                 ("blob count", s.blob_count),
1903 |                 ("ref count", len(s)),
1904 |             )
1905 |         )
1906 |         _print_title(args.path)
1907 |         _print_dict(info)
1908 |         print("\n")
1909 |         _print_title("CONTENT TYPES")
1910 |         for ct in h.content_types:
1911 |             print(ct)
1912 |         print("\n")
1913 |         _print_title("TAGS")
1914 |         _print_tags(s)
1915 |         print("\n")
1916 | 
1917 | 
1918 | def _print_title(title):
1919 |     print(title)
1920 |     print("-" * len(title))
1921 | 
1922 | 
1923 | def _cli_tag(args):
1924 |     tag_name = args.name
1925 |     if args.value:
1926 |         try:
1927 |             set_tag_value(args.filename, tag_name, args.value)
1928 |         except TagNotFound:
1929 |             print("No such tag")
1930 |     else:
1931 |         with open(args.filename) as s:
1932 |             _print_tags(s, tag_name)
1933 | 
1934 | 
1935 | def _print_tags(s, tag_name=None):
1936 |     if tag_name:
1937 |         try:
1938 |             value = s.tags[tag_name]
1939 |         except KeyError:
1940 |             print("No such tag")
1941 |         else:
1942 |             print(value)
1943 |     else:
1944 |         _print_dict(s.tags)
1945 | 
1946 | 
1947 | def _print_dict(d):
1948 |     max_key_len = 0
1949 |     for k, v in d.items():
1950 |         key_len = len(k)
1951 |         if key_len > max_key_len:
1952 |             max_key_len = key_len
1953 |     fmt_template = "{:>%s}: {}"
1954 |     fmt = fmt_template % max_key_len
1955 |     for k, v in d.items():
1956 |         print(fmt.format(k, v))
1957 | 
1958 | 
1959 | def _cli_find(args):
1960 |     with open(args.path) as s:
1961 |         match_prefix = not args.whole
1962 |         for i, item in enumerate(find(args.query, s, match_prefix=match_prefix)):
1963 |             _, blob = item
1964 |             print(blob.id, blob.content_type, blob.key)
1965 |             if i == args.limit:
1966 |                 break
1967 | 
1968 | 
1969 | def _cli_aliases(args):
1970 |     word = args.query
1971 |     with open(args.path) as s:
1972 |         d = s.as_dict(strength=QUATERNARY)
1973 |         length = len(s)
1974 |         aliases = []
1975 | 
1976 |         def print_item(i, item):
1977 |             print(("\t {:>%s} {}" % len(str(length))).format(i, item.key))
1978 | 
1979 |         for blob in d[word]:
1980 |             print(blob.id, blob.content_type, blob.key)
1981 |             progress = ""
1982 |             for i, item in enumerate(s):
1983 |                 if i % 10000 == 0:
1984 |                     new_progress = "{:>4.1f}% ...".format(100 * i / length)
1985 |                     if new_progress != progress:
1986 |                         print(new_progress)
1987 |                         progress = new_progress
1988 | 
1989 |                 if item.id == blob.id:
1990 |                     aliases.append((i, item))
1991 |                     print_item(i, item)
1992 |             print("100%")
1993 |             header = "{} {}".format(blob.id, blob.content_type)
1994 |             print("=" * len(header), "\n")
1995 |             print(header)
1996 |             print("-" * len(header))
1997 |             for i, item in aliases:
1998 |                 print_item(i, item)
1999 | 
2000 | 
2001 | def _cli_get(args):
2002 |     with open(args.path) as s:
2003 |         _content_type, content = s.get(args.blob_id)
2004 |         sys.stdout.buffer.write(content)
2005 | 
2006 | 
2007 | def _p(i, *args, step=100, steps_per_line=50, fmt="{}"):
2008 |     line = steps_per_line * step
2009 |     if i % step == 0 and i != 0:
2010 |         sys.stdout.write(".")
2011 |         if i and i % line == 0:
2012 |             sys.stdout.write(fmt.format(*args))
2013 |         sys.stdout.flush()
2014 | 
2015 | 
2016 | def _cli_convert(args):
2017 |     import sys
2018 |     import time
2019 | 
2020 |     t0 = time.time()
2021 | 
2022 |     output_name = args.output
2023 | 
2024 |     output_dir = os.path.dirname(output_name)
2025 |     output_base = os.path.basename(output_name)
2026 |     output_base_noext, output_base_ext = os.path.splitext(output_base)
2027 | 
2028 |     DOT_SLOB = ".slob"
2029 | 
2030 |     if output_base_ext != DOT_SLOB:
2031 |         output_base = output_base + DOT_SLOB
2032 |         output_base_noext, output_base_ext = os.path.splitext(output_base)
2033 | 
2034 |     if args.split:
2035 |         output_dir = os.path.join(output_dir, output_base)
2036 |         if os.path.exists(output_dir):
2037 |             raise SystemExit("%r already exists" % output_dir)
2038 |         os.mkdir(output_dir)
2039 |     else:
2040 |         output_name = os.path.join(output_dir, output_base)
2041 |         if os.path.exists(output_name):
2042 |             raise SystemExit("%r already exists" % output_name)
2043 | 
2044 |     with open(args.path) as s:
2045 |         workdir = args.workdir
2046 |         encoding = args.encoding or s._header.encoding
2047 |         compression = (
2048 |             s._header.compression if args.compression is None else args.compression
2049 |         )
2050 |         min_bin_size = 1024 * args.min_bin_size
2051 |         split = 1024 * 1024 * args.split
2052 | 
2053 |         print("Mapping blobs to keys...")
2054 |         blob_to_refs = [
2055 |             collections.defaultdict(lambda: array.array("L"))
2056 |             for i in range(len(s._store))
2057 |         ]
2058 |         key_count = 0
2059 |         pp = functools.partial(_p, step=10000, fmt=" {:.2f}%/{:.2f}s\n")
2060 |         total_keys = len(s)
2061 |         blob_count = s.blob_count
2062 |         for i, ref in enumerate(s._refs):
2063 |             blob_to_refs[ref.bin_index][ref.item_index].append(i)
2064 |             key_count += 1
2065 |             pp(i, 100 * key_count / total_keys, time.time() - t0)
2066 | 
2067 |         print(
2068 |             "\nFound {} keys for {} blobs in {:.2f}s".format(
2069 |                 key_count, blob_count, time.time() - t0
2070 |             )
2071 |         )
2072 | 
2073 |         print("Converting...")
2074 |         Mb = 1024 * 1024
2075 |         pp = functools.partial(_p, step=100, fmt=" {:.2f}%/{:.2f}Mb/{:.2f}s\n")
2076 | 
2077 |         def mkout(output):
2078 |             w = create(
2079 |                 output,
2080 |                 workdir=workdir,
2081 |                 encoding=encoding,
2082 |                 compression=compression,
2083 |                 min_bin_size=min_bin_size,
2084 |             )
2085 |             for name, value in s.tags.items():
2086 |                 if not name in w.tags:
2087 |                     w.tag(name, value)
2088 |             size_header_and_tags = w.size_header() + w.size_tags()
2089 |             return w, size_header_and_tags
2090 | 
2091 |         def fin(t1, w, current_output):
2092 |             print(
2093 |                 "\nDone adding content to {1} in {0:.2f}s".format(
2094 |                     time.time() - t1, current_output
2095 |                 )
2096 |             )
2097 |             print("Finalizing...")
2098 |             w.finalize()
2099 |             return time.time(), None, 0
2100 | 
2101 |         t1, w, size_header_and_tags = time.time(), None, 0
2102 |         current_size = 0
2103 |         volume_count = 0
2104 |         current_output = output_name
2105 |         current = 0
2106 |         for j, store_item in enumerate(s._store):
2107 |             for i in range(len(store_item.content_type_ids)):
2108 |                 bin_index = j
2109 |                 item_index = i
2110 |                 current += 1
2111 |                 content_type, content = s._store.get(bin_index, item_index)
2112 |                 pp(
2113 |                     current,
2114 |                     100 * (current / blob_count),
2115 |                     current_size / Mb,
2116 |                     time.time() - t1,
2117 |                 )
2118 |                 keys = []
2119 |                 for k in blob_to_refs[bin_index][item_index]:
2120 |                     ref = s._refs[k]
2121 |                     keys.append((ref.key, ref.fragment))
2122 |                 if w is None:
2123 |                     volume_count += 1
2124 |                     if split:
2125 |                         current_output = os.path.join(
2126 |                             output_dir,
2127 |                             "".join(
2128 |                                 (
2129 |                                     output_base_noext,
2130 |                                     "-{:02d}".format(volume_count),
2131 |                                     output_base_ext,
2132 |                                 )
2133 |                             ),
2134 |                         )
2135 |                     w, size_header_and_tags = mkout(current_output)
2136 |                 w.add(content, *keys, content_type=content_type)
2137 |                 current_size = (
2138 |                     size_header_and_tags + w.size_content_types() + w.size_data()
2139 |                 )
2140 |                 if split and current_size + min_bin_size >= split:
2141 |                     t1, w, size_header_and_tags = fin(t1, w, current_output)
2142 |         if w:
2143 |             fin(t1, w, current_output)
2144 | 
2145 |     print("\nDone in {0:.2f}s".format(time.time() - t0))
2146 | 
2147 | 
2148 | def _arg_parser():
2149 |     parser = argparse.ArgumentParser()
2150 |     parent = argparse.ArgumentParser(add_help=False)
2151 |     parent.add_argument("path", help="Slob path")
2152 | 
2153 |     parents = [parent]
2154 | 
2155 |     subparsers = parser.add_subparsers(help="sub-command")
2156 | 
2157 |     parser_find = subparsers.add_parser(
2158 |         "find",
2159 |         parents=parents,
2160 |         help="Find keys",
2161 |     )
2162 |     parser_find.add_argument("query", help="Word to look up")
2163 |     parser_find.add_argument(
2164 |         "--whole", action="store_true", help="Match whole words only"
2165 |     )
2166 |     parser_find.add_argument(
2167 |         "-l",
2168 |         "--limit",
2169 |         type=int,
2170 |         default=10,
2171 |         help="Maximum number of results to display",
2172 |     )
2173 |     parser_find.set_defaults(func=_cli_find)
2174 | 
2175 |     parser_aliases = subparsers.add_parser(
2176 |         "aliases",
2177 |         parents=parents,
2178 |         help="Find blob aliases",
2179 |     )
2180 |     parser_aliases.add_argument("query", help="Word to look up")
2181 |     parser_aliases.set_defaults(func=_cli_aliases)
2182 | 
2183 |     parser_get = subparsers.add_parser(
2184 |         "get",
2185 |         parents=parents,
2186 |         help="Retrieve blob content",
2187 |     )
2188 |     parser_get.add_argument(
2189 |         "blob_id", type=int, help='Id of blob to retrive (from output of "find")'
2190 |     )
2191 | 
2192 |     parser_get.set_defaults(func=_cli_get)
2193 | 
2194 |     parser_info = subparsers.add_parser(
2195 |         "info",
2196 |         parents=parents,
2197 |         help="Inspect slob and print basic information about it",
2198 |     )
2199 |     parser_info.set_defaults(func=_cli_info)
2200 | 
2201 |     parser_tag = subparsers.add_parser("tag", help="List tags, view or edit tag value")
2202 |     parser_tag.add_argument(
2203 |         "-n", "--name", default="", help="Name of tag to view or edit"
2204 |     )
2205 |     parser_tag.add_argument("-v", "--value", help="Tag value to set")
2206 |     parser_tag.add_argument(
2207 |         "filename", help="Slob file name (split files are not supported)"
2208 |     )
2209 |     parser_tag.set_defaults(func=_cli_tag)
2210 | 
2211 |     parser_convert = subparsers.add_parser(
2212 |         "convert",
2213 |         parents=parents,
2214 |         help=(
2215 |             "Create new slob with the same content "
2216 |             "but different encoding and compression parameters "
2217 |             "or split into multiple slobs"
2218 |         ),
2219 |     )
2220 |     parser_convert.add_argument("output", help="Name of slob file to create")
2221 | 
2222 |     parser_convert.add_argument(
2223 |         "--workdir",
2224 |         help="Directory where temporary files for " "conversion should be created",
2225 |     )
2226 | 
2227 |     parser_convert.add_argument(
2228 |         "-e",
2229 |         "--encoding",
2230 |         help=("Text encoding for the output slob. " "Default: same as source"),
2231 |     )
2232 | 
2233 |     parser_convert.add_argument(
2234 |         "-c",
2235 |         "--compression",
2236 |         help=(
2237 |             "Compression algorithm to use fot the output slob. "
2238 |             "Default: same as source"
2239 |         ),
2240 |     )
2241 | 
2242 |     parser_convert.add_argument(
2243 |         "-b",
2244 |         "--min_bin_size",
2245 |         type=int,
2246 |         help=(
2247 |             "Minimum size of storage bin to compress in kilobytes. "
2248 |             "Default: %(default)s"
2249 |         ),
2250 |         default=384,
2251 |     )
2252 | 
2253 |     parser_convert.add_argument(
2254 |         "-s",
2255 |         "--split",
2256 |         type=int,
2257 |         help=(
2258 |             "Split output into multiple slob files no larger "
2259 |             "than specified number of megabytes. "
2260 |             "Default: %(default)s (do not split)"
2261 |         ),
2262 |         default=0,
2263 |     )
2264 | 
2265 |     parser_convert.set_defaults(func=_cli_convert)
2266 | 
2267 |     return parser
2268 | 
2269 | 
2270 | def add_dir(
2271 |     slb, topdir, prefix="", include_only=None, mime_types=MIME_TYPES, print=print
2272 | ):
2273 |     print("Adding", topdir)
2274 |     for item in os.walk(topdir):
2275 |         dirpath, _dirnames, filenames = item
2276 |         for filename in filenames:
2277 |             full_path = os.path.join(dirpath, filename)
2278 |             rel_path = os.path.relpath(full_path, topdir)
2279 |             if include_only and not any(rel_path.startswith(x) for x in include_only):
2280 |                 print("SKIPPING (not included): {!a}".format(rel_path))
2281 |                 continue
2282 |             _, ext = os.path.splitext(filename)
2283 |             ext = ext.lstrip(os.path.extsep)
2284 |             content_type = mime_types.get(ext.lower())
2285 |             if not content_type:
2286 |                 print("SKIPPING (unknown content type): {!a}".format(rel_path))
2287 |             else:
2288 |                 with fopen(full_path, "rb") as f:
2289 |                     content = f.read()
2290 |                     key = prefix + rel_path
2291 |                     print("ADDING: {!a}".format(key))
2292 |                     try:
2293 |                         key.encode(slb.encoding)
2294 |                     except UnicodeEncodeError:
2295 |                         print("Failed to add, broken unicode in key: {!a}".format(key))
2296 |                     else:
2297 |                         slb.add(content, key, content_type=content_type)
2298 | 
2299 | 
2300 | import time
2301 | from datetime import timedelta
2302 | 
2303 | 
2304 | class SimpleTimingObserver(object):
2305 | 
2306 |     def __init__(self, p=None):
2307 | 
2308 |         if not p is None:
2309 |             self.p = p
2310 | 
2311 |         self.times = {}
2312 | 
2313 |     def p(self, text):
2314 |         sys.stdout.write(text)
2315 |         sys.stdout.flush()
2316 | 
2317 |     def begin(self, name):
2318 |         self.times[name] = time.time()
2319 | 
2320 |     def end(self, name):
2321 |         t0 = self.times.pop(name)
2322 |         dt = timedelta(seconds=int(time.time() - t0))
2323 |         return dt
2324 | 
2325 |     def __call__(self, e):
2326 |         p = self.p
2327 |         begin = self.begin
2328 |         end = self.end
2329 |         if e.name == "begin_finalize":
2330 |             p("\nFinished adding content in %s" % end("content"))
2331 |             p("\nFinalizing...")
2332 |             begin("finalize")
2333 |         if e.name == "end_finalize":
2334 |             p("\nFinalized in %s" % end("finalize"))
2335 |         elif e.name == "begin_resolve_aliases":
2336 |             p("\nResolving aliases...")
2337 |             begin("aliases")
2338 |         elif e.name == "end_resolve_aliases":
2339 |             p("\nResolved aliases in %s" % end("aliases"))
2340 |         elif e.name == "begin_sort":
2341 |             p("\nSorting...")
2342 |             begin("sort")
2343 |         elif e.name == "end_sort":
2344 |             p(" sorted in %s" % end("sort"))
2345 | 
2346 | 
2347 | def main():
2348 |     parser = _arg_parser()
2349 |     args = parser.parse_args()
2350 |     if hasattr(args, "func"):
2351 |         args.func(args)
2352 |     else:
2353 |         parser.print_help()
2354 | 
2355 | 
2356 | if __name__ == "__main__":
2357 |     main()
2358 | 


--------------------------------------------------------------------------------