├── .github └── ISSUE_TEMPLATE │ └── bug_report.md ├── .gitignore ├── LICENSE ├── README.org ├── setup.py └── slob.py /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Report an issue with slob reader/writer 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 21 | 22 | **Description** 23 | A clear and concise description of what the bug is. 24 | 25 | **To Reproduce** 26 | Steps to reproduce the behavior: 27 | - ... 28 | - ... 29 | 30 | **Expected behavior** 31 | A clear and concise description of what you expected to happen. 32 | 33 | **Environment:** 34 | - OS: [e.g. iOS] 35 | - Python version 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *~ 3 | *.egg-info 4 | __pycache__ 5 | README.html -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. [http://fsf.org/] 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | {one line to give the program's name and a brief idea of what it does.} 635 | Copyright (C) {year} {name of author} 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see [http://www.gnu.org/licenses/]. 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | {project} Copyright (C) {year} {fullname} 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | [http://www.gnu.org/licenses/]. 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | [http://www.gnu.org/philosophy/why-not-lgpl.html]. -------------------------------------------------------------------------------- /README.org: -------------------------------------------------------------------------------- 1 | * Slob 2 | Slob (sorted list of blobs) is a read-only, compressed data store 3 | with dictionary-like interface to look up content by text keys. Keys 4 | are sorted according to [[http://www.unicode.org/reports/tr10/][Unicode Collation Algorithm]]. This allows to 5 | perform punctuation, case and diacritics insensitive 6 | lookups. /slob.py/ is a reference implementation of slob format 7 | reader and writer in [[http://python.org][Python 3]]. 8 | 9 | ** Installation 10 | 11 | /slob.py/ depends on the following components: 12 | 13 | - [[http://python.org][Python]] >= 3.6 14 | - [[http://icu-project.org][ICU]] >= 4.8 15 | - [[https://pypi.python.org/pypi/PyICU][PyICU]] >= 1.5 16 | 17 | In addition, the following components are needed to set up 18 | slob environment: 19 | 20 | - [[http://git-scm.com/][git]] 21 | - [[https://virtualenv.pypa.io/][virtualenv]] 22 | 23 | Consult your operating system documentation and these component's 24 | websites for installation instructions. 25 | 26 | For example, on Ubuntu 20.04, the following command installs 27 | required packages: 28 | 29 | #+BEGIN_SRC sh 30 | sudo apt update 31 | sudo apt install python3 python3-icu python3.8-venv git 32 | #+END_SRC 33 | 34 | Create new Python virtual environment: 35 | 36 | #+BEGIN_SRC sh 37 | python3 -m venv env-slob --system-site-packages 38 | #+END_SRC 39 | 40 | Activate it: 41 | 42 | #+BEGIN_SRC sh 43 | source env-slob/bin/activate 44 | #+END_SRC 45 | 46 | Install from source code repository: 47 | 48 | #+BEGIN_SRC sh 49 | pip install git+https://github.com/itkach/slob.git 50 | #+END_SRC 51 | 52 | or, download source code manually: 53 | 54 | #+BEGIN_SRC sh 55 | wget https://github.com/itkach/slob/archive/master.zip 56 | pip install master.zip 57 | #+END_SRC 58 | 59 | Run tests: 60 | 61 | #+BEGIN_SRC sh 62 | python -m unittest slob 63 | #+END_SRC 64 | 65 | ** Command line interface 66 | 67 | /slob.py/ provides basic command line interface to inspect 68 | and modify slob content. 69 | 70 | #+BEGIN_SRC 71 | usage: slob [-h] {find,get,info,tag} ... 72 | 73 | positional arguments: 74 | {find,get,info,tag} sub-command 75 | find Find keys 76 | get Retrieve blob content 77 | info Inspect slob and print basic information about it 78 | tag List tags, view or edit tag value 79 | convert Create new slob with the same convent but different 80 | encoding and compression parameters 81 | or split into multiple slobs 82 | 83 | optional arguments: 84 | -h, --help show this help message and exit 85 | #+END_SRC 86 | 87 | To see basic slob info such as text encoding, compression and tags: 88 | #+BEGIN_SRC 89 | slob info my.slob 90 | #+END_SRC 91 | 92 | To see value of a tag, for example /label/: 93 | #+BEGIN_SRC 94 | slob tag -n label my.slob 95 | #+END_SRC 96 | 97 | To set tag value: 98 | #+BEGIN_SRC 99 | slob tag -n label -v "A Fine Dictionary" my.slob 100 | #+END_SRC 101 | 102 | To look up a key, for example /abc/: 103 | #+BEGIN_SRC 104 | slob find wordnet-3.0.slob abc 105 | #+END_SRC 106 | 107 | The output should like something like 108 | #+BEGIN_SRC 109 | 465 text/html; charset=utf-8 ABC 110 | 466 text/html; charset=utf-8 abcoulomb 111 | 472 text/html; charset=utf-8 ABC's 112 | 468 text/html; charset=utf-8 ABCs 113 | #+END_SRC 114 | 115 | First column in the output is blob id. It can be used to retrieve 116 | blob content (content bytes are written to stdout): 117 | #+BEGIN_SRC 118 | slob get wordnet-3.0.slob 465 119 | #+END_SRC 120 | 121 | To re-encode or re-compress slob content with different 122 | parameters: 123 | #+BEGIN_SRC 124 | slob convert -c lzma2 -b 256 simplewiki-20140209.zlib.384k.slob simplewiki-20140209.lzma2.256k.slob 125 | #+END_SRC 126 | 127 | To split into multiple slobs: 128 | 129 | #+BEGIN_SRC 130 | slob convert --split 4096 enwiki-20150406.slob enwiki-20150406-vol.slob 131 | #+END_SRC 132 | 133 | Output name /enwiki-20150406-vol.slob/ is the name of the 134 | directory where resulting .slob files will be created. 135 | 136 | This is useful for crippled systems that can't use normal 137 | filesystems and have file size limits, such as SD cards on 138 | vanilla Android. Note that this command doesn't duplicate any 139 | content, so clients must search all these slobs when looking for 140 | shared resources such as stylesheets, fonts, javascript or 141 | images. 142 | 143 | 144 | ** Examples 145 | 146 | *** Basic Usage 147 | 148 | Create a slob: 149 | 150 | #+BEGIN_SRC python 151 | import slob 152 | with slob.create('test.slob') as w: 153 | w.add(b'Hello A', 'a') 154 | w.add(b'Hello B', 'b') 155 | #+END_SRC 156 | 157 | Read content: 158 | 159 | #+BEGIN_SRC python 160 | import slob 161 | with slob.open('test.slob') as r: 162 | d = r.as_dict() 163 | for key in ('a', 'b'): 164 | result = next(d[key]) 165 | print(result.content) 166 | 167 | #+END_SRC 168 | 169 | will print 170 | 171 | #+BEGIN_SRC 172 | b'Hello A' 173 | b'Hello B' 174 | #+END_SRC 175 | 176 | 177 | Slob we created in this example certainly works, but it is not 178 | ideal: we neglected to specify content type for the content we 179 | are adding. Lets consider a slightly more involved example: 180 | 181 | #+BEGIN_SRC python 182 | import slob 183 | PLAIN_TEXT = 'text/plain; charset=utf-8' 184 | with slob.create('test1.slob') as w: 185 | w.add('Hello, Earth!'.encode('utf-8'), 186 | 'earth', 'terra', content_type=PLAIN_TEXT) 187 | w.add_alias('земля', 'earth') 188 | w.add('Hello, Mars!'.encode('utf-8'), 'mars', 189 | content_type=PLAIN_TEXT) 190 | #+END_SRC 191 | 192 | Here we specify MIME type of the content we are adding so that 193 | consumers of this content can display or process it 194 | properly. Note that the same content may be associated with 195 | multiple keys, either when it is added or later with /add_alias/. 196 | 197 | This 198 | 199 | #+BEGIN_SRC python 200 | with slob.open('test1.slob') as r: 201 | 202 | def p(blob): 203 | print(blob.id, blob.content_type, blob.content) 204 | 205 | for key in ('earth', 'земля', 'terra'): 206 | blob = next(r.as_dict()[key]) 207 | p(blob) 208 | 209 | p(next(r.as_dict()['mars'])) 210 | 211 | #+END_SRC 212 | 213 | will print 214 | 215 | #+BEGIN_SRC 216 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 217 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 218 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 219 | 1 text/plain; charset=utf-8 b'Hello, Mars!' 220 | #+END_SRC 221 | 222 | Note that blob id for the first three keys is the same, they all 223 | point to the same content item. 224 | 225 | Take a look at tests in /slob.py/ for more examples. 226 | 227 | 228 | *** Software and Dictionaries 229 | 230 | - [[https://github.com/itkach/slob/wiki/Dictionaries][Wikipedia, Wiktionary, WordNet, FreeDict and more]] 231 | - [[http://github.com/itkach/aard2-android/][aard2-android]] - dictionary for Android 232 | - [[https://github.com/farfromrefug/OSS-Dict][OSS-Dict]] - fork of Aard2 with new Material design and updated features 233 | - [[https://github.com/itkach/aard2-web][aard2-web]] - minimalistic Web UI (Java) 234 | - [[http://github.com/itkach/slobber/][slobber]] - Web API to look up content in slob dictionaries 235 | - [[http://github.com/itkach/slobby/][slobby]] - minimalistic Web UI (Python) 236 | - [[https://github.com/ilius/pyglossary][pyglossary]] - convert dictionaries in various formats, including slob 237 | - [[https://github.com/itkach/mw2slob][mw2slob]] - create slob dictionaries from Wikimedia Enterprise HTML Dumps or MediaWiki API 238 | - [[http://github.com/itkach/xdxf2slob/][xdxf2slob]] - create slob dictionaries from XDXF 239 | - [[https://github.com/itkach/tei2slob/][tei2slob]] - create slob dictionaries from TEI 240 | - [[http://github.com/itkach/wordnet2slob/][wordnet2slob]] - convert WordNet databaset to slob dictionary 241 | 242 | 243 | ** Slob File Format 244 | 245 | *** Slob 246 | 247 | | Element | Type | Description | 248 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 249 | | magic | fixed size sequence of 8 bytes | Bytes ~21 2d 31 53 4c 4f 42 1f~: string ~!-1SLOB~ followed by ascii unit separator (ascii hex code ~1f~) identifying slob format | 250 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 251 | | uuid | fixed size sequence of 16 bytes | Unique slob identifier ([[https://tools.ietf.org/html/rfc4122][RFC 4122]] UUID) | 252 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 253 | | encoding | tiny text (utf8) | Name of text encoding used for all other text elements: tag names and values, content types, keys, fragments | 254 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 255 | | compression | tiny text | Name of compression algorithm used to compress storage bins. | 256 | | | | slob.py understands following names: /bz2/, /zlib/ which correspond to Python module names, and /lzma2/ which refers to raw lzma2 compression with LZMA2 filter (this is default). | 257 | | | | Empty value means bins are not compressed. | 258 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 259 | | tags | char-sized sequence of tags | Tags are text key-value pairs that may provide additional information about slob or its data. | 260 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 261 | | content types | char-sized sequence of content types | MIME content types. Content items refer to content types by id. | 262 | | | | Content type id is 0-based position of content type in this sequence. | 263 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 264 | | blob count | int | Number of content items stored in the slob | 265 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 266 | | store offset | long | File position at which store data begins | 267 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 268 | | size | long | Total file byte size (or sum of all files if slob is split into multiple files) | 269 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 270 | | refs | list of long-positioned refs | References to content | 271 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 272 | | store | list of long-positioned store items | Store item contains number of items stored, content type id for each item and storage bin with each item's content | 273 | 274 | 275 | 276 | *** tiny text 277 | 278 | char-sized sequence of encoded text bytes 279 | 280 | 281 | *** text 282 | 283 | short-sized sequence of encoded text bytes 284 | 285 | 286 | *** large byte string 287 | 288 | int-sized sequence of bytes 289 | 290 | *** /size type/-sized sequence of /items/ 291 | 292 | | Element | Type | 293 | |---------+---------------------------| 294 | | count | /size type/ | 295 | | items | sequence of /count/ items | 296 | 297 | 298 | *** tag 299 | 300 | | Element | Type | 301 | |---------+-----------------------------| 302 | | name | tiny text | 303 | | value | tiny text padded to maximum | 304 | | | length with null bytes | 305 | 306 | Tag values are tiny text of length 255, starting with encoded 307 | text bytes followed by null bytes. This allowes modifying tag 308 | values without having to recompile the whole slob. Null bytes 309 | must be stripped before decoding value text. 310 | 311 | *** content type 312 | 313 | text 314 | 315 | 316 | *** ref 317 | 318 | | Element | Type | Description | 319 | |------------+-----------+-------------------------------------------------------| 320 | | key | text | Text key associated with content | 321 | | bin index | int | Index of compressed bin containing content | 322 | | item index | short | Index of content item inside uncompressed bin | 323 | | fragment | tiny text | Text identifier of a specific location inside content | 324 | 325 | 326 | *** store item 327 | | Element | Type | Description | 328 | |------------------+---------------------------------------------------------+---------------------------------------------------| 329 | | content type ids | int-sized sequence of bytes | Each byte is a char representing content type id. | 330 | | storage bin | list of int-positioned large byte strings without count | Content | 331 | 332 | Storage bin doesn't include leading int that would represent item 333 | count - item count equals the length of content type ids. Items in the 334 | storage bin are large byte strings - actual content bytes. 335 | 336 | *** list of /position type/-positioned /items/ 337 | 338 | | Element | Type | Description | 339 | |-----------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------------------| 340 | | positions | int-sized sequence of item offsets of type /position type/. | Item offset specifies position in file where item data starts, relative to the end of position data | 341 | | items | sequence of /items/ | | 342 | 343 | *** char 344 | unsigned char (1 byte) 345 | 346 | *** short 347 | big endian unsigned short (2 bytes) 348 | 349 | *** int 350 | big endian unsigned int (4 bytes) 351 | 352 | *** long 353 | big endian unsigned long long (8 bytes) 354 | 355 | 356 | ** Design Considerations 357 | 358 | Slob format design is influenced by old Aard Dictionary's [[https://github.com/aarddict/tools/blob/master/doc/aardformat.rst][aard]] and [[http://openzim.org/][ZIM]] 359 | file formats. Similar to Aard Dictionary, it allows to perform 360 | non-exact lookups based on UCA's notion of collation 361 | strength. Similar to ZIM, it groups and compresses multiple 362 | content items to achieve high compression ratio and can combine 363 | several physical files into one logical container. Both aard and 364 | ZIM contain vestigial elements of predecessor formats as well 365 | as elements specific to a particular use case (such as 366 | implementing offline Wikipedia content access). Slob aims to 367 | provide a minimal framework to allow building such applications 368 | while remaining a simple, generic, read-only data store. 369 | 370 | *** No Format Version 371 | Slob header doesn't contain explicit file format version 372 | number. Any incompatible changes to the format 373 | should be introduced in a new file format which will get its own 374 | identifying magic bytes. 375 | 376 | *** No Content Checksum 377 | Unlike aard and ZIM file formats, slob doesn't contain 378 | content checksum. File integrity can be easily verified by 379 | employing standard tools to calculate content hash. Inclusion of 380 | pre-calculated hash into the file itself prevents using most 381 | standard tools and puts burden of implementing hash calculation 382 | on every slob reader implementation. 383 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup(name='Slob', 4 | version='1.0.2', 5 | description='Read-only compressed data store', 6 | author='Igor Tkach', 7 | author_email='itkach@gmail.com', 8 | url='http://github.com/itkach/slob', 9 | license='GPL3', 10 | py_modules=['slob'], 11 | install_requires=['PyICU >= 1.5'], 12 | entry_points={'console_scripts': ['slob=slob:main']}) 13 | -------------------------------------------------------------------------------- /slob.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=C0111,C0103,C0302,R0903,R0904,R0914,R0201 2 | import argparse 3 | import array 4 | import collections 5 | import encodings 6 | import functools 7 | import io 8 | import os 9 | import pickle 10 | import random 11 | import sys 12 | import tempfile 13 | import unicodedata 14 | import unittest 15 | import warnings 16 | 17 | from abc import abstractmethod 18 | from bisect import bisect_left 19 | from builtins import open as fopen 20 | from collections import namedtuple 21 | from collections.abc import Sequence 22 | from datetime import datetime, timezone 23 | from functools import lru_cache 24 | from struct import pack, unpack, calcsize 25 | from threading import RLock 26 | from types import MappingProxyType 27 | from uuid import uuid4, UUID 28 | 29 | import icu 30 | 31 | DEFAULT_COMPRESSION = "lzma2" 32 | 33 | UTF8 = "utf-8" 34 | MAGIC = b"!-1SLOB\x1F" 35 | 36 | Compression = namedtuple("Compression", "compress decompress") 37 | 38 | Ref = namedtuple("Ref", "key bin_index item_index fragment") 39 | 40 | Header = namedtuple( 41 | "Header", 42 | "magic uuid encoding " 43 | "compression tags content_types " 44 | "blob_count " 45 | "store_offset " 46 | "refs_offset " 47 | "size", 48 | ) 49 | 50 | U_CHAR = ">B" 51 | U_CHAR_SIZE = calcsize(U_CHAR) 52 | U_SHORT = ">H" 53 | U_SHORT_SIZE = calcsize(U_SHORT) 54 | U_INT = ">I" 55 | U_INT_SIZE = calcsize(U_INT) 56 | U_LONG_LONG = ">Q" 57 | U_LONG_LONG_SIZE = calcsize(U_LONG_LONG) 58 | 59 | 60 | def calcmax(len_size_spec): 61 | return 2 ** (calcsize(len_size_spec) * 8) - 1 62 | 63 | 64 | MAX_TEXT_LEN = calcmax(U_SHORT) 65 | MAX_TINY_TEXT_LEN = calcmax(U_CHAR) 66 | MAX_LARGE_BYTE_STRING_LEN = calcmax(U_INT) 67 | MAX_BIN_ITEM_COUNT = calcmax(U_SHORT) 68 | 69 | from icu import Locale, Collator, UCollAttribute, UCollAttributeValue 70 | 71 | PRIMARY = Collator.PRIMARY 72 | SECONDARY = Collator.SECONDARY 73 | TERTIARY = Collator.TERTIARY 74 | QUATERNARY = Collator.QUATERNARY 75 | IDENTICAL = Collator.IDENTICAL 76 | 77 | 78 | def init_compressions(): 79 | ident = lambda x: x 80 | compressions = {"": Compression(ident, ident)} 81 | for name in ("bz2", "zlib"): 82 | try: 83 | m = __import__(name) 84 | except ImportError: 85 | warnings.warn("%s is not available" % name) 86 | else: 87 | compressions[name] = Compression(lambda x: m.compress(x, 9), m.decompress) 88 | 89 | try: 90 | import lzma 91 | except ImportError: 92 | warnings.warn("lzma is not available") 93 | else: 94 | filters = [{"id": lzma.FILTER_LZMA2}] 95 | compress = lambda s: lzma.compress(s, format=lzma.FORMAT_RAW, filters=filters) 96 | decompress = lambda s: lzma.decompress( 97 | s, format=lzma.FORMAT_RAW, filters=filters 98 | ) 99 | compressions["lzma2"] = Compression(compress, decompress) 100 | return compressions 101 | 102 | 103 | COMPRESSIONS = init_compressions() 104 | 105 | 106 | del init_compressions 107 | 108 | 109 | MIME_TEXT = "text/plain" 110 | MIME_HTML = "text/html" 111 | MIME_CSS = "text/css" 112 | MIME_JS = "application/javascript" 113 | 114 | MIME_TYPES = { 115 | "html": MIME_HTML, 116 | "txt": MIME_TEXT, 117 | "js": MIME_JS, 118 | "css": MIME_CSS, 119 | "json": "application/json", 120 | "woff": "application/font-woff", 121 | "svg": "image/svg+xml", 122 | "png": "image/png", 123 | "jpg": "image/jpeg", 124 | "jpeg": "image/jpeg", 125 | "gif": "image/gif", 126 | "ttf": "application/x-font-ttf", 127 | "otf": "application/x-font-opentype", 128 | } 129 | 130 | 131 | class FileFormatException(Exception): 132 | pass 133 | 134 | 135 | class UnknownFileFormat(FileFormatException): 136 | pass 137 | 138 | 139 | class UnknownCompression(FileFormatException): 140 | pass 141 | 142 | 143 | class UnknownEncoding(FileFormatException): 144 | pass 145 | 146 | 147 | class IncorrectFileSize(FileFormatException): 148 | pass 149 | 150 | 151 | class TagNotFound(Exception): 152 | pass 153 | 154 | 155 | @lru_cache(maxsize=None) 156 | def sortkey(strength, maxlength=None): 157 | c = Collator.createInstance(Locale("")) 158 | c.setStrength(strength) 159 | c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED) 160 | if maxlength is None: 161 | return c.getSortKey 162 | else: 163 | return lambda x: c.getSortKey(x)[:maxlength] 164 | 165 | 166 | def sortkey_length(strength, word): 167 | c = Collator.createInstance(Locale("")) 168 | c.setStrength(strength) 169 | c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED) 170 | coll_key = c.getSortKey(word) 171 | return len(coll_key) - 1 # subtract 1 for ending \x00 byte 172 | 173 | 174 | class MultiFileReader(io.BufferedIOBase): 175 | 176 | def __init__(self, *args): 177 | filenames = [] 178 | for arg in args: 179 | if isinstance(arg, str): 180 | filenames.append(arg) 181 | else: 182 | for name in arg: 183 | filenames.append(name) 184 | files = [] 185 | ranges = [] 186 | offset = 0 187 | for name in filenames: 188 | size = os.stat(name).st_size 189 | ranges.append(range(offset, offset + size)) 190 | files.append(fopen(name, "rb")) 191 | offset += size 192 | self.size = offset 193 | self._ranges = ranges 194 | self._files = files 195 | self._fcount = len(self._files) 196 | self._offset = -1 197 | self.seek(0) 198 | 199 | def __enter__(self): 200 | return self 201 | 202 | def __exit__(self, exc_type, exc_val, exc_tb): 203 | self.close() 204 | return False 205 | 206 | def close(self): 207 | for f in self._files: 208 | f.close() 209 | self._files.clear() 210 | self._ranges.clear() 211 | 212 | def closed(self): 213 | return len(self._ranges) == 0 214 | 215 | def isatty(self): 216 | return False 217 | 218 | def readable(self): 219 | return True 220 | 221 | def seek(self, offset, whence=io.SEEK_SET): 222 | if whence == io.SEEK_SET: 223 | self._offset = offset 224 | elif whence == io.SEEK_CUR: 225 | self._offset = self._offset + offset 226 | elif whence == io.SEEK_END: 227 | self._offset = self.size + offset 228 | else: 229 | raise ValueError("Invalid value for parameter whence: %r" % whence) 230 | return self._offset 231 | 232 | def seekable(self): 233 | return True 234 | 235 | def tell(self): 236 | return self._offset 237 | 238 | def writable(self): 239 | return False 240 | 241 | def read(self, n=-1): 242 | file_index = -1 243 | actual_offset = 0 244 | for i, r in enumerate(self._ranges): 245 | if self._offset in r: 246 | file_index = i 247 | actual_offset = self._offset - r.start 248 | break 249 | result = b"" 250 | if n == -1 or n is None: 251 | to_read = self.size 252 | else: 253 | to_read = n 254 | while -1 < file_index < self._fcount: 255 | f = self._files[file_index] 256 | f.seek(actual_offset) 257 | read = f.read(to_read) 258 | read_count = len(read) 259 | self._offset += read_count 260 | result += read 261 | to_read -= read_count 262 | if to_read > 0: 263 | file_index += 1 264 | actual_offset = 0 265 | else: 266 | break 267 | return result 268 | 269 | 270 | class CollationKeyList(object): 271 | 272 | def __init__(self, lst, sortkey_): 273 | self.lst = lst 274 | self.sortkey = sortkey_ 275 | 276 | def __len__(self): 277 | return len(self.lst) 278 | 279 | def __getitem__(self, i): 280 | return self.sortkey(self.lst[i].key) 281 | 282 | 283 | class KeydItemDict(object): 284 | 285 | def __init__(self, lst, strength, maxlength=None): 286 | self.lst = lst 287 | self.sortkey = sortkey(strength, maxlength=maxlength) 288 | self.sortkeylist = CollationKeyList(lst, self.sortkey) 289 | 290 | def __len__(self): 291 | return len(self.lst) 292 | 293 | def __getitem__(self, key): 294 | key_as_sk = self.sortkey(key) 295 | i = bisect_left(self.sortkeylist, key_as_sk) 296 | if i != len(self.lst): 297 | while i < len(self.lst): 298 | if self.sortkey(self.lst[i].key) == key_as_sk: 299 | yield self.lst[i] 300 | else: 301 | break 302 | i += 1 303 | 304 | def __contains__(self, key): 305 | try: 306 | next(self[key]) 307 | except StopIteration: 308 | return False 309 | else: 310 | return True 311 | 312 | 313 | class Blob(object): 314 | 315 | def __init__(self, content_id, key, fragment, read_content_type_func, read_func): 316 | self._content_id = content_id 317 | self._key = key 318 | self._fragment = fragment 319 | self._read_content_type = read_content_type_func 320 | self._read = read_func 321 | 322 | @property 323 | def id(self): 324 | return self._content_id 325 | 326 | @property 327 | def key(self): 328 | return self._key 329 | 330 | @property 331 | def fragment(self): 332 | return self._fragment 333 | 334 | @property 335 | def content_type(self): 336 | return self._read_content_type() 337 | 338 | @property 339 | def content(self): 340 | return self._read() 341 | 342 | def __str__(self): 343 | return self.key 344 | 345 | def __repr__(self): 346 | return "<{0.__class__.__module__}.{0.__class__.__name__} " "{0.key}>".format( 347 | self 348 | ) 349 | 350 | 351 | def read_byte_string(f, len_spec): 352 | length = unpack(len_spec, f.read(calcsize(len_spec)))[0] 353 | return f.read(length) 354 | 355 | 356 | class StructReader: 357 | 358 | def __init__(self, file_, encoding=None): 359 | self._file = file_ 360 | self.encoding = encoding 361 | 362 | def read_int(self): 363 | s = self.read(U_INT_SIZE) 364 | return unpack(U_INT, s)[0] 365 | 366 | def read_long(self): 367 | b = self.read(U_LONG_LONG_SIZE) 368 | return unpack(U_LONG_LONG, b)[0] 369 | 370 | def read_byte(self): 371 | s = self.read(U_CHAR_SIZE) 372 | return unpack(U_CHAR, s)[0] 373 | 374 | def read_short(self): 375 | return unpack(U_SHORT, self._file.read(U_SHORT_SIZE))[0] 376 | 377 | def _read_text(self, len_spec): 378 | max_len = 2 ** (8 * calcsize(len_spec)) - 1 379 | byte_string = read_byte_string(self._file, len_spec) 380 | if len(byte_string) == max_len: 381 | terminator = byte_string.find(0) 382 | if terminator > -1: 383 | byte_string = byte_string[:terminator] 384 | return byte_string.decode(self.encoding) 385 | 386 | def read_tiny_text(self): 387 | return self._read_text(U_CHAR) 388 | 389 | def read_text(self): 390 | return self._read_text(U_SHORT) 391 | 392 | def __getattr__(self, name): 393 | return getattr(self._file, name) 394 | 395 | 396 | class StructWriter: 397 | 398 | def __init__(self, file_, encoding=None): 399 | self._file = file_ 400 | self.encoding = encoding 401 | 402 | def write_int(self, value): 403 | self._file.write(pack(U_INT, value)) 404 | 405 | def write_long(self, value): 406 | self._file.write(pack(U_LONG_LONG, value)) 407 | 408 | def write_byte(self, value): 409 | self._file.write(pack(U_CHAR, value)) 410 | 411 | def write_short(self, value): 412 | self._file.write(pack(U_SHORT, value)) 413 | 414 | def _write_text(self, text, len_size_spec, encoding=None, pad_to_length=None): 415 | if encoding is None: 416 | encoding = self.encoding 417 | text_bytes = text.encode(encoding) 418 | length = len(text_bytes) 419 | max_length = calcmax(len_size_spec) 420 | if length > max_length: 421 | raise ValueError("Text is too long for size spec %s" % len_size_spec) 422 | self._file.write( 423 | pack(len_size_spec, pad_to_length if pad_to_length else length) 424 | ) 425 | self._file.write(text_bytes) 426 | if pad_to_length: 427 | for _ in range(pad_to_length - length): 428 | self._file.write(pack(U_CHAR, 0)) 429 | 430 | def write_tiny_text(self, text, encoding=None, editable=False): 431 | pad_to_length = 255 if editable else None 432 | self._write_text(text, U_CHAR, encoding=encoding, pad_to_length=pad_to_length) 433 | 434 | def write_text(self, text, encoding=None): 435 | self._write_text(text, U_SHORT, encoding=encoding) 436 | 437 | def __getattr__(self, name): 438 | return getattr(self._file, name) 439 | 440 | 441 | def set_tag_value(filename, name, value): 442 | with fopen(filename, "rb+") as f: 443 | f.seek(len(MAGIC) + 16) 444 | encoding = read_byte_string(f, U_CHAR).decode(UTF8) 445 | if encodings.search_function(encoding) is None: 446 | raise UnknownEncoding(encoding) 447 | f = StructWriter(StructReader(f, encoding=encoding), encoding=encoding) 448 | f.read_tiny_text() 449 | tag_count = f.read_byte() 450 | for _ in range(tag_count): 451 | key = f.read_tiny_text() 452 | if key == name: 453 | f.write_tiny_text(value, editable=True) 454 | return 455 | f.read_tiny_text() 456 | raise TagNotFound(name) 457 | 458 | 459 | def read_header(f): 460 | f.seek(0) 461 | 462 | magic = f.read(len(MAGIC)) 463 | if magic != MAGIC: 464 | raise UnknownFileFormat() 465 | 466 | uuid = UUID(bytes=f.read(16)) 467 | encoding = read_byte_string(f, U_CHAR).decode(UTF8) 468 | if encodings.search_function(encoding) is None: 469 | raise UnknownEncoding(encoding) 470 | 471 | f = StructReader(f, encoding) 472 | compression = f.read_tiny_text() 473 | if not compression in COMPRESSIONS: 474 | raise UnknownCompression(compression) 475 | 476 | def read_tags(): 477 | tags = {} 478 | count = f.read_byte() 479 | for _ in range(count): 480 | key = f.read_tiny_text() 481 | value = f.read_tiny_text() 482 | tags[key] = value 483 | return tags 484 | 485 | tags = read_tags() 486 | 487 | def read_content_types(): 488 | content_types = [] 489 | count = f.read_byte() 490 | for _ in range(count): 491 | content_type = f.read_text() 492 | content_types.append(content_type) 493 | return tuple(content_types) 494 | 495 | content_types = read_content_types() 496 | 497 | blob_count = f.read_int() 498 | store_offset = f.read_long() 499 | size = f.read_long() 500 | refs_offset = f.tell() 501 | 502 | return Header( 503 | magic=magic, 504 | uuid=uuid, 505 | encoding=encoding, 506 | compression=compression, 507 | tags=MappingProxyType(tags), 508 | content_types=content_types, 509 | blob_count=blob_count, 510 | store_offset=store_offset, 511 | refs_offset=refs_offset, 512 | size=size, 513 | ) 514 | 515 | 516 | def meld_ints(a, b): 517 | return (a << 16) | b 518 | 519 | 520 | def unmeld_ints(c): 521 | bstr = bin(c).lstrip("0b").zfill(48) 522 | a, b = bstr[-48:-16], bstr[-16:] 523 | return int(a, 2), int(b, 2) 524 | 525 | 526 | class Slob(Sequence): 527 | 528 | def __init__(self, file_or_filenames): 529 | self._f = MultiFileReader(file_or_filenames) 530 | 531 | try: 532 | self._header = read_header(self._f) 533 | if self._f.size != self._header.size: 534 | raise IncorrectFileSize( 535 | "File size should be {0}, {1} bytes found".format( 536 | self._header.size, self._f.size 537 | ) 538 | ) 539 | except FileFormatException: 540 | self._f.close() 541 | raise 542 | 543 | self._refs = RefList( 544 | self._f, self._header.encoding, offset=self._header.refs_offset 545 | ) 546 | 547 | self._g = MultiFileReader(file_or_filenames) 548 | self._store = Store( 549 | self._g, 550 | self._header.store_offset, 551 | COMPRESSIONS[self._header.compression].decompress, 552 | self._header.content_types, 553 | ) 554 | 555 | def __enter__(self): 556 | return self 557 | 558 | def __exit__(self, exc_type, exc_val, exc_tb): 559 | self.close() 560 | return False 561 | 562 | @property 563 | def id(self): 564 | return self._header.uuid.hex 565 | 566 | @property 567 | def content_types(self): 568 | return self._header.content_types 569 | 570 | @property 571 | def tags(self): 572 | return self._header.tags 573 | 574 | @property 575 | def blob_count(self): 576 | return self._header.blob_count 577 | 578 | @property 579 | def compression(self): 580 | return self._header.compression 581 | 582 | @property 583 | def encoding(self): 584 | return self._header.encoding 585 | 586 | def __len__(self): 587 | return len(self._refs) 588 | 589 | def __getitem__(self, i): 590 | ref = self._refs[i] 591 | 592 | def read_func(): 593 | return self._store.get(ref.bin_index, ref.item_index)[1] 594 | 595 | read_func = lru_cache(maxsize=None)(read_func) 596 | 597 | def read_content_type_func(): 598 | return self._store.content_type(ref.bin_index, ref.item_index) 599 | 600 | content_id = meld_ints(ref.bin_index, ref.item_index) 601 | return Blob( 602 | content_id, ref.key, ref.fragment, read_content_type_func, read_func 603 | ) 604 | 605 | def get(self, blob_id): 606 | bin_index, bin_item_index = unmeld_ints(blob_id) 607 | return self._store.get(bin_index, bin_item_index) 608 | 609 | @lru_cache(maxsize=None) 610 | def as_dict(self, strength=TERTIARY, maxlength=None): 611 | return KeydItemDict(self, strength, maxlength=maxlength) 612 | 613 | def close(self): 614 | self._f.close() 615 | self._g.close() 616 | 617 | 618 | def find_parts(fname): 619 | fname = os.path.expanduser(fname) 620 | dirname = os.path.dirname(fname) or os.getcwd() 621 | basename = os.path.basename(fname) 622 | candidates = [] 623 | for name in os.listdir(dirname): 624 | if name.startswith(basename): 625 | candidates.append(os.path.join(dirname, name)) 626 | return sorted(candidates) 627 | 628 | 629 | def open(file_or_filenames): 630 | if isinstance(file_or_filenames, str): 631 | if not os.path.exists(file_or_filenames): 632 | file_or_filenames = find_parts(file_or_filenames) 633 | return Slob(file_or_filenames) 634 | 635 | 636 | def create(*args, **kwargs): 637 | return Writer(*args, **kwargs) 638 | 639 | 640 | class BinMemWriter: 641 | 642 | def __init__(self): 643 | self.content_type_ids = [] 644 | self.item_dir = [] 645 | self.items = [] 646 | self.current_offset = 0 647 | 648 | def add(self, content_type_id, blob): 649 | self.content_type_ids.append(content_type_id) 650 | self.item_dir.append(pack(U_INT, self.current_offset)) 651 | length_and_bytes = pack(U_INT, len(blob)) + blob 652 | self.items.append(length_and_bytes) 653 | self.current_offset += len(length_and_bytes) 654 | 655 | def __len__(self): 656 | return len(self.item_dir) 657 | 658 | def finalize(self, fout: "output file", compress: "function"): 659 | count = len(self) 660 | fout.write(pack(U_INT, count)) 661 | for content_type_id in self.content_type_ids: 662 | fout.write(pack(U_CHAR, content_type_id)) 663 | content = b"".join(self.item_dir + self.items) 664 | compressed = compress(content) 665 | fout.write(pack(U_INT, len(compressed))) 666 | fout.write(compressed) 667 | self.content_type_ids.clear() 668 | self.item_dir.clear() 669 | self.items.clear() 670 | 671 | 672 | class ItemList(Sequence): 673 | 674 | def __init__(self, file_, offset, count_or_spec, pos_spec, cache_size=None): 675 | self.lock = RLock() 676 | self._file = file_ 677 | file_.seek(offset) 678 | if isinstance(count_or_spec, str): 679 | count_spec = count_or_spec 680 | self.count = unpack(count_spec, file_.read(calcsize(count_spec)))[0] 681 | else: 682 | self.count = count_or_spec 683 | self.pos_offset = file_.tell() 684 | self.pos_spec = pos_spec 685 | self.pos_size = calcsize(pos_spec) 686 | self.data_offset = self.pos_offset + self.pos_size * self.count 687 | if cache_size: 688 | self.__getitem__ = lru_cache(maxsize=cache_size)(self.__getitem__) 689 | 690 | def __len__(self): 691 | return self.count 692 | 693 | def pos(self, i): 694 | with self.lock: 695 | self._file.seek(self.pos_offset + self.pos_size * i) 696 | return unpack(self.pos_spec, self._file.read(self.pos_size))[0] 697 | 698 | def read(self, pos): 699 | with self.lock: 700 | self._file.seek(self.data_offset + pos) 701 | return self._read_item() 702 | 703 | @abstractmethod 704 | def _read_item(self): 705 | pass 706 | 707 | def __getitem__(self, i): 708 | if i >= len(self) or i < 0: 709 | raise IndexError("index out of range") 710 | return self.read(self.pos(i)) 711 | 712 | 713 | class RefList(ItemList): 714 | 715 | def __init__(self, f, encoding, offset=0, count=None): 716 | super().__init__( 717 | StructReader(f, encoding), 718 | offset, 719 | U_INT if count is None else count, 720 | U_LONG_LONG, 721 | cache_size=512, 722 | ) 723 | 724 | def _read_item(self): 725 | key = self._file.read_text() 726 | bin_index = self._file.read_int() 727 | item_index = self._file.read_short() 728 | fragment = self._file.read_tiny_text() 729 | return Ref( 730 | key=key, bin_index=bin_index, item_index=item_index, fragment=fragment 731 | ) 732 | 733 | @lru_cache(maxsize=None) 734 | def as_dict(self, strength=TERTIARY, maxlength=None): 735 | return KeydItemDict(self, strength, maxlength=maxlength) 736 | 737 | 738 | class Bin(ItemList): 739 | 740 | def __init__(self, count, bin_bytes): 741 | super().__init__(StructReader(io.BytesIO(bin_bytes)), 0, count, U_INT) 742 | 743 | def _read_item(self): 744 | content_len = self._file.read_int() 745 | content = self._file.read(content_len) 746 | return content 747 | 748 | 749 | StoreItem = namedtuple("StoreItem", "content_type_ids compressed_content") 750 | 751 | 752 | class Store(ItemList): 753 | 754 | def __init__(self, file_, offset, decompress, content_types): 755 | super().__init__(StructReader(file_), offset, U_INT, U_LONG_LONG, cache_size=32) 756 | self.decompress = decompress 757 | self.content_types = content_types 758 | 759 | def _read_item(self): 760 | bin_item_count = self._file.read_int() 761 | packed_content_type_ids = self._file.read(bin_item_count * U_CHAR_SIZE) 762 | content_type_ids = [] 763 | for i in range(bin_item_count): 764 | content_type_id = unpack(U_CHAR, packed_content_type_ids[i : i + 1])[0] 765 | content_type_ids.append(content_type_id) 766 | content_length = self._file.read_int() 767 | content = self._file.read(content_length) 768 | return StoreItem(content_type_ids=content_type_ids, compressed_content=content) 769 | 770 | def _content_type(self, bin_index, item_index): 771 | store_item = self[bin_index] 772 | content_type_id = store_item.content_type_ids[item_index] 773 | content_type = self.content_types[content_type_id] 774 | return content_type, store_item 775 | 776 | def content_type(self, bin_index, item_index): 777 | return self._content_type(bin_index, item_index)[0] 778 | 779 | @lru_cache(maxsize=16) 780 | def _decompress(self, bin_index): 781 | store_item = self[bin_index] 782 | return self.decompress(store_item.compressed_content) 783 | 784 | def get(self, bin_index, item_index): 785 | content_type, store_item = self._content_type(bin_index, item_index) 786 | content = self._decompress(bin_index) 787 | count = len(store_item.content_type_ids) 788 | store_bin = Bin(count, content) 789 | content = store_bin[item_index] 790 | return (content_type, content) 791 | 792 | 793 | def find(word, slobs, match_prefix=True): 794 | seen = set() 795 | if isinstance(slobs, Slob): 796 | slobs = [slobs] 797 | 798 | variants = [] 799 | 800 | for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY): 801 | variants.append((strength, None)) 802 | 803 | if match_prefix: 804 | for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY): 805 | variants.append((strength, sortkey_length(strength, word))) 806 | 807 | for strength, maxlength in variants: 808 | for slob in slobs: 809 | d = slob.as_dict(strength=strength, maxlength=maxlength) 810 | for item in d[word]: 811 | dedup_key = (slob.id, item.id, item.fragment) 812 | if dedup_key in seen: 813 | continue 814 | else: 815 | seen.add(dedup_key) 816 | yield slob, item 817 | 818 | 819 | WriterEvent = namedtuple("WriterEvent", "name data") 820 | 821 | 822 | class KeyTooLongException(Exception): 823 | 824 | @property 825 | def key(self): 826 | return self.args[0] 827 | 828 | 829 | class Writer(object): 830 | 831 | def __init__( 832 | self, 833 | filename, 834 | workdir=None, 835 | encoding=UTF8, 836 | compression=DEFAULT_COMPRESSION, 837 | min_bin_size=512 * 1024, 838 | max_redirects=5, 839 | observer=None, 840 | ): 841 | self.filename = filename 842 | self.observer = observer 843 | if os.path.exists(self.filename): 844 | raise SystemExit("File %r already exists" % self.filename) 845 | 846 | # make sure we can write 847 | with fopen(self.filename, "wb"): 848 | pass 849 | 850 | self.encoding = encoding 851 | 852 | if encodings.search_function(self.encoding) is None: 853 | raise UnknownEncoding(self.encoding) 854 | 855 | self.workdir = workdir 856 | 857 | self.tmpdir = tmpdir = tempfile.TemporaryDirectory( 858 | prefix="{0}-".format(os.path.basename(filename)), dir=workdir 859 | ) 860 | 861 | self.f_ref_positions = self._wbfopen("ref-positions") 862 | self.f_store_positions = self._wbfopen("store-positions") 863 | self.f_refs = self._wbfopen("refs") 864 | self.f_store = self._wbfopen("store") 865 | 866 | self.max_redirects = max_redirects 867 | if max_redirects: 868 | self.aliases_path = os.path.join(tmpdir.name, "aliases") 869 | self.f_aliases = Writer( 870 | self.aliases_path, 871 | workdir=tmpdir.name, 872 | max_redirects=0, 873 | compression=None, 874 | ) 875 | 876 | if compression is None: 877 | compression = "" 878 | if not compression in COMPRESSIONS: 879 | raise UnknownCompression(compression) 880 | else: 881 | self.compress = COMPRESSIONS[compression].compress 882 | 883 | self.compression = compression 884 | self.content_types = {} 885 | 886 | self.min_bin_size = min_bin_size 887 | 888 | self.current_bin = None 889 | 890 | self.blob_count = 0 891 | self.ref_count = 0 892 | self.bin_count = 0 893 | self._tags = { 894 | "version.python": sys.version.replace("\n", " "), 895 | "version.pyicu": icu.VERSION, 896 | "version.icu": icu.ICU_VERSION, 897 | "created.at": datetime.now(timezone.utc).isoformat(), 898 | } 899 | self.tags = MappingProxyType(self._tags) 900 | 901 | def _wbfopen(self, name): 902 | return StructWriter( 903 | fopen(os.path.join(self.tmpdir.name, name), "wb"), encoding=self.encoding 904 | ) 905 | 906 | def tag(self, name, value=""): 907 | if len(name.encode(self.encoding)) > MAX_TINY_TEXT_LEN: 908 | self._fire_event("tag_name_too_long", (name, value)) 909 | return 910 | 911 | if len(value.encode(self.encoding)) > MAX_TINY_TEXT_LEN: 912 | self._fire_event("tag_value_too_long", (name, value)) 913 | value = "" 914 | 915 | self._tags[name] = value 916 | 917 | def _split_key(self, key): 918 | if isinstance(key, str): 919 | actual_key = key 920 | fragment = "" 921 | else: 922 | actual_key, fragment = key 923 | if len(actual_key) > MAX_TEXT_LEN or len(fragment) > MAX_TINY_TEXT_LEN: 924 | raise KeyTooLongException(key) 925 | return actual_key, fragment 926 | 927 | def add(self, blob, *keys, content_type=""): 928 | 929 | if len(blob) > MAX_LARGE_BYTE_STRING_LEN: 930 | self._fire_event("content_too_long", blob) 931 | return 932 | 933 | if len(content_type) > MAX_TEXT_LEN: 934 | self._fire_event("content_type_too_long", content_type) 935 | return 936 | 937 | actual_keys = [] 938 | 939 | for key in keys: 940 | try: 941 | actual_key, fragment = self._split_key(key) 942 | except KeyTooLongException as e: 943 | self._fire_event("key_too_long", e.key) 944 | else: 945 | actual_keys.append((actual_key, fragment)) 946 | 947 | if len(actual_keys) == 0: 948 | return 949 | 950 | if self.current_bin is None: 951 | self.current_bin = BinMemWriter() 952 | self.bin_count += 1 953 | 954 | if content_type not in self.content_types: 955 | self.content_types[content_type] = len(self.content_types) 956 | 957 | self.current_bin.add(self.content_types[content_type], blob) 958 | self.blob_count += 1 959 | bin_item_index = len(self.current_bin) - 1 960 | bin_index = self.bin_count - 1 961 | 962 | for actual_key, fragment in actual_keys: 963 | self._write_ref(actual_key, bin_index, bin_item_index, fragment) 964 | 965 | if ( 966 | self.current_bin.current_offset > self.min_bin_size 967 | or len(self.current_bin) == MAX_BIN_ITEM_COUNT 968 | ): 969 | self._write_current_bin() 970 | 971 | def add_alias(self, key, target_key): 972 | if self.max_redirects: 973 | try: 974 | self._split_key(key) 975 | except KeyTooLongException as e: 976 | self._fire_event("alias_too_long", e.key) 977 | return 978 | try: 979 | self._split_key(target_key) 980 | except KeyTooLongException as e: 981 | self._fire_event("alias_target_too_long", e.key) 982 | return 983 | self.f_aliases.add(pickle.dumps(target_key), key) 984 | else: 985 | raise NotImplementedError() 986 | 987 | def _fire_event(self, name, data=None): 988 | if self.observer: 989 | self.observer(WriterEvent(name, data)) 990 | 991 | def _write_current_bin(self): 992 | self.f_store_positions.write_long(self.f_store.tell()) 993 | self.current_bin.finalize(self.f_store, self.compress) 994 | self.current_bin = None 995 | 996 | def _write_ref(self, key, bin_index, item_index, fragment=""): 997 | self.f_ref_positions.write_long(self.f_refs.tell()) 998 | self.f_refs.write_text(key) 999 | self.f_refs.write_int(bin_index) 1000 | self.f_refs.write_short(item_index) 1001 | self.f_refs.write_tiny_text(fragment) 1002 | self.ref_count += 1 1003 | 1004 | def _sort(self): 1005 | self._fire_event("begin_sort") 1006 | f_ref_positions_sorted = self._wbfopen("ref-positions-sorted") 1007 | self.f_refs.flush() 1008 | self.f_ref_positions.close() 1009 | with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f: 1010 | ref_list = RefList(f, self.encoding, count=self.ref_count) 1011 | sortkey_func = sortkey(IDENTICAL) 1012 | for i in sorted( 1013 | range(len(ref_list)), key=lambda j: sortkey_func(ref_list[j].key) 1014 | ): 1015 | ref_pos = ref_list.pos(i) 1016 | f_ref_positions_sorted.write_long(ref_pos) 1017 | f_ref_positions_sorted.close() 1018 | os.remove(self.f_ref_positions.name) 1019 | os.rename(f_ref_positions_sorted.name, self.f_ref_positions.name) 1020 | self.f_ref_positions = StructWriter( 1021 | fopen(self.f_ref_positions.name, "ab"), encoding=self.encoding 1022 | ) 1023 | self._fire_event("end_sort") 1024 | 1025 | def _resolve_aliases(self): 1026 | self._fire_event("begin_resolve_aliases") 1027 | self.f_aliases.finalize() 1028 | with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f_ref_list: 1029 | ref_list = RefList(f_ref_list, self.encoding, count=self.ref_count) 1030 | ref_dict = ref_list.as_dict() 1031 | with open(self.aliases_path) as r: 1032 | aliases = r.as_dict() 1033 | path = os.path.join(self.tmpdir.name, "resolved-aliases") 1034 | with create( 1035 | path, workdir=self.tmpdir.name, max_redirects=0, compression=None 1036 | ) as alias_writer: 1037 | 1038 | def read_key_frag(item, default_fragment): 1039 | key_frag = pickle.loads(item.content) 1040 | if isinstance(key_frag, str): 1041 | return key_frag, default_fragment 1042 | else: 1043 | return key_frag 1044 | 1045 | for item in r: 1046 | from_key = item.key 1047 | keys = set() 1048 | keys.add(from_key) 1049 | to_key, fragment = read_key_frag(item, item.fragment) 1050 | count = 0 1051 | while count <= self.max_redirects: 1052 | # is target key itself a redirect? 1053 | try: 1054 | orig_to_key = to_key 1055 | to_key, fragment = read_key_frag( 1056 | next(aliases[to_key]), fragment 1057 | ) 1058 | count += 1 1059 | keys.add(orig_to_key) 1060 | except StopIteration: 1061 | break 1062 | if count > self.max_redirects: 1063 | self._fire_event("too_many_redirects", from_key) 1064 | try: 1065 | target_ref = next(ref_dict[to_key]) 1066 | except StopIteration: 1067 | self._fire_event("alias_target_not_found", to_key) 1068 | else: 1069 | for key in keys: 1070 | ref = Ref( 1071 | key=key, 1072 | bin_index=target_ref.bin_index, 1073 | item_index=target_ref.item_index, 1074 | # last fragment in the chain wins 1075 | fragment=target_ref.fragment or fragment, 1076 | ) 1077 | alias_writer.add(pickle.dumps(ref), key) 1078 | 1079 | with open(path) as resolved_aliases_reader: 1080 | previous = None 1081 | targets = set() 1082 | 1083 | for item in resolved_aliases_reader: 1084 | ref = pickle.loads(item.content) 1085 | if previous is not None and ref.key != previous.key: 1086 | for bin_index, item_index, fragment in sorted(targets): 1087 | self._write_ref(previous.key, bin_index, item_index, fragment) 1088 | targets.clear() 1089 | targets.add((ref.bin_index, ref.item_index, ref.fragment)) 1090 | previous = ref 1091 | 1092 | for bin_index, item_index, fragment in sorted(targets): 1093 | self._write_ref(previous.key, bin_index, item_index, fragment) 1094 | 1095 | self._sort() 1096 | self._fire_event("end_resolve_aliases") 1097 | 1098 | def finalize(self): 1099 | self._fire_event("begin_finalize") 1100 | if not self.current_bin is None: 1101 | self._write_current_bin() 1102 | 1103 | self._sort() 1104 | if self.max_redirects: 1105 | self._resolve_aliases() 1106 | 1107 | files = ( 1108 | self.f_ref_positions, 1109 | self.f_refs, 1110 | self.f_store_positions, 1111 | self.f_store, 1112 | ) 1113 | 1114 | for f in files: 1115 | f.close() 1116 | 1117 | buf_size = 10 * 1024 * 1024 1118 | 1119 | with fopen(self.filename, mode="wb") as output_file: 1120 | out = StructWriter(output_file, self.encoding) 1121 | out.write(MAGIC) 1122 | out.write(uuid4().bytes) 1123 | out.write_tiny_text(self.encoding, encoding=UTF8) 1124 | out.write_tiny_text(self.compression) 1125 | 1126 | def write_tags(tags, f): 1127 | f.write(pack(U_CHAR, len(tags))) 1128 | for key, value in tags.items(): 1129 | f.write_tiny_text(key) 1130 | f.write_tiny_text(value, editable=True) 1131 | 1132 | write_tags(self.tags, out) 1133 | 1134 | def write_content_types(content_types, f): 1135 | count = len(content_types) 1136 | f.write(pack(U_CHAR, count)) 1137 | types = sorted(content_types.items(), key=lambda x: x[1]) 1138 | for content_type, _ in types: 1139 | f.write_text(content_type) 1140 | 1141 | write_content_types(self.content_types, out) 1142 | 1143 | out.write_int(self.blob_count) 1144 | store_offset = ( 1145 | out.tell() 1146 | + U_LONG_LONG_SIZE # this value 1147 | + U_LONG_LONG_SIZE # file size value 1148 | + U_INT_SIZE # ref count value 1149 | + os.stat(self.f_ref_positions.name).st_size 1150 | + os.stat(self.f_refs.name).st_size 1151 | ) 1152 | out.write_long(store_offset) 1153 | out.flush() 1154 | 1155 | file_size = ( 1156 | out.tell() # bytes written so far 1157 | + U_LONG_LONG_SIZE # file size value 1158 | + 2 * U_INT_SIZE 1159 | ) # ref count and bin count 1160 | file_size += sum((os.stat(f.name).st_size for f in files)) 1161 | out.write_long(file_size) 1162 | 1163 | def mv(src, out): 1164 | fname = src.name 1165 | self._fire_event("begin_move", fname) 1166 | with fopen(fname, mode="rb") as f: 1167 | while True: 1168 | data = f.read(buf_size) 1169 | if len(data) == 0: 1170 | break 1171 | out.write(data) 1172 | out.flush() 1173 | os.remove(fname) 1174 | self._fire_event("end_move", fname) 1175 | 1176 | out.write_int(self.ref_count) 1177 | mv(self.f_ref_positions, out) 1178 | mv(self.f_refs, out) 1179 | 1180 | out.write_int(self.bin_count) 1181 | mv(self.f_store_positions, out) 1182 | mv(self.f_store, out) 1183 | 1184 | self.tmpdir.cleanup() 1185 | self._fire_event("end_finalize") 1186 | 1187 | def size_header(self): 1188 | size = 0 1189 | size += len(MAGIC) 1190 | size += 16 # uuid bytes 1191 | size += U_CHAR_SIZE + len(self.encoding.encode(UTF8)) 1192 | size += U_CHAR_SIZE + len(self.compression.encode(self.encoding)) 1193 | 1194 | size += U_CHAR_SIZE # tag length 1195 | size += U_CHAR_SIZE # content types count 1196 | 1197 | # tags and content types themselves counted elsewhere 1198 | 1199 | size += U_INT_SIZE # blob count 1200 | size += U_LONG_LONG_SIZE # store offset 1201 | size += U_LONG_LONG_SIZE # file size 1202 | size += U_INT_SIZE # ref count 1203 | size += U_INT_SIZE # bin count 1204 | 1205 | return size 1206 | 1207 | def size_tags(self): 1208 | size = 0 1209 | for key, _ in self.tags.items(): 1210 | size += U_CHAR_SIZE + len(key.encode(self.encoding)) 1211 | size += 255 1212 | return size 1213 | 1214 | def size_content_types(self): 1215 | size = 0 1216 | for content_type in self.content_types: 1217 | size += U_CHAR_SIZE + len(content_type.encode(self.encoding)) 1218 | return size 1219 | 1220 | def size_data(self): 1221 | files = ( 1222 | self.f_ref_positions, 1223 | self.f_refs, 1224 | self.f_store_positions, 1225 | self.f_store, 1226 | ) 1227 | return sum((os.stat(f.name).st_size for f in files)) 1228 | 1229 | def __enter__(self): 1230 | return self 1231 | 1232 | def __exit__(self, exc_type, exc_val, exc_tb): 1233 | self.finalize() 1234 | return False 1235 | 1236 | 1237 | class TestReadWrite(unittest.TestCase): 1238 | 1239 | def setUp(self): 1240 | 1241 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1242 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1243 | 1244 | with create(self.path) as w: 1245 | 1246 | self.tags = {"a": "abc", "bb": "xyz123", "ccc": "lkjlk"} 1247 | for name, value in self.tags.items(): 1248 | w.tag(name, value) 1249 | 1250 | self.tag2 = "bb", "xyz123" 1251 | 1252 | self.blob_encoding = "ascii" 1253 | 1254 | self.data = [ 1255 | (("c", "cc", "ccc"), MIME_TEXT, "Hello C 1"), 1256 | ("a", MIME_TEXT, "Hello A 12"), 1257 | ("z", MIME_TEXT, "Hello Z 123"), 1258 | ("b", MIME_TEXT, "Hello B 1234"), 1259 | ("d", MIME_TEXT, "Hello D 12345"), 1260 | ("uuu", MIME_HTML, "
Hello U!"), 1261 | ((("yy", "frag1"),), MIME_HTML, '