├── .github └── ISSUE_TEMPLATE │ └── bug_report.md ├── .gitignore ├── LICENSE ├── README.org ├── setup.py └── slob.py /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Report an issue with slob reader/writer 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 21 | 22 | **Description** 23 | A clear and concise description of what the bug is. 24 | 25 | **To Reproduce** 26 | Steps to reproduce the behavior: 27 | - ... 28 | - ... 29 | 30 | **Expected behavior** 31 | A clear and concise description of what you expected to happen. 32 | 33 | **Environment:** 34 | - OS: [e.g. iOS] 35 | - Python version 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *~ 3 | *.egg-info 4 | __pycache__ 5 | README.html -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. [http://fsf.org/] 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | {one line to give the program's name and a brief idea of what it does.} 635 | Copyright (C) {year} {name of author} 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see [http://www.gnu.org/licenses/]. 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | {project} Copyright (C) {year} {fullname} 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | [http://www.gnu.org/licenses/]. 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | [http://www.gnu.org/philosophy/why-not-lgpl.html]. -------------------------------------------------------------------------------- /README.org: -------------------------------------------------------------------------------- 1 | * Slob 2 | Slob (sorted list of blobs) is a read-only, compressed data store 3 | with dictionary-like interface to look up content by text keys. Keys 4 | are sorted according to [[http://www.unicode.org/reports/tr10/][Unicode Collation Algorithm]]. This allows to 5 | perform punctuation, case and diacritics insensitive 6 | lookups. /slob.py/ is a reference implementation of slob format 7 | reader and writer in [[http://python.org][Python 3]]. 8 | 9 | ** Installation 10 | 11 | /slob.py/ depends on the following components: 12 | 13 | - [[http://python.org][Python]] >= 3.6 14 | - [[http://icu-project.org][ICU]] >= 4.8 15 | - [[https://pypi.python.org/pypi/PyICU][PyICU]] >= 1.5 16 | 17 | In addition, the following components are needed to set up 18 | slob environment: 19 | 20 | - [[http://git-scm.com/][git]] 21 | - [[https://virtualenv.pypa.io/][virtualenv]] 22 | 23 | Consult your operating system documentation and these component's 24 | websites for installation instructions. 25 | 26 | For example, on Ubuntu 20.04, the following command installs 27 | required packages: 28 | 29 | #+BEGIN_SRC sh 30 | sudo apt update 31 | sudo apt install python3 python3-icu python3.8-venv git 32 | #+END_SRC 33 | 34 | Create new Python virtual environment: 35 | 36 | #+BEGIN_SRC sh 37 | python3 -m venv env-slob --system-site-packages 38 | #+END_SRC 39 | 40 | Activate it: 41 | 42 | #+BEGIN_SRC sh 43 | source env-slob/bin/activate 44 | #+END_SRC 45 | 46 | Install from source code repository: 47 | 48 | #+BEGIN_SRC sh 49 | pip install git+https://github.com/itkach/slob.git 50 | #+END_SRC 51 | 52 | or, download source code manually: 53 | 54 | #+BEGIN_SRC sh 55 | wget https://github.com/itkach/slob/archive/master.zip 56 | pip install master.zip 57 | #+END_SRC 58 | 59 | Run tests: 60 | 61 | #+BEGIN_SRC sh 62 | python -m unittest slob 63 | #+END_SRC 64 | 65 | ** Command line interface 66 | 67 | /slob.py/ provides basic command line interface to inspect 68 | and modify slob content. 69 | 70 | #+BEGIN_SRC 71 | usage: slob [-h] {find,get,info,tag} ... 72 | 73 | positional arguments: 74 | {find,get,info,tag} sub-command 75 | find Find keys 76 | get Retrieve blob content 77 | info Inspect slob and print basic information about it 78 | tag List tags, view or edit tag value 79 | convert Create new slob with the same convent but different 80 | encoding and compression parameters 81 | or split into multiple slobs 82 | 83 | optional arguments: 84 | -h, --help show this help message and exit 85 | #+END_SRC 86 | 87 | To see basic slob info such as text encoding, compression and tags: 88 | #+BEGIN_SRC 89 | slob info my.slob 90 | #+END_SRC 91 | 92 | To see value of a tag, for example /label/: 93 | #+BEGIN_SRC 94 | slob tag -n label my.slob 95 | #+END_SRC 96 | 97 | To set tag value: 98 | #+BEGIN_SRC 99 | slob tag -n label -v "A Fine Dictionary" my.slob 100 | #+END_SRC 101 | 102 | To look up a key, for example /abc/: 103 | #+BEGIN_SRC 104 | slob find wordnet-3.0.slob abc 105 | #+END_SRC 106 | 107 | The output should like something like 108 | #+BEGIN_SRC 109 | 465 text/html; charset=utf-8 ABC 110 | 466 text/html; charset=utf-8 abcoulomb 111 | 472 text/html; charset=utf-8 ABC's 112 | 468 text/html; charset=utf-8 ABCs 113 | #+END_SRC 114 | 115 | First column in the output is blob id. It can be used to retrieve 116 | blob content (content bytes are written to stdout): 117 | #+BEGIN_SRC 118 | slob get wordnet-3.0.slob 465 119 | #+END_SRC 120 | 121 | To re-encode or re-compress slob content with different 122 | parameters: 123 | #+BEGIN_SRC 124 | slob convert -c lzma2 -b 256 simplewiki-20140209.zlib.384k.slob simplewiki-20140209.lzma2.256k.slob 125 | #+END_SRC 126 | 127 | To split into multiple slobs: 128 | 129 | #+BEGIN_SRC 130 | slob convert --split 4096 enwiki-20150406.slob enwiki-20150406-vol.slob 131 | #+END_SRC 132 | 133 | Output name /enwiki-20150406-vol.slob/ is the name of the 134 | directory where resulting .slob files will be created. 135 | 136 | This is useful for crippled systems that can't use normal 137 | filesystems and have file size limits, such as SD cards on 138 | vanilla Android. Note that this command doesn't duplicate any 139 | content, so clients must search all these slobs when looking for 140 | shared resources such as stylesheets, fonts, javascript or 141 | images. 142 | 143 | 144 | ** Examples 145 | 146 | *** Basic Usage 147 | 148 | Create a slob: 149 | 150 | #+BEGIN_SRC python 151 | import slob 152 | with slob.create('test.slob') as w: 153 | w.add(b'Hello A', 'a') 154 | w.add(b'Hello B', 'b') 155 | #+END_SRC 156 | 157 | Read content: 158 | 159 | #+BEGIN_SRC python 160 | import slob 161 | with slob.open('test.slob') as r: 162 | d = r.as_dict() 163 | for key in ('a', 'b'): 164 | result = next(d[key]) 165 | print(result.content) 166 | 167 | #+END_SRC 168 | 169 | will print 170 | 171 | #+BEGIN_SRC 172 | b'Hello A' 173 | b'Hello B' 174 | #+END_SRC 175 | 176 | 177 | Slob we created in this example certainly works, but it is not 178 | ideal: we neglected to specify content type for the content we 179 | are adding. Lets consider a slightly more involved example: 180 | 181 | #+BEGIN_SRC python 182 | import slob 183 | PLAIN_TEXT = 'text/plain; charset=utf-8' 184 | with slob.create('test1.slob') as w: 185 | w.add('Hello, Earth!'.encode('utf-8'), 186 | 'earth', 'terra', content_type=PLAIN_TEXT) 187 | w.add_alias('земля', 'earth') 188 | w.add('Hello, Mars!'.encode('utf-8'), 'mars', 189 | content_type=PLAIN_TEXT) 190 | #+END_SRC 191 | 192 | Here we specify MIME type of the content we are adding so that 193 | consumers of this content can display or process it 194 | properly. Note that the same content may be associated with 195 | multiple keys, either when it is added or later with /add_alias/. 196 | 197 | This 198 | 199 | #+BEGIN_SRC python 200 | with slob.open('test1.slob') as r: 201 | 202 | def p(blob): 203 | print(blob.id, blob.content_type, blob.content) 204 | 205 | for key in ('earth', 'земля', 'terra'): 206 | blob = next(r.as_dict()[key]) 207 | p(blob) 208 | 209 | p(next(r.as_dict()['mars'])) 210 | 211 | #+END_SRC 212 | 213 | will print 214 | 215 | #+BEGIN_SRC 216 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 217 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 218 | 0 text/plain; charset=utf-8 b'Hello, Earth!' 219 | 1 text/plain; charset=utf-8 b'Hello, Mars!' 220 | #+END_SRC 221 | 222 | Note that blob id for the first three keys is the same, they all 223 | point to the same content item. 224 | 225 | Take a look at tests in /slob.py/ for more examples. 226 | 227 | 228 | *** Software and Dictionaries 229 | 230 | - [[https://github.com/itkach/slob/wiki/Dictionaries][Wikipedia, Wiktionary, WordNet, FreeDict and more]] 231 | - [[http://github.com/itkach/aard2-android/][aard2-android]] - dictionary for Android 232 | - [[https://github.com/farfromrefug/OSS-Dict][OSS-Dict]] - fork of Aard2 with new Material design and updated features 233 | - [[https://github.com/itkach/aard2-web][aard2-web]] - minimalistic Web UI (Java) 234 | - [[http://github.com/itkach/slobber/][slobber]] - Web API to look up content in slob dictionaries 235 | - [[http://github.com/itkach/slobby/][slobby]] - minimalistic Web UI (Python) 236 | - [[https://github.com/ilius/pyglossary][pyglossary]] - convert dictionaries in various formats, including slob 237 | - [[https://github.com/itkach/mw2slob][mw2slob]] - create slob dictionaries from Wikimedia Enterprise HTML Dumps or MediaWiki API 238 | - [[http://github.com/itkach/xdxf2slob/][xdxf2slob]] - create slob dictionaries from XDXF 239 | - [[https://github.com/itkach/tei2slob/][tei2slob]] - create slob dictionaries from TEI 240 | - [[http://github.com/itkach/wordnet2slob/][wordnet2slob]] - convert WordNet databaset to slob dictionary 241 | 242 | 243 | ** Slob File Format 244 | 245 | *** Slob 246 | 247 | | Element | Type | Description | 248 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 249 | | magic | fixed size sequence of 8 bytes | Bytes ~21 2d 31 53 4c 4f 42 1f~: string ~!-1SLOB~ followed by ascii unit separator (ascii hex code ~1f~) identifying slob format | 250 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 251 | | uuid | fixed size sequence of 16 bytes | Unique slob identifier ([[https://tools.ietf.org/html/rfc4122][RFC 4122]] UUID) | 252 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 253 | | encoding | tiny text (utf8) | Name of text encoding used for all other text elements: tag names and values, content types, keys, fragments | 254 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 255 | | compression | tiny text | Name of compression algorithm used to compress storage bins. | 256 | | | | slob.py understands following names: /bz2/, /zlib/ which correspond to Python module names, and /lzma2/ which refers to raw lzma2 compression with LZMA2 filter (this is default). | 257 | | | | Empty value means bins are not compressed. | 258 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 259 | | tags | char-sized sequence of tags | Tags are text key-value pairs that may provide additional information about slob or its data. | 260 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 261 | | content types | char-sized sequence of content types | MIME content types. Content items refer to content types by id. | 262 | | | | Content type id is 0-based position of content type in this sequence. | 263 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 264 | | blob count | int | Number of content items stored in the slob | 265 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 266 | | store offset | long | File position at which store data begins | 267 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 268 | | size | long | Total file byte size (or sum of all files if slob is split into multiple files) | 269 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 270 | | refs | list of long-positioned refs | References to content | 271 | |---------------+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 272 | | store | list of long-positioned store items | Store item contains number of items stored, content type id for each item and storage bin with each item's content | 273 | 274 | 275 | 276 | *** tiny text 277 | 278 | char-sized sequence of encoded text bytes 279 | 280 | 281 | *** text 282 | 283 | short-sized sequence of encoded text bytes 284 | 285 | 286 | *** large byte string 287 | 288 | int-sized sequence of bytes 289 | 290 | *** /size type/-sized sequence of /items/ 291 | 292 | | Element | Type | 293 | |---------+---------------------------| 294 | | count | /size type/ | 295 | | items | sequence of /count/ items | 296 | 297 | 298 | *** tag 299 | 300 | | Element | Type | 301 | |---------+-----------------------------| 302 | | name | tiny text | 303 | | value | tiny text padded to maximum | 304 | | | length with null bytes | 305 | 306 | Tag values are tiny text of length 255, starting with encoded 307 | text bytes followed by null bytes. This allowes modifying tag 308 | values without having to recompile the whole slob. Null bytes 309 | must be stripped before decoding value text. 310 | 311 | *** content type 312 | 313 | text 314 | 315 | 316 | *** ref 317 | 318 | | Element | Type | Description | 319 | |------------+-----------+-------------------------------------------------------| 320 | | key | text | Text key associated with content | 321 | | bin index | int | Index of compressed bin containing content | 322 | | item index | short | Index of content item inside uncompressed bin | 323 | | fragment | tiny text | Text identifier of a specific location inside content | 324 | 325 | 326 | *** store item 327 | | Element | Type | Description | 328 | |------------------+---------------------------------------------------------+---------------------------------------------------| 329 | | content type ids | int-sized sequence of bytes | Each byte is a char representing content type id. | 330 | | storage bin | list of int-positioned large byte strings without count | Content | 331 | 332 | Storage bin doesn't include leading int that would represent item 333 | count - item count equals the length of content type ids. Items in the 334 | storage bin are large byte strings - actual content bytes. 335 | 336 | *** list of /position type/-positioned /items/ 337 | 338 | | Element | Type | Description | 339 | |-----------+-------------------------------------------------------------+-----------------------------------------------------------------------------------------------------| 340 | | positions | int-sized sequence of item offsets of type /position type/. | Item offset specifies position in file where item data starts, relative to the end of position data | 341 | | items | sequence of /items/ | | 342 | 343 | *** char 344 | unsigned char (1 byte) 345 | 346 | *** short 347 | big endian unsigned short (2 bytes) 348 | 349 | *** int 350 | big endian unsigned int (4 bytes) 351 | 352 | *** long 353 | big endian unsigned long long (8 bytes) 354 | 355 | 356 | ** Design Considerations 357 | 358 | Slob format design is influenced by old Aard Dictionary's [[https://github.com/aarddict/tools/blob/master/doc/aardformat.rst][aard]] and [[http://openzim.org/][ZIM]] 359 | file formats. Similar to Aard Dictionary, it allows to perform 360 | non-exact lookups based on UCA's notion of collation 361 | strength. Similar to ZIM, it groups and compresses multiple 362 | content items to achieve high compression ratio and can combine 363 | several physical files into one logical container. Both aard and 364 | ZIM contain vestigial elements of predecessor formats as well 365 | as elements specific to a particular use case (such as 366 | implementing offline Wikipedia content access). Slob aims to 367 | provide a minimal framework to allow building such applications 368 | while remaining a simple, generic, read-only data store. 369 | 370 | *** No Format Version 371 | Slob header doesn't contain explicit file format version 372 | number. Any incompatible changes to the format 373 | should be introduced in a new file format which will get its own 374 | identifying magic bytes. 375 | 376 | *** No Content Checksum 377 | Unlike aard and ZIM file formats, slob doesn't contain 378 | content checksum. File integrity can be easily verified by 379 | employing standard tools to calculate content hash. Inclusion of 380 | pre-calculated hash into the file itself prevents using most 381 | standard tools and puts burden of implementing hash calculation 382 | on every slob reader implementation. 383 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup(name='Slob', 4 | version='1.0.2', 5 | description='Read-only compressed data store', 6 | author='Igor Tkach', 7 | author_email='itkach@gmail.com', 8 | url='http://github.com/itkach/slob', 9 | license='GPL3', 10 | py_modules=['slob'], 11 | install_requires=['PyICU >= 1.5'], 12 | entry_points={'console_scripts': ['slob=slob:main']}) 13 | -------------------------------------------------------------------------------- /slob.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=C0111,C0103,C0302,R0903,R0904,R0914,R0201 2 | import argparse 3 | import array 4 | import collections 5 | import encodings 6 | import functools 7 | import io 8 | import os 9 | import pickle 10 | import random 11 | import sys 12 | import tempfile 13 | import unicodedata 14 | import unittest 15 | import warnings 16 | 17 | from abc import abstractmethod 18 | from bisect import bisect_left 19 | from builtins import open as fopen 20 | from collections import namedtuple 21 | from collections.abc import Sequence 22 | from datetime import datetime, timezone 23 | from functools import lru_cache 24 | from struct import pack, unpack, calcsize 25 | from threading import RLock 26 | from types import MappingProxyType 27 | from uuid import uuid4, UUID 28 | 29 | import icu 30 | 31 | DEFAULT_COMPRESSION = "lzma2" 32 | 33 | UTF8 = "utf-8" 34 | MAGIC = b"!-1SLOB\x1F" 35 | 36 | Compression = namedtuple("Compression", "compress decompress") 37 | 38 | Ref = namedtuple("Ref", "key bin_index item_index fragment") 39 | 40 | Header = namedtuple( 41 | "Header", 42 | "magic uuid encoding " 43 | "compression tags content_types " 44 | "blob_count " 45 | "store_offset " 46 | "refs_offset " 47 | "size", 48 | ) 49 | 50 | U_CHAR = ">B" 51 | U_CHAR_SIZE = calcsize(U_CHAR) 52 | U_SHORT = ">H" 53 | U_SHORT_SIZE = calcsize(U_SHORT) 54 | U_INT = ">I" 55 | U_INT_SIZE = calcsize(U_INT) 56 | U_LONG_LONG = ">Q" 57 | U_LONG_LONG_SIZE = calcsize(U_LONG_LONG) 58 | 59 | 60 | def calcmax(len_size_spec): 61 | return 2 ** (calcsize(len_size_spec) * 8) - 1 62 | 63 | 64 | MAX_TEXT_LEN = calcmax(U_SHORT) 65 | MAX_TINY_TEXT_LEN = calcmax(U_CHAR) 66 | MAX_LARGE_BYTE_STRING_LEN = calcmax(U_INT) 67 | MAX_BIN_ITEM_COUNT = calcmax(U_SHORT) 68 | 69 | from icu import Locale, Collator, UCollAttribute, UCollAttributeValue 70 | 71 | PRIMARY = Collator.PRIMARY 72 | SECONDARY = Collator.SECONDARY 73 | TERTIARY = Collator.TERTIARY 74 | QUATERNARY = Collator.QUATERNARY 75 | IDENTICAL = Collator.IDENTICAL 76 | 77 | 78 | def init_compressions(): 79 | ident = lambda x: x 80 | compressions = {"": Compression(ident, ident)} 81 | for name in ("bz2", "zlib"): 82 | try: 83 | m = __import__(name) 84 | except ImportError: 85 | warnings.warn("%s is not available" % name) 86 | else: 87 | compressions[name] = Compression(lambda x: m.compress(x, 9), m.decompress) 88 | 89 | try: 90 | import lzma 91 | except ImportError: 92 | warnings.warn("lzma is not available") 93 | else: 94 | filters = [{"id": lzma.FILTER_LZMA2}] 95 | compress = lambda s: lzma.compress(s, format=lzma.FORMAT_RAW, filters=filters) 96 | decompress = lambda s: lzma.decompress( 97 | s, format=lzma.FORMAT_RAW, filters=filters 98 | ) 99 | compressions["lzma2"] = Compression(compress, decompress) 100 | return compressions 101 | 102 | 103 | COMPRESSIONS = init_compressions() 104 | 105 | 106 | del init_compressions 107 | 108 | 109 | MIME_TEXT = "text/plain" 110 | MIME_HTML = "text/html" 111 | MIME_CSS = "text/css" 112 | MIME_JS = "application/javascript" 113 | 114 | MIME_TYPES = { 115 | "html": MIME_HTML, 116 | "txt": MIME_TEXT, 117 | "js": MIME_JS, 118 | "css": MIME_CSS, 119 | "json": "application/json", 120 | "woff": "application/font-woff", 121 | "svg": "image/svg+xml", 122 | "png": "image/png", 123 | "jpg": "image/jpeg", 124 | "jpeg": "image/jpeg", 125 | "gif": "image/gif", 126 | "ttf": "application/x-font-ttf", 127 | "otf": "application/x-font-opentype", 128 | } 129 | 130 | 131 | class FileFormatException(Exception): 132 | pass 133 | 134 | 135 | class UnknownFileFormat(FileFormatException): 136 | pass 137 | 138 | 139 | class UnknownCompression(FileFormatException): 140 | pass 141 | 142 | 143 | class UnknownEncoding(FileFormatException): 144 | pass 145 | 146 | 147 | class IncorrectFileSize(FileFormatException): 148 | pass 149 | 150 | 151 | class TagNotFound(Exception): 152 | pass 153 | 154 | 155 | @lru_cache(maxsize=None) 156 | def sortkey(strength, maxlength=None): 157 | c = Collator.createInstance(Locale("")) 158 | c.setStrength(strength) 159 | c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED) 160 | if maxlength is None: 161 | return c.getSortKey 162 | else: 163 | return lambda x: c.getSortKey(x)[:maxlength] 164 | 165 | 166 | def sortkey_length(strength, word): 167 | c = Collator.createInstance(Locale("")) 168 | c.setStrength(strength) 169 | c.setAttribute(UCollAttribute.ALTERNATE_HANDLING, UCollAttributeValue.SHIFTED) 170 | coll_key = c.getSortKey(word) 171 | return len(coll_key) - 1 # subtract 1 for ending \x00 byte 172 | 173 | 174 | class MultiFileReader(io.BufferedIOBase): 175 | 176 | def __init__(self, *args): 177 | filenames = [] 178 | for arg in args: 179 | if isinstance(arg, str): 180 | filenames.append(arg) 181 | else: 182 | for name in arg: 183 | filenames.append(name) 184 | files = [] 185 | ranges = [] 186 | offset = 0 187 | for name in filenames: 188 | size = os.stat(name).st_size 189 | ranges.append(range(offset, offset + size)) 190 | files.append(fopen(name, "rb")) 191 | offset += size 192 | self.size = offset 193 | self._ranges = ranges 194 | self._files = files 195 | self._fcount = len(self._files) 196 | self._offset = -1 197 | self.seek(0) 198 | 199 | def __enter__(self): 200 | return self 201 | 202 | def __exit__(self, exc_type, exc_val, exc_tb): 203 | self.close() 204 | return False 205 | 206 | def close(self): 207 | for f in self._files: 208 | f.close() 209 | self._files.clear() 210 | self._ranges.clear() 211 | 212 | def closed(self): 213 | return len(self._ranges) == 0 214 | 215 | def isatty(self): 216 | return False 217 | 218 | def readable(self): 219 | return True 220 | 221 | def seek(self, offset, whence=io.SEEK_SET): 222 | if whence == io.SEEK_SET: 223 | self._offset = offset 224 | elif whence == io.SEEK_CUR: 225 | self._offset = self._offset + offset 226 | elif whence == io.SEEK_END: 227 | self._offset = self.size + offset 228 | else: 229 | raise ValueError("Invalid value for parameter whence: %r" % whence) 230 | return self._offset 231 | 232 | def seekable(self): 233 | return True 234 | 235 | def tell(self): 236 | return self._offset 237 | 238 | def writable(self): 239 | return False 240 | 241 | def read(self, n=-1): 242 | file_index = -1 243 | actual_offset = 0 244 | for i, r in enumerate(self._ranges): 245 | if self._offset in r: 246 | file_index = i 247 | actual_offset = self._offset - r.start 248 | break 249 | result = b"" 250 | if n == -1 or n is None: 251 | to_read = self.size 252 | else: 253 | to_read = n 254 | while -1 < file_index < self._fcount: 255 | f = self._files[file_index] 256 | f.seek(actual_offset) 257 | read = f.read(to_read) 258 | read_count = len(read) 259 | self._offset += read_count 260 | result += read 261 | to_read -= read_count 262 | if to_read > 0: 263 | file_index += 1 264 | actual_offset = 0 265 | else: 266 | break 267 | return result 268 | 269 | 270 | class CollationKeyList(object): 271 | 272 | def __init__(self, lst, sortkey_): 273 | self.lst = lst 274 | self.sortkey = sortkey_ 275 | 276 | def __len__(self): 277 | return len(self.lst) 278 | 279 | def __getitem__(self, i): 280 | return self.sortkey(self.lst[i].key) 281 | 282 | 283 | class KeydItemDict(object): 284 | 285 | def __init__(self, lst, strength, maxlength=None): 286 | self.lst = lst 287 | self.sortkey = sortkey(strength, maxlength=maxlength) 288 | self.sortkeylist = CollationKeyList(lst, self.sortkey) 289 | 290 | def __len__(self): 291 | return len(self.lst) 292 | 293 | def __getitem__(self, key): 294 | key_as_sk = self.sortkey(key) 295 | i = bisect_left(self.sortkeylist, key_as_sk) 296 | if i != len(self.lst): 297 | while i < len(self.lst): 298 | if self.sortkey(self.lst[i].key) == key_as_sk: 299 | yield self.lst[i] 300 | else: 301 | break 302 | i += 1 303 | 304 | def __contains__(self, key): 305 | try: 306 | next(self[key]) 307 | except StopIteration: 308 | return False 309 | else: 310 | return True 311 | 312 | 313 | class Blob(object): 314 | 315 | def __init__(self, content_id, key, fragment, read_content_type_func, read_func): 316 | self._content_id = content_id 317 | self._key = key 318 | self._fragment = fragment 319 | self._read_content_type = read_content_type_func 320 | self._read = read_func 321 | 322 | @property 323 | def id(self): 324 | return self._content_id 325 | 326 | @property 327 | def key(self): 328 | return self._key 329 | 330 | @property 331 | def fragment(self): 332 | return self._fragment 333 | 334 | @property 335 | def content_type(self): 336 | return self._read_content_type() 337 | 338 | @property 339 | def content(self): 340 | return self._read() 341 | 342 | def __str__(self): 343 | return self.key 344 | 345 | def __repr__(self): 346 | return "<{0.__class__.__module__}.{0.__class__.__name__} " "{0.key}>".format( 347 | self 348 | ) 349 | 350 | 351 | def read_byte_string(f, len_spec): 352 | length = unpack(len_spec, f.read(calcsize(len_spec)))[0] 353 | return f.read(length) 354 | 355 | 356 | class StructReader: 357 | 358 | def __init__(self, file_, encoding=None): 359 | self._file = file_ 360 | self.encoding = encoding 361 | 362 | def read_int(self): 363 | s = self.read(U_INT_SIZE) 364 | return unpack(U_INT, s)[0] 365 | 366 | def read_long(self): 367 | b = self.read(U_LONG_LONG_SIZE) 368 | return unpack(U_LONG_LONG, b)[0] 369 | 370 | def read_byte(self): 371 | s = self.read(U_CHAR_SIZE) 372 | return unpack(U_CHAR, s)[0] 373 | 374 | def read_short(self): 375 | return unpack(U_SHORT, self._file.read(U_SHORT_SIZE))[0] 376 | 377 | def _read_text(self, len_spec): 378 | max_len = 2 ** (8 * calcsize(len_spec)) - 1 379 | byte_string = read_byte_string(self._file, len_spec) 380 | if len(byte_string) == max_len: 381 | terminator = byte_string.find(0) 382 | if terminator > -1: 383 | byte_string = byte_string[:terminator] 384 | return byte_string.decode(self.encoding) 385 | 386 | def read_tiny_text(self): 387 | return self._read_text(U_CHAR) 388 | 389 | def read_text(self): 390 | return self._read_text(U_SHORT) 391 | 392 | def __getattr__(self, name): 393 | return getattr(self._file, name) 394 | 395 | 396 | class StructWriter: 397 | 398 | def __init__(self, file_, encoding=None): 399 | self._file = file_ 400 | self.encoding = encoding 401 | 402 | def write_int(self, value): 403 | self._file.write(pack(U_INT, value)) 404 | 405 | def write_long(self, value): 406 | self._file.write(pack(U_LONG_LONG, value)) 407 | 408 | def write_byte(self, value): 409 | self._file.write(pack(U_CHAR, value)) 410 | 411 | def write_short(self, value): 412 | self._file.write(pack(U_SHORT, value)) 413 | 414 | def _write_text(self, text, len_size_spec, encoding=None, pad_to_length=None): 415 | if encoding is None: 416 | encoding = self.encoding 417 | text_bytes = text.encode(encoding) 418 | length = len(text_bytes) 419 | max_length = calcmax(len_size_spec) 420 | if length > max_length: 421 | raise ValueError("Text is too long for size spec %s" % len_size_spec) 422 | self._file.write( 423 | pack(len_size_spec, pad_to_length if pad_to_length else length) 424 | ) 425 | self._file.write(text_bytes) 426 | if pad_to_length: 427 | for _ in range(pad_to_length - length): 428 | self._file.write(pack(U_CHAR, 0)) 429 | 430 | def write_tiny_text(self, text, encoding=None, editable=False): 431 | pad_to_length = 255 if editable else None 432 | self._write_text(text, U_CHAR, encoding=encoding, pad_to_length=pad_to_length) 433 | 434 | def write_text(self, text, encoding=None): 435 | self._write_text(text, U_SHORT, encoding=encoding) 436 | 437 | def __getattr__(self, name): 438 | return getattr(self._file, name) 439 | 440 | 441 | def set_tag_value(filename, name, value): 442 | with fopen(filename, "rb+") as f: 443 | f.seek(len(MAGIC) + 16) 444 | encoding = read_byte_string(f, U_CHAR).decode(UTF8) 445 | if encodings.search_function(encoding) is None: 446 | raise UnknownEncoding(encoding) 447 | f = StructWriter(StructReader(f, encoding=encoding), encoding=encoding) 448 | f.read_tiny_text() 449 | tag_count = f.read_byte() 450 | for _ in range(tag_count): 451 | key = f.read_tiny_text() 452 | if key == name: 453 | f.write_tiny_text(value, editable=True) 454 | return 455 | f.read_tiny_text() 456 | raise TagNotFound(name) 457 | 458 | 459 | def read_header(f): 460 | f.seek(0) 461 | 462 | magic = f.read(len(MAGIC)) 463 | if magic != MAGIC: 464 | raise UnknownFileFormat() 465 | 466 | uuid = UUID(bytes=f.read(16)) 467 | encoding = read_byte_string(f, U_CHAR).decode(UTF8) 468 | if encodings.search_function(encoding) is None: 469 | raise UnknownEncoding(encoding) 470 | 471 | f = StructReader(f, encoding) 472 | compression = f.read_tiny_text() 473 | if not compression in COMPRESSIONS: 474 | raise UnknownCompression(compression) 475 | 476 | def read_tags(): 477 | tags = {} 478 | count = f.read_byte() 479 | for _ in range(count): 480 | key = f.read_tiny_text() 481 | value = f.read_tiny_text() 482 | tags[key] = value 483 | return tags 484 | 485 | tags = read_tags() 486 | 487 | def read_content_types(): 488 | content_types = [] 489 | count = f.read_byte() 490 | for _ in range(count): 491 | content_type = f.read_text() 492 | content_types.append(content_type) 493 | return tuple(content_types) 494 | 495 | content_types = read_content_types() 496 | 497 | blob_count = f.read_int() 498 | store_offset = f.read_long() 499 | size = f.read_long() 500 | refs_offset = f.tell() 501 | 502 | return Header( 503 | magic=magic, 504 | uuid=uuid, 505 | encoding=encoding, 506 | compression=compression, 507 | tags=MappingProxyType(tags), 508 | content_types=content_types, 509 | blob_count=blob_count, 510 | store_offset=store_offset, 511 | refs_offset=refs_offset, 512 | size=size, 513 | ) 514 | 515 | 516 | def meld_ints(a, b): 517 | return (a << 16) | b 518 | 519 | 520 | def unmeld_ints(c): 521 | bstr = bin(c).lstrip("0b").zfill(48) 522 | a, b = bstr[-48:-16], bstr[-16:] 523 | return int(a, 2), int(b, 2) 524 | 525 | 526 | class Slob(Sequence): 527 | 528 | def __init__(self, file_or_filenames): 529 | self._f = MultiFileReader(file_or_filenames) 530 | 531 | try: 532 | self._header = read_header(self._f) 533 | if self._f.size != self._header.size: 534 | raise IncorrectFileSize( 535 | "File size should be {0}, {1} bytes found".format( 536 | self._header.size, self._f.size 537 | ) 538 | ) 539 | except FileFormatException: 540 | self._f.close() 541 | raise 542 | 543 | self._refs = RefList( 544 | self._f, self._header.encoding, offset=self._header.refs_offset 545 | ) 546 | 547 | self._g = MultiFileReader(file_or_filenames) 548 | self._store = Store( 549 | self._g, 550 | self._header.store_offset, 551 | COMPRESSIONS[self._header.compression].decompress, 552 | self._header.content_types, 553 | ) 554 | 555 | def __enter__(self): 556 | return self 557 | 558 | def __exit__(self, exc_type, exc_val, exc_tb): 559 | self.close() 560 | return False 561 | 562 | @property 563 | def id(self): 564 | return self._header.uuid.hex 565 | 566 | @property 567 | def content_types(self): 568 | return self._header.content_types 569 | 570 | @property 571 | def tags(self): 572 | return self._header.tags 573 | 574 | @property 575 | def blob_count(self): 576 | return self._header.blob_count 577 | 578 | @property 579 | def compression(self): 580 | return self._header.compression 581 | 582 | @property 583 | def encoding(self): 584 | return self._header.encoding 585 | 586 | def __len__(self): 587 | return len(self._refs) 588 | 589 | def __getitem__(self, i): 590 | ref = self._refs[i] 591 | 592 | def read_func(): 593 | return self._store.get(ref.bin_index, ref.item_index)[1] 594 | 595 | read_func = lru_cache(maxsize=None)(read_func) 596 | 597 | def read_content_type_func(): 598 | return self._store.content_type(ref.bin_index, ref.item_index) 599 | 600 | content_id = meld_ints(ref.bin_index, ref.item_index) 601 | return Blob( 602 | content_id, ref.key, ref.fragment, read_content_type_func, read_func 603 | ) 604 | 605 | def get(self, blob_id): 606 | bin_index, bin_item_index = unmeld_ints(blob_id) 607 | return self._store.get(bin_index, bin_item_index) 608 | 609 | @lru_cache(maxsize=None) 610 | def as_dict(self, strength=TERTIARY, maxlength=None): 611 | return KeydItemDict(self, strength, maxlength=maxlength) 612 | 613 | def close(self): 614 | self._f.close() 615 | self._g.close() 616 | 617 | 618 | def find_parts(fname): 619 | fname = os.path.expanduser(fname) 620 | dirname = os.path.dirname(fname) or os.getcwd() 621 | basename = os.path.basename(fname) 622 | candidates = [] 623 | for name in os.listdir(dirname): 624 | if name.startswith(basename): 625 | candidates.append(os.path.join(dirname, name)) 626 | return sorted(candidates) 627 | 628 | 629 | def open(file_or_filenames): 630 | if isinstance(file_or_filenames, str): 631 | if not os.path.exists(file_or_filenames): 632 | file_or_filenames = find_parts(file_or_filenames) 633 | return Slob(file_or_filenames) 634 | 635 | 636 | def create(*args, **kwargs): 637 | return Writer(*args, **kwargs) 638 | 639 | 640 | class BinMemWriter: 641 | 642 | def __init__(self): 643 | self.content_type_ids = [] 644 | self.item_dir = [] 645 | self.items = [] 646 | self.current_offset = 0 647 | 648 | def add(self, content_type_id, blob): 649 | self.content_type_ids.append(content_type_id) 650 | self.item_dir.append(pack(U_INT, self.current_offset)) 651 | length_and_bytes = pack(U_INT, len(blob)) + blob 652 | self.items.append(length_and_bytes) 653 | self.current_offset += len(length_and_bytes) 654 | 655 | def __len__(self): 656 | return len(self.item_dir) 657 | 658 | def finalize(self, fout: "output file", compress: "function"): 659 | count = len(self) 660 | fout.write(pack(U_INT, count)) 661 | for content_type_id in self.content_type_ids: 662 | fout.write(pack(U_CHAR, content_type_id)) 663 | content = b"".join(self.item_dir + self.items) 664 | compressed = compress(content) 665 | fout.write(pack(U_INT, len(compressed))) 666 | fout.write(compressed) 667 | self.content_type_ids.clear() 668 | self.item_dir.clear() 669 | self.items.clear() 670 | 671 | 672 | class ItemList(Sequence): 673 | 674 | def __init__(self, file_, offset, count_or_spec, pos_spec, cache_size=None): 675 | self.lock = RLock() 676 | self._file = file_ 677 | file_.seek(offset) 678 | if isinstance(count_or_spec, str): 679 | count_spec = count_or_spec 680 | self.count = unpack(count_spec, file_.read(calcsize(count_spec)))[0] 681 | else: 682 | self.count = count_or_spec 683 | self.pos_offset = file_.tell() 684 | self.pos_spec = pos_spec 685 | self.pos_size = calcsize(pos_spec) 686 | self.data_offset = self.pos_offset + self.pos_size * self.count 687 | if cache_size: 688 | self.__getitem__ = lru_cache(maxsize=cache_size)(self.__getitem__) 689 | 690 | def __len__(self): 691 | return self.count 692 | 693 | def pos(self, i): 694 | with self.lock: 695 | self._file.seek(self.pos_offset + self.pos_size * i) 696 | return unpack(self.pos_spec, self._file.read(self.pos_size))[0] 697 | 698 | def read(self, pos): 699 | with self.lock: 700 | self._file.seek(self.data_offset + pos) 701 | return self._read_item() 702 | 703 | @abstractmethod 704 | def _read_item(self): 705 | pass 706 | 707 | def __getitem__(self, i): 708 | if i >= len(self) or i < 0: 709 | raise IndexError("index out of range") 710 | return self.read(self.pos(i)) 711 | 712 | 713 | class RefList(ItemList): 714 | 715 | def __init__(self, f, encoding, offset=0, count=None): 716 | super().__init__( 717 | StructReader(f, encoding), 718 | offset, 719 | U_INT if count is None else count, 720 | U_LONG_LONG, 721 | cache_size=512, 722 | ) 723 | 724 | def _read_item(self): 725 | key = self._file.read_text() 726 | bin_index = self._file.read_int() 727 | item_index = self._file.read_short() 728 | fragment = self._file.read_tiny_text() 729 | return Ref( 730 | key=key, bin_index=bin_index, item_index=item_index, fragment=fragment 731 | ) 732 | 733 | @lru_cache(maxsize=None) 734 | def as_dict(self, strength=TERTIARY, maxlength=None): 735 | return KeydItemDict(self, strength, maxlength=maxlength) 736 | 737 | 738 | class Bin(ItemList): 739 | 740 | def __init__(self, count, bin_bytes): 741 | super().__init__(StructReader(io.BytesIO(bin_bytes)), 0, count, U_INT) 742 | 743 | def _read_item(self): 744 | content_len = self._file.read_int() 745 | content = self._file.read(content_len) 746 | return content 747 | 748 | 749 | StoreItem = namedtuple("StoreItem", "content_type_ids compressed_content") 750 | 751 | 752 | class Store(ItemList): 753 | 754 | def __init__(self, file_, offset, decompress, content_types): 755 | super().__init__(StructReader(file_), offset, U_INT, U_LONG_LONG, cache_size=32) 756 | self.decompress = decompress 757 | self.content_types = content_types 758 | 759 | def _read_item(self): 760 | bin_item_count = self._file.read_int() 761 | packed_content_type_ids = self._file.read(bin_item_count * U_CHAR_SIZE) 762 | content_type_ids = [] 763 | for i in range(bin_item_count): 764 | content_type_id = unpack(U_CHAR, packed_content_type_ids[i : i + 1])[0] 765 | content_type_ids.append(content_type_id) 766 | content_length = self._file.read_int() 767 | content = self._file.read(content_length) 768 | return StoreItem(content_type_ids=content_type_ids, compressed_content=content) 769 | 770 | def _content_type(self, bin_index, item_index): 771 | store_item = self[bin_index] 772 | content_type_id = store_item.content_type_ids[item_index] 773 | content_type = self.content_types[content_type_id] 774 | return content_type, store_item 775 | 776 | def content_type(self, bin_index, item_index): 777 | return self._content_type(bin_index, item_index)[0] 778 | 779 | @lru_cache(maxsize=16) 780 | def _decompress(self, bin_index): 781 | store_item = self[bin_index] 782 | return self.decompress(store_item.compressed_content) 783 | 784 | def get(self, bin_index, item_index): 785 | content_type, store_item = self._content_type(bin_index, item_index) 786 | content = self._decompress(bin_index) 787 | count = len(store_item.content_type_ids) 788 | store_bin = Bin(count, content) 789 | content = store_bin[item_index] 790 | return (content_type, content) 791 | 792 | 793 | def find(word, slobs, match_prefix=True): 794 | seen = set() 795 | if isinstance(slobs, Slob): 796 | slobs = [slobs] 797 | 798 | variants = [] 799 | 800 | for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY): 801 | variants.append((strength, None)) 802 | 803 | if match_prefix: 804 | for strength in (QUATERNARY, TERTIARY, SECONDARY, PRIMARY): 805 | variants.append((strength, sortkey_length(strength, word))) 806 | 807 | for strength, maxlength in variants: 808 | for slob in slobs: 809 | d = slob.as_dict(strength=strength, maxlength=maxlength) 810 | for item in d[word]: 811 | dedup_key = (slob.id, item.id, item.fragment) 812 | if dedup_key in seen: 813 | continue 814 | else: 815 | seen.add(dedup_key) 816 | yield slob, item 817 | 818 | 819 | WriterEvent = namedtuple("WriterEvent", "name data") 820 | 821 | 822 | class KeyTooLongException(Exception): 823 | 824 | @property 825 | def key(self): 826 | return self.args[0] 827 | 828 | 829 | class Writer(object): 830 | 831 | def __init__( 832 | self, 833 | filename, 834 | workdir=None, 835 | encoding=UTF8, 836 | compression=DEFAULT_COMPRESSION, 837 | min_bin_size=512 * 1024, 838 | max_redirects=5, 839 | observer=None, 840 | ): 841 | self.filename = filename 842 | self.observer = observer 843 | if os.path.exists(self.filename): 844 | raise SystemExit("File %r already exists" % self.filename) 845 | 846 | # make sure we can write 847 | with fopen(self.filename, "wb"): 848 | pass 849 | 850 | self.encoding = encoding 851 | 852 | if encodings.search_function(self.encoding) is None: 853 | raise UnknownEncoding(self.encoding) 854 | 855 | self.workdir = workdir 856 | 857 | self.tmpdir = tmpdir = tempfile.TemporaryDirectory( 858 | prefix="{0}-".format(os.path.basename(filename)), dir=workdir 859 | ) 860 | 861 | self.f_ref_positions = self._wbfopen("ref-positions") 862 | self.f_store_positions = self._wbfopen("store-positions") 863 | self.f_refs = self._wbfopen("refs") 864 | self.f_store = self._wbfopen("store") 865 | 866 | self.max_redirects = max_redirects 867 | if max_redirects: 868 | self.aliases_path = os.path.join(tmpdir.name, "aliases") 869 | self.f_aliases = Writer( 870 | self.aliases_path, 871 | workdir=tmpdir.name, 872 | max_redirects=0, 873 | compression=None, 874 | ) 875 | 876 | if compression is None: 877 | compression = "" 878 | if not compression in COMPRESSIONS: 879 | raise UnknownCompression(compression) 880 | else: 881 | self.compress = COMPRESSIONS[compression].compress 882 | 883 | self.compression = compression 884 | self.content_types = {} 885 | 886 | self.min_bin_size = min_bin_size 887 | 888 | self.current_bin = None 889 | 890 | self.blob_count = 0 891 | self.ref_count = 0 892 | self.bin_count = 0 893 | self._tags = { 894 | "version.python": sys.version.replace("\n", " "), 895 | "version.pyicu": icu.VERSION, 896 | "version.icu": icu.ICU_VERSION, 897 | "created.at": datetime.now(timezone.utc).isoformat(), 898 | } 899 | self.tags = MappingProxyType(self._tags) 900 | 901 | def _wbfopen(self, name): 902 | return StructWriter( 903 | fopen(os.path.join(self.tmpdir.name, name), "wb"), encoding=self.encoding 904 | ) 905 | 906 | def tag(self, name, value=""): 907 | if len(name.encode(self.encoding)) > MAX_TINY_TEXT_LEN: 908 | self._fire_event("tag_name_too_long", (name, value)) 909 | return 910 | 911 | if len(value.encode(self.encoding)) > MAX_TINY_TEXT_LEN: 912 | self._fire_event("tag_value_too_long", (name, value)) 913 | value = "" 914 | 915 | self._tags[name] = value 916 | 917 | def _split_key(self, key): 918 | if isinstance(key, str): 919 | actual_key = key 920 | fragment = "" 921 | else: 922 | actual_key, fragment = key 923 | if len(actual_key) > MAX_TEXT_LEN or len(fragment) > MAX_TINY_TEXT_LEN: 924 | raise KeyTooLongException(key) 925 | return actual_key, fragment 926 | 927 | def add(self, blob, *keys, content_type=""): 928 | 929 | if len(blob) > MAX_LARGE_BYTE_STRING_LEN: 930 | self._fire_event("content_too_long", blob) 931 | return 932 | 933 | if len(content_type) > MAX_TEXT_LEN: 934 | self._fire_event("content_type_too_long", content_type) 935 | return 936 | 937 | actual_keys = [] 938 | 939 | for key in keys: 940 | try: 941 | actual_key, fragment = self._split_key(key) 942 | except KeyTooLongException as e: 943 | self._fire_event("key_too_long", e.key) 944 | else: 945 | actual_keys.append((actual_key, fragment)) 946 | 947 | if len(actual_keys) == 0: 948 | return 949 | 950 | if self.current_bin is None: 951 | self.current_bin = BinMemWriter() 952 | self.bin_count += 1 953 | 954 | if content_type not in self.content_types: 955 | self.content_types[content_type] = len(self.content_types) 956 | 957 | self.current_bin.add(self.content_types[content_type], blob) 958 | self.blob_count += 1 959 | bin_item_index = len(self.current_bin) - 1 960 | bin_index = self.bin_count - 1 961 | 962 | for actual_key, fragment in actual_keys: 963 | self._write_ref(actual_key, bin_index, bin_item_index, fragment) 964 | 965 | if ( 966 | self.current_bin.current_offset > self.min_bin_size 967 | or len(self.current_bin) == MAX_BIN_ITEM_COUNT 968 | ): 969 | self._write_current_bin() 970 | 971 | def add_alias(self, key, target_key): 972 | if self.max_redirects: 973 | try: 974 | self._split_key(key) 975 | except KeyTooLongException as e: 976 | self._fire_event("alias_too_long", e.key) 977 | return 978 | try: 979 | self._split_key(target_key) 980 | except KeyTooLongException as e: 981 | self._fire_event("alias_target_too_long", e.key) 982 | return 983 | self.f_aliases.add(pickle.dumps(target_key), key) 984 | else: 985 | raise NotImplementedError() 986 | 987 | def _fire_event(self, name, data=None): 988 | if self.observer: 989 | self.observer(WriterEvent(name, data)) 990 | 991 | def _write_current_bin(self): 992 | self.f_store_positions.write_long(self.f_store.tell()) 993 | self.current_bin.finalize(self.f_store, self.compress) 994 | self.current_bin = None 995 | 996 | def _write_ref(self, key, bin_index, item_index, fragment=""): 997 | self.f_ref_positions.write_long(self.f_refs.tell()) 998 | self.f_refs.write_text(key) 999 | self.f_refs.write_int(bin_index) 1000 | self.f_refs.write_short(item_index) 1001 | self.f_refs.write_tiny_text(fragment) 1002 | self.ref_count += 1 1003 | 1004 | def _sort(self): 1005 | self._fire_event("begin_sort") 1006 | f_ref_positions_sorted = self._wbfopen("ref-positions-sorted") 1007 | self.f_refs.flush() 1008 | self.f_ref_positions.close() 1009 | with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f: 1010 | ref_list = RefList(f, self.encoding, count=self.ref_count) 1011 | sortkey_func = sortkey(IDENTICAL) 1012 | for i in sorted( 1013 | range(len(ref_list)), key=lambda j: sortkey_func(ref_list[j].key) 1014 | ): 1015 | ref_pos = ref_list.pos(i) 1016 | f_ref_positions_sorted.write_long(ref_pos) 1017 | f_ref_positions_sorted.close() 1018 | os.remove(self.f_ref_positions.name) 1019 | os.rename(f_ref_positions_sorted.name, self.f_ref_positions.name) 1020 | self.f_ref_positions = StructWriter( 1021 | fopen(self.f_ref_positions.name, "ab"), encoding=self.encoding 1022 | ) 1023 | self._fire_event("end_sort") 1024 | 1025 | def _resolve_aliases(self): 1026 | self._fire_event("begin_resolve_aliases") 1027 | self.f_aliases.finalize() 1028 | with MultiFileReader(self.f_ref_positions.name, self.f_refs.name) as f_ref_list: 1029 | ref_list = RefList(f_ref_list, self.encoding, count=self.ref_count) 1030 | ref_dict = ref_list.as_dict() 1031 | with open(self.aliases_path) as r: 1032 | aliases = r.as_dict() 1033 | path = os.path.join(self.tmpdir.name, "resolved-aliases") 1034 | with create( 1035 | path, workdir=self.tmpdir.name, max_redirects=0, compression=None 1036 | ) as alias_writer: 1037 | 1038 | def read_key_frag(item, default_fragment): 1039 | key_frag = pickle.loads(item.content) 1040 | if isinstance(key_frag, str): 1041 | return key_frag, default_fragment 1042 | else: 1043 | return key_frag 1044 | 1045 | for item in r: 1046 | from_key = item.key 1047 | keys = set() 1048 | keys.add(from_key) 1049 | to_key, fragment = read_key_frag(item, item.fragment) 1050 | count = 0 1051 | while count <= self.max_redirects: 1052 | # is target key itself a redirect? 1053 | try: 1054 | orig_to_key = to_key 1055 | to_key, fragment = read_key_frag( 1056 | next(aliases[to_key]), fragment 1057 | ) 1058 | count += 1 1059 | keys.add(orig_to_key) 1060 | except StopIteration: 1061 | break 1062 | if count > self.max_redirects: 1063 | self._fire_event("too_many_redirects", from_key) 1064 | try: 1065 | target_ref = next(ref_dict[to_key]) 1066 | except StopIteration: 1067 | self._fire_event("alias_target_not_found", to_key) 1068 | else: 1069 | for key in keys: 1070 | ref = Ref( 1071 | key=key, 1072 | bin_index=target_ref.bin_index, 1073 | item_index=target_ref.item_index, 1074 | # last fragment in the chain wins 1075 | fragment=target_ref.fragment or fragment, 1076 | ) 1077 | alias_writer.add(pickle.dumps(ref), key) 1078 | 1079 | with open(path) as resolved_aliases_reader: 1080 | previous = None 1081 | targets = set() 1082 | 1083 | for item in resolved_aliases_reader: 1084 | ref = pickle.loads(item.content) 1085 | if previous is not None and ref.key != previous.key: 1086 | for bin_index, item_index, fragment in sorted(targets): 1087 | self._write_ref(previous.key, bin_index, item_index, fragment) 1088 | targets.clear() 1089 | targets.add((ref.bin_index, ref.item_index, ref.fragment)) 1090 | previous = ref 1091 | 1092 | for bin_index, item_index, fragment in sorted(targets): 1093 | self._write_ref(previous.key, bin_index, item_index, fragment) 1094 | 1095 | self._sort() 1096 | self._fire_event("end_resolve_aliases") 1097 | 1098 | def finalize(self): 1099 | self._fire_event("begin_finalize") 1100 | if not self.current_bin is None: 1101 | self._write_current_bin() 1102 | 1103 | self._sort() 1104 | if self.max_redirects: 1105 | self._resolve_aliases() 1106 | 1107 | files = ( 1108 | self.f_ref_positions, 1109 | self.f_refs, 1110 | self.f_store_positions, 1111 | self.f_store, 1112 | ) 1113 | 1114 | for f in files: 1115 | f.close() 1116 | 1117 | buf_size = 10 * 1024 * 1024 1118 | 1119 | with fopen(self.filename, mode="wb") as output_file: 1120 | out = StructWriter(output_file, self.encoding) 1121 | out.write(MAGIC) 1122 | out.write(uuid4().bytes) 1123 | out.write_tiny_text(self.encoding, encoding=UTF8) 1124 | out.write_tiny_text(self.compression) 1125 | 1126 | def write_tags(tags, f): 1127 | f.write(pack(U_CHAR, len(tags))) 1128 | for key, value in tags.items(): 1129 | f.write_tiny_text(key) 1130 | f.write_tiny_text(value, editable=True) 1131 | 1132 | write_tags(self.tags, out) 1133 | 1134 | def write_content_types(content_types, f): 1135 | count = len(content_types) 1136 | f.write(pack(U_CHAR, count)) 1137 | types = sorted(content_types.items(), key=lambda x: x[1]) 1138 | for content_type, _ in types: 1139 | f.write_text(content_type) 1140 | 1141 | write_content_types(self.content_types, out) 1142 | 1143 | out.write_int(self.blob_count) 1144 | store_offset = ( 1145 | out.tell() 1146 | + U_LONG_LONG_SIZE # this value 1147 | + U_LONG_LONG_SIZE # file size value 1148 | + U_INT_SIZE # ref count value 1149 | + os.stat(self.f_ref_positions.name).st_size 1150 | + os.stat(self.f_refs.name).st_size 1151 | ) 1152 | out.write_long(store_offset) 1153 | out.flush() 1154 | 1155 | file_size = ( 1156 | out.tell() # bytes written so far 1157 | + U_LONG_LONG_SIZE # file size value 1158 | + 2 * U_INT_SIZE 1159 | ) # ref count and bin count 1160 | file_size += sum((os.stat(f.name).st_size for f in files)) 1161 | out.write_long(file_size) 1162 | 1163 | def mv(src, out): 1164 | fname = src.name 1165 | self._fire_event("begin_move", fname) 1166 | with fopen(fname, mode="rb") as f: 1167 | while True: 1168 | data = f.read(buf_size) 1169 | if len(data) == 0: 1170 | break 1171 | out.write(data) 1172 | out.flush() 1173 | os.remove(fname) 1174 | self._fire_event("end_move", fname) 1175 | 1176 | out.write_int(self.ref_count) 1177 | mv(self.f_ref_positions, out) 1178 | mv(self.f_refs, out) 1179 | 1180 | out.write_int(self.bin_count) 1181 | mv(self.f_store_positions, out) 1182 | mv(self.f_store, out) 1183 | 1184 | self.tmpdir.cleanup() 1185 | self._fire_event("end_finalize") 1186 | 1187 | def size_header(self): 1188 | size = 0 1189 | size += len(MAGIC) 1190 | size += 16 # uuid bytes 1191 | size += U_CHAR_SIZE + len(self.encoding.encode(UTF8)) 1192 | size += U_CHAR_SIZE + len(self.compression.encode(self.encoding)) 1193 | 1194 | size += U_CHAR_SIZE # tag length 1195 | size += U_CHAR_SIZE # content types count 1196 | 1197 | # tags and content types themselves counted elsewhere 1198 | 1199 | size += U_INT_SIZE # blob count 1200 | size += U_LONG_LONG_SIZE # store offset 1201 | size += U_LONG_LONG_SIZE # file size 1202 | size += U_INT_SIZE # ref count 1203 | size += U_INT_SIZE # bin count 1204 | 1205 | return size 1206 | 1207 | def size_tags(self): 1208 | size = 0 1209 | for key, _ in self.tags.items(): 1210 | size += U_CHAR_SIZE + len(key.encode(self.encoding)) 1211 | size += 255 1212 | return size 1213 | 1214 | def size_content_types(self): 1215 | size = 0 1216 | for content_type in self.content_types: 1217 | size += U_CHAR_SIZE + len(content_type.encode(self.encoding)) 1218 | return size 1219 | 1220 | def size_data(self): 1221 | files = ( 1222 | self.f_ref_positions, 1223 | self.f_refs, 1224 | self.f_store_positions, 1225 | self.f_store, 1226 | ) 1227 | return sum((os.stat(f.name).st_size for f in files)) 1228 | 1229 | def __enter__(self): 1230 | return self 1231 | 1232 | def __exit__(self, exc_type, exc_val, exc_tb): 1233 | self.finalize() 1234 | return False 1235 | 1236 | 1237 | class TestReadWrite(unittest.TestCase): 1238 | 1239 | def setUp(self): 1240 | 1241 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1242 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1243 | 1244 | with create(self.path) as w: 1245 | 1246 | self.tags = {"a": "abc", "bb": "xyz123", "ccc": "lkjlk"} 1247 | for name, value in self.tags.items(): 1248 | w.tag(name, value) 1249 | 1250 | self.tag2 = "bb", "xyz123" 1251 | 1252 | self.blob_encoding = "ascii" 1253 | 1254 | self.data = [ 1255 | (("c", "cc", "ccc"), MIME_TEXT, "Hello C 1"), 1256 | ("a", MIME_TEXT, "Hello A 12"), 1257 | ("z", MIME_TEXT, "Hello Z 123"), 1258 | ("b", MIME_TEXT, "Hello B 1234"), 1259 | ("d", MIME_TEXT, "Hello D 12345"), 1260 | ("uuu", MIME_HTML, "Hello U!"), 1261 | ((("yy", "frag1"),), MIME_HTML, '

Section 1

'), 1262 | ] 1263 | 1264 | self.all_keys = [] 1265 | 1266 | self.data_as_dict = {} 1267 | 1268 | for k, t, v in self.data: 1269 | if isinstance(k, str): 1270 | k = (k,) 1271 | for key in k: 1272 | if isinstance(key, tuple): 1273 | key, fragment = key 1274 | else: 1275 | fragment = "" 1276 | self.all_keys.append(key) 1277 | self.data_as_dict[key] = (t, v, fragment) 1278 | w.add(v.encode(self.blob_encoding), *k, content_type=t) 1279 | self.all_keys.sort() 1280 | 1281 | self.w = w 1282 | 1283 | def test_header(self): 1284 | with MultiFileReader(self.path) as f: 1285 | header = read_header(f) 1286 | 1287 | for key, value in self.tags.items(): 1288 | self.assertEqual(header.tags[key], value) 1289 | 1290 | self.assertEqual(self.w.encoding, UTF8) 1291 | self.assertEqual(header.encoding, self.w.encoding) 1292 | 1293 | self.assertEqual(header.compression, self.w.compression) 1294 | 1295 | for i, content_type in enumerate(header.content_types): 1296 | self.assertEqual(self.w.content_types[content_type], i) 1297 | 1298 | self.assertEqual(header.blob_count, len(self.data)) 1299 | 1300 | def test_content(self): 1301 | with open(self.path) as r: 1302 | self.assertEqual(len(r), len(self.all_keys)) 1303 | self.assertRaises(IndexError, r.__getitem__, len(self.all_keys)) 1304 | for i, item in enumerate(r): 1305 | self.assertEqual(item.key, self.all_keys[i]) 1306 | content_type, value, fragment = self.data_as_dict[item.key] 1307 | self.assertEqual(item.content_type, content_type) 1308 | self.assertEqual(item.content.decode(self.blob_encoding), value) 1309 | self.assertEqual(item.fragment, fragment) 1310 | 1311 | def tearDown(self): 1312 | self.tmpdir.cleanup() 1313 | 1314 | 1315 | class TestSort(unittest.TestCase): 1316 | 1317 | def setUp(self): 1318 | 1319 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1320 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1321 | 1322 | with create(self.path) as w: 1323 | 1324 | data = [ 1325 | "Ф, ф", 1326 | "Ф ф", 1327 | "Ф", 1328 | "Э", 1329 | "Е е", 1330 | "г", 1331 | "н", 1332 | "ф", 1333 | "а", 1334 | "Ф, Ф", 1335 | "е", 1336 | "Е", 1337 | "Ее", 1338 | "ё", 1339 | "Ё", 1340 | "Её", 1341 | "Е ё", 1342 | "А", 1343 | "э", 1344 | "ы", 1345 | ] 1346 | 1347 | self.data_sorted = sorted(data, key=sortkey(IDENTICAL)) 1348 | 1349 | for k in data: 1350 | v = ";".join(unicodedata.name(c) for c in k) 1351 | w.add(v.encode("ascii"), k) 1352 | 1353 | self.r = open(self.path) 1354 | 1355 | def test_sort_order(self): 1356 | for i in range(len(self.r)): 1357 | self.assertEqual(self.r[i].key, self.data_sorted[i]) 1358 | 1359 | def tearDown(self): 1360 | self.r.close() 1361 | self.tmpdir.cleanup() 1362 | 1363 | 1364 | class TestFind(unittest.TestCase): 1365 | 1366 | def setUp(self): 1367 | 1368 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1369 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1370 | 1371 | with create(self.path) as w: 1372 | data = [ 1373 | "Cc", 1374 | "aA", 1375 | "aa", 1376 | "Aa", 1377 | "Bb", 1378 | "cc", 1379 | "Äā", 1380 | "ăÀ", 1381 | "a\u00A0a", 1382 | "a-a", 1383 | "a\u2019a", 1384 | "a\u2032a", 1385 | "a,a", 1386 | "a a", 1387 | ] 1388 | 1389 | for k in data: 1390 | v = ";".join(unicodedata.name(c) for c in k) 1391 | w.add(v.encode("ascii"), k) 1392 | 1393 | self.r = open(self.path) 1394 | 1395 | def get(self, d, key): 1396 | return list(item.content.decode("ascii") for item in d[key]) 1397 | 1398 | def test_find_identical(self): 1399 | d = self.r.as_dict(IDENTICAL) 1400 | self.assertEqual( 1401 | self.get(d, "aa"), ["LATIN SMALL LETTER A;LATIN SMALL LETTER A"] 1402 | ) 1403 | self.assertEqual( 1404 | self.get(d, "a-a"), 1405 | ["LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A"], 1406 | ) 1407 | self.assertEqual( 1408 | self.get(d, "aA"), ["LATIN SMALL LETTER A;LATIN CAPITAL LETTER A"] 1409 | ) 1410 | self.assertEqual( 1411 | self.get(d, "Äā"), 1412 | [ 1413 | "LATIN CAPITAL LETTER A WITH DIAERESIS;" 1414 | "LATIN SMALL LETTER A WITH MACRON" 1415 | ], 1416 | ) 1417 | self.assertEqual( 1418 | self.get(d, "a a"), ["LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A"] 1419 | ) 1420 | 1421 | def test_find_quaternary(self): 1422 | d = self.r.as_dict(QUATERNARY) 1423 | self.assertEqual( 1424 | self.get(d, "a\u2032a"), ["LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A"] 1425 | ) 1426 | self.assertEqual( 1427 | self.get(d, "a a"), 1428 | [ 1429 | "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A", 1430 | "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A", 1431 | ], 1432 | ) 1433 | 1434 | def test_find_tertiary(self): 1435 | d = self.r.as_dict(TERTIARY) 1436 | self.assertEqual( 1437 | self.get(d, "aa"), 1438 | [ 1439 | "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A", 1440 | "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A", 1441 | "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A", 1442 | "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A", 1443 | "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A", 1444 | "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A", 1445 | "LATIN SMALL LETTER A;LATIN SMALL LETTER A", 1446 | ], 1447 | ) 1448 | 1449 | def test_find_secondary(self): 1450 | d = self.r.as_dict(SECONDARY) 1451 | self.assertEqual( 1452 | self.get(d, "aa"), 1453 | [ 1454 | "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A", 1455 | "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A", 1456 | "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A", 1457 | "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A", 1458 | "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A", 1459 | "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A", 1460 | "LATIN SMALL LETTER A;LATIN SMALL LETTER A", 1461 | "LATIN SMALL LETTER A;LATIN CAPITAL LETTER A", 1462 | "LATIN CAPITAL LETTER A;LATIN SMALL LETTER A", 1463 | ], 1464 | ) 1465 | 1466 | def test_find_primary(self): 1467 | d = self.r.as_dict(PRIMARY) 1468 | 1469 | self.assertEqual( 1470 | self.get(d, "aa"), 1471 | [ 1472 | "LATIN SMALL LETTER A;SPACE;LATIN SMALL LETTER A", 1473 | "LATIN SMALL LETTER A;NO-BREAK SPACE;LATIN SMALL LETTER A", 1474 | "LATIN SMALL LETTER A;HYPHEN-MINUS;LATIN SMALL LETTER A", 1475 | "LATIN SMALL LETTER A;COMMA;LATIN SMALL LETTER A", 1476 | "LATIN SMALL LETTER A;RIGHT SINGLE QUOTATION MARK;LATIN SMALL LETTER A", 1477 | "LATIN SMALL LETTER A;PRIME;LATIN SMALL LETTER A", 1478 | "LATIN SMALL LETTER A;LATIN SMALL LETTER A", 1479 | "LATIN SMALL LETTER A;LATIN CAPITAL LETTER A", 1480 | "LATIN CAPITAL LETTER A;LATIN SMALL LETTER A", 1481 | "LATIN SMALL LETTER A WITH BREVE;LATIN CAPITAL LETTER A WITH GRAVE", 1482 | "LATIN CAPITAL LETTER A WITH DIAERESIS;LATIN SMALL LETTER A WITH MACRON", 1483 | ], 1484 | ) 1485 | 1486 | def tearDown(self): 1487 | self.r.close() 1488 | self.tmpdir.cleanup() 1489 | 1490 | 1491 | class TestPrefixFind(unittest.TestCase): 1492 | 1493 | def setUp(self): 1494 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1495 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1496 | self.data = ["a", "ab", "abc", "abcd", "abcde"] 1497 | with create(self.path) as w: 1498 | for k in self.data: 1499 | w.add(k.encode("ascii"), k) 1500 | 1501 | def tearDown(self): 1502 | self.tmpdir.cleanup() 1503 | 1504 | def test(self): 1505 | with open(self.path) as r: 1506 | for i, k in enumerate(self.data): 1507 | d = r.as_dict(IDENTICAL, len(k)) 1508 | self.assertEqual( 1509 | list(v.content.decode("ascii") for v in d[k]), self.data[i:] 1510 | ) 1511 | 1512 | 1513 | class TestBestMatch(unittest.TestCase): 1514 | 1515 | def setUp(self): 1516 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1517 | self.path1 = os.path.join(self.tmpdir.name, "test1.slob") 1518 | self.path2 = os.path.join(self.tmpdir.name, "test2.slob") 1519 | 1520 | data1 = ["aa", "Aa", "a-a", "aabc", "Äā", "bb", "aa"] 1521 | data2 = ["aa", "aA", "āā", "a,a", "a-a", "aade", "Äā", "cc"] 1522 | 1523 | with create(self.path1) as w: 1524 | for key in data1: 1525 | w.add(b"", key) 1526 | 1527 | with create(self.path2) as w: 1528 | for key in data2: 1529 | w.add(b"", key) 1530 | 1531 | def test_best_match(self): 1532 | self.maxDiff = None 1533 | with open(self.path1) as s1, open(self.path2) as s2: 1534 | result = find("aa", [s1, s2], match_prefix=True) 1535 | actual = list((s.id, item.key) for s, item in result) 1536 | expected = [ 1537 | (s1.id, "aa"), 1538 | (s1.id, "aa"), 1539 | (s2.id, "aa"), 1540 | (s1.id, "a-a"), 1541 | (s2.id, "a-a"), 1542 | (s2.id, "a,a"), 1543 | (s1.id, "Aa"), 1544 | (s2.id, "aA"), 1545 | (s1.id, "Äā"), 1546 | (s2.id, "Äā"), 1547 | (s2.id, "āā"), 1548 | (s1.id, "aabc"), 1549 | (s2.id, "aade"), 1550 | ] 1551 | self.assertEqual(expected, actual) 1552 | 1553 | def tearDown(self): 1554 | self.tmpdir.cleanup() 1555 | 1556 | 1557 | class TestAlias(unittest.TestCase): 1558 | 1559 | def setUp(self): 1560 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1561 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1562 | 1563 | def tearDown(self): 1564 | self.tmpdir.cleanup() 1565 | 1566 | def test_alias(self): 1567 | too_many_redirects = [] 1568 | target_not_found = [] 1569 | 1570 | def observer(event): 1571 | if event.name == "too_many_redirects": 1572 | too_many_redirects.append(event.data) 1573 | elif event.name == "alias_target_not_found": 1574 | target_not_found.append(event.data) 1575 | 1576 | with create(self.path, observer=observer) as w: 1577 | data = ["z", "b", "q", "a", "u", "g", "p", "n"] 1578 | for k in data: 1579 | v = ";".join(unicodedata.name(c) for c in k) 1580 | w.add(v.encode("ascii"), k) 1581 | 1582 | w.add_alias("w", "u") 1583 | w.add_alias("small u", "u") 1584 | w.add_alias("y1", "y2") 1585 | w.add_alias("y2", "y3") 1586 | w.add_alias("y3", "z") 1587 | w.add_alias("ZZZ", "YYY") 1588 | 1589 | w.add_alias("l3", "l1") 1590 | w.add_alias("l1", "l2") 1591 | w.add_alias("l2", "l3") 1592 | 1593 | w.add_alias("a1", ("a", "a-frag1")) 1594 | w.add_alias("a2", "a1") 1595 | w.add_alias("a3", ("a2", "a-frag2")) 1596 | 1597 | w.add_alias("g1", "g") 1598 | w.add_alias("g2", ("g1", "g-frag1")) 1599 | 1600 | w.add_alias("n or p", "n") 1601 | w.add_alias("n or p", "p") 1602 | 1603 | self.assertEqual(too_many_redirects, ["l1", "l2", "l3"]) 1604 | self.assertEqual(target_not_found, ["l2", "l3", "l1", "YYY"]) 1605 | 1606 | with open(self.path) as r: 1607 | d = r.as_dict() 1608 | 1609 | def get(key): 1610 | return list(item.content.decode("ascii") for item in d[key]) 1611 | 1612 | self.assertEqual(get("w"), ["LATIN SMALL LETTER U"]) 1613 | self.assertEqual(get("small u"), ["LATIN SMALL LETTER U"]) 1614 | self.assertEqual(get("y1"), ["LATIN SMALL LETTER Z"]) 1615 | self.assertEqual(get("y2"), ["LATIN SMALL LETTER Z"]) 1616 | self.assertEqual(get("y3"), ["LATIN SMALL LETTER Z"]) 1617 | self.assertEqual(get("ZZZ"), []) 1618 | self.assertEqual(get("l1"), []) 1619 | self.assertEqual(get("l2"), []) 1620 | self.assertEqual(get("l3"), []) 1621 | 1622 | self.assertEqual(len(list(d["n or p"])), 2) 1623 | 1624 | item_a1 = next(d["a1"]) 1625 | self.assertEqual(item_a1.content, b"LATIN SMALL LETTER A") 1626 | self.assertEqual(item_a1.fragment, "a-frag1") 1627 | 1628 | item_a2 = next(d["a2"]) 1629 | self.assertEqual(item_a2.content, b"LATIN SMALL LETTER A") 1630 | self.assertEqual(item_a2.fragment, "a-frag1") 1631 | 1632 | item_a3 = next(d["a3"]) 1633 | self.assertEqual(item_a3.content, b"LATIN SMALL LETTER A") 1634 | self.assertEqual(item_a3.fragment, "a-frag1") 1635 | 1636 | item_g1 = next(d["g1"]) 1637 | self.assertEqual(item_g1.content, b"LATIN SMALL LETTER G") 1638 | self.assertEqual(item_g1.fragment, "") 1639 | 1640 | item_g2 = next(d["g2"]) 1641 | self.assertEqual(item_g2.content, b"LATIN SMALL LETTER G") 1642 | self.assertEqual(item_g2.fragment, "g-frag1") 1643 | 1644 | 1645 | class TestBlobId(unittest.TestCase): 1646 | 1647 | def test(self): 1648 | max_i = 2**32 - 1 1649 | max_j = 2**16 - 1 1650 | i_values = [0, max_i] + [random.randint(1, max_i - 1) for _ in range(100)] 1651 | j_values = [0, max_j] + [random.randint(1, max_j - 1) for _ in range(100)] 1652 | for i in i_values: 1653 | for j in j_values: 1654 | self.assertEqual(unmeld_ints(meld_ints(i, j)), (i, j)) 1655 | 1656 | 1657 | class TestMultiFileReader(unittest.TestCase): 1658 | 1659 | def setUp(self): 1660 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1661 | 1662 | def tearDown(self): 1663 | self.tmpdir.cleanup() 1664 | 1665 | def test_read_all(self): 1666 | fnames = [] 1667 | for name in "abcdef": 1668 | path = os.path.join(self.tmpdir.name, name) 1669 | fnames.append(path) 1670 | with fopen(path, "wb") as f: 1671 | f.write(name.encode(UTF8)) 1672 | with MultiFileReader(fnames) as m: 1673 | self.assertEqual(m.read().decode(UTF8), "abcdef") 1674 | 1675 | def test_seek_and_read(self): 1676 | 1677 | def mkfile(basename, content): 1678 | part = os.path.join(self.tmpdir.name, basename) 1679 | with fopen(part, "wb") as f: 1680 | f.write(content) 1681 | return part 1682 | 1683 | content = b"abc\nd\nefgh\nij" 1684 | part1 = mkfile("1", content[:4]) 1685 | part2 = mkfile("2", content[4:5]) 1686 | part3 = mkfile("3", content[5:]) 1687 | 1688 | with MultiFileReader(part1, part2, part3) as m: 1689 | self.assertEqual(m.size, len(content)) 1690 | m.seek(2) 1691 | self.assertEqual(m.read(2), content[2:4]) 1692 | m.seek(1) 1693 | self.assertEqual(m.read(len(content) - 2), content[1:-1]) 1694 | m.seek(-1, whence=io.SEEK_END) 1695 | self.assertEqual(m.read(10), content[-1:]) 1696 | 1697 | m.seek(4) 1698 | m.seek(-2, whence=io.SEEK_CUR) 1699 | self.assertEqual(m.read(3), content[2:5]) 1700 | 1701 | 1702 | class TestFormatErrors(unittest.TestCase): 1703 | 1704 | def setUp(self): 1705 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1706 | 1707 | def tearDown(self): 1708 | self.tmpdir.cleanup() 1709 | 1710 | def test_wrong_file_type(self): 1711 | name = os.path.join(self.tmpdir.name, "1") 1712 | with fopen(name, "wb") as f: 1713 | f.write(b"123") 1714 | self.assertRaises(UnknownFileFormat, open, name) 1715 | 1716 | def test_truncated_file(self): 1717 | name = os.path.join(self.tmpdir.name, "1") 1718 | 1719 | with create(name) as f: 1720 | f.add(b"123", "a") 1721 | f.add( 1722 | b"234", 1723 | "b", 1724 | ) 1725 | 1726 | with fopen(name, "rb") as f: 1727 | all_bytes = f.read() 1728 | 1729 | with fopen(name, "wb") as f: 1730 | f.write(all_bytes[:-1]) 1731 | 1732 | self.assertRaises(IncorrectFileSize, open, name) 1733 | 1734 | with fopen(name, "wb") as f: 1735 | f.write(all_bytes) 1736 | f.write(b"\n") 1737 | 1738 | self.assertRaises(IncorrectFileSize, open, name) 1739 | 1740 | 1741 | class TestFindParts(unittest.TestCase): 1742 | 1743 | def setUp(self): 1744 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1745 | 1746 | def tearDown(self): 1747 | self.tmpdir.cleanup() 1748 | 1749 | def test_find_parts(self): 1750 | names = [ 1751 | os.path.join(self.tmpdir.name, name) for name in ("abc-1", "abc-2", "abc-3") 1752 | ] 1753 | for name in names: 1754 | with fopen(name, "wb"): 1755 | pass 1756 | parts = find_parts(os.path.join(self.tmpdir.name, "abc")) 1757 | self.assertEqual(names, parts) 1758 | 1759 | 1760 | class TestTooLongText(unittest.TestCase): 1761 | 1762 | def setUp(self): 1763 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1764 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1765 | 1766 | def tearDown(self): 1767 | self.tmpdir.cleanup() 1768 | 1769 | def test_too_long(self): 1770 | rejected_keys = [] 1771 | rejected_aliases = [] 1772 | rejected_alias_targets = [] 1773 | rejected_tags = [] 1774 | rejected_content_types = [] 1775 | 1776 | def observer(event): 1777 | if event.name == "key_too_long": 1778 | rejected_keys.append(event.data) 1779 | elif event.name == "alias_too_long": 1780 | rejected_aliases.append(event.data) 1781 | elif event.name == "alias_target_too_long": 1782 | rejected_alias_targets.append(event.data) 1783 | elif event.name == "tag_name_too_long": 1784 | rejected_tags.append(event.data) 1785 | elif event.name == "content_type_too_long": 1786 | rejected_content_types.append(event.data) 1787 | 1788 | long_tag_name = "t" * (MAX_TINY_TEXT_LEN + 1) 1789 | long_tag_value = "v" * (MAX_TINY_TEXT_LEN + 1) 1790 | long_content_type = "T" * (MAX_TEXT_LEN + 1) 1791 | long_key = "c" * (MAX_TEXT_LEN + 1) 1792 | long_frag = "d" * (MAX_TINY_TEXT_LEN + 1) 1793 | key_with_long_frag = ("d", long_frag) 1794 | tag_with_long_name = (long_tag_name, "t3 value") 1795 | tag_with_long_value = ("t1", long_tag_value) 1796 | long_alias = "f" * (MAX_TEXT_LEN + 1) 1797 | alias_with_long_frag = ("i", long_frag) 1798 | long_alias_target = long_key 1799 | long_alias_target_frag = key_with_long_frag 1800 | 1801 | with create(self.path, observer=observer) as w: 1802 | 1803 | w.tag(*tag_with_long_value) 1804 | w.tag("t2", "t2 value") 1805 | w.tag(*tag_with_long_name) 1806 | 1807 | data = ["a", "b", long_key, key_with_long_frag] 1808 | 1809 | for k in data: 1810 | if isinstance(k, str): 1811 | v = k.encode("ascii") 1812 | else: 1813 | v = "#".join(k).encode("ascii") 1814 | w.add(v, k) 1815 | 1816 | w.add_alias("e", "a") 1817 | w.add_alias(long_alias, "a") 1818 | w.add_alias(alias_with_long_frag, "a") 1819 | w.add_alias("g", long_alias_target) 1820 | w.add_alias("h", long_alias_target_frag) 1821 | 1822 | w.add(b"Hello", "hello", content_type=long_content_type) 1823 | 1824 | self.assertEqual(rejected_keys, [long_key, key_with_long_frag]) 1825 | self.assertEqual(rejected_aliases, [long_alias, alias_with_long_frag]) 1826 | self.assertEqual( 1827 | rejected_alias_targets, [long_alias_target, long_alias_target_frag] 1828 | ) 1829 | self.assertEqual(rejected_tags, [tag_with_long_name]) 1830 | self.assertEqual(rejected_content_types, [long_content_type]) 1831 | 1832 | with open(self.path) as r: 1833 | self.assertEqual(r.tags["t2"], "t2 value") 1834 | self.assertFalse(tag_with_long_name[0] in r.tags) 1835 | self.assertTrue(tag_with_long_value[0] in r.tags) 1836 | self.assertEqual(r.tags[tag_with_long_value[0]], "") 1837 | d = r.as_dict() 1838 | self.assertTrue("a" in d) 1839 | self.assertTrue("b" in d) 1840 | self.assertFalse(long_key in d) 1841 | self.assertFalse(key_with_long_frag[0] in d) 1842 | self.assertTrue("e" in d) 1843 | self.assertFalse(long_alias in d) 1844 | self.assertFalse("g" in d) 1845 | 1846 | self.assertRaises(ValueError, set_tag_value, self.path, "t1", "ы" * 128) 1847 | 1848 | 1849 | class TestEditTag(unittest.TestCase): 1850 | 1851 | def setUp(self): 1852 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1853 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1854 | with create(self.path) as w: 1855 | w.tag("a", "123456") 1856 | w.tag("b", "654321") 1857 | 1858 | def tearDown(self): 1859 | self.tmpdir.cleanup() 1860 | 1861 | def test_edit_existing_tag(self): 1862 | with open(self.path) as f: 1863 | self.assertEqual(f.tags["a"], "123456") 1864 | self.assertEqual(f.tags["b"], "654321") 1865 | set_tag_value(self.path, "b", "efg") 1866 | set_tag_value(self.path, "a", "xyz") 1867 | with open(self.path) as f: 1868 | self.assertEqual(f.tags["a"], "xyz") 1869 | self.assertEqual(f.tags["b"], "efg") 1870 | 1871 | def test_edit_nonexisting_tag(self): 1872 | self.assertRaises(TagNotFound, set_tag_value, self.path, "z", "abc") 1873 | 1874 | 1875 | class TestBinItemNumberLimit(unittest.TestCase): 1876 | 1877 | def setUp(self): 1878 | self.tmpdir = tempfile.TemporaryDirectory(prefix="test") 1879 | self.path = os.path.join(self.tmpdir.name, "test.slob") 1880 | 1881 | def tearDown(self): 1882 | self.tmpdir.cleanup() 1883 | 1884 | def test_writing_more_then_max_number_of_bin_items(self): 1885 | with create(self.path) as w: 1886 | for _ in range(MAX_BIN_ITEM_COUNT + 2): 1887 | w.add(b"a", "a") 1888 | self.assertEqual(w.bin_count, 2) 1889 | 1890 | 1891 | def _cli_info(args): 1892 | from collections import OrderedDict 1893 | 1894 | with open(args.path) as s: 1895 | h = s._header 1896 | print("\n") 1897 | info = OrderedDict( 1898 | ( 1899 | ("id", s.id), 1900 | ("encoding", h.encoding), 1901 | ("compression", h.compression), 1902 | ("blob count", s.blob_count), 1903 | ("ref count", len(s)), 1904 | ) 1905 | ) 1906 | _print_title(args.path) 1907 | _print_dict(info) 1908 | print("\n") 1909 | _print_title("CONTENT TYPES") 1910 | for ct in h.content_types: 1911 | print(ct) 1912 | print("\n") 1913 | _print_title("TAGS") 1914 | _print_tags(s) 1915 | print("\n") 1916 | 1917 | 1918 | def _print_title(title): 1919 | print(title) 1920 | print("-" * len(title)) 1921 | 1922 | 1923 | def _cli_tag(args): 1924 | tag_name = args.name 1925 | if args.value: 1926 | try: 1927 | set_tag_value(args.filename, tag_name, args.value) 1928 | except TagNotFound: 1929 | print("No such tag") 1930 | else: 1931 | with open(args.filename) as s: 1932 | _print_tags(s, tag_name) 1933 | 1934 | 1935 | def _print_tags(s, tag_name=None): 1936 | if tag_name: 1937 | try: 1938 | value = s.tags[tag_name] 1939 | except KeyError: 1940 | print("No such tag") 1941 | else: 1942 | print(value) 1943 | else: 1944 | _print_dict(s.tags) 1945 | 1946 | 1947 | def _print_dict(d): 1948 | max_key_len = 0 1949 | for k, v in d.items(): 1950 | key_len = len(k) 1951 | if key_len > max_key_len: 1952 | max_key_len = key_len 1953 | fmt_template = "{:>%s}: {}" 1954 | fmt = fmt_template % max_key_len 1955 | for k, v in d.items(): 1956 | print(fmt.format(k, v)) 1957 | 1958 | 1959 | def _cli_find(args): 1960 | with open(args.path) as s: 1961 | match_prefix = not args.whole 1962 | for i, item in enumerate(find(args.query, s, match_prefix=match_prefix)): 1963 | _, blob = item 1964 | print(blob.id, blob.content_type, blob.key) 1965 | if i == args.limit: 1966 | break 1967 | 1968 | 1969 | def _cli_aliases(args): 1970 | word = args.query 1971 | with open(args.path) as s: 1972 | d = s.as_dict(strength=QUATERNARY) 1973 | length = len(s) 1974 | aliases = [] 1975 | 1976 | def print_item(i, item): 1977 | print(("\t {:>%s} {}" % len(str(length))).format(i, item.key)) 1978 | 1979 | for blob in d[word]: 1980 | print(blob.id, blob.content_type, blob.key) 1981 | progress = "" 1982 | for i, item in enumerate(s): 1983 | if i % 10000 == 0: 1984 | new_progress = "{:>4.1f}% ...".format(100 * i / length) 1985 | if new_progress != progress: 1986 | print(new_progress) 1987 | progress = new_progress 1988 | 1989 | if item.id == blob.id: 1990 | aliases.append((i, item)) 1991 | print_item(i, item) 1992 | print("100%") 1993 | header = "{} {}".format(blob.id, blob.content_type) 1994 | print("=" * len(header), "\n") 1995 | print(header) 1996 | print("-" * len(header)) 1997 | for i, item in aliases: 1998 | print_item(i, item) 1999 | 2000 | 2001 | def _cli_get(args): 2002 | with open(args.path) as s: 2003 | _content_type, content = s.get(args.blob_id) 2004 | sys.stdout.buffer.write(content) 2005 | 2006 | 2007 | def _p(i, *args, step=100, steps_per_line=50, fmt="{}"): 2008 | line = steps_per_line * step 2009 | if i % step == 0 and i != 0: 2010 | sys.stdout.write(".") 2011 | if i and i % line == 0: 2012 | sys.stdout.write(fmt.format(*args)) 2013 | sys.stdout.flush() 2014 | 2015 | 2016 | def _cli_convert(args): 2017 | import sys 2018 | import time 2019 | 2020 | t0 = time.time() 2021 | 2022 | output_name = args.output 2023 | 2024 | output_dir = os.path.dirname(output_name) 2025 | output_base = os.path.basename(output_name) 2026 | output_base_noext, output_base_ext = os.path.splitext(output_base) 2027 | 2028 | DOT_SLOB = ".slob" 2029 | 2030 | if output_base_ext != DOT_SLOB: 2031 | output_base = output_base + DOT_SLOB 2032 | output_base_noext, output_base_ext = os.path.splitext(output_base) 2033 | 2034 | if args.split: 2035 | output_dir = os.path.join(output_dir, output_base) 2036 | if os.path.exists(output_dir): 2037 | raise SystemExit("%r already exists" % output_dir) 2038 | os.mkdir(output_dir) 2039 | else: 2040 | output_name = os.path.join(output_dir, output_base) 2041 | if os.path.exists(output_name): 2042 | raise SystemExit("%r already exists" % output_name) 2043 | 2044 | with open(args.path) as s: 2045 | workdir = args.workdir 2046 | encoding = args.encoding or s._header.encoding 2047 | compression = ( 2048 | s._header.compression if args.compression is None else args.compression 2049 | ) 2050 | min_bin_size = 1024 * args.min_bin_size 2051 | split = 1024 * 1024 * args.split 2052 | 2053 | print("Mapping blobs to keys...") 2054 | blob_to_refs = [ 2055 | collections.defaultdict(lambda: array.array("L")) 2056 | for i in range(len(s._store)) 2057 | ] 2058 | key_count = 0 2059 | pp = functools.partial(_p, step=10000, fmt=" {:.2f}%/{:.2f}s\n") 2060 | total_keys = len(s) 2061 | blob_count = s.blob_count 2062 | for i, ref in enumerate(s._refs): 2063 | blob_to_refs[ref.bin_index][ref.item_index].append(i) 2064 | key_count += 1 2065 | pp(i, 100 * key_count / total_keys, time.time() - t0) 2066 | 2067 | print( 2068 | "\nFound {} keys for {} blobs in {:.2f}s".format( 2069 | key_count, blob_count, time.time() - t0 2070 | ) 2071 | ) 2072 | 2073 | print("Converting...") 2074 | Mb = 1024 * 1024 2075 | pp = functools.partial(_p, step=100, fmt=" {:.2f}%/{:.2f}Mb/{:.2f}s\n") 2076 | 2077 | def mkout(output): 2078 | w = create( 2079 | output, 2080 | workdir=workdir, 2081 | encoding=encoding, 2082 | compression=compression, 2083 | min_bin_size=min_bin_size, 2084 | ) 2085 | for name, value in s.tags.items(): 2086 | if not name in w.tags: 2087 | w.tag(name, value) 2088 | size_header_and_tags = w.size_header() + w.size_tags() 2089 | return w, size_header_and_tags 2090 | 2091 | def fin(t1, w, current_output): 2092 | print( 2093 | "\nDone adding content to {1} in {0:.2f}s".format( 2094 | time.time() - t1, current_output 2095 | ) 2096 | ) 2097 | print("Finalizing...") 2098 | w.finalize() 2099 | return time.time(), None, 0 2100 | 2101 | t1, w, size_header_and_tags = time.time(), None, 0 2102 | current_size = 0 2103 | volume_count = 0 2104 | current_output = output_name 2105 | current = 0 2106 | for j, store_item in enumerate(s._store): 2107 | for i in range(len(store_item.content_type_ids)): 2108 | bin_index = j 2109 | item_index = i 2110 | current += 1 2111 | content_type, content = s._store.get(bin_index, item_index) 2112 | pp( 2113 | current, 2114 | 100 * (current / blob_count), 2115 | current_size / Mb, 2116 | time.time() - t1, 2117 | ) 2118 | keys = [] 2119 | for k in blob_to_refs[bin_index][item_index]: 2120 | ref = s._refs[k] 2121 | keys.append((ref.key, ref.fragment)) 2122 | if w is None: 2123 | volume_count += 1 2124 | if split: 2125 | current_output = os.path.join( 2126 | output_dir, 2127 | "".join( 2128 | ( 2129 | output_base_noext, 2130 | "-{:02d}".format(volume_count), 2131 | output_base_ext, 2132 | ) 2133 | ), 2134 | ) 2135 | w, size_header_and_tags = mkout(current_output) 2136 | w.add(content, *keys, content_type=content_type) 2137 | current_size = ( 2138 | size_header_and_tags + w.size_content_types() + w.size_data() 2139 | ) 2140 | if split and current_size + min_bin_size >= split: 2141 | t1, w, size_header_and_tags = fin(t1, w, current_output) 2142 | if w: 2143 | fin(t1, w, current_output) 2144 | 2145 | print("\nDone in {0:.2f}s".format(time.time() - t0)) 2146 | 2147 | 2148 | def _arg_parser(): 2149 | parser = argparse.ArgumentParser() 2150 | parent = argparse.ArgumentParser(add_help=False) 2151 | parent.add_argument("path", help="Slob path") 2152 | 2153 | parents = [parent] 2154 | 2155 | subparsers = parser.add_subparsers(help="sub-command") 2156 | 2157 | parser_find = subparsers.add_parser( 2158 | "find", 2159 | parents=parents, 2160 | help="Find keys", 2161 | ) 2162 | parser_find.add_argument("query", help="Word to look up") 2163 | parser_find.add_argument( 2164 | "--whole", action="store_true", help="Match whole words only" 2165 | ) 2166 | parser_find.add_argument( 2167 | "-l", 2168 | "--limit", 2169 | type=int, 2170 | default=10, 2171 | help="Maximum number of results to display", 2172 | ) 2173 | parser_find.set_defaults(func=_cli_find) 2174 | 2175 | parser_aliases = subparsers.add_parser( 2176 | "aliases", 2177 | parents=parents, 2178 | help="Find blob aliases", 2179 | ) 2180 | parser_aliases.add_argument("query", help="Word to look up") 2181 | parser_aliases.set_defaults(func=_cli_aliases) 2182 | 2183 | parser_get = subparsers.add_parser( 2184 | "get", 2185 | parents=parents, 2186 | help="Retrieve blob content", 2187 | ) 2188 | parser_get.add_argument( 2189 | "blob_id", type=int, help='Id of blob to retrive (from output of "find")' 2190 | ) 2191 | 2192 | parser_get.set_defaults(func=_cli_get) 2193 | 2194 | parser_info = subparsers.add_parser( 2195 | "info", 2196 | parents=parents, 2197 | help="Inspect slob and print basic information about it", 2198 | ) 2199 | parser_info.set_defaults(func=_cli_info) 2200 | 2201 | parser_tag = subparsers.add_parser("tag", help="List tags, view or edit tag value") 2202 | parser_tag.add_argument( 2203 | "-n", "--name", default="", help="Name of tag to view or edit" 2204 | ) 2205 | parser_tag.add_argument("-v", "--value", help="Tag value to set") 2206 | parser_tag.add_argument( 2207 | "filename", help="Slob file name (split files are not supported)" 2208 | ) 2209 | parser_tag.set_defaults(func=_cli_tag) 2210 | 2211 | parser_convert = subparsers.add_parser( 2212 | "convert", 2213 | parents=parents, 2214 | help=( 2215 | "Create new slob with the same content " 2216 | "but different encoding and compression parameters " 2217 | "or split into multiple slobs" 2218 | ), 2219 | ) 2220 | parser_convert.add_argument("output", help="Name of slob file to create") 2221 | 2222 | parser_convert.add_argument( 2223 | "--workdir", 2224 | help="Directory where temporary files for " "conversion should be created", 2225 | ) 2226 | 2227 | parser_convert.add_argument( 2228 | "-e", 2229 | "--encoding", 2230 | help=("Text encoding for the output slob. " "Default: same as source"), 2231 | ) 2232 | 2233 | parser_convert.add_argument( 2234 | "-c", 2235 | "--compression", 2236 | help=( 2237 | "Compression algorithm to use fot the output slob. " 2238 | "Default: same as source" 2239 | ), 2240 | ) 2241 | 2242 | parser_convert.add_argument( 2243 | "-b", 2244 | "--min_bin_size", 2245 | type=int, 2246 | help=( 2247 | "Minimum size of storage bin to compress in kilobytes. " 2248 | "Default: %(default)s" 2249 | ), 2250 | default=384, 2251 | ) 2252 | 2253 | parser_convert.add_argument( 2254 | "-s", 2255 | "--split", 2256 | type=int, 2257 | help=( 2258 | "Split output into multiple slob files no larger " 2259 | "than specified number of megabytes. " 2260 | "Default: %(default)s (do not split)" 2261 | ), 2262 | default=0, 2263 | ) 2264 | 2265 | parser_convert.set_defaults(func=_cli_convert) 2266 | 2267 | return parser 2268 | 2269 | 2270 | def add_dir( 2271 | slb, topdir, prefix="", include_only=None, mime_types=MIME_TYPES, print=print 2272 | ): 2273 | print("Adding", topdir) 2274 | for item in os.walk(topdir): 2275 | dirpath, _dirnames, filenames = item 2276 | for filename in filenames: 2277 | full_path = os.path.join(dirpath, filename) 2278 | rel_path = os.path.relpath(full_path, topdir) 2279 | if include_only and not any(rel_path.startswith(x) for x in include_only): 2280 | print("SKIPPING (not included): {!a}".format(rel_path)) 2281 | continue 2282 | _, ext = os.path.splitext(filename) 2283 | ext = ext.lstrip(os.path.extsep) 2284 | content_type = mime_types.get(ext.lower()) 2285 | if not content_type: 2286 | print("SKIPPING (unknown content type): {!a}".format(rel_path)) 2287 | else: 2288 | with fopen(full_path, "rb") as f: 2289 | content = f.read() 2290 | key = prefix + rel_path 2291 | print("ADDING: {!a}".format(key)) 2292 | try: 2293 | key.encode(slb.encoding) 2294 | except UnicodeEncodeError: 2295 | print("Failed to add, broken unicode in key: {!a}".format(key)) 2296 | else: 2297 | slb.add(content, key, content_type=content_type) 2298 | 2299 | 2300 | import time 2301 | from datetime import timedelta 2302 | 2303 | 2304 | class SimpleTimingObserver(object): 2305 | 2306 | def __init__(self, p=None): 2307 | 2308 | if not p is None: 2309 | self.p = p 2310 | 2311 | self.times = {} 2312 | 2313 | def p(self, text): 2314 | sys.stdout.write(text) 2315 | sys.stdout.flush() 2316 | 2317 | def begin(self, name): 2318 | self.times[name] = time.time() 2319 | 2320 | def end(self, name): 2321 | t0 = self.times.pop(name) 2322 | dt = timedelta(seconds=int(time.time() - t0)) 2323 | return dt 2324 | 2325 | def __call__(self, e): 2326 | p = self.p 2327 | begin = self.begin 2328 | end = self.end 2329 | if e.name == "begin_finalize": 2330 | p("\nFinished adding content in %s" % end("content")) 2331 | p("\nFinalizing...") 2332 | begin("finalize") 2333 | if e.name == "end_finalize": 2334 | p("\nFinalized in %s" % end("finalize")) 2335 | elif e.name == "begin_resolve_aliases": 2336 | p("\nResolving aliases...") 2337 | begin("aliases") 2338 | elif e.name == "end_resolve_aliases": 2339 | p("\nResolved aliases in %s" % end("aliases")) 2340 | elif e.name == "begin_sort": 2341 | p("\nSorting...") 2342 | begin("sort") 2343 | elif e.name == "end_sort": 2344 | p(" sorted in %s" % end("sort")) 2345 | 2346 | 2347 | def main(): 2348 | parser = _arg_parser() 2349 | args = parser.parse_args() 2350 | if hasattr(args, "func"): 2351 | args.func(args) 2352 | else: 2353 | parser.print_help() 2354 | 2355 | 2356 | if __name__ == "__main__": 2357 | main() 2358 | --------------------------------------------------------------------------------