├── .gitignore ├── LICENSE.txt ├── MANIFEST.in ├── README.rst ├── docs ├── similarities │ └── simserver.rst └── simserver.rst ├── ez_setup.py ├── setup.py └── simserver ├── __init__.py ├── run_simserver.py ├── simserver.py └── test ├── __init__.py └── test_simserver.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.egg 3 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | GNU AFFERO GENERAL PUBLIC LICENSE 2 | Version 3, 19 November 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU Affero General Public License is a free, copyleft license for 11 | software and other kinds of works, specifically designed to ensure 12 | cooperation with the community in the case of network server software. 13 | 14 | The licenses for most software and other practical works are designed 15 | to take away your freedom to share and change the works. By contrast, 16 | our General Public Licenses are intended to guarantee your freedom to 17 | share and change all versions of a program--to make sure it remains free 18 | software for all its users. 19 | 20 | When we speak of free software, we are referring to freedom, not 21 | price. Our General Public Licenses are designed to make sure that you 22 | have the freedom to distribute copies of free software (and charge for 23 | them if you wish), that you receive source code or can get it if you 24 | want it, that you can change the software or use pieces of it in new 25 | free programs, and that you know you can do these things. 26 | 27 | Developers that use our General Public Licenses protect your rights 28 | with two steps: (1) assert copyright on the software, and (2) offer 29 | you this License which gives you legal permission to copy, distribute 30 | and/or modify the software. 31 | 32 | A secondary benefit of defending all users' freedom is that 33 | improvements made in alternate versions of the program, if they 34 | receive widespread use, become available for other developers to 35 | incorporate. Many developers of free software are heartened and 36 | encouraged by the resulting cooperation. However, in the case of 37 | software used on network servers, this result may fail to come about. 38 | The GNU General Public License permits making a modified version and 39 | letting the public access it on a server without ever releasing its 40 | source code to the public. 41 | 42 | The GNU Affero General Public License is designed specifically to 43 | ensure that, in such cases, the modified source code becomes available 44 | to the community. It requires the operator of a network server to 45 | provide the source code of the modified version running there to the 46 | users of that server. Therefore, public use of a modified version, on 47 | a publicly accessible server, gives the public access to the source 48 | code of the modified version. 49 | 50 | An older license, called the Affero General Public License and 51 | published by Affero, was designed to accomplish similar goals. This is 52 | a different license, not a version of the Affero GPL, but Affero has 53 | released a new version of the Affero GPL which permits relicensing under 54 | this license. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | TERMS AND CONDITIONS 60 | 61 | 0. Definitions. 62 | 63 | "This License" refers to version 3 of the GNU Affero General Public License. 64 | 65 | "Copyright" also means copyright-like laws that apply to other kinds of 66 | works, such as semiconductor masks. 67 | 68 | "The Program" refers to any copyrightable work licensed under this 69 | License. Each licensee is addressed as "you". "Licensees" and 70 | "recipients" may be individuals or organizations. 71 | 72 | To "modify" a work means to copy from or adapt all or part of the work 73 | in a fashion requiring copyright permission, other than the making of an 74 | exact copy. The resulting work is called a "modified version" of the 75 | earlier work or a work "based on" the earlier work. 76 | 77 | A "covered work" means either the unmodified Program or a work based 78 | on the Program. 79 | 80 | To "propagate" a work means to do anything with it that, without 81 | permission, would make you directly or secondarily liable for 82 | infringement under applicable copyright law, except executing it on a 83 | computer or modifying a private copy. Propagation includes copying, 84 | distribution (with or without modification), making available to the 85 | public, and in some countries other activities as well. 86 | 87 | To "convey" a work means any kind of propagation that enables other 88 | parties to make or receive copies. Mere interaction with a user through 89 | a computer network, with no transfer of a copy, is not conveying. 90 | 91 | An interactive user interface displays "Appropriate Legal Notices" 92 | to the extent that it includes a convenient and prominently visible 93 | feature that (1) displays an appropriate copyright notice, and (2) 94 | tells the user that there is no warranty for the work (except to the 95 | extent that warranties are provided), that licensees may convey the 96 | work under this License, and how to view a copy of this License. If 97 | the interface presents a list of user commands or options, such as a 98 | menu, a prominent item in the list meets this criterion. 99 | 100 | 1. Source Code. 101 | 102 | The "source code" for a work means the preferred form of the work 103 | for making modifications to it. "Object code" means any non-source 104 | form of a work. 105 | 106 | A "Standard Interface" means an interface that either is an official 107 | standard defined by a recognized standards body, or, in the case of 108 | interfaces specified for a particular programming language, one that 109 | is widely used among developers working in that language. 110 | 111 | The "System Libraries" of an executable work include anything, other 112 | than the work as a whole, that (a) is included in the normal form of 113 | packaging a Major Component, but which is not part of that Major 114 | Component, and (b) serves only to enable use of the work with that 115 | Major Component, or to implement a Standard Interface for which an 116 | implementation is available to the public in source code form. A 117 | "Major Component", in this context, means a major essential component 118 | (kernel, window system, and so on) of the specific operating system 119 | (if any) on which the executable work runs, or a compiler used to 120 | produce the work, or an object code interpreter used to run it. 121 | 122 | The "Corresponding Source" for a work in object code form means all 123 | the source code needed to generate, install, and (for an executable 124 | work) run the object code and to modify the work, including scripts to 125 | control those activities. However, it does not include the work's 126 | System Libraries, or general-purpose tools or generally available free 127 | programs which are used unmodified in performing those activities but 128 | which are not part of the work. For example, Corresponding Source 129 | includes interface definition files associated with source files for 130 | the work, and the source code for shared libraries and dynamically 131 | linked subprograms that the work is specifically designed to require, 132 | such as by intimate data communication or control flow between those 133 | subprograms and other parts of the work. 134 | 135 | The Corresponding Source need not include anything that users 136 | can regenerate automatically from other parts of the Corresponding 137 | Source. 138 | 139 | The Corresponding Source for a work in source code form is that 140 | same work. 141 | 142 | 2. Basic Permissions. 143 | 144 | All rights granted under this License are granted for the term of 145 | copyright on the Program, and are irrevocable provided the stated 146 | conditions are met. This License explicitly affirms your unlimited 147 | permission to run the unmodified Program. The output from running a 148 | covered work is covered by this License only if the output, given its 149 | content, constitutes a covered work. This License acknowledges your 150 | rights of fair use or other equivalent, as provided by copyright law. 151 | 152 | You may make, run and propagate covered works that you do not 153 | convey, without conditions so long as your license otherwise remains 154 | in force. You may convey covered works to others for the sole purpose 155 | of having them make modifications exclusively for you, or provide you 156 | with facilities for running those works, provided that you comply with 157 | the terms of this License in conveying all material for which you do 158 | not control copyright. Those thus making or running the covered works 159 | for you must do so exclusively on your behalf, under your direction 160 | and control, on terms that prohibit them from making any copies of 161 | your copyrighted material outside their relationship with you. 162 | 163 | Conveying under any other circumstances is permitted solely under 164 | the conditions stated below. Sublicensing is not allowed; section 10 165 | makes it unnecessary. 166 | 167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 168 | 169 | No covered work shall be deemed part of an effective technological 170 | measure under any applicable law fulfilling obligations under article 171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 172 | similar laws prohibiting or restricting circumvention of such 173 | measures. 174 | 175 | When you convey a covered work, you waive any legal power to forbid 176 | circumvention of technological measures to the extent such circumvention 177 | is effected by exercising rights under this License with respect to 178 | the covered work, and you disclaim any intention to limit operation or 179 | modification of the work as a means of enforcing, against the work's 180 | users, your or third parties' legal rights to forbid circumvention of 181 | technological measures. 182 | 183 | 4. Conveying Verbatim Copies. 184 | 185 | You may convey verbatim copies of the Program's source code as you 186 | receive it, in any medium, provided that you conspicuously and 187 | appropriately publish on each copy an appropriate copyright notice; 188 | keep intact all notices stating that this License and any 189 | non-permissive terms added in accord with section 7 apply to the code; 190 | keep intact all notices of the absence of any warranty; and give all 191 | recipients a copy of this License along with the Program. 192 | 193 | You may charge any price or no price for each copy that you convey, 194 | and you may offer support or warranty protection for a fee. 195 | 196 | 5. Conveying Modified Source Versions. 197 | 198 | You may convey a work based on the Program, or the modifications to 199 | produce it from the Program, in the form of source code under the 200 | terms of section 4, provided that you also meet all of these conditions: 201 | 202 | a) The work must carry prominent notices stating that you modified 203 | it, and giving a relevant date. 204 | 205 | b) The work must carry prominent notices stating that it is 206 | released under this License and any conditions added under section 207 | 7. This requirement modifies the requirement in section 4 to 208 | "keep intact all notices". 209 | 210 | c) You must license the entire work, as a whole, under this 211 | License to anyone who comes into possession of a copy. This 212 | License will therefore apply, along with any applicable section 7 213 | additional terms, to the whole of the work, and all its parts, 214 | regardless of how they are packaged. This License gives no 215 | permission to license the work in any other way, but it does not 216 | invalidate such permission if you have separately received it. 217 | 218 | d) If the work has interactive user interfaces, each must display 219 | Appropriate Legal Notices; however, if the Program has interactive 220 | interfaces that do not display Appropriate Legal Notices, your 221 | work need not make them do so. 222 | 223 | A compilation of a covered work with other separate and independent 224 | works, which are not by their nature extensions of the covered work, 225 | and which are not combined with it such as to form a larger program, 226 | in or on a volume of a storage or distribution medium, is called an 227 | "aggregate" if the compilation and its resulting copyright are not 228 | used to limit the access or legal rights of the compilation's users 229 | beyond what the individual works permit. Inclusion of a covered work 230 | in an aggregate does not cause this License to apply to the other 231 | parts of the aggregate. 232 | 233 | 6. Conveying Non-Source Forms. 234 | 235 | You may convey a covered work in object code form under the terms 236 | of sections 4 and 5, provided that you also convey the 237 | machine-readable Corresponding Source under the terms of this License, 238 | in one of these ways: 239 | 240 | a) Convey the object code in, or embodied in, a physical product 241 | (including a physical distribution medium), accompanied by the 242 | Corresponding Source fixed on a durable physical medium 243 | customarily used for software interchange. 244 | 245 | b) Convey the object code in, or embodied in, a physical product 246 | (including a physical distribution medium), accompanied by a 247 | written offer, valid for at least three years and valid for as 248 | long as you offer spare parts or customer support for that product 249 | model, to give anyone who possesses the object code either (1) a 250 | copy of the Corresponding Source for all the software in the 251 | product that is covered by this License, on a durable physical 252 | medium customarily used for software interchange, for a price no 253 | more than your reasonable cost of physically performing this 254 | conveying of source, or (2) access to copy the 255 | Corresponding Source from a network server at no charge. 256 | 257 | c) Convey individual copies of the object code with a copy of the 258 | written offer to provide the Corresponding Source. This 259 | alternative is allowed only occasionally and noncommercially, and 260 | only if you received the object code with such an offer, in accord 261 | with subsection 6b. 262 | 263 | d) Convey the object code by offering access from a designated 264 | place (gratis or for a charge), and offer equivalent access to the 265 | Corresponding Source in the same way through the same place at no 266 | further charge. You need not require recipients to copy the 267 | Corresponding Source along with the object code. If the place to 268 | copy the object code is a network server, the Corresponding Source 269 | may be on a different server (operated by you or a third party) 270 | that supports equivalent copying facilities, provided you maintain 271 | clear directions next to the object code saying where to find the 272 | Corresponding Source. Regardless of what server hosts the 273 | Corresponding Source, you remain obligated to ensure that it is 274 | available for as long as needed to satisfy these requirements. 275 | 276 | e) Convey the object code using peer-to-peer transmission, provided 277 | you inform other peers where the object code and Corresponding 278 | Source of the work are being offered to the general public at no 279 | charge under subsection 6d. 280 | 281 | A separable portion of the object code, whose source code is excluded 282 | from the Corresponding Source as a System Library, need not be 283 | included in conveying the object code work. 284 | 285 | A "User Product" is either (1) a "consumer product", which means any 286 | tangible personal property which is normally used for personal, family, 287 | or household purposes, or (2) anything designed or sold for incorporation 288 | into a dwelling. In determining whether a product is a consumer product, 289 | doubtful cases shall be resolved in favor of coverage. For a particular 290 | product received by a particular user, "normally used" refers to a 291 | typical or common use of that class of product, regardless of the status 292 | of the particular user or of the way in which the particular user 293 | actually uses, or expects or is expected to use, the product. A product 294 | is a consumer product regardless of whether the product has substantial 295 | commercial, industrial or non-consumer uses, unless such uses represent 296 | the only significant mode of use of the product. 297 | 298 | "Installation Information" for a User Product means any methods, 299 | procedures, authorization keys, or other information required to install 300 | and execute modified versions of a covered work in that User Product from 301 | a modified version of its Corresponding Source. The information must 302 | suffice to ensure that the continued functioning of the modified object 303 | code is in no case prevented or interfered with solely because 304 | modification has been made. 305 | 306 | If you convey an object code work under this section in, or with, or 307 | specifically for use in, a User Product, and the conveying occurs as 308 | part of a transaction in which the right of possession and use of the 309 | User Product is transferred to the recipient in perpetuity or for a 310 | fixed term (regardless of how the transaction is characterized), the 311 | Corresponding Source conveyed under this section must be accompanied 312 | by the Installation Information. But this requirement does not apply 313 | if neither you nor any third party retains the ability to install 314 | modified object code on the User Product (for example, the work has 315 | been installed in ROM). 316 | 317 | The requirement to provide Installation Information does not include a 318 | requirement to continue to provide support service, warranty, or updates 319 | for a work that has been modified or installed by the recipient, or for 320 | the User Product in which it has been modified or installed. Access to a 321 | network may be denied when the modification itself materially and 322 | adversely affects the operation of the network or violates the rules and 323 | protocols for communication across the network. 324 | 325 | Corresponding Source conveyed, and Installation Information provided, 326 | in accord with this section must be in a format that is publicly 327 | documented (and with an implementation available to the public in 328 | source code form), and must require no special password or key for 329 | unpacking, reading or copying. 330 | 331 | 7. Additional Terms. 332 | 333 | "Additional permissions" are terms that supplement the terms of this 334 | License by making exceptions from one or more of its conditions. 335 | Additional permissions that are applicable to the entire Program shall 336 | be treated as though they were included in this License, to the extent 337 | that they are valid under applicable law. If additional permissions 338 | apply only to part of the Program, that part may be used separately 339 | under those permissions, but the entire Program remains governed by 340 | this License without regard to the additional permissions. 341 | 342 | When you convey a copy of a covered work, you may at your option 343 | remove any additional permissions from that copy, or from any part of 344 | it. (Additional permissions may be written to require their own 345 | removal in certain cases when you modify the work.) You may place 346 | additional permissions on material, added by you to a covered work, 347 | for which you have or can give appropriate copyright permission. 348 | 349 | Notwithstanding any other provision of this License, for material you 350 | add to a covered work, you may (if authorized by the copyright holders of 351 | that material) supplement the terms of this License with terms: 352 | 353 | a) Disclaiming warranty or limiting liability differently from the 354 | terms of sections 15 and 16 of this License; or 355 | 356 | b) Requiring preservation of specified reasonable legal notices or 357 | author attributions in that material or in the Appropriate Legal 358 | Notices displayed by works containing it; or 359 | 360 | c) Prohibiting misrepresentation of the origin of that material, or 361 | requiring that modified versions of such material be marked in 362 | reasonable ways as different from the original version; or 363 | 364 | d) Limiting the use for publicity purposes of names of licensors or 365 | authors of the material; or 366 | 367 | e) Declining to grant rights under trademark law for use of some 368 | trade names, trademarks, or service marks; or 369 | 370 | f) Requiring indemnification of licensors and authors of that 371 | material by anyone who conveys the material (or modified versions of 372 | it) with contractual assumptions of liability to the recipient, for 373 | any liability that these contractual assumptions directly impose on 374 | those licensors and authors. 375 | 376 | All other non-permissive additional terms are considered "further 377 | restrictions" within the meaning of section 10. If the Program as you 378 | received it, or any part of it, contains a notice stating that it is 379 | governed by this License along with a term that is a further 380 | restriction, you may remove that term. If a license document contains 381 | a further restriction but permits relicensing or conveying under this 382 | License, you may add to a covered work material governed by the terms 383 | of that license document, provided that the further restriction does 384 | not survive such relicensing or conveying. 385 | 386 | If you add terms to a covered work in accord with this section, you 387 | must place, in the relevant source files, a statement of the 388 | additional terms that apply to those files, or a notice indicating 389 | where to find the applicable terms. 390 | 391 | Additional terms, permissive or non-permissive, may be stated in the 392 | form of a separately written license, or stated as exceptions; 393 | the above requirements apply either way. 394 | 395 | 8. Termination. 396 | 397 | You may not propagate or modify a covered work except as expressly 398 | provided under this License. Any attempt otherwise to propagate or 399 | modify it is void, and will automatically terminate your rights under 400 | this License (including any patent licenses granted under the third 401 | paragraph of section 11). 402 | 403 | However, if you cease all violation of this License, then your 404 | license from a particular copyright holder is reinstated (a) 405 | provisionally, unless and until the copyright holder explicitly and 406 | finally terminates your license, and (b) permanently, if the copyright 407 | holder fails to notify you of the violation by some reasonable means 408 | prior to 60 days after the cessation. 409 | 410 | Moreover, your license from a particular copyright holder is 411 | reinstated permanently if the copyright holder notifies you of the 412 | violation by some reasonable means, this is the first time you have 413 | received notice of violation of this License (for any work) from that 414 | copyright holder, and you cure the violation prior to 30 days after 415 | your receipt of the notice. 416 | 417 | Termination of your rights under this section does not terminate the 418 | licenses of parties who have received copies or rights from you under 419 | this License. If your rights have been terminated and not permanently 420 | reinstated, you do not qualify to receive new licenses for the same 421 | material under section 10. 422 | 423 | 9. Acceptance Not Required for Having Copies. 424 | 425 | You are not required to accept this License in order to receive or 426 | run a copy of the Program. Ancillary propagation of a covered work 427 | occurring solely as a consequence of using peer-to-peer transmission 428 | to receive a copy likewise does not require acceptance. However, 429 | nothing other than this License grants you permission to propagate or 430 | modify any covered work. These actions infringe copyright if you do 431 | not accept this License. Therefore, by modifying or propagating a 432 | covered work, you indicate your acceptance of this License to do so. 433 | 434 | 10. Automatic Licensing of Downstream Recipients. 435 | 436 | Each time you convey a covered work, the recipient automatically 437 | receives a license from the original licensors, to run, modify and 438 | propagate that work, subject to this License. You are not responsible 439 | for enforcing compliance by third parties with this License. 440 | 441 | An "entity transaction" is a transaction transferring control of an 442 | organization, or substantially all assets of one, or subdividing an 443 | organization, or merging organizations. If propagation of a covered 444 | work results from an entity transaction, each party to that 445 | transaction who receives a copy of the work also receives whatever 446 | licenses to the work the party's predecessor in interest had or could 447 | give under the previous paragraph, plus a right to possession of the 448 | Corresponding Source of the work from the predecessor in interest, if 449 | the predecessor has it or can get it with reasonable efforts. 450 | 451 | You may not impose any further restrictions on the exercise of the 452 | rights granted or affirmed under this License. For example, you may 453 | not impose a license fee, royalty, or other charge for exercise of 454 | rights granted under this License, and you may not initiate litigation 455 | (including a cross-claim or counterclaim in a lawsuit) alleging that 456 | any patent claim is infringed by making, using, selling, offering for 457 | sale, or importing the Program or any portion of it. 458 | 459 | 11. Patents. 460 | 461 | A "contributor" is a copyright holder who authorizes use under this 462 | License of the Program or a work on which the Program is based. The 463 | work thus licensed is called the contributor's "contributor version". 464 | 465 | A contributor's "essential patent claims" are all patent claims 466 | owned or controlled by the contributor, whether already acquired or 467 | hereafter acquired, that would be infringed by some manner, permitted 468 | by this License, of making, using, or selling its contributor version, 469 | but do not include claims that would be infringed only as a 470 | consequence of further modification of the contributor version. For 471 | purposes of this definition, "control" includes the right to grant 472 | patent sublicenses in a manner consistent with the requirements of 473 | this License. 474 | 475 | Each contributor grants you a non-exclusive, worldwide, royalty-free 476 | patent license under the contributor's essential patent claims, to 477 | make, use, sell, offer for sale, import and otherwise run, modify and 478 | propagate the contents of its contributor version. 479 | 480 | In the following three paragraphs, a "patent license" is any express 481 | agreement or commitment, however denominated, not to enforce a patent 482 | (such as an express permission to practice a patent or covenant not to 483 | sue for patent infringement). To "grant" such a patent license to a 484 | party means to make such an agreement or commitment not to enforce a 485 | patent against the party. 486 | 487 | If you convey a covered work, knowingly relying on a patent license, 488 | and the Corresponding Source of the work is not available for anyone 489 | to copy, free of charge and under the terms of this License, through a 490 | publicly available network server or other readily accessible means, 491 | then you must either (1) cause the Corresponding Source to be so 492 | available, or (2) arrange to deprive yourself of the benefit of the 493 | patent license for this particular work, or (3) arrange, in a manner 494 | consistent with the requirements of this License, to extend the patent 495 | license to downstream recipients. "Knowingly relying" means you have 496 | actual knowledge that, but for the patent license, your conveying the 497 | covered work in a country, or your recipient's use of the covered work 498 | in a country, would infringe one or more identifiable patents in that 499 | country that you have reason to believe are valid. 500 | 501 | If, pursuant to or in connection with a single transaction or 502 | arrangement, you convey, or propagate by procuring conveyance of, a 503 | covered work, and grant a patent license to some of the parties 504 | receiving the covered work authorizing them to use, propagate, modify 505 | or convey a specific copy of the covered work, then the patent license 506 | you grant is automatically extended to all recipients of the covered 507 | work and works based on it. 508 | 509 | A patent license is "discriminatory" if it does not include within 510 | the scope of its coverage, prohibits the exercise of, or is 511 | conditioned on the non-exercise of one or more of the rights that are 512 | specifically granted under this License. You may not convey a covered 513 | work if you are a party to an arrangement with a third party that is 514 | in the business of distributing software, under which you make payment 515 | to the third party based on the extent of your activity of conveying 516 | the work, and under which the third party grants, to any of the 517 | parties who would receive the covered work from you, a discriminatory 518 | patent license (a) in connection with copies of the covered work 519 | conveyed by you (or copies made from those copies), or (b) primarily 520 | for and in connection with specific products or compilations that 521 | contain the covered work, unless you entered into that arrangement, 522 | or that patent license was granted, prior to 28 March 2007. 523 | 524 | Nothing in this License shall be construed as excluding or limiting 525 | any implied license or other defenses to infringement that may 526 | otherwise be available to you under applicable patent law. 527 | 528 | 12. No Surrender of Others' Freedom. 529 | 530 | If conditions are imposed on you (whether by court order, agreement or 531 | otherwise) that contradict the conditions of this License, they do not 532 | excuse you from the conditions of this License. If you cannot convey a 533 | covered work so as to satisfy simultaneously your obligations under this 534 | License and any other pertinent obligations, then as a consequence you may 535 | not convey it at all. For example, if you agree to terms that obligate you 536 | to collect a royalty for further conveying from those to whom you convey 537 | the Program, the only way you could satisfy both those terms and this 538 | License would be to refrain entirely from conveying the Program. 539 | 540 | 13. Remote Network Interaction; Use with the GNU General Public License. 541 | 542 | Notwithstanding any other provision of this License, if you modify the 543 | Program, your modified version must prominently offer all users 544 | interacting with it remotely through a computer network (if your version 545 | supports such interaction) an opportunity to receive the Corresponding 546 | Source of your version by providing access to the Corresponding Source 547 | from a network server at no charge, through some standard or customary 548 | means of facilitating copying of software. This Corresponding Source 549 | shall include the Corresponding Source for any work covered by version 3 550 | of the GNU General Public License that is incorporated pursuant to the 551 | following paragraph. 552 | 553 | Notwithstanding any other provision of this License, you have 554 | permission to link or combine any covered work with a work licensed 555 | under version 3 of the GNU General Public License into a single 556 | combined work, and to convey the resulting work. The terms of this 557 | License will continue to apply to the part which is the covered work, 558 | but the work with which it is combined will remain governed by version 559 | 3 of the GNU General Public License. 560 | 561 | 14. Revised Versions of this License. 562 | 563 | The Free Software Foundation may publish revised and/or new versions of 564 | the GNU Affero General Public License from time to time. Such new versions 565 | will be similar in spirit to the present version, but may differ in detail to 566 | address new problems or concerns. 567 | 568 | Each version is given a distinguishing version number. If the 569 | Program specifies that a certain numbered version of the GNU Affero General 570 | Public License "or any later version" applies to it, you have the 571 | option of following the terms and conditions either of that numbered 572 | version or of any later version published by the Free Software 573 | Foundation. If the Program does not specify a version number of the 574 | GNU Affero General Public License, you may choose any version ever published 575 | by the Free Software Foundation. 576 | 577 | If the Program specifies that a proxy can decide which future 578 | versions of the GNU Affero General Public License can be used, that proxy's 579 | public statement of acceptance of a version permanently authorizes you 580 | to choose that version for the Program. 581 | 582 | Later license versions may give you additional or different 583 | permissions. However, no additional obligations are imposed on any 584 | author or copyright holder as a result of your choosing to follow a 585 | later version. 586 | 587 | 15. Disclaimer of Warranty. 588 | 589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 597 | 598 | 16. Limitation of Liability. 599 | 600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 608 | SUCH DAMAGES. 609 | 610 | 17. Interpretation of Sections 15 and 16. 611 | 612 | If the disclaimer of warranty and limitation of liability provided 613 | above cannot be given local legal effect according to their terms, 614 | reviewing courts shall apply local law that most closely approximates 615 | an absolute waiver of all civil liability in connection with the 616 | Program, unless a warranty or assumption of liability accompanies a 617 | copy of the Program in return for a fee. 618 | 619 | END OF TERMS AND CONDITIONS 620 | 621 | How to Apply These Terms to Your New Programs 622 | 623 | If you develop a new program, and you want it to be of the greatest 624 | possible use to the public, the best way to achieve this is to make it 625 | free software which everyone can redistribute and change under these terms. 626 | 627 | To do so, attach the following notices to the program. It is safest 628 | to attach them to the start of each source file to most effectively 629 | state the exclusion of warranty; and each file should have at least 630 | the "copyright" line and a pointer to where the full notice is found. 631 | 632 | 633 | Copyright (C) 634 | 635 | This program is free software: you can redistribute it and/or modify 636 | it under the terms of the GNU Affero General Public License as published by 637 | the Free Software Foundation, either version 3 of the License, or 638 | (at your option) any later version. 639 | 640 | This program is distributed in the hope that it will be useful, 641 | but WITHOUT ANY WARRANTY; without even the implied warranty of 642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 643 | GNU Affero General Public License for more details. 644 | 645 | You should have received a copy of the GNU Affero General Public License 646 | along with this program. If not, see . 647 | 648 | Also add information on how to contact you by electronic and paper mail. 649 | 650 | If your software can interact with users remotely through a computer 651 | network, you should also make sure that it provides a way for users to 652 | get its source. For example, if your program is a web application, its 653 | interface could display a "Source" link that leads users to an archive 654 | of the code. There are many ways you could offer source, and different 655 | solutions will be better for different programs; see section 13 for the 656 | specific requirements. 657 | 658 | You should also get your employer (if you work as a programmer) or school, 659 | if any, to sign a "copyright disclaimer" for the program, if necessary. 660 | For more information on this, and how to apply and follow the GNU AGPL, see 661 | . 662 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | recursive-include docs * 2 | recursive-include simserver/test/test_data * 3 | recursive-include . *.sh 4 | prune docs/src* 5 | include README.rst 6 | include CHANGELOG.txt 7 | include COPYING 8 | include COPYING.LESSER 9 | include LICENSE.txt 10 | include ez_setup.py 11 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ================================================== 2 | simserver -- document similarity server in Python 3 | ================================================== 4 | 5 | 6 | Index plain text documents and query the index for semantically related documents. 7 | 8 | Simserver uses transactions internally to provide a robust and scalable similarity server. 9 | 10 | 11 | Installation 12 | ------------ 13 | 14 | Simserver builds on the `gensim `_ framework for 15 | topic modelling. 16 | 17 | The simple way to install `simserver` is with:: 18 | 19 | sudo easy_install -U simserver 20 | 21 | Or, if you have instead downloaded and unzipped the `source tar.gz `_ package, 22 | you'll need to run:: 23 | 24 | python setup.py test 25 | sudo python setup.py install 26 | 27 | This version has been tested under Python 2.5 and 2.7, but should run on any 2.5 <= Python < 3.0. 28 | 29 | Documentation 30 | ------------- 31 | 32 | See http://radimrehurek.com/gensim/simserver.html . More coming soon. 33 | 34 | Licensing 35 | ---------------- 36 | 37 | Simserver is released under the `GNU Affero GPL license v3 `_. 38 | 39 | This means you may use simserver freely in your application (even commercial application!), 40 | but you **must then open-source your application as well**, under an AGPL-compatible license. 41 | 42 | The AGPL license makes sure that this applies even when you make your application 43 | available only remotely (such as through the web). 44 | 45 | TL;DR: **simserver is open-source, but you have to contact me for any proprietary use.** 46 | 47 | History 48 | ------------- 49 | 50 | 0.1.4: 51 | * performance improvements to sharding 52 | * change to threading model -- removed restriction on per-thread session access 53 | * bug fix in index optimize() 54 | 55 | 0.1.3: 56 | * changed behaviour for very few training documents: instead of latent semantic analysis, use simpler log-entropy model 57 | * fixed bug with leaking SQLite file descriptors 58 | 59 | ------------- 60 | 61 | Copyright (c) 2009-2012 Radim Rehurek 62 | -------------------------------------------------------------------------------- /docs/similarities/simserver.rst: -------------------------------------------------------------------------------- 1 | :mod:`simserver` -- Document similarity server 2 | ====================================================== 3 | 4 | .. automodule:: simserver.simserver 5 | :synopsis: Document similarity server 6 | :members: 7 | :inherited-members: 8 | 9 | -------------------------------------------------------------------------------- /docs/simserver.rst: -------------------------------------------------------------------------------- 1 | .. _simserver: 2 | 3 | Document Similarity Server 4 | ============================= 5 | 6 | The 0.7.x series of `gensim `_ was about improving performance and consolidating API. 7 | 0.8.x will be about new features --- 0.8.1, first of the series, is a **document similarity service**. 8 | 9 | The source code itself has been moved from gensim to its own, dedicated package, named `simserver`. 10 | Get it from `PyPI `_ or clone it on `Github `_. 11 | 12 | What is a document similarity service? 13 | --------------------------------------- 14 | 15 | Conceptually, a service that lets you : 16 | 17 | 1. train a semantic model from a corpus of plain texts (no manual annotation and mark-up needed) 18 | 2. index arbitrary documents using this semantic model 19 | 3. query the index for similar documents (the query can be either an id of a document already in the index, or an arbitrary text) 20 | 21 | 22 | >>> from simserver import SessionServer 23 | >>> server = SessionServer('/tmp/my_server') # resume server (or create a new one) 24 | 25 | >>> server.train(training_corpus, method='lsi') # create a semantic model 26 | >>> server.index(some_documents) # convert plain text to semantic representation and index it 27 | >>> server.find_similar(query) # convert query to semantic representation and compare against index 28 | >>> ... 29 | >>> server.index(more_documents) # add to index: incremental indexing works 30 | >>> server.find_similar(query) 31 | >>> ... 32 | >>> server.delete(ids_to_delete) # incremental deleting also works 33 | >>> server.find_similar(query) 34 | >>> ... 35 | 36 | .. note:: 37 | "Semantic" here refers to semantics of the crude, statistical type -- 38 | `Latent Semantic Analysis `_, 39 | `Latent Dirichlet Allocation `_ etc. 40 | Nothing to do with the semantic web, manual resource tagging or detailed linguistic inference. 41 | 42 | 43 | What is it good for? 44 | --------------------- 45 | 46 | Digital libraries of (mostly) text documents. More generally, it helps you annotate, 47 | organize and navigate documents in a more abstract way, compared to plain keyword search. 48 | 49 | How is it unique? 50 | ----------------- 51 | 52 | 1. **Memory independent**. Gensim has unique algorithms for statistical analysis that allow 53 | you to create semantic models of arbitrarily large training corpora (larger than RAM) very quickly 54 | and in constant RAM. 55 | 2. **Memory independent (again)**. Indexing shards are stored as files to disk/mmapped back as needed, 56 | so you can index very large corpora. So again, constant RAM, this time independent of the number of indexed documents. 57 | 3. **Efficient**. Gensim makes heavy use of Python's NumPy and SciPy libraries to make indexing and 58 | querying efficient. 59 | 4. **Robust**. Modifications of the index are transactional, so you can commit/rollback an 60 | entire indexing session. Also, during the session, the service is still available 61 | for querying (using its state from when the session started). Power failures leave 62 | service in a consistent state (implicit rollback). 63 | 5. **Pure Python**. Well, technically, NumPy and SciPy are mostly wrapped C and Fortran, but 64 | `gensim `_ itself is pure Python. No compiling, installing or root priviledges needed. 65 | 6. **Concurrency support**. The underlying service object is thread-safe and can 66 | therefore be used as a daemon server: clients connect to it via RPC and issue train/index/query requests remotely. 67 | 7. **Cross-network, cross-platform and cross-language**. While the Python server runs 68 | over TCP using `Pyro `_, 69 | clients in Java/.NET are trivial thanks to `Pyrolite `_. 70 | 71 | The rest of this document serves as a tutorial explaining the features in more detail. 72 | 73 | ----- 74 | 75 | Prerequisites 76 | ---------------------- 77 | 78 | It is assumed you have `gensim` properly :doc:`installed `. You'll also 79 | need the `sqlitedict `_ package that wraps 80 | Python's sqlite3 module in a thread-safe manner:: 81 | 82 | $ sudo easy_install -U sqlitedict 83 | 84 | To test the remote server capabilities, install Pyro4 (Python Remote Objects, at 85 | version 4.8 as of this writing):: 86 | 87 | $ sudo easy_install Pyro4 88 | 89 | .. note:: 90 | Don't forget to initialize logging to see logging messages:: 91 | 92 | >>> import logging 93 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 94 | 95 | What is a document? 96 | ------------------- 97 | 98 | In case of text documents, the service expects:: 99 | 100 | >>> document = {'id': 'some_unique_string', 101 | >>> 'tokens': ['content', 'of', 'the', 'document', '...'], 102 | >>> 'other_fields_are_allowed_but_ignored': None} 103 | 104 | This format was chosen because it coincides with plain JSON and is therefore easy to serialize and send over the wire, in almost any language. 105 | All strings involved must be utf8-encoded. 106 | 107 | 108 | What is a corpus? 109 | ----------------- 110 | 111 | A sequence of documents. Anything that supports the `for document in corpus: ...` 112 | iterator protocol. Generators are ok. Plain lists are also ok (but consume more memory). 113 | 114 | >>> from gensim import utils 115 | >>> texts = ["Human machine interface for lab abc computer applications", 116 | >>> "A survey of user opinion of computer system response time", 117 | >>> "The EPS user interface management system", 118 | >>> "System and human system engineering testing of EPS", 119 | >>> "Relation of user perceived response time to error measurement", 120 | >>> "The generation of random binary unordered trees", 121 | >>> "The intersection graph of paths in trees", 122 | >>> "Graph minors IV Widths of trees and well quasi ordering", 123 | >>> "Graph minors A survey"] 124 | >>> corpus = [{'id': 'doc_%i' % num, 'tokens': utils.simple_preprocess(text)} 125 | >>> for num, text in enumerate(texts)] 126 | 127 | Since corpora are allowed to be arbitrarily large, it is 128 | recommended client splits them into smaller chunks before uploading them to the server: 129 | 130 | >>> utils.upload_chunked(server, corpus, chunksize=1000) # send 1k docs at a time 131 | 132 | Wait, upload what, where? 133 | ------------------------- 134 | 135 | If you use the similarity service object (instance of :class:`simserver.SessionServer`) in 136 | your code directly---no remote access---that's perfectly fine. Using the service remotely, from a different process/machine, is an 137 | option, not a necessity. 138 | 139 | Document similarity can also act as a long-running service, a daemon process on a separate machine. In that 140 | case, I'll call the service object a *server*. 141 | 142 | But let's start with a local object. Open your `favourite shell `_ and:: 143 | 144 | >>> from gensim import utils 145 | >>> from simserver import SessionServer 146 | >>> service = SessionServer('/tmp/my_server/') # or wherever 147 | 148 | That initialized a new service, located in `/tmp/my_server` (you need write access rights to that directory). 149 | 150 | .. note:: 151 | The service is fully defined by the content of its location directory ("`/tmp/my_server/`"). 152 | If you use an existing location, the service object will resume 153 | from the index found there. Also, to "clone" a service, just copy that 154 | directory somewhere else. The copy will be a fully working duplicate of the 155 | original service. 156 | 157 | 158 | Model training 159 | --------------- 160 | 161 | We can start indexing right away: 162 | 163 | >>> service.index(corpus) 164 | AttributeError: must initialize model for /tmp/my_server/b before indexing documents 165 | 166 | Oops, we can not. The service indexes documents in a semantic representation, which 167 | is different to the plain text we give it. We must teach the service how to convert 168 | between plain text and semantics first:: 169 | 170 | >>> service.train(corpus, method='lsi') 171 | 172 | That was easy. The `method='lsi'` parameter meant that we trained a model for 173 | `Latent Semantic Indexing `_ 174 | and default dimensionality (400) over a `tf-idf `_ 175 | representation of our little `corpus`, all automatically. More on that later. 176 | 177 | Note that for the semantic model to make sense, it should be trained 178 | on a corpus that is: 179 | 180 | * Reasonably similar to the documents you want to index later. Training on a corpus 181 | of recipes in French when all indexed documents will be about programming in English 182 | will not help. 183 | * Reasonably large (at least thousands of documents), so that the statistical analysis has 184 | a chance to kick in. Don't use my example corpus here of 9 documents in production O_o 185 | 186 | Indexing documents 187 | ------------------ 188 | 189 | >>> service.index(corpus) # index the same documents that we trained on... 190 | 191 | Indexing can happen over any documents, but I'm too lazy to create another example corpus, so we index the same 9 docs used for training. 192 | 193 | Delete documents with:: 194 | 195 | >>> service.delete(['doc_5', 'doc_8']) # supply a list of document ids to be removed from the index 196 | 197 | When you pass documents that have the same id as some already indexed document, 198 | the indexed document is overwritten by the new input (=only the latest counts; 199 | document ids are always unique per service):: 200 | 201 | >>> service.index(corpus[:3]) # overall index size unchanged (just 3 docs overwritten) 202 | 203 | The index/delete/overwrite calls can be arbitrarily interspersed with queries. 204 | You don't have to index **all** documents first to start querying, indexing can be incremental. 205 | 206 | Querying 207 | --------- 208 | 209 | There are two types of queries: 210 | 211 | 1. by id: 212 | 213 | .. code-block:: python 214 | 215 | >>> print service.find_similar('doc_0') 216 | [('doc_0', 1.0, None), ('doc_2', 0.30426699, None), ('doc_1', 0.25648531, None), ('doc_3', 0.25480536, None)] 217 | 218 | >>> print service.find_similar('doc_5') # we deleted doc_5 and doc_8, remember? 219 | ValueError: document 'doc_5' not in index 220 | 221 | In the resulting 3-tuples, `doc_n` is the document id we supplied during indexing, 222 | `0.30426699` is the similarity of `doc_n` to the query, but what's up with that `None`, you ask? 223 | Well, you can associate each document with a "payload", during indexing. 224 | This payload object (anything pickle-able) is later returned during querying. 225 | If you don't specify `doc['payload']` during indexing, queries simply return `None` in the result tuple, as in our example here. 226 | 227 | 2. or by document (using `document['tokens']`; id is ignored in this case): 228 | 229 | .. code-block:: python 230 | 231 | >>> doc = {'tokens': utils.simple_preprocess('Graph and minors and humans and trees.')} 232 | >>> print service.find_similar(doc, min_score=0.4, max_results=50) 233 | [('doc_7', 0.93350589, None), ('doc_3', 0.42718196, None)] 234 | 235 | Remote access 236 | ------------- 237 | 238 | So far, we did everything in our Python shell, locally. I very much like `Pyro `_, 239 | a pure Python package for Remote Procedure Calls (RPC), so I'll illustrate remote 240 | service access via Pyro. Pyro takes care of all the socket listening/request routing/data marshalling/thread 241 | spawning, so it saves us a lot of trouble. 242 | 243 | To create a similarity server, we just create a :class:`simserver.SessionServer` object and register it 244 | with a Pyro daemon for remote access. There is a small `example script `_ 245 | included with simserver, run it with:: 246 | 247 | $ python -m simserver.run_simserver /tmp/testserver 248 | 249 | You can just `ctrl+c` to terminate the server, but leave it running for now. 250 | 251 | Now open your Python shell again, in another terminal window or possibly on another machine, and:: 252 | 253 | >>> import Pyro4 254 | >>> service = Pyro4.Proxy(Pyro4.locateNS().lookup('gensim.testserver')) 255 | 256 | Now `service` is only a proxy object: every call is physically executed wherever 257 | you ran the `run_server.py` script, which can be a totally different computer 258 | (within a network broadcast domain), but you don't even know:: 259 | 260 | >>> print service.status() 261 | >>> service.train(corpus) 262 | >>> service.index(other_corpus) 263 | >>> service.find_similar(query) 264 | >>> ... 265 | 266 | It is worth mentioning that Irmen, the author of Pyro, also released 267 | `Pyrolite `_ recently. That is a package 268 | which allows you to create Pyro proxies also from Java and .NET, in addition to Python. 269 | That way you can call remote methods from there too---the client doesn't have to be in Python. 270 | 271 | Concurrency 272 | ----------- 273 | 274 | Ok, now it's getting interesting. Since we can access the service remotely, what 275 | happens if multiple clients create proxies to it at the same time? What if they 276 | want to modify the server index at the same time? 277 | 278 | Answer: the `SessionServer` object is thread-safe, so that when each client spawns a request 279 | thread via Pyro, they don't step on each other's toes. 280 | 281 | This means that: 282 | 283 | 1. There can be multiple simultaneous `service.find_similar` queries (or, in 284 | general, multiple simultaneus calls that are "read-only"). 285 | 2. When two clients issue modification calls (`index`/`train`/`delete`/`drop_index`/...) 286 | at the same time, an internal lock serializes them -- the later call has to wait. 287 | 3. While one client is modifying the index, all other clients' queries still see 288 | the original index. Only once the modifications are committed do they become 289 | "visible". 290 | 291 | What do you mean, visible? 292 | -------------------------- 293 | 294 | The service uses transactions internally. This means that each modification is 295 | done over a clone of the service. If the modification session fails for whatever 296 | reason (exception in code; power failure that turns off the server; client unhappy 297 | with how the session went), it can be rolled back. It also means other clients can 298 | continue querying the original index during index updates. 299 | 300 | The mechanism is hidden from users by default through auto-committing (it was already happening 301 | in the examples above too), but auto-committing can be turned off explicitly:: 302 | 303 | >>> service.set_autosession(False) 304 | >>> service.train(corpus) 305 | RuntimeError: must open a session before modifying SessionServer 306 | >>> service.open_session() 307 | >>> service.train(corpus) 308 | >>> service.index(corpus) 309 | >>> service.delete(doc_ids) 310 | >>> ... 311 | 312 | None of these changes are visible to other clients, yet. Also, other clients' 313 | calls to index/train/etc will block until this session is committed/rolled back---there 314 | cannot be two open sessions at the same time. 315 | 316 | To end a session:: 317 | 318 | >>> service.rollback() # discard all changes since open_session() 319 | 320 | or:: 321 | 322 | >>> service.commit() # make changes public; now other clients can see changes/acquire the modification lock 323 | 324 | 325 | Other stuff 326 | ------------ 327 | 328 | TODO Custom document parsing (in lieu of `utils.simple_preprocess`). Different models (not just `lsi`). Optimizing the index with `service.optimize()`. 329 | TODO add some hard numbers; example tutorial for some bigger collection, e.g. for `arxiv.org `_ or wikipedia. 330 | 331 | -------------------------------------------------------------------------------- /ez_setup.py: -------------------------------------------------------------------------------- 1 | #!python 2 | """Bootstrap setuptools installation 3 | 4 | If you want to use setuptools in your package's setup.py, just include this 5 | file in the same directory with it, and add this to the top of your setup.py:: 6 | 7 | from ez_setup import use_setuptools 8 | use_setuptools() 9 | 10 | If you want to require a specific version of setuptools, set a download 11 | mirror, or use an alternate download directory, you can do so by supplying 12 | the appropriate options to ``use_setuptools()``. 13 | 14 | This file can also be run as a script to install or upgrade setuptools. 15 | """ 16 | import sys 17 | DEFAULT_VERSION = "0.6c11" 18 | DEFAULT_URL = "http://pypi.python.org/packages/%s/s/setuptools/" % sys.version[:3] 19 | 20 | md5_data = { 21 | 'setuptools-0.6b1-py2.3.egg': '8822caf901250d848b996b7f25c6e6ca', 22 | 'setuptools-0.6b1-py2.4.egg': 'b79a8a403e4502fbb85ee3f1941735cb', 23 | 'setuptools-0.6b2-py2.3.egg': '5657759d8a6d8fc44070a9d07272d99b', 24 | 'setuptools-0.6b2-py2.4.egg': '4996a8d169d2be661fa32a6e52e4f82a', 25 | 'setuptools-0.6b3-py2.3.egg': 'bb31c0fc7399a63579975cad9f5a0618', 26 | 'setuptools-0.6b3-py2.4.egg': '38a8c6b3d6ecd22247f179f7da669fac', 27 | 'setuptools-0.6b4-py2.3.egg': '62045a24ed4e1ebc77fe039aa4e6f7e5', 28 | 'setuptools-0.6b4-py2.4.egg': '4cb2a185d228dacffb2d17f103b3b1c4', 29 | 'setuptools-0.6c1-py2.3.egg': 'b3f2b5539d65cb7f74ad79127f1a908c', 30 | 'setuptools-0.6c1-py2.4.egg': 'b45adeda0667d2d2ffe14009364f2a4b', 31 | 'setuptools-0.6c10-py2.3.egg': 'ce1e2ab5d3a0256456d9fc13800a7090', 32 | 'setuptools-0.6c10-py2.4.egg': '57d6d9d6e9b80772c59a53a8433a5dd4', 33 | 'setuptools-0.6c10-py2.5.egg': 'de46ac8b1c97c895572e5e8596aeb8c7', 34 | 'setuptools-0.6c10-py2.6.egg': '58ea40aef06da02ce641495523a0b7f5', 35 | 'setuptools-0.6c11-py2.3.egg': '2baeac6e13d414a9d28e7ba5b5a596de', 36 | 'setuptools-0.6c11-py2.4.egg': 'bd639f9b0eac4c42497034dec2ec0c2b', 37 | 'setuptools-0.6c11-py2.5.egg': '64c94f3bf7a72a13ec83e0b24f2749b2', 38 | 'setuptools-0.6c11-py2.6.egg': 'bfa92100bd772d5a213eedd356d64086', 39 | 'setuptools-0.6c2-py2.3.egg': 'f0064bf6aa2b7d0f3ba0b43f20817c27', 40 | 'setuptools-0.6c2-py2.4.egg': '616192eec35f47e8ea16cd6a122b7277', 41 | 'setuptools-0.6c3-py2.3.egg': 'f181fa125dfe85a259c9cd6f1d7b78fa', 42 | 'setuptools-0.6c3-py2.4.egg': 'e0ed74682c998bfb73bf803a50e7b71e', 43 | 'setuptools-0.6c3-py2.5.egg': 'abef16fdd61955514841c7c6bd98965e', 44 | 'setuptools-0.6c4-py2.3.egg': 'b0b9131acab32022bfac7f44c5d7971f', 45 | 'setuptools-0.6c4-py2.4.egg': '2a1f9656d4fbf3c97bf946c0a124e6e2', 46 | 'setuptools-0.6c4-py2.5.egg': '8f5a052e32cdb9c72bcf4b5526f28afc', 47 | 'setuptools-0.6c5-py2.3.egg': 'ee9fd80965da04f2f3e6b3576e9d8167', 48 | 'setuptools-0.6c5-py2.4.egg': 'afe2adf1c01701ee841761f5bcd8aa64', 49 | 'setuptools-0.6c5-py2.5.egg': 'a8d3f61494ccaa8714dfed37bccd3d5d', 50 | 'setuptools-0.6c6-py2.3.egg': '35686b78116a668847237b69d549ec20', 51 | 'setuptools-0.6c6-py2.4.egg': '3c56af57be3225019260a644430065ab', 52 | 'setuptools-0.6c6-py2.5.egg': 'b2f8a7520709a5b34f80946de5f02f53', 53 | 'setuptools-0.6c7-py2.3.egg': '209fdf9adc3a615e5115b725658e13e2', 54 | 'setuptools-0.6c7-py2.4.egg': '5a8f954807d46a0fb67cf1f26c55a82e', 55 | 'setuptools-0.6c7-py2.5.egg': '45d2ad28f9750e7434111fde831e8372', 56 | 'setuptools-0.6c8-py2.3.egg': '50759d29b349db8cfd807ba8303f1902', 57 | 'setuptools-0.6c8-py2.4.egg': 'cba38d74f7d483c06e9daa6070cce6de', 58 | 'setuptools-0.6c8-py2.5.egg': '1721747ee329dc150590a58b3e1ac95b', 59 | 'setuptools-0.6c9-py2.3.egg': 'a83c4020414807b496e4cfbe08507c03', 60 | 'setuptools-0.6c9-py2.4.egg': '260a2be2e5388d66bdaee06abec6342a', 61 | 'setuptools-0.6c9-py2.5.egg': 'fe67c3e5a17b12c0e7c541b7ea43a8e6', 62 | 'setuptools-0.6c9-py2.6.egg': 'ca37b1ff16fa2ede6e19383e7b59245a', 63 | } 64 | 65 | import sys, os 66 | try: from hashlib import md5 67 | except ImportError: from md5 import md5 68 | 69 | def _validate_md5(egg_name, data): 70 | if egg_name in md5_data: 71 | digest = md5(data).hexdigest() 72 | if digest != md5_data[egg_name]: 73 | print >>sys.stderr, ( 74 | "md5 validation of %s failed! (Possible download problem?)" 75 | % egg_name 76 | ) 77 | sys.exit(2) 78 | return data 79 | 80 | def use_setuptools( 81 | version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir, 82 | download_delay=15 83 | ): 84 | """Automatically find/download setuptools and make it available on sys.path 85 | 86 | `version` should be a valid setuptools version number that is available 87 | as an egg for download under the `download_base` URL (which should end with 88 | a '/'). `to_dir` is the directory where setuptools will be downloaded, if 89 | it is not already available. If `download_delay` is specified, it should 90 | be the number of seconds that will be paused before initiating a download, 91 | should one be required. If an older version of setuptools is installed, 92 | this routine will print a message to ``sys.stderr`` and raise SystemExit in 93 | an attempt to abort the calling script. 94 | """ 95 | was_imported = 'pkg_resources' in sys.modules or 'setuptools' in sys.modules 96 | def do_download(): 97 | egg = download_setuptools(version, download_base, to_dir, download_delay) 98 | sys.path.insert(0, egg) 99 | import setuptools; setuptools.bootstrap_install_from = egg 100 | try: 101 | import pkg_resources 102 | except ImportError: 103 | return do_download() 104 | try: 105 | pkg_resources.require("setuptools>="+version); return 106 | except pkg_resources.VersionConflict, e: 107 | if was_imported: 108 | print >>sys.stderr, ( 109 | "The required version of setuptools (>=%s) is not available, and\n" 110 | "can't be installed while this script is running. Please install\n" 111 | " a more recent version first, using 'easy_install -U setuptools'." 112 | "\n\n(Currently using %r)" 113 | ) % (version, e.args[0]) 114 | sys.exit(2) 115 | else: 116 | del pkg_resources, sys.modules['pkg_resources'] # reload ok 117 | return do_download() 118 | except pkg_resources.DistributionNotFound: 119 | return do_download() 120 | 121 | def download_setuptools( 122 | version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir, 123 | delay = 15 124 | ): 125 | """Download setuptools from a specified location and return its filename 126 | 127 | `version` should be a valid setuptools version number that is available 128 | as an egg for download under the `download_base` URL (which should end 129 | with a '/'). `to_dir` is the directory where the egg will be downloaded. 130 | `delay` is the number of seconds to pause before an actual download attempt. 131 | """ 132 | import urllib2, shutil 133 | egg_name = "setuptools-%s-py%s.egg" % (version,sys.version[:3]) 134 | url = download_base + egg_name 135 | saveto = os.path.join(to_dir, egg_name) 136 | src = dst = None 137 | if not os.path.exists(saveto): # Avoid repeated downloads 138 | try: 139 | from distutils import log 140 | if delay: 141 | log.warn(""" 142 | --------------------------------------------------------------------------- 143 | This script requires setuptools version %s to run (even to display 144 | help). I will attempt to download it for you (from 145 | %s), but 146 | you may need to enable firewall access for this script first. 147 | I will start the download in %d seconds. 148 | 149 | (Note: if this machine does not have network access, please obtain the file 150 | 151 | %s 152 | 153 | and place it in this directory before rerunning this script.) 154 | ---------------------------------------------------------------------------""", 155 | version, download_base, delay, url 156 | ); from time import sleep; sleep(delay) 157 | log.warn("Downloading %s", url) 158 | src = urllib2.urlopen(url) 159 | # Read/write all in one block, so we don't create a corrupt file 160 | # if the download is interrupted. 161 | data = _validate_md5(egg_name, src.read()) 162 | dst = open(saveto,"wb"); dst.write(data) 163 | finally: 164 | if src: src.close() 165 | if dst: dst.close() 166 | return os.path.realpath(saveto) 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | def main(argv, version=DEFAULT_VERSION): 204 | """Install or upgrade setuptools and EasyInstall""" 205 | try: 206 | import setuptools 207 | except ImportError: 208 | egg = None 209 | try: 210 | egg = download_setuptools(version, delay=0) 211 | sys.path.insert(0,egg) 212 | from setuptools.command.easy_install import main 213 | return main(list(argv)+[egg]) # we're done here 214 | finally: 215 | if egg and os.path.exists(egg): 216 | os.unlink(egg) 217 | else: 218 | if setuptools.__version__ == '0.0.1': 219 | print >>sys.stderr, ( 220 | "You have an obsolete version of setuptools installed. Please\n" 221 | "remove it from your system entirely before rerunning this script." 222 | ) 223 | sys.exit(2) 224 | 225 | req = "setuptools>="+version 226 | import pkg_resources 227 | try: 228 | pkg_resources.require(req) 229 | except pkg_resources.VersionConflict: 230 | try: 231 | from setuptools.command.easy_install import main 232 | except ImportError: 233 | from easy_install import main 234 | main(list(argv)+[download_setuptools(delay=0)]) 235 | sys.exit(0) # try to force an exit 236 | else: 237 | if argv: 238 | from setuptools.command.easy_install import main 239 | main(argv) 240 | else: 241 | print "Setuptools version",version,"or greater has been installed." 242 | print '(Run "ez_setup.py -U setuptools" to reinstall or upgrade.)' 243 | 244 | def update_md5(filenames): 245 | """Update our built-in md5 registry""" 246 | 247 | import re 248 | 249 | for name in filenames: 250 | base = os.path.basename(name) 251 | f = open(name,'rb') 252 | md5_data[base] = md5(f.read()).hexdigest() 253 | f.close() 254 | 255 | data = [" %r: %r,\n" % it for it in md5_data.items()] 256 | data.sort() 257 | repl = "".join(data) 258 | 259 | import inspect 260 | srcfile = inspect.getsourcefile(sys.modules[__name__]) 261 | f = open(srcfile, 'rb'); src = f.read(); f.close() 262 | 263 | match = re.search("\nmd5_data = {\n([^}]+)}", src) 264 | if not match: 265 | print >>sys.stderr, "Internal error!" 266 | sys.exit(2) 267 | 268 | src = src[:match.start(1)] + repl + src[match.end(1):] 269 | f = open(srcfile,'w') 270 | f.write(src) 271 | f.close() 272 | 273 | 274 | if __name__=='__main__': 275 | if len(sys.argv)>2 and sys.argv[1]=='--md5update': 276 | update_md5(sys.argv[2:]) 277 | else: 278 | main(sys.argv[1:]) 279 | 280 | 281 | 282 | 283 | 284 | 285 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # Copyright (C) 2012 Radim Rehurek 5 | # Licensed under the GNU AGPL v3.0 - http://www.gnu.org/licenses/agpl.html 6 | 7 | """ 8 | Run with: 9 | 10 | sudo python ./setup.py install 11 | """ 12 | 13 | import os 14 | import sys 15 | 16 | if sys.version_info[:2] < (2, 5): 17 | raise Exception('This version of simserver needs Python 2.5 or later. ') 18 | 19 | import ez_setup 20 | ez_setup.use_setuptools() 21 | from setuptools import setup, find_packages 22 | 23 | 24 | def read(fname): 25 | return open(os.path.join(os.path.dirname(__file__), fname)).read() 26 | 27 | 28 | 29 | setup( 30 | name = 'simserver', 31 | version = '0.1.4', 32 | description = 'Document similarity server', 33 | long_description = read('README.rst'), 34 | 35 | packages = find_packages(), 36 | 37 | # there is a bug in python2.5, preventing distutils from using any non-ascii characters :( http://bugs.python.org/issue2562 38 | author = 'Radim Rehurek', # u'Radim Řehůřek', # <- should really be this... 39 | author_email = 'radimrehurek@seznam.cz', 40 | 41 | url = 'https://github.com/piskvorky/gensim-simserver', 42 | download_url = 'http://pypi.python.org/pypi/simserver', 43 | 44 | keywords = 'Similarity server, document database, Latent Semantic Indexing, LSA, ' 45 | 'LSI, LDA, Latent Dirichlet Allocation, TF-IDF, gensim', 46 | 47 | license = 'AGPL v3', 48 | platforms = 'any', 49 | 50 | zip_safe = False, 51 | 52 | classifiers = [ # from http://pypi.python.org/pypi?%3Aaction=list_classifiers 53 | 'Development Status :: 4 - Beta', 54 | 'Environment :: Console', 55 | 'Intended Audience :: Science/Research', 56 | 'License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)', 57 | 'Operating System :: OS Independent', 58 | 'Programming Language :: Python :: 2.5', 59 | 'Topic :: Scientific/Engineering :: Artificial Intelligence', 60 | 'Topic :: Scientific/Engineering :: Information Analysis', 61 | 'Topic :: Text Processing :: Linguistic', 62 | ], 63 | 64 | test_suite = "simserver.test", 65 | 66 | install_requires = [ 67 | 'gensim >= 0.8.5', 68 | 'Pyro4 >= 4.8', 69 | 'sqlitedict >= 1.0.8', 70 | ], 71 | 72 | include_package_data = True, 73 | 74 | ) 75 | -------------------------------------------------------------------------------- /simserver/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Package containing a document similarity server, an extension of gensim. 3 | """ 4 | 5 | # for IPython tab-completion 6 | from simserver import SessionServer, SimServer 7 | 8 | 9 | try: 10 | __version__ = __import__('pkg_resources').get_distribution('simserver').version 11 | except: 12 | __version__ = '?' 13 | 14 | 15 | import logging 16 | 17 | class NullHandler(logging.Handler): 18 | """For python versions <= 2.6; same as `logging.NullHandler` in 2.7.""" 19 | def emit(self, record): 20 | pass 21 | 22 | logger = logging.getLogger('simserver') 23 | if len(logger.handlers) == 0: # To ensure reload() doesn't add another one 24 | logger.addHandler(NullHandler()) 25 | -------------------------------------------------------------------------------- /simserver/run_simserver.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # Copyright (C) 2011 Radim Rehurek 5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html 6 | 7 | """ 8 | USAGE: %(program)s DATA_DIRECTORY 9 | 10 | Start a sample similarity server, register it with Pyro and leave it running \ 11 | as a daemon. 12 | 13 | Example: 14 | python -m simserver.run_simserver /tmp/server 15 | """ 16 | 17 | from __future__ import with_statement 18 | 19 | import logging 20 | import os 21 | import sys 22 | 23 | import gensim 24 | import simserver 25 | 26 | if __name__ == '__main__': 27 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(module)s:%(lineno)d : %(funcName)s(%(threadName)s) : %(message)s') 28 | logging.root.setLevel(level=logging.INFO) 29 | logging.info("running %s" % ' '.join(sys.argv)) 30 | 31 | program = os.path.basename(sys.argv[0]) 32 | 33 | # check and process input arguments 34 | if len(sys.argv) < 2: 35 | print globals()['__doc__'] % locals() 36 | sys.exit(1) 37 | 38 | basename = sys.argv[1] 39 | server = simserver.SessionServer(basename) 40 | gensim.utils.pyro_daemon('gensim.testserver', server) 41 | 42 | logging.info("finished running %s" % program) 43 | -------------------------------------------------------------------------------- /simserver/simserver.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # Copyright (C) 2012 Radim Rehurek 5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html 6 | 7 | 8 | """ 9 | "Find similar" service, using gensim (=vector spaces) for backend. 10 | 11 | The server performs 3 main functions: 12 | 13 | 1. converts documents to semantic representation (TF-IDF, LSA, LDA...) 14 | 2. indexes documents in the vector representation, for faster retrieval 15 | 3. for a given query document, return ids of the most similar documents from the index 16 | 17 | SessionServer objects are transactional, so that you can rollback/commit an entire 18 | set of changes. 19 | 20 | The server is ready for concurrent requests (thread-safe). Indexing is incremental 21 | and you can query the SessionServer even while it's being updated, so that there 22 | is virtually no down-time. 23 | 24 | """ 25 | 26 | from __future__ import with_statement 27 | 28 | import os 29 | import logging 30 | import threading 31 | import shutil 32 | 33 | import numpy 34 | 35 | import gensim 36 | from sqlitedict import SqliteDict # needs sqlitedict: run "sudo easy_install sqlitedict" 37 | 38 | 39 | logger = logging.getLogger('gensim.similarities.simserver') 40 | 41 | 42 | TOP_SIMS = 100 # when precomputing similarities, only consider this many "most similar" documents 43 | SHARD_SIZE = 65536 # spill index shards to disk in SHARD_SIZE-ed chunks of documents 44 | DEFAULT_NUM_TOPICS = 400 # use this many topics for topic models unless user specified a value 45 | JOURNAL_MODE = 'OFF' # don't keep journals in sqlite dbs 46 | 47 | 48 | 49 | def merge_sims(oldsims, newsims, clip=None): 50 | """Merge two precomputed similarity lists, truncating the result to `clip` most similar items.""" 51 | if oldsims is None: 52 | result = newsims or [] 53 | elif newsims is None: 54 | result = oldsims 55 | else: 56 | result = sorted(oldsims + newsims, key=lambda item: -item[1]) 57 | if clip is not None: 58 | result = result[:clip] 59 | return result 60 | 61 | 62 | 63 | class SimIndex(gensim.utils.SaveLoad): 64 | """ 65 | An index of documents. Used internally by SimServer. 66 | 67 | It uses the Similarity class to persist all document vectors to disk (via mmap). 68 | """ 69 | def __init__(self, fname, num_features, shardsize=SHARD_SIZE, topsims=TOP_SIMS): 70 | """ 71 | Spill index shards to disk after every `shardsize` documents. 72 | In similarity queries, return only the `topsims` most similar documents. 73 | """ 74 | self.fname = fname 75 | self.shardsize = int(shardsize) 76 | self.topsims = int(topsims) 77 | self.id2pos = {} # map document id (string) to index position (integer) 78 | self.pos2id = {} # reverse mapping for id2pos; redundant, for performance 79 | self.id2sims = SqliteDict(self.fname + '.id2sims', journal_mode=JOURNAL_MODE) # precomputed top similar: document id -> [(doc_id, similarity)] 80 | self.qindex = gensim.similarities.Similarity(self.fname + '.idx', corpus=None, 81 | num_best=None, num_features=num_features, shardsize=shardsize) 82 | self.length = 0 83 | 84 | def save(self, fname): 85 | tmp, self.id2sims = self.id2sims, None 86 | super(SimIndex, self).save(fname) 87 | self.id2sims = tmp 88 | 89 | 90 | @staticmethod 91 | def load(fname): 92 | result = gensim.utils.SaveLoad.load(fname) 93 | result.fname = fname 94 | result.check_moved() 95 | result.id2sims = SqliteDict(fname + '.id2sims', journal_mode=JOURNAL_MODE) 96 | return result 97 | 98 | 99 | def check_moved(self): 100 | output_prefix = self.fname + '.idx' 101 | if self.qindex.output_prefix != output_prefix: 102 | logger.info("index seems to have moved from %s to %s; updating locations" % 103 | (self.qindex.output_prefix, output_prefix)) 104 | self.qindex.output_prefix = output_prefix 105 | self.qindex.check_moved() 106 | 107 | 108 | def close(self): 109 | "Explicitly release important resources (file handles, db, ...)" 110 | try: 111 | self.id2sims.close() 112 | except: 113 | pass 114 | try: 115 | del self.qindex 116 | except: 117 | pass 118 | 119 | 120 | def terminate(self): 121 | """Delete all files created by this index, invalidating `self`. Use with care.""" 122 | try: 123 | self.id2sims.terminate() 124 | except: 125 | pass 126 | import glob 127 | for fname in glob.glob(self.fname + '*'): 128 | try: 129 | os.remove(fname) 130 | logger.info("deleted %s" % fname) 131 | except Exception, e: 132 | logger.warning("failed to delete %s: %s" % (fname, e)) 133 | for val in self.__dict__.keys(): 134 | try: 135 | delattr(self, val) 136 | except: 137 | pass 138 | 139 | 140 | def index_documents(self, fresh_docs, model): 141 | """ 142 | Update fresh index with new documents (potentially replacing old ones with 143 | the same id). `fresh_docs` is a dictionary-like object (=dict, sqlitedict, shelve etc) 144 | that maps document_id->document. 145 | """ 146 | docids = fresh_docs.keys() 147 | vectors = (model.docs2vecs(fresh_docs[docid] for docid in docids)) 148 | logger.info("adding %i documents to %s" % (len(docids), self)) 149 | self.qindex.add_documents(vectors) 150 | self.qindex.save() 151 | self.update_ids(docids) 152 | 153 | 154 | def update_ids(self, docids): 155 | """Update id->pos mapping with new document ids.""" 156 | logger.info("updating %i id mappings" % len(docids)) 157 | for docid in docids: 158 | if docid is not None: 159 | pos = self.id2pos.get(docid, None) 160 | if pos is not None: 161 | logger.info("replacing existing document %r in %s" % (docid, self)) 162 | del self.pos2id[pos] 163 | self.id2pos[docid] = self.length 164 | try: 165 | del self.id2sims[docid] 166 | except: 167 | pass 168 | self.length += 1 169 | self.id2sims.sync() 170 | self.update_mappings() 171 | 172 | 173 | def update_mappings(self): 174 | """Synchronize id<->position mappings.""" 175 | self.pos2id = dict((v, k) for k, v in self.id2pos.iteritems()) 176 | assert len(self.pos2id) == len(self.id2pos), "duplicate ids or positions detected" 177 | 178 | 179 | def delete(self, docids): 180 | """Delete documents (specified by their ids) from the index.""" 181 | logger.debug("deleting %i documents from %s" % (len(docids), self)) 182 | deleted = 0 183 | for docid in docids: 184 | try: 185 | del self.id2pos[docid] 186 | deleted += 1 187 | del self.id2sims[docid] 188 | except: 189 | pass 190 | self.id2sims.sync() 191 | if deleted: 192 | logger.info("deleted %i documents from %s" % (deleted, self)) 193 | self.update_mappings() 194 | 195 | 196 | def sims2scores(self, sims, eps=1e-7): 197 | """Convert raw similarity vector to a list of (docid, similarity) results.""" 198 | result = [] 199 | if isinstance(sims, numpy.ndarray): 200 | sims = abs(sims) # TODO or maybe clip? are opposite vectors "similar" or "dissimilar"?! 201 | for pos in numpy.argsort(sims)[::-1]: 202 | if pos in self.pos2id and sims[pos] > eps: # ignore deleted/rewritten documents 203 | # convert positions of resulting docs back to ids 204 | result.append((self.pos2id[pos], sims[pos])) 205 | if len(result) == self.topsims: 206 | break 207 | else: 208 | for pos, score in sims: 209 | if pos in self.pos2id and abs(score) > eps: # ignore deleted/rewritten documents 210 | # convert positions of resulting docs back to ids 211 | result.append((self.pos2id[pos], abs(score))) 212 | if len(result) == self.topsims: 213 | break 214 | return result 215 | 216 | 217 | def vec_by_id(self, docid): 218 | """Return indexed vector corresponding to document `docid`.""" 219 | pos = self.id2pos[docid] 220 | return self.qindex.vector_by_id(pos) 221 | 222 | 223 | def sims_by_id(self, docid): 224 | """Find the most similar documents to the (already indexed) document with `docid`.""" 225 | result = self.id2sims.get(docid, None) 226 | if result is None: 227 | self.qindex.num_best = self.topsims 228 | sims = self.qindex.similarity_by_id(self.id2pos[docid]) 229 | result = self.sims2scores(sims) 230 | return result 231 | 232 | 233 | def sims_by_vec(self, vec, normalize=None): 234 | """ 235 | Find the most similar documents to a given vector (=already processed document). 236 | """ 237 | if normalize is None: 238 | normalize = self.qindex.normalize 239 | norm, self.qindex.normalize = self.qindex.normalize, normalize # store old value 240 | self.qindex.num_best = self.topsims 241 | sims = self.qindex[vec] 242 | self.qindex.normalize = norm # restore old value of qindex.normalize 243 | return self.sims2scores(sims) 244 | 245 | 246 | def merge(self, other): 247 | """Merge documents from the other index. Update precomputed similarities 248 | in the process.""" 249 | other.qindex.normalize, other.qindex.num_best = False, self.topsims 250 | # update precomputed "most similar" for old documents (in case some of 251 | # the new docs make it to the top-N for some of the old documents) 252 | logger.info("updating old precomputed values") 253 | pos, lenself = 0, len(self.qindex) 254 | for chunk in self.qindex.iter_chunks(): 255 | for sims in other.qindex[chunk]: 256 | if pos in self.pos2id: 257 | # ignore masked entries (deleted, overwritten documents) 258 | docid = self.pos2id[pos] 259 | sims = self.sims2scores(sims) 260 | self.id2sims[docid] = merge_sims(self.id2sims[docid], sims, self.topsims) 261 | pos += 1 262 | if pos % 10000 == 0: 263 | logger.info("PROGRESS: updated doc #%i/%i" % (pos, lenself)) 264 | self.id2sims.sync() 265 | 266 | logger.info("merging fresh index into optimized one") 267 | pos, docids = 0, [] 268 | for chunk in other.qindex.iter_chunks(): 269 | for vec in chunk: 270 | if pos in other.pos2id: # don't copy deleted documents 271 | self.qindex.add_documents([vec]) 272 | docids.append(other.pos2id[pos]) 273 | pos += 1 274 | self.qindex.save() 275 | self.update_ids(docids) 276 | 277 | logger.info("precomputing most similar for the fresh index") 278 | pos, lenother = 0, len(other.qindex) 279 | norm, self.qindex.normalize = self.qindex.normalize, False 280 | topsims, self.qindex.num_best = self.qindex.num_best, self.topsims 281 | for chunk in other.qindex.iter_chunks(): 282 | for sims in self.qindex[chunk]: 283 | if pos in other.pos2id: 284 | # ignore masked entries (deleted, overwritten documents) 285 | docid = other.pos2id[pos] 286 | self.id2sims[docid] = self.sims2scores(sims) 287 | pos += 1 288 | if pos % 10000 == 0: 289 | logger.info("PROGRESS: precomputed doc #%i/%i" % (pos, lenother)) 290 | self.qindex.normalize, self.qindex.num_best = norm, topsims 291 | self.id2sims.sync() 292 | 293 | 294 | def __len__(self): 295 | return len(self.id2pos) 296 | 297 | def __contains__(self, docid): 298 | return docid in self.id2pos 299 | 300 | def keys(self): 301 | return self.id2pos.keys() 302 | 303 | def __str__(self): 304 | return "SimIndex(%i docs, %i real size)" % (len(self), self.length) 305 | #endclass SimIndex 306 | 307 | 308 | 309 | class SimModel(gensim.utils.SaveLoad): 310 | """ 311 | A semantic model responsible for translating between plain text and (semantic) 312 | vectors. 313 | 314 | These vectors can then be indexed/queried for similarity, see the `SimIndex` 315 | class. Used internally by `SimServer`. 316 | """ 317 | def __init__(self, fresh_docs, dictionary=None, method=None, params=None): 318 | """ 319 | Train a model, using `fresh_docs` as training corpus. 320 | 321 | If `dictionary` is not specified, it is computed from the documents. 322 | 323 | `method` is currently one of "tfidf"/"lsi"/"lda". 324 | """ 325 | # FIXME TODO: use subclassing/injection for different methods, instead of param.. 326 | self.method = method 327 | if params is None: 328 | params = {} 329 | self.params = params 330 | logger.info("collecting %i document ids" % len(fresh_docs)) 331 | docids = fresh_docs.keys() 332 | logger.info("creating model from %s documents" % len(docids)) 333 | preprocessed = lambda: (fresh_docs[docid]['tokens'] for docid in docids) 334 | 335 | # create id->word (integer->string) mapping 336 | logger.info("creating dictionary from %s documents" % len(docids)) 337 | if dictionary is None: 338 | dictionary = gensim.corpora.Dictionary(preprocessed()) 339 | if len(docids) >= 1000: 340 | dictionary.filter_extremes(no_below=5, no_above=0.2, keep_n=50000) 341 | else: 342 | logger.warning("training model on only %i documents; is this intentional?" % len(docids)) 343 | dictionary.filter_extremes(no_below=0, no_above=1.0, keep_n=50000) 344 | self.dictionary = dictionary 345 | 346 | class IterableCorpus(object): 347 | def __iter__(self): 348 | for tokens in preprocessed(): 349 | yield dictionary.doc2bow(tokens) 350 | 351 | def __len__(self): 352 | return len(docids) 353 | corpus = IterableCorpus() 354 | 355 | if method == 'lsi': 356 | logger.info("training TF-IDF model") 357 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary) 358 | logger.info("training LSI model") 359 | tfidf_corpus = self.tfidf[corpus] 360 | self.lsi = gensim.models.LsiModel(tfidf_corpus, id2word=self.dictionary, **params) 361 | self.lsi.projection.u = self.lsi.projection.u.astype(numpy.float32) # use single precision to save mem 362 | self.num_features = len(self.lsi.projection.s) 363 | elif method == 'lda_tfidf': 364 | logger.info("training TF-IDF model") 365 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary) 366 | logger.info("training LDA model") 367 | self.lda = gensim.models.LdaModel(self.tfidf[corpus], id2word=self.dictionary, **params) 368 | self.num_features = self.lda.num_topics 369 | elif method == 'lda': 370 | logger.info("training TF-IDF model") 371 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary) 372 | logger.info("training LDA model") 373 | self.lda = gensim.models.LdaModel(corpus, id2word=self.dictionary, **params) 374 | self.num_features = self.lda.num_topics 375 | elif method == 'logentropy': 376 | logger.info("training a log-entropy model") 377 | self.logent = gensim.models.LogEntropyModel(corpus, id2word=self.dictionary, **params) 378 | self.num_features = len(self.dictionary) 379 | else: 380 | msg = "unknown semantic method %s" % method 381 | logger.error(msg) 382 | raise NotImplementedError(msg) 383 | 384 | 385 | def doc2vec(self, doc): 386 | """Convert a single SimilarityDocument to vector.""" 387 | bow = self.dictionary.doc2bow(doc['tokens']) 388 | if self.method == 'lsi': 389 | return self.lsi[self.tfidf[bow]] 390 | elif self.method == 'lda': 391 | return self.lda[bow] 392 | elif self.method == 'lda_tfidf': 393 | return self.lda[self.tfidf[bow]] 394 | elif self.method == 'logentropy': 395 | return self.logent[bow] 396 | 397 | 398 | def docs2vecs(self, docs): 399 | """Convert multiple SimilarityDocuments to vectors (batch version of doc2vec).""" 400 | bows = (self.dictionary.doc2bow(doc['tokens']) for doc in docs) 401 | if self.method == 'lsi': 402 | return self.lsi[self.tfidf[bows]] 403 | elif self.method == 'lda': 404 | return self.lda[bows] 405 | elif self.method == 'lda_tfidf': 406 | return self.lda[self.tfidf[bows]] 407 | elif self.method == 'logentropy': 408 | return self.logent[bows] 409 | 410 | 411 | def get_tfidf(self, doc): 412 | bow = self.dictionary.doc2bow(doc['tokens']) 413 | if hasattr(self, 'tfidf'): 414 | return self.tfidf[bow] 415 | if hasattr(self, 'logent'): 416 | return self.logent[bow] 417 | raise ValueError("model must contain either TF-IDF or LogEntropy transformation") 418 | 419 | 420 | def close(self): 421 | """Release important resources manually.""" 422 | pass 423 | 424 | def __str__(self): 425 | return "SimModel(method=%s, dict=%s)" % (self.method, self.dictionary) 426 | #endclass SimModel 427 | 428 | 429 | 430 | class SimServer(object): 431 | """ 432 | Top-level functionality for similarity services. A similarity server takes 433 | care of:: 434 | 435 | 1. creating semantic models 436 | 2. indexing documents using these models 437 | 3. finding the most similar documents in an index. 438 | 439 | An object of this class can be shared across network via Pyro, to answer remote 440 | client requests. It is thread safe. Using a server concurrently from multiple 441 | processes is safe for reading = answering similarity queries. Modifying 442 | (training/indexing) is realized via locking = serialized internally. 443 | """ 444 | def __init__(self, basename, use_locks=False): 445 | """ 446 | All data will be stored under directory `basename`. If there is a server 447 | there already, it will be loaded (resumed). 448 | 449 | The server object is stateless in RAM -- its state is defined entirely by its location. 450 | There is therefore no need to store the server object. 451 | """ 452 | if not os.path.isdir(basename): 453 | raise ValueError("%r must be a writable directory" % basename) 454 | self.basename = basename 455 | self.use_locks = use_locks 456 | self.lock_update = threading.RLock() if use_locks else gensim.utils.nocm 457 | try: 458 | self.fresh_index = SimIndex.load(self.location('index_fresh')) 459 | except: 460 | logger.debug("starting a new fresh index") 461 | self.fresh_index = None 462 | try: 463 | self.opt_index = SimIndex.load(self.location('index_opt')) 464 | except: 465 | logger.debug("starting a new optimized index") 466 | self.opt_index = None 467 | try: 468 | self.model = SimModel.load(self.location('model')) 469 | except: 470 | self.model = None 471 | self.payload = SqliteDict(self.location('payload'), autocommit=True, journal_mode=JOURNAL_MODE) 472 | self.flush(save_index=False, save_model=False, clear_buffer=True) 473 | logger.info("loaded %s" % self) 474 | 475 | 476 | def location(self, name): 477 | return os.path.join(self.basename, name) 478 | 479 | 480 | @gensim.utils.synchronous('lock_update') 481 | def flush(self, save_index=False, save_model=False, clear_buffer=False): 482 | """Commit all changes, clear all caches.""" 483 | if save_index: 484 | if self.fresh_index is not None: 485 | self.fresh_index.save(self.location('index_fresh')) 486 | if self.opt_index is not None: 487 | self.opt_index.save(self.location('index_opt')) 488 | if save_model: 489 | if self.model is not None: 490 | self.model.save(self.location('model')) 491 | self.payload.commit() 492 | if clear_buffer: 493 | if hasattr(self, 'fresh_docs'): 494 | try: 495 | self.fresh_docs.terminate() # erase all buffered documents + file on disk 496 | except: 497 | pass 498 | self.fresh_docs = SqliteDict(journal_mode=JOURNAL_MODE) # buffer defaults to a random location in temp 499 | self.fresh_docs.sync() 500 | 501 | 502 | def close(self): 503 | """Explicitly close open file handles, databases etc.""" 504 | try: 505 | self.payload.close() 506 | except: 507 | pass 508 | try: 509 | self.model.close() 510 | except: 511 | pass 512 | try: 513 | self.fresh_index.close() 514 | except: 515 | pass 516 | try: 517 | self.opt_index.close() 518 | except: 519 | pass 520 | try: 521 | self.fresh_docs.terminate() 522 | except: 523 | pass 524 | 525 | def __del__(self): 526 | """When the server went out of scope, make an effort to close its DBs.""" 527 | self.close() 528 | 529 | @gensim.utils.synchronous('lock_update') 530 | def buffer(self, documents): 531 | """ 532 | Add a sequence of documents to be processed (indexed or trained on). 533 | 534 | Here, the documents are simply collected; real processing is done later, 535 | during the `self.index` or `self.train` calls. 536 | 537 | `buffer` can be called repeatedly; the result is the same as if it was 538 | called once, with a concatenation of all the partial document batches. 539 | The point is to save memory when sending large corpora over network: the 540 | entire `documents` must be serialized into RAM. See `utils.upload_chunked()`. 541 | 542 | A call to `flush()` clears this documents-to-be-processed buffer (`flush` 543 | is also implicitly called when you call `index()` and `train()`). 544 | """ 545 | logger.info("adding documents to temporary buffer of %s" % (self)) 546 | for doc in documents: 547 | docid = doc['id'] 548 | # logger.debug("buffering document %r" % docid) 549 | if docid in self.fresh_docs: 550 | logger.warning("asked to re-add id %r; rewriting old value" % docid) 551 | self.fresh_docs[docid] = doc 552 | self.fresh_docs.sync() 553 | 554 | 555 | @gensim.utils.synchronous('lock_update') 556 | def train(self, corpus=None, method='auto', clear_buffer=True, params=None): 557 | """ 558 | Create an indexing model. Will overwrite the model if it already exists. 559 | All indexes become invalid, because documents in them use a now-obsolete 560 | representation. 561 | 562 | The model is trained on documents previously entered via `buffer`, 563 | or directly on `corpus`, if specified. 564 | """ 565 | if corpus is not None: 566 | # use the supplied corpus only (erase existing buffer, if any) 567 | self.flush(clear_buffer=True) 568 | self.buffer(corpus) 569 | if not self.fresh_docs: 570 | msg = "train called but no training corpus specified for %s" % self 571 | logger.error(msg) 572 | raise ValueError(msg) 573 | if method == 'auto': 574 | numdocs = len(self.fresh_docs) 575 | if numdocs < 1000: 576 | logging.warning("too few training documents; using simple log-entropy model instead of latent semantic indexing") 577 | method = 'logentropy' 578 | else: 579 | method = 'lsi' 580 | if params is None: 581 | params = {} 582 | self.model = SimModel(self.fresh_docs, method=method, params=params) 583 | self.flush(save_model=True, clear_buffer=clear_buffer) 584 | 585 | 586 | @gensim.utils.synchronous('lock_update') 587 | def index(self, corpus=None, clear_buffer=True): 588 | """ 589 | Permanently index all documents previously added via `buffer`, or 590 | directly index documents from `corpus`, if specified. 591 | 592 | The indexing model must already exist (see `train`) before this function 593 | is called. 594 | """ 595 | if not self.model: 596 | msg = 'must initialize model for %s before indexing documents' % self.basename 597 | logger.error(msg) 598 | raise AttributeError(msg) 599 | 600 | if corpus is not None: 601 | # use the supplied corpus only (erase existing buffer, if any) 602 | self.flush(clear_buffer=True) 603 | self.buffer(corpus) 604 | 605 | if not self.fresh_docs: 606 | msg = "index called but no indexing corpus specified for %s" % self 607 | logger.error(msg) 608 | raise ValueError(msg) 609 | 610 | if not self.fresh_index: 611 | logger.info("starting a new fresh index for %s" % self) 612 | self.fresh_index = SimIndex(self.location('index_fresh'), self.model.num_features) 613 | self.fresh_index.index_documents(self.fresh_docs, self.model) 614 | if self.opt_index is not None: 615 | self.opt_index.delete(self.fresh_docs.keys()) 616 | logger.info("storing document payloads") 617 | for docid in self.fresh_docs: 618 | payload = self.fresh_docs[docid].get('payload', None) 619 | if payload is None: 620 | # HACK: exit on first doc without a payload (=assume all docs have payload, or none does) 621 | break 622 | self.payload[docid] = payload 623 | self.flush(save_index=True, clear_buffer=clear_buffer) 624 | 625 | 626 | @gensim.utils.synchronous('lock_update') 627 | def optimize(self): 628 | """ 629 | Precompute top similarities for all indexed documents. This speeds up 630 | `find_similar` queries by id (but not queries by fulltext). 631 | 632 | Internally, documents are moved from a fresh index (=no precomputed similarities) 633 | to an optimized index (precomputed similarities). Similarity queries always 634 | query both indexes, so this split is transparent to clients. 635 | 636 | If you add documents later via `index`, they go to the fresh index again. 637 | To precompute top similarities for these new documents too, simply call 638 | `optimize` again. 639 | 640 | """ 641 | if self.fresh_index is None: 642 | logger.warning("optimize called but there are no new documents") 643 | return # nothing to do! 644 | 645 | if self.opt_index is None: 646 | logger.info("starting a new optimized index for %s" % self) 647 | self.opt_index = SimIndex(self.location('index_opt'), self.model.num_features) 648 | 649 | self.opt_index.merge(self.fresh_index) 650 | self.fresh_index.terminate() # delete old files 651 | self.fresh_index = None 652 | self.flush(save_index=True) 653 | 654 | 655 | @gensim.utils.synchronous('lock_update') 656 | def drop_index(self, keep_model=True): 657 | """Drop all indexed documents. If `keep_model` is False, also dropped the model.""" 658 | modelstr = "" if keep_model else "and model " 659 | logger.info("deleting similarity index " + modelstr + "from %s" % self.basename) 660 | 661 | # delete indexes 662 | for index in [self.fresh_index, self.opt_index]: 663 | if index is not None: 664 | index.terminate() 665 | self.fresh_index, self.opt_index = None, None 666 | 667 | # delete payload 668 | if self.payload is not None: 669 | self.payload.close() 670 | 671 | fname = self.location('payload') 672 | try: 673 | if os.path.exists(fname): 674 | os.remove(fname) 675 | logger.info("deleted %s" % fname) 676 | except Exception, e: 677 | logger.warning("failed to delete %s" % fname) 678 | self.payload = SqliteDict(self.location('payload'), autocommit=True, journal_mode=JOURNAL_MODE) 679 | 680 | # optionally, delete the model as well 681 | if not keep_model and self.model is not None: 682 | self.model.close() 683 | fname = self.location('model') 684 | try: 685 | if os.path.exists(fname): 686 | os.remove(fname) 687 | logger.info("deleted %s" % fname) 688 | except Exception, e: 689 | logger.warning("failed to delete %s" % fname) 690 | self.model = None 691 | self.flush(save_index=True, save_model=True, clear_buffer=True) 692 | 693 | 694 | @gensim.utils.synchronous('lock_update') 695 | def delete(self, docids): 696 | """Delete specified documents from the index.""" 697 | logger.info("asked to drop %i documents" % len(docids)) 698 | for index in [self.opt_index, self.fresh_index]: 699 | if index is not None: 700 | index.delete(docids) 701 | self.flush(save_index=True) 702 | 703 | 704 | def is_locked(self): 705 | return self.use_locks and self.lock_update._RLock__count > 0 706 | 707 | 708 | def vec_by_id(self, docid): 709 | for index in [self.opt_index, self.fresh_index]: 710 | if index is not None and docid in index: 711 | return index.vec_by_id(docid) 712 | 713 | 714 | def find_similar(self, doc, min_score=0.0, max_results=100): 715 | """ 716 | Find `max_results` most similar articles in the index, each having similarity 717 | score of at least `min_score`. The resulting list may be shorter than `max_results`, 718 | in case there are not enough matching documents. 719 | 720 | `doc` is either a string (=document id, previously indexed) or a 721 | dict containing a 'tokens' key. These tokens are processed to produce a 722 | vector, which is then used as a query against the index. 723 | 724 | The similar documents are returned in decreasing similarity order, as 725 | `(doc_id, similarity_score, doc_payload)` 3-tuples. The payload returned 726 | is identical to what was supplied for this document during indexing. 727 | 728 | """ 729 | logger.debug("received query call with %r" % doc) 730 | if self.is_locked(): 731 | msg = "cannot query while the server is being updated" 732 | logger.error(msg) 733 | raise RuntimeError(msg) 734 | sims_opt, sims_fresh = None, None 735 | for index in [self.fresh_index, self.opt_index]: 736 | if index is not None: 737 | index.topsims = max_results 738 | if isinstance(doc, basestring): 739 | # query by direct document id 740 | docid = doc 741 | if self.opt_index is not None and docid in self.opt_index: 742 | sims_opt = self.opt_index.sims_by_id(docid) 743 | if self.fresh_index is not None: 744 | vec = self.opt_index.vec_by_id(docid) 745 | sims_fresh = self.fresh_index.sims_by_vec(vec, normalize=False) 746 | elif self.fresh_index is not None and docid in self.fresh_index: 747 | sims_fresh = self.fresh_index.sims_by_id(docid) 748 | if self.opt_index is not None: 749 | vec = self.fresh_index.vec_by_id(docid) 750 | sims_opt = self.opt_index.sims_by_vec(vec, normalize=False) 751 | else: 752 | raise ValueError("document %r not in index" % docid) 753 | else: 754 | if 'topics' in doc: 755 | # user supplied vector directly => use that 756 | vec = gensim.matutils.any2sparse(doc['topics']) 757 | else: 758 | # query by an arbitrary text (=tokens) inside doc['tokens'] 759 | vec = self.model.doc2vec(doc) # convert document (text) to vector 760 | if self.opt_index is not None: 761 | sims_opt = self.opt_index.sims_by_vec(vec) 762 | if self.fresh_index is not None: 763 | sims_fresh = self.fresh_index.sims_by_vec(vec) 764 | 765 | merged = merge_sims(sims_opt, sims_fresh) 766 | logger.debug("got %s raw similars, pruning with max_results=%s, min_score=%s" % 767 | (len(merged), max_results, min_score)) 768 | result = [] 769 | for docid, score in merged: 770 | if score < min_score or 0 < max_results <= len(result): 771 | break 772 | result.append((docid, float(score), self.payload.get(docid, None))) 773 | return result 774 | 775 | 776 | def __str__(self): 777 | return ("SimServer(loc=%r, fresh=%s, opt=%s, model=%s, buffer=%s)" % 778 | (self.basename, self.fresh_index, self.opt_index, self.model, self.fresh_docs)) 779 | 780 | 781 | def __len__(self): 782 | return sum(len(index) for index in [self.opt_index, self.fresh_index] 783 | if index is not None) 784 | 785 | 786 | def __contains__(self, docid): 787 | """Is document with `docid` in the index?""" 788 | return any(index is not None and docid in index 789 | for index in [self.opt_index, self.fresh_index]) 790 | 791 | 792 | def get_tfidf(self, *args, **kwargs): 793 | return self.model.get_tfidf(*args, **kwargs) 794 | 795 | 796 | def status(self): 797 | return str(self) 798 | 799 | def keys(self): 800 | """Return ids of all indexed documents.""" 801 | result = [] 802 | if self.fresh_index is not None: 803 | result += self.fresh_index.keys() 804 | if self.opt_index is not None: 805 | result += self.opt_index.keys() 806 | return result 807 | 808 | def memdebug(self): 809 | from guppy import hpy 810 | return str(hpy().heap()) 811 | #endclass SimServer 812 | 813 | 814 | 815 | class SessionServer(gensim.utils.SaveLoad): 816 | """ 817 | Similarity server on top of :class:`SimServer` that implements sessions = transactions. 818 | 819 | A transaction is a set of server modifications (index/delete/train calls) that 820 | may be either committed or rolled back entirely. 821 | 822 | Sessions are realized by: 823 | 824 | 1. cloning (=copying) a SimServer at the beginning of a session 825 | 2. serving read-only queries from the original server (the clone may be modified during queries) 826 | 3. modifications affect only the clone 827 | 4. at commit, the clone becomes the original 828 | 5. at rollback, do nothing (clone is discarded, next transaction starts from the original again) 829 | """ 830 | def __init__(self, basedir, autosession=True, use_locks=True): 831 | self.basedir = basedir 832 | self.autosession = autosession 833 | self.use_locks = use_locks 834 | self.lock_update = threading.RLock() if use_locks else gensim.utils.nocm 835 | self.locs = ['a', 'b'] # directories under which to store stable.session data 836 | try: 837 | stable = open(self.location('stable')).read().strip() 838 | self.istable = self.locs.index(stable) 839 | except: 840 | self.istable = 0 841 | logger.info("stable index pointer not found or invalid; starting from %s" % 842 | self.loc_stable) 843 | try: 844 | os.makedirs(self.loc_stable) 845 | except: 846 | pass 847 | self.write_istable() 848 | self.stable = SimServer(self.loc_stable, use_locks=self.use_locks) 849 | self.session = None 850 | 851 | 852 | def location(self, name): 853 | return os.path.join(self.basedir, name) 854 | 855 | @property 856 | def loc_stable(self): 857 | return self.location(self.locs[self.istable]) 858 | 859 | @property 860 | def loc_session(self): 861 | return self.location(self.locs[1 - self.istable]) 862 | 863 | def __contains__(self, docid): 864 | return docid in self.stable 865 | 866 | def __str__(self): 867 | return "SessionServer(\n\tstable=%s\n\tsession=%s\n)" % (self.stable, self.session) 868 | 869 | def __len__(self): 870 | return len(self.stable) 871 | 872 | def keys(self): 873 | return self.stable.keys() 874 | 875 | @gensim.utils.synchronous('lock_update') 876 | def check_session(self): 877 | """ 878 | Make sure a session is open. 879 | 880 | If it's not and autosession is turned on, create a new session automatically. 881 | If it's not and autosession is off, raise an exception. 882 | """ 883 | if self.session is None: 884 | if self.autosession: 885 | self.open_session() 886 | else: 887 | msg = "must open a session before modifying %s" % self 888 | raise RuntimeError(msg) 889 | 890 | @gensim.utils.synchronous('lock_update') 891 | def open_session(self): 892 | """ 893 | Open a new session to modify this server. 894 | 895 | You can either call this fnc directly, or turn on autosession which will 896 | open/commit sessions for you transparently. 897 | """ 898 | if self.session is not None: 899 | msg = "session already open; commit it or rollback before opening another one in %s" % self 900 | logger.error(msg) 901 | raise RuntimeError(msg) 902 | 903 | logger.info("opening a new session") 904 | logger.info("removing %s" % self.loc_session) 905 | try: 906 | shutil.rmtree(self.loc_session) 907 | except: 908 | logger.info("failed to delete %s" % self.loc_session) 909 | logger.info("cloning server from %s to %s" % 910 | (self.loc_stable, self.loc_session)) 911 | shutil.copytree(self.loc_stable, self.loc_session) 912 | self.session = SimServer(self.loc_session, use_locks=self.use_locks) 913 | self.lock_update.acquire() # no other thread can call any modification methods until commit/rollback 914 | 915 | @gensim.utils.synchronous('lock_update') 916 | def buffer(self, *args, **kwargs): 917 | """Buffer documents, in the current session""" 918 | self.check_session() 919 | result = self.session.buffer(*args, **kwargs) 920 | return result 921 | 922 | @gensim.utils.synchronous('lock_update') 923 | def index(self, *args, **kwargs): 924 | """Index documents, in the current session""" 925 | self.check_session() 926 | result = self.session.index(*args, **kwargs) 927 | if self.autosession: 928 | self.commit() 929 | return result 930 | 931 | @gensim.utils.synchronous('lock_update') 932 | def train(self, *args, **kwargs): 933 | """Update semantic model, in the current session.""" 934 | self.check_session() 935 | result = self.session.train(*args, **kwargs) 936 | if self.autosession: 937 | self.commit() 938 | return result 939 | 940 | @gensim.utils.synchronous('lock_update') 941 | def drop_index(self, keep_model=True): 942 | """Drop all indexed documents from the session. Optionally, drop model too.""" 943 | self.check_session() 944 | result = self.session.drop_index(keep_model) 945 | if self.autosession: 946 | self.commit() 947 | return result 948 | 949 | @gensim.utils.synchronous('lock_update') 950 | def delete(self, docids): 951 | """Delete documents from the current session.""" 952 | self.check_session() 953 | result = self.session.delete(docids) 954 | if self.autosession: 955 | self.commit() 956 | return result 957 | 958 | @gensim.utils.synchronous('lock_update') 959 | def optimize(self): 960 | """Optimize index for faster by-document-id queries.""" 961 | self.check_session() 962 | result = self.session.optimize() 963 | if self.autosession: 964 | self.commit() 965 | return result 966 | 967 | @gensim.utils.synchronous('lock_update') 968 | def write_istable(self): 969 | with open(self.location('stable'), 'w') as fout: 970 | fout.write(os.path.basename(self.loc_stable)) 971 | 972 | @gensim.utils.synchronous('lock_update') 973 | def commit(self): 974 | """Commit changes made by the latest session.""" 975 | if self.session is not None: 976 | logger.info("committing transaction in %s" % self) 977 | tmp = self.stable 978 | self.stable, self.session = self.session, None 979 | self.istable = 1 - self.istable 980 | self.write_istable() 981 | tmp.close() # don't wait for gc, release resources manually 982 | self.lock_update.release() 983 | else: 984 | logger.warning("commit called but there's no open session in %s" % self) 985 | 986 | @gensim.utils.synchronous('lock_update') 987 | def rollback(self): 988 | """Ignore all changes made in the latest session (terminate the session).""" 989 | if self.session is not None: 990 | logger.info("rolling back transaction in %s" % self) 991 | self.session.close() 992 | self.session = None 993 | self.lock_update.release() 994 | else: 995 | logger.warning("rollback called but there's no open session in %s" % self) 996 | 997 | @gensim.utils.synchronous('lock_update') 998 | def set_autosession(self, value=None): 999 | """ 1000 | Turn autosession (automatic committing after each modification call) on/off. 1001 | If value is None, only query the current value (don't change anything). 1002 | """ 1003 | if value is not None: 1004 | self.rollback() 1005 | self.autosession = value 1006 | return self.autosession 1007 | 1008 | @gensim.utils.synchronous('lock_update') 1009 | def close(self): 1010 | """Don't wait for gc, try to release important resources manually""" 1011 | try: 1012 | self.stable.close() 1013 | except: 1014 | pass 1015 | try: 1016 | self.session.close() 1017 | except: 1018 | pass 1019 | 1020 | def __del__(self): 1021 | self.close() 1022 | 1023 | @gensim.utils.synchronous('lock_update') 1024 | def terminate(self): 1025 | """Delete all files created by this server, invalidating `self`. Use with care.""" 1026 | logger.info("deleting entire server %s" % self) 1027 | self.close() 1028 | try: 1029 | shutil.rmtree(self.basedir) 1030 | logger.info("deleted server under %s" % self.basedir) 1031 | # delete everything from self, so that using this object fails results 1032 | # in an error as quickly as possible 1033 | for val in self.__dict__.keys(): 1034 | try: 1035 | delattr(self, val) 1036 | except: 1037 | pass 1038 | except Exception, e: 1039 | logger.warning("failed to delete SessionServer: %s" % (e)) 1040 | 1041 | 1042 | def find_similar(self, *args, **kwargs): 1043 | """ 1044 | Find similar articles. 1045 | 1046 | With autosession off, use the index state *before* current session started, 1047 | so that changes made in the session will not be visible here. With autosession 1048 | on, close the current session first (so that session changes *are* committed 1049 | and visible). 1050 | """ 1051 | if self.session is not None and self.autosession: 1052 | # with autosession on, commit the pending transaction first 1053 | self.commit() 1054 | return self.stable.find_similar(*args, **kwargs) 1055 | 1056 | 1057 | def get_tfidf(self, *args, **kwargs): 1058 | if self.session is not None and self.autosession: 1059 | # with autosession on, commit the pending transaction first 1060 | self.commit() 1061 | return self.stable.get_tfidf(*args, **kwargs) 1062 | 1063 | 1064 | # add some functions for remote access (RPC via Pyro) 1065 | def debug_model(self): 1066 | return self.stable.model 1067 | 1068 | def status(self): # str() alias 1069 | return str(self) 1070 | -------------------------------------------------------------------------------- /simserver/test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/piskvorky/gensim-simserver/e7e59e836ef6d9da019a8c6b218ef0bdd998b2da/simserver/test/__init__.py -------------------------------------------------------------------------------- /simserver/test/test_simserver.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # Copyright (C) 2011 Radim Rehurek 5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html 6 | 7 | """ 8 | Automated tests for checking similarity server. 9 | """ 10 | 11 | from __future__ import with_statement 12 | 13 | import logging 14 | import sys 15 | import unittest 16 | from copy import deepcopy 17 | 18 | import numpy 19 | import Pyro4 20 | 21 | import gensim 22 | import simserver 23 | 24 | logger = logging.getLogger('test_simserver') 25 | 26 | 27 | 28 | def mock_documents(language, category): 29 | """Create a few SimServer documents, for testing.""" 30 | documents = ["Human machine interface for lab abc computer applications", 31 | "A survey of user opinion of computer system response time", 32 | "The EPS user interface management system", 33 | "System and human system engineering testing of EPS", 34 | "Relation of user perceived response time to error measurement", 35 | "The generation of random binary unordered trees", 36 | "The intersection graph of paths in trees", 37 | "Graph minors IV Widths of trees and well quasi ordering", 38 | "Graph minors A survey"] 39 | 40 | # Create SimServer dicts from the texts. These are the object that the gensim 41 | # server expects as input. They must contain doc['id'] and doc['tokens'] attributes. 42 | docs = [{'id': '_'.join((language, category, str(num))), 43 | 'tokens': gensim.utils.simple_preprocess(document), 'payload': range(num), 44 | 'language': language, 'category': category} 45 | for num, document in enumerate(documents)] 46 | return docs 47 | 48 | 49 | class SessionServerTester(unittest.TestCase): 50 | """Test a running SessionServer""" 51 | def setUp(self): 52 | self.docs = mock_documents('en', '') 53 | try: 54 | self.server = Pyro4.Proxy('PYRONAME:gensim.testserver') 55 | logger.info(self.server.status()) 56 | except Exception: 57 | logger.info("could not locate running SessionServer; starting a local server") 58 | self.server = simserver.SessionServer(gensim.utils.randfname()) 59 | self.server.set_autosession(True) 60 | 61 | def tearDown(self): 62 | self.docs = None 63 | self.server.terminate() 64 | try: 65 | # if server is remote, just close the proxy connection 66 | self.server._pyroRelease() 67 | except AttributeError: 68 | try: 69 | # for local server, close & remove all files 70 | self.server.terminate() 71 | except: 72 | pass 73 | 74 | 75 | def check_equal(self, sims1, sims2): 76 | """Check that two returned lists of similarities are equal.""" 77 | sims1 = dict(s[:2] for s in sims1) 78 | sims2 = dict(s[:2] for s in sims2) 79 | for docid in set(sims1.keys() + sims2.keys()): 80 | self.assertTrue(numpy.allclose(sims1.get(docid, 0.0), sims2.get(docid, 0.0), atol=1e-7)) 81 | 82 | 83 | def test_model(self): 84 | """test remote server model creation""" 85 | logger.debug(self.server.status()) 86 | # calling train without specifying a training corpus raises a ValueError: 87 | self.assertRaises(ValueError, self.server.train, method='lsi') 88 | 89 | # now do the training for real. use a common pattern -- upload documents 90 | # to be processed to the server. 91 | # the documents will be stored server-side in an Sqlite db, not in memory, 92 | # so the training corpus may be larger than RAM. 93 | # if the corpus is very large, upload it in smaller chunks, like 10k docs 94 | # at a time (or else Pyro & cPickle will choke). also see `utils.upload_chunked`. 95 | self.server.buffer(self.docs[:2]) # upload 2 documents to server 96 | self.server.buffer(self.docs[2:]) # upload the rest of the documents 97 | 98 | # now, train a model 99 | self.server.train(method='lsi') 100 | 101 | # check that the model was trained correctly 102 | model = self.server.debug_model() 103 | s_values = [1.2704573, 1.13604315, 1.07827574, 1.02963433, 0.97147057, 104 | 0.9280468, 0.90321329, 0.83034548, 0.74981662] 105 | self.assertTrue(numpy.allclose(model.lsi.projection.s, s_values)) 106 | 107 | vec0 = [(0, -0.27759625943508104), (1, -0.29736164713214469), (2, 0.14768134319504395), 108 | (3, -0.24586351025187975), (4, 0.8359357384389362), (5, 0.084659319019917578), 109 | (6, -0.2042204354844826), (7, -0.016382387104491858), (8, 0.065784642613330224)] 110 | got = model.doc2vec(self.docs[0]) 111 | self.assertTrue(numpy.allclose(abs(gensim.matutils.sparse2full(vec0, model.num_features)), 112 | abs(gensim.matutils.sparse2full(got, model.num_features)))) 113 | 114 | 115 | def test_index(self): 116 | """test remote server incremental indexing""" 117 | # delete any existing model and indexes first 118 | self.server.drop_index(keep_model=False) 119 | logger.debug(self.server.status()) 120 | 121 | # try indexing without a model -- raises AttributeError 122 | self.assertRaises(AttributeError, self.server.index, self.docs) 123 | 124 | # train a fresh model 125 | self.server.train(self.docs, method='lsi') 126 | 127 | # use incremental indexing -- start by indexing the first three documents 128 | self.server.buffer(self.docs[:3]) # upload the documents 129 | self.server.index() # index uploaded documents & clear upload buffer 130 | self.assertRaises(ValueError, self.server.find_similar, 'fakeid') # no such id -> raises ValueError 131 | 132 | expected = [('en__1', 0.99999994), ('en__2', 0.16279206), ('en__0', 0.09881371)] 133 | got = self.server.find_similar(self.docs[1]['id']) # retrieve similar to the last document 134 | self.check_equal(expected, got) 135 | 136 | self.server.index(self.docs[3:]) # upload & index the rest of the documents 137 | logger.debug(self.server.status()) 138 | expected = [('en__1', 0.99999994), ('en__4', 0.2686055), ('en__8', 0.229533), 139 | ('en__2', 0.16279206), ('en__3', 0.143899247), ('en__0', 0.09881371), 140 | ('en__6', 0.018686194), ('en__5', 0.017070908), ('en__7', 0.01428914)] 141 | got = self.server.find_similar(self.docs[1]['id']) # retrieve similar to the last document 142 | self.check_equal(expected, got) 143 | 144 | # re-index documents. just index documents with the same id -- the old document 145 | # will be replaced by the new one, so that only the latest update counts. 146 | docs = deepcopy(self.docs) 147 | docs[2]['tokens'] = docs[1]['tokens'] # different text, same id 148 | self.server.index(docs[1:3]) # reindex the two modified docs -- total number of indexed docs doesn't change 149 | logger.debug(self.server.status()) 150 | expected = [('en__2', 0.99999994), ('en__1', 0.99999994), ('en__4', 0.26860553), 151 | ('en__8', 0.229533046), ('en__3', 0.143899247), ('en__0', 0.0988137126), 152 | ('en__6', 0.01868619397), ('en__5', 0.0170709081), ('en__7', 0.0142891407)] 153 | got = self.server.find_similar(self.docs[2]['id']) 154 | self.check_equal(expected, got) 155 | 156 | # delete documents: pass it a collection of ids to be removed from the index 157 | to_delete = [doc['id'] for doc in self.docs[-3:]] 158 | self.server.delete(to_delete) # delete the last 3 documents 159 | logger.debug(self.server.status()) 160 | expected = [('en__2', 0.99999994), ('en__1', 0.99999994), ('en__4', 0.26860553), 161 | ('en__3', 0.143899247), ('en__0', 0.09881371), ('en__5', 0.017070908)] 162 | got = self.server.find_similar(self.docs[2]['id']) 163 | self.check_equal(expected, got) 164 | self.assertRaises(ValueError, self.server.find_similar, to_delete[0]) # deleted document not there anymore 165 | 166 | 167 | def test_optimize(self): 168 | # to speed up queries by id, call server.optimize() 169 | # it will precompute the most similar documents, for all documents in the index, 170 | # and store them to Sqlite db for lightning-fast querying. 171 | # querying by fulltext is not affected by this optimization, though. 172 | self.server.drop_index(keep_model=False) 173 | self.server.train(self.docs, method='lsi') 174 | self.server.index(self.docs) 175 | self.server.optimize() 176 | logger.debug(self.server.status()) 177 | # TODO how to test that it's faster? 178 | 179 | 180 | def test_query_id(self): 181 | # index some docs first 182 | self.server.drop_index(keep_model=False) 183 | self.server.train(self.docs, method='lsi') 184 | self.server.index(self.docs) 185 | 186 | # query index by id: return the most similar documents to an already indexed document 187 | docid = self.docs[0]['id'] 188 | expected = [('en__0', 1.0), ('en__2', 0.112942614), ('en__1', 0.09881371), 189 | ('en__3', 0.087866522)] 190 | got = self.server.find_similar(docid) 191 | self.check_equal(expected, got) 192 | 193 | # same thing, but only get docs with similarity >= 0.3 194 | expected = [('en__0', 1.0)] 195 | got = self.server.find_similar(docid, min_score=0.3) 196 | self.check_equal(expected, got) 197 | 198 | # same thing, but only get max 3 documents docs with similarity >= 0.2 199 | expected = [('en__0', 1.0), ('en__2', 0.112942614)] 200 | got = self.server.find_similar(docid, max_results=3, min_score=0.1) 201 | self.check_equal(expected, got) 202 | 203 | 204 | def test_query_document(self): 205 | # index some docs first 206 | self.server.drop_index(keep_model=False) 207 | self.server.train(self.docs, method='lsi') 208 | self.server.index(self.docs) 209 | 210 | # query index by document text: id is ignored 211 | doc = self.docs[0] 212 | doc['id'] = None # clear out id; not necessary, just to demonstrate it's not used in query-by-document 213 | expected = [('en__0', 1.0), ('en__2', 0.11294261), ('en__1', 0.09881371), ('en__3', 0.087866522)] 214 | got = self.server.find_similar(doc) 215 | self.check_equal(expected, got) 216 | 217 | # same thing, but only get docs with similarity >= 0.3 218 | expected = [('en__0', 1.0)] 219 | got = self.server.find_similar(doc, min_score=0.3) 220 | self.check_equal(expected, got) 221 | 222 | # same thing, but only get max 3 documents docs with similarity >= 0.2 223 | expected = [('en__0', 1.0), ('en__2', 0.112942614)] 224 | got = self.server.find_similar(doc, max_results=3, min_score=0.1) 225 | self.check_equal(expected, got) 226 | 227 | 228 | def test_payload(self): 229 | """test storing/retrieving document payload""" 230 | # delete any existing model and indexes first 231 | self.server.drop_index(keep_model=False) 232 | self.server.train(self.docs, method='lsi') 233 | 234 | # create payload for three documents 235 | docs = deepcopy(self.docs) 236 | docs[0]['payload'] = 'some payload' 237 | docs[1]['payload'] = range(10) 238 | docs[2]['payload'] = 3.14 239 | id2doc = dict((doc['id'], doc) for doc in docs) 240 | 241 | # index documents & store payloads 242 | self.server.index(docs) 243 | 244 | # do a few queries, check that returned payloads match what we sent to the server 245 | for queryid in [docs[0]['id'], docs[1]['id'], docs[2]['id']]: 246 | for docid, sim, payload in self.server.find_similar(queryid): 247 | self.assertEqual(payload, id2doc[docid].get('payload', None)) 248 | 249 | 250 | def test_sessions(self): 251 | """check similarity server transactions (autosession off)""" 252 | self.server.drop_index(keep_model=False) 253 | self.server.set_autosession(False) # turn off auto-commit 254 | 255 | # trying to modify index with auto-commit off and without an open session results in exception 256 | self.assertRaises(RuntimeError, self.server.train, self.docs, method='lsi') 257 | self.assertRaises(RuntimeError, self.server.index, self.docs) 258 | 259 | # open session, train model & index some documents 260 | self.server.open_session() 261 | self.server.train(self.docs, method='lsi') 262 | self.server.index(self.docs) 263 | 264 | # cannot open 2 simultaneous sessions: must commit or rollback first 265 | self.assertRaises(RuntimeError, self.server.open_session) 266 | 267 | self.server.commit() # commit ends the session 268 | 269 | # no session open; cannot modify 270 | self.assertRaises(RuntimeError, self.server.index, self.docs) 271 | 272 | # open another session (using outcome of the previously committed one) 273 | self.server.open_session() 274 | doc = self.docs[0] 275 | self.server.delete([doc['id']]) # delete one document from the session 276 | # queries hit the original index; current session modifications are ignored 277 | self.server.find_similar(doc['id']) # document still there! 278 | self.server.commit() 279 | 280 | # session committed => document is gone now, querying for its id raises exception 281 | self.assertRaises(ValueError, self.server.find_similar, doc['id']) 282 | 283 | # open another session; this one will be rolled back 284 | self.server.open_session() 285 | self.server.index([doc]) # re-add the deleted document 286 | self.assertRaises(ValueError, self.server.find_similar, doc['id']) # no commit yet -- document still gone! 287 | self.server.rollback() # ignore all changes made since open_session 288 | 289 | self.assertRaises(ValueError, self.server.find_similar, doc['id']) # addition was rolled back -- document still gone! 290 | #end SessionServerTester 291 | 292 | 293 | 294 | if __name__ == '__main__': 295 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(module)s:%(lineno)d : %(funcName)s(%(threadName)s) : %(message)s', 296 | level=logging.DEBUG) 297 | unittest.main() 298 | --------------------------------------------------------------------------------