├── .gitignore
├── LICENSE.txt
├── MANIFEST.in
├── README.rst
├── docs
├── similarities
│ └── simserver.rst
└── simserver.rst
├── ez_setup.py
├── setup.py
└── simserver
├── __init__.py
├── run_simserver.py
├── simserver.py
└── test
├── __init__.py
└── test_simserver.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.egg
3 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | GNU AFFERO GENERAL PUBLIC LICENSE
2 | Version 3, 19 November 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 | Preamble
9 |
10 | The GNU Affero General Public License is a free, copyleft license for
11 | software and other kinds of works, specifically designed to ensure
12 | cooperation with the community in the case of network server software.
13 |
14 | The licenses for most software and other practical works are designed
15 | to take away your freedom to share and change the works. By contrast,
16 | our General Public Licenses are intended to guarantee your freedom to
17 | share and change all versions of a program--to make sure it remains free
18 | software for all its users.
19 |
20 | When we speak of free software, we are referring to freedom, not
21 | price. Our General Public Licenses are designed to make sure that you
22 | have the freedom to distribute copies of free software (and charge for
23 | them if you wish), that you receive source code or can get it if you
24 | want it, that you can change the software or use pieces of it in new
25 | free programs, and that you know you can do these things.
26 |
27 | Developers that use our General Public Licenses protect your rights
28 | with two steps: (1) assert copyright on the software, and (2) offer
29 | you this License which gives you legal permission to copy, distribute
30 | and/or modify the software.
31 |
32 | A secondary benefit of defending all users' freedom is that
33 | improvements made in alternate versions of the program, if they
34 | receive widespread use, become available for other developers to
35 | incorporate. Many developers of free software are heartened and
36 | encouraged by the resulting cooperation. However, in the case of
37 | software used on network servers, this result may fail to come about.
38 | The GNU General Public License permits making a modified version and
39 | letting the public access it on a server without ever releasing its
40 | source code to the public.
41 |
42 | The GNU Affero General Public License is designed specifically to
43 | ensure that, in such cases, the modified source code becomes available
44 | to the community. It requires the operator of a network server to
45 | provide the source code of the modified version running there to the
46 | users of that server. Therefore, public use of a modified version, on
47 | a publicly accessible server, gives the public access to the source
48 | code of the modified version.
49 |
50 | An older license, called the Affero General Public License and
51 | published by Affero, was designed to accomplish similar goals. This is
52 | a different license, not a version of the Affero GPL, but Affero has
53 | released a new version of the Affero GPL which permits relicensing under
54 | this license.
55 |
56 | The precise terms and conditions for copying, distribution and
57 | modification follow.
58 |
59 | TERMS AND CONDITIONS
60 |
61 | 0. Definitions.
62 |
63 | "This License" refers to version 3 of the GNU Affero General Public License.
64 |
65 | "Copyright" also means copyright-like laws that apply to other kinds of
66 | works, such as semiconductor masks.
67 |
68 | "The Program" refers to any copyrightable work licensed under this
69 | License. Each licensee is addressed as "you". "Licensees" and
70 | "recipients" may be individuals or organizations.
71 |
72 | To "modify" a work means to copy from or adapt all or part of the work
73 | in a fashion requiring copyright permission, other than the making of an
74 | exact copy. The resulting work is called a "modified version" of the
75 | earlier work or a work "based on" the earlier work.
76 |
77 | A "covered work" means either the unmodified Program or a work based
78 | on the Program.
79 |
80 | To "propagate" a work means to do anything with it that, without
81 | permission, would make you directly or secondarily liable for
82 | infringement under applicable copyright law, except executing it on a
83 | computer or modifying a private copy. Propagation includes copying,
84 | distribution (with or without modification), making available to the
85 | public, and in some countries other activities as well.
86 |
87 | To "convey" a work means any kind of propagation that enables other
88 | parties to make or receive copies. Mere interaction with a user through
89 | a computer network, with no transfer of a copy, is not conveying.
90 |
91 | An interactive user interface displays "Appropriate Legal Notices"
92 | to the extent that it includes a convenient and prominently visible
93 | feature that (1) displays an appropriate copyright notice, and (2)
94 | tells the user that there is no warranty for the work (except to the
95 | extent that warranties are provided), that licensees may convey the
96 | work under this License, and how to view a copy of this License. If
97 | the interface presents a list of user commands or options, such as a
98 | menu, a prominent item in the list meets this criterion.
99 |
100 | 1. Source Code.
101 |
102 | The "source code" for a work means the preferred form of the work
103 | for making modifications to it. "Object code" means any non-source
104 | form of a work.
105 |
106 | A "Standard Interface" means an interface that either is an official
107 | standard defined by a recognized standards body, or, in the case of
108 | interfaces specified for a particular programming language, one that
109 | is widely used among developers working in that language.
110 |
111 | The "System Libraries" of an executable work include anything, other
112 | than the work as a whole, that (a) is included in the normal form of
113 | packaging a Major Component, but which is not part of that Major
114 | Component, and (b) serves only to enable use of the work with that
115 | Major Component, or to implement a Standard Interface for which an
116 | implementation is available to the public in source code form. A
117 | "Major Component", in this context, means a major essential component
118 | (kernel, window system, and so on) of the specific operating system
119 | (if any) on which the executable work runs, or a compiler used to
120 | produce the work, or an object code interpreter used to run it.
121 |
122 | The "Corresponding Source" for a work in object code form means all
123 | the source code needed to generate, install, and (for an executable
124 | work) run the object code and to modify the work, including scripts to
125 | control those activities. However, it does not include the work's
126 | System Libraries, or general-purpose tools or generally available free
127 | programs which are used unmodified in performing those activities but
128 | which are not part of the work. For example, Corresponding Source
129 | includes interface definition files associated with source files for
130 | the work, and the source code for shared libraries and dynamically
131 | linked subprograms that the work is specifically designed to require,
132 | such as by intimate data communication or control flow between those
133 | subprograms and other parts of the work.
134 |
135 | The Corresponding Source need not include anything that users
136 | can regenerate automatically from other parts of the Corresponding
137 | Source.
138 |
139 | The Corresponding Source for a work in source code form is that
140 | same work.
141 |
142 | 2. Basic Permissions.
143 |
144 | All rights granted under this License are granted for the term of
145 | copyright on the Program, and are irrevocable provided the stated
146 | conditions are met. This License explicitly affirms your unlimited
147 | permission to run the unmodified Program. The output from running a
148 | covered work is covered by this License only if the output, given its
149 | content, constitutes a covered work. This License acknowledges your
150 | rights of fair use or other equivalent, as provided by copyright law.
151 |
152 | You may make, run and propagate covered works that you do not
153 | convey, without conditions so long as your license otherwise remains
154 | in force. You may convey covered works to others for the sole purpose
155 | of having them make modifications exclusively for you, or provide you
156 | with facilities for running those works, provided that you comply with
157 | the terms of this License in conveying all material for which you do
158 | not control copyright. Those thus making or running the covered works
159 | for you must do so exclusively on your behalf, under your direction
160 | and control, on terms that prohibit them from making any copies of
161 | your copyrighted material outside their relationship with you.
162 |
163 | Conveying under any other circumstances is permitted solely under
164 | the conditions stated below. Sublicensing is not allowed; section 10
165 | makes it unnecessary.
166 |
167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
168 |
169 | No covered work shall be deemed part of an effective technological
170 | measure under any applicable law fulfilling obligations under article
171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
172 | similar laws prohibiting or restricting circumvention of such
173 | measures.
174 |
175 | When you convey a covered work, you waive any legal power to forbid
176 | circumvention of technological measures to the extent such circumvention
177 | is effected by exercising rights under this License with respect to
178 | the covered work, and you disclaim any intention to limit operation or
179 | modification of the work as a means of enforcing, against the work's
180 | users, your or third parties' legal rights to forbid circumvention of
181 | technological measures.
182 |
183 | 4. Conveying Verbatim Copies.
184 |
185 | You may convey verbatim copies of the Program's source code as you
186 | receive it, in any medium, provided that you conspicuously and
187 | appropriately publish on each copy an appropriate copyright notice;
188 | keep intact all notices stating that this License and any
189 | non-permissive terms added in accord with section 7 apply to the code;
190 | keep intact all notices of the absence of any warranty; and give all
191 | recipients a copy of this License along with the Program.
192 |
193 | You may charge any price or no price for each copy that you convey,
194 | and you may offer support or warranty protection for a fee.
195 |
196 | 5. Conveying Modified Source Versions.
197 |
198 | You may convey a work based on the Program, or the modifications to
199 | produce it from the Program, in the form of source code under the
200 | terms of section 4, provided that you also meet all of these conditions:
201 |
202 | a) The work must carry prominent notices stating that you modified
203 | it, and giving a relevant date.
204 |
205 | b) The work must carry prominent notices stating that it is
206 | released under this License and any conditions added under section
207 | 7. This requirement modifies the requirement in section 4 to
208 | "keep intact all notices".
209 |
210 | c) You must license the entire work, as a whole, under this
211 | License to anyone who comes into possession of a copy. This
212 | License will therefore apply, along with any applicable section 7
213 | additional terms, to the whole of the work, and all its parts,
214 | regardless of how they are packaged. This License gives no
215 | permission to license the work in any other way, but it does not
216 | invalidate such permission if you have separately received it.
217 |
218 | d) If the work has interactive user interfaces, each must display
219 | Appropriate Legal Notices; however, if the Program has interactive
220 | interfaces that do not display Appropriate Legal Notices, your
221 | work need not make them do so.
222 |
223 | A compilation of a covered work with other separate and independent
224 | works, which are not by their nature extensions of the covered work,
225 | and which are not combined with it such as to form a larger program,
226 | in or on a volume of a storage or distribution medium, is called an
227 | "aggregate" if the compilation and its resulting copyright are not
228 | used to limit the access or legal rights of the compilation's users
229 | beyond what the individual works permit. Inclusion of a covered work
230 | in an aggregate does not cause this License to apply to the other
231 | parts of the aggregate.
232 |
233 | 6. Conveying Non-Source Forms.
234 |
235 | You may convey a covered work in object code form under the terms
236 | of sections 4 and 5, provided that you also convey the
237 | machine-readable Corresponding Source under the terms of this License,
238 | in one of these ways:
239 |
240 | a) Convey the object code in, or embodied in, a physical product
241 | (including a physical distribution medium), accompanied by the
242 | Corresponding Source fixed on a durable physical medium
243 | customarily used for software interchange.
244 |
245 | b) Convey the object code in, or embodied in, a physical product
246 | (including a physical distribution medium), accompanied by a
247 | written offer, valid for at least three years and valid for as
248 | long as you offer spare parts or customer support for that product
249 | model, to give anyone who possesses the object code either (1) a
250 | copy of the Corresponding Source for all the software in the
251 | product that is covered by this License, on a durable physical
252 | medium customarily used for software interchange, for a price no
253 | more than your reasonable cost of physically performing this
254 | conveying of source, or (2) access to copy the
255 | Corresponding Source from a network server at no charge.
256 |
257 | c) Convey individual copies of the object code with a copy of the
258 | written offer to provide the Corresponding Source. This
259 | alternative is allowed only occasionally and noncommercially, and
260 | only if you received the object code with such an offer, in accord
261 | with subsection 6b.
262 |
263 | d) Convey the object code by offering access from a designated
264 | place (gratis or for a charge), and offer equivalent access to the
265 | Corresponding Source in the same way through the same place at no
266 | further charge. You need not require recipients to copy the
267 | Corresponding Source along with the object code. If the place to
268 | copy the object code is a network server, the Corresponding Source
269 | may be on a different server (operated by you or a third party)
270 | that supports equivalent copying facilities, provided you maintain
271 | clear directions next to the object code saying where to find the
272 | Corresponding Source. Regardless of what server hosts the
273 | Corresponding Source, you remain obligated to ensure that it is
274 | available for as long as needed to satisfy these requirements.
275 |
276 | e) Convey the object code using peer-to-peer transmission, provided
277 | you inform other peers where the object code and Corresponding
278 | Source of the work are being offered to the general public at no
279 | charge under subsection 6d.
280 |
281 | A separable portion of the object code, whose source code is excluded
282 | from the Corresponding Source as a System Library, need not be
283 | included in conveying the object code work.
284 |
285 | A "User Product" is either (1) a "consumer product", which means any
286 | tangible personal property which is normally used for personal, family,
287 | or household purposes, or (2) anything designed or sold for incorporation
288 | into a dwelling. In determining whether a product is a consumer product,
289 | doubtful cases shall be resolved in favor of coverage. For a particular
290 | product received by a particular user, "normally used" refers to a
291 | typical or common use of that class of product, regardless of the status
292 | of the particular user or of the way in which the particular user
293 | actually uses, or expects or is expected to use, the product. A product
294 | is a consumer product regardless of whether the product has substantial
295 | commercial, industrial or non-consumer uses, unless such uses represent
296 | the only significant mode of use of the product.
297 |
298 | "Installation Information" for a User Product means any methods,
299 | procedures, authorization keys, or other information required to install
300 | and execute modified versions of a covered work in that User Product from
301 | a modified version of its Corresponding Source. The information must
302 | suffice to ensure that the continued functioning of the modified object
303 | code is in no case prevented or interfered with solely because
304 | modification has been made.
305 |
306 | If you convey an object code work under this section in, or with, or
307 | specifically for use in, a User Product, and the conveying occurs as
308 | part of a transaction in which the right of possession and use of the
309 | User Product is transferred to the recipient in perpetuity or for a
310 | fixed term (regardless of how the transaction is characterized), the
311 | Corresponding Source conveyed under this section must be accompanied
312 | by the Installation Information. But this requirement does not apply
313 | if neither you nor any third party retains the ability to install
314 | modified object code on the User Product (for example, the work has
315 | been installed in ROM).
316 |
317 | The requirement to provide Installation Information does not include a
318 | requirement to continue to provide support service, warranty, or updates
319 | for a work that has been modified or installed by the recipient, or for
320 | the User Product in which it has been modified or installed. Access to a
321 | network may be denied when the modification itself materially and
322 | adversely affects the operation of the network or violates the rules and
323 | protocols for communication across the network.
324 |
325 | Corresponding Source conveyed, and Installation Information provided,
326 | in accord with this section must be in a format that is publicly
327 | documented (and with an implementation available to the public in
328 | source code form), and must require no special password or key for
329 | unpacking, reading or copying.
330 |
331 | 7. Additional Terms.
332 |
333 | "Additional permissions" are terms that supplement the terms of this
334 | License by making exceptions from one or more of its conditions.
335 | Additional permissions that are applicable to the entire Program shall
336 | be treated as though they were included in this License, to the extent
337 | that they are valid under applicable law. If additional permissions
338 | apply only to part of the Program, that part may be used separately
339 | under those permissions, but the entire Program remains governed by
340 | this License without regard to the additional permissions.
341 |
342 | When you convey a copy of a covered work, you may at your option
343 | remove any additional permissions from that copy, or from any part of
344 | it. (Additional permissions may be written to require their own
345 | removal in certain cases when you modify the work.) You may place
346 | additional permissions on material, added by you to a covered work,
347 | for which you have or can give appropriate copyright permission.
348 |
349 | Notwithstanding any other provision of this License, for material you
350 | add to a covered work, you may (if authorized by the copyright holders of
351 | that material) supplement the terms of this License with terms:
352 |
353 | a) Disclaiming warranty or limiting liability differently from the
354 | terms of sections 15 and 16 of this License; or
355 |
356 | b) Requiring preservation of specified reasonable legal notices or
357 | author attributions in that material or in the Appropriate Legal
358 | Notices displayed by works containing it; or
359 |
360 | c) Prohibiting misrepresentation of the origin of that material, or
361 | requiring that modified versions of such material be marked in
362 | reasonable ways as different from the original version; or
363 |
364 | d) Limiting the use for publicity purposes of names of licensors or
365 | authors of the material; or
366 |
367 | e) Declining to grant rights under trademark law for use of some
368 | trade names, trademarks, or service marks; or
369 |
370 | f) Requiring indemnification of licensors and authors of that
371 | material by anyone who conveys the material (or modified versions of
372 | it) with contractual assumptions of liability to the recipient, for
373 | any liability that these contractual assumptions directly impose on
374 | those licensors and authors.
375 |
376 | All other non-permissive additional terms are considered "further
377 | restrictions" within the meaning of section 10. If the Program as you
378 | received it, or any part of it, contains a notice stating that it is
379 | governed by this License along with a term that is a further
380 | restriction, you may remove that term. If a license document contains
381 | a further restriction but permits relicensing or conveying under this
382 | License, you may add to a covered work material governed by the terms
383 | of that license document, provided that the further restriction does
384 | not survive such relicensing or conveying.
385 |
386 | If you add terms to a covered work in accord with this section, you
387 | must place, in the relevant source files, a statement of the
388 | additional terms that apply to those files, or a notice indicating
389 | where to find the applicable terms.
390 |
391 | Additional terms, permissive or non-permissive, may be stated in the
392 | form of a separately written license, or stated as exceptions;
393 | the above requirements apply either way.
394 |
395 | 8. Termination.
396 |
397 | You may not propagate or modify a covered work except as expressly
398 | provided under this License. Any attempt otherwise to propagate or
399 | modify it is void, and will automatically terminate your rights under
400 | this License (including any patent licenses granted under the third
401 | paragraph of section 11).
402 |
403 | However, if you cease all violation of this License, then your
404 | license from a particular copyright holder is reinstated (a)
405 | provisionally, unless and until the copyright holder explicitly and
406 | finally terminates your license, and (b) permanently, if the copyright
407 | holder fails to notify you of the violation by some reasonable means
408 | prior to 60 days after the cessation.
409 |
410 | Moreover, your license from a particular copyright holder is
411 | reinstated permanently if the copyright holder notifies you of the
412 | violation by some reasonable means, this is the first time you have
413 | received notice of violation of this License (for any work) from that
414 | copyright holder, and you cure the violation prior to 30 days after
415 | your receipt of the notice.
416 |
417 | Termination of your rights under this section does not terminate the
418 | licenses of parties who have received copies or rights from you under
419 | this License. If your rights have been terminated and not permanently
420 | reinstated, you do not qualify to receive new licenses for the same
421 | material under section 10.
422 |
423 | 9. Acceptance Not Required for Having Copies.
424 |
425 | You are not required to accept this License in order to receive or
426 | run a copy of the Program. Ancillary propagation of a covered work
427 | occurring solely as a consequence of using peer-to-peer transmission
428 | to receive a copy likewise does not require acceptance. However,
429 | nothing other than this License grants you permission to propagate or
430 | modify any covered work. These actions infringe copyright if you do
431 | not accept this License. Therefore, by modifying or propagating a
432 | covered work, you indicate your acceptance of this License to do so.
433 |
434 | 10. Automatic Licensing of Downstream Recipients.
435 |
436 | Each time you convey a covered work, the recipient automatically
437 | receives a license from the original licensors, to run, modify and
438 | propagate that work, subject to this License. You are not responsible
439 | for enforcing compliance by third parties with this License.
440 |
441 | An "entity transaction" is a transaction transferring control of an
442 | organization, or substantially all assets of one, or subdividing an
443 | organization, or merging organizations. If propagation of a covered
444 | work results from an entity transaction, each party to that
445 | transaction who receives a copy of the work also receives whatever
446 | licenses to the work the party's predecessor in interest had or could
447 | give under the previous paragraph, plus a right to possession of the
448 | Corresponding Source of the work from the predecessor in interest, if
449 | the predecessor has it or can get it with reasonable efforts.
450 |
451 | You may not impose any further restrictions on the exercise of the
452 | rights granted or affirmed under this License. For example, you may
453 | not impose a license fee, royalty, or other charge for exercise of
454 | rights granted under this License, and you may not initiate litigation
455 | (including a cross-claim or counterclaim in a lawsuit) alleging that
456 | any patent claim is infringed by making, using, selling, offering for
457 | sale, or importing the Program or any portion of it.
458 |
459 | 11. Patents.
460 |
461 | A "contributor" is a copyright holder who authorizes use under this
462 | License of the Program or a work on which the Program is based. The
463 | work thus licensed is called the contributor's "contributor version".
464 |
465 | A contributor's "essential patent claims" are all patent claims
466 | owned or controlled by the contributor, whether already acquired or
467 | hereafter acquired, that would be infringed by some manner, permitted
468 | by this License, of making, using, or selling its contributor version,
469 | but do not include claims that would be infringed only as a
470 | consequence of further modification of the contributor version. For
471 | purposes of this definition, "control" includes the right to grant
472 | patent sublicenses in a manner consistent with the requirements of
473 | this License.
474 |
475 | Each contributor grants you a non-exclusive, worldwide, royalty-free
476 | patent license under the contributor's essential patent claims, to
477 | make, use, sell, offer for sale, import and otherwise run, modify and
478 | propagate the contents of its contributor version.
479 |
480 | In the following three paragraphs, a "patent license" is any express
481 | agreement or commitment, however denominated, not to enforce a patent
482 | (such as an express permission to practice a patent or covenant not to
483 | sue for patent infringement). To "grant" such a patent license to a
484 | party means to make such an agreement or commitment not to enforce a
485 | patent against the party.
486 |
487 | If you convey a covered work, knowingly relying on a patent license,
488 | and the Corresponding Source of the work is not available for anyone
489 | to copy, free of charge and under the terms of this License, through a
490 | publicly available network server or other readily accessible means,
491 | then you must either (1) cause the Corresponding Source to be so
492 | available, or (2) arrange to deprive yourself of the benefit of the
493 | patent license for this particular work, or (3) arrange, in a manner
494 | consistent with the requirements of this License, to extend the patent
495 | license to downstream recipients. "Knowingly relying" means you have
496 | actual knowledge that, but for the patent license, your conveying the
497 | covered work in a country, or your recipient's use of the covered work
498 | in a country, would infringe one or more identifiable patents in that
499 | country that you have reason to believe are valid.
500 |
501 | If, pursuant to or in connection with a single transaction or
502 | arrangement, you convey, or propagate by procuring conveyance of, a
503 | covered work, and grant a patent license to some of the parties
504 | receiving the covered work authorizing them to use, propagate, modify
505 | or convey a specific copy of the covered work, then the patent license
506 | you grant is automatically extended to all recipients of the covered
507 | work and works based on it.
508 |
509 | A patent license is "discriminatory" if it does not include within
510 | the scope of its coverage, prohibits the exercise of, or is
511 | conditioned on the non-exercise of one or more of the rights that are
512 | specifically granted under this License. You may not convey a covered
513 | work if you are a party to an arrangement with a third party that is
514 | in the business of distributing software, under which you make payment
515 | to the third party based on the extent of your activity of conveying
516 | the work, and under which the third party grants, to any of the
517 | parties who would receive the covered work from you, a discriminatory
518 | patent license (a) in connection with copies of the covered work
519 | conveyed by you (or copies made from those copies), or (b) primarily
520 | for and in connection with specific products or compilations that
521 | contain the covered work, unless you entered into that arrangement,
522 | or that patent license was granted, prior to 28 March 2007.
523 |
524 | Nothing in this License shall be construed as excluding or limiting
525 | any implied license or other defenses to infringement that may
526 | otherwise be available to you under applicable patent law.
527 |
528 | 12. No Surrender of Others' Freedom.
529 |
530 | If conditions are imposed on you (whether by court order, agreement or
531 | otherwise) that contradict the conditions of this License, they do not
532 | excuse you from the conditions of this License. If you cannot convey a
533 | covered work so as to satisfy simultaneously your obligations under this
534 | License and any other pertinent obligations, then as a consequence you may
535 | not convey it at all. For example, if you agree to terms that obligate you
536 | to collect a royalty for further conveying from those to whom you convey
537 | the Program, the only way you could satisfy both those terms and this
538 | License would be to refrain entirely from conveying the Program.
539 |
540 | 13. Remote Network Interaction; Use with the GNU General Public License.
541 |
542 | Notwithstanding any other provision of this License, if you modify the
543 | Program, your modified version must prominently offer all users
544 | interacting with it remotely through a computer network (if your version
545 | supports such interaction) an opportunity to receive the Corresponding
546 | Source of your version by providing access to the Corresponding Source
547 | from a network server at no charge, through some standard or customary
548 | means of facilitating copying of software. This Corresponding Source
549 | shall include the Corresponding Source for any work covered by version 3
550 | of the GNU General Public License that is incorporated pursuant to the
551 | following paragraph.
552 |
553 | Notwithstanding any other provision of this License, you have
554 | permission to link or combine any covered work with a work licensed
555 | under version 3 of the GNU General Public License into a single
556 | combined work, and to convey the resulting work. The terms of this
557 | License will continue to apply to the part which is the covered work,
558 | but the work with which it is combined will remain governed by version
559 | 3 of the GNU General Public License.
560 |
561 | 14. Revised Versions of this License.
562 |
563 | The Free Software Foundation may publish revised and/or new versions of
564 | the GNU Affero General Public License from time to time. Such new versions
565 | will be similar in spirit to the present version, but may differ in detail to
566 | address new problems or concerns.
567 |
568 | Each version is given a distinguishing version number. If the
569 | Program specifies that a certain numbered version of the GNU Affero General
570 | Public License "or any later version" applies to it, you have the
571 | option of following the terms and conditions either of that numbered
572 | version or of any later version published by the Free Software
573 | Foundation. If the Program does not specify a version number of the
574 | GNU Affero General Public License, you may choose any version ever published
575 | by the Free Software Foundation.
576 |
577 | If the Program specifies that a proxy can decide which future
578 | versions of the GNU Affero General Public License can be used, that proxy's
579 | public statement of acceptance of a version permanently authorizes you
580 | to choose that version for the Program.
581 |
582 | Later license versions may give you additional or different
583 | permissions. However, no additional obligations are imposed on any
584 | author or copyright holder as a result of your choosing to follow a
585 | later version.
586 |
587 | 15. Disclaimer of Warranty.
588 |
589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
597 |
598 | 16. Limitation of Liability.
599 |
600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
608 | SUCH DAMAGES.
609 |
610 | 17. Interpretation of Sections 15 and 16.
611 |
612 | If the disclaimer of warranty and limitation of liability provided
613 | above cannot be given local legal effect according to their terms,
614 | reviewing courts shall apply local law that most closely approximates
615 | an absolute waiver of all civil liability in connection with the
616 | Program, unless a warranty or assumption of liability accompanies a
617 | copy of the Program in return for a fee.
618 |
619 | END OF TERMS AND CONDITIONS
620 |
621 | How to Apply These Terms to Your New Programs
622 |
623 | If you develop a new program, and you want it to be of the greatest
624 | possible use to the public, the best way to achieve this is to make it
625 | free software which everyone can redistribute and change under these terms.
626 |
627 | To do so, attach the following notices to the program. It is safest
628 | to attach them to the start of each source file to most effectively
629 | state the exclusion of warranty; and each file should have at least
630 | the "copyright" line and a pointer to where the full notice is found.
631 |
632 |
633 | Copyright (C)
634 |
635 | This program is free software: you can redistribute it and/or modify
636 | it under the terms of the GNU Affero General Public License as published by
637 | the Free Software Foundation, either version 3 of the License, or
638 | (at your option) any later version.
639 |
640 | This program is distributed in the hope that it will be useful,
641 | but WITHOUT ANY WARRANTY; without even the implied warranty of
642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
643 | GNU Affero General Public License for more details.
644 |
645 | You should have received a copy of the GNU Affero General Public License
646 | along with this program. If not, see .
647 |
648 | Also add information on how to contact you by electronic and paper mail.
649 |
650 | If your software can interact with users remotely through a computer
651 | network, you should also make sure that it provides a way for users to
652 | get its source. For example, if your program is a web application, its
653 | interface could display a "Source" link that leads users to an archive
654 | of the code. There are many ways you could offer source, and different
655 | solutions will be better for different programs; see section 13 for the
656 | specific requirements.
657 |
658 | You should also get your employer (if you work as a programmer) or school,
659 | if any, to sign a "copyright disclaimer" for the program, if necessary.
660 | For more information on this, and how to apply and follow the GNU AGPL, see
661 | .
662 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include docs *
2 | recursive-include simserver/test/test_data *
3 | recursive-include . *.sh
4 | prune docs/src*
5 | include README.rst
6 | include CHANGELOG.txt
7 | include COPYING
8 | include COPYING.LESSER
9 | include LICENSE.txt
10 | include ez_setup.py
11 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | ==================================================
2 | simserver -- document similarity server in Python
3 | ==================================================
4 |
5 |
6 | Index plain text documents and query the index for semantically related documents.
7 |
8 | Simserver uses transactions internally to provide a robust and scalable similarity server.
9 |
10 |
11 | Installation
12 | ------------
13 |
14 | Simserver builds on the `gensim `_ framework for
15 | topic modelling.
16 |
17 | The simple way to install `simserver` is with::
18 |
19 | sudo easy_install -U simserver
20 |
21 | Or, if you have instead downloaded and unzipped the `source tar.gz `_ package,
22 | you'll need to run::
23 |
24 | python setup.py test
25 | sudo python setup.py install
26 |
27 | This version has been tested under Python 2.5 and 2.7, but should run on any 2.5 <= Python < 3.0.
28 |
29 | Documentation
30 | -------------
31 |
32 | See http://radimrehurek.com/gensim/simserver.html . More coming soon.
33 |
34 | Licensing
35 | ----------------
36 |
37 | Simserver is released under the `GNU Affero GPL license v3 `_.
38 |
39 | This means you may use simserver freely in your application (even commercial application!),
40 | but you **must then open-source your application as well**, under an AGPL-compatible license.
41 |
42 | The AGPL license makes sure that this applies even when you make your application
43 | available only remotely (such as through the web).
44 |
45 | TL;DR: **simserver is open-source, but you have to contact me for any proprietary use.**
46 |
47 | History
48 | -------------
49 |
50 | 0.1.4:
51 | * performance improvements to sharding
52 | * change to threading model -- removed restriction on per-thread session access
53 | * bug fix in index optimize()
54 |
55 | 0.1.3:
56 | * changed behaviour for very few training documents: instead of latent semantic analysis, use simpler log-entropy model
57 | * fixed bug with leaking SQLite file descriptors
58 |
59 | -------------
60 |
61 | Copyright (c) 2009-2012 Radim Rehurek
62 |
--------------------------------------------------------------------------------
/docs/similarities/simserver.rst:
--------------------------------------------------------------------------------
1 | :mod:`simserver` -- Document similarity server
2 | ======================================================
3 |
4 | .. automodule:: simserver.simserver
5 | :synopsis: Document similarity server
6 | :members:
7 | :inherited-members:
8 |
9 |
--------------------------------------------------------------------------------
/docs/simserver.rst:
--------------------------------------------------------------------------------
1 | .. _simserver:
2 |
3 | Document Similarity Server
4 | =============================
5 |
6 | The 0.7.x series of `gensim `_ was about improving performance and consolidating API.
7 | 0.8.x will be about new features --- 0.8.1, first of the series, is a **document similarity service**.
8 |
9 | The source code itself has been moved from gensim to its own, dedicated package, named `simserver`.
10 | Get it from `PyPI `_ or clone it on `Github `_.
11 |
12 | What is a document similarity service?
13 | ---------------------------------------
14 |
15 | Conceptually, a service that lets you :
16 |
17 | 1. train a semantic model from a corpus of plain texts (no manual annotation and mark-up needed)
18 | 2. index arbitrary documents using this semantic model
19 | 3. query the index for similar documents (the query can be either an id of a document already in the index, or an arbitrary text)
20 |
21 |
22 | >>> from simserver import SessionServer
23 | >>> server = SessionServer('/tmp/my_server') # resume server (or create a new one)
24 |
25 | >>> server.train(training_corpus, method='lsi') # create a semantic model
26 | >>> server.index(some_documents) # convert plain text to semantic representation and index it
27 | >>> server.find_similar(query) # convert query to semantic representation and compare against index
28 | >>> ...
29 | >>> server.index(more_documents) # add to index: incremental indexing works
30 | >>> server.find_similar(query)
31 | >>> ...
32 | >>> server.delete(ids_to_delete) # incremental deleting also works
33 | >>> server.find_similar(query)
34 | >>> ...
35 |
36 | .. note::
37 | "Semantic" here refers to semantics of the crude, statistical type --
38 | `Latent Semantic Analysis `_,
39 | `Latent Dirichlet Allocation `_ etc.
40 | Nothing to do with the semantic web, manual resource tagging or detailed linguistic inference.
41 |
42 |
43 | What is it good for?
44 | ---------------------
45 |
46 | Digital libraries of (mostly) text documents. More generally, it helps you annotate,
47 | organize and navigate documents in a more abstract way, compared to plain keyword search.
48 |
49 | How is it unique?
50 | -----------------
51 |
52 | 1. **Memory independent**. Gensim has unique algorithms for statistical analysis that allow
53 | you to create semantic models of arbitrarily large training corpora (larger than RAM) very quickly
54 | and in constant RAM.
55 | 2. **Memory independent (again)**. Indexing shards are stored as files to disk/mmapped back as needed,
56 | so you can index very large corpora. So again, constant RAM, this time independent of the number of indexed documents.
57 | 3. **Efficient**. Gensim makes heavy use of Python's NumPy and SciPy libraries to make indexing and
58 | querying efficient.
59 | 4. **Robust**. Modifications of the index are transactional, so you can commit/rollback an
60 | entire indexing session. Also, during the session, the service is still available
61 | for querying (using its state from when the session started). Power failures leave
62 | service in a consistent state (implicit rollback).
63 | 5. **Pure Python**. Well, technically, NumPy and SciPy are mostly wrapped C and Fortran, but
64 | `gensim `_ itself is pure Python. No compiling, installing or root priviledges needed.
65 | 6. **Concurrency support**. The underlying service object is thread-safe and can
66 | therefore be used as a daemon server: clients connect to it via RPC and issue train/index/query requests remotely.
67 | 7. **Cross-network, cross-platform and cross-language**. While the Python server runs
68 | over TCP using `Pyro `_,
69 | clients in Java/.NET are trivial thanks to `Pyrolite `_.
70 |
71 | The rest of this document serves as a tutorial explaining the features in more detail.
72 |
73 | -----
74 |
75 | Prerequisites
76 | ----------------------
77 |
78 | It is assumed you have `gensim` properly :doc:`installed `. You'll also
79 | need the `sqlitedict `_ package that wraps
80 | Python's sqlite3 module in a thread-safe manner::
81 |
82 | $ sudo easy_install -U sqlitedict
83 |
84 | To test the remote server capabilities, install Pyro4 (Python Remote Objects, at
85 | version 4.8 as of this writing)::
86 |
87 | $ sudo easy_install Pyro4
88 |
89 | .. note::
90 | Don't forget to initialize logging to see logging messages::
91 |
92 | >>> import logging
93 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
94 |
95 | What is a document?
96 | -------------------
97 |
98 | In case of text documents, the service expects::
99 |
100 | >>> document = {'id': 'some_unique_string',
101 | >>> 'tokens': ['content', 'of', 'the', 'document', '...'],
102 | >>> 'other_fields_are_allowed_but_ignored': None}
103 |
104 | This format was chosen because it coincides with plain JSON and is therefore easy to serialize and send over the wire, in almost any language.
105 | All strings involved must be utf8-encoded.
106 |
107 |
108 | What is a corpus?
109 | -----------------
110 |
111 | A sequence of documents. Anything that supports the `for document in corpus: ...`
112 | iterator protocol. Generators are ok. Plain lists are also ok (but consume more memory).
113 |
114 | >>> from gensim import utils
115 | >>> texts = ["Human machine interface for lab abc computer applications",
116 | >>> "A survey of user opinion of computer system response time",
117 | >>> "The EPS user interface management system",
118 | >>> "System and human system engineering testing of EPS",
119 | >>> "Relation of user perceived response time to error measurement",
120 | >>> "The generation of random binary unordered trees",
121 | >>> "The intersection graph of paths in trees",
122 | >>> "Graph minors IV Widths of trees and well quasi ordering",
123 | >>> "Graph minors A survey"]
124 | >>> corpus = [{'id': 'doc_%i' % num, 'tokens': utils.simple_preprocess(text)}
125 | >>> for num, text in enumerate(texts)]
126 |
127 | Since corpora are allowed to be arbitrarily large, it is
128 | recommended client splits them into smaller chunks before uploading them to the server:
129 |
130 | >>> utils.upload_chunked(server, corpus, chunksize=1000) # send 1k docs at a time
131 |
132 | Wait, upload what, where?
133 | -------------------------
134 |
135 | If you use the similarity service object (instance of :class:`simserver.SessionServer`) in
136 | your code directly---no remote access---that's perfectly fine. Using the service remotely, from a different process/machine, is an
137 | option, not a necessity.
138 |
139 | Document similarity can also act as a long-running service, a daemon process on a separate machine. In that
140 | case, I'll call the service object a *server*.
141 |
142 | But let's start with a local object. Open your `favourite shell `_ and::
143 |
144 | >>> from gensim import utils
145 | >>> from simserver import SessionServer
146 | >>> service = SessionServer('/tmp/my_server/') # or wherever
147 |
148 | That initialized a new service, located in `/tmp/my_server` (you need write access rights to that directory).
149 |
150 | .. note::
151 | The service is fully defined by the content of its location directory ("`/tmp/my_server/`").
152 | If you use an existing location, the service object will resume
153 | from the index found there. Also, to "clone" a service, just copy that
154 | directory somewhere else. The copy will be a fully working duplicate of the
155 | original service.
156 |
157 |
158 | Model training
159 | ---------------
160 |
161 | We can start indexing right away:
162 |
163 | >>> service.index(corpus)
164 | AttributeError: must initialize model for /tmp/my_server/b before indexing documents
165 |
166 | Oops, we can not. The service indexes documents in a semantic representation, which
167 | is different to the plain text we give it. We must teach the service how to convert
168 | between plain text and semantics first::
169 |
170 | >>> service.train(corpus, method='lsi')
171 |
172 | That was easy. The `method='lsi'` parameter meant that we trained a model for
173 | `Latent Semantic Indexing `_
174 | and default dimensionality (400) over a `tf-idf `_
175 | representation of our little `corpus`, all automatically. More on that later.
176 |
177 | Note that for the semantic model to make sense, it should be trained
178 | on a corpus that is:
179 |
180 | * Reasonably similar to the documents you want to index later. Training on a corpus
181 | of recipes in French when all indexed documents will be about programming in English
182 | will not help.
183 | * Reasonably large (at least thousands of documents), so that the statistical analysis has
184 | a chance to kick in. Don't use my example corpus here of 9 documents in production O_o
185 |
186 | Indexing documents
187 | ------------------
188 |
189 | >>> service.index(corpus) # index the same documents that we trained on...
190 |
191 | Indexing can happen over any documents, but I'm too lazy to create another example corpus, so we index the same 9 docs used for training.
192 |
193 | Delete documents with::
194 |
195 | >>> service.delete(['doc_5', 'doc_8']) # supply a list of document ids to be removed from the index
196 |
197 | When you pass documents that have the same id as some already indexed document,
198 | the indexed document is overwritten by the new input (=only the latest counts;
199 | document ids are always unique per service)::
200 |
201 | >>> service.index(corpus[:3]) # overall index size unchanged (just 3 docs overwritten)
202 |
203 | The index/delete/overwrite calls can be arbitrarily interspersed with queries.
204 | You don't have to index **all** documents first to start querying, indexing can be incremental.
205 |
206 | Querying
207 | ---------
208 |
209 | There are two types of queries:
210 |
211 | 1. by id:
212 |
213 | .. code-block:: python
214 |
215 | >>> print service.find_similar('doc_0')
216 | [('doc_0', 1.0, None), ('doc_2', 0.30426699, None), ('doc_1', 0.25648531, None), ('doc_3', 0.25480536, None)]
217 |
218 | >>> print service.find_similar('doc_5') # we deleted doc_5 and doc_8, remember?
219 | ValueError: document 'doc_5' not in index
220 |
221 | In the resulting 3-tuples, `doc_n` is the document id we supplied during indexing,
222 | `0.30426699` is the similarity of `doc_n` to the query, but what's up with that `None`, you ask?
223 | Well, you can associate each document with a "payload", during indexing.
224 | This payload object (anything pickle-able) is later returned during querying.
225 | If you don't specify `doc['payload']` during indexing, queries simply return `None` in the result tuple, as in our example here.
226 |
227 | 2. or by document (using `document['tokens']`; id is ignored in this case):
228 |
229 | .. code-block:: python
230 |
231 | >>> doc = {'tokens': utils.simple_preprocess('Graph and minors and humans and trees.')}
232 | >>> print service.find_similar(doc, min_score=0.4, max_results=50)
233 | [('doc_7', 0.93350589, None), ('doc_3', 0.42718196, None)]
234 |
235 | Remote access
236 | -------------
237 |
238 | So far, we did everything in our Python shell, locally. I very much like `Pyro `_,
239 | a pure Python package for Remote Procedure Calls (RPC), so I'll illustrate remote
240 | service access via Pyro. Pyro takes care of all the socket listening/request routing/data marshalling/thread
241 | spawning, so it saves us a lot of trouble.
242 |
243 | To create a similarity server, we just create a :class:`simserver.SessionServer` object and register it
244 | with a Pyro daemon for remote access. There is a small `example script `_
245 | included with simserver, run it with::
246 |
247 | $ python -m simserver.run_simserver /tmp/testserver
248 |
249 | You can just `ctrl+c` to terminate the server, but leave it running for now.
250 |
251 | Now open your Python shell again, in another terminal window or possibly on another machine, and::
252 |
253 | >>> import Pyro4
254 | >>> service = Pyro4.Proxy(Pyro4.locateNS().lookup('gensim.testserver'))
255 |
256 | Now `service` is only a proxy object: every call is physically executed wherever
257 | you ran the `run_server.py` script, which can be a totally different computer
258 | (within a network broadcast domain), but you don't even know::
259 |
260 | >>> print service.status()
261 | >>> service.train(corpus)
262 | >>> service.index(other_corpus)
263 | >>> service.find_similar(query)
264 | >>> ...
265 |
266 | It is worth mentioning that Irmen, the author of Pyro, also released
267 | `Pyrolite `_ recently. That is a package
268 | which allows you to create Pyro proxies also from Java and .NET, in addition to Python.
269 | That way you can call remote methods from there too---the client doesn't have to be in Python.
270 |
271 | Concurrency
272 | -----------
273 |
274 | Ok, now it's getting interesting. Since we can access the service remotely, what
275 | happens if multiple clients create proxies to it at the same time? What if they
276 | want to modify the server index at the same time?
277 |
278 | Answer: the `SessionServer` object is thread-safe, so that when each client spawns a request
279 | thread via Pyro, they don't step on each other's toes.
280 |
281 | This means that:
282 |
283 | 1. There can be multiple simultaneous `service.find_similar` queries (or, in
284 | general, multiple simultaneus calls that are "read-only").
285 | 2. When two clients issue modification calls (`index`/`train`/`delete`/`drop_index`/...)
286 | at the same time, an internal lock serializes them -- the later call has to wait.
287 | 3. While one client is modifying the index, all other clients' queries still see
288 | the original index. Only once the modifications are committed do they become
289 | "visible".
290 |
291 | What do you mean, visible?
292 | --------------------------
293 |
294 | The service uses transactions internally. This means that each modification is
295 | done over a clone of the service. If the modification session fails for whatever
296 | reason (exception in code; power failure that turns off the server; client unhappy
297 | with how the session went), it can be rolled back. It also means other clients can
298 | continue querying the original index during index updates.
299 |
300 | The mechanism is hidden from users by default through auto-committing (it was already happening
301 | in the examples above too), but auto-committing can be turned off explicitly::
302 |
303 | >>> service.set_autosession(False)
304 | >>> service.train(corpus)
305 | RuntimeError: must open a session before modifying SessionServer
306 | >>> service.open_session()
307 | >>> service.train(corpus)
308 | >>> service.index(corpus)
309 | >>> service.delete(doc_ids)
310 | >>> ...
311 |
312 | None of these changes are visible to other clients, yet. Also, other clients'
313 | calls to index/train/etc will block until this session is committed/rolled back---there
314 | cannot be two open sessions at the same time.
315 |
316 | To end a session::
317 |
318 | >>> service.rollback() # discard all changes since open_session()
319 |
320 | or::
321 |
322 | >>> service.commit() # make changes public; now other clients can see changes/acquire the modification lock
323 |
324 |
325 | Other stuff
326 | ------------
327 |
328 | TODO Custom document parsing (in lieu of `utils.simple_preprocess`). Different models (not just `lsi`). Optimizing the index with `service.optimize()`.
329 | TODO add some hard numbers; example tutorial for some bigger collection, e.g. for `arxiv.org `_ or wikipedia.
330 |
331 |
--------------------------------------------------------------------------------
/ez_setup.py:
--------------------------------------------------------------------------------
1 | #!python
2 | """Bootstrap setuptools installation
3 |
4 | If you want to use setuptools in your package's setup.py, just include this
5 | file in the same directory with it, and add this to the top of your setup.py::
6 |
7 | from ez_setup import use_setuptools
8 | use_setuptools()
9 |
10 | If you want to require a specific version of setuptools, set a download
11 | mirror, or use an alternate download directory, you can do so by supplying
12 | the appropriate options to ``use_setuptools()``.
13 |
14 | This file can also be run as a script to install or upgrade setuptools.
15 | """
16 | import sys
17 | DEFAULT_VERSION = "0.6c11"
18 | DEFAULT_URL = "http://pypi.python.org/packages/%s/s/setuptools/" % sys.version[:3]
19 |
20 | md5_data = {
21 | 'setuptools-0.6b1-py2.3.egg': '8822caf901250d848b996b7f25c6e6ca',
22 | 'setuptools-0.6b1-py2.4.egg': 'b79a8a403e4502fbb85ee3f1941735cb',
23 | 'setuptools-0.6b2-py2.3.egg': '5657759d8a6d8fc44070a9d07272d99b',
24 | 'setuptools-0.6b2-py2.4.egg': '4996a8d169d2be661fa32a6e52e4f82a',
25 | 'setuptools-0.6b3-py2.3.egg': 'bb31c0fc7399a63579975cad9f5a0618',
26 | 'setuptools-0.6b3-py2.4.egg': '38a8c6b3d6ecd22247f179f7da669fac',
27 | 'setuptools-0.6b4-py2.3.egg': '62045a24ed4e1ebc77fe039aa4e6f7e5',
28 | 'setuptools-0.6b4-py2.4.egg': '4cb2a185d228dacffb2d17f103b3b1c4',
29 | 'setuptools-0.6c1-py2.3.egg': 'b3f2b5539d65cb7f74ad79127f1a908c',
30 | 'setuptools-0.6c1-py2.4.egg': 'b45adeda0667d2d2ffe14009364f2a4b',
31 | 'setuptools-0.6c10-py2.3.egg': 'ce1e2ab5d3a0256456d9fc13800a7090',
32 | 'setuptools-0.6c10-py2.4.egg': '57d6d9d6e9b80772c59a53a8433a5dd4',
33 | 'setuptools-0.6c10-py2.5.egg': 'de46ac8b1c97c895572e5e8596aeb8c7',
34 | 'setuptools-0.6c10-py2.6.egg': '58ea40aef06da02ce641495523a0b7f5',
35 | 'setuptools-0.6c11-py2.3.egg': '2baeac6e13d414a9d28e7ba5b5a596de',
36 | 'setuptools-0.6c11-py2.4.egg': 'bd639f9b0eac4c42497034dec2ec0c2b',
37 | 'setuptools-0.6c11-py2.5.egg': '64c94f3bf7a72a13ec83e0b24f2749b2',
38 | 'setuptools-0.6c11-py2.6.egg': 'bfa92100bd772d5a213eedd356d64086',
39 | 'setuptools-0.6c2-py2.3.egg': 'f0064bf6aa2b7d0f3ba0b43f20817c27',
40 | 'setuptools-0.6c2-py2.4.egg': '616192eec35f47e8ea16cd6a122b7277',
41 | 'setuptools-0.6c3-py2.3.egg': 'f181fa125dfe85a259c9cd6f1d7b78fa',
42 | 'setuptools-0.6c3-py2.4.egg': 'e0ed74682c998bfb73bf803a50e7b71e',
43 | 'setuptools-0.6c3-py2.5.egg': 'abef16fdd61955514841c7c6bd98965e',
44 | 'setuptools-0.6c4-py2.3.egg': 'b0b9131acab32022bfac7f44c5d7971f',
45 | 'setuptools-0.6c4-py2.4.egg': '2a1f9656d4fbf3c97bf946c0a124e6e2',
46 | 'setuptools-0.6c4-py2.5.egg': '8f5a052e32cdb9c72bcf4b5526f28afc',
47 | 'setuptools-0.6c5-py2.3.egg': 'ee9fd80965da04f2f3e6b3576e9d8167',
48 | 'setuptools-0.6c5-py2.4.egg': 'afe2adf1c01701ee841761f5bcd8aa64',
49 | 'setuptools-0.6c5-py2.5.egg': 'a8d3f61494ccaa8714dfed37bccd3d5d',
50 | 'setuptools-0.6c6-py2.3.egg': '35686b78116a668847237b69d549ec20',
51 | 'setuptools-0.6c6-py2.4.egg': '3c56af57be3225019260a644430065ab',
52 | 'setuptools-0.6c6-py2.5.egg': 'b2f8a7520709a5b34f80946de5f02f53',
53 | 'setuptools-0.6c7-py2.3.egg': '209fdf9adc3a615e5115b725658e13e2',
54 | 'setuptools-0.6c7-py2.4.egg': '5a8f954807d46a0fb67cf1f26c55a82e',
55 | 'setuptools-0.6c7-py2.5.egg': '45d2ad28f9750e7434111fde831e8372',
56 | 'setuptools-0.6c8-py2.3.egg': '50759d29b349db8cfd807ba8303f1902',
57 | 'setuptools-0.6c8-py2.4.egg': 'cba38d74f7d483c06e9daa6070cce6de',
58 | 'setuptools-0.6c8-py2.5.egg': '1721747ee329dc150590a58b3e1ac95b',
59 | 'setuptools-0.6c9-py2.3.egg': 'a83c4020414807b496e4cfbe08507c03',
60 | 'setuptools-0.6c9-py2.4.egg': '260a2be2e5388d66bdaee06abec6342a',
61 | 'setuptools-0.6c9-py2.5.egg': 'fe67c3e5a17b12c0e7c541b7ea43a8e6',
62 | 'setuptools-0.6c9-py2.6.egg': 'ca37b1ff16fa2ede6e19383e7b59245a',
63 | }
64 |
65 | import sys, os
66 | try: from hashlib import md5
67 | except ImportError: from md5 import md5
68 |
69 | def _validate_md5(egg_name, data):
70 | if egg_name in md5_data:
71 | digest = md5(data).hexdigest()
72 | if digest != md5_data[egg_name]:
73 | print >>sys.stderr, (
74 | "md5 validation of %s failed! (Possible download problem?)"
75 | % egg_name
76 | )
77 | sys.exit(2)
78 | return data
79 |
80 | def use_setuptools(
81 | version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
82 | download_delay=15
83 | ):
84 | """Automatically find/download setuptools and make it available on sys.path
85 |
86 | `version` should be a valid setuptools version number that is available
87 | as an egg for download under the `download_base` URL (which should end with
88 | a '/'). `to_dir` is the directory where setuptools will be downloaded, if
89 | it is not already available. If `download_delay` is specified, it should
90 | be the number of seconds that will be paused before initiating a download,
91 | should one be required. If an older version of setuptools is installed,
92 | this routine will print a message to ``sys.stderr`` and raise SystemExit in
93 | an attempt to abort the calling script.
94 | """
95 | was_imported = 'pkg_resources' in sys.modules or 'setuptools' in sys.modules
96 | def do_download():
97 | egg = download_setuptools(version, download_base, to_dir, download_delay)
98 | sys.path.insert(0, egg)
99 | import setuptools; setuptools.bootstrap_install_from = egg
100 | try:
101 | import pkg_resources
102 | except ImportError:
103 | return do_download()
104 | try:
105 | pkg_resources.require("setuptools>="+version); return
106 | except pkg_resources.VersionConflict, e:
107 | if was_imported:
108 | print >>sys.stderr, (
109 | "The required version of setuptools (>=%s) is not available, and\n"
110 | "can't be installed while this script is running. Please install\n"
111 | " a more recent version first, using 'easy_install -U setuptools'."
112 | "\n\n(Currently using %r)"
113 | ) % (version, e.args[0])
114 | sys.exit(2)
115 | else:
116 | del pkg_resources, sys.modules['pkg_resources'] # reload ok
117 | return do_download()
118 | except pkg_resources.DistributionNotFound:
119 | return do_download()
120 |
121 | def download_setuptools(
122 | version=DEFAULT_VERSION, download_base=DEFAULT_URL, to_dir=os.curdir,
123 | delay = 15
124 | ):
125 | """Download setuptools from a specified location and return its filename
126 |
127 | `version` should be a valid setuptools version number that is available
128 | as an egg for download under the `download_base` URL (which should end
129 | with a '/'). `to_dir` is the directory where the egg will be downloaded.
130 | `delay` is the number of seconds to pause before an actual download attempt.
131 | """
132 | import urllib2, shutil
133 | egg_name = "setuptools-%s-py%s.egg" % (version,sys.version[:3])
134 | url = download_base + egg_name
135 | saveto = os.path.join(to_dir, egg_name)
136 | src = dst = None
137 | if not os.path.exists(saveto): # Avoid repeated downloads
138 | try:
139 | from distutils import log
140 | if delay:
141 | log.warn("""
142 | ---------------------------------------------------------------------------
143 | This script requires setuptools version %s to run (even to display
144 | help). I will attempt to download it for you (from
145 | %s), but
146 | you may need to enable firewall access for this script first.
147 | I will start the download in %d seconds.
148 |
149 | (Note: if this machine does not have network access, please obtain the file
150 |
151 | %s
152 |
153 | and place it in this directory before rerunning this script.)
154 | ---------------------------------------------------------------------------""",
155 | version, download_base, delay, url
156 | ); from time import sleep; sleep(delay)
157 | log.warn("Downloading %s", url)
158 | src = urllib2.urlopen(url)
159 | # Read/write all in one block, so we don't create a corrupt file
160 | # if the download is interrupted.
161 | data = _validate_md5(egg_name, src.read())
162 | dst = open(saveto,"wb"); dst.write(data)
163 | finally:
164 | if src: src.close()
165 | if dst: dst.close()
166 | return os.path.realpath(saveto)
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 | def main(argv, version=DEFAULT_VERSION):
204 | """Install or upgrade setuptools and EasyInstall"""
205 | try:
206 | import setuptools
207 | except ImportError:
208 | egg = None
209 | try:
210 | egg = download_setuptools(version, delay=0)
211 | sys.path.insert(0,egg)
212 | from setuptools.command.easy_install import main
213 | return main(list(argv)+[egg]) # we're done here
214 | finally:
215 | if egg and os.path.exists(egg):
216 | os.unlink(egg)
217 | else:
218 | if setuptools.__version__ == '0.0.1':
219 | print >>sys.stderr, (
220 | "You have an obsolete version of setuptools installed. Please\n"
221 | "remove it from your system entirely before rerunning this script."
222 | )
223 | sys.exit(2)
224 |
225 | req = "setuptools>="+version
226 | import pkg_resources
227 | try:
228 | pkg_resources.require(req)
229 | except pkg_resources.VersionConflict:
230 | try:
231 | from setuptools.command.easy_install import main
232 | except ImportError:
233 | from easy_install import main
234 | main(list(argv)+[download_setuptools(delay=0)])
235 | sys.exit(0) # try to force an exit
236 | else:
237 | if argv:
238 | from setuptools.command.easy_install import main
239 | main(argv)
240 | else:
241 | print "Setuptools version",version,"or greater has been installed."
242 | print '(Run "ez_setup.py -U setuptools" to reinstall or upgrade.)'
243 |
244 | def update_md5(filenames):
245 | """Update our built-in md5 registry"""
246 |
247 | import re
248 |
249 | for name in filenames:
250 | base = os.path.basename(name)
251 | f = open(name,'rb')
252 | md5_data[base] = md5(f.read()).hexdigest()
253 | f.close()
254 |
255 | data = [" %r: %r,\n" % it for it in md5_data.items()]
256 | data.sort()
257 | repl = "".join(data)
258 |
259 | import inspect
260 | srcfile = inspect.getsourcefile(sys.modules[__name__])
261 | f = open(srcfile, 'rb'); src = f.read(); f.close()
262 |
263 | match = re.search("\nmd5_data = {\n([^}]+)}", src)
264 | if not match:
265 | print >>sys.stderr, "Internal error!"
266 | sys.exit(2)
267 |
268 | src = src[:match.start(1)] + repl + src[match.end(1):]
269 | f = open(srcfile,'w')
270 | f.write(src)
271 | f.close()
272 |
273 |
274 | if __name__=='__main__':
275 | if len(sys.argv)>2 and sys.argv[1]=='--md5update':
276 | update_md5(sys.argv[2:])
277 | else:
278 | main(sys.argv[1:])
279 |
280 |
281 |
282 |
283 |
284 |
285 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #
4 | # Copyright (C) 2012 Radim Rehurek
5 | # Licensed under the GNU AGPL v3.0 - http://www.gnu.org/licenses/agpl.html
6 |
7 | """
8 | Run with:
9 |
10 | sudo python ./setup.py install
11 | """
12 |
13 | import os
14 | import sys
15 |
16 | if sys.version_info[:2] < (2, 5):
17 | raise Exception('This version of simserver needs Python 2.5 or later. ')
18 |
19 | import ez_setup
20 | ez_setup.use_setuptools()
21 | from setuptools import setup, find_packages
22 |
23 |
24 | def read(fname):
25 | return open(os.path.join(os.path.dirname(__file__), fname)).read()
26 |
27 |
28 |
29 | setup(
30 | name = 'simserver',
31 | version = '0.1.4',
32 | description = 'Document similarity server',
33 | long_description = read('README.rst'),
34 |
35 | packages = find_packages(),
36 |
37 | # there is a bug in python2.5, preventing distutils from using any non-ascii characters :( http://bugs.python.org/issue2562
38 | author = 'Radim Rehurek', # u'Radim Řehůřek', # <- should really be this...
39 | author_email = 'radimrehurek@seznam.cz',
40 |
41 | url = 'https://github.com/piskvorky/gensim-simserver',
42 | download_url = 'http://pypi.python.org/pypi/simserver',
43 |
44 | keywords = 'Similarity server, document database, Latent Semantic Indexing, LSA, '
45 | 'LSI, LDA, Latent Dirichlet Allocation, TF-IDF, gensim',
46 |
47 | license = 'AGPL v3',
48 | platforms = 'any',
49 |
50 | zip_safe = False,
51 |
52 | classifiers = [ # from http://pypi.python.org/pypi?%3Aaction=list_classifiers
53 | 'Development Status :: 4 - Beta',
54 | 'Environment :: Console',
55 | 'Intended Audience :: Science/Research',
56 | 'License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)',
57 | 'Operating System :: OS Independent',
58 | 'Programming Language :: Python :: 2.5',
59 | 'Topic :: Scientific/Engineering :: Artificial Intelligence',
60 | 'Topic :: Scientific/Engineering :: Information Analysis',
61 | 'Topic :: Text Processing :: Linguistic',
62 | ],
63 |
64 | test_suite = "simserver.test",
65 |
66 | install_requires = [
67 | 'gensim >= 0.8.5',
68 | 'Pyro4 >= 4.8',
69 | 'sqlitedict >= 1.0.8',
70 | ],
71 |
72 | include_package_data = True,
73 |
74 | )
75 |
--------------------------------------------------------------------------------
/simserver/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Package containing a document similarity server, an extension of gensim.
3 | """
4 |
5 | # for IPython tab-completion
6 | from simserver import SessionServer, SimServer
7 |
8 |
9 | try:
10 | __version__ = __import__('pkg_resources').get_distribution('simserver').version
11 | except:
12 | __version__ = '?'
13 |
14 |
15 | import logging
16 |
17 | class NullHandler(logging.Handler):
18 | """For python versions <= 2.6; same as `logging.NullHandler` in 2.7."""
19 | def emit(self, record):
20 | pass
21 |
22 | logger = logging.getLogger('simserver')
23 | if len(logger.handlers) == 0: # To ensure reload() doesn't add another one
24 | logger.addHandler(NullHandler())
25 |
--------------------------------------------------------------------------------
/simserver/run_simserver.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #
4 | # Copyright (C) 2011 Radim Rehurek
5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html
6 |
7 | """
8 | USAGE: %(program)s DATA_DIRECTORY
9 |
10 | Start a sample similarity server, register it with Pyro and leave it running \
11 | as a daemon.
12 |
13 | Example:
14 | python -m simserver.run_simserver /tmp/server
15 | """
16 |
17 | from __future__ import with_statement
18 |
19 | import logging
20 | import os
21 | import sys
22 |
23 | import gensim
24 | import simserver
25 |
26 | if __name__ == '__main__':
27 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(module)s:%(lineno)d : %(funcName)s(%(threadName)s) : %(message)s')
28 | logging.root.setLevel(level=logging.INFO)
29 | logging.info("running %s" % ' '.join(sys.argv))
30 |
31 | program = os.path.basename(sys.argv[0])
32 |
33 | # check and process input arguments
34 | if len(sys.argv) < 2:
35 | print globals()['__doc__'] % locals()
36 | sys.exit(1)
37 |
38 | basename = sys.argv[1]
39 | server = simserver.SessionServer(basename)
40 | gensim.utils.pyro_daemon('gensim.testserver', server)
41 |
42 | logging.info("finished running %s" % program)
43 |
--------------------------------------------------------------------------------
/simserver/simserver.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #
4 | # Copyright (C) 2012 Radim Rehurek
5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html
6 |
7 |
8 | """
9 | "Find similar" service, using gensim (=vector spaces) for backend.
10 |
11 | The server performs 3 main functions:
12 |
13 | 1. converts documents to semantic representation (TF-IDF, LSA, LDA...)
14 | 2. indexes documents in the vector representation, for faster retrieval
15 | 3. for a given query document, return ids of the most similar documents from the index
16 |
17 | SessionServer objects are transactional, so that you can rollback/commit an entire
18 | set of changes.
19 |
20 | The server is ready for concurrent requests (thread-safe). Indexing is incremental
21 | and you can query the SessionServer even while it's being updated, so that there
22 | is virtually no down-time.
23 |
24 | """
25 |
26 | from __future__ import with_statement
27 |
28 | import os
29 | import logging
30 | import threading
31 | import shutil
32 |
33 | import numpy
34 |
35 | import gensim
36 | from sqlitedict import SqliteDict # needs sqlitedict: run "sudo easy_install sqlitedict"
37 |
38 |
39 | logger = logging.getLogger('gensim.similarities.simserver')
40 |
41 |
42 | TOP_SIMS = 100 # when precomputing similarities, only consider this many "most similar" documents
43 | SHARD_SIZE = 65536 # spill index shards to disk in SHARD_SIZE-ed chunks of documents
44 | DEFAULT_NUM_TOPICS = 400 # use this many topics for topic models unless user specified a value
45 | JOURNAL_MODE = 'OFF' # don't keep journals in sqlite dbs
46 |
47 |
48 |
49 | def merge_sims(oldsims, newsims, clip=None):
50 | """Merge two precomputed similarity lists, truncating the result to `clip` most similar items."""
51 | if oldsims is None:
52 | result = newsims or []
53 | elif newsims is None:
54 | result = oldsims
55 | else:
56 | result = sorted(oldsims + newsims, key=lambda item: -item[1])
57 | if clip is not None:
58 | result = result[:clip]
59 | return result
60 |
61 |
62 |
63 | class SimIndex(gensim.utils.SaveLoad):
64 | """
65 | An index of documents. Used internally by SimServer.
66 |
67 | It uses the Similarity class to persist all document vectors to disk (via mmap).
68 | """
69 | def __init__(self, fname, num_features, shardsize=SHARD_SIZE, topsims=TOP_SIMS):
70 | """
71 | Spill index shards to disk after every `shardsize` documents.
72 | In similarity queries, return only the `topsims` most similar documents.
73 | """
74 | self.fname = fname
75 | self.shardsize = int(shardsize)
76 | self.topsims = int(topsims)
77 | self.id2pos = {} # map document id (string) to index position (integer)
78 | self.pos2id = {} # reverse mapping for id2pos; redundant, for performance
79 | self.id2sims = SqliteDict(self.fname + '.id2sims', journal_mode=JOURNAL_MODE) # precomputed top similar: document id -> [(doc_id, similarity)]
80 | self.qindex = gensim.similarities.Similarity(self.fname + '.idx', corpus=None,
81 | num_best=None, num_features=num_features, shardsize=shardsize)
82 | self.length = 0
83 |
84 | def save(self, fname):
85 | tmp, self.id2sims = self.id2sims, None
86 | super(SimIndex, self).save(fname)
87 | self.id2sims = tmp
88 |
89 |
90 | @staticmethod
91 | def load(fname):
92 | result = gensim.utils.SaveLoad.load(fname)
93 | result.fname = fname
94 | result.check_moved()
95 | result.id2sims = SqliteDict(fname + '.id2sims', journal_mode=JOURNAL_MODE)
96 | return result
97 |
98 |
99 | def check_moved(self):
100 | output_prefix = self.fname + '.idx'
101 | if self.qindex.output_prefix != output_prefix:
102 | logger.info("index seems to have moved from %s to %s; updating locations" %
103 | (self.qindex.output_prefix, output_prefix))
104 | self.qindex.output_prefix = output_prefix
105 | self.qindex.check_moved()
106 |
107 |
108 | def close(self):
109 | "Explicitly release important resources (file handles, db, ...)"
110 | try:
111 | self.id2sims.close()
112 | except:
113 | pass
114 | try:
115 | del self.qindex
116 | except:
117 | pass
118 |
119 |
120 | def terminate(self):
121 | """Delete all files created by this index, invalidating `self`. Use with care."""
122 | try:
123 | self.id2sims.terminate()
124 | except:
125 | pass
126 | import glob
127 | for fname in glob.glob(self.fname + '*'):
128 | try:
129 | os.remove(fname)
130 | logger.info("deleted %s" % fname)
131 | except Exception, e:
132 | logger.warning("failed to delete %s: %s" % (fname, e))
133 | for val in self.__dict__.keys():
134 | try:
135 | delattr(self, val)
136 | except:
137 | pass
138 |
139 |
140 | def index_documents(self, fresh_docs, model):
141 | """
142 | Update fresh index with new documents (potentially replacing old ones with
143 | the same id). `fresh_docs` is a dictionary-like object (=dict, sqlitedict, shelve etc)
144 | that maps document_id->document.
145 | """
146 | docids = fresh_docs.keys()
147 | vectors = (model.docs2vecs(fresh_docs[docid] for docid in docids))
148 | logger.info("adding %i documents to %s" % (len(docids), self))
149 | self.qindex.add_documents(vectors)
150 | self.qindex.save()
151 | self.update_ids(docids)
152 |
153 |
154 | def update_ids(self, docids):
155 | """Update id->pos mapping with new document ids."""
156 | logger.info("updating %i id mappings" % len(docids))
157 | for docid in docids:
158 | if docid is not None:
159 | pos = self.id2pos.get(docid, None)
160 | if pos is not None:
161 | logger.info("replacing existing document %r in %s" % (docid, self))
162 | del self.pos2id[pos]
163 | self.id2pos[docid] = self.length
164 | try:
165 | del self.id2sims[docid]
166 | except:
167 | pass
168 | self.length += 1
169 | self.id2sims.sync()
170 | self.update_mappings()
171 |
172 |
173 | def update_mappings(self):
174 | """Synchronize id<->position mappings."""
175 | self.pos2id = dict((v, k) for k, v in self.id2pos.iteritems())
176 | assert len(self.pos2id) == len(self.id2pos), "duplicate ids or positions detected"
177 |
178 |
179 | def delete(self, docids):
180 | """Delete documents (specified by their ids) from the index."""
181 | logger.debug("deleting %i documents from %s" % (len(docids), self))
182 | deleted = 0
183 | for docid in docids:
184 | try:
185 | del self.id2pos[docid]
186 | deleted += 1
187 | del self.id2sims[docid]
188 | except:
189 | pass
190 | self.id2sims.sync()
191 | if deleted:
192 | logger.info("deleted %i documents from %s" % (deleted, self))
193 | self.update_mappings()
194 |
195 |
196 | def sims2scores(self, sims, eps=1e-7):
197 | """Convert raw similarity vector to a list of (docid, similarity) results."""
198 | result = []
199 | if isinstance(sims, numpy.ndarray):
200 | sims = abs(sims) # TODO or maybe clip? are opposite vectors "similar" or "dissimilar"?!
201 | for pos in numpy.argsort(sims)[::-1]:
202 | if pos in self.pos2id and sims[pos] > eps: # ignore deleted/rewritten documents
203 | # convert positions of resulting docs back to ids
204 | result.append((self.pos2id[pos], sims[pos]))
205 | if len(result) == self.topsims:
206 | break
207 | else:
208 | for pos, score in sims:
209 | if pos in self.pos2id and abs(score) > eps: # ignore deleted/rewritten documents
210 | # convert positions of resulting docs back to ids
211 | result.append((self.pos2id[pos], abs(score)))
212 | if len(result) == self.topsims:
213 | break
214 | return result
215 |
216 |
217 | def vec_by_id(self, docid):
218 | """Return indexed vector corresponding to document `docid`."""
219 | pos = self.id2pos[docid]
220 | return self.qindex.vector_by_id(pos)
221 |
222 |
223 | def sims_by_id(self, docid):
224 | """Find the most similar documents to the (already indexed) document with `docid`."""
225 | result = self.id2sims.get(docid, None)
226 | if result is None:
227 | self.qindex.num_best = self.topsims
228 | sims = self.qindex.similarity_by_id(self.id2pos[docid])
229 | result = self.sims2scores(sims)
230 | return result
231 |
232 |
233 | def sims_by_vec(self, vec, normalize=None):
234 | """
235 | Find the most similar documents to a given vector (=already processed document).
236 | """
237 | if normalize is None:
238 | normalize = self.qindex.normalize
239 | norm, self.qindex.normalize = self.qindex.normalize, normalize # store old value
240 | self.qindex.num_best = self.topsims
241 | sims = self.qindex[vec]
242 | self.qindex.normalize = norm # restore old value of qindex.normalize
243 | return self.sims2scores(sims)
244 |
245 |
246 | def merge(self, other):
247 | """Merge documents from the other index. Update precomputed similarities
248 | in the process."""
249 | other.qindex.normalize, other.qindex.num_best = False, self.topsims
250 | # update precomputed "most similar" for old documents (in case some of
251 | # the new docs make it to the top-N for some of the old documents)
252 | logger.info("updating old precomputed values")
253 | pos, lenself = 0, len(self.qindex)
254 | for chunk in self.qindex.iter_chunks():
255 | for sims in other.qindex[chunk]:
256 | if pos in self.pos2id:
257 | # ignore masked entries (deleted, overwritten documents)
258 | docid = self.pos2id[pos]
259 | sims = self.sims2scores(sims)
260 | self.id2sims[docid] = merge_sims(self.id2sims[docid], sims, self.topsims)
261 | pos += 1
262 | if pos % 10000 == 0:
263 | logger.info("PROGRESS: updated doc #%i/%i" % (pos, lenself))
264 | self.id2sims.sync()
265 |
266 | logger.info("merging fresh index into optimized one")
267 | pos, docids = 0, []
268 | for chunk in other.qindex.iter_chunks():
269 | for vec in chunk:
270 | if pos in other.pos2id: # don't copy deleted documents
271 | self.qindex.add_documents([vec])
272 | docids.append(other.pos2id[pos])
273 | pos += 1
274 | self.qindex.save()
275 | self.update_ids(docids)
276 |
277 | logger.info("precomputing most similar for the fresh index")
278 | pos, lenother = 0, len(other.qindex)
279 | norm, self.qindex.normalize = self.qindex.normalize, False
280 | topsims, self.qindex.num_best = self.qindex.num_best, self.topsims
281 | for chunk in other.qindex.iter_chunks():
282 | for sims in self.qindex[chunk]:
283 | if pos in other.pos2id:
284 | # ignore masked entries (deleted, overwritten documents)
285 | docid = other.pos2id[pos]
286 | self.id2sims[docid] = self.sims2scores(sims)
287 | pos += 1
288 | if pos % 10000 == 0:
289 | logger.info("PROGRESS: precomputed doc #%i/%i" % (pos, lenother))
290 | self.qindex.normalize, self.qindex.num_best = norm, topsims
291 | self.id2sims.sync()
292 |
293 |
294 | def __len__(self):
295 | return len(self.id2pos)
296 |
297 | def __contains__(self, docid):
298 | return docid in self.id2pos
299 |
300 | def keys(self):
301 | return self.id2pos.keys()
302 |
303 | def __str__(self):
304 | return "SimIndex(%i docs, %i real size)" % (len(self), self.length)
305 | #endclass SimIndex
306 |
307 |
308 |
309 | class SimModel(gensim.utils.SaveLoad):
310 | """
311 | A semantic model responsible for translating between plain text and (semantic)
312 | vectors.
313 |
314 | These vectors can then be indexed/queried for similarity, see the `SimIndex`
315 | class. Used internally by `SimServer`.
316 | """
317 | def __init__(self, fresh_docs, dictionary=None, method=None, params=None):
318 | """
319 | Train a model, using `fresh_docs` as training corpus.
320 |
321 | If `dictionary` is not specified, it is computed from the documents.
322 |
323 | `method` is currently one of "tfidf"/"lsi"/"lda".
324 | """
325 | # FIXME TODO: use subclassing/injection for different methods, instead of param..
326 | self.method = method
327 | if params is None:
328 | params = {}
329 | self.params = params
330 | logger.info("collecting %i document ids" % len(fresh_docs))
331 | docids = fresh_docs.keys()
332 | logger.info("creating model from %s documents" % len(docids))
333 | preprocessed = lambda: (fresh_docs[docid]['tokens'] for docid in docids)
334 |
335 | # create id->word (integer->string) mapping
336 | logger.info("creating dictionary from %s documents" % len(docids))
337 | if dictionary is None:
338 | dictionary = gensim.corpora.Dictionary(preprocessed())
339 | if len(docids) >= 1000:
340 | dictionary.filter_extremes(no_below=5, no_above=0.2, keep_n=50000)
341 | else:
342 | logger.warning("training model on only %i documents; is this intentional?" % len(docids))
343 | dictionary.filter_extremes(no_below=0, no_above=1.0, keep_n=50000)
344 | self.dictionary = dictionary
345 |
346 | class IterableCorpus(object):
347 | def __iter__(self):
348 | for tokens in preprocessed():
349 | yield dictionary.doc2bow(tokens)
350 |
351 | def __len__(self):
352 | return len(docids)
353 | corpus = IterableCorpus()
354 |
355 | if method == 'lsi':
356 | logger.info("training TF-IDF model")
357 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary)
358 | logger.info("training LSI model")
359 | tfidf_corpus = self.tfidf[corpus]
360 | self.lsi = gensim.models.LsiModel(tfidf_corpus, id2word=self.dictionary, **params)
361 | self.lsi.projection.u = self.lsi.projection.u.astype(numpy.float32) # use single precision to save mem
362 | self.num_features = len(self.lsi.projection.s)
363 | elif method == 'lda_tfidf':
364 | logger.info("training TF-IDF model")
365 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary)
366 | logger.info("training LDA model")
367 | self.lda = gensim.models.LdaModel(self.tfidf[corpus], id2word=self.dictionary, **params)
368 | self.num_features = self.lda.num_topics
369 | elif method == 'lda':
370 | logger.info("training TF-IDF model")
371 | self.tfidf = gensim.models.TfidfModel(corpus, id2word=self.dictionary)
372 | logger.info("training LDA model")
373 | self.lda = gensim.models.LdaModel(corpus, id2word=self.dictionary, **params)
374 | self.num_features = self.lda.num_topics
375 | elif method == 'logentropy':
376 | logger.info("training a log-entropy model")
377 | self.logent = gensim.models.LogEntropyModel(corpus, id2word=self.dictionary, **params)
378 | self.num_features = len(self.dictionary)
379 | else:
380 | msg = "unknown semantic method %s" % method
381 | logger.error(msg)
382 | raise NotImplementedError(msg)
383 |
384 |
385 | def doc2vec(self, doc):
386 | """Convert a single SimilarityDocument to vector."""
387 | bow = self.dictionary.doc2bow(doc['tokens'])
388 | if self.method == 'lsi':
389 | return self.lsi[self.tfidf[bow]]
390 | elif self.method == 'lda':
391 | return self.lda[bow]
392 | elif self.method == 'lda_tfidf':
393 | return self.lda[self.tfidf[bow]]
394 | elif self.method == 'logentropy':
395 | return self.logent[bow]
396 |
397 |
398 | def docs2vecs(self, docs):
399 | """Convert multiple SimilarityDocuments to vectors (batch version of doc2vec)."""
400 | bows = (self.dictionary.doc2bow(doc['tokens']) for doc in docs)
401 | if self.method == 'lsi':
402 | return self.lsi[self.tfidf[bows]]
403 | elif self.method == 'lda':
404 | return self.lda[bows]
405 | elif self.method == 'lda_tfidf':
406 | return self.lda[self.tfidf[bows]]
407 | elif self.method == 'logentropy':
408 | return self.logent[bows]
409 |
410 |
411 | def get_tfidf(self, doc):
412 | bow = self.dictionary.doc2bow(doc['tokens'])
413 | if hasattr(self, 'tfidf'):
414 | return self.tfidf[bow]
415 | if hasattr(self, 'logent'):
416 | return self.logent[bow]
417 | raise ValueError("model must contain either TF-IDF or LogEntropy transformation")
418 |
419 |
420 | def close(self):
421 | """Release important resources manually."""
422 | pass
423 |
424 | def __str__(self):
425 | return "SimModel(method=%s, dict=%s)" % (self.method, self.dictionary)
426 | #endclass SimModel
427 |
428 |
429 |
430 | class SimServer(object):
431 | """
432 | Top-level functionality for similarity services. A similarity server takes
433 | care of::
434 |
435 | 1. creating semantic models
436 | 2. indexing documents using these models
437 | 3. finding the most similar documents in an index.
438 |
439 | An object of this class can be shared across network via Pyro, to answer remote
440 | client requests. It is thread safe. Using a server concurrently from multiple
441 | processes is safe for reading = answering similarity queries. Modifying
442 | (training/indexing) is realized via locking = serialized internally.
443 | """
444 | def __init__(self, basename, use_locks=False):
445 | """
446 | All data will be stored under directory `basename`. If there is a server
447 | there already, it will be loaded (resumed).
448 |
449 | The server object is stateless in RAM -- its state is defined entirely by its location.
450 | There is therefore no need to store the server object.
451 | """
452 | if not os.path.isdir(basename):
453 | raise ValueError("%r must be a writable directory" % basename)
454 | self.basename = basename
455 | self.use_locks = use_locks
456 | self.lock_update = threading.RLock() if use_locks else gensim.utils.nocm
457 | try:
458 | self.fresh_index = SimIndex.load(self.location('index_fresh'))
459 | except:
460 | logger.debug("starting a new fresh index")
461 | self.fresh_index = None
462 | try:
463 | self.opt_index = SimIndex.load(self.location('index_opt'))
464 | except:
465 | logger.debug("starting a new optimized index")
466 | self.opt_index = None
467 | try:
468 | self.model = SimModel.load(self.location('model'))
469 | except:
470 | self.model = None
471 | self.payload = SqliteDict(self.location('payload'), autocommit=True, journal_mode=JOURNAL_MODE)
472 | self.flush(save_index=False, save_model=False, clear_buffer=True)
473 | logger.info("loaded %s" % self)
474 |
475 |
476 | def location(self, name):
477 | return os.path.join(self.basename, name)
478 |
479 |
480 | @gensim.utils.synchronous('lock_update')
481 | def flush(self, save_index=False, save_model=False, clear_buffer=False):
482 | """Commit all changes, clear all caches."""
483 | if save_index:
484 | if self.fresh_index is not None:
485 | self.fresh_index.save(self.location('index_fresh'))
486 | if self.opt_index is not None:
487 | self.opt_index.save(self.location('index_opt'))
488 | if save_model:
489 | if self.model is not None:
490 | self.model.save(self.location('model'))
491 | self.payload.commit()
492 | if clear_buffer:
493 | if hasattr(self, 'fresh_docs'):
494 | try:
495 | self.fresh_docs.terminate() # erase all buffered documents + file on disk
496 | except:
497 | pass
498 | self.fresh_docs = SqliteDict(journal_mode=JOURNAL_MODE) # buffer defaults to a random location in temp
499 | self.fresh_docs.sync()
500 |
501 |
502 | def close(self):
503 | """Explicitly close open file handles, databases etc."""
504 | try:
505 | self.payload.close()
506 | except:
507 | pass
508 | try:
509 | self.model.close()
510 | except:
511 | pass
512 | try:
513 | self.fresh_index.close()
514 | except:
515 | pass
516 | try:
517 | self.opt_index.close()
518 | except:
519 | pass
520 | try:
521 | self.fresh_docs.terminate()
522 | except:
523 | pass
524 |
525 | def __del__(self):
526 | """When the server went out of scope, make an effort to close its DBs."""
527 | self.close()
528 |
529 | @gensim.utils.synchronous('lock_update')
530 | def buffer(self, documents):
531 | """
532 | Add a sequence of documents to be processed (indexed or trained on).
533 |
534 | Here, the documents are simply collected; real processing is done later,
535 | during the `self.index` or `self.train` calls.
536 |
537 | `buffer` can be called repeatedly; the result is the same as if it was
538 | called once, with a concatenation of all the partial document batches.
539 | The point is to save memory when sending large corpora over network: the
540 | entire `documents` must be serialized into RAM. See `utils.upload_chunked()`.
541 |
542 | A call to `flush()` clears this documents-to-be-processed buffer (`flush`
543 | is also implicitly called when you call `index()` and `train()`).
544 | """
545 | logger.info("adding documents to temporary buffer of %s" % (self))
546 | for doc in documents:
547 | docid = doc['id']
548 | # logger.debug("buffering document %r" % docid)
549 | if docid in self.fresh_docs:
550 | logger.warning("asked to re-add id %r; rewriting old value" % docid)
551 | self.fresh_docs[docid] = doc
552 | self.fresh_docs.sync()
553 |
554 |
555 | @gensim.utils.synchronous('lock_update')
556 | def train(self, corpus=None, method='auto', clear_buffer=True, params=None):
557 | """
558 | Create an indexing model. Will overwrite the model if it already exists.
559 | All indexes become invalid, because documents in them use a now-obsolete
560 | representation.
561 |
562 | The model is trained on documents previously entered via `buffer`,
563 | or directly on `corpus`, if specified.
564 | """
565 | if corpus is not None:
566 | # use the supplied corpus only (erase existing buffer, if any)
567 | self.flush(clear_buffer=True)
568 | self.buffer(corpus)
569 | if not self.fresh_docs:
570 | msg = "train called but no training corpus specified for %s" % self
571 | logger.error(msg)
572 | raise ValueError(msg)
573 | if method == 'auto':
574 | numdocs = len(self.fresh_docs)
575 | if numdocs < 1000:
576 | logging.warning("too few training documents; using simple log-entropy model instead of latent semantic indexing")
577 | method = 'logentropy'
578 | else:
579 | method = 'lsi'
580 | if params is None:
581 | params = {}
582 | self.model = SimModel(self.fresh_docs, method=method, params=params)
583 | self.flush(save_model=True, clear_buffer=clear_buffer)
584 |
585 |
586 | @gensim.utils.synchronous('lock_update')
587 | def index(self, corpus=None, clear_buffer=True):
588 | """
589 | Permanently index all documents previously added via `buffer`, or
590 | directly index documents from `corpus`, if specified.
591 |
592 | The indexing model must already exist (see `train`) before this function
593 | is called.
594 | """
595 | if not self.model:
596 | msg = 'must initialize model for %s before indexing documents' % self.basename
597 | logger.error(msg)
598 | raise AttributeError(msg)
599 |
600 | if corpus is not None:
601 | # use the supplied corpus only (erase existing buffer, if any)
602 | self.flush(clear_buffer=True)
603 | self.buffer(corpus)
604 |
605 | if not self.fresh_docs:
606 | msg = "index called but no indexing corpus specified for %s" % self
607 | logger.error(msg)
608 | raise ValueError(msg)
609 |
610 | if not self.fresh_index:
611 | logger.info("starting a new fresh index for %s" % self)
612 | self.fresh_index = SimIndex(self.location('index_fresh'), self.model.num_features)
613 | self.fresh_index.index_documents(self.fresh_docs, self.model)
614 | if self.opt_index is not None:
615 | self.opt_index.delete(self.fresh_docs.keys())
616 | logger.info("storing document payloads")
617 | for docid in self.fresh_docs:
618 | payload = self.fresh_docs[docid].get('payload', None)
619 | if payload is None:
620 | # HACK: exit on first doc without a payload (=assume all docs have payload, or none does)
621 | break
622 | self.payload[docid] = payload
623 | self.flush(save_index=True, clear_buffer=clear_buffer)
624 |
625 |
626 | @gensim.utils.synchronous('lock_update')
627 | def optimize(self):
628 | """
629 | Precompute top similarities for all indexed documents. This speeds up
630 | `find_similar` queries by id (but not queries by fulltext).
631 |
632 | Internally, documents are moved from a fresh index (=no precomputed similarities)
633 | to an optimized index (precomputed similarities). Similarity queries always
634 | query both indexes, so this split is transparent to clients.
635 |
636 | If you add documents later via `index`, they go to the fresh index again.
637 | To precompute top similarities for these new documents too, simply call
638 | `optimize` again.
639 |
640 | """
641 | if self.fresh_index is None:
642 | logger.warning("optimize called but there are no new documents")
643 | return # nothing to do!
644 |
645 | if self.opt_index is None:
646 | logger.info("starting a new optimized index for %s" % self)
647 | self.opt_index = SimIndex(self.location('index_opt'), self.model.num_features)
648 |
649 | self.opt_index.merge(self.fresh_index)
650 | self.fresh_index.terminate() # delete old files
651 | self.fresh_index = None
652 | self.flush(save_index=True)
653 |
654 |
655 | @gensim.utils.synchronous('lock_update')
656 | def drop_index(self, keep_model=True):
657 | """Drop all indexed documents. If `keep_model` is False, also dropped the model."""
658 | modelstr = "" if keep_model else "and model "
659 | logger.info("deleting similarity index " + modelstr + "from %s" % self.basename)
660 |
661 | # delete indexes
662 | for index in [self.fresh_index, self.opt_index]:
663 | if index is not None:
664 | index.terminate()
665 | self.fresh_index, self.opt_index = None, None
666 |
667 | # delete payload
668 | if self.payload is not None:
669 | self.payload.close()
670 |
671 | fname = self.location('payload')
672 | try:
673 | if os.path.exists(fname):
674 | os.remove(fname)
675 | logger.info("deleted %s" % fname)
676 | except Exception, e:
677 | logger.warning("failed to delete %s" % fname)
678 | self.payload = SqliteDict(self.location('payload'), autocommit=True, journal_mode=JOURNAL_MODE)
679 |
680 | # optionally, delete the model as well
681 | if not keep_model and self.model is not None:
682 | self.model.close()
683 | fname = self.location('model')
684 | try:
685 | if os.path.exists(fname):
686 | os.remove(fname)
687 | logger.info("deleted %s" % fname)
688 | except Exception, e:
689 | logger.warning("failed to delete %s" % fname)
690 | self.model = None
691 | self.flush(save_index=True, save_model=True, clear_buffer=True)
692 |
693 |
694 | @gensim.utils.synchronous('lock_update')
695 | def delete(self, docids):
696 | """Delete specified documents from the index."""
697 | logger.info("asked to drop %i documents" % len(docids))
698 | for index in [self.opt_index, self.fresh_index]:
699 | if index is not None:
700 | index.delete(docids)
701 | self.flush(save_index=True)
702 |
703 |
704 | def is_locked(self):
705 | return self.use_locks and self.lock_update._RLock__count > 0
706 |
707 |
708 | def vec_by_id(self, docid):
709 | for index in [self.opt_index, self.fresh_index]:
710 | if index is not None and docid in index:
711 | return index.vec_by_id(docid)
712 |
713 |
714 | def find_similar(self, doc, min_score=0.0, max_results=100):
715 | """
716 | Find `max_results` most similar articles in the index, each having similarity
717 | score of at least `min_score`. The resulting list may be shorter than `max_results`,
718 | in case there are not enough matching documents.
719 |
720 | `doc` is either a string (=document id, previously indexed) or a
721 | dict containing a 'tokens' key. These tokens are processed to produce a
722 | vector, which is then used as a query against the index.
723 |
724 | The similar documents are returned in decreasing similarity order, as
725 | `(doc_id, similarity_score, doc_payload)` 3-tuples. The payload returned
726 | is identical to what was supplied for this document during indexing.
727 |
728 | """
729 | logger.debug("received query call with %r" % doc)
730 | if self.is_locked():
731 | msg = "cannot query while the server is being updated"
732 | logger.error(msg)
733 | raise RuntimeError(msg)
734 | sims_opt, sims_fresh = None, None
735 | for index in [self.fresh_index, self.opt_index]:
736 | if index is not None:
737 | index.topsims = max_results
738 | if isinstance(doc, basestring):
739 | # query by direct document id
740 | docid = doc
741 | if self.opt_index is not None and docid in self.opt_index:
742 | sims_opt = self.opt_index.sims_by_id(docid)
743 | if self.fresh_index is not None:
744 | vec = self.opt_index.vec_by_id(docid)
745 | sims_fresh = self.fresh_index.sims_by_vec(vec, normalize=False)
746 | elif self.fresh_index is not None and docid in self.fresh_index:
747 | sims_fresh = self.fresh_index.sims_by_id(docid)
748 | if self.opt_index is not None:
749 | vec = self.fresh_index.vec_by_id(docid)
750 | sims_opt = self.opt_index.sims_by_vec(vec, normalize=False)
751 | else:
752 | raise ValueError("document %r not in index" % docid)
753 | else:
754 | if 'topics' in doc:
755 | # user supplied vector directly => use that
756 | vec = gensim.matutils.any2sparse(doc['topics'])
757 | else:
758 | # query by an arbitrary text (=tokens) inside doc['tokens']
759 | vec = self.model.doc2vec(doc) # convert document (text) to vector
760 | if self.opt_index is not None:
761 | sims_opt = self.opt_index.sims_by_vec(vec)
762 | if self.fresh_index is not None:
763 | sims_fresh = self.fresh_index.sims_by_vec(vec)
764 |
765 | merged = merge_sims(sims_opt, sims_fresh)
766 | logger.debug("got %s raw similars, pruning with max_results=%s, min_score=%s" %
767 | (len(merged), max_results, min_score))
768 | result = []
769 | for docid, score in merged:
770 | if score < min_score or 0 < max_results <= len(result):
771 | break
772 | result.append((docid, float(score), self.payload.get(docid, None)))
773 | return result
774 |
775 |
776 | def __str__(self):
777 | return ("SimServer(loc=%r, fresh=%s, opt=%s, model=%s, buffer=%s)" %
778 | (self.basename, self.fresh_index, self.opt_index, self.model, self.fresh_docs))
779 |
780 |
781 | def __len__(self):
782 | return sum(len(index) for index in [self.opt_index, self.fresh_index]
783 | if index is not None)
784 |
785 |
786 | def __contains__(self, docid):
787 | """Is document with `docid` in the index?"""
788 | return any(index is not None and docid in index
789 | for index in [self.opt_index, self.fresh_index])
790 |
791 |
792 | def get_tfidf(self, *args, **kwargs):
793 | return self.model.get_tfidf(*args, **kwargs)
794 |
795 |
796 | def status(self):
797 | return str(self)
798 |
799 | def keys(self):
800 | """Return ids of all indexed documents."""
801 | result = []
802 | if self.fresh_index is not None:
803 | result += self.fresh_index.keys()
804 | if self.opt_index is not None:
805 | result += self.opt_index.keys()
806 | return result
807 |
808 | def memdebug(self):
809 | from guppy import hpy
810 | return str(hpy().heap())
811 | #endclass SimServer
812 |
813 |
814 |
815 | class SessionServer(gensim.utils.SaveLoad):
816 | """
817 | Similarity server on top of :class:`SimServer` that implements sessions = transactions.
818 |
819 | A transaction is a set of server modifications (index/delete/train calls) that
820 | may be either committed or rolled back entirely.
821 |
822 | Sessions are realized by:
823 |
824 | 1. cloning (=copying) a SimServer at the beginning of a session
825 | 2. serving read-only queries from the original server (the clone may be modified during queries)
826 | 3. modifications affect only the clone
827 | 4. at commit, the clone becomes the original
828 | 5. at rollback, do nothing (clone is discarded, next transaction starts from the original again)
829 | """
830 | def __init__(self, basedir, autosession=True, use_locks=True):
831 | self.basedir = basedir
832 | self.autosession = autosession
833 | self.use_locks = use_locks
834 | self.lock_update = threading.RLock() if use_locks else gensim.utils.nocm
835 | self.locs = ['a', 'b'] # directories under which to store stable.session data
836 | try:
837 | stable = open(self.location('stable')).read().strip()
838 | self.istable = self.locs.index(stable)
839 | except:
840 | self.istable = 0
841 | logger.info("stable index pointer not found or invalid; starting from %s" %
842 | self.loc_stable)
843 | try:
844 | os.makedirs(self.loc_stable)
845 | except:
846 | pass
847 | self.write_istable()
848 | self.stable = SimServer(self.loc_stable, use_locks=self.use_locks)
849 | self.session = None
850 |
851 |
852 | def location(self, name):
853 | return os.path.join(self.basedir, name)
854 |
855 | @property
856 | def loc_stable(self):
857 | return self.location(self.locs[self.istable])
858 |
859 | @property
860 | def loc_session(self):
861 | return self.location(self.locs[1 - self.istable])
862 |
863 | def __contains__(self, docid):
864 | return docid in self.stable
865 |
866 | def __str__(self):
867 | return "SessionServer(\n\tstable=%s\n\tsession=%s\n)" % (self.stable, self.session)
868 |
869 | def __len__(self):
870 | return len(self.stable)
871 |
872 | def keys(self):
873 | return self.stable.keys()
874 |
875 | @gensim.utils.synchronous('lock_update')
876 | def check_session(self):
877 | """
878 | Make sure a session is open.
879 |
880 | If it's not and autosession is turned on, create a new session automatically.
881 | If it's not and autosession is off, raise an exception.
882 | """
883 | if self.session is None:
884 | if self.autosession:
885 | self.open_session()
886 | else:
887 | msg = "must open a session before modifying %s" % self
888 | raise RuntimeError(msg)
889 |
890 | @gensim.utils.synchronous('lock_update')
891 | def open_session(self):
892 | """
893 | Open a new session to modify this server.
894 |
895 | You can either call this fnc directly, or turn on autosession which will
896 | open/commit sessions for you transparently.
897 | """
898 | if self.session is not None:
899 | msg = "session already open; commit it or rollback before opening another one in %s" % self
900 | logger.error(msg)
901 | raise RuntimeError(msg)
902 |
903 | logger.info("opening a new session")
904 | logger.info("removing %s" % self.loc_session)
905 | try:
906 | shutil.rmtree(self.loc_session)
907 | except:
908 | logger.info("failed to delete %s" % self.loc_session)
909 | logger.info("cloning server from %s to %s" %
910 | (self.loc_stable, self.loc_session))
911 | shutil.copytree(self.loc_stable, self.loc_session)
912 | self.session = SimServer(self.loc_session, use_locks=self.use_locks)
913 | self.lock_update.acquire() # no other thread can call any modification methods until commit/rollback
914 |
915 | @gensim.utils.synchronous('lock_update')
916 | def buffer(self, *args, **kwargs):
917 | """Buffer documents, in the current session"""
918 | self.check_session()
919 | result = self.session.buffer(*args, **kwargs)
920 | return result
921 |
922 | @gensim.utils.synchronous('lock_update')
923 | def index(self, *args, **kwargs):
924 | """Index documents, in the current session"""
925 | self.check_session()
926 | result = self.session.index(*args, **kwargs)
927 | if self.autosession:
928 | self.commit()
929 | return result
930 |
931 | @gensim.utils.synchronous('lock_update')
932 | def train(self, *args, **kwargs):
933 | """Update semantic model, in the current session."""
934 | self.check_session()
935 | result = self.session.train(*args, **kwargs)
936 | if self.autosession:
937 | self.commit()
938 | return result
939 |
940 | @gensim.utils.synchronous('lock_update')
941 | def drop_index(self, keep_model=True):
942 | """Drop all indexed documents from the session. Optionally, drop model too."""
943 | self.check_session()
944 | result = self.session.drop_index(keep_model)
945 | if self.autosession:
946 | self.commit()
947 | return result
948 |
949 | @gensim.utils.synchronous('lock_update')
950 | def delete(self, docids):
951 | """Delete documents from the current session."""
952 | self.check_session()
953 | result = self.session.delete(docids)
954 | if self.autosession:
955 | self.commit()
956 | return result
957 |
958 | @gensim.utils.synchronous('lock_update')
959 | def optimize(self):
960 | """Optimize index for faster by-document-id queries."""
961 | self.check_session()
962 | result = self.session.optimize()
963 | if self.autosession:
964 | self.commit()
965 | return result
966 |
967 | @gensim.utils.synchronous('lock_update')
968 | def write_istable(self):
969 | with open(self.location('stable'), 'w') as fout:
970 | fout.write(os.path.basename(self.loc_stable))
971 |
972 | @gensim.utils.synchronous('lock_update')
973 | def commit(self):
974 | """Commit changes made by the latest session."""
975 | if self.session is not None:
976 | logger.info("committing transaction in %s" % self)
977 | tmp = self.stable
978 | self.stable, self.session = self.session, None
979 | self.istable = 1 - self.istable
980 | self.write_istable()
981 | tmp.close() # don't wait for gc, release resources manually
982 | self.lock_update.release()
983 | else:
984 | logger.warning("commit called but there's no open session in %s" % self)
985 |
986 | @gensim.utils.synchronous('lock_update')
987 | def rollback(self):
988 | """Ignore all changes made in the latest session (terminate the session)."""
989 | if self.session is not None:
990 | logger.info("rolling back transaction in %s" % self)
991 | self.session.close()
992 | self.session = None
993 | self.lock_update.release()
994 | else:
995 | logger.warning("rollback called but there's no open session in %s" % self)
996 |
997 | @gensim.utils.synchronous('lock_update')
998 | def set_autosession(self, value=None):
999 | """
1000 | Turn autosession (automatic committing after each modification call) on/off.
1001 | If value is None, only query the current value (don't change anything).
1002 | """
1003 | if value is not None:
1004 | self.rollback()
1005 | self.autosession = value
1006 | return self.autosession
1007 |
1008 | @gensim.utils.synchronous('lock_update')
1009 | def close(self):
1010 | """Don't wait for gc, try to release important resources manually"""
1011 | try:
1012 | self.stable.close()
1013 | except:
1014 | pass
1015 | try:
1016 | self.session.close()
1017 | except:
1018 | pass
1019 |
1020 | def __del__(self):
1021 | self.close()
1022 |
1023 | @gensim.utils.synchronous('lock_update')
1024 | def terminate(self):
1025 | """Delete all files created by this server, invalidating `self`. Use with care."""
1026 | logger.info("deleting entire server %s" % self)
1027 | self.close()
1028 | try:
1029 | shutil.rmtree(self.basedir)
1030 | logger.info("deleted server under %s" % self.basedir)
1031 | # delete everything from self, so that using this object fails results
1032 | # in an error as quickly as possible
1033 | for val in self.__dict__.keys():
1034 | try:
1035 | delattr(self, val)
1036 | except:
1037 | pass
1038 | except Exception, e:
1039 | logger.warning("failed to delete SessionServer: %s" % (e))
1040 |
1041 |
1042 | def find_similar(self, *args, **kwargs):
1043 | """
1044 | Find similar articles.
1045 |
1046 | With autosession off, use the index state *before* current session started,
1047 | so that changes made in the session will not be visible here. With autosession
1048 | on, close the current session first (so that session changes *are* committed
1049 | and visible).
1050 | """
1051 | if self.session is not None and self.autosession:
1052 | # with autosession on, commit the pending transaction first
1053 | self.commit()
1054 | return self.stable.find_similar(*args, **kwargs)
1055 |
1056 |
1057 | def get_tfidf(self, *args, **kwargs):
1058 | if self.session is not None and self.autosession:
1059 | # with autosession on, commit the pending transaction first
1060 | self.commit()
1061 | return self.stable.get_tfidf(*args, **kwargs)
1062 |
1063 |
1064 | # add some functions for remote access (RPC via Pyro)
1065 | def debug_model(self):
1066 | return self.stable.model
1067 |
1068 | def status(self): # str() alias
1069 | return str(self)
1070 |
--------------------------------------------------------------------------------
/simserver/test/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/piskvorky/gensim-simserver/e7e59e836ef6d9da019a8c6b218ef0bdd998b2da/simserver/test/__init__.py
--------------------------------------------------------------------------------
/simserver/test/test_simserver.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #
4 | # Copyright (C) 2011 Radim Rehurek
5 | # Licensed under the GNU AGPL v3 - http://www.gnu.org/licenses/agpl.html
6 |
7 | """
8 | Automated tests for checking similarity server.
9 | """
10 |
11 | from __future__ import with_statement
12 |
13 | import logging
14 | import sys
15 | import unittest
16 | from copy import deepcopy
17 |
18 | import numpy
19 | import Pyro4
20 |
21 | import gensim
22 | import simserver
23 |
24 | logger = logging.getLogger('test_simserver')
25 |
26 |
27 |
28 | def mock_documents(language, category):
29 | """Create a few SimServer documents, for testing."""
30 | documents = ["Human machine interface for lab abc computer applications",
31 | "A survey of user opinion of computer system response time",
32 | "The EPS user interface management system",
33 | "System and human system engineering testing of EPS",
34 | "Relation of user perceived response time to error measurement",
35 | "The generation of random binary unordered trees",
36 | "The intersection graph of paths in trees",
37 | "Graph minors IV Widths of trees and well quasi ordering",
38 | "Graph minors A survey"]
39 |
40 | # Create SimServer dicts from the texts. These are the object that the gensim
41 | # server expects as input. They must contain doc['id'] and doc['tokens'] attributes.
42 | docs = [{'id': '_'.join((language, category, str(num))),
43 | 'tokens': gensim.utils.simple_preprocess(document), 'payload': range(num),
44 | 'language': language, 'category': category}
45 | for num, document in enumerate(documents)]
46 | return docs
47 |
48 |
49 | class SessionServerTester(unittest.TestCase):
50 | """Test a running SessionServer"""
51 | def setUp(self):
52 | self.docs = mock_documents('en', '')
53 | try:
54 | self.server = Pyro4.Proxy('PYRONAME:gensim.testserver')
55 | logger.info(self.server.status())
56 | except Exception:
57 | logger.info("could not locate running SessionServer; starting a local server")
58 | self.server = simserver.SessionServer(gensim.utils.randfname())
59 | self.server.set_autosession(True)
60 |
61 | def tearDown(self):
62 | self.docs = None
63 | self.server.terminate()
64 | try:
65 | # if server is remote, just close the proxy connection
66 | self.server._pyroRelease()
67 | except AttributeError:
68 | try:
69 | # for local server, close & remove all files
70 | self.server.terminate()
71 | except:
72 | pass
73 |
74 |
75 | def check_equal(self, sims1, sims2):
76 | """Check that two returned lists of similarities are equal."""
77 | sims1 = dict(s[:2] for s in sims1)
78 | sims2 = dict(s[:2] for s in sims2)
79 | for docid in set(sims1.keys() + sims2.keys()):
80 | self.assertTrue(numpy.allclose(sims1.get(docid, 0.0), sims2.get(docid, 0.0), atol=1e-7))
81 |
82 |
83 | def test_model(self):
84 | """test remote server model creation"""
85 | logger.debug(self.server.status())
86 | # calling train without specifying a training corpus raises a ValueError:
87 | self.assertRaises(ValueError, self.server.train, method='lsi')
88 |
89 | # now do the training for real. use a common pattern -- upload documents
90 | # to be processed to the server.
91 | # the documents will be stored server-side in an Sqlite db, not in memory,
92 | # so the training corpus may be larger than RAM.
93 | # if the corpus is very large, upload it in smaller chunks, like 10k docs
94 | # at a time (or else Pyro & cPickle will choke). also see `utils.upload_chunked`.
95 | self.server.buffer(self.docs[:2]) # upload 2 documents to server
96 | self.server.buffer(self.docs[2:]) # upload the rest of the documents
97 |
98 | # now, train a model
99 | self.server.train(method='lsi')
100 |
101 | # check that the model was trained correctly
102 | model = self.server.debug_model()
103 | s_values = [1.2704573, 1.13604315, 1.07827574, 1.02963433, 0.97147057,
104 | 0.9280468, 0.90321329, 0.83034548, 0.74981662]
105 | self.assertTrue(numpy.allclose(model.lsi.projection.s, s_values))
106 |
107 | vec0 = [(0, -0.27759625943508104), (1, -0.29736164713214469), (2, 0.14768134319504395),
108 | (3, -0.24586351025187975), (4, 0.8359357384389362), (5, 0.084659319019917578),
109 | (6, -0.2042204354844826), (7, -0.016382387104491858), (8, 0.065784642613330224)]
110 | got = model.doc2vec(self.docs[0])
111 | self.assertTrue(numpy.allclose(abs(gensim.matutils.sparse2full(vec0, model.num_features)),
112 | abs(gensim.matutils.sparse2full(got, model.num_features))))
113 |
114 |
115 | def test_index(self):
116 | """test remote server incremental indexing"""
117 | # delete any existing model and indexes first
118 | self.server.drop_index(keep_model=False)
119 | logger.debug(self.server.status())
120 |
121 | # try indexing without a model -- raises AttributeError
122 | self.assertRaises(AttributeError, self.server.index, self.docs)
123 |
124 | # train a fresh model
125 | self.server.train(self.docs, method='lsi')
126 |
127 | # use incremental indexing -- start by indexing the first three documents
128 | self.server.buffer(self.docs[:3]) # upload the documents
129 | self.server.index() # index uploaded documents & clear upload buffer
130 | self.assertRaises(ValueError, self.server.find_similar, 'fakeid') # no such id -> raises ValueError
131 |
132 | expected = [('en__1', 0.99999994), ('en__2', 0.16279206), ('en__0', 0.09881371)]
133 | got = self.server.find_similar(self.docs[1]['id']) # retrieve similar to the last document
134 | self.check_equal(expected, got)
135 |
136 | self.server.index(self.docs[3:]) # upload & index the rest of the documents
137 | logger.debug(self.server.status())
138 | expected = [('en__1', 0.99999994), ('en__4', 0.2686055), ('en__8', 0.229533),
139 | ('en__2', 0.16279206), ('en__3', 0.143899247), ('en__0', 0.09881371),
140 | ('en__6', 0.018686194), ('en__5', 0.017070908), ('en__7', 0.01428914)]
141 | got = self.server.find_similar(self.docs[1]['id']) # retrieve similar to the last document
142 | self.check_equal(expected, got)
143 |
144 | # re-index documents. just index documents with the same id -- the old document
145 | # will be replaced by the new one, so that only the latest update counts.
146 | docs = deepcopy(self.docs)
147 | docs[2]['tokens'] = docs[1]['tokens'] # different text, same id
148 | self.server.index(docs[1:3]) # reindex the two modified docs -- total number of indexed docs doesn't change
149 | logger.debug(self.server.status())
150 | expected = [('en__2', 0.99999994), ('en__1', 0.99999994), ('en__4', 0.26860553),
151 | ('en__8', 0.229533046), ('en__3', 0.143899247), ('en__0', 0.0988137126),
152 | ('en__6', 0.01868619397), ('en__5', 0.0170709081), ('en__7', 0.0142891407)]
153 | got = self.server.find_similar(self.docs[2]['id'])
154 | self.check_equal(expected, got)
155 |
156 | # delete documents: pass it a collection of ids to be removed from the index
157 | to_delete = [doc['id'] for doc in self.docs[-3:]]
158 | self.server.delete(to_delete) # delete the last 3 documents
159 | logger.debug(self.server.status())
160 | expected = [('en__2', 0.99999994), ('en__1', 0.99999994), ('en__4', 0.26860553),
161 | ('en__3', 0.143899247), ('en__0', 0.09881371), ('en__5', 0.017070908)]
162 | got = self.server.find_similar(self.docs[2]['id'])
163 | self.check_equal(expected, got)
164 | self.assertRaises(ValueError, self.server.find_similar, to_delete[0]) # deleted document not there anymore
165 |
166 |
167 | def test_optimize(self):
168 | # to speed up queries by id, call server.optimize()
169 | # it will precompute the most similar documents, for all documents in the index,
170 | # and store them to Sqlite db for lightning-fast querying.
171 | # querying by fulltext is not affected by this optimization, though.
172 | self.server.drop_index(keep_model=False)
173 | self.server.train(self.docs, method='lsi')
174 | self.server.index(self.docs)
175 | self.server.optimize()
176 | logger.debug(self.server.status())
177 | # TODO how to test that it's faster?
178 |
179 |
180 | def test_query_id(self):
181 | # index some docs first
182 | self.server.drop_index(keep_model=False)
183 | self.server.train(self.docs, method='lsi')
184 | self.server.index(self.docs)
185 |
186 | # query index by id: return the most similar documents to an already indexed document
187 | docid = self.docs[0]['id']
188 | expected = [('en__0', 1.0), ('en__2', 0.112942614), ('en__1', 0.09881371),
189 | ('en__3', 0.087866522)]
190 | got = self.server.find_similar(docid)
191 | self.check_equal(expected, got)
192 |
193 | # same thing, but only get docs with similarity >= 0.3
194 | expected = [('en__0', 1.0)]
195 | got = self.server.find_similar(docid, min_score=0.3)
196 | self.check_equal(expected, got)
197 |
198 | # same thing, but only get max 3 documents docs with similarity >= 0.2
199 | expected = [('en__0', 1.0), ('en__2', 0.112942614)]
200 | got = self.server.find_similar(docid, max_results=3, min_score=0.1)
201 | self.check_equal(expected, got)
202 |
203 |
204 | def test_query_document(self):
205 | # index some docs first
206 | self.server.drop_index(keep_model=False)
207 | self.server.train(self.docs, method='lsi')
208 | self.server.index(self.docs)
209 |
210 | # query index by document text: id is ignored
211 | doc = self.docs[0]
212 | doc['id'] = None # clear out id; not necessary, just to demonstrate it's not used in query-by-document
213 | expected = [('en__0', 1.0), ('en__2', 0.11294261), ('en__1', 0.09881371), ('en__3', 0.087866522)]
214 | got = self.server.find_similar(doc)
215 | self.check_equal(expected, got)
216 |
217 | # same thing, but only get docs with similarity >= 0.3
218 | expected = [('en__0', 1.0)]
219 | got = self.server.find_similar(doc, min_score=0.3)
220 | self.check_equal(expected, got)
221 |
222 | # same thing, but only get max 3 documents docs with similarity >= 0.2
223 | expected = [('en__0', 1.0), ('en__2', 0.112942614)]
224 | got = self.server.find_similar(doc, max_results=3, min_score=0.1)
225 | self.check_equal(expected, got)
226 |
227 |
228 | def test_payload(self):
229 | """test storing/retrieving document payload"""
230 | # delete any existing model and indexes first
231 | self.server.drop_index(keep_model=False)
232 | self.server.train(self.docs, method='lsi')
233 |
234 | # create payload for three documents
235 | docs = deepcopy(self.docs)
236 | docs[0]['payload'] = 'some payload'
237 | docs[1]['payload'] = range(10)
238 | docs[2]['payload'] = 3.14
239 | id2doc = dict((doc['id'], doc) for doc in docs)
240 |
241 | # index documents & store payloads
242 | self.server.index(docs)
243 |
244 | # do a few queries, check that returned payloads match what we sent to the server
245 | for queryid in [docs[0]['id'], docs[1]['id'], docs[2]['id']]:
246 | for docid, sim, payload in self.server.find_similar(queryid):
247 | self.assertEqual(payload, id2doc[docid].get('payload', None))
248 |
249 |
250 | def test_sessions(self):
251 | """check similarity server transactions (autosession off)"""
252 | self.server.drop_index(keep_model=False)
253 | self.server.set_autosession(False) # turn off auto-commit
254 |
255 | # trying to modify index with auto-commit off and without an open session results in exception
256 | self.assertRaises(RuntimeError, self.server.train, self.docs, method='lsi')
257 | self.assertRaises(RuntimeError, self.server.index, self.docs)
258 |
259 | # open session, train model & index some documents
260 | self.server.open_session()
261 | self.server.train(self.docs, method='lsi')
262 | self.server.index(self.docs)
263 |
264 | # cannot open 2 simultaneous sessions: must commit or rollback first
265 | self.assertRaises(RuntimeError, self.server.open_session)
266 |
267 | self.server.commit() # commit ends the session
268 |
269 | # no session open; cannot modify
270 | self.assertRaises(RuntimeError, self.server.index, self.docs)
271 |
272 | # open another session (using outcome of the previously committed one)
273 | self.server.open_session()
274 | doc = self.docs[0]
275 | self.server.delete([doc['id']]) # delete one document from the session
276 | # queries hit the original index; current session modifications are ignored
277 | self.server.find_similar(doc['id']) # document still there!
278 | self.server.commit()
279 |
280 | # session committed => document is gone now, querying for its id raises exception
281 | self.assertRaises(ValueError, self.server.find_similar, doc['id'])
282 |
283 | # open another session; this one will be rolled back
284 | self.server.open_session()
285 | self.server.index([doc]) # re-add the deleted document
286 | self.assertRaises(ValueError, self.server.find_similar, doc['id']) # no commit yet -- document still gone!
287 | self.server.rollback() # ignore all changes made since open_session
288 |
289 | self.assertRaises(ValueError, self.server.find_similar, doc['id']) # addition was rolled back -- document still gone!
290 | #end SessionServerTester
291 |
292 |
293 |
294 | if __name__ == '__main__':
295 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(module)s:%(lineno)d : %(funcName)s(%(threadName)s) : %(message)s',
296 | level=logging.DEBUG)
297 | unittest.main()
298 |
--------------------------------------------------------------------------------