├── .gitignore ├── .nojekyll ├── 404.html ├── CNAME ├── Dockerfile ├── LICENSE ├── README.md ├── SUMMARY.md ├── asset ├── docsify-apachecn-footer.js ├── docsify-baidu-push.js ├── docsify-baidu-stat.js ├── docsify-clicker.js ├── docsify-cnzz.js ├── docsify-copy-code.min.js ├── docsify.min.js ├── prism-darcula.css ├── prism-python.min.js ├── search.min.js ├── style.css └── vue.css ├── blog ├── Install │ └── README.md ├── Introduction │ └── README.md └── tutorial │ ├── 1.md │ ├── 2.md │ ├── 3.md │ ├── 4.md │ ├── 5.md │ └── README.md ├── imgs ├── Introduction │ ├── develop.svg │ ├── develop_1.svg │ └── gensim.svg └── gensim.png ├── index.html └── update.sh /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | .DS_Store 103 | 104 | # gitbook 105 | _book 106 | 107 | # node.js 108 | node_modules 109 | 110 | # windows 111 | Thumbs.db 112 | 113 | # word 114 | ~$*.docx 115 | ~$*.doc 116 | 117 | # custom 118 | docs/README.md 119 | -------------------------------------------------------------------------------- /.nojekyll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apachecn/gensim-doc-zh/69158825f7bd8cce00288e06576e943528037b6b/.nojekyll -------------------------------------------------------------------------------- /404.html: -------------------------------------------------------------------------------- 1 | --- 2 | permalink: /404.html 3 | --- 4 | 5 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | gensim.apachecn.org -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM httpd:2.4 2 | COPY ./ /usr/local/apache2/htdocs/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Gensim 中文文档 2 | 3 | ![](imgs/gensim.png) 4 | 5 | > 原文:[Gensim 文档](https://radimrehurek.com/gensim/index.html) 6 | > 7 | > 协议:[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/) 8 | > 9 | > 代码是为人类阅读而写,只是顺便能被机器执行罢了。——哈罗德·埃布尔森 10 | 11 | * [在线阅读](https://gensim.apachecn.org) 12 | * [在线阅读(Gitee)](https://apachecn.gitee.io/gensim-doc-zh/) 13 | * [ApacheCN 机器学习交流群 629470233](http://shang.qq.com/wpa/qunwpa?idkey=30e5f1123a79867570f665aa3a483ca404b1c3f77737bc01ec520ed5f078ddef) 14 | * [AILearning 机器学习实战](https://github.com/apachecn/AILearning) 15 | * [ApacheCN 组织资源](https://github.com/apachecn/home) 16 | 17 | ## 联系方式 18 | 19 | > 负责人 20 | 21 | * [@片刻](https://github.com/jiangzhonglian): 529815144 22 | 23 | > 加入方式 24 | 25 | * 企鹅: 529815144(片刻) 1042658081(那伊抹微笑) 190442212(瑶妹) 26 | * ApacheCN [关于我们](http://cwiki.apachecn.org/pages/viewpage.action?pageId=2887240) && [加入我们](http://cwiki.apachecn.org/pages/viewpage.action?pageId=2887239) 27 | 28 | ## 免责声明 - 【只供学习参考】 29 | 30 | * ApacheCN 纯粹出于学习目的与个人兴趣翻译本书 31 | * ApacheCN 保留对此版本译文的署名权及其它相关权利 32 | 33 | ## 下载 34 | 35 | ### Docker 36 | 37 | ``` 38 | docker pull apachecn0/gensim-doc-zh 39 | docker run -tid -p :80 apachecn0/gensim-doc-zh 40 | # 访问 http://localhost:{port} 查看文档 41 | ``` 42 | 43 | ### PYPI 44 | 45 | ``` 46 | pip install gensim-doc-zh 47 | gensim-doc-zh 48 | # 访问 http://localhost:{port} 查看文档 49 | ``` 50 | 51 | ### NPM 52 | 53 | ``` 54 | npm install -g gensim-doc-zh 55 | gensim-doc-zh 56 | # 访问 http://localhost:{port} 查看文档 57 | ``` 58 | 59 | ## 赞助我们 60 | 61 | ![](http://www.apachecn.org/img/about/donate.jpg) 62 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | * [介绍](blog/Introduction/README.md) 2 | * [安装](blog/Install/README.md) 3 | * [教程](blog/tutorial/README.md) 4 | * [语料库和向量空间](blog/tutorial/1.md) 5 | * [主题和转换](blog/tutorial/2.md) 6 | * [相似性查询](blog/tutorial/3.md) 7 | * [英语维基百科上的实验](blog/tutorial/4.md) 8 | * [分布式计算](blog/tutorial/5.md) 9 | -------------------------------------------------------------------------------- /asset/docsify-apachecn-footer.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var footer = [ 3 | '
', 4 | '
', 5 | '

我们一直在努力

', 6 | '

apachecn/gensim-doc-zh

', 7 | '

', 8 | ' ', 9 | ' ', 10 | ' ML | ApacheCN

', 11 | '

', 12 | '
', 13 | ' ', 17 | '
', 18 | '
' 19 | ].join('\n') 20 | var plugin = function(hook) { 21 | hook.afterEach(function(html) { 22 | return html + footer 23 | }) 24 | hook.doneEach(function() { 25 | (adsbygoogle = window.adsbygoogle || []).push({}) 26 | }) 27 | } 28 | var plugins = window.$docsify.plugins || [] 29 | plugins.push(plugin) 30 | window.$docsify.plugins = plugins 31 | })() -------------------------------------------------------------------------------- /asset/docsify-baidu-push.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | new Image().src = 5 | '//api.share.baidu.com/s.gif?r=' + 6 | encodeURIComponent(document.referrer) + 7 | "&l=" + encodeURIComponent(location.href) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-baidu-stat.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | window._hmt = window._hmt || [] 5 | var hm = document.createElement("script") 6 | hm.src = "https://hm.baidu.com/hm.js?" + window.$docsify.bdStatId 7 | document.querySelector("article").appendChild(hm) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-clicker.js: -------------------------------------------------------------------------------- 1 | (function() { 2 | var ids = [ 3 | '109577065', '108852955', '102682374', '100520874', '92400861', '90312982', 4 | '109963325', '109323014', '109301511', '108898970', '108590722', '108538676', 5 | '108503526', '108437109', '108402202', '108292691', '108291153', '108268498', 6 | '108030854', '107867070', '107847299', '107827334', '107825454', '107802131', 7 | '107775320', '107752974', '107735139', '107702571', '107598864', '107584507', 8 | '107568311', '107526159', '107452391', '107437455', '107430050', '107395781', 9 | '107325304', '107283210', '107107145', '107085440', '106995421', '106993460', 10 | '106972215', '106959775', '106766787', '106749609', '106745967', '106634313', 11 | '106451602', '106180097', '106095505', '106077010', '106008089', '106002346', 12 | '105653809', '105647855', '105130705', '104837872', '104706815', '104192620', 13 | '104074941', '104040537', '103962171', '103793502', '103783460', '103774572', 14 | '103547748', '103547703', '103547571', '103490757', '103413481', '103341935', 15 | '103330191', '103246597', '103235808', '103204403', '103075981', '103015105', 16 | '103014899', '103014785', '103014702', '103014540', '102993780', '102993754', 17 | '102993680', '102958443', '102913317', '102903382', '102874766', '102870470', 18 | '102864513', '102811179', '102761237', '102711565', '102645443', '102621845', 19 | '102596167', '102593333', '102585262', '102558427', '102537547', '102530610', 20 | '102527017', '102504698', '102489806', '102372981', '102258897', '102257303', 21 | '102056248', '101920097', '101648638', '101516708', '101350577', '101268149', 22 | '101128167', '101107328', '101053939', '101038866', '100977414', '100945061', 23 | '100932401', '100886407', '100797378', '100634918', '100588305', '100572447', 24 | '100192249', '100153559', '100099032', '100061455', '100035392', '100033450', 25 | '99671267', '99624846', '99172551', '98992150', '98989508', '98987516', '98938304', 26 | '98937682', '98725145', '98521688', '98450861', '98306787', '98203342', '98026348', 27 | '97680167', '97492426', '97108940', '96888872', '96568559', '96509100', '96508938', 28 | '96508611', '96508374', '96498314', '96476494', '96333593', '96101522', '95989273', 29 | '95960507', '95771870', '95770611', '95766810', '95727700', '95588929', '95218707', 30 | '95073151', '95054615', '95016540', '94868371', '94839549', '94719281', '94401578', 31 | '93931439', '93853494', '93198026', '92397889', '92063437', '91635930', '91433989', 32 | '91128193', '90915507', '90752423', '90738421', '90725712', '90725083', '90722238', 33 | '90647220', '90604415', '90544478', '90379769', '90288341', '90183695', '90144066', 34 | '90108283', '90021771', '89914471', '89876284', '89852050', '89839033', '89812373', 35 | '89789699', '89786189', '89752620', '89636380', '89632889', '89525811', '89480625', 36 | '89464088', '89464025', '89463984', '89463925', '89445280', '89441793', '89430432', 37 | '89429877', '89416176', '89412750', '89409618', '89409485', '89409365', '89409292', 38 | '89409222', '89399738', '89399674', '89399526', '89355336', '89330241', '89308077', 39 | '89222240', '89140953', '89139942', '89134398', '89069355', '89049266', '89035735', 40 | '89004259', '88925790', '88925049', '88915838', '88912706', '88911548', '88899438', 41 | '88878890', '88837519', '88832555', '88824257', '88777952', '88752158', '88659061', 42 | '88615256', '88551434', '88375675', '88322134', '88322085', '88321996', '88321978', 43 | '88321950', '88321931', '88321919', '88321899', '88321830', '88321756', '88321710', 44 | '88321661', '88321632', '88321566', '88321550', '88321506', '88321475', '88321440', 45 | '88321409', '88321362', '88321321', '88321293', '88321226', '88232699', '88094874', 46 | '88090899', '88090784', '88089091', '88048808', '87938224', '87913318', '87905933', 47 | '87897358', '87856753', '87856461', '87827666', '87822008', '87821456', '87739137', 48 | '87734022', '87643633', '87624617', '87602909', '87548744', '87548689', '87548624', 49 | '87548550', '87548461', '87463201', '87385913', '87344048', '87078109', '87074784', 50 | '87004367', '86997632', '86997466', '86997303', '86997116', '86996474', '86995899', 51 | '86892769', '86892654', '86892569', '86892457', '86892347', '86892239', '86892124', 52 | '86798671', '86777307', '86762845', '86760008', '86759962', '86759944', '86759930', 53 | '86759922', '86759646', '86759638', '86759633', '86759622', '86759611', '86759602', 54 | '86759596', '86759591', '86759580', '86759572', '86759567', '86759558', '86759545', 55 | '86759534', '86749811', '86741502', '86741074', '86741059', '86741020', '86740897', 56 | '86694754', '86670104', '86651882', '86651875', '86651866', '86651828', '86651790', 57 | '86651767', '86651756', '86651735', '86651720', '86651708', '86618534', '86618526', 58 | '86594785', '86590937', '86550497', '86550481', '86550472', '86550453', '86550438', 59 | '86550429', '86550407', '86550381', '86550359', '86536071', '86536035', '86536014', 60 | '86535988', '86535963', '86535953', '86535932', '86535902', '86472491', '86472298', 61 | '86472236', '86472191', '86472108', '86471967', '86471899', '86471822', '86439022', 62 | '86438972', '86438902', '86438887', '86438867', '86438836', '86438818', '85850119', 63 | '85850075', '85850021', '85849945', '85849893', '85849837', '85849790', '85849740', 64 | '85849661', '85849620', '85849550', '85606096', '85564441', '85547709', '85471981', 65 | '85471317', '85471136', '85471073', '85470629', '85470456', '85470169', '85469996', 66 | '85469877', '85469775', '85469651', '85469331', '85469033', '85345768', '85345742', 67 | '85337900', '85337879', '85337860', '85337833', '85337797', '85322822', '85322810', 68 | '85322791', '85322745', '85317667', '85265742', '85265696', '85265618', '85265350', 69 | '85098457', '85057670', '85009890', '84755581', '84637437', '84637431', '84637393', 70 | '84637374', '84637355', '84637338', '84637321', '84637305', '84637283', '84637259', 71 | '84629399', '84629314', '84629233', '84629124', '84629065', '84628997', '84628933', 72 | '84628838', '84628777', '84628690', '84591581', '84591553', '84591511', '84591484', 73 | '84591468', '84591416', '84591386', '84591350', '84591308', '84572155', '84572107', 74 | '84503228', '84500221', '84403516', '84403496', '84403473', '84403442', '84075703', 75 | '84029659', '83933480', '83933459', '83933435', '83903298', '83903274', '83903258', 76 | '83752369', '83345186', '83116487', '83116446', '83116402', '83116334', '83116213', 77 | '82944248', '82941023', '82938777', '82936611', '82932735', '82918102', '82911085', 78 | '82888399', '82884263', '82883507', '82880996', '82875334', '82864060', '82831039', 79 | '82823385', '82795277', '82790832', '82775718', '82752022', '82730437', '82718126', 80 | '82661646', '82588279', '82588267', '82588261', '82588192', '82347066', '82056138', 81 | '81978722', '81211571', '81104145', '81069048', '81006768', '80788365', '80767582', 82 | '80759172', '80759144', '80759129', '80736927', '80661288', '80616304', '80602366', 83 | '80584625', '80561364', '80549878', '80549875', '80541470', '80539726', '80531328', 84 | '80513257', '80469816', '80406810', '80356781', '80334130', '80333252', '80332666', 85 | '80332389', '80311244', '80301070', '80295974', '80292252', '80286963', '80279504', 86 | '80278369', '80274371', '80249825', '80247284', '80223054', '80219559', '80209778', 87 | '80200279', '80164236', '80160900', '80153046', '80149560', '80144670', '80061205', 88 | '80046520', '80025644', '80014721', '80005213', '80004664', '80001653', '79990178', 89 | '79989283', '79947873', '79946002', '79941517', '79938786', '79932755', '79921178', 90 | '79911339', '79897603', '79883931', '79872574', '79846509', '79832150', '79828161', 91 | '79828156', '79828149', '79828146', '79828140', '79828139', '79828135', '79828123', 92 | '79820772', '79776809', '79776801', '79776788', '79776782', '79776772', '79776767', 93 | '79776760', '79776753', '79776736', '79776705', '79676183', '79676171', '79676166', 94 | '79676160', '79658242', '79658137', '79658130', '79658123', '79658119', '79658112', 95 | '79658100', '79658092', '79658089', '79658069', '79658054', '79633508', '79587857', 96 | '79587850', '79587842', '79587831', '79587825', '79587819', '79547908', '79477700', 97 | '79477692', '79440956', '79431176', '79428647', '79416896', '79406699', '79350633', 98 | '79350545', '79344765', '79339391', '79339383', '79339157', '79307345', '79293944', 99 | '79292623', '79274443', '79242798', '79184420', '79184386', '79184355', '79184269', 100 | '79183979', '79100314', '79100206', '79100064', '79090813', '79057834', '78967246', 101 | '78941571', '78927340', '78911467', '78909741', '78848006', '78628917', '78628908', 102 | '78628889', '78571306', '78571273', '78571253', '78508837', '78508791', '78448073', 103 | '78430940', '78408150', '78369548', '78323851', '78314301', '78307417', '78300457', 104 | '78287108', '78278945', '78259349', '78237192', '78231360', '78141031', '78100357', 105 | '78095793', '78084949', '78073873', '78073833', '78067868', '78067811', '78055014', 106 | '78041555', '78039240', '77948804', '77879624', '77837792', '77824937', '77816459', 107 | '77816208', '77801801', '77801767', '77776636', '77776610', '77505676', '77485156', 108 | '77478296', '77460928', '77327521', '77326428', '77278423', '77258908', '77252370', 109 | '77248841', '77239042', '77233843', '77230880', '77200256', '77198140', '77196405', 110 | '77193456', '77186557', '77185568', '77181823', '77170422', '77164604', '77163389', 111 | '77160103', '77159392', '77150721', '77146204', '77141824', '77129604', '77123259', 112 | '77113014', '77103247', '77101924', '77100165', '77098190', '77094986', '77088637', 113 | '77073399', '77062405', '77044198', '77036923', '77017092', '77007016', '76999924', 114 | '76977678', '76944015', '76923087', '76912696', '76890184', '76862282', '76852434', 115 | '76829683', '76794256', '76780755', '76762181', '76732277', '76718569', '76696048', 116 | '76691568', '76689003', '76674746', '76651230', '76640301', '76615315', '76598528', 117 | '76571947', '76551820', '74178127', '74157245', '74090991', '74012309', '74001789', 118 | '73910511', '73613471', '73605647', '73605082', '73503704', '73380636', '73277303', 119 | '73274683', '73252108', '73252085', '73252070', '73252039', '73252025', '73251974', 120 | '73135779', '73087531', '73044025', '73008658', '72998118', '72997953', '72847091', 121 | '72833384', '72830909', '72828999', '72823633', '72793092', '72757626', '71157154', 122 | '71131579', '71128551', '71122253', '71082760', '71078326', '71075369', '71057216', 123 | '70812997', '70384625', '70347260', '70328937', '70313267', '70312950', '70255825', 124 | '70238893', '70237566', '70237072', '70230665', '70228737', '70228729', '70175557', 125 | '70175401', '70173259', '70172591', '70170835', '70140724', '70139606', '70053923', 126 | '69067886', '69063732', '69055974', '69055708', '69031254', '68960022', '68957926', 127 | '68957556', '68953383', '68952755', '68946828', '68483371', '68120861', '68065606', 128 | '68064545', '68064493', '67646436', '67637525', '67632961', '66984317', '66968934', 129 | '66968328', '66491589', '66475786', '66473308', '65946462', '65635220', '65632553', 130 | '65443309', '65437683', '63260222', '63253665', '63253636', '63253628', '63253610', 131 | '63253572', '63252767', '63252672', '63252636', '63252537', '63252440', '63252329', 132 | '63252155', '62888876', '62238064', '62039365', '62038016', '61925813', '60957024', 133 | '60146286', '59523598', '59489460', '59480461', '59160354', '59109234', '59089006', 134 | '58595549', '57406062', '56678797', '55001342', '55001340', '55001336', '55001330', 135 | '55001328', '55001325', '55001311', '55001305', '55001298', '55001290', '55001283', 136 | '55001278', '55001272', '55001265', '55001262', '55001253', '55001246', '55001242', 137 | '55001236', '54907997', '54798827', '54782693', '54782689', '54782688', '54782676', 138 | '54782673', '54782671', '54782662', '54782649', '54782636', '54782630', '54782628', 139 | '54782627', '54782624', '54782621', '54782620', '54782615', '54782613', '54782608', 140 | '54782604', '54782600', '54767237', '54766779', '54755814', '54755674', '54730253', 141 | '54709338', '54667667', '54667657', '54667639', '54646201', '54407212', '54236114', 142 | '54234220', '54233181', '54232788', '54232407', '54177960', '53991319', '53932970', 143 | '53888106', '53887128', '53885944', '53885094', '53884497', '53819985', '53812640', 144 | '53811866', '53790628', '53785053', '53782838', '53768406', '53763191', '53763163', 145 | '53763148', '53763104', '53763092', '53576302', '53576157', '53573472', '53560183', 146 | '53523648', '53516634', '53514474', '53510917', '53502297', '53492224', '53467240', 147 | '53467122', '53437115', '53436579', '53435710', '53415115', '53377875', '53365337', 148 | '53350165', '53337979', '53332925', '53321283', '53318758', '53307049', '53301773', 149 | '53289364', '53286367', '53259948', '53242892', '53239518', '53230890', '53218625', 150 | '53184121', '53148662', '53129280', '53116507', '53116486', '52980893', '52980652', 151 | '52971002', '52950276', '52950259', '52944714', '52934397', '52932994', '52924939', 152 | '52887083', '52877145', '52858258', '52858046', '52840214', '52829673', '52818774', 153 | '52814054', '52805448', '52798019', '52794801', '52786111', '52774750', '52748816', 154 | '52745187', '52739313', '52738109', '52734410', '52734406', '52734401', '52515005', 155 | '52056818', '52039757', '52034057', '50899381', '50738883', '50726018', '50695984', 156 | '50695978', '50695961', '50695931', '50695913', '50695902', '50695898', '50695896', 157 | '50695885', '50695852', '50695843', '50695829', '50643222', '50591997', '50561827', 158 | '50550829', '50541472', '50527581', '50527317', '50527206', '50527094', '50526976', 159 | '50525931', '50525764', '50518363', '50498312', '50493019', '50492927', '50492881', 160 | '50492863', '50492772', '50492741', '50492688', '50492454', '50491686', '50491675', 161 | '50491602', '50491550', '50491467', '50488409', '50485177', '48683433', '48679853', 162 | '48678381', '48626023', '48623059', '48603183', '48599041', '48595555', '48576507', 163 | '48574581', '48574425', '48547849', '48542371', '48518705', '48494395', '48493321', 164 | '48491545', '48471207', '48471161', '48471085', '48468239', '48416035', '48415577', 165 | '48415515', '48297597', '48225865', '48224037', '48223553', '48213383', '48211439', 166 | '48206757', '48195685', '48193981', '48154955', '48128811', '48105995', '48105727', 167 | '48105441', '48105085', '48101717', '48101691', '48101637', '48101569', '48101543', 168 | '48085839', '48085821', '48085797', '48085785', '48085775', '48085765', '48085749', 169 | '48085717', '48085687', '48085377', '48085189', '48085119', '48085043', '48084991', 170 | '48084747', '48084139', '48084075', '48055511', '48055403', '48054259', '48053917', 171 | '47378253', '47359989', '47344793', '47344083', '47336927', '47335827', '47316383', 172 | '47315813', '47312213', '47295745', '47294471', '47259467', '47256015', '47255529', 173 | '47253649', '47207791', '47206309', '47189383', '47172333', '47170495', '47166223', '47149681', '47146967', '47126915', '47126883', '47108297', '47091823', '47084039', 174 | '47080883', '47058549', '47056435', '47054703', '47041395', '47035325', '47035143', 175 | '47027547', '47016851', '47006665', '46854213', '46128743', '45035163', '43053503', 176 | '41968283', '41958265', '40707993', '40706971', '40685165', '40684953', '40684575', 177 | '40683867', '40683021', '39853417', '39806033', '39757139', '38391523', '37595169', 178 | '37584503', '35696501', '29593529', '28100441', '27330071', '26950993', '26011757', 179 | '26010983', '26010603', '26004793', '26003621', '26003575', '26003405', '26003373', 180 | '26003307', '26003225', '26003189', '26002929', '26002863', '26002749', '26001477', 181 | '25641541', '25414671', '25410705', '24973063', '20648491', '20621099', '17802317', 182 | '17171597', '17141619', '17141381', '17139321', '17121903', '16898605', '16886449', 183 | '14523439', '14104635', '14054225', '9317965' 184 | ] 185 | var urlb64 = 'aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dpemFyZGZvcmNlbC9hcnRpY2xlL2RldGFpbHMv' 186 | var plugin = function(hook) { 187 | hook.doneEach(function() { 188 | for (var i = 0; i < 5; i++) { 189 | var idx = Math.trunc(Math.random() * ids.length) 190 | new Image().src = atob(urlb64) + ids[idx] 191 | } 192 | }) 193 | } 194 | var plugins = window.$docsify.plugins || [] 195 | plugins.push(plugin) 196 | window.$docsify.plugins = plugins 197 | })() -------------------------------------------------------------------------------- /asset/docsify-cnzz.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | var sc = document.createElement('script') 5 | sc.src = 'https://s5.cnzz.com/z_stat.php?id=' + 6 | window.$docsify.cnzzId + '&online=1&show=line' 7 | document.querySelector('article').appendChild(sc) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-copy-code.min.js: -------------------------------------------------------------------------------- 1 | /*! 2 | * docsify-copy-code 3 | * v2.1.0 4 | * https://github.com/jperasmus/docsify-copy-code 5 | * (c) 2017-2019 JP Erasmus 6 | * MIT license 7 | */ 8 | !function(){"use strict";function r(o){return(r="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(o){return typeof o}:function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o})(o)}!function(o,e){void 0===e&&(e={});var t=e.insertAt;if(o&&"undefined"!=typeof document){var n=document.head||document.getElementsByTagName("head")[0],c=document.createElement("style");c.type="text/css","top"===t&&n.firstChild?n.insertBefore(c,n.firstChild):n.appendChild(c),c.styleSheet?c.styleSheet.cssText=o:c.appendChild(document.createTextNode(o))}}(".docsify-copy-code-button,.docsify-copy-code-button span{cursor:pointer;transition:all .25s ease}.docsify-copy-code-button{position:absolute;z-index:1;top:0;right:0;overflow:visible;padding:.65em .8em;border:0;border-radius:0;outline:0;font-size:1em;background:grey;background:var(--theme-color,grey);color:#fff;opacity:0}.docsify-copy-code-button span{border-radius:3px;background:inherit;pointer-events:none}.docsify-copy-code-button .error,.docsify-copy-code-button .success{position:absolute;z-index:-100;top:50%;left:0;padding:.5em .65em;font-size:.825em;opacity:0;-webkit-transform:translateY(-50%);transform:translateY(-50%)}.docsify-copy-code-button.error .error,.docsify-copy-code-button.success .success{opacity:1;-webkit-transform:translate(-115%,-50%);transform:translate(-115%,-50%)}.docsify-copy-code-button:focus,pre:hover .docsify-copy-code-button{opacity:1}"),document.querySelector('link[href*="docsify-copy-code"]')&&console.warn("[Deprecation] Link to external docsify-copy-code stylesheet is no longer necessary."),window.DocsifyCopyCodePlugin={init:function(){return function(o,e){o.ready(function(){console.warn("[Deprecation] Manually initializing docsify-copy-code using window.DocsifyCopyCodePlugin.init() is no longer necessary.")})}}},window.$docsify=window.$docsify||{},window.$docsify.plugins=[function(o,s){o.doneEach(function(){var o=Array.apply(null,document.querySelectorAll("pre[data-lang]")),c={buttonText:"Copy to clipboard",errorText:"Error",successText:"Copied"};s.config.copyCode&&Object.keys(c).forEach(function(t){var n=s.config.copyCode[t];"string"==typeof n?c[t]=n:"object"===r(n)&&Object.keys(n).some(function(o){var e=-1',''.concat(c.buttonText,""),''.concat(c.errorText,""),''.concat(c.successText,""),""].join("");o.forEach(function(o){o.insertAdjacentHTML("beforeend",e)})}),o.mounted(function(){document.querySelector(".content").addEventListener("click",function(o){if(o.target.classList.contains("docsify-copy-code-button")){var e="BUTTON"===o.target.tagName?o.target:o.target.parentNode,t=document.createRange(),n=e.parentNode.querySelector("code"),c=window.getSelection();t.selectNode(n),c.removeAllRanges(),c.addRange(t);try{document.execCommand("copy")&&(e.classList.add("success"),setTimeout(function(){e.classList.remove("success")},1e3))}catch(o){console.error("docsify-copy-code: ".concat(o)),e.classList.add("error"),setTimeout(function(){e.classList.remove("error")},1e3)}"function"==typeof(c=window.getSelection()).removeRange?c.removeRange(t):"function"==typeof c.removeAllRanges&&c.removeAllRanges()}})})}].concat(window.$docsify.plugins||[])}(); 9 | //# sourceMappingURL=docsify-copy-code.min.js.map 10 | -------------------------------------------------------------------------------- /asset/prism-darcula.css: -------------------------------------------------------------------------------- 1 | /** 2 | * Darcula theme 3 | * 4 | * Adapted from a theme based on: 5 | * IntelliJ Darcula Theme (https://github.com/bulenkov/Darcula) 6 | * 7 | * @author Alexandre Paradis 8 | * @version 1.0 9 | */ 10 | 11 | code[class*="lang-"], 12 | pre[data-lang] { 13 | color: #a9b7c6 !important; 14 | background-color: #2b2b2b !important; 15 | font-family: Consolas, Monaco, 'Andale Mono', monospace; 16 | direction: ltr; 17 | text-align: left; 18 | white-space: pre; 19 | word-spacing: normal; 20 | word-break: normal; 21 | line-height: 1.5; 22 | 23 | -moz-tab-size: 4; 24 | -o-tab-size: 4; 25 | tab-size: 4; 26 | 27 | -webkit-hyphens: none; 28 | -moz-hyphens: none; 29 | -ms-hyphens: none; 30 | hyphens: none; 31 | } 32 | 33 | pre[data-lang]::-moz-selection, pre[data-lang] ::-moz-selection, 34 | code[class*="lang-"]::-moz-selection, code[class*="lang-"] ::-moz-selection { 35 | color: inherit; 36 | background: rgba(33, 66, 131, .85); 37 | } 38 | 39 | pre[data-lang]::selection, pre[data-lang] ::selection, 40 | code[class*="lang-"]::selection, code[class*="lang-"] ::selection { 41 | color: inherit; 42 | background: rgba(33, 66, 131, .85); 43 | } 44 | 45 | /* Code blocks */ 46 | pre[data-lang] { 47 | padding: 1em; 48 | margin: .5em 0; 49 | overflow: auto; 50 | } 51 | 52 | :not(pre) > code[class*="lang-"], 53 | pre[data-lang] { 54 | background: #2b2b2b; 55 | } 56 | 57 | /* Inline code */ 58 | :not(pre) > code[class*="lang-"] { 59 | padding: .1em; 60 | border-radius: .3em; 61 | } 62 | 63 | .token.comment, 64 | .token.prolog, 65 | .token.cdata { 66 | color: #808080; 67 | } 68 | 69 | .token.delimiter, 70 | .token.boolean, 71 | .token.keyword, 72 | .token.selector, 73 | .token.important, 74 | .token.atrule { 75 | color: #cc7832; 76 | } 77 | 78 | .token.operator, 79 | .token.punctuation, 80 | .token.attr-name { 81 | color: #a9b7c6; 82 | } 83 | 84 | .token.tag, 85 | .token.tag .punctuation, 86 | .token.doctype, 87 | .token.builtin { 88 | color: #e8bf6a; 89 | } 90 | 91 | .token.entity, 92 | .token.number, 93 | .token.symbol { 94 | color: #6897bb; 95 | } 96 | 97 | .token.property, 98 | .token.constant, 99 | .token.variable { 100 | color: #9876aa; 101 | } 102 | 103 | .token.string, 104 | .token.char { 105 | color: #6a8759; 106 | } 107 | 108 | .token.attr-value, 109 | .token.attr-value .punctuation { 110 | color: #a5c261; 111 | } 112 | 113 | .token.attr-value .punctuation:first-child { 114 | color: #a9b7c6; 115 | } 116 | 117 | .token.url { 118 | color: #287bde; 119 | text-decoration: underline; 120 | } 121 | 122 | .token.function { 123 | color: #ffc66d; 124 | } 125 | 126 | .token.regex { 127 | background: #364135; 128 | } 129 | 130 | .token.bold { 131 | font-weight: bold; 132 | } 133 | 134 | .token.italic { 135 | font-style: italic; 136 | } 137 | 138 | .token.inserted { 139 | background: #294436; 140 | } 141 | 142 | .token.deleted { 143 | background: #484a4a; 144 | } 145 | 146 | code.lang-css .token.property, 147 | code.lang-css .token.property + .token.punctuation { 148 | color: #a9b7c6; 149 | } 150 | 151 | code.lang-css .token.id { 152 | color: #ffc66d; 153 | } 154 | 155 | code.lang-css .token.selector > .token.class, 156 | code.lang-css .token.selector > .token.attribute, 157 | code.lang-css .token.selector > .token.pseudo-class, 158 | code.lang-css .token.selector > .token.pseudo-element { 159 | color: #ffc66d; 160 | } -------------------------------------------------------------------------------- /asset/prism-python.min.js: -------------------------------------------------------------------------------- 1 | Prism.languages.python={comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},"string-interpolation":{pattern:/(?:f|rf|fr)(?:("""|''')[\s\S]*?\1|("|')(?:\\.|(?!\2)[^\\\r\n])*\2)/i,greedy:!0,inside:{interpolation:{pattern:/((?:^|[^{])(?:{{)*){(?!{)(?:[^{}]|{(?!{)(?:[^{}]|{(?!{)(?:[^{}])+})+})+}/,lookbehind:!0,inside:{"format-spec":{pattern:/(:)[^:(){}]+(?=}$)/,lookbehind:!0},"conversion-option":{pattern:/![sra](?=[:}]$)/,alias:"punctuation"},rest:null}},string:/[\s\S]+/}},"triple-quoted-string":{pattern:/(?:[rub]|rb|br)?("""|''')[\s\S]*?\1/i,greedy:!0,alias:"string"},string:{pattern:/(?:[rub]|rb|br)?("|')(?:\\.|(?!\1)[^\\\r\n])*\1/i,greedy:!0},function:{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_]\w*(?=\s*\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)\w+/i,lookbehind:!0},decorator:{pattern:/(^\s*)@\w+(?:\.\w+)*/im,lookbehind:!0,alias:["annotation","punctuation"],inside:{punctuation:/\./}},keyword:/\b(?:and|as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|nonlocal|not|or|pass|print|raise|return|try|while|with|yield)\b/,builtin:/\b(?:__import__|abs|all|any|apply|ascii|basestring|bin|bool|buffer|bytearray|bytes|callable|chr|classmethod|cmp|coerce|compile|complex|delattr|dict|dir|divmod|enumerate|eval|execfile|file|filter|float|format|frozenset|getattr|globals|hasattr|hash|help|hex|id|input|int|intern|isinstance|issubclass|iter|len|list|locals|long|map|max|memoryview|min|next|object|oct|open|ord|pow|property|range|raw_input|reduce|reload|repr|reversed|round|set|setattr|slice|sorted|staticmethod|str|sum|super|tuple|type|unichr|unicode|vars|xrange|zip)\b/,boolean:/\b(?:True|False|None)\b/,number:/(?:\b(?=\d)|\B(?=\.))(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]/,punctuation:/[{}[\];(),.:]/},Prism.languages.python["string-interpolation"].inside.interpolation.inside.rest=Prism.languages.python,Prism.languages.py=Prism.languages.python; -------------------------------------------------------------------------------- /asset/search.min.js: -------------------------------------------------------------------------------- 1 | !function(){"use strict";function e(e){var n={"&":"&","<":"<",">":">",'"':""","'":"'","/":"/"};return String(e).replace(/[&<>"'\/]/g,function(e){return n[e]})}function n(e){var n=[];return h.dom.findAll("a:not([data-nosearch])").map(function(t){var o=t.href,i=t.getAttribute("href"),r=e.parse(o).path;r&&-1===n.indexOf(r)&&!Docsify.util.isAbsolutePath(i)&&n.push(r)}),n}function t(e){localStorage.setItem("docsify.search.expires",Date.now()+e),localStorage.setItem("docsify.search.index",JSON.stringify(g))}function o(e,n,t,o){void 0===n&&(n="");var i,r=window.marked.lexer(n),a=window.Docsify.slugify,s={};return r.forEach(function(n){if("heading"===n.type&&n.depth<=o)i=t.toURL(e,{id:a(n.text)}),s[i]={slug:i,title:n.text,body:""};else{if(!i)return;s[i]?s[i].body?s[i].body+="\n"+(n.text||""):s[i].body=n.text:s[i]={slug:i,title:"",body:""}}}),a.clear(),s}function i(n){var t=[],o=[];Object.keys(g).forEach(function(e){o=o.concat(Object.keys(g[e]).map(function(n){return g[e][n]}))}),n=n.trim();var i=n.split(/[\s\-\,\\\/]+/);1!==i.length&&(i=[].concat(n,i));for(var r=0;rl.length&&(d=l.length);var p="..."+e(l).substring(f,d).replace(o,''+n+"")+"...";s+=p}}),a)){var d={title:e(c),content:s,url:f};t.push(d)}}(r);return t}function r(e,i){h=Docsify;var r="auto"===e.paths,a=localStorage.getItem("docsify.search.expires")
',o=Docsify.dom.create("div",t),i=Docsify.dom.find("aside");Docsify.dom.toggleClass(o,"search"),Docsify.dom.before(i,o)}function c(e){var n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,".results-panel");if(!e)return t.classList.remove("show"),void(t.innerHTML="");var o=i(e),r="";o.forEach(function(e){r+='
\n \n

'+e.title+"

\n

"+e.content+"

\n
\n
"}),t.classList.add("show"),t.innerHTML=r||'

'+y+"

"}function l(){var e,n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,"input");Docsify.dom.on(n,"click",function(e){return"A"!==e.target.tagName&&e.stopPropagation()}),Docsify.dom.on(t,"input",function(n){clearTimeout(e),e=setTimeout(function(e){return c(n.target.value.trim())},100)})}function f(e,n){var t=Docsify.dom.getNode('.search input[type="search"]');if(t)if("string"==typeof e)t.placeholder=e;else{var o=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];t.placeholder=e[o]}}function d(e,n){if("string"==typeof e)y=e;else{var t=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];y=e[t]}}function p(e,n){var t=n.router.parse().query.s;a(),s(e,t),l(),t&&setTimeout(function(e){return c(t)},500)}function u(e,n){f(e.placeholder,n.route.path),d(e.noData,n.route.path)}var h,g={},y="",m={placeholder:"Type to search",noData:"No Results!",paths:"auto",depth:2,maxAge:864e5},v=function(e,n){var t=Docsify.util,o=n.config.search||m;Array.isArray(o)?m.paths=o:"object"==typeof o&&(m.paths=Array.isArray(o.paths)?o.paths:"auto",m.maxAge=t.isPrimitive(o.maxAge)?o.maxAge:m.maxAge,m.placeholder=o.placeholder||m.placeholder,m.noData=o.noData||m.noData,m.depth=o.depth||m.depth);var i="auto"===m.paths;e.mounted(function(e){p(m,n),!i&&r(m,n)}),e.doneEach(function(e){u(m,n),i&&r(m,n)})};$docsify.plugins=[].concat(v,$docsify.plugins)}(); 2 | -------------------------------------------------------------------------------- /asset/style.css: -------------------------------------------------------------------------------- 1 | /*隐藏头部的目录*/ 2 | #main>ul:nth-child(1) { 3 | display: none; 4 | } 5 | 6 | #main>ul:nth-child(2) { 7 | display: none; 8 | } 9 | 10 | .markdown-section h1 { 11 | margin: 3rem 0 2rem 0; 12 | } 13 | 14 | .markdown-section h2 { 15 | margin: 2rem 0 1rem; 16 | } 17 | 18 | img, 19 | pre { 20 | border-radius: 8px; 21 | } 22 | 23 | .content, 24 | .sidebar, 25 | .markdown-section, 26 | body, 27 | .search input { 28 | background-color: rgba(243, 242, 238, 1) !important; 29 | } 30 | 31 | @media (min-width:600px) { 32 | .sidebar-toggle { 33 | background-color: #f3f2ee; 34 | } 35 | } 36 | 37 | .docsify-copy-code-button { 38 | background: #f8f8f8 !important; 39 | color: #7a7a7a !important; 40 | } 41 | 42 | body { 43 | /*font-family: Microsoft YaHei, Source Sans Pro, Helvetica Neue, Arial, sans-serif !important;*/ 44 | } 45 | 46 | .markdown-section>p { 47 | font-size: 16px !important; 48 | } 49 | 50 | .markdown-section pre>code { 51 | font-family: Consolas, Roboto Mono, Monaco, courier, monospace !important; 52 | font-size: .9rem !important; 53 | 54 | } 55 | 56 | /*.anchor span { 57 | color: rgb(66, 185, 131); 58 | }*/ 59 | 60 | section.cover h1 { 61 | margin: 0; 62 | } 63 | 64 | body>section>div.cover-main>ul>li>a { 65 | color: #42b983; 66 | } 67 | 68 | .markdown-section img { 69 | box-shadow: 7px 9px 10px #aaa !important; 70 | } 71 | 72 | 73 | pre { 74 | background-color: #f3f2ee !important; 75 | } 76 | 77 | @media (min-width:600px) { 78 | pre code { 79 | /*box-shadow: 2px 1px 20px 2px #aaa;*/ 80 | /*border-radius: 10px !important;*/ 81 | padding-left: 20px !important; 82 | } 83 | } 84 | 85 | @media (max-width:600px) { 86 | pre { 87 | padding-left: 0px !important; 88 | padding-right: 0px !important; 89 | } 90 | } 91 | 92 | .markdown-section pre { 93 | padding-left: 0 !important; 94 | padding-right: 0px !important; 95 | box-shadow: 2px 1px 20px 2px #aaa; 96 | } -------------------------------------------------------------------------------- /asset/vue.css: -------------------------------------------------------------------------------- 1 | @import url("https://fonts.googleapis.com/css?family=Roboto+Mono|Source+Sans+Pro:300,400,600"); 2 | * { 3 | -webkit-font-smoothing: antialiased; 4 | -webkit-overflow-scrolling: touch; 5 | -webkit-tap-highlight-color: rgba(0,0,0,0); 6 | -webkit-text-size-adjust: none; 7 | -webkit-touch-callout: none; 8 | box-sizing: border-box; 9 | } 10 | body:not(.ready) { 11 | overflow: hidden; 12 | } 13 | body:not(.ready) [data-cloak], 14 | body:not(.ready) .app-nav, 15 | body:not(.ready) > nav { 16 | display: none; 17 | } 18 | div#app { 19 | font-size: 30px; 20 | font-weight: lighter; 21 | margin: 40vh auto; 22 | text-align: center; 23 | } 24 | div#app:empty::before { 25 | content: 'Loading...'; 26 | } 27 | .emoji { 28 | height: 1.2rem; 29 | vertical-align: middle; 30 | } 31 | .progress { 32 | background-color: var(--theme-color, #42b983); 33 | height: 2px; 34 | left: 0px; 35 | position: fixed; 36 | right: 0px; 37 | top: 0px; 38 | transition: width 0.2s, opacity 0.4s; 39 | width: 0%; 40 | z-index: 999999; 41 | } 42 | .search a:hover { 43 | color: var(--theme-color, #42b983); 44 | } 45 | .search .search-keyword { 46 | color: var(--theme-color, #42b983); 47 | font-style: normal; 48 | font-weight: bold; 49 | } 50 | html, 51 | body { 52 | height: 100%; 53 | } 54 | body { 55 | -moz-osx-font-smoothing: grayscale; 56 | -webkit-font-smoothing: antialiased; 57 | color: #34495e; 58 | font-family: 'Source Sans Pro', 'Helvetica Neue', Arial, sans-serif; 59 | font-size: 15px; 60 | letter-spacing: 0; 61 | margin: 0; 62 | overflow-x: hidden; 63 | } 64 | img { 65 | max-width: 100%; 66 | } 67 | a[disabled] { 68 | cursor: not-allowed; 69 | opacity: 0.6; 70 | } 71 | kbd { 72 | border: solid 1px #ccc; 73 | border-radius: 3px; 74 | display: inline-block; 75 | font-size: 12px !important; 76 | line-height: 12px; 77 | margin-bottom: 3px; 78 | padding: 3px 5px; 79 | vertical-align: middle; 80 | } 81 | li input[type='checkbox'] { 82 | margin: 0 0.2em 0.25em 0; 83 | vertical-align: middle; 84 | } 85 | .app-nav { 86 | margin: 25px 60px 0 0; 87 | position: absolute; 88 | right: 0; 89 | text-align: right; 90 | z-index: 10; 91 | /* navbar dropdown */ 92 | } 93 | .app-nav.no-badge { 94 | margin-right: 25px; 95 | } 96 | .app-nav p { 97 | margin: 0; 98 | } 99 | .app-nav > a { 100 | margin: 0 1rem; 101 | padding: 5px 0; 102 | } 103 | .app-nav ul, 104 | .app-nav li { 105 | display: inline-block; 106 | list-style: none; 107 | margin: 0; 108 | } 109 | .app-nav a { 110 | color: inherit; 111 | font-size: 16px; 112 | text-decoration: none; 113 | transition: color 0.3s; 114 | } 115 | .app-nav a:hover { 116 | color: var(--theme-color, #42b983); 117 | } 118 | .app-nav a.active { 119 | border-bottom: 2px solid var(--theme-color, #42b983); 120 | color: var(--theme-color, #42b983); 121 | } 122 | .app-nav li { 123 | display: inline-block; 124 | margin: 0 1rem; 125 | padding: 5px 0; 126 | position: relative; 127 | cursor: pointer; 128 | } 129 | .app-nav li ul { 130 | background-color: #fff; 131 | border: 1px solid #ddd; 132 | border-bottom-color: #ccc; 133 | border-radius: 4px; 134 | box-sizing: border-box; 135 | display: none; 136 | max-height: calc(100vh - 61px); 137 | overflow-y: auto; 138 | padding: 10px 0; 139 | position: absolute; 140 | right: -15px; 141 | text-align: left; 142 | top: 100%; 143 | white-space: nowrap; 144 | } 145 | .app-nav li ul li { 146 | display: block; 147 | font-size: 14px; 148 | line-height: 1rem; 149 | margin: 0; 150 | margin: 8px 14px; 151 | white-space: nowrap; 152 | } 153 | .app-nav li ul a { 154 | display: block; 155 | font-size: inherit; 156 | margin: 0; 157 | padding: 0; 158 | } 159 | .app-nav li ul a.active { 160 | border-bottom: 0; 161 | } 162 | .app-nav li:hover ul { 163 | display: block; 164 | } 165 | .github-corner { 166 | border-bottom: 0; 167 | position: fixed; 168 | right: 0; 169 | text-decoration: none; 170 | top: 0; 171 | z-index: 1; 172 | } 173 | .github-corner:hover .octo-arm { 174 | -webkit-animation: octocat-wave 560ms ease-in-out; 175 | animation: octocat-wave 560ms ease-in-out; 176 | } 177 | .github-corner svg { 178 | color: #fff; 179 | fill: var(--theme-color, #42b983); 180 | height: 80px; 181 | width: 80px; 182 | } 183 | main { 184 | display: block; 185 | position: relative; 186 | width: 100vw; 187 | height: 100%; 188 | z-index: 0; 189 | } 190 | main.hidden { 191 | display: none; 192 | } 193 | .anchor { 194 | display: inline-block; 195 | text-decoration: none; 196 | transition: all 0.3s; 197 | } 198 | .anchor span { 199 | color: #34495e; 200 | } 201 | .anchor:hover { 202 | text-decoration: underline; 203 | } 204 | .sidebar { 205 | border-right: 1px solid rgba(0,0,0,0.07); 206 | overflow-y: auto; 207 | padding: 40px 0 0; 208 | position: absolute; 209 | top: 0; 210 | bottom: 0; 211 | left: 0; 212 | transition: transform 250ms ease-out; 213 | width: 300px; 214 | z-index: 20; 215 | } 216 | .sidebar > h1 { 217 | margin: 0 auto 1rem; 218 | font-size: 1.5rem; 219 | font-weight: 300; 220 | text-align: center; 221 | } 222 | .sidebar > h1 a { 223 | color: inherit; 224 | text-decoration: none; 225 | } 226 | .sidebar > h1 .app-nav { 227 | display: block; 228 | position: static; 229 | } 230 | .sidebar .sidebar-nav { 231 | line-height: 2em; 232 | padding-bottom: 40px; 233 | } 234 | .sidebar li.collapse .app-sub-sidebar { 235 | display: none; 236 | } 237 | .sidebar ul { 238 | margin: 0 0 0 15px; 239 | padding: 0; 240 | } 241 | .sidebar li > p { 242 | font-weight: 700; 243 | margin: 0; 244 | } 245 | .sidebar ul, 246 | .sidebar ul li { 247 | list-style: none; 248 | } 249 | .sidebar ul li a { 250 | border-bottom: none; 251 | display: block; 252 | } 253 | .sidebar ul li ul { 254 | padding-left: 20px; 255 | } 256 | .sidebar::-webkit-scrollbar { 257 | width: 4px; 258 | } 259 | .sidebar::-webkit-scrollbar-thumb { 260 | background: transparent; 261 | border-radius: 4px; 262 | } 263 | .sidebar:hover::-webkit-scrollbar-thumb { 264 | background: rgba(136,136,136,0.4); 265 | } 266 | .sidebar:hover::-webkit-scrollbar-track { 267 | background: rgba(136,136,136,0.1); 268 | } 269 | .sidebar-toggle { 270 | background-color: transparent; 271 | background-color: rgba(255,255,255,0.8); 272 | border: 0; 273 | outline: none; 274 | padding: 10px; 275 | position: absolute; 276 | bottom: 0; 277 | left: 0; 278 | text-align: center; 279 | transition: opacity 0.3s; 280 | width: 284px; 281 | z-index: 30; 282 | cursor: pointer; 283 | } 284 | .sidebar-toggle:hover .sidebar-toggle-button { 285 | opacity: 0.4; 286 | } 287 | .sidebar-toggle span { 288 | background-color: var(--theme-color, #42b983); 289 | display: block; 290 | margin-bottom: 4px; 291 | width: 16px; 292 | height: 2px; 293 | } 294 | body.sticky .sidebar, 295 | body.sticky .sidebar-toggle { 296 | position: fixed; 297 | } 298 | .content { 299 | padding-top: 60px; 300 | position: absolute; 301 | top: 0; 302 | right: 0; 303 | bottom: 0; 304 | left: 300px; 305 | transition: left 250ms ease; 306 | } 307 | .markdown-section { 308 | margin: 0 auto; 309 | max-width: 80%; 310 | padding: 30px 15px 40px 15px; 311 | position: relative; 312 | } 313 | .markdown-section > * { 314 | box-sizing: border-box; 315 | font-size: inherit; 316 | } 317 | .markdown-section > :first-child { 318 | margin-top: 0 !important; 319 | } 320 | .markdown-section hr { 321 | border: none; 322 | border-bottom: 1px solid #eee; 323 | margin: 2em 0; 324 | } 325 | .markdown-section iframe { 326 | border: 1px solid #eee; 327 | /* fix horizontal overflow on iOS Safari */ 328 | width: 1px; 329 | min-width: 100%; 330 | } 331 | .markdown-section table { 332 | border-collapse: collapse; 333 | border-spacing: 0; 334 | display: block; 335 | margin-bottom: 1rem; 336 | overflow: auto; 337 | width: 100%; 338 | } 339 | .markdown-section th { 340 | border: 1px solid #ddd; 341 | font-weight: bold; 342 | padding: 6px 13px; 343 | } 344 | .markdown-section td { 345 | border: 1px solid #ddd; 346 | padding: 6px 13px; 347 | } 348 | .markdown-section tr { 349 | border-top: 1px solid #ccc; 350 | } 351 | .markdown-section tr:nth-child(2n) { 352 | background-color: #f8f8f8; 353 | } 354 | .markdown-section p.tip { 355 | background-color: #f8f8f8; 356 | border-bottom-right-radius: 2px; 357 | border-left: 4px solid #f66; 358 | border-top-right-radius: 2px; 359 | margin: 2em 0; 360 | padding: 12px 24px 12px 30px; 361 | position: relative; 362 | } 363 | .markdown-section p.tip:before { 364 | background-color: #f66; 365 | border-radius: 100%; 366 | color: #fff; 367 | content: '!'; 368 | font-family: 'Dosis', 'Source Sans Pro', 'Helvetica Neue', Arial, sans-serif; 369 | font-size: 14px; 370 | font-weight: bold; 371 | left: -12px; 372 | line-height: 20px; 373 | position: absolute; 374 | height: 20px; 375 | width: 20px; 376 | text-align: center; 377 | top: 14px; 378 | } 379 | .markdown-section p.tip code { 380 | background-color: #efefef; 381 | } 382 | .markdown-section p.tip em { 383 | color: #34495e; 384 | } 385 | .markdown-section p.warn { 386 | background: rgba(66,185,131,0.1); 387 | border-radius: 2px; 388 | padding: 1rem; 389 | } 390 | .markdown-section ul.task-list > li { 391 | list-style-type: none; 392 | } 393 | body.close .sidebar { 394 | transform: translateX(-300px); 395 | } 396 | body.close .sidebar-toggle { 397 | width: auto; 398 | } 399 | body.close .content { 400 | left: 0; 401 | } 402 | @media print { 403 | .github-corner, 404 | .sidebar-toggle, 405 | .sidebar, 406 | .app-nav { 407 | display: none; 408 | } 409 | } 410 | @media screen and (max-width: 768px) { 411 | .github-corner, 412 | .sidebar-toggle, 413 | .sidebar { 414 | position: fixed; 415 | } 416 | .app-nav { 417 | margin-top: 16px; 418 | } 419 | .app-nav li ul { 420 | top: 30px; 421 | } 422 | main { 423 | height: auto; 424 | overflow-x: hidden; 425 | } 426 | .sidebar { 427 | left: -300px; 428 | transition: transform 250ms ease-out; 429 | } 430 | .content { 431 | left: 0; 432 | max-width: 100vw; 433 | position: static; 434 | padding-top: 20px; 435 | transition: transform 250ms ease; 436 | } 437 | .app-nav, 438 | .github-corner { 439 | transition: transform 250ms ease-out; 440 | } 441 | .sidebar-toggle { 442 | background-color: transparent; 443 | width: auto; 444 | padding: 30px 30px 10px 10px; 445 | } 446 | body.close .sidebar { 447 | transform: translateX(300px); 448 | } 449 | body.close .sidebar-toggle { 450 | background-color: rgba(255,255,255,0.8); 451 | transition: 1s background-color; 452 | width: 284px; 453 | padding: 10px; 454 | } 455 | body.close .content { 456 | transform: translateX(300px); 457 | } 458 | body.close .app-nav, 459 | body.close .github-corner { 460 | display: none; 461 | } 462 | .github-corner:hover .octo-arm { 463 | -webkit-animation: none; 464 | animation: none; 465 | } 466 | .github-corner .octo-arm { 467 | -webkit-animation: octocat-wave 560ms ease-in-out; 468 | animation: octocat-wave 560ms ease-in-out; 469 | } 470 | } 471 | @-webkit-keyframes octocat-wave { 472 | 0%, 100% { 473 | transform: rotate(0); 474 | } 475 | 20%, 60% { 476 | transform: rotate(-25deg); 477 | } 478 | 40%, 80% { 479 | transform: rotate(10deg); 480 | } 481 | } 482 | @keyframes octocat-wave { 483 | 0%, 100% { 484 | transform: rotate(0); 485 | } 486 | 20%, 60% { 487 | transform: rotate(-25deg); 488 | } 489 | 40%, 80% { 490 | transform: rotate(10deg); 491 | } 492 | } 493 | section.cover { 494 | align-items: center; 495 | background-position: center center; 496 | background-repeat: no-repeat; 497 | background-size: cover; 498 | height: 100vh; 499 | width: 100vw; 500 | display: none; 501 | } 502 | section.cover.show { 503 | display: flex; 504 | } 505 | section.cover.has-mask .mask { 506 | background-color: #fff; 507 | opacity: 0.8; 508 | position: absolute; 509 | top: 0; 510 | height: 100%; 511 | width: 100%; 512 | } 513 | section.cover .cover-main { 514 | flex: 1; 515 | margin: -20px 16px 0; 516 | text-align: center; 517 | position: relative; 518 | } 519 | section.cover a { 520 | color: inherit; 521 | text-decoration: none; 522 | } 523 | section.cover a:hover { 524 | text-decoration: none; 525 | } 526 | section.cover p { 527 | line-height: 1.5rem; 528 | margin: 1em 0; 529 | } 530 | section.cover h1 { 531 | color: inherit; 532 | font-size: 2.5rem; 533 | font-weight: 300; 534 | margin: 0.625rem 0 2.5rem; 535 | position: relative; 536 | text-align: center; 537 | } 538 | section.cover h1 a { 539 | display: block; 540 | } 541 | section.cover h1 small { 542 | bottom: -0.4375rem; 543 | font-size: 1rem; 544 | position: absolute; 545 | } 546 | section.cover blockquote { 547 | font-size: 1.5rem; 548 | text-align: center; 549 | } 550 | section.cover ul { 551 | line-height: 1.8; 552 | list-style-type: none; 553 | margin: 1em auto; 554 | max-width: 500px; 555 | padding: 0; 556 | } 557 | section.cover .cover-main > p:last-child a { 558 | border-color: var(--theme-color, #42b983); 559 | border-radius: 2rem; 560 | border-style: solid; 561 | border-width: 1px; 562 | box-sizing: border-box; 563 | color: var(--theme-color, #42b983); 564 | display: inline-block; 565 | font-size: 1.05rem; 566 | letter-spacing: 0.1rem; 567 | margin: 0.5rem 1rem; 568 | padding: 0.75em 2rem; 569 | text-decoration: none; 570 | transition: all 0.15s ease; 571 | } 572 | section.cover .cover-main > p:last-child a:last-child { 573 | background-color: var(--theme-color, #42b983); 574 | color: #fff; 575 | } 576 | section.cover .cover-main > p:last-child a:last-child:hover { 577 | color: inherit; 578 | opacity: 0.8; 579 | } 580 | section.cover .cover-main > p:last-child a:hover { 581 | color: inherit; 582 | } 583 | section.cover blockquote > p > a { 584 | border-bottom: 2px solid var(--theme-color, #42b983); 585 | transition: color 0.3s; 586 | } 587 | section.cover blockquote > p > a:hover { 588 | color: var(--theme-color, #42b983); 589 | } 590 | body { 591 | background-color: #fff; 592 | } 593 | /* sidebar */ 594 | .sidebar { 595 | background-color: #fff; 596 | color: #364149; 597 | } 598 | .sidebar li { 599 | margin: 6px 0 6px 0; 600 | } 601 | .sidebar ul li a { 602 | color: #505d6b; 603 | font-size: 14px; 604 | font-weight: normal; 605 | overflow: hidden; 606 | text-decoration: none; 607 | text-overflow: ellipsis; 608 | white-space: nowrap; 609 | } 610 | .sidebar ul li a:hover { 611 | text-decoration: underline; 612 | } 613 | .sidebar ul li ul { 614 | padding: 0; 615 | } 616 | .sidebar ul li.active > a { 617 | border-right: 2px solid; 618 | color: var(--theme-color, #42b983); 619 | font-weight: 600; 620 | } 621 | .app-sub-sidebar li::before { 622 | content: '-'; 623 | padding-right: 4px; 624 | float: left; 625 | } 626 | /* markdown content found on pages */ 627 | .markdown-section h1, 628 | .markdown-section h2, 629 | .markdown-section h3, 630 | .markdown-section h4, 631 | .markdown-section strong { 632 | color: #2c3e50; 633 | font-weight: 600; 634 | } 635 | .markdown-section a { 636 | color: var(--theme-color, #42b983); 637 | font-weight: 600; 638 | } 639 | .markdown-section h1 { 640 | font-size: 2rem; 641 | margin: 0 0 1rem; 642 | } 643 | .markdown-section h2 { 644 | font-size: 1.75rem; 645 | margin: 45px 0 0.8rem; 646 | } 647 | .markdown-section h3 { 648 | font-size: 1.5rem; 649 | margin: 40px 0 0.6rem; 650 | } 651 | .markdown-section h4 { 652 | font-size: 1.25rem; 653 | } 654 | .markdown-section h5 { 655 | font-size: 1rem; 656 | } 657 | .markdown-section h6 { 658 | color: #777; 659 | font-size: 1rem; 660 | } 661 | .markdown-section figure, 662 | .markdown-section p { 663 | margin: 1.2em 0; 664 | } 665 | .markdown-section p, 666 | .markdown-section ul, 667 | .markdown-section ol { 668 | line-height: 1.6rem; 669 | word-spacing: 0.05rem; 670 | } 671 | .markdown-section ul, 672 | .markdown-section ol { 673 | padding-left: 1.5rem; 674 | } 675 | .markdown-section blockquote { 676 | border-left: 4px solid var(--theme-color, #42b983); 677 | color: #858585; 678 | margin: 2em 0; 679 | padding-left: 20px; 680 | } 681 | .markdown-section blockquote p { 682 | font-weight: 600; 683 | margin-left: 0; 684 | } 685 | .markdown-section iframe { 686 | margin: 1em 0; 687 | } 688 | .markdown-section em { 689 | color: #7f8c8d; 690 | } 691 | .markdown-section code { 692 | background-color: #f8f8f8; 693 | border-radius: 2px; 694 | color: #e96900; 695 | font-family: 'Roboto Mono', Monaco, courier, monospace; 696 | font-size: 0.8rem; 697 | margin: 0 2px; 698 | padding: 3px 5px; 699 | white-space: pre-wrap; 700 | } 701 | .markdown-section pre { 702 | -moz-osx-font-smoothing: initial; 703 | -webkit-font-smoothing: initial; 704 | background-color: #f8f8f8; 705 | font-family: 'Roboto Mono', Monaco, courier, monospace; 706 | line-height: 1.5rem; 707 | margin: 1.2em 0; 708 | overflow: auto; 709 | padding: 0 1.4rem; 710 | position: relative; 711 | word-wrap: normal; 712 | } 713 | /* code highlight */ 714 | .token.comment, 715 | .token.prolog, 716 | .token.doctype, 717 | .token.cdata { 718 | color: #8e908c; 719 | } 720 | .token.namespace { 721 | opacity: 0.7; 722 | } 723 | .token.boolean, 724 | .token.number { 725 | color: #c76b29; 726 | } 727 | .token.punctuation { 728 | color: #525252; 729 | } 730 | .token.property { 731 | color: #c08b30; 732 | } 733 | .token.tag { 734 | color: #2973b7; 735 | } 736 | .token.string { 737 | color: var(--theme-color, #42b983); 738 | } 739 | .token.selector { 740 | color: #6679cc; 741 | } 742 | .token.attr-name { 743 | color: #2973b7; 744 | } 745 | .token.entity, 746 | .token.url, 747 | .language-css .token.string, 748 | .style .token.string { 749 | color: #22a2c9; 750 | } 751 | .token.attr-value, 752 | .token.control, 753 | .token.directive, 754 | .token.unit { 755 | color: var(--theme-color, #42b983); 756 | } 757 | .token.keyword, 758 | .token.function { 759 | color: #e96900; 760 | } 761 | .token.statement, 762 | .token.regex, 763 | .token.atrule { 764 | color: #22a2c9; 765 | } 766 | .token.placeholder, 767 | .token.variable { 768 | color: #3d8fd1; 769 | } 770 | .token.deleted { 771 | text-decoration: line-through; 772 | } 773 | .token.inserted { 774 | border-bottom: 1px dotted #202746; 775 | text-decoration: none; 776 | } 777 | .token.italic { 778 | font-style: italic; 779 | } 780 | .token.important, 781 | .token.bold { 782 | font-weight: bold; 783 | } 784 | .token.important { 785 | color: #c94922; 786 | } 787 | .token.entity { 788 | cursor: help; 789 | } 790 | .markdown-section pre > code { 791 | -moz-osx-font-smoothing: initial; 792 | -webkit-font-smoothing: initial; 793 | background-color: #f8f8f8; 794 | border-radius: 2px; 795 | color: #525252; 796 | display: block; 797 | font-family: 'Roboto Mono', Monaco, courier, monospace; 798 | font-size: 0.8rem; 799 | line-height: inherit; 800 | margin: 0 2px; 801 | max-width: inherit; 802 | overflow: inherit; 803 | padding: 2.2em 5px; 804 | white-space: inherit; 805 | } 806 | .markdown-section code::after, 807 | .markdown-section code::before { 808 | letter-spacing: 0.05rem; 809 | } 810 | code .token { 811 | -moz-osx-font-smoothing: initial; 812 | -webkit-font-smoothing: initial; 813 | min-height: 1.5rem; 814 | position: relative; 815 | left: auto; 816 | } 817 | pre::after { 818 | color: #ccc; 819 | content: attr(data-lang); 820 | font-size: 0.6rem; 821 | font-weight: 600; 822 | height: 15px; 823 | line-height: 15px; 824 | padding: 5px 10px 0; 825 | position: absolute; 826 | right: 0; 827 | text-align: right; 828 | top: 0; 829 | } 830 | -------------------------------------------------------------------------------- /blog/Install/README.md: -------------------------------------------------------------------------------- 1 | # 安装 2 | 3 | ## 快速安装 4 | 5 | 在您的终端中运行(推荐): 6 | 7 | `pip install --upgrade gensim` 8 | 9 | 或者,对于conda环境: 10 | 11 | `conda install -c conda-forge gensim` 12 | 13 | 而已!恭喜,您可以继续学习本[教程](https://radimrehurek.com/gensim/tutorial.html)。 14 | 15 | 如果失败,请确保您正在安装到可写位置(或使用sudo)。 16 | 17 | --- 18 | 19 | ## 代码依赖 20 | 21 | Gensim在Linux,Windows和Mac OS X上运行,并且应该在支持Python 2.7+和NumPy的任何其他平台上运行。Gensim取决于以下软件: 22 | 23 | * [Python](https://www.python.org/) >= 2.7 (tested with versions 2.7, 3.5 and 3.6) 24 | * [NumPy](http://www.numpy.org/) >= 1.11.3 25 | * [SciPy](https://www.scipy.org/) >= 0.18.1 26 | * [Six](https://pypi.org/project/six/) >= 1.5.0 27 | * [smart_open](https://pypi.org/project/smart_open/) >= 1.2.1 28 | 29 | ## 测试Gensim 30 | 31 | Gensim使用持续集成,在每个pull请求上自动运行完整的测试套件 32 | 33 | | CI service | Task | Build badge | 34 | | :-- | :-- | :-- | 35 | | Travis | 在Linux上运行测试并检查[代码样式](https://www.python.org/dev/peps/pep-0008/?) | ![Travis](/imgs/Introduction/gensim.svg) | 36 | | AppVeyor | 在Windows上运行测试 | ![AppVeyor](/imgs/Introduction/develop_1.svg) | 37 | | CircleCI | 构建文档 | ![CircleCI](/imgs/Introduction/develop.svg) | 38 | 39 | ## 问题 40 | 41 | 使用[Gensim讨论组](https://groups.google.com/group/gensim/)进行问题和故障排除。有关商业支持,请参阅[支持页面](https://radimrehurek.com/gensim/support.html)。 42 | -------------------------------------------------------------------------------- /blog/Introduction/README.md: -------------------------------------------------------------------------------- 1 | # 介绍 2 | 3 | Gensim是一个[免费的](https://radimrehurek.com/gensim/intro.html#availability) Python库,旨在从文档中自动提取语义主题,尽可能高效(计算机方面)和 painlessly(人性化)。 4 | 5 | Gensim旨在处理原始的非结构化数字文本(**`纯文本`**)。 6 | 7 | 在Gensim的算法,比如[`Word2Vec`](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec "gensim.models.word2vec.Word2Vec"),[`FastText`](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText "gensim.models.fasttext.FastText"),潜在语义分析(LSI,LSA,see [`LsiModel`](https://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel "gensim.models.lsimodel.LsiModel")),隐含狄利克雷分布(LDA,见[`LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel "gensim.models.ldamodel.LdaModel"))等,自动训练文档的躯体内检查统计共生模式发现的文件的语义结构。这些算法是**无监督的**,这意味着不需要人工输入 - 您只需要一个纯文本文档。 8 | 9 | 一旦找到这些统计模式,任何纯文本文档(句子,短语,单词......)都可以在新的语义表示中简洁地表达,并查询与其他文档(单词,短语......)的主题相似性。 10 | 11 | > 注意 12 | 如果前面的段落让您感到困惑,您可以在Wikipedia上阅读有关[向量空间模型](https://en.wikipedia.org/wiki/Vector_space_model)和[无监督文档分析的](https://en.wikipedia.org/wiki/Latent_semantic_indexing)更多信息。 13 | 14 | ## 功能 15 | 16 | * **内存独立性** - 任何时候都不需要整个训练语料库完全驻留在RAM中(可以处理大型的Web级语料库)。 17 | * **内存共享** - 经过训练的模型可以持久保存到磁盘并通过[mmap](https://en.wikipedia.org/wiki/Mmap)加载回来。多个进程可以共享相同的数据,从而减少RAM占用空间。 18 | * 一些流行的向量空间算法的高效实现,包括[`Word2Vec`](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec "gensim.models.word2vec.Word2Vec"),[`Doc2Vec`](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec "gensim.models.doc2vec.Doc2Vec"),[`FastText`](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText "gensim.models.fasttext.FastText"),TF-IDF,潜在语义分析(LSI,LSA,见[`LsiModel`](https://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel "gensim.models.lsimodel.LsiModel")),隐含狄利克雷分布(LDA,见[`LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel "gensim.models.ldamodel.LdaModel"))或随机投影(见[`RpModel`](https://radimrehurek.com/gensim/models/rpmodel.html#gensim.models.rpmodel.RpModel "gensim.models.rpmodel.RpModel"))。 19 | * 来自几种流行数据格式的I / O包装器和读卡器。 20 | * 在语义表示中对文档进行快速相似性查询。 21 | 22 | Gensim背后的**主要设计目标**是: 23 | 24 | 1. 为开发人员提供简单的接口和低API学习曲线。适合原型设计。 25 | 2. 记忆独立性与输入语料库的大小有关; 所有中间步骤和算法都以流式方式运行,一次访问一个文档。 26 | 27 | 也可以看看 28 | 29 | 我们还为NLP,文档分析,索引,搜索和集群构建了一个高性能的商业服务器:[https](https://scaletext.ai/):[//scaletext.ai](https://scaletext.ai/)。ScaleText既可以在本地使用,也可以作为SaaS使用。 30 | 31 | 到达 info@scaletext.com 如果你需要专业支持的工业级NLP工具。 32 | 33 | ## 可用性 34 | 35 | Gensim根据OSI批准的[GNU LGPLv2.1许可证授权](https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html),可以从其[Github存储库](https://github.com/piskvorky/gensim/) 或[Python Package Index下载](https://pypi.python.org/pypi/gensim)。 36 | 37 | 也可以看看 38 | 39 | 有关Gensim部署的更多信息,请参阅[安装](https://radimrehurek.com/gensim/install.html)页面。 40 | 41 | ## 核心概念 42 | 43 | > 文集 44 | 45 | 数字文档的集合。Corpora在Gensim担任两个角色: 46 | 47 | 1. 模型训练的输入语料库用于自动训练机器学习模型,例如 [`LsiModel`](https://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel "gensim.models.lsimodel.LsiModel")或[`LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel "gensim.models.ldamodel.LdaModel")。 48 | 49 | 模型使用此*培训语料库*来查找共同的主题和主题,初始化其内部模型参数。 50 | 51 | Gensim专注于*无监督*模型,因此无需人工干预,例如昂贵的注释或手工标记文件。 52 | 53 | 2. 要组织的文件。训练之后,可以使用主题模型从新文档中提取主题(培训语料库中未见的文档)。 54 | 55 | 这样的语料库可以通过语义相似性,聚类等进行[索引](https://radimrehurek.com/gensim/tut3.html),查询。 56 | 57 | > 向量空间模型 58 | 59 | 在向量空间模型(VSM)中,每个文档由一系列要素表示。例如,单个功能可以被视为问答配对: 60 | 61 | 1. 单词*splonge*在文档中出现了多少次?零。 62 | 2. 该文件包含多少段?二。 63 | 3. 该文档使用了多少种字体?五。 64 | 65 | 问题通常只能由它的一个整数标识符(如所表示1,2和3在此),因此,该文件的表示变得一系列像`(1, 0.0), (2, 2.0), (3, 5.0)` 对。 66 | 67 | 如果我们提前知道所有问题,我们可能会隐瞒并简单地写成 `(0.0, 2.0, 5.0)`。 68 | 69 | 该答案序列可以被认为是 **向量(vector)**(在这种情况下是三维密集矢量)。出于实际目的,Gensim中只允许答案(或可以转换为)*单个浮点数的问题*。 70 | 71 | 每个文档的问题都是相同的,所以看两个向量(代表两个文档),我们希望能够得出结论,例如“这两个向量中的数字非常相似,因此原始文档必须类似“也是”。当然,这些结论是否与现实相符取决于我们选择问题的程度。 72 | 73 | > Gensim 稀疏向量,Bag-of-words 向量 74 | 75 | 为了节省空间,在Gensim中我们省略了值为0.0的所有向量元素。例如,我们只写(注意缺失的)而不是三维密集向量。每个向量元素是一对(2元组)。此稀疏表示中所有缺失特征的值可以明确地解析为零。`(0.0, 2.0, 5.0)``[(2, 2.0), (3, 5.0)]``(1, 0.0)``(feature_id,feature_value)``0.0` 76 | 77 | Gensim中的文档由稀疏向量(有时称为词袋向量)表示。 78 | 79 | > Gensim流式语料库 80 | 81 | Gensim没有规定任何特定的语料库格式。语料库只是一个稀疏向量序列(见上文)。 82 | 83 | 例如 `[ [(2, 2.0), (3, 5.0)], [(3, 1.0)] ]` 是两个文档的简单语料库=两个稀疏向量:第一个具有两个非零元素,第二个具有一个非零元素。这个特定的语料库表示为普通的Python 。 84 | 85 | 然而,Gensim的全部功能来自于语料库不必是a `list`,或`NumPy`数组,或`Pandas`数据帧等等。Gensim *接受任何对象,当迭代时,连续产生这些稀疏的袋子向量*。 86 | 87 | 这种灵活性允许您创建自己的语料库类,直接从磁盘,网络,数据库,数据帧...流式传输稀疏向量。实现Gensim中的模型,使得它们不需要所有向量一次驻留在RAM中。你甚至可以动态创建稀疏矢量! 88 | 89 | 请参阅我们的[Python流数据处理教程](https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)。 90 | 91 | 有关直接从磁盘流式传输的高效语料库格式的内置示例,请参阅中的Matrix Market格式[`mmcorpus`](https://radimrehurek.com/gensim/corpora/mmcorpus.html#module-gensim.corpora.mmcorpus "gensim.corpora.mmcorpus:矩阵市场格式的语料库")。有关如何创建自己的流式语料库的最小蓝图示例,请查看[CSV语料库](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/csvcorpus.py)的[源代码](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/csvcorpus.py)。 92 | 93 | > 模型,转型 94 | 95 | Gensim使用**模型**来引用将一个文档表示转换为另一个文档表示所需的代码和相关数据(模型参数)。 96 | 97 | 在Gensim中,文档被表示为向量(见上文),因此模型可以被认为是从一个向量空间到另一个向量空间的转换。从训练语料库中学习该变换的参数。 98 | 99 | 训练有素的模型(数据参数)可以持久保存到磁盘,然后加载回来,以继续培训新的培训文档或转换新文档。 100 | 101 | Gensim实现多种模式,如[`Word2Vec`](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec "gensim.models.word2vec.Word2Vec"), [`LsiModel`](https://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel "gensim.models.lsimodel.LsiModel"),[`LdaModel`](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel "gensim.models.ldamodel.LdaModel"), [`FastText`](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText "gensim.models.fasttext.FastText")等见[API参考](https://radimrehurek.com/gensim/apiref.html)的完整列表。 102 | 103 | 也可以看看 104 | 105 | 有关如何在代码中解决所有问题的一些示例,请转到[教程](https://radimrehurek.com/gensim/tutorial.html)。 106 | -------------------------------------------------------------------------------- /blog/tutorial/1.md: -------------------------------------------------------------------------------- 1 | # 语料库和向量空间 2 | 3 | 本教程[在此处](https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb)以Jupyter Notebook的形式提供。 4 | 5 | 别忘了设置 6 | 7 | ```py 8 | >>> import logging 9 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 10 | ``` 11 | 12 | 如果你想看到记录事件。 13 | 14 | ## [从字符串到向量](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors "永久链接到这个标题") 15 | 16 | 这一次,让我们从表示为字符串的文档开始: 17 | 18 | ```py 19 | >>> from gensim import corpora 20 | >>> 21 | >>> documents = ["Human machine interface for lab abc computer applications", 22 | >>> "A survey of user opinion of computer system response time", 23 | >>> "The EPS user interface management system", 24 | >>> "System and human system engineering testing of EPS", 25 | >>> "Relation of user perceived response time to error measurement", 26 | >>> "The generation of random binary unordered trees", 27 | >>> "The intersection graph of paths in trees", 28 | >>> "Graph minors IV Widths of trees and well quasi ordering", 29 | >>> "Graph minors A survey"] 30 | ``` 31 | 32 | 这是一个由九个文档组成的小型语料库,每个文档只包含一个句子。 33 | 34 | 首先,让我们对文档进行标记,删除常用单词(使用玩具停止列表)以及仅在语料库中出现一次的单词: 35 | 36 | ```py 37 | >>> # remove common words and tokenize 38 | >>> stoplist = set('for a of the and to in'.split()) 39 | >>> texts = [[word for word in document.lower().split() if word not in stoplist] 40 | >>> for document in documents] 41 | >>> 42 | >>> # remove words that appear only once 43 | >>> from collections import defaultdict 44 | >>> frequency = defaultdict(int) 45 | >>> for text in texts: 46 | >>> for token in text: 47 | >>> frequency[token] += 1 48 | >>> 49 | >>> texts = [[token for token in text if frequency[token] > 1] 50 | >>> for text in texts] 51 | >>> 52 | >>> from pprint import pprint # pretty-printer 53 | >>> pprint(texts) 54 | [['human', 'interface', 'computer'], 55 | ['survey', 'user', 'computer', 'system', 'response', 'time'], 56 | ['eps', 'user', 'interface', 'system'], 57 | ['system', 'human', 'system', 'eps'], 58 | ['user', 'response', 'time'], 59 | ['trees'], 60 | ['graph', 'trees'], 61 | ['graph', 'minors', 'trees'], 62 | ['graph', 'minors', 'survey']] 63 | ``` 64 | 65 | 您处理文件的方式可能会有所不同; 在这里,我只拆分空格来标记,然后小写每个单词。实际上,我使用这种特殊的(简单和低效)设置来模仿Deerwester等人的原始LSA文章[[1]中](https://radimrehurek.com/gensim/tut1.html#id3)所做的实验。 66 | 67 | 处理文档的方式是多种多样的,依赖于应用程序和语言,我决定*不*通过任何接口约束它们。相反,文档由从中提取的特征表示,而不是由其“表面”字符串形式表示:如何使用这些特征取决于您。下面我描述一种常见的通用方法(称为 *词袋*),但请记住,不同的应用程序域需要不同的功能,而且,一如既往,它是[垃圾,垃圾输出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out) ...... 68 | 69 | 要将文档转换为向量,我们将使用名为[bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words)的文档表示 。在此表示中,每个文档由一个向量表示,其中每个向量元素表示问题 - 答案对,格式为: 70 | 71 | > “单词系统出现在文档中的次数是多少?一旦。” 72 | 73 | 仅通过它们的(整数)id来表示问题是有利的。问题和ID之间的映射称为字典: 74 | 75 | ```py 76 | >>> dictionary = corpora.Dictionary(texts) 77 | >>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference 78 | >>> print(dictionary) 79 | Dictionary(12 unique tokens) 80 | ``` 81 | 82 | 在这里,我们为语料库中出现的所有单词分配了一个唯一的整数id [`gensim.corpora.dictionary.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary "gensim.corpora.dictionary.Dictionary")。这会扫描文本,收集字数和相关统计数据。最后,我们看到在处理过的语料库中有12个不同的单词,这意味着每个文档将由12个数字表示(即,通过12-D向量)。要查看单词及其ID之间的映射: 83 | 84 | ```py 85 | >>> print(dictionary.token2id) 86 | {'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0, 87 | 'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3} 88 | ``` 89 | 90 | 要将标记化文档实际转换为向量: 91 | 92 | ```py 93 | >>> new_doc = "Human computer interaction" 94 | >>> new_vec = dictionary.doc2bow(new_doc.lower().split()) 95 | >>> print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored 96 | [(0, 1), (1, 1)] 97 | ``` 98 | 99 | 该函数`doc2bow()`只计算每个不同单词的出现次数,将单词转换为整数单词id,并将结果作为稀疏向量返回。 因此,稀疏向量 `[(0, 1), (1, 1)]` 读取:在文档“人机交互”中,单词computer (id 0)和human(id 1)出现一次; 其他十个字典单词(隐含地)出现零次。 100 | 101 | ```py 102 | >>> corpus = [dictionary.doc2bow(text) for text in texts] 103 | >>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use 104 | >>> print(corpus) 105 | [(0, 1), (1, 1), (2, 1)] 106 | [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] 107 | [(2, 1), (5, 1), (7, 1), (8, 1)] 108 | [(1, 1), (5, 2), (8, 1)] 109 | [(3, 1), (6, 1), (7, 1)] 110 | [(9, 1)] 111 | [(9, 1), (10, 1)] 112 | [(9, 1), (10, 1), (11, 1)] 113 | [(4, 1), (10, 1), (11, 1)] 114 | ``` 115 | 116 | 到目前为止,应该清楚的是,矢量要素 `id=10` 代表问题“文字中出现多少次文字?”,前六个文件的答案为“零”,其余三个答案为“一” 。事实上,我们已经得到了与[快速示例](https://radimrehurek.com/gensim/tutorial.html#first-example)中完全相同的向量语料库。 117 | 118 | 119 | ## [语料库流 - 一次一个文档](https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time "永久链接到这个标题") 120 | 121 | 请注意,上面的语料库完全驻留在内存中,作为普通的Python列表。在这个简单的例子中,它并不重要,但为了使事情清楚,让我们假设语料库中有数百万个文档。将所有这些存储在RAM中是行不通的。相反,我们假设文档存储在磁盘上的文件中,每行一个文档。gensim只要求语料库必须能够一次返回一个文档向量: 122 | 123 | ```py 124 | >>> class MyCorpus(object): 125 | >>> def __iter__(self): 126 | >>> for line in open('mycorpus.txt'): 127 | >>> # assume there's one document per line, tokens separated by whitespace 128 | >>> yield dictionary.doc2bow(line.lower().split()) 129 | ``` 130 | 131 | 在[此处](https://radimrehurek.com/gensim/mycorpus.txt)下载示例[mycorpus.txt文件](https://radimrehurek.com/gensim/mycorpus.txt)。假设每个文档在单个文件中占据一行并不重要; 您可以模拟__iter__函数以适合您的输入格式,无论它是什么。行走目录,解析XML,访问网络......只需解析输入以在每个文档中检索一个干净的标记列表,然后通过字典将标记转换为它们的ID,并在__iter__中生成生成的稀疏向量。 132 | 133 | ```py 134 | >>> corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory! 135 | >>> print(corpus_memory_friendly) 136 | ``` 137 | 138 | 语料库现在是一个对象。我们没有定义任何打印方式,因此print只输出内存中对象的地址。不是很有用。要查看构成向量,让我们遍历语料库并打印每个文档向量(一次一个): 139 | 140 | ```py 141 | >>> for vector in corpus_memory_friendly: # load one vector into memory at a time 142 | ... print(vector) 143 | [(0, 1), (1, 1), (2, 1)] 144 | [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] 145 | [(2, 1), (5, 1), (7, 1), (8, 1)] 146 | [(1, 1), (5, 2), (8, 1)] 147 | [(3, 1), (6, 1), (7, 1)] 148 | [(9, 1)] 149 | [(9, 1), (10, 1)] 150 | [(9, 1), (10, 1), (11, 1)] 151 | [(4, 1), (10, 1), (11, 1)] 152 | ``` 153 | 154 | 尽管输出与普通Python列表的输出相同,但语料库现在更加内存友好,因为一次最多只有一个向量驻留在RAM中。您的语料库现在可以随意扩展。 155 | 156 | 类似地,构造字典而不将所有文本加载到内存中: 157 | 158 | ```py 159 | >>> from six import iteritems 160 | >>> # collect statistics about all tokens 161 | >>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) 162 | >>> # remove stop words and words that appear only once 163 | >>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist 164 | >>> if stopword in dictionary.token2id] 165 | >>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1] 166 | >>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once 167 | >>> dictionary.compactify() # remove gaps in id sequence after words that were removed 168 | >>> print(dictionary) 169 | Dictionary(12 unique tokens) 170 | ``` 171 | 172 | 这就是它的全部!至少就字袋表示而言。当然,我们用这种语料库做的是另一个问题; 如何计算不同单词的频率可能是有用的,这一点都不清楚。事实证明,它不是,我们需要首先对这个简单的表示应用转换,然后才能使用它来计算任何有意义的文档与文档的相似性。转换将[在下一个教程中介绍](https://radimrehurek.com/gensim/tut2.html),但在此之前,让我们简单地将注意力转向*语料库持久性*。 173 | 174 | ## [语料库格式](https://radimrehurek.com/gensim/tut1.html#corpus-formats "永久链接到这个标题") 175 | 176 | 存在几种用于将Vector Space语料库(〜矢量序列)序列化到磁盘的文件格式。 gensim通过前面提到的*流式语料库接口*实现它们:文件以懒惰的方式从(分别存储到)磁盘读取,一次一个文档,而不是一次将整个语料库读入主存储器。 177 | 178 | [市场矩阵格式](http://math.nist.gov/MatrixMarket/formats.html)是一种比较值得注意的文件[格式](http://math.nist.gov/MatrixMarket/formats.html)。要以Matrix Market格式保存语料库: 179 | 180 | ```py 181 | >>> # create a toy corpus of 2 documents, as a plain Python list 182 | >>> corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it 183 | >>> 184 | >>> corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus) 185 | ``` 186 | 187 | 其他格式包括[Joachim的SVMlight格式](http://svmlight.joachims.org/), [Blei的LDA-C格式](https://www.cs.princeton.edu/~blei/lda-c/)和 [GibbsLDA ++格式](http://gibbslda.sourceforge.net/)。 188 | 189 | ```py 190 | >>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus) 191 | >>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus) 192 | >>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus) 193 | ``` 194 | 195 | 相反,要从Matrix Market文件加载语料库迭代器: 196 | 197 | ```py 198 | >>> corpus = corpora.MmCorpus('/tmp/corpus.mm') 199 | ``` 200 | 201 | 语料库对象是流,因此通常您将无法直接打印它们: 202 | 203 | ```py 204 | >>> print(corpus) 205 | MmCorpus(2 documents, 2 features, 1 non-zero entries) 206 | ``` 207 | 208 | 相反,要查看语料库的内容: 209 | 210 | ```py 211 | >>> # one way of printing a corpus: load it entirely into memory 212 | >>> print(list(corpus)) # calling list() will convert any sequence to a plain Python list 213 | [[(1, 0.5)], []] 214 | ``` 215 | 216 | 要么 217 | 218 | ```py 219 | >>> # another way of doing it: print one document at a time, making use of the streaming interface 220 | >>> for doc in corpus: 221 | ... print(doc) 222 | [(1, 0.5)] 223 | [] 224 | ``` 225 | 226 | 第二种方式显然对内存更友好,但是出于测试和开发目的,没有什么比调用的简单性更好`list(corpus)`。 227 | 228 | 要以Blei的LDA-C格式保存相同的Matrix Market文档流, 229 | 230 | ```py 231 | >>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus) 232 | ``` 233 | 234 | 通过这种方式,gensim还可以用作内存高效的**I / O格式转换工具**:只需使用一种格式加载文档流,然后立即以另一种格式保存。添加新格式非常容易,请查看[SVMlight语料库](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)的[代码](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)示例。 235 | 236 | ## [与NumPy和SciPy的兼容性](https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy "永久链接到这个标题") 237 | 238 | gensim还包含[有效的实用程序函数](https://radimrehurek.com/gensim/matutils.html) 来帮助转换为/ numpy矩阵: 239 | 240 | ```py 241 | >>> import gensim 242 | >>> import numpy as np 243 | >>> numpy_matrix = np.random.randint(10, size=[5,2]) # random matrix as an example 244 | >>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix) 245 | >>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features) 246 | ``` 247 | 248 | 从/到scipy.sparse矩阵: 249 | 250 | ```py 251 | >>> import scipy.sparse 252 | >>> scipy_sparse_matrix = scipy.sparse.random(5,2) # random sparse matrix as example 253 | >>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix) 254 | >>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus) 255 | ``` 256 | 257 | --- 258 | 259 | 要获得完整的参考(想要将字典修剪为更小的尺寸?优化语料库和NumPy / SciPy数组之间的转换?),请参阅[API文档](https://radimrehurek.com/gensim/apiref.html)。或者继续下一个关于[主题和转换的](https://radimrehurek.com/gensim/tut2.html)教程。 260 | 261 | [[1]](https://radimrehurek.com/gensim/tut1.html#id1) 这与[Deerwester等人](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)使用的语料库相同 [。](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)[(1990):通过潜在语义分析进行索引](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf),表2。 262 | -------------------------------------------------------------------------------- /blog/tutorial/2.md: -------------------------------------------------------------------------------- 1 | # 主题和转换 2 | 3 | 别忘了设置 4 | 5 | ```py 6 | >>> import logging 7 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 8 | ``` 9 | 10 | 如果你想看到记录事件。 11 | 12 | ## [转换接口](https://radimrehurek.com/gensim/tut2.html#transformation-interface "永久链接到这个标题") 13 | 14 | 在上一篇关于[Corpora和Vector Spaces的](https://radimrehurek.com/gensim/tut1.html)教程中,我们创建了一个文档语料库,表示为向量流。要继续,让我们启动gensim并使用该语料库: 15 | 16 | ```py 17 | >>> from gensim import corpora, models, similarities 18 | >>> if (os.path.exists("/tmp/deerwester.dict")): 19 | >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') 20 | >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') 21 | >>> print("Used files generated from first tutorial") 22 | >>> else: 23 | >>> print("Please run first tutorial to generate data set") 24 | ``` 25 | 26 | MmCorpus(9个文件,12个特征,28个非零项) 27 | 28 | 在本教程中,我将展示如何将文档从一个矢量表示转换为另一个矢量表示。这个过程有两个目标: 29 | 30 | 1. 为了在语料库中显示隐藏的结构,发现单词之间的关系并使用它们以新的(希望)更加语义的方式描述文档。 31 | 2. 使文档表示更紧凑。这既提高了效率(新表示消耗更少的资源)和功效(边际数据趋势被忽略,降噪)。 32 | 33 | ### [创建转换](https://radimrehurek.com/gensim/tut2.html#creating-a-transformation "永久链接到这个标题") 34 | 35 | 转换是标准的Python对象,通常通过*训练语料库进行*初始化: 36 | 37 | ```py 38 | >>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model 39 | ``` 40 | 41 | 我们使用教程1中的旧语料库初始化(训练)转换模型。不同的转换可能需要不同的初始化参数; 在TfIdf的情况下,“训练”仅包括通过提供的语料库一次并计算其所有特征的文档频率。训练其他模型,例如潜在语义分析或潜在Dirichlet分配,涉及更多,因此需要更多时间。 42 | 43 | 注意 44 | 45 | 转换总是在两个特定的向量空间之间转换。必须使用相同的向量空间(=同一组特征id)进行训练以及后续的向量转换。无法使用相同的输入要素空间,例如应用不同的字符串预处理,使用不同的特征ID,或使用预期为TfIdf向量的词袋输入向量,将导致转换调用期间的特征不匹配,从而导致垃圾中的任何一个输出和/或运行时异常。 46 | 47 | ### [变换向量](https://radimrehurek.com/gensim/tut2.html#transforming-vectors "永久链接到这个标题") 48 | 49 | 从现在开始,`tfidf` 被视为一个只读对象,可用于将任何向量从旧表示(bag-of-words整数计数)转换为新表示(TfIdf实值权重): 50 | 51 | ```py 52 | >>> doc_bow = [(0, 1), (1, 1)] 53 | >>> print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors 54 | [(0, 0.70710678), (1, 0.70710678)] 55 | ``` 56 | 57 | 或者将转换应用于整个语料库: 58 | 59 | ```py 60 | >>> corpus_tfidf = tfidf[corpus] 61 | >>> for doc in corpus_tfidf: 62 | ... print(doc) 63 | [(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)] 64 | [(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)] 65 | [(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)] 66 | [(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)] 67 | [(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)] 68 | [(9, 1.0)] 69 | [(9, 0.70710678118654746), (10, 0.70710678118654746)] 70 | [(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)] 71 | [(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)] 72 | ``` 73 | 74 | 在这种特殊情况下,我们正在改变我们用于训练的同一语料库,但这只是偶然的。一旦初始化了转换模型,它就可以用在任何向量上(当然它们来自相同的向量空间),即使它们根本没有用在训练语料库中。这是通过LSA的折叠过程,LDA的主题推断等来实现的。 75 | 76 | > 注意 77 | 调用`model[corpus]`仅在旧`corpus` 文档流周围创建一个包装器- 实际转换在文档迭代期间即时完成。我们无法在调用 `corpus_transformed = model[corpus]` 时转换整个语料库,因为这意味着将结果存储在主存中,这与gensim的内存独立目标相矛盾。如果您将多次迭代转换,并且转换成本[很高,请先将生成的语料库序列化为磁盘](https://radimrehurek.com/gensim/tut1.html#corpus-formats)并继续使用它。 78 | 79 | 转换也可以序列化,一个在另一个之上,在一个链中: 80 | 81 | ```py 82 | >>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation 83 | >>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi 84 | ``` 85 | 86 | 在这里,我们通过[潜在语义索引](https://en.wikipedia.org/wiki/Latent_semantic_indexing)将我们的Tf-Idf语料库 转换为潜在的2-D空间(因为我们设置了2-D `num_topics=2`)。现在你可能想知道:这两个潜在的维度代表什么?让我们检查一下`models.LsiModel.print_topics()`: 87 | 88 | ```py 89 | >>> lsi.print_topics(2) 90 | topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface" 91 | topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees" 92 | ``` 93 | 94 | (主题打印到日志 - 请参阅本页顶部有关激活日志记录的说明) 95 | 96 | 根据LSI的说法,“树”,“图”和“未成年人”都是相关词(并且对第一个主题的方向贡献最大),而第二个主题实际上与所有其他词有关。正如所料,前五个文件与第二个主题的关联性更强,而剩下的四个文件与第一个主题相关: 97 | 98 | ```py 99 | >>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly 100 | ... print(doc) 101 | [(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications" 102 | [(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time" 103 | [(0, -0.090), (1, 0.724)] # "The EPS user interface management system" 104 | [(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS" 105 | [(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement" 106 | [(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees" 107 | [(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees" 108 | [(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering" 109 | [(0, -0.617), (1, 0.054)] # "Graph minors A survey" 110 | ``` 111 | 112 | 使用`save()`和`load()`函数实现模型持久性: 113 | 114 | ```py 115 | >>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ... 116 | >>> lsi = models.LsiModel.load('/tmp/model.lsi') 117 | ``` 118 | 119 | 接下来的问题可能是:这些文件之间的相似程度如何?有没有办法形式化相似性,以便对于给定的输入文档,我们可以根据它们的相似性订购一些其他文档?[下一个教程](https://radimrehurek.com/gensim/tut3.html)将介绍相似性查询。 120 | 121 | ## [可用的转换](https://radimrehurek.com/gensim/tut2.html#available-transformations "永久链接到这个标题") 122 | 123 | gensim实现了几种流行的矢量空间模型算法: 124 | 125 | * [术语频率*反向文档频率,Tf-Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) 期望初始化期间的词袋(整数值)训练语料库。在变换期间,它将采用向量并返回具有相同维度的另一个向量,除了在训练语料库中罕见的特征将增加其值。因此,它将整数值向量转换为实值向量,同时保持维度的数量不变。它还可以任选地将得到的矢量归一化为(欧几里得)单位长度。 126 | 127 | `>>> model = models.TfidfModel(corpus, normalize=True)` 128 | 129 | * [潜在语义索引,LSI(或有时LSA)](https://en.wikipedia.org/wiki/Latent_semantic_indexing) 将文档从单词袋或(优选地)TfIdf加权空间转换为较低维度的潜在空间。对于上面的玩具语料库,我们只使用了2个潜在维度,但在实际语料库中,建议将200-500的目标维度作为“黄金标准” [[1]](https://radimrehurek.com/gensim/tut2.html#id6)。 130 | 131 | `>>> model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)` 132 | 133 | LSI培训的独特之处在于我们可以随时继续“培训”,只需提供更多培训文件即可。这是通过在称为在线培训的过程中对底层模型的增量更新来完成的。由于这个特性,输入文档流甚至可能是无限的 - 只需在LSI新文档到达时继续提供它们,同时使用计算的转换模型作为只读! 134 | 135 | ```py 136 | >>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus 137 | >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model 138 | >>> ... 139 | >>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents 140 | >>> lsi_vec = model[tfidf_vec] 141 | >>> ... 142 | ``` 143 | 144 | 有关[`gensim.models.lsimodel`](https://radimrehurek.com/gensim/models/lsimodel.html#module-gensim.models.lsimodel "gensim.models.lsimodel:潜在语义索引")如何使LSI逐渐“忘记”无限流中的旧观察的详细信息,请参阅文档。如果你想变脏,还有一些你可以调整的参数会影响速度与内存占用量和LSI算法的数值精度。 145 | 146 | gensim使用了一种新颖的在线增量流分布式训练算法(相当满口!),我在[[5]中](https://radimrehurek.com/gensim/tut2.html#id10)发表过。gensim还执行Halko等人的随机多遍算法。[[4]](https://radimrehurek.com/gensim/tut2.html#id9)内部,加速核心部分的计算。另请参阅[英语维基百科上的实验,](https://radimrehurek.com/gensim/wiki.html)以便通过在计算机集群中分配计算来进一步提高速度。 147 | 148 | * [随机投影,RP](http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf)旨在减少向量空间维度。这是一种非常有效的(内存和CPU友好的)方法,通过投入一点随机性来近似文档之间的TfIdf距离。建议的目标维度再次为数百/数千,具体取决于您的数据集。 149 | 150 | `>>> model = models.RpModel(tfidf_corpus, num_topics=500)` 151 | 152 | * [Latent Dirichlet Allocation,LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) 是另一种从词袋计数转变为低维度主题空间的转变。LDA是LSA(也称为多项PCA)的概率扩展,因此LDA的主题可以解释为对单词的概率分布。与LSA一样,这些分布也是从训练语料库中自动推断出来的。文档又被解释为这些主题的(软)混合(再次,就像LSA一样)。 153 | 154 | `>>> model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)` 155 | 156 | gensim使用基于[[2]](https://radimrehurek.com/gensim/tut2.html#id7)的在线LDA参数估计的快速实现,修改为在计算机集群上以[分布式模式](https://radimrehurek.com/gensim/distributed.html)运行。 157 | 158 | * [分层Dirichlet过程,HDP](http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf) 是一种非参数贝叶斯方法(请注意缺少的请求主题数): 159 | 160 | `>>> model = models.HdpModel(corpus, id2word=dictionary)` 161 | 162 | gensim使用基于[[3]](https://radimrehurek.com/gensim/tut2.html#id8)的快速在线实现。HDP模型是gensim的新成员,并且在学术方面仍然很粗糙 - 谨慎使用。 163 | 164 | 添加新的VSM转换(例如不同的加权方案)相当简单; 有关更多信息和示例,请参阅[API参考](https://radimrehurek.com/gensim/apiref.html)或直接参阅[Python代码](https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py)。 165 | 166 | 值得重申的是,这些都是独特的**增量**实现,不需要整个训练语料库一次性存在于主存储器中。有了内存,我现在正在改进[分布式计算](https://radimrehurek.com/gensim/distributed.html),以提高CPU效率。如果您认为自己可以做出贡献(通过测试,提供用例或代码),请[告诉我们](mailto:radimrehurek%40seznam.cz)。 167 | 168 | 继续阅读下一个关于[相似性查询的](https://radimrehurek.com/gensim/tut3.html)教程。 169 | 170 | --- 171 | 172 | [[1]](https://radimrehurek.com/gensim/tut2.html#id1) 布拉德福德。2008.对大规模潜在语义索引应用程序所需维度的实证研究。 173 | [[2]](https://radimrehurek.com/gensim/tut2.html#id4) 霍夫曼,布莱,巴赫。2010.潜在Dirichlet分配的在线学习。 174 | [[3]](https://radimrehurek.com/gensim/tut2.html#id5) 王,佩斯利,布莱。2011.层级Dirichlet过程的在线变分推理。 175 | [[4]](https://radimrehurek.com/gensim/tut2.html#id3) Halko,Martinsson,Tropp。2009.找到随机性的结构。 176 | [[5]](https://radimrehurek.com/gensim/tut2.html#id2) Řehůřek。2011.潜在语义分析的子空间跟踪。 177 | -------------------------------------------------------------------------------- /blog/tutorial/3.md: -------------------------------------------------------------------------------- 1 | # 相似性查询 2 | 3 | 别忘了设置 4 | 5 | ```py 6 | >>> import logging 7 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 8 | ``` 9 | 10 | 如果你想看到记录事件。 11 | 12 | ## [相似性界面](https://radimrehurek.com/gensim/tut3.html#similarity-interface "永久链接到这个标题") 13 | 14 | 在之前关于[Corpora和向量空间](https://radimrehurek.com/gensim/tut1.html)以及[主题和转换的](https://radimrehurek.com/gensim/tut2.html)教程中,我们介绍了在向量空间模型中创建语料库以及如何在不同向量空间之间进行转换的含义。这种特征的一个常见原因是我们想要确定 **文档对****之间**的**相似性**,或者**特定文档与一组其他文档**(例如用户查询与索引文档)**之间**的**相似性**。 15 | 16 | 为了说明在gensim中如何做到这一点,让我们考虑与之前的例子相同的语料库(它最初来自Deerwester等人的[“潜在语义分析索引”](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) 1990年开篇 文章): 17 | 18 | ```py 19 | >>> from gensim import corpora, models, similarities 20 | >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') 21 | >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors" 22 | >>> print(corpus) 23 | MmCorpus(9 documents, 12 features, 28 non-zero entries) 24 | ``` 25 | 26 | 按照Deerwester的例子,我们首先使用这个小的语料库来定义一个二维LSI空间: 27 | 28 | ```py 29 | >>> lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) 30 | ``` 31 | 32 | 现在假设用户输入查询“人机交互”。我们希望按照与此查询相关的递减顺序对我们的九个语料库文档进行排序。与现代搜索引擎不同,这里我们只关注可能相似性的一个方面 - 关于其文本(单词)的明显语义相关性。没有超链接,没有随机游走静态排名,只是布尔关键字匹配的语义扩展: 33 | 34 | ```py 35 | >>> doc = "Human computer interaction" 36 | >>> vec_bow = dictionary.doc2bow(doc.lower().split()) 37 | >>> vec_lsi = lsi[vec_bow] # convert the query to LSI space 38 | >>> print(vec_lsi) 39 | [(0, -0.461821), (1, 0.070028)] 40 | ``` 41 | 42 | 此外,我们将考虑[余弦相似性](https://en.wikipedia.org/wiki/Cosine_similarity) 来确定两个向量的相似性。余弦相似度是向量空间建模中的标准度量,但是无论向量表示概率分布, [不同的相似性度量](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence)可能更合适。 43 | 44 | ### [初始化查询结构](https://radimrehurek.com/gensim/tut3.html#initializing-query-structures "永久链接到这个标题") 45 | 46 | 为了准备相似性查询,我们需要输入我们想要与后续查询进行比较的所有文档。在我们的例子中,它们与用于训练LSI的九个文件相同,转换为二维LSA空间。但这只是偶然的,我们也可能完全索引不同的语料库。 47 | 48 | ```py 49 | >>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it 50 | ``` 51 | 52 | > 警告 53 | * `similarities.MatrixSimilarity`只有当整个向量集适合内存时,该类才适用。例如,当与此类一起使用时,一百万个文档的语料库在256维LSI空间中将需要2GB的RAM。 54 | * 如果没有2GB的可用RAM,则需要使用`similarities.Similarity`该类。此类通过在磁盘上的多个文件(称为分片)之间拆分索引,在固定内存中运行。它使用`similarities.MatrixSimilarity`和`similarities.SparseMatrixSimilarity`内部,所以它仍然很快,虽然稍微复杂一点。 55 | 56 | 索引持久性通过标准`save()`和`load()`函数处理: 57 | 58 | ```py 59 | >>> index.save('/tmp/deerwester.index') 60 | >>> index = similarities.MatrixSimilarity.load('/tmp/deerwester.index') 61 | ``` 62 | 63 | 对于所有相似性索引类(`similarities.Similarity`, `similarities.MatrixSimilarity`和`similarities.SparseMatrixSimilarity`)都是如此。同样在下文中,索引可以是任何这些的对象。如果有疑问,请使用`similarities.Similarity`,因为它是最具扩展性的版本,并且它还支持稍后向索引添加更多文档。 64 | 65 | ### [执行查询](https://radimrehurek.com/gensim/tut3.html#performing-queries "永久链接到这个标题") 66 | 67 | 要获得我们的查询文档与九个索引文档的相似性: 68 | 69 | ```py 70 | >>> sims = index[vec_lsi] # perform a similarity query against the corpus 71 | >>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples 72 | [(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), 73 | (5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)] 74 | ``` 75 | 76 | 余弦测量返回范围中的相似度(越大,越相似),因此第一个文档的得分为0.99809301等。 77 | 78 | 使用一些标准的Python魔术,我们将这些相似性按降序排序,并获得查询 `人机交互` 的最终答案: 79 | 80 | ```py 81 | >>> sims = sorted(enumerate(sims), key=lambda item: -item[1]) 82 | >>> print(sims) # print sorted (document number, similarity score) 2-tuples 83 | [(2, 0.99844527), # The EPS user interface management system 84 | (0, 0.99809301), # Human machine interface for lab abc computer applications 85 | (3, 0.9865886), # System and human system engineering testing of EPS 86 | (1, 0.93748635), # A survey of user opinion of computer system response time 87 | (4, 0.90755945), # Relation of user perceived response time to error measurement 88 | (8, 0.050041795), # Graph minors A survey 89 | (7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering 90 | (6, -0.1063926), # The intersection graph of paths in trees 91 | (5, -0.12416792)] # The generation of random binary unordered trees 92 | ``` 93 | 94 | (我将原始文档以“字符串形式”添加到输出注释中,以提高清晰度。) 95 | 96 | 这里要注意的是文件没有。标准布尔全文搜索永远不会返回2(`EPS用户界面管理系统`)和4(`用户感知响应时间与错误测量的关系`),因为他们不与 `人机交互` 分享任何常用词。然而,在应用LSI之后,我们可以观察到它们都获得了相当高的相似性得分(第2个实际上是最相似的!),这更符合我们对它们与查询共享 `computer-human` 相关主题的直觉。事实上,这种语义概括是我们首先应用转换并进行主题建模的原因。 97 | 98 | ## [下一个在哪里](https://radimrehurek.com/gensim/tut3.html#where-next "永久链接到这个标题") 99 | 100 | 恭喜你,你已经完成了教程-现在你知道作品:-)深入到更多的细节如何gensim,您可以通过浏览[API文档](https://radimrehurek.com/gensim/apiref.html),请参阅[维基百科的实验](https://radimrehurek.com/gensim/wiki.html)或者是退房[分布式计算](https://radimrehurek.com/gensim/distributed.html)中gensim。 101 | 102 | gensim是一个相当成熟的软件包,已被许多个人和公司成功使用,用于快速原型制作和生产。这并不意味着它是完美的: 103 | 104 | * 有些部分可以更有效地实现(例如,在C中),或者更好地利用并行性(多个机器内核) 105 | * 新算法一直在发布; 帮助gensim通过[讨论](https://groups.google.com/group/gensim)和[贡献代码](https://github.com/piskvorky/gensim/wiki/Developer-page)来跟上[](https://github.com/piskvorky/gensim/wiki/Developer-page) 106 | * 您的**反馈非常受欢迎**和赞赏(而且不仅仅是代码!): [创意贡献](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [错误报告](https://github.com/piskvorky/gensim/issues)或只考虑贡献 [用户故事和一般问题](https://groups.google.com/group/gensim/topics)。 107 | 108 | 在所有NLP(甚至机器学习)子域中,gensim没有野心成为一个包罗万象的框架。它的使命是帮助NLP从业者轻松地在大型数据集上尝试流行的主题建模算法,并促进研究人员对新算法的原型设计。 109 | -------------------------------------------------------------------------------- /blog/tutorial/4.md: -------------------------------------------------------------------------------- 1 | # 英语维基百科上的实验 2 | 3 | 为了测试gensim性能,我们针对英文版的Wikipedia运行它。 4 | 5 | 此页面描述了获取和处理Wikipedia的过程,以便任何人都可以重现结果。假设您已正确[安装](https://radimrehurek.com/gensim/install.html) gensim。[](https://radimrehurek.com/gensim/install.html) 6 | 7 | ## [准备语料库](https://radimrehurek.com/gensim/wiki.html#preparing-the-corpus "永久链接到这个标题") 8 | 9 | 1. 首先,从[http://download.wikimedia.org/enwiki/](https://download.wikimedia.org/enwiki/)下载所有维基百科文章的转储 (您需要文件enwiki-latest-pages-articles.xml.bz2或enwiki-YYYYMMDD-pages-articles.xml。 bz2用于特定于日期的转储)。此文件大小约为8GB,包含英语维基百科的所有文章(压缩版本)。 10 | 11 | 2. 将文章转换为纯文本(处理Wiki标记)并将结果存储为稀疏TF-IDF向量。在Python中,这很容易在运行中进行,我们甚至不需要将整个存档解压缩到磁盘。gensim中包含一个脚本 可以执行此操作,运行: 12 | 13 | `$ python -m gensim.scripts.make_wiki` 14 | 15 | > 注意 16 | * 这个预处理步骤通过8.2GB压缩wiki转储进行两次传递(一次用于提取字典,一次用于创建和存储稀疏向量),并且在笔记本电脑上花费大约9个小时,因此您可能想要喝咖啡或二。 17 | * 此外,您将需要大约35GB的可用磁盘空间来存储稀疏输出向量。我建议立即压缩这些文件,例如使用bzip2(低至~13GB)。gensim可以直接使用压缩文件,因此可以节省磁盘空间。 18 | 19 | ## [潜在语义分析](https://radimrehurek.com/gensim/wiki.html#latent-semantic-analysis "永久链接到这个标题") 20 | 21 | 首先让我们加载在上面第二步中创建的语料库迭代器和字典: 22 | 23 | ```py 24 | >>> import logging, gensim 25 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 26 | 27 | >>> # load id->word mapping (the dictionary), one of the results of step 2 above 28 | >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt') 29 | >>> # load corpus iterator 30 | >>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm') 31 | >>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output (recommended) 32 | 33 | >>> print(mm) 34 | MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries) 35 | ``` 36 | 37 | 我们看到我们的语料库包含3.9M文档,100K特征(不同的标记)和稀疏TF-IDF矩阵中的0.76G非零条目。维基百科语料库共包含约22.4亿个令牌。 38 | 39 | 现在我们准备计算英语维基百科的LSA: 40 | 41 | ```py 42 | >>> # extract 400 LSI topics; use the default one-pass algorithm 43 | >>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) 44 | 45 | >>> # print the most contributing words (both positively and negatively) for each of the first ten topics 46 | >>> lsi.print_topics(10) 47 | topic #0(332.762): 0.425*"utc" + 0.299*"talk" + 0.293*"page" + 0.226*"article" + 0.224*"delete" + 0.216*"discussion" + 0.205*"deletion" + 0.198*"should" + 0.146*"debate" + 0.132*"be" 48 | topic #1(201.852): 0.282*"link" + 0.209*"he" + 0.145*"com" + 0.139*"his" + -0.137*"page" + -0.118*"delete" + 0.114*"blacklist" + -0.108*"deletion" + -0.105*"discussion" + 0.100*"diff" 49 | topic #2(191.991): -0.565*"link" + -0.241*"com" + -0.238*"blacklist" + -0.202*"diff" + -0.193*"additions" + -0.182*"users" + -0.158*"coibot" + -0.136*"user" + 0.133*"he" + -0.130*"resolves" 50 | topic #3(141.284): -0.476*"image" + -0.255*"copyright" + -0.245*"fair" + -0.225*"use" + -0.173*"album" + -0.163*"cover" + -0.155*"resolution" + -0.141*"licensing" + 0.137*"he" + -0.121*"copies" 51 | topic #4(130.909): 0.264*"population" + 0.246*"age" + 0.243*"median" + 0.213*"income" + 0.195*"census" + -0.189*"he" + 0.184*"households" + 0.175*"were" + 0.167*"females" + 0.166*"males" 52 | topic #5(120.397): 0.304*"diff" + 0.278*"utc" + 0.213*"you" + -0.171*"additions" + 0.165*"talk" + -0.159*"image" + 0.159*"undo" + 0.155*"www" + -0.152*"page" + 0.148*"contribs" 53 | topic #6(115.414): -0.362*"diff" + -0.203*"www" + 0.197*"you" + -0.180*"undo" + -0.180*"kategori" + 0.164*"users" + 0.157*"additions" + -0.150*"contribs" + -0.139*"he" + -0.136*"image" 54 | topic #7(111.440): 0.429*"kategori" + 0.276*"categoria" + 0.251*"category" + 0.207*"kategorija" + 0.198*"kategorie" + -0.188*"diff" + 0.163*"категория" + 0.153*"categoría" + 0.139*"kategoria" + 0.133*"categorie" 55 | topic #8(109.907): 0.385*"album" + 0.224*"song" + 0.209*"chart" + 0.204*"band" + 0.169*"released" + 0.151*"music" + 0.142*"diff" + 0.141*"vocals" + 0.138*"she" + 0.132*"guitar" 56 | topic #9(102.599): -0.237*"league" + -0.214*"he" + -0.180*"season" + -0.174*"football" + -0.166*"team" + 0.159*"station" + -0.137*"played" + -0.131*"cup" + 0.131*"she" + -0.128*"utc" 57 | ``` 58 | 59 | 在我的笔记本电脑上创建维基百科的LSI模型大约需要4小时9分钟[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。这是约**每分钟16000的文件,包括所有的I / O**。 60 | 61 | > 注意 62 | 如果您需要更快的结果,请参阅[分布式计算](https://radimrehurek.com/gensim/distributed.html)教程。请注意,gensim中的BLAS库透明地使用多个内核,因此可以“免费”在多核计算机上更快地处理相同的数据,而无需任何分布式设置。 63 | 64 | 我们看到总处理时间主要是从原始维基百科XML转储准备TF-IDF语料库的预处理步骤,花费了9小时。[[2]](https://radimrehurek.com/gensim/wiki.html#id7) 65 | 66 | gensim中使用的算法只需要查看每个输入文档一次,因此它适用于文档作为不可重复的流,或者多次存储/迭代语料库的成本太高的环境。 67 | 68 | 69 | ## [潜在Dirichlet分配](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation "永久链接到这个标题") 70 | 71 | 与上面的Latent Semantic Analysis一样,首先加载语料库迭代器和字典: 72 | 73 | ```py 74 | >>> import logging, gensim 75 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 76 | 77 | >>> # load id->word mapping (the dictionary), one of the results of step 2 above 78 | >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt') 79 | >>> # load corpus iterator 80 | >>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm') 81 | >>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output 82 | 83 | >>> print(mm) 84 | MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries) 85 | ``` 86 | 87 | 我们将运行在线LDA(参见Hoffman等人[[3]](https://radimrehurek.com/gensim/wiki.html#id8)),这是一个算法,需要一大堆文件,更新LDA模型,取另一个块,更新模型等。在线LDA可以与批处理LDA进行对比,批处理LDA处理整个语料库(一次完整通过),然后更新模型,然后另一个传递,另一个更新...不同的是,给定一个相当固定的文档流(没有太多的主题漂移),较小的块(子扇区)上的在线更新本身相当不错,因此模型估计收敛更快。因此,我们可能只需要对语料库进行一次完整传递:如果语料库有300万篇文章,并且我们在每10,000篇文章后更新一次,这意味着我们将在一次传递中完成300次更新,很可能足以有一个非常准确的主题估计: 88 | 89 | ```py 90 | >>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents) 91 | >>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1) 92 | using serial LDA version on this node 93 | running online LDA training, 100 topics, 1 passes over the supplied corpus of 3931787 documents, updating model once every 10000 documents 94 | ... 95 | ``` 96 | 97 | 与LSA不同,来自LDA的主题更容易理解: 98 | 99 | ```py 100 | >>> # print the most contributing words for 20 randomly selected topics 101 | >>> lda.print_topics(20) 102 | topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam 103 | topic #1: 0.026*relay + 0.026*athletics + 0.025*metres + 0.023*freestyle + 0.022*hurdles + 0.020*ret + 0.017*divisão + 0.017*athletes + 0.016*bundesliga + 0.014*medals 104 | topic #2: 0.002*were + 0.002*he + 0.002*court + 0.002*his + 0.002*had + 0.002*law + 0.002*government + 0.002*police + 0.002*patrolling + 0.002*their 105 | topic #3: 0.040*courcelles + 0.035*centimeters + 0.023*mattythewhite + 0.021*wine + 0.019*stamps + 0.018*oko + 0.017*perennial + 0.014*stubs + 0.012*ovate + 0.011*greyish 106 | topic #4: 0.039*al + 0.029*sysop + 0.019*iran + 0.015*pakistan + 0.014*ali + 0.013*arab + 0.010*islamic + 0.010*arabic + 0.010*saudi + 0.010*muhammad 107 | topic #5: 0.020*copyrighted + 0.020*northamerica + 0.014*uncopyrighted + 0.007*rihanna + 0.005*cloudz + 0.005*knowles + 0.004*gaga + 0.004*zombie + 0.004*wigan + 0.003*maccabi 108 | topic #6: 0.061*israel + 0.056*israeli + 0.030*sockpuppet + 0.025*jerusalem + 0.025*tel + 0.023*aviv + 0.022*palestinian + 0.019*ifk + 0.016*palestine + 0.014*hebrew 109 | topic #7: 0.015*melbourne + 0.014*rovers + 0.013*vfl + 0.012*australian + 0.012*wanderers + 0.011*afl + 0.008*dinamo + 0.008*queensland + 0.008*tracklist + 0.008*brisbane 110 | topic #8: 0.011*film + 0.007*her + 0.007*she + 0.004*he + 0.004*series + 0.004*his + 0.004*episode + 0.003*films + 0.003*television + 0.003*best 111 | topic #9: 0.019*wrestling + 0.013*château + 0.013*ligue + 0.012*discus + 0.012*estonian + 0.009*uci + 0.008*hockeyarchives + 0.008*wwe + 0.008*estonia + 0.007*reign 112 | topic #10: 0.078*edits + 0.059*notability + 0.035*archived + 0.025*clearer + 0.022*speedy + 0.021*deleted + 0.016*hook + 0.015*checkuser + 0.014*ron + 0.011*nominator 113 | topic #11: 0.013*admins + 0.009*acid + 0.009*molniya + 0.009*chemical + 0.007*ch + 0.007*chemistry + 0.007*compound + 0.007*anemone + 0.006*mg + 0.006*reaction 114 | topic #12: 0.018*india + 0.013*indian + 0.010*tamil + 0.009*singh + 0.008*film + 0.008*temple + 0.006*kumar + 0.006*hindi + 0.006*delhi + 0.005*bengal 115 | topic #13: 0.047*bwebs + 0.024*malta + 0.020*hobart + 0.019*basa + 0.019*columella + 0.019*huon + 0.018*tasmania + 0.016*popups + 0.014*tasmanian + 0.014*modèle 116 | topic #14: 0.014*jewish + 0.011*rabbi + 0.008*bgwhite + 0.008*lebanese + 0.007*lebanon + 0.006*homs + 0.005*beirut + 0.004*jews + 0.004*hebrew + 0.004*caligari 117 | topic #15: 0.025*german + 0.020*der + 0.017*von + 0.015*und + 0.014*berlin + 0.012*germany + 0.012*die + 0.010*des + 0.008*kategorie + 0.007*cross 118 | topic #16: 0.003*can + 0.003*system + 0.003*power + 0.003*are + 0.003*energy + 0.002*data + 0.002*be + 0.002*used + 0.002*or + 0.002*using 119 | topic #17: 0.049*indonesia + 0.042*indonesian + 0.031*malaysia + 0.024*singapore + 0.022*greek + 0.021*jakarta + 0.016*greece + 0.015*dord + 0.014*athens + 0.011*malaysian 120 | topic #18: 0.031*stakes + 0.029*webs + 0.018*futsal + 0.014*whitish + 0.013*hyun + 0.012*thoroughbred + 0.012*dnf + 0.012*jockey + 0.011*medalists + 0.011*racehorse 121 | topic #19: 0.119*oblast + 0.034*uploaded + 0.034*uploads + 0.033*nordland + 0.025*selsoviet + 0.023*raion + 0.022*krai + 0.018*okrug + 0.015*hålogaland + 0.015*russiae + 0.020*manga + 0.017*dragon + 0.012*theme + 0.011*dvd + 0.011*super + 0.011*hunter + 0.009*ash + 0.009*dream + 0.009*angel 122 | ``` 123 | 124 | 在我的笔记本电脑上创建维基百科的这个LDA模型需要大约6小时20分钟[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。如果您需要更快地获得结果,请考虑在计算机群集上运行[Distributed Latent Dirichlet Allocation](https://radimrehurek.com/gensim/dist_lda.html)。 125 | 126 | 注意LDA和LSA运行之间的两个区别:我们要求LSA提取400个主题,LDA只有100个主题(因此速度差异实际上更大)。其次,gensim中的LSA实现是真正的在线:如果输入流的性质随时间变化,LSA将在相当少量的更新中重新定位自己以反映这些变化。相比之下,LDA并不是真正的在线( 尽管[[3]](https://radimrehurek.com/gensim/wiki.html#id8)文章的名称),因为后来更新对模型的影响逐渐减弱。如果输入文档流中存在主题偏差,LDA将会变得混乱,并且在调整自身以适应新的状态时会越来越慢。 127 | 128 | 简而言之,如果使用LDA逐步将新文档添加到模型中,请务必小心。**批量使用LDA**,其中整个训练语料库事先已知或未显示主题漂移,**是可以的并且不受影响**。 129 | 130 | 要运行批量LDA(不在线),请使用以下方法训练LdaModel: 131 | 132 | ```py 133 | >>> # extract 100 LDA topics, using 20 full passes, no online updates 134 | >>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20) 135 | ``` 136 | 137 | 像往常一样,训练有素的模型可以用来将新的,看不见的文档(简单的词袋计数向量)转换为LDA主题分布: 138 | 139 | ```py 140 | >>> doc_lda = lda[doc_bow] 141 | ``` 142 | 143 | --- 144 | 145 | [1] *([1](https://radimrehurek.com/gensim/wiki.html#id1),[2](https://radimrehurek.com/gensim/wiki.html#id4))*我的笔记本=的MacBook Pro,英特尔酷睿i7 2.3GHz的,16GB DDR3 RAM,具有OS X libVec。 146 | [[2]](https://radimrehurek.com/gensim/wiki.html#id2) 147 | 在这里,我们最感兴趣的是性能,但是查看检索到的LSA概念也很有趣。我不是维基百科的专家,也没有看到维基百科的内容,但Brian Mingus对结果有这样的说法: 148 | 149 | ```py 150 | There appears to be a lot of noise in your dataset. The first three topics 151 | in your list appear to be meta topics, concerning the administration and 152 | cleanup of Wikipedia. These show up because you didn't exclude templates 153 | such as these, some of which are included in most articles for quality 154 | control: http://en.wikipedia.org/wiki/Wikipedia:Template_messages/Cleanup 155 | 156 | The fourth and fifth topics clearly shows the influence of bots that import 157 | massive databases of cities, countries, etc. and their statistics such as 158 | population, capita, etc. 159 | 160 | The sixth shows the influence of sports bots, and the seventh of music bots. 161 | ``` 162 | 163 | 因此,十大概念显然由维基百科机器人和扩展模板主导; 这是一个很好的提醒,LSA是一个强大的数据分析工具,但没有银弹。一如既往,它是[垃圾,垃圾输出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out) ......顺便说一句,欢迎改进Wiki标记解析代码:-) 164 | 165 | [3] *([1](https://radimrehurek.com/gensim/wiki.html#id3),[2](https://radimrehurek.com/gensim/wiki.html#id5))*霍夫曼,Blei,巴赫。2010.潜在Dirichlet分配的在线学习[ [pdf](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) ] [ [code](https://www.cs.princeton.edu/~mdhoffma/) ] 166 | -------------------------------------------------------------------------------- /blog/tutorial/5.md: -------------------------------------------------------------------------------- 1 | # 分布式计算 2 | 3 | ## [为何分布式计算?](https://radimrehurek.com/gensim/distributed.html#why-distributed-computing "永久链接到这个标题") 4 | 5 | 需要构建一个语料库的语义表示,这个语料库是数百万个大型文档并且它将永远存在?您可以使用几台闲置的机器吗? [分布式计算](https://en.wikipedia.org/wiki/Distributed_computing)试图通过将给定任务拆分为几个较小的子任务来加速计算,并将它们并行传递到多个计算节点。 6 | 7 | 在gensim的上下文中,计算节点是由其IP地址/端口标识的计算机,并且通过TCP / IP进行通信。整个可用机器集合称为*集群*。分布非常粗糙(进行的通信不多),因此允许网络具有相对较高的延迟。 8 | 9 | > 警告 10 | * 使用分布式计算的主要原因是使事情运行得更快。在gensim中,大多数耗时的东西是在NumPy内部的线性代数的低级例程中完成的,与任何gensim代码无关。 **为NumPy ****安装快速** [BLAS(基本线性代数)](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) **库可以将性能提高15倍!**因此,在开始购买这些额外的计算机之前,请考虑安装一个针对您的特定计算机优化的快速线程BLAS(而不是通用的二进制分布式库)。选项包括供应商的BLAS库(英特尔的MKL,AMD的ACML,OS X的vecLib,Sun的Sunperf,......)或一些开源替代品(GotoBLAS,ALTAS)。 11 | * 要查看您正在使用的BLAS和LAPACK,请键入您的shell: 12 | `python -c 'import scipy; scipy.show_config()'` 13 | 14 | ## [先决条件](https://radimrehurek.com/gensim/distributed.html#prerequisites "永久链接到这个标题") 15 | 16 | 对于节点之间的通信,gensim使用[Pyro(PYthon远程对象)](https://pypi.python.org/pypi/Pyro4),版本> = 4.27。这是一个用于Python中的低级套接字通信和远程过程调用(RPC)的库。Pyro4是一个纯Python库,因此它的安装非常简单,只需将* .py文件复制到Python的导入路径上: 17 | 18 | `pip install Pyro4` 19 | 20 | 您不必安装Pyro来运行gensim,但如果不这样做,您将无法访问分布式功能(即,所有内容都将始终以串行模式运行,此页面上的示例不适用)。 21 | 22 | ## [核心概念](https://radimrehurek.com/gensim/distributed.html#core-concepts "永久链接到这个标题") 23 | 24 | 与往常一样,gensim努力寻求一个清晰明了的API(见[功能](https://radimrehurek.com/gensim/intro.html#design))。为此,*您无需在代码中进行任何更改*,以便在计算机集群上运行它! 25 | 26 | 您需要做的是在开始计算之前在每个集群节点上运行一个[工作](https://radimrehurek.com/gensim/distributed.html#term-worker)脚本(见下文)。运行此脚本告诉gensim它可以使用该节点作为从属程序将某些工作委托给它。在初始化期间,gensim中的算法将尝试查找并奴役所有可用的工作节点。 27 | 28 | * Node 29 | 30 | 一个逻辑工作单位。可以对应单个物理计算机,但您也可以在一台计算机上运行多个工作程序,从而生成多个逻辑节点。 31 | 32 | * Cluster 33 | 34 | 通过TCP / IP进行通信的几个节点。目前,网络广播用于发现和连接所有通信节点,因此节点必须位于同一[广播域内](https://en.wikipedia.org/wiki/Broadcast_domain)。 35 | 36 | * Worker 37 | 38 | 在每个节点上创建的进程。要从群集中删除节点,只需终止其工作进程。 39 | 40 | * Dispatcher 41 | 42 | 调度员将负责协商所有计算,排队(“调度”)个人工作给工人。计算永远不会直接与工作节点“交谈”,只能通过此调度程序。与worker不同,集群中一次只能有一个活动调度程序。 43 | 44 | ## [可用的分布式算法](https://radimrehurek.com/gensim/distributed.html#available-distributed-algorithms "永久链接到这个标题") 45 | 46 | * [分布式潜在语义分析](https://radimrehurek.com/gensim/dist_lsi.html) 47 | * [分布式潜在Dirichlet分配](https://radimrehurek.com/gensim/dist_lda.html) 48 | -------------------------------------------------------------------------------- /blog/tutorial/README.md: -------------------------------------------------------------------------------- 1 | # 教程 2 | 3 | 这些教程被组织为一系列示例,突出了Gensim的各种功能。假设读者熟悉[Python语言](https://www.python.org/),[安装了gensim](/blog/Install/README.md) 并阅读了[介绍](/blog/Introduction/README.md)。 4 | 5 | 这些例子分为以下部分: 6 | 7 | 8 | * [语料库和向量空间](blog/tutorial/1.md) 9 | * 从字符串到向量 10 | * 语料库流 - 一次一个文档 11 | * 语料库格式 12 | * 与NumPy和SciPy的兼容性 13 | * [主题和转换](blog/tutorial/2.md) 14 | * 转换界面 15 | * 可用的转换 16 | * [相似性查询](blog/tutorial/3.md) 17 | * 相似界面 18 | * 下一个在哪里 19 | * [英语维基百科上的实验](blog/tutorial/4.md) 20 | * 准备语料库 21 | * 潜在语义分析 22 | * 潜在的Dirichlet分配 23 | * [分布式计算](blog/tutorial/5.md) 24 | * 为何分布式计算? 25 | * 先决条件 26 | * 核心概念 27 | * 可用分布式算法 28 | 29 | ## 预赛 30 | 31 | 所有示例都可以直接复制到Python解释器shell。[IPython](http://ipython.scipy.org/)的 `cpaste` 命令对于复制代码片段(包括主要 `>>>>` 字符)特别方便。 32 | 33 | Gensim使用Python的标准 `logging` 模块来记录各种优先级的各种东西; 要激活日志记录(这是可选的),请运行 34 | 35 | ```py 36 | >>> import logging 37 | >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 38 | ``` 39 | 40 | ## 快速示例 41 | 42 | 首先,让我们导入gensim并创建一个包含九个文档和十二个特征的小型语料库[[1]](https://radimrehurek.com/gensim/tutorial.html#id2): 43 | 44 | ```py 45 | >>> from gensim import corpora, models, similarities 46 | >>> 47 | >>> corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)], 48 | >>> [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)], 49 | >>> [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)], 50 | >>> [(0, 1.0), (4, 2.0), (7, 1.0)], 51 | >>> [(3, 1.0), (5, 1.0), (6, 1.0)], 52 | >>> [(9, 1.0)], 53 | >>> [(9, 1.0), (10, 1.0)], 54 | >>> [(9, 1.0), (10, 1.0), (11, 1.0)], 55 | >>> [(8, 1.0), (10, 1.0), (11, 1.0)]] 56 | ``` 57 | 58 | 在Gensim中,*语料库*只是一个对象,当迭代时,返回其表示为稀疏向量的文档。在这种情况下,我们使用元组列表的列表。如果您不熟悉[矢量空间模型](https://en.wikipedia.org/wiki/Vector_space_model),我们将在下一个关于[Corpora和Vector Spaces的](https://radimrehurek.com/gensim/tut1.html)教程中弥合**原始字符串**,**语料库**和**稀疏矢量**之间的差距。[](https://radimrehurek.com/gensim/tut1.html) 59 | 60 | 如果您熟悉向量空间模型,您可能会知道解析文档并将其转换为向量的方式会对任何后续应用程序的质量产生重大影响。 61 | 62 | > 注意: 63 | 在此示例中,整个语料库作为Python列表存储在内存中。但是,语料库接口只表示语料库必须支持对其组成文档的迭代。对于非常大的语料库,有利的是将语料库保持在磁盘上,并且一次一个地顺序访问其文档。所有操作和转换都以这样的方式实现,使得它们在内存方面独立于语料库的大小。 64 | 65 | 接下来,让我们初始化一个*转换*: 66 | 67 | ```py 68 | >>> tfidf = models.TfidfModel(corpus) 69 | ``` 70 | 71 | 转换用于将文档从一个向量表示转换为另一个向量表示: 72 | 73 | ```py 74 | >>> vec = [(0, 1), (4, 1)] 75 | >>> print(tfidf[vec]) 76 | [(0, 0.8075244), (4, 0.5898342)] 77 | ``` 78 | 79 | 在这里,我们使用了[Tf-Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf),这是一种简单的转换,它将文档表示为词袋计数,并应用对常用术语进行折扣的权重(或者等同于促销稀有术语)。它还将得到的向量缩放到单位长度(在[欧几里德范数中](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm))。 80 | 81 | [主题和转换](https://radimrehurek.com/gensim/tut2.html)教程中详细介绍了[转换](https://radimrehurek.com/gensim/tut2.html)。 82 | 83 | 要通过TfIdf转换整个语料库并对其进行索引,以准备相似性查询: 84 | 85 | ```py 86 | >>> index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12) 87 | ``` 88 | 89 | 并查询我们的查询向量`vec`与语料库中每个文档的相似性: 90 | 91 | ```py 92 | >>> sims = index[tfidf[vec]] 93 | >>> print(list(enumerate(sims))) 94 | [(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)] 95 | ``` 96 | 97 | 如何阅读此输出?文档编号为零(第一个文档)的相似度得分为0.466 = 46.6%,第二个文档的相似度得分为19.1%等。 98 | 99 | 因此,根据TfIdf文档表示和余弦相似性度量,最类似于我们的查询文档vec是文档号。3,相似度得分为82.1%。请注意,在TfIdf表示中,任何不具有任何共同特征的 `vec` 文档(文档编号4-8)的相似性得分均为0.0。有关更多详细信息,请参阅[Similarity Queries](https://radimrehurek.com/gensim/tut3.html)教程。 100 | 101 | --- 102 | 103 | > [1] 这与 [Deerwester等人](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) 使用的语料库相同[。](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)[(1990):通过潜在语义分析进行索引](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf),表2。 104 | -------------------------------------------------------------------------------- /imgs/Introduction/develop.svg: -------------------------------------------------------------------------------- 1 | circlecicirclecipassingpassing -------------------------------------------------------------------------------- /imgs/Introduction/develop_1.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | build 20 | build 21 | 22 | passing 23 | passing 24 | 25 | -------------------------------------------------------------------------------- /imgs/Introduction/gensim.svg: -------------------------------------------------------------------------------- 1 | buildbuildpassingpassing -------------------------------------------------------------------------------- /imgs/gensim.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apachecn/gensim-doc-zh/69158825f7bd8cce00288e06576e943528037b6b/imgs/gensim.png -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
now loading...
21 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /update.sh: -------------------------------------------------------------------------------- 1 | git add -A 2 | git commit -am "$(date "+%Y-%m-%d %H:%M:%S")" 3 | git push --------------------------------------------------------------------------------