├── .gitignore ├── .travis.yml ├── AUTHORS ├── CHANGELOG ├── INSTALL ├── LICENSE ├── MANIFEST.in ├── README.rst ├── cluster.bmp ├── cluster ├── __init__.py ├── cluster.py ├── linkage.py ├── matrix.py ├── method │ ├── __init__.py │ ├── base.py │ ├── hierarchical.py │ └── kmeans.py ├── test │ ├── test_hierarchical.py │ ├── test_kmeans.py │ ├── test_linkage.py │ └── test_numpy.py ├── util.py └── version.txt ├── dev-requirements.txt ├── docs ├── Makefile ├── _static │ └── .gitkeep ├── apidoc │ ├── cluster.matrix.rst │ ├── cluster.method.base.rst │ ├── cluster.method.hierarchical.rst │ ├── cluster.method.kmeans.rst │ ├── cluster.rst │ └── cluster.util.rst ├── changelog.rst ├── conf.py └── index.rst ├── fabfile.py ├── makedist.sh ├── pytest.ini ├── setup.cfg ├── setup.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | /*.egg-info 3 | /.cache 4 | /.pytest_cache 5 | /.tox 6 | /MANIFEST 7 | /build 8 | /dist 9 | /docs/_build 10 | /env 11 | /env3 12 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | arch: 2 | - amd64 3 | - ppc64le 4 | language: python 5 | python: 6 | - "2.7" 7 | - "3.5" 8 | - "3.6" 9 | - "3.7" 10 | - "3.8" 11 | - "3.9" 12 | - "nightly" 13 | install: pip install tox-travis 14 | script: "tox" 15 | -------------------------------------------------------------------------------- /AUTHORS: -------------------------------------------------------------------------------- 1 | Michel Albert (exhuma@users.sourceforge.net) 2 | Sam Sandberg (@LoisaidaSam) -------------------------------------------------------------------------------- /CHANGELOG: -------------------------------------------------------------------------------- 1 | Release 1.4.1.post3 2 | =================== 3 | 4 | This is a "house-keeping" commit. No new features or fixes are introduced. 5 | 6 | * Update CI test rules to include amd64 and ppc (santosh653) 7 | 8 | 9 | Release 1.4.1.post2 10 | =================== 11 | 12 | This is a "house-keeping" commit. No new features or fixes are introduced. 13 | 14 | * Update changelog. 15 | * Removed the ``Pipfile`` which was introduced in ``1.4.1.post1``. The file 16 | caused false positives on security checks. Additionally, having a ``Pipfile`` 17 | is mainly useful in applications, and not in libraries like this one. 18 | 19 | Release 1.4.1.post1 20 | =================== 21 | 22 | This is a "house-keeping" commit. No new features or fixes are introduced. 23 | 24 | * Update changelog. 25 | * Switch doc-building to use ``pipenv`` & update ``Pipfile`` accordingly. 26 | 27 | Release 1.4.1 28 | ============= 29 | 30 | * Fix clustering of dictionaries. See GitHub issue #28 (Tim Littlefair). 31 | 32 | Release 1.4.0 33 | ============= 34 | 35 | * Added a "display" method to hierarchical clusters (by 1kastner). 36 | 37 | Release 1.3.2 & 1.3.3 38 | ===================== 39 | 40 | * Fix regression introduced in 1.3.1 related to package version metadata. 41 | 42 | Release 1.3.1 43 | ============= 44 | 45 | * Don't break if the cluster is initiated with iterable elements (GitHub Issue 46 | #20). 47 | * Fix package version metadata in setup.py 48 | 49 | Release 1.3.0 50 | ============= 51 | 52 | * Performance improvments for hierarchical clustering (at the cost of memory) 53 | * Cluster instances are now iterable. It will iterate over each element, 54 | resulting in a flat list of items. 55 | * New option to specify a progress callback to hierarchical clustring. This 56 | method will be called on each iteration for hierarchical clusters. It gets 57 | two numeric values as argument: The total count of elements, and the number 58 | of processed elements. It gives users a way to present to progress on screen. 59 | * The library now also has a ``__version__`` member. 60 | 61 | 62 | Release 1.2.2 63 | ============= 64 | 65 | * Package metadata fixed. 66 | 67 | Release 1.2.1 68 | ============= 69 | 70 | * Fixed an issue in multiprocessing code. 71 | 72 | Release 1.2.0 73 | ============= 74 | 75 | * Multiprocessing (by loisaidasam) 76 | * Python 3 support 77 | * Split up one big file into smaller more logical sub-modules 78 | * Fixed https://github.com/exhuma/python-cluster/issues/11 79 | * Documentation update. 80 | * Migrated to GitHub 81 | 82 | Release 1.1.1b3 83 | =============== 84 | 85 | * Fixed bug #1727558 86 | * Some more unit-tests 87 | * ValueError changed to ClusteringError where appropriate 88 | 89 | Release 1.1.1b2 90 | =============== 91 | 92 | * Fixed bug #1604859 (thanks to Willi Richert for reporting it) 93 | 94 | Release 1.1.1b1 95 | =============== 96 | 97 | * Applied SVN patch [1535137] (thanks ajaksu) 98 | 99 | * Topology output supported 100 | * ``data`` and ``raw_data`` are now properties. 101 | 102 | Release 1.1.0b1 103 | =============== 104 | 105 | * KMeans Clustering implemented for simple numeric tuples. 106 | 107 | Data in the form ``[(1,1), (2,1), (5,3), ...]`` can be clustered. 108 | 109 | Usage:: 110 | 111 | >>> from cluster import KMeansClustering 112 | >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...]) 113 | >>> clusters = cl.getclusters(2) 114 | 115 | The method ``getclusters`` takes the amount of clusters you would like to 116 | have as parameter. 117 | 118 | Only numeric values are supported in the tuples. The reason for this is 119 | that the "centroid" method which I use, essentially returns a tuple of 120 | floats. So you will lose any other kind of metadata. Once I figure out a 121 | way how to recode that method, other types should be possible. 122 | 123 | Release 1.0.1b2 124 | =============== 125 | 126 | * Optimized calculation of the hierarchical clustering by using the fact, that 127 | the generated matrix is symmetrical. 128 | 129 | Release 1.0.1b1 130 | =============== 131 | 132 | * Implemented complete-, average-, and uclus-linkage methods. You can select 133 | one by specifying it in the constructor, for example:: 134 | 135 | cl = HierarchicalClustering(data, distfunc, linkage='uclus') 136 | 137 | or by setting it before starting the clustering process:: 138 | 139 | cl = HierarchicalClustering(data, distfunc) 140 | cl.setLinkageMethod('uclus') 141 | cl.cluster() 142 | 143 | * Clustering is not executed on object creation, but on the first call of 144 | ``getlevel``. You can force the creation of the clusters by calling the 145 | ``cluster`` method as shown above. 146 | 147 | .. vim: filetype=rst : 148 | -------------------------------------------------------------------------------- /INSTALL: -------------------------------------------------------------------------------- 1 | INSTALLATION 2 | ============ 3 | 4 | Simply run:: 5 | 6 | pip install cluster 7 | 8 | Or, if you run it in a virtualenv: 9 | 10 | /path/to/your/env/bin/pip install cluster 11 | 12 | 13 | Source installation 14 | ~~~~~~~~~~~~~~~~~~~ 15 | 16 | Untar the archive:: 17 | 18 | tar xf 19 | 20 | Next, go to the folder just created. It will have the same name as the package 21 | (for example "cluster-1.2.2") and run:: 22 | 23 | python setup.py install 24 | 25 | This will require superuser privileges unless you install it in a virtual environment:: 26 | 27 | /path/to/your/env/bin/python setup.py install 28 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 2.1, February 1999 3 | 4 | Copyright (C) 1991, 1999 Free Software Foundation, Inc. 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | [This is the first released version of the Lesser GPL. It also counts 10 | as the successor of the GNU Library Public License, version 2, hence 11 | the version number 2.1.] 12 | 13 | Preamble 14 | 15 | The licenses for most software are designed to take away your 16 | freedom to share and change it. By contrast, the GNU General Public 17 | Licenses are intended to guarantee your freedom to share and change 18 | free software--to make sure the software is free for all its users. 19 | 20 | This license, the Lesser General Public License, applies to some 21 | specially designated software packages--typically libraries--of the 22 | Free Software Foundation and other authors who decide to use it. You 23 | can use it too, but we suggest you first think carefully about whether 24 | this license or the ordinary General Public License is the better 25 | strategy to use in any particular case, based on the explanations below. 26 | 27 | When we speak of free software, we are referring to freedom of use, 28 | not price. Our General Public Licenses are designed to make sure that 29 | you have the freedom to distribute copies of free software (and charge 30 | for this service if you wish); that you receive source code or can get 31 | it if you want it; that you can change the software and use pieces of 32 | it in new free programs; and that you are informed that you can do 33 | these things. 34 | 35 | To protect your rights, we need to make restrictions that forbid 36 | distributors to deny you these rights or to ask you to surrender these 37 | rights. These restrictions translate to certain responsibilities for 38 | you if you distribute copies of the library or if you modify it. 39 | 40 | For example, if you distribute copies of the library, whether gratis 41 | or for a fee, you must give the recipients all the rights that we gave 42 | you. You must make sure that they, too, receive or can get the source 43 | code. If you link other code with the library, you must provide 44 | complete object files to the recipients, so that they can relink them 45 | with the library after making changes to the library and recompiling 46 | it. And you must show them these terms so they know their rights. 47 | 48 | We protect your rights with a two-step method: (1) we copyright the 49 | library, and (2) we offer you this license, which gives you legal 50 | permission to copy, distribute and/or modify the library. 51 | 52 | To protect each distributor, we want to make it very clear that 53 | there is no warranty for the free library. Also, if the library is 54 | modified by someone else and passed on, the recipients should know 55 | that what they have is not the original version, so that the original 56 | author's reputation will not be affected by problems that might be 57 | introduced by others. 58 | 59 | Finally, software patents pose a constant threat to the existence of 60 | any free program. We wish to make sure that a company cannot 61 | effectively restrict the users of a free program by obtaining a 62 | restrictive license from a patent holder. Therefore, we insist that 63 | any patent license obtained for a version of the library must be 64 | consistent with the full freedom of use specified in this license. 65 | 66 | Most GNU software, including some libraries, is covered by the 67 | ordinary GNU General Public License. This license, the GNU Lesser 68 | General Public License, applies to certain designated libraries, and 69 | is quite different from the ordinary General Public License. We use 70 | this license for certain libraries in order to permit linking those 71 | libraries into non-free programs. 72 | 73 | When a program is linked with a library, whether statically or using 74 | a shared library, the combination of the two is legally speaking a 75 | combined work, a derivative of the original library. The ordinary 76 | General Public License therefore permits such linking only if the 77 | entire combination fits its criteria of freedom. The Lesser General 78 | Public License permits more lax criteria for linking other code with 79 | the library. 80 | 81 | We call this license the "Lesser" General Public License because it 82 | does Less to protect the user's freedom than the ordinary General 83 | Public License. It also provides other free software developers Less 84 | of an advantage over competing non-free programs. These disadvantages 85 | are the reason we use the ordinary General Public License for many 86 | libraries. However, the Lesser license provides advantages in certain 87 | special circumstances. 88 | 89 | For example, on rare occasions, there may be a special need to 90 | encourage the widest possible use of a certain library, so that it becomes 91 | a de-facto standard. To achieve this, non-free programs must be 92 | allowed to use the library. A more frequent case is that a free 93 | library does the same job as widely used non-free libraries. In this 94 | case, there is little to gain by limiting the free library to free 95 | software only, so we use the Lesser General Public License. 96 | 97 | In other cases, permission to use a particular library in non-free 98 | programs enables a greater number of people to use a large body of 99 | free software. For example, permission to use the GNU C Library in 100 | non-free programs enables many more people to use the whole GNU 101 | operating system, as well as its variant, the GNU/Linux operating 102 | system. 103 | 104 | Although the Lesser General Public License is Less protective of the 105 | users' freedom, it does ensure that the user of a program that is 106 | linked with the Library has the freedom and the wherewithal to run 107 | that program using a modified version of the Library. 108 | 109 | The precise terms and conditions for copying, distribution and 110 | modification follow. Pay close attention to the difference between a 111 | "work based on the library" and a "work that uses the library". The 112 | former contains code derived from the library, whereas the latter must 113 | be combined with the library in order to run. 114 | 115 | GNU LESSER GENERAL PUBLIC LICENSE 116 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 117 | 118 | 0. This License Agreement applies to any software library or other 119 | program which contains a notice placed by the copyright holder or 120 | other authorized party saying it may be distributed under the terms of 121 | this Lesser General Public License (also called "this License"). 122 | Each licensee is addressed as "you". 123 | 124 | A "library" means a collection of software functions and/or data 125 | prepared so as to be conveniently linked with application programs 126 | (which use some of those functions and data) to form executables. 127 | 128 | The "Library", below, refers to any such software library or work 129 | which has been distributed under these terms. A "work based on the 130 | Library" means either the Library or any derivative work under 131 | copyright law: that is to say, a work containing the Library or a 132 | portion of it, either verbatim or with modifications and/or translated 133 | straightforwardly into another language. (Hereinafter, translation is 134 | included without limitation in the term "modification".) 135 | 136 | "Source code" for a work means the preferred form of the work for 137 | making modifications to it. For a library, complete source code means 138 | all the source code for all modules it contains, plus any associated 139 | interface definition files, plus the scripts used to control compilation 140 | and installation of the library. 141 | 142 | Activities other than copying, distribution and modification are not 143 | covered by this License; they are outside its scope. The act of 144 | running a program using the Library is not restricted, and output from 145 | such a program is covered only if its contents constitute a work based 146 | on the Library (independent of the use of the Library in a tool for 147 | writing it). Whether that is true depends on what the Library does 148 | and what the program that uses the Library does. 149 | 150 | 1. You may copy and distribute verbatim copies of the Library's 151 | complete source code as you receive it, in any medium, provided that 152 | you conspicuously and appropriately publish on each copy an 153 | appropriate copyright notice and disclaimer of warranty; keep intact 154 | all the notices that refer to this License and to the absence of any 155 | warranty; and distribute a copy of this License along with the 156 | Library. 157 | 158 | You may charge a fee for the physical act of transferring a copy, 159 | and you may at your option offer warranty protection in exchange for a 160 | fee. 161 | 162 | 2. You may modify your copy or copies of the Library or any portion 163 | of it, thus forming a work based on the Library, and copy and 164 | distribute such modifications or work under the terms of Section 1 165 | above, provided that you also meet all of these conditions: 166 | 167 | a) The modified work must itself be a software library. 168 | 169 | b) You must cause the files modified to carry prominent notices 170 | stating that you changed the files and the date of any change. 171 | 172 | c) You must cause the whole of the work to be licensed at no 173 | charge to all third parties under the terms of this License. 174 | 175 | d) If a facility in the modified Library refers to a function or a 176 | table of data to be supplied by an application program that uses 177 | the facility, other than as an argument passed when the facility 178 | is invoked, then you must make a good faith effort to ensure that, 179 | in the event an application does not supply such function or 180 | table, the facility still operates, and performs whatever part of 181 | its purpose remains meaningful. 182 | 183 | (For example, a function in a library to compute square roots has 184 | a purpose that is entirely well-defined independent of the 185 | application. Therefore, Subsection 2d requires that any 186 | application-supplied function or table used by this function must 187 | be optional: if the application does not supply it, the square 188 | root function must still compute square roots.) 189 | 190 | These requirements apply to the modified work as a whole. If 191 | identifiable sections of that work are not derived from the Library, 192 | and can be reasonably considered independent and separate works in 193 | themselves, then this License, and its terms, do not apply to those 194 | sections when you distribute them as separate works. But when you 195 | distribute the same sections as part of a whole which is a work based 196 | on the Library, the distribution of the whole must be on the terms of 197 | this License, whose permissions for other licensees extend to the 198 | entire whole, and thus to each and every part regardless of who wrote 199 | it. 200 | 201 | Thus, it is not the intent of this section to claim rights or contest 202 | your rights to work written entirely by you; rather, the intent is to 203 | exercise the right to control the distribution of derivative or 204 | collective works based on the Library. 205 | 206 | In addition, mere aggregation of another work not based on the Library 207 | with the Library (or with a work based on the Library) on a volume of 208 | a storage or distribution medium does not bring the other work under 209 | the scope of this License. 210 | 211 | 3. You may opt to apply the terms of the ordinary GNU General Public 212 | License instead of this License to a given copy of the Library. To do 213 | this, you must alter all the notices that refer to this License, so 214 | that they refer to the ordinary GNU General Public License, version 2, 215 | instead of to this License. (If a newer version than version 2 of the 216 | ordinary GNU General Public License has appeared, then you can specify 217 | that version instead if you wish.) Do not make any other change in 218 | these notices. 219 | 220 | Once this change is made in a given copy, it is irreversible for 221 | that copy, so the ordinary GNU General Public License applies to all 222 | subsequent copies and derivative works made from that copy. 223 | 224 | This option is useful when you wish to copy part of the code of 225 | the Library into a program that is not a library. 226 | 227 | 4. You may copy and distribute the Library (or a portion or 228 | derivative of it, under Section 2) in object code or executable form 229 | under the terms of Sections 1 and 2 above provided that you accompany 230 | it with the complete corresponding machine-readable source code, which 231 | must be distributed under the terms of Sections 1 and 2 above on a 232 | medium customarily used for software interchange. 233 | 234 | If distribution of object code is made by offering access to copy 235 | from a designated place, then offering equivalent access to copy the 236 | source code from the same place satisfies the requirement to 237 | distribute the source code, even though third parties are not 238 | compelled to copy the source along with the object code. 239 | 240 | 5. A program that contains no derivative of any portion of the 241 | Library, but is designed to work with the Library by being compiled or 242 | linked with it, is called a "work that uses the Library". Such a 243 | work, in isolation, is not a derivative work of the Library, and 244 | therefore falls outside the scope of this License. 245 | 246 | However, linking a "work that uses the Library" with the Library 247 | creates an executable that is a derivative of the Library (because it 248 | contains portions of the Library), rather than a "work that uses the 249 | library". The executable is therefore covered by this License. 250 | Section 6 states terms for distribution of such executables. 251 | 252 | When a "work that uses the Library" uses material from a header file 253 | that is part of the Library, the object code for the work may be a 254 | derivative work of the Library even though the source code is not. 255 | Whether this is true is especially significant if the work can be 256 | linked without the Library, or if the work is itself a library. The 257 | threshold for this to be true is not precisely defined by law. 258 | 259 | If such an object file uses only numerical parameters, data 260 | structure layouts and accessors, and small macros and small inline 261 | functions (ten lines or less in length), then the use of the object 262 | file is unrestricted, regardless of whether it is legally a derivative 263 | work. (Executables containing this object code plus portions of the 264 | Library will still fall under Section 6.) 265 | 266 | Otherwise, if the work is a derivative of the Library, you may 267 | distribute the object code for the work under the terms of Section 6. 268 | Any executables containing that work also fall under Section 6, 269 | whether or not they are linked directly with the Library itself. 270 | 271 | 6. As an exception to the Sections above, you may also combine or 272 | link a "work that uses the Library" with the Library to produce a 273 | work containing portions of the Library, and distribute that work 274 | under terms of your choice, provided that the terms permit 275 | modification of the work for the customer's own use and reverse 276 | engineering for debugging such modifications. 277 | 278 | You must give prominent notice with each copy of the work that the 279 | Library is used in it and that the Library and its use are covered by 280 | this License. You must supply a copy of this License. If the work 281 | during execution displays copyright notices, you must include the 282 | copyright notice for the Library among them, as well as a reference 283 | directing the user to the copy of this License. Also, you must do one 284 | of these things: 285 | 286 | a) Accompany the work with the complete corresponding 287 | machine-readable source code for the Library including whatever 288 | changes were used in the work (which must be distributed under 289 | Sections 1 and 2 above); and, if the work is an executable linked 290 | with the Library, with the complete machine-readable "work that 291 | uses the Library", as object code and/or source code, so that the 292 | user can modify the Library and then relink to produce a modified 293 | executable containing the modified Library. (It is understood 294 | that the user who changes the contents of definitions files in the 295 | Library will not necessarily be able to recompile the application 296 | to use the modified definitions.) 297 | 298 | b) Use a suitable shared library mechanism for linking with the 299 | Library. A suitable mechanism is one that (1) uses at run time a 300 | copy of the library already present on the user's computer system, 301 | rather than copying library functions into the executable, and (2) 302 | will operate properly with a modified version of the library, if 303 | the user installs one, as long as the modified version is 304 | interface-compatible with the version that the work was made with. 305 | 306 | c) Accompany the work with a written offer, valid for at 307 | least three years, to give the same user the materials 308 | specified in Subsection 6a, above, for a charge no more 309 | than the cost of performing this distribution. 310 | 311 | d) If distribution of the work is made by offering access to copy 312 | from a designated place, offer equivalent access to copy the above 313 | specified materials from the same place. 314 | 315 | e) Verify that the user has already received a copy of these 316 | materials or that you have already sent this user a copy. 317 | 318 | For an executable, the required form of the "work that uses the 319 | Library" must include any data and utility programs needed for 320 | reproducing the executable from it. However, as a special exception, 321 | the materials to be distributed need not include anything that is 322 | normally distributed (in either source or binary form) with the major 323 | components (compiler, kernel, and so on) of the operating system on 324 | which the executable runs, unless that component itself accompanies 325 | the executable. 326 | 327 | It may happen that this requirement contradicts the license 328 | restrictions of other proprietary libraries that do not normally 329 | accompany the operating system. Such a contradiction means you cannot 330 | use both them and the Library together in an executable that you 331 | distribute. 332 | 333 | 7. You may place library facilities that are a work based on the 334 | Library side-by-side in a single library together with other library 335 | facilities not covered by this License, and distribute such a combined 336 | library, provided that the separate distribution of the work based on 337 | the Library and of the other library facilities is otherwise 338 | permitted, and provided that you do these two things: 339 | 340 | a) Accompany the combined library with a copy of the same work 341 | based on the Library, uncombined with any other library 342 | facilities. This must be distributed under the terms of the 343 | Sections above. 344 | 345 | b) Give prominent notice with the combined library of the fact 346 | that part of it is a work based on the Library, and explaining 347 | where to find the accompanying uncombined form of the same work. 348 | 349 | 8. You may not copy, modify, sublicense, link with, or distribute 350 | the Library except as expressly provided under this License. Any 351 | attempt otherwise to copy, modify, sublicense, link with, or 352 | distribute the Library is void, and will automatically terminate your 353 | rights under this License. However, parties who have received copies, 354 | or rights, from you under this License will not have their licenses 355 | terminated so long as such parties remain in full compliance. 356 | 357 | 9. You are not required to accept this License, since you have not 358 | signed it. However, nothing else grants you permission to modify or 359 | distribute the Library or its derivative works. These actions are 360 | prohibited by law if you do not accept this License. Therefore, by 361 | modifying or distributing the Library (or any work based on the 362 | Library), you indicate your acceptance of this License to do so, and 363 | all its terms and conditions for copying, distributing or modifying 364 | the Library or works based on it. 365 | 366 | 10. Each time you redistribute the Library (or any work based on the 367 | Library), the recipient automatically receives a license from the 368 | original licensor to copy, distribute, link with or modify the Library 369 | subject to these terms and conditions. You may not impose any further 370 | restrictions on the recipients' exercise of the rights granted herein. 371 | You are not responsible for enforcing compliance by third parties with 372 | this License. 373 | 374 | 11. If, as a consequence of a court judgment or allegation of patent 375 | infringement or for any other reason (not limited to patent issues), 376 | conditions are imposed on you (whether by court order, agreement or 377 | otherwise) that contradict the conditions of this License, they do not 378 | excuse you from the conditions of this License. If you cannot 379 | distribute so as to satisfy simultaneously your obligations under this 380 | License and any other pertinent obligations, then as a consequence you 381 | may not distribute the Library at all. For example, if a patent 382 | license would not permit royalty-free redistribution of the Library by 383 | all those who receive copies directly or indirectly through you, then 384 | the only way you could satisfy both it and this License would be to 385 | refrain entirely from distribution of the Library. 386 | 387 | If any portion of this section is held invalid or unenforceable under any 388 | particular circumstance, the balance of the section is intended to apply, 389 | and the section as a whole is intended to apply in other circumstances. 390 | 391 | It is not the purpose of this section to induce you to infringe any 392 | patents or other property right claims or to contest validity of any 393 | such claims; this section has the sole purpose of protecting the 394 | integrity of the free software distribution system which is 395 | implemented by public license practices. Many people have made 396 | generous contributions to the wide range of software distributed 397 | through that system in reliance on consistent application of that 398 | system; it is up to the author/donor to decide if he or she is willing 399 | to distribute software through any other system and a licensee cannot 400 | impose that choice. 401 | 402 | This section is intended to make thoroughly clear what is believed to 403 | be a consequence of the rest of this License. 404 | 405 | 12. If the distribution and/or use of the Library is restricted in 406 | certain countries either by patents or by copyrighted interfaces, the 407 | original copyright holder who places the Library under this License may add 408 | an explicit geographical distribution limitation excluding those countries, 409 | so that distribution is permitted only in or among countries not thus 410 | excluded. In such case, this License incorporates the limitation as if 411 | written in the body of this License. 412 | 413 | 13. The Free Software Foundation may publish revised and/or new 414 | versions of the Lesser General Public License from time to time. 415 | Such new versions will be similar in spirit to the present version, 416 | but may differ in detail to address new problems or concerns. 417 | 418 | Each version is given a distinguishing version number. If the Library 419 | specifies a version number of this License which applies to it and 420 | "any later version", you have the option of following the terms and 421 | conditions either of that version or of any later version published by 422 | the Free Software Foundation. If the Library does not specify a 423 | license version number, you may choose any version ever published by 424 | the Free Software Foundation. 425 | 426 | 14. If you wish to incorporate parts of the Library into other free 427 | programs whose distribution conditions are incompatible with these, 428 | write to the author to ask for permission. For software which is 429 | copyrighted by the Free Software Foundation, write to the Free 430 | Software Foundation; we sometimes make exceptions for this. Our 431 | decision will be guided by the two goals of preserving the free status 432 | of all derivatives of our free software and of promoting the sharing 433 | and reuse of software generally. 434 | 435 | NO WARRANTY 436 | 437 | 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO 438 | WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 439 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR 440 | OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY 441 | KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 442 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 443 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE 444 | LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME 445 | THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 446 | 447 | 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN 448 | WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY 449 | AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU 450 | FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR 451 | CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 452 | LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING 453 | RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A 454 | FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF 455 | SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH 456 | DAMAGES. 457 | 458 | END OF TERMS AND CONDITIONS 459 | 460 | How to Apply These Terms to Your New Libraries 461 | 462 | If you develop a new library, and you want it to be of the greatest 463 | possible use to the public, we recommend making it free software that 464 | everyone can redistribute and change. You can do so by permitting 465 | redistribution under these terms (or, alternatively, under the terms of the 466 | ordinary General Public License). 467 | 468 | To apply these terms, attach the following notices to the library. It is 469 | safest to attach them to the start of each source file to most effectively 470 | convey the exclusion of warranty; and each file should have at least the 471 | "copyright" line and a pointer to where the full notice is found. 472 | 473 | 474 | Copyright (C) 475 | 476 | This library is free software; you can redistribute it and/or 477 | modify it under the terms of the GNU Lesser General Public 478 | License as published by the Free Software Foundation; either 479 | version 2.1 of the License, or (at your option) any later version. 480 | 481 | This library is distributed in the hope that it will be useful, 482 | but WITHOUT ANY WARRANTY; without even the implied warranty of 483 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 484 | Lesser General Public License for more details. 485 | 486 | You should have received a copy of the GNU Lesser General Public 487 | License along with this library; if not, write to the Free Software 488 | Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 489 | 490 | Also add information on how to contact you by electronic and paper mail. 491 | 492 | You should also get your employer (if you work as a programmer) or your 493 | school, if any, to sign a "copyright disclaimer" for the library, if 494 | necessary. Here is a sample; alter the names: 495 | 496 | Yoyodyne, Inc., hereby disclaims all copyright interest in the 497 | library `Frob' (a library for tweaking knobs) written by James Random Hacker. 498 | 499 | , 1 April 1990 500 | Ty Coon, President of Vice 501 | 502 | That's all there is to it! 503 | 504 | 505 | 506 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst LICENSE CHANGELOG 2 | include cluster.bmp 3 | include cluster/version.txt 4 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | DESCRIPTION 2 | =========== 3 | 4 | .. image:: https://readthedocs.org/projects/python-cluster/badge/?version=latest 5 | :target: http://python-cluster.readthedocs.org 6 | :alt: Documentation Status 7 | 8 | python-cluster is a "simple" package that allows to create several groups 9 | (clusters) of objects from a list. It's meant to be flexible and able to 10 | cluster any object. To ensure this kind of flexibility, you need not only to 11 | supply the list of objects, but also a function that calculates the similarity 12 | between two of those objects. For simple datatypes, like integers, this can be 13 | as simple as a subtraction, but more complex calculations are possible. Right 14 | now, it is possible to generate the clusters using a hierarchical clustering 15 | and the popular K-Means algorithm. For the hierarchical algorithm there are 16 | different "linkage" (single, complete, average and uclus) methods available. 17 | 18 | Algorithms are based on the document found at 19 | http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ 20 | 21 | .. note:: 22 | The above site is no longer avaialble, but you can still view it in the 23 | internet archive at: 24 | https://web.archive.org/web/20070912040206/http://home.dei.polimi.it//matteucc/Clustering/tutorial_html/ 25 | 26 | 27 | USAGE 28 | ===== 29 | 30 | A simple python program could look like this:: 31 | 32 | >>> from cluster import HierarchicalClustering 33 | >>> data = [12,34,23,32,46,96,13] 34 | >>> cl = HierarchicalClustering(data, lambda x,y: abs(x-y)) 35 | >>> cl.getlevel(10) # get clusters of items closer than 10 36 | [96, 46, [12, 13, 23, 34, 32]] 37 | >>> cl.getlevel(5) # get clusters of items closer than 5 38 | [96, 46, [12, 13], 23, [34, 32]] 39 | 40 | Note, that when you retrieve a set of clusters, it immediately starts the 41 | clustering process, which is quite complex. If you intend to create clusters 42 | from a large dataset, consider doing that in a separate thread. 43 | 44 | For K-Means clustering it would look like this:: 45 | 46 | >>> from cluster import KMeansClustering 47 | >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...]) 48 | >>> clusters = cl.getclusters(2) 49 | 50 | The parameter passed to getclusters is the count of clusters generated. 51 | 52 | 53 | .. image:: https://readthedocs.org/projects/python-cluster/badge/?version=latest 54 | :target: http://python-cluster.readthedocs.org 55 | :alt: Documentation Status 56 | -------------------------------------------------------------------------------- /cluster.bmp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/exhuma/python-cluster/2739ff420ef5bf8fba53f67788453b8239f16c9a/cluster.bmp -------------------------------------------------------------------------------- /cluster/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | 19 | from pkg_resources import resource_string 20 | 21 | from .method.hierarchical import HierarchicalClustering 22 | from .method.kmeans import KMeansClustering 23 | from .util import ClusteringError 24 | 25 | __version__ = resource_string('cluster', 'version.txt').decode('ascii').strip() 26 | -------------------------------------------------------------------------------- /cluster/cluster.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | from __future__ import print_function 19 | 20 | from .util import fullyflatten 21 | 22 | 23 | class Cluster(object): 24 | """ 25 | A collection of items. This is internally used to detect clustered items 26 | in the data so we could distinguish other collection types (lists, dicts, 27 | ...) from the actual clusters. This means that you could also create 28 | clusters of lists with this class. 29 | """ 30 | 31 | def __repr__(self): 32 | return "" % (self.level, self.items) 33 | 34 | def __str__(self): 35 | return self.__str__() 36 | 37 | def __init__(self, level, *args): 38 | """ 39 | Constructor 40 | 41 | :param level: The level of this cluster. This is used in hierarchical 42 | clustering to retrieve a specific set of clusters. The higher the 43 | level, the smaller the count of clusters returned. The level depends 44 | on the difference function used. 45 | :param *args: every additional argument passed following the level value 46 | will get added as item to the cluster. You could also pass a list as 47 | second parameter to initialise the cluster with that list as content 48 | """ 49 | self.level = level 50 | if len(args) == 0: 51 | self.items = [] 52 | else: 53 | self.items = args 54 | 55 | def __iter__(self): 56 | for item in self.items: 57 | if isinstance(item, Cluster): 58 | for recursed_item in item: 59 | yield recursed_item 60 | else: 61 | yield item 62 | 63 | def display(self, depth=0): 64 | """ 65 | Pretty-prints this cluster. Useful for debuging. 66 | """ 67 | print(depth * " " + "[level %s]" % self.level) 68 | for item in self.items: 69 | if isinstance(item, Cluster): 70 | item.display(depth + 1) 71 | else: 72 | print(depth * " " + "%s" % item) 73 | 74 | def topology(self): 75 | """ 76 | Returns the structure (topology) of the cluster as tuples. 77 | 78 | Output from cl.data:: 79 | 80 | [])>, 84 | ])>])>])>])>])>])>] 88 | 89 | Corresponding output from cl.topo():: 90 | 91 | ('CVS', ('34.xls', (('0.txt', ('ChangeLog', 'ChangeLog.txt')), 92 | ('20060730.py', ('.cvsignore', ('About.py', 93 | ('.idlerc', '.pylint.d'))))))) 94 | """ 95 | 96 | left = self.items[0] 97 | right = self.items[1] 98 | 99 | if isinstance(left, Cluster): 100 | first = left.topology() 101 | else: 102 | first = left 103 | 104 | if isinstance(right, Cluster): 105 | second = right.topology() 106 | else: 107 | second = right 108 | 109 | return first, second 110 | 111 | def getlevel(self, threshold): 112 | """ 113 | Retrieve all clusters up to a specific level threshold. This 114 | level-threshold represents the maximum distance between two clusters. 115 | So the lower you set this threshold, the more clusters you will 116 | receive and the higher you set it, you will receive less but bigger 117 | clusters. 118 | 119 | :param threshold: The level threshold: 120 | 121 | .. note:: 122 | It is debatable whether the value passed into this method should 123 | really be as strongly linked to the real cluster-levels as it is 124 | right now. The end-user will not know the range of this value 125 | unless s/he first inspects the top-level cluster. So instead you 126 | might argue that a value ranging from 0 to 1 might be a more 127 | useful approach. 128 | """ 129 | 130 | left = self.items[0] 131 | right = self.items[1] 132 | 133 | # if this object itself is below the threshold value we only need to 134 | # return it's contents as a list 135 | if self.level <= threshold: 136 | return [fullyflatten(self.items)] 137 | 138 | # if this cluster's level is higher than the threshold we will 139 | # investgate it's left and right part. Their level could be below the 140 | # threshold 141 | if isinstance(left, Cluster) and left.level <= threshold: 142 | if isinstance(right, Cluster): 143 | return [fullyflatten(left.items)] + right.getlevel(threshold) 144 | else: 145 | return [fullyflatten(left.items)] + [[right]] 146 | elif isinstance(right, Cluster) and right.level <= threshold: 147 | if isinstance(left, Cluster): 148 | return left.getlevel(threshold) + [fullyflatten(right.items)] 149 | else: 150 | return [[left]] + [fullyflatten(right.items)] 151 | 152 | # Alright. We covered the cases where one of the clusters was below 153 | # the threshold value. Now we'll deal with the clusters that are above 154 | # by recursively applying the previous cases. 155 | if isinstance(left, Cluster) and isinstance(right, Cluster): 156 | return left.getlevel(threshold) + right.getlevel(threshold) 157 | elif isinstance(left, Cluster): 158 | return left.getlevel(threshold) + [[right]] 159 | elif isinstance(right, Cluster): 160 | return [[left]] + right.getlevel(threshold) 161 | else: 162 | return [[left], [right]] 163 | -------------------------------------------------------------------------------- /cluster/linkage.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from functools import wraps 3 | 4 | 5 | def cached(fun): 6 | """ 7 | memoizing decorator for linkage functions. 8 | 9 | Parameters have been hardcoded (no ``*args``, ``**kwargs`` magic), because, 10 | the way this is coded (interchangingly using sets and frozensets) is true 11 | for this specific case. For other cases that is not necessarily guaranteed. 12 | """ 13 | 14 | _cache = {} 15 | 16 | @wraps(fun) 17 | def newfun(a, b, distance_function): 18 | frozen_a = frozenset(a) 19 | frozen_b = frozenset(b) 20 | if (frozen_a, frozen_b) not in _cache: 21 | result = fun(a, b, distance_function) 22 | _cache[(frozen_a, frozen_b)] = result 23 | return _cache[(frozen_a, frozen_b)] 24 | return newfun 25 | 26 | 27 | @cached 28 | def single(a, b, distance_function): 29 | """ 30 | Given two collections ``a`` and ``b``, this will return the distance of the 31 | points which are closest together. ``distance_function`` is used to 32 | determine the distance between two elements. 33 | 34 | Example:: 35 | 36 | >>> single([1, 2], [3, 4], lambda x, y: abs(x-y)) 37 | 1 # (distance between 2 and 3) 38 | """ 39 | left_a, right_a = min(a), max(a) 40 | left_b, right_b = min(b), max(b) 41 | result = min(distance_function(left_a, right_b), 42 | distance_function(left_b, right_a)) 43 | return result 44 | 45 | 46 | @cached 47 | def complete(a, b, distance_function): 48 | """ 49 | Given two collections ``a`` and ``b``, this will return the distance of the 50 | points which are farthest apart. ``distance_function`` is used to determine 51 | the distance between two elements. 52 | 53 | Example:: 54 | 55 | >>> single([1, 2], [3, 4], lambda x, y: abs(x-y)) 56 | 3 # (distance between 1 and 4) 57 | """ 58 | left_a, right_a = min(a), max(a) 59 | left_b, right_b = min(b), max(b) 60 | result = max(distance_function(left_a, right_b), 61 | distance_function(left_b, right_a)) 62 | return result 63 | 64 | 65 | @cached 66 | def average(a, b, distance_function): 67 | """ 68 | Given two collections ``a`` and ``b``, this will return the mean of all 69 | distances. ``distance_function`` is used to determine the distance between 70 | two elements. 71 | 72 | Example:: 73 | 74 | >>> single([1, 2], [3, 100], lambda x, y: abs(x-y)) 75 | 26 76 | """ 77 | distances = [distance_function(x, y) 78 | for x in a for y in b] 79 | return sum(distances) / len(distances) 80 | 81 | 82 | @cached 83 | def uclus(a, b, distance_function): 84 | """ 85 | Given two collections ``a`` and ``b``, this will return the *median* of all 86 | distances. ``distance_function`` is used to determine the distance between 87 | two elements. 88 | 89 | Example:: 90 | 91 | >>> single([1, 2], [3, 100], lambda x, y: abs(x-y)) 92 | 2.5 93 | """ 94 | distances = sorted([distance_function(x, y) 95 | for x in a for y in b]) 96 | midpoint, rest = len(distances) // 2, len(distances) % 2 97 | if not rest: 98 | return sum(distances[midpoint-1:midpoint+1]) / 2 99 | else: 100 | return distances[midpoint] 101 | -------------------------------------------------------------------------------- /cluster/matrix.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | 19 | import logging 20 | from multiprocessing import Process, Queue, current_process 21 | 22 | 23 | logger = logging.getLogger(__name__) 24 | 25 | def _encapsulate_item_for_combinfunc(item): 26 | """ 27 | This function has been extracted in order to 28 | make Github issue #28 easier to investigate. 29 | It replaces the following two lines of code, 30 | which occur twice in method genmatrix, just 31 | before the invocation of combinfunc. 32 | if not hasattr(item, '__iter__') or isinstance(item, tuple): 33 | item = [item] 34 | Logging was added to the original two lines 35 | and shows that the outcome of this snippet 36 | has changed between Python2.7 and Python3.5. 37 | This logging showed that the difference in 38 | outcome consisted of the handling of the builtin 39 | str class, which was encapsulated into a list in 40 | Python2.7 but returned naked in Python3.5. 41 | Adding a test for this specific class to the 42 | set of conditions appears to give correct behaviour 43 | under both versions. 44 | """ 45 | encapsulated_item = None 46 | if ( 47 | not hasattr(item, '__iter__') or 48 | isinstance(item, tuple) or 49 | isinstance(item, str) 50 | ): 51 | encapsulated_item = [item] 52 | else: 53 | encapsulated_item = item 54 | logging.debug( 55 | "item class:%s encapsulated as:%s ", 56 | item.__class__.__name__, 57 | encapsulated_item.__class__.__name__ 58 | ) 59 | return encapsulated_item 60 | 61 | 62 | class Matrix(object): 63 | """ 64 | Object representation of the item-item matrix. 65 | """ 66 | 67 | def __init__(self, data, combinfunc, symmetric=False, diagonal=None): 68 | """ 69 | Takes a list of data and generates a 2D-matrix using the supplied 70 | combination function to calculate the values. 71 | 72 | :param data: the list of items. 73 | :param combinfunc: the function that is used to calculate teh value in a 74 | cell. It has to cope with two arguments. 75 | :param symmetric: Whether it will be a symmetric matrix along the 76 | diagonal. For example, if the list contains integers, and the 77 | combination function is ``abs(x-y)``, then the matrix will be 78 | symmetric. 79 | :param diagonal: The value to be put into the diagonal. For some 80 | functions, the diagonal will stay constant. An example could be the 81 | function ``x-y``. Then each diagonal cell will be ``0``. If this 82 | value is set to None, then the diagonal will be calculated. 83 | """ 84 | self.data = data 85 | self.combinfunc = combinfunc 86 | self.symmetric = symmetric 87 | self.diagonal = diagonal 88 | 89 | def worker(self): 90 | """ 91 | Multiprocessing task function run by worker processes 92 | """ 93 | tasks_completed = 0 94 | for task in iter(self.task_queue.get, 'STOP'): 95 | col_index, item, item2 = task 96 | if not hasattr(item, '__iter__') or isinstance(item, tuple): 97 | item = [item] 98 | if not hasattr(item2, '__iter__') or isinstance(item2, tuple): 99 | item2 = [item2] 100 | result = (col_index, self.combinfunc(item, item2)) 101 | self.done_queue.put(result) 102 | tasks_completed += 1 103 | logger.info("Worker %s performed %s tasks", 104 | current_process().name, 105 | tasks_completed) 106 | 107 | def genmatrix(self, num_processes=1): 108 | """ 109 | Actually generate the matrix 110 | 111 | :param num_processes: If you want to use multiprocessing to split up the 112 | work and run ``combinfunc()`` in parallel, specify 113 | ``num_processes > 1`` and this number of workers will be spun up, 114 | the work is split up amongst them evenly. 115 | """ 116 | use_multiprocessing = num_processes > 1 117 | if use_multiprocessing: 118 | self.task_queue = Queue() 119 | self.done_queue = Queue() 120 | 121 | self.matrix = [] 122 | logger.info("Generating matrix for %s items - O(n^2)", len(self.data)) 123 | if use_multiprocessing: 124 | logger.info("Using multiprocessing on %s processes!", num_processes) 125 | 126 | if use_multiprocessing: 127 | logger.info("Spinning up %s workers", num_processes) 128 | processes = [Process(target=self.worker) for i in range(num_processes)] 129 | [process.start() for process in processes] 130 | 131 | for row_index, item in enumerate(self.data): 132 | logger.debug("Generating row %s/%s (%0.2f%%)", 133 | row_index, 134 | len(self.data), 135 | 100.0 * row_index / len(self.data)) 136 | row = {} 137 | if use_multiprocessing: 138 | num_tasks_queued = num_tasks_completed = 0 139 | for col_index, item2 in enumerate(self.data): 140 | if self.diagonal is not None and col_index == row_index: 141 | # This is a cell on the diagonal 142 | row[col_index] = self.diagonal 143 | elif self.symmetric and col_index < row_index: 144 | # The matrix is symmetric and we are "in the lower left 145 | # triangle" - fill this in after (in case of multiprocessing) 146 | pass 147 | # Otherwise, this cell is not on the diagonal and we do indeed 148 | # need to call combinfunc() 149 | elif use_multiprocessing: 150 | # Add that thing to the task queue! 151 | self.task_queue.put((col_index, item, item2)) 152 | num_tasks_queued += 1 153 | # Start grabbing the results as we go, so as not to stuff all of 154 | # the worker args into memory at once (as Queue.get() is a 155 | # blocking operation) 156 | if num_tasks_queued > num_processes: 157 | col_index, result = self.done_queue.get() 158 | row[col_index] = result 159 | num_tasks_completed += 1 160 | else: 161 | # Otherwise do it here, in line 162 | """ 163 | if not hasattr(item, '__iter__') or isinstance(item, tuple): 164 | item = [item] 165 | if not hasattr(item2, '__iter__') or isinstance(item2, tuple): 166 | item2 = [item2] 167 | """ 168 | # See the comment in function _encapsulate_item_for_combinfunc 169 | # for details of why the lines above have been replaced 170 | # by function invocations 171 | item = _encapsulate_item_for_combinfunc(item) 172 | item2 = _encapsulate_item_for_combinfunc(item2) 173 | row[col_index] = self.combinfunc(item, item2) 174 | 175 | if self.symmetric: 176 | # One more iteration to get symmetric lower left triangle 177 | for col_index, item2 in enumerate(self.data): 178 | if col_index >= row_index: 179 | break 180 | # post-process symmetric "lower left triangle" 181 | row[col_index] = self.matrix[col_index][row_index] 182 | 183 | if use_multiprocessing: 184 | # Grab the remaining worker task results 185 | while num_tasks_completed < num_tasks_queued: 186 | col_index, result = self.done_queue.get() 187 | row[col_index] = result 188 | num_tasks_completed += 1 189 | 190 | row_indexed = [row[index] for index in range(len(self.data))] 191 | self.matrix.append(row_indexed) 192 | 193 | if use_multiprocessing: 194 | logger.info("Stopping/joining %s workers", num_processes) 195 | [self.task_queue.put('STOP') for i in range(num_processes)] 196 | [process.join() for process in processes] 197 | 198 | logger.info("Matrix generated") 199 | 200 | def __str__(self): 201 | """ 202 | Returns a 2-dimensional list of data as text-string which can be 203 | displayed to the user. 204 | """ 205 | # determine maximum length 206 | maxlen = 0 207 | colcount = len(self.data[0]) 208 | for col in self.data: 209 | for cell in col: 210 | maxlen = max(len(str(cell)), maxlen) 211 | format = " %%%is |" % maxlen 212 | format = "|" + format * colcount 213 | rows = [format % tuple(row) for row in self.data] 214 | return "\n".join(rows) 215 | -------------------------------------------------------------------------------- /cluster/method/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | -------------------------------------------------------------------------------- /cluster/method/base.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | 19 | class BaseClusterMethod(object): 20 | """ 21 | The base class of all clustering methods. 22 | 23 | :param input: a list of objects 24 | :distance_function: a function returning the distance - or opposite of 25 | similarity ``(distance = -similarity)`` - of two items from the input. 26 | In other words, the closer the two items are related, the smaller this 27 | value needs to be. With 0 meaning they are exactly the same. 28 | 29 | .. note:: 30 | The distance function should always return the absolute distance between 31 | two given items of the list. Say:: 32 | 33 | distance(input[1], input[4]) = distance(input[4], input[1]) 34 | 35 | This is very important for the clustering algorithm to work! Naturally, 36 | the data returned by the distance function MUST be a comparable 37 | datatype, so you can perform arithmetic comparisons on them (``<`` or 38 | ``>``)! The simplest examples would be floats or ints. But as long as 39 | they are comparable, it's ok. 40 | """ 41 | 42 | def __init__(self, input, distance_function, progress_callback=None): 43 | self.distance = distance_function 44 | self._input = input # the original input 45 | self._data = input[:] # clone the input so we can work with it 46 | # without distroying the original data. 47 | self.progress_callback = progress_callback 48 | 49 | def topo(self): 50 | """ 51 | Returns the structure (topology) of the cluster. 52 | 53 | See :py:meth:`~cluster.cluster.Cluster.topology` for more information. 54 | """ 55 | return self.data[0].topology() 56 | 57 | @property 58 | def data(self): 59 | """ 60 | Returns the data that is currently in process. 61 | """ 62 | return self._data 63 | 64 | @property 65 | def raw_data(self): 66 | """ 67 | Returns the raw data (data without being clustered). 68 | """ 69 | return self._input 70 | -------------------------------------------------------------------------------- /cluster/method/hierarchical.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | from functools import partial 19 | import logging 20 | 21 | from cluster.cluster import Cluster 22 | from cluster.matrix import Matrix 23 | from cluster.method.base import BaseClusterMethod 24 | from cluster.linkage import single, complete, average, uclus 25 | 26 | 27 | logger = logging.getLogger(__name__) 28 | 29 | 30 | class HierarchicalClustering(BaseClusterMethod): 31 | """ 32 | Implementation of the hierarchical clustering method as explained in a 33 | tutorial_ by *matteucc*. 34 | 35 | Object prerequisites: 36 | 37 | * Items must be sortable (See `issue #11`_) 38 | * Items must be hashable. 39 | 40 | .. _issue #11: https://github.com/exhuma/python-cluster/issues/11 41 | .. _tutorial: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html 42 | 43 | Example: 44 | 45 | >>> from cluster import HierarchicalClustering 46 | >>> # or: from cluster import * 47 | >>> cl = HierarchicalClustering([123,334,345,242,234,1,3], 48 | lambda x,y: float(abs(x-y))) 49 | >>> cl.getlevel(90) 50 | [[345, 334], [234, 242], [123], [3, 1]] 51 | 52 | Note that all of the returned clusters are more than 90 (``getlevel(90)``) 53 | apart. 54 | 55 | See :py:class:`~cluster.method.base.BaseClusterMethod` for more details. 56 | 57 | :param data: The collection of items to be clustered. 58 | :param distance_function: A function which takes two elements of ``data`` 59 | and returns a distance between both elements (note that the distance 60 | should not be returned as negative value!) 61 | :param linkage: The method used to determine the distance between two 62 | clusters. See :py:meth:`~.HierarchicalClustering.set_linkage_method` for 63 | possible values. 64 | :param num_processes: If you want to use multiprocessing to split up the 65 | work and run ``genmatrix()`` in parallel, specify num_processes > 1 and 66 | this number of workers will be spun up, the work split up amongst them 67 | evenly. 68 | :param progress_callback: A function to be called on each iteration to 69 | publish the progress. The function is called with two integer arguments 70 | which represent the total number of elements in the cluster, and the 71 | remaining elements to be clustered. 72 | """ 73 | 74 | def __init__(self, data, distance_function, linkage=None, num_processes=1, 75 | progress_callback=None): 76 | if not linkage: 77 | linkage = single 78 | logger.info("Initializing HierarchicalClustering object with linkage " 79 | "method %s", linkage) 80 | BaseClusterMethod.__init__(self, sorted(data), distance_function) 81 | self.set_linkage_method(linkage) 82 | self.num_processes = num_processes 83 | self.progress_callback = progress_callback 84 | self.__cluster_created = False 85 | 86 | def publish_progress(self, total, current): 87 | """ 88 | If a progress function was supplied, this will call that function with 89 | the total number of elements, and the remaining number of elements. 90 | 91 | :param total: The total number of elements. 92 | :param remaining: The remaining number of elements. 93 | """ 94 | if self.progress_callback: 95 | self.progress_callback(total, current) 96 | 97 | def set_linkage_method(self, method): 98 | """ 99 | Sets the method to determine the distance between two clusters. 100 | 101 | :param method: The method to use. It can be one of ``'single'``, 102 | ``'complete'``, ``'average'`` or ``'uclus'``, or a callable. The 103 | callable should take two collections as parameters and return a 104 | distance value between both collections. 105 | """ 106 | if method == 'single': 107 | self.linkage = single 108 | elif method == 'complete': 109 | self.linkage = complete 110 | elif method == 'average': 111 | self.linkage = average 112 | elif method == 'uclus': 113 | self.linkage = uclus 114 | elif hasattr(method, '__call__'): 115 | self.linkage = method 116 | else: 117 | raise ValueError('distance method must be one of single, ' 118 | 'complete, average of uclus') 119 | 120 | def cluster(self, matrix=None, level=None, sequence=None): 121 | """ 122 | Perform hierarchical clustering. 123 | 124 | :param matrix: The 2D list that is currently under processing. The 125 | matrix contains the distances of each item with each other 126 | :param level: The current level of clustering 127 | :param sequence: The sequence number of the clustering 128 | """ 129 | logger.info("Performing cluster()") 130 | 131 | if matrix is None: 132 | # create level 0, first iteration (sequence) 133 | level = 0 134 | sequence = 0 135 | matrix = [] 136 | 137 | # if the matrix only has two rows left, we are done 138 | linkage = partial(self.linkage, distance_function=self.distance) 139 | initial_element_count = len(self._data) 140 | while len(matrix) > 2 or matrix == []: 141 | 142 | item_item_matrix = Matrix(self._data, 143 | linkage, 144 | True, 145 | 0) 146 | item_item_matrix.genmatrix(self.num_processes) 147 | matrix = item_item_matrix.matrix 148 | 149 | smallestpair = None 150 | mindistance = None 151 | rowindex = 0 # keep track of where we are in the matrix 152 | # find the minimum distance 153 | for row in matrix: 154 | cellindex = 0 # keep track of where we are in the matrix 155 | for cell in row: 156 | # if we are not on the diagonal (which is always 0) 157 | # and if this cell represents a new minimum... 158 | cell_lt_mdist = cell < mindistance if mindistance else False 159 | if ((rowindex != cellindex) and 160 | (cell_lt_mdist or smallestpair is None)): 161 | smallestpair = (rowindex, cellindex) 162 | mindistance = cell 163 | cellindex += 1 164 | rowindex += 1 165 | 166 | sequence += 1 167 | level = matrix[smallestpair[1]][smallestpair[0]] 168 | cluster = Cluster(level, self._data[smallestpair[0]], 169 | self._data[smallestpair[1]]) 170 | 171 | # maintain the data, by combining the the two most similar items 172 | # in the list we use the min and max functions to ensure the 173 | # integrity of the data. imagine: if we first remove the item 174 | # with the smaller index, all the rest of the items shift down by 175 | # one. So the next index will be wrong. We could simply adjust the 176 | # value of the second "remove" call, but we don't know the order 177 | # in which they come. The max and min approach clarifies that 178 | self._data.remove(self._data[max(smallestpair[0], 179 | smallestpair[1])]) # remove item 1 180 | self._data.remove(self._data[min(smallestpair[0], 181 | smallestpair[1])]) # remove item 2 182 | self._data.append(cluster) # append item 1 and 2 combined 183 | 184 | self.publish_progress(initial_element_count, len(self._data)) 185 | 186 | # all the data is in one single cluster. We return that and stop 187 | self.__cluster_created = True 188 | logger.info("Call to cluster() is complete") 189 | return 190 | 191 | def getlevel(self, threshold): 192 | """ 193 | Returns all clusters with a maximum distance of *threshold* in between 194 | each other 195 | 196 | :param threshold: the maximum distance between clusters. 197 | 198 | See :py:meth:`~cluster.cluster.Cluster.getlevel` 199 | """ 200 | 201 | # if it's not worth clustering, just return the data 202 | if len(self._input) <= 1: 203 | return self._input 204 | 205 | # initialize the cluster if not yet done 206 | if not self.__cluster_created: 207 | self.cluster() 208 | 209 | return self._data[0].getlevel(threshold) 210 | 211 | def display(self): 212 | """ 213 | Prints a simple dendogram-like representation of the full cluster 214 | to the console. 215 | """ 216 | # initialize the cluster if not yet done 217 | if not self.__cluster_created: 218 | self.cluster() 219 | 220 | self._data[0].display() 221 | -------------------------------------------------------------------------------- /cluster/method/kmeans.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | 19 | from cluster.util import ClusteringError, centroid, minkowski_distance 20 | 21 | 22 | class KMeansClustering(object): 23 | """ 24 | Implementation of the kmeans clustering method as explained in a tutorial_ 25 | by *matteucc*. 26 | 27 | .. _tutorial: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html 28 | 29 | Example: 30 | 31 | >>> from cluster import KMeansClustering 32 | >>> cl = KMeansClustering([(1,1), (2,1), (5,3), ...]) 33 | >>> clusters = cl.getclusters(2) 34 | 35 | :param data: A list of tuples or integers. 36 | :param distance: A function determining the distance between two items. 37 | Default (if ``None`` is passed): It assumes the tuples contain numeric 38 | values and appiles a generalised form of the euclidian-distance 39 | algorithm on them. 40 | :param equality: A function to test equality of items. By default the 41 | standard python equality operator (``==``) is applied. 42 | :raises ValueError: if the list contains heterogeneous items or if the 43 | distance between items cannot be determined. 44 | """ 45 | 46 | def __init__(self, data, distance=None, equality=None): 47 | self.__clusters = [] 48 | self.__data = data 49 | self.distance = distance 50 | self.__initial_length = len(data) 51 | self.equality = equality 52 | 53 | # test if each item is of same dimensions 54 | if len(data) > 1 and isinstance(data[0], tuple): 55 | control_length = len(data[0]) 56 | for item in data[1:]: 57 | if len(item) != control_length: 58 | raise ValueError("Each item in the data list must have " 59 | "the same amount of dimensions. Item " 60 | "%r was out of line!" % (item,)) 61 | # now check if we need and have a distance function 62 | if (len(data) > 1 and not isinstance(data[0], tuple) and 63 | distance is None): 64 | raise ValueError("You supplied non-standard items but no " 65 | "distance function! We cannot continue!") 66 | # we now know that we have tuples, and assume therefore that it's 67 | # items are numeric 68 | elif distance is None: 69 | self.distance = minkowski_distance 70 | 71 | def getclusters(self, count): 72 | """ 73 | Generates *count* clusters. 74 | 75 | :param count: The amount of clusters that should be generated. count 76 | must be greater than ``1``. 77 | :raises ClusteringError: if *count* is out of bounds. 78 | """ 79 | 80 | # only proceed if we got sensible input 81 | if count <= 1: 82 | raise ClusteringError("When clustering, you need to ask for at " 83 | "least two clusters! " 84 | "You asked for %d" % count) 85 | 86 | # return the data straight away if there is nothing to cluster 87 | if (self.__data == [] or len(self.__data) == 1 or 88 | count == self.__initial_length): 89 | return self.__data 90 | 91 | # It makes no sense to ask for more clusters than data-items available 92 | if count > self.__initial_length: 93 | raise ClusteringError( 94 | "Unable to generate more clusters than " 95 | "items available. You supplied %d items, and asked for " 96 | "%d clusters." % (self.__initial_length, count)) 97 | 98 | self.initialise_clusters(self.__data, count) 99 | 100 | items_moved = True # tells us if any item moved between the clusters, 101 | # as we initialised the clusters, we assume that 102 | # is the case 103 | 104 | while items_moved is True: 105 | items_moved = False 106 | for cluster in self.__clusters: 107 | for item in cluster: 108 | res = self.assign_item(item, cluster) 109 | if items_moved is False: 110 | items_moved = res 111 | return self.__clusters 112 | 113 | def assign_item(self, item, origin): 114 | """ 115 | Assigns an item from a given cluster to the closest located cluster. 116 | 117 | :param item: the item to be moved. 118 | :param origin: the originating cluster. 119 | """ 120 | closest_cluster = origin 121 | for cluster in self.__clusters: 122 | if self.distance(item, centroid(cluster)) < self.distance( 123 | item, centroid(closest_cluster)): 124 | closest_cluster = cluster 125 | 126 | if id(closest_cluster) != id(origin): 127 | self.move_item(item, origin, closest_cluster) 128 | return True 129 | else: 130 | return False 131 | 132 | def move_item(self, item, origin, destination): 133 | """ 134 | Moves an item from one cluster to anoter cluster. 135 | 136 | :param item: the item to be moved. 137 | :param origin: the originating cluster. 138 | :param destination: the target cluster. 139 | """ 140 | if self.equality: 141 | item_index = 0 142 | for i, element in enumerate(origin): 143 | if self.equality(element, item): 144 | item_index = i 145 | break 146 | else: 147 | item_index = origin.index(item) 148 | 149 | destination.append(origin.pop(item_index)) 150 | 151 | def initialise_clusters(self, input_, clustercount): 152 | """ 153 | Initialises the clusters by distributing the items from the data. 154 | evenly across n clusters 155 | 156 | :param input_: the data set (a list of tuples). 157 | :param clustercount: the amount of clusters (n). 158 | """ 159 | # initialise the clusters with empty lists 160 | self.__clusters = [] 161 | for _ in range(clustercount): 162 | self.__clusters.append([]) 163 | 164 | # distribute the items into the clusters 165 | count = 0 166 | for item in input_: 167 | self.__clusters[count % clustercount].append(item) 168 | count += 1 169 | -------------------------------------------------------------------------------- /cluster/test/test_hierarchical.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it under 6 | # the terms of the GNU Lesser General Public License as published by the Free 7 | # Software Foundation; either version 2.1 of the License, or (at your option) 8 | # any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 11 | # FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more 12 | # details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, Inc., 15 | # 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | """ 19 | Tests for hierarchical clustering. 20 | 21 | .. note:: 22 | 23 | Even though the results are lists, the order of items in the resulting 24 | clusters is non-deterministic. This should be taken into consideration when 25 | writing "expected" values! 26 | """ 27 | 28 | from difflib import SequenceMatcher 29 | from math import sqrt 30 | from sys import hexversion 31 | import unittest 32 | 33 | from cluster import HierarchicalClustering 34 | 35 | 36 | class Py23TestCase(unittest.TestCase): 37 | 38 | def __init__(self, *args, **kwargs): 39 | super(Py23TestCase, self).__init__(*args, **kwargs) 40 | if hexversion < 0x030000f0: 41 | self.assertCItemsEqual = self.assertItemsEqual 42 | else: 43 | self.assertCItemsEqual = self.assertCountEqual 44 | 45 | 46 | class HClusterSmallListTestCase(Py23TestCase): 47 | """ 48 | Test for Bug #1516204 49 | """ 50 | 51 | def testClusterLen1(self): 52 | """ 53 | Testing if hierarchical clustering a set of length 1 returns a set of 54 | length 1 55 | """ 56 | cl = HierarchicalClustering([876], lambda x, y: abs(x - y)) 57 | self.assertCItemsEqual([876], cl.getlevel(40)) 58 | 59 | def testClusterLen0(self): 60 | """ 61 | Testing if hierarchical clustering an empty list returns an empty list 62 | """ 63 | cl = HierarchicalClustering([], lambda x, y: abs(x - y)) 64 | self.assertEqual([], cl.getlevel(40)) 65 | 66 | 67 | class HClusterIntegerTestCase(Py23TestCase): 68 | 69 | def setUp(self): 70 | self.__data = [791, 956, 676, 124, 564, 84, 24, 365, 594, 940, 398, 71 | 971, 131, 365, 542, 336, 518, 835, 134, 391] 72 | 73 | def testSingleLinkage(self): 74 | "Basic Hierarchical Clustering test with integers" 75 | cl = HierarchicalClustering(self.__data, lambda x, y: abs(x - y)) 76 | result = cl.getlevel(40) 77 | 78 | # sort the values to make the tests less prone to algorithm changes 79 | result = [sorted(_) for _ in result] 80 | self.assertCItemsEqual([ 81 | [24], 82 | [336, 365, 365, 391, 398], 83 | [518, 542, 564, 594], 84 | [676], 85 | [791], 86 | [835], 87 | [84, 124, 131, 134], 88 | [940, 956, 971], 89 | ], result) 90 | 91 | def testCompleteLinkage(self): 92 | "Basic Hierarchical Clustering test with integers" 93 | cl = HierarchicalClustering(self.__data, 94 | lambda x, y: abs(x - y), 95 | linkage='complete') 96 | result = cl.getlevel(40) 97 | 98 | # sort the values to make the tests less prone to algorithm changes 99 | result = sorted([sorted(_) for _ in result]) 100 | 101 | expected = [ 102 | [24], 103 | [84], 104 | [124, 131, 134], 105 | [336, 365, 365], 106 | [391, 398], 107 | [518], 108 | [542, 564], 109 | [594], 110 | [676], 111 | [791], 112 | [835], 113 | [940, 956, 971], 114 | ] 115 | self.assertEqual(result, expected) 116 | 117 | def testUCLUS(self): 118 | "Basic Hierarchical Clustering test with integers" 119 | cl = HierarchicalClustering(self.__data, 120 | lambda x, y: abs(x - y), 121 | linkage='uclus') 122 | expected = [ 123 | [24], 124 | [84], 125 | [124, 131, 134], 126 | [336, 365, 365, 391, 398], 127 | [518, 542, 564], 128 | [594], 129 | [676], 130 | [791], 131 | [835], 132 | [940, 956, 971], 133 | ] 134 | result = sorted([sorted(_) for _ in cl.getlevel(40)]) 135 | self.assertEqual(result, expected) 136 | 137 | def testAverageLinkage(self): 138 | cl = HierarchicalClustering(self.__data, 139 | lambda x, y: abs(x - y), 140 | linkage='average') 141 | # TODO: The current test-data does not really trigger a difference 142 | # between UCLUS and "average" linkage. 143 | expected = [ 144 | [24], 145 | [84], 146 | [124, 131, 134], 147 | [336, 365, 365, 391, 398], 148 | [518, 542, 564], 149 | [594], 150 | [676], 151 | [791], 152 | [835], 153 | [940, 956, 971], 154 | ] 155 | result = sorted([sorted(_) for _ in cl.getlevel(40)]) 156 | self.assertEqual(result, expected) 157 | 158 | def testUnmodifiedData(self): 159 | cl = HierarchicalClustering(self.__data, lambda x, y: abs(x - y)) 160 | new_data = [] 161 | [new_data.extend(_) for _ in cl.getlevel(40)] 162 | self.assertEqual(sorted(new_data), sorted(self.__data)) 163 | 164 | def testMultiprocessing(self): 165 | cl = HierarchicalClustering(self.__data, lambda x, y: abs(x - y), 166 | num_processes=4) 167 | new_data = [] 168 | [new_data.extend(_) for _ in cl.getlevel(40)] 169 | self.assertEqual(sorted(new_data), sorted(self.__data)) 170 | 171 | 172 | class HClusterStringTestCase(Py23TestCase): 173 | 174 | def sim(self, x, y): 175 | sm = SequenceMatcher(lambda x: x in ". -", x, y) 176 | return 1 - sm.ratio() 177 | 178 | def setUp(self): 179 | self.__data = ("Lorem ipsum dolor sit amet consectetuer adipiscing " 180 | "elit Ut elit Phasellus consequat ultricies mi Sed " 181 | "congue leo at neque Nullam").split() 182 | 183 | def testDataTypes(self): 184 | "Test for bug #?" 185 | cl = HierarchicalClustering(self.__data, self.sim) 186 | for item in cl.getlevel(0.5): 187 | self.assertEqual( 188 | type(item), type([]), 189 | "Every item should be a list!") 190 | 191 | def testCluster(self): 192 | "Basic Hierachical clustering test with strings" 193 | self.skipTest('These values lead to non-deterministic results. ' 194 | 'This makes it untestable!') 195 | cl = HierarchicalClustering(self.__data, self.sim) 196 | self.assertEqual([ 197 | ['ultricies'], 198 | ['Sed'], 199 | ['Phasellus'], 200 | ['mi'], 201 | ['Nullam'], 202 | ['sit', 'elit', 'elit', 'Ut', 'amet', 'at'], 203 | ['leo', 'Lorem', 'dolor'], 204 | ['congue', 'neque', 'consectetuer', 'consequat'], 205 | ['adipiscing'], 206 | ['ipsum'], 207 | ], cl.getlevel(0.5)) 208 | 209 | def testUnmodifiedData(self): 210 | cl = HierarchicalClustering(self.__data, self.sim) 211 | new_data = [] 212 | [new_data.extend(_) for _ in cl.getlevel(0.5)] 213 | self.assertEqual(sorted(new_data), sorted(self.__data)) 214 | 215 | 216 | class HClusterTuplesTestCase(Py23TestCase): 217 | ''' 218 | Test case to cover the case where the data contains tuple-items 219 | 220 | See Github issue #20 221 | ''' 222 | 223 | def testSingleLinkage(self): 224 | "Basic Hierarchical Clustering test with integers" 225 | 226 | def euclidian_distance(a, b): 227 | return sqrt(sum([pow(z[0] - z[1], 2) for z in zip(a, b)])) 228 | 229 | self.__data = [(1, 1), (1, 2), (1, 3)] 230 | cl = HierarchicalClustering(self.__data, euclidian_distance) 231 | result = cl.getlevel(40) 232 | self.assertIsNotNone(result) 233 | 234 | class Issue28TestCase(Py23TestCase): 235 | ''' 236 | Test case to cover the case where the data consist 237 | of dictionary keys, and the distance function executes 238 | on the values these keys are associated with in the 239 | dictionary, rather than the keys themselves. 240 | 241 | Behaviour for this test case differs between Python2.7 242 | and Python3.5: on 2.7 the test behaves as expected, 243 | 244 | See Github issue #28. 245 | ''' 246 | 247 | def testIssue28(self): 248 | "Issue28 (Hierarchical Clustering)" 249 | 250 | points1D = { 251 | 'p4' : 5, 'p2' : 6, 'p7' : 10, 252 | 'p9' : 120, 'p10' : 121, 'p11' : 119, 253 | } 254 | 255 | distance_func = lambda a,b : abs(points1D[a]-points1D[b]) 256 | cl = HierarchicalClustering(list(points1D.keys()), distance_func) 257 | result = cl.getlevel(20) 258 | self.assertIsNotNone(result) 259 | 260 | if __name__ == '__main__': 261 | 262 | import logging 263 | 264 | suite = unittest.TestSuite(( 265 | unittest.makeSuite(HClusterIntegerTestCase), 266 | unittest.makeSuite(HClusterSmallListTestCase), 267 | unittest.makeSuite(HClusterStringTestCase), 268 | unittest.makeSuite(Issue28TestCase), 269 | )) 270 | 271 | logging.basicConfig(level=logging.DEBUG) 272 | unittest.TextTestRunner(verbosity=2).run(suite) 273 | -------------------------------------------------------------------------------- /cluster/test/test_kmeans.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it under 6 | # the terms of the GNU Lesser General Public License as published by the Free 7 | # Software Foundation; either version 2.1 of the License, or (at your option) 8 | # any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 11 | # FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more 12 | # details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, Inc., 15 | # 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | from cluster import (KMeansClustering, ClusteringError) 19 | import unittest 20 | 21 | 22 | def compare_list(x, y): 23 | """ 24 | Compare lists by content. Ordering does not matter. 25 | Returns True if both lists contain the same items (and are of identical 26 | length) 27 | """ 28 | 29 | cmpx = [set(cluster) for cluster in x] 30 | cmpy = [set(cluster) for cluster in y] 31 | 32 | all_ok = True 33 | 34 | for cset in cmpx: 35 | all_ok &= cset in cmpy 36 | 37 | for cset in cmpy: 38 | all_ok &= cset in cmpx 39 | 40 | return all_ok 41 | 42 | 43 | class KClusterSmallListTestCase(unittest.TestCase): 44 | 45 | def testClusterLen1(self): 46 | "Testing that a search space of length 1 returns only one cluster" 47 | cl = KMeansClustering([876]) 48 | self.assertEqual([876], cl.getclusters(2)) 49 | self.assertEqual([876], cl.getclusters(5)) 50 | 51 | def testClusterLen0(self): 52 | "Testing if clustering an empty set, returns an empty set" 53 | cl = KMeansClustering([]) 54 | self.assertEqual([], cl.getclusters(2)) 55 | self.assertEqual([], cl.getclusters(7)) 56 | 57 | 58 | class KCluster2DTestCase(unittest.TestCase): 59 | 60 | def testClusterCount(self): 61 | "Test that asking for less than 2 clusters raises an error" 62 | cl = KMeansClustering([876, 123, 344, 676], 63 | distance=lambda x, y: abs(x - y)) 64 | self.assertRaises(ClusteringError, cl.getclusters, 0) 65 | self.assertRaises(ClusteringError, cl.getclusters, 1) 66 | 67 | def testNonsenseCluster(self): 68 | """ 69 | Test that asking for more clusters than data-items available raises an 70 | error 71 | """ 72 | cl = KMeansClustering([876, 123], distance=lambda x, y: abs(x - y)) 73 | self.assertRaises(ClusteringError, cl.getclusters, 5) 74 | 75 | def testUniformLength(self): 76 | """ 77 | Test if there is an item in the cluster that has a different 78 | cardinality 79 | """ 80 | data = [(1, 5), (2, 5), (2, 6), (3, 4), (3, 5), (3, 6, 7), (7, 3), 81 | (8, 1), (8, 2), (8), (9, 2), (9, 3)] 82 | self.assertRaises(ValueError, KMeansClustering, data) 83 | 84 | def testPointDoubling(self): 85 | "test for bug #1604868" 86 | data = [(18, 13), (15, 12), (17, 12), (18, 12), (19, 12), (16, 11), 87 | (18, 11), (19, 10), (0, 0), (1, 4), (1, 2), (2, 3), (4, 1), 88 | (4, 3), (5, 2), (6, 1)] 89 | cl = KMeansClustering(data) 90 | clusters = cl.getclusters(2) 91 | expected = [[(18, 13), (15, 12), (17, 12), (18, 12), (19, 12), 92 | (16, 11), (18, 11), (19, 10)], 93 | [(0, 0), (1, 4), (1, 2), (2, 3), (4, 1), 94 | (5, 2), (6, 1), (4, 3)]] 95 | self.assertTrue(compare_list( 96 | clusters, 97 | expected), 98 | "Elements differ!\n%s\n%s" % (clusters, expected)) 99 | 100 | def testClustering(self): 101 | "Basic clustering test" 102 | data = [(8, 2), (7, 3), (2, 6), (3, 5), (3, 6), (1, 5), (8, 1), 103 | (3, 4), (8, 3), (9, 2), (2, 5), (9, 3)] 104 | cl = KMeansClustering(data) 105 | self.assertEqual( 106 | cl.getclusters(2), 107 | [[(8, 2), (8, 1), (8, 3), (7, 3), (9, 2), (9, 3)], 108 | [(3, 5), (1, 5), (3, 4), (2, 6), (2, 5), (3, 6)]]) 109 | 110 | def testUnmodifiedData(self): 111 | "Basic clustering test" 112 | data = [(8, 2), (7, 3), (2, 6), (3, 5), (3, 6), (1, 5), (8, 1), 113 | (3, 4), (8, 3), (9, 2), (2, 5), (9, 3)] 114 | cl = KMeansClustering(data) 115 | 116 | new_data = [] 117 | [new_data.extend(_) for _ in cl.getclusters(2)] 118 | self.assertEqual(sorted(new_data), sorted(data)) 119 | 120 | 121 | class KClusterSFBugs(unittest.TestCase): 122 | 123 | def testLostFunctionReference(self): 124 | "test for bug #1727558" 125 | cl = KMeansClustering([(1, 1), (20, 40), (20, 41)], 126 | lambda x, y: x + y) 127 | clusters = cl.getclusters(3) 128 | expected = [(1, 1), (20, 40), (20, 41)] 129 | self.assertTrue(compare_list( 130 | clusters, 131 | expected), 132 | "Elements differ!\n%s\n%s" % (clusters, expected)) 133 | 134 | def testMultidimArray(self): 135 | from random import random 136 | data = [] 137 | for _ in range(200): 138 | data.append([random(), random()]) 139 | cl = KMeansClustering(data, lambda p0, p1: ( 140 | p0[0] - p1[0]) ** 2 + (p0[1] - p1[1]) ** 2) 141 | cl.getclusters(10) 142 | -------------------------------------------------------------------------------- /cluster/test/test_linkage.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from cluster.linkage import single, complete, uclus, average 4 | 5 | 6 | class LinkageMethods(unittest.TestCase): 7 | 8 | def setUp(self): 9 | self.set_a = [1, 2, 3, 4] 10 | self.set_b = [10, 11, 12, 13, 14, 15, 100] 11 | self.dist = lambda x, y: abs(x-y) # NOQA 12 | 13 | def test_single_distance(self): 14 | result = single(self.set_a, self.set_b, self.dist) 15 | expected = 6 16 | self.assertEqual(result, expected) 17 | 18 | def test_complete_distance(self): 19 | result = complete(self.set_a, self.set_b, self.dist) 20 | expected = 99 21 | self.assertEqual(result, expected) 22 | 23 | def test_uclus_distance(self): 24 | result = uclus(self.set_a, self.set_b, self.dist) 25 | expected = 10.5 26 | self.assertEqual(result, expected) 27 | 28 | def test_average_distance(self): 29 | result = average(self.set_a, self.set_b, self.dist) 30 | expected = 22.5 31 | self.assertEqual(result, expected) 32 | 33 | if __name__ == '__main__': 34 | 35 | import logging 36 | 37 | suite = unittest.TestSuite(( 38 | unittest.makeSuite(LinkageMethods), 39 | )) 40 | 41 | logging.basicConfig(level=logging.DEBUG) 42 | unittest.TextTestRunner(verbosity=2).run(suite) 43 | -------------------------------------------------------------------------------- /cluster/test/test_numpy.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it under 6 | # the terms of the GNU Lesser General Public License as published by the Free 7 | # Software Foundation; either version 2.1 of the License, or (at your option) 8 | # any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 11 | # FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more 12 | # details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, Inc., 15 | # 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | import unittest 19 | 20 | from cluster import KMeansClustering 21 | try: 22 | import numpy 23 | NUMPY_AVAILABLE = True 24 | except: 25 | NUMPY_AVAILABLE = False 26 | 27 | 28 | @unittest.skipUnless(NUMPY_AVAILABLE, 29 | 'numpy not available. Associated test will not be loaded!') 30 | class NumpyTests(unittest.TestCase): 31 | 32 | def testNumpyRandom(self): 33 | data = numpy.random.rand(500, 2) 34 | cl = KMeansClustering(data, lambda p0, p1: ( 35 | p0[0] - p1[0]) ** 2 + (p0[1] - p1[1]) ** 2, numpy.array_equal) 36 | cl.getclusters(10) 37 | -------------------------------------------------------------------------------- /cluster/util.py: -------------------------------------------------------------------------------- 1 | # 2 | # This is part of "python-cluster". A library to group similar items together. 3 | # Copyright (C) 2006 Michel Albert 4 | # 5 | # This library is free software; you can redistribute it and/or modify it 6 | # under the terms of the GNU Lesser General Public License as published by the 7 | # Free Software Foundation; either version 2.1 of the License, or (at your 8 | # option) any later version. 9 | # This library is distributed in the hope that it will be useful, but WITHOUT 10 | # ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | # FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 12 | # for more details. 13 | # You should have received a copy of the GNU Lesser General Public License 14 | # along with this library; if not, write to the Free Software Foundation, 15 | # Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA 16 | # 17 | 18 | from __future__ import print_function 19 | import logging 20 | 21 | 22 | logger = logging.getLogger(__name__) 23 | 24 | 25 | class ClusteringError(Exception): 26 | pass 27 | 28 | 29 | def flatten(L): 30 | """ 31 | Flattens a list. 32 | 33 | Example: 34 | 35 | >>> flatten([a,b,[c,d,[e,f]]]) 36 | [a,b,c,d,e,f] 37 | """ 38 | if not isinstance(L, list): 39 | return [L] 40 | 41 | if L == []: 42 | return L 43 | 44 | return flatten(L[0]) + flatten(L[1:]) 45 | 46 | 47 | def fullyflatten(container): 48 | """ 49 | Completely flattens out a cluster and returns a one-dimensional set 50 | containing the cluster's items. This is useful in cases where some items of 51 | the cluster are clusters in their own right and you only want the items. 52 | 53 | :param container: the container to flatten. 54 | """ 55 | flattened_items = [] 56 | 57 | for item in container: 58 | if hasattr(item, 'items'): 59 | flattened_items = flattened_items + fullyflatten(item.items) 60 | else: 61 | flattened_items.append(item) 62 | 63 | return flattened_items 64 | 65 | 66 | def median(numbers): 67 | """ 68 | Return the median of the list of numbers. 69 | see: http://mail.python.org/pipermail/python-list/2004-December/294990.html 70 | """ 71 | 72 | # Sort the list and take the middle element. 73 | n = len(numbers) 74 | copy = sorted(numbers) 75 | if n & 1: # There is an odd number of elements 76 | return copy[n // 2] 77 | else: 78 | return (copy[n // 2 - 1] + copy[n // 2]) / 2.0 79 | 80 | 81 | def mean(numbers): 82 | """ 83 | Returns the arithmetic mean of a numeric list. 84 | see: http://mail.python.org/pipermail/python-list/2004-December/294990.html 85 | """ 86 | return float(sum(numbers)) / float(len(numbers)) 87 | 88 | 89 | def minkowski_distance(x, y, p=2): 90 | """ 91 | Calculates the minkowski distance between two points. 92 | 93 | :param x: the first point 94 | :param y: the second point 95 | :param p: the order of the minkowski algorithm. If *p=1* it is equal 96 | to the manhatten distance, if *p=2* it is equal to the euclidian 97 | distance. The higher the order, the closer it converges to the 98 | Chebyshev distance, which has *p=infinity*. 99 | """ 100 | from math import pow 101 | assert len(y) == len(x) 102 | assert len(x) >= 1 103 | sum = 0 104 | for i in range(len(x)): 105 | sum += abs(x[i] - y[i]) ** p 106 | return pow(sum, 1.0 / float(p)) 107 | 108 | 109 | def magnitude(a): 110 | "calculates the magnitude of a vecor" 111 | from math import sqrt 112 | sum = 0 113 | for coord in a: 114 | sum += coord ** 2 115 | return sqrt(sum) 116 | 117 | 118 | def dotproduct(a, b): 119 | "Calculates the dotproduct between two vecors" 120 | assert(len(a) == len(b)) 121 | out = 0 122 | for i in range(len(a)): 123 | out += a[i] * b[i] 124 | return out 125 | 126 | 127 | def centroid(data, method=median): 128 | "returns the central vector of a list of vectors" 129 | out = [] 130 | for i in range(len(data[0])): 131 | out.append(method([x[i] for x in data])) 132 | return tuple(out) 133 | -------------------------------------------------------------------------------- /cluster/version.txt: -------------------------------------------------------------------------------- 1 | 1.4.1.post3 2 | -------------------------------------------------------------------------------- /dev-requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx 2 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " devhelp to make HTML files and a Devhelp project" 34 | @echo " epub to make an epub" 35 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 36 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 37 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 38 | @echo " text to make text files" 39 | @echo " man to make manual pages" 40 | @echo " texinfo to make Texinfo files" 41 | @echo " info to make Texinfo files and run them through makeinfo" 42 | @echo " gettext to make PO message catalogs" 43 | @echo " changes to make an overview of all changed/added/deprecated items" 44 | @echo " xml to make Docutils-native XML files" 45 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 46 | @echo " linkcheck to check all external links for integrity" 47 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 48 | 49 | clean: 50 | rm -rf $(BUILDDIR)/* 51 | 52 | html: 53 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 54 | @echo 55 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 56 | 57 | dirhtml: 58 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 59 | @echo 60 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 61 | 62 | singlehtml: 63 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 64 | @echo 65 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 66 | 67 | pickle: 68 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 69 | @echo 70 | @echo "Build finished; now you can process the pickle files." 71 | 72 | json: 73 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 74 | @echo 75 | @echo "Build finished; now you can process the JSON files." 76 | 77 | htmlhelp: 78 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 79 | @echo 80 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 81 | ".hhp project file in $(BUILDDIR)/htmlhelp." 82 | 83 | qthelp: 84 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 85 | @echo 86 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 87 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 88 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/python-cluster.qhcp" 89 | @echo "To view the help file:" 90 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/python-cluster.qhc" 91 | 92 | devhelp: 93 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 94 | @echo 95 | @echo "Build finished." 96 | @echo "To view the help file:" 97 | @echo "# mkdir -p $$HOME/.local/share/devhelp/python-cluster" 98 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/python-cluster" 99 | @echo "# devhelp" 100 | 101 | epub: 102 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 103 | @echo 104 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 105 | 106 | latex: 107 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 108 | @echo 109 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 110 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 111 | "(use \`make latexpdf' here to do that automatically)." 112 | 113 | latexpdf: 114 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 115 | @echo "Running LaTeX files through pdflatex..." 116 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 117 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 118 | 119 | latexpdfja: 120 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 121 | @echo "Running LaTeX files through platex and dvipdfmx..." 122 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 123 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 124 | 125 | text: 126 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 127 | @echo 128 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 129 | 130 | man: 131 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 132 | @echo 133 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 134 | 135 | texinfo: 136 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 137 | @echo 138 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 139 | @echo "Run \`make' in that directory to run these through makeinfo" \ 140 | "(use \`make info' here to do that automatically)." 141 | 142 | info: 143 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 144 | @echo "Running Texinfo files through makeinfo..." 145 | make -C $(BUILDDIR)/texinfo info 146 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 147 | 148 | gettext: 149 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 150 | @echo 151 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 152 | 153 | changes: 154 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 155 | @echo 156 | @echo "The overview file is in $(BUILDDIR)/changes." 157 | 158 | linkcheck: 159 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 160 | @echo 161 | @echo "Link check complete; look for any errors in the above output " \ 162 | "or in $(BUILDDIR)/linkcheck/output.txt." 163 | 164 | doctest: 165 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 166 | @echo "Testing of doctests in the sources finished, look at the " \ 167 | "results in $(BUILDDIR)/doctest/output.txt." 168 | 169 | xml: 170 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 171 | @echo 172 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 173 | 174 | pseudoxml: 175 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 176 | @echo 177 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 178 | -------------------------------------------------------------------------------- /docs/_static/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/exhuma/python-cluster/2739ff420ef5bf8fba53f67788453b8239f16c9a/docs/_static/.gitkeep -------------------------------------------------------------------------------- /docs/apidoc/cluster.matrix.rst: -------------------------------------------------------------------------------- 1 | cluster.matrix 2 | ============== 3 | 4 | .. automodule:: cluster.matrix 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/apidoc/cluster.method.base.rst: -------------------------------------------------------------------------------- 1 | cluster.method.base 2 | =================== 3 | 4 | .. automodule:: cluster.method.base 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/apidoc/cluster.method.hierarchical.rst: -------------------------------------------------------------------------------- 1 | cluster.method.hierarchical 2 | =========================== 3 | 4 | .. automodule:: cluster.method.hierarchical 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/apidoc/cluster.method.kmeans.rst: -------------------------------------------------------------------------------- 1 | cluster.method.kmeans 2 | ===================== 3 | 4 | .. automodule:: cluster.method.kmeans 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/apidoc/cluster.rst: -------------------------------------------------------------------------------- 1 | cluster 2 | ======= 3 | 4 | .. automodule:: cluster.cluster 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/apidoc/cluster.util.rst: -------------------------------------------------------------------------------- 1 | cluster.util 2 | ============ 3 | 4 | .. automodule:: cluster.util 5 | :members: 6 | :undoc-members: 7 | :show-inheritance: 8 | -------------------------------------------------------------------------------- /docs/changelog.rst: -------------------------------------------------------------------------------- 1 | Changelog 2 | ######### 3 | 4 | .. include:: ../CHANGELOG 5 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # python-cluster documentation build configuration file, created by 4 | # sphinx-quickstart on Wed Aug 27 07:50:52 2014. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | from os.path import dirname, join 18 | 19 | # If extensions (or modules to document with autodoc) are in another directory, 20 | # add these directories to sys.path here. If the directory is relative to the 21 | # documentation root, use os.path.abspath to make it absolute, like shown here. 22 | #sys.path.insert(0, os.path.abspath('.')) 23 | 24 | # -- General configuration ------------------------------------------------ 25 | 26 | # If your documentation needs a minimal Sphinx version, state it here. 27 | #needs_sphinx = '1.0' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = [ 33 | 'sphinx.ext.autodoc', 34 | ] 35 | 36 | # Add any paths that contain templates here, relative to this directory. 37 | templates_path = ['_templates'] 38 | 39 | # The suffix of source filenames. 40 | source_suffix = '.rst' 41 | 42 | # The encoding of source files. 43 | #source_encoding = 'utf-8-sig' 44 | 45 | # The master toctree document. 46 | master_doc = 'index' 47 | 48 | # General information about the project. 49 | project = u'python-cluster' 50 | copyright = u'2014, Michel Albert' 51 | 52 | # The version info for the project you're documenting, acts as replacement for 53 | # |version| and |release|, also used in various other places throughout the 54 | # built documents. 55 | # 56 | version_file = join(dirname(__file__), '..', 'cluster', 'version.txt') 57 | with open(version_file) as fptr: 58 | # The full version, including alpha/beta/rc tags. 59 | release = fptr.read().strip() 60 | versioninfo = release.split('.') 61 | 62 | # The short X.Y version. 63 | version = '%s.%s' % (versioninfo[0], versioninfo[1]) 64 | 65 | # The language for content autogenerated by Sphinx. Refer to documentation 66 | # for a list of supported languages. 67 | #language = None 68 | 69 | # There are two options for replacing |today|: either, you set today to some 70 | # non-false value, then it is used: 71 | #today = '' 72 | # Else, today_fmt is used as the format for a strftime call. 73 | #today_fmt = '%B %d, %Y' 74 | 75 | # List of patterns, relative to source directory, that match files and 76 | # directories to ignore when looking for source files. 77 | exclude_patterns = ['_build'] 78 | 79 | # The reST default role (used for this markup: `text`) to use for all 80 | # documents. 81 | #default_role = None 82 | 83 | # If true, '()' will be appended to :func: etc. cross-reference text. 84 | #add_function_parentheses = True 85 | 86 | # If true, the current module name will be prepended to all description 87 | # unit titles (such as .. function::). 88 | #add_module_names = True 89 | 90 | # If true, sectionauthor and moduleauthor directives will be shown in the 91 | # output. They are ignored by default. 92 | #show_authors = False 93 | 94 | # The name of the Pygments (syntax highlighting) style to use. 95 | pygments_style = 'sphinx' 96 | 97 | # A list of ignored prefixes for module index sorting. 98 | #modindex_common_prefix = [] 99 | 100 | # If true, keep warnings as "system message" paragraphs in the built documents. 101 | #keep_warnings = False 102 | 103 | 104 | # -- Options for HTML output ---------------------------------------------- 105 | 106 | # The theme to use for HTML and HTML Help pages. See the documentation for 107 | # a list of builtin themes. 108 | html_theme = 'alabaster' 109 | 110 | # Theme options are theme-specific and customize the look and feel of a theme 111 | # further. For a list of options available for each theme, see the 112 | # documentation. 113 | #html_theme_options = {} 114 | 115 | # Add any paths that contain custom themes here, relative to this directory. 116 | #html_theme_path = [] 117 | 118 | # The name for this set of Sphinx documents. If None, it defaults to 119 | # " v documentation". 120 | #html_title = None 121 | 122 | # A shorter title for the navigation bar. Default is the same as html_title. 123 | #html_short_title = None 124 | 125 | # The name of an image file (relative to this directory) to place at the top 126 | # of the sidebar. 127 | #html_logo = None 128 | 129 | # The name of an image file (within the static path) to use as favicon of the 130 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 131 | # pixels large. 132 | #html_favicon = None 133 | 134 | # Add any paths that contain custom static files (such as style sheets) here, 135 | # relative to this directory. They are copied after the builtin static files, 136 | # so a file named "default.css" will overwrite the builtin "default.css". 137 | html_static_path = ['_static'] 138 | 139 | # Add any extra paths that contain custom files (such as robots.txt or 140 | # .htaccess) here, relative to this directory. These files are copied 141 | # directly to the root of the documentation. 142 | #html_extra_path = [] 143 | 144 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 145 | # using the given strftime format. 146 | #html_last_updated_fmt = '%b %d, %Y' 147 | 148 | # If true, SmartyPants will be used to convert quotes and dashes to 149 | # typographically correct entities. 150 | #html_use_smartypants = True 151 | 152 | # Custom sidebar templates, maps document names to template names. 153 | #html_sidebars = {} 154 | 155 | # Additional templates that should be rendered to pages, maps page names to 156 | # template names. 157 | #html_additional_pages = {} 158 | 159 | # If false, no module index is generated. 160 | #html_domain_indices = True 161 | 162 | # If false, no index is generated. 163 | #html_use_index = True 164 | 165 | # If true, the index is split into individual pages for each letter. 166 | #html_split_index = False 167 | 168 | # If true, links to the reST sources are added to the pages. 169 | #html_show_sourcelink = True 170 | 171 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 172 | #html_show_sphinx = True 173 | 174 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 175 | #html_show_copyright = True 176 | 177 | # If true, an OpenSearch description file will be output, and all pages will 178 | # contain a tag referring to it. The value of this option must be the 179 | # base URL from which the finished HTML is served. 180 | #html_use_opensearch = '' 181 | 182 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 183 | #html_file_suffix = None 184 | 185 | # Output file base name for HTML help builder. 186 | htmlhelp_basename = 'python-clusterdoc' 187 | 188 | 189 | # -- Options for LaTeX output --------------------------------------------- 190 | 191 | latex_elements = { 192 | # The paper size ('letterpaper' or 'a4paper'). 193 | #'papersize': 'letterpaper', 194 | 195 | # The font size ('10pt', '11pt' or '12pt'). 196 | #'pointsize': '10pt', 197 | 198 | # Additional stuff for the LaTeX preamble. 199 | #'preamble': '', 200 | } 201 | 202 | # Grouping the document tree into LaTeX files. List of tuples 203 | # (source start file, target name, title, 204 | # author, documentclass [howto, manual, or own class]). 205 | latex_documents = [ 206 | ('index', 'python-cluster.tex', u'python-cluster Documentation', 207 | u'Michel Albert', 'manual'), 208 | ] 209 | 210 | # The name of an image file (relative to this directory) to place at the top of 211 | # the title page. 212 | #latex_logo = None 213 | 214 | # For "manual" documents, if this is true, then toplevel headings are parts, 215 | # not chapters. 216 | #latex_use_parts = False 217 | 218 | # If true, show page references after internal links. 219 | #latex_show_pagerefs = False 220 | 221 | # If true, show URL addresses after external links. 222 | #latex_show_urls = False 223 | 224 | # Documents to append as an appendix to all manuals. 225 | #latex_appendices = [] 226 | 227 | # If false, no module index is generated. 228 | #latex_domain_indices = True 229 | 230 | 231 | # -- Options for manual page output --------------------------------------- 232 | 233 | # One entry per manual page. List of tuples 234 | # (source start file, name, description, authors, manual section). 235 | man_pages = [ 236 | ('index', 'python-cluster', u'python-cluster Documentation', 237 | [u'Michel Albert'], 1) 238 | ] 239 | 240 | # If true, show URL addresses after external links. 241 | #man_show_urls = False 242 | 243 | 244 | # -- Options for Texinfo output ------------------------------------------- 245 | 246 | # Grouping the document tree into Texinfo files. List of tuples 247 | # (source start file, target name, title, author, 248 | # dir menu entry, description, category) 249 | texinfo_documents = [ 250 | ('index', 'python-cluster', u'python-cluster Documentation', 251 | u'Michel Albert', 'python-cluster', 'One line description of project.', 252 | 'Miscellaneous'), 253 | ] 254 | 255 | # Documents to append as an appendix to all manuals. 256 | #texinfo_appendices = [] 257 | 258 | # If false, no module index is generated. 259 | #texinfo_domain_indices = True 260 | 261 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 262 | #texinfo_show_urls = 'footnote' 263 | 264 | # If true, do not generate a @detailmenu in the "Top" node's menu. 265 | #texinfo_no_detailmenu = False 266 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | Welcome to python-cluster's documentation! 2 | ========================================== 3 | 4 | Index 5 | ----- 6 | 7 | .. toctree:: 8 | :maxdepth: 1 9 | 10 | changelog 11 | 12 | 13 | Introduction 14 | ------------ 15 | 16 | Implementation of cluster algorithms in pure Python. 17 | 18 | As this is exacuted in the Python runtime, the code runs slower than similar 19 | implementations in compiled languages. You gain however to run this on pretty 20 | much any Python object. The different clustering methods have different 21 | prerequisites however which are mentioned in the different implementations. 22 | 23 | 24 | 25 | Example for K-Means Clustering 26 | ------------------------------ 27 | 28 | :: 29 | 30 | from cluster import KMeansClustering 31 | data = [ 32 | (8, 2), 33 | (7, 3), 34 | (2, 6), 35 | (3, 5), 36 | (3, 6), 37 | (1, 5), 38 | (8, 1), 39 | (3, 4), 40 | (8, 3), 41 | (9, 2), 42 | (2, 5), 43 | (9, 3) 44 | ] 45 | cl = KMeansClustering(data) 46 | cl.getclusters(2) 47 | 48 | The above code would give the following result:: 49 | 50 | [ 51 | [(8, 2), (8, 1), (8, 3), (7, 3), (9, 2), (9, 3)], 52 | [(3, 5), (1, 5), (3, 4), (2, 6), (2, 5), (3, 6)] 53 | ] 54 | 55 | 56 | Example for Hierarchical Clustering 57 | ----------------------------------- 58 | 59 | :: 60 | 61 | from cluster import HierarchicalClustering 62 | data = [791, 956, 676, 124, 564, 84, 24, 365, 594, 940, 398, 63 | 971, 131, 365, 542, 336, 518, 835, 134, 391] 64 | cl = HierarchicalClustering(data) 65 | cl.getlevel(40) 66 | 67 | The above code would give the following result:: 68 | 69 | [ 70 | [24], 71 | [84, 124, 131, 134], 72 | [336, 365, 365, 391, 398], 73 | [676], 74 | [594, 518, 542, 564], 75 | [940, 956, 971], 76 | [791], 77 | [835], 78 | ] 79 | 80 | 81 | Using :py:meth:`~cluster.method.hierarchical.HierarchicalClustering.getlevel()` 82 | returns clusters where the distance between each cluster is no less than 83 | *level*. 84 | 85 | .. note:: 86 | 87 | Due to a bug_ in earlier releases, the elements of the input data *must be* 88 | sortable! 89 | 90 | .. _bug: https://github.com/exhuma/python-cluster/issues/11 91 | 92 | 93 | API 94 | --- 95 | 96 | .. toctree:: 97 | :maxdepth: 1 98 | 99 | apidoc/cluster 100 | apidoc/cluster.matrix 101 | apidoc/cluster.method.base 102 | apidoc/cluster.method.hierarchical 103 | apidoc/cluster.method.kmeans 104 | apidoc/cluster.util 105 | 106 | Indices and tables 107 | ================== 108 | 109 | * :ref:`genindex` 110 | * :ref:`modindex` 111 | * :ref:`search` 112 | 113 | -------------------------------------------------------------------------------- /fabfile.py: -------------------------------------------------------------------------------- 1 | import fabric.api as fab 2 | 3 | 4 | @fab.task 5 | def doc(): 6 | with fab.lcd('docs'): 7 | fab.local('../env/bin/sphinx-build ' 8 | '-b html ' 9 | '-d _build/doctrees . ' 10 | '_build/html') 11 | -------------------------------------------------------------------------------- /makedist.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python setup.py sdist 4 | python setup.py bdist 5 | python setup.py bdist_rpm 6 | python setup.py bdist_wininst 7 | 8 | 9 | -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | looponfailroots = cluster 3 | norecursedirs = env env3 env3_nonumpy .git 4 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [bdist_wheel] 2 | universal=1 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | readme_contents = open("README.rst").read() 4 | 5 | # index where the first paragraph starts 6 | parastart = readme_contents.find('=\n') + 3 7 | 8 | # first sentence of first paragraph 9 | sentence_end = readme_contents.find('.', parastart) 10 | 11 | setup( 12 | name='cluster', 13 | version=open('cluster/version.txt').read().strip(), 14 | author='Michel Albert', 15 | author_email='michel@albert.lu', 16 | url='https://github.com/exhuma/python-cluster', 17 | packages=['cluster', 'cluster.method'], 18 | license='LGPL', 19 | description=readme_contents[parastart:sentence_end], 20 | long_description=readme_contents, 21 | include_package_data=True, 22 | classifiers=[ 23 | 'Development Status :: 5 - Production/Stable', 24 | 'Intended Audience :: Developers', 25 | 'Intended Audience :: Education', 26 | 'Intended Audience :: Other Audience', 27 | 'Intended Audience :: Science/Research', 28 | 'License :: OSI Approved :: GNU Lesser General Public License v2 (LGPLv2)', 29 | 'Operating System :: OS Independent', 30 | 'Programming Language :: Python', 31 | 'Programming Language :: Python :: 2', 32 | 'Programming Language :: Python :: 3', 33 | 'Topic :: Scientific/Engineering :: Information Analysis', 34 | ] 35 | ) 36 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27, py35, py36 3 | 4 | [testenv] 5 | deps = pytest 6 | commands = pytest 7 | --------------------------------------------------------------------------------