├── LICENSE ├── README.md ├── example.ijs ├── km.ijs └── km.jproj /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Vincent Toups 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | K-Means In J 2 | ------------ 3 | 4 | This is an implementation of K-Means, a simple clustering algorithm in 5 | the remarkable and remarkably weird language J. 6 | 7 | Usage 8 | ----- 9 | 10 | load '/km.ijs' 11 | vectors =: ? 100 2 $ 50 12 | nClusters =: 3 13 | 14 | nClusters km vectors 15 | 16 | 0 0 0 0 1 1 2 0 0 1 2 0 2 2 ... 17 | 18 | 19 | This results in a list of integral labels, associating each vector in 20 | the right hand argument with a cluster. To recover the cluster 21 | centers you may use 22 | 23 | ids calcCenters vectors 24 | 25 | Which returns a list of vectors representing the center of each 26 | cluster. 27 | 28 | About K-Means 29 | ------------- 30 | 31 | K-means works by the almost comically simple expedient of assigning 32 | each vector in the data set to a random cluster, calculating the 33 | centers of these clusters, and then re-assigning the vectors to the 34 | center to which they are nearest. 35 | 36 | The process is repeated as necessary until the assignments converge. 37 | Generally, this results a few stable configurations which depend on 38 | the initial assignment. 39 | -------------------------------------------------------------------------------- /example.ijs: -------------------------------------------------------------------------------- 1 | load 'stats/distribs' 2 | load './km.ijs' 3 | load 'graphics/plot' 4 | 5 | createCluster =: dyad define 6 | size =. x 7 | 'cx cy sx sy' =. y 8 | xs =. cx + sx * (rnorm size) 9 | ys =. cy + sy * (rnorm size) 10 | |: (2,size) $ xs,ys 11 | ) 12 | 13 | nPerCreateCluster =: 300 14 | 15 | NB. create three guassian clusters 16 | c1 =: nPerCreateCluster createCluster 0 3 0.6 0.6 17 | c2 =: nPerCreateCluster createCluster _3 0 0.6 0.6 18 | c3 =: nPerCreateCluster createCluster 3 0 0.6 0.6 19 | 20 | NB. combine the cluster into one data set 21 | NB. shuffle just to prove nothing funny is going on 22 | data =: shuffle c1,c2,c3 23 | 24 | forPlotting =: (0&{ ; 1&{)@|: 25 | 26 | 27 | NB. cluster our data 28 | clustering =: 3 km data 29 | 30 | pd 'reset' 31 | pd 'type dot' 32 | pd 'color red' 33 | pd forPlotting (I. 0 = clustering) { data 34 | 35 | pd 'color green' 36 | pd forPlotting (I. 1 = clustering) { data 37 | 38 | pd 'color blue' 39 | pd forPlotting (I. 2 = clustering) { data 40 | 41 | pd 'show' 42 | 43 | -------------------------------------------------------------------------------- /km.ijs: -------------------------------------------------------------------------------- 1 | NB. This file contains an implementation of k-means, a clustering 2 | NB. algorithm which takes a set of vectors and a number of clusters 3 | NB. and assigns each vector to a cluster. 4 | NB. 5 | NB. Example Data 6 | NB. v1 =: ? 10 2 $ 10 7 | NB. v2 =: ? 3 2 $ 10 8 | NB. v3 =: ? 100 2 $ 50 9 | NB. trivialV =: 2 2 $ 10 10 | NB. 11 | NB. Examples 12 | NB. 3 km v3 13 | 14 | NB. produce a permutation of y. 15 | permutation =: (#@[ ? #@[) 16 | 17 | NB. shuffle list 18 | NB. shuffle the list 19 | shuffle =: {~ permutation 20 | 21 | NB. nClusters drawIds vectors 22 | NB. given a list of vectors and a number of clusters 23 | NB. assign each vector to a cluster randomly 24 | NB. with the constraint that all clusters are 25 | NB. roughly equal in size. 26 | drawIds =: shuffle@(#@] $ i.@[) 27 | 28 | NB. distance between two vectors as lists 29 | vd =: +/@:(*:@-) 30 | 31 | NB. Give the distance between all vectors in x and y 32 | distances =: (vd"1 1)/ 33 | 34 | NB. return the list of indices of the minimum values of the rows. 35 | minI =: i. <./ 36 | 37 | NB. Given x a list of centers and y a list of vectors, 38 | NB. return the labeling of the vectors to each center. 39 | reId =: ((minI"1)@distances)~ 40 | 41 | NB. canonicalize labels 42 | NB. given a list of labels, re-label the labels so 43 | NB. that 0 appears first, 1 second, etc 44 | canonicalize =: ] { /:@~. 45 | 46 | NB. (minI"1 v1 distances v2 ) 47 | 48 | vsum =: +/"1@|: 49 | vm =: (# %~ (+/"1@|:)) 50 | 51 | calcCenters =: (/:@~.@[) { (vm/.) 52 | 53 | NB. ids kmOnce vectors This performs the heavy lifting for K-means. 54 | NB. Given a list of integral ids, one for each vector in the N x D 55 | NB. vectors, this verb calculates the averages of the groups implied 56 | NB. by those labels and re-assigns each vector to one of those 57 | NB. averages, whichever it is closest to. 58 | 59 | kmOnce =: dyad define 60 | ids =. x 61 | vectors =. y 62 | centers =. ids calcCenters vectors 63 | centers reId vectors 64 | ) 65 | 66 | NB. Use the power adverb to iterate kmOnce until the labels converge 67 | kmConverge =: ((kmOnce~)^:_)~ 68 | 69 | NB. nClusters km dataset k means. Given a cluster count on the left 70 | NB. and a vector data set on the right, return an array of IDs 71 | NB. assigning each vector to a cluster. 72 | NB. One can use `calcCenters` to recover the cluster centers. 73 | 74 | km =: canonicalize@((drawIds) kmConverge ]) 75 | 76 | NB. Example Data 77 | NB. v1 =: ? 10 2 $ 10 78 | NB. v2 =: ? 3 2 $ 10 79 | NB. v3 =: ? 100 2 $ 50 80 | NB. trivialV =: 2 2 $ 10 81 | 82 | NB. Examples 83 | NB. 3 km v3 84 | 85 | -------------------------------------------------------------------------------- /km.jproj: -------------------------------------------------------------------------------- 1 | NB. project: 2 | NB. 3 | NB. defines list of source files. 4 | NB. path defaults to project directory. 5 | 6 | km.ijs 7 | --------------------------------------------------------------------------------