├── .gitignore ├── wiki ├── Tutorials │ ├── img │ │ ├── DL.png │ │ ├── RDL.png │ │ ├── formuleRDL.png │ │ ├── dendrogram-viz.png │ │ ├── single-linkage.png │ │ ├── complete-linkage.png │ │ ├── kmeans-data-real.png │ │ ├── data-inspector-iris.png │ │ ├── renoir-river-form.png │ │ ├── elbow-method-credit-card.png │ │ ├── normalization_comparison.png │ │ ├── segmented-renoir-river.png │ │ ├── petal-graph-roassal-kmeans.png │ │ ├── renoir-river-all-segments.png │ │ ├── workflow_linear_regression.png │ │ ├── credit-card-dataset-inspector.png │ │ ├── market-basket-inspecting-file.png │ │ ├── workflow_logistic_regression.png │ │ ├── credit-card-reduced-two-dimensions.png │ │ ├── kmeans-data-clustered-two-clusters.png │ │ ├── kmeans-data-clustered-three-clusters.png │ │ └── Capture d’écran 2022-05-09 à 11.12.04.png │ ├── image-segmentation-using-kmeans.md │ ├── clustering-simple-example.md │ ├── market-basket-analysis-using-a-priori.md │ ├── clustering-credit-card-kmeans.md │ ├── linear-regression-tutorial.md │ ├── logistic-regression-tutorial.md │ └── edit-distances-tutorial.md ├── MachineLearning │ ├── img │ │ ├── 1.png │ │ ├── 2.png │ │ ├── 3.png │ │ ├── 4.png │ │ ├── svmHyperplan.png │ │ ├── trash │ │ │ ├── bestHyper.png │ │ │ ├── exampleSVM.png │ │ │ └── svmExample1.png │ │ └── White_square_50%_transparency.png │ ├── Measuring-the-accuracy-of-a-model.md │ ├── k-nearest-neighbors.md │ ├── Logistic-Regression.md │ ├── Support-Vector-Machines.md │ └── Linear-Regression.md ├── StringMatching │ └── Edit-distances.md ├── GettingStarted │ ├── Contributing.md │ └── GettingStarted.md ├── NaturalLanguageProcessing │ └── TFIDF.md ├── LinearAlgebra │ ├── LinearAlgebra.md │ └── Lapack.md ├── Clustering │ └── k-means.md ├── DataExploration │ ├── Random-Partitioner.md │ ├── Normalization.md │ └── Metrics.md └── Graphs │ └── Graph-Algorithms.md ├── .gitattributes ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /wiki/Tutorials/img/DL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/DL.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/RDL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/RDL.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/1.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/2.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/3.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/4.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/formuleRDL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/formuleRDL.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/dendrogram-viz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/dendrogram-viz.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/single-linkage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/single-linkage.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/complete-linkage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/complete-linkage.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/kmeans-data-real.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/kmeans-data-real.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/svmHyperplan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/svmHyperplan.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/data-inspector-iris.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/data-inspector-iris.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/renoir-river-form.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/renoir-river-form.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/trash/bestHyper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/trash/bestHyper.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/trash/exampleSVM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/trash/exampleSVM.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/trash/svmExample1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/trash/svmExample1.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/elbow-method-credit-card.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/elbow-method-credit-card.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/normalization_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/normalization_comparison.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/segmented-renoir-river.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/segmented-renoir-river.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/petal-graph-roassal-kmeans.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/petal-graph-roassal-kmeans.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/renoir-river-all-segments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/renoir-river-all-segments.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/workflow_linear_regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/workflow_linear_regression.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/credit-card-dataset-inspector.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/credit-card-dataset-inspector.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/market-basket-inspecting-file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/market-basket-inspecting-file.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/workflow_logistic_regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/workflow_logistic_regression.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/credit-card-reduced-two-dimensions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/credit-card-reduced-two-dimensions.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/kmeans-data-clustered-two-clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/kmeans-data-clustered-two-clusters.png -------------------------------------------------------------------------------- /wiki/MachineLearning/img/White_square_50%_transparency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/MachineLearning/img/White_square_50%_transparency.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/kmeans-data-clustered-three-clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/kmeans-data-clustered-three-clusters.png -------------------------------------------------------------------------------- /wiki/Tutorials/img/Capture d’écran 2022-05-09 à 11.12.04.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pharo-ai/wiki/HEAD/wiki/Tutorials/img/Capture d’écran 2022-05-09 à 11.12.04.png -------------------------------------------------------------------------------- /wiki/StringMatching/Edit-distances.md: -------------------------------------------------------------------------------- 1 | # Edit distances 2 | 3 | Please read the README of the repository for more information: https://github.com/pharo-ai/edit-distances 4 | -------------------------------------------------------------------------------- /wiki/GettingStarted/Contributing.md: -------------------------------------------------------------------------------- 1 | # Pharo-ai contribution guide 2 | 3 | The pharo-ai contribution guide can be found: [here](https://github.com/pharo-ai/ai/blob/master/CONTRIBUTING.md) 4 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Set the default behavior, in case people don't have core.autocrlf set. 2 | * text=auto 3 | 4 | # Declare files that will always have CRLF line endings on checkout. 5 | *.st text eol=crlf 6 | 7 | # Denote all files that are truly binary and should not be modified. 8 | *.png binary 9 | *.jpg binary 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 pharo-ai 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /wiki/MachineLearning/Measuring-the-accuracy-of-a-model.md: -------------------------------------------------------------------------------- 1 | # Measuring the accuracy of a model 2 | 3 | To measure the accuracy of the model we will use the package metrics also available in Pharo-AI. You can refer to the metrics page of this wiki for more information. 4 | 5 | For example, let's image that our logistic regression model has produced the following output. 6 | 7 | ```st 8 | prediction := logisticRegressionModel predict: someValues 9 | "#( 1 3 5 7 4 2)" 10 | ``` 11 | 12 | And we know that we the real values are: 13 | ```st 14 | realValues := #(1 3.5 4 7 3.5 1) 15 | ``` 16 | 17 | So, we can test the accuracy of the model using the metrics that we just installed. 18 | For the previous values, we can compute the r2 score (the coefficient of determination). 19 | 20 | ```st 21 | prediction := #( 1 3 5 7 4 2). 22 | realValues := #(1 3.5 4 7 3.5 1). 23 | 24 | metric := AIR2Score new. 25 | metric computeForActual: realValues predicted: prediction. 26 | "0.8993288590604027" 27 | ``` 28 | 29 | So, our model has a r2 score 0.89. A R2 score of 1 indicates that the model is predicting perfectly. 30 | 31 | For more information about please refer metrics wiki page available in this same wiki. -------------------------------------------------------------------------------- /wiki/GettingStarted/GettingStarted.md: -------------------------------------------------------------------------------- 1 | # Getting Started to Pharo-AI 2 | 3 | To install Pharo-AI, please execute the following code snipped in a Pharo Playground: 4 | 5 | ```st 6 | EpMonitor disableDuring: [ 7 | Metacello new 8 | baseline: 'AIPharo'; 9 | repository: 'github://pharo-ai/ai/src'; 10 | onWarningLog; 11 | load ] 12 | ``` 13 | 14 | If you want to see the status of all our currently available projects, please refer to: https://github.com/pharo-ai/ai 15 | 16 | ## How to contribute 17 | 18 | This is an open-source community project so we are more than happy to receive contributions. 19 | 20 | For developing proposes, we have the different projects/algorithms in several github repositories. In each of the wiki page we put a link of the specific repository in which that project is located. 21 | 22 | If you want to fix an issue you can do a Pull Request. You can see more information on the [Contribute a fix to Pharo](https://github.com/pharo-project/pharo/wiki/Contribute-a-fix-to-Pharo) wiki page. 23 | 24 | Finally, if you have some idea or project you want to discuss we are also welcome to hear your ideas. Please send an email to any of the [Pharo mailing lists](https://pharo.org/community) -------------------------------------------------------------------------------- /wiki/MachineLearning/k-nearest-neighbors.md: -------------------------------------------------------------------------------- 1 | # K-Nearest Neighbors 2 | 3 | Repository: https://github.com/pharo-ai/k-nearest-neighbors 4 | 5 | K-nearest neighbors is a supervised, non parametric, classifier. It can be used for both regression and classification, but it is more generally used for classification problems. 6 | 7 | For classifying new data it calculates the distance of the point to the k-nearest neighbors and classifies the point with the class label of the majority. 8 | 9 | ## Using it 10 | 11 | ```st 12 | kNN := AIKNearestNeighbors new. 13 | 14 | kNN fitX: inputMatrix y: outputVector. 15 | kNN predict: aCollectionOfPoints 16 | ``` 17 | 18 | ## K-Nearest-Neighbors Distance metrics 19 | 20 | For calculating the distance between two points, k-nearest-neoghbors uses by defult the euclidean distance. But one can also specify another distances for the calculations. 21 | 22 | For using other distance metric, one can send the following messsages: 23 | 24 | ```st 25 | kNN := AIKNearestNeighbors k: 5. 26 | 27 | kNN useEuclideanDistance. 28 | kNN useHammingDistance. 29 | kNN useManhattanDistance. 30 | 31 | kNN useMinkowskiDistanceWithPValue: 4 32 | ``` 33 | 34 | ### Euclidean Distance 35 | 36 | This is the distance that is used by default. It uses the euclidean formula to calculate the distance. One limitation is that this metric will work properly only for real valued vectors. 37 | 38 | ``` 39 | d(x, y) = ( Σ (yi - xi)^2 ) sqrt 40 | ``` 41 | 42 | ### Manhattan distance 43 | 44 | ``` 45 | d(x, y) = Σ abs (yi - xi) 46 | ``` 47 | 48 | ### Minkowski Distance 49 | 50 | This distance needs to be initialized with a p value. 51 | 52 | ``` 53 | kNN useMinkowskiDistanceWithPValue: 4 54 | ``` 55 | 56 | The formula is: 57 | 58 | ``` 59 | d(x, y) = ( Σ abs (yi - xi) ) ^ (1/p) 60 | ``` 61 | 62 | ### Hamming Distance 63 | 64 | The Hamming distance is mostly used for binary vectors of string vector. It counts the points in which the vectors do not match. -------------------------------------------------------------------------------- /wiki/MachineLearning/Logistic-Regression.md: -------------------------------------------------------------------------------- 1 | 2 | # Logistic regression 3 | 4 | Repository: https://github.com/pharo-ai/linear-models 5 | 6 | Logistic regression is a statistical model that predicts the likehood that an event can happen giving as an output the probability. 7 | 8 | For example, if we have a function: 9 | 10 | ``` 11 | f(x) = 1 if x ≥ 0 12 | 0 if x < 0 13 | ``` 14 | 15 | that returns `1` for all positive numbers including 0, and `0` for all negative numbers. 16 | We can train a logistic model for making predictions when a new number arrives. 17 | 18 | ```st 19 | input := #( (-6.64) (-7) (-0.16) (-8) (-2) 20 | (-90.03) (-7.26) (-3.02) (-54.18) (-9.6) 21 | (8) (7) (3.29) (8.21) (76.09) 22 | (45) (3) (10) (9) (5.11)). 23 | 24 | output := #(0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1). 25 | ``` 26 | 27 | Now we have to train the logistic regression model. 28 | 29 | ```Smalltalk 30 | logisticRegressionModel := AILogisticRegression new 31 | learningRate: 0.001; 32 | maxIterations: 2000; 33 | yourself. 34 | 35 | logisticRegressionModel fitX: input y: output. 36 | ``` 37 | 38 | We can look at the trained parameters. 39 | 40 | ```Smalltalk 41 | b := logisticRegressionModel bias. 42 | "0.005920279313031135" 43 | w := logisticRegressionModel weights. 44 | "#(0.9606463585049467)" 45 | ``` 46 | 47 | We can use the model to predict the output for previously unseen input. 48 | 49 | ```Smalltalk 50 | testInput := #( (-3) (-0.11) (-4.67) (9) (7) (10)). 51 | 52 | expectedOutput := #(0 0 0 1 1 1). 53 | ``` 54 | 55 | ```Smalltalk 56 | logisticRegressionModel predict: testInput. 57 | "#(0 0 0 1 1 1)" 58 | ``` 59 | 60 | If we want to have the probabilities of the model, not just `1` or `0`, we can call the method `predictProbabilities:`. It will return us the probabilities that the input has an output of 1. 61 | 62 | ```st 63 | logisticRegressionModel predictProbabilities: testInput 64 | "#(0.05335185163762839 0.4750829523986599 0.011203102336961287 0.9998252077292514 0.998807422178672 0.9999331093095885)" 65 | ``` 66 | 67 | In our example, we have a `0.05335185163762839` probability that the output for `-3` is `1`. Also we have a `0.9999331093095885` probability that the output for `10` is `1`. 68 | 69 | [Measuring the accuracy of a model](./Measuring-the-accuracy-of-a-model.md) -------------------------------------------------------------------------------- /wiki/NaturalLanguageProcessing/TFIDF.md: -------------------------------------------------------------------------------- 1 | # Term Frequency - Inverse Document Frequency (TF-IDF) 2 | 3 | Repository: https://github.com/pharo-ai/tf-idf 4 | 5 | ## How to use it 6 | 7 | Here is a simple example of how you can train a TF-IDF model and use it to assign scores to words. You are given an array of sentences where each sentence is represented as an array of words: 8 | 9 | ```Smalltalk 10 | sentences := #( 11 | (I am Sam) 12 | (Sam I am) 13 | (I 'don''t' like green eggs and ham)). 14 | ``` 15 | 16 | You can train a TF-IDF model on those sentences: 17 | 18 | ```Smalltalk 19 | tfidf := AITermFrequencyInverseDocumentFrequency new. 20 | tfidf trainOn: sentences. 21 | ``` 22 | 23 | Now you can use it to assign TF-IDF scores to words: 24 | 25 | ```Smalltalk 26 | tfidf scoreOf: 'Sam' in: #(I am Sam). "0.4054651081081644" 27 | tfidf scoreOf: 'I' in: #(I am Sam). "0.0" 28 | tfidf scoreOf: 'green' in: #(I am green green ham). "2.1972245773362196" 29 | ``` 30 | 31 | `scoreOf:` is the multiplication between `tf` (term frequency) and `idf` (inverse document frequency). 32 | 33 | ```st 34 | AITermFrequencyInverseDocumentFrequency >> scoreOf: aWord in: aDocument 35 | 36 | | tf idf | 37 | tf := self termFrequencyOf: aWord in: aDocument. 38 | idf := self inverseDocumentFrequencyOf: aWord. 39 | ^ tf * idf 40 | ``` 41 | 42 | You can also encode any given text with a TF-IDF vector, which will contain a TF-IDF score for each word from the vocabulary of unique words extracted from sentences on which TF-IDF was trained: 43 | 44 | ```Smalltalk 45 | tfidfVector := tfidf vectorFor: #(I am green green ham). 46 | "#(0.0 0.0 0.4054651081081644 0.0 0.0 0.0 2.1972245773362196 1.0986122886681098 0.0)" 47 | ``` 48 | 49 | Those vectors can be used to find semantic similarities between different texts. 50 | 51 | We access the vocabulary: 52 | 53 | ```st 54 | vocabulary := tfidf vocabulary. 55 | "#(#I #Sam #am #and 'don''t' #eggs #green #ham #like). 56 | ``` 57 | 58 | Map the vocaculary with the score of the word for one sentence: 59 | 60 | ```st 61 | tfidfVector := tfidf vectorFor: #(I am green green ham). 62 | vocabulary := tfidf vocabulary. 63 | 64 | vocabulary 65 | with: tfidfVector 66 | collect: [ :word :scoreOfTheWord | word -> scoreOfTheWord ] 67 | 68 | "{ #I->0.0. 69 | #Sam->0.0. 70 | #am->0.4054651081081644. 71 | #and->0.0. 'don''t'->0.0. 72 | #eggs->0.0. #green->2.1972245773362196. 73 | #ham->1.0986122886681098. 74 | #like->0.0 }" 75 | ``` -------------------------------------------------------------------------------- /wiki/LinearAlgebra/LinearAlgebra.md: -------------------------------------------------------------------------------- 1 | # Linear Algebra 2 | 3 | Repository: https://github.com/pharo-ai/linear-algebra 4 | 5 | Fast linear algebra implemented using [Pharo Lapack](https://github.com/pharo-ai/lapack). 6 | 7 | ## Matrices implemented 8 | 9 | We have implemented several matrices for usign them as data structures. 10 | 11 | All the matrices are subclasses of `AIAbstractMatrix`. They have the same API for creating the matrices. 12 | 13 | You can create an empty matrix: 14 | 15 | ```st 16 | AIAbstractMatrix newRows: aNumberOfRows columns: aNumberOfColumns 17 | ``` 18 | 19 | For creating a matrix with the rows represented as an array of arrays 20 | 21 | ```st 22 | AIAbstractMatrix rows: collectionOfRows 23 | ``` 24 | 25 | ### Contiguous matrix `AIContiguousMatrix` 26 | 27 | The abstract class `AIContiguousMatrix` stores the elements in a flat array. That means we can access one element of the matrix on constant time. 28 | 29 | We have implemented 3 types of matrices: 30 | 31 | - `AIColumnMajorMatrix` 32 | - `AIRowMajorMatrix` 33 | 34 | The difference is how the store the elements in the flat array. Column mejor or row major. 35 | 36 | - `AINativeFloatMatrix` 37 | 38 | This matrix also stores the elements in a column major form. But, for it uses a `Float64Array` for storing them. This can be very useful when we want to use this matrix with foreigner functions. Because, when calling, for example C, from Pharo we need to convert the Pharo object into C object. This can take time. So, one can use this matrix for speeding up as we do not need to do the convertion. By the other hand, if one wants to use this matrix in Pharo is not a good idea since it will convert each object into a Pharo object and then into a native object again. 39 | 40 | ## Implemented algorithms 41 | 42 | - For the moment, we only have implemented a solver for the [Least Squares Problem](https://en.wikipedia.org/wiki/Least_squares) 43 | 44 | For using it, first you need to use the `AIColumnMajorMatrix` as this solver uses Fortran Lapack that expects to be flattened in a column major way. 45 | 46 | ```st 47 | matrixA := #( 48 | ( 0.12 -8.19 7.69 -2.26 -4.71) 49 | (-6.91 2.22 -5.12 -9.08 9.96) 50 | (-3.33 -8.94 -6.72 -4.40 -9.98) 51 | ( 3.97 3.33 -2.74 -7.92 -3.20)) asAIColumnMajorMatrix. 52 | 53 | matrixB := #( 54 | (7.30 0.47 -6.28) 55 | (1.33 6.58 -3.42) 56 | (2.68 -1.71 3.46) 57 | (-9.62 -0.79 0.41)) asAIColumnMajorMatrix. 58 | 59 | algo := AILeastSquares new 60 | matrixA: matrixA; 61 | matrixB: matrixB; 62 | yourself. 63 | 64 | algo solve. 65 | 66 | algo solution. "AIColumnMajorMatrix( 67 | (-0.69 -0.80 0.38 0.29 0.29) 68 | (-0.29 -0.48 0.51 0.20 0.18) 69 | (-0.02 -0.15 0.26 -0.18 0.04) )" 70 | 71 | algo singularValues. "#(18.66 15.99 10.01 8.51)" 72 | algo rank. "4" 73 | ``` 74 | -------------------------------------------------------------------------------- /wiki/LinearAlgebra/Lapack.md: -------------------------------------------------------------------------------- 1 | # Pharo LAPACK 2 | 3 | Repository: https://github.com/pharo-ai/lapack 4 | 5 | A minimal FFI binding for LAPACK (http://www.netlib.org/lapack) in Pharo. 6 | 7 | For creating another binding to another method (or routine, because the library is written in Fortran), you only need to create a subclass of `LapackAlgorithm`. You can check `LapackDgelsd` as an example. 8 | 9 | _Note: We only tested this for MacOS. But it should work on Windows and Linux too. A prerequisite is to have already intalled the library on your OS. For making it work on Linux and Windows, is only needed to override the methods `unixLibraryName` and `win32LibraryName` to return the path in which the library is installed on your system. Check `LapackLibrary>>#macLibraryName`_ 10 | 11 | ## How to use it? 12 | 13 | Currently, we only bound one routine: `dgelsd()` in the class `LapackDgelsd` 14 | 15 | This algorithm computes the minimum-norm solution to a linear least squares problem double precision GE matrices. If you don't know about lapack nor about the least squares problem, please see the class comment (`LapackDgelsd`). For now, we are going to continue the explanation assuming that the reader knows about it. 16 | 17 | To use the method, you need to pass two matrices: `matrixA` and `matrixB`. The object needs to understand the messages `contentsForLapack` and `contentsForLapackOfAtLeast:`. Those methods are implemented in `Collection` as extensions. So, you can pass an Array or a Collection. 18 | 19 | Be careful, this is the binding for the fortran lapack and it expects that the matrices are flattened in a column-major form. If you want to use this library in a high-level way, we strongly recomend using the [linear-algebra](https://github.com/pharo-ai/linear-algebra) library of pharo-ai. It uses this lapack binding for solving the least square algorithm and it provides a nicer API and also has the matrices data structures that are internally store the elements already flattened in a column-major way and that you can use directly. 20 | 21 | ### Code example 22 | 23 | For example, you can use the `LapackDgelsd` algorithm in the following way: 24 | 25 | ```st 26 | matrixA := #( 0.12 -6.91 -3.33 3.97 -8.19 27 | 2.22 -8.94 3.33 7.69 -5.12 28 | -6.72 -2.74 -2.26 -9.08 -4.40 29 | -7.92 -4.71 9.96 -9.98 -3.20 ). 30 | matrixB := #( 7.30 1.33 2.68 -9.62 0.00 31 | 0.47 6.58 -1.71 -0.79 0.00 32 | -6.28 -3.42 3.46 0.41 0.00 ). 33 | 34 | algorithm := LapackDgelsd new 35 | numberOfRows: 4; 36 | numberOfColumns: 5; 37 | numberOfRightHandSides: 3; 38 | matrixA: matrixA; 39 | matrixB: matrixB; 40 | yourself. 41 | 42 | algorithm solve. 43 | 44 | "To get the result, we have to call the accessors. 45 | Info represents if the process completed with success" 46 | algorithm info. 47 | 48 | "the array with the solutions" 49 | algorithm minimumNormSolution. 50 | 51 | "The effective rank" 52 | algorithm rank. 53 | 54 | "And the singular values" 55 | algorithm singularValues. 56 | ``` 57 | -------------------------------------------------------------------------------- /wiki/Clustering/k-means.md: -------------------------------------------------------------------------------- 1 | # K-Means Clustering Algorithm 2 | 3 | K-Means is an unsupervised machine learning algorithm that clusters the data into k-clusters. The algorithm consists of 3 basic steps: 4 | 5 | 1. Initialize the initial centroids 6 | 2. Assign all the points to the closest centroid using the Euclidean distance 7 | 3. Update the centroids using the mean of all its points 8 | 4. Repeat steps 2 and 3 until all the centrois remain the same 9 | 10 | After some time the algorithm will converge, but it can be to a local minimum. This depends on the initialization of the centroids. For bypassing this problem, the algorithm is ran several times and it will return the best centroids that were found. See `AIKMeans class>>#defaultNumberOfTimesItIsRun`. 11 | 12 | There are different strategies for initializing the centroids. For the moment in this implementation, we choose k-random points that are in the range of the data. That means, we take the max axis for each of the coordinates (in a 2 dimensional mathematical space we take `x` and `y`), and we chose a random point between those limits. 13 | 14 | ## K-Means++ centroids initialization algorithm 15 | 16 | K-Means pharo-ai implementation uses by default the k-means++ centroids initialization algorithm. 17 | 18 | Using the algorithm instead of randomly choosing the initial centroids, increacces the speed in the centroids initialization step, but, makes the algorithm to converge faster. 19 | 20 | It was proposed in 2007 by Arthur et Vassilvitskii. 21 | 22 | The algorithm is as follows: 23 | 24 | ``` 25 | 1. Choose the first centroid to be a random point. 26 | 2. Calculate the distance of all the point to the choosen clusters. Keep the min distance of a point to the choosen clusters. 27 | 3. Choose the next cluster the point being the farest being the one with the most probability of being choose. 28 | 4. Repeat Steps 2 and 3 k centroids are selected 29 | ``` 30 | 31 | ## Using K-Means 32 | 33 | ## Configuring the parameters 34 | 35 | For creating an instance of the object, you need to specify the numbers of clusters. For that you can use the constructor method `numberOfClusters:` 36 | 37 | ```st 38 | kMeans := AIKMeans numberOfClusters: 3. 39 | ``` 40 | 41 | You also can configure the max iterations, the numbers of clusters and the number of time the the k-means algorithm will be run. 42 | 43 | ```st 44 | kMeans 45 | maxIterations: 1000; 46 | numberOfClusters: 4; 47 | timesToRun: 4 48 | ``` 49 | 50 | ## Fitting 51 | ] 52 | With the method `fit:` you can train the model. As K-Means is a clustering algorithm, that means unsupervised learning, we only need to pass the dependent variable to train the model. 53 | 54 | ```st 55 | kMeans fit: aCollectionOfPoints. 56 | ``` 57 | 58 | ## Clustering 59 | 60 | 61 | After the model is fitted, you can access to the clusters using the `clusters` accessing method. 62 | The final clusters are the result of the clustering of the data. 63 | 64 | ```st 65 | kMeans clusters 66 | ``` 67 | 68 | For clustering unseen data you can use the `predict:` method. It will return the clustering for the new data (data that was not used for training the model). 69 | 70 | ```st 71 | kMeans predict: aCollection 72 | ``` 73 | 74 | ## Evaluating 75 | 76 | For getting the score of the K-Means model you can send the message `score:`. The score in this case is the sum of the Euclidean distance between all the points and their centroids. It is also called inertia. 77 | 78 | ```st 79 | kMeans score: aCollectionOfPoints 80 | ``` 81 | 82 | The transform method will calculate the euclidean distance from each of the points, that are passed as arguments, to their centroids. 83 | 84 | ```st 85 | kMeans transform: aCollectionOfPoints 86 | ``` -------------------------------------------------------------------------------- /wiki/DataExploration/Random-Partitioner.md: -------------------------------------------------------------------------------- 1 | # Random partitioner for Datasets 2 | 3 | Repository: https://github.com/pharo-ai/random-partitioner/edit/master/README.md 4 | 5 | This is a Pharo library for partitioning a collection. Given a set of K proportions, for example 50%, 30%, and 20%, it shuffles the collection and divides it into $k$ non-empty subsets in such a way that every element is included in exactly one subset. 6 | 7 | `RandomPartitioner` can be used in machine learning and statistical analysis for splitting the data into training, validation, and test (a.k.a. holdout) sets, or partitioning the data for cross-validation. 8 | 9 | ## Table of contents 10 | 11 | - [Simple example](#simple-example) 12 | - [Practical example: training, validation, and test sets](#practical-example-training-validation-and-test-sets) 13 | 14 | ## Simple example 15 | 16 | Here is a small array of 10 letters: 17 | 18 | ```Smalltalk 19 | letters := #(a b c d e f g h i j). 20 | ``` 21 | We can split it in 3 random subsets with 50%, 30%, and 20% of data respectively: 22 | 23 | ```Smalltalk 24 | partitioner := AIRandomPartitioner new. 25 | subsets := partitioner split: letters withProportions: #(0.5 0.3 0.2). 26 | ``` 27 | The result might look something like this: 28 | 29 | ``` 30 | #((d h j a b) 31 | (i f e) 32 | (g c)) 33 | ``` 34 | 35 | Alternatively, you might want to specify exact sizes of each partition. Let's split the array in two random subset with 3 and 7 elements: 36 | 37 | ```Smalltalk 38 | subsets := partitioner split: letters withSizes: #(3 7). 39 | ``` 40 | 41 | This may produce the following partition: 42 | 43 | ``` 44 | #((d e a) 45 | (c j g f i b h)) 46 | ``` 47 | 48 | ## Practical example: training, validation, and test sets 49 | 50 | In this example, we will be splitting a real dataset into three subsets: one for training the machine learning model, one for validation (adjusting the parameters of the model) and one for testing the final result (a separate subset of data that is not used during training and allows us to evaluate how well does the model generalize by feeding it with previously unseen data). 51 | 52 | We will be working with [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) of flowers - it is a simple and relatively small dataset that is widely used for teaching classification algorithms. 53 | 54 | The easiest way to quickly load Iris dataset is to install the [Pharo Datasets](https://github.com/pharo-ai/Datasets) - a simple library that allows you to load various toy datasets. We install it by executing the following Metacello script: 55 | 56 | ```Smalltalk 57 | Metacello new 58 | baseline: 'AIDatasets'; 59 | repository: 'github://pharo-ai/Datasets'; 60 | load. 61 | ``` 62 | 63 | Now we can load Iris dataset: 64 | 65 | ```Smalltalk 66 | irisDataset := AIDatasets loadIris. 67 | ``` 68 | 69 | This gives us a [data frame](https://github.com/PolyMathOrg/DataFrame) with 150 rows and 5 columns. Just to ilustrate what we are working with, here are the first 5 rows of our dataset: 70 | 71 | | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | class | 72 | |-----|-----|-----|-----|--------| 73 | | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 74 | | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 75 | | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 76 | | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 77 | | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 78 | 79 | We split this data frame into three non-intersecting subsets: we will use 50% of data for training the model (75 flowers), 25% of data for validating it (37 flowers), and 25% for testing (38 flowers). 80 | 81 | ```Smalltalk 82 | partitioner := AIRandomPartitioner new. 83 | subsets := partitioner split: irisDataset withProportions: #(0.5 0.25 0.25). 84 | 85 | irisTraining := subsets first. 86 | irisValidation := subsets second. 87 | irisTest := subsets third. 88 | ``` 89 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pharo-AI Wiki 2 | 3 | This is the Pharo-AI Wiki. The goal of this wiki is to provide documentation and tutorials to help people start using our Pharo AI/Machine-Learning libraries. 4 | 5 | - [Getting Started page](./wiki/GettingStarted/GettingStarted.md). 6 | - [Contributing guide](./wiki/GettingStarted/Contributing.md) 7 | 8 | If you want to see other Machine Learning projects in Pharo, please see: https://github.com/pharo-ai/awesome-pharo-ml 9 | 10 | Keep in mind that the wiki and pharo-ai is right now under construction version so not all the algorithms will be documented or with all the functionalities that we would like to have. Nevertheless, all the things that are documented here had been revised and are working. 11 | 12 | ## Contents 13 | 14 | - [Tutorials](#tutorials) 15 | - [Linear Regression](#linear-regression) 16 | - [Logistic Regression](#logistic-regression) 17 | - [Clustering](#clustering) 18 | - [Data Mining](#data-mining) 19 | - [Edit Distances](#edit-distances) 20 | - [Machine Learning](#machine-learning) 21 | - [Regression](#regression) 22 | - [Classification](#classification) 23 | - [Clustering](#clustering-1) 24 | - [Using metrics](#using-metrics) 25 | - [Linear Algebra](#linear-algebra) 26 | - [Data Preprocessing](#data-preprocessing) 27 | - [Data Mining](#data-mining) 28 | - [Metrics](#metrics) 29 | - [State Space Search](#state-space-search) 30 | - [Natural Language Processing](#natural-language-processing) 31 | 32 | ## Tutorials 33 | 34 | ##### Linear Regression 35 | 36 | - [Using Linear Regression for predicting the price of a house using the Boston Dataset](./wiki/Tutorials/linear-regression-tutorial.md) 37 | 38 | ##### Logistic Regression 39 | 40 | - [Using Logistic Regression for saying if someone has diabetes based on its physical conditions](./wiki/Tutorials/logistic-regression-tutorial.md) 41 | 42 | ##### Clustering 43 | 44 | - [Using K-Means Clustering Machine Learning Algorithm - Simple Example](./wiki/Tutorials/clustering-simple-example.md) 45 | - [Clustering Users of a Credit Card Company using the K-Means Algorithm](./wiki/Tutorials/clustering-credit-card-kmeans.md) 46 | - [Image segmentation using K-Kmeans](./wiki/Tutorials/image-segmentation-using-kmeans.md) 47 | - [Hierarchical clustering](./wiki/Tutorials/hierarchical-clustering.md) 48 | 49 | ##### Data Mining 50 | 51 | - [Market Basket Analysis Using A-Priori](./wiki/Tutorials/market-basket-analysis-using-a-priori.md) 52 | 53 | ##### Edit Distances 54 | 55 | - [Edit distances: Understanding them and Examples](./wiki/Tutorials/edit-distances-tutorial.md) 56 | 57 | ## Machine Learning 58 | 59 | ##### Regression 60 | 61 | - [Linear regression](./wiki/MachineLearning/Linear-Regression.md) 62 | - [Logistic regression](./wiki/MachineLearning/Logistic-Regression.md) 63 | - [Support Vector Machines](wiki/MachineLearning/Support-Vector-Machines.md) 64 | 65 | ##### Classification 66 | 67 | - Decision Tree Model 68 | - Naive Bayes Classifier 69 | - [K-Nearest Neighbors](./wiki/MachineLearning/k-nearest-neighbors.md) 70 | 71 | ##### Clustering 72 | 73 | - [K Means](./wiki/Clustering/k-means.md) 74 | - Hierarchical Clustering (WIP) 75 | - Gaussian Mixture Model (WIP) 76 | 77 | ##### Using metrics 78 | 79 | - [Measuring the accuracy of a model](./wiki/MachineLearning/Measuring-the-accuracy-of-a-model.md) 80 | 81 | ## Linear Algebra 82 | 83 | - [Linear Algebra](./wiki/LinearAlgebra/LinearAlgebra.md) 84 | - [Pharo LAPACK](./wiki/LinearAlgebra/Lapack.md) 85 | 86 | ## Data Preprocessing 87 | 88 | - [Normalization](./wiki/DataExploration/Normalization.md) 89 | - [Random Partitioner for Datasets](./wiki/DataExploration/Random-Partitioner.md) 90 | 91 | ## Data Mining 92 | 93 | - A Priori algorithm 94 | 95 | ## Metrics 96 | 97 | - [Metrics](./wiki/DataExploration/Metrics.md) 98 | - [Edit distances](./wiki/StringMatching/Edit-distances.md) 99 | 100 | ## State Space Search 101 | 102 | - [Graphs algorithms](./wiki/Graphs/Graph-Algorithms.md) 103 | 104 | ## Natural Language Processing 105 | 106 | - Natural language Processing (WIP) 107 | - [Term Frequency - Inverse Document Frequency (TF-IDF)](./wiki/NaturalLanguageProcessing/TFIDF.md) 108 | - N-gram Model (WIP) 109 | - Spelling correction (WIP) 110 | -------------------------------------------------------------------------------- /wiki/DataExploration/Normalization.md: -------------------------------------------------------------------------------- 1 | # Normalization Strategies for Machine Learning 2 | 3 | Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. 4 | 5 | ### Table of Contents 6 | 7 | - [What is normalization](#what-is-normalization) 8 | - [Min-Max Normalization (a.k.a. Rescaling)](#min-max-normalization-aka-rescaling) 9 | - [Standardization](#standardization) 10 | - [How to use it - API](#how-to-use-it---API) 11 | - [How to define new normalization strategies](#how-to-define-new-normalization-strategies) 12 | 13 | ## What is normalization 14 | 15 | For example, consider that you have two collections, `ages` and `salaries`: 16 | 17 | ```Smalltalk 18 | ages := #(25 19 30 32 41 50 24). 19 | salaries := #(1600 1000 2500 2400 5000 3500 2500). 20 | ``` 21 | 22 | Those collections are on a very different scale. The differences in salaries have larger magnitude than differences in age. Which can confuse some machine learning algorithms and force them to "think" that if the difference salaries is 600 (euros) and the difference in age is 6 (years), then salary difference is 100 times greater than age difference. Such algorithms require data to be normalized - for example, both ages and salaries can be transformed to a scale of [0, 1]. 23 | 24 | There are different types of normalization. In this repository, we implement two most commonly used strategies: [Min-Max Normalization](https://en.wikipedia.org/wiki/Feature_scaling) and [Standardization](https://en.wikipedia.org/wiki/Standard_score). You can easily define your own strategy by adding a subclass of `AINormalizer`. 25 | 26 | ## Min-Max Normalization (a.k.a. Rescaling) 27 | 28 | [Min-Max or Rescaling](https://en.wikipedia.org/wiki/Feature_scaling) is the type of normalization, every element of the numeric collection is transformed to a scale of [0, 1]: 29 | 30 | ``` 31 | x'[i] = (x[i] - x min) / (x max - x min) 32 | ``` 33 | 34 | ## Standardization 35 | 36 | [Standardization](https://en.wikipedia.org/wiki/Standard_score) is the type of normalization, every element of the numeric collection is by scaled to be centered around the mean with a unit standard deviation: 37 | 38 | ``` 39 | x'[i] = (x[i] - x mean) / x std 40 | ``` 41 | 42 | ## How to use it - API 43 | 44 | You can normalize any numeric collection by calling the `normalized` method on it: 45 | 46 | ```Smalltalk 47 | numbers := #(10 -3 4 2 -7 1000 0.1 -4.05). 48 | numbers normalized. "#(0.0169 0.004 0.0109 0.0089 0.0 1.0 0.007 0.0029)" 49 | ``` 50 | 51 | By default, it will use the `AIMinMaxNormalizer`. If you want to use a different normalization strategy, you can call `normalizedUsing:` on a collection: 52 | 53 | ```Smalltalk 54 | normalizer := AIStandardizationNormalizer new. 55 | numbers normalizedUsing: normalizer. "#(-0.3261 -0.3628 -0.343 -0.3487 -0.3741 2.475 -0.3541 -0.3658)" 56 | ``` 57 | 58 | For the two normalization strategies that are defined in this package, we provide alias methods: 59 | 60 | ```Smalltalk 61 | numbers rescaled. 62 | 63 | "is the same as" 64 | numbers normalizedUsing: AIMinMaxNormalizer new. 65 | ``` 66 | ```Smalltalk 67 | numbers standardized. 68 | 69 | "is the same as" 70 | numbers normalizedUsing: AIStandardizationNormalizer new. 71 | ``` 72 | 73 | Each normalizer remembers the parameters of the original collection (e.g., min/max or mean/std) and can use them to restore the normalized collection to its original state: 74 | 75 | ```Smalltalk 76 | numbers := #(10 -3 4 2 -7 1000 0.1 -4.05). 77 | 78 | normalizer := AIMinMaxNormalizer new. 79 | normalizedNumbers := normalizer normalize: numbers. "#(0.0169 0.004 0.0109 0.0089 0.0 1.0 0.007 0.0029)" 80 | restoredNumbers := normalizer restore: normalizedNumbers. "#(10 -3 4 2 -7 1000 0.1 -4.05)" 81 | ``` 82 | 83 | ## How to define new normalization strategies 84 | 85 | Normalization is implemented using a [strategy design pattern](https://en.wikipedia.org/wiki/Strategy_pattern). The `AI-Normalization` defines an abstract class `AINormalizer` which has two abstract methods `AINormalizer class >> normalize: aCollection` and `AINormalizer class >> restore: aCollection`. To define a normalization strategy, please implement a subclass of `AINormalizer` and provide your own definitions of `normalize:` and `restore:` methods. Keep in mind that those methods must not modify the given collection but return a new one. 86 | 87 | To normalize a collection using your own strategy, call: 88 | 89 | ```Smalltalk 90 | normalizer := YourCustomNormalizer new. 91 | numbers normalizedUsing: normalizer. 92 | ``` 93 | -------------------------------------------------------------------------------- /wiki/MachineLearning/Support-Vector-Machines.md: -------------------------------------------------------------------------------- 1 | 9 | 10 | Repository: https://github.com/pharo-ai/Support-Vector-Machines 11 | 12 | 13 | ## Support Vector Machines 14 | 15 | Support Vector Machines (abbreviated as SVM) are supervised learning algorithms that can be used for classification and regression problems. They are based on the idea of finding a hyperplane that best separates the features into two classes. Suppose that we have a set of points in a two-dimensional feature space, with each of the points belonging to one of two classes. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which separates the two classes of points from one another. 16 | 17 |

18 | 19 |

20 | 21 | Depending on the dataset and how we want to classify it there are three approaches of SVMs: **Hard Margin**, **Soft Margin** and **Kernels**. We will introduce only the Soft Margin since it is the only implementation we have currently. 22 | ## Soft Margin 23 | 24 | Soft Margin consists in classifying the data avoiding overfitting. In other words, we want to allow some misclassifications in the hope of achieving better generality. 25 | Let's visualize this with an example : 26 | 27 | Given 4 input vectors _x = (x1, x2)_ and 4 output values _y_. 28 | 29 | ```st 30 | input := #( #( 1 0 0 ) #( 1 1 1 ) #( 1 3 3 ) #( 1 4 4 ) ). 31 | output := #( 1 1 -1 -1 ). 32 | ``` 33 | 34 | You will notice that there's a 1 at the beginning of each row of the input, don't worry we will explain it later. 35 | 36 | So we want to find a hyperplan that separates the two classes each input vector belongs to. SVM were originally designed to only support two-class-problems. So our two classes here are {-1,1}. As you can see, the red points belong to the first class and the blue ones to the second class. 37 | 38 |

39 | 40 |

41 | 42 | We have an infinit number of possible hyperplans that can separate our data. Since we're looking for the best one we're dealing with an optimazation problem. Therefor, one of the commonly used ways of solving this type of problems is the Stochastic Gradient Descent (SGD) which will be used here. In simple words, SGD is an iterative learning process where an objective function is minimized according to the direction of steepest ascent so that the best coefficients for modeling may be converged upon. 43 | 44 | 45 |

46 | 47 | 48 |

49 | 50 | 51 | 52 | So basically the SVM what it does : it finds an optimal value of the parameter w (weight) and b (bias) such as when using the function **sgn(x)= sign(wx+b)** we manage to determine the classifier for our dataset. SVM is a linear model, hence the signe function is a straight line and has the general equation form. Where w is the gradient of the line (how steep the line is) and b is the y -intercept (the point in which the line crosses the y -axis). 53 | 54 | For simplicity reasons, we don't want to carry on a dot and a sum. We will push the param b into the weight vector and add a one at the end of each inputVector so that when operating the weights, the last input vector would be doted to the last weight instance and we can get the parameter b (bias). 55 | 56 | ```st 57 | input := input collect: [ :each | each copyWith: 1 ]. 58 | ``` 59 | 60 | We initialize an SVM model and fit it to the data. 61 | 62 | ```st 63 | model := AISupportVectorMachines new. 64 | model regularizationStrenght: 10000. 65 | model learningRate: 1e-6. 66 | model maxEpochs: 5000. 67 | 68 | model fitX: inputMatrix y: outputVector. 69 | ``` 70 | Now that we trained the model, we have the values of the weights that will be needed to predict the output values for new input vectors. 71 | 72 | ```st 73 | model weights. "#(-0.5100145599113426 -0.5100145599113426 2.0303388323187965)" 74 | ``` 75 | 76 | The predict function is defined by the sign function 77 | 78 | So can proceed to predict the output on a new input vector that we will name inputTaste 79 | 80 | ```st 81 | testInput := #( #( 1 -1 -1 ) #( 1 5 5 ) #( 1 6 6 ) #( 1 7 7 ) ). 82 | expectedOutput := #(1 -1 -1 -1) 83 | ``` 84 | ```st 85 | model predict: testInput "#(1 -1 -1 -1)" 86 | ``` 87 | We can see that the expected output and the predicted one are the same. -------------------------------------------------------------------------------- /wiki/Tutorials/image-segmentation-using-kmeans.md: -------------------------------------------------------------------------------- 1 | # Image segmentation using K-Means 2 | 3 | Image segmentation is the process of assigning a label to every pixel of an image. The pixels are grouped into clusters that share certain characteristics. 4 | 5 | By doing this, it reduces the complexity of the image. This can help for finding, for example, the silhouette of a figure. 6 | 7 | In this tutorial, we will use the [K-Means](https://github.com/pharo-ai/k-means) algorithm for segmenting the image. 8 | 9 | ## Installing the project 10 | 11 | First, you need to install our image segmentation project, available in https://github.com/pharo-ai/image-segmentation. 12 | 13 | You can execute the following Metacello script in the Pharo Playground. 14 | 15 | ```st 16 | Metacello new 17 | baseline: 'AIImageSegmentation'; 18 | repository: 'github://pharo-ai/image-segmentation'; 19 | load. 20 | ``` 21 | 22 | ## Importing the image 23 | 24 | For the sake of this tutorial, we will use some open source images that are installing along with the project. But, you can use whatever image that you want. Just keep in mind that the bigger the image is, the longer the algorithm is going to take to fit. 25 | 26 | We will use the picture of a painting of [Renoir](https://en.wikipedia.org/wiki/Pierre-Auguste_Renoir). 27 | 28 | First, we will import the image as a file reference. 29 | 30 | ```st 31 | file := 'pharo-local/iceberg/pharo-ai/image-segmentation/img/renoir_river.jpg' asFileReference. 32 | ``` 33 | 34 | We have the image as a `FileReference` object. 35 | 36 | ![Renoir](./img/renoir-river-form.png) 37 | 38 | ## Using the image segmentator 39 | 40 | Let's create the image segmentation model. After creating the object, we need to load the image using the `loadImage:` message. That method will convert the `FileReference` object into an instance of `Form`. Also, we need to specify the number of segments. That means in how many colors the image will be segmented. In this case we will chose 3 segments (or colors). 41 | 42 | ```st 43 | segmentator := AIImageSegmentator new 44 | loadImage: file; 45 | numberOfSegments: 3; 46 | yourself. 47 | ``` 48 | 49 | To cluster the pixels of the image, we need to send the message `clusterImagePixels` to the `segmentator`. This will fit the k-means algorithm with the image that we passed. Note that is the image is large, this can take some time. With the Renoir image that we are using, this takes only some seconds. 50 | 51 | ```st 52 | segmentator clusterImagePixels. 53 | ``` 54 | 55 | We have clustered the pixels of our image. 56 | 57 | We can inspect the clusters with: 58 | 59 | ```st 60 | segmentator clusters. "#(2 2 2 2 2 2 2 2 2 2 2 2 2 2 [..]" 61 | ``` 62 | 63 | We can also see the colors that were assigned to each of the clusters: 64 | 65 | ```st 66 | segmentator clusterColors. "an OrderedCollection((Color r: 0.6021505376344086 g: 0.6324535679374389 b: 0.624633431085044 alpha: 1.0) (Color r: 0.8347996089931574 g: 0.8123167155425219 b: 0.7478005865102639 alpha: 1.0) (Color r: 0.42424242424242425 g: 0.48484848484848486 b: 0.5073313782991202 alpha: 1.0))" 67 | ``` 68 | 69 | Now, after clustering the pixels of the image, we want to segmentate it. To do that, we send the message `segmentate` to the `segmentator`. 70 | 71 | ```st 72 | segmentator segmentate. 73 | ``` 74 | 75 | The image is segmented. We can inspecting using the message. 76 | 77 | ```st 78 | segmentator segmentedImage. 79 | ``` 80 | 81 | ![Segmented Renoir River](./img/segmented-renoir-river.png) 82 | 83 | We see the same Renoir painting but only with 3 colors. Cool! 84 | 85 | We can also inspect each of the image segments with. 86 | 87 | ```st 88 | segmentator segments. 89 | ``` 90 | 91 | Or, we can open all the images (original image, the segmented one and each of the segments) with 92 | 93 | ```st 94 | segmentator openAll. 95 | ``` 96 | 97 | ![](./img/renoir-river-all-segments.png) 98 | 99 | ## Summary of the code 100 | 101 | ```st 102 | file := 'pharo-local/iceberg/pharo-ai/image-segmentation/img/renoir_river.jpg' asFileReference. 103 | 104 | segmentator := AIImageSegmentator new 105 | loadImage: file; 106 | numberOfSegments: 3; 107 | yourself. 108 | 109 | segmentator clusterImagePixels. 110 | 111 | segmentator clusters. "#(2 2 2 2 2 2 2 2 2 2 2 2 2 2 [..]" 112 | segmentator clusterColors. "an OrderedCollection((Color r: 0.6021505376344086 g: 0.6324535679374389 b: 0.624633431085044 alpha: 1.0) (Color r: 0.8347996089931574 g: 0.8123167155425219 b: 0.7478005865102639 alpha: 1.0) (Color r: 0.42424242424242425 g: 0.48484848484848486 b: 0.5073313782991202 alpha: 1.0))" 113 | 114 | segmentator segmentate. 115 | 116 | segmentator segmentedImage. 117 | segmentator segments. 118 | segmentator openAll 119 | ``` 120 | -------------------------------------------------------------------------------- /wiki/Tutorials/clustering-simple-example.md: -------------------------------------------------------------------------------- 1 | # Using Agglomerative Hierarchical Clustering Algorithm - Simple Example 2 | 3 | ## Overview 4 | 5 | _If you don't have the library installed, you can refer to: [Getting Started page](../GettingStarted/GettingStarted.md)_ 6 | 7 | In this example, we are going to cluster data using the hierarchical clustering algorithm on a dummy example. 8 | 9 | Hierarchical clustering works on a collection of vectors representing *elements* to cluster. 10 | Each element is represented by an `AIVectorItem`. 11 | 12 | The agglomerative hierarchical clustering algorithm will recursively group together the two closest elements of the collection into one new cluster. 13 | 14 | 1. Each elements can be seen as an atomic cluster; 15 | 1. Group two closest clusters (atomic or not) into one; 16 | 1. Thus at each iteration, there is one less cluster in the collection (replacing two clusters by their new parent); 17 | 1. Recompute the distance from new cluster to all remaining ones 18 | 1. Go back to step 2. until there is only one cluster left (the "root") containing all the initial elements. 19 | 20 | ## Example 21 | 22 | We will use a dummy example of 12 elements. 23 | 24 | We make an **Array** of `AIVectorItem`: 25 | ```st 26 | elts := { 27 | AIVectorItem with: #a and: #(1 0). 28 | AIVectorItem with: #b and: #(1 0). 29 | AIVectorItem with: #c and: #(2 0). 30 | AIVectorItem with: #d and: #(3 0). 31 | AIVectorItem with: #e and: #(4 0). 32 | AIVectorItem with: #f and: #(5 0). 33 | AIVectorItem with: #g and: #(6 0). 34 | AIVectorItem with: #h and: #(7 0). 35 | AIVectorItem with: #i and: #(8 0). 36 | AIVectorItem with: #j and: #(9 0). 37 | AIVectorItem with: #k and: #(0 10). 38 | AIVectorItem with: #l and: #(0 10). 39 | }. 40 | ``` 41 | 42 | Note that #a is equal to #b and #k is equal to #l. therefore they will be the first elements to be grouped in clusters in iteration 1 and 2 of the algorithm. 43 | 44 | To compute the hierarchy of clusters (called a *Dendrogram*) we use: 45 | ```st 46 | engine := AIClusterEngine with: elts. 47 | engine hierarchicalClusteringUsing: #averageLinkage. 48 | ``` 49 | 50 | The dendrogram can be found in `engine dendrogram` it's a binary three with a `#left` and `#right` children. 51 | 52 | Note: The dendrogram may be visualized with Roassal extension: [https://github.com/pharo-ai/viz](https://github.com/pharo-ai/viz). 53 | For this, just execute `engine plotDendrogram` 54 | 55 | 56 | 57 | ## Configuration 58 | 59 | The argument `#averageLinkage` in the code above is used to compute the distance of new clusters to the other ones. 60 | There are 3 values possible: 61 | - `#singleLinkage`: Distance between two clusters is the distance between their two closest (more similar) objects. 62 | - `#averageLinkage`: Distance between two clusters is the arithmetic mean of all the distances between the objects of one and the objects of the other. 63 | - `#completeLinkage`: Distance between two clusters is the distance between their two most dissimilar objects. 64 | 65 | Single Linkage can result in clusters formed of a "chain of elements" but the first and last elements quite far from one another. 66 | 67 | 68 | 69 | Complete linkage results in clusters with "compact contours", but elements not necessarily compact inside. 70 | Notice that the left branch of the big cluster (ie. top branch on the plot) is better balanced with Complete linkage than with Average linkage (first figure above) 71 | 72 | 73 | 74 | ## Distance matrix 75 | 76 | In the code above, the distance matrix is a default one computed from the elements. 77 | In this case the distance is the sum of the square differences between two vectors. 78 | 79 | One can provide a different matrix using any other possible metrics. 80 | This must be a **distance** matrix (higher value means more different), not a similarity matrix (higher value mean more similar). 81 | 82 | ## Threshold 83 | 84 | In the plots, the horizontal bar are not always at the same position. 85 | This depends on the `threshold` of each cluster. 86 | The threshold of a cluster can be seen as the distance between its two elements. 87 | 88 | With less elements, the threshold is typically small (elements close one to the other). 89 | With more elements, the difference start to be bigger and the threshold augment. 90 | 91 | If you compare the first plot (average linkage) and the last one (complete linkage), you might see that the threshold of the first left node is smaller with average linkage. 92 | The two clusters contain the same elements (10 elements with a 0 as second part). 93 | But because complete linkage computes the distance between two clusters as the distance between their two most dissimilar objects, the threshold ends up being higher. -------------------------------------------------------------------------------- /wiki/Tutorials/market-basket-analysis-using-a-priori.md: -------------------------------------------------------------------------------- 1 | # Market Basket Analysis using A-Priori Algorithm 2 | 3 | In this tutorial we will show an application of the A-Priori algorithm. 4 | You can find it at [pharo-ai/a-priori](https://github.com/pharo-ai/APriori). 5 | We will analyze a dataset that contains a list of transactions of supermarket items that were bought. 6 | 7 | You need to install the A-Priori algorithm. 8 | To do that, execute the following script in your Pharo image: 9 | 10 | ```st 11 | Metacello new 12 | repository: 'github://pharo-ai/a-priori:v1.0.0'; 13 | baseline: 'AIAPriori'; 14 | load. 15 | ``` 16 | 17 | The A-Priori algorithm is going to give us some associations rule of the most frequently items that are bought together. 18 | For example, if someone goes to the supermarket and buys eggs, butter vegetables and yogurt is likely that they will buy milk too. 19 | 20 | ## Importing the data 21 | 22 | First, let's download the [dataset CSV file from Kaggle](https://www.kaggle.com/datasets/irfanasrullah/groceries). 23 | 24 | Then, we need to import the csv file into memory with this line. You need to change the path to location of where you downloaded the file. 25 | 26 | ```st 27 | file := 'your-path-to-the-file/groceries.csv' asFileReference. 28 | ``` 29 | 30 | If we inspect the contents of the file, we will see that we have some empty spaces. Also it has a header. 31 | 32 | ![Inspecting file object](./img/market-basket-inspecting-file.png) 33 | 34 | So, we will do a little pre-processing to clean this dataset. We will divide each line by commas and then we will reject the empty fields. 35 | 36 | ```st 37 | transactions := file contents lines allButFirst collect: [ :line | 38 | (',' split: line) allButFirst reject: [ :item | item isEmpty ] ]. 39 | ``` 40 | 41 | ## Instantiating A-Priori Algorithm 42 | 43 | We need to create an instance of `APrioriTransactionsArray`. 44 | It is only an object in which we need to pass the array of `transactions`. 45 | 46 | ```st 47 | transactionsSource := APrioriTransactionsArray from: transactions. 48 | ``` 49 | 50 | Then, we instantiate the A-Priori algorithm with the transactions object that we just created. 51 | 52 | ```st 53 | apriori := APriori forTransactions: transactionsSource. 54 | ``` 55 | 56 | A-Priori will find the combinations of items (we call them itemsets) that frequently appear together. 57 | To do that, first we need to define, what it means to be "frequent". 58 | This can be expressed with a metric called _support_ -- a percentage of transactions in which the given itemset appears. 59 | By setting a minimum support threshold of 1\%, we tell the A-Priori algorithm that it should only select the combinations of items that appear in at least 1% of transactions. 60 | 61 | ```st 62 | apriori minSupport: 0.01. 63 | ``` 64 | 65 | For running the algorithm we need to send the message `findFrequentItemsets` to the a-priori object. 66 | 67 | ```st 68 | apriori findFrequentItemsets. 69 | ``` 70 | 71 | The `apriori` object will mine frequent itemsets from the dataset of transactions and store them in `apriori frequentItemsets` attribute. 72 | Those will include itemsets that only contain one item (for example, 50% of customers buy water, which means that water is a frequently purchased item). 73 | But in our case, we are only interested in co-occurences of items (pairs, triplets, etc.) so we remove all frequent itemsets of size 1: 74 | 75 | ```st 76 | frequentItemsets := apriori frequentItemsets select: [ :each | each size > 1 ]. 77 | ``` 78 | 79 | Now we can ask A-Priori to build the association rules with `buildAssociationRules`. 80 | Those are the rules in form `{left itemset} -> {right itemset}` that represent the situation when client has already purchased the items on the left and now we want to recommend him or her the items on the right. 81 | 82 | ```st 83 | apriori buildAssociationRules. 84 | ``` 85 | 86 | We built *all* the associations of the transactions. That means that we built all the combinations of the items that were bought together. Now we need to filter them. 87 | Because we only want to get the rules which have high _confidence_. 88 | Confidence is another metric which tells us the conditional probability of a cient purchasing the items on the right given that this same client has already purchased items on the left. 89 | We ask A-Priori to calculate confidence: 90 | 91 | ```st 92 | apriori calculateAssociationRuleMetrics: { 93 | APrioriConfidenceMetric }. 94 | ``` 95 | 96 | Now we can filter the association rules that have at least 50\% confidence: 97 | 98 | ```st 99 | rules := apriori associationRules select: [ :each | each confidence >= 0.5 ]. 100 | ``` 101 | 102 | ## Summary of the code 103 | 104 | ```st 105 | file := 'your-path-to-the-file/groceries.csv' asFileReference. 106 | 107 | transactions := file contents lines allButFirst collect: [ :line | 108 | (',' split: line) allButFirst reject: [ :item | item isEmpty ] ]. 109 | 110 | transactionsSource := APrioriTransactionsArray from: transactions. 111 | 112 | apriori := APriori forTransactions: transactionsSource. 113 | 114 | apriori minSupport: 0.01. 115 | 116 | apriori findFrequentItemsets. 117 | 118 | frequentItemsets := apriori frequentItemsets select: [ :each | each size > 1 ]. 119 | 120 | apriori buildAssociationRules. 121 | 122 | apriori calculateAssociationRuleMetrics: { 123 | APrioriConfidenceMetric . 124 | APrioriLiftMetric }. 125 | 126 | rules := apriori associationRules select: [ :each | each confidence >= 0.5 ]. 127 | ``` 128 | -------------------------------------------------------------------------------- /wiki/MachineLearning/Linear-Regression.md: -------------------------------------------------------------------------------- 1 | # Linear regression 2 | 3 | Repository: https://github.com/pharo-ai/linear-models 4 | 5 | ## Table of contents 6 | 7 | - [Linear Regression with Gradient Descent](#linear-regression-with-gradient-descent) 8 | - [Measuring the accuracy of a model](#measuring-the-accuracy-of-a-model) 9 | - [Linear Regression with Least Squares using Pharo-Lapack](#linear-regression-with-least-squares-using-pharo-lapack) 10 | - [Linear Regression with Least Squares using PolyMath](#linear-regression-with-least-squares-using-polymath) 11 | 12 | ## Linear Regression with Gradient Descent 13 | Linear regression is a well-known machine learning algorithms. 14 | It allows us to learn the linear dependency between the input variables $x_1, \dots, x_n$ and the output variable $y$. 15 | Attempts to find the the linear relationship between one or more input variables _x1, x2, ..., xn_ and an output variable _y_. It finds a set of parameters _b, w1, w2, ..., wn_ such that the predicted output _h(x) = b + w1 * x1 + ... + wn * xn_ is as close as possible to the real output _y_. Then we can use the trained model to predict the previously unseen values of y. 16 | 17 | For example: 18 | 19 | Given 20 input vectors _x = (x1, x2, x3)_ and 20 output values _y_. 20 | 21 | ```Smalltalk 22 | input := #( (-6 0.44 3) (4 -0.45 -7) (-4 -0.16 4) (9 0.17 -8) (-6 -0.41 8) 23 | (9 0.03 6) (-2 -0.26 -4) (-3 -0.02 -6) (6 -0.18 -2) (-6 -0.11 9) 24 | (-10 0.15 -8) (-8 -0.13 7) (3 -0.29 1) (8 -0.21 -1) (-3 0.12 7) 25 | (4 0.03 5) (3 -0.27 2) (-8 -0.21 -10) (-10 -0.41 -8) (-5 0.11 0)). 26 | 27 | output := #(-10.6 10.5 -13.6 27.7 -24.1 12.3 -2.6 -0.2 12.2 -22.1 -10.5 -24.3 2.1 14.9 -11.8 3.3 1.3 -8.1 -16.1 -8.9). 28 | ``` 29 | 30 | We want to find the linear relationships between the input and the output. In other words, we need to find such parameters _b, w1, w2, w3_ that the line _h(x) = b + w1 * x1 + w2 * x2 + w3 * x3_ fits the data as closely as possible. To do that, we initialize a linear regression model and fit it to the data. 31 | 32 | ```Smalltalk 33 | linearRegressionModel := AILinearRegression new 34 | learningRate: 0.001; 35 | maxIterations: 2000; 36 | yourself. 37 | 38 | linearRegressionModel fitX: input y: output. 39 | ``` 40 | 41 | Now we can look at the trained parameters. The real relationship between x and y is _y = 2*x1 + 10*x2 - x3_, so the parameters should be close to _b=0_, _w1=2_, _w2=10_, _w3=-1_. 42 | 43 | ```Smalltalk 44 | b := linearRegressionModel bias. 45 | "-0.0029744215186773065" 46 | w := linearRegressionModel weights. 47 | "#(1.9999658061022905 9.972821149946537 -0.9998757756318858)" 48 | ``` 49 | 50 | Finally, we can use the model to predict the output of previously unseen input. 51 | 52 | ```Smalltalk 53 | testInput := #( (-3 0.43 1) (-3 -0.11 -7) 54 | (-6 0.06 -9) (-7 -0.41 7) (3 0.43 10)). 55 | 56 | expectedOutput := #(-2.7 -0.01 -2.4 -25.1 0.3). 57 | ``` 58 | 59 | ```Smalltalk 60 | linearRegressionModel predict: testInput. 61 | "#(-2.7144345209804244 -0.10075173689646852 -2.405518008448657 -25.090722165135997 0.28647833494634645)" 62 | ``` 63 | 64 | ## Measuring the accuracy of a model 65 | 66 | Please refer to the wiki for [Measuring the accuracy of a model](./Measuring-the-accuracy-of-a-model.md) 67 | 68 | ## Linear Regression with Least Squares using Pharo-Lapack 69 | 70 | We have implemented the class `AILinearRegressionLeastSquares` that also trains a linear regression model but using other approach. The class `AILinearRegression` uses the gradient descent algorithm for finding the weights of the model. With the class `AILinearRegressionLeastSquares` we model the problems in terms of the least squares problem and we solve it using [Linear Algebra](../LinearAlgebra/Lapack.md). It's a package for solving linear algebra problem, like least squares.That library uses [Pharo-Lapack](../LinearAlgebra/Lapack.md), that is a binding for Lapack in Pharo. We use Lapack for speed up the computation, as Lapack it's a highly optimized library written in Fortran. 71 | 72 | The prerequisite is that you need to have Lapack installed on your computer. If you use a Mac, you have it already installed by default. You can check [this link](https://netlib.org/lapack/lug/node14.html) for installing it. 73 | 74 | ### How to use it 75 | 76 | This class has the same API for training and predicting as the other ones. 77 | 78 | ```st 79 | linearRegressionModel := AILinearRegressionLeastSquares new. 80 | 81 | linearRegressionModel fitX: input y: output. 82 | linearRegressionModel predict: testInput. 83 | ``` 84 | 85 | The difference is that you need to use the class `AIColumnMajorMatrix` or even `AINativeFloatMatrix`. Those are matrices data structures that we implemented meant to be used with Lapack. They are optimized to avoid transposing the matrix and copying the data. 86 | 87 | You can create the matrices using the class side message `rows` 88 | 89 | 90 | ```st 91 | AIColumnMajorMatrix rows: ( 92 | (12 3 5) 93 | (1 4 17) 94 | (3 7 33) 95 | ). 96 | 97 | AINativeFloatMatrix rows: ( 98 | (12 3 5) 99 | (1 4 17) 100 | (3 7 33) 101 | ). 102 | ``` 103 | 104 | For more information, please refer to the wiki pages [Linear Algebra](../LinearAlgebra/LinearAlgebra.md) and [Lapack](../LinearAlgebra/Lapack.md) 105 | 106 | 107 | ### Linear Regression with Least Squares using PolyMath 108 | 109 | We also have the class `AILinearRegressionLeastSquaresVanilla` that is a subclass of `AILinearRegressionLeastSquares`. But it uses PolyMath that is written in pure Pharo. You can call the fit method with an array of array, as it doesn't use Lapack, you don't need the special data structures. 110 | -------------------------------------------------------------------------------- /wiki/DataExploration/Metrics.md: -------------------------------------------------------------------------------- 1 | # Metrics for Machine Learning 2 | 3 | Repository: https://github.com/pharo-ai/metrics 4 | 5 | In this packages we have implemented several metris for different machine learning algorithm types. 6 | 7 | _Note: This is currently work in progress. More metrics will be added in the future and maybe also changes in the architecture._ 8 | 9 | ## Table of Contents 10 | 11 | - [Clustering metrics](#clustering-metrics) 12 | - [Regression metrics](#regression-metrics) 13 | - [Classification metrics](#classification-metrics) 14 | 15 | ## Clustering metrics 16 | 17 | - Jaccard Index (`AIJaccardIndex`) 18 | - Rand Index (`AIRandIndex`) 19 | - Silhouette Index (`AISilhouetteIndex`) 20 | - Adjusted Rand Index (`AIAdjustedRandIndex`) 21 | - Fowlkes Mallows Index (`AIFowlkesMallowsIndex`) 22 | - Mirkin Index (`AIMirkinIndex`) 23 | - Weighted Jaccard Index (`AIWeightedJaccardIndex`) 24 | 25 | ## Regression metrics 26 | 27 | - [Mean Squared Error](#mean-squared-error) 28 | - [Mean Absolute Error](#mean-absolute-error) 29 | - [Mean Squared Logarithmic Error](#mean-squared-logarithmic-error) 30 | - [R2 Score](#r2-score) 31 | - [Root Mean Squared Error](#root-mean-squared-error) 32 | - [Max Error](#max-error) 33 | - [Explained Variance Score](#explained-variance-score) 34 | 35 | This package also contains regression metrics for testing the performance of different machine learning models. For example: Logistic and Linear regression. 36 | Part of this text for the explanation were extracted from [scikit-learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) 37 | 38 | ### Mean Squared Error 39 | 40 | The mean_squared_error function computes mean square error, a risk metric corresponding to the expected value of the squared (quadratic) error or loss. 41 | 42 | ```st 43 | | yActual yPredicted metric | 44 | metric := AIMeanSquaredError new. 45 | yActual := #( 3 -0.5 2 7 ). 46 | yPredicted := #( 2.5 0.0 2 8 ). 47 | metric computeForActual: yActual predicted: yPredicted "0.375" 48 | ``` 49 | 50 | ### Mean Absolute Error 51 | 52 | The mean absolute error function computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss or -norm loss. 53 | 54 | ```st 55 | | yActual yPredicted metric | 56 | metric := AIMeanAbsoluteError new. 57 | yActual := #( 3 -0.5 2 7 ). 58 | yPredicted := #( 2.5 0.0 2 8 ). 59 | metric computeForActual: yActual predicted: yPredicted "0.5" 60 | ``` 61 | 62 | ### Mean Squared Logarithmic Error 63 | 64 | The mean squared log error function computes a risk metric corresponding to the expected value of the squared logarithmic (quadratic) error or loss. 65 | 66 | ```st 67 | | yActual yPredicted metric | 68 | metric := AIMeanSquaredLogarithmicError new. 69 | yActual := #( 3 5 2.5 7 ). 70 | yPredicted := #( 2.5 5 4 8 ). 71 | metric computeForActual: yActual predicted: yPredicted "0.03973012298459379" 72 | ``` 73 | 74 | ### R2 Score 75 | 76 | The r2 score is the coefficient of determination, usually denoted as R². 77 | 78 | It represents the proportion of variance (of y) that has been explained by the independent variables in the model. 79 | It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. 80 | 81 | As such variance is dataset dependent, R² may not be meaningfully comparable across different datasets. 82 | Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 83 | A constant model that always predicts the expected value of y, disregarding the input features, would get a R² score of 0.0. 84 | 85 | ```st 86 | | yActual yPredicted metric | 87 | metric := AIR2Score new. 88 | yActual := #( 3 -0.5 2 7 ). 89 | yPredicted := #( 2.5 0.0 2 8 ). 90 | metric computeForActual: yActual predicted: yPredicted "0.9486081370449679" 91 | ``` 92 | 93 | ### Root Mean Squared Error 94 | 95 | This metric is only the square root of the Mean Absolute Error. 96 | 97 | ```st 98 | | yActual yPredicted metric | 99 | metric := AIRootMeanSquaredError. 100 | yActual := #( 6 7 2.43 8 ). 101 | yPredicted := #( 5 7 3.09 7 ). 102 | metric computeForActual: yActual predicted: yPredicted "0.780320446995976" 103 | ``` 104 | 105 | ### Max Error 106 | 107 | The max error function computes the maximum residual error , a metric that captures the worst case error between the predicted value and the true value. 108 | In a perfectly fitted single output regression model, max_error would be 0 on the training set and though this would be highly unlikely in the real world, 109 | this metric shows the extent of error that the model had when it was fitted. 110 | 111 | ```st 112 | | yActual yPredicted metric | 113 | metric := AIRootMeanSquaredError. 114 | yActual := #( 3 2 7 1 ). 115 | yPredicted := #( 9 2 7 1 ). 116 | metric computeForActual: yActual predicted: yPredicted "6" 117 | ``` 118 | 119 | ### Explained Variance Score 120 | 121 | In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be used. 122 | 123 | The complementary part of the total variation is called unexplained or residual variation. 124 | 125 | ```st 126 | | yActual yPredicted metric | 127 | metric := AIRootMeanSquaredError. 128 | yActual := #( 3 -0.5 2 7 ). 129 | yPredicted := #( 2.5 0.0 2 8 ). 130 | metric computeForActual: yActual predicted: yPredicted "0.9571734475374732" 131 | ``` 132 | 133 | ## Classification metrics 134 | 135 | - [Accuracy Score](#accuracy-score) 136 | - [F1 Score](#f1-score) 137 | - [Precision Score](#precision-score) 138 | - [Recall Score](#recall-score) 139 | 140 | ### Accuracy Score 141 | 142 | The accuracy score function computes the accuracy, returning a fraction. The accuracy is defined as the sum of the correct predictions divided by the total number of predictions. 143 | 144 | ```st 145 | | yActual yPredicted metric | 146 | metric := AIAccuracyScore new. 147 | yActual := #( 0 1 2 3 ). 148 | yPredicted := #( 0 2 1 3 ). 149 | metric computeForActual: yActual predicted: yPredicted "0.5" 150 | ``` 151 | 152 | ### F1 Score 153 | 154 | The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: 155 | 156 | `F1 = 2 * (precision * recall) / (precision + recall)` 157 | 158 | ```st 159 | metric := AIF1Score new. 160 | yActual := #( 0 1 0 0 1 0 ). 161 | yPredicted := #( 0 0 1 0 1 1 ). 162 | metric computeForActual: yActual predicted: yPredicted "0.4" 163 | ``` 164 | 165 | ### Precision Score 166 | 167 | The precision is the ratio `tp / (tp + fp)` where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. 168 | 169 | The best value is 1 and the worst value is 0. 170 | 171 | ```st 172 | metric := AIPrecisionScore new. 173 | yActual := #( 0 1 0 0 1 0 ). 174 | yPredicted := #( 0 0 1 0 1 1 ). 175 | metric computeForActual: yActual predicted: yPredicted "(1/3)" 176 | ``` 177 | 178 | ### Recall Score 179 | 180 | The recall is the ratio `tp / (tp + fn)` where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. 181 | 182 | The best value is 1 and the worst value is 0. 183 | 184 | ```st 185 | metric := AIRecallScore new. 186 | yActual := #( 0 1 0 0 1 0 ). 187 | yPredicted := #( 0 0 1 0 1 1 ). 188 | metric computeForActual: yActual predicted: yPredicted "(1/2)" 189 | ``` -------------------------------------------------------------------------------- /wiki/Tutorials/clustering-credit-card-kmeans.md: -------------------------------------------------------------------------------- 1 | # Clustering Users of a Credit Card Company using the K-Means Algorithm 2 | 3 | In this tutorial we will use the k-means clustering algorithm for clustering the clients based on their credit card consumption. 4 | 5 | ### Data Analysis and Manipulation 6 | 7 | First, we will load the dataset. 8 | 9 | ```st 10 | creditCardData := AIDatasets loadCreditCard. 11 | ``` 12 | 13 | If we inspect (open Pharo Inspector) the DataFrame, we can see that there is 18 features for each client of the credit card company. 14 | 15 | 16 | 17 | Also, after inspecting the data, we can see that it contains nil fields, so we will replace them with zeros 18 | 19 | ```st 20 | creditCardData replaceAllNilsWithZeros. 21 | ``` 22 | 23 | Currently, the ID parameter is a string. It is the letter C along with a number. For example: `C10001` 24 | So, we convert the ID field to be a numeric field 25 | 26 | ```st 27 | creditCardData 28 | toColumn: 'CUST_ID' 29 | applyElementwise: [ :element | element asInteger ]. 30 | ``` 31 | 32 | ### K-Means Algorithm 33 | 34 | For now, the K-Means algorithm expects an `Array` or a `Collection`. The data is an object of `DataFrame`. We need to convert the data from DataFrame to Array. 35 | 36 | [Optional] 37 | To speed up the computation, we will take 1000 random elements of the data. You can run it with the whole dataset but it will take some minutes to execute. 38 | 39 | ```st 40 | dataAsArray := creditCardData asArrayOfRows shuffled first: 1000. 41 | ``` 42 | 43 | The data is too big and it has too many dimensions to be visualised. So, to find the best number of clusters, we will train the model with 12 different cluster numbers. 44 | Then, we will plot the error of the model with each number of clusters. 45 | 46 | ```st 47 | numberOfClustersCollection := 2 to: 12. 48 | 49 | "Create a collection for storing the errors." 50 | inertias := OrderedCollection new. 51 | 52 | "We train 11 times k-means models, and store the error of each of them" 53 | numberOfClustersCollection do: [ :numberOfClusters | 54 | kMeans := AIKMeans numberOfClusters: numberOfClusters. 55 | kMeans fit: dataAsArray. 56 | inertias add: (kMeans score: dataAsArray) ]. 57 | ``` 58 | 59 | By definition, if the number of clusters increase the error will always reduce. There is different technoques to find the best number of clusters. We will use the elbow method. 60 | 61 | If we plot the errors, the curve will look like an arm. We need to manually the point in which the plot starts decreasing the looses. 62 | 63 | We will use [Roassal 3](https://github.com/ObjectProfile/Roassal3) for doing the plot. We will not explain the code in this tutorial. 64 | 65 | According to the elbow plot, we can see that the best number of cluster is nine clusters. 66 | 67 | 68 | 69 | Code for plotting the elbow plot. 70 | 71 | ```st 72 | "Elbow draw with Roassal" 73 | elbowChart := RSChart new. 74 | elbowChart extent: (numberOfClustersCollection size * 20) @ (7 * 20). 75 | elbowChart addPlot: 76 | (RSLinePlot new 77 | x: numberOfClustersCollection y: inertias; 78 | color: Color red; 79 | fmt: 'o'; 80 | yourself). 81 | elbowChart addDecoration: 82 | (RSHorizontalTick new 83 | numberOfTicks: numberOfClustersCollection size; 84 | integer; 85 | yourself). 86 | elbowChart addDecoration: 87 | (RSVerticalTick new 88 | asFloat; 89 | yourself). 90 | 91 | elbowChart xlabel: 'Number of Clusters'. 92 | elbowChart ylabel: 'Inertia'. 93 | elbowChart title: 'Elbow '. 94 | elbowChart build. 95 | elbowChart canvas open. 96 | ``` 97 | 98 | If we look at the elbow plot, we it seems that the best number of clusters is 8. Note that you may have not the exact same results. 99 | 100 | We train the algorithm again with that number of clusters. 101 | 102 | ```st 103 | kMeans := AIKMeans numberOfClusters: 8. 104 | kMeans fit: dataAsArray. 105 | 106 | clusters := kMeans clusters. 107 | ``` 108 | 109 | ## All the code 110 | 111 | ```st 112 | "" 113 | "DATA ANALYSIS AND MANIPULATION" 114 | "" 115 | 116 | "Load credit card dataset" 117 | creditCardData := AIDatasets loadCreditCard. 118 | 119 | "If we inspect the DataFrame, we can see that there is 18 features for each client of the credit card company" 120 | 121 | creditCardData columnNames. "('CUST_ID' 'BALANCE' 'BALANCE_FREQUENCY' 'PURCHASES' 'ONEOFF_PURCHASES' ...)" 122 | 123 | "Also, after inspecting the data, we can see that it contains nil fields, 124 | so we will replace them with zeros" 125 | creditCardData replaceAllNilsWithZeros. 126 | 127 | "Currently, the ID parameter is a string. 128 | It is the letter C along with a number. For example: C10001 129 | So, we convert the ID field to be a numeric field" 130 | creditCardData 131 | toColumn: 'CUST_ID' 132 | applyElementwise: [ :element | element asInteger ]. 133 | 134 | "" 135 | "K-MEANS ALGORITHM CLUSTERING" 136 | "" 137 | 138 | "Convert the data from DataFrame to Array and take 1000 random elements to speed up." 139 | dataAsArray := creditCardData asArrayOfRows shuffled first: 1000. 140 | 141 | "We train the model with 12 clusters" 142 | numberOfClustersCollection := 2 to: 12. 143 | 144 | "Create a collection for storing the errors." 145 | inertias := OrderedCollection new. 146 | 147 | "We train 11 k-means models, and store the error of each of them" 148 | numberOfClustersCollection do: [ :numberOfClusters | 149 | kMeans := AIKMeans numberOfClusters: numberOfClusters. 150 | kMeans fit: dataAsArray. 151 | inertias add: (kMeans score: dataAsArray) ]. 152 | 153 | "If we look at the elbow plot, we see that 8 clusters seems to be the best solution." 154 | kMeans := AIKMeans numberOfClusters: 8. 155 | kMeans fit: dataAsArray. 156 | 157 | clusters := kMeans clusters. 158 | ``` 159 | 160 | ## Principal Component Analysis 161 | 162 | Principal component analysis (PCA), is a statistical method that allows to summarize the information of a dataset. It gives the k-most representative features of data. It can be used to more easily visualise the data. 163 | 164 | For doing this, we will use [PolyMath](https://github.com/PolyMathOrg/PolyMath) library that is loaded in the pahro-ai library. 165 | 166 | [Optional] 167 | For speeding purposes, we will take 1000 random elements of the data. You can run it with the whole dataset but it will take some minutes to execute. 168 | 169 | ```st 170 | "For doing the principal component analysis (PCA), we will take randomly 1000 elements for speed up the computation" 171 | shuffledData := creditCardData asArrayOfRows shuffled first: 1000. 172 | ``` 173 | 174 | Now, we create our PolyMath matrix 175 | 176 | ```st 177 | polyMathMatrix := PMMatrix rows: shuffledData. 178 | ``` 179 | 180 | Wa want to plot the data. So, we want 2 parameters, one for the x axis and the other one for the y axis. 181 | 182 | We train the principal component analyser for 2 components. 183 | 184 | ```st 185 | pca := PMPrincipalComponentAnalyserSVD new. 186 | pca componentsNumber: 2. 187 | pca fit: polyMathMatrix. 188 | principalComponents := pca transform: polyMathMatrix. 189 | 190 | firstPrincipalComponent := principalComponents rows collect: [ :each | each first]. 191 | secondPrincipalComponent := principalComponents rows collect: [ :each | each second]. 192 | ``` 193 | 194 | As we discussed in the last part, we see that 8 is the number of clusters that work the best. We train again our model. 195 | 196 | ```st 197 | kMeans := AIKMeans numberOfClusters: 8. 198 | kMeans fit: shuffledData. 199 | 200 | clusters := kMeans clusters. 201 | ``` 202 | 203 | We use those `x` and `y` to plot the data with its different groups. 204 | 205 | ```st 206 | colors := RSColorPalette qualitative dark28 range. 207 | 208 | clusteredDataChart :=RSChart new. 209 | clusteredDataChart addPlot: (plot := RSScatterPlot new x: firstPrincipalComponent y: secondPrincipalComponent ). 210 | clusteredDataChart 211 | xlabel: 'most representative feature'; 212 | ylabel: 'second most representative feature'; 213 | title: 'Clustered data reduced to 2 dimensions'. 214 | 215 | clusteredDataChart build. 216 | 217 | plot ellipses doWithIndex: [ :e :i| 218 | e color: (colors at: (clusters at: i)) ]. 219 | 220 | clusteredDataChart canvas open. 221 | ``` 222 | 223 | We have to keep in mind that the data has 18 dimensions, but we plot it only using 2. So, we lost information in terms of visualisation. Also, we choose the principal components using information of around only 1/9 of the whole dataset. Finallly, it can be that the k-means algorithm may not be the best approach for this problem. 224 | 225 | 226 | 227 | ## All the dimensionality reduction code 228 | 229 | ```st 230 | "take randomly 1000 elements for speed up the computation" 231 | shuffledData := creditCardData asArrayOfRows shuffled first: 1000. 232 | 233 | polyMathMatrix := PMMatrix rows: shuffledData. 234 | 235 | pca := PMPrincipalComponentAnalyserSVD new. 236 | pca componentsNumber: 2. 237 | pca fit: polyMathMatrix. 238 | principalComponents := pca transform: polyMathMatrix. 239 | 240 | 241 | "If we look at the elbow plot, we see that 8 clusters seems to be the best solution." 242 | kMeans := AIKMeans numberOfClusters: 8. 243 | kMeans fit: shuffledData. 244 | 245 | clusters := kMeans clusters. 246 | 247 | "Get the principal components" 248 | firstPrincipalComponent := principalComponents rows collect: [ :each | each first]. 249 | secondPrincipalComponent := principalComponents rows collect: [ :each | each second]. 250 | 251 | "Visualisation to show the different groups that were found using the clustering algorithm" 252 | 253 | colors := RSColorPalette qualitative dark28 range. 254 | 255 | clusteredDataChart :=RSChart new. 256 | clusteredDataChart addPlot: (plot := RSScatterPlot new x: firstPrincipalComponent y: secondPrincipalComponent ). 257 | clusteredDataChart 258 | xlabel: 'most representative feature'; 259 | ylabel: 'second most representative feature'; 260 | title: 'Clustered data reduced to 2 dimensions'. 261 | 262 | clusteredDataChart build. 263 | 264 | plot ellipses doWithIndex: [ :e :i| 265 | e color: (colors at: (clusters at: i)) ]. 266 | 267 | clusteredDataChart canvas open. 268 | ``` 269 | -------------------------------------------------------------------------------- /wiki/Tutorials/linear-regression-tutorial.md: -------------------------------------------------------------------------------- 1 | # Hands-On Linear Regression 2 | 3 | 4 | _If you don't have the library installed, you can refer to: [Getting Started page](../GettingStarted/GettingStarted.md)_ 5 | 6 | Linear regression is a machine learning model that learns the linear dependencies between the independent variables and the dependent variable. It is capable of making predictions for previously unseen values once it has learned. 7 | 8 | Here we will use the renowned [Boston Housing dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to train the linear regression model. The Boston dataset that contains 13 parameters of a house, like its surface,per capita crime rate by town, etc and as an output the median value of owner-occupied homes in $1000's. 9 | 10 | First, we will load the dataset. Then, normalizing the data, to split between test and training datasets. After, training the logistic regression model and finally measuring its performance. 11 | 12 | 13 | 14 | ## Table of Contents 15 | 16 | - [Hands-On Linear Regression](#hands-on-linear-regression) 17 | - [Table of Contents](#table-of-contents) 18 | - [Preprocessing the data](#preprocessing-the-data) 19 | - [Training the machine learning model](#training-the-machine-learning-model) 20 | - [About normalization](#about-normalization) 21 | - [What is normalization?](#what-is-normalization) 22 | - [Measuring the performance of the model](#measuring-the-performance-of-the-model) 23 | - [All the code](#all-the-code) 24 | 25 | ## Preprocessing the data 26 | 27 | We will use [Pharo Datasets](https://github.com/pharo-ai/datasets) to load the dataset into the Pharo image. The library contains several datasets ready to be loaded. Pharo Datasets will return a [Pharo DataFrame](https://github.com/PolyMathOrg/DataFrame) object. To install Pharo Datasets you only need to run the code sniped of the Metacello script available on the [README](https://github.com/pharo-ai/Datasets) 28 | 29 | First, we load the boston housing dataset into the Pharo image. 30 | 31 | ```st 32 | "Loading the dataset" 33 | data := AIDatasets loadBostonHousing. 34 | ``` 35 | 36 | Now, to train the machine model we need to separate the dataset into at least two parts: one for training and the other for testing it. We have already a library in Pharo that does that: [Random partitioner](https://github.com/pharo-ai/random-partitioner). It is already included be default if you load the Pharo Datasets library. 37 | 38 | We will separate it into two sets: test and train. We need this to train the model and after to see how precisely the model is predicting. 39 | 40 | ```st 41 | "Dividing into test and training" 42 | partitioner := AIRandomPartitioner new. 43 | subsets := partitioner split: data withProportions: #(0.75 0.25). 44 | trainData := subsets first. 45 | testData := subsets second. 46 | ``` 47 | 48 | Then, we need to obtain the X (independent, input variables) and Y (dependent, output variable) for each one of the test and training sets. 49 | 50 | ```st 51 | "Separating between X and Y" 52 | trainData columnNames. "an OrderedCollection('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT' 'MEDV')" 53 | 54 | xTrain := trainData columns: #('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'). 55 | yTrain := trainData column: 'MEDV'. 56 | 57 | xTest := testData columns: #('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' LSTAT ). 58 | yTest := testData column: 'MEDV'. 59 | ``` 60 | 61 | Finally, as our linear regression model only accepts `SequenceableCollection` and not `DataFrame` objects (for now!), we need to convert the DataFrame into an array. We can do that only sending the message `asArray` or `asArrayOfRows`. 62 | 63 | ```st 64 | "Converting the DataFrame into an array of arrays For using it in the linear model. 65 | For now, the linear model does not work on a DataFrame." 66 | xTrain := xTrain asArrayOfRows. 67 | yTrain := yTrain asArray. 68 | 69 | xTest := xTest asArrayOfRows. 70 | yTest := yTest asArray. 71 | ``` 72 | 73 | ## Training the machine learning model 74 | 75 | We have everything that is needed to start training the linear regression model. We need to load the [Linear models library](https://github.com/pharo-ai/linear-models) from pharo-ai. That library contains both the logistic regression and linear regression algorithms. Both algorithms have the same API. 76 | 77 | We instantiate the model, set the learning rate and the max iterations (if not set, the model will use the default values). After that, we train the model with the `trainX` and `trainY` collections that we have obtained. 78 | 79 | ```st 80 | "Training the linear regression model" 81 | linearRegression := AILinearRegression 82 | learningRate: 0.1 83 | maxIterations: 5000. 84 | 85 | linearRegression fitX: xTrain y: yTrain. 86 | ``` 87 | 88 | If you try to run all the code that we wrote until now, you most likely saw an exception with the message: `The model is starting to diverge. Try setting up a smaller learning rate or normalizing your data.` It is normal! Usually, a model starts to diverge when the data is not normalized or the learning rate is too high. In this case it is because the data is not normalized. 89 | 90 | ## About normalization 91 | 92 | ### What is normalization? 93 | 94 | In statistics and machine learning, normalization is the process which transforms multiple columns of a dataset to make them numerically consistent with each other (e.g. be on the same scale) but at the same time preserve the valuable information stored in these columns. 95 | 96 | For example, we have a table containing the Salaries that a person earns according to some criteria. The values of variable Years since PhD are in the range of `[1 .. 56]` and the salaries `[57,800 .. 231,545]`. If we plot the two variables we see: 97 | 98 | Source: https://blog.oleks.fr/normalization 99 | 100 | So, the big difference between the range of the values can affect our model. 101 | 102 | If you want to read more about normalization Oleks has a [nice blog post](https://blog.oleks.fr/normalization) about it. 103 | >Part of the text for explaining normalization were extracted from that post. 104 | 105 | For normalizing our data, DataFrame has a simple API: we just call the `DataFrame >> normalized` method that returns a new DataFrame that has been normalized. This method uses the default normalizer that is the min max normalizer. If you want to use another one you can use the method `DataFrame >> normalized: aNormalizerClass` instead. 106 | 107 | So, we just execute this part **before** partitioning the data. 108 | 109 | ```st 110 | "Normalizing the data frames" 111 | normalizedData := data normalized. 112 | ``` 113 | 114 | Pay attention, now, as we want to use the normalized data, in the partitioning part we need to use the `normalizedData` variable instead of `data`. 115 | 116 | ```st 117 | subsets := partitioner split: normalizedData withProportions: #(0.75 0.25). 118 | ``` 119 | 120 | ## Measuring the performance of the model 121 | 122 | Now we can make predictions for previously unseen values: estimate the price of a new house based on its parameters. To make a prediction we need to send the message `predict:` to the linear regression model with the data that we want to predict as an argument. 123 | 124 | ```st 125 | yPredicted := linearRegression predict: xTest. 126 | ``` 127 | 128 | We want to see how well our model is performing. In Pharo we also have a library for measuring the metrics of machine learning models: [Machine learning metrics!](https://github.com/pharo-ai/metrics). As usual, you will find the Metacello script for installing it on the README file. 129 | 130 | For a linear regression model we have several metrics implemented: 131 | - Mean Squared Error (AIMeanSquaredError) 132 | - Mean Absolute Error (AIMeanAbsoluteError) 133 | - Mean Squared Logarithmic Error (AIMeanSquaredLogarithmicError) 134 | - R2 Score (AIR2Score) 135 | - Root Mean Squared Error (AIRootMeanSquaredError) 136 | - Max Error (AIMaxError) 137 | - and Explained Variance Score (AIExplainedVarianceScore) 138 | 139 | For this exercise will we use the R2 score metric (coefficient of determination). It is a coefficient that determinates the proportion of the variation in the dependent variable that is predictable from the independent variables. That means: How close are the predictions to the real values. 140 | 141 | If the value of r2 is 1 means that the model predicts perfectly. 142 | 143 | ```st 144 | "Computing the accuracy of the logistic regression model" 145 | metric := AIR2Score new. 146 | 147 | r2Score "0.7382841848355153" := (metric computeForActual: yTest predicted: yPredicted) asFloat. 148 | ``` 149 | 150 | ## All the code 151 | 152 | Here is the complete workflow of the exercise in which we have worked today. You can run everything in a Pharo Playground to play with the model. 153 | 154 | Do not forget that you need to install the libraries to this to work. 155 | 156 | ```st 157 | "Loading the dataset" 158 | data := AIDatasets loadBostonHousing. 159 | 160 | 161 | "Normalizing the data frames" 162 | normalizedData := data normalized. 163 | 164 | 165 | "SEPARATING THE DATA" 166 | "Dividing into test and training" 167 | partitioner := AIRandomPartitioner new. 168 | subsets := partitioner split: normalizedData withProportions: #(0.75 0.25). 169 | trainData := subsets first. 170 | testData := subsets second. 171 | 172 | 173 | "Separating between X and Y" 174 | trainData columnNames. "an OrderedCollection('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT' 'MEDV')" 175 | 176 | xTrain := trainData columns: #('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'). 177 | yTrain := trainData column: 'MEDV'. 178 | 179 | xTest := testData columns: #('CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' LSTAT ). 180 | yTest := testData column: 'MEDV'. 181 | 182 | 183 | "Converting the DataFrame into an array of arrays For using it in the linear model. 184 | For now, the linear model does not work on a DataFrame." 185 | xTrain := xTrain asArrayOfRows. 186 | yTrain := yTrain asArray. 187 | 188 | xTest := xTest asArrayOfRows. 189 | yTest := yTest asArray. 190 | 191 | 192 | "TRAINING THE MACHINE LEARNING MODEL" 193 | "Training the linear regression model" 194 | linearRegression := AILinearRegression 195 | learningRate: 0.1 196 | maxIterations: 5000. 197 | 198 | linearRegression fitX: xTrain y: yTrain. 199 | 200 | yPredicted := linearRegression predict: xTest. 201 | 202 | 203 | "COMPUTING METRICS" 204 | "Computing the accuracy of the logistic regression model" 205 | metric := AIR2Score new. 206 | 207 | r2Score "0.7382841848355153" := (metric computeForActual: yTest predicted: yPredicted) asFloat. 208 | ``` 209 | -------------------------------------------------------------------------------- /wiki/Tutorials/logistic-regression-tutorial.md: -------------------------------------------------------------------------------- 1 | # Hands-On Logistic Regression 2 | 3 | Logistic regression is a classification machine algorithm for variables that follow a linear distribution. Here, we will to use the Logistic Regression Pharo implementation to determine if someone has or has not diabetes based on their physical condition. 4 | 5 | We will use data from the [National Institute of Diabetes and Digestive and Kidney Diseases](https://www.kaggle.com/uciml/pima-indians-diabetes-database) to train the machine learning to be able to predict if someone has or not diabetes based in several parameters. 6 | 7 | To make this exercise, we will follow a workflow. First, we need to do all the steps for preprocessing the data. After we have the data all set, we train the machine learning model to finally measure its accuracy. 8 | 9 | 10 | 11 | This is the link of the linear regression exercise if you want to see it. 12 | 13 | ## Table of Contents 14 | - [Preprocessing the data](#preprocessing-the-data) 15 | - [Training the machine learning model](#training-the-machine-learning-model) 16 | - [About normalization](#about-normalization) 17 | - [Measuring the accuracy and other metrics of the model](#measuring-the-accuracy-and-other-metrics-of-the-model) 18 | - [Workflow summary](#workflow-summary) 19 | 20 | ## Preprocessing the data 21 | 22 | In Pharo, we have a library for loading several dataset directly into Pharo as DataFrame objects: [Pharo Datasets](https://github.com/pharo-ai/Datasets). It contains the well-know examples like the [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) and several other ones. Feel free to inspect the available datasets. Here we will use a dataset from the [National Institute of Diabetes and Digestive and Kidney Diseases](https://www.kaggle.com/uciml/pima-indians-diabetes-database) for predicting if a patient has or not diabetes. 23 | 24 | First we need to install the library using the [Metacello script](https://github.com/pharo-ai/Datasets) available in the README. The method `loadXXX` will return a [DataFrame object](https://github.com/PolyMathOrg/DataFrame). 25 | 26 | [Pharo DataFrame](https://github.com/PolyMathOrg/DataFrame) can be seen as the Pharo equivalent of [Python Pandas](https://pandas.pydata.org/). 27 | 28 | ```st 29 | "Loading the dataset" 30 | diabetesPima := AIDatasets loadDiabetesPima. 31 | ``` 32 | 33 | Now, for training our model, we need at least two partitions of the dataset: one for training and the other for measuring the accuracy of the model. Also in Pharo, we have a small library to help you with that task! [Random Partitioner](https://github.com/pharo-ai/random-partitioner). The library first random shuffle the data and then partition it with the given proportions. This library is already included by default when you load the [Pharo Datasets](https://github.com/pharo-ai/Datasets) or [Pharo DataFrame](https://github.com/PolyMathOrg/DataFrame). So, you do not need to install it again. We will partition our data into two sets: training and test with a proportion of 75%-25%. 34 | 35 | ```st 36 | "Dividing into test and training" 37 | partitioner := AIRandomPartitioner new. 38 | subsets := partitioner split: diabetesPima withProportions: #(0.75 0.25). 39 | diabetesPimaTrainDF := subsets first. 40 | diabetesPimaTestDF := subsets second. 41 | ``` 42 | 43 | As a next step, we can separate the features between X and Y. That means: into the independent variables `X` and the dependent variable `Y`. As we have the data loaded in a DataFrame object, we can select the desire columns. 44 | 45 | If we inspect `diabetesPima columnNames` we will see all the columns that the data has: `('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age' 'Outcome')`. Where `Outcome` is the dependent variable and all the rest are the independent ones. 46 | 47 | The method `DataFrame>>columns:` will return a new DataFrame with the specify columns. 48 | 49 | ```st 50 | "Separating between X and Y" 51 | diabetesPimaTrainDF columnNames. "an OrderedCollection('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age' 'Outcome')" 52 | 53 | xTrain := diabetesPimaTrainDF columns: #('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age'). 54 | yTrain := diabetesPimaTrainDF column: 'Outcome'. 55 | 56 | xTest := diabetesPimaTestDF columns: #('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age'). 57 | yTest := diabetesPimaTestDF column: 'Outcome'. 58 | ``` 59 | 60 | Now we have everything that we need to start training our machine learning model! 61 | 62 | ## Training the machine learning model 63 | 64 | The API for both the logistic and linear regression is the same. You can load the logistic regression from the [pharo-ai linear-models](https://github.com/pharo-ai/linear-models) repository. 65 | 66 | The linear models (logistic and linear regression) accept only a `SequenceableCollection` (For now, we are working on making it compatible with DataFrame). We need to convert the DataFrame to an array. Which is quite easy, we only send the message `asArray` or `asArrayOfRows`. 67 | 68 | ```st 69 | "Converting the DataFrame into an array of arrays For using it in the linar model. 70 | For now, the linear model does not work on a DataFrame." 71 | xTrain := xTrain asArrayOfRows. 72 | yTrain := yTrain asArray. 73 | 74 | xTest := xTest asArrayOfRows. 75 | yTest := yTest asArray. 76 | ``` 77 | 78 | We are set. Now we proceed to train the model. 79 | 80 | ```st 81 | "Training the logistic regression model" 82 | logisticRegression := AILogisticRegression 83 | learningRate: 0.1 84 | maxIterations: 5000. 85 | 86 | logisticRegression fitX: xTrain y: yTrain. 87 | ``` 88 | 89 | If you try to run all the code that we wrote until now, you most likely saw an exception with the message: `The model is starting to diverge. Try setting up a smaller learning rate or normalizing your data.` It is normal! Usually, a model starts to diverge when the data is not normalized or the learning rate is too high. In this case is because the data is not normalized. 90 | 91 | ## About normalization 92 | 93 | ### What is normalization? 94 | 95 | In statistics and machine learning, normalization is the process which transforms multiple columns of a dataset to make them numerically consistent with each other (e.g. be on the same scale) but in the same time preserve the valuable information stored in these columns. 96 | 97 | For example, we have a table that the Salaries that a person earns according to some criteria. The values of variable Years since PhD are in the range of `[1 .. 56]` and the salaries `[57,800 .. 231,545]`. If we plot the two variables we see: 98 | 99 | Source: https://blog.oleks.fr/normalization 100 | 101 | So, the big difference between the range of the values can affect out model. 102 | 103 | If you want to read more about normalization Oleks has a [nice blog post](https://blog.oleks.fr/normalization) about it. 104 | >Part of the text for explaining normalization were extracted from that post. 105 | 106 | For normalizing our data, DataFrame has a simple API: we just call the `DataFrame >> normalized` method that returns a new DataFrame that has been normalized. This method uses the default normalizer that is the min max normalizer. If you want to use another one you can use the method `DataFrame >> normalized: aNormalizerClass` instead. 107 | 108 | So, we just execute this part **before** partitioning the data. 109 | 110 | ```st 111 | "Normalizing the data frames" 112 | normalizedDF := diabetesPima normalized. 113 | ``` 114 | 115 | Pay attention, now, as we want to use the normalized data, in the partitioning part we need to use the `normalizedDF` variable instead of the `diabetesPima`. 116 | 117 | ```st 118 | subsets := partitioner split: normalizedDF withProportions: #(0.75 0.25). 119 | ``` 120 | 121 | Now, if we train the model with the normalized data we will see that everything runs smoothly! Also, as all the data is in the range of [0, 1] we can use a bigger learning rate. We experimented and we saw that with a learning rate of `1` the model converges faster and has the same accuracy. 122 | 123 | ```st 124 | "Training the logistic regression model" 125 | logisticRegression := AILogisticRegression 126 | learningRate: 1 127 | maxIterations: 5000. 128 | 129 | logisticRegression fitX: xTrain y: yTrain. 130 | ``` 131 | 132 | ## Measuring the accuracy and other metrics of the model 133 | 134 | We have our model trained. But now we want to see how our model it is performing. 135 | We will use statistical metrics to measure our model! 136 | 137 | Yes, in Pharo we also have a library for that also: [Machine learning metrics](https://github.com/pharo-ai/metrics). 138 | As usual, just install the library running the Metacello script from the README. 139 | 140 | Now, we want to make predictions with our model. We want to predict if someone has diabetes or not. We will use the `training dataset` which contains new data that the model has not seen before. 141 | 142 | For making a prediction we send the message `predict:` with the training dataset as an argument. 143 | 144 | ```st 145 | yPredicted := logisticRegression predict: xTest. 146 | ``` 147 | 148 | Now, as we have also the real values for the data. We will measure what is the accuracy of the predictions. 149 | 150 | ```st 151 | "Computing the accuracy of the logistic regression model" 152 | metric := AIAccuracyScore new. 153 | accuracy "0.7916666666666666" := (metric computeForActual: yTest predicted: yPredicted) asFloat. 154 | ``` 155 | 156 | As we can see, our model has a accuracy of **79%**. 157 | 158 | That means, the model predicted correctly if 8 out of 10 people have or do not have diabetes. 159 | 160 | ## All the code 161 | 162 | You can run the following script in a Pharo image to get all the result that we discussed above. 163 | Do not forget to install the libraries! 164 | 165 | The summary of all the workflow that we have done is: 166 | 167 | ```st 168 | "Loading the dataset" 169 | diabetesPima := AIDatasets loadDiabetesPima. 170 | 171 | 172 | "PREPROCESSING THE DATA" 173 | "Normalizing the data frames" 174 | normalizedDF := diabetesPima normalized. 175 | 176 | 177 | "SEPARATING THE DATA" 178 | "Dividing into test and training" 179 | partitioner := AIRandomPartitioner new. 180 | subsets := partitioner split: normalizedDF withProportions: #(0.75 0.25). 181 | diabetesPimaTrainDF := subsets first. 182 | diabetesPimaTestDF := subsets second. 183 | 184 | 185 | "Separating between X and Y" 186 | diabetesPimaTrainDF columnNames. "an OrderedCollection('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age' 'Outcome')" 187 | 188 | xTrain := diabetesPimaTrainDF columns: #('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age'). 189 | yTrain := diabetesPimaTrainDF column: 'Outcome'. 190 | 191 | xTest := diabetesPimaTestDF columns: #('Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI' 'DiabetesPedigreeFunction' 'Age'). 192 | yTest := diabetesPimaTestDF column: 'Outcome'. 193 | 194 | 195 | "Converting the DataFrame into an array of arrays For using it in the linear model. 196 | For now, the linear model does not work on a DataFrame." 197 | xTrain := xTrain asArrayOfRows. 198 | yTrain := yTrain asArray. 199 | 200 | xTest := xTest asArrayOfRows. 201 | yTest := yTest asArray. 202 | 203 | 204 | "TRAINING THE MACHINE LEARNING MODEL" 205 | "Training the logistic regression model" 206 | logisticRegression := AILogisticRegression 207 | learningRate: 3 208 | maxIterations: 5000. 209 | 210 | logisticRegression fitX: xTrain y: yTrain. 211 | 212 | yPredicted := logisticRegression predict: xTest. 213 | 214 | 215 | "COMPUTING METRICS" 216 | "Computing the accuracy of the logistic regression model" 217 | metric := AIAccuracyScore new. 218 | accuracy "0.7916666666666666" := (metric computeForActual: yTest predicted: yPredicted) asFloat. 219 | ``` 220 | -------------------------------------------------------------------------------- /wiki/Tutorials/edit-distances-tutorial.md: -------------------------------------------------------------------------------- 1 | # Understanding Edit Distances 2 | 3 | Repository: https://github.com/pharo-ai/edit-distances 4 | 5 | In this tutorial we are going to see what edit distance is about and some cool applications in everyday life that are implemented with this distance metrics. First of all, let's explain what is an edit distance : Given two strings (finite sequence of symbols) $s_1$ and $s_2$, the edit distance between them is the minimum number of edit operations required to transform $s_1$ into $s_2$. So we can say that with this distance we are able to measure the similarity of corresponding symbols. The basic edit operations here are: 6 | 7 | • Substitutions 8 | • Deletions 9 | • Insertions 10 | 11 | We say 'basic operations' because when changing one string into another we usually have to subsitute, delete or insert a character or a symbole (depends on what we are working with). Say you want to write `fork` but instead you wrote `zork`, to correct this sadly mistake you subtitute the `z` for `f`. If we give to this operation the value of 1 (common given value), then `zork` is _1 edit distance away from `fork`_. There are evidently other operations that one can do, so the variants of edit distance that exist are obtained by restricting the set of operations or adding more. 12 | 13 | Important to say, Edit Distance has properties of dynamic programming. Because the basic principle of dynamic programming is to decompose a problem into stages, find optimal solutions for each stage, save each solution. And that is exactly how the different edit distance algorithms work. 14 | 15 | # How does edit distance algorithm works ? 16 | 17 | The idea is to build a distance matrix. Here we are going to show the progress of the `retricted Damerau-Levenshtein` and the `full Damerau-Levenshtein` algorithm. By explaining these two algorithms you will understand the Levenshtein algorithm as well since they are based on this last one. 18 | 19 | ## Restricted Damerau-Levenshtein distance : 20 | 21 | For 2 words, such as `'a cat'` and `'an act'`, a matrix of size 5x6 is created as shown in the matrix below. Note that the row and column surrounded in light purple are not part of the matrix. They are just added to start to count correctly. 22 | 23 | For calculating the cost of each cell we procede by using this function : 24 | 25 | formulaRDL
26 | 27 | 28 | This function above defines the restricted Damerau-Levenshtein distance. The four first conditions only, define the Levenshtein distance. 29 | 30 | - Okay so for each cell, as said we apply the function above. Let's take the cell (4,3) for example. Note that in Pharo we start at 1 and not 0. The indexes i are for the rows and j are for the columns. 31 | - So we calculate the min of the upper cell value +1, the left cell value +1 and the upper-left cell +( 0 if the characters of this cell are equal and 1 if they're not - in this case no, they're not). These cells are the one indicated with the red arrows, min(2+1,3+1,2+1) = 3. Now, if we stop here we would have calculated the edit distance using Levenshtein distance. 32 | - To use the restricted Damerau-Levenshtein we have to calculate the transposition operation too (swap two consecutive characters). This is possible with the last condition of our function above. Which is the value of the cell (i-2,j-2). Here its value is 1, so we do the same as we did before, we calculate the min ((2+1,3+1,2+1,1+1) = 2. It's the cell coloured in green. This cell gives us the vaue of restricted Damereau-Levenshtein distance between our first string "a cat" and our second string "an act". 33 | - _Note_ : Edit distance has properties of dynamic programming. Because at each stage of the algorithm we have the optimal choice. Back to our example of the cell (4, 3), if you see this cell represents the substrings `"a ca"` and `"an "`. The edit distance between these two is 3 - since we have to add "n" after the "a" and add "c" and "a" after the space character " ". Three edit operations to transform "a ca" into "an ". 34 | - _Continuity of note_: If you take any cell of the matrix you'll notice that the value of it, is the edit distance of the substrings that is related to. So obviously, the last cell in green is the value of the edit distance of our two starting strings. 35 | 36 | formulaRDL
37 | 38 | 39 | ## Damerau-Levenshtein distance : 40 | 41 | For 2 words, such as 'a cat' and 'an abct', a matrix of size 5x6 is created as shown in the matrix below. Note that the rows and columns surrounded in dark and light purple are not part of the matrix they are just added to count correctly s. 42 | 43 | The main difference between the retricted algorithm and this one is that in the first we couldn't calculate a transposion of non-adjacents characters. In other words, we couldn't edit a substring that was already edited. In the example of the matrix below we have the first string `"a cat"` and second string `"a abct"`. "ca" could be swapped (swap = tanspose) with "ac". But we couldn't re-edit this substring to add "b" to it and have "abc". On the contrary, in the non-restricted Damerau-Levenshtein distance we can ! Let's see how. 44 | - Let's take the example of cell (6,7). We apply the levenshtein function (the first four conditions of the equation above) to calculate the edit distance value of the cell. In addition to that, we calculate the value of transposition. To know if we have a transposion or not we: 45 | - First, go through the rows in backwards from our current cell (Matrix: red arrow from blue cell to cell with star ) and stop when we find row with the current column's character - here we're looking for the character `"c"`. 46 | - Second, go through the columns in backwards from the cell we stopped in (here the blue one) and look for the column in this row where the characters match with the characters of our curent cell (Remember our current cell is still (6,7)). One row back (j-1) we have (c,b) not a match, two rows back (j-2) we have (a,c) it's a match ! 47 | - Now we're in cell (5,5). Go to the upper-left cell of it, (4,4). Notice that the value of this cell it's the edit distance before transposition. Here to calculate the transposition, we take the value of the cell we're in (the cell before transposition) so we have 0. We add to it the cost 1 of the transposition since it's an edit operation itself, so now we have 0 + 1 = 1. We add the number of columns and rows we have in between our charcters that can be swapped because we have to calculate how many edit operations we have to do along with the transposition. 48 | - Columns: we have 1 column between our two characters that can be swapped, column for character "b". 49 | - Rows: we have 0 rows between our two characters that can be swapped. 50 | - Let's count then: 0 + 1 + 1 + 0 = 2. This is the value of the cell in blue. 51 | - Do the same thing for each cell. At the end we have an edit distance of 2 (cell in green). 52 | 53 | - And that's how Damerau-Levenshtein algorithm works. In the code in Pharo, instead of going backwards through the rows, we hold a dictionary with each character of the first string and it's number of row. Then we only have to compare our current column's character with the matching character from the dictionary and get the number of the row. 54 | 55 | formulaRDL
56 | 57 | 58 | ## Comparing both distances in Pharo 59 | 60 | ```st 61 | levenshetein := AILevenshteinDistance new. 62 | restrictedDL := AIRestrictedDamerauLevenshteinDistance new. 63 | fullDL := AIDamerauLevenshteinDistance new. 64 | ``` 65 | 66 | ```st 67 | levenshetein distanceBetween: 'a cat' and: 'an act'. "3" 68 | ``` 69 | 70 | ```st 71 | restrictedDL distanceBetween: 'a cat' and: 'an act'. "2" 72 | ``` 73 | 74 | ```st 75 | fullDL distanceBetween: 'a cat' and: 'a abct'. "2" 76 | ``` 77 | 78 | ```st 79 | restrictedDL distanceBetween: 'a cat' and: 'a abct' "3" 80 | ``` 81 | 82 | 83 | # What they are used for ? 84 | 85 | The usefulness of edit distances differs from one use to another. By default, a lower distance implies greater similarity between two words. However, in NLP(Natural Language Processing) we generally wish to minimize the distance, which is not the case in computational biology where we wish to maximize similarity. Or even in error correcting codes, where we wish to maximize the distance so that one codeword is not easily confused with another one. There are many application domains where you might find the utilisation of edit distance such as: 86 | 87 | - **Spelling correction** : For example a user typed `graffe`, which word is the closest ? 88 | - graf 89 | - graft 90 | - grail 91 | - giraffe 92 | 93 | Using the edit distance metric we can affirm that `giraffe` is the closest. That's how correcting spelling works. It compares every word typed with thousands of correctly spelled words and then uses the edit distance metric to determine the correct spellings by choosing the lowest distance between the typed word and the words of our dictionary. 94 | 95 | - **DNA sequence alignment** - [Example](#dna-sequence-alignments) : We might want to know the purpose of sequence alignment, well it's important when identifying regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences 96 | 97 | 98 | # Some applications in Machine Learning 99 | 100 | Just to put in context we will quickly remind the definiton of Machine Learning: 101 | 102 | 103 | _It's the use and development of computer systems that are **able to learn and adapt without following explicit instructions**, by using algorithms and statistical models to analyse and draw inferences from patterns in data._ 104 | So it's kind of about prediction, we don't have to repeat a task that could be done mechanicaly (automatically) after training our algorithm or machine on a historical dataset and applying to new data. Here are some machine learning domains that use edit distance: 105 | 106 | - **Recommendation systems** (use of algorithm that suggests relevant items to users): Using the Cosine Similarity distance, its value which is located in the interval of 0 and 1 represent the percentage of similarity between the items (73%, 50%, ... of similarity). 107 | - **Optical character recognition** : Recognize the off-line characters from text images. Allows you to quickly and automatically digitize a document without the need for manual data entry. 108 | - **Document similarity** : Used in Information Retrieval which goal is to develop a model for retrieving (collecting) information from the repositories of documents. 109 | - **Image Data Matching For Entity Resolution** : Used to track google images results for product design copyright infringement or product matching across different competitors to understand market size or price tracking. 110 | 111 | # Example 112 | 113 | ## DNA sequence alignment 114 | 115 | **Computational biology** is a field in which computers are used to do research on biological systems. Here we will compute DNA sequence alignments. The optimization process involves evaluating how well individual base pairs match up in the DNA sequence case. 116 | 117 | **Biology review**: DNA is an acid that contains all of the hereditary genetic information that comes as an "instruction manual". It is composed of four biological macromolecules {Adenine (A), Thymine (T), Guanine (G), Cytosine (C)} that together forme a string called _genetic sequence_. 118 | 119 | **In bioinformatics**, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. 120 | 121 | **Relation with edit distance?** : 122 | 123 | The edit-distance is the score of the best possible alignment between two genetic sequences over all possible alignments. 124 | 125 | **Why sequence alignment?** 126 | 127 | • Assembling fragments to sequence DNA. 128 | • Compare individuals to looking for mutations. 129 | 130 | For our example we want to compare a DNA sequence of a gene that we found in an unstudied organism. For what? To know the function of the protein that this gene encodes by comparing it to other sequences and chose the one with the lowest distance - since it will mean that it's the most similar and has similar functions. So using edit distance, we will measure the similarity of this gene with two other genes that have been sequenced before and whose functions are understood. 131 | 132 | So let's compare this sequence: 133 | 134 | `Seq1 : G C T A A C T C G G A` 135 | 136 | with this one 137 | 138 | `Seq2 : C G T A A C A C G G A` 139 | 140 | and this one 141 | 142 | `Seq3 : C G T A A C A C T T G` 143 | 144 | We will use the Levenshtein distance. 145 | 146 | ```st 147 | |firstSequence secondSequence| 148 | 149 | levenshetein := AILevenshteinDistance new. 150 | 151 | unknownSequence := 'GCTAACTCGGA'. 152 | 153 | firstSequence := 'CGTAACACGGA'. 154 | 155 | secondSequence := 'CGTAACACTTG'. 156 | 157 | levenshetein distanceBetween: unknownSequence and: firstSequence. "3" 158 | 159 | levenshetein distanceBetween: unknownSequence and: secondSequence. "6" 160 | 161 | ``` 162 | 163 | The most similar sequence between these two is the `Seq2`. So for their researches, biologists will chose this one over the `Seq3` to find the functions of the protein that this unknown gene sequence encodes. 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | -------------------------------------------------------------------------------- /wiki/Graphs/Graph-Algorithms.md: -------------------------------------------------------------------------------- 1 | # Graph Algorithms 2 | 3 | Repository: https://github.com/pharo-ai/graph-algorithms 4 | 5 | Graphs algorithms are very useful for different types of problems. We also have a great booklet [Booklet-PharoGraphs](https://github.com/SquareBracketAssociates/Booklet-PharoGraphs) in which we go into details of the logic of the algorithms and its implementation in Pharo. 6 | 7 | ### Table of Contents 8 | 9 | - [Implemented graph algorithms](#implemented-graph-algorithms) 10 | - [How to use the graph algorithms - API](#how-to-use-the-graph-algorithms---api) 11 | - [Graph generation algorithms](#graph-generation-algorithms) 12 | - [Comparison of the implemented algorithms](#comparison-of-the-implemented-algorithms) 13 | - [External links](#external-links) 14 | 15 | ## Implemented graph algorithms 16 | 17 | Currently, we have these algorithms implemented: 18 | 19 | - Topological Sorting 20 | + Kahn's Algorithm 21 | - Shortest Path 22 | + BFS (Breadth First Search) 23 | + DFS (Depth First Search) 24 | + Dijkstra's Algorithm 25 | + in DAG (Directed Acyclic Graph) 26 | + Bellman-Ford Algorithm (for DCG, Directed Cyclic Graph) 27 | + Floyd-Warshall Algorithm 28 | + A* Algorithm 29 | - Longest Path 30 | + in DAG (variation of shortest path in DAG) 31 | + in DCG (variation of Bellman-Ford's Algorithm) 32 | - Minimum Spanning Tree 33 | + Kruskal’s Algorithm 34 | + Prim’s Algorithm 35 | - Strongly Connected Components 36 | + Tarjan's Algorithm 37 | + Graph Reducer 38 | - Maximum Flow 39 | + Edmonds-Karp Algorithm 40 | + Dinic's Algorithm 41 | - Link Analysis 42 | + HITS (Hyperlink-Induced Topic Search) Algorithm 43 | - Matching (Independent Edge Set) 44 | + Maximum-weight Matching Approximation Algorithm 45 | (with variations for minimum-weight and maximum-cardinality) 46 | + Gale-Shapely Stable Matching Algorithm (for the so-called marriage problem) 47 | 48 | ## How to use the graph algorithms - API 49 | 50 | All the graph algorithms of this library share a common API also. The class AIGraphAlgorithm provides the common API to add nodes, edges, searching the nodes, etc. 51 | 52 | Some of the common methods are: 53 | 54 | - `algorithm nodes:` 55 | - `algorithm nodes` 56 | - `algorithm edges` 57 | - `algorithm edges:from:to:` 58 | - `algorithm edges:from:to:weight:` 59 | - `algorithm findNode:` 60 | - `algorithm run` 61 | 62 | For example, for using the topological sort algorithm for the following graph, we can run this code snippet 63 | 64 | 65 | 66 | ```st 67 | "First define the nodes and the edges" 68 | nodes := #( $A $B $C $D $E $F $G ). 69 | edges := #( #( $A $B ) #( $A $C ) #( $B $E ) #( $C $E ) #( $C $D ) 70 | #( $D $E ) #( $D $F ) #( $E $G ) #( $F $G ) ). 71 | 72 | "Instantiate the graph algorithm" 73 | topSortingAlgo := AITopologicalSorting new. 74 | 75 | "Set the nodes and edges" 76 | topSortingAlgo nodes: nodes. 77 | topSortingAlgo 78 | edges: edges 79 | from: [ :each | each first ] 80 | to: [ :each | each second ]. 81 | 82 | "Run to obtain the result" 83 | topologicalSortedElements := topSortingAlgo run. 84 | ``` 85 | 86 | Or if we want to find the shortest path in a weighted graph: 87 | 88 | 89 | 90 | ```st 91 | nodes := $A to: $F. 92 | edges := #( #( $A $B 5 ) #( $A $C 1 ) #( $B $C 2 ) #( $B $E 20 ) 93 | #( $B $D 3 ) #( $C $B 3 ) #( $C $E 12 ) #( $D $C 3 ) 94 | #( $D $E 2 ) #( $D $F 6 ) #( $E $F 1 ) ). 95 | 96 | dijkstra := AIDijkstra new. 97 | dijkstra nodes: nodes. 98 | dijkstra 99 | edges: edges 100 | from: [ :each | each first ] 101 | to: [ :each | each second ] 102 | weight: [ :each | each third ]. 103 | 104 | shortestPathAToB := dijkstra runFrom: $A to: $B. 105 | pathDistanceAToB := (dijkstra findNode: $B) pathDistance. 106 | 107 | dijkstra end: $F. 108 | shortestPathAToF := dijkstra reconstructPath. 109 | pathDistanceAToF := (dijkstra findNode: $F) pathDistance. 110 | 111 | dijkstra reset. 112 | shortestPathBToE := dijkstra runFrom: $B to: $E. 113 | ``` 114 | 115 | ## Graph generation algorithms 116 | 117 | This library also contains algorithms for generating regular and random graphs. This algorithms are not loaded by default. To load them, you can either load them manually using [Iceberg directly from the Pharo image](https://github.com/pharo-vcs/iceberg/wiki/Tutorial) or load the `GraphGenerators` baseline group. 118 | 119 | The algorithms implemented are: 120 | 121 | - Albert Barabasi Graph Generator 122 | - Atlas Graph Graph Generator 123 | - Erdos Renyi GNM Graph Generator 124 | - Erdos Renyi GNP Graph Generator 125 | - Grid 2D Graph Generator 126 | - Grid 3D Graph Generator 127 | - Hexagonal Lattice Graph Generator 128 | - Kleinberg Graph Generator 129 | - Triangular Lattice Graph Generator 130 | - Waltz Strogatz Graph Generator 131 | 132 | ## Comparison of the implemented algorithms 133 | 134 | This section summarizes for each provided graph algorithm in the library its input, its output, some remark and the implemented time complexity. 135 | 136 | **Abbreviations**: DAG means Directed Acyclic Graph, DCG means Directed Cyclic Graph, DG means DG Directed Graph (cyclic or not). V represents the number of vertices and E the number of edges. 137 | 138 | ### Topological sorting 139 | 140 | | Algorithm | Input | Output | Remark | Complexity | 141 | | --------- | ----- | ------------ | ---------------------- | ---------- | 142 | | Kahn's | DAG | sorted nodes | working list iteration | O(V + E) | 143 | 144 | ### Shortest path 145 | 146 | | Algorithm | Input | Output | Remark | Complexity | 147 | | -------------- | --------------------------------------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------------------- | -------------- | 148 | | BFS | unweighted graph and starting node | path nodes | BFS in queue | O(V + E) | 149 | | DFS | unweighted graph and starting node | path nodes | DFS in queue | O(V + E) | 150 | | Dijkstra's | graph with non-negative weights and starting node | path nodes | [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming#Dijkstra's_algorithm_for_the_shortest_path_problem) | O(V²) | 151 | | in DAG | weighted DAG and starting node | path nodes | based on topological sorting | O(V + E) | 152 | | Bellman-Ford | weighted graph and starting node | path nodes | edge [relaxation](https://en.wikipedia.org/wiki/Relaxation_(iterative_method)); detects negative cycles | O(V \* E) | 153 | | Floyd-Warshall | weighted graph | path for every node pair | data structure: two matrices; detects negative cycles | O(V³) | 154 | | A* | graph with non-negative weights, start and end node | path from start to end node | Extension of Dijkstra's algorithm | O(E \* log(V)) | 155 | 156 | ### Strongly connected components 157 | 158 | | Algorithm | Input | Output | Remark | Complexity | 159 | | ------------- | ------------------------- | ------------------------------ | ------------------------------- | ---------- | 160 | | Tarjan's | unweighted graph | list of node lists | based on DFS | O(V + E) | 161 | | Graph Reducer | Tarjan's algorithm output | nodes of the new reduced graph | no default reduction of weights | ? | 162 | 163 | ### Maximum flow 164 | 165 | | Algorithm | Input | Output | Remark | Complexity | 166 | | ------------ | -------------------------------------- | ------------------ | ---------------- | ---------- | 167 | | Edmonds-Karp | weighted DG, source node and sink node | maximal flow value | uses BFS | O(V * E²) | 168 | | Dinic's | weighted DG, source node and sink node | maximal flow value | layering concept | O(V² \* E) | 169 | 170 | ### Link analysis 171 | 172 | | Algorithm | Input | Output | Remark | Complexity | 173 | | --------- | ----- | -------------------------------------------- | --------------------------------------- | ---------- | 174 | | HITS | DG | hub score and authority score for every node | optional weights to improve performance | O(V + E) | 175 | 176 | ### Minimum spanning tree 177 | 178 | | Algorithm | Input | Output | Remark | Complexity | 179 | | --------- | ------------------------------------------------ | -------------------------------- | ---------------------------------------------------------------------------------------- | ---------------- | 180 | | Kruskal’s | undirected weighted graph (non-negative weights) | subset of edges (tree or forest) | [disjoint-set data structure](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) | O(V \* log(E)) | 181 | | Prim’s | connected undirected weighted graph | subset of edges (tree) | very similar to Dijkstra's algorithm | O( V \* (V + E)) | 182 | 183 | Both algorithms are [greedy](https://en.wikipedia.org/wiki/Greedy_algorithm). 184 | 185 | ### Matching 186 | 187 | | Problem | Algorithm | Input | Output | Remark | Complexity | 188 | | ---------------- | -------------- | ----------------------------------------------------------- | ------------------------------------------------ | ---------------------------------- | -------------- | 189 | | Maximum Matching | Graph Matching | undirected weighted graph | independent edges set | greedy approximation | O(E \* log(V)) | 190 | | Stable Matching | Gale-Shapely's | nodes in group A or B with preferences over the other group | set of edges, each one relating a different pair | groups A and B are of equal size n | O(n²) | 191 | 192 | ## External links 193 | 194 | Almost all items are well described in Wikipedia. For a couple of items, the posts in [w3schools](https://www.w3schools.com), [Research Scientist Pod](https://researchdatapod.com) and [Jiho's Blog](https://berryjune07.github.io/computer-science) are outstanding. 195 | 196 | ### Graph problems 197 | 198 | - Topological sorting: [Wikipedia](https://en.wikipedia.org/wiki/Topological_sorting) 199 | 200 | - Shortest path: [Wikipedia](https://en.wikipedia.org/wiki/Shortest_path_problem), [comparison between algorithms](https://www.geeksforgeeks.org/dsa/comparison-between-shortest-path-algorithms/) 201 | 202 | - Minimum spanning tree: [Wikipedia](https://en.wikipedia.org/wiki/Minimum_spanning_tree), [Jiho's Blog](https://berryjune07.github.io/computer-science/minimum-spanning-tree.html), [Prim's vs Kruskal's (Geeks for geeks)](https://www.geeksforgeeks.org/dsa/difference-between-prims-and-kruskals-algorithm-for-mst), [Prim's vs Kruskal's (Baeldung)](https://www.baeldung.com/cs/kruskals-vs-prims-algorithm) 203 | 204 | - Strongly connected components: [Wikipedia](https://en.wikipedia.org/wiki/Strongly_connected_component) 205 | 206 | - Maximum flow: [Wikipedia](https://en.wikipedia.org/wiki/Maximum_flow_problem) 207 | 208 | - Link analysis: [Wikipedia](https://en.wikipedia.org/wiki/Link_analysis) 209 | 210 | - Matching: [Wikipedia](https://en.wikipedia.org/wiki/Matching_(graph_theory)), [Brilliant course](https://brilliant.org/wiki/matching-algorithms) 211 | 212 | ### Graph algorithms 213 | 214 | - A*: [Wikipedia](https://en.wikipedia.org/wiki/A*_search_algorithm), [Reserch Scientist Pod](https://researchdatapod.com/a-star-algorithm/) 215 | 216 | - Bellman-Ford: [Wikipedia](https://en.wikipedia.org/wiki/Bellman–Ford_algorithm), [w3schools](https://www.w3schools.com/dsa/dsa_algo_graphs_bellmanford.php), [Research Scientist Pod](https://researchdatapod.com/bellman-ford-algorithm/), [Jiho's Blog](https://berryjune07.github.io/computer-science/bellman-ford-algorithm.html) 217 | 218 | - BFS vs. DFS: [Wikipedia (BFS)](https://en.wikipedia.org/wiki/Breadth-first_search), [Wikipedia (DFS)](https://en.wikipedia.org/wiki/Depth-first_search), [wscubetech](https://www.wscubetech.com/resources/dsa/dfs-vs-bfs), [Baeldung](https://www.baeldung.com/cs/dfs-vs-bfs), [Jiho's Blog](https://berryjune07.github.io/computer-science/dfs-and-bfs.html), [IntelliPaat](https://intellipaat.com/blog/difference-between-bfs-and-dfs/), [codecademy](https://www.codecademy.com/article/bfs-vs-dfs) 219 | 220 | - Dijkstra's: [Wikipedia](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm), [w3schools](https://www.w3schools.com/dsa/dsa_algo_graphs_dijkstra.php), [Research Scientist Pod](https://researchdatapod.com/dijkstras-algorithm/) 221 | 222 | - Dinic's: [Wikipedia](https://en.wikipedia.org/wiki/Dinic%27s_algorithm), [Geeks for geeks](https://www.geeksforgeeks.org/dsa/dinics-algorithm-maximum-flow) 223 | 224 | - Edmonds-Karp: [Wikipedia](https://en.wikipedia.org/wiki/Edmonds–Karp_algorithm) 225 | 226 | - Floyd-Warshall: [Wikipedia](https://en.wikipedia.org/wiki/Floyd–Warshall_algorithm), [Research Scientist Pod](https://researchdatapod.com/floyd-warshall-algorithm/), [Jiho's Blog](https://berryjune07.github.io/computer-science/floyd-warshall-algorithm.html) 227 | 228 | - Gale-Shapely: [Wikipedia](https://de.wikipedia.org/wiki/Stable_Marriage_Problem#Gale-Shapley-Algorithmus) 229 | 230 | - HITS: [Wikipedia](https://en.wikipedia.org/wiki/HITS_algorithm) 231 | 232 | - Kahn's: [Wikipedia](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm) 233 | 234 | - Kruskal’s: [Wikipedia](https://en.wikipedia.org/wiki/Kruskal%27s_algorithm) 235 | 236 | - Disjoint-Set data structure: [Wikipedia](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) 237 | 238 | - Disjoint-Set Forest: [Jiho's Blog](https://berryjune07.github.io/computer-science/disjoint-set-forest.html) 239 | 240 | - Matching Approximation: [Cornell lecture](https://www.cs.cornell.edu/courses/cs6820/2014fa/matchingNotes.pdf) 241 | 242 | - Prim's: [Wikipedia](https://en.wikipedia.org/wiki/Prim%27s_algorithm) 243 | 244 | - Tarjan's: [Wikipedia](https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm), [dev.to](https://dev.to/muhammad_taaha_90438c47f1/understanding-tarjans-algorithm-with-visual-examples-5hb8) 245 | --------------------------------------------------------------------------------