├── 0_Intro_OT.ipynb ├── 0_Intro_OT.py ├── 1_DomainAdaptation.ipynb ├── 1_DomainAdaptation.py ├── 2_ColorGrading.ipynb ├── 2_ColorGrading.py ├── 3_WMD.ipynb ├── 3_WMD.py ├── LICENSE ├── README.md ├── data ├── data_text.npz ├── klimt.jpg ├── manhattan.npz ├── mnist_usps.npz ├── model.npz └── schiele.jpg └── slides ├── Part1_intro_OT_2022.pdf ├── Part2_UOT_GW_Rennes_2022.pdf └── Part3_OTML_Rennes_2022.pdf /0_Intro_OT.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # Introduction to Optimal Transport with Python 5 | # 6 | # #### *Rémi Flamary, Nicolas Courty* 7 | 8 | # ## POT installation 9 | 10 | # + Install with pip: 11 | # ```bash 12 | # pip install pot 13 | # ``` 14 | # + Install with conda 15 | # ```bash 16 | # conda install -c conda-forge pot 17 | # ``` 18 | 19 | # ## POT Python Optimal Transport Toolbox 20 | # 21 | # #### Import the toolbox 22 | 23 | # In[1]: 24 | 25 | 26 | import numpy as np # always need it 27 | import scipy as sp # often use it 28 | import pylab as pl # do the plots 29 | 30 | import ot # ot 31 | 32 | 33 | #%% #### Getting help 34 | # 35 | # Online documentation : [http://pot.readthedocs.io](http://pot.readthedocs.io) 36 | # 37 | # Or inline help: 38 | # 39 | 40 | # In[2]: 41 | 42 | 43 | help(ot.dist) 44 | 45 | 46 | #%% ## First OT Problem 47 | # 48 | # We will solve the Bakery/Cafés problem of transporting croissants from a number of Bakeries to Cafés in a City (In this case Manhattan). We did a quick google map search in Manhattan for bakeries and Cafés: 49 | # 50 | # ![bak.png](https://remi.flamary.com/cours/otml/bak.png) 51 | # 52 | # We extracted from this search their positions and generated fictional production and sale number (that both sum to the same value). 53 | # 54 | # We have acess to the position of Bakeries ```bakery_pos``` and their respective production ```bakery_prod``` which describe the source distribution. The Cafés where the croissants are sold are defiend also by their position ```cafe_pos``` and ```cafe_prod```. For fun we also provide a map ```Imap``` that will illustrate the position of these shops in the city. 55 | # 56 | # 57 | # Now we load the data 58 | # 59 | # 60 | 61 | # In[3]: 62 | 63 | 64 | data=np.load('data/manhattan.npz') 65 | 66 | bakery_pos=data['bakery_pos'] 67 | bakery_prod=data['bakery_prod'] 68 | cafe_pos=data['cafe_pos'] 69 | cafe_prod=data['cafe_prod'] 70 | Imap=data['Imap'] 71 | 72 | print('Bakery production: {}'.format(bakery_prod)) 73 | print('Cafe sale: {}'.format(cafe_prod)) 74 | print('Total croissants : {}'.format(cafe_prod.sum())) 75 | 76 | 77 | #%% #### Plotting bakeries in the city 78 | # 79 | # Next we plot the position of the bakeries and cafés on the map. The size of the circle is proportional to their production. 80 | # 81 | 82 | # In[4]: 83 | 84 | 85 | 86 | pl.figure(1,(8,7)) 87 | pl.clf() 88 | pl.imshow(Imap,interpolation='bilinear') # plot the map 89 | pl.scatter(bakery_pos[:,0],bakery_pos[:,1],s=bakery_prod,c='r', edgecolors='k',label='Bakeries') 90 | pl.scatter(cafe_pos[:,0],cafe_pos[:,1],s=cafe_prod,c='b', edgecolors='k',label='Cafés') 91 | pl.legend() 92 | pl.title('Manhattan Bakeries and Cafés'); 93 | 94 | 95 | #%% #### Cost matrix 96 | # 97 | # 98 | # We compute the cost matrix between the bakeries and the cafés, this will be the transport cost matrix. This can be done using the [ot.dist](http://pot.readthedocs.io/en/stable/all.html#ot.dist) that defaults to squared euclidean distance but can return other things such as cityblock (or manhattan distance). 99 | # 100 | # 101 | 102 | #%% #### Solving the OT problem with [ot.emd](http://pot.readthedocs.io/en/stable/all.html#ot.emd) 103 | 104 | # #### Transportation plan vizualization 105 | # 106 | # A good vizualization of the OT matrix in the 2D plane is to denote the transportation of mass between a Bakery and a Café by a line. This can easily be done with a double ```for``` loop. 107 | # 108 | # In order to make it more interpretable one can also use the ```alpha``` parameter of plot and set it to ```alpha=G[i,j]/G[i,j].max()```. 109 | 110 | #%% #### OT loss and dual variables 111 | # 112 | # The resulting wasserstein loss loss is of the form: 113 | # 114 | # $W=\sum_{i,j}\gamma_{i,j}C_{i,j}$ 115 | # 116 | # where $\gamma$ is the optimal transport matrix. 117 | # 118 | 119 | #%% #### Regularized OT with SInkhorn 120 | # 121 | # The Sinkhorn algorithm is very simple to code. You can implement it directly using the following pseudo-code: 122 | # 123 | # ![sinkhorn.png](attachment:sinkhorn.png) 124 | # 125 | # An alternative is to use the POT toolbox with [ot.sinkhorn](http://pot.readthedocs.io/en/stable/all.html#ot.sinkhorn) 126 | # 127 | # Be carefull to numerical problems. A good pre-provcessing for Sinkhorn is to divide the cost matrix ```C``` 128 | # by its maximum value. 129 | -------------------------------------------------------------------------------- /1_DomainAdaptation.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # Domain Adaptation between digits 5 | # 6 | # #### *Rémi Flamary, Nicolas Courty* 7 | # 8 | # In this practical session we will apply on digit classification the OT based domain adaptation method proposed in 9 | # 10 | # N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy, "[Optimal transport for domain adaptation](http://remi.flamary.com/biblio/courty2016optimal.pdf)", Pattern Analysis and Machine Intelligence, IEEE Transactions on , 2016. 11 | # 12 | # ![otda.png](http://remi.flamary.com/cours/otml/otda.png) 13 | # 14 | # To this end we will try and adapt between the MNIST and USPS datasets. Since those datasets do not have the same resolution (28x28 and 16x16 for MNSIT and USPS) we perform a zeros padding of the USPS digits 15 | # 16 | # 17 | #%% #### Import modules 18 | # 19 | # First we import the relevant modules. Note that you will need ```sklearn``` to learn the Support Vector Machine cleassifier and to projet the data with TSNE. 20 | # 21 | 22 | # In[1]: 23 | 24 | 25 | import numpy as np # always need it 26 | import pylab as pl # do the plots 27 | 28 | from sklearn.svm import SVC 29 | from sklearn.manifold import TSNE 30 | import ot 31 | 32 | 33 | #%% ### Loading data and normalization 34 | # 35 | # We load the data in memory and perform a normalization of the images so that they all sum to 1. 36 | # 37 | # Note that every line in the ```xs``` and ```xt``` is a 28x28 image. 38 | 39 | # In[2]: 40 | 41 | 42 | data=np.load('data/mnist_usps.npz') 43 | 44 | xs,ys=data['xs'],data['ys'] 45 | xt,yt=data['xt'],data['yt'] 46 | 47 | 48 | # normalization 49 | xs=xs/xs.sum(1,keepdims=True) # every l 50 | xt=xt/xt.sum(1,keepdims=True) 51 | 52 | ns=xs.shape[0] 53 | nt=xt.shape[0] 54 | 55 | 56 | #%% ### Vizualizing Source (MNIST) and Target (USPS) datasets 57 | # 58 | # 59 | # 60 | # 61 | 62 | # In[3]: 63 | 64 | 65 | 66 | # function for plotting images 67 | def plot_image(x): 68 | pl.imshow(x.reshape((28,28)),cmap='gray') 69 | pl.xticks(()) 70 | pl.yticks(()) 71 | 72 | 73 | nb=10 74 | 75 | # Fisrt we plot MNIST 76 | pl.figure(1,(nb,nb)) 77 | for i in range(nb*nb): 78 | pl.subplot(nb,nb,1+i) 79 | c=i%nb 80 | plot_image(xs[np.where(ys==c)[0][i//nb],:]) 81 | pl.gcf().suptitle("MNIST", fontsize=20); 82 | pl.gcf().subplots_adjust(top=0.95) 83 | 84 | # Then we plot USPS 85 | pl.figure(2,(nb,nb)) 86 | for i in range(nb*nb): 87 | pl.subplot(nb,nb,1+i) 88 | c=i%nb 89 | plot_image(xt[np.where(yt==c)[0][i//nb],:]) 90 | pl.gcf().suptitle("USPS", fontsize=20); 91 | pl.gcf().subplots_adjust(top=0.95) 92 | 93 | 94 | # Note that there is a large discrepancy especially between the 1,2 and 5 that have differnt shapes in both datasets. 95 | # 96 | # Also since we have performe zero padding on the USPS digits theyr are in average slightly smaller than NMSIT that can take the whole image. 97 | # 98 | # 99 | #%% ### Classification without domain adaptation 100 | # 101 | # We learn a classifier on the MNIST dataset (we will not be state of the art on 1000 samples). We evaluate this claddifier on MNIST and on the USPS dataset. 102 | 103 | # In[4]: 104 | 105 | 106 | 107 | # Train SVM with reg parameter C=1 and RBF kernel parameter gamma=1e1 108 | clf=SVC(C=1,gamma=1e2) # might take time 109 | clf.fit(xs,ys) 110 | 111 | # Compute accuracy 112 | ACC_MNIST=clf.score(xs,ys) # beware of overfitting ! 113 | ACC_USPS=clf.score(xt,yt) 114 | 115 | print('ACC_MNIST={:1.3f}'.format(ACC_MNIST)) 116 | print('ACC_USPS={:1.3f}'.format(ACC_USPS)) 117 | 118 | 119 | #%% There is a very large loss in performances. This can be better explained by performning a TSNE embedding on the data. 120 | # 121 | # ### TSNE of the Source/Target domains 122 | # 123 | # [t-distributed stochastic neighbor embedding (TSNE)](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) is a well knwn approch that allow projection of complex high dimensionnal data in a lower dimensionnal space while keeping its structure. 124 | # 125 | # 126 | 127 | # In[5]: 128 | 129 | 130 | 131 | xtot=np.concatenate((xs,xt),axis=0) # all data 132 | 133 | xp=TSNE().fit_transform(xtot) # this maigh take a while (30 sec on my laptop) 134 | 135 | # separate again but now in 2D 136 | xps=xp[:ns,:] 137 | xpt=xp[ns:,:] 138 | 139 | 140 | # In[6]: 141 | 142 | 143 | 144 | pl.figure(3,(12,10)) 145 | 146 | pl.scatter(xps[:,0],xps[:,1],c=ys,marker='o',cmap='tab10',label='Source data') 147 | pl.scatter(xpt[:,0],xpt[:,1],c=yt,marker='+',cmap='tab10',label='Target data') 148 | pl.legend() 149 | pl.colorbar() 150 | pl.title('TSNE Embedding of the Source/Target data'); 151 | 152 | 153 | # We can see that while the classes are relatively well clustured, the clusters from source and target dataset rarely overlapp. This is the main reason for the important loss in performance between Source and target. 154 | # 155 | #%% ### Optimal Transport Domain Adaptation (OTDA) 156 | # 157 | # Now we perform domain adaptation with the following 3 steps illustrated at the top of the notebook: 158 | # 159 | # 1. Compute the OT matrix betxeen source and target datasets 160 | # 1. Perform OT mapping with barycentric mapping (```np.dot```). 161 | # 1. Estimate classifier on the mapped source samples 162 | # 163 | # #### 1. OT between domain 164 | # 165 | # First we compute the Cost matrix and vizualize it. Note that the sampels are sorted by class in both source and target domains in order to better see the class based structure in the cost matrix and OT matrix. 166 | # 167 | # 168 | # 169 | 170 | # We can clearly see the (noisy) structure in the matrix. It is also interesting to note that the class 1 in usps (second column) is particularly different fromm all the other classes in MNIST data (even class 1). 171 | # 172 | # 173 | #%% Next we compute the OT matrix using exact LP OT [ot.emd](http://pot.readthedocs.io/en/stable/all.html#ot.emd) or regularized OT with [ot.sinkhorn](http://pot.readthedocs.io/en/stable/all.html#ot.sinkhorn). 174 | 175 | # We can see that most of the trasportation is done in the block-diagonal which means that in average samples from one class are affected to the proper classs in the target. 176 | # 177 | #%% #### 2/3 Mapping + Classification 178 | # 179 | # Now we perform the barycentric mapping of the samples and traing the classifier on the mapped samples. We recomend to use a smaller ```gamma=1e1``` here because some samples will be mislabeled and a smooth classifier will work better. 180 | 181 | # We can see that the adaptation with EMD leads to a performance gain of nearly 10%. You can get even better performances using entropic regularized OT or group lasso regularization. 182 | # 183 | #%% #### TNSE vizualization for OTDA 184 | # 185 | # In order to see the effect of the adaptation we can perform a new TSNE embedding to see if the classes are betetr aligned. 186 | # 187 | # 188 | 189 | # In[ ]: 190 | 191 | 192 | 193 | 194 | 195 | # We can see that when using emd solver the OT matrix is a permutation wo the samples are exactly superimposed. In average the classes are also well transported but there exist a number of badly transported samples that have a class permutation. 196 | # 197 | # 198 | #%% #### Transported sampels vizualization 199 | # 200 | # We can now also plot the transported samples. 201 | 202 | # Those are the same MNIST samples that have been plotted above but after trasnportation. There are several samples that are transported on the wrong class but again in average the class information is preserved which explain the accuracy gain. 203 | # 204 | #%% ### OTDA with regularization 205 | # 206 | # We now recomend to try regularized OT and to redo classification/TSNE/Vizu to see the impact of the regularization in term of performances, TNSE and transported samples. 207 | -------------------------------------------------------------------------------- /2_ColorGrading.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # Color grading with optimal transport 5 | # 6 | # #### *Nicolas Courty, Rémi Flamary* 7 | 8 | #%% In this tutorial we will learn how to perform color grading of images with optimal transport. This is somehow a very direct usage of optimal transport. You will learn how to treat an image as an empirical distribution, and apply optimal transport to find a matching between two different images seens as distributions. 9 | 10 | # First we need to load two images. To this end we need some packages 11 | # 12 | 13 | # In[1]: 14 | 15 | 16 | import numpy as np 17 | import matplotlib.pylab as pl 18 | from matplotlib.pyplot import imread 19 | from mpl_toolkits.mplot3d import Axes3D 20 | 21 | I1 = imread('./data/klimt.jpg').astype(np.float64) / 256 22 | I2 = imread('./data/schiele.jpg').astype(np.float64) / 256 23 | 24 | 25 | #%% We need some code to visualize them 26 | 27 | # In[2]: 28 | 29 | 30 | def showImage(I,myPreferredFigsize=(8,8)): 31 | pl.figure(figsize=myPreferredFigsize) 32 | pl.imshow(I) 33 | pl.axis('off') 34 | pl.tight_layout() 35 | pl.show() 36 | 37 | 38 | # In[3]: 39 | 40 | 41 | showImage(I1) 42 | showImage(I2) 43 | 44 | 45 | # Those are two beautiful paintings of respectively Gustav Klimt and Egon Schiele. Now we will treat them as empirical distributions. 46 | 47 | #%% Write two functions that will be used to convert 2D images as arrays of 3D points (in the color space), and back. 48 | 49 | # In[4]: 50 | 51 | 52 | def im2mat(I): 53 | """Converts and image to matrix (one pixel per line)""" 54 | pass # use reshape 55 | 56 | 57 | def mat2im(X, shape): 58 | """Converts back a matrix to an image""" 59 | pass # use reshape 60 | 61 | X1 = im2mat(I1) 62 | X2 = im2mat(I2) 63 | 64 | 65 | #%% It is unlikely that our solver, as efficient it can be, can handle so large distributions (1Mx1M for the coupling). We will use the Mini batch k-means procedure from sklearn to subsample those distributions. Write the code that performs this subsampling (you can choose a size of 1000 clusters to have a good approximation of the image) 66 | 67 | # In[5]: 68 | 69 | 70 | import sklearn.cluster as skcluster 71 | nbsamples=1000 72 | 73 | 74 | #%% You can use the following procedure to display them as point clouds 75 | 76 | # In[6]: 77 | 78 | 79 | def showImageAsPointCloud(X,myPreferredFigsize=(8,8)): 80 | fig = pl.figure(figsize=myPreferredFigsize) 81 | ax = fig.add_subplot(111, projection='3d') 82 | ax.set_xlim(0,1) 83 | ax.scatter(X[:,0], X[:,1], X[:,2], c=X, marker='o', alpha=1.0) 84 | ax.set_xlabel('R',fontsize=22) 85 | ax.set_xticklabels([]) 86 | ax.set_ylim(0,1) 87 | ax.set_ylabel('G',fontsize=22) 88 | ax.set_yticklabels([]) 89 | ax.set_zlim(0,1) 90 | ax.set_zlabel('B',fontsize=22) 91 | ax.set_zticklabels([]) 92 | ax.grid('off') 93 | pl.show() 94 | 95 | 96 | #%% You can now compute the coupling between those two distributions using the exact LP solver (EMD) 97 | 98 | #%% using the barycentric mapping method, express the tansformation of both images into the other one 99 | 100 | #%% Since only the centroid of clusters have changed, we need to figure out a simple way of transporting all the pixels in the original image. At first, we will apply a simple strategy where the new value of the pixel corresponds simply to the new position of its corresponding centroid 101 | 102 | #%% Express this transformation in your code, and display the corresponding adapted image. 103 | 104 | #%% You can use also the entropy regularized version of Optimal Transport (a.k.a. the Sinkhorn algorithm) to explore the impact of regularization on the final result 105 | # 106 | -------------------------------------------------------------------------------- /3_WMD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Word Mover's distance\n", 8 | "\n", 9 | "In this note notebook we will see an application of Optimal Transport to the problem of computing similarities between sentences and texts. The method under the lens is called 'Word Mover's Distance' in reference to 'Earth Mover's Distance', another name of the Wasserstein $1$ distance, mostly used in computer vision. \n", 10 | "\n", 11 | "Traditionnally, portions of texts are compared by Cosine similarity on bag-of-words vectors, i.e. histograms of occurences of words in a text. It captures the exact similarity in terms of words, but two very related sentences can be orthogonal if the words that are used have the same semantic but are different. Such a semantic distance can be obtained by using *word embeddings*, that are embeddings of words in a Euclidean space (of potentially large dimension) where the Euclidean distance have a semantic meaning: two related words will be close in such embeddings. A popular embedding is the *word2vec* embedding, obtained with neural networks. A study of those mechanisms is not in the scope of this notebook, but the interested reader can find more information on [the corresponding Wikipedia page](https://en.wikipedia.org/wiki/Word2vec). Throughout the rest of this tutorial, we will use a subset of the [GloVe](https://nlp.stanford.edu/projects/glove/) embedding.\n", 12 | "\n", 13 | "The key observation made by Kusner and colleagues [1] is that when confronted to a sentence/document, the optimal transport distance can be used between histograms of occuring words using a ground metric obtained through word embeddings. In such a way, related words will be matched together, and the resulting distance will somehow express semantic relatedness between the content.\n", 14 | "\n", 15 | "[1] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International Conference on Machine Learning (pp. 957-966). http://proceedings.mlr.press/v37/kusnerb15.pdf\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## A basic example \n", 23 | "\n", 24 | "We will start by reproducing the Figure $1$ in the original paper\n", 25 | "\n", 26 | "" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "Two sentences are considered: 'Obama speaks to the media in Illinois' and 'The president greets the press in Chicago'. It is clear from this example that the Cosine similarity between the two sentences indicates that the two sentences are totally not related, since there is no word in common. We will start by some imports and creating a list of the two sentences as words without stopwords that are not relevant for our analysis." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 1, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import os\n", 43 | "\n", 44 | "import numpy as np\n", 45 | "import matplotlib.pylab as pl\n", 46 | "import ot\n", 47 | "\n", 48 | "\n", 49 | "s1 = ['Obama','speaks','media','Illinois']\n", 50 | "s2 = ['President','greets','press','Chicago']\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "We will use a subset of the GloVe word embedding, expressed as a dictionnary (word,embedding) that you can load this way" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | " \n", 67 | "model=dict(np.load('data/model.npz'))\n", 68 | " " 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "Then the embedded representation of the sentences can be obtained by" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 3, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "s1_embed = np.array([model[w] for w in s1])\n", 85 | "s2_embed = np.array([model[w] for w in s2])" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "From the multidimensional scaling method in Scikitlearn, try to visualize the corresponding embedding of words in 2D." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "image/png": "\n", 103 | "text/plain": [ 104 | "
" 105 | ] 106 | }, 107 | "metadata": {}, 108 | "output_type": "display_data" 109 | } 110 | ], 111 | "source": [ 112 | "from sklearn import manifold\n", 113 | "\n", 114 | "C = ot.dist(np.vstack((s1_embed,s2_embed)))\n", 115 | "\n", 116 | "nmds = manifold.MDS(\n", 117 | " 2,\n", 118 | " eps=1e-9,\n", 119 | " dissimilarity=\"precomputed\",\n", 120 | " n_init=1)\n", 121 | "npos = nmds.fit_transform(C)\n", 122 | "\n", 123 | "pl.figure(figsize=(6,6))\n", 124 | "pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k')\n", 125 | "for i, txt in enumerate(s1):\n", 126 | " pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20)\n", 127 | "pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k')\n", 128 | "for i, txt in enumerate(s2):\n", 129 | " pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20)\n", 130 | "pl.axis('off')\n", 131 | "pl.tight_layout()\n", 132 | "pl.show()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "Let's now compute the coupling between those two distributions and visualize the corresponding result \n" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 5, 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAc0AAAGoCAYAAAAkfL70AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzs3Xm8VVX9//HXG1BRJMcccEycB0xzCjFBpcwx0yRNC6fSvmVZmqaZaOO3X98y7dugX5MsTcqhNHFABBJxyNQcyAynFBVzQBFBBT6/P9Y6cticc+/mTufee97Px+M8Nnfttfde5wD3c9asiMDMzMxa16fRBTAzM+spHDTNzMxKctA0MzMryUHTzMysJAdNMzOzkhw0zczMSnLQtBZJGiPpt13wnI0lhaR+HXS/pyTtU+fccEnPVv38iKThHfHczpQ/n00bXY4KSZMlHZ///ClJtzS4PG9I2qSF83X/TZiV5aDZg0j6uqTxhbR/1Un7ZNeWrueKiG0iYnKjy9GTRcTlEfHhBpdh5Yh4AkDSWEnfbs/9JK0v6XJJL0uaK+keSQfkcxvmIF15Rc5T+XmPjnhP1v04aPYsfwF2l9QXQNI6wHLAjoW0TXPe0pT434MZIGl1YCrwNrANsCbwY+AKSYdFxL9zkF45IlbOl21flXZ7g4puncy/JHuWv5KC5Pvzzx8CJgH/LKQ9HhHPAUgaKumvkl7Lx6GVm+Xmte9IugN4E9hE0vskTZE0R9IE0i+LuiQdIOkBSbMlTZM0pOrcU5JOk/Rg/hZ+iaS1Jd2Y73+rpNUKtzxW0nOSnpf01ap79ZF0hqTH8zf/3+dfbJXzR0t6Op87q1DGFXPN41VJ04GdC+ffbbbLzdG/l3RZLuMjknaqyrujpPvzuT9IGlevRiNpsKTbcpleyrWWVQvPPTV/Pq/le/WvOn9a/hyek3RsK38Pq0u6NOd9VdIfq86dIGmGpFckXSdpUE5fqklcSza5jpZ0h6QLc/kelbR3neePljS16ueQdKJSq8erkv5XkvK5vpL+J38mT0r6QrEcVfc5RtL1VT/PkPT7qp+fkfT+qmduKumzwKeArynV+q6vuuX7633eBacAbwDHRcQLETEvIn4HfAf4n8p7sebjoNmDRMTbwN2kwEg+3k76Rlyd9hd499vyDcAFwBrAj4AbJK1Rddujgc8CA4GngSuAv5GC5beAz9Qrj6QdgV8Bn8v3/yVwnaQVqrIdCowENgcOBG4Ezsz37wOcXLjtCGAz4MPAGVrcB3Uy8DFgT2AQ8Crwv7kcWwM/z+9lUC7L+lX3PAcYnF8faek9ZQcBVwKrAtcBP83PWR64FhgLrA78DjikhfsI+F4u01bABsCYQp7DgX2B9wFDgNH5WfsCp5I+u82A1vrifgOsRKoVrUWqFSFpr1yGw4F1SX/HV7Zyr2q7Ak+Q/r7OAa6p/rLSigNIX1C2z8//SE4/Afgo6YvejqS/13qmAHvkL03rkr407g6g1H+5MvBg9QURcRFwOfCDXOs7sOp0zc+7hpHA1RGxqJD+e2BD0r9na0IOmj3PFBYHyD1IQfP2QtqU/Of9gX9FxG8iYkH+pvwoKXhVjI2IRyJiAemX6s7A2RHxVkT8Baj+ll50AvDLiLg7IhZGxK+Bt4DdqvJcGBGzImJmLufdEXF/RLxFCkA7FO55bkTMjYiHgEuBI3L654CzIuLZfO0Y4LBcOzkM+HNE/CWfOxuo/mV3OPCdiHglIp4hfYloydSIGB8RC0nBaPucvhvQD7ggIt6JiGuAe+rdJCJmRMSE/Fn+h/SlZc9Ctgsi4rmIeIX0WVdaDA4HLo2IhyNiLksH23flYPJR4MSIeDWXrfJv4FPAryLivvzZfB34oKSNW/kMKl4Ezs/3HEdq1di/5LXfj4jZEfFvUotI9Xv7Sf67fBX4fr0b5D7KOfnaPYGbgZmStsw/314jsLWk3uddtCbwfI3056vOWxNy0Ox5/gIMy82a742IfwHTgKE5bVsW92cOItUsqj0NrFf18zNVfx4EvJp/SVfnr2cj4KtKTbOzJc0m1aYGVeWZVfXneTV+XpklVZfn6ap7bQRcW/WcfwALgbVznnevy+V/ufC+ivdtyQtVf34T6J+D8yBgZiy5y8Ez1CFpLUlXSpop6XXgtyz9y7b4rMrnsSxl3gB4JQegoiX+DUTEG6TPZr0aeWspvt/qv5PWlH1vdT/DbAownPTFcAowmRQw92TxF8Sy6pWp6CXSl8iidavOWxNy0Ox57gRWITWp3gEQEa8Dz+W05yLiyZz3OVKwqbYhMLPq5+pfiM8Dq0kaUMhfzzOkGtyqVa+Vco22rTYoPPu5qmd9tPCs/rkG+3z1dZJWIjXRVr+v4n3b4nlgvUJ/1gb1MpOaRQMYEhHvAY4iNdmWfVbZMj8DrF7dX1pliX8D+e92DdK/gcqXo5Wq8q9TuL74fqv/TtrqeZZsPm/pM4TFQbPSijKF1oNme7dvuhU4VEsPjjuc9Hk/1s77Ww/loNnDRMQ84F7gK6TmzoqpOa161Ox4YHNJR0rqJ2kUsDXw5zr3fjrf+1xJy0saxpJNuUUXAydK2lXJAEn7SxrY5jcIZ0taSdI2wDHAuJz+C+A7kjYCkPReSQfnc1cBB0galvsdz2PJf9u/B74uaTVJ6wNfbGPZ7iTVbr+QP8+DgV1ayD+QNJhktqT1gNOW4Vm/B0ZL2jp/CTinXsaIeJ7UV/yz/B6Xk1Rprr8COEbS+3Nf83dJTeRP5SbjmcBReXDOsaR+32prASfne36C1Dc7nvb5PfAlSevlQH96K/mnkPq6V4yIZ0n/7vclBf/761wzC6g7Z7OEHwPvAS6RtI6k/pKOAM4CTivUvq2JOGj2TFNIv8ymVqXdntPeDZoR8TJpMMZXSU1yXwMOiIiWmpaOJA3+eIX0i/qyehkj4l5Sv+ZPSQNzZlB/YEVZU/J9JgI/jIjKhPmfkAbl3CJpDnBXLicR8QjwX6QA8Xwuy7NV9zyX1Kz4JHALqZ9ymeWBWB8HjgNmk2qOfyb149ZyLmmgy2ukAVnXLMOzbgTOB24jfR63tXLJ0cA7pD7rF4Ev5/tMJPXxXk36bAYD1XN4TyAF85dJg4imFe57N2kg0kukkaOH5X9X7XEx6e/hQVLQGw8sIH0hWUpEPEb68nF7/vl10uCkO3K/cy2XAFvn5vw/1slTV36Pw4D+wHTS5/MV4Ojct2tNSv7CZNZ2ku4GfhERlza6LB1N0mjg+IgY1snP+SjpMyx2JZh1O65pmi0DSXvm5rp+kj5DmrZwU6PL1ZMozZvdL3+G65FaNK5tdLk6khbPgR1bSB+b0zduLW9XqFUea5mDptmy2QL4O6nJ9auk5spaUxOsPpGarl8lNc/+A/hmQ0vUDjnouMmuSbh51sysHSoBMyJUlbYxqQ/91xExuip9LGlxjfdFxFM5bTlSX/NrXf0FLM/xXYW0itg7XfnsnqpDdpQwM7O2ycHq0QY9+3lqL+Jgdbh51sysgcr0f0r6nKSHJM2XNEvSRZJWqXO/D0i6WtKLkt5SWpP5Z7lWWcxbs09T0kGSJiqtffyW0prGUyR9vgPfeo/kmqaZWff2A9K6vdeTpuqMIE0V2hTYqzqj0tZlV5P6ja8iTbX6AHAScLCk3SvNwvUoLXj/S9LqSdeTphutRRr0dgzwsw56Xz2Sg6aZWfe2G7BdXsOXvKTjbcAISbtExD05fWXSZgL9gOHV25NJOp20xu9FpM0QWvI50pZo20fEi9UnJDX9mrtunjUz697OqwRMgLy5QmVecPWKVAeTVkkaV2M/z/8BngJGSiqzjOQC0mIZS2hlYZSm4KBpZta93VsjrbLIffV+tDvm41KrR+VAW1ktrLizUNHlpPWIH5H0Y0kfk/TeZShvr+agaWbWvc2ukbYgH/tWpVUGBtUbDVtJr7Ww/7si4kekaTH/Ju1jey0wS9IkVW3I3qwcNM3MeofX8rG4U03FuoV8dUXEZRGxG6m5d3/SWr4fAm6WtFZ7C9qTOWiamfUOlR1fhhdP5MFDlTWE7yt7w7yJ+PiIOIE0yGh10hZtTctB08ysd/gjaXeiIyTtVjj3ZdJWabdWDyqqRdK+OcgWVWqYb7a7pD2Yp5yYmfUCEfFG3hP1D8AUSX8g9Ut+gDTN5AXSdJLWXAnMlzSVNOJWpNrlzsDfSBt0Ny0HTTOzXiIi/iRpd+BM0oIIq5CC5S+Ab0XEcyVuc0a+dkdgP2A+aZGE04GfN/satV6w3czMrCT3aZqZmZXkoGlmZlaS+zTNzLqQpIHAqH6w6QKYQVr2bk6jy2XluE/TzKyLSBrWH8aPgD67w4A7YO4kWDQf9ouIqY0un7XOQdPMrAtIGtgfZl4HA99HWstuTWACcBDMmQ+DIuKNxpbSWuM+TTOzrnHU5rD894DNSBtWAowERqTfxaMaVjIrzUHTzKwTSdpe0oXAjx+EFZ4Cvg2MrsozFAb0hcFdXzpbVh4IZGbWwSStAhwBHE9akedt4IGdYchd0L9YW5kGcxfC411dTlt2rmmamXUAJcMkjSVtw/VzYHngS8AgYJ+H4J2JhesmAJNgETCuK8trbeOBQGZm7SBpbeDTwHHAFsAc4HfA/wH3RtUv2erRs0NhwDSPnu1xHDTNzJaRpL6kRdCPBw4idXXdQQqUf4iIuS1cuzIwqi8Mzk2y4zxqtudw0DQzK0nSxsCxwDHA+sBLwK+BSyLiH40rmXUVB00zsxZIWgH4GKn5dZ+cfDOpVnl9RLzdqLJZ13PQNDOrQdK2pEB5NLAGaW/KXwGXtraRs/VennJiZpZV1oUlBcvdgHeAP5JqlRMjYmEDi2fdgGuaZtbUJAnYlTSo55PAAGA6KVD+NiL+08DiWTfjmqaZNSVJa5KaXo8HtgbeBK4kBcu7wjUKq8E1TTNrGpL6kAbzHAccAiwH3A1cQpr68XoDi2c9gIOmmfV6kjYgTRM5FtgIeAX4DWmqyEONLJv1LA6aZtYrSVoeOIDU/LovIOBWUvPrnyJifgOLZz2Ug6aZ9SqStiQ1v34GeC8wE7gU+FVEPNnIslnP54FAZtbjSRoAfIIULIcBC4DrSbXKmz1VxDqKa5pm1iPlqSIfIDW/HgkMBB4jDeq5LCJeaGDxrJfy1mBmTUDSxpIib1tV9prR+ZrRnVeyZSdpdUlfBO4H/kraYeRa4EPAlhHxAwdM6yxunjXrwXL/3X8BI4ANgBVJi4jfD1wDXN4bBrzkqSJ7kmqVhwIrAH8DTgJ+FxGvNbB41kQcNM16KEnfBM4htRjdRdpt4w1gbWA4qT/vJGCnNj7i2nzf59tb1raSNAgYTeqr3AR4jfS+LomI+xtVLmteDppmPZCkM4FzgWeAT0TE3TXyHAB8ta3PyLW3Lq/BSVoO2I8UKPcnfSmYTPqCcHVEzOvqMplVuE/TrIfJezqOIS0mvl+tgAkQEX8mzU9c6npJV0p6SdJ8SffmAFvMV7dPU9L6ki6Q9K98j1ck3SPp7EK+EZIukjRd0uuS5kl6WNI5kvoX8m4q6XvAs6RF0vcHFgH/AMbm9DcljalRns0kXSZppqS3JT2Xf96s1mdj1lauaZr1PMeQln+7MiIebiljRLxVSNoIuAd4grQizuqkXT3+JGmfiJjU2sMl7UTaT3J14C+kvtOVSOu3jgG+VZX9dGBLYBpwA9Af2D3nG56D9cdIfZXDSUGy0gd7R36tA/wMuKVOeXYmLVowELiOtNj6lsCngIMl7R0R97b2vszKcNA063mG5ePENlw7HBgTEedWEiRdAdwEnAa0GDTzKjt/IAXMT0XEFYXzGxQu+TzwZHHxc0kXAScAs0i7ijwBnAVsR9pp5AcRcXpV/vNJwb5YHgGXAe8BjoqIy6vOjSItwP5bSVtHxKKW3ptZGW6eNet51s3HZ9tw7dPAt6sTIuJm0gbLu5S4/kBgY+C6YsDM93qm8PMTlYApaRVJJ0m6lxQwIQXNvYDNgB8CB5P6UYtl/DspOBYNJdUq76wOmPmaccBUYAsWf9EwaxcHTbOeR/nYlpVJHqizOs4zwGolrt8tH28s8zBJAyT9UtJLwGxSM+sHqrLMiIhJuRa4BWnKzIMRMafG7abWSNsxH2+rU4RK+g5lymvWGjfPmvU8z5FqV+u34drZddIXUO5L9Kr5OLOlTJLWJvW9nkPqx6wM6LmTFKDJ51aoumyVfJxV57a10ivX1JsWU0lftc55s2XioGnW80wlNWnuTVoyritVgu56xROS+gIfIQ3qOZDFv1+mAvtGxNyqvOuSgma1yl6Wa9d5dq30ypSYdepcs24hn1m7uHnWrOe5lDTd5FBJW7eUUdIKLZ1vg7vy8aNVz3ifpG+R+ktvIPUfng/8KGf5QXXAzPasce9HgXnAEEkDa5yv1S9ZWeBgeJ3yVtLvq3PebJk4aJr1MBHxFGnKxvLADXkKyFIk7UvJvsdlcD3wFHBQnqc5gcUjXx8EPgusHxGnkdaFhUJAk7QJ8N/FG0fE28A4UpPrNwrXbE9aY7boDuCfwDBJhxWuOYy0Hu1j1O4PNVtmbp4164Ei4ruS+pGaOP8qaRpwL4uX0fsQaURqR89P3JzUL7kR8EXgLVJAeozUx/rziLg4570emAF8RdJ2pFrhhqSNoW/Ify46g9T0/DVJu5Lmd64LHA6MJ83pfHfqSESEpM8AE4Bxkv5EqrFukfPOAT7t6SbWUVzTNOuhIuI8YFvgp6Ta2TGkuZb7A4+T+hY7YqpFf0nHS7oLeAg4jBT0ricNtNmVFKBWpaqfMjfJ7gVcAWwDnAwMIS1+cFSd9zSLNI3ksnzNKaSRr58HKlNKXi9cczewc37OB0mfwVDgd8DO9VZMMmsL76dpZkvJiwbsRgq8o0gLEEwnLZb+m4h4qQFl+g5wJmlQ0c1d/XwzcNA0syqS3gscTVosfWtgLmlVnf8D7i6u7NNJZRgUEc8V0rYjNdW+DazXG7Y7s57JfZpmTSCPRh0F/TaFBTOAcZUFBPJelfuQapUfI61rezdp1Z5xdRYa6Ez3SpoBPEwK2puxeLeTEx0wrZFc0zTr5SQNg/7jYUQf2H0A3DEXJi2C+aNJa70eQxrY8wqpL/GS1haC7+TynkMK3huTFmGfTZrq8sOImNyocpmBg6ZZr5ZqmP1nwnUDYSSpdfN64PtUDaydQFok4Y81dkUxsypunjXr3UalGua2wKmkiuR/SAv6DH4HHj87IpaaM2lmtXnKiVmv1m/T1CS7gDQzZQ/SbJGngdHLQd9VWrzczJbgmqZZr7ZgRurDPGtAWu+8OkZOmwsLH29Uycx6IvdpmvViS/dpVkwADpoD8wdFxBuNKp9ZT+OgadbLLTl6duiAVMOctAjm7xcRXpPVbBk4aJo1AUkrA6Og7+DcJDvONUyzZeegaWZmVpJHz5qZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTloNgFJG0sKSWMlbSnpj5JekTRX0lRJHy7kH53zj5a0r6TJkl6TFIV8W+Z7PiPpLUmzJF0haYsaZVhb0g8l/TM/d3b+81hJm1Tlk6TPSJom6T+S5uf73yxpVOd9SmZmrevX6AJYl3ofcCfwMPBLYF1gFHCjpCMjYlwh/2HAvsCNwC+AjSsnJO0LXAMsB1wPzADWBz4O7C9pRETcl/OuBNwBDAYm5PwCNgIOBq4Cnsi3/g7wdeBJ4PfAa7mcOwOfAIplNDPrMoqI1nNZjyZpY1IQAvhhRJxWdW4nUiB9A9goIl6XNBq4FAhgv4i4qXC/1UhBbiHwoYiYXnVuG+Bu4LGI2DGnHQhcB5wfEacU7rU8sEJEzMk/vwzMAzaPiDcLedeMiJfa8VGYmbWLm2eby2vAedUJEXEvcDmwKnBIIf+figEz+3TOf051wMz3ewS4GNhB0taF6+YVbxQRb1cCZpV3SAG5mNcB08ways2zzeW+GgEKYDLwGWAH4NdV6ffUuc8H83F7SWNqnN88H7cCpgNTgJnAGZJ2BMaTmmsfiIhicLwc+CLwiKQ/5GvvjIjXWnhfZmZdwkGzucyqk/5CPq5SJ71ojXw8oZXnrQyQm3x3A84FDgI+ks+/JOlnwLcj4p2cdgrwOHAscEZ+LZA0HvhqRMxo5ZlmZp3GzbPNZe066evkY7E2V6/Du5Jv+4hQC693a60R8WxEHAesBWwLnAy8DHwzvyr5FkbETyJi+1zeQ4FrScH2JkkrlH+7ZmYdy0GzuewoaWCN9OH5eH/J+9yVj3ssawEieSQiLgRG5uSP1cn7YkRcExGHA7eRRt9uu6zPNDPrKA6azWUVqmp18O7o2U+Rao/XlrzPpcBs4BxJuxRPSuojaXjVz9vmEbxFlZrvmznfCpL2lqTC/ZYDVq/Oa2bWCO7TbC5/AY6XtCtpIE5lnmYf4HMR8XqZm0TEy5IOIwXZuyRNBB4BFgEbkgYKrQH0z5fsA/xI0jTgUeBF0pzOg/M1/y/nWxG4FXhK0t3A0/keI0mDiq6LiH+0/e2bmbWPg2ZzeRI4Efh+Pq4A3AecFxE3L8uNImKipCHAqaSBPXsAbwPPkZpSr67KfjNwPvAhUqB8D/A8aaGDH0XEtJxvLnA6MAIYSmq2nUMaGHQS8Ktle7tmZh3Lixs0garFDX4dEaMbWhgzsx7MfZpmZmYlOWiamZmV5D5NW0KekjIK+m0KC2YA4+qsImRm1nTcp2nvkjQM+o+HEX1g9wFwx1yYtAjm7xcRUxtdPjOzRnPQNKBSw+w/E64bCLuQZpAMJQ1wPWgOzB8UEW80tpRmZo3lPk1DUh/ga7DBCmkTlDWAD5NmkIwk1TzxBtBm1vTcp9mkJK1Nioz75uOa8C/SFMozSFMv++bcQwfALYMbUlAzs27EQbONJD0FEBEbV6WNJi0xd0xEjG1EuerJmz0PJUXDfYH351MvAjcCb8FeR8LElZa+etpcWPh4FxXVzKzbctDsxSRtwuIguRdpq64FpCX0zgRuAv4eEYtSn+a0UakPc2TVXSaQBgMxrksLb2bWDXkgUBvVqWmuQlrP9flGbJosaQBpx5J9ScFys3zqSdJSdjcBk+qtMbvk6NmhA1IN06NnzcwqHDTbqFbQbEAZRNoqqxIk9wCWJ+0EMonFgXJGlPyLlrQyMAr6Ds5NsuM8atbMLOnWo2clbSwpJI2VNFjSVZJeljRH0i2Sts353ivpIknPS5ov6a+SRtS4Xz9Jn5d0l6TXJb0p6X5JX8gjSIv5lc89ku87U9JPc42yVnlH5/KOLqSPyOWbnp87T9LDks6R1L/WvVr4TNaQNErSr4BngQeBH5A2d76AtKPI6hFxQERcGBH/KhswASLijYi4JGLBmenogGlmVtFT+jQ3Bu4G/gGMzT8fAkyW9EFSbep1Ur/b6sAngRslbR4R/4Z392S8nlQj+ydwBTCftKPGhcCuwNGF554PnEzakeMi4B3SLh27kmp0b5cs/+nAlsA04AbSdle7A2OA4ZL2iYiFtS6U1A/YmcW1yV0AAa+SOhxvAm6JiJkly2JmZm0VEd32RQqOkV9nFc6dndNfAX4B9Kk6d3Q+9+OqtDE57UKgb1V6X+CSfO7gqvShOW0GqeZWSe8P3JnPPVUo0+icPrqQvgm5KbyQ/q2cf1QhfX3gOOD3pOAYwML83HOA3arfg19++eWXX13z6tZ9mlVbWj0FbBpVtTFJG5I2KX4TWCeq1keV1JdUi5waESNy0+uLpJriBhGxoPCcVUnB96qIODynXQwcDxwbEZcW8g8n9Rk+He2YciJpDeAl4DLgtywe6bpNzjKTxf2SEyPildbuaWZmnaenNM8+EEs3Xz6Xj49FYUHxiFgoaRapxgawOWmZm38B30jjZ5YyD9iq6ucd83FKjby3k6ZulJJHtX6J1KS8OTCQ1MRacRTwaVJz719Igfdm4JHozt9qzMyaTE8JmktN34iIBTn41ZvasQBYLv95jXzcjNS8Wc/KVX+uDPaZVePZCyW93FKBK3Jf6m2kvsh/k2qPg6ruD6nP9LPAlIiYW+a+ZmbW9br16NkOVAms10aEWni9r8Y1axdvlpt/1yimF/L0kfQB0sClXUj9khuSar+TgJNIA4ogTQkZ74BpZta9NUvQfBSYDeyWa35l3JePe9Y4twe1a+kD8/GzwAvAvcCROe2qfK81IuKQiPgFaYCQmZn1EE0RNPPAnwtJq/VcIGnFYh5J60rauippbD6eJWn1qnz9ge9V/bynpO9K+htpniSkBQduIY3iPTGnPRMRf4mId/J1mwD/3RHvz8zMukZP6dPsCN8Ctid5Z8b+AAAdbElEQVQFsQMl3UbqX1yL1Ne5O3AWMB0gIu6QdCHwReBhSVeRRt8emu83n9TcOpnUf3oncHU+/6XKiNs8COhU4CuStgPuz9cdQJqzuWGnvmszM+swTVHTBMg1vI+RRqn+kxS0vkqa4tGHNO/z8sJlX8p5FgCfzz9vlF/LAXNJI2LXiIgPAX+uPK7quXNJi6VfQZpKcjIwhBTEj+rgt2lmZp2oW8/TbISq9VwrcyYr67nOI9UqbyJNB3nM00HMzJqLgyaQ+yz3YfFSdYPyqUdYHCRvj4j5jSmhmVnXkRSkKXDDS+YfTpoVcG5EjOm8kjVer+nTTPtBMqofbLogLX03rrjoQVXeviy9nmsf0gjbCaQgeXNEPNslhTezXi8HomqLSMtkPghcEhHF7iErQdJY4DPA+yLiqc5+Xq8ImpKG9YfxI6DP7jDgDpg7CX4k6d19ICWtRwqQHyHtsrwaqe/xHlL/4s3AX4tL7JmZdbBz83E5YAvSWIsRkj4QEV9pXLGWsBVpiVIr6PFBU9LA/jD+Ohg4cnHygBuAQ+BmSf9HGoizbT73PPBHUpC8NSJKrexjZtYRis2XkvYmtXB9WdIFXVFbak1EPNroMnRXvWH07KgR0GckaTHaC4H9gU8A78BKpFGvs4CvkUatrhcRx0bEOAdMM2u0iJhIWoBFpG6j4l7Cm0saJ+lFSYty/yE53+qSvifpH3mf3tckTZT04eJzJC0v6WRJ90l6Ne8n/JSkP0nap5A3JE2ucY+1JV0iaVZ+3gOSPtPS+1vGMr67J3Heh3iy0v7Jr0u6QdJWhfxBapoFeDJfG5KeaqlM7dHja5r9YNPdYQCkjoGTSZMujyN1UF4B5y+MOK2BRTQza01lA4div+dg0l7Cj5GmxK1I2jsYSRuRRvRvTNpE4ibS78IDgJskfS4iLq6611jgCOBh0s5K80iDHoeRxnfc2mIB065M00grmU3Nr3VJWzPeUueaZS1jxQGkvYtvzPffGtgP2FnS1hHxUs53Lql5e3vgJ6Rf+1QdO16j9yZr7ws4/qPwRkDMg3gcIvLro/AGcFyjy+iXX375Rd4buEb6PqRBQYuAjXLaxpX8wHfr3G9yvuaThfRVgQdIQXHtnLZKznsvNfbiJc01L5Z1ciHtIgr7FOf0nUgLvwQwpq1lzOmj830WAHsXrvlePve1QvrYnL5xV/w99obm2XGTYNEE0u7QlcVcJwCT0l/WuIaVzMysQNKY/PpOXmnsJlJN8/yIeLqQfRaLBw5V32N70lrWV0fEldXnImI2aTen/ixewSzyM94i/V6kcE2LXVV5ze5PAXOAMYVr72XphWHaUsZqV0Zqtq52UT7u0lJZO1uPb56NiDmS9jsoj54dCgOmpdGzi+bDfhHxRqPLaGZWpbI9YZCaEW8nTTn5bY28f4+It2qkfzAfV5E0psb59+bjVgAR8bqk64EDgQckXZ2fe3dElBkluyVpjMjtEVFrO8bJLO5bbFMZC+6tkfZMPq7WYkk7WY8PmgARMVXSoBth1C0weCE8Tpqn6YBpZt1KRKj1XO96oU56ZWvCkflVT/UewaOA00k7L1Vqr/NzbffUiFhq7+AqdfcXbqGcbSljxVJ9krF4D+W+Ldyr0/WKoAmQA+QljS6HmVkHqrdkW6W296WIuKBOniVvFDGP1LQ6RtIGwIdIfYhHkfpQ92jh8rr7C2frdEQZe4Le0KdpZtZs7srHlgJdXRHxTKQViD4C/AsYlkfH1vMoabGD90tapcb54R1dxmWwMB+7pAbqoGlm1sPkwTe3Ax+XdGytPJK2k7RW/vN7Je1aI9sAYCBptOrbLTzvHdJgn4EUBgJJ2ok0SKhdZWyHyiCmLtlmsdc0z5qZNZkjgduASySdTJrPORtYn7SQy7akwTgvAusBd0n6B3AfaVDNe0jzIdcBLog6a3VXORPYm7Ry0U4snqc5ChgPHNTOMrbVROA04OLcP/sGMDsiftqOe9bloGlm1gNFxLOSPgB8kTRt41OkJsoXgOmkBdIeytmfIo3aHQ6MANYEXiHtLXwGsMSUkDrPe0nS7sB3SaNwd8rXn5Tvv1TQXMYytklE3Czpq8AJwCmkrRyfBjolaHprMDMzs5Lcp2lmZlaSm2fNzLqhyh7B0G9TWNDiHsHWddw8a2bWzUgaBv3Hw4g+sPsAuGMuTFoE89/dI9gaw0HTzKwbSTXM/jPhuoFLLqQzAThoDswf5NXOGsd9mmZm3cuoVMPcDfgccG1OHklKZ1TDSmYOmmZm3Uu/TVOT7EqktQG+weJFb4YOgL6DG1c2c9A0M+tWFsxIfZh9SYvvTGfxDofT5sLCxxtWNHOfpplZd7Jkn+bewPtJ22D+BDjEfZoN5qBpZtbNLDl6dtUB8Dug33xYMNKjZxvLQdPMrBuStDIwCvoMhkVHAouALfLi6dYgDppmZt2cpP2BPwOfjYiLG12eZuagaWbWzUkScCcwCNgsIt5qcJGalkfPmpl1c5FqN2cDGwDHN7g4Tc01TTOzHiDXNqcAmwKDI2Jeg4vUlFzTNDPrAapqm+sCJza4OE3LNU0zsx5E0q3AdsAmETG30eVpNq5pmpn1LGcDawFfaHRBmpFrmmZmPYyk8cCuwPsi4vVGl6eZuKZpZtbzfBNYHfhSowvSbFzTNDPrgST9ERhOqm2+2uDiNA3XNM3MeqZzgFWAUxpdkGbimqaZWQ8l6Q/AR0i1zZcbXZ5m4JqmmVnPNQZYGTitweVoGq5pmpn1YJIuBz5Gmrc5q9Hl6e1c0zQz69nOBfoDpze6IM3AQdOsSUkaLikkjWl0WaztIuIx4DLgJEmDGl2e3s5B08ys5zsP6Aec2eiC9HYOmmZmPVxEPAn8CjhB0oaNLk9v5qBpZtY7fDsfv9HQUvRyDppm7STpIEkTJT0v6S1Jz0maIunzVXkm5/7DFSR9W9KTOe/jks6RtHyde28paaykZ3L+WZKukLRFjbybS/q+pHsl/Sfnf1rSRZLWX4b301/SVbm8/yupT05fW9IPJf1T0lxJs/Ofx0rapC2fnXWciHgGuAg4xn8fncdTTszaQdJngV8CLwDXAy+RdqAYQvr/tXPONxnYE7gO2Bm4CngHOBgYDPwZOCiq/kNK2he4Blgu33sGsD7wceAtYERE3FeV/wzgDGAS8AzwNrANafL7LGCniJhZlX94zntuRIzJaavlMu4OnBkR38/pKwEP5rJOyH8WsBGwN3B0RPy57Z+kdQRJ6wJPAOMiYnSDi9M7RYRffvnVxhfwN1IAW6vGuTWr/jwZCOAxYLWq9P7Anfnc0VXpqwGvkoLw1oX7bgO8AdxXSF8PWKFGOT4MLAR+Xkgfnp87Jv+8ETCdFGyPKuQ9MOf9cY37Lw8MbPTfhV/v/n38T/773qLRZemNLzfPmrXfAlKtcQkR8VKNvN+KqsW1I2I+8PX847FV+T4NrAqcExHTC/d9BLgY2EHS1lXpMyPirRrluAV4hFTjrEnS+0nBez3goxHx2zpZ59W4/9sRMafeva3L/Tcwn7Q2rXWwfo0ugFkPdznpm/0jksYBU4A7IuI/dfJPqZF2Oynw7lCV9sF83L7OPMrN83ErUu0QSQI+BYwGtifVVvtWXfN2nTINA74CzAE+FBF/r1PumcAZknYExgN3AA9ExMI697UGiIgXJV0IfE3Sd/KXLOsg7tM0aydJnwY+T+qr7ENqxpwCnBYR9+Y8k0l9mv1r1QYlvUBq4q0MupkA7FPi8aMj4tf5mh8DXwaeB24jBblKzXA0sFFEqOqZw0l9mq+Q9ma8GTi4Vvly/vVJq88cBKyZk18CfgZ8OyKWqm1bY0haA3gSuDkiPtHo8vQmDppmHUTSqsBQ4BBSU+tsYKv8zX8yKWhuFBH/LlzXl9ScNjciVs1pVwGHAttHxIMlnr0WKVhOB4YWm0sl/RPYvE7QPA94L3AScAvwsYhYqhm26joBWwN7Af8FbEEKmme3Vk7rOpLOA84GdoiIBxpdnt7CfZpmHSQiZkfE+Ig4ARhLqr3tUci2Z41L9yB1ldxflXZX1bkyNiH9f76lRsBcP59voejxeeB80qChGyQNaClzRDwSERcCI3Pyx0qW07rOj0hf3M5tdEF6EwdNs3aQtK+kWmMD1srHNwvpZ+dpHZXr+wPfyz9eWpXvUtIvvHMk7VLjuX1yTbHiqXwclmuulXwrkwYNtTp+ISJOyWUZAdws6T1V99lW0sY1Lls7H4vv0xosImaT+tsPkrRzo8vTW3ggkFn7XAnMlzSVFLhEqh3uTJqOcmsh/z9Ig4aK8zRvAH5TyRQRL0s6DLgWuEvSRNII2EXAhqSBQmuQpqwQES9IuhL4JPCApFuAVUg1wfnAA8D7W3szEXGmpPmk2skESfvm0b77AD+SNA14FHiRNGf04Fym/1f2A7Mu9RNSP/d5wEcbXJZewTVNs/Y5gzRVY0fSYKBjSIsRnE5afKA4OOZw0hqhBwJfIP0fHAMcGoUBBhExkbRIws+AjYETgeOBbUkDfT5ZuPdxwHeBFUl9jR8hLZowFHit7BuKiPOArwG7ABMlrUkaJHQ+KUgfDHwV+BBpoYM9IuKqsve3rpOb6n8A7CtpaKPL0xt4IJBZF6gMBKoeiGPWFXL/9BPAwxGxd6PL09O5pmlm1otFxFxSX/VehX5wawMHTTOz3u8XwHPAt/KUIWsjB02zLiRpoKTjl0u7kRwvaWCjy2S9X16u8Tuk1Z9GtpLdWuA+TbMuImlYfxg/AvrsDgPugLmTYNF82C8ipja6fNa7SVoB+BdpEYzdigPPrBwHTbMuIGlgf5h5HQzcgrQg7CDSfJSDYM58GBQRbzS2lNbbSTqBtOfmgeGt3NrEzbNmXWPUCOgzkjRhrrIy+0hgRPp/OKphJbNmMpY0kvY89222jYOmWRfoB5vuDgMg7d68HWkVBIChMKBvWuDArFPlecOV721e+rANHDTNusACmHEHzF1IWtZnSNW5aSn98QYVzZrP5aTN0M+T5BiwjPyBmXWNcZNg0W9Ii7RWguYEYFJahm5cw0pmTSUiFpBWodoW8LZhy8gDgcy6iKRhy8Et78CKJwJPe/SsNUiuYT5IGpO2rTcSL89B06wLSfoucEYf+OEi+CcwzqNmrREkHQpcBXw6In7TWn5LHDTNupCka4CtI2LLRpfFmluubf4NGEjaLL24uYDV4D5Ns641hNQsZtZQEbEI+CZp5PanG1ycHsNB06yL5A2hB+Ogad3Hn4F7SJujL1/2IkmjJYWk0Z1Wsm7KQdOs62ybjw81tBRmWV5K75vARsBkSU9ImifpdUkPSfp/ktZrcDG7FQdNs66zXT66pmndQl4VaK/84wdJ8zcvAC4hzY46FXhM0mGNKWH346Bp1nWGAHOApxtdELPsbOBrwAv55xsj4vSIOCUidgUOI8WJKyWNaFQhuxMHTbOuMwR4OA/AMGsoSRuTguY7wIeBycDXJa1UyRMRVwOnkOZz/rzWCkKS9pc0TdJcSa9KukrSZjXyba60Jd69kv4j6S1JT0u6SNL6NfIPz/2mYyTtJOkmSa/lZ1wtaYOcbxNJV+Z7zpM0SdL27X1+PQ6aZl0gN4Nth5tmrfs4BugHXBsRD5EC6NrA5wv5/o+0gfUWwJ6Fcx8H/gg8C/wEuBM4FLhL0hY18p4IPAP8DrgQmA4cD/y1hb7TnYHb858vJg1c+jgwUdKW+ef1gcuAG3IZJ+SBdx3x/CVFhF9++dXJr/yfOoDPN7osfvkVEQAT87/JE6rSbgb+Awws5L085/1G/nl0/jmAAwp5v5TTJxbS1wNWqFGODwMLgZ8X0odXPeNThXOX5PRXgLMK587O577UnufXe7mmadY1KsvNuqZp3cW6+fhMVdrZwJrAFwt5K3kGFdJvi6X35fwpaQOCvSRtVEmMiJkR8VaxEBFxC2kfg4/UKefUiLi8kPbrfHwN+H7h3GX5+P7Cc9r6/CU4aJp1jUrQfLihpTBbrLI73bvLwkXEPaS5m6dKWqWlvNmU4k0jrWNbWUu5snUsSo6SdGvuU1yQ+yyD1HVRr3n03hppz+XjA7H0urkz83GJfsp2PH8J/cpkMrN22w74d0TMbnRBzLLngS2BDQvp3wTuIw0AGpPT1q+6ptqsOveujMatDrw/Ar6c73EzKbjNy+dGk+aK1vJajbQF9c5FxIK8v/ZyhVNtff4SHDTNuoaXz7PuZiowAtiHNMAGgIi4P6+RfIqkC0iBaXg+fUfhHmvXufc6+fgagKS1gJNJLS1DI2JOdWZJR7T9bbSuI5/v5lmzTpaXJ9sSB03rXsaSBsAcImmbwrlzSAu5nwocS+rL/CdLN8cWR9MiqS8wLP94fz5uQoo3t9QIWOvn852pw57voGnW+bYktep4+TzrNiLiCeC7pGbM6yRtXXXuYdLG6KeQppIsJI38Ls4x3kvSAYW0L5DWWJ4UEZWFPJ7Kx2E5qALvrsd8MZ3f6tlhz3fzrFnn88hZ667GAAOArwB/l3QzaSTpcsBWQH9S/+EREXFbjeuvB66VdC0wA9ge2I80FeTd+Z4R8YKkK4FPAg9IuoXU3zkSmA88QGG0a0fqyOe7pmnW+YYAb5PW9TTrNiJiUUR8FdgVuALYhtT391lSpWo6KWhOrXOLa4BDgA1I8zN3z2kfjIhHC3mPI9VsVwT+izTF48/AUGoP9uloHfJ8b0Jt1skk3QSsHRE7tJrZrBuRtCnwKPCziDi50eXpDlzTNOt8Xj7PeqSImEEaMPS5ylqvzc5B06wTSVqTNPLQQdN6qm+RFjc4q9EF6Q48EMisc3kPTevRIuJpSf8HnCDpyX6w2oI06GdccfpGM3BN06xzVUbOerqJ9WS3AP3Wg2+PgdM/Cuf3h5mShrVyXa/joGnWubYj7RpRb7kxs25N0sD+cNkhwAvQbxQwHgZcBwP7w/gaW3D1ag6aZp1rCPBgeJi69VyjRkCfnwHLkzo4IU1wHJFiyKiGlawBHDTNOkleeWRb3DRrPVg/2HR3GLAO8FsWB02AoTCgb1r9p2l4IJBZ59mENJHag4Csx1oAM+6AucCAjxfOTYO5C9PemU3DNU2zzuPl86w3GDcJFk0oJE4AJsEi0hq1TcM1TbPOM4T0S2V6owti1lYRMUfSfgfB+BHQZygMmAZzJ8Gi+bBfRLzR6DJ2JS+jZ9ZJ8iLWW0XElo0ui1l75VGyo/rC4NwkO67ZAiY4aJp1GkkzgPsi4vBGl8XMOob7NM06Qf5WPhj3Z5r1Kg6aZp1j23x00DTrRRw0zTqHl88z64UcNM06x3bAHODpRhfEzDqOg6ZZ5xgCPBQRixpdEDPrOA6aZh1MkshBs9FlMbOO5aBp1vHWA1bFg4DMeh0HTbOO5+XzzHopB02zjueRs2a9lIOmWccbAvw7Il5rdEHMrGM5aJp1vO1w06xZr+SgadaBJK0AbImDplmv5KBp1rG2JG255/5Ms17IQdOsY22Xj65pmvVCDppmHWsI8DbwWKMLYmYdz0HTrGMNAR6JiAWNLoiZdTwHTbOO5eXzzHoxB02zDiJpTWBd3J9p1ms5aJp1HA8CMuvlHDTNOo7XnDXr5Rw0zTrOEOA/ETGr0QUxs87hoGnWcbx8nlkv56Bp1gEk9QW2xUHTrFdz0DTrGIOBFfF0E7NezUHTrGN45KxZE3DQNOsYQ4BFwPRGF8TMOo+DplnHGAI8FhHzGl0QM+s8DppmHcPL55k1AQdNs3aStDKwCe7PNOv1HDTN2m/bfHTQNOvlHDTN2s/L55k1CQdNs/bbDpgDPN3ogphZ53LQNGu/IcBDERGNLoiZdS4HTbN2kCRS0HTTrFkTcNA0a5/1gVXxdBOzpuCgadY+Xj7PrIk4aJq1T2XkrGuaZk3AQdOsfYYAT0fEa40uiJl1PgdNs/bp0uXzJA2XFJLGdNUzzWwxB02zNpK0ArAF7s80axoOmmZttyXQDwdNs6bhoGlNTcmXJE2XNF/STEk/lbSKpKckPVWVd3RuGh0taV/gynzqysI9t5Q0VtIzkt6SNEvSFZK2qFOGlSR9XdIDkuZKekPSnZKOKOQbC0zKP56Ty1J5Dc95lpd0sqT7JL0q6c38Pv4kaZ8O+dDMmli/RhfArMH+FzgJeA64CHgbOAjYBVgOeKfGNYcB+wJPAQuBayoncjC9Jl97PTCDNJfz48D+kkZExH1V+VcFbgN2AO4DfkX6MvsR4ApJ20TEN3L2P+bjZ4ApwOSqMj2Vj2OBI4CHgcuAecAgYFgu862lPhUzq0le+cualaQ9gL8AjwG7RsTsnL48KbjsQRoZu3FOHw1cCgSwH/BlYK2I2DGfXw14ghRIPxQR06uetQ1wN2mj6h2r0seSguDpEfGDqvT+pCD5YWDHiHggpw8n1TbPjYgxhfezCvAqKfjuGhELC+fXiIiX2/RhmRng5llrbp/Jx+9UAiZARLwNfL2F6/4UETex9PJ5nyatDnROdcDM93wEuBjYQdLWkIIYcBRwb3XAzPnnA6cDAo4s+X4i538LWLTUSQdMs3Zz86w1sx3ycWqNc3cBC+pcd4+kNYF1WXK6yQfzcfs6U0I2z8etgOnAzkBfoN4UkuWq8rcqIl6XdD1wIPCApKuB24G7I+LNMvcws5Y5aFozWyUfZxVPRMRCSfVqZi9Qe/m8NfLxhFaeu3Ih/8751Vr+MkaRaqhHAufmtPmSrgJOjYil3quZlefmWWtmr+fj2sUTkvqyOKgVBbU3nq6sCrR9RKiF168L+X/cSv4RZd9QRMyLiDERsTmwIan5d2o+XlX2PmZWm4OmNbP783FYjXO70XJLzBDgxULN7a583KPk8+8h9T2WzQ9pkBGkZt0WRcQzEXE5aSTuv4BhuR/VzNrIQdOa2WX5eFYeeQq8O3r2u61cW2v5vEuB2aQ5lLsUL5DUpzKfEiAiXgQuB3aSdLakpYK0pMGS3leVVGky3rBG3vdK2rVGWQcAA0l9tG+3+K7MrEXu07SmFRFTJF0EfBZ4JA+ceYc0kOY10tzNpUahkkaobgP8onC/lyUdBlwL3CVpIvBIvseGpIFCawD9qy77ArAZcB5wtKSppD7WQaQBQDuT5l0+mfP/E5gJfFLS28C/Sc3FvwFWy8/9B2nayTPAe4ADgHWACyJizrJ/UmZW4aBpze4k4FHgc8CJpJrctcCZwLPA4zWuWQtYkRrL50XERElDgFNJzaJ7kGp3z5EWMbi6kP91SXuSAveRwKGkoDqL1KR6CjChKv9CSYcA3wcOJ9UgReq3fAA4BxgOjADWBF4hBdozKKxcZGbLzosbmNUgaTPSogdXRkRxObvDgD8AO0XE3xpRPjNrDPdpWlOTtI6kPoW0lYDz84/X1rhsO1KT6/Qa58ysF3PzrDW7LwNHSJoMPE/q+9ubtF7sjaQaJQCSBpLmQR4BvIj//5g1Hf+nt2Y3AdietMbr6qQRpo8BFwDnR+6/kDQM+o+HEX3grgGwwgKYPVPSfhFRa0UhM+uF3Kdp1opUw+w/E64bmKZvvgf4FrArcNAcmD8oIt5obCnNrCu4T9OsdaNSDXMkaQYJpGmaI0npjGpYycysSzlomrWq36aw+4D058osk8oqekMHQN/BjSiVmXU9B02zVi2YAXfMTX8+jjTtcaN8btpcWFhrLqeZ9ULu0zRrxZJ9miOrzkzAfZpmzcVB06yEJUfPDh2QapiTFsF8j541ayIOmmYlSVoZGJX6MBc+DoxzDdOsuThompmZleSBQGZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlaSg6aZmVlJDppmZmYlOWiamZmV5KBpZmZWkoOmmZlZSQ6aZmZmJTlompmZleSgaWZmVpKDppmZWUkOmmZmZiU5aJqZmZXkoGlmZlbS/x8Ah1Wqe5XVONEAAAAASUVORK5CYII=\n", 150 | "text/plain": [ 151 | "
" 152 | ] 153 | }, 154 | "metadata": {}, 155 | "output_type": "display_data" 156 | } 157 | ], 158 | "source": [ 159 | "C2= ot.dist(s1_embed,s2_embed)\n", 160 | "G=ot.emd(ot.unif(4),ot.unif(4),C2)\n", 161 | "\n", 162 | "pl.figure(figsize=(6,6))\n", 163 | "pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k')\n", 164 | "for i, txt in enumerate(s1):\n", 165 | " pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20)\n", 166 | "pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k')\n", 167 | "for i, txt in enumerate(s2):\n", 168 | " pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20)\n", 169 | "for i in range(G.shape[0]):\n", 170 | " for j in range(G.shape[1]):\n", 171 | " if G[i,j]>1e-5:\n", 172 | " pl.plot([npos[i,0],npos[j+4,0]],[npos[i,1],npos[j+4,1]],'k',alpha=G[i,j]/np.max(G))\n", 173 | "pl.title('Word embedding and coupling with OT')\n", 174 | "pl.axis('off')\n", 175 | "pl.tight_layout()\n", 176 | "pl.show()" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Sentence similarity\n", 184 | "We will now explore the superiority of this Word mover distance (WMD) in a regression context, where our goal is to estimate the similarity (or relatedness) of two sentences on a scale of 0 to 5 (5 being the most similar). Given a set of pairs of sentences with a human annotated relatedness, our goal is predict the relatedness from a new pair of sentences.\n", 185 | "\n", 186 | "We will use the [SICK (Sentences Involving Compositional Knowledge) dataset](http://clic.cimec.unitn.it/composes/sick.html) for this purpose.\n", 187 | "\n", 188 | "We first load it." 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 6, 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "A group of kids is playing in a yard and an old man is standing in the background\n", 201 | "A group of boys in a yard is playing and a man is standing in the background\n", 202 | "4.5\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | " \n", 208 | "data=np.load('data/data_text.npz') \n", 209 | "setA=data['setA']\n", 210 | "setB=data['setB']\n", 211 | "scores=data['scores']\n", 212 | "\n", 213 | "print (setA[0])\n", 214 | "print (setB[0])\n", 215 | "print(scores[0])\n", 216 | "\n", 217 | "np.savez('data/data_text.npz',setA=setA,setB=setB,scores=scores)\n" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "We will only keep 200 sentences for learning our regression model and the rest for testing" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 7, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "n=200\n", 234 | "testA=setA[n:]\n", 235 | "trainA=setA[:n]\n", 236 | "testB=setB[n:]\n", 237 | "trainB=setB[:n]\n", 238 | "\n", 239 | "scores_train=scores[:n]\n", 240 | "scores_test=scores[n:]" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Using the countVectorizer model from ScikitLearn, compute all the bag-of-words representations of the sentences" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 8, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "from sklearn.feature_extraction.text import CountVectorizer\n", 257 | "vect = # TO BE FILLED" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "Build a big data matrix of all the words present in the dataset embeddings\n" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 9, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "all_feat = # TO BE FILLED" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "Compute a big matrix of all pairwise feature distances using the dist() method of POT" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": 10, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "D = ot.dist(all_feat)" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "now you can write a code that will compute the Cosine and WMD dissimilarities from all the pairs of the training set " 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 11, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "X_cos=[]\n", 306 | "X_wmd=[]\n", 307 | "Y=[]\n", 308 | "\n", 309 | "\n", 310 | "\n", 311 | "for i in range(len(trainA)):\n", 312 | " s1 = vect.transform([trainA[i]]).toarray().ravel()\n", 313 | " s2 = vect.transform([trainB[i]]).toarray().ravel()\n", 314 | " # Cosine similarity between bag of words\n", 315 | " d_cos=# TO BE FILLED\n", 316 | " X_cos.append(d_cos)\n", 317 | " # WMD\n", 318 | " d_wmd=# TO BE FILLED\n", 319 | " X_wmd.append(d_wmd)\n", 320 | " Y.append(scores_train[i])\n", 321 | "\n", 322 | "\n" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "Visualize the corresponding golden similarities / distance from the learning set. Hence you have a first appreciation of how much WMD better captures this similarity." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 12, 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "data": { 339 | "image/png": "\n", 340 | "text/plain": [ 341 | "
" 342 | ] 343 | }, 344 | "metadata": {}, 345 | "output_type": "display_data" 346 | }, 347 | { 348 | "data": { 349 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJztnXu0XXV17z/fHI54IsgBOZfiSUIQGUnRKJEjj4ZaQocGEWkG6AWKDzq0ufXqrVaMI+m1GKxe8HJrba+23rRVQShSCaQ8i9iAPAroiQmPgFSQRziAHB4HCAkhj3n/WGsn++yzH2uvvdZejz0/Y5xx9l7r95i/x5577d+cv/mTmeE4juOUi2lZC+A4juMkjyt3x3GcEuLK3XEcp4S4cnccxykhrtwdx3FKiCt3x3GcEuLK3ekISbMkbZLUFzP/JklvCV9/X9JXO5Dlekkfj5s/j0iaLckk7dHg/gpJF3dbLif/uHLPIZKWS7q+5tqvGlw7PXxtkp6pVgKS+sNrVnXtZkmvSnpZ0kuS1kpaJmnPJvLMkLRK0rOSXpR0n6SzAMzscTPby8x2xGlrmPfXcfLWKev9ZnZhKPNZkm6LU07YH7fUub6/pNckvV3S6yT9laQnwi+oRyV9s9M2OE5SuHLPJ7cAv1N5GpZ0INAPzK+59tYwbYUXgPdXvX9/eK2Wz5jZ3sCBwNnA6cB1ktRAnh8AG4GDgDcBHwV+E69pyaOAJOfyxQT9f3DN9dOBe83sPmA5MAIcCewNHAf8IkEZSk2jXyJOcrhyzyc/J1Dmh4fvfxe4CXiw5trDZvZkVb4fAB+rev8x4KJGlZjZK2Z2M3AycAzwgQZJ3w18P0y/3czWmdn1MHXZIPxl8FVJ/xE+0V4t6U2SLgl/Kfxc0uxKwWHet9ZWKGlfSddIGpf0Qvh6RtX9myV9TdLtwGbgLeG1T0r6beA7wDGhDBOS3i3pN9XLR5JOkXR3nX55AlhD8CVWTXV/vhu40syetIBHzaxhX0t6n6QHw18+fyfpp5I+Gd6bJulLkh4Lf2ldJGmfBuUcHOZ9WdKNwP41948O+35C0t2Sjqvps7+UdHuY/8eS9q+tI0y7f9jnE5Kel3Rr5QtU0kxJV4Rj85ykb7VqR9U8+YSkx8P+bSqv0xmu3HOImb0G3AW8J7z0HuBW4Laaa7VLB6uB90galLQvwRfAv0ao73FgNExfjzuBb0s6XdKsCE04nUAxDgOHAHcA3wP2Ax4AvhyhjGlhnoOAWcAW4Fs1aT4KLCF4cn6sqj0PAH8C3BEu+wya2c+B54D31eRvpJAvpEq5S5pD8MX6z+GlO4HPS/rvkuY1+dVDqEAvJ3jafxPBl/TvVCU5K/xbCLwF2KtOWyv8M7CWQKn/JbDLxiBpGLgW+CpBX38BWCVpqCr/HwJ/BPwX4HVhmnqcDTwBDAEHAH8OWPjleA1Bf88mGOMfttGO3wN+G1gUUV4nJq7c88tP2a3If5dAud9ac+2nNXleBa4GTgv/rgqvReFJgg9YPT4c1v0XwCOS1kt6d5OyvmdmD5vZi8D1BL8wfmJm24EfAfNbCWNmz5nZKjPbbGYvA18jUAzVfN/MNoS/Jra1KpNAYX8EQNJ+wCJ2K+targQOkFRRwh8Drjez8fD9ecDXgTMJvhjH1NiYeyKwwcyuCPvgb4Gnq+6fCXzDzH5tZpsIvgROr126CL9Y3w38hZltNbNbCMa7wkeA68zsOjPbaWY3hrKdWJXme2b2n2a2BfgXdv8SrGUbwbLdQWa2zcxutSAQ1ZHAm4Gl4S+5V82sYtuI0o4VYb4tEeV1YuLKPb/cAhwbKqEhM/sV8B8Ea8H7AW9n6pM7BE+iH6PFkkwdhoHn690wsxfMbJmZvY3gKW49sLrJ02r1evyWOu/3aiWMpOmS/l/4E/8lgrYOarJXzsZW5dRwMfBBSW8A/itwq5k9VS+hmW0m+CL6WNjOM6nqTzPbYWbfNrMFwCDBl893wyWhWt5cLWuoJJ+ouf9Y1fvHgD0I+rq2nBfM7JWatBUOAj4cLnFMSJoAjiVQ0hWqv1Q203gsLgAeAn4s6deSloXXZwKPhV9StURpR/WYRZHXiYkr9/xyB7AP8MfA7QBm9hLBE/YfA0+a2SN18t1K8OE4gGAZpyWSZgJHhHmbYmbPAv+H4IPc6Ek/Cc4G5gBHmdkb2f2LpfoLpVlI0yn3zGyMoF9PIVhy+UELGS4k+BJ4L8HSz9X1EpnZFjP7NoHx+rA6SZ4Cqu0Fqn5PMKYHVb2fBWxnqtH6KWDf8MupOm2FjcAPwmWoyt8bzOz8Jm2si5m9bGZnm9lbCGwyn5f0+2Eds2p/VbTRjupxSUxeZyqu3HNK+LN1FPg8k5XubeG1ek/tlafCDwInW4t4zuHT8e8RrMv/DLiuQbqvK3D/20PS3sCngIfM7Lk2m9UOexM85U+Ev1SirNNX8xtghqTX1Vy/CPgiMA+4okUZtwITwErgh6EtBABJn5N0nKSBsF8+Hsq8rk451wLzJC0OleKngd+qun8p8GehsXQv4H8Bl9U+HZvZYwRz4lwFrpjHEox1hcovk0WS+iS9PpSx+oskEpJOkvTW8IvoRWAHsJNgnjwFnC/pDWEdC9ppRxryOlNx5Z5vfkpg+Kp+Ar81vFZXuQOE69AbmpT7LUkvEyjAbwKrgBPMbGeD9NMJ1qAngF8TPJ2dHLURMfkmMAA8S2C8/Lc2868BNgBPS3q26vqVBPJfGS69NCT8crwoTF+7xLUZ+CuCZY5nCRT2qfV89sNfOx8G/jeBUfcwAiW9NUzyXYJfEbcAjxDYSf5HA7H+EDiKYAnty0xeKtoI/AGB8XOc4Ml4KfE+54cCPwE2Efza+Tszuyncz/BBAjfcxwmWl06L0Y6k5XVqkB/W4fQakh4G/puZ/SSj+qcRKMUzzeymLGRwyo9/Qzo9haRTCdZ913S53kWhi+qeBE+qIvhF4jip4LvEnJ5B0s0ESyIfbbIElRbHELhdvg64H1gc2lUcJxV8WcZxHKeE+LKM4zhOCclsWWb//fe32bNnZ1W94zhOIVm7du2zZtYyRENmyn327NmMjo5mVb3jOE4hkfRY61S+LOM4jlNKXLk7juOUEFfujuM4JcSVu+M4Tglx5e44jlNCXLk7juOUkEiukJIeBV4mCPu53cxGau4L+BuCE1Q2A2eZmR8WnCCr141xwQ0P8uTEFt48OMDSRXNYPH84a7Ecx8kp7fi5LwxDl9bj/QQhQg8lCEf69+F/JwFWrxtj+RX3smXbDgDGJraw/Ip7AVzBO45Tl6SWZf4AuCg8Bf5OguPQ/KishLjghgd3KfYKW7bt4IIbHsxIIsdx8k5U5W4EZymulbSkzv1hJp+N+ER4bRKSlkgalTQ6Pj5ee9tpwJMT9YMHNrruOI4TVbkfa2bvIlh++bSk97TKUA8zW2lmI2Y2MjTUMjSCE/LmwYG2rjuO40Racw8PFsbMnpF0JXAkk495GyM4Fb3CjPBaLunEOJmFYXPpojmT1twBBvr7WLpoTqr1RsWNvY6TP1o+uYeH4O5deQ28D7ivJtlVwMcUcDTwopk9lbi0CVAxTo5NbMHYbZxcva71d1EneTth8fxhzjtlHsODAwgYHhzgvFPm5UKBZtUnjuM0J8qT+wHAlYG3I3sA/2xm/ybpTwDM7DvAdQRukA8RuEL+UTridk4z42QrZdlJ3k5ZPH84F8q8liz7xHGcxrRU7uFp7u+sc/07Va+N4PT33NOJcdINm1PxPnGcfNJzO1Q7MU66YXMq3ieOk096TrkvXTSHgf6+SdeiGic7yVtWvE8cJ59kdkD2yMiItXsSU1JeGdXl7DPQjwQTm7dNet2o/Lx5huRBnjzI4Di9gqS1tSFg6qYrinKv3YIPwRNiJ14j9cqsptPy0yaNPnEcJ99EVe6FWZZJYwt+vTKTLD9tPCyB4ziNKIxyT8MroxMPmTzgniqO4zSiMMo9Da+MTjxk8oB7qjiO04jCKPeli+bQP02TrvVPU0deGfU8PaoZ6O9j4dwhFpy/hoOXXcuC89c03Hm5et1YpHRJ0sxTJQt5nOTw8XM6pZ147tmjFu/bpGJ0rOc58+bBARbOHWLV2rGWcdSzirdeK3/FUwXw+O8FxuP3O0lQGG+ZBeevYazOWvLw4AC3Lzs+SdHarjML2ZqRN3mc9vDxc5pROm+ZLIyHUevMm2Ezb/I47eHj5yRBYZR7FsbDqHXmzbCZN3mc9vDxc5KgMMo9i23uUetMQrZqA9r8r/yYw8/9cWxjWj15+qeJza9tj1WmG/e6i4d0cJKgMAbVRsbDLAyWtXV2KlutAe2Fzdt23YtjTKtnKH7lte27ym2nTDfudZ8s5rpTPgpjUC0zjQxo1XRiTOvEQOfGPcfJF6UzqJaZtHfKegx7x+k9XLnngLR3ynoMe8fpPSIrd0l9ktZJuqbOvbMkjUtaH/59Mlkxi0UzA2S9e1F2yjYzprUyeJY5hr0bex2nPpHX3CV9HhgB3mhmJ9XcOwsYMbPPRK24rGvuzcLwAk3vVQxog9P7MYMXtzSOKx+lvtpdtJ0YfPNo3POQx04vkmg8d0kzgAuBrwGfd+XemGYGSCBx42QvGzx7ue1O75K0QfWbwBeBnU3SnCrpHkmXS5rZQKglkkYljY6Pj0esulg0M0B2M2xxLxg8e7ntjtOKlspd0knAM2a2tkmyq4HZZvYO4EaCp/wpmNlKMxsxs5GhoaFYAuedZgbIboYt7gWDZy+33XFaEeXJfQFwsqRHgR8Cx0u6uDqBmT1nZlvDt/8IHJGolAWimQEyDeNk3g2eadLLbXecVrTcoWpmy4HlAJKOA75gZh+pTiPpQDN7Knx7MvBAwnLmnmqj4+D0fvbcY1pDg2i1cXLh3CFWXLWBz122HoB9p/fz5Q++ra3dqKOPPc+ld21khxl9EqceMRwpf9KG0m4bXn0np+M0Jnb4AUlfAUbN7CrgTyWdDGwHngfOSka8YlAvfMBAfx9/fdrhdUMVVK6tXjfG0h/dzbadu43aL2zextLL796VNkrdq9aOsSM0jO8wY9XaMUYO2q9p/qTDCmQZ096VueNMpa1NTGZ2c8VTxszOCRU7ZrbczN5mZu80s4Vm9ss0hM0rcQ+qvuCGBycp9grbdljkQ647qTvJw7X9sG7HyRe+QzUB4nptNLsf1eMj6brjepq454rj5AtX7gkQ12uj2f2oHh9J1x3X08Q9VxwnXxQm5C/kd6fk0kVz6u6UbOW1sXTRnClr7gD9fbsP/m7V5lZ1N8rfLF/Ufq5ON/11U8MnxPVcyYOhtxtzLa/z2SkHhVHueY4rHtdro3J/xVUbmNgSxFqv9paJ0uZmdUfJH/dw7dqyX3lt8nq7ILLXTjV5MPR2Y67leT475aAw8dx7cat5p22Ok7/TQ8HjyNmpzEmX14251ovz2UmG0sVz70WDXadtjpO/00PB200Tt/40y+vGXOvF+ex0l8Io91402HXa5jj5Oz0UvN00cetPs7xuzLVenM9OdymMcs/zVvO0Yoo3ivP+ytbtkepo1GcL5w41lLdZP1e3c/Nr2+mfpoZ1xx2bpMc5annVbXtl63b6+9QyTyfkeT475aAwBtW8bjVP0zBWyX/u1RsmHZo9sWVbpDrq9dnCuUOsWjvWUN6ohtYXNm+jv08MDvTz4pZtbcWgj9LmpMY5Snm1YzixZRv908S+0/uZ2NxZezqRy3E6oTAG1bxSNONb3LLKbAAsc9uc8lE6g2peKZrxLS87WvNEmdvm9C6u3DukaMa3vOxozRNlbpvTu7hy75BuGMaSrCNuWWU2AJa5bU7vUhiDKuRzu3Y9o+eee7T+zqy0ZWxiC30SO8wYbtCmJI1vne6mzTosQBrluHHTKSOFMajm+aT7dmWrlz5KvjKR1HjmeV44ThqUzqCa53jh7cpWL32UfGUiqfHM87xwnCwpjHLPs0dDu7J1Eue9LCQ1nnmeF46TJYVR7nn2aGhXtk7ivJeFpMYzz/PCcbIksnKX1CdpnaRr6tzbU9Jlkh6SdJek2UkKCfn2aGhXtkZhBerlSyu0QdYsnDtEbfCCOOPZTt+36ssk+jqr8WpWb1nnkNOcdrxlPgs8ALyxzr1PAC+Y2VslnQ58HTgtAfl2kWePhnZlWzx/mNHHnueSOx+n2pxdGwO9rDG/K4d6N2t7VKL2fau+TKKvsxqvZvVCtPj8TvmI5C0jaQZwIfA14POVQ7Kr7t8ArDCzOyTtATwNDFmTwssSfiAuUba8l3VbfBbtalVnEjJlNV7N6gVKOYd6maS9Zb4JfBHY2eD+MLARwMy2Ay8Cb6oj1BJJo5JGx8fHI1ZdTqIYAstqLMyiXa3qTEKmrMarWb1lnUNOa1oqd0knAc+Y2dpOKzOzlWY2YmYjQ0NDnRZXaKIYAstqLMyiXa3qTEKmrMarWb1lnUNOa6I8uS8ATpb0KPBD4HhJF9ekGQNmAoTLMvsAzyUoJ5APw1BSMkQxBEaNrZ61Aa3d+rIwjreqMwmZkjTutkOzevPsiOCkS0uDqpktB5YDSDoO+IKZfaQm2VXAx4E7gA8Ba5qtt8chD8bFJGWIYgiMc4h1s3tp9FOcPsnCON6qziRkSsq4m3TbOm2XU0zaCj9QpdxPkvQVYNTMrpL0euAHwHzgeeB0M/t1s7KKeEB2HmRoJQd014CWlz4pEt5nTidENai2FTjMzG4Gbg5fn1N1/VXgw+2J2B55MAzlQYa4cqQlY176pEh4nzndwHeoFkyGVnJ0W8a89EmR8D5zukFhlHseDEN5kKGVHN2WMS99UiS8z5xuUJh47ovnD/Oj0ce5/eHnd11716x9umoYinrYclTjVW3ahXOHuOmX41Py1ivzvFPmZWpAq5ZpcHo/e+4xLdLh2HmMyd8OUeRvlKY6hn916IV9p/fz5Q++rVD94OSfwsRz/9Lqe7n4zsenXP/I0bP46uJ5SYoWm3ZiizeL6V6d99Qjhlm1dixX8crjxlAveuz1KPI3SlNvHBuV4TjNKF0890vv2tjW9SxoJ7Z4s5ju1XkvvWtj7uKVx42hXvTY61Hkb5Sm3jg2KsNxkqAwyn1Hg18Yja5nQTteEFE9Ixq1L0vPiqTj1BfFS6STkBGt5mlR+sApDoVR7n2qDRDb/HoWtOMFEdUzolH7svSsSDpOfVG8RDoJGdFqnhalD5ziUBjlfsZRM9u6XqGbW/EbeUEsnDs0RYZmMd2r855x1Ez6p01WDP3TlKlnRVxvj6WL5tDfV9OWvmzbUk2rudJJyIgzjpoZOYZ/VHmSaFM7fGn1vRyy/DpmL7uWQ5Zfx5dW39s6k5MZhVHuj4xvaus67DZujU1swdi9zTstBb94/jDnnTKP4cEBRLDjsGJIq5UBmJL2I0fPmvT+vFPmMXLQfkw51SLjHyv12hnZIFi7OpGTVbUocyVKuxul+eriebuuw+4n+UZ9l8TcTXL+VxwaKstLO8y4+M7HXcHnmMJ4y8xedm3De4+e/4G61/OwzbtTGfLQhqTIc1vyJlve4ssfsvy6unaDPomHzzuxrbKczkgl/EDRyIMBr1MZ4ubPoz95HsajEXmTLW/x5Yvg0OBMpjDLMnHIgwGvUxni5O/2clRU8jAejcibbHmLL18EhwZnMoVR7gsO2a+t69DZNu9uxm1POn9e/cnbbUvtGHxp9b2pGce7ERKgHYNkK3mizM8k2xTXocHJjsIsyxw8tNek0APV1xsRN0Z3t+O2J50/b0sMFdppS70xqN6hnHSc+rRjzNfusK4YJIG6O6ybyRN1fibZpoqMl961kR1m9EmccdTM3OwOd6ZSGINqNw06eTOutUvR5YfGbailKG1Kcv6WYXyd+JQu/EA3DTp5ffKNShmiDnZqcM4bSc7fos9PpzsURrl306CTN+Nau3Tkh54TOjU4540k52/R56fTHVoqd0mvl/QzSXdL2iDp3DppzpI0Lml9+PfJpAXtpkGnDE++i+cPc/uy43nk/A9w+7LjC6XYof4Y1FKkMUly/pZhfjrpE8WguhU43sw2SeoHbpN0vZndWZPuMjP7TPIiBnTToNOu4e/cqzfwwuZtAAwO9LPi5Naxubvlh55Hf/co1BuDRvHu2yFqDP0oedupP8n522p+FnXMnWRp94Ds6cBtwKfM7K6q62cBI+0o93YNqnlk9boxll5+N9t2TO7D/mnigg+/s6mS6EZc86LHT0+aqDH0o8bfz2NfFkVOJz6JGlQl9UlaDzwD3Fit2Ks4VdI9ki6X1BPOrxfc8OAUxQ6wbac19Snvlh96Xv3dsyJqDP2o8ffz2JdFkdNJn0jK3cx2mNnhwAzgSElvr0lyNTDbzN4B3AhcWK8cSUskjUoaHR8f70TuXNDMOyHOvaS9HdyrYjKdeOAUpS+LIqeTPm15y5jZBHATcELN9efMbGv49h+BIxrkX2lmI2Y2MjQ0FEfeXNHMOyHOvaS9HdyrYjKdeOBk1Zft7pT2MXcqRPGWGZI0GL4eAN4L/LImzYFVb08GHkhSyApn/sMdzF527a6/M//hjkj50orpXi8+ObSOt94tb4eye1W0O65RPHD6+8QrW7dPKTOLvowSI6i2DxbOHYosZzfPOnC6T5Qn9wOBmyTdA/ycYM39GklfkXRymOZPQzfJu4E/Bc5KWtAz/+GOKeEHbn/4+ZYKPs0gWovnD3PBh97JvtP7d10bHOhvakyt5OuGH3oZ/N0bEWdc6/VHdQz9faf3g8HElm1TysyiL1utn9frg1Vrxzj1iOGWcuY1uJyTHIUJPxAnnjv4Vu2yksa45m2uHLzs2rpnmQh45PwPdCRv3trqRKd04Qfi4gamcpLGuOZtrrRaP+9E3ry11Ume0it3NzCVkzTGNW9zpdU6fyfy5q2tTvIUJuTvgkP2qxvyt1k8dwg+IPU2dcQ1hCW9+893E8Yj6XGNU+aXVt+7a8dpheEIY9hqzKvvD07vZ889pvHilm1T0nbSB2n0X6f4ZyFZCrPmDlONqgsO2Y9L/viYlvmSmjRJ7/7z3YSdkYYyiFpmbXz2apqNYasxb3dOdNIHeVKm/lmITtQ190Ip96xJ2gjlRq3i0ig+e4VGY9hqzHt1TvRqu+PgBtUUSNoI5Uat4tIqDnu7Y1u53qtzolfbnSau3NsgaSOUG7WKS6s47O2ObeV6r86JXm13mhTGoAqTDVhZnOEYxQjVzjpmVKNWnLXRrNdi2y2jm+u/nci2z0A/UvMn92aGyVZjnpWhs14bJzZPNuKmOUZ5NPAWnb4VK1ZkUvHKlStXLFmyJHL6igGr8pEy4J4nXuTZTVs5fu4BqchYy9wD38iMfQe4d+xFNr26neHBAc754GGT4mgvv+Jent/8GgAvv7qdn/7nODP2HWDugW9su7w4ZcbNk0TeuGUkUWdUOpXt1e07eXXbzobl1xvDalqNeZQ5kTTN2ljpn6df2sL/XfNQamOURbuLyrnnnvvUihUrVrZKVxiDajcPyI5LXnZNZr1zsd0yumlMS0q2KHmLQpQ29kl1P39Fb3sRKZ1BtZsHZMclL7sms965GNeY2EmdUUlDhqIb/aLI3+hzVvS2l5nCKPduHpAdl7zsmsx652JcY2IndUYlDRmKbvSLIn+jz1nR215mCqPcu3lAdlzSCAsbp8xO5EiiDe2W0c1wuknIFjVvUYjSxjOOmlnq8NFlpDDeMl9dPI9HxjdN2aHaTW+ZVrRzsHaaZcbJU+st8fr+aVO8JdKSOY1+a+TZ0alstZ4kC+cOccEND/Jnl63v6i7PJD1XWrWxUvbIQfvlZker05rCGFR9e3J6lK1vy34AednGy2mP0hlU/eDf9Chb35b9APKyjZeTDoVR7r49OT3K1rdlP4C8bOPlpENhlLtvT06PsvVtXg4gT+uM0rKNl5MOLQ2qkl4P3ALsGaa/3My+XJNmT+Ai4AjgOeA0M3s0SUGXLprD0svvZtuO3TaC/r7mB1EXhaxDr5Zt63ej9iycO8SC89fU7ec4Y9BsTtaui1fOKK3QyXinNV7txJl3g2o8utmHUbxltgLHm9kmSf3AbZKuN7M7q9J8AnjBzN4q6XTg68BpiUtba/vNz/6l2DRTAt364KThrZIl9dqzcO4Qq9aONVS2scegwZxstC6+4qoNbN2+s6PxTsu7qFkf5GGeFp1u92Fb3jKSpgO3AZ8ys7uqrt8ArDCzOyTtATwNDFmTwtv1lilrvOeytitvNOtnINYYNCvzyYktbT17ZD3eHmc+fZLqw0S9ZST1SVoPPAPcWK3YK/IBGwHMbDvwIvCmOuUskTQqaXR8fDxK1bsoqxGprO3KG836Oe4YNMvX7vp31uPtcebTp9t9GEm5m9kOMzscmAEcKentcSozs5VmNmJmI0NDQ23lLasRqaztyhvN+jnuGDTL12gn7L7T+2PVlTYeZz59ut2HbXnLmNkEcBNwQs2tMWAmQLgssw+BYTUxurlFPa6XQ5x89dolYOHc9r78nOY0mz9x51azfIvnD3PeKfMYHhxAwGC46/eFzduojdKSB+N1qz7oZJ6m5TVUNLqpwyCCcpc0JGkwfD0AvBf4ZU2yq4CPh68/BKxptt4eh9oPy/DgQCo78ipGj7FwzbRi9Gg1IePmWzx/mFOPGJ70gTdg1dqxnv0QpEGz+RN3brXKt3j+MLcvO56/Pu1wtm7fyQubtwHB+FbGO6153C5R2hJnnsb9XJSRbumwCi0NqpLeAVwI9BF8GfyLmX1F0leAUTO7KnSX/AEwH3geON3Mft2s3LwekB3X6JF1DHUnv5RlfLt9toBTn6gG1ZaukGZ2D4HSrr1+TtXrV4EPtytkHknDuJZWnU4xKMv4dvtsAaczCrNDtVukYVxLq06nGJRlfLt9toDTGa7ca0jDuJZWnU4xKMv4dvtsAaczChPPvVvE3f3Xya7Bsu0QzYI8b41PYnzz0L40zxbIQ/vKRmHiuTtOI8oe39zb51RTunjujtOIssc39/Y5cXDl7hSesntkePucOLhydwpP2T0yvH1OHNygWoUbdYpJmvHom82JevegvvGwk7mVx3j7SX5W8ti+MuAG1RA36hSbNL6Ym80JYMq9/j6BwbadNin9qUcMT4olX11OVBnz9OCRxmd9ai2JAAANI0lEQVQlT+3LO1ENqq7cQ3ybtFNLnBjw9eiT2FHnc1bUueWflWxJLPxAr+BGHaeWpOZEPcUep5y84J+VYuAG1RA36ji1xIkBX48+1Qb5bV5+3vHPSjFw5R6Sl3jxeYt9nTd5usnCuUMNY6/Xmy/9faJ/mqakP+OomcF6fA2vbN2e6/5sNPYeUqAY+LJMSLdCADQ7JBc6OKg5BXr5UOTV68ZYtXZs0jmoAk49YnhS26N4ywBc9rONU+qY2LItt/0ZZezdAJpv3KDaZdI4qDktetlwlmTbG5XVSZlp08tjn3fcoJpTkoyJnTa9bDhLsu1xzwLIkl4e+7Lga+5dJo2DmtMib/J0kyTbHvcsgCzp5bEvC1HOUJ0p6SZJ90vaIOmzddIcJ+lFSevDv3PqleWkc1BzWuRNnm6SZNvrldVpmWnTy2NfFqIsy2wHzjazX0jaG1gr6UYzu78m3a1mdlLyIpaLKMaovBiq4hrO0tot2s1+qW374PR+zODPLlvPBTc82Fb91WWNTWzZtampT5oU/TBPBskiGk1bzZFe2wXbtkFV0r8C3zKzG6uuHQd8oR3l3qsG1bKT1tb0LENDJF1/1u0pI636tEx9nko8d0mzCQ7LvqvO7WMk3S3peklva6dcpzykEZs763jfSdefdXvKSKs+7cU+j+wtI2kvYBXwOTN7qeb2L4CDzGyTpBOB1cChdcpYAiwBmDVrVmyhnfyShpdF1p4bSdefdXvKSKs+7cU+j/TkLqmfQLFfYmZX1N43s5fMbFP4+jqgX9L+ddKtNLMRMxsZGhrqUHQnj6ThZZG150bS9WfdnjLSqk97sc+jeMsI+CfgATP7RoM0vxWmQ9KRYbnPJSlokemlLfxpeFlk7bnRSf31xj5qeb00bzpl6aI5U0I/9E/Trj7Neg5lQZRlmQXAR4F7Ja0Pr/05MAvAzL4DfAj4lKTtwBbgdMtq62vO6LUt/Gl4WWTtudGJ11C9sT/vlHmcd8q8lp4dvTRvEqE2fE/V+6znUBZ4+IGU8W3cvUsnY+/zpj16qb9S8ZZx2qcXDTlOQCdj7/OmPby/puLKPWV60ZDjBHQy9j5v2sP7ayqu3FOmFw05WZMXQ2QnY99u3kqbZy+7lkOWX8fsHjPC+udsKh4VMmV60ZCTJXkyRHYy9u3krW1z5Vi/XjLC+udsKm5QdUpFLxnWKhQxXrwTHzeoOj1JLxrWihgv3kkfV+5OqehFw1oR48U76ePK3SkVRTSsdWoALmK8eCd93KDqlIqiGdaSMAA3ihc/nPO2O+niBlXHyZBeNAA7neEGVccpAL1oAHa6gyt3x8mQXjQAO93BlbvjZEgRDcBOMXCDquOkTLODmYtmAK6m1w6cLhqu3B0nRaJ4wyyeP1w4pZinMA9OfXxZxnFSpKwHM5e1XWXClbvjpEhZvWHK2q4y4crdcVKkrN4wZW1XmYhyQPZMSTdJul/SBkmfrZNGkv5W0kOS7pH0rnTEdZzsaSdcQD1vGAEL5w6lLGW6uJdP/ony5L4dONvMDgOOBj4t6bCaNO8HDg3/lgB/n6iUjpMTKobEsYktGLsNiY0U/OL5w5x6xPCks5sNWLV2rNAHaSyeP8x5p8xjeHAAEeyoPe+UeW5MzREtvWXM7CngqfD1y5IeAIaB+6uS/QFwkQWxDO6UNCjpwDCv45SGZobERortpl+OUxvko1WeIlBEL59eoq01d0mzgfnAXTW3hoGNVe+fCK/V5l8iaVTS6Pj4eHuSOk4OiGNIdOOjkwWRlbukvYBVwOfM7KU4lZnZSjMbMbORoaFirzk6vUkcQ6IbH50siKTcJfUTKPZLzOyKOknGgJlV72eE1xynVMQxJHbL+JiXg8GLTJn6sOWauyQB/wQ8YGbfaJDsKuAzkn4IHAW86OvtThmJEy6gGyEGfMdo55StD1vGc5d0LHArcC+wM7z858AsADP7TvgF8C3gBGAz8Edm1jRYu8dzd5zk8LjwnVOUPowazz2Kt8xtMMmTq14aAz4dXTzHcZLEjbadU7Y+9B2qjlMC3GjbOWXrQ1fujlMCfMdo55StDz3kr+OUgCLHhc8LZetDPyDbcRynQPgB2Y7jOD2MK3fHcZwS4srdcRynhLhB1XFyih9A7XSCK3fHySFl2wrvdB9flnGcHOIHUDud4srdcXJI2bbCO93Hlbvj5JCybYV3uo+vuTtOisQ1ii5dNGfSmjvU3wrvRlenEa7cHSclOjGKRtkK70ZXpxmu3B0nJeIcpl1NqwOoOy3fKTe+5u44KZG2UdSNrk4zXLk7TkqkbRR1o6vTjJbKXdJ3JT0j6b4G94+T9KKk9eHfOcmL6TjFI+344GWLP+4kS5Q19+8TnI96UZM0t5rZSYlI5DglodP44K08YTopPy0vG/feyQ9RzlC9RdLs9EVxnPLRyijaiKieMHHKT8vLxr138kVSa+7HSLpb0vWS3pZQmY7Ts6QZfiCtsj1kQr5IwhXyF8BBZrZJ0onAauDQegklLQGWAMyaNSuBqh2nnKTpCZNW2e69ky86fnI3s5fMbFP4+jqgX9L+DdKuNLMRMxsZGhrqtGrHKS1pesKkVbZ77+SLjpW7pN+SpPD1kWGZz3VaruP0Mml6wqRVdr1yBSycm/yD3Op1Yyw4fw0HL7uWBeevYfW6scTrKDotl2UkXQocB+wv6Qngy0A/gJl9B/gQ8ClJ24EtwOmW1anbjlMSOvW0yaLsxfOHGX3seS6583EqCsCAVWvHGDlov8SMqm64jYay0sMjIyM2OjqaSd2O46TDgvPXMFZnjX14cIDblx1fmDryjKS1ZjbSKp3vUHUcJzG6YVR1w200XLk7jpMY3TCquuE2Gq7cHcdJjKSMtc0Mph52IRoe8tdxnMRIwljbymCaprG5TLhB1XGcXNHrBtNWuEHVcZxC4gbTZHDl7jhOrnCDaTK4cnccJ1e4wTQZ3KDqOE6ucINpMrhydxwnd8SNg+/sxpdlHMdxSogrd8dxnBLiyt1xHKeEuHJ3HMcpIa7cHcdxSogrd8dxnBKSWWwZSePAYzGz7w88m6A43aKIcrvM3aOIcrvM3aFa5oPMrOXZhZkp906QNBolcE7eKKLcLnP3KKLcLnN3iCOzL8s4juOUEFfujuM4JaSoyn1l1gLEpIhyu8zdo4hyu8zdoW2ZC7nm7jiO4zSnqE/ujuM4ThNcuTuO45SQwil3SSdIelDSQ5KWZS1PIyR9V9Izku6rurafpBsl/Sr8v2+WMlYjaaakmyTdL2mDpM+G13MrM4Ck10v6maS7Q7nPDa8fLOmucJ5cJul1Wctai6Q+SeskXRO+z7XMkh6VdK+k9ZJGw2u5nh8AkgYlXS7pl5IekHRMnuWWNCfs48rfS5I+167MhVLukvqAbwPvBw4DzpB0WLZSNeT7wAk115YB/25mhwL/Hr7PC9uBs83sMOBo4NNh3+ZZZoCtwPFm9k7gcOAESUcDXwf+2szeCrwAfCJDGRvxWeCBqvdFkHmhmR1e5XOd9/kB8DfAv5nZXOCdBH2eW7nN7MGwjw8HjgA2A1fSrsxmVpg/4Bjghqr3y4HlWcvVRN7ZwH1V7x8EDgxfHwg8mLWMTWT/V+C9BZN5OvAL4CiC3Xx71Js3efgDZoQf0OOBawAVQOZHgf1rruV6fgD7AI8QOo8URe4qOd8H3B5H5kI9uQPDwMaq90+E14rCAWb2VPj6aeCALIVphKTZwHzgLgogc7i8sR54BrgReBiYMLPtYZI8zpNvAl8Edobv30T+ZTbgx5LWSloSXsv7/DgYGAe+Fy6B/aOkN5B/uSucDlwavm5L5qIp99Jgwddv7vxQJe0FrAI+Z2YvVd/Lq8xmtsOCn7AzgCOBuRmL1BRJJwHPmNnarGVpk2PN7F0Ey6KflvSe6ps5nR97AO8C/t7M5gOvULOckVO5CW0uJwM/qr0XReaiKfcxYGbV+xnhtaLwG0kHAoT/n8lYnklI6idQ7JeY2RXh5VzLXI2ZTQA3ESxpDEqqnBGct3myADhZ0qPADwmWZv6GfMuMmY2F/58hWAM+kvzPjyeAJ8zsrvD95QTKPu9yQ/Al+gsz+034vi2Zi6bcfw4cGnoVvI7gJ8tVGcvUDlcBHw9ff5xgXTsXSBLwT8ADZvaNqlu5lRlA0pCkwfD1AIGd4AECJf+hMFmu5Daz5WY2w8xmE8zhNWZ2JjmWWdIbJO1deU2wFnwfOZ8fZvY0sFHSnPDS7wP3k3O5Q85g95IMtCtz1gaDGAaGE4H/JFhX/Z9Zy9NEzkuBp4BtBE8PnyBYV/134FfAT4D9spazSt5jCX7m3QOsD/9OzLPModzvANaFct8HnBNefwvwM+Ahgp+1e2YtawP5jwOuybvMoWx3h38bKp+9vM+PUMbDgdFwjqwG9s273MAbgOeAfaqutSWzhx9wHMcpIUVblnEcx3Ei4MrdcRynhLhydxzHKSGu3B3HcUqIK3fHcZwS4srdcRynhLhydxzHKSH/Hx0ARVv23WaMAAAAAElFTkSuQmCC\n", 350 | "text/plain": [ 351 | "
" 352 | ] 353 | }, 354 | "metadata": {}, 355 | "output_type": "display_data" 356 | } 357 | ], 358 | "source": [ 359 | "pl.figure()\n", 360 | "pl.scatter(X_cos,Y)\n", 361 | "pl.title('Cosine Similarity VS golden score')\n", 362 | "pl.show()\n", 363 | "pl.figure()\n", 364 | "pl.scatter(X_wmd,Y)\n", 365 | "pl.title('WMD Similarity VS golden score')\n", 366 | "pl.show()\n" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "You can learn a simple regression model between those 2 quantities. Use a polynomial of degree 2 to learn the regression model." 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 13, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "import numpy.polynomial.polynomial as poly\n", 383 | "k_cos = # TO BE FILLED\n", 384 | "k_wmd = # TO BE FILLED" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "Now compute from your regression model the estimated relatedness for all the pairs in the test set." 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 14, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "X_cos=[]\n", 401 | "X_wmd=[]\n", 402 | "Y_test=[]\n", 403 | "for i in range(len(testA)):\n", 404 | " s1 = vect.transform([testA[i]]).toarray().ravel()\n", 405 | " s2 = vect.transform([testB[i]]).toarray().ravel()\n", 406 | " # cosine similarity between bag of words\n", 407 | " d_cos=# TO BE FILLED\n", 408 | " X_cos.append(d_cos)\n", 409 | " # WMD\n", 410 | " d_wmd=# TO BE FILLED\n", 411 | " X_wmd.append(d_wmd)\n", 412 | " Y_test.append(scores_test[i])\n", 413 | "\n", 414 | "# Final regression scores\n", 415 | "Y_cos = # TO BE FILLED\n", 416 | "Y_wmd = # TO BE FILLED" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "We will use MSE, Spearman's rho and Pearson coefficients to measure the quality of our regression model" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 15, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "from sklearn.metrics import mean_squared_error as mse\n", 433 | "from scipy.stats import pearsonr\n", 434 | "from scipy.stats import spearmanr\n" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "Estimate the quality of your regression model for both Cosine and WMD dissimilarities" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 16, 447 | "metadata": {}, 448 | "outputs": [ 449 | { 450 | "name": "stdout", 451 | "output_type": "stream", 452 | "text": [ 453 | "-------- Cosine\n", 454 | "Test Pearson (test): 0.4391501309646927\n", 455 | "Test Spearman (test): 0.36660337782004127\n", 456 | "Test MSE (test): 0.8784692440428779\n", 457 | "-------- WMD\n", 458 | "Test Pearson (test): 0.7062364858939475\n", 459 | "Test Spearman (test): 0.5872407398045967\n", 460 | "Test MSE (test): 0.5858116558928291\n" 461 | ] 462 | } 463 | ], 464 | "source": [ 465 | "print('-------- Cosine')\n", 466 | "\n", 467 | "pr = pearsonr(Y_cos, Y_test)[0]\n", 468 | "sr = spearmanr(Y_cos,Y_test)[0]\n", 469 | "se = mse(Y_cos,Y_test)\n", 470 | "\n", 471 | "print('Test Pearson (test): ' + str(pr))\n", 472 | "print('Test Spearman (test): ' + str(sr))\n", 473 | "print('Test MSE (test): ' + str(se))\n", 474 | "\n", 475 | "print('-------- WMD')\n", 476 | "\n", 477 | "pr = pearsonr(Y_wmd, Y_test)[0]\n", 478 | "sr = spearmanr(Y_wmd,Y_test)[0]\n", 479 | "se = mse(Y_wmd,Y_test)\n", 480 | "\n", 481 | "print('Test Pearson (test): ' + str(pr))\n", 482 | "print('Test Spearman (test): ' + str(sr))\n", 483 | "print('Test MSE (test): ' + str(se))" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "Not bad isn't it ?" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [] 499 | } 500 | ], 501 | "metadata": { 502 | "kernelspec": { 503 | "display_name": "Python 2", 504 | "language": "python", 505 | "name": "python2" 506 | }, 507 | "language_info": { 508 | "codemirror_mode": { 509 | "name": "ipython", 510 | "version": 2 511 | }, 512 | "file_extension": ".py", 513 | "mimetype": "text/x-python", 514 | "name": "python", 515 | "nbconvert_exporter": "python", 516 | "pygments_lexer": "ipython2", 517 | "version": "2.7.15" 518 | } 519 | }, 520 | "nbformat": 4, 521 | "nbformat_minor": 2 522 | } 523 | -------------------------------------------------------------------------------- /3_WMD.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # # Word Mover's distance 5 | # 6 | # In this note notebook we will see an application of Optimal Transport to the problem of computing similarities between sentences and texts. The method under the lens is called 'Word Mover's Distance' in reference to 'Earth Mover's Distance', another name of the Wasserstein $1$ distance, mostly used in computer vision. 7 | # 8 | # Traditionnally, portions of texts are compared by Cosine similarity on bag-of-words vectors, i.e. histograms of occurences of words in a text. It captures the exact similarity in terms of words, but two very related sentences can be orthogonal if the words that are used have the same semantic but are different. Such a semantic distance can be obtained by using *word embeddings*, that are embeddings of words in a Euclidean space (of potentially large dimension) where the Euclidean distance have a semantic meaning: two related words will be close in such embeddings. A popular embedding is the *word2vec* embedding, obtained with neural networks. A study of those mechanisms is not in the scope of this notebook, but the interested reader can find more information on [the corresponding Wikipedia page](https://en.wikipedia.org/wiki/Word2vec). Throughout the rest of this tutorial, we will use a subset of the [GloVe](https://nlp.stanford.edu/projects/glove/) embedding. 9 | # 10 | # The key observation made by Kusner and colleagues [1] is that when confronted to a sentence/document, the optimal transport distance can be used between histograms of occuring words using a ground metric obtained through word embeddings. In such a way, related words will be matched together, and the resulting distance will somehow express semantic relatedness between the content. 11 | # 12 | # [1] Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International Conference on Machine Learning (pp. 957-966). http://proceedings.mlr.press/v37/kusnerb15.pdf 13 | # 14 | 15 | # ## A basic example 16 | # 17 | # We will start by reproducing the Figure $1$ in the original paper 18 | # 19 | # 20 | 21 | # Two sentences are considered: 'Obama speaks to the media in Illinois' and 'The president greets the press in Chicago'. It is clear from this example that the Cosine similarity between the two sentences indicates that the two sentences are totally not related, since there is no word in common. We will start by some imports and creating a list of the two sentences as words without stopwords that are not relevant for our analysis. 22 | 23 | # In[1]: 24 | 25 | 26 | import os 27 | 28 | import numpy as np 29 | import matplotlib.pylab as pl 30 | import ot 31 | 32 | 33 | s1 = ['Obama','speaks','media','Illinois'] 34 | s2 = ['President','greets','press','Chicago'] 35 | 36 | 37 | # We will use a subset of the GloVe word embedding, expressed as a dictionnary (word,embedding) that you can load this way 38 | 39 | # In[2]: 40 | 41 | 42 | 43 | model=dict(np.load('data/model.npz')) 44 | 45 | 46 | # Then the embedded representation of the sentences can be obtained by 47 | 48 | # In[3]: 49 | 50 | 51 | s1_embed = np.array([model[w] for w in s1]) 52 | s2_embed = np.array([model[w] for w in s2]) 53 | 54 | 55 | # From the multidimensional scaling method in Scikitlearn, try to visualize the corresponding embedding of words in 2D. 56 | 57 | # In[4]: 58 | 59 | 60 | from sklearn import manifold 61 | 62 | C = ot.dist(np.vstack((s1_embed,s2_embed))) 63 | 64 | nmds = manifold.MDS( 65 | 2, 66 | eps=1e-9, 67 | dissimilarity="precomputed", 68 | n_init=1) 69 | npos = nmds.fit_transform(C) 70 | 71 | pl.figure(figsize=(6,6)) 72 | pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k') 73 | for i, txt in enumerate(s1): 74 | pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20) 75 | pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k') 76 | for i, txt in enumerate(s2): 77 | pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20) 78 | pl.axis('off') 79 | pl.tight_layout() 80 | pl.show() 81 | 82 | 83 | # Let's now compute the coupling between those two distributions and visualize the corresponding result 84 | # 85 | 86 | # In[5]: 87 | 88 | 89 | C2= ot.dist(s1_embed,s2_embed) 90 | G=ot.emd(ot.unif(4),ot.unif(4),C2) 91 | 92 | pl.figure(figsize=(6,6)) 93 | pl.scatter(npos[:4,0],npos[:4,1],c='r',s=50, edgecolor = 'k') 94 | for i, txt in enumerate(s1): 95 | pl.annotate(txt, (npos[i,0]-4,npos[i,1]+2),fontsize=20) 96 | pl.scatter(npos[4:,0],npos[4:,1],c='b',s=50, edgecolor = 'k') 97 | for i, txt in enumerate(s2): 98 | pl.annotate(txt, (npos[i+4,0]-4,npos[i+4,1]+2),fontsize=20) 99 | for i in range(G.shape[0]): 100 | for j in range(G.shape[1]): 101 | if G[i,j]>1e-5: 102 | pl.plot([npos[i,0],npos[j+4,0]],[npos[i,1],npos[j+4,1]],'k',alpha=G[i,j]/np.max(G)) 103 | pl.title('Word embedding and coupling with OT') 104 | pl.axis('off') 105 | pl.tight_layout() 106 | pl.show() 107 | 108 | 109 | # ## Sentence similarity 110 | # We will now explore the superiority of this Word mover distance (WMD) in a regression context, where our goal is to estimate the similarity (or relatedness) of two sentences on a scale of 0 to 5 (5 being the most similar). Given a set of pairs of sentences with a human annotated relatedness, our goal is predict the relatedness from a new pair of sentences. 111 | # 112 | # We will use the [SICK (Sentences Involving Compositional Knowledge) dataset](http://clic.cimec.unitn.it/composes/sick.html) for this purpose. 113 | # 114 | # We first load it. 115 | 116 | # In[6]: 117 | 118 | 119 | 120 | data=np.load('data/data_text.npz') 121 | setA=data['setA'] 122 | setB=data['setB'] 123 | scores=data['scores'] 124 | 125 | print (setA[0]) 126 | print (setB[0]) 127 | print(scores[0]) 128 | 129 | np.savez('data/data_text.npz',setA=setA,setB=setB,scores=scores) 130 | 131 | 132 | # We will only keep 200 sentences for learning our regression model and the rest for testing 133 | 134 | # In[7]: 135 | 136 | 137 | n=200 138 | testA=setA[n:] 139 | trainA=setA[:n] 140 | testB=setB[n:] 141 | trainB=setB[:n] 142 | 143 | scores_train=scores[:n] 144 | scores_test=scores[n:] 145 | 146 | 147 | # Using the countVectorizer model from ScikitLearn, compute all the bag-of-words representations of the sentences 148 | 149 | # In[8]: 150 | 151 | 152 | from sklearn.feature_extraction.text import CountVectorizer 153 | vect = # TO BE FILLED 154 | 155 | 156 | # Build a big data matrix of all the words present in the dataset embeddings 157 | # 158 | 159 | # In[9]: 160 | 161 | 162 | all_feat = # TO BE FILLED 163 | 164 | 165 | # Compute a big matrix of all pairwise feature distances using the dist() method of POT 166 | 167 | # In[10]: 168 | 169 | 170 | D = ot.dist(all_feat) 171 | 172 | 173 | # now you can write a code that will compute the Cosine and WMD dissimilarities from all the pairs of the training set 174 | 175 | # In[11]: 176 | 177 | 178 | X_cos=[] 179 | X_wmd=[] 180 | Y=[] 181 | 182 | 183 | 184 | for i in range(len(trainA)): 185 | s1 = vect.transform([trainA[i]]).toarray().ravel() 186 | s2 = vect.transform([trainB[i]]).toarray().ravel() 187 | # Cosine similarity between bag of words 188 | d_cos=# TO BE FILLED 189 | X_cos.append(d_cos) 190 | # WMD 191 | d_wmd=# TO BE FILLED 192 | X_wmd.append(d_wmd) 193 | Y.append(scores_train[i]) 194 | 195 | 196 | 197 | # Visualize the corresponding golden similarities / distance from the learning set. Hence you have a first appreciation of how much WMD better captures this similarity. 198 | 199 | # In[12]: 200 | 201 | 202 | pl.figure() 203 | pl.scatter(X_cos,Y) 204 | pl.title('Cosine Similarity VS golden score') 205 | pl.show() 206 | pl.figure() 207 | pl.scatter(X_wmd,Y) 208 | pl.title('WMD Similarity VS golden score') 209 | pl.show() 210 | 211 | 212 | # You can learn a simple regression model between those 2 quantities. Use a polynomial of degree 2 to learn the regression model. 213 | 214 | # In[13]: 215 | 216 | 217 | import numpy.polynomial.polynomial as poly 218 | k_cos = # TO BE FILLED 219 | k_wmd = # TO BE FILLED 220 | 221 | 222 | # Now compute from your regression model the estimated relatedness for all the pairs in the test set. 223 | 224 | # In[14]: 225 | 226 | 227 | X_cos=[] 228 | X_wmd=[] 229 | Y_test=[] 230 | for i in range(len(testA)): 231 | s1 = vect.transform([testA[i]]).toarray().ravel() 232 | s2 = vect.transform([testB[i]]).toarray().ravel() 233 | # cosine similarity between bag of words 234 | d_cos=# TO BE FILLED 235 | X_cos.append(d_cos) 236 | # WMD 237 | d_wmd=# TO BE FILLED 238 | X_wmd.append(d_wmd) 239 | Y_test.append(scores_test[i]) 240 | 241 | # Final regression scores 242 | Y_cos = # TO BE FILLED 243 | Y_wmd = # TO BE FILLED 244 | 245 | 246 | # We will use MSE, Spearman's rho and Pearson coefficients to measure the quality of our regression model 247 | 248 | # In[15]: 249 | 250 | 251 | from sklearn.metrics import mean_squared_error as mse 252 | from scipy.stats import pearsonr 253 | from scipy.stats import spearmanr 254 | 255 | 256 | # Estimate the quality of your regression model for both Cosine and WMD dissimilarities 257 | 258 | # In[16]: 259 | 260 | 261 | print('-------- Cosine') 262 | 263 | pr = pearsonr(Y_cos, Y_test)[0] 264 | sr = spearmanr(Y_cos,Y_test)[0] 265 | se = mse(Y_cos,Y_test) 266 | 267 | print('Test Pearson (test): ' + str(pr)) 268 | print('Test Spearman (test): ' + str(sr)) 269 | print('Test MSE (test): ' + str(se)) 270 | 271 | print('-------- WMD') 272 | 273 | pr = pearsonr(Y_wmd, Y_test)[0] 274 | sr = spearmanr(Y_wmd,Y_test)[0] 275 | se = mse(Y_wmd,Y_test) 276 | 277 | print('Test Pearson (test): ' + str(pr)) 278 | print('Test Spearman (test): ' + str(sr)) 279 | print('Test MSE (test): ' + str(se)) 280 | 281 | 282 | # Not bad isn't it ? 283 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Rémi Flamary 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Optimal Transport and Machine Learning Course 2022 2 | 3 | Courses and practical sessions for the Optimal Transport and Machine learning 4 | course. 5 | 6 | 7 | 8 | ### Course 9 | 10 | 11 | The slides from the course can be downloaded here: 12 | 13 | * [Part 1](slides/Part1_intro_OT_2022.pdf) Intro to numerical Optimal Transport 14 | (N. Courty) 15 | * [Part 2](slides/Part2_UOT_GW_Rennes_2022.pdf) Unbalanced OT and OT across 16 | spaces (L. Chapel) 17 | * [Part 3](slides/Part3_OTML_Rennes_2022.pdf) Optimal Transport for Machine 18 | Learning (R. Flamary) 19 | 20 | ### Practical Sessions 21 | 22 | You can download the introductory slides to the practical session [here](https://remi.flamary.com/cours/otml/OTML_TPDS3_2018.pdf). 23 | 24 | 25 | #### Install Python and POT Toolbox 26 | 27 | In order to do the practical sessions you need to have a working Python installation. 28 | The simplest way on any OS is to install the [Anaconda](https://www.anaconda.com/download/) distribution that can be freely downloaded from [here](https://www.anaconda.com/download/). 29 | 30 | When anaconda is installed the simplest way to install pot is to launch the anaconda terminal and execute: 31 | 32 | ``` 33 | conda install -c conda-forge pot 34 | ``` 35 | 36 | which will install the POT OT Toolbox automatically. Note that in Window you need to launch the anaconda terminal with admnistrator mode to install with conda. 37 | 38 | 39 | 40 | #### Download the Notebooks for the session 41 | 42 | You can download all the necessary files here: [OTML_DS3_2018.zip](https://github.com/rflamary/OTML_DS3_2018/archive/master.zip) 43 | 44 | The zip file contains the following session: 45 | 46 | 0. [Introduction to OT with POT](0_Intro_OT.ipynb) 47 | 1. [Domain adaptation on digits with OT](1_DomainAdaptation.ipynb) 48 | 2. [Color Grading with OT](2_ColorGrading.ipynb) 49 | 3. [Word Mover's Distance on text](3_WMD.ipynb) 50 | 51 | You can choose to do the practical session using the notebooks included or the python script. We recommend Notebooks for beginners. 52 | 53 | The solutions for the practical sessions can be obtained at the following URL: 54 | 55 | ``` 56 | https://remi.flamary.com/cours/otml/solution_[NUMBER].zip 57 | ``` 58 | 59 | Where [NUMBER] has to be replaced by the integer part of the value of the 60 | Wasserstein distance obtained in Practical [Session 0](0_Intro_OT.ipynb) using 61 | the Manhattan/Cityblock ground metric (without normalization of the marginals). 62 | -------------------------------------------------------------------------------- /data/data_text.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/data_text.npz -------------------------------------------------------------------------------- /data/klimt.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/klimt.jpg -------------------------------------------------------------------------------- /data/manhattan.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/manhattan.npz -------------------------------------------------------------------------------- /data/mnist_usps.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/mnist_usps.npz -------------------------------------------------------------------------------- /data/model.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/model.npz -------------------------------------------------------------------------------- /data/schiele.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/data/schiele.jpg -------------------------------------------------------------------------------- /slides/Part1_intro_OT_2022.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/slides/Part1_intro_OT_2022.pdf -------------------------------------------------------------------------------- /slides/Part2_UOT_GW_Rennes_2022.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/slides/Part2_UOT_GW_Rennes_2022.pdf -------------------------------------------------------------------------------- /slides/Part3_OTML_Rennes_2022.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PythonOT/OTML_course_2022/4f1189e73c61937dbdcfc3a458cf5463f639d337/slides/Part3_OTML_Rennes_2022.pdf --------------------------------------------------------------------------------