├── .gitignore ├── LICENSE ├── README.md ├── data └── kitti_features.zip └── ilp ├── __init__.py ├── algo ├── __init__.py ├── base_sl_graph.py ├── datastore.py ├── incremental_label_prop.py ├── knn_graph_utils.py ├── knn_sl_graph.py └── knn_sl_subgraph.py ├── constants.py ├── experiments ├── __init__.py ├── base.py ├── cfg │ ├── default.yml │ ├── var_k_L.yml │ ├── var_k_U.yml │ ├── var_n_L.yml │ ├── var_stream_labeled.yml │ └── var_theta.yml ├── default_run.py ├── var_n_labeled.py ├── var_n_neighbors_labeled.py ├── var_n_neighbors_unlabeled.py ├── var_stream_labeled.py └── var_theta.py ├── helpers ├── __init__.py ├── data_fetcher.py ├── data_flow.py ├── datasets.yml ├── fc_heap.py ├── log.py ├── params_parse.py └── stats.py └── plots ├── __init__.py └── plot_stats.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled source # 2 | ################### 3 | *.com 4 | *.class 5 | *.dll 6 | *.exe 7 | *.o 8 | *.so 9 | 10 | # Packages # 11 | ############ 12 | # it's better to unpack these files and commit the raw source 13 | # git has its own built in compression methods 14 | *.7z 15 | *.dmg 16 | *.gz 17 | *.iso 18 | *.jar 19 | *.rar 20 | *.tar 21 | *.zip 22 | 23 | # Logs and databases # 24 | ###################### 25 | *.log 26 | *.sql 27 | *.sqlite 28 | 29 | # OS generated files # 30 | ###################### 31 | .DS_Store 32 | .DS_Store? 33 | ._* 34 | .Spotlight-V100 35 | .Trashes 36 | ehthumbs.db 37 | Thumbs.db 38 | 39 | # Latex generated files # 40 | ######################### 41 | *.log 42 | *.aux 43 | *.blg 44 | *.out 45 | *.gz 46 | *.bbl 47 | 48 | # IDE # 49 | ####### 50 | .idea/ 51 | __pycache__/ 52 | build/ 53 | cmake-build-debug/ 54 | dist/ 55 | _build/ 56 | _generate/ 57 | *.so 58 | *.py[cod] 59 | *.egg-info 60 | 61 | 62 | # Project Specific # 63 | #################### 64 | doc/ 65 | *.mat 66 | *.p 67 | *.stat 68 | *.hist 69 | *.png 70 | */experiments/results/ 71 | data/kitti_features/ 72 | data/mnist/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 John Chiotellis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Incremental Label Propagation 2 | This repository provides the implementation of our paper ["Incremental Semi-Supervised Learning from Streams for Object Classification"](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf) (Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, IROS 2018). All results presented in our work were produced with this code. 3 | 4 | * [Installing](#usage) 5 | * [Datasets](#data) 6 | * [Experiments](#experiments) 7 | * [Publication](#paper) 8 | * [License and Contact](#other) 9 | 10 | 11 | ## Installation 12 | The code was developed in python 3.5 under Ubuntu 16.04. You can clone the repo with: 13 | ``` 14 | git clone https://github.com/johny-c/incremental-label-propagation.git 15 | ``` 16 | 17 | ## Datasets 18 | * KITTI 19 | 20 | The repository includes 64-dimentional features extracted from KITTI sequences compressed in a zip file (data/kitti_features.zip). The included files will be extracted automatically if one of the included experiments is run on KITTI. 21 | 22 | * MNIST 23 | 24 | A script will automatically download the MNIST dataset if an experiment is run on it. 25 | 26 | 27 | ## Experiments 28 | 29 | The repository includes scripts that replicate the experiments found in the paper, including: 30 | 31 | * Varying the number of labeled points or the ratio of labeled points in the data. 32 | * Varying the number of labeled or unlabeled neighbors considered for each node. 33 | * Varying the hyperparameter $$\theta$$ that controls the propagation area size. 34 | 35 | To run an experiment with varying $$\theta$$: 36 | 37 | python ilp/experiments/var_theta.py -d mnist 38 | 39 | You can set different experiment options in the .yaml files found in the experimens/cfg directory. 40 | 41 | #### WARNING: 42 | The included experiment scripts compute and store statistics after every new data point, therefore the resulting output files are very large. 43 | 44 | 45 | ## Publication 46 | If you use this code in your work, please cite the following paper. 47 | 48 | Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, _"Incremental Semi-Supervised Learning from Streams for Object Classification"_, in proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018). ([pdf](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf)) 49 | 50 | *equal contribution 51 | 52 | @InProceedings{chiotellis2018incremental, 53 | author = "I. Chiotellis and F. Zimmermann and D. Cremers and R. Triebel", 54 | title = "Incremental Semi-Supervised Learning from Streams for Object Classification", 55 | booktitle = iros, 56 | year = "2018", 57 | month = "October", 58 | keywords={stream-based learning, sequential data, semi-supervised learning, object classification}, 59 | note = {{[code]} }, 60 | } 61 | 62 | ## License and Contact 63 | 64 | This work is released under the [MIT Licence]. 65 | 66 | Contact **John Chiotellis** [:envelope:](mailto:chiotell@in.tum.de) for questions, comments and reporting bugs. 67 | -------------------------------------------------------------------------------- /data/kitti_features.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/data/kitti_features.zip -------------------------------------------------------------------------------- /ilp/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/__init__.py -------------------------------------------------------------------------------- /ilp/algo/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/algo/__init__.py -------------------------------------------------------------------------------- /ilp/algo/base_sl_graph.py: -------------------------------------------------------------------------------- 1 | from sklearn.externals import six 2 | from abc import ABCMeta, abstractmethod 3 | 4 | 5 | class BaseSemiLabeledGraph(six.with_metaclass(ABCMeta)): 6 | """ 7 | Parameters 8 | ---------- 9 | 10 | datastore : algo.datastore.SemoLabeledDatastore 11 | A datastore to store observations as they arrive 12 | 13 | max_samples : int, optional (default=1000) 14 | The maximum number of points expected to be observed. Useful for 15 | memory allocation. 16 | 17 | max_labeled : {float, int}, optional 18 | Maximum expected labeled points ratio, or number of labeled points 19 | 20 | dtype : dtype, optional (default=np.float32) 21 | Precision in floats, (can also be float16, float64) 22 | 23 | """ 24 | 25 | def __init__(self, datastore): 26 | 27 | self.datastore = datastore 28 | self.max_samples = datastore.max_samples 29 | self.max_labeled = datastore.max_labeled 30 | self.dtype = datastore.dtype 31 | self.eps = datastore.eps 32 | 33 | self.n_labeled = 0 34 | self.n_unlabeled = 0 35 | 36 | 37 | @abstractmethod 38 | def build(self, X_l, X_u): 39 | raise NotImplementedError('build is not implemented!') 40 | 41 | @abstractmethod 42 | def add_node(self, x, ind, labeled): 43 | raise NotImplementedError('add_node is not implemented!') 44 | 45 | @abstractmethod 46 | def add_labeled_node(self, x, ind): 47 | raise NotImplementedError('add_labeled_node is not implemented!') 48 | 49 | @abstractmethod 50 | def add_unlabeled_node(self, x, ind): 51 | raise NotImplementedError('add_unlabeled_node is not implemented!') 52 | 53 | def get_n_nodes(self): 54 | return self.n_labeled + self.n_unlabeled 55 | -------------------------------------------------------------------------------- /ilp/algo/datastore.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.preprocessing import LabelBinarizer, label_binarize 3 | 4 | from ilp.constants import EPS_32, EPS_64 5 | 6 | 7 | class SemiLabeledDataStore: 8 | 9 | def __init__(self, max_samples, max_labeled, classes, precision='float32'): 10 | 11 | self.max_samples = max_samples 12 | self.max_labeled = max_labeled 13 | self.classes = classes 14 | 15 | self.precision = precision 16 | self.dtype = np.dtype(precision) 17 | self.eps = EPS_32 if self.dtype == np.dtype('float32') else EPS_64 18 | self.n_labeled = 0 19 | self.n_unlabeled = 0 20 | 21 | self.X_labeled = np.array([]) 22 | self.X_unlabeled = np.array([]) 23 | self.y_labeled = np.array([]) 24 | 25 | def _allocate(self, n_features): 26 | 27 | # Allocate memory for data matrices 28 | self.X_labeled = np.zeros((self.max_labeled, n_features), 29 | dtype=self.dtype) 30 | self.X_unlabeled = np.zeros((self.max_samples, n_features), 31 | dtype=self.dtype) 32 | 33 | # Allocate memory for label matrix 34 | self.y_labeled = np.zeros((self.max_labeled, len(self.classes)), 35 | dtype=self.dtype) 36 | 37 | def append(self, x, y): 38 | 39 | if self.get_n_samples() == 0: 40 | self._allocate(len(x)) 41 | 42 | if y == -1: 43 | ind_new = self.n_unlabeled 44 | self.X_unlabeled[ind_new] = x 45 | self.n_unlabeled += 1 46 | else: 47 | ind_new = self.n_labeled 48 | self.X_labeled[ind_new] = x 49 | self.y_labeled[ind_new] = label_binarize([y], self.classes) 50 | self.n_labeled += 1 51 | 52 | return ind_new 53 | 54 | def inverse_transform_labels(self, y_proba): 55 | if not hasattr(self, 'label_binarizer'): 56 | self.label_binarizer = LabelBinarizer() 57 | self.label_binarizer.fit(self.classes) 58 | 59 | return self.label_binarizer.inverse_transform(y_proba) 60 | 61 | def get_n_samples(self): 62 | return self.n_labeled + self.n_unlabeled 63 | 64 | def get_X_l(self): 65 | return self.X_labeled[:self.n_labeled] 66 | 67 | def get_X_u(self): 68 | return self.X_unlabeled[:self.n_unlabeled] 69 | 70 | def get_y_l(self): 71 | return self.y_labeled[:self.n_labeled] 72 | 73 | def get_y_l_int(self): 74 | return np.argmax(self.get_y_l(), axis=1) -------------------------------------------------------------------------------- /ilp/algo/incremental_label_prop.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from time import time 3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot 4 | from sklearn.utils.validation import check_random_state 5 | from sklearn.preprocessing import normalize 6 | 7 | from ilp.algo.knn_graph_utils import construct_weight_mat 8 | from ilp.algo.knn_sl_graph import KnnSemiLabeledGraph 9 | from ilp.helpers.stats import JobType 10 | from ilp.helpers.log import make_logger 11 | 12 | 13 | logger = make_logger(__name__) 14 | 15 | 16 | class IncrementalLabelPropagation: 17 | """ 18 | Parameters 19 | ---------- 20 | 21 | datastore : algo.datastore.SemiLabeledDataStore 22 | A datastore instance to store observations as they arrive 23 | 24 | stats_worker : helper.stats.StatisticsWorker, optional 25 | A statistics computing unit (run in a separate thread) 26 | 27 | theta : float, optional 28 | The threshold of significance for a label update (defines the online nature of the algorithm). (default: 0.1) 29 | 30 | max_iter : int, optional 31 | The maximum number of iterations per point insertion. (default: 30) 32 | 33 | tol : float, optional 34 | The tolerance of absolute difference between two label distributions 35 | that indicates convergence. (default: 1e-3) 36 | 37 | params_graph : dict 38 | Parameters for the graph construction. Will be passed to the Graph 39 | constructor. 40 | 41 | params_offline_lp : dict 42 | Parameters for the offline label propagation that will be performed 43 | once `n_burn_in` number of samples have been observed. 44 | 45 | random_state : int seed, RandomState instance, or None (default) 46 | The seed of the pseudo random number generator to use when 47 | shuffling the data. 48 | 49 | iprint : integer, optional 50 | The n_labeled_freq of printing progress messages to stdout. 51 | 52 | n_jobs : integer, optional 53 | The number of CPUs to use to do the OVA (One Versus All, for 54 | multi-class problems) computation. -1 means 'all CPUs'. Defaults to 1. 55 | 56 | 57 | Attributes 58 | ---------- 59 | 60 | classes : tuple, shape = (n_classes,) 61 | The classes expected to appear in the stream. 62 | 63 | """ 64 | 65 | def __init__(self, datastore=None, stats_worker=None, 66 | params_graph=None, params_offline_lp=None, n_burn_in=0, 67 | theta=0.1, max_iter=30, tol=1e-3, 68 | random_state=42, n_jobs=-1, iprint=0): 69 | 70 | self.datastore = datastore 71 | self.max_iter = max_iter 72 | self.tol = tol 73 | self.theta = theta 74 | self.params_graph = params_graph 75 | self.params_offline_lp = params_offline_lp 76 | self.n_burn_in = n_burn_in 77 | self.random_state = random_state 78 | 79 | kernel = params_graph['kernel'] 80 | if kernel == 'knn': 81 | Graph = KnnSemiLabeledGraph 82 | # elif kernel == 'm-knn': 83 | # Graph = MutualKnnSemiLabeledGraph 84 | else: 85 | raise NotImplementedError('Only knn graphs.') 86 | 87 | self.graph = Graph(datastore, **params_graph) 88 | 89 | self.n_jobs = n_jobs 90 | self.stats_worker = stats_worker 91 | self.iprint = iprint 92 | 93 | def fit_burn_in(self): 94 | """Fit a semi-supervised label propagation model 95 | 96 | A bootstrap data set is provided in matrix X (labeled and unlabeled) 97 | and corresponding label matrix y with a dedicated value (-1) for 98 | unlabeled samples. 99 | 100 | Parameters 101 | ---------- 102 | X : array-like, shape (n_samples, n_features) 103 | A {n_samples by n_samples} size matrix will be created from this 104 | 105 | Returns 106 | ------- 107 | self : returns an instance of self. 108 | """ 109 | 110 | self.random_state_ = check_random_state(self.random_state) 111 | 112 | # Get appearing classes 113 | classes = self.datastore.classes 114 | 115 | self.n_iter_online = 0 116 | self.n_burn_in_ = self.datastore.get_n_samples() 117 | logger.debug('FITTING BURN-IN WITH {} SAMPLES'.format(self.n_burn_in_)) 118 | 119 | # actual graph construction (implementations should override this) 120 | logger.debug('Building graph....') 121 | self.graph.build(self.get_X_l(), self.get_X_u()) 122 | logger.debug('Graph built.') 123 | 124 | u_max = self.datastore.max_samples 125 | # Initialize F_U with uniform label vectors 126 | self.y_unlabeled = np.full((u_max, len(classes)), 1 / len(classes), 127 | dtype=self.datastore.dtype) 128 | 129 | # Initialize F_U with zero label vectors 130 | # self.y_unlabeled = np.zeros((u_max, len(classes)), 131 | # dtype=self.datastore.dtype) 132 | 133 | # Offline label propagation on burn-in data set 134 | self._offline_lp(**self.params_offline_lp) 135 | 136 | # Normalize F_U as it might have numerically diverged from [0, 1] 137 | normalize(self.y_unlabeled, norm='l1', axis=1, copy=False) 138 | 139 | return self 140 | 141 | def predict(self, X, mode=None): 142 | """Predict the labels for a batch of data points, without actually 143 | inserting them in the graph. 144 | 145 | Parameters 146 | ---------- 147 | X : array, shape (n_samples_batch, n_features) 148 | A batch of data points. 149 | 150 | mode : string 151 | Test with nearest labeled neighbors ('knn'), or nearest 152 | unlabeled neighbors ('lp'), their combination (default), 153 | or return both ('pair'). 154 | 155 | Returns 156 | ------- 157 | y : array, shape (n_samples_batch, n_classes) 158 | The predicted label distributions for the given points. 159 | 160 | """ 161 | 162 | modes = ['knn', 'lp', 'pair'] 163 | if mode is not None and mode not in modes: 164 | raise ValueError('predict_proba can have modes: {}'.format(modes)) 165 | 166 | if X.ndim == 1: 167 | X = X.reshape(1, -1) 168 | 169 | if mode == 'pair': 170 | y_proba_knn, y_proba_lp = self.predict_proba(X, 'pair') 171 | y_pred_knn = self.datastore.inverse_transform_labels(y_proba_knn) 172 | y_pred_lp = self.datastore.inverse_transform_labels(y_proba_lp) 173 | return y_pred_knn, y_pred_lp 174 | 175 | y_proba = self.predict_proba(X, mode) 176 | y_pred = self.datastore.inverse_transform_labels(y_proba) 177 | 178 | return y_pred 179 | 180 | def predict_proba(self, X, mode=None): 181 | 182 | modes = ['knn', 'lp', 'pair'] 183 | if mode is not None and mode not in modes: 184 | raise ValueError('predict_proba can have modes: {}'.format(modes)) 185 | 186 | u, l = self.graph.n_unlabeled, self.graph.n_labeled 187 | 188 | logger.info('Now testing on {} samples...'.format(len(X))) 189 | neighbors, distances = self.graph.find_labeled_neighbors(X) 190 | affinity_mat = construct_weight_mat(neighbors, distances, 191 | (X.shape[0], l), self.graph.dtype) 192 | p_tl = normalize(affinity_mat.tocsr(), norm='l1', axis=1) 193 | y_from_labeled = ssdot(p_tl, self.datastore.y_labeled[:l], True) 194 | 195 | neighbors, distances = self.graph.find_unlabeled_neighbors(X) 196 | affinity_mat = construct_weight_mat(neighbors, distances, 197 | (X.shape[0], u), self.graph.dtype) 198 | p_tu = normalize(affinity_mat.tocsr(), norm='l1', axis=1) 199 | y_from_unlabeled = ssdot(p_tu, self.y_unlabeled[:u], True) 200 | 201 | y_pred_proba = y_from_labeled + y_from_unlabeled 202 | logger.info('Labels have been predicted.') 203 | 204 | if mode is None: 205 | return y_pred_proba 206 | elif mode == 'knn': 207 | return y_from_labeled 208 | elif mode == 'lp': 209 | return y_from_unlabeled 210 | elif mode == 'pair': 211 | return y_from_labeled, y_pred_proba 212 | 213 | def get_X_l(self): 214 | return self.datastore.get_X_l() 215 | 216 | def get_X_u(self): 217 | return self.datastore.get_X_u() 218 | 219 | def set_params(self, L): 220 | self.graph.reset_metric(L) 221 | tic = time() 222 | self.graph.build(self.get_X_l(), self.get_X_u()) 223 | toc = time() 224 | logger.info('Reconstructed graph in {:.4f}s\n'.format(toc-tic)) 225 | 226 | def _offline_lp(self, return_iter=False, max_iter=30, tol=0.001): 227 | """Perform the offline label propagation until convergence of the label 228 | estimates of the unlabeled points. 229 | 230 | Parameters 231 | ---------- 232 | return_iter : bool, default=False 233 | Whether or not to return the number of iterations till convergence 234 | of the label estimates. 235 | 236 | Returns 237 | ------- 238 | y_unlabeled, num_iter : 239 | the new label estimates and optionally the number of iterations 240 | """ 241 | 242 | logger.debug('Doing Offline LP...') 243 | 244 | u, l = self.graph.n_unlabeled, self.graph.n_labeled 245 | 246 | p_ul = self.graph.subgraph_ul.transition_matrix[:u] 247 | p_uu = self.graph.subgraph_uu.transition_matrix[:u, :u] 248 | y_unlabeled = self.y_unlabeled[:u] 249 | y_labeled = self.datastore.y_labeled 250 | 251 | # First iteration 252 | y_static = ssdot(p_ul, y_labeled, dense_output=True) 253 | 254 | # Continue loop 255 | n_iter = 0 256 | converged = False 257 | while n_iter < max_iter and not converged: 258 | y_unlabeled_prev = y_unlabeled.copy() 259 | y_unlabeled = y_static + ssdot(p_uu, y_unlabeled, True) 260 | n_iter += 1 261 | 262 | converged = _converged(y_unlabeled, y_unlabeled_prev, tol) 263 | 264 | logger.info('Offline LP took {} iterations'.format(n_iter)) 265 | 266 | if return_iter: 267 | return y_unlabeled, n_iter 268 | else: 269 | return y_unlabeled 270 | 271 | def fit_incremental(self, x_new, y_new): 272 | 273 | n_samples = self.datastore.get_n_samples() 274 | if n_samples == 0 and self.stats_worker is not None: 275 | logger.info('\n\nStarting the Statistics Worker\n\n') 276 | self.stats_worker.start() 277 | 278 | if n_samples < self.n_burn_in: 279 | logger.debug('Still in burn-in phase... observed {:>4} ' 280 | 'points'.format(n_samples)) 281 | self.datastore.append(x_new, y_new) 282 | if n_samples == self.n_burn_in - 1: 283 | logger.debug('Burn-in complete!') 284 | self.fit_burn_in() 285 | else: 286 | ind_new = self.datastore.append(x_new, y_new) 287 | self._fit_incremental(x_new, y_new, ind_new) 288 | 289 | def _fit_incremental(self, x_new, y_new, ind_new): 290 | """Fit a single new point 291 | 292 | Args: 293 | x_new : array_like, shape (1, n_features) 294 | A new data point. 295 | 296 | y_new : int 297 | Label of the new data point (-1 if point is unlabeled). 298 | 299 | ind_new : int 300 | Index of the new point in the data store. 301 | 302 | Returns: 303 | IncrementalLabelPropagation: a reference to self. 304 | """ 305 | 306 | tic = time() 307 | 308 | if self.n_iter_online == 0: 309 | self.tic_iprint = time() 310 | 311 | labeled = y_new != -1 312 | 313 | self.graph.add_node(x_new, ind_new, labeled) 314 | 315 | _, n_in_iter = self._propagate_single(ind_new, y_new, return_iter=True) 316 | self.n_iter_online += 1 317 | 318 | # Update statistics 319 | dt = time() - tic 320 | self.log_stats(JobType.ONLINE_ITER, dt=dt, n_in_iter=n_in_iter) 321 | 322 | if not labeled: 323 | # Track prediction entropy and accuracy 324 | label_vec = self.y_unlabeled[ind_new][None, :] 325 | pred = self.datastore.inverse_transform_labels(label_vec) 326 | self.log_stats(JobType.POINT_PREDICTION, vec=label_vec, y=pred) 327 | 328 | # Print information if needed 329 | if self.n_iter_online % self.iprint == 0: 330 | dt = time() - self.tic_iprint 331 | max_samples = self.datastore.max_samples 332 | n_samples_curr = self.graph.get_n_nodes() 333 | n_samples_prev = n_samples_curr - self.iprint 334 | logger.info('Iterations {} to {}/{} took {:.4f}s'. 335 | format(n_samples_prev, n_samples_curr, 336 | max_samples, dt)) 337 | self.tic_iprint = time() 338 | self.log_stats(JobType.PRINT_STATS) 339 | 340 | # Normalize y_u as it might have diverged from [0, 1] 341 | u = self.graph.n_unlabeled 342 | normalize(self.y_unlabeled[:u], norm='l1', axis=1, copy=False) 343 | 344 | return self 345 | 346 | def log_stats(self, job_type, **kwargs): 347 | if self.stats_worker is not None: 348 | d = dict(job_type=job_type) 349 | d.update(**kwargs) 350 | self.stats_worker.send(d) 351 | 352 | def _propagate_single(self, ind_new, y_new, return_iter=False): 353 | """Perform label propagation until convergence of the label 354 | estimates of the unlabeled points. Assume the new node has already 355 | been added to the graph, but no label has been estimated. 356 | 357 | Parameters 358 | ---------- 359 | ind_new : int 360 | The index of the new observation determined during graph addition. 361 | 362 | y_new : int 363 | The label of the new observation (-1 if point is unlabeled). 364 | 365 | return_iter : bool, default=False 366 | Whether to return the number of iterations until convergence of 367 | the label estimates. 368 | 369 | Returns 370 | ------- 371 | y_unlabeled, num_iter : returns the new label estimates and optionally 372 | the number of iterations 373 | """ 374 | # The number of labeled and unlabeled nodes now includes the new point 375 | y_u = self.y_unlabeled 376 | y_l = self.datastore.y_labeled 377 | 378 | p_ul = self.graph.subgraph_ul.transition_matrix 379 | p_uu = self.graph.subgraph_uu.transition_matrix 380 | 381 | a_rev_ul = self.graph.subgraph_ul.rev_adj 382 | a_rev_uu = self.graph.subgraph_uu.rev_adj 383 | 384 | if y_new == -1: 385 | # Estimate the label of the new unlabeled point 386 | label_new = ssdot(p_ul[ind_new], y_l, True) \ 387 | + ssdot(p_uu[ind_new], y_u, True) 388 | y_u[ind_new] = label_new 389 | 390 | # The first LP candidates are the unlabeled samples that have 391 | # the new point as a nearest neighbor 392 | candidates = a_rev_uu.get(ind_new, set()) 393 | else: 394 | # The label of the new labeled point is already in the data store 395 | candidates = a_rev_ul.get(ind_new, set()) 396 | 397 | # Initialize a tentative label matrix / hash-map 398 | y_u_tent = {} # y_u[:u].copy() 399 | 400 | # Tentative labels are the label est. after the new point insertion 401 | candid1_norms = [] 402 | for ind in candidates: 403 | y_u_tent.setdefault(ind, y_u[ind].copy()) 404 | label = ssdot(p_ul[ind], y_l, True) + ssdot(p_uu[ind], y_u, True) 405 | y_u_tent[ind] = label.ravel() 406 | 407 | n_updates_per_iter = [] 408 | n_iter = 0 409 | k_u = self.graph.n_neighbors_unlabeled 410 | u = max(self.graph.n_unlabeled, 1) 411 | max_iter = int(np.log(u) / np.log(k_u)) if k_u > 1 else self.max_iter 412 | while len(candidates) and n_iter < max_iter: # < self.max_iter: 413 | 414 | # Pick the ones that change significantly and change them 415 | updates, norm = filter_and_update(candidates, y_u_tent, y_u, 416 | self.theta) 417 | n_updates_per_iter.append(len(updates)) 418 | 419 | # Get the next set of candidates (farther from the source) 420 | candidates = get_next_candidates(updates, y_u_tent, y_u, a_rev_uu, 421 | p_uu) 422 | 423 | n_iter += 1 424 | 425 | # Print the total number of updates 426 | n_updates = sum(n_updates_per_iter) 427 | if n_updates: 428 | logger.info('Iter {:6}: {:6} updates in {:2} LP iters, ' 429 | 'max_iter = {:2}' 430 | .format(self.n_iter_online, n_updates, n_iter, max_iter)) 431 | 432 | if return_iter: 433 | return y_u, n_iter 434 | else: 435 | return y_u 436 | 437 | 438 | def get_next_candidates(major_changes, y_u_tent, y_u, a_rev_uu, p_uu): 439 | candidates = set() 440 | for index, label_diff in major_changes: 441 | back_neighbors = a_rev_uu.get(index, set()) 442 | for neigh in back_neighbors: 443 | y_u_tent.setdefault(neigh, y_u[neigh].copy()) 444 | y_u_tent[neigh] += ssdot(p_uu[neigh, index], label_diff, True) 445 | candidates.add(neigh) 446 | return candidates 447 | 448 | 449 | def filter_and_update(candidates, y_u_tent, y_u, theta, top_ratio=None): 450 | 451 | # Store for visualization all norms to see how to tune theta 452 | major_updates = [] 453 | updates_norm = 0. 454 | candidate_changes = [] 455 | for candidate in candidates: 456 | dy_u = y_u_tent[candidate] - y_u[candidate] 457 | dy_u_norm = np.abs(dy_u).sum() 458 | if top_ratio is not None: 459 | candidate_changes.append((dy_u_norm, dy_u, candidate)) 460 | else: 461 | if dy_u_norm > theta: 462 | # Apply the update 463 | y_u[candidate] = y_u_tent[candidate] 464 | major_updates.append((candidate, dy_u)) 465 | updates_norm += dy_u_norm 466 | 467 | if top_ratio is None: 468 | return major_updates, updates_norm 469 | 470 | # Sort changes by descending norm and select the top k candidates for LP 471 | n_candidates = len(candidates) 472 | candidate_changes.sort(reverse=True, key=lambda x: x[0]) 473 | 474 | # Apply the changes to the top k candidates 475 | top_k = int(top_ratio * n_candidates) 476 | for _, dy_u, candidate in candidate_changes[:top_k]: 477 | # Apply the update 478 | y_u[candidate] = y_u_tent[candidate] 479 | major_updates.append((candidate, dy_u)) 480 | updates_norm += np.abs(dy_u).sum() 481 | 482 | return major_updates, updates_norm 483 | 484 | 485 | def _converged(y_curr, y_prev, tol=0.01): 486 | """basic convergence check""" 487 | return np.abs(y_curr - y_prev).max() < tol 488 | -------------------------------------------------------------------------------- /ilp/algo/knn_graph_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.metrics.pairwise import euclidean_distances 3 | from scipy.sparse import coo_matrix 4 | 5 | 6 | def squared_distances(X1, X2, L=None): 7 | 8 | if L is None: 9 | dist = euclidean_distances(X1, X2, squared=True) 10 | else: 11 | dist = euclidean_distances(X1.dot(L.T), X2.dot(L.T), squared=True) 12 | 13 | return dist 14 | 15 | 16 | def get_nearest(distances, n_neighbors): 17 | 18 | n, m = distances.shape 19 | 20 | neighbors = np.argpartition(distances, n_neighbors - 1, axis=1) 21 | neighbors = neighbors[:, :n_neighbors] 22 | 23 | return neighbors, distances[np.arange(n)[:, None], neighbors] 24 | 25 | 26 | def find_nearest_neighbors(X1, X2, n_neighbors, L=None): 27 | """ 28 | Args: 29 | X1 (array_like): [n_samples, n_features] input data points 30 | X2 (array_like): [m_samples, n_features] reference data points 31 | n_neighbors (int): number of nearest neighbors to find 32 | L (array) : linear transformation for Mahalanobis distance computation 33 | 34 | Returns: 35 | tuple: 36 | (array_like): [n_samples, k_samples] indices of nearest neighbors 37 | (array_like): [n_samples, k_distances] distances to nearest neighbors 38 | 39 | """ 40 | 41 | dist = squared_distances(X1, X2, L) 42 | 43 | if X1 is X2: 44 | np.fill_diagonal(dist, np.inf) 45 | 46 | n, m = X1.shape[0], X2.shape[0] 47 | 48 | neigh_ind = np.argpartition(dist, n_neighbors - 1, axis=1) 49 | neigh_ind = neigh_ind[:, :n_neighbors] 50 | 51 | return neigh_ind, dist[np.arange(n)[:, None], neigh_ind] 52 | 53 | 54 | def construct_weight_mat(neighbors, distances, shape, dtype): 55 | 56 | n, k = neighbors.shape 57 | rows = np.repeat(range(n), k) 58 | cols = neighbors.ravel() 59 | weights = np.exp(-distances.ravel()) 60 | mat = coo_matrix((weights, (rows, cols)), shape, dtype) 61 | 62 | return mat 63 | -------------------------------------------------------------------------------- /ilp/algo/knn_sl_graph.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.sparse import spdiags 3 | 4 | from ilp.algo.knn_graph_utils import squared_distances, find_nearest_neighbors 5 | from ilp.algo.knn_sl_subgraph import KnnSubGraph 6 | from ilp.algo.base_sl_graph import BaseSemiLabeledGraph 7 | 8 | N_ITER_ELIMINATE_ZEROS = 20 9 | 10 | 11 | class KnnSemiLabeledGraph(BaseSemiLabeledGraph): 12 | """ 13 | Parameters 14 | ---------- 15 | 16 | n_neighbors_labeled : int, optional (default=1) 17 | The number of labeled neighbors to use if 'knn' kernel us used. 18 | 19 | n_neighbors_unlabeled : int, optional (default=7) 20 | The number of labeled neighbors to use if 'knn' kernel us used. 21 | 22 | max_samples : int, optional 23 | The maximum number of points expected to be observed. Useful for 24 | memory allocation. (default: 1000) 25 | 26 | max_labeled : {float, int}, optional 27 | Maximum expected labeled points ratio, or number of labeled points 28 | 29 | dtype : dtype, optional 30 | Precision in floats, (default is float32, can also be float16, float64) 31 | 32 | 33 | Attributes 34 | ---------- 35 | 36 | L : array-like, shape (n_features_out, n_features_in) 37 | 38 | weight_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples] 39 | xy can be in {ll, lu, ul, uu} indicating labeled or unlabeled points 40 | 41 | transition_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples] 42 | 43 | adj_list_{xy} : dict that contains the graph connectivity -> 44 | keys = nodes indices, values = neighbor nodes indices 45 | 46 | """ 47 | 48 | def __init__(self, datastore, n_neighbors_labeled=1, 49 | n_neighbors_unlabeled=7, **kwargs): 50 | 51 | super(KnnSemiLabeledGraph, self).__init__(datastore=datastore) 52 | self.n_neighbors_labeled = n_neighbors_labeled 53 | self.n_neighbors_unlabeled = n_neighbors_unlabeled 54 | 55 | max_l, max_u = self.max_labeled, self.max_samples 56 | dtype = self.dtype 57 | 58 | self.subgraph_ll = KnnSubGraph(n_neighbors=n_neighbors_labeled, 59 | dtype=dtype, shape=(max_l, max_l)) 60 | 61 | self.subgraph_lu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled, 62 | dtype=dtype, shape=(max_l, max_u)) 63 | 64 | self.subgraph_ul = KnnSubGraph(n_neighbors=n_neighbors_labeled, 65 | dtype=dtype, shape=(max_u, max_l)) 66 | 67 | self.subgraph_uu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled, 68 | dtype=dtype, shape=(max_u, max_u)) 69 | 70 | self.L = None 71 | 72 | def build(self, X_l, X_u): 73 | """ 74 | Graph matrix for Online Label Propagation computes the weighted adjacency matrix 75 | 76 | Parameters 77 | ---------- 78 | 79 | X_l : array-like, shape [l_samples, n_features], the labeled features 80 | 81 | X_u : array-like, shape [u_samples, n_features], the unlabeled features 82 | 83 | """ 84 | 85 | print('Building graph with {} labeled and {} unlabeled ' 86 | 'samples...'.format(X_l.shape, X_u.shape)) 87 | 88 | self.subgraph_ll.build(X_l, X_l, self.L) 89 | self.subgraph_lu.build(X_l, X_u, self.L) 90 | self.subgraph_ul.build(X_u, X_l, self.L) 91 | self.subgraph_uu.build(X_u, X_u, self.L) 92 | 93 | self.n_labeled = X_l.shape[0] 94 | self.n_unlabeled = X_u.shape[0] 95 | 96 | print('Computing transitions...') 97 | 98 | self._compute_transitions() 99 | 100 | def find_labeled_neighbors(self, X): 101 | 102 | X_l = self.datastore.X_labeled[:self.n_labeled] 103 | return find_nearest_neighbors(X, X_l, self.n_neighbors_labeled, self.L) 104 | 105 | def find_unlabeled_neighbors(self, X): 106 | 107 | X_u = self.datastore.X_unlabeled[:self.n_unlabeled] 108 | return find_nearest_neighbors(X, X_u, self.n_neighbors_unlabeled, 109 | self.L) 110 | 111 | def add_node(self, x, ind, labeled): 112 | if labeled: 113 | res = self.add_labeled_node(x, ind) 114 | else: 115 | res = self.add_unlabeled_node(x, ind) 116 | 117 | # Periodically remove explicit zeros from the sparse matrices 118 | if self.get_n_nodes() % N_ITER_ELIMINATE_ZEROS == 0: 119 | 120 | self.subgraph_ll.eliminate_zeros() 121 | self.subgraph_lu.eliminate_zeros() 122 | self.subgraph_ul.eliminate_zeros() 123 | self.subgraph_uu.eliminate_zeros() 124 | 125 | return res 126 | 127 | def add_labeled_node(self, x_new, ind_new): 128 | 129 | # Compute distances to all other labeled nodes 130 | X_l = self.datastore.X_labeled[:self.n_labeled] 131 | distances = squared_distances(x_new.reshape(1, -1), X_l, self.L) 132 | 133 | # Update the labeled-labeled subgraph 134 | self.subgraph_ll.append_row(ind_new, distances) 135 | self.subgraph_ll.update_columns(ind_new, distances) 136 | 137 | # Compute distances to all other unlabeled nodes 138 | X_u = self.datastore.X_unlabeled[:self.n_unlabeled] 139 | distances = squared_distances(x_new.reshape(1, -1), X_u, self.L) 140 | 141 | # Update the labeled-unlabeled subgraph 142 | self.subgraph_lu.append_row(ind_new, distances) 143 | 144 | # Update the unlabeled-labeled subgraph 145 | self.subgraph_ul.update_columns(ind_new, distances) 146 | 147 | self.n_labeled += 1 148 | 149 | # Compute normalized weight matrix (matrices) 150 | self._compute_transitions() 151 | 152 | return ind_new 153 | 154 | def add_unlabeled_node(self, x_new, ind_new): 155 | 156 | # Compute distances to all other labeled nodes 157 | X_u = self.datastore.X_labeled[:self.n_unlabeled] 158 | distances = squared_distances(x_new.reshape(1, -1), X_u, self.L) 159 | 160 | # Update the labeled-labeled subgraph 161 | self.subgraph_uu.append_row(ind_new, distances) 162 | self.subgraph_uu.update_columns(ind_new, distances) 163 | 164 | # Compute distances to all labeled nodes 165 | X_l = self.datastore.X_labeled[:self.n_labeled] 166 | distances = squared_distances(x_new.reshape(1, -1), X_l, self.L) 167 | 168 | # Update the labeled-unlabeled subgraph 169 | self.subgraph_ul.append_row(ind_new, distances) 170 | 171 | # Update the unlabeled-labeled subgraph 172 | self.subgraph_lu.update_columns(ind_new, distances) 173 | 174 | self.n_unlabeled += 1 175 | 176 | # Compute normalized weight matrix (matrices) 177 | self._compute_transitions() 178 | 179 | return ind_new 180 | 181 | def _compute_transitions(self): 182 | """Normalize the weight matrices by dividing with the row sums""" 183 | 184 | self.row_sum_l = self.subgraph_ll.weight_matrix.sum(axis=1) + \ 185 | self.subgraph_lu.weight_matrix.sum(axis=1) 186 | 187 | self.row_sum_u = self.subgraph_ul.weight_matrix.sum(axis=1) + \ 188 | self.subgraph_uu.weight_matrix.sum(axis=1) 189 | 190 | # Avoid division by zero 191 | actual_l = self.row_sum_l[:self.n_labeled] 192 | actual_l[actual_l < self.eps] = 1. 193 | # print('Min value l: ', actual_l.min()) 194 | actual_u = self.row_sum_u[:self.n_unlabeled] 195 | actual_u[actual_u < self.eps] = 1. 196 | # print('Min value u: ', actual_u.min()) 197 | 198 | row_sum_l_inv = 1 / np.asarray(actual_l, dtype=self.dtype) 199 | row_sum_l_inv[row_sum_l_inv == np.inf] = 1 200 | 201 | row_sum_u_inv = 1 / np.asarray(actual_u, dtype=self.dtype) 202 | row_sum_u_inv[row_sum_u_inv == np.inf] = 1 203 | 204 | # Temporary divisors (diagonal pre-multiplier matrices) 205 | diag_l = spdiags(row_sum_l_inv.ravel(), 0, *self.subgraph_ll.shape) 206 | diag_u = spdiags(row_sum_u_inv.ravel(), 0, *self.subgraph_uu.shape) 207 | 208 | self.subgraph_ll.update_transitions(diag_l) 209 | self.subgraph_lu.update_transitions(diag_l) 210 | self.subgraph_ul.update_transitions(diag_u) 211 | self.subgraph_uu.update_transitions(diag_u) 212 | 213 | def reset_metric(self, L): 214 | self.L = L 215 | -------------------------------------------------------------------------------- /ilp/algo/knn_sl_subgraph.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy.sparse import csr_matrix, coo_matrix 3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot 4 | 5 | from ilp.helpers.fc_heap import FixedCapacityHeap as FSH 6 | from ilp.algo.knn_graph_utils import find_nearest_neighbors, get_nearest, construct_weight_mat 7 | 8 | 9 | class KnnSubGraph: 10 | def __init__(self, n_neighbors=1, dtype=float, shape=None, **kwargs): 11 | 12 | self.n_neighbors = n_neighbors 13 | self.dtype = dtype 14 | self.shape = shape 15 | 16 | self.weight_matrix = csr_matrix(shape, dtype=dtype) 17 | self.transition_matrix = csr_matrix(shape, dtype=dtype) 18 | self.radii = np.zeros(shape[0], dtype=dtype) 19 | 20 | self.adj = {} 21 | self.rev_adj = {} 22 | 23 | def build(self, X1, X2, L=None): 24 | 25 | neigh_ind, dist = find_nearest_neighbors(X1, X2, self.n_neighbors, L) 26 | weight_matrix = construct_weight_mat(neigh_ind, dist, self.shape, 27 | self.dtype) 28 | self.weight_matrix = weight_matrix.tocsr() 29 | self.radii[:len(dist)] = dist[:, self.n_neighbors-1] 30 | self.adj = self.adj_from_weight_mat(weight_matrix) 31 | self.rev_adj = self.rev_adj_from_weight_mat(weight_matrix) 32 | 33 | def adj_from_weight_mat(self, weight_mat): 34 | """Get the non-zero cols for each row and insert in a FSPQ in the 35 | form (weight, ind) 36 | 37 | Args: 38 | weight_mat (coo_matrix): a weights submatrix 39 | 40 | Returns: 41 | adj_list (dict): a dictionary where keys are indices and values 42 | are FSHs 43 | 44 | """ 45 | 46 | # Create a list of adjacent vertices for each node 47 | print('Creating hashmap for {} nodes...'.format(self.shape[0])) 48 | adj_list = {i: [] for i in range(self.shape[0])} 49 | print('Iterating over weightmat.row, col, data...') 50 | for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data): 51 | adj_list[r].append((w, c)) 52 | 53 | # Convert each list to a FixedCapacityHeap 54 | print('Converting to FCH') 55 | for node, neighbors in adj_list.items(): 56 | adj_list[node] = FSH(neighbors, capacity=self.n_neighbors) 57 | 58 | return adj_list 59 | 60 | def rev_adj_from_weight_mat(self, weight_mat): 61 | # Create a list of adjacent vertices for each node 62 | # adj_list = {i: set() for i in range(self.shape[1])} 63 | adj_list = {} 64 | for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data): 65 | adj_list.setdefault(c, set()).add(r) 66 | # adj_list[c].add(r) 67 | 68 | return adj_list 69 | 70 | def update_transitions(self, normalizer): 71 | self.transition_matrix = ssdot(normalizer, self.weight_matrix) 72 | 73 | def eliminate_zeros(self): 74 | self.weight_matrix.eliminate_zeros() 75 | self.transition_matrix.eliminate_zeros() 76 | 77 | def append_row(self, index, distances): 78 | 79 | # Identify the k nearest neighbors 80 | nearest, dist_nearest = get_nearest(distances, self.n_neighbors) 81 | nearest = nearest.ravel() 82 | 83 | # Create the new node's adjacency list 84 | weights = np.exp(-dist_nearest.ravel()) 85 | lst = [(w, i) for w, i in zip(weights, nearest)] 86 | self.adj[index] = FSH(lst, self.n_neighbors) 87 | 88 | # Update the reverse adjacency list 89 | for w, i in zip(weights, nearest): 90 | self.rev_adj.setdefault(i, set()).add(index) 91 | # self.rev_adj[i].add(index) 92 | 93 | # Update the W_LL matrix (append the row vector) 94 | row = [index] * len(weights) 95 | row_new = csr_matrix((weights, (row, nearest)), self.shape, self.dtype) 96 | self.weight_matrix = self.weight_matrix + row_new 97 | 98 | def update_columns(self, ind_new, distances): 99 | """ 100 | 101 | Parameters 102 | ---------- 103 | ind_new : int 104 | Index of the new point. 105 | 106 | distances : array 107 | Array of distances of the new point to the reference points of 108 | the subgraph. 109 | 110 | """ 111 | 112 | distances = distances.ravel() 113 | # Identify the samples that have the new point in their knn radius 114 | back_refs, = np.where(distances < self.radii[:len(distances)]) 115 | back_weights = np.exp(-distances[back_refs]) 116 | 117 | # Update the W_LL matrix (compute the column update) 118 | update_mat = self._update(ind_new, back_refs, back_weights) 119 | self.weight_matrix = self.weight_matrix + update_mat.tocsr() 120 | 121 | def _update(self, ind_new, back_refs, weights_new, eps=1e-12): 122 | row, col, val = [], [], [] 123 | # row_del, col_del, val_del = [], [], [] 124 | for neigh_new, weight_new in zip(back_refs, weights_new): 125 | neighbors_heap = self.adj[neigh_new] # FSH with (weight, ind) 126 | inserted, removed = neighbors_heap.push((weight_new, ind_new)) 127 | if inserted: # neigh got a new nearest neighbor: ind_new 128 | row.append(neigh_new) 129 | col.append(ind_new) 130 | val.append(weight_new) 131 | 132 | # Update the reverse adjacency list 133 | self.rev_adj.setdefault(ind_new, set()).add(neigh_new) 134 | if removed is not None: # old point swapped a nearest neighbor 135 | # row_del.append(neigh) 136 | # col_del.append(removed[1]) 137 | # val_del.append(-removed[0]) 138 | row.append(neigh_new) 139 | col.append(removed[1]) 140 | val.append(-removed[0]) 141 | 142 | # Update the reverse adjacency list 143 | self.rev_adj[removed[1]].discard(neigh_new) 144 | 145 | # Update the radii 146 | min_weight = neighbors_heap.get_min()[0] 147 | self.radii[neigh_new] = -np.log(max(min_weight, eps)) 148 | 149 | update_mat = coo_matrix((val, (row, col)), self.shape, self.dtype) 150 | 151 | return update_mat 152 | -------------------------------------------------------------------------------- /ilp/constants.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | 4 | CWD = os.path.abspath(os.path.split(__file__)[0]) 5 | PROJECT_DIR = os.path.split(CWD)[0] 6 | 7 | SOURCE_DIR = os.path.join(PROJECT_DIR, 'ilp') 8 | DATA_DIR = os.path.join(PROJECT_DIR, 'data') 9 | 10 | STATS_DIR = os.path.join(SOURCE_DIR, 'stats_res') 11 | EXPERIMENTS_DIR = os.path.join(SOURCE_DIR, 'experiments') 12 | CONFIG_DIR = os.path.join(EXPERIMENTS_DIR, 'cfg') 13 | RESULTS_DIR = os.path.join(EXPERIMENTS_DIR, 'results') 14 | # PLOT_DIR = os.path.join(PROJECT_DIR, 'plot') 15 | 16 | 17 | EPS_32 = np.spacing(np.float32(0)) 18 | EPS_64 = np.spacing(np.float64(0)) 19 | -------------------------------------------------------------------------------- /ilp/experiments/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/experiments/__init__.py -------------------------------------------------------------------------------- /ilp/experiments/base.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import time 4 | from datetime import datetime 5 | from sklearn.externals import six 6 | from sklearn.preprocessing import LabelBinarizer 7 | from sklearn.utils.random import check_random_state 8 | from abc import ABCMeta, abstractmethod 9 | from matplotlib import pyplot as plt 10 | 11 | from ilp.constants import RESULTS_DIR 12 | from ilp.helpers.data_fetcher import check_supported_dataset, fetch_load_data 13 | from ilp.helpers.stats import StatisticsWorker, aggregate_statistics, JobType 14 | from ilp.plots.plot_stats import plot_curves 15 | from ilp.algo.incremental_label_prop import IncrementalLabelPropagation 16 | from ilp.algo.datastore import SemiLabeledDataStore 17 | from ilp.helpers.data_flow import gen_semilabeled_data, split_labels_rest, split_burn_in_rest 18 | from ilp.helpers.params_parse import print_config 19 | from ilp.helpers.log import make_logger 20 | 21 | 22 | logger = make_logger(__name__) 23 | 24 | 25 | class BaseExperiment(six.with_metaclass(ABCMeta)): 26 | 27 | def __init__(self, name, config, plot_title, multi_var, n_runs, isave=100): 28 | self.name = name 29 | self.config = config 30 | self.precision = config.get('options', {}).get('precision', 'float32') 31 | self.isave = isave 32 | self.plot_title = plot_title 33 | self.multi_var = multi_var 34 | self.n_runs = n_runs 35 | self._setup() 36 | 37 | def _setup(self): 38 | self.dataset = self.config['dataset']['name'].lower() 39 | check_supported_dataset(self.dataset) 40 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S') 41 | self.class_dir = os.path.join(RESULTS_DIR, self.name) 42 | instance_dir = self.name + '_' + self.dataset.upper() + '_' + cur_time 43 | self.top_dir = os.path.join(self.class_dir, instance_dir) 44 | 45 | def run(self, dataset_name, random_state=42): 46 | 47 | config = self.config 48 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name) 49 | 50 | for n_run in range(self.n_runs): 51 | seed_run = random_state * n_run 52 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run)) 53 | rng = check_random_state(seed_run) 54 | if config['dataset']['is_stream']: 55 | logger.info('Dataset is a stream. Sampling observed labels.') 56 | # Just randomly sample ratio_labeled samples for mask_labeled 57 | n_burn_in = config['data']['n_burn_in_stream'] 58 | ratio_labeled = config['data']['stream']['ratio_labeled'] 59 | n_labeled = int(ratio_labeled*len(y_train)) 60 | ind_labeled = rng.choice(len(y_train), n_labeled, 61 | replace=False) 62 | mask_labeled = np.zeros(len(y_train), dtype=bool) 63 | mask_labeled[ind_labeled] = True 64 | X_run, y_run = X_train, y_train 65 | else: 66 | burn_in_params = config['data']['burn_in'] 67 | ind_burn_in, mask_labeled_burn_in = \ 68 | split_burn_in_rest(y_train, shuffle=True, seed=seed_run, 69 | **burn_in_params) 70 | X_burn_in, y_burn_in = X_train[ind_burn_in], \ 71 | y_train[ind_burn_in] 72 | mask_rest = np.ones(len(X_train), dtype=bool) 73 | mask_rest[ind_burn_in] = False 74 | X_rest, y_rest = X_train[mask_rest], y_train[mask_rest] 75 | stream_params = config['data']['stream'] 76 | mask_labeled_rest = split_labels_rest( 77 | y_rest, seed=seed_run, shuffle=True, **stream_params) 78 | 79 | # Shuffle the rest 80 | indices = np.arange(len(y_rest)) 81 | rng.shuffle(indices) 82 | X_run = np.concatenate((X_burn_in, X_rest[indices])) 83 | y_run = np.concatenate((y_burn_in, y_rest[indices])) 84 | mask_labeled = np.concatenate((mask_labeled_burn_in, 85 | mask_labeled_rest[indices])) 86 | n_burn_in = len(y_burn_in) 87 | 88 | config['data']['n_burn_in'] = n_burn_in 89 | config.setdefault('options', {}) 90 | config['options']['random_state'] = seed_run 91 | 92 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in, 93 | seed_run, X_test, y_test, n_run) 94 | 95 | @abstractmethod 96 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 97 | X_test, y_test, n_run): 98 | raise NotImplementedError('pre_single_run must be overriden!') 99 | 100 | def _single_run(self, X, y, mask_labeled, n_burn_in, stats_path, 101 | random_state, X_test=None, y_test=None): 102 | 103 | lb = LabelBinarizer() 104 | lb.fit(y) 105 | logger.info('\n\nLABELS SEEN BY LABEL BINARIZER: {}'.format(lb.classes_)) 106 | 107 | # Now print configuration for sanity check 108 | self.config.setdefault('dataset', {}) 109 | self.config['dataset']['classes'] = lb.classes_ 110 | print_config(self.config) 111 | 112 | logger.info('Creating stream generator...') 113 | stream_generator = gen_semilabeled_data(X, y, mask_labeled) 114 | 115 | logger.info('Creating one-hot groundtruth...') 116 | y_u_true_int = y[~mask_labeled] 117 | y_u_true = np.asarray(lb.transform(y_u_true_int), dtype=self.precision) 118 | 119 | logger.info('Initializing learner...') 120 | datastore_params = {'precision': self.precision, 121 | 'max_samples': len(y), 122 | 'max_labeled': sum(mask_labeled), 123 | 'classes': lb.classes_} 124 | learner = self.init_learner(stats_path, datastore_params, random_state, 125 | n_burn_in) 126 | 127 | # Iterate through the generated samples and learn 128 | t_total = time.time() 129 | logger.info('Now feeding stream . . .') 130 | for t, x_new, y_new, is_labeled in stream_generator: 131 | 132 | # Pass the new point to the learner 133 | y_observed = y_new if is_labeled else -1 134 | learner.fit_incremental(x_new, y_observed) 135 | 136 | if t > n_burn_in: 137 | # Compute classification error 138 | u = learner.datastore.n_unlabeled 139 | y_u = learner.y_unlabeled[:u] 140 | learner.log_stats(JobType.EVAL, y_est=y_u, y_true=y_u_true[:u]) 141 | 142 | # Compute test error every 1000 samples 143 | if t % 1000 == 0: 144 | if X_test is not None: 145 | logger.info('Now testing . . .') 146 | t_test = time.time() 147 | y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair') 148 | t_test = time.time() - t_test 149 | logger.info('Testing finished in {}s'.format(t_test)) 150 | learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn, 151 | y_pred_lp=y_pred_lp, y_true=y_test) 152 | 153 | # Store the true label stream in statistics 154 | learner.log_stats(JobType.LABEL_STREAM, y_true=y, mask_obs=mask_labeled) 155 | 156 | logger.info('Reached end of generated data.') 157 | total_runtime = time.time() - t_total 158 | logger.info('Total time elapsed: {} s'.format(total_runtime)) 159 | learner.log_stats(JobType.RUNTIME, t=total_runtime) 160 | 161 | # Store last predictions in statistics 162 | u = learner.datastore.n_unlabeled 163 | y_u = learner.y_unlabeled[:u] 164 | learner.log_stats(JobType.TRAIN_PRED, y_est=y_u, y_true=y_u_true[:u]) 165 | 166 | if X_test is not None: 167 | logger.info('Now testing . . .') 168 | t_test = time.time() 169 | y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair') 170 | t_test = time.time() - t_test 171 | logger.info('Testing finished in {}s'.format(t_test)) 172 | learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn, 173 | y_pred_lp=y_pred_lp, y_true=y_test) 174 | 175 | if learner.stats_worker is not None: 176 | learner.stats_worker.stop() 177 | 178 | def init_learner(self, stats_path, datastore_params, random_state, n_burn_in): 179 | 180 | config = self.config 181 | ilp_params = dict(params_offline_lp=config['offline_lp'], 182 | params_graph=config['graph'], 183 | **config['online_lp']) 184 | 185 | # Instantiate a worker thread for statistics 186 | stats_worker = StatisticsWorker(config=config, isave=self.isave, path=stats_path) 187 | 188 | # Instantiate a datastore for labeled and unlabeled samples 189 | datastore = SemiLabeledDataStore(**datastore_params) 190 | 191 | # Instantiate the learner 192 | learner = IncrementalLabelPropagation(datastore=datastore, stats_worker=stats_worker, 193 | random_state=random_state, n_burn_in=n_burn_in, **ilp_params) 194 | 195 | return learner 196 | 197 | def load_plot(self, path=None): 198 | if path is None: 199 | path = self.top_dir 200 | elif not os.path.isdir(path): 201 | # Load and plot the latest experiment 202 | logger.info('Experiment Class dir: {}'.format(self.class_dir)) 203 | logger.info('Experiment subdirs: {}'.format(os.listdir(self.class_dir))) 204 | files_in_class = os.listdir(self.class_dir) 205 | a_files = [os.path.join(self.class_dir, d) for d in files_in_class] 206 | list_of_dirs = [d for d in a_files if os.path.isdir(d)] 207 | path = max(list_of_dirs, key=os.path.getctime) 208 | 209 | logger.info('Collecting statistics from {}'.format(path)) 210 | config = None 211 | if self.multi_var: 212 | experiment_stats = [] 213 | for variable_dir in os.listdir(path): 214 | var_value = variable_dir[len(self.name) + 1:] 215 | experiment_dir = os.path.join(path, variable_dir) 216 | stats_mean, stats_std, config = aggregate_statistics(experiment_dir) 217 | experiment_stats.append((self.name, var_value, stats_mean, stats_std)) 218 | else: 219 | stats_mean, stats_std, config = aggregate_statistics(path) 220 | experiment_stats = (stats_mean, stats_std) 221 | 222 | if config is None: 223 | raise KeyError('No configuration found for {}'.format(path)) 224 | 225 | title = self.plot_title 226 | plot_curves(experiment_stats, config, title=title, path=path) 227 | plt.show() 228 | -------------------------------------------------------------------------------- /ilp/experiments/cfg/default.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : 0.10 7 | batch_size : 100 8 | max_samples : 200000 9 | n_burn_in_stream : 100 10 | 11 | 12 | offline_lp: 13 | tol : 0.001 14 | max_iter : 30 15 | 16 | 17 | online_lp: 18 | tol : 0.001 19 | max_iter : 30 20 | theta : 0.1 21 | iprint : 100 22 | 23 | 24 | graph: 25 | kernel : knn 26 | n_neighbors_labeled : 3 27 | n_neighbors_unlabeled : 3 28 | 29 | 30 | options: 31 | precision : float64 32 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/cfg/var_k_L.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : 0.05 7 | batch_size : 100 8 | max_samples : 1000000 9 | 10 | 11 | offline_lp: 12 | tol : 0.001 13 | max_iter : 30 14 | 15 | 16 | online_lp: 17 | tol : 0.001 18 | max_iter : 30 19 | theta : 0.3 20 | iprint : 100 21 | 22 | 23 | graph: 24 | kernel : knn 25 | n_neighbors_labeled : [1, 3, 7, 11, 15, 19] 26 | n_neighbors_unlabeled : 3 27 | 28 | 29 | options: 30 | precision : float64 31 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/cfg/var_k_U.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : 0.05 7 | batch_size : 100 8 | max_samples : 1000000 9 | 10 | 11 | offline_lp: 12 | tol : 0.001 13 | max_iter : 30 14 | 15 | 16 | online_lp: 17 | tol : 0.001 18 | max_iter : 30 19 | theta : 0.3 20 | iprint : 100 21 | 22 | 23 | graph: 24 | kernel : knn 25 | n_neighbors_labeled : 3 26 | n_neighbors_unlabeled : [1, 3, 7, 11, 15, 19] 27 | 28 | 29 | options: 30 | precision : float64 31 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/cfg/var_n_L.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : 0.05 7 | batch_size : 100 8 | max_samples : 1000000 9 | n_labeled_per_class : [300, 500] 10 | 11 | 12 | offline_lp: 13 | tol : 0.001 14 | max_iter : 30 15 | 16 | 17 | online_lp: 18 | tol : 0.001 19 | max_iter : 30 20 | theta : 0.3 21 | iprint : 100 22 | 23 | 24 | graph: 25 | kernel : knn 26 | n_neighbors_labeled : 3 27 | n_neighbors_unlabeled : 3 28 | 29 | 30 | options: 31 | precision : float64 32 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/cfg/var_stream_labeled.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : [0.05, 0.1, 0.2] 7 | batch_size : 100 8 | max_samples : 1000000 9 | n_burn_in_stream : 100 10 | 11 | 12 | offline_lp: 13 | tol : 0.001 14 | max_iter : 30 15 | 16 | 17 | online_lp: 18 | tol : 0.001 19 | max_iter : 30 20 | theta : 1.0 21 | iprint : 100 22 | 23 | 24 | graph: 25 | kernel : knn 26 | n_neighbors_labeled : 3 27 | n_neighbors_unlabeled : 3 28 | 29 | 30 | options: 31 | precision : float64 32 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/cfg/var_theta.yml: -------------------------------------------------------------------------------- 1 | data: 2 | burn_in: 3 | n_labeled_per_class : 2 4 | ratio_labeled : 0.5 5 | stream: 6 | ratio_labeled : 0.20 7 | batch_size : 100 8 | max_samples : 1000000 9 | n_burn_in_stream : 100 10 | 11 | 12 | offline_lp: 13 | tol : 0.001 14 | max_iter : 30 15 | 16 | 17 | online_lp: 18 | tol : 0.001 19 | max_iter : 30 20 | theta : [0.0, 0.1, 0.5, 1.0, 1.5, 2.0] 21 | iprint : 100 22 | 23 | 24 | graph: 25 | kernel : knn 26 | n_neighbors_labeled : 3 27 | n_neighbors_unlabeled : 3 28 | 29 | 30 | options: 31 | precision : float64 32 | iter_stats : 100 -------------------------------------------------------------------------------- /ilp/experiments/default_run.py: -------------------------------------------------------------------------------- 1 | import os 2 | from datetime import datetime 3 | 4 | from ilp.experiments.base import BaseExperiment 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 7 | from ilp.constants import CONFIG_DIR 8 | 9 | 10 | class DefaultRun(BaseExperiment): 11 | 12 | def __init__(self, params, n_runs=1, isave=100): 13 | super(DefaultRun, self).__init__(name='default_run', config=params, 14 | isave=isave, n_runs=n_runs, 15 | plot_title=r'Default run', 16 | multi_var=False) 17 | 18 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 19 | X_test, y_test, n_run): 20 | 21 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S') 22 | stats_path = os.path.join(self.top_dir, 'run_' + cur_time) 23 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, stats_path, 24 | seed_run, X_test, y_test) 25 | 26 | 27 | if __name__ == '__main__': 28 | 29 | # Parse user input 30 | parser = experiment_arg_parser() 31 | args = vars(parser.parse_args()) 32 | dataset_name = args['dataset'].lower() 33 | config_file = os.path.join(CONFIG_DIR, 'default.yml') 34 | config = parse_yaml(config_file) 35 | 36 | # Store dataset info 37 | config.setdefault('dataset', {}) 38 | config['dataset']['name'] = dataset_name 39 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 40 | 41 | # Instantiate experiment 42 | experiment = DefaultRun(params=config, n_runs=args['n_runs']) 43 | 44 | if args['plot'] != '': 45 | # python3 default_run.py -p latest 46 | experiment.load_plot(path=args['plot']) 47 | else: 48 | # python3 default_run.py -d digits 49 | experiment.run(dataset_name) 50 | -------------------------------------------------------------------------------- /ilp/experiments/var_n_labeled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import numpy as np 4 | from sklearn.utils.random import check_random_state 5 | 6 | from ilp.experiments.base import BaseExperiment 7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM 8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 9 | from ilp.constants import CONFIG_DIR 10 | from ilp.helpers.data_flow import split_labels_rest, split_burn_in_rest 11 | from ilp.helpers.log import make_logger 12 | 13 | 14 | logger = make_logger(__name__) 15 | 16 | 17 | class VarSamplesLabeled(BaseExperiment): 18 | 19 | def __init__(self, n_labeled_values, params, n_runs=1, isave=100): 20 | super(VarSamplesLabeled, self).__init__(name='n_L', config=params, 21 | isave=isave, n_runs=n_runs, 22 | plot_title=r'Influence of ' 23 | r'number of labels', 24 | multi_var=True) 25 | self.n_labeled_values = n_labeled_values 26 | 27 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 28 | X_test, y_test, n_run): 29 | 30 | config = self.config 31 | 32 | n_labels = config['data']['n_labels'] 33 | save_dir = os.path.join(self.top_dir, 'n_L_' + str(n_labels)) 34 | stats_file = os.path.join(save_dir, 'run_' + str(n_run)) 35 | logger.info('\n\nExperiment: {}, n_labels = {}, run {}...\n'. 36 | format(self.name.upper(), n_labels, n_run)) 37 | time.sleep(1) 38 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, 39 | stats_file, seed_run, X_test, y_test) 40 | 41 | def run(self, dataset_name, random_state=42): 42 | 43 | config = self.config 44 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name) 45 | 46 | n_classes = len(np.unique(y_train)) 47 | 48 | # if dataset_name == 'usps': 49 | # X_train = np.concatenate((X_train, X_test)) 50 | # y_train = np.concatenate((y_train, y_test)) 51 | 52 | for n_run in range(self.n_runs): 53 | seed_run = random_state * n_run 54 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run)) 55 | rng = check_random_state(seed_run) 56 | if config['dataset']['is_stream']: 57 | logger.info('Dataset is a stream. Sampling observed labels.') 58 | # Just randomly sample ratio_labeled samples for mask_labeled 59 | n_burn_in = config['data']['n_burn_in_stream'] 60 | ratio_labeled = config['data']['stream']['ratio_labeled'] 61 | n_labeled = int(ratio_labeled*len(y_train)) 62 | ind_labeled = rng.choice(len(y_train), n_labeled, 63 | replace=False) 64 | mask_labeled = np.zeros(len(y_train), dtype=bool) 65 | mask_labeled[ind_labeled] = True 66 | X_run, y_run = X_train, y_train 67 | else: 68 | 69 | burn_in_params = config['data']['burn_in'] 70 | ind_burn_in, mask_labeled_burn_in = \ 71 | split_burn_in_rest(y_train, shuffle=True, seed=seed_run, 72 | **burn_in_params) 73 | n_labeled_burn_in = sum(mask_labeled_burn_in) 74 | X_burn_in, y_burn_in = X_train[ind_burn_in], \ 75 | y_train[ind_burn_in] 76 | mask_rest = np.ones(len(X_train), dtype=bool) 77 | mask_rest[ind_burn_in] = False 78 | X_rest, y_rest = X_train[mask_rest], y_train[mask_rest] 79 | 80 | for nlpc in self.n_labeled_values: 81 | n_labels = nlpc*n_classes 82 | config['data']['n_labels'] = n_labels 83 | 84 | rl = (n_labels - n_labeled_burn_in) / len(y_rest) 85 | assert rl >= 0 86 | mask_labeled_rest = split_labels_rest(y_rest, batch_size=0, 87 | seed=seed_run, shuffle=True, ratio_labeled=rl) 88 | 89 | # Shuffle the rest 90 | indices = np.arange(len(y_rest)) 91 | rng.shuffle(indices) 92 | X_run = np.concatenate((X_burn_in, X_rest[indices])) 93 | y_run = np.concatenate((y_burn_in, y_rest[indices])) 94 | mask_labeled = np.concatenate((mask_labeled_burn_in, 95 | mask_labeled_rest[indices])) 96 | n_burn_in = len(y_burn_in) 97 | 98 | config['data']['n_burn_in'] = n_burn_in 99 | config.setdefault('options', {}) 100 | config['options']['random_state'] = seed_run 101 | 102 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in, 103 | seed_run, X_test, y_test, n_run) 104 | 105 | 106 | if __name__ == '__main__': 107 | parser = experiment_arg_parser() 108 | args = vars(parser.parse_args()) 109 | dataset_name = args['dataset'].lower() 110 | config_file = os.path.join(CONFIG_DIR, 'var_n_L.yml') 111 | config = parse_yaml(config_file) 112 | 113 | # Store dataset info 114 | config.setdefault('dataset', {}) 115 | config['dataset']['name'] = dataset_name 116 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 117 | 118 | N_LABELED_PER_CLASS = config['data']['n_labeled_per_class'].copy() 119 | 120 | experiment = VarSamplesLabeled(N_LABELED_PER_CLASS, params=config, 121 | n_runs=args['n_runs']) 122 | if args['plot'] != '': 123 | experiment.load_plot(path=args['plot']) 124 | else: 125 | experiment.run(dataset_name) 126 | -------------------------------------------------------------------------------- /ilp/experiments/var_n_neighbors_labeled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | 4 | from ilp.experiments.base import BaseExperiment 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 7 | from ilp.constants import CONFIG_DIR 8 | from ilp.helpers.log import make_logger 9 | 10 | 11 | logger = make_logger(__name__) 12 | 13 | 14 | class VarNeighborsLabeled(BaseExperiment): 15 | 16 | def __init__(self, n_neighbors_values, params, n_runs, isave=100): 17 | super(VarNeighborsLabeled, self).__init__(name='k_L', config=params, 18 | isave=isave, n_runs=n_runs, 19 | plot_title=r'Influence of ' 20 | r'$k_l$', 21 | multi_var=True) 22 | self.n_neighbors_values = n_neighbors_values 23 | 24 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 25 | X_test, y_test, n_run): 26 | 27 | params = self.config 28 | 29 | for n_neighbors in self.n_neighbors_values: 30 | params['graph']['n_neighbors_labeled'] = n_neighbors 31 | save_dir = os.path.join(self.top_dir, 'k_L_' + str(n_neighbors)) 32 | stats_file = os.path.join(save_dir, 'run_' + str(n_run)) 33 | logger.info('\n\nExperiment: {}, k_L = {}, run {}...\n'. 34 | format(self.name.upper(), n_neighbors, n_run)) 35 | time.sleep(1) 36 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, 37 | stats_file, seed_run, X_test, y_test) 38 | 39 | 40 | if __name__ == '__main__': 41 | 42 | parser = experiment_arg_parser() 43 | args = vars(parser.parse_args()) 44 | dataset_name = args['dataset'].lower() 45 | config_file = os.path.join(CONFIG_DIR, 'var_k_L.yml') 46 | config = parse_yaml(config_file) 47 | 48 | # Store dataset info 49 | config.setdefault('dataset', {}) 50 | config['dataset']['name'] = dataset_name 51 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 52 | 53 | N_NEIGHBORS_VALUES = config['graph']['n_neighbors_labeled'].copy() 54 | 55 | experiment = VarNeighborsLabeled(N_NEIGHBORS_VALUES, params=config, 56 | n_runs=args['n_runs']) 57 | if args['plot'] != '': 58 | experiment.load_plot(path=args['plot']) 59 | else: 60 | experiment.run(dataset_name) -------------------------------------------------------------------------------- /ilp/experiments/var_n_neighbors_unlabeled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | 4 | from ilp.experiments.base import BaseExperiment 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 7 | from ilp.constants import CONFIG_DIR 8 | from ilp.helpers.log import make_logger 9 | 10 | 11 | logger = make_logger(__name__) 12 | 13 | 14 | class VarNeighborsUnLabeled(BaseExperiment): 15 | 16 | def __init__(self, n_neighbors_values, params, n_runs, isave=100): 17 | super(VarNeighborsUnLabeled, self).__init__(name='k_U', config=params, 18 | isave=isave, 19 | plot_title=r'Influence of ' 20 | r'$k_u$', 21 | n_runs=n_runs, 22 | multi_var=True) 23 | self.n_neighbors_values = n_neighbors_values 24 | 25 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, 26 | seed_run, X_test, y_test, n_run): 27 | params = self.config 28 | 29 | for n_neighbors in self.n_neighbors_values: 30 | params['graph']['n_neighbors_unlabeled'] = n_neighbors 31 | save_dir = os.path.join(self.top_dir, 32 | 'k_U_' + str(n_neighbors)) 33 | stats_file = os.path.join(save_dir, 'run_' + str(n_run)) 34 | logger.info('\n\nExperiment: {}, k_U = {}, run {}...\n'. 35 | format(self.name.upper(), n_neighbors, n_run)) 36 | time.sleep(1) 37 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, 38 | stats_file, seed_run, X_test, y_test) 39 | 40 | 41 | if __name__ == '__main__': 42 | parser = experiment_arg_parser() 43 | args = vars(parser.parse_args()) 44 | dataset_name = args['dataset'].lower() 45 | config_file = os.path.join(CONFIG_DIR, 'var_k_U.yml') 46 | config = parse_yaml(config_file) 47 | 48 | # Store dataset info 49 | config.setdefault('dataset', {}) 50 | config['dataset']['name'] = dataset_name 51 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 52 | 53 | N_NEIGHBORS_VALUES = config['graph']['n_neighbors_unlabeled'].copy() 54 | 55 | experiment = VarNeighborsUnLabeled(N_NEIGHBORS_VALUES, params=config, 56 | n_runs=args['n_runs']) 57 | if args['plot'] != '': 58 | experiment.load_plot(path=args['plot']) 59 | else: 60 | experiment.run(dataset_name) -------------------------------------------------------------------------------- /ilp/experiments/var_stream_labeled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | from time import sleep 4 | from sklearn.utils.random import check_random_state 5 | 6 | from ilp.experiments.base import BaseExperiment 7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM 8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 9 | from ilp.constants import CONFIG_DIR 10 | from ilp.helpers.log import make_logger 11 | 12 | 13 | logger = make_logger(__name__) 14 | 15 | 16 | class VarStreamLabeled(BaseExperiment): 17 | 18 | def __init__(self, ratio_labeled_values, params, n_runs=1, isave=100): 19 | super(VarStreamLabeled, self).__init__(name='srl', 20 | config=params, 21 | isave=isave, n_runs=n_runs, 22 | plot_title=r'Influence of ratio of labels', 23 | multi_var=True) 24 | self.ratio_labeled_values = ratio_labeled_values 25 | 26 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 27 | X_test, y_test, n_run): 28 | 29 | config = self.config 30 | 31 | ratio_labeled = config['data']['stream']['ratio_labeled'] 32 | save_dir = os.path.join(self.top_dir, 'srl_' + str(ratio_labeled)) 33 | stats_path = os.path.join(save_dir, 'run_' + str(n_run)) 34 | logger.info('\n\nExperiment: {}, ratio_labeled = {}, run {}...\n'. 35 | format(self.name.upper(), ratio_labeled, n_run)) 36 | sleep(1) 37 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, 38 | stats_path, seed_run, X_test, y_test) 39 | 40 | def run(self, dataset_name, random_state=42): 41 | 42 | config = self.config 43 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name) 44 | 45 | for n_run in range(self.n_runs): 46 | seed_run = random_state * n_run 47 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run)) 48 | rng = check_random_state(seed_run) 49 | if config['dataset']['is_stream']: 50 | logger.info('Dataset is a stream. Sampling observed labels.') 51 | # Just randomly sample ratio_labeled samples for mask_labeled 52 | n_burn_in = config['data']['n_burn_in_stream'] 53 | 54 | for ratio_labeled in self.ratio_labeled_values: 55 | 56 | config['data']['stream']['ratio_labeled'] = ratio_labeled 57 | n_labeled = int(ratio_labeled*len(y_train)) 58 | ind_labeled = rng.choice(len(y_train), n_labeled, 59 | replace=False) 60 | mask_labeled = np.zeros(len(y_train), dtype=bool) 61 | mask_labeled[ind_labeled] = True 62 | X_run, y_run = X_train, y_train 63 | 64 | config['data']['n_burn_in'] = n_burn_in 65 | config.setdefault('options', {}) 66 | config['options']['random_state'] = seed_run 67 | 68 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in, 69 | seed_run, X_test, y_test, n_run) 70 | 71 | 72 | if __name__ == '__main__': 73 | parser = experiment_arg_parser() 74 | args = vars(parser.parse_args()) 75 | dataset_name = args['dataset'].lower() 76 | config_file = os.path.join(CONFIG_DIR, 'var_stream_labeled.yml') 77 | config = parse_yaml(config_file) 78 | 79 | # Store dataset info 80 | config.setdefault('dataset', {}) 81 | config['dataset']['name'] = dataset_name 82 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 83 | 84 | N_RATIO_LABELED = config['data']['stream']['ratio_labeled'].copy() 85 | 86 | experiment = VarStreamLabeled(N_RATIO_LABELED, params=config, 87 | n_runs=args['n_runs']) 88 | if args['plot'] != '': 89 | experiment.load_plot(path=args['plot']) 90 | else: 91 | experiment.run(dataset_name) 92 | -------------------------------------------------------------------------------- /ilp/experiments/var_theta.py: -------------------------------------------------------------------------------- 1 | import os 2 | from time import sleep 3 | 4 | from ilp.experiments.base import BaseExperiment 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser 7 | from ilp.constants import CONFIG_DIR 8 | from ilp.helpers.log import make_logger 9 | 10 | 11 | logger = make_logger(__name__) 12 | 13 | 14 | class VarTheta(BaseExperiment): 15 | 16 | def __init__(self, theta_values, params, n_runs, isave=100): 17 | super(VarTheta, self).__init__(name='theta', config=params, 18 | isave=isave, n_runs=n_runs, 19 | plot_title=r'Influence of $\vartheta$', 20 | multi_var=True) 21 | self.theta_values = theta_values 22 | 23 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run, 24 | X_test, y_test, n_run): 25 | 26 | params = self.config 27 | 28 | for theta in self.theta_values: 29 | params['online_lp']['theta'] = theta 30 | save_dir = os.path.join(self.top_dir, 'theta_' + str(theta)) 31 | stats_file = os.path.join(save_dir, 'run_' + str(n_run)) 32 | logger.info('\n\nExperiment: {}, theta = {}, run {}...\n'. 33 | format(self.name.upper(), theta, n_run)) 34 | sleep(1) 35 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, 36 | stats_file, seed_run, X_test, y_test) 37 | 38 | 39 | if __name__ == '__main__': 40 | 41 | parser = experiment_arg_parser() 42 | args = vars(parser.parse_args()) 43 | dataset_name = args['dataset'].lower() 44 | config_file = os.path.join(CONFIG_DIR, 'var_theta.yml') 45 | config = parse_yaml(config_file) 46 | 47 | # Store dataset info 48 | config.setdefault('dataset', {}) 49 | config['dataset']['name'] = dataset_name 50 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False) 51 | 52 | THETA_VALUES = config['online_lp']['theta'].copy() 53 | 54 | # Instantiate experiment 55 | experiment = VarTheta(theta_values=THETA_VALUES, params=config, 56 | n_runs=args['n_runs']) 57 | if args['plot'] != '': 58 | experiment.load_plot(path=args['plot']) 59 | else: 60 | # python3 default_run.py -d digits 61 | experiment.run(dataset_name) -------------------------------------------------------------------------------- /ilp/helpers/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/helpers/__init__.py -------------------------------------------------------------------------------- /ilp/helpers/data_fetcher.py: -------------------------------------------------------------------------------- 1 | import os 2 | import gzip 3 | import zipfile 4 | from urllib import request 5 | import yaml 6 | import numpy as np 7 | from sklearn.datasets import make_classification 8 | from sklearn.model_selection import train_test_split 9 | 10 | from ilp.constants import DATA_DIR 11 | 12 | 13 | CWD = os.path.split(__file__)[0] 14 | DATASET_CONFIG_PATH = os.path.join(CWD, 'datasets.yml') 15 | 16 | SUPPORTED_DATASETS = {'mnist', 'usps', 'blobs', 'kitti_features'} 17 | IS_DATASET_STREAM = {'kitti_features': True} 18 | 19 | 20 | def check_supported_dataset(dataset): 21 | 22 | if dataset not in SUPPORTED_DATASETS: 23 | raise FileNotFoundError('Dataset {} is not supported.'.format(dataset)) 24 | 25 | return True 26 | 27 | 28 | def fetch_load_data(name): 29 | 30 | print('\nFetching/Loading {}...'.format(name)) 31 | with open(DATASET_CONFIG_PATH, 'r') as f: 32 | datasets_configs = yaml.load(f) 33 | if name.upper() not in datasets_configs: 34 | raise FileNotFoundError('Dataset {} not supported.'.format(name)) 35 | 36 | config = datasets_configs[name.upper()] 37 | name_ = config.get('name', name) 38 | test_size = config.get('test_size', 0) 39 | 40 | if name_ == 'KITTI_FEATURES': 41 | X_tr, y_tr, X_te, y_te = fetch_kitti() 42 | elif name_ == 'USPS': 43 | X_tr, y_tr, X_te, y_te = fetch_usps() 44 | elif name_ == 'MNIST': 45 | X_tr, y_tr, X_te, y_te = fetch_mnist() 46 | X_tr = X_tr / 255. 47 | X_te = X_te / 255. 48 | elif name_ == 'BLOBS': 49 | X, y = make_classification(n_samples=60) 50 | X = np.asarray(X) 51 | y = np.asarray(y, dtype=int) 52 | 53 | if test_size > 0: 54 | if type(test_size) is int: 55 | t = test_size 56 | print('{} has shape {}'.format(name_, X.shape)) 57 | print('Splitting data with test size = {}'.format(test_size)) 58 | X_tr, X_te, y_tr, y_te = X[:-t], X[-t:], y[:-t], y[-t:] 59 | elif type(test_size) is float: 60 | X_tr, X_te, y_tr, y_te = train_test_split( 61 | X, y, test_size=test_size, stratify=y) 62 | else: 63 | raise TypeError('test_size is neither int or float.') 64 | 65 | print('Loaded training set with shape {}'.format(X_tr.shape)) 66 | print('Loaded testing set with shape {}'.format(X_te.shape)) 67 | return X_tr, y_tr, X_te, y_te 68 | else: 69 | print('Loaded {} with {} samples of dimension {}.' 70 | .format(name_, X.shape[0], X.shape[1])) 71 | return X, y, None, None 72 | else: 73 | raise NameError('No data set {} found!'.format(name_)) 74 | 75 | print('Loaded training data with shape {}'.format(X_tr.shape)) 76 | print('Loaded training labels with shape {}'.format(y_tr.shape)) 77 | print('Loaded testing data with shape {}'.format(X_te.shape)) 78 | print('Loaded testing labels with shape {}'.format(y_te.shape)) 79 | return X_tr, y_tr, X_te, y_te 80 | 81 | 82 | def fetch_usps(save_dir=None): 83 | 84 | base_url = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/' 85 | train_file = 'zip.train.gz' 86 | test_file = 'zip.test.gz' 87 | save_dir = DATA_DIR if save_dir is None else save_dir 88 | 89 | if not os.path.isdir(save_dir): 90 | raise NotADirectoryError('{} is not a directory.'.format(save_dir)) 91 | 92 | train_source = os.path.join(base_url, train_file) 93 | test_source = os.path.join(base_url, test_file) 94 | 95 | train_dest = os.path.join(save_dir, train_file) 96 | test_dest = os.path.join(save_dir, test_file) 97 | 98 | def download_file(source, destination): 99 | if not os.path.exists(destination): 100 | print('Downloading from {}...'.format(source)) 101 | f, msg = request.urlretrieve(url=source, filename=destination) 102 | print('HTTP response: {}'.format(msg)) 103 | return f, msg 104 | else: 105 | print('Found dataset in {}!'.format(destination)) 106 | return None 107 | 108 | download_file(train_source, train_dest) 109 | download_file(test_source, test_dest) 110 | 111 | X_train = np.loadtxt(train_dest) 112 | y_train, X_train = X_train[:, 0].astype(np.int32), X_train[:, 1:] 113 | 114 | X_test = np.loadtxt(test_dest) 115 | y_test, X_test = X_test[:, 0].astype(np.int32), X_test[:, 1:] 116 | 117 | return X_train, y_train, X_test, y_test 118 | 119 | 120 | def fetch_kitti(data_dir=None): 121 | 122 | if data_dir is None: 123 | data_dir = os.path.join(DATA_DIR, 'kitti_features') 124 | 125 | files = ['kitti_all_train.data', 126 | 'kitti_all_train.labels', 127 | 'kitti_all_test.data', 128 | 'kitti_all_test.labels'] 129 | 130 | for file in files: 131 | if file not in os.listdir(data_dir): 132 | zip_path = os.path.join(data_dir, 'kitti_features.zip') 133 | target_path = os.path.dirname(zip_path) 134 | print("Extracting {} to {}...".format(zip_path, target_path)) 135 | with zipfile.ZipFile(zip_path, "r") as zip_ref: 136 | zip_ref.extractall(target_path) 137 | print("Done.") 138 | break 139 | 140 | X_train = np.loadtxt(os.path.join(data_dir, files[0]), np.float64, skiprows=1) 141 | y_train = np.loadtxt(os.path.join(data_dir, files[1]), np.int32, skiprows=1) 142 | X_test = np.loadtxt(os.path.join(data_dir, files[2]), np.float64, skiprows=1) 143 | y_test = np.loadtxt(os.path.join(data_dir, files[3]), np.int32, skiprows=1) 144 | 145 | return X_train, y_train, X_test, y_test 146 | 147 | 148 | def fetch_mnist(data_dir=None): 149 | 150 | if data_dir is None: 151 | data_dir = os.path.join(DATA_DIR, 'mnist') 152 | 153 | url = 'http://yann.lecun.com/exdb/mnist/' 154 | files = ['train-images-idx3-ubyte.gz', 155 | 'train-labels-idx1-ubyte.gz', 156 | 't10k-images-idx3-ubyte.gz', 157 | 't10k-labels-idx1-ubyte.gz'] 158 | 159 | # Create path if it doesn't exist 160 | os.makedirs(data_dir, exist_ok=True) 161 | 162 | # Download any missing files 163 | for file in files: 164 | if file not in os.listdir(data_dir): 165 | request.urlretrieve(url + file, os.path.join(data_dir, file)) 166 | print("Downloaded %s to %s" % (file, data_dir)) 167 | 168 | def _images(path): 169 | """Return flattened images loaded from local file.""" 170 | with gzip.open(path) as f: 171 | # First 16 bytes are magic_number, n_imgs, n_rows, n_cols 172 | pixels = np.frombuffer(f.read(), '>B', offset=16) 173 | return pixels.reshape(-1, 784).astype('float64') 174 | 175 | def _labels(path): 176 | with gzip.open(path) as f: 177 | # First 8 bytes are magic_number, n_labels 178 | integer_labels = np.frombuffer(f.read(), '>B', offset=8) 179 | 180 | return integer_labels 181 | 182 | X_train = _images(os.path.join(data_dir, files[0])) 183 | y_train = _labels(os.path.join(data_dir, files[1])) 184 | X_test = _images(os.path.join(data_dir, files[2])) 185 | y_test = _labels(os.path.join(data_dir, files[3])) 186 | 187 | return X_train, y_train, X_test, y_test 188 | -------------------------------------------------------------------------------- /ilp/helpers/data_flow.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.utils.validation import check_random_state 3 | 4 | 5 | def check_min_samples_per_class(y, min_samples=2): 6 | classes, class_sizes = np.unique(y, return_counts=True) 7 | min_class_size = class_sizes.min() 8 | print('Class sizes: {}'.format(class_sizes)) 9 | if min_class_size < min_samples: 10 | print('Classes: {}'.format(np.unique(y))) 11 | raise ValueError('Minimum class size < 2.') 12 | 13 | return classes, class_sizes 14 | 15 | 16 | def split_burn_in_rest(y, n_labeled_per_class, ratio_labeled, shuffle=False, seed=None): 17 | """ 18 | 19 | Parameters 20 | ---------- 21 | y : array, shape (n_samples,) 22 | The true data labels. 23 | 24 | n_labeled_per_class : int 25 | Number of labeled samples per class to include in the burn-in set. 26 | 27 | ratio_labeled : float 28 | Ratio of labeled samples to include within the burn-in set. 29 | 30 | shuffle : bool 31 | Whether to shuffle indices within classes before adding to burn-in set. 32 | 33 | seed : int, np.random.RandomState or None 34 | For reproducibility. 35 | 36 | Returns 37 | ------- 38 | ind_burn_in : list of length n_burn_in 39 | Indices of the samples in the burn-in set. 40 | 41 | mask_labeled_burn_in : array, shape (n_burn_in,) 42 | Mask indicating whether the labels in the burn-in set are observed. 43 | 44 | """ 45 | classes, class_sizes = check_min_samples_per_class(y, n_labeled_per_class) 46 | 47 | rng = check_random_state(seed) 48 | ind_burn_in = [] 49 | set_ind_burn_in_labeled = set() 50 | 51 | for class_ in classes: 52 | ind_class, = np.where(y == class_) 53 | n_class = len(ind_class) 54 | n_unlabeled_class = int( 55 | n_labeled_per_class * (1 / ratio_labeled - 1)) #+ 1 56 | n_unlabeled_class = min(n_unlabeled_class, 57 | n_class - n_labeled_per_class) 58 | n_burn_in_class = n_labeled_per_class + n_unlabeled_class 59 | 60 | if shuffle: 61 | ind_samples = rng.choice(n_class, n_burn_in_class, replace=False) 62 | ind_burn_in_class = ind_class[ind_samples] 63 | ind_burn_in.extend(ind_burn_in_class) 64 | ind_samples = rng.choice(n_burn_in_class, n_labeled_per_class, 65 | replace=False) 66 | ind_burn_in_class_labeled = ind_burn_in_class[ind_samples] 67 | else: 68 | ind_burn_in_class = ind_class[:n_burn_in_class] 69 | ind_burn_in.extend(ind_burn_in_class) 70 | ind_burn_in_class_labeled = ind_burn_in_class[:n_labeled_per_class] 71 | 72 | set_ind_burn_in_labeled.update(ind_burn_in_class_labeled) 73 | 74 | mask_labeled_burn_in = [i in set_ind_burn_in_labeled for i in ind_burn_in] 75 | mask_labeled_burn_in = np.asarray(mask_labeled_burn_in) 76 | 77 | y_burn_in = y[ind_burn_in] 78 | y_burn_in_labeled = y_burn_in[mask_labeled_burn_in] 79 | y_burn_in_unlabeled = y_burn_in[~mask_labeled_burn_in] 80 | 81 | _, class_sizes_labeled = np.unique(y_burn_in_labeled, return_counts=True) 82 | _, class_sizes_unlabeled = np.unique(y_burn_in_unlabeled, 83 | return_counts=True) 84 | 85 | if len(y_burn_in_labeled) == 0 and len(y_burn_in_unlabeled) == 0: 86 | class_sizes_burnin = np.zeros(len(classes)) 87 | elif len(y_burn_in_labeled) == 0: 88 | class_sizes_burnin = class_sizes_unlabeled 89 | elif len(y_burn_in_unlabeled) == 0: 90 | class_sizes_burnin = class_sizes_labeled 91 | else: 92 | class_sizes_burnin = class_sizes_labeled + class_sizes_unlabeled 93 | 94 | print('\n\n') 95 | print('Burn-in labeled class sizes: {} , sum = {}'.format( 96 | class_sizes_labeled, sum(class_sizes_labeled))) 97 | print('Burn-in unlabeled class sizes: {}, sum = {}'.format( 98 | class_sizes_unlabeled, sum(class_sizes_unlabeled))) 99 | print('Burn-in total class sizes: {}, sum = {}'.format( 100 | class_sizes_burnin, sum(class_sizes_burnin))) 101 | print('\nRest total size: {}'.format(len(y) - len(y_burn_in))) 102 | 103 | return ind_burn_in, mask_labeled_burn_in 104 | 105 | 106 | def split_labels_rest(y_rest, ratio_labeled, batch_size, shuffle=False, 107 | seed=None): 108 | """ 109 | 110 | Parameters 111 | ---------- 112 | y_rest : array with shape (n_rest,) 113 | Remaining data labels after burn-in. 114 | 115 | ratio_labeled : float 116 | Ratio of observed labels in remaining data. 117 | 118 | batch_size : int 119 | Number of points for which the ratio_labeled must be satisfied. 120 | 121 | shuffle : bool 122 | Whether to shuffle indices within classes before adding to burn-in set. 123 | 124 | seed : int, np.random.RandomState or None 125 | For reproducibility. 126 | 127 | Returns 128 | ------- 129 | mask_labeled_rest : array, shape (n_rest,) 130 | Mask indicating whether the labels in the rest set are observed. 131 | 132 | """ 133 | 134 | classes = np.unique(y_rest) 135 | 136 | rng = check_random_state(seed) 137 | 138 | set_ind_rest_labeled = set() 139 | 140 | for class_ in classes: 141 | ind_class, = np.where(y_rest == class_) 142 | n_class = len(ind_class) 143 | n_labeled_class = int(n_class * ratio_labeled) 144 | 145 | if shuffle: 146 | ind_samples = rng.choice(n_class, n_labeled_class, replace=False) 147 | is_labeled_rest_class = ind_class[ind_samples] 148 | else: 149 | is_labeled_rest_class = ind_class[:n_labeled_class] 150 | 151 | set_ind_rest_labeled.update(is_labeled_rest_class) 152 | 153 | mask_labeled_rest = [i in set_ind_rest_labeled for i in range(len(y_rest))] 154 | mask_labeled_rest = np.asarray(mask_labeled_rest) 155 | 156 | y_rest_labeled = y_rest[mask_labeled_rest] 157 | y_rest_unlabeled = y_rest[~mask_labeled_rest] 158 | 159 | _, class_sizes_labeled = np.unique(y_rest_labeled, return_counts=True) 160 | _, class_sizes_unlabeled = np.unique(y_rest_unlabeled, return_counts=True) 161 | 162 | if len(y_rest_labeled) == 0 and len(y_rest_unlabeled) == 0: 163 | class_sizes_rest = np.zeros(len(classes)) 164 | elif len(y_rest_labeled) == 0: 165 | class_sizes_rest = class_sizes_unlabeled 166 | elif len(y_rest_unlabeled) == 0: 167 | class_sizes_rest = class_sizes_labeled 168 | else: 169 | class_sizes_rest = class_sizes_labeled + class_sizes_unlabeled 170 | 171 | print('\n\n') 172 | print('Rest labeled class sizes: {}, sum = {}'.format( 173 | class_sizes_labeled, sum(class_sizes_labeled))) 174 | print('Rest unlabeled class sizes: {}, sum = {}'.format( 175 | class_sizes_unlabeled, sum(class_sizes_unlabeled))) 176 | print('Rest total class sizes: {}, sum = {}'.format( 177 | class_sizes_rest, sum(class_sizes_rest))) 178 | print('\n\n') 179 | 180 | return mask_labeled_rest 181 | 182 | 183 | def gen_semilabeled_data(inputs, targets, flags): 184 | """ 185 | Generates a sequence of all inputs and targets, prepended with an id 186 | of the sample. A boolean value indicating if the label is observed by 187 | the algorithm or not is also generated. 188 | """ 189 | 190 | assert len(inputs) == len(targets) == len(flags) 191 | 192 | indices = range(len(inputs)) 193 | 194 | for i, j, k, l in zip(indices, inputs, targets, flags): 195 | yield i, j, k, l 196 | 197 | 198 | def gen_data_stream(inputs, targets, shuffle=False, seed=None): 199 | """ 200 | Generates a sequence of all inputs and targets, optionally shuffled. 201 | """ 202 | 203 | assert len(inputs) == len(targets) 204 | if shuffle: 205 | indices = np.arange(len(inputs)) 206 | random_state = check_random_state(seed) 207 | random_state.shuffle(indices) 208 | for i in indices: 209 | yield inputs[i], targets[i] 210 | else: 211 | for i in range(len(inputs)): 212 | yield inputs[i], targets[i] 213 | -------------------------------------------------------------------------------- /ilp/helpers/datasets.yml: -------------------------------------------------------------------------------- 1 | KITTI_FEATURES: 2 | name : KITTI_FEATURES 3 | test_size : 9090 4 | 5 | 6 | BLOBS: 7 | name : BLOBS 8 | pca : True 9 | test_size : 0.0 10 | 11 | 12 | MNIST: 13 | name : MNIST 14 | test_size : 10000 15 | 16 | 17 | USPS: 18 | name : USPS 19 | test_size : 2007 20 | -------------------------------------------------------------------------------- /ilp/helpers/fc_heap.py: -------------------------------------------------------------------------------- 1 | import heapq 2 | import warnings 3 | 4 | 5 | class FixedCapacityHeap: 6 | """Implementation of a min-heap with fixed capacity. 7 | The heap contains tuples of the form (edge_weight, node_id), 8 | which means the min. edge weight is extracted first 9 | """ 10 | 11 | def __init__(self, lst=None, capacity=10): 12 | self.capacity = capacity 13 | if lst is None: 14 | self.data = [] 15 | elif type(lst) is list: 16 | self.data = lst 17 | else: 18 | self.data = lst.tolist() 19 | 20 | if lst is not None: 21 | heapq.heapify(self.data) 22 | if len(self.data) > capacity: 23 | msg = 'Input data structure is larger than the queue\'s ' \ 24 | 'capacity ({}), truncating to smallest ' \ 25 | 'elements.'.format(capacity) 26 | 27 | warnings.warn(msg, UserWarning) 28 | self.data = self.data[:self.capacity] 29 | 30 | def push(self, item): 31 | """Insert an element in the heap if its key is smaller than the current 32 | max-key elements and remove the current max-key element if the new 33 | heap size exceeds the heap capacity 34 | 35 | Args: 36 | item (tuple): (edge_weight, node_ind) 37 | 38 | Returns: 39 | tuple : (bool, item) 40 | bool: whether the item was actually inserted in the queue 41 | item: another item that was removed from the queue or None if none was removed 42 | 43 | """ 44 | inserted = False 45 | removed = None 46 | if len(self.data) < self.capacity: 47 | heapq.heappush(self.data, item) 48 | inserted = True 49 | else: 50 | if item > self.get_min(): 51 | removed = heapq.heappushpop(self.data, item) 52 | inserted = True 53 | 54 | return inserted, removed 55 | 56 | def get_min(self): 57 | """Return the min-key element without removing it from the heap""" 58 | return self.data[0] 59 | 60 | def __len__(self): 61 | return len(self.data) 62 | -------------------------------------------------------------------------------- /ilp/helpers/log.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import logging 3 | 4 | 5 | def make_logger(name, path=None): 6 | logger = logging.getLogger(name) 7 | logger.setLevel(logging.DEBUG) 8 | stream_handler = logging.StreamHandler(stream=sys.stdout) 9 | # fmt = '%(asctime)s ' \ 10 | fmt = '[%(levelname)-10s] %(name)-10s : %(message)s' 11 | # fmt = '[{levelname}] {name} {message}' 12 | formatter = logging.Formatter(fmt=fmt, style='%') 13 | stream_handler.setFormatter(formatter) 14 | logger.addHandler(stream_handler) 15 | 16 | if path: 17 | file_handler = logging.FileHandler(filename=path) 18 | file_handler.setFormatter(formatter) 19 | logger.addHandler(file_handler) 20 | 21 | return logger 22 | -------------------------------------------------------------------------------- /ilp/helpers/params_parse.py: -------------------------------------------------------------------------------- 1 | import os 2 | import yaml 3 | from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter 4 | 5 | from ilp.constants import CONFIG_DIR 6 | from ilp.helpers.data_fetcher import SUPPORTED_DATASETS 7 | 8 | 9 | def args_to_dict(args): 10 | 11 | sample_yaml = os.path.join(CONFIG_DIR, 'default.yml') 12 | with open(sample_yaml, 'r') as f: 13 | sample_dict = yaml.load(f) 14 | 15 | params_dict = dict(sample_dict) 16 | for k, v in args.items(): 17 | for section in sample_dict: 18 | if k in sample_dict[section]: 19 | params_dict[section][k] = v 20 | 21 | return params_dict 22 | 23 | 24 | def parse_yaml(config_file): 25 | 26 | sample_yaml = os.path.join(CONFIG_DIR, 'default.yml') 27 | with open(sample_yaml, 'r') as f: 28 | default_params = yaml.load(f) 29 | 30 | with open(config_file) as cfg_file: 31 | params = yaml.load(cfg_file) 32 | 33 | for k, v in default_params.items(): 34 | if k not in params: 35 | params[k] = default_params[k] 36 | 37 | return params 38 | 39 | 40 | def print_config(params): 41 | for section in params: 42 | print('\n{} PARAMS:'.format(section.upper())) 43 | if type(params[section]) is dict: 44 | for k, v in params[section].items(): 45 | print('{}: {}'.format(k, v)) 46 | else: 47 | print('{}'.format(params[section])) 48 | 49 | 50 | def experiment_arg_parser(): 51 | arg_parser = ArgumentParser(description="ILP experiment", formatter_class=ArgumentDefaultsHelpFormatter) 52 | 53 | # Dataset 54 | arg_parser.add_argument( 55 | '-d', '--dataset', type=str, default='digits', 56 | help='Load the given dataset.\nSupported datasets are: {}' 57 | .format(SUPPORTED_DATASETS) 58 | ) 59 | 60 | arg_parser.add_argument( 61 | '-n', '--n_runs', type=int, default=1, 62 | help='Number of times to run the experiment with different seeds' 63 | ) 64 | 65 | arg_parser.add_argument( 66 | '-p', '--plot', type=str, default='', 67 | help='Plot the latest experiment results from the given directory' 68 | ) 69 | 70 | return arg_parser 71 | -------------------------------------------------------------------------------- /ilp/helpers/stats.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import shelve 4 | import sklearn.preprocessing as prep 5 | from datetime import datetime 6 | from threading import Thread 7 | from enum import Enum 8 | from queue import Queue 9 | 10 | from ilp.constants import EPS_32, EPS_64, STATS_DIR 11 | from ilp.helpers.log import make_logger 12 | 13 | 14 | STATS_FILE_EXT = '.stat' 15 | 16 | 17 | class JobType(Enum): 18 | EVAL = 1 19 | ONLINE_ITER = 3 20 | PRINT_STATS = 6 21 | LABEL_STREAM = 9 22 | TRAIN_PRED = 11 23 | TEST_PRED = 12 24 | RUNTIME = 13 25 | POINT_PREDICTION = 14 26 | 27 | 28 | logger = make_logger(__name__) 29 | 30 | class StatisticsWorker: 31 | """ 32 | Parameters 33 | ---------- 34 | 35 | config : dict 36 | Dictionary with configuration key-value pairs of the running experiment. 37 | 38 | path : str, optional 39 | File path to save the aggregated statistics (default=None). 40 | 41 | isave : int, optional 42 | The frequency of saving statistics (default=1000). 43 | 44 | 45 | Attributes 46 | ---------- 47 | 48 | _stats : Statistics 49 | Class to store different kinds of statistics during an experiment run. 50 | 51 | _jobs : Queue 52 | Queue of jobs to process in a different thread. 53 | 54 | _thread : Thread 55 | Thread in which to process incoming jobs. 56 | 57 | """ 58 | 59 | def __init__(self, config, path=None, isave=1): 60 | if path is None: 61 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S') 62 | dataset_name = config['dataset']['name'] 63 | filename = 'stats_' + cur_time + '_' + str(dataset_name) + STATS_FILE_EXT 64 | path = os.path.join(STATS_DIR, filename) 65 | elif not path.endswith(STATS_FILE_EXT): 66 | path = path + STATS_FILE_EXT 67 | 68 | os.makedirs(os.path.split(path)[0], exist_ok=True) 69 | self.path = path 70 | self.config = config 71 | self.isave = isave 72 | self._stats = Statistics() 73 | self._jobs = Queue() 74 | self._thread = Thread(target=self.work) 75 | 76 | def start(self): 77 | self.n_iter_eval = 0 78 | self._thread.start() 79 | 80 | def save(self): 81 | prev_file = self.path + '_iter_' + str(self.n_iter_eval) 82 | if os.path.exists(self.path): 83 | os.rename(self.path, prev_file) 84 | with shelve.open(self.path, 'c') as shelf: 85 | shelf['stats'] = self._stats.__dict__ 86 | shelf['config'] = self.config 87 | if os.path.exists(prev_file): 88 | os.remove(prev_file) 89 | 90 | def stop(self): 91 | self._jobs.put_nowait({'job_type': JobType.PRINT_STATS}) 92 | self._jobs.put(None) 93 | self._thread.join() 94 | self.save() 95 | 96 | def send(self, d): 97 | self._jobs.put_nowait(d) 98 | 99 | def work(self): 100 | while True: 101 | job = self._jobs.get() 102 | 103 | if job is None: # End of algorithm 104 | self._jobs.task_done() 105 | break 106 | 107 | job_type = job['job_type'] 108 | 109 | if job_type == JobType.EVAL: 110 | self._stats.evaluate(job['y_est'], job['y_true']) 111 | self.n_iter_eval += 1 112 | elif job_type == JobType.LABEL_STREAM: 113 | self._stats.label_stream_true = job['y_true'] 114 | self._stats.label_stream_mask_observed = job['mask_obs'] 115 | elif job_type == JobType.POINT_PREDICTION: 116 | f = job['vec'] 117 | h = self._stats.entropy(f) 118 | self._stats.entropy_point_after.append(h) 119 | y = job['y'] 120 | self._stats.pred_point_after.append(y) 121 | self._stats.conf_point_after.append(f.max()) 122 | elif job_type == JobType.ONLINE_ITER: 123 | self._stats.iter_online_count.append(job['n_in_iter']) 124 | self._stats.iter_online_duration.append(job['dt']) 125 | elif job_type == JobType.PRINT_STATS: 126 | err = self._stats.clf_error_mixed[-1] * 100 127 | logger.info('Classif. Error: {:5.2f}%\n\n'.format(err)) 128 | elif job_type == JobType.TRAIN_PRED: 129 | self._stats.train_est = job['y_est'] 130 | elif job_type == JobType.TEST_PRED: 131 | y_pred_knn = job['y_pred_knn'] 132 | self._stats.test_pred_knn.append(y_pred_knn) 133 | 134 | y_pred_lp = job['y_pred_lp'] 135 | self._stats.test_pred_lp.append(y_pred_lp) 136 | 137 | self._stats.test_true = y_test = job['y_true'] 138 | 139 | err_knn = np.mean(np.not_equal(y_pred_knn, y_test)) 140 | err_lp = np.mean(np.not_equal(y_pred_lp, y_test)) 141 | logger.info('knn test err: {:5.2f}%'.format(err_knn*100)) 142 | logger.info('ILP test err: {:5.2f}%'.format(err_lp*100)) 143 | self._stats.test_error_knn.append(err_knn) 144 | self._stats.test_error_ilp.append(err_lp) 145 | elif job_type == JobType.RUNTIME: 146 | self._stats.runtime = job['t'] 147 | 148 | if self.n_iter_eval % self.isave == 0: 149 | self.save() 150 | 151 | self._jobs.task_done() 152 | 153 | 154 | EXCLUDED_METRICS = {'label_stream_true', 155 | 'label_stream_mask_observed', 156 | 'n_burn_in', 157 | 'test_pred_knn', 'test_pred_lp', 'test_true', 158 | 'train_est', 'runtime', 159 | 'conf_point_after', 'test_error_knn', 'test_error_ilp'} 160 | 161 | 162 | class Statistics: 163 | """ 164 | Statistics gathered during learning (training and testing). 165 | """ 166 | def __init__(self): 167 | self.iter_online_count = [] 168 | self.iter_online_duration = [] 169 | 170 | # Evaluation after a new point arrives 171 | self.n_invalid_samples = [] 172 | self.invalid_samples_ratio = [] 173 | self.clf_error_mixed = [] 174 | self.clf_error_valid = [] 175 | self.l1_error_mixed = [] 176 | self.l1_error_valid = [] 177 | self.cross_ent_mixed = [] 178 | self.cross_ent_valid = [] 179 | self.entropy_pred_mixed = [] 180 | self.entropy_pred_valid = [] 181 | 182 | # Defined once 183 | self.label_stream_true = [] 184 | self.label_stream_mask_observed = [] 185 | self.test_pred_knn = [] 186 | self.test_pred_lp = [] 187 | self.test_true = [] 188 | self.train_est = [] 189 | self.runtime = np.nan 190 | self.test_error_ilp = [] 191 | self.test_error_knn = [] 192 | self.entropy_point_after = [] 193 | self.pred_point_after = [] 194 | self.conf_point_after = [] 195 | 196 | def evaluate(self, y_predictions, y_true): 197 | """Computes statistics for a given set of predictions and the ground truth. 198 | 199 | Args: 200 | y_predictions (array_like): [u_samples, n_classes] soft class predictions for current unlabeled samples 201 | y_true (array_like): [u_samples, n_classes] one-hot encoding of the true classes_ of the unlabeled samples 202 | 203 | eps (float): quantity slightly larger than zero to avoid division by zero 204 | 205 | Returns: 206 | float, average accuracy 207 | 208 | """ 209 | 210 | u_samples, n_classes = y_predictions.shape 211 | 212 | # Clip predictions to [0,1] 213 | eps = EPS_32 if y_predictions.itemsize == 4 else EPS_64 214 | y_pred_01 = np.clip(y_predictions, eps, 1-eps) 215 | # Normalize predictions to make them proper distributions 216 | y_pred = prep.normalize(y_pred_01, copy=False, norm='l1') 217 | 218 | # 0-1 Classification error under valid and invalid points 219 | y_pred_max = np.argmax(y_pred, axis=1) 220 | y_true_max = np.argmax(y_true, axis=1) 221 | fc_err_mixed = self.zero_one_loss(y_pred_max, y_true_max) 222 | self.clf_error_mixed.append(fc_err_mixed) 223 | 224 | # L1 error under valid and invalid points 225 | l1_err_mixed = np.mean(self.l1_error(y_pred, y_true)) 226 | self.l1_error_mixed.append(l1_err_mixed) 227 | 228 | # Cross-entropy loss 229 | crossent_mixed = np.mean(self.cross_entropy(y_true, y_pred)) 230 | self.cross_ent_mixed.append(crossent_mixed) 231 | 232 | # Identify valid points (for which a label has been estimated) 233 | ind_valid, = np.where(y_pred.sum(axis=1) != 0) 234 | n_valid = len(ind_valid) 235 | n_invalid = u_samples - n_valid 236 | 237 | self.n_invalid_samples.append(n_invalid) 238 | self.invalid_samples_ratio.append(n_invalid / u_samples) 239 | 240 | # Entropy of the predictions 241 | if n_invalid == 0: 242 | entropy_pred_mixed = np.mean(self.entropy(y_pred)) 243 | self.entropy_pred_mixed.append(entropy_pred_mixed) 244 | return 245 | 246 | y_pred_valid = y_pred[ind_valid] 247 | y_true_valid = y_true[ind_valid] 248 | 249 | # 0-1 Classification error under valid points only 250 | y_pred_valid_max = y_pred_max[ind_valid] 251 | y_true_valid_max = y_true_max[ind_valid] 252 | err_valid_max = self.zero_one_loss(y_pred_valid_max, y_true_valid_max) 253 | self.clf_error_valid.append(err_valid_max) 254 | 255 | # L1 error under valid points only 256 | l1_err_valid = np.mean(self.l1_error(y_pred_valid, y_true_valid)) 257 | self.l1_error_valid.append(l1_err_valid) 258 | 259 | # Cross-entropy loss 260 | ce_valid = np.mean(self.cross_entropy(y_true_valid, y_pred_valid)) 261 | self.cross_ent_valid.append(ce_valid) 262 | 263 | # Entropy of the predictions 264 | entropy_pred_valid = np.mean(self.entropy(y_pred_valid)) 265 | self.entropy_pred_valid.append(entropy_pred_valid) 266 | n_total = n_valid + n_invalid 267 | entropy_pred_mixed = (entropy_pred_valid*n_valid + n_invalid) / n_total 268 | self.entropy_pred_mixed.append(entropy_pred_mixed) 269 | 270 | 271 | @staticmethod 272 | def zero_one_loss(y_pred, y_true, average=True): 273 | """ 274 | 275 | Args: 276 | y_pred (array_like): (n_samples, n_classes) 277 | y_true (array_like): (n_samples, n_classes) 278 | average (bool): Whether to take the average over all predictions. 279 | 280 | Returns: The absolute difference for each row. 281 | Note that this will be in [0,2] for p.d.f.s. 282 | 283 | """ 284 | 285 | if average: 286 | return np.mean(np.not_equal(y_pred, y_true)) 287 | else: 288 | return np.sum(np.not_equal(y_pred, y_true)) 289 | 290 | @staticmethod 291 | def l1_error(y_pred, y_true, norm=True): 292 | """ 293 | 294 | Args: 295 | y_pred (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes) 296 | y_true (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes) 297 | norm (bool): Whether to constrain the L1 error to be in [0,1]. 298 | 299 | Returns: The absolute difference for each row. Note that this will be in [0,2] for pdfs. 300 | 301 | """ 302 | 303 | l1_error = np.abs(y_pred - y_true).sum(axis=1) 304 | if norm: 305 | l1_error /= 2 306 | 307 | return l1_error 308 | 309 | @staticmethod 310 | def entropy(p, norm=True): 311 | """ 312 | 313 | Args: 314 | p (array_like): An array of probability distributions with shape (n_distros, n_classes) 315 | norm (bool): Whether to normalize the entropy to constrain it in [0,1] 316 | 317 | Returns: An array of entropies of the distributions with shape (n_distros,) 318 | 319 | """ 320 | 321 | entropy = - (p * np.log(p)).sum(axis=1) 322 | if norm: 323 | entropy /= np.log(p.shape[1]) 324 | 325 | return entropy 326 | 327 | @staticmethod 328 | def cross_entropy(p, q, norm=True): 329 | """ 330 | 331 | Args: 332 | p (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes) 333 | q (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes) 334 | norm (bool): Whether to normalize the entropy to constrain it in [0,1] 335 | 336 | Returns: An array of cross entropies between the groundtruth and the prediction with shape (n_distros,) 337 | 338 | """ 339 | 340 | cross_ent = -(p * np.log(q)).sum(axis=1) 341 | if norm: 342 | cross_ent /= np.log(p.shape[1]) 343 | 344 | return cross_ent 345 | 346 | 347 | def aggregate_statistics(stats_path, metrics=None, excluded_metrics=None): 348 | 349 | print('Aggregating statistics from {}'.format(stats_path)) 350 | if stats_path.endswith(STATS_FILE_EXT): 351 | list_of_files = [stats_path] 352 | else: 353 | list_of_files = [os.path.join(stats_path, f) for f in os.listdir( 354 | stats_path) if f.endswith(STATS_FILE_EXT)] 355 | 356 | stats_runs = [] 357 | random_states = [] 358 | for stats_file in list_of_files: 359 | with shelve.open(stats_file, 'r') as f: 360 | stats_runs.append(f['stats']) 361 | random_states.append(f['config']['options']['random_state']) 362 | 363 | print('\nRandom seeds used: {}\n'.format(random_states)) 364 | 365 | if metrics is None: 366 | metrics = Statistics().__dict__.keys() 367 | 368 | if excluded_metrics is None: 369 | excluded_metrics = EXCLUDED_METRICS 370 | 371 | stats_mean, stats_std = {}, {} 372 | stats_run0 = stats_runs[0] 373 | 374 | for metric in metrics: 375 | if metric in excluded_metrics: continue 376 | if metric not in stats_run0: 377 | print('\nMetric {} not found!'.format(metric)) 378 | continue 379 | metric_lists = [stats[metric] for stats in stats_runs] 380 | 381 | # Make a numpy 2D array to merge the different runs 382 | metric_runs = np.asarray(metric_lists) 383 | s = metric_runs.shape 384 | if len(s) < 2: 385 | print('No values for metric, skipping.') 386 | continue 387 | stats_mean[metric] = np.mean(metric_runs, axis=0) 388 | stats_std[metric] = np.std(metric_runs, axis=0) 389 | 390 | with shelve.open(list_of_files[0], 'r') as f: 391 | config = f['config'] 392 | 393 | lp_times = stats_mean['iter_online_duration'] 394 | ice = stats_mean['clf_error_mixed'][0] * 100 395 | fce = stats_mean['clf_error_mixed'][-1] * 100 396 | print('Avg. LP time/iter: {:.4f}s'.format(np.mean(lp_times))) 397 | print('Initial classification error: {:.2f}%'.format(ice)) 398 | print('Final classification error: {:.2f}%'.format(fce)) 399 | 400 | # Add excluded metrics in the end 401 | for stats_run in stats_runs: 402 | for ex_metric in excluded_metrics: 403 | if ex_metric in stats_run: 404 | print('Appending excluded metric: {}'.format(ex_metric)) 405 | stats_mean[ex_metric] = stats_run[ex_metric] 406 | 407 | if len(list_of_files) == 1: 408 | stats_std = None 409 | 410 | return stats_mean, stats_std, config 411 | -------------------------------------------------------------------------------- /ilp/plots/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/plots/__init__.py -------------------------------------------------------------------------------- /ilp/plots/plot_stats.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib 3 | from matplotlib import pyplot as plt 4 | from itertools import count, product 5 | from math import ceil 6 | from sklearn.metrics import confusion_matrix 7 | from tabulate import tabulate 8 | 9 | from ilp.helpers.params_parse import print_config 10 | from ilp.helpers.data_fetcher import fetch_load_data 11 | from ilp.helpers.stats import STATS_FILE_EXT 12 | 13 | 14 | COLORS = ['red', 'darkorange', 'black', 'green', 'cyan', 'blue'] 15 | 16 | N_AXES_PER_ROW = 3 17 | 18 | PLOT_LABELS = {'iter_online_count': r'\#LP iterations', 19 | 'iter_online_duration': r'LP time (s)', 20 | 'n_invalid_samples' : r'\#Invalid samples', 21 | 'invalid_samples_ratio': r'Invalid samples ratio', 22 | 'clf_error_mixed': r'classification error', 23 | 'l1_error_mixed': r'$\ell_1$ error', 24 | 'cross_ent_mixed': r'cross entropy', 25 | 'entropy_pred_mixed': r'prediction entropy', 26 | 'theta': r'$\vartheta$', 27 | 'k_L': r'$k_l$', 28 | 'k_U': r'$k_u$', 29 | 'n_L': r'\#labels', 30 | 'srl': r'labeled', 31 | 'entropy_point_after': r'$H(f)$', 32 | 'conf_point_after': r'$\max_c f_i$', 33 | 'test_error_ilp': 'ILP test error', 34 | 'test_error_knn': 'test error (knn)' 35 | } 36 | 37 | PLOT_TITLE = {'entropy_point_after': 'Entropy', 38 | 'conf_point_after': 'Confidence'} 39 | 40 | METRIC_ORDER = [ 41 | 'l1_error_mixed', 42 | 'cross_ent_mixed', 43 | 'clf_error_mixed' , 44 | 'entropy_point_after', 45 | 'entropy_pred_mixed', 46 | 'iter_online_count', 47 | 'iter_online_duration', 48 | 'test_error_ilp', 49 | 'test_error_knn' 50 | ] 51 | 52 | METRIC_TITLE = {'l1_error_mixed': 53 | r'$\frac{1}{2u}\sum\limits_{i} ' 54 | r'|{F_U}_{(i)}^{True} - {F_U}_{(i)}|_1$', 55 | 'cross_ent_mixed': 56 | r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)}^{True}, ' 57 | r'{F_U}_{(i)})$', 58 | 'entropy_pred_mixed': 59 | r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)})$', 60 | 'clf_error_mixed': 61 | r'$\frac{1}{u}\sum\limits_{i} I(\arg \max {F_U}_{(i)} ' 62 | r'\neq \arg \max {F_U}_{(i)}^{True})$'} 63 | 64 | SCATTER_METRIC = ['iter_online_duration', 'iter_online_count'] 65 | 66 | LEGEND_METRIC = 'clf_error_mixed' 67 | 68 | DECORATORS = {'iter_offline', 'burn_in_labels_true', 'label_stream_true'} 69 | 70 | X_LABEL_DEFAULT = r'\#observed samples' 71 | DEFAULT_COLOR = 'b' 72 | DEFAULT_MEAN_COLOR = 'r' 73 | DEFAULT_STD_COLOR = 'darkorange' 74 | COLOR_MAP = plt.cm.inferno 75 | N_CURVES = 6 # THETA \in [0., 0.4, 0.8, 1.2, 1.6, 2.0] 76 | COLOR_IDX = np.linspace(0, 1, N_CURVES + 2)[1:-1] 77 | 78 | KITTI_CLASSES = ['car', 'van', 'truck', 'pedestrian', 'sitter', 'cyclist', 79 | 'tram', 'misc'] 80 | 81 | 82 | def remove_frame(top=True, bottom=True, left=True, right=True): 83 | 84 | ax = plt.gca() 85 | if top: 86 | ax.spines['top'].set_visible(False) 87 | 88 | if bottom: 89 | ax.spines['bottom'].set_visible(False) 90 | 91 | if left: 92 | ax.spines['left'].set_visible(False) 93 | 94 | if right: 95 | ax.spines['right'].set_visible(False) 96 | 97 | 98 | def print_latex_table(stats_list): 99 | 100 | headers = ['\#Labels', 'Runtime (s)', 'Est. error (%)', 101 | 'knn error (%)', 'ILP error (%)'] 102 | table = [] 103 | 104 | for _, var_value, stats_value, _ in stats_list: 105 | runtime = stats_value['runtime'] 106 | est_err = stats_value['clf_error_mixed'][-1] 107 | 108 | y_pred_knn = stats_value['test_pred_knn'] 109 | y_pred = stats_value['test_pred_lp'] 110 | y_true = stats_value['test_true'] 111 | test_err_knn = np.mean(np.not_equal(y_true, y_pred_knn)) 112 | test_err_ilp = np.mean(np.not_equal(y_true, y_pred)) 113 | 114 | runtime = '{:6.2f}'.format(runtime) 115 | est_err = '{:5.2f}'.format(est_err*100) 116 | 117 | test_err_knn = '{:5.2f}'.format(test_err_knn*100) 118 | test_err_ilp = '{:5.2f}'.format(test_err_ilp*100) 119 | row = [var_value, runtime, est_err, test_err_knn, test_err_ilp] 120 | table.append(row) 121 | 122 | print(tabulate(table, headers, tablefmt="latex")) 123 | 124 | 125 | def plot_histogram(ax, values, title, xlabel, ylabel, value_range): 126 | 127 | weights = np.ones_like(values) / float(len(values)) 128 | bin_y, bin_x, _ = ax.hist(values, range=value_range, normed=False, bins=20, 129 | weights=weights, alpha=0.5, align='mid') 130 | print('Bin values min/max: {}, {}'.format(bin_y.min(), bin_y.max())) 131 | 132 | ax.set_xticks(np.arange(0, 1.1, 0.2)) 133 | ax.set_yticks(np.arange(0, 1.1, 0.2)) 134 | ax.set_title(title, fontweight='bold') 135 | ax.set_xlabel(xlabel, fontweight='bold') 136 | ax.set_ylabel(ylabel) 137 | 138 | ax.spines['top'].set_visible(False) 139 | ax.spines['right'].set_visible(False) 140 | 141 | 142 | def plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos=1): 143 | 144 | if metric_key not in stats_value: 145 | print('FOUND NO KEY {}.'.format(metric_key)) 146 | return 147 | if 'pred_point_after' not in stats_value: 148 | print('FOUND NO KEY pred_point_after.') 149 | return 150 | 151 | y_u_metric = np.asarray(stats_value[metric_key]) 152 | print('START PLOT HISTOGRAM FOR {}'.format(metric_key)) 153 | y_u_pred = stats_value['pred_point_after'].ravel() 154 | 155 | n = len(y_u_metric) 156 | y_true = stats_value['label_stream_true'] 157 | mask_labeled = stats_value['label_stream_mask_observed'] 158 | y_u_true = y_true[~mask_labeled] 159 | 160 | mask_correct = np.equal(y_u_pred, y_u_true[-n:]) 161 | metric_hits = y_u_metric[mask_correct] 162 | metric_miss = y_u_metric[~mask_correct] 163 | 164 | value_range = (0., 1.) 165 | ylabel = 'Samples ratio' 166 | xlabel = PLOT_LABELS[metric_key] 167 | 168 | if pos == 1: 169 | title = PLOT_TITLE[metric_key] + ' - correct predictions' 170 | xlabel = '' 171 | elif pos == -1: 172 | title = '' 173 | else: 174 | xlabel = '' 175 | title = '' 176 | plot_histogram(ax1, metric_hits, title, xlabel, ylabel, value_range) 177 | 178 | if pos == 1: 179 | title = PLOT_TITLE[metric_key] + ' - false predictions' 180 | xlabel = '' 181 | elif pos == -1: 182 | title = '' 183 | else: 184 | xlabel = '' 185 | title = '' 186 | plot_histogram(ax2, metric_miss, title, xlabel, ylabel, value_range) 187 | 188 | 189 | def plot_metric_histograms(stats, metric_key): 190 | 191 | if type(stats) is list: 192 | fig = plt.figure(6, (8, 2.5*len(stats)), dpi=200) 193 | sp_count = count(1) 194 | for i, (_, var_value, stats_value, _) in enumerate(stats): 195 | ax1 = fig.add_subplot(len(stats), 2, next(sp_count)) 196 | ax2 = fig.add_subplot(len(stats), 2, next(sp_count)) 197 | 198 | if i == 0: 199 | pos = 1 200 | elif i == len(stats) - 1: 201 | pos = -1 202 | else: 203 | pos = 0 204 | 205 | plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos) 206 | 207 | title = r'$\vartheta = $' + str(var_value) 208 | ax1.set_label(title) 209 | plt.legend() 210 | else: 211 | fig = plt.figure(6, (8, 2.5), dpi=100) 212 | ax1 = fig.add_subplot(1, 2, 1) 213 | ax2 = fig.add_subplot(1, 2, 2) 214 | plot_metric_histogram(stats, ax1, ax2, metric_key) 215 | 216 | fig.subplots_adjust(top=0.9) 217 | fig.tight_layout() 218 | 219 | 220 | def plot_class_distro_stream(ax, y_true, mask_observed, n_burn_in, classes): 221 | """Plot incoming label distributions""" 222 | print('Plotting {}'.format('labels stream distro')) 223 | 224 | xx = np.arange(len(y_true)) 225 | x_lu = [xx[mask_observed], xx[~mask_observed]] 226 | y_lu = [y_true[mask_observed], y_true[~mask_observed]] 227 | c_lu = ['red', 'gray'] 228 | sizes = [2, 1] 229 | markers = ['d', '.'] 230 | 231 | n_labeled = sum(mask_observed) 232 | n_unlabeled = len(y_true) - n_labeled 233 | labels = [r'labeled ({})'.format(n_labeled), 234 | r'unlabeled ({})'.format(n_unlabeled)] 235 | 236 | for x, y, c, s, m, label in zip(x_lu, y_lu, c_lu, sizes, markers, labels): 237 | ax.scatter(x, y, c=c, marker=m, s=s, label=label) 238 | 239 | burn_in_label = 'burn-in ({})'.format(n_burn_in) 240 | ax.vlines(n_burn_in, *ax.get_ylim(), colors='blue', linestyle=':', 241 | label=burn_in_label) 242 | 243 | classes_type = type(classes[0]) 244 | if classes_type is str: 245 | true_labels = np.unique(y_true) 246 | plt.yticks(range(len(classes)), classes, rotation=45, fontsize=7) 247 | else: 248 | ax.set_yticks(classes) 249 | 250 | ax.set_xlabel(X_LABEL_DEFAULT) 251 | ax.set_ylabel(r'class labels') 252 | 253 | ax.spines['top'].set_visible(False) 254 | ax.spines['right'].set_visible(False) 255 | 256 | plt.legend(loc='upper right') 257 | 258 | 259 | def plot_corrected_samples(y_true, y_pred1, y_pred2, config, n_samples=50): 260 | 261 | dataset = config['dataset']['name'] 262 | _, _, X_test, y_test = fetch_load_data(dataset) 263 | 264 | mask_miss1 = np.not_equal(y_true, y_pred1) 265 | print('KNN missed {} test cases'.format(sum(mask_miss1))) 266 | mask_hits2 = np.equal(y_true, y_pred2) 267 | print('ILP missed {} test cases'.format(sum(~mask_hits2))) 268 | mask_miss1_hits2 = np.logical_and(mask_miss1, mask_hits2) 269 | print('ILP missed {} less than KNN'.format(sum(mask_miss1_hits2))) 270 | 271 | samples = X_test[mask_miss1_hits2] 272 | y_pred_miss = y_pred1[mask_miss1_hits2] 273 | y_pred_hit = y_pred2[mask_miss1_hits2] 274 | print('THERE ARE {} CASES OF MISS/HIT'.format(len(samples))) 275 | 276 | if len(samples) > n_samples: 277 | ind = np.random.choice(len(samples), n_samples, replace=False) 278 | samples = samples[ind] 279 | y_pred_miss = y_pred_miss[ind] 280 | y_pred_hit = y_pred_hit[ind] 281 | 282 | fig = plt.figure(4, figsize=(4, 4), dpi=200) 283 | dim = int(np.sqrt(len(X_test[0]))) 284 | 285 | n_subplots = len(samples) 286 | n_cols = 10 287 | n_empty_rows = 0 # 2 288 | n_rows = ceil(n_subplots / n_cols) + n_empty_rows 289 | subplot_count = count(1) 290 | 291 | for x, ym, yh in zip(samples, y_pred_miss, y_pred_hit): 292 | i = next(subplot_count) 293 | ax = fig.add_subplot(n_rows, n_cols, i) 294 | ax.set_xticks([]) 295 | ax.set_yticks([]) 296 | 297 | image = x.reshape(dim, dim) 298 | ax.imshow(image, cmap=plt.cm.gray) 299 | ax.set_title(r'{}$\rightarrow${}'.format(ym, yh), fontsize=8, 300 | fontweight='bold', horizontalalignment="center") 301 | fig.add_subplot(ax) 302 | 303 | fig.suptitle('Corrected samples') 304 | 305 | 306 | def plot_confusion_matrix(y_true, y_pred, title, cmap=plt.cm.Greys): 307 | 308 | cm = confusion_matrix(y_true, y_pred) 309 | # Normalize 310 | cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 311 | true_labels = [str(int(y)) for y in np.unique(y_true)] 312 | pred_labels = [str(int(y)) for y in np.unique(y_pred)] 313 | plt.imshow(cm_norm, interpolation='nearest', cmap=cmap) 314 | plt.title(title, fontsize=14, fontweight='bold') 315 | xtick_marks = np.arange(len(true_labels)) 316 | ytick_marks = np.arange(len(pred_labels)) 317 | plt.xticks(xtick_marks, true_labels, fontsize=6, fontweight='bold') 318 | plt.yticks(ytick_marks, pred_labels, fontsize=6, fontweight='bold') 319 | 320 | [i.set_color("b") for i in plt.gca().get_xticklabels()] 321 | [i.set_color("b") for i in plt.gca().get_yticklabels()] 322 | 323 | thresh = cm_norm.max() / 2. 324 | for i, j in product(range(cm.shape[0]), range(cm.shape[1])): 325 | s = r'{:4}'.format(str(cm[i, j])) 326 | s = r'\textbf{' + s + r'}' 327 | plt.text(j, i, s, fontsize=8, 328 | horizontalalignment="center", verticalalignment='center', 329 | color="white" if cm_norm[i, j] > thresh else "black") 330 | 331 | plt.tick_params(top='off', bottom='off', left='off', right='off', 332 | labelleft='on', labelbottom='on') 333 | 334 | remove_frame() 335 | 336 | plt.tight_layout() 337 | ax = plt.gca() 338 | ax.set_ylabel('True label', fontsize=10) 339 | ax.set_xlabel('Predicted label', fontsize=10) 340 | 341 | 342 | def plot_confusion_matrices(stats, config, improvements=True): 343 | matplotlib.rcParams['text.latex.unicode'] = True 344 | print('Plotting confusion matrices...') 345 | 346 | y_pred_knn = stats['test_pred_knn'] 347 | y_pred_ilp = stats['test_pred_lp'] 348 | y_true = stats['test_true'] 349 | 350 | err_knn = np.mean(np.not_equal(y_pred_knn, y_true)) 351 | err_ilp = np.mean(np.not_equal(y_pred_ilp, y_true)) 352 | print('knn error: {}'.format(err_knn)) 353 | print('ilp error: {}'.format(err_ilp)) 354 | 355 | fig = plt.figure(3, figsize=(8, 4), dpi=200) 356 | n_subplots = 2 357 | 358 | fig.add_subplot(1, n_subplots, 1) 359 | plot_confusion_matrix(y_true, y_pred_knn, '$knn$') 360 | 361 | fig.add_subplot(1, n_subplots, 2) 362 | plot_confusion_matrix(y_true, y_pred_ilp, 'ILP') 363 | 364 | if improvements: 365 | plot_corrected_samples(y_true, y_pred_knn, y_pred_ilp, 366 | config, n_samples=40) 367 | 368 | dt = config['dataset']['name'].upper() 369 | fig.suptitle(r'\textbf{Confusion}' + ' (' + dt + ')') 370 | 371 | fig.tight_layout() 372 | fig.subplots_adjust(top=0.85) 373 | 374 | 375 | def plot_single_run_single_var(stats, fig, config=None): 376 | 377 | metrics_to_plot = [] 378 | for metric_key in METRIC_ORDER: 379 | if metric_key in stats and len(stats[metric_key]): 380 | metrics_to_plot.append(metric_key) 381 | 382 | n_subplots = len(metrics_to_plot) 383 | if 'label_stream_true' in stats: 384 | n_subplots += 1 385 | if 'iter_offline_duration' in stats: 386 | n_subplots += 1 387 | 388 | n_cols = 4 389 | n_rows = ceil(n_subplots / n_cols) 390 | plot_count = count(1) 391 | 392 | n = len(stats['iter_online_count']) 393 | xx = range(n) 394 | for metric_key in metrics_to_plot: 395 | print('Plotting metric {}'.format(metric_key)) 396 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 397 | if metric_key in SCATTER_METRIC: 398 | ax.scatter(xx, stats[metric_key], s=2) 399 | else: 400 | ax.plot(stats[metric_key], color=DEFAULT_COLOR, label=None, lw=2) 401 | 402 | ax.set_xlabel(X_LABEL_DEFAULT) 403 | ax.set_ylabel(PLOT_LABELS[metric_key]) 404 | 405 | if 'label_stream_true' in stats and config is not None: 406 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 407 | y_true = np.asarray(stats['label_stream_true']) 408 | y_mask = np.asarray(stats['label_stream_mask_observed']) 409 | n_burn_in = config['data'].get('n_burn_in', 0) 410 | classes = np.unique(y_true) 411 | dataset_config = config.get('dataset', None) 412 | if dataset_config is not None: 413 | classes = dataset_config.get('classes', classes) 414 | if dataset_config['name'].startswith('kitti'): 415 | classes = KITTI_CLASSES 416 | plot_class_distro_stream(ax, y_true, y_mask, n_burn_in, classes) 417 | 418 | iod = stats.get('iter_offline_duration', None) 419 | if iod is not None: 420 | if len(iod): 421 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 422 | ax.scatter(stats['iter_offline'], iod) 423 | ax.set_xlabel(X_LABEL_DEFAULT) 424 | ax.set_ylabel(PLOT_LABELS['iter_offline_duration']) 425 | 426 | fig.subplots_adjust(top=0.9) 427 | 428 | 429 | def plot_single_run_multi_var(stats_list, fig, config): 430 | # Single run for each value a variable (e.g. \vartheta) takes. 431 | 432 | kitti = config['dataset']['name'].startswith('kitti') 433 | LW = 1.5 434 | 435 | var_name, var_value0, stats0, _ = stats_list[0] 436 | var_label = PLOT_LABELS[var_name] 437 | metrics_to_plot = [] 438 | 439 | metric_order = METRIC_ORDER 440 | if var_name.startswith('k'): 441 | metric_order = METRIC_ORDER[:3] 442 | for metric_key in metric_order: 443 | print('Metric {}'.format(metric_key)) 444 | print(stats0[metric_key]) 445 | if metric_key in stats0: 446 | if hasattr(stats0[metric_key], '__len__'): 447 | if len(stats0[metric_key]): 448 | print('Metric to plot: {}'.format(metric_key)) 449 | metrics_to_plot.append(metric_key) 450 | 451 | stats_list.sort(key=lambda x: float(x[1])) 452 | 453 | for i, (_, var_value, stats_value, _) in enumerate(stats_list): 454 | rt = stats_value['runtime'] 455 | print('theta = {}, runtime = {}'.format(var_value, rt)) 456 | 457 | if len(stats_list) > 7: 458 | # for n_neighbors pick 1,3,5,7,11,15,19 -> idx = [0,1,2,3,5,7,9] 459 | stats_list = [stats_list[i] for i in [0,1,2,3,5,7,9]] 460 | 461 | n_subplots = len(metrics_to_plot) 462 | n_cols = N_AXES_PER_ROW if n_subplots >= N_AXES_PER_ROW else n_subplots 463 | n_rows = ceil(n_subplots / n_cols) 464 | plot_count = count(1) 465 | 466 | n_values = len(stats_list) 467 | if n_values != N_CURVES: 468 | color_idx = np.linspace(0, 1, n_values+2) 469 | color_idx = color_idx[1:-1] 470 | else: 471 | color_idx = COLOR_IDX 472 | 473 | n = len(stats0['iter_online_count']) 474 | xx = range(n) 475 | for metric_key in metrics_to_plot: 476 | print('Plotting metric {}'.format(metric_key)) 477 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 478 | for i, (_, var_value, stats_value, _) in enumerate(stats_list): 479 | c = COLOR_MAP(color_idx[i]) 480 | 481 | val = int(float(var_value)*100) 482 | label = r'{}={}\%'.format(var_label, int(val)) 483 | if metric_key in SCATTER_METRIC: 484 | ax.scatter(xx, stats_value[metric_key], s=1, color=c, 485 | label=label) 486 | else: 487 | if kitti and metric_key.startswith('test_error'): 488 | test_errors = stats_value[metric_key] 489 | print('Final Test error: {:5.2f} for {}% of labels'.format( 490 | test_errors[-1]*100, var_value)) 491 | test_times = np.arange(1, len(test_errors) + 1)* 1000 492 | ax.plot(test_times, test_errors, color=c, label=label, 493 | lw=LW, marker='.') 494 | ax.set_ylim((0.0, 0.5)) 495 | # ax.set_xticks(test_times) 496 | else: 497 | ax.plot(stats_value[metric_key], color=c, label=label, 498 | lw=LW) 499 | 500 | ax.set_xlabel(X_LABEL_DEFAULT) 501 | ax.set_ylabel(PLOT_LABELS[metric_key]) 502 | if metric_key == LEGEND_METRIC: 503 | plt.legend(loc='best') 504 | 505 | plt.legend(loc='best') 506 | 507 | 508 | def plot_multi_run_single_var(stats_mean, stats_std, fig): 509 | """Multiple runs (random seeds) for a single variable (e.g. \vartheta)""" 510 | 511 | metrics_to_plot = [] 512 | for metric_key in METRIC_ORDER: 513 | if metric_key in stats_mean and len(stats_mean[metric_key]): 514 | metrics_to_plot.append(metric_key) 515 | 516 | n_subplots = len(metrics_to_plot) 517 | n_cols = 4 518 | n_rows = ceil(n_subplots / n_cols) 519 | plot_count = count(1) 520 | 521 | for metric_key in metrics_to_plot: 522 | print('Plotting metric {}'.format(metric_key)) 523 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 524 | metric_mean = stats_mean[metric_key] 525 | color = DEFAULT_MEAN_COLOR 526 | ax.plot(metric_mean, color=color, label=None, lw=2) 527 | metric_std = 1 * stats_std[metric_key] 528 | lb, ub = metric_mean - metric_std, metric_mean + metric_std 529 | color = DEFAULT_STD_COLOR 530 | ax.plot(lb, color=color) 531 | ax.plot(ub, color=color) 532 | ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.5) 533 | ax.set_xlabel(X_LABEL_DEFAULT) 534 | ax.set_ylabel(PLOT_LABELS[metric_key]) 535 | 536 | 537 | def plot_multi_run_multi_var(stats_list, fig): 538 | # Multiple runs (random seeds) for each value a variable takes 539 | 540 | headers = [r'$\vartheta$', 'Runtime (s)', 'Est. error (%)', 541 | 'Test error (%)'] 542 | table = [] 543 | for i in range(len(stats_list)): 544 | _, var_value, stats_value_mean, stats_value_std = stats_list[i] 545 | print('\n\nVar value = {}'.format(var_value)) 546 | runtime = stats_value_mean['runtime'] 547 | print('Runtime: {}'.format(runtime)) 548 | 549 | est_err_mean = stats_value_mean['clf_error_mixed'][-1] 550 | est_err_std = stats_value_std['clf_error_mixed'][-1] 551 | print('est_err: {} ({})'.format(est_err_mean, est_err_std)) 552 | s1 = '{:5.2f} ({:4.2f})'.format(est_err_mean*100, est_err_std*100) 553 | 554 | test_err_mean = stats_value_mean.get('test_error_ilp', None) 555 | test_err_std = stats_value_std.get('test_error_ilp', None) 556 | print('test_err: {}'.format(test_err_mean, test_err_std)) 557 | if test_err_mean is None: 558 | s2 = '-' 559 | else: 560 | s2 = '{:5.2f} ({:4.2f})'.format(test_err_mean*100, test_err_std*100) 561 | 562 | row = [var_value, runtime, s1, s2] 563 | table.append(row) 564 | 565 | print(tabulate(table, headers, tablefmt="latex")) 566 | 567 | metrics_to_plot = [] 568 | _, _, stats_value_mean0, _ = stats_list[0] 569 | for metric_key in METRIC_ORDER: 570 | metric = stats_value_mean0.get(metric_key, None) 571 | if metric is not None and len(metric): 572 | metrics_to_plot.append(metric_key) 573 | 574 | n_subplots = len(metrics_to_plot) 575 | n_cols = 4 576 | n_rows = ceil(n_subplots / n_cols) 577 | plot_count = count(1) 578 | 579 | for metric_key in metrics_to_plot: 580 | print('Plotting metric {}'.format(metric_key)) 581 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count)) 582 | 583 | for i in range(len(stats_list)): 584 | _, var_value, stats_value_mean, stats_value_std = stats_list[i] 585 | 586 | metric_mean = stats_value_mean[metric_key] 587 | color = DEFAULT_MEAN_COLOR 588 | ax.plot(metric_mean, color=color, label=None, lw=2) 589 | metric_std = 1 * stats_value_std[metric_key] 590 | lb, ub = metric_mean - metric_std, metric_mean + metric_std 591 | color = DEFAULT_STD_COLOR 592 | ax.plot(lb, color=color) 593 | ax.plot(ub, color=color) 594 | ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.3) 595 | ax.set_xlabel(X_LABEL_DEFAULT) 596 | ax.set_ylabel(PLOT_LABELS[metric_key]) 597 | 598 | 599 | def plot_standard(single_run, single_var, stats, config, title, path): 600 | 601 | figsize = (11, 5) 602 | fig = plt.figure(1, figsize=figsize, dpi=200) 603 | 604 | print() 605 | if single_run and single_var: # default_run 606 | print('Plotting a single run with a single variable value') 607 | plot_single_run_single_var(stats[0], fig, config) 608 | elif single_run and not single_var: # var_theta 609 | print('Plotting a single run for each variable value') 610 | plot_single_run_multi_var(stats, fig, config) 611 | elif single_var and not single_run: # mean and std of default_run 612 | print('Plotting multiple runs for a single variable value') 613 | plot_multi_run_single_var(stats[0], stats[1], fig) 614 | elif not single_run and not single_var: 615 | print('Plotting multiple runs for multiple variable values') 616 | plot_multi_run_multi_var(stats, fig) 617 | print() 618 | 619 | # plt.legend(loc='upper right') 620 | dataset = config['dataset']['name'] 621 | dataset = 'kitti' if dataset.startswith('kitti') else dataset 622 | dataset = dataset.replace('_', ' ') 623 | if title == r'Default run' or dataset == 'kitti': 624 | title = dataset.upper() 625 | else: 626 | title = r'\textbf{' + title + r'}' + ' (' + dataset.upper() + ')' 627 | 628 | fig.suptitle(title, fontsize='xx-large', fontweight='bold') 629 | 630 | fig.tight_layout() 631 | fig.subplots_adjust(top=0.9) 632 | 633 | if path is not None: 634 | if path.endswith(STATS_FILE_EXT): 635 | path = path[:-len(STATS_FILE_EXT)] 636 | 637 | try: 638 | fig.savefig(path + '.pdf', orientation='landscape', transparent=False) 639 | except RuntimeError as e: 640 | print('Cannot save figure due to: \n{}'.format(e)) 641 | 642 | 643 | def plot_curves(stats, config, title='', path=None): 644 | plt.rc('text', usetex=True) # need dvipng in Ubuntu 645 | plt.rc('font', family='serif') 646 | # matplotlib.rcParams['text.latex.unicode'] = True 647 | 648 | if type(stats) is tuple: 649 | single_var = True 650 | stats_mean, stats_std = stats 651 | single_run = stats_std is None 652 | stats_print = stats_mean 653 | elif type(stats) is list: 654 | single_var = False 655 | var_name, var_value, stats_mean0, stats_std0 = stats[0] 656 | print('Plotting multi var: {}'.format(var_name)) 657 | single_run = stats_std0 is None 658 | stats_print = stats_mean0 659 | else: 660 | print('stats has type: {}'.format(type(stats).__name__)) 661 | raise TypeError('stats is neither list nor tuple!') 662 | 663 | print('\nStatistics{}Size{}Type\n{}'.format(' '*25, ' '*8, '-'*52)) 664 | for k in sorted(stats_print.keys()): 665 | v = stats_print[k] 666 | if v is not None: 667 | if hasattr(v, '__len__'): 668 | print('{:>32} {:>8} {:>9}'.format(k, len(v), type(v).__name__)) 669 | else: 670 | print('{:>32} {:>8.2f} {:>9}'.format(k, v, type(v).__name__)) 671 | 672 | if config is not None: 673 | print_config(config) 674 | 675 | plot_standard(single_run, single_var, stats, config, title, path) 676 | 677 | if single_var: 678 | if config['dataset']['name'] == 'mnist': 679 | test_pred_knn = stats_mean.get('test_pred_knn', None) 680 | if test_pred_knn is not None and len(test_pred_knn) > 0: 681 | plot_confusion_matrices(stats_mean, config) 682 | --------------------------------------------------------------------------------