├── .gitignore
├── LICENSE
├── README.md
├── data
└── kitti_features.zip
└── ilp
├── __init__.py
├── algo
├── __init__.py
├── base_sl_graph.py
├── datastore.py
├── incremental_label_prop.py
├── knn_graph_utils.py
├── knn_sl_graph.py
└── knn_sl_subgraph.py
├── constants.py
├── experiments
├── __init__.py
├── base.py
├── cfg
│ ├── default.yml
│ ├── var_k_L.yml
│ ├── var_k_U.yml
│ ├── var_n_L.yml
│ ├── var_stream_labeled.yml
│ └── var_theta.yml
├── default_run.py
├── var_n_labeled.py
├── var_n_neighbors_labeled.py
├── var_n_neighbors_unlabeled.py
├── var_stream_labeled.py
└── var_theta.py
├── helpers
├── __init__.py
├── data_fetcher.py
├── data_flow.py
├── datasets.yml
├── fc_heap.py
├── log.py
├── params_parse.py
└── stats.py
└── plots
├── __init__.py
└── plot_stats.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Compiled source #
2 | ###################
3 | *.com
4 | *.class
5 | *.dll
6 | *.exe
7 | *.o
8 | *.so
9 |
10 | # Packages #
11 | ############
12 | # it's better to unpack these files and commit the raw source
13 | # git has its own built in compression methods
14 | *.7z
15 | *.dmg
16 | *.gz
17 | *.iso
18 | *.jar
19 | *.rar
20 | *.tar
21 | *.zip
22 |
23 | # Logs and databases #
24 | ######################
25 | *.log
26 | *.sql
27 | *.sqlite
28 |
29 | # OS generated files #
30 | ######################
31 | .DS_Store
32 | .DS_Store?
33 | ._*
34 | .Spotlight-V100
35 | .Trashes
36 | ehthumbs.db
37 | Thumbs.db
38 |
39 | # Latex generated files #
40 | #########################
41 | *.log
42 | *.aux
43 | *.blg
44 | *.out
45 | *.gz
46 | *.bbl
47 |
48 | # IDE #
49 | #######
50 | .idea/
51 | __pycache__/
52 | build/
53 | cmake-build-debug/
54 | dist/
55 | _build/
56 | _generate/
57 | *.so
58 | *.py[cod]
59 | *.egg-info
60 |
61 |
62 | # Project Specific #
63 | ####################
64 | doc/
65 | *.mat
66 | *.p
67 | *.stat
68 | *.hist
69 | *.png
70 | */experiments/results/
71 | data/kitti_features/
72 | data/mnist/
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 John Chiotellis
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Incremental Label Propagation
2 | This repository provides the implementation of our paper ["Incremental Semi-Supervised Learning from Streams for Object Classification"](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf) (Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, IROS 2018). All results presented in our work were produced with this code.
3 |
4 | * [Installing](#usage)
5 | * [Datasets](#data)
6 | * [Experiments](#experiments)
7 | * [Publication](#paper)
8 | * [License and Contact](#other)
9 |
10 |
11 | ## Installation
12 | The code was developed in python 3.5 under Ubuntu 16.04. You can clone the repo with:
13 | ```
14 | git clone https://github.com/johny-c/incremental-label-propagation.git
15 | ```
16 |
17 | ## Datasets
18 | * KITTI
19 |
20 | The repository includes 64-dimentional features extracted from KITTI sequences compressed in a zip file (data/kitti_features.zip). The included files will be extracted automatically if one of the included experiments is run on KITTI.
21 |
22 | * MNIST
23 |
24 | A script will automatically download the MNIST dataset if an experiment is run on it.
25 |
26 |
27 | ## Experiments
28 |
29 | The repository includes scripts that replicate the experiments found in the paper, including:
30 |
31 | * Varying the number of labeled points or the ratio of labeled points in the data.
32 | * Varying the number of labeled or unlabeled neighbors considered for each node.
33 | * Varying the hyperparameter $$\theta$$ that controls the propagation area size.
34 |
35 | To run an experiment with varying $$\theta$$:
36 |
37 | python ilp/experiments/var_theta.py -d mnist
38 |
39 | You can set different experiment options in the .yaml files found in the experimens/cfg directory.
40 |
41 | #### WARNING:
42 | The included experiment scripts compute and store statistics after every new data point, therefore the resulting output files are very large.
43 |
44 |
45 | ## Publication
46 | If you use this code in your work, please cite the following paper.
47 |
48 | Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, _"Incremental Semi-Supervised Learning from Streams for Object Classification"_, in proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018). ([pdf](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf))
49 |
50 | *equal contribution
51 |
52 | @InProceedings{chiotellis2018incremental,
53 | author = "I. Chiotellis and F. Zimmermann and D. Cremers and R. Triebel",
54 | title = "Incremental Semi-Supervised Learning from Streams for Object Classification",
55 | booktitle = iros,
56 | year = "2018",
57 | month = "October",
58 | keywords={stream-based learning, sequential data, semi-supervised learning, object classification},
59 | note = {{[code]} },
60 | }
61 |
62 | ## License and Contact
63 |
64 | This work is released under the [MIT Licence].
65 |
66 | Contact **John Chiotellis** [:envelope:](mailto:chiotell@in.tum.de) for questions, comments and reporting bugs.
67 |
--------------------------------------------------------------------------------
/data/kitti_features.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/data/kitti_features.zip
--------------------------------------------------------------------------------
/ilp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/__init__.py
--------------------------------------------------------------------------------
/ilp/algo/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/algo/__init__.py
--------------------------------------------------------------------------------
/ilp/algo/base_sl_graph.py:
--------------------------------------------------------------------------------
1 | from sklearn.externals import six
2 | from abc import ABCMeta, abstractmethod
3 |
4 |
5 | class BaseSemiLabeledGraph(six.with_metaclass(ABCMeta)):
6 | """
7 | Parameters
8 | ----------
9 |
10 | datastore : algo.datastore.SemoLabeledDatastore
11 | A datastore to store observations as they arrive
12 |
13 | max_samples : int, optional (default=1000)
14 | The maximum number of points expected to be observed. Useful for
15 | memory allocation.
16 |
17 | max_labeled : {float, int}, optional
18 | Maximum expected labeled points ratio, or number of labeled points
19 |
20 | dtype : dtype, optional (default=np.float32)
21 | Precision in floats, (can also be float16, float64)
22 |
23 | """
24 |
25 | def __init__(self, datastore):
26 |
27 | self.datastore = datastore
28 | self.max_samples = datastore.max_samples
29 | self.max_labeled = datastore.max_labeled
30 | self.dtype = datastore.dtype
31 | self.eps = datastore.eps
32 |
33 | self.n_labeled = 0
34 | self.n_unlabeled = 0
35 |
36 |
37 | @abstractmethod
38 | def build(self, X_l, X_u):
39 | raise NotImplementedError('build is not implemented!')
40 |
41 | @abstractmethod
42 | def add_node(self, x, ind, labeled):
43 | raise NotImplementedError('add_node is not implemented!')
44 |
45 | @abstractmethod
46 | def add_labeled_node(self, x, ind):
47 | raise NotImplementedError('add_labeled_node is not implemented!')
48 |
49 | @abstractmethod
50 | def add_unlabeled_node(self, x, ind):
51 | raise NotImplementedError('add_unlabeled_node is not implemented!')
52 |
53 | def get_n_nodes(self):
54 | return self.n_labeled + self.n_unlabeled
55 |
--------------------------------------------------------------------------------
/ilp/algo/datastore.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from sklearn.preprocessing import LabelBinarizer, label_binarize
3 |
4 | from ilp.constants import EPS_32, EPS_64
5 |
6 |
7 | class SemiLabeledDataStore:
8 |
9 | def __init__(self, max_samples, max_labeled, classes, precision='float32'):
10 |
11 | self.max_samples = max_samples
12 | self.max_labeled = max_labeled
13 | self.classes = classes
14 |
15 | self.precision = precision
16 | self.dtype = np.dtype(precision)
17 | self.eps = EPS_32 if self.dtype == np.dtype('float32') else EPS_64
18 | self.n_labeled = 0
19 | self.n_unlabeled = 0
20 |
21 | self.X_labeled = np.array([])
22 | self.X_unlabeled = np.array([])
23 | self.y_labeled = np.array([])
24 |
25 | def _allocate(self, n_features):
26 |
27 | # Allocate memory for data matrices
28 | self.X_labeled = np.zeros((self.max_labeled, n_features),
29 | dtype=self.dtype)
30 | self.X_unlabeled = np.zeros((self.max_samples, n_features),
31 | dtype=self.dtype)
32 |
33 | # Allocate memory for label matrix
34 | self.y_labeled = np.zeros((self.max_labeled, len(self.classes)),
35 | dtype=self.dtype)
36 |
37 | def append(self, x, y):
38 |
39 | if self.get_n_samples() == 0:
40 | self._allocate(len(x))
41 |
42 | if y == -1:
43 | ind_new = self.n_unlabeled
44 | self.X_unlabeled[ind_new] = x
45 | self.n_unlabeled += 1
46 | else:
47 | ind_new = self.n_labeled
48 | self.X_labeled[ind_new] = x
49 | self.y_labeled[ind_new] = label_binarize([y], self.classes)
50 | self.n_labeled += 1
51 |
52 | return ind_new
53 |
54 | def inverse_transform_labels(self, y_proba):
55 | if not hasattr(self, 'label_binarizer'):
56 | self.label_binarizer = LabelBinarizer()
57 | self.label_binarizer.fit(self.classes)
58 |
59 | return self.label_binarizer.inverse_transform(y_proba)
60 |
61 | def get_n_samples(self):
62 | return self.n_labeled + self.n_unlabeled
63 |
64 | def get_X_l(self):
65 | return self.X_labeled[:self.n_labeled]
66 |
67 | def get_X_u(self):
68 | return self.X_unlabeled[:self.n_unlabeled]
69 |
70 | def get_y_l(self):
71 | return self.y_labeled[:self.n_labeled]
72 |
73 | def get_y_l_int(self):
74 | return np.argmax(self.get_y_l(), axis=1)
--------------------------------------------------------------------------------
/ilp/algo/incremental_label_prop.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from time import time
3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot
4 | from sklearn.utils.validation import check_random_state
5 | from sklearn.preprocessing import normalize
6 |
7 | from ilp.algo.knn_graph_utils import construct_weight_mat
8 | from ilp.algo.knn_sl_graph import KnnSemiLabeledGraph
9 | from ilp.helpers.stats import JobType
10 | from ilp.helpers.log import make_logger
11 |
12 |
13 | logger = make_logger(__name__)
14 |
15 |
16 | class IncrementalLabelPropagation:
17 | """
18 | Parameters
19 | ----------
20 |
21 | datastore : algo.datastore.SemiLabeledDataStore
22 | A datastore instance to store observations as they arrive
23 |
24 | stats_worker : helper.stats.StatisticsWorker, optional
25 | A statistics computing unit (run in a separate thread)
26 |
27 | theta : float, optional
28 | The threshold of significance for a label update (defines the online nature of the algorithm). (default: 0.1)
29 |
30 | max_iter : int, optional
31 | The maximum number of iterations per point insertion. (default: 30)
32 |
33 | tol : float, optional
34 | The tolerance of absolute difference between two label distributions
35 | that indicates convergence. (default: 1e-3)
36 |
37 | params_graph : dict
38 | Parameters for the graph construction. Will be passed to the Graph
39 | constructor.
40 |
41 | params_offline_lp : dict
42 | Parameters for the offline label propagation that will be performed
43 | once `n_burn_in` number of samples have been observed.
44 |
45 | random_state : int seed, RandomState instance, or None (default)
46 | The seed of the pseudo random number generator to use when
47 | shuffling the data.
48 |
49 | iprint : integer, optional
50 | The n_labeled_freq of printing progress messages to stdout.
51 |
52 | n_jobs : integer, optional
53 | The number of CPUs to use to do the OVA (One Versus All, for
54 | multi-class problems) computation. -1 means 'all CPUs'. Defaults to 1.
55 |
56 |
57 | Attributes
58 | ----------
59 |
60 | classes : tuple, shape = (n_classes,)
61 | The classes expected to appear in the stream.
62 |
63 | """
64 |
65 | def __init__(self, datastore=None, stats_worker=None,
66 | params_graph=None, params_offline_lp=None, n_burn_in=0,
67 | theta=0.1, max_iter=30, tol=1e-3,
68 | random_state=42, n_jobs=-1, iprint=0):
69 |
70 | self.datastore = datastore
71 | self.max_iter = max_iter
72 | self.tol = tol
73 | self.theta = theta
74 | self.params_graph = params_graph
75 | self.params_offline_lp = params_offline_lp
76 | self.n_burn_in = n_burn_in
77 | self.random_state = random_state
78 |
79 | kernel = params_graph['kernel']
80 | if kernel == 'knn':
81 | Graph = KnnSemiLabeledGraph
82 | # elif kernel == 'm-knn':
83 | # Graph = MutualKnnSemiLabeledGraph
84 | else:
85 | raise NotImplementedError('Only knn graphs.')
86 |
87 | self.graph = Graph(datastore, **params_graph)
88 |
89 | self.n_jobs = n_jobs
90 | self.stats_worker = stats_worker
91 | self.iprint = iprint
92 |
93 | def fit_burn_in(self):
94 | """Fit a semi-supervised label propagation model
95 |
96 | A bootstrap data set is provided in matrix X (labeled and unlabeled)
97 | and corresponding label matrix y with a dedicated value (-1) for
98 | unlabeled samples.
99 |
100 | Parameters
101 | ----------
102 | X : array-like, shape (n_samples, n_features)
103 | A {n_samples by n_samples} size matrix will be created from this
104 |
105 | Returns
106 | -------
107 | self : returns an instance of self.
108 | """
109 |
110 | self.random_state_ = check_random_state(self.random_state)
111 |
112 | # Get appearing classes
113 | classes = self.datastore.classes
114 |
115 | self.n_iter_online = 0
116 | self.n_burn_in_ = self.datastore.get_n_samples()
117 | logger.debug('FITTING BURN-IN WITH {} SAMPLES'.format(self.n_burn_in_))
118 |
119 | # actual graph construction (implementations should override this)
120 | logger.debug('Building graph....')
121 | self.graph.build(self.get_X_l(), self.get_X_u())
122 | logger.debug('Graph built.')
123 |
124 | u_max = self.datastore.max_samples
125 | # Initialize F_U with uniform label vectors
126 | self.y_unlabeled = np.full((u_max, len(classes)), 1 / len(classes),
127 | dtype=self.datastore.dtype)
128 |
129 | # Initialize F_U with zero label vectors
130 | # self.y_unlabeled = np.zeros((u_max, len(classes)),
131 | # dtype=self.datastore.dtype)
132 |
133 | # Offline label propagation on burn-in data set
134 | self._offline_lp(**self.params_offline_lp)
135 |
136 | # Normalize F_U as it might have numerically diverged from [0, 1]
137 | normalize(self.y_unlabeled, norm='l1', axis=1, copy=False)
138 |
139 | return self
140 |
141 | def predict(self, X, mode=None):
142 | """Predict the labels for a batch of data points, without actually
143 | inserting them in the graph.
144 |
145 | Parameters
146 | ----------
147 | X : array, shape (n_samples_batch, n_features)
148 | A batch of data points.
149 |
150 | mode : string
151 | Test with nearest labeled neighbors ('knn'), or nearest
152 | unlabeled neighbors ('lp'), their combination (default),
153 | or return both ('pair').
154 |
155 | Returns
156 | -------
157 | y : array, shape (n_samples_batch, n_classes)
158 | The predicted label distributions for the given points.
159 |
160 | """
161 |
162 | modes = ['knn', 'lp', 'pair']
163 | if mode is not None and mode not in modes:
164 | raise ValueError('predict_proba can have modes: {}'.format(modes))
165 |
166 | if X.ndim == 1:
167 | X = X.reshape(1, -1)
168 |
169 | if mode == 'pair':
170 | y_proba_knn, y_proba_lp = self.predict_proba(X, 'pair')
171 | y_pred_knn = self.datastore.inverse_transform_labels(y_proba_knn)
172 | y_pred_lp = self.datastore.inverse_transform_labels(y_proba_lp)
173 | return y_pred_knn, y_pred_lp
174 |
175 | y_proba = self.predict_proba(X, mode)
176 | y_pred = self.datastore.inverse_transform_labels(y_proba)
177 |
178 | return y_pred
179 |
180 | def predict_proba(self, X, mode=None):
181 |
182 | modes = ['knn', 'lp', 'pair']
183 | if mode is not None and mode not in modes:
184 | raise ValueError('predict_proba can have modes: {}'.format(modes))
185 |
186 | u, l = self.graph.n_unlabeled, self.graph.n_labeled
187 |
188 | logger.info('Now testing on {} samples...'.format(len(X)))
189 | neighbors, distances = self.graph.find_labeled_neighbors(X)
190 | affinity_mat = construct_weight_mat(neighbors, distances,
191 | (X.shape[0], l), self.graph.dtype)
192 | p_tl = normalize(affinity_mat.tocsr(), norm='l1', axis=1)
193 | y_from_labeled = ssdot(p_tl, self.datastore.y_labeled[:l], True)
194 |
195 | neighbors, distances = self.graph.find_unlabeled_neighbors(X)
196 | affinity_mat = construct_weight_mat(neighbors, distances,
197 | (X.shape[0], u), self.graph.dtype)
198 | p_tu = normalize(affinity_mat.tocsr(), norm='l1', axis=1)
199 | y_from_unlabeled = ssdot(p_tu, self.y_unlabeled[:u], True)
200 |
201 | y_pred_proba = y_from_labeled + y_from_unlabeled
202 | logger.info('Labels have been predicted.')
203 |
204 | if mode is None:
205 | return y_pred_proba
206 | elif mode == 'knn':
207 | return y_from_labeled
208 | elif mode == 'lp':
209 | return y_from_unlabeled
210 | elif mode == 'pair':
211 | return y_from_labeled, y_pred_proba
212 |
213 | def get_X_l(self):
214 | return self.datastore.get_X_l()
215 |
216 | def get_X_u(self):
217 | return self.datastore.get_X_u()
218 |
219 | def set_params(self, L):
220 | self.graph.reset_metric(L)
221 | tic = time()
222 | self.graph.build(self.get_X_l(), self.get_X_u())
223 | toc = time()
224 | logger.info('Reconstructed graph in {:.4f}s\n'.format(toc-tic))
225 |
226 | def _offline_lp(self, return_iter=False, max_iter=30, tol=0.001):
227 | """Perform the offline label propagation until convergence of the label
228 | estimates of the unlabeled points.
229 |
230 | Parameters
231 | ----------
232 | return_iter : bool, default=False
233 | Whether or not to return the number of iterations till convergence
234 | of the label estimates.
235 |
236 | Returns
237 | -------
238 | y_unlabeled, num_iter :
239 | the new label estimates and optionally the number of iterations
240 | """
241 |
242 | logger.debug('Doing Offline LP...')
243 |
244 | u, l = self.graph.n_unlabeled, self.graph.n_labeled
245 |
246 | p_ul = self.graph.subgraph_ul.transition_matrix[:u]
247 | p_uu = self.graph.subgraph_uu.transition_matrix[:u, :u]
248 | y_unlabeled = self.y_unlabeled[:u]
249 | y_labeled = self.datastore.y_labeled
250 |
251 | # First iteration
252 | y_static = ssdot(p_ul, y_labeled, dense_output=True)
253 |
254 | # Continue loop
255 | n_iter = 0
256 | converged = False
257 | while n_iter < max_iter and not converged:
258 | y_unlabeled_prev = y_unlabeled.copy()
259 | y_unlabeled = y_static + ssdot(p_uu, y_unlabeled, True)
260 | n_iter += 1
261 |
262 | converged = _converged(y_unlabeled, y_unlabeled_prev, tol)
263 |
264 | logger.info('Offline LP took {} iterations'.format(n_iter))
265 |
266 | if return_iter:
267 | return y_unlabeled, n_iter
268 | else:
269 | return y_unlabeled
270 |
271 | def fit_incremental(self, x_new, y_new):
272 |
273 | n_samples = self.datastore.get_n_samples()
274 | if n_samples == 0 and self.stats_worker is not None:
275 | logger.info('\n\nStarting the Statistics Worker\n\n')
276 | self.stats_worker.start()
277 |
278 | if n_samples < self.n_burn_in:
279 | logger.debug('Still in burn-in phase... observed {:>4} '
280 | 'points'.format(n_samples))
281 | self.datastore.append(x_new, y_new)
282 | if n_samples == self.n_burn_in - 1:
283 | logger.debug('Burn-in complete!')
284 | self.fit_burn_in()
285 | else:
286 | ind_new = self.datastore.append(x_new, y_new)
287 | self._fit_incremental(x_new, y_new, ind_new)
288 |
289 | def _fit_incremental(self, x_new, y_new, ind_new):
290 | """Fit a single new point
291 |
292 | Args:
293 | x_new : array_like, shape (1, n_features)
294 | A new data point.
295 |
296 | y_new : int
297 | Label of the new data point (-1 if point is unlabeled).
298 |
299 | ind_new : int
300 | Index of the new point in the data store.
301 |
302 | Returns:
303 | IncrementalLabelPropagation: a reference to self.
304 | """
305 |
306 | tic = time()
307 |
308 | if self.n_iter_online == 0:
309 | self.tic_iprint = time()
310 |
311 | labeled = y_new != -1
312 |
313 | self.graph.add_node(x_new, ind_new, labeled)
314 |
315 | _, n_in_iter = self._propagate_single(ind_new, y_new, return_iter=True)
316 | self.n_iter_online += 1
317 |
318 | # Update statistics
319 | dt = time() - tic
320 | self.log_stats(JobType.ONLINE_ITER, dt=dt, n_in_iter=n_in_iter)
321 |
322 | if not labeled:
323 | # Track prediction entropy and accuracy
324 | label_vec = self.y_unlabeled[ind_new][None, :]
325 | pred = self.datastore.inverse_transform_labels(label_vec)
326 | self.log_stats(JobType.POINT_PREDICTION, vec=label_vec, y=pred)
327 |
328 | # Print information if needed
329 | if self.n_iter_online % self.iprint == 0:
330 | dt = time() - self.tic_iprint
331 | max_samples = self.datastore.max_samples
332 | n_samples_curr = self.graph.get_n_nodes()
333 | n_samples_prev = n_samples_curr - self.iprint
334 | logger.info('Iterations {} to {}/{} took {:.4f}s'.
335 | format(n_samples_prev, n_samples_curr,
336 | max_samples, dt))
337 | self.tic_iprint = time()
338 | self.log_stats(JobType.PRINT_STATS)
339 |
340 | # Normalize y_u as it might have diverged from [0, 1]
341 | u = self.graph.n_unlabeled
342 | normalize(self.y_unlabeled[:u], norm='l1', axis=1, copy=False)
343 |
344 | return self
345 |
346 | def log_stats(self, job_type, **kwargs):
347 | if self.stats_worker is not None:
348 | d = dict(job_type=job_type)
349 | d.update(**kwargs)
350 | self.stats_worker.send(d)
351 |
352 | def _propagate_single(self, ind_new, y_new, return_iter=False):
353 | """Perform label propagation until convergence of the label
354 | estimates of the unlabeled points. Assume the new node has already
355 | been added to the graph, but no label has been estimated.
356 |
357 | Parameters
358 | ----------
359 | ind_new : int
360 | The index of the new observation determined during graph addition.
361 |
362 | y_new : int
363 | The label of the new observation (-1 if point is unlabeled).
364 |
365 | return_iter : bool, default=False
366 | Whether to return the number of iterations until convergence of
367 | the label estimates.
368 |
369 | Returns
370 | -------
371 | y_unlabeled, num_iter : returns the new label estimates and optionally
372 | the number of iterations
373 | """
374 | # The number of labeled and unlabeled nodes now includes the new point
375 | y_u = self.y_unlabeled
376 | y_l = self.datastore.y_labeled
377 |
378 | p_ul = self.graph.subgraph_ul.transition_matrix
379 | p_uu = self.graph.subgraph_uu.transition_matrix
380 |
381 | a_rev_ul = self.graph.subgraph_ul.rev_adj
382 | a_rev_uu = self.graph.subgraph_uu.rev_adj
383 |
384 | if y_new == -1:
385 | # Estimate the label of the new unlabeled point
386 | label_new = ssdot(p_ul[ind_new], y_l, True) \
387 | + ssdot(p_uu[ind_new], y_u, True)
388 | y_u[ind_new] = label_new
389 |
390 | # The first LP candidates are the unlabeled samples that have
391 | # the new point as a nearest neighbor
392 | candidates = a_rev_uu.get(ind_new, set())
393 | else:
394 | # The label of the new labeled point is already in the data store
395 | candidates = a_rev_ul.get(ind_new, set())
396 |
397 | # Initialize a tentative label matrix / hash-map
398 | y_u_tent = {} # y_u[:u].copy()
399 |
400 | # Tentative labels are the label est. after the new point insertion
401 | candid1_norms = []
402 | for ind in candidates:
403 | y_u_tent.setdefault(ind, y_u[ind].copy())
404 | label = ssdot(p_ul[ind], y_l, True) + ssdot(p_uu[ind], y_u, True)
405 | y_u_tent[ind] = label.ravel()
406 |
407 | n_updates_per_iter = []
408 | n_iter = 0
409 | k_u = self.graph.n_neighbors_unlabeled
410 | u = max(self.graph.n_unlabeled, 1)
411 | max_iter = int(np.log(u) / np.log(k_u)) if k_u > 1 else self.max_iter
412 | while len(candidates) and n_iter < max_iter: # < self.max_iter:
413 |
414 | # Pick the ones that change significantly and change them
415 | updates, norm = filter_and_update(candidates, y_u_tent, y_u,
416 | self.theta)
417 | n_updates_per_iter.append(len(updates))
418 |
419 | # Get the next set of candidates (farther from the source)
420 | candidates = get_next_candidates(updates, y_u_tent, y_u, a_rev_uu,
421 | p_uu)
422 |
423 | n_iter += 1
424 |
425 | # Print the total number of updates
426 | n_updates = sum(n_updates_per_iter)
427 | if n_updates:
428 | logger.info('Iter {:6}: {:6} updates in {:2} LP iters, '
429 | 'max_iter = {:2}'
430 | .format(self.n_iter_online, n_updates, n_iter, max_iter))
431 |
432 | if return_iter:
433 | return y_u, n_iter
434 | else:
435 | return y_u
436 |
437 |
438 | def get_next_candidates(major_changes, y_u_tent, y_u, a_rev_uu, p_uu):
439 | candidates = set()
440 | for index, label_diff in major_changes:
441 | back_neighbors = a_rev_uu.get(index, set())
442 | for neigh in back_neighbors:
443 | y_u_tent.setdefault(neigh, y_u[neigh].copy())
444 | y_u_tent[neigh] += ssdot(p_uu[neigh, index], label_diff, True)
445 | candidates.add(neigh)
446 | return candidates
447 |
448 |
449 | def filter_and_update(candidates, y_u_tent, y_u, theta, top_ratio=None):
450 |
451 | # Store for visualization all norms to see how to tune theta
452 | major_updates = []
453 | updates_norm = 0.
454 | candidate_changes = []
455 | for candidate in candidates:
456 | dy_u = y_u_tent[candidate] - y_u[candidate]
457 | dy_u_norm = np.abs(dy_u).sum()
458 | if top_ratio is not None:
459 | candidate_changes.append((dy_u_norm, dy_u, candidate))
460 | else:
461 | if dy_u_norm > theta:
462 | # Apply the update
463 | y_u[candidate] = y_u_tent[candidate]
464 | major_updates.append((candidate, dy_u))
465 | updates_norm += dy_u_norm
466 |
467 | if top_ratio is None:
468 | return major_updates, updates_norm
469 |
470 | # Sort changes by descending norm and select the top k candidates for LP
471 | n_candidates = len(candidates)
472 | candidate_changes.sort(reverse=True, key=lambda x: x[0])
473 |
474 | # Apply the changes to the top k candidates
475 | top_k = int(top_ratio * n_candidates)
476 | for _, dy_u, candidate in candidate_changes[:top_k]:
477 | # Apply the update
478 | y_u[candidate] = y_u_tent[candidate]
479 | major_updates.append((candidate, dy_u))
480 | updates_norm += np.abs(dy_u).sum()
481 |
482 | return major_updates, updates_norm
483 |
484 |
485 | def _converged(y_curr, y_prev, tol=0.01):
486 | """basic convergence check"""
487 | return np.abs(y_curr - y_prev).max() < tol
488 |
--------------------------------------------------------------------------------
/ilp/algo/knn_graph_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from sklearn.metrics.pairwise import euclidean_distances
3 | from scipy.sparse import coo_matrix
4 |
5 |
6 | def squared_distances(X1, X2, L=None):
7 |
8 | if L is None:
9 | dist = euclidean_distances(X1, X2, squared=True)
10 | else:
11 | dist = euclidean_distances(X1.dot(L.T), X2.dot(L.T), squared=True)
12 |
13 | return dist
14 |
15 |
16 | def get_nearest(distances, n_neighbors):
17 |
18 | n, m = distances.shape
19 |
20 | neighbors = np.argpartition(distances, n_neighbors - 1, axis=1)
21 | neighbors = neighbors[:, :n_neighbors]
22 |
23 | return neighbors, distances[np.arange(n)[:, None], neighbors]
24 |
25 |
26 | def find_nearest_neighbors(X1, X2, n_neighbors, L=None):
27 | """
28 | Args:
29 | X1 (array_like): [n_samples, n_features] input data points
30 | X2 (array_like): [m_samples, n_features] reference data points
31 | n_neighbors (int): number of nearest neighbors to find
32 | L (array) : linear transformation for Mahalanobis distance computation
33 |
34 | Returns:
35 | tuple:
36 | (array_like): [n_samples, k_samples] indices of nearest neighbors
37 | (array_like): [n_samples, k_distances] distances to nearest neighbors
38 |
39 | """
40 |
41 | dist = squared_distances(X1, X2, L)
42 |
43 | if X1 is X2:
44 | np.fill_diagonal(dist, np.inf)
45 |
46 | n, m = X1.shape[0], X2.shape[0]
47 |
48 | neigh_ind = np.argpartition(dist, n_neighbors - 1, axis=1)
49 | neigh_ind = neigh_ind[:, :n_neighbors]
50 |
51 | return neigh_ind, dist[np.arange(n)[:, None], neigh_ind]
52 |
53 |
54 | def construct_weight_mat(neighbors, distances, shape, dtype):
55 |
56 | n, k = neighbors.shape
57 | rows = np.repeat(range(n), k)
58 | cols = neighbors.ravel()
59 | weights = np.exp(-distances.ravel())
60 | mat = coo_matrix((weights, (rows, cols)), shape, dtype)
61 |
62 | return mat
63 |
--------------------------------------------------------------------------------
/ilp/algo/knn_sl_graph.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from scipy.sparse import spdiags
3 |
4 | from ilp.algo.knn_graph_utils import squared_distances, find_nearest_neighbors
5 | from ilp.algo.knn_sl_subgraph import KnnSubGraph
6 | from ilp.algo.base_sl_graph import BaseSemiLabeledGraph
7 |
8 | N_ITER_ELIMINATE_ZEROS = 20
9 |
10 |
11 | class KnnSemiLabeledGraph(BaseSemiLabeledGraph):
12 | """
13 | Parameters
14 | ----------
15 |
16 | n_neighbors_labeled : int, optional (default=1)
17 | The number of labeled neighbors to use if 'knn' kernel us used.
18 |
19 | n_neighbors_unlabeled : int, optional (default=7)
20 | The number of labeled neighbors to use if 'knn' kernel us used.
21 |
22 | max_samples : int, optional
23 | The maximum number of points expected to be observed. Useful for
24 | memory allocation. (default: 1000)
25 |
26 | max_labeled : {float, int}, optional
27 | Maximum expected labeled points ratio, or number of labeled points
28 |
29 | dtype : dtype, optional
30 | Precision in floats, (default is float32, can also be float16, float64)
31 |
32 |
33 | Attributes
34 | ----------
35 |
36 | L : array-like, shape (n_features_out, n_features_in)
37 |
38 | weight_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples]
39 | xy can be in {ll, lu, ul, uu} indicating labeled or unlabeled points
40 |
41 | transition_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples]
42 |
43 | adj_list_{xy} : dict that contains the graph connectivity ->
44 | keys = nodes indices, values = neighbor nodes indices
45 |
46 | """
47 |
48 | def __init__(self, datastore, n_neighbors_labeled=1,
49 | n_neighbors_unlabeled=7, **kwargs):
50 |
51 | super(KnnSemiLabeledGraph, self).__init__(datastore=datastore)
52 | self.n_neighbors_labeled = n_neighbors_labeled
53 | self.n_neighbors_unlabeled = n_neighbors_unlabeled
54 |
55 | max_l, max_u = self.max_labeled, self.max_samples
56 | dtype = self.dtype
57 |
58 | self.subgraph_ll = KnnSubGraph(n_neighbors=n_neighbors_labeled,
59 | dtype=dtype, shape=(max_l, max_l))
60 |
61 | self.subgraph_lu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled,
62 | dtype=dtype, shape=(max_l, max_u))
63 |
64 | self.subgraph_ul = KnnSubGraph(n_neighbors=n_neighbors_labeled,
65 | dtype=dtype, shape=(max_u, max_l))
66 |
67 | self.subgraph_uu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled,
68 | dtype=dtype, shape=(max_u, max_u))
69 |
70 | self.L = None
71 |
72 | def build(self, X_l, X_u):
73 | """
74 | Graph matrix for Online Label Propagation computes the weighted adjacency matrix
75 |
76 | Parameters
77 | ----------
78 |
79 | X_l : array-like, shape [l_samples, n_features], the labeled features
80 |
81 | X_u : array-like, shape [u_samples, n_features], the unlabeled features
82 |
83 | """
84 |
85 | print('Building graph with {} labeled and {} unlabeled '
86 | 'samples...'.format(X_l.shape, X_u.shape))
87 |
88 | self.subgraph_ll.build(X_l, X_l, self.L)
89 | self.subgraph_lu.build(X_l, X_u, self.L)
90 | self.subgraph_ul.build(X_u, X_l, self.L)
91 | self.subgraph_uu.build(X_u, X_u, self.L)
92 |
93 | self.n_labeled = X_l.shape[0]
94 | self.n_unlabeled = X_u.shape[0]
95 |
96 | print('Computing transitions...')
97 |
98 | self._compute_transitions()
99 |
100 | def find_labeled_neighbors(self, X):
101 |
102 | X_l = self.datastore.X_labeled[:self.n_labeled]
103 | return find_nearest_neighbors(X, X_l, self.n_neighbors_labeled, self.L)
104 |
105 | def find_unlabeled_neighbors(self, X):
106 |
107 | X_u = self.datastore.X_unlabeled[:self.n_unlabeled]
108 | return find_nearest_neighbors(X, X_u, self.n_neighbors_unlabeled,
109 | self.L)
110 |
111 | def add_node(self, x, ind, labeled):
112 | if labeled:
113 | res = self.add_labeled_node(x, ind)
114 | else:
115 | res = self.add_unlabeled_node(x, ind)
116 |
117 | # Periodically remove explicit zeros from the sparse matrices
118 | if self.get_n_nodes() % N_ITER_ELIMINATE_ZEROS == 0:
119 |
120 | self.subgraph_ll.eliminate_zeros()
121 | self.subgraph_lu.eliminate_zeros()
122 | self.subgraph_ul.eliminate_zeros()
123 | self.subgraph_uu.eliminate_zeros()
124 |
125 | return res
126 |
127 | def add_labeled_node(self, x_new, ind_new):
128 |
129 | # Compute distances to all other labeled nodes
130 | X_l = self.datastore.X_labeled[:self.n_labeled]
131 | distances = squared_distances(x_new.reshape(1, -1), X_l, self.L)
132 |
133 | # Update the labeled-labeled subgraph
134 | self.subgraph_ll.append_row(ind_new, distances)
135 | self.subgraph_ll.update_columns(ind_new, distances)
136 |
137 | # Compute distances to all other unlabeled nodes
138 | X_u = self.datastore.X_unlabeled[:self.n_unlabeled]
139 | distances = squared_distances(x_new.reshape(1, -1), X_u, self.L)
140 |
141 | # Update the labeled-unlabeled subgraph
142 | self.subgraph_lu.append_row(ind_new, distances)
143 |
144 | # Update the unlabeled-labeled subgraph
145 | self.subgraph_ul.update_columns(ind_new, distances)
146 |
147 | self.n_labeled += 1
148 |
149 | # Compute normalized weight matrix (matrices)
150 | self._compute_transitions()
151 |
152 | return ind_new
153 |
154 | def add_unlabeled_node(self, x_new, ind_new):
155 |
156 | # Compute distances to all other labeled nodes
157 | X_u = self.datastore.X_labeled[:self.n_unlabeled]
158 | distances = squared_distances(x_new.reshape(1, -1), X_u, self.L)
159 |
160 | # Update the labeled-labeled subgraph
161 | self.subgraph_uu.append_row(ind_new, distances)
162 | self.subgraph_uu.update_columns(ind_new, distances)
163 |
164 | # Compute distances to all labeled nodes
165 | X_l = self.datastore.X_labeled[:self.n_labeled]
166 | distances = squared_distances(x_new.reshape(1, -1), X_l, self.L)
167 |
168 | # Update the labeled-unlabeled subgraph
169 | self.subgraph_ul.append_row(ind_new, distances)
170 |
171 | # Update the unlabeled-labeled subgraph
172 | self.subgraph_lu.update_columns(ind_new, distances)
173 |
174 | self.n_unlabeled += 1
175 |
176 | # Compute normalized weight matrix (matrices)
177 | self._compute_transitions()
178 |
179 | return ind_new
180 |
181 | def _compute_transitions(self):
182 | """Normalize the weight matrices by dividing with the row sums"""
183 |
184 | self.row_sum_l = self.subgraph_ll.weight_matrix.sum(axis=1) + \
185 | self.subgraph_lu.weight_matrix.sum(axis=1)
186 |
187 | self.row_sum_u = self.subgraph_ul.weight_matrix.sum(axis=1) + \
188 | self.subgraph_uu.weight_matrix.sum(axis=1)
189 |
190 | # Avoid division by zero
191 | actual_l = self.row_sum_l[:self.n_labeled]
192 | actual_l[actual_l < self.eps] = 1.
193 | # print('Min value l: ', actual_l.min())
194 | actual_u = self.row_sum_u[:self.n_unlabeled]
195 | actual_u[actual_u < self.eps] = 1.
196 | # print('Min value u: ', actual_u.min())
197 |
198 | row_sum_l_inv = 1 / np.asarray(actual_l, dtype=self.dtype)
199 | row_sum_l_inv[row_sum_l_inv == np.inf] = 1
200 |
201 | row_sum_u_inv = 1 / np.asarray(actual_u, dtype=self.dtype)
202 | row_sum_u_inv[row_sum_u_inv == np.inf] = 1
203 |
204 | # Temporary divisors (diagonal pre-multiplier matrices)
205 | diag_l = spdiags(row_sum_l_inv.ravel(), 0, *self.subgraph_ll.shape)
206 | diag_u = spdiags(row_sum_u_inv.ravel(), 0, *self.subgraph_uu.shape)
207 |
208 | self.subgraph_ll.update_transitions(diag_l)
209 | self.subgraph_lu.update_transitions(diag_l)
210 | self.subgraph_ul.update_transitions(diag_u)
211 | self.subgraph_uu.update_transitions(diag_u)
212 |
213 | def reset_metric(self, L):
214 | self.L = L
215 |
--------------------------------------------------------------------------------
/ilp/algo/knn_sl_subgraph.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from scipy.sparse import csr_matrix, coo_matrix
3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot
4 |
5 | from ilp.helpers.fc_heap import FixedCapacityHeap as FSH
6 | from ilp.algo.knn_graph_utils import find_nearest_neighbors, get_nearest, construct_weight_mat
7 |
8 |
9 | class KnnSubGraph:
10 | def __init__(self, n_neighbors=1, dtype=float, shape=None, **kwargs):
11 |
12 | self.n_neighbors = n_neighbors
13 | self.dtype = dtype
14 | self.shape = shape
15 |
16 | self.weight_matrix = csr_matrix(shape, dtype=dtype)
17 | self.transition_matrix = csr_matrix(shape, dtype=dtype)
18 | self.radii = np.zeros(shape[0], dtype=dtype)
19 |
20 | self.adj = {}
21 | self.rev_adj = {}
22 |
23 | def build(self, X1, X2, L=None):
24 |
25 | neigh_ind, dist = find_nearest_neighbors(X1, X2, self.n_neighbors, L)
26 | weight_matrix = construct_weight_mat(neigh_ind, dist, self.shape,
27 | self.dtype)
28 | self.weight_matrix = weight_matrix.tocsr()
29 | self.radii[:len(dist)] = dist[:, self.n_neighbors-1]
30 | self.adj = self.adj_from_weight_mat(weight_matrix)
31 | self.rev_adj = self.rev_adj_from_weight_mat(weight_matrix)
32 |
33 | def adj_from_weight_mat(self, weight_mat):
34 | """Get the non-zero cols for each row and insert in a FSPQ in the
35 | form (weight, ind)
36 |
37 | Args:
38 | weight_mat (coo_matrix): a weights submatrix
39 |
40 | Returns:
41 | adj_list (dict): a dictionary where keys are indices and values
42 | are FSHs
43 |
44 | """
45 |
46 | # Create a list of adjacent vertices for each node
47 | print('Creating hashmap for {} nodes...'.format(self.shape[0]))
48 | adj_list = {i: [] for i in range(self.shape[0])}
49 | print('Iterating over weightmat.row, col, data...')
50 | for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data):
51 | adj_list[r].append((w, c))
52 |
53 | # Convert each list to a FixedCapacityHeap
54 | print('Converting to FCH')
55 | for node, neighbors in adj_list.items():
56 | adj_list[node] = FSH(neighbors, capacity=self.n_neighbors)
57 |
58 | return adj_list
59 |
60 | def rev_adj_from_weight_mat(self, weight_mat):
61 | # Create a list of adjacent vertices for each node
62 | # adj_list = {i: set() for i in range(self.shape[1])}
63 | adj_list = {}
64 | for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data):
65 | adj_list.setdefault(c, set()).add(r)
66 | # adj_list[c].add(r)
67 |
68 | return adj_list
69 |
70 | def update_transitions(self, normalizer):
71 | self.transition_matrix = ssdot(normalizer, self.weight_matrix)
72 |
73 | def eliminate_zeros(self):
74 | self.weight_matrix.eliminate_zeros()
75 | self.transition_matrix.eliminate_zeros()
76 |
77 | def append_row(self, index, distances):
78 |
79 | # Identify the k nearest neighbors
80 | nearest, dist_nearest = get_nearest(distances, self.n_neighbors)
81 | nearest = nearest.ravel()
82 |
83 | # Create the new node's adjacency list
84 | weights = np.exp(-dist_nearest.ravel())
85 | lst = [(w, i) for w, i in zip(weights, nearest)]
86 | self.adj[index] = FSH(lst, self.n_neighbors)
87 |
88 | # Update the reverse adjacency list
89 | for w, i in zip(weights, nearest):
90 | self.rev_adj.setdefault(i, set()).add(index)
91 | # self.rev_adj[i].add(index)
92 |
93 | # Update the W_LL matrix (append the row vector)
94 | row = [index] * len(weights)
95 | row_new = csr_matrix((weights, (row, nearest)), self.shape, self.dtype)
96 | self.weight_matrix = self.weight_matrix + row_new
97 |
98 | def update_columns(self, ind_new, distances):
99 | """
100 |
101 | Parameters
102 | ----------
103 | ind_new : int
104 | Index of the new point.
105 |
106 | distances : array
107 | Array of distances of the new point to the reference points of
108 | the subgraph.
109 |
110 | """
111 |
112 | distances = distances.ravel()
113 | # Identify the samples that have the new point in their knn radius
114 | back_refs, = np.where(distances < self.radii[:len(distances)])
115 | back_weights = np.exp(-distances[back_refs])
116 |
117 | # Update the W_LL matrix (compute the column update)
118 | update_mat = self._update(ind_new, back_refs, back_weights)
119 | self.weight_matrix = self.weight_matrix + update_mat.tocsr()
120 |
121 | def _update(self, ind_new, back_refs, weights_new, eps=1e-12):
122 | row, col, val = [], [], []
123 | # row_del, col_del, val_del = [], [], []
124 | for neigh_new, weight_new in zip(back_refs, weights_new):
125 | neighbors_heap = self.adj[neigh_new] # FSH with (weight, ind)
126 | inserted, removed = neighbors_heap.push((weight_new, ind_new))
127 | if inserted: # neigh got a new nearest neighbor: ind_new
128 | row.append(neigh_new)
129 | col.append(ind_new)
130 | val.append(weight_new)
131 |
132 | # Update the reverse adjacency list
133 | self.rev_adj.setdefault(ind_new, set()).add(neigh_new)
134 | if removed is not None: # old point swapped a nearest neighbor
135 | # row_del.append(neigh)
136 | # col_del.append(removed[1])
137 | # val_del.append(-removed[0])
138 | row.append(neigh_new)
139 | col.append(removed[1])
140 | val.append(-removed[0])
141 |
142 | # Update the reverse adjacency list
143 | self.rev_adj[removed[1]].discard(neigh_new)
144 |
145 | # Update the radii
146 | min_weight = neighbors_heap.get_min()[0]
147 | self.radii[neigh_new] = -np.log(max(min_weight, eps))
148 |
149 | update_mat = coo_matrix((val, (row, col)), self.shape, self.dtype)
150 |
151 | return update_mat
152 |
--------------------------------------------------------------------------------
/ilp/constants.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 |
4 | CWD = os.path.abspath(os.path.split(__file__)[0])
5 | PROJECT_DIR = os.path.split(CWD)[0]
6 |
7 | SOURCE_DIR = os.path.join(PROJECT_DIR, 'ilp')
8 | DATA_DIR = os.path.join(PROJECT_DIR, 'data')
9 |
10 | STATS_DIR = os.path.join(SOURCE_DIR, 'stats_res')
11 | EXPERIMENTS_DIR = os.path.join(SOURCE_DIR, 'experiments')
12 | CONFIG_DIR = os.path.join(EXPERIMENTS_DIR, 'cfg')
13 | RESULTS_DIR = os.path.join(EXPERIMENTS_DIR, 'results')
14 | # PLOT_DIR = os.path.join(PROJECT_DIR, 'plot')
15 |
16 |
17 | EPS_32 = np.spacing(np.float32(0))
18 | EPS_64 = np.spacing(np.float64(0))
19 |
--------------------------------------------------------------------------------
/ilp/experiments/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/experiments/__init__.py
--------------------------------------------------------------------------------
/ilp/experiments/base.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | import time
4 | from datetime import datetime
5 | from sklearn.externals import six
6 | from sklearn.preprocessing import LabelBinarizer
7 | from sklearn.utils.random import check_random_state
8 | from abc import ABCMeta, abstractmethod
9 | from matplotlib import pyplot as plt
10 |
11 | from ilp.constants import RESULTS_DIR
12 | from ilp.helpers.data_fetcher import check_supported_dataset, fetch_load_data
13 | from ilp.helpers.stats import StatisticsWorker, aggregate_statistics, JobType
14 | from ilp.plots.plot_stats import plot_curves
15 | from ilp.algo.incremental_label_prop import IncrementalLabelPropagation
16 | from ilp.algo.datastore import SemiLabeledDataStore
17 | from ilp.helpers.data_flow import gen_semilabeled_data, split_labels_rest, split_burn_in_rest
18 | from ilp.helpers.params_parse import print_config
19 | from ilp.helpers.log import make_logger
20 |
21 |
22 | logger = make_logger(__name__)
23 |
24 |
25 | class BaseExperiment(six.with_metaclass(ABCMeta)):
26 |
27 | def __init__(self, name, config, plot_title, multi_var, n_runs, isave=100):
28 | self.name = name
29 | self.config = config
30 | self.precision = config.get('options', {}).get('precision', 'float32')
31 | self.isave = isave
32 | self.plot_title = plot_title
33 | self.multi_var = multi_var
34 | self.n_runs = n_runs
35 | self._setup()
36 |
37 | def _setup(self):
38 | self.dataset = self.config['dataset']['name'].lower()
39 | check_supported_dataset(self.dataset)
40 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
41 | self.class_dir = os.path.join(RESULTS_DIR, self.name)
42 | instance_dir = self.name + '_' + self.dataset.upper() + '_' + cur_time
43 | self.top_dir = os.path.join(self.class_dir, instance_dir)
44 |
45 | def run(self, dataset_name, random_state=42):
46 |
47 | config = self.config
48 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
49 |
50 | for n_run in range(self.n_runs):
51 | seed_run = random_state * n_run
52 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
53 | rng = check_random_state(seed_run)
54 | if config['dataset']['is_stream']:
55 | logger.info('Dataset is a stream. Sampling observed labels.')
56 | # Just randomly sample ratio_labeled samples for mask_labeled
57 | n_burn_in = config['data']['n_burn_in_stream']
58 | ratio_labeled = config['data']['stream']['ratio_labeled']
59 | n_labeled = int(ratio_labeled*len(y_train))
60 | ind_labeled = rng.choice(len(y_train), n_labeled,
61 | replace=False)
62 | mask_labeled = np.zeros(len(y_train), dtype=bool)
63 | mask_labeled[ind_labeled] = True
64 | X_run, y_run = X_train, y_train
65 | else:
66 | burn_in_params = config['data']['burn_in']
67 | ind_burn_in, mask_labeled_burn_in = \
68 | split_burn_in_rest(y_train, shuffle=True, seed=seed_run,
69 | **burn_in_params)
70 | X_burn_in, y_burn_in = X_train[ind_burn_in], \
71 | y_train[ind_burn_in]
72 | mask_rest = np.ones(len(X_train), dtype=bool)
73 | mask_rest[ind_burn_in] = False
74 | X_rest, y_rest = X_train[mask_rest], y_train[mask_rest]
75 | stream_params = config['data']['stream']
76 | mask_labeled_rest = split_labels_rest(
77 | y_rest, seed=seed_run, shuffle=True, **stream_params)
78 |
79 | # Shuffle the rest
80 | indices = np.arange(len(y_rest))
81 | rng.shuffle(indices)
82 | X_run = np.concatenate((X_burn_in, X_rest[indices]))
83 | y_run = np.concatenate((y_burn_in, y_rest[indices]))
84 | mask_labeled = np.concatenate((mask_labeled_burn_in,
85 | mask_labeled_rest[indices]))
86 | n_burn_in = len(y_burn_in)
87 |
88 | config['data']['n_burn_in'] = n_burn_in
89 | config.setdefault('options', {})
90 | config['options']['random_state'] = seed_run
91 |
92 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
93 | seed_run, X_test, y_test, n_run)
94 |
95 | @abstractmethod
96 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
97 | X_test, y_test, n_run):
98 | raise NotImplementedError('pre_single_run must be overriden!')
99 |
100 | def _single_run(self, X, y, mask_labeled, n_burn_in, stats_path,
101 | random_state, X_test=None, y_test=None):
102 |
103 | lb = LabelBinarizer()
104 | lb.fit(y)
105 | logger.info('\n\nLABELS SEEN BY LABEL BINARIZER: {}'.format(lb.classes_))
106 |
107 | # Now print configuration for sanity check
108 | self.config.setdefault('dataset', {})
109 | self.config['dataset']['classes'] = lb.classes_
110 | print_config(self.config)
111 |
112 | logger.info('Creating stream generator...')
113 | stream_generator = gen_semilabeled_data(X, y, mask_labeled)
114 |
115 | logger.info('Creating one-hot groundtruth...')
116 | y_u_true_int = y[~mask_labeled]
117 | y_u_true = np.asarray(lb.transform(y_u_true_int), dtype=self.precision)
118 |
119 | logger.info('Initializing learner...')
120 | datastore_params = {'precision': self.precision,
121 | 'max_samples': len(y),
122 | 'max_labeled': sum(mask_labeled),
123 | 'classes': lb.classes_}
124 | learner = self.init_learner(stats_path, datastore_params, random_state,
125 | n_burn_in)
126 |
127 | # Iterate through the generated samples and learn
128 | t_total = time.time()
129 | logger.info('Now feeding stream . . .')
130 | for t, x_new, y_new, is_labeled in stream_generator:
131 |
132 | # Pass the new point to the learner
133 | y_observed = y_new if is_labeled else -1
134 | learner.fit_incremental(x_new, y_observed)
135 |
136 | if t > n_burn_in:
137 | # Compute classification error
138 | u = learner.datastore.n_unlabeled
139 | y_u = learner.y_unlabeled[:u]
140 | learner.log_stats(JobType.EVAL, y_est=y_u, y_true=y_u_true[:u])
141 |
142 | # Compute test error every 1000 samples
143 | if t % 1000 == 0:
144 | if X_test is not None:
145 | logger.info('Now testing . . .')
146 | t_test = time.time()
147 | y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair')
148 | t_test = time.time() - t_test
149 | logger.info('Testing finished in {}s'.format(t_test))
150 | learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn,
151 | y_pred_lp=y_pred_lp, y_true=y_test)
152 |
153 | # Store the true label stream in statistics
154 | learner.log_stats(JobType.LABEL_STREAM, y_true=y, mask_obs=mask_labeled)
155 |
156 | logger.info('Reached end of generated data.')
157 | total_runtime = time.time() - t_total
158 | logger.info('Total time elapsed: {} s'.format(total_runtime))
159 | learner.log_stats(JobType.RUNTIME, t=total_runtime)
160 |
161 | # Store last predictions in statistics
162 | u = learner.datastore.n_unlabeled
163 | y_u = learner.y_unlabeled[:u]
164 | learner.log_stats(JobType.TRAIN_PRED, y_est=y_u, y_true=y_u_true[:u])
165 |
166 | if X_test is not None:
167 | logger.info('Now testing . . .')
168 | t_test = time.time()
169 | y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair')
170 | t_test = time.time() - t_test
171 | logger.info('Testing finished in {}s'.format(t_test))
172 | learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn,
173 | y_pred_lp=y_pred_lp, y_true=y_test)
174 |
175 | if learner.stats_worker is not None:
176 | learner.stats_worker.stop()
177 |
178 | def init_learner(self, stats_path, datastore_params, random_state, n_burn_in):
179 |
180 | config = self.config
181 | ilp_params = dict(params_offline_lp=config['offline_lp'],
182 | params_graph=config['graph'],
183 | **config['online_lp'])
184 |
185 | # Instantiate a worker thread for statistics
186 | stats_worker = StatisticsWorker(config=config, isave=self.isave, path=stats_path)
187 |
188 | # Instantiate a datastore for labeled and unlabeled samples
189 | datastore = SemiLabeledDataStore(**datastore_params)
190 |
191 | # Instantiate the learner
192 | learner = IncrementalLabelPropagation(datastore=datastore, stats_worker=stats_worker,
193 | random_state=random_state, n_burn_in=n_burn_in, **ilp_params)
194 |
195 | return learner
196 |
197 | def load_plot(self, path=None):
198 | if path is None:
199 | path = self.top_dir
200 | elif not os.path.isdir(path):
201 | # Load and plot the latest experiment
202 | logger.info('Experiment Class dir: {}'.format(self.class_dir))
203 | logger.info('Experiment subdirs: {}'.format(os.listdir(self.class_dir)))
204 | files_in_class = os.listdir(self.class_dir)
205 | a_files = [os.path.join(self.class_dir, d) for d in files_in_class]
206 | list_of_dirs = [d for d in a_files if os.path.isdir(d)]
207 | path = max(list_of_dirs, key=os.path.getctime)
208 |
209 | logger.info('Collecting statistics from {}'.format(path))
210 | config = None
211 | if self.multi_var:
212 | experiment_stats = []
213 | for variable_dir in os.listdir(path):
214 | var_value = variable_dir[len(self.name) + 1:]
215 | experiment_dir = os.path.join(path, variable_dir)
216 | stats_mean, stats_std, config = aggregate_statistics(experiment_dir)
217 | experiment_stats.append((self.name, var_value, stats_mean, stats_std))
218 | else:
219 | stats_mean, stats_std, config = aggregate_statistics(path)
220 | experiment_stats = (stats_mean, stats_std)
221 |
222 | if config is None:
223 | raise KeyError('No configuration found for {}'.format(path))
224 |
225 | title = self.plot_title
226 | plot_curves(experiment_stats, config, title=title, path=path)
227 | plt.show()
228 |
--------------------------------------------------------------------------------
/ilp/experiments/cfg/default.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : 0.10
7 | batch_size : 100
8 | max_samples : 200000
9 | n_burn_in_stream : 100
10 |
11 |
12 | offline_lp:
13 | tol : 0.001
14 | max_iter : 30
15 |
16 |
17 | online_lp:
18 | tol : 0.001
19 | max_iter : 30
20 | theta : 0.1
21 | iprint : 100
22 |
23 |
24 | graph:
25 | kernel : knn
26 | n_neighbors_labeled : 3
27 | n_neighbors_unlabeled : 3
28 |
29 |
30 | options:
31 | precision : float64
32 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_k_L.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : 0.05
7 | batch_size : 100
8 | max_samples : 1000000
9 |
10 |
11 | offline_lp:
12 | tol : 0.001
13 | max_iter : 30
14 |
15 |
16 | online_lp:
17 | tol : 0.001
18 | max_iter : 30
19 | theta : 0.3
20 | iprint : 100
21 |
22 |
23 | graph:
24 | kernel : knn
25 | n_neighbors_labeled : [1, 3, 7, 11, 15, 19]
26 | n_neighbors_unlabeled : 3
27 |
28 |
29 | options:
30 | precision : float64
31 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_k_U.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : 0.05
7 | batch_size : 100
8 | max_samples : 1000000
9 |
10 |
11 | offline_lp:
12 | tol : 0.001
13 | max_iter : 30
14 |
15 |
16 | online_lp:
17 | tol : 0.001
18 | max_iter : 30
19 | theta : 0.3
20 | iprint : 100
21 |
22 |
23 | graph:
24 | kernel : knn
25 | n_neighbors_labeled : 3
26 | n_neighbors_unlabeled : [1, 3, 7, 11, 15, 19]
27 |
28 |
29 | options:
30 | precision : float64
31 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_n_L.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : 0.05
7 | batch_size : 100
8 | max_samples : 1000000
9 | n_labeled_per_class : [300, 500]
10 |
11 |
12 | offline_lp:
13 | tol : 0.001
14 | max_iter : 30
15 |
16 |
17 | online_lp:
18 | tol : 0.001
19 | max_iter : 30
20 | theta : 0.3
21 | iprint : 100
22 |
23 |
24 | graph:
25 | kernel : knn
26 | n_neighbors_labeled : 3
27 | n_neighbors_unlabeled : 3
28 |
29 |
30 | options:
31 | precision : float64
32 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_stream_labeled.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : [0.05, 0.1, 0.2]
7 | batch_size : 100
8 | max_samples : 1000000
9 | n_burn_in_stream : 100
10 |
11 |
12 | offline_lp:
13 | tol : 0.001
14 | max_iter : 30
15 |
16 |
17 | online_lp:
18 | tol : 0.001
19 | max_iter : 30
20 | theta : 1.0
21 | iprint : 100
22 |
23 |
24 | graph:
25 | kernel : knn
26 | n_neighbors_labeled : 3
27 | n_neighbors_unlabeled : 3
28 |
29 |
30 | options:
31 | precision : float64
32 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_theta.yml:
--------------------------------------------------------------------------------
1 | data:
2 | burn_in:
3 | n_labeled_per_class : 2
4 | ratio_labeled : 0.5
5 | stream:
6 | ratio_labeled : 0.20
7 | batch_size : 100
8 | max_samples : 1000000
9 | n_burn_in_stream : 100
10 |
11 |
12 | offline_lp:
13 | tol : 0.001
14 | max_iter : 30
15 |
16 |
17 | online_lp:
18 | tol : 0.001
19 | max_iter : 30
20 | theta : [0.0, 0.1, 0.5, 1.0, 1.5, 2.0]
21 | iprint : 100
22 |
23 |
24 | graph:
25 | kernel : knn
26 | n_neighbors_labeled : 3
27 | n_neighbors_unlabeled : 3
28 |
29 |
30 | options:
31 | precision : float64
32 | iter_stats : 100
--------------------------------------------------------------------------------
/ilp/experiments/default_run.py:
--------------------------------------------------------------------------------
1 | import os
2 | from datetime import datetime
3 |
4 | from ilp.experiments.base import BaseExperiment
5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
7 | from ilp.constants import CONFIG_DIR
8 |
9 |
10 | class DefaultRun(BaseExperiment):
11 |
12 | def __init__(self, params, n_runs=1, isave=100):
13 | super(DefaultRun, self).__init__(name='default_run', config=params,
14 | isave=isave, n_runs=n_runs,
15 | plot_title=r'Default run',
16 | multi_var=False)
17 |
18 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
19 | X_test, y_test, n_run):
20 |
21 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
22 | stats_path = os.path.join(self.top_dir, 'run_' + cur_time)
23 | self._single_run(X_run, y_run, mask_labeled, n_burn_in, stats_path,
24 | seed_run, X_test, y_test)
25 |
26 |
27 | if __name__ == '__main__':
28 |
29 | # Parse user input
30 | parser = experiment_arg_parser()
31 | args = vars(parser.parse_args())
32 | dataset_name = args['dataset'].lower()
33 | config_file = os.path.join(CONFIG_DIR, 'default.yml')
34 | config = parse_yaml(config_file)
35 |
36 | # Store dataset info
37 | config.setdefault('dataset', {})
38 | config['dataset']['name'] = dataset_name
39 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
40 |
41 | # Instantiate experiment
42 | experiment = DefaultRun(params=config, n_runs=args['n_runs'])
43 |
44 | if args['plot'] != '':
45 | # python3 default_run.py -p latest
46 | experiment.load_plot(path=args['plot'])
47 | else:
48 | # python3 default_run.py -d digits
49 | experiment.run(dataset_name)
50 |
--------------------------------------------------------------------------------
/ilp/experiments/var_n_labeled.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import numpy as np
4 | from sklearn.utils.random import check_random_state
5 |
6 | from ilp.experiments.base import BaseExperiment
7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM
8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
9 | from ilp.constants import CONFIG_DIR
10 | from ilp.helpers.data_flow import split_labels_rest, split_burn_in_rest
11 | from ilp.helpers.log import make_logger
12 |
13 |
14 | logger = make_logger(__name__)
15 |
16 |
17 | class VarSamplesLabeled(BaseExperiment):
18 |
19 | def __init__(self, n_labeled_values, params, n_runs=1, isave=100):
20 | super(VarSamplesLabeled, self).__init__(name='n_L', config=params,
21 | isave=isave, n_runs=n_runs,
22 | plot_title=r'Influence of '
23 | r'number of labels',
24 | multi_var=True)
25 | self.n_labeled_values = n_labeled_values
26 |
27 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
28 | X_test, y_test, n_run):
29 |
30 | config = self.config
31 |
32 | n_labels = config['data']['n_labels']
33 | save_dir = os.path.join(self.top_dir, 'n_L_' + str(n_labels))
34 | stats_file = os.path.join(save_dir, 'run_' + str(n_run))
35 | logger.info('\n\nExperiment: {}, n_labels = {}, run {}...\n'.
36 | format(self.name.upper(), n_labels, n_run))
37 | time.sleep(1)
38 | self._single_run(X_run, y_run, mask_labeled, n_burn_in,
39 | stats_file, seed_run, X_test, y_test)
40 |
41 | def run(self, dataset_name, random_state=42):
42 |
43 | config = self.config
44 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
45 |
46 | n_classes = len(np.unique(y_train))
47 |
48 | # if dataset_name == 'usps':
49 | # X_train = np.concatenate((X_train, X_test))
50 | # y_train = np.concatenate((y_train, y_test))
51 |
52 | for n_run in range(self.n_runs):
53 | seed_run = random_state * n_run
54 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
55 | rng = check_random_state(seed_run)
56 | if config['dataset']['is_stream']:
57 | logger.info('Dataset is a stream. Sampling observed labels.')
58 | # Just randomly sample ratio_labeled samples for mask_labeled
59 | n_burn_in = config['data']['n_burn_in_stream']
60 | ratio_labeled = config['data']['stream']['ratio_labeled']
61 | n_labeled = int(ratio_labeled*len(y_train))
62 | ind_labeled = rng.choice(len(y_train), n_labeled,
63 | replace=False)
64 | mask_labeled = np.zeros(len(y_train), dtype=bool)
65 | mask_labeled[ind_labeled] = True
66 | X_run, y_run = X_train, y_train
67 | else:
68 |
69 | burn_in_params = config['data']['burn_in']
70 | ind_burn_in, mask_labeled_burn_in = \
71 | split_burn_in_rest(y_train, shuffle=True, seed=seed_run,
72 | **burn_in_params)
73 | n_labeled_burn_in = sum(mask_labeled_burn_in)
74 | X_burn_in, y_burn_in = X_train[ind_burn_in], \
75 | y_train[ind_burn_in]
76 | mask_rest = np.ones(len(X_train), dtype=bool)
77 | mask_rest[ind_burn_in] = False
78 | X_rest, y_rest = X_train[mask_rest], y_train[mask_rest]
79 |
80 | for nlpc in self.n_labeled_values:
81 | n_labels = nlpc*n_classes
82 | config['data']['n_labels'] = n_labels
83 |
84 | rl = (n_labels - n_labeled_burn_in) / len(y_rest)
85 | assert rl >= 0
86 | mask_labeled_rest = split_labels_rest(y_rest, batch_size=0,
87 | seed=seed_run, shuffle=True, ratio_labeled=rl)
88 |
89 | # Shuffle the rest
90 | indices = np.arange(len(y_rest))
91 | rng.shuffle(indices)
92 | X_run = np.concatenate((X_burn_in, X_rest[indices]))
93 | y_run = np.concatenate((y_burn_in, y_rest[indices]))
94 | mask_labeled = np.concatenate((mask_labeled_burn_in,
95 | mask_labeled_rest[indices]))
96 | n_burn_in = len(y_burn_in)
97 |
98 | config['data']['n_burn_in'] = n_burn_in
99 | config.setdefault('options', {})
100 | config['options']['random_state'] = seed_run
101 |
102 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
103 | seed_run, X_test, y_test, n_run)
104 |
105 |
106 | if __name__ == '__main__':
107 | parser = experiment_arg_parser()
108 | args = vars(parser.parse_args())
109 | dataset_name = args['dataset'].lower()
110 | config_file = os.path.join(CONFIG_DIR, 'var_n_L.yml')
111 | config = parse_yaml(config_file)
112 |
113 | # Store dataset info
114 | config.setdefault('dataset', {})
115 | config['dataset']['name'] = dataset_name
116 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
117 |
118 | N_LABELED_PER_CLASS = config['data']['n_labeled_per_class'].copy()
119 |
120 | experiment = VarSamplesLabeled(N_LABELED_PER_CLASS, params=config,
121 | n_runs=args['n_runs'])
122 | if args['plot'] != '':
123 | experiment.load_plot(path=args['plot'])
124 | else:
125 | experiment.run(dataset_name)
126 |
--------------------------------------------------------------------------------
/ilp/experiments/var_n_neighbors_labeled.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 |
4 | from ilp.experiments.base import BaseExperiment
5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
7 | from ilp.constants import CONFIG_DIR
8 | from ilp.helpers.log import make_logger
9 |
10 |
11 | logger = make_logger(__name__)
12 |
13 |
14 | class VarNeighborsLabeled(BaseExperiment):
15 |
16 | def __init__(self, n_neighbors_values, params, n_runs, isave=100):
17 | super(VarNeighborsLabeled, self).__init__(name='k_L', config=params,
18 | isave=isave, n_runs=n_runs,
19 | plot_title=r'Influence of '
20 | r'$k_l$',
21 | multi_var=True)
22 | self.n_neighbors_values = n_neighbors_values
23 |
24 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
25 | X_test, y_test, n_run):
26 |
27 | params = self.config
28 |
29 | for n_neighbors in self.n_neighbors_values:
30 | params['graph']['n_neighbors_labeled'] = n_neighbors
31 | save_dir = os.path.join(self.top_dir, 'k_L_' + str(n_neighbors))
32 | stats_file = os.path.join(save_dir, 'run_' + str(n_run))
33 | logger.info('\n\nExperiment: {}, k_L = {}, run {}...\n'.
34 | format(self.name.upper(), n_neighbors, n_run))
35 | time.sleep(1)
36 | self._single_run(X_run, y_run, mask_labeled, n_burn_in,
37 | stats_file, seed_run, X_test, y_test)
38 |
39 |
40 | if __name__ == '__main__':
41 |
42 | parser = experiment_arg_parser()
43 | args = vars(parser.parse_args())
44 | dataset_name = args['dataset'].lower()
45 | config_file = os.path.join(CONFIG_DIR, 'var_k_L.yml')
46 | config = parse_yaml(config_file)
47 |
48 | # Store dataset info
49 | config.setdefault('dataset', {})
50 | config['dataset']['name'] = dataset_name
51 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
52 |
53 | N_NEIGHBORS_VALUES = config['graph']['n_neighbors_labeled'].copy()
54 |
55 | experiment = VarNeighborsLabeled(N_NEIGHBORS_VALUES, params=config,
56 | n_runs=args['n_runs'])
57 | if args['plot'] != '':
58 | experiment.load_plot(path=args['plot'])
59 | else:
60 | experiment.run(dataset_name)
--------------------------------------------------------------------------------
/ilp/experiments/var_n_neighbors_unlabeled.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 |
4 | from ilp.experiments.base import BaseExperiment
5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
7 | from ilp.constants import CONFIG_DIR
8 | from ilp.helpers.log import make_logger
9 |
10 |
11 | logger = make_logger(__name__)
12 |
13 |
14 | class VarNeighborsUnLabeled(BaseExperiment):
15 |
16 | def __init__(self, n_neighbors_values, params, n_runs, isave=100):
17 | super(VarNeighborsUnLabeled, self).__init__(name='k_U', config=params,
18 | isave=isave,
19 | plot_title=r'Influence of '
20 | r'$k_u$',
21 | n_runs=n_runs,
22 | multi_var=True)
23 | self.n_neighbors_values = n_neighbors_values
24 |
25 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in,
26 | seed_run, X_test, y_test, n_run):
27 | params = self.config
28 |
29 | for n_neighbors in self.n_neighbors_values:
30 | params['graph']['n_neighbors_unlabeled'] = n_neighbors
31 | save_dir = os.path.join(self.top_dir,
32 | 'k_U_' + str(n_neighbors))
33 | stats_file = os.path.join(save_dir, 'run_' + str(n_run))
34 | logger.info('\n\nExperiment: {}, k_U = {}, run {}...\n'.
35 | format(self.name.upper(), n_neighbors, n_run))
36 | time.sleep(1)
37 | self._single_run(X_run, y_run, mask_labeled, n_burn_in,
38 | stats_file, seed_run, X_test, y_test)
39 |
40 |
41 | if __name__ == '__main__':
42 | parser = experiment_arg_parser()
43 | args = vars(parser.parse_args())
44 | dataset_name = args['dataset'].lower()
45 | config_file = os.path.join(CONFIG_DIR, 'var_k_U.yml')
46 | config = parse_yaml(config_file)
47 |
48 | # Store dataset info
49 | config.setdefault('dataset', {})
50 | config['dataset']['name'] = dataset_name
51 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
52 |
53 | N_NEIGHBORS_VALUES = config['graph']['n_neighbors_unlabeled'].copy()
54 |
55 | experiment = VarNeighborsUnLabeled(N_NEIGHBORS_VALUES, params=config,
56 | n_runs=args['n_runs'])
57 | if args['plot'] != '':
58 | experiment.load_plot(path=args['plot'])
59 | else:
60 | experiment.run(dataset_name)
--------------------------------------------------------------------------------
/ilp/experiments/var_stream_labeled.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | from time import sleep
4 | from sklearn.utils.random import check_random_state
5 |
6 | from ilp.experiments.base import BaseExperiment
7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM
8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
9 | from ilp.constants import CONFIG_DIR
10 | from ilp.helpers.log import make_logger
11 |
12 |
13 | logger = make_logger(__name__)
14 |
15 |
16 | class VarStreamLabeled(BaseExperiment):
17 |
18 | def __init__(self, ratio_labeled_values, params, n_runs=1, isave=100):
19 | super(VarStreamLabeled, self).__init__(name='srl',
20 | config=params,
21 | isave=isave, n_runs=n_runs,
22 | plot_title=r'Influence of ratio of labels',
23 | multi_var=True)
24 | self.ratio_labeled_values = ratio_labeled_values
25 |
26 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
27 | X_test, y_test, n_run):
28 |
29 | config = self.config
30 |
31 | ratio_labeled = config['data']['stream']['ratio_labeled']
32 | save_dir = os.path.join(self.top_dir, 'srl_' + str(ratio_labeled))
33 | stats_path = os.path.join(save_dir, 'run_' + str(n_run))
34 | logger.info('\n\nExperiment: {}, ratio_labeled = {}, run {}...\n'.
35 | format(self.name.upper(), ratio_labeled, n_run))
36 | sleep(1)
37 | self._single_run(X_run, y_run, mask_labeled, n_burn_in,
38 | stats_path, seed_run, X_test, y_test)
39 |
40 | def run(self, dataset_name, random_state=42):
41 |
42 | config = self.config
43 | X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
44 |
45 | for n_run in range(self.n_runs):
46 | seed_run = random_state * n_run
47 | logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
48 | rng = check_random_state(seed_run)
49 | if config['dataset']['is_stream']:
50 | logger.info('Dataset is a stream. Sampling observed labels.')
51 | # Just randomly sample ratio_labeled samples for mask_labeled
52 | n_burn_in = config['data']['n_burn_in_stream']
53 |
54 | for ratio_labeled in self.ratio_labeled_values:
55 |
56 | config['data']['stream']['ratio_labeled'] = ratio_labeled
57 | n_labeled = int(ratio_labeled*len(y_train))
58 | ind_labeled = rng.choice(len(y_train), n_labeled,
59 | replace=False)
60 | mask_labeled = np.zeros(len(y_train), dtype=bool)
61 | mask_labeled[ind_labeled] = True
62 | X_run, y_run = X_train, y_train
63 |
64 | config['data']['n_burn_in'] = n_burn_in
65 | config.setdefault('options', {})
66 | config['options']['random_state'] = seed_run
67 |
68 | self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
69 | seed_run, X_test, y_test, n_run)
70 |
71 |
72 | if __name__ == '__main__':
73 | parser = experiment_arg_parser()
74 | args = vars(parser.parse_args())
75 | dataset_name = args['dataset'].lower()
76 | config_file = os.path.join(CONFIG_DIR, 'var_stream_labeled.yml')
77 | config = parse_yaml(config_file)
78 |
79 | # Store dataset info
80 | config.setdefault('dataset', {})
81 | config['dataset']['name'] = dataset_name
82 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
83 |
84 | N_RATIO_LABELED = config['data']['stream']['ratio_labeled'].copy()
85 |
86 | experiment = VarStreamLabeled(N_RATIO_LABELED, params=config,
87 | n_runs=args['n_runs'])
88 | if args['plot'] != '':
89 | experiment.load_plot(path=args['plot'])
90 | else:
91 | experiment.run(dataset_name)
92 |
--------------------------------------------------------------------------------
/ilp/experiments/var_theta.py:
--------------------------------------------------------------------------------
1 | import os
2 | from time import sleep
3 |
4 | from ilp.experiments.base import BaseExperiment
5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
7 | from ilp.constants import CONFIG_DIR
8 | from ilp.helpers.log import make_logger
9 |
10 |
11 | logger = make_logger(__name__)
12 |
13 |
14 | class VarTheta(BaseExperiment):
15 |
16 | def __init__(self, theta_values, params, n_runs, isave=100):
17 | super(VarTheta, self).__init__(name='theta', config=params,
18 | isave=isave, n_runs=n_runs,
19 | plot_title=r'Influence of $\vartheta$',
20 | multi_var=True)
21 | self.theta_values = theta_values
22 |
23 | def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
24 | X_test, y_test, n_run):
25 |
26 | params = self.config
27 |
28 | for theta in self.theta_values:
29 | params['online_lp']['theta'] = theta
30 | save_dir = os.path.join(self.top_dir, 'theta_' + str(theta))
31 | stats_file = os.path.join(save_dir, 'run_' + str(n_run))
32 | logger.info('\n\nExperiment: {}, theta = {}, run {}...\n'.
33 | format(self.name.upper(), theta, n_run))
34 | sleep(1)
35 | self._single_run(X_run, y_run, mask_labeled, n_burn_in,
36 | stats_file, seed_run, X_test, y_test)
37 |
38 |
39 | if __name__ == '__main__':
40 |
41 | parser = experiment_arg_parser()
42 | args = vars(parser.parse_args())
43 | dataset_name = args['dataset'].lower()
44 | config_file = os.path.join(CONFIG_DIR, 'var_theta.yml')
45 | config = parse_yaml(config_file)
46 |
47 | # Store dataset info
48 | config.setdefault('dataset', {})
49 | config['dataset']['name'] = dataset_name
50 | config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
51 |
52 | THETA_VALUES = config['online_lp']['theta'].copy()
53 |
54 | # Instantiate experiment
55 | experiment = VarTheta(theta_values=THETA_VALUES, params=config,
56 | n_runs=args['n_runs'])
57 | if args['plot'] != '':
58 | experiment.load_plot(path=args['plot'])
59 | else:
60 | # python3 default_run.py -d digits
61 | experiment.run(dataset_name)
--------------------------------------------------------------------------------
/ilp/helpers/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/helpers/__init__.py
--------------------------------------------------------------------------------
/ilp/helpers/data_fetcher.py:
--------------------------------------------------------------------------------
1 | import os
2 | import gzip
3 | import zipfile
4 | from urllib import request
5 | import yaml
6 | import numpy as np
7 | from sklearn.datasets import make_classification
8 | from sklearn.model_selection import train_test_split
9 |
10 | from ilp.constants import DATA_DIR
11 |
12 |
13 | CWD = os.path.split(__file__)[0]
14 | DATASET_CONFIG_PATH = os.path.join(CWD, 'datasets.yml')
15 |
16 | SUPPORTED_DATASETS = {'mnist', 'usps', 'blobs', 'kitti_features'}
17 | IS_DATASET_STREAM = {'kitti_features': True}
18 |
19 |
20 | def check_supported_dataset(dataset):
21 |
22 | if dataset not in SUPPORTED_DATASETS:
23 | raise FileNotFoundError('Dataset {} is not supported.'.format(dataset))
24 |
25 | return True
26 |
27 |
28 | def fetch_load_data(name):
29 |
30 | print('\nFetching/Loading {}...'.format(name))
31 | with open(DATASET_CONFIG_PATH, 'r') as f:
32 | datasets_configs = yaml.load(f)
33 | if name.upper() not in datasets_configs:
34 | raise FileNotFoundError('Dataset {} not supported.'.format(name))
35 |
36 | config = datasets_configs[name.upper()]
37 | name_ = config.get('name', name)
38 | test_size = config.get('test_size', 0)
39 |
40 | if name_ == 'KITTI_FEATURES':
41 | X_tr, y_tr, X_te, y_te = fetch_kitti()
42 | elif name_ == 'USPS':
43 | X_tr, y_tr, X_te, y_te = fetch_usps()
44 | elif name_ == 'MNIST':
45 | X_tr, y_tr, X_te, y_te = fetch_mnist()
46 | X_tr = X_tr / 255.
47 | X_te = X_te / 255.
48 | elif name_ == 'BLOBS':
49 | X, y = make_classification(n_samples=60)
50 | X = np.asarray(X)
51 | y = np.asarray(y, dtype=int)
52 |
53 | if test_size > 0:
54 | if type(test_size) is int:
55 | t = test_size
56 | print('{} has shape {}'.format(name_, X.shape))
57 | print('Splitting data with test size = {}'.format(test_size))
58 | X_tr, X_te, y_tr, y_te = X[:-t], X[-t:], y[:-t], y[-t:]
59 | elif type(test_size) is float:
60 | X_tr, X_te, y_tr, y_te = train_test_split(
61 | X, y, test_size=test_size, stratify=y)
62 | else:
63 | raise TypeError('test_size is neither int or float.')
64 |
65 | print('Loaded training set with shape {}'.format(X_tr.shape))
66 | print('Loaded testing set with shape {}'.format(X_te.shape))
67 | return X_tr, y_tr, X_te, y_te
68 | else:
69 | print('Loaded {} with {} samples of dimension {}.'
70 | .format(name_, X.shape[0], X.shape[1]))
71 | return X, y, None, None
72 | else:
73 | raise NameError('No data set {} found!'.format(name_))
74 |
75 | print('Loaded training data with shape {}'.format(X_tr.shape))
76 | print('Loaded training labels with shape {}'.format(y_tr.shape))
77 | print('Loaded testing data with shape {}'.format(X_te.shape))
78 | print('Loaded testing labels with shape {}'.format(y_te.shape))
79 | return X_tr, y_tr, X_te, y_te
80 |
81 |
82 | def fetch_usps(save_dir=None):
83 |
84 | base_url = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/'
85 | train_file = 'zip.train.gz'
86 | test_file = 'zip.test.gz'
87 | save_dir = DATA_DIR if save_dir is None else save_dir
88 |
89 | if not os.path.isdir(save_dir):
90 | raise NotADirectoryError('{} is not a directory.'.format(save_dir))
91 |
92 | train_source = os.path.join(base_url, train_file)
93 | test_source = os.path.join(base_url, test_file)
94 |
95 | train_dest = os.path.join(save_dir, train_file)
96 | test_dest = os.path.join(save_dir, test_file)
97 |
98 | def download_file(source, destination):
99 | if not os.path.exists(destination):
100 | print('Downloading from {}...'.format(source))
101 | f, msg = request.urlretrieve(url=source, filename=destination)
102 | print('HTTP response: {}'.format(msg))
103 | return f, msg
104 | else:
105 | print('Found dataset in {}!'.format(destination))
106 | return None
107 |
108 | download_file(train_source, train_dest)
109 | download_file(test_source, test_dest)
110 |
111 | X_train = np.loadtxt(train_dest)
112 | y_train, X_train = X_train[:, 0].astype(np.int32), X_train[:, 1:]
113 |
114 | X_test = np.loadtxt(test_dest)
115 | y_test, X_test = X_test[:, 0].astype(np.int32), X_test[:, 1:]
116 |
117 | return X_train, y_train, X_test, y_test
118 |
119 |
120 | def fetch_kitti(data_dir=None):
121 |
122 | if data_dir is None:
123 | data_dir = os.path.join(DATA_DIR, 'kitti_features')
124 |
125 | files = ['kitti_all_train.data',
126 | 'kitti_all_train.labels',
127 | 'kitti_all_test.data',
128 | 'kitti_all_test.labels']
129 |
130 | for file in files:
131 | if file not in os.listdir(data_dir):
132 | zip_path = os.path.join(data_dir, 'kitti_features.zip')
133 | target_path = os.path.dirname(zip_path)
134 | print("Extracting {} to {}...".format(zip_path, target_path))
135 | with zipfile.ZipFile(zip_path, "r") as zip_ref:
136 | zip_ref.extractall(target_path)
137 | print("Done.")
138 | break
139 |
140 | X_train = np.loadtxt(os.path.join(data_dir, files[0]), np.float64, skiprows=1)
141 | y_train = np.loadtxt(os.path.join(data_dir, files[1]), np.int32, skiprows=1)
142 | X_test = np.loadtxt(os.path.join(data_dir, files[2]), np.float64, skiprows=1)
143 | y_test = np.loadtxt(os.path.join(data_dir, files[3]), np.int32, skiprows=1)
144 |
145 | return X_train, y_train, X_test, y_test
146 |
147 |
148 | def fetch_mnist(data_dir=None):
149 |
150 | if data_dir is None:
151 | data_dir = os.path.join(DATA_DIR, 'mnist')
152 |
153 | url = 'http://yann.lecun.com/exdb/mnist/'
154 | files = ['train-images-idx3-ubyte.gz',
155 | 'train-labels-idx1-ubyte.gz',
156 | 't10k-images-idx3-ubyte.gz',
157 | 't10k-labels-idx1-ubyte.gz']
158 |
159 | # Create path if it doesn't exist
160 | os.makedirs(data_dir, exist_ok=True)
161 |
162 | # Download any missing files
163 | for file in files:
164 | if file not in os.listdir(data_dir):
165 | request.urlretrieve(url + file, os.path.join(data_dir, file))
166 | print("Downloaded %s to %s" % (file, data_dir))
167 |
168 | def _images(path):
169 | """Return flattened images loaded from local file."""
170 | with gzip.open(path) as f:
171 | # First 16 bytes are magic_number, n_imgs, n_rows, n_cols
172 | pixels = np.frombuffer(f.read(), '>B', offset=16)
173 | return pixels.reshape(-1, 784).astype('float64')
174 |
175 | def _labels(path):
176 | with gzip.open(path) as f:
177 | # First 8 bytes are magic_number, n_labels
178 | integer_labels = np.frombuffer(f.read(), '>B', offset=8)
179 |
180 | return integer_labels
181 |
182 | X_train = _images(os.path.join(data_dir, files[0]))
183 | y_train = _labels(os.path.join(data_dir, files[1]))
184 | X_test = _images(os.path.join(data_dir, files[2]))
185 | y_test = _labels(os.path.join(data_dir, files[3]))
186 |
187 | return X_train, y_train, X_test, y_test
188 |
--------------------------------------------------------------------------------
/ilp/helpers/data_flow.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from sklearn.utils.validation import check_random_state
3 |
4 |
5 | def check_min_samples_per_class(y, min_samples=2):
6 | classes, class_sizes = np.unique(y, return_counts=True)
7 | min_class_size = class_sizes.min()
8 | print('Class sizes: {}'.format(class_sizes))
9 | if min_class_size < min_samples:
10 | print('Classes: {}'.format(np.unique(y)))
11 | raise ValueError('Minimum class size < 2.')
12 |
13 | return classes, class_sizes
14 |
15 |
16 | def split_burn_in_rest(y, n_labeled_per_class, ratio_labeled, shuffle=False, seed=None):
17 | """
18 |
19 | Parameters
20 | ----------
21 | y : array, shape (n_samples,)
22 | The true data labels.
23 |
24 | n_labeled_per_class : int
25 | Number of labeled samples per class to include in the burn-in set.
26 |
27 | ratio_labeled : float
28 | Ratio of labeled samples to include within the burn-in set.
29 |
30 | shuffle : bool
31 | Whether to shuffle indices within classes before adding to burn-in set.
32 |
33 | seed : int, np.random.RandomState or None
34 | For reproducibility.
35 |
36 | Returns
37 | -------
38 | ind_burn_in : list of length n_burn_in
39 | Indices of the samples in the burn-in set.
40 |
41 | mask_labeled_burn_in : array, shape (n_burn_in,)
42 | Mask indicating whether the labels in the burn-in set are observed.
43 |
44 | """
45 | classes, class_sizes = check_min_samples_per_class(y, n_labeled_per_class)
46 |
47 | rng = check_random_state(seed)
48 | ind_burn_in = []
49 | set_ind_burn_in_labeled = set()
50 |
51 | for class_ in classes:
52 | ind_class, = np.where(y == class_)
53 | n_class = len(ind_class)
54 | n_unlabeled_class = int(
55 | n_labeled_per_class * (1 / ratio_labeled - 1)) #+ 1
56 | n_unlabeled_class = min(n_unlabeled_class,
57 | n_class - n_labeled_per_class)
58 | n_burn_in_class = n_labeled_per_class + n_unlabeled_class
59 |
60 | if shuffle:
61 | ind_samples = rng.choice(n_class, n_burn_in_class, replace=False)
62 | ind_burn_in_class = ind_class[ind_samples]
63 | ind_burn_in.extend(ind_burn_in_class)
64 | ind_samples = rng.choice(n_burn_in_class, n_labeled_per_class,
65 | replace=False)
66 | ind_burn_in_class_labeled = ind_burn_in_class[ind_samples]
67 | else:
68 | ind_burn_in_class = ind_class[:n_burn_in_class]
69 | ind_burn_in.extend(ind_burn_in_class)
70 | ind_burn_in_class_labeled = ind_burn_in_class[:n_labeled_per_class]
71 |
72 | set_ind_burn_in_labeled.update(ind_burn_in_class_labeled)
73 |
74 | mask_labeled_burn_in = [i in set_ind_burn_in_labeled for i in ind_burn_in]
75 | mask_labeled_burn_in = np.asarray(mask_labeled_burn_in)
76 |
77 | y_burn_in = y[ind_burn_in]
78 | y_burn_in_labeled = y_burn_in[mask_labeled_burn_in]
79 | y_burn_in_unlabeled = y_burn_in[~mask_labeled_burn_in]
80 |
81 | _, class_sizes_labeled = np.unique(y_burn_in_labeled, return_counts=True)
82 | _, class_sizes_unlabeled = np.unique(y_burn_in_unlabeled,
83 | return_counts=True)
84 |
85 | if len(y_burn_in_labeled) == 0 and len(y_burn_in_unlabeled) == 0:
86 | class_sizes_burnin = np.zeros(len(classes))
87 | elif len(y_burn_in_labeled) == 0:
88 | class_sizes_burnin = class_sizes_unlabeled
89 | elif len(y_burn_in_unlabeled) == 0:
90 | class_sizes_burnin = class_sizes_labeled
91 | else:
92 | class_sizes_burnin = class_sizes_labeled + class_sizes_unlabeled
93 |
94 | print('\n\n')
95 | print('Burn-in labeled class sizes: {} , sum = {}'.format(
96 | class_sizes_labeled, sum(class_sizes_labeled)))
97 | print('Burn-in unlabeled class sizes: {}, sum = {}'.format(
98 | class_sizes_unlabeled, sum(class_sizes_unlabeled)))
99 | print('Burn-in total class sizes: {}, sum = {}'.format(
100 | class_sizes_burnin, sum(class_sizes_burnin)))
101 | print('\nRest total size: {}'.format(len(y) - len(y_burn_in)))
102 |
103 | return ind_burn_in, mask_labeled_burn_in
104 |
105 |
106 | def split_labels_rest(y_rest, ratio_labeled, batch_size, shuffle=False,
107 | seed=None):
108 | """
109 |
110 | Parameters
111 | ----------
112 | y_rest : array with shape (n_rest,)
113 | Remaining data labels after burn-in.
114 |
115 | ratio_labeled : float
116 | Ratio of observed labels in remaining data.
117 |
118 | batch_size : int
119 | Number of points for which the ratio_labeled must be satisfied.
120 |
121 | shuffle : bool
122 | Whether to shuffle indices within classes before adding to burn-in set.
123 |
124 | seed : int, np.random.RandomState or None
125 | For reproducibility.
126 |
127 | Returns
128 | -------
129 | mask_labeled_rest : array, shape (n_rest,)
130 | Mask indicating whether the labels in the rest set are observed.
131 |
132 | """
133 |
134 | classes = np.unique(y_rest)
135 |
136 | rng = check_random_state(seed)
137 |
138 | set_ind_rest_labeled = set()
139 |
140 | for class_ in classes:
141 | ind_class, = np.where(y_rest == class_)
142 | n_class = len(ind_class)
143 | n_labeled_class = int(n_class * ratio_labeled)
144 |
145 | if shuffle:
146 | ind_samples = rng.choice(n_class, n_labeled_class, replace=False)
147 | is_labeled_rest_class = ind_class[ind_samples]
148 | else:
149 | is_labeled_rest_class = ind_class[:n_labeled_class]
150 |
151 | set_ind_rest_labeled.update(is_labeled_rest_class)
152 |
153 | mask_labeled_rest = [i in set_ind_rest_labeled for i in range(len(y_rest))]
154 | mask_labeled_rest = np.asarray(mask_labeled_rest)
155 |
156 | y_rest_labeled = y_rest[mask_labeled_rest]
157 | y_rest_unlabeled = y_rest[~mask_labeled_rest]
158 |
159 | _, class_sizes_labeled = np.unique(y_rest_labeled, return_counts=True)
160 | _, class_sizes_unlabeled = np.unique(y_rest_unlabeled, return_counts=True)
161 |
162 | if len(y_rest_labeled) == 0 and len(y_rest_unlabeled) == 0:
163 | class_sizes_rest = np.zeros(len(classes))
164 | elif len(y_rest_labeled) == 0:
165 | class_sizes_rest = class_sizes_unlabeled
166 | elif len(y_rest_unlabeled) == 0:
167 | class_sizes_rest = class_sizes_labeled
168 | else:
169 | class_sizes_rest = class_sizes_labeled + class_sizes_unlabeled
170 |
171 | print('\n\n')
172 | print('Rest labeled class sizes: {}, sum = {}'.format(
173 | class_sizes_labeled, sum(class_sizes_labeled)))
174 | print('Rest unlabeled class sizes: {}, sum = {}'.format(
175 | class_sizes_unlabeled, sum(class_sizes_unlabeled)))
176 | print('Rest total class sizes: {}, sum = {}'.format(
177 | class_sizes_rest, sum(class_sizes_rest)))
178 | print('\n\n')
179 |
180 | return mask_labeled_rest
181 |
182 |
183 | def gen_semilabeled_data(inputs, targets, flags):
184 | """
185 | Generates a sequence of all inputs and targets, prepended with an id
186 | of the sample. A boolean value indicating if the label is observed by
187 | the algorithm or not is also generated.
188 | """
189 |
190 | assert len(inputs) == len(targets) == len(flags)
191 |
192 | indices = range(len(inputs))
193 |
194 | for i, j, k, l in zip(indices, inputs, targets, flags):
195 | yield i, j, k, l
196 |
197 |
198 | def gen_data_stream(inputs, targets, shuffle=False, seed=None):
199 | """
200 | Generates a sequence of all inputs and targets, optionally shuffled.
201 | """
202 |
203 | assert len(inputs) == len(targets)
204 | if shuffle:
205 | indices = np.arange(len(inputs))
206 | random_state = check_random_state(seed)
207 | random_state.shuffle(indices)
208 | for i in indices:
209 | yield inputs[i], targets[i]
210 | else:
211 | for i in range(len(inputs)):
212 | yield inputs[i], targets[i]
213 |
--------------------------------------------------------------------------------
/ilp/helpers/datasets.yml:
--------------------------------------------------------------------------------
1 | KITTI_FEATURES:
2 | name : KITTI_FEATURES
3 | test_size : 9090
4 |
5 |
6 | BLOBS:
7 | name : BLOBS
8 | pca : True
9 | test_size : 0.0
10 |
11 |
12 | MNIST:
13 | name : MNIST
14 | test_size : 10000
15 |
16 |
17 | USPS:
18 | name : USPS
19 | test_size : 2007
20 |
--------------------------------------------------------------------------------
/ilp/helpers/fc_heap.py:
--------------------------------------------------------------------------------
1 | import heapq
2 | import warnings
3 |
4 |
5 | class FixedCapacityHeap:
6 | """Implementation of a min-heap with fixed capacity.
7 | The heap contains tuples of the form (edge_weight, node_id),
8 | which means the min. edge weight is extracted first
9 | """
10 |
11 | def __init__(self, lst=None, capacity=10):
12 | self.capacity = capacity
13 | if lst is None:
14 | self.data = []
15 | elif type(lst) is list:
16 | self.data = lst
17 | else:
18 | self.data = lst.tolist()
19 |
20 | if lst is not None:
21 | heapq.heapify(self.data)
22 | if len(self.data) > capacity:
23 | msg = 'Input data structure is larger than the queue\'s ' \
24 | 'capacity ({}), truncating to smallest ' \
25 | 'elements.'.format(capacity)
26 |
27 | warnings.warn(msg, UserWarning)
28 | self.data = self.data[:self.capacity]
29 |
30 | def push(self, item):
31 | """Insert an element in the heap if its key is smaller than the current
32 | max-key elements and remove the current max-key element if the new
33 | heap size exceeds the heap capacity
34 |
35 | Args:
36 | item (tuple): (edge_weight, node_ind)
37 |
38 | Returns:
39 | tuple : (bool, item)
40 | bool: whether the item was actually inserted in the queue
41 | item: another item that was removed from the queue or None if none was removed
42 |
43 | """
44 | inserted = False
45 | removed = None
46 | if len(self.data) < self.capacity:
47 | heapq.heappush(self.data, item)
48 | inserted = True
49 | else:
50 | if item > self.get_min():
51 | removed = heapq.heappushpop(self.data, item)
52 | inserted = True
53 |
54 | return inserted, removed
55 |
56 | def get_min(self):
57 | """Return the min-key element without removing it from the heap"""
58 | return self.data[0]
59 |
60 | def __len__(self):
61 | return len(self.data)
62 |
--------------------------------------------------------------------------------
/ilp/helpers/log.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import logging
3 |
4 |
5 | def make_logger(name, path=None):
6 | logger = logging.getLogger(name)
7 | logger.setLevel(logging.DEBUG)
8 | stream_handler = logging.StreamHandler(stream=sys.stdout)
9 | # fmt = '%(asctime)s ' \
10 | fmt = '[%(levelname)-10s] %(name)-10s : %(message)s'
11 | # fmt = '[{levelname}] {name} {message}'
12 | formatter = logging.Formatter(fmt=fmt, style='%')
13 | stream_handler.setFormatter(formatter)
14 | logger.addHandler(stream_handler)
15 |
16 | if path:
17 | file_handler = logging.FileHandler(filename=path)
18 | file_handler.setFormatter(formatter)
19 | logger.addHandler(file_handler)
20 |
21 | return logger
22 |
--------------------------------------------------------------------------------
/ilp/helpers/params_parse.py:
--------------------------------------------------------------------------------
1 | import os
2 | import yaml
3 | from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
4 |
5 | from ilp.constants import CONFIG_DIR
6 | from ilp.helpers.data_fetcher import SUPPORTED_DATASETS
7 |
8 |
9 | def args_to_dict(args):
10 |
11 | sample_yaml = os.path.join(CONFIG_DIR, 'default.yml')
12 | with open(sample_yaml, 'r') as f:
13 | sample_dict = yaml.load(f)
14 |
15 | params_dict = dict(sample_dict)
16 | for k, v in args.items():
17 | for section in sample_dict:
18 | if k in sample_dict[section]:
19 | params_dict[section][k] = v
20 |
21 | return params_dict
22 |
23 |
24 | def parse_yaml(config_file):
25 |
26 | sample_yaml = os.path.join(CONFIG_DIR, 'default.yml')
27 | with open(sample_yaml, 'r') as f:
28 | default_params = yaml.load(f)
29 |
30 | with open(config_file) as cfg_file:
31 | params = yaml.load(cfg_file)
32 |
33 | for k, v in default_params.items():
34 | if k not in params:
35 | params[k] = default_params[k]
36 |
37 | return params
38 |
39 |
40 | def print_config(params):
41 | for section in params:
42 | print('\n{} PARAMS:'.format(section.upper()))
43 | if type(params[section]) is dict:
44 | for k, v in params[section].items():
45 | print('{}: {}'.format(k, v))
46 | else:
47 | print('{}'.format(params[section]))
48 |
49 |
50 | def experiment_arg_parser():
51 | arg_parser = ArgumentParser(description="ILP experiment", formatter_class=ArgumentDefaultsHelpFormatter)
52 |
53 | # Dataset
54 | arg_parser.add_argument(
55 | '-d', '--dataset', type=str, default='digits',
56 | help='Load the given dataset.\nSupported datasets are: {}'
57 | .format(SUPPORTED_DATASETS)
58 | )
59 |
60 | arg_parser.add_argument(
61 | '-n', '--n_runs', type=int, default=1,
62 | help='Number of times to run the experiment with different seeds'
63 | )
64 |
65 | arg_parser.add_argument(
66 | '-p', '--plot', type=str, default='',
67 | help='Plot the latest experiment results from the given directory'
68 | )
69 |
70 | return arg_parser
71 |
--------------------------------------------------------------------------------
/ilp/helpers/stats.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import os
3 | import shelve
4 | import sklearn.preprocessing as prep
5 | from datetime import datetime
6 | from threading import Thread
7 | from enum import Enum
8 | from queue import Queue
9 |
10 | from ilp.constants import EPS_32, EPS_64, STATS_DIR
11 | from ilp.helpers.log import make_logger
12 |
13 |
14 | STATS_FILE_EXT = '.stat'
15 |
16 |
17 | class JobType(Enum):
18 | EVAL = 1
19 | ONLINE_ITER = 3
20 | PRINT_STATS = 6
21 | LABEL_STREAM = 9
22 | TRAIN_PRED = 11
23 | TEST_PRED = 12
24 | RUNTIME = 13
25 | POINT_PREDICTION = 14
26 |
27 |
28 | logger = make_logger(__name__)
29 |
30 | class StatisticsWorker:
31 | """
32 | Parameters
33 | ----------
34 |
35 | config : dict
36 | Dictionary with configuration key-value pairs of the running experiment.
37 |
38 | path : str, optional
39 | File path to save the aggregated statistics (default=None).
40 |
41 | isave : int, optional
42 | The frequency of saving statistics (default=1000).
43 |
44 |
45 | Attributes
46 | ----------
47 |
48 | _stats : Statistics
49 | Class to store different kinds of statistics during an experiment run.
50 |
51 | _jobs : Queue
52 | Queue of jobs to process in a different thread.
53 |
54 | _thread : Thread
55 | Thread in which to process incoming jobs.
56 |
57 | """
58 |
59 | def __init__(self, config, path=None, isave=1):
60 | if path is None:
61 | cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
62 | dataset_name = config['dataset']['name']
63 | filename = 'stats_' + cur_time + '_' + str(dataset_name) + STATS_FILE_EXT
64 | path = os.path.join(STATS_DIR, filename)
65 | elif not path.endswith(STATS_FILE_EXT):
66 | path = path + STATS_FILE_EXT
67 |
68 | os.makedirs(os.path.split(path)[0], exist_ok=True)
69 | self.path = path
70 | self.config = config
71 | self.isave = isave
72 | self._stats = Statistics()
73 | self._jobs = Queue()
74 | self._thread = Thread(target=self.work)
75 |
76 | def start(self):
77 | self.n_iter_eval = 0
78 | self._thread.start()
79 |
80 | def save(self):
81 | prev_file = self.path + '_iter_' + str(self.n_iter_eval)
82 | if os.path.exists(self.path):
83 | os.rename(self.path, prev_file)
84 | with shelve.open(self.path, 'c') as shelf:
85 | shelf['stats'] = self._stats.__dict__
86 | shelf['config'] = self.config
87 | if os.path.exists(prev_file):
88 | os.remove(prev_file)
89 |
90 | def stop(self):
91 | self._jobs.put_nowait({'job_type': JobType.PRINT_STATS})
92 | self._jobs.put(None)
93 | self._thread.join()
94 | self.save()
95 |
96 | def send(self, d):
97 | self._jobs.put_nowait(d)
98 |
99 | def work(self):
100 | while True:
101 | job = self._jobs.get()
102 |
103 | if job is None: # End of algorithm
104 | self._jobs.task_done()
105 | break
106 |
107 | job_type = job['job_type']
108 |
109 | if job_type == JobType.EVAL:
110 | self._stats.evaluate(job['y_est'], job['y_true'])
111 | self.n_iter_eval += 1
112 | elif job_type == JobType.LABEL_STREAM:
113 | self._stats.label_stream_true = job['y_true']
114 | self._stats.label_stream_mask_observed = job['mask_obs']
115 | elif job_type == JobType.POINT_PREDICTION:
116 | f = job['vec']
117 | h = self._stats.entropy(f)
118 | self._stats.entropy_point_after.append(h)
119 | y = job['y']
120 | self._stats.pred_point_after.append(y)
121 | self._stats.conf_point_after.append(f.max())
122 | elif job_type == JobType.ONLINE_ITER:
123 | self._stats.iter_online_count.append(job['n_in_iter'])
124 | self._stats.iter_online_duration.append(job['dt'])
125 | elif job_type == JobType.PRINT_STATS:
126 | err = self._stats.clf_error_mixed[-1] * 100
127 | logger.info('Classif. Error: {:5.2f}%\n\n'.format(err))
128 | elif job_type == JobType.TRAIN_PRED:
129 | self._stats.train_est = job['y_est']
130 | elif job_type == JobType.TEST_PRED:
131 | y_pred_knn = job['y_pred_knn']
132 | self._stats.test_pred_knn.append(y_pred_knn)
133 |
134 | y_pred_lp = job['y_pred_lp']
135 | self._stats.test_pred_lp.append(y_pred_lp)
136 |
137 | self._stats.test_true = y_test = job['y_true']
138 |
139 | err_knn = np.mean(np.not_equal(y_pred_knn, y_test))
140 | err_lp = np.mean(np.not_equal(y_pred_lp, y_test))
141 | logger.info('knn test err: {:5.2f}%'.format(err_knn*100))
142 | logger.info('ILP test err: {:5.2f}%'.format(err_lp*100))
143 | self._stats.test_error_knn.append(err_knn)
144 | self._stats.test_error_ilp.append(err_lp)
145 | elif job_type == JobType.RUNTIME:
146 | self._stats.runtime = job['t']
147 |
148 | if self.n_iter_eval % self.isave == 0:
149 | self.save()
150 |
151 | self._jobs.task_done()
152 |
153 |
154 | EXCLUDED_METRICS = {'label_stream_true',
155 | 'label_stream_mask_observed',
156 | 'n_burn_in',
157 | 'test_pred_knn', 'test_pred_lp', 'test_true',
158 | 'train_est', 'runtime',
159 | 'conf_point_after', 'test_error_knn', 'test_error_ilp'}
160 |
161 |
162 | class Statistics:
163 | """
164 | Statistics gathered during learning (training and testing).
165 | """
166 | def __init__(self):
167 | self.iter_online_count = []
168 | self.iter_online_duration = []
169 |
170 | # Evaluation after a new point arrives
171 | self.n_invalid_samples = []
172 | self.invalid_samples_ratio = []
173 | self.clf_error_mixed = []
174 | self.clf_error_valid = []
175 | self.l1_error_mixed = []
176 | self.l1_error_valid = []
177 | self.cross_ent_mixed = []
178 | self.cross_ent_valid = []
179 | self.entropy_pred_mixed = []
180 | self.entropy_pred_valid = []
181 |
182 | # Defined once
183 | self.label_stream_true = []
184 | self.label_stream_mask_observed = []
185 | self.test_pred_knn = []
186 | self.test_pred_lp = []
187 | self.test_true = []
188 | self.train_est = []
189 | self.runtime = np.nan
190 | self.test_error_ilp = []
191 | self.test_error_knn = []
192 | self.entropy_point_after = []
193 | self.pred_point_after = []
194 | self.conf_point_after = []
195 |
196 | def evaluate(self, y_predictions, y_true):
197 | """Computes statistics for a given set of predictions and the ground truth.
198 |
199 | Args:
200 | y_predictions (array_like): [u_samples, n_classes] soft class predictions for current unlabeled samples
201 | y_true (array_like): [u_samples, n_classes] one-hot encoding of the true classes_ of the unlabeled samples
202 |
203 | eps (float): quantity slightly larger than zero to avoid division by zero
204 |
205 | Returns:
206 | float, average accuracy
207 |
208 | """
209 |
210 | u_samples, n_classes = y_predictions.shape
211 |
212 | # Clip predictions to [0,1]
213 | eps = EPS_32 if y_predictions.itemsize == 4 else EPS_64
214 | y_pred_01 = np.clip(y_predictions, eps, 1-eps)
215 | # Normalize predictions to make them proper distributions
216 | y_pred = prep.normalize(y_pred_01, copy=False, norm='l1')
217 |
218 | # 0-1 Classification error under valid and invalid points
219 | y_pred_max = np.argmax(y_pred, axis=1)
220 | y_true_max = np.argmax(y_true, axis=1)
221 | fc_err_mixed = self.zero_one_loss(y_pred_max, y_true_max)
222 | self.clf_error_mixed.append(fc_err_mixed)
223 |
224 | # L1 error under valid and invalid points
225 | l1_err_mixed = np.mean(self.l1_error(y_pred, y_true))
226 | self.l1_error_mixed.append(l1_err_mixed)
227 |
228 | # Cross-entropy loss
229 | crossent_mixed = np.mean(self.cross_entropy(y_true, y_pred))
230 | self.cross_ent_mixed.append(crossent_mixed)
231 |
232 | # Identify valid points (for which a label has been estimated)
233 | ind_valid, = np.where(y_pred.sum(axis=1) != 0)
234 | n_valid = len(ind_valid)
235 | n_invalid = u_samples - n_valid
236 |
237 | self.n_invalid_samples.append(n_invalid)
238 | self.invalid_samples_ratio.append(n_invalid / u_samples)
239 |
240 | # Entropy of the predictions
241 | if n_invalid == 0:
242 | entropy_pred_mixed = np.mean(self.entropy(y_pred))
243 | self.entropy_pred_mixed.append(entropy_pred_mixed)
244 | return
245 |
246 | y_pred_valid = y_pred[ind_valid]
247 | y_true_valid = y_true[ind_valid]
248 |
249 | # 0-1 Classification error under valid points only
250 | y_pred_valid_max = y_pred_max[ind_valid]
251 | y_true_valid_max = y_true_max[ind_valid]
252 | err_valid_max = self.zero_one_loss(y_pred_valid_max, y_true_valid_max)
253 | self.clf_error_valid.append(err_valid_max)
254 |
255 | # L1 error under valid points only
256 | l1_err_valid = np.mean(self.l1_error(y_pred_valid, y_true_valid))
257 | self.l1_error_valid.append(l1_err_valid)
258 |
259 | # Cross-entropy loss
260 | ce_valid = np.mean(self.cross_entropy(y_true_valid, y_pred_valid))
261 | self.cross_ent_valid.append(ce_valid)
262 |
263 | # Entropy of the predictions
264 | entropy_pred_valid = np.mean(self.entropy(y_pred_valid))
265 | self.entropy_pred_valid.append(entropy_pred_valid)
266 | n_total = n_valid + n_invalid
267 | entropy_pred_mixed = (entropy_pred_valid*n_valid + n_invalid) / n_total
268 | self.entropy_pred_mixed.append(entropy_pred_mixed)
269 |
270 |
271 | @staticmethod
272 | def zero_one_loss(y_pred, y_true, average=True):
273 | """
274 |
275 | Args:
276 | y_pred (array_like): (n_samples, n_classes)
277 | y_true (array_like): (n_samples, n_classes)
278 | average (bool): Whether to take the average over all predictions.
279 |
280 | Returns: The absolute difference for each row.
281 | Note that this will be in [0,2] for p.d.f.s.
282 |
283 | """
284 |
285 | if average:
286 | return np.mean(np.not_equal(y_pred, y_true))
287 | else:
288 | return np.sum(np.not_equal(y_pred, y_true))
289 |
290 | @staticmethod
291 | def l1_error(y_pred, y_true, norm=True):
292 | """
293 |
294 | Args:
295 | y_pred (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes)
296 | y_true (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes)
297 | norm (bool): Whether to constrain the L1 error to be in [0,1].
298 |
299 | Returns: The absolute difference for each row. Note that this will be in [0,2] for pdfs.
300 |
301 | """
302 |
303 | l1_error = np.abs(y_pred - y_true).sum(axis=1)
304 | if norm:
305 | l1_error /= 2
306 |
307 | return l1_error
308 |
309 | @staticmethod
310 | def entropy(p, norm=True):
311 | """
312 |
313 | Args:
314 | p (array_like): An array of probability distributions with shape (n_distros, n_classes)
315 | norm (bool): Whether to normalize the entropy to constrain it in [0,1]
316 |
317 | Returns: An array of entropies of the distributions with shape (n_distros,)
318 |
319 | """
320 |
321 | entropy = - (p * np.log(p)).sum(axis=1)
322 | if norm:
323 | entropy /= np.log(p.shape[1])
324 |
325 | return entropy
326 |
327 | @staticmethod
328 | def cross_entropy(p, q, norm=True):
329 | """
330 |
331 | Args:
332 | p (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes)
333 | q (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes)
334 | norm (bool): Whether to normalize the entropy to constrain it in [0,1]
335 |
336 | Returns: An array of cross entropies between the groundtruth and the prediction with shape (n_distros,)
337 |
338 | """
339 |
340 | cross_ent = -(p * np.log(q)).sum(axis=1)
341 | if norm:
342 | cross_ent /= np.log(p.shape[1])
343 |
344 | return cross_ent
345 |
346 |
347 | def aggregate_statistics(stats_path, metrics=None, excluded_metrics=None):
348 |
349 | print('Aggregating statistics from {}'.format(stats_path))
350 | if stats_path.endswith(STATS_FILE_EXT):
351 | list_of_files = [stats_path]
352 | else:
353 | list_of_files = [os.path.join(stats_path, f) for f in os.listdir(
354 | stats_path) if f.endswith(STATS_FILE_EXT)]
355 |
356 | stats_runs = []
357 | random_states = []
358 | for stats_file in list_of_files:
359 | with shelve.open(stats_file, 'r') as f:
360 | stats_runs.append(f['stats'])
361 | random_states.append(f['config']['options']['random_state'])
362 |
363 | print('\nRandom seeds used: {}\n'.format(random_states))
364 |
365 | if metrics is None:
366 | metrics = Statistics().__dict__.keys()
367 |
368 | if excluded_metrics is None:
369 | excluded_metrics = EXCLUDED_METRICS
370 |
371 | stats_mean, stats_std = {}, {}
372 | stats_run0 = stats_runs[0]
373 |
374 | for metric in metrics:
375 | if metric in excluded_metrics: continue
376 | if metric not in stats_run0:
377 | print('\nMetric {} not found!'.format(metric))
378 | continue
379 | metric_lists = [stats[metric] for stats in stats_runs]
380 |
381 | # Make a numpy 2D array to merge the different runs
382 | metric_runs = np.asarray(metric_lists)
383 | s = metric_runs.shape
384 | if len(s) < 2:
385 | print('No values for metric, skipping.')
386 | continue
387 | stats_mean[metric] = np.mean(metric_runs, axis=0)
388 | stats_std[metric] = np.std(metric_runs, axis=0)
389 |
390 | with shelve.open(list_of_files[0], 'r') as f:
391 | config = f['config']
392 |
393 | lp_times = stats_mean['iter_online_duration']
394 | ice = stats_mean['clf_error_mixed'][0] * 100
395 | fce = stats_mean['clf_error_mixed'][-1] * 100
396 | print('Avg. LP time/iter: {:.4f}s'.format(np.mean(lp_times)))
397 | print('Initial classification error: {:.2f}%'.format(ice))
398 | print('Final classification error: {:.2f}%'.format(fce))
399 |
400 | # Add excluded metrics in the end
401 | for stats_run in stats_runs:
402 | for ex_metric in excluded_metrics:
403 | if ex_metric in stats_run:
404 | print('Appending excluded metric: {}'.format(ex_metric))
405 | stats_mean[ex_metric] = stats_run[ex_metric]
406 |
407 | if len(list_of_files) == 1:
408 | stats_std = None
409 |
410 | return stats_mean, stats_std, config
411 |
--------------------------------------------------------------------------------
/ilp/plots/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/plots/__init__.py
--------------------------------------------------------------------------------
/ilp/plots/plot_stats.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib
3 | from matplotlib import pyplot as plt
4 | from itertools import count, product
5 | from math import ceil
6 | from sklearn.metrics import confusion_matrix
7 | from tabulate import tabulate
8 |
9 | from ilp.helpers.params_parse import print_config
10 | from ilp.helpers.data_fetcher import fetch_load_data
11 | from ilp.helpers.stats import STATS_FILE_EXT
12 |
13 |
14 | COLORS = ['red', 'darkorange', 'black', 'green', 'cyan', 'blue']
15 |
16 | N_AXES_PER_ROW = 3
17 |
18 | PLOT_LABELS = {'iter_online_count': r'\#LP iterations',
19 | 'iter_online_duration': r'LP time (s)',
20 | 'n_invalid_samples' : r'\#Invalid samples',
21 | 'invalid_samples_ratio': r'Invalid samples ratio',
22 | 'clf_error_mixed': r'classification error',
23 | 'l1_error_mixed': r'$\ell_1$ error',
24 | 'cross_ent_mixed': r'cross entropy',
25 | 'entropy_pred_mixed': r'prediction entropy',
26 | 'theta': r'$\vartheta$',
27 | 'k_L': r'$k_l$',
28 | 'k_U': r'$k_u$',
29 | 'n_L': r'\#labels',
30 | 'srl': r'labeled',
31 | 'entropy_point_after': r'$H(f)$',
32 | 'conf_point_after': r'$\max_c f_i$',
33 | 'test_error_ilp': 'ILP test error',
34 | 'test_error_knn': 'test error (knn)'
35 | }
36 |
37 | PLOT_TITLE = {'entropy_point_after': 'Entropy',
38 | 'conf_point_after': 'Confidence'}
39 |
40 | METRIC_ORDER = [
41 | 'l1_error_mixed',
42 | 'cross_ent_mixed',
43 | 'clf_error_mixed' ,
44 | 'entropy_point_after',
45 | 'entropy_pred_mixed',
46 | 'iter_online_count',
47 | 'iter_online_duration',
48 | 'test_error_ilp',
49 | 'test_error_knn'
50 | ]
51 |
52 | METRIC_TITLE = {'l1_error_mixed':
53 | r'$\frac{1}{2u}\sum\limits_{i} '
54 | r'|{F_U}_{(i)}^{True} - {F_U}_{(i)}|_1$',
55 | 'cross_ent_mixed':
56 | r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)}^{True}, '
57 | r'{F_U}_{(i)})$',
58 | 'entropy_pred_mixed':
59 | r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)})$',
60 | 'clf_error_mixed':
61 | r'$\frac{1}{u}\sum\limits_{i} I(\arg \max {F_U}_{(i)} '
62 | r'\neq \arg \max {F_U}_{(i)}^{True})$'}
63 |
64 | SCATTER_METRIC = ['iter_online_duration', 'iter_online_count']
65 |
66 | LEGEND_METRIC = 'clf_error_mixed'
67 |
68 | DECORATORS = {'iter_offline', 'burn_in_labels_true', 'label_stream_true'}
69 |
70 | X_LABEL_DEFAULT = r'\#observed samples'
71 | DEFAULT_COLOR = 'b'
72 | DEFAULT_MEAN_COLOR = 'r'
73 | DEFAULT_STD_COLOR = 'darkorange'
74 | COLOR_MAP = plt.cm.inferno
75 | N_CURVES = 6 # THETA \in [0., 0.4, 0.8, 1.2, 1.6, 2.0]
76 | COLOR_IDX = np.linspace(0, 1, N_CURVES + 2)[1:-1]
77 |
78 | KITTI_CLASSES = ['car', 'van', 'truck', 'pedestrian', 'sitter', 'cyclist',
79 | 'tram', 'misc']
80 |
81 |
82 | def remove_frame(top=True, bottom=True, left=True, right=True):
83 |
84 | ax = plt.gca()
85 | if top:
86 | ax.spines['top'].set_visible(False)
87 |
88 | if bottom:
89 | ax.spines['bottom'].set_visible(False)
90 |
91 | if left:
92 | ax.spines['left'].set_visible(False)
93 |
94 | if right:
95 | ax.spines['right'].set_visible(False)
96 |
97 |
98 | def print_latex_table(stats_list):
99 |
100 | headers = ['\#Labels', 'Runtime (s)', 'Est. error (%)',
101 | 'knn error (%)', 'ILP error (%)']
102 | table = []
103 |
104 | for _, var_value, stats_value, _ in stats_list:
105 | runtime = stats_value['runtime']
106 | est_err = stats_value['clf_error_mixed'][-1]
107 |
108 | y_pred_knn = stats_value['test_pred_knn']
109 | y_pred = stats_value['test_pred_lp']
110 | y_true = stats_value['test_true']
111 | test_err_knn = np.mean(np.not_equal(y_true, y_pred_knn))
112 | test_err_ilp = np.mean(np.not_equal(y_true, y_pred))
113 |
114 | runtime = '{:6.2f}'.format(runtime)
115 | est_err = '{:5.2f}'.format(est_err*100)
116 |
117 | test_err_knn = '{:5.2f}'.format(test_err_knn*100)
118 | test_err_ilp = '{:5.2f}'.format(test_err_ilp*100)
119 | row = [var_value, runtime, est_err, test_err_knn, test_err_ilp]
120 | table.append(row)
121 |
122 | print(tabulate(table, headers, tablefmt="latex"))
123 |
124 |
125 | def plot_histogram(ax, values, title, xlabel, ylabel, value_range):
126 |
127 | weights = np.ones_like(values) / float(len(values))
128 | bin_y, bin_x, _ = ax.hist(values, range=value_range, normed=False, bins=20,
129 | weights=weights, alpha=0.5, align='mid')
130 | print('Bin values min/max: {}, {}'.format(bin_y.min(), bin_y.max()))
131 |
132 | ax.set_xticks(np.arange(0, 1.1, 0.2))
133 | ax.set_yticks(np.arange(0, 1.1, 0.2))
134 | ax.set_title(title, fontweight='bold')
135 | ax.set_xlabel(xlabel, fontweight='bold')
136 | ax.set_ylabel(ylabel)
137 |
138 | ax.spines['top'].set_visible(False)
139 | ax.spines['right'].set_visible(False)
140 |
141 |
142 | def plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos=1):
143 |
144 | if metric_key not in stats_value:
145 | print('FOUND NO KEY {}.'.format(metric_key))
146 | return
147 | if 'pred_point_after' not in stats_value:
148 | print('FOUND NO KEY pred_point_after.')
149 | return
150 |
151 | y_u_metric = np.asarray(stats_value[metric_key])
152 | print('START PLOT HISTOGRAM FOR {}'.format(metric_key))
153 | y_u_pred = stats_value['pred_point_after'].ravel()
154 |
155 | n = len(y_u_metric)
156 | y_true = stats_value['label_stream_true']
157 | mask_labeled = stats_value['label_stream_mask_observed']
158 | y_u_true = y_true[~mask_labeled]
159 |
160 | mask_correct = np.equal(y_u_pred, y_u_true[-n:])
161 | metric_hits = y_u_metric[mask_correct]
162 | metric_miss = y_u_metric[~mask_correct]
163 |
164 | value_range = (0., 1.)
165 | ylabel = 'Samples ratio'
166 | xlabel = PLOT_LABELS[metric_key]
167 |
168 | if pos == 1:
169 | title = PLOT_TITLE[metric_key] + ' - correct predictions'
170 | xlabel = ''
171 | elif pos == -1:
172 | title = ''
173 | else:
174 | xlabel = ''
175 | title = ''
176 | plot_histogram(ax1, metric_hits, title, xlabel, ylabel, value_range)
177 |
178 | if pos == 1:
179 | title = PLOT_TITLE[metric_key] + ' - false predictions'
180 | xlabel = ''
181 | elif pos == -1:
182 | title = ''
183 | else:
184 | xlabel = ''
185 | title = ''
186 | plot_histogram(ax2, metric_miss, title, xlabel, ylabel, value_range)
187 |
188 |
189 | def plot_metric_histograms(stats, metric_key):
190 |
191 | if type(stats) is list:
192 | fig = plt.figure(6, (8, 2.5*len(stats)), dpi=200)
193 | sp_count = count(1)
194 | for i, (_, var_value, stats_value, _) in enumerate(stats):
195 | ax1 = fig.add_subplot(len(stats), 2, next(sp_count))
196 | ax2 = fig.add_subplot(len(stats), 2, next(sp_count))
197 |
198 | if i == 0:
199 | pos = 1
200 | elif i == len(stats) - 1:
201 | pos = -1
202 | else:
203 | pos = 0
204 |
205 | plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos)
206 |
207 | title = r'$\vartheta = $' + str(var_value)
208 | ax1.set_label(title)
209 | plt.legend()
210 | else:
211 | fig = plt.figure(6, (8, 2.5), dpi=100)
212 | ax1 = fig.add_subplot(1, 2, 1)
213 | ax2 = fig.add_subplot(1, 2, 2)
214 | plot_metric_histogram(stats, ax1, ax2, metric_key)
215 |
216 | fig.subplots_adjust(top=0.9)
217 | fig.tight_layout()
218 |
219 |
220 | def plot_class_distro_stream(ax, y_true, mask_observed, n_burn_in, classes):
221 | """Plot incoming label distributions"""
222 | print('Plotting {}'.format('labels stream distro'))
223 |
224 | xx = np.arange(len(y_true))
225 | x_lu = [xx[mask_observed], xx[~mask_observed]]
226 | y_lu = [y_true[mask_observed], y_true[~mask_observed]]
227 | c_lu = ['red', 'gray']
228 | sizes = [2, 1]
229 | markers = ['d', '.']
230 |
231 | n_labeled = sum(mask_observed)
232 | n_unlabeled = len(y_true) - n_labeled
233 | labels = [r'labeled ({})'.format(n_labeled),
234 | r'unlabeled ({})'.format(n_unlabeled)]
235 |
236 | for x, y, c, s, m, label in zip(x_lu, y_lu, c_lu, sizes, markers, labels):
237 | ax.scatter(x, y, c=c, marker=m, s=s, label=label)
238 |
239 | burn_in_label = 'burn-in ({})'.format(n_burn_in)
240 | ax.vlines(n_burn_in, *ax.get_ylim(), colors='blue', linestyle=':',
241 | label=burn_in_label)
242 |
243 | classes_type = type(classes[0])
244 | if classes_type is str:
245 | true_labels = np.unique(y_true)
246 | plt.yticks(range(len(classes)), classes, rotation=45, fontsize=7)
247 | else:
248 | ax.set_yticks(classes)
249 |
250 | ax.set_xlabel(X_LABEL_DEFAULT)
251 | ax.set_ylabel(r'class labels')
252 |
253 | ax.spines['top'].set_visible(False)
254 | ax.spines['right'].set_visible(False)
255 |
256 | plt.legend(loc='upper right')
257 |
258 |
259 | def plot_corrected_samples(y_true, y_pred1, y_pred2, config, n_samples=50):
260 |
261 | dataset = config['dataset']['name']
262 | _, _, X_test, y_test = fetch_load_data(dataset)
263 |
264 | mask_miss1 = np.not_equal(y_true, y_pred1)
265 | print('KNN missed {} test cases'.format(sum(mask_miss1)))
266 | mask_hits2 = np.equal(y_true, y_pred2)
267 | print('ILP missed {} test cases'.format(sum(~mask_hits2)))
268 | mask_miss1_hits2 = np.logical_and(mask_miss1, mask_hits2)
269 | print('ILP missed {} less than KNN'.format(sum(mask_miss1_hits2)))
270 |
271 | samples = X_test[mask_miss1_hits2]
272 | y_pred_miss = y_pred1[mask_miss1_hits2]
273 | y_pred_hit = y_pred2[mask_miss1_hits2]
274 | print('THERE ARE {} CASES OF MISS/HIT'.format(len(samples)))
275 |
276 | if len(samples) > n_samples:
277 | ind = np.random.choice(len(samples), n_samples, replace=False)
278 | samples = samples[ind]
279 | y_pred_miss = y_pred_miss[ind]
280 | y_pred_hit = y_pred_hit[ind]
281 |
282 | fig = plt.figure(4, figsize=(4, 4), dpi=200)
283 | dim = int(np.sqrt(len(X_test[0])))
284 |
285 | n_subplots = len(samples)
286 | n_cols = 10
287 | n_empty_rows = 0 # 2
288 | n_rows = ceil(n_subplots / n_cols) + n_empty_rows
289 | subplot_count = count(1)
290 |
291 | for x, ym, yh in zip(samples, y_pred_miss, y_pred_hit):
292 | i = next(subplot_count)
293 | ax = fig.add_subplot(n_rows, n_cols, i)
294 | ax.set_xticks([])
295 | ax.set_yticks([])
296 |
297 | image = x.reshape(dim, dim)
298 | ax.imshow(image, cmap=plt.cm.gray)
299 | ax.set_title(r'{}$\rightarrow${}'.format(ym, yh), fontsize=8,
300 | fontweight='bold', horizontalalignment="center")
301 | fig.add_subplot(ax)
302 |
303 | fig.suptitle('Corrected samples')
304 |
305 |
306 | def plot_confusion_matrix(y_true, y_pred, title, cmap=plt.cm.Greys):
307 |
308 | cm = confusion_matrix(y_true, y_pred)
309 | # Normalize
310 | cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
311 | true_labels = [str(int(y)) for y in np.unique(y_true)]
312 | pred_labels = [str(int(y)) for y in np.unique(y_pred)]
313 | plt.imshow(cm_norm, interpolation='nearest', cmap=cmap)
314 | plt.title(title, fontsize=14, fontweight='bold')
315 | xtick_marks = np.arange(len(true_labels))
316 | ytick_marks = np.arange(len(pred_labels))
317 | plt.xticks(xtick_marks, true_labels, fontsize=6, fontweight='bold')
318 | plt.yticks(ytick_marks, pred_labels, fontsize=6, fontweight='bold')
319 |
320 | [i.set_color("b") for i in plt.gca().get_xticklabels()]
321 | [i.set_color("b") for i in plt.gca().get_yticklabels()]
322 |
323 | thresh = cm_norm.max() / 2.
324 | for i, j in product(range(cm.shape[0]), range(cm.shape[1])):
325 | s = r'{:4}'.format(str(cm[i, j]))
326 | s = r'\textbf{' + s + r'}'
327 | plt.text(j, i, s, fontsize=8,
328 | horizontalalignment="center", verticalalignment='center',
329 | color="white" if cm_norm[i, j] > thresh else "black")
330 |
331 | plt.tick_params(top='off', bottom='off', left='off', right='off',
332 | labelleft='on', labelbottom='on')
333 |
334 | remove_frame()
335 |
336 | plt.tight_layout()
337 | ax = plt.gca()
338 | ax.set_ylabel('True label', fontsize=10)
339 | ax.set_xlabel('Predicted label', fontsize=10)
340 |
341 |
342 | def plot_confusion_matrices(stats, config, improvements=True):
343 | matplotlib.rcParams['text.latex.unicode'] = True
344 | print('Plotting confusion matrices...')
345 |
346 | y_pred_knn = stats['test_pred_knn']
347 | y_pred_ilp = stats['test_pred_lp']
348 | y_true = stats['test_true']
349 |
350 | err_knn = np.mean(np.not_equal(y_pred_knn, y_true))
351 | err_ilp = np.mean(np.not_equal(y_pred_ilp, y_true))
352 | print('knn error: {}'.format(err_knn))
353 | print('ilp error: {}'.format(err_ilp))
354 |
355 | fig = plt.figure(3, figsize=(8, 4), dpi=200)
356 | n_subplots = 2
357 |
358 | fig.add_subplot(1, n_subplots, 1)
359 | plot_confusion_matrix(y_true, y_pred_knn, '$knn$')
360 |
361 | fig.add_subplot(1, n_subplots, 2)
362 | plot_confusion_matrix(y_true, y_pred_ilp, 'ILP')
363 |
364 | if improvements:
365 | plot_corrected_samples(y_true, y_pred_knn, y_pred_ilp,
366 | config, n_samples=40)
367 |
368 | dt = config['dataset']['name'].upper()
369 | fig.suptitle(r'\textbf{Confusion}' + ' (' + dt + ')')
370 |
371 | fig.tight_layout()
372 | fig.subplots_adjust(top=0.85)
373 |
374 |
375 | def plot_single_run_single_var(stats, fig, config=None):
376 |
377 | metrics_to_plot = []
378 | for metric_key in METRIC_ORDER:
379 | if metric_key in stats and len(stats[metric_key]):
380 | metrics_to_plot.append(metric_key)
381 |
382 | n_subplots = len(metrics_to_plot)
383 | if 'label_stream_true' in stats:
384 | n_subplots += 1
385 | if 'iter_offline_duration' in stats:
386 | n_subplots += 1
387 |
388 | n_cols = 4
389 | n_rows = ceil(n_subplots / n_cols)
390 | plot_count = count(1)
391 |
392 | n = len(stats['iter_online_count'])
393 | xx = range(n)
394 | for metric_key in metrics_to_plot:
395 | print('Plotting metric {}'.format(metric_key))
396 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
397 | if metric_key in SCATTER_METRIC:
398 | ax.scatter(xx, stats[metric_key], s=2)
399 | else:
400 | ax.plot(stats[metric_key], color=DEFAULT_COLOR, label=None, lw=2)
401 |
402 | ax.set_xlabel(X_LABEL_DEFAULT)
403 | ax.set_ylabel(PLOT_LABELS[metric_key])
404 |
405 | if 'label_stream_true' in stats and config is not None:
406 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
407 | y_true = np.asarray(stats['label_stream_true'])
408 | y_mask = np.asarray(stats['label_stream_mask_observed'])
409 | n_burn_in = config['data'].get('n_burn_in', 0)
410 | classes = np.unique(y_true)
411 | dataset_config = config.get('dataset', None)
412 | if dataset_config is not None:
413 | classes = dataset_config.get('classes', classes)
414 | if dataset_config['name'].startswith('kitti'):
415 | classes = KITTI_CLASSES
416 | plot_class_distro_stream(ax, y_true, y_mask, n_burn_in, classes)
417 |
418 | iod = stats.get('iter_offline_duration', None)
419 | if iod is not None:
420 | if len(iod):
421 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
422 | ax.scatter(stats['iter_offline'], iod)
423 | ax.set_xlabel(X_LABEL_DEFAULT)
424 | ax.set_ylabel(PLOT_LABELS['iter_offline_duration'])
425 |
426 | fig.subplots_adjust(top=0.9)
427 |
428 |
429 | def plot_single_run_multi_var(stats_list, fig, config):
430 | # Single run for each value a variable (e.g. \vartheta) takes.
431 |
432 | kitti = config['dataset']['name'].startswith('kitti')
433 | LW = 1.5
434 |
435 | var_name, var_value0, stats0, _ = stats_list[0]
436 | var_label = PLOT_LABELS[var_name]
437 | metrics_to_plot = []
438 |
439 | metric_order = METRIC_ORDER
440 | if var_name.startswith('k'):
441 | metric_order = METRIC_ORDER[:3]
442 | for metric_key in metric_order:
443 | print('Metric {}'.format(metric_key))
444 | print(stats0[metric_key])
445 | if metric_key in stats0:
446 | if hasattr(stats0[metric_key], '__len__'):
447 | if len(stats0[metric_key]):
448 | print('Metric to plot: {}'.format(metric_key))
449 | metrics_to_plot.append(metric_key)
450 |
451 | stats_list.sort(key=lambda x: float(x[1]))
452 |
453 | for i, (_, var_value, stats_value, _) in enumerate(stats_list):
454 | rt = stats_value['runtime']
455 | print('theta = {}, runtime = {}'.format(var_value, rt))
456 |
457 | if len(stats_list) > 7:
458 | # for n_neighbors pick 1,3,5,7,11,15,19 -> idx = [0,1,2,3,5,7,9]
459 | stats_list = [stats_list[i] for i in [0,1,2,3,5,7,9]]
460 |
461 | n_subplots = len(metrics_to_plot)
462 | n_cols = N_AXES_PER_ROW if n_subplots >= N_AXES_PER_ROW else n_subplots
463 | n_rows = ceil(n_subplots / n_cols)
464 | plot_count = count(1)
465 |
466 | n_values = len(stats_list)
467 | if n_values != N_CURVES:
468 | color_idx = np.linspace(0, 1, n_values+2)
469 | color_idx = color_idx[1:-1]
470 | else:
471 | color_idx = COLOR_IDX
472 |
473 | n = len(stats0['iter_online_count'])
474 | xx = range(n)
475 | for metric_key in metrics_to_plot:
476 | print('Plotting metric {}'.format(metric_key))
477 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
478 | for i, (_, var_value, stats_value, _) in enumerate(stats_list):
479 | c = COLOR_MAP(color_idx[i])
480 |
481 | val = int(float(var_value)*100)
482 | label = r'{}={}\%'.format(var_label, int(val))
483 | if metric_key in SCATTER_METRIC:
484 | ax.scatter(xx, stats_value[metric_key], s=1, color=c,
485 | label=label)
486 | else:
487 | if kitti and metric_key.startswith('test_error'):
488 | test_errors = stats_value[metric_key]
489 | print('Final Test error: {:5.2f} for {}% of labels'.format(
490 | test_errors[-1]*100, var_value))
491 | test_times = np.arange(1, len(test_errors) + 1)* 1000
492 | ax.plot(test_times, test_errors, color=c, label=label,
493 | lw=LW, marker='.')
494 | ax.set_ylim((0.0, 0.5))
495 | # ax.set_xticks(test_times)
496 | else:
497 | ax.plot(stats_value[metric_key], color=c, label=label,
498 | lw=LW)
499 |
500 | ax.set_xlabel(X_LABEL_DEFAULT)
501 | ax.set_ylabel(PLOT_LABELS[metric_key])
502 | if metric_key == LEGEND_METRIC:
503 | plt.legend(loc='best')
504 |
505 | plt.legend(loc='best')
506 |
507 |
508 | def plot_multi_run_single_var(stats_mean, stats_std, fig):
509 | """Multiple runs (random seeds) for a single variable (e.g. \vartheta)"""
510 |
511 | metrics_to_plot = []
512 | for metric_key in METRIC_ORDER:
513 | if metric_key in stats_mean and len(stats_mean[metric_key]):
514 | metrics_to_plot.append(metric_key)
515 |
516 | n_subplots = len(metrics_to_plot)
517 | n_cols = 4
518 | n_rows = ceil(n_subplots / n_cols)
519 | plot_count = count(1)
520 |
521 | for metric_key in metrics_to_plot:
522 | print('Plotting metric {}'.format(metric_key))
523 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
524 | metric_mean = stats_mean[metric_key]
525 | color = DEFAULT_MEAN_COLOR
526 | ax.plot(metric_mean, color=color, label=None, lw=2)
527 | metric_std = 1 * stats_std[metric_key]
528 | lb, ub = metric_mean - metric_std, metric_mean + metric_std
529 | color = DEFAULT_STD_COLOR
530 | ax.plot(lb, color=color)
531 | ax.plot(ub, color=color)
532 | ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.5)
533 | ax.set_xlabel(X_LABEL_DEFAULT)
534 | ax.set_ylabel(PLOT_LABELS[metric_key])
535 |
536 |
537 | def plot_multi_run_multi_var(stats_list, fig):
538 | # Multiple runs (random seeds) for each value a variable takes
539 |
540 | headers = [r'$\vartheta$', 'Runtime (s)', 'Est. error (%)',
541 | 'Test error (%)']
542 | table = []
543 | for i in range(len(stats_list)):
544 | _, var_value, stats_value_mean, stats_value_std = stats_list[i]
545 | print('\n\nVar value = {}'.format(var_value))
546 | runtime = stats_value_mean['runtime']
547 | print('Runtime: {}'.format(runtime))
548 |
549 | est_err_mean = stats_value_mean['clf_error_mixed'][-1]
550 | est_err_std = stats_value_std['clf_error_mixed'][-1]
551 | print('est_err: {} ({})'.format(est_err_mean, est_err_std))
552 | s1 = '{:5.2f} ({:4.2f})'.format(est_err_mean*100, est_err_std*100)
553 |
554 | test_err_mean = stats_value_mean.get('test_error_ilp', None)
555 | test_err_std = stats_value_std.get('test_error_ilp', None)
556 | print('test_err: {}'.format(test_err_mean, test_err_std))
557 | if test_err_mean is None:
558 | s2 = '-'
559 | else:
560 | s2 = '{:5.2f} ({:4.2f})'.format(test_err_mean*100, test_err_std*100)
561 |
562 | row = [var_value, runtime, s1, s2]
563 | table.append(row)
564 |
565 | print(tabulate(table, headers, tablefmt="latex"))
566 |
567 | metrics_to_plot = []
568 | _, _, stats_value_mean0, _ = stats_list[0]
569 | for metric_key in METRIC_ORDER:
570 | metric = stats_value_mean0.get(metric_key, None)
571 | if metric is not None and len(metric):
572 | metrics_to_plot.append(metric_key)
573 |
574 | n_subplots = len(metrics_to_plot)
575 | n_cols = 4
576 | n_rows = ceil(n_subplots / n_cols)
577 | plot_count = count(1)
578 |
579 | for metric_key in metrics_to_plot:
580 | print('Plotting metric {}'.format(metric_key))
581 | ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
582 |
583 | for i in range(len(stats_list)):
584 | _, var_value, stats_value_mean, stats_value_std = stats_list[i]
585 |
586 | metric_mean = stats_value_mean[metric_key]
587 | color = DEFAULT_MEAN_COLOR
588 | ax.plot(metric_mean, color=color, label=None, lw=2)
589 | metric_std = 1 * stats_value_std[metric_key]
590 | lb, ub = metric_mean - metric_std, metric_mean + metric_std
591 | color = DEFAULT_STD_COLOR
592 | ax.plot(lb, color=color)
593 | ax.plot(ub, color=color)
594 | ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.3)
595 | ax.set_xlabel(X_LABEL_DEFAULT)
596 | ax.set_ylabel(PLOT_LABELS[metric_key])
597 |
598 |
599 | def plot_standard(single_run, single_var, stats, config, title, path):
600 |
601 | figsize = (11, 5)
602 | fig = plt.figure(1, figsize=figsize, dpi=200)
603 |
604 | print()
605 | if single_run and single_var: # default_run
606 | print('Plotting a single run with a single variable value')
607 | plot_single_run_single_var(stats[0], fig, config)
608 | elif single_run and not single_var: # var_theta
609 | print('Plotting a single run for each variable value')
610 | plot_single_run_multi_var(stats, fig, config)
611 | elif single_var and not single_run: # mean and std of default_run
612 | print('Plotting multiple runs for a single variable value')
613 | plot_multi_run_single_var(stats[0], stats[1], fig)
614 | elif not single_run and not single_var:
615 | print('Plotting multiple runs for multiple variable values')
616 | plot_multi_run_multi_var(stats, fig)
617 | print()
618 |
619 | # plt.legend(loc='upper right')
620 | dataset = config['dataset']['name']
621 | dataset = 'kitti' if dataset.startswith('kitti') else dataset
622 | dataset = dataset.replace('_', ' ')
623 | if title == r'Default run' or dataset == 'kitti':
624 | title = dataset.upper()
625 | else:
626 | title = r'\textbf{' + title + r'}' + ' (' + dataset.upper() + ')'
627 |
628 | fig.suptitle(title, fontsize='xx-large', fontweight='bold')
629 |
630 | fig.tight_layout()
631 | fig.subplots_adjust(top=0.9)
632 |
633 | if path is not None:
634 | if path.endswith(STATS_FILE_EXT):
635 | path = path[:-len(STATS_FILE_EXT)]
636 |
637 | try:
638 | fig.savefig(path + '.pdf', orientation='landscape', transparent=False)
639 | except RuntimeError as e:
640 | print('Cannot save figure due to: \n{}'.format(e))
641 |
642 |
643 | def plot_curves(stats, config, title='', path=None):
644 | plt.rc('text', usetex=True) # need dvipng in Ubuntu
645 | plt.rc('font', family='serif')
646 | # matplotlib.rcParams['text.latex.unicode'] = True
647 |
648 | if type(stats) is tuple:
649 | single_var = True
650 | stats_mean, stats_std = stats
651 | single_run = stats_std is None
652 | stats_print = stats_mean
653 | elif type(stats) is list:
654 | single_var = False
655 | var_name, var_value, stats_mean0, stats_std0 = stats[0]
656 | print('Plotting multi var: {}'.format(var_name))
657 | single_run = stats_std0 is None
658 | stats_print = stats_mean0
659 | else:
660 | print('stats has type: {}'.format(type(stats).__name__))
661 | raise TypeError('stats is neither list nor tuple!')
662 |
663 | print('\nStatistics{}Size{}Type\n{}'.format(' '*25, ' '*8, '-'*52))
664 | for k in sorted(stats_print.keys()):
665 | v = stats_print[k]
666 | if v is not None:
667 | if hasattr(v, '__len__'):
668 | print('{:>32} {:>8} {:>9}'.format(k, len(v), type(v).__name__))
669 | else:
670 | print('{:>32} {:>8.2f} {:>9}'.format(k, v, type(v).__name__))
671 |
672 | if config is not None:
673 | print_config(config)
674 |
675 | plot_standard(single_run, single_var, stats, config, title, path)
676 |
677 | if single_var:
678 | if config['dataset']['name'] == 'mnist':
679 | test_pred_knn = stats_mean.get('test_pred_knn', None)
680 | if test_pred_knn is not None and len(test_pred_knn) > 0:
681 | plot_confusion_matrices(stats_mean, config)
682 |
--------------------------------------------------------------------------------