├── .gitignore
├── LICENSE
├── README.md
├── data
    └── kitti_features.zip
└── ilp
    ├── __init__.py
    ├── algo
        ├── __init__.py
        ├── base_sl_graph.py
        ├── datastore.py
        ├── incremental_label_prop.py
        ├── knn_graph_utils.py
        ├── knn_sl_graph.py
        └── knn_sl_subgraph.py
    ├── constants.py
    ├── experiments
        ├── __init__.py
        ├── base.py
        ├── cfg
        │   ├── default.yml
        │   ├── var_k_L.yml
        │   ├── var_k_U.yml
        │   ├── var_n_L.yml
        │   ├── var_stream_labeled.yml
        │   └── var_theta.yml
        ├── default_run.py
        ├── var_n_labeled.py
        ├── var_n_neighbors_labeled.py
        ├── var_n_neighbors_unlabeled.py
        ├── var_stream_labeled.py
        └── var_theta.py
    ├── helpers
        ├── __init__.py
        ├── data_fetcher.py
        ├── data_flow.py
        ├── datasets.yml
        ├── fc_heap.py
        ├── log.py
        ├── params_parse.py
        └── stats.py
    └── plots
        ├── __init__.py
        └── plot_stats.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Compiled source #
 2 | ###################
 3 | *.com
 4 | *.class
 5 | *.dll
 6 | *.exe
 7 | *.o
 8 | *.so
 9 | 
10 | # Packages #
11 | ############
12 | # it's better to unpack these files and commit the raw source
13 | # git has its own built in compression methods
14 | *.7z
15 | *.dmg
16 | *.gz
17 | *.iso
18 | *.jar
19 | *.rar
20 | *.tar
21 | *.zip
22 | 
23 | # Logs and databases #
24 | ######################
25 | *.log
26 | *.sql
27 | *.sqlite
28 | 
29 | # OS generated files #
30 | ######################
31 | .DS_Store
32 | .DS_Store?
33 | ._*
34 | .Spotlight-V100
35 | .Trashes
36 | ehthumbs.db
37 | Thumbs.db
38 | 
39 | # Latex generated files #
40 | #########################
41 | *.log
42 | *.aux
43 | *.blg
44 | *.out
45 | *.gz
46 | *.bbl
47 | 
48 | # IDE #
49 | #######
50 | .idea/
51 | __pycache__/
52 | build/
53 | cmake-build-debug/
54 | dist/
55 | _build/
56 | _generate/
57 | *.so
58 | *.py[cod]
59 | *.egg-info
60 | 
61 | 
62 | # Project Specific #
63 | ####################
64 | doc/
65 | *.mat
66 | *.p
67 | *.stat
68 | *.hist
69 | *.png
70 | */experiments/results/
71 | data/kitti_features/
72 | data/mnist/


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 John Chiotellis
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Incremental Label Propagation
 2 | This repository provides the implementation of our paper ["Incremental Semi-Supervised Learning from Streams for Object Classification"](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf) (Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, IROS 2018). All results presented in our work were produced with this code.
 3 | 
 4 | * [Installing](#usage)
 5 | * [Datasets](#data)
 6 | * [Experiments](#experiments)
 7 | * [Publication](#paper)
 8 | * [License and Contact](#other)
 9 | 
10 | 
11 | ## <a name="usage">Installation</a>
12 | The code was developed in python 3.5 under Ubuntu 16.04. You can clone the repo with:
13 | ```
14 | git clone https://github.com/johny-c/incremental-label-propagation.git
15 | ```
16 | 
17 | ## <a name="data">Datasets</a>
18 | * KITTI
19 | 
20 | The repository includes 64-dimentional features extracted from KITTI sequences compressed in a zip file (data/kitti_features.zip). The included files will be extracted automatically if one of the included experiments is run on KITTI.
21 | 
22 | * MNIST
23 | 
24 | A script will automatically download the MNIST dataset if an experiment is run on it.
25 | 
26 | 
27 | ## <a name="experiments">Experiments</a>
28 | 
29 | The repository includes scripts that replicate the experiments found in the paper, including:
30 | 
31 | * Varying the number of labeled points or the ratio of labeled points in the data.
32 | * Varying the number of labeled or unlabeled neighbors considered for each node.
33 | * Varying the hyperparameter $$\theta$$ that controls the propagation area size.
34 | 
35 | To run an experiment with varying $$\theta$$:
36 |     
37 |     python ilp/experiments/var_theta.py -d mnist
38 | 
39 | You can set different experiment options in the .yaml files found in the experimens/cfg directory.
40 | 
41 | #### WARNING:
42 | The included experiment scripts compute and store statistics after every new data point, therefore the resulting output files are very large.
43 | 
44 | 
45 | ## <a name="paper">Publication</a>
46 | If you use this code in your work, please cite the following paper.
47 | 
48 | Ioannis Chiotellis*, Franziska Zimmermann*, Daniel Cremers and Rudolph Triebel, _"Incremental Semi-Supervised Learning from Streams for Object Classification"_, in proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018). ([pdf](https://vision.in.tum.de/_media/spezial/bib/chiotellis2018ilp.pdf))
49 |     
50 | *equal contribution
51 |     
52 |     @InProceedings{chiotellis2018incremental,
53 |       author = "I. Chiotellis and F. Zimmermann and D. Cremers and R. Triebel",
54 |       title = "Incremental Semi-Supervised Learning from Streams for Object Classification",
55 |       booktitle = iros,
56 |       year = "2018",
57 |       month = "October",
58 |       keywords={stream-based learning, sequential data, semi-supervised learning, object classification},
59 |       note = {{<a href="https://github.com/johny-c/incremental-label-propagation" target="_blank">[code]</a>} },
60 |     }
61 | 
62 | ## <a name="others"> License and Contact</a>
63 | 
64 | This work is released under the [MIT Licence].
65 | 
66 | Contact **John Chiotellis** [:envelope:](mailto:chiotell@in.tum.de) for questions, comments and reporting bugs.
67 | 


--------------------------------------------------------------------------------
/data/kitti_features.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/data/kitti_features.zip


--------------------------------------------------------------------------------
/ilp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/__init__.py


--------------------------------------------------------------------------------
/ilp/algo/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/algo/__init__.py


--------------------------------------------------------------------------------
/ilp/algo/base_sl_graph.py:
--------------------------------------------------------------------------------
 1 | from sklearn.externals import six
 2 | from abc import ABCMeta, abstractmethod
 3 | 
 4 | 
 5 | class BaseSemiLabeledGraph(six.with_metaclass(ABCMeta)):
 6 |     """
 7 |     Parameters
 8 |     ----------
 9 | 
10 |     datastore : algo.datastore.SemoLabeledDatastore
11 |         A datastore to store observations as they arrive
12 | 
13 |     max_samples : int, optional (default=1000)
14 |         The maximum number of points expected to be observed. Useful for 
15 |         memory allocation. 
16 | 
17 |     max_labeled : {float, int}, optional
18 |         Maximum expected labeled points ratio, or number of labeled points
19 | 
20 |     dtype : dtype, optional (default=np.float32)
21 |         Precision in floats, (can also be float16, float64)
22 | 
23 |     """
24 | 
25 |     def __init__(self, datastore):
26 | 
27 |         self.datastore = datastore
28 |         self.max_samples = datastore.max_samples
29 |         self.max_labeled = datastore.max_labeled
30 |         self.dtype = datastore.dtype
31 |         self.eps = datastore.eps
32 | 
33 |         self.n_labeled = 0
34 |         self.n_unlabeled = 0
35 | 
36 | 
37 |     @abstractmethod
38 |     def build(self, X_l, X_u):
39 |         raise NotImplementedError('build is not implemented!')
40 | 
41 |     @abstractmethod
42 |     def add_node(self, x, ind, labeled):
43 |         raise NotImplementedError('add_node is not implemented!')
44 | 
45 |     @abstractmethod
46 |     def add_labeled_node(self, x, ind):
47 |         raise NotImplementedError('add_labeled_node is not implemented!')
48 | 
49 |     @abstractmethod
50 |     def add_unlabeled_node(self, x, ind):
51 |         raise NotImplementedError('add_unlabeled_node is not implemented!')
52 | 
53 |     def get_n_nodes(self):
54 |         return self.n_labeled + self.n_unlabeled
55 | 


--------------------------------------------------------------------------------
/ilp/algo/datastore.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.preprocessing import LabelBinarizer, label_binarize
 3 | 
 4 | from ilp.constants import EPS_32, EPS_64
 5 | 
 6 | 
 7 | class SemiLabeledDataStore:
 8 | 
 9 |     def __init__(self, max_samples, max_labeled, classes, precision='float32'):
10 | 
11 |         self.max_samples = max_samples
12 |         self.max_labeled = max_labeled
13 |         self.classes = classes
14 | 
15 |         self.precision = precision
16 |         self.dtype = np.dtype(precision)
17 |         self.eps = EPS_32 if self.dtype == np.dtype('float32') else EPS_64
18 |         self.n_labeled = 0
19 |         self.n_unlabeled = 0
20 | 
21 |         self.X_labeled = np.array([])
22 |         self.X_unlabeled = np.array([])
23 |         self.y_labeled = np.array([])
24 | 
25 |     def _allocate(self, n_features):
26 | 
27 |         # Allocate memory for data matrices
28 |         self.X_labeled = np.zeros((self.max_labeled, n_features),
29 |                                   dtype=self.dtype)
30 |         self.X_unlabeled = np.zeros((self.max_samples, n_features),
31 |                                     dtype=self.dtype)
32 | 
33 |         # Allocate memory for label matrix
34 |         self.y_labeled = np.zeros((self.max_labeled, len(self.classes)),
35 |                                   dtype=self.dtype)
36 | 
37 |     def append(self, x, y):
38 | 
39 |         if self.get_n_samples() == 0:
40 |             self._allocate(len(x))
41 | 
42 |         if y == -1:
43 |             ind_new = self.n_unlabeled
44 |             self.X_unlabeled[ind_new] = x
45 |             self.n_unlabeled += 1
46 |         else:
47 |             ind_new = self.n_labeled
48 |             self.X_labeled[ind_new] = x
49 |             self.y_labeled[ind_new] = label_binarize([y], self.classes)
50 |             self.n_labeled += 1
51 | 
52 |         return ind_new
53 | 
54 |     def inverse_transform_labels(self, y_proba):
55 |         if not hasattr(self, 'label_binarizer'):
56 |             self.label_binarizer = LabelBinarizer()
57 |             self.label_binarizer.fit(self.classes)
58 | 
59 |         return self.label_binarizer.inverse_transform(y_proba)
60 | 
61 |     def get_n_samples(self):
62 |         return self.n_labeled + self.n_unlabeled
63 | 
64 |     def get_X_l(self):
65 |         return self.X_labeled[:self.n_labeled]
66 | 
67 |     def get_X_u(self):
68 |         return self.X_unlabeled[:self.n_unlabeled]
69 | 
70 |     def get_y_l(self):
71 |         return self.y_labeled[:self.n_labeled]
72 | 
73 |     def get_y_l_int(self):
74 |         return np.argmax(self.get_y_l(), axis=1)


--------------------------------------------------------------------------------
/ilp/algo/incremental_label_prop.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from time import time
  3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot
  4 | from sklearn.utils.validation import check_random_state
  5 | from sklearn.preprocessing import normalize
  6 | 
  7 | from ilp.algo.knn_graph_utils import construct_weight_mat
  8 | from ilp.algo.knn_sl_graph import KnnSemiLabeledGraph
  9 | from ilp.helpers.stats import JobType
 10 | from ilp.helpers.log import make_logger
 11 | 
 12 | 
 13 | logger = make_logger(__name__)
 14 | 
 15 | 
 16 | class IncrementalLabelPropagation:
 17 |     """
 18 |     Parameters
 19 |     ----------
 20 | 
 21 |     datastore : algo.datastore.SemiLabeledDataStore
 22 |         A datastore instance to store observations as they arrive
 23 | 
 24 |     stats_worker : helper.stats.StatisticsWorker, optional
 25 |         A statistics computing unit (run in a separate thread)
 26 | 
 27 |     theta : float, optional
 28 |         The threshold of significance for a label update (defines the online nature of the algorithm). (default: 0.1)
 29 | 
 30 |     max_iter : int, optional
 31 |         The maximum number of iterations per point insertion. (default: 30)
 32 | 
 33 |     tol : float, optional
 34 |         The tolerance of absolute difference between two label distributions
 35 |         that indicates convergence. (default: 1e-3)
 36 | 
 37 |     params_graph : dict
 38 |         Parameters for the graph construction. Will be passed to the Graph
 39 |         constructor.
 40 | 
 41 |     params_offline_lp : dict
 42 |         Parameters for the offline label propagation that will be performed
 43 |         once `n_burn_in` number of samples have been observed.
 44 | 
 45 |     random_state : int seed, RandomState instance, or None (default)
 46 |         The seed of the pseudo random number generator to use when
 47 |         shuffling the data.
 48 | 
 49 |     iprint : integer, optional
 50 |         The n_labeled_freq of printing progress messages to stdout.
 51 | 
 52 |     n_jobs : integer, optional
 53 |         The number of CPUs to use to do the OVA (One Versus All, for
 54 |         multi-class problems) computation. -1 means 'all CPUs'. Defaults to 1.
 55 | 
 56 | 
 57 |     Attributes
 58 |     ----------
 59 | 
 60 |     classes : tuple, shape = (n_classes,)
 61 |         The classes expected to appear in the stream.
 62 | 
 63 |     """
 64 | 
 65 |     def __init__(self, datastore=None, stats_worker=None,
 66 |                  params_graph=None, params_offline_lp=None, n_burn_in=0,
 67 |                  theta=0.1, max_iter=30, tol=1e-3,
 68 |                  random_state=42, n_jobs=-1, iprint=0):
 69 | 
 70 |         self.datastore = datastore
 71 |         self.max_iter = max_iter
 72 |         self.tol = tol
 73 |         self.theta = theta
 74 |         self.params_graph = params_graph
 75 |         self.params_offline_lp = params_offline_lp
 76 |         self.n_burn_in = n_burn_in
 77 |         self.random_state = random_state
 78 | 
 79 |         kernel = params_graph['kernel']
 80 |         if kernel == 'knn':
 81 |             Graph = KnnSemiLabeledGraph
 82 |         # elif kernel == 'm-knn':
 83 |         #     Graph = MutualKnnSemiLabeledGraph
 84 |         else:
 85 |             raise NotImplementedError('Only knn graphs.')
 86 | 
 87 |         self.graph = Graph(datastore, **params_graph)
 88 | 
 89 |         self.n_jobs = n_jobs
 90 |         self.stats_worker = stats_worker
 91 |         self.iprint = iprint
 92 | 
 93 |     def fit_burn_in(self):
 94 |         """Fit a semi-supervised label propagation model
 95 | 
 96 |         A bootstrap data set is provided in matrix X (labeled and unlabeled)
 97 |         and corresponding label matrix y with a dedicated value (-1) for
 98 |         unlabeled samples.
 99 | 
100 |         Parameters
101 |         ----------
102 |         X : array-like, shape (n_samples, n_features)
103 |             A {n_samples by n_samples} size matrix will be created from this
104 | 
105 |         Returns
106 |         -------
107 |         self : returns an instance of self.
108 |         """
109 | 
110 |         self.random_state_ = check_random_state(self.random_state)
111 | 
112 |         # Get appearing classes
113 |         classes = self.datastore.classes
114 | 
115 |         self.n_iter_online = 0
116 |         self.n_burn_in_ = self.datastore.get_n_samples()
117 |         logger.debug('FITTING BURN-IN WITH {} SAMPLES'.format(self.n_burn_in_))
118 | 
119 |         # actual graph construction (implementations should override this)
120 |         logger.debug('Building graph....')
121 |         self.graph.build(self.get_X_l(), self.get_X_u())
122 |         logger.debug('Graph built.')
123 | 
124 |         u_max = self.datastore.max_samples
125 |         # Initialize F_U with uniform label vectors
126 |         self.y_unlabeled = np.full((u_max, len(classes)), 1 / len(classes),
127 |                                    dtype=self.datastore.dtype)
128 | 
129 |         # Initialize F_U with zero label vectors
130 |         # self.y_unlabeled = np.zeros((u_max, len(classes)),
131 |         #                             dtype=self.datastore.dtype)
132 | 
133 |         # Offline label propagation on burn-in data set
134 |         self._offline_lp(**self.params_offline_lp)
135 | 
136 |         # Normalize F_U as it might have numerically diverged from [0, 1]
137 |         normalize(self.y_unlabeled, norm='l1', axis=1, copy=False)
138 | 
139 |         return self
140 | 
141 |     def predict(self, X, mode=None):
142 |         """Predict the labels for a batch of data points, without actually 
143 |         inserting them in the graph.
144 | 
145 |         Parameters
146 |         ----------
147 |         X : array, shape (n_samples_batch, n_features)
148 |             A batch of data points.
149 |         
150 |         mode : string
151 |             Test with nearest labeled neighbors ('knn'), or nearest 
152 |             unlabeled neighbors ('lp'), their combination (default), 
153 |             or return both ('pair').
154 | 
155 |         Returns
156 |         -------
157 |         y : array, shape (n_samples_batch, n_classes)
158 |             The predicted label distributions for the given points.
159 | 
160 |         """
161 | 
162 |         modes = ['knn', 'lp', 'pair']
163 |         if mode is not None and mode not in modes:
164 |             raise ValueError('predict_proba can have modes: {}'.format(modes))
165 | 
166 |         if X.ndim == 1:
167 |             X = X.reshape(1, -1)
168 | 
169 |         if mode == 'pair':
170 |             y_proba_knn, y_proba_lp = self.predict_proba(X, 'pair')
171 |             y_pred_knn = self.datastore.inverse_transform_labels(y_proba_knn)
172 |             y_pred_lp = self.datastore.inverse_transform_labels(y_proba_lp)
173 |             return y_pred_knn, y_pred_lp
174 | 
175 |         y_proba = self.predict_proba(X, mode)
176 |         y_pred = self.datastore.inverse_transform_labels(y_proba)
177 | 
178 |         return y_pred
179 | 
180 |     def predict_proba(self, X, mode=None):
181 | 
182 |         modes = ['knn', 'lp', 'pair']
183 |         if mode is not None and mode not in modes:
184 |             raise ValueError('predict_proba can have modes: {}'.format(modes))
185 | 
186 |         u, l = self.graph.n_unlabeled, self.graph.n_labeled
187 | 
188 |         logger.info('Now testing on {} samples...'.format(len(X)))
189 |         neighbors, distances = self.graph.find_labeled_neighbors(X)
190 |         affinity_mat = construct_weight_mat(neighbors, distances,
191 |                                             (X.shape[0], l), self.graph.dtype)
192 |         p_tl = normalize(affinity_mat.tocsr(), norm='l1', axis=1)
193 |         y_from_labeled = ssdot(p_tl, self.datastore.y_labeled[:l], True)
194 | 
195 |         neighbors, distances = self.graph.find_unlabeled_neighbors(X)
196 |         affinity_mat = construct_weight_mat(neighbors, distances,
197 |                                             (X.shape[0], u), self.graph.dtype)
198 |         p_tu = normalize(affinity_mat.tocsr(), norm='l1', axis=1)
199 |         y_from_unlabeled = ssdot(p_tu, self.y_unlabeled[:u], True)
200 | 
201 |         y_pred_proba = y_from_labeled + y_from_unlabeled
202 |         logger.info('Labels have been predicted.')
203 | 
204 |         if mode is None:
205 |             return y_pred_proba
206 |         elif mode == 'knn':
207 |             return y_from_labeled
208 |         elif mode == 'lp':
209 |             return y_from_unlabeled
210 |         elif mode == 'pair':
211 |             return y_from_labeled, y_pred_proba
212 | 
213 |     def get_X_l(self):
214 |         return self.datastore.get_X_l()
215 | 
216 |     def get_X_u(self):
217 |         return self.datastore.get_X_u()
218 | 
219 |     def set_params(self, L):
220 |         self.graph.reset_metric(L)
221 |         tic = time()
222 |         self.graph.build(self.get_X_l(), self.get_X_u())
223 |         toc = time()
224 |         logger.info('Reconstructed graph in {:.4f}s\n'.format(toc-tic))
225 | 
226 |     def _offline_lp(self, return_iter=False, max_iter=30, tol=0.001):
227 |         """Perform the offline label propagation until convergence of the label 
228 |         estimates of the unlabeled points.
229 | 
230 |         Parameters
231 |         ----------
232 |         return_iter : bool, default=False
233 |             Whether or not to return the number of iterations till convergence 
234 |             of the label estimates.
235 | 
236 |         Returns
237 |         -------
238 |         y_unlabeled, num_iter : 
239 |             the new label estimates and optionally the number of iterations
240 |         """
241 | 
242 |         logger.debug('Doing Offline LP...')
243 | 
244 |         u, l = self.graph.n_unlabeled, self.graph.n_labeled
245 | 
246 |         p_ul = self.graph.subgraph_ul.transition_matrix[:u]
247 |         p_uu = self.graph.subgraph_uu.transition_matrix[:u, :u]
248 |         y_unlabeled = self.y_unlabeled[:u]
249 |         y_labeled = self.datastore.y_labeled
250 | 
251 |         # First iteration
252 |         y_static = ssdot(p_ul, y_labeled, dense_output=True)
253 | 
254 |         # Continue loop
255 |         n_iter = 0
256 |         converged = False
257 |         while n_iter < max_iter and not converged:
258 |             y_unlabeled_prev = y_unlabeled.copy()
259 |             y_unlabeled = y_static + ssdot(p_uu, y_unlabeled, True)
260 |             n_iter += 1
261 | 
262 |             converged = _converged(y_unlabeled, y_unlabeled_prev, tol)
263 | 
264 |         logger.info('Offline LP took {} iterations'.format(n_iter))
265 | 
266 |         if return_iter:
267 |             return y_unlabeled, n_iter
268 |         else:
269 |             return y_unlabeled
270 | 
271 |     def fit_incremental(self, x_new, y_new):
272 | 
273 |         n_samples = self.datastore.get_n_samples()
274 |         if n_samples == 0 and self.stats_worker is not None:
275 |             logger.info('\n\nStarting the Statistics Worker\n\n')
276 |             self.stats_worker.start()
277 | 
278 |         if n_samples < self.n_burn_in:
279 |             logger.debug('Still in burn-in phase... observed {:>4} '
280 |                               'points'.format(n_samples))
281 |             self.datastore.append(x_new, y_new)
282 |             if n_samples == self.n_burn_in - 1:
283 |                 logger.debug('Burn-in complete!')
284 |                 self.fit_burn_in()
285 |         else:
286 |             ind_new = self.datastore.append(x_new, y_new)
287 |             self._fit_incremental(x_new, y_new, ind_new)
288 | 
289 |     def _fit_incremental(self, x_new, y_new, ind_new):
290 |         """Fit a single new point
291 | 
292 |         Args:
293 |             x_new : array_like, shape (1, n_features) 
294 |                 A new data point.
295 |             
296 |             y_new : int
297 |                 Label of the new data point (-1 if point is unlabeled).
298 |                 
299 |             ind_new : int
300 |                 Index of the new point in the data store.
301 | 
302 |         Returns:
303 |             IncrementalLabelPropagation: a reference to self.
304 |         """
305 | 
306 |         tic = time()
307 | 
308 |         if self.n_iter_online == 0:
309 |             self.tic_iprint = time()
310 | 
311 |         labeled = y_new != -1
312 | 
313 |         self.graph.add_node(x_new, ind_new, labeled)
314 | 
315 |         _, n_in_iter = self._propagate_single(ind_new, y_new, return_iter=True)
316 |         self.n_iter_online += 1
317 | 
318 |         # Update statistics
319 |         dt = time() - tic
320 |         self.log_stats(JobType.ONLINE_ITER, dt=dt, n_in_iter=n_in_iter)
321 | 
322 |         if not labeled:
323 |             # Track prediction entropy and accuracy
324 |             label_vec = self.y_unlabeled[ind_new][None, :]
325 |             pred = self.datastore.inverse_transform_labels(label_vec)
326 |             self.log_stats(JobType.POINT_PREDICTION, vec=label_vec, y=pred)
327 | 
328 |         # Print information if needed
329 |         if self.n_iter_online % self.iprint == 0:
330 |             dt = time() - self.tic_iprint
331 |             max_samples = self.datastore.max_samples
332 |             n_samples_curr = self.graph.get_n_nodes()
333 |             n_samples_prev = n_samples_curr - self.iprint
334 |             logger.info('Iterations {} to {}/{} took {:.4f}s'.
335 |                               format(n_samples_prev, n_samples_curr,
336 |                                      max_samples, dt))
337 |             self.tic_iprint = time()
338 |             self.log_stats(JobType.PRINT_STATS)
339 | 
340 |         # Normalize y_u as it might have diverged from [0, 1]
341 |         u = self.graph.n_unlabeled
342 |         normalize(self.y_unlabeled[:u], norm='l1', axis=1, copy=False)
343 | 
344 |         return self
345 | 
346 |     def log_stats(self, job_type, **kwargs):
347 |         if self.stats_worker is not None:
348 |             d = dict(job_type=job_type)
349 |             d.update(**kwargs)
350 |             self.stats_worker.send(d)
351 | 
352 |     def _propagate_single(self, ind_new, y_new, return_iter=False):
353 |         """Perform label propagation until convergence of the label
354 |         estimates of the unlabeled points. Assume the new node has already 
355 |         been added to the graph, but no label has been estimated.
356 | 
357 |         Parameters
358 |         ----------
359 |         ind_new : int 
360 |             The index of the new observation determined during graph addition.
361 | 
362 |         y_new : int 
363 |             The label of the new observation (-1 if point is unlabeled).
364 | 
365 |         return_iter : bool, default=False
366 |             Whether to return the number of iterations until convergence of 
367 |             the label estimates.
368 | 
369 |         Returns
370 |         -------
371 |         y_unlabeled, num_iter : returns the new label estimates and optionally 
372 |                                 the number of iterations
373 |         """
374 |         # The number of labeled and unlabeled nodes now includes the new point
375 |         y_u = self.y_unlabeled
376 |         y_l = self.datastore.y_labeled
377 | 
378 |         p_ul = self.graph.subgraph_ul.transition_matrix
379 |         p_uu = self.graph.subgraph_uu.transition_matrix
380 | 
381 |         a_rev_ul = self.graph.subgraph_ul.rev_adj
382 |         a_rev_uu = self.graph.subgraph_uu.rev_adj
383 | 
384 |         if y_new == -1:
385 |             # Estimate the label of the new unlabeled point
386 |             label_new = ssdot(p_ul[ind_new], y_l, True) \
387 |                         + ssdot(p_uu[ind_new], y_u, True)
388 |             y_u[ind_new] = label_new
389 | 
390 |             # The first LP candidates are the unlabeled samples that have
391 |             # the new point as a nearest neighbor
392 |             candidates = a_rev_uu.get(ind_new, set())
393 |         else:
394 |             # The label of the new labeled point is already in the data store
395 |             candidates = a_rev_ul.get(ind_new, set())
396 | 
397 |         # Initialize a tentative label matrix / hash-map
398 |         y_u_tent = {}  # y_u[:u].copy()
399 | 
400 |         # Tentative labels are the label est. after the new point insertion
401 |         candid1_norms = []
402 |         for ind in candidates:
403 |             y_u_tent.setdefault(ind, y_u[ind].copy())
404 |             label = ssdot(p_ul[ind], y_l, True) + ssdot(p_uu[ind], y_u, True)
405 |             y_u_tent[ind] = label.ravel()
406 | 
407 |         n_updates_per_iter = []
408 |         n_iter = 0
409 |         k_u = self.graph.n_neighbors_unlabeled
410 |         u = max(self.graph.n_unlabeled, 1)
411 |         max_iter = int(np.log(u) / np.log(k_u)) if k_u > 1 else self.max_iter
412 |         while len(candidates) and n_iter < max_iter:  # < self.max_iter:
413 | 
414 |             # Pick the ones that change significantly and change them
415 |             updates, norm = filter_and_update(candidates, y_u_tent, y_u,
416 |                                               self.theta)
417 |             n_updates_per_iter.append(len(updates))
418 | 
419 |             # Get the next set of candidates (farther from the source)
420 |             candidates = get_next_candidates(updates, y_u_tent, y_u, a_rev_uu,
421 |                                              p_uu)
422 | 
423 |             n_iter += 1
424 | 
425 |         # Print the total number of updates
426 |         n_updates = sum(n_updates_per_iter)
427 |         if n_updates:
428 |             logger.info('Iter {:6}: {:6} updates in {:2} LP iters, '
429 |                              'max_iter = {:2}'
430 |                 .format(self.n_iter_online, n_updates, n_iter, max_iter))
431 | 
432 |         if return_iter:
433 |             return y_u, n_iter
434 |         else:
435 |             return y_u
436 | 
437 | 
438 | def get_next_candidates(major_changes, y_u_tent, y_u, a_rev_uu, p_uu):
439 |     candidates = set()
440 |     for index, label_diff in major_changes:
441 |         back_neighbors = a_rev_uu.get(index, set())
442 |         for neigh in back_neighbors:
443 |             y_u_tent.setdefault(neigh, y_u[neigh].copy())
444 |             y_u_tent[neigh] += ssdot(p_uu[neigh, index], label_diff, True)
445 |             candidates.add(neigh)
446 |     return candidates
447 | 
448 | 
449 | def filter_and_update(candidates, y_u_tent, y_u, theta, top_ratio=None):
450 | 
451 |     # Store for visualization all norms to see how to tune theta
452 |     major_updates = []
453 |     updates_norm = 0.
454 |     candidate_changes = []
455 |     for candidate in candidates:
456 |         dy_u = y_u_tent[candidate] - y_u[candidate]
457 |         dy_u_norm = np.abs(dy_u).sum()
458 |         if top_ratio is not None:
459 |             candidate_changes.append((dy_u_norm, dy_u, candidate))
460 |         else:
461 |             if dy_u_norm > theta:
462 |                 # Apply the update
463 |                 y_u[candidate] = y_u_tent[candidate]
464 |                 major_updates.append((candidate, dy_u))
465 |                 updates_norm += dy_u_norm
466 | 
467 |     if top_ratio is None:
468 |         return major_updates, updates_norm
469 | 
470 |     # Sort changes by descending norm and select the top k candidates for LP
471 |     n_candidates = len(candidates)
472 |     candidate_changes.sort(reverse=True, key=lambda x: x[0])
473 | 
474 |     # Apply the changes to the top k candidates
475 |     top_k = int(top_ratio * n_candidates)
476 |     for _, dy_u, candidate in candidate_changes[:top_k]:
477 |         # Apply the update
478 |         y_u[candidate] = y_u_tent[candidate]
479 |         major_updates.append((candidate, dy_u))
480 |         updates_norm += np.abs(dy_u).sum()
481 | 
482 |     return major_updates, updates_norm
483 | 
484 | 
485 | def _converged(y_curr, y_prev, tol=0.01):
486 |     """basic convergence check"""
487 |     return np.abs(y_curr - y_prev).max() < tol
488 | 


--------------------------------------------------------------------------------
/ilp/algo/knn_graph_utils.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.metrics.pairwise import euclidean_distances
 3 | from scipy.sparse import coo_matrix
 4 | 
 5 | 
 6 | def squared_distances(X1, X2, L=None):
 7 | 
 8 |     if L is None:
 9 |         dist = euclidean_distances(X1, X2, squared=True)
10 |     else:
11 |         dist = euclidean_distances(X1.dot(L.T), X2.dot(L.T), squared=True)
12 | 
13 |     return dist
14 | 
15 | 
16 | def get_nearest(distances, n_neighbors):
17 | 
18 |     n, m = distances.shape
19 | 
20 |     neighbors = np.argpartition(distances, n_neighbors - 1, axis=1)
21 |     neighbors = neighbors[:, :n_neighbors]
22 | 
23 |     return neighbors, distances[np.arange(n)[:, None], neighbors]
24 | 
25 | 
26 | def find_nearest_neighbors(X1, X2, n_neighbors, L=None):
27 |     """
28 |     Args:
29 |         X1 (array_like): [n_samples, n_features] input data points
30 |         X2 (array_like): [m_samples, n_features] reference data points
31 |         n_neighbors (int): number of nearest neighbors to find
32 |         L (array) : linear transformation for Mahalanobis distance computation
33 | 
34 |     Returns:
35 |         tuple:
36 |             (array_like): [n_samples, k_samples] indices of nearest neighbors
37 |             (array_like): [n_samples, k_distances] distances to nearest neighbors
38 | 
39 |     """
40 | 
41 |     dist = squared_distances(X1, X2, L)
42 | 
43 |     if X1 is X2:
44 |         np.fill_diagonal(dist, np.inf)
45 | 
46 |     n, m = X1.shape[0], X2.shape[0]
47 | 
48 |     neigh_ind = np.argpartition(dist, n_neighbors - 1, axis=1)
49 |     neigh_ind = neigh_ind[:, :n_neighbors]
50 | 
51 |     return neigh_ind, dist[np.arange(n)[:, None], neigh_ind]
52 | 
53 | 
54 | def construct_weight_mat(neighbors, distances, shape, dtype):
55 | 
56 |     n, k = neighbors.shape
57 |     rows = np.repeat(range(n), k)
58 |     cols = neighbors.ravel()
59 |     weights = np.exp(-distances.ravel())
60 |     mat = coo_matrix((weights, (rows, cols)), shape, dtype)
61 | 
62 |     return mat
63 | 


--------------------------------------------------------------------------------
/ilp/algo/knn_sl_graph.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from scipy.sparse import spdiags
  3 | 
  4 | from ilp.algo.knn_graph_utils import squared_distances, find_nearest_neighbors
  5 | from ilp.algo.knn_sl_subgraph import KnnSubGraph
  6 | from ilp.algo.base_sl_graph import BaseSemiLabeledGraph
  7 | 
  8 | N_ITER_ELIMINATE_ZEROS = 20
  9 | 
 10 | 
 11 | class KnnSemiLabeledGraph(BaseSemiLabeledGraph):
 12 |     """
 13 |     Parameters
 14 |     ----------
 15 | 
 16 |     n_neighbors_labeled : int, optional (default=1)
 17 |         The number of labeled neighbors to use if 'knn' kernel us used.
 18 | 
 19 |     n_neighbors_unlabeled : int, optional (default=7)
 20 |         The number of labeled neighbors to use if 'knn' kernel us used.
 21 | 
 22 |     max_samples : int, optional
 23 |         The maximum number of points expected to be observed. Useful for
 24 |         memory allocation. (default: 1000)
 25 | 
 26 |     max_labeled : {float, int}, optional
 27 |         Maximum expected labeled points ratio, or number of labeled points
 28 | 
 29 |     dtype : dtype, optional
 30 |         Precision in floats, (default is float32, can also be float16, float64)
 31 | 
 32 | 
 33 |     Attributes
 34 |     ----------
 35 | 
 36 |     L : array-like, shape (n_features_out, n_features_in)
 37 | 
 38 |     weight_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples]
 39 |                 xy can be in {ll, lu, ul, uu} indicating labeled or unlabeled points
 40 | 
 41 |     transition_matrix_{xy} : {array-like, sparse matrix}, shape = [n_samples, n_samples]
 42 | 
 43 |     adj_list_{xy} : dict that contains the graph connectivity ->
 44 |         keys = nodes indices, values = neighbor nodes indices
 45 | 
 46 |     """
 47 | 
 48 |     def __init__(self, datastore, n_neighbors_labeled=1,
 49 |                  n_neighbors_unlabeled=7, **kwargs):
 50 | 
 51 |         super(KnnSemiLabeledGraph, self).__init__(datastore=datastore)
 52 |         self.n_neighbors_labeled = n_neighbors_labeled
 53 |         self.n_neighbors_unlabeled = n_neighbors_unlabeled
 54 | 
 55 |         max_l, max_u = self.max_labeled, self.max_samples
 56 |         dtype = self.dtype
 57 | 
 58 |         self.subgraph_ll = KnnSubGraph(n_neighbors=n_neighbors_labeled,
 59 |                                        dtype=dtype, shape=(max_l, max_l))
 60 | 
 61 |         self.subgraph_lu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled,
 62 |                                        dtype=dtype, shape=(max_l, max_u))
 63 | 
 64 |         self.subgraph_ul = KnnSubGraph(n_neighbors=n_neighbors_labeled,
 65 |                                        dtype=dtype, shape=(max_u, max_l))
 66 | 
 67 |         self.subgraph_uu = KnnSubGraph(n_neighbors=n_neighbors_unlabeled,
 68 |                                        dtype=dtype, shape=(max_u, max_u))
 69 | 
 70 |         self.L = None
 71 | 
 72 |     def build(self, X_l, X_u):
 73 |         """
 74 |         Graph matrix for Online Label Propagation computes the weighted adjacency matrix
 75 | 
 76 |         Parameters
 77 |         ----------
 78 | 
 79 |         X_l : array-like, shape [l_samples, n_features], the labeled features
 80 | 
 81 |         X_u : array-like, shape [u_samples, n_features], the unlabeled features
 82 | 
 83 |         """
 84 | 
 85 |         print('Building graph with {} labeled and {} unlabeled '
 86 |               'samples...'.format(X_l.shape, X_u.shape))
 87 | 
 88 |         self.subgraph_ll.build(X_l, X_l, self.L)
 89 |         self.subgraph_lu.build(X_l, X_u, self.L)
 90 |         self.subgraph_ul.build(X_u, X_l, self.L)
 91 |         self.subgraph_uu.build(X_u, X_u, self.L)
 92 | 
 93 |         self.n_labeled = X_l.shape[0]
 94 |         self.n_unlabeled = X_u.shape[0]
 95 | 
 96 |         print('Computing transitions...')
 97 | 
 98 |         self._compute_transitions()
 99 | 
100 |     def find_labeled_neighbors(self, X):
101 | 
102 |         X_l = self.datastore.X_labeled[:self.n_labeled]
103 |         return find_nearest_neighbors(X, X_l, self.n_neighbors_labeled, self.L)
104 | 
105 |     def find_unlabeled_neighbors(self, X):
106 | 
107 |         X_u = self.datastore.X_unlabeled[:self.n_unlabeled]
108 |         return find_nearest_neighbors(X, X_u, self.n_neighbors_unlabeled,
109 |                                       self.L)
110 | 
111 |     def add_node(self, x, ind, labeled):
112 |         if labeled:
113 |             res = self.add_labeled_node(x, ind)
114 |         else:
115 |             res = self.add_unlabeled_node(x, ind)
116 | 
117 |         # Periodically remove explicit zeros from the sparse matrices
118 |         if self.get_n_nodes() % N_ITER_ELIMINATE_ZEROS == 0:
119 | 
120 |             self.subgraph_ll.eliminate_zeros()
121 |             self.subgraph_lu.eliminate_zeros()
122 |             self.subgraph_ul.eliminate_zeros()
123 |             self.subgraph_uu.eliminate_zeros()
124 | 
125 |         return res
126 | 
127 |     def add_labeled_node(self, x_new, ind_new):
128 | 
129 |         # Compute distances to all other labeled nodes
130 |         X_l = self.datastore.X_labeled[:self.n_labeled]
131 |         distances = squared_distances(x_new.reshape(1, -1), X_l, self.L)
132 | 
133 |         # Update the labeled-labeled subgraph
134 |         self.subgraph_ll.append_row(ind_new, distances)
135 |         self.subgraph_ll.update_columns(ind_new, distances)
136 | 
137 |         # Compute distances to all other unlabeled nodes
138 |         X_u = self.datastore.X_unlabeled[:self.n_unlabeled]
139 |         distances = squared_distances(x_new.reshape(1, -1), X_u, self.L)
140 | 
141 |         # Update the labeled-unlabeled subgraph
142 |         self.subgraph_lu.append_row(ind_new, distances)
143 | 
144 |         # Update the unlabeled-labeled subgraph
145 |         self.subgraph_ul.update_columns(ind_new, distances)
146 | 
147 |         self.n_labeled += 1
148 | 
149 |         # Compute normalized weight matrix (matrices)
150 |         self._compute_transitions()
151 | 
152 |         return ind_new
153 | 
154 |     def add_unlabeled_node(self, x_new, ind_new):
155 | 
156 |         # Compute distances to all other labeled nodes
157 |         X_u = self.datastore.X_labeled[:self.n_unlabeled]
158 |         distances = squared_distances(x_new.reshape(1, -1), X_u, self.L)
159 | 
160 |         # Update the labeled-labeled subgraph
161 |         self.subgraph_uu.append_row(ind_new, distances)
162 |         self.subgraph_uu.update_columns(ind_new, distances)
163 | 
164 |         # Compute distances to all labeled nodes
165 |         X_l = self.datastore.X_labeled[:self.n_labeled]
166 |         distances = squared_distances(x_new.reshape(1, -1), X_l, self.L)
167 | 
168 |         # Update the labeled-unlabeled subgraph
169 |         self.subgraph_ul.append_row(ind_new, distances)
170 | 
171 |         # Update the unlabeled-labeled subgraph
172 |         self.subgraph_lu.update_columns(ind_new, distances)
173 | 
174 |         self.n_unlabeled += 1
175 | 
176 |         # Compute normalized weight matrix (matrices)
177 |         self._compute_transitions()
178 | 
179 |         return ind_new
180 | 
181 |     def _compute_transitions(self):
182 |         """Normalize the weight matrices by dividing with the row sums"""
183 | 
184 |         self.row_sum_l = self.subgraph_ll.weight_matrix.sum(axis=1) + \
185 |                          self.subgraph_lu.weight_matrix.sum(axis=1)
186 | 
187 |         self.row_sum_u = self.subgraph_ul.weight_matrix.sum(axis=1) + \
188 |                          self.subgraph_uu.weight_matrix.sum(axis=1)
189 | 
190 |         # Avoid division by zero
191 |         actual_l = self.row_sum_l[:self.n_labeled]
192 |         actual_l[actual_l < self.eps] = 1.
193 |         # print('Min value l: ', actual_l.min())
194 |         actual_u = self.row_sum_u[:self.n_unlabeled]
195 |         actual_u[actual_u < self.eps] = 1.
196 |         # print('Min value u: ', actual_u.min())
197 | 
198 |         row_sum_l_inv = 1 / np.asarray(actual_l, dtype=self.dtype)
199 |         row_sum_l_inv[row_sum_l_inv == np.inf] = 1
200 | 
201 |         row_sum_u_inv = 1 / np.asarray(actual_u, dtype=self.dtype)
202 |         row_sum_u_inv[row_sum_u_inv == np.inf] = 1
203 | 
204 |         # Temporary divisors (diagonal pre-multiplier matrices)
205 |         diag_l = spdiags(row_sum_l_inv.ravel(), 0, *self.subgraph_ll.shape)
206 |         diag_u = spdiags(row_sum_u_inv.ravel(), 0, *self.subgraph_uu.shape)
207 | 
208 |         self.subgraph_ll.update_transitions(diag_l)
209 |         self.subgraph_lu.update_transitions(diag_l)
210 |         self.subgraph_ul.update_transitions(diag_u)
211 |         self.subgraph_uu.update_transitions(diag_u)
212 | 
213 |     def reset_metric(self, L):
214 |         self.L = L
215 | 


--------------------------------------------------------------------------------
/ilp/algo/knn_sl_subgraph.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from scipy.sparse import csr_matrix, coo_matrix
  3 | from sklearn.utils.extmath import safe_sparse_dot as ssdot
  4 | 
  5 | from ilp.helpers.fc_heap import FixedCapacityHeap as FSH
  6 | from ilp.algo.knn_graph_utils import find_nearest_neighbors, get_nearest, construct_weight_mat
  7 | 
  8 | 
  9 | class KnnSubGraph:
 10 |     def __init__(self, n_neighbors=1, dtype=float, shape=None, **kwargs):
 11 | 
 12 |         self.n_neighbors = n_neighbors
 13 |         self.dtype = dtype
 14 |         self.shape = shape
 15 | 
 16 |         self.weight_matrix = csr_matrix(shape, dtype=dtype)
 17 |         self.transition_matrix = csr_matrix(shape, dtype=dtype)
 18 |         self.radii = np.zeros(shape[0], dtype=dtype)
 19 | 
 20 |         self.adj = {}
 21 |         self.rev_adj = {}
 22 | 
 23 |     def build(self, X1, X2, L=None):
 24 | 
 25 |         neigh_ind, dist = find_nearest_neighbors(X1, X2, self.n_neighbors, L)
 26 |         weight_matrix = construct_weight_mat(neigh_ind, dist, self.shape,
 27 |                                              self.dtype)
 28 |         self.weight_matrix = weight_matrix.tocsr()
 29 |         self.radii[:len(dist)] = dist[:, self.n_neighbors-1]
 30 |         self.adj = self.adj_from_weight_mat(weight_matrix)
 31 |         self.rev_adj = self.rev_adj_from_weight_mat(weight_matrix)
 32 | 
 33 |     def adj_from_weight_mat(self, weight_mat):
 34 |         """Get the non-zero cols for each row and insert in a FSPQ in the
 35 |         form (weight, ind)
 36 |     
 37 |         Args:
 38 |             weight_mat (coo_matrix): a weights submatrix
 39 |     
 40 |         Returns:
 41 |             adj_list (dict): a dictionary where keys are indices and values
 42 |             are FSHs
 43 |     
 44 |         """
 45 | 
 46 |         # Create a list of adjacent vertices for each node
 47 |         print('Creating hashmap for {} nodes...'.format(self.shape[0]))
 48 |         adj_list = {i: [] for i in range(self.shape[0])}
 49 |         print('Iterating over weightmat.row, col, data...')
 50 |         for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data):
 51 |             adj_list[r].append((w, c))
 52 | 
 53 |         # Convert each list to a FixedCapacityHeap
 54 |         print('Converting to FCH')
 55 |         for node, neighbors in adj_list.items():
 56 |             adj_list[node] = FSH(neighbors, capacity=self.n_neighbors)
 57 | 
 58 |         return adj_list
 59 | 
 60 |     def rev_adj_from_weight_mat(self, weight_mat):
 61 |         # Create a list of adjacent vertices for each node
 62 |         # adj_list = {i: set() for i in range(self.shape[1])}
 63 |         adj_list = {}
 64 |         for r, c, w in zip(weight_mat.row, weight_mat.col, weight_mat.data):
 65 |             adj_list.setdefault(c, set()).add(r)
 66 |             # adj_list[c].add(r)
 67 | 
 68 |         return adj_list
 69 | 
 70 |     def update_transitions(self, normalizer):
 71 |         self.transition_matrix = ssdot(normalizer, self.weight_matrix)
 72 | 
 73 |     def eliminate_zeros(self):
 74 |         self.weight_matrix.eliminate_zeros()
 75 |         self.transition_matrix.eliminate_zeros()
 76 | 
 77 |     def append_row(self, index, distances):
 78 | 
 79 |         # Identify the k nearest neighbors
 80 |         nearest, dist_nearest = get_nearest(distances, self.n_neighbors)
 81 |         nearest = nearest.ravel()
 82 | 
 83 |         # Create the new node's adjacency list
 84 |         weights = np.exp(-dist_nearest.ravel())
 85 |         lst = [(w, i) for w, i in zip(weights, nearest)]
 86 |         self.adj[index] = FSH(lst, self.n_neighbors)
 87 | 
 88 |         # Update the reverse adjacency list
 89 |         for w, i in zip(weights, nearest):
 90 |             self.rev_adj.setdefault(i, set()).add(index)
 91 |             # self.rev_adj[i].add(index)
 92 | 
 93 |         # Update the W_LL matrix (append the row vector)
 94 |         row = [index] * len(weights)
 95 |         row_new = csr_matrix((weights, (row, nearest)), self.shape, self.dtype)
 96 |         self.weight_matrix = self.weight_matrix + row_new
 97 | 
 98 |     def update_columns(self, ind_new, distances):
 99 |         """
100 |         
101 |         Parameters
102 |         ----------
103 |         ind_new : int
104 |             Index of the new point.
105 |             
106 |         distances : array
107 |             Array of distances of the new point to the reference points of 
108 |             the subgraph.
109 | 
110 |         """
111 | 
112 |         distances = distances.ravel()
113 |         # Identify the samples that have the new point in their knn radius
114 |         back_refs, = np.where(distances < self.radii[:len(distances)])
115 |         back_weights = np.exp(-distances[back_refs])
116 | 
117 |         # Update the W_LL matrix (compute the column update)
118 |         update_mat = self._update(ind_new, back_refs, back_weights)
119 |         self.weight_matrix = self.weight_matrix + update_mat.tocsr()
120 | 
121 |     def _update(self, ind_new, back_refs, weights_new, eps=1e-12):
122 |         row, col, val = [], [], []
123 |         # row_del, col_del, val_del = [], [], []
124 |         for neigh_new, weight_new in zip(back_refs, weights_new):
125 |             neighbors_heap = self.adj[neigh_new]  # FSH with (weight, ind)
126 |             inserted, removed = neighbors_heap.push((weight_new, ind_new))
127 |             if inserted:  # neigh got a new nearest neighbor: ind_new
128 |                 row.append(neigh_new)
129 |                 col.append(ind_new)
130 |                 val.append(weight_new)
131 | 
132 |                 # Update the reverse adjacency list
133 |                 self.rev_adj.setdefault(ind_new, set()).add(neigh_new)
134 |                 if removed is not None:  # old point swapped a nearest neighbor
135 |                     # row_del.append(neigh)
136 |                     # col_del.append(removed[1])
137 |                     # val_del.append(-removed[0])
138 |                     row.append(neigh_new)
139 |                     col.append(removed[1])
140 |                     val.append(-removed[0])
141 | 
142 |                     # Update the reverse adjacency list
143 |                     self.rev_adj[removed[1]].discard(neigh_new)
144 | 
145 |                     # Update the radii
146 |                     min_weight = neighbors_heap.get_min()[0]
147 |                     self.radii[neigh_new] = -np.log(max(min_weight, eps))
148 | 
149 |         update_mat = coo_matrix((val, (row, col)), self.shape, self.dtype)
150 | 
151 |         return update_mat
152 | 


--------------------------------------------------------------------------------
/ilp/constants.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | 
 4 | CWD = os.path.abspath(os.path.split(__file__)[0])
 5 | PROJECT_DIR = os.path.split(CWD)[0]
 6 | 
 7 | SOURCE_DIR = os.path.join(PROJECT_DIR, 'ilp')
 8 | DATA_DIR = os.path.join(PROJECT_DIR, 'data')
 9 | 
10 | STATS_DIR = os.path.join(SOURCE_DIR, 'stats_res')
11 | EXPERIMENTS_DIR = os.path.join(SOURCE_DIR, 'experiments')
12 | CONFIG_DIR = os.path.join(EXPERIMENTS_DIR, 'cfg')
13 | RESULTS_DIR = os.path.join(EXPERIMENTS_DIR, 'results')
14 | # PLOT_DIR = os.path.join(PROJECT_DIR, 'plot')
15 | 
16 | 
17 | EPS_32 = np.spacing(np.float32(0))
18 | EPS_64 = np.spacing(np.float64(0))
19 | 


--------------------------------------------------------------------------------
/ilp/experiments/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/experiments/__init__.py


--------------------------------------------------------------------------------
/ilp/experiments/base.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import time
  4 | from datetime import datetime
  5 | from sklearn.externals import six
  6 | from sklearn.preprocessing import LabelBinarizer
  7 | from sklearn.utils.random import check_random_state
  8 | from abc import ABCMeta, abstractmethod
  9 | from matplotlib import pyplot as plt
 10 | 
 11 | from ilp.constants import RESULTS_DIR
 12 | from ilp.helpers.data_fetcher import check_supported_dataset, fetch_load_data
 13 | from ilp.helpers.stats import StatisticsWorker, aggregate_statistics, JobType
 14 | from ilp.plots.plot_stats import plot_curves
 15 | from ilp.algo.incremental_label_prop import IncrementalLabelPropagation
 16 | from ilp.algo.datastore import SemiLabeledDataStore
 17 | from ilp.helpers.data_flow import gen_semilabeled_data, split_labels_rest, split_burn_in_rest
 18 | from ilp.helpers.params_parse import print_config
 19 | from ilp.helpers.log import make_logger
 20 | 
 21 | 
 22 | logger = make_logger(__name__)
 23 | 
 24 | 
 25 | class BaseExperiment(six.with_metaclass(ABCMeta)):
 26 | 
 27 |     def __init__(self, name, config, plot_title, multi_var, n_runs, isave=100):
 28 |         self.name = name
 29 |         self.config = config
 30 |         self.precision = config.get('options', {}).get('precision', 'float32')
 31 |         self.isave = isave
 32 |         self.plot_title = plot_title
 33 |         self.multi_var = multi_var
 34 |         self.n_runs = n_runs
 35 |         self._setup()
 36 | 
 37 |     def _setup(self):
 38 |         self.dataset = self.config['dataset']['name'].lower()
 39 |         check_supported_dataset(self.dataset)
 40 |         cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
 41 |         self.class_dir = os.path.join(RESULTS_DIR, self.name)
 42 |         instance_dir = self.name + '_' + self.dataset.upper() + '_' + cur_time
 43 |         self.top_dir = os.path.join(self.class_dir, instance_dir)
 44 | 
 45 |     def run(self, dataset_name, random_state=42):
 46 | 
 47 |         config = self.config
 48 |         X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
 49 | 
 50 |         for n_run in range(self.n_runs):
 51 |             seed_run = random_state * n_run
 52 |             logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
 53 |             rng = check_random_state(seed_run)
 54 |             if config['dataset']['is_stream']:
 55 |                 logger.info('Dataset is a stream. Sampling observed labels.')
 56 |                 # Just randomly sample ratio_labeled samples for mask_labeled
 57 |                 n_burn_in = config['data']['n_burn_in_stream']
 58 |                 ratio_labeled = config['data']['stream']['ratio_labeled']
 59 |                 n_labeled = int(ratio_labeled*len(y_train))
 60 |                 ind_labeled = rng.choice(len(y_train), n_labeled,
 61 |                                          replace=False)
 62 |                 mask_labeled = np.zeros(len(y_train), dtype=bool)
 63 |                 mask_labeled[ind_labeled] = True
 64 |                 X_run, y_run = X_train, y_train
 65 |             else:
 66 |                 burn_in_params = config['data']['burn_in']
 67 |                 ind_burn_in, mask_labeled_burn_in = \
 68 |                     split_burn_in_rest(y_train, shuffle=True, seed=seed_run,
 69 |                                    **burn_in_params)
 70 |                 X_burn_in, y_burn_in = X_train[ind_burn_in], \
 71 |                                        y_train[ind_burn_in]
 72 |                 mask_rest = np.ones(len(X_train), dtype=bool)
 73 |                 mask_rest[ind_burn_in] = False
 74 |                 X_rest, y_rest = X_train[mask_rest], y_train[mask_rest]
 75 |                 stream_params = config['data']['stream']
 76 |                 mask_labeled_rest = split_labels_rest(
 77 |                     y_rest, seed=seed_run, shuffle=True, **stream_params)
 78 | 
 79 |                 # Shuffle the rest
 80 |                 indices = np.arange(len(y_rest))
 81 |                 rng.shuffle(indices)
 82 |                 X_run = np.concatenate((X_burn_in, X_rest[indices]))
 83 |                 y_run = np.concatenate((y_burn_in, y_rest[indices]))
 84 |                 mask_labeled = np.concatenate((mask_labeled_burn_in,
 85 |                                                mask_labeled_rest[indices]))
 86 |                 n_burn_in = len(y_burn_in)
 87 | 
 88 |             config['data']['n_burn_in'] = n_burn_in
 89 |             config.setdefault('options', {})
 90 |             config['options']['random_state'] = seed_run
 91 | 
 92 |             self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
 93 |                                 seed_run, X_test, y_test, n_run)
 94 | 
 95 |     @abstractmethod
 96 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
 97 |                        X_test, y_test, n_run):
 98 |         raise NotImplementedError('pre_single_run must be overriden!')
 99 | 
100 |     def _single_run(self, X, y, mask_labeled, n_burn_in, stats_path,
101 |                     random_state, X_test=None, y_test=None):
102 | 
103 |         lb = LabelBinarizer()
104 |         lb.fit(y)
105 |         logger.info('\n\nLABELS SEEN BY LABEL BINARIZER: {}'.format(lb.classes_))
106 | 
107 |         # Now print configuration for sanity check
108 |         self.config.setdefault('dataset', {})
109 |         self.config['dataset']['classes'] = lb.classes_
110 |         print_config(self.config)
111 | 
112 |         logger.info('Creating stream generator...')
113 |         stream_generator = gen_semilabeled_data(X, y, mask_labeled)
114 | 
115 |         logger.info('Creating one-hot groundtruth...')
116 |         y_u_true_int = y[~mask_labeled]
117 |         y_u_true = np.asarray(lb.transform(y_u_true_int), dtype=self.precision)
118 | 
119 |         logger.info('Initializing learner...')
120 |         datastore_params = {'precision': self.precision,
121 |                             'max_samples': len(y),
122 |                             'max_labeled': sum(mask_labeled),
123 |                             'classes': lb.classes_}
124 |         learner = self.init_learner(stats_path, datastore_params, random_state,
125 |                                     n_burn_in)
126 | 
127 |         # Iterate through the generated samples and learn
128 |         t_total = time.time()
129 |         logger.info('Now feeding stream . . .')
130 |         for t, x_new, y_new, is_labeled in stream_generator:
131 | 
132 |             # Pass the new point to the learner
133 |             y_observed = y_new if is_labeled else -1
134 |             learner.fit_incremental(x_new, y_observed)
135 | 
136 |             if t > n_burn_in:
137 |                 # Compute classification error
138 |                 u = learner.datastore.n_unlabeled
139 |                 y_u = learner.y_unlabeled[:u]
140 |                 learner.log_stats(JobType.EVAL, y_est=y_u, y_true=y_u_true[:u])
141 | 
142 |                 # Compute test error every 1000 samples
143 |                 if t % 1000 == 0:
144 |                     if X_test is not None:
145 |                         logger.info('Now testing . . .')
146 |                         t_test = time.time()
147 |                         y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair')
148 |                         t_test = time.time() - t_test
149 |                         logger.info('Testing finished in {}s'.format(t_test))
150 |                         learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn,
151 |                              y_pred_lp=y_pred_lp, y_true=y_test)
152 | 
153 |         # Store the true label stream in statistics
154 |         learner.log_stats(JobType.LABEL_STREAM, y_true=y, mask_obs=mask_labeled)
155 | 
156 |         logger.info('Reached end of generated data.')
157 |         total_runtime = time.time() - t_total
158 |         logger.info('Total time elapsed: {} s'.format(total_runtime))
159 |         learner.log_stats(JobType.RUNTIME, t=total_runtime)
160 | 
161 |         # Store last predictions in statistics
162 |         u = learner.datastore.n_unlabeled
163 |         y_u = learner.y_unlabeled[:u]
164 |         learner.log_stats(JobType.TRAIN_PRED, y_est=y_u, y_true=y_u_true[:u])
165 | 
166 |         if X_test is not None:
167 |             logger.info('Now testing . . .')
168 |             t_test = time.time()
169 |             y_pred_knn, y_pred_lp = learner.predict(X_test, mode='pair')
170 |             t_test = time.time() - t_test
171 |             logger.info('Testing finished in {}s'.format(t_test))
172 |             learner.log_stats(JobType.TEST_PRED, y_pred_knn=y_pred_knn,
173 |                               y_pred_lp=y_pred_lp, y_true=y_test)
174 | 
175 |         if learner.stats_worker is not None:
176 |             learner.stats_worker.stop()
177 | 
178 |     def init_learner(self, stats_path, datastore_params, random_state, n_burn_in):
179 | 
180 |         config = self.config
181 |         ilp_params = dict(params_offline_lp=config['offline_lp'],
182 |                           params_graph=config['graph'],
183 |                           **config['online_lp'])
184 | 
185 |         # Instantiate a worker thread for statistics
186 |         stats_worker = StatisticsWorker(config=config, isave=self.isave, path=stats_path)
187 | 
188 |         # Instantiate a datastore for labeled and unlabeled samples
189 |         datastore = SemiLabeledDataStore(**datastore_params)
190 | 
191 |         # Instantiate the learner
192 |         learner = IncrementalLabelPropagation(datastore=datastore, stats_worker=stats_worker,
193 |                                           random_state=random_state, n_burn_in=n_burn_in, **ilp_params)
194 | 
195 |         return learner
196 | 
197 |     def load_plot(self, path=None):
198 |         if path is None:
199 |             path = self.top_dir
200 |         elif not os.path.isdir(path):
201 |             # Load and plot the latest experiment
202 |             logger.info('Experiment Class dir: {}'.format(self.class_dir))
203 |             logger.info('Experiment subdirs: {}'.format(os.listdir(self.class_dir)))
204 |             files_in_class = os.listdir(self.class_dir)
205 |             a_files = [os.path.join(self.class_dir, d) for d in files_in_class]
206 |             list_of_dirs = [d for d in a_files if os.path.isdir(d)]
207 |             path = max(list_of_dirs, key=os.path.getctime)
208 | 
209 |         logger.info('Collecting statistics from {}'.format(path))
210 |         config = None
211 |         if self.multi_var:
212 |             experiment_stats = []
213 |             for variable_dir in os.listdir(path):
214 |                 var_value = variable_dir[len(self.name) + 1:]
215 |                 experiment_dir = os.path.join(path, variable_dir)
216 |                 stats_mean, stats_std, config = aggregate_statistics(experiment_dir)
217 |                 experiment_stats.append((self.name, var_value, stats_mean, stats_std))
218 |         else:
219 |             stats_mean, stats_std, config = aggregate_statistics(path)
220 |             experiment_stats = (stats_mean, stats_std)
221 | 
222 |         if config is None:
223 |             raise KeyError('No configuration found for {}'.format(path))
224 | 
225 |         title = self.plot_title
226 |         plot_curves(experiment_stats, config, title=title, path=path)
227 |         plt.show()
228 | 


--------------------------------------------------------------------------------
/ilp/experiments/cfg/default.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : 0.10
 7 |     batch_size : 100
 8 |   max_samples : 200000
 9 |   n_burn_in_stream : 100
10 | 
11 | 
12 | offline_lp:
13 |   tol : 0.001
14 |   max_iter : 30
15 | 
16 | 
17 | online_lp:
18 |   tol : 0.001
19 |   max_iter : 30
20 |   theta : 0.1
21 |   iprint : 100
22 | 
23 | 
24 | graph:
25 |   kernel : knn
26 |   n_neighbors_labeled : 3
27 |   n_neighbors_unlabeled : 3
28 | 
29 | 
30 | options:
31 |   precision : float64
32 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_k_L.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : 0.05
 7 |     batch_size : 100
 8 |   max_samples : 1000000
 9 | 
10 | 
11 | offline_lp:
12 |   tol : 0.001
13 |   max_iter : 30
14 | 
15 | 
16 | online_lp:
17 |   tol : 0.001
18 |   max_iter : 30
19 |   theta : 0.3
20 |   iprint : 100
21 | 
22 | 
23 | graph:
24 |   kernel : knn
25 |   n_neighbors_labeled : [1, 3, 7, 11, 15, 19]
26 |   n_neighbors_unlabeled : 3
27 | 
28 | 
29 | options:
30 |   precision : float64
31 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_k_U.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : 0.05
 7 |     batch_size : 100
 8 |   max_samples : 1000000
 9 | 
10 | 
11 | offline_lp:
12 |   tol : 0.001
13 |   max_iter : 30
14 | 
15 | 
16 | online_lp:
17 |   tol : 0.001
18 |   max_iter : 30
19 |   theta : 0.3
20 |   iprint : 100
21 | 
22 | 
23 | graph:
24 |   kernel : knn
25 |   n_neighbors_labeled : 3
26 |   n_neighbors_unlabeled : [1, 3, 7, 11, 15, 19]
27 | 
28 | 
29 | options:
30 |   precision : float64
31 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_n_L.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : 0.05
 7 |     batch_size : 100
 8 |   max_samples : 1000000
 9 |   n_labeled_per_class : [300, 500]
10 | 
11 | 
12 | offline_lp:
13 |   tol : 0.001
14 |   max_iter : 30
15 | 
16 | 
17 | online_lp:
18 |   tol : 0.001
19 |   max_iter : 30
20 |   theta : 0.3
21 |   iprint : 100
22 | 
23 | 
24 | graph:
25 |   kernel : knn
26 |   n_neighbors_labeled : 3
27 |   n_neighbors_unlabeled : 3
28 | 
29 | 
30 | options:
31 |   precision : float64
32 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_stream_labeled.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : [0.05, 0.1, 0.2]
 7 |     batch_size : 100
 8 |   max_samples : 1000000
 9 |   n_burn_in_stream : 100
10 | 
11 | 
12 | offline_lp:
13 |   tol : 0.001
14 |   max_iter : 30
15 | 
16 | 
17 | online_lp:
18 |   tol : 0.001
19 |   max_iter : 30
20 |   theta : 1.0
21 |   iprint : 100
22 | 
23 | 
24 | graph:
25 |   kernel : knn
26 |   n_neighbors_labeled : 3
27 |   n_neighbors_unlabeled : 3
28 | 
29 | 
30 | options:
31 |   precision : float64
32 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/cfg/var_theta.yml:
--------------------------------------------------------------------------------
 1 | data:
 2 |   burn_in:
 3 |     n_labeled_per_class : 2
 4 |     ratio_labeled : 0.5
 5 |   stream:
 6 |     ratio_labeled : 0.20
 7 |     batch_size : 100
 8 |   max_samples : 1000000
 9 |   n_burn_in_stream : 100
10 | 
11 | 
12 | offline_lp:
13 |   tol : 0.001
14 |   max_iter : 30
15 | 
16 | 
17 | online_lp:
18 |   tol : 0.001
19 |   max_iter : 30
20 |   theta : [0.0, 0.1, 0.5, 1.0, 1.5, 2.0]
21 |   iprint : 100
22 | 
23 | 
24 | graph:
25 |   kernel : knn
26 |   n_neighbors_labeled : 3
27 |   n_neighbors_unlabeled : 3
28 | 
29 | 
30 | options:
31 |   precision : float64
32 |   iter_stats : 100


--------------------------------------------------------------------------------
/ilp/experiments/default_run.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from datetime import datetime
 3 | 
 4 | from ilp.experiments.base import BaseExperiment
 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
 7 | from ilp.constants import CONFIG_DIR
 8 | 
 9 | 
10 | class DefaultRun(BaseExperiment):
11 | 
12 |     def __init__(self, params, n_runs=1, isave=100):
13 |         super(DefaultRun, self).__init__(name='default_run', config=params,
14 |                                          isave=isave, n_runs=n_runs,
15 |                                          plot_title=r'Default run',
16 |                                          multi_var=False)
17 | 
18 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
19 |                        X_test, y_test, n_run):
20 | 
21 |         cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
22 |         stats_path = os.path.join(self.top_dir, 'run_' + cur_time)
23 |         self._single_run(X_run, y_run, mask_labeled, n_burn_in, stats_path,
24 |                          seed_run, X_test, y_test)
25 | 
26 | 
27 | if __name__ == '__main__':
28 | 
29 |     # Parse user input
30 |     parser = experiment_arg_parser()
31 |     args = vars(parser.parse_args())
32 |     dataset_name = args['dataset'].lower()
33 |     config_file = os.path.join(CONFIG_DIR, 'default.yml')
34 |     config = parse_yaml(config_file)
35 | 
36 |     # Store dataset info
37 |     config.setdefault('dataset', {})
38 |     config['dataset']['name'] = dataset_name
39 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
40 | 
41 |     # Instantiate experiment
42 |     experiment = DefaultRun(params=config, n_runs=args['n_runs'])
43 | 
44 |     if args['plot'] != '':
45 |         # python3 default_run.py -p latest
46 |         experiment.load_plot(path=args['plot'])
47 |     else:
48 |         # python3 default_run.py -d digits
49 |         experiment.run(dataset_name)
50 | 


--------------------------------------------------------------------------------
/ilp/experiments/var_n_labeled.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import numpy as np
  4 | from sklearn.utils.random import check_random_state
  5 | 
  6 | from ilp.experiments.base import BaseExperiment
  7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM
  8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
  9 | from ilp.constants import CONFIG_DIR
 10 | from ilp.helpers.data_flow import split_labels_rest, split_burn_in_rest
 11 | from ilp.helpers.log import make_logger
 12 | 
 13 | 
 14 | logger = make_logger(__name__)
 15 | 
 16 | 
 17 | class VarSamplesLabeled(BaseExperiment):
 18 | 
 19 |     def __init__(self, n_labeled_values, params, n_runs=1, isave=100):
 20 |         super(VarSamplesLabeled, self).__init__(name='n_L', config=params,
 21 |                                                 isave=isave, n_runs=n_runs,
 22 |                                                 plot_title=r'Influence of '
 23 |                                                            r'number of labels',
 24 |                                                 multi_var=True)
 25 |         self.n_labeled_values = n_labeled_values
 26 | 
 27 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
 28 |                        X_test, y_test, n_run):
 29 | 
 30 |         config = self.config
 31 | 
 32 |         n_labels = config['data']['n_labels']
 33 |         save_dir = os.path.join(self.top_dir, 'n_L_' + str(n_labels))
 34 |         stats_file = os.path.join(save_dir, 'run_' + str(n_run))
 35 |         logger.info('\n\nExperiment: {}, n_labels = {}, run {}...\n'.
 36 |               format(self.name.upper(), n_labels, n_run))
 37 |         time.sleep(1)
 38 |         self._single_run(X_run, y_run, mask_labeled, n_burn_in,
 39 |                          stats_file, seed_run, X_test, y_test)
 40 | 
 41 |     def run(self, dataset_name, random_state=42):
 42 | 
 43 |         config = self.config
 44 |         X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
 45 | 
 46 |         n_classes = len(np.unique(y_train))
 47 | 
 48 |         # if dataset_name == 'usps':
 49 |         #     X_train = np.concatenate((X_train, X_test))
 50 |         #     y_train = np.concatenate((y_train, y_test))
 51 | 
 52 |         for n_run in range(self.n_runs):
 53 |             seed_run = random_state * n_run
 54 |             logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
 55 |             rng = check_random_state(seed_run)
 56 |             if config['dataset']['is_stream']:
 57 |                 logger.info('Dataset is a stream. Sampling observed labels.')
 58 |                 # Just randomly sample ratio_labeled samples for mask_labeled
 59 |                 n_burn_in = config['data']['n_burn_in_stream']
 60 |                 ratio_labeled = config['data']['stream']['ratio_labeled']
 61 |                 n_labeled = int(ratio_labeled*len(y_train))
 62 |                 ind_labeled = rng.choice(len(y_train), n_labeled,
 63 |                                          replace=False)
 64 |                 mask_labeled = np.zeros(len(y_train), dtype=bool)
 65 |                 mask_labeled[ind_labeled] = True
 66 |                 X_run, y_run = X_train, y_train
 67 |             else:
 68 | 
 69 |                 burn_in_params = config['data']['burn_in']
 70 |                 ind_burn_in, mask_labeled_burn_in = \
 71 |                     split_burn_in_rest(y_train, shuffle=True, seed=seed_run,
 72 |                                    **burn_in_params)
 73 |                 n_labeled_burn_in = sum(mask_labeled_burn_in)
 74 |                 X_burn_in, y_burn_in = X_train[ind_burn_in], \
 75 |                                        y_train[ind_burn_in]
 76 |                 mask_rest = np.ones(len(X_train), dtype=bool)
 77 |                 mask_rest[ind_burn_in] = False
 78 |                 X_rest, y_rest = X_train[mask_rest], y_train[mask_rest]
 79 | 
 80 |                 for nlpc in self.n_labeled_values:
 81 |                     n_labels = nlpc*n_classes
 82 |                     config['data']['n_labels'] = n_labels
 83 | 
 84 |                     rl = (n_labels - n_labeled_burn_in) / len(y_rest)
 85 |                     assert rl >= 0
 86 |                     mask_labeled_rest = split_labels_rest(y_rest, batch_size=0,
 87 |                         seed=seed_run, shuffle=True, ratio_labeled=rl)
 88 | 
 89 |                     # Shuffle the rest
 90 |                     indices = np.arange(len(y_rest))
 91 |                     rng.shuffle(indices)
 92 |                     X_run = np.concatenate((X_burn_in, X_rest[indices]))
 93 |                     y_run = np.concatenate((y_burn_in, y_rest[indices]))
 94 |                     mask_labeled = np.concatenate((mask_labeled_burn_in,
 95 |                                                    mask_labeled_rest[indices]))
 96 |                     n_burn_in = len(y_burn_in)
 97 | 
 98 |                     config['data']['n_burn_in'] = n_burn_in
 99 |                     config.setdefault('options', {})
100 |                     config['options']['random_state'] = seed_run
101 | 
102 |                     self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
103 |                                         seed_run, X_test, y_test, n_run)
104 | 
105 | 
106 | if __name__ == '__main__':
107 |     parser = experiment_arg_parser()
108 |     args = vars(parser.parse_args())
109 |     dataset_name = args['dataset'].lower()
110 |     config_file = os.path.join(CONFIG_DIR, 'var_n_L.yml')
111 |     config = parse_yaml(config_file)
112 | 
113 |     # Store dataset info
114 |     config.setdefault('dataset', {})
115 |     config['dataset']['name'] = dataset_name
116 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
117 | 
118 |     N_LABELED_PER_CLASS = config['data']['n_labeled_per_class'].copy()
119 | 
120 |     experiment = VarSamplesLabeled(N_LABELED_PER_CLASS, params=config,
121 |                                    n_runs=args['n_runs'])
122 |     if args['plot'] != '':
123 |         experiment.load_plot(path=args['plot'])
124 |     else:
125 |         experiment.run(dataset_name)
126 | 


--------------------------------------------------------------------------------
/ilp/experiments/var_n_neighbors_labeled.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import time
 3 | 
 4 | from ilp.experiments.base import BaseExperiment
 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
 7 | from ilp.constants import CONFIG_DIR
 8 | from ilp.helpers.log import make_logger
 9 | 
10 | 
11 | logger = make_logger(__name__)
12 | 
13 | 
14 | class VarNeighborsLabeled(BaseExperiment):
15 | 
16 |     def __init__(self, n_neighbors_values, params, n_runs, isave=100):
17 |         super(VarNeighborsLabeled, self).__init__(name='k_L', config=params,
18 |                                                   isave=isave, n_runs=n_runs,
19 |                                                   plot_title=r'Influence of '
20 |                                                              r'$k_l$',
21 |                                                   multi_var=True)
22 |         self.n_neighbors_values = n_neighbors_values
23 | 
24 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
25 |                        X_test, y_test, n_run):
26 | 
27 |         params = self.config
28 | 
29 |         for n_neighbors in self.n_neighbors_values:
30 |             params['graph']['n_neighbors_labeled'] = n_neighbors
31 |             save_dir = os.path.join(self.top_dir,  'k_L_' + str(n_neighbors))
32 |             stats_file = os.path.join(save_dir, 'run_' + str(n_run))
33 |             logger.info('\n\nExperiment: {}, k_L = {}, run {}...\n'.
34 |                   format(self.name.upper(), n_neighbors, n_run))
35 |             time.sleep(1)
36 |             self._single_run(X_run, y_run, mask_labeled, n_burn_in,
37 |                              stats_file, seed_run, X_test, y_test)
38 | 
39 | 
40 | if __name__ == '__main__':
41 | 
42 |     parser = experiment_arg_parser()
43 |     args = vars(parser.parse_args())
44 |     dataset_name = args['dataset'].lower()
45 |     config_file = os.path.join(CONFIG_DIR, 'var_k_L.yml')
46 |     config = parse_yaml(config_file)
47 | 
48 |     # Store dataset info
49 |     config.setdefault('dataset', {})
50 |     config['dataset']['name'] = dataset_name
51 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
52 | 
53 |     N_NEIGHBORS_VALUES = config['graph']['n_neighbors_labeled'].copy()
54 | 
55 |     experiment = VarNeighborsLabeled(N_NEIGHBORS_VALUES, params=config,
56 |                                      n_runs=args['n_runs'])
57 |     if args['plot'] != '':
58 |         experiment.load_plot(path=args['plot'])
59 |     else:
60 |         experiment.run(dataset_name)


--------------------------------------------------------------------------------
/ilp/experiments/var_n_neighbors_unlabeled.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import time
 3 | 
 4 | from ilp.experiments.base import BaseExperiment
 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
 7 | from ilp.constants import CONFIG_DIR
 8 | from ilp.helpers.log import make_logger
 9 | 
10 | 
11 | logger = make_logger(__name__)
12 | 
13 | 
14 | class VarNeighborsUnLabeled(BaseExperiment):
15 | 
16 |     def __init__(self, n_neighbors_values, params, n_runs, isave=100):
17 |         super(VarNeighborsUnLabeled, self).__init__(name='k_U', config=params,
18 |                                                     isave=isave,
19 |                                                     plot_title=r'Influence of '
20 |                                                              r'$k_u$',
21 |                                                     n_runs=n_runs,
22 |                                                     multi_var=True)
23 |         self.n_neighbors_values = n_neighbors_values
24 | 
25 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in,
26 |                        seed_run, X_test, y_test, n_run):
27 |         params = self.config
28 | 
29 |         for n_neighbors in self.n_neighbors_values:
30 |             params['graph']['n_neighbors_unlabeled'] = n_neighbors
31 |             save_dir = os.path.join(self.top_dir,
32 |                                     'k_U_' + str(n_neighbors))
33 |             stats_file = os.path.join(save_dir, 'run_' + str(n_run))
34 |             logger.info('\n\nExperiment: {}, k_U = {}, run {}...\n'.
35 |                   format(self.name.upper(), n_neighbors, n_run))
36 |             time.sleep(1)
37 |             self._single_run(X_run, y_run, mask_labeled, n_burn_in,
38 |                              stats_file, seed_run, X_test, y_test)
39 | 
40 | 
41 | if __name__ == '__main__':
42 |     parser = experiment_arg_parser()
43 |     args = vars(parser.parse_args())
44 |     dataset_name = args['dataset'].lower()
45 |     config_file = os.path.join(CONFIG_DIR, 'var_k_U.yml')
46 |     config = parse_yaml(config_file)
47 | 
48 |     # Store dataset info
49 |     config.setdefault('dataset', {})
50 |     config['dataset']['name'] = dataset_name
51 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
52 | 
53 |     N_NEIGHBORS_VALUES = config['graph']['n_neighbors_unlabeled'].copy()
54 | 
55 |     experiment = VarNeighborsUnLabeled(N_NEIGHBORS_VALUES, params=config,
56 |                                      n_runs=args['n_runs'])
57 |     if args['plot'] != '':
58 |         experiment.load_plot(path=args['plot'])
59 |     else:
60 |         experiment.run(dataset_name)


--------------------------------------------------------------------------------
/ilp/experiments/var_stream_labeled.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | from time import sleep
 4 | from sklearn.utils.random import check_random_state
 5 | 
 6 | from ilp.experiments.base import BaseExperiment
 7 | from ilp.helpers.data_fetcher import fetch_load_data, IS_DATASET_STREAM
 8 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
 9 | from ilp.constants import CONFIG_DIR
10 | from ilp.helpers.log import make_logger
11 | 
12 | 
13 | logger = make_logger(__name__)
14 | 
15 | 
16 | class VarStreamLabeled(BaseExperiment):
17 | 
18 |     def __init__(self, ratio_labeled_values, params, n_runs=1, isave=100):
19 |         super(VarStreamLabeled, self).__init__(name='srl',
20 |                                                config=params,
21 |                                                 isave=isave, n_runs=n_runs,
22 |                                                 plot_title=r'Influence of ratio of labels',
23 |                                                 multi_var=True)
24 |         self.ratio_labeled_values = ratio_labeled_values
25 | 
26 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
27 |                        X_test, y_test, n_run):
28 | 
29 |         config = self.config
30 | 
31 |         ratio_labeled = config['data']['stream']['ratio_labeled']
32 |         save_dir = os.path.join(self.top_dir, 'srl_' + str(ratio_labeled))
33 |         stats_path = os.path.join(save_dir, 'run_' + str(n_run))
34 |         logger.info('\n\nExperiment: {}, ratio_labeled = {}, run {}...\n'.
35 |               format(self.name.upper(), ratio_labeled, n_run))
36 |         sleep(1)
37 |         self._single_run(X_run, y_run, mask_labeled, n_burn_in,
38 |                          stats_path, seed_run, X_test, y_test)
39 | 
40 |     def run(self, dataset_name, random_state=42):
41 | 
42 |         config = self.config
43 |         X_train, y_train, X_test, y_test = fetch_load_data(dataset_name)
44 | 
45 |         for n_run in range(self.n_runs):
46 |             seed_run = random_state * n_run
47 |             logger.info('\n\nRANDOM SEED = {} for data split.'.format(seed_run))
48 |             rng = check_random_state(seed_run)
49 |             if config['dataset']['is_stream']:
50 |                 logger.info('Dataset is a stream. Sampling observed labels.')
51 |                 # Just randomly sample ratio_labeled samples for mask_labeled
52 |                 n_burn_in = config['data']['n_burn_in_stream']
53 | 
54 |                 for ratio_labeled in self.ratio_labeled_values:
55 | 
56 |                     config['data']['stream']['ratio_labeled'] = ratio_labeled
57 |                     n_labeled = int(ratio_labeled*len(y_train))
58 |                     ind_labeled = rng.choice(len(y_train), n_labeled,
59 |                                              replace=False)
60 |                     mask_labeled = np.zeros(len(y_train), dtype=bool)
61 |                     mask_labeled[ind_labeled] = True
62 |                     X_run, y_run = X_train, y_train
63 | 
64 |                     config['data']['n_burn_in'] = n_burn_in
65 |                     config.setdefault('options', {})
66 |                     config['options']['random_state'] = seed_run
67 | 
68 |                     self.pre_single_run(X_run, y_run, mask_labeled, n_burn_in,
69 |                                         seed_run, X_test, y_test, n_run)
70 | 
71 | 
72 | if __name__ == '__main__':
73 |     parser = experiment_arg_parser()
74 |     args = vars(parser.parse_args())
75 |     dataset_name = args['dataset'].lower()
76 |     config_file = os.path.join(CONFIG_DIR, 'var_stream_labeled.yml')
77 |     config = parse_yaml(config_file)
78 | 
79 |     # Store dataset info
80 |     config.setdefault('dataset', {})
81 |     config['dataset']['name'] = dataset_name
82 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
83 | 
84 |     N_RATIO_LABELED = config['data']['stream']['ratio_labeled'].copy()
85 | 
86 |     experiment = VarStreamLabeled(N_RATIO_LABELED, params=config,
87 |                                   n_runs=args['n_runs'])
88 |     if args['plot'] != '':
89 |         experiment.load_plot(path=args['plot'])
90 |     else:
91 |         experiment.run(dataset_name)
92 | 


--------------------------------------------------------------------------------
/ilp/experiments/var_theta.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from time import sleep
 3 | 
 4 | from ilp.experiments.base import BaseExperiment
 5 | from ilp.helpers.data_fetcher import IS_DATASET_STREAM
 6 | from ilp.helpers.params_parse import parse_yaml, experiment_arg_parser
 7 | from ilp.constants import CONFIG_DIR
 8 | from ilp.helpers.log import make_logger
 9 | 
10 | 
11 | logger = make_logger(__name__)
12 | 
13 | 
14 | class VarTheta(BaseExperiment):
15 | 
16 |     def __init__(self, theta_values, params, n_runs, isave=100):
17 |         super(VarTheta, self).__init__(name='theta', config=params,
18 |                                        isave=isave, n_runs=n_runs,
19 |                                        plot_title=r'Influence of $\vartheta$',
20 |                                        multi_var=True)
21 |         self.theta_values = theta_values
22 | 
23 |     def pre_single_run(self, X_run, y_run, mask_labeled, n_burn_in, seed_run,
24 |                        X_test, y_test, n_run):
25 | 
26 |         params = self.config
27 | 
28 |         for theta in self.theta_values:
29 |             params['online_lp']['theta'] = theta
30 |             save_dir = os.path.join(self.top_dir,  'theta_' + str(theta))
31 |             stats_file = os.path.join(save_dir, 'run_' + str(n_run))
32 |             logger.info('\n\nExperiment: {}, theta = {}, run {}...\n'.
33 |                   format(self.name.upper(), theta, n_run))
34 |             sleep(1)
35 |             self._single_run(X_run, y_run, mask_labeled, n_burn_in,
36 |                              stats_file, seed_run, X_test, y_test)
37 | 
38 | 
39 | if __name__ == '__main__':
40 | 
41 |     parser = experiment_arg_parser()
42 |     args = vars(parser.parse_args())
43 |     dataset_name = args['dataset'].lower()
44 |     config_file = os.path.join(CONFIG_DIR, 'var_theta.yml')
45 |     config = parse_yaml(config_file)
46 | 
47 |     # Store dataset info
48 |     config.setdefault('dataset', {})
49 |     config['dataset']['name'] = dataset_name
50 |     config['dataset']['is_stream'] = IS_DATASET_STREAM.get(dataset_name, False)
51 | 
52 |     THETA_VALUES = config['online_lp']['theta'].copy()
53 | 
54 |     # Instantiate experiment
55 |     experiment = VarTheta(theta_values=THETA_VALUES, params=config,
56 |                           n_runs=args['n_runs'])
57 |     if args['plot'] != '':
58 |         experiment.load_plot(path=args['plot'])
59 |     else:
60 |         # python3 default_run.py -d digits
61 |         experiment.run(dataset_name)


--------------------------------------------------------------------------------
/ilp/helpers/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/helpers/__init__.py


--------------------------------------------------------------------------------
/ilp/helpers/data_fetcher.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import gzip
  3 | import zipfile
  4 | from urllib import request
  5 | import yaml
  6 | import numpy as np
  7 | from sklearn.datasets import make_classification
  8 | from sklearn.model_selection import train_test_split
  9 | 
 10 | from ilp.constants import DATA_DIR
 11 | 
 12 | 
 13 | CWD = os.path.split(__file__)[0]
 14 | DATASET_CONFIG_PATH = os.path.join(CWD, 'datasets.yml')
 15 | 
 16 | SUPPORTED_DATASETS = {'mnist', 'usps', 'blobs', 'kitti_features'}
 17 | IS_DATASET_STREAM = {'kitti_features': True}
 18 | 
 19 | 
 20 | def check_supported_dataset(dataset):
 21 | 
 22 |     if dataset not in SUPPORTED_DATASETS:
 23 |         raise FileNotFoundError('Dataset {} is not supported.'.format(dataset))
 24 | 
 25 |     return True
 26 | 
 27 | 
 28 | def fetch_load_data(name):
 29 | 
 30 |     print('\nFetching/Loading {}...'.format(name))
 31 |     with open(DATASET_CONFIG_PATH, 'r') as f:
 32 |         datasets_configs = yaml.load(f)
 33 |         if name.upper() not in datasets_configs:
 34 |             raise FileNotFoundError('Dataset {} not supported.'.format(name))
 35 | 
 36 |     config = datasets_configs[name.upper()]
 37 |     name_ = config.get('name', name)
 38 |     test_size = config.get('test_size', 0)
 39 | 
 40 |     if name_ == 'KITTI_FEATURES':
 41 |         X_tr, y_tr, X_te, y_te = fetch_kitti()
 42 |     elif name_ == 'USPS':
 43 |         X_tr, y_tr, X_te, y_te = fetch_usps()
 44 |     elif name_ == 'MNIST':
 45 |         X_tr, y_tr, X_te, y_te = fetch_mnist()
 46 |         X_tr = X_tr / 255.
 47 |         X_te = X_te / 255.
 48 |     elif name_ == 'BLOBS':
 49 |         X, y = make_classification(n_samples=60)
 50 |         X = np.asarray(X)
 51 |         y = np.asarray(y, dtype=int)
 52 | 
 53 |         if test_size > 0:
 54 |             if type(test_size) is int:
 55 |                 t = test_size
 56 |                 print('{} has shape {}'.format(name_, X.shape))
 57 |                 print('Splitting data with test size = {}'.format(test_size))
 58 |                 X_tr, X_te, y_tr, y_te = X[:-t], X[-t:], y[:-t], y[-t:]
 59 |             elif type(test_size) is float:
 60 |                 X_tr, X_te, y_tr, y_te = train_test_split(
 61 |                     X, y, test_size=test_size, stratify=y)
 62 |             else:
 63 |                 raise TypeError('test_size is neither int or float.')
 64 | 
 65 |             print('Loaded training set with shape {}'.format(X_tr.shape))
 66 |             print('Loaded testing  set with shape {}'.format(X_te.shape))
 67 |             return X_tr, y_tr, X_te, y_te
 68 |         else:
 69 |             print('Loaded {} with {} samples of dimension {}.'
 70 |                   .format(name_, X.shape[0], X.shape[1]))
 71 |             return X, y, None, None
 72 |     else:
 73 |         raise NameError('No data set {} found!'.format(name_))
 74 | 
 75 |     print('Loaded training data   with shape {}'.format(X_tr.shape))
 76 |     print('Loaded training labels with shape {}'.format(y_tr.shape))
 77 |     print('Loaded testing  data   with shape {}'.format(X_te.shape))
 78 |     print('Loaded testing  labels with shape {}'.format(y_te.shape))
 79 |     return X_tr, y_tr, X_te, y_te
 80 | 
 81 | 
 82 | def fetch_usps(save_dir=None):
 83 | 
 84 |     base_url = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/'
 85 |     train_file = 'zip.train.gz'
 86 |     test_file = 'zip.test.gz'
 87 |     save_dir = DATA_DIR if save_dir is None else save_dir
 88 | 
 89 |     if not os.path.isdir(save_dir):
 90 |         raise NotADirectoryError('{} is not a directory.'.format(save_dir))
 91 | 
 92 |     train_source = os.path.join(base_url, train_file)
 93 |     test_source = os.path.join(base_url, test_file)
 94 | 
 95 |     train_dest = os.path.join(save_dir, train_file)
 96 |     test_dest = os.path.join(save_dir, test_file)
 97 | 
 98 |     def download_file(source, destination):
 99 |         if not os.path.exists(destination):
100 |             print('Downloading from {}...'.format(source))
101 |             f, msg = request.urlretrieve(url=source, filename=destination)
102 |             print('HTTP response: {}'.format(msg))
103 |             return f, msg
104 |         else:
105 |             print('Found dataset in {}!'.format(destination))
106 |             return None
107 | 
108 |     download_file(train_source, train_dest)
109 |     download_file(test_source, test_dest)
110 | 
111 |     X_train = np.loadtxt(train_dest)
112 |     y_train, X_train = X_train[:, 0].astype(np.int32), X_train[:, 1:]
113 | 
114 |     X_test = np.loadtxt(test_dest)
115 |     y_test, X_test = X_test[:, 0].astype(np.int32), X_test[:, 1:]
116 | 
117 |     return X_train, y_train, X_test, y_test
118 | 
119 | 
120 | def fetch_kitti(data_dir=None):
121 | 
122 |     if data_dir is None:
123 |         data_dir = os.path.join(DATA_DIR, 'kitti_features')
124 | 
125 |     files = ['kitti_all_train.data',
126 |              'kitti_all_train.labels',
127 |              'kitti_all_test.data',
128 |              'kitti_all_test.labels']
129 | 
130 |     for file in files:
131 |         if file not in os.listdir(data_dir):
132 |             zip_path = os.path.join(data_dir, 'kitti_features.zip')
133 |             target_path = os.path.dirname(zip_path)
134 |             print("Extracting {} to {}...".format(zip_path, target_path))
135 |             with zipfile.ZipFile(zip_path, "r") as zip_ref:
136 |                 zip_ref.extractall(target_path)
137 |             print("Done.")
138 |             break
139 | 
140 |     X_train = np.loadtxt(os.path.join(data_dir, files[0]), np.float64, skiprows=1)
141 |     y_train = np.loadtxt(os.path.join(data_dir, files[1]), np.int32, skiprows=1)
142 |     X_test = np.loadtxt(os.path.join(data_dir, files[2]), np.float64, skiprows=1)
143 |     y_test = np.loadtxt(os.path.join(data_dir, files[3]), np.int32, skiprows=1)
144 | 
145 |     return X_train, y_train, X_test, y_test
146 | 
147 | 
148 | def fetch_mnist(data_dir=None):
149 | 
150 |     if data_dir is None:
151 |         data_dir = os.path.join(DATA_DIR, 'mnist')
152 | 
153 |     url = 'http://yann.lecun.com/exdb/mnist/'
154 |     files = ['train-images-idx3-ubyte.gz',
155 |              'train-labels-idx1-ubyte.gz',
156 |              't10k-images-idx3-ubyte.gz',
157 |              't10k-labels-idx1-ubyte.gz']
158 | 
159 |     # Create path if it doesn't exist
160 |     os.makedirs(data_dir, exist_ok=True)
161 | 
162 |     # Download any missing files
163 |     for file in files:
164 |         if file not in os.listdir(data_dir):
165 |             request.urlretrieve(url + file, os.path.join(data_dir, file))
166 |             print("Downloaded %s to %s" % (file, data_dir))
167 | 
168 |     def _images(path):
169 |         """Return flattened images loaded from local file."""
170 |         with gzip.open(path) as f:
171 |             # First 16 bytes are magic_number, n_imgs, n_rows, n_cols
172 |             pixels = np.frombuffer(f.read(), '>B', offset=16)
173 |         return pixels.reshape(-1, 784).astype('float64')
174 | 
175 |     def _labels(path):
176 |         with gzip.open(path) as f:
177 |             # First 8 bytes are magic_number, n_labels
178 |             integer_labels = np.frombuffer(f.read(), '>B', offset=8)
179 | 
180 |         return integer_labels
181 | 
182 |     X_train = _images(os.path.join(data_dir, files[0]))
183 |     y_train = _labels(os.path.join(data_dir, files[1]))
184 |     X_test = _images(os.path.join(data_dir, files[2]))
185 |     y_test = _labels(os.path.join(data_dir, files[3]))
186 | 
187 |     return X_train, y_train, X_test, y_test
188 | 


--------------------------------------------------------------------------------
/ilp/helpers/data_flow.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from sklearn.utils.validation import check_random_state
  3 | 
  4 | 
  5 | def check_min_samples_per_class(y, min_samples=2):
  6 |     classes, class_sizes = np.unique(y, return_counts=True)
  7 |     min_class_size = class_sizes.min()
  8 |     print('Class sizes: {}'.format(class_sizes))
  9 |     if min_class_size < min_samples:
 10 |         print('Classes: {}'.format(np.unique(y)))
 11 |         raise ValueError('Minimum class size < 2.')
 12 | 
 13 |     return classes, class_sizes
 14 | 
 15 | 
 16 | def split_burn_in_rest(y, n_labeled_per_class, ratio_labeled, shuffle=False, seed=None):
 17 |     """
 18 | 
 19 |     Parameters
 20 |     ----------
 21 |     y : array, shape (n_samples,)
 22 |         The true data labels.
 23 | 
 24 |     n_labeled_per_class : int
 25 |         Number of labeled samples per class to include in the burn-in set.
 26 | 
 27 |     ratio_labeled : float 
 28 |         Ratio of labeled samples to include within the burn-in set.
 29 | 
 30 |     shuffle : bool
 31 |         Whether to shuffle indices within classes before adding to burn-in set.
 32 | 
 33 |     seed : int, np.random.RandomState or None
 34 |         For reproducibility.
 35 | 
 36 |     Returns
 37 |     -------
 38 |     ind_burn_in : list of length n_burn_in
 39 |         Indices of the samples in the burn-in set.
 40 | 
 41 |     mask_labeled_burn_in : array, shape (n_burn_in,)
 42 |         Mask indicating whether the labels in the burn-in set are observed.
 43 | 
 44 |     """
 45 |     classes, class_sizes = check_min_samples_per_class(y, n_labeled_per_class)
 46 | 
 47 |     rng = check_random_state(seed)
 48 |     ind_burn_in = []
 49 |     set_ind_burn_in_labeled = set()
 50 | 
 51 |     for class_ in classes:
 52 |         ind_class, = np.where(y == class_)
 53 |         n_class = len(ind_class)
 54 |         n_unlabeled_class = int(
 55 |             n_labeled_per_class * (1 / ratio_labeled - 1)) #+ 1
 56 |         n_unlabeled_class = min(n_unlabeled_class,
 57 |                                 n_class - n_labeled_per_class)
 58 |         n_burn_in_class = n_labeled_per_class + n_unlabeled_class
 59 | 
 60 |         if shuffle:
 61 |             ind_samples = rng.choice(n_class, n_burn_in_class, replace=False)
 62 |             ind_burn_in_class = ind_class[ind_samples]
 63 |             ind_burn_in.extend(ind_burn_in_class)
 64 |             ind_samples = rng.choice(n_burn_in_class, n_labeled_per_class,
 65 |                                      replace=False)
 66 |             ind_burn_in_class_labeled = ind_burn_in_class[ind_samples]
 67 |         else:
 68 |             ind_burn_in_class = ind_class[:n_burn_in_class]
 69 |             ind_burn_in.extend(ind_burn_in_class)
 70 |             ind_burn_in_class_labeled = ind_burn_in_class[:n_labeled_per_class]
 71 | 
 72 |         set_ind_burn_in_labeled.update(ind_burn_in_class_labeled)
 73 | 
 74 |     mask_labeled_burn_in = [i in set_ind_burn_in_labeled for i in ind_burn_in]
 75 |     mask_labeled_burn_in = np.asarray(mask_labeled_burn_in)
 76 | 
 77 |     y_burn_in = y[ind_burn_in]
 78 |     y_burn_in_labeled = y_burn_in[mask_labeled_burn_in]
 79 |     y_burn_in_unlabeled = y_burn_in[~mask_labeled_burn_in]
 80 | 
 81 |     _, class_sizes_labeled = np.unique(y_burn_in_labeled, return_counts=True)
 82 |     _, class_sizes_unlabeled = np.unique(y_burn_in_unlabeled,
 83 |                                          return_counts=True)
 84 | 
 85 |     if len(y_burn_in_labeled) == 0 and len(y_burn_in_unlabeled) == 0:
 86 |         class_sizes_burnin = np.zeros(len(classes))
 87 |     elif len(y_burn_in_labeled) == 0:
 88 |         class_sizes_burnin = class_sizes_unlabeled
 89 |     elif len(y_burn_in_unlabeled) == 0:
 90 |         class_sizes_burnin = class_sizes_labeled
 91 |     else:
 92 |         class_sizes_burnin = class_sizes_labeled + class_sizes_unlabeled
 93 | 
 94 |     print('\n\n')
 95 |     print('Burn-in   labeled class sizes: {} , sum = {}'.format(
 96 |         class_sizes_labeled, sum(class_sizes_labeled)))
 97 |     print('Burn-in unlabeled class sizes: {}, sum = {}'.format(
 98 |         class_sizes_unlabeled, sum(class_sizes_unlabeled)))
 99 |     print('Burn-in     total class sizes: {}, sum = {}'.format(
100 |         class_sizes_burnin, sum(class_sizes_burnin)))
101 |     print('\nRest total size: {}'.format(len(y) - len(y_burn_in)))
102 | 
103 |     return ind_burn_in, mask_labeled_burn_in
104 | 
105 | 
106 | def split_labels_rest(y_rest, ratio_labeled, batch_size, shuffle=False,
107 |                       seed=None):
108 |     """
109 | 
110 |     Parameters
111 |     ----------
112 |     y_rest : array with shape (n_rest,)
113 |         Remaining data labels after burn-in.
114 | 
115 |     ratio_labeled : float
116 |         Ratio of observed labels in remaining data.
117 | 
118 |     batch_size : int
119 |         Number of points for which the ratio_labeled must be satisfied.
120 | 
121 |     shuffle : bool
122 |         Whether to shuffle indices within classes before adding to burn-in set.
123 | 
124 |     seed : int, np.random.RandomState or None
125 |         For reproducibility.
126 | 
127 |     Returns
128 |     -------
129 |     mask_labeled_rest : array, shape (n_rest,)
130 |         Mask indicating whether the labels in the rest set are observed.
131 | 
132 |     """
133 | 
134 |     classes = np.unique(y_rest)
135 | 
136 |     rng = check_random_state(seed)
137 | 
138 |     set_ind_rest_labeled = set()
139 | 
140 |     for class_ in classes:
141 |         ind_class, = np.where(y_rest == class_)
142 |         n_class = len(ind_class)
143 |         n_labeled_class = int(n_class * ratio_labeled)
144 | 
145 |         if shuffle:
146 |             ind_samples = rng.choice(n_class, n_labeled_class, replace=False)
147 |             is_labeled_rest_class = ind_class[ind_samples]
148 |         else:
149 |             is_labeled_rest_class = ind_class[:n_labeled_class]
150 | 
151 |         set_ind_rest_labeled.update(is_labeled_rest_class)
152 | 
153 |     mask_labeled_rest = [i in set_ind_rest_labeled for i in range(len(y_rest))]
154 |     mask_labeled_rest = np.asarray(mask_labeled_rest)
155 | 
156 |     y_rest_labeled = y_rest[mask_labeled_rest]
157 |     y_rest_unlabeled = y_rest[~mask_labeled_rest]
158 | 
159 |     _, class_sizes_labeled = np.unique(y_rest_labeled, return_counts=True)
160 |     _, class_sizes_unlabeled = np.unique(y_rest_unlabeled, return_counts=True)
161 | 
162 |     if len(y_rest_labeled) == 0 and len(y_rest_unlabeled) == 0:
163 |         class_sizes_rest = np.zeros(len(classes))
164 |     elif len(y_rest_labeled) == 0:
165 |         class_sizes_rest = class_sizes_unlabeled
166 |     elif len(y_rest_unlabeled) == 0:
167 |         class_sizes_rest = class_sizes_labeled
168 |     else:
169 |         class_sizes_rest = class_sizes_labeled + class_sizes_unlabeled
170 | 
171 |     print('\n\n')
172 |     print('Rest   labeled class sizes: {}, sum = {}'.format(
173 |         class_sizes_labeled, sum(class_sizes_labeled)))
174 |     print('Rest unlabeled class sizes: {}, sum = {}'.format(
175 |         class_sizes_unlabeled, sum(class_sizes_unlabeled)))
176 |     print('Rest     total class sizes: {}, sum = {}'.format(
177 |         class_sizes_rest, sum(class_sizes_rest)))
178 |     print('\n\n')
179 | 
180 |     return mask_labeled_rest
181 | 
182 | 
183 | def gen_semilabeled_data(inputs, targets, flags):
184 |     """
185 |     Generates a sequence of all inputs and targets, prepended with an id
186 |     of the sample. A boolean value indicating if the label is observed by
187 |     the algorithm or not is also generated.
188 |     """
189 | 
190 |     assert len(inputs) == len(targets) == len(flags)
191 | 
192 |     indices = range(len(inputs))
193 | 
194 |     for i, j, k, l in zip(indices, inputs, targets, flags):
195 |         yield i, j, k, l
196 | 
197 | 
198 | def gen_data_stream(inputs, targets, shuffle=False, seed=None):
199 |     """
200 |     Generates a sequence of all inputs and targets, optionally shuffled.
201 |     """
202 | 
203 |     assert len(inputs) == len(targets)
204 |     if shuffle:
205 |         indices = np.arange(len(inputs))
206 |         random_state = check_random_state(seed)
207 |         random_state.shuffle(indices)
208 |         for i in indices:
209 |             yield inputs[i], targets[i]
210 |     else:
211 |         for i in range(len(inputs)):
212 |             yield inputs[i], targets[i]
213 | 


--------------------------------------------------------------------------------
/ilp/helpers/datasets.yml:
--------------------------------------------------------------------------------
 1 | KITTI_FEATURES:
 2 |   name : KITTI_FEATURES
 3 |   test_size : 9090
 4 | 
 5 | 
 6 | BLOBS:
 7 |   name : BLOBS
 8 |   pca : True
 9 |   test_size : 0.0
10 | 
11 | 
12 | MNIST:
13 |   name : MNIST
14 |   test_size : 10000
15 | 
16 | 
17 | USPS:
18 |   name : USPS
19 |   test_size : 2007
20 | 


--------------------------------------------------------------------------------
/ilp/helpers/fc_heap.py:
--------------------------------------------------------------------------------
 1 | import heapq
 2 | import warnings
 3 | 
 4 | 
 5 | class FixedCapacityHeap:
 6 |     """Implementation of a min-heap with fixed capacity.
 7 |     The heap contains tuples of the form (edge_weight, node_id), 
 8 |     which means the min. edge weight is extracted first
 9 |     """
10 | 
11 |     def __init__(self, lst=None, capacity=10):
12 |         self.capacity = capacity
13 |         if lst is None:
14 |             self.data = []
15 |         elif type(lst) is list:
16 |             self.data = lst
17 |         else:
18 |             self.data = lst.tolist()
19 | 
20 |         if lst is not None:
21 |             heapq.heapify(self.data)
22 |             if len(self.data) > capacity:
23 |                 msg = 'Input data structure is larger than the queue\'s ' \
24 |                       'capacity ({}), truncating to smallest ' \
25 |                       'elements.'.format(capacity)
26 | 
27 |                 warnings.warn(msg, UserWarning)
28 |                 self.data = self.data[:self.capacity]
29 | 
30 |     def push(self, item):
31 |         """Insert an element in the heap if its key is smaller than the current 
32 |         max-key elements and remove the current max-key element if the new 
33 |         heap size exceeds the heap capacity
34 | 
35 |         Args:
36 |             item (tuple): (edge_weight, node_ind)
37 | 
38 |         Returns:
39 |             tuple :  (bool, item)
40 |                 bool: whether the item was actually inserted in the queue
41 |                 item: another item that was removed from the queue or None if none was removed
42 | 
43 |         """
44 |         inserted = False
45 |         removed = None
46 |         if len(self.data) < self.capacity:
47 |             heapq.heappush(self.data, item)
48 |             inserted = True
49 |         else:
50 |             if item > self.get_min():
51 |                 removed = heapq.heappushpop(self.data, item)
52 |                 inserted = True
53 | 
54 |         return inserted, removed
55 | 
56 |     def get_min(self):
57 |         """Return the min-key element without removing it from the heap"""
58 |         return self.data[0]
59 | 
60 |     def __len__(self):
61 |         return len(self.data)
62 | 


--------------------------------------------------------------------------------
/ilp/helpers/log.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import logging
 3 | 
 4 | 
 5 | def make_logger(name, path=None):
 6 |     logger = logging.getLogger(name)
 7 |     logger.setLevel(logging.DEBUG)
 8 |     stream_handler = logging.StreamHandler(stream=sys.stdout)
 9 |     # fmt = '%(asctime)s  ' \
10 |     fmt = '[%(levelname)-10s] %(name)-10s : %(message)s'
11 |     # fmt = '[{levelname}] {name} {message}'
12 |     formatter = logging.Formatter(fmt=fmt, style='%')
13 |     stream_handler.setFormatter(formatter)
14 |     logger.addHandler(stream_handler)
15 | 
16 |     if path:
17 |         file_handler = logging.FileHandler(filename=path)
18 |         file_handler.setFormatter(formatter)
19 |         logger.addHandler(file_handler)
20 | 
21 |     return logger
22 | 


--------------------------------------------------------------------------------
/ilp/helpers/params_parse.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import yaml
 3 | from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
 4 | 
 5 | from ilp.constants import CONFIG_DIR
 6 | from ilp.helpers.data_fetcher import SUPPORTED_DATASETS
 7 | 
 8 | 
 9 | def args_to_dict(args):
10 | 
11 |     sample_yaml = os.path.join(CONFIG_DIR, 'default.yml')
12 |     with open(sample_yaml, 'r') as f:
13 |         sample_dict = yaml.load(f)
14 | 
15 |     params_dict = dict(sample_dict)
16 |     for k, v in args.items():
17 |         for section in sample_dict:
18 |             if k in sample_dict[section]:
19 |                 params_dict[section][k] = v
20 | 
21 |     return params_dict
22 | 
23 | 
24 | def parse_yaml(config_file):
25 | 
26 |     sample_yaml = os.path.join(CONFIG_DIR, 'default.yml')
27 |     with open(sample_yaml, 'r') as f:
28 |         default_params = yaml.load(f)
29 | 
30 |     with open(config_file) as cfg_file:
31 |         params = yaml.load(cfg_file)
32 | 
33 |     for k, v in default_params.items():
34 |         if k not in params:
35 |             params[k] = default_params[k]
36 | 
37 |     return params
38 | 
39 | 
40 | def print_config(params):
41 |     for section in params:
42 |         print('\n{} PARAMS:'.format(section.upper()))
43 |         if type(params[section]) is dict:
44 |             for k, v in params[section].items():
45 |                 print('{}: {}'.format(k, v))
46 |         else:
47 |             print('{}'.format(params[section]))
48 | 
49 | 
50 | def experiment_arg_parser():
51 |     arg_parser = ArgumentParser(description="ILP experiment", formatter_class=ArgumentDefaultsHelpFormatter)
52 | 
53 |     # Dataset
54 |     arg_parser.add_argument(
55 |         '-d', '--dataset', type=str, default='digits',
56 |         help='Load the given dataset.\nSupported datasets are: {}'
57 |              .format(SUPPORTED_DATASETS)
58 |     )
59 | 
60 |     arg_parser.add_argument(
61 |         '-n', '--n_runs', type=int, default=1,
62 |         help='Number of times to run the experiment with different seeds'
63 |     )
64 | 
65 |     arg_parser.add_argument(
66 |         '-p', '--plot', type=str, default='',
67 |         help='Plot the latest experiment results from the given directory'
68 |     )
69 | 
70 |     return arg_parser
71 | 


--------------------------------------------------------------------------------
/ilp/helpers/stats.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import os
  3 | import shelve
  4 | import sklearn.preprocessing as prep
  5 | from datetime import datetime
  6 | from threading import Thread
  7 | from enum import Enum
  8 | from queue import Queue
  9 | 
 10 | from ilp.constants import EPS_32, EPS_64, STATS_DIR
 11 | from ilp.helpers.log import make_logger
 12 | 
 13 | 
 14 | STATS_FILE_EXT = '.stat'
 15 | 
 16 | 
 17 | class JobType(Enum):
 18 |     EVAL = 1
 19 |     ONLINE_ITER = 3
 20 |     PRINT_STATS = 6
 21 |     LABEL_STREAM = 9
 22 |     TRAIN_PRED = 11
 23 |     TEST_PRED = 12
 24 |     RUNTIME = 13
 25 |     POINT_PREDICTION = 14
 26 | 
 27 | 
 28 | logger = make_logger(__name__)
 29 | 
 30 | class StatisticsWorker:
 31 |     """
 32 |     Parameters
 33 |     ----------
 34 | 
 35 |     config : dict
 36 |         Dictionary with configuration key-value pairs of the running experiment.
 37 | 
 38 |     path : str, optional
 39 |         File path to save the aggregated statistics (default=None).
 40 | 
 41 |     isave : int, optional
 42 |         The frequency of saving statistics (default=1000).
 43 | 
 44 | 
 45 |     Attributes
 46 |     ----------
 47 | 
 48 |     _stats : Statistics
 49 |         Class to store different kinds of statistics during an experiment run.
 50 | 
 51 |     _jobs : Queue
 52 |         Queue of jobs to process in a different thread.
 53 | 
 54 |     _thread : Thread
 55 |         Thread in which to process incoming jobs.
 56 | 
 57 |     """
 58 | 
 59 |     def __init__(self, config, path=None, isave=1):
 60 |         if path is None:
 61 |             cur_time = datetime.now().strftime('%d%m%Y-%H%M%S')
 62 |             dataset_name = config['dataset']['name']
 63 |             filename = 'stats_' + cur_time + '_' + str(dataset_name) + STATS_FILE_EXT
 64 |             path = os.path.join(STATS_DIR, filename)
 65 |         elif not path.endswith(STATS_FILE_EXT):
 66 |             path = path + STATS_FILE_EXT
 67 | 
 68 |         os.makedirs(os.path.split(path)[0], exist_ok=True)
 69 |         self.path = path
 70 |         self.config = config
 71 |         self.isave = isave
 72 |         self._stats = Statistics()
 73 |         self._jobs = Queue()
 74 |         self._thread = Thread(target=self.work)
 75 | 
 76 |     def start(self):
 77 |         self.n_iter_eval = 0
 78 |         self._thread.start()
 79 | 
 80 |     def save(self):
 81 |         prev_file = self.path + '_iter_' + str(self.n_iter_eval)
 82 |         if os.path.exists(self.path):
 83 |             os.rename(self.path, prev_file)
 84 |         with shelve.open(self.path, 'c') as shelf:
 85 |             shelf['stats'] = self._stats.__dict__
 86 |             shelf['config'] = self.config
 87 |         if os.path.exists(prev_file):
 88 |             os.remove(prev_file)
 89 | 
 90 |     def stop(self):
 91 |         self._jobs.put_nowait({'job_type': JobType.PRINT_STATS})
 92 |         self._jobs.put(None)
 93 |         self._thread.join()
 94 |         self.save()
 95 | 
 96 |     def send(self, d):
 97 |         self._jobs.put_nowait(d)
 98 | 
 99 |     def work(self):
100 |         while True:
101 |             job = self._jobs.get()
102 | 
103 |             if job is None:  # End of algorithm
104 |                 self._jobs.task_done()
105 |                 break
106 | 
107 |             job_type = job['job_type']
108 | 
109 |             if job_type == JobType.EVAL:
110 |                 self._stats.evaluate(job['y_est'], job['y_true'])
111 |                 self.n_iter_eval += 1
112 |             elif job_type == JobType.LABEL_STREAM:
113 |                 self._stats.label_stream_true = job['y_true']
114 |                 self._stats.label_stream_mask_observed = job['mask_obs']
115 |             elif job_type == JobType.POINT_PREDICTION:
116 |                 f = job['vec']
117 |                 h = self._stats.entropy(f)
118 |                 self._stats.entropy_point_after.append(h)
119 |                 y = job['y']
120 |                 self._stats.pred_point_after.append(y)
121 |                 self._stats.conf_point_after.append(f.max())
122 |             elif job_type == JobType.ONLINE_ITER:
123 |                 self._stats.iter_online_count.append(job['n_in_iter'])
124 |                 self._stats.iter_online_duration.append(job['dt'])
125 |             elif job_type == JobType.PRINT_STATS:
126 |                 err = self._stats.clf_error_mixed[-1] * 100
127 |                 logger.info('Classif. Error: {:5.2f}%\n\n'.format(err))
128 |             elif job_type == JobType.TRAIN_PRED:
129 |                 self._stats.train_est = job['y_est']
130 |             elif job_type == JobType.TEST_PRED:
131 |                 y_pred_knn = job['y_pred_knn']
132 |                 self._stats.test_pred_knn.append(y_pred_knn)
133 | 
134 |                 y_pred_lp = job['y_pred_lp']
135 |                 self._stats.test_pred_lp.append(y_pred_lp)
136 | 
137 |                 self._stats.test_true = y_test = job['y_true']
138 | 
139 |                 err_knn = np.mean(np.not_equal(y_pred_knn, y_test))
140 |                 err_lp = np.mean(np.not_equal(y_pred_lp, y_test))
141 |                 logger.info('knn test err: {:5.2f}%'.format(err_knn*100))
142 |                 logger.info('ILP test err: {:5.2f}%'.format(err_lp*100))
143 |                 self._stats.test_error_knn.append(err_knn)
144 |                 self._stats.test_error_ilp.append(err_lp)
145 |             elif job_type == JobType.RUNTIME:
146 |                 self._stats.runtime = job['t']
147 | 
148 |             if self.n_iter_eval % self.isave == 0:
149 |                 self.save()
150 | 
151 |             self._jobs.task_done()
152 | 
153 | 
154 | EXCLUDED_METRICS = {'label_stream_true',
155 |                     'label_stream_mask_observed',
156 |                     'n_burn_in',
157 |                     'test_pred_knn', 'test_pred_lp', 'test_true',
158 |                     'train_est', 'runtime',
159 |                     'conf_point_after', 'test_error_knn', 'test_error_ilp'}
160 | 
161 | 
162 | class Statistics:
163 |     """
164 |     Statistics gathered during learning (training and testing).
165 |     """
166 |     def __init__(self):
167 |         self.iter_online_count = []
168 |         self.iter_online_duration = []
169 | 
170 |         # Evaluation after a new point arrives
171 |         self.n_invalid_samples = []
172 |         self.invalid_samples_ratio = []
173 |         self.clf_error_mixed = []
174 |         self.clf_error_valid = []
175 |         self.l1_error_mixed = []
176 |         self.l1_error_valid = []
177 |         self.cross_ent_mixed = []
178 |         self.cross_ent_valid = []
179 |         self.entropy_pred_mixed = []
180 |         self.entropy_pred_valid = []
181 | 
182 |         # Defined once
183 |         self.label_stream_true = []
184 |         self.label_stream_mask_observed = []
185 |         self.test_pred_knn = []
186 |         self.test_pred_lp = []
187 |         self.test_true = []
188 |         self.train_est = []
189 |         self.runtime = np.nan
190 |         self.test_error_ilp = []
191 |         self.test_error_knn = []
192 |         self.entropy_point_after = []
193 |         self.pred_point_after = []
194 |         self.conf_point_after = []
195 | 
196 |     def evaluate(self, y_predictions, y_true):
197 |         """Computes statistics for a given set of predictions and the ground truth.
198 | 
199 |         Args:
200 |             y_predictions (array_like): [u_samples, n_classes] soft class predictions for current unlabeled samples
201 |             y_true (array_like): [u_samples, n_classes] one-hot encoding of the true classes_ of the unlabeled samples
202 | 
203 |             eps (float): quantity slightly larger than zero to avoid division by zero
204 | 
205 |         Returns:
206 |             float, average accuracy
207 | 
208 |         """
209 | 
210 |         u_samples, n_classes = y_predictions.shape
211 | 
212 |         # Clip predictions to [0,1]
213 |         eps = EPS_32 if y_predictions.itemsize == 4 else EPS_64
214 |         y_pred_01 = np.clip(y_predictions, eps, 1-eps)
215 |         # Normalize predictions to make them proper distributions
216 |         y_pred = prep.normalize(y_pred_01, copy=False, norm='l1')
217 | 
218 |         # 0-1 Classification error under valid and invalid points
219 |         y_pred_max = np.argmax(y_pred, axis=1)
220 |         y_true_max = np.argmax(y_true, axis=1)
221 |         fc_err_mixed = self.zero_one_loss(y_pred_max, y_true_max)
222 |         self.clf_error_mixed.append(fc_err_mixed)
223 | 
224 |         # L1 error under valid and invalid points
225 |         l1_err_mixed = np.mean(self.l1_error(y_pred, y_true))
226 |         self.l1_error_mixed.append(l1_err_mixed)
227 | 
228 |         # Cross-entropy loss
229 |         crossent_mixed = np.mean(self.cross_entropy(y_true, y_pred))
230 |         self.cross_ent_mixed.append(crossent_mixed)
231 | 
232 |         # Identify valid points (for which a label has been estimated)
233 |         ind_valid, = np.where(y_pred.sum(axis=1) != 0)
234 |         n_valid = len(ind_valid)
235 |         n_invalid = u_samples - n_valid
236 | 
237 |         self.n_invalid_samples.append(n_invalid)
238 |         self.invalid_samples_ratio.append(n_invalid / u_samples)
239 | 
240 |         # Entropy of the predictions
241 |         if n_invalid == 0:
242 |             entropy_pred_mixed = np.mean(self.entropy(y_pred))
243 |             self.entropy_pred_mixed.append(entropy_pred_mixed)
244 |             return
245 | 
246 |         y_pred_valid = y_pred[ind_valid]
247 |         y_true_valid = y_true[ind_valid]
248 | 
249 |         # 0-1 Classification error under valid points only
250 |         y_pred_valid_max = y_pred_max[ind_valid]
251 |         y_true_valid_max = y_true_max[ind_valid]
252 |         err_valid_max = self.zero_one_loss(y_pred_valid_max, y_true_valid_max)
253 |         self.clf_error_valid.append(err_valid_max)
254 | 
255 |         # L1 error under valid points only
256 |         l1_err_valid = np.mean(self.l1_error(y_pred_valid, y_true_valid))
257 |         self.l1_error_valid.append(l1_err_valid)
258 | 
259 |         # Cross-entropy loss
260 |         ce_valid = np.mean(self.cross_entropy(y_true_valid, y_pred_valid))
261 |         self.cross_ent_valid.append(ce_valid)
262 | 
263 |         # Entropy of the predictions
264 |         entropy_pred_valid = np.mean(self.entropy(y_pred_valid))
265 |         self.entropy_pred_valid.append(entropy_pred_valid)
266 |         n_total = n_valid + n_invalid
267 |         entropy_pred_mixed = (entropy_pred_valid*n_valid + n_invalid) / n_total
268 |         self.entropy_pred_mixed.append(entropy_pred_mixed)
269 | 
270 | 
271 |     @staticmethod
272 |     def zero_one_loss(y_pred, y_true, average=True):
273 |         """
274 | 
275 |         Args:
276 |             y_pred (array_like):  (n_samples, n_classes)
277 |             y_true (array_like):  (n_samples, n_classes)
278 |             average (bool):     Whether to take the average over all predictions.
279 | 
280 |         Returns:    The absolute difference for each row. 
281 |                     Note that this will be in [0,2] for p.d.f.s.
282 | 
283 |         """
284 | 
285 |         if average:
286 |             return np.mean(np.not_equal(y_pred, y_true))
287 |         else:
288 |             return np.sum(np.not_equal(y_pred, y_true))
289 | 
290 |     @staticmethod
291 |     def l1_error(y_pred, y_true, norm=True):
292 |         """
293 | 
294 |         Args:
295 |             y_pred (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes)
296 |             y_true (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes)
297 |             norm (bool):    Whether to constrain the L1 error to be in [0,1].
298 | 
299 |         Returns:    The absolute difference for each row. Note that this will be in [0,2] for pdfs.
300 | 
301 |         """
302 | 
303 |         l1_error = np.abs(y_pred - y_true).sum(axis=1)
304 |         if norm:
305 |             l1_error /= 2
306 | 
307 |         return l1_error
308 | 
309 |     @staticmethod
310 |     def entropy(p, norm=True):
311 |         """
312 | 
313 |         Args:
314 |             p (array_like): An array of probability distributions with shape (n_distros, n_classes)
315 |             norm (bool):    Whether to normalize the entropy to constrain it in [0,1]
316 | 
317 |         Returns:    An array of entropies of the distributions with shape (n_distros,)
318 | 
319 |         """
320 | 
321 |         entropy = - (p * np.log(p)).sum(axis=1)
322 |         if norm:
323 |             entropy /= np.log(p.shape[1])
324 | 
325 |         return entropy
326 | 
327 |     @staticmethod
328 |     def cross_entropy(p, q, norm=True):
329 |         """
330 | 
331 |         Args:
332 |             p (array_like): An array of probability distributions (usually groundtruth) with shape (n_distros, n_classes)
333 |             q (array_like): An array of probability distributions (usually predictions) with shape (n_distros, n_classes)
334 |             norm (bool):    Whether to normalize the entropy to constrain it in [0,1]
335 | 
336 |         Returns:    An array of cross entropies between the groundtruth and the prediction with shape (n_distros,)
337 | 
338 |         """
339 | 
340 |         cross_ent = -(p * np.log(q)).sum(axis=1)
341 |         if norm:
342 |             cross_ent /= np.log(p.shape[1])
343 | 
344 |         return cross_ent
345 | 
346 | 
347 | def aggregate_statistics(stats_path, metrics=None, excluded_metrics=None):
348 | 
349 |     print('Aggregating statistics from {}'.format(stats_path))
350 |     if stats_path.endswith(STATS_FILE_EXT):
351 |         list_of_files = [stats_path]
352 |     else:
353 |         list_of_files = [os.path.join(stats_path, f) for f in os.listdir(
354 |                         stats_path) if f.endswith(STATS_FILE_EXT)]
355 | 
356 |     stats_runs = []
357 |     random_states = []
358 |     for stats_file in list_of_files:
359 |         with shelve.open(stats_file, 'r') as f:
360 |             stats_runs.append(f['stats'])
361 |             random_states.append(f['config']['options']['random_state'])
362 | 
363 |     print('\nRandom seeds used: {}\n'.format(random_states))
364 | 
365 |     if metrics is None:
366 |         metrics = Statistics().__dict__.keys()
367 | 
368 |     if excluded_metrics is None:
369 |         excluded_metrics = EXCLUDED_METRICS
370 | 
371 |     stats_mean, stats_std = {}, {}
372 |     stats_run0 = stats_runs[0]
373 | 
374 |     for metric in metrics:
375 |         if metric in excluded_metrics: continue
376 |         if metric not in stats_run0:
377 |             print('\nMetric {} not found!'.format(metric))
378 |             continue
379 |         metric_lists = [stats[metric] for stats in stats_runs]
380 | 
381 |         # Make a numpy 2D array to merge the different runs
382 |         metric_runs = np.asarray(metric_lists)
383 |         s = metric_runs.shape
384 |         if len(s) < 2:
385 |             print('No values for metric, skipping.')
386 |             continue
387 |         stats_mean[metric] = np.mean(metric_runs, axis=0)
388 |         stats_std[metric] = np.std(metric_runs, axis=0)
389 | 
390 |     with shelve.open(list_of_files[0], 'r') as f:
391 |         config = f['config']
392 | 
393 |     lp_times = stats_mean['iter_online_duration']
394 |     ice = stats_mean['clf_error_mixed'][0] * 100
395 |     fce = stats_mean['clf_error_mixed'][-1] * 100
396 |     print('Avg. LP time/iter: {:.4f}s'.format(np.mean(lp_times)))
397 |     print('Initial classification error: {:.2f}%'.format(ice))
398 |     print('Final classification error: {:.2f}%'.format(fce))
399 | 
400 |     # Add excluded metrics in the end
401 |     for stats_run in stats_runs:
402 |         for ex_metric in excluded_metrics:
403 |             if ex_metric in stats_run:
404 |                 print('Appending excluded metric: {}'.format(ex_metric))
405 |                 stats_mean[ex_metric] = stats_run[ex_metric]
406 | 
407 |     if len(list_of_files) == 1:
408 |         stats_std = None
409 | 
410 |     return stats_mean, stats_std, config
411 | 


--------------------------------------------------------------------------------
/ilp/plots/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/johny-c/incremental-label-propagation/29c413dba023694b99e2c2708c0aa98d891d234d/ilp/plots/__init__.py


--------------------------------------------------------------------------------
/ilp/plots/plot_stats.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib
  3 | from matplotlib import pyplot as plt
  4 | from itertools import count, product
  5 | from math import ceil
  6 | from sklearn.metrics import confusion_matrix
  7 | from tabulate import tabulate
  8 | 
  9 | from ilp.helpers.params_parse import print_config
 10 | from ilp.helpers.data_fetcher import fetch_load_data
 11 | from ilp.helpers.stats import STATS_FILE_EXT
 12 | 
 13 | 
 14 | COLORS = ['red', 'darkorange', 'black', 'green', 'cyan', 'blue']
 15 | 
 16 | N_AXES_PER_ROW = 3
 17 | 
 18 | PLOT_LABELS  = {'iter_online_count': r'\#LP iterations',
 19 |                 'iter_online_duration': r'LP time (s)',
 20 |                 'n_invalid_samples' : r'\#Invalid samples',
 21 |                 'invalid_samples_ratio': r'Invalid samples ratio',
 22 |                 'clf_error_mixed': r'classification error',
 23 |                 'l1_error_mixed': r'$\ell_1$ error',
 24 |                 'cross_ent_mixed': r'cross entropy',
 25 |                 'entropy_pred_mixed': r'prediction entropy',
 26 |                 'theta': r'$\vartheta$',
 27 |                 'k_L': r'$k_l$',
 28 |                 'k_U': r'$k_u$',
 29 |                 'n_L': r'\#labels',
 30 |                 'srl': r'labeled',
 31 |                 'entropy_point_after': r'$H(f)$',
 32 |                 'conf_point_after': r'$\max_c f_i$',
 33 |                 'test_error_ilp': 'ILP test error',
 34 |                 'test_error_knn': 'test error (knn)'
 35 |                 }
 36 | 
 37 | PLOT_TITLE = {'entropy_point_after': 'Entropy',
 38 |               'conf_point_after': 'Confidence'}
 39 | 
 40 | METRIC_ORDER = [
 41 |                 'l1_error_mixed',
 42 |                 'cross_ent_mixed',
 43 |                 'clf_error_mixed' ,
 44 |                 'entropy_point_after',
 45 |                 'entropy_pred_mixed',
 46 |                 'iter_online_count',
 47 |                 'iter_online_duration',
 48 |                 'test_error_ilp',
 49 |                 'test_error_knn'
 50 |                 ]
 51 | 
 52 | METRIC_TITLE = {'l1_error_mixed':
 53 |                     r'$\frac{1}{2u}\sum\limits_{i} '
 54 |                     r'|{F_U}_{(i)}^{True} - {F_U}_{(i)}|_1$',
 55 |                 'cross_ent_mixed':
 56 |                     r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)}^{True}, '
 57 |                     r'{F_U}_{(i)})$',
 58 |                 'entropy_pred_mixed':
 59 |                     r'$\frac{1}{u}\sum\limits_{i} H({F_U}_{(i)})$',
 60 |                 'clf_error_mixed':
 61 |                     r'$\frac{1}{u}\sum\limits_{i} I(\arg \max {F_U}_{(i)} '
 62 |                     r'\neq \arg \max {F_U}_{(i)}^{True})$'}
 63 | 
 64 | SCATTER_METRIC = ['iter_online_duration', 'iter_online_count']
 65 | 
 66 | LEGEND_METRIC = 'clf_error_mixed'
 67 | 
 68 | DECORATORS = {'iter_offline', 'burn_in_labels_true', 'label_stream_true'}
 69 | 
 70 | X_LABEL_DEFAULT = r'\#observed samples'
 71 | DEFAULT_COLOR = 'b'
 72 | DEFAULT_MEAN_COLOR = 'r'
 73 | DEFAULT_STD_COLOR = 'darkorange'
 74 | COLOR_MAP = plt.cm.inferno
 75 | N_CURVES = 6  # THETA \in [0., 0.4, 0.8, 1.2, 1.6, 2.0]
 76 | COLOR_IDX = np.linspace(0, 1, N_CURVES + 2)[1:-1]
 77 | 
 78 | KITTI_CLASSES = ['car', 'van', 'truck', 'pedestrian', 'sitter', 'cyclist',
 79 |                  'tram', 'misc']
 80 | 
 81 | 
 82 | def remove_frame(top=True, bottom=True, left=True, right=True):
 83 | 
 84 |     ax = plt.gca()
 85 |     if top:
 86 |         ax.spines['top'].set_visible(False)
 87 | 
 88 |     if bottom:
 89 |         ax.spines['bottom'].set_visible(False)
 90 | 
 91 |     if left:
 92 |         ax.spines['left'].set_visible(False)
 93 | 
 94 |     if right:
 95 |         ax.spines['right'].set_visible(False)
 96 | 
 97 | 
 98 | def print_latex_table(stats_list):
 99 | 
100 |     headers = ['\#Labels', 'Runtime (s)', 'Est. error (%)',
101 |                'knn error (%)', 'ILP error (%)']
102 |     table = []
103 | 
104 |     for _, var_value, stats_value, _ in stats_list:
105 |         runtime = stats_value['runtime']
106 |         est_err = stats_value['clf_error_mixed'][-1]
107 | 
108 |         y_pred_knn = stats_value['test_pred_knn']
109 |         y_pred = stats_value['test_pred_lp']
110 |         y_true = stats_value['test_true']
111 |         test_err_knn = np.mean(np.not_equal(y_true, y_pred_knn))
112 |         test_err_ilp = np.mean(np.not_equal(y_true, y_pred))
113 | 
114 |         runtime = '{:6.2f}'.format(runtime)
115 |         est_err = '{:5.2f}'.format(est_err*100)
116 | 
117 |         test_err_knn = '{:5.2f}'.format(test_err_knn*100)
118 |         test_err_ilp = '{:5.2f}'.format(test_err_ilp*100)
119 |         row = [var_value, runtime, est_err, test_err_knn, test_err_ilp]
120 |         table.append(row)
121 | 
122 |     print(tabulate(table, headers, tablefmt="latex"))
123 | 
124 | 
125 | def plot_histogram(ax, values, title, xlabel, ylabel, value_range):
126 | 
127 |     weights = np.ones_like(values) / float(len(values))
128 |     bin_y, bin_x, _ = ax.hist(values, range=value_range, normed=False, bins=20,
129 |                               weights=weights, alpha=0.5, align='mid')
130 |     print('Bin values min/max: {}, {}'.format(bin_y.min(), bin_y.max()))
131 | 
132 |     ax.set_xticks(np.arange(0, 1.1, 0.2))
133 |     ax.set_yticks(np.arange(0, 1.1, 0.2))
134 |     ax.set_title(title, fontweight='bold')
135 |     ax.set_xlabel(xlabel, fontweight='bold')
136 |     ax.set_ylabel(ylabel)
137 | 
138 |     ax.spines['top'].set_visible(False)
139 |     ax.spines['right'].set_visible(False)
140 | 
141 | 
142 | def plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos=1):
143 | 
144 |     if metric_key not in stats_value:
145 |         print('FOUND NO KEY {}.'.format(metric_key))
146 |         return
147 |     if 'pred_point_after' not in stats_value:
148 |         print('FOUND NO KEY pred_point_after.')
149 |         return
150 | 
151 |     y_u_metric = np.asarray(stats_value[metric_key])
152 |     print('START PLOT HISTOGRAM FOR {}'.format(metric_key))
153 |     y_u_pred = stats_value['pred_point_after'].ravel()
154 | 
155 |     n = len(y_u_metric)
156 |     y_true = stats_value['label_stream_true']
157 |     mask_labeled = stats_value['label_stream_mask_observed']
158 |     y_u_true = y_true[~mask_labeled]
159 | 
160 |     mask_correct = np.equal(y_u_pred, y_u_true[-n:])
161 |     metric_hits = y_u_metric[mask_correct]
162 |     metric_miss = y_u_metric[~mask_correct]
163 | 
164 |     value_range = (0., 1.)
165 |     ylabel = 'Samples ratio'
166 |     xlabel = PLOT_LABELS[metric_key]
167 | 
168 |     if pos == 1:
169 |         title = PLOT_TITLE[metric_key] + ' - correct predictions'
170 |         xlabel = ''
171 |     elif pos == -1:
172 |         title = ''
173 |     else:
174 |         xlabel = ''
175 |         title = ''
176 |     plot_histogram(ax1, metric_hits, title, xlabel, ylabel, value_range)
177 | 
178 |     if pos == 1:
179 |         title = PLOT_TITLE[metric_key] + ' - false predictions'
180 |         xlabel = ''
181 |     elif pos == -1:
182 |         title = ''
183 |     else:
184 |         xlabel = ''
185 |         title = ''
186 |     plot_histogram(ax2, metric_miss, title, xlabel, ylabel, value_range)
187 | 
188 | 
189 | def plot_metric_histograms(stats, metric_key):
190 | 
191 |     if type(stats) is list:
192 |         fig = plt.figure(6, (8, 2.5*len(stats)), dpi=200)
193 |         sp_count = count(1)
194 |         for i, (_, var_value, stats_value, _) in enumerate(stats):
195 |             ax1 = fig.add_subplot(len(stats), 2, next(sp_count))
196 |             ax2 = fig.add_subplot(len(stats), 2, next(sp_count))
197 | 
198 |             if i == 0:
199 |                 pos = 1
200 |             elif i == len(stats) - 1:
201 |                 pos = -1
202 |             else:
203 |                 pos = 0
204 | 
205 |             plot_metric_histogram(stats_value, ax1, ax2, metric_key, pos)
206 | 
207 |             title = r'$\vartheta = $' + str(var_value)
208 |             ax1.set_label(title)
209 |             plt.legend()
210 |     else:
211 |         fig = plt.figure(6, (8, 2.5), dpi=100)
212 |         ax1 = fig.add_subplot(1, 2, 1)
213 |         ax2 = fig.add_subplot(1, 2, 2)
214 |         plot_metric_histogram(stats, ax1, ax2, metric_key)
215 | 
216 |     fig.subplots_adjust(top=0.9)
217 |     fig.tight_layout()
218 | 
219 | 
220 | def plot_class_distro_stream(ax, y_true, mask_observed, n_burn_in, classes):
221 |     """Plot incoming label distributions"""
222 |     print('Plotting {}'.format('labels stream distro'))
223 | 
224 |     xx = np.arange(len(y_true))
225 |     x_lu = [xx[mask_observed], xx[~mask_observed]]
226 |     y_lu = [y_true[mask_observed], y_true[~mask_observed]]
227 |     c_lu = ['red', 'gray']
228 |     sizes = [2, 1]
229 |     markers = ['d', '.']
230 | 
231 |     n_labeled = sum(mask_observed)
232 |     n_unlabeled = len(y_true) - n_labeled
233 |     labels = [r'labeled ({})'.format(n_labeled),
234 |               r'unlabeled ({})'.format(n_unlabeled)]
235 | 
236 |     for x, y, c, s, m, label in zip(x_lu, y_lu, c_lu, sizes, markers, labels):
237 |         ax.scatter(x, y, c=c, marker=m, s=s, label=label)
238 | 
239 |     burn_in_label = 'burn-in ({})'.format(n_burn_in)
240 |     ax.vlines(n_burn_in, *ax.get_ylim(), colors='blue', linestyle=':',
241 |               label=burn_in_label)
242 | 
243 |     classes_type = type(classes[0])
244 |     if classes_type is str:
245 |         true_labels = np.unique(y_true)
246 |         plt.yticks(range(len(classes)), classes, rotation=45, fontsize=7)
247 |     else:
248 |         ax.set_yticks(classes)
249 | 
250 |     ax.set_xlabel(X_LABEL_DEFAULT)
251 |     ax.set_ylabel(r'class labels')
252 | 
253 |     ax.spines['top'].set_visible(False)
254 |     ax.spines['right'].set_visible(False)
255 | 
256 |     plt.legend(loc='upper right')
257 | 
258 | 
259 | def plot_corrected_samples(y_true, y_pred1, y_pred2, config, n_samples=50):
260 | 
261 |     dataset = config['dataset']['name']
262 |     _, _, X_test, y_test = fetch_load_data(dataset)
263 | 
264 |     mask_miss1 = np.not_equal(y_true, y_pred1)
265 |     print('KNN missed {} test cases'.format(sum(mask_miss1)))
266 |     mask_hits2 = np.equal(y_true, y_pred2)
267 |     print('ILP missed {} test cases'.format(sum(~mask_hits2)))
268 |     mask_miss1_hits2 = np.logical_and(mask_miss1, mask_hits2)
269 |     print('ILP missed {} less than KNN'.format(sum(mask_miss1_hits2)))
270 | 
271 |     samples = X_test[mask_miss1_hits2]
272 |     y_pred_miss = y_pred1[mask_miss1_hits2]
273 |     y_pred_hit = y_pred2[mask_miss1_hits2]
274 |     print('THERE ARE {} CASES OF MISS/HIT'.format(len(samples)))
275 | 
276 |     if len(samples) > n_samples:
277 |         ind = np.random.choice(len(samples), n_samples, replace=False)
278 |         samples = samples[ind]
279 |         y_pred_miss = y_pred_miss[ind]
280 |         y_pred_hit = y_pred_hit[ind]
281 | 
282 |     fig = plt.figure(4, figsize=(4, 4), dpi=200)
283 |     dim = int(np.sqrt(len(X_test[0])))
284 | 
285 |     n_subplots = len(samples)
286 |     n_cols = 10
287 |     n_empty_rows = 0 #  2
288 |     n_rows = ceil(n_subplots / n_cols) + n_empty_rows
289 |     subplot_count = count(1)
290 | 
291 |     for x, ym, yh in zip(samples, y_pred_miss, y_pred_hit):
292 |         i = next(subplot_count)
293 |         ax = fig.add_subplot(n_rows, n_cols, i)
294 |         ax.set_xticks([])
295 |         ax.set_yticks([])
296 | 
297 |         image = x.reshape(dim, dim)
298 |         ax.imshow(image, cmap=plt.cm.gray)
299 |         ax.set_title(r'{}$\rightarrow${}'.format(ym, yh), fontsize=8,
300 |                      fontweight='bold', horizontalalignment="center")
301 |         fig.add_subplot(ax)
302 | 
303 |     fig.suptitle('Corrected samples')
304 | 
305 | 
306 | def plot_confusion_matrix(y_true, y_pred, title, cmap=plt.cm.Greys):
307 | 
308 |     cm = confusion_matrix(y_true, y_pred)
309 |     # Normalize
310 |     cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
311 |     true_labels = [str(int(y)) for y in np.unique(y_true)]
312 |     pred_labels = [str(int(y)) for y in np.unique(y_pred)]
313 |     plt.imshow(cm_norm, interpolation='nearest', cmap=cmap)
314 |     plt.title(title, fontsize=14, fontweight='bold')
315 |     xtick_marks = np.arange(len(true_labels))
316 |     ytick_marks = np.arange(len(pred_labels))
317 |     plt.xticks(xtick_marks, true_labels, fontsize=6, fontweight='bold')
318 |     plt.yticks(ytick_marks, pred_labels, fontsize=6, fontweight='bold')
319 | 
320 |     [i.set_color("b") for i in plt.gca().get_xticklabels()]
321 |     [i.set_color("b") for i in plt.gca().get_yticklabels()]
322 | 
323 |     thresh = cm_norm.max() / 2.
324 |     for i, j in product(range(cm.shape[0]), range(cm.shape[1])):
325 |         s = r'{:4}'.format(str(cm[i, j]))
326 |         s = r'\textbf{' + s + r'}'
327 |         plt.text(j, i, s, fontsize=8,
328 |                  horizontalalignment="center", verticalalignment='center',
329 |                  color="white" if cm_norm[i, j] > thresh else "black")
330 | 
331 |     plt.tick_params(top='off', bottom='off', left='off', right='off',
332 |                     labelleft='on', labelbottom='on')
333 | 
334 |     remove_frame()
335 | 
336 |     plt.tight_layout()
337 |     ax = plt.gca()
338 |     ax.set_ylabel('True label', fontsize=10)
339 |     ax.set_xlabel('Predicted label', fontsize=10)
340 | 
341 | 
342 | def plot_confusion_matrices(stats, config, improvements=True):
343 |     matplotlib.rcParams['text.latex.unicode'] = True
344 |     print('Plotting confusion matrices...')
345 | 
346 |     y_pred_knn = stats['test_pred_knn']
347 |     y_pred_ilp = stats['test_pred_lp']
348 |     y_true = stats['test_true']
349 | 
350 |     err_knn = np.mean(np.not_equal(y_pred_knn, y_true))
351 |     err_ilp = np.mean(np.not_equal(y_pred_ilp, y_true))
352 |     print('knn error: {}'.format(err_knn))
353 |     print('ilp error: {}'.format(err_ilp))
354 | 
355 |     fig = plt.figure(3, figsize=(8, 4), dpi=200)
356 |     n_subplots = 2
357 | 
358 |     fig.add_subplot(1, n_subplots, 1)
359 |     plot_confusion_matrix(y_true, y_pred_knn, '$knn$')
360 | 
361 |     fig.add_subplot(1, n_subplots, 2)
362 |     plot_confusion_matrix(y_true, y_pred_ilp, 'ILP')
363 | 
364 |     if improvements:
365 |         plot_corrected_samples(y_true, y_pred_knn, y_pred_ilp,
366 |                                config, n_samples=40)
367 | 
368 |     dt = config['dataset']['name'].upper()
369 |     fig.suptitle(r'\textbf{Confusion}' + ' (' + dt + ')')
370 | 
371 |     fig.tight_layout()
372 |     fig.subplots_adjust(top=0.85)
373 | 
374 | 
375 | def plot_single_run_single_var(stats, fig, config=None):
376 | 
377 |     metrics_to_plot = []
378 |     for metric_key in METRIC_ORDER:
379 |         if metric_key in stats and len(stats[metric_key]):
380 |             metrics_to_plot.append(metric_key)
381 | 
382 |     n_subplots = len(metrics_to_plot)
383 |     if 'label_stream_true' in stats:
384 |         n_subplots += 1
385 |     if 'iter_offline_duration' in stats:
386 |         n_subplots += 1
387 | 
388 |     n_cols = 4
389 |     n_rows = ceil(n_subplots / n_cols)
390 |     plot_count = count(1)
391 | 
392 |     n = len(stats['iter_online_count'])
393 |     xx = range(n)
394 |     for metric_key in metrics_to_plot:
395 |         print('Plotting metric {}'.format(metric_key))
396 |         ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
397 |         if metric_key in SCATTER_METRIC:
398 |             ax.scatter(xx, stats[metric_key], s=2)
399 |         else:
400 |             ax.plot(stats[metric_key], color=DEFAULT_COLOR, label=None, lw=2)
401 | 
402 |         ax.set_xlabel(X_LABEL_DEFAULT)
403 |         ax.set_ylabel(PLOT_LABELS[metric_key])
404 | 
405 |     if 'label_stream_true' in stats and config is not None:
406 |         ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
407 |         y_true = np.asarray(stats['label_stream_true'])
408 |         y_mask = np.asarray(stats['label_stream_mask_observed'])
409 |         n_burn_in = config['data'].get('n_burn_in', 0)
410 |         classes = np.unique(y_true)
411 |         dataset_config = config.get('dataset', None)
412 |         if dataset_config is not None:
413 |             classes = dataset_config.get('classes', classes)
414 |             if dataset_config['name'].startswith('kitti'):
415 |                 classes = KITTI_CLASSES
416 |         plot_class_distro_stream(ax, y_true, y_mask, n_burn_in, classes)
417 | 
418 |     iod = stats.get('iter_offline_duration', None)
419 |     if iod is not None:
420 |         if len(iod):
421 |             ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
422 |             ax.scatter(stats['iter_offline'], iod)
423 |             ax.set_xlabel(X_LABEL_DEFAULT)
424 |             ax.set_ylabel(PLOT_LABELS['iter_offline_duration'])
425 | 
426 |     fig.subplots_adjust(top=0.9)
427 | 
428 | 
429 | def plot_single_run_multi_var(stats_list, fig, config):
430 |     # Single run for each value a variable (e.g. \vartheta) takes.
431 | 
432 |     kitti = config['dataset']['name'].startswith('kitti')
433 |     LW = 1.5
434 | 
435 |     var_name, var_value0, stats0, _  = stats_list[0]
436 |     var_label = PLOT_LABELS[var_name]
437 |     metrics_to_plot = []
438 | 
439 |     metric_order = METRIC_ORDER
440 |     if var_name.startswith('k'):
441 |         metric_order = METRIC_ORDER[:3]
442 |     for metric_key in metric_order:
443 |         print('Metric {}'.format(metric_key))
444 |         print(stats0[metric_key])
445 |         if metric_key in stats0:
446 |             if hasattr(stats0[metric_key], '__len__'):
447 |                 if len(stats0[metric_key]):
448 |                     print('Metric to plot: {}'.format(metric_key))
449 |                     metrics_to_plot.append(metric_key)
450 | 
451 |     stats_list.sort(key=lambda x: float(x[1]))
452 | 
453 |     for i, (_, var_value, stats_value, _) in enumerate(stats_list):
454 |         rt = stats_value['runtime']
455 |         print('theta = {}, runtime = {}'.format(var_value, rt))
456 | 
457 |     if len(stats_list) > 7:
458 |         # for n_neighbors pick 1,3,5,7,11,15,19 -> idx = [0,1,2,3,5,7,9]
459 |         stats_list = [stats_list[i] for i in [0,1,2,3,5,7,9]]
460 | 
461 |     n_subplots = len(metrics_to_plot)
462 |     n_cols = N_AXES_PER_ROW if n_subplots >= N_AXES_PER_ROW else n_subplots
463 |     n_rows = ceil(n_subplots / n_cols)
464 |     plot_count = count(1)
465 | 
466 |     n_values = len(stats_list)
467 |     if n_values != N_CURVES:
468 |         color_idx = np.linspace(0, 1, n_values+2)
469 |         color_idx = color_idx[1:-1]
470 |     else:
471 |         color_idx = COLOR_IDX
472 | 
473 |     n = len(stats0['iter_online_count'])
474 |     xx = range(n)
475 |     for metric_key in metrics_to_plot:
476 |         print('Plotting metric {}'.format(metric_key))
477 |         ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
478 |         for i, (_, var_value, stats_value, _) in enumerate(stats_list):
479 |             c = COLOR_MAP(color_idx[i])
480 | 
481 |             val = int(float(var_value)*100)
482 |             label = r'{}={}\%'.format(var_label, int(val))
483 |             if metric_key in SCATTER_METRIC:
484 |                 ax.scatter(xx, stats_value[metric_key], s=1, color=c,
485 |                            label=label)
486 |             else:
487 |                 if kitti and metric_key.startswith('test_error'):
488 |                     test_errors = stats_value[metric_key]
489 |                     print('Final Test error: {:5.2f} for {}% of labels'.format(
490 |                         test_errors[-1]*100, var_value))
491 |                     test_times = np.arange(1, len(test_errors) + 1)* 1000
492 |                     ax.plot(test_times, test_errors, color=c, label=label,
493 |                             lw=LW, marker='.')
494 |                     ax.set_ylim((0.0, 0.5))
495 |                     # ax.set_xticks(test_times)
496 |                 else:
497 |                     ax.plot(stats_value[metric_key], color=c, label=label,
498 |                             lw=LW)
499 | 
500 |         ax.set_xlabel(X_LABEL_DEFAULT)
501 |         ax.set_ylabel(PLOT_LABELS[metric_key])
502 |         if metric_key == LEGEND_METRIC:
503 |             plt.legend(loc='best')
504 | 
505 |     plt.legend(loc='best')
506 | 
507 | 
508 | def plot_multi_run_single_var(stats_mean, stats_std, fig):
509 |     """Multiple runs (random seeds) for a single variable (e.g. \vartheta)"""
510 | 
511 |     metrics_to_plot = []
512 |     for metric_key in METRIC_ORDER:
513 |         if metric_key in stats_mean and len(stats_mean[metric_key]):
514 |             metrics_to_plot.append(metric_key)
515 | 
516 |     n_subplots = len(metrics_to_plot)
517 |     n_cols = 4
518 |     n_rows = ceil(n_subplots / n_cols)
519 |     plot_count = count(1)
520 | 
521 |     for metric_key in metrics_to_plot:
522 |         print('Plotting metric {}'.format(metric_key))
523 |         ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
524 |         metric_mean = stats_mean[metric_key]
525 |         color = DEFAULT_MEAN_COLOR
526 |         ax.plot(metric_mean, color=color, label=None, lw=2)
527 |         metric_std = 1 * stats_std[metric_key]
528 |         lb, ub = metric_mean - metric_std, metric_mean + metric_std
529 |         color = DEFAULT_STD_COLOR
530 |         ax.plot(lb, color=color)
531 |         ax.plot(ub, color=color)
532 |         ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.5)
533 |         ax.set_xlabel(X_LABEL_DEFAULT)
534 |         ax.set_ylabel(PLOT_LABELS[metric_key])
535 | 
536 | 
537 | def plot_multi_run_multi_var(stats_list, fig):
538 |     # Multiple runs (random seeds) for each value a variable takes
539 | 
540 |     headers = [r'$\vartheta$', 'Runtime (s)', 'Est. error (%)',
541 |                'Test error (%)']
542 |     table = []
543 |     for i in range(len(stats_list)):
544 |         _, var_value, stats_value_mean, stats_value_std = stats_list[i]
545 |         print('\n\nVar value = {}'.format(var_value))
546 |         runtime = stats_value_mean['runtime']
547 |         print('Runtime: {}'.format(runtime))
548 | 
549 |         est_err_mean = stats_value_mean['clf_error_mixed'][-1]
550 |         est_err_std  = stats_value_std['clf_error_mixed'][-1]
551 |         print('est_err: {} ({})'.format(est_err_mean, est_err_std))
552 |         s1 = '{:5.2f} ({:4.2f})'.format(est_err_mean*100, est_err_std*100)
553 | 
554 |         test_err_mean = stats_value_mean.get('test_error_ilp', None)
555 |         test_err_std = stats_value_std.get('test_error_ilp', None)
556 |         print('test_err: {}'.format(test_err_mean, test_err_std))
557 |         if test_err_mean is None:
558 |             s2 = '-'
559 |         else:
560 |             s2 = '{:5.2f} ({:4.2f})'.format(test_err_mean*100, test_err_std*100)
561 | 
562 |         row = [var_value, runtime, s1, s2]
563 |         table.append(row)
564 | 
565 |     print(tabulate(table, headers, tablefmt="latex"))
566 | 
567 |     metrics_to_plot = []
568 |     _, _, stats_value_mean0, _ = stats_list[0]
569 |     for metric_key in METRIC_ORDER:
570 |         metric = stats_value_mean0.get(metric_key, None)
571 |         if metric is not None and len(metric):
572 |             metrics_to_plot.append(metric_key)
573 | 
574 |     n_subplots = len(metrics_to_plot)
575 |     n_cols = 4
576 |     n_rows = ceil(n_subplots / n_cols)
577 |     plot_count = count(1)
578 | 
579 |     for metric_key in metrics_to_plot:
580 |         print('Plotting metric {}'.format(metric_key))
581 |         ax = fig.add_subplot(n_rows, n_cols, next(plot_count))
582 | 
583 |         for i in range(len(stats_list)):
584 |             _, var_value, stats_value_mean, stats_value_std = stats_list[i]
585 | 
586 |             metric_mean = stats_value_mean[metric_key]
587 |             color = DEFAULT_MEAN_COLOR
588 |             ax.plot(metric_mean, color=color, label=None, lw=2)
589 |             metric_std = 1 * stats_value_std[metric_key]
590 |             lb, ub = metric_mean - metric_std, metric_mean + metric_std
591 |             color = DEFAULT_STD_COLOR
592 |             ax.plot(lb, color=color)
593 |             ax.plot(ub, color=color)
594 |             ax.fill_between(range(len(lb)), lb, ub, facecolor=color, alpha=0.3)
595 |             ax.set_xlabel(X_LABEL_DEFAULT)
596 |             ax.set_ylabel(PLOT_LABELS[metric_key])
597 | 
598 | 
599 | def plot_standard(single_run, single_var, stats, config, title, path):
600 | 
601 |     figsize = (11, 5)
602 |     fig = plt.figure(1, figsize=figsize, dpi=200)
603 | 
604 |     print()
605 |     if single_run and single_var:  # default_run
606 |         print('Plotting a single run with a single variable value')
607 |         plot_single_run_single_var(stats[0], fig, config)
608 |     elif single_run and not single_var:  # var_theta
609 |         print('Plotting a single run for each variable value')
610 |         plot_single_run_multi_var(stats, fig, config)
611 |     elif single_var and not single_run:  # mean and std of default_run
612 |         print('Plotting multiple runs for a single variable value')
613 |         plot_multi_run_single_var(stats[0], stats[1], fig)
614 |     elif not single_run and not single_var:
615 |         print('Plotting multiple runs for multiple variable values')
616 |         plot_multi_run_multi_var(stats, fig)
617 |     print()
618 | 
619 |     # plt.legend(loc='upper right')
620 |     dataset = config['dataset']['name']
621 |     dataset = 'kitti' if dataset.startswith('kitti') else dataset
622 |     dataset = dataset.replace('_', ' ')
623 |     if title == r'Default run' or dataset == 'kitti':
624 |         title = dataset.upper()
625 |     else:
626 |         title = r'\textbf{' + title + r'}' + ' (' + dataset.upper() + ')'
627 | 
628 |     fig.suptitle(title, fontsize='xx-large', fontweight='bold')
629 | 
630 |     fig.tight_layout()
631 |     fig.subplots_adjust(top=0.9)
632 | 
633 |     if path is not None:
634 |         if path.endswith(STATS_FILE_EXT):
635 |             path = path[:-len(STATS_FILE_EXT)]
636 | 
637 |         try:
638 |             fig.savefig(path + '.pdf', orientation='landscape', transparent=False)
639 |         except RuntimeError as e:
640 |             print('Cannot save figure due to: \n{}'.format(e))
641 | 
642 | 
643 | def plot_curves(stats, config, title='', path=None):
644 |     plt.rc('text', usetex=True)  # need dvipng in Ubuntu
645 |     plt.rc('font', family='serif')
646 |     # matplotlib.rcParams['text.latex.unicode'] = True
647 | 
648 |     if type(stats) is tuple:
649 |         single_var = True
650 |         stats_mean, stats_std = stats
651 |         single_run = stats_std is None
652 |         stats_print = stats_mean
653 |     elif type(stats) is list:
654 |         single_var = False
655 |         var_name, var_value, stats_mean0, stats_std0 = stats[0]
656 |         print('Plotting multi var: {}'.format(var_name))
657 |         single_run = stats_std0 is None
658 |         stats_print = stats_mean0
659 |     else:
660 |         print('stats has type: {}'.format(type(stats).__name__))
661 |         raise TypeError('stats is neither list nor tuple!')
662 | 
663 |     print('\nStatistics{}Size{}Type\n{}'.format(' '*25, ' '*8, '-'*52))
664 |     for k in sorted(stats_print.keys()):
665 |         v = stats_print[k]
666 |         if v is not None:
667 |             if hasattr(v, '__len__'):
668 |                 print('{:>32} {:>8} {:>9}'.format(k, len(v), type(v).__name__))
669 |             else:
670 |                 print('{:>32} {:>8.2f} {:>9}'.format(k, v, type(v).__name__))
671 | 
672 |     if config is not None:
673 |         print_config(config)
674 | 
675 |     plot_standard(single_run, single_var, stats, config, title, path)
676 | 
677 |     if single_var:
678 |         if config['dataset']['name'] == 'mnist':
679 |             test_pred_knn = stats_mean.get('test_pred_knn', None)
680 |             if test_pred_knn is not None and len(test_pred_knn) > 0:
681 |                 plot_confusion_matrices(stats_mean, config)
682 | 


--------------------------------------------------------------------------------