├── .gitignore
├── LICENSE
├── README.md
├── big-data.ipynb
├── big-data
    ├── baseline.csv
    └── benchmark.csv
├── data
    ├── __init__.py
    └── download.sh
├── img
    ├── FIt-SNE.pdf
    ├── FIt-SNE.png
    ├── LargeVis.pdf
    ├── LargeVis.png
    ├── Multicore t-SNE.pdf
    ├── Multicore t-SNE.png
    ├── NCVis.pdf
    ├── NCVis.png
    ├── Umap.pdf
    ├── Umap.png
    ├── efficiency.pdf
    ├── efficiency.png
    ├── gen_png.sh
    ├── isolated.jpg
    ├── news_2kk.jpg
    ├── pendigits.pdf
    ├── pendigits.png
    ├── t-SNE.pdf
    ├── t-SNE.png
    ├── teaser.jpg
    ├── time.pdf
    ├── time.png
    ├── time_all.pdf
    ├── time_all.png
    └── words.jpg
├── requirements-conda.txt
├── requirements-pip.txt
├── sample.ipynb
└── utils
    └── __init__.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | data/*/
 2 | __pycache__/
 3 | .ipynb_checkpoints/
 4 | big-data/*
 5 | !big-data/baseline.csv
 6 | !big-data/benchmark.csv
 7 | annoy_index_file
 8 | *.dat
 9 | *.in
10 | *.out


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Aleksandr Artemenkov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ncvis-examples
 2 | Examples for [NCVis](https://github.com/alartum/ncvis) Python wrapper. 
 3 | 
 4 | |Notebook| Contents | 
 5 | |-------|:-----------|
 6 | |[sample.ipynb](https://nbviewer.jupyter.org/github/alartum/ncvis-examples/blob/master/sample.ipynb)  | Introduction to NCVis |
 7 | |[big-data.ipynb](https://nbviewer.jupyter.org/github/alartum/ncvis-examples/blob/master/big-data.ipynb)| Large-scale application case |
 8 | 
 9 | # Setup
10 | 
11 | ## Conda [recommended]
12 | 
13 | You do not need to setup the environment if using *conda*, all dependencies are installed automatically. 
14 | ```bash
15 |  $ conda install --file requirements-conda.txt
16 |  ```
17 | 
18 | ## Pip [not recommended]
19 | 
20 | **Important**: be sure to have a compiler with *OpenMP* support. *GCC* has it by default, which is not the case with *clang*. You may need to install *llvm-openmp* library beforehand.  
21 | 
22 | 1. Install **numpy** and **cython** packages (compile-time dependencies):
23 |     ```bash
24 |     $ pip install numpy cython
25 |     ```
26 | 2. Install other packages:
27 |     ```bash
28 |     $ pip install -r requirements-pip.txt
29 |     ```
30 | 
31 | 
32 | # Popular Datasets
33 | 
34 | Datasets can be dowloaded by using the *download.sh* script:
35 | ```bash
36 | $ bash data/download.sh <dataset>
37 | ```
38 | Replace *\<dataset\>* with corresponding entry from the table. You can also download all of them at once:
39 | ```bash
40 | $ bash data/download.sh
41 | ```
42 | 
43 |  The datasets can be then accessed by using interfaces from the *data* Python module.
44 | 
45 | |Dataset| \<dataset\> | Dataset Class|
46 | |-------|:-----------:|:------:|
47 | |[MNIST](http://yann.lecun.com/exdb/mnist/)|mnist| MNIST|
48 | |[Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist)|fmnist| FMNIST|
49 | |[Iris](https://archive.ics.uci.edu/ml/datasets/Iris)|iris|Iris|
50 | |[Handwritten Digits](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits)|pendigits|PenDigits|
51 | |[COIL-20](http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php)|coil20|COIL20|
52 | |[COIL-100](http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)|coil100|COIL100|
53 | |[Mouse scRNA-seq](https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/)|scrna|ScRNA|
54 | |[Statlog (Shuttle)](https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle))|shuttle|Shuttle|
55 | 
56 | Each dataset can be used in the following way:
57 | 
58 | |Sample Code | Action |
59 | |-----|--------|
60 | |```d = data.MNIST()```| Load the dataset.|
61 | |```ds.X```| Get the samples as numpy array of shape *(n_samples, n_dimensions)*. If samples have more than one dimension they are all flattened.|
62 | |```ds.y```| Get the labels of the samples.|
63 | |```len(ds)```| Get total number of samples.|
64 | |```ds[0]```| Get 0-th pair *(sample, label)* from the dataset.|
65 | |```ds.shape```| Get the original shape of the samples. For example, it equals to *(28, 28)* for MNIST. |


--------------------------------------------------------------------------------
/big-data/baseline.csv:
--------------------------------------------------------------------------------
1 | ,method,time
2 | 0,t-SNE,29.504466772079468
3 | 1,Multicore t-SNE,14.302932739257812
4 | 2,Umap,7.528846740722656
5 | 3,NCVis,0.9122929573059082
6 | 


--------------------------------------------------------------------------------
/big-data/benchmark.csv:
--------------------------------------------------------------------------------
 1 | ,method,n_samples,time
 2 | 0,t-SNE,1000.0,5.071566581726074
 3 | 1,t-SNE,2000.0,14.646541357040405
 4 | 2,t-SNE,4000.0,30.49600863456726
 5 | 3,t-SNE,8000.0,96.22555685043335
 6 | 4,t-SNE,16000.0,255.46899485588074
 7 | 5,t-SNE,32000.0,779.162201166153
 8 | 6,Multicore t-SNE,1000.0,5.27056884765625
 9 | 7,Multicore t-SNE,2000.0,9.886829376220703
10 | 8,Multicore t-SNE,4000.0,14.905185222625732
11 | 9,Multicore t-SNE,8000.0,34.719964027404785
12 | 10,Multicore t-SNE,16000.0,71.90425729751587
13 | 11,Multicore t-SNE,32000.0,203.12810039520264
14 | 12,Multicore t-SNE,64000.0,573.0402963161469
15 | 13,Multicore t-SNE,128000.0,1814.4171047210693
16 | 14,Umap,1000.0,3.215291738510132
17 | 15,Umap,2000.0,2.4666836261749268
18 | 16,Umap,4000.0,5.363312721252441
19 | 17,Umap,8000.0,13.978663921356201
20 | 18,Umap,16000.0,12.72315764427185
21 | 19,Umap,32000.0,26.08822512626648
22 | 20,Umap,64000.0,58.3027617931366
23 | 21,Umap,128000.0,127.516841173172
24 | 22,Umap,256000.0,283.3631901741028
25 | 23,Umap,512000.0,682.2378108501434
26 | 24,NCVis,1000.0,0.11863160133361816
27 | 25,NCVis,2000.0,0.3299705982208252
28 | 26,NCVis,4000.0,0.6363554000854492
29 | 27,NCVis,8000.0,1.353666067123413
30 | 28,NCVis,16000.0,2.899792432785034
31 | 29,NCVis,32000.0,6.572118282318115
32 | 30,NCVis,64000.0,15.098254680633545
33 | 31,NCVis,128000.0,34.84386920928955
34 | 32,NCVis,256000.0,82.09642720222473
35 | 33,NCVis,512000.0,173.22600722312927
36 | 34,NCVis,1024000.0,375.61637353897095
37 | 


--------------------------------------------------------------------------------
/data/__init__.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import struct
  3 | import numpy as np
  4 | import pandas as pd
  5 | 
  6 | from abc import ABC, abstractmethod
  7 | class Dataset(ABC):
  8 |     """
  9 |     Abstract data interface class. 
 10 |     """
 11 |     @abstractmethod
 12 |     def __init__(self):
 13 |         """
 14 |         Must define:
 15 |             self.X     -- numpy array of data samples; 
 16 |             self.y     -- numpy array of labels;
 17 |             self.names -- dictionary of names for each value of label 
 18 |             self.shape -- shape of the raw data
 19 |         """
 20 |         super().__init__()
 21 |         self.X = None
 22 |         self.y = None
 23 |         self.names = {}
 24 |         self.shape = None
 25 | 
 26 |     def __getitem__(self, id):
 27 |         """
 28 |         Returns:
 29 |         X, y  -- data sample and label by given index
 30 |         """
 31 |         return self.X[id], self.y[id]
 32 | 
 33 |     def X(self):
 34 |         """
 35 |         Returns:
 36 |         X  -- data samples
 37 |         """
 38 |         return self.X
 39 | 
 40 |     def y(self):
 41 |         """
 42 |         Returns:
 43 |         y  -- data labels
 44 |         """
 45 |         return self.y
 46 | 
 47 |     def __len__(self):
 48 |         """
 49 |         Returns:
 50 |         n -- number of samples in the dataset
 51 |         """
 52 |         if self.X.shape[0] != self.y.shape[0]:
 53 |             raise RuntimeError("Data samples and labels sizes differ {} and {}, but must be the same".format(self.X.shape[0], self.y.shape[0]))
 54 |         return self.X.shape[0]
 55 | 
 56 | from multiprocessing import Process, Pool, Queue
 57 | import time
 58 | import progressbar
 59 | class LargePool:
 60 |     """
 61 |     Multiprocessing with progressbar.
 62 |     """
 63 |     def __init__(self, tasks, worker_class, worker_args=(), worker_kwargs={}, message='Loading '):
 64 |         self.tasks = tasks
 65 |         self.worker_class = worker_class
 66 |         self.worker_args = worker_args
 67 |         self.worker_kwargs = worker_kwargs
 68 |         self.message = message
 69 |         
 70 |     def run(self, processes=None, progress=True, delay=0.2):
 71 |         tasks = Queue()
 72 |         size = len(self.tasks)
 73 |         results = Queue(maxsize=size)
 74 |         
 75 |         def init():
 76 |             worker = self.worker_class(*self.worker_args, **self.worker_kwargs)
 77 |             while True:
 78 |                 t = tasks.get()
 79 |                 results.put(worker(t))
 80 |         
 81 |         def load_queue():
 82 |             for t in self.tasks:
 83 |                 tasks.put(t)
 84 |         p = Process(target=load_queue)
 85 |         p.start()
 86 |           
 87 |         pool = Pool(processes=processes, initializer=init)
 88 |         if progress:
 89 |             with progressbar.ProgressBar(max_value=size, prefix=self.message) as bar:
 90 |                 while not results.full():
 91 |                     bar.update(results.qsize())
 92 |                     time.sleep(delay)
 93 |         
 94 |         res = [results.get() for i in range(size)]
 95 |         
 96 |         p.terminate()
 97 |         pool.terminate()
 98 |         return [r for r in res if r is not None]
 99 |     
100 | class Worker(ABC):
101 |     @abstractmethod
102 |     def __init__(self):
103 |         super().__init__()
104 |         pass
105 |     @abstractmethod
106 |     def __call__(self, task):
107 |         pass
108 | 
109 | def load_mnist_raw(path, kind):
110 |     """
111 |     Load image/labels data packed as http://yann.lecun.com/exdb/mnist/.
112 |     
113 |     Arguments:
114 |     path   -- path to the loaded file
115 |     kind   -- kind of the file contents:
116 |               'l' = labels
117 |               'i' = images
118 |             
119 |     Returns:
120 |     data   -- loaded data as numpy array
121 |     """
122 |     with open(path, 'rb') as f:
123 |         if kind == 'l':
124 |             magic, n = struct.unpack('>ii', f.read(8))
125 |             data = np.fromfile(f, dtype=np.uint8)
126 |         elif kind == 'i':
127 |             magic, num, rows, cols = struct.unpack(">iiii", f.read(16))
128 |             data = np.fromfile(f, dtype=np.uint8).reshape(num, rows*cols)
129 |         else:
130 |             raise RuntimeError("Unsupported file contents kind: '{}'".format(kind))
131 | 
132 |     return data
133 | 
134 | def load_mnist_like(folder='mnist'):
135 |     """
136 |     Load MNIST(F-MNIST) dataset.
137 | 
138 |     Returns:
139 |     X, y -- data points and labels
140 |     """
141 |     train = {'i': 'data/{}/train-images-idx3-ubyte'.format(folder),
142 |              'l': 'data/{}/train-labels-idx1-ubyte'.format(folder)}
143 |     test  = {'i': 'data/{}/t10k-images-idx3-ubyte'.format(folder),
144 |              'l': 'data/{}/t10k-labels-idx1-ubyte'.format(folder)}
145 |     files = [train, test]
146 |     
147 |     storage = {'i': None,
148 |                'l': None}
149 |     for f in files:
150 |         for kind in storage:
151 |             arr = load_mnist_raw(f[kind], kind)
152 |             if storage[kind] is None:
153 |                 storage[kind] = arr
154 |             else:
155 |                 storage[kind] = np.concatenate((storage[kind], arr))
156 | 
157 |     return storage['i'], storage['l'] 
158 | 
159 | class MNIST(Dataset):
160 |     """
161 |     MNIST Dataset
162 |     Alias: mnist
163 |     http://yann.lecun.com/exdb/mnist/
164 |     """
165 |     def __init__(self):
166 |         super().__init__()
167 |         self.X, self.y = load_mnist_like('mnist')
168 |         self.names = {k:str(k) for k in range(self.y.max())}
169 |         self.shape = (28, 28)
170 | 
171 | class FMNIST(Dataset):
172 |     """
173 |     Fashion MNIST Dataset
174 |     Alias: fmnist
175 |     https://github.com/zalandoresearch/fashion-mnist
176 |     """
177 |     def __init__(self):
178 |         super().__init__()
179 |         self.X, self.y = load_mnist_like('fmnist')
180 |         self.names = {
181 |             0: "T-shirt/top",
182 |             1: "Trouser",
183 |             2: "Pullover",
184 |             3: "Dress",
185 |             4: "Coat",
186 |             5: "Sandal",
187 |             6: "Shirt",
188 |             7: "Sneaker",
189 |             8: "Bag",
190 |             9: "Ankle boot"
191 |         }
192 |         self.shape = (28, 28)
193 | 
194 | class Iris(Dataset):
195 |     """
196 |     Iris Dataset
197 |     https://archive.ics.uci.edu/ml/datasets/Iris
198 |     """
199 |     def __init__(self):
200 |         super().__init__()        
201 |         df = pd.read_csv("data/iris/iris.data", header=None)
202 |         self.X = df.iloc[:, :-1].values
203 |         classes = df.iloc[:, -1].astype("category").cat
204 |         self.y = classes.codes.values
205 |         self.names = dict(enumerate(classes.categories))
206 |         for k in self.names:
207 |             self.names[k] = self.names[k].rsplit('-', 1)[1].title()
208 |         self.shape = (self.X.shape[1], )
209 | 
210 | import re
211 | import imageio
212 | class CoilLoader(Worker):
213 |     def __init__(self, path):
214 |         super().__init__()
215 |         self.pattern = re.compile(r'obj(\d+)__(\d+).png')
216 |         self.path = path
217 |     
218 |     def __call__(self, file):
219 |         match = self.pattern.match(file)
220 |         if match:
221 |             obj = match.group(1)
222 |             res = imageio.imread(os.path.join(self.path, file)).ravel(), int(obj)-1
223 |             return res
224 | 
225 | def load_coil_like(path):
226 |     """
227 |     Load COIL-20(COIL-100) dataset.
228 | 
229 |     Returns:
230 |     X, y -- data points and labels
231 |     """
232 |     for _, _, f in os.walk(path):
233 |         fs = f
234 |         break
235 | 
236 |     p = LargePool(fs, CoilLoader, (path,))
237 |     res = p.run()
238 | 
239 |     X = []
240 |     y = []
241 |     for r in res:
242 |         X.append(r[0])
243 |         y.append(r[1])
244 | 
245 |     return np.stack(X), np.stack(y)
246 | 
247 | class COIL20(Dataset):
248 |     """
249 |     COIL-20 Dataset
250 |     Alias: coil20
251 |     http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
252 |     """
253 |     def __init__(self):
254 |         super().__init__()        
255 | 
256 |         self.X, self.y = load_coil_like('data/coil20/coil-20-proc')
257 |         self.names = {k:'Object ' + str(k) for k in range(self.y.max())}
258 |         self.shape = (128, 128)
259 | 
260 | class COIL100(Dataset):
261 |     """
262 |     COIL-100 Dataset
263 |     Alias: coil100
264 |     http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php
265 |     """
266 |     def __init__(self):
267 |         super().__init__()        
268 | 
269 |         self.X, self.y = load_coil_like('data/coil100/coil-100')
270 |         self.names = {k:'Object ' + str(k) for k in range(self.y.max())}
271 |         self.shape = (128, 128, 3)
272 | 
273 | class PenDigits(Dataset):
274 |     """
275 |     Pen Digits Dataset
276 |     Alias: pendigits
277 |     https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits
278 |     """
279 |     def __init__(self):
280 |         super().__init__()
281 |         files = ["data/pendigits/optdigits.tes",
282 |                  "data/pendigits/optdigits.tra"]
283 | 
284 |         loaded = [None]*2
285 |         for f in files:
286 |             df = pd.read_csv(f, header=None)
287 |             for i in range(2):
288 |                 if i == 0:
289 |                     new = df.iloc[:, :-1].values
290 |                 else:
291 |                     new = df.iloc[:, -1].values
292 |                 if loaded[i] is None:
293 |                     loaded[i] = new
294 |                 else:
295 |                     loaded[i] = np.concatenate((loaded[i], new))
296 |         self.X, self.y = loaded
297 |             
298 |         self.names = {k:str(k) for k in range(self.y.max())}
299 |         self.shape = (8, 8)
300 | 
301 | from io import StringIO
302 | class CsvLoader(Worker):
303 |     def __init__(self, sep='\t'):
304 |         super().__init__()
305 |         self.sep = sep
306 |         
307 |     def __call__(self, text):
308 |         csv = StringIO(text)
309 |         return pd.read_csv(csv, sep=self.sep, header=None, engine='c')
310 |     
311 | class CsvReader:
312 |     def __init__(self, path, nlines, chunksize=1024):
313 |         super().__init__()
314 |         self.nlines = nlines
315 |         self.chunksize = chunksize
316 |         self.path = path
317 |     
318 |     def __iter__(self):
319 |         nlines = 0
320 |         nread = 0
321 |         text = ''
322 |         with open(self.path) as f:
323 |             f.readline()
324 |             for line in f:
325 |                 if nlines == self.nlines:
326 |                     break
327 |                 nlines += 1
328 |                 nread += 1
329 |                 text += line + '\n'
330 |                 if nread == self.chunksize:
331 |                     yield text
332 |                     nread = 0
333 |                     text = ''
334 |             yield text
335 |     
336 |     def __len__(self):
337 |         return (self.nlines+self.chunksize-1)//self.chunksize
338 | 
339 | class ScRNA(Dataset):
340 |     """
341 |     Mouse scRNA-seq Dataset
342 |     Alias: scrna
343 |     https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/
344 |     """
345 |     def __init__(self):
346 |         super().__init__()
347 | 
348 |         # Load labels
349 |         df = pd.read_csv('data/scrna/GSE93374_cell_metadata.txt', sep='\t')
350 |         classes = df.iloc[:, 6].astype('category').cat
351 |         name_to_class = dict(zip(df.iloc[:, 0], classes.codes.values))
352 |         self.names = dict(enumerate(classes.categories))
353 |         df = pd.read_csv('data/scrna/GSE93374_Merged_all_020816_DGE.txt', sep='\t', nrows=1)
354 |         ind_to_name = df.columns.values
355 |         self.y = np.empty(len(ind_to_name), dtype=np.int)
356 |         for i in range(len(ind_to_name)):
357 |             self.y[i] = name_to_class[ind_to_name[i]]
358 | 
359 |         # Load the data itself
360 |         path = 'data/scrna/GSE93374_Merged_all_020816_DGE.txt'
361 |         nlines = 26774
362 |         reader = CsvReader(path, nlines=nlines, chunksize=1024)
363 |         p = LargePool(reader, CsvLoader, ('\t',))
364 |         df = pd.concat(p.run())
365 |         self.X = df.iloc[:, 1:].values.T
366 | 
367 |         self.shape = (nlines,)
368 | 
369 | class Shuttle(Dataset):
370 |     """
371 |     Statlog (Shuttle) Dataset
372 |     Alias: shuttle
373 |     https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)
374 |     """
375 |     def __init__(self, drop_time=True):
376 |         super().__init__()
377 | 
378 |         base = "data/shuttle/shuttle."
379 |         exts = ["trn", "tst"]
380 | 
381 |         vals = {'X': None, 
382 |                 'y': None}
383 |         for ext in exts:
384 |             df = pd.read_csv(base+ext, sep=' ').values
385 |             new = {'X': df[:, 1:-1] if drop_time else df[:, :-1],
386 |                    'y': df[:,  -1]}
387 |             for k in new:
388 |                 if vals[k] is None:
389 |                     vals[k] = new[k]
390 |                 else:
391 |                     vals[k] = np.concatenate((vals[k], new[k]))
392 | 
393 |         self.X, self.y = vals['X'], vals['y']
394 |         self.names = {
395 |                     1: 'Rad Flow', 
396 |                     2: 'Fpv Close', 
397 |                     3: 'Fpv Open', 
398 |                     4: 'High', 
399 |                     5: 'Bypass', 
400 |                     6: 'Bpv Close', 
401 |                     7: 'Bpv Open'
402 |                 }
403 |         self.shape = (self.X.shape[1],)
404 | 
405 | 


--------------------------------------------------------------------------------
/data/download.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | path=$(git rev-parse --show-toplevel)
  4 | cd $path/data
  5 | 
  6 | # https://gist.github.com/iamtekeste/3cdfd0366ebfd2c0d805#gistcomment-2359248
  7 | gdrive_download() {
  8 |     # $1 = GDrive id of the file
  9 |     # $2 = location to save file
 10 |     CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
 11 |     wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2
 12 |     rm -rf /tmp/cookies.txt
 13 | }
 14 | 
 15 | download() {
 16 |     # $1 = folder name
 17 |     # $2 = url1, $3 = url2, ... 
 18 |     local dir=$1
 19 |     while [ -n "$2" ]; do
 20 |         wget --no-check-certificate -N -P $dir $2
 21 |         shift
 22 |     done
 23 | }
 24 | 
 25 | extract() {
 26 |     # $1 = folder name
 27 |     for f in $1/*; do
 28 |         case $f in 
 29 |             *.zip) yes N | unzip -d $1 -qq $f ;;
 30 |             *.gz)  yes n | gzip -dk $f ;;
 31 |             *.Z)   yes n | uncompress $f ;;
 32 |             *) echo "$f is not compressed"
 33 |         esac
 34 |     done
 35 | }
 36 | 
 37 | get_data() {
 38 |     # $1 = folder name
 39 |     # $2 = url1, $3 = url2, ... 
 40 |     download "$@"
 41 |     extract $1
 42 | }
 43 | 
 44 | init_data() {
 45 |     case $1 in
 46 |         iris)
 47 |             # Iris Dataset
 48 |             # https://archive.ics.uci.edu/ml/datasets/Iris
 49 |             get_data iris http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 
 50 |             ;;
 51 |         mnist)   
 52 |             # MNIST Dataset 
 53 |             # http://yann.lecun.com/exdb/mnist/
 54 |             get_data mnist http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz 
 55 |             ;;
 56 |         pendigits)
 57 |             # Pen digits
 58 |             # https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits
 59 |             get_data pendigits https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes
 60 |             ;;
 61 |         coil20)
 62 |             # COIL-20
 63 |             # http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
 64 |             get_data coil20 http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-20/coil-20-proc.zip
 65 |             ;;
 66 |         coil100)
 67 |             # COIL-100
 68 |             # http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php
 69 |             get_data coil100 http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-100/coil-100.zip
 70 |             ;;
 71 |         fmnist)
 72 |             # Fashion-MNIST
 73 |             # https://github.com/zalandoresearch/fashion-mnist
 74 |             get_data fmnist https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-images-idx3-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-labels-idx1-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-images-idx3-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-labels-idx1-ubyte.gz
 75 |             ;;
 76 |         scrna)
 77 |             # scRNA-seq
 78 |             # https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/
 79 |             get_data scrna ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93374/suppl/GSE93374_Merged_all_020816_DGE.txt.gz ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93374/suppl/GSE93374_cell_metadata.txt.gz
 80 |             ;;
 81 |         shuttle)
 82 |             # Statlog (Shuttle)
 83 |             # https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)
 84 |             get_data shuttle https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/shuttle.trn.Z https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/shuttle.tst
 85 |             ;;
 86 |         flow)
 87 |             # Flow cytometry
 88 |             # https://flowrepository.org/id/FR-FCM-ZZ36
 89 |             urls=(https://flowrepository.org/experiments/102/fcs_files/10672/download https://flowrepository.org/experiments/102/fcs_files/10673/download)
 90 |             names=(pbmc_luca_cd8.fcs pbmc_luca.fcs)
 91 |             local dir_name="flow/"
 92 |             for idx in ${!urls[*]}; do
 93 |                 download ${dir_name} ${urls[idx]}
 94 |                 mv ${dir_name}download ${dir_name}${names[idx]}
 95 |             done
 96 |             ;;
 97 |         news)
 98 |             # GoogleNews
 99 |             # https://code.google.com/archive/p/word2vec/
100 |             mkdir -p news
101 |             f="news/GoogleNews-vectors-negative300.bin.gz"
102 |             if [[ -f "$f" ]]; then
103 |                 echo "$f is already present, nothing to do."
104 |             else 
105 |                 gdrive_download 0B7XkCwpI5KDYNlNUTTlSS21pQmM $f
106 |                 yes n | gzip -dk $f
107 |             fi
108 |             ;;
109 |     esac
110 | }
111 | 
112 | if [ -z $1 ]; then
113 |     all=(iris mnist pendigits coil20 coil100 fmnist scrna shuttle flow news)
114 | else
115 |     all=($1)
116 | fi
117 | 
118 | for idx in ${!all[*]}; do
119 |     echo "[$((idx+1))/${#all[*]}] Downloading: ${all[idx]}"
120 |     init_data ${all[idx]}
121 | done 


--------------------------------------------------------------------------------
/img/FIt-SNE.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/FIt-SNE.pdf


--------------------------------------------------------------------------------
/img/FIt-SNE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/FIt-SNE.png


--------------------------------------------------------------------------------
/img/LargeVis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/LargeVis.pdf


--------------------------------------------------------------------------------
/img/LargeVis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/LargeVis.png


--------------------------------------------------------------------------------
/img/Multicore t-SNE.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Multicore t-SNE.pdf


--------------------------------------------------------------------------------
/img/Multicore t-SNE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Multicore t-SNE.png


--------------------------------------------------------------------------------
/img/NCVis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/NCVis.pdf


--------------------------------------------------------------------------------
/img/NCVis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/NCVis.png


--------------------------------------------------------------------------------
/img/Umap.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Umap.pdf


--------------------------------------------------------------------------------
/img/Umap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Umap.png


--------------------------------------------------------------------------------
/img/efficiency.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/efficiency.pdf


--------------------------------------------------------------------------------
/img/efficiency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/efficiency.png


--------------------------------------------------------------------------------
/img/gen_png.sh:
--------------------------------------------------------------------------------
1 | for i in *.pdf; do
2 | convert "$i" -resize 800 -background white -alpha background -alpha off -compress zip +adjoin "${i%%.pdf}.png";
3 | done
4 | 


--------------------------------------------------------------------------------
/img/isolated.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/isolated.jpg


--------------------------------------------------------------------------------
/img/news_2kk.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/news_2kk.jpg


--------------------------------------------------------------------------------
/img/pendigits.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/pendigits.pdf


--------------------------------------------------------------------------------
/img/pendigits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/pendigits.png


--------------------------------------------------------------------------------
/img/t-SNE.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/t-SNE.pdf


--------------------------------------------------------------------------------
/img/t-SNE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/t-SNE.png


--------------------------------------------------------------------------------
/img/teaser.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/teaser.jpg


--------------------------------------------------------------------------------
/img/time.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time.pdf


--------------------------------------------------------------------------------
/img/time.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time.png


--------------------------------------------------------------------------------
/img/time_all.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time_all.pdf


--------------------------------------------------------------------------------
/img/time_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time_all.png


--------------------------------------------------------------------------------
/img/words.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/words.jpg


--------------------------------------------------------------------------------
/requirements-conda.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | pandas
 3 | imageio
 4 | progressbar2
 5 | matplotlib
 6 | scikit-learn
 7 | conda-forge::umap-learn
 8 | conda-forge::ncvis
 9 | kaggle
10 | 


--------------------------------------------------------------------------------
/requirements-pip.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | cython
 3 | pandas
 4 | imageio
 5 | progressbar2
 6 | matplotlib
 7 | scikit-learn
 8 | umap-learn
 9 | kaggle
10 | ncvis


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | from cycler import cycler
 4 | 
 5 | from scipy.interpolate import interp1d
 6 | def scatter_classes(xs, y, ax, show_labels=True, silhouette_coefficient=False, **kwargs):
 7 |     if silhouette_coefficient:
 8 |         from sklearn.metrics import silhouette_samples
 9 |         scores = silhouette_samples(xs, y)
10 |         labels = np.unique(y)
11 |         n_labels = labels.shape[0]
12 |         macro_scores = np.empty(n_labels)
13 |         for i in range(n_labels):
14 |             macro_scores[i] = np.mean(scores[y == labels[i]])
15 |         score = np.mean(macro_scores)
16 |         std_score = np.std(macro_scores)
17 | 
18 |         props = dict(boxstyle='round', facecolor='grey', alpha=0.1, linewidth=0)
19 |         info = 'Silhouette Coefficient = {:.3f}±{:.3f}'
20 |     
21 |     # Interpolating the discrete colormap
22 |     n = 12
23 |     ps = np.linspace(0, 1, n)
24 |     cs = plt.cm.Set3(ps)
25 |     f = interp1d(ps, cs.T)
26 |     nclasses = len(np.unique(y))
27 |     custom_cycler = cycler(color=f(np.linspace(0, 1, nclasses)).T)
28 |     ax.set_prop_cycle(custom_cycler)
29 |     n_dims = xs.shape[-1]
30 |     for k in np.unique(y):
31 |         npoints = np.count_nonzero(y==k)
32 |         x1 = x2 = None
33 | 
34 |         class_mask = (y==k)
35 |         if n_dims == 2:
36 |             x1 = xs[class_mask, 0]
37 |             x2 = xs[class_mask, 1]
38 |         elif n_dims == 1:
39 |             x1 = xs[class_mask, 0]
40 |             x2 = np.random.uniform(0, 1, npoints)
41 |         
42 |         ax.scatter(x1, x2, label="{}".format(k) if show_labels else None, **kwargs)
43 |         
44 |     if show_labels:
45 |         ax.legend(loc='upper right')
46 |     if silhouette_coefficient:
47 |         ax.text(0.01, 0.01, info.format(score, std_score), fontsize=14, bbox=props, transform=ax.transAxes)


--------------------------------------------------------------------------------