├── .gitignore ├── LICENSE ├── README.md ├── big-data.ipynb ├── big-data ├── baseline.csv └── benchmark.csv ├── data ├── __init__.py └── download.sh ├── img ├── FIt-SNE.pdf ├── FIt-SNE.png ├── LargeVis.pdf ├── LargeVis.png ├── Multicore t-SNE.pdf ├── Multicore t-SNE.png ├── NCVis.pdf ├── NCVis.png ├── Umap.pdf ├── Umap.png ├── efficiency.pdf ├── efficiency.png ├── gen_png.sh ├── isolated.jpg ├── news_2kk.jpg ├── pendigits.pdf ├── pendigits.png ├── t-SNE.pdf ├── t-SNE.png ├── teaser.jpg ├── time.pdf ├── time.png ├── time_all.pdf ├── time_all.png └── words.jpg ├── requirements-conda.txt ├── requirements-pip.txt ├── sample.ipynb └── utils └── __init__.py /.gitignore: -------------------------------------------------------------------------------- 1 | data/*/ 2 | __pycache__/ 3 | .ipynb_checkpoints/ 4 | big-data/* 5 | !big-data/baseline.csv 6 | !big-data/benchmark.csv 7 | annoy_index_file 8 | *.dat 9 | *.in 10 | *.out -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Aleksandr Artemenkov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ncvis-examples 2 | Examples for [NCVis](https://github.com/alartum/ncvis) Python wrapper. 3 | 4 | |Notebook| Contents | 5 | |-------|:-----------| 6 | |[sample.ipynb](https://nbviewer.jupyter.org/github/alartum/ncvis-examples/blob/master/sample.ipynb) | Introduction to NCVis | 7 | |[big-data.ipynb](https://nbviewer.jupyter.org/github/alartum/ncvis-examples/blob/master/big-data.ipynb)| Large-scale application case | 8 | 9 | # Setup 10 | 11 | ## Conda [recommended] 12 | 13 | You do not need to setup the environment if using *conda*, all dependencies are installed automatically. 14 | ```bash 15 | $ conda install --file requirements-conda.txt 16 | ``` 17 | 18 | ## Pip [not recommended] 19 | 20 | **Important**: be sure to have a compiler with *OpenMP* support. *GCC* has it by default, which is not the case with *clang*. You may need to install *llvm-openmp* library beforehand. 21 | 22 | 1. Install **numpy** and **cython** packages (compile-time dependencies): 23 | ```bash 24 | $ pip install numpy cython 25 | ``` 26 | 2. Install other packages: 27 | ```bash 28 | $ pip install -r requirements-pip.txt 29 | ``` 30 | 31 | 32 | # Popular Datasets 33 | 34 | Datasets can be dowloaded by using the *download.sh* script: 35 | ```bash 36 | $ bash data/download.sh 37 | ``` 38 | Replace *\* with corresponding entry from the table. You can also download all of them at once: 39 | ```bash 40 | $ bash data/download.sh 41 | ``` 42 | 43 | The datasets can be then accessed by using interfaces from the *data* Python module. 44 | 45 | |Dataset| \ | Dataset Class| 46 | |-------|:-----------:|:------:| 47 | |[MNIST](http://yann.lecun.com/exdb/mnist/)|mnist| MNIST| 48 | |[Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist)|fmnist| FMNIST| 49 | |[Iris](https://archive.ics.uci.edu/ml/datasets/Iris)|iris|Iris| 50 | |[Handwritten Digits](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits)|pendigits|PenDigits| 51 | |[COIL-20](http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php)|coil20|COIL20| 52 | |[COIL-100](http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)|coil100|COIL100| 53 | |[Mouse scRNA-seq](https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/)|scrna|ScRNA| 54 | |[Statlog (Shuttle)](https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle))|shuttle|Shuttle| 55 | 56 | Each dataset can be used in the following way: 57 | 58 | |Sample Code | Action | 59 | |-----|--------| 60 | |```d = data.MNIST()```| Load the dataset.| 61 | |```ds.X```| Get the samples as numpy array of shape *(n_samples, n_dimensions)*. If samples have more than one dimension they are all flattened.| 62 | |```ds.y```| Get the labels of the samples.| 63 | |```len(ds)```| Get total number of samples.| 64 | |```ds[0]```| Get 0-th pair *(sample, label)* from the dataset.| 65 | |```ds.shape```| Get the original shape of the samples. For example, it equals to *(28, 28)* for MNIST. | -------------------------------------------------------------------------------- /big-data/baseline.csv: -------------------------------------------------------------------------------- 1 | ,method,time 2 | 0,t-SNE,29.504466772079468 3 | 1,Multicore t-SNE,14.302932739257812 4 | 2,Umap,7.528846740722656 5 | 3,NCVis,0.9122929573059082 6 | -------------------------------------------------------------------------------- /big-data/benchmark.csv: -------------------------------------------------------------------------------- 1 | ,method,n_samples,time 2 | 0,t-SNE,1000.0,5.071566581726074 3 | 1,t-SNE,2000.0,14.646541357040405 4 | 2,t-SNE,4000.0,30.49600863456726 5 | 3,t-SNE,8000.0,96.22555685043335 6 | 4,t-SNE,16000.0,255.46899485588074 7 | 5,t-SNE,32000.0,779.162201166153 8 | 6,Multicore t-SNE,1000.0,5.27056884765625 9 | 7,Multicore t-SNE,2000.0,9.886829376220703 10 | 8,Multicore t-SNE,4000.0,14.905185222625732 11 | 9,Multicore t-SNE,8000.0,34.719964027404785 12 | 10,Multicore t-SNE,16000.0,71.90425729751587 13 | 11,Multicore t-SNE,32000.0,203.12810039520264 14 | 12,Multicore t-SNE,64000.0,573.0402963161469 15 | 13,Multicore t-SNE,128000.0,1814.4171047210693 16 | 14,Umap,1000.0,3.215291738510132 17 | 15,Umap,2000.0,2.4666836261749268 18 | 16,Umap,4000.0,5.363312721252441 19 | 17,Umap,8000.0,13.978663921356201 20 | 18,Umap,16000.0,12.72315764427185 21 | 19,Umap,32000.0,26.08822512626648 22 | 20,Umap,64000.0,58.3027617931366 23 | 21,Umap,128000.0,127.516841173172 24 | 22,Umap,256000.0,283.3631901741028 25 | 23,Umap,512000.0,682.2378108501434 26 | 24,NCVis,1000.0,0.11863160133361816 27 | 25,NCVis,2000.0,0.3299705982208252 28 | 26,NCVis,4000.0,0.6363554000854492 29 | 27,NCVis,8000.0,1.353666067123413 30 | 28,NCVis,16000.0,2.899792432785034 31 | 29,NCVis,32000.0,6.572118282318115 32 | 30,NCVis,64000.0,15.098254680633545 33 | 31,NCVis,128000.0,34.84386920928955 34 | 32,NCVis,256000.0,82.09642720222473 35 | 33,NCVis,512000.0,173.22600722312927 36 | 34,NCVis,1024000.0,375.61637353897095 37 | -------------------------------------------------------------------------------- /data/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | import struct 3 | import numpy as np 4 | import pandas as pd 5 | 6 | from abc import ABC, abstractmethod 7 | class Dataset(ABC): 8 | """ 9 | Abstract data interface class. 10 | """ 11 | @abstractmethod 12 | def __init__(self): 13 | """ 14 | Must define: 15 | self.X -- numpy array of data samples; 16 | self.y -- numpy array of labels; 17 | self.names -- dictionary of names for each value of label 18 | self.shape -- shape of the raw data 19 | """ 20 | super().__init__() 21 | self.X = None 22 | self.y = None 23 | self.names = {} 24 | self.shape = None 25 | 26 | def __getitem__(self, id): 27 | """ 28 | Returns: 29 | X, y -- data sample and label by given index 30 | """ 31 | return self.X[id], self.y[id] 32 | 33 | def X(self): 34 | """ 35 | Returns: 36 | X -- data samples 37 | """ 38 | return self.X 39 | 40 | def y(self): 41 | """ 42 | Returns: 43 | y -- data labels 44 | """ 45 | return self.y 46 | 47 | def __len__(self): 48 | """ 49 | Returns: 50 | n -- number of samples in the dataset 51 | """ 52 | if self.X.shape[0] != self.y.shape[0]: 53 | raise RuntimeError("Data samples and labels sizes differ {} and {}, but must be the same".format(self.X.shape[0], self.y.shape[0])) 54 | return self.X.shape[0] 55 | 56 | from multiprocessing import Process, Pool, Queue 57 | import time 58 | import progressbar 59 | class LargePool: 60 | """ 61 | Multiprocessing with progressbar. 62 | """ 63 | def __init__(self, tasks, worker_class, worker_args=(), worker_kwargs={}, message='Loading '): 64 | self.tasks = tasks 65 | self.worker_class = worker_class 66 | self.worker_args = worker_args 67 | self.worker_kwargs = worker_kwargs 68 | self.message = message 69 | 70 | def run(self, processes=None, progress=True, delay=0.2): 71 | tasks = Queue() 72 | size = len(self.tasks) 73 | results = Queue(maxsize=size) 74 | 75 | def init(): 76 | worker = self.worker_class(*self.worker_args, **self.worker_kwargs) 77 | while True: 78 | t = tasks.get() 79 | results.put(worker(t)) 80 | 81 | def load_queue(): 82 | for t in self.tasks: 83 | tasks.put(t) 84 | p = Process(target=load_queue) 85 | p.start() 86 | 87 | pool = Pool(processes=processes, initializer=init) 88 | if progress: 89 | with progressbar.ProgressBar(max_value=size, prefix=self.message) as bar: 90 | while not results.full(): 91 | bar.update(results.qsize()) 92 | time.sleep(delay) 93 | 94 | res = [results.get() for i in range(size)] 95 | 96 | p.terminate() 97 | pool.terminate() 98 | return [r for r in res if r is not None] 99 | 100 | class Worker(ABC): 101 | @abstractmethod 102 | def __init__(self): 103 | super().__init__() 104 | pass 105 | @abstractmethod 106 | def __call__(self, task): 107 | pass 108 | 109 | def load_mnist_raw(path, kind): 110 | """ 111 | Load image/labels data packed as http://yann.lecun.com/exdb/mnist/. 112 | 113 | Arguments: 114 | path -- path to the loaded file 115 | kind -- kind of the file contents: 116 | 'l' = labels 117 | 'i' = images 118 | 119 | Returns: 120 | data -- loaded data as numpy array 121 | """ 122 | with open(path, 'rb') as f: 123 | if kind == 'l': 124 | magic, n = struct.unpack('>ii', f.read(8)) 125 | data = np.fromfile(f, dtype=np.uint8) 126 | elif kind == 'i': 127 | magic, num, rows, cols = struct.unpack(">iiii", f.read(16)) 128 | data = np.fromfile(f, dtype=np.uint8).reshape(num, rows*cols) 129 | else: 130 | raise RuntimeError("Unsupported file contents kind: '{}'".format(kind)) 131 | 132 | return data 133 | 134 | def load_mnist_like(folder='mnist'): 135 | """ 136 | Load MNIST(F-MNIST) dataset. 137 | 138 | Returns: 139 | X, y -- data points and labels 140 | """ 141 | train = {'i': 'data/{}/train-images-idx3-ubyte'.format(folder), 142 | 'l': 'data/{}/train-labels-idx1-ubyte'.format(folder)} 143 | test = {'i': 'data/{}/t10k-images-idx3-ubyte'.format(folder), 144 | 'l': 'data/{}/t10k-labels-idx1-ubyte'.format(folder)} 145 | files = [train, test] 146 | 147 | storage = {'i': None, 148 | 'l': None} 149 | for f in files: 150 | for kind in storage: 151 | arr = load_mnist_raw(f[kind], kind) 152 | if storage[kind] is None: 153 | storage[kind] = arr 154 | else: 155 | storage[kind] = np.concatenate((storage[kind], arr)) 156 | 157 | return storage['i'], storage['l'] 158 | 159 | class MNIST(Dataset): 160 | """ 161 | MNIST Dataset 162 | Alias: mnist 163 | http://yann.lecun.com/exdb/mnist/ 164 | """ 165 | def __init__(self): 166 | super().__init__() 167 | self.X, self.y = load_mnist_like('mnist') 168 | self.names = {k:str(k) for k in range(self.y.max())} 169 | self.shape = (28, 28) 170 | 171 | class FMNIST(Dataset): 172 | """ 173 | Fashion MNIST Dataset 174 | Alias: fmnist 175 | https://github.com/zalandoresearch/fashion-mnist 176 | """ 177 | def __init__(self): 178 | super().__init__() 179 | self.X, self.y = load_mnist_like('fmnist') 180 | self.names = { 181 | 0: "T-shirt/top", 182 | 1: "Trouser", 183 | 2: "Pullover", 184 | 3: "Dress", 185 | 4: "Coat", 186 | 5: "Sandal", 187 | 6: "Shirt", 188 | 7: "Sneaker", 189 | 8: "Bag", 190 | 9: "Ankle boot" 191 | } 192 | self.shape = (28, 28) 193 | 194 | class Iris(Dataset): 195 | """ 196 | Iris Dataset 197 | https://archive.ics.uci.edu/ml/datasets/Iris 198 | """ 199 | def __init__(self): 200 | super().__init__() 201 | df = pd.read_csv("data/iris/iris.data", header=None) 202 | self.X = df.iloc[:, :-1].values 203 | classes = df.iloc[:, -1].astype("category").cat 204 | self.y = classes.codes.values 205 | self.names = dict(enumerate(classes.categories)) 206 | for k in self.names: 207 | self.names[k] = self.names[k].rsplit('-', 1)[1].title() 208 | self.shape = (self.X.shape[1], ) 209 | 210 | import re 211 | import imageio 212 | class CoilLoader(Worker): 213 | def __init__(self, path): 214 | super().__init__() 215 | self.pattern = re.compile(r'obj(\d+)__(\d+).png') 216 | self.path = path 217 | 218 | def __call__(self, file): 219 | match = self.pattern.match(file) 220 | if match: 221 | obj = match.group(1) 222 | res = imageio.imread(os.path.join(self.path, file)).ravel(), int(obj)-1 223 | return res 224 | 225 | def load_coil_like(path): 226 | """ 227 | Load COIL-20(COIL-100) dataset. 228 | 229 | Returns: 230 | X, y -- data points and labels 231 | """ 232 | for _, _, f in os.walk(path): 233 | fs = f 234 | break 235 | 236 | p = LargePool(fs, CoilLoader, (path,)) 237 | res = p.run() 238 | 239 | X = [] 240 | y = [] 241 | for r in res: 242 | X.append(r[0]) 243 | y.append(r[1]) 244 | 245 | return np.stack(X), np.stack(y) 246 | 247 | class COIL20(Dataset): 248 | """ 249 | COIL-20 Dataset 250 | Alias: coil20 251 | http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 252 | """ 253 | def __init__(self): 254 | super().__init__() 255 | 256 | self.X, self.y = load_coil_like('data/coil20/coil-20-proc') 257 | self.names = {k:'Object ' + str(k) for k in range(self.y.max())} 258 | self.shape = (128, 128) 259 | 260 | class COIL100(Dataset): 261 | """ 262 | COIL-100 Dataset 263 | Alias: coil100 264 | http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php 265 | """ 266 | def __init__(self): 267 | super().__init__() 268 | 269 | self.X, self.y = load_coil_like('data/coil100/coil-100') 270 | self.names = {k:'Object ' + str(k) for k in range(self.y.max())} 271 | self.shape = (128, 128, 3) 272 | 273 | class PenDigits(Dataset): 274 | """ 275 | Pen Digits Dataset 276 | Alias: pendigits 277 | https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits 278 | """ 279 | def __init__(self): 280 | super().__init__() 281 | files = ["data/pendigits/optdigits.tes", 282 | "data/pendigits/optdigits.tra"] 283 | 284 | loaded = [None]*2 285 | for f in files: 286 | df = pd.read_csv(f, header=None) 287 | for i in range(2): 288 | if i == 0: 289 | new = df.iloc[:, :-1].values 290 | else: 291 | new = df.iloc[:, -1].values 292 | if loaded[i] is None: 293 | loaded[i] = new 294 | else: 295 | loaded[i] = np.concatenate((loaded[i], new)) 296 | self.X, self.y = loaded 297 | 298 | self.names = {k:str(k) for k in range(self.y.max())} 299 | self.shape = (8, 8) 300 | 301 | from io import StringIO 302 | class CsvLoader(Worker): 303 | def __init__(self, sep='\t'): 304 | super().__init__() 305 | self.sep = sep 306 | 307 | def __call__(self, text): 308 | csv = StringIO(text) 309 | return pd.read_csv(csv, sep=self.sep, header=None, engine='c') 310 | 311 | class CsvReader: 312 | def __init__(self, path, nlines, chunksize=1024): 313 | super().__init__() 314 | self.nlines = nlines 315 | self.chunksize = chunksize 316 | self.path = path 317 | 318 | def __iter__(self): 319 | nlines = 0 320 | nread = 0 321 | text = '' 322 | with open(self.path) as f: 323 | f.readline() 324 | for line in f: 325 | if nlines == self.nlines: 326 | break 327 | nlines += 1 328 | nread += 1 329 | text += line + '\n' 330 | if nread == self.chunksize: 331 | yield text 332 | nread = 0 333 | text = '' 334 | yield text 335 | 336 | def __len__(self): 337 | return (self.nlines+self.chunksize-1)//self.chunksize 338 | 339 | class ScRNA(Dataset): 340 | """ 341 | Mouse scRNA-seq Dataset 342 | Alias: scrna 343 | https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/ 344 | """ 345 | def __init__(self): 346 | super().__init__() 347 | 348 | # Load labels 349 | df = pd.read_csv('data/scrna/GSE93374_cell_metadata.txt', sep='\t') 350 | classes = df.iloc[:, 6].astype('category').cat 351 | name_to_class = dict(zip(df.iloc[:, 0], classes.codes.values)) 352 | self.names = dict(enumerate(classes.categories)) 353 | df = pd.read_csv('data/scrna/GSE93374_Merged_all_020816_DGE.txt', sep='\t', nrows=1) 354 | ind_to_name = df.columns.values 355 | self.y = np.empty(len(ind_to_name), dtype=np.int) 356 | for i in range(len(ind_to_name)): 357 | self.y[i] = name_to_class[ind_to_name[i]] 358 | 359 | # Load the data itself 360 | path = 'data/scrna/GSE93374_Merged_all_020816_DGE.txt' 361 | nlines = 26774 362 | reader = CsvReader(path, nlines=nlines, chunksize=1024) 363 | p = LargePool(reader, CsvLoader, ('\t',)) 364 | df = pd.concat(p.run()) 365 | self.X = df.iloc[:, 1:].values.T 366 | 367 | self.shape = (nlines,) 368 | 369 | class Shuttle(Dataset): 370 | """ 371 | Statlog (Shuttle) Dataset 372 | Alias: shuttle 373 | https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle) 374 | """ 375 | def __init__(self, drop_time=True): 376 | super().__init__() 377 | 378 | base = "data/shuttle/shuttle." 379 | exts = ["trn", "tst"] 380 | 381 | vals = {'X': None, 382 | 'y': None} 383 | for ext in exts: 384 | df = pd.read_csv(base+ext, sep=' ').values 385 | new = {'X': df[:, 1:-1] if drop_time else df[:, :-1], 386 | 'y': df[:, -1]} 387 | for k in new: 388 | if vals[k] is None: 389 | vals[k] = new[k] 390 | else: 391 | vals[k] = np.concatenate((vals[k], new[k])) 392 | 393 | self.X, self.y = vals['X'], vals['y'] 394 | self.names = { 395 | 1: 'Rad Flow', 396 | 2: 'Fpv Close', 397 | 3: 'Fpv Open', 398 | 4: 'High', 399 | 5: 'Bypass', 400 | 6: 'Bpv Close', 401 | 7: 'Bpv Open' 402 | } 403 | self.shape = (self.X.shape[1],) 404 | 405 | -------------------------------------------------------------------------------- /data/download.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | path=$(git rev-parse --show-toplevel) 4 | cd $path/data 5 | 6 | # https://gist.github.com/iamtekeste/3cdfd0366ebfd2c0d805#gistcomment-2359248 7 | gdrive_download() { 8 | # $1 = GDrive id of the file 9 | # $2 = location to save file 10 | CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p') 11 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2 12 | rm -rf /tmp/cookies.txt 13 | } 14 | 15 | download() { 16 | # $1 = folder name 17 | # $2 = url1, $3 = url2, ... 18 | local dir=$1 19 | while [ -n "$2" ]; do 20 | wget --no-check-certificate -N -P $dir $2 21 | shift 22 | done 23 | } 24 | 25 | extract() { 26 | # $1 = folder name 27 | for f in $1/*; do 28 | case $f in 29 | *.zip) yes N | unzip -d $1 -qq $f ;; 30 | *.gz) yes n | gzip -dk $f ;; 31 | *.Z) yes n | uncompress $f ;; 32 | *) echo "$f is not compressed" 33 | esac 34 | done 35 | } 36 | 37 | get_data() { 38 | # $1 = folder name 39 | # $2 = url1, $3 = url2, ... 40 | download "$@" 41 | extract $1 42 | } 43 | 44 | init_data() { 45 | case $1 in 46 | iris) 47 | # Iris Dataset 48 | # https://archive.ics.uci.edu/ml/datasets/Iris 49 | get_data iris http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data 50 | ;; 51 | mnist) 52 | # MNIST Dataset 53 | # http://yann.lecun.com/exdb/mnist/ 54 | get_data mnist http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz 55 | ;; 56 | pendigits) 57 | # Pen digits 58 | # https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits 59 | get_data pendigits https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes 60 | ;; 61 | coil20) 62 | # COIL-20 63 | # http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 64 | get_data coil20 http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-20/coil-20-proc.zip 65 | ;; 66 | coil100) 67 | # COIL-100 68 | # http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php 69 | get_data coil100 http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-100/coil-100.zip 70 | ;; 71 | fmnist) 72 | # Fashion-MNIST 73 | # https://github.com/zalandoresearch/fashion-mnist 74 | get_data fmnist https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-images-idx3-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-labels-idx1-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-images-idx3-ubyte.gz https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-labels-idx1-ubyte.gz 75 | ;; 76 | scrna) 77 | # scRNA-seq 78 | # https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/brain/ 79 | get_data scrna ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93374/suppl/GSE93374_Merged_all_020816_DGE.txt.gz ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93374/suppl/GSE93374_cell_metadata.txt.gz 80 | ;; 81 | shuttle) 82 | # Statlog (Shuttle) 83 | # https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle) 84 | get_data shuttle https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/shuttle.trn.Z https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/shuttle/shuttle.tst 85 | ;; 86 | flow) 87 | # Flow cytometry 88 | # https://flowrepository.org/id/FR-FCM-ZZ36 89 | urls=(https://flowrepository.org/experiments/102/fcs_files/10672/download https://flowrepository.org/experiments/102/fcs_files/10673/download) 90 | names=(pbmc_luca_cd8.fcs pbmc_luca.fcs) 91 | local dir_name="flow/" 92 | for idx in ${!urls[*]}; do 93 | download ${dir_name} ${urls[idx]} 94 | mv ${dir_name}download ${dir_name}${names[idx]} 95 | done 96 | ;; 97 | news) 98 | # GoogleNews 99 | # https://code.google.com/archive/p/word2vec/ 100 | mkdir -p news 101 | f="news/GoogleNews-vectors-negative300.bin.gz" 102 | if [[ -f "$f" ]]; then 103 | echo "$f is already present, nothing to do." 104 | else 105 | gdrive_download 0B7XkCwpI5KDYNlNUTTlSS21pQmM $f 106 | yes n | gzip -dk $f 107 | fi 108 | ;; 109 | esac 110 | } 111 | 112 | if [ -z $1 ]; then 113 | all=(iris mnist pendigits coil20 coil100 fmnist scrna shuttle flow news) 114 | else 115 | all=($1) 116 | fi 117 | 118 | for idx in ${!all[*]}; do 119 | echo "[$((idx+1))/${#all[*]}] Downloading: ${all[idx]}" 120 | init_data ${all[idx]} 121 | done -------------------------------------------------------------------------------- /img/FIt-SNE.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/FIt-SNE.pdf -------------------------------------------------------------------------------- /img/FIt-SNE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/FIt-SNE.png -------------------------------------------------------------------------------- /img/LargeVis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/LargeVis.pdf -------------------------------------------------------------------------------- /img/LargeVis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/LargeVis.png -------------------------------------------------------------------------------- /img/Multicore t-SNE.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Multicore t-SNE.pdf -------------------------------------------------------------------------------- /img/Multicore t-SNE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Multicore t-SNE.png -------------------------------------------------------------------------------- /img/NCVis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/NCVis.pdf -------------------------------------------------------------------------------- /img/NCVis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/NCVis.png -------------------------------------------------------------------------------- /img/Umap.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Umap.pdf -------------------------------------------------------------------------------- /img/Umap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/Umap.png -------------------------------------------------------------------------------- /img/efficiency.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/efficiency.pdf -------------------------------------------------------------------------------- /img/efficiency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/efficiency.png -------------------------------------------------------------------------------- /img/gen_png.sh: -------------------------------------------------------------------------------- 1 | for i in *.pdf; do 2 | convert "$i" -resize 800 -background white -alpha background -alpha off -compress zip +adjoin "${i%%.pdf}.png"; 3 | done 4 | -------------------------------------------------------------------------------- /img/isolated.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/isolated.jpg -------------------------------------------------------------------------------- /img/news_2kk.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/news_2kk.jpg -------------------------------------------------------------------------------- /img/pendigits.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/pendigits.pdf -------------------------------------------------------------------------------- /img/pendigits.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/pendigits.png -------------------------------------------------------------------------------- /img/t-SNE.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/t-SNE.pdf -------------------------------------------------------------------------------- /img/t-SNE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/t-SNE.png -------------------------------------------------------------------------------- /img/teaser.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/teaser.jpg -------------------------------------------------------------------------------- /img/time.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time.pdf -------------------------------------------------------------------------------- /img/time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time.png -------------------------------------------------------------------------------- /img/time_all.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time_all.pdf -------------------------------------------------------------------------------- /img/time_all.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/time_all.png -------------------------------------------------------------------------------- /img/words.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/stat-ml/ncvis-examples/40510efac97a611c9db97ad6855684f4899dbaff/img/words.jpg -------------------------------------------------------------------------------- /requirements-conda.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | imageio 4 | progressbar2 5 | matplotlib 6 | scikit-learn 7 | conda-forge::umap-learn 8 | conda-forge::ncvis 9 | kaggle 10 | -------------------------------------------------------------------------------- /requirements-pip.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | cython 3 | pandas 4 | imageio 5 | progressbar2 6 | matplotlib 7 | scikit-learn 8 | umap-learn 9 | kaggle 10 | ncvis -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | from cycler import cycler 4 | 5 | from scipy.interpolate import interp1d 6 | def scatter_classes(xs, y, ax, show_labels=True, silhouette_coefficient=False, **kwargs): 7 | if silhouette_coefficient: 8 | from sklearn.metrics import silhouette_samples 9 | scores = silhouette_samples(xs, y) 10 | labels = np.unique(y) 11 | n_labels = labels.shape[0] 12 | macro_scores = np.empty(n_labels) 13 | for i in range(n_labels): 14 | macro_scores[i] = np.mean(scores[y == labels[i]]) 15 | score = np.mean(macro_scores) 16 | std_score = np.std(macro_scores) 17 | 18 | props = dict(boxstyle='round', facecolor='grey', alpha=0.1, linewidth=0) 19 | info = 'Silhouette Coefficient = {:.3f}±{:.3f}' 20 | 21 | # Interpolating the discrete colormap 22 | n = 12 23 | ps = np.linspace(0, 1, n) 24 | cs = plt.cm.Set3(ps) 25 | f = interp1d(ps, cs.T) 26 | nclasses = len(np.unique(y)) 27 | custom_cycler = cycler(color=f(np.linspace(0, 1, nclasses)).T) 28 | ax.set_prop_cycle(custom_cycler) 29 | n_dims = xs.shape[-1] 30 | for k in np.unique(y): 31 | npoints = np.count_nonzero(y==k) 32 | x1 = x2 = None 33 | 34 | class_mask = (y==k) 35 | if n_dims == 2: 36 | x1 = xs[class_mask, 0] 37 | x2 = xs[class_mask, 1] 38 | elif n_dims == 1: 39 | x1 = xs[class_mask, 0] 40 | x2 = np.random.uniform(0, 1, npoints) 41 | 42 | ax.scatter(x1, x2, label="{}".format(k) if show_labels else None, **kwargs) 43 | 44 | if show_labels: 45 | ax.legend(loc='upper right') 46 | if silhouette_coefficient: 47 | ax.text(0.01, 0.01, info.format(score, std_score), fontsize=14, bbox=props, transform=ax.transAxes) --------------------------------------------------------------------------------