├── 11scRNAseq_DatasetSource.txt ├── Codes ├── GenerateSimData.py └── GenerateSimData_Rare.py ├── ExamplePollen.zip ├── LICENSE.txt ├── OLMC_animation.gif ├── PanoView.jpg ├── PanoViewManual.pdf ├── PanoramicView ├── __init__.py └── scPanoView.py ├── README.md └── setup.py /11scRNAseq_DatasetSource.txt: -------------------------------------------------------------------------------- 1 | ################################################################################################### 2 | The following is the download information for 11 scRNA-seq datasets used in the PanoView manuscript. 3 | You could find expression values and original cluster assignment from these sources. 4 | ################################################################################################### 5 | 6 | ### Yan et al. ### 7 | 1. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36552 8 | 2. scRNA-Seq Datasets at Hemberg Group/Sanger Institude https://hemberg-lab.github.io/scRNA.seq.datasets/ 9 | 10 | ### Goolam et al. ### 11 | 1. ArryaExpress at EMBL-EBI https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3321/ 12 | 2. scRNA-Seq Datasets at Hemberg Group/Sanger Institude https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/edev/ 13 | 14 | ### Deng et al. ### 15 | 1. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45719 16 | 2. ArryaExpress at EMBL-EBI https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-45719/ 17 | 3. scRNA-Seq Datasets at Hemberg Group/Sanger Institude https://hemberg-lab.github.io/scRNA.seq.datasets/mouse/edev/ 18 | 19 | ### Pollen et al. ### 20 | 1. scRNA-Seq Datasets at Hemberg Group/Sanger Institude https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/ 21 | 22 | ### Patel et al. ### 23 | 1. GiniClust at github https://github.com/lanjiangboston/GiniClust 24 | 2. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57872 25 | 26 | ### Usoskin et al. ### 27 | 1. Linnarsson Lab http://linnarssonlab.org/drg/ 28 | 2. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE59739 29 | 30 | ### Villani et al. ### 31 | 1. Single Cell PORTAL at Broad Institude https://singlecell.broadinstitute.org/single_cell 32 | 2. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820 33 | 34 | ### Zeisel et al. ### 35 | 1. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361 36 | 2. Linnarsson Lab http://linnarssonlab.org/cortex/ 37 | 38 | ### Tirosh et al. ### 39 | 1. Single Cell PORTAL at Broad Institude https://singlecell.broadinstitute.org/single_cell 40 | 2. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056 41 | 42 | ### Baron et al. ### 43 | 1. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 44 | 2. scRNA-Seq Datasets at Hemberg Group/Sanger Institude https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/ 45 | 46 | ### Campbell et al. ### 47 | 1. Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93374 48 | 2. Single Cell PORTAL at Broad Institude https://singlecell.broadinstitute.org/single_cell -------------------------------------------------------------------------------- /Codes/GenerateSimData.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import random 4 | from sklearn import datasets 5 | from sklearn.preprocessing import MinMaxScaler 6 | 7 | def Skl_scale(data): 8 | newdata = data.copy() 9 | for i in newdata.index: 10 | scaler = MinMaxScaler(feature_range=(0,10000)) 11 | scaler = scaler.fit(newdata.loc[i,:].values.reshape(len(newdata.columns),1)) 12 | newdata.loc[i,:] = scaler.transform(newdata.loc[i,:].values.reshape(len(newdata.columns),1)).reshape(1,len(newdata.columns)) 13 | return(newdata) 14 | 15 | 16 | R_number = random.sample(range(1000), k=20) # random numners 17 | for i in range(20): 18 | rnumber = R_number[i] 19 | for j in rnage(3,23): 20 | blobs = datasets.make_blobs(n_samples=500,n_features=20000,random_state=rnumber,centers=j,cluster_std=1) 21 | inputdf = pd.DataFrame(data=blobs[0]) 22 | blobs_cluster=pd.DataFrame(data=blobs[1],columns=['cluster']) # the ground truth 23 | inputdf =Skl_scale(inputdf.transpose()) # expression data for simulation 24 | -------------------------------------------------------------------------------- /Codes/GenerateSimData_Rare.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import random 4 | from sklearn import datasets 5 | from sklearn.preprocessing import MinMaxScaler 6 | 7 | def Skl_scale(data): 8 | newdata = data.copy() 9 | for i in newdata.index: 10 | scaler = MinMaxScaler(feature_range=(0,10000)) 11 | scaler = scaler.fit(newdata.loc[i,:].values.reshape(len(newdata.columns),1)) 12 | newdata.loc[i,:] = scaler.transform(newdata.loc[i,:].values.reshape(len(newdata.columns),1)).reshape(1,len(newdata.columns)) 13 | return(newdata) 14 | 15 | 16 | 17 | R_number = random.sample(range(1000), k=20) 18 | 19 | for r in R_number: 20 | for j in range(3,16): 21 | 22 | blobs = datasets.make_blobs(n_samples=500,n_features=20000,random_state=r,centers=j,cluster_std=1) 23 | blobs_cluster=pd.DataFrame(data=blobs[1],columns=['cluster']) 24 | 25 | inputdf = pd.DataFrame(data=blobs[0]) 26 | inputdf=Skl_scale(data)(inputdf.transpose()) 27 | 28 | clusternumber = random.sample(range(j), k=1) 29 | clustersize = len(blobs_cluster[blobs_cluster.cluster == clusternumber]) 30 | randomnumber = random.sample(range(clustersize), k=round(clustersize*0.9)) 31 | RandomCell = blobs_cluster[blobs_cluster.cluster == clusternumber].index[randomnumber] 32 | 33 | inputdf2=inputdf.drop(labels=RandomCell,axis=1) 34 | inputdf2.columns=range(len(inputdf2.columns)) # expression data for the simulation of rare cells 35 | 36 | blobs_cluster2 = blobs_cluster.drop(labels=RandomCell,axis=0) 37 | blobs_cluster2.index=range(len(blobs_cluster2.index)) 38 | blobs_cluster2=blobs_cluster2.replace(to_replace=clusternumber,value=999) # ground truth 39 | -------------------------------------------------------------------------------- /ExamplePollen.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mhu10/scPanoView/84d2a1e5c12f146314d2b741d0e3f0e3f63cfb2b/ExamplePollen.zip -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /OLMC_animation.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mhu10/scPanoView/84d2a1e5c12f146314d2b741d0e3f0e3f63cfb2b/OLMC_animation.gif -------------------------------------------------------------------------------- /PanoView.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mhu10/scPanoView/84d2a1e5c12f146314d2b741d0e3f0e3f63cfb2b/PanoView.jpg -------------------------------------------------------------------------------- /PanoViewManual.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mhu10/scPanoView/84d2a1e5c12f146314d2b741d0e3f0e3f63cfb2b/PanoViewManual.pdf -------------------------------------------------------------------------------- /PanoramicView/__init__.py: -------------------------------------------------------------------------------- 1 | from PanoramicView import scPanoView 2 | -------------------------------------------------------------------------------- /PanoramicView/scPanoView.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | import matplotlib as mpl 6 | import matplotlib.gridspec as gridspec 7 | from numpy import linalg as LA 8 | from collections import Counter 9 | from scipy.spatial import ConvexHull 10 | from scipy import stats 11 | from scipy.spatial import distance 12 | from scipy.cluster.hierarchy import linkage 13 | from scipy.cluster.hierarchy import dendrogram 14 | from scipy.cluster.hierarchy import fcluster 15 | from scipy.cluster import hierarchy 16 | from sklearn.decomposition import PCA 17 | from sklearn.neighbors import BallTree 18 | from sklearn.manifold import TSNE 19 | from sklearn.preprocessing import MinMaxScaler 20 | from sklearn.preprocessing import normalize 21 | from statsmodels.sandbox.stats.multicomp import multipletests 22 | 23 | import warnings 24 | warnings.filterwarnings("ignore", category=FutureWarning) 25 | np.random.seed(1) 26 | 27 | 28 | def RunPCA(data,n): 29 | pca = PCA(n_components=n) 30 | pca.fit(data) 31 | data_trans = pca.transform(data) ### new coordinates after pca transform 32 | return(data_trans,pca.explained_variance_ratio_) 33 | 34 | 35 | def gini(data): 36 | total=0 37 | for i in data: 38 | for j in data: 39 | 40 | total = total + abs(i-j) 41 | result = total / (2*len(data)*len(data)*np.mean(data)) 42 | return(result) 43 | 44 | 45 | def Skl_scale(data): 46 | newdata = data.copy() 47 | for i in newdata.index: 48 | scaler = MinMaxScaler(feature_range=(-2,2)) 49 | scaler = scaler.fit(newdata.loc[i,:].values.reshape(len(newdata.columns),1)) 50 | newdata.loc[i,:] = scaler.transform(newdata.loc[i,:].values.reshape(len(newdata.columns),1)).reshape(1,len(newdata.columns)) 51 | return(newdata) 52 | 53 | 54 | def OrderCell(data,radius): 55 | tree = BallTree(data,leaf_size=2) 56 | Countnumber=[] 57 | for point in range(len(data)): 58 | count = tree.query_radius(data[point].reshape(1,-1), r=radius, count_only = True) # counting the number of neighbors for each point 59 | Countnumber.append(count) # storing number of neighbors 60 | CountnumberDf = pd.DataFrame(Countnumber,columns =['neighbors']) 61 | return(CountnumberDf) 62 | 63 | 64 | def jitter(a_series, noise_reduction=1000000): 65 | return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction)) 66 | 67 | 68 | def HighVarGene(data,z,meangene): 69 | data=data.transpose() 70 | data=data.loc[:,(data!=0).any(axis=0)] 71 | data.loc[:,'average'] = np.mean(data,axis=1) 72 | data.loc[:,'genegroup'] = pd.qcut(data.loc[:,'average'] + jitter(data.loc[:,'average']),20,labels=range(1,21)) 73 | data.loc[:,'variance']= np.var(data.drop('genegroup',axis = 1),axis=1) 74 | data.loc[:,'dispersion']= data.drop('genegroup',axis =1).variance/data.drop('genegroup',axis=1).average 75 | pickdf=[] 76 | for group in range(1,21): 77 | 78 | data.loc[data[data.genegroup == group].index,'zscore'] = pd.DataFrame(stats.zscore(data[data.genegroup == group].dispersion)).fillna(0).values 79 | if len(data[(data.zscore>z)&(data.average>meangene)]) > 0: 80 | pickdf.append(data[(data.zscore>z)&(data.average>meangene)].index) 81 | 82 | if pickdf ==[]: 83 | for group in range(1,21): 84 | data.loc[data[data.genegroup == group].index,'zscore'] = pd.DataFrame(stats.zscore(data[data.genegroup == group].dispersion)).fillna(0).values 85 | if len(data[data.zscore>z]) >0: 86 | pickdf.append(data[(data.zscore>z)].index) 87 | if pickdf==[]: 88 | return([]) 89 | else: 90 | hvg=(np.unique([g for sublist in pickdf for g in sublist])) 91 | if len(hvg) <3: 92 | return([]) 93 | else: 94 | return(hvg) 95 | else: 96 | hvg=(np.unique([g for sublist in pickdf for g in sublist])) 97 | if len(hvg) < 3: 98 | return([]) 99 | else: 100 | return(hvg) 101 | 102 | 103 | def Distohull(xcoord,point,clusthull,clusthullvertices): 104 | disttovertices = [LA.norm(xcoord[i] - xcoord[point]) for i in clusthull.iloc[clusthullvertices,:].index ] 105 | if np.min(disttovertices) < np.mean(distance.pdist(xcoord[clusthull.iloc[clusthullvertices,:].index])): 106 | outcheck = 0 # assign to cluster 107 | outvalue = np.min(disttovertices) 108 | else: 109 | outcheck = 1 # not belong to cluster 110 | outvalue = 9999999999999 111 | return(outcheck,outvalue) 112 | 113 | 114 | def Findlocalmax(countdataframe,xcoordinate,bin): 115 | clusters = [] 116 | newdataframe = countdataframe 117 | distohighest = [LA.norm(xcoordinate[point] - xcoordinate[np.argmax(newdataframe.neighbors)]) for point in newdataframe.index] 118 | newdataframe = newdataframe.assign(dist = distohighest) 119 | hist = np.histogram(distohighest,bin) 120 | firstclust = newdataframe[newdataframe.dist < hist[1][1]] 121 | clust_1 = list(firstclust.index) 122 | if len(clust_1) < 4: 123 | return(False) 124 | hull = ConvexHull(xcoordinate[firstclust.index],qhull_options ='QJ') 125 | clusters.append([firstclust,clust_1,hull]) 126 | 127 | tempclust = newdataframe[newdataframe.dist >= hist[1][1]] 128 | checkpoint = len(tempclust) 129 | while checkpoint > 0: 130 | checkpoint = 0 131 | neighbornumber = np.unique(tempclust.neighbors)[::-1] 132 | newcenter = [] 133 | for i in neighbornumber: 134 | for cell in tempclust[tempclust.neighbors ==i].index: 135 | checkgroup = [] 136 | checkvalue = [] 137 | for k in range(len(clusters)): 138 | check = Distohull(xcoordinate,cell,clusters[k][0],clusters[k][2].vertices) 139 | checkgroup.append(check[0]) 140 | checkvalue.append(check[1]) 141 | if 0 not in checkgroup: 142 | newcenter.append(cell) 143 | break 144 | else: 145 | tempclust = tempclust.drop(cell) 146 | clusters[np.argmin(checkvalue)][1].append(cell) 147 | clusters[np.argmin(checkvalue)][0] = newdataframe.loc[clusters[np.argmin(checkvalue)][1],:] 148 | clusters[np.argmin(checkvalue)][2] = ConvexHull(xcoordinate[clusters[np.argmin(checkvalue)][1]],qhull_options ='QJ') 149 | 150 | if newcenter !=[]: 151 | break 152 | elif i == 1: 153 | return(clusters) 154 | 155 | distohighest = [LA.norm(xcoordinate[point] - xcoordinate[newcenter[0]]) for point in tempclust.index] 156 | tempclust=tempclust.assign(dist=distohighest) 157 | hist = np.histogram(distohighest,bin) 158 | cluster2 = tempclust[tempclust.dist < hist[1][1]] 159 | clust_2 = list(cluster2.index) 160 | if len(clust_2) < 4: 161 | return(clusters) 162 | else: 163 | hull2 = ConvexHull(xcoordinate[cluster2.index],qhull_options ='QJ') 164 | clusters.append([cluster2,clust_2,hull2]) 165 | tempclust = tempclust[tempclust.dist >= hist[1][1]] 166 | checkpoint = len(tempclust) 167 | return(clusters) 168 | 169 | 170 | 171 | class Panoite: 172 | 173 | def __init__(self,expression): 174 | self.expression = expression 175 | self.genespace=[] 176 | self.membership = pd.DataFrame({'L1Cluster':0},index=list(range(len(self.expression)))) 177 | self.stopite = False 178 | 179 | def generate_clusters(self,lowgene,zscore): 180 | CellMaximumn=1000 181 | ginicutoff = 0.05 182 | Bc= 20 183 | Bg=20 184 | maxbb = 20 185 | findvarg = HighVarGene(self.expression,zscore,lowgene) 186 | if len(findvarg) > 0: 187 | self.genespace.append(findvarg) 188 | subdf = self.expression.loc[:,self.genespace[-1]] 189 | 190 | elif len(findvarg) == 0: 191 | self.stopite = True 192 | return() 193 | 194 | subdf=Skl_scale(subdf) 195 | pcaspace = RunPCA(subdf.values.astype(float),3)[0] 196 | Radius = np.histogram(distance.pdist(pcaspace),Bc)[1][1] 197 | temppca = pcaspace 198 | Ordercell = OrderCell(temppca,Radius) 199 | bb=0 200 | opt_bins=True 201 | lm_number = 1 202 | while opt_bins == True: 203 | bb = bb+1 204 | if len(temppca) >CellMaximumn: 205 | localmax = Findlocalmax(Ordercell,temppca,Bg) 206 | opt_bins = False 207 | 208 | else: 209 | localmax = Findlocalmax(Ordercell,temppca,5*bb) 210 | 211 | if localmax == False: 212 | if bb == 1: 213 | self.membership.L1Cluster= 1 214 | self.CSIZE = len(self.membership) 215 | return() 216 | else: 217 | localmax = Findlocalmax(Ordercell,temppca,5*(bb-1)) 218 | opt_bins = False 219 | 220 | lm_number_next = len(localmax) 221 | if bb > maxbb and len(temppca) = lm_number: 224 | lm_number = lm_number_next 225 | elif lm_number_next < lm_number and len(temppca) ginicutoff) 273 | while check_iteation == True: 274 | 275 | if all(Gini.values < ginicutoff): 276 | check_iteation = False 277 | 278 | else: 279 | clust_pick_auto = list(set(np.unique(Ordercell.cluster))) 280 | clust_keep_auto = Eva.idxmin().values 281 | clust_pick_auto.remove(clust_keep_auto) 282 | 283 | nextdf = self.expression.loc[self.membership[self.membership.L1Cluster.isin(clust_pick_auto)].index,:] 284 | 285 | print('Percentage of cells being analyzed: '+str(round((1-(float(len(nextdf))/float(len(self.expression))))*100))+'%') 286 | 287 | findvarg = HighVarGene(nextdf,zscore,lowgene) 288 | 289 | if len(findvarg) > 0: 290 | self.genespace.append(findvarg) 291 | subdf = self.expression.loc[nextdf.index,self.genespace[-1]] 292 | 293 | elif len(findvarg) == 0: 294 | self.stopite = True 295 | return() 296 | 297 | subdf=Skl_scale(subdf) 298 | pcaspace = RunPCA(subdf.values.astype(float),3)[0] 299 | Radius = np.histogram(distance.pdist(pcaspace),Bc)[1][1] 300 | temppca = pcaspace 301 | Ordercell = OrderCell(temppca,Radius) 302 | bb=0 303 | opt_bins=True 304 | lm_number = 1 305 | while opt_bins == True: 306 | bb = bb+1 307 | 308 | if len(temppca) >CellMaximumn: 309 | localmax = Findlocalmax(Ordercell,temppca,Bg) 310 | opt_bins = False 311 | else: 312 | localmax = Findlocalmax(Ordercell,temppca,5*bb) 313 | 314 | if localmax == False: 315 | if bb == 1: 316 | self.membership.loc[nextdf.index,'L1Cluster'] = np.max(self.membership.L1Cluster)+1 317 | opt_bins = False 318 | else: 319 | localmax = Findlocalmax(Ordercell,temppca,5*(bb-1)) 320 | opt_bins = False 321 | 322 | lm_number_next = len(localmax) 323 | if bb > maxbb and len(temppca) = lm_number: 327 | lm_number = lm_number_next 328 | elif lm_number_next < lm_number and len(temppca) ginicutoff) 377 | 378 | 379 | class PanoView: 380 | 381 | def __init__(self,filename,annotation=None): 382 | 383 | expression=pd.read_csv(filename+'.csv',index_col=0) 384 | 385 | if annotation != None: 386 | self.cell_anno = pd.read_csv(annotation+'.csv',index_col=0).values 387 | 388 | self.raw_exp = expression 389 | self.log_exp =[] 390 | self.cell_id =[] 391 | self.vargene =[] 392 | self.cell_clusters=[] 393 | self.cell_membership=[] 394 | self.sim_matrix=[] 395 | self.tsne2d=np.array([]) 396 | self.vg_stat=[] 397 | self.L1cell_color=[] 398 | self.L1cell_dendro_order=[] 399 | self.L1cluster_color=[] 400 | self.L2cell_color=[] 401 | self.L2cell_dendro_order=[] 402 | self.L2cluster_color=[] 403 | 404 | 405 | def RunSearching(self,Normal=True,Log2=True,GeneLow='default',Zscore='default'): 406 | 407 | self.raw_exp.index.astype(str) 408 | generepeat = len(self.raw_exp.index) != len(np.unique(self.raw_exp.index)) 409 | while generepeat == True: 410 | self.raw_exp.index = self.raw_exp.index.where(~self.raw_exp.index.duplicated(), self.raw_exp.index + '_dp') 411 | generepeat = len(self.raw_exp.index) != len(np.unique(self.raw_exp.index)) 412 | 413 | if Normal == False: 414 | if Log2 == False: 415 | self.log_exp = self.raw_exp.transpose() 416 | expression = self.log_exp.loc[:,(self.log_exp!=0).any(axis=0)] 417 | else: 418 | self.log_exp = np.log2(1+self.raw_exp.transpose()) 419 | expression = self.log_exp.loc[:,(self.log_exp!=0).any(axis=0)] 420 | else: 421 | 422 | self.raw_exp=self.raw_exp.loc[(self.raw_exp!=0).any(axis=1),:] 423 | raw_norm = pd.DataFrame(normalize(self.raw_exp,norm='l1',axis=0)*self.raw_exp.sum().median(),index=self.raw_exp.index,columns=self.raw_exp.columns) 424 | self.log_exp = np.log2(1+raw_norm.transpose()) 425 | expression = self.log_exp 426 | 427 | self.cell_id = expression.index 428 | self.gene_id = expression.columns 429 | 430 | expression.index=range(len(expression)) 431 | VarGene=[] 432 | 433 | if GeneLow != 'default': 434 | GeneLow = GeneLow 435 | else: 436 | GeneLow = 0.5 437 | 438 | if Zscore != 'default': 439 | Zscore = Zscore 440 | else: 441 | Zscore = 1.5 442 | 443 | result = Panoite(expression) 444 | result.generate_clusters(GeneLow,Zscore) 445 | 446 | finalcluster=[] 447 | VarGene.append(np.unique([gene for sublist in result.genespace for gene in sublist])) 448 | for i in np.unique(result.membership.L1Cluster): 449 | finalcluster.append(result.membership[result.membership.L1Cluster ==i].index.values) 450 | self.vargene= np.unique([gene for sublist in VarGene for gene in sublist]) 451 | self.cell_clusters = finalcluster 452 | print("Clusters generated ") 453 | 454 | 455 | def OutputResult(self,clust_merge='default',metric_dis='default',fclust_height= 'default', init='default',n_PCs='default'): 456 | 457 | if clust_merge != 'default': 458 | clust_merge=clust_merge; 459 | else: 460 | clust_merge = 0.2 461 | 462 | if metric_dis != 'default': 463 | metric_dis = 2; 464 | else: 465 | metric_dis = 1 466 | 467 | if fclust_height != 'default': 468 | fclust_height = fclust_height 469 | else: 470 | fclust_height = 0.2 471 | 472 | if init != 'default': 473 | init = 'random' 474 | else: 475 | init = 'pca' 476 | 477 | if n_PCs != 'default': 478 | n_PCs = n_PCs 479 | else: 480 | n_PCs = 15 481 | 482 | expression= self.log_exp 483 | expression.index=range(len(expression)) 484 | expression = Skl_scale(expression) 485 | 486 | cluster_list=[] 487 | for i in range(len(self.cell_clusters)): 488 | cluster_list.append(expression.loc[self.cell_clusters[i],self.vargene]) 489 | 490 | cluster_list_oroginal = cluster_list 491 | 492 | sim_mat = pd.DataFrame(0,index=np.arange(1,len(cluster_list)+1),columns=np.arange(1,len(cluster_list)+1)) 493 | 494 | for i in sim_mat.index: 495 | ci = cluster_list[i-1] 496 | for j in sim_mat.index: 497 | cj = cluster_list[j-1] 498 | if metric_dis == 1: 499 | 500 | sim_mat.loc[i,j] = np.mean(distance.cdist(ci,cj,metric='correlation')) 501 | elif metric_dis ==2: 502 | 503 | sim_mat.loc[i,j] = np.mean(distance.cdist(ci,cj,metric='euclidean')) 504 | 505 | sim_mat_original = sim_mat 506 | clustdis_within = pd.DataFrame(np.diagonal(sim_mat),index=sim_mat.index,columns=['dis']) 507 | 508 | 509 | linkage_matrix = linkage(sim_mat,method="ward") 510 | cutree = pd.DataFrame(data=hierarchy.cut_tree(linkage_matrix , n_clusters=None, height=None)+1,index=range(1,len(cluster_list)+1)) 511 | 512 | init_lenth=0 513 | Lgroups=[] 514 | MergeCluster=[] 515 | 516 | for step in range(1,len(cutree.columns)): 517 | 518 | repeats=[item for item, count in Counter(cutree[step]).items() if count > 1] 519 | check_lenth = len(repeats) 520 | 521 | lgroups=[] 522 | for i in repeats: 523 | lgroups.append(cutree[step][cutree[step]==i].index.tolist()) 524 | Lgroups.append(lgroups) 525 | 526 | if check_lenth > init_lenth: 527 | cgroups=[] 528 | for i in repeats: 529 | cgroups.append(cutree[step][cutree[step]==i].index.tolist()) 530 | 531 | for g in cgroups: 532 | if len(g) ==2: 533 | pairdis = sim_mat.loc[g[0],g[1]] 534 | 535 | if abs(pairdis - clustdis_within.loc[g[0]].dis) <= clustdis_within.min().dis*clust_merge or abs(pairdis - clustdis_within.loc[g[1]].dis) <= clustdis_within.min().dis*clust_merge: 536 | if g not in MergeCluster: 537 | MergeCluster.append(g) 538 | 539 | elif check_lenth < init_lenth: 540 | break 541 | else: 542 | clust=[i for i in Lgroups[step-1] if i not in Lgroups[step-2]] 543 | clust=[i for sublist in clust for i in sublist] 544 | clust_to_merge = [s for s in MergeCluster if bool(set(s) & set(clust))] 545 | clust_to_merge=[i for sublist in clust_to_merge for i in sublist] 546 | clust_add = [s for s in clust if s not in clust_to_merge][0] 547 | 548 | for c in clust_to_merge: 549 | 550 | pairdis = sim_mat.loc[c,clust_add] 551 | 552 | if abs(pairdis - clustdis_within.loc[c].dis) <= clustdis_within.min().dis*clust_merge or abs(pairdis - clustdis_within.loc[clust_add].dis) <= clustdis_within.min().dis*clust_merge: 553 | 554 | MergeCluster.remove(clust_to_merge) 555 | clust_to_merge.append(clust_add) 556 | MergeCluster.append(clust_to_merge) 557 | break 558 | init_lenth = check_lenth 559 | 560 | mergelist = MergeCluster 561 | 562 | if len(mergelist) !=0: 563 | for i in mergelist: 564 | frame = [cluster_list[s-1] for s in i ] 565 | cluster_list.append(pd.concat(frame)) 566 | 567 | rmlist = [item for sublist in mergelist for item in sublist] 568 | cluster_list_new = [v for i, v in enumerate(cluster_list) if i+1 not in rmlist] 569 | 570 | sim_mat = pd.DataFrame(0,index=np.arange(1,len(cluster_list_new)+1),columns=np.arange(1,len(cluster_list_new)+1)) 571 | for i in sim_mat.index: 572 | ci = cluster_list_new[i-1] 573 | for j in sim_mat.index: 574 | cj = cluster_list_new[j-1] 575 | 576 | if metric_dis == 1: 577 | sim_mat.loc[i,j] = np.mean(distance.cdist(ci,cj,metric='correlation')) 578 | elif metric_dis ==2: 579 | sim_mat.loc[i,j] = np.mean(distance.cdist(ci,cj,metric='euclidean')) 580 | 581 | cluster_list = cluster_list_new 582 | 583 | if len(sim_mat) ==1: 584 | cluster_list = cluster_list_oroginal 585 | linkage_matrix = linkage(sim_mat_original,method="ward") 586 | sim_mat=sim_mat_original 587 | 588 | else: 589 | linkage_matrix = linkage(sim_mat,method="ward") 590 | 591 | dn1=dendrogram(linkage_matrix,distance_sort='descending',leaf_font_size=24,labels=sim_mat.index,color_threshold=0.01,no_plot=True) 592 | 593 | dfheat_order=[] 594 | for i in dn1['leaves']: 595 | dfheat_order.append(cluster_list[i]) 596 | 597 | membership = pd.DataFrame({'L1Cluster':0},index=list(range(len(expression)))) 598 | for i in range(len(dfheat_order)): 599 | membership.loc[dfheat_order[i].index,'L1Cluster'] = dn1['leaves'][i]+1 600 | 601 | self.L1clust_dendro=dn1['leaves'] 602 | 603 | membership=membership.astype(int) 604 | membership.loc[:,'Cell_ID'] = self.cell_id 605 | self.cell_membership = membership 606 | dn1_linkage_matrix_dn1=linkage_matrix 607 | 608 | colormaps = sns.color_palette("hls", len(np.unique(self.cell_membership.L1Cluster))) 609 | cellgroup=[] 610 | heatcolor = [] 611 | cluster_color=[] 612 | for i in self.L1clust_dendro: 613 | cellgroup.append(self.cell_membership[self.cell_membership.L1Cluster == (i+1)].index) 614 | cluster_color.append([(i+1),colormaps[i]]) 615 | for j in range(len(self.cell_membership[self.cell_membership.L1Cluster == (i+1)].index)): 616 | heatcolor.append(colormaps[i]) 617 | 618 | self.L1cell_color = heatcolor 619 | self.L1cell_dendro_order = [item for sublist in cellgroup for item in sublist] 620 | self.L1cluster_color=cluster_color 621 | 622 | if fclust_height != 0.2: 623 | fclust_height = fclust_height; 624 | 625 | assignments = fcluster(dn1_linkage_matrix_dn1,fclust_height,'distance') 626 | Final_cluster = pd.DataFrame({'cluster':assignments}) 627 | Final_cluster.index=sim_mat.index 628 | Final_cluster = Final_cluster.iloc[self.L1clust_dendro] 629 | 630 | for i in Final_cluster.index: 631 | self.cell_membership.loc[self.cell_membership[self.cell_membership.L1Cluster ==i].index,'L2Cluster'] = Final_cluster.loc[i,'cluster'] 632 | self.cell_membership.loc[:,'L2Cluster'] =self.cell_membership.loc[:,'L2Cluster'].astype(int) 633 | self.cell_membership=self.cell_membership[['Cell_ID','L1Cluster','L2Cluster']] 634 | self.cell_membership.to_csv('Cell_Membership.csv') 635 | 636 | print("Output Cell_Membership.csv ") 637 | 638 | 639 | sns.set_style(style="white") 640 | fig = plt.figure(figsize=(18, 12)) 641 | #plt.figure(figsize=(18,12)) 642 | gs = gridspec.GridSpec(2, 2) 643 | gs.update(wspace=0.05) 644 | 645 | ax1 = plt.subplot(gs[:, 0]) 646 | dendrogram(dn1_linkage_matrix_dn1,distance_sort='descending',leaf_font_size=18,labels=sim_mat.index,color_threshold=0.01,above_threshold_color='grey') 647 | ax1_1 = plt.gca() 648 | xlbls = ax1_1.get_xmajorticklabels() 649 | c=0 650 | for lbl in xlbls: 651 | lbl.set_color(self.L1cluster_color[c][1]) 652 | c=c+1 653 | ax1.spines['top'].set_visible(False) 654 | ax1.spines['right'].set_visible(False) 655 | ax1.spines['bottom'].set_visible(False) 656 | plt.axhline(y=fclust_height, color='grey', linestyle='--',linewidth=1.45) 657 | 658 | 659 | 660 | expression=self.log_exp 661 | result=self.cell_membership 662 | tsnedata= Skl_scale(expression.loc[:,self.vargene]) 663 | tmodel = TSNE(n_components=2,random_state=1,init=init) 664 | if n_PCs != 15: 665 | n_PCs=n_PCs; 666 | tsnedata = RunPCA(tsnedata,n_PCs)[0] 667 | tcoord = tmodel.fit_transform(tsnedata) 668 | self.tsne2d = tcoord 669 | 670 | sns.set_style(style="white") 671 | ax2 = plt.subplot(gs[0, 1]) 672 | cluster_colors = [i[1] for i in self.L1cluster_color] 673 | j=0 674 | 675 | for i in [i[0] for i in self.L1cluster_color]: 676 | 677 | ax2.scatter(tcoord[result[result.L1Cluster ==i].index,0],tcoord[result[result.L1Cluster==i].index,1],color=cluster_colors[j],s=50,label=i) 678 | j=j+1 679 | 680 | if len(np.unique(result.L1Cluster))>15: 681 | ncol=2 682 | else: 683 | ncol=1 684 | 685 | plt.legend(prop={'size':11}, bbox_to_anchor=(0.95,1.05),ncol=ncol,loc='upper left',frameon=False) 686 | plt.grid() 687 | plt.title('L1 Cluster',fontsize=16) 688 | plt.xticks([]) 689 | plt.yticks([]) 690 | ax2.spines['top'].set_visible(False) 691 | ax2.spines['right'].set_visible(False) 692 | 693 | sns.set_style(style="white") 694 | ax3 = plt.subplot(gs[1, 1]) 695 | cluster_colors = sns.color_palette("hls", len(np.unique(result.L2Cluster))) 696 | l2order = [] 697 | [l2order.append(item) for item in Final_cluster.cluster if item not in l2order] 698 | j=0 699 | 700 | heatcolor2 = [] 701 | cluster_color2=[] 702 | cellgroup2 = [] 703 | for i in l2order: 704 | cluster_color2.append([i,cluster_colors[j]]) 705 | clabel = str(i) +" "+ str([i for i in Final_cluster[Final_cluster.cluster==i].index]) 706 | ax3.scatter(tcoord[result[result.L2Cluster ==i].index,0],tcoord[result[result.L2Cluster==i].index,1],color=cluster_colors[j],s=50,label=clabel) 707 | heatcolor2.append([cluster_colors[j] for k in range(len(result[result.L2Cluster ==i].index))]) 708 | cellgroup2.append(result[result.L2Cluster ==i].index) 709 | j=j+1 710 | 711 | self.L2cell_color = [item for sublist in heatcolor2 for item in sublist] 712 | self.L2cell_dendro_order = [item for sublist in cellgroup2 for item in sublist] 713 | self.L2cluster_color=cluster_color2 714 | 715 | 716 | if len(np.unique(result.L2Cluster))>15: 717 | ncol=2 718 | else: 719 | ncol=1 720 | 721 | plt.legend(prop={'size':11}, bbox_to_anchor=(0.95,1.05),ncol=ncol,loc='upper left',frameon=False) 722 | plt.grid() 723 | plt.title('L2 Cluster (---)',fontsize=16, color ='grey') 724 | plt.xticks([]) 725 | plt.yticks([]) 726 | ax3.spines['top'].set_visible(False) 727 | ax3.spines['right'].set_visible(False) 728 | plt.savefig('PanoView_output',dpi=fig.dpi) 729 | print("Output PanoView_output.png") 730 | 731 | 732 | 733 | 734 | def VisCluster(self,clevel,cnumber): 735 | result=self.cell_membership 736 | tcoord=self.tsne2d 737 | 738 | clustnumber1 = np.unique(result.L1Cluster) 739 | clustnumber2 = np.unique(result.L2Cluster) 740 | 741 | 742 | sns.set_style(style="white") 743 | fig = plt.figure(figsize=(10, 10)) 744 | 745 | if clevel == 1: 746 | 747 | if cnumber not in clustnumber1: 748 | print('cnumber not found, please input a different cnumber') 749 | 750 | else: 751 | for i in clustnumber1: 752 | if i == cnumber: 753 | plt.scatter(tcoord[result[result.L1Cluster ==i].index,0],tcoord[result[result.L1Cluster==i].index,1],color='b',s=50,label=i) 754 | else: 755 | plt.scatter(tcoord[result[result.L1Cluster ==i].index,0],tcoord[result[result.L1Cluster==i].index,1],color='gray',s=50,label=i) 756 | 757 | plt.legend(loc='upper left', prop={'size':16}, bbox_to_anchor=(0.99,1),ncol=1,frameon=False) 758 | 759 | elif clevel == 2: 760 | 761 | if cnumber not in clustnumber2: 762 | print('cnumber not found, please input a different cnumber') 763 | 764 | else: 765 | for i in clustnumber2: 766 | if i == cnumber: 767 | plt.scatter(tcoord[result[result.L2Cluster ==i].index,0],tcoord[result[result.L2Cluster==i].index,1],color='b',s=50,label=i) 768 | else: 769 | plt.scatter(tcoord[result[result.L2Cluster ==i].index,0],tcoord[result[result.L2Cluster==i].index,1],color='gray',s=50,label=i) 770 | 771 | plt.legend(loc='upper left', prop={'size':16}, bbox_to_anchor=(0.99,1),ncol=1,frameon=False) 772 | 773 | else: 774 | print('clevel not found, please input a different clevel') 775 | 776 | plt.savefig('cluster_%s.png' % cnumber,dpi=fig.dpi) 777 | 778 | 779 | def VisClusterAnno(self): 780 | annotation = pd.DataFrame(self.cell_anno,columns=['anno']) 781 | cluster_id = np.unique(annotation) 782 | tcoord=self.tsne2d 783 | sns.set_style(style="white") 784 | fig = plt.figure(figsize=(16, 10)) 785 | cluster_colors = sns.color_palette("hls", len(cluster_id)) 786 | j=0 787 | for i in cluster_id: 788 | plt.scatter(tcoord[annotation[annotation.anno ==i].index,0],tcoord[annotation[annotation.anno==i].index,1],color=cluster_colors[j],s=50,label=i) 789 | j=j+1 790 | 791 | plt.legend(prop={'size':14}, bbox_to_anchor=(0.99,1),loc='upper left',frameon=False) 792 | plt.grid() 793 | plt.xticks([]) 794 | plt.yticks([]) 795 | plt.savefig('Clusters_Annotation',dpi=fig.dpi) 796 | 797 | 798 | def VisGeneExp(self,genes): 799 | markers=genes 800 | markerdata = self.log_exp 801 | markerdata = normalize(markerdata,norm='max') ## normalization 802 | markerdata=pd.DataFrame(markerdata,index=self.log_exp.index ,columns=self.log_exp.columns) 803 | 804 | for i in range(len(markers)): 805 | sns.set_style(style="white") 806 | fig = plt.figure(figsize=(10, 10)) 807 | plt.suptitle(markers[i],fontsize=36) 808 | plt.scatter(self.tsne2d[:,0],self.tsne2d[:,1],c=markerdata.loc[:,markers[i]],s=50,cmap='BuPu',edgecolor='gray',alpha=0.3) 809 | plt.savefig('%s.png' % markers[i],dpi=fig.dpi) 810 | 811 | 812 | def RunVGs(self,clevel): 813 | pvalue=[] 814 | logFD=[] 815 | datafordeg = self.log_exp.loc[:,(self.log_exp!=0).any(axis=0)] 816 | for i in datafordeg.columns: 817 | groups=[] 818 | fdgroups=[] 819 | if clevel ==1: 820 | for j in np.unique(self.cell_membership.L1Cluster): 821 | groups.append(datafordeg.loc[self.cell_membership[self.cell_membership.L1Cluster == j].index,i]) 822 | fdgroups.append(datafordeg.loc[self.cell_membership[self.cell_membership.L1Cluster == j].index,i].mean()) 823 | if max(fdgroups)-min(fdgroups) >=1: 824 | pvalue.append([i,stats.kruskal(*groups)[1]]) 825 | logFD.append(max(fdgroups)-min(fdgroups)) 826 | elif clevel ==2: 827 | for j in np.unique(self.cell_membership.L2Cluster): 828 | groups.append(datafordeg.loc[self.cell_membership[self.cell_membership.L2Cluster == j].index,i]) 829 | fdgroups.append(datafordeg.loc[self.cell_membership[self.cell_membership.L2Cluster == j].index,i].mean()) 830 | if max(fdgroups)-min(fdgroups) >=1: 831 | pvalue.append([i,stats.kruskal(*groups)[1]]) 832 | logFD.append(max(fdgroups)-min(fdgroups)) 833 | 834 | DEGstat=pd.DataFrame(pvalue,columns=['gene','pvalue']).fillna(1) 835 | p_value_adj=multipletests(DEGstat.iloc[:,1],alpha=0.05,method='fdr_bh') 836 | DEGstat.loc[:,'Padj'] = p_value_adj[1] 837 | DEGstat.loc[:,'log2FD'] = logFD 838 | self.vg_stat = DEGstat 839 | 840 | 841 | def HeatMapVGs(self,pval,number,fd,clevel,genes_add=None): 842 | 843 | if clevel ==1: 844 | cellcolor =self.L1cell_color 845 | cellorder=self.L1cell_dendro_order 846 | elif clevel ==2: 847 | cellcolor=self.L2cell_color 848 | cellorder=self.L2cell_dendro_order 849 | 850 | topgene = self.vg_stat.query("Padj <@pval & log2FD >@fd").sort_values(by='Padj')[:number].gene.tolist() 851 | if genes_add != None: 852 | topgene = topgene + genes_add 853 | topgene = list(set(topgene)) 854 | df = self.log_exp.loc[cellorder,topgene] 855 | 856 | linkage_matrix = linkage(df.T,method='complete') 857 | dn=dendrogram(linkage_matrix,show_leaf_counts=True,orientation='left',no_plot=True) 858 | df = df.iloc[:,dn['leaves']] 859 | 860 | sns.set_style(style="white") 861 | FigHeat = plt.figure(figsize=(12, 12)) 862 | 863 | ax = FigHeat.add_subplot(111) 864 | cax = ax.matshow(df,aspect='auto',cmap='BuPu') 865 | cbr=plt.colorbar(cax,fraction=0.02, pad=0.05) 866 | cbr.ax.set_title('$\log2$',fontsize=10) 867 | cbr.outline.set_visible(False) 868 | 869 | ax.set_xticks(range(len(topgene))) 870 | ax.set_xticklabels(df.columns,fontsize=12,rotation=90) 871 | ax.set_yticks([]) 872 | ax.spines['top'].set_visible(False) 873 | ax.spines['right'].set_visible(False) 874 | ax.spines['bottom'].set_visible(False) 875 | ax.spines['left'].set_visible(False) 876 | 877 | 878 | axbar = FigHeat.add_axes([0.1, 0.11, 0.02, 0.770]) 879 | cmap1 = mpl.colors.ListedColormap(cellcolor[::-1]) 880 | cbar=mpl.colorbar.ColorbarBase(axbar,cmap=cmap1, orientation='vertical',ticks=[]) 881 | cbar.outline.set_visible(False) 882 | plt.savefig('HeatmapVGs',dpi=FigHeat.dpi) 883 | 884 | 885 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Single-cell Panoramic View Clustering (PanoView) # 2 | 3 | **PanoView** is an iterative PCA-based method that integrates with a novel density-based clustering, ordering local maximum by convex hull (OLMC) algorithm, to identify cell subpopulations for single-cell RNA-sequencing. For details of the method, please see our paper at PLOS Computational Biology (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007040). 4 | 5 | 6 | 7 | 8 | ![PanoView](https://github.com/mhu10/scPanoView/blob/master/PanoView.jpg) 9 |

10 | :heavy_plus_sign: 11 | 12 |

13 | 14 |

15 | 16 | 17 | ## Installation ## 18 | **PanoView** is a python module that uses other common python libraries such as *numpy*, *scipy*, *pandas*, *scikit-learn*, etc., to realize the proposed algorithm. Prior to installing **PanoView** from Github repository, please make sure that Git is properly installed or go to https://git-scm.com/ for the installation of Git. 19 | To install **PanoView** at your local computer, open your command prompt and type the following 20 | 21 | ``` 22 | pip install git+https://github.com/mhu10/scPanoView.git#egg=scPanoView 23 | ``` 24 | 25 | It will install all the required python libraries for executing **PanoView**. To test the installation of **PanoView**, open the python interpreter or your preferred IDE (*Spyder*, *PyCharm*, *Jupyter*, etc. ) and type the following 26 | 27 | ``` 28 | from PanoramicView import scPanoView 29 | ``` 30 | There should not be any error message popping out. 31 | 32 | Note: PanoView was implement and tested by python3.6. 33 | 34 | 35 | 36 | ## Tutorial ## 37 | 38 | Plese refer to the manuaul ( *"PanoViewManual.pdf"* ) for details of executing **PanoView** algorithm in python. 39 | 40 | For running tutorial in the manual, please download the example dataset (*"ExamplePollen.zip"* ) and upzip it into your python working directory. 41 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='scPanoView', 5 | version='0.1', 6 | packages=find_packages(exclude=['tests*']), 7 | license='MIT', 8 | description='A single-cell clustering algorithm', 9 | long_description=open('README.md').read(), 10 | install_requires=['numpy>=1.13','pandas>=0.20','scipy>=0.19','matplotlib','seaborn>=0.8','scikit-learn>=0.19','statsmodels>=0.8'], 11 | url='https://github.com/mhu10/scPanoView', 12 | author='Ming-Wen Hu & Jiang Qian', 13 | author_email='mhu10@jhmi.edu,jiang.qian@jhmi.edu' 14 | ) 15 | --------------------------------------------------------------------------------