├── LICENSE ├── New_predicted_genes.png ├── README.md ├── SpaGE ├── dimensionality_reduction.py ├── main.py └── principal_vectors.py ├── SpaGE_Tutorial.ipynb ├── benchmark ├── MERFISH_Moffit │ ├── Liger │ │ └── LIGER.R │ ├── Performance_evaluation.py │ ├── Seurat │ │ ├── MERFISH.R │ │ ├── MERFISH_integration.R │ │ └── Moffit_RNA.R │ ├── SpaGE │ │ ├── Integration.py │ │ ├── MERFISH_data_preprocessing.py │ │ ├── Moffit_RNA_preprocessing.py │ │ ├── Precise_output.py │ │ ├── dimensionality_reduction.py │ │ └── principal_vectors.py │ └── gimVI │ │ └── gimVI.py ├── Performance_evaluation_osmFISH.py ├── Precise_output_all.py ├── STARmap_AllenVISp │ ├── DownSampling │ │ ├── DownSampling.py │ │ └── DownSampling_evaluation.py │ ├── Liger │ │ └── LIGER.R │ ├── Performance_evaluation.py │ ├── Seurat │ │ ├── allen_brain.R │ │ ├── impute_starmap.R │ │ ├── starmap.R │ │ └── starmap_integration.R │ ├── SpaGE │ │ ├── Allen_data_preprocessing.py │ │ ├── Integration.py │ │ ├── Precise_output.py │ │ ├── Starmap_data_preprocessing.py │ │ ├── dimensionality_reduction.py │ │ ├── principal_vectors.py │ │ └── viz.py │ ├── Starmap_plots.R │ └── gimVI │ │ └── gimVI.py ├── Timing_Evaluation.py ├── osmFISH_AllenSSp │ ├── Liger │ │ └── LIGER.R │ ├── Performance_evaluation.py │ ├── Seurat │ │ ├── allen_brain.R │ │ ├── impute_osmFISH.R │ │ ├── osmFISH.R │ │ └── osmFISH_integration.R │ ├── SpaGE │ │ ├── Allen_data_preprocessing.py │ │ ├── Integration.py │ │ ├── Precise_output.py │ │ ├── dimensionality_reduction.py │ │ ├── osmFISH_data_preprocessing.py │ │ └── principal_vectors.py │ └── gimVI │ │ └── gimVI.py ├── osmFISH_AllenVISp │ ├── Liger │ │ └── LIGER.R │ ├── Performance_evaluation.py │ ├── Seurat │ │ ├── allen_brain.R │ │ ├── impute_osmFISH.R │ │ ├── osmFISH.R │ │ └── osmFISH_integration.R │ ├── SpaGE │ │ ├── Allen_data_preprocessing.py │ │ ├── Integration.py │ │ ├── Precise_output.py │ │ ├── dimensionality_reduction.py │ │ ├── osmFISH_data_preprocessing.py │ │ └── principal_vectors.py │ └── gimVI │ │ └── gimVI.py ├── osmFISH_Ziesel │ ├── Liger │ │ └── LIGER.R │ ├── Performance_evaluation.py │ ├── Seurat │ │ ├── Zeisel_Cortex.R │ │ ├── impute_osmFISH.R │ │ ├── osmFISH.R │ │ └── osmFISH_integration.R │ ├── SpaGE │ │ ├── Integration.py │ │ ├── Linnarson_data_preprocessing.py │ │ ├── Precise_output.py │ │ ├── dimensionality_reduction.py │ │ └── principal_vectors.py │ └── gimVI │ │ └── gimVI.py └── seqFISH_AllenVISp │ ├── DownSampling │ ├── DownSampling.py │ └── DownSampling_evaluation.py │ ├── Performance_evaluation.py │ └── SpaGE │ ├── Integration.py │ ├── Precise_output.py │ ├── dimensionality_reduction.py │ ├── principal_vectors.py │ └── seqFISH_data_preprocessing.py ├── dimensionality_reduction.py └── principal_vectors.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 tabdelaal 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /New_predicted_genes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tabdelaal/SpaGE/edd0fb7086eca791c6589020893eb4f034b195c7/New_predicted_genes.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SpaGE 2 | ## Predicting whole-transcriptome expression of spatial transcriptomics data through integration with scRNA-seq data 3 | 4 | ### Implementation description 5 | 6 | Python implementation can be found in the 'SpaGE' folder. The ```SpaGE``` function takes as input i) two single cell datasets, spatial transcriptomics and scRNA-seq, ii) the number of principal vectors *(PVs)*, and iii) the set of unmeasured genes in the spatial data for which predictions are obtained from the scRNA-seq (optional). The function returns back the predicted expression for these unmeasured genes across all spatial cells. 7 | 8 | For full description, please check the ```SpaGE``` function description in ```main.py```. 9 | 10 | ### Tutorial 11 | 12 | The ```SpaGE_Tutorial``` notebook is a step-by-step example showing how to validate SpaGE on the spatially measured genes, and how to use SpaGE to predict new spatial gene patterns. 13 | 14 |

15 | 16 |

17 | 18 | ### Experiments code description 19 | 20 | The 'benchmark' folder contains the scripts to reproduce the results shown in the pre-print. The bencmark folder has five subfolders, each corresponds to one dataset-pair and contains the scripts to run SpaGE, Seurat-v3, Liger and gimVI. Additionally, we provide evaluation scripts to calculate and compare the performance of all four methods, and to reproduce all the figures in the paper. 21 | 22 | ### Datasets 23 | 24 | All datasets used are publicly available data, for convenience datasets can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.3967291) 25 | 26 | For citation and further information please refer to: 27 | "SpaGE: Spatial Gene Enhancement using scRNA-seq", [NAR](https://academic.oup.com/nar/article/48/18/e107/5909530) 28 | -------------------------------------------------------------------------------- /SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /SpaGE/main.py: -------------------------------------------------------------------------------- 1 | """ SpaGE [1] 2 | @author: Tamim Abdelaal 3 | This function integrates two single-cell datasets, spatial and scRNA-seq, and 4 | enhance the spatial data by predicting the expression of the spatially 5 | unmeasured genes from the scRNA-seq data. 6 | The integration is performed using the domain adaption method PRECISE [2] 7 | 8 | References 9 | ------- 10 | [1] Abdelaal T., Mourragui S., Mahfouz A., Reiders M.J.T. (2020) 11 | SpaGE: Spatial Gene Enhancement using scRNA-seq 12 | [2] Mourragui S., Loog M., Reinders M.J.T., Wessels L.F.A. (2019) 13 | PRECISE: A domain adaptation approach to transfer predictors of drug response 14 | from pre-clinical models to tumors 15 | """ 16 | 17 | import numpy as np 18 | import pandas as pd 19 | import scipy.stats as st 20 | from sklearn.neighbors import NearestNeighbors 21 | from SpaGE.principal_vectors import PVComputation 22 | 23 | def SpaGE(Spatial_data,RNA_data,n_pv,genes_to_predict=None): 24 | """ 25 | @author: Tamim Abdelaal 26 | This function integrates two single-cell datasets, spatial and scRNA-seq, 27 | and enhance the spatial data by predicting the expression of the spatially 28 | unmeasured genes from the scRNA-seq data. 29 | 30 | Parameters 31 | ------- 32 | Spatial_data : Dataframe 33 | Normalized Spatial data matrix (cells X genes). 34 | RNA_data : Dataframe 35 | Normalized scRNA-seq data matrix (cells X genes). 36 | n_pv : int 37 | Number of principal vectors to find from the independently computed 38 | principal components, and used to align both datasets. This should 39 | be <= number of shared genes between the two datasets. 40 | genes_to_predict : str array 41 | list of gene names missing from the spatial data, to be predicted 42 | from the scRNA-seq data. Default is the set of different genes 43 | (columns) between scRNA-seq and spatial data. 44 | 45 | Return 46 | ------- 47 | Imp_Genes: Dataframe 48 | Matrix containing the predicted gene expressions for the spatial 49 | cells. Rows are equal to the number of spatial data rows (cells), 50 | and columns are equal to genes_to_predict, . 51 | """ 52 | 53 | if genes_to_predict is SpaGE.__defaults__[0]: 54 | genes_to_predict = np.setdiff1d(RNA_data.columns,Spatial_data.columns) 55 | 56 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data,axis=0), 57 | index = RNA_data.index,columns=RNA_data.columns) 58 | Spatial_data_scaled = pd.DataFrame(data=st.zscore(Spatial_data,axis=0), 59 | index = Spatial_data.index,columns=Spatial_data.columns) 60 | Common_data = RNA_data_scaled[np.intersect1d(Spatial_data_scaled.columns,RNA_data_scaled.columns)] 61 | 62 | Y_train = RNA_data[genes_to_predict] 63 | 64 | Imp_Genes = pd.DataFrame(np.zeros((Spatial_data.shape[0],len(genes_to_predict))), 65 | columns=genes_to_predict) 66 | 67 | pv_Spatial_RNA = PVComputation( 68 | n_factors = n_pv, 69 | n_pv = n_pv, 70 | dim_reduction = 'pca', 71 | dim_reduction_target = 'pca' 72 | ) 73 | 74 | pv_Spatial_RNA.fit(Common_data,Spatial_data_scaled[Common_data.columns]) 75 | 76 | S = pv_Spatial_RNA.source_components_.T 77 | 78 | Effective_n_pv = sum(np.diag(pv_Spatial_RNA.cosine_similarity_matrix_) > 0.3) 79 | S = S[:,0:Effective_n_pv] 80 | 81 | Common_data_projected = Common_data.dot(S) 82 | Spatial_data_projected = Spatial_data_scaled[Common_data.columns].dot(S) 83 | 84 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 85 | metric = 'cosine').fit(Common_data_projected) 86 | distances, indices = nbrs.kneighbors(Spatial_data_projected) 87 | 88 | for j in range(0,Spatial_data.shape[0]): 89 | 90 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 91 | weights = weights/(len(weights)-1) 92 | Imp_Genes.iloc[j,:] = np.dot(weights,Y_train.iloc[indices[j,:][distances[j,:] < 1]]) 93 | 94 | return Imp_Genes 95 | -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/Liger/LIGER.R: -------------------------------------------------------------------------------- 1 | setwd("MERFISH_Moffit/") 2 | library(liger) 3 | library(Seurat) 4 | library(ggplot2) 5 | 6 | # Moffit RNA 7 | Moffit <- Read10X("data/Moffit_RNA/GSE113576/") 8 | 9 | Moffit <- as.matrix(Moffit) 10 | Genes_count = rowSums(Moffit > 0) 11 | Moffit <- Moffit[Genes_count>=10,] 12 | 13 | # MERFISH 14 | MERFISH <- read.csv(file = "data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv", header = TRUE) 15 | MERFISH_1 <- MERFISH[MERFISH['Animal_ID']==1,] 16 | MERFISH_1 <- MERFISH_1[MERFISH_1['Cell_class']!='Ambiguous',] 17 | MERFISH_meta <- MERFISH_1[,c(1:9)] 18 | MERFISH_data <- MERFISH_1[,c(10:170)] 19 | drops <- c('Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos') 20 | MERFISH_data <- MERFISH_data[ , !(colnames(MERFISH_data) %in% drops)] 21 | MERFISH_data <- t(MERFISH_data) 22 | 23 | Gene_set <- intersect(rownames(MERFISH_data),rownames(Moffit)) 24 | 25 | #### New genes prediction 26 | Ligerex <- createLiger(list(MERFISH = MERFISH_data, Moffit_RNA = Moffit)) 27 | Ligerex <- normalize(Ligerex) 28 | Ligerex@var.genes <- Gene_set 29 | Ligerex <- scaleNotCenter(Ligerex) 30 | 31 | # suggestK(Ligerex) # K = 25 32 | # suggestLambda(Ligerex, k = 25) 33 | 34 | Ligerex <- optimizeALS(Ligerex,k = 25) 35 | Ligerex <- quantileAlignSNF(Ligerex) 36 | 37 | # leave-one-out validation 38 | genes.leaveout <- Gene_set 39 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(MERFISH_data)[2]) 40 | rownames(Imp_genes) <- genes.leaveout 41 | colnames(Imp_genes) <- colnames(MERFISH_data) 42 | NMF_time <- vector(mode= "numeric") 43 | knn_time <- vector(mode= "numeric") 44 | 45 | for(i in 1:length(genes.leaveout)) { 46 | print(i) 47 | start_time <- Sys.time() 48 | Ligerex.leaveout <- createLiger(list(MERFISH = MERFISH_data[-which(rownames(MERFISH_data) %in% genes.leaveout[i]),], Moffit_RNA = Moffit)) 49 | Ligerex.leaveout <- normalize(Ligerex.leaveout) 50 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i]) 51 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout) 52 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 25) 53 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout) 54 | end_time <- Sys.time() 55 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 56 | 57 | start_time <- Sys.time() 58 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'Moffit_RNA', queries = list('MERFISH'), norm = TRUE, scale = FALSE) 59 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$MERFISH[genes.leaveout[i],]) 60 | end_time <- Sys.time() 61 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 62 | } 63 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv') 64 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE) 65 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/Performance_evaluation.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import pickle 7 | import matplotlib 8 | matplotlib.use('qt5agg') 9 | matplotlib.rcParams['pdf.fonttype'] = 42 10 | matplotlib.rcParams['ps.fonttype'] = 42 11 | import matplotlib.pyplot as plt 12 | import scipy.stats as st 13 | 14 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f: 15 | datadict = pickle.load(f) 16 | 17 | MERFISH_data = datadict['MERFISH_data'] 18 | del datadict 19 | 20 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f: 21 | datadict = pickle.load(f) 22 | 23 | RNA_data = datadict['RNA_data'] 24 | del datadict 25 | 26 | Gene_Order = np.intersect1d(MERFISH_data.columns,RNA_data.columns) 27 | 28 | ### SpaGE 29 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',') 30 | 31 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 32 | 33 | SpaGE_Corr = pd.Series(index = Gene_Order) 34 | for i in Gene_Order: 35 | SpaGE_Corr[i] = st.spearmanr(MERFISH_data[i],SpaGE_imputed[i])[0] 36 | 37 | ### gimVI 38 | gimVI_imputed = pd.read_csv('Results/gimVI_LeaveOneOut.csv',header=0,index_col=0,sep=',') 39 | gimVI_imputed = gimVI_imputed.drop(columns='AVPR2') 40 | 41 | gimVI_imputed = gimVI_imputed.loc[:,[x.upper() for x in np.array(Gene_Order,dtype='str')]] 42 | 43 | gimVI_Corr = pd.Series(index = Gene_Order) 44 | for i in Gene_Order: 45 | gimVI_Corr[i] = st.spearmanr(MERFISH_data[i],gimVI_imputed[str(i).upper()])[0] 46 | gimVI_Corr[np.isnan(gimVI_Corr)] = 0 47 | 48 | 49 | ### Seurat 50 | Seurat_imputed = pd.read_csv('Results/Seurat_LeaveOneOut.csv',header=0,index_col=0,sep=',').T 51 | 52 | Seurat_imputed = Seurat_imputed.loc[:,Gene_Order] 53 | 54 | Seurat_Corr = pd.Series(index = Gene_Order) 55 | for i in Gene_Order: 56 | Seurat_Corr[i] = st.spearmanr(MERFISH_data[i],Seurat_imputed[i])[0] 57 | 58 | ### Liger 59 | Liger_imputed = pd.read_csv('Results/Liger_LeaveOneOut.csv',header=0,index_col=0,sep=',').T 60 | 61 | Liger_imputed = Liger_imputed.loc[:,Gene_Order] 62 | 63 | Liger_Corr = pd.Series(index = Gene_Order) 64 | for i in Gene_Order: 65 | Liger_Corr[i] = st.spearmanr(MERFISH_data[i],Liger_imputed[i])[0] 66 | Liger_Corr[np.isnan(Liger_Corr)] = 0 67 | 68 | ### Comparison plots 69 | plt.style.use('ggplot') 70 | fig, ax = plt.subplots(figsize=(3.7, 5.5)) 71 | ax.boxplot([SpaGE_Corr,Seurat_Corr, Liger_Corr,gimVI_Corr]) 72 | 73 | y = SpaGE_Corr 74 | x = np.random.normal(1, 0.05, len(y)) 75 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2) 76 | y = Seurat_Corr 77 | x = np.random.normal(2, 0.05, len(y)) 78 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2) 79 | y = Liger_Corr 80 | x = np.random.normal(3, 0.05, len(y)) 81 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2) 82 | y = gimVI_Corr 83 | x = np.random.normal(4, 0.05, len(y)) 84 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2) 85 | 86 | plt.xticks((1,2,3,4),('SpaGE', 'Seurat', 'Liger','gimVI'),size=12) 87 | plt.yticks(size=8) 88 | plt.gca().set_ylim([-0.5,1]) 89 | plt.ylabel('Spearman Correlation',size=12) 90 | ax.set_aspect(aspect=3) 91 | _,p_val = st.wilcoxon(SpaGE_Corr,Seurat_Corr) 92 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 93 | _,p_val = st.wilcoxon(SpaGE_Corr,Liger_Corr) 94 | plt.text(3,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 95 | _,p_val = st.wilcoxon(SpaGE_Corr,gimVI_Corr) 96 | plt.text(4,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 97 | plt.show() 98 | 99 | def Compare_Correlations(X,Y): 100 | fig, ax = plt.subplots(figsize=(4.5, 4.5)) 101 | ax.scatter(X, Y, s=5) 102 | ax.axvline(linestyle='--',color='gray') 103 | ax.axhline(linestyle='--',color='gray') 104 | plt.gca().set_ylim([-0.5,1]) 105 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]), 106 | np.max([ax.get_xlim(), ax.get_ylim()])] 107 | ax.plot(lims, lims, 'k-') 108 | ax.set_aspect('equal') 109 | ax.set_xlim(lims) 110 | ax.set_ylim(lims) 111 | plt.xticks(size=8) 112 | plt.yticks(size=8) 113 | plt.show() 114 | 115 | 116 | Compare_Correlations(Seurat_Corr,SpaGE_Corr) 117 | plt.xlabel('Spearman Correlation Seurat',size=12) 118 | plt.ylabel('Spearman Correlation SpaGE',size=12) 119 | plt.show() 120 | 121 | Compare_Correlations(Liger_Corr,SpaGE_Corr) 122 | plt.xlabel('Spearman Correlation Liger',size=12) 123 | plt.ylabel('Spearman Correlation SpaGE',size=12) 124 | plt.show() 125 | 126 | Compare_Correlations(gimVI_Corr,SpaGE_Corr) 127 | plt.xlabel('Spearman Correlation gimVI',size=12) 128 | plt.ylabel('Spearman Correlation SpaGE',size=12) 129 | plt.show() 130 | -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/Seurat/MERFISH.R: -------------------------------------------------------------------------------- 1 | setwd("MERFISH_Moffit/") 2 | library(Seurat) 3 | library(Matrix) 4 | 5 | MERFISH <- read.csv(file = "data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv", header = TRUE) 6 | MERFISH_1 <- MERFISH[MERFISH['Animal_ID']==1,] 7 | MERFISH_1 <- MERFISH_1[MERFISH_1['Cell_class']!='Ambiguous',] 8 | MERFISH_meta <- MERFISH_1[,c(1:9)] 9 | MERFISH_data <- MERFISH_1[,c(10:170)] 10 | drops <- c('Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos') 11 | MERFISH_data <- MERFISH_data[ , !(colnames(MERFISH_data) %in% drops)] 12 | MERFISH_data <- t(MERFISH_data) 13 | 14 | MERFISH_seurat <- CreateSeuratObject(counts = MERFISH_data, project = 'MERFISH', assay = 'RNA', meta.data = MERFISH_meta ,min.cells = -1, min.features = -1) 15 | total.counts = colSums(x = as.matrix(MERFISH_seurat@assays$RNA@counts)) 16 | MERFISH_seurat <- NormalizeData(MERFISH_seurat, scale.factor = median(x = total.counts)) 17 | MERFISH_seurat <- ScaleData(MERFISH_seurat) 18 | saveRDS(object = MERFISH_seurat, file = 'data/seurat_objects/MERFISH.rds') -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/Seurat/MERFISH_integration.R: -------------------------------------------------------------------------------- 1 | setwd("MERFISH_Moffit/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | MERFISH <- readRDS("data/seurat_objects/MERFISH.rds") 6 | Moffit <- readRDS("data/seurat_objects/Moffit_RNA.rds") 7 | 8 | genes.leaveout <- intersect(rownames(MERFISH),rownames(Moffit)) 9 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(MERFISH@assays$RNA)[2]) 10 | rownames(Imp_genes) <- genes.leaveout 11 | anchor_time <- vector(mode= "numeric") 12 | Transfer_time <- vector(mode= "numeric") 13 | 14 | run_imputation <- function(ref.obj, query.obj, feature.remove) { 15 | message(paste0('removing ', feature.remove)) 16 | features <- setdiff(rownames(query.obj), feature.remove) 17 | DefaultAssay(ref.obj) <- 'RNA' 18 | DefaultAssay(query.obj) <- 'RNA' 19 | 20 | start_time <- Sys.time() 21 | anchors <- FindTransferAnchors( 22 | reference = ref.obj, 23 | query = query.obj, 24 | features = features, 25 | dims = 1:30, 26 | reduction = 'cca' 27 | ) 28 | end_time <- Sys.time() 29 | anchor_time <- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 30 | 31 | refdata <- GetAssayData( 32 | object = ref.obj, 33 | assay = 'RNA', 34 | slot = 'data' 35 | ) 36 | 37 | start_time <- Sys.time() 38 | imputation <- TransferData( 39 | anchorset = anchors, 40 | refdata = refdata, 41 | weight.reduction = 'pca' 42 | ) 43 | query.obj[['seq']] <- imputation 44 | end_time <- Sys.time() 45 | Transfer_time <- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 46 | return(query.obj) 47 | } 48 | 49 | for(i in 1:length(genes.leaveout)) { 50 | imputed.ss2 <- run_imputation(ref.obj = Moffit, query.obj = MERFISH, feature.remove = genes.leaveout[[i]]) 51 | MERFISH[['ss2']] <- imputed.ss2[, colnames(MERFISH)][['seq']] 52 | Imp_genes[genes.leaveout[[i]],] = as.vector(MERFISH@assays$ss2[genes.leaveout[i],]) 53 | } 54 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv') 55 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE) 56 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/Seurat/Moffit_RNA.R: -------------------------------------------------------------------------------- 1 | setwd("MERFISH_Moffit/") 2 | library(Seurat) 3 | 4 | Moffit <- Read10X("data/Moffit_RNA/GSE113576/") 5 | 6 | Mo <- CreateSeuratObject(counts = Moffit, project = 'POR', min.cells = 10) 7 | Mo <- NormalizeData(object = Mo) 8 | Mo <- FindVariableFeatures(object = Mo, nfeatures = 2000) 9 | Mo <- ScaleData(object = Mo) 10 | Mo <- RunPCA(object = Mo, npcs = 50, verbose = FALSE) 11 | Mo <- RunUMAP(object = Mo, dims = 1:50, nneighbors = 5) 12 | saveRDS(object = Mo, file = paste0("data/seurat_objects/","Moffit_RNA.rds")) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | MERFISH_data = datadict['MERFISH_data'] 14 | MERFISH_data_scaled = datadict['MERFISH_data_scaled'] 15 | MERFISH_meta = datadict['MERFISH_meta'] 16 | del datadict 17 | 18 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f: 19 | datadict = pickle.load(f) 20 | 21 | RNA_data = datadict['RNA_data'] 22 | RNA_data_scaled = datadict['RNA_data_scaled'] 23 | del datadict 24 | 25 | #### Leave One Out Validation #### 26 | Common_data = RNA_data_scaled[np.intersect1d(MERFISH_data_scaled.columns,RNA_data_scaled.columns)] 27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 28 | precise_time = [] 29 | knn_time = [] 30 | for i in Common_data.columns: 31 | print(i) 32 | start = tm.time() 33 | from principal_vectors import PVComputation 34 | 35 | n_factors = 50 36 | n_pv = 50 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation(n_factors = n_factors,n_pv = n_pv,dim_reduction = dim_reduction,dim_reduction_target = dim_reduction_target) 41 | 42 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),MERFISH_data_scaled[Common_data.columns].drop(i,axis=1)) 43 | 44 | S = pv_FISH_RNA.source_components_.T 45 | 46 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 47 | S = S[:,0:Effective_n_pv] 48 | 49 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 50 | FISH_exp_t = MERFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 51 | precise_time.append(tm.time()-start) 52 | 53 | start = tm.time() 54 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 55 | distances, indices = nbrs.kneighbors(FISH_exp_t) 56 | 57 | Imp_Gene = np.zeros(MERFISH_data.shape[0]) 58 | for j in range(0,MERFISH_data.shape[0]): 59 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 60 | weights = weights/(len(weights)-1) 61 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 62 | Imp_Gene[np.isnan(Imp_Gene)] = 0 63 | Imp_Genes[i] = Imp_Gene 64 | knn_time.append(tm.time()-start) 65 | 66 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 67 | precise_time = pd.DataFrame(precise_time) 68 | knn_time = pd.DataFrame(knn_time) 69 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 70 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/SpaGE/MERFISH_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | 9 | MERFISH = pd.read_csv('data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv') 10 | #Select 1st replicate, Naive female 11 | MERFISH_1 = MERFISH.loc[MERFISH['Animal_ID']==1,:] 12 | 13 | #remove Blank-1 to 5 and Fos --> 155 genes 14 | MERFISH_1 = MERFISH_1.loc[MERFISH_1['Cell_class']!='Ambiguous',:] 15 | MERFISH_meta = MERFISH_1.iloc[:,0:9] 16 | MERFISH_data = MERFISH_1.iloc[:,9:171] 17 | MERFISH_data = MERFISH_data.drop(columns = ['Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos']) 18 | del MERFISH, MERFISH_1 19 | 20 | MERFISH_data = MERFISH_data.T 21 | 22 | cell_count = np.sum(MERFISH_data,axis=0) 23 | def Log_Norm(x): 24 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 25 | 26 | MERFISH_data = MERFISH_data.apply(Log_Norm,axis=0) 27 | MERFISH_data_scaled = pd.DataFrame(data=st.zscore(MERFISH_data.T),index = MERFISH_data.columns,columns=MERFISH_data.index) 28 | 29 | datadict = dict() 30 | datadict['MERFISH_data'] = MERFISH_data.T 31 | datadict['MERFISH_data_scaled'] = MERFISH_data_scaled 32 | datadict['MERFISH_meta'] = MERFISH_meta 33 | 34 | with open('data/SpaGE_pkl/MERFISH.pkl','wb') as f: 35 | pickle.dump(datadict, f) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/SpaGE/Moffit_RNA_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | import scipy.io as io 9 | 10 | genes = pd.read_csv('data/Moffit_RNA/GSE113576/genes.tsv',sep='\t',header=None) 11 | barcodes = pd.read_csv('data/Moffit_RNA/GSE113576/barcodes.tsv',sep='\t',header=None) 12 | 13 | genes = np.array(genes.loc[:,1]) 14 | barcodes = np.array(barcodes.loc[:,0]) 15 | RNA_data = io.mmread('data/Moffit_RNA/GSE113576/matrix.mtx') 16 | RNA_data = RNA_data.todense() 17 | RNA_data = pd.DataFrame(RNA_data,index=genes,columns=barcodes) 18 | 19 | # filter lowely expressed genes 20 | Genes_count = np.sum(RNA_data > 0, axis=1) 21 | RNA_data = RNA_data.loc[Genes_count >=10,:] 22 | del Genes_count 23 | 24 | def Log_Norm(x): 25 | return np.log(((x/np.sum(x))*1000000) + 1) 26 | 27 | RNA_data = RNA_data.apply(Log_Norm,axis=0) 28 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index) 29 | 30 | datadict = dict() 31 | datadict['RNA_data'] = RNA_data.T 32 | datadict['RNA_data_scaled'] = RNA_data_scaled 33 | 34 | with open('data/SpaGE_pkl/Moffit_RNA.pkl','wb') as f: 35 | pickle.dump(datadict, f, protocol=4) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | import sys 13 | sys.path.insert(1,'SpaGE/') 14 | from principal_vectors import PVComputation 15 | 16 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | MERFISH_data = datadict['MERFISH_data'] 20 | MERFISH_data_scaled = datadict['MERFISH_data_scaled'] 21 | MERFISH_meta = datadict['MERFISH_meta'] 22 | del datadict 23 | 24 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f: 25 | datadict = pickle.load(f) 26 | 27 | RNA_data = datadict['RNA_data'] 28 | RNA_data_scaled = datadict['RNA_data_scaled'] 29 | del datadict 30 | 31 | Common_data = RNA_data_scaled[np.intersect1d(MERFISH_data_scaled.columns,RNA_data_scaled.columns)] 32 | 33 | n_factors = 50 34 | n_pv = 50 35 | n_pv_display = 50 36 | dim_reduction = 'pca' 37 | dim_reduction_target = 'pca' 38 | 39 | pv_FISH_RNA = PVComputation( 40 | n_factors = n_factors, 41 | n_pv = n_pv, 42 | dim_reduction = dim_reduction, 43 | dim_reduction_target = dim_reduction_target 44 | ) 45 | 46 | pv_FISH_RNA.fit(Common_data,MERFISH_data_scaled[Common_data.columns]) 47 | 48 | fig = plt.figure() 49 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 50 | center=0, vmax=1., vmin=0) 51 | plt.xlabel('MERFISH',fontsize=18, color='black') 52 | plt.ylabel('scRNA-seq',fontsize=18, color='black') 53 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 54 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 55 | plt.gca().set_ylim([n_pv_display,0]) 56 | plt.show() 57 | 58 | plt.figure() 59 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 60 | center=0, vmax=1., vmin=0) 61 | for i in range(n_pv_display-1): 62 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 63 | 64 | plt.xlabel('MERFISH',fontsize=18, color='black') 65 | plt.ylabel('scRNA-seq',fontsize=18, color='black') 66 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 67 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 68 | plt.gca().set_ylim([n_pv_display,0]) 69 | plt.show() 70 | 71 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 72 | Importance.sort_values(ascending=False,inplace=True) 73 | Importance.index[0:50] 74 | 75 | ### Technology specific Processes 76 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 77 | 78 | # explained variance RNA 79 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 80 | # explained variance spatial 81 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 82 | -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/MERFISH_Moffit/gimVI/gimVI.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('MERFISH_Moffit/') 3 | 4 | from scvi.dataset import CsvDataset 5 | from scvi.models import JVAE, Classifier 6 | from scvi.inference import JVAETrainer 7 | import numpy as np 8 | import pandas as pd 9 | import copy 10 | import torch 11 | import time as tm 12 | 13 | ### MERFISH data 14 | MERFISH_data = CsvDataset('data/gimVI_data/MERFISH_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 15 | 16 | ### AllenVISp 17 | RNA_data = CsvDataset('data/gimVI_data/Moffit_RNA_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 18 | 19 | ### Leave-one-out validation 20 | Gene_set = np.intersect1d(MERFISH_data.gene_names,RNA_data.gene_names) 21 | MERFISH_data.gene_names = Gene_set 22 | MERFISH_data.X = MERFISH_data.X[:,np.reshape(np.vstack(np.argwhere(i==MERFISH_data.gene_names) for i in Gene_set),-1)] 23 | Common_data = copy.deepcopy(RNA_data) 24 | Common_data.gene_names = Gene_set 25 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in Gene_set),-1)] 26 | Imp_Genes = pd.DataFrame(columns=Gene_set) 27 | gimVI_time = [] 28 | 29 | for i in Gene_set: 30 | print(i) 31 | # Create copy of the fish dataset with hidden genes 32 | data_spatial_partial = copy.deepcopy(MERFISH_data) 33 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(MERFISH_data.gene_names,i)) 34 | data_spatial_partial.batch_indices += Common_data.n_batches 35 | 36 | if(data_spatial_partial.X.shape[0] != MERFISH_data.X.shape[0]): 37 | continue 38 | 39 | datasets = [Common_data, data_spatial_partial] 40 | generative_distributions = ["zinb", "nb"] 41 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 42 | n_inputs = [d.nb_genes for d in datasets] 43 | total_genes = Common_data.nb_genes 44 | n_batches = sum([d.n_batches for d in datasets]) 45 | 46 | model_library_size = [True, False] 47 | 48 | n_latent = 8 49 | kappa = 5 50 | 51 | start = tm.time() 52 | torch.manual_seed(0) 53 | 54 | model = JVAE( 55 | n_inputs, 56 | total_genes, 57 | gene_mappings, 58 | generative_distributions, 59 | model_library_size, 60 | n_layers_decoder_individual=0, 61 | n_layers_decoder_shared=0, 62 | n_layers_encoder_individual=1, 63 | n_layers_encoder_shared=1, 64 | dim_hidden_encoder=64, 65 | dim_hidden_decoder_shared=64, 66 | dropout_rate_encoder=0.2, 67 | dropout_rate_decoder=0.2, 68 | n_batch=n_batches, 69 | n_latent=n_latent, 70 | ) 71 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 72 | 73 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 74 | trainer.train(n_epochs=200) 75 | _,Imputed = trainer.get_imputed_values(normalized=True) 76 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1) 77 | Imp_Genes[i] = Imputed 78 | gimVI_time.append(tm.time()-start) 79 | 80 | Imp_Genes = Imp_Genes.fillna(0) 81 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv') 82 | gimVI_time = pd.DataFrame(gimVI_time) 83 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False) 84 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/DownSampling/DownSampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import sys 9 | sys.path.insert(1,'Scripts/SpaGE/') 10 | from principal_vectors import PVComputation 11 | 12 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 13 | datadict = pickle.load(f) 14 | 15 | Starmap_data = datadict['Starmap_data'] 16 | Starmap_data_scaled = datadict['Starmap_data_scaled'] 17 | coords = datadict['coords'] 18 | del datadict 19 | 20 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 21 | datadict = pickle.load(f) 22 | 23 | RNA_data = datadict['RNA_data'] 24 | RNA_data_scaled = datadict['RNA_data_scaled'] 25 | del datadict 26 | 27 | all_centroids = np.vstack([c.mean(0) for c in coords]) 28 | 29 | def Moran_I(SpatialData,XYmap): 30 | XYnbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',metric = 'euclidean').fit(XYmap) 31 | XYdistances, XYindices = XYnbrs.kneighbors(XYmap) 32 | W = np.zeros((SpatialData.shape[0],SpatialData.shape[0])) 33 | for i in range(0,SpatialData.shape[0]): 34 | W[i,XYindices[i,:]]=1 35 | 36 | for i in range(0,SpatialData.shape[0]): 37 | W[i,i]=0 38 | 39 | I = pd.Series(index=SpatialData.columns) 40 | for k in SpatialData.columns: 41 | X_minus_mean = np.array(SpatialData[k] - np.mean(SpatialData[k])) 42 | X_minus_mean = np.reshape(X_minus_mean,(len(X_minus_mean),1)) 43 | Nom = np.sum(np.multiply(W,np.matmul(X_minus_mean,X_minus_mean.T))) 44 | Den = np.sum(np.multiply(X_minus_mean,X_minus_mean)) 45 | I[k] = (len(SpatialData[k])/np.sum(W))*(Nom/Den) 46 | return(I) 47 | 48 | Moran_Is = Moran_I(Starmap_data,all_centroids) 49 | 50 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns) 51 | Moran_Is = Moran_Is[Gene_Order] 52 | 53 | Moran_Is.sort_values(ascending=False,inplace=True) 54 | test_set = Moran_Is.index[0:50] 55 | 56 | Common_data = RNA_data[np.intersect1d(Starmap_data.columns,RNA_data.columns)] 57 | Common_data = Common_data.drop(columns=test_set) 58 | 59 | Corr = np.corrcoef(Common_data.T) 60 | removed_genes = [] 61 | for i in range(0,Corr.shape[0]): 62 | for j in range(i+1,Corr.shape[0]): 63 | if(np.abs(Corr[i,j]) > 0.7): 64 | Vi = np.var(Common_data.iloc[:,i]) 65 | Vj = np.var(Common_data.iloc[:,j]) 66 | if(Vi > Vj): 67 | removed_genes.append(Common_data.columns[j]) 68 | else: 69 | removed_genes.append(Common_data.columns[i]) 70 | removed_genes= np.unique(removed_genes) 71 | 72 | Common_data = Common_data.drop(columns=removed_genes) 73 | Variance = np.var(Common_data) 74 | Variance.sort_values(ascending=False,inplace=True) 75 | Variance = Variance.append(pd.Series(0,index=removed_genes)) 76 | 77 | genes_to_impute = test_set 78 | for i in [10,30,50,100,200,500,len(Variance)]: 79 | Imp_New_Genes = pd.DataFrame(np.zeros((Starmap_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 80 | 81 | if(i>=50): 82 | n_factors = 50 83 | n_pv = 50 84 | else: 85 | n_factors = i 86 | n_pv = i 87 | 88 | dim_reduction = 'pca' 89 | dim_reduction_target = 'pca' 90 | 91 | pv_FISH_RNA = PVComputation( 92 | n_factors = n_factors, 93 | n_pv = n_pv, 94 | dim_reduction = dim_reduction, 95 | dim_reduction_target = dim_reduction_target 96 | ) 97 | 98 | source_data = RNA_data_scaled[Variance.index[0:i]] 99 | target_data = Starmap_data_scaled[Variance.index[0:i]] 100 | 101 | pv_FISH_RNA.fit(source_data,target_data) 102 | 103 | S = pv_FISH_RNA.source_components_.T 104 | 105 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 106 | S = S[:,0:Effective_n_pv] 107 | 108 | RNA_data_t = source_data.dot(S) 109 | FISH_exp_t = target_data.dot(S) 110 | 111 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 112 | metric = 'cosine').fit(RNA_data_t) 113 | distances, indices = nbrs.kneighbors(FISH_exp_t) 114 | 115 | for j in range(0,Starmap_data.shape[0]): 116 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 117 | weights = weights/(len(weights)-1) 118 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 119 | 120 | Imp_New_Genes.to_csv('Results/' + str(i) +'SpaGE_New_genes.csv') 121 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/DownSampling/DownSampling_evaluation.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import matplotlib 7 | matplotlib.use('qt5agg') 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import scipy.stats as st 12 | import pickle 13 | 14 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 15 | datadict = pickle.load(f) 16 | 17 | Starmap_data = datadict['Starmap_data'] 18 | del datadict 19 | 20 | test_set = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',').columns 21 | Starmap_data = Starmap_data[test_set] 22 | 23 | ### SpaGE 24 | #10 25 | SpaGE_imputed_10 = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 26 | 27 | SpaGE_Corr_10 = pd.Series(index = test_set) 28 | for i in test_set: 29 | SpaGE_Corr_10[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_10[i])[0] 30 | 31 | #30 32 | SpaGE_imputed_30 = pd.read_csv('Results/30SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 33 | 34 | SpaGE_Corr_30 = pd.Series(index = test_set) 35 | for i in test_set: 36 | SpaGE_Corr_30[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_30[i])[0] 37 | 38 | #50 39 | SpaGE_imputed_50 = pd.read_csv('Results/50SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 40 | 41 | SpaGE_Corr_50 = pd.Series(index = test_set) 42 | for i in test_set: 43 | SpaGE_Corr_50[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_50[i])[0] 44 | 45 | #100 46 | SpaGE_imputed_100 = pd.read_csv('Results/100SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 47 | 48 | SpaGE_Corr_100 = pd.Series(index = test_set) 49 | for i in test_set: 50 | SpaGE_Corr_100[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_100[i])[0] 51 | 52 | #200 53 | SpaGE_imputed_200 = pd.read_csv('Results/200SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 54 | 55 | SpaGE_Corr_200 = pd.Series(index = test_set) 56 | for i in test_set: 57 | SpaGE_Corr_200[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_200[i])[0] 58 | 59 | #500 60 | SpaGE_imputed_500 = pd.read_csv('Results/500SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 61 | 62 | SpaGE_Corr_500 = pd.Series(index = test_set) 63 | for i in test_set: 64 | SpaGE_Corr_500[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_500[i])[0] 65 | 66 | #944 67 | SpaGE_imputed_944 = pd.read_csv('Results/944SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 68 | 69 | SpaGE_Corr_944 = pd.Series(index = test_set) 70 | for i in test_set: 71 | SpaGE_Corr_944[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_944[i])[0] 72 | 73 | ### Comparison plots 74 | plt.style.use('ggplot') 75 | plt.figure(figsize=(9, 3)) 76 | plt.boxplot([SpaGE_Corr_10, SpaGE_Corr_30, SpaGE_Corr_50, 77 | SpaGE_Corr_100,SpaGE_Corr_200,SpaGE_Corr_944]) 78 | 79 | y = SpaGE_Corr_10 80 | x = np.random.normal(1, 0.05, len(y)) 81 | plt.plot(x, y, 'g.', alpha=0.2) 82 | y = SpaGE_Corr_30 83 | x = np.random.normal(2, 0.05, len(y)) 84 | plt.plot(x, y, 'g.', alpha=0.2) 85 | y = SpaGE_Corr_50 86 | x = np.random.normal(3, 0.05, len(y)) 87 | plt.plot(x, y, 'g.', alpha=0.2) 88 | y = SpaGE_Corr_100 89 | x = np.random.normal(4, 0.05, len(y)) 90 | plt.plot(x, y, 'g.', alpha=0.2) 91 | y = SpaGE_Corr_200 92 | x = np.random.normal(5, 0.05, len(y)) 93 | plt.plot(x, y, 'g.', alpha=0.2) 94 | y = SpaGE_Corr_500 95 | x = np.random.normal(6, 0.05, len(y)) 96 | plt.plot(x, y, 'g.', alpha=0.2) 97 | y = SpaGE_Corr_944 98 | x = np.random.normal(7, 0.05, len(y)) 99 | plt.plot(x, y, 'g.', alpha=0.2) 100 | 101 | 102 | plt.xticks((1,2,3,4,5,6,7),('10','30', '50','100','200','500','944'),size=10) 103 | plt.yticks(size=8) 104 | plt.xlabel('Number of shared genes',size=12) 105 | plt.gca().set_ylim([-0.3,0.8]) 106 | plt.ylabel('Spearman Correlation',size=12) 107 | plt.show() 108 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Liger/LIGER.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(liger) 3 | library(Seurat) 4 | library(ggplot2) 5 | 6 | # allen VISp 7 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv", 8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE) 9 | allen <- as.matrix(x = allen) 10 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv", 11 | sep = ',', stringsAsFactors = FALSE, header = TRUE) 12 | rownames(x = allen) <- make.unique(names = genes$gene_symbol) 13 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv", 14 | row.names = 1, stringsAsFactors = FALSE) 15 | 16 | Genes_count = rowSums(allen > 0) 17 | allen <- allen[Genes_count>=10,] 18 | 19 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ]) 20 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)] 21 | allen <-allen[,ok.cells] 22 | 23 | # STARmap 24 | counts <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/cell_barcode_count.csv", 25 | sep = ",", stringsAsFactors = FALSE) 26 | gene.names <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/genes.csv", 27 | sep = ",", stringsAsFactors = FALSE) 28 | 29 | qhulls <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/qhulls.tsv", 30 | sep = '\t', col.names = c('cell', 'y', 'x'), stringsAsFactors = FALSE) 31 | centroids <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/centroids.tsv", 32 | sep = "\t", col.names = c("spatial2", "spatial1"), stringsAsFactors = FALSE) 33 | colnames(counts) <- gene.names$V1 34 | rownames(counts) <- paste0('starmap', seq(1:nrow(counts))) 35 | counts <- as.matrix(counts) 36 | rownames(centroids) <- rownames(counts) 37 | centroids <- as.matrix(centroids) 38 | counts <- t(counts) 39 | 40 | Gene_set <- intersect(rownames(counts),rownames(allen)) 41 | 42 | #### New genes prediction 43 | Ligerex <- createLiger(list(STARmap = counts, AllenVISp = allen)) 44 | Ligerex <- normalize(Ligerex) 45 | Ligerex@var.genes <- Gene_set 46 | Ligerex <- scaleNotCenter(Ligerex) 47 | 48 | # suggestK(Ligerex) # K = 25 49 | # suggestLambda(Ligerex, k = 25) # Lambda = 35 50 | 51 | Ligerex <- optimizeALS(Ligerex,k = 25, lambda = 35) 52 | Ligerex <- quantileAlignSNF(Ligerex) 53 | 54 | Imputation <- imputeKNN(Ligerex,reference = 'AllenVISp', queries = list('STARmap'), norm = TRUE, scale = FALSE) 55 | 56 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb') 57 | Imp_New_genes <- matrix(0,nrow= length(new.genes),ncol = dim(Imputation@norm.data$STARmap)[2]) 58 | rownames(Imp_New_genes) <- new.genes 59 | 60 | for (i in c(1:length(new.genes))){ 61 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$STARmap[new.genes[i],]) 62 | } 63 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv') 64 | 65 | # leave-one-out validation 66 | genes.leaveout <- Gene_set 67 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(counts)[2]) 68 | rownames(Imp_genes) <- genes.leaveout 69 | colnames(Imp_genes) <- colnames(counts) 70 | NMF_time <- vector(mode= "numeric") 71 | knn_time <- vector(mode= "numeric") 72 | 73 | for(i in 1:length(genes.leaveout)) { 74 | print(i) 75 | start_time <- Sys.time() 76 | Ligerex.leaveout <- createLiger(list(STARmap = counts[-which(rownames(counts) %in% genes.leaveout[i]),], AllenVISp = allen)) 77 | Ligerex.leaveout <- normalize(Ligerex.leaveout) 78 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i]) 79 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout) 80 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 25, lambda = 35) 81 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout) 82 | end_time <- Sys.time() 83 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 84 | 85 | start_time <- Sys.time() 86 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'AllenVISp', queries = list('STARmap'), norm = TRUE, scale = FALSE) 87 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$STARmap[genes.leaveout[i],]) 88 | end_time <- Sys.time() 89 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 90 | } 91 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv') 92 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE) 93 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE) -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Performance_evaluation.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import pickle 7 | import matplotlib 8 | matplotlib.use('qt5agg') 9 | matplotlib.rcParams['pdf.fonttype'] = 42 10 | matplotlib.rcParams['ps.fonttype'] = 42 11 | import matplotlib.pyplot as plt 12 | import scipy.stats as st 13 | from sklearn.neighbors import NearestNeighbors 14 | from matplotlib import cm 15 | 16 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | coords = datadict['coords'] 20 | Starmap_data = datadict['Starmap_data'] 21 | del datadict 22 | 23 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 24 | datadict = pickle.load(f) 25 | 26 | RNA_data = datadict['RNA_data'] 27 | del datadict 28 | 29 | all_centroids = np.vstack([c.mean(0) for c in coords]) 30 | 31 | plt.style.use('dark_background') 32 | cmap = cm.get_cmap('viridis',20) 33 | 34 | def Moran_I(SpatialData,XYmap): 35 | XYnbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',metric = 'euclidean').fit(XYmap) 36 | XYdistances, XYindices = XYnbrs.kneighbors(XYmap) 37 | W = np.zeros((SpatialData.shape[0],SpatialData.shape[0])) 38 | for i in range(0,SpatialData.shape[0]): 39 | W[i,XYindices[i,:]]=1 40 | 41 | for i in range(0,SpatialData.shape[0]): 42 | W[i,i]=0 43 | 44 | I = pd.Series(index=SpatialData.columns) 45 | for k in SpatialData.columns: 46 | X_minus_mean = np.array(SpatialData[k] - np.mean(SpatialData[k])) 47 | X_minus_mean = np.reshape(X_minus_mean,(len(X_minus_mean),1)) 48 | Nom = np.sum(np.multiply(W,np.matmul(X_minus_mean,X_minus_mean.T))) 49 | Den = np.sum(np.multiply(X_minus_mean,X_minus_mean)) 50 | I[k] = (len(SpatialData[k])/np.sum(W))*(Nom/Den) 51 | return(I) 52 | 53 | Moran_Is = Moran_I(Starmap_data,all_centroids) 54 | 55 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns) 56 | 57 | Moran_Is = Moran_Is[Gene_Order] 58 | 59 | ### SpaGE 60 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',') 61 | 62 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 63 | 64 | SpaGE_Corr = pd.Series(index = Gene_Order) 65 | for i in Gene_Order: 66 | SpaGE_Corr[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed[i])[0] 67 | 68 | ### gimVI 69 | gimVI_imputed = pd.read_csv('Results/gimVI_LeaveOneOut.csv',header=0,index_col=0,sep=',') 70 | 71 | gimVI_imputed.columns = Gene_Order 72 | 73 | gimVI_Corr = pd.Series(index = Gene_Order) 74 | for i in Gene_Order: 75 | gimVI_Corr[i] = st.spearmanr(Starmap_data[i],gimVI_imputed[i])[0] 76 | gimVI_Corr[np.isnan(gimVI_Corr)] = 0 77 | 78 | ### Seurat 79 | Seurat_imputed = pd.read_csv('Results/Seurat_LeaveOneOut.csv',header=0,index_col=0,sep=',').T 80 | 81 | Seurat_imputed = Seurat_imputed.loc[:,Gene_Order] 82 | Seurat_imputed.index = range(0,Seurat_imputed.shape[0]) 83 | cell_labels = pd.read_csv('data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv', 84 | header=0,sep=',') 85 | Starmap_data_Seurat = Starmap_data.drop(np.where(cell_labels['ClusterName']=='HPC')[0],axis=0) 86 | 87 | Seurat_Corr = pd.Series(index = Gene_Order) 88 | for i in Gene_Order: 89 | Seurat_Corr[i] = st.spearmanr(Starmap_data_Seurat[i],Seurat_imputed[i])[0] 90 | 91 | ### Liger 92 | Liger_imputed = pd.read_csv('Results/Liger_LeaveOneOut.csv',header=0,index_col=0,sep=',').T 93 | 94 | Liger_imputed = Liger_imputed.loc[:,Gene_Order] 95 | Liger_imputed.index = range(0,Liger_imputed.shape[0]) 96 | 97 | Liger_Corr = pd.Series(index = Gene_Order) 98 | for i in Gene_Order: 99 | Liger_Corr[i] = st.spearmanr(Starmap_data[i],Liger_imputed[i])[0] 100 | Liger_Corr[np.isnan(Liger_Corr)] = 0 101 | 102 | ### Comparison plots 103 | plt.style.use('ggplot') 104 | fig, ax = plt.subplots(figsize=(3.7, 5.5)) 105 | ax.boxplot([SpaGE_Corr,Seurat_Corr, Liger_Corr,gimVI_Corr]) 106 | 107 | y = SpaGE_Corr 108 | x = np.random.normal(1, 0.05, len(y)) 109 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 110 | y = Seurat_Corr 111 | x = np.random.normal(2, 0.05, len(y)) 112 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 113 | y = Liger_Corr 114 | x = np.random.normal(3, 0.05, len(y)) 115 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 116 | y = gimVI_Corr 117 | x = np.random.normal(4, 0.05, len(y)) 118 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 119 | 120 | plt.xticks((1,2,3,4),('SpaGE', 'Seurat', 'Liger','gimVI'),size=12) 121 | plt.yticks(size=8) 122 | plt.gca().set_ylim([-0.5,1]) 123 | plt.ylabel('Spearman Correlation',size=12) 124 | ax.set_aspect(aspect=3) 125 | _,p_val = st.wilcoxon(SpaGE_Corr,Seurat_Corr) 126 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 127 | _,p_val = st.wilcoxon(SpaGE_Corr,Liger_Corr) 128 | plt.text(3,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 129 | _,p_val = st.wilcoxon(SpaGE_Corr,gimVI_Corr) 130 | plt.text(4,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 131 | plt.show() 132 | 133 | def Compare_Correlations(X,Y): 134 | fig, ax = plt.subplots(figsize=(5.2, 5.2)) 135 | cmap = Moran_Is 136 | ax.axvline(linestyle='--',color='gray') 137 | ax.axhline(linestyle='--',color='gray') 138 | im = ax.scatter(X, Y, s=1, c=cmap) 139 | im.set_cmap('seismic') 140 | plt.gca().set_ylim([-0.5,1]) 141 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]), 142 | np.max([ax.get_xlim(), ax.get_ylim()])] 143 | ax.plot(lims, lims, 'k-') 144 | ax.set_aspect('equal') 145 | ax.set_xlim(lims) 146 | ax.set_ylim(lims) 147 | plt.xticks((-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),size=8) 148 | plt.yticks((-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),size=8) 149 | cbar = plt.colorbar(im) 150 | cbar.ax.tick_params(labelsize=8) 151 | cbar.ax.set_ylabel("Moran's I statistic",fontsize=12) 152 | plt.show() 153 | 154 | Compare_Correlations(Seurat_Corr,SpaGE_Corr) 155 | plt.xlabel('Spearman Correlation Seurat',size=12) 156 | plt.ylabel('Spearman Correlation SpaGE',size=12) 157 | plt.show() 158 | 159 | Compare_Correlations(Liger_Corr,SpaGE_Corr) 160 | plt.xlabel('Spearman Correlation Liger',size=12) 161 | plt.ylabel('Spearman Correlation SpaGE',size=12) 162 | plt.show() 163 | 164 | Compare_Correlations(gimVI_Corr,SpaGE_Corr) 165 | plt.xlabel('Spearman Correlation gimVI',size=12) 166 | plt.ylabel('Spearman Correlation SpaGE',size=12) 167 | plt.show() 168 | 169 | def Correlation_vs_Moran(X,Y): 170 | fig, ax = plt.subplots(figsize=(4.8, 4.8)) 171 | ax.scatter(X, Y, s=1) 172 | Corr = st.spearmanr(X,Y)[0] 173 | plt.text(np.mean(plt.gca().get_xlim()),np.min(plt.gca().get_ylim()),'%1.3f'%Corr,color='black',size=9) 174 | plt.xticks(size=8) 175 | plt.yticks(size=8) 176 | plt.axis('scaled') 177 | plt.gca().set_ylim([-0.5,1]) 178 | plt.gca().set_xlim([-0.2,1]) 179 | plt.show() 180 | 181 | 182 | Correlation_vs_Moran(Moran_Is,SpaGE_Corr) 183 | plt.xlabel("Moran's I",size=12) 184 | plt.ylabel('Spearman Correlation SpaGE',size=12) 185 | plt.show() 186 | 187 | Correlation_vs_Moran(Moran_Is,Seurat_Corr) 188 | plt.xlabel("Moran's I",size=12) 189 | plt.ylabel('Spearman Correlation Seurat',size=12) 190 | plt.show() 191 | 192 | Correlation_vs_Moran(Moran_Is,Liger_Corr) 193 | plt.xlabel("Moran's I",size=12) 194 | plt.ylabel('Spearman Correlation Liger',size=12) 195 | plt.show() 196 | 197 | Correlation_vs_Moran(Moran_Is,gimVI_Corr) 198 | plt.xlabel("Moran's I",size=12) 199 | plt.ylabel('Spearman Correlation gimVI',size=12) 200 | plt.show() 201 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Seurat/allen_brain.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(Seurat) 3 | 4 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv", 5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE) 6 | allen <- as.matrix(x = allen) 7 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv", 8 | sep = ',', stringsAsFactors = FALSE, header = TRUE) 9 | rownames(x = allen) <- make.unique(names = genes$gene_symbol) 10 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv", 11 | row.names = 1, stringsAsFactors = FALSE) 12 | 13 | al <- CreateSeuratObject(counts = allen, project = 'VISp', meta.data = meta.data, min.cells = 10) 14 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ]) 15 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)] 16 | al <- al[, ok.cells] 17 | al <- NormalizeData(object = al) 18 | al <- FindVariableFeatures(object = al, nfeatures = 2000) 19 | al <- ScaleData(object = al) 20 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE) 21 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5) 22 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain.rds")) -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Seurat/impute_starmap.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(Seurat) 3 | 4 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 5 | allen <- readRDS("data/seurat_objects/allen_brain.rds") 6 | 7 | # remove HPC from starmap 8 | class_labels <- read.table( 9 | file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv", 10 | sep = ",", 11 | header = TRUE, 12 | stringsAsFactors = FALSE 13 | ) 14 | 15 | class_labels$cellname <- paste0('starmap', rownames(class_labels)) 16 | 17 | class_labels$ClusterName <- ifelse(is.na(class_labels$ClusterName), 'Other', class_labels$ClusterName) 18 | 19 | hpc <- class_labels[class_labels$ClusterName == 'HPC', ]$cellname 20 | 21 | accept.cells <- setdiff(colnames(starmap), hpc) 22 | 23 | starmap <- starmap[, accept.cells] 24 | 25 | starmap@misc$spatial <- starmap@misc$spatial[starmap@misc$spatial$cell %in% accept.cells, ] 26 | 27 | #Project on allen labels 28 | i2 <- FindTransferAnchors( 29 | reference = allen, 30 | query = starmap, 31 | features = rownames(starmap), 32 | reduction = 'cca', 33 | reference.assay = 'RNA', 34 | query.assay = 'RNA' 35 | ) 36 | 37 | refdata <- GetAssayData( 38 | object = allen, 39 | assay = 'RNA', 40 | slot = 'data' 41 | ) 42 | 43 | imputation <- TransferData( 44 | anchorset = i2, 45 | refdata = refdata, 46 | weight.reduction = 'pca' 47 | ) 48 | 49 | starmap[['ss2']] <- imputation 50 | saveRDS(starmap, 'data/seurat_objects/20180505_BY3_1kgenes_imputed.rds') -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Seurat/starmap.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(Seurat) 3 | library(Matrix) 4 | 5 | read_data <- function(base_path, project) { 6 | counts <- read.table( 7 | file = paste0(base_path, "cell_barcode_count.csv"), 8 | sep = ",", 9 | stringsAsFactors = FALSE 10 | ) 11 | gene.names <- read.table( 12 | file = paste0(base_path, "genes.csv"), 13 | sep = ",", 14 | stringsAsFactors = FALSE 15 | ) 16 | qhulls <- read.table( 17 | file = paste0(base_path, "qhulls.tsv"), 18 | sep = '\t', 19 | col.names = c('cell', 'y', 'x'), 20 | stringsAsFactors = FALSE 21 | ) 22 | centroids <- read.table( 23 | file = paste0(base_path, "centroids.tsv"), 24 | sep = "\t", 25 | col.names = c("spatial2", "spatial1"), 26 | stringsAsFactors = FALSE 27 | ) 28 | colnames(x = counts) <- gene.names$V1 29 | rownames(x = counts) <- paste0('starmap', seq(1:nrow(x = counts))) 30 | counts <- as.matrix(x = counts) 31 | rownames(x = centroids) <- rownames(x = counts) 32 | centroids <- as.matrix(x = centroids) 33 | total.counts = rowSums(x = counts) 34 | 35 | obj <- CreateSeuratObject( 36 | counts = t(x = counts), 37 | project = project, 38 | min.cells = -1, 39 | min.features = -1 40 | ) 41 | obj <- NormalizeData( 42 | object = obj, 43 | scale.factor = median(x = total.counts) 44 | ) 45 | obj <- ScaleData( 46 | object = obj, 47 | features = rownames(x = obj) 48 | ) 49 | obj <- RunPCA( 50 | object = obj, 51 | features = rownames(x = obj), 52 | npcs = 30 53 | ) 54 | obj <- RunUMAP( 55 | object = obj, 56 | dims = 1:30 57 | ) 58 | obj[['spatial']] <- CreateDimReducObject( 59 | embeddings <- centroids, 60 | assay = "RNA", 61 | key = 'spatial' 62 | ) 63 | qhulls$cell <- paste0('starmap', qhulls$cell) 64 | obj@misc[['spatial']] <- qhulls 65 | return(obj) 66 | } 67 | 68 | # experiments <- c("visual_1020/20180505_BY3_1kgenes/", "visual_1020/20180410-BY3_1kgenes/") 69 | experiments <- c("data/Starmap/visual_1020/20180505_BY3_1kgenes/") 70 | 71 | for(i in 1:length(experiments)) { 72 | project.name <- unlist(x = strsplit(x = experiments[[i]], "/"))[[4]] 73 | dat <- read_data(base_path = experiments[[i]],project = project.name) 74 | saveRDS(object = dat, file = paste0("data/seurat_objects/", project.name, ".rds")) 75 | } -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Seurat/starmap_integration.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 6 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds") 7 | allen <- readRDS("data/seurat_objects/allen_brain.rds") 8 | 9 | # remove HPC from starmap 10 | class_labels <- read.table( 11 | file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv", 12 | sep = ",", 13 | header = TRUE, 14 | stringsAsFactors = FALSE 15 | ) 16 | 17 | class_labels$cellname <- paste0('starmap', rownames(class_labels)) 18 | 19 | class_labels$ClusterName <- ifelse(is.na(class_labels$ClusterName), 'Other', class_labels$ClusterName) 20 | 21 | hpc <- class_labels[class_labels$ClusterName == 'HPC', ]$cellname 22 | 23 | accept.cells <- setdiff(colnames(starmap), hpc) 24 | 25 | starmap <- starmap[, accept.cells] 26 | 27 | starmap@misc$spatial <- starmap@misc$spatial[starmap@misc$spatial$cell %in% accept.cells, ] 28 | 29 | genes.leaveout <- intersect(rownames(starmap),rownames(allen)) 30 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(starmap@assays$RNA)[2]) 31 | rownames(Imp_genes) <- genes.leaveout 32 | anchor_time <- vector(mode= "numeric") 33 | Transfer_time <- vector(mode= "numeric") 34 | 35 | run_imputation <- function(ref.obj, query.obj, feature.remove) { 36 | message(paste0('removing ', feature.remove)) 37 | features <- setdiff(rownames(query.obj), feature.remove) 38 | DefaultAssay(ref.obj) <- 'RNA' 39 | DefaultAssay(query.obj) <- 'RNA' 40 | 41 | start_time <- Sys.time() 42 | anchors <- FindTransferAnchors( 43 | reference = ref.obj, 44 | query = query.obj, 45 | features = features, 46 | dims = 1:30, 47 | reduction = 'cca' 48 | ) 49 | end_time <- Sys.time() 50 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 51 | 52 | refdata <- GetAssayData( 53 | object = ref.obj, 54 | assay = 'RNA', 55 | slot = 'data' 56 | ) 57 | 58 | start_time <- Sys.time() 59 | imputation <- TransferData( 60 | anchorset = anchors, 61 | refdata = refdata, 62 | weight.reduction = 'pca' 63 | ) 64 | query.obj[['seq']] <- imputation 65 | end_time <- Sys.time() 66 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 67 | return(query.obj) 68 | } 69 | 70 | for(i in 1:length(genes.leaveout)) { 71 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = starmap, feature.remove = genes.leaveout[[i]]) 72 | starmap[['ss2']] <- imputed.ss2[, colnames(starmap)][['seq']] 73 | Imp_genes[genes.leaveout[[i]],] = as.vector(starmap@assays$ss2[genes.leaveout[i],]) 74 | } 75 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv') 76 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE) 77 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE) 78 | 79 | # show genes not in the starmap dataset 80 | DefaultAssay(starmap.imputed) <- "ss2" 81 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb') 82 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(starmap.imputed@assays$ss2)[2]) 83 | rownames(Imp_New_genes) <- new.genes 84 | for(i in 1:length(new.genes)) { 85 | Imp_New_genes[new.genes[[i]],] = as.vector(starmap.imputed@assays$ss2[new.genes[i],]) 86 | } 87 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/Allen_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | 9 | RNA_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv', 10 | header=0,index_col=0,sep=',') 11 | 12 | Genes = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv', 13 | header=0,sep=',') 14 | RNA_data.index = Genes.gene_symbol 15 | del Genes 16 | 17 | # filter lowely expressed genes 18 | Genes_count = np.sum(RNA_data > 0, axis=1) 19 | RNA_data = RNA_data.loc[Genes_count >=10,:] 20 | del Genes_count 21 | 22 | # filter low quality cells 23 | meta_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv', 24 | header=0,sep=',') 25 | HighQualityCells = (meta_data['class'] != 'No Class') & (meta_data['class'] != 'Low Quality') 26 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]] 27 | del meta_data, HighQualityCells 28 | 29 | def Log_Norm(x): 30 | return np.log(((x/np.sum(x))*1000000) + 1) 31 | 32 | RNA_data = RNA_data.apply(Log_Norm,axis=0) 33 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index) 34 | 35 | datadict = dict() 36 | datadict['RNA_data'] = RNA_data.T 37 | datadict['RNA_data_scaled'] = RNA_data_scaled 38 | 39 | with open('data/SpaGE_pkl/Allen_VISp.pkl','wb') as f: 40 | pickle.dump(datadict, f) 41 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | Starmap_data = datadict['Starmap_data'] 14 | Starmap_data_scaled = datadict['Starmap_data_scaled'] 15 | labels = datadict['labels'] 16 | qhulls = datadict['qhulls'] 17 | coords = datadict['coords'] 18 | del datadict 19 | 20 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 21 | datadict = pickle.load(f) 22 | 23 | RNA_data = datadict['RNA_data'] 24 | RNA_data_scaled = datadict['RNA_data_scaled'] 25 | del datadict 26 | 27 | #### Leave One Out Validation #### 28 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)] 29 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 30 | precise_time = [] 31 | knn_time = [] 32 | for i in Common_data.columns: 33 | print(i) 34 | start = tm.time() 35 | from principal_vectors import PVComputation 36 | 37 | n_factors = 50 38 | n_pv = 50 39 | dim_reduction = 'pca' 40 | dim_reduction_target = 'pca' 41 | 42 | pv_FISH_RNA = PVComputation(n_factors = n_factors,n_pv = n_pv,dim_reduction = dim_reduction,dim_reduction_target = dim_reduction_target) 43 | 44 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),Starmap_data_scaled[Common_data.columns].drop(i,axis=1)) 45 | 46 | S = pv_FISH_RNA.source_components_.T 47 | 48 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 49 | S = S[:,0:Effective_n_pv] 50 | 51 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 52 | FISH_exp_t = Starmap_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 53 | precise_time.append(tm.time()-start) 54 | 55 | start = tm.time() 56 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 57 | distances, indices = nbrs.kneighbors(FISH_exp_t) 58 | 59 | Imp_Gene = np.zeros(Starmap_data.shape[0]) 60 | for j in range(0,Starmap_data.shape[0]): 61 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 62 | weights = weights/(len(weights)-1) 63 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 64 | Imp_Gene[np.isnan(Imp_Gene)] = 0 65 | Imp_Genes[i] = Imp_Gene 66 | knn_time.append(tm.time()-start) 67 | 68 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 69 | precise_time = pd.DataFrame(precise_time) 70 | knn_time = pd.DataFrame(knn_time) 71 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 72 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) 73 | 74 | ##### Novel Genes Expression Patterns #### 75 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)] 76 | #genes_to_impute = ["Tesc","Pvrl3","Sox10","Grm2","Tcrb"] 77 | genes_to_impute = np.setdiff1d(RNA_data.columns,Starmap_data.columns) 78 | Imp_New_Genes = pd.DataFrame(np.zeros((Starmap_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 79 | 80 | from principal_vectors import PVComputation 81 | 82 | n_factors = 50 83 | n_pv = 50 84 | dim_reduction = 'pca' 85 | dim_reduction_target = 'pca' 86 | 87 | pv_FISH_RNA = PVComputation( 88 | n_factors = n_factors, 89 | n_pv = n_pv, 90 | dim_reduction = dim_reduction, 91 | dim_reduction_target = dim_reduction_target 92 | ) 93 | 94 | pv_FISH_RNA.fit(Common_data,Starmap_data_scaled[Common_data.columns]) 95 | 96 | S = pv_FISH_RNA.source_components_.T 97 | 98 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 99 | S = S[:,0:Effective_n_pv] 100 | 101 | Common_data_t = Common_data.dot(S) 102 | FISH_exp_t = Starmap_data_scaled[Common_data.columns].dot(S) 103 | 104 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 105 | metric = 'cosine').fit(Common_data_t) 106 | distances, indices = nbrs.kneighbors(FISH_exp_t) 107 | 108 | for j in range(0,Starmap_data.shape[0]): 109 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 110 | weights = weights/(len(weights)-1) 111 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 112 | 113 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | import sys 13 | sys.path.insert(1,'SpaGE/') 14 | from principal_vectors import PVComputation 15 | 16 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | Starmap_data = datadict['Starmap_data'] 20 | Starmap_data_scaled = datadict['Starmap_data_scaled'] 21 | labels = datadict['labels'] 22 | qhulls = datadict['qhulls'] 23 | coords = datadict['coords'] 24 | del datadict 25 | 26 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 27 | datadict = pickle.load(f) 28 | 29 | RNA_data = datadict['RNA_data'] 30 | RNA_data_scaled = datadict['RNA_data_scaled'] 31 | del datadict 32 | 33 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)] 34 | 35 | n_factors = 50 36 | n_pv = 50 37 | n_pv_display = 50 38 | dim_reduction = 'pca' 39 | dim_reduction_target = 'pca' 40 | 41 | pv_FISH_RNA = PVComputation( 42 | n_factors = n_factors, 43 | n_pv = n_pv, 44 | dim_reduction = dim_reduction, 45 | dim_reduction_target = dim_reduction_target 46 | ) 47 | 48 | pv_FISH_RNA.fit(Common_data,Starmap_data_scaled[Common_data.columns]) 49 | 50 | fig = plt.figure() 51 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 52 | center=0, vmax=1., vmin=0) 53 | plt.xlabel('Starmap',fontsize=18, color='black') 54 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 55 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 56 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 57 | plt.gca().set_ylim([n_pv_display,0]) 58 | plt.show() 59 | 60 | plt.figure() 61 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 62 | center=0, vmax=1., vmin=0) 63 | for i in range(n_pv_display-1): 64 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 65 | 66 | plt.xlabel('Starmap',fontsize=18, color='black') 67 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 68 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 69 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 70 | plt.gca().set_ylim([n_pv_display,0]) 71 | plt.show() 72 | 73 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 74 | Importance.sort_values(ascending=False,inplace=True) 75 | Importance.index[0:50] 76 | 77 | ### Technology specific Processes 78 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 79 | 80 | # explained variance RNA 81 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 82 | # explained variance spatial 83 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 84 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/Starmap_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | from viz import GetQHulls 9 | 10 | counts = np.load('data/Starmap/visual_1020/20180505_BY3_1kgenes/cell_barcode_count.npy') 11 | Genes = pd.read_csv('data/Starmap/visual_1020/20180505_BY3_1kgenes/genes.csv',header=None) 12 | Genes = (Genes.iloc[:,0]) 13 | counts = pd.DataFrame(data=counts,columns=Genes) 14 | Starmap_data = counts.T 15 | del Genes, counts 16 | 17 | cell_count = np.sum(Starmap_data,axis=0) 18 | def Log_Norm(x): 19 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 20 | 21 | Starmap_data = Starmap_data.apply(Log_Norm,axis=0) 22 | Starmap_data_scaled = pd.DataFrame(data=st.zscore(Starmap_data.T),index = Starmap_data.columns,columns=Starmap_data.index) 23 | 24 | labels = np.load('data/Starmap/visual_1020/20180505_BY3_1kgenes/labels.npz')["labels"] 25 | qhulls,coords = GetQHulls(labels) 26 | 27 | datadict = dict() 28 | datadict['Starmap_data'] = Starmap_data.T 29 | datadict['Starmap_data_scaled'] = Starmap_data_scaled 30 | datadict['labels'] = labels 31 | datadict['qhulls'] = qhulls 32 | datadict['coords'] = coords 33 | 34 | with open('data/SpaGE_pkl/Starmap.pkl','wb') as f: 35 | pickle.dump(datadict, f) 36 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/SpaGE/viz.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from scipy.spatial import ConvexHull 3 | from skimage.transform import downscale_local_mean 4 | from matplotlib.patches import Polygon 5 | from matplotlib.collections import PatchCollection 6 | from skimage.measure import regionprops 7 | import numpy as np 8 | import matplotlib.pyplot as plt 9 | 10 | def plot_heatmap_with_labels(data_obj, degenes,cmap,show_axis=True, font_size=10): 11 | g = plt.GridSpec(2,1, wspace=0.01, hspace=0.01, height_ratios=[0.5,10]) 12 | ax = plt.subplot(g[0]) 13 | ax.imshow(np.expand_dims(np.sort(data_obj._clusts),1).T,aspect='auto',interpolation='none',cmap=cmap) 14 | ax.axis('off') 15 | ax = plt.subplot(g[1]) 16 | data_obj.plot_heatmap(list(degenes), fontsize=font_size,use_imshow=False,ax=ax) 17 | if not show_axis: 18 | plt.axis('off') 19 | 20 | def GetQHulls(labels): 21 | labels += 1 22 | Nlabels = labels.max() 23 | hulls = [] 24 | coords = [] 25 | num_cells = 0 26 | print('blah') 27 | for i in range(Nlabels):#enumerate(regionprops(labels)): 28 | print(i,"/",Nlabels) 29 | curr_coords = np.argwhere(labels==i) 30 | # size threshold of > 100 pixels and < 100000 31 | if curr_coords.shape[0] < 100000 and curr_coords.shape[0] > 1000: 32 | num_cells += 1 33 | hulls.append(ConvexHull(curr_coords)) 34 | coords.append(curr_coords) 35 | print("Used %d / %d" % (num_cells, Nlabels)) 36 | return hulls, coords 37 | 38 | 39 | def hull_to_polygon(hull): 40 | cent = np.mean(hull.points, 0) 41 | pts = [] 42 | for pt in hull.points[hull.simplices]: 43 | pts.append(pt[0].tolist()) 44 | pts.append(pt[1].tolist()) 45 | pts.sort(key=lambda p: np.arctan2(p[1] - cent[1], 46 | p[0] - cent[0])) 47 | pts = pts[0::2] # Deleting duplicates 48 | pts.insert(len(pts), pts[0]) 49 | k =1.1 50 | poly = Polygon(k*(np.array(pts)- cent) + cent,edgecolor='k', linewidth=1) 51 | #poly.set_capstyle('round') 52 | return poly 53 | 54 | def plot_poly_cells_expression(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1): 55 | figscale = 10 56 | plt.figure(figsize=(figscale*width/float(height),figscale)) 57 | polys = [hull_to_polygon(h) for h in hulls] 58 | if good_cells is not None: 59 | polys = [p for i,p in enumerate(polys) if i in good_cells] 60 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0) 61 | p.set_array(expr) 62 | p.set_clim(vmin=0, vmax=expr.max()) 63 | plt.gca().add_collection(p) 64 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15) 65 | plt.axis('off') 66 | 67 | def plot_poly_cells_expression_99(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1): 68 | figscale = 10 69 | plt.figure(figsize=(figscale*width/float(height),figscale)) 70 | polys = [hull_to_polygon(h) for h in hulls] 71 | if good_cells is not None: 72 | polys = [p for i,p in enumerate(polys) if i in good_cells] 73 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0) 74 | p.set_array(expr) 75 | p.set_clim(vmin=0, vmax=np.percentile(expr,99)) 76 | plt.gca().add_collection(p) 77 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15) 78 | plt.axis('off') 79 | 80 | def plot_poly_cells_expression_40(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1): 81 | figscale = 10 82 | plt.figure(figsize=(figscale*width/float(height),figscale)) 83 | polys = [hull_to_polygon(h) for h in hulls] 84 | if good_cells is not None: 85 | polys = [p for i,p in enumerate(polys) if i in good_cells] 86 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0) 87 | p.set_array(expr) 88 | p.set_clim(vmin=np.percentile(expr,40), vmax=np.percentile(expr,99)) 89 | plt.gca().add_collection(p) 90 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15) 91 | plt.axis('off') 92 | 93 | def plot_poly_cells_cluster(nissl, hulls, colors, cmap, good_cells=None,width=2, height=9,figscale=10, rescale_colors=False,alpha=1,vmin=None,vmax=None): 94 | figscale = 10 95 | plt.figure(figsize=(figscale*width/float(height),figscale)) 96 | polys = [hull_to_polygon(h) for h in hulls] 97 | if good_cells is not None: 98 | polys = [p for i,p in enumerate(polys) if i in good_cells] 99 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,edgecolor='k', linewidth=0.5) 100 | if vmin or vmax is not None: 101 | p.set_array(colors) 102 | p.set_clim(vmin=vmin,vmax=vmax) 103 | else: 104 | if rescale_colors: 105 | p.set_array(colors+1) 106 | p.set_clim(vmin=0, vmax=max(colors+1)) 107 | else: 108 | p.set_array(colors) 109 | p.set_clim(vmin=0, vmax=max(colors)) 110 | nissl = (nissl > 0).astype(np.int) 111 | plt.imshow(nissl.T,cmap=plt.cm.gray_r,alpha=0.15) 112 | plt.gca().add_collection(p) 113 | plt.axis('off') 114 | return polys 115 | 116 | def plot_cells_cluster(nissl, coords, good_cells, colors, cmap, width=2, height=9,figscale=100, vmin=None,vmax=None): 117 | figscale = 10 118 | plt.figure(figsize=(figscale*width/float(height),figscale)) 119 | img = -1*np.ones_like(nissl) 120 | curr_coords = [coords[k] for k in range(len(coords)) if k in good_cells] 121 | for i,c in enumerate(curr_coords): 122 | for k in c: 123 | if k[0] < img.shape[0] and k[1] < img.shape[1]: 124 | img[k[0],k[1]] = colors[i] 125 | plt.imshow(img.T,cmap=cmap,vmin=-1,vmax=colors.max()) 126 | plt.axis('off') 127 | 128 | def get_cells_and_clusts_for_experiment(analysis_obj, expt_id): 129 | good_cells = analysis_obj._meta.index[(analysis_obj._meta["orig_ident"]==expt_id)].values 130 | colors = analysis_obj._clusts[analysis_obj._meta["orig_ident"]==expt_id] 131 | return good_cells, colors 132 | -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/Starmap_plots.R: -------------------------------------------------------------------------------- 1 | setwd("STARmap_AllenVISp/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 6 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds") 7 | allen <- readRDS("data/seurat_objects/allen_brain.rds") 8 | 9 | DefaultAssay(starmap.imputed) <- "RNA" 10 | genes.leaveout <- intersect(rownames(starmap),rownames(allen)) 11 | 12 | # Original 13 | for(i in 1:length(genes.leaveout)) { 14 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) { 15 | qcut = 'q40' 16 | } else { 17 | qcut = 0 18 | } 19 | p <- PolyFeaturePlot(starmap.imputed, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 20 | ggsave(plot = p, filename = paste0('Figures/Original/', genes.leaveout[[i]],'.pdf')) 21 | } 22 | 23 | # Seurat_Predicted 24 | Seurat_Predicted <- read.csv(file = 'Results/Seurat_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1) 25 | Seurat_Predicted <- Seurat_Predicted[genes.leaveout,] 26 | colnames(Seurat_Predicted) <- colnames(starmap.imputed@assays$ss2[,]) 27 | starmap.imputed[['ss2']] <- CreateAssayObject(data = as.matrix(Seurat_Predicted)) 28 | DefaultAssay(starmap.imputed) <- 'ss2' 29 | for(i in 1:length(genes.leaveout)) { 30 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) { 31 | qcut = 'q40' 32 | } else { 33 | qcut = 0 34 | } 35 | p <- PolyFeaturePlot(starmap.imputed, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 36 | ggsave(plot = p, filename = paste0('Figures/Seurat_Predicted/', genes.leaveout[[i]],'.pdf')) 37 | } 38 | 39 | # SpaGE_Predicted 40 | SpaGE_Predicted <- read.csv(file = 'Results/SpaGE_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1) 41 | SpaGE_Predicted <- t(SpaGE_Predicted) 42 | SpaGE_Predicted <- SpaGE_Predicted[genes.leaveout,] 43 | colnames(SpaGE_Predicted) <- colnames(starmap@assays$RNA[,]) 44 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(SpaGE_Predicted)) 45 | DefaultAssay(starmap) <- 'ss2' 46 | for(i in 1:length(genes.leaveout)) { 47 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) { 48 | qcut = 'q40' 49 | } else { 50 | qcut = 0 51 | } 52 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 53 | ggsave(plot = p, filename = paste0('Figures/SpaGE_Predicted/', genes.leaveout[[i]],'.pdf')) 54 | } 55 | 56 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 57 | 58 | # Liger_Predicted 59 | Liger_Predicted <- read.csv(file = 'Results/Liger_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1) 60 | Liger_Predicted <- Liger_Predicted[genes.leaveout,] 61 | colnames(Liger_Predicted) <- colnames(starmap@assays$RNA[,]) 62 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(Liger_Predicted)) 63 | DefaultAssay(starmap) <- 'ss2' 64 | for(i in 1:length(genes.leaveout)) { 65 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) { 66 | qcut = 'q40' 67 | } else { 68 | qcut = 0 69 | } 70 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 71 | ggsave(plot = p, filename = paste0('Figures/Liger_Predicted/', genes.leaveout[[i]],'.pdf')) 72 | } 73 | 74 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 75 | 76 | # gimVI_Predicted 77 | gimVI_Predicted <- read.csv(file = 'Results/gimVI_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1) 78 | gimVI_Predicted <- t(gimVI_Predicted) 79 | gimVI_Predicted <- gimVI_Predicted[toupper(genes.leaveout),] 80 | colnames(gimVI_Predicted) <- colnames(starmap@assays$RNA[,]) 81 | rownames(gimVI_Predicted) <- genes.leaveout 82 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(gimVI_Predicted)) 83 | DefaultAssay(starmap) <- 'ss2' 84 | for(i in 1:length(genes.leaveout)) { 85 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) { 86 | qcut = 'q40' 87 | } else { 88 | qcut = 0 89 | } 90 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 91 | ggsave(plot = p, filename = paste0('Figures/gimVI_Predicted/', genes.leaveout[[i]],'.pdf')) 92 | } 93 | 94 | 95 | ### NEw genes 96 | # show genes not in the starmap dataset 97 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb') 98 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds") 99 | DefaultAssay(starmap.imputed) <- "ss2" 100 | 101 | for(i in new.genes) { 102 | p <- PolyFeaturePlot(starmap.imputed, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 103 | ggsave(plot = p, filename = paste0('Figures/Seurat_Predicted/New_', i,'.pdf')) 104 | } 105 | 106 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 107 | #SpaGE 108 | SpaGE_New <- read.csv(file = 'Results/SpaGE_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1) 109 | SpaGE_New <- t(SpaGE_New) 110 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb','Ttyh2','Cldn11','Tmem88b') 111 | SpaGE_New2 <- SpaGE_New[new.genes,] 112 | colnames(SpaGE_New2) <- colnames(starmap@assays$RNA[,]) 113 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(SpaGE_New2)) 114 | DefaultAssay(starmap) <- 'ss2' 115 | for(i in new.genes) { 116 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = 0, max.cutoff = 'q99') + SpatialTheme() 117 | ggsave(plot = p, filename = paste0('Figures/SpaGE_Predicted/New_', i,'.pdf')) 118 | } 119 | 120 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 121 | # Liger 122 | Liger_New <- read.csv(file = 'Results/Liger_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1) 123 | Liger_New <- Liger_New[new.genes,] 124 | colnames(Liger_New) <- colnames(starmap@assays$RNA[,]) 125 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(Liger_New)) 126 | DefaultAssay(starmap) <- 'ss2' 127 | for(i in new.genes) { 128 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 129 | ggsave(plot = p, filename = paste0('Figures/Liger_Predicted/New_', i,'.pdf')) 130 | } 131 | 132 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds") 133 | # gimVI 134 | gimVI_New <- read.csv(file = 'Results/gimVI_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1) 135 | gimVI_New <- t(gimVI_New) 136 | gimVI_New <- gimVI_New[toupper(new.genes),] 137 | colnames(gimVI_New) <- colnames(starmap@assays$RNA[,]) 138 | rownames(gimVI_New) <- new.genes 139 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(gimVI_New)) 140 | DefaultAssay(starmap) <- 'ss2' 141 | for(i in new.genes) { 142 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme() 143 | ggsave(plot = p, filename = paste0('Figures/gimVI_Predicted/New_', i,'.pdf')) 144 | } -------------------------------------------------------------------------------- /benchmark/STARmap_AllenVISp/gimVI/gimVI.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('STARmap_AllenVISp/') 3 | 4 | from scvi.dataset import CsvDataset 5 | from scvi.models import JVAE, Classifier 6 | from scvi.inference import JVAETrainer 7 | import numpy as np 8 | import pandas as pd 9 | import copy 10 | import torch 11 | import time as tm 12 | 13 | ### STARmap data 14 | Starmap_data = CsvDataset('data/gimVI_data/STARmap_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 15 | 16 | ### AllenVISp 17 | RNA_data = CsvDataset('data/gimVI_data/Allen_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 18 | 19 | ### Leave-one-out validation 20 | Gene_set = np.intersect1d(Starmap_data.gene_names,RNA_data.gene_names) 21 | Starmap_data.gene_names = Gene_set 22 | Starmap_data.X = Starmap_data.X[:,np.reshape(np.vstack(np.argwhere(i==Starmap_data.gene_names) for i in Gene_set),-1)] 23 | Common_data = copy.deepcopy(RNA_data) 24 | Common_data.gene_names = Gene_set 25 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in Gene_set),-1)] 26 | Imp_Genes = pd.DataFrame(columns=Gene_set) 27 | gimVI_time = [] 28 | 29 | for i in Gene_set: 30 | print(i) 31 | # Create copy of the fish dataset with hidden genes 32 | data_spatial_partial = copy.deepcopy(Starmap_data) 33 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(Starmap_data.gene_names,i)) 34 | data_spatial_partial.batch_indices += Common_data.n_batches 35 | 36 | datasets = [Common_data, data_spatial_partial] 37 | generative_distributions = ["zinb", "nb"] 38 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 39 | n_inputs = [d.nb_genes for d in datasets] 40 | total_genes = Common_data.nb_genes 41 | n_batches = sum([d.n_batches for d in datasets]) 42 | 43 | model_library_size = [True, False] 44 | 45 | n_latent = 8 46 | kappa = 5 47 | 48 | start = tm.time() 49 | torch.manual_seed(0) 50 | 51 | model = JVAE( 52 | n_inputs, 53 | total_genes, 54 | gene_mappings, 55 | generative_distributions, 56 | model_library_size, 57 | n_layers_decoder_individual=0, 58 | n_layers_decoder_shared=0, 59 | n_layers_encoder_individual=1, 60 | n_layers_encoder_shared=1, 61 | dim_hidden_encoder=64, 62 | dim_hidden_decoder_shared=64, 63 | dropout_rate_encoder=0.2, 64 | dropout_rate_decoder=0.2, 65 | n_batch=n_batches, 66 | n_latent=n_latent, 67 | ) 68 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 69 | 70 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 71 | trainer.train(n_epochs=200) 72 | _,Imputed = trainer.get_imputed_values(normalized=True) 73 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1) 74 | Imp_Genes[i] = Imputed 75 | gimVI_time.append(tm.time()-start) 76 | 77 | Imp_Genes = Imp_Genes.fillna(0) 78 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv') 79 | gimVI_time = pd.DataFrame(gimVI_time) 80 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False) 81 | 82 | ### New genes 83 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","SOX10","GRM2","TCRB"]) 84 | 85 | # Create copy of the fish dataset with hidden genes 86 | data_spatial_partial = copy.deepcopy(Starmap_data) 87 | data_spatial_partial.filter_genes_by_attribute(Starmap_data.gene_names) 88 | data_spatial_partial.batch_indices += RNA_data.n_batches 89 | 90 | datasets = [RNA_data, data_spatial_partial] 91 | generative_distributions = ["zinb", "nb"] 92 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 93 | n_inputs = [d.nb_genes for d in datasets] 94 | total_genes = RNA_data.nb_genes 95 | n_batches = sum([d.n_batches for d in datasets]) 96 | 97 | model_library_size = [True, False] 98 | 99 | n_latent = 8 100 | kappa = 5 101 | 102 | torch.manual_seed(0) 103 | 104 | model = JVAE( 105 | n_inputs, 106 | total_genes, 107 | gene_mappings, 108 | generative_distributions, 109 | model_library_size, 110 | n_layers_decoder_individual=0, 111 | n_layers_decoder_shared=0, 112 | n_layers_encoder_individual=1, 113 | n_layers_encoder_shared=1, 114 | dim_hidden_encoder=64, 115 | dim_hidden_decoder_shared=64, 116 | dropout_rate_encoder=0.2, 117 | dropout_rate_decoder=0.2, 118 | n_batch=n_batches, 119 | n_latent=n_latent, 120 | ) 121 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 122 | 123 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 124 | trainer.train(n_epochs=200) 125 | 126 | for i in ["TESC","PVRL3","SOX10","GRM2","TCRB"]: 127 | _,Imputed = trainer.get_imputed_values(normalized=True) 128 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1) 129 | Imp_New_Genes[i] = Imputed 130 | 131 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv') 132 | -------------------------------------------------------------------------------- /benchmark/Timing_Evaluation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import matplotlib 4 | matplotlib.rcParams['pdf.fonttype'] = 42 5 | matplotlib.rcParams['ps.fonttype'] = 42 6 | import matplotlib.pyplot as plt 7 | from matplotlib.lines import Line2D 8 | 9 | ### Timing 10 | experiments = ['STARmap_AllenVISp','osmFISH_Ziesel','osmFISH_AllenVISp', 11 | 'osmFISH_AllenSSp','MERFISH_Moffit'] 12 | # SpaGE 13 | SpaGE_Avg_Time = pd.Series(index = experiments) 14 | for i in experiments: 15 | SpaGE_Precise_Time = np.array(pd.read_csv(i + '/Results/SpaGE_PreciseTime.csv', 16 | header=0,sep=',')).reshape(-1) 17 | SpaGE_knn_Time = np.array(pd.read_csv(i +'/Results/SpaGE_knnTime.csv', 18 | header=0,sep=',')).reshape(-1) 19 | 20 | SpaGE_total_time = SpaGE_Precise_Time + SpaGE_knn_Time 21 | SpaGE_Avg_Time[i] = np.mean(SpaGE_total_time) 22 | del SpaGE_Precise_Time,SpaGE_knn_Time,SpaGE_total_time 23 | 24 | # gimVI 25 | gimVI_Avg_Time = pd.Series(index = experiments) 26 | for i in experiments: 27 | gimVI_Time = np.array(pd.read_csv(i+'/Results/gimVI_Time.csv',header=0,sep=',')).reshape(-1) 28 | gimVI_Avg_Time[i] = np.mean(gimVI_Time) 29 | del gimVI_Time 30 | 31 | # Seurat 32 | Seurat_Avg_Time = pd.Series(index = experiments) 33 | for i in experiments: 34 | Seurat_anchor_Time = np.array(pd.read_csv(i+'/Results/Seurat_anchor_time.csv', 35 | header=0,sep=',')).reshape(-1) 36 | Seurat_transfer_Time = np.array(pd.read_csv(i+'/Results/Seurat_transfer_time.csv' 37 | ,header=0,sep=',')).reshape(-1) 38 | 39 | Seurat_total_time = Seurat_anchor_Time + Seurat_transfer_Time 40 | Seurat_Avg_Time[i] = np.mean(Seurat_total_time) 41 | del Seurat_anchor_Time,Seurat_transfer_Time,Seurat_total_time 42 | 43 | # Liger 44 | Liger_Avg_Time = pd.Series(index = experiments) 45 | for i in experiments: 46 | Liger_NMF_Time = np.array(pd.read_csv(i+'/Results/Liger_NMF_time.csv', 47 | header=0,sep=',')).reshape(-1) 48 | Liger_knn_Time = np.array(pd.read_csv(i+'/Results/Liger_knn_time.csv', 49 | header=0,sep=',')).reshape(-1) 50 | 51 | Liger_total_time = Liger_NMF_Time + Liger_knn_Time 52 | Liger_Avg_Time[i] = np.mean(Liger_total_time) 53 | del Liger_NMF_Time,Liger_knn_Time,Liger_total_time 54 | 55 | plt.style.use('ggplot') 56 | plt.figure(figsize=(9, 3)) 57 | plt.plot([1,2,3,4,5],SpaGE_Avg_Time,color='black',marker='s',linewidth=3) 58 | plt.plot([1,2,3,4,5],Seurat_Avg_Time,color='blue',marker='s',linewidth=3) 59 | plt.plot([1,2,3,4,5],Liger_Avg_Time,color='red',marker='s',linewidth=3) 60 | plt.plot([1,2,3,4,5],gimVI_Avg_Time,color='purple',marker='s',linewidth=3) 61 | #plt.yscale('log') 62 | plt.xticks((1,2,3,4,5),('STARmap_AllenVISp\n(1549,14249)','osmFISH_Zeisel\n(3405,1691)', 63 | 'osmFISH_AllenVISp\n(3405,14249)','osmFISH_AllenSSp\n(3405,5613)', 64 | 'MERFSIH_Moffit\n(64373,31299)'),size=10) 65 | plt.yticks(size=8) 66 | plt.ylabel('Avergae computation time (seconds)',size=12) 67 | colors = ['black','blue', 'red', 'purple'] 68 | lines = [Line2D([0], [0], color=c, linewidth=3, marker='s') for c in colors] 69 | labels = ['SpaGE','Seurat', 'Liger','gimVI'] 70 | plt.legend(lines, labels) 71 | plt.show() -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/Liger/LIGER.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenSSp/") 2 | library(liger) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | # allen VISp 7 | allen <- read.table(file = "data/Allen_SSp/SSp_exons_matrix.csv", 8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE) 9 | allen <- as.matrix(x = allen) 10 | 11 | Genes_count = rowSums(allen > 0) 12 | allen <- allen[Genes_count>=10,] 13 | 14 | meta.data <- read.csv(file = "data/Allen_SSp/AllenSSp_metadata.csv", 15 | row.names = 1, stringsAsFactors = FALSE) 16 | ok.cells <- colnames(allen[,meta.data$class_label != 'Exclude']) 17 | allen <- allen[,ok.cells] 18 | 19 | # osmFISH 20 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 21 | mat <- osm[['matrix']][,] 22 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 23 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 24 | x_dim <- osm[['col_attrs']][['X']][] 25 | y_dim <- osm[['col_attrs']][['Y']][] 26 | region <- osm[['col_attrs']][['Region']][] 27 | cluster <- osm[['col_attrs']][['ClusterName']][] 28 | osm$close_all() 29 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 30 | rownames(spatial) <- rownames(mat) 31 | spatial <- as.matrix(spatial) 32 | osmFISH <- t(mat) 33 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))] 34 | 35 | Gene_set <- intersect(rownames(osmFISH),rownames(allen)) 36 | 37 | #### New genes prediction 38 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = allen)) 39 | Ligerex <- normalize(Ligerex) 40 | Ligerex@var.genes <- Gene_set 41 | Ligerex <- scaleNotCenter(Ligerex) 42 | 43 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20 44 | # suggestLambda(Ligerex, k = 20) # Lambda = 20 45 | 46 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20) 47 | Ligerex <- quantileAlignSNF(Ligerex) 48 | 49 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE) 50 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 51 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2]) 52 | rownames(Imp_New_genes) <- new.genes 53 | for(i in 1:length(new.genes)) { 54 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],]) 55 | } 56 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv') 57 | 58 | # leave-one-out validation 59 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen)) 60 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2]) 61 | rownames(Imp_genes) <- genes.leaveout 62 | colnames(Imp_genes) <- colnames(osmFISH) 63 | NMF_time <- vector(mode= "numeric") 64 | knn_time <- vector(mode= "numeric") 65 | 66 | for(i in 1:length(genes.leaveout)) { 67 | print(i) 68 | start_time <- Sys.time() 69 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = allen[rownames(osmFISH),])) 70 | Ligerex.leaveout <- normalize(Ligerex.leaveout) 71 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i]) 72 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout) 73 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){ 74 | next 75 | } 76 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20) 77 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout) 78 | end_time <- Sys.time() 79 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 80 | 81 | start_time <- Sys.time() 82 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30) 83 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],]) 84 | end_time <- Sys.time() 85 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 86 | } 87 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv') 88 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE) 89 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE) 90 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/Seurat/allen_brain.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenSSp/") 2 | library(Seurat) 3 | 4 | allen <- read.table(file = "data/Allen_SSp/SSp_exons_matrix.csv", 5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE) 6 | allen <- as.matrix(x = allen) 7 | 8 | meta.data <- read.csv(file = "data/Allen_SSp/AllenSSp_metadata.csv", 9 | row.names = 1, stringsAsFactors = FALSE) 10 | 11 | al <- CreateSeuratObject(counts = allen, project = 'SSp', min.cells = 10) 12 | ok.cells <- colnames(al[,meta.data$class_label != 'Exclude']) 13 | al <- al[, ok.cells] 14 | al <- NormalizeData(object = al) 15 | al <- FindVariableFeatures(object = al, nfeatures = 2000) 16 | al <- ScaleData(object = al) 17 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE) 18 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5) 19 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain_SSp.rds")) 20 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/Seurat/impute_osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenSSp/") 2 | library(Seurat) 3 | 4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 5 | allen <- readRDS("data/seurat_objects/allen_brain_SSp.rds") 6 | 7 | #Project on allen labels 8 | i2 <- FindTransferAnchors( 9 | reference = allen, 10 | query = osmFISH, 11 | features = rownames(osmFISH), 12 | reduction = 'cca', 13 | reference.assay = 'RNA', 14 | query.assay = 'RNA' 15 | ) 16 | 17 | refdata <- GetAssayData( 18 | object = allen, 19 | assay = 'RNA', 20 | slot = 'data' 21 | ) 22 | 23 | imputation <- TransferData( 24 | anchorset = i2, 25 | refdata = refdata, 26 | weight.reduction = 'pca' 27 | ) 28 | 29 | osmFISH[['ss2']] <- imputation 30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/Seurat/osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenSSp/") 2 | library(Seurat) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 7 | mat <- osm[['matrix']][,] 8 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 10 | x_dim <- osm[['col_attrs']][['X']][] 11 | y_dim <- osm[['col_attrs']][['Y']][] 12 | region <- osm[['col_attrs']][['Region']][] 13 | cluster <- osm[['col_attrs']][['ClusterName']][] 14 | osm$close_all() 15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 16 | rownames(spatial) <- rownames(mat) 17 | spatial <- as.matrix(spatial) 18 | mat <- t(mat) 19 | 20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1) 21 | names(region) <- colnames(osm_seurat) 22 | names(cluster) <- colnames(osm_seurat) 23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region') 24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster') 25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA') 26 | Idents(osm_seurat) <- 'region' 27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle')) 28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts)) 29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts)) 30 | osm_seurat <- ScaleData(osm_seurat) 31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/Seurat/osmFISH_integration.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenSSp/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds") 7 | allen <- readRDS("data/seurat_objects/allen_brain_SSp.rds") 8 | 9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen)) 10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2]) 11 | rownames(Imp_genes) <- genes.leaveout 12 | anchor_time <- vector(mode= "numeric") 13 | Transfer_time <- vector(mode= "numeric") 14 | 15 | run_imputation <- function(ref.obj, query.obj, feature.remove) { 16 | message(paste0('removing ', feature.remove)) 17 | features <- setdiff(rownames(query.obj), feature.remove) 18 | DefaultAssay(ref.obj) <- 'RNA' 19 | DefaultAssay(query.obj) <- 'RNA' 20 | 21 | start_time <- Sys.time() 22 | anchors <- FindTransferAnchors( 23 | reference = ref.obj, 24 | query = query.obj, 25 | features = features, 26 | dims = 1:30, 27 | reduction = 'cca' 28 | ) 29 | end_time <- Sys.time() 30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 31 | 32 | refdata <- GetAssayData( 33 | object = ref.obj, 34 | assay = 'RNA', 35 | slot = 'data' 36 | ) 37 | 38 | start_time <- Sys.time() 39 | imputation <- TransferData( 40 | anchorset = anchors, 41 | refdata = refdata, 42 | weight.reduction = 'pca' 43 | ) 44 | query.obj[['seq']] <- imputation 45 | end_time <- Sys.time() 46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 47 | return(query.obj) 48 | } 49 | 50 | for(i in 1:length(genes.leaveout)) { 51 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = osmFISH, feature.remove = genes.leaveout[[i]]) 52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']] 53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],]) 54 | } 55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv') 56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE) 57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE) 58 | 59 | # show genes not in the osmFISH dataset 60 | DefaultAssay(osmFISH.imputed) <- "ss2" 61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2]) 63 | rownames(Imp_New_genes) <- new.genes 64 | for(i in 1:length(new.genes)) { 65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],]) 66 | } 67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/SpaGE/Allen_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenSSp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | 9 | RNA_data = pd.read_csv('data/Allen_SSp/SSp_exons_matrix.csv', 10 | header=0,index_col=0,sep=',') 11 | 12 | # filter lowely expressed genes 13 | Genes_count = np.sum(RNA_data > 0, axis=1) 14 | RNA_data = RNA_data.loc[Genes_count >=10,:] 15 | del Genes_count 16 | 17 | # filter low quality cells 18 | meta_data = pd.read_csv('data/Allen_SSp/AllenSSp_metadata.csv', 19 | header=0,sep=',') 20 | HighQualityCells = meta_data['class_label'] != 'Exclude' 21 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]] 22 | del meta_data, HighQualityCells 23 | 24 | def Log_Norm(x): 25 | return np.log(((x/np.sum(x))*1000000) + 1) 26 | 27 | RNA_data = RNA_data.apply(Log_Norm,axis=0) 28 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index) 29 | 30 | datadict = dict() 31 | datadict['RNA_data'] = RNA_data.T 32 | datadict['RNA_data_scaled'] = RNA_data_scaled 33 | 34 | with open('data/SpaGE_pkl/Allen_SSp.pkl','wb') as f: 35 | pickle.dump(datadict, f) 36 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenSSp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | osmFISH_data = datadict['osmFISH_data'] 14 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 15 | osmFISH_meta= datadict['osmFISH_meta'] 16 | del datadict 17 | 18 | with open ('data/SpaGE_pkl/Allen_SSp.pkl', 'rb') as f: 19 | datadict = pickle.load(f) 20 | 21 | RNA_data = datadict['RNA_data'] 22 | RNA_data_scaled = datadict['RNA_data_scaled'] 23 | del datadict 24 | 25 | #### Leave One Out Validation #### 26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 28 | precise_time = [] 29 | knn_time = [] 30 | for i in Common_data.columns: 31 | print(i) 32 | start = tm.time() 33 | from principal_vectors import PVComputation 34 | 35 | n_factors = 30 36 | n_pv = 30 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1)) 48 | 49 | S = pv_FISH_RNA.source_components_.T 50 | 51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 52 | S = S[:,0:Effective_n_pv] 53 | 54 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 56 | precise_time.append(tm.time()-start) 57 | 58 | start = tm.time() 59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 60 | distances, indices = nbrs.kneighbors(FISH_exp_t) 61 | 62 | Imp_Gene = np.zeros(osmFISH_data.shape[0]) 63 | for j in range(0,osmFISH_data.shape[0]): 64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 65 | weights = weights/(len(weights)-1) 66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 67 | Imp_Gene[np.isnan(Imp_Gene)] = 0 68 | Imp_Genes[i] = Imp_Gene 69 | knn_time.append(tm.time()-start) 70 | 71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 72 | precise_time = pd.DataFrame(precise_time) 73 | knn_time = pd.DataFrame(knn_time) 74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) 76 | 77 | #### Novel Genes Expression Patterns #### 78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"] 80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 81 | 82 | from principal_vectors import PVComputation 83 | 84 | n_factors = 30 85 | n_pv = 30 86 | dim_reduction = 'pca' 87 | dim_reduction_target = 'pca' 88 | 89 | pv_FISH_RNA = PVComputation( 90 | n_factors = n_factors, 91 | n_pv = n_pv, 92 | dim_reduction = dim_reduction, 93 | dim_reduction_target = dim_reduction_target 94 | ) 95 | 96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 97 | 98 | S = pv_FISH_RNA.source_components_.T 99 | 100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 101 | S = S[:,0:Effective_n_pv] 102 | 103 | Common_data_t = Common_data.dot(S) 104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S) 105 | 106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 107 | metric = 'cosine').fit(Common_data_t) 108 | distances, indices = nbrs.kneighbors(FISH_exp_t) 109 | 110 | for j in range(0,osmFISH_data.shape[0]): 111 | 112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 113 | weights = weights/(len(weights)-1) 114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 115 | 116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenSSp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | import sys 13 | sys.path.insert(1,'SpaGE/') 14 | from principal_vectors import PVComputation 15 | 16 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | osmFISH_data = datadict['osmFISH_data'] 20 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 21 | osmFISH_meta= datadict['osmFISH_meta'] 22 | del datadict 23 | 24 | with open ('data/SpaGE_pkl/Allen_SSp.pkl', 'rb') as f: 25 | datadict = pickle.load(f) 26 | 27 | RNA_data = datadict['RNA_data'] 28 | RNA_data_scaled = datadict['RNA_data_scaled'] 29 | del datadict 30 | 31 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 32 | 33 | n_factors = 30 34 | n_pv = 30 35 | n_pv_display = 30 36 | dim_reduction = 'pca' 37 | dim_reduction_target = 'pca' 38 | 39 | pv_FISH_RNA = PVComputation( 40 | n_factors = n_factors, 41 | n_pv = n_pv, 42 | dim_reduction = dim_reduction, 43 | dim_reduction_target = dim_reduction_target 44 | ) 45 | 46 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 47 | 48 | fig = plt.figure() 49 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 50 | center=0, vmax=1., vmin=0) 51 | plt.xlabel('osmFISH',fontsize=18, color='black') 52 | plt.ylabel('Allen_SSp',fontsize=18, color='black') 53 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 54 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 55 | plt.gca().set_ylim([n_pv_display,0]) 56 | plt.show() 57 | 58 | plt.figure() 59 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 60 | center=0, vmax=1., vmin=0) 61 | for i in range(n_pv_display-1): 62 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 63 | 64 | plt.xlabel('osmFISH',fontsize=18, color='black') 65 | plt.ylabel('Allen_SSp',fontsize=18, color='black') 66 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 67 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 68 | plt.gca().set_ylim([n_pv_display,0]) 69 | plt.show() 70 | 71 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 72 | Importance.sort_values(ascending=False,inplace=True) 73 | Importance.index[0:30] 74 | 75 | ### Technology specific Processes 76 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 77 | 78 | # explained variance RNA 79 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 80 | # explained variance spatial 81 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 82 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/SpaGE/osmFISH_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenSSp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | import loompy 9 | 10 | ## Read loom data 11 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom') 12 | 13 | FISH_Genes = ds.ra['Gene'] 14 | 15 | colAtr = ds.ca.keys() 16 | 17 | df = pd.DataFrame() 18 | for i in colAtr: 19 | df[i] = ds.ca[i] 20 | 21 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :] 22 | osmFISH_data = ds[:,:] 23 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]] 24 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes) 25 | 26 | del ds, colAtr, i, df, FISH_Genes 27 | 28 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4', 29 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1'] 30 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region) 31 | 32 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:] 33 | osmFISH_data = osmFISH_data.iloc[:,Cortical] 34 | 35 | cell_count = np.sum(osmFISH_data,axis=0) 36 | def Log_Norm(x): 37 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 38 | 39 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0) 40 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index) 41 | 42 | datadict = dict() 43 | datadict['osmFISH_data'] = osmFISH_data.T 44 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled 45 | datadict['osmFISH_meta'] = osmFISH_meta 46 | 47 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f: 48 | pickle.dump(datadict, f) 49 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenSSp/gimVI/gimVI.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenSSp/') 3 | 4 | from scvi.dataset import CsvDataset 5 | from scvi.models import JVAE, Classifier 6 | from scvi.inference import JVAETrainer 7 | import numpy as np 8 | import pandas as pd 9 | import copy 10 | import torch 11 | import time as tm 12 | 13 | ### osmFISH data 14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 15 | 16 | ### RNA 17 | RNA_data = CsvDataset('data/gimVI_data/Allen_SSp_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 18 | 19 | ### Leave-one-out validation 20 | Common_data = copy.deepcopy(RNA_data) 21 | Common_data.gene_names = osmFISH_data.gene_names 22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)] 23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names) 24 | Imp_Genes = pd.DataFrame(columns=Gene_set) 25 | gimVI_time = [] 26 | 27 | for i in Gene_set: 28 | print(i) 29 | # Create copy of the fish dataset with hidden genes 30 | data_spatial_partial = copy.deepcopy(osmFISH_data) 31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i)) 32 | data_spatial_partial.batch_indices += Common_data.n_batches 33 | 34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]): 35 | continue 36 | 37 | datasets = [Common_data, data_spatial_partial] 38 | generative_distributions = ["zinb", "nb"] 39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 40 | n_inputs = [d.nb_genes for d in datasets] 41 | total_genes = Common_data.nb_genes 42 | n_batches = sum([d.n_batches for d in datasets]) 43 | 44 | model_library_size = [True, False] 45 | 46 | n_latent = 8 47 | kappa = 5 48 | 49 | start = tm.time() 50 | torch.manual_seed(0) 51 | 52 | model = JVAE( 53 | n_inputs, 54 | total_genes, 55 | gene_mappings, 56 | generative_distributions, 57 | model_library_size, 58 | n_layers_decoder_individual=0, 59 | n_layers_decoder_shared=0, 60 | n_layers_encoder_individual=1, 61 | n_layers_encoder_shared=1, 62 | dim_hidden_encoder=64, 63 | dim_hidden_decoder_shared=64, 64 | dropout_rate_encoder=0.2, 65 | dropout_rate_decoder=0.2, 66 | n_batch=n_batches, 67 | n_latent=n_latent, 68 | ) 69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 70 | 71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 72 | trainer.train(n_epochs=200) 73 | _,Imputed = trainer.get_imputed_values(normalized=True) 74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1) 75 | Imp_Genes[i] = Imputed 76 | gimVI_time.append(tm.time()-start) 77 | 78 | Imp_Genes = Imp_Genes.fillna(0) 79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv') 80 | gimVI_time = pd.DataFrame(gimVI_time) 81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False) 82 | 83 | ### New genes 84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"]) 85 | 86 | # Create copy of the fish dataset with hidden genes 87 | data_spatial_partial = copy.deepcopy(osmFISH_data) 88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names) 89 | data_spatial_partial.batch_indices += RNA_data.n_batches 90 | 91 | datasets = [RNA_data, data_spatial_partial] 92 | generative_distributions = ["zinb", "nb"] 93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 94 | n_inputs = [d.nb_genes for d in datasets] 95 | total_genes = RNA_data.nb_genes 96 | n_batches = sum([d.n_batches for d in datasets]) 97 | 98 | model_library_size = [True, False] 99 | 100 | n_latent = 8 101 | kappa = 5 102 | 103 | torch.manual_seed(0) 104 | 105 | model = JVAE( 106 | n_inputs, 107 | total_genes, 108 | gene_mappings, 109 | generative_distributions, 110 | model_library_size, 111 | n_layers_decoder_individual=0, 112 | n_layers_decoder_shared=0, 113 | n_layers_encoder_individual=1, 114 | n_layers_encoder_shared=1, 115 | dim_hidden_encoder=64, 116 | dim_hidden_decoder_shared=64, 117 | dropout_rate_encoder=0.2, 118 | dropout_rate_decoder=0.2, 119 | n_batch=n_batches, 120 | n_latent=n_latent, 121 | ) 122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 123 | 124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 125 | trainer.train(n_epochs=200) 126 | 127 | for i in ["TESC","PVRL3","GRM2"]: 128 | _,Imputed = trainer.get_imputed_values(normalized=True) 129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1) 130 | Imp_New_Genes[i] = Imputed 131 | 132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv') 133 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/Liger/LIGER.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenVISp/") 2 | library(liger) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | # allen VISp 7 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv", 8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE) 9 | allen <- as.matrix(x = allen) 10 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv", 11 | sep = ',', stringsAsFactors = FALSE, header = TRUE) 12 | rownames(x = allen) <- make.unique(names = genes$gene_symbol) 13 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv", 14 | row.names = 1, stringsAsFactors = FALSE) 15 | 16 | Genes_count = rowSums(allen > 0) 17 | allen <- allen[Genes_count>=10,] 18 | 19 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ]) 20 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)] 21 | allen <-allen[,ok.cells] 22 | 23 | # osmFISH 24 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 25 | mat <- osm[['matrix']][,] 26 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 27 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 28 | x_dim <- osm[['col_attrs']][['X']][] 29 | y_dim <- osm[['col_attrs']][['Y']][] 30 | region <- osm[['col_attrs']][['Region']][] 31 | cluster <- osm[['col_attrs']][['ClusterName']][] 32 | osm$close_all() 33 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 34 | rownames(spatial) <- rownames(mat) 35 | spatial <- as.matrix(spatial) 36 | osmFISH <- t(mat) 37 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))] 38 | 39 | Gene_set <- intersect(rownames(osmFISH),rownames(allen)) 40 | 41 | #### New genes prediction 42 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = allen)) 43 | Ligerex <- normalize(Ligerex) 44 | Ligerex@var.genes <- Gene_set 45 | Ligerex <- scaleNotCenter(Ligerex) 46 | 47 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20 48 | # suggestLambda(Ligerex, k = 20) # Lambda = 20 49 | 50 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20) 51 | Ligerex <- quantileAlignSNF(Ligerex) 52 | 53 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE) 54 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 55 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2]) 56 | rownames(Imp_New_genes) <- new.genes 57 | for(i in 1:length(new.genes)) { 58 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],]) 59 | } 60 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv') 61 | 62 | # leave-one-out validation 63 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen)) 64 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2]) 65 | rownames(Imp_genes) <- genes.leaveout 66 | colnames(Imp_genes) <- colnames(osmFISH) 67 | NMF_time <- vector(mode= "numeric") 68 | knn_time <- vector(mode= "numeric") 69 | 70 | for(i in 1:length(genes.leaveout)) { 71 | print(i) 72 | start_time <- Sys.time() 73 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = allen[rownames(osmFISH),])) 74 | Ligerex.leaveout <- normalize(Ligerex.leaveout) 75 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i]) 76 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout) 77 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){ 78 | next 79 | } 80 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20) 81 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout) 82 | end_time <- Sys.time() 83 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 84 | 85 | start_time <- Sys.time() 86 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30) 87 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],]) 88 | end_time <- Sys.time() 89 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 90 | } 91 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv') 92 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE) 93 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE) -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/Seurat/allen_brain.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenVISp/") 2 | library(Seurat) 3 | 4 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv", 5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE) 6 | allen <- as.matrix(x = allen) 7 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv", 8 | sep = ',', stringsAsFactors = FALSE, header = TRUE) 9 | rownames(x = allen) <- make.unique(names = genes$gene_symbol) 10 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv", 11 | row.names = 1, stringsAsFactors = FALSE) 12 | 13 | al <- CreateSeuratObject(counts = allen, project = 'VISp', meta.data = meta.data, min.cells = 10) 14 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ]) 15 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)] 16 | al <- al[, ok.cells] 17 | al <- NormalizeData(object = al) 18 | al <- FindVariableFeatures(object = al, nfeatures = 2000) 19 | al <- ScaleData(object = al) 20 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE) 21 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5) 22 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain.rds")) -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/Seurat/impute_osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenVISp/") 2 | library(Seurat) 3 | 4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 5 | allen <- readRDS("data/seurat_objects/allen_brain.rds") 6 | 7 | #Project on allen labels 8 | i2 <- FindTransferAnchors( 9 | reference = allen, 10 | query = osmFISH, 11 | features = rownames(osmFISH), 12 | reduction = 'cca', 13 | reference.assay = 'RNA', 14 | query.assay = 'RNA' 15 | ) 16 | 17 | refdata <- GetAssayData( 18 | object = allen, 19 | assay = 'RNA', 20 | slot = 'data' 21 | ) 22 | 23 | imputation <- TransferData( 24 | anchorset = i2, 25 | refdata = refdata, 26 | weight.reduction = 'pca' 27 | ) 28 | 29 | osmFISH[['ss2']] <- imputation 30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/Seurat/osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenVISp/") 2 | library(Seurat) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 7 | mat <- osm[['matrix']][,] 8 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 10 | x_dim <- osm[['col_attrs']][['X']][] 11 | y_dim <- osm[['col_attrs']][['Y']][] 12 | region <- osm[['col_attrs']][['Region']][] 13 | cluster <- osm[['col_attrs']][['ClusterName']][] 14 | osm$close_all() 15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 16 | rownames(spatial) <- rownames(mat) 17 | spatial <- as.matrix(spatial) 18 | mat <- t(mat) 19 | 20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1) 21 | names(region) <- colnames(osm_seurat) 22 | names(cluster) <- colnames(osm_seurat) 23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region') 24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster') 25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA') 26 | Idents(osm_seurat) <- 'region' 27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle')) 28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts)) 29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts)) 30 | osm_seurat <- ScaleData(osm_seurat) 31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/Seurat/osmFISH_integration.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_AllenVISp/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds") 7 | allen <- readRDS("data/seurat_objects/allen_brain.rds") 8 | 9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen)) 10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2]) 11 | rownames(Imp_genes) <- genes.leaveout 12 | anchor_time <- vector(mode= "numeric") 13 | Transfer_time <- vector(mode= "numeric") 14 | 15 | run_imputation <- function(ref.obj, query.obj, feature.remove) { 16 | message(paste0('removing ', feature.remove)) 17 | features <- setdiff(rownames(query.obj), feature.remove) 18 | DefaultAssay(ref.obj) <- 'RNA' 19 | DefaultAssay(query.obj) <- 'RNA' 20 | 21 | start_time <- Sys.time() 22 | anchors <- FindTransferAnchors( 23 | reference = ref.obj, 24 | query = query.obj, 25 | features = features, 26 | dims = 1:30, 27 | reduction = 'cca' 28 | ) 29 | end_time <- Sys.time() 30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 31 | 32 | refdata <- GetAssayData( 33 | object = ref.obj, 34 | assay = 'RNA', 35 | slot = 'data' 36 | ) 37 | 38 | start_time <- Sys.time() 39 | imputation <- TransferData( 40 | anchorset = anchors, 41 | refdata = refdata, 42 | weight.reduction = 'pca' 43 | ) 44 | query.obj[['seq']] <- imputation 45 | end_time <- Sys.time() 46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 47 | return(query.obj) 48 | } 49 | 50 | for(i in 1:length(genes.leaveout)) { 51 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = osmFISH, feature.remove = genes.leaveout[[i]]) 52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']] 53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],]) 54 | } 55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv') 56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE) 57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE) 58 | 59 | # show genes not in the osmFISH dataset 60 | DefaultAssay(osmFISH.imputed) <- "ss2" 61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2]) 63 | rownames(Imp_New_genes) <- new.genes 64 | for(i in 1:length(new.genes)) { 65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],]) 66 | } 67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/SpaGE/Allen_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | 9 | RNA_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv', 10 | header=0,index_col=0,sep=',') 11 | 12 | Genes = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv', 13 | header=0,sep=',') 14 | RNA_data.index = Genes.gene_symbol 15 | del Genes 16 | 17 | # filter lowely expressed genes 18 | Genes_count = np.sum(RNA_data > 0, axis=1) 19 | RNA_data = RNA_data.loc[Genes_count >=10,:] 20 | del Genes_count 21 | 22 | # filter low quality cells 23 | meta_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv', 24 | header=0,sep=',') 25 | HighQualityCells = (meta_data['class'] != 'No Class') & (meta_data['class'] != 'Low Quality') 26 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]] 27 | del meta_data, HighQualityCells 28 | 29 | def Log_Norm(x): 30 | return np.log(((x/np.sum(x))*1000000) + 1) 31 | 32 | RNA_data = RNA_data.apply(Log_Norm,axis=0) 33 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index) 34 | 35 | datadict = dict() 36 | datadict['RNA_data'] = RNA_data.T 37 | datadict['RNA_data_scaled'] = RNA_data_scaled 38 | 39 | with open('data/SpaGE_pkl/Allen_VISp.pkl','wb') as f: 40 | pickle.dump(datadict, f) -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | osmFISH_data = datadict['osmFISH_data'] 14 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 15 | osmFISH_meta= datadict['osmFISH_meta'] 16 | del datadict 17 | 18 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 19 | datadict = pickle.load(f) 20 | 21 | RNA_data = datadict['RNA_data'] 22 | RNA_data_scaled = datadict['RNA_data_scaled'] 23 | del datadict 24 | 25 | #### Leave One Out Validation #### 26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 28 | precise_time = [] 29 | knn_time = [] 30 | for i in Common_data.columns: 31 | print(i) 32 | start = tm.time() 33 | from principal_vectors import PVComputation 34 | 35 | n_factors = 30 36 | n_pv = 30 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1)) 48 | 49 | S = pv_FISH_RNA.source_components_.T 50 | 51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 52 | S = S[:,0:Effective_n_pv] 53 | 54 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 56 | precise_time.append(tm.time()-start) 57 | 58 | start = tm.time() 59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 60 | distances, indices = nbrs.kneighbors(FISH_exp_t) 61 | 62 | Imp_Gene = np.zeros(osmFISH_data.shape[0]) 63 | for j in range(0,osmFISH_data.shape[0]): 64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 65 | weights = weights/(len(weights)-1) 66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 67 | Imp_Gene[np.isnan(Imp_Gene)] = 0 68 | Imp_Genes[i] = Imp_Gene 69 | knn_time.append(tm.time()-start) 70 | 71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 72 | precise_time = pd.DataFrame(precise_time) 73 | knn_time = pd.DataFrame(knn_time) 74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) 76 | 77 | #### Novel Genes Expression Patterns #### 78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"] 80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 81 | 82 | from principal_vectors import PVComputation 83 | 84 | n_factors = 30 85 | n_pv = 30 86 | dim_reduction = 'pca' 87 | dim_reduction_target = 'pca' 88 | 89 | pv_FISH_RNA = PVComputation( 90 | n_factors = n_factors, 91 | n_pv = n_pv, 92 | dim_reduction = dim_reduction, 93 | dim_reduction_target = dim_reduction_target 94 | ) 95 | 96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 97 | 98 | S = pv_FISH_RNA.source_components_.T 99 | 100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 101 | S = S[:,0:Effective_n_pv] 102 | 103 | Common_data_t = Common_data.dot(S) 104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S) 105 | 106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 107 | metric = 'cosine').fit(Common_data_t) 108 | distances, indices = nbrs.kneighbors(FISH_exp_t) 109 | 110 | for j in range(0,osmFISH_data.shape[0]): 111 | 112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 113 | weights = weights/(len(weights)-1) 114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 115 | 116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | from matplotlib import cm 12 | import seaborn as sns 13 | import sys 14 | sys.path.insert(1,'SpaGE/') 15 | from principal_vectors import PVComputation 16 | 17 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 18 | datadict = pickle.load(f) 19 | 20 | osmFISH_data = datadict['osmFISH_data'] 21 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 22 | osmFISH_meta= datadict['osmFISH_meta'] 23 | del datadict 24 | 25 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 26 | datadict = pickle.load(f) 27 | 28 | RNA_data = datadict['RNA_data'] 29 | RNA_data_scaled = datadict['RNA_data_scaled'] 30 | del datadict 31 | 32 | all_centroids = osmFISH_meta[['X','Y']] 33 | cmap = cm.get_cmap('viridis',20) 34 | 35 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 36 | 37 | n_factors = 30 38 | n_pv = 30 39 | n_pv_display = 30 40 | dim_reduction = 'pca' 41 | dim_reduction_target = 'pca' 42 | 43 | pv_FISH_RNA = PVComputation( 44 | n_factors = n_factors, 45 | n_pv = n_pv, 46 | dim_reduction = dim_reduction, 47 | dim_reduction_target = dim_reduction_target 48 | ) 49 | 50 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 51 | 52 | fig = plt.figure() 53 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 54 | center=0, vmax=1., vmin=0) 55 | plt.xlabel('osmFISH',fontsize=18, color='black') 56 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 57 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 58 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 59 | plt.gca().set_ylim([n_pv_display,0]) 60 | plt.show() 61 | 62 | plt.figure() 63 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 64 | center=0, vmax=1., vmin=0) 65 | for i in range(n_pv_display-1): 66 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 67 | 68 | plt.xlabel('osmFISH',fontsize=18, color='black') 69 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 70 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 71 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 72 | plt.gca().set_ylim([n_pv_display,0]) 73 | plt.show() 74 | 75 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 76 | Importance.sort_values(ascending=False,inplace=True) 77 | Importance.index[0:30] 78 | 79 | ### Technology specific Processes 80 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 81 | 82 | # explained variance RNA 83 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 84 | # explained variance spatial 85 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 86 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/SpaGE/osmFISH_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | import loompy 9 | 10 | ## Read loom data 11 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom') 12 | 13 | FISH_Genes = ds.ra['Gene'] 14 | 15 | colAtr = ds.ca.keys() 16 | 17 | df = pd.DataFrame() 18 | for i in colAtr: 19 | df[i] = ds.ca[i] 20 | 21 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :] 22 | osmFISH_data = ds[:,:] 23 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]] 24 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes) 25 | 26 | del ds, colAtr, i, df, FISH_Genes 27 | 28 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4', 29 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1'] 30 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region) 31 | 32 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:] 33 | osmFISH_data = osmFISH_data.iloc[:,Cortical] 34 | 35 | cell_count = np.sum(osmFISH_data,axis=0) 36 | def Log_Norm(x): 37 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 38 | 39 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0) 40 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index) 41 | 42 | datadict = dict() 43 | datadict['osmFISH_data'] = osmFISH_data.T 44 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled 45 | datadict['osmFISH_meta'] = osmFISH_meta 46 | 47 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f: 48 | pickle.dump(datadict, f) 49 | -------------------------------------------------------------------------------- /benchmark/osmFISH_AllenVISp/gimVI/gimVI.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_AllenVISp/') 3 | 4 | from scvi.dataset import CsvDataset 5 | from scvi.models import JVAE, Classifier 6 | from scvi.inference import JVAETrainer 7 | import numpy as np 8 | import pandas as pd 9 | import copy 10 | import torch 11 | import time as tm 12 | 13 | ### osmFISH data 14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 15 | 16 | ### RNA 17 | RNA_data = CsvDataset('data/gimVI_data/Allen_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 18 | 19 | ### Leave-one-out validation 20 | Common_data = copy.deepcopy(RNA_data) 21 | Common_data.gene_names = osmFISH_data.gene_names 22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)] 23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names) 24 | Imp_Genes = pd.DataFrame(columns=Gene_set) 25 | gimVI_time = [] 26 | 27 | for i in Gene_set: 28 | print(i) 29 | # Create copy of the fish dataset with hidden genes 30 | data_spatial_partial = copy.deepcopy(osmFISH_data) 31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i)) 32 | data_spatial_partial.batch_indices += Common_data.n_batches 33 | 34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]): 35 | continue 36 | 37 | datasets = [Common_data, data_spatial_partial] 38 | generative_distributions = ["zinb", "nb"] 39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 40 | n_inputs = [d.nb_genes for d in datasets] 41 | total_genes = Common_data.nb_genes 42 | n_batches = sum([d.n_batches for d in datasets]) 43 | 44 | model_library_size = [True, False] 45 | 46 | n_latent = 8 47 | kappa = 5 48 | 49 | start = tm.time() 50 | torch.manual_seed(0) 51 | 52 | model = JVAE( 53 | n_inputs, 54 | total_genes, 55 | gene_mappings, 56 | generative_distributions, 57 | model_library_size, 58 | n_layers_decoder_individual=0, 59 | n_layers_decoder_shared=0, 60 | n_layers_encoder_individual=1, 61 | n_layers_encoder_shared=1, 62 | dim_hidden_encoder=64, 63 | dim_hidden_decoder_shared=64, 64 | dropout_rate_encoder=0.2, 65 | dropout_rate_decoder=0.2, 66 | n_batch=n_batches, 67 | n_latent=n_latent, 68 | ) 69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 70 | 71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 72 | trainer.train(n_epochs=200) 73 | _,Imputed = trainer.get_imputed_values(normalized=True) 74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1) 75 | Imp_Genes[i] = Imputed 76 | gimVI_time.append(tm.time()-start) 77 | 78 | Imp_Genes = Imp_Genes.fillna(0) 79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv') 80 | gimVI_time = pd.DataFrame(gimVI_time) 81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False) 82 | 83 | ### New genes 84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"]) 85 | 86 | # Create copy of the fish dataset with hidden genes 87 | data_spatial_partial = copy.deepcopy(osmFISH_data) 88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names) 89 | data_spatial_partial.batch_indices += RNA_data.n_batches 90 | 91 | datasets = [RNA_data, data_spatial_partial] 92 | generative_distributions = ["zinb", "nb"] 93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 94 | n_inputs = [d.nb_genes for d in datasets] 95 | total_genes = RNA_data.nb_genes 96 | n_batches = sum([d.n_batches for d in datasets]) 97 | 98 | model_library_size = [True, False] 99 | 100 | n_latent = 8 101 | kappa = 5 102 | 103 | torch.manual_seed(0) 104 | 105 | model = JVAE( 106 | n_inputs, 107 | total_genes, 108 | gene_mappings, 109 | generative_distributions, 110 | model_library_size, 111 | n_layers_decoder_individual=0, 112 | n_layers_decoder_shared=0, 113 | n_layers_encoder_individual=1, 114 | n_layers_encoder_shared=1, 115 | dim_hidden_encoder=64, 116 | dim_hidden_decoder_shared=64, 117 | dropout_rate_encoder=0.2, 118 | dropout_rate_decoder=0.2, 119 | n_batch=n_batches, 120 | n_latent=n_latent, 121 | ) 122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 123 | 124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 125 | trainer.train(n_epochs=200) 126 | 127 | for i in ["TESC","PVRL3","GRM2"]: 128 | _,Imputed = trainer.get_imputed_values(normalized=True) 129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1) 130 | Imp_New_Genes[i] = Imputed 131 | 132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv') 133 | -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/Liger/LIGER.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_Ziesel/") 2 | library(liger) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | # Zeisel SMSC 7 | Zeisel <- read.delim(file = "data/Zeisel/expression_mRNA_17-Aug-2014.txt",header = FALSE) 8 | meta.data <- Zeisel[1:10,] 9 | rownames(meta.data) <- meta.data[,2] 10 | meta.data <- meta.data[,-c(1,2)] 11 | meta.data <- as.data.frame(t(meta.data)) 12 | 13 | Zeisel <- Zeisel[12:19983,] 14 | gene_names <- Zeisel[,1] 15 | Zeisel <- Zeisel[,-c(1,2)] 16 | Zeisel <- apply(Zeisel,2,as.numeric) 17 | Zeisel <- as.matrix(Zeisel) 18 | rownames(Zeisel) <- gene_names 19 | Zeisel <- Zeisel[,meta.data$tissue=='sscortex'] 20 | meta.data <- meta.data[meta.data$tissue=='sscortex',] 21 | 22 | colnames(Zeisel) <- paste0('SMSC_',c(1:dim(Zeisel)[2])) 23 | rownames(meta.data) <- paste0('SMSC_',c(1:dim(meta.data)[1])) 24 | 25 | gene_count <- rowSums(Zeisel>0) 26 | Zeisel <- Zeisel[gene_count >= 10,] 27 | 28 | # osmFISH 29 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 30 | mat <- osm[['matrix']][,] 31 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 32 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 33 | x_dim <- osm[['col_attrs']][['X']][] 34 | y_dim <- osm[['col_attrs']][['Y']][] 35 | region <- osm[['col_attrs']][['Region']][] 36 | cluster <- osm[['col_attrs']][['ClusterName']][] 37 | osm$close_all() 38 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 39 | rownames(spatial) <- rownames(mat) 40 | spatial <- as.matrix(spatial) 41 | osmFISH <- t(mat) 42 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))] 43 | 44 | Gene_set <- intersect(rownames(osmFISH),rownames(Zeisel)) 45 | 46 | #### New genes prediction 47 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = Zeisel)) 48 | Ligerex <- normalize(Ligerex) 49 | Ligerex@var.genes <- Gene_set 50 | Ligerex <- scaleNotCenter(Ligerex) 51 | 52 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20 53 | # suggestLambda(Ligerex, k = 20) # Lambda = 20 54 | 55 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20) 56 | Ligerex <- quantileAlignSNF(Ligerex) 57 | 58 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE) 59 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 60 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2]) 61 | rownames(Imp_New_genes) <- new.genes 62 | for(i in 1:length(new.genes)) { 63 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],]) 64 | } 65 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv') 66 | 67 | # leave-one-out validation 68 | genes.leaveout <- intersect(rownames(osmFISH),rownames(Zeisel)) 69 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2]) 70 | rownames(Imp_genes) <- genes.leaveout 71 | colnames(Imp_genes) <- colnames(osmFISH) 72 | NMF_time <- vector(mode= "numeric") 73 | knn_time <- vector(mode= "numeric") 74 | 75 | for(i in 1:length(genes.leaveout)) { 76 | print(i) 77 | start_time <- Sys.time() 78 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = Zeisel[rownames(osmFISH),])) 79 | Ligerex.leaveout <- normalize(Ligerex.leaveout) 80 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i]) 81 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout) 82 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){ 83 | next 84 | } 85 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20) 86 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout) 87 | end_time <- Sys.time() 88 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 89 | 90 | start_time <- Sys.time() 91 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30) 92 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],]) 93 | end_time <- Sys.time() 94 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 95 | } 96 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv') 97 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE) 98 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE) -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/Seurat/Zeisel_Cortex.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_Ziesel/") 2 | library(Seurat) 3 | 4 | Zeisel <- read.delim(file = "data/Zeisel/expression_mRNA_17-Aug-2014.txt",header = FALSE) 5 | 6 | meta.data <- Zeisel[1:10,] 7 | rownames(meta.data) <- meta.data[,2] 8 | meta.data <- meta.data[,-c(1,2)] 9 | meta.data <- as.data.frame(t(meta.data)) 10 | 11 | Zeisel <- Zeisel[12:19983,] 12 | gene_names <- Zeisel[,1] 13 | Zeisel <- Zeisel[,-c(1,2)] 14 | Zeisel <- apply(Zeisel,2,as.numeric) 15 | Zeisel <- as.matrix(Zeisel) 16 | rownames(Zeisel) <- gene_names 17 | Zeisel <- Zeisel[,meta.data$tissue=='sscortex'] 18 | meta.data <- meta.data[meta.data$tissue=='sscortex',] 19 | 20 | colnames(Zeisel) <- paste0('SMSC_',c(1:dim(Zeisel)[2])) 21 | rownames(meta.data) <- paste0('SMSC_',c(1:dim(meta.data)[1])) 22 | 23 | Zeisel <- CreateSeuratObject(counts = Zeisel, project = 'SMSC', meta.data = meta.data, min.cells = 10) 24 | Zeisel <- NormalizeData(object = Zeisel) 25 | Zeisel <- FindVariableFeatures(object = Zeisel, nfeatures = 2000) 26 | Zeisel <- ScaleData(object = Zeisel) 27 | Zeisel <- RunPCA(object = Zeisel, npcs = 50, verbose = FALSE) 28 | Zeisel <- RunUMAP(object = Zeisel, dims = 1:50, nneighbors = 5) 29 | saveRDS(object = Zeisel, file = paste0("data/seurat_objects/","Zeisel_SMSC.rds")) -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/Seurat/impute_osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_Ziesel/") 2 | library(Seurat) 3 | 4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 5 | Zeisel <- readRDS("data/seurat_objects/Zeisel_SMSC.rds") 6 | 7 | #Project on Zeisel labels 8 | i2 <- FindTransferAnchors( 9 | reference = Zeisel, 10 | query = osmFISH, 11 | features = rownames(osmFISH), 12 | reduction = 'cca', 13 | reference.assay = 'RNA', 14 | query.assay = 'RNA' 15 | ) 16 | 17 | refdata <- GetAssayData( 18 | object = Zeisel, 19 | assay = 'RNA', 20 | slot = 'data' 21 | ) 22 | 23 | imputation <- TransferData( 24 | anchorset = i2, 25 | refdata = refdata, 26 | weight.reduction = 'pca' 27 | ) 28 | 29 | osmFISH[['ss2']] <- imputation 30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/Seurat/osmFISH.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_Ziesel/") 2 | library(Seurat) 3 | library(hdf5r) 4 | library(methods) 5 | 6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom") 7 | mat <- osm[['matrix']][,] 8 | colnames(mat) <- osm[['row_attrs']][['Gene']][] 9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][]) 10 | x_dim <- osm[['col_attrs']][['X']][] 11 | y_dim <- osm[['col_attrs']][['Y']][] 12 | region <- osm[['col_attrs']][['Region']][] 13 | cluster <- osm[['col_attrs']][['ClusterName']][] 14 | osm$close_all() 15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim) 16 | rownames(spatial) <- rownames(mat) 17 | spatial <- as.matrix(spatial) 18 | mat <- t(mat) 19 | 20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1) 21 | names(region) <- colnames(osm_seurat) 22 | names(cluster) <- colnames(osm_seurat) 23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region') 24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster') 25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA') 26 | Idents(osm_seurat) <- 'region' 27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle')) 28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts)) 29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts)) 30 | osm_seurat <- ScaleData(osm_seurat) 31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds') -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/Seurat/osmFISH_integration.R: -------------------------------------------------------------------------------- 1 | setwd("osmFISH_Ziesel/") 2 | library(Seurat) 3 | library(ggplot2) 4 | 5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds") 6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds") 7 | Zeisel <- readRDS("data/seurat_objects/Zeisel_SMSC.rds") 8 | 9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(Zeisel)) 10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2]) 11 | rownames(Imp_genes) <- genes.leaveout 12 | anchor_time <- vector(mode= "numeric") 13 | Transfer_time <- vector(mode= "numeric") 14 | 15 | run_imputation <- function(ref.obj, query.obj, feature.remove) { 16 | message(paste0('removing ', feature.remove)) 17 | features <- setdiff(rownames(query.obj), feature.remove) 18 | DefaultAssay(ref.obj) <- 'RNA' 19 | DefaultAssay(query.obj) <- 'RNA' 20 | 21 | start_time <- Sys.time() 22 | anchors <- FindTransferAnchors( 23 | reference = ref.obj, 24 | query = query.obj, 25 | features = features, 26 | dims = 1:30, 27 | reduction = 'cca' 28 | ) 29 | end_time <- Sys.time() 30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 31 | 32 | refdata <- GetAssayData( 33 | object = ref.obj, 34 | assay = 'RNA', 35 | slot = 'data' 36 | ) 37 | 38 | start_time <- Sys.time() 39 | imputation <- TransferData( 40 | anchorset = anchors, 41 | refdata = refdata, 42 | weight.reduction = 'pca' 43 | ) 44 | query.obj[['seq']] <- imputation 45 | end_time <- Sys.time() 46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs'))) 47 | return(query.obj) 48 | } 49 | 50 | for(i in 1:length(genes.leaveout)) { 51 | imputed.ss2 <- run_imputation(ref.obj = Zeisel, query.obj = osmFISH, feature.remove = genes.leaveout[[i]]) 52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']] 53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],]) 54 | } 55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv') 56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE) 57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE) 58 | 59 | # show genes not in the osmFISH dataset 60 | DefaultAssay(osmFISH.imputed) <- "ss2" 61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2') 62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2]) 63 | rownames(Imp_New_genes) <- new.genes 64 | for(i in 1:length(new.genes)) { 65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],]) 66 | } 67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_Ziesel/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/Ziesel.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | RNA_data = datadict['RNA_data'] 14 | RNA_data_scaled = datadict['RNA_data_scaled'] 15 | del datadict 16 | 17 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 18 | datadict = pickle.load(f) 19 | 20 | osmFISH_data = datadict['osmFISH_data'] 21 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 22 | osmFISH_meta= datadict['osmFISH_meta'] 23 | del datadict 24 | 25 | #### Leave One Out Validation #### 26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 28 | precise_time = [] 29 | knn_time = [] 30 | for i in Common_data.columns: 31 | print(i) 32 | start = tm.time() 33 | from principal_vectors import PVComputation 34 | 35 | n_factors = 30 36 | n_pv = 30 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1)) 48 | 49 | S = pv_FISH_RNA.source_components_.T 50 | 51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 52 | S = S[:,0:Effective_n_pv] 53 | 54 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 56 | precise_time.append(tm.time()-start) 57 | 58 | start = tm.time() 59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 60 | distances, indices = nbrs.kneighbors(FISH_exp_t) 61 | 62 | Imp_Gene = np.zeros(osmFISH_data.shape[0]) 63 | for j in range(0,osmFISH_data.shape[0]): 64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 65 | weights = weights/(len(weights)-1) 66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 67 | Imp_Gene[np.isnan(Imp_Gene)] = 0 68 | Imp_Genes[i] = Imp_Gene 69 | knn_time.append(tm.time()-start) 70 | 71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 72 | precise_time = pd.DataFrame(precise_time) 73 | knn_time = pd.DataFrame(knn_time) 74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) 76 | 77 | #### Novel Genes Expression Patterns #### 78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"] 80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 81 | 82 | from principal_vectors import PVComputation 83 | 84 | n_factors = 30 85 | n_pv = 30 86 | dim_reduction = 'pca' 87 | dim_reduction_target = 'pca' 88 | 89 | pv_FISH_RNA = PVComputation( 90 | n_factors = n_factors, 91 | n_pv = n_pv, 92 | dim_reduction = dim_reduction, 93 | dim_reduction_target = dim_reduction_target 94 | ) 95 | 96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 97 | 98 | S = pv_FISH_RNA.source_components_.T 99 | 100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 101 | S = S[:,0:Effective_n_pv] 102 | 103 | Common_data_t = Common_data.dot(S) 104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S) 105 | 106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 107 | metric = 'cosine').fit(Common_data_t) 108 | distances, indices = nbrs.kneighbors(FISH_exp_t) 109 | 110 | for j in range(0,osmFISH_data.shape[0]): 111 | 112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 113 | weights = weights/(len(weights)-1) 114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 115 | 116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv') -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/SpaGE/Linnarson_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_Ziesel/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | import loompy 9 | 10 | ## Read RNA seq data 11 | RNA_exp = pd.read_csv('data/Zeisel/expression_mRNA_17-Aug-2014.txt',header=None,sep='\t') 12 | 13 | RNA_data = RNA_exp.iloc[11:19984,:] 14 | RNA_data = RNA_data.drop([1],axis=1) 15 | RNA_data.set_index(0,inplace=True) 16 | RNA_data.columns = range(0,RNA_data.shape[1]) 17 | RNA_data = RNA_data.astype('float') 18 | 19 | RNA_meta = RNA_exp.iloc[0:10,:] 20 | RNA_meta = RNA_meta.drop([0],axis=1) 21 | RNA_meta.set_index(1,inplace=True) 22 | RNA_meta.columns = range(0,RNA_meta.shape[1]) 23 | RNA_meta = RNA_meta.T 24 | del RNA_exp 25 | 26 | RNA_data = RNA_data.loc[:,RNA_meta['tissue']=='sscortex'] 27 | RNA_meta = RNA_meta.loc[RNA_meta['tissue']=='sscortex',:] 28 | 29 | RNA_data.columns = range(0,RNA_data.shape[1]) 30 | RNA_meta.index = range(0,RNA_meta.shape[0]) 31 | 32 | # filter lowely expressed genes 33 | Genes_count = np.sum(RNA_data > 0, axis=1) 34 | RNA_data = RNA_data.loc[Genes_count >=10,:] 35 | del Genes_count 36 | 37 | def Log_Norm_cpm(x): 38 | return np.log(((x/np.sum(x))*1000000) + 1) 39 | 40 | RNA_data = RNA_data.apply(Log_Norm_cpm,axis=0) 41 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index) 42 | 43 | datadict = dict() 44 | datadict['RNA_data'] = RNA_data.T 45 | datadict['RNA_data_scaled'] = RNA_data_scaled 46 | datadict['RNA_meta'] = RNA_meta 47 | 48 | with open('data/SpaGE_pkl/Ziesel.pkl','wb') as f: 49 | pickle.dump(datadict, f) 50 | 51 | ## Read loom data 52 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom') 53 | 54 | FISH_Genes = ds.ra['Gene'] 55 | 56 | colAtr = ds.ca.keys() 57 | 58 | df = pd.DataFrame() 59 | for i in colAtr: 60 | df[i] = ds.ca[i] 61 | 62 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :] 63 | osmFISH_data = ds[:,:] 64 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]] 65 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes) 66 | 67 | del ds, colAtr, i, df, FISH_Genes 68 | 69 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4', 70 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1'] 71 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region) 72 | 73 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:] 74 | osmFISH_data = osmFISH_data.iloc[:,Cortical] 75 | 76 | cell_count = np.sum(osmFISH_data,axis=0) 77 | def Log_Norm(x): 78 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 79 | 80 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0) 81 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index) 82 | 83 | datadict = dict() 84 | datadict['osmFISH_data'] = osmFISH_data.T 85 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled 86 | datadict['osmFISH_meta'] = osmFISH_meta 87 | 88 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f: 89 | pickle.dump(datadict, f) 90 | -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_Ziesel/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | import sys 13 | sys.path.insert(1,'SpaGE/') 14 | from principal_vectors import PVComputation 15 | 16 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | osmFISH_data = datadict['osmFISH_data'] 20 | osmFISH_data_scaled = datadict['osmFISH_data_scaled'] 21 | osmFISH_meta= datadict['osmFISH_meta'] 22 | del datadict 23 | 24 | with open ('data/SpaGE_pkl/Ziesel.pkl', 'rb') as f: 25 | datadict = pickle.load(f) 26 | 27 | RNA_data = datadict['RNA_data'] 28 | RNA_data_scaled = datadict['RNA_data_scaled'] 29 | RNA_meta = datadict['RNA_meta'] 30 | del datadict 31 | 32 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)] 33 | 34 | n_factors = 30 35 | n_pv = 30 36 | n_pv_display = 30 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns]) 48 | 49 | fig = plt.figure() 50 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 51 | center=0, vmax=1., vmin=0) 52 | plt.xlabel('osmFISH',fontsize=18, color='black') 53 | plt.ylabel('Ziesel_RNA',fontsize=18, color='black') 54 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 55 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 56 | plt.gca().set_ylim([n_pv_display,0]) 57 | plt.show() 58 | 59 | plt.figure() 60 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 61 | center=0, vmax=1., vmin=0) 62 | for i in range(n_pv_display-1): 63 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 64 | 65 | plt.xlabel('osmFISH',fontsize=18, color='black') 66 | plt.ylabel('Ziesel_RNA',fontsize=18, color='black') 67 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 68 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 69 | plt.gca().set_ylim([n_pv_display,0]) 70 | plt.show() 71 | 72 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 73 | Importance.sort_values(ascending=False,inplace=True) 74 | Importance.index[0:30] 75 | 76 | ### Technology specific Processes 77 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 78 | 79 | # explained variance RNA 80 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 81 | # explained variance spatial 82 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 83 | -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/osmFISH_Ziesel/gimVI/gimVI.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('osmFISH_Ziesel/') 3 | 4 | from scvi.dataset import CsvDataset 5 | from scvi.models import JVAE, Classifier 6 | from scvi.inference import JVAETrainer 7 | import numpy as np 8 | import pandas as pd 9 | import copy 10 | import torch 11 | import time as tm 12 | 13 | ### osmFISH data 14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 15 | 16 | ### RNA 17 | RNA_data = CsvDataset('data/gimVI_data/RNA_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False) 18 | 19 | ### Leave-one-out validation 20 | Common_data = copy.deepcopy(RNA_data) 21 | Common_data.gene_names = osmFISH_data.gene_names 22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)] 23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names) 24 | Imp_Genes = pd.DataFrame(columns=Gene_set) 25 | gimVI_time = [] 26 | 27 | for i in Gene_set: 28 | print(i) 29 | # Create copy of the fish dataset with hidden genes 30 | data_spatial_partial = copy.deepcopy(osmFISH_data) 31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i)) 32 | data_spatial_partial.batch_indices += Common_data.n_batches 33 | 34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]): 35 | continue 36 | 37 | datasets = [Common_data, data_spatial_partial] 38 | generative_distributions = ["zinb", "nb"] 39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 40 | n_inputs = [d.nb_genes for d in datasets] 41 | total_genes = Common_data.nb_genes 42 | n_batches = sum([d.n_batches for d in datasets]) 43 | 44 | model_library_size = [True, False] 45 | 46 | n_latent = 8 47 | kappa = 5 48 | 49 | start = tm.time() 50 | torch.manual_seed(0) 51 | 52 | model = JVAE( 53 | n_inputs, 54 | total_genes, 55 | gene_mappings, 56 | generative_distributions, 57 | model_library_size, 58 | n_layers_decoder_individual=0, 59 | n_layers_decoder_shared=0, 60 | n_layers_encoder_individual=1, 61 | n_layers_encoder_shared=1, 62 | dim_hidden_encoder=64, 63 | dim_hidden_decoder_shared=64, 64 | dropout_rate_encoder=0.2, 65 | dropout_rate_decoder=0.2, 66 | n_batch=n_batches, 67 | n_latent=n_latent, 68 | ) 69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 70 | 71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 72 | trainer.train(n_epochs=200) 73 | _,Imputed = trainer.get_imputed_values(normalized=True) 74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1) 75 | Imp_Genes[i] = Imputed 76 | gimVI_time.append(tm.time()-start) 77 | 78 | Imp_Genes = Imp_Genes.fillna(0) 79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv') 80 | gimVI_time = pd.DataFrame(gimVI_time) 81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False) 82 | 83 | ### New genes 84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"]) 85 | 86 | # Create copy of the fish dataset with hidden genes 87 | data_spatial_partial = copy.deepcopy(osmFISH_data) 88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names) 89 | data_spatial_partial.batch_indices += RNA_data.n_batches 90 | 91 | datasets = [RNA_data, data_spatial_partial] 92 | generative_distributions = ["zinb", "nb"] 93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)] 94 | n_inputs = [d.nb_genes for d in datasets] 95 | total_genes = RNA_data.nb_genes 96 | n_batches = sum([d.n_batches for d in datasets]) 97 | 98 | model_library_size = [True, False] 99 | 100 | n_latent = 8 101 | kappa = 5 102 | 103 | torch.manual_seed(0) 104 | 105 | model = JVAE( 106 | n_inputs, 107 | total_genes, 108 | gene_mappings, 109 | generative_distributions, 110 | model_library_size, 111 | n_layers_decoder_individual=0, 112 | n_layers_decoder_shared=0, 113 | n_layers_encoder_individual=1, 114 | n_layers_encoder_shared=1, 115 | dim_hidden_encoder=64, 116 | dim_hidden_decoder_shared=64, 117 | dropout_rate_encoder=0.2, 118 | dropout_rate_decoder=0.2, 119 | n_batch=n_batches, 120 | n_latent=n_latent, 121 | ) 122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True) 123 | 124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa) 125 | trainer.train(n_epochs=200) 126 | 127 | for i in ["TESC","PVRL3","GRM2"]: 128 | _,Imputed = trainer.get_imputed_values(normalized=True) 129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1) 130 | Imp_New_Genes[i] = Imputed 131 | 132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv') 133 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/DownSampling/DownSampling.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import scipy.stats as st 9 | import sys 10 | sys.path.insert(1,'Scripts/SpaGE/') 11 | from principal_vectors import PVComputation 12 | 13 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f: 14 | datadict = pickle.load(f) 15 | 16 | seqFISH_data = datadict['seqFISH_data'] 17 | seqFISH_data_scaled = datadict['seqFISH_data_scaled'] 18 | seqFISH_meta= datadict['seqFISH_meta'] 19 | del datadict 20 | 21 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 22 | datadict = pickle.load(f) 23 | 24 | RNA_data = datadict['RNA_data'] 25 | RNA_data_scaled = datadict['RNA_data_scaled'] 26 | del datadict 27 | 28 | Gene_Order = np.intersect1d(seqFISH_data.columns,RNA_data.columns) 29 | 30 | ### SpaGE 31 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',') 32 | #SpaGE_New = pd.read_csv('Results/SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 33 | 34 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 35 | 36 | SpaGE_seqCorr = pd.Series(index = Gene_Order) 37 | for i in Gene_Order: 38 | SpaGE_seqCorr[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed[i])[0] 39 | SpaGE_seqCorr[np.isnan(SpaGE_seqCorr)] = 0 40 | 41 | SpaGE_seqCorr.sort_values(ascending=False,inplace=True) 42 | test_set = SpaGE_seqCorr.index[0:50] 43 | 44 | Common_data = RNA_data[Gene_Order] 45 | Common_data = Common_data.drop(columns=test_set) 46 | 47 | Corr = np.corrcoef(Common_data.T) 48 | #for i in range(0,Corr.shape[0]): 49 | # Corr[i,i]=0 50 | 51 | #plt.hist(np.abs(np.reshape(Corr,-1)),bins=np.arange(0,1.05,0.05)) 52 | #plt.show() 53 | # 0.7 54 | removed_genes = [] 55 | for i in range(0,Corr.shape[0]): 56 | for j in range(i+1,Corr.shape[0]): 57 | if(np.abs(Corr[i,j]) > 0.7): 58 | Vi = np.var(Common_data.iloc[:,i]) 59 | Vj = np.var(Common_data.iloc[:,j]) 60 | if(Vi > Vj): 61 | removed_genes.append(Common_data.columns[j]) 62 | else: 63 | removed_genes.append(Common_data.columns[i]) 64 | removed_genes= np.unique(removed_genes) 65 | 66 | Common_data = Common_data.drop(columns=removed_genes) 67 | Variance = np.var(Common_data) 68 | Variance.sort_values(ascending=False,inplace=True) 69 | Variance = Variance.append(pd.Series(0,index=removed_genes)) 70 | 71 | #### Novel Genes Expression Patterns #### 72 | genes_to_impute = test_set 73 | for i in [10,30,50,100,200,500,1000,2000,5000,7000,len(Variance)]: 74 | print(i) 75 | Imp_New_Genes = pd.DataFrame(np.zeros((seqFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute) 76 | 77 | if(i>=50): 78 | n_factors = 50 79 | n_pv = 50 80 | else: 81 | n_factors = i 82 | n_pv = i 83 | 84 | dim_reduction = 'pca' 85 | dim_reduction_target = 'pca' 86 | 87 | pv_FISH_RNA = PVComputation( 88 | n_factors = n_factors, 89 | n_pv = n_pv, 90 | dim_reduction = dim_reduction, 91 | dim_reduction_target = dim_reduction_target 92 | ) 93 | 94 | source_data = RNA_data_scaled[Variance.index[0:i]] 95 | target_data = seqFISH_data_scaled[Variance.index[0:i]] 96 | 97 | pv_FISH_RNA.fit(source_data,target_data) 98 | 99 | S = pv_FISH_RNA.source_components_.T 100 | 101 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 102 | S = S[:,0:Effective_n_pv] 103 | 104 | RNA_data_t = source_data.dot(S) 105 | FISH_exp_t = target_data.dot(S) 106 | 107 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto', 108 | metric = 'cosine').fit(RNA_data_t) 109 | distances, indices = nbrs.kneighbors(FISH_exp_t) 110 | 111 | for j in range(0,seqFISH_data.shape[0]): 112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 113 | weights = weights/(len(weights)-1) 114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]]) 115 | 116 | Imp_New_Genes.to_csv('Results/' + str(i) +'SpaGE_New_genes.csv') 117 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/DownSampling/DownSampling_evaluation.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import matplotlib 7 | matplotlib.use('qt5agg') 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | import scipy.stats as st 12 | import pickle 13 | 14 | ### Original data 15 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f: 16 | datadict = pickle.load(f) 17 | 18 | seqFISH_data = datadict['seqFISH_data'] 19 | del datadict 20 | 21 | test_set = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',').columns 22 | seqFISH_data = seqFISH_data[test_set] 23 | 24 | ### SpaGE 25 | #10 26 | SpaGE_imputed_10 = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 27 | 28 | SpaGE_Corr_10 = pd.Series(index = test_set) 29 | for i in test_set: 30 | SpaGE_Corr_10[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_10[i])[0] 31 | 32 | #30 33 | SpaGE_imputed_30 = pd.read_csv('Results/30SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 34 | 35 | SpaGE_Corr_30 = pd.Series(index = test_set) 36 | for i in test_set: 37 | SpaGE_Corr_30[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_30[i])[0] 38 | 39 | #50 40 | SpaGE_imputed_50 = pd.read_csv('Results/50SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 41 | 42 | SpaGE_Corr_50 = pd.Series(index = test_set) 43 | for i in test_set: 44 | SpaGE_Corr_50[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_50[i])[0] 45 | 46 | #100 47 | SpaGE_imputed_100 = pd.read_csv('Results/100SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 48 | 49 | SpaGE_Corr_100 = pd.Series(index = test_set) 50 | for i in test_set: 51 | SpaGE_Corr_100[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_100[i])[0] 52 | 53 | #200 54 | SpaGE_imputed_200 = pd.read_csv('Results/200SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 55 | 56 | SpaGE_Corr_200 = pd.Series(index = test_set) 57 | for i in test_set: 58 | SpaGE_Corr_200[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_200[i])[0] 59 | 60 | #500 61 | SpaGE_imputed_500 = pd.read_csv('Results/500SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 62 | 63 | SpaGE_Corr_500 = pd.Series(index = test_set) 64 | for i in test_set: 65 | SpaGE_Corr_500[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_500[i])[0] 66 | 67 | #1000 68 | SpaGE_imputed_1000 = pd.read_csv('Results/1000SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 69 | 70 | SpaGE_Corr_1000 = pd.Series(index = test_set) 71 | for i in test_set: 72 | SpaGE_Corr_1000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_1000[i])[0] 73 | 74 | #2000 75 | SpaGE_imputed_2000 = pd.read_csv('Results/2000SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 76 | 77 | SpaGE_Corr_2000 = pd.Series(index = test_set) 78 | for i in test_set: 79 | SpaGE_Corr_2000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_2000[i])[0] 80 | 81 | #5000 82 | SpaGE_imputed_5000 = pd.read_csv('Results/5000SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 83 | 84 | SpaGE_Corr_5000 = pd.Series(index = test_set) 85 | for i in test_set: 86 | SpaGE_Corr_5000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_5000[i])[0] 87 | 88 | #7000 89 | SpaGE_imputed_7000 = pd.read_csv('Results/7000SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 90 | 91 | SpaGE_Corr_7000 = pd.Series(index = test_set) 92 | for i in test_set: 93 | SpaGE_Corr_7000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_7000[i])[0] 94 | 95 | #9701 96 | SpaGE_imputed_9701 = pd.read_csv('Results/9701SpaGE_New_genes.csv',header=0,index_col=0,sep=',') 97 | 98 | SpaGE_Corr_9701 = pd.Series(index = test_set) 99 | for i in test_set: 100 | SpaGE_Corr_9701[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_9701[i])[0] 101 | 102 | ### Comparison plots 103 | plt.style.use('ggplot') 104 | plt.figure(figsize=(9, 3)) 105 | plt.boxplot([SpaGE_Corr_10, SpaGE_Corr_30, SpaGE_Corr_50, 106 | SpaGE_Corr_100,SpaGE_Corr_200,SpaGE_Corr_500, 107 | SpaGE_Corr_1000,SpaGE_Corr_2000,SpaGE_Corr_5000, 108 | SpaGE_Corr_7000,SpaGE_Corr_9701]) 109 | 110 | y = SpaGE_Corr_10 111 | x = np.random.normal(1, 0.05, len(y)) 112 | plt.plot(x, y, 'g.', alpha=0.2) 113 | y = SpaGE_Corr_30 114 | x = np.random.normal(2, 0.05, len(y)) 115 | plt.plot(x, y, 'g.', alpha=0.2) 116 | y = SpaGE_Corr_50 117 | x = np.random.normal(3, 0.05, len(y)) 118 | plt.plot(x, y, 'g.', alpha=0.2) 119 | y = SpaGE_Corr_100 120 | x = np.random.normal(4, 0.05, len(y)) 121 | plt.plot(x, y, 'g.', alpha=0.2) 122 | y = SpaGE_Corr_200 123 | x = np.random.normal(5, 0.05, len(y)) 124 | plt.plot(x, y, 'g.', alpha=0.2) 125 | y = SpaGE_Corr_500 126 | x = np.random.normal(6, 0.05, len(y)) 127 | plt.plot(x, y, 'g.', alpha=0.2) 128 | y = SpaGE_Corr_1000 129 | x = np.random.normal(7, 0.05, len(y)) 130 | plt.plot(x, y, 'g.', alpha=0.2) 131 | y = SpaGE_Corr_2000 132 | x = np.random.normal(8, 0.05, len(y)) 133 | plt.plot(x, y, 'g.', alpha=0.2) 134 | y = SpaGE_Corr_5000 135 | x = np.random.normal(9, 0.05, len(y)) 136 | plt.plot(x, y, 'g.', alpha=0.2) 137 | y = SpaGE_Corr_7000 138 | x = np.random.normal(10, 0.05, len(y)) 139 | plt.plot(x, y, 'g.', alpha=0.2) 140 | y = SpaGE_Corr_9701 141 | x = np.random.normal(11, 0.05, len(y)) 142 | plt.plot(x, y, 'g.', alpha=0.2) 143 | 144 | 145 | plt.xticks((1,2,3,4,5,6,7,8,9,10,11), 146 | ('10','30','50','100','200','500','1000','2000','5000','7000','9701'),size=10) 147 | plt.yticks(size=8) 148 | plt.xlabel('Number of shared genes',size=12) 149 | plt.gca().set_ylim([-0.3,0.8]) 150 | plt.ylabel('Spearman Correlation',size=12) 151 | plt.show() 152 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/Performance_evaluation.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp/') 3 | 4 | 5 | import numpy as np 6 | import pandas as pd 7 | import pickle 8 | import matplotlib 9 | matplotlib.use('qt5agg') 10 | matplotlib.rcParams['pdf.fonttype'] = 42 11 | matplotlib.rcParams['ps.fonttype'] = 42 12 | import matplotlib.pyplot as plt 13 | #from matplotlib import cm 14 | import scipy.stats as st 15 | 16 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f: 17 | datadict = pickle.load(f) 18 | 19 | seqFISH_data = datadict['seqFISH_data'] 20 | seqFISH_meta= datadict['seqFISH_meta'] 21 | del datadict 22 | 23 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 24 | datadict = pickle.load(f) 25 | 26 | RNA_data = datadict['RNA_data'] 27 | del datadict 28 | 29 | Gene_Order = np.intersect1d(seqFISH_data.columns,RNA_data.columns) 30 | 31 | ### SpaGE 32 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',') 33 | 34 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 35 | 36 | SpaGE_seqCorr = pd.Series(index = Gene_Order) 37 | for i in Gene_Order: 38 | SpaGE_seqCorr[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed[i])[0] 39 | SpaGE_seqCorr[np.isnan(SpaGE_seqCorr)] = 0 40 | 41 | os.chdir('STARmap_AllenVISp/') 42 | 43 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f: 44 | datadict = pickle.load(f) 45 | 46 | coords = datadict['coords'] 47 | Starmap_data = datadict['Starmap_data'] 48 | del datadict 49 | 50 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns) 51 | 52 | ### SpaGE 53 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut_cutoff.csv',header=0,index_col=0,sep=',') 54 | 55 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 56 | 57 | SpaGE_starCorr = pd.Series(index = Gene_Order) 58 | for i in Gene_Order: 59 | SpaGE_starCorr[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed[i])[0] 60 | 61 | def Compare_Correlations(X,Y): 62 | fig, ax = plt.subplots(figsize=(4.5, 4.5)) 63 | ax.scatter(X, Y, s=1) 64 | ax.axvline(linestyle='--',color='gray') 65 | ax.axhline(linestyle='--',color='gray') 66 | plt.gca().set_ylim([-0.5,1]) 67 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]), 68 | np.max([ax.get_xlim(), ax.get_ylim()])] 69 | ax.plot(lims, lims, 'k-') 70 | ax.set_aspect('equal') 71 | ax.set_xlim(lims) 72 | ax.set_ylim(lims) 73 | plt.xticks(size=8) 74 | plt.yticks(size=8) 75 | plt.show() 76 | 77 | Starmap_seq_genes = np.intersect1d(Starmap_data.columns,seqFISH_data.columns) 78 | Compare_Correlations(SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes]) 79 | plt.xlabel('Spearman Correlation STARmap',size=12) 80 | plt.ylabel('Spearman Correlation seqFISH',size=12) 81 | plt.show() 82 | 83 | fig, ax = plt.subplots(figsize=(3.7, 4.5)) 84 | ax.boxplot([SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes]],widths=0.5) 85 | 86 | y = SpaGE_starCorr[Starmap_seq_genes] 87 | x = np.random.normal(1, 0.05, len(y)) 88 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 89 | y = SpaGE_seqCorr[Starmap_seq_genes] 90 | x = np.random.normal(2, 0.05, len(y)) 91 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2) 92 | 93 | plt.xticks((1,2),('STARmap','seqFISH'),size=12) 94 | plt.yticks(size=8) 95 | plt.gca().set_ylim([-0.4,0.8]) 96 | plt.ylabel('Spearman Correlation',size=12) 97 | #ax.set_aspect(aspect=3) 98 | _,p_val = st.wilcoxon(SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes]) 99 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 100 | plt.show() 101 | 102 | os.chdir('osmFISH_AllenVISp/') 103 | 104 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f: 105 | datadict = pickle.load(f) 106 | 107 | osmFISH_data = datadict['osmFISH_data'] 108 | del datadict 109 | 110 | Gene_Order = osmFISH_data.columns 111 | 112 | ### SpaGE 113 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut_cutoff.csv',header=0,index_col=0,sep=',') 114 | 115 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order] 116 | 117 | SpaGE_osmCorr = pd.Series(index = Gene_Order) 118 | for i in Gene_Order: 119 | SpaGE_osmCorr[i] = st.spearmanr(osmFISH_data[i],SpaGE_imputed[i])[0] 120 | 121 | def Compare_Correlations(X,Y): 122 | fig, ax = plt.subplots(figsize=(4.5, 4.5)) 123 | ax.scatter(X, Y, s=25) 124 | ax.axvline(linestyle='--',color='gray') 125 | ax.axhline(linestyle='--',color='gray') 126 | plt.gca().set_ylim([-0.5,1]) 127 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]), 128 | np.max([ax.get_xlim(), ax.get_ylim()])] 129 | ax.plot(lims, lims, 'k-') 130 | ax.set_aspect('equal') 131 | ax.set_xlim(lims) 132 | ax.set_ylim(lims) 133 | plt.xticks(size=8) 134 | plt.yticks(size=8) 135 | plt.show() 136 | 137 | osm_seq_genes = np.intersect1d(osmFISH_data.columns,seqFISH_data.columns) 138 | Compare_Correlations(SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes]) 139 | plt.xlabel('Spearman Correlation osmFISH',size=12) 140 | plt.ylabel('Spearman Correlation seqFISH',size=12) 141 | plt.show() 142 | 143 | fig, ax = plt.subplots(figsize=(3.7, 4.5)) 144 | ax.boxplot([SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes]],widths=0.5) 145 | 146 | y = SpaGE_osmCorr[osm_seq_genes] 147 | x = np.random.normal(1, 0.05, len(y)) 148 | plt.plot(x, y, 'g.', alpha=0.2) 149 | y = SpaGE_seqCorr[osm_seq_genes] 150 | x = np.random.normal(2, 0.05, len(y)) 151 | plt.plot(x, y, 'g.', alpha=0.2) 152 | 153 | plt.xticks((1,2),('osmFISH','seqFISH'),size=12) 154 | plt.yticks(size=8) 155 | plt.gca().set_ylim([-0.5,1]) 156 | plt.ylabel('Spearman Correlation',size=12) 157 | #ax.set_aspect(aspect=3) 158 | _,p_val = st.wilcoxon(SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes]) 159 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8) 160 | plt.show() 161 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/SpaGE/Integration.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp/') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | from sklearn.neighbors import NearestNeighbors 8 | import time as tm 9 | 10 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f: 11 | datadict = pickle.load(f) 12 | 13 | seqFISH_data = datadict['seqFISH_data'] 14 | seqFISH_data_scaled = datadict['seqFISH_data_scaled'] 15 | seqFISH_meta= datadict['seqFISH_meta'] 16 | del datadict 17 | 18 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 19 | datadict = pickle.load(f) 20 | 21 | RNA_data = datadict['RNA_data'] 22 | RNA_data_scaled = datadict['RNA_data_scaled'] 23 | del datadict 24 | 25 | #### Leave One Out Validation #### 26 | Common_data = RNA_data_scaled[np.intersect1d(seqFISH_data_scaled.columns,RNA_data_scaled.columns)] 27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns) 28 | precise_time = [] 29 | knn_time = [] 30 | for i in Common_data.columns: 31 | print(i) 32 | start = tm.time() 33 | from principal_vectors import PVComputation 34 | 35 | n_factors = 50 36 | n_pv = 50 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),seqFISH_data_scaled[Common_data.columns].drop(i,axis=1)) 48 | 49 | S = pv_FISH_RNA.source_components_.T 50 | 51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 52 | S = S[:,0:Effective_n_pv] 53 | 54 | Common_data_t = Common_data.drop(i,axis=1).dot(S) 55 | FISH_exp_t = seqFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S) 56 | precise_time.append(tm.time()-start) 57 | 58 | start = tm.time() 59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t) 60 | distances, indices = nbrs.kneighbors(FISH_exp_t) 61 | 62 | Imp_Gene = np.zeros(seqFISH_data.shape[0]) 63 | for j in range(0,seqFISH_data.shape[0]): 64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1])) 65 | weights = weights/(len(weights)-1) 66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights)) 67 | Imp_Gene[np.isnan(Imp_Gene)] = 0 68 | Imp_Genes[i] = Imp_Gene 69 | knn_time.append(tm.time()-start) 70 | 71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv') 72 | precise_time = pd.DataFrame(precise_time) 73 | knn_time = pd.DataFrame(knn_time) 74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False) 75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False) 76 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/SpaGE/Precise_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp') 3 | 4 | import pickle 5 | import numpy as np 6 | import pandas as pd 7 | import matplotlib 8 | matplotlib.rcParams['pdf.fonttype'] = 42 9 | matplotlib.rcParams['ps.fonttype'] = 42 10 | import matplotlib.pyplot as plt 11 | from matplotlib import cm 12 | import seaborn as sns 13 | import sys 14 | sys.path.insert(1,'Scripts/SpaGE/') 15 | from principal_vectors import PVComputation 16 | 17 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f: 18 | datadict = pickle.load(f) 19 | 20 | seqFISH_data = datadict['seqFISH_data'] 21 | seqFISH_data_scaled = datadict['seqFISH_data_scaled'] 22 | seqFISH_meta= datadict['seqFISH_meta'] 23 | del datadict 24 | 25 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f: 26 | datadict = pickle.load(f) 27 | 28 | RNA_data = datadict['RNA_data'] 29 | RNA_data_scaled = datadict['RNA_data_scaled'] 30 | del datadict 31 | 32 | Common_data = RNA_data_scaled[np.intersect1d(seqFISH_data_scaled.columns,RNA_data_scaled.columns)] 33 | 34 | n_factors = 50 35 | n_pv = 50 36 | n_pv_display = 50 37 | dim_reduction = 'pca' 38 | dim_reduction_target = 'pca' 39 | 40 | pv_FISH_RNA = PVComputation( 41 | n_factors = n_factors, 42 | n_pv = n_pv, 43 | dim_reduction = dim_reduction, 44 | dim_reduction_target = dim_reduction_target 45 | ) 46 | 47 | pv_FISH_RNA.fit(Common_data,seqFISH_data_scaled[Common_data.columns]) 48 | 49 | fig = plt.figure() 50 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 51 | center=0, vmax=1., vmin=0) 52 | plt.xlabel('seqFISH',fontsize=18, color='black') 53 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 54 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 55 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 56 | plt.gca().set_ylim([n_pv_display,0]) 57 | plt.show() 58 | 59 | plt.figure() 60 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r', 61 | center=0, vmax=1., vmin=0) 62 | for i in range(n_pv_display-1): 63 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black') 64 | 65 | plt.xlabel('seqFISH',fontsize=18, color='black') 66 | plt.ylabel('Allen_VISp',fontsize=18, color='black') 67 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12) 68 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal') 69 | plt.gca().set_ylim([n_pv_display,0]) 70 | plt.show() 71 | 72 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns) 73 | Importance.sort_values(ascending=False,inplace=True) 74 | Importance.index[0:50] 75 | 76 | ### Technology specific Processes 77 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3) 78 | 79 | # explained variance RNA 80 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 81 | # explained variance spatial 82 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100 83 | -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/SpaGE/dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) -------------------------------------------------------------------------------- /benchmark/seqFISH_AllenVISp/SpaGE/seqFISH_data_preprocessing.py: -------------------------------------------------------------------------------- 1 | import os 2 | os.chdir('seqFISH_AllenVISp/') 3 | 4 | import numpy as np 5 | import pandas as pd 6 | import scipy.stats as st 7 | import pickle 8 | 9 | seqFISH_data = pd.read_csv('data/seqFISH/sourcedata/cortex_svz_counts.csv',header=0) 10 | seqFISH_meta = pd.read_csv('data/seqFISH/sourcedata/cortex_svz_cellcentroids.csv',header=0) 11 | 12 | seqFISH_data = seqFISH_data.iloc[np.where(seqFISH_meta['Region'] == 'Cortex')[0],:] 13 | seqFISH_meta = seqFISH_meta.iloc[np.where(seqFISH_meta['Region'] == 'Cortex')[0],:] 14 | 15 | seqFISH_data = seqFISH_data.T 16 | 17 | cell_count = np.sum(seqFISH_data,axis=0) 18 | def Log_Norm(x): 19 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1) 20 | 21 | seqFISH_data = seqFISH_data.apply(Log_Norm,axis=0) 22 | seqFISH_data_scaled = pd.DataFrame(data=st.zscore(seqFISH_data.T),index = seqFISH_data.columns,columns=seqFISH_data.index) 23 | 24 | datadict = dict() 25 | datadict['seqFISH_data'] = seqFISH_data.T 26 | datadict['seqFISH_data_scaled'] = seqFISH_data_scaled 27 | datadict['seqFISH_meta'] = seqFISH_meta 28 | 29 | with open('data/SpaGE_pkl/seqFISH_Cortex.pkl','wb') as f: 30 | pickle.dump(datadict, f) 31 | -------------------------------------------------------------------------------- /dimensionality_reduction.py: -------------------------------------------------------------------------------- 1 | """ Dimensionality Reduction 2 | @author: Soufiane Mourragui 3 | This module extracts the domain-specific factors from the high-dimensional omics 4 | dataset. Several methods are here implemented and they can be directly 5 | called from string name in main method method. All the methods 6 | use scikit-learn implementation. 7 | Notes 8 | ------- 9 | - 10 | 11 | References 12 | ------- 13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python. 14 | Journal of Machine Learning Research 15 | """ 16 | 17 | import numpy as np 18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA 19 | from sklearn.cross_decomposition import PLSRegression 20 | 21 | 22 | def process_dim_reduction(method='pca', n_dim=10): 23 | """ 24 | Default linear dimensionality reduction method. For each method, return a 25 | BaseEstimator instance corresponding to the method given as input. 26 | Attributes 27 | ------- 28 | method: str, default to 'pca' 29 | Method used for dimensionality reduction. 30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis), 31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA). 32 | 33 | n_dim: int, default to 10 34 | Number of domain-specific factors to compute. 35 | Return values 36 | ------- 37 | Classifier, i.e. BaseEstimator instance 38 | """ 39 | 40 | if method.lower() == 'pca': 41 | clf = PCA(n_components=n_dim) 42 | 43 | elif method.lower() == 'ica': 44 | print('ICA') 45 | clf = FastICA(n_components=n_dim) 46 | 47 | elif method.lower() == 'fa': 48 | clf = FactorAnalysis(n_components=n_dim) 49 | 50 | elif method.lower() == 'nmf': 51 | clf = NMF(n_components=n_dim) 52 | 53 | elif method.lower() == 'sparsepca': 54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1) 55 | 56 | elif method.lower() == 'pls': 57 | clf = PLS(n_components=n_dim) 58 | 59 | else: 60 | raise NameError('%s is not an implemented method'%(method)) 61 | 62 | return clf 63 | 64 | 65 | class PLS(): 66 | """ 67 | Implement PLS to make it compliant with the other dimensionality 68 | reduction methodology. 69 | (Simple class rewritting). 70 | """ 71 | def __init__(self, n_components=10): 72 | self.clf = PLSRegression(n_components) 73 | 74 | def get_components_(self): 75 | return self.clf.x_weights_.transpose() 76 | 77 | def set_components_(self, x): 78 | pass 79 | 80 | components_ = property(get_components_, set_components_) 81 | 82 | def fit(self, X, y): 83 | self.clf.fit(X,y) 84 | return self 85 | 86 | def transform(self, X): 87 | return self.clf.transform(X) 88 | 89 | def predict(self, X): 90 | return self.clf.predict(X) --------------------------------------------------------------------------------