├── LICENSE
├── New_predicted_genes.png
├── README.md
├── SpaGE
├── dimensionality_reduction.py
├── main.py
└── principal_vectors.py
├── SpaGE_Tutorial.ipynb
├── benchmark
├── MERFISH_Moffit
│ ├── Liger
│ │ └── LIGER.R
│ ├── Performance_evaluation.py
│ ├── Seurat
│ │ ├── MERFISH.R
│ │ ├── MERFISH_integration.R
│ │ └── Moffit_RNA.R
│ ├── SpaGE
│ │ ├── Integration.py
│ │ ├── MERFISH_data_preprocessing.py
│ │ ├── Moffit_RNA_preprocessing.py
│ │ ├── Precise_output.py
│ │ ├── dimensionality_reduction.py
│ │ └── principal_vectors.py
│ └── gimVI
│ │ └── gimVI.py
├── Performance_evaluation_osmFISH.py
├── Precise_output_all.py
├── STARmap_AllenVISp
│ ├── DownSampling
│ │ ├── DownSampling.py
│ │ └── DownSampling_evaluation.py
│ ├── Liger
│ │ └── LIGER.R
│ ├── Performance_evaluation.py
│ ├── Seurat
│ │ ├── allen_brain.R
│ │ ├── impute_starmap.R
│ │ ├── starmap.R
│ │ └── starmap_integration.R
│ ├── SpaGE
│ │ ├── Allen_data_preprocessing.py
│ │ ├── Integration.py
│ │ ├── Precise_output.py
│ │ ├── Starmap_data_preprocessing.py
│ │ ├── dimensionality_reduction.py
│ │ ├── principal_vectors.py
│ │ └── viz.py
│ ├── Starmap_plots.R
│ └── gimVI
│ │ └── gimVI.py
├── Timing_Evaluation.py
├── osmFISH_AllenSSp
│ ├── Liger
│ │ └── LIGER.R
│ ├── Performance_evaluation.py
│ ├── Seurat
│ │ ├── allen_brain.R
│ │ ├── impute_osmFISH.R
│ │ ├── osmFISH.R
│ │ └── osmFISH_integration.R
│ ├── SpaGE
│ │ ├── Allen_data_preprocessing.py
│ │ ├── Integration.py
│ │ ├── Precise_output.py
│ │ ├── dimensionality_reduction.py
│ │ ├── osmFISH_data_preprocessing.py
│ │ └── principal_vectors.py
│ └── gimVI
│ │ └── gimVI.py
├── osmFISH_AllenVISp
│ ├── Liger
│ │ └── LIGER.R
│ ├── Performance_evaluation.py
│ ├── Seurat
│ │ ├── allen_brain.R
│ │ ├── impute_osmFISH.R
│ │ ├── osmFISH.R
│ │ └── osmFISH_integration.R
│ ├── SpaGE
│ │ ├── Allen_data_preprocessing.py
│ │ ├── Integration.py
│ │ ├── Precise_output.py
│ │ ├── dimensionality_reduction.py
│ │ ├── osmFISH_data_preprocessing.py
│ │ └── principal_vectors.py
│ └── gimVI
│ │ └── gimVI.py
├── osmFISH_Ziesel
│ ├── Liger
│ │ └── LIGER.R
│ ├── Performance_evaluation.py
│ ├── Seurat
│ │ ├── Zeisel_Cortex.R
│ │ ├── impute_osmFISH.R
│ │ ├── osmFISH.R
│ │ └── osmFISH_integration.R
│ ├── SpaGE
│ │ ├── Integration.py
│ │ ├── Linnarson_data_preprocessing.py
│ │ ├── Precise_output.py
│ │ ├── dimensionality_reduction.py
│ │ └── principal_vectors.py
│ └── gimVI
│ │ └── gimVI.py
└── seqFISH_AllenVISp
│ ├── DownSampling
│ ├── DownSampling.py
│ └── DownSampling_evaluation.py
│ ├── Performance_evaluation.py
│ └── SpaGE
│ ├── Integration.py
│ ├── Precise_output.py
│ ├── dimensionality_reduction.py
│ ├── principal_vectors.py
│ └── seqFISH_data_preprocessing.py
├── dimensionality_reduction.py
└── principal_vectors.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 tabdelaal
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/New_predicted_genes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tabdelaal/SpaGE/edd0fb7086eca791c6589020893eb4f034b195c7/New_predicted_genes.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SpaGE
2 | ## Predicting whole-transcriptome expression of spatial transcriptomics data through integration with scRNA-seq data
3 |
4 | ### Implementation description
5 |
6 | Python implementation can be found in the 'SpaGE' folder. The ```SpaGE``` function takes as input i) two single cell datasets, spatial transcriptomics and scRNA-seq, ii) the number of principal vectors *(PVs)*, and iii) the set of unmeasured genes in the spatial data for which predictions are obtained from the scRNA-seq (optional). The function returns back the predicted expression for these unmeasured genes across all spatial cells.
7 |
8 | For full description, please check the ```SpaGE``` function description in ```main.py```.
9 |
10 | ### Tutorial
11 |
12 | The ```SpaGE_Tutorial``` notebook is a step-by-step example showing how to validate SpaGE on the spatially measured genes, and how to use SpaGE to predict new spatial gene patterns.
13 |
14 |
15 |
16 |
17 |
18 | ### Experiments code description
19 |
20 | The 'benchmark' folder contains the scripts to reproduce the results shown in the pre-print. The bencmark folder has five subfolders, each corresponds to one dataset-pair and contains the scripts to run SpaGE, Seurat-v3, Liger and gimVI. Additionally, we provide evaluation scripts to calculate and compare the performance of all four methods, and to reproduce all the figures in the paper.
21 |
22 | ### Datasets
23 |
24 | All datasets used are publicly available data, for convenience datasets can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.3967291)
25 |
26 | For citation and further information please refer to:
27 | "SpaGE: Spatial Gene Enhancement using scRNA-seq", [NAR](https://academic.oup.com/nar/article/48/18/e107/5909530)
28 |
--------------------------------------------------------------------------------
/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/SpaGE/main.py:
--------------------------------------------------------------------------------
1 | """ SpaGE [1]
2 | @author: Tamim Abdelaal
3 | This function integrates two single-cell datasets, spatial and scRNA-seq, and
4 | enhance the spatial data by predicting the expression of the spatially
5 | unmeasured genes from the scRNA-seq data.
6 | The integration is performed using the domain adaption method PRECISE [2]
7 |
8 | References
9 | -------
10 | [1] Abdelaal T., Mourragui S., Mahfouz A., Reiders M.J.T. (2020)
11 | SpaGE: Spatial Gene Enhancement using scRNA-seq
12 | [2] Mourragui S., Loog M., Reinders M.J.T., Wessels L.F.A. (2019)
13 | PRECISE: A domain adaptation approach to transfer predictors of drug response
14 | from pre-clinical models to tumors
15 | """
16 |
17 | import numpy as np
18 | import pandas as pd
19 | import scipy.stats as st
20 | from sklearn.neighbors import NearestNeighbors
21 | from SpaGE.principal_vectors import PVComputation
22 |
23 | def SpaGE(Spatial_data,RNA_data,n_pv,genes_to_predict=None):
24 | """
25 | @author: Tamim Abdelaal
26 | This function integrates two single-cell datasets, spatial and scRNA-seq,
27 | and enhance the spatial data by predicting the expression of the spatially
28 | unmeasured genes from the scRNA-seq data.
29 |
30 | Parameters
31 | -------
32 | Spatial_data : Dataframe
33 | Normalized Spatial data matrix (cells X genes).
34 | RNA_data : Dataframe
35 | Normalized scRNA-seq data matrix (cells X genes).
36 | n_pv : int
37 | Number of principal vectors to find from the independently computed
38 | principal components, and used to align both datasets. This should
39 | be <= number of shared genes between the two datasets.
40 | genes_to_predict : str array
41 | list of gene names missing from the spatial data, to be predicted
42 | from the scRNA-seq data. Default is the set of different genes
43 | (columns) between scRNA-seq and spatial data.
44 |
45 | Return
46 | -------
47 | Imp_Genes: Dataframe
48 | Matrix containing the predicted gene expressions for the spatial
49 | cells. Rows are equal to the number of spatial data rows (cells),
50 | and columns are equal to genes_to_predict, .
51 | """
52 |
53 | if genes_to_predict is SpaGE.__defaults__[0]:
54 | genes_to_predict = np.setdiff1d(RNA_data.columns,Spatial_data.columns)
55 |
56 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data,axis=0),
57 | index = RNA_data.index,columns=RNA_data.columns)
58 | Spatial_data_scaled = pd.DataFrame(data=st.zscore(Spatial_data,axis=0),
59 | index = Spatial_data.index,columns=Spatial_data.columns)
60 | Common_data = RNA_data_scaled[np.intersect1d(Spatial_data_scaled.columns,RNA_data_scaled.columns)]
61 |
62 | Y_train = RNA_data[genes_to_predict]
63 |
64 | Imp_Genes = pd.DataFrame(np.zeros((Spatial_data.shape[0],len(genes_to_predict))),
65 | columns=genes_to_predict)
66 |
67 | pv_Spatial_RNA = PVComputation(
68 | n_factors = n_pv,
69 | n_pv = n_pv,
70 | dim_reduction = 'pca',
71 | dim_reduction_target = 'pca'
72 | )
73 |
74 | pv_Spatial_RNA.fit(Common_data,Spatial_data_scaled[Common_data.columns])
75 |
76 | S = pv_Spatial_RNA.source_components_.T
77 |
78 | Effective_n_pv = sum(np.diag(pv_Spatial_RNA.cosine_similarity_matrix_) > 0.3)
79 | S = S[:,0:Effective_n_pv]
80 |
81 | Common_data_projected = Common_data.dot(S)
82 | Spatial_data_projected = Spatial_data_scaled[Common_data.columns].dot(S)
83 |
84 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
85 | metric = 'cosine').fit(Common_data_projected)
86 | distances, indices = nbrs.kneighbors(Spatial_data_projected)
87 |
88 | for j in range(0,Spatial_data.shape[0]):
89 |
90 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
91 | weights = weights/(len(weights)-1)
92 | Imp_Genes.iloc[j,:] = np.dot(weights,Y_train.iloc[indices[j,:][distances[j,:] < 1]])
93 |
94 | return Imp_Genes
95 |
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/Liger/LIGER.R:
--------------------------------------------------------------------------------
1 | setwd("MERFISH_Moffit/")
2 | library(liger)
3 | library(Seurat)
4 | library(ggplot2)
5 |
6 | # Moffit RNA
7 | Moffit <- Read10X("data/Moffit_RNA/GSE113576/")
8 |
9 | Moffit <- as.matrix(Moffit)
10 | Genes_count = rowSums(Moffit > 0)
11 | Moffit <- Moffit[Genes_count>=10,]
12 |
13 | # MERFISH
14 | MERFISH <- read.csv(file = "data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv", header = TRUE)
15 | MERFISH_1 <- MERFISH[MERFISH['Animal_ID']==1,]
16 | MERFISH_1 <- MERFISH_1[MERFISH_1['Cell_class']!='Ambiguous',]
17 | MERFISH_meta <- MERFISH_1[,c(1:9)]
18 | MERFISH_data <- MERFISH_1[,c(10:170)]
19 | drops <- c('Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos')
20 | MERFISH_data <- MERFISH_data[ , !(colnames(MERFISH_data) %in% drops)]
21 | MERFISH_data <- t(MERFISH_data)
22 |
23 | Gene_set <- intersect(rownames(MERFISH_data),rownames(Moffit))
24 |
25 | #### New genes prediction
26 | Ligerex <- createLiger(list(MERFISH = MERFISH_data, Moffit_RNA = Moffit))
27 | Ligerex <- normalize(Ligerex)
28 | Ligerex@var.genes <- Gene_set
29 | Ligerex <- scaleNotCenter(Ligerex)
30 |
31 | # suggestK(Ligerex) # K = 25
32 | # suggestLambda(Ligerex, k = 25)
33 |
34 | Ligerex <- optimizeALS(Ligerex,k = 25)
35 | Ligerex <- quantileAlignSNF(Ligerex)
36 |
37 | # leave-one-out validation
38 | genes.leaveout <- Gene_set
39 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(MERFISH_data)[2])
40 | rownames(Imp_genes) <- genes.leaveout
41 | colnames(Imp_genes) <- colnames(MERFISH_data)
42 | NMF_time <- vector(mode= "numeric")
43 | knn_time <- vector(mode= "numeric")
44 |
45 | for(i in 1:length(genes.leaveout)) {
46 | print(i)
47 | start_time <- Sys.time()
48 | Ligerex.leaveout <- createLiger(list(MERFISH = MERFISH_data[-which(rownames(MERFISH_data) %in% genes.leaveout[i]),], Moffit_RNA = Moffit))
49 | Ligerex.leaveout <- normalize(Ligerex.leaveout)
50 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i])
51 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout)
52 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 25)
53 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout)
54 | end_time <- Sys.time()
55 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
56 |
57 | start_time <- Sys.time()
58 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'Moffit_RNA', queries = list('MERFISH'), norm = TRUE, scale = FALSE)
59 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$MERFISH[genes.leaveout[i],])
60 | end_time <- Sys.time()
61 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
62 | }
63 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv')
64 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE)
65 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/Performance_evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import pickle
7 | import matplotlib
8 | matplotlib.use('qt5agg')
9 | matplotlib.rcParams['pdf.fonttype'] = 42
10 | matplotlib.rcParams['ps.fonttype'] = 42
11 | import matplotlib.pyplot as plt
12 | import scipy.stats as st
13 |
14 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f:
15 | datadict = pickle.load(f)
16 |
17 | MERFISH_data = datadict['MERFISH_data']
18 | del datadict
19 |
20 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f:
21 | datadict = pickle.load(f)
22 |
23 | RNA_data = datadict['RNA_data']
24 | del datadict
25 |
26 | Gene_Order = np.intersect1d(MERFISH_data.columns,RNA_data.columns)
27 |
28 | ### SpaGE
29 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',')
30 |
31 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
32 |
33 | SpaGE_Corr = pd.Series(index = Gene_Order)
34 | for i in Gene_Order:
35 | SpaGE_Corr[i] = st.spearmanr(MERFISH_data[i],SpaGE_imputed[i])[0]
36 |
37 | ### gimVI
38 | gimVI_imputed = pd.read_csv('Results/gimVI_LeaveOneOut.csv',header=0,index_col=0,sep=',')
39 | gimVI_imputed = gimVI_imputed.drop(columns='AVPR2')
40 |
41 | gimVI_imputed = gimVI_imputed.loc[:,[x.upper() for x in np.array(Gene_Order,dtype='str')]]
42 |
43 | gimVI_Corr = pd.Series(index = Gene_Order)
44 | for i in Gene_Order:
45 | gimVI_Corr[i] = st.spearmanr(MERFISH_data[i],gimVI_imputed[str(i).upper()])[0]
46 | gimVI_Corr[np.isnan(gimVI_Corr)] = 0
47 |
48 |
49 | ### Seurat
50 | Seurat_imputed = pd.read_csv('Results/Seurat_LeaveOneOut.csv',header=0,index_col=0,sep=',').T
51 |
52 | Seurat_imputed = Seurat_imputed.loc[:,Gene_Order]
53 |
54 | Seurat_Corr = pd.Series(index = Gene_Order)
55 | for i in Gene_Order:
56 | Seurat_Corr[i] = st.spearmanr(MERFISH_data[i],Seurat_imputed[i])[0]
57 |
58 | ### Liger
59 | Liger_imputed = pd.read_csv('Results/Liger_LeaveOneOut.csv',header=0,index_col=0,sep=',').T
60 |
61 | Liger_imputed = Liger_imputed.loc[:,Gene_Order]
62 |
63 | Liger_Corr = pd.Series(index = Gene_Order)
64 | for i in Gene_Order:
65 | Liger_Corr[i] = st.spearmanr(MERFISH_data[i],Liger_imputed[i])[0]
66 | Liger_Corr[np.isnan(Liger_Corr)] = 0
67 |
68 | ### Comparison plots
69 | plt.style.use('ggplot')
70 | fig, ax = plt.subplots(figsize=(3.7, 5.5))
71 | ax.boxplot([SpaGE_Corr,Seurat_Corr, Liger_Corr,gimVI_Corr])
72 |
73 | y = SpaGE_Corr
74 | x = np.random.normal(1, 0.05, len(y))
75 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2)
76 | y = Seurat_Corr
77 | x = np.random.normal(2, 0.05, len(y))
78 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2)
79 | y = Liger_Corr
80 | x = np.random.normal(3, 0.05, len(y))
81 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2)
82 | y = gimVI_Corr
83 | x = np.random.normal(4, 0.05, len(y))
84 | plt.plot(x, y, 'g.', markersize=3, alpha=0.2)
85 |
86 | plt.xticks((1,2,3,4),('SpaGE', 'Seurat', 'Liger','gimVI'),size=12)
87 | plt.yticks(size=8)
88 | plt.gca().set_ylim([-0.5,1])
89 | plt.ylabel('Spearman Correlation',size=12)
90 | ax.set_aspect(aspect=3)
91 | _,p_val = st.wilcoxon(SpaGE_Corr,Seurat_Corr)
92 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
93 | _,p_val = st.wilcoxon(SpaGE_Corr,Liger_Corr)
94 | plt.text(3,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
95 | _,p_val = st.wilcoxon(SpaGE_Corr,gimVI_Corr)
96 | plt.text(4,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
97 | plt.show()
98 |
99 | def Compare_Correlations(X,Y):
100 | fig, ax = plt.subplots(figsize=(4.5, 4.5))
101 | ax.scatter(X, Y, s=5)
102 | ax.axvline(linestyle='--',color='gray')
103 | ax.axhline(linestyle='--',color='gray')
104 | plt.gca().set_ylim([-0.5,1])
105 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]),
106 | np.max([ax.get_xlim(), ax.get_ylim()])]
107 | ax.plot(lims, lims, 'k-')
108 | ax.set_aspect('equal')
109 | ax.set_xlim(lims)
110 | ax.set_ylim(lims)
111 | plt.xticks(size=8)
112 | plt.yticks(size=8)
113 | plt.show()
114 |
115 |
116 | Compare_Correlations(Seurat_Corr,SpaGE_Corr)
117 | plt.xlabel('Spearman Correlation Seurat',size=12)
118 | plt.ylabel('Spearman Correlation SpaGE',size=12)
119 | plt.show()
120 |
121 | Compare_Correlations(Liger_Corr,SpaGE_Corr)
122 | plt.xlabel('Spearman Correlation Liger',size=12)
123 | plt.ylabel('Spearman Correlation SpaGE',size=12)
124 | plt.show()
125 |
126 | Compare_Correlations(gimVI_Corr,SpaGE_Corr)
127 | plt.xlabel('Spearman Correlation gimVI',size=12)
128 | plt.ylabel('Spearman Correlation SpaGE',size=12)
129 | plt.show()
130 |
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/Seurat/MERFISH.R:
--------------------------------------------------------------------------------
1 | setwd("MERFISH_Moffit/")
2 | library(Seurat)
3 | library(Matrix)
4 |
5 | MERFISH <- read.csv(file = "data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv", header = TRUE)
6 | MERFISH_1 <- MERFISH[MERFISH['Animal_ID']==1,]
7 | MERFISH_1 <- MERFISH_1[MERFISH_1['Cell_class']!='Ambiguous',]
8 | MERFISH_meta <- MERFISH_1[,c(1:9)]
9 | MERFISH_data <- MERFISH_1[,c(10:170)]
10 | drops <- c('Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos')
11 | MERFISH_data <- MERFISH_data[ , !(colnames(MERFISH_data) %in% drops)]
12 | MERFISH_data <- t(MERFISH_data)
13 |
14 | MERFISH_seurat <- CreateSeuratObject(counts = MERFISH_data, project = 'MERFISH', assay = 'RNA', meta.data = MERFISH_meta ,min.cells = -1, min.features = -1)
15 | total.counts = colSums(x = as.matrix(MERFISH_seurat@assays$RNA@counts))
16 | MERFISH_seurat <- NormalizeData(MERFISH_seurat, scale.factor = median(x = total.counts))
17 | MERFISH_seurat <- ScaleData(MERFISH_seurat)
18 | saveRDS(object = MERFISH_seurat, file = 'data/seurat_objects/MERFISH.rds')
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/Seurat/MERFISH_integration.R:
--------------------------------------------------------------------------------
1 | setwd("MERFISH_Moffit/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | MERFISH <- readRDS("data/seurat_objects/MERFISH.rds")
6 | Moffit <- readRDS("data/seurat_objects/Moffit_RNA.rds")
7 |
8 | genes.leaveout <- intersect(rownames(MERFISH),rownames(Moffit))
9 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(MERFISH@assays$RNA)[2])
10 | rownames(Imp_genes) <- genes.leaveout
11 | anchor_time <- vector(mode= "numeric")
12 | Transfer_time <- vector(mode= "numeric")
13 |
14 | run_imputation <- function(ref.obj, query.obj, feature.remove) {
15 | message(paste0('removing ', feature.remove))
16 | features <- setdiff(rownames(query.obj), feature.remove)
17 | DefaultAssay(ref.obj) <- 'RNA'
18 | DefaultAssay(query.obj) <- 'RNA'
19 |
20 | start_time <- Sys.time()
21 | anchors <- FindTransferAnchors(
22 | reference = ref.obj,
23 | query = query.obj,
24 | features = features,
25 | dims = 1:30,
26 | reduction = 'cca'
27 | )
28 | end_time <- Sys.time()
29 | anchor_time <- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
30 |
31 | refdata <- GetAssayData(
32 | object = ref.obj,
33 | assay = 'RNA',
34 | slot = 'data'
35 | )
36 |
37 | start_time <- Sys.time()
38 | imputation <- TransferData(
39 | anchorset = anchors,
40 | refdata = refdata,
41 | weight.reduction = 'pca'
42 | )
43 | query.obj[['seq']] <- imputation
44 | end_time <- Sys.time()
45 | Transfer_time <- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
46 | return(query.obj)
47 | }
48 |
49 | for(i in 1:length(genes.leaveout)) {
50 | imputed.ss2 <- run_imputation(ref.obj = Moffit, query.obj = MERFISH, feature.remove = genes.leaveout[[i]])
51 | MERFISH[['ss2']] <- imputed.ss2[, colnames(MERFISH)][['seq']]
52 | Imp_genes[genes.leaveout[[i]],] = as.vector(MERFISH@assays$ss2[genes.leaveout[i],])
53 | }
54 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv')
55 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE)
56 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/Seurat/Moffit_RNA.R:
--------------------------------------------------------------------------------
1 | setwd("MERFISH_Moffit/")
2 | library(Seurat)
3 |
4 | Moffit <- Read10X("data/Moffit_RNA/GSE113576/")
5 |
6 | Mo <- CreateSeuratObject(counts = Moffit, project = 'POR', min.cells = 10)
7 | Mo <- NormalizeData(object = Mo)
8 | Mo <- FindVariableFeatures(object = Mo, nfeatures = 2000)
9 | Mo <- ScaleData(object = Mo)
10 | Mo <- RunPCA(object = Mo, npcs = 50, verbose = FALSE)
11 | Mo <- RunUMAP(object = Mo, dims = 1:50, nneighbors = 5)
12 | saveRDS(object = Mo, file = paste0("data/seurat_objects/","Moffit_RNA.rds"))
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | MERFISH_data = datadict['MERFISH_data']
14 | MERFISH_data_scaled = datadict['MERFISH_data_scaled']
15 | MERFISH_meta = datadict['MERFISH_meta']
16 | del datadict
17 |
18 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f:
19 | datadict = pickle.load(f)
20 |
21 | RNA_data = datadict['RNA_data']
22 | RNA_data_scaled = datadict['RNA_data_scaled']
23 | del datadict
24 |
25 | #### Leave One Out Validation ####
26 | Common_data = RNA_data_scaled[np.intersect1d(MERFISH_data_scaled.columns,RNA_data_scaled.columns)]
27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
28 | precise_time = []
29 | knn_time = []
30 | for i in Common_data.columns:
31 | print(i)
32 | start = tm.time()
33 | from principal_vectors import PVComputation
34 |
35 | n_factors = 50
36 | n_pv = 50
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(n_factors = n_factors,n_pv = n_pv,dim_reduction = dim_reduction,dim_reduction_target = dim_reduction_target)
41 |
42 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),MERFISH_data_scaled[Common_data.columns].drop(i,axis=1))
43 |
44 | S = pv_FISH_RNA.source_components_.T
45 |
46 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
47 | S = S[:,0:Effective_n_pv]
48 |
49 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
50 | FISH_exp_t = MERFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
51 | precise_time.append(tm.time()-start)
52 |
53 | start = tm.time()
54 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
55 | distances, indices = nbrs.kneighbors(FISH_exp_t)
56 |
57 | Imp_Gene = np.zeros(MERFISH_data.shape[0])
58 | for j in range(0,MERFISH_data.shape[0]):
59 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
60 | weights = weights/(len(weights)-1)
61 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
62 | Imp_Gene[np.isnan(Imp_Gene)] = 0
63 | Imp_Genes[i] = Imp_Gene
64 | knn_time.append(tm.time()-start)
65 |
66 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
67 | precise_time = pd.DataFrame(precise_time)
68 | knn_time = pd.DataFrame(knn_time)
69 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
70 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/SpaGE/MERFISH_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 |
9 | MERFISH = pd.read_csv('data/MERFISH/Moffitt_and_Bambah-Mukku_et_al_merfish_all_cells.csv')
10 | #Select 1st replicate, Naive female
11 | MERFISH_1 = MERFISH.loc[MERFISH['Animal_ID']==1,:]
12 |
13 | #remove Blank-1 to 5 and Fos --> 155 genes
14 | MERFISH_1 = MERFISH_1.loc[MERFISH_1['Cell_class']!='Ambiguous',:]
15 | MERFISH_meta = MERFISH_1.iloc[:,0:9]
16 | MERFISH_data = MERFISH_1.iloc[:,9:171]
17 | MERFISH_data = MERFISH_data.drop(columns = ['Blank_1','Blank_2','Blank_3','Blank_4','Blank_5','Fos'])
18 | del MERFISH, MERFISH_1
19 |
20 | MERFISH_data = MERFISH_data.T
21 |
22 | cell_count = np.sum(MERFISH_data,axis=0)
23 | def Log_Norm(x):
24 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
25 |
26 | MERFISH_data = MERFISH_data.apply(Log_Norm,axis=0)
27 | MERFISH_data_scaled = pd.DataFrame(data=st.zscore(MERFISH_data.T),index = MERFISH_data.columns,columns=MERFISH_data.index)
28 |
29 | datadict = dict()
30 | datadict['MERFISH_data'] = MERFISH_data.T
31 | datadict['MERFISH_data_scaled'] = MERFISH_data_scaled
32 | datadict['MERFISH_meta'] = MERFISH_meta
33 |
34 | with open('data/SpaGE_pkl/MERFISH.pkl','wb') as f:
35 | pickle.dump(datadict, f)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/SpaGE/Moffit_RNA_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 | import scipy.io as io
9 |
10 | genes = pd.read_csv('data/Moffit_RNA/GSE113576/genes.tsv',sep='\t',header=None)
11 | barcodes = pd.read_csv('data/Moffit_RNA/GSE113576/barcodes.tsv',sep='\t',header=None)
12 |
13 | genes = np.array(genes.loc[:,1])
14 | barcodes = np.array(barcodes.loc[:,0])
15 | RNA_data = io.mmread('data/Moffit_RNA/GSE113576/matrix.mtx')
16 | RNA_data = RNA_data.todense()
17 | RNA_data = pd.DataFrame(RNA_data,index=genes,columns=barcodes)
18 |
19 | # filter lowely expressed genes
20 | Genes_count = np.sum(RNA_data > 0, axis=1)
21 | RNA_data = RNA_data.loc[Genes_count >=10,:]
22 | del Genes_count
23 |
24 | def Log_Norm(x):
25 | return np.log(((x/np.sum(x))*1000000) + 1)
26 |
27 | RNA_data = RNA_data.apply(Log_Norm,axis=0)
28 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index)
29 |
30 | datadict = dict()
31 | datadict['RNA_data'] = RNA_data.T
32 | datadict['RNA_data_scaled'] = RNA_data_scaled
33 |
34 | with open('data/SpaGE_pkl/Moffit_RNA.pkl','wb') as f:
35 | pickle.dump(datadict, f, protocol=4)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import seaborn as sns
12 | import sys
13 | sys.path.insert(1,'SpaGE/')
14 | from principal_vectors import PVComputation
15 |
16 | with open ('data/SpaGE_pkl/MERFISH.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | MERFISH_data = datadict['MERFISH_data']
20 | MERFISH_data_scaled = datadict['MERFISH_data_scaled']
21 | MERFISH_meta = datadict['MERFISH_meta']
22 | del datadict
23 |
24 | with open ('data/SpaGE_pkl/Moffit_RNA.pkl', 'rb') as f:
25 | datadict = pickle.load(f)
26 |
27 | RNA_data = datadict['RNA_data']
28 | RNA_data_scaled = datadict['RNA_data_scaled']
29 | del datadict
30 |
31 | Common_data = RNA_data_scaled[np.intersect1d(MERFISH_data_scaled.columns,RNA_data_scaled.columns)]
32 |
33 | n_factors = 50
34 | n_pv = 50
35 | n_pv_display = 50
36 | dim_reduction = 'pca'
37 | dim_reduction_target = 'pca'
38 |
39 | pv_FISH_RNA = PVComputation(
40 | n_factors = n_factors,
41 | n_pv = n_pv,
42 | dim_reduction = dim_reduction,
43 | dim_reduction_target = dim_reduction_target
44 | )
45 |
46 | pv_FISH_RNA.fit(Common_data,MERFISH_data_scaled[Common_data.columns])
47 |
48 | fig = plt.figure()
49 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
50 | center=0, vmax=1., vmin=0)
51 | plt.xlabel('MERFISH',fontsize=18, color='black')
52 | plt.ylabel('scRNA-seq',fontsize=18, color='black')
53 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
54 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
55 | plt.gca().set_ylim([n_pv_display,0])
56 | plt.show()
57 |
58 | plt.figure()
59 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
60 | center=0, vmax=1., vmin=0)
61 | for i in range(n_pv_display-1):
62 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
63 |
64 | plt.xlabel('MERFISH',fontsize=18, color='black')
65 | plt.ylabel('scRNA-seq',fontsize=18, color='black')
66 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
67 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
68 | plt.gca().set_ylim([n_pv_display,0])
69 | plt.show()
70 |
71 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
72 | Importance.sort_values(ascending=False,inplace=True)
73 | Importance.index[0:50]
74 |
75 | ### Technology specific Processes
76 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
77 |
78 | # explained variance RNA
79 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
80 | # explained variance spatial
81 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
82 |
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/MERFISH_Moffit/gimVI/gimVI.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('MERFISH_Moffit/')
3 |
4 | from scvi.dataset import CsvDataset
5 | from scvi.models import JVAE, Classifier
6 | from scvi.inference import JVAETrainer
7 | import numpy as np
8 | import pandas as pd
9 | import copy
10 | import torch
11 | import time as tm
12 |
13 | ### MERFISH data
14 | MERFISH_data = CsvDataset('data/gimVI_data/MERFISH_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
15 |
16 | ### AllenVISp
17 | RNA_data = CsvDataset('data/gimVI_data/Moffit_RNA_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
18 |
19 | ### Leave-one-out validation
20 | Gene_set = np.intersect1d(MERFISH_data.gene_names,RNA_data.gene_names)
21 | MERFISH_data.gene_names = Gene_set
22 | MERFISH_data.X = MERFISH_data.X[:,np.reshape(np.vstack(np.argwhere(i==MERFISH_data.gene_names) for i in Gene_set),-1)]
23 | Common_data = copy.deepcopy(RNA_data)
24 | Common_data.gene_names = Gene_set
25 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in Gene_set),-1)]
26 | Imp_Genes = pd.DataFrame(columns=Gene_set)
27 | gimVI_time = []
28 |
29 | for i in Gene_set:
30 | print(i)
31 | # Create copy of the fish dataset with hidden genes
32 | data_spatial_partial = copy.deepcopy(MERFISH_data)
33 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(MERFISH_data.gene_names,i))
34 | data_spatial_partial.batch_indices += Common_data.n_batches
35 |
36 | if(data_spatial_partial.X.shape[0] != MERFISH_data.X.shape[0]):
37 | continue
38 |
39 | datasets = [Common_data, data_spatial_partial]
40 | generative_distributions = ["zinb", "nb"]
41 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
42 | n_inputs = [d.nb_genes for d in datasets]
43 | total_genes = Common_data.nb_genes
44 | n_batches = sum([d.n_batches for d in datasets])
45 |
46 | model_library_size = [True, False]
47 |
48 | n_latent = 8
49 | kappa = 5
50 |
51 | start = tm.time()
52 | torch.manual_seed(0)
53 |
54 | model = JVAE(
55 | n_inputs,
56 | total_genes,
57 | gene_mappings,
58 | generative_distributions,
59 | model_library_size,
60 | n_layers_decoder_individual=0,
61 | n_layers_decoder_shared=0,
62 | n_layers_encoder_individual=1,
63 | n_layers_encoder_shared=1,
64 | dim_hidden_encoder=64,
65 | dim_hidden_decoder_shared=64,
66 | dropout_rate_encoder=0.2,
67 | dropout_rate_decoder=0.2,
68 | n_batch=n_batches,
69 | n_latent=n_latent,
70 | )
71 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
72 |
73 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
74 | trainer.train(n_epochs=200)
75 | _,Imputed = trainer.get_imputed_values(normalized=True)
76 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1)
77 | Imp_Genes[i] = Imputed
78 | gimVI_time.append(tm.time()-start)
79 |
80 | Imp_Genes = Imp_Genes.fillna(0)
81 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv')
82 | gimVI_time = pd.DataFrame(gimVI_time)
83 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False)
84 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/DownSampling/DownSampling.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import sys
9 | sys.path.insert(1,'Scripts/SpaGE/')
10 | from principal_vectors import PVComputation
11 |
12 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
13 | datadict = pickle.load(f)
14 |
15 | Starmap_data = datadict['Starmap_data']
16 | Starmap_data_scaled = datadict['Starmap_data_scaled']
17 | coords = datadict['coords']
18 | del datadict
19 |
20 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
21 | datadict = pickle.load(f)
22 |
23 | RNA_data = datadict['RNA_data']
24 | RNA_data_scaled = datadict['RNA_data_scaled']
25 | del datadict
26 |
27 | all_centroids = np.vstack([c.mean(0) for c in coords])
28 |
29 | def Moran_I(SpatialData,XYmap):
30 | XYnbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',metric = 'euclidean').fit(XYmap)
31 | XYdistances, XYindices = XYnbrs.kneighbors(XYmap)
32 | W = np.zeros((SpatialData.shape[0],SpatialData.shape[0]))
33 | for i in range(0,SpatialData.shape[0]):
34 | W[i,XYindices[i,:]]=1
35 |
36 | for i in range(0,SpatialData.shape[0]):
37 | W[i,i]=0
38 |
39 | I = pd.Series(index=SpatialData.columns)
40 | for k in SpatialData.columns:
41 | X_minus_mean = np.array(SpatialData[k] - np.mean(SpatialData[k]))
42 | X_minus_mean = np.reshape(X_minus_mean,(len(X_minus_mean),1))
43 | Nom = np.sum(np.multiply(W,np.matmul(X_minus_mean,X_minus_mean.T)))
44 | Den = np.sum(np.multiply(X_minus_mean,X_minus_mean))
45 | I[k] = (len(SpatialData[k])/np.sum(W))*(Nom/Den)
46 | return(I)
47 |
48 | Moran_Is = Moran_I(Starmap_data,all_centroids)
49 |
50 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns)
51 | Moran_Is = Moran_Is[Gene_Order]
52 |
53 | Moran_Is.sort_values(ascending=False,inplace=True)
54 | test_set = Moran_Is.index[0:50]
55 |
56 | Common_data = RNA_data[np.intersect1d(Starmap_data.columns,RNA_data.columns)]
57 | Common_data = Common_data.drop(columns=test_set)
58 |
59 | Corr = np.corrcoef(Common_data.T)
60 | removed_genes = []
61 | for i in range(0,Corr.shape[0]):
62 | for j in range(i+1,Corr.shape[0]):
63 | if(np.abs(Corr[i,j]) > 0.7):
64 | Vi = np.var(Common_data.iloc[:,i])
65 | Vj = np.var(Common_data.iloc[:,j])
66 | if(Vi > Vj):
67 | removed_genes.append(Common_data.columns[j])
68 | else:
69 | removed_genes.append(Common_data.columns[i])
70 | removed_genes= np.unique(removed_genes)
71 |
72 | Common_data = Common_data.drop(columns=removed_genes)
73 | Variance = np.var(Common_data)
74 | Variance.sort_values(ascending=False,inplace=True)
75 | Variance = Variance.append(pd.Series(0,index=removed_genes))
76 |
77 | genes_to_impute = test_set
78 | for i in [10,30,50,100,200,500,len(Variance)]:
79 | Imp_New_Genes = pd.DataFrame(np.zeros((Starmap_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
80 |
81 | if(i>=50):
82 | n_factors = 50
83 | n_pv = 50
84 | else:
85 | n_factors = i
86 | n_pv = i
87 |
88 | dim_reduction = 'pca'
89 | dim_reduction_target = 'pca'
90 |
91 | pv_FISH_RNA = PVComputation(
92 | n_factors = n_factors,
93 | n_pv = n_pv,
94 | dim_reduction = dim_reduction,
95 | dim_reduction_target = dim_reduction_target
96 | )
97 |
98 | source_data = RNA_data_scaled[Variance.index[0:i]]
99 | target_data = Starmap_data_scaled[Variance.index[0:i]]
100 |
101 | pv_FISH_RNA.fit(source_data,target_data)
102 |
103 | S = pv_FISH_RNA.source_components_.T
104 |
105 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
106 | S = S[:,0:Effective_n_pv]
107 |
108 | RNA_data_t = source_data.dot(S)
109 | FISH_exp_t = target_data.dot(S)
110 |
111 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
112 | metric = 'cosine').fit(RNA_data_t)
113 | distances, indices = nbrs.kneighbors(FISH_exp_t)
114 |
115 | for j in range(0,Starmap_data.shape[0]):
116 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
117 | weights = weights/(len(weights)-1)
118 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
119 |
120 | Imp_New_Genes.to_csv('Results/' + str(i) +'SpaGE_New_genes.csv')
121 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/DownSampling/DownSampling_evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import matplotlib
7 | matplotlib.use('qt5agg')
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import scipy.stats as st
12 | import pickle
13 |
14 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
15 | datadict = pickle.load(f)
16 |
17 | Starmap_data = datadict['Starmap_data']
18 | del datadict
19 |
20 | test_set = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',').columns
21 | Starmap_data = Starmap_data[test_set]
22 |
23 | ### SpaGE
24 | #10
25 | SpaGE_imputed_10 = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
26 |
27 | SpaGE_Corr_10 = pd.Series(index = test_set)
28 | for i in test_set:
29 | SpaGE_Corr_10[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_10[i])[0]
30 |
31 | #30
32 | SpaGE_imputed_30 = pd.read_csv('Results/30SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
33 |
34 | SpaGE_Corr_30 = pd.Series(index = test_set)
35 | for i in test_set:
36 | SpaGE_Corr_30[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_30[i])[0]
37 |
38 | #50
39 | SpaGE_imputed_50 = pd.read_csv('Results/50SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
40 |
41 | SpaGE_Corr_50 = pd.Series(index = test_set)
42 | for i in test_set:
43 | SpaGE_Corr_50[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_50[i])[0]
44 |
45 | #100
46 | SpaGE_imputed_100 = pd.read_csv('Results/100SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
47 |
48 | SpaGE_Corr_100 = pd.Series(index = test_set)
49 | for i in test_set:
50 | SpaGE_Corr_100[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_100[i])[0]
51 |
52 | #200
53 | SpaGE_imputed_200 = pd.read_csv('Results/200SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
54 |
55 | SpaGE_Corr_200 = pd.Series(index = test_set)
56 | for i in test_set:
57 | SpaGE_Corr_200[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_200[i])[0]
58 |
59 | #500
60 | SpaGE_imputed_500 = pd.read_csv('Results/500SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
61 |
62 | SpaGE_Corr_500 = pd.Series(index = test_set)
63 | for i in test_set:
64 | SpaGE_Corr_500[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_500[i])[0]
65 |
66 | #944
67 | SpaGE_imputed_944 = pd.read_csv('Results/944SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
68 |
69 | SpaGE_Corr_944 = pd.Series(index = test_set)
70 | for i in test_set:
71 | SpaGE_Corr_944[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed_944[i])[0]
72 |
73 | ### Comparison plots
74 | plt.style.use('ggplot')
75 | plt.figure(figsize=(9, 3))
76 | plt.boxplot([SpaGE_Corr_10, SpaGE_Corr_30, SpaGE_Corr_50,
77 | SpaGE_Corr_100,SpaGE_Corr_200,SpaGE_Corr_944])
78 |
79 | y = SpaGE_Corr_10
80 | x = np.random.normal(1, 0.05, len(y))
81 | plt.plot(x, y, 'g.', alpha=0.2)
82 | y = SpaGE_Corr_30
83 | x = np.random.normal(2, 0.05, len(y))
84 | plt.plot(x, y, 'g.', alpha=0.2)
85 | y = SpaGE_Corr_50
86 | x = np.random.normal(3, 0.05, len(y))
87 | plt.plot(x, y, 'g.', alpha=0.2)
88 | y = SpaGE_Corr_100
89 | x = np.random.normal(4, 0.05, len(y))
90 | plt.plot(x, y, 'g.', alpha=0.2)
91 | y = SpaGE_Corr_200
92 | x = np.random.normal(5, 0.05, len(y))
93 | plt.plot(x, y, 'g.', alpha=0.2)
94 | y = SpaGE_Corr_500
95 | x = np.random.normal(6, 0.05, len(y))
96 | plt.plot(x, y, 'g.', alpha=0.2)
97 | y = SpaGE_Corr_944
98 | x = np.random.normal(7, 0.05, len(y))
99 | plt.plot(x, y, 'g.', alpha=0.2)
100 |
101 |
102 | plt.xticks((1,2,3,4,5,6,7),('10','30', '50','100','200','500','944'),size=10)
103 | plt.yticks(size=8)
104 | plt.xlabel('Number of shared genes',size=12)
105 | plt.gca().set_ylim([-0.3,0.8])
106 | plt.ylabel('Spearman Correlation',size=12)
107 | plt.show()
108 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Liger/LIGER.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(liger)
3 | library(Seurat)
4 | library(ggplot2)
5 |
6 | # allen VISp
7 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv",
8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE)
9 | allen <- as.matrix(x = allen)
10 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv",
11 | sep = ',', stringsAsFactors = FALSE, header = TRUE)
12 | rownames(x = allen) <- make.unique(names = genes$gene_symbol)
13 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv",
14 | row.names = 1, stringsAsFactors = FALSE)
15 |
16 | Genes_count = rowSums(allen > 0)
17 | allen <- allen[Genes_count>=10,]
18 |
19 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ])
20 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)]
21 | allen <-allen[,ok.cells]
22 |
23 | # STARmap
24 | counts <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/cell_barcode_count.csv",
25 | sep = ",", stringsAsFactors = FALSE)
26 | gene.names <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/genes.csv",
27 | sep = ",", stringsAsFactors = FALSE)
28 |
29 | qhulls <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/qhulls.tsv",
30 | sep = '\t', col.names = c('cell', 'y', 'x'), stringsAsFactors = FALSE)
31 | centroids <- read.table(file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/centroids.tsv",
32 | sep = "\t", col.names = c("spatial2", "spatial1"), stringsAsFactors = FALSE)
33 | colnames(counts) <- gene.names$V1
34 | rownames(counts) <- paste0('starmap', seq(1:nrow(counts)))
35 | counts <- as.matrix(counts)
36 | rownames(centroids) <- rownames(counts)
37 | centroids <- as.matrix(centroids)
38 | counts <- t(counts)
39 |
40 | Gene_set <- intersect(rownames(counts),rownames(allen))
41 |
42 | #### New genes prediction
43 | Ligerex <- createLiger(list(STARmap = counts, AllenVISp = allen))
44 | Ligerex <- normalize(Ligerex)
45 | Ligerex@var.genes <- Gene_set
46 | Ligerex <- scaleNotCenter(Ligerex)
47 |
48 | # suggestK(Ligerex) # K = 25
49 | # suggestLambda(Ligerex, k = 25) # Lambda = 35
50 |
51 | Ligerex <- optimizeALS(Ligerex,k = 25, lambda = 35)
52 | Ligerex <- quantileAlignSNF(Ligerex)
53 |
54 | Imputation <- imputeKNN(Ligerex,reference = 'AllenVISp', queries = list('STARmap'), norm = TRUE, scale = FALSE)
55 |
56 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb')
57 | Imp_New_genes <- matrix(0,nrow= length(new.genes),ncol = dim(Imputation@norm.data$STARmap)[2])
58 | rownames(Imp_New_genes) <- new.genes
59 |
60 | for (i in c(1:length(new.genes))){
61 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$STARmap[new.genes[i],])
62 | }
63 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv')
64 |
65 | # leave-one-out validation
66 | genes.leaveout <- Gene_set
67 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(counts)[2])
68 | rownames(Imp_genes) <- genes.leaveout
69 | colnames(Imp_genes) <- colnames(counts)
70 | NMF_time <- vector(mode= "numeric")
71 | knn_time <- vector(mode= "numeric")
72 |
73 | for(i in 1:length(genes.leaveout)) {
74 | print(i)
75 | start_time <- Sys.time()
76 | Ligerex.leaveout <- createLiger(list(STARmap = counts[-which(rownames(counts) %in% genes.leaveout[i]),], AllenVISp = allen))
77 | Ligerex.leaveout <- normalize(Ligerex.leaveout)
78 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i])
79 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout)
80 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 25, lambda = 35)
81 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout)
82 | end_time <- Sys.time()
83 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
84 |
85 | start_time <- Sys.time()
86 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'AllenVISp', queries = list('STARmap'), norm = TRUE, scale = FALSE)
87 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$STARmap[genes.leaveout[i],])
88 | end_time <- Sys.time()
89 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
90 | }
91 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv')
92 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE)
93 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE)
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Performance_evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import pickle
7 | import matplotlib
8 | matplotlib.use('qt5agg')
9 | matplotlib.rcParams['pdf.fonttype'] = 42
10 | matplotlib.rcParams['ps.fonttype'] = 42
11 | import matplotlib.pyplot as plt
12 | import scipy.stats as st
13 | from sklearn.neighbors import NearestNeighbors
14 | from matplotlib import cm
15 |
16 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | coords = datadict['coords']
20 | Starmap_data = datadict['Starmap_data']
21 | del datadict
22 |
23 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
24 | datadict = pickle.load(f)
25 |
26 | RNA_data = datadict['RNA_data']
27 | del datadict
28 |
29 | all_centroids = np.vstack([c.mean(0) for c in coords])
30 |
31 | plt.style.use('dark_background')
32 | cmap = cm.get_cmap('viridis',20)
33 |
34 | def Moran_I(SpatialData,XYmap):
35 | XYnbrs = NearestNeighbors(n_neighbors=5, algorithm='auto',metric = 'euclidean').fit(XYmap)
36 | XYdistances, XYindices = XYnbrs.kneighbors(XYmap)
37 | W = np.zeros((SpatialData.shape[0],SpatialData.shape[0]))
38 | for i in range(0,SpatialData.shape[0]):
39 | W[i,XYindices[i,:]]=1
40 |
41 | for i in range(0,SpatialData.shape[0]):
42 | W[i,i]=0
43 |
44 | I = pd.Series(index=SpatialData.columns)
45 | for k in SpatialData.columns:
46 | X_minus_mean = np.array(SpatialData[k] - np.mean(SpatialData[k]))
47 | X_minus_mean = np.reshape(X_minus_mean,(len(X_minus_mean),1))
48 | Nom = np.sum(np.multiply(W,np.matmul(X_minus_mean,X_minus_mean.T)))
49 | Den = np.sum(np.multiply(X_minus_mean,X_minus_mean))
50 | I[k] = (len(SpatialData[k])/np.sum(W))*(Nom/Den)
51 | return(I)
52 |
53 | Moran_Is = Moran_I(Starmap_data,all_centroids)
54 |
55 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns)
56 |
57 | Moran_Is = Moran_Is[Gene_Order]
58 |
59 | ### SpaGE
60 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',')
61 |
62 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
63 |
64 | SpaGE_Corr = pd.Series(index = Gene_Order)
65 | for i in Gene_Order:
66 | SpaGE_Corr[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed[i])[0]
67 |
68 | ### gimVI
69 | gimVI_imputed = pd.read_csv('Results/gimVI_LeaveOneOut.csv',header=0,index_col=0,sep=',')
70 |
71 | gimVI_imputed.columns = Gene_Order
72 |
73 | gimVI_Corr = pd.Series(index = Gene_Order)
74 | for i in Gene_Order:
75 | gimVI_Corr[i] = st.spearmanr(Starmap_data[i],gimVI_imputed[i])[0]
76 | gimVI_Corr[np.isnan(gimVI_Corr)] = 0
77 |
78 | ### Seurat
79 | Seurat_imputed = pd.read_csv('Results/Seurat_LeaveOneOut.csv',header=0,index_col=0,sep=',').T
80 |
81 | Seurat_imputed = Seurat_imputed.loc[:,Gene_Order]
82 | Seurat_imputed.index = range(0,Seurat_imputed.shape[0])
83 | cell_labels = pd.read_csv('data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv',
84 | header=0,sep=',')
85 | Starmap_data_Seurat = Starmap_data.drop(np.where(cell_labels['ClusterName']=='HPC')[0],axis=0)
86 |
87 | Seurat_Corr = pd.Series(index = Gene_Order)
88 | for i in Gene_Order:
89 | Seurat_Corr[i] = st.spearmanr(Starmap_data_Seurat[i],Seurat_imputed[i])[0]
90 |
91 | ### Liger
92 | Liger_imputed = pd.read_csv('Results/Liger_LeaveOneOut.csv',header=0,index_col=0,sep=',').T
93 |
94 | Liger_imputed = Liger_imputed.loc[:,Gene_Order]
95 | Liger_imputed.index = range(0,Liger_imputed.shape[0])
96 |
97 | Liger_Corr = pd.Series(index = Gene_Order)
98 | for i in Gene_Order:
99 | Liger_Corr[i] = st.spearmanr(Starmap_data[i],Liger_imputed[i])[0]
100 | Liger_Corr[np.isnan(Liger_Corr)] = 0
101 |
102 | ### Comparison plots
103 | plt.style.use('ggplot')
104 | fig, ax = plt.subplots(figsize=(3.7, 5.5))
105 | ax.boxplot([SpaGE_Corr,Seurat_Corr, Liger_Corr,gimVI_Corr])
106 |
107 | y = SpaGE_Corr
108 | x = np.random.normal(1, 0.05, len(y))
109 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
110 | y = Seurat_Corr
111 | x = np.random.normal(2, 0.05, len(y))
112 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
113 | y = Liger_Corr
114 | x = np.random.normal(3, 0.05, len(y))
115 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
116 | y = gimVI_Corr
117 | x = np.random.normal(4, 0.05, len(y))
118 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
119 |
120 | plt.xticks((1,2,3,4),('SpaGE', 'Seurat', 'Liger','gimVI'),size=12)
121 | plt.yticks(size=8)
122 | plt.gca().set_ylim([-0.5,1])
123 | plt.ylabel('Spearman Correlation',size=12)
124 | ax.set_aspect(aspect=3)
125 | _,p_val = st.wilcoxon(SpaGE_Corr,Seurat_Corr)
126 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
127 | _,p_val = st.wilcoxon(SpaGE_Corr,Liger_Corr)
128 | plt.text(3,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
129 | _,p_val = st.wilcoxon(SpaGE_Corr,gimVI_Corr)
130 | plt.text(4,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
131 | plt.show()
132 |
133 | def Compare_Correlations(X,Y):
134 | fig, ax = plt.subplots(figsize=(5.2, 5.2))
135 | cmap = Moran_Is
136 | ax.axvline(linestyle='--',color='gray')
137 | ax.axhline(linestyle='--',color='gray')
138 | im = ax.scatter(X, Y, s=1, c=cmap)
139 | im.set_cmap('seismic')
140 | plt.gca().set_ylim([-0.5,1])
141 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]),
142 | np.max([ax.get_xlim(), ax.get_ylim()])]
143 | ax.plot(lims, lims, 'k-')
144 | ax.set_aspect('equal')
145 | ax.set_xlim(lims)
146 | ax.set_ylim(lims)
147 | plt.xticks((-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),size=8)
148 | plt.yticks((-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),size=8)
149 | cbar = plt.colorbar(im)
150 | cbar.ax.tick_params(labelsize=8)
151 | cbar.ax.set_ylabel("Moran's I statistic",fontsize=12)
152 | plt.show()
153 |
154 | Compare_Correlations(Seurat_Corr,SpaGE_Corr)
155 | plt.xlabel('Spearman Correlation Seurat',size=12)
156 | plt.ylabel('Spearman Correlation SpaGE',size=12)
157 | plt.show()
158 |
159 | Compare_Correlations(Liger_Corr,SpaGE_Corr)
160 | plt.xlabel('Spearman Correlation Liger',size=12)
161 | plt.ylabel('Spearman Correlation SpaGE',size=12)
162 | plt.show()
163 |
164 | Compare_Correlations(gimVI_Corr,SpaGE_Corr)
165 | plt.xlabel('Spearman Correlation gimVI',size=12)
166 | plt.ylabel('Spearman Correlation SpaGE',size=12)
167 | plt.show()
168 |
169 | def Correlation_vs_Moran(X,Y):
170 | fig, ax = plt.subplots(figsize=(4.8, 4.8))
171 | ax.scatter(X, Y, s=1)
172 | Corr = st.spearmanr(X,Y)[0]
173 | plt.text(np.mean(plt.gca().get_xlim()),np.min(plt.gca().get_ylim()),'%1.3f'%Corr,color='black',size=9)
174 | plt.xticks(size=8)
175 | plt.yticks(size=8)
176 | plt.axis('scaled')
177 | plt.gca().set_ylim([-0.5,1])
178 | plt.gca().set_xlim([-0.2,1])
179 | plt.show()
180 |
181 |
182 | Correlation_vs_Moran(Moran_Is,SpaGE_Corr)
183 | plt.xlabel("Moran's I",size=12)
184 | plt.ylabel('Spearman Correlation SpaGE',size=12)
185 | plt.show()
186 |
187 | Correlation_vs_Moran(Moran_Is,Seurat_Corr)
188 | plt.xlabel("Moran's I",size=12)
189 | plt.ylabel('Spearman Correlation Seurat',size=12)
190 | plt.show()
191 |
192 | Correlation_vs_Moran(Moran_Is,Liger_Corr)
193 | plt.xlabel("Moran's I",size=12)
194 | plt.ylabel('Spearman Correlation Liger',size=12)
195 | plt.show()
196 |
197 | Correlation_vs_Moran(Moran_Is,gimVI_Corr)
198 | plt.xlabel("Moran's I",size=12)
199 | plt.ylabel('Spearman Correlation gimVI',size=12)
200 | plt.show()
201 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Seurat/allen_brain.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(Seurat)
3 |
4 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv",
5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE)
6 | allen <- as.matrix(x = allen)
7 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv",
8 | sep = ',', stringsAsFactors = FALSE, header = TRUE)
9 | rownames(x = allen) <- make.unique(names = genes$gene_symbol)
10 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv",
11 | row.names = 1, stringsAsFactors = FALSE)
12 |
13 | al <- CreateSeuratObject(counts = allen, project = 'VISp', meta.data = meta.data, min.cells = 10)
14 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ])
15 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)]
16 | al <- al[, ok.cells]
17 | al <- NormalizeData(object = al)
18 | al <- FindVariableFeatures(object = al, nfeatures = 2000)
19 | al <- ScaleData(object = al)
20 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE)
21 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5)
22 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain.rds"))
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Seurat/impute_starmap.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(Seurat)
3 |
4 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
5 | allen <- readRDS("data/seurat_objects/allen_brain.rds")
6 |
7 | # remove HPC from starmap
8 | class_labels <- read.table(
9 | file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv",
10 | sep = ",",
11 | header = TRUE,
12 | stringsAsFactors = FALSE
13 | )
14 |
15 | class_labels$cellname <- paste0('starmap', rownames(class_labels))
16 |
17 | class_labels$ClusterName <- ifelse(is.na(class_labels$ClusterName), 'Other', class_labels$ClusterName)
18 |
19 | hpc <- class_labels[class_labels$ClusterName == 'HPC', ]$cellname
20 |
21 | accept.cells <- setdiff(colnames(starmap), hpc)
22 |
23 | starmap <- starmap[, accept.cells]
24 |
25 | starmap@misc$spatial <- starmap@misc$spatial[starmap@misc$spatial$cell %in% accept.cells, ]
26 |
27 | #Project on allen labels
28 | i2 <- FindTransferAnchors(
29 | reference = allen,
30 | query = starmap,
31 | features = rownames(starmap),
32 | reduction = 'cca',
33 | reference.assay = 'RNA',
34 | query.assay = 'RNA'
35 | )
36 |
37 | refdata <- GetAssayData(
38 | object = allen,
39 | assay = 'RNA',
40 | slot = 'data'
41 | )
42 |
43 | imputation <- TransferData(
44 | anchorset = i2,
45 | refdata = refdata,
46 | weight.reduction = 'pca'
47 | )
48 |
49 | starmap[['ss2']] <- imputation
50 | saveRDS(starmap, 'data/seurat_objects/20180505_BY3_1kgenes_imputed.rds')
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Seurat/starmap.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(Seurat)
3 | library(Matrix)
4 |
5 | read_data <- function(base_path, project) {
6 | counts <- read.table(
7 | file = paste0(base_path, "cell_barcode_count.csv"),
8 | sep = ",",
9 | stringsAsFactors = FALSE
10 | )
11 | gene.names <- read.table(
12 | file = paste0(base_path, "genes.csv"),
13 | sep = ",",
14 | stringsAsFactors = FALSE
15 | )
16 | qhulls <- read.table(
17 | file = paste0(base_path, "qhulls.tsv"),
18 | sep = '\t',
19 | col.names = c('cell', 'y', 'x'),
20 | stringsAsFactors = FALSE
21 | )
22 | centroids <- read.table(
23 | file = paste0(base_path, "centroids.tsv"),
24 | sep = "\t",
25 | col.names = c("spatial2", "spatial1"),
26 | stringsAsFactors = FALSE
27 | )
28 | colnames(x = counts) <- gene.names$V1
29 | rownames(x = counts) <- paste0('starmap', seq(1:nrow(x = counts)))
30 | counts <- as.matrix(x = counts)
31 | rownames(x = centroids) <- rownames(x = counts)
32 | centroids <- as.matrix(x = centroids)
33 | total.counts = rowSums(x = counts)
34 |
35 | obj <- CreateSeuratObject(
36 | counts = t(x = counts),
37 | project = project,
38 | min.cells = -1,
39 | min.features = -1
40 | )
41 | obj <- NormalizeData(
42 | object = obj,
43 | scale.factor = median(x = total.counts)
44 | )
45 | obj <- ScaleData(
46 | object = obj,
47 | features = rownames(x = obj)
48 | )
49 | obj <- RunPCA(
50 | object = obj,
51 | features = rownames(x = obj),
52 | npcs = 30
53 | )
54 | obj <- RunUMAP(
55 | object = obj,
56 | dims = 1:30
57 | )
58 | obj[['spatial']] <- CreateDimReducObject(
59 | embeddings <- centroids,
60 | assay = "RNA",
61 | key = 'spatial'
62 | )
63 | qhulls$cell <- paste0('starmap', qhulls$cell)
64 | obj@misc[['spatial']] <- qhulls
65 | return(obj)
66 | }
67 |
68 | # experiments <- c("visual_1020/20180505_BY3_1kgenes/", "visual_1020/20180410-BY3_1kgenes/")
69 | experiments <- c("data/Starmap/visual_1020/20180505_BY3_1kgenes/")
70 |
71 | for(i in 1:length(experiments)) {
72 | project.name <- unlist(x = strsplit(x = experiments[[i]], "/"))[[4]]
73 | dat <- read_data(base_path = experiments[[i]],project = project.name)
74 | saveRDS(object = dat, file = paste0("data/seurat_objects/", project.name, ".rds"))
75 | }
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Seurat/starmap_integration.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
6 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds")
7 | allen <- readRDS("data/seurat_objects/allen_brain.rds")
8 |
9 | # remove HPC from starmap
10 | class_labels <- read.table(
11 | file = "data/Starmap/visual_1020/20180505_BY3_1kgenes/class_labels.csv",
12 | sep = ",",
13 | header = TRUE,
14 | stringsAsFactors = FALSE
15 | )
16 |
17 | class_labels$cellname <- paste0('starmap', rownames(class_labels))
18 |
19 | class_labels$ClusterName <- ifelse(is.na(class_labels$ClusterName), 'Other', class_labels$ClusterName)
20 |
21 | hpc <- class_labels[class_labels$ClusterName == 'HPC', ]$cellname
22 |
23 | accept.cells <- setdiff(colnames(starmap), hpc)
24 |
25 | starmap <- starmap[, accept.cells]
26 |
27 | starmap@misc$spatial <- starmap@misc$spatial[starmap@misc$spatial$cell %in% accept.cells, ]
28 |
29 | genes.leaveout <- intersect(rownames(starmap),rownames(allen))
30 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(starmap@assays$RNA)[2])
31 | rownames(Imp_genes) <- genes.leaveout
32 | anchor_time <- vector(mode= "numeric")
33 | Transfer_time <- vector(mode= "numeric")
34 |
35 | run_imputation <- function(ref.obj, query.obj, feature.remove) {
36 | message(paste0('removing ', feature.remove))
37 | features <- setdiff(rownames(query.obj), feature.remove)
38 | DefaultAssay(ref.obj) <- 'RNA'
39 | DefaultAssay(query.obj) <- 'RNA'
40 |
41 | start_time <- Sys.time()
42 | anchors <- FindTransferAnchors(
43 | reference = ref.obj,
44 | query = query.obj,
45 | features = features,
46 | dims = 1:30,
47 | reduction = 'cca'
48 | )
49 | end_time <- Sys.time()
50 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
51 |
52 | refdata <- GetAssayData(
53 | object = ref.obj,
54 | assay = 'RNA',
55 | slot = 'data'
56 | )
57 |
58 | start_time <- Sys.time()
59 | imputation <- TransferData(
60 | anchorset = anchors,
61 | refdata = refdata,
62 | weight.reduction = 'pca'
63 | )
64 | query.obj[['seq']] <- imputation
65 | end_time <- Sys.time()
66 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
67 | return(query.obj)
68 | }
69 |
70 | for(i in 1:length(genes.leaveout)) {
71 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = starmap, feature.remove = genes.leaveout[[i]])
72 | starmap[['ss2']] <- imputed.ss2[, colnames(starmap)][['seq']]
73 | Imp_genes[genes.leaveout[[i]],] = as.vector(starmap@assays$ss2[genes.leaveout[i],])
74 | }
75 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv')
76 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE)
77 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE)
78 |
79 | # show genes not in the starmap dataset
80 | DefaultAssay(starmap.imputed) <- "ss2"
81 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb')
82 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(starmap.imputed@assays$ss2)[2])
83 | rownames(Imp_New_genes) <- new.genes
84 | for(i in 1:length(new.genes)) {
85 | Imp_New_genes[new.genes[[i]],] = as.vector(starmap.imputed@assays$ss2[new.genes[i],])
86 | }
87 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/Allen_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 |
9 | RNA_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv',
10 | header=0,index_col=0,sep=',')
11 |
12 | Genes = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv',
13 | header=0,sep=',')
14 | RNA_data.index = Genes.gene_symbol
15 | del Genes
16 |
17 | # filter lowely expressed genes
18 | Genes_count = np.sum(RNA_data > 0, axis=1)
19 | RNA_data = RNA_data.loc[Genes_count >=10,:]
20 | del Genes_count
21 |
22 | # filter low quality cells
23 | meta_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv',
24 | header=0,sep=',')
25 | HighQualityCells = (meta_data['class'] != 'No Class') & (meta_data['class'] != 'Low Quality')
26 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]]
27 | del meta_data, HighQualityCells
28 |
29 | def Log_Norm(x):
30 | return np.log(((x/np.sum(x))*1000000) + 1)
31 |
32 | RNA_data = RNA_data.apply(Log_Norm,axis=0)
33 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index)
34 |
35 | datadict = dict()
36 | datadict['RNA_data'] = RNA_data.T
37 | datadict['RNA_data_scaled'] = RNA_data_scaled
38 |
39 | with open('data/SpaGE_pkl/Allen_VISp.pkl','wb') as f:
40 | pickle.dump(datadict, f)
41 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | Starmap_data = datadict['Starmap_data']
14 | Starmap_data_scaled = datadict['Starmap_data_scaled']
15 | labels = datadict['labels']
16 | qhulls = datadict['qhulls']
17 | coords = datadict['coords']
18 | del datadict
19 |
20 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
21 | datadict = pickle.load(f)
22 |
23 | RNA_data = datadict['RNA_data']
24 | RNA_data_scaled = datadict['RNA_data_scaled']
25 | del datadict
26 |
27 | #### Leave One Out Validation ####
28 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)]
29 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
30 | precise_time = []
31 | knn_time = []
32 | for i in Common_data.columns:
33 | print(i)
34 | start = tm.time()
35 | from principal_vectors import PVComputation
36 |
37 | n_factors = 50
38 | n_pv = 50
39 | dim_reduction = 'pca'
40 | dim_reduction_target = 'pca'
41 |
42 | pv_FISH_RNA = PVComputation(n_factors = n_factors,n_pv = n_pv,dim_reduction = dim_reduction,dim_reduction_target = dim_reduction_target)
43 |
44 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),Starmap_data_scaled[Common_data.columns].drop(i,axis=1))
45 |
46 | S = pv_FISH_RNA.source_components_.T
47 |
48 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
49 | S = S[:,0:Effective_n_pv]
50 |
51 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
52 | FISH_exp_t = Starmap_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
53 | precise_time.append(tm.time()-start)
54 |
55 | start = tm.time()
56 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
57 | distances, indices = nbrs.kneighbors(FISH_exp_t)
58 |
59 | Imp_Gene = np.zeros(Starmap_data.shape[0])
60 | for j in range(0,Starmap_data.shape[0]):
61 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
62 | weights = weights/(len(weights)-1)
63 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
64 | Imp_Gene[np.isnan(Imp_Gene)] = 0
65 | Imp_Genes[i] = Imp_Gene
66 | knn_time.append(tm.time()-start)
67 |
68 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
69 | precise_time = pd.DataFrame(precise_time)
70 | knn_time = pd.DataFrame(knn_time)
71 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
72 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
73 |
74 | ##### Novel Genes Expression Patterns ####
75 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)]
76 | #genes_to_impute = ["Tesc","Pvrl3","Sox10","Grm2","Tcrb"]
77 | genes_to_impute = np.setdiff1d(RNA_data.columns,Starmap_data.columns)
78 | Imp_New_Genes = pd.DataFrame(np.zeros((Starmap_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
79 |
80 | from principal_vectors import PVComputation
81 |
82 | n_factors = 50
83 | n_pv = 50
84 | dim_reduction = 'pca'
85 | dim_reduction_target = 'pca'
86 |
87 | pv_FISH_RNA = PVComputation(
88 | n_factors = n_factors,
89 | n_pv = n_pv,
90 | dim_reduction = dim_reduction,
91 | dim_reduction_target = dim_reduction_target
92 | )
93 |
94 | pv_FISH_RNA.fit(Common_data,Starmap_data_scaled[Common_data.columns])
95 |
96 | S = pv_FISH_RNA.source_components_.T
97 |
98 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
99 | S = S[:,0:Effective_n_pv]
100 |
101 | Common_data_t = Common_data.dot(S)
102 | FISH_exp_t = Starmap_data_scaled[Common_data.columns].dot(S)
103 |
104 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
105 | metric = 'cosine').fit(Common_data_t)
106 | distances, indices = nbrs.kneighbors(FISH_exp_t)
107 |
108 | for j in range(0,Starmap_data.shape[0]):
109 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
110 | weights = weights/(len(weights)-1)
111 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
112 |
113 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import seaborn as sns
12 | import sys
13 | sys.path.insert(1,'SpaGE/')
14 | from principal_vectors import PVComputation
15 |
16 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | Starmap_data = datadict['Starmap_data']
20 | Starmap_data_scaled = datadict['Starmap_data_scaled']
21 | labels = datadict['labels']
22 | qhulls = datadict['qhulls']
23 | coords = datadict['coords']
24 | del datadict
25 |
26 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
27 | datadict = pickle.load(f)
28 |
29 | RNA_data = datadict['RNA_data']
30 | RNA_data_scaled = datadict['RNA_data_scaled']
31 | del datadict
32 |
33 | Common_data = RNA_data_scaled[np.intersect1d(Starmap_data_scaled.columns,RNA_data_scaled.columns)]
34 |
35 | n_factors = 50
36 | n_pv = 50
37 | n_pv_display = 50
38 | dim_reduction = 'pca'
39 | dim_reduction_target = 'pca'
40 |
41 | pv_FISH_RNA = PVComputation(
42 | n_factors = n_factors,
43 | n_pv = n_pv,
44 | dim_reduction = dim_reduction,
45 | dim_reduction_target = dim_reduction_target
46 | )
47 |
48 | pv_FISH_RNA.fit(Common_data,Starmap_data_scaled[Common_data.columns])
49 |
50 | fig = plt.figure()
51 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
52 | center=0, vmax=1., vmin=0)
53 | plt.xlabel('Starmap',fontsize=18, color='black')
54 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
55 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
56 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
57 | plt.gca().set_ylim([n_pv_display,0])
58 | plt.show()
59 |
60 | plt.figure()
61 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
62 | center=0, vmax=1., vmin=0)
63 | for i in range(n_pv_display-1):
64 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
65 |
66 | plt.xlabel('Starmap',fontsize=18, color='black')
67 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
68 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
69 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
70 | plt.gca().set_ylim([n_pv_display,0])
71 | plt.show()
72 |
73 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
74 | Importance.sort_values(ascending=False,inplace=True)
75 | Importance.index[0:50]
76 |
77 | ### Technology specific Processes
78 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
79 |
80 | # explained variance RNA
81 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
82 | # explained variance spatial
83 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
84 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/Starmap_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 | from viz import GetQHulls
9 |
10 | counts = np.load('data/Starmap/visual_1020/20180505_BY3_1kgenes/cell_barcode_count.npy')
11 | Genes = pd.read_csv('data/Starmap/visual_1020/20180505_BY3_1kgenes/genes.csv',header=None)
12 | Genes = (Genes.iloc[:,0])
13 | counts = pd.DataFrame(data=counts,columns=Genes)
14 | Starmap_data = counts.T
15 | del Genes, counts
16 |
17 | cell_count = np.sum(Starmap_data,axis=0)
18 | def Log_Norm(x):
19 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
20 |
21 | Starmap_data = Starmap_data.apply(Log_Norm,axis=0)
22 | Starmap_data_scaled = pd.DataFrame(data=st.zscore(Starmap_data.T),index = Starmap_data.columns,columns=Starmap_data.index)
23 |
24 | labels = np.load('data/Starmap/visual_1020/20180505_BY3_1kgenes/labels.npz')["labels"]
25 | qhulls,coords = GetQHulls(labels)
26 |
27 | datadict = dict()
28 | datadict['Starmap_data'] = Starmap_data.T
29 | datadict['Starmap_data_scaled'] = Starmap_data_scaled
30 | datadict['labels'] = labels
31 | datadict['qhulls'] = qhulls
32 | datadict['coords'] = coords
33 |
34 | with open('data/SpaGE_pkl/Starmap.pkl','wb') as f:
35 | pickle.dump(datadict, f)
36 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/SpaGE/viz.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 | from scipy.spatial import ConvexHull
3 | from skimage.transform import downscale_local_mean
4 | from matplotlib.patches import Polygon
5 | from matplotlib.collections import PatchCollection
6 | from skimage.measure import regionprops
7 | import numpy as np
8 | import matplotlib.pyplot as plt
9 |
10 | def plot_heatmap_with_labels(data_obj, degenes,cmap,show_axis=True, font_size=10):
11 | g = plt.GridSpec(2,1, wspace=0.01, hspace=0.01, height_ratios=[0.5,10])
12 | ax = plt.subplot(g[0])
13 | ax.imshow(np.expand_dims(np.sort(data_obj._clusts),1).T,aspect='auto',interpolation='none',cmap=cmap)
14 | ax.axis('off')
15 | ax = plt.subplot(g[1])
16 | data_obj.plot_heatmap(list(degenes), fontsize=font_size,use_imshow=False,ax=ax)
17 | if not show_axis:
18 | plt.axis('off')
19 |
20 | def GetQHulls(labels):
21 | labels += 1
22 | Nlabels = labels.max()
23 | hulls = []
24 | coords = []
25 | num_cells = 0
26 | print('blah')
27 | for i in range(Nlabels):#enumerate(regionprops(labels)):
28 | print(i,"/",Nlabels)
29 | curr_coords = np.argwhere(labels==i)
30 | # size threshold of > 100 pixels and < 100000
31 | if curr_coords.shape[0] < 100000 and curr_coords.shape[0] > 1000:
32 | num_cells += 1
33 | hulls.append(ConvexHull(curr_coords))
34 | coords.append(curr_coords)
35 | print("Used %d / %d" % (num_cells, Nlabels))
36 | return hulls, coords
37 |
38 |
39 | def hull_to_polygon(hull):
40 | cent = np.mean(hull.points, 0)
41 | pts = []
42 | for pt in hull.points[hull.simplices]:
43 | pts.append(pt[0].tolist())
44 | pts.append(pt[1].tolist())
45 | pts.sort(key=lambda p: np.arctan2(p[1] - cent[1],
46 | p[0] - cent[0]))
47 | pts = pts[0::2] # Deleting duplicates
48 | pts.insert(len(pts), pts[0])
49 | k =1.1
50 | poly = Polygon(k*(np.array(pts)- cent) + cent,edgecolor='k', linewidth=1)
51 | #poly.set_capstyle('round')
52 | return poly
53 |
54 | def plot_poly_cells_expression(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1):
55 | figscale = 10
56 | plt.figure(figsize=(figscale*width/float(height),figscale))
57 | polys = [hull_to_polygon(h) for h in hulls]
58 | if good_cells is not None:
59 | polys = [p for i,p in enumerate(polys) if i in good_cells]
60 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0)
61 | p.set_array(expr)
62 | p.set_clim(vmin=0, vmax=expr.max())
63 | plt.gca().add_collection(p)
64 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15)
65 | plt.axis('off')
66 |
67 | def plot_poly_cells_expression_99(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1):
68 | figscale = 10
69 | plt.figure(figsize=(figscale*width/float(height),figscale))
70 | polys = [hull_to_polygon(h) for h in hulls]
71 | if good_cells is not None:
72 | polys = [p for i,p in enumerate(polys) if i in good_cells]
73 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0)
74 | p.set_array(expr)
75 | p.set_clim(vmin=0, vmax=np.percentile(expr,99))
76 | plt.gca().add_collection(p)
77 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15)
78 | plt.axis('off')
79 |
80 | def plot_poly_cells_expression_40(nissl, hulls, expr, cmap, good_cells=None,width=2, height=9,figscale=10,alpha=1):
81 | figscale = 10
82 | plt.figure(figsize=(figscale*width/float(height),figscale))
83 | polys = [hull_to_polygon(h) for h in hulls]
84 | if good_cells is not None:
85 | polys = [p for i,p in enumerate(polys) if i in good_cells]
86 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,linewidths=0)
87 | p.set_array(expr)
88 | p.set_clim(vmin=np.percentile(expr,40), vmax=np.percentile(expr,99))
89 | plt.gca().add_collection(p)
90 | plt.imshow(nissl.T, cmap=plt.cm.gray_r,alpha=0.15)
91 | plt.axis('off')
92 |
93 | def plot_poly_cells_cluster(nissl, hulls, colors, cmap, good_cells=None,width=2, height=9,figscale=10, rescale_colors=False,alpha=1,vmin=None,vmax=None):
94 | figscale = 10
95 | plt.figure(figsize=(figscale*width/float(height),figscale))
96 | polys = [hull_to_polygon(h) for h in hulls]
97 | if good_cells is not None:
98 | polys = [p for i,p in enumerate(polys) if i in good_cells]
99 | p = PatchCollection(polys,alpha=alpha, cmap=cmap,edgecolor='k', linewidth=0.5)
100 | if vmin or vmax is not None:
101 | p.set_array(colors)
102 | p.set_clim(vmin=vmin,vmax=vmax)
103 | else:
104 | if rescale_colors:
105 | p.set_array(colors+1)
106 | p.set_clim(vmin=0, vmax=max(colors+1))
107 | else:
108 | p.set_array(colors)
109 | p.set_clim(vmin=0, vmax=max(colors))
110 | nissl = (nissl > 0).astype(np.int)
111 | plt.imshow(nissl.T,cmap=plt.cm.gray_r,alpha=0.15)
112 | plt.gca().add_collection(p)
113 | plt.axis('off')
114 | return polys
115 |
116 | def plot_cells_cluster(nissl, coords, good_cells, colors, cmap, width=2, height=9,figscale=100, vmin=None,vmax=None):
117 | figscale = 10
118 | plt.figure(figsize=(figscale*width/float(height),figscale))
119 | img = -1*np.ones_like(nissl)
120 | curr_coords = [coords[k] for k in range(len(coords)) if k in good_cells]
121 | for i,c in enumerate(curr_coords):
122 | for k in c:
123 | if k[0] < img.shape[0] and k[1] < img.shape[1]:
124 | img[k[0],k[1]] = colors[i]
125 | plt.imshow(img.T,cmap=cmap,vmin=-1,vmax=colors.max())
126 | plt.axis('off')
127 |
128 | def get_cells_and_clusts_for_experiment(analysis_obj, expt_id):
129 | good_cells = analysis_obj._meta.index[(analysis_obj._meta["orig_ident"]==expt_id)].values
130 | colors = analysis_obj._clusts[analysis_obj._meta["orig_ident"]==expt_id]
131 | return good_cells, colors
132 |
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/Starmap_plots.R:
--------------------------------------------------------------------------------
1 | setwd("STARmap_AllenVISp/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
6 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds")
7 | allen <- readRDS("data/seurat_objects/allen_brain.rds")
8 |
9 | DefaultAssay(starmap.imputed) <- "RNA"
10 | genes.leaveout <- intersect(rownames(starmap),rownames(allen))
11 |
12 | # Original
13 | for(i in 1:length(genes.leaveout)) {
14 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) {
15 | qcut = 'q40'
16 | } else {
17 | qcut = 0
18 | }
19 | p <- PolyFeaturePlot(starmap.imputed, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
20 | ggsave(plot = p, filename = paste0('Figures/Original/', genes.leaveout[[i]],'.pdf'))
21 | }
22 |
23 | # Seurat_Predicted
24 | Seurat_Predicted <- read.csv(file = 'Results/Seurat_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1)
25 | Seurat_Predicted <- Seurat_Predicted[genes.leaveout,]
26 | colnames(Seurat_Predicted) <- colnames(starmap.imputed@assays$ss2[,])
27 | starmap.imputed[['ss2']] <- CreateAssayObject(data = as.matrix(Seurat_Predicted))
28 | DefaultAssay(starmap.imputed) <- 'ss2'
29 | for(i in 1:length(genes.leaveout)) {
30 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) {
31 | qcut = 'q40'
32 | } else {
33 | qcut = 0
34 | }
35 | p <- PolyFeaturePlot(starmap.imputed, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
36 | ggsave(plot = p, filename = paste0('Figures/Seurat_Predicted/', genes.leaveout[[i]],'.pdf'))
37 | }
38 |
39 | # SpaGE_Predicted
40 | SpaGE_Predicted <- read.csv(file = 'Results/SpaGE_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1)
41 | SpaGE_Predicted <- t(SpaGE_Predicted)
42 | SpaGE_Predicted <- SpaGE_Predicted[genes.leaveout,]
43 | colnames(SpaGE_Predicted) <- colnames(starmap@assays$RNA[,])
44 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(SpaGE_Predicted))
45 | DefaultAssay(starmap) <- 'ss2'
46 | for(i in 1:length(genes.leaveout)) {
47 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) {
48 | qcut = 'q40'
49 | } else {
50 | qcut = 0
51 | }
52 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
53 | ggsave(plot = p, filename = paste0('Figures/SpaGE_Predicted/', genes.leaveout[[i]],'.pdf'))
54 | }
55 |
56 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
57 |
58 | # Liger_Predicted
59 | Liger_Predicted <- read.csv(file = 'Results/Liger_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1)
60 | Liger_Predicted <- Liger_Predicted[genes.leaveout,]
61 | colnames(Liger_Predicted) <- colnames(starmap@assays$RNA[,])
62 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(Liger_Predicted))
63 | DefaultAssay(starmap) <- 'ss2'
64 | for(i in 1:length(genes.leaveout)) {
65 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) {
66 | qcut = 'q40'
67 | } else {
68 | qcut = 0
69 | }
70 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
71 | ggsave(plot = p, filename = paste0('Figures/Liger_Predicted/', genes.leaveout[[i]],'.pdf'))
72 | }
73 |
74 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
75 |
76 | # gimVI_Predicted
77 | gimVI_Predicted <- read.csv(file = 'Results/gimVI_LeaveOneOut.csv',header = TRUE, check.names = FALSE, row.names = 1)
78 | gimVI_Predicted <- t(gimVI_Predicted)
79 | gimVI_Predicted <- gimVI_Predicted[toupper(genes.leaveout),]
80 | colnames(gimVI_Predicted) <- colnames(starmap@assays$RNA[,])
81 | rownames(gimVI_Predicted) <- genes.leaveout
82 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(gimVI_Predicted))
83 | DefaultAssay(starmap) <- 'ss2'
84 | for(i in 1:length(genes.leaveout)) {
85 | if (genes.leaveout[[i]] %in% c('Bsg', 'Rab3c')) {
86 | qcut = 'q40'
87 | } else {
88 | qcut = 0
89 | }
90 | p <- PolyFeaturePlot(starmap, genes.leaveout[[i]], flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
91 | ggsave(plot = p, filename = paste0('Figures/gimVI_Predicted/', genes.leaveout[[i]],'.pdf'))
92 | }
93 |
94 |
95 | ### NEw genes
96 | # show genes not in the starmap dataset
97 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb')
98 | starmap.imputed <- readRDS("data/seurat_objects/20180505_BY3_1kgenes_imputed.rds")
99 | DefaultAssay(starmap.imputed) <- "ss2"
100 |
101 | for(i in new.genes) {
102 | p <- PolyFeaturePlot(starmap.imputed, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
103 | ggsave(plot = p, filename = paste0('Figures/Seurat_Predicted/New_', i,'.pdf'))
104 | }
105 |
106 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
107 | #SpaGE
108 | SpaGE_New <- read.csv(file = 'Results/SpaGE_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1)
109 | SpaGE_New <- t(SpaGE_New)
110 | new.genes <- c('Tesc', 'Pvrl3', 'Sox10', 'Grm2', 'Tcrb','Ttyh2','Cldn11','Tmem88b')
111 | SpaGE_New2 <- SpaGE_New[new.genes,]
112 | colnames(SpaGE_New2) <- colnames(starmap@assays$RNA[,])
113 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(SpaGE_New2))
114 | DefaultAssay(starmap) <- 'ss2'
115 | for(i in new.genes) {
116 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = 0, max.cutoff = 'q99') + SpatialTheme()
117 | ggsave(plot = p, filename = paste0('Figures/SpaGE_Predicted/New_', i,'.pdf'))
118 | }
119 |
120 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
121 | # Liger
122 | Liger_New <- read.csv(file = 'Results/Liger_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1)
123 | Liger_New <- Liger_New[new.genes,]
124 | colnames(Liger_New) <- colnames(starmap@assays$RNA[,])
125 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(Liger_New))
126 | DefaultAssay(starmap) <- 'ss2'
127 | for(i in new.genes) {
128 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
129 | ggsave(plot = p, filename = paste0('Figures/Liger_Predicted/New_', i,'.pdf'))
130 | }
131 |
132 | starmap <- readRDS("data/seurat_objects/20180505_BY3_1kgenes.rds")
133 | # gimVI
134 | gimVI_New <- read.csv(file = 'Results/gimVI_New_genes.csv',header = TRUE, check.names = FALSE, row.names = 1)
135 | gimVI_New <- t(gimVI_New)
136 | gimVI_New <- gimVI_New[toupper(new.genes),]
137 | colnames(gimVI_New) <- colnames(starmap@assays$RNA[,])
138 | rownames(gimVI_New) <- new.genes
139 | starmap[['ss2']] <- CreateAssayObject(data = as.matrix(gimVI_New))
140 | DefaultAssay(starmap) <- 'ss2'
141 | for(i in new.genes) {
142 | p <- PolyFeaturePlot(starmap, i, flip.coords = TRUE, min.cutoff = qcut, max.cutoff = 'q99') + SpatialTheme()
143 | ggsave(plot = p, filename = paste0('Figures/gimVI_Predicted/New_', i,'.pdf'))
144 | }
--------------------------------------------------------------------------------
/benchmark/STARmap_AllenVISp/gimVI/gimVI.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('STARmap_AllenVISp/')
3 |
4 | from scvi.dataset import CsvDataset
5 | from scvi.models import JVAE, Classifier
6 | from scvi.inference import JVAETrainer
7 | import numpy as np
8 | import pandas as pd
9 | import copy
10 | import torch
11 | import time as tm
12 |
13 | ### STARmap data
14 | Starmap_data = CsvDataset('data/gimVI_data/STARmap_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
15 |
16 | ### AllenVISp
17 | RNA_data = CsvDataset('data/gimVI_data/Allen_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
18 |
19 | ### Leave-one-out validation
20 | Gene_set = np.intersect1d(Starmap_data.gene_names,RNA_data.gene_names)
21 | Starmap_data.gene_names = Gene_set
22 | Starmap_data.X = Starmap_data.X[:,np.reshape(np.vstack(np.argwhere(i==Starmap_data.gene_names) for i in Gene_set),-1)]
23 | Common_data = copy.deepcopy(RNA_data)
24 | Common_data.gene_names = Gene_set
25 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in Gene_set),-1)]
26 | Imp_Genes = pd.DataFrame(columns=Gene_set)
27 | gimVI_time = []
28 |
29 | for i in Gene_set:
30 | print(i)
31 | # Create copy of the fish dataset with hidden genes
32 | data_spatial_partial = copy.deepcopy(Starmap_data)
33 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(Starmap_data.gene_names,i))
34 | data_spatial_partial.batch_indices += Common_data.n_batches
35 |
36 | datasets = [Common_data, data_spatial_partial]
37 | generative_distributions = ["zinb", "nb"]
38 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
39 | n_inputs = [d.nb_genes for d in datasets]
40 | total_genes = Common_data.nb_genes
41 | n_batches = sum([d.n_batches for d in datasets])
42 |
43 | model_library_size = [True, False]
44 |
45 | n_latent = 8
46 | kappa = 5
47 |
48 | start = tm.time()
49 | torch.manual_seed(0)
50 |
51 | model = JVAE(
52 | n_inputs,
53 | total_genes,
54 | gene_mappings,
55 | generative_distributions,
56 | model_library_size,
57 | n_layers_decoder_individual=0,
58 | n_layers_decoder_shared=0,
59 | n_layers_encoder_individual=1,
60 | n_layers_encoder_shared=1,
61 | dim_hidden_encoder=64,
62 | dim_hidden_decoder_shared=64,
63 | dropout_rate_encoder=0.2,
64 | dropout_rate_decoder=0.2,
65 | n_batch=n_batches,
66 | n_latent=n_latent,
67 | )
68 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
69 |
70 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
71 | trainer.train(n_epochs=200)
72 | _,Imputed = trainer.get_imputed_values(normalized=True)
73 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1)
74 | Imp_Genes[i] = Imputed
75 | gimVI_time.append(tm.time()-start)
76 |
77 | Imp_Genes = Imp_Genes.fillna(0)
78 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv')
79 | gimVI_time = pd.DataFrame(gimVI_time)
80 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False)
81 |
82 | ### New genes
83 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","SOX10","GRM2","TCRB"])
84 |
85 | # Create copy of the fish dataset with hidden genes
86 | data_spatial_partial = copy.deepcopy(Starmap_data)
87 | data_spatial_partial.filter_genes_by_attribute(Starmap_data.gene_names)
88 | data_spatial_partial.batch_indices += RNA_data.n_batches
89 |
90 | datasets = [RNA_data, data_spatial_partial]
91 | generative_distributions = ["zinb", "nb"]
92 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
93 | n_inputs = [d.nb_genes for d in datasets]
94 | total_genes = RNA_data.nb_genes
95 | n_batches = sum([d.n_batches for d in datasets])
96 |
97 | model_library_size = [True, False]
98 |
99 | n_latent = 8
100 | kappa = 5
101 |
102 | torch.manual_seed(0)
103 |
104 | model = JVAE(
105 | n_inputs,
106 | total_genes,
107 | gene_mappings,
108 | generative_distributions,
109 | model_library_size,
110 | n_layers_decoder_individual=0,
111 | n_layers_decoder_shared=0,
112 | n_layers_encoder_individual=1,
113 | n_layers_encoder_shared=1,
114 | dim_hidden_encoder=64,
115 | dim_hidden_decoder_shared=64,
116 | dropout_rate_encoder=0.2,
117 | dropout_rate_decoder=0.2,
118 | n_batch=n_batches,
119 | n_latent=n_latent,
120 | )
121 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
122 |
123 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
124 | trainer.train(n_epochs=200)
125 |
126 | for i in ["TESC","PVRL3","SOX10","GRM2","TCRB"]:
127 | _,Imputed = trainer.get_imputed_values(normalized=True)
128 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1)
129 | Imp_New_Genes[i] = Imputed
130 |
131 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv')
132 |
--------------------------------------------------------------------------------
/benchmark/Timing_Evaluation.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import matplotlib
4 | matplotlib.rcParams['pdf.fonttype'] = 42
5 | matplotlib.rcParams['ps.fonttype'] = 42
6 | import matplotlib.pyplot as plt
7 | from matplotlib.lines import Line2D
8 |
9 | ### Timing
10 | experiments = ['STARmap_AllenVISp','osmFISH_Ziesel','osmFISH_AllenVISp',
11 | 'osmFISH_AllenSSp','MERFISH_Moffit']
12 | # SpaGE
13 | SpaGE_Avg_Time = pd.Series(index = experiments)
14 | for i in experiments:
15 | SpaGE_Precise_Time = np.array(pd.read_csv(i + '/Results/SpaGE_PreciseTime.csv',
16 | header=0,sep=',')).reshape(-1)
17 | SpaGE_knn_Time = np.array(pd.read_csv(i +'/Results/SpaGE_knnTime.csv',
18 | header=0,sep=',')).reshape(-1)
19 |
20 | SpaGE_total_time = SpaGE_Precise_Time + SpaGE_knn_Time
21 | SpaGE_Avg_Time[i] = np.mean(SpaGE_total_time)
22 | del SpaGE_Precise_Time,SpaGE_knn_Time,SpaGE_total_time
23 |
24 | # gimVI
25 | gimVI_Avg_Time = pd.Series(index = experiments)
26 | for i in experiments:
27 | gimVI_Time = np.array(pd.read_csv(i+'/Results/gimVI_Time.csv',header=0,sep=',')).reshape(-1)
28 | gimVI_Avg_Time[i] = np.mean(gimVI_Time)
29 | del gimVI_Time
30 |
31 | # Seurat
32 | Seurat_Avg_Time = pd.Series(index = experiments)
33 | for i in experiments:
34 | Seurat_anchor_Time = np.array(pd.read_csv(i+'/Results/Seurat_anchor_time.csv',
35 | header=0,sep=',')).reshape(-1)
36 | Seurat_transfer_Time = np.array(pd.read_csv(i+'/Results/Seurat_transfer_time.csv'
37 | ,header=0,sep=',')).reshape(-1)
38 |
39 | Seurat_total_time = Seurat_anchor_Time + Seurat_transfer_Time
40 | Seurat_Avg_Time[i] = np.mean(Seurat_total_time)
41 | del Seurat_anchor_Time,Seurat_transfer_Time,Seurat_total_time
42 |
43 | # Liger
44 | Liger_Avg_Time = pd.Series(index = experiments)
45 | for i in experiments:
46 | Liger_NMF_Time = np.array(pd.read_csv(i+'/Results/Liger_NMF_time.csv',
47 | header=0,sep=',')).reshape(-1)
48 | Liger_knn_Time = np.array(pd.read_csv(i+'/Results/Liger_knn_time.csv',
49 | header=0,sep=',')).reshape(-1)
50 |
51 | Liger_total_time = Liger_NMF_Time + Liger_knn_Time
52 | Liger_Avg_Time[i] = np.mean(Liger_total_time)
53 | del Liger_NMF_Time,Liger_knn_Time,Liger_total_time
54 |
55 | plt.style.use('ggplot')
56 | plt.figure(figsize=(9, 3))
57 | plt.plot([1,2,3,4,5],SpaGE_Avg_Time,color='black',marker='s',linewidth=3)
58 | plt.plot([1,2,3,4,5],Seurat_Avg_Time,color='blue',marker='s',linewidth=3)
59 | plt.plot([1,2,3,4,5],Liger_Avg_Time,color='red',marker='s',linewidth=3)
60 | plt.plot([1,2,3,4,5],gimVI_Avg_Time,color='purple',marker='s',linewidth=3)
61 | #plt.yscale('log')
62 | plt.xticks((1,2,3,4,5),('STARmap_AllenVISp\n(1549,14249)','osmFISH_Zeisel\n(3405,1691)',
63 | 'osmFISH_AllenVISp\n(3405,14249)','osmFISH_AllenSSp\n(3405,5613)',
64 | 'MERFSIH_Moffit\n(64373,31299)'),size=10)
65 | plt.yticks(size=8)
66 | plt.ylabel('Avergae computation time (seconds)',size=12)
67 | colors = ['black','blue', 'red', 'purple']
68 | lines = [Line2D([0], [0], color=c, linewidth=3, marker='s') for c in colors]
69 | labels = ['SpaGE','Seurat', 'Liger','gimVI']
70 | plt.legend(lines, labels)
71 | plt.show()
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/Liger/LIGER.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenSSp/")
2 | library(liger)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | # allen VISp
7 | allen <- read.table(file = "data/Allen_SSp/SSp_exons_matrix.csv",
8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE)
9 | allen <- as.matrix(x = allen)
10 |
11 | Genes_count = rowSums(allen > 0)
12 | allen <- allen[Genes_count>=10,]
13 |
14 | meta.data <- read.csv(file = "data/Allen_SSp/AllenSSp_metadata.csv",
15 | row.names = 1, stringsAsFactors = FALSE)
16 | ok.cells <- colnames(allen[,meta.data$class_label != 'Exclude'])
17 | allen <- allen[,ok.cells]
18 |
19 | # osmFISH
20 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
21 | mat <- osm[['matrix']][,]
22 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
23 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
24 | x_dim <- osm[['col_attrs']][['X']][]
25 | y_dim <- osm[['col_attrs']][['Y']][]
26 | region <- osm[['col_attrs']][['Region']][]
27 | cluster <- osm[['col_attrs']][['ClusterName']][]
28 | osm$close_all()
29 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
30 | rownames(spatial) <- rownames(mat)
31 | spatial <- as.matrix(spatial)
32 | osmFISH <- t(mat)
33 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))]
34 |
35 | Gene_set <- intersect(rownames(osmFISH),rownames(allen))
36 |
37 | #### New genes prediction
38 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = allen))
39 | Ligerex <- normalize(Ligerex)
40 | Ligerex@var.genes <- Gene_set
41 | Ligerex <- scaleNotCenter(Ligerex)
42 |
43 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20
44 | # suggestLambda(Ligerex, k = 20) # Lambda = 20
45 |
46 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20)
47 | Ligerex <- quantileAlignSNF(Ligerex)
48 |
49 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE)
50 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
51 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2])
52 | rownames(Imp_New_genes) <- new.genes
53 | for(i in 1:length(new.genes)) {
54 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],])
55 | }
56 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv')
57 |
58 | # leave-one-out validation
59 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen))
60 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2])
61 | rownames(Imp_genes) <- genes.leaveout
62 | colnames(Imp_genes) <- colnames(osmFISH)
63 | NMF_time <- vector(mode= "numeric")
64 | knn_time <- vector(mode= "numeric")
65 |
66 | for(i in 1:length(genes.leaveout)) {
67 | print(i)
68 | start_time <- Sys.time()
69 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = allen[rownames(osmFISH),]))
70 | Ligerex.leaveout <- normalize(Ligerex.leaveout)
71 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i])
72 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout)
73 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){
74 | next
75 | }
76 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20)
77 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout)
78 | end_time <- Sys.time()
79 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
80 |
81 | start_time <- Sys.time()
82 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30)
83 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],])
84 | end_time <- Sys.time()
85 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
86 | }
87 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv')
88 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE)
89 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE)
90 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/Seurat/allen_brain.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenSSp/")
2 | library(Seurat)
3 |
4 | allen <- read.table(file = "data/Allen_SSp/SSp_exons_matrix.csv",
5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE)
6 | allen <- as.matrix(x = allen)
7 |
8 | meta.data <- read.csv(file = "data/Allen_SSp/AllenSSp_metadata.csv",
9 | row.names = 1, stringsAsFactors = FALSE)
10 |
11 | al <- CreateSeuratObject(counts = allen, project = 'SSp', min.cells = 10)
12 | ok.cells <- colnames(al[,meta.data$class_label != 'Exclude'])
13 | al <- al[, ok.cells]
14 | al <- NormalizeData(object = al)
15 | al <- FindVariableFeatures(object = al, nfeatures = 2000)
16 | al <- ScaleData(object = al)
17 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE)
18 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5)
19 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain_SSp.rds"))
20 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/Seurat/impute_osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenSSp/")
2 | library(Seurat)
3 |
4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
5 | allen <- readRDS("data/seurat_objects/allen_brain_SSp.rds")
6 |
7 | #Project on allen labels
8 | i2 <- FindTransferAnchors(
9 | reference = allen,
10 | query = osmFISH,
11 | features = rownames(osmFISH),
12 | reduction = 'cca',
13 | reference.assay = 'RNA',
14 | query.assay = 'RNA'
15 | )
16 |
17 | refdata <- GetAssayData(
18 | object = allen,
19 | assay = 'RNA',
20 | slot = 'data'
21 | )
22 |
23 | imputation <- TransferData(
24 | anchorset = i2,
25 | refdata = refdata,
26 | weight.reduction = 'pca'
27 | )
28 |
29 | osmFISH[['ss2']] <- imputation
30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/Seurat/osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenSSp/")
2 | library(Seurat)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
7 | mat <- osm[['matrix']][,]
8 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
10 | x_dim <- osm[['col_attrs']][['X']][]
11 | y_dim <- osm[['col_attrs']][['Y']][]
12 | region <- osm[['col_attrs']][['Region']][]
13 | cluster <- osm[['col_attrs']][['ClusterName']][]
14 | osm$close_all()
15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
16 | rownames(spatial) <- rownames(mat)
17 | spatial <- as.matrix(spatial)
18 | mat <- t(mat)
19 |
20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1)
21 | names(region) <- colnames(osm_seurat)
22 | names(cluster) <- colnames(osm_seurat)
23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region')
24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster')
25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA')
26 | Idents(osm_seurat) <- 'region'
27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))
28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts))
29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts))
30 | osm_seurat <- ScaleData(osm_seurat)
31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/Seurat/osmFISH_integration.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenSSp/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds")
7 | allen <- readRDS("data/seurat_objects/allen_brain_SSp.rds")
8 |
9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen))
10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2])
11 | rownames(Imp_genes) <- genes.leaveout
12 | anchor_time <- vector(mode= "numeric")
13 | Transfer_time <- vector(mode= "numeric")
14 |
15 | run_imputation <- function(ref.obj, query.obj, feature.remove) {
16 | message(paste0('removing ', feature.remove))
17 | features <- setdiff(rownames(query.obj), feature.remove)
18 | DefaultAssay(ref.obj) <- 'RNA'
19 | DefaultAssay(query.obj) <- 'RNA'
20 |
21 | start_time <- Sys.time()
22 | anchors <- FindTransferAnchors(
23 | reference = ref.obj,
24 | query = query.obj,
25 | features = features,
26 | dims = 1:30,
27 | reduction = 'cca'
28 | )
29 | end_time <- Sys.time()
30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
31 |
32 | refdata <- GetAssayData(
33 | object = ref.obj,
34 | assay = 'RNA',
35 | slot = 'data'
36 | )
37 |
38 | start_time <- Sys.time()
39 | imputation <- TransferData(
40 | anchorset = anchors,
41 | refdata = refdata,
42 | weight.reduction = 'pca'
43 | )
44 | query.obj[['seq']] <- imputation
45 | end_time <- Sys.time()
46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
47 | return(query.obj)
48 | }
49 |
50 | for(i in 1:length(genes.leaveout)) {
51 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = osmFISH, feature.remove = genes.leaveout[[i]])
52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']]
53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],])
54 | }
55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv')
56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE)
57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE)
58 |
59 | # show genes not in the osmFISH dataset
60 | DefaultAssay(osmFISH.imputed) <- "ss2"
61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2])
63 | rownames(Imp_New_genes) <- new.genes
64 | for(i in 1:length(new.genes)) {
65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],])
66 | }
67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/SpaGE/Allen_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenSSp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 |
9 | RNA_data = pd.read_csv('data/Allen_SSp/SSp_exons_matrix.csv',
10 | header=0,index_col=0,sep=',')
11 |
12 | # filter lowely expressed genes
13 | Genes_count = np.sum(RNA_data > 0, axis=1)
14 | RNA_data = RNA_data.loc[Genes_count >=10,:]
15 | del Genes_count
16 |
17 | # filter low quality cells
18 | meta_data = pd.read_csv('data/Allen_SSp/AllenSSp_metadata.csv',
19 | header=0,sep=',')
20 | HighQualityCells = meta_data['class_label'] != 'Exclude'
21 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]]
22 | del meta_data, HighQualityCells
23 |
24 | def Log_Norm(x):
25 | return np.log(((x/np.sum(x))*1000000) + 1)
26 |
27 | RNA_data = RNA_data.apply(Log_Norm,axis=0)
28 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index)
29 |
30 | datadict = dict()
31 | datadict['RNA_data'] = RNA_data.T
32 | datadict['RNA_data_scaled'] = RNA_data_scaled
33 |
34 | with open('data/SpaGE_pkl/Allen_SSp.pkl','wb') as f:
35 | pickle.dump(datadict, f)
36 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenSSp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | osmFISH_data = datadict['osmFISH_data']
14 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
15 | osmFISH_meta= datadict['osmFISH_meta']
16 | del datadict
17 |
18 | with open ('data/SpaGE_pkl/Allen_SSp.pkl', 'rb') as f:
19 | datadict = pickle.load(f)
20 |
21 | RNA_data = datadict['RNA_data']
22 | RNA_data_scaled = datadict['RNA_data_scaled']
23 | del datadict
24 |
25 | #### Leave One Out Validation ####
26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
28 | precise_time = []
29 | knn_time = []
30 | for i in Common_data.columns:
31 | print(i)
32 | start = tm.time()
33 | from principal_vectors import PVComputation
34 |
35 | n_factors = 30
36 | n_pv = 30
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1))
48 |
49 | S = pv_FISH_RNA.source_components_.T
50 |
51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
52 | S = S[:,0:Effective_n_pv]
53 |
54 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
56 | precise_time.append(tm.time()-start)
57 |
58 | start = tm.time()
59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
60 | distances, indices = nbrs.kneighbors(FISH_exp_t)
61 |
62 | Imp_Gene = np.zeros(osmFISH_data.shape[0])
63 | for j in range(0,osmFISH_data.shape[0]):
64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
65 | weights = weights/(len(weights)-1)
66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
67 | Imp_Gene[np.isnan(Imp_Gene)] = 0
68 | Imp_Genes[i] = Imp_Gene
69 | knn_time.append(tm.time()-start)
70 |
71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
72 | precise_time = pd.DataFrame(precise_time)
73 | knn_time = pd.DataFrame(knn_time)
74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
76 |
77 | #### Novel Genes Expression Patterns ####
78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"]
80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
81 |
82 | from principal_vectors import PVComputation
83 |
84 | n_factors = 30
85 | n_pv = 30
86 | dim_reduction = 'pca'
87 | dim_reduction_target = 'pca'
88 |
89 | pv_FISH_RNA = PVComputation(
90 | n_factors = n_factors,
91 | n_pv = n_pv,
92 | dim_reduction = dim_reduction,
93 | dim_reduction_target = dim_reduction_target
94 | )
95 |
96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
97 |
98 | S = pv_FISH_RNA.source_components_.T
99 |
100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
101 | S = S[:,0:Effective_n_pv]
102 |
103 | Common_data_t = Common_data.dot(S)
104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S)
105 |
106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
107 | metric = 'cosine').fit(Common_data_t)
108 | distances, indices = nbrs.kneighbors(FISH_exp_t)
109 |
110 | for j in range(0,osmFISH_data.shape[0]):
111 |
112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
113 | weights = weights/(len(weights)-1)
114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
115 |
116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenSSp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import seaborn as sns
12 | import sys
13 | sys.path.insert(1,'SpaGE/')
14 | from principal_vectors import PVComputation
15 |
16 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | osmFISH_data = datadict['osmFISH_data']
20 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
21 | osmFISH_meta= datadict['osmFISH_meta']
22 | del datadict
23 |
24 | with open ('data/SpaGE_pkl/Allen_SSp.pkl', 'rb') as f:
25 | datadict = pickle.load(f)
26 |
27 | RNA_data = datadict['RNA_data']
28 | RNA_data_scaled = datadict['RNA_data_scaled']
29 | del datadict
30 |
31 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
32 |
33 | n_factors = 30
34 | n_pv = 30
35 | n_pv_display = 30
36 | dim_reduction = 'pca'
37 | dim_reduction_target = 'pca'
38 |
39 | pv_FISH_RNA = PVComputation(
40 | n_factors = n_factors,
41 | n_pv = n_pv,
42 | dim_reduction = dim_reduction,
43 | dim_reduction_target = dim_reduction_target
44 | )
45 |
46 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
47 |
48 | fig = plt.figure()
49 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
50 | center=0, vmax=1., vmin=0)
51 | plt.xlabel('osmFISH',fontsize=18, color='black')
52 | plt.ylabel('Allen_SSp',fontsize=18, color='black')
53 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
54 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
55 | plt.gca().set_ylim([n_pv_display,0])
56 | plt.show()
57 |
58 | plt.figure()
59 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
60 | center=0, vmax=1., vmin=0)
61 | for i in range(n_pv_display-1):
62 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
63 |
64 | plt.xlabel('osmFISH',fontsize=18, color='black')
65 | plt.ylabel('Allen_SSp',fontsize=18, color='black')
66 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
67 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
68 | plt.gca().set_ylim([n_pv_display,0])
69 | plt.show()
70 |
71 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
72 | Importance.sort_values(ascending=False,inplace=True)
73 | Importance.index[0:30]
74 |
75 | ### Technology specific Processes
76 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
77 |
78 | # explained variance RNA
79 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
80 | # explained variance spatial
81 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
82 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/SpaGE/osmFISH_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenSSp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 | import loompy
9 |
10 | ## Read loom data
11 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom')
12 |
13 | FISH_Genes = ds.ra['Gene']
14 |
15 | colAtr = ds.ca.keys()
16 |
17 | df = pd.DataFrame()
18 | for i in colAtr:
19 | df[i] = ds.ca[i]
20 |
21 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :]
22 | osmFISH_data = ds[:,:]
23 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]]
24 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes)
25 |
26 | del ds, colAtr, i, df, FISH_Genes
27 |
28 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4',
29 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1']
30 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region)
31 |
32 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:]
33 | osmFISH_data = osmFISH_data.iloc[:,Cortical]
34 |
35 | cell_count = np.sum(osmFISH_data,axis=0)
36 | def Log_Norm(x):
37 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
38 |
39 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0)
40 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index)
41 |
42 | datadict = dict()
43 | datadict['osmFISH_data'] = osmFISH_data.T
44 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled
45 | datadict['osmFISH_meta'] = osmFISH_meta
46 |
47 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f:
48 | pickle.dump(datadict, f)
49 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenSSp/gimVI/gimVI.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenSSp/')
3 |
4 | from scvi.dataset import CsvDataset
5 | from scvi.models import JVAE, Classifier
6 | from scvi.inference import JVAETrainer
7 | import numpy as np
8 | import pandas as pd
9 | import copy
10 | import torch
11 | import time as tm
12 |
13 | ### osmFISH data
14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
15 |
16 | ### RNA
17 | RNA_data = CsvDataset('data/gimVI_data/Allen_SSp_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
18 |
19 | ### Leave-one-out validation
20 | Common_data = copy.deepcopy(RNA_data)
21 | Common_data.gene_names = osmFISH_data.gene_names
22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)]
23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names)
24 | Imp_Genes = pd.DataFrame(columns=Gene_set)
25 | gimVI_time = []
26 |
27 | for i in Gene_set:
28 | print(i)
29 | # Create copy of the fish dataset with hidden genes
30 | data_spatial_partial = copy.deepcopy(osmFISH_data)
31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i))
32 | data_spatial_partial.batch_indices += Common_data.n_batches
33 |
34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]):
35 | continue
36 |
37 | datasets = [Common_data, data_spatial_partial]
38 | generative_distributions = ["zinb", "nb"]
39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
40 | n_inputs = [d.nb_genes for d in datasets]
41 | total_genes = Common_data.nb_genes
42 | n_batches = sum([d.n_batches for d in datasets])
43 |
44 | model_library_size = [True, False]
45 |
46 | n_latent = 8
47 | kappa = 5
48 |
49 | start = tm.time()
50 | torch.manual_seed(0)
51 |
52 | model = JVAE(
53 | n_inputs,
54 | total_genes,
55 | gene_mappings,
56 | generative_distributions,
57 | model_library_size,
58 | n_layers_decoder_individual=0,
59 | n_layers_decoder_shared=0,
60 | n_layers_encoder_individual=1,
61 | n_layers_encoder_shared=1,
62 | dim_hidden_encoder=64,
63 | dim_hidden_decoder_shared=64,
64 | dropout_rate_encoder=0.2,
65 | dropout_rate_decoder=0.2,
66 | n_batch=n_batches,
67 | n_latent=n_latent,
68 | )
69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
70 |
71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
72 | trainer.train(n_epochs=200)
73 | _,Imputed = trainer.get_imputed_values(normalized=True)
74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1)
75 | Imp_Genes[i] = Imputed
76 | gimVI_time.append(tm.time()-start)
77 |
78 | Imp_Genes = Imp_Genes.fillna(0)
79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv')
80 | gimVI_time = pd.DataFrame(gimVI_time)
81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False)
82 |
83 | ### New genes
84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"])
85 |
86 | # Create copy of the fish dataset with hidden genes
87 | data_spatial_partial = copy.deepcopy(osmFISH_data)
88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names)
89 | data_spatial_partial.batch_indices += RNA_data.n_batches
90 |
91 | datasets = [RNA_data, data_spatial_partial]
92 | generative_distributions = ["zinb", "nb"]
93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
94 | n_inputs = [d.nb_genes for d in datasets]
95 | total_genes = RNA_data.nb_genes
96 | n_batches = sum([d.n_batches for d in datasets])
97 |
98 | model_library_size = [True, False]
99 |
100 | n_latent = 8
101 | kappa = 5
102 |
103 | torch.manual_seed(0)
104 |
105 | model = JVAE(
106 | n_inputs,
107 | total_genes,
108 | gene_mappings,
109 | generative_distributions,
110 | model_library_size,
111 | n_layers_decoder_individual=0,
112 | n_layers_decoder_shared=0,
113 | n_layers_encoder_individual=1,
114 | n_layers_encoder_shared=1,
115 | dim_hidden_encoder=64,
116 | dim_hidden_decoder_shared=64,
117 | dropout_rate_encoder=0.2,
118 | dropout_rate_decoder=0.2,
119 | n_batch=n_batches,
120 | n_latent=n_latent,
121 | )
122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
123 |
124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
125 | trainer.train(n_epochs=200)
126 |
127 | for i in ["TESC","PVRL3","GRM2"]:
128 | _,Imputed = trainer.get_imputed_values(normalized=True)
129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1)
130 | Imp_New_Genes[i] = Imputed
131 |
132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv')
133 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/Liger/LIGER.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenVISp/")
2 | library(liger)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | # allen VISp
7 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv",
8 | row.names = 1, sep = ',', stringsAsFactors = FALSE, header = TRUE)
9 | allen <- as.matrix(x = allen)
10 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv",
11 | sep = ',', stringsAsFactors = FALSE, header = TRUE)
12 | rownames(x = allen) <- make.unique(names = genes$gene_symbol)
13 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv",
14 | row.names = 1, stringsAsFactors = FALSE)
15 |
16 | Genes_count = rowSums(allen > 0)
17 | allen <- allen[Genes_count>=10,]
18 |
19 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ])
20 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)]
21 | allen <-allen[,ok.cells]
22 |
23 | # osmFISH
24 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
25 | mat <- osm[['matrix']][,]
26 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
27 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
28 | x_dim <- osm[['col_attrs']][['X']][]
29 | y_dim <- osm[['col_attrs']][['Y']][]
30 | region <- osm[['col_attrs']][['Region']][]
31 | cluster <- osm[['col_attrs']][['ClusterName']][]
32 | osm$close_all()
33 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
34 | rownames(spatial) <- rownames(mat)
35 | spatial <- as.matrix(spatial)
36 | osmFISH <- t(mat)
37 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))]
38 |
39 | Gene_set <- intersect(rownames(osmFISH),rownames(allen))
40 |
41 | #### New genes prediction
42 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = allen))
43 | Ligerex <- normalize(Ligerex)
44 | Ligerex@var.genes <- Gene_set
45 | Ligerex <- scaleNotCenter(Ligerex)
46 |
47 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20
48 | # suggestLambda(Ligerex, k = 20) # Lambda = 20
49 |
50 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20)
51 | Ligerex <- quantileAlignSNF(Ligerex)
52 |
53 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE)
54 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
55 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2])
56 | rownames(Imp_New_genes) <- new.genes
57 | for(i in 1:length(new.genes)) {
58 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],])
59 | }
60 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv')
61 |
62 | # leave-one-out validation
63 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen))
64 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2])
65 | rownames(Imp_genes) <- genes.leaveout
66 | colnames(Imp_genes) <- colnames(osmFISH)
67 | NMF_time <- vector(mode= "numeric")
68 | knn_time <- vector(mode= "numeric")
69 |
70 | for(i in 1:length(genes.leaveout)) {
71 | print(i)
72 | start_time <- Sys.time()
73 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = allen[rownames(osmFISH),]))
74 | Ligerex.leaveout <- normalize(Ligerex.leaveout)
75 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i])
76 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout)
77 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){
78 | next
79 | }
80 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20)
81 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout)
82 | end_time <- Sys.time()
83 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
84 |
85 | start_time <- Sys.time()
86 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30)
87 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],])
88 | end_time <- Sys.time()
89 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
90 | }
91 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv')
92 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE)
93 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE)
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/Seurat/allen_brain.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenVISp/")
2 | library(Seurat)
3 |
4 | allen <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv",
5 | row.names = 1,sep = ',', stringsAsFactors = FALSE, header = TRUE)
6 | allen <- as.matrix(x = allen)
7 | genes <- read.table(file = "data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv",
8 | sep = ',', stringsAsFactors = FALSE, header = TRUE)
9 | rownames(x = allen) <- make.unique(names = genes$gene_symbol)
10 | meta.data <- read.csv(file = "data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv",
11 | row.names = 1, stringsAsFactors = FALSE)
12 |
13 | al <- CreateSeuratObject(counts = allen, project = 'VISp', meta.data = meta.data, min.cells = 10)
14 | low.q.cells <- rownames(x = meta.data[meta.data$class %in% c('Low Quality', 'No Class'), ])
15 | ok.cells <- rownames(x = meta.data)[!(rownames(x = meta.data) %in% low.q.cells)]
16 | al <- al[, ok.cells]
17 | al <- NormalizeData(object = al)
18 | al <- FindVariableFeatures(object = al, nfeatures = 2000)
19 | al <- ScaleData(object = al)
20 | al <- RunPCA(object = al, npcs = 50, verbose = FALSE)
21 | al <- RunUMAP(object = al, dims = 1:50, nneighbors = 5)
22 | saveRDS(object = al, file = paste0("data/seurat_objects/","allen_brain.rds"))
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/Seurat/impute_osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenVISp/")
2 | library(Seurat)
3 |
4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
5 | allen <- readRDS("data/seurat_objects/allen_brain.rds")
6 |
7 | #Project on allen labels
8 | i2 <- FindTransferAnchors(
9 | reference = allen,
10 | query = osmFISH,
11 | features = rownames(osmFISH),
12 | reduction = 'cca',
13 | reference.assay = 'RNA',
14 | query.assay = 'RNA'
15 | )
16 |
17 | refdata <- GetAssayData(
18 | object = allen,
19 | assay = 'RNA',
20 | slot = 'data'
21 | )
22 |
23 | imputation <- TransferData(
24 | anchorset = i2,
25 | refdata = refdata,
26 | weight.reduction = 'pca'
27 | )
28 |
29 | osmFISH[['ss2']] <- imputation
30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/Seurat/osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenVISp/")
2 | library(Seurat)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
7 | mat <- osm[['matrix']][,]
8 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
10 | x_dim <- osm[['col_attrs']][['X']][]
11 | y_dim <- osm[['col_attrs']][['Y']][]
12 | region <- osm[['col_attrs']][['Region']][]
13 | cluster <- osm[['col_attrs']][['ClusterName']][]
14 | osm$close_all()
15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
16 | rownames(spatial) <- rownames(mat)
17 | spatial <- as.matrix(spatial)
18 | mat <- t(mat)
19 |
20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1)
21 | names(region) <- colnames(osm_seurat)
22 | names(cluster) <- colnames(osm_seurat)
23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region')
24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster')
25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA')
26 | Idents(osm_seurat) <- 'region'
27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))
28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts))
29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts))
30 | osm_seurat <- ScaleData(osm_seurat)
31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/Seurat/osmFISH_integration.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_AllenVISp/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds")
7 | allen <- readRDS("data/seurat_objects/allen_brain.rds")
8 |
9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(allen))
10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2])
11 | rownames(Imp_genes) <- genes.leaveout
12 | anchor_time <- vector(mode= "numeric")
13 | Transfer_time <- vector(mode= "numeric")
14 |
15 | run_imputation <- function(ref.obj, query.obj, feature.remove) {
16 | message(paste0('removing ', feature.remove))
17 | features <- setdiff(rownames(query.obj), feature.remove)
18 | DefaultAssay(ref.obj) <- 'RNA'
19 | DefaultAssay(query.obj) <- 'RNA'
20 |
21 | start_time <- Sys.time()
22 | anchors <- FindTransferAnchors(
23 | reference = ref.obj,
24 | query = query.obj,
25 | features = features,
26 | dims = 1:30,
27 | reduction = 'cca'
28 | )
29 | end_time <- Sys.time()
30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
31 |
32 | refdata <- GetAssayData(
33 | object = ref.obj,
34 | assay = 'RNA',
35 | slot = 'data'
36 | )
37 |
38 | start_time <- Sys.time()
39 | imputation <- TransferData(
40 | anchorset = anchors,
41 | refdata = refdata,
42 | weight.reduction = 'pca'
43 | )
44 | query.obj[['seq']] <- imputation
45 | end_time <- Sys.time()
46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
47 | return(query.obj)
48 | }
49 |
50 | for(i in 1:length(genes.leaveout)) {
51 | imputed.ss2 <- run_imputation(ref.obj = allen, query.obj = osmFISH, feature.remove = genes.leaveout[[i]])
52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']]
53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],])
54 | }
55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv')
56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE)
57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE)
58 |
59 | # show genes not in the osmFISH dataset
60 | DefaultAssay(osmFISH.imputed) <- "ss2"
61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2])
63 | rownames(Imp_New_genes) <- new.genes
64 | for(i in 1:length(new.genes)) {
65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],])
66 | }
67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/SpaGE/Allen_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 |
9 | RNA_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_exon-matrix.csv',
10 | header=0,index_col=0,sep=',')
11 |
12 | Genes = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_genes-rows.csv',
13 | header=0,sep=',')
14 | RNA_data.index = Genes.gene_symbol
15 | del Genes
16 |
17 | # filter lowely expressed genes
18 | Genes_count = np.sum(RNA_data > 0, axis=1)
19 | RNA_data = RNA_data.loc[Genes_count >=10,:]
20 | del Genes_count
21 |
22 | # filter low quality cells
23 | meta_data = pd.read_csv('data/Allen_VISp/mouse_VISp_2018-06-14_samples-columns.csv',
24 | header=0,sep=',')
25 | HighQualityCells = (meta_data['class'] != 'No Class') & (meta_data['class'] != 'Low Quality')
26 | RNA_data = RNA_data.iloc[:,np.where(HighQualityCells)[0]]
27 | del meta_data, HighQualityCells
28 |
29 | def Log_Norm(x):
30 | return np.log(((x/np.sum(x))*1000000) + 1)
31 |
32 | RNA_data = RNA_data.apply(Log_Norm,axis=0)
33 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index)
34 |
35 | datadict = dict()
36 | datadict['RNA_data'] = RNA_data.T
37 | datadict['RNA_data_scaled'] = RNA_data_scaled
38 |
39 | with open('data/SpaGE_pkl/Allen_VISp.pkl','wb') as f:
40 | pickle.dump(datadict, f)
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | osmFISH_data = datadict['osmFISH_data']
14 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
15 | osmFISH_meta= datadict['osmFISH_meta']
16 | del datadict
17 |
18 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
19 | datadict = pickle.load(f)
20 |
21 | RNA_data = datadict['RNA_data']
22 | RNA_data_scaled = datadict['RNA_data_scaled']
23 | del datadict
24 |
25 | #### Leave One Out Validation ####
26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
28 | precise_time = []
29 | knn_time = []
30 | for i in Common_data.columns:
31 | print(i)
32 | start = tm.time()
33 | from principal_vectors import PVComputation
34 |
35 | n_factors = 30
36 | n_pv = 30
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1))
48 |
49 | S = pv_FISH_RNA.source_components_.T
50 |
51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
52 | S = S[:,0:Effective_n_pv]
53 |
54 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
56 | precise_time.append(tm.time()-start)
57 |
58 | start = tm.time()
59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
60 | distances, indices = nbrs.kneighbors(FISH_exp_t)
61 |
62 | Imp_Gene = np.zeros(osmFISH_data.shape[0])
63 | for j in range(0,osmFISH_data.shape[0]):
64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
65 | weights = weights/(len(weights)-1)
66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
67 | Imp_Gene[np.isnan(Imp_Gene)] = 0
68 | Imp_Genes[i] = Imp_Gene
69 | knn_time.append(tm.time()-start)
70 |
71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
72 | precise_time = pd.DataFrame(precise_time)
73 | knn_time = pd.DataFrame(knn_time)
74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
76 |
77 | #### Novel Genes Expression Patterns ####
78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"]
80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
81 |
82 | from principal_vectors import PVComputation
83 |
84 | n_factors = 30
85 | n_pv = 30
86 | dim_reduction = 'pca'
87 | dim_reduction_target = 'pca'
88 |
89 | pv_FISH_RNA = PVComputation(
90 | n_factors = n_factors,
91 | n_pv = n_pv,
92 | dim_reduction = dim_reduction,
93 | dim_reduction_target = dim_reduction_target
94 | )
95 |
96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
97 |
98 | S = pv_FISH_RNA.source_components_.T
99 |
100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
101 | S = S[:,0:Effective_n_pv]
102 |
103 | Common_data_t = Common_data.dot(S)
104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S)
105 |
106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
107 | metric = 'cosine').fit(Common_data_t)
108 | distances, indices = nbrs.kneighbors(FISH_exp_t)
109 |
110 | for j in range(0,osmFISH_data.shape[0]):
111 |
112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
113 | weights = weights/(len(weights)-1)
114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
115 |
116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | from matplotlib import cm
12 | import seaborn as sns
13 | import sys
14 | sys.path.insert(1,'SpaGE/')
15 | from principal_vectors import PVComputation
16 |
17 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
18 | datadict = pickle.load(f)
19 |
20 | osmFISH_data = datadict['osmFISH_data']
21 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
22 | osmFISH_meta= datadict['osmFISH_meta']
23 | del datadict
24 |
25 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
26 | datadict = pickle.load(f)
27 |
28 | RNA_data = datadict['RNA_data']
29 | RNA_data_scaled = datadict['RNA_data_scaled']
30 | del datadict
31 |
32 | all_centroids = osmFISH_meta[['X','Y']]
33 | cmap = cm.get_cmap('viridis',20)
34 |
35 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
36 |
37 | n_factors = 30
38 | n_pv = 30
39 | n_pv_display = 30
40 | dim_reduction = 'pca'
41 | dim_reduction_target = 'pca'
42 |
43 | pv_FISH_RNA = PVComputation(
44 | n_factors = n_factors,
45 | n_pv = n_pv,
46 | dim_reduction = dim_reduction,
47 | dim_reduction_target = dim_reduction_target
48 | )
49 |
50 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
51 |
52 | fig = plt.figure()
53 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
54 | center=0, vmax=1., vmin=0)
55 | plt.xlabel('osmFISH',fontsize=18, color='black')
56 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
57 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
58 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
59 | plt.gca().set_ylim([n_pv_display,0])
60 | plt.show()
61 |
62 | plt.figure()
63 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
64 | center=0, vmax=1., vmin=0)
65 | for i in range(n_pv_display-1):
66 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
67 |
68 | plt.xlabel('osmFISH',fontsize=18, color='black')
69 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
70 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
71 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
72 | plt.gca().set_ylim([n_pv_display,0])
73 | plt.show()
74 |
75 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
76 | Importance.sort_values(ascending=False,inplace=True)
77 | Importance.index[0:30]
78 |
79 | ### Technology specific Processes
80 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
81 |
82 | # explained variance RNA
83 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
84 | # explained variance spatial
85 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
86 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/SpaGE/osmFISH_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 | import loompy
9 |
10 | ## Read loom data
11 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom')
12 |
13 | FISH_Genes = ds.ra['Gene']
14 |
15 | colAtr = ds.ca.keys()
16 |
17 | df = pd.DataFrame()
18 | for i in colAtr:
19 | df[i] = ds.ca[i]
20 |
21 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :]
22 | osmFISH_data = ds[:,:]
23 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]]
24 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes)
25 |
26 | del ds, colAtr, i, df, FISH_Genes
27 |
28 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4',
29 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1']
30 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region)
31 |
32 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:]
33 | osmFISH_data = osmFISH_data.iloc[:,Cortical]
34 |
35 | cell_count = np.sum(osmFISH_data,axis=0)
36 | def Log_Norm(x):
37 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
38 |
39 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0)
40 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index)
41 |
42 | datadict = dict()
43 | datadict['osmFISH_data'] = osmFISH_data.T
44 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled
45 | datadict['osmFISH_meta'] = osmFISH_meta
46 |
47 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f:
48 | pickle.dump(datadict, f)
49 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_AllenVISp/gimVI/gimVI.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_AllenVISp/')
3 |
4 | from scvi.dataset import CsvDataset
5 | from scvi.models import JVAE, Classifier
6 | from scvi.inference import JVAETrainer
7 | import numpy as np
8 | import pandas as pd
9 | import copy
10 | import torch
11 | import time as tm
12 |
13 | ### osmFISH data
14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
15 |
16 | ### RNA
17 | RNA_data = CsvDataset('data/gimVI_data/Allen_data_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
18 |
19 | ### Leave-one-out validation
20 | Common_data = copy.deepcopy(RNA_data)
21 | Common_data.gene_names = osmFISH_data.gene_names
22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)]
23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names)
24 | Imp_Genes = pd.DataFrame(columns=Gene_set)
25 | gimVI_time = []
26 |
27 | for i in Gene_set:
28 | print(i)
29 | # Create copy of the fish dataset with hidden genes
30 | data_spatial_partial = copy.deepcopy(osmFISH_data)
31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i))
32 | data_spatial_partial.batch_indices += Common_data.n_batches
33 |
34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]):
35 | continue
36 |
37 | datasets = [Common_data, data_spatial_partial]
38 | generative_distributions = ["zinb", "nb"]
39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
40 | n_inputs = [d.nb_genes for d in datasets]
41 | total_genes = Common_data.nb_genes
42 | n_batches = sum([d.n_batches for d in datasets])
43 |
44 | model_library_size = [True, False]
45 |
46 | n_latent = 8
47 | kappa = 5
48 |
49 | start = tm.time()
50 | torch.manual_seed(0)
51 |
52 | model = JVAE(
53 | n_inputs,
54 | total_genes,
55 | gene_mappings,
56 | generative_distributions,
57 | model_library_size,
58 | n_layers_decoder_individual=0,
59 | n_layers_decoder_shared=0,
60 | n_layers_encoder_individual=1,
61 | n_layers_encoder_shared=1,
62 | dim_hidden_encoder=64,
63 | dim_hidden_decoder_shared=64,
64 | dropout_rate_encoder=0.2,
65 | dropout_rate_decoder=0.2,
66 | n_batch=n_batches,
67 | n_latent=n_latent,
68 | )
69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
70 |
71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
72 | trainer.train(n_epochs=200)
73 | _,Imputed = trainer.get_imputed_values(normalized=True)
74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1)
75 | Imp_Genes[i] = Imputed
76 | gimVI_time.append(tm.time()-start)
77 |
78 | Imp_Genes = Imp_Genes.fillna(0)
79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv')
80 | gimVI_time = pd.DataFrame(gimVI_time)
81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False)
82 |
83 | ### New genes
84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"])
85 |
86 | # Create copy of the fish dataset with hidden genes
87 | data_spatial_partial = copy.deepcopy(osmFISH_data)
88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names)
89 | data_spatial_partial.batch_indices += RNA_data.n_batches
90 |
91 | datasets = [RNA_data, data_spatial_partial]
92 | generative_distributions = ["zinb", "nb"]
93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
94 | n_inputs = [d.nb_genes for d in datasets]
95 | total_genes = RNA_data.nb_genes
96 | n_batches = sum([d.n_batches for d in datasets])
97 |
98 | model_library_size = [True, False]
99 |
100 | n_latent = 8
101 | kappa = 5
102 |
103 | torch.manual_seed(0)
104 |
105 | model = JVAE(
106 | n_inputs,
107 | total_genes,
108 | gene_mappings,
109 | generative_distributions,
110 | model_library_size,
111 | n_layers_decoder_individual=0,
112 | n_layers_decoder_shared=0,
113 | n_layers_encoder_individual=1,
114 | n_layers_encoder_shared=1,
115 | dim_hidden_encoder=64,
116 | dim_hidden_decoder_shared=64,
117 | dropout_rate_encoder=0.2,
118 | dropout_rate_decoder=0.2,
119 | n_batch=n_batches,
120 | n_latent=n_latent,
121 | )
122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
123 |
124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
125 | trainer.train(n_epochs=200)
126 |
127 | for i in ["TESC","PVRL3","GRM2"]:
128 | _,Imputed = trainer.get_imputed_values(normalized=True)
129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1)
130 | Imp_New_Genes[i] = Imputed
131 |
132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv')
133 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/Liger/LIGER.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_Ziesel/")
2 | library(liger)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | # Zeisel SMSC
7 | Zeisel <- read.delim(file = "data/Zeisel/expression_mRNA_17-Aug-2014.txt",header = FALSE)
8 | meta.data <- Zeisel[1:10,]
9 | rownames(meta.data) <- meta.data[,2]
10 | meta.data <- meta.data[,-c(1,2)]
11 | meta.data <- as.data.frame(t(meta.data))
12 |
13 | Zeisel <- Zeisel[12:19983,]
14 | gene_names <- Zeisel[,1]
15 | Zeisel <- Zeisel[,-c(1,2)]
16 | Zeisel <- apply(Zeisel,2,as.numeric)
17 | Zeisel <- as.matrix(Zeisel)
18 | rownames(Zeisel) <- gene_names
19 | Zeisel <- Zeisel[,meta.data$tissue=='sscortex']
20 | meta.data <- meta.data[meta.data$tissue=='sscortex',]
21 |
22 | colnames(Zeisel) <- paste0('SMSC_',c(1:dim(Zeisel)[2]))
23 | rownames(meta.data) <- paste0('SMSC_',c(1:dim(meta.data)[1]))
24 |
25 | gene_count <- rowSums(Zeisel>0)
26 | Zeisel <- Zeisel[gene_count >= 10,]
27 |
28 | # osmFISH
29 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
30 | mat <- osm[['matrix']][,]
31 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
32 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
33 | x_dim <- osm[['col_attrs']][['X']][]
34 | y_dim <- osm[['col_attrs']][['Y']][]
35 | region <- osm[['col_attrs']][['Region']][]
36 | cluster <- osm[['col_attrs']][['ClusterName']][]
37 | osm$close_all()
38 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
39 | rownames(spatial) <- rownames(mat)
40 | spatial <- as.matrix(spatial)
41 | osmFISH <- t(mat)
42 | osmFISH <- osmFISH[,!is.element(region,c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))]
43 |
44 | Gene_set <- intersect(rownames(osmFISH),rownames(Zeisel))
45 |
46 | #### New genes prediction
47 | Ligerex <- createLiger(list(SMSC_FISH = osmFISH, SMSC_RNA = Zeisel))
48 | Ligerex <- normalize(Ligerex)
49 | Ligerex@var.genes <- Gene_set
50 | Ligerex <- scaleNotCenter(Ligerex)
51 |
52 | # suggestK(Ligerex, k.test= seq(5,30,5)) # K = 20
53 | # suggestLambda(Ligerex, k = 20) # Lambda = 20
54 |
55 | Ligerex <- optimizeALS(Ligerex,k = 20, lambda = 20)
56 | Ligerex <- quantileAlignSNF(Ligerex)
57 |
58 | Imputation <- imputeKNN(Ligerex,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE)
59 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
60 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(Imputation@norm.data$SMSC_FISH)[2])
61 | rownames(Imp_New_genes) <- new.genes
62 | for(i in 1:length(new.genes)) {
63 | Imp_New_genes[new.genes[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[new.genes[i],])
64 | }
65 | write.csv(Imp_New_genes,file = 'Results/Liger_New_genes.csv')
66 |
67 | # leave-one-out validation
68 | genes.leaveout <- intersect(rownames(osmFISH),rownames(Zeisel))
69 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH)[2])
70 | rownames(Imp_genes) <- genes.leaveout
71 | colnames(Imp_genes) <- colnames(osmFISH)
72 | NMF_time <- vector(mode= "numeric")
73 | knn_time <- vector(mode= "numeric")
74 |
75 | for(i in 1:length(genes.leaveout)) {
76 | print(i)
77 | start_time <- Sys.time()
78 | Ligerex.leaveout <- createLiger(list(SMSC_FISH = osmFISH[-which(rownames(osmFISH) %in% genes.leaveout[i]),], SMSC_RNA = Zeisel[rownames(osmFISH),]))
79 | Ligerex.leaveout <- normalize(Ligerex.leaveout)
80 | Ligerex.leaveout@var.genes <- setdiff(Gene_set,genes.leaveout[i])
81 | Ligerex.leaveout <- scaleNotCenter(Ligerex.leaveout)
82 | if(dim(Ligerex.leaveout@norm.data$SMSC_FISH)[2]!=dim(osmFISH)[2]){
83 | next
84 | }
85 | Ligerex.leaveout <- optimizeALS(Ligerex.leaveout,k = 20, lambda = 20)
86 | Ligerex.leaveout <- quantileAlignSNF(Ligerex.leaveout)
87 | end_time <- Sys.time()
88 | NMF_time <- c(NMF_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
89 |
90 | start_time <- Sys.time()
91 | Imputation <- imputeKNN(Ligerex.leaveout,reference = 'SMSC_RNA', queries = list('SMSC_FISH'), norm = TRUE, scale = FALSE, knn_k = 30)
92 | Imp_genes[genes.leaveout[[i]],] = as.vector(Imputation@norm.data$SMSC_FISH[genes.leaveout[i],])
93 | end_time <- Sys.time()
94 | knn_time <- c(knn_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
95 | }
96 | write.csv(Imp_genes,file = 'Results/Liger_LeaveOneOut.csv')
97 | write.csv(NMF_time,file = 'Results/Liger_NMF_time.csv',row.names = FALSE)
98 | write.csv(knn_time,file = 'Results/Liger_knn_time.csv',row.names = FALSE)
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/Seurat/Zeisel_Cortex.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_Ziesel/")
2 | library(Seurat)
3 |
4 | Zeisel <- read.delim(file = "data/Zeisel/expression_mRNA_17-Aug-2014.txt",header = FALSE)
5 |
6 | meta.data <- Zeisel[1:10,]
7 | rownames(meta.data) <- meta.data[,2]
8 | meta.data <- meta.data[,-c(1,2)]
9 | meta.data <- as.data.frame(t(meta.data))
10 |
11 | Zeisel <- Zeisel[12:19983,]
12 | gene_names <- Zeisel[,1]
13 | Zeisel <- Zeisel[,-c(1,2)]
14 | Zeisel <- apply(Zeisel,2,as.numeric)
15 | Zeisel <- as.matrix(Zeisel)
16 | rownames(Zeisel) <- gene_names
17 | Zeisel <- Zeisel[,meta.data$tissue=='sscortex']
18 | meta.data <- meta.data[meta.data$tissue=='sscortex',]
19 |
20 | colnames(Zeisel) <- paste0('SMSC_',c(1:dim(Zeisel)[2]))
21 | rownames(meta.data) <- paste0('SMSC_',c(1:dim(meta.data)[1]))
22 |
23 | Zeisel <- CreateSeuratObject(counts = Zeisel, project = 'SMSC', meta.data = meta.data, min.cells = 10)
24 | Zeisel <- NormalizeData(object = Zeisel)
25 | Zeisel <- FindVariableFeatures(object = Zeisel, nfeatures = 2000)
26 | Zeisel <- ScaleData(object = Zeisel)
27 | Zeisel <- RunPCA(object = Zeisel, npcs = 50, verbose = FALSE)
28 | Zeisel <- RunUMAP(object = Zeisel, dims = 1:50, nneighbors = 5)
29 | saveRDS(object = Zeisel, file = paste0("data/seurat_objects/","Zeisel_SMSC.rds"))
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/Seurat/impute_osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_Ziesel/")
2 | library(Seurat)
3 |
4 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
5 | Zeisel <- readRDS("data/seurat_objects/Zeisel_SMSC.rds")
6 |
7 | #Project on Zeisel labels
8 | i2 <- FindTransferAnchors(
9 | reference = Zeisel,
10 | query = osmFISH,
11 | features = rownames(osmFISH),
12 | reduction = 'cca',
13 | reference.assay = 'RNA',
14 | query.assay = 'RNA'
15 | )
16 |
17 | refdata <- GetAssayData(
18 | object = Zeisel,
19 | assay = 'RNA',
20 | slot = 'data'
21 | )
22 |
23 | imputation <- TransferData(
24 | anchorset = i2,
25 | refdata = refdata,
26 | weight.reduction = 'pca'
27 | )
28 |
29 | osmFISH[['ss2']] <- imputation
30 | saveRDS(osmFISH, 'data/seurat_objects/osmFISH_Cortex_imputed.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/Seurat/osmFISH.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_Ziesel/")
2 | library(Seurat)
3 | library(hdf5r)
4 | library(methods)
5 |
6 | osm <- H5File$new("data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom")
7 | mat <- osm[['matrix']][,]
8 | colnames(mat) <- osm[['row_attrs']][['Gene']][]
9 | rownames(mat) <- paste0('osm_', osm[['col_attrs']][['CellID']][])
10 | x_dim <- osm[['col_attrs']][['X']][]
11 | y_dim <- osm[['col_attrs']][['Y']][]
12 | region <- osm[['col_attrs']][['Region']][]
13 | cluster <- osm[['col_attrs']][['ClusterName']][]
14 | osm$close_all()
15 | spatial <- data.frame(spatial1 = x_dim, spatial2 = y_dim)
16 | rownames(spatial) <- rownames(mat)
17 | spatial <- as.matrix(spatial)
18 | mat <- t(mat)
19 |
20 | osm_seurat <- CreateSeuratObject(counts = mat, project = 'osmFISH', assay = 'RNA', min.cells = -1, min.features = -1)
21 | names(region) <- colnames(osm_seurat)
22 | names(cluster) <- colnames(osm_seurat)
23 | osm_seurat <- AddMetaData(osm_seurat, region, col.name = 'region')
24 | osm_seurat <- AddMetaData(osm_seurat, cluster, col.name = 'cluster')
25 | osm_seurat[['spatial']] <- CreateDimReducObject(embeddings = spatial, key = 'spatial', assay = 'RNA')
26 | Idents(osm_seurat) <- 'region'
27 | osm_seurat <- SubsetData(osm_seurat, ident.remove = c('Excluded','Internal Capsule Caudoputamen','White matter','Hippocampus','Ventricle'))
28 | total.counts = colSums(x = as.matrix(osm_seurat@assays$RNA@counts))
29 | osm_seurat <- NormalizeData(osm_seurat, scale.factor = median(x = total.counts))
30 | osm_seurat <- ScaleData(osm_seurat)
31 | saveRDS(object = osm_seurat, file = 'data/seurat_objects/osmFISH_Cortex.rds')
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/Seurat/osmFISH_integration.R:
--------------------------------------------------------------------------------
1 | setwd("osmFISH_Ziesel/")
2 | library(Seurat)
3 | library(ggplot2)
4 |
5 | osmFISH <- readRDS("data/seurat_objects/osmFISH_Cortex.rds")
6 | osmFISH.imputed <- readRDS("data/seurat_objects/osmFISH_Cortex_imputed.rds")
7 | Zeisel <- readRDS("data/seurat_objects/Zeisel_SMSC.rds")
8 |
9 | genes.leaveout <- intersect(rownames(osmFISH),rownames(Zeisel))
10 | Imp_genes <- matrix(0,nrow = length(genes.leaveout),ncol = dim(osmFISH@assays$RNA)[2])
11 | rownames(Imp_genes) <- genes.leaveout
12 | anchor_time <- vector(mode= "numeric")
13 | Transfer_time <- vector(mode= "numeric")
14 |
15 | run_imputation <- function(ref.obj, query.obj, feature.remove) {
16 | message(paste0('removing ', feature.remove))
17 | features <- setdiff(rownames(query.obj), feature.remove)
18 | DefaultAssay(ref.obj) <- 'RNA'
19 | DefaultAssay(query.obj) <- 'RNA'
20 |
21 | start_time <- Sys.time()
22 | anchors <- FindTransferAnchors(
23 | reference = ref.obj,
24 | query = query.obj,
25 | features = features,
26 | dims = 1:30,
27 | reduction = 'cca'
28 | )
29 | end_time <- Sys.time()
30 | anchor_time <<- c(anchor_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
31 |
32 | refdata <- GetAssayData(
33 | object = ref.obj,
34 | assay = 'RNA',
35 | slot = 'data'
36 | )
37 |
38 | start_time <- Sys.time()
39 | imputation <- TransferData(
40 | anchorset = anchors,
41 | refdata = refdata,
42 | weight.reduction = 'pca'
43 | )
44 | query.obj[['seq']] <- imputation
45 | end_time <- Sys.time()
46 | Transfer_time <<- c(Transfer_time,as.numeric(difftime(end_time,start_time,units = 'secs')))
47 | return(query.obj)
48 | }
49 |
50 | for(i in 1:length(genes.leaveout)) {
51 | imputed.ss2 <- run_imputation(ref.obj = Zeisel, query.obj = osmFISH, feature.remove = genes.leaveout[[i]])
52 | osmFISH[['ss2']] <- imputed.ss2[, colnames(osmFISH)][['seq']]
53 | Imp_genes[genes.leaveout[[i]],] = as.vector(osmFISH@assays$ss2[genes.leaveout[i],])
54 | }
55 | write.csv(Imp_genes,file = 'Results/Seurat_LeaveOneOut.csv')
56 | write.csv(anchor_time,file = 'Results/Seurat_anchor_time.csv',row.names = FALSE)
57 | write.csv(Transfer_time,file = 'Results/Seurat_transfer_time.csv',row.names = FALSE)
58 |
59 | # show genes not in the osmFISH dataset
60 | DefaultAssay(osmFISH.imputed) <- "ss2"
61 | new.genes <- c('Tesc', 'Pvrl3', 'Grm2')
62 | Imp_New_genes <- matrix(0,nrow = length(new.genes),ncol = dim(osmFISH.imputed@assays$ss2)[2])
63 | rownames(Imp_New_genes) <- new.genes
64 | for(i in 1:length(new.genes)) {
65 | Imp_New_genes[new.genes[[i]],] = as.vector(osmFISH.imputed@assays$ss2[new.genes[i],])
66 | }
67 | write.csv(Imp_New_genes,file = 'Results/Seurat_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_Ziesel/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/Ziesel.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | RNA_data = datadict['RNA_data']
14 | RNA_data_scaled = datadict['RNA_data_scaled']
15 | del datadict
16 |
17 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
18 | datadict = pickle.load(f)
19 |
20 | osmFISH_data = datadict['osmFISH_data']
21 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
22 | osmFISH_meta= datadict['osmFISH_meta']
23 | del datadict
24 |
25 | #### Leave One Out Validation ####
26 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
28 | precise_time = []
29 | knn_time = []
30 | for i in Common_data.columns:
31 | print(i)
32 | start = tm.time()
33 | from principal_vectors import PVComputation
34 |
35 | n_factors = 30
36 | n_pv = 30
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),osmFISH_data_scaled[Common_data.columns].drop(i,axis=1))
48 |
49 | S = pv_FISH_RNA.source_components_.T
50 |
51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
52 | S = S[:,0:Effective_n_pv]
53 |
54 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
55 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
56 | precise_time.append(tm.time()-start)
57 |
58 | start = tm.time()
59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
60 | distances, indices = nbrs.kneighbors(FISH_exp_t)
61 |
62 | Imp_Gene = np.zeros(osmFISH_data.shape[0])
63 | for j in range(0,osmFISH_data.shape[0]):
64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
65 | weights = weights/(len(weights)-1)
66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
67 | Imp_Gene[np.isnan(Imp_Gene)] = 0
68 | Imp_Genes[i] = Imp_Gene
69 | knn_time.append(tm.time()-start)
70 |
71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
72 | precise_time = pd.DataFrame(precise_time)
73 | knn_time = pd.DataFrame(knn_time)
74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
76 |
77 | #### Novel Genes Expression Patterns ####
78 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
79 | genes_to_impute = ["Tesc","Pvrl3","Grm2"]
80 | Imp_New_Genes = pd.DataFrame(np.zeros((osmFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
81 |
82 | from principal_vectors import PVComputation
83 |
84 | n_factors = 30
85 | n_pv = 30
86 | dim_reduction = 'pca'
87 | dim_reduction_target = 'pca'
88 |
89 | pv_FISH_RNA = PVComputation(
90 | n_factors = n_factors,
91 | n_pv = n_pv,
92 | dim_reduction = dim_reduction,
93 | dim_reduction_target = dim_reduction_target
94 | )
95 |
96 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
97 |
98 | S = pv_FISH_RNA.source_components_.T
99 |
100 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
101 | S = S[:,0:Effective_n_pv]
102 |
103 | Common_data_t = Common_data.dot(S)
104 | FISH_exp_t = osmFISH_data_scaled[Common_data.columns].dot(S)
105 |
106 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
107 | metric = 'cosine').fit(Common_data_t)
108 | distances, indices = nbrs.kneighbors(FISH_exp_t)
109 |
110 | for j in range(0,osmFISH_data.shape[0]):
111 |
112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
113 | weights = weights/(len(weights)-1)
114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
115 |
116 | Imp_New_Genes.to_csv('Results/SpaGE_New_genes.csv')
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/SpaGE/Linnarson_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_Ziesel/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 | import loompy
9 |
10 | ## Read RNA seq data
11 | RNA_exp = pd.read_csv('data/Zeisel/expression_mRNA_17-Aug-2014.txt',header=None,sep='\t')
12 |
13 | RNA_data = RNA_exp.iloc[11:19984,:]
14 | RNA_data = RNA_data.drop([1],axis=1)
15 | RNA_data.set_index(0,inplace=True)
16 | RNA_data.columns = range(0,RNA_data.shape[1])
17 | RNA_data = RNA_data.astype('float')
18 |
19 | RNA_meta = RNA_exp.iloc[0:10,:]
20 | RNA_meta = RNA_meta.drop([0],axis=1)
21 | RNA_meta.set_index(1,inplace=True)
22 | RNA_meta.columns = range(0,RNA_meta.shape[1])
23 | RNA_meta = RNA_meta.T
24 | del RNA_exp
25 |
26 | RNA_data = RNA_data.loc[:,RNA_meta['tissue']=='sscortex']
27 | RNA_meta = RNA_meta.loc[RNA_meta['tissue']=='sscortex',:]
28 |
29 | RNA_data.columns = range(0,RNA_data.shape[1])
30 | RNA_meta.index = range(0,RNA_meta.shape[0])
31 |
32 | # filter lowely expressed genes
33 | Genes_count = np.sum(RNA_data > 0, axis=1)
34 | RNA_data = RNA_data.loc[Genes_count >=10,:]
35 | del Genes_count
36 |
37 | def Log_Norm_cpm(x):
38 | return np.log(((x/np.sum(x))*1000000) + 1)
39 |
40 | RNA_data = RNA_data.apply(Log_Norm_cpm,axis=0)
41 | RNA_data_scaled = pd.DataFrame(data=st.zscore(RNA_data.T),index = RNA_data.columns,columns=RNA_data.index)
42 |
43 | datadict = dict()
44 | datadict['RNA_data'] = RNA_data.T
45 | datadict['RNA_data_scaled'] = RNA_data_scaled
46 | datadict['RNA_meta'] = RNA_meta
47 |
48 | with open('data/SpaGE_pkl/Ziesel.pkl','wb') as f:
49 | pickle.dump(datadict, f)
50 |
51 | ## Read loom data
52 | ds = loompy.connect('data/osmFISH/osmFISH_SScortex_mouse_all_cells.loom')
53 |
54 | FISH_Genes = ds.ra['Gene']
55 |
56 | colAtr = ds.ca.keys()
57 |
58 | df = pd.DataFrame()
59 | for i in colAtr:
60 | df[i] = ds.ca[i]
61 |
62 | osmFISH_meta = df.iloc[np.where(df.Valid == 1)[0], :]
63 | osmFISH_data = ds[:,:]
64 | osmFISH_data = osmFISH_data[:,np.where(df.Valid == 1)[0]]
65 | osmFISH_data = pd.DataFrame(data= osmFISH_data, index= FISH_Genes)
66 |
67 | del ds, colAtr, i, df, FISH_Genes
68 |
69 | Cortex_Regions = ['Layer 2-3 lateral', 'Layer 2-3 medial', 'Layer 3-4',
70 | 'Layer 4','Layer 5', 'Layer 6', 'Pia Layer 1']
71 | Cortical = np.stack(i in Cortex_Regions for i in osmFISH_meta.Region)
72 |
73 | osmFISH_meta = osmFISH_meta.iloc[Cortical,:]
74 | osmFISH_data = osmFISH_data.iloc[:,Cortical]
75 |
76 | cell_count = np.sum(osmFISH_data,axis=0)
77 | def Log_Norm(x):
78 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
79 |
80 | osmFISH_data = osmFISH_data.apply(Log_Norm,axis=0)
81 | osmFISH_data_scaled = pd.DataFrame(data=st.zscore(osmFISH_data.T),index = osmFISH_data.columns,columns=osmFISH_data.index)
82 |
83 | datadict = dict()
84 | datadict['osmFISH_data'] = osmFISH_data.T
85 | datadict['osmFISH_data_scaled'] = osmFISH_data_scaled
86 | datadict['osmFISH_meta'] = osmFISH_meta
87 |
88 | with open('data/SpaGE_pkl/osmFISH_Cortex.pkl','wb') as f:
89 | pickle.dump(datadict, f)
90 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_Ziesel/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import seaborn as sns
12 | import sys
13 | sys.path.insert(1,'SpaGE/')
14 | from principal_vectors import PVComputation
15 |
16 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | osmFISH_data = datadict['osmFISH_data']
20 | osmFISH_data_scaled = datadict['osmFISH_data_scaled']
21 | osmFISH_meta= datadict['osmFISH_meta']
22 | del datadict
23 |
24 | with open ('data/SpaGE_pkl/Ziesel.pkl', 'rb') as f:
25 | datadict = pickle.load(f)
26 |
27 | RNA_data = datadict['RNA_data']
28 | RNA_data_scaled = datadict['RNA_data_scaled']
29 | RNA_meta = datadict['RNA_meta']
30 | del datadict
31 |
32 | Common_data = RNA_data_scaled[np.intersect1d(osmFISH_data_scaled.columns,RNA_data_scaled.columns)]
33 |
34 | n_factors = 30
35 | n_pv = 30
36 | n_pv_display = 30
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data,osmFISH_data_scaled[Common_data.columns])
48 |
49 | fig = plt.figure()
50 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
51 | center=0, vmax=1., vmin=0)
52 | plt.xlabel('osmFISH',fontsize=18, color='black')
53 | plt.ylabel('Ziesel_RNA',fontsize=18, color='black')
54 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
55 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
56 | plt.gca().set_ylim([n_pv_display,0])
57 | plt.show()
58 |
59 | plt.figure()
60 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
61 | center=0, vmax=1., vmin=0)
62 | for i in range(n_pv_display-1):
63 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
64 |
65 | plt.xlabel('osmFISH',fontsize=18, color='black')
66 | plt.ylabel('Ziesel_RNA',fontsize=18, color='black')
67 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
68 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
69 | plt.gca().set_ylim([n_pv_display,0])
70 | plt.show()
71 |
72 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
73 | Importance.sort_values(ascending=False,inplace=True)
74 | Importance.index[0:30]
75 |
76 | ### Technology specific Processes
77 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
78 |
79 | # explained variance RNA
80 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
81 | # explained variance spatial
82 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
83 |
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/osmFISH_Ziesel/gimVI/gimVI.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('osmFISH_Ziesel/')
3 |
4 | from scvi.dataset import CsvDataset
5 | from scvi.models import JVAE, Classifier
6 | from scvi.inference import JVAETrainer
7 | import numpy as np
8 | import pandas as pd
9 | import copy
10 | import torch
11 | import time as tm
12 |
13 | ### osmFISH data
14 | osmFISH_data = CsvDataset('data/gimVI_data/osmFISH_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
15 |
16 | ### RNA
17 | RNA_data = CsvDataset('data/gimVI_data/RNA_Cortex_scvi.csv', save_path = "", sep = ",", gene_by_cell = False)
18 |
19 | ### Leave-one-out validation
20 | Common_data = copy.deepcopy(RNA_data)
21 | Common_data.gene_names = osmFISH_data.gene_names
22 | Common_data.X = Common_data.X[:,np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in osmFISH_data.gene_names),-1)]
23 | Gene_set = np.intersect1d(osmFISH_data.gene_names,Common_data.gene_names)
24 | Imp_Genes = pd.DataFrame(columns=Gene_set)
25 | gimVI_time = []
26 |
27 | for i in Gene_set:
28 | print(i)
29 | # Create copy of the fish dataset with hidden genes
30 | data_spatial_partial = copy.deepcopy(osmFISH_data)
31 | data_spatial_partial.filter_genes_by_attribute(np.setdiff1d(osmFISH_data.gene_names,i))
32 | data_spatial_partial.batch_indices += Common_data.n_batches
33 |
34 | if(data_spatial_partial.X.shape[0] != osmFISH_data.X.shape[0]):
35 | continue
36 |
37 | datasets = [Common_data, data_spatial_partial]
38 | generative_distributions = ["zinb", "nb"]
39 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==Common_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
40 | n_inputs = [d.nb_genes for d in datasets]
41 | total_genes = Common_data.nb_genes
42 | n_batches = sum([d.n_batches for d in datasets])
43 |
44 | model_library_size = [True, False]
45 |
46 | n_latent = 8
47 | kappa = 5
48 |
49 | start = tm.time()
50 | torch.manual_seed(0)
51 |
52 | model = JVAE(
53 | n_inputs,
54 | total_genes,
55 | gene_mappings,
56 | generative_distributions,
57 | model_library_size,
58 | n_layers_decoder_individual=0,
59 | n_layers_decoder_shared=0,
60 | n_layers_encoder_individual=1,
61 | n_layers_encoder_shared=1,
62 | dim_hidden_encoder=64,
63 | dim_hidden_decoder_shared=64,
64 | dropout_rate_encoder=0.2,
65 | dropout_rate_decoder=0.2,
66 | n_batch=n_batches,
67 | n_latent=n_latent,
68 | )
69 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
70 |
71 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
72 | trainer.train(n_epochs=200)
73 | _,Imputed = trainer.get_imputed_values(normalized=True)
74 | Imputed = np.reshape(Imputed[:,np.argwhere(i==Common_data.gene_names)[0]],-1)
75 | Imp_Genes[i] = Imputed
76 | gimVI_time.append(tm.time()-start)
77 |
78 | Imp_Genes = Imp_Genes.fillna(0)
79 | Imp_Genes.to_csv('Results/gimVI_LeaveOneOut.csv')
80 | gimVI_time = pd.DataFrame(gimVI_time)
81 | gimVI_time.to_csv('Results/gimVI_Time.csv', index = False)
82 |
83 | ### New genes
84 | Imp_New_Genes = pd.DataFrame(columns=["TESC","PVRL3","GRM2"])
85 |
86 | # Create copy of the fish dataset with hidden genes
87 | data_spatial_partial = copy.deepcopy(osmFISH_data)
88 | data_spatial_partial.filter_genes_by_attribute(osmFISH_data.gene_names)
89 | data_spatial_partial.batch_indices += RNA_data.n_batches
90 |
91 | datasets = [RNA_data, data_spatial_partial]
92 | generative_distributions = ["zinb", "nb"]
93 | gene_mappings = [slice(None), np.reshape(np.vstack(np.argwhere(i==RNA_data.gene_names) for i in data_spatial_partial.gene_names),-1)]
94 | n_inputs = [d.nb_genes for d in datasets]
95 | total_genes = RNA_data.nb_genes
96 | n_batches = sum([d.n_batches for d in datasets])
97 |
98 | model_library_size = [True, False]
99 |
100 | n_latent = 8
101 | kappa = 5
102 |
103 | torch.manual_seed(0)
104 |
105 | model = JVAE(
106 | n_inputs,
107 | total_genes,
108 | gene_mappings,
109 | generative_distributions,
110 | model_library_size,
111 | n_layers_decoder_individual=0,
112 | n_layers_decoder_shared=0,
113 | n_layers_encoder_individual=1,
114 | n_layers_encoder_shared=1,
115 | dim_hidden_encoder=64,
116 | dim_hidden_decoder_shared=64,
117 | dropout_rate_encoder=0.2,
118 | dropout_rate_decoder=0.2,
119 | n_batch=n_batches,
120 | n_latent=n_latent,
121 | )
122 | discriminator = Classifier(n_latent, 32, 2, 3, logits=True)
123 |
124 | trainer = JVAETrainer(model, discriminator, datasets, 0.95, frequency=1, kappa=kappa)
125 | trainer.train(n_epochs=200)
126 |
127 | for i in ["TESC","PVRL3","GRM2"]:
128 | _,Imputed = trainer.get_imputed_values(normalized=True)
129 | Imputed = np.reshape(Imputed[:,np.argwhere(i==RNA_data.gene_names)[0]],-1)
130 | Imp_New_Genes[i] = Imputed
131 |
132 | Imp_New_Genes.to_csv('Results/gimVI_New_genes.csv')
133 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/DownSampling/DownSampling.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import scipy.stats as st
9 | import sys
10 | sys.path.insert(1,'Scripts/SpaGE/')
11 | from principal_vectors import PVComputation
12 |
13 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f:
14 | datadict = pickle.load(f)
15 |
16 | seqFISH_data = datadict['seqFISH_data']
17 | seqFISH_data_scaled = datadict['seqFISH_data_scaled']
18 | seqFISH_meta= datadict['seqFISH_meta']
19 | del datadict
20 |
21 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
22 | datadict = pickle.load(f)
23 |
24 | RNA_data = datadict['RNA_data']
25 | RNA_data_scaled = datadict['RNA_data_scaled']
26 | del datadict
27 |
28 | Gene_Order = np.intersect1d(seqFISH_data.columns,RNA_data.columns)
29 |
30 | ### SpaGE
31 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',')
32 | #SpaGE_New = pd.read_csv('Results/SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
33 |
34 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
35 |
36 | SpaGE_seqCorr = pd.Series(index = Gene_Order)
37 | for i in Gene_Order:
38 | SpaGE_seqCorr[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed[i])[0]
39 | SpaGE_seqCorr[np.isnan(SpaGE_seqCorr)] = 0
40 |
41 | SpaGE_seqCorr.sort_values(ascending=False,inplace=True)
42 | test_set = SpaGE_seqCorr.index[0:50]
43 |
44 | Common_data = RNA_data[Gene_Order]
45 | Common_data = Common_data.drop(columns=test_set)
46 |
47 | Corr = np.corrcoef(Common_data.T)
48 | #for i in range(0,Corr.shape[0]):
49 | # Corr[i,i]=0
50 |
51 | #plt.hist(np.abs(np.reshape(Corr,-1)),bins=np.arange(0,1.05,0.05))
52 | #plt.show()
53 | # 0.7
54 | removed_genes = []
55 | for i in range(0,Corr.shape[0]):
56 | for j in range(i+1,Corr.shape[0]):
57 | if(np.abs(Corr[i,j]) > 0.7):
58 | Vi = np.var(Common_data.iloc[:,i])
59 | Vj = np.var(Common_data.iloc[:,j])
60 | if(Vi > Vj):
61 | removed_genes.append(Common_data.columns[j])
62 | else:
63 | removed_genes.append(Common_data.columns[i])
64 | removed_genes= np.unique(removed_genes)
65 |
66 | Common_data = Common_data.drop(columns=removed_genes)
67 | Variance = np.var(Common_data)
68 | Variance.sort_values(ascending=False,inplace=True)
69 | Variance = Variance.append(pd.Series(0,index=removed_genes))
70 |
71 | #### Novel Genes Expression Patterns ####
72 | genes_to_impute = test_set
73 | for i in [10,30,50,100,200,500,1000,2000,5000,7000,len(Variance)]:
74 | print(i)
75 | Imp_New_Genes = pd.DataFrame(np.zeros((seqFISH_data.shape[0],len(genes_to_impute))),columns=genes_to_impute)
76 |
77 | if(i>=50):
78 | n_factors = 50
79 | n_pv = 50
80 | else:
81 | n_factors = i
82 | n_pv = i
83 |
84 | dim_reduction = 'pca'
85 | dim_reduction_target = 'pca'
86 |
87 | pv_FISH_RNA = PVComputation(
88 | n_factors = n_factors,
89 | n_pv = n_pv,
90 | dim_reduction = dim_reduction,
91 | dim_reduction_target = dim_reduction_target
92 | )
93 |
94 | source_data = RNA_data_scaled[Variance.index[0:i]]
95 | target_data = seqFISH_data_scaled[Variance.index[0:i]]
96 |
97 | pv_FISH_RNA.fit(source_data,target_data)
98 |
99 | S = pv_FISH_RNA.source_components_.T
100 |
101 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
102 | S = S[:,0:Effective_n_pv]
103 |
104 | RNA_data_t = source_data.dot(S)
105 | FISH_exp_t = target_data.dot(S)
106 |
107 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',
108 | metric = 'cosine').fit(RNA_data_t)
109 | distances, indices = nbrs.kneighbors(FISH_exp_t)
110 |
111 | for j in range(0,seqFISH_data.shape[0]):
112 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
113 | weights = weights/(len(weights)-1)
114 | Imp_New_Genes.iloc[j,:] = np.dot(weights,RNA_data[genes_to_impute].iloc[indices[j,:][distances[j,:] < 1]])
115 |
116 | Imp_New_Genes.to_csv('Results/' + str(i) +'SpaGE_New_genes.csv')
117 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/DownSampling/DownSampling_evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import matplotlib
7 | matplotlib.use('qt5agg')
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | import scipy.stats as st
12 | import pickle
13 |
14 | ### Original data
15 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f:
16 | datadict = pickle.load(f)
17 |
18 | seqFISH_data = datadict['seqFISH_data']
19 | del datadict
20 |
21 | test_set = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',').columns
22 | seqFISH_data = seqFISH_data[test_set]
23 |
24 | ### SpaGE
25 | #10
26 | SpaGE_imputed_10 = pd.read_csv('Results/10SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
27 |
28 | SpaGE_Corr_10 = pd.Series(index = test_set)
29 | for i in test_set:
30 | SpaGE_Corr_10[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_10[i])[0]
31 |
32 | #30
33 | SpaGE_imputed_30 = pd.read_csv('Results/30SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
34 |
35 | SpaGE_Corr_30 = pd.Series(index = test_set)
36 | for i in test_set:
37 | SpaGE_Corr_30[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_30[i])[0]
38 |
39 | #50
40 | SpaGE_imputed_50 = pd.read_csv('Results/50SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
41 |
42 | SpaGE_Corr_50 = pd.Series(index = test_set)
43 | for i in test_set:
44 | SpaGE_Corr_50[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_50[i])[0]
45 |
46 | #100
47 | SpaGE_imputed_100 = pd.read_csv('Results/100SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
48 |
49 | SpaGE_Corr_100 = pd.Series(index = test_set)
50 | for i in test_set:
51 | SpaGE_Corr_100[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_100[i])[0]
52 |
53 | #200
54 | SpaGE_imputed_200 = pd.read_csv('Results/200SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
55 |
56 | SpaGE_Corr_200 = pd.Series(index = test_set)
57 | for i in test_set:
58 | SpaGE_Corr_200[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_200[i])[0]
59 |
60 | #500
61 | SpaGE_imputed_500 = pd.read_csv('Results/500SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
62 |
63 | SpaGE_Corr_500 = pd.Series(index = test_set)
64 | for i in test_set:
65 | SpaGE_Corr_500[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_500[i])[0]
66 |
67 | #1000
68 | SpaGE_imputed_1000 = pd.read_csv('Results/1000SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
69 |
70 | SpaGE_Corr_1000 = pd.Series(index = test_set)
71 | for i in test_set:
72 | SpaGE_Corr_1000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_1000[i])[0]
73 |
74 | #2000
75 | SpaGE_imputed_2000 = pd.read_csv('Results/2000SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
76 |
77 | SpaGE_Corr_2000 = pd.Series(index = test_set)
78 | for i in test_set:
79 | SpaGE_Corr_2000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_2000[i])[0]
80 |
81 | #5000
82 | SpaGE_imputed_5000 = pd.read_csv('Results/5000SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
83 |
84 | SpaGE_Corr_5000 = pd.Series(index = test_set)
85 | for i in test_set:
86 | SpaGE_Corr_5000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_5000[i])[0]
87 |
88 | #7000
89 | SpaGE_imputed_7000 = pd.read_csv('Results/7000SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
90 |
91 | SpaGE_Corr_7000 = pd.Series(index = test_set)
92 | for i in test_set:
93 | SpaGE_Corr_7000[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_7000[i])[0]
94 |
95 | #9701
96 | SpaGE_imputed_9701 = pd.read_csv('Results/9701SpaGE_New_genes.csv',header=0,index_col=0,sep=',')
97 |
98 | SpaGE_Corr_9701 = pd.Series(index = test_set)
99 | for i in test_set:
100 | SpaGE_Corr_9701[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed_9701[i])[0]
101 |
102 | ### Comparison plots
103 | plt.style.use('ggplot')
104 | plt.figure(figsize=(9, 3))
105 | plt.boxplot([SpaGE_Corr_10, SpaGE_Corr_30, SpaGE_Corr_50,
106 | SpaGE_Corr_100,SpaGE_Corr_200,SpaGE_Corr_500,
107 | SpaGE_Corr_1000,SpaGE_Corr_2000,SpaGE_Corr_5000,
108 | SpaGE_Corr_7000,SpaGE_Corr_9701])
109 |
110 | y = SpaGE_Corr_10
111 | x = np.random.normal(1, 0.05, len(y))
112 | plt.plot(x, y, 'g.', alpha=0.2)
113 | y = SpaGE_Corr_30
114 | x = np.random.normal(2, 0.05, len(y))
115 | plt.plot(x, y, 'g.', alpha=0.2)
116 | y = SpaGE_Corr_50
117 | x = np.random.normal(3, 0.05, len(y))
118 | plt.plot(x, y, 'g.', alpha=0.2)
119 | y = SpaGE_Corr_100
120 | x = np.random.normal(4, 0.05, len(y))
121 | plt.plot(x, y, 'g.', alpha=0.2)
122 | y = SpaGE_Corr_200
123 | x = np.random.normal(5, 0.05, len(y))
124 | plt.plot(x, y, 'g.', alpha=0.2)
125 | y = SpaGE_Corr_500
126 | x = np.random.normal(6, 0.05, len(y))
127 | plt.plot(x, y, 'g.', alpha=0.2)
128 | y = SpaGE_Corr_1000
129 | x = np.random.normal(7, 0.05, len(y))
130 | plt.plot(x, y, 'g.', alpha=0.2)
131 | y = SpaGE_Corr_2000
132 | x = np.random.normal(8, 0.05, len(y))
133 | plt.plot(x, y, 'g.', alpha=0.2)
134 | y = SpaGE_Corr_5000
135 | x = np.random.normal(9, 0.05, len(y))
136 | plt.plot(x, y, 'g.', alpha=0.2)
137 | y = SpaGE_Corr_7000
138 | x = np.random.normal(10, 0.05, len(y))
139 | plt.plot(x, y, 'g.', alpha=0.2)
140 | y = SpaGE_Corr_9701
141 | x = np.random.normal(11, 0.05, len(y))
142 | plt.plot(x, y, 'g.', alpha=0.2)
143 |
144 |
145 | plt.xticks((1,2,3,4,5,6,7,8,9,10,11),
146 | ('10','30','50','100','200','500','1000','2000','5000','7000','9701'),size=10)
147 | plt.yticks(size=8)
148 | plt.xlabel('Number of shared genes',size=12)
149 | plt.gca().set_ylim([-0.3,0.8])
150 | plt.ylabel('Spearman Correlation',size=12)
151 | plt.show()
152 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/Performance_evaluation.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp/')
3 |
4 |
5 | import numpy as np
6 | import pandas as pd
7 | import pickle
8 | import matplotlib
9 | matplotlib.use('qt5agg')
10 | matplotlib.rcParams['pdf.fonttype'] = 42
11 | matplotlib.rcParams['ps.fonttype'] = 42
12 | import matplotlib.pyplot as plt
13 | #from matplotlib import cm
14 | import scipy.stats as st
15 |
16 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f:
17 | datadict = pickle.load(f)
18 |
19 | seqFISH_data = datadict['seqFISH_data']
20 | seqFISH_meta= datadict['seqFISH_meta']
21 | del datadict
22 |
23 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
24 | datadict = pickle.load(f)
25 |
26 | RNA_data = datadict['RNA_data']
27 | del datadict
28 |
29 | Gene_Order = np.intersect1d(seqFISH_data.columns,RNA_data.columns)
30 |
31 | ### SpaGE
32 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut.csv',header=0,index_col=0,sep=',')
33 |
34 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
35 |
36 | SpaGE_seqCorr = pd.Series(index = Gene_Order)
37 | for i in Gene_Order:
38 | SpaGE_seqCorr[i] = st.spearmanr(seqFISH_data[i],SpaGE_imputed[i])[0]
39 | SpaGE_seqCorr[np.isnan(SpaGE_seqCorr)] = 0
40 |
41 | os.chdir('STARmap_AllenVISp/')
42 |
43 | with open ('data/SpaGE_pkl/Starmap.pkl', 'rb') as f:
44 | datadict = pickle.load(f)
45 |
46 | coords = datadict['coords']
47 | Starmap_data = datadict['Starmap_data']
48 | del datadict
49 |
50 | Gene_Order = np.intersect1d(Starmap_data.columns,RNA_data.columns)
51 |
52 | ### SpaGE
53 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut_cutoff.csv',header=0,index_col=0,sep=',')
54 |
55 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
56 |
57 | SpaGE_starCorr = pd.Series(index = Gene_Order)
58 | for i in Gene_Order:
59 | SpaGE_starCorr[i] = st.spearmanr(Starmap_data[i],SpaGE_imputed[i])[0]
60 |
61 | def Compare_Correlations(X,Y):
62 | fig, ax = plt.subplots(figsize=(4.5, 4.5))
63 | ax.scatter(X, Y, s=1)
64 | ax.axvline(linestyle='--',color='gray')
65 | ax.axhline(linestyle='--',color='gray')
66 | plt.gca().set_ylim([-0.5,1])
67 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]),
68 | np.max([ax.get_xlim(), ax.get_ylim()])]
69 | ax.plot(lims, lims, 'k-')
70 | ax.set_aspect('equal')
71 | ax.set_xlim(lims)
72 | ax.set_ylim(lims)
73 | plt.xticks(size=8)
74 | plt.yticks(size=8)
75 | plt.show()
76 |
77 | Starmap_seq_genes = np.intersect1d(Starmap_data.columns,seqFISH_data.columns)
78 | Compare_Correlations(SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes])
79 | plt.xlabel('Spearman Correlation STARmap',size=12)
80 | plt.ylabel('Spearman Correlation seqFISH',size=12)
81 | plt.show()
82 |
83 | fig, ax = plt.subplots(figsize=(3.7, 4.5))
84 | ax.boxplot([SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes]],widths=0.5)
85 |
86 | y = SpaGE_starCorr[Starmap_seq_genes]
87 | x = np.random.normal(1, 0.05, len(y))
88 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
89 | y = SpaGE_seqCorr[Starmap_seq_genes]
90 | x = np.random.normal(2, 0.05, len(y))
91 | plt.plot(x, y, 'g.', markersize=1, alpha=0.2)
92 |
93 | plt.xticks((1,2),('STARmap','seqFISH'),size=12)
94 | plt.yticks(size=8)
95 | plt.gca().set_ylim([-0.4,0.8])
96 | plt.ylabel('Spearman Correlation',size=12)
97 | #ax.set_aspect(aspect=3)
98 | _,p_val = st.wilcoxon(SpaGE_starCorr[Starmap_seq_genes],SpaGE_seqCorr[Starmap_seq_genes])
99 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
100 | plt.show()
101 |
102 | os.chdir('osmFISH_AllenVISp/')
103 |
104 | with open ('data/SpaGE_pkl/osmFISH_Cortex.pkl', 'rb') as f:
105 | datadict = pickle.load(f)
106 |
107 | osmFISH_data = datadict['osmFISH_data']
108 | del datadict
109 |
110 | Gene_Order = osmFISH_data.columns
111 |
112 | ### SpaGE
113 | SpaGE_imputed = pd.read_csv('Results/SpaGE_LeaveOneOut_cutoff.csv',header=0,index_col=0,sep=',')
114 |
115 | SpaGE_imputed = SpaGE_imputed.loc[:,Gene_Order]
116 |
117 | SpaGE_osmCorr = pd.Series(index = Gene_Order)
118 | for i in Gene_Order:
119 | SpaGE_osmCorr[i] = st.spearmanr(osmFISH_data[i],SpaGE_imputed[i])[0]
120 |
121 | def Compare_Correlations(X,Y):
122 | fig, ax = plt.subplots(figsize=(4.5, 4.5))
123 | ax.scatter(X, Y, s=25)
124 | ax.axvline(linestyle='--',color='gray')
125 | ax.axhline(linestyle='--',color='gray')
126 | plt.gca().set_ylim([-0.5,1])
127 | lims = [np.min([ax.get_xlim(), ax.get_ylim()]),
128 | np.max([ax.get_xlim(), ax.get_ylim()])]
129 | ax.plot(lims, lims, 'k-')
130 | ax.set_aspect('equal')
131 | ax.set_xlim(lims)
132 | ax.set_ylim(lims)
133 | plt.xticks(size=8)
134 | plt.yticks(size=8)
135 | plt.show()
136 |
137 | osm_seq_genes = np.intersect1d(osmFISH_data.columns,seqFISH_data.columns)
138 | Compare_Correlations(SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes])
139 | plt.xlabel('Spearman Correlation osmFISH',size=12)
140 | plt.ylabel('Spearman Correlation seqFISH',size=12)
141 | plt.show()
142 |
143 | fig, ax = plt.subplots(figsize=(3.7, 4.5))
144 | ax.boxplot([SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes]],widths=0.5)
145 |
146 | y = SpaGE_osmCorr[osm_seq_genes]
147 | x = np.random.normal(1, 0.05, len(y))
148 | plt.plot(x, y, 'g.', alpha=0.2)
149 | y = SpaGE_seqCorr[osm_seq_genes]
150 | x = np.random.normal(2, 0.05, len(y))
151 | plt.plot(x, y, 'g.', alpha=0.2)
152 |
153 | plt.xticks((1,2),('osmFISH','seqFISH'),size=12)
154 | plt.yticks(size=8)
155 | plt.gca().set_ylim([-0.5,1])
156 | plt.ylabel('Spearman Correlation',size=12)
157 | #ax.set_aspect(aspect=3)
158 | _,p_val = st.wilcoxon(SpaGE_osmCorr[osm_seq_genes],SpaGE_seqCorr[osm_seq_genes])
159 | plt.text(2,np.max(plt.gca().get_ylim()),'%1.2e'%p_val,color='black',size=8)
160 | plt.show()
161 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/SpaGE/Integration.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp/')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | from sklearn.neighbors import NearestNeighbors
8 | import time as tm
9 |
10 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f:
11 | datadict = pickle.load(f)
12 |
13 | seqFISH_data = datadict['seqFISH_data']
14 | seqFISH_data_scaled = datadict['seqFISH_data_scaled']
15 | seqFISH_meta= datadict['seqFISH_meta']
16 | del datadict
17 |
18 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
19 | datadict = pickle.load(f)
20 |
21 | RNA_data = datadict['RNA_data']
22 | RNA_data_scaled = datadict['RNA_data_scaled']
23 | del datadict
24 |
25 | #### Leave One Out Validation ####
26 | Common_data = RNA_data_scaled[np.intersect1d(seqFISH_data_scaled.columns,RNA_data_scaled.columns)]
27 | Imp_Genes = pd.DataFrame(columns=Common_data.columns)
28 | precise_time = []
29 | knn_time = []
30 | for i in Common_data.columns:
31 | print(i)
32 | start = tm.time()
33 | from principal_vectors import PVComputation
34 |
35 | n_factors = 50
36 | n_pv = 50
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data.drop(i,axis=1),seqFISH_data_scaled[Common_data.columns].drop(i,axis=1))
48 |
49 | S = pv_FISH_RNA.source_components_.T
50 |
51 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
52 | S = S[:,0:Effective_n_pv]
53 |
54 | Common_data_t = Common_data.drop(i,axis=1).dot(S)
55 | FISH_exp_t = seqFISH_data_scaled[Common_data.columns].drop(i,axis=1).dot(S)
56 | precise_time.append(tm.time()-start)
57 |
58 | start = tm.time()
59 | nbrs = NearestNeighbors(n_neighbors=50, algorithm='auto',metric = 'cosine').fit(Common_data_t)
60 | distances, indices = nbrs.kneighbors(FISH_exp_t)
61 |
62 | Imp_Gene = np.zeros(seqFISH_data.shape[0])
63 | for j in range(0,seqFISH_data.shape[0]):
64 | weights = 1-(distances[j,:][distances[j,:]<1])/(np.sum(distances[j,:][distances[j,:]<1]))
65 | weights = weights/(len(weights)-1)
66 | Imp_Gene[j] = np.sum(np.multiply(RNA_data[i][indices[j,:][distances[j,:] < 1]],weights))
67 | Imp_Gene[np.isnan(Imp_Gene)] = 0
68 | Imp_Genes[i] = Imp_Gene
69 | knn_time.append(tm.time()-start)
70 |
71 | Imp_Genes.to_csv('Results/SpaGE_LeaveOneOut.csv')
72 | precise_time = pd.DataFrame(precise_time)
73 | knn_time = pd.DataFrame(knn_time)
74 | precise_time.to_csv('Results/SpaGE_PreciseTime.csv', index = False)
75 | knn_time.to_csv('Results/SpaGE_knnTime.csv', index = False)
76 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/SpaGE/Precise_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp')
3 |
4 | import pickle
5 | import numpy as np
6 | import pandas as pd
7 | import matplotlib
8 | matplotlib.rcParams['pdf.fonttype'] = 42
9 | matplotlib.rcParams['ps.fonttype'] = 42
10 | import matplotlib.pyplot as plt
11 | from matplotlib import cm
12 | import seaborn as sns
13 | import sys
14 | sys.path.insert(1,'Scripts/SpaGE/')
15 | from principal_vectors import PVComputation
16 |
17 | with open ('data/SpaGE_pkl/seqFISH_Cortex.pkl', 'rb') as f:
18 | datadict = pickle.load(f)
19 |
20 | seqFISH_data = datadict['seqFISH_data']
21 | seqFISH_data_scaled = datadict['seqFISH_data_scaled']
22 | seqFISH_meta= datadict['seqFISH_meta']
23 | del datadict
24 |
25 | with open ('data/SpaGE_pkl/Allen_VISp.pkl', 'rb') as f:
26 | datadict = pickle.load(f)
27 |
28 | RNA_data = datadict['RNA_data']
29 | RNA_data_scaled = datadict['RNA_data_scaled']
30 | del datadict
31 |
32 | Common_data = RNA_data_scaled[np.intersect1d(seqFISH_data_scaled.columns,RNA_data_scaled.columns)]
33 |
34 | n_factors = 50
35 | n_pv = 50
36 | n_pv_display = 50
37 | dim_reduction = 'pca'
38 | dim_reduction_target = 'pca'
39 |
40 | pv_FISH_RNA = PVComputation(
41 | n_factors = n_factors,
42 | n_pv = n_pv,
43 | dim_reduction = dim_reduction,
44 | dim_reduction_target = dim_reduction_target
45 | )
46 |
47 | pv_FISH_RNA.fit(Common_data,seqFISH_data_scaled[Common_data.columns])
48 |
49 | fig = plt.figure()
50 | sns.heatmap(pv_FISH_RNA.initial_cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
51 | center=0, vmax=1., vmin=0)
52 | plt.xlabel('seqFISH',fontsize=18, color='black')
53 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
54 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
55 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
56 | plt.gca().set_ylim([n_pv_display,0])
57 | plt.show()
58 |
59 | plt.figure()
60 | sns.heatmap(pv_FISH_RNA.cosine_similarity_matrix_[:n_pv_display,:n_pv_display], cmap='seismic_r',
61 | center=0, vmax=1., vmin=0)
62 | for i in range(n_pv_display-1):
63 | plt.text(i+1,i+.7,'%1.2f'%pv_FISH_RNA.cosine_similarity_matrix_[i,i], fontsize=14,color='black')
64 |
65 | plt.xlabel('seqFISH',fontsize=18, color='black')
66 | plt.ylabel('Allen_VISp',fontsize=18, color='black')
67 | plt.xticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12)
68 | plt.yticks(np.arange(n_pv_display)+0.5, range(1, n_pv_display+1), fontsize=12, rotation='horizontal')
69 | plt.gca().set_ylim([n_pv_display,0])
70 | plt.show()
71 |
72 | Importance = pd.Series(np.sum(pv_FISH_RNA.source_components_**2,axis=0),index=Common_data.columns)
73 | Importance.sort_values(ascending=False,inplace=True)
74 | Importance.index[0:50]
75 |
76 | ### Technology specific Processes
77 | Effective_n_pv = sum(np.diag(pv_FISH_RNA.cosine_similarity_matrix_) > 0.3)
78 |
79 | # explained variance RNA
80 | np.sum(pv_FISH_RNA.source_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
81 | # explained variance spatial
82 | np.sum(pv_FISH_RNA.target_explained_variance_ratio_[np.arange(Effective_n_pv)])*100
83 |
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/SpaGE/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------
/benchmark/seqFISH_AllenVISp/SpaGE/seqFISH_data_preprocessing.py:
--------------------------------------------------------------------------------
1 | import os
2 | os.chdir('seqFISH_AllenVISp/')
3 |
4 | import numpy as np
5 | import pandas as pd
6 | import scipy.stats as st
7 | import pickle
8 |
9 | seqFISH_data = pd.read_csv('data/seqFISH/sourcedata/cortex_svz_counts.csv',header=0)
10 | seqFISH_meta = pd.read_csv('data/seqFISH/sourcedata/cortex_svz_cellcentroids.csv',header=0)
11 |
12 | seqFISH_data = seqFISH_data.iloc[np.where(seqFISH_meta['Region'] == 'Cortex')[0],:]
13 | seqFISH_meta = seqFISH_meta.iloc[np.where(seqFISH_meta['Region'] == 'Cortex')[0],:]
14 |
15 | seqFISH_data = seqFISH_data.T
16 |
17 | cell_count = np.sum(seqFISH_data,axis=0)
18 | def Log_Norm(x):
19 | return np.log(((x/np.sum(x))*np.median(cell_count)) + 1)
20 |
21 | seqFISH_data = seqFISH_data.apply(Log_Norm,axis=0)
22 | seqFISH_data_scaled = pd.DataFrame(data=st.zscore(seqFISH_data.T),index = seqFISH_data.columns,columns=seqFISH_data.index)
23 |
24 | datadict = dict()
25 | datadict['seqFISH_data'] = seqFISH_data.T
26 | datadict['seqFISH_data_scaled'] = seqFISH_data_scaled
27 | datadict['seqFISH_meta'] = seqFISH_meta
28 |
29 | with open('data/SpaGE_pkl/seqFISH_Cortex.pkl','wb') as f:
30 | pickle.dump(datadict, f)
31 |
--------------------------------------------------------------------------------
/dimensionality_reduction.py:
--------------------------------------------------------------------------------
1 | """ Dimensionality Reduction
2 | @author: Soufiane Mourragui
3 | This module extracts the domain-specific factors from the high-dimensional omics
4 | dataset. Several methods are here implemented and they can be directly
5 | called from string name in main method method. All the methods
6 | use scikit-learn implementation.
7 | Notes
8 | -------
9 | -
10 |
11 | References
12 | -------
13 | [1] Pedregosa, Fabian, et al. (2011) Scikit-learn: Machine learning in Python.
14 | Journal of Machine Learning Research
15 | """
16 |
17 | import numpy as np
18 | from sklearn.decomposition import PCA, FastICA, FactorAnalysis, NMF, SparsePCA
19 | from sklearn.cross_decomposition import PLSRegression
20 |
21 |
22 | def process_dim_reduction(method='pca', n_dim=10):
23 | """
24 | Default linear dimensionality reduction method. For each method, return a
25 | BaseEstimator instance corresponding to the method given as input.
26 | Attributes
27 | -------
28 | method: str, default to 'pca'
29 | Method used for dimensionality reduction.
30 | Implemented: 'pca', 'ica', 'fa' (Factor Analysis),
31 | 'nmf' (Non-negative matrix factorisation), 'sparsepca' (Sparse PCA).
32 |
33 | n_dim: int, default to 10
34 | Number of domain-specific factors to compute.
35 | Return values
36 | -------
37 | Classifier, i.e. BaseEstimator instance
38 | """
39 |
40 | if method.lower() == 'pca':
41 | clf = PCA(n_components=n_dim)
42 |
43 | elif method.lower() == 'ica':
44 | print('ICA')
45 | clf = FastICA(n_components=n_dim)
46 |
47 | elif method.lower() == 'fa':
48 | clf = FactorAnalysis(n_components=n_dim)
49 |
50 | elif method.lower() == 'nmf':
51 | clf = NMF(n_components=n_dim)
52 |
53 | elif method.lower() == 'sparsepca':
54 | clf = SparsePCA(n_components=n_dim, alpha=10., tol=1e-4, verbose=10, n_jobs=1)
55 |
56 | elif method.lower() == 'pls':
57 | clf = PLS(n_components=n_dim)
58 |
59 | else:
60 | raise NameError('%s is not an implemented method'%(method))
61 |
62 | return clf
63 |
64 |
65 | class PLS():
66 | """
67 | Implement PLS to make it compliant with the other dimensionality
68 | reduction methodology.
69 | (Simple class rewritting).
70 | """
71 | def __init__(self, n_components=10):
72 | self.clf = PLSRegression(n_components)
73 |
74 | def get_components_(self):
75 | return self.clf.x_weights_.transpose()
76 |
77 | def set_components_(self, x):
78 | pass
79 |
80 | components_ = property(get_components_, set_components_)
81 |
82 | def fit(self, X, y):
83 | self.clf.fit(X,y)
84 | return self
85 |
86 | def transform(self, X):
87 | return self.clf.transform(X)
88 |
89 | def predict(self, X):
90 | return self.clf.predict(X)
--------------------------------------------------------------------------------