├── LICENSE ├── README.md ├── bin ├── LIGER_integration.R ├── alignment_score.py ├── concat_by_homology_multiple_species_by_gene_id.R ├── concat_by_homology_rligerUINMF_multiple_species.R ├── convert_format.R ├── fastMNN_integration.R ├── harmony_integration.py ├── kbet_multiple_species.R ├── rliger_integration_UINMF_multiple_species.R ├── scANVI_integration.py ├── scIB_metrics.py ├── scIB_metrics_individual.py ├── scIB_trajectory.py ├── scanorama_integration.py ├── sccaf_assessment_metadata.py ├── sccaf_kNN_distance.py ├── sccaf_projection_multiple_species.py ├── scvi_integration.py ├── seurat_CCA_integration.R ├── seurat_RPCA_integration.R └── validate_input.py ├── concat_by_homology_multiple_species.nf ├── config └── example.config ├── containers └── README.md ├── cross_species_assessment_multiple_species_individual.nf ├── cross_species_integration_multiple_species.nf ├── dockerfiles ├── ALCS.Dockerfile ├── scanpy.Dockerfile └── seurat.Dockerfile ├── envs ├── sceasy.yml ├── scib.yml └── scvi.yml └── example_metadata_nf.tsv /README.md: -------------------------------------------------------------------------------- 1 | ## BENGAL: BENchmarking strateGies for cross-species integrAtion of singLe-cell RNA sequencing data ## 2 | 3 | Author&maintainer: Yuyao Song 4 | 5 | A Nextflow DSL2 pipeline to perform cross-species single-cell RNA-seq data integration and assessment of integration results. 6 | 7 | **On Oct 2023, Yuyao updated the pipeline for containerization, improvements in anndata/seurat conversion, updates in scrips and updates in Nextflow.** 8 | 9 | ## This repo includes: 10 | 11 | 1) the Nextflow pipeline for cross-species integration of scRNA-seq data using various strategies 12 | 2) containers paths, dockerfiles used and conda environments 13 | 3) an example config file for running the Nextflow pipeline 14 | 15 | ## System requirements 16 | #### Hardware: 17 | This workflow is written to be executed on HPC clusters with LSF job scheduler. It could be easily adapted to other schedulers by changing job resource syntax in the nextflow config file. If the GPU inplementation of scVI/scANVI is to be used (beneficial for speeding up the integration on large datasets), GPU computing nodes are required, please refer to [scVI-tools site](https://scvi-tools.org/) for respective setups. 18 | 19 | #### OS: 20 | Development of this workflow was done on Rocky Linux 8.5 (RHEL), while in theory this can be run on any linux distribution. To run the GPU inplementation of scVI/scANVI we used Nvidia Tesla V100 GPUs. 21 | 22 | ## Installation: 23 | 24 | #### Clone the source code of BENGAL: 25 | `git clone -b main git@github.com:Functional-Genomics/BENGAL.git` 26 | 27 | **If nextflow or singularity is not installed in your cluster, install them. This can take some efforts and it might worth discussing with cluster IT managers. Please refer to [nextflow documentation](https://www.nextflow.io/docs/latest/getstarted.html) or [singularity documentation](https://singularity-tutorial.github.io/01-installation/).** 28 | 29 | 30 | ## Inputs 31 | The nextflow script takes one input file: a tab-seperated metadata file mapping species to the paths of raw count AnnData objects, in the form of .h5ad files. See example: `example_metadata_nf.tsv`. 32 | 33 | The config file defines project directories and parameters. See example: `config/example.config` 34 | 35 | **Please change the metadata file, and the directories and parameters in the config file for your own application as appropriate** 36 | **Read the instrictions in the config file and change relevant entries is crucial for running the pipeline** 37 | 38 | #### Input Requirements: 39 | *These requirements will be checked in the first process of the pipeline.* 40 | 41 | The raw count AnnData objects need to have the following row or column annotations. Note that the exact column name of each key is specified in the config file. 42 | 43 | 1) a `species_key` in adata.obs to store species identity. Naming should be in line with the short name in ENSEMBL, such as hsapiens; mmusculus; drerio etc. 44 | 2) a `cluster_key` in adata.obs to store cell types. If assessment is performed, this column will be used to match homologous cell types across species. Preferably, use [Cell Ontology annotation](https://obofoundry.org/ontology/cl.html). 45 | 3) `mean_counts` in adata.var computed by `sc.pp.calculate_qc_metrics` from [scanpy](https://github.com/scverse/scanpy). 46 | 47 | The .var_names of the raw count AnnData file should be ENSEMBL gene ids. 48 | The .X of the raw count AnnData file should be stored in dense matrix format, if SeuratDisk is used for .h5ad/.h5seurat conversion. 49 | 50 | 51 | ## Run instructions 52 | 53 | #### Perpare the conda environment for anndata/seurat conversion. 54 | In principle, you can use any program to perform the conversion. Since Oct 2023 we now use [sceasy](https://github.com/cellgeni/sceasy). We also no longer use h5seurat format due to challenges in converting to/from anndata. 55 | 56 | It didn't seem so necessary to containerize this process so we provide a light conda environment that is compatible with other parts of the pipeline. [Mamba](https://github.com/mamba-org/mamba) is recommended as a faster substitute for conda. 57 | 58 | I personally perfer creating a conda env independent of nextflow and then point nextflow to the absolute path of the conda env. This way saves the running time of the pipeline and make reuse of the same env and debug easier. 59 | 60 | First create a conda environment for the conversion: 61 | 62 | `conda env create -f envs/sceasy.yml` 63 | 64 | Then put the path of your sceasy conda environment into the config file in the indicated place. 65 | 66 | #### Perpare the conda environment for running scVI/scANVI and scIB 67 | 68 | These two parts are also not containerized since the conda env is relatively easy to set up while the respective container will be very heavy. In the future we might consider containerizing it if necessary. 69 | 70 | `conda env create -f envs/scvi.yml` 71 | 72 | `conda env create -f envs/scib.yml` 73 | 74 | Then put the path of your scvi and scib conda environments into the config file in the indicated place. These env files are just created as I followed the installation instruction from [scvi](https://docs.scvi-tools.org/en/stable/installation.html) and [scib](https://scib.readthedocs.io/en/stable/installation.html) under Python 3.10.10, so if you encounter any issues, feel free to create your own evns based on their instructions. 75 | 76 | #### Pull the containers used in BENGAL. 77 | 78 | We now provide a few containers to help execute the pipeline (well deserved yay due to the complexity of building them...). Please pull these containers into a local dir and specify in the config file. Here we assume you use [singularity](https://sylabs.io/) to run these containers on a HPC cluster. 79 | 80 | 1. Concatenate anndata files cross-species: `singularity pull bengal_concat.sif docker://yysong123/bengal_concat:4.2.0` 81 | 2. Python based integration: `singularity pull bengal_py.sif docker://yysong123/bengal_py:1.9.2` 82 | 3. Seurat/R based integration: `singularity pull bengal_seurat.sif docker://yysong123/bengal_seurat:4.3.0` 83 | 4. SCCAF assessment for ALCS: `singularity pull bengal_sccaf.sif docker://yysong123/bengal_sccaf:0.0.11` 84 | 85 | ### To run BENGAL: 86 | In a bash shell, check your metadata/config files are set and run: 87 | 88 | 1) `conda activate nextflow && nextflow -C config/example.config run concat_by_homology_multiple_species.nf`. Add flag `-with-trace -with-report report.html` if you want nextflow run stats. 89 | 2) `nextflow -C config/example.config run cross_species_integration_multiple_species.nf` 90 | 3) `nextflow -C config/example.config run cross_species_assessment_multiple_species_individual.nf` 91 | 92 | Note: add resume flag `-resume` as appropriate to avoid re-calculation of the same data during multiple runs. 93 | 94 | ## Outputs 95 | 96 | 1) Concatenated raw count AnnData objects containing cells from all species, in the form of .h5ad files. Objects are concatenated by matching genes between species using gene homology annotation from ENSEMBL. 97 | 2) Integration result from different algorithms including: [fastMNN](https://bioconductor.org/packages/release/bioc/html/batchelor.html), [harmony](https://github.com/slowkow/harmonypy), [LIGER](https://github.com/welch-lab/liger), [LIGER-UINMF](https://github.com/welch-lab/liger), [scanorama](https://github.com/brianhie/scanorama), [scVI](https://scvi-tools.org/), [SeuratV4CCA](https://satijalab.org/seurat/) and [SeuratV4RPCA](https://satijalab.org/seurat/), in the form of AnnData (.h5ad) or Seurat (.h5seurat) objects. 98 | 3) Respective UMAP visualizations with species; batches or cell types color code. 99 | 4) Assessment metrics for each integrated results. There are 4 batch correction metrics and 6 biology conservation metrics. Plots associated with the metrics are also generated for visual inspection. 100 | 5) Cross-species cell type annotation transfer results with [SCCAF](https://github.com/SCCAF/sccaf). 101 | 102 | Estimated execution time: ~6h for integrated dataset with 100,000 cells using resources specified in the .nf scripts. 103 | 104 | ## Citation 105 | 106 | The publication in which we described and applied BENGAL is [here](https://www.nature.com/articles/s41467-023-41855-w). Please cite it if you use BENGAL. 107 | 108 | Song, Y., Miao, Z., Brazma, A. et al. Benchmarking strategies for cross-species integration of single-cell RNA sequencing data. Nat Commun 14, 6495 (2023). https://doi.org/10.1038/s41467-023-41855-w 109 | 110 | The BENGAL pipeline used upon publication of the paper is archived in zenodo: 111 | 112 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8268784.svg)](https://doi.org/10.5281/zenodo.8268784) 113 | 114 | LICENSE: GPLv3 license 115 | 116 | NOTE: we moved this git repo from Functional-Genomics/BENGAL to Papatheodorou-Group/BENGAL on 23 Oct 2023, but redirection should happen automatically. 117 | -------------------------------------------------------------------------------- /bin/LIGER_integration.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | library(optparse) 7 | library(rliger) 8 | library(Seurat) 9 | library(SeuratWrappers) 10 | 11 | option_list <- list( 12 | make_option(c("-i", "--input_rds"), 13 | type = "character", default = NULL, 14 | help = "Path to input preprocessed rds file" 15 | ), 16 | make_option(c("-o", "--out_rds"), 17 | type = "character", default = NULL, 18 | help = "Output fastMNN from Seurat wrappers integrated rds file" 19 | ), 20 | make_option(c("-p", "--out_UMAP"), 21 | type = "character", default = NULL, 22 | help = "Output UMAP after fastMNN integration" 23 | ), 24 | make_option(c("-b", "--batch_key"), 25 | type = "character", default = NULL, 26 | help = "Batch key identifier to integrate" 27 | ), 28 | make_option(c("-s", "--species_key"), 29 | type = "character", default = NULL, 30 | help = "Species key identifier" 31 | ), 32 | make_option(c("-c", "--cluster_key"), 33 | type = "character", default = NULL, 34 | help = "Cluster key for UMAP plotting" 35 | ) 36 | ) 37 | 38 | # parse input 39 | opt <- parse_args(OptionParser(option_list = option_list)) 40 | 41 | input_rds <- opt$input_rds 42 | out_rds <- opt$out_rds 43 | out_UMAP <- opt$out_UMAP 44 | batch_key <- opt$batch_key 45 | species_key <- opt$species_key 46 | cluster_key <- opt$cluster_key 47 | 48 | ## create Seurat object via rds 49 | 50 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE) 51 | # input_rds <- gsub("h5ad", "rds", input_h5ad) 52 | obj <- readRDS(input_rds) 53 | 54 | obj <- NormalizeData(obj) 55 | obj <- FindVariableFeatures(obj) 56 | obj <- ScaleData(obj, split.by = batch_key, do.center = FALSE) 57 | 58 | # LIGER 59 | obj <- RunOptimizeALS(obj, k = 30, lambda = 5, split.by = batch_key) 60 | obj <- RunQuantileNorm(obj, split.by = batch_key) 61 | 62 | obj <- FindNeighbors(obj, reduction = "iNMF", dims = 1:30) 63 | obj <- FindClusters(obj, resolution = 0.4) 64 | # Dimensional reduction and plotting 65 | obj <- RunUMAP(obj, dims = 1:ncol(obj[["iNMF"]]), reduction = "iNMF", n_neighbors = 15L, min_dist = 0.3) 66 | 67 | 68 | # have to convert all factor to character, or when later converting to h5ad, the factors will be numbers 69 | i <- sapply(obj@meta.data, is.factor) 70 | obj@meta.data[i] <- lapply(obj@meta.data[i], as.character) 71 | 72 | 73 | saveRDS(obj, 74 | file = out_rds 75 | ) 76 | 77 | # iNMF embedding will be in .obsm['iNMF'] 78 | 79 | pdf(out_UMAP, height = 6, width = 10) 80 | DimPlot(obj, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE) 81 | DimPlot(obj, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE) 82 | DimPlot(obj, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE) 83 | 84 | dev.off() 85 | -------------------------------------------------------------------------------- /bin/alignment_score.py: -------------------------------------------------------------------------------- 1 | # © EMBL-European Bioinformatics Institute, 2023 2 | # Yuyao Song 3 | 4 | K_default=20 5 | 6 | import pandas as pd 7 | import numpy as np 8 | import scanpy as sc 9 | import random 10 | import numpy 11 | import click 12 | 13 | random.seed(123) 14 | numpy.random.seed(456) 15 | 16 | 17 | @click.command() 18 | @click.argument("input_h5ad", type=click.Path(exists=True)) 19 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None) 20 | @click.argument("out_integrated_metrics", type=click.Path(exists=False), default=None) 21 | @click.argument("out_orig_metrics", type=click.Path(exists=False), default=None) 22 | @click.option( 23 | "--species_key", type=str, default=None, help="Species key to distinguish species" 24 | ) 25 | @click.option( 26 | "--batch_key", 27 | type=str, 28 | default=None, 29 | help="Batch key on which integration is performed", 30 | ) 31 | @click.option("--integration_method", type=str, default=None, help="Integration method") 32 | @click.option( 33 | "--cluster_key", 34 | type=str, 35 | default=None, 36 | help="Cluster key in species one to use as labels to transfer to species two", 37 | ) 38 | def run_alignment_score( 39 | input_h5ad, 40 | unintegrated_h5ad, 41 | out_integrated_metrics, 42 | out_orig_metrics, 43 | species_key, 44 | batch_key, 45 | cluster_key, 46 | integration_method, 47 | ): 48 | # dictionary for method properties 49 | embedding_keys = { 50 | "harmony": "X_pca_harmony", 51 | "scanorama": "X_scanorama", 52 | "scVI": "X_scVI", 53 | "scANVI": "X_scANVI", 54 | "LIGER": "X_iNMF", 55 | "rligerUINMF": "X_inmf", 56 | "fastMNN": "X_mnn", 57 | } 58 | use_embeddings = { 59 | "harmony": True, 60 | "scanorama": True, 61 | "scVI": True, 62 | "scANVI": True, 63 | "LIGER": True, 64 | "rligerUINMF": True, 65 | "fastMNN": True, 66 | "SAMap": False, 67 | "seuratCCA": False, 68 | "seuratRPCA": False, 69 | "unintegrated": False, 70 | } 71 | from_h5seurat = { 72 | "harmony": False, 73 | "scanorama": False, 74 | "scVI": False, 75 | "scANVI": False, 76 | "LIGER": True, 77 | "rligerUINMF": True, 78 | "fastMNN": True, 79 | "SAMap": False, 80 | "seuratCCA": True, 81 | "seuratRPCA": True, 82 | "unintegrated": False, 83 | } 84 | sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5)) 85 | 86 | click.echo("Read anndata") 87 | input_ad = sc.read_h5ad(input_h5ad) 88 | orig_ad = sc.read_h5ad(unintegrated_h5ad) 89 | species_all = input_ad.obs[species_key].astype("category").cat.categories.values 90 | 91 | ## for files from h5seurat sometimes these are not stored as category 92 | 93 | input_ad.obs[species_key] = input_ad.obs[species_key].astype("category") 94 | input_ad.obs[cluster_key] = input_ad.obs[cluster_key].astype("category") 95 | input_ad.obs[batch_key] = input_ad.obs[batch_key].astype("category") 96 | 97 | # known bug - fix when convert h5Seurat to h5ad the index name error 98 | if from_h5seurat[integration_method] is True: 99 | input_ad.__dict__["_raw"].__dict__["_var"] = ( 100 | input_ad.__dict__["_raw"] 101 | .__dict__["_var"] 102 | .rename(columns={"_index": "features"}) 103 | ) 104 | 105 | use_embedding = use_embeddings[integration_method] 106 | if use_embedding is True: 107 | embedding_key = embedding_keys[integration_method] 108 | 109 | # re-calculate on integrated and unintegrated data 110 | # due to scIB hard-coding, make sure input_ad.obsp['connectivities'], input_ad.uns['neighbours'] are from the embedding 111 | # for lisi type_='knn' 112 | # LIGER embedding only have 20 dims 113 | 114 | if integration_method == "SAMap": 115 | click.echo("use SAMap KNN graph") 116 | # do nothing 117 | elif use_embedding is True: 118 | click.echo("calculate KNN graph from embedding " + embedding_key) 119 | num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20) 120 | if num_pcs < 20: 121 | click.echo("using less PCs: " + str(num_pcs)) 122 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key) 123 | # compute knn if use embedding 124 | else: 125 | click.echo("use PCA to compute KNN graph") 126 | sc.tl.pca(input_ad, svd_solver="arpack") 127 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep="X_pca") 128 | embedding_key = "X_pca" 129 | 130 | # while no embedding, compute PCA and compute knn 131 | 132 | # get neighbour graph from unintegrated data 133 | sc.pp.normalize_total(orig_ad, target_sum=1e4) 134 | sc.pp.log1p(orig_ad) 135 | sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 136 | sc.pp.scale(orig_ad, max_value=10) 137 | sc.tl.pca(orig_ad, svd_solver="arpack") 138 | sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40) 139 | sc.tl.umap(orig_ad, min_dist=0.3) 140 | sc.pl.umap(orig_ad, color=[batch_key, species_key, cluster_key]) 141 | 142 | click.echo( 143 | "Start computing various batch metrics using scIB, the integrated file is " 144 | + input_h5ad 145 | ) 146 | 147 | output_integrated = pd.DataFrame() 148 | output_orig = pd.DataFrame() 149 | 150 | click.echo("alignment score") 151 | 152 | def q(x): 153 | return np.array(list(x)) 154 | 155 | def avg_as(ad): 156 | x = q(ad.obs[batch_key]) 157 | xu = np.unique(x) 158 | a = np.zeros((xu.size,xu.size)) 159 | for i in range(xu.size): 160 | for j in range(xu.size): 161 | if i!=j: 162 | a[i,j] = ad.obsp['connectivities'][x==xu[i],:][:,x==xu[j]].sum(1).A.flatten().mean() / K_default 163 | return pd.DataFrame(data=a,index=xu,columns=xu) 164 | 165 | 166 | output_integrated.loc["Alignment_score", "value"] = avg_as(input_ad)[avg_as(input_ad) != 0].mean().mean() 167 | 168 | output_orig.loc["Alignment_score", "value"] = avg_as(orig_ad)[avg_as(orig_ad) != 0].mean().mean() 169 | 170 | output_integrated.loc["input_h5ad", "value"] = input_h5ad 171 | output_integrated.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad 172 | output_integrated.loc["species_key", "value"] = species_key 173 | output_integrated.loc["batch_key", "value"] = batch_key 174 | output_integrated.loc["cluster_key", "value"] = cluster_key 175 | output_integrated.loc["integration_method", "value"] = integration_method 176 | 177 | output_integrated.T.to_csv(out_integrated_metrics) 178 | 179 | click.echo("metric of unintegrated data") 180 | output_orig.loc["input_h5ad", "value"] = unintegrated_h5ad 181 | output_orig.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad 182 | output_orig.loc["species_key", "value"] = species_key 183 | output_orig.loc["batch_key", "value"] = batch_key 184 | output_orig.loc["cluster_key", "value"] = cluster_key 185 | output_orig.loc["integration_method", "value"] = integration_method 186 | 187 | output_orig.T.to_csv(out_orig_metrics) 188 | 189 | 190 | 191 | 192 | if __name__ == "__main__": 193 | run_alignment_score() -------------------------------------------------------------------------------- /bin/concat_by_homology_multiple_species_by_gene_id.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env R 2 | 3 | ########## 4 | ## concatenate anndata object by homology 5 | ## ysong@ebi.ac.uk for CrossSpeciesIntegration pipeline 6 | ######### 7 | 8 | # © EMBL-European Bioinformatics Institute, 2023 9 | # Yuyao Song 10 | 11 | library(optparse) 12 | library(anndata) 13 | library(dplyr) 14 | library(purrr) 15 | library(readr) 16 | library(magrittr) 17 | library(tibble) 18 | library(biomaRt) 19 | 20 | option_list <- list( 21 | make_option(c("--metadata"), 22 | type = "character", default = NULL, 23 | help = "Path to a file indicate species-h5ad mapping, tab-seperated" 24 | ), 25 | make_option(c("--homology_tbl"), 26 | type = "character", default = NULL, 27 | help = "Ensembl homology tbl output" 28 | ), 29 | make_option(c("--one2one_h5ad"), 30 | type = "character", default = NULL, 31 | help = "h5ad output with only one2one orthologs" 32 | ), 33 | make_option(c("--many_higher_expr_h5ad"), 34 | type = "character", default = NULL, 35 | help = "h5ad output with one2one, one2many and many2many homologs, paralog with higher mean expression across dataset is selected to match the ortholog" 36 | ), 37 | make_option(c("--many_higher_homology_conf_h5ad"), 38 | type = "character", default = NULL, 39 | help = "h5ad output with one2one, one2many and many2many homologs, paralog with higher homology confidence across dataset is selected to match the ortholog" 40 | ) 41 | ) 42 | 43 | # parse input 44 | opt <- parse_args(OptionParser(option_list = option_list)) 45 | 46 | input_metadata <- opt$metadata 47 | homology_tbl_csv <- opt$homology_tbl 48 | one2one_h5ad <- opt$one2one_h5ad 49 | many_higher_expr_h5ad <- opt$many_higher_expr_h5ad 50 | many_higher_homology_conf_h5ad <- opt$many_higher_homology_conf_h5ad 51 | 52 | genes_main_chr <- list( 53 | hsapiens = as.character(c(lapply(seq(1, 22, by = 1), as.character), "X", "Y")), 54 | sscrofa = as.character(c(lapply(seq(1, 18, by = 1), as.character), "X", "Y")), 55 | mmusculus = as.character(c(lapply(seq(1, 19, by = 1), as.character), "X", "Y")), 56 | mmulatta = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X", "Y")), 57 | mfascicularis = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X")), 58 | dmelanogaster = c("2L", "2R", "3L", "3R", "4", "X", "Y"), 59 | xtropicalis = as.character(c(lapply(seq(1, 10, by = 1), as.character))), 60 | drerio = as.character(c(lapply(seq(1, 25, by = 1), as.character))) 61 | ) 62 | 63 | 64 | metadata <- read_tsv(input_metadata, col_names = FALSE) 65 | species_list=metadata[['X1']] 66 | message(paste0("start concatenating anndata object from ", paste(species_list, collapse = ', '))) 67 | 68 | 69 | adatas = list() 70 | for( species in metadata$X1){ 71 | adatas[[species]] = read_h5ad(as.character(metadata[which(metadata$X1 == species), 'X2'])) 72 | } 73 | 74 | for(species_now in species_list){ 75 | adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]] = adatas[[species_now]]$var_names 76 | } 77 | 78 | 79 | count_all <- data.frame() 80 | species_1 = species_list[1] 81 | 82 | 83 | ## version = '105' 84 | mart <- useEnsembl("ensembl", version = "105", dataset = as.character(paste(species_1, "_gene_ensembl", sep = ""))) 85 | 86 | # get genes in the main chrs of the first species 87 | genes_species_1 <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = mart) %>% 88 | dplyr::filter(chromosome_name %in% (genes_main_chr[species_1] %>% purrr::flatten_chr())) 89 | message(paste("start querying ensembl biomaRt for gene homology")) 90 | 91 | avail_attributes <- listAttributes(mart) %>% filter(grepl((paste(species_list[-1],collapse="|")), name)) %>% filter(grepl("homo", name)) %>% 92 | filter(!grepl("Query protein or transcript ID", description)) 93 | biomartCacheClear() 94 | 95 | homology_tbl = getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", 96 | avail_attributes$name), mart = mart, filters = "ensembl_gene_id", 97 | values = genes_species_1[["ensembl_gene_id"]]) 98 | 99 | 100 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene stable ID'] = paste0(species_1, "_homolog_ensembl_gene") 101 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene name'] = paste0(species_1, "_homolog_associated_gene_name") 102 | #names(homology_tbl)[colnames(homology_tbl) == 'Chromosome/scaffold name'] = paste0(species_1, "_homolog_chromosome") 103 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene start (bp)'] = paste0(species_1, "_homolog_chrom_start") 104 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene end (bp)'] = paste0(species_1, "_homolog_chrom_end") 105 | 106 | #names(homology_tbl)[match(avail_attributes[,"description"], names(homology_tbl))] = avail_attributes[,"name"] 107 | 108 | write_csv(homology_tbl, file = homology_tbl_csv) 109 | 110 | message("start building one2one") 111 | homology_tbl[paste0(species_1, "_homolog_associated_gene_name")] = homology_tbl$external_gene_name 112 | homology_tbl[paste0(species_1, "_homolog_ensembl_gene")] = homology_tbl$ensembl_gene_id 113 | homology_tbl[paste0(species_1, "_homolog_chromosome")] = homology_tbl$chromosome_name 114 | homology_tbl[paste0(species_1, "_homolog_chrom_start")] = homology_tbl$start_position 115 | homology_tbl[paste0(species_1, "_homolog_chrom_end")] = homology_tbl$end_position 116 | one2one = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. == 'ortholog_one2one')) %>% 117 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE) 118 | 119 | adatas_one2one = list() 120 | 121 | for(species_now in names(adatas)){ 122 | adatas_one2one[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(one2one[[paste0(species_now, "_homolog_ensembl_gene")]])] 123 | } 124 | 125 | 126 | for(species_now in species_list[-1]){ 127 | message(species_now) 128 | one2one_now = one2one %>% filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas_one2one[[species_now]]$var_names)) %>% 129 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE) 130 | adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, tolower(adatas_one2one[[species_now]]$var_names) %in% tolower(one2one_now[[paste0(species_now, "_homolog_ensembl_gene")]])] 131 | one2one_now = one2one_now[match(tolower(adatas_one2one[[species_now]]$var_names), tolower(one2one_now[[paste0(species_now, '_homolog_ensembl_gene')]])), ] 132 | adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] = adatas_one2one[[species_now]]$var_names 133 | adatas_one2one[[species_now]]$var[[paste0(species_1, '_homolog_ensembl_gene')]] = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]] 134 | adatas_one2one[[species_now]]$var_names = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]] 135 | adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, !duplicated(adatas_one2one[[species_now]]$var_names)] 136 | } 137 | 138 | adata_one2one = concat(adatas_one2one, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-',) 139 | 140 | i <- sapply(adata_one2one$obs, is.factor) 141 | adata_one2one$obs[i] <- lapply(adata_one2one$obs[i], as.character) 142 | 143 | j <- sapply(adata_one2one$var, is.factor) 144 | adata_one2one$var[j] <- lapply(adata_one2one$var[j], as.character) 145 | 146 | 147 | write_h5ad(anndata = adata_one2one, filename = one2one_h5ad, compression = "gzip") 148 | 149 | message("finish building one2one") 150 | adata_one2one 151 | 152 | message("start building many2many") 153 | message("start higher expression level") 154 | 155 | 156 | many2many = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. != 'ortholog_one2one')) 157 | 158 | 159 | for(species_now in species_list){ 160 | print(species_now) 161 | many2many = many2many %>% filter(!is.na(get(paste0(species_now, "_homolog_ensembl_gene"))) & get(paste0(species_now, "_homolog_ensembl_gene")) != "") 162 | print(dim(many2many)) 163 | many2many = many2many %>% 164 | filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]])) 165 | print(dim(many2many)) 166 | } 167 | 168 | many2many_copy <- many2many %>% rowid_to_column("index") 169 | adata_many2many = AnnData(shape = list(0, 0)) 170 | 171 | while (nrow(many2many_copy) > 0) { 172 | 173 | dd <- many2many_copy %>% 174 | filter(get(paste0(species_1, "_homolog_ensembl_gene")) == levels(factor(many2many_copy[[paste0(species_1, "_homolog_ensembl_gene")]]))[1]) 175 | 176 | genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character() 177 | 178 | gene_group <- many2many_copy %>% 179 | dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now)) 180 | 181 | many2many_copy <- many2many_copy %>% 182 | filter(!index %in% gene_group$index) 183 | 184 | adatas_many2many = list() 185 | 186 | for(species_now in species_list){ 187 | adatas_many2many[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(gene_group[[paste0(species_now, "_homolog_ensembl_gene")]])] 188 | keep_row = adatas_many2many[[species_now]]$var %>% 189 | arrange(desc(mean_counts)) %>% 190 | slice(1) 191 | adatas_many2many[[species_now]] = adatas_many2many[[species_now]][, which(tolower(adatas_many2many[[species_now]]$var_names) == tolower(rownames(keep_row)))] 192 | } 193 | 194 | new_name = adatas_many2many[[species_1]]$var_names 195 | #message(new_name) 196 | for(species_now in species_list[-1]){ 197 | adatas_many2many[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name 198 | rownames(adatas_many2many[[species_now]]$var) = new_name 199 | } 200 | adata_add = concat(adatas_many2many, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-') 201 | adata_add 202 | 203 | if (dim(adata_many2many)[1] == 0) { 204 | adata_many2many <- adata_add 205 | } else { 206 | adata_many2many <- concat(list(adata_many2many, adata_add), axis = 1L, join = "outer", merge = "first", index_unique = '-') 207 | } 208 | 209 | } 210 | 211 | adata_higher_expr = concat(list(adata_one2one, adata_many2many), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 212 | 213 | message("finish building higher expression level") 214 | adata_higher_expr 215 | i <- sapply(adata_higher_expr$obs, is.factor) 216 | adata_higher_expr$obs[i] <- lapply(adata_higher_expr$obs[i], as.character) 217 | 218 | j <- sapply(adata_higher_expr$var, is.factor) 219 | adata_higher_expr$var[j] <- lapply(adata_higher_expr$var[j], as.character) 220 | 221 | write_h5ad(anndata = adata_higher_expr , filename = many_higher_expr_h5ad, compression = "gzip") 222 | 223 | message("start higher homology confidence") 224 | 225 | ## available attributes to indicate confidence of homology 226 | ## mind the order 227 | order = metadata$X1 228 | avail_ordered = c() 229 | for (attr in c("orthology_confidence", "homolog_goc_score", "homolog_wga_coverage")){ 230 | 231 | avail_homo = c(avail_attributes$name[grepl(attr, avail_attributes$name)]) 232 | 233 | for (i in seq(1, length(order))){ 234 | avail_ordered = c(avail_ordered, avail_homo[grepl(order[i], avail_homo)]) 235 | } 236 | } 237 | 238 | avail_homo = avail_ordered 239 | 240 | many2many_copy_homo <- many2many %>% rowid_to_column("index") 241 | adata_many2many_homo = AnnData(shape = list(0, 0)) 242 | 243 | while (nrow(many2many_copy_homo) > 0) { 244 | 245 | dd <- many2many_copy_homo %>% 246 | filter(get(paste0(species_1, "_homolog_ensembl_gene")) == levels(factor(many2many_copy_homo[[paste0(species_1, "_homolog_ensembl_gene")]]))[1]) 247 | 248 | genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character() 249 | 250 | ## find a gene group with many-to-many relationships 251 | gene_group <- many2many_copy_homo %>% 252 | dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now)) 253 | 254 | if(nrow(gene_group) == 1) { 255 | gene_keep = gene_group 256 | } else { 257 | gene_keep = gene_group %>% 258 | dplyr::arrange( 259 | sapply(avail_homo, FUN = function(x) get(x)) ## keep the member of group with highest overall confidence 260 | ) %>% 261 | slice(n()) 262 | } 263 | many2many_copy_homo <- many2many_copy_homo %>% 264 | filter(!index %in% gene_group$index) 265 | 266 | adatas_many2many_homo = list() 267 | for(species_now in species_list){ 268 | 269 | adatas_many2many_homo[[species_now]] = adatas[[species_now]][, adatas[[species_now]]$var_names %in% gene_keep[[paste0(species_now, "_homolog_ensembl_gene")]]] 270 | 271 | } 272 | 273 | 274 | new_name = adatas_many2many_homo[[species_1]]$var_names 275 | #message(new_name) 276 | for(species_now in species_list[-1]){ 277 | 278 | adatas_many2many_homo[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name 279 | rownames(adatas_many2many_homo[[species_now]]$var) = new_name 280 | 281 | } 282 | adata_add = concat(adatas_many2many_homo, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-') 283 | 284 | if (dim(adata_many2many_homo)[1] == 0) { 285 | adata_many2many_homo <- adata_add 286 | } else { 287 | adata_many2many_homo <- concat(list(adata_many2many_homo, adata_add), axis = 1L, join = "outer", merge = "first", index_unique = '-') 288 | } 289 | 290 | } 291 | 292 | 293 | adata_higher_homology_conf = concat(list(adata_one2one, adata_many2many_homo), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 294 | i <- sapply(adata_higher_homology_conf$obs, is.factor) 295 | adata_higher_homology_conf$obs[i] <- lapply(adata_higher_homology_conf$obs[i], as.character) 296 | 297 | j <- sapply(adata_higher_homology_conf$var, is.factor) 298 | adata_higher_homology_conf$var[j] <- lapply(adata_higher_homology_conf$var[j], as.character) 299 | 300 | write_h5ad(anndata = adata_higher_homology_conf , filename = many_higher_homology_conf_h5ad, compression = "gzip") 301 | 302 | message("finish building higher homology confidence") 303 | adata_higher_homology_conf 304 | 305 | message("finish concat by homology") 306 | -------------------------------------------------------------------------------- /bin/concat_by_homology_rligerUINMF_multiple_species.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env R 2 | 3 | ########## 4 | ## concatenate anndata object by homology for LIGER UINMF pipeline 5 | ## ysong@ebi.ac.uk for CrossSpeciesIntegration pipeline 6 | ######### 7 | 8 | # © EMBL-European Bioinformatics Institute, 2023 9 | # Yuyao Song 10 | 11 | library(optparse) 12 | library(anndata) 13 | library(dplyr) 14 | library(purrr) 15 | library(readr) 16 | library(magrittr) 17 | library(tibble) 18 | library(biomaRt) 19 | 20 | option_list <- list( 21 | make_option(c("--metadata"), 22 | type = "character", default = NULL, 23 | help = "Path to a file indicate species-h5ad mapping, tab-seperated" 24 | ), 25 | make_option(c("--homology_tbl"), 26 | type = "character", default = NULL, 27 | help = "Ensembl homology tbl output" 28 | ), 29 | make_option(c("--out_dir"), 30 | type = "character", default = NULL, 31 | help = "output dir to write h5ad files" 32 | ), 33 | make_option(c("--metadata_output"), 34 | type = "character", default = NULL, 35 | help = "paths to rliger UINMF ready metadata" 36 | ) 37 | ) 38 | opt <- parse_args(OptionParser(option_list = option_list)) 39 | 40 | input_metadata <- opt$metadata 41 | homology_tbl_csv <- opt$homology_tbl 42 | out_dir <- opt$out_dir 43 | metadata_output <- opt$metadata_output 44 | 45 | 46 | message("create seurat objects for rliger") 47 | 48 | 49 | genes_main_chr <- list( 50 | hsapiens = as.character(c(lapply(seq(1, 22, by = 1), as.character), "X", "Y")), 51 | sscrofa = as.character(c(lapply(seq(1, 18, by = 1), as.character), "X", "Y")), 52 | mmusculus = as.character(c(lapply(seq(1, 19, by = 1), as.character), "X", "Y")), 53 | mmulatta = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X", "Y")), 54 | mfascicularis = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X")), 55 | dmelanogaster = c("2L", "2R", "3L", "3R", "4", "X", "Y"), 56 | xtropicalis = as.character(c(lapply(seq(1, 10, by = 1), as.character))), 57 | drerio = as.character(c(lapply(seq(1, 25, by = 1), as.character))) 58 | ) 59 | 60 | metadata <- read_tsv(input_metadata, col_names = FALSE) 61 | species_list=metadata[['X1']] 62 | message(paste0("start concatenating anndata object from ", paste(species_list, collapse = ', '))) 63 | 64 | 65 | adatas = list() 66 | for( species in metadata$X1){ 67 | adatas[[species]] = read_h5ad(as.character(metadata[which(metadata$X1 == species), 'X2'])) 68 | } 69 | 70 | for(species_now in species_list){ 71 | adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]] = adatas[[species_now]]$var_names 72 | } 73 | 74 | 75 | count_all <- data.frame() 76 | species_1 = species_list[1] 77 | 78 | 79 | ## version = '105' 80 | mart <- useEnsembl("ensembl", version = "105", dataset = as.character(paste(species_1, "_gene_ensembl", sep = ""))) 81 | 82 | # get genes in the main chrs of the first species 83 | genes_species_1 <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = mart) %>% 84 | dplyr::filter(chromosome_name %in% (genes_main_chr[species_1] %>% purrr::flatten_chr())) 85 | message(paste("start querying ensembl biomaRt for gene homology")) 86 | 87 | avail_attributes <- listAttributes(mart) %>% filter(grepl((paste(species_list[-1],collapse="|")), name)) %>% filter(grepl("homo", name)) %>% 88 | filter(!grepl("Query protein or transcript ID", description)) 89 | biomartCacheClear() 90 | 91 | homology_tbl = getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position", 92 | avail_attributes$name), mart = mart, filters = "ensembl_gene_id", 93 | values = genes_species_1[["ensembl_gene_id"]]) 94 | 95 | 96 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene stable ID'] = paste0(species_1, "_homolog_ensembl_gene") 97 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene name'] = paste0(species_1, "_homolog_associated_gene_name") 98 | #names(homology_tbl)[colnames(homology_tbl) == 'Chromosome/scaffold name'] = paste0(species_1, "_homolog_chromosome") 99 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene start (bp)'] = paste0(species_1, "_homolog_chrom_start") 100 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene end (bp)'] = paste0(species_1, "_homolog_chrom_end") 101 | 102 | #names(homology_tbl)[match(avail_attributes[,"description"], names(homology_tbl))] = avail_attributes[,"name"] 103 | 104 | write_csv(homology_tbl, file = homology_tbl_csv) 105 | 106 | 107 | message("start building one2one") 108 | homology_tbl[paste0(species_1, "_homolog_associated_gene_name")] = homology_tbl$external_gene_name 109 | homology_tbl[paste0(species_1, "_homolog_ensembl_gene")] = homology_tbl$ensembl_gene_id 110 | homology_tbl[paste0(species_1, "_homolog_chromosome")] = homology_tbl$chromosome_name 111 | homology_tbl[paste0(species_1, "_homolog_chrom_start")] = homology_tbl$start_position 112 | homology_tbl[paste0(species_1, "_homolog_chrom_end")] = homology_tbl$end_position 113 | one2one = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. == 'ortholog_one2one')) %>% 114 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE) 115 | 116 | adatas_one2one = list() 117 | 118 | for(species_now in names(adatas)){ 119 | adatas_one2one[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(one2one[[paste0(species_now, "_homolog_ensembl_gene")]])] 120 | } 121 | 122 | 123 | for(species_now in species_list[-1]){ 124 | message(species_now) 125 | one2one_now = one2one %>% filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas_one2one[[species_now]]$var_names)) %>% 126 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE) 127 | adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, tolower(adatas_one2one[[species_now]]$var_names) %in% tolower(one2one_now[[paste0(species_now, "_homolog_ensembl_gene")]])] 128 | one2one_now = one2one_now[match(tolower(adatas_one2one[[species_now]]$var_names), tolower(one2one_now[[paste0(species_now, '_homolog_ensembl_gene')]])), ] 129 | adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] = adatas_one2one[[species_now]]$var_names 130 | adatas_one2one[[species_now]]$var[[paste0(species_1, '_homolog_ensembl_gene')]] = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]] 131 | adatas_one2one[[species_now]]$var_names = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]] 132 | adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, !duplicated(adatas_one2one[[species_now]]$var_names)] 133 | } 134 | 135 | adatas_one2one_uinmf = list() 136 | 137 | for(species_now in names(adatas)){ 138 | 139 | adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])] 140 | adatas_one2one_uinmf[[species_now]] = concat(list(adatas_one2one[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 141 | 142 | adatas_one2one_uinmf[[species_now]] = adatas_one2one_uinmf[[species_now]][, !duplicated(adatas_one2one_uinmf[[species_now]]$var_names)] 143 | 144 | i <- sapply(adatas_one2one_uinmf[[species_now]]$obs, is.factor) 145 | adatas_one2one_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_uinmf[[species_now]]$obs[i], as.character) 146 | j <- sapply(adatas_one2one_uinmf[[species_now]]$var, is.factor) 147 | adatas_one2one_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_uinmf[[species_now]]$var[j], as.character) 148 | 149 | write_h5ad(anndata = adatas_one2one_uinmf[[species_now]], 150 | filename = paste0(out_dir, "/", species_now, "_one2one_only_ligerUINMF.h5ad"), compression = "gzip") 151 | message(species_now) 152 | 153 | } 154 | 155 | message("finish building one2one") 156 | 157 | 158 | message("start building many2many") 159 | message("start higher expression level") 160 | 161 | 162 | many2many = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. != 'ortholog_one2one')) 163 | 164 | 165 | for(species_now in species_list){ 166 | print(species_now) 167 | many2many = many2many %>% filter(!is.na(get(paste0(species_now, "_homolog_ensembl_gene"))) & get(paste0(species_now, "_homolog_ensembl_gene")) != "") 168 | print(dim(many2many)) 169 | many2many = many2many %>% 170 | filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]])) 171 | print(dim(many2many)) 172 | } 173 | 174 | many2many_copy <- many2many %>% rowid_to_column("index") 175 | adata_many2many = AnnData(shape = list(0, 0)) 176 | 177 | 178 | adatas_many2many_all = list() 179 | while (nrow(many2many_copy) > 0) { 180 | 181 | if(nrow(many2many_copy) < 6000 & nrow(many2many_copy) > 5990) message(nrow(many2many_copy)) 182 | 183 | if(nrow(many2many_copy) < 4000 & nrow(many2many_copy) > 3990) message(nrow(many2many_copy)) 184 | 185 | if(nrow(many2many_copy) < 1000 & nrow(many2many_copy) > 990) message(nrow(many2many_copy)) 186 | 187 | 188 | dd <- many2many_copy %>% 189 | filter(get(paste0(species_1, "_homolog_ensembl_gene")) == levels(factor(many2many_copy[[paste0(species_1, "_homolog_ensembl_gene")]]))[1]) 190 | 191 | genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character() 192 | 193 | gene_group <- many2many_copy %>% 194 | dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now)) 195 | 196 | many2many_copy <- many2many_copy %>% 197 | filter(!index %in% gene_group$index) 198 | 199 | adatas_many2many = list() 200 | 201 | for(species_now in species_list){ 202 | adatas_many2many[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(gene_group[[paste0(species_now, "_homolog_ensembl_gene")]])] 203 | keep_row = adatas_many2many[[species_now]]$var %>% 204 | arrange(desc(mean_counts)) %>% 205 | slice(1) 206 | adatas_many2many[[species_now]] = adatas_many2many[[species_now]][, which(tolower(adatas_many2many[[species_now]]$var_names) == tolower(rownames(keep_row)))] 207 | } 208 | 209 | new_name = adatas_many2many[[species_1]]$var_names 210 | #message(new_name) 211 | 212 | for(species_now in species_list[-1]){ 213 | adatas_many2many[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name 214 | rownames(adatas_many2many[[species_now]]$var) = new_name 215 | } 216 | 217 | for(species_now in species_list){ 218 | if (is.null(adatas_many2many_all[[species_now]])) { 219 | adatas_many2many_all[[species_now]] <- adatas_many2many[[species_now]] 220 | } else { 221 | adatas_many2many_all[[species_now]] <- concat(list(adatas_many2many_all[[species_now]], adatas_many2many[[species_now]]), axis = 1L, join = "outer", merge = "first", index_unique = '-') 222 | } 223 | 224 | } 225 | 226 | rm(adatas_many2many) 227 | } 228 | 229 | 230 | adatas_one2one_and_many_expr = list() 231 | 232 | for(species_now in names(adatas)){ 233 | 234 | adatas_one2one_and_many_expr[[species_now]] = concat(list(adatas_one2one[[species_now]], adatas_many2many_all[[species_now]]), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 235 | message(species_now) 236 | 237 | } 238 | 239 | rm(adatas_many2many_all) 240 | 241 | adatas_one2one_higher_expr_uinmf = list() 242 | 243 | for(species_now in names(adatas)){ 244 | 245 | adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one_and_many_expr[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])] 246 | adatas_one2one_higher_expr_uinmf[[species_now]] = concat(list(adatas_one2one_and_many_expr[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 247 | 248 | adatas_one2one_higher_expr_uinmf[[species_now]]= adatas_one2one_higher_expr_uinmf[[species_now]][, !duplicated(adatas_one2one_higher_expr_uinmf[[species_now]]$var_names)] 249 | 250 | i <- sapply(adatas_one2one_higher_expr_uinmf[[species_now]]$obs, is.factor) 251 | adatas_one2one_higher_expr_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_higher_expr_uinmf[[species_now]]$obs[i], as.character) 252 | j <- sapply(adatas_one2one_higher_expr_uinmf[[species_now]]$var, is.factor) 253 | adatas_one2one_higher_expr_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_higher_expr_uinmf[[species_now]]$var[j], as.character) 254 | 255 | write_h5ad(anndata = adatas_one2one_higher_expr_uinmf[[species_now]], 256 | filename = paste0(out_dir, "/", species_now, "_many_higher_expr_ligerUINMF.h5ad"), compression = "gzip") 257 | message(species_now) 258 | 259 | } 260 | message("finish building higher expression level") 261 | 262 | rm(adatas_one2one_and_many_expr) 263 | rm(adatas_one2one_higher_expr_uinmf) 264 | 265 | 266 | message("start higher homology confidence") 267 | 268 | ## available attributes to indicate confidence of homology 269 | adatas_many2many_homo_all = list() 270 | 271 | ## mind the order 272 | order = metadata$X1 273 | avail_ordered = c() 274 | for (attr in c("orthology_confidence", "homolog_goc_score", "homolog_wga_coverage")){ 275 | 276 | avail_homo = c(avail_attributes$name[grepl(attr, avail_attributes$name)]) 277 | 278 | for (i in seq(1, length(order))){ 279 | avail_ordered = c(avail_ordered, avail_homo[grepl(order[i], avail_homo)]) 280 | } 281 | } 282 | 283 | avail_homo = avail_ordered 284 | 285 | many2many_copy_homo <- many2many %>% rowid_to_column("index") 286 | adata_many2many_homo = AnnData(shape = list(0, 0)) 287 | 288 | while (nrow(many2many_copy_homo) > 0) { 289 | 290 | 291 | if(nrow(many2many_copy_homo) < 6000 & nrow(many2many_copy_homo) > 5990) message(nrow(many2many_copy_homo)) 292 | 293 | if(nrow(many2many_copy_homo) < 4000 & nrow(many2many_copy_homo) > 3990) message(nrow(many2many_copy_homo)) 294 | 295 | if(nrow(many2many_copy_homo) < 1000 & nrow(many2many_copy_homo) > 990) message(nrow(many2many_copy_homo)) 296 | 297 | 298 | dd <- many2many_copy_homo %>% 299 | filter(get(paste0(species_1, "_homolog_ensembl_gene")) == levels(factor(many2many_copy_homo[[paste0(species_1, "_homolog_ensembl_gene")]]))[1]) 300 | 301 | genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character() 302 | 303 | ## find a gene group with many-to-many relationships 304 | gene_group <- many2many_copy_homo %>% 305 | dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now)) 306 | 307 | if(nrow(gene_group) == 1) { 308 | gene_keep = gene_group 309 | } else { 310 | gene_keep = gene_group %>% 311 | dplyr::arrange( 312 | sapply(avail_homo, FUN = function(x) get(x)) ## keep the member of group with highest overall confidence 313 | ) %>% 314 | slice(n()) 315 | } 316 | 317 | many2many_copy_homo <- many2many_copy_homo %>% 318 | filter(!index %in% gene_group$index) 319 | 320 | adatas_many2many_homo = list() 321 | for(species_now in species_list){ 322 | 323 | adatas_many2many_homo[[species_now]] = adatas[[species_now]][, adatas[[species_now]]$var_names %in% gene_keep[[paste0(species_now, "_homolog_ensembl_gene")]]] 324 | 325 | } 326 | 327 | new_name = adatas_many2many_homo[[species_1]]$var_names 328 | #message(new_name) 329 | for(species_now in species_list[-1]){ 330 | adatas_many2many_homo[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name 331 | rownames(adatas_many2many_homo[[species_now]]$var) = new_name 332 | } 333 | 334 | for(species_now in species_list){ 335 | if (is.null(adatas_many2many_homo_all[[species_now]])) { 336 | adatas_many2many_homo_all[[species_now]] <- adatas_many2many_homo[[species_now]] 337 | } else { 338 | adatas_many2many_homo_all[[species_now]] <- concat(list(adatas_many2many_homo_all[[species_now]], adatas_many2many_homo[[species_now]]), axis = 1L, join = "outer", merge = "first", index_unique = '-') 339 | } 340 | } 341 | rm(adatas_many2many_homo) 342 | 343 | } 344 | adatas_one2one_and_many_homo = list() 345 | for(species_now in names(adatas)){ 346 | 347 | adatas_one2one_and_many_homo[[species_now]] = concat(list(adatas_one2one[[species_now]], adatas_many2many_homo_all[[species_now]]), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 348 | message(species_now) 349 | 350 | } 351 | 352 | rm(adatas_many2many_homo_all) 353 | 354 | 355 | adatas_one2one_higher_homo_uinmf = list() 356 | 357 | for(species_now in names(adatas)){ 358 | 359 | adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one_and_many_homo[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])] 360 | adatas_one2one_higher_homo_uinmf[[species_now]] = concat(list(adatas_one2one_and_many_homo[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-') 361 | 362 | adatas_one2one_higher_homo_uinmf[[species_now]]= adatas_one2one_higher_homo_uinmf[[species_now]][, !duplicated(adatas_one2one_higher_homo_uinmf[[species_now]]$var_names)] 363 | 364 | i <- sapply(adatas_one2one_higher_homo_uinmf[[species_now]]$obs, is.factor) 365 | adatas_one2one_higher_homo_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_higher_homo_uinmf[[species_now]]$obs[i], as.character) 366 | j <- sapply(adatas_one2one_higher_homo_uinmf[[species_now]]$var, is.factor) 367 | adatas_one2one_higher_homo_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_higher_homo_uinmf[[species_now]]$var[j], as.character) 368 | 369 | write_h5ad(anndata = adatas_one2one_higher_homo_uinmf[[species_now]], 370 | filename = paste0(out_dir, "/", species_now, "_many_higher_homology_conf_ligerUINMF.h5ad"), compression = "gzip") 371 | message(species_now) 372 | 373 | } 374 | message("finish higher homology confidence") 375 | 376 | rm(adatas_one2one_and_many_homo) 377 | rm(adatas_one2one_higher_homo_uinmf) 378 | 379 | 380 | 381 | metadata = data.frame('species' = names(adatas)) 382 | metadata$one2one_only = paste0(out_dir, "/", metadata$species, "_one2one_only_ligerUINMF.rds") 383 | metadata$many_higher_expr = paste0(out_dir, "/", metadata$species, "_many_higher_expr_ligerUINMF.rds") 384 | metadata$many_higher_homology_conf = paste0(out_dir, "/", metadata$species, "_many_higher_homology_conf_ligerUINMF.rds") 385 | write_tsv(metadata, file = metadata_output, col_names = TRUE) 386 | -------------------------------------------------------------------------------- /bin/convert_format.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | library(optparse) 4 | 5 | # © EMBL-European Bioinformatics Institute, 2023 6 | # Yuyao Song 7 | 8 | option_list <- list( 9 | make_option(c("-i", "--input_file"), 10 | type = "character", default = NULL, 11 | help = "Path to input file for convrting" 12 | ), 13 | make_option(c("-o", "--output_file"), 14 | type = "character", default = NULL, 15 | help = "Output file after conversion" 16 | ), 17 | make_option(c("-t", "--type"), 18 | type = "character", default = NULL, 19 | help = "Conversion type, choose between anndata_to_seurat or seurat_to_anndata" 20 | ), 21 | make_option(c("--conda_path"), 22 | type = "character", default = NULL, 23 | help = "Conda for python executable to use for reticulate, important to match the prepared conda env!" 24 | ) 25 | ) 26 | 27 | # parse input 28 | opt <- parse_args(OptionParser(option_list = option_list)) 29 | 30 | input_file <- opt$input_file 31 | output_file <- opt$output_file 32 | type <- opt$type 33 | conda_path <- opt$conda_path 34 | 35 | # set sys env before loading reticulate 36 | Sys.setenv(RETICULATE_PYTHON=paste0(conda_path, "/bin/python3")) 37 | Sys.setenv(RETICULATE_PYTHON_ENV=conda_path) 38 | 39 | library(reticulate) 40 | library(sceasy) 41 | library(anndata) 42 | 43 | 44 | if(type == 'anndata_to_seurat'){ 45 | 46 | message(paste0("from anndata to seurat, input: ", input_file)) 47 | 48 | sceasy::convertFormat(input_file, from="anndata", to="seurat", 49 | outFile=output_file) 50 | } else if (type == 'seurat_to_anndata'){ 51 | 52 | message(paste0("from seurat to anndata, input: ", input_file)) 53 | obj <- readRDS(input_file) 54 | dir_name = dirname(input_file) 55 | ## it is bizzar that from seurat to anndata needs loading object, but the other way works on-disk 56 | sceasy::convertFormat(obj, from="seurat", to="anndata", 57 | outFile=output_file) 58 | 59 | } else { 60 | 61 | warning("type must be either anndata_to_seurat or seurat_to_anndata") 62 | } 63 | -------------------------------------------------------------------------------- /bin/fastMNN_integration.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | library(optparse) 7 | library(Seurat) 8 | library(SeuratWrappers) 9 | 10 | option_list <- list( 11 | make_option(c("-i", "--input_rds"), 12 | type = "character", default = NULL, 13 | help = "Path to input preprocessed rds file" 14 | ), 15 | make_option(c("-o", "--out_rds"), 16 | type = "character", default = NULL, 17 | help = "Output fastMNN from Seurat wrappers integrated rds file" 18 | ), 19 | make_option(c("-p", "--out_UMAP"), 20 | type = "character", default = NULL, 21 | help = "Output UMAP after fastMNN integration" 22 | ), 23 | make_option(c("-b", "--batch_key"), 24 | type = "character", default = NULL, 25 | help = "Batch key identifier to integrate" 26 | ), 27 | make_option(c("-s", "--species_key"), 28 | type = "character", default = NULL, 29 | help = "Species key identifier" 30 | ), 31 | make_option(c("-c", "--cluster_key"), 32 | type = "character", default = NULL, 33 | help = "Cluster key for UMAP plotting" 34 | ) 35 | ) 36 | 37 | # parse input 38 | opt <- parse_args(OptionParser(option_list = option_list)) 39 | 40 | input_rds <- opt$input_rds 41 | out_rds <- opt$out_rds 42 | out_UMAP <- opt$out_UMAP 43 | batch_key <- opt$batch_key 44 | species_key <- opt$species_key 45 | cluster_key <- opt$cluster_key 46 | 47 | ## create Seurat object via rds 48 | 49 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE) 50 | # input_rds <- gsub("h5ad", "rds", input_h5ad) 51 | obj <- readRDS(input_rds) 52 | 53 | obj <- NormalizeData(obj) 54 | obj <- FindVariableFeatures(obj) 55 | 56 | ## run fastMNN 57 | obj <- RunFastMNN(object.list = SplitObject(obj, split.by = batch_key)) 58 | obj <- RunUMAP(obj, reduction = "mnn", dims = 1:30, n_neighbors = 15L, min_dist = 0.3) 59 | obj <- FindNeighbors(obj, reduction = "mnn", dims = 1:30) 60 | obj <- FindClusters(obj, resolution = 0.4) 61 | 62 | # have to convert all factor to character, or when later converting to h5ad, the factors will be numbers 63 | i <- sapply(obj@meta.data, is.factor) 64 | obj@meta.data[i] <- lapply(obj@meta.data[i], as.character) 65 | 66 | saveRDS(obj, 67 | file= out_rds, 68 | ) 69 | 70 | 71 | pdf(out_UMAP, height = 6, width = 10) 72 | DimPlot(obj, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE) 73 | DimPlot(obj, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE) 74 | DimPlot(obj, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE) 75 | 76 | dev.off() 77 | -------------------------------------------------------------------------------- /bin/harmony_integration.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | import matplotlib.pyplot as plt 9 | import scanpy as sc 10 | 11 | @click.command() 12 | @click.argument("input_h5ad", type=click.Path(exists=True)) 13 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 14 | @click.argument("out_umap", type=click.Path(exists=False), default=None) 15 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration") 16 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 17 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 18 | 19 | 20 | def run_harmony(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key): 21 | click.echo('Start harmony integration') 22 | sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6)) 23 | adata = sc.read_h5ad(input_h5ad) 24 | adata.var_names_make_unique() 25 | sc.pp.normalize_total(adata, target_sum=1e4) 26 | sc.pp.log1p(adata) 27 | click.echo("HVG") 28 | sc.pp.highly_variable_genes(adata, batch_key=batch_key) 29 | sc.pp.scale(adata, max_value=10) 30 | sc.tl.pca(adata) 31 | sc.pp.neighbors(adata, use_rep='X_pca', n_neighbors=15, n_pcs=40) 32 | sc.tl.umap(adata, min_dist=0.3) ## to match min_dist in seurat 33 | adata.obsm['X_umapraw'] = adata.obsm['X_umap'] 34 | click.echo("Harmony") 35 | sc.external.pp.harmony_integrate(adata, key=batch_key, basis = 'X_pca') 36 | sc.pp.neighbors(adata, use_rep='X_pca_harmony', key_added = 'harmony', n_neighbors=15, n_pcs=40) 37 | sc.tl.umap(adata, neighbors_key = 'harmony', min_dist=0.3) ## to match min_dist in seurat 38 | sc.pl.umap(adata, color=[batch_key, species_key, cluster_key], ncols=1) 39 | plt.savefig(out_umap, dpi=300, bbox_inches='tight') 40 | adata.obsm['X_umapharmony'] = adata.obsm['X_umap'] 41 | click.echo("Save output") 42 | adata.write(out_h5ad) 43 | click.echo("Done harmony") 44 | 45 | 46 | if __name__ == '__main__': 47 | run_harmony() 48 | 49 | -------------------------------------------------------------------------------- /bin/kbet_multiple_species.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env R 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | library(anndata) 6 | library(mclust) 7 | library(cluster) 8 | library(optparse) 9 | library(kBET) 10 | library(tidyverse) 11 | library(FNN) 12 | 13 | 14 | option_list <- list( 15 | make_option(c("-i", "--input_h5ad"), 16 | type = "character", default = NULL, 17 | help = "Path to input integrated h5ad file" 18 | ), 19 | make_option(c("-o", "--out_csv"), 20 | type = "character", default = NULL, 21 | help = "Output csv file with various batch effect measurements" 22 | ), 23 | make_option(c("-m", "--method"), 24 | type = "character", default = NULL, 25 | help = "Integration method used, have an effect on use embedding or use count" 26 | ), 27 | make_option(c("-b", "--batch_key"), 28 | type = "character", default = NULL, 29 | help = "Batch key identifier to integrate" 30 | ), 31 | make_option(c("-c", "--cluster_key"), 32 | type = "character", default = NULL, 33 | help = "Cell type cluster key" 34 | ), 35 | make_option(c("-s", "--species_key"), 36 | type = "character", default = NULL, 37 | help = "Species key identifier" 38 | ) 39 | ) 40 | 41 | opt <- parse_args(OptionParser(option_list = option_list)) 42 | 43 | input_h5ad <- opt$input_h5ad 44 | out_csv <- opt$out_csv 45 | method <- opt$method 46 | batch_key <- opt$batch_key 47 | cluster_key <- opt$cluster_key 48 | species_key <- opt$species_key 49 | 50 | message("read in h5ad") 51 | ad <- read_h5ad(input_h5ad) 52 | 53 | ## identify shared cell types for measuring batch effect 54 | species_all = levels(factor(ad$obs[[species_key]])) 55 | cell_types = list() 56 | for(species in species_all){ 57 | cell_types[[species]] = levels(factor(ad$obs[ad$obs[[species_key]] == species, cluster_key])) 58 | } 59 | 60 | ct_shared=Reduce(intersect, cell_types) 61 | message(paste0("shared cell types include ", paste(ct_shared, collapse = ", "))) 62 | 63 | ad <- ad[ad$obs[[cluster_key]] %in% ct_shared, ] 64 | 65 | # get the matrix to compute batch effect 66 | # these methods return embedding 67 | if (method == "harmony") { 68 | data <- data.frame(ad$obsm[["X_pca_harmony"]], row.names = ad$obs_names) 69 | do_pca <- FALSE 70 | } 71 | if (method == "scanorama") { 72 | data <- data.frame(ad$obsm[["X_scanorama"]], row.names = ad$obs_names) 73 | do_pca <- FALSE 74 | } 75 | if (method == "scVI") { 76 | data <- data.frame(ad$obsm[["X_scVI"]], row.names = ad$obs_names) 77 | do_pca <- FALSE 78 | } 79 | if (method == "LIGER") { 80 | data <- data.frame(ad$obsm[["X_iNMF"]], row.names = ad$obs_names) 81 | do_pca <- FALSE 82 | } 83 | if (method == "rligerUINMF") { 84 | data <- data.frame(ad$obsm[["X_inmf"]], row.names = ad$obs_names) 85 | do_pca <- FALSE 86 | } 87 | if (method == "fastMNN") { 88 | data <- data.frame(ad$obsm[["X_mnn"]], row.names = ad$obs_names) 89 | do_pca <- FALSE 90 | } 91 | 92 | if (method == "SAMap") { 93 | data <- data.frame(ad$obsm[["wPCA"]], row.names = ad$obs_names) 94 | do_pca <- FALSE 95 | } 96 | 97 | # seurat return a pseudo-count matrix that is after the integration 98 | if (method %in% c("seuratCCA", "seuratRPCA", "unintegrated")) { 99 | data <- data.frame(as.matrix((ad$X)), row.names = ad$obs_names) 100 | do_pca <- TRUE 101 | } # sparse matrix to dense 102 | 103 | batch <- ad$obs[[batch_key]] 104 | 105 | # if only 2 batches use pval in linear model, if multiple batches use pval in ANOVA F test in PC regression 106 | if (length(levels(factor(as.character(batch)))) > 2) { 107 | use_pval <- "p.value.F.test" 108 | } else { 109 | use_pval <- "p.value.lm" 110 | } 111 | 112 | kbet_per_ct <- data.frame() 113 | for (ct in levels(factor(ad$obs[[cluster_key]]))) { 114 | message(ct) 115 | ad_ct <- ad[ad$obs[[cluster_key]] == ct, ] 116 | data_ct <- data[ad_ct$obs_names, ] 117 | batch <- ad_ct$obs[[batch_key]] 118 | k0 <- floor(mean(table(batch))) # neighbourhood size: mean batch size 119 | knn <- get.knn(data_ct, k = k0, algorithm = "cover_tree") 120 | # now run kBET with pre-defined nearest neighbours. 121 | batch.estimate <- suppressWarnings(kBET(data_ct, batch, k = k0, knn = knn, plot = FALSE, do.pca = do_pca)) 122 | 123 | if (any(is.na(batch.estimate))) { 124 | message(paste("cell type ", ct, " have less than 10 cells, skip cell-type kBET"), sep = "") 125 | next 126 | } 127 | 128 | add <- batch.estimate$summary["mean", ] 129 | add$cell_type <- ct 130 | 131 | if(nrow(kbet_per_ct) == 0){ 132 | 133 | kbet_per_ct <- add %>% pivot_longer(cols = 1:3, names_to = "measure", values_to = "value") 134 | 135 | } else { 136 | kbet_per_ct <- rbind(kbet_per_ct, add %>% pivot_longer(cols = 1:3, names_to = "measure", values_to = "value")) 137 | } 138 | } 139 | 140 | 141 | message("write summary") 142 | 143 | summary <- kbet_per_ct %>% pivot_wider(id_cols = cell_type, names_from = measure, values_from = value) 144 | summary$method <- method 145 | summary$input_h5ad <- input_h5ad 146 | write_csv(summary, out_csv) 147 | -------------------------------------------------------------------------------- /bin/rliger_integration_UINMF_multiple_species.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | library(optparse) 7 | library(rliger) 8 | library(scCustomize) # liger to seurat function keep_metadata 9 | library(SeuratDisk) 10 | 11 | 12 | option_list <- list( 13 | make_option(c("--metadata"), 14 | type = "character", default = NULL, 15 | help = "Path to a file indicate species-rds mapping, tab-seperated" 16 | ), 17 | make_option(c("--basename"), 18 | type = "character", default = NULL, 19 | help = "Basename of file to save" 20 | ), 21 | make_option(c("--out_dir"), 22 | type = "character", default = NULL, 23 | help = "output dir to write rds files" 24 | ), 25 | make_option(c("--cluster_key"), 26 | type = "character", default = NULL, 27 | help = "paths to rliger UINMF ready metadata" 28 | ) 29 | ) 30 | opt <- parse_args(OptionParser(option_list = option_list)) 31 | 32 | metadata_path <- opt$metadata 33 | basename <- opt$basename 34 | out_dir <- opt$out_dir 35 | cluster_key <- opt$cluster_key 36 | 37 | 38 | metadata = read.table(metadata_path, sep = '\t', header=TRUE) 39 | metadata = as.data.frame(metadata) 40 | 41 | for(type in colnames(metadata)[-1]){ 42 | 43 | message(type) 44 | 45 | obj_list = list() 46 | for(i in seq(1, nrow(metadata))){ 47 | 48 | obj_list[[i]] = readRDS(metadata[i, type]) 49 | } 50 | 51 | liger <- seuratToLiger(obj_list, remove.missing = FALSE, renormalize = FALSE, names = metadata$species) 52 | 53 | meta_all = data.frame() 54 | keep_cols = Reduce(intersect, lapply(obj_list, FUN = function(x) colnames(x@meta.data))) 55 | 56 | for(i in seq(1, nrow(metadata))){ 57 | message(i) 58 | obj_list[[i]]@meta.data$cell_id = rownames(obj_list[[i]]@meta.data) 59 | use = obj_list[[i]]@meta.data[, c("cell_id", keep_cols)] 60 | meta_all = rbind(meta_all, use) 61 | 62 | } 63 | 64 | meta_new = meta_all[match(rownames(liger@cell.data), rownames(meta_all)), ] 65 | meta_new = cbind(meta_new, liger@cell.data) 66 | meta_new = meta_new[match(rownames(liger@cell.data), rownames(meta_new)), ] 67 | 68 | liger@cell.data <- meta_new 69 | 70 | species.liger <- normalize(liger) 71 | species.liger <- rliger::selectGenes(species.liger, var.thres = 0.3, unshared = TRUE, unshared.datasets = list(1, 2), unshared.thresh = 0.3) 72 | species.liger <- scaleNotCenter(species.liger) 73 | species.liger <- optimizeALS(species.liger, lambda = 5, use.unshared = TRUE, thresh = 1e-10, k = 30) 74 | species.liger <- quantile_norm(species.liger, ref_dataset = metadata$species[1]) 75 | species.liger <- louvainCluster(species.liger) 76 | species.liger <- runUMAP(species.liger) 77 | 78 | seurat_obj <- scCustomize::Liger_to_Seurat( 79 | species.liger, 80 | nms = names(species.liger@H), 81 | renormalize = TRUE, 82 | use.liger.genes = TRUE, 83 | by.dataset = FALSE, 84 | keep_meta = TRUE, 85 | reduction_label = "umap", # in line with the X_umap in scanpy 86 | seurat_assay = "RNA" 87 | ) 88 | 89 | k <- sapply(seurat_obj@meta.data, is.factor) 90 | seurat_obj@meta.data[k] <- lapply(seurat_obj@meta.data[k], as.character) # known bug in seurat to h5ad for factors 91 | 92 | message("save pdf") 93 | pdf(paste0(out_dir,"/", basename, "_", type, "_rligerUINMF_integrated_UMAP.pdf"), height = 10, width = 12) 94 | DimPlot(object = seurat_obj, reduction = 'umap', group.by = c('species', 'cell_ontology_mapped'), ncol = 1) 95 | dev.off() 96 | 97 | 98 | message("save seurat object") 99 | 100 | saveRDS(seurat_obj, 101 | file = paste0(out_dir,"/", basename, "_", type, "_rligerUINMF_integrated.rds") 102 | ) 103 | 104 | message(paste0("finish", type)) 105 | 106 | } 107 | 108 | -------------------------------------------------------------------------------- /bin/scANVI_integration.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | import matplotlib.pyplot as plt 9 | import scanpy as sc 10 | import scvi 11 | 12 | @click.command() 13 | @click.argument("input_h5ad", type=click.Path(exists=True)) 14 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 15 | @click.argument("out_umap", type=click.Path(exists=False), default=None) 16 | @click.argument("out_scanvi", type=click.Path(exists=False), default=None) 17 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration") 18 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 19 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 20 | 21 | 22 | def run_scANVI(input_h5ad, out_h5ad, out_umap, out_scanvi, batch_key, species_key, cluster_key): 23 | click.echo('Start scANVI integration') 24 | 25 | sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6)) 26 | 27 | adata = sc.read_h5ad(input_h5ad) 28 | adata.var_names_make_unique() 29 | sc.pp.highly_variable_genes( 30 | adata, 31 | flavor="seurat_v3", 32 | n_top_genes=2000, 33 | ##layer="counts", 34 | batch_key=batch_key, 35 | subset=True 36 | ) 37 | 38 | adata.layers["counts"] = adata.X.copy() 39 | sc.pp.normalize_total(adata, target_sum=1e4) 40 | sc.pp.log1p(adata) 41 | adata.raw = adata 42 | 43 | 44 | click.echo("setup scVI model") 45 | 46 | 47 | scvi.model.SCANVI.setup_anndata(adata, labels_key=cluster_key, unlabeled_category = "Unknown", batch_key=batch_key, ) 48 | vae = scvi.model.SCANVI(adata) 49 | vae.train() 50 | adata.obsm["X_scANVI"] = vae.get_latent_representation() 51 | adata.obsm["X_mde_scanvi"] = scvi.model.utils.mde(adata.obsm["X_scANVI"]) 52 | 53 | sc.pl.embedding( 54 | adata, basis="X_mde_scanvi", color=[batch_key, species_key, cluster_key], ncols=1, frameon=False 55 | ) 56 | plt.savefig(out_scanvi, dpi=300, bbox_inches='tight') 57 | 58 | num_pcs = min(adata.obsm['X_scANVI'].shape[1], 40) 59 | if num_pcs < 40: 60 | click.echo("using less PCs: " + str(num_pcs)) 61 | 62 | sc.pp.neighbors(adata, use_rep="X_scANVI", key_added='scANVI', n_neighbors=15, n_pcs=num_pcs) 63 | sc.tl.umap(adata, neighbors_key='scANVI', min_dist=0.3) ## to match min_dist in seurat 64 | sc.pl.umap(adata, neighbors_key='scANVI', color=[batch_key, species_key, cluster_key], ncols=1) 65 | plt.savefig(out_umap, dpi=300, bbox_inches='tight') 66 | 67 | adata.obsm['X_umapscANVI'] = adata.obsm['X_umap'] 68 | 69 | click.echo("Save output") 70 | adata.write(out_h5ad) 71 | click.echo("Done scVI") 72 | 73 | 74 | if __name__ == '__main__': 75 | run_scANVI() 76 | 77 | -------------------------------------------------------------------------------- /bin/scIB_metrics.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | import pandas as pd 9 | import scanpy as sc 10 | import scib 11 | 12 | 13 | @click.command() 14 | 15 | @click.argument("input_h5ad", type=click.Path(exists=True)) 16 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None) 17 | @click.argument("out_csv", type=click.Path(exists=False), default=None) 18 | @click.argument("out_silhouette", type=click.Path(exists=False), default=None) 19 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 20 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 21 | @click.option('--batch_key', type=str, default=None, help="Batch key on which integration is performed") 22 | @click.option('--integration_method', type=str, default=None, help="Integration method") 23 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 24 | 25 | 26 | def run_scIB_metrics(input_h5ad, unintegrated_h5ad, out_csv, out_silhouette, out_h5ad, species_key, batch_key, cluster_key, integration_method): 27 | # dictionary for method properties 28 | embedding_keys={"harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_iNMF", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA" } 29 | use_embeddings={"harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": True , "seuratCCA": False, "seuratRPCA": False, "unintegrated": False} 30 | from_h5seurat={"harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "unintegrated": False} 31 | sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5)) 32 | click.echo("Read anndata") 33 | input_ad = sc.read_h5ad(input_h5ad) 34 | orig_ad = sc.read_h5ad(unintegrated_h5ad) 35 | species_all=input_ad.obs[species_key].astype("category").cat.categories.values 36 | # known bug - fix when convert h5Seurat to h5ad the index name error 37 | if from_h5seurat[integration_method] is True: 38 | input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'}) 39 | 40 | use_embedding=use_embeddings[integration_method] 41 | if use_embedding is True: 42 | embedding_key=embedding_keys[integration_method] 43 | 44 | 45 | # register color in .uns 46 | sc.pl.umap(input_ad, color = cluster_key) 47 | color = dict(zip(input_ad.obs[cluster_key].cat.categories.to_list(), input_ad.uns[cluster_key+'_colors'])) 48 | # get neighbours on integrated and unintegrated data 49 | # for lisi type_='knn' 50 | ## LIGER embedding only have 20 dims 51 | 52 | if integration_method == 'SAMap': 53 | click.echo("use SAMap KNN graph") 54 | ## do nothing 55 | elif use_embedding is True: 56 | click.echo("calculate KNN graph from embedding " + embedding_key) 57 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep=embedding_key) 58 | ## compute knn if use embedding 59 | else: 60 | click.echo("use PCA to compute KNN graph") 61 | sc.tl.pca(input_ad, svd_solver='arpack') 62 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep='X_pca') 63 | embedding_key='X_pca' 64 | ## while no embedding, compute PCA and compute knn 65 | 66 | ## get neighbour graph from unintegrated data 67 | sc.pp.normalize_total(orig_ad, target_sum=1e4) 68 | sc.pp.log1p(orig_ad) 69 | sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 70 | sc.pp.scale(orig_ad, max_value=10) 71 | sc.tl.pca(orig_ad, svd_solver='arpack') 72 | sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40) 73 | 74 | click.echo("Start computing various batch metrics using scIB "+input_h5ad) 75 | click.echo("silhouette_batch") 76 | 77 | ## silhouette_batch 78 | silb = scib.metrics.silhouette_batch(input_ad, batch_key = batch_key, group_key = cluster_key, embed = embedding_key, metric='euclidean', 79 | return_all=True, scale=True, verbose=True) 80 | click.echo("PC regression") 81 | 82 | ## pcr 83 | 84 | pcr = scib.metrics.pcr_comparison(adata_pre = orig_ad, adata_post = input_ad, covariate = batch_key, 85 | embed = embedding_key, n_comps=50, scale=True, verbose=True) 86 | ## click.echo("HVG overlap") 87 | ## I don't understand how methods that generate an embedding change HVG 88 | ## hvg_overlap 89 | 90 | ## hvg = scib.metrics.hvg_overlap(adata_pre = orig_ad, adata_post = input_ad, batch = batch_key, n_hvg=500, verbose=True) 91 | 92 | ## sil on embedding 93 | click.echo("silhouette on embedding") 94 | 95 | sil = scib.metrics.silhouette(adata = input_ad, group_key = cluster_key, embed = embedding_key, metric='euclidean', scale=True) 96 | 97 | ## nmi and ari cluster vs label, graph_conn, clisi and ilisi 98 | click.echo("NMI, ARI, grapth connectivity, cLISI and iLISI") 99 | 100 | res = scib.metrics.metrics(adata = orig_ad, adata_int = input_ad, batch_key=batch_key, label_key=cluster_key, embed=embedding_key, 101 | cluster_key='cluster_scIB', cluster_nmi=out_csv+"_nmi_opt_cluster.csv", type_='knn', 102 | isolated_labels_asw_=False, 103 | silhouette_=False, 104 | hvg_score_=False, 105 | graph_conn_=True, 106 | pcr_=False, 107 | isolated_labels_f1_=True, 108 | nmi_=True, 109 | ari_=True, 110 | kBET_=True, 111 | ilisi_=True, 112 | clisi_=True, 113 | cell_cycle_=False, trajectory_=False) 114 | 115 | ## collect results and write output 116 | output=pd.DataFrame() 117 | output.loc['NMI_cluster/label', 'value'] = res.loc['NMI_cluster/label', 0] 118 | output.loc['ARI_cluster/label', 'value'] = res.loc['ARI_cluster/label', 0] 119 | output.loc['iLISI', 'value'] = res.loc['iLISI', 0] 120 | output.loc['cLISI', 'value'] = res.loc['cLISI', 0] 121 | output.loc['graph_conn', 'value'] = res.loc['graph_conn', 0] 122 | #output.loc['HVG_overlap', 'value'] = hvg 123 | output.loc['pcr', 'value'] = pcr 124 | output.loc['silhouette', 'value'] = sil 125 | output.loc['silhouette_batch', 'value'] = silb[0] 126 | output.loc['input_h5ad', 'value'] = input_h5ad 127 | output.loc['unintegrated_h5ad', 'value'] = unintegrated_h5ad 128 | output.loc['species_key', 'value'] = species_key 129 | output.loc['batch_key', 'value'] = batch_key 130 | output.loc['cluster_key', 'value'] = cluster_key 131 | output.loc['integration_method', 'value'] = integration_method 132 | 133 | output.T.to_csv( out_csv) 134 | 135 | silb[1]['input_h5ad'] = input_h5ad 136 | silb[1]['unintegrated_h5ad'] = unintegrated_h5ad 137 | silb[1]['species_key'] = species_key 138 | silb[1]['batch_key'] = batch_key 139 | silb[1]['cluster_key'] = cluster_key 140 | silb[1]['integration_method'] = integration_method 141 | silb[1].to_csv( out_silhouette) 142 | 143 | input_ad.write(out_h5ad, compression = 'gzip') 144 | click.echo("finish batch metrics") 145 | 146 | 147 | if __name__ == '__main__': 148 | run_scIB_metrics() 149 | -------------------------------------------------------------------------------- /bin/scIB_metrics_individual.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import os 8 | 9 | import click 10 | import pandas as pd 11 | import scanpy as sc 12 | import random 13 | import scib 14 | import numpy 15 | 16 | # set R for kBET 17 | import os 18 | 19 | ## set seed 20 | random.seed(123) 21 | numpy.random.seed(456) 22 | 23 | @click.command() 24 | @click.argument("input_h5ad", type=click.Path(exists=True)) 25 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None) 26 | @click.argument("out_integrated_metrics", type=click.Path(exists=False), default=None) 27 | @click.argument("out_integrated_basw", type=click.Path(exists=False), default=None) 28 | @click.argument("out_orig_metrics", type=click.Path(exists=False), default=None) 29 | @click.argument("out_orig_basw", type=click.Path(exists=False), default=None) 30 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 31 | @click.option( 32 | "--species_key", type=str, default=None, help="Species key to distinguish species" 33 | ) 34 | @click.option( 35 | "--batch_key", 36 | type=str, 37 | default=None, 38 | help="Batch key on which integration is performed", 39 | ) 40 | @click.option("--integration_method", type=str, default=None, help="Integration method") 41 | @click.option( 42 | "--cluster_key", 43 | type=str, 44 | default=None, 45 | help="Cluster key in species one to use as labels to transfer to species two", 46 | ) 47 | @click.option( 48 | "--num_cores", 49 | type=int, 50 | default=1, 51 | help="Number of cores for graph LISI scores", 52 | ) 53 | @click.option( 54 | "--conda_path", 55 | type=str, 56 | default=None, 57 | help="scIB conda path", 58 | ) 59 | 60 | 61 | def run_scIB_metrics( 62 | input_h5ad, 63 | unintegrated_h5ad, 64 | out_integrated_metrics, 65 | out_integrated_basw, 66 | out_orig_metrics, 67 | out_orig_basw, 68 | out_h5ad, 69 | species_key, 70 | batch_key, 71 | cluster_key, 72 | integration_method, 73 | num_cores, 74 | conda_path 75 | ): 76 | 77 | 78 | os.environ['R_HOME'] = f"{conda_path}/lib/R" 79 | os.environ['PATH'] = f"{conda_path}/bin:" + os.environ['PATH'] 80 | os.environ['R_LIBS_USER'] = f"{conda_path}/lib/R/library" 81 | os.environ['R_LIBS'] = f"{conda_path}lib/R/library" 82 | os.environ['LD_LIBRARY_PATH'] = f"{conda_path}/lib" 83 | 84 | # dictionary for method properties 85 | embedding_keys = { 86 | "harmony": "X_pca_harmony", 87 | "scanorama": "X_scanorama", 88 | "scVI": "X_scVI", 89 | "scANVI": "X_scANVI", 90 | "LIGER": "X_inmf", 91 | "rligerUINMF": "X_inmf", 92 | "fastMNN": "X_mnn", 93 | } 94 | use_embeddings = { 95 | "harmony": True, 96 | "scanorama": True, 97 | "scVI": True, 98 | "scANVI": True, 99 | "LIGER": True, 100 | "rligerUINMF": True, 101 | "fastMNN": True, 102 | "seuratCCA": False, 103 | "seuratRPCA": False, 104 | "unintegrated": False, 105 | } 106 | from_h5seurat = { 107 | "harmony": False, 108 | "scanorama": False, 109 | "scVI": False, 110 | "scANVI": False, 111 | "LIGER": True, 112 | "rligerUINMF": True, 113 | "fastMNN": True, 114 | "seuratCCA": True, 115 | "seuratRPCA": True, 116 | "unintegrated": False, 117 | } 118 | sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5)) 119 | 120 | click.echo("Read anndata") 121 | input_ad = sc.read_h5ad(input_h5ad) 122 | orig_ad = sc.read_h5ad(unintegrated_h5ad) 123 | species_all = input_ad.obs[species_key].astype("category").cat.categories.values 124 | 125 | ## for files from h5seurat sometimes these are not stored as category 126 | 127 | input_ad.obs[species_key] = input_ad.obs[species_key].astype("category") 128 | input_ad.obs[cluster_key] = input_ad.obs[cluster_key].astype("category") 129 | input_ad.obs[batch_key] = input_ad.obs[batch_key].astype("category") 130 | 131 | # known bug - fix when convert h5Seurat to h5ad the index name error 132 | #if from_h5seurat[integration_method] is True: 133 | # input_ad.__dict__["_raw"].__dict__["_var"] = ( 134 | # input_ad.__dict__["_raw"] 135 | # .__dict__["_var"] 136 | # .rename(columns={"_index": "features"}) 137 | # ) 138 | 139 | use_embedding = use_embeddings[integration_method] 140 | if use_embedding is True: 141 | embedding_key = embedding_keys[integration_method] 142 | 143 | # re-calculate on integrated and unintegrated data 144 | # due to scIB hard-coding, make sure input_ad.obsp['connectivities'], input_ad.uns['neighbours'] are from the embedding 145 | # for lisi type_='knn' 146 | # LIGER embedding only have 20 dims 147 | 148 | 149 | if use_embedding is True: 150 | click.echo("calculate KNN graph from embedding " + embedding_key) 151 | num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20) 152 | if num_pcs < 20: 153 | click.echo("using less PCs: " + str(num_pcs)) 154 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key) 155 | # compute knn if use embedding 156 | else: 157 | click.echo("use PCA to compute KNN graph") 158 | sc.tl.pca(input_ad, svd_solver="arpack") 159 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep="X_pca") 160 | embedding_key = "X_pca" 161 | 162 | # while no embedding, compute PCA and compute knn 163 | 164 | # get neighbour graph from unintegrated data 165 | sc.pp.normalize_total(orig_ad, target_sum=1e4) 166 | sc.pp.log1p(orig_ad) 167 | sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 168 | sc.pp.scale(orig_ad, max_value=10) 169 | sc.tl.pca(orig_ad, svd_solver="arpack") 170 | sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40) 171 | sc.tl.umap(orig_ad, min_dist=0.3) 172 | sc.pl.umap(orig_ad, color=[batch_key, species_key, cluster_key]) 173 | 174 | click.echo( 175 | "Start computing various batch metrics using scIB, the integrated file is " 176 | + input_h5ad 177 | ) 178 | 179 | output_integrated = pd.DataFrame() 180 | output_integrated_basw = pd.DataFrame() 181 | output_orig = pd.DataFrame() 182 | output_orig_basw = pd.DataFrame() 183 | 184 | click.echo("PC regression") 185 | 186 | output_integrated.loc["PCR", "value"] = scib.metrics.pc_regression( 187 | input_ad.obsm[embedding_key], covariate=input_ad.obs[species_key], n_comps=50 188 | ) 189 | output_orig.loc["PCR", "value"] = scib.metrics.pc_regression( 190 | orig_ad.obsm["X_pca"], 191 | covariate=orig_ad.obs[species_key], 192 | pca_var=orig_ad.uns["pca"]["variance"], 193 | ) 194 | 195 | click.echo("Silhouette batch") 196 | 197 | integrated_basw = scib.metrics.silhouette_batch( 198 | input_ad, 199 | batch_key=batch_key, 200 | group_key=cluster_key, 201 | embed=embedding_key, 202 | metric="euclidean", 203 | return_all=True, 204 | scale=True, 205 | verbose=False, 206 | ) 207 | 208 | output_integrated.loc["bASW", "value"] = integrated_basw[0] 209 | 210 | orig_basw = scib.metrics.silhouette_batch( 211 | orig_ad, 212 | batch_key=batch_key, 213 | group_key=cluster_key, 214 | embed="X_pca", 215 | metric="euclidean", 216 | return_all=True, 217 | scale=True, 218 | verbose=False, 219 | ) 220 | 221 | output_orig.loc["bASW", "value"] = orig_basw[0] 222 | 223 | click.echo("Graph connectivity") 224 | 225 | output_integrated.loc["GC", "value"] = scib.metrics.graph_connectivity( 226 | input_ad, label_key=cluster_key 227 | ) 228 | output_orig.loc["GC", "value"] = scib.metrics.graph_connectivity( 229 | orig_ad, label_key=cluster_key 230 | ) 231 | 232 | # click.echo("graph iLISI") 233 | 234 | # output_integrated.loc["iLISI", "value"] = scib.metrics.ilisi_graph( 235 | # input_ad, 236 | # batch_key, 237 | # type_="embed", 238 | # use_rep=embedding_key, 239 | # k0=50, 240 | # subsample=None, 241 | # n_cores=num_cores, 242 | # scale=True, 243 | # verbose=True, 244 | # ) 245 | 246 | # output_orig.loc["iLISI", "value"] = scib.metrics.ilisi_graph( 247 | # orig_ad, 248 | # batch_key, 249 | # type_="full", 250 | # k0=50, 251 | # subsample=None, 252 | # n_cores=num_cores, 253 | # scale=True, 254 | # verbose=True, 255 | # ) 256 | 257 | click.echo("kBET") 258 | 259 | output_integrated.loc["kBET", "value"] = scib.metrics.kBET( 260 | input_ad, 261 | batch_key=batch_key, 262 | label_key=cluster_key, 263 | type_="knn", ## for equal treatment 264 | scaled=True, 265 | return_df=False, 266 | verbose=True, 267 | ) 268 | 269 | output_orig.loc["kBET", "value"] = scib.metrics.kBET( 270 | orig_ad, 271 | batch_key=batch_key, 272 | label_key=cluster_key, 273 | type_="full", 274 | scaled=True, 275 | return_df=False, 276 | verbose=True, 277 | ) 278 | 279 | # click.echo("cLISI") 280 | 281 | # output_integrated.loc["cLISI", "value"] = scib.metrics.clisi_graph( 282 | # input_ad, 283 | # label_key=cluster_key, 284 | # type_="embed", 285 | # use_rep=embedding_key, 286 | # k0=50, 287 | # subsample=None, 288 | # scale=True, 289 | # n_cores=num_cores, 290 | # verbose=True, 291 | # ) 292 | 293 | # output_orig.loc["cLISI", "value"] = scib.metrics.clisi_graph( 294 | # orig_ad, 295 | # label_key=cluster_key, 296 | # type_="full", 297 | # k0=50, 298 | # subsample=None, 299 | # scale=True, 300 | # n_cores=num_cores, 301 | # verbose=True, 302 | # ) 303 | 304 | click.echo("NMI") 305 | click.echo("clustering optimization with leiden") 306 | 307 | scib.me.cluster_optimal_resolution( 308 | input_ad, 309 | cluster_key="cluster", 310 | label_key=cluster_key, 311 | cluster_function=sc.tl.leiden, 312 | ) 313 | scib.me.cluster_optimal_resolution( 314 | orig_ad, 315 | cluster_key="cluster", 316 | label_key=cluster_key, 317 | cluster_function=sc.tl.leiden, 318 | ) 319 | 320 | output_integrated.loc["NMI", "value"] = scib.me.nmi( 321 | input_ad, cluster_key="cluster", label_key=cluster_key 322 | ) 323 | output_integrated.loc["ARI", "value"] = scib.me.ari( 324 | input_ad, cluster_key="cluster", label_key=cluster_key 325 | ) 326 | 327 | output_orig.loc["NMI", "value"] = scib.me.nmi( 328 | orig_ad, cluster_key="cluster", label_key=cluster_key 329 | ) 330 | output_orig.loc["ARI", "value"] = scib.me.ari( 331 | orig_ad, cluster_key="cluster", label_key=cluster_key 332 | ) 333 | 334 | click.echo("Silhouette cell type") 335 | 336 | output_integrated.loc["cASW", "value"] = scib.me.silhouette( 337 | input_ad, label_key=cluster_key, embed=embedding_key 338 | ) 339 | 340 | output_orig.loc["cASW", "value"] = scib.me.silhouette( 341 | orig_ad, label_key=cluster_key, embed="X_pca" 342 | ) 343 | 344 | click.echo("Isolated label F1") 345 | 346 | output_integrated.loc["iso_F1", "value"] = scib.me.isolated_labels_f1( 347 | input_ad, embed=None, batch_key=batch_key, label_key=cluster_key 348 | ) 349 | 350 | output_orig.loc["iso_F1", "value"] = scib.me.isolated_labels_f1( 351 | orig_ad, embed=None, batch_key=batch_key, label_key=cluster_key 352 | ) 353 | 354 | click.echo("write results") 355 | click.echo("metrics of integrated data") 356 | 357 | output_integrated.loc["input_h5ad", "value"] = input_h5ad 358 | output_integrated.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad 359 | output_integrated.loc["species_key", "value"] = species_key 360 | output_integrated.loc["batch_key", "value"] = batch_key 361 | output_integrated.loc["cluster_key", "value"] = cluster_key 362 | output_integrated.loc["integration_method", "value"] = integration_method 363 | 364 | output_integrated.T.to_csv(out_integrated_metrics) 365 | 366 | click.echo("writing clustering optimized integrated data") 367 | input_ad.write(out_h5ad, compression="gzip") 368 | 369 | click.echo("metric of unintegrated data") 370 | output_orig.loc["input_h5ad", "value"] = unintegrated_h5ad 371 | output_orig.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad 372 | output_orig.loc["species_key", "value"] = species_key 373 | output_orig.loc["batch_key", "value"] = batch_key 374 | output_orig.loc["cluster_key", "value"] = cluster_key 375 | output_orig.loc["integration_method", "value"] = integration_method 376 | 377 | output_orig.T.to_csv(out_orig_metrics) 378 | 379 | click.echo("cell type bASW of integrated data") 380 | 381 | integrated_basw[1]["input_h5ad"] = input_h5ad 382 | integrated_basw[1]["unintegrated_h5ad"] = unintegrated_h5ad 383 | integrated_basw[1]["species_key"] = species_key 384 | integrated_basw[1]["batch_key"] = batch_key 385 | integrated_basw[1]["cluster_key"] = cluster_key 386 | integrated_basw[1]["integration_method"] = integration_method 387 | integrated_basw[1].to_csv(out_integrated_basw) 388 | 389 | orig_basw[1]["input_h5ad"] = unintegrated_h5ad 390 | orig_basw[1]["unintegrated_h5ad"] = unintegrated_h5ad 391 | orig_basw[1]["species_key"] = species_key 392 | orig_basw[1]["batch_key"] = batch_key 393 | orig_basw[1]["cluster_key"] = cluster_key 394 | orig_basw[1]["integration_method"] = integration_method 395 | orig_basw[1].to_csv(out_orig_basw) 396 | 397 | click.echo("finish scIB metrics calculation") 398 | 399 | 400 | if __name__ == "__main__": 401 | run_scIB_metrics() 402 | -------------------------------------------------------------------------------- /bin/scIB_trajectory.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | import pandas as pd 9 | import scanpy as sc 10 | import numpy as np 11 | import scib 12 | 13 | 14 | @click.command() 15 | 16 | @click.argument("input_h5ad", type=click.Path(exists=True)) 17 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None) 18 | @click.argument("out_csv", type=click.Path(exists=False), default=None) 19 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 20 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 21 | @click.option('--batch_key', type=str, default=None, help="Batch key on which integration is performed") 22 | @click.option('--integration_method', type=str, default=None, help="Integration method") 23 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 24 | @click.option('--root_cell', type=str, default=None, help="Root cell in trajectory, should be one category in adata.obs[cluster_key]") 25 | 26 | 27 | def run_scIB_trajectory(input_h5ad, unintegrated_h5ad, out_csv, out_h5ad, species_key, batch_key, cluster_key, integration_method, root_cell): 28 | # dictionary for method properties 29 | embedding_keys={"scANVI": "X_scANVI", "harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_iNMF", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA" } 30 | use_embeddings={"scANVI": True, "harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": False, "seuratRPCA": False, "unintegrated": False} 31 | from_h5seurat={"scANVI": False, "harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "unintegrated": False} 32 | sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5)) 33 | click.echo("Read anndata") 34 | input_ad = sc.read_h5ad(input_h5ad) 35 | orig_ad = sc.read_h5ad(unintegrated_h5ad) 36 | 37 | species_all=input_ad.obs[species_key].astype("category").cat.categories.values 38 | # known bug - fix when convert h5Seurat to h5ad the index name error 39 | # if from_h5seurat[integration_method] is True: 40 | # input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'}) 41 | 42 | use_embedding=use_embeddings[integration_method] 43 | if use_embedding is True: 44 | embedding_key=embedding_keys[integration_method] 45 | 46 | # register color in .uns 47 | sc.pl.umap(input_ad, color = cluster_key) 48 | color = dict(zip(input_ad.obs[cluster_key].cat.categories.to_list(), input_ad.uns[cluster_key+'_colors'])) 49 | 50 | # get neighbours on integrated and unintegrated data 51 | # for lisi type_='knn' 52 | ## LIGER embedding only have 20 dims 53 | 54 | if integration_method == 'SAMap': 55 | click.echo("use SAMap KNN graph") 56 | 57 | ## do nothing 58 | elif use_embedding is True: 59 | click.echo("calculate KNN graph from embedding " + embedding_key) 60 | num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20) 61 | if num_pcs < 20: 62 | click.echo("using less PCs: " + str(num_pcs)) 63 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key, key_added=integration_method) 64 | ## compute knn if use embedding 65 | else: 66 | click.echo("use PCA to compute KNN graph") 67 | sc.tl.pca(input_ad, svd_solver='arpack') 68 | sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep='X_pca', key_added=integration_method) 69 | embedding_key='X_pca' 70 | ## while no embedding, compute PCA and compute knn 71 | 72 | ## get neighbour graph from unintegrated data 73 | ## use PCA to get initial neighbours for diffusion map 74 | ## use diffusion map to compute neighbours - denoise 75 | if from_h5seurat[integration_method] is True: 76 | orig_ad.obs_names = orig_ad.obs_names.str.replace("-", "_") 77 | ## seurat does not convert - to _ which does not match integrated data 78 | 79 | sc.pp.normalize_total(orig_ad, target_sum=1e4) 80 | sc.pp.log1p(orig_ad) 81 | sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 82 | sc.pp.scale(orig_ad, max_value=10) 83 | sc.tl.pca(orig_ad, svd_solver='arpack') 84 | sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40, use_rep='X_pca') 85 | sc.tl.diffmap(orig_ad) 86 | sc.pp.neighbors(orig_ad, n_neighbors=10, use_rep='X_diffmap') 87 | 88 | ## set root cell for diffusion map pseudo-time 89 | orig_ad.uns['iroot'] = np.flatnonzero(orig_ad.obs[cluster_key] == root_cell)[0] 90 | sc.tl.dpt(orig_ad) 91 | 92 | ## integrated data 93 | ## use integrated embedding to get initial neighbours for diffusion map 94 | ## use diffusion map neighbours 95 | if integration_method == 'SAMap': 96 | click.echo("use SAMap KNN graph for diffmap") 97 | sc.tl.diffmap(input_ad, neighbors_key=None) 98 | 99 | else: 100 | click.echo("calculate diffmap for " + integration_method) 101 | sc.tl.diffmap(input_ad, neighbors_key=integration_method) 102 | ## neighbours with diffmap need not have a key 103 | sc.pp.neighbors(input_ad, n_neighbors=10, use_rep='X_diffmap') 104 | input_ad.uns['iroot'] = np.flatnonzero(input_ad.obs[cluster_key] == root_cell)[0] 105 | sc.tl.dpt(input_ad) 106 | 107 | ## scIB trajectory conservation score 108 | ## per-species conservation: set batch_key='species' 109 | score_batch = scib.metrics.trajectory_conservation(adata_pre=orig_ad, adata_post=input_ad, label_key=cluster_key, pseudotime_key='dpt_pseudotime', batch_key=species_key) 110 | score = scib.metrics.trajectory_conservation(adata_pre=orig_ad, adata_post=input_ad, label_key=cluster_key, pseudotime_key='dpt_pseudotime', batch_key=None) 111 | 112 | output=pd.DataFrame() 113 | 114 | output.loc['trajectory_conservation_score_batch', 'value'] = score_batch 115 | output.loc['trajectory_conservation_score_none', 'value'] = score 116 | output.loc['input_h5ad', 'value'] = input_h5ad 117 | output.loc['unintegrated_h5ad', 'value'] = unintegrated_h5ad 118 | output.loc['species_key', 'value'] = species_key 119 | output.loc['batch_key', 'value'] = batch_key 120 | output.loc['cluster_key', 'value'] = cluster_key 121 | output.loc['root_cell', 'value'] = root_cell 122 | output.loc['integration_method', 'value'] = integration_method 123 | output.T.to_csv(out_csv) 124 | input_ad.write(out_h5ad, compression = 'gzip') 125 | click.echo("finish trajectory conservation metrics") 126 | 127 | 128 | if __name__ == '__main__': 129 | run_scIB_trajectory() 130 | -------------------------------------------------------------------------------- /bin/scanorama_integration.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | import click 7 | import matplotlib.pyplot as plt 8 | import scanpy as sc 9 | 10 | 11 | @click.command() 12 | @click.argument("input_h5ad", type=click.Path(exists=True)) 13 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 14 | @click.argument("out_umap", type=click.Path(exists=False), default=None) 15 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration") 16 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 17 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 18 | 19 | 20 | def run_scanorama(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key): 21 | click.echo('Start scanorama integration') 22 | sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6)) 23 | adata = sc.read_h5ad(input_h5ad) 24 | adata.var_names_make_unique() 25 | sc.pp.normalize_total(adata, target_sum=1e4) 26 | sc.pp.log1p(adata) 27 | click.echo("HVG") 28 | sc.pp.highly_variable_genes(adata, batch_key=batch_key) 29 | sc.pp.scale(adata, max_value=10) 30 | sc.tl.pca(adata) 31 | sc.pp.neighbors(adata, use_rep='X_pca', n_neighbors=15, n_pcs=40) 32 | sc.tl.umap(adata, min_dist=0.3) ## to match min_dist in seurat 33 | adata.obsm['X_umapraw'] = adata.obsm['X_umap'] 34 | click.echo("Scanorama") 35 | sc.external.pp.scanorama_integrate(adata, key=species_key, basis='X_pca', adjusted_basis='X_scanorama') 36 | sc.pp.neighbors(adata, use_rep='X_scanorama', key_added = 'scanorama', n_neighbors=15, n_pcs=40) 37 | sc.tl.umap(adata, neighbors_key = 'scanorama', min_dist=0.3) ## to match min_dist in seurat 38 | sc.pl.umap(adata, color=[batch_key, species_key, cluster_key], ncols=1) 39 | plt.savefig(out_umap, dpi=300, bbox_inches='tight') 40 | adata.obsm['X_umapscanorama'] = adata.obsm['X_umap'] 41 | click.echo("Save output") 42 | adata.write(out_h5ad) 43 | click.echo("Done scanorama") 44 | 45 | 46 | if __name__ == '__main__': 47 | run_scanorama() 48 | 49 | -------------------------------------------------------------------------------- /bin/sccaf_assessment_metadata.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | import matplotlib.pyplot as plt 9 | from sklearn import metrics 10 | import pandas as pd 11 | import scanpy as sc 12 | from SCCAF import * 13 | 14 | @click.command() 15 | @click.argument("input_metadata", type=click.Path(exists=True)) 16 | @click.argument("out_auc", type=click.Path(exists=False), default=None) 17 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None) 18 | @click.option('--integration_method', type=str, default=None, help="Integration method") 19 | @click.option('--use_embedding', type=bool, default=False, help="Whether use embedding for SCCAF assessment, default False to use count matrix") 20 | @click.option('--embedding_key', type=str, default=None, help="If use embedding, the embedding key in input_h5ad.obsm to calculate SCCAF assessment") 21 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in input_h5ad.obs to use as the label to calculate SCCAF assessment") 22 | 23 | def run_sccaf_assessment(input_metadata, out_auc, use_embedding, embedding_key, cluster_key, out_acc_csv, integration_method): 24 | 25 | def get_cell_type_auc(clf, y_test, y_prob): 26 | rc_aucs = [] #AUC 27 | rp_aucs = [] # AUC from recall precision 28 | fprs = [] #FPR 29 | tprs = [] #TPR 30 | prss = [] #Precision 31 | recs = [] #Recall 32 | for i, cell_type in enumerate(clf.classes_): 33 | fpr, tpr, _ = metrics.roc_curve(y_test == cell_type, y_prob[:, i]) 34 | prs, rec, _ = metrics.precision_recall_curve(y_test == cell_type, y_prob[:, i]) 35 | fprs.append(fpr) 36 | tprs.append(tpr) 37 | prss.append(prs) 38 | recs.append(rec) 39 | rc_aucs.append(metrics.auc(fpr, tpr)) 40 | rp_aucs.append(metrics.auc(rec, prs)) 41 | tbl = pd.DataFrame(data=list(zip(clf.classes_, rp_aucs, rc_aucs)), columns=['cell_type', "ROC_AUC", "PR_AUC"]) 42 | return tbl 43 | 44 | meta = pd.read_csv(input_metadata, delimiter='\t', header=None) 45 | for i in range(0, meta.shape[0]): 46 | species=meta.loc[i, 0] 47 | input_h5ad=meta.loc[i, 1] 48 | print(species) 49 | input_ad = sc.read_h5ad(input_h5ad) 50 | 51 | sc.pp.normalize_total(input_ad, target_sum=1e4) 52 | sc.pp.log1p(input_ad) 53 | sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 54 | input_ad.raw = input_ad 55 | sc.pp.scale(input_ad, max_value=10) 56 | sc.tl.pca(input_ad, svd_solver='arpack') 57 | sc.pp.neighbors(input_ad, n_neighbors=10, n_pcs=40) 58 | sc.tl.umap(input_ad, min_dist=0.3) 59 | sc.pl.umap(input_ad, color = [cluster_key]) 60 | sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 5)) 61 | ##input_ad = input_ad[~(input_ad.obs.cell_type=='T_cell'), ] 62 | 63 | if use_embedding is True and not embedding_key in input_ad.obsm.keys(): 64 | raise ValueError("`input_ad.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key)) 65 | if use_embedding is True and embedding_key is not None: 66 | input_matrix = input_ad.obsm[embedding_key] 67 | else: 68 | input_matrix = input_ad.X 69 | 70 | colors = input_ad.uns[cluster_key+"_colors"] 71 | y_prob, y_pred, y_test, clf, cvsm, acc = SCCAF_assessment(input_matrix, input_ad.obs[cluster_key], n=200) 72 | aucs = plot_roc(y_prob, y_test, clf, cvsm=cvsm, acc=acc, colors=colors) 73 | 74 | plt.savefig(out_auc+"_"+species+".png", dpi=200, bbox_inches='tight') 75 | 76 | tbl1=get_cell_type_auc(clf, y_test, y_prob) 77 | 78 | tbl1[['test_acc']] = acc 79 | tbl1[['CV_acc']] = cvsm 80 | tbl1[['type_label']] = 'original' 81 | tbl1['from_species'] = species 82 | tbl1['to_species'] = species 83 | tbl1['integration_method'] = integration_method 84 | tbl1['input_file'] = input_h5ad 85 | tbl1['key_use'] = cluster_key 86 | tbl1['adj_rand_score'] = 'NaN' 87 | tbl1['pct_cell_type_kept'] = 'NaN' 88 | 89 | tbl1.to_csv(out_acc_csv+"_"+species+".csv", index=False, header=True) 90 | 91 | if __name__ == '__main__': 92 | run_sccaf_assessment() 93 | -------------------------------------------------------------------------------- /bin/sccaf_kNN_distance.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | import pandas as pd 7 | import numpy as np 8 | import click 9 | import scanpy as sc 10 | from scipy.sparse import issparse 11 | 12 | # for reading/saving clf model 13 | 14 | from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler 15 | from sklearn.model_selection import cross_val_score, train_test_split 16 | from sklearn.linear_model import LogisticRegression, SGDClassifier 17 | from sklearn.naive_bayes import GaussianNB 18 | from sklearn.gaussian_process import GaussianProcessClassifier 19 | from sklearn.tree import DecisionTreeClassifier 20 | from sklearn.ensemble import RandomForestClassifier 21 | from sklearn.svm import SVC 22 | 23 | from sklearn.neighbors import KNeighborsClassifier 24 | 25 | np.random.seed(seed=123) 26 | 27 | @click.command() 28 | @click.argument("input_h5ad", type=click.Path(exists=True)) 29 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None) 30 | @click.option('--integration_method', type=str, default=None, help="Integration method") 31 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in input_h5ad.obs to use as the label to calculate SCCAF assessment") 32 | @click.option('--n_neighbors', type=int, default=15, help="Number of neighbours for kNN calculation and assessment") 33 | 34 | 35 | 36 | 37 | def anndata_knn_acc(input_h5ad, cluster_key, out_acc_csv, integration_method, n_neighbors=15): 38 | 39 | click.echo("Read input h5ad") 40 | input_ad = sc.read_h5ad(input_h5ad) 41 | 42 | # dictionary for method properties 43 | embedding_keys = { 44 | "harmony": "X_pca_harmony", 45 | "scanorama": "X_scanorama", 46 | "scVI": "X_scVI", 47 | "scANVI": "X_scANVI", 48 | "LIGER": "X_iNMF", 49 | "rligerUINMF": "X_inmf", 50 | "fastMNN": "X_mnn", 51 | } 52 | use_embeddings = { 53 | "harmony": True, 54 | "scanorama": True, 55 | "scVI": True, 56 | "scANVI": True, 57 | "LIGER": True, 58 | "rligerUINMF": True, 59 | "fastMNN": True, 60 | "SAMap": False, 61 | "seuratCCA": False, 62 | "seuratRPCA": False, 63 | "unintegrated": False, 64 | "per_species": False, 65 | } 66 | 67 | 68 | use_embedding = use_embeddings[integration_method] 69 | 70 | if use_embedding is True: 71 | embedding_key = embedding_keys[integration_method] 72 | 73 | ## prepare the connectivity matrix 74 | 75 | if integration_method == "SAMap": 76 | click.echo("use SAMap KNN graph") 77 | # do nothing 78 | elif use_embedding is True: 79 | click.echo(f"Calculate KNN graph from embedding {embedding_key}") 80 | num_pcs = min(input_ad.obsm[embedding_key].shape[1], 40) 81 | if num_pcs < 40: 82 | click.echo(f"using {str(num_pcs)} PCs") 83 | sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=num_pcs, use_rep=embedding_key, key_added=integration_method) 84 | 85 | # compute knn if use embedding 86 | elif integration_method == "per_species": 87 | click.echo('calculate NN graph from raw count for per species data') 88 | sc.pp.normalize_total(input_ad, target_sum=1e4) 89 | sc.pp.log1p(input_ad) 90 | sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 91 | input_ad.raw = input_ad 92 | sc.pp.scale(input_ad, max_value=10) 93 | sc.tl.pca(input_ad, svd_solver='arpack') 94 | sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca") 95 | embedding_key = "X_pca" 96 | elif integration_method == 'unintegrated': 97 | click.echo("use PCA to compute KNN graph for unintegrated data") 98 | 99 | sc.pp.normalize_total(input_ad, target_sum=1e4) 100 | sc.pp.log1p(input_ad) 101 | sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 102 | sc.pp.scale(input_ad, max_value=10) 103 | sc.tl.pca(input_ad, svd_solver="arpack") 104 | sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca") 105 | embedding_key = "X_pca" 106 | 107 | else: 108 | click.echo("use PCA to compute KNN graph for corrected counts output") 109 | sc.pp.scale(input_ad, max_value=10) 110 | sc.tl.pca(input_ad, svd_solver="arpack") 111 | sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca") 112 | embedding_key = "X_pca" 113 | 114 | # take the distance matrix as input for kNN classifier 115 | # in theory the "distance" slot is correct, while if we use "connectivities" the output is exactly the same 116 | # this is because a non-weighted majority vote is performed, essentially, the most frequent label of the neighbours 117 | # so the weight or distance does not matter numerically. I could also use weight='distance' 118 | if use_embedding and f"{integration_method}_distances" in input_ad.obsp.keys(): 119 | X_use = input_ad.obsp[f"{integration_method}_distances"] 120 | distances_use = f"{integration_method}_distances" 121 | 122 | elif not use_embedding and "distances" in input_ad.obsp.keys(): 123 | X_use = input_ad.obsp["distances"] 124 | distances_use = "distances" 125 | elif integration_method == 'SAMap': 126 | X_use = input_ad.obsp['connectivities'] 127 | distances_use = "connectivities_SAMap" 128 | # in SAMap, connectivities is just 1 if neighbor, 0 if not 129 | else: 130 | raise ValueError("Error in neighbours calculation") 131 | 132 | # minmaxscaler requires dense input 133 | 134 | ## use half data for training and half for testing 135 | 136 | click.echo(f"testing on k={n_neighbors}") 137 | 138 | y_prob, y_pred, y_test, clf, cvsm, acc = self_projection(X_use, input_ad.obs[cluster_key], n=0, fraction=0.5, classifier="KNN", K_knn=n_neighbors) 139 | 140 | pd.DataFrame(data={"test_accuracy":acc, "CV_accuracy":cvsm, "integration_method": integration_method, "input_file": input_h5ad, "connectivities": distances_use}, index=[0]).to_csv(out_acc_csv, index=False) 141 | 142 | #return X_use, y_prob, y_pred, y_test, clf, cvsm, acc 143 | 144 | 145 | def train_test_split_per_type_square(X, y, frac=0.5): 146 | """ 147 | This function is identical to train_test_split, but can split the data either based on number of cells or by fraction. 148 | 149 | Input 150 | ----- 151 | X: `numpy.array` or sparse matrix 152 | the feature matrix 153 | y: `list of string/int` 154 | the class assignments 155 | n: `int` optional (default: 100) 156 | maximum number sampled in each label 157 | fraction: `float` optional (default: 0.8) 158 | Fraction of data included in the training set. 0.5 means use half of the data for training, 159 | if half of the data is fewer than maximum number of cells (n). 160 | 161 | return 162 | ----- 163 | X_train, X_test, y_train, y_test 164 | """ 165 | df = pd.DataFrame(y) 166 | df.index = np.arange(len(y)) 167 | df.columns = ['class'] 168 | c_idx = df.groupby('class').apply(lambda x: x.sample(frac=frac)).index.get_level_values(None) 169 | d_idx = ~np.isin(np.arange(len(y)), c_idx) 170 | # the test matrix is n_test by n_indexed(trained) 171 | return X[c_idx, :][:, c_idx], X[d_idx, :][:, c_idx], y[c_idx], y[d_idx] 172 | 173 | 174 | def inverse_weight(arr): 175 | result = arr 176 | for i in range(len(arr)): 177 | for j in range(len(arr[i])): 178 | result[i][j] = 1 / arr[i][j] 179 | return result 180 | 181 | ## make distance huge when not neighbours 182 | def replace_zeros(arr): 183 | for i in range(len(arr)): 184 | for j in range(len(arr[i])): 185 | if arr[i][j] == 0: 186 | arr[i][j] = 1e5 187 | return arr 188 | 189 | 190 | def self_projection(X, 191 | cell_types, 192 | classifier="LR", 193 | K_knn=15, 194 | metric_knn='precomputed', 195 | penalty='l1', 196 | sparsity=0.5, 197 | fraction=0.5, 198 | random_state=1, 199 | solver='liblinear', 200 | n=0, 201 | cv=5, 202 | whole=False, 203 | n_jobs=None): 204 | # n = 100 should be good. 205 | """ 206 | This is the core function for running self-projection. 207 | 208 | Input 209 | ----- 210 | X: `numpy.array` or sparse matrix 211 | the expression matrix, e.g. ad.raw.X. 212 | cell_types: `list of String/int` 213 | the cell clustering assignment 214 | classifier: `String` optional (defatul: 'LR') 215 | a machine learning model in "LR" (logistic regression), "KNN" (k-nearest neighbour),\ 216 | "RF" (Random Forest), "GNB"(Gaussion Naive Bayes), "SVM" (Support Vector Machine) and "DT"(Decision Tree). 217 | K_knn: `int` optional (default: 15) 218 | the "k" in a KNN classifier if used 219 | metric_knn: `String` optional (default: 'precomputed') 220 | the distance metric for KNN classifier, default is to use a precomputed distance matrix in sklearn.neighbors.KNeighborsClassifier 221 | penalty: `String` optional (default: 'l2') 222 | the standardization mode of logistic regression. Use 'l1' or 'l2'. 223 | sparsity: `fload` optional (default: 0.5) 224 | The sparsity parameter (C in sklearn.linear_model.LogisticRegression) for the logistic regression model. 225 | fraction: `float` optional (default: 0.5) 226 | Fraction of data included in the training set. 0.5 means use half of the data for training, 227 | if half of the data is fewer than maximum number of cells (n). 228 | random_state: `int` optional (default: 1) 229 | random_state parameter for logistic regression. 230 | n: `int` optional (default: 100) 231 | Maximum number of cell included in the training set for each cluster of cells. 232 | only fraction is used to split the dataset if n is 0. 233 | cv: `int` optional (default: 5) 234 | fold for cross-validation on the training set. 235 | 0 means no cross-validation. 236 | whole: `bool` optional (default: False) 237 | if measure the performance on the whole dataset (include training and test). 238 | n_jobs: `int` optional, number of threads to use with the different classifiers (default: None - unlimited). 239 | 240 | return 241 | ----- 242 | y_prob, y_pred, y_test, clf 243 | y_prob: `matrix of float` 244 | prediction probability 245 | y_pred: `list of string/int` 246 | predicted clustering of the test set 247 | y_test: `list of string/int` 248 | real clustering of the test set 249 | clf: the classifier model. 250 | """ 251 | # split the data into training and testing 252 | 253 | if classifier == 'KNN': 254 | scaler = MaxAbsScaler() 255 | 256 | # for kNN classifier, we need to normalize the data to unify the distance measurement between different algorithms 257 | X = scaler.fit_transform(X) 258 | assert X.shape[0] == X.shape[1], "Matrix is not square, is a connectivity matrix used as input for kNN classifier?" 259 | 260 | if issparse(X): 261 | X = X.todense() 262 | click.echo("to dense after maxabs scaling") 263 | 264 | X_train, X_test, y_train, y_test = train_test_split_per_type_square(X, cell_types, frac=fraction) 265 | X_train = replace_zeros(np.array(X_train)) 266 | X_test = replace_zeros(np.array(X_test)) 267 | 268 | # fraction means test size 269 | # set the classifier 270 | 271 | if classifier == 'LR': 272 | clf = LogisticRegression(random_state=1, penalty=penalty, C=sparsity, multi_class="ovr", solver=solver) 273 | elif classifier == 'KNN': 274 | clf = clf = KNeighborsClassifier(n_neighbors=K_knn, metric=metric_knn, n_jobs=n_jobs, weights=inverse_weight) 275 | 276 | 277 | # mean cross validation score 278 | cvsm = 0 279 | if cv > 0: 280 | cvs = cross_val_score(clf, X_train, np.array(y_train), cv=cv, scoring='accuracy', n_jobs=n_jobs) 281 | cvsm = cvs.mean() 282 | print("Mean CV accuracy: %.4f" % cvsm) 283 | 284 | # accuracy on cross validation and on test set 285 | clf.fit(X_train, y_train) 286 | accuracy = clf.score(X_train, y_train) 287 | print("Accuracy on the training set: %.4f" % accuracy) 288 | accuracy_test = clf.score(X_test, y_test) 289 | print("Accuracy on the hold-out set: %.4f" % accuracy_test) 290 | 291 | # accuracy of the whole dataset 292 | if whole: 293 | accuracy = clf.score(X, cell_types) 294 | print("Accuracy on the whole set: %.4f" % accuracy) 295 | 296 | # get predicted probability on the test set 297 | y_prob = None 298 | #if not classifier in ['SH', 'PCP']: 299 | y_prob = clf.predict_proba(X_test) 300 | y_pred = clf.predict(X_test) 301 | 302 | return y_prob, y_pred, y_test, clf, cvsm, accuracy_test 303 | 304 | if __name__ == '__main__': 305 | anndata_knn_acc() 306 | -------------------------------------------------------------------------------- /bin/sccaf_projection_multiple_species.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | 7 | import click 8 | from typing import List 9 | 10 | import matplotlib.pyplot as plt 11 | from matplotlib.backends.backend_pdf import PdfPages 12 | import pandas as pd 13 | import scanpy as sc 14 | import anndata as ad 15 | from anndata import AnnData 16 | import itertools 17 | from SCCAF import * 18 | from sklearn.metrics.cluster import adjusted_rand_score 19 | from sklearn import metrics 20 | 21 | 22 | @click.command() 23 | 24 | @click.argument("input_h5ad", type=click.Path(exists=True)) 25 | @click.argument("out_projection_h5ads", type=click.Path(exists=False), default=None) 26 | @click.argument("out_figures", type=click.Path(exists=False), default=None) 27 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None) 28 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 29 | @click.option('--integration_method', type=str, default=None, help="Integration method") 30 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species to use to assess label preservation") 31 | @click.option('--projection_key', type=str, default=None, help="Projection key in species one to use as labels to transfer to species two") 32 | 33 | 34 | def run_sccaf_projection(input_h5ad, species_key, cluster_key, projection_key, integration_method, out_projection_h5ads, out_figures, out_acc_csv): 35 | # dictionary for method properties 36 | embedding_keys={"harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_inmf", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA", "scANVI": "X_scANVI"} 37 | use_embeddings={"harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": True , "seuratCCA": False, "seuratRPCA": False, "scANVI": True, "unintegrated": False} 38 | from_h5seurat={"harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "scANVI": False, "unintegrated": False} 39 | sc.set_figure_params(dpi_save=200, frameon=False, figsize=(11, 6)) 40 | input_ad = sc.read_h5ad(input_h5ad) 41 | species_all=input_ad.obs[species_key].astype("category").cat.categories.values 42 | acc_summary=pd.DataFrame() 43 | 44 | if integration_method == 'unintegrated': 45 | 46 | ## raw concatinated data needs preprocessing 47 | sc.pp.normalize_total(input_ad, target_sum=1e4) 48 | sc.pp.log1p(input_ad) 49 | sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5) 50 | input_ad.raw = input_ad 51 | sc.pp.scale(input_ad, max_value=10) 52 | sc.tl.pca(input_ad, svd_solver='arpack') 53 | sc.pp.neighbors(input_ad, n_neighbors=10, n_pcs=40) 54 | sc.tl.umap(input_ad, min_dist=0.3) 55 | 56 | # register color in .uns 57 | sc.pl.umap(input_ad, color = cluster_key) 58 | sc.pl.umap(input_ad, color = projection_key) 59 | 60 | click.echo("Start SCCAF projection workflow with input: "+input_h5ad) 61 | use_embedding=use_embeddings[integration_method] 62 | if use_embedding is True: 63 | embedding_key=embedding_keys[integration_method] 64 | 65 | 66 | # known bug - fix when convert h5Seurat to h5ad the index name error 67 | #if from_h5seurat[integration_method] is True: 68 | # input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'}) 69 | # click.echo("From h5seurat") 70 | 71 | def get_cell_type_auc(clf, y_test, y_prob): 72 | rc_aucs = [] #AUC 73 | rp_aucs = [] # AUC from recall precision 74 | fprs = [] #FPR 75 | tprs = [] #TPR 76 | prss = [] #Precision 77 | recs = [] #Recall 78 | for i, cell_type in enumerate(clf.classes_): 79 | fpr, tpr, _ = metrics.roc_curve(y_test == cell_type, y_prob[:, i]) 80 | prs, rec, _ = metrics.precision_recall_curve(y_test == cell_type, y_prob[:, i]) 81 | fprs.append(fpr) 82 | tprs.append(tpr) 83 | prss.append(prs) 84 | recs.append(rec) 85 | rc_aucs.append(metrics.auc(fpr, tpr)) 86 | rp_aucs.append(metrics.auc(rec, prs)) 87 | tbl = pd.DataFrame(data=list(zip(clf.classes_, rp_aucs, rc_aucs)), columns=['cell_type', "ROC_AUC", "PR_AUC"]) 88 | return tbl 89 | 90 | ##def unique_combinations(elements: list[str]) -> list[tuple[str, str]]: 91 | def unique_combinations(elements) -> list: 92 | ## Precondition: `elements` does not contain duplicates. 93 | ## Postcondition: Returns unique combinations of length 2 from `elements`. 94 | 95 | return list(itertools.combinations(elements, 2)) 96 | 97 | species_pairs=unique_combinations(species_all) 98 | 99 | with PdfPages(out_figures) as pdf: 100 | for species_pair in species_pairs: 101 | species_one=species_pair[0] 102 | species_two=species_pair[1] 103 | 104 | adx1 = input_ad[input_ad.obs[species_key]==species_one, :] 105 | adx2 = input_ad[input_ad.obs[species_key]==species_two, :] 106 | 107 | if use_embedding is True and not embedding_key in input_ad.obsm.keys(): 108 | raise ValueError("`adata.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key)) 109 | if use_embedding is True and embedding_key is not None: 110 | input_matrix_1 = adx1.obsm[embedding_key] 111 | input_matrix_2 = adx2.obsm[embedding_key] 112 | else: 113 | input_matrix_1 = adx1.X 114 | input_matrix_2 = adx2.X 115 | click.echo("running projection between "+species_one+" and "+species_two) 116 | 117 | colors1 = adx1.uns[cluster_key+"_colors"] 118 | 119 | y_prob1, y_pred1, y_test1, clf1, cvsm1, acc1 = SCCAF_assessment(input_matrix_1, adx1.obs[cluster_key], n=200) 120 | aucs1 = plot_roc(y_prob1, y_test1, clf1, cvsm=cvsm1, acc=acc1, colors=colors1, title=species_one+"_"+integration_method+"_original_label") 121 | pdf.attach_note("AUC of original label in "+species_one+" by "+integration_method) 122 | pdf.savefig(dpi=200, bbox_inches='tight') 123 | plt.close() 124 | 125 | tbl1=get_cell_type_auc(clf1, y_test1, y_prob1) 126 | 127 | tbl1[['test_acc']] = acc1 128 | tbl1[['CV_acc']] = cvsm1 129 | tbl1[['type_label']] = 'original' 130 | tbl1['from_species'] = species_one 131 | tbl1['to_species'] = species_one 132 | tbl1['integration_method'] = integration_method 133 | tbl1['input_file'] = input_h5ad 134 | tbl1['key_use'] = cluster_key 135 | tbl1['adj_rand_score'] = 'NaN' 136 | tbl1['pct_cell_type_kept'] = 'NaN' 137 | 138 | 139 | colors2 = adx2.uns[cluster_key+"_colors"] 140 | 141 | y_prob2, y_pred2, y_test2, clf2, cvsm2, acc2 = SCCAF_assessment(input_matrix_2, adx2.obs[cluster_key], n=200) 142 | aucs2 = plot_roc(y_prob2, y_test2, clf2, cvsm=cvsm2, acc=acc2, colors=colors2, title=species_two+"_"+integration_method+"_original_label") 143 | pdf.attach_note("AUC of original label in "+species_two+" by "+integration_method) 144 | pdf.savefig(dpi=200, bbox_inches='tight') 145 | plt.close() 146 | 147 | tbl2=get_cell_type_auc(clf2, y_test2, y_prob2) 148 | 149 | tbl2['test_acc'] = acc2 150 | tbl2[['CV_acc']] = cvsm2 151 | tbl2[['type_label']] = 'original' 152 | tbl2['from_species'] = species_two 153 | tbl2['to_species'] = species_two 154 | tbl2['integration_method'] = integration_method 155 | tbl2['input_file'] = input_h5ad 156 | tbl2['key_use'] = cluster_key 157 | tbl2['adj_rand_score'] = 'NaN' 158 | tbl2['pct_cell_type_kept'] = 'NaN' 159 | 160 | 161 | # run projection only on shared cell types 162 | adx1 = input_ad[input_ad.obs[species_key]==species_one, :] 163 | adx2 = input_ad[input_ad.obs[species_key]==species_two, :] 164 | 165 | a = adx1.obs[projection_key].cat.categories.tolist() 166 | b = adx2.obs[projection_key].cat.categories.tolist() 167 | 168 | shared_ct = list(set(a) & set(b)) 169 | ##shared_ct.remove('T_cell') 170 | adx1 = adx1[adx1.obs[projection_key].isin(shared_ct), :] 171 | adx2 = adx2[adx2.obs[projection_key].isin(shared_ct), :] 172 | 173 | color = dict(zip(input_ad.obs[projection_key].cat.categories.to_list(), input_ad.uns[projection_key+'_colors'])) 174 | 175 | color_palette = {key: value for key, value in color.items() if key in shared_ct} 176 | 177 | if use_embedding is True and not embedding_key in input_ad.obsm.keys(): 178 | raise ValueError("`adata.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key)) 179 | if use_embedding is True and embedding_key is not None: 180 | input_matrix_1 = adx1.obsm[embedding_key] 181 | input_matrix_2 = adx2.obsm[embedding_key] 182 | else: 183 | input_matrix_1 = adx1.X 184 | input_matrix_2 = adx2.X 185 | 186 | 187 | ## species one label inferred from species two 188 | y_prob2, y_pred2, y_test2, clf2, cvsm2, acc2 = SCCAF_assessment(input_matrix_2, adx2.obs[projection_key], n=200) 189 | adx1.obs['logit_inferred'] = clf2.predict(input_matrix_1) 190 | pct_1 = len(set(adx1.obs['logit_inferred'].astype("category").cat.categories.tolist()) & set(adx1.obs[projection_key].astype("category").cat.categories.tolist())) / len(adx1.obs[projection_key].astype("category").cat.categories.tolist()) 191 | 192 | umap1 = sc.pl.umap(adx1, color=[projection_key,"logit_inferred"], palette=color_palette, title = [species_one+' original label', species_one+' inferred label by '+species_two], ncols=2, frameon=False, show=False) 193 | pdf.attach_note("UMAP of species " + species_one + " original and transferred label") 194 | pdf.savefig(dpi=200, bbox_inches='tight') 195 | plt.close() 196 | 197 | ars_1 = adjusted_rand_score(adx1.obs['logit_inferred'], adx1.obs[projection_key]) 198 | 199 | # sometimes the projection doesn't work at all and there will be only one cell type left after the projection 200 | if len(set(adx1.obs['logit_inferred'].astype("category").cat.categories.tolist())) <= 2: 201 | click.echo("projection doesn't work, return NA") 202 | tbl3=pd.DataFrame(data=list(zip(['logit_not_working'], ["NaN"], ["NaN"])), columns=['cell_type', "ROC_AUC", "PR_AUC"]) 203 | 204 | tbl3['test_acc'] = 'NaN' 205 | tbl3[['CV_acc']] = 'NaN' 206 | tbl3[['type_label']] = 'logit_inferred' 207 | tbl3['from_species'] = species_two 208 | tbl3['to_species'] = species_one 209 | tbl3['integration_method'] = integration_method 210 | tbl3['input_file'] = input_h5ad 211 | tbl3['key_use'] = projection_key 212 | tbl3['adj_rand_score'] = ars_1 213 | tbl3['pct_cell_type_kept'] = pct_1 214 | 215 | else: 216 | click.echo("projection worked") 217 | colors3 = adx1.uns['logit_inferred_colors'] 218 | y_prob3, y_pred3, y_test3, clf3, cvsm3, acc3 = SCCAF_assessment(input_matrix_1, adx1.obs['logit_inferred'],n=200) 219 | aucs3 = plot_roc(y_prob3, y_test3, clf3, cvsm=cvsm3, acc=acc3, colors=colors3) 220 | pdf.attach_note("AUC of logit inferred label in "+species_one) 221 | pdf.savefig(dpi=200, bbox_inches='tight') 222 | plt.close() 223 | 224 | tbl3=get_cell_type_auc(clf3, y_test3, y_prob3) 225 | 226 | tbl3['test_acc'] = acc3 227 | tbl3[['CV_acc']] = cvsm3 228 | tbl3[['type_label']] = 'logit_inferred' 229 | tbl3['from_species'] = species_two 230 | tbl3['to_species'] = species_one 231 | tbl3['integration_method'] = integration_method 232 | tbl3['input_file'] = input_h5ad 233 | tbl3['key_use'] = projection_key 234 | tbl3['adj_rand_score'] = ars_1 235 | tbl3['pct_cell_type_kept'] = pct_1 236 | 237 | 238 | 239 | y_prob1, y_pred1, y_test1, clf1, cvsm1, acc1 = SCCAF_assessment(input_matrix_1, adx1.obs[projection_key], n=200) 240 | 241 | adx2.obs['logit_inferred'] = clf1.predict(input_matrix_2) 242 | pct_2 = len(set(adx2.obs['logit_inferred'].astype("category").cat.categories.tolist()) & set(adx2.obs[projection_key].astype("category").cat.categories.tolist())) / len(adx2.obs[projection_key].astype("category").cat.categories.tolist()) 243 | umap2 = sc.pl.umap(adx2, color=[projection_key,"logit_inferred"], palette=color_palette, title = [species_two+' original label', species_two+' inferred label by '+species_one], ncols=2, frameon=False, show=False) 244 | pdf.attach_note("UMAP of species " + species_two + " original and transferred label") 245 | pdf.savefig(dpi=200, bbox_inches='tight') 246 | plt.close() 247 | 248 | ars_2 = adjusted_rand_score(adx2.obs['logit_inferred'], adx2.obs[projection_key]) 249 | 250 | if len(set(adx2.obs['logit_inferred'].astype("category").cat.categories.tolist())) <= 2: 251 | click.echo("projection doesn't work, return NA") 252 | tbl4=pd.DataFrame(data=list(zip(['logit_not_working'], ["NaN"], ["NaN"])), columns=['cell_type', "ROC_AUC", "PR_AUC"]) 253 | 254 | tbl4['test_acc'] = 'NaN' 255 | tbl4[['CV_acc']] = 'NaN' 256 | tbl4[['type_label']] = 'logit_inferred' 257 | tbl4['from_species'] = species_one 258 | tbl4['to_species'] = species_two 259 | tbl4['integration_method'] = integration_method 260 | tbl4['input_file'] = input_h5ad 261 | tbl4['key_use'] = projection_key 262 | tbl4['adj_rand_score'] = ars_2 263 | tbl4['pct_cell_type_kept'] = pct_2 264 | 265 | else: 266 | click.echo("projection worked") 267 | colors4 = adx2.uns['logit_inferred_colors'] 268 | y_prob4, y_pred4, y_test4, clf4, cvsm4, acc4 = SCCAF_assessment(input_matrix_2, adx2.obs['logit_inferred'],n=200) 269 | aucs4 = plot_roc(y_prob4, y_test4, clf4, cvsm=cvsm4, acc=acc4, colors=colors4) 270 | pdf.attach_note("AUC of logit inferred label in " + species_two) 271 | pdf.savefig(dpi=200, bbox_inches='tight') 272 | plt.close() 273 | 274 | tbl4=get_cell_type_auc(clf4, y_test4, y_prob4) 275 | 276 | tbl4['test_acc'] = acc4 277 | tbl4[['CV_acc']] = cvsm4 278 | tbl4[['type_label']] = 'logit_inferred' 279 | tbl4['from_species'] = species_one 280 | tbl4['to_species'] = species_two 281 | tbl4['integration_method'] = integration_method 282 | tbl4['input_file'] = input_h5ad 283 | tbl4['key_use'] = projection_key 284 | tbl4['adj_rand_score'] = ars_2 285 | tbl4['pct_cell_type_kept'] = pct_2 286 | 287 | acc_summary = pd.concat([acc_summary, tbl1, tbl2, tbl3, tbl4], axis=0, ignore_index=True) 288 | adata_out = ad.concat([adx1, adx2], join='inner', merge='same', label='sccaf_projection', keys=[species_one+"_from_"+species_one+"_"+species_two+"_"+integration_method, species_two+"_from_"+species_one+"_"+species_two+"_"+integration_method]) 289 | 290 | adata_out.write(out_projection_h5ads+"_"+species_one+"_"+species_two+".h5ad", compression = 'gzip') 291 | click.echo("finish projection between "+species_one+" and "+species_two) 292 | 293 | click.echo("write acc summary for all") 294 | acc_summary.to_csv(out_acc_csv, index=False) 295 | 296 | if __name__ == '__main__': 297 | run_sccaf_projection() 298 | 299 | 300 | -------------------------------------------------------------------------------- /bin/scvi_integration.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | import click 7 | import matplotlib.pyplot as plt 8 | import scanpy as sc 9 | import scvi 10 | 11 | 12 | 13 | @click.command() 14 | @click.argument("input_h5ad", type=click.Path(exists=True)) 15 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None) 16 | @click.argument("out_umap", type=click.Path(exists=False), default=None) 17 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration") 18 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species") 19 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two") 20 | 21 | 22 | def run_scVI(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key): 23 | click.echo('Start scVI integration - use cpu mode') 24 | sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6)) 25 | adata = sc.read_h5ad(input_h5ad) 26 | adata.var_names_make_unique() 27 | sc.pp.highly_variable_genes( 28 | adata, 29 | flavor="seurat_v3", 30 | n_top_genes=2000, 31 | ##layer="counts", 32 | batch_key=batch_key, 33 | subset=True 34 | ) 35 | 36 | adata.layers["counts"] = adata.X.copy() 37 | sc.pp.normalize_total(adata, target_sum=1e4) 38 | sc.pp.log1p(adata) 39 | adata.raw = adata 40 | #sc.pp.scale(adata, max_value=10) 41 | #sc.tl.pca(adata) 42 | #sc.pp.neighbors(adata) 43 | #sc.tl.umap(adata) 44 | #adata.obsm['X_umapraw'] = adata.obsm['X_umap'] 45 | 46 | click.echo("setup scVI model") 47 | scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key=batch_key) 48 | vae = scvi.model.SCVI(adata, n_layers=2, n_latent=40, gene_likelihood="nb") 49 | vae.train() 50 | adata.obsm["X_scVI"] = vae.get_latent_representation() 51 | sc.pp.neighbors(adata, use_rep="X_scVI", key_added='scVI', n_neighbors=15, n_pcs=40) 52 | sc.tl.umap(adata, neighbors_key='scVI', min_dist=0.3) ## to match min_dist in seurat 53 | sc.pl.umap(adata, neighbors_key='scVI', color=[batch_key, species_key, cluster_key], ncols=1) 54 | plt.savefig(out_umap, dpi=300, bbox_inches='tight') 55 | adata.obsm['X_umapscVI'] = adata.obsm['X_umap'] 56 | click.echo("Save output") 57 | adata.write(out_h5ad) 58 | click.echo("Done scVI") 59 | 60 | 61 | if __name__ == '__main__': 62 | run_scVI() 63 | 64 | -------------------------------------------------------------------------------- /bin/seurat_CCA_integration.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | library(Seurat) 4 | library(optparse) 5 | 6 | 7 | option_list <- list( 8 | make_option(c("-i", "--input_rds"), 9 | type = "character", default = NULL, 10 | help = "Path to input preprocessed rds file" 11 | ), 12 | make_option(c("-o", "--out_rds"), 13 | type = "character", default = NULL, 14 | help = "Output Seurat CCA integrated rds file" 15 | ), 16 | make_option(c("-p", "--out_UMAP"), 17 | type = "character", default = NULL, 18 | help = "Output UMAP after Seurat CCA integration" 19 | ), 20 | make_option(c("-b", "--batch_key"), 21 | type = "character", default = NULL, 22 | help = "Batch key identifier to integrate" 23 | ), 24 | make_option(c("-s", "--species_key"), 25 | type = "character", default = NULL, 26 | help = "Species key identifier" 27 | ), 28 | make_option(c("-c", "--cluster_key"), 29 | type = "character", default = NULL, 30 | help = "Cluster key for UMAP plotting" 31 | ) 32 | ) 33 | 34 | # parse input 35 | opt <- parse_args(OptionParser(option_list = option_list)) 36 | 37 | input_rds <- opt$input_rds 38 | out_rds <- opt$out_rds 39 | out_UMAP <- opt$out_UMAP 40 | batch_key <- opt$batch_key 41 | species_key <- opt$species_key 42 | cluster_key <- opt$cluster_key 43 | 44 | ## create Seurat object via rds 45 | 46 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE) 47 | # input_rds <- gsub("h5ad", "rds", input_h5ad) 48 | input <- readRDS(input_rds) 49 | 50 | object.list <- SplitObject(input, split.by = batch_key) 51 | 52 | # normalize and identify variable features for each dataset independently 53 | object.list <- lapply(X = object.list, FUN = function(x) { 54 | x <- NormalizeData(x) 55 | x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000) 56 | }) 57 | 58 | features <- SelectIntegrationFeatures(object.list = object.list) 59 | 60 | input.anchors <- FindIntegrationAnchors(object.list = object.list, anchor.features = features) 61 | 62 | input.combined <- IntegrateData(anchorset = input.anchors) 63 | 64 | DefaultAssay(input.combined) <- "integrated" 65 | 66 | input.combined <- ScaleData(input.combined, verbose = FALSE) 67 | input.combined <- RunPCA(input.combined, npcs = 50, verbose = FALSE) 68 | input.combined <- RunUMAP(input.combined, reduction = "pca", dims = 1:30, n_neighbors = 15L, min_dist = 0.3) 69 | input.combined <- FindNeighbors(input.combined, reduction = "pca", dims = 1:30) 70 | input.combined <- FindClusters(input.combined, resolution = 0.4) 71 | 72 | saveRDS(input.combined, 73 | file = out_rds 74 | ) 75 | 76 | pdf(out_UMAP, height = 6, width = 10) 77 | DimPlot(input.combined, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE) 78 | DimPlot(input.combined, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE) 79 | DimPlot(input.combined, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE) 80 | 81 | dev.off() 82 | -------------------------------------------------------------------------------- /bin/seurat_RPCA_integration.R: -------------------------------------------------------------------------------- 1 | # /usr/bin/env R 2 | 3 | library(Seurat) 4 | library(optparse) 5 | 6 | 7 | option_list <- list( 8 | make_option(c("-i", "--input_rds"), 9 | type = "character", default = NULL, 10 | help = "Path to input preprocessed rds file" 11 | ), 12 | make_option(c("-o", "--out_rds"), 13 | type = "character", default = NULL, 14 | help = "Output Seurat RPCA integrated rds file" 15 | ), 16 | make_option(c("-p", "--out_UMAP"), 17 | type = "character", default = NULL, 18 | help = "Output UMAP after Seurat RPCA integration" 19 | ), 20 | make_option(c("-b", "--batch_key"), 21 | type = "character", default = NULL, 22 | help = "Batch key identifier to integrate" 23 | ), 24 | make_option(c("-s", "--species_key"), 25 | type = "character", default = NULL, 26 | help = "Species key identifier" 27 | ), 28 | make_option(c("-c", "--cluster_key"), 29 | type = "character", default = NULL, 30 | help = "Cluster key for UMAP plotting" 31 | ) 32 | ) 33 | 34 | # parse input 35 | opt <- parse_args(OptionParser(option_list = option_list)) 36 | 37 | input_rds <- opt$input_rds 38 | out_rds <- opt$out_rds 39 | out_UMAP <- opt$out_UMAP 40 | batch_key <- opt$batch_key 41 | species_key <- opt$species_key 42 | cluster_key <- opt$cluster_key 43 | 44 | input <- readRDS(input_rds) 45 | 46 | object.list <- SplitObject(input, split.by = batch_key) 47 | 48 | # normalize and identify variable features for each dataset independently 49 | object.list <- lapply(X = object.list, FUN = function(x) { 50 | x <- NormalizeData(x) 51 | x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000) 52 | }) 53 | 54 | features <- SelectIntegrationFeatures(object.list = object.list) 55 | 56 | # rpca workflow 57 | object.list <- lapply(X = object.list, FUN = function(x) { 58 | x <- ScaleData(x, features = features, verbose = FALSE) 59 | x <- RunPCA(x, features = features, verbose = FALSE) 60 | }) 61 | 62 | anchors <- FindIntegrationAnchors(object.list = object.list, reduction = "rpca", dims = 1:50) 63 | input.combined <- IntegrateData(anchors = anchors, dims = 1:50) 64 | 65 | DefaultAssay(input.combined) <- "integrated" 66 | 67 | 68 | input.combined <- ScaleData(input.combined, verbose = FALSE) 69 | input.combined <- RunPCA(input.combined, npcs = 50, verbose = FALSE) 70 | input.combined <- RunUMAP(input.combined, reduction = "pca", dims = 1:30, n_neighbors = 15L, min_dist = 0.3) 71 | input.combined <- FindNeighbors(input.combined, reduction = "pca", dims = 1:30) 72 | input.combined <- FindClusters(input.combined, resolution = 0.4) 73 | 74 | saveRDS(input.combined, 75 | file = out_rds 76 | ) 77 | 78 | pdf(out_UMAP, height = 6, width = 10) 79 | DimPlot(input.combined, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE) 80 | DimPlot(input.combined, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE) 81 | DimPlot(input.combined, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE) 82 | 83 | dev.off() 84 | -------------------------------------------------------------------------------- /bin/validate_input.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python3 2 | 3 | # © EMBL-European Bioinformatics Institute, 2023 4 | # Yuyao Song 5 | 6 | # ysong update pydantic to v2 for containerization Oct 2023 7 | 8 | import click 9 | from pydantic import StringConstraints, Field, ConfigDict, BaseModel, ValidationError 10 | import pandas as pd 11 | import numpy as np 12 | import scanpy as sc 13 | from typing import List 14 | from typing_extensions import Annotated 15 | 16 | @click.command() 17 | @click.argument("input_metadata", type=click.Path(exists=True)) 18 | @click.option('--batch_key', type=str, default=None, help="Column in adata.obs for batch label") 19 | @click.option('--species_key', type=str, default=None, help="Column in adata.obs for species label") 20 | @click.option('--cluster_key', type=str, default=None, help="Column in adata.obs for cell type label") 21 | 22 | def validate_adata_input(input_metadata, batch_key, cluster_key, species_key): 23 | 24 | ## define classes to validate 25 | meta = pd.read_csv(input_metadata, sep = '\t', header=None) 26 | 27 | class adata_obs_for_csi(BaseModel): 28 | species_key: pd.Series 29 | cluster_key: pd.Series 30 | batch_key: pd.Series 31 | model_config = ConfigDict(arbitrary_types_allowed=True) 32 | 33 | class adata_X_for_csi(BaseModel): 34 | X: np.ndarray 35 | model_config = ConfigDict(arbitrary_types_allowed=True) 36 | 37 | class adata_var_for_csi(BaseModel): 38 | mean_counts: pd.Series 39 | var_names: List[Annotated[str, Field(pattern="^ENS[A-Z]{3}[GP][0-9]{11}$|^ENSG[0-9]{11}(\.[0-9]+_[A-Z]+_[A-Z]+)?$")]] # require ensembl gene id as var_names 40 | model_config = ConfigDict(arbitrary_types_allowed=True) 41 | 42 | ## validate 43 | 44 | for i in range(0, meta.shape[0]): 45 | print('validating input for species '+ meta.iloc[i, 0]) 46 | print('input anndata path is '+ meta.iloc[i, 1]) 47 | 48 | species_now = meta.iloc[i, 0] 49 | 50 | print('read in adata') 51 | ad_now = sc.read_h5ad(meta.iloc[i, 1]) 52 | 53 | ## validate required fields in adata.obs 54 | for key in {species_key, batch_key, cluster_key}: 55 | try: 56 | ad_now.obs[key] 57 | except KeyError as ke: 58 | print(str(ke) + ' does not exist in ' + species_now + ' adata.obs') 59 | raise 60 | else: 61 | try: 62 | adata_obs_for_csi(species_key = ad_now.obs[species_key], batch_key = ad_now.obs[batch_key], cluster_key = ad_now.obs[cluster_key] ) 63 | print('obs test pass for ' + species_now) 64 | except ValidationError as ve: 65 | print(ve.json()) 66 | raise 67 | ## validate adata.X is a dense matrix in np.array and all positive values (i.e. non-scaled data) 68 | try: 69 | adata_X_for_csi(X = ad_now.X) 70 | print('count matrix test pass for ' + species_now) 71 | except ValidationError as e: 72 | print(e.json()) 73 | else: 74 | if not all(ad_now.X.flatten() >= 0): 75 | raise ValueError('values in adata.X are not all positive, please make sure raw count matrix is in adata.X') 76 | 77 | ## validate required fields in adata.var, and adata.var_names are ensembl gene ids 78 | try: 79 | ad_now.var['mean_counts'] 80 | except KeyError as ke: 81 | print(str(ke) + ' does not exist in ' + species_now + ' adata.var') 82 | raise 83 | else: 84 | try: 85 | adata_var_for_csi(mean_counts = ad_now.var['mean_counts'], var_names = ad_now.var_names.values.tolist()) 86 | print('var test pass for ' + species_now) 87 | except ValidationError as e: 88 | print(e.json()) 89 | raise 90 | 91 | print('input validation complete') 92 | 93 | if __name__ == '__main__': 94 | validate_adata_input() 95 | -------------------------------------------------------------------------------- /concat_by_homology_multiple_species.nf: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env nextflow 2 | nextflow.enable.dsl=2 3 | 4 | // © EMBL-European Bioinformatics Institute, 2023 5 | // Updated upon containerization of BENGAL envs 6 | // Yuyao Song 7 | // Oct 2023 8 | 9 | // data requirements: batch_key, species_key, sample_key, mapped in config file, present in adata.obs 10 | // data requirements: mean_count in adata.var from scanpy QC 11 | // raw h5ad file naming: 12 | // adata.var_names are ensembl gene ids 13 | 14 | log.info """ 15 | =========================================================== 16 | 17 | 18 | Cross-species integration and assessment - nextflow pipeline 19 | Use singularity containers for cluster execution 20 | - check inout format 21 | - concatenate input anndata 22 | Author: ysong@ebi.ac.uk 23 | Initial date: Mar 2022 24 | Latest date: Oct 2023 25 | 26 | =========================================================== 27 | 28 | """ 29 | .stripIndent() 30 | 31 | 32 | 33 | process validate_adata_input { 34 | 35 | // labels for cluster options and containers 36 | // no need to change here, adjust as per dataset in config file 37 | 38 | label 'validate' 39 | label 'regular_resource' 40 | 41 | input: 42 | tuple val(basename), path(metadata) 43 | 44 | script: 45 | """ 46 | python ${projectDir}/bin/validate_input.py ${metadata} --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} 47 | 48 | """ 49 | 50 | } 51 | 52 | 53 | process concat_by_homology { 54 | 55 | label 'concat' 56 | label 'regular_resource' 57 | 58 | publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy' 59 | 60 | input: 61 | 62 | tuple val(basename), path(metadata) 63 | 64 | output: 65 | 66 | path "*.h5ad", emit: concat_h5ad 67 | path "${basename}_homology_tbl.csv" 68 | 69 | 70 | script: 71 | """ 72 | Rscript ${projectDir}/bin/concat_by_homology_multiple_species_by_gene_id.R --metadata ${metadata} \ 73 | --one2one_h5ad ${basename}_one2one_only.h5ad --many_higher_expr_h5ad ${basename}_many_higher_expr.h5ad \ 74 | --many_higher_homology_conf_h5ad ${basename}_many_higher_homology_conf.h5ad --homology_tbl ${basename}_homology_tbl.csv 75 | """ 76 | } 77 | 78 | process concat_by_homology_rliger_uinmf { 79 | 80 | label 'concat' 81 | label 'regular_resource' 82 | 83 | publishDir "${params.results}/results/rligerUINMF/h5ad_homology_concat", mode: 'copy' 84 | 85 | input: 86 | 87 | tuple val(basename), path(metadata) 88 | 89 | output: 90 | path "homology_tbl.csv" 91 | path "${basename}_liger.tsv", emit: concat_h5ad_rliger_paths 92 | path "*.h5ad", emit: concat_h5ad_rliger 93 | 94 | script: 95 | """ 96 | mkdir -p ${params.results}/results/rligerUINMF/h5ad_homology_concat && \ 97 | Rscript ${projectDir}/bin/concat_by_homology_rligerUINMF_multiple_species.R --metadata ${metadata} \ 98 | --out_dir . --homology_tbl homology_tbl.csv \ 99 | --metadata_output ${basename}_liger.tsv 100 | 101 | """ 102 | } 103 | 104 | process convert_format_h5ad { 105 | 106 | label 'convert' 107 | label 'regular_resource' 108 | 109 | publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy' 110 | 111 | input: 112 | tuple val(basename), path(h5ad_concat) 113 | 114 | output: 115 | path "*.rds" 116 | 117 | script: 118 | """ 119 | Rscript ${projectDir}/bin/convert_format.R \ 120 | -i ${h5ad_concat} -o ${basename}.rds -t anndata_to_seurat \ 121 | --conda_path ${params.sceasy_conda} 122 | 123 | """ 124 | 125 | 126 | } 127 | 128 | process convert_format_rliger_uinmf { 129 | 130 | label 'convert' 131 | label 'regular_resource' 132 | 133 | publishDir "${params.results}/results/rligerUINMF/h5ad_homology_concat", mode: 'copy' 134 | 135 | input: 136 | tuple val(basename), path(h5ad_species) 137 | 138 | output: 139 | path "*.rds" 140 | 141 | script: 142 | """ 143 | Rscript ${projectDir}/bin/convert_format.R \ 144 | -i ${h5ad_species} -o ${basename}.rds -t anndata_to_seurat \ 145 | --conda_path ${params.sceasy_conda} 146 | 147 | """ 148 | 149 | 150 | } 151 | 152 | 153 | workflow { 154 | 155 | metadata_ch = Channel.fromPath(params.input_metadata) 156 | .map { file -> tuple(file.baseName, file) } 157 | //metadata_ch.view() 158 | validate_adata_input(metadata_ch) 159 | concat_by_homology(metadata_ch) 160 | concat_by_homology_rliger_uinmf(metadata_ch) 161 | 162 | concat_h5ad_ch = concat_by_homology.out.concat_h5ad.flatten().filter( ~/.*h5ad$/ ).map { file -> tuple(file.baseName, file) } 163 | convert_format_h5ad(concat_h5ad_ch) 164 | concat_rliger_ch = Channel.fromPath("${params.results}/results/rligerUINMF/h5ad_homology_concat/*.h5ad").map { file -> tuple(file.baseName, file) }.view() 165 | convert_format_rliger_uinmf(concat_rliger_ch) 166 | 167 | } 168 | -------------------------------------------------------------------------------- /config/example.config: -------------------------------------------------------------------------------- 1 | // © EMBL-European Bioinformatics Institute, 2023 2 | // Update example nextflow config file for the BENGAL pipeline 3 | // Yuyao Song 4 | // Oct 2023 5 | 6 | // sections marked with CHANGE_PER_RUN should be adjusted per execution 7 | // sections marked with CHANGE_PER_SETUP should be adjusted for each new setup on the cluster, etc. 8 | 9 | // CHANGE_PER_RUN: directories for project root, work and results 10 | 11 | projectDir='/some/path/NEXTFLOW/BENGAL' // containing this repo pulled 12 | workDir='/some/path/work' 13 | params.results='/some/path/results' 14 | 15 | // CHANGE_PER_RUN: params specific to dataset, please change according to your task 16 | // input variables 17 | 18 | // if each data is batchy, set batch_key to actual batch 19 | // if each data does not appear bachy and cell type between batches are balanced, set batch_key to species_key 20 | 21 | // which column stores batch to integrate 22 | params.batch_key='species' 23 | 24 | // which column stores species names 25 | params.species_key='species' 26 | 27 | // which column specifies cell types matched between species 28 | params.cluster_key='cell_ontology_mapped' 29 | 30 | // which column stores cell types to perform SCCAF assessment, usually same with cluster_key 31 | params.cluster_key_sccaf='cell_ontology_mapped' 32 | 33 | // which column stores cell types to perform SCCAF annotation transfer, usually same with cluster_key 34 | params.projection_key_sccaf='cell_ontology_mapped' 35 | 36 | // input metadata file that maps species names to the respective raw count .h5ad files 37 | params.input_metadata='/some/path/data/heart_hs_mf_metadata_nf.tsv' 38 | 39 | // set task name to metadata file basename 40 | params.task_name='heart_hs_mf_metadata_nf' 41 | 42 | // CHANGE_PER_SETUP: the sceasy, scvi and scib conda env that is prepared 43 | params.sceasy_conda='/some/conda/path/anaconda3/envs/sceasy' 44 | params.scvi_conda='/some/conda/path/anaconda3/envs/scvi-tools' 45 | params.scib_conda='/some/conda/path/anaconda3/envs/scib' 46 | 47 | 48 | // Only if running trajectory conservation score 49 | //params.root_cell = 'Blastula' 50 | 51 | // CHANGE_PER_SETUP: cluster execution 52 | 53 | process.executor = 'lsf' 54 | executor { 55 | name = 'lsf' 56 | queueSize = 2000 57 | } 58 | 59 | // enable use of conda envs 60 | conda.enabled = true 61 | 62 | // CHANGE_PER_SETUP: singularity container paths and cluster resource options 63 | 64 | singularity { 65 | enabled = true 66 | } 67 | 68 | process { 69 | withLabel: validate { 70 | container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_py.sif" 71 | } 72 | withLabel: concat { 73 | container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_concat.sif" 74 | } 75 | withLabel: convert { 76 | conda = "${params.sceasy_conda}" 77 | } 78 | withLabel: scanpy_based { 79 | container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_py.sif" 80 | } 81 | withLabel: 'seurat_based|R_based' { 82 | container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_seurat.sif" 83 | } 84 | withLabel: scvi_based { 85 | conda = "${params.scvi_conda}" 86 | } 87 | withLabel: scIB_based { 88 | conda = "${params.scib_conda}" 89 | } 90 | withLabel: sccaf { 91 | container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_sccaf.sif" 92 | } 93 | withLabel: regular_resource { 94 | cpus = 4 95 | queue = 'research' 96 | memory = '50GB' 97 | cache = 'lenient' 98 | } 99 | withLabel: regular_intg_resource { 100 | cpus = 12 101 | queue = 'research' 102 | memory = '200GB' 103 | cache = 'lenient' 104 | } 105 | withLabel: bigmem_intg_resource { 106 | cpus = 12 107 | queue = 'bigmem' 108 | memory = '400GB' 109 | cache = 'lenient' 110 | } 111 | withLabel: GPU_intg_resource { 112 | queue = 'gpu' 113 | clusterOptions = ' -gpu "num=2:j_exclusive=no" -P gpu -n 4 ' 114 | memory = '50GB' 115 | cache = 'lenient' 116 | } 117 | 118 | } 119 | 120 | 121 | // params for pipeline, no need to change from here onwards 122 | 123 | params.liger_metadata="${params.results}/results/rligerUINMF/h5ad_homology_concat/${params.task_name}_liger.tsv" 124 | params.homology_concat_h5ad="${params.results}/results/h5ad_homology_concat/*.h5ad" 125 | params.homology_concat_rds="${params.results}/results/h5ad_homology_concat/*.rds" 126 | params.integrated_h5ad="${params.results}/results/*/cross_species/integrated_h5ad/*.h5ad" 127 | params.integrated_rds="${params.results}/results/*/cross_species/integrated_h5ad/*.rds" 128 | 129 | // nextflow trace settings, get run stats by adding -with-trace upon execution 130 | 131 | trace { 132 | enabled = true 133 | file = "${params.results}/nf_trace/${params.task_name}_trace.txt" 134 | fields = 'task_id,hash,native_id,name,status,exit,submit,duration,realtime,%cpu,peak_rss,peak_vmem,rchar,wchar' 135 | } 136 | 137 | -------------------------------------------------------------------------------- /containers/README.md: -------------------------------------------------------------------------------- 1 | ## Containers for BENGAL nextflow pipeline 2 | 3 | Yuyao Song 4 | 5 | We provide a docker container for running SCCAF assessment. 6 | 7 | For cluster execuition, we convert the docker container into a singularity container. 8 | 9 | Pull SCCAF containers as follows: 10 | 11 | `singularity pull sccaf.sif docker://yysong123/intgpy:sccaf` 12 | 13 | This container is built for linux/amd64 machines. 14 | -------------------------------------------------------------------------------- /cross_species_assessment_multiple_species_individual.nf: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env nextflow 2 | 3 | 4 | nextflow.enable.dsl=2 5 | 6 | 7 | // data requirements: batch_key, species_key, sample_key in adata.obs 8 | // data requirements: mean_count in adata.var from scanpy QC 9 | // name the raw h5ad file using the same string as in homology table 10 | // adata.var have column 'gene_name' as gene_name, be unique! 11 | // adata.obs and adata.var cannot contain NA and the column names cannot contain "-" 12 | // adata.var.cluster_key factor is ordered to be the same across two datasets with the same cell type // todo 13 | 14 | log.info """ 15 | =========================================================== 16 | 17 | Cross-species integration and assessment - nextflow pipeline 18 | - Integrate scRNA-seq data from multiple species 19 | - SCCAF projection on integrated data to assess integration quality - this workflow 20 | - harmony, scanorama, scVI - python based 21 | - Seurat CCA, Seurat RPCA, fastMNN, LIGER, LIGER-UINMF - r based 22 | - SAMap 23 | Author: ysong@ebi.ac.uk 24 | Mar 2022 25 | 26 | =========================================================== 27 | 28 | """ 29 | .stripIndent() 30 | 31 | 32 | metadata_ch = channel.fromPath(params.input_metadata) 33 | 34 | all_integrated_h5ad_mapped_ch = channel 35 | .fromPath(params.integrated_h5ad) 36 | .map { file -> tuple(file.baseName.split("_")[-2], file.baseName, file) } 37 | 38 | all_orig_h5ad_mapped_ch = channel 39 | .fromPath(params.integrated_h5ad) 40 | .map { file -> (params.homology_concat_h5ad - "*.h5ad" + file.baseName - file.baseName.split("_")[-2] - '__integrated' + ".h5ad") } 41 | .unique() 42 | 43 | all_orig_h5ad_with_base_mapped_ch = channel 44 | .fromPath(params.homology_concat_h5ad) 45 | .map { file -> tuple(params.task_name, file) } 46 | .unique() 47 | 48 | 49 | all_integrated_and_orig_h5ad_mapped_ch = channel 50 | .fromPath(params.integrated_h5ad) 51 | .map { file -> tuple(file.baseName.split("_")[-2], file.baseName, (params.homology_concat_h5ad - "*.h5ad" + file.baseName - file.baseName.split("_")[-2] - '__integrated' + ".h5ad"), file) } 52 | 53 | 54 | all_integrated_rds_mapped_ch = channel 55 | .fromPath(params.integrated_rds) 56 | .map { file -> tuple(file.baseName, file, file.getParent()) } 57 | // method name, file basename, file 58 | 59 | process copy_for_rliger { 60 | 61 | label 'regular_resource' 62 | 63 | publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy' 64 | 65 | input: 66 | tuple val(basename), path(unintegrated_h5ad) 67 | 68 | output: 69 | val true, emit: signal 70 | 71 | shell: 72 | ''' 73 | if ! [ -n "\$(find !{params.results}/results/h5ad_homology_concat -type f -regex '.*liger.*')" ] 74 | then 75 | rliger_file=$(echo !{unintegrated_h5ad} | sed "s/!{basename}/!{basename}_liger/g") && echo ${rliger_file} && cp !{unintegrated_h5ad} ${rliger_file} 76 | fi 77 | ''' 78 | 79 | } 80 | 81 | process convert_format_rds { 82 | 83 | label 'convert' 84 | label 'regular_resource' 85 | 86 | //publishDir "${out_dir}/", mode: 'copy' 87 | 88 | input: 89 | tuple val(basename), path(integrated_rds), path(out_dir) 90 | 91 | //output: 92 | //path "*.h5ad" 93 | //directly write to results out_dir, not elegant, but works 94 | 95 | script: 96 | 97 | """ 98 | Rscript ${projectDir}/bin/convert_format.R \ 99 | -i ${integrated_rds} -o ${out_dir}/${basename}.h5ad -t seurat_to_anndata \ 100 | --conda_path ${params.sceasy_conda} 101 | 102 | """ 103 | 104 | 105 | } 106 | 107 | 108 | 109 | process sccaf_assessment { 110 | 111 | label 'regular_resource' 112 | label 'sccaf_based' 113 | 114 | publishDir "${params.results}/results/per_species", mode: 'copy' 115 | 116 | input: 117 | path(metadata) 118 | 119 | output: 120 | path '*' 121 | 122 | script: 123 | """ 124 | python ${projectDir}/bin/sccaf_assessment_metadata.py ${metadata} ${metadata}_SCCAF_AUC ${metadata}_SCCAF_accuracy_summary \ 125 | --use_embedding True --embedding_key X_pca \ 126 | --integration_method unintegrated \ 127 | --cluster_key ${params.cluster_key} 128 | """ 129 | 130 | } 131 | 132 | 133 | process sccaf_projection { 134 | 135 | label 'regular_resource' 136 | label 'sccaf_based' 137 | 138 | 139 | publishDir "${params.results}/results/${method}/cross_species/SCCAF_projection/", mode: 'copy' 140 | 141 | 142 | input: 143 | tuple val(method), val(basename), path(cross_species_integrated_h5ad) 144 | 145 | output: 146 | path '*' 147 | 148 | script: 149 | """ 150 | python ${projectDir}/bin/sccaf_projection_multiple_species.py \ 151 | --species_key ${params.species_key} --cluster_key ${params.cluster_key_sccaf} --projection_key ${params.projection_key_sccaf} \ 152 | --integration_method ${method} ${cross_species_integrated_h5ad} \ 153 | ${basename}_SCCAF_projection_result.h5ad \ 154 | ${basename}_SCCAF_projection_figures.pdf \ 155 | ${basename}_SCCAF_accuracy_summary.csv 156 | """ 157 | } 158 | 159 | 160 | 161 | process batch_metrics{ 162 | 163 | label 'regular_intg_resource' 164 | label 'scIB_based' 165 | 166 | publishDir "${params.results}/batch_metrics/cross_species", mode: 'copy' 167 | 168 | input: 169 | 170 | tuple val(ready), val(method), val(basename), path(unintegrated_h5ad), path(cross_species_integrated_h5ad) 171 | 172 | output: 173 | path "*" 174 | 175 | script: 176 | """ 177 | python ${projectDir}/bin/scIB_metrics_individual.py \ 178 | ${cross_species_integrated_h5ad} \ 179 | ${unintegrated_h5ad} \ 180 | ${basename}_scIB_metrics.csv ${basename}_cell_type_basw.csv \ 181 | ${basename}_orig_scIB_metrics.csv ${basename}_orig_cell_type_basw.csv \ 182 | ${basename}_scIB.h5ad \ 183 | --integration_method ${method} --batch_key ${params.batch_key} \ 184 | --species_key ${params.species_key} --cluster_key ${params.cluster_key} \ 185 | --num_cores 4 --conda_path ${params.scib_conda} 186 | 187 | """ 188 | 189 | 190 | 191 | } 192 | 193 | 194 | 195 | process trajectory_metrics{ 196 | 197 | label 'regular_resource' 198 | label 'scIB_based' 199 | 200 | publishDir "${params.results}/batch_metrics/cross_species", mode: 'copy' 201 | 202 | input: 203 | tuple val(method), val(basename), path(unintegrated_h5ad), path(cross_species_integrated_h5ad) 204 | 205 | output: 206 | path "*" 207 | 208 | script: 209 | """ 210 | python ${projectDir}/bin/scIB_trajectory.py \ 211 | ${cross_species_integrated_h5ad} ${unintegrated_h5ad} \ 212 | ${basename}_trajectory_metrics.csv \ 213 | ${basename}_trajectory_scIB.h5ad \ 214 | --integration_method ${method} --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} --root_cell ${params.root_cell} 215 | 216 | """ 217 | 218 | } 219 | 220 | 221 | workflow { 222 | 223 | //all_integrated_rds_mapped_ch.view() 224 | 225 | convert_format_rds(all_integrated_rds_mapped_ch) 226 | 227 | copy_for_rliger(all_orig_h5ad_with_base_mapped_ch) 228 | sccaf_assessment(metadata_ch) 229 | sccaf_projection(all_integrated_h5ad_mapped_ch) 230 | ch_test = copy_for_rliger.out.signal.combine(all_integrated_and_orig_h5ad_mapped_ch).unique() 231 | batch_metrics(ch_test) 232 | //trajectory_metrics(iall_integrated_and_orig_h5ad_mapped_ch) 233 | } 234 | -------------------------------------------------------------------------------- /cross_species_integration_multiple_species.nf: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env nextflow 2 | nextflow.enable.dsl=2 3 | 4 | // © EMBL-European Bioinformatics Institute, 2023 5 | 6 | // Updated upon containerization of BENGAL envs 7 | // Yuyao Song 8 | // Oct 2023 9 | 10 | 11 | log.info """ 12 | =========================================================== 13 | 14 | Cross-species integration and assessment - nextflow pipeline 15 | - Integrate scRNA-seq data from multiple species - this workflow 16 | - SCCAF projection on integrated data to assess integration quality 17 | - harmony, scanorama, scVI - python based 18 | - Seurat CCA, Seurat RPCA, fastMNN, LIGER, LIGER-UINMF - r based 19 | Author: ysong@ebi.ac.uk 20 | Mar 2022 21 | 22 | =========================================================== 23 | 24 | """ 25 | .stripIndent() 26 | 27 | 28 | process harmony_integration { 29 | 30 | label 'scanpy_based' 31 | label 'regular_intg_resource' 32 | 33 | publishDir "${params.results}/results/harmony/cross_species/integrated_h5ad", mode: 'copy' 34 | 35 | input: 36 | tuple val(baseName), path(file) 37 | 38 | output: 39 | path "${baseName}_harmony_integrated.h5ad", emit: harmony_cross_species_h5ad_ch 40 | path "${baseName}_harmony_integrated_UMAP.png", emit: harmony_cross_species_umap_ch 41 | 42 | script: 43 | """ 44 | python ${projectDir}/bin/harmony_integration.py \ 45 | --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \ 46 | ${file} ${baseName}_harmony_integrated.h5ad ${baseName}_harmony_integrated_UMAP.png 47 | """ 48 | } 49 | 50 | 51 | process scanorama_integration { 52 | 53 | label 'scanpy_based' 54 | label 'regular_intg_resource' 55 | 56 | publishDir "${params.results}/results/scanorama/cross_species/integrated_h5ad", mode: 'copy' 57 | 58 | input: 59 | tuple val(baseName), path(file) 60 | 61 | output: 62 | path "${baseName}_scanorama_integrated.h5ad", emit: scanorama_cross_species_h5ad_ch 63 | path "${baseName}_scanorama_integrated_UMAP.png", emit: scanorama_cross_species_umap_ch 64 | 65 | script: 66 | """ 67 | python ${projectDir}/bin/scanorama_integration.py \ 68 | --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \ 69 | ${file} ${baseName}_scanorama_integrated.h5ad ${baseName}_scanorama_integrated_UMAP.png 70 | """ 71 | } 72 | 73 | 74 | // scVI uses GPU 75 | process scvi_integration { 76 | 77 | label 'scvi_based' 78 | label 'GPU_intg_resource' 79 | 80 | publishDir "${params.results}/results/scVI/cross_species/integrated_h5ad", mode: 'copy' 81 | 82 | input: 83 | tuple val(baseName), path(file) 84 | 85 | output: 86 | path "${baseName}_scVI_integrated.h5ad", emit: scvi_cross_species_h5ad_ch 87 | path "${baseName}_scVI_integrated_UMAP.png", emit: scvi_cross_species_umap_ch 88 | 89 | script: 90 | """ 91 | python ${projectDir}/bin/scvi_integration.py \ 92 | --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \ 93 | ${file} ${baseName}_scVI_integrated.h5ad ${baseName}_scVI_integrated_UMAP.png 94 | """ 95 | } 96 | // scANVI uses GPU 97 | process scanvi_integration { 98 | 99 | 100 | label 'scvi_based' 101 | label 'GPU_intg_resource' 102 | 103 | publishDir "${params.results}/results/scANVI/cross_species/integrated_h5ad", mode: 'copy' 104 | 105 | input: 106 | tuple val(baseName), path(file) 107 | 108 | output: 109 | path "${baseName}_scANVI_integrated.h5ad", emit: scanvi_cross_species_h5ad_ch 110 | path "${baseName}_scANVI_integrated_UMAP.png", emit: scanvi_cross_species_umap_ch 111 | 112 | script: 113 | """ 114 | python ${projectDir}/bin/scANVI_integration.py \ 115 | --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \ 116 | ${file} ${baseName}_scANVI_integrated.h5ad ${baseName}_scANVI_integrated_UMAP.png ${baseName}_scANVI_integrated_mde.png 117 | """ 118 | } 119 | 120 | 121 | 122 | process seurat_CCA_integration{ 123 | 124 | 125 | label 'seurat_based' 126 | label 'bigmem_intg_resource' 127 | 128 | publishDir "${params.results}/results/seuratCCA/cross_species/integrated_h5ad/", mode: 'copy' 129 | 130 | input: 131 | tuple val(baseName), path(file) 132 | 133 | output: 134 | path "${baseName}_seuratCCA_integrated.rds", emit: seuratCCA_cross_species_rds_ch 135 | path "${baseName}_seuratCCA_integrated_UMAP.pdf" , emit: seuratCCA_cross_species_umap_ch 136 | 137 | script: 138 | """ 139 | Rscript ${projectDir}/bin/seurat_CCA_integration.R -i ${file} -o ${baseName}_seuratCCA_integrated.rds -p ${baseName}_seuratCCA_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key} 140 | """ 141 | } 142 | 143 | 144 | process seurat_RPCA_integration{ 145 | 146 | label 'seurat_based' 147 | label 'bigmem_intg_resource' 148 | 149 | publishDir "${params.results}/results/seuratRPCA/cross_species/integrated_h5ad", mode: 'copy' 150 | 151 | input: 152 | tuple val(baseName), path(file) 153 | 154 | output: 155 | path "${baseName}_seuratRPCA_integrated.rds" , emit: seuratRPCA_cross_species_rds_ch 156 | path "${baseName}_seuratRPCA_integrated_UMAP.pdf" , emit: seuratRPCA_cross_species_umap_ch 157 | 158 | script: 159 | """ 160 | Rscript ${projectDir}/bin/seurat_RPCA_integration.R -i ${file} -o ${baseName}_seuratRPCA_integrated.rds -p ${baseName}_seuratRPCA_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key} 161 | """ 162 | } 163 | 164 | process fastMNN_integration{ 165 | 166 | label 'R_based' 167 | label 'regular_intg_resource' 168 | 169 | publishDir "${params.results}/results/fastMNN/cross_species/integrated_h5ad", mode: 'copy' 170 | 171 | input: 172 | tuple val(baseName), path(file) 173 | 174 | output: 175 | path "${baseName}_fastMNN_integrated.rds" , emit: fastMNN_cross_species_rds_ch 176 | path "${baseName}_fastMNN_integrated_UMAP.pdf" , emit: fastMNN_cross_species_UMAP_ch 177 | 178 | script: 179 | """ 180 | Rscript ${projectDir}/bin/fastMNN_integration.R -i ${file} -o ${baseName}_fastMNN_integrated.rds -p ${baseName}_fastMNN_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key} 181 | """ 182 | } 183 | 184 | 185 | process liger_integration{ 186 | 187 | label 'R_based' 188 | label 'regular_intg_resource' 189 | 190 | publishDir "${params.results}/results/LIGER/cross_species/integrated_h5ad", mode: 'copy' 191 | 192 | input: 193 | tuple val(baseName), path(file) 194 | 195 | output: 196 | path "${baseName}_LIGER_integrated.rds", emit: liger_cross_species_rds_ch 197 | path "${baseName}_LIGER_integrated_UMAP.pdf", emit: liger_cross_species_UMAP_ch 198 | 199 | script: 200 | """ 201 | Rscript ${projectDir}/bin/LIGER_integration.R -i ${file} -o ${baseName}_LIGER_integrated.rds -p ${baseName}_LIGER_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key} 202 | """ 203 | } 204 | 205 | 206 | process rligerUINMF_integration{ 207 | 208 | label 'R_based' 209 | label 'regular_intg_resource' 210 | 211 | publishDir "${params.results}/results/rligerUINMF/cross_species/integrated_h5ad", mode: 'copy' 212 | 213 | input: 214 | tuple val(baseName), path(metadata) 215 | 216 | output: 217 | path "*_rligerUINMF_integrated.rds" 218 | path "*_rligerUINMF_integrated_UMAP.pdf" 219 | 220 | script: 221 | """ 222 | mkdir -p ${params.results}/results/rligerUINMF/cross_species/integrated_h5ad && \ 223 | Rscript ${projectDir}/bin/rliger_integration_UINMF_multiple_species.R \ 224 | --basename ${baseName} \ 225 | --metadata ${params.liger_metadata} \ 226 | --out_dir . \ 227 | --cluster_key ${params.cluster_key} 228 | """ 229 | } 230 | 231 | // nextflow will search for the output file in the cwd 232 | // so set the out_dir in the script to cwd, then nextflow will copy the results to the publishDir anyway 233 | 234 | workflow { 235 | 236 | all_homology_h5ad_mapped_ch = Channel.fromPath(params.homology_concat_h5ad) 237 | .map { file -> tuple(file.baseName, file) } 238 | 239 | all_homology_rds_mapped_ch = Channel.fromPath(params.homology_concat_rds) 240 | .map { file -> tuple(file.baseName, file) } 241 | 242 | concatenated_h5ad = all_homology_h5ad_mapped_ch 243 | concatenated_h5ad.view() 244 | 245 | harmony_integration(concatenated_h5ad) 246 | scanorama_integration(concatenated_h5ad) 247 | scvi_integration(concatenated_h5ad) 248 | scanvi_integration(concatenated_h5ad) 249 | 250 | concatenated_rds = all_homology_rds_mapped_ch 251 | seurat_CCA_integration(concatenated_rds) 252 | seurat_RPCA_integration(concatenated_rds) 253 | fastMNN_integration(concatenated_rds) 254 | liger_integration(concatenated_rds) 255 | 256 | liger_metadata = Channel.fromPath(params.liger_metadata) 257 | .map { file -> tuple(file.baseName, file) } 258 | liger_metadata.view() 259 | rligerUINMF_integration(liger_metadata) 260 | 261 | } 262 | -------------------------------------------------------------------------------- /dockerfiles/ALCS.Dockerfile: -------------------------------------------------------------------------------- 1 | # syntax=docker/dockerfile:1 2 | 3 | 4 | FROM python:3.7.12-slim-buster 5 | 6 | 7 | RUN apt-get update && apt-get -y install gcc && apt-get -y install g++ \ 8 | && apt-get -y install wget && apt-get -y install autoconf && apt-get -y install automake && apt-get -y install libxml2-dev && \ 9 | wget https://github.com/Kitware/CMake/releases/download/v3.22.3/cmake-3.22.3-linux-x86_64.sh && \ 10 | cp cmake-3.22.3-linux-x86_64.sh /opt/ && chmod +x /opt/cmake-3.22.3-linux-x86_64.sh && \ 11 | cd /opt/ &&yes y | bash /opt/cmake-3.22.3-linux-x86_64.sh && ln -s /opt/cmake-3.22.3-linux-x86_64/bin/* /usr/local/bin && \ 12 | apt-get install -y git 13 | 14 | RUN pip3 install click matplotlib pandas anndata sklearn 15 | 16 | RUN git clone https://github.com/YY-SONG0718/sccaf && cd sccaf && pip3 install . 17 | 18 | RUN pip3 install scanpy==1.9.1 19 | 20 | ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 21 | -------------------------------------------------------------------------------- /dockerfiles/scanpy.Dockerfile: -------------------------------------------------------------------------------- 1 | FROM --platform=linux/amd64 gcfntnu/scanpy:1.9.2 2 | 3 | MAINTAINER Yuyao Song 4 | 5 | CMD ["echo", "Container for harmony and scanorama integration in BENGAL"] 6 | 7 | RUN pip3 install harmonypy scanorama click 8 | 9 | RUN pip3 install pydantic 10 | 11 | ENTRYPOINT ["python"] 12 | 13 | -------------------------------------------------------------------------------- /dockerfiles/seurat.Dockerfile: -------------------------------------------------------------------------------- 1 | # Dockerfile for Seurat 4.3.0 2 | FROM rocker/r-ver:4.2.0 3 | 4 | # Set global R options 5 | RUN echo "options(repos = 'https://cloud.r-project.org')" > $(R --no-echo --no-save -e "cat(Sys.getenv('R_HOME'))")/etc/Rprofile.site 6 | ENV RETICULATE_MINICONDA_ENABLED=FALSE 7 | 8 | # Install Seurat's system dependencies 9 | RUN apt-get update 10 | RUN apt-get install -y \ 11 | libhdf5-dev \ 12 | libcurl4-openssl-dev \ 13 | libssl-dev \ 14 | libpng-dev \ 15 | libboost-all-dev \ 16 | libxml2-dev \ 17 | openjdk-8-jdk \ 18 | python3-dev \ 19 | python3-pip \ 20 | wget \ 21 | git \ 22 | libfftw3-dev \ 23 | libgsl-dev \ 24 | pkg-config 25 | 26 | RUN apt-get install -y llvm-10 27 | 28 | # Install system library for rgeos 29 | RUN apt-get install -y libgeos-dev 30 | 31 | # Install UMAP 32 | RUN LLVM_CONFIG=/usr/lib/llvm-10/bin/llvm-config pip3 install llvmlite 33 | RUN pip3 install numpy 34 | RUN pip3 install umap-learn 35 | 36 | # Install FIt-SNE 37 | RUN git clone --branch v1.2.1 https://github.com/KlugerLab/FIt-SNE.git 38 | RUN g++ -std=c++11 -O3 FIt-SNE/src/sptree.cpp FIt-SNE/src/tsne.cpp FIt-SNE/src/nbodyfft.cpp -o bin/fast_tsne -pthread -lfftw3 -lm 39 | 40 | # Install bioconductor dependencies & suggests 41 | RUN R --no-echo --no-restore --no-save -e "install.packages('BiocManager')" 42 | RUN R --no-echo --no-restore --no-save -e "BiocManager::install(c('multtest', 'S4Vectors', 'SummarizedExperiment', 'SingleCellExperiment', 'MAST', 'DESeq2', 'BiocGenerics', 'GenomicRanges', 'IRanges', 'rtracklayer', 'monocle', 'Biobase', 'limma', 'glmGamPoi'))" 43 | 44 | # Install CRAN suggests 45 | RUN R --no-echo --no-restore --no-save -e "install.packages(c('VGAM', 'R.utils', 'metap', 'Rfast2', 'ape', 'enrichR', 'mixtools'))" 46 | 47 | # Install spatstat 48 | RUN R --no-echo --no-restore --no-save -e "install.packages(c('spatstat.explore', 'spatstat.geom'))" 49 | 50 | # Install hdf5r 51 | RUN R --no-echo --no-restore --no-save -e "install.packages('hdf5r')" 52 | 53 | # Install latest Matrix 54 | RUN R --no-echo --no-restore --no-save -e "install.packages('Matrix')" 55 | 56 | # Install rgeos 57 | RUN R --no-echo --no-restore --no-save -e "install.packages('rgeos')" 58 | 59 | # Install Seurat 60 | RUN R --no-echo --no-restore --no-save -e "install.packages('remotes')" 61 | RUN R --no-echo --no-restore --no-save -e "install.packages('Seurat')" 62 | 63 | # Install SeuratDisk 64 | RUN R --no-echo --no-restore --no-save -e "remotes::install_github('mojaveazure/seurat-disk')" 65 | 66 | # BENGAL specific install 67 | 68 | RUN R --no-echo --no-restore --no-save -e "install.packages('optparse')" 69 | 70 | RUN R --no-echo --no-restore --no-save -e "BiocManager::install('biomaRt')" 71 | 72 | 73 | 74 | 75 | -------------------------------------------------------------------------------- /envs/sceasy.yml: -------------------------------------------------------------------------------- 1 | name: sceasy 2 | channels: 3 | - bioconda 4 | - conda-forge 5 | - r 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1 9 | - _openmp_mutex=4.5 10 | - _r-mutex=1.0.1 11 | - alsa-lib=1.2.10 12 | - aom=3.6.1 13 | - argcomplete=3.1.2 14 | - attr=2.5.1 15 | - binutils_impl_linux-64=2.40 16 | - bioconductor-batchelor=1.16.0 17 | - bioconductor-beachmat=2.16.0 18 | - bioconductor-biobase=2.60.0 19 | - bioconductor-biocgenerics=0.46.0 20 | - bioconductor-biocio=1.10.0 21 | - bioconductor-biocneighbors=1.18.0 22 | - bioconductor-biocparallel=1.34.2 23 | - bioconductor-biocsingular=1.16.0 24 | - bioconductor-data-packages=20230718 25 | - bioconductor-delayedarray=0.26.6 26 | - bioconductor-delayedmatrixstats=1.22.1 27 | - bioconductor-genomeinfodb=1.36.1 28 | - bioconductor-genomeinfodbdata=1.2.10 29 | - bioconductor-genomicranges=1.52.0 30 | - bioconductor-hdf5array=1.28.1 31 | - bioconductor-iranges=2.34.1 32 | - bioconductor-limma=3.56.2 33 | - bioconductor-loomexperiment=1.18.0 34 | - bioconductor-matrixgenerics=1.12.2 35 | - bioconductor-residualmatrix=1.10.0 36 | - bioconductor-rhdf5=2.44.0 37 | - bioconductor-rhdf5filters=1.12.1 38 | - bioconductor-rhdf5lib=1.22.0 39 | - bioconductor-s4arrays=1.0.4 40 | - bioconductor-s4vectors=0.38.1 41 | - bioconductor-scaledmatrix=1.8.1 42 | - bioconductor-scuttle=1.10.1 43 | - bioconductor-singlecellexperiment=1.22.0 44 | - bioconductor-sparsematrixstats=1.12.2 45 | - bioconductor-summarizedexperiment=1.30.2 46 | - bioconductor-xvector=0.40.0 47 | - bioconductor-zlibbioc=1.46.0 48 | - blosc=1.21.5 49 | - bwidget=1.9.14 50 | - bzip2=1.0.8 51 | - c-ares=1.19.1 52 | - ca-certificates=2023.7.22 53 | - cached-property=1.5.2 54 | - cached_property=1.5.2 55 | - cairo=1.16.0 56 | - cfitsio=4.3.0 57 | - curl=8.3.0 58 | - dav1d=1.2.1 59 | - dbus=1.13.6 60 | - expat=2.5.0 61 | - ffmpeg=6.0.0 62 | - fftw=3.3.10 63 | - font-ttf-dejavu-sans-mono=2.37 64 | - font-ttf-inconsolata=3.000 65 | - font-ttf-source-code-pro=2.038 66 | - font-ttf-ubuntu=0.83 67 | - fontconfig=2.14.2 68 | - fonts-conda-ecosystem=1 69 | - fonts-conda-forge=1 70 | - freeglut=3.2.2 71 | - freetype=2.12.1 72 | - freexl=2.0.0 73 | - fribidi=1.0.10 74 | - gcc_impl_linux-64=13.2.0 75 | - geos=3.12.0 76 | - geotiff=1.7.1 77 | - gettext=0.21.1 78 | - gfortran_impl_linux-64=13.2.0 79 | - giflib=5.2.1 80 | - glib=2.78.0 81 | - glib-tools=2.78.0 82 | - glpk=5.0 83 | - gmp=6.2.1 84 | - gnutls=3.7.8 85 | - graphite2=1.3.13 86 | - gsl=2.7 87 | - gst-plugins-base=1.22.6 88 | - gstreamer=1.22.6 89 | - gxx_impl_linux-64=13.2.0 90 | - h5py=3.9.0 91 | - harfbuzz=8.2.1 92 | - hdf4=4.2.15 93 | - hdf5=1.14.2 94 | - icu=73.2 95 | - ihm=0.41 96 | - imp=2.19.0 97 | - jasper=4.0.0 98 | - jq=1.5 99 | - json-c=0.17 100 | - kealib=1.5.2 101 | - kernel-headers_linux-64=2.6.32 102 | - keyutils=1.6.1 103 | - krb5=1.21.2 104 | - lame=3.100 105 | - lcms2=2.15 106 | - ld_impl_linux-64=2.40 107 | - lerc=4.0.0 108 | - libabseil=20230802.1 109 | - libaec=1.1.2 110 | - libarchive=3.7.2 111 | - libass=0.17.1 112 | - libblas=3.9.0 113 | - libboost=1.82.0 114 | - libboost-headers=1.82.0 115 | - libcap=2.69 116 | - libcblas=3.9.0 117 | - libclang=15.0.7 118 | - libclang13=15.0.7 119 | - libcrc32c=1.1.2 120 | - libcups=2.3.3 121 | - libcurl=8.3.0 122 | - libdeflate=1.19 123 | - libdrm=2.4.114 124 | - libedit=3.1.20191231 125 | - libev=4.33 126 | - libevent=2.1.12 127 | - libexpat=2.5.0 128 | - libffi=3.4.2 129 | - libflac=1.4.3 130 | - libgcc=7.2.0 131 | - libgcc-devel_linux-64=13.2.0 132 | - libgcc-ng=13.2.0 133 | - libgcrypt=1.10.1 134 | - libgdal=3.7.2 135 | - libgfortran-ng=13.2.0 136 | - libgfortran5=13.2.0 137 | - libglib=2.78.0 138 | - libglu=9.0.0 139 | - libgomp=13.2.0 140 | - libgoogle-cloud=2.12.0 141 | - libgpg-error=1.47 142 | - libgrpc=1.57.0 143 | - libhwloc=2.9.3 144 | - libiconv=1.17 145 | - libidn2=2.3.4 146 | - libjpeg-turbo=2.1.5.1 147 | - libkml=1.3.0 148 | - liblapack=3.9.0 149 | - liblapacke=3.9.0 150 | - libllvm15=15.0.7 151 | - libnetcdf=4.9.2 152 | - libnghttp2=1.52.0 153 | - libnsl=2.0.0 154 | - libogg=1.3.4 155 | - libopenblas=0.3.24 156 | - libopencv=4.8.1 157 | - libopenvino=2023.1.0 158 | - libopenvino-auto-batch-plugin=2023.1.0 159 | - libopenvino-auto-plugin=2023.1.0 160 | - libopenvino-hetero-plugin=2023.1.0 161 | - libopenvino-intel-cpu-plugin=2023.1.0 162 | - libopenvino-intel-gpu-plugin=2023.1.0 163 | - libopenvino-ir-frontend=2023.1.0 164 | - libopenvino-onnx-frontend=2023.1.0 165 | - libopenvino-paddle-frontend=2023.1.0 166 | - libopenvino-pytorch-frontend=2023.1.0 167 | - libopenvino-tensorflow-frontend=2023.1.0 168 | - libopenvino-tensorflow-lite-frontend=2023.1.0 169 | - libopus=1.3.1 170 | - libpciaccess=0.17 171 | - libpng=1.6.39 172 | - libpq=15.4 173 | - libprotobuf=4.23.4 174 | - librttopo=1.1.0 175 | - libsanitizer=13.2.0 176 | - libsndfile=1.2.2 177 | - libspatialite=5.1.0 178 | - libsqlite=3.43.0 179 | - libssh2=1.11.0 180 | - libstdcxx-devel_linux-64=13.2.0 181 | - libstdcxx-ng=13.2.0 182 | - libsystemd0=254 183 | - libtasn1=4.19.0 184 | - libtiff=4.6.0 185 | - libudunits2=2.2.28 186 | - libunistring=0.9.10 187 | - libuuid=2.38.1 188 | - libva=2.20.0 189 | - libvorbis=1.3.7 190 | - libvpx=1.13.1 191 | - libwebp-base=1.3.2 192 | - libxcb=1.15 193 | - libxkbcommon=1.5.0 194 | - libxml2=2.11.5 195 | - libzip=1.10.1 196 | - libzlib=1.2.13 197 | - lz4-c=1.9.4 198 | - lzo=2.10 199 | - make=4.3 200 | - minizip=4.0.1 201 | - mpfr=4.2.0 202 | - mpg123=1.32.3 203 | - mpi=1.0 204 | - mpich=4.1.2 205 | - msgpack-python=1.0.6 206 | - mysql-common=8.0.33 207 | - mysql-libs=8.0.33 208 | - natsort=8.4.0 209 | - ncurses=6.4 210 | - nettle=3.8.1 211 | - nspr=4.35 212 | - nss=3.94 213 | - numpy=1.26.0 214 | - ocl-icd=2.3.1 215 | - ocl-icd-system=1.0.0 216 | - openh264=2.3.1 217 | - openjpeg=2.5.0 218 | - openssl=3.1.3 219 | - p11-kit=0.24.1 220 | - packaging=23.2 221 | - pandas=2.1.1 222 | - pandoc=3.1.3 223 | - pango=1.50.14 224 | - pcre2=10.40 225 | - pip=23.2.1 226 | - pixman=0.42.2 227 | - poppler=23.08.0 228 | - poppler-data=0.4.12 229 | - postgresql=15.4 230 | - proj=9.3.0 231 | - protobuf=4.23.4 232 | - pthread-stubs=0.4 233 | - pugixml=1.13 234 | - pulseaudio-client=16.1 235 | - python=3.9.18 236 | - python-dateutil=2.8.2 237 | - python-tzdata=2023.3 238 | - python_abi=3.9 239 | - pytz=2023.3.post1 240 | - pyyaml=6.0.1 241 | - qt-main=5.15.8 242 | - r-abind=1.4_5 243 | - r-anndata=0.7.5.4 244 | - r-askpass=1.2.0 245 | - r-assertthat=0.2.1 246 | - r-base=4.3.1 247 | - r-base64enc=0.1_3 248 | - r-bh=1.81.0_1 249 | - r-biglm=0.9_2.1 250 | - r-bitops=1.0_7 251 | - r-boot=1.3_28.1 252 | - r-brio=1.1.3 253 | - r-bslib=0.5.1 254 | - r-cachem=1.0.8 255 | - r-callr=3.7.3 256 | - r-catools=1.18.2 257 | - r-class=7.3_22 258 | - r-classint=0.4_10 259 | - r-cli=3.6.1 260 | - r-cluster=2.1.4 261 | - r-codetools=0.2_19 262 | - r-colorspace=2.1_0 263 | - r-commonmark=1.9.0 264 | - r-cowplot=1.1.1 265 | - r-cpp11=0.4.6 266 | - r-crayon=1.5.2 267 | - r-crosstalk=1.2.0 268 | - r-curl=5.1.0 269 | - r-data.table=1.14.8 270 | - r-dbi=1.1.3 271 | - r-deldir=1.0_9 272 | - r-desc=1.4.2 273 | - r-diffobj=0.3.5 274 | - r-digest=0.6.33 275 | - r-dplyr=1.1.3 276 | - r-dqrng=0.3.1 277 | - r-e1071=1.7_13 278 | - r-ellipsis=0.3.2 279 | - r-evaluate=0.22 280 | - r-fansi=1.0.4 281 | - r-farver=2.1.1 282 | - r-fastmap=1.1.1 283 | - r-fitdistrplus=1.1_11 284 | - r-fnn=1.1.3.2 285 | - r-fontawesome=0.5.2 286 | - r-formatr=1.14 287 | - r-fs=1.6.3 288 | - r-furrr=0.3.1 289 | - r-futile.logger=1.4.3 290 | - r-futile.options=1.0.1 291 | - r-future=1.33.0 292 | - r-future.apply=1.11.0 293 | - r-generics=0.1.3 294 | - r-getopt=1.20.4 295 | - r-ggplot2=3.4.3 296 | - r-ggrepel=0.9.3 297 | - r-ggridges=0.5.4 298 | - r-globals=0.16.2 299 | - r-glue=1.6.2 300 | - r-goftest=1.2_3 301 | - r-gplots=3.1.3 302 | - r-gridextra=2.3 303 | - r-grr=0.9.5 304 | - r-gtable=0.3.4 305 | - r-gtools=3.9.4 306 | - r-here=1.0.1 307 | - r-hexbin=1.28.3 308 | - r-highr=0.10 309 | - r-htmltools=0.5.6 310 | - r-htmlwidgets=1.6.2 311 | - r-httpuv=1.6.11 312 | - r-httr=1.4.7 313 | - r-hunspell=3.0.3 314 | - r-ica=1.0_3 315 | - r-igraph=1.5.1 316 | - r-irlba=2.3.5.1 317 | - r-isoband=0.2.7 318 | - r-jquerylib=0.1.4 319 | - r-jsonlite=1.8.7 320 | - r-kernsmooth=2.23_22 321 | - r-knitr=1.44 322 | - r-labeling=0.4.3 323 | - r-lambda.r=1.2.4 324 | - r-later=1.3.1 325 | - r-lattice=0.21_9 326 | - r-lazyeval=0.2.2 327 | - r-leiden=0.4.3 328 | - r-leidenbase=0.1.18 329 | - r-lifecycle=1.0.3 330 | - r-listenv=0.9.0 331 | - r-lmtest=0.9_40 332 | - r-lobstr=1.1.2 333 | - r-lsei=1.3_0 334 | - r-magrittr=2.0.3 335 | - r-mass=7.3_60 336 | - r-matrix=1.6_1.1 337 | - r-matrix.utils=0.9.8 338 | - r-matrixstats=1.0.0 339 | - r-memoise=2.0.1 340 | - r-mgcv=1.9_0 341 | - r-mime=0.12 342 | - r-miniui=0.1.1.1 343 | - r-monocle3=1.0.0 344 | - r-munsell=0.5.0 345 | - r-nlme=3.1_163 346 | - r-npsurv=0.5_0 347 | - r-openssl=2.1.1 348 | - r-optparse=1.7.3 349 | - r-parallelly=1.36.0 350 | - r-patchwork=1.1.3 351 | - r-pbapply=1.7_2 352 | - r-pbmcapply=1.5.1 353 | - r-pheatmap=1.0.12 354 | - r-pillar=1.9.0 355 | - r-pkgbuild=1.4.2 356 | - r-pkgconfig=2.0.3 357 | - r-pkgload=1.3.3 358 | - r-plotly=4.10.2 359 | - r-plyr=1.8.9 360 | - r-png=0.1_8 361 | - r-polyclip=1.10_6 362 | - r-praise=1.0.0 363 | - r-prettyunits=1.2.0 364 | - r-processx=3.8.2 365 | - r-progressr=0.14.0 366 | - r-promises=1.2.1 367 | - r-proxy=0.4_27 368 | - r-pryr=0.1.6 369 | - r-ps=1.7.5 370 | - r-pscl=1.5.5.1 371 | - r-purrr=1.0.2 372 | - r-r6=2.5.1 373 | - r-rann=2.6.1 374 | - r-rappdirs=0.3.3 375 | - r-raster=3.6_23 376 | - r-rcolorbrewer=1.1_3 377 | - r-rcpp=1.0.11 378 | - r-rcppannoy=0.0.21 379 | - r-rcpparmadillo=0.12.6.4.0 380 | - r-rcppeigen=0.3.3.9.3 381 | - r-rcpphnsw=0.5.0 382 | - r-rcppparallel=5.1.6 383 | - r-rcppprogress=0.4.2 384 | - r-rcpptoml=0.2.2 385 | - r-rcurl=1.98_1.12 386 | - r-rematch2=2.1.2 387 | - r-reshape2=1.4.4 388 | - r-reticulate=1.32.0 389 | - r-rhpcblasctl=0.23_42 390 | - r-rlang=1.1.1 391 | - r-rmarkdown=2.25 392 | - r-rocr=1.0_11 393 | - r-rprojroot=2.0.3 394 | - r-rsample=1.2.0 395 | - r-rsvd=1.0.5 396 | - r-rtsne=0.16 397 | - r-s2=1.1.4 398 | - r-sass=0.4.7 399 | - r-scales=1.2.1 400 | - r-scattermore=1.2 401 | - r-sceasy=0.0.7 402 | - r-sctransform=0.4.0 403 | - r-seurat=4.4.0 404 | - r-seuratobject=4.1.4 405 | - r-sf=1.0_14 406 | - r-shiny=1.7.5 407 | - r-sitmo=2.0.2 408 | - r-slam=0.1_50 409 | - r-slider=0.3.0 410 | - r-snow=0.4_4 411 | - r-sourcetools=0.1.7_1 412 | - r-sp=2.1_0 413 | - r-spatstat.data=3.0_1 414 | - r-spatstat.explore=3.2_3 415 | - r-spatstat.geom=3.2_5 416 | - r-spatstat.random=3.1_6 417 | - r-spatstat.sparse=3.0_2 418 | - r-spatstat.utils=3.0_3 419 | - r-spdata=2.3.0 420 | - r-spdep=1.2_8 421 | - r-speedglm=0.3_5 422 | - r-spelling=2.2.1 423 | - r-stringi=1.7.12 424 | - r-stringr=1.5.0 425 | - r-survival=3.5_7 426 | - r-sys=3.4.2 427 | - r-tensor=1.5 428 | - r-terra=1.7_46 429 | - r-testthat=3.2.0 430 | - r-tibble=3.2.1 431 | - r-tidyr=1.3.0 432 | - r-tidyselect=1.2.0 433 | - r-tinytex=0.47 434 | - r-units=0.8_4 435 | - r-utf8=1.2.3 436 | - r-uwot=0.1.16 437 | - r-vctrs=0.6.3 438 | - r-viridis=0.6.4 439 | - r-viridislite=0.4.2 440 | - r-waldo=0.5.1 441 | - r-warp=0.2.0 442 | - r-withr=2.5.1 443 | - r-wk=0.8.0 444 | - r-xfun=0.40 445 | - r-xml2=1.3.5 446 | - r-xtable=1.8_4 447 | - r-yaml=2.3.7 448 | - r-zoo=1.8_12 449 | - re2=2023.03.02 450 | - readline=8.2 451 | - rmf=1.5.1 452 | - scipy=1.11.3 453 | - sed=4.8 454 | - setuptools=68.2.2 455 | - six=1.16.0 456 | - snappy=1.1.10 457 | - sqlite=3.43.0 458 | - svt-av1=1.7.0 459 | - sysroot_linux-64=2.12 460 | - tbb=2021.10.0 461 | - tiledb=2.16.3 462 | - tk=8.6.13 463 | - tktable=2.10 464 | - toml=0.10.2 465 | - tomlkit=0.12.1 466 | - tzcode=2023c 467 | - tzdata=2023c 468 | - udunits2=2.2.28 469 | - uriparser=0.9.7 470 | - wheel=0.41.2 471 | - x264=1!164.3095 472 | - x265=3.5 473 | - xcb-util=0.4.0 474 | - xcb-util-image=0.4.0 475 | - xcb-util-keysyms=0.4.0 476 | - xcb-util-renderutil=0.3.9 477 | - xcb-util-wm=0.4.1 478 | - xerces-c=3.2.4 479 | - xkeyboard-config=2.40 480 | - xmltodict=0.13.0 481 | - xorg-fixesproto=5.0 482 | - xorg-inputproto=2.3.2 483 | - xorg-kbproto=1.0.7 484 | - xorg-libice=1.1.1 485 | - xorg-libsm=1.2.4 486 | - xorg-libx11=1.8.6 487 | - xorg-libxau=1.0.11 488 | - xorg-libxdmcp=1.1.3 489 | - xorg-libxext=1.3.4 490 | - xorg-libxfixes=5.0.3 491 | - xorg-libxi=1.7.10 492 | - xorg-libxrender=0.9.11 493 | - xorg-libxt=1.3.0 494 | - xorg-renderproto=0.11.1 495 | - xorg-xextproto=7.3.0 496 | - xorg-xf86vidmodeproto=2.3.1 497 | - xorg-xproto=7.0.31 498 | - xz=5.2.6 499 | - yaml=0.2.5 500 | - yq=3.2.3 501 | - zlib=1.2.13 502 | - zstd=1.5.5 503 | - pip: 504 | - anndata==0.10.1 505 | - array-api-compat==1.4 506 | - exceptiongroup==1.1.3 507 | -------------------------------------------------------------------------------- /envs/scib.yml: -------------------------------------------------------------------------------- 1 | name: scib-pipeline-R4.0 2 | channels: 3 | - bioconda 4 | - conda-forge 5 | - r 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1 9 | - _openmp_mutex=4.5 10 | - _r-mutex=1.0.1 11 | - absl-py=2.0.0 12 | - aioeasywebdav=2.4.0 13 | - aiohttp=3.8.6 14 | - aiosignal=1.3.1 15 | - amply=0.1.6 16 | - anndata=0.8.0 17 | - anndata2ri=1.3.1 18 | - anyio=3.7.1 19 | - appdirs=1.4.4 20 | - argon2-cffi=23.1.0 21 | - argon2-cffi-bindings=21.2.0 22 | - asttokens=2.4.1 23 | - async-timeout=4.0.3 24 | - atk-1.0=2.38.0 25 | - attmap=0.13.2 26 | - attrs=23.1.0 27 | - backcall=0.2.0 28 | - backports=1.0 29 | - backports.functools_lru_cache=1.6.5 30 | - backports.zoneinfo=0.2.1 31 | - bbknn=1.5.1 32 | - bcrypt=4.0.1 33 | - beautifulsoup4=4.12.2 34 | - binutils_impl_linux-64=2.40 35 | - bioconductor-beachmat=2.6.4 36 | - bioconductor-biobase=2.50.0 37 | - bioconductor-biocgenerics=0.36.0 38 | - bioconductor-biocneighbors=1.8.2 39 | - bioconductor-biocparallel=1.24.1 40 | - bioconductor-biocsingular=1.6.0 41 | - bioconductor-bluster=1.0.0 42 | - bioconductor-delayedarray=0.16.3 43 | - bioconductor-delayedmatrixstats=1.12.3 44 | - bioconductor-edger=3.32.1 45 | - bioconductor-genomeinfodb=1.26.4 46 | - bioconductor-genomeinfodbdata=1.2.4 47 | - bioconductor-genomicranges=1.42.0 48 | - bioconductor-hdf5array=1.18.1 49 | - bioconductor-iranges=2.24.1 50 | - bioconductor-limma=3.46.0 51 | - bioconductor-matrixgenerics=1.2.1 52 | - bioconductor-rhdf5=2.34.0 53 | - bioconductor-rhdf5filters=1.2.0 54 | - bioconductor-rhdf5lib=1.12.1 55 | - bioconductor-s4vectors=0.28.1 56 | - bioconductor-scater=1.18.6 57 | - bioconductor-scran=1.18.5 58 | - bioconductor-scuttle=1.0.4 59 | - bioconductor-singlecellexperiment=1.12.0 60 | - bioconductor-sparsematrixstats=1.2.1 61 | - bioconductor-summarizedexperiment=1.20.0 62 | - bioconductor-xvector=0.30.0 63 | - bioconductor-zlibbioc=1.36.0 64 | - blas=1.1 65 | - bleach=6.1.0 66 | - blinker=1.6.3 67 | - boto3=1.28.76 68 | - botocore=1.31.76 69 | - bottleneck=1.3.7 70 | - brotli=1.1.0 71 | - brotli-bin=1.1.0 72 | - brotli-python=1.1.0 73 | - bwidget=1.9.14 74 | - bzip2=1.0.8 75 | - c-ares=1.20.1 76 | - ca-certificates=2023.7.22 77 | - cached-property=1.5.2 78 | - cached_property=1.5.2 79 | - cachetools=5.3.2 80 | - cairo=1.16.0 81 | - certifi=2023.7.22 82 | - cffi=1.16.0 83 | - chardet=5.2.0 84 | - charset-normalizer=3.3.2 85 | - chex=0.1.81 86 | - click=8.1.7 87 | - coin-or-cbc=2.10.10 88 | - coin-or-cgl=0.60.7 89 | - coin-or-clp=1.17.8 90 | - coin-or-osi=0.108.8 91 | - coin-or-utils=2.11.9 92 | - coincbc=2.10.10 93 | - colorama=0.4.6 94 | - comm=0.1.4 95 | - configargparse=1.7 96 | - connection_pool=0.0.3 97 | - contourpy=1.1.1 98 | - cryptography=41.0.5 99 | - curl=7.86.0 100 | - cycler=0.11.0 101 | - datrie=0.8.2 102 | - debugpy=1.8.0 103 | - decorator=5.1.1 104 | - defusedxml=0.7.1 105 | - dm-tree=0.1.7 106 | - docrep=0.3.2 107 | - docutils=0.20.1 108 | - dpath=2.1.6 109 | - dropbox=11.36.2 110 | - dunamai=1.19.0 111 | - entrypoints=0.4 112 | - et_xmlfile=1.1.0 113 | - exceptiongroup=1.1.3 114 | - executing=2.0.1 115 | - expat=2.5.0 116 | - fbpca=1.0 117 | - filechunkio=1.8 118 | - filelock=3.13.1 119 | - flax=0.6.1 120 | - font-ttf-dejavu-sans-mono=2.37 121 | - font-ttf-inconsolata=3.000 122 | - font-ttf-source-code-pro=2.038 123 | - font-ttf-ubuntu=0.83 124 | - fontconfig=2.14.2 125 | - fonts-conda-ecosystem=1 126 | - fonts-conda-forge=1 127 | - fonttools=4.43.1 128 | - freetype=2.12.1 129 | - fribidi=1.0.10 130 | - frozenlist=1.4.0 131 | - fsspec=2023.1.0 132 | - ftputil=5.0.4 133 | - future=0.18.3 134 | - gcc=13.2.0 135 | - gcc_impl_linux-64=13.2.0 136 | - gdk-pixbuf=2.42.10 137 | - geos=3.11.0 138 | - geosketch=1.2 139 | - get_version=3.5.4 140 | - gettext=0.21.1 141 | - gfortran_impl_linux-64=13.2.0 142 | - giflib=5.2.1 143 | - git=2.42.0 144 | - gitdb=4.0.11 145 | - gitpython=3.1.40 146 | - glib=2.78.0 147 | - glib-tools=2.78.0 148 | - glpk=5.0 149 | - gmp=6.2.1 150 | - google-api-core=2.12.0 151 | - google-api-python-client=2.106.0 152 | - google-auth=2.23.0 153 | - google-auth-httplib2=0.1.1 154 | - google-auth-oauthlib=0.4.6 155 | - google-cloud-core=2.3.3 156 | - google-cloud-storage=2.11.0 157 | - google-crc32c=1.1.2 158 | - google-resumable-media=2.6.0 159 | - googleapis-common-protos=1.61.0 160 | - graphite2=1.3.13 161 | - graphviz=8.0.3 162 | - grpc-cpp=1.48.1 163 | - grpcio=1.48.1 164 | - gsl=2.7 165 | - gtk2=2.24.33 166 | - gts=0.7.6 167 | - gxx=13.2.0 168 | - gxx_impl_linux-64=13.2.0 169 | - h5py=3.8.0 170 | - harfbuzz=6.0.0 171 | - hdf5=1.12.2 172 | - httplib2=0.22.0 173 | - humanfriendly=10.0 174 | - icu=70.1 175 | - idna=3.4 176 | - importlib-metadata=6.8.0 177 | - importlib_metadata=6.8.0 178 | - importlib_resources=6.0.0 179 | - iniconfig=2.0.0 180 | - intel-openmp=2022.1.0 181 | - intervaltree=2.1.0 182 | - ipykernel=6.16.2 183 | - ipython=8.17.2 184 | - ipython_genutils=0.2.0 185 | - ipywidgets=8.1.1 186 | - jax=0.3.25 187 | - jaxlib=0.3.25 188 | - jedi=0.19.1 189 | - jinja2=3.1.2 190 | - jmespath=1.0.1 191 | - joblib=1.3.2 192 | - jpeg=9e 193 | - jsonschema=4.17.3 194 | - jupyter_client=7.4.9 195 | - jupyter_core=5.5.0 196 | - jupyter_server=1.23.4 197 | - jupyterlab_pygments=0.2.2 198 | - jupyterlab_widgets=3.0.9 199 | - kernel-headers_linux-64=2.6.32 200 | - keyutils=1.6.1 201 | - kiwisolver=1.4.5 202 | - krb5=1.19.3 203 | - lcms2=2.14 204 | - ld_impl_linux-64=2.40 205 | - lerc=4.0.0 206 | - libabseil=20220623.0 207 | - libblas=3.9.0 208 | - libbrotlicommon=1.1.0 209 | - libbrotlidec=1.1.0 210 | - libbrotlienc=1.1.0 211 | - libcblas=3.9.0 212 | - libcrc32c=1.1.2 213 | - libcurl=7.86.0 214 | - libdeflate=1.14 215 | - libedit=3.1.20191231 216 | - libev=4.33 217 | - libexpat=2.5.0 218 | - libffi=3.4.2 219 | - libgcc-devel_linux-64=13.2.0 220 | - libgcc-ng=13.2.0 221 | - libgd=2.3.3 222 | - libgfortran-ng=13.2.0 223 | - libgfortran5=13.2.0 224 | - libgit2=1.5.1 225 | - libglib=2.78.0 226 | - libgomp=13.2.0 227 | - libhwloc=2.9.1 228 | - libiconv=1.17 229 | - liblapack=3.9.0 230 | - liblapacke=3.9.0 231 | - libllvm11=11.1.0 232 | - libllvm14=14.0.6 233 | - libnghttp2=1.55.1 234 | - libnsl=2.0.1 235 | - libopenblas=0.3.24 236 | - libpng=1.6.39 237 | - libprotobuf=3.21.8 238 | - librsvg=2.54.4 239 | - libsanitizer=13.2.0 240 | - libsodium=1.0.18 241 | - libsqlite=3.44.0 242 | - libssh2=1.11.0 243 | - libstdcxx-devel_linux-64=13.2.0 244 | - libstdcxx-ng=13.2.0 245 | - libtiff=4.4.0 246 | - libtool=2.4.7 247 | - libuuid=2.38.1 248 | - libwebp=1.3.2 249 | - libwebp-base=1.3.2 250 | - libxcb=1.13 251 | - libxml2=2.10.3 252 | - libzlib=1.2.13 253 | - lightning-utilities=0.9.0 254 | - llvmlite=0.40.1 255 | - logmuse=0.2.6 256 | - make=4.3 257 | - markdown=3.5.1 258 | - markdown-it-py=2.2.0 259 | - markupsafe=2.1.3 260 | - matplotlib-base=3.8.1 261 | - matplotlib-inline=0.1.6 262 | - mdurl=0.1.0 263 | - mistune=3.0.1 264 | - mkl=2022.1.0 265 | - msgpack-python=1.0.6 266 | - multidict=6.0.4 267 | - multipledispatch=0.6.0 268 | - munkres=1.1.4 269 | - natsort=8.4.0 270 | - nbclassic=1.0.0 271 | - nbclient=0.7.0 272 | - nbconvert=7.6.0 273 | - nbconvert-core=7.6.0 274 | - nbconvert-pandoc=7.6.0 275 | - nbformat=5.8.0 276 | - ncurses=6.4 277 | - nest-asyncio=1.5.8 278 | - networkx=2.7 279 | - ninja=1.11.1 280 | - nomkl=3.0 281 | - notebook=6.5.6 282 | - notebook-shim=0.2.3 283 | - numba=0.57.1 284 | - numexpr=2.8.7 285 | - numpy=1.24.4 286 | - numpyro=0.13.2 287 | - oauth2client=4.1.3 288 | - oauthlib=3.2.2 289 | - openblas=0.3.24 290 | - openjpeg=2.5.0 291 | - openpyxl=3.1.2 292 | - openssl=3.1.4 293 | - opt_einsum=3.3.0 294 | - optax=0.1.5 295 | - packaging=23.2 296 | - pandas=1.5.2 297 | - pandoc=2.19.2 298 | - pandocfilters=1.5.0 299 | - pango=1.50.14 300 | - paramiko=3.3.1 301 | - parso=0.8.3 302 | - patsy=0.5.3 303 | - pcre2=10.40 304 | - peppy=0.35.7 305 | - perl=5.32.1 306 | - pexpect=4.8.0 307 | - pickleshare=0.7.5 308 | - pillow=9.2.0 309 | - pip=23.3.1 310 | - pixman=0.42.2 311 | - pkgutil-resolve-name=1.3.10 312 | - plac=1.4.1 313 | - platformdirs=3.11.0 314 | - pluggy=1.3.0 315 | - ply=3.11 316 | - prettytable=3.7.0 317 | - prometheus_client=0.18.0 318 | - prompt-toolkit=3.0.39 319 | - prompt_toolkit=3.0.39 320 | - protobuf=4.21.8 321 | - psutil=5.9.5 322 | - pthread-stubs=0.4 323 | - ptyprocess=0.7.0 324 | - pulp=2.7.0 325 | - pure_eval=0.2.2 326 | - pyasn1=0.5.0 327 | - pyasn1-modules=0.3.0 328 | - pycparser=2.21 329 | - pydeprecate=0.3.1 330 | - pygments=2.16.1 331 | - pyjwt=2.8.0 332 | - pynacl=1.5.0 333 | - pynndescent=0.5.10 334 | - pyopenssl=23.2.0 335 | - pyparsing=3.1.1 336 | - pyro-api=0.1.2 337 | - pyro-ppl=1.8.6 338 | - pyrsistent=0.20.0 339 | - pysftp=0.2.9 340 | - pysocks=1.7.1 341 | - pytest=7.4.3 342 | - python=3.10.10 343 | - python-annoy=1.17.2 344 | - python-dateutil=2.8.2 345 | - python-fastjsonschema=2.18.1 346 | - python-irodsclient=1.1.9 347 | - python-tzdata=2023.3 348 | - python_abi=3.10 349 | - pytorch=1.12.1 350 | - pytorch-lightning=1.5.8 351 | - pytz=2023.3.post1 352 | - pytz-deprecation-shim=0.1.0.post0 353 | - pyu2f=0.1.5 354 | - pyyaml=6.0.1 355 | - pyzmq=24.0.1 356 | - r-abind=1.4_5 357 | - r-askpass=1.1 358 | - r-assertthat=0.2.1 359 | - r-backports=1.4.1 360 | - r-base=4.0.5 361 | - r-base64enc=0.1_3 362 | - r-beeswarm=0.4.0 363 | - r-bh=1.78.0_0 364 | - r-bit=4.0.4 365 | - r-bit64=4.0.5 366 | - r-bitops=1.0_7 367 | - r-blob=1.2.3 368 | - r-boot=1.3_28 369 | - r-brew=1.0_7 370 | - r-brio=1.1.3 371 | - r-broom=1.0.1 372 | - r-bslib=0.4.0 373 | - r-cachem=1.0.6 374 | - r-callr=3.7.2 375 | - r-caret=6.0_93 376 | - r-catools=1.18.2 377 | - r-cellranger=1.1.0 378 | - r-class=7.3_20 379 | - r-cli=3.4.1 380 | - r-clipr=0.8.0 381 | - r-cluster=2.1.3 382 | - r-codetools=0.2_18 383 | - r-colorspace=2.0_3 384 | - r-commonmark=1.8.0 385 | - r-cowplot=1.1.1 386 | - r-cpp11=0.4.2 387 | - r-crayon=1.5.1 388 | - r-credentials=1.3.2 389 | - r-crosstalk=1.2.0 390 | - r-crul=1.3 391 | - r-curl=4.3.2 392 | - r-data.table=1.14.2 393 | - r-dbi=1.1.3 394 | - r-dbplyr=2.2.1 395 | - r-deldir=1.0_6 396 | - r-desc=1.4.2 397 | - r-devtools=2.4.4 398 | - r-diffobj=0.3.5 399 | - r-digest=0.6.29 400 | - r-downlit=0.4.2 401 | - r-dplyr=1.0.10 402 | - r-dqrng=0.3.0 403 | - r-dtplyr=1.2.2 404 | - r-e1071=1.7_11 405 | - r-ellipsis=0.3.2 406 | - r-essentials=4.0 407 | - r-evaluate=0.16 408 | - r-fansi=1.0.3 409 | - r-farver=2.1.1 410 | - r-fastmap=1.1.0 411 | - r-fitdistrplus=1.1_8 412 | - r-fnn=1.1.3.1 413 | - r-fontawesome=0.3.0 414 | - r-forcats=0.5.2 415 | - r-foreach=1.5.2 416 | - r-foreign=0.8_82 417 | - r-formatr=1.12 418 | - r-fs=1.5.2 419 | - r-futile.logger=1.4.3 420 | - r-futile.options=1.0.1 421 | - r-future=1.28.0 422 | - r-future.apply=1.9.1 423 | - r-gargle=1.2.1 424 | - r-generics=0.1.3 425 | - r-gert=1.5.0 426 | - r-ggbeeswarm=0.6.0 427 | - r-ggplot2=3.3.6 428 | - r-ggrepel=0.9.1 429 | - r-ggridges=0.5.4 430 | - r-gh=1.3.1 431 | - r-gistr=0.9.0 432 | - r-gitcreds=0.1.2 433 | - r-glmnet=4.1_2 434 | - r-globals=0.16.1 435 | - r-glue=1.6.2 436 | - r-goftest=1.2_3 437 | - r-googledrive=2.0.0 438 | - r-googlesheets4=1.0.1 439 | - r-gower=1.0.0 440 | - r-gplots=3.1.3 441 | - r-gridextra=2.3 442 | - r-gtable=0.3.1 443 | - r-gtools=3.9.3 444 | - r-hardhat=1.2.0 445 | - r-haven=2.5.0 446 | - r-here=1.0.1 447 | - r-hexbin=1.28.2 448 | - r-highr=0.9 449 | - r-hms=1.1.2 450 | - r-htmltools=0.5.3 451 | - r-htmlwidgets=1.5.4 452 | - r-httpcode=0.3.0 453 | - r-httpuv=1.6.6 454 | - r-httr=1.4.4 455 | - r-ica=1.0_3 456 | - r-ids=1.0.1 457 | - r-igraph=1.3.4 458 | - r-ini=0.3.1 459 | - r-ipred=0.9_13 460 | - r-irdisplay=1.1 461 | - r-irkernel=1.3 462 | - r-irlba=2.3.5 463 | - r-isoband=0.2.5 464 | - r-iterators=1.0.14 465 | - r-jquerylib=0.1.4 466 | - r-jsonlite=1.8.0 467 | - r-kernsmooth=2.23_20 468 | - r-knitr=1.40 469 | - r-labeling=0.4.2 470 | - r-lambda.r=1.2.4 471 | - r-later=1.2.0 472 | - r-lattice=0.20_45 473 | - r-lava=1.6.10 474 | - r-lazyeval=0.2.2 475 | - r-leiden=0.4.3 476 | - r-lifecycle=1.0.2 477 | - r-listenv=0.8.0 478 | - r-lmtest=0.9_40 479 | - r-lobstr=1.1.2 480 | - r-locfit=1.5_9.4 481 | - r-lsei=1.3_0 482 | - r-lubridate=1.8.0 483 | - r-magrittr=2.0.3 484 | - r-maps=3.4.0 485 | - r-mass=7.3_58.1 486 | - r-matrix=1.4_1 487 | - r-matrixstats=0.62.0 488 | - r-memoise=2.0.1 489 | - r-mgcv=1.8_40 490 | - r-mime=0.12 491 | - r-miniui=0.1.1.1 492 | - r-modelmetrics=1.2.2.2 493 | - r-modelr=0.1.9 494 | - r-munsell=0.5.0 495 | - r-nlme=3.1_159 496 | - r-nnet=7.3_17 497 | - r-npsurv=0.5_0 498 | - r-numderiv=2016.8_1.1 499 | - r-openssl=2.0.3 500 | - r-parallelly=1.32.1 501 | - r-patchwork=1.1.2 502 | - r-pbapply=1.5_0 503 | - r-pbdzmq=0.3_7 504 | - r-pillar=1.8.1 505 | - r-pkgbuild=1.3.1 506 | - r-pkgconfig=2.0.3 507 | - r-pkgdown=2.0.6 508 | - r-pkgload=1.3.0 509 | - r-plotly=4.10.0 510 | - r-plyr=1.8.7 511 | - r-png=0.1_7 512 | - r-polyclip=1.10_0 513 | - r-praise=1.0.0 514 | - r-prettyunits=1.1.1 515 | - r-proc=1.18.0 516 | - r-processx=3.7.0 517 | - r-prodlim=2019.11.13 518 | - r-profvis=0.3.7 519 | - r-progress=1.2.2 520 | - r-progressr=0.11.0 521 | - r-promises=1.2.0.1 522 | - r-proxy=0.4_27 523 | - r-pryr=0.1.5 524 | - r-ps=1.7.1 525 | - r-purrr=0.3.4 526 | - r-quantmod=0.4.20 527 | - r-r6=2.5.1 528 | - r-ragg=0.3.1 529 | - r-randomforest=4.6_14 530 | - r-rann=2.6.1 531 | - r-rappdirs=0.3.3 532 | - r-rbokeh=0.5.2 533 | - r-rcmdcheck=1.4.0 534 | - r-rcolorbrewer=1.1_3 535 | - r-rcpp=1.0.9 536 | - r-rcppannoy=0.0.19 537 | - r-rcpparmadillo=0.11.2.3.1 538 | - r-rcppeigen=0.3.3.9.2 539 | - r-rcpphnsw=0.4.1 540 | - r-rcppparallel=5.1.5 541 | - r-rcppprogress=0.4.2 542 | - r-rcpptoml=0.1.7 543 | - r-rcurl=1.98_1.8 544 | - r-readr=2.1.2 545 | - r-readxl=1.4.1 546 | - r-recipes=1.0.1 547 | - r-recommended=4.0 548 | - r-rematch=1.0.1 549 | - r-rematch2=2.1.2 550 | - r-remotes=2.4.2 551 | - r-repr=1.1.4 552 | - r-reprex=2.0.2 553 | - r-reshape2=1.4.4 554 | - r-reticulate=1.30 555 | - r-rgeos=0.5_9 556 | - r-rlang=1.0.6 557 | - r-rmarkdown=2.16 558 | - r-rocr=1.0_11 559 | - r-roxygen2=7.2.1 560 | - r-rpart=4.1.16 561 | - r-rprojroot=2.0.3 562 | - r-rspectra=0.16_1 563 | - r-rstudioapi=0.14 564 | - r-rsvd=1.0.5 565 | - r-rtsne=0.16 566 | - r-rversions=2.1.2 567 | - r-rvest=1.0.3 568 | - r-sass=0.4.2 569 | - r-scales=1.2.1 570 | - r-scattermore=0.8 571 | - r-sctransform=0.3.4 572 | - r-selectr=0.4_2 573 | - r-sessioninfo=1.2.2 574 | - r-seurat=3.2.3 575 | - r-seuratobject=4.1.1 576 | - r-shape=1.4.6 577 | - r-shiny=1.7.2 578 | - r-sitmo=2.0.2 579 | - r-snow=0.4_4 580 | - r-sourcetools=0.1.7 581 | - r-sp=1.5_0 582 | - r-spatial=7.3_15 583 | - r-spatstat=1.64_1 584 | - r-spatstat.data=2.2_0 585 | - r-spatstat.utils=2.3_1 586 | - r-squarem=2021.1 587 | - r-statmod=1.4.37 588 | - r-stringi=1.7.8 589 | - r-stringr=1.4.1 590 | - r-survival=3.4_0 591 | - r-sys=3.4 592 | - r-systemfonts=1.0.4 593 | - r-tensor=1.5 594 | - r-testthat=3.1.4 595 | - r-tibble=3.1.8 596 | - r-tidyr=1.2.1 597 | - r-tidyselect=1.1.2 598 | - r-tidyverse=1.3.2 599 | - r-timedate=4021.104 600 | - r-tinytex=0.42 601 | - r-triebeard=0.3.0 602 | - r-ttr=0.24.3 603 | - r-tzdb=0.3.0 604 | - r-urlchecker=1.0.1 605 | - r-urltools=1.7.3 606 | - r-usethis=2.1.6 607 | - r-utf8=1.2.2 608 | - r-uuid=1.1_0 609 | - r-uwot=0.1.14 610 | - r-vctrs=0.4.1 611 | - r-vipor=0.4.5 612 | - r-viridis=0.6.2 613 | - r-viridislite=0.4.1 614 | - r-vroom=1.5.7 615 | - r-waldo=0.4.0 616 | - r-whisker=0.4 617 | - r-withr=2.5.0 618 | - r-xfun=0.33 619 | - r-xml2=1.3.3 620 | - r-xopen=1.0.0 621 | - r-xtable=1.8_4 622 | - r-xts=0.12.1 623 | - r-yaml=2.3.5 624 | - r-zip=2.2.1 625 | - r-zoo=1.8_11 626 | - re2=2022.06.01 627 | - readline=8.2 628 | - requests=2.31.0 629 | - requests-oauthlib=1.3.1 630 | - reretry=0.11.8 631 | - rich=13.6.0 632 | - rsa=4.9 633 | - s3transfer=0.7.0 634 | - scanorama=1.7 635 | - scanpy=1.9.2 636 | - scikit-learn=1.3.2 637 | - scipy=1.11.3 638 | - scvi-tools=0.16.1 639 | - seaborn=0.13.0 640 | - seaborn-base=0.13.0 641 | - sed=4.8 642 | - send2trash=1.8.2 643 | - session-info=1.0.0 644 | - setuptools=68.2.2 645 | - setuptools-scm=8.0.4 646 | - simplegeneric=0.8.1 647 | - six=1.16.0 648 | - slacker=0.14.0 649 | - sleef=3.5.1 650 | - smart_open=6.4.0 651 | - smmap=5.0.0 652 | - snakemake=7.24.0 653 | - snakemake-minimal=7.24.0 654 | - sniffio=1.3.0 655 | - sortedcontainers=2.4.0 656 | - soupsieve=2.3.2.post1 657 | - sqlite=3.44.0 658 | - stack_data=0.6.2 659 | - statsmodels=0.14.0 660 | - stdlib-list=0.8.0 661 | - stone=3.3.1 662 | - stopit=1.1.2 663 | - sysroot_linux-64=2.12 664 | - tabulate=0.9.0 665 | - tensorboard=2.11.2 666 | - tensorboard-data-server=0.6.1 667 | - tensorboard-plugin-wit=1.8.1 668 | - terminado=0.17.1 669 | - threadpoolctl=3.1.0 670 | - throttler=1.2.2 671 | - tinycss2=1.2.1 672 | - tk=8.6.13 673 | - tktable=2.10 674 | - tomli=2.0.1 675 | - toolz=0.12.0 676 | - toposort=1.7 677 | - torchmetrics=1.0.3 678 | - tornado=6.3.3 679 | - tqdm=4.66.1 680 | - traitlets=5.9.0 681 | - typing-extensions=4.7.1 682 | - typing_extensions=4.7.1 683 | - tzdata=2023c 684 | - tzlocal=5.2 685 | - ubiquerg=0.6.3 686 | - umap-learn=0.5.4 687 | - unicodedata2=15.1.0 688 | - uritemplate=4.1.1 689 | - urllib3=1.26.18 690 | - veracitools=0.1.3 691 | - wcwidth=0.2.9 692 | - webencodings=0.5.1 693 | - websocket-client=1.6.1 694 | - werkzeug=2.2.3 695 | - wheel=0.41.3 696 | - widgetsnbextension=4.0.9 697 | - wrapt=1.15.0 698 | - xorg-kbproto=1.0.7 699 | - xorg-libice=1.0.10 700 | - xorg-libsm=1.2.3 701 | - xorg-libx11=1.8.4 702 | - xorg-libxau=1.0.11 703 | - xorg-libxdmcp=1.1.3 704 | - xorg-libxext=1.3.4 705 | - xorg-libxrender=0.9.10 706 | - xorg-libxt=1.3.0 707 | - xorg-renderproto=0.11.1 708 | - xorg-xextproto=7.3.0 709 | - xorg-xproto=7.0.31 710 | - xz=5.2.6 711 | - yaml=0.2.5 712 | - yarl=1.9.2 713 | - yte=1.5.1 714 | - zeromq=4.3.5 715 | - zipp=3.15.0 716 | - zlib=1.2.13 717 | - zstd=1.5.5 718 | - pip: 719 | - deprecated==1.2.14 720 | - igraph==0.10.8 721 | - leidenalg==0.10.1 722 | - pydot==1.4.2 723 | - rpy2==3.5.14 724 | - scib==1.1.4 725 | - scikit-misc==0.3.0 726 | - tbb==2021.10.0 727 | - texttable==1.7.0 728 | -------------------------------------------------------------------------------- /envs/scvi.yml: -------------------------------------------------------------------------------- 1 | name: scvi 2 | channels: 3 | - bioconda 4 | - conda-forge 5 | - r 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1=conda_forge 9 | - _openmp_mutex=4.5=2_gnu 10 | - anndata=0.9.2=pyhd8ed1ab_0 11 | - brotli=1.1.0=hd590300_1 12 | - brotli-bin=1.1.0=hd590300_1 13 | - bzip2=1.0.8=h7f98852_4 14 | - c-ares=1.20.1=hd590300_0 15 | - ca-certificates=2023.7.22=hbcca054_0 16 | - cached-property=1.5.2=hd8ed1ab_1 17 | - cached_property=1.5.2=pyha770c72_1 18 | - certifi=2023.7.22=pyhd8ed1ab_0 19 | - colorama=0.4.6=pyhd8ed1ab_0 20 | - contourpy=1.1.1=py310hd41b1e2_1 21 | - cudatoolkit=11.8.0=h4ba93d1_12 22 | - cycler=0.12.1=pyhd8ed1ab_0 23 | - fonttools=4.43.1=py310h2372a71_0 24 | - freetype=2.12.1=h267a509_2 25 | - h5py=3.10.0=nompi_py310ha2ad45a_100 26 | - hdf5=1.14.2=nompi_h4f84152_100 27 | - icu=73.2=h59595ed_0 28 | - importlib-metadata=6.8.0=pyha770c72_0 29 | - importlib_metadata=6.8.0=hd8ed1ab_0 30 | - joblib=1.3.2=pyhd8ed1ab_0 31 | - keyutils=1.6.1=h166bdaf_0 32 | - kiwisolver=1.4.5=py310hd41b1e2_1 33 | - krb5=1.21.2=h659d440_0 34 | - lcms2=2.15=hb7c19ff_3 35 | - ld_impl_linux-64=2.40=h41732ed_0 36 | - lerc=4.0.0=h27087fc_0 37 | - libaec=1.1.2=h59595ed_1 38 | - libblas=3.9.0=18_linux64_openblas 39 | - libbrotlicommon=1.1.0=hd590300_1 40 | - libbrotlidec=1.1.0=hd590300_1 41 | - libbrotlienc=1.1.0=hd590300_1 42 | - libcblas=3.9.0=18_linux64_openblas 43 | - libcurl=8.4.0=hca28451_0 44 | - libdeflate=1.19=hd590300_0 45 | - libedit=3.1.20191231=he28a2e2_2 46 | - libev=4.33=h516909a_1 47 | - libffi=3.4.2=h7f98852_5 48 | - libgcc-ng=13.2.0=h807b86a_2 49 | - libgfortran-ng=13.2.0=h69a702a_2 50 | - libgfortran5=13.2.0=ha4646dd_2 51 | - libgomp=13.2.0=h807b86a_2 52 | - libhwloc=2.9.3=default_h554bfaf_1009 53 | - libiconv=1.17=h166bdaf_0 54 | - libjpeg-turbo=3.0.0=hd590300_1 55 | - liblapack=3.9.0=18_linux64_openblas 56 | - libllvm14=14.0.6=hcd5def8_4 57 | - libnghttp2=1.52.0=h61bc06f_0 58 | - libnsl=2.0.0=hd590300_1 59 | - libopenblas=0.3.24=pthreads_h413a1c8_0 60 | - libpng=1.6.39=h753d276_0 61 | - libsqlite=3.43.2=h2797004_0 62 | - libssh2=1.11.0=h0841786_0 63 | - libstdcxx-ng=13.2.0=h7e041cc_2 64 | - libtiff=4.6.0=ha9c0a0a_2 65 | - libuuid=2.38.1=h0b41bf4_0 66 | - libwebp-base=1.3.2=hd590300_0 67 | - libxcb=1.15=h0b41bf4_0 68 | - libxml2=2.11.5=h232c23b_1 69 | - libzlib=1.2.13=hd590300_5 70 | - llvmlite=0.40.1=py310h1b8f574_0 71 | - matplotlib-base=3.8.0=py310h62c0568_2 72 | - munkres=1.0.7=py_1 73 | - natsort=8.4.0=pyhd8ed1ab_0 74 | - ncurses=6.4=hcb278e6_0 75 | - networkx=3.1=pyhd8ed1ab_0 76 | - numba=0.57.1=py310h0f6aa51_0 77 | - numpy=1.24.4=py310ha4c1d20_0 78 | - openjpeg=2.5.0=h488ebb8_3 79 | - openssl=3.1.3=hd590300_0 80 | - packaging=23.2=pyhd8ed1ab_0 81 | - pandas=1.5.3=py310h9b08913_1 82 | - patsy=0.5.3=pyhd8ed1ab_0 83 | - pillow=10.0.1=py310h01dd4db_2 84 | - pip=23.2.1=pyhd8ed1ab_0 85 | - pthread-stubs=0.4=h36c2ea0_1001 86 | - pynndescent=0.5.10=pyh1a96a4e_0 87 | - pyparsing=3.1.1=pyhd8ed1ab_0 88 | - python=3.10.10=he550d4f_0_cpython 89 | - python-dateutil=2.8.2=pyhd8ed1ab_0 90 | - python-tzdata=2023.3=pyhd8ed1ab_0 91 | - python_abi=3.10=4_cp310 92 | - pytz=2023.3.post1=pyhd8ed1ab_0 93 | - readline=8.2=h8228510_1 94 | - scanpy=1.9.2=pyhd8ed1ab_0 95 | - scikit-learn=1.3.1=py310h1fdf081_1 96 | - scipy=1.11.3=py310hb13e2d6_1 97 | - seaborn=0.13.0=hd8ed1ab_0 98 | - seaborn-base=0.13.0=pyhd8ed1ab_0 99 | - session-info=1.0.0=pyhd8ed1ab_0 100 | - setuptools=68.2.2=pyhd8ed1ab_0 101 | - six=1.16.0=pyh6c4a22f_0 102 | - statsmodels=0.14.0=py310h1f7b6fc_2 103 | - stdlib-list=0.8.0=pyhd8ed1ab_0 104 | - tbb=2021.10.0=h00ab1b0_1 105 | - threadpoolctl=3.2.0=pyha21a80b_0 106 | - tk=8.6.13=h2797004_0 107 | - tqdm=4.66.1=pyhd8ed1ab_0 108 | - tzdata=2023c=h71feb2d_0 109 | - umap-learn=0.5.4=py310hff52083_0 110 | - unicodedata2=15.1.0=py310h2372a71_0 111 | - wheel=0.41.2=pyhd8ed1ab_0 112 | - xorg-libxau=1.0.11=hd590300_0 113 | - xorg-libxdmcp=1.1.3=h7f98852_0 114 | - xz=5.2.6=h166bdaf_0 115 | - zipp=3.17.0=pyhd8ed1ab_0 116 | - zstd=1.5.5=hfc55251_0 117 | - pip: 118 | - absl-py==2.0.0 119 | - aiohttp==3.8.6 120 | - aiosignal==1.3.1 121 | - annotated-types==0.6.0 122 | - anyio==3.7.1 123 | - arrow==1.3.0 124 | - async-timeout==4.0.3 125 | - attrs==23.1.0 126 | - backoff==2.2.1 127 | - beautifulsoup4==4.12.2 128 | - blessed==1.20.0 129 | - charset-normalizer==3.3.0 130 | - chex==0.1.7 131 | - click==8.1.7 132 | - contextlib2==21.6.0 133 | - croniter==1.4.1 134 | - dateutils==0.6.12 135 | - deepdiff==6.6.0 136 | - dm-tree==0.1.8 137 | - docrep==0.3.2 138 | - etils==1.5.0 139 | - exceptiongroup==1.1.3 140 | - fastapi==0.103.2 141 | - filelock==3.12.4 142 | - flax==0.7.4 143 | - frozenlist==1.4.0 144 | - fsspec==2023.9.2 145 | - h11==0.14.0 146 | - idna==3.4 147 | - importlib-resources==6.1.0 148 | - inquirer==3.1.3 149 | - itsdangerous==2.1.2 150 | - jax==0.4.18 151 | - jaxlib==0.4.18 152 | - jinja2==3.1.2 153 | - lightning==2.0.9.post0 154 | - lightning-cloud==0.5.39 155 | - lightning-utilities==0.9.0 156 | - markdown-it-py==3.0.0 157 | - markupsafe==2.1.3 158 | - mdurl==0.1.2 159 | - ml-collections==0.1.1 160 | - ml-dtypes==0.3.1 161 | - mpmath==1.3.0 162 | - msgpack==1.0.7 163 | - mudata==0.2.3 164 | - multidict==6.0.4 165 | - multipledispatch==1.0.0 166 | - nest-asyncio==1.5.8 167 | - numpyro==0.13.2 168 | - nvidia-cublas-cu12==12.1.3.1 169 | - nvidia-cuda-cupti-cu12==12.1.105 170 | - nvidia-cuda-nvrtc-cu12==12.1.105 171 | - nvidia-cuda-runtime-cu12==12.1.105 172 | - nvidia-cudnn-cu12==8.9.2.26 173 | - nvidia-cufft-cu12==11.0.2.54 174 | - nvidia-curand-cu12==10.3.2.106 175 | - nvidia-cusolver-cu12==11.4.5.107 176 | - nvidia-cusparse-cu12==12.1.0.106 177 | - nvidia-nccl-cu12==2.18.1 178 | - nvidia-nvjitlink-cu12==12.2.140 179 | - nvidia-nvtx-cu12==12.1.105 180 | - opt-einsum==3.3.0 181 | - optax==0.1.7 182 | - orbax-checkpoint==0.4.1 183 | - ordered-set==4.1.0 184 | - protobuf==4.24.4 185 | - psutil==5.9.5 186 | - pydantic==2.1.1 187 | - pydantic-core==2.4.0 188 | - pygments==2.16.1 189 | - pyjwt==2.8.0 190 | - pymde==0.1.18 191 | - pyro-api==0.1.2 192 | - pyro-ppl==1.8.6 193 | - python-editor==1.0.4 194 | - python-multipart==0.0.6 195 | - pytorch-lightning==2.0.9.post0 196 | - pyyaml==6.0.1 197 | - readchar==4.0.5 198 | - requests==2.31.0 199 | - rich==13.6.0 200 | - scvi-tools==1.0.3 201 | - sniffio==1.3.0 202 | - soupsieve==2.5 203 | - sparse==0.14.0 204 | - starlette==0.27.0 205 | - starsessions==1.3.0 206 | - sympy==1.12 207 | - tensorstore==0.1.45 208 | - toolz==0.12.0 209 | - torch==2.1.0 210 | - torchmetrics==1.2.0 211 | - torchvision==0.16.0 212 | - traitlets==5.11.2 213 | - triton==2.1.0 214 | - types-python-dateutil==2.8.19.14 215 | - typing-extensions==4.8.0 216 | - urllib3==2.0.6 217 | - uvicorn==0.23.2 218 | - wcwidth==0.2.8 219 | - websocket-client==1.6.4 220 | - websockets==11.0.3 221 | - xarray==2023.9.0 222 | - yarl==1.9.2 223 | -------------------------------------------------------------------------------- /example_metadata_nf.tsv: -------------------------------------------------------------------------------- 1 | drerio /some/path/drerio_embryo_final_gene_id.h5ad 2 | xtropicalis /some/path/xtropicalis_embryo_final_gene_id.h5ad 3 | --------------------------------------------------------------------------------