├── LICENSE
├── README.md
├── bin
    ├── LIGER_integration.R
    ├── alignment_score.py
    ├── concat_by_homology_multiple_species_by_gene_id.R
    ├── concat_by_homology_rligerUINMF_multiple_species.R
    ├── convert_format.R
    ├── fastMNN_integration.R
    ├── harmony_integration.py
    ├── kbet_multiple_species.R
    ├── rliger_integration_UINMF_multiple_species.R
    ├── scANVI_integration.py
    ├── scIB_metrics.py
    ├── scIB_metrics_individual.py
    ├── scIB_trajectory.py
    ├── scanorama_integration.py
    ├── sccaf_assessment_metadata.py
    ├── sccaf_kNN_distance.py
    ├── sccaf_projection_multiple_species.py
    ├── scvi_integration.py
    ├── seurat_CCA_integration.R
    ├── seurat_RPCA_integration.R
    └── validate_input.py
├── concat_by_homology_multiple_species.nf
├── config
    └── example.config
├── containers
    └── README.md
├── cross_species_assessment_multiple_species_individual.nf
├── cross_species_integration_multiple_species.nf
├── dockerfiles
    ├── ALCS.Dockerfile
    ├── scanpy.Dockerfile
    └── seurat.Dockerfile
├── envs
    ├── sceasy.yml
    ├── scib.yml
    └── scvi.yml
└── example_metadata_nf.tsv


/README.md:
--------------------------------------------------------------------------------
  1 | ## BENGAL: BENchmarking strateGies for cross-species integrAtion of singLe-cell RNA sequencing data ##
  2 | 
  3 | Author&maintainer: Yuyao Song <ysong@ebi.ac.uk>
  4 | 
  5 | A Nextflow DSL2 pipeline to perform cross-species single-cell RNA-seq data integration and assessment of integration results.
  6 | 
  7 | **On Oct 2023, Yuyao updated the pipeline for containerization, improvements in anndata/seurat conversion, updates in scrips and updates in Nextflow.**
  8 | 
  9 | ## This repo includes:
 10 | 
 11 | 1) the Nextflow pipeline for cross-species integration of scRNA-seq data using various strategies
 12 | 2) containers paths, dockerfiles used and conda environments
 13 | 3) an example config file for running the Nextflow pipeline
 14 | 
 15 | ## System requirements
 16 | #### Hardware:
 17 | This workflow is written to be executed on HPC clusters with LSF job scheduler. It could be easily adapted to other schedulers by changing job resource syntax in the nextflow config file. If the GPU inplementation of scVI/scANVI is to be used (beneficial for speeding up the integration on large datasets), GPU computing nodes are required, please refer to [scVI-tools site](https://scvi-tools.org/) for respective setups.
 18 | 
 19 | #### OS:
 20 | Development of this workflow was done on Rocky Linux 8.5 (RHEL), while in theory this can be run on any linux distribution. To run the GPU inplementation of scVI/scANVI we used Nvidia Tesla V100 GPUs. 
 21 | 
 22 | ## Installation:
 23 | 
 24 | #### Clone the source code of BENGAL:
 25 | `git clone -b main git@github.com:Functional-Genomics/BENGAL.git`
 26 | 
 27 | **If nextflow or singularity is not installed in your cluster, install them. This can take some efforts and it might worth discussing with cluster IT managers. Please refer to [nextflow documentation](https://www.nextflow.io/docs/latest/getstarted.html) or [singularity documentation](https://singularity-tutorial.github.io/01-installation/).** 
 28 | 
 29 | 
 30 | ## Inputs
 31 | The nextflow script takes one input file: a tab-seperated metadata file mapping species to the paths of raw count AnnData objects, in the form of .h5ad files. See example: `example_metadata_nf.tsv`. 
 32 | 
 33 | The config file defines project directories and parameters. See example: `config/example.config`
 34 | 
 35 | **Please change the metadata file, and the directories and parameters in the config file for your own application as appropriate**
 36 | **Read the instrictions in the config file and change relevant entries is crucial for running the pipeline**
 37 | 
 38 | #### Input Requirements:
 39 | *These requirements will be checked in the first process of the pipeline.*
 40 | 
 41 | The raw count AnnData objects need to have the following row or column annotations. Note that the exact column name of each key is specified in the config file.
 42 | 
 43 | 1) a `species_key` in adata.obs to store species identity. Naming should be in line with the short name in ENSEMBL, such as hsapiens; mmusculus; drerio etc.
 44 | 2) a `cluster_key` in adata.obs to store cell types. If assessment is performed, this column will be used to match homologous cell types across species. Preferably, use [Cell Ontology annotation](https://obofoundry.org/ontology/cl.html). 
 45 | 3) `mean_counts` in adata.var computed by `sc.pp.calculate_qc_metrics` from [scanpy](https://github.com/scverse/scanpy).
 46 | 
 47 | The .var_names of the raw count AnnData file should be ENSEMBL gene ids.
 48 | The .X of the raw count AnnData file should be stored in dense matrix format, if SeuratDisk is used for .h5ad/.h5seurat conversion.
 49 | 
 50 | 
 51 | ## Run instructions
 52 | 
 53 | #### Perpare the conda environment for anndata/seurat conversion. 
 54 | In principle, you can use any program to perform the conversion. Since Oct 2023 we now use [sceasy](https://github.com/cellgeni/sceasy). We also no longer use h5seurat format due to challenges in converting to/from anndata. 
 55 | 
 56 | It didn't seem so necessary to containerize this process so we provide a light conda environment that is compatible with other parts of the pipeline. [Mamba](https://github.com/mamba-org/mamba) is recommended as a faster substitute for conda. 
 57 | 
 58 | I personally perfer creating a conda env independent of nextflow and then point nextflow to the absolute path of the conda env. This way saves the running time of the pipeline and make reuse of the same env and debug easier.
 59 | 
 60 | First create a conda environment for the conversion:
 61 | 
 62 | `conda env create -f envs/sceasy.yml`
 63 | 
 64 | Then put the path of your sceasy conda environment into the config file in the indicated place.
 65 | 
 66 | #### Perpare the conda environment for running scVI/scANVI and scIB
 67 | 
 68 | These two parts are also not containerized since the conda env is relatively easy to set up while the respective container will be very heavy. In the future we might consider containerizing it if necessary.
 69 | 
 70 | `conda env create -f envs/scvi.yml`
 71 | 
 72 | `conda env create -f envs/scib.yml`
 73 | 
 74 | Then put the path of your scvi and scib conda environments into the config file in the indicated place. These env files are just created as I followed the installation instruction from [scvi](https://docs.scvi-tools.org/en/stable/installation.html) and [scib](https://scib.readthedocs.io/en/stable/installation.html) under Python 3.10.10, so if you encounter any issues, feel free to create your own evns based on their instructions.
 75 | 
 76 | #### Pull the containers used in BENGAL. 
 77 | 
 78 | We now provide a few containers to help execute the pipeline (well deserved yay due to the complexity of building them...). Please pull these containers into a local dir and specify in the config file. Here we assume you use [singularity](https://sylabs.io/) to run these containers on a HPC cluster.
 79 | 
 80 | 1. Concatenate anndata files cross-species: `singularity pull bengal_concat.sif docker://yysong123/bengal_concat:4.2.0`
 81 | 2. Python based integration: `singularity pull bengal_py.sif docker://yysong123/bengal_py:1.9.2`
 82 | 3. Seurat/R based integration: `singularity pull bengal_seurat.sif docker://yysong123/bengal_seurat:4.3.0`
 83 | 4. SCCAF assessment for ALCS: `singularity pull bengal_sccaf.sif docker://yysong123/bengal_sccaf:0.0.11` 
 84 | 
 85 | ### To run BENGAL:
 86 | In a bash shell, check your metadata/config files are set and run:
 87 | 
 88 | 1) `conda activate nextflow && nextflow -C config/example.config run concat_by_homology_multiple_species.nf`. Add flag `-with-trace -with-report report.html` if you want nextflow run stats.
 89 | 2) `nextflow -C config/example.config run cross_species_integration_multiple_species.nf`
 90 | 3) `nextflow -C config/example.config run cross_species_assessment_multiple_species_individual.nf`
 91 | 
 92 | Note: add resume flag `-resume` as appropriate to avoid re-calculation of the same data during multiple runs.
 93 | 
 94 | ## Outputs
 95 | 
 96 | 1) Concatenated raw count AnnData objects containing cells from all species, in the form of .h5ad files. Objects are concatenated by matching genes between species using gene homology annotation from ENSEMBL.  
 97 | 2) Integration result from different algorithms including: [fastMNN](https://bioconductor.org/packages/release/bioc/html/batchelor.html), [harmony](https://github.com/slowkow/harmonypy), [LIGER](https://github.com/welch-lab/liger), [LIGER-UINMF](https://github.com/welch-lab/liger), [scanorama](https://github.com/brianhie/scanorama), [scVI](https://scvi-tools.org/), [SeuratV4CCA](https://satijalab.org/seurat/) and [SeuratV4RPCA](https://satijalab.org/seurat/), in the form of AnnData (.h5ad) or Seurat (.h5seurat) objects.
 98 | 3) Respective UMAP visualizations with species; batches or cell types color code.
 99 | 4) Assessment metrics for each integrated results. There are 4 batch correction metrics and 6 biology conservation metrics. Plots associated with the metrics are also generated for visual inspection. 
100 | 5) Cross-species cell type annotation transfer results with [SCCAF](https://github.com/SCCAF/sccaf).
101 | 
102 | Estimated execution time: ~6h for integrated dataset with 100,000 cells using resources specified in the .nf scripts.
103 | 
104 | ## Citation
105 | 
106 | The publication in which we described and applied BENGAL is [here](https://www.nature.com/articles/s41467-023-41855-w). Please cite it if you use BENGAL.
107 | 
108 | Song, Y., Miao, Z., Brazma, A. et al. Benchmarking strategies for cross-species integration of single-cell RNA sequencing data. Nat Commun 14, 6495 (2023). https://doi.org/10.1038/s41467-023-41855-w
109 | 
110 | The BENGAL pipeline used upon publication of the paper is archived in zenodo:
111 | 
112 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8268784.svg)](https://doi.org/10.5281/zenodo.8268784)
113 | 
114 | LICENSE: GPLv3 license
115 | 
116 | NOTE: we moved this git repo from Functional-Genomics/BENGAL to Papatheodorou-Group/BENGAL on 23 Oct 2023, but redirection should happen automatically.
117 | 


--------------------------------------------------------------------------------
/bin/LIGER_integration.R:
--------------------------------------------------------------------------------
 1 | # /usr/bin/env R
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | library(optparse)
 7 | library(rliger)
 8 | library(Seurat)
 9 | library(SeuratWrappers)
10 | 
11 | option_list <- list(
12 |   make_option(c("-i", "--input_rds"),
13 |     type = "character", default = NULL,
14 |     help = "Path to input preprocessed rds file"
15 |   ),
16 |   make_option(c("-o", "--out_rds"),
17 |     type = "character", default = NULL,
18 |     help = "Output fastMNN from Seurat wrappers integrated rds file"
19 |   ),
20 |   make_option(c("-p", "--out_UMAP"),
21 |     type = "character", default = NULL,
22 |     help = "Output UMAP after fastMNN integration"
23 |   ),
24 |   make_option(c("-b", "--batch_key"),
25 |     type = "character", default = NULL,
26 |     help = "Batch key identifier to integrate"
27 |   ),
28 |   make_option(c("-s", "--species_key"),
29 |     type = "character", default = NULL,
30 |     help = "Species key identifier"
31 |   ),
32 |   make_option(c("-c", "--cluster_key"),
33 |     type = "character", default = NULL,
34 |     help = "Cluster key for UMAP plotting"
35 |   )
36 | )
37 | 
38 | # parse input
39 | opt <- parse_args(OptionParser(option_list = option_list))
40 | 
41 | input_rds <- opt$input_rds
42 | out_rds <- opt$out_rds
43 | out_UMAP <- opt$out_UMAP
44 | batch_key <- opt$batch_key
45 | species_key <- opt$species_key
46 | cluster_key <- opt$cluster_key
47 | 
48 | ## create Seurat object via rds
49 | 
50 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE)
51 | # input_rds <- gsub("h5ad", "rds", input_h5ad)
52 | obj <- readRDS(input_rds)
53 | 
54 | obj <- NormalizeData(obj)
55 | obj <- FindVariableFeatures(obj)
56 | obj <- ScaleData(obj, split.by = batch_key, do.center = FALSE)
57 | 
58 | # LIGER
59 | obj <- RunOptimizeALS(obj, k = 30, lambda = 5, split.by = batch_key)
60 | obj <- RunQuantileNorm(obj, split.by = batch_key)
61 | 
62 | obj <- FindNeighbors(obj, reduction = "iNMF", dims = 1:30)
63 | obj <- FindClusters(obj, resolution = 0.4)
64 | # Dimensional reduction and plotting
65 | obj <- RunUMAP(obj, dims = 1:ncol(obj[["iNMF"]]), reduction = "iNMF", n_neighbors = 15L,  min_dist = 0.3)
66 | 
67 | 
68 | # have to convert all factor to character, or when later converting to h5ad, the factors will be numbers
69 | i <- sapply(obj@meta.data, is.factor)
70 | obj@meta.data[i] <- lapply(obj@meta.data[i], as.character)
71 | 
72 | 
73 | saveRDS(obj,
74 |   file = out_rds
75 | )
76 | 
77 | # iNMF embedding will be in .obsm['iNMF']
78 | 
79 | pdf(out_UMAP, height = 6, width = 10)
80 | DimPlot(obj, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE)
81 | DimPlot(obj, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE)
82 | DimPlot(obj, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE)
83 | 
84 | dev.off()
85 | 


--------------------------------------------------------------------------------
/bin/alignment_score.py:
--------------------------------------------------------------------------------
  1 | # © EMBL-European Bioinformatics Institute, 2023
  2 | # Yuyao Song <ysong@ebi.ac.uk>
  3 | 
  4 | K_default=20
  5 | 
  6 | import pandas as pd
  7 | import numpy as np
  8 | import scanpy as sc
  9 | import random
 10 | import numpy
 11 | import click
 12 | 
 13 | random.seed(123)
 14 | numpy.random.seed(456)
 15 | 
 16 | 
 17 | @click.command()
 18 | @click.argument("input_h5ad", type=click.Path(exists=True))
 19 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None)
 20 | @click.argument("out_integrated_metrics", type=click.Path(exists=False), default=None)
 21 | @click.argument("out_orig_metrics", type=click.Path(exists=False), default=None)
 22 | @click.option(
 23 |     "--species_key", type=str, default=None, help="Species key to distinguish species"
 24 | )
 25 | @click.option(
 26 |     "--batch_key",
 27 |     type=str,
 28 |     default=None,
 29 |     help="Batch key on which integration is performed",
 30 | )
 31 | @click.option("--integration_method", type=str, default=None, help="Integration method")
 32 | @click.option(
 33 |     "--cluster_key",
 34 |     type=str,
 35 |     default=None,
 36 |     help="Cluster key in species one to use as labels to transfer to species two",
 37 | )
 38 | def run_alignment_score(
 39 |     input_h5ad,
 40 |     unintegrated_h5ad,
 41 |     out_integrated_metrics,
 42 |     out_orig_metrics,
 43 |     species_key,
 44 |     batch_key,
 45 |     cluster_key,
 46 |     integration_method,
 47 | ):
 48 |     # dictionary for method properties
 49 |     embedding_keys = {
 50 |         "harmony": "X_pca_harmony",
 51 |         "scanorama": "X_scanorama",
 52 |         "scVI": "X_scVI",
 53 |         "scANVI": "X_scANVI",
 54 |         "LIGER": "X_iNMF",
 55 |         "rligerUINMF": "X_inmf",
 56 |         "fastMNN": "X_mnn",
 57 |     }
 58 |     use_embeddings = {
 59 |         "harmony": True,
 60 |         "scanorama": True,
 61 |         "scVI": True,
 62 |         "scANVI": True,
 63 |         "LIGER": True,
 64 |         "rligerUINMF": True,
 65 |         "fastMNN": True,
 66 |         "SAMap": False,
 67 |         "seuratCCA": False,
 68 |         "seuratRPCA": False,
 69 |         "unintegrated": False,
 70 |     }
 71 |     from_h5seurat = {
 72 |         "harmony": False,
 73 |         "scanorama": False,
 74 |         "scVI": False,
 75 |         "scANVI": False,
 76 |         "LIGER": True,
 77 |         "rligerUINMF": True,
 78 |         "fastMNN": True,
 79 |         "SAMap": False,
 80 |         "seuratCCA": True,
 81 |         "seuratRPCA": True,
 82 |         "unintegrated": False,
 83 |     }
 84 |     sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5))
 85 | 
 86 |     click.echo("Read anndata")
 87 |     input_ad = sc.read_h5ad(input_h5ad)
 88 |     orig_ad = sc.read_h5ad(unintegrated_h5ad)
 89 |     species_all = input_ad.obs[species_key].astype("category").cat.categories.values
 90 | 
 91 |     ## for files from h5seurat sometimes these are not stored as category
 92 | 
 93 |     input_ad.obs[species_key] = input_ad.obs[species_key].astype("category")
 94 |     input_ad.obs[cluster_key] = input_ad.obs[cluster_key].astype("category")  
 95 |     input_ad.obs[batch_key] = input_ad.obs[batch_key].astype("category")
 96 |     
 97 |     # known bug - fix when convert h5Seurat to h5ad the index name error
 98 |     if from_h5seurat[integration_method] is True:
 99 |         input_ad.__dict__["_raw"].__dict__["_var"] = (
100 |             input_ad.__dict__["_raw"]
101 |             .__dict__["_var"]
102 |             .rename(columns={"_index": "features"})
103 |         )
104 | 
105 |     use_embedding = use_embeddings[integration_method]
106 |     if use_embedding is True:
107 |         embedding_key = embedding_keys[integration_method]
108 | 
109 |     # re-calculate on integrated and unintegrated data
110 |     # due to scIB hard-coding, make sure input_ad.obsp['connectivities'], input_ad.uns['neighbours'] are from the embedding
111 |     # for lisi type_='knn'
112 |     # LIGER embedding only have 20 dims
113 | 
114 |     if integration_method == "SAMap":
115 |         click.echo("use SAMap KNN graph")
116 |         # do nothing
117 |     elif use_embedding is True:
118 |         click.echo("calculate KNN graph from embedding " + embedding_key)
119 |         num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20)
120 |         if num_pcs < 20:
121 |             click.echo("using less PCs: " + str(num_pcs))
122 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key)
123 |         # compute knn if use embedding
124 |     else:
125 |         click.echo("use PCA to compute KNN graph")
126 |         sc.tl.pca(input_ad, svd_solver="arpack")
127 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep="X_pca")
128 |         embedding_key = "X_pca"
129 | 
130 |     # while no embedding, compute PCA and compute knn
131 | 
132 |     # get neighbour graph from unintegrated data
133 |     sc.pp.normalize_total(orig_ad, target_sum=1e4)
134 |     sc.pp.log1p(orig_ad)
135 |     sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
136 |     sc.pp.scale(orig_ad, max_value=10)
137 |     sc.tl.pca(orig_ad, svd_solver="arpack")
138 |     sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40)
139 |     sc.tl.umap(orig_ad, min_dist=0.3)
140 |     sc.pl.umap(orig_ad, color=[batch_key, species_key, cluster_key])
141 | 
142 |     click.echo(
143 |         "Start computing various batch metrics using scIB, the integrated file is "
144 |         + input_h5ad
145 |     )
146 | 
147 |     output_integrated = pd.DataFrame()
148 |     output_orig = pd.DataFrame()
149 | 
150 |     click.echo("alignment score")
151 |     
152 |     def q(x):
153 |         return np.array(list(x))
154 | 
155 |     def avg_as(ad):
156 |         x = q(ad.obs[batch_key])
157 |         xu = np.unique(x)
158 |         a = np.zeros((xu.size,xu.size))
159 |         for i in range(xu.size):
160 |             for j in range(xu.size):
161 |                 if i!=j:
162 |                     a[i,j] = ad.obsp['connectivities'][x==xu[i],:][:,x==xu[j]].sum(1).A.flatten().mean() / K_default
163 |         return pd.DataFrame(data=a,index=xu,columns=xu)
164 | 
165 | 
166 |     output_integrated.loc["Alignment_score", "value"] = avg_as(input_ad)[avg_as(input_ad) != 0].mean().mean()
167 |     
168 |     output_orig.loc["Alignment_score", "value"] = avg_as(orig_ad)[avg_as(orig_ad) != 0].mean().mean()
169 | 
170 |     output_integrated.loc["input_h5ad", "value"] = input_h5ad
171 |     output_integrated.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad
172 |     output_integrated.loc["species_key", "value"] = species_key
173 |     output_integrated.loc["batch_key", "value"] = batch_key
174 |     output_integrated.loc["cluster_key", "value"] = cluster_key
175 |     output_integrated.loc["integration_method", "value"] = integration_method
176 | 
177 |     output_integrated.T.to_csv(out_integrated_metrics)
178 | 
179 |     click.echo("metric of unintegrated data")
180 |     output_orig.loc["input_h5ad", "value"] = unintegrated_h5ad
181 |     output_orig.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad
182 |     output_orig.loc["species_key", "value"] = species_key
183 |     output_orig.loc["batch_key", "value"] = batch_key
184 |     output_orig.loc["cluster_key", "value"] = cluster_key
185 |     output_orig.loc["integration_method", "value"] = integration_method
186 | 
187 |     output_orig.T.to_csv(out_orig_metrics)
188 | 
189 |     
190 |     
191 | 
192 | if __name__ == "__main__":
193 |     run_alignment_score()


--------------------------------------------------------------------------------
/bin/concat_by_homology_multiple_species_by_gene_id.R:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env R
  2 | 
  3 | ##########
  4 | ## concatenate anndata object by homology
  5 | ## ysong@ebi.ac.uk for CrossSpeciesIntegration pipeline
  6 | #########
  7 | 
  8 | # © EMBL-European Bioinformatics Institute, 2023
  9 | # Yuyao Song <ysong@ebi.ac.uk>
 10 | 
 11 | library(optparse)
 12 | library(anndata)
 13 | library(dplyr)
 14 | library(purrr)
 15 | library(readr)
 16 | library(magrittr)
 17 | library(tibble)
 18 | library(biomaRt)
 19 | 
 20 | option_list <- list(
 21 |   make_option(c("--metadata"),
 22 |     type = "character", default = NULL,
 23 |     help = "Path to a file indicate species-h5ad mapping, tab-seperated"
 24 |   ),
 25 |   make_option(c("--homology_tbl"),
 26 |       type = "character", default = NULL,
 27 |       help = "Ensembl homology tbl output"
 28 |   ),
 29 |   make_option(c("--one2one_h5ad"),
 30 |     type = "character", default = NULL,
 31 |     help = "h5ad output with only one2one orthologs"
 32 |   ),
 33 |   make_option(c("--many_higher_expr_h5ad"),
 34 |     type = "character", default = NULL,
 35 |     help = "h5ad output with one2one, one2many and many2many homologs, paralog with higher mean expression across dataset is selected to match the ortholog"
 36 |   ),
 37 |   make_option(c("--many_higher_homology_conf_h5ad"),
 38 |     type = "character", default = NULL,
 39 |     help = "h5ad output with one2one, one2many and many2many homologs, paralog with higher homology confidence across dataset is selected to match the ortholog"
 40 |   )
 41 | )
 42 | 
 43 | # parse input
 44 | opt <- parse_args(OptionParser(option_list = option_list))
 45 | 
 46 | input_metadata <- opt$metadata
 47 | homology_tbl_csv <- opt$homology_tbl
 48 | one2one_h5ad <- opt$one2one_h5ad
 49 | many_higher_expr_h5ad <- opt$many_higher_expr_h5ad
 50 | many_higher_homology_conf_h5ad <- opt$many_higher_homology_conf_h5ad
 51 | 
 52 | genes_main_chr <- list(
 53 |   hsapiens = as.character(c(lapply(seq(1, 22, by = 1), as.character), "X", "Y")),
 54 |   sscrofa = as.character(c(lapply(seq(1, 18, by = 1), as.character), "X", "Y")),
 55 |   mmusculus = as.character(c(lapply(seq(1, 19, by = 1), as.character), "X", "Y")),
 56 |   mmulatta = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X", "Y")),
 57 |   mfascicularis = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X")),
 58 |   dmelanogaster = c("2L", "2R", "3L", "3R", "4", "X", "Y"),
 59 |   xtropicalis = as.character(c(lapply(seq(1, 10, by = 1), as.character))),
 60 |   drerio = as.character(c(lapply(seq(1, 25, by = 1), as.character)))
 61 | )
 62 | 
 63 | 
 64 | metadata <- read_tsv(input_metadata, col_names = FALSE)
 65 | species_list=metadata[['X1']]
 66 | message(paste0("start concatenating anndata object from ", paste(species_list, collapse = ', ')))
 67 | 
 68 | 
 69 | adatas = list()
 70 | for( species in metadata$X1){
 71 |     adatas[[species]] = read_h5ad(as.character(metadata[which(metadata$X1 == species), 'X2']))
 72 | }
 73 | 
 74 | for(species_now in species_list){
 75 |     adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]] = adatas[[species_now]]$var_names
 76 | }
 77 | 
 78 | 
 79 | count_all <- data.frame()
 80 | species_1 = species_list[1]
 81 | 
 82 | 
 83 | ## version = '105'
 84 | mart <- useEnsembl("ensembl", version = "105", dataset = as.character(paste(species_1, "_gene_ensembl", sep = "")))
 85 | 
 86 | # get genes in the main chrs of the first species
 87 | genes_species_1 <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = mart) %>%
 88 |   dplyr::filter(chromosome_name %in% (genes_main_chr[species_1] %>% purrr::flatten_chr()))
 89 | message(paste("start querying ensembl biomaRt for gene homology"))
 90 | 
 91 | avail_attributes <- listAttributes(mart) %>% filter(grepl((paste(species_list[-1],collapse="|")), name)) %>% filter(grepl("homo", name)) %>%
 92 | filter(!grepl("Query protein or transcript ID", description))
 93 | biomartCacheClear()
 94 | 
 95 | homology_tbl = getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position",
 96 |     avail_attributes$name), mart = mart, filters = "ensembl_gene_id",
 97 |   values = genes_species_1[["ensembl_gene_id"]])
 98 | 
 99 | 
100 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene stable ID'] = paste0(species_1, "_homolog_ensembl_gene")
101 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene name'] = paste0(species_1, "_homolog_associated_gene_name")
102 | #names(homology_tbl)[colnames(homology_tbl) == 'Chromosome/scaffold name'] = paste0(species_1, "_homolog_chromosome")
103 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene start (bp)'] = paste0(species_1, "_homolog_chrom_start")
104 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene end (bp)'] = paste0(species_1, "_homolog_chrom_end")
105 | 
106 | #names(homology_tbl)[match(avail_attributes[,"description"], names(homology_tbl))] = avail_attributes[,"name"]
107 | 
108 | write_csv(homology_tbl, file = homology_tbl_csv)
109 | 
110 | message("start building one2one")
111 | homology_tbl[paste0(species_1, "_homolog_associated_gene_name")] = homology_tbl$external_gene_name
112 | homology_tbl[paste0(species_1, "_homolog_ensembl_gene")] = homology_tbl$ensembl_gene_id
113 | homology_tbl[paste0(species_1, "_homolog_chromosome")] = homology_tbl$chromosome_name
114 | homology_tbl[paste0(species_1, "_homolog_chrom_start")] = homology_tbl$start_position
115 | homology_tbl[paste0(species_1, "_homolog_chrom_end")] = homology_tbl$end_position
116 | one2one = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. == 'ortholog_one2one')) %>%
117 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE)
118 | 
119 | adatas_one2one = list()
120 | 
121 | for(species_now in names(adatas)){
122 |     adatas_one2one[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(one2one[[paste0(species_now, "_homolog_ensembl_gene")]])]
123 | }
124 | 
125 | 
126 | for(species_now in species_list[-1]){
127 |     message(species_now)
128 |      one2one_now = one2one %>% filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas_one2one[[species_now]]$var_names)) %>%
129 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE)
130 |      adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, tolower(adatas_one2one[[species_now]]$var_names) %in% tolower(one2one_now[[paste0(species_now, "_homolog_ensembl_gene")]])]
131 |      one2one_now = one2one_now[match(tolower(adatas_one2one[[species_now]]$var_names), tolower(one2one_now[[paste0(species_now, '_homolog_ensembl_gene')]])), ]
132 |      adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] = adatas_one2one[[species_now]]$var_names
133 |      adatas_one2one[[species_now]]$var[[paste0(species_1, '_homolog_ensembl_gene')]] = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]]
134 |      adatas_one2one[[species_now]]$var_names = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]]
135 |      adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, !duplicated(adatas_one2one[[species_now]]$var_names)]
136 | }
137 | 
138 | adata_one2one = concat(adatas_one2one, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-',)
139 | 
140 | i <- sapply(adata_one2one$obs, is.factor)
141 | adata_one2one$obs[i] <- lapply(adata_one2one$obs[i], as.character)
142 | 
143 | j <- sapply(adata_one2one$var, is.factor)
144 | adata_one2one$var[j] <- lapply(adata_one2one$var[j], as.character)
145 | 
146 | 
147 | write_h5ad(anndata = adata_one2one, filename = one2one_h5ad, compression = "gzip")
148 | 
149 | message("finish building one2one")
150 | adata_one2one
151 | 
152 | message("start building many2many")
153 | message("start higher expression level")
154 | 
155 | 
156 | many2many = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. != 'ortholog_one2one'))
157 | 
158 | 
159 | for(species_now in species_list){
160 |     print(species_now)
161 |     many2many = many2many %>% filter(!is.na(get(paste0(species_now, "_homolog_ensembl_gene"))) & get(paste0(species_now, "_homolog_ensembl_gene")) != "")
162 |     print(dim(many2many))
163 |     many2many = many2many %>%
164 |     filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]]))
165 |     print(dim(many2many))
166 | }
167 | 
168 | many2many_copy <- many2many %>% rowid_to_column("index")
169 | adata_many2many = AnnData(shape = list(0, 0))
170 | 
171 | while (nrow(many2many_copy) > 0) {
172 | 
173 |     dd <- many2many_copy %>%
174 |         filter(get(paste0(species_1, "_homolog_ensembl_gene"))  == levels(factor(many2many_copy[[paste0(species_1, "_homolog_ensembl_gene")]]))[1])
175 | 
176 |     genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character()
177 | 
178 |     gene_group <- many2many_copy %>%
179 |         dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now))
180 | 
181 |     many2many_copy <- many2many_copy %>%
182 |     filter(!index %in% gene_group$index)
183 | 
184 |     adatas_many2many = list()
185 | 
186 |     for(species_now in species_list){
187 |             adatas_many2many[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(gene_group[[paste0(species_now, "_homolog_ensembl_gene")]])]
188 |             keep_row = adatas_many2many[[species_now]]$var %>%
189 |             arrange(desc(mean_counts)) %>%
190 |             slice(1)
191 |             adatas_many2many[[species_now]] = adatas_many2many[[species_now]][, which(tolower(adatas_many2many[[species_now]]$var_names) == tolower(rownames(keep_row)))]
192 |         }
193 | 
194 |     new_name = adatas_many2many[[species_1]]$var_names
195 |     #message(new_name)
196 |     for(species_now in species_list[-1]){
197 |     adatas_many2many[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name
198 |     rownames(adatas_many2many[[species_now]]$var) = new_name
199 | }
200 |     adata_add = concat(adatas_many2many, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-')
201 |     adata_add
202 |     
203 |   if (dim(adata_many2many)[1] == 0) {
204 |     adata_many2many <- adata_add
205 |   } else {
206 |     adata_many2many <- concat(list(adata_many2many, adata_add), axis = 1L, join = "outer", merge = "first", index_unique = '-')
207 |   }
208 | 
209 | }
210 | 
211 | adata_higher_expr = concat(list(adata_one2one, adata_many2many), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
212 | 
213 | message("finish building higher expression level")
214 | adata_higher_expr
215 | i <- sapply(adata_higher_expr$obs, is.factor)
216 | adata_higher_expr$obs[i] <- lapply(adata_higher_expr$obs[i], as.character)
217 | 
218 | j <- sapply(adata_higher_expr$var, is.factor)
219 | adata_higher_expr$var[j] <- lapply(adata_higher_expr$var[j], as.character)
220 | 
221 | write_h5ad(anndata = adata_higher_expr , filename = many_higher_expr_h5ad, compression = "gzip")
222 | 
223 | message("start higher homology confidence")
224 | 
225 | ## available attributes to indicate confidence of homology
226 | ## mind the order
227 | order = metadata$X1
228 | avail_ordered = c()
229 | for (attr in c("orthology_confidence", "homolog_goc_score", "homolog_wga_coverage")){
230 | 
231 |     avail_homo = c(avail_attributes$name[grepl(attr, avail_attributes$name)])
232 | 
233 |     for (i in seq(1, length(order))){
234 |    avail_ordered = c(avail_ordered, avail_homo[grepl(order[i], avail_homo)])
235 |   }
236 | }
237 | 
238 | avail_homo = avail_ordered
239 | 
240 | many2many_copy_homo <- many2many %>% rowid_to_column("index")
241 | adata_many2many_homo = AnnData(shape = list(0, 0))
242 | 
243 | while (nrow(many2many_copy_homo) > 0) {
244 | 
245 |     dd <- many2many_copy_homo %>%
246 |         filter(get(paste0(species_1, "_homolog_ensembl_gene"))  == levels(factor(many2many_copy_homo[[paste0(species_1, "_homolog_ensembl_gene")]]))[1])
247 | 
248 |     genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character()
249 | 
250 |     ## find a gene group with many-to-many relationships
251 |     gene_group <- many2many_copy_homo %>%
252 |         dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now))
253 | 
254 |     if(nrow(gene_group) == 1) {
255 |         gene_keep = gene_group
256 |     } else {
257 |     gene_keep = gene_group %>%
258 |         dplyr::arrange(
259 |         sapply(avail_homo, FUN = function(x) get(x)) ## keep the member of group with highest overall confidence
260 |         ) %>%
261 |             slice(n())
262 |     }
263 |     many2many_copy_homo <- many2many_copy_homo %>%
264 |             filter(!index %in% gene_group$index)
265 | 
266 |     adatas_many2many_homo = list()
267 |     for(species_now in species_list){
268 | 
269 |     adatas_many2many_homo[[species_now]] = adatas[[species_now]][, adatas[[species_now]]$var_names %in% gene_keep[[paste0(species_now, "_homolog_ensembl_gene")]]]
270 | 
271 | }
272 | 
273 | 
274 |     new_name = adatas_many2many_homo[[species_1]]$var_names
275 |     #message(new_name)
276 |     for(species_now in species_list[-1]){
277 | 
278 |     adatas_many2many_homo[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name
279 |     rownames(adatas_many2many_homo[[species_now]]$var) = new_name
280 | 
281 | }
282 |     adata_add = concat(adatas_many2many_homo, axis = 0L, join = 'inner', merge = 'first', label = 'batch', index_unique = '-')
283 | 
284 |       if (dim(adata_many2many_homo)[1] == 0) {
285 |     adata_many2many_homo <- adata_add
286 |   } else {
287 |     adata_many2many_homo <- concat(list(adata_many2many_homo, adata_add), axis = 1L, join = "outer", merge = "first", index_unique = '-')
288 |   }
289 | 
290 | }
291 | 
292 | 
293 | adata_higher_homology_conf = concat(list(adata_one2one, adata_many2many_homo), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
294 | i <- sapply(adata_higher_homology_conf$obs, is.factor)
295 | adata_higher_homology_conf$obs[i] <- lapply(adata_higher_homology_conf$obs[i], as.character)
296 | 
297 | j <- sapply(adata_higher_homology_conf$var, is.factor)
298 | adata_higher_homology_conf$var[j] <- lapply(adata_higher_homology_conf$var[j], as.character)
299 | 
300 | write_h5ad(anndata = adata_higher_homology_conf , filename = many_higher_homology_conf_h5ad, compression = "gzip")
301 | 
302 | message("finish building higher homology confidence")
303 | adata_higher_homology_conf
304 | 
305 | message("finish concat by homology")
306 | 


--------------------------------------------------------------------------------
/bin/concat_by_homology_rligerUINMF_multiple_species.R:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env R
  2 | 
  3 | ##########
  4 | ## concatenate anndata object by homology for LIGER UINMF pipeline
  5 | ## ysong@ebi.ac.uk for CrossSpeciesIntegration pipeline
  6 | #########
  7 | 
  8 | # © EMBL-European Bioinformatics Institute, 2023
  9 | # Yuyao Song <ysong@ebi.ac.uk>
 10 | 
 11 | library(optparse)
 12 | library(anndata)
 13 | library(dplyr)
 14 | library(purrr)
 15 | library(readr)
 16 | library(magrittr)
 17 | library(tibble)
 18 | library(biomaRt)
 19 | 
 20 | option_list <- list(
 21 |   make_option(c("--metadata"),
 22 |     type = "character", default = NULL,
 23 |     help = "Path to a file indicate species-h5ad mapping, tab-seperated"
 24 |   ),
 25 |   make_option(c("--homology_tbl"),
 26 |       type = "character", default = NULL,
 27 |       help = "Ensembl homology tbl output"
 28 |   ),
 29 |   make_option(c("--out_dir"),
 30 |     type = "character", default = NULL,
 31 |     help = "output dir to write h5ad files"
 32 |   ),
 33 |   make_option(c("--metadata_output"),
 34 |     type = "character", default = NULL,
 35 |     help = "paths to rliger UINMF ready metadata"
 36 |   )
 37 | )
 38 | opt <- parse_args(OptionParser(option_list = option_list))
 39 | 
 40 | input_metadata <- opt$metadata
 41 | homology_tbl_csv <- opt$homology_tbl
 42 | out_dir <- opt$out_dir
 43 | metadata_output <- opt$metadata_output
 44 | 
 45 | 
 46 | message("create seurat objects for rliger")
 47 | 
 48 | 
 49 | genes_main_chr <- list(
 50 |   hsapiens = as.character(c(lapply(seq(1, 22, by = 1), as.character), "X", "Y")),
 51 |   sscrofa = as.character(c(lapply(seq(1, 18, by = 1), as.character), "X", "Y")),
 52 |   mmusculus = as.character(c(lapply(seq(1, 19, by = 1), as.character), "X", "Y")),
 53 |   mmulatta = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X", "Y")),
 54 |   mfascicularis = as.character(c(lapply(seq(1, 20, by = 1), as.character), "X")),
 55 |   dmelanogaster = c("2L", "2R", "3L", "3R", "4", "X", "Y"),
 56 |   xtropicalis = as.character(c(lapply(seq(1, 10, by = 1), as.character))),
 57 |   drerio = as.character(c(lapply(seq(1, 25, by = 1), as.character)))
 58 | )
 59 | 
 60 | metadata <- read_tsv(input_metadata, col_names = FALSE)
 61 | species_list=metadata[['X1']]
 62 | message(paste0("start concatenating anndata object from ", paste(species_list, collapse = ', ')))
 63 | 
 64 | 
 65 | adatas = list()
 66 | for( species in metadata$X1){
 67 |     adatas[[species]] = read_h5ad(as.character(metadata[which(metadata$X1 == species), 'X2']))
 68 | }
 69 | 
 70 | for(species_now in species_list){
 71 |     adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]] = adatas[[species_now]]$var_names
 72 | }
 73 | 
 74 | 
 75 | count_all <- data.frame()
 76 | species_1 = species_list[1]
 77 | 
 78 | 
 79 | ## version = '105'
 80 | mart <- useEnsembl("ensembl", version = "105", dataset = as.character(paste(species_1, "_gene_ensembl", sep = "")))
 81 | 
 82 | # get genes in the main chrs of the first species
 83 | genes_species_1 <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"), mart = mart) %>%
 84 |   dplyr::filter(chromosome_name %in% (genes_main_chr[species_1] %>% purrr::flatten_chr()))
 85 | message(paste("start querying ensembl biomaRt for gene homology"))
 86 | 
 87 | avail_attributes <- listAttributes(mart) %>% filter(grepl((paste(species_list[-1],collapse="|")), name)) %>% filter(grepl("homo", name)) %>%
 88 | filter(!grepl("Query protein or transcript ID", description))
 89 | biomartCacheClear()
 90 | 
 91 | homology_tbl = getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name", "start_position", "end_position",
 92 |     avail_attributes$name), mart = mart, filters = "ensembl_gene_id",
 93 |   values = genes_species_1[["ensembl_gene_id"]])
 94 | 
 95 | 
 96 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene stable ID'] = paste0(species_1, "_homolog_ensembl_gene")
 97 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene name'] = paste0(species_1, "_homolog_associated_gene_name")
 98 | #names(homology_tbl)[colnames(homology_tbl) == 'Chromosome/scaffold name'] = paste0(species_1, "_homolog_chromosome")
 99 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene start (bp)'] = paste0(species_1, "_homolog_chrom_start")
100 | #names(homology_tbl)[colnames(homology_tbl) == 'Gene end (bp)'] = paste0(species_1, "_homolog_chrom_end")
101 | 
102 | #names(homology_tbl)[match(avail_attributes[,"description"], names(homology_tbl))] = avail_attributes[,"name"]
103 | 
104 | write_csv(homology_tbl, file = homology_tbl_csv)
105 | 
106 | 
107 | message("start building one2one")
108 | homology_tbl[paste0(species_1, "_homolog_associated_gene_name")] = homology_tbl$external_gene_name
109 | homology_tbl[paste0(species_1, "_homolog_ensembl_gene")] = homology_tbl$ensembl_gene_id
110 | homology_tbl[paste0(species_1, "_homolog_chromosome")] = homology_tbl$chromosome_name
111 | homology_tbl[paste0(species_1, "_homolog_chrom_start")] = homology_tbl$start_position
112 | homology_tbl[paste0(species_1, "_homolog_chrom_end")] = homology_tbl$end_position
113 | one2one = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. == 'ortholog_one2one')) %>%
114 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE)
115 | 
116 | adatas_one2one = list()
117 | 
118 | for(species_now in names(adatas)){
119 |     adatas_one2one[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(one2one[[paste0(species_now, "_homolog_ensembl_gene")]])]
120 | }
121 | 
122 | 
123 | for(species_now in species_list[-1]){
124 |     message(species_now)
125 |      one2one_now = one2one %>% filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas_one2one[[species_now]]$var_names)) %>%
126 | distinct(get(paste0(species_1, "_homolog_ensembl_gene")), `.keep_all` = TRUE)
127 |      adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, tolower(adatas_one2one[[species_now]]$var_names) %in% tolower(one2one_now[[paste0(species_now, "_homolog_ensembl_gene")]])]
128 |      one2one_now = one2one_now[match(tolower(adatas_one2one[[species_now]]$var_names), tolower(one2one_now[[paste0(species_now, '_homolog_ensembl_gene')]])), ]
129 |      adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] = adatas_one2one[[species_now]]$var_names
130 |      adatas_one2one[[species_now]]$var[[paste0(species_1, '_homolog_ensembl_gene')]] = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]]
131 |      adatas_one2one[[species_now]]$var_names = one2one_now[[paste0(species_1, '_homolog_ensembl_gene')]]
132 |      adatas_one2one[[species_now]] = adatas_one2one[[species_now]][, !duplicated(adatas_one2one[[species_now]]$var_names)]
133 | }
134 | 
135 | adatas_one2one_uinmf = list()
136 | 
137 | for(species_now in names(adatas)){
138 | 
139 |     adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])]
140 |     adatas_one2one_uinmf[[species_now]] = concat(list(adatas_one2one[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
141 | 
142 |     adatas_one2one_uinmf[[species_now]] = adatas_one2one_uinmf[[species_now]][, !duplicated(adatas_one2one_uinmf[[species_now]]$var_names)]
143 | 
144 |     i <- sapply(adatas_one2one_uinmf[[species_now]]$obs, is.factor)
145 |     adatas_one2one_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_uinmf[[species_now]]$obs[i], as.character)
146 |     j <- sapply(adatas_one2one_uinmf[[species_now]]$var, is.factor)
147 |     adatas_one2one_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_uinmf[[species_now]]$var[j], as.character)
148 | 
149 |     write_h5ad(anndata = adatas_one2one_uinmf[[species_now]],
150 |                filename = paste0(out_dir, "/", species_now, "_one2one_only_ligerUINMF.h5ad"), compression = "gzip")
151 |     message(species_now)
152 | 
153 |     }
154 | 
155 | message("finish building one2one")
156 | 
157 | 
158 | message("start building many2many")
159 | message("start higher expression level")
160 | 
161 | 
162 | many2many = homology_tbl %>% filter_at(vars(ends_with("homolog_orthology_type")), all_vars(. != 'ortholog_one2one'))
163 | 
164 | 
165 | for(species_now in species_list){
166 |     print(species_now)
167 |     many2many = many2many %>% filter(!is.na(get(paste0(species_now, "_homolog_ensembl_gene"))) & get(paste0(species_now, "_homolog_ensembl_gene")) != "")
168 |     print(dim(many2many))
169 |     many2many = many2many %>%
170 |     filter(tolower(get(paste0(species_now, "_homolog_ensembl_gene"))) %in% tolower(adatas[[species_now]]$var[[paste0(species_now, "_homolog_ensembl_gene")]]))
171 |     print(dim(many2many))
172 | }
173 | 
174 | many2many_copy <- many2many %>% rowid_to_column("index")
175 | adata_many2many = AnnData(shape = list(0, 0))
176 | 
177 | 
178 | adatas_many2many_all = list()
179 | while (nrow(many2many_copy) > 0) {
180 | 
181 |     if(nrow(many2many_copy) < 6000 & nrow(many2many_copy) > 5990) message(nrow(many2many_copy))
182 |     
183 |     if(nrow(many2many_copy) < 4000 & nrow(many2many_copy) > 3990) message(nrow(many2many_copy))
184 | 
185 |     if(nrow(many2many_copy) < 1000 & nrow(many2many_copy) > 990) message(nrow(many2many_copy))
186 | 
187 | 
188 |     dd <- many2many_copy %>%
189 |         filter(get(paste0(species_1, "_homolog_ensembl_gene"))  == levels(factor(many2many_copy[[paste0(species_1, "_homolog_ensembl_gene")]]))[1])
190 | 
191 |     genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character()
192 | 
193 |     gene_group <- many2many_copy %>%
194 |         dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now))
195 | 
196 |     many2many_copy <- many2many_copy %>%
197 |     filter(!index %in% gene_group$index)
198 | 
199 |     adatas_many2many = list()
200 | 
201 |     for(species_now in species_list){
202 |             adatas_many2many[[species_now]] = adatas[[species_now]][, tolower(adatas[[species_now]]$var_names) %in% tolower(gene_group[[paste0(species_now, "_homolog_ensembl_gene")]])]
203 |             keep_row = adatas_many2many[[species_now]]$var %>%
204 |             arrange(desc(mean_counts)) %>%
205 |             slice(1)
206 |             adatas_many2many[[species_now]] = adatas_many2many[[species_now]][, which(tolower(adatas_many2many[[species_now]]$var_names) == tolower(rownames(keep_row)))]
207 |         }
208 | 
209 |     new_name = adatas_many2many[[species_1]]$var_names
210 |     #message(new_name)
211 | 
212 |     for(species_now in species_list[-1]){
213 |     adatas_many2many[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name
214 |     rownames(adatas_many2many[[species_now]]$var) = new_name
215 | }
216 | 
217 |     for(species_now in species_list){
218 |     if (is.null(adatas_many2many_all[[species_now]])) {
219 |     adatas_many2many_all[[species_now]] <- adatas_many2many[[species_now]]
220 |   } else {
221 |     adatas_many2many_all[[species_now]] <- concat(list(adatas_many2many_all[[species_now]], adatas_many2many[[species_now]]), axis = 1L, join = "outer", merge = "first", index_unique = '-')
222 |   }
223 | 
224 | }
225 | 
226 |   rm(adatas_many2many)
227 | }
228 | 
229 | 
230 | adatas_one2one_and_many_expr = list()
231 | 
232 | for(species_now in names(adatas)){
233 | 
234 |     adatas_one2one_and_many_expr[[species_now]] = concat(list(adatas_one2one[[species_now]], adatas_many2many_all[[species_now]]), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
235 |     message(species_now)
236 | 
237 |     }
238 | 
239 | rm(adatas_many2many_all)
240 | 
241 | adatas_one2one_higher_expr_uinmf = list()
242 | 
243 | for(species_now in names(adatas)){
244 | 
245 |     adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one_and_many_expr[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])]
246 |     adatas_one2one_higher_expr_uinmf[[species_now]] = concat(list(adatas_one2one_and_many_expr[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
247 | 
248 |     adatas_one2one_higher_expr_uinmf[[species_now]]= adatas_one2one_higher_expr_uinmf[[species_now]][, !duplicated(adatas_one2one_higher_expr_uinmf[[species_now]]$var_names)]
249 | 
250 |     i <- sapply(adatas_one2one_higher_expr_uinmf[[species_now]]$obs, is.factor)
251 |     adatas_one2one_higher_expr_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_higher_expr_uinmf[[species_now]]$obs[i], as.character)
252 |     j <- sapply(adatas_one2one_higher_expr_uinmf[[species_now]]$var, is.factor)
253 |     adatas_one2one_higher_expr_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_higher_expr_uinmf[[species_now]]$var[j], as.character)
254 | 
255 |     write_h5ad(anndata = adatas_one2one_higher_expr_uinmf[[species_now]],
256 |                filename = paste0(out_dir, "/", species_now, "_many_higher_expr_ligerUINMF.h5ad"), compression = "gzip")
257 |     message(species_now)
258 | 
259 |     }
260 | message("finish building higher expression level")
261 | 
262 | rm(adatas_one2one_and_many_expr)
263 | rm(adatas_one2one_higher_expr_uinmf)
264 | 
265 | 
266 | message("start higher homology confidence")
267 | 
268 | ## available attributes to indicate confidence of homology
269 | adatas_many2many_homo_all = list()
270 | 
271 | ## mind the order
272 | order = metadata$X1
273 | avail_ordered = c()
274 | for (attr in c("orthology_confidence", "homolog_goc_score", "homolog_wga_coverage")){
275 | 
276 |     avail_homo = c(avail_attributes$name[grepl(attr, avail_attributes$name)])
277 | 
278 |     for (i in seq(1, length(order))){
279 |    avail_ordered = c(avail_ordered, avail_homo[grepl(order[i], avail_homo)])
280 |   }
281 | }
282 | 
283 | avail_homo = avail_ordered
284 | 
285 | many2many_copy_homo <- many2many %>% rowid_to_column("index")
286 | adata_many2many_homo = AnnData(shape = list(0, 0))
287 | 
288 | while (nrow(many2many_copy_homo) > 0) {
289 |     
290 | 
291 |     if(nrow(many2many_copy_homo) < 6000 & nrow(many2many_copy_homo) > 5990) message(nrow(many2many_copy_homo))
292 | 
293 |     if(nrow(many2many_copy_homo) < 4000 & nrow(many2many_copy_homo) > 3990) message(nrow(many2many_copy_homo))
294 | 
295 |     if(nrow(many2many_copy_homo) < 1000 & nrow(many2many_copy_homo) > 990) message(nrow(many2many_copy_homo))
296 | 
297 | 
298 |     dd <- many2many_copy_homo %>%
299 |         filter(get(paste0(species_1, "_homolog_ensembl_gene"))  == levels(factor(many2many_copy_homo[[paste0(species_1, "_homolog_ensembl_gene")]]))[1])
300 | 
301 |     genes_now = dd %>% dplyr::select(ends_with("_homolog_ensembl_gene")) %>% flatten() %>% unique() %>% as.character()
302 | 
303 |     ## find a gene group with many-to-many relationships
304 |     gene_group <- many2many_copy_homo %>%
305 |         dplyr::filter_at(vars(ends_with("_homolog_ensembl_gene")), any_vars(. %in% genes_now))
306 | 
307 |     if(nrow(gene_group) == 1) {
308 |         gene_keep = gene_group
309 |     } else {
310 |     gene_keep = gene_group %>%
311 |         dplyr::arrange(
312 |         sapply(avail_homo, FUN = function(x) get(x)) ## keep the member of group with highest overall confidence
313 |         ) %>%
314 |             slice(n())
315 |     }
316 | 
317 |     many2many_copy_homo <- many2many_copy_homo %>%
318 |             filter(!index %in% gene_group$index)
319 | 
320 |     adatas_many2many_homo = list()
321 |     for(species_now in species_list){
322 | 
323 |     adatas_many2many_homo[[species_now]] = adatas[[species_now]][, adatas[[species_now]]$var_names %in% gene_keep[[paste0(species_now, "_homolog_ensembl_gene")]]]
324 | 
325 | }
326 | 
327 |     new_name = adatas_many2many_homo[[species_1]]$var_names
328 |     #message(new_name)
329 |     for(species_now in species_list[-1]){
330 |     adatas_many2many_homo[[species_now]]$var[[paste0(species_1, "_homolog_ensembl_gene")]] = new_name
331 |     rownames(adatas_many2many_homo[[species_now]]$var) = new_name
332 | }
333 | 
334 |     for(species_now in species_list){
335 |     if (is.null(adatas_many2many_homo_all[[species_now]])) {
336 |     adatas_many2many_homo_all[[species_now]] <- adatas_many2many_homo[[species_now]]
337 |   } else {
338 |     adatas_many2many_homo_all[[species_now]] <- concat(list(adatas_many2many_homo_all[[species_now]], adatas_many2many_homo[[species_now]]), axis = 1L, join = "outer", merge = "first", index_unique = '-')
339 |   }
340 | }
341 |   rm(adatas_many2many_homo)
342 | 
343 | }
344 | adatas_one2one_and_many_homo = list()
345 | for(species_now in names(adatas)){
346 | 
347 |     adatas_one2one_and_many_homo[[species_now]] = concat(list(adatas_one2one[[species_now]], adatas_many2many_homo_all[[species_now]]), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
348 |     message(species_now)
349 | 
350 |     }
351 | 
352 | rm(adatas_many2many_homo_all)
353 | 
354 | 
355 | adatas_one2one_higher_homo_uinmf = list()
356 | 
357 | for(species_now in names(adatas)){
358 | 
359 |     adata_unshared = adatas[[species_now]][, !(adatas[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]] %in% adatas_one2one_and_many_homo[[species_now]]$var[[paste0(species_now, '_homolog_ensembl_gene')]])]
360 |     adatas_one2one_higher_homo_uinmf[[species_now]] = concat(list(adatas_one2one_and_many_homo[[species_now]], adata_unshared), axis = 1L, join = 'inner', merge = 'first', index_unique = '-')
361 | 
362 |     adatas_one2one_higher_homo_uinmf[[species_now]]= adatas_one2one_higher_homo_uinmf[[species_now]][, !duplicated(adatas_one2one_higher_homo_uinmf[[species_now]]$var_names)]
363 | 
364 |     i <- sapply(adatas_one2one_higher_homo_uinmf[[species_now]]$obs, is.factor)
365 |     adatas_one2one_higher_homo_uinmf[[species_now]]$obs[i] <- lapply(adatas_one2one_higher_homo_uinmf[[species_now]]$obs[i], as.character)
366 |     j <- sapply(adatas_one2one_higher_homo_uinmf[[species_now]]$var, is.factor)
367 |     adatas_one2one_higher_homo_uinmf[[species_now]]$var[j] <- lapply(adatas_one2one_higher_homo_uinmf[[species_now]]$var[j], as.character)
368 | 
369 |     write_h5ad(anndata = adatas_one2one_higher_homo_uinmf[[species_now]],
370 |                filename = paste0(out_dir, "/", species_now, "_many_higher_homology_conf_ligerUINMF.h5ad"), compression = "gzip")
371 |     message(species_now)
372 | 
373 |     }
374 | message("finish higher homology confidence")
375 | 
376 | rm(adatas_one2one_and_many_homo)
377 | rm(adatas_one2one_higher_homo_uinmf)
378 | 
379 | 
380 | 
381 | metadata = data.frame('species' = names(adatas))
382 | metadata$one2one_only = paste0(out_dir, "/", metadata$species, "_one2one_only_ligerUINMF.rds")
383 | metadata$many_higher_expr = paste0(out_dir, "/", metadata$species, "_many_higher_expr_ligerUINMF.rds")
384 | metadata$many_higher_homology_conf = paste0(out_dir, "/", metadata$species, "_many_higher_homology_conf_ligerUINMF.rds")
385 | write_tsv(metadata, file = metadata_output, col_names = TRUE)
386 | 


--------------------------------------------------------------------------------
/bin/convert_format.R:
--------------------------------------------------------------------------------
 1 | # /usr/bin/env R
 2 | 
 3 | library(optparse)
 4 | 
 5 | # © EMBL-European Bioinformatics Institute, 2023
 6 | # Yuyao Song <ysong@ebi.ac.uk>
 7 | 
 8 | option_list <- list(
 9 |   make_option(c("-i", "--input_file"),
10 |     type = "character", default = NULL,
11 |     help = "Path to input file for convrting"
12 |   ),
13 |   make_option(c("-o", "--output_file"),
14 |     type = "character", default = NULL,
15 |     help = "Output file after conversion"
16 |   ),
17 |   make_option(c("-t", "--type"),
18 |     type = "character", default = NULL,
19 |     help = "Conversion type, choose between anndata_to_seurat or seurat_to_anndata"
20 |   ),
21 |     make_option(c("--conda_path"),
22 |     type = "character", default = NULL,
23 |     help = "Conda for python executable to use for reticulate, important to match the prepared conda env!"
24 |   )
25 | )
26 | 
27 | # parse input
28 | opt <- parse_args(OptionParser(option_list = option_list))
29 | 
30 | input_file <- opt$input_file
31 | output_file <- opt$output_file
32 | type <- opt$type
33 | conda_path <- opt$conda_path
34 | 
35 | # set sys env before loading reticulate
36 | Sys.setenv(RETICULATE_PYTHON=paste0(conda_path, "/bin/python3"))
37 | Sys.setenv(RETICULATE_PYTHON_ENV=conda_path)
38 | 
39 | library(reticulate)
40 | library(sceasy)
41 | library(anndata)
42 | 
43 | 
44 | if(type == 'anndata_to_seurat'){
45 | 
46 |     message(paste0("from anndata to seurat, input: ", input_file))
47 | 
48 |     sceasy::convertFormat(input_file, from="anndata", to="seurat",
49 |                        outFile=output_file)
50 | } else if (type == 'seurat_to_anndata'){
51 | 
52 |     message(paste0("from seurat to anndata, input: ", input_file))
53 |     obj <- readRDS(input_file)
54 |     dir_name = dirname(input_file)
55 |     ## it is bizzar that from seurat to anndata needs loading object, but the other way works on-disk
56 |     sceasy::convertFormat(obj, from="seurat", to="anndata",
57 |                        outFile=output_file)
58 | 
59 | } else {
60 | 
61 |     warning("type must be either anndata_to_seurat or seurat_to_anndata")
62 | }
63 | 


--------------------------------------------------------------------------------
/bin/fastMNN_integration.R:
--------------------------------------------------------------------------------
 1 | # /usr/bin/env R
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | library(optparse)
 7 | library(Seurat)
 8 | library(SeuratWrappers)
 9 | 
10 | option_list <- list(
11 |   make_option(c("-i", "--input_rds"),
12 |     type = "character", default = NULL,
13 |     help = "Path to input preprocessed rds file"
14 |   ),
15 |   make_option(c("-o", "--out_rds"),
16 |     type = "character", default = NULL,
17 |     help = "Output fastMNN from Seurat wrappers integrated rds file"
18 |   ),
19 |   make_option(c("-p", "--out_UMAP"),
20 |     type = "character", default = NULL,
21 |     help = "Output UMAP after fastMNN integration"
22 |   ),
23 |   make_option(c("-b", "--batch_key"),
24 |     type = "character", default = NULL,
25 |     help = "Batch key identifier to integrate"
26 |   ),
27 |   make_option(c("-s", "--species_key"),
28 |     type = "character", default = NULL,
29 |     help = "Species key identifier"
30 |   ),
31 |   make_option(c("-c", "--cluster_key"),
32 |     type = "character", default = NULL,
33 |     help = "Cluster key for UMAP plotting"
34 |   )
35 | )
36 | 
37 | # parse input
38 | opt <- parse_args(OptionParser(option_list = option_list))
39 | 
40 | input_rds <- opt$input_rds
41 | out_rds <- opt$out_rds
42 | out_UMAP <- opt$out_UMAP
43 | batch_key <- opt$batch_key
44 | species_key <- opt$species_key
45 | cluster_key <- opt$cluster_key
46 | 
47 | ## create Seurat object via rds
48 | 
49 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE)
50 | # input_rds <- gsub("h5ad", "rds", input_h5ad)
51 | obj <- readRDS(input_rds)
52 | 
53 | obj <- NormalizeData(obj)
54 | obj <- FindVariableFeatures(obj)
55 | 
56 | ## run fastMNN
57 | obj <- RunFastMNN(object.list = SplitObject(obj, split.by = batch_key))
58 | obj <- RunUMAP(obj, reduction = "mnn", dims = 1:30, n_neighbors = 15L,  min_dist = 0.3)
59 | obj <- FindNeighbors(obj, reduction = "mnn", dims = 1:30)
60 | obj <- FindClusters(obj, resolution = 0.4)
61 | 
62 | # have to convert all factor to character, or when later converting to h5ad, the factors will be numbers
63 | i <- sapply(obj@meta.data, is.factor)
64 | obj@meta.data[i] <- lapply(obj@meta.data[i], as.character)
65 | 
66 | saveRDS(obj,
67 |   file= out_rds,
68 | )
69 | 
70 | 
71 | pdf(out_UMAP, height = 6, width = 10)
72 | DimPlot(obj, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE)
73 | DimPlot(obj, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE)
74 | DimPlot(obj, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE)
75 | 
76 | dev.off()
77 | 


--------------------------------------------------------------------------------
/bin/harmony_integration.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | 
 7 | import click
 8 | import matplotlib.pyplot as plt
 9 | import scanpy as sc
10 | 
11 | @click.command()
12 | @click.argument("input_h5ad", type=click.Path(exists=True))
13 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
14 | @click.argument("out_umap", type=click.Path(exists=False), default=None)
15 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration")
16 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
17 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
18 | 
19 | 
20 | def run_harmony(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key):
21 |     click.echo('Start harmony integration')
22 |     sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6))
23 |     adata = sc.read_h5ad(input_h5ad)
24 |     adata.var_names_make_unique()
25 |     sc.pp.normalize_total(adata, target_sum=1e4)
26 |     sc.pp.log1p(adata)
27 |     click.echo("HVG")
28 |     sc.pp.highly_variable_genes(adata, batch_key=batch_key)
29 |     sc.pp.scale(adata, max_value=10)
30 |     sc.tl.pca(adata)
31 |     sc.pp.neighbors(adata, use_rep='X_pca', n_neighbors=15, n_pcs=40)
32 |     sc.tl.umap(adata, min_dist=0.3) ## to match min_dist in seurat
33 |     adata.obsm['X_umapraw'] = adata.obsm['X_umap']
34 |     click.echo("Harmony")
35 |     sc.external.pp.harmony_integrate(adata, key=batch_key, basis = 'X_pca')
36 |     sc.pp.neighbors(adata, use_rep='X_pca_harmony', key_added = 'harmony', n_neighbors=15, n_pcs=40)
37 |     sc.tl.umap(adata, neighbors_key = 'harmony', min_dist=0.3) ## to match min_dist in seurat
38 |     sc.pl.umap(adata, color=[batch_key, species_key, cluster_key], ncols=1)
39 |     plt.savefig(out_umap, dpi=300,  bbox_inches='tight')
40 |     adata.obsm['X_umapharmony'] = adata.obsm['X_umap']
41 |     click.echo("Save output")
42 |     adata.write(out_h5ad)
43 |     click.echo("Done harmony")
44 | 
45 | 
46 | if __name__ == '__main__':
47 |     run_harmony()
48 | 
49 | 


--------------------------------------------------------------------------------
/bin/kbet_multiple_species.R:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env R
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | library(anndata)
  6 | library(mclust)
  7 | library(cluster)
  8 | library(optparse)
  9 | library(kBET)
 10 | library(tidyverse)
 11 | library(FNN)
 12 | 
 13 | 
 14 | option_list <- list(
 15 |   make_option(c("-i", "--input_h5ad"),
 16 |     type = "character", default = NULL,
 17 |     help = "Path to input integrated h5ad file"
 18 |   ),
 19 |   make_option(c("-o", "--out_csv"),
 20 |     type = "character", default = NULL,
 21 |     help = "Output csv file with various batch effect measurements"
 22 |   ),
 23 |   make_option(c("-m", "--method"),
 24 |     type = "character", default = NULL,
 25 |     help = "Integration method used, have an effect on use embedding or use count"
 26 |   ),
 27 |   make_option(c("-b", "--batch_key"),
 28 |     type = "character", default = NULL,
 29 |     help = "Batch key identifier to integrate"
 30 |   ),
 31 |   make_option(c("-c", "--cluster_key"),
 32 |     type = "character", default = NULL,
 33 |     help = "Cell type cluster key"
 34 |   ),
 35 |   make_option(c("-s", "--species_key"),
 36 |     type = "character", default = NULL,
 37 |     help = "Species key identifier"
 38 |   )
 39 | )
 40 | 
 41 | opt <- parse_args(OptionParser(option_list = option_list))
 42 | 
 43 | input_h5ad <- opt$input_h5ad
 44 | out_csv <- opt$out_csv
 45 | method <- opt$method
 46 | batch_key <- opt$batch_key
 47 | cluster_key <- opt$cluster_key
 48 | species_key <- opt$species_key
 49 | 
 50 | message("read in h5ad")
 51 | ad <- read_h5ad(input_h5ad)
 52 | 
 53 | ## identify shared cell types for measuring batch effect
 54 | species_all = levels(factor(ad$obs[[species_key]]))
 55 | cell_types = list()
 56 | for(species in species_all){
 57 |     cell_types[[species]] = levels(factor(ad$obs[ad$obs[[species_key]] == species, cluster_key]))
 58 | }
 59 | 
 60 | ct_shared=Reduce(intersect, cell_types)
 61 | message(paste0("shared cell types include ", paste(ct_shared, collapse = ", ")))
 62 | 
 63 | ad <- ad[ad$obs[[cluster_key]] %in% ct_shared, ]
 64 | 
 65 | # get the matrix to compute batch effect
 66 | # these methods return embedding
 67 | if (method == "harmony") {
 68 |   data <- data.frame(ad$obsm[["X_pca_harmony"]], row.names = ad$obs_names)
 69 |   do_pca <- FALSE
 70 | }
 71 | if (method == "scanorama") {
 72 |   data <- data.frame(ad$obsm[["X_scanorama"]], row.names = ad$obs_names)
 73 |   do_pca <- FALSE
 74 | }
 75 | if (method == "scVI") {
 76 |   data <- data.frame(ad$obsm[["X_scVI"]], row.names = ad$obs_names)
 77 |   do_pca <- FALSE
 78 | }
 79 | if (method == "LIGER") {
 80 |   data <- data.frame(ad$obsm[["X_iNMF"]], row.names = ad$obs_names)
 81 |   do_pca <- FALSE
 82 | }
 83 | if (method == "rligerUINMF") {
 84 |   data <- data.frame(ad$obsm[["X_inmf"]], row.names = ad$obs_names)
 85 |   do_pca <- FALSE
 86 | }
 87 | if (method == "fastMNN") {
 88 |   data <- data.frame(ad$obsm[["X_mnn"]], row.names = ad$obs_names)
 89 |   do_pca <- FALSE
 90 | }
 91 | 
 92 | if (method == "SAMap") {
 93 |   data <- data.frame(ad$obsm[["wPCA"]], row.names = ad$obs_names)
 94 |   do_pca <- FALSE
 95 | }
 96 | 
 97 | # seurat return a pseudo-count matrix that is after the integration
 98 | if (method %in% c("seuratCCA", "seuratRPCA", "unintegrated")) {
 99 |   data <- data.frame(as.matrix((ad$X)), row.names = ad$obs_names)
100 |   do_pca <- TRUE
101 | } # sparse matrix to dense
102 | 
103 | batch <- ad$obs[[batch_key]]
104 | 
105 | # if only 2 batches use pval in linear model, if multiple batches use pval in ANOVA F test in PC regression
106 | if (length(levels(factor(as.character(batch)))) > 2) {
107 |   use_pval <- "p.value.F.test"
108 | } else {
109 |   use_pval <- "p.value.lm"
110 | }
111 | 
112 | kbet_per_ct <- data.frame()
113 | for (ct in levels(factor(ad$obs[[cluster_key]]))) {
114 |   message(ct)
115 |   ad_ct <- ad[ad$obs[[cluster_key]] == ct, ]
116 |   data_ct <- data[ad_ct$obs_names, ]
117 |   batch <- ad_ct$obs[[batch_key]]
118 |   k0 <- floor(mean(table(batch))) # neighbourhood size: mean batch size
119 |   knn <- get.knn(data_ct, k = k0, algorithm = "cover_tree")
120 |   # now run kBET with pre-defined nearest neighbours.
121 |   batch.estimate <- suppressWarnings(kBET(data_ct, batch, k = k0, knn = knn, plot = FALSE, do.pca = do_pca))
122 | 
123 |   if (any(is.na(batch.estimate))) {
124 |     message(paste("cell type ", ct, " have less than 10 cells, skip cell-type kBET"), sep = "")
125 |     next
126 |   }
127 | 
128 |   add <- batch.estimate$summary["mean", ]
129 |   add$cell_type <- ct
130 | 
131 |   if(nrow(kbet_per_ct) == 0){
132 | 
133 |       kbet_per_ct <- add %>% pivot_longer(cols = 1:3, names_to = "measure", values_to = "value")
134 | 
135 |   } else {
136 |   kbet_per_ct <- rbind(kbet_per_ct, add %>% pivot_longer(cols = 1:3, names_to = "measure", values_to = "value"))
137 |   }
138 | }
139 | 
140 | 
141 | message("write summary")
142 | 
143 | summary <- kbet_per_ct %>% pivot_wider(id_cols = cell_type, names_from = measure, values_from = value)
144 | summary$method <- method
145 | summary$input_h5ad <- input_h5ad
146 | write_csv(summary, out_csv)
147 | 


--------------------------------------------------------------------------------
/bin/rliger_integration_UINMF_multiple_species.R:
--------------------------------------------------------------------------------
  1 | # /usr/bin/env R
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | library(optparse)
  7 | library(rliger)
  8 | library(scCustomize) # liger to seurat function keep_metadata
  9 | library(SeuratDisk)
 10 | 
 11 | 
 12 | option_list <- list(
 13 |   make_option(c("--metadata"),
 14 |     type = "character", default = NULL,
 15 |     help = "Path to a file indicate species-rds mapping, tab-seperated"
 16 |   ),
 17 |   make_option(c("--basename"),
 18 |       type = "character", default = NULL,
 19 |       help = "Basename of file to save"
 20 |   ),
 21 |   make_option(c("--out_dir"),
 22 |     type = "character", default = NULL,
 23 |     help = "output dir to write rds files"
 24 |   ),
 25 |   make_option(c("--cluster_key"),
 26 |     type = "character", default = NULL,
 27 |     help = "paths to rliger UINMF ready metadata"
 28 |   )
 29 | )
 30 | opt <- parse_args(OptionParser(option_list = option_list))
 31 | 
 32 | metadata_path <- opt$metadata
 33 | basename <- opt$basename
 34 | out_dir <- opt$out_dir
 35 | cluster_key <- opt$cluster_key
 36 | 
 37 | 
 38 | metadata = read.table(metadata_path, sep = '\t', header=TRUE)
 39 | metadata = as.data.frame(metadata)
 40 | 
 41 | for(type in colnames(metadata)[-1]){
 42 | 
 43 | message(type)
 44 | 
 45 | obj_list = list()
 46 | for(i in seq(1, nrow(metadata))){
 47 | 
 48 |     obj_list[[i]] = readRDS(metadata[i, type])
 49 | }
 50 | 
 51 | liger <- seuratToLiger(obj_list, remove.missing = FALSE, renormalize = FALSE, names = metadata$species)
 52 | 
 53 | meta_all = data.frame()
 54 | keep_cols = Reduce(intersect, lapply(obj_list, FUN = function(x) colnames(x@meta.data)))
 55 | 
 56 | for(i in seq(1, nrow(metadata))){
 57 |     message(i)
 58 |     obj_list[[i]]@meta.data$cell_id = rownames(obj_list[[i]]@meta.data)
 59 |     use = obj_list[[i]]@meta.data[, c("cell_id", keep_cols)]
 60 |     meta_all = rbind(meta_all, use)
 61 | 
 62 | }
 63 | 
 64 | meta_new = meta_all[match(rownames(liger@cell.data), rownames(meta_all)), ]
 65 | meta_new = cbind(meta_new, liger@cell.data)
 66 | meta_new = meta_new[match(rownames(liger@cell.data), rownames(meta_new)), ]
 67 | 
 68 | liger@cell.data <- meta_new
 69 | 
 70 | species.liger <- normalize(liger)
 71 | species.liger <- rliger::selectGenes(species.liger, var.thres = 0.3, unshared = TRUE, unshared.datasets = list(1, 2), unshared.thresh = 0.3)
 72 | species.liger <- scaleNotCenter(species.liger)
 73 | species.liger <- optimizeALS(species.liger, lambda = 5, use.unshared = TRUE, thresh = 1e-10, k = 30)
 74 | species.liger <- quantile_norm(species.liger, ref_dataset = metadata$species[1])
 75 | species.liger <- louvainCluster(species.liger)
 76 | species.liger <- runUMAP(species.liger)
 77 | 
 78 | seurat_obj <- scCustomize::Liger_to_Seurat(
 79 |   species.liger,
 80 |   nms = names(species.liger@H),
 81 |   renormalize = TRUE,
 82 |   use.liger.genes = TRUE,
 83 |   by.dataset = FALSE,
 84 |   keep_meta = TRUE,
 85 |   reduction_label = "umap", # in line with the X_umap in scanpy
 86 |   seurat_assay = "RNA"
 87 | )
 88 | 
 89 | k <- sapply(seurat_obj@meta.data, is.factor)
 90 | seurat_obj@meta.data[k] <- lapply(seurat_obj@meta.data[k], as.character) # known bug in seurat to h5ad for factors
 91 | 
 92 | message("save pdf")
 93 | pdf(paste0(out_dir,"/", basename, "_", type, "_rligerUINMF_integrated_UMAP.pdf"), height = 10, width = 12)
 94 | DimPlot(object = seurat_obj, reduction = 'umap', group.by = c('species', 'cell_ontology_mapped'), ncol = 1)
 95 | dev.off()
 96 | 
 97 | 
 98 | message("save seurat object")
 99 | 
100 | saveRDS(seurat_obj,
101 |   file = paste0(out_dir,"/", basename, "_", type, "_rligerUINMF_integrated.rds")
102 | )
103 | 
104 | message(paste0("finish", type))
105 | 
106 | }
107 | 
108 | 


--------------------------------------------------------------------------------
/bin/scANVI_integration.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | 
 7 | import click
 8 | import matplotlib.pyplot as plt
 9 | import scanpy as sc
10 | import scvi
11 | 
12 | @click.command()
13 | @click.argument("input_h5ad", type=click.Path(exists=True))
14 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
15 | @click.argument("out_umap", type=click.Path(exists=False), default=None)
16 | @click.argument("out_scanvi", type=click.Path(exists=False), default=None)
17 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration")
18 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
19 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
20 | 
21 | 
22 | def run_scANVI(input_h5ad, out_h5ad, out_umap, out_scanvi, batch_key, species_key, cluster_key):
23 |     click.echo('Start scANVI integration')
24 | 
25 |     sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6))
26 | 
27 |     adata = sc.read_h5ad(input_h5ad)
28 |     adata.var_names_make_unique()
29 |     sc.pp.highly_variable_genes(
30 |         adata,
31 |         flavor="seurat_v3",
32 |         n_top_genes=2000,
33 |         ##layer="counts",
34 |         batch_key=batch_key,
35 |         subset=True
36 |     )
37 | 
38 |     adata.layers["counts"] = adata.X.copy()
39 |     sc.pp.normalize_total(adata, target_sum=1e4)
40 |     sc.pp.log1p(adata)
41 |     adata.raw = adata
42 | 
43 | 
44 |     click.echo("setup scVI model")
45 | 
46 | 
47 |     scvi.model.SCANVI.setup_anndata(adata, labels_key=cluster_key, unlabeled_category = "Unknown", batch_key=batch_key, )
48 |     vae = scvi.model.SCANVI(adata)
49 |     vae.train()
50 |     adata.obsm["X_scANVI"] = vae.get_latent_representation()
51 |     adata.obsm["X_mde_scanvi"] = scvi.model.utils.mde(adata.obsm["X_scANVI"])
52 | 
53 |     sc.pl.embedding(
54 |     adata, basis="X_mde_scanvi", color=[batch_key, species_key, cluster_key], ncols=1, frameon=False
55 |     )
56 |     plt.savefig(out_scanvi, dpi=300,  bbox_inches='tight')
57 | 
58 |     num_pcs = min(adata.obsm['X_scANVI'].shape[1], 40)
59 |     if num_pcs < 40:
60 |         click.echo("using less PCs: " + str(num_pcs))
61 | 
62 |     sc.pp.neighbors(adata, use_rep="X_scANVI", key_added='scANVI', n_neighbors=15, n_pcs=num_pcs)
63 |     sc.tl.umap(adata, neighbors_key='scANVI', min_dist=0.3) ## to match min_dist in seurat
64 |     sc.pl.umap(adata, neighbors_key='scANVI', color=[batch_key, species_key, cluster_key], ncols=1)
65 |     plt.savefig(out_umap, dpi=300,  bbox_inches='tight')
66 | 
67 |     adata.obsm['X_umapscANVI'] = adata.obsm['X_umap']
68 | 
69 |     click.echo("Save output")
70 |     adata.write(out_h5ad)
71 |     click.echo("Done scVI")
72 | 
73 | 
74 | if __name__ == '__main__':
75 |     run_scANVI()
76 | 
77 | 


--------------------------------------------------------------------------------
/bin/scIB_metrics.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | 
  7 | import click
  8 | import pandas as pd
  9 | import scanpy as sc
 10 | import scib
 11 | 
 12 | 
 13 | @click.command()
 14 | 
 15 | @click.argument("input_h5ad", type=click.Path(exists=True))
 16 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None)
 17 | @click.argument("out_csv", type=click.Path(exists=False), default=None)
 18 | @click.argument("out_silhouette", type=click.Path(exists=False), default=None)
 19 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
 20 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
 21 | @click.option('--batch_key', type=str, default=None, help="Batch key on which integration is performed")
 22 | @click.option('--integration_method', type=str, default=None, help="Integration method")
 23 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
 24 | 
 25 | 
 26 | def run_scIB_metrics(input_h5ad, unintegrated_h5ad, out_csv, out_silhouette, out_h5ad, species_key, batch_key, cluster_key, integration_method):
 27 |     # dictionary for method properties
 28 |     embedding_keys={"harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_iNMF", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA" }
 29 |     use_embeddings={"harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": True , "seuratCCA": False, "seuratRPCA": False, "unintegrated": False}
 30 |     from_h5seurat={"harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "unintegrated": False}
 31 |     sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5))
 32 |     click.echo("Read anndata")
 33 |     input_ad = sc.read_h5ad(input_h5ad)
 34 |     orig_ad = sc.read_h5ad(unintegrated_h5ad)
 35 |     species_all=input_ad.obs[species_key].astype("category").cat.categories.values
 36 |     # known bug - fix when convert h5Seurat to h5ad the index name error
 37 |     if from_h5seurat[integration_method] is True:
 38 |         input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})
 39 | 
 40 |     use_embedding=use_embeddings[integration_method]
 41 |     if use_embedding is True:
 42 |         embedding_key=embedding_keys[integration_method]
 43 | 
 44 | 
 45 |     # register color in .uns
 46 |     sc.pl.umap(input_ad, color = cluster_key)
 47 |     color = dict(zip(input_ad.obs[cluster_key].cat.categories.to_list(), input_ad.uns[cluster_key+'_colors']))
 48 |     # get neighbours on integrated and unintegrated data
 49 |     # for lisi type_='knn'
 50 |     ## LIGER embedding only have 20 dims
 51 | 
 52 |     if integration_method == 'SAMap':
 53 |         click.echo("use SAMap KNN graph")
 54 |         ## do nothing
 55 |     elif use_embedding is True:
 56 |         click.echo("calculate KNN graph from embedding " + embedding_key)
 57 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep=embedding_key)
 58 |         ## compute knn if use embedding
 59 |     else:
 60 |         click.echo("use PCA to compute KNN graph")
 61 |         sc.tl.pca(input_ad, svd_solver='arpack')
 62 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep='X_pca')
 63 |         embedding_key='X_pca'
 64 |     ## while no embedding, compute PCA and compute knn
 65 | 
 66 |     ## get neighbour graph from unintegrated data
 67 |     sc.pp.normalize_total(orig_ad, target_sum=1e4)
 68 |     sc.pp.log1p(orig_ad)
 69 |     sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
 70 |     sc.pp.scale(orig_ad, max_value=10)
 71 |     sc.tl.pca(orig_ad, svd_solver='arpack')
 72 |     sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40)
 73 | 
 74 |     click.echo("Start computing various batch metrics using scIB "+input_h5ad)
 75 |     click.echo("silhouette_batch")
 76 | 
 77 |     ## silhouette_batch
 78 |     silb = scib.metrics.silhouette_batch(input_ad, batch_key = batch_key, group_key = cluster_key, embed = embedding_key, metric='euclidean',
 79 |                                          return_all=True, scale=True, verbose=True)
 80 |     click.echo("PC regression")
 81 | 
 82 |     ## pcr
 83 | 
 84 |     pcr = scib.metrics.pcr_comparison(adata_pre = orig_ad, adata_post = input_ad, covariate = batch_key,
 85 |                                       embed = embedding_key, n_comps=50, scale=True, verbose=True)
 86 |     ## click.echo("HVG overlap")
 87 |     ## I don't understand how methods that generate an embedding change HVG
 88 |     ## hvg_overlap
 89 | 
 90 |     ## hvg = scib.metrics.hvg_overlap(adata_pre = orig_ad, adata_post = input_ad, batch = batch_key, n_hvg=500, verbose=True)
 91 | 
 92 |     ## sil on embedding
 93 |     click.echo("silhouette on embedding")
 94 | 
 95 |     sil = scib.metrics.silhouette(adata = input_ad, group_key = cluster_key, embed = embedding_key, metric='euclidean', scale=True)
 96 | 
 97 |     ## nmi and ari cluster vs label, graph_conn, clisi and ilisi
 98 |     click.echo("NMI, ARI, grapth connectivity, cLISI and iLISI")
 99 | 
100 |     res = scib.metrics.metrics(adata =  orig_ad, adata_int = input_ad, batch_key=batch_key, label_key=cluster_key, embed=embedding_key,
101 |                                cluster_key='cluster_scIB', cluster_nmi=out_csv+"_nmi_opt_cluster.csv", type_='knn',
102 |                                isolated_labels_asw_=False,
103 |                                silhouette_=False,
104 |                                hvg_score_=False,
105 |                                graph_conn_=True,
106 |                                pcr_=False,
107 |                                isolated_labels_f1_=True,
108 |                                nmi_=True,
109 |                                ari_=True,
110 |                                kBET_=True,
111 |                                ilisi_=True,
112 |                                clisi_=True,
113 |                                cell_cycle_=False, trajectory_=False)
114 | 
115 |     ## collect results and write output
116 |     output=pd.DataFrame()
117 |     output.loc['NMI_cluster/label', 'value'] = res.loc['NMI_cluster/label', 0]
118 |     output.loc['ARI_cluster/label', 'value'] = res.loc['ARI_cluster/label', 0]
119 |     output.loc['iLISI', 'value'] = res.loc['iLISI', 0]
120 |     output.loc['cLISI', 'value'] = res.loc['cLISI', 0]
121 |     output.loc['graph_conn', 'value'] = res.loc['graph_conn', 0]
122 |     #output.loc['HVG_overlap', 'value'] = hvg
123 |     output.loc['pcr', 'value'] = pcr
124 |     output.loc['silhouette', 'value'] = sil
125 |     output.loc['silhouette_batch', 'value'] = silb[0]
126 |     output.loc['input_h5ad', 'value'] = input_h5ad
127 |     output.loc['unintegrated_h5ad', 'value'] = unintegrated_h5ad
128 |     output.loc['species_key', 'value'] = species_key
129 |     output.loc['batch_key', 'value'] = batch_key
130 |     output.loc['cluster_key', 'value'] = cluster_key
131 |     output.loc['integration_method', 'value'] = integration_method
132 | 
133 |     output.T.to_csv( out_csv)
134 | 
135 |     silb[1]['input_h5ad'] = input_h5ad
136 |     silb[1]['unintegrated_h5ad'] = unintegrated_h5ad
137 |     silb[1]['species_key'] = species_key
138 |     silb[1]['batch_key'] = batch_key
139 |     silb[1]['cluster_key'] = cluster_key
140 |     silb[1]['integration_method'] = integration_method
141 |     silb[1].to_csv( out_silhouette)
142 | 
143 |     input_ad.write(out_h5ad, compression = 'gzip')
144 |     click.echo("finish batch metrics")
145 | 
146 | 
147 | if __name__ == '__main__':
148 |     run_scIB_metrics()
149 | 


--------------------------------------------------------------------------------
/bin/scIB_metrics_individual.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | 
  7 | import os
  8 | 
  9 | import click
 10 | import pandas as pd
 11 | import scanpy as sc
 12 | import random
 13 | import scib
 14 | import numpy
 15 | 
 16 | # set R for kBET
 17 | import os
 18 | 
 19 | ## set seed 
 20 | random.seed(123)
 21 | numpy.random.seed(456)
 22 | 
 23 | @click.command()
 24 | @click.argument("input_h5ad", type=click.Path(exists=True))
 25 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None)
 26 | @click.argument("out_integrated_metrics", type=click.Path(exists=False), default=None)
 27 | @click.argument("out_integrated_basw", type=click.Path(exists=False), default=None)
 28 | @click.argument("out_orig_metrics", type=click.Path(exists=False), default=None)
 29 | @click.argument("out_orig_basw", type=click.Path(exists=False), default=None)
 30 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
 31 | @click.option(
 32 |     "--species_key", type=str, default=None, help="Species key to distinguish species"
 33 | )
 34 | @click.option(
 35 |     "--batch_key",
 36 |     type=str,
 37 |     default=None,
 38 |     help="Batch key on which integration is performed",
 39 | )
 40 | @click.option("--integration_method", type=str, default=None, help="Integration method")
 41 | @click.option(
 42 |     "--cluster_key",
 43 |     type=str,
 44 |     default=None,
 45 |     help="Cluster key in species one to use as labels to transfer to species two",
 46 | )
 47 | @click.option(
 48 |     "--num_cores",
 49 |     type=int,
 50 |     default=1,
 51 |     help="Number of cores for graph LISI scores",
 52 | )
 53 | @click.option(
 54 |     "--conda_path",
 55 |     type=str,
 56 |     default=None,
 57 |     help="scIB conda path",
 58 | )
 59 | 
 60 | 
 61 | def run_scIB_metrics(
 62 |     input_h5ad,
 63 |     unintegrated_h5ad,
 64 |     out_integrated_metrics,
 65 |     out_integrated_basw,
 66 |     out_orig_metrics,
 67 |     out_orig_basw,
 68 |     out_h5ad,
 69 |     species_key,
 70 |     batch_key,
 71 |     cluster_key,
 72 |     integration_method,
 73 |     num_cores,
 74 |     conda_path
 75 | ):
 76 |     
 77 | 
 78 |     os.environ['R_HOME'] = f"{conda_path}/lib/R"
 79 |     os.environ['PATH'] = f"{conda_path}/bin:" + os.environ['PATH']
 80 |     os.environ['R_LIBS_USER'] = f"{conda_path}/lib/R/library"
 81 |     os.environ['R_LIBS'] = f"{conda_path}lib/R/library"
 82 |     os.environ['LD_LIBRARY_PATH'] = f"{conda_path}/lib"
 83 | 
 84 |     # dictionary for method properties
 85 |     embedding_keys = {
 86 |         "harmony": "X_pca_harmony",
 87 |         "scanorama": "X_scanorama",
 88 |         "scVI": "X_scVI",
 89 |         "scANVI": "X_scANVI",
 90 |         "LIGER": "X_inmf",
 91 |         "rligerUINMF": "X_inmf",
 92 |         "fastMNN": "X_mnn",
 93 |     }
 94 |     use_embeddings = {
 95 |         "harmony": True,
 96 |         "scanorama": True,
 97 |         "scVI": True,
 98 |         "scANVI": True,
 99 |         "LIGER": True,
100 |         "rligerUINMF": True,
101 |         "fastMNN": True,
102 |         "seuratCCA": False,
103 |         "seuratRPCA": False,
104 |         "unintegrated": False,
105 |     }
106 |     from_h5seurat = {
107 |         "harmony": False,
108 |         "scanorama": False,
109 |         "scVI": False,
110 |         "scANVI": False,
111 |         "LIGER": True,
112 |         "rligerUINMF": True,
113 |         "fastMNN": True,
114 |         "seuratCCA": True,
115 |         "seuratRPCA": True,
116 |         "unintegrated": False,
117 |     }
118 |     sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5))
119 | 
120 |     click.echo("Read anndata")
121 |     input_ad = sc.read_h5ad(input_h5ad)
122 |     orig_ad = sc.read_h5ad(unintegrated_h5ad)
123 |     species_all = input_ad.obs[species_key].astype("category").cat.categories.values
124 | 
125 |     ## for files from h5seurat sometimes these are not stored as category
126 | 
127 |     input_ad.obs[species_key] = input_ad.obs[species_key].astype("category")
128 |     input_ad.obs[cluster_key] = input_ad.obs[cluster_key].astype("category")  
129 |     input_ad.obs[batch_key] = input_ad.obs[batch_key].astype("category")
130 |     
131 |     # known bug - fix when convert h5Seurat to h5ad the index name error
132 |     #if from_h5seurat[integration_method] is True:
133 |     #    input_ad.__dict__["_raw"].__dict__["_var"] = (
134 |     #        input_ad.__dict__["_raw"]
135 |     #        .__dict__["_var"]
136 |     #        .rename(columns={"_index": "features"})
137 |     #    )
138 | 
139 |     use_embedding = use_embeddings[integration_method]
140 |     if use_embedding is True:
141 |         embedding_key = embedding_keys[integration_method]
142 | 
143 |     # re-calculate on integrated and unintegrated data
144 |     # due to scIB hard-coding, make sure input_ad.obsp['connectivities'], input_ad.uns['neighbours'] are from the embedding
145 |     # for lisi type_='knn'
146 |     # LIGER embedding only have 20 dims
147 | 
148 | 
149 |     if use_embedding is True:
150 |         click.echo("calculate KNN graph from embedding " + embedding_key)
151 |         num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20)
152 |         if num_pcs < 20:
153 |             click.echo("using less PCs: " + str(num_pcs))
154 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key)
155 |         # compute knn if use embedding
156 |     else:
157 |         click.echo("use PCA to compute KNN graph")
158 |         sc.tl.pca(input_ad, svd_solver="arpack")
159 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep="X_pca")
160 |         embedding_key = "X_pca"
161 | 
162 |     # while no embedding, compute PCA and compute knn
163 | 
164 |     # get neighbour graph from unintegrated data
165 |     sc.pp.normalize_total(orig_ad, target_sum=1e4)
166 |     sc.pp.log1p(orig_ad)
167 |     sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
168 |     sc.pp.scale(orig_ad, max_value=10)
169 |     sc.tl.pca(orig_ad, svd_solver="arpack")
170 |     sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40)
171 |     sc.tl.umap(orig_ad, min_dist=0.3)
172 |     sc.pl.umap(orig_ad, color=[batch_key, species_key, cluster_key])
173 | 
174 |     click.echo(
175 |         "Start computing various batch metrics using scIB, the integrated file is "
176 |         + input_h5ad
177 |     )
178 | 
179 |     output_integrated = pd.DataFrame()
180 |     output_integrated_basw = pd.DataFrame()
181 |     output_orig = pd.DataFrame()
182 |     output_orig_basw = pd.DataFrame()
183 | 
184 |     click.echo("PC regression")
185 | 
186 |     output_integrated.loc["PCR", "value"] = scib.metrics.pc_regression(
187 |         input_ad.obsm[embedding_key], covariate=input_ad.obs[species_key], n_comps=50
188 |     )
189 |     output_orig.loc["PCR", "value"] = scib.metrics.pc_regression(
190 |         orig_ad.obsm["X_pca"],
191 |         covariate=orig_ad.obs[species_key],
192 |         pca_var=orig_ad.uns["pca"]["variance"],
193 |     )
194 | 
195 |     click.echo("Silhouette batch")
196 | 
197 |     integrated_basw = scib.metrics.silhouette_batch(
198 |         input_ad,
199 |         batch_key=batch_key,
200 |         group_key=cluster_key,
201 |         embed=embedding_key,
202 |         metric="euclidean",
203 |         return_all=True,
204 |         scale=True,
205 |         verbose=False,
206 |     )
207 | 
208 |     output_integrated.loc["bASW", "value"] = integrated_basw[0]
209 | 
210 |     orig_basw = scib.metrics.silhouette_batch(
211 |         orig_ad,
212 |         batch_key=batch_key,
213 |         group_key=cluster_key,
214 |         embed="X_pca",
215 |         metric="euclidean",
216 |         return_all=True,
217 |         scale=True,
218 |         verbose=False,
219 |     )
220 | 
221 |     output_orig.loc["bASW", "value"] = orig_basw[0]
222 | 
223 |     click.echo("Graph connectivity")
224 | 
225 |     output_integrated.loc["GC", "value"] = scib.metrics.graph_connectivity(
226 |         input_ad, label_key=cluster_key
227 |     )
228 |     output_orig.loc["GC", "value"] = scib.metrics.graph_connectivity(
229 |         orig_ad, label_key=cluster_key
230 |     )
231 | 
232 |     # click.echo("graph iLISI")
233 | 
234 |     # output_integrated.loc["iLISI", "value"] = scib.metrics.ilisi_graph(
235 |     #     input_ad,
236 |     #     batch_key,
237 |     #     type_="embed",
238 |     #     use_rep=embedding_key,
239 |     #     k0=50,
240 |     #     subsample=None,
241 |     #     n_cores=num_cores,
242 |     #     scale=True,
243 |     #     verbose=True,
244 |     # )
245 | 
246 |     # output_orig.loc["iLISI", "value"] = scib.metrics.ilisi_graph(
247 |     #     orig_ad,
248 |     #     batch_key,
249 |     #     type_="full",
250 |     #     k0=50,
251 |     #     subsample=None,
252 |     #     n_cores=num_cores,
253 |     #     scale=True,
254 |     #     verbose=True,
255 |     # )
256 | 
257 |     click.echo("kBET")
258 | 
259 |     output_integrated.loc["kBET", "value"] = scib.metrics.kBET(
260 |         input_ad,
261 |         batch_key=batch_key,
262 |         label_key=cluster_key,
263 |         type_="knn", ## for equal treatment
264 |         scaled=True,
265 |         return_df=False,
266 |         verbose=True,
267 |     )
268 | 
269 |     output_orig.loc["kBET", "value"] = scib.metrics.kBET(
270 |         orig_ad,
271 |         batch_key=batch_key,
272 |         label_key=cluster_key,
273 |         type_="full",
274 |         scaled=True,
275 |         return_df=False,
276 |         verbose=True,
277 |     )
278 | 
279 |     # click.echo("cLISI")
280 | 
281 |     # output_integrated.loc["cLISI", "value"] = scib.metrics.clisi_graph(
282 |     #     input_ad,
283 |     #     label_key=cluster_key,
284 |     #     type_="embed",
285 |     #     use_rep=embedding_key,
286 |     #     k0=50,
287 |     #     subsample=None,
288 |     #     scale=True,
289 |     #     n_cores=num_cores,
290 |     #     verbose=True,
291 |     # )
292 | 
293 |     # output_orig.loc["cLISI", "value"] = scib.metrics.clisi_graph(
294 |     #     orig_ad,
295 |     #     label_key=cluster_key,
296 |     #     type_="full",
297 |     #     k0=50,
298 |     #     subsample=None,
299 |     #     scale=True,
300 |     #     n_cores=num_cores,
301 |     #     verbose=True,
302 |     # )
303 | 
304 |     click.echo("NMI")
305 |     click.echo("clustering optimization with leiden")
306 | 
307 |     scib.me.cluster_optimal_resolution(
308 |         input_ad,
309 |         cluster_key="cluster",
310 |         label_key=cluster_key,
311 |         cluster_function=sc.tl.leiden,
312 |     )
313 |     scib.me.cluster_optimal_resolution(
314 |         orig_ad,
315 |         cluster_key="cluster",
316 |         label_key=cluster_key,
317 |         cluster_function=sc.tl.leiden,
318 |     )
319 | 
320 |     output_integrated.loc["NMI", "value"] = scib.me.nmi(
321 |         input_ad, cluster_key="cluster", label_key=cluster_key
322 |     )
323 |     output_integrated.loc["ARI", "value"] = scib.me.ari(
324 |         input_ad, cluster_key="cluster", label_key=cluster_key
325 |     )
326 | 
327 |     output_orig.loc["NMI", "value"] = scib.me.nmi(
328 |         orig_ad, cluster_key="cluster", label_key=cluster_key
329 |     )
330 |     output_orig.loc["ARI", "value"] = scib.me.ari(
331 |         orig_ad, cluster_key="cluster", label_key=cluster_key
332 |     )
333 | 
334 |     click.echo("Silhouette cell type")
335 | 
336 |     output_integrated.loc["cASW", "value"] = scib.me.silhouette(
337 |         input_ad, label_key=cluster_key, embed=embedding_key
338 |     )
339 | 
340 |     output_orig.loc["cASW", "value"] = scib.me.silhouette(
341 |         orig_ad, label_key=cluster_key, embed="X_pca"
342 |     )
343 |     
344 |     click.echo("Isolated label F1")
345 | 
346 |     output_integrated.loc["iso_F1", "value"] = scib.me.isolated_labels_f1(
347 |         input_ad, embed=None, batch_key=batch_key, label_key=cluster_key
348 |     )
349 | 
350 |     output_orig.loc["iso_F1", "value"] = scib.me.isolated_labels_f1(
351 |         orig_ad, embed=None, batch_key=batch_key, label_key=cluster_key
352 |     )
353 | 
354 |     click.echo("write results")
355 |     click.echo("metrics of integrated data")
356 | 
357 |     output_integrated.loc["input_h5ad", "value"] = input_h5ad
358 |     output_integrated.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad
359 |     output_integrated.loc["species_key", "value"] = species_key
360 |     output_integrated.loc["batch_key", "value"] = batch_key
361 |     output_integrated.loc["cluster_key", "value"] = cluster_key
362 |     output_integrated.loc["integration_method", "value"] = integration_method
363 | 
364 |     output_integrated.T.to_csv(out_integrated_metrics)
365 | 
366 |     click.echo("writing clustering optimized integrated data")
367 |     input_ad.write(out_h5ad, compression="gzip")
368 | 
369 |     click.echo("metric of unintegrated data")
370 |     output_orig.loc["input_h5ad", "value"] = unintegrated_h5ad
371 |     output_orig.loc["unintegrated_h5ad", "value"] = unintegrated_h5ad
372 |     output_orig.loc["species_key", "value"] = species_key
373 |     output_orig.loc["batch_key", "value"] = batch_key
374 |     output_orig.loc["cluster_key", "value"] = cluster_key
375 |     output_orig.loc["integration_method", "value"] = integration_method
376 | 
377 |     output_orig.T.to_csv(out_orig_metrics)
378 | 
379 |     click.echo("cell type bASW of integrated data")
380 | 
381 |     integrated_basw[1]["input_h5ad"] = input_h5ad
382 |     integrated_basw[1]["unintegrated_h5ad"] = unintegrated_h5ad
383 |     integrated_basw[1]["species_key"] = species_key
384 |     integrated_basw[1]["batch_key"] = batch_key
385 |     integrated_basw[1]["cluster_key"] = cluster_key
386 |     integrated_basw[1]["integration_method"] = integration_method
387 |     integrated_basw[1].to_csv(out_integrated_basw)
388 | 
389 |     orig_basw[1]["input_h5ad"] = unintegrated_h5ad
390 |     orig_basw[1]["unintegrated_h5ad"] = unintegrated_h5ad
391 |     orig_basw[1]["species_key"] = species_key
392 |     orig_basw[1]["batch_key"] = batch_key
393 |     orig_basw[1]["cluster_key"] = cluster_key
394 |     orig_basw[1]["integration_method"] = integration_method
395 |     orig_basw[1].to_csv(out_orig_basw)
396 | 
397 |     click.echo("finish scIB metrics calculation")
398 | 
399 | 
400 | if __name__ == "__main__":
401 |     run_scIB_metrics()
402 | 


--------------------------------------------------------------------------------
/bin/scIB_trajectory.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | 
  7 | import click
  8 | import pandas as pd
  9 | import scanpy as sc
 10 | import numpy as np
 11 | import scib
 12 | 
 13 | 
 14 | @click.command()
 15 | 
 16 | @click.argument("input_h5ad", type=click.Path(exists=True))
 17 | @click.argument("unintegrated_h5ad", type=click.Path(exists=False), default=None)
 18 | @click.argument("out_csv", type=click.Path(exists=False), default=None)
 19 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
 20 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
 21 | @click.option('--batch_key', type=str, default=None, help="Batch key on which integration is performed")
 22 | @click.option('--integration_method', type=str, default=None, help="Integration method")
 23 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
 24 | @click.option('--root_cell', type=str, default=None, help="Root cell in trajectory, should be one category in adata.obs[cluster_key]")
 25 | 
 26 | 
 27 | def run_scIB_trajectory(input_h5ad, unintegrated_h5ad, out_csv,  out_h5ad, species_key, batch_key, cluster_key, integration_method, root_cell):
 28 |     # dictionary for method properties
 29 |     embedding_keys={"scANVI": "X_scANVI", "harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_iNMF", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA" }
 30 |     use_embeddings={"scANVI": True, "harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": False, "seuratRPCA": False, "unintegrated": False}
 31 |     from_h5seurat={"scANVI": False, "harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "unintegrated": False}
 32 |     sc.set_figure_params(dpi_save=200, frameon=False, figsize=(10, 5))
 33 |     click.echo("Read anndata")
 34 |     input_ad = sc.read_h5ad(input_h5ad)
 35 |     orig_ad = sc.read_h5ad(unintegrated_h5ad)
 36 | 
 37 |     species_all=input_ad.obs[species_key].astype("category").cat.categories.values
 38 |     # known bug - fix when convert h5Seurat to h5ad the index name error
 39 |     # if from_h5seurat[integration_method] is True:
 40 |     #    input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})
 41 | 
 42 |     use_embedding=use_embeddings[integration_method]
 43 |     if use_embedding is True:
 44 |         embedding_key=embedding_keys[integration_method]
 45 | 
 46 |     # register color in .uns
 47 |     sc.pl.umap(input_ad, color = cluster_key)
 48 |     color = dict(zip(input_ad.obs[cluster_key].cat.categories.to_list(), input_ad.uns[cluster_key+'_colors']))
 49 | 
 50 |     # get neighbours on integrated and unintegrated data
 51 |     # for lisi type_='knn'
 52 |     ## LIGER embedding only have 20 dims
 53 | 
 54 |     if integration_method == 'SAMap':
 55 |         click.echo("use SAMap KNN graph")
 56 | 
 57 |         ## do nothing
 58 |     elif use_embedding is True:
 59 |         click.echo("calculate KNN graph from embedding " + embedding_key)
 60 |         num_pcs = min(input_ad.obsm[embedding_key].shape[1], 20)
 61 |         if num_pcs < 20:
 62 |             click.echo("using less PCs: " + str(num_pcs))
 63 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=num_pcs, use_rep=embedding_key, key_added=integration_method)        
 64 |         ## compute knn if use embedding
 65 |     else:
 66 |         click.echo("use PCA to compute KNN graph")
 67 |         sc.tl.pca(input_ad, svd_solver='arpack')
 68 |         sc.pp.neighbors(input_ad, n_neighbors=20, n_pcs=20, use_rep='X_pca', key_added=integration_method)
 69 |         embedding_key='X_pca'
 70 |         ## while no embedding, compute PCA and compute knn
 71 | 
 72 |     ## get neighbour graph from unintegrated data
 73 |     ## use PCA to get initial neighbours for diffusion map
 74 |     ## use diffusion map to compute neighbours - denoise
 75 |     if from_h5seurat[integration_method] is True:
 76 |         orig_ad.obs_names = orig_ad.obs_names.str.replace("-", "_")
 77 |         ## seurat does not convert - to _ which does not match integrated data
 78 | 
 79 |     sc.pp.normalize_total(orig_ad, target_sum=1e4)
 80 |     sc.pp.log1p(orig_ad)
 81 |     sc.pp.highly_variable_genes(orig_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
 82 |     sc.pp.scale(orig_ad, max_value=10)
 83 |     sc.tl.pca(orig_ad, svd_solver='arpack')
 84 |     sc.pp.neighbors(orig_ad, n_neighbors=20, n_pcs=40, use_rep='X_pca')
 85 |     sc.tl.diffmap(orig_ad)
 86 |     sc.pp.neighbors(orig_ad, n_neighbors=10, use_rep='X_diffmap')
 87 | 
 88 |     ## set root cell for diffusion map pseudo-time
 89 |     orig_ad.uns['iroot'] = np.flatnonzero(orig_ad.obs[cluster_key]  == root_cell)[0]
 90 |     sc.tl.dpt(orig_ad)
 91 | 
 92 |     ## integrated data
 93 |     ## use integrated embedding to get initial neighbours for diffusion map
 94 |     ## use diffusion map neighbours
 95 |     if integration_method == 'SAMap':
 96 |         click.echo("use SAMap KNN graph for diffmap")
 97 |         sc.tl.diffmap(input_ad, neighbors_key=None)
 98 | 
 99 |     else:
100 |         click.echo("calculate diffmap for " + integration_method)
101 |         sc.tl.diffmap(input_ad, neighbors_key=integration_method)
102 |     ## neighbours with diffmap need not have a key
103 |     sc.pp.neighbors(input_ad, n_neighbors=10, use_rep='X_diffmap')
104 |     input_ad.uns['iroot'] = np.flatnonzero(input_ad.obs[cluster_key]  == root_cell)[0]
105 |     sc.tl.dpt(input_ad)
106 | 
107 |     ## scIB trajectory conservation score
108 |     ## per-species conservation: set batch_key='species'
109 |     score_batch = scib.metrics.trajectory_conservation(adata_pre=orig_ad, adata_post=input_ad, label_key=cluster_key, pseudotime_key='dpt_pseudotime', batch_key=species_key)
110 |     score = scib.metrics.trajectory_conservation(adata_pre=orig_ad, adata_post=input_ad, label_key=cluster_key, pseudotime_key='dpt_pseudotime', batch_key=None)
111 | 
112 |     output=pd.DataFrame()
113 | 
114 |     output.loc['trajectory_conservation_score_batch', 'value'] = score_batch
115 |     output.loc['trajectory_conservation_score_none', 'value'] = score
116 |     output.loc['input_h5ad', 'value'] = input_h5ad
117 |     output.loc['unintegrated_h5ad', 'value'] = unintegrated_h5ad
118 |     output.loc['species_key', 'value'] = species_key
119 |     output.loc['batch_key', 'value'] = batch_key
120 |     output.loc['cluster_key', 'value'] = cluster_key
121 |     output.loc['root_cell', 'value'] = root_cell
122 |     output.loc['integration_method', 'value'] = integration_method
123 |     output.T.to_csv(out_csv)
124 |     input_ad.write(out_h5ad, compression = 'gzip')
125 |     click.echo("finish trajectory conservation metrics")
126 | 
127 | 
128 | if __name__ == '__main__':
129 |     run_scIB_trajectory()
130 | 


--------------------------------------------------------------------------------
/bin/scanorama_integration.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | import click
 7 | import matplotlib.pyplot as plt
 8 | import scanpy as sc
 9 | 
10 | 
11 | @click.command()
12 | @click.argument("input_h5ad", type=click.Path(exists=True))
13 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
14 | @click.argument("out_umap", type=click.Path(exists=False), default=None)
15 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration")
16 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
17 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
18 | 
19 | 
20 | def run_scanorama(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key):
21 |     click.echo('Start scanorama integration')
22 |     sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6))
23 |     adata = sc.read_h5ad(input_h5ad)
24 |     adata.var_names_make_unique()
25 |     sc.pp.normalize_total(adata, target_sum=1e4)
26 |     sc.pp.log1p(adata)
27 |     click.echo("HVG")
28 |     sc.pp.highly_variable_genes(adata, batch_key=batch_key)
29 |     sc.pp.scale(adata, max_value=10)
30 |     sc.tl.pca(adata)
31 |     sc.pp.neighbors(adata, use_rep='X_pca', n_neighbors=15, n_pcs=40)
32 |     sc.tl.umap(adata, min_dist=0.3) ## to match min_dist in seurat
33 |     adata.obsm['X_umapraw'] = adata.obsm['X_umap']
34 |     click.echo("Scanorama")
35 |     sc.external.pp.scanorama_integrate(adata, key=species_key, basis='X_pca', adjusted_basis='X_scanorama')
36 |     sc.pp.neighbors(adata, use_rep='X_scanorama', key_added = 'scanorama', n_neighbors=15, n_pcs=40)
37 |     sc.tl.umap(adata, neighbors_key = 'scanorama', min_dist=0.3) ## to match min_dist in seurat
38 |     sc.pl.umap(adata, color=[batch_key, species_key, cluster_key], ncols=1)
39 |     plt.savefig(out_umap, dpi=300,  bbox_inches='tight')
40 |     adata.obsm['X_umapscanorama'] = adata.obsm['X_umap']
41 |     click.echo("Save output")
42 |     adata.write(out_h5ad)
43 |     click.echo("Done scanorama")
44 | 
45 | 
46 | if __name__ == '__main__':
47 |     run_scanorama()
48 | 
49 | 


--------------------------------------------------------------------------------
/bin/sccaf_assessment_metadata.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | 
 7 | import click
 8 | import matplotlib.pyplot as plt
 9 | from sklearn import metrics
10 | import pandas as pd
11 | import scanpy as sc
12 | from SCCAF import *
13 | 
14 | @click.command()
15 | @click.argument("input_metadata", type=click.Path(exists=True))
16 | @click.argument("out_auc", type=click.Path(exists=False), default=None)
17 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None)
18 | @click.option('--integration_method', type=str, default=None, help="Integration method")
19 | @click.option('--use_embedding', type=bool, default=False, help="Whether use embedding for SCCAF assessment, default False to use count matrix")
20 | @click.option('--embedding_key', type=str, default=None, help="If use embedding, the embedding key in input_h5ad.obsm to calculate SCCAF assessment")
21 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in input_h5ad.obs to use as the label to calculate SCCAF assessment")
22 | 
23 | def run_sccaf_assessment(input_metadata, out_auc, use_embedding, embedding_key, cluster_key, out_acc_csv, integration_method):
24 | 
25 |     def get_cell_type_auc(clf, y_test, y_prob):
26 |         rc_aucs = [] #AUC
27 |         rp_aucs = [] # AUC from recall precision
28 |         fprs = [] #FPR
29 |         tprs = [] #TPR
30 |         prss = [] #Precision
31 |         recs = [] #Recall
32 |         for i, cell_type in enumerate(clf.classes_):
33 |             fpr, tpr, _ = metrics.roc_curve(y_test == cell_type, y_prob[:, i])
34 |             prs, rec, _ = metrics.precision_recall_curve(y_test == cell_type, y_prob[:, i])
35 |             fprs.append(fpr)
36 |             tprs.append(tpr)
37 |             prss.append(prs)
38 |             recs.append(rec)
39 |             rc_aucs.append(metrics.auc(fpr, tpr))
40 |             rp_aucs.append(metrics.auc(rec, prs))
41 |         tbl = pd.DataFrame(data=list(zip(clf.classes_, rp_aucs, rc_aucs)), columns=['cell_type', "ROC_AUC", "PR_AUC"])
42 |         return tbl
43 | 
44 |     meta = pd.read_csv(input_metadata, delimiter='\t', header=None)
45 |     for i in range(0, meta.shape[0]):
46 |         species=meta.loc[i, 0]
47 |         input_h5ad=meta.loc[i, 1]
48 |         print(species)
49 |         input_ad = sc.read_h5ad(input_h5ad)
50 | 
51 |         sc.pp.normalize_total(input_ad, target_sum=1e4)
52 |         sc.pp.log1p(input_ad)
53 |         sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
54 |         input_ad.raw = input_ad
55 |         sc.pp.scale(input_ad, max_value=10)
56 |         sc.tl.pca(input_ad, svd_solver='arpack')
57 |         sc.pp.neighbors(input_ad, n_neighbors=10, n_pcs=40)
58 |         sc.tl.umap(input_ad, min_dist=0.3)
59 |         sc.pl.umap(input_ad, color = [cluster_key])
60 |         sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 5))
61 |         ##input_ad = input_ad[~(input_ad.obs.cell_type=='T_cell'), ]
62 | 
63 |         if use_embedding is True and not embedding_key in input_ad.obsm.keys():
64 |             raise ValueError("`input_ad.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key))
65 |         if use_embedding is True and embedding_key is not None:
66 |             input_matrix = input_ad.obsm[embedding_key]
67 |         else:
68 |             input_matrix = input_ad.X
69 | 
70 |         colors = input_ad.uns[cluster_key+"_colors"]
71 |         y_prob, y_pred, y_test, clf, cvsm, acc = SCCAF_assessment(input_matrix, input_ad.obs[cluster_key], n=200)
72 |         aucs = plot_roc(y_prob, y_test, clf, cvsm=cvsm, acc=acc, colors=colors)
73 | 
74 |         plt.savefig(out_auc+"_"+species+".png", dpi=200, bbox_inches='tight')
75 | 
76 |         tbl1=get_cell_type_auc(clf, y_test, y_prob)
77 | 
78 |         tbl1[['test_acc']] = acc
79 |         tbl1[['CV_acc']] = cvsm
80 |         tbl1[['type_label']] = 'original'
81 |         tbl1['from_species'] = species
82 |         tbl1['to_species'] = species
83 |         tbl1['integration_method'] = integration_method
84 |         tbl1['input_file'] = input_h5ad
85 |         tbl1['key_use'] = cluster_key
86 |         tbl1['adj_rand_score'] = 'NaN'
87 |         tbl1['pct_cell_type_kept'] = 'NaN'
88 | 
89 |         tbl1.to_csv(out_acc_csv+"_"+species+".csv", index=False, header=True)
90 | 
91 | if __name__ == '__main__':
92 |     run_sccaf_assessment()
93 | 


--------------------------------------------------------------------------------
/bin/sccaf_kNN_distance.py:
--------------------------------------------------------------------------------
  1 | #/usr/bin/env python3
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | import pandas as pd
  7 | import numpy as np
  8 | import click
  9 | import scanpy as sc
 10 | from scipy.sparse import issparse
 11 | 
 12 | # for reading/saving clf model
 13 | 
 14 | from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler
 15 | from sklearn.model_selection import cross_val_score, train_test_split
 16 | from sklearn.linear_model import LogisticRegression, SGDClassifier
 17 | from sklearn.naive_bayes import GaussianNB
 18 | from sklearn.gaussian_process import GaussianProcessClassifier
 19 | from sklearn.tree import DecisionTreeClassifier
 20 | from sklearn.ensemble import RandomForestClassifier
 21 | from sklearn.svm import SVC
 22 | 
 23 | from sklearn.neighbors import KNeighborsClassifier
 24 | 
 25 | np.random.seed(seed=123)
 26 | 
 27 | @click.command()
 28 | @click.argument("input_h5ad", type=click.Path(exists=True))
 29 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None)
 30 | @click.option('--integration_method', type=str, default=None, help="Integration method")
 31 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in input_h5ad.obs to use as the label to calculate SCCAF assessment")
 32 | @click.option('--n_neighbors', type=int, default=15, help="Number of neighbours for kNN calculation and assessment")
 33 | 
 34 | 
 35 |     
 36 |     
 37 | def anndata_knn_acc(input_h5ad, cluster_key, out_acc_csv, integration_method, n_neighbors=15):
 38 |     
 39 |     click.echo("Read input h5ad")
 40 |     input_ad = sc.read_h5ad(input_h5ad)
 41 |     
 42 |     # dictionary for method properties
 43 |     embedding_keys = {
 44 |         "harmony": "X_pca_harmony",
 45 |         "scanorama": "X_scanorama",
 46 |         "scVI": "X_scVI",
 47 |         "scANVI": "X_scANVI",
 48 |         "LIGER": "X_iNMF",
 49 |         "rligerUINMF": "X_inmf",
 50 |         "fastMNN": "X_mnn",
 51 |     }
 52 |     use_embeddings = {
 53 |         "harmony": True,
 54 |         "scanorama": True,
 55 |         "scVI": True,
 56 |         "scANVI": True,
 57 |         "LIGER": True,
 58 |         "rligerUINMF": True,
 59 |         "fastMNN": True,
 60 |         "SAMap": False,
 61 |         "seuratCCA": False,
 62 |         "seuratRPCA": False,
 63 |         "unintegrated": False,
 64 |         "per_species": False,
 65 |     }
 66 | 
 67 |     
 68 |     use_embedding = use_embeddings[integration_method]
 69 |     
 70 |     if use_embedding is True:
 71 |         embedding_key = embedding_keys[integration_method]
 72 |         
 73 |     ## prepare the connectivity matrix
 74 |     
 75 |     if integration_method == "SAMap":
 76 |         click.echo("use SAMap KNN graph")
 77 |         # do nothing
 78 |     elif use_embedding is True:
 79 |         click.echo(f"Calculate KNN graph from embedding {embedding_key}")
 80 |         num_pcs = min(input_ad.obsm[embedding_key].shape[1], 40)
 81 |         if num_pcs < 40:
 82 |             click.echo(f"using {str(num_pcs)} PCs")
 83 |         sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=num_pcs, use_rep=embedding_key, key_added=integration_method)
 84 |         
 85 |         # compute knn if use embedding
 86 |     elif integration_method == "per_species":
 87 |         click.echo('calculate NN graph from raw count for per species data')
 88 |         sc.pp.normalize_total(input_ad, target_sum=1e4)
 89 |         sc.pp.log1p(input_ad)
 90 |         sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
 91 |         input_ad.raw = input_ad
 92 |         sc.pp.scale(input_ad, max_value=10)
 93 |         sc.tl.pca(input_ad, svd_solver='arpack')
 94 |         sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca")
 95 |         embedding_key = "X_pca"
 96 |     elif integration_method == 'unintegrated':
 97 |         click.echo("use PCA to compute KNN graph for unintegrated data")
 98 |         
 99 |         sc.pp.normalize_total(input_ad, target_sum=1e4)
100 |         sc.pp.log1p(input_ad)
101 |         sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
102 |         sc.pp.scale(input_ad, max_value=10)
103 |         sc.tl.pca(input_ad, svd_solver="arpack")
104 |         sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca")
105 |         embedding_key = "X_pca"
106 |         
107 |     else:
108 |         click.echo("use PCA to compute KNN graph for corrected counts output")
109 |         sc.pp.scale(input_ad, max_value=10)
110 |         sc.tl.pca(input_ad, svd_solver="arpack")
111 |         sc.pp.neighbors(input_ad, n_neighbors=n_neighbors, n_pcs=40, use_rep="X_pca")
112 |         embedding_key = "X_pca"
113 |     
114 |     # take the distance matrix as input for kNN classifier
115 |     # in theory the "distance" slot is correct, while if we use "connectivities" the output is exactly the same
116 |     # this is because a non-weighted majority vote is performed, essentially, the most frequent label of the neighbours
117 |     # so the weight or distance does not matter numerically. I could also use weight='distance'
118 |     if use_embedding and f"{integration_method}_distances" in input_ad.obsp.keys():
119 |         X_use = input_ad.obsp[f"{integration_method}_distances"]
120 |         distances_use = f"{integration_method}_distances"
121 |         
122 |     elif not use_embedding and "distances" in input_ad.obsp.keys():
123 |         X_use = input_ad.obsp["distances"]
124 |         distances_use = "distances"
125 |     elif integration_method == 'SAMap':
126 |         X_use = input_ad.obsp['connectivities']
127 |         distances_use = "connectivities_SAMap"
128 |         # in SAMap, connectivities is just 1 if neighbor, 0 if not
129 |     else:
130 |         raise ValueError("Error in neighbours calculation")
131 |      
132 |         # minmaxscaler requires dense input
133 |         
134 |     ## use half data for training and half for testing
135 |     
136 |     click.echo(f"testing on k={n_neighbors}")
137 |         
138 |     y_prob, y_pred, y_test, clf, cvsm, acc = self_projection(X_use, input_ad.obs[cluster_key], n=0, fraction=0.5, classifier="KNN", K_knn=n_neighbors)
139 |     
140 |     pd.DataFrame(data={"test_accuracy":acc, "CV_accuracy":cvsm, "integration_method": integration_method, "input_file": input_h5ad, "connectivities": distances_use}, index=[0]).to_csv(out_acc_csv, index=False)
141 |     
142 |     #return X_use, y_prob, y_pred, y_test, clf, cvsm, acc
143 | 
144 | 
145 | def train_test_split_per_type_square(X, y, frac=0.5):
146 |     """
147 |     This function is identical to train_test_split, but can split the data either based on number of cells or by fraction.
148 | 
149 |     Input
150 |     -----
151 |     X: `numpy.array` or sparse matrix
152 |         the feature matrix
153 |     y: `list of string/int`
154 |         the class assignments
155 |     n: `int` optional (default: 100)
156 |         maximum number sampled in each label
157 |     fraction: `float` optional (default: 0.8)
158 |         Fraction of data included in the training set. 0.5 means use half of the data for training,
159 |         if half of the data is fewer than maximum number of cells (n).
160 | 
161 |     return
162 |     -----
163 |     X_train, X_test, y_train, y_test
164 |     """
165 |     df = pd.DataFrame(y)
166 |     df.index = np.arange(len(y))
167 |     df.columns = ['class']
168 |     c_idx = df.groupby('class').apply(lambda x: x.sample(frac=frac)).index.get_level_values(None)
169 |     d_idx = ~np.isin(np.arange(len(y)), c_idx)
170 |     # the test matrix is n_test by n_indexed(trained)
171 |     return X[c_idx, :][:, c_idx], X[d_idx, :][:, c_idx], y[c_idx], y[d_idx]
172 | 
173 | 
174 | def inverse_weight(arr):
175 |     result = arr
176 |     for i in range(len(arr)):
177 |         for j in range(len(arr[i])):
178 |                 result[i][j] = 1 / arr[i][j]
179 |     return result
180 | 
181 | ## make distance huge when not neighbours
182 | def replace_zeros(arr):
183 |     for i in range(len(arr)):
184 |         for j in range(len(arr[i])):
185 |             if arr[i][j] == 0:
186 |                 arr[i][j] = 1e5
187 |     return arr
188 | 
189 | 
190 | def self_projection(X,
191 |                     cell_types,
192 |                     classifier="LR",
193 |                     K_knn=15,
194 |                     metric_knn='precomputed',
195 |                     penalty='l1',
196 |                     sparsity=0.5,
197 |                     fraction=0.5,
198 |                     random_state=1,
199 |                     solver='liblinear',
200 |                     n=0,
201 |                     cv=5,
202 |                     whole=False,
203 |                     n_jobs=None):
204 |     # n = 100 should be good.
205 |     """
206 |     This is the core function for running self-projection.
207 | 
208 |     Input
209 |     -----
210 |     X: `numpy.array` or sparse matrix
211 |         the expression matrix, e.g. ad.raw.X.
212 |     cell_types: `list of String/int`
213 |         the cell clustering assignment
214 |     classifier: `String` optional (defatul: 'LR')
215 |         a machine learning model in "LR" (logistic regression), "KNN" (k-nearest neighbour),\
216 |         "RF" (Random Forest), "GNB"(Gaussion Naive Bayes), "SVM" (Support Vector Machine) and "DT"(Decision Tree).
217 |     K_knn:  `int` optional (default: 15)
218 |         the "k" in a KNN classifier if used
219 |     metric_knn: `String` optional (default: 'precomputed')
220 |         the distance metric for KNN classifier, default is to use a precomputed distance matrix in sklearn.neighbors.KNeighborsClassifier
221 |     penalty: `String` optional (default: 'l2')
222 |         the standardization mode of logistic regression. Use 'l1' or 'l2'.
223 |     sparsity: `fload` optional (default: 0.5)
224 |         The sparsity parameter (C in sklearn.linear_model.LogisticRegression) for the logistic regression model.
225 |     fraction: `float` optional (default: 0.5)
226 |         Fraction of data included in the training set. 0.5 means use half of the data for training,
227 |         if half of the data is fewer than maximum number of cells (n).
228 |     random_state: `int` optional (default: 1)
229 |         random_state parameter for logistic regression.
230 |     n: `int` optional (default: 100)
231 |         Maximum number of cell included in the training set for each cluster of cells.
232 |         only fraction is used to split the dataset if n is 0.
233 |     cv: `int` optional (default: 5)
234 |         fold for cross-validation on the training set.
235 |         0 means no cross-validation.
236 |     whole: `bool` optional (default: False)
237 |         if measure the performance on the whole dataset (include training and test).
238 |     n_jobs: `int` optional, number of threads to use with the different classifiers (default: None - unlimited).
239 | 
240 |     return
241 |     -----
242 |     y_prob, y_pred, y_test, clf
243 |     y_prob: `matrix of float`
244 |         prediction probability
245 |     y_pred: `list of string/int`
246 |         predicted clustering of the test set
247 |     y_test: `list of string/int`
248 |         real clustering of the test set
249 |     clf: the classifier model.
250 |     """
251 |     # split the data into training and testing
252 |     
253 |     if classifier == 'KNN':
254 |         scaler = MaxAbsScaler()
255 |         
256 |         # for kNN classifier, we need to normalize the data to unify the distance measurement between different algorithms
257 |         X = scaler.fit_transform(X)
258 |         assert X.shape[0] == X.shape[1], "Matrix is not square, is a connectivity matrix used as input for kNN classifier?"
259 | 
260 |     if issparse(X):
261 |         X = X.todense()
262 |         click.echo("to dense after maxabs scaling")
263 |         
264 |     X_train, X_test, y_train, y_test = train_test_split_per_type_square(X, cell_types, frac=fraction)
265 |     X_train = replace_zeros(np.array(X_train))
266 |     X_test = replace_zeros(np.array(X_test))
267 | 
268 |         # fraction means test size
269 |     # set the classifier
270 | 
271 |     if classifier == 'LR':
272 |         clf = LogisticRegression(random_state=1, penalty=penalty, C=sparsity, multi_class="ovr", solver=solver)
273 |     elif classifier == 'KNN':
274 |         clf = clf = KNeighborsClassifier(n_neighbors=K_knn, metric=metric_knn, n_jobs=n_jobs, weights=inverse_weight)
275 | 
276 | 
277 |     # mean cross validation score
278 |     cvsm = 0
279 |     if cv > 0:
280 |         cvs = cross_val_score(clf, X_train, np.array(y_train), cv=cv, scoring='accuracy', n_jobs=n_jobs)
281 |         cvsm = cvs.mean()
282 |         print("Mean CV accuracy: %.4f" % cvsm)
283 | 
284 |     # accuracy on cross validation and on test set
285 |     clf.fit(X_train, y_train)
286 |     accuracy = clf.score(X_train, y_train)
287 |     print("Accuracy on the training set: %.4f" % accuracy)
288 |     accuracy_test = clf.score(X_test, y_test)
289 |     print("Accuracy on the hold-out set: %.4f" % accuracy_test)
290 | 
291 |     # accuracy of the whole dataset
292 |     if whole:
293 |         accuracy = clf.score(X, cell_types)
294 |         print("Accuracy on the whole set: %.4f" % accuracy)
295 | 
296 |     # get predicted probability on the test set
297 |     y_prob = None
298 |     #if not classifier in ['SH', 'PCP']:
299 |     y_prob = clf.predict_proba(X_test)
300 |     y_pred = clf.predict(X_test)
301 | 
302 |     return y_prob, y_pred, y_test, clf, cvsm, accuracy_test
303 | 
304 | if __name__ == '__main__':
305 |     anndata_knn_acc()
306 | 


--------------------------------------------------------------------------------
/bin/sccaf_projection_multiple_species.py:
--------------------------------------------------------------------------------
  1 | #/usr/bin/env python3
  2 | 
  3 | # © EMBL-European Bioinformatics Institute, 2023
  4 | # Yuyao Song <ysong@ebi.ac.uk>
  5 | 
  6 | 
  7 | import click
  8 | from typing import List
  9 | 
 10 | import matplotlib.pyplot as plt
 11 | from matplotlib.backends.backend_pdf import PdfPages
 12 | import pandas as pd
 13 | import scanpy as sc
 14 | import anndata as ad
 15 | from anndata import AnnData
 16 | import itertools
 17 | from SCCAF import *
 18 | from sklearn.metrics.cluster import adjusted_rand_score
 19 | from sklearn import metrics
 20 | 
 21 | 
 22 | @click.command()
 23 | 
 24 | @click.argument("input_h5ad", type=click.Path(exists=True))
 25 | @click.argument("out_projection_h5ads", type=click.Path(exists=False), default=None)
 26 | @click.argument("out_figures", type=click.Path(exists=False), default=None)
 27 | @click.argument("out_acc_csv", type=click.Path(exists=False), default=None)
 28 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
 29 | @click.option('--integration_method', type=str, default=None, help="Integration method")
 30 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species to use to assess label preservation")
 31 | @click.option('--projection_key', type=str, default=None, help="Projection key in species one to use as labels to transfer to species two")
 32 | 
 33 | 
 34 | def run_sccaf_projection(input_h5ad, species_key,  cluster_key, projection_key, integration_method, out_projection_h5ads, out_figures, out_acc_csv):
 35 | # dictionary for method properties
 36 |     embedding_keys={"harmony": "X_pca_harmony", "scanorama": "X_scanorama", "scVI": "X_scVI", "LIGER": "X_inmf", "rligerUINMF":"X_inmf", "fastMNN": "X_mnn", "SAMap": "wPCA", "scANVI": "X_scANVI"}
 37 |     use_embeddings={"harmony": True, "scanorama": True, "scVI": True, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": True , "seuratCCA": False, "seuratRPCA": False, "scANVI": True, "unintegrated": False}
 38 |     from_h5seurat={"harmony": False, "scanorama": False, "scVI": False, "LIGER": True, "rligerUINMF":True, "fastMNN": True, "SAMap": False , "seuratCCA": True, "seuratRPCA": True, "scANVI": False, "unintegrated": False}
 39 |     sc.set_figure_params(dpi_save=200, frameon=False, figsize=(11, 6))
 40 |     input_ad = sc.read_h5ad(input_h5ad)
 41 |     species_all=input_ad.obs[species_key].astype("category").cat.categories.values
 42 |     acc_summary=pd.DataFrame()
 43 |      
 44 |     if integration_method == 'unintegrated':
 45 | 
 46 |     ## raw concatinated data needs preprocessing
 47 |         sc.pp.normalize_total(input_ad, target_sum=1e4)
 48 |         sc.pp.log1p(input_ad)
 49 |         sc.pp.highly_variable_genes(input_ad, min_mean=0.0125, max_mean=3, min_disp=0.5)
 50 |         input_ad.raw = input_ad
 51 |         sc.pp.scale(input_ad, max_value=10)
 52 |         sc.tl.pca(input_ad, svd_solver='arpack')
 53 |         sc.pp.neighbors(input_ad, n_neighbors=10, n_pcs=40)
 54 |         sc.tl.umap(input_ad, min_dist=0.3)
 55 | 
 56 |     # register color in .uns
 57 |     sc.pl.umap(input_ad, color = cluster_key)
 58 |     sc.pl.umap(input_ad, color = projection_key)
 59 | 
 60 |     click.echo("Start SCCAF projection workflow with input: "+input_h5ad)
 61 |     use_embedding=use_embeddings[integration_method]
 62 |     if use_embedding is True:
 63 |         embedding_key=embedding_keys[integration_method]
 64 | 
 65 | 
 66 |     # known bug - fix when convert h5Seurat to h5ad the index name error
 67 |     #if from_h5seurat[integration_method] is True:
 68 |     #    input_ad.__dict__['_raw'].__dict__['_var'] = input_ad.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})
 69 |     #    click.echo("From h5seurat")
 70 | 
 71 |     def get_cell_type_auc(clf, y_test, y_prob):
 72 |         rc_aucs = [] #AUC
 73 |         rp_aucs = [] # AUC from recall precision
 74 |         fprs = [] #FPR
 75 |         tprs = [] #TPR
 76 |         prss = [] #Precision
 77 |         recs = [] #Recall
 78 |         for i, cell_type in enumerate(clf.classes_):
 79 |             fpr, tpr, _ = metrics.roc_curve(y_test == cell_type, y_prob[:, i])
 80 |             prs, rec, _ = metrics.precision_recall_curve(y_test == cell_type, y_prob[:, i])
 81 |             fprs.append(fpr)
 82 |             tprs.append(tpr)
 83 |             prss.append(prs)
 84 |             recs.append(rec)
 85 |             rc_aucs.append(metrics.auc(fpr, tpr))
 86 |             rp_aucs.append(metrics.auc(rec, prs))
 87 |         tbl = pd.DataFrame(data=list(zip(clf.classes_, rp_aucs, rc_aucs)), columns=['cell_type', "ROC_AUC", "PR_AUC"])
 88 |         return tbl
 89 | 
 90 |     ##def unique_combinations(elements: list[str]) -> list[tuple[str, str]]:
 91 |     def unique_combinations(elements) -> list:
 92 |     ## Precondition: `elements` does not contain duplicates.
 93 |     ## Postcondition: Returns unique combinations of length 2 from `elements`.
 94 | 
 95 |         return list(itertools.combinations(elements, 2))
 96 | 
 97 |     species_pairs=unique_combinations(species_all)
 98 | 
 99 |     with PdfPages(out_figures) as pdf:
100 |         for species_pair in species_pairs:
101 |             species_one=species_pair[0]
102 |             species_two=species_pair[1]
103 | 
104 |             adx1 = input_ad[input_ad.obs[species_key]==species_one, :]
105 |             adx2 = input_ad[input_ad.obs[species_key]==species_two, :]
106 | 
107 |             if use_embedding is True and not embedding_key in input_ad.obsm.keys():
108 |                 raise ValueError("`adata.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key))
109 |             if use_embedding is True and embedding_key is not None:
110 |                 input_matrix_1 = adx1.obsm[embedding_key]
111 |                 input_matrix_2 = adx2.obsm[embedding_key]
112 |             else:
113 |                 input_matrix_1 = adx1.X
114 |                 input_matrix_2 = adx2.X
115 |             click.echo("running projection between "+species_one+" and "+species_two)
116 | 
117 |             colors1 = adx1.uns[cluster_key+"_colors"]
118 | 
119 |             y_prob1, y_pred1, y_test1, clf1, cvsm1, acc1 = SCCAF_assessment(input_matrix_1, adx1.obs[cluster_key], n=200)
120 |             aucs1 = plot_roc(y_prob1, y_test1, clf1, cvsm=cvsm1, acc=acc1, colors=colors1, title=species_one+"_"+integration_method+"_original_label")
121 |             pdf.attach_note("AUC of original label in "+species_one+" by "+integration_method)
122 |             pdf.savefig(dpi=200, bbox_inches='tight')
123 |             plt.close()
124 | 
125 |             tbl1=get_cell_type_auc(clf1, y_test1, y_prob1)
126 | 
127 |             tbl1[['test_acc']] = acc1
128 |             tbl1[['CV_acc']] = cvsm1
129 |             tbl1[['type_label']] = 'original'
130 |             tbl1['from_species'] = species_one
131 |             tbl1['to_species'] = species_one
132 |             tbl1['integration_method'] = integration_method
133 |             tbl1['input_file'] = input_h5ad
134 |             tbl1['key_use'] = cluster_key
135 |             tbl1['adj_rand_score'] = 'NaN'
136 |             tbl1['pct_cell_type_kept'] = 'NaN'
137 | 
138 | 
139 |             colors2 = adx2.uns[cluster_key+"_colors"]
140 | 
141 |             y_prob2, y_pred2, y_test2, clf2, cvsm2, acc2 = SCCAF_assessment(input_matrix_2, adx2.obs[cluster_key], n=200)
142 |             aucs2 = plot_roc(y_prob2, y_test2, clf2, cvsm=cvsm2, acc=acc2, colors=colors2, title=species_two+"_"+integration_method+"_original_label")
143 |             pdf.attach_note("AUC of original label in "+species_two+" by "+integration_method)
144 |             pdf.savefig(dpi=200, bbox_inches='tight')
145 |             plt.close()
146 | 
147 |             tbl2=get_cell_type_auc(clf2, y_test2, y_prob2)
148 | 
149 |             tbl2['test_acc'] = acc2
150 |             tbl2[['CV_acc']] = cvsm2
151 |             tbl2[['type_label']] = 'original'
152 |             tbl2['from_species'] = species_two
153 |             tbl2['to_species'] = species_two
154 |             tbl2['integration_method'] = integration_method
155 |             tbl2['input_file'] = input_h5ad
156 |             tbl2['key_use'] = cluster_key
157 |             tbl2['adj_rand_score'] = 'NaN'
158 |             tbl2['pct_cell_type_kept'] = 'NaN'
159 | 
160 | 
161 |             # run projection only on shared cell types
162 |             adx1 = input_ad[input_ad.obs[species_key]==species_one, :]
163 |             adx2 = input_ad[input_ad.obs[species_key]==species_two, :]
164 | 
165 |             a = adx1.obs[projection_key].cat.categories.tolist()
166 |             b = adx2.obs[projection_key].cat.categories.tolist()
167 | 
168 |             shared_ct =  list(set(a) & set(b))
169 |             ##shared_ct.remove('T_cell')
170 |             adx1 = adx1[adx1.obs[projection_key].isin(shared_ct), :]
171 |             adx2 = adx2[adx2.obs[projection_key].isin(shared_ct), :]
172 | 
173 |             color = dict(zip(input_ad.obs[projection_key].cat.categories.to_list(), input_ad.uns[projection_key+'_colors']))
174 | 
175 |             color_palette = {key: value for key, value in color.items() if key in shared_ct}
176 | 
177 |             if use_embedding is True and not embedding_key in input_ad.obsm.keys():
178 |                 raise ValueError("`adata.obsm['%s']` doesn't exist. Please assign the embedding key if use embedding."%(embedding_key))
179 |             if use_embedding is True and embedding_key is not None:
180 |                 input_matrix_1 = adx1.obsm[embedding_key]
181 |                 input_matrix_2 = adx2.obsm[embedding_key]
182 |             else:
183 |                 input_matrix_1 = adx1.X
184 |                 input_matrix_2 = adx2.X
185 | 
186 | 
187 |             ## species one label inferred from species two
188 |             y_prob2, y_pred2, y_test2, clf2, cvsm2, acc2 = SCCAF_assessment(input_matrix_2, adx2.obs[projection_key], n=200)
189 |             adx1.obs['logit_inferred'] = clf2.predict(input_matrix_1)
190 |             pct_1 = len(set(adx1.obs['logit_inferred'].astype("category").cat.categories.tolist()) & set(adx1.obs[projection_key].astype("category").cat.categories.tolist())) / len(adx1.obs[projection_key].astype("category").cat.categories.tolist())
191 | 
192 |             umap1 = sc.pl.umap(adx1, color=[projection_key,"logit_inferred"], palette=color_palette, title = [species_one+' original label', species_one+' inferred label by '+species_two], ncols=2, frameon=False, show=False)
193 |             pdf.attach_note("UMAP of species " + species_one + " original and transferred label")
194 |             pdf.savefig(dpi=200, bbox_inches='tight')
195 |             plt.close()
196 | 
197 |             ars_1  = adjusted_rand_score(adx1.obs['logit_inferred'], adx1.obs[projection_key])
198 | 
199 |             # sometimes the projection doesn't work at all and there will be only one cell type left after the projection
200 |             if len(set(adx1.obs['logit_inferred'].astype("category").cat.categories.tolist())) <= 2:
201 |                 click.echo("projection doesn't work, return NA")
202 |                 tbl3=pd.DataFrame(data=list(zip(['logit_not_working'], ["NaN"], ["NaN"])), columns=['cell_type', "ROC_AUC", "PR_AUC"])
203 | 
204 |                 tbl3['test_acc'] = 'NaN'
205 |                 tbl3[['CV_acc']] = 'NaN'
206 |                 tbl3[['type_label']] = 'logit_inferred'
207 |                 tbl3['from_species'] = species_two
208 |                 tbl3['to_species'] = species_one
209 |                 tbl3['integration_method'] = integration_method
210 |                 tbl3['input_file'] = input_h5ad
211 |                 tbl3['key_use'] = projection_key
212 |                 tbl3['adj_rand_score'] = ars_1
213 |                 tbl3['pct_cell_type_kept'] = pct_1
214 | 
215 |             else:
216 |                 click.echo("projection worked")
217 |                 colors3 = adx1.uns['logit_inferred_colors']
218 |                 y_prob3, y_pred3, y_test3, clf3, cvsm3, acc3 = SCCAF_assessment(input_matrix_1, adx1.obs['logit_inferred'],n=200)
219 |                 aucs3 = plot_roc(y_prob3, y_test3, clf3, cvsm=cvsm3, acc=acc3, colors=colors3)
220 |                 pdf.attach_note("AUC of logit inferred label in "+species_one)
221 |                 pdf.savefig(dpi=200, bbox_inches='tight')
222 |                 plt.close()
223 | 
224 |                 tbl3=get_cell_type_auc(clf3, y_test3, y_prob3)
225 | 
226 |                 tbl3['test_acc'] = acc3
227 |                 tbl3[['CV_acc']] = cvsm3
228 |                 tbl3[['type_label']] = 'logit_inferred'
229 |                 tbl3['from_species'] = species_two
230 |                 tbl3['to_species'] = species_one
231 |                 tbl3['integration_method'] = integration_method
232 |                 tbl3['input_file'] = input_h5ad
233 |                 tbl3['key_use'] = projection_key
234 |                 tbl3['adj_rand_score'] = ars_1
235 |                 tbl3['pct_cell_type_kept'] = pct_1
236 | 
237 | 
238 | 
239 |             y_prob1, y_pred1, y_test1, clf1, cvsm1, acc1 = SCCAF_assessment(input_matrix_1, adx1.obs[projection_key], n=200)
240 | 
241 |             adx2.obs['logit_inferred'] = clf1.predict(input_matrix_2)
242 |             pct_2 = len(set(adx2.obs['logit_inferred'].astype("category").cat.categories.tolist()) & set(adx2.obs[projection_key].astype("category").cat.categories.tolist())) / len(adx2.obs[projection_key].astype("category").cat.categories.tolist())
243 |             umap2 = sc.pl.umap(adx2, color=[projection_key,"logit_inferred"], palette=color_palette, title = [species_two+' original label', species_two+' inferred label by '+species_one], ncols=2, frameon=False, show=False)
244 |             pdf.attach_note("UMAP of species " + species_two + " original and transferred label")
245 |             pdf.savefig(dpi=200, bbox_inches='tight')
246 |             plt.close()
247 | 
248 |             ars_2  = adjusted_rand_score(adx2.obs['logit_inferred'], adx2.obs[projection_key])
249 | 
250 |             if len(set(adx2.obs['logit_inferred'].astype("category").cat.categories.tolist())) <= 2:
251 |                 click.echo("projection doesn't work, return NA")
252 |                 tbl4=pd.DataFrame(data=list(zip(['logit_not_working'], ["NaN"], ["NaN"])), columns=['cell_type', "ROC_AUC", "PR_AUC"])
253 | 
254 |                 tbl4['test_acc'] = 'NaN'
255 |                 tbl4[['CV_acc']] = 'NaN'
256 |                 tbl4[['type_label']] = 'logit_inferred'
257 |                 tbl4['from_species'] = species_one
258 |                 tbl4['to_species'] = species_two
259 |                 tbl4['integration_method'] = integration_method
260 |                 tbl4['input_file'] = input_h5ad
261 |                 tbl4['key_use'] = projection_key
262 |                 tbl4['adj_rand_score'] = ars_2
263 |                 tbl4['pct_cell_type_kept'] = pct_2
264 | 
265 |             else:
266 |                 click.echo("projection worked")
267 |                 colors4 = adx2.uns['logit_inferred_colors']
268 |                 y_prob4, y_pred4, y_test4, clf4, cvsm4, acc4 = SCCAF_assessment(input_matrix_2, adx2.obs['logit_inferred'],n=200)
269 |                 aucs4 = plot_roc(y_prob4, y_test4, clf4, cvsm=cvsm4, acc=acc4, colors=colors4)
270 |                 pdf.attach_note("AUC of logit inferred label in " + species_two)
271 |                 pdf.savefig(dpi=200, bbox_inches='tight')
272 |                 plt.close()
273 | 
274 |                 tbl4=get_cell_type_auc(clf4, y_test4, y_prob4)
275 | 
276 |                 tbl4['test_acc'] = acc4
277 |                 tbl4[['CV_acc']] = cvsm4
278 |                 tbl4[['type_label']] = 'logit_inferred'
279 |                 tbl4['from_species'] = species_one
280 |                 tbl4['to_species'] = species_two
281 |                 tbl4['integration_method'] = integration_method
282 |                 tbl4['input_file'] = input_h5ad
283 |                 tbl4['key_use'] = projection_key
284 |                 tbl4['adj_rand_score'] = ars_2
285 |                 tbl4['pct_cell_type_kept'] = pct_2
286 | 
287 |             acc_summary = pd.concat([acc_summary, tbl1, tbl2, tbl3, tbl4], axis=0, ignore_index=True)
288 |             adata_out = ad.concat([adx1, adx2], join='inner', merge='same', label='sccaf_projection', keys=[species_one+"_from_"+species_one+"_"+species_two+"_"+integration_method, species_two+"_from_"+species_one+"_"+species_two+"_"+integration_method])
289 |                 
290 |             adata_out.write(out_projection_h5ads+"_"+species_one+"_"+species_two+".h5ad", compression = 'gzip')
291 |             click.echo("finish projection between "+species_one+" and "+species_two)
292 | 
293 |     click.echo("write acc summary for all")
294 |     acc_summary.to_csv(out_acc_csv, index=False)
295 | 
296 | if __name__ == '__main__':
297 |     run_sccaf_projection()
298 | 
299 | 
300 | 


--------------------------------------------------------------------------------
/bin/scvi_integration.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | import click
 7 | import matplotlib.pyplot as plt
 8 | import scanpy as sc
 9 | import scvi
10 | 
11 | 
12 | 
13 | @click.command()
14 | @click.argument("input_h5ad", type=click.Path(exists=True))
15 | @click.argument("out_h5ad", type=click.Path(exists=False), default=None)
16 | @click.argument("out_umap", type=click.Path(exists=False), default=None)
17 | @click.option('--batch_key', type=str, default=None, help="Batch key in identifying HVG and harmony integration")
18 | @click.option('--species_key', type=str, default=None, help="Species key to distinguish species")
19 | @click.option('--cluster_key', type=str, default=None, help="Cluster key in species one to use as labels to transfer to species two")
20 | 
21 | 
22 | def run_scVI(input_h5ad, out_h5ad, out_umap, batch_key, species_key, cluster_key):
23 |     click.echo('Start scVI integration - use cpu mode')
24 |     sc.set_figure_params(dpi_save=300, frameon=False, figsize=(10, 6))
25 |     adata = sc.read_h5ad(input_h5ad)
26 |     adata.var_names_make_unique()
27 |     sc.pp.highly_variable_genes(
28 |         adata,
29 |         flavor="seurat_v3",
30 |         n_top_genes=2000,
31 |         ##layer="counts",
32 |         batch_key=batch_key,
33 |         subset=True
34 |     )
35 | 
36 |     adata.layers["counts"] = adata.X.copy()
37 |     sc.pp.normalize_total(adata, target_sum=1e4)
38 |     sc.pp.log1p(adata)
39 |     adata.raw = adata
40 |     #sc.pp.scale(adata, max_value=10)
41 |     #sc.tl.pca(adata)
42 |     #sc.pp.neighbors(adata)
43 |     #sc.tl.umap(adata)
44 |     #adata.obsm['X_umapraw'] = adata.obsm['X_umap']
45 | 
46 |     click.echo("setup scVI model")
47 |     scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key=batch_key)
48 |     vae = scvi.model.SCVI(adata, n_layers=2, n_latent=40, gene_likelihood="nb")
49 |     vae.train()
50 |     adata.obsm["X_scVI"] = vae.get_latent_representation()
51 |     sc.pp.neighbors(adata, use_rep="X_scVI", key_added='scVI', n_neighbors=15, n_pcs=40)
52 |     sc.tl.umap(adata, neighbors_key='scVI', min_dist=0.3) ## to match min_dist in seurat
53 |     sc.pl.umap(adata, neighbors_key='scVI', color=[batch_key, species_key, cluster_key], ncols=1)
54 |     plt.savefig(out_umap, dpi=300,  bbox_inches='tight')
55 |     adata.obsm['X_umapscVI'] = adata.obsm['X_umap']
56 |     click.echo("Save output")
57 |     adata.write(out_h5ad)
58 |     click.echo("Done scVI")
59 | 
60 | 
61 | if __name__ == '__main__':
62 |     run_scVI()
63 | 
64 | 


--------------------------------------------------------------------------------
/bin/seurat_CCA_integration.R:
--------------------------------------------------------------------------------
 1 | # /usr/bin/env R
 2 | 
 3 | library(Seurat)
 4 | library(optparse)
 5 | 
 6 | 
 7 | option_list <- list(
 8 |   make_option(c("-i", "--input_rds"),
 9 |     type = "character", default = NULL,
10 |     help = "Path to input preprocessed rds file"
11 |   ),
12 |   make_option(c("-o", "--out_rds"),
13 |     type = "character", default = NULL,
14 |     help = "Output Seurat CCA integrated rds file"
15 |   ),
16 |   make_option(c("-p", "--out_UMAP"),
17 |     type = "character", default = NULL,
18 |     help = "Output UMAP after Seurat CCA integration"
19 |   ),
20 |   make_option(c("-b", "--batch_key"),
21 |     type = "character", default = NULL,
22 |     help = "Batch key identifier to integrate"
23 |   ),
24 |   make_option(c("-s", "--species_key"),
25 |     type = "character", default = NULL,
26 |     help = "Species key identifier"
27 |   ),
28 |   make_option(c("-c", "--cluster_key"),
29 |     type = "character", default = NULL,
30 |     help = "Cluster key for UMAP plotting"
31 |   )
32 | )
33 | 
34 | # parse input
35 | opt <- parse_args(OptionParser(option_list = option_list))
36 | 
37 | input_rds <- opt$input_rds
38 | out_rds <- opt$out_rds
39 | out_UMAP <- opt$out_UMAP
40 | batch_key <- opt$batch_key
41 | species_key <- opt$species_key
42 | cluster_key <- opt$cluster_key
43 | 
44 | ## create Seurat object via rds
45 | 
46 | # Convert(input_h5ad, dest = "rds", overwrite = TRUE)
47 | # input_rds <- gsub("h5ad", "rds", input_h5ad)
48 | input <- readRDS(input_rds)
49 | 
50 | object.list <- SplitObject(input, split.by = batch_key)
51 | 
52 | # normalize and identify variable features for each dataset independently
53 | object.list <- lapply(X = object.list, FUN = function(x) {
54 |   x <- NormalizeData(x)
55 |   x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
56 | })
57 | 
58 | features <- SelectIntegrationFeatures(object.list = object.list)
59 | 
60 | input.anchors <- FindIntegrationAnchors(object.list = object.list, anchor.features = features)
61 | 
62 | input.combined <- IntegrateData(anchorset = input.anchors)
63 | 
64 | DefaultAssay(input.combined) <- "integrated"
65 | 
66 | input.combined <- ScaleData(input.combined, verbose = FALSE)
67 | input.combined <- RunPCA(input.combined, npcs = 50, verbose = FALSE)
68 | input.combined <- RunUMAP(input.combined, reduction = "pca", dims = 1:30, n_neighbors = 15L,  min_dist = 0.3)
69 | input.combined <- FindNeighbors(input.combined, reduction = "pca", dims = 1:30)
70 | input.combined <- FindClusters(input.combined, resolution = 0.4)
71 | 
72 | saveRDS(input.combined,
73 |   file = out_rds
74 | )
75 | 
76 | pdf(out_UMAP, height = 6, width = 10)
77 | DimPlot(input.combined, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE)
78 | DimPlot(input.combined, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE)
79 | DimPlot(input.combined, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE)
80 | 
81 | dev.off()
82 | 


--------------------------------------------------------------------------------
/bin/seurat_RPCA_integration.R:
--------------------------------------------------------------------------------
 1 | # /usr/bin/env R
 2 | 
 3 | library(Seurat)
 4 | library(optparse)
 5 | 
 6 | 
 7 | option_list <- list(
 8 |   make_option(c("-i", "--input_rds"),
 9 |     type = "character", default = NULL,
10 |     help = "Path to input preprocessed rds file"
11 |   ),
12 |   make_option(c("-o", "--out_rds"),
13 |     type = "character", default = NULL,
14 |     help = "Output Seurat RPCA integrated rds file"
15 |   ),
16 |   make_option(c("-p", "--out_UMAP"),
17 |     type = "character", default = NULL,
18 |     help = "Output UMAP after Seurat RPCA integration"
19 |   ),
20 |   make_option(c("-b", "--batch_key"),
21 |     type = "character", default = NULL,
22 |     help = "Batch key identifier to integrate"
23 |   ),
24 |   make_option(c("-s", "--species_key"),
25 |     type = "character", default = NULL,
26 |     help = "Species key identifier"
27 |   ),
28 |   make_option(c("-c", "--cluster_key"),
29 |     type = "character", default = NULL,
30 |     help = "Cluster key for UMAP plotting"
31 |   )
32 | )
33 | 
34 | # parse input
35 | opt <- parse_args(OptionParser(option_list = option_list))
36 | 
37 | input_rds <- opt$input_rds
38 | out_rds <- opt$out_rds
39 | out_UMAP <- opt$out_UMAP
40 | batch_key <- opt$batch_key
41 | species_key <- opt$species_key
42 | cluster_key <- opt$cluster_key
43 | 
44 | input <- readRDS(input_rds)
45 | 
46 | object.list <- SplitObject(input, split.by = batch_key)
47 | 
48 | # normalize and identify variable features for each dataset independently
49 | object.list <- lapply(X = object.list, FUN = function(x) {
50 |   x <- NormalizeData(x)
51 |   x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
52 | })
53 | 
54 | features <- SelectIntegrationFeatures(object.list = object.list)
55 | 
56 | # rpca workflow
57 | object.list <- lapply(X = object.list, FUN = function(x) {
58 |   x <- ScaleData(x, features = features, verbose = FALSE)
59 |   x <- RunPCA(x, features = features, verbose = FALSE)
60 | })
61 | 
62 | anchors <- FindIntegrationAnchors(object.list = object.list, reduction = "rpca", dims = 1:50)
63 | input.combined <- IntegrateData(anchors = anchors, dims = 1:50)
64 | 
65 | DefaultAssay(input.combined) <- "integrated"
66 | 
67 | 
68 | input.combined <- ScaleData(input.combined, verbose = FALSE)
69 | input.combined <- RunPCA(input.combined, npcs = 50, verbose = FALSE)
70 | input.combined <- RunUMAP(input.combined, reduction = "pca", dims = 1:30, n_neighbors = 15L,  min_dist = 0.3)
71 | input.combined <- FindNeighbors(input.combined, reduction = "pca", dims = 1:30)
72 | input.combined <- FindClusters(input.combined, resolution = 0.4)
73 | 
74 | saveRDS(input.combined,
75 |   file = out_rds
76 | )
77 | 
78 | pdf(out_UMAP, height = 6, width = 10)
79 | DimPlot(input.combined, reduction = "umap", group.by = species_key, shuffle = TRUE, label = TRUE)
80 | DimPlot(input.combined, reduction = "umap", group.by = batch_key, shuffle = TRUE, label = TRUE)
81 | DimPlot(input.combined, reduction = "umap", group.by = cluster_key, shuffle = TRUE, label = TRUE)
82 | 
83 | dev.off()
84 | 


--------------------------------------------------------------------------------
/bin/validate_input.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python3
 2 | 
 3 | # © EMBL-European Bioinformatics Institute, 2023
 4 | # Yuyao Song <ysong@ebi.ac.uk>
 5 | 
 6 | # ysong update pydantic to v2 for containerization Oct 2023
 7 | 
 8 | import click
 9 | from pydantic import StringConstraints, Field, ConfigDict, BaseModel, ValidationError
10 | import pandas as pd
11 | import numpy as np
12 | import scanpy as sc
13 | from typing import List
14 | from typing_extensions import Annotated
15 | 
16 | @click.command()
17 | @click.argument("input_metadata", type=click.Path(exists=True))
18 | @click.option('--batch_key', type=str, default=None, help="Column in adata.obs for batch label")
19 | @click.option('--species_key', type=str, default=None, help="Column in adata.obs for species label")
20 | @click.option('--cluster_key', type=str, default=None, help="Column in adata.obs for cell type label")
21 | 
22 | def validate_adata_input(input_metadata, batch_key, cluster_key, species_key):
23 | 
24 |     ## define classes to validate
25 |     meta = pd.read_csv(input_metadata, sep = '\t', header=None)
26 | 
27 |     class adata_obs_for_csi(BaseModel):
28 |         species_key: pd.Series
29 |         cluster_key: pd.Series
30 |         batch_key: pd.Series
31 |         model_config = ConfigDict(arbitrary_types_allowed=True)
32 | 
33 |     class adata_X_for_csi(BaseModel):
34 |         X: np.ndarray
35 |         model_config = ConfigDict(arbitrary_types_allowed=True)
36 | 
37 |     class adata_var_for_csi(BaseModel):
38 |         mean_counts: pd.Series
39 |         var_names: List[Annotated[str, Field(pattern="^ENS[A-Z]{3}[GP][0-9]{11}$|^ENSG[0-9]{11}(\.[0-9]+_[A-Z]+_[A-Z]+)?$")]] # require ensembl gene id as var_names 
40 |         model_config = ConfigDict(arbitrary_types_allowed=True)
41 | 
42 |     ## validate
43 | 
44 |     for i in range(0, meta.shape[0]):
45 |         print('validating input for species '+ meta.iloc[i, 0])
46 |         print('input anndata path is '+ meta.iloc[i, 1])
47 | 
48 |         species_now = meta.iloc[i, 0]
49 | 
50 |         print('read in adata')
51 |         ad_now = sc.read_h5ad(meta.iloc[i, 1])
52 | 
53 |         ## validate required fields in adata.obs
54 |         for key in {species_key, batch_key, cluster_key}:
55 |             try:
56 |                 ad_now.obs[key]
57 |             except KeyError as ke:
58 |                 print(str(ke) + ' does not exist in ' + species_now + ' adata.obs')
59 |                 raise
60 |             else:
61 |                 try:
62 |                     adata_obs_for_csi(species_key = ad_now.obs[species_key], batch_key = ad_now.obs[batch_key], cluster_key = ad_now.obs[cluster_key] )
63 |                     print('obs test pass for ' + species_now)
64 |                 except ValidationError as ve:
65 |                     print(ve.json())
66 |                     raise
67 |         ## validate adata.X is a dense matrix in np.array and all positive values (i.e. non-scaled data)
68 |         try:
69 |             adata_X_for_csi(X = ad_now.X)
70 |             print('count matrix test pass for ' + species_now)
71 |         except ValidationError as e:
72 |             print(e.json())
73 |         else:
74 |             if not all(ad_now.X.flatten() >= 0):
75 |                 raise ValueError('values in adata.X are not all positive, please make sure raw count matrix is in adata.X')
76 | 
77 |         ## validate required fields in adata.var, and adata.var_names are ensembl gene ids
78 |         try:
79 |             ad_now.var['mean_counts']
80 |         except KeyError as ke:
81 |             print(str(ke) + ' does not exist in ' + species_now + ' adata.var')
82 |             raise
83 |         else:
84 |             try:
85 |                 adata_var_for_csi(mean_counts = ad_now.var['mean_counts'], var_names = ad_now.var_names.values.tolist())
86 |                 print('var test pass for ' + species_now)
87 |             except ValidationError as e:
88 |                 print(e.json())
89 |                 raise
90 | 
91 |     print('input validation complete')
92 | 
93 | if __name__ == '__main__':
94 |     validate_adata_input()
95 | 


--------------------------------------------------------------------------------
/concat_by_homology_multiple_species.nf:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env nextflow
  2 | nextflow.enable.dsl=2
  3 | 
  4 | // © EMBL-European Bioinformatics Institute, 2023
  5 | // Updated upon containerization of BENGAL envs
  6 | // Yuyao Song <ysong@ebi.ac.uk>
  7 | // Oct 2023
  8 | 
  9 | // data requirements: batch_key, species_key, sample_key, mapped in config file, present in adata.obs
 10 | // data requirements: mean_count in adata.var from scanpy QC
 11 | // raw h5ad file naming: <species>
 12 | // adata.var_names are ensembl gene ids
 13 | 
 14 | log.info """
 15 |          ===========================================================
 16 | 
 17 | 
 18 |          Cross-species integration and assessment - nextflow pipeline
 19 |          Use singularity containers for cluster execution
 20 |          - check inout format
 21 |          - concatenate input anndata
 22 |          Author: ysong@ebi.ac.uk
 23 |          Initial date: Mar 2022
 24 |          Latest date: Oct 2023
 25 | 
 26 |          ===========================================================
 27 | 
 28 |          """
 29 |          .stripIndent()
 30 | 
 31 | 
 32 | 
 33 | process validate_adata_input {
 34 | 
 35 |     // labels for cluster options and containers
 36 |     // no need to change here, adjust as per dataset in config file
 37 | 
 38 |     label 'validate'
 39 |     label 'regular_resource'
 40 | 
 41 |     input:
 42 |     tuple val(basename), path(metadata)
 43 | 
 44 |     script:
 45 |     """
 46 |     python ${projectDir}/bin/validate_input.py ${metadata} --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key}
 47 | 
 48 |     """
 49 | 
 50 | }
 51 | 
 52 | 
 53 | process concat_by_homology {
 54 | 
 55 |     label 'concat'
 56 |     label 'regular_resource'
 57 | 
 58 |     publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy'
 59 | 
 60 |     input:
 61 | 
 62 |     tuple val(basename), path(metadata)
 63 | 
 64 |     output:
 65 | 
 66 |     path "*.h5ad", emit: concat_h5ad
 67 |     path "${basename}_homology_tbl.csv"
 68 | 
 69 | 
 70 |     script:
 71 |     """
 72 |     Rscript ${projectDir}/bin/concat_by_homology_multiple_species_by_gene_id.R --metadata ${metadata} \
 73 |     --one2one_h5ad ${basename}_one2one_only.h5ad --many_higher_expr_h5ad ${basename}_many_higher_expr.h5ad \
 74 |     --many_higher_homology_conf_h5ad ${basename}_many_higher_homology_conf.h5ad --homology_tbl ${basename}_homology_tbl.csv
 75 |     """
 76 | }
 77 | 
 78 | process concat_by_homology_rliger_uinmf {
 79 | 
 80 |     label 'concat'
 81 |     label 'regular_resource'
 82 | 
 83 |     publishDir "${params.results}/results/rligerUINMF/h5ad_homology_concat", mode: 'copy'
 84 | 
 85 |     input:
 86 | 
 87 |     tuple val(basename), path(metadata)
 88 | 
 89 |     output:
 90 |     path "homology_tbl.csv"
 91 |     path "${basename}_liger.tsv", emit: concat_h5ad_rliger_paths
 92 |     path "*.h5ad", emit: concat_h5ad_rliger
 93 | 
 94 |     script:
 95 |     """
 96 |     mkdir -p ${params.results}/results/rligerUINMF/h5ad_homology_concat && \
 97 |     Rscript ${projectDir}/bin/concat_by_homology_rligerUINMF_multiple_species.R --metadata ${metadata} \
 98 |     --out_dir .  --homology_tbl homology_tbl.csv \
 99 |     --metadata_output ${basename}_liger.tsv
100 | 
101 |     """
102 | }
103 | 
104 | process convert_format_h5ad {
105 | 
106 |     label 'convert'
107 |     label 'regular_resource'
108 | 
109 |     publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy'
110 | 
111 |     input: 
112 |     tuple val(basename), path(h5ad_concat)
113 | 
114 |     output:
115 |     path "*.rds"
116 | 
117 |     script:
118 |     """
119 |     Rscript ${projectDir}/bin/convert_format.R \
120 |     -i ${h5ad_concat} -o ${basename}.rds -t anndata_to_seurat \
121 |     --conda_path ${params.sceasy_conda}
122 | 
123 |     """
124 | 
125 | 
126 | }
127 | 
128 | process convert_format_rliger_uinmf {
129 | 
130 |     label 'convert'
131 |     label 'regular_resource'
132 | 
133 |     publishDir "${params.results}/results/rligerUINMF/h5ad_homology_concat", mode: 'copy'
134 | 
135 |     input: 
136 |     tuple val(basename), path(h5ad_species)
137 | 
138 |     output:
139 |     path "*.rds"
140 | 
141 |     script:
142 |     """
143 |     Rscript ${projectDir}/bin/convert_format.R \
144 |     -i ${h5ad_species} -o ${basename}.rds -t anndata_to_seurat \
145 |     --conda_path ${params.sceasy_conda}
146 | 
147 |     """
148 | 
149 | 
150 | }
151 | 
152 | 
153 | workflow {
154 | 
155 |     metadata_ch = Channel.fromPath(params.input_metadata)
156 |                                          .map { file -> tuple(file.baseName, file) }
157 |     //metadata_ch.view()
158 |     validate_adata_input(metadata_ch)
159 |     concat_by_homology(metadata_ch)
160 |     concat_by_homology_rliger_uinmf(metadata_ch)
161 | 
162 |     concat_h5ad_ch = concat_by_homology.out.concat_h5ad.flatten().filter( ~/.*h5ad$/ ).map { file -> tuple(file.baseName, file) }
163 |     convert_format_h5ad(concat_h5ad_ch)
164 |     concat_rliger_ch = Channel.fromPath("${params.results}/results/rligerUINMF/h5ad_homology_concat/*.h5ad").map { file -> tuple(file.baseName, file) }.view()
165 |     convert_format_rliger_uinmf(concat_rliger_ch)
166 | 
167 | }
168 | 


--------------------------------------------------------------------------------
/config/example.config:
--------------------------------------------------------------------------------
  1 | // © EMBL-European Bioinformatics Institute, 2023
  2 | // Update example nextflow config file for the BENGAL pipeline
  3 | // Yuyao Song <ysong@ebi.ac.uk>
  4 | // Oct 2023
  5 | 
  6 | // sections marked with CHANGE_PER_RUN should be adjusted per execution
  7 | // sections marked with CHANGE_PER_SETUP should be adjusted for each new setup on the cluster, etc.
  8 | 
  9 | // CHANGE_PER_RUN: directories for project root, work and results 
 10 | 
 11 | projectDir='/some/path/NEXTFLOW/BENGAL' // containing this repo pulled
 12 | workDir='/some/path/work'
 13 | params.results='/some/path/results'
 14 | 
 15 | // CHANGE_PER_RUN: params specific to dataset, please change according to your task
 16 | // input variables
 17 | 
 18 | // if each data is batchy, set batch_key to actual batch
 19 | // if each data does not appear bachy and cell type between batches are balanced, set batch_key to species_key
 20 | 
 21 | // which column stores batch to integrate
 22 | params.batch_key='species'
 23 | 
 24 | // which column stores species names
 25 | params.species_key='species'
 26 | 
 27 | // which column specifies cell types matched between species
 28 | params.cluster_key='cell_ontology_mapped'
 29 | 
 30 | // which column stores cell types to perform SCCAF assessment, usually same with cluster_key
 31 | params.cluster_key_sccaf='cell_ontology_mapped'
 32 | 
 33 | // which column stores cell types to perform SCCAF annotation transfer, usually same with cluster_key
 34 | params.projection_key_sccaf='cell_ontology_mapped'
 35 | 
 36 | // input metadata file that maps species names to the respective raw count .h5ad files
 37 | params.input_metadata='/some/path/data/heart_hs_mf_metadata_nf.tsv'
 38 | 
 39 | // set task name to metadata file basename
 40 | params.task_name='heart_hs_mf_metadata_nf'
 41 | 
 42 | // CHANGE_PER_SETUP: the sceasy, scvi and scib conda env that is prepared
 43 | params.sceasy_conda='/some/conda/path/anaconda3/envs/sceasy'
 44 | params.scvi_conda='/some/conda/path/anaconda3/envs/scvi-tools'
 45 | params.scib_conda='/some/conda/path/anaconda3/envs/scib'
 46 | 
 47 | 
 48 | // Only if running trajectory conservation score
 49 | //params.root_cell = 'Blastula'
 50 | 
 51 | // CHANGE_PER_SETUP: cluster execution
 52 | 
 53 | process.executor = 'lsf'
 54 | executor {
 55 |     name = 'lsf'
 56 |     queueSize = 2000
 57 | }
 58 | 
 59 | // enable use of conda envs
 60 | conda.enabled = true
 61 | 
 62 | // CHANGE_PER_SETUP: singularity container paths and cluster resource options
 63 | 
 64 | singularity {
 65 |     enabled = true
 66 | }
 67 | 
 68 | process {
 69 |     withLabel: validate {
 70 |         container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_py.sif"
 71 |     }
 72 |     withLabel: concat {
 73 |         container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_concat.sif"
 74 |     }
 75 |     withLabel: convert {
 76 |         conda = "${params.sceasy_conda}"
 77 |     }
 78 |     withLabel: scanpy_based {
 79 |         container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_py.sif"
 80 |     }
 81 |     withLabel: 'seurat_based|R_based' {
 82 |         container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_seurat.sif"
 83 |     }
 84 |     withLabel: scvi_based {
 85 |         conda = "${params.scvi_conda}"
 86 |     }
 87 |     withLabel: scIB_based {
 88 |         conda = "${params.scib_conda}"
 89 |     }
 90 |     withLabel: sccaf {
 91 |         container = "/some/path/Singularity/CONTAINERS/CROSS_SPECIES/bengal_sccaf.sif"
 92 |     }
 93 |     withLabel: regular_resource {
 94 |         cpus = 4
 95 |         queue = 'research'
 96 |         memory = '50GB'
 97 |         cache = 'lenient'
 98 |     }
 99 |     withLabel: regular_intg_resource {
100 |         cpus = 12
101 |         queue = 'research'
102 |         memory = '200GB'
103 |         cache = 'lenient'
104 |     }
105 |     withLabel: bigmem_intg_resource {
106 |         cpus = 12
107 |         queue = 'bigmem'
108 |         memory = '400GB'
109 |         cache = 'lenient'
110 |     }
111 |    withLabel: GPU_intg_resource {
112 |         queue = 'gpu'
113 |         clusterOptions = ' -gpu "num=2:j_exclusive=no" -P gpu -n 4 '
114 |         memory = '50GB'
115 |         cache = 'lenient'
116 |     }
117 | 
118 | }
119 | 
120 | 
121 | // params for pipeline, no need to change from here onwards
122 | 
123 | params.liger_metadata="${params.results}/results/rligerUINMF/h5ad_homology_concat/${params.task_name}_liger.tsv"
124 | params.homology_concat_h5ad="${params.results}/results/h5ad_homology_concat/*.h5ad"
125 | params.homology_concat_rds="${params.results}/results/h5ad_homology_concat/*.rds"
126 | params.integrated_h5ad="${params.results}/results/*/cross_species/integrated_h5ad/*.h5ad"
127 | params.integrated_rds="${params.results}/results/*/cross_species/integrated_h5ad/*.rds"
128 | 
129 | // nextflow trace settings, get run stats by adding -with-trace upon execution
130 | 
131 | trace {
132 |     enabled = true
133 |     file = "${params.results}/nf_trace/${params.task_name}_trace.txt"
134 |     fields = 'task_id,hash,native_id,name,status,exit,submit,duration,realtime,%cpu,peak_rss,peak_vmem,rchar,wchar'
135 | }
136 | 
137 | 


--------------------------------------------------------------------------------
/containers/README.md:
--------------------------------------------------------------------------------
 1 | ## Containers for BENGAL nextflow pipeline
 2 | 
 3 | Yuyao Song <ysong@ebi.ac.uk>
 4 | 
 5 | We provide a docker container for running SCCAF assessment. 
 6 | 
 7 | For cluster execuition, we convert the docker container into a singularity container. 
 8 | 
 9 | Pull SCCAF containers as follows:
10 | 
11 | `singularity pull sccaf.sif docker://yysong123/intgpy:sccaf`
12 | 
13 | This container is built for linux/amd64 machines.
14 | 


--------------------------------------------------------------------------------
/cross_species_assessment_multiple_species_individual.nf:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env nextflow
  2 | 
  3 | 
  4 | nextflow.enable.dsl=2
  5 | 
  6 | 
  7 | // data requirements: batch_key, species_key, sample_key in adata.obs
  8 | // data requirements: mean_count in adata.var from scanpy QC
  9 | // name the raw h5ad file using the same string as in homology table
 10 | // adata.var have column 'gene_name' as gene_name, be unique!
 11 | // adata.obs and adata.var cannot contain NA and the column names cannot contain "-"
 12 | // adata.var.cluster_key factor is ordered to be the same across two datasets with the same cell type // todo
 13 | 
 14 | log.info """
 15 |          ===========================================================
 16 | 
 17 |          Cross-species integration and assessment - nextflow pipeline
 18 |          - Integrate scRNA-seq data from multiple species
 19 |          - SCCAF projection on integrated data to assess integration quality - this workflow
 20 |          - harmony, scanorama, scVI - python based
 21 |          - Seurat CCA, Seurat RPCA, fastMNN, LIGER, LIGER-UINMF - r based
 22 |          - SAMap
 23 |          Author: ysong@ebi.ac.uk
 24 |          Mar 2022
 25 | 
 26 |          ===========================================================
 27 | 
 28 |          """
 29 |          .stripIndent()
 30 | 
 31 | 
 32 | metadata_ch = channel.fromPath(params.input_metadata)
 33 | 
 34 | all_integrated_h5ad_mapped_ch = channel
 35 |                 .fromPath(params.integrated_h5ad)
 36 |                 .map { file -> tuple(file.baseName.split("_")[-2], file.baseName, file) }
 37 | 
 38 | all_orig_h5ad_mapped_ch = channel
 39 |                 .fromPath(params.integrated_h5ad)
 40 |                 .map { file -> (params.homology_concat_h5ad - "*.h5ad" + file.baseName - file.baseName.split("_")[-2] - '__integrated'  + ".h5ad") }
 41 |                 .unique()
 42 | 
 43 | all_orig_h5ad_with_base_mapped_ch = channel
 44 |                 .fromPath(params.homology_concat_h5ad)
 45 |                 .map { file -> tuple(params.task_name, file) }
 46 |                 .unique()
 47 | 
 48 | 
 49 | all_integrated_and_orig_h5ad_mapped_ch = channel
 50 |                 .fromPath(params.integrated_h5ad)
 51 |                 .map { file -> tuple(file.baseName.split("_")[-2], file.baseName, (params.homology_concat_h5ad - "*.h5ad" + file.baseName - file.baseName.split("_")[-2] - '__integrated'  + ".h5ad"), file) }
 52 | 
 53 | 
 54 | all_integrated_rds_mapped_ch = channel
 55 |                 .fromPath(params.integrated_rds)
 56 |                 .map { file -> tuple(file.baseName, file, file.getParent()) }
 57 | // method name, file basename, file
 58 | 
 59 | process copy_for_rliger {
 60 | 
 61 |     label 'regular_resource'
 62 | 
 63 |     publishDir "${params.results}/results/h5ad_homology_concat", mode: 'copy'
 64 | 
 65 |     input:
 66 |     tuple val(basename), path(unintegrated_h5ad)
 67 | 
 68 |     output:
 69 |     val true, emit: signal
 70 | 
 71 |     shell:
 72 |     '''
 73 |     if ! [ -n "\$(find !{params.results}/results/h5ad_homology_concat -type f -regex '.*liger.*')" ]
 74 |     then 
 75 |         rliger_file=$(echo !{unintegrated_h5ad}  | sed "s/!{basename}/!{basename}_liger/g") && echo ${rliger_file} && cp !{unintegrated_h5ad} ${rliger_file}
 76 |     fi
 77 |     '''
 78 | 
 79 | }
 80 | 
 81 | process convert_format_rds {
 82 | 
 83 |     label 'convert'
 84 |     label 'regular_resource'
 85 | 
 86 |     //publishDir "${out_dir}/", mode: 'copy'
 87 | 
 88 |     input: 
 89 |     tuple val(basename), path(integrated_rds), path(out_dir)
 90 | 
 91 |     //output:
 92 |     //path "*.h5ad"
 93 |     //directly write to results out_dir, not elegant, but works
 94 | 
 95 |     script:
 96 | 
 97 |     """
 98 |     Rscript ${projectDir}/bin/convert_format.R \
 99 |     -i ${integrated_rds} -o ${out_dir}/${basename}.h5ad -t seurat_to_anndata \
100 |     --conda_path ${params.sceasy_conda}
101 | 
102 |     """
103 | 
104 | 
105 | }
106 | 
107 | 
108 | 
109 | process sccaf_assessment {
110 | 
111 |     label 'regular_resource'
112 |     label 'sccaf_based'
113 | 
114 |     publishDir "${params.results}/results/per_species", mode: 'copy'
115 | 
116 |     input:
117 |     path(metadata)
118 | 
119 |     output:
120 |     path '*'
121 | 
122 |     script:
123 |     """
124 |     python ${projectDir}/bin/sccaf_assessment_metadata.py ${metadata} ${metadata}_SCCAF_AUC ${metadata}_SCCAF_accuracy_summary \
125 |     --use_embedding True --embedding_key X_pca \
126 |     --integration_method unintegrated \
127 |     --cluster_key ${params.cluster_key}
128 |     """
129 | 
130 | }
131 | 
132 | 
133 | process sccaf_projection {
134 | 
135 |     label 'regular_resource'
136 |     label 'sccaf_based'
137 | 
138 | 
139 |     publishDir "${params.results}/results/${method}/cross_species/SCCAF_projection/", mode: 'copy'
140 | 
141 | 
142 |     input:
143 |     tuple val(method), val(basename), path(cross_species_integrated_h5ad)
144 | 
145 |     output:
146 |     path '*'
147 | 
148 |     script:
149 |     """
150 |     python ${projectDir}/bin/sccaf_projection_multiple_species.py \
151 |     --species_key ${params.species_key} --cluster_key ${params.cluster_key_sccaf} --projection_key ${params.projection_key_sccaf} \
152 |     --integration_method ${method} ${cross_species_integrated_h5ad} \
153 |     ${basename}_SCCAF_projection_result.h5ad \
154 |     ${basename}_SCCAF_projection_figures.pdf \
155 |     ${basename}_SCCAF_accuracy_summary.csv
156 |     """
157 | }
158 | 
159 | 
160 | 
161 | process batch_metrics{
162 | 
163 |     label 'regular_intg_resource'
164 |     label 'scIB_based'
165 | 
166 |     publishDir "${params.results}/batch_metrics/cross_species", mode: 'copy'
167 | 
168 |     input:
169 |     
170 |     tuple val(ready), val(method), val(basename), path(unintegrated_h5ad), path(cross_species_integrated_h5ad)
171 | 
172 |     output:
173 |     path "*"
174 |     
175 |     script:
176 |     """
177 |     python ${projectDir}/bin/scIB_metrics_individual.py \
178 |     ${cross_species_integrated_h5ad} \
179 |     ${unintegrated_h5ad} \
180 |     ${basename}_scIB_metrics.csv ${basename}_cell_type_basw.csv \
181 |     ${basename}_orig_scIB_metrics.csv ${basename}_orig_cell_type_basw.csv \
182 |     ${basename}_scIB.h5ad \
183 |     --integration_method ${method} --batch_key ${params.batch_key} \
184 |     --species_key ${params.species_key} --cluster_key ${params.cluster_key} \
185 |     --num_cores 4 --conda_path ${params.scib_conda}
186 | 
187 |     """
188 | 
189 | 
190 | 
191 | }
192 | 
193 | 
194 | 
195 | process trajectory_metrics{
196 | 
197 |     label 'regular_resource'
198 |     label 'scIB_based'
199 | 
200 |     publishDir "${params.results}/batch_metrics/cross_species", mode: 'copy'
201 | 
202 |     input:
203 |     tuple val(method), val(basename), path(unintegrated_h5ad), path(cross_species_integrated_h5ad)
204 | 
205 |     output:
206 |     path "*"
207 | 
208 |     script:
209 |     """
210 |     python ${projectDir}/bin/scIB_trajectory.py \
211 |     ${cross_species_integrated_h5ad} ${unintegrated_h5ad} \
212 |     ${basename}_trajectory_metrics.csv \
213 |     ${basename}_trajectory_scIB.h5ad \
214 |     --integration_method ${method} --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} --root_cell ${params.root_cell}
215 | 
216 |     """
217 | 
218 | }
219 | 
220 | 
221 | workflow {
222 | 
223 |     //all_integrated_rds_mapped_ch.view()
224 | 
225 |     convert_format_rds(all_integrated_rds_mapped_ch)
226 | 
227 |     copy_for_rliger(all_orig_h5ad_with_base_mapped_ch)
228 |     sccaf_assessment(metadata_ch)
229 |     sccaf_projection(all_integrated_h5ad_mapped_ch)
230 |     ch_test = copy_for_rliger.out.signal.combine(all_integrated_and_orig_h5ad_mapped_ch).unique()
231 |     batch_metrics(ch_test)
232 |     //trajectory_metrics(iall_integrated_and_orig_h5ad_mapped_ch)
233 | }
234 | 


--------------------------------------------------------------------------------
/cross_species_integration_multiple_species.nf:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env nextflow
  2 | nextflow.enable.dsl=2
  3 | 
  4 | // © EMBL-European Bioinformatics Institute, 2023
  5 | 
  6 | // Updated upon containerization of BENGAL envs
  7 | // Yuyao Song <ysong@ebi.ac.uk>
  8 | // Oct 2023
  9 | 
 10 | 
 11 | log.info """
 12 |          ===========================================================
 13 | 
 14 |          Cross-species integration and assessment - nextflow pipeline
 15 |          - Integrate scRNA-seq data from multiple species - this workflow
 16 |          - SCCAF projection on integrated data to assess integration quality
 17 |          - harmony, scanorama, scVI - python based
 18 |          - Seurat CCA, Seurat RPCA, fastMNN, LIGER, LIGER-UINMF - r based
 19 |          Author: ysong@ebi.ac.uk
 20 |          Mar 2022
 21 | 
 22 |          ===========================================================
 23 | 
 24 |          """
 25 |          .stripIndent()
 26 | 
 27 | 
 28 | process harmony_integration {
 29 |     
 30 |     label 'scanpy_based'
 31 |     label 'regular_intg_resource'
 32 | 
 33 |     publishDir "${params.results}/results/harmony/cross_species/integrated_h5ad", mode: 'copy'
 34 |    
 35 |     input:
 36 |     tuple val(baseName), path(file)
 37 | 
 38 |     output:
 39 |     path "${baseName}_harmony_integrated.h5ad", emit: harmony_cross_species_h5ad_ch
 40 |     path "${baseName}_harmony_integrated_UMAP.png", emit: harmony_cross_species_umap_ch
 41 | 
 42 |     script:
 43 |     """
 44 |     python ${projectDir}/bin/harmony_integration.py \
 45 |     --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \
 46 |     ${file} ${baseName}_harmony_integrated.h5ad ${baseName}_harmony_integrated_UMAP.png
 47 |     """
 48 | }
 49 | 
 50 | 
 51 | process scanorama_integration {
 52 | 
 53 |     label 'scanpy_based'
 54 |     label 'regular_intg_resource'
 55 | 
 56 |     publishDir "${params.results}/results/scanorama/cross_species/integrated_h5ad", mode: 'copy'
 57 |     
 58 |     input:
 59 |     tuple val(baseName), path(file)
 60 | 
 61 |     output:
 62 |     path "${baseName}_scanorama_integrated.h5ad", emit: scanorama_cross_species_h5ad_ch
 63 |     path "${baseName}_scanorama_integrated_UMAP.png", emit: scanorama_cross_species_umap_ch
 64 | 
 65 |     script:
 66 |     """
 67 |     python ${projectDir}/bin/scanorama_integration.py \
 68 |     --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \
 69 |     ${file} ${baseName}_scanorama_integrated.h5ad ${baseName}_scanorama_integrated_UMAP.png
 70 |     """
 71 | }
 72 | 
 73 | 
 74 | // scVI uses GPU
 75 | process scvi_integration {
 76 | 
 77 |     label 'scvi_based'
 78 |     label 'GPU_intg_resource'
 79 | 
 80 |     publishDir "${params.results}/results/scVI/cross_species/integrated_h5ad", mode: 'copy'
 81 | 
 82 |     input:
 83 |     tuple val(baseName), path(file)
 84 | 
 85 |     output:
 86 |     path "${baseName}_scVI_integrated.h5ad", emit: scvi_cross_species_h5ad_ch
 87 |     path "${baseName}_scVI_integrated_UMAP.png", emit: scvi_cross_species_umap_ch
 88 | 
 89 |     script:
 90 |     """
 91 |     python ${projectDir}/bin/scvi_integration.py \
 92 |     --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \
 93 |     ${file} ${baseName}_scVI_integrated.h5ad ${baseName}_scVI_integrated_UMAP.png
 94 |     """
 95 | }
 96 | // scANVI uses GPU
 97 | process scanvi_integration {
 98 | 
 99 | 
100 |     label 'scvi_based'
101 |     label 'GPU_intg_resource'
102 | 
103 |     publishDir "${params.results}/results/scANVI/cross_species/integrated_h5ad", mode: 'copy'
104 | 
105 |     input:
106 |     tuple val(baseName), path(file)
107 | 
108 |     output:
109 |     path "${baseName}_scANVI_integrated.h5ad", emit: scanvi_cross_species_h5ad_ch
110 |     path "${baseName}_scANVI_integrated_UMAP.png", emit: scanvi_cross_species_umap_ch
111 | 
112 |     script:
113 |     """
114 |     python ${projectDir}/bin/scANVI_integration.py \
115 |     --batch_key ${params.batch_key} --species_key ${params.species_key} --cluster_key ${params.cluster_key} \
116 |     ${file} ${baseName}_scANVI_integrated.h5ad ${baseName}_scANVI_integrated_UMAP.png ${baseName}_scANVI_integrated_mde.png
117 |     """
118 | }
119 | 
120 | 
121 | 
122 | process seurat_CCA_integration{
123 | 
124 | 
125 |     label 'seurat_based'
126 |     label 'bigmem_intg_resource'
127 | 
128 |     publishDir "${params.results}/results/seuratCCA/cross_species/integrated_h5ad/", mode: 'copy'
129 | 
130 |     input:
131 |     tuple val(baseName), path(file)
132 | 
133 |     output:
134 |     path "${baseName}_seuratCCA_integrated.rds", emit: seuratCCA_cross_species_rds_ch
135 |     path "${baseName}_seuratCCA_integrated_UMAP.pdf" , emit: seuratCCA_cross_species_umap_ch
136 | 
137 |     script:
138 |     """
139 |     Rscript ${projectDir}/bin/seurat_CCA_integration.R -i ${file} -o ${baseName}_seuratCCA_integrated.rds -p ${baseName}_seuratCCA_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key}
140 |     """
141 | }
142 | 
143 | 
144 | process seurat_RPCA_integration{
145 | 
146 |     label 'seurat_based'
147 |     label 'bigmem_intg_resource'
148 | 
149 |     publishDir "${params.results}/results/seuratRPCA/cross_species/integrated_h5ad", mode: 'copy'
150 |    
151 |     input:
152 |     tuple val(baseName), path(file)
153 | 
154 |     output:
155 |     path "${baseName}_seuratRPCA_integrated.rds" , emit: seuratRPCA_cross_species_rds_ch
156 |     path "${baseName}_seuratRPCA_integrated_UMAP.pdf" , emit: seuratRPCA_cross_species_umap_ch
157 | 
158 |     script:
159 |     """
160 |     Rscript ${projectDir}/bin/seurat_RPCA_integration.R -i ${file} -o ${baseName}_seuratRPCA_integrated.rds -p ${baseName}_seuratRPCA_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key}
161 |     """
162 | }
163 | 
164 | process fastMNN_integration{
165 | 
166 |     label 'R_based'
167 |     label 'regular_intg_resource'
168 | 
169 |     publishDir "${params.results}/results/fastMNN/cross_species/integrated_h5ad", mode: 'copy'
170 | 
171 |     input:
172 |     tuple val(baseName), path(file)
173 | 
174 |     output:
175 |     path "${baseName}_fastMNN_integrated.rds" , emit: fastMNN_cross_species_rds_ch
176 |     path "${baseName}_fastMNN_integrated_UMAP.pdf" , emit: fastMNN_cross_species_UMAP_ch
177 | 
178 |     script:
179 |     """
180 |     Rscript ${projectDir}/bin/fastMNN_integration.R -i ${file} -o ${baseName}_fastMNN_integrated.rds -p ${baseName}_fastMNN_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key}
181 |     """
182 | }
183 | 
184 | 
185 | process liger_integration{
186 | 
187 |     label 'R_based'
188 |     label 'regular_intg_resource'
189 | 
190 |     publishDir "${params.results}/results/LIGER/cross_species/integrated_h5ad", mode: 'copy'
191 | 
192 |     input:
193 |     tuple val(baseName), path(file)
194 | 
195 |     output:
196 |     path "${baseName}_LIGER_integrated.rds", emit: liger_cross_species_rds_ch
197 |     path "${baseName}_LIGER_integrated_UMAP.pdf", emit: liger_cross_species_UMAP_ch
198 | 
199 |     script:
200 |     """
201 |     Rscript ${projectDir}/bin/LIGER_integration.R -i ${file} -o ${baseName}_LIGER_integrated.rds -p ${baseName}_LIGER_integrated_UMAP.pdf -b ${params.batch_key} -s ${params.species_key} -c ${params.cluster_key}
202 |     """
203 | }
204 | 
205 | 
206 | process rligerUINMF_integration{
207 | 
208 |     label 'R_based'
209 |     label 'regular_intg_resource'
210 | 
211 |     publishDir "${params.results}/results/rligerUINMF/cross_species/integrated_h5ad", mode: 'copy'
212 | 
213 |     input:
214 |     tuple val(baseName), path(metadata)
215 | 
216 |     output:
217 |     path "*_rligerUINMF_integrated.rds"
218 |     path "*_rligerUINMF_integrated_UMAP.pdf"
219 | 
220 |     script:
221 |     """
222 |     mkdir -p ${params.results}/results/rligerUINMF/cross_species/integrated_h5ad && \
223 |     Rscript ${projectDir}/bin/rliger_integration_UINMF_multiple_species.R \
224 |     --basename ${baseName} \
225 |     --metadata ${params.liger_metadata} \
226 |     --out_dir .  \
227 |     --cluster_key ${params.cluster_key}
228 |     """
229 | }
230 | 
231 | // nextflow will search for the output file in the cwd 
232 | // so set the out_dir in the script to cwd, then nextflow will copy the results to the publishDir anyway
233 | 
234 | workflow {
235 | 
236 |     all_homology_h5ad_mapped_ch = Channel.fromPath(params.homology_concat_h5ad)
237 |                                          .map { file -> tuple(file.baseName, file) }
238 | 
239 |     all_homology_rds_mapped_ch = Channel.fromPath(params.homology_concat_rds)
240 |                                          .map { file -> tuple(file.baseName, file) }
241 | 
242 |     concatenated_h5ad = all_homology_h5ad_mapped_ch
243 |     concatenated_h5ad.view()
244 | 
245 |     harmony_integration(concatenated_h5ad)
246 |     scanorama_integration(concatenated_h5ad)
247 |     scvi_integration(concatenated_h5ad)
248 |     scanvi_integration(concatenated_h5ad)
249 | 
250 |     concatenated_rds = all_homology_rds_mapped_ch
251 |     seurat_CCA_integration(concatenated_rds)
252 |     seurat_RPCA_integration(concatenated_rds)
253 |     fastMNN_integration(concatenated_rds)
254 |     liger_integration(concatenated_rds)
255 | 
256 |     liger_metadata = Channel.fromPath(params.liger_metadata)
257 |                                          .map { file -> tuple(file.baseName, file) }
258 |     liger_metadata.view()
259 |     rligerUINMF_integration(liger_metadata)
260 | 
261 | }
262 | 


--------------------------------------------------------------------------------
/dockerfiles/ALCS.Dockerfile:
--------------------------------------------------------------------------------
 1 | # syntax=docker/dockerfile:1
 2 | 
 3 | 
 4 | FROM python:3.7.12-slim-buster
 5 | 
 6 | 
 7 | RUN apt-get update && apt-get -y install gcc && apt-get -y install g++ \
 8 | && apt-get -y install wget && apt-get -y install autoconf && apt-get -y install automake && apt-get -y install libxml2-dev && \
 9 | wget https://github.com/Kitware/CMake/releases/download/v3.22.3/cmake-3.22.3-linux-x86_64.sh && \
10 | cp cmake-3.22.3-linux-x86_64.sh /opt/ && chmod +x /opt/cmake-3.22.3-linux-x86_64.sh && \
11 | cd /opt/ &&yes y | bash /opt/cmake-3.22.3-linux-x86_64.sh && ln -s /opt/cmake-3.22.3-linux-x86_64/bin/* /usr/local/bin && \
12 | apt-get install -y git
13 | 
14 | RUN pip3 install click matplotlib pandas anndata sklearn
15 | 
16 | RUN git clone https://github.com/YY-SONG0718/sccaf && cd sccaf && pip3 install .
17 | 
18 | RUN pip3 install scanpy==1.9.1
19 | 
20 | ENV PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
21 | 


--------------------------------------------------------------------------------
/dockerfiles/scanpy.Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM --platform=linux/amd64 gcfntnu/scanpy:1.9.2
 2 | 
 3 | MAINTAINER Yuyao Song
 4 | 
 5 | CMD ["echo", "Container for harmony and scanorama integration in BENGAL"]
 6 | 
 7 | RUN pip3 install harmonypy scanorama click
 8 | 
 9 | RUN pip3 install pydantic
10 | 
11 | ENTRYPOINT ["python"]
12 | 
13 | 


--------------------------------------------------------------------------------
/dockerfiles/seurat.Dockerfile:
--------------------------------------------------------------------------------
 1 | # Dockerfile for Seurat 4.3.0
 2 | FROM rocker/r-ver:4.2.0
 3 | 
 4 | # Set global R options
 5 | RUN echo "options(repos = 'https://cloud.r-project.org')" > $(R --no-echo --no-save -e "cat(Sys.getenv('R_HOME'))")/etc/Rprofile.site
 6 | ENV RETICULATE_MINICONDA_ENABLED=FALSE
 7 | 
 8 | # Install Seurat's system dependencies
 9 | RUN apt-get update
10 | RUN apt-get install -y \
11 |     libhdf5-dev \
12 |     libcurl4-openssl-dev \
13 |     libssl-dev \
14 |     libpng-dev \
15 |     libboost-all-dev \
16 |     libxml2-dev \
17 |     openjdk-8-jdk \
18 |     python3-dev \
19 |     python3-pip \
20 |     wget \
21 |     git \
22 |     libfftw3-dev \
23 |     libgsl-dev \
24 |     pkg-config
25 | 
26 | RUN apt-get install -y llvm-10
27 | 
28 | # Install system library for rgeos
29 | RUN apt-get install -y libgeos-dev
30 | 
31 | # Install UMAP
32 | RUN LLVM_CONFIG=/usr/lib/llvm-10/bin/llvm-config pip3 install llvmlite
33 | RUN pip3 install numpy
34 | RUN pip3 install umap-learn
35 | 
36 | # Install FIt-SNE
37 | RUN git clone --branch v1.2.1 https://github.com/KlugerLab/FIt-SNE.git
38 | RUN g++ -std=c++11 -O3 FIt-SNE/src/sptree.cpp FIt-SNE/src/tsne.cpp FIt-SNE/src/nbodyfft.cpp  -o bin/fast_tsne -pthread -lfftw3 -lm
39 | 
40 | # Install bioconductor dependencies & suggests
41 | RUN R --no-echo --no-restore --no-save -e "install.packages('BiocManager')"
42 | RUN R --no-echo --no-restore --no-save -e "BiocManager::install(c('multtest', 'S4Vectors', 'SummarizedExperiment', 'SingleCellExperiment', 'MAST', 'DESeq2', 'BiocGenerics', 'GenomicRanges', 'IRanges', 'rtracklayer', 'monocle', 'Biobase', 'limma', 'glmGamPoi'))"
43 | 
44 | # Install CRAN suggests
45 | RUN R --no-echo --no-restore --no-save -e "install.packages(c('VGAM', 'R.utils', 'metap', 'Rfast2', 'ape', 'enrichR', 'mixtools'))"
46 | 
47 | # Install spatstat
48 | RUN R --no-echo --no-restore --no-save -e "install.packages(c('spatstat.explore', 'spatstat.geom'))"
49 | 
50 | # Install hdf5r
51 | RUN R --no-echo --no-restore --no-save -e "install.packages('hdf5r')"
52 | 
53 | # Install latest Matrix
54 | RUN R --no-echo --no-restore --no-save -e "install.packages('Matrix')"
55 | 
56 | # Install rgeos
57 | RUN R --no-echo --no-restore --no-save -e "install.packages('rgeos')"
58 | 
59 | # Install Seurat
60 | RUN R --no-echo --no-restore --no-save -e "install.packages('remotes')"
61 | RUN R --no-echo --no-restore --no-save -e "install.packages('Seurat')"
62 | 
63 | # Install SeuratDisk
64 | RUN R --no-echo --no-restore --no-save -e "remotes::install_github('mojaveazure/seurat-disk')"
65 | 
66 | # BENGAL specific install
67 | 
68 | RUN R --no-echo --no-restore --no-save -e "install.packages('optparse')"
69 | 
70 | RUN R --no-echo --no-restore --no-save -e "BiocManager::install('biomaRt')"
71 | 
72 | 
73 | 
74 | 
75 | 


--------------------------------------------------------------------------------
/envs/sceasy.yml:
--------------------------------------------------------------------------------
  1 | name: sceasy
  2 | channels:
  3 |   - bioconda
  4 |   - conda-forge
  5 |   - r
  6 |   - defaults
  7 | dependencies:
  8 |   - _libgcc_mutex=0.1
  9 |   - _openmp_mutex=4.5
 10 |   - _r-mutex=1.0.1
 11 |   - alsa-lib=1.2.10
 12 |   - aom=3.6.1
 13 |   - argcomplete=3.1.2
 14 |   - attr=2.5.1
 15 |   - binutils_impl_linux-64=2.40
 16 |   - bioconductor-batchelor=1.16.0
 17 |   - bioconductor-beachmat=2.16.0
 18 |   - bioconductor-biobase=2.60.0
 19 |   - bioconductor-biocgenerics=0.46.0
 20 |   - bioconductor-biocio=1.10.0
 21 |   - bioconductor-biocneighbors=1.18.0
 22 |   - bioconductor-biocparallel=1.34.2
 23 |   - bioconductor-biocsingular=1.16.0
 24 |   - bioconductor-data-packages=20230718
 25 |   - bioconductor-delayedarray=0.26.6
 26 |   - bioconductor-delayedmatrixstats=1.22.1
 27 |   - bioconductor-genomeinfodb=1.36.1
 28 |   - bioconductor-genomeinfodbdata=1.2.10
 29 |   - bioconductor-genomicranges=1.52.0
 30 |   - bioconductor-hdf5array=1.28.1
 31 |   - bioconductor-iranges=2.34.1
 32 |   - bioconductor-limma=3.56.2
 33 |   - bioconductor-loomexperiment=1.18.0
 34 |   - bioconductor-matrixgenerics=1.12.2
 35 |   - bioconductor-residualmatrix=1.10.0
 36 |   - bioconductor-rhdf5=2.44.0
 37 |   - bioconductor-rhdf5filters=1.12.1
 38 |   - bioconductor-rhdf5lib=1.22.0
 39 |   - bioconductor-s4arrays=1.0.4
 40 |   - bioconductor-s4vectors=0.38.1
 41 |   - bioconductor-scaledmatrix=1.8.1
 42 |   - bioconductor-scuttle=1.10.1
 43 |   - bioconductor-singlecellexperiment=1.22.0
 44 |   - bioconductor-sparsematrixstats=1.12.2
 45 |   - bioconductor-summarizedexperiment=1.30.2
 46 |   - bioconductor-xvector=0.40.0
 47 |   - bioconductor-zlibbioc=1.46.0
 48 |   - blosc=1.21.5
 49 |   - bwidget=1.9.14
 50 |   - bzip2=1.0.8
 51 |   - c-ares=1.19.1
 52 |   - ca-certificates=2023.7.22
 53 |   - cached-property=1.5.2
 54 |   - cached_property=1.5.2
 55 |   - cairo=1.16.0
 56 |   - cfitsio=4.3.0
 57 |   - curl=8.3.0
 58 |   - dav1d=1.2.1
 59 |   - dbus=1.13.6
 60 |   - expat=2.5.0
 61 |   - ffmpeg=6.0.0
 62 |   - fftw=3.3.10
 63 |   - font-ttf-dejavu-sans-mono=2.37
 64 |   - font-ttf-inconsolata=3.000
 65 |   - font-ttf-source-code-pro=2.038
 66 |   - font-ttf-ubuntu=0.83
 67 |   - fontconfig=2.14.2
 68 |   - fonts-conda-ecosystem=1
 69 |   - fonts-conda-forge=1
 70 |   - freeglut=3.2.2
 71 |   - freetype=2.12.1
 72 |   - freexl=2.0.0
 73 |   - fribidi=1.0.10
 74 |   - gcc_impl_linux-64=13.2.0
 75 |   - geos=3.12.0
 76 |   - geotiff=1.7.1
 77 |   - gettext=0.21.1
 78 |   - gfortran_impl_linux-64=13.2.0
 79 |   - giflib=5.2.1
 80 |   - glib=2.78.0
 81 |   - glib-tools=2.78.0
 82 |   - glpk=5.0
 83 |   - gmp=6.2.1
 84 |   - gnutls=3.7.8
 85 |   - graphite2=1.3.13
 86 |   - gsl=2.7
 87 |   - gst-plugins-base=1.22.6
 88 |   - gstreamer=1.22.6
 89 |   - gxx_impl_linux-64=13.2.0
 90 |   - h5py=3.9.0
 91 |   - harfbuzz=8.2.1
 92 |   - hdf4=4.2.15
 93 |   - hdf5=1.14.2
 94 |   - icu=73.2
 95 |   - ihm=0.41
 96 |   - imp=2.19.0
 97 |   - jasper=4.0.0
 98 |   - jq=1.5
 99 |   - json-c=0.17
100 |   - kealib=1.5.2
101 |   - kernel-headers_linux-64=2.6.32
102 |   - keyutils=1.6.1
103 |   - krb5=1.21.2
104 |   - lame=3.100
105 |   - lcms2=2.15
106 |   - ld_impl_linux-64=2.40
107 |   - lerc=4.0.0
108 |   - libabseil=20230802.1
109 |   - libaec=1.1.2
110 |   - libarchive=3.7.2
111 |   - libass=0.17.1
112 |   - libblas=3.9.0
113 |   - libboost=1.82.0
114 |   - libboost-headers=1.82.0
115 |   - libcap=2.69
116 |   - libcblas=3.9.0
117 |   - libclang=15.0.7
118 |   - libclang13=15.0.7
119 |   - libcrc32c=1.1.2
120 |   - libcups=2.3.3
121 |   - libcurl=8.3.0
122 |   - libdeflate=1.19
123 |   - libdrm=2.4.114
124 |   - libedit=3.1.20191231
125 |   - libev=4.33
126 |   - libevent=2.1.12
127 |   - libexpat=2.5.0
128 |   - libffi=3.4.2
129 |   - libflac=1.4.3
130 |   - libgcc=7.2.0
131 |   - libgcc-devel_linux-64=13.2.0
132 |   - libgcc-ng=13.2.0
133 |   - libgcrypt=1.10.1
134 |   - libgdal=3.7.2
135 |   - libgfortran-ng=13.2.0
136 |   - libgfortran5=13.2.0
137 |   - libglib=2.78.0
138 |   - libglu=9.0.0
139 |   - libgomp=13.2.0
140 |   - libgoogle-cloud=2.12.0
141 |   - libgpg-error=1.47
142 |   - libgrpc=1.57.0
143 |   - libhwloc=2.9.3
144 |   - libiconv=1.17
145 |   - libidn2=2.3.4
146 |   - libjpeg-turbo=2.1.5.1
147 |   - libkml=1.3.0
148 |   - liblapack=3.9.0
149 |   - liblapacke=3.9.0
150 |   - libllvm15=15.0.7
151 |   - libnetcdf=4.9.2
152 |   - libnghttp2=1.52.0
153 |   - libnsl=2.0.0
154 |   - libogg=1.3.4
155 |   - libopenblas=0.3.24
156 |   - libopencv=4.8.1
157 |   - libopenvino=2023.1.0
158 |   - libopenvino-auto-batch-plugin=2023.1.0
159 |   - libopenvino-auto-plugin=2023.1.0
160 |   - libopenvino-hetero-plugin=2023.1.0
161 |   - libopenvino-intel-cpu-plugin=2023.1.0
162 |   - libopenvino-intel-gpu-plugin=2023.1.0
163 |   - libopenvino-ir-frontend=2023.1.0
164 |   - libopenvino-onnx-frontend=2023.1.0
165 |   - libopenvino-paddle-frontend=2023.1.0
166 |   - libopenvino-pytorch-frontend=2023.1.0
167 |   - libopenvino-tensorflow-frontend=2023.1.0
168 |   - libopenvino-tensorflow-lite-frontend=2023.1.0
169 |   - libopus=1.3.1
170 |   - libpciaccess=0.17
171 |   - libpng=1.6.39
172 |   - libpq=15.4
173 |   - libprotobuf=4.23.4
174 |   - librttopo=1.1.0
175 |   - libsanitizer=13.2.0
176 |   - libsndfile=1.2.2
177 |   - libspatialite=5.1.0
178 |   - libsqlite=3.43.0
179 |   - libssh2=1.11.0
180 |   - libstdcxx-devel_linux-64=13.2.0
181 |   - libstdcxx-ng=13.2.0
182 |   - libsystemd0=254
183 |   - libtasn1=4.19.0
184 |   - libtiff=4.6.0
185 |   - libudunits2=2.2.28
186 |   - libunistring=0.9.10
187 |   - libuuid=2.38.1
188 |   - libva=2.20.0
189 |   - libvorbis=1.3.7
190 |   - libvpx=1.13.1
191 |   - libwebp-base=1.3.2
192 |   - libxcb=1.15
193 |   - libxkbcommon=1.5.0
194 |   - libxml2=2.11.5
195 |   - libzip=1.10.1
196 |   - libzlib=1.2.13
197 |   - lz4-c=1.9.4
198 |   - lzo=2.10
199 |   - make=4.3
200 |   - minizip=4.0.1
201 |   - mpfr=4.2.0
202 |   - mpg123=1.32.3
203 |   - mpi=1.0
204 |   - mpich=4.1.2
205 |   - msgpack-python=1.0.6
206 |   - mysql-common=8.0.33
207 |   - mysql-libs=8.0.33
208 |   - natsort=8.4.0
209 |   - ncurses=6.4
210 |   - nettle=3.8.1
211 |   - nspr=4.35
212 |   - nss=3.94
213 |   - numpy=1.26.0
214 |   - ocl-icd=2.3.1
215 |   - ocl-icd-system=1.0.0
216 |   - openh264=2.3.1
217 |   - openjpeg=2.5.0
218 |   - openssl=3.1.3
219 |   - p11-kit=0.24.1
220 |   - packaging=23.2
221 |   - pandas=2.1.1
222 |   - pandoc=3.1.3
223 |   - pango=1.50.14
224 |   - pcre2=10.40
225 |   - pip=23.2.1
226 |   - pixman=0.42.2
227 |   - poppler=23.08.0
228 |   - poppler-data=0.4.12
229 |   - postgresql=15.4
230 |   - proj=9.3.0
231 |   - protobuf=4.23.4
232 |   - pthread-stubs=0.4
233 |   - pugixml=1.13
234 |   - pulseaudio-client=16.1
235 |   - python=3.9.18
236 |   - python-dateutil=2.8.2
237 |   - python-tzdata=2023.3
238 |   - python_abi=3.9
239 |   - pytz=2023.3.post1
240 |   - pyyaml=6.0.1
241 |   - qt-main=5.15.8
242 |   - r-abind=1.4_5
243 |   - r-anndata=0.7.5.4
244 |   - r-askpass=1.2.0
245 |   - r-assertthat=0.2.1
246 |   - r-base=4.3.1
247 |   - r-base64enc=0.1_3
248 |   - r-bh=1.81.0_1
249 |   - r-biglm=0.9_2.1
250 |   - r-bitops=1.0_7
251 |   - r-boot=1.3_28.1
252 |   - r-brio=1.1.3
253 |   - r-bslib=0.5.1
254 |   - r-cachem=1.0.8
255 |   - r-callr=3.7.3
256 |   - r-catools=1.18.2
257 |   - r-class=7.3_22
258 |   - r-classint=0.4_10
259 |   - r-cli=3.6.1
260 |   - r-cluster=2.1.4
261 |   - r-codetools=0.2_19
262 |   - r-colorspace=2.1_0
263 |   - r-commonmark=1.9.0
264 |   - r-cowplot=1.1.1
265 |   - r-cpp11=0.4.6
266 |   - r-crayon=1.5.2
267 |   - r-crosstalk=1.2.0
268 |   - r-curl=5.1.0
269 |   - r-data.table=1.14.8
270 |   - r-dbi=1.1.3
271 |   - r-deldir=1.0_9
272 |   - r-desc=1.4.2
273 |   - r-diffobj=0.3.5
274 |   - r-digest=0.6.33
275 |   - r-dplyr=1.1.3
276 |   - r-dqrng=0.3.1
277 |   - r-e1071=1.7_13
278 |   - r-ellipsis=0.3.2
279 |   - r-evaluate=0.22
280 |   - r-fansi=1.0.4
281 |   - r-farver=2.1.1
282 |   - r-fastmap=1.1.1
283 |   - r-fitdistrplus=1.1_11
284 |   - r-fnn=1.1.3.2
285 |   - r-fontawesome=0.5.2
286 |   - r-formatr=1.14
287 |   - r-fs=1.6.3
288 |   - r-furrr=0.3.1
289 |   - r-futile.logger=1.4.3
290 |   - r-futile.options=1.0.1
291 |   - r-future=1.33.0
292 |   - r-future.apply=1.11.0
293 |   - r-generics=0.1.3
294 |   - r-getopt=1.20.4
295 |   - r-ggplot2=3.4.3
296 |   - r-ggrepel=0.9.3
297 |   - r-ggridges=0.5.4
298 |   - r-globals=0.16.2
299 |   - r-glue=1.6.2
300 |   - r-goftest=1.2_3
301 |   - r-gplots=3.1.3
302 |   - r-gridextra=2.3
303 |   - r-grr=0.9.5
304 |   - r-gtable=0.3.4
305 |   - r-gtools=3.9.4
306 |   - r-here=1.0.1
307 |   - r-hexbin=1.28.3
308 |   - r-highr=0.10
309 |   - r-htmltools=0.5.6
310 |   - r-htmlwidgets=1.6.2
311 |   - r-httpuv=1.6.11
312 |   - r-httr=1.4.7
313 |   - r-hunspell=3.0.3
314 |   - r-ica=1.0_3
315 |   - r-igraph=1.5.1
316 |   - r-irlba=2.3.5.1
317 |   - r-isoband=0.2.7
318 |   - r-jquerylib=0.1.4
319 |   - r-jsonlite=1.8.7
320 |   - r-kernsmooth=2.23_22
321 |   - r-knitr=1.44
322 |   - r-labeling=0.4.3
323 |   - r-lambda.r=1.2.4
324 |   - r-later=1.3.1
325 |   - r-lattice=0.21_9
326 |   - r-lazyeval=0.2.2
327 |   - r-leiden=0.4.3
328 |   - r-leidenbase=0.1.18
329 |   - r-lifecycle=1.0.3
330 |   - r-listenv=0.9.0
331 |   - r-lmtest=0.9_40
332 |   - r-lobstr=1.1.2
333 |   - r-lsei=1.3_0
334 |   - r-magrittr=2.0.3
335 |   - r-mass=7.3_60
336 |   - r-matrix=1.6_1.1
337 |   - r-matrix.utils=0.9.8
338 |   - r-matrixstats=1.0.0
339 |   - r-memoise=2.0.1
340 |   - r-mgcv=1.9_0
341 |   - r-mime=0.12
342 |   - r-miniui=0.1.1.1
343 |   - r-monocle3=1.0.0
344 |   - r-munsell=0.5.0
345 |   - r-nlme=3.1_163
346 |   - r-npsurv=0.5_0
347 |   - r-openssl=2.1.1
348 |   - r-optparse=1.7.3
349 |   - r-parallelly=1.36.0
350 |   - r-patchwork=1.1.3
351 |   - r-pbapply=1.7_2
352 |   - r-pbmcapply=1.5.1
353 |   - r-pheatmap=1.0.12
354 |   - r-pillar=1.9.0
355 |   - r-pkgbuild=1.4.2
356 |   - r-pkgconfig=2.0.3
357 |   - r-pkgload=1.3.3
358 |   - r-plotly=4.10.2
359 |   - r-plyr=1.8.9
360 |   - r-png=0.1_8
361 |   - r-polyclip=1.10_6
362 |   - r-praise=1.0.0
363 |   - r-prettyunits=1.2.0
364 |   - r-processx=3.8.2
365 |   - r-progressr=0.14.0
366 |   - r-promises=1.2.1
367 |   - r-proxy=0.4_27
368 |   - r-pryr=0.1.6
369 |   - r-ps=1.7.5
370 |   - r-pscl=1.5.5.1
371 |   - r-purrr=1.0.2
372 |   - r-r6=2.5.1
373 |   - r-rann=2.6.1
374 |   - r-rappdirs=0.3.3
375 |   - r-raster=3.6_23
376 |   - r-rcolorbrewer=1.1_3
377 |   - r-rcpp=1.0.11
378 |   - r-rcppannoy=0.0.21
379 |   - r-rcpparmadillo=0.12.6.4.0
380 |   - r-rcppeigen=0.3.3.9.3
381 |   - r-rcpphnsw=0.5.0
382 |   - r-rcppparallel=5.1.6
383 |   - r-rcppprogress=0.4.2
384 |   - r-rcpptoml=0.2.2
385 |   - r-rcurl=1.98_1.12
386 |   - r-rematch2=2.1.2
387 |   - r-reshape2=1.4.4
388 |   - r-reticulate=1.32.0
389 |   - r-rhpcblasctl=0.23_42
390 |   - r-rlang=1.1.1
391 |   - r-rmarkdown=2.25
392 |   - r-rocr=1.0_11
393 |   - r-rprojroot=2.0.3
394 |   - r-rsample=1.2.0
395 |   - r-rsvd=1.0.5
396 |   - r-rtsne=0.16
397 |   - r-s2=1.1.4
398 |   - r-sass=0.4.7
399 |   - r-scales=1.2.1
400 |   - r-scattermore=1.2
401 |   - r-sceasy=0.0.7
402 |   - r-sctransform=0.4.0
403 |   - r-seurat=4.4.0
404 |   - r-seuratobject=4.1.4
405 |   - r-sf=1.0_14
406 |   - r-shiny=1.7.5
407 |   - r-sitmo=2.0.2
408 |   - r-slam=0.1_50
409 |   - r-slider=0.3.0
410 |   - r-snow=0.4_4
411 |   - r-sourcetools=0.1.7_1
412 |   - r-sp=2.1_0
413 |   - r-spatstat.data=3.0_1
414 |   - r-spatstat.explore=3.2_3
415 |   - r-spatstat.geom=3.2_5
416 |   - r-spatstat.random=3.1_6
417 |   - r-spatstat.sparse=3.0_2
418 |   - r-spatstat.utils=3.0_3
419 |   - r-spdata=2.3.0
420 |   - r-spdep=1.2_8
421 |   - r-speedglm=0.3_5
422 |   - r-spelling=2.2.1
423 |   - r-stringi=1.7.12
424 |   - r-stringr=1.5.0
425 |   - r-survival=3.5_7
426 |   - r-sys=3.4.2
427 |   - r-tensor=1.5
428 |   - r-terra=1.7_46
429 |   - r-testthat=3.2.0
430 |   - r-tibble=3.2.1
431 |   - r-tidyr=1.3.0
432 |   - r-tidyselect=1.2.0
433 |   - r-tinytex=0.47
434 |   - r-units=0.8_4
435 |   - r-utf8=1.2.3
436 |   - r-uwot=0.1.16
437 |   - r-vctrs=0.6.3
438 |   - r-viridis=0.6.4
439 |   - r-viridislite=0.4.2
440 |   - r-waldo=0.5.1
441 |   - r-warp=0.2.0
442 |   - r-withr=2.5.1
443 |   - r-wk=0.8.0
444 |   - r-xfun=0.40
445 |   - r-xml2=1.3.5
446 |   - r-xtable=1.8_4
447 |   - r-yaml=2.3.7
448 |   - r-zoo=1.8_12
449 |   - re2=2023.03.02
450 |   - readline=8.2
451 |   - rmf=1.5.1
452 |   - scipy=1.11.3
453 |   - sed=4.8
454 |   - setuptools=68.2.2
455 |   - six=1.16.0
456 |   - snappy=1.1.10
457 |   - sqlite=3.43.0
458 |   - svt-av1=1.7.0
459 |   - sysroot_linux-64=2.12
460 |   - tbb=2021.10.0
461 |   - tiledb=2.16.3
462 |   - tk=8.6.13
463 |   - tktable=2.10
464 |   - toml=0.10.2
465 |   - tomlkit=0.12.1
466 |   - tzcode=2023c
467 |   - tzdata=2023c
468 |   - udunits2=2.2.28
469 |   - uriparser=0.9.7
470 |   - wheel=0.41.2
471 |   - x264=1!164.3095
472 |   - x265=3.5
473 |   - xcb-util=0.4.0
474 |   - xcb-util-image=0.4.0
475 |   - xcb-util-keysyms=0.4.0
476 |   - xcb-util-renderutil=0.3.9
477 |   - xcb-util-wm=0.4.1
478 |   - xerces-c=3.2.4
479 |   - xkeyboard-config=2.40
480 |   - xmltodict=0.13.0
481 |   - xorg-fixesproto=5.0
482 |   - xorg-inputproto=2.3.2
483 |   - xorg-kbproto=1.0.7
484 |   - xorg-libice=1.1.1
485 |   - xorg-libsm=1.2.4
486 |   - xorg-libx11=1.8.6
487 |   - xorg-libxau=1.0.11
488 |   - xorg-libxdmcp=1.1.3
489 |   - xorg-libxext=1.3.4
490 |   - xorg-libxfixes=5.0.3
491 |   - xorg-libxi=1.7.10
492 |   - xorg-libxrender=0.9.11
493 |   - xorg-libxt=1.3.0
494 |   - xorg-renderproto=0.11.1
495 |   - xorg-xextproto=7.3.0
496 |   - xorg-xf86vidmodeproto=2.3.1
497 |   - xorg-xproto=7.0.31
498 |   - xz=5.2.6
499 |   - yaml=0.2.5
500 |   - yq=3.2.3
501 |   - zlib=1.2.13
502 |   - zstd=1.5.5
503 |   - pip:
504 |       - anndata==0.10.1
505 |       - array-api-compat==1.4
506 |       - exceptiongroup==1.1.3
507 | 


--------------------------------------------------------------------------------
/envs/scib.yml:
--------------------------------------------------------------------------------
  1 | name: scib-pipeline-R4.0
  2 | channels:
  3 |   - bioconda
  4 |   - conda-forge
  5 |   - r
  6 |   - defaults
  7 | dependencies:
  8 |   - _libgcc_mutex=0.1
  9 |   - _openmp_mutex=4.5
 10 |   - _r-mutex=1.0.1
 11 |   - absl-py=2.0.0
 12 |   - aioeasywebdav=2.4.0
 13 |   - aiohttp=3.8.6
 14 |   - aiosignal=1.3.1
 15 |   - amply=0.1.6
 16 |   - anndata=0.8.0
 17 |   - anndata2ri=1.3.1
 18 |   - anyio=3.7.1
 19 |   - appdirs=1.4.4
 20 |   - argon2-cffi=23.1.0
 21 |   - argon2-cffi-bindings=21.2.0
 22 |   - asttokens=2.4.1
 23 |   - async-timeout=4.0.3
 24 |   - atk-1.0=2.38.0
 25 |   - attmap=0.13.2
 26 |   - attrs=23.1.0
 27 |   - backcall=0.2.0
 28 |   - backports=1.0
 29 |   - backports.functools_lru_cache=1.6.5
 30 |   - backports.zoneinfo=0.2.1
 31 |   - bbknn=1.5.1
 32 |   - bcrypt=4.0.1
 33 |   - beautifulsoup4=4.12.2
 34 |   - binutils_impl_linux-64=2.40
 35 |   - bioconductor-beachmat=2.6.4
 36 |   - bioconductor-biobase=2.50.0
 37 |   - bioconductor-biocgenerics=0.36.0
 38 |   - bioconductor-biocneighbors=1.8.2
 39 |   - bioconductor-biocparallel=1.24.1
 40 |   - bioconductor-biocsingular=1.6.0
 41 |   - bioconductor-bluster=1.0.0
 42 |   - bioconductor-delayedarray=0.16.3
 43 |   - bioconductor-delayedmatrixstats=1.12.3
 44 |   - bioconductor-edger=3.32.1
 45 |   - bioconductor-genomeinfodb=1.26.4
 46 |   - bioconductor-genomeinfodbdata=1.2.4
 47 |   - bioconductor-genomicranges=1.42.0
 48 |   - bioconductor-hdf5array=1.18.1
 49 |   - bioconductor-iranges=2.24.1
 50 |   - bioconductor-limma=3.46.0
 51 |   - bioconductor-matrixgenerics=1.2.1
 52 |   - bioconductor-rhdf5=2.34.0
 53 |   - bioconductor-rhdf5filters=1.2.0
 54 |   - bioconductor-rhdf5lib=1.12.1
 55 |   - bioconductor-s4vectors=0.28.1
 56 |   - bioconductor-scater=1.18.6
 57 |   - bioconductor-scran=1.18.5
 58 |   - bioconductor-scuttle=1.0.4
 59 |   - bioconductor-singlecellexperiment=1.12.0
 60 |   - bioconductor-sparsematrixstats=1.2.1
 61 |   - bioconductor-summarizedexperiment=1.20.0
 62 |   - bioconductor-xvector=0.30.0
 63 |   - bioconductor-zlibbioc=1.36.0
 64 |   - blas=1.1
 65 |   - bleach=6.1.0
 66 |   - blinker=1.6.3
 67 |   - boto3=1.28.76
 68 |   - botocore=1.31.76
 69 |   - bottleneck=1.3.7
 70 |   - brotli=1.1.0
 71 |   - brotli-bin=1.1.0
 72 |   - brotli-python=1.1.0
 73 |   - bwidget=1.9.14
 74 |   - bzip2=1.0.8
 75 |   - c-ares=1.20.1
 76 |   - ca-certificates=2023.7.22
 77 |   - cached-property=1.5.2
 78 |   - cached_property=1.5.2
 79 |   - cachetools=5.3.2
 80 |   - cairo=1.16.0
 81 |   - certifi=2023.7.22
 82 |   - cffi=1.16.0
 83 |   - chardet=5.2.0
 84 |   - charset-normalizer=3.3.2
 85 |   - chex=0.1.81
 86 |   - click=8.1.7
 87 |   - coin-or-cbc=2.10.10
 88 |   - coin-or-cgl=0.60.7
 89 |   - coin-or-clp=1.17.8
 90 |   - coin-or-osi=0.108.8
 91 |   - coin-or-utils=2.11.9
 92 |   - coincbc=2.10.10
 93 |   - colorama=0.4.6
 94 |   - comm=0.1.4
 95 |   - configargparse=1.7
 96 |   - connection_pool=0.0.3
 97 |   - contourpy=1.1.1
 98 |   - cryptography=41.0.5
 99 |   - curl=7.86.0
100 |   - cycler=0.11.0
101 |   - datrie=0.8.2
102 |   - debugpy=1.8.0
103 |   - decorator=5.1.1
104 |   - defusedxml=0.7.1
105 |   - dm-tree=0.1.7
106 |   - docrep=0.3.2
107 |   - docutils=0.20.1
108 |   - dpath=2.1.6
109 |   - dropbox=11.36.2
110 |   - dunamai=1.19.0
111 |   - entrypoints=0.4
112 |   - et_xmlfile=1.1.0
113 |   - exceptiongroup=1.1.3
114 |   - executing=2.0.1
115 |   - expat=2.5.0
116 |   - fbpca=1.0
117 |   - filechunkio=1.8
118 |   - filelock=3.13.1
119 |   - flax=0.6.1
120 |   - font-ttf-dejavu-sans-mono=2.37
121 |   - font-ttf-inconsolata=3.000
122 |   - font-ttf-source-code-pro=2.038
123 |   - font-ttf-ubuntu=0.83
124 |   - fontconfig=2.14.2
125 |   - fonts-conda-ecosystem=1
126 |   - fonts-conda-forge=1
127 |   - fonttools=4.43.1
128 |   - freetype=2.12.1
129 |   - fribidi=1.0.10
130 |   - frozenlist=1.4.0
131 |   - fsspec=2023.1.0
132 |   - ftputil=5.0.4
133 |   - future=0.18.3
134 |   - gcc=13.2.0
135 |   - gcc_impl_linux-64=13.2.0
136 |   - gdk-pixbuf=2.42.10
137 |   - geos=3.11.0
138 |   - geosketch=1.2
139 |   - get_version=3.5.4
140 |   - gettext=0.21.1
141 |   - gfortran_impl_linux-64=13.2.0
142 |   - giflib=5.2.1
143 |   - git=2.42.0
144 |   - gitdb=4.0.11
145 |   - gitpython=3.1.40
146 |   - glib=2.78.0
147 |   - glib-tools=2.78.0
148 |   - glpk=5.0
149 |   - gmp=6.2.1
150 |   - google-api-core=2.12.0
151 |   - google-api-python-client=2.106.0
152 |   - google-auth=2.23.0
153 |   - google-auth-httplib2=0.1.1
154 |   - google-auth-oauthlib=0.4.6
155 |   - google-cloud-core=2.3.3
156 |   - google-cloud-storage=2.11.0
157 |   - google-crc32c=1.1.2
158 |   - google-resumable-media=2.6.0
159 |   - googleapis-common-protos=1.61.0
160 |   - graphite2=1.3.13
161 |   - graphviz=8.0.3
162 |   - grpc-cpp=1.48.1
163 |   - grpcio=1.48.1
164 |   - gsl=2.7
165 |   - gtk2=2.24.33
166 |   - gts=0.7.6
167 |   - gxx=13.2.0
168 |   - gxx_impl_linux-64=13.2.0
169 |   - h5py=3.8.0
170 |   - harfbuzz=6.0.0
171 |   - hdf5=1.12.2
172 |   - httplib2=0.22.0
173 |   - humanfriendly=10.0
174 |   - icu=70.1
175 |   - idna=3.4
176 |   - importlib-metadata=6.8.0
177 |   - importlib_metadata=6.8.0
178 |   - importlib_resources=6.0.0
179 |   - iniconfig=2.0.0
180 |   - intel-openmp=2022.1.0
181 |   - intervaltree=2.1.0
182 |   - ipykernel=6.16.2
183 |   - ipython=8.17.2
184 |   - ipython_genutils=0.2.0
185 |   - ipywidgets=8.1.1
186 |   - jax=0.3.25
187 |   - jaxlib=0.3.25
188 |   - jedi=0.19.1
189 |   - jinja2=3.1.2
190 |   - jmespath=1.0.1
191 |   - joblib=1.3.2
192 |   - jpeg=9e
193 |   - jsonschema=4.17.3
194 |   - jupyter_client=7.4.9
195 |   - jupyter_core=5.5.0
196 |   - jupyter_server=1.23.4
197 |   - jupyterlab_pygments=0.2.2
198 |   - jupyterlab_widgets=3.0.9
199 |   - kernel-headers_linux-64=2.6.32
200 |   - keyutils=1.6.1
201 |   - kiwisolver=1.4.5
202 |   - krb5=1.19.3
203 |   - lcms2=2.14
204 |   - ld_impl_linux-64=2.40
205 |   - lerc=4.0.0
206 |   - libabseil=20220623.0
207 |   - libblas=3.9.0
208 |   - libbrotlicommon=1.1.0
209 |   - libbrotlidec=1.1.0
210 |   - libbrotlienc=1.1.0
211 |   - libcblas=3.9.0
212 |   - libcrc32c=1.1.2
213 |   - libcurl=7.86.0
214 |   - libdeflate=1.14
215 |   - libedit=3.1.20191231
216 |   - libev=4.33
217 |   - libexpat=2.5.0
218 |   - libffi=3.4.2
219 |   - libgcc-devel_linux-64=13.2.0
220 |   - libgcc-ng=13.2.0
221 |   - libgd=2.3.3
222 |   - libgfortran-ng=13.2.0
223 |   - libgfortran5=13.2.0
224 |   - libgit2=1.5.1
225 |   - libglib=2.78.0
226 |   - libgomp=13.2.0
227 |   - libhwloc=2.9.1
228 |   - libiconv=1.17
229 |   - liblapack=3.9.0
230 |   - liblapacke=3.9.0
231 |   - libllvm11=11.1.0
232 |   - libllvm14=14.0.6
233 |   - libnghttp2=1.55.1
234 |   - libnsl=2.0.1
235 |   - libopenblas=0.3.24
236 |   - libpng=1.6.39
237 |   - libprotobuf=3.21.8
238 |   - librsvg=2.54.4
239 |   - libsanitizer=13.2.0
240 |   - libsodium=1.0.18
241 |   - libsqlite=3.44.0
242 |   - libssh2=1.11.0
243 |   - libstdcxx-devel_linux-64=13.2.0
244 |   - libstdcxx-ng=13.2.0
245 |   - libtiff=4.4.0
246 |   - libtool=2.4.7
247 |   - libuuid=2.38.1
248 |   - libwebp=1.3.2
249 |   - libwebp-base=1.3.2
250 |   - libxcb=1.13
251 |   - libxml2=2.10.3
252 |   - libzlib=1.2.13
253 |   - lightning-utilities=0.9.0
254 |   - llvmlite=0.40.1
255 |   - logmuse=0.2.6
256 |   - make=4.3
257 |   - markdown=3.5.1
258 |   - markdown-it-py=2.2.0
259 |   - markupsafe=2.1.3
260 |   - matplotlib-base=3.8.1
261 |   - matplotlib-inline=0.1.6
262 |   - mdurl=0.1.0
263 |   - mistune=3.0.1
264 |   - mkl=2022.1.0
265 |   - msgpack-python=1.0.6
266 |   - multidict=6.0.4
267 |   - multipledispatch=0.6.0
268 |   - munkres=1.1.4
269 |   - natsort=8.4.0
270 |   - nbclassic=1.0.0
271 |   - nbclient=0.7.0
272 |   - nbconvert=7.6.0
273 |   - nbconvert-core=7.6.0
274 |   - nbconvert-pandoc=7.6.0
275 |   - nbformat=5.8.0
276 |   - ncurses=6.4
277 |   - nest-asyncio=1.5.8
278 |   - networkx=2.7
279 |   - ninja=1.11.1
280 |   - nomkl=3.0
281 |   - notebook=6.5.6
282 |   - notebook-shim=0.2.3
283 |   - numba=0.57.1
284 |   - numexpr=2.8.7
285 |   - numpy=1.24.4
286 |   - numpyro=0.13.2
287 |   - oauth2client=4.1.3
288 |   - oauthlib=3.2.2
289 |   - openblas=0.3.24
290 |   - openjpeg=2.5.0
291 |   - openpyxl=3.1.2
292 |   - openssl=3.1.4
293 |   - opt_einsum=3.3.0
294 |   - optax=0.1.5
295 |   - packaging=23.2
296 |   - pandas=1.5.2
297 |   - pandoc=2.19.2
298 |   - pandocfilters=1.5.0
299 |   - pango=1.50.14
300 |   - paramiko=3.3.1
301 |   - parso=0.8.3
302 |   - patsy=0.5.3
303 |   - pcre2=10.40
304 |   - peppy=0.35.7
305 |   - perl=5.32.1
306 |   - pexpect=4.8.0
307 |   - pickleshare=0.7.5
308 |   - pillow=9.2.0
309 |   - pip=23.3.1
310 |   - pixman=0.42.2
311 |   - pkgutil-resolve-name=1.3.10
312 |   - plac=1.4.1
313 |   - platformdirs=3.11.0
314 |   - pluggy=1.3.0
315 |   - ply=3.11
316 |   - prettytable=3.7.0
317 |   - prometheus_client=0.18.0
318 |   - prompt-toolkit=3.0.39
319 |   - prompt_toolkit=3.0.39
320 |   - protobuf=4.21.8
321 |   - psutil=5.9.5
322 |   - pthread-stubs=0.4
323 |   - ptyprocess=0.7.0
324 |   - pulp=2.7.0
325 |   - pure_eval=0.2.2
326 |   - pyasn1=0.5.0
327 |   - pyasn1-modules=0.3.0
328 |   - pycparser=2.21
329 |   - pydeprecate=0.3.1
330 |   - pygments=2.16.1
331 |   - pyjwt=2.8.0
332 |   - pynacl=1.5.0
333 |   - pynndescent=0.5.10
334 |   - pyopenssl=23.2.0
335 |   - pyparsing=3.1.1
336 |   - pyro-api=0.1.2
337 |   - pyro-ppl=1.8.6
338 |   - pyrsistent=0.20.0
339 |   - pysftp=0.2.9
340 |   - pysocks=1.7.1
341 |   - pytest=7.4.3
342 |   - python=3.10.10
343 |   - python-annoy=1.17.2
344 |   - python-dateutil=2.8.2
345 |   - python-fastjsonschema=2.18.1
346 |   - python-irodsclient=1.1.9
347 |   - python-tzdata=2023.3
348 |   - python_abi=3.10
349 |   - pytorch=1.12.1
350 |   - pytorch-lightning=1.5.8
351 |   - pytz=2023.3.post1
352 |   - pytz-deprecation-shim=0.1.0.post0
353 |   - pyu2f=0.1.5
354 |   - pyyaml=6.0.1
355 |   - pyzmq=24.0.1
356 |   - r-abind=1.4_5
357 |   - r-askpass=1.1
358 |   - r-assertthat=0.2.1
359 |   - r-backports=1.4.1
360 |   - r-base=4.0.5
361 |   - r-base64enc=0.1_3
362 |   - r-beeswarm=0.4.0
363 |   - r-bh=1.78.0_0
364 |   - r-bit=4.0.4
365 |   - r-bit64=4.0.5
366 |   - r-bitops=1.0_7
367 |   - r-blob=1.2.3
368 |   - r-boot=1.3_28
369 |   - r-brew=1.0_7
370 |   - r-brio=1.1.3
371 |   - r-broom=1.0.1
372 |   - r-bslib=0.4.0
373 |   - r-cachem=1.0.6
374 |   - r-callr=3.7.2
375 |   - r-caret=6.0_93
376 |   - r-catools=1.18.2
377 |   - r-cellranger=1.1.0
378 |   - r-class=7.3_20
379 |   - r-cli=3.4.1
380 |   - r-clipr=0.8.0
381 |   - r-cluster=2.1.3
382 |   - r-codetools=0.2_18
383 |   - r-colorspace=2.0_3
384 |   - r-commonmark=1.8.0
385 |   - r-cowplot=1.1.1
386 |   - r-cpp11=0.4.2
387 |   - r-crayon=1.5.1
388 |   - r-credentials=1.3.2
389 |   - r-crosstalk=1.2.0
390 |   - r-crul=1.3
391 |   - r-curl=4.3.2
392 |   - r-data.table=1.14.2
393 |   - r-dbi=1.1.3
394 |   - r-dbplyr=2.2.1
395 |   - r-deldir=1.0_6
396 |   - r-desc=1.4.2
397 |   - r-devtools=2.4.4
398 |   - r-diffobj=0.3.5
399 |   - r-digest=0.6.29
400 |   - r-downlit=0.4.2
401 |   - r-dplyr=1.0.10
402 |   - r-dqrng=0.3.0
403 |   - r-dtplyr=1.2.2
404 |   - r-e1071=1.7_11
405 |   - r-ellipsis=0.3.2
406 |   - r-essentials=4.0
407 |   - r-evaluate=0.16
408 |   - r-fansi=1.0.3
409 |   - r-farver=2.1.1
410 |   - r-fastmap=1.1.0
411 |   - r-fitdistrplus=1.1_8
412 |   - r-fnn=1.1.3.1
413 |   - r-fontawesome=0.3.0
414 |   - r-forcats=0.5.2
415 |   - r-foreach=1.5.2
416 |   - r-foreign=0.8_82
417 |   - r-formatr=1.12
418 |   - r-fs=1.5.2
419 |   - r-futile.logger=1.4.3
420 |   - r-futile.options=1.0.1
421 |   - r-future=1.28.0
422 |   - r-future.apply=1.9.1
423 |   - r-gargle=1.2.1
424 |   - r-generics=0.1.3
425 |   - r-gert=1.5.0
426 |   - r-ggbeeswarm=0.6.0
427 |   - r-ggplot2=3.3.6
428 |   - r-ggrepel=0.9.1
429 |   - r-ggridges=0.5.4
430 |   - r-gh=1.3.1
431 |   - r-gistr=0.9.0
432 |   - r-gitcreds=0.1.2
433 |   - r-glmnet=4.1_2
434 |   - r-globals=0.16.1
435 |   - r-glue=1.6.2
436 |   - r-goftest=1.2_3
437 |   - r-googledrive=2.0.0
438 |   - r-googlesheets4=1.0.1
439 |   - r-gower=1.0.0
440 |   - r-gplots=3.1.3
441 |   - r-gridextra=2.3
442 |   - r-gtable=0.3.1
443 |   - r-gtools=3.9.3
444 |   - r-hardhat=1.2.0
445 |   - r-haven=2.5.0
446 |   - r-here=1.0.1
447 |   - r-hexbin=1.28.2
448 |   - r-highr=0.9
449 |   - r-hms=1.1.2
450 |   - r-htmltools=0.5.3
451 |   - r-htmlwidgets=1.5.4
452 |   - r-httpcode=0.3.0
453 |   - r-httpuv=1.6.6
454 |   - r-httr=1.4.4
455 |   - r-ica=1.0_3
456 |   - r-ids=1.0.1
457 |   - r-igraph=1.3.4
458 |   - r-ini=0.3.1
459 |   - r-ipred=0.9_13
460 |   - r-irdisplay=1.1
461 |   - r-irkernel=1.3
462 |   - r-irlba=2.3.5
463 |   - r-isoband=0.2.5
464 |   - r-iterators=1.0.14
465 |   - r-jquerylib=0.1.4
466 |   - r-jsonlite=1.8.0
467 |   - r-kernsmooth=2.23_20
468 |   - r-knitr=1.40
469 |   - r-labeling=0.4.2
470 |   - r-lambda.r=1.2.4
471 |   - r-later=1.2.0
472 |   - r-lattice=0.20_45
473 |   - r-lava=1.6.10
474 |   - r-lazyeval=0.2.2
475 |   - r-leiden=0.4.3
476 |   - r-lifecycle=1.0.2
477 |   - r-listenv=0.8.0
478 |   - r-lmtest=0.9_40
479 |   - r-lobstr=1.1.2
480 |   - r-locfit=1.5_9.4
481 |   - r-lsei=1.3_0
482 |   - r-lubridate=1.8.0
483 |   - r-magrittr=2.0.3
484 |   - r-maps=3.4.0
485 |   - r-mass=7.3_58.1
486 |   - r-matrix=1.4_1
487 |   - r-matrixstats=0.62.0
488 |   - r-memoise=2.0.1
489 |   - r-mgcv=1.8_40
490 |   - r-mime=0.12
491 |   - r-miniui=0.1.1.1
492 |   - r-modelmetrics=1.2.2.2
493 |   - r-modelr=0.1.9
494 |   - r-munsell=0.5.0
495 |   - r-nlme=3.1_159
496 |   - r-nnet=7.3_17
497 |   - r-npsurv=0.5_0
498 |   - r-numderiv=2016.8_1.1
499 |   - r-openssl=2.0.3
500 |   - r-parallelly=1.32.1
501 |   - r-patchwork=1.1.2
502 |   - r-pbapply=1.5_0
503 |   - r-pbdzmq=0.3_7
504 |   - r-pillar=1.8.1
505 |   - r-pkgbuild=1.3.1
506 |   - r-pkgconfig=2.0.3
507 |   - r-pkgdown=2.0.6
508 |   - r-pkgload=1.3.0
509 |   - r-plotly=4.10.0
510 |   - r-plyr=1.8.7
511 |   - r-png=0.1_7
512 |   - r-polyclip=1.10_0
513 |   - r-praise=1.0.0
514 |   - r-prettyunits=1.1.1
515 |   - r-proc=1.18.0
516 |   - r-processx=3.7.0
517 |   - r-prodlim=2019.11.13
518 |   - r-profvis=0.3.7
519 |   - r-progress=1.2.2
520 |   - r-progressr=0.11.0
521 |   - r-promises=1.2.0.1
522 |   - r-proxy=0.4_27
523 |   - r-pryr=0.1.5
524 |   - r-ps=1.7.1
525 |   - r-purrr=0.3.4
526 |   - r-quantmod=0.4.20
527 |   - r-r6=2.5.1
528 |   - r-ragg=0.3.1
529 |   - r-randomforest=4.6_14
530 |   - r-rann=2.6.1
531 |   - r-rappdirs=0.3.3
532 |   - r-rbokeh=0.5.2
533 |   - r-rcmdcheck=1.4.0
534 |   - r-rcolorbrewer=1.1_3
535 |   - r-rcpp=1.0.9
536 |   - r-rcppannoy=0.0.19
537 |   - r-rcpparmadillo=0.11.2.3.1
538 |   - r-rcppeigen=0.3.3.9.2
539 |   - r-rcpphnsw=0.4.1
540 |   - r-rcppparallel=5.1.5
541 |   - r-rcppprogress=0.4.2
542 |   - r-rcpptoml=0.1.7
543 |   - r-rcurl=1.98_1.8
544 |   - r-readr=2.1.2
545 |   - r-readxl=1.4.1
546 |   - r-recipes=1.0.1
547 |   - r-recommended=4.0
548 |   - r-rematch=1.0.1
549 |   - r-rematch2=2.1.2
550 |   - r-remotes=2.4.2
551 |   - r-repr=1.1.4
552 |   - r-reprex=2.0.2
553 |   - r-reshape2=1.4.4
554 |   - r-reticulate=1.30
555 |   - r-rgeos=0.5_9
556 |   - r-rlang=1.0.6
557 |   - r-rmarkdown=2.16
558 |   - r-rocr=1.0_11
559 |   - r-roxygen2=7.2.1
560 |   - r-rpart=4.1.16
561 |   - r-rprojroot=2.0.3
562 |   - r-rspectra=0.16_1
563 |   - r-rstudioapi=0.14
564 |   - r-rsvd=1.0.5
565 |   - r-rtsne=0.16
566 |   - r-rversions=2.1.2
567 |   - r-rvest=1.0.3
568 |   - r-sass=0.4.2
569 |   - r-scales=1.2.1
570 |   - r-scattermore=0.8
571 |   - r-sctransform=0.3.4
572 |   - r-selectr=0.4_2
573 |   - r-sessioninfo=1.2.2
574 |   - r-seurat=3.2.3
575 |   - r-seuratobject=4.1.1
576 |   - r-shape=1.4.6
577 |   - r-shiny=1.7.2
578 |   - r-sitmo=2.0.2
579 |   - r-snow=0.4_4
580 |   - r-sourcetools=0.1.7
581 |   - r-sp=1.5_0
582 |   - r-spatial=7.3_15
583 |   - r-spatstat=1.64_1
584 |   - r-spatstat.data=2.2_0
585 |   - r-spatstat.utils=2.3_1
586 |   - r-squarem=2021.1
587 |   - r-statmod=1.4.37
588 |   - r-stringi=1.7.8
589 |   - r-stringr=1.4.1
590 |   - r-survival=3.4_0
591 |   - r-sys=3.4
592 |   - r-systemfonts=1.0.4
593 |   - r-tensor=1.5
594 |   - r-testthat=3.1.4
595 |   - r-tibble=3.1.8
596 |   - r-tidyr=1.2.1
597 |   - r-tidyselect=1.1.2
598 |   - r-tidyverse=1.3.2
599 |   - r-timedate=4021.104
600 |   - r-tinytex=0.42
601 |   - r-triebeard=0.3.0
602 |   - r-ttr=0.24.3
603 |   - r-tzdb=0.3.0
604 |   - r-urlchecker=1.0.1
605 |   - r-urltools=1.7.3
606 |   - r-usethis=2.1.6
607 |   - r-utf8=1.2.2
608 |   - r-uuid=1.1_0
609 |   - r-uwot=0.1.14
610 |   - r-vctrs=0.4.1
611 |   - r-vipor=0.4.5
612 |   - r-viridis=0.6.2
613 |   - r-viridislite=0.4.1
614 |   - r-vroom=1.5.7
615 |   - r-waldo=0.4.0
616 |   - r-whisker=0.4
617 |   - r-withr=2.5.0
618 |   - r-xfun=0.33
619 |   - r-xml2=1.3.3
620 |   - r-xopen=1.0.0
621 |   - r-xtable=1.8_4
622 |   - r-xts=0.12.1
623 |   - r-yaml=2.3.5
624 |   - r-zip=2.2.1
625 |   - r-zoo=1.8_11
626 |   - re2=2022.06.01
627 |   - readline=8.2
628 |   - requests=2.31.0
629 |   - requests-oauthlib=1.3.1
630 |   - reretry=0.11.8
631 |   - rich=13.6.0
632 |   - rsa=4.9
633 |   - s3transfer=0.7.0
634 |   - scanorama=1.7
635 |   - scanpy=1.9.2
636 |   - scikit-learn=1.3.2
637 |   - scipy=1.11.3
638 |   - scvi-tools=0.16.1
639 |   - seaborn=0.13.0
640 |   - seaborn-base=0.13.0
641 |   - sed=4.8
642 |   - send2trash=1.8.2
643 |   - session-info=1.0.0
644 |   - setuptools=68.2.2
645 |   - setuptools-scm=8.0.4
646 |   - simplegeneric=0.8.1
647 |   - six=1.16.0
648 |   - slacker=0.14.0
649 |   - sleef=3.5.1
650 |   - smart_open=6.4.0
651 |   - smmap=5.0.0
652 |   - snakemake=7.24.0
653 |   - snakemake-minimal=7.24.0
654 |   - sniffio=1.3.0
655 |   - sortedcontainers=2.4.0
656 |   - soupsieve=2.3.2.post1
657 |   - sqlite=3.44.0
658 |   - stack_data=0.6.2
659 |   - statsmodels=0.14.0
660 |   - stdlib-list=0.8.0
661 |   - stone=3.3.1
662 |   - stopit=1.1.2
663 |   - sysroot_linux-64=2.12
664 |   - tabulate=0.9.0
665 |   - tensorboard=2.11.2
666 |   - tensorboard-data-server=0.6.1
667 |   - tensorboard-plugin-wit=1.8.1
668 |   - terminado=0.17.1
669 |   - threadpoolctl=3.1.0
670 |   - throttler=1.2.2
671 |   - tinycss2=1.2.1
672 |   - tk=8.6.13
673 |   - tktable=2.10
674 |   - tomli=2.0.1
675 |   - toolz=0.12.0
676 |   - toposort=1.7
677 |   - torchmetrics=1.0.3
678 |   - tornado=6.3.3
679 |   - tqdm=4.66.1
680 |   - traitlets=5.9.0
681 |   - typing-extensions=4.7.1
682 |   - typing_extensions=4.7.1
683 |   - tzdata=2023c
684 |   - tzlocal=5.2
685 |   - ubiquerg=0.6.3
686 |   - umap-learn=0.5.4
687 |   - unicodedata2=15.1.0
688 |   - uritemplate=4.1.1
689 |   - urllib3=1.26.18
690 |   - veracitools=0.1.3
691 |   - wcwidth=0.2.9
692 |   - webencodings=0.5.1
693 |   - websocket-client=1.6.1
694 |   - werkzeug=2.2.3
695 |   - wheel=0.41.3
696 |   - widgetsnbextension=4.0.9
697 |   - wrapt=1.15.0
698 |   - xorg-kbproto=1.0.7
699 |   - xorg-libice=1.0.10
700 |   - xorg-libsm=1.2.3
701 |   - xorg-libx11=1.8.4
702 |   - xorg-libxau=1.0.11
703 |   - xorg-libxdmcp=1.1.3
704 |   - xorg-libxext=1.3.4
705 |   - xorg-libxrender=0.9.10
706 |   - xorg-libxt=1.3.0
707 |   - xorg-renderproto=0.11.1
708 |   - xorg-xextproto=7.3.0
709 |   - xorg-xproto=7.0.31
710 |   - xz=5.2.6
711 |   - yaml=0.2.5
712 |   - yarl=1.9.2
713 |   - yte=1.5.1
714 |   - zeromq=4.3.5
715 |   - zipp=3.15.0
716 |   - zlib=1.2.13
717 |   - zstd=1.5.5
718 |   - pip:
719 |       - deprecated==1.2.14
720 |       - igraph==0.10.8
721 |       - leidenalg==0.10.1
722 |       - pydot==1.4.2
723 |       - rpy2==3.5.14
724 |       - scib==1.1.4
725 |       - scikit-misc==0.3.0
726 |       - tbb==2021.10.0
727 |       - texttable==1.7.0
728 | 


--------------------------------------------------------------------------------
/envs/scvi.yml:
--------------------------------------------------------------------------------
  1 | name: scvi
  2 | channels:
  3 |   - bioconda
  4 |   - conda-forge
  5 |   - r
  6 |   - defaults
  7 | dependencies:
  8 |   - _libgcc_mutex=0.1=conda_forge
  9 |   - _openmp_mutex=4.5=2_gnu
 10 |   - anndata=0.9.2=pyhd8ed1ab_0
 11 |   - brotli=1.1.0=hd590300_1
 12 |   - brotli-bin=1.1.0=hd590300_1
 13 |   - bzip2=1.0.8=h7f98852_4
 14 |   - c-ares=1.20.1=hd590300_0
 15 |   - ca-certificates=2023.7.22=hbcca054_0
 16 |   - cached-property=1.5.2=hd8ed1ab_1
 17 |   - cached_property=1.5.2=pyha770c72_1
 18 |   - certifi=2023.7.22=pyhd8ed1ab_0
 19 |   - colorama=0.4.6=pyhd8ed1ab_0
 20 |   - contourpy=1.1.1=py310hd41b1e2_1
 21 |   - cudatoolkit=11.8.0=h4ba93d1_12
 22 |   - cycler=0.12.1=pyhd8ed1ab_0
 23 |   - fonttools=4.43.1=py310h2372a71_0
 24 |   - freetype=2.12.1=h267a509_2
 25 |   - h5py=3.10.0=nompi_py310ha2ad45a_100
 26 |   - hdf5=1.14.2=nompi_h4f84152_100
 27 |   - icu=73.2=h59595ed_0
 28 |   - importlib-metadata=6.8.0=pyha770c72_0
 29 |   - importlib_metadata=6.8.0=hd8ed1ab_0
 30 |   - joblib=1.3.2=pyhd8ed1ab_0
 31 |   - keyutils=1.6.1=h166bdaf_0
 32 |   - kiwisolver=1.4.5=py310hd41b1e2_1
 33 |   - krb5=1.21.2=h659d440_0
 34 |   - lcms2=2.15=hb7c19ff_3
 35 |   - ld_impl_linux-64=2.40=h41732ed_0
 36 |   - lerc=4.0.0=h27087fc_0
 37 |   - libaec=1.1.2=h59595ed_1
 38 |   - libblas=3.9.0=18_linux64_openblas
 39 |   - libbrotlicommon=1.1.0=hd590300_1
 40 |   - libbrotlidec=1.1.0=hd590300_1
 41 |   - libbrotlienc=1.1.0=hd590300_1
 42 |   - libcblas=3.9.0=18_linux64_openblas
 43 |   - libcurl=8.4.0=hca28451_0
 44 |   - libdeflate=1.19=hd590300_0
 45 |   - libedit=3.1.20191231=he28a2e2_2
 46 |   - libev=4.33=h516909a_1
 47 |   - libffi=3.4.2=h7f98852_5
 48 |   - libgcc-ng=13.2.0=h807b86a_2
 49 |   - libgfortran-ng=13.2.0=h69a702a_2
 50 |   - libgfortran5=13.2.0=ha4646dd_2
 51 |   - libgomp=13.2.0=h807b86a_2
 52 |   - libhwloc=2.9.3=default_h554bfaf_1009
 53 |   - libiconv=1.17=h166bdaf_0
 54 |   - libjpeg-turbo=3.0.0=hd590300_1
 55 |   - liblapack=3.9.0=18_linux64_openblas
 56 |   - libllvm14=14.0.6=hcd5def8_4
 57 |   - libnghttp2=1.52.0=h61bc06f_0
 58 |   - libnsl=2.0.0=hd590300_1
 59 |   - libopenblas=0.3.24=pthreads_h413a1c8_0
 60 |   - libpng=1.6.39=h753d276_0
 61 |   - libsqlite=3.43.2=h2797004_0
 62 |   - libssh2=1.11.0=h0841786_0
 63 |   - libstdcxx-ng=13.2.0=h7e041cc_2
 64 |   - libtiff=4.6.0=ha9c0a0a_2
 65 |   - libuuid=2.38.1=h0b41bf4_0
 66 |   - libwebp-base=1.3.2=hd590300_0
 67 |   - libxcb=1.15=h0b41bf4_0
 68 |   - libxml2=2.11.5=h232c23b_1
 69 |   - libzlib=1.2.13=hd590300_5
 70 |   - llvmlite=0.40.1=py310h1b8f574_0
 71 |   - matplotlib-base=3.8.0=py310h62c0568_2
 72 |   - munkres=1.0.7=py_1
 73 |   - natsort=8.4.0=pyhd8ed1ab_0
 74 |   - ncurses=6.4=hcb278e6_0
 75 |   - networkx=3.1=pyhd8ed1ab_0
 76 |   - numba=0.57.1=py310h0f6aa51_0
 77 |   - numpy=1.24.4=py310ha4c1d20_0
 78 |   - openjpeg=2.5.0=h488ebb8_3
 79 |   - openssl=3.1.3=hd590300_0
 80 |   - packaging=23.2=pyhd8ed1ab_0
 81 |   - pandas=1.5.3=py310h9b08913_1
 82 |   - patsy=0.5.3=pyhd8ed1ab_0
 83 |   - pillow=10.0.1=py310h01dd4db_2
 84 |   - pip=23.2.1=pyhd8ed1ab_0
 85 |   - pthread-stubs=0.4=h36c2ea0_1001
 86 |   - pynndescent=0.5.10=pyh1a96a4e_0
 87 |   - pyparsing=3.1.1=pyhd8ed1ab_0
 88 |   - python=3.10.10=he550d4f_0_cpython
 89 |   - python-dateutil=2.8.2=pyhd8ed1ab_0
 90 |   - python-tzdata=2023.3=pyhd8ed1ab_0
 91 |   - python_abi=3.10=4_cp310
 92 |   - pytz=2023.3.post1=pyhd8ed1ab_0
 93 |   - readline=8.2=h8228510_1
 94 |   - scanpy=1.9.2=pyhd8ed1ab_0
 95 |   - scikit-learn=1.3.1=py310h1fdf081_1
 96 |   - scipy=1.11.3=py310hb13e2d6_1
 97 |   - seaborn=0.13.0=hd8ed1ab_0
 98 |   - seaborn-base=0.13.0=pyhd8ed1ab_0
 99 |   - session-info=1.0.0=pyhd8ed1ab_0
100 |   - setuptools=68.2.2=pyhd8ed1ab_0
101 |   - six=1.16.0=pyh6c4a22f_0
102 |   - statsmodels=0.14.0=py310h1f7b6fc_2
103 |   - stdlib-list=0.8.0=pyhd8ed1ab_0
104 |   - tbb=2021.10.0=h00ab1b0_1
105 |   - threadpoolctl=3.2.0=pyha21a80b_0
106 |   - tk=8.6.13=h2797004_0
107 |   - tqdm=4.66.1=pyhd8ed1ab_0
108 |   - tzdata=2023c=h71feb2d_0
109 |   - umap-learn=0.5.4=py310hff52083_0
110 |   - unicodedata2=15.1.0=py310h2372a71_0
111 |   - wheel=0.41.2=pyhd8ed1ab_0
112 |   - xorg-libxau=1.0.11=hd590300_0
113 |   - xorg-libxdmcp=1.1.3=h7f98852_0
114 |   - xz=5.2.6=h166bdaf_0
115 |   - zipp=3.17.0=pyhd8ed1ab_0
116 |   - zstd=1.5.5=hfc55251_0
117 |   - pip:
118 |       - absl-py==2.0.0
119 |       - aiohttp==3.8.6
120 |       - aiosignal==1.3.1
121 |       - annotated-types==0.6.0
122 |       - anyio==3.7.1
123 |       - arrow==1.3.0
124 |       - async-timeout==4.0.3
125 |       - attrs==23.1.0
126 |       - backoff==2.2.1
127 |       - beautifulsoup4==4.12.2
128 |       - blessed==1.20.0
129 |       - charset-normalizer==3.3.0
130 |       - chex==0.1.7
131 |       - click==8.1.7
132 |       - contextlib2==21.6.0
133 |       - croniter==1.4.1
134 |       - dateutils==0.6.12
135 |       - deepdiff==6.6.0
136 |       - dm-tree==0.1.8
137 |       - docrep==0.3.2
138 |       - etils==1.5.0
139 |       - exceptiongroup==1.1.3
140 |       - fastapi==0.103.2
141 |       - filelock==3.12.4
142 |       - flax==0.7.4
143 |       - frozenlist==1.4.0
144 |       - fsspec==2023.9.2
145 |       - h11==0.14.0
146 |       - idna==3.4
147 |       - importlib-resources==6.1.0
148 |       - inquirer==3.1.3
149 |       - itsdangerous==2.1.2
150 |       - jax==0.4.18
151 |       - jaxlib==0.4.18
152 |       - jinja2==3.1.2
153 |       - lightning==2.0.9.post0
154 |       - lightning-cloud==0.5.39
155 |       - lightning-utilities==0.9.0
156 |       - markdown-it-py==3.0.0
157 |       - markupsafe==2.1.3
158 |       - mdurl==0.1.2
159 |       - ml-collections==0.1.1
160 |       - ml-dtypes==0.3.1
161 |       - mpmath==1.3.0
162 |       - msgpack==1.0.7
163 |       - mudata==0.2.3
164 |       - multidict==6.0.4
165 |       - multipledispatch==1.0.0
166 |       - nest-asyncio==1.5.8
167 |       - numpyro==0.13.2
168 |       - nvidia-cublas-cu12==12.1.3.1
169 |       - nvidia-cuda-cupti-cu12==12.1.105
170 |       - nvidia-cuda-nvrtc-cu12==12.1.105
171 |       - nvidia-cuda-runtime-cu12==12.1.105
172 |       - nvidia-cudnn-cu12==8.9.2.26
173 |       - nvidia-cufft-cu12==11.0.2.54
174 |       - nvidia-curand-cu12==10.3.2.106
175 |       - nvidia-cusolver-cu12==11.4.5.107
176 |       - nvidia-cusparse-cu12==12.1.0.106
177 |       - nvidia-nccl-cu12==2.18.1
178 |       - nvidia-nvjitlink-cu12==12.2.140
179 |       - nvidia-nvtx-cu12==12.1.105
180 |       - opt-einsum==3.3.0
181 |       - optax==0.1.7
182 |       - orbax-checkpoint==0.4.1
183 |       - ordered-set==4.1.0
184 |       - protobuf==4.24.4
185 |       - psutil==5.9.5
186 |       - pydantic==2.1.1
187 |       - pydantic-core==2.4.0
188 |       - pygments==2.16.1
189 |       - pyjwt==2.8.0
190 |       - pymde==0.1.18
191 |       - pyro-api==0.1.2
192 |       - pyro-ppl==1.8.6
193 |       - python-editor==1.0.4
194 |       - python-multipart==0.0.6
195 |       - pytorch-lightning==2.0.9.post0
196 |       - pyyaml==6.0.1
197 |       - readchar==4.0.5
198 |       - requests==2.31.0
199 |       - rich==13.6.0
200 |       - scvi-tools==1.0.3
201 |       - sniffio==1.3.0
202 |       - soupsieve==2.5
203 |       - sparse==0.14.0
204 |       - starlette==0.27.0
205 |       - starsessions==1.3.0
206 |       - sympy==1.12
207 |       - tensorstore==0.1.45
208 |       - toolz==0.12.0
209 |       - torch==2.1.0
210 |       - torchmetrics==1.2.0
211 |       - torchvision==0.16.0
212 |       - traitlets==5.11.2
213 |       - triton==2.1.0
214 |       - types-python-dateutil==2.8.19.14
215 |       - typing-extensions==4.8.0
216 |       - urllib3==2.0.6
217 |       - uvicorn==0.23.2
218 |       - wcwidth==0.2.8
219 |       - websocket-client==1.6.4
220 |       - websockets==11.0.3
221 |       - xarray==2023.9.0
222 |       - yarl==1.9.2
223 | 


--------------------------------------------------------------------------------
/example_metadata_nf.tsv:
--------------------------------------------------------------------------------
1 | drerio	/some/path/drerio_embryo_final_gene_id.h5ad
2 | xtropicalis	/some/path/xtropicalis_embryo_final_gene_id.h5ad
3 | 


--------------------------------------------------------------------------------