├── LINGER.PNG ├── README.md ├── conti1_1024.yml ├── doc └── driver.md └── docs ├── ATAC.png ├── Benchmark.md ├── GRN_head.png ├── GRN_infer.md ├── H1_E2F6_trans.jpg ├── LINGER.png ├── PBMC.md ├── PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png ├── PBMCs_heatmap_activity.png ├── PBMCs_ttest.png ├── POU5F1_KO_Diff_Umap_NANOG.png ├── POU5F1_KO_Diff_exp_Umap_NANOG.png ├── POU5F1_KO_Differentiation_Umap.png ├── PWM.jpg ├── RNA.png ├── RNA_ds.jpg ├── S_TG.png ├── TFactivity.md ├── User_guide.md ├── adata_ATAC.png ├── adata_RNA.png ├── barcode_mm10.png ├── box_plot_ATF1_activity_0_Others.png ├── box_plot_ATF1_expression_0_Others.png ├── box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png ├── box_plot_Erg_activity_1_Others.png ├── box_plot_Erg_expression_1_Others.png ├── downstream.md ├── driver.md ├── driver_epi.png ├── driver_trans.png ├── feature_engineering.jpg ├── feature_engineering.png ├── genomemap.jpg ├── h5_input.md ├── heatmap_activity.png ├── heatmap_activity_mm10.png ├── label.png ├── label_PBMC.png ├── metadata_ds.jpg ├── module_result.png ├── motifmatch.png ├── original.png ├── perturb.md ├── perturb.png ├── pvalue_all.png ├── scNN.md ├── scNN_newSpecies.md ├── trans_pr_curveMYC.png ├── trans_roc_curveMYC.png ├── ttest.png ├── ttest_mm10.png ├── tutorial2.md └── tvalue_all.png /LINGER.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/LINGER.PNG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LINGER 2 | ## Introduction 3 | LINGER (LIfelong neural Network for GEne Regulation) is a novel method to infer GRNs from single-cell multiome data built on top of [PyTorch](https://pytorch.org/). 4 | 5 | LINGER incorporates both 1) atlas-scale external bulk data across diverse cellular contexts and 2) the knowledge of transcription factor (TF) motif matching to cis-regulatory elements as a manifold regularization to address the challenge of limited data and extensive parameter space in GRN inference. 6 | ## Analysis tasks for single cell multiome data 7 | - Infer gene regulatory network 8 | - Benchmark gene regulatory network 9 | - Explainable dimensionality reduction (transcription factor activity, availiable for single cell or bulk RNA-seq data) 10 | - In silico pertubation 11 | 12 | In the user guide, we provide an overview of each task. 13 | ## Basic installation 14 | LINGER can be installed by pip 15 | ```sh 16 | conda create -n LINGER python==3.10.0 17 | conda activate LINGER 18 | pip install LingerGRN==1.105 19 | conda install bioconda::bedtools # Requirment 20 | ``` 21 | ## Documentation 22 | 23 | We provide several tutorials and user guide. If you find our tool useful for your research, please consider citing the LINGER manuscript. 24 | 25 | | | | | 26 | |:-------------------------:|:-------------------------:|:-------------------------:| 27 | | [User guide](https://github.com/Durenlab/LINGER/blob/main/docs/User_guide.md) | [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md) |[H1 cell line tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md) | 28 | |[GRN benchmark](https://github.com/Durenlab/LINGER/blob/main/docs/Benchmark.md) | [In silico perturbation](https://github.com/Durenlab/LINGER/blob/main/docs/perturb.md) | [Other species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md) | 29 | |[Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md)|[Downstream analysis-TF Driver score](https://github.com/Durenlab/LINGER/blob/main/docs/driver.md)|| 30 | 31 | 32 | ## Reference 33 | > If you use LINGER, please cite: 34 | > 35 | > [Yuan, Qiuyue, and Zhana Duren. "Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data." Nature Biotechnology (2024): 1-11.](https://doi.org/10.1038/s41587-024-02182-7) 36 | -------------------------------------------------------------------------------- /conti1_1024.yml: -------------------------------------------------------------------------------- 1 | name: null 2 | channels: 3 | - conda-forge 4 | - anaconda 5 | - bioconda 6 | - defaults 7 | dependencies: 8 | - _libgcc_mutex=0.1=main 9 | - _openmp_mutex=5.1=1_gnu 10 | - asttokens=2.0.5=pyhd3eb1b0_0 11 | - backcall=0.2.0=pyhd3eb1b0_0 12 | - blas=1.0=openblas 13 | - bottleneck=1.3.5=py310ha9d4c09_0 14 | - bzip2=1.0.8=h7b6447c_0 15 | - ca-certificates=2023.08.22=h06a4308_0 16 | - cloudpickle=2.2.1=pyhd8ed1ab_0 17 | - colorama=0.4.6=pyhd8ed1ab_0 18 | - comm=0.1.2=py310h06a4308_0 19 | - debugpy=1.6.7=py310h6a678d5_0 20 | - decorator=5.1.1=pyhd3eb1b0_0 21 | - exceptiongroup=1.0.4=py310h06a4308_0 22 | - executing=0.8.3=pyhd3eb1b0_0 23 | - ipykernel=6.25.0=py310h2f386ee_0 24 | - ipython=8.15.0=py310h06a4308_0 25 | - jedi=0.18.1=py310h06a4308_1 26 | - joblib=1.3.2=pyhd8ed1ab_0 27 | - jupyter_client=8.1.0=py310h06a4308_0 28 | - jupyter_core=5.3.0=py310h06a4308_0 29 | - ld_impl_linux-64=2.38=h1181459_1 30 | - libblas=3.9.0=16_linux64_openblas 31 | - libcblas=3.9.0=16_linux64_openblas 32 | - libffi=3.3=he6710b0_2 33 | - libgcc-ng=11.2.0=h1234567_1 34 | - libgfortran-ng=11.2.0=h00389a5_1 35 | - libgfortran5=11.2.0=h1234567_1 36 | - libgomp=11.2.0=h1234567_1 37 | - liblapack=3.9.0=16_linux64_openblas 38 | - libllvm14=14.0.6=hef93074_0 39 | - libopenblas=0.3.21=h043d6bf_0 40 | - libsodium=1.0.18=h7b6447c_0 41 | - libstdcxx-ng=11.2.0=h1234567_1 42 | - libuuid=1.41.5=h5eee18b_0 43 | - llvmlite=0.40.0=py310he621ea3_0 44 | - matplotlib-inline=0.1.6=py310h06a4308_0 45 | - ncurses=6.4=h6a678d5_0 46 | - nest-asyncio=1.5.6=py310h06a4308_0 47 | - numba=0.57.1=py310h1128e8f_0 48 | - numexpr=2.8.7=py310h286c3b5_0 49 | - openssl=1.1.1w=h7f8727e_0 50 | - packaging=23.1=py310h06a4308_0 51 | - pandas=2.0.3=py310h1128e8f_0 52 | - parso=0.8.3=pyhd3eb1b0_0 53 | - pexpect=4.8.0=pyhd3eb1b0_3 54 | - pickleshare=0.7.5=pyhd3eb1b0_1003 55 | - pip=23.2.1=py310h06a4308_0 56 | - platformdirs=3.10.0=py310h06a4308_0 57 | - prompt-toolkit=3.0.36=py310h06a4308_0 58 | - psutil=5.9.0=py310h5eee18b_0 59 | - ptyprocess=0.7.0=pyhd3eb1b0_2 60 | - pure_eval=0.2.2=pyhd3eb1b0_0 61 | - pygments=2.15.1=py310h06a4308_1 62 | - python=3.10.0=h12debd9_5 63 | - python-dateutil=2.8.2=pyhd3eb1b0_0 64 | - python-tzdata=2023.3=pyhd3eb1b0_0 65 | - python_abi=3.10=2_cp310 66 | - pytz=2023.3.post1=py310h06a4308_0 67 | - pyzmq=25.1.0=py310h6a678d5_0 68 | - readline=8.2=h5eee18b_0 69 | - scikit-learn=1.3.0=py310h1128e8f_0 70 | - scipy=1.11.3=py310heeff2f4_0 71 | - setuptools=68.0.0=py310h06a4308_0 72 | - shap=0.42.1=py310h1128e8f_0 73 | - six=1.16.0=pyhd3eb1b0_1 74 | - slicer=0.0.7=pyhd8ed1ab_0 75 | - sqlite=3.41.2=h5eee18b_0 76 | - stack_data=0.2.0=pyhd3eb1b0_0 77 | - tbb=2021.8.0=hdb19cb5_0 78 | - threadpoolctl=3.2.0=pyha21a80b_0 79 | - tk=8.6.12=h1ccaba5_0 80 | - tornado=6.3.2=py310h5eee18b_0 81 | - tqdm=4.66.1=pyhd8ed1ab_0 82 | - traitlets=5.7.1=py310h06a4308_0 83 | - tzdata=2023c=h04d1e81_0 84 | - wcwidth=0.2.5=pyhd3eb1b0_0 85 | - wheel=0.41.2=py310h06a4308_0 86 | - xz=5.4.2=h5eee18b_0 87 | - zeromq=4.3.4=h2531618_0 88 | - zlib=1.2.13=h5eee18b_0 89 | - pip: 90 | - certifi==2023.7.22 91 | - charset-normalizer==3.3.0 92 | - filelock==3.12.4 93 | - fsspec==2023.9.2 94 | - idna==3.4 95 | - jinja2==3.1.2 96 | - markupsafe==2.1.3 97 | - mpmath==1.3.0 98 | - networkx==3.1 99 | - numpy==1.24.0 100 | - nvidia-cublas-cu12==12.1.3.1 101 | - nvidia-cuda-cupti-cu12==12.1.105 102 | - nvidia-cuda-nvrtc-cu12==12.1.105 103 | - nvidia-cuda-runtime-cu12==12.1.105 104 | - nvidia-cudnn-cu12==8.9.2.26 105 | - nvidia-cufft-cu12==11.0.2.54 106 | - nvidia-curand-cu12==10.3.2.106 107 | - nvidia-cusolver-cu12==11.4.5.107 108 | - nvidia-cusparse-cu12==12.1.0.106 109 | - nvidia-nccl-cu12==2.18.1 110 | - nvidia-nvjitlink-cu12==12.2.140 111 | - nvidia-nvtx-cu12==12.1.105 112 | - pillow==10.0.1 113 | - requests==2.31.0 114 | - sympy==1.12 115 | - torch==2.1.0 116 | - torchaudio==2.1.0 117 | - torchvision==0.16.0 118 | - triton==2.1.0 119 | - typing-extensions==4.8.0 120 | - urllib3==2.0.6 121 | prefix: /zfs/durenlab/palmetto/Kaya/env/conti1 122 | -------------------------------------------------------------------------------- /doc/driver.md: -------------------------------------------------------------------------------- 1 | ## Driver Score 2 | We identify driver TFs underlying epigenetic and transcriptomics change between control and AUD using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF. 3 | ### Transcriptomics driver score 4 | ```python 5 | import pandas as pd 6 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0) 7 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-' 8 | import scanpy as sc 9 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad') 10 | label_all = adata_RNA.obs[['barcode','sample','label']] 11 | label_all.index = label_all['barcode'] 12 | metadata = label_all.loc[TG_pseudobulk.columns] 13 | metadata.columns = ['barcode','group','celltype'] 14 | GRN='trans_regulatory' 15 | adjust_method='bonferroni' 16 | corr_method='pearsonr' 17 | import numpy as np 18 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method) 19 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes. 20 | C_result_RNA_sp_r,Q_result_RNA_sp_r=driver_result(C_result_RNA_sp,Q_result_RNA_sp,K) 21 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t') 22 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t') 23 | ``` 24 | The adjust_method is the p-value adjust method, you could choose one from the following: 25 | - bonferroni : one-step correction 26 | - sidak : one-step correction 27 | - holm-sidak : step down method using Sidak adjustments 28 | - holm : step-down method using Bonferroni adjustments 29 | - simes-hochberg : step-up method (independent) 30 | - hommel : closed method based on Simes tests (non-negative) 31 | - fdr_bh : Benjamini/Hochberg (non-negative) 32 | - fdr_by : Benjamini/Yekutieli (negative) 33 | - fdr_tsbh : two stage fdr correction (non-negative) 34 | - fdr_tsbky : two stage fdr correction (non-negative) 35 | ### visualize 36 | ```python 37 | import os 38 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path 39 | import rpy2.robjects as robjects 40 | from rpy2.robjects import r 41 | # Import the R plotting package (ggplot2 as an example) 42 | r('library(ggplot2)') 43 | r('library(grid)') 44 | # Create data in R environment through Python 45 | r(''' 46 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE) 47 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE) 48 | sort_TF=rownames(dataT) 49 | library(tidyr) 50 | dataP=-log10(dataP) 51 | print(paste0('maxinum of -log10P:',max(dataP))) 52 | dataP[dataP>40]=40 53 | dataP1=dataP 54 | dataP1$TF=rownames(dataP) 55 | longdiff0 <- gather(dataP1, sample, value,-TF) 56 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ] 57 | dataT1=dataT 58 | dataT1$TF=rownames(dataT) 59 | longdiff1=gather(dataT1, sample, value,-TF) 60 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ] 61 | colnames(longdiff1)=c('TF','celltype','PCC') 62 | longdiff1$P=longdiff0_s$value 63 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF)) 64 | library(egg) 65 | limits0=c(2,ceiling(dataP)) 66 | range0 = c(1,4) 67 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP)) 68 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+ 69 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 70 | scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) + 71 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+ 72 | theme(legend.key=element_blank(), 73 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 74 | legend.position = "right") + 75 | scale_fill_gradient2(midpoint=0, low="blue", mid="white", 76 | high="red", space ="Lab" ) 77 | 78 | 79 | ''') 80 | ``` 81 | The figure is saved to driver_trans.pdf. 82 |
83 | Image 84 |
85 | 86 | ### Epigenetic driver score 87 | ```python 88 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0) 89 | K=5 90 | GRN='TF_RE_binding' 91 | adjust_method='bonferroni' 92 | corr_method='pearsonr' 93 | C_result_RE,P_result_RE,Q_result_RE=driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method) 94 | C_result_RE_r,Q_result_RE_r=driver_result(C_result_RE,Q_result_RE,K) 95 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t') 96 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t') 97 | ``` 98 | ### Visualization 99 | ```python 100 | import os 101 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path 102 | import rpy2.robjects as robjects 103 | from rpy2.robjects import r 104 | # Import the R plotting package (ggplot2 as an example) 105 | r('library(ggplot2)') 106 | r('library(grid)') 107 | # Create data in R environment through Python 108 | r(''' 109 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE) 110 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE) 111 | sort_TF=rownames(dataT) 112 | library(tidyr) 113 | dataP=-log10(dataP) 114 | print(paste0('maxinum of -log10P:',max(dataP))) 115 | maxP=100 116 | dataP[dataP>100]=100 117 | dataP1=dataP 118 | dataP1$TF=rownames(dataP) 119 | longdiff0 <- gather(dataP1, sample, value,-TF) 120 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ] 121 | dataT1=dataT 122 | dataT1$TF=rownames(dataT) 123 | longdiff1=gather(dataT1, sample, value,-TF) 124 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ] 125 | colnames(longdiff1)=c('TF','celltype','PCC') 126 | longdiff1$P=longdiff0_s$value 127 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF)) 128 | library(egg) 129 | cutoff=2 130 | maxp=ceiling(max(dataP)) 131 | print(maxp) 132 | limits0=c(cutoff,maxp) 133 | print(limits0) 134 | range0 = c(1,4) 135 | numbreak=5 136 | d=ceiling((maxp-cutoff)/(numbreak-1)) 137 | print(d) 138 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1), length.out=numbreak) 139 | 140 | print(range0) 141 | print(breaks0) 142 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+ 143 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 144 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 145 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+ 146 | theme(legend.key=element_blank(), 147 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 148 | legend.position = "right") + 149 | scale_fill_gradient2(midpoint=0, low="blue", mid="white", 150 | high="red", space ="Lab" ) 151 | 152 | 153 | ''') 154 | r(''' 155 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1) 156 | library(pheatmap) 157 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ]) 158 | colnames(anno1)=c('TG') 159 | rownames(anno1)=rownames(dataP) 160 | anno1[is.na(anno1)]=0 161 | anno1[,1]=paste0('M',anno1[,1]) 162 | anno1$name=rownames(anno1) 163 | longdiff0 <- gather(anno1, sample, value,-name) 164 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1)) 165 | library("RColorBrewer") 166 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb', 167 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d') 168 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d') 169 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) + 170 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) + 171 | theme_article()+theme(text = element_text(size = 9), legend.position = "left", 172 | ) 173 | ''') 174 | r(''' 175 | widths <- c(4.5, 4.5+dim(dataP)[2]) 176 | print(unique(longdiff0$value)) 177 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5) 178 | #print(p) 179 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths) 180 | dev.off() 181 | ''') 182 | ``` 183 | The figure is saved to driver_epi.pdf. 184 |
185 | Image 186 |
187 | -------------------------------------------------------------------------------- /docs/ATAC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ATAC.png -------------------------------------------------------------------------------- /docs/Benchmark.md: -------------------------------------------------------------------------------- 1 | # Benchmark gene regulatory network 2 | The example input we provided is an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from SNARE-seq data. We benchmark the GRN of H1 cell line as example. 3 | 4 | ## Download the groundtruth data 5 | The groundtruth data is [Cistrome](http://cistrome.org/) putative target of TF from the ChIP-seq data. Here, we take the ChIP-seq data of E2F6 in H1 cell line as example. We download the ground truth on Cistrome as following (46177_gene_score_5fold.txt). 6 |
7 | Image 8 |
9 | ## Prepare GRN file 10 | You can provide a list of predicted GRN file. We support 2 format: 11 | - list: There are 3 columns, Target gene (TG), TF, and regulation strength (score) 12 |
13 | Image 14 |
15 | - matrix: The row is gene name the column is TF name and the value is regulation strength. 16 | 17 | Here, we give cell type specific and cell population GRN files as examples (the format here is 'matrix' described above). 18 | 19 | ## Roc and pr curve 20 | ```python 21 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' 22 | TFName = 'MYC' 23 | Method_name=['H1','cell population'] 24 | Infer_trans=[outdir+'cell_type_specific_trans_regulatory_0.txt',outdir+'cell_population_trans_regulatory.txt'] 25 | Groundtruth='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/groundtruth/45691_gene_score_5fold.txt' 26 | from LingerGRN import Benchmk 27 | Benchmk.bm_trans(TFName,Method_name,Groundtruth,Infer_trans,outdir,'matrix') 28 | ``` 29 |
30 | Image 31 |
32 |
33 | Image 34 |
35 | 36 | The result will be automatically saved in the outdir with name trans_roc_curve+TFName+.pdf and trans_roc_curve+TFName+.pdf. 37 | -------------------------------------------------------------------------------- /docs/GRN_head.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/GRN_head.png -------------------------------------------------------------------------------- /docs/GRN_infer.md: -------------------------------------------------------------------------------- 1 | # Construct the gene regulatory network 2 | # Load data 3 | ## Download the general gene regulatory network 4 | We provide the general gene regulatory network, please download the data first. 5 | ```sh 6 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/ 7 | mkdir $Datadir 8 | cd $Datadir 9 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt 10 | ``` 11 | Then unzip, 12 | ```sh 13 | tar -xzf data_bulk.tar.gz 14 | ``` 15 | 16 | ## Input data 17 | The input data should be anndata format. In this example, we need to transfer the following data type to anndata. 18 | - Single-cell multiome data including gene expression (RNA.txt in our example) and chromatin accessibility (ATAC.txt in our example). 19 | - Cell annotation/cell type label if you need the cell type specific gene regulatory network (label.txt in our example). 20 | ### RNA-seq 21 | The row of RNA-seq is gene symbol; the column is the barcode; the value is the count matrix. Here is our example: 22 |
23 | Image 24 |
25 | 26 | ### ATAC-seq 27 | The row is the regulatory element/genomic region; the column is the barcode, which is in the same order as RNA-seq data; the value is the count matrix. Here is our example: 28 |
29 | Image 30 |
31 | 32 | ### Cell annotation/cell type label 33 | The row is cell barcode, which is the same order with RNA-seq data; there is one column 'Annotation', which is the cell type label. It could be a number or the string. Here is our example: 34 |
35 | Image 36 |
37 | 38 | ### Provided input example 39 | You can download the example input datasets into a certain directory. This sc multiome data of an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from droplet-based single-nucleus chromatin accessibility and mRNA expression sequencing (SNARE-seq) data. 40 | 41 | ```sh 42 | Input_dir=/path/to/dir/ 43 | # The input data directory. Example: Input_dir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/ 44 | cd $Input_dir 45 | #ATAC-seq 46 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r" -O ATAC.txt && rm -rf /tmp/cookies.txt 47 | #RNA-seq 48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK" -O RNA.txt && rm -rf /tmp/cookies.txt 49 | #label 50 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5" -O label.txt && rm -rf /tmp/cookies.txt 51 | ``` 52 | ## LINGER 53 | ### Install 54 | ```sh 55 | conda create -n LINGER python==3.10.0 56 | conda activate LINGER 57 | pip install LingerGRN==1.92 58 | conda install bioconda::bedtools #Requirement 59 | ``` 60 | ### Preprocess 61 | There are 2 options of the method we introduced above: 62 | 1. baseline; 63 | ```python 64 | method='baseline' # this is corresponding to bulkNN 65 | ``` 66 | 2. LINGER; 67 | ```python 68 | method='LINGER' 69 | ``` 70 | #### Tansfer sc-multiome data to anndata 71 | ```python 72 | import pandas as pd 73 | label=pd.read_csv('data/label.txt',sep='\t',header=0,index_col=None) 74 | RNA=pd.read_csv('data/RNA.txt',sep='\t',header=0,index_col=0) 75 | ATAC=pd.read_csv('data/ATAC.txt',sep='\t',header=0,index_col=0) 76 | from scipy.sparse import csc_matrix 77 | # Convert the NumPy array to a sparse csc_matrix 78 | matrix = csc_matrix(pd.concat([RNA,ATAC],axis=0).values) 79 | features=pd.DataFrame(RNA.index.tolist()+ATAC.index.tolist(),columns=[1]) 80 | K=RNA.shape[0] 81 | N=K+ATAC.shape[0] 82 | types = ['Gene Expression' if i <= K else 'Peaks' for i in range(0, N)] 83 | features[2]=types 84 | barcodes=pd.DataFrame(RNA.columns.values,columns=[0]) 85 | from LingerGRN.preprocess import * 86 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 87 | ``` 88 | #### Remove low counts cells and genes 89 | ```python 90 | import scanpy as sc 91 | sc.pp.filter_cells(adata_RNA, min_genes=200) 92 | sc.pp.filter_genes(adata_RNA, min_cells=3) 93 | sc.pp.filter_cells(adata_ATAC, min_genes=200) 94 | sc.pp.filter_genes(adata_ATAC, min_cells=3) 95 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values)) 96 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values) 97 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]] 98 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values) 99 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]] 100 | ``` 101 | #### Generate the pseudo-bulk/metacell 102 | ```python 103 | from LingerGRN.pseudo_bulk import * 104 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 105 | tempsample=samplelist[0] 106 | TG_pseudobulk=pd.DataFrame([]) 107 | RE_pseudobulk=pd.DataFrame([]) 108 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100) 109 | for tempsample in samplelist: 110 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample] 111 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample] 112 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk) 113 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1) 114 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1) 115 | RE_pseudobulk[RE_pseudobulk > 100] = 100 116 | 117 | import os 118 | if not os.path.exists('data/'): 119 | os.mkdir('data/') 120 | adata_ATAC.write('data/adata_ATAC.h5ad') 121 | adata_RNA.write('data/adata_RNA.h5ad') 122 | TG_pseudobulk=TG_pseudobulk.fillna(0) 123 | RE_pseudobulk=RE_pseudobulk.fillna(0) 124 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None) 125 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv') 126 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv') 127 | ``` 128 | ### Training model 129 | Overlap the region with general GRN: 130 | ```python 131 | from LingerGRN.preprocess import * 132 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section 133 | GRNdir=Datadir+'data_bulk/' 134 | genome='hg38' 135 | outdir='/path/to/output/' #output dir 136 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir) 137 | ``` 138 | Train for the LINGER model. 139 | ```python 140 | import LingerGRN.LINGER_tr as LINGER_tr 141 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh' 142 | LINGER_tr.training(GRNdir,method,outdir,activef) 143 | ``` 144 | ### Cell population gene regulatory network 145 | #### TF binding potential 146 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score. 147 | ```python 148 | import LingerGRN.LL_net as LL_net 149 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 150 | ``` 151 | 152 | #### *cis*-regulatory network 153 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score. 154 | ```python 155 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 156 | ``` 157 | #### *trans*-regulatory network 158 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score. 159 | ```python 160 | LL_net.trans_reg(GRNdir,method,outdir) 161 | ``` 162 | 163 | ### Cell type sepecific gene regulaory network 164 | There are 2 options: 165 | 1. infer GRN for a specific cell type, which is in the label.txt; 166 | ```python 167 | celltype='0'#use a string to assign your cell type 168 | ``` 169 | 2. infer GRNs for all cell types. 170 | ```python 171 | celltype='all' 172 | ``` 173 | Please make sure that 'all' is not a cell type in your data. 174 | 175 | #### TF binding potential 176 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential. 177 | ```python 178 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir) 179 | ``` 180 | 181 | #### *cis*-regulatory network 182 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score. 183 | ```python 184 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir) 185 | ``` 186 | 187 | #### *trans*-regulatory network 188 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score. 189 | ```python 190 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir) 191 | ``` 192 | 193 | ## Note 194 | - The cell specific GRN is based on the output of the cell population GRN. 195 | - If we want to try 2 different method options, we can create 2 output directory. 196 | ## Identify driver regulators by TF activity 197 | ### Instruction 198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 199 | 200 | ### Prepare 201 | We need to *trans*-regulatory network, you can choose a network match you data best. 202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues. 203 | ```python 204 | network = 'general' 205 | ``` 206 | 2. If your gene expression data are matched with cell population GRN, you can set 207 | ```python 208 | network = 'cell population' 209 | ``` 210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 211 | ```python 212 | network = '0' # 0 is the name of one cell type 213 | ``` 214 | 215 | ### Calculate TF activity 216 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk). 217 | 218 | ```python 219 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir 220 | GRNdir=Datadir+'data_bulk/' 221 | genome='hg38' 222 | from LingerGRN.TF_activity import * 223 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir 224 | import anndata 225 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad') 226 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome) 227 | ``` 228 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'. 229 | ```python 230 | save=True 231 | heatmap_cluster(TF_activity,adata_RNA,save,outdir) 232 | ``` 233 |
234 | Image 235 |
236 | 237 | ### Identify driver regulator 238 | We use t-test to find the differential TFs of a certain cell type by the activity. 239 | 1. You can assign a certain cell type of the gene expression data by 240 | ```python 241 | celltype='0' 242 | ``` 243 | 2. Or, you can obtain the result for all cell types. 244 | ```python 245 | celltype='all' 246 | ``` 247 | 248 | For example, 249 | 250 | ```python 251 | celltype='0' 252 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype) 253 | t_test_results 254 | ``` 255 | 256 |
257 | Image 258 |
259 | 260 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'. 261 | 262 | ```python 263 | TFName='ATF1' 264 | datatype='activity' 265 | celltype1='0' 266 | celltype2='Others' 267 | save=True 268 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir) 269 | ``` 270 | 271 |
272 | Image 273 |
274 | 275 | For gene expression data, the boxplot is: 276 | ```python 277 | datatype='expression' 278 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir) 279 | ``` 280 | 281 |
282 | Image 283 |
284 | -------------------------------------------------------------------------------- /docs/H1_E2F6_trans.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/H1_E2F6_trans.jpg -------------------------------------------------------------------------------- /docs/LINGER.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/LINGER.png -------------------------------------------------------------------------------- /docs/PBMC.md: -------------------------------------------------------------------------------- 1 | # PBMCs Tutorial 2 | ## Instruction 3 | This tutorial delineates a computational framework for constructing gene regulatory networks (GRNs) from single-cell multiome data. We provide 2 options to do this: '**baseline**' and '**LINGER**'. The first is a naive method combining the prior GRNs and features from the single-cell data, offering a rapid approach. LINGER integrates the comprehensive gene regulatory profile from external bulk data. As the following figure, LINGER uses lifelong machine learning (continuous learning) based on neural network (NN) models, which has been proven to leverage the knowledge learned in previous tasks to help learn the new task better. 4 |
5 | Image 6 |
7 | 8 | After constructing the GRNs for the cell population, we infer the cell type specific one using the feature engineering approach. Just as in the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$. 9 | 10 | ![Image Alt Text](feature_engineering.jpg) 11 | 12 | ## Download the general gene regulatory network 13 | We provide the general gene regulatory network, please download the data first. 14 | ```sh 15 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/ 16 | mkdir $Datadir 17 | cd $Datadir 18 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt 19 | ``` 20 | or use the following link: [https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing](https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing) 21 | 22 | Then unzip, 23 | ```sh 24 | tar -xzf data_bulk.tar.gz 25 | ``` 26 | 27 | ## Prepare the input data 28 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 29 | - Single-cell multiome data including matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz. 30 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (PBMC_label.txt in our example). 31 |
32 | Image 33 |
34 | 35 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) . 36 | 37 | ### sc data 38 | We download the data using shell command line. 39 | ```sh 40 | mkdir -p data 41 | wget -O data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz 42 | tar -xzvf data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz 43 | mv filtered_feature_bc_matrix data/ 44 | gzip -d data/filtered_feature_bc_matrix/* 45 | ``` 46 | We provide the cell annotation as following: 47 | ```sh 48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_" -O PBMC_label.txt && rm -rf /tmp/cookies.txt 49 | mv PBMC_label.txt data/ 50 | ``` 51 | ## LINGER 52 | ### Install 53 | ```sh 54 | conda create -n LINGER python==3.10.0 55 | conda activate LINGER 56 | pip install LingerGRN==1.105 57 | conda install bioconda::bedtools #Requirement 58 | ``` 59 | For the following step, we run the code in python. 60 | ### Preprocess 61 | There are 2 options for the method we introduced above: 62 | 1. baseline; 63 | ```python 64 | method='baseline' # this method is corresponding to bulkNN described in the paper 65 | ``` 66 | 2. LINGER; 67 | ```python 68 | method='LINGER' 69 | ``` 70 | #### Transfer the sc-multiome data to anndata 71 | 72 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label. 73 | ```python 74 | import scanpy as sc 75 | #set some figure parameters for nice display inside jupyternotebooks. 76 | %matplotlib inline 77 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white') 78 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) 79 | sc.logging.print_header() 80 | #results_file = "scRNA/pbmc10k.h5ad" 81 | import scipy 82 | import pandas as pd 83 | matrix=scipy.io.mmread('data/filtered_feature_bc_matrix/matrix.mtx') 84 | features=pd.read_csv('data/filtered_feature_bc_matrix/features.tsv',sep='\t',header=None) 85 | barcodes=pd.read_csv('data/filtered_feature_bc_matrix/barcodes.tsv',sep='\t',header=None) 86 | label=pd.read_csv('data/PBMC_label.txt',sep='\t',header=0) 87 | from LingerGRN.preprocess import * 88 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 89 | ``` 90 | #### Remove low counts cells and genes 91 | ```python 92 | import scanpy as sc 93 | sc.pp.filter_cells(adata_RNA, min_genes=200) 94 | sc.pp.filter_genes(adata_RNA, min_cells=3) 95 | sc.pp.filter_cells(adata_ATAC, min_genes=200) 96 | sc.pp.filter_genes(adata_ATAC, min_cells=3) 97 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values)) 98 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values) 99 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]] 100 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values) 101 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]] 102 | ``` 103 | #### Generate the pseudo-bulk/metacell: 104 | ```python 105 | from LingerGRN.pseudo_bulk import * 106 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 107 | tempsample=samplelist[0] 108 | TG_pseudobulk=pd.DataFrame([]) 109 | RE_pseudobulk=pd.DataFrame([]) 110 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100) 111 | for tempsample in samplelist: 112 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample] 113 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample] 114 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk) 115 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1) 116 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1) 117 | RE_pseudobulk[RE_pseudobulk > 100] = 100 118 | 119 | import os 120 | if not os.path.exists('data/'): 121 | os.mkdir('data/') 122 | adata_ATAC.write('data/adata_ATAC.h5ad') 123 | adata_RNA.write('data/adata_RNA.h5ad') 124 | TG_pseudobulk=TG_pseudobulk.fillna(0) 125 | RE_pseudobulk=RE_pseudobulk.fillna(0) 126 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None) 127 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv') 128 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv') 129 | ``` 130 | ### Training model 131 | Overlap the region with general GRN: 132 | ```python 133 | from LingerGRN.preprocess import * 134 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section 135 | GRNdir=Datadir+'data_bulk/' 136 | genome='hg38' 137 | outdir='/path/to/output/' #output dir 138 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir) 139 | ``` 140 | Train for the LINGER model. 141 | ```python 142 | import LingerGRN.LINGER_tr as LINGER_tr 143 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh' 144 | LINGER_tr.training(GRNdir,method,outdir,activef,'Human') 145 | ``` 146 | 147 | 148 | ### Cell population gene regulatory network 149 | #### TF binding potential 150 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score. 151 | ```python 152 | import LingerGRN.LL_net as LL_net 153 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 154 | ``` 155 | 156 | #### *cis*-regulatory network 157 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score. 158 | ```python 159 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 160 | ``` 161 | #### *trans*-regulatory network 162 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score. 163 | ```python 164 | LL_net.trans_reg(GRNdir,method,outdir,genome) 165 | ``` 166 | 167 | ### Cell type specific gene regulatory network 168 | There are 2 options: 169 | 1. infer GRN for a specific cell type, which is in the label.txt; 170 | ```python 171 | celltype='CD56 (bright) NK cells' #use a string to assign your cell type 172 | ``` 173 | 2. infer GRNs for all cell types. 174 | ```python 175 | celltype='all' 176 | ``` 177 | Please make sure that 'all' is not a cell type in your data. 178 | 179 | #### TF binding potential 180 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential. 181 | ```python 182 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version 183 | ``` 184 | 185 | #### *cis*-regulatory network 186 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score. 187 | ```python 188 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir) 189 | ``` 190 | 191 | #### *trans*-regulatory network 192 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score. 193 | ```python 194 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir) 195 | ``` 196 | ## Identify driver regulators by TF activity 197 | ### Instruction 198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 199 | 200 | ### Prepare 201 | You can choose a *trans*-regulatory network that matches your data best. 202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues. 203 | ```python 204 | network = 'general' 205 | ``` 206 | 2. If your gene expression data are matched with cell population GRN, you can set 207 | ```python 208 | network = 'cell population' 209 | ``` 210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 211 | ```python 212 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type 213 | ``` 214 | 215 | ### Calculate TF activity 216 | The input is gene expression data. It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk). 217 | 218 | ```python 219 | 220 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir 221 | GRNdir=Datadir+'data_bulk/' 222 | genome='hg38' 223 | from LingerGRN.TF_activity import * 224 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir 225 | import anndata 226 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad') 227 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome) 228 | ``` 229 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'. 230 | ```python 231 | save=True 232 | heatmap_cluster(TF_activity,adata_RNA,save,outdir) 233 | ``` 234 |
235 | Image 236 |
237 | 238 | ### Identify driver regulator 239 | We use t-test to find the differential TFs of a certain cell type by the activity. 240 | 1. You can assign a certain cell type of the gene expression data by 241 | ```python 242 | celltype='CD56 (bright) NK cells' 243 | ``` 244 | 2. Or, you can obtain the result for all cell types. 245 | ```python 246 | celltype='all' 247 | ``` 248 | 249 | For example, 250 | 251 | ```python 252 | celltype='CD56 (bright) NK cells' 253 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype) 254 | t_test_results 255 | ``` 256 | 257 |
258 | Image 259 |
260 | 261 | Visualize the differential activity and expression. You can compare 2 different cell types and one cell type with others. If you want to save the heatmap to output, please set `save=True`. The output is `box_plot____.png`. 262 | 263 | ```python 264 | TFName='ATF1' 265 | datatype='activity' 266 | celltype1='CD56 (bright) NK cells' 267 | celltype2='Others' 268 | save=True 269 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 270 | ``` 271 | 272 |
273 | Image 274 |
275 | 276 | For gene expression data, the boxplot is: 277 | ```python 278 | datatype='expression' 279 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 280 | ``` 281 | 282 |
283 | Image 284 |
285 | 286 | ## Note 287 | - The cell specific GRN is based on the output of the cell population GRN. 288 | - If we want to try 2 different method options, we can create 2 output directory. 289 | 290 | 291 | -------------------------------------------------------------------------------- /docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png -------------------------------------------------------------------------------- /docs/PBMCs_heatmap_activity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_heatmap_activity.png -------------------------------------------------------------------------------- /docs/PBMCs_ttest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_ttest.png -------------------------------------------------------------------------------- /docs/POU5F1_KO_Diff_Umap_NANOG.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_Umap_NANOG.png -------------------------------------------------------------------------------- /docs/POU5F1_KO_Diff_exp_Umap_NANOG.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_exp_Umap_NANOG.png -------------------------------------------------------------------------------- /docs/POU5F1_KO_Differentiation_Umap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Differentiation_Umap.png -------------------------------------------------------------------------------- /docs/PWM.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PWM.jpg -------------------------------------------------------------------------------- /docs/RNA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA.png -------------------------------------------------------------------------------- /docs/RNA_ds.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA_ds.jpg -------------------------------------------------------------------------------- /docs/S_TG.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/S_TG.png -------------------------------------------------------------------------------- /docs/TFactivity.md: -------------------------------------------------------------------------------- 1 | # Identify driver regulators by TF activity 2 | ## Instruction 3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 4 | 5 | ## Prepare 6 | We need to *trans*-regulatory network, you can choose a network match you data best. 7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues. 8 | ```python 9 | network = 'general' 10 | ``` 11 | 2. If your gene expression data are matched with cell population GRN, you can set 12 | ```python 13 | network = 'cell population' 14 | ``` 15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 16 | ```python 17 | network = '0' # 0 is the name of one cell type 18 | ``` 19 | 20 | ## Calculate TF activity 21 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk). 22 | 23 | ```python 24 | 25 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir 26 | GRNdir=Datadir+'data_bulk/' 27 | genome='hg38' 28 | from LingerGRN.TF_activity import * 29 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir 30 | import anndata 31 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad') 32 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome) 33 | ``` 34 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'. 35 | ```python 36 | save=True 37 | heatmap_cluster(TF_activity,adata_RNA,save,outdir) 38 | ``` 39 |
40 | Image 41 |
42 | 43 | ## Identify driver regulator 44 | We use t-test to find the differential TFs of a certain cell type by the activity. 45 | 1. You can assign a certain cell type of the gene expression data by 46 | ```python 47 | celltype='0' 48 | ``` 49 | 2. Or, you can obtain the result for all cell types. 50 | ```python 51 | celltype='all' 52 | ``` 53 | 54 | For example, 55 | 56 | ```python 57 | celltype='0' 58 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype) 59 | t_test_results 60 | ``` 61 | 62 |
63 | Image 64 |
65 | 66 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'. 67 | 68 | ```python 69 | TFName='ATF1' 70 | datatype='activity' 71 | celltype1='0' 72 | celltype2='Others' 73 | save=True 74 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir) 75 | ``` 76 | 77 |
78 | Image 79 |
80 | 81 | For gene expression data, the boxplot is: 82 | ```python 83 | datatype='expression' 84 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir) 85 | ``` 86 | 87 |
88 | Image 89 |
90 | -------------------------------------------------------------------------------- /docs/User_guide.md: -------------------------------------------------------------------------------- 1 | # LINGER 2 | LINGER (LIfelong neural Network for GEne Regulation) is a interpretable artificial intelligence model designed to analyze sc multiome data (RNA-seq count and chromatine accessbility count data that can subsequently be used for many downstream tasks. 3 | 4 | The advantages of LINGER are: 5 | - incorporate large-scale external bulk data and neural networks 6 | - integrate TF-RE motif matching knowledge 7 | - high accuracy of gene regulatory network (GRN) inference 8 | - enable the estimation of TF activity solely from gene expression data 9 | 10 | ## Method overview 11 | LINGER uses neural network architectures to model transcriptional regulation, predicting target gene (TG) expression levels from transcription factor (TF) and regulatory element (RE) profiles. The loss function is compisosed of 3 parts: 12 | - Accuracy of prediction, MSE, L1 regularization 13 | - Elastic weight consolidation (EWC) loss of lifelong learning to incorporate large-scale external bulk data 14 | - LINGER integrates prior biological understanding of TF-RE interactions through manifold regularization. 15 | 16 |
17 | Image 18 |
19 | 20 | 21 | ## Neuron network model structure 22 | LINGER trains individual models for each gene using a neural network architecture that includes a single input layer and two fully connected hidden layers. The input layer has dimension equal to the number of features, containing all TFs and REs within 1Mb from the transcription start site (TSS) for the gene to be predicted. The first hidden layer has 64 neurons with rectified linear unit (ReLU) activation. The second hidden layer has 16 neurons with ReLU activation. The output layer is a single neuron, which outputs a real value for gene expression prediction. 23 | 24 | ## Tasks 25 | ### Cell population gene regulatory network inference 26 | We use the average of absolute Shapley value across samples to infer the regulation strength of TF and RE to target genes, generating the RE-TG *cis*-regulatory strength and the TF-TG *trans*-regulatory strength. To generate the TF-RE binding strength, we use the weights from input layer (TFs and REs) to all node in the second layer of the NN model as the embedding of the TF or RE. The TF-RE binding strength is calculated by the PCC between the TF and RE based on the embedding. 27 | ### Cell type specific gene regulatory network 28 | Afer constructing the GRNs for cell population, we infer the cell type specific one using the feature engineering approach. Just as the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$. 29 | 30 | ![Image Alt Text](feature_engineering.jpg) 31 | 32 | ### Benchmark gene regulatory network 33 | We systematically assess the performance of the GRN by the metrics, AUC and AUPR ratio. 34 | - *trans*: the ground truth include and knock down 35 | - *cis*: the ground truth is HiC and eQTL 36 | - TF-RE: the ground truth is ChIP-seq data from 37 | ### Transcription factor activity 38 | Assuming the GRN structure is consistent across individuals, we employ LINGER inferred GRNs from sc-multiome data of a single individual to estimated TF activity using gene expression data alone from same or other individuals. By comparing TF activity between cases and controls, we identified driver regulators. 39 | ### In silico pertubation 40 | We predict the gene expression when knock out one TF or several TFs together. Then we could predict the target gene after the perturbation. 41 | ## Tutorials 42 | - [PBMCs](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md) 43 | - [H1 cell line](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md) 44 | - [Non-human species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md) 45 | -------------------------------------------------------------------------------- /docs/adata_ATAC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_ATAC.png -------------------------------------------------------------------------------- /docs/adata_RNA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_RNA.png -------------------------------------------------------------------------------- /docs/barcode_mm10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/barcode_mm10.png -------------------------------------------------------------------------------- /docs/box_plot_ATF1_activity_0_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_activity_0_Others.png -------------------------------------------------------------------------------- /docs/box_plot_ATF1_expression_0_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_0_Others.png -------------------------------------------------------------------------------- /docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png -------------------------------------------------------------------------------- /docs/box_plot_Erg_activity_1_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_activity_1_Others.png -------------------------------------------------------------------------------- /docs/box_plot_Erg_expression_1_Others.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_expression_1_Others.png -------------------------------------------------------------------------------- /docs/downstream.md: -------------------------------------------------------------------------------- 1 | # Downstream analysis - Module detection 2 | ## Regulatory Module 3 | For this analysis, we first detect key TF-TG subnetworks (modules) from the cell population TF–TG trans-regulation. Then, we identify the differential regulatory modules that are differentially expressed between the case and control groups. 4 | ### Detect Module 5 | #### Input 6 | - pseudobulk gene expression: [TG_pseudobulk], please make sure the data is after removing the batch effect 7 | - metadata including case and control in column 'group' and cell type annotation in column 'celltype': [metadata]. Note that the case is 1 and the control is 0. 8 | - LINGER outdir including a trans-regulatory network, 'cell_population_trans_regulatory.txt', 9 | - GWAS data file, which is not necessary. 10 | 11 | This is an example of the input. 12 | ```python 13 | import pandas as pd 14 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0) 15 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the specie is mouse, replace 'MT-' with 'mt-' 16 | import scanpy as sc 17 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad') 18 | label_all = adata_RNA.obs[['barcode','sample','label']] 19 | label_all.index = label_all['barcode'] 20 | metadata = label_all.loc[TG_pseudobulk.columns] 21 | metadata.columns = ['barcode','group','celltype'] 22 | outdir = 'output/' 23 | GWASfile = ['AUD_gene.txt','AUD_gene2.txt']# GWAS file is a gene list with no head (Optional) 24 | ``` 25 | ```python 26 | TG_pseudobulk 27 | ``` 28 |
29 | Image 30 |
31 | 32 | ```python 33 | metadata 34 | ``` 35 |
36 | Image 37 |
38 | 39 | #### output 40 | ```python 41 | K=10 #k is the number of modules, a tuning parameter 42 | from LingerGRN import Compare 43 | Module_result=Compare.Module_trans(outdir,metadata,TG_pseudobulk,K,GWASfile) 44 | ``` 45 | The output is Module_result object. There are 3 items in this object: 46 | - S_TG, which represents the module assigned for each gene; 47 | - pvalue_all, the p-value of the differential module t-test comparing the case and control groups; 48 | - t_value, the t-value of the t-test, a positive value representing group 1 is more active, and a negative value representing group 0 is more active. 49 | ```python 50 | Module_result.S_TG 51 | ``` 52 |
53 | Image 54 |
55 | 56 | ```python 57 | Module_result.pvalue_all 58 | ``` 59 |
60 | Image 61 |
62 | 63 | ```python 64 | Module_result.tvalue_all 65 | ``` 66 |
67 | Image 68 |
69 | 70 | Save the result to files. 71 | ```python 72 | Module_result.pvalue_all.to_csv('pvalue_all.txt',sep='\t') 73 | Module_result.tvalue_all.to_csv('tvalue_all.txt',sep='\t') 74 | temp=Module_result.S_TG 75 | temp=temp[temp['Module']>0] 76 | temp.to_csv('Module.txt',sep='\t') 77 | ``` 78 | #### Visualize 79 | Please ensure that the r packages: ggplot2, grid, tidyr, egg are well-installed. Note that 'cutoff' is a parameter, representing the cutoff of -log10(p-value). We suggest 'cutoff = 2' as a default. 80 | ```python 81 | # Import the rpy2 components needed 82 | import os 83 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path 84 | import rpy2.robjects as robjects 85 | from rpy2.robjects import r 86 | # Import the R plotting package (ggplot2 as an example) 87 | r('library(ggplot2)') 88 | r('library(grid)') 89 | # Create data in R environment through Python 90 | r(''' 91 | library(tidyr) 92 | dataP=read.table('pvalue_all.txt',sep='\t',header=TRUE,row.names=1) 93 | dataT=read.table('tvalue_all.txt',sep='\t',header=TRUE,row.names=1) 94 | dataP=-log10(dataP) 95 | dataP$TF=rownames(dataP) 96 | dataT$TF=rownames(dataT) 97 | longdiff0 <- gather(dataP, sample, value,-TF) 98 | longdiff1 <- gather(dataT, sample, value,-TF) 99 | colnames(longdiff1)=c('TF','celltype','T') 100 | longdiff1$P=longdiff0$value 101 | longdiff1$TF=factor(longdiff1$TF,levels=rev(longdiff1$TF[1:10])) 102 | print(longdiff1[1:10,]) 103 | ''') 104 | # R code for plotting using ggplot 105 | r(''' 106 | cutoff=1 # here 107 | maxp=ceiling(max(longdiff1$P)) 108 | limits0=c(cutoff,maxp) 109 | range0=c(1,(maxp-cutoff+1))*4/(maxp-cutoff+1) 110 | breaks0=(cutoff+1):(maxp-1) 111 | print(limits0) 112 | print(range0) 113 | print(breaks0) 114 | ''') 115 | 116 | # Print the plot to display it 117 | r(''' 118 | library(ggplot2) 119 | library(egg) 120 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+ 121 | geom_point(aes(size = P, fill = T), alpha = 1, shape = 21) + 122 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 123 | labs( x= "cell type", y = "Module", fill = "") + theme_article()+ 124 | theme(legend.key=element_blank(), 125 | axis.text.x = element_text(colour = "black", size = 9, face = "bold", angle = 90, vjust = 0.3, hjust = 1), 126 | axis.text.y = element_text(colour = "black", face = "bold", size = 11), 127 | legend.text = element_text(size = 9, face ="bold", colour ="black"), 128 | legend.title = element_text(size = 9, face = "bold"), 129 | panel.background = element_blank(), panel.border = element_rect(colour = "black", fill = NA, size = 1.2), 130 | legend.position = "right") + 131 | scale_fill_gradient2(midpoint=0, low="blue", mid="white", 132 | high="red", space ="Lab" ) 133 | ''') 134 | r("pdf('module_result.pdf',width=1.5+dim(dataP)[2]/3,height=3)")# change the height and the width of the figure 135 | r('print(p)') 136 | r('dev.off()') 137 | ``` 138 | The figure is saved to module_result.pdf. 139 |
140 | Image 141 |
142 | 143 | -------------------------------------------------------------------------------- /docs/driver.md: -------------------------------------------------------------------------------- 1 | # Downstream analysis- TF driver score 2 | ## Driver Score 3 | We identify driver TFs underlying epigenetic and transcriptomic change between control and case using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF. 4 | ### Request 5 | Please complete the following tutorials 6 | - [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md) 7 | - [Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md) 8 | ### Transcriptomics driver score 9 | ```python 10 | import pandas as pd 11 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0) 12 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-' 13 | import scanpy as sc 14 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad') 15 | label_all = adata_RNA.obs[['barcode','sample','label']] 16 | label_all.index = label_all['barcode'] 17 | metadata = label_all.loc[TG_pseudobulk.columns] 18 | metadata.columns = ['barcode','group','celltype'] 19 | GRN='trans_regulatory' 20 | adjust_method='bonferroni' 21 | corr_method='pearsonr' 22 | import numpy as np 23 | from LingerGRN import Compare 24 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=Compare.driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method) 25 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes. 26 | C_result_RNA_sp_r,Q_result_RNA_sp_r=Compare.driver_result(C_result_RNA_sp,Q_result_RNA_sp,K) 27 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t') 28 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t') 29 | ``` 30 | The adjust_method is the p-value adjust method, you could choose one from the following: 31 | - bonferroni : one-step correction 32 | - sidak : one-step correction 33 | - holm-sidak : step down method using Sidak adjustments 34 | - holm : step-down method using Bonferroni adjustments 35 | - simes-hochberg : step-up method (independent) 36 | - hommel : closed method based on Simes tests (non-negative) 37 | - fdr_bh : Benjamini/Hochberg (non-negative) 38 | - fdr_by : Benjamini/Yekutieli (negative) 39 | - fdr_tsbh : two stage fdr correction (non-negative) 40 | - fdr_tsbky : two stage fdr correction (non-negative) 41 | ### visualize 42 | ```python 43 | import os 44 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path 45 | import rpy2.robjects as robjects 46 | from rpy2.robjects import r 47 | # Import the R plotting package (ggplot2 as an example) 48 | r('library(ggplot2)') 49 | r('library(grid)') 50 | # Create data in R environment through Python 51 | r(''' 52 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE) 53 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE) 54 | sort_TF=rownames(dataT) 55 | library(tidyr) 56 | dataP=-log10(dataP) 57 | print(paste0('maxinum of -log10P:',max(dataP))) 58 | dataP[dataP>40]=40 59 | dataP1=dataP 60 | dataP1$TF=rownames(dataP) 61 | longdiff0 <- gather(dataP1, sample, value,-TF) 62 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ] 63 | dataT1=dataT 64 | dataT1$TF=rownames(dataT) 65 | longdiff1=gather(dataT1, sample, value,-TF) 66 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ] 67 | colnames(longdiff1)=c('TF','celltype','PCC') 68 | longdiff1$P=longdiff0_s$value 69 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF)) 70 | library(egg) 71 | limits0=c(2,ceiling(dataP)) 72 | range0 = c(1,4) 73 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP)) 74 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+ 75 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 76 | scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) + 77 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+ 78 | theme(legend.key=element_blank(), 79 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 80 | legend.position = "right") + 81 | scale_fill_gradient2(midpoint=0, low="blue", mid="white", 82 | high="red", space ="Lab" ) 83 | 84 | 85 | ''') 86 | r(''' 87 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1) 88 | library(pheatmap) 89 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ]) 90 | colnames(anno1)=c('TG') 91 | rownames(anno1)=rownames(dataP) 92 | anno1[is.na(anno1)]=0 93 | anno1[,1]=paste0('M',anno1[,1]) 94 | anno1$name=rownames(anno1) 95 | longdiff0 <- gather(anno1, sample, value,-name) 96 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1)) 97 | library("RColorBrewer") 98 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb', 99 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d') 100 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d') 101 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) + 102 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) + 103 | theme_article()+theme(text = element_text(size = 9), legend.position = "left") 104 | ''') 105 | r(''' 106 | widths <- c(4.5, 4.5+dim(dataP)[2]) 107 | print(unique(longdiff0$value)) 108 | pdf('driver_trans.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+0.5) 109 | #print(heatmap_plot) 110 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths) 111 | dev.off() 112 | ''') 113 | ``` 114 | The figure is saved to driver_trans.pdf. 115 |
116 | Image 117 |
118 | 119 | ### Epigenetic driver score 120 | ```python 121 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0) 122 | K=5 123 | GRN='TF_RE_binding' 124 | adjust_method='bonferroni' 125 | corr_method='pearsonr' 126 | C_result_RE,P_result_RE,Q_result_RE=Compare.driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method) 127 | C_result_RE_r,Q_result_RE_r=Compare.driver_result(C_result_RE,Q_result_RE,K) 128 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t') 129 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t') 130 | ``` 131 | ### Visualization 132 | ```python 133 | import os 134 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path 135 | import rpy2.robjects as robjects 136 | from rpy2.robjects import r 137 | # Import the R plotting package (ggplot2 as an example) 138 | r('library(ggplot2)') 139 | r('library(grid)') 140 | # Create data in R environment through Python 141 | r(''' 142 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE) 143 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE) 144 | sort_TF=rownames(dataT) 145 | library(tidyr) 146 | dataP=-log10(dataP) 147 | print(paste0('maxinum of -log10P:',max(dataP))) 148 | maxP=100 149 | dataP[dataP>100]=100 150 | dataP1=dataP 151 | dataP1$TF=rownames(dataP) 152 | longdiff0 <- gather(dataP1, sample, value,-TF) 153 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ] 154 | dataT1=dataT 155 | dataT1$TF=rownames(dataT) 156 | longdiff1=gather(dataT1, sample, value,-TF) 157 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ] 158 | colnames(longdiff1)=c('TF','celltype','PCC') 159 | longdiff1$P=longdiff0_s$value 160 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF)) 161 | library(egg) 162 | cutoff=2 163 | maxp=ceiling(max(dataP)) 164 | print(maxp) 165 | limits0=c(cutoff,maxp) 166 | print(limits0) 167 | range0 = c(1,4) 168 | numbreak=5 169 | d=ceiling((maxp-cutoff)/(numbreak-1)) 170 | print(d) 171 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1), length.out=numbreak) 172 | 173 | print(range0) 174 | print(breaks0) 175 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+ 176 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 177 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 178 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+ 179 | theme(legend.key=element_blank(), 180 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 181 | legend.position = "right") + 182 | scale_fill_gradient2(midpoint=0, low="blue", mid="white", 183 | high="red", space ="Lab" ) 184 | 185 | 186 | ''') 187 | r(''' 188 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1) 189 | library(pheatmap) 190 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ]) 191 | colnames(anno1)=c('TG') 192 | rownames(anno1)=rownames(dataP) 193 | anno1[is.na(anno1)]=0 194 | anno1[,1]=paste0('M',anno1[,1]) 195 | anno1$name=rownames(anno1) 196 | longdiff0 <- gather(anno1, sample, value,-name) 197 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1)) 198 | library("RColorBrewer") 199 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb', 200 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d') 201 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d') 202 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) + 203 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) + 204 | theme_article()+theme(text = element_text(size = 9), legend.position = "left", 205 | ) 206 | ''') 207 | r(''' 208 | widths <- c(4.5, 4.5+dim(dataP)[2]) 209 | print(unique(longdiff0$value)) 210 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5) 211 | #print(p) 212 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths) 213 | dev.off() 214 | ''') 215 | ``` 216 | The figure is saved to driver_epi.pdf. 217 |
218 | Image 219 |
220 | -------------------------------------------------------------------------------- /docs/driver_epi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_epi.png -------------------------------------------------------------------------------- /docs/driver_trans.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_trans.png -------------------------------------------------------------------------------- /docs/feature_engineering.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.jpg -------------------------------------------------------------------------------- /docs/feature_engineering.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.png -------------------------------------------------------------------------------- /docs/genomemap.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/genomemap.jpg -------------------------------------------------------------------------------- /docs/h5_input.md: -------------------------------------------------------------------------------- 1 | # h5ad file as input 2 | ## case1. 10x filtered feature barcode matrix 3 | ### download h5 file 4 | ```sh 5 | wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 6 | ``` 7 | ### get the input data for LINGER 8 | ```python 9 | import scanpy as sc 10 | adata = sc.read_10x_h5('pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5', gex_only=False) 11 | import scipy.sparse as sp 12 | import pandas as pd 13 | matrix=adata.X.T 14 | adata.var['gene_ids']=adata.var.index 15 | features=pd.DataFrame(adata.var['gene_ids'].values.tolist(),columns=[1]) 16 | features[2]=adata.var['feature_types'].values 17 | barcodes=pd.DataFrame(adata.obs_names,columns=[0]) 18 | from LingerGRN.preprocess import * 19 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 20 | ``` 21 | ## case2. seperate RNA and ATAC h5ad file 22 | ### Read H5AD file as an AnnData object 23 | ```python 24 | import scanpy as sc 25 | adata_RNA = sc.read_h5ad('rna.h5ad') 26 | adata_ATAC=sc.read_h5ad('ATAC.h5ad') 27 | import pandas as pd 28 | label=pd.read_csv('label.txt',sep='\t',header=0) 29 | ``` 30 | ```python 31 | adata_RNA 32 | ``` 33 | 34 |
35 | Image 36 |
37 | 38 | ```python 39 | adata_ATAC 40 | ``` 41 | 42 |
43 | Image 44 |
45 | 46 | ```python 47 | label 48 | ``` 49 |
50 | Image 51 |
52 | 53 | ### get the input data for LINGER 54 | 55 | ```python 56 | import scipy.sparse as sp 57 | matrix=sp.vstack([adata_RNA.X.T, adata_ATAC.X.T]) 58 | features=pd.DataFrame(adata_RNA.var['gene_ids'].values.tolist()+adata_ATAC.var['gene_ids'].values.tolist(),columns=[1]) 59 | K=adata_RNA.shape[1] 60 | N=K+adata_ATAC.shape[1] 61 | types = ['Gene Expression' if i <= K-1 else 'Peaks' for i in range(0, N)] 62 | features[2]=types 63 | barcodes=pd.DataFrame(adata_RNA.obs['barcode'].values,columns=[0]) 64 | from LingerGRN.preprocess import * 65 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 66 | ``` 67 | Then you could go back to the PBMC tutorial and continue with the 'Remove low counts cells and genes' step. 68 | -------------------------------------------------------------------------------- /docs/heatmap_activity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity.png -------------------------------------------------------------------------------- /docs/heatmap_activity_mm10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity_mm10.png -------------------------------------------------------------------------------- /docs/label.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label.png -------------------------------------------------------------------------------- /docs/label_PBMC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label_PBMC.png -------------------------------------------------------------------------------- /docs/metadata_ds.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/metadata_ds.jpg -------------------------------------------------------------------------------- /docs/module_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/module_result.png -------------------------------------------------------------------------------- /docs/motifmatch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/motifmatch.png -------------------------------------------------------------------------------- /docs/original.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/original.png -------------------------------------------------------------------------------- /docs/perturb.md: -------------------------------------------------------------------------------- 1 | # In silico perturbation 2 | Here, we use gene-regulatory networks inferred from single-cell multi-omics data to perform in silico transcription factor(TF) perturbations, simulating the consequent changes such as traget gene and differentiation direction using only unperturbed wild-type data. 3 | 4 | ## Predict the original gene expression 5 | We first predict the gene expression based on LINGER neural network-based model. We use this to represent the wild type context transcriptional profile. 6 | ```python 7 | from LingerGRN.perturb import * 8 | # insilico pertubation 9 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir 10 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir 11 | GRNdir=Datadir+'data_bulk/' 12 | Input_dir= '/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/'# input data dir 13 | chrall,data_merge,Exp,Opn,Target,idx,TFname=load_data_ptb(Input_dir,outdir,GRNdir) 14 | original=get_simulation(outdir,chrall,data_merge,GRNdir,Exp,Opn,Target,idx) 15 | original.to_csv(outdir+'original.txt',sep='\t') 16 | original 17 | ``` 18 |
19 | Image 20 |
21 | 22 | ## Predict the gene expression after TF perturbation 23 | We predict the gene expression after kouck out one TF or several together. We take the POU5F1 as an example, which is a master regulator of H1 cell line (stem cell). 24 | ```python 25 | TFko='POU5F1'# multiple TFs: TFko=[TF1 TF2 ...] 26 | import pandas as pd 27 | Exp_df=pd.DataFrame(Exp,index=TFname) 28 | Exp1=Exp_df.copy() 29 | Exp1.loc[TFko]=0 30 | perturb=get_simulation(outdir,chrall,data_merge,GRNdir,Exp1.values,Opn,Target,idx) 31 | perturb.to_csv(outdir+TFko+'.txt',sep='\t') 32 | perturb 33 | ``` 34 |
35 | Image 36 |
37 | 38 | ## Differential expression for single cell 39 | We visualize the differential expression of the target gene. We take the POU5F1's target gene, NANOG, as an example. We set save = True to save the figure to outdir (Kouckout TF+'_KO_Diff_exp_Umap_'+Target gene.png). The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively. 40 | ```python 41 | embedding,D=umap_embedding(outdir,Target,original,perturb,Input_dir) 42 | TG='NANOG' 43 | save=True 44 | diff_umap(TFko,TG,save,outdir,embedding,perturb,original,Input_dir) 45 | ``` 46 |
47 | Image 48 |
49 | 50 | The above result suggests that NANOG expression decreased in H1 cell line after POU5F1 knockout. 51 | ## Differentiation prediction 52 | 53 | We get the embedding of original and perturbed gene expression to the same embedding space. Then we get the difference of embedding to represent the differentiation prediction after the perturbatiion. The figure will be saved as Kouckout TF+'KO_Differentiation_Umap.png'. The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively. 54 | ```python 55 | save=True 56 | Umap_direct(TFko,Input_dir,embedding,D,save,outdir) 57 | ``` 58 |
59 | Image 60 |
61 | 62 | POU5F1 knock out simulation yeilds a potential shift of H1 cell line. 63 | -------------------------------------------------------------------------------- /docs/perturb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/perturb.png -------------------------------------------------------------------------------- /docs/pvalue_all.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/pvalue_all.png -------------------------------------------------------------------------------- /docs/scNN.md: -------------------------------------------------------------------------------- 1 | # Other species tutorial 2 | We support for the following species. For other species, we provide another [tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/scNN_newSpecies.md). 3 | |genome_short | species | species_ensembl | genome | 4 | | --- | --- | --- | --- | 5 | | canFam3 | dog | Canis_lupus_familiaris |CanFam3 | 6 | | danRer11|zebrafish|Danio_rerio|GRCz11| 7 | |danRer10|zebrafish|Danio_rerio|GRCz10| 8 | | dm6|fly|Drosophila_melanogaster|BDGP6| 9 | | dm3|fly|Drosophila_melanogaster|BDGP5| 10 | | rheMac8|rhesus|Macaca_mulatta|Mmul_8| 11 | |mm10|mouse|Mus_musculus|GRCm38| 12 | |mm9|mouse|Mus_musculus|NCBIM37| 13 | |rn5|rat|Rattus_norvegicus|Rnor_5| 14 | |rn6|rat|Rattus_norvegicus|Rnor_6| 15 | |susScr3|pig|Sus_scrofa|Sscrofa10| 16 | |susScr11|pig|Sus_scrofa|Sscrofa11| 17 | |fr3|fugu|Takifugu_rubripes|FUGU5| 18 | |xenTro9|frog|Xenopus_tropicalis|Xenopus_tropicalis_v9| 19 | |tair10|Arabidopsis|Arabidopsis_thaliana|Tair10| 20 | ## Download the provided data 21 | We provide the TSS location for the above genome and the motif information. 22 | ```sh 23 | Datadir=/path/to/LINGER/# the directory to store the data, please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/ 24 | mkdir $Datadir 25 | cd $Datadir 26 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV" -O provide_data.tar.gz && rm -rf /tmp/cookies.txt 27 | ``` 28 | Then unzip, 29 | ```sh 30 | tar -xzf provide_data.tar.gz 31 | ``` 32 | ## Prepare the input data 33 | We take the sc data of mm10 as an example. The data is from the published paper (FOXA2 drives lineage plasticity and KIT pathway 34 | activation in neuroendocrine prostate cancer). 35 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 36 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt 37 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example). 38 |
39 | Image 40 |
41 | 42 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) . 43 | 44 | ### sc data 45 | We download the data using the shell command line. 46 | ```sh 47 | mkdir -p data 48 | cd data 49 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt 50 | tar -xzvf mm10_data.tar.gz 51 | mv mm10_data/* ./ 52 | cd ../ 53 | ``` 54 | We provide the cell annotation as follows: 55 | ```sh 56 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt 57 | mv mm10_label.txt data/ 58 | ``` 59 | ## LINGER 60 | ### Install 61 | ```sh 62 | conda create -n LINGER python==3.10.0 63 | conda activate LINGER 64 | pip install LingerGRN==1.97 65 | conda install bioconda::bedtools #Requirement 66 | ``` 67 | ### Install homer 68 | Check whether homer is installed 69 | ```sh 70 | which homer # run this in the command line 71 | ``` 72 | If homer is not installed, use Conda to install it 73 | ```sh 74 | conda install bioconda::homer 75 | ``` 76 | #### install genome 77 | You can check the installed genome 78 | ```sh 79 | dir=$(which homer) 80 | dir_path=$(dirname "$dir") 81 | ls $dir_path/../share/homer/data/genomes/ # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory' 82 | ``` 83 | If the genome is not installed, use the following shell script to install it. 84 | ```sh 85 | genome='mm10' 86 | perl $dir_path/../share/homer/configureHomer.pl -install $genome 87 | ``` 88 | For the following step, we run the code in python. 89 | #### Transfer the sc-multiome data to anndata 90 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label. 91 | ```python 92 | import scanpy as sc 93 | #set some figure parameters for nice display inside jupyternotebooks. 94 | %matplotlib inline 95 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white') 96 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) 97 | sc.logging.print_header() 98 | #results_file = "scRNA/pbmc10k.h5ad" 99 | import scipy 100 | import pandas as pd 101 | matrix=scipy.io.mmread('data/matrix.mtx') 102 | features=pd.read_csv('data/features.txt',sep='\t',header=None) 103 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None) 104 | label=pd.read_csv('data/mm10_label.txt',sep='\t',header=0) 105 | from LingerGRN.preprocess import * 106 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 107 | ``` 108 | #### Remove low counts cells and genes 109 | ```python 110 | import scanpy as sc 111 | sc.pp.filter_cells(adata_RNA, min_genes=200) 112 | sc.pp.filter_genes(adata_RNA, min_cells=3) 113 | sc.pp.filter_cells(adata_ATAC, min_genes=200) 114 | sc.pp.filter_genes(adata_ATAC, min_cells=3) 115 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values)) 116 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values) 117 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]] 118 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values) 119 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]] 120 | ``` 121 | #### Generate the pseudo-bulk/metacell: 122 | ```python 123 | from LingerGRN.pseudo_bulk import * 124 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 125 | tempsample=samplelist[0] 126 | TG_pseudobulk=pd.DataFrame([]) 127 | RE_pseudobulk=pd.DataFrame([]) 128 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100) 129 | for tempsample in samplelist: 130 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample] 131 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample] 132 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk) 133 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1) 134 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1) 135 | RE_pseudobulk[RE_pseudobulk > 100] = 100 136 | 137 | import os 138 | if not os.path.exists('data/'): 139 | os.mkdir('data/') 140 | adata_ATAC.write('data/adata_ATAC.h5ad') 141 | adata_RNA.write('data/adata_RNA.h5ad') 142 | TG_pseudobulk=TG_pseudobulk.fillna(0) 143 | RE_pseudobulk=RE_pseudobulk.fillna(0) 144 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None) 145 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv') 146 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv') 147 | ``` 148 | ### Training model 149 | Overlap the region with general GRN: 150 | ```python 151 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section 152 | GRNdir=Datadir+'provide_data/' 153 | genome='mm10' 154 | outdir='/path/to/output/' #output dir 155 | activef='ReLU' 156 | method='scNN' 157 | import torch 158 | import subprocess 159 | import os 160 | import LingerGRN.LINGER_tr as LINGER_tr 161 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported 162 | LINGER_tr.RE_TG_dis(outdir) 163 | ``` 164 | Train for the LINGER model. 165 | ```python 166 | import LingerGRN.LINGER_tr as LINGER_tr 167 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh' 168 | genomemap=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t') 169 | genomemap.index=genomemap['genome_short'] 170 | species=genomemap.loc[genome]['species_ensembl'] 171 | LINGER_tr.training(GRNdir,method,outdir,activef,species) 172 | ``` 173 | ### Cell population gene regulatory network 174 | #### TF binding potential 175 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score. 176 | ```python 177 | import LingerGRN.LL_net as LL_net 178 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 179 | ``` 180 | 181 | #### *cis*-regulatory network 182 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score. 183 | ```python 184 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 185 | ``` 186 | #### *trans*-regulatory network 187 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score. 188 | ```python 189 | LL_net.trans_reg(GRNdir,method,outdir,genome) 190 | ``` 191 | 192 | ### Cell type sepecific gene regulaory network 193 | There are 2 options: 194 | 1. infer GRN for a specific cell type, which is in the label.txt; 195 | ```python 196 | celltype='1' #use a string to assign your cell type 197 | ``` 198 | 2. infer GRNs for all cell types. 199 | ```python 200 | celltype='all' 201 | ``` 202 | Please make sure that 'all' is not a cell type in your data. 203 | #### Motif matching 204 | ```python 205 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt' 206 | subprocess.run(command, shell=True) 207 | import pandas as pd 208 | genome_map=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t',header=0) 209 | genome_map.index=genome_map['genome_short'] 210 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+'all_motif_rmdup_'+genome_map.loc[genome]['Motif']+'> '+outdir+'MotifTarget.bed' 211 | subprocess.run(command, shell=True) 212 | ``` 213 | #### TF binding potential 214 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential. 215 | ```python 216 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version 217 | ``` 218 | 219 | #### *cis*-regulatory network 220 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score. 221 | ```python 222 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method) 223 | ``` 224 | 225 | #### *trans*-regulatory network 226 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score. 227 | ```python 228 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir) 229 | ``` 230 | ## Identify driver regulators by TF activity 231 | ### Instruction 232 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 233 | 234 | ### Prepare 235 | We need to *trans*-regulatory network, you can choose a network match you data best. 236 | 1. If your gene expression data are matched with cell population GRN, you can set 237 | ```python 238 | network = 'cell population' 239 | ``` 240 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 241 | ```python 242 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type 243 | ``` 244 | 245 | ### Calculate TF activity 246 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk). 247 | 248 | ```python 249 | from LingerGRN.TF_activity import * 250 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome) 251 | ``` 252 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'. 253 | ```python 254 | save=True 255 | heatmap_cluster(TF_activity,adata_RNA,save,outdir) 256 | ``` 257 |
258 | Image 259 |
260 | 261 | ### Identify driver regulator 262 | We use t-test to find the differential TFs of a certain cell type by the activity. 263 | 1. You can assign a certain cell type of the gene expression data by 264 | ```python 265 | celltype='1' 266 | ``` 267 | 2. Or, you can obtain the result for all cell types. 268 | ```python 269 | celltype='all' 270 | ``` 271 | 272 | For example, 273 | 274 | ```python 275 | celltype='1' 276 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype) 277 | t_test_results 278 | ``` 279 | 280 |
281 | Image 282 |
283 | 284 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'. 285 | 286 | ```python 287 | TFName='Erg' 288 | datatype='activity' 289 | celltype1='1' 290 | celltype2='Others' 291 | save=True 292 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 293 | ``` 294 | 295 |
296 | Image 297 |
298 | 299 | For gene expression data, the boxplot is: 300 | ```python 301 | datatype='expression' 302 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 303 | ``` 304 | 305 |
306 | Image 307 |
308 | 309 | 310 | -------------------------------------------------------------------------------- /docs/scNN_newSpecies.md: -------------------------------------------------------------------------------- 1 | # Other species not in the list 2 | Here, we provide a tutorial for the species not included in the given list but in the basic Homer installation. You can choose one genome version from the following table based on your data. 3 | | **Genome Version** | **Release** | **Description** | 4 | |---------------------|-------------|-----------------------------------------------------------| 5 | | tair10 | v6.0 | Arabidopsis genome and annotation (tair10) | 6 | | dm6 | v7.0 | Fly genome and annotation for UCSC dm6 | 7 | | xenTro3 | v6.4 | Frog genome and annotation for UCSC xenTro3 | 8 | | danRer11 | v7.0 | Zebrafish genome and annotation for UCSC danRer11 | 9 | | rn7 | v7.0 | Rat genome and annotation for UCSC rn7 | 10 | | canFam3 | v6.4 | Dog genome and annotation for UCSC canFam3 | 11 | | anoGam3 | v7.0 | Mosquito genome and annotation for UCSC anoGam3 | 12 | | gorGor3 | v6.4 | Human genome and annotation for UCSC gorGor3 | 13 | | strPur2 | v7.0 | Urchin genome and annotation for UCSC strPur2 | 14 | | rn5 | v6.4 | Rat genome and annotation for UCSC rn5 | 15 | | rheMac10 | v7.0 | Rhesus genome and annotation for UCSC rheMac10 | 16 | | danRer7 | v6.4 | Zebrafish genome and annotation for UCSC danRer7 | 17 | | canFam5 | v7.0 | Dog genome and annotation for UCSC canFam5 | 18 | | gorGor6 | v7.0 | Human genome and annotation for UCSC gorGor6 | 19 | | canFam6 | v7.0 | Dog genome and annotation for UCSC canFam6 | 20 | | apiMel2 | v7.0 | Bee genome and annotation for UCSC apiMel2 | 21 | | gorGor5 | v6.4 | Human genome and annotation for UCSC gorGor5 | 22 | | panTro4 | v6.4 | Human genome and annotation for UCSC panTro4 | 23 | | xenTro7 | v6.4 | Frog genome and annotation for UCSC xenTro7 | 24 | | rheMac2 | v6.4 | Rhesus genome and annotation for UCSC rheMac2 | 25 | | gorGor4 | v6.4 | Human genome and annotation for UCSC gorGor4 | 26 | | panTro5 | v6.4 | Human genome and annotation for UCSC panTro5 | 27 | | panPan2 | v6.4 | Human genome and annotation for UCSC panPan2 | 28 | | patens.ASM242v1 | v5.10 | Patens genome and annotation (patens.ASM242v1) | 29 | | panTro6 | v7.0 | Human genome and annotation for UCSC panTro6 | 30 | | AGPv3 | v5.10 | Corn genome and annotation (AGPv3) | 31 | | xenTro9 | v6.4 | Frog genome and annotation for UCSC xenTro9 | 32 | | papAnu2 | v6.4 | Human genome and annotation for UCSC papAnu2 | 33 | | corn.AGPv3 | v5.10 | Corn genome and annotation (corn.AGPv3) | 34 | | petMar2 | v6.4 | Lamprey genome and annotation for UCSC petMar2 | 35 | | panPan1 | v6.4 | Human genome and annotation for UCSC panPan1 | 36 | | fr3 | v7.0 | Fugu genome and annotation for UCSC fr3 | 37 | | sacCer3 | v7.0 | Yeast genome and annotation for UCSC sacCer3 | 38 | | mm9 | v7.0 | Mouse genome and annotation for UCSC mm9 | 39 | | susScr3 | v6.4 | Pig genome and annotation for UCSC susScr3 | 40 | | hg17 | v7.0 | Human genome and annotation for UCSC hg17 | 41 | | panTro3 | v6.4 | Human genome and annotation for UCSC panTro3 | 42 | | tetNig2 | v7.0 | Fugu genome and annotation for UCSC tetNig2 | 43 | | mm10 | v7.0 | Mouse genome and annotation for UCSC mm10 | 44 | | taeGut2 | v7.0 | Zebrafinch genome and annotation for UCSC taeGut2 | 45 | | hg18 | v7.0 | Human genome and annotation for UCSC hg18 | 46 | | susScr11 | v7.0 | Pig genome and annotation for UCSC susScr11 | 47 | | galGal5 | v6.4 | Chicken genome and annotation for UCSC galGal5 | 48 | | hg38 | v7.0 | Human genome and annotation for UCSC hg38 | 49 | | mm8 | v6.4 | Mouse genome and annotation for UCSC mm8 | 50 | | xenLae2 | v7.0 | Frog genome and annotation for UCSC xenLae2 | 51 | | rn6 | v7.0 | Rat genome and annotation for UCSC rn6 | 52 | | ce11 | v7.0 | Worm genome and annotation for UCSC ce11 | 53 | | dm3 | v7.0 | Fly genome and annotation for UCSC dm3 | 54 | | rheMac3 | v6.4 | Rhesus genome and annotation for UCSC rheMac3 | 55 | | rn4 | v6.4 | Rat genome and annotation for UCSC rn4 | 56 | | rice.IRGSP1.0 | v5.10 | Rice genome and annotation (rice.IRGSP1.0) | 57 | | panPan3 | v7.0 | Human genome and annotation for UCSC panPan3 | 58 | | galGal6 | v7.0 | Chicken genome and annotation for UCSC galGal6 | 59 | | danRer10 | v7.0 | Zebrafish genome and annotation for UCSC danRer10 | 60 | | ci2 | v7.0 | Ciona genome and annotation for UCSC ci2 | 61 | | petMar3 | v7.0 | Lamprey genome and annotation for UCSC petMar3 | 62 | | sacCer2 | v7.0 | Yeast genome and annotation for UCSC sacCer2 | 63 | | hg19 | v7.0 | Human genome and annotation for UCSC hg19 | 64 | | xenTro10 | v7.0 | Frog genome and annotation for UCSC xenTro10 | 65 | | rheMac8 | v6.4 | Rhesus genome and annotation for UCSC rheMac8 | 66 | | mm39 | v7.0 | Mouse genome and annotation for UCSC mm39 | 67 | | ce10 | v7.0 | Worm genome and annotation for UCSC ce10 | 68 | | aplCal1 | v6.4 | Seahare genome and annotation for UCSC aplCal1 | 69 | | xenTro2 | v6.4 | Frog genome and annotation for UCSC xenTro2 | 70 | | papAnu4 | v7.0 | Human genome and annotation for UCSC papAnu4 | 71 | | ce6 | v7.0 | Worm genome and annotation for UCSC ce6 | 72 | | ci3 | v7.0 | Ciona genome and annotation for UCSC ci3 | 73 | | apiMel3 | v7.0 | Bee genome and annotation for UCSC apiMel3 | 74 | | anoGam1 | v7.0 | Mosquito genome and annotation for UCSC anoGam1 | 75 | | galGal4 | v6.4 | Chicken genome and annotation for UCSC galGal4 | 76 | 77 | ## Prepare the input data 78 | We take the sc data of mm10 as an example. The data is from the published paper,[FOXA2 drives lineage plasticity and KIT pathway 79 | activation in neuroendocrine prostate cancer](https://www.cell.com/cancer-cell/pdf/S1535-6108(22)00502-5.pdf). 80 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 81 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt 82 | 83 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) . 84 | 85 | Here, we provide sc-multiome data of mm10. We download the data using the shell command line. 86 | ```sh 87 | mkdir -p data 88 | cd data 89 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt 90 | tar -xzvf mm10_data.tar.gz 91 | mv mm10_data/* ./ 92 | cd ../ 93 | ``` 94 | We provide the cell annotation as follows: 95 | ```sh 96 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt 97 | mv mm10_label.txt data/label.txt 98 | ``` 99 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example). 100 |
101 | Image 102 |
103 | - gtf file or the URL of the gtf file describing gene annotation, '*.gtf' 104 | - PWM matrix file of motifs, 'all_motif.txt' 105 |
106 | Image 107 |
108 | - Motif-TF match file, 'MotifMatch.txt', mapping motif and TFs 109 |
110 | Image 111 |
112 | 113 | ## LINGER 114 | ### Install 115 | ```sh 116 | conda create -n LINGER python==3.10.0 117 | conda activate LINGER 118 | pip install LingerGRN==1.97 119 | conda install bioconda::bedtools #Requirement 120 | ``` 121 | ### Install homer 122 | Check whether homer is installed 123 | ```sh 124 | which homer # run this in the command line 125 | ``` 126 | If homer is not installed, use Conda to install it 127 | ```sh 128 | conda install bioconda::homer 129 | ``` 130 | #### Install genome 131 | You can check the installed genome 132 | ```sh 133 | dir=$(which homer) 134 | dir_path=$(dirname "$dir") 135 | ls $dir_path/../share/homer/data/genomes/ # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory' 136 | ``` 137 | If the genome is not installed, use the following shell script to install it. 138 | ```sh 139 | genome='mm10' 140 | perl $dir_path/../share/homer/configureHomer.pl -install $genome 141 | ``` 142 | For the following step, we run the code in python. 143 | #### Input 144 | Note that PWM_file and MotifMatch_file are located at the GRNdir. 145 | ```python 146 | GRNdir='/path/to/current_dir/' # The dir we currently use; the sc data is in GRNdir/data/ 147 | gtf_file='/paht/to/gtfdir/*.gtf'# gtf, PWM matrix, and Motif-TF match file should be included in the GRNdir 148 | PWM_file='all_motif.txt' 149 | MotifMatch_file='MotifMatch.txt' 150 | genome='mm10' # the genome of your data 151 | outdir='/path/to/output/' #output dir 152 | species='New' 153 | ``` 154 | #### Transfer the sc-multiome data to anndata 155 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label. 156 | ```python 157 | import scanpy as sc 158 | #set some figure parameters for nice display inside Jupiter notebooks. 159 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white') 160 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) 161 | sc.logging.print_header() 162 | #results_file = "scRNA/pbmc10k.h5ad" 163 | import scipy 164 | import pandas as pd 165 | matrix=scipy.io.mmread('data/matrix.mtx') 166 | features=pd.read_csv('data/features.txt',sep='\t',header=None) 167 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None) 168 | label=pd.read_csv('data/label.txt',sep='\t',header=0) 169 | from LingerGRN.preprocess import * 170 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC 171 | ``` 172 | #### Remove low counts cells and genes 173 | ```python 174 | import scanpy as sc 175 | sc.pp.filter_cells(adata_RNA, min_genes=200) 176 | sc.pp.filter_genes(adata_RNA, min_cells=3) 177 | sc.pp.filter_cells(adata_ATAC, min_genes=200) 178 | sc.pp.filter_genes(adata_ATAC, min_cells=3) 179 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values)) 180 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values) 181 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]] 182 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values) 183 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]] 184 | ``` 185 | #### Generate the pseudo-bulk/metacell: 186 | ```python 187 | from LingerGRN.pseudo_bulk import * 188 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 189 | tempsample=samplelist[0] 190 | TG_pseudobulk=pd.DataFrame([]) 191 | RE_pseudobulk=pd.DataFrame([]) 192 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100) 193 | for tempsample in samplelist: 194 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample] 195 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample] 196 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk) 197 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1) 198 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1) 199 | RE_pseudobulk[RE_pseudobulk > 100] = 100 200 | 201 | import os 202 | if not os.path.exists('data/'): 203 | os.mkdir('data/') 204 | adata_ATAC.write('data/adata_ATAC.h5ad') 205 | adata_RNA.write('data/adata_RNA.h5ad') 206 | TG_pseudobulk=TG_pseudobulk.fillna(0) 207 | RE_pseudobulk=RE_pseudobulk.fillna(0) 208 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None) 209 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv') 210 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv') 211 | ``` 212 | ### Training model 213 | Overlap the region with general GRN: 214 | ```python 215 | activef='ReLU' 216 | method='scNN' 217 | import torch 218 | import subprocess 219 | import os 220 | import LingerGRN.LINGER_tr as LINGER_tr 221 | LINGER_tr.get_TSS_ensembl(genome,gtf_file) 222 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported 223 | LINGER_tr.RE_TG_dis(outdir) 224 | ``` 225 | Train for the LINGER model. 226 | ```python 227 | import LingerGRN.LINGER_tr as LINGER_tr 228 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh' 229 | LINGER_tr.training(GRNdir,method,outdir,activef,species) 230 | ``` 231 | ### Cell population gene regulatory network 232 | #### TF binding potential 233 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score. 234 | ```python 235 | import LingerGRN.LL_net as LL_net 236 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 237 | ``` 238 | 239 | #### *cis*-regulatory network 240 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score. 241 | ```python 242 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir) 243 | ``` 244 | #### *trans*-regulatory network 245 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score. 246 | ```python 247 | LL_net.trans_reg(GRNdir,method,outdir,genome) 248 | ``` 249 | 250 | ### Cell type sepecific gene regulaory network 251 | There are 2 options: 252 | 1. infer GRN for a specific cell type, which is in the label.txt; 253 | ```python 254 | celltype='1' #use a string to assign your cell type 255 | ``` 256 | 2. infer GRNs for all cell types. 257 | ```python 258 | celltype='all' 259 | ``` 260 | Please make sure that 'all' is not a cell type in your data. 261 | #### Motif matching 262 | ```python 263 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt' 264 | subprocess.run(command, shell=True) 265 | import pandas as pd 266 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+PWM_file+'> '+outdir+'MotifTarget.bed' 267 | subprocess.run(command, shell=True) 268 | ``` 269 | #### TF binding potential 270 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential. 271 | ```python 272 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version 273 | ``` 274 | 275 | #### *cis*-regulatory network 276 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score. 277 | ```python 278 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method) 279 | ``` 280 | 281 | #### *trans*-regulatory network 282 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score. 283 | ```python 284 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir) 285 | ``` 286 | ## Identify driver regulators by TF activity 287 | ### Instruction 288 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 289 | 290 | ### Prepare 291 | We need to *trans*-regulatory network, you can choose a network match you data best. 292 | 1. If your gene expression data are matched with cell population GRN, you can set 293 | ```python 294 | network = 'cell population' 295 | ``` 296 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 297 | ```python 298 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type 299 | ``` 300 | 301 | ### Calculate TF activity 302 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk). 303 | 304 | ```python 305 | from LingerGRN.TF_activity import * 306 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome) 307 | ``` 308 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'. 309 | ```python 310 | save=True 311 | heatmap_cluster(TF_activity,adata_RNA,save,outdir) 312 | ``` 313 |
314 | Image 315 |
316 | 317 | ### Identify driver regulator 318 | We use t-test to find the differential TFs of a certain cell type by the activity. 319 | 1. You can assign a certain cell type of the gene expression data by 320 | ```python 321 | celltype='1' 322 | ``` 323 | 2. Or, you can obtain the result for all cell types. 324 | ```python 325 | celltype='all' 326 | ``` 327 | 328 | For example, 329 | 330 | ```python 331 | celltype='1' 332 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype) 333 | t_test_results 334 | ``` 335 | 336 |
337 | Image 338 |
339 | 340 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'. 341 | 342 | ```python 343 | TFName='Erg' 344 | datatype='activity' 345 | celltype1='1' 346 | celltype2='Others' 347 | save=True 348 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 349 | ``` 350 | 351 |
352 | Image 353 |
354 | 355 | For gene expression data, the boxplot is: 356 | ```python 357 | datatype='expression' 358 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir) 359 | ``` 360 | 361 |
362 | Image 363 |
364 | 365 | 366 | -------------------------------------------------------------------------------- /docs/trans_pr_curveMYC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_pr_curveMYC.png -------------------------------------------------------------------------------- /docs/trans_roc_curveMYC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_roc_curveMYC.png -------------------------------------------------------------------------------- /docs/ttest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest.png -------------------------------------------------------------------------------- /docs/ttest_mm10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest_mm10.png -------------------------------------------------------------------------------- /docs/tutorial2.md: -------------------------------------------------------------------------------- 1 | # Identify driver regulators by TF activity 2 | ## Instruction 3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 4 | 5 | ## Prepare 6 | We need to *trans*-regulatory network, you can choose a network match you data best. 7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues. 8 | ```python 9 | network = 'general' 10 | ``` 11 | 2. If your gene expression data are matched with cell population GRN, you can set 12 | ```python 13 | network = 'cell population' 14 | ``` 15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type. 16 | ```python 17 | network = '0' # 0 is the name of one cell type 18 | ``` 19 | ## Calculate TF activity 20 | ```python 21 | Input_dir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/Input/' 22 | RNA_file='RNA.txt' 23 | GRNdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data_bulk/' 24 | genome='hg38' 25 | from TF_activity import * 26 | regulon_score=regulon(Input_dir,RNA_file,GRNdir,network,genome) 27 | ``` 28 | ## Identify driver regulator 29 | We use t-test to find the differential TFs of a certain cell type by the activity. 30 | 1. You can assign a certain cell type by 31 | ```python 32 | celltype='0' 33 | ``` 34 | 2. Or, you can obtain the result for all cell types. 35 | ```python 36 | celltype='all' 37 | ``` 38 | ```python 39 | labels='label.txt' 40 | t_test_results=master_regulator(regulon_score,Input_dir,labels,celltype) 41 | t_test_results 42 | ``` 43 |
44 | Image 45 |
46 | -------------------------------------------------------------------------------- /docs/tvalue_all.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/tvalue_all.png --------------------------------------------------------------------------------