├── LINGER.PNG
├── README.md
├── conti1_1024.yml
├── doc
└── driver.md
└── docs
├── ATAC.png
├── Benchmark.md
├── GRN_head.png
├── GRN_infer.md
├── H1_E2F6_trans.jpg
├── LINGER.png
├── PBMC.md
├── PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png
├── PBMCs_heatmap_activity.png
├── PBMCs_ttest.png
├── POU5F1_KO_Diff_Umap_NANOG.png
├── POU5F1_KO_Diff_exp_Umap_NANOG.png
├── POU5F1_KO_Differentiation_Umap.png
├── PWM.jpg
├── RNA.png
├── RNA_ds.jpg
├── S_TG.png
├── TFactivity.md
├── User_guide.md
├── adata_ATAC.png
├── adata_RNA.png
├── barcode_mm10.png
├── box_plot_ATF1_activity_0_Others.png
├── box_plot_ATF1_expression_0_Others.png
├── box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png
├── box_plot_Erg_activity_1_Others.png
├── box_plot_Erg_expression_1_Others.png
├── downstream.md
├── driver.md
├── driver_epi.png
├── driver_trans.png
├── feature_engineering.jpg
├── feature_engineering.png
├── genomemap.jpg
├── h5_input.md
├── heatmap_activity.png
├── heatmap_activity_mm10.png
├── label.png
├── label_PBMC.png
├── metadata_ds.jpg
├── module_result.png
├── motifmatch.png
├── original.png
├── perturb.md
├── perturb.png
├── pvalue_all.png
├── scNN.md
├── scNN_newSpecies.md
├── trans_pr_curveMYC.png
├── trans_roc_curveMYC.png
├── ttest.png
├── ttest_mm10.png
├── tutorial2.md
└── tvalue_all.png
/LINGER.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/LINGER.PNG
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # LINGER
2 | ## Introduction
3 | LINGER (LIfelong neural Network for GEne Regulation) is a novel method to infer GRNs from single-cell multiome data built on top of [PyTorch](https://pytorch.org/).
4 |
5 | LINGER incorporates both 1) atlas-scale external bulk data across diverse cellular contexts and 2) the knowledge of transcription factor (TF) motif matching to cis-regulatory elements as a manifold regularization to address the challenge of limited data and extensive parameter space in GRN inference.
6 | ## Analysis tasks for single cell multiome data
7 | - Infer gene regulatory network
8 | - Benchmark gene regulatory network
9 | - Explainable dimensionality reduction (transcription factor activity, availiable for single cell or bulk RNA-seq data)
10 | - In silico pertubation
11 |
12 | In the user guide, we provide an overview of each task.
13 | ## Basic installation
14 | LINGER can be installed by pip
15 | ```sh
16 | conda create -n LINGER python==3.10.0
17 | conda activate LINGER
18 | pip install LingerGRN==1.105
19 | conda install bioconda::bedtools # Requirment
20 | ```
21 | ## Documentation
22 |
23 | We provide several tutorials and user guide. If you find our tool useful for your research, please consider citing the LINGER manuscript.
24 |
25 | | | | |
26 | |:-------------------------:|:-------------------------:|:-------------------------:|
27 | | [User guide](https://github.com/Durenlab/LINGER/blob/main/docs/User_guide.md) | [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md) |[H1 cell line tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md) |
28 | |[GRN benchmark](https://github.com/Durenlab/LINGER/blob/main/docs/Benchmark.md) | [In silico perturbation](https://github.com/Durenlab/LINGER/blob/main/docs/perturb.md) | [Other species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md) |
29 | |[Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md)|[Downstream analysis-TF Driver score](https://github.com/Durenlab/LINGER/blob/main/docs/driver.md)||
30 |
31 |
32 | ## Reference
33 | > If you use LINGER, please cite:
34 | >
35 | > [Yuan, Qiuyue, and Zhana Duren. "Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data." Nature Biotechnology (2024): 1-11.](https://doi.org/10.1038/s41587-024-02182-7)
36 |
--------------------------------------------------------------------------------
/conti1_1024.yml:
--------------------------------------------------------------------------------
1 | name: null
2 | channels:
3 | - conda-forge
4 | - anaconda
5 | - bioconda
6 | - defaults
7 | dependencies:
8 | - _libgcc_mutex=0.1=main
9 | - _openmp_mutex=5.1=1_gnu
10 | - asttokens=2.0.5=pyhd3eb1b0_0
11 | - backcall=0.2.0=pyhd3eb1b0_0
12 | - blas=1.0=openblas
13 | - bottleneck=1.3.5=py310ha9d4c09_0
14 | - bzip2=1.0.8=h7b6447c_0
15 | - ca-certificates=2023.08.22=h06a4308_0
16 | - cloudpickle=2.2.1=pyhd8ed1ab_0
17 | - colorama=0.4.6=pyhd8ed1ab_0
18 | - comm=0.1.2=py310h06a4308_0
19 | - debugpy=1.6.7=py310h6a678d5_0
20 | - decorator=5.1.1=pyhd3eb1b0_0
21 | - exceptiongroup=1.0.4=py310h06a4308_0
22 | - executing=0.8.3=pyhd3eb1b0_0
23 | - ipykernel=6.25.0=py310h2f386ee_0
24 | - ipython=8.15.0=py310h06a4308_0
25 | - jedi=0.18.1=py310h06a4308_1
26 | - joblib=1.3.2=pyhd8ed1ab_0
27 | - jupyter_client=8.1.0=py310h06a4308_0
28 | - jupyter_core=5.3.0=py310h06a4308_0
29 | - ld_impl_linux-64=2.38=h1181459_1
30 | - libblas=3.9.0=16_linux64_openblas
31 | - libcblas=3.9.0=16_linux64_openblas
32 | - libffi=3.3=he6710b0_2
33 | - libgcc-ng=11.2.0=h1234567_1
34 | - libgfortran-ng=11.2.0=h00389a5_1
35 | - libgfortran5=11.2.0=h1234567_1
36 | - libgomp=11.2.0=h1234567_1
37 | - liblapack=3.9.0=16_linux64_openblas
38 | - libllvm14=14.0.6=hef93074_0
39 | - libopenblas=0.3.21=h043d6bf_0
40 | - libsodium=1.0.18=h7b6447c_0
41 | - libstdcxx-ng=11.2.0=h1234567_1
42 | - libuuid=1.41.5=h5eee18b_0
43 | - llvmlite=0.40.0=py310he621ea3_0
44 | - matplotlib-inline=0.1.6=py310h06a4308_0
45 | - ncurses=6.4=h6a678d5_0
46 | - nest-asyncio=1.5.6=py310h06a4308_0
47 | - numba=0.57.1=py310h1128e8f_0
48 | - numexpr=2.8.7=py310h286c3b5_0
49 | - openssl=1.1.1w=h7f8727e_0
50 | - packaging=23.1=py310h06a4308_0
51 | - pandas=2.0.3=py310h1128e8f_0
52 | - parso=0.8.3=pyhd3eb1b0_0
53 | - pexpect=4.8.0=pyhd3eb1b0_3
54 | - pickleshare=0.7.5=pyhd3eb1b0_1003
55 | - pip=23.2.1=py310h06a4308_0
56 | - platformdirs=3.10.0=py310h06a4308_0
57 | - prompt-toolkit=3.0.36=py310h06a4308_0
58 | - psutil=5.9.0=py310h5eee18b_0
59 | - ptyprocess=0.7.0=pyhd3eb1b0_2
60 | - pure_eval=0.2.2=pyhd3eb1b0_0
61 | - pygments=2.15.1=py310h06a4308_1
62 | - python=3.10.0=h12debd9_5
63 | - python-dateutil=2.8.2=pyhd3eb1b0_0
64 | - python-tzdata=2023.3=pyhd3eb1b0_0
65 | - python_abi=3.10=2_cp310
66 | - pytz=2023.3.post1=py310h06a4308_0
67 | - pyzmq=25.1.0=py310h6a678d5_0
68 | - readline=8.2=h5eee18b_0
69 | - scikit-learn=1.3.0=py310h1128e8f_0
70 | - scipy=1.11.3=py310heeff2f4_0
71 | - setuptools=68.0.0=py310h06a4308_0
72 | - shap=0.42.1=py310h1128e8f_0
73 | - six=1.16.0=pyhd3eb1b0_1
74 | - slicer=0.0.7=pyhd8ed1ab_0
75 | - sqlite=3.41.2=h5eee18b_0
76 | - stack_data=0.2.0=pyhd3eb1b0_0
77 | - tbb=2021.8.0=hdb19cb5_0
78 | - threadpoolctl=3.2.0=pyha21a80b_0
79 | - tk=8.6.12=h1ccaba5_0
80 | - tornado=6.3.2=py310h5eee18b_0
81 | - tqdm=4.66.1=pyhd8ed1ab_0
82 | - traitlets=5.7.1=py310h06a4308_0
83 | - tzdata=2023c=h04d1e81_0
84 | - wcwidth=0.2.5=pyhd3eb1b0_0
85 | - wheel=0.41.2=py310h06a4308_0
86 | - xz=5.4.2=h5eee18b_0
87 | - zeromq=4.3.4=h2531618_0
88 | - zlib=1.2.13=h5eee18b_0
89 | - pip:
90 | - certifi==2023.7.22
91 | - charset-normalizer==3.3.0
92 | - filelock==3.12.4
93 | - fsspec==2023.9.2
94 | - idna==3.4
95 | - jinja2==3.1.2
96 | - markupsafe==2.1.3
97 | - mpmath==1.3.0
98 | - networkx==3.1
99 | - numpy==1.24.0
100 | - nvidia-cublas-cu12==12.1.3.1
101 | - nvidia-cuda-cupti-cu12==12.1.105
102 | - nvidia-cuda-nvrtc-cu12==12.1.105
103 | - nvidia-cuda-runtime-cu12==12.1.105
104 | - nvidia-cudnn-cu12==8.9.2.26
105 | - nvidia-cufft-cu12==11.0.2.54
106 | - nvidia-curand-cu12==10.3.2.106
107 | - nvidia-cusolver-cu12==11.4.5.107
108 | - nvidia-cusparse-cu12==12.1.0.106
109 | - nvidia-nccl-cu12==2.18.1
110 | - nvidia-nvjitlink-cu12==12.2.140
111 | - nvidia-nvtx-cu12==12.1.105
112 | - pillow==10.0.1
113 | - requests==2.31.0
114 | - sympy==1.12
115 | - torch==2.1.0
116 | - torchaudio==2.1.0
117 | - torchvision==0.16.0
118 | - triton==2.1.0
119 | - typing-extensions==4.8.0
120 | - urllib3==2.0.6
121 | prefix: /zfs/durenlab/palmetto/Kaya/env/conti1
122 |
--------------------------------------------------------------------------------
/doc/driver.md:
--------------------------------------------------------------------------------
1 | ## Driver Score
2 | We identify driver TFs underlying epigenetic and transcriptomics change between control and AUD using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF.
3 | ### Transcriptomics driver score
4 | ```python
5 | import pandas as pd
6 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
7 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-'
8 | import scanpy as sc
9 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
10 | label_all = adata_RNA.obs[['barcode','sample','label']]
11 | label_all.index = label_all['barcode']
12 | metadata = label_all.loc[TG_pseudobulk.columns]
13 | metadata.columns = ['barcode','group','celltype']
14 | GRN='trans_regulatory'
15 | adjust_method='bonferroni'
16 | corr_method='pearsonr'
17 | import numpy as np
18 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
19 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes.
20 | C_result_RNA_sp_r,Q_result_RNA_sp_r=driver_result(C_result_RNA_sp,Q_result_RNA_sp,K)
21 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t')
22 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t')
23 | ```
24 | The adjust_method is the p-value adjust method, you could choose one from the following:
25 | - bonferroni : one-step correction
26 | - sidak : one-step correction
27 | - holm-sidak : step down method using Sidak adjustments
28 | - holm : step-down method using Bonferroni adjustments
29 | - simes-hochberg : step-up method (independent)
30 | - hommel : closed method based on Simes tests (non-negative)
31 | - fdr_bh : Benjamini/Hochberg (non-negative)
32 | - fdr_by : Benjamini/Yekutieli (negative)
33 | - fdr_tsbh : two stage fdr correction (non-negative)
34 | - fdr_tsbky : two stage fdr correction (non-negative)
35 | ### visualize
36 | ```python
37 | import os
38 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path
39 | import rpy2.robjects as robjects
40 | from rpy2.robjects import r
41 | # Import the R plotting package (ggplot2 as an example)
42 | r('library(ggplot2)')
43 | r('library(grid)')
44 | # Create data in R environment through Python
45 | r('''
46 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
47 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
48 | sort_TF=rownames(dataT)
49 | library(tidyr)
50 | dataP=-log10(dataP)
51 | print(paste0('maxinum of -log10P:',max(dataP)))
52 | dataP[dataP>40]=40
53 | dataP1=dataP
54 | dataP1$TF=rownames(dataP)
55 | longdiff0 <- gather(dataP1, sample, value,-TF)
56 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
57 | dataT1=dataT
58 | dataT1$TF=rownames(dataT)
59 | longdiff1=gather(dataT1, sample, value,-TF)
60 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
61 | colnames(longdiff1)=c('TF','celltype','PCC')
62 | longdiff1$P=longdiff0_s$value
63 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF))
64 | library(egg)
65 | limits0=c(2,ceiling(dataP))
66 | range0 = c(1,4)
67 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP))
68 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
69 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) +
70 | scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) +
71 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+
72 | theme(legend.key=element_blank(),
73 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1),
74 | legend.position = "right") +
75 | scale_fill_gradient2(midpoint=0, low="blue", mid="white",
76 | high="red", space ="Lab" )
77 |
78 |
79 | ''')
80 | ```
81 | The figure is saved to driver_trans.pdf.
82 |
83 |

84 |
85 |
86 | ### Epigenetic driver score
87 | ```python
88 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0)
89 | K=5
90 | GRN='TF_RE_binding'
91 | adjust_method='bonferroni'
92 | corr_method='pearsonr'
93 | C_result_RE,P_result_RE,Q_result_RE=driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
94 | C_result_RE_r,Q_result_RE_r=driver_result(C_result_RE,Q_result_RE,K)
95 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t')
96 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t')
97 | ```
98 | ### Visualization
99 | ```python
100 | import os
101 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path
102 | import rpy2.robjects as robjects
103 | from rpy2.robjects import r
104 | # Import the R plotting package (ggplot2 as an example)
105 | r('library(ggplot2)')
106 | r('library(grid)')
107 | # Create data in R environment through Python
108 | r('''
109 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
110 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
111 | sort_TF=rownames(dataT)
112 | library(tidyr)
113 | dataP=-log10(dataP)
114 | print(paste0('maxinum of -log10P:',max(dataP)))
115 | maxP=100
116 | dataP[dataP>100]=100
117 | dataP1=dataP
118 | dataP1$TF=rownames(dataP)
119 | longdiff0 <- gather(dataP1, sample, value,-TF)
120 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
121 | dataT1=dataT
122 | dataT1$TF=rownames(dataT)
123 | longdiff1=gather(dataT1, sample, value,-TF)
124 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
125 | colnames(longdiff1)=c('TF','celltype','PCC')
126 | longdiff1$P=longdiff0_s$value
127 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF))
128 | library(egg)
129 | cutoff=2
130 | maxp=ceiling(max(dataP))
131 | print(maxp)
132 | limits0=c(cutoff,maxp)
133 | print(limits0)
134 | range0 = c(1,4)
135 | numbreak=5
136 | d=ceiling((maxp-cutoff)/(numbreak-1))
137 | print(d)
138 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1), length.out=numbreak)
139 |
140 | print(range0)
141 | print(breaks0)
142 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
143 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) +
144 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) +
145 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+
146 | theme(legend.key=element_blank(),
147 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1),
148 | legend.position = "right") +
149 | scale_fill_gradient2(midpoint=0, low="blue", mid="white",
150 | high="red", space ="Lab" )
151 |
152 |
153 | ''')
154 | r('''
155 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
156 | library(pheatmap)
157 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
158 | colnames(anno1)=c('TG')
159 | rownames(anno1)=rownames(dataP)
160 | anno1[is.na(anno1)]=0
161 | anno1[,1]=paste0('M',anno1[,1])
162 | anno1$name=rownames(anno1)
163 | longdiff0 <- gather(anno1, sample, value,-name)
164 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
165 | library("RColorBrewer")
166 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
167 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
168 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
169 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
170 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
171 | theme_article()+theme(text = element_text(size = 9), legend.position = "left",
172 | )
173 | ''')
174 | r('''
175 | widths <- c(4.5, 4.5+dim(dataP)[2])
176 | print(unique(longdiff0$value))
177 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5)
178 | #print(p)
179 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
180 | dev.off()
181 | ''')
182 | ```
183 | The figure is saved to driver_epi.pdf.
184 |
185 |

186 |
187 |
--------------------------------------------------------------------------------
/docs/ATAC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ATAC.png
--------------------------------------------------------------------------------
/docs/Benchmark.md:
--------------------------------------------------------------------------------
1 | # Benchmark gene regulatory network
2 | The example input we provided is an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from SNARE-seq data. We benchmark the GRN of H1 cell line as example.
3 |
4 | ## Download the groundtruth data
5 | The groundtruth data is [Cistrome](http://cistrome.org/) putative target of TF from the ChIP-seq data. Here, we take the ChIP-seq data of E2F6 in H1 cell line as example. We download the ground truth on Cistrome as following (46177_gene_score_5fold.txt).
6 |
7 |

8 |
9 | ## Prepare GRN file
10 | You can provide a list of predicted GRN file. We support 2 format:
11 | - list: There are 3 columns, Target gene (TG), TF, and regulation strength (score)
12 |
13 |

14 |
15 | - matrix: The row is gene name the column is TF name and the value is regulation strength.
16 |
17 | Here, we give cell type specific and cell population GRN files as examples (the format here is 'matrix' described above).
18 |
19 | ## Roc and pr curve
20 | ```python
21 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/'
22 | TFName = 'MYC'
23 | Method_name=['H1','cell population']
24 | Infer_trans=[outdir+'cell_type_specific_trans_regulatory_0.txt',outdir+'cell_population_trans_regulatory.txt']
25 | Groundtruth='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/groundtruth/45691_gene_score_5fold.txt'
26 | from LingerGRN import Benchmk
27 | Benchmk.bm_trans(TFName,Method_name,Groundtruth,Infer_trans,outdir,'matrix')
28 | ```
29 |
30 |

31 |
32 |
33 |

34 |
35 |
36 | The result will be automatically saved in the outdir with name trans_roc_curve+TFName+.pdf and trans_roc_curve+TFName+.pdf.
37 |
--------------------------------------------------------------------------------
/docs/GRN_head.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/GRN_head.png
--------------------------------------------------------------------------------
/docs/GRN_infer.md:
--------------------------------------------------------------------------------
1 | # Construct the gene regulatory network
2 | # Load data
3 | ## Download the general gene regulatory network
4 | We provide the general gene regulatory network, please download the data first.
5 | ```sh
6 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
7 | mkdir $Datadir
8 | cd $Datadir
9 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt
10 | ```
11 | Then unzip,
12 | ```sh
13 | tar -xzf data_bulk.tar.gz
14 | ```
15 |
16 | ## Input data
17 | The input data should be anndata format. In this example, we need to transfer the following data type to anndata.
18 | - Single-cell multiome data including gene expression (RNA.txt in our example) and chromatin accessibility (ATAC.txt in our example).
19 | - Cell annotation/cell type label if you need the cell type specific gene regulatory network (label.txt in our example).
20 | ### RNA-seq
21 | The row of RNA-seq is gene symbol; the column is the barcode; the value is the count matrix. Here is our example:
22 |
23 |

24 |
25 |
26 | ### ATAC-seq
27 | The row is the regulatory element/genomic region; the column is the barcode, which is in the same order as RNA-seq data; the value is the count matrix. Here is our example:
28 |
29 |

30 |
31 |
32 | ### Cell annotation/cell type label
33 | The row is cell barcode, which is the same order with RNA-seq data; there is one column 'Annotation', which is the cell type label. It could be a number or the string. Here is our example:
34 |
35 |

36 |
37 |
38 | ### Provided input example
39 | You can download the example input datasets into a certain directory. This sc multiome data of an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from droplet-based single-nucleus chromatin accessibility and mRNA expression sequencing (SNARE-seq) data.
40 |
41 | ```sh
42 | Input_dir=/path/to/dir/
43 | # The input data directory. Example: Input_dir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/
44 | cd $Input_dir
45 | #ATAC-seq
46 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r" -O ATAC.txt && rm -rf /tmp/cookies.txt
47 | #RNA-seq
48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK" -O RNA.txt && rm -rf /tmp/cookies.txt
49 | #label
50 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5" -O label.txt && rm -rf /tmp/cookies.txt
51 | ```
52 | ## LINGER
53 | ### Install
54 | ```sh
55 | conda create -n LINGER python==3.10.0
56 | conda activate LINGER
57 | pip install LingerGRN==1.92
58 | conda install bioconda::bedtools #Requirement
59 | ```
60 | ### Preprocess
61 | There are 2 options of the method we introduced above:
62 | 1. baseline;
63 | ```python
64 | method='baseline' # this is corresponding to bulkNN
65 | ```
66 | 2. LINGER;
67 | ```python
68 | method='LINGER'
69 | ```
70 | #### Tansfer sc-multiome data to anndata
71 | ```python
72 | import pandas as pd
73 | label=pd.read_csv('data/label.txt',sep='\t',header=0,index_col=None)
74 | RNA=pd.read_csv('data/RNA.txt',sep='\t',header=0,index_col=0)
75 | ATAC=pd.read_csv('data/ATAC.txt',sep='\t',header=0,index_col=0)
76 | from scipy.sparse import csc_matrix
77 | # Convert the NumPy array to a sparse csc_matrix
78 | matrix = csc_matrix(pd.concat([RNA,ATAC],axis=0).values)
79 | features=pd.DataFrame(RNA.index.tolist()+ATAC.index.tolist(),columns=[1])
80 | K=RNA.shape[0]
81 | N=K+ATAC.shape[0]
82 | types = ['Gene Expression' if i <= K else 'Peaks' for i in range(0, N)]
83 | features[2]=types
84 | barcodes=pd.DataFrame(RNA.columns.values,columns=[0])
85 | from LingerGRN.preprocess import *
86 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
87 | ```
88 | #### Remove low counts cells and genes
89 | ```python
90 | import scanpy as sc
91 | sc.pp.filter_cells(adata_RNA, min_genes=200)
92 | sc.pp.filter_genes(adata_RNA, min_cells=3)
93 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
94 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
95 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
96 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
97 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
98 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
99 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
100 | ```
101 | #### Generate the pseudo-bulk/metacell
102 | ```python
103 | from LingerGRN.pseudo_bulk import *
104 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode
105 | tempsample=samplelist[0]
106 | TG_pseudobulk=pd.DataFrame([])
107 | RE_pseudobulk=pd.DataFrame([])
108 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
109 | for tempsample in samplelist:
110 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
111 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
112 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)
113 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
114 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
115 | RE_pseudobulk[RE_pseudobulk > 100] = 100
116 |
117 | import os
118 | if not os.path.exists('data/'):
119 | os.mkdir('data/')
120 | adata_ATAC.write('data/adata_ATAC.h5ad')
121 | adata_RNA.write('data/adata_RNA.h5ad')
122 | TG_pseudobulk=TG_pseudobulk.fillna(0)
123 | RE_pseudobulk=RE_pseudobulk.fillna(0)
124 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
125 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
126 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
127 | ```
128 | ### Training model
129 | Overlap the region with general GRN:
130 | ```python
131 | from LingerGRN.preprocess import *
132 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
133 | GRNdir=Datadir+'data_bulk/'
134 | genome='hg38'
135 | outdir='/path/to/output/' #output dir
136 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir)
137 | ```
138 | Train for the LINGER model.
139 | ```python
140 | import LingerGRN.LINGER_tr as LINGER_tr
141 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
142 | LINGER_tr.training(GRNdir,method,outdir,activef)
143 | ```
144 | ### Cell population gene regulatory network
145 | #### TF binding potential
146 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
147 | ```python
148 | import LingerGRN.LL_net as LL_net
149 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
150 | ```
151 |
152 | #### *cis*-regulatory network
153 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
154 | ```python
155 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
156 | ```
157 | #### *trans*-regulatory network
158 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
159 | ```python
160 | LL_net.trans_reg(GRNdir,method,outdir)
161 | ```
162 |
163 | ### Cell type sepecific gene regulaory network
164 | There are 2 options:
165 | 1. infer GRN for a specific cell type, which is in the label.txt;
166 | ```python
167 | celltype='0'#use a string to assign your cell type
168 | ```
169 | 2. infer GRNs for all cell types.
170 | ```python
171 | celltype='all'
172 | ```
173 | Please make sure that 'all' is not a cell type in your data.
174 |
175 | #### TF binding potential
176 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
177 | ```python
178 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
179 | ```
180 |
181 | #### *cis*-regulatory network
182 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
183 | ```python
184 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
185 | ```
186 |
187 | #### *trans*-regulatory network
188 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
189 | ```python
190 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
191 | ```
192 |
193 | ## Note
194 | - The cell specific GRN is based on the output of the cell population GRN.
195 | - If we want to try 2 different method options, we can create 2 output directory.
196 | ## Identify driver regulators by TF activity
197 | ### Instruction
198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
199 |
200 | ### Prepare
201 | We need to *trans*-regulatory network, you can choose a network match you data best.
202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
203 | ```python
204 | network = 'general'
205 | ```
206 | 2. If your gene expression data are matched with cell population GRN, you can set
207 | ```python
208 | network = 'cell population'
209 | ```
210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
211 | ```python
212 | network = '0' # 0 is the name of one cell type
213 | ```
214 |
215 | ### Calculate TF activity
216 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
217 |
218 | ```python
219 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
220 | GRNdir=Datadir+'data_bulk/'
221 | genome='hg38'
222 | from LingerGRN.TF_activity import *
223 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
224 | import anndata
225 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
226 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
227 | ```
228 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
229 | ```python
230 | save=True
231 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
232 | ```
233 |
234 |

235 |
236 |
237 | ### Identify driver regulator
238 | We use t-test to find the differential TFs of a certain cell type by the activity.
239 | 1. You can assign a certain cell type of the gene expression data by
240 | ```python
241 | celltype='0'
242 | ```
243 | 2. Or, you can obtain the result for all cell types.
244 | ```python
245 | celltype='all'
246 | ```
247 |
248 | For example,
249 |
250 | ```python
251 | celltype='0'
252 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
253 | t_test_results
254 | ```
255 |
256 |
257 |

258 |
259 |
260 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
261 |
262 | ```python
263 | TFName='ATF1'
264 | datatype='activity'
265 | celltype1='0'
266 | celltype2='Others'
267 | save=True
268 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir)
269 | ```
270 |
271 |
272 |

273 |
274 |
275 | For gene expression data, the boxplot is:
276 | ```python
277 | datatype='expression'
278 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir)
279 | ```
280 |
281 |
282 |

283 |
284 |
--------------------------------------------------------------------------------
/docs/H1_E2F6_trans.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/H1_E2F6_trans.jpg
--------------------------------------------------------------------------------
/docs/LINGER.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/LINGER.png
--------------------------------------------------------------------------------
/docs/PBMC.md:
--------------------------------------------------------------------------------
1 | # PBMCs Tutorial
2 | ## Instruction
3 | This tutorial delineates a computational framework for constructing gene regulatory networks (GRNs) from single-cell multiome data. We provide 2 options to do this: '**baseline**' and '**LINGER**'. The first is a naive method combining the prior GRNs and features from the single-cell data, offering a rapid approach. LINGER integrates the comprehensive gene regulatory profile from external bulk data. As the following figure, LINGER uses lifelong machine learning (continuous learning) based on neural network (NN) models, which has been proven to leverage the knowledge learned in previous tasks to help learn the new task better.
4 |
5 |

6 |
7 |
8 | After constructing the GRNs for the cell population, we infer the cell type specific one using the feature engineering approach. Just as in the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$.
9 |
10 | 
11 |
12 | ## Download the general gene regulatory network
13 | We provide the general gene regulatory network, please download the data first.
14 | ```sh
15 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
16 | mkdir $Datadir
17 | cd $Datadir
18 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt
19 | ```
20 | or use the following link: [https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing](https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing)
21 |
22 | Then unzip,
23 | ```sh
24 | tar -xzf data_bulk.tar.gz
25 | ```
26 |
27 | ## Prepare the input data
28 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes:
29 | - Single-cell multiome data including matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz.
30 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (PBMC_label.txt in our example).
31 |
32 |

33 |
34 |
35 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
36 |
37 | ### sc data
38 | We download the data using shell command line.
39 | ```sh
40 | mkdir -p data
41 | wget -O data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
42 | tar -xzvf data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
43 | mv filtered_feature_bc_matrix data/
44 | gzip -d data/filtered_feature_bc_matrix/*
45 | ```
46 | We provide the cell annotation as following:
47 | ```sh
48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_" -O PBMC_label.txt && rm -rf /tmp/cookies.txt
49 | mv PBMC_label.txt data/
50 | ```
51 | ## LINGER
52 | ### Install
53 | ```sh
54 | conda create -n LINGER python==3.10.0
55 | conda activate LINGER
56 | pip install LingerGRN==1.105
57 | conda install bioconda::bedtools #Requirement
58 | ```
59 | For the following step, we run the code in python.
60 | ### Preprocess
61 | There are 2 options for the method we introduced above:
62 | 1. baseline;
63 | ```python
64 | method='baseline' # this method is corresponding to bulkNN described in the paper
65 | ```
66 | 2. LINGER;
67 | ```python
68 | method='LINGER'
69 | ```
70 | #### Transfer the sc-multiome data to anndata
71 |
72 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
73 | ```python
74 | import scanpy as sc
75 | #set some figure parameters for nice display inside jupyternotebooks.
76 | %matplotlib inline
77 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
78 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
79 | sc.logging.print_header()
80 | #results_file = "scRNA/pbmc10k.h5ad"
81 | import scipy
82 | import pandas as pd
83 | matrix=scipy.io.mmread('data/filtered_feature_bc_matrix/matrix.mtx')
84 | features=pd.read_csv('data/filtered_feature_bc_matrix/features.tsv',sep='\t',header=None)
85 | barcodes=pd.read_csv('data/filtered_feature_bc_matrix/barcodes.tsv',sep='\t',header=None)
86 | label=pd.read_csv('data/PBMC_label.txt',sep='\t',header=0)
87 | from LingerGRN.preprocess import *
88 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
89 | ```
90 | #### Remove low counts cells and genes
91 | ```python
92 | import scanpy as sc
93 | sc.pp.filter_cells(adata_RNA, min_genes=200)
94 | sc.pp.filter_genes(adata_RNA, min_cells=3)
95 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
96 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
97 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
98 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
99 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
100 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
101 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
102 | ```
103 | #### Generate the pseudo-bulk/metacell:
104 | ```python
105 | from LingerGRN.pseudo_bulk import *
106 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode
107 | tempsample=samplelist[0]
108 | TG_pseudobulk=pd.DataFrame([])
109 | RE_pseudobulk=pd.DataFrame([])
110 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
111 | for tempsample in samplelist:
112 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
113 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
114 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)
115 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
116 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
117 | RE_pseudobulk[RE_pseudobulk > 100] = 100
118 |
119 | import os
120 | if not os.path.exists('data/'):
121 | os.mkdir('data/')
122 | adata_ATAC.write('data/adata_ATAC.h5ad')
123 | adata_RNA.write('data/adata_RNA.h5ad')
124 | TG_pseudobulk=TG_pseudobulk.fillna(0)
125 | RE_pseudobulk=RE_pseudobulk.fillna(0)
126 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
127 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
128 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
129 | ```
130 | ### Training model
131 | Overlap the region with general GRN:
132 | ```python
133 | from LingerGRN.preprocess import *
134 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
135 | GRNdir=Datadir+'data_bulk/'
136 | genome='hg38'
137 | outdir='/path/to/output/' #output dir
138 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir)
139 | ```
140 | Train for the LINGER model.
141 | ```python
142 | import LingerGRN.LINGER_tr as LINGER_tr
143 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
144 | LINGER_tr.training(GRNdir,method,outdir,activef,'Human')
145 | ```
146 |
147 |
148 | ### Cell population gene regulatory network
149 | #### TF binding potential
150 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
151 | ```python
152 | import LingerGRN.LL_net as LL_net
153 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
154 | ```
155 |
156 | #### *cis*-regulatory network
157 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
158 | ```python
159 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
160 | ```
161 | #### *trans*-regulatory network
162 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
163 | ```python
164 | LL_net.trans_reg(GRNdir,method,outdir,genome)
165 | ```
166 |
167 | ### Cell type specific gene regulatory network
168 | There are 2 options:
169 | 1. infer GRN for a specific cell type, which is in the label.txt;
170 | ```python
171 | celltype='CD56 (bright) NK cells' #use a string to assign your cell type
172 | ```
173 | 2. infer GRNs for all cell types.
174 | ```python
175 | celltype='all'
176 | ```
177 | Please make sure that 'all' is not a cell type in your data.
178 |
179 | #### TF binding potential
180 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
181 | ```python
182 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
183 | ```
184 |
185 | #### *cis*-regulatory network
186 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
187 | ```python
188 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
189 | ```
190 |
191 | #### *trans*-regulatory network
192 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
193 | ```python
194 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
195 | ```
196 | ## Identify driver regulators by TF activity
197 | ### Instruction
198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
199 |
200 | ### Prepare
201 | You can choose a *trans*-regulatory network that matches your data best.
202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
203 | ```python
204 | network = 'general'
205 | ```
206 | 2. If your gene expression data are matched with cell population GRN, you can set
207 | ```python
208 | network = 'cell population'
209 | ```
210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
211 | ```python
212 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
213 | ```
214 |
215 | ### Calculate TF activity
216 | The input is gene expression data. It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
217 |
218 | ```python
219 |
220 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
221 | GRNdir=Datadir+'data_bulk/'
222 | genome='hg38'
223 | from LingerGRN.TF_activity import *
224 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
225 | import anndata
226 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
227 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
228 | ```
229 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
230 | ```python
231 | save=True
232 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
233 | ```
234 |
235 |

236 |
237 |
238 | ### Identify driver regulator
239 | We use t-test to find the differential TFs of a certain cell type by the activity.
240 | 1. You can assign a certain cell type of the gene expression data by
241 | ```python
242 | celltype='CD56 (bright) NK cells'
243 | ```
244 | 2. Or, you can obtain the result for all cell types.
245 | ```python
246 | celltype='all'
247 | ```
248 |
249 | For example,
250 |
251 | ```python
252 | celltype='CD56 (bright) NK cells'
253 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
254 | t_test_results
255 | ```
256 |
257 |
258 |

259 |
260 |
261 | Visualize the differential activity and expression. You can compare 2 different cell types and one cell type with others. If you want to save the heatmap to output, please set `save=True`. The output is `box_plot____.png`.
262 |
263 | ```python
264 | TFName='ATF1'
265 | datatype='activity'
266 | celltype1='CD56 (bright) NK cells'
267 | celltype2='Others'
268 | save=True
269 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
270 | ```
271 |
272 |
273 |
 NK cells_Others.png)
274 |
275 |
276 | For gene expression data, the boxplot is:
277 | ```python
278 | datatype='expression'
279 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
280 | ```
281 |
282 |
283 |
 NK cells_Others.png)
284 |
285 |
286 | ## Note
287 | - The cell specific GRN is based on the output of the cell population GRN.
288 | - If we want to try 2 different method options, we can create 2 output directory.
289 |
290 |
291 |
--------------------------------------------------------------------------------
/docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png
--------------------------------------------------------------------------------
/docs/PBMCs_heatmap_activity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_heatmap_activity.png
--------------------------------------------------------------------------------
/docs/PBMCs_ttest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_ttest.png
--------------------------------------------------------------------------------
/docs/POU5F1_KO_Diff_Umap_NANOG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_Umap_NANOG.png
--------------------------------------------------------------------------------
/docs/POU5F1_KO_Diff_exp_Umap_NANOG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_exp_Umap_NANOG.png
--------------------------------------------------------------------------------
/docs/POU5F1_KO_Differentiation_Umap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Differentiation_Umap.png
--------------------------------------------------------------------------------
/docs/PWM.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PWM.jpg
--------------------------------------------------------------------------------
/docs/RNA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA.png
--------------------------------------------------------------------------------
/docs/RNA_ds.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA_ds.jpg
--------------------------------------------------------------------------------
/docs/S_TG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/S_TG.png
--------------------------------------------------------------------------------
/docs/TFactivity.md:
--------------------------------------------------------------------------------
1 | # Identify driver regulators by TF activity
2 | ## Instruction
3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
4 |
5 | ## Prepare
6 | We need to *trans*-regulatory network, you can choose a network match you data best.
7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
8 | ```python
9 | network = 'general'
10 | ```
11 | 2. If your gene expression data are matched with cell population GRN, you can set
12 | ```python
13 | network = 'cell population'
14 | ```
15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
16 | ```python
17 | network = '0' # 0 is the name of one cell type
18 | ```
19 |
20 | ## Calculate TF activity
21 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
22 |
23 | ```python
24 |
25 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
26 | GRNdir=Datadir+'data_bulk/'
27 | genome='hg38'
28 | from LingerGRN.TF_activity import *
29 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
30 | import anndata
31 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
32 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
33 | ```
34 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
35 | ```python
36 | save=True
37 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
38 | ```
39 |
40 |

41 |
42 |
43 | ## Identify driver regulator
44 | We use t-test to find the differential TFs of a certain cell type by the activity.
45 | 1. You can assign a certain cell type of the gene expression data by
46 | ```python
47 | celltype='0'
48 | ```
49 | 2. Or, you can obtain the result for all cell types.
50 | ```python
51 | celltype='all'
52 | ```
53 |
54 | For example,
55 |
56 | ```python
57 | celltype='0'
58 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
59 | t_test_results
60 | ```
61 |
62 |
63 |

64 |
65 |
66 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
67 |
68 | ```python
69 | TFName='ATF1'
70 | datatype='activity'
71 | celltype1='0'
72 | celltype2='Others'
73 | save=True
74 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir)
75 | ```
76 |
77 |
78 |

79 |
80 |
81 | For gene expression data, the boxplot is:
82 | ```python
83 | datatype='expression'
84 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir)
85 | ```
86 |
87 |
88 |

89 |
90 |
--------------------------------------------------------------------------------
/docs/User_guide.md:
--------------------------------------------------------------------------------
1 | # LINGER
2 | LINGER (LIfelong neural Network for GEne Regulation) is a interpretable artificial intelligence model designed to analyze sc multiome data (RNA-seq count and chromatine accessbility count data that can subsequently be used for many downstream tasks.
3 |
4 | The advantages of LINGER are:
5 | - incorporate large-scale external bulk data and neural networks
6 | - integrate TF-RE motif matching knowledge
7 | - high accuracy of gene regulatory network (GRN) inference
8 | - enable the estimation of TF activity solely from gene expression data
9 |
10 | ## Method overview
11 | LINGER uses neural network architectures to model transcriptional regulation, predicting target gene (TG) expression levels from transcription factor (TF) and regulatory element (RE) profiles. The loss function is compisosed of 3 parts:
12 | - Accuracy of prediction, MSE, L1 regularization
13 | - Elastic weight consolidation (EWC) loss of lifelong learning to incorporate large-scale external bulk data
14 | - LINGER integrates prior biological understanding of TF-RE interactions through manifold regularization.
15 |
16 |
17 |

18 |
19 |
20 |
21 | ## Neuron network model structure
22 | LINGER trains individual models for each gene using a neural network architecture that includes a single input layer and two fully connected hidden layers. The input layer has dimension equal to the number of features, containing all TFs and REs within 1Mb from the transcription start site (TSS) for the gene to be predicted. The first hidden layer has 64 neurons with rectified linear unit (ReLU) activation. The second hidden layer has 16 neurons with ReLU activation. The output layer is a single neuron, which outputs a real value for gene expression prediction.
23 |
24 | ## Tasks
25 | ### Cell population gene regulatory network inference
26 | We use the average of absolute Shapley value across samples to infer the regulation strength of TF and RE to target genes, generating the RE-TG *cis*-regulatory strength and the TF-TG *trans*-regulatory strength. To generate the TF-RE binding strength, we use the weights from input layer (TFs and REs) to all node in the second layer of the NN model as the embedding of the TF or RE. The TF-RE binding strength is calculated by the PCC between the TF and RE based on the embedding.
27 | ### Cell type specific gene regulatory network
28 | Afer constructing the GRNs for cell population, we infer the cell type specific one using the feature engineering approach. Just as the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$.
29 |
30 | 
31 |
32 | ### Benchmark gene regulatory network
33 | We systematically assess the performance of the GRN by the metrics, AUC and AUPR ratio.
34 | - *trans*: the ground truth include and knock down
35 | - *cis*: the ground truth is HiC and eQTL
36 | - TF-RE: the ground truth is ChIP-seq data from
37 | ### Transcription factor activity
38 | Assuming the GRN structure is consistent across individuals, we employ LINGER inferred GRNs from sc-multiome data of a single individual to estimated TF activity using gene expression data alone from same or other individuals. By comparing TF activity between cases and controls, we identified driver regulators.
39 | ### In silico pertubation
40 | We predict the gene expression when knock out one TF or several TFs together. Then we could predict the target gene after the perturbation.
41 | ## Tutorials
42 | - [PBMCs](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md)
43 | - [H1 cell line](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md)
44 | - [Non-human species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md)
45 |
--------------------------------------------------------------------------------
/docs/adata_ATAC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_ATAC.png
--------------------------------------------------------------------------------
/docs/adata_RNA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_RNA.png
--------------------------------------------------------------------------------
/docs/barcode_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/barcode_mm10.png
--------------------------------------------------------------------------------
/docs/box_plot_ATF1_activity_0_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_activity_0_Others.png
--------------------------------------------------------------------------------
/docs/box_plot_ATF1_expression_0_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_0_Others.png
--------------------------------------------------------------------------------
/docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png
--------------------------------------------------------------------------------
/docs/box_plot_Erg_activity_1_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_activity_1_Others.png
--------------------------------------------------------------------------------
/docs/box_plot_Erg_expression_1_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_expression_1_Others.png
--------------------------------------------------------------------------------
/docs/downstream.md:
--------------------------------------------------------------------------------
1 | # Downstream analysis - Module detection
2 | ## Regulatory Module
3 | For this analysis, we first detect key TF-TG subnetworks (modules) from the cell population TF–TG trans-regulation. Then, we identify the differential regulatory modules that are differentially expressed between the case and control groups.
4 | ### Detect Module
5 | #### Input
6 | - pseudobulk gene expression: [TG_pseudobulk], please make sure the data is after removing the batch effect
7 | - metadata including case and control in column 'group' and cell type annotation in column 'celltype': [metadata]. Note that the case is 1 and the control is 0.
8 | - LINGER outdir including a trans-regulatory network, 'cell_population_trans_regulatory.txt',
9 | - GWAS data file, which is not necessary.
10 |
11 | This is an example of the input.
12 | ```python
13 | import pandas as pd
14 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
15 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the specie is mouse, replace 'MT-' with 'mt-'
16 | import scanpy as sc
17 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
18 | label_all = adata_RNA.obs[['barcode','sample','label']]
19 | label_all.index = label_all['barcode']
20 | metadata = label_all.loc[TG_pseudobulk.columns]
21 | metadata.columns = ['barcode','group','celltype']
22 | outdir = 'output/'
23 | GWASfile = ['AUD_gene.txt','AUD_gene2.txt']# GWAS file is a gene list with no head (Optional)
24 | ```
25 | ```python
26 | TG_pseudobulk
27 | ```
28 |
29 |

30 |
31 |
32 | ```python
33 | metadata
34 | ```
35 |
36 |

37 |
38 |
39 | #### output
40 | ```python
41 | K=10 #k is the number of modules, a tuning parameter
42 | from LingerGRN import Compare
43 | Module_result=Compare.Module_trans(outdir,metadata,TG_pseudobulk,K,GWASfile)
44 | ```
45 | The output is Module_result object. There are 3 items in this object:
46 | - S_TG, which represents the module assigned for each gene;
47 | - pvalue_all, the p-value of the differential module t-test comparing the case and control groups;
48 | - t_value, the t-value of the t-test, a positive value representing group 1 is more active, and a negative value representing group 0 is more active.
49 | ```python
50 | Module_result.S_TG
51 | ```
52 |
53 |

54 |
55 |
56 | ```python
57 | Module_result.pvalue_all
58 | ```
59 |
60 |

61 |
62 |
63 | ```python
64 | Module_result.tvalue_all
65 | ```
66 |
67 |

68 |
69 |
70 | Save the result to files.
71 | ```python
72 | Module_result.pvalue_all.to_csv('pvalue_all.txt',sep='\t')
73 | Module_result.tvalue_all.to_csv('tvalue_all.txt',sep='\t')
74 | temp=Module_result.S_TG
75 | temp=temp[temp['Module']>0]
76 | temp.to_csv('Module.txt',sep='\t')
77 | ```
78 | #### Visualize
79 | Please ensure that the r packages: ggplot2, grid, tidyr, egg are well-installed. Note that 'cutoff' is a parameter, representing the cutoff of -log10(p-value). We suggest 'cutoff = 2' as a default.
80 | ```python
81 | # Import the rpy2 components needed
82 | import os
83 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path
84 | import rpy2.robjects as robjects
85 | from rpy2.robjects import r
86 | # Import the R plotting package (ggplot2 as an example)
87 | r('library(ggplot2)')
88 | r('library(grid)')
89 | # Create data in R environment through Python
90 | r('''
91 | library(tidyr)
92 | dataP=read.table('pvalue_all.txt',sep='\t',header=TRUE,row.names=1)
93 | dataT=read.table('tvalue_all.txt',sep='\t',header=TRUE,row.names=1)
94 | dataP=-log10(dataP)
95 | dataP$TF=rownames(dataP)
96 | dataT$TF=rownames(dataT)
97 | longdiff0 <- gather(dataP, sample, value,-TF)
98 | longdiff1 <- gather(dataT, sample, value,-TF)
99 | colnames(longdiff1)=c('TF','celltype','T')
100 | longdiff1$P=longdiff0$value
101 | longdiff1$TF=factor(longdiff1$TF,levels=rev(longdiff1$TF[1:10]))
102 | print(longdiff1[1:10,])
103 | ''')
104 | # R code for plotting using ggplot
105 | r('''
106 | cutoff=1 # here
107 | maxp=ceiling(max(longdiff1$P))
108 | limits0=c(cutoff,maxp)
109 | range0=c(1,(maxp-cutoff+1))*4/(maxp-cutoff+1)
110 | breaks0=(cutoff+1):(maxp-1)
111 | print(limits0)
112 | print(range0)
113 | print(breaks0)
114 | ''')
115 |
116 | # Print the plot to display it
117 | r('''
118 | library(ggplot2)
119 | library(egg)
120 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
121 | geom_point(aes(size = P, fill = T), alpha = 1, shape = 21) +
122 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) +
123 | labs( x= "cell type", y = "Module", fill = "") + theme_article()+
124 | theme(legend.key=element_blank(),
125 | axis.text.x = element_text(colour = "black", size = 9, face = "bold", angle = 90, vjust = 0.3, hjust = 1),
126 | axis.text.y = element_text(colour = "black", face = "bold", size = 11),
127 | legend.text = element_text(size = 9, face ="bold", colour ="black"),
128 | legend.title = element_text(size = 9, face = "bold"),
129 | panel.background = element_blank(), panel.border = element_rect(colour = "black", fill = NA, size = 1.2),
130 | legend.position = "right") +
131 | scale_fill_gradient2(midpoint=0, low="blue", mid="white",
132 | high="red", space ="Lab" )
133 | ''')
134 | r("pdf('module_result.pdf',width=1.5+dim(dataP)[2]/3,height=3)")# change the height and the width of the figure
135 | r('print(p)')
136 | r('dev.off()')
137 | ```
138 | The figure is saved to module_result.pdf.
139 |
140 |

141 |
142 |
143 |
--------------------------------------------------------------------------------
/docs/driver.md:
--------------------------------------------------------------------------------
1 | # Downstream analysis- TF driver score
2 | ## Driver Score
3 | We identify driver TFs underlying epigenetic and transcriptomic change between control and case using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF.
4 | ### Request
5 | Please complete the following tutorials
6 | - [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md)
7 | - [Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md)
8 | ### Transcriptomics driver score
9 | ```python
10 | import pandas as pd
11 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
12 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-'
13 | import scanpy as sc
14 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
15 | label_all = adata_RNA.obs[['barcode','sample','label']]
16 | label_all.index = label_all['barcode']
17 | metadata = label_all.loc[TG_pseudobulk.columns]
18 | metadata.columns = ['barcode','group','celltype']
19 | GRN='trans_regulatory'
20 | adjust_method='bonferroni'
21 | corr_method='pearsonr'
22 | import numpy as np
23 | from LingerGRN import Compare
24 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=Compare.driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
25 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes.
26 | C_result_RNA_sp_r,Q_result_RNA_sp_r=Compare.driver_result(C_result_RNA_sp,Q_result_RNA_sp,K)
27 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t')
28 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t')
29 | ```
30 | The adjust_method is the p-value adjust method, you could choose one from the following:
31 | - bonferroni : one-step correction
32 | - sidak : one-step correction
33 | - holm-sidak : step down method using Sidak adjustments
34 | - holm : step-down method using Bonferroni adjustments
35 | - simes-hochberg : step-up method (independent)
36 | - hommel : closed method based on Simes tests (non-negative)
37 | - fdr_bh : Benjamini/Hochberg (non-negative)
38 | - fdr_by : Benjamini/Yekutieli (negative)
39 | - fdr_tsbh : two stage fdr correction (non-negative)
40 | - fdr_tsbky : two stage fdr correction (non-negative)
41 | ### visualize
42 | ```python
43 | import os
44 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path
45 | import rpy2.robjects as robjects
46 | from rpy2.robjects import r
47 | # Import the R plotting package (ggplot2 as an example)
48 | r('library(ggplot2)')
49 | r('library(grid)')
50 | # Create data in R environment through Python
51 | r('''
52 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
53 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
54 | sort_TF=rownames(dataT)
55 | library(tidyr)
56 | dataP=-log10(dataP)
57 | print(paste0('maxinum of -log10P:',max(dataP)))
58 | dataP[dataP>40]=40
59 | dataP1=dataP
60 | dataP1$TF=rownames(dataP)
61 | longdiff0 <- gather(dataP1, sample, value,-TF)
62 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
63 | dataT1=dataT
64 | dataT1$TF=rownames(dataT)
65 | longdiff1=gather(dataT1, sample, value,-TF)
66 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
67 | colnames(longdiff1)=c('TF','celltype','PCC')
68 | longdiff1$P=longdiff0_s$value
69 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF))
70 | library(egg)
71 | limits0=c(2,ceiling(dataP))
72 | range0 = c(1,4)
73 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP))
74 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
75 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) +
76 | scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) +
77 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+
78 | theme(legend.key=element_blank(),
79 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1),
80 | legend.position = "right") +
81 | scale_fill_gradient2(midpoint=0, low="blue", mid="white",
82 | high="red", space ="Lab" )
83 |
84 |
85 | ''')
86 | r('''
87 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
88 | library(pheatmap)
89 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
90 | colnames(anno1)=c('TG')
91 | rownames(anno1)=rownames(dataP)
92 | anno1[is.na(anno1)]=0
93 | anno1[,1]=paste0('M',anno1[,1])
94 | anno1$name=rownames(anno1)
95 | longdiff0 <- gather(anno1, sample, value,-name)
96 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
97 | library("RColorBrewer")
98 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
99 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
100 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
101 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
102 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
103 | theme_article()+theme(text = element_text(size = 9), legend.position = "left")
104 | ''')
105 | r('''
106 | widths <- c(4.5, 4.5+dim(dataP)[2])
107 | print(unique(longdiff0$value))
108 | pdf('driver_trans.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+0.5)
109 | #print(heatmap_plot)
110 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
111 | dev.off()
112 | ''')
113 | ```
114 | The figure is saved to driver_trans.pdf.
115 |
116 |

117 |
118 |
119 | ### Epigenetic driver score
120 | ```python
121 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0)
122 | K=5
123 | GRN='TF_RE_binding'
124 | adjust_method='bonferroni'
125 | corr_method='pearsonr'
126 | C_result_RE,P_result_RE,Q_result_RE=Compare.driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
127 | C_result_RE_r,Q_result_RE_r=Compare.driver_result(C_result_RE,Q_result_RE,K)
128 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t')
129 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t')
130 | ```
131 | ### Visualization
132 | ```python
133 | import os
134 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R' # Replace with your actual R home path
135 | import rpy2.robjects as robjects
136 | from rpy2.robjects import r
137 | # Import the R plotting package (ggplot2 as an example)
138 | r('library(ggplot2)')
139 | r('library(grid)')
140 | # Create data in R environment through Python
141 | r('''
142 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
143 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
144 | sort_TF=rownames(dataT)
145 | library(tidyr)
146 | dataP=-log10(dataP)
147 | print(paste0('maxinum of -log10P:',max(dataP)))
148 | maxP=100
149 | dataP[dataP>100]=100
150 | dataP1=dataP
151 | dataP1$TF=rownames(dataP)
152 | longdiff0 <- gather(dataP1, sample, value,-TF)
153 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
154 | dataT1=dataT
155 | dataT1$TF=rownames(dataT)
156 | longdiff1=gather(dataT1, sample, value,-TF)
157 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
158 | colnames(longdiff1)=c('TF','celltype','PCC')
159 | longdiff1$P=longdiff0_s$value
160 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF))
161 | library(egg)
162 | cutoff=2
163 | maxp=ceiling(max(dataP))
164 | print(maxp)
165 | limits0=c(cutoff,maxp)
166 | print(limits0)
167 | range0 = c(1,4)
168 | numbreak=5
169 | d=ceiling((maxp-cutoff)/(numbreak-1))
170 | print(d)
171 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1), length.out=numbreak)
172 |
173 | print(range0)
174 | print(breaks0)
175 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
176 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) +
177 | scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) +
178 | labs( x= "cell type", y = "TF", fill = "") + theme_article()+
179 | theme(legend.key=element_blank(),
180 | axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1),
181 | legend.position = "right") +
182 | scale_fill_gradient2(midpoint=0, low="blue", mid="white",
183 | high="red", space ="Lab" )
184 |
185 |
186 | ''')
187 | r('''
188 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
189 | library(pheatmap)
190 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
191 | colnames(anno1)=c('TG')
192 | rownames(anno1)=rownames(dataP)
193 | anno1[is.na(anno1)]=0
194 | anno1[,1]=paste0('M',anno1[,1])
195 | anno1$name=rownames(anno1)
196 | longdiff0 <- gather(anno1, sample, value,-name)
197 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
198 | library("RColorBrewer")
199 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
200 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
201 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
202 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
203 | geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
204 | theme_article()+theme(text = element_text(size = 9), legend.position = "left",
205 | )
206 | ''')
207 | r('''
208 | widths <- c(4.5, 4.5+dim(dataP)[2])
209 | print(unique(longdiff0$value))
210 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5)
211 | #print(p)
212 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
213 | dev.off()
214 | ''')
215 | ```
216 | The figure is saved to driver_epi.pdf.
217 |
218 |

219 |
220 |
--------------------------------------------------------------------------------
/docs/driver_epi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_epi.png
--------------------------------------------------------------------------------
/docs/driver_trans.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_trans.png
--------------------------------------------------------------------------------
/docs/feature_engineering.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.jpg
--------------------------------------------------------------------------------
/docs/feature_engineering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.png
--------------------------------------------------------------------------------
/docs/genomemap.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/genomemap.jpg
--------------------------------------------------------------------------------
/docs/h5_input.md:
--------------------------------------------------------------------------------
1 | # h5ad file as input
2 | ## case1. 10x filtered feature barcode matrix
3 | ### download h5 file
4 | ```sh
5 | wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
6 | ```
7 | ### get the input data for LINGER
8 | ```python
9 | import scanpy as sc
10 | adata = sc.read_10x_h5('pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5', gex_only=False)
11 | import scipy.sparse as sp
12 | import pandas as pd
13 | matrix=adata.X.T
14 | adata.var['gene_ids']=adata.var.index
15 | features=pd.DataFrame(adata.var['gene_ids'].values.tolist(),columns=[1])
16 | features[2]=adata.var['feature_types'].values
17 | barcodes=pd.DataFrame(adata.obs_names,columns=[0])
18 | from LingerGRN.preprocess import *
19 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
20 | ```
21 | ## case2. seperate RNA and ATAC h5ad file
22 | ### Read H5AD file as an AnnData object
23 | ```python
24 | import scanpy as sc
25 | adata_RNA = sc.read_h5ad('rna.h5ad')
26 | adata_ATAC=sc.read_h5ad('ATAC.h5ad')
27 | import pandas as pd
28 | label=pd.read_csv('label.txt',sep='\t',header=0)
29 | ```
30 | ```python
31 | adata_RNA
32 | ```
33 |
34 |
35 |

36 |
37 |
38 | ```python
39 | adata_ATAC
40 | ```
41 |
42 |
43 |

44 |
45 |
46 | ```python
47 | label
48 | ```
49 |
50 |

51 |
52 |
53 | ### get the input data for LINGER
54 |
55 | ```python
56 | import scipy.sparse as sp
57 | matrix=sp.vstack([adata_RNA.X.T, adata_ATAC.X.T])
58 | features=pd.DataFrame(adata_RNA.var['gene_ids'].values.tolist()+adata_ATAC.var['gene_ids'].values.tolist(),columns=[1])
59 | K=adata_RNA.shape[1]
60 | N=K+adata_ATAC.shape[1]
61 | types = ['Gene Expression' if i <= K-1 else 'Peaks' for i in range(0, N)]
62 | features[2]=types
63 | barcodes=pd.DataFrame(adata_RNA.obs['barcode'].values,columns=[0])
64 | from LingerGRN.preprocess import *
65 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
66 | ```
67 | Then you could go back to the PBMC tutorial and continue with the 'Remove low counts cells and genes' step.
68 |
--------------------------------------------------------------------------------
/docs/heatmap_activity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity.png
--------------------------------------------------------------------------------
/docs/heatmap_activity_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity_mm10.png
--------------------------------------------------------------------------------
/docs/label.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label.png
--------------------------------------------------------------------------------
/docs/label_PBMC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label_PBMC.png
--------------------------------------------------------------------------------
/docs/metadata_ds.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/metadata_ds.jpg
--------------------------------------------------------------------------------
/docs/module_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/module_result.png
--------------------------------------------------------------------------------
/docs/motifmatch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/motifmatch.png
--------------------------------------------------------------------------------
/docs/original.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/original.png
--------------------------------------------------------------------------------
/docs/perturb.md:
--------------------------------------------------------------------------------
1 | # In silico perturbation
2 | Here, we use gene-regulatory networks inferred from single-cell multi-omics data to perform in silico transcription factor(TF) perturbations, simulating the consequent changes such as traget gene and differentiation direction using only unperturbed wild-type data.
3 |
4 | ## Predict the original gene expression
5 | We first predict the gene expression based on LINGER neural network-based model. We use this to represent the wild type context transcriptional profile.
6 | ```python
7 | from LingerGRN.perturb import *
8 | # insilico pertubation
9 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
10 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
11 | GRNdir=Datadir+'data_bulk/'
12 | Input_dir= '/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/'# input data dir
13 | chrall,data_merge,Exp,Opn,Target,idx,TFname=load_data_ptb(Input_dir,outdir,GRNdir)
14 | original=get_simulation(outdir,chrall,data_merge,GRNdir,Exp,Opn,Target,idx)
15 | original.to_csv(outdir+'original.txt',sep='\t')
16 | original
17 | ```
18 |
19 |

20 |
21 |
22 | ## Predict the gene expression after TF perturbation
23 | We predict the gene expression after kouck out one TF or several together. We take the POU5F1 as an example, which is a master regulator of H1 cell line (stem cell).
24 | ```python
25 | TFko='POU5F1'# multiple TFs: TFko=[TF1 TF2 ...]
26 | import pandas as pd
27 | Exp_df=pd.DataFrame(Exp,index=TFname)
28 | Exp1=Exp_df.copy()
29 | Exp1.loc[TFko]=0
30 | perturb=get_simulation(outdir,chrall,data_merge,GRNdir,Exp1.values,Opn,Target,idx)
31 | perturb.to_csv(outdir+TFko+'.txt',sep='\t')
32 | perturb
33 | ```
34 |
35 |

36 |
37 |
38 | ## Differential expression for single cell
39 | We visualize the differential expression of the target gene. We take the POU5F1's target gene, NANOG, as an example. We set save = True to save the figure to outdir (Kouckout TF+'_KO_Diff_exp_Umap_'+Target gene.png). The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively.
40 | ```python
41 | embedding,D=umap_embedding(outdir,Target,original,perturb,Input_dir)
42 | TG='NANOG'
43 | save=True
44 | diff_umap(TFko,TG,save,outdir,embedding,perturb,original,Input_dir)
45 | ```
46 |
47 |

48 |
49 |
50 | The above result suggests that NANOG expression decreased in H1 cell line after POU5F1 knockout.
51 | ## Differentiation prediction
52 |
53 | We get the embedding of original and perturbed gene expression to the same embedding space. Then we get the difference of embedding to represent the differentiation prediction after the perturbatiion. The figure will be saved as Kouckout TF+'KO_Differentiation_Umap.png'. The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively.
54 | ```python
55 | save=True
56 | Umap_direct(TFko,Input_dir,embedding,D,save,outdir)
57 | ```
58 |
59 |

60 |
61 |
62 | POU5F1 knock out simulation yeilds a potential shift of H1 cell line.
63 |
--------------------------------------------------------------------------------
/docs/perturb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/perturb.png
--------------------------------------------------------------------------------
/docs/pvalue_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/pvalue_all.png
--------------------------------------------------------------------------------
/docs/scNN.md:
--------------------------------------------------------------------------------
1 | # Other species tutorial
2 | We support for the following species. For other species, we provide another [tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/scNN_newSpecies.md).
3 | |genome_short | species | species_ensembl | genome |
4 | | --- | --- | --- | --- |
5 | | canFam3 | dog | Canis_lupus_familiaris |CanFam3 |
6 | | danRer11|zebrafish|Danio_rerio|GRCz11|
7 | |danRer10|zebrafish|Danio_rerio|GRCz10|
8 | | dm6|fly|Drosophila_melanogaster|BDGP6|
9 | | dm3|fly|Drosophila_melanogaster|BDGP5|
10 | | rheMac8|rhesus|Macaca_mulatta|Mmul_8|
11 | |mm10|mouse|Mus_musculus|GRCm38|
12 | |mm9|mouse|Mus_musculus|NCBIM37|
13 | |rn5|rat|Rattus_norvegicus|Rnor_5|
14 | |rn6|rat|Rattus_norvegicus|Rnor_6|
15 | |susScr3|pig|Sus_scrofa|Sscrofa10|
16 | |susScr11|pig|Sus_scrofa|Sscrofa11|
17 | |fr3|fugu|Takifugu_rubripes|FUGU5|
18 | |xenTro9|frog|Xenopus_tropicalis|Xenopus_tropicalis_v9|
19 | |tair10|Arabidopsis|Arabidopsis_thaliana|Tair10|
20 | ## Download the provided data
21 | We provide the TSS location for the above genome and the motif information.
22 | ```sh
23 | Datadir=/path/to/LINGER/# the directory to store the data, please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
24 | mkdir $Datadir
25 | cd $Datadir
26 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV" -O provide_data.tar.gz && rm -rf /tmp/cookies.txt
27 | ```
28 | Then unzip,
29 | ```sh
30 | tar -xzf provide_data.tar.gz
31 | ```
32 | ## Prepare the input data
33 | We take the sc data of mm10 as an example. The data is from the published paper (FOXA2 drives lineage plasticity and KIT pathway
34 | activation in neuroendocrine prostate cancer).
35 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes:
36 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt
37 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example).
38 |
39 |

40 |
41 |
42 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
43 |
44 | ### sc data
45 | We download the data using the shell command line.
46 | ```sh
47 | mkdir -p data
48 | cd data
49 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt
50 | tar -xzvf mm10_data.tar.gz
51 | mv mm10_data/* ./
52 | cd ../
53 | ```
54 | We provide the cell annotation as follows:
55 | ```sh
56 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt
57 | mv mm10_label.txt data/
58 | ```
59 | ## LINGER
60 | ### Install
61 | ```sh
62 | conda create -n LINGER python==3.10.0
63 | conda activate LINGER
64 | pip install LingerGRN==1.97
65 | conda install bioconda::bedtools #Requirement
66 | ```
67 | ### Install homer
68 | Check whether homer is installed
69 | ```sh
70 | which homer # run this in the command line
71 | ```
72 | If homer is not installed, use Conda to install it
73 | ```sh
74 | conda install bioconda::homer
75 | ```
76 | #### install genome
77 | You can check the installed genome
78 | ```sh
79 | dir=$(which homer)
80 | dir_path=$(dirname "$dir")
81 | ls $dir_path/../share/homer/data/genomes/ # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory'
82 | ```
83 | If the genome is not installed, use the following shell script to install it.
84 | ```sh
85 | genome='mm10'
86 | perl $dir_path/../share/homer/configureHomer.pl -install $genome
87 | ```
88 | For the following step, we run the code in python.
89 | #### Transfer the sc-multiome data to anndata
90 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
91 | ```python
92 | import scanpy as sc
93 | #set some figure parameters for nice display inside jupyternotebooks.
94 | %matplotlib inline
95 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
96 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
97 | sc.logging.print_header()
98 | #results_file = "scRNA/pbmc10k.h5ad"
99 | import scipy
100 | import pandas as pd
101 | matrix=scipy.io.mmread('data/matrix.mtx')
102 | features=pd.read_csv('data/features.txt',sep='\t',header=None)
103 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None)
104 | label=pd.read_csv('data/mm10_label.txt',sep='\t',header=0)
105 | from LingerGRN.preprocess import *
106 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
107 | ```
108 | #### Remove low counts cells and genes
109 | ```python
110 | import scanpy as sc
111 | sc.pp.filter_cells(adata_RNA, min_genes=200)
112 | sc.pp.filter_genes(adata_RNA, min_cells=3)
113 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
114 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
115 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
116 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
117 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
118 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
119 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
120 | ```
121 | #### Generate the pseudo-bulk/metacell:
122 | ```python
123 | from LingerGRN.pseudo_bulk import *
124 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode
125 | tempsample=samplelist[0]
126 | TG_pseudobulk=pd.DataFrame([])
127 | RE_pseudobulk=pd.DataFrame([])
128 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
129 | for tempsample in samplelist:
130 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
131 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
132 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)
133 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
134 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
135 | RE_pseudobulk[RE_pseudobulk > 100] = 100
136 |
137 | import os
138 | if not os.path.exists('data/'):
139 | os.mkdir('data/')
140 | adata_ATAC.write('data/adata_ATAC.h5ad')
141 | adata_RNA.write('data/adata_RNA.h5ad')
142 | TG_pseudobulk=TG_pseudobulk.fillna(0)
143 | RE_pseudobulk=RE_pseudobulk.fillna(0)
144 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
145 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
146 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
147 | ```
148 | ### Training model
149 | Overlap the region with general GRN:
150 | ```python
151 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
152 | GRNdir=Datadir+'provide_data/'
153 | genome='mm10'
154 | outdir='/path/to/output/' #output dir
155 | activef='ReLU'
156 | method='scNN'
157 | import torch
158 | import subprocess
159 | import os
160 | import LingerGRN.LINGER_tr as LINGER_tr
161 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported
162 | LINGER_tr.RE_TG_dis(outdir)
163 | ```
164 | Train for the LINGER model.
165 | ```python
166 | import LingerGRN.LINGER_tr as LINGER_tr
167 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
168 | genomemap=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t')
169 | genomemap.index=genomemap['genome_short']
170 | species=genomemap.loc[genome]['species_ensembl']
171 | LINGER_tr.training(GRNdir,method,outdir,activef,species)
172 | ```
173 | ### Cell population gene regulatory network
174 | #### TF binding potential
175 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
176 | ```python
177 | import LingerGRN.LL_net as LL_net
178 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
179 | ```
180 |
181 | #### *cis*-regulatory network
182 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
183 | ```python
184 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
185 | ```
186 | #### *trans*-regulatory network
187 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
188 | ```python
189 | LL_net.trans_reg(GRNdir,method,outdir,genome)
190 | ```
191 |
192 | ### Cell type sepecific gene regulaory network
193 | There are 2 options:
194 | 1. infer GRN for a specific cell type, which is in the label.txt;
195 | ```python
196 | celltype='1' #use a string to assign your cell type
197 | ```
198 | 2. infer GRNs for all cell types.
199 | ```python
200 | celltype='all'
201 | ```
202 | Please make sure that 'all' is not a cell type in your data.
203 | #### Motif matching
204 | ```python
205 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt'
206 | subprocess.run(command, shell=True)
207 | import pandas as pd
208 | genome_map=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t',header=0)
209 | genome_map.index=genome_map['genome_short']
210 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+'all_motif_rmdup_'+genome_map.loc[genome]['Motif']+'> '+outdir+'MotifTarget.bed'
211 | subprocess.run(command, shell=True)
212 | ```
213 | #### TF binding potential
214 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
215 | ```python
216 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
217 | ```
218 |
219 | #### *cis*-regulatory network
220 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
221 | ```python
222 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)
223 | ```
224 |
225 | #### *trans*-regulatory network
226 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
227 | ```python
228 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
229 | ```
230 | ## Identify driver regulators by TF activity
231 | ### Instruction
232 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
233 |
234 | ### Prepare
235 | We need to *trans*-regulatory network, you can choose a network match you data best.
236 | 1. If your gene expression data are matched with cell population GRN, you can set
237 | ```python
238 | network = 'cell population'
239 | ```
240 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
241 | ```python
242 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
243 | ```
244 |
245 | ### Calculate TF activity
246 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
247 |
248 | ```python
249 | from LingerGRN.TF_activity import *
250 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
251 | ```
252 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
253 | ```python
254 | save=True
255 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
256 | ```
257 |
258 |

259 |
260 |
261 | ### Identify driver regulator
262 | We use t-test to find the differential TFs of a certain cell type by the activity.
263 | 1. You can assign a certain cell type of the gene expression data by
264 | ```python
265 | celltype='1'
266 | ```
267 | 2. Or, you can obtain the result for all cell types.
268 | ```python
269 | celltype='all'
270 | ```
271 |
272 | For example,
273 |
274 | ```python
275 | celltype='1'
276 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
277 | t_test_results
278 | ```
279 |
280 |
281 |

282 |
283 |
284 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
285 |
286 | ```python
287 | TFName='Erg'
288 | datatype='activity'
289 | celltype1='1'
290 | celltype2='Others'
291 | save=True
292 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
293 | ```
294 |
295 |
296 |

297 |
298 |
299 | For gene expression data, the boxplot is:
300 | ```python
301 | datatype='expression'
302 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
303 | ```
304 |
305 |
306 |

307 |
308 |
309 |
310 |
--------------------------------------------------------------------------------
/docs/scNN_newSpecies.md:
--------------------------------------------------------------------------------
1 | # Other species not in the list
2 | Here, we provide a tutorial for the species not included in the given list but in the basic Homer installation. You can choose one genome version from the following table based on your data.
3 | | **Genome Version** | **Release** | **Description** |
4 | |---------------------|-------------|-----------------------------------------------------------|
5 | | tair10 | v6.0 | Arabidopsis genome and annotation (tair10) |
6 | | dm6 | v7.0 | Fly genome and annotation for UCSC dm6 |
7 | | xenTro3 | v6.4 | Frog genome and annotation for UCSC xenTro3 |
8 | | danRer11 | v7.0 | Zebrafish genome and annotation for UCSC danRer11 |
9 | | rn7 | v7.0 | Rat genome and annotation for UCSC rn7 |
10 | | canFam3 | v6.4 | Dog genome and annotation for UCSC canFam3 |
11 | | anoGam3 | v7.0 | Mosquito genome and annotation for UCSC anoGam3 |
12 | | gorGor3 | v6.4 | Human genome and annotation for UCSC gorGor3 |
13 | | strPur2 | v7.0 | Urchin genome and annotation for UCSC strPur2 |
14 | | rn5 | v6.4 | Rat genome and annotation for UCSC rn5 |
15 | | rheMac10 | v7.0 | Rhesus genome and annotation for UCSC rheMac10 |
16 | | danRer7 | v6.4 | Zebrafish genome and annotation for UCSC danRer7 |
17 | | canFam5 | v7.0 | Dog genome and annotation for UCSC canFam5 |
18 | | gorGor6 | v7.0 | Human genome and annotation for UCSC gorGor6 |
19 | | canFam6 | v7.0 | Dog genome and annotation for UCSC canFam6 |
20 | | apiMel2 | v7.0 | Bee genome and annotation for UCSC apiMel2 |
21 | | gorGor5 | v6.4 | Human genome and annotation for UCSC gorGor5 |
22 | | panTro4 | v6.4 | Human genome and annotation for UCSC panTro4 |
23 | | xenTro7 | v6.4 | Frog genome and annotation for UCSC xenTro7 |
24 | | rheMac2 | v6.4 | Rhesus genome and annotation for UCSC rheMac2 |
25 | | gorGor4 | v6.4 | Human genome and annotation for UCSC gorGor4 |
26 | | panTro5 | v6.4 | Human genome and annotation for UCSC panTro5 |
27 | | panPan2 | v6.4 | Human genome and annotation for UCSC panPan2 |
28 | | patens.ASM242v1 | v5.10 | Patens genome and annotation (patens.ASM242v1) |
29 | | panTro6 | v7.0 | Human genome and annotation for UCSC panTro6 |
30 | | AGPv3 | v5.10 | Corn genome and annotation (AGPv3) |
31 | | xenTro9 | v6.4 | Frog genome and annotation for UCSC xenTro9 |
32 | | papAnu2 | v6.4 | Human genome and annotation for UCSC papAnu2 |
33 | | corn.AGPv3 | v5.10 | Corn genome and annotation (corn.AGPv3) |
34 | | petMar2 | v6.4 | Lamprey genome and annotation for UCSC petMar2 |
35 | | panPan1 | v6.4 | Human genome and annotation for UCSC panPan1 |
36 | | fr3 | v7.0 | Fugu genome and annotation for UCSC fr3 |
37 | | sacCer3 | v7.0 | Yeast genome and annotation for UCSC sacCer3 |
38 | | mm9 | v7.0 | Mouse genome and annotation for UCSC mm9 |
39 | | susScr3 | v6.4 | Pig genome and annotation for UCSC susScr3 |
40 | | hg17 | v7.0 | Human genome and annotation for UCSC hg17 |
41 | | panTro3 | v6.4 | Human genome and annotation for UCSC panTro3 |
42 | | tetNig2 | v7.0 | Fugu genome and annotation for UCSC tetNig2 |
43 | | mm10 | v7.0 | Mouse genome and annotation for UCSC mm10 |
44 | | taeGut2 | v7.0 | Zebrafinch genome and annotation for UCSC taeGut2 |
45 | | hg18 | v7.0 | Human genome and annotation for UCSC hg18 |
46 | | susScr11 | v7.0 | Pig genome and annotation for UCSC susScr11 |
47 | | galGal5 | v6.4 | Chicken genome and annotation for UCSC galGal5 |
48 | | hg38 | v7.0 | Human genome and annotation for UCSC hg38 |
49 | | mm8 | v6.4 | Mouse genome and annotation for UCSC mm8 |
50 | | xenLae2 | v7.0 | Frog genome and annotation for UCSC xenLae2 |
51 | | rn6 | v7.0 | Rat genome and annotation for UCSC rn6 |
52 | | ce11 | v7.0 | Worm genome and annotation for UCSC ce11 |
53 | | dm3 | v7.0 | Fly genome and annotation for UCSC dm3 |
54 | | rheMac3 | v6.4 | Rhesus genome and annotation for UCSC rheMac3 |
55 | | rn4 | v6.4 | Rat genome and annotation for UCSC rn4 |
56 | | rice.IRGSP1.0 | v5.10 | Rice genome and annotation (rice.IRGSP1.0) |
57 | | panPan3 | v7.0 | Human genome and annotation for UCSC panPan3 |
58 | | galGal6 | v7.0 | Chicken genome and annotation for UCSC galGal6 |
59 | | danRer10 | v7.0 | Zebrafish genome and annotation for UCSC danRer10 |
60 | | ci2 | v7.0 | Ciona genome and annotation for UCSC ci2 |
61 | | petMar3 | v7.0 | Lamprey genome and annotation for UCSC petMar3 |
62 | | sacCer2 | v7.0 | Yeast genome and annotation for UCSC sacCer2 |
63 | | hg19 | v7.0 | Human genome and annotation for UCSC hg19 |
64 | | xenTro10 | v7.0 | Frog genome and annotation for UCSC xenTro10 |
65 | | rheMac8 | v6.4 | Rhesus genome and annotation for UCSC rheMac8 |
66 | | mm39 | v7.0 | Mouse genome and annotation for UCSC mm39 |
67 | | ce10 | v7.0 | Worm genome and annotation for UCSC ce10 |
68 | | aplCal1 | v6.4 | Seahare genome and annotation for UCSC aplCal1 |
69 | | xenTro2 | v6.4 | Frog genome and annotation for UCSC xenTro2 |
70 | | papAnu4 | v7.0 | Human genome and annotation for UCSC papAnu4 |
71 | | ce6 | v7.0 | Worm genome and annotation for UCSC ce6 |
72 | | ci3 | v7.0 | Ciona genome and annotation for UCSC ci3 |
73 | | apiMel3 | v7.0 | Bee genome and annotation for UCSC apiMel3 |
74 | | anoGam1 | v7.0 | Mosquito genome and annotation for UCSC anoGam1 |
75 | | galGal4 | v6.4 | Chicken genome and annotation for UCSC galGal4 |
76 |
77 | ## Prepare the input data
78 | We take the sc data of mm10 as an example. The data is from the published paper,[FOXA2 drives lineage plasticity and KIT pathway
79 | activation in neuroendocrine prostate cancer](https://www.cell.com/cancer-cell/pdf/S1535-6108(22)00502-5.pdf).
80 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes:
81 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt
82 |
83 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
84 |
85 | Here, we provide sc-multiome data of mm10. We download the data using the shell command line.
86 | ```sh
87 | mkdir -p data
88 | cd data
89 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt
90 | tar -xzvf mm10_data.tar.gz
91 | mv mm10_data/* ./
92 | cd ../
93 | ```
94 | We provide the cell annotation as follows:
95 | ```sh
96 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt
97 | mv mm10_label.txt data/label.txt
98 | ```
99 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example).
100 |
101 |

102 |
103 | - gtf file or the URL of the gtf file describing gene annotation, '*.gtf'
104 | - PWM matrix file of motifs, 'all_motif.txt'
105 |
106 |

107 |
108 | - Motif-TF match file, 'MotifMatch.txt', mapping motif and TFs
109 |
110 |

111 |
112 |
113 | ## LINGER
114 | ### Install
115 | ```sh
116 | conda create -n LINGER python==3.10.0
117 | conda activate LINGER
118 | pip install LingerGRN==1.97
119 | conda install bioconda::bedtools #Requirement
120 | ```
121 | ### Install homer
122 | Check whether homer is installed
123 | ```sh
124 | which homer # run this in the command line
125 | ```
126 | If homer is not installed, use Conda to install it
127 | ```sh
128 | conda install bioconda::homer
129 | ```
130 | #### Install genome
131 | You can check the installed genome
132 | ```sh
133 | dir=$(which homer)
134 | dir_path=$(dirname "$dir")
135 | ls $dir_path/../share/homer/data/genomes/ # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory'
136 | ```
137 | If the genome is not installed, use the following shell script to install it.
138 | ```sh
139 | genome='mm10'
140 | perl $dir_path/../share/homer/configureHomer.pl -install $genome
141 | ```
142 | For the following step, we run the code in python.
143 | #### Input
144 | Note that PWM_file and MotifMatch_file are located at the GRNdir.
145 | ```python
146 | GRNdir='/path/to/current_dir/' # The dir we currently use; the sc data is in GRNdir/data/
147 | gtf_file='/paht/to/gtfdir/*.gtf'# gtf, PWM matrix, and Motif-TF match file should be included in the GRNdir
148 | PWM_file='all_motif.txt'
149 | MotifMatch_file='MotifMatch.txt'
150 | genome='mm10' # the genome of your data
151 | outdir='/path/to/output/' #output dir
152 | species='New'
153 | ```
154 | #### Transfer the sc-multiome data to anndata
155 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
156 | ```python
157 | import scanpy as sc
158 | #set some figure parameters for nice display inside Jupiter notebooks.
159 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
160 | sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
161 | sc.logging.print_header()
162 | #results_file = "scRNA/pbmc10k.h5ad"
163 | import scipy
164 | import pandas as pd
165 | matrix=scipy.io.mmread('data/matrix.mtx')
166 | features=pd.read_csv('data/features.txt',sep='\t',header=None)
167 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None)
168 | label=pd.read_csv('data/label.txt',sep='\t',header=0)
169 | from LingerGRN.preprocess import *
170 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
171 | ```
172 | #### Remove low counts cells and genes
173 | ```python
174 | import scanpy as sc
175 | sc.pp.filter_cells(adata_RNA, min_genes=200)
176 | sc.pp.filter_genes(adata_RNA, min_cells=3)
177 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
178 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
179 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
180 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
181 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
182 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
183 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
184 | ```
185 | #### Generate the pseudo-bulk/metacell:
186 | ```python
187 | from LingerGRN.pseudo_bulk import *
188 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode
189 | tempsample=samplelist[0]
190 | TG_pseudobulk=pd.DataFrame([])
191 | RE_pseudobulk=pd.DataFrame([])
192 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
193 | for tempsample in samplelist:
194 | adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
195 | adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
196 | TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)
197 | TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
198 | RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
199 | RE_pseudobulk[RE_pseudobulk > 100] = 100
200 |
201 | import os
202 | if not os.path.exists('data/'):
203 | os.mkdir('data/')
204 | adata_ATAC.write('data/adata_ATAC.h5ad')
205 | adata_RNA.write('data/adata_RNA.h5ad')
206 | TG_pseudobulk=TG_pseudobulk.fillna(0)
207 | RE_pseudobulk=RE_pseudobulk.fillna(0)
208 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
209 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
210 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
211 | ```
212 | ### Training model
213 | Overlap the region with general GRN:
214 | ```python
215 | activef='ReLU'
216 | method='scNN'
217 | import torch
218 | import subprocess
219 | import os
220 | import LingerGRN.LINGER_tr as LINGER_tr
221 | LINGER_tr.get_TSS_ensembl(genome,gtf_file)
222 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported
223 | LINGER_tr.RE_TG_dis(outdir)
224 | ```
225 | Train for the LINGER model.
226 | ```python
227 | import LingerGRN.LINGER_tr as LINGER_tr
228 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
229 | LINGER_tr.training(GRNdir,method,outdir,activef,species)
230 | ```
231 | ### Cell population gene regulatory network
232 | #### TF binding potential
233 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
234 | ```python
235 | import LingerGRN.LL_net as LL_net
236 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
237 | ```
238 |
239 | #### *cis*-regulatory network
240 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
241 | ```python
242 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
243 | ```
244 | #### *trans*-regulatory network
245 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
246 | ```python
247 | LL_net.trans_reg(GRNdir,method,outdir,genome)
248 | ```
249 |
250 | ### Cell type sepecific gene regulaory network
251 | There are 2 options:
252 | 1. infer GRN for a specific cell type, which is in the label.txt;
253 | ```python
254 | celltype='1' #use a string to assign your cell type
255 | ```
256 | 2. infer GRNs for all cell types.
257 | ```python
258 | celltype='all'
259 | ```
260 | Please make sure that 'all' is not a cell type in your data.
261 | #### Motif matching
262 | ```python
263 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt'
264 | subprocess.run(command, shell=True)
265 | import pandas as pd
266 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+PWM_file+'> '+outdir+'MotifTarget.bed'
267 | subprocess.run(command, shell=True)
268 | ```
269 | #### TF binding potential
270 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
271 | ```python
272 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
273 | ```
274 |
275 | #### *cis*-regulatory network
276 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
277 | ```python
278 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)
279 | ```
280 |
281 | #### *trans*-regulatory network
282 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
283 | ```python
284 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
285 | ```
286 | ## Identify driver regulators by TF activity
287 | ### Instruction
288 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
289 |
290 | ### Prepare
291 | We need to *trans*-regulatory network, you can choose a network match you data best.
292 | 1. If your gene expression data are matched with cell population GRN, you can set
293 | ```python
294 | network = 'cell population'
295 | ```
296 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
297 | ```python
298 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
299 | ```
300 |
301 | ### Calculate TF activity
302 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
303 |
304 | ```python
305 | from LingerGRN.TF_activity import *
306 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
307 | ```
308 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
309 | ```python
310 | save=True
311 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
312 | ```
313 |
314 |

315 |
316 |
317 | ### Identify driver regulator
318 | We use t-test to find the differential TFs of a certain cell type by the activity.
319 | 1. You can assign a certain cell type of the gene expression data by
320 | ```python
321 | celltype='1'
322 | ```
323 | 2. Or, you can obtain the result for all cell types.
324 | ```python
325 | celltype='all'
326 | ```
327 |
328 | For example,
329 |
330 | ```python
331 | celltype='1'
332 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
333 | t_test_results
334 | ```
335 |
336 |
337 |

338 |
339 |
340 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
341 |
342 | ```python
343 | TFName='Erg'
344 | datatype='activity'
345 | celltype1='1'
346 | celltype2='Others'
347 | save=True
348 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
349 | ```
350 |
351 |
352 |

353 |
354 |
355 | For gene expression data, the boxplot is:
356 | ```python
357 | datatype='expression'
358 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
359 | ```
360 |
361 |
362 |

363 |
364 |
365 |
366 |
--------------------------------------------------------------------------------
/docs/trans_pr_curveMYC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_pr_curveMYC.png
--------------------------------------------------------------------------------
/docs/trans_roc_curveMYC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_roc_curveMYC.png
--------------------------------------------------------------------------------
/docs/ttest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest.png
--------------------------------------------------------------------------------
/docs/ttest_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest_mm10.png
--------------------------------------------------------------------------------
/docs/tutorial2.md:
--------------------------------------------------------------------------------
1 | # Identify driver regulators by TF activity
2 | ## Instruction
3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators.
4 |
5 | ## Prepare
6 | We need to *trans*-regulatory network, you can choose a network match you data best.
7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
8 | ```python
9 | network = 'general'
10 | ```
11 | 2. If your gene expression data are matched with cell population GRN, you can set
12 | ```python
13 | network = 'cell population'
14 | ```
15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
16 | ```python
17 | network = '0' # 0 is the name of one cell type
18 | ```
19 | ## Calculate TF activity
20 | ```python
21 | Input_dir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/Input/'
22 | RNA_file='RNA.txt'
23 | GRNdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data_bulk/'
24 | genome='hg38'
25 | from TF_activity import *
26 | regulon_score=regulon(Input_dir,RNA_file,GRNdir,network,genome)
27 | ```
28 | ## Identify driver regulator
29 | We use t-test to find the differential TFs of a certain cell type by the activity.
30 | 1. You can assign a certain cell type by
31 | ```python
32 | celltype='0'
33 | ```
34 | 2. Or, you can obtain the result for all cell types.
35 | ```python
36 | celltype='all'
37 | ```
38 | ```python
39 | labels='label.txt'
40 | t_test_results=master_regulator(regulon_score,Input_dir,labels,celltype)
41 | t_test_results
42 | ```
43 |
44 |

45 |
46 |
--------------------------------------------------------------------------------
/docs/tvalue_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/tvalue_all.png
--------------------------------------------------------------------------------