├── LINGER.PNG
├── README.md
├── conti1_1024.yml
├── doc
    └── driver.md
└── docs
    ├── ATAC.png
    ├── Benchmark.md
    ├── GRN_head.png
    ├── GRN_infer.md
    ├── H1_E2F6_trans.jpg
    ├── LINGER.png
    ├── PBMC.md
    ├── PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png
    ├── PBMCs_heatmap_activity.png
    ├── PBMCs_ttest.png
    ├── POU5F1_KO_Diff_Umap_NANOG.png
    ├── POU5F1_KO_Diff_exp_Umap_NANOG.png
    ├── POU5F1_KO_Differentiation_Umap.png
    ├── PWM.jpg
    ├── RNA.png
    ├── RNA_ds.jpg
    ├── S_TG.png
    ├── TFactivity.md
    ├── User_guide.md
    ├── adata_ATAC.png
    ├── adata_RNA.png
    ├── barcode_mm10.png
    ├── box_plot_ATF1_activity_0_Others.png
    ├── box_plot_ATF1_expression_0_Others.png
    ├── box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png
    ├── box_plot_Erg_activity_1_Others.png
    ├── box_plot_Erg_expression_1_Others.png
    ├── downstream.md
    ├── driver.md
    ├── driver_epi.png
    ├── driver_trans.png
    ├── feature_engineering.jpg
    ├── feature_engineering.png
    ├── genomemap.jpg
    ├── h5_input.md
    ├── heatmap_activity.png
    ├── heatmap_activity_mm10.png
    ├── label.png
    ├── label_PBMC.png
    ├── metadata_ds.jpg
    ├── module_result.png
    ├── motifmatch.png
    ├── original.png
    ├── perturb.md
    ├── perturb.png
    ├── pvalue_all.png
    ├── scNN.md
    ├── scNN_newSpecies.md
    ├── trans_pr_curveMYC.png
    ├── trans_roc_curveMYC.png
    ├── ttest.png
    ├── ttest_mm10.png
    ├── tutorial2.md
    └── tvalue_all.png


/LINGER.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/LINGER.PNG


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # LINGER
 2 | ## Introduction
 3 | LINGER (LIfelong neural Network for GEne Regulation) is a novel method to infer GRNs from single-cell multiome data built on top of [PyTorch](https://pytorch.org/).
 4 | 
 5 | LINGER incorporates both 1) atlas-scale external bulk data across diverse cellular contexts and 2) the knowledge of transcription factor (TF) motif matching to cis-regulatory elements as a manifold regularization to address the challenge of limited data and extensive parameter space in GRN inference.
 6 | ## Analysis tasks for single cell multiome data
 7 | - Infer gene regulatory network
 8 | - Benchmark gene regulatory network
 9 | - Explainable dimensionality reduction (transcription factor activity, availiable for single cell or bulk RNA-seq data)
10 | - In silico pertubation
11 | 
12 | In the user guide, we provide an overview of each task. 
13 | ## Basic installation
14 | LINGER can be installed by pip
15 | ```sh
16 | conda create -n LINGER python==3.10.0
17 | conda activate LINGER
18 | pip install LingerGRN==1.105
19 | conda install bioconda::bedtools # Requirment
20 | ```
21 | ## Documentation
22 | 
23 | We provide several tutorials and user guide. If you find our tool useful for your research, please consider citing the LINGER manuscript.
24 | 
25 | |                           |                           |                           |
26 | |:-------------------------:|:-------------------------:|:-------------------------:|
27 | | [User guide](https://github.com/Durenlab/LINGER/blob/main/docs/User_guide.md) | [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md) |[H1 cell line tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md)  |
28 | |[GRN benchmark](https://github.com/Durenlab/LINGER/blob/main/docs/Benchmark.md)  | [In silico perturbation](https://github.com/Durenlab/LINGER/blob/main/docs/perturb.md) | [Other species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md) |
29 | |[Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md)|[Downstream analysis-TF Driver score](https://github.com/Durenlab/LINGER/blob/main/docs/driver.md)||
30 |     
31 | 
32 | ## Reference
33 | > If you use LINGER, please cite:
34 | > 
35 | > [Yuan, Qiuyue, and Zhana Duren. "Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data." Nature Biotechnology (2024): 1-11.](https://doi.org/10.1038/s41587-024-02182-7)
36 | 


--------------------------------------------------------------------------------
/conti1_1024.yml:
--------------------------------------------------------------------------------
  1 | name: null
  2 | channels:
  3 |   - conda-forge
  4 |   - anaconda
  5 |   - bioconda
  6 |   - defaults
  7 | dependencies:
  8 |   - _libgcc_mutex=0.1=main
  9 |   - _openmp_mutex=5.1=1_gnu
 10 |   - asttokens=2.0.5=pyhd3eb1b0_0
 11 |   - backcall=0.2.0=pyhd3eb1b0_0
 12 |   - blas=1.0=openblas
 13 |   - bottleneck=1.3.5=py310ha9d4c09_0
 14 |   - bzip2=1.0.8=h7b6447c_0
 15 |   - ca-certificates=2023.08.22=h06a4308_0
 16 |   - cloudpickle=2.2.1=pyhd8ed1ab_0
 17 |   - colorama=0.4.6=pyhd8ed1ab_0
 18 |   - comm=0.1.2=py310h06a4308_0
 19 |   - debugpy=1.6.7=py310h6a678d5_0
 20 |   - decorator=5.1.1=pyhd3eb1b0_0
 21 |   - exceptiongroup=1.0.4=py310h06a4308_0
 22 |   - executing=0.8.3=pyhd3eb1b0_0
 23 |   - ipykernel=6.25.0=py310h2f386ee_0
 24 |   - ipython=8.15.0=py310h06a4308_0
 25 |   - jedi=0.18.1=py310h06a4308_1
 26 |   - joblib=1.3.2=pyhd8ed1ab_0
 27 |   - jupyter_client=8.1.0=py310h06a4308_0
 28 |   - jupyter_core=5.3.0=py310h06a4308_0
 29 |   - ld_impl_linux-64=2.38=h1181459_1
 30 |   - libblas=3.9.0=16_linux64_openblas
 31 |   - libcblas=3.9.0=16_linux64_openblas
 32 |   - libffi=3.3=he6710b0_2
 33 |   - libgcc-ng=11.2.0=h1234567_1
 34 |   - libgfortran-ng=11.2.0=h00389a5_1
 35 |   - libgfortran5=11.2.0=h1234567_1
 36 |   - libgomp=11.2.0=h1234567_1
 37 |   - liblapack=3.9.0=16_linux64_openblas
 38 |   - libllvm14=14.0.6=hef93074_0
 39 |   - libopenblas=0.3.21=h043d6bf_0
 40 |   - libsodium=1.0.18=h7b6447c_0
 41 |   - libstdcxx-ng=11.2.0=h1234567_1
 42 |   - libuuid=1.41.5=h5eee18b_0
 43 |   - llvmlite=0.40.0=py310he621ea3_0
 44 |   - matplotlib-inline=0.1.6=py310h06a4308_0
 45 |   - ncurses=6.4=h6a678d5_0
 46 |   - nest-asyncio=1.5.6=py310h06a4308_0
 47 |   - numba=0.57.1=py310h1128e8f_0
 48 |   - numexpr=2.8.7=py310h286c3b5_0
 49 |   - openssl=1.1.1w=h7f8727e_0
 50 |   - packaging=23.1=py310h06a4308_0
 51 |   - pandas=2.0.3=py310h1128e8f_0
 52 |   - parso=0.8.3=pyhd3eb1b0_0
 53 |   - pexpect=4.8.0=pyhd3eb1b0_3
 54 |   - pickleshare=0.7.5=pyhd3eb1b0_1003
 55 |   - pip=23.2.1=py310h06a4308_0
 56 |   - platformdirs=3.10.0=py310h06a4308_0
 57 |   - prompt-toolkit=3.0.36=py310h06a4308_0
 58 |   - psutil=5.9.0=py310h5eee18b_0
 59 |   - ptyprocess=0.7.0=pyhd3eb1b0_2
 60 |   - pure_eval=0.2.2=pyhd3eb1b0_0
 61 |   - pygments=2.15.1=py310h06a4308_1
 62 |   - python=3.10.0=h12debd9_5
 63 |   - python-dateutil=2.8.2=pyhd3eb1b0_0
 64 |   - python-tzdata=2023.3=pyhd3eb1b0_0
 65 |   - python_abi=3.10=2_cp310
 66 |   - pytz=2023.3.post1=py310h06a4308_0
 67 |   - pyzmq=25.1.0=py310h6a678d5_0
 68 |   - readline=8.2=h5eee18b_0
 69 |   - scikit-learn=1.3.0=py310h1128e8f_0
 70 |   - scipy=1.11.3=py310heeff2f4_0
 71 |   - setuptools=68.0.0=py310h06a4308_0
 72 |   - shap=0.42.1=py310h1128e8f_0
 73 |   - six=1.16.0=pyhd3eb1b0_1
 74 |   - slicer=0.0.7=pyhd8ed1ab_0
 75 |   - sqlite=3.41.2=h5eee18b_0
 76 |   - stack_data=0.2.0=pyhd3eb1b0_0
 77 |   - tbb=2021.8.0=hdb19cb5_0
 78 |   - threadpoolctl=3.2.0=pyha21a80b_0
 79 |   - tk=8.6.12=h1ccaba5_0
 80 |   - tornado=6.3.2=py310h5eee18b_0
 81 |   - tqdm=4.66.1=pyhd8ed1ab_0
 82 |   - traitlets=5.7.1=py310h06a4308_0
 83 |   - tzdata=2023c=h04d1e81_0
 84 |   - wcwidth=0.2.5=pyhd3eb1b0_0
 85 |   - wheel=0.41.2=py310h06a4308_0
 86 |   - xz=5.4.2=h5eee18b_0
 87 |   - zeromq=4.3.4=h2531618_0
 88 |   - zlib=1.2.13=h5eee18b_0
 89 |   - pip:
 90 |     - certifi==2023.7.22
 91 |     - charset-normalizer==3.3.0
 92 |     - filelock==3.12.4
 93 |     - fsspec==2023.9.2
 94 |     - idna==3.4
 95 |     - jinja2==3.1.2
 96 |     - markupsafe==2.1.3
 97 |     - mpmath==1.3.0
 98 |     - networkx==3.1
 99 |     - numpy==1.24.0
100 |     - nvidia-cublas-cu12==12.1.3.1
101 |     - nvidia-cuda-cupti-cu12==12.1.105
102 |     - nvidia-cuda-nvrtc-cu12==12.1.105
103 |     - nvidia-cuda-runtime-cu12==12.1.105
104 |     - nvidia-cudnn-cu12==8.9.2.26
105 |     - nvidia-cufft-cu12==11.0.2.54
106 |     - nvidia-curand-cu12==10.3.2.106
107 |     - nvidia-cusolver-cu12==11.4.5.107
108 |     - nvidia-cusparse-cu12==12.1.0.106
109 |     - nvidia-nccl-cu12==2.18.1
110 |     - nvidia-nvjitlink-cu12==12.2.140
111 |     - nvidia-nvtx-cu12==12.1.105
112 |     - pillow==10.0.1
113 |     - requests==2.31.0
114 |     - sympy==1.12
115 |     - torch==2.1.0
116 |     - torchaudio==2.1.0
117 |     - torchvision==0.16.0
118 |     - triton==2.1.0
119 |     - typing-extensions==4.8.0
120 |     - urllib3==2.0.6
121 | prefix: /zfs/durenlab/palmetto/Kaya/env/conti1
122 | 


--------------------------------------------------------------------------------
/doc/driver.md:
--------------------------------------------------------------------------------
  1 | ## Driver Score
  2 | We identify driver TFs underlying epigenetic and transcriptomics change between control and AUD using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF.
  3 | ### Transcriptomics driver score
  4 | ```python
  5 | import pandas as pd
  6 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
  7 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-'
  8 | import scanpy as sc
  9 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
 10 | label_all = adata_RNA.obs[['barcode','sample','label']]
 11 | label_all.index = label_all['barcode']
 12 | metadata = label_all.loc[TG_pseudobulk.columns]
 13 | metadata.columns = ['barcode','group','celltype']
 14 | GRN='trans_regulatory'
 15 | adjust_method='bonferroni' 
 16 | corr_method='pearsonr'
 17 | import numpy as np
 18 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
 19 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes.
 20 | C_result_RNA_sp_r,Q_result_RNA_sp_r=driver_result(C_result_RNA_sp,Q_result_RNA_sp,K)
 21 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t')
 22 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t')
 23 | ```
 24 | The adjust_method is the p-value adjust method, you could choose one from the following:
 25 | - bonferroni : one-step correction
 26 | - sidak : one-step correction
 27 | - holm-sidak : step down method using Sidak adjustments
 28 | - holm : step-down method using Bonferroni adjustments
 29 | - simes-hochberg : step-up method (independent)
 30 | - hommel : closed method based on Simes tests (non-negative)
 31 | - fdr_bh : Benjamini/Hochberg (non-negative)
 32 | - fdr_by : Benjamini/Yekutieli (negative)
 33 | - fdr_tsbh : two stage fdr correction (non-negative)
 34 | - fdr_tsbky : two stage fdr correction (non-negative)
 35 | ### visualize
 36 | ```python
 37 | import os
 38 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R'  # Replace with your actual R home path
 39 | import rpy2.robjects as robjects
 40 | from rpy2.robjects import r
 41 | # Import the R plotting package (ggplot2 as an example)
 42 | r('library(ggplot2)')
 43 | r('library(grid)')
 44 | # Create data in R environment through Python
 45 | r('''
 46 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
 47 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
 48 | sort_TF=rownames(dataT)
 49 | library(tidyr)
 50 | dataP=-log10(dataP)
 51 | print(paste0('maxinum of -log10P:',max(dataP)))
 52 | dataP[dataP>40]=40
 53 | dataP1=dataP
 54 | dataP1$TF=rownames(dataP)
 55 | longdiff0 <- gather(dataP1, sample, value,-TF)
 56 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
 57 | dataT1=dataT
 58 | dataT1$TF=rownames(dataT)
 59 | longdiff1=gather(dataT1, sample, value,-TF)
 60 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
 61 | colnames(longdiff1)=c('TF','celltype','PCC')
 62 | longdiff1$P=longdiff0_s$value
 63 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF))
 64 | library(egg)
 65 | limits0=c(2,ceiling(dataP))
 66 | range0 = c(1,4)
 67 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP))
 68 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
 69 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 
 70 |   scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) + 
 71 |   labs( x= "cell type", y = "TF", fill = "")  + theme_article()+
 72 |   theme(legend.key=element_blank(), 
 73 |   axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 
 74 |   legend.position = "right") + 
 75 |   scale_fill_gradient2(midpoint=0, low="blue", mid="white",
 76 |                      high="red", space ="Lab" )
 77 | 
 78 | 
 79 | ''')
 80 | ```
 81 | The figure is saved to driver_trans.pdf.
 82 | <div style="text-align: right">
 83 |   <img src="driver_trans.png" alt="Image" width="400">
 84 | </div>
 85 | 
 86 | ### Epigenetic driver score
 87 | ```python
 88 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0)
 89 | K=5
 90 | GRN='TF_RE_binding'
 91 | adjust_method='bonferroni'
 92 | corr_method='pearsonr'
 93 | C_result_RE,P_result_RE,Q_result_RE=driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
 94 | C_result_RE_r,Q_result_RE_r=driver_result(C_result_RE,Q_result_RE,K)
 95 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t')
 96 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t')
 97 | ```
 98 | ### Visualization
 99 | ```python
100 | import os
101 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R'  # Replace with your actual R home path
102 | import rpy2.robjects as robjects
103 | from rpy2.robjects import r
104 | # Import the R plotting package (ggplot2 as an example)
105 | r('library(ggplot2)')
106 | r('library(grid)')
107 | # Create data in R environment through Python
108 | r('''
109 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
110 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
111 | sort_TF=rownames(dataT)
112 | library(tidyr)
113 | dataP=-log10(dataP)
114 | print(paste0('maxinum of -log10P:',max(dataP)))
115 | maxP=100
116 | dataP[dataP>100]=100
117 | dataP1=dataP
118 | dataP1$TF=rownames(dataP)
119 | longdiff0 <- gather(dataP1, sample, value,-TF)
120 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
121 | dataT1=dataT
122 | dataT1$TF=rownames(dataT)
123 | longdiff1=gather(dataT1, sample, value,-TF)
124 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
125 | colnames(longdiff1)=c('TF','celltype','PCC')
126 | longdiff1$P=longdiff0_s$value
127 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF))
128 | library(egg)
129 | cutoff=2
130 | maxp=ceiling(max(dataP))
131 | print(maxp)
132 | limits0=c(cutoff,maxp)
133 | print(limits0)
134 | range0 = c(1,4)
135 | numbreak=5
136 | d=ceiling((maxp-cutoff)/(numbreak-1))
137 | print(d)
138 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1),  length.out=numbreak)
139 | 
140 | print(range0)
141 | print(breaks0)
142 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
143 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 
144 |   scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 
145 |   labs( x= "cell type", y = "TF", fill = "")  + theme_article()+
146 |   theme(legend.key=element_blank(), 
147 |   axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 
148 |   legend.position = "right") + 
149 |   scale_fill_gradient2(midpoint=0, low="blue", mid="white",
150 |                      high="red", space ="Lab" )
151 | 
152 | 
153 | ''')
154 | r('''
155 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
156 | library(pheatmap)
157 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
158 | colnames(anno1)=c('TG')
159 | rownames(anno1)=rownames(dataP)
160 | anno1[is.na(anno1)]=0
161 | anno1[,1]=paste0('M',anno1[,1])
162 | anno1$name=rownames(anno1)
163 | longdiff0 <- gather(anno1, sample, value,-name)
164 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
165 | library("RColorBrewer")
166 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
167 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
168 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
169 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
170 |   geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
171 |   theme_article()+theme(text = element_text(size = 9), legend.position = "left",
172 |  )
173 |   ''')
174 | r('''
175 | widths <- c(4.5, 4.5+dim(dataP)[2]) 
176 | print(unique(longdiff0$value))
177 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5)
178 | #print(p)
179 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
180 | dev.off()
181 | ''')
182 | ```
183 | The figure is saved to driver_epi.pdf.
184 | <div style="text-align: right">
185 |   <img src="driver_epi.png" alt="Image" width="400">
186 | </div>
187 | 


--------------------------------------------------------------------------------
/docs/ATAC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ATAC.png


--------------------------------------------------------------------------------
/docs/Benchmark.md:
--------------------------------------------------------------------------------
 1 | # Benchmark gene regulatory network
 2 | The example input we provided is an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from SNARE-seq data. We benchmark the GRN of H1 cell line as example. 
 3 | 
 4 | ## Download the groundtruth data
 5 | The groundtruth data is [Cistrome](http://cistrome.org/) putative target of TF from the ChIP-seq data. Here, we take the ChIP-seq data of E2F6 in H1 cell line as example. We download the ground truth on Cistrome as following (46177_gene_score_5fold.txt).
 6 | <div style="text-align: right">
 7 |   <img src="H1_E2F6_trans.jpg" alt="Image" width="700">
 8 | </div>
 9 | ## Prepare GRN file
10 | You can provide a list of predicted GRN file. We support 2 format:
11 | - list: There are 3 columns, Target gene (TG), TF, and regulation strength (score)
12 | <div style="text-align: right">
13 |   <img src="GRN_head.png" alt="Image" width="300">
14 | </div>
15 | - matrix: The row is gene name the column is TF name and the value is regulation strength. 
16 | 
17 | Here, we give cell type specific and cell population GRN files as examples (the format here is 'matrix' described above).
18 | 
19 | ## Roc and pr curve
20 | ```python
21 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/'
22 | TFName = 'MYC'
23 | Method_name=['H1','cell population']
24 | Infer_trans=[outdir+'cell_type_specific_trans_regulatory_0.txt',outdir+'cell_population_trans_regulatory.txt']
25 | Groundtruth='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/groundtruth/45691_gene_score_5fold.txt'
26 | from LingerGRN import Benchmk
27 | Benchmk.bm_trans(TFName,Method_name,Groundtruth,Infer_trans,outdir,'matrix')
28 | ```
29 | <div style="text-align: right">
30 |   <img src="trans_roc_curveMYC.png" alt="Image" width="300">
31 | </div>
32 | <div style="text-align: right">
33 |   <img src="trans_pr_curveMYC.png" alt="Image" width="300">
34 | </div>
35 | 
36 | The result will be automatically saved in the outdir with name trans_roc_curve+TFName+.pdf and trans_roc_curve+TFName+.pdf.
37 | 


--------------------------------------------------------------------------------
/docs/GRN_head.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/GRN_head.png


--------------------------------------------------------------------------------
/docs/GRN_infer.md:
--------------------------------------------------------------------------------
  1 | # Construct the gene regulatory network
  2 | # Load data
  3 | ## Download the general gene regulatory network 
  4 | We provide the general gene regulatory network, please download the data first.
  5 | ```sh
  6 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
  7 | mkdir $Datadir
  8 | cd $Datadir
  9 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1lAlzjU5BYbpbr4RHMlAGDOh9KWdCMQpS" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt
 10 | ```
 11 | Then unzip，
 12 | ```sh
 13 | tar -xzf data_bulk.tar.gz
 14 | ```
 15 | 
 16 | ## Input data
 17 | The input data should be anndata format. In this example, we need to transfer the following data type to anndata.
 18 | - Single-cell multiome data including gene expression (RNA.txt in our example) and chromatin accessibility (ATAC.txt in our example).
 19 | - Cell annotation/cell type label if you need the cell type specific gene regulatory network (label.txt in our example).
 20 | ### RNA-seq
 21 | The row of RNA-seq is gene symbol; the column is the barcode; the value is the count matrix. Here is our example:
 22 | <div style="text-align: right">
 23 |   <img src="RNA.png" alt="Image" width="500">
 24 | </div>
 25 | 
 26 | ### ATAC-seq
 27 | The row is the regulatory element/genomic region; the column is the barcode, which is in the same order as RNA-seq data; the value is the count matrix. Here is our example:
 28 | <div style="text-align: right">
 29 |   <img src="ATAC.png" alt="Image" width="500">
 30 | </div>
 31 | 
 32 | ### Cell annotation/cell type label
 33 | The row is cell barcode, which is the same order with RNA-seq data; there is one column 'Annotation', which is the cell type label. It could be a number or the string. Here is our example:
 34 | <div style="text-align: right">
 35 |   <img src="label.png" alt="Image" width="250">
 36 | </div>
 37 | 
 38 | ### Provided input example 
 39 | You can download the example input datasets into a certain directory. This sc multiome data of an in-silico mixture of H1, BJ, GM12878, and K562 cell lines from droplet-based single-nucleus chromatin accessibility and mRNA expression sequencing (SNARE-seq) data.
 40 | 
 41 | ```sh
 42 | Input_dir=/path/to/dir/
 43 | # The input data directory. Example: Input_dir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/
 44 | cd $Input_dir
 45 | #ATAC-seq
 46 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qmMudeixeRbYS8LCDJEuWxlAgeM0hC1r" -O ATAC.txt && rm -rf /tmp/cookies.txt
 47 | #RNA-seq
 48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1dP4ITjQZiVDa52xfDTo5c14f9H0MsEGK" -O RNA.txt && rm -rf /tmp/cookies.txt
 49 | #label
 50 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZeEp5GnWfQJxuAY0uK9o8s_uAvFsNPI5" -O label.txt && rm -rf /tmp/cookies.txt
 51 | ```
 52 | ## LINGER 
 53 | ### Install
 54 | ```sh
 55 | conda create -n LINGER python==3.10.0
 56 | conda activate LINGER
 57 | pip install LingerGRN==1.92
 58 | conda install bioconda::bedtools #Requirement
 59 | ```
 60 | ### Preprocess
 61 | There are 2 options of the method we introduced above:
 62 | 1. baseline;
 63 | ```python
 64 | method='baseline' # this is corresponding to bulkNN
 65 | ```
 66 | 2. LINGER;
 67 | ```python
 68 | method='LINGER'
 69 | ```
 70 | #### Tansfer sc-multiome data to anndata
 71 | ```python
 72 | import pandas as pd
 73 | label=pd.read_csv('data/label.txt',sep='\t',header=0,index_col=None)
 74 | RNA=pd.read_csv('data/RNA.txt',sep='\t',header=0,index_col=0)
 75 | ATAC=pd.read_csv('data/ATAC.txt',sep='\t',header=0,index_col=0)
 76 | from scipy.sparse import csc_matrix
 77 | # Convert the NumPy array to a sparse csc_matrix
 78 | matrix = csc_matrix(pd.concat([RNA,ATAC],axis=0).values)
 79 | features=pd.DataFrame(RNA.index.tolist()+ATAC.index.tolist(),columns=[1])
 80 | K=RNA.shape[0]
 81 | N=K+ATAC.shape[0]
 82 | types = ['Gene Expression' if i <= K else 'Peaks' for i in range(0, N)]
 83 | features[2]=types
 84 | barcodes=pd.DataFrame(RNA.columns.values,columns=[0])
 85 | from LingerGRN.preprocess import *
 86 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
 87 | ```
 88 | #### Remove low counts cells and genes 
 89 | ```python
 90 | import scanpy as sc
 91 | sc.pp.filter_cells(adata_RNA, min_genes=200)
 92 | sc.pp.filter_genes(adata_RNA, min_cells=3)
 93 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
 94 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
 95 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
 96 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
 97 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
 98 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
 99 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
100 | ```
101 | #### Generate the pseudo-bulk/metacell
102 | ```python
103 | from LingerGRN.pseudo_bulk import *
104 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
105 | tempsample=samplelist[0]
106 | TG_pseudobulk=pd.DataFrame([])
107 | RE_pseudobulk=pd.DataFrame([])
108 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
109 | for tempsample in samplelist:
110 |     adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
111 |     adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
112 |     TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)                
113 |     TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
114 |     RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
115 |     RE_pseudobulk[RE_pseudobulk > 100] = 100
116 | 
117 | import os
118 | if not os.path.exists('data/'):
119 |     os.mkdir('data/')
120 | adata_ATAC.write('data/adata_ATAC.h5ad')
121 | adata_RNA.write('data/adata_RNA.h5ad')
122 | TG_pseudobulk=TG_pseudobulk.fillna(0)
123 | RE_pseudobulk=RE_pseudobulk.fillna(0)
124 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
125 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
126 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
127 | ```
128 | ### Training model
129 | Overlap the region with general GRN:
130 | ```python
131 | from LingerGRN.preprocess import *
132 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
133 | GRNdir=Datadir+'data_bulk/'
134 | genome='hg38'
135 | outdir='/path/to/output/' #output dir
136 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir)
137 | ```
138 | Train for the LINGER model.
139 | ```python
140 | import LingerGRN.LINGER_tr as LINGER_tr
141 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
142 | LINGER_tr.training(GRNdir,method,outdir,activef)
143 | ```
144 | ### Cell population gene regulatory network
145 | #### TF binding potential
146 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
147 | ```python
148 | import LingerGRN.LL_net as LL_net
149 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
150 | ```
151 | 
152 | #### *cis*-regulatory network
153 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
154 | ```python
155 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
156 | ```
157 | #### *trans*-regulatory network
158 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
159 | ```python
160 | LL_net.trans_reg(GRNdir,method,outdir)
161 | ```
162 | 
163 | ### Cell type sepecific gene regulaory network
164 | There are 2 options:
165 | 1. infer GRN for a specific cell type, which is in the label.txt;
166 | ```python
167 | celltype='0'#use a string to assign your cell type
168 | ```
169 | 2. infer GRNs for all cell types.
170 | ```python
171 | celltype='all'
172 | ```
173 | Please make sure that 'all' is not a cell type in your data.
174 | 
175 | #### TF binding potential
176 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
177 | ```python
178 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
179 | ```
180 | 
181 | #### *cis*-regulatory network
182 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
183 | ```python
184 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
185 | ```
186 | 
187 | #### *trans*-regulatory network
188 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
189 | ```python
190 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
191 | ```
192 | 
193 | ## Note
194 | - The cell specific GRN is based on the output of the cell population GRN.
195 | - If we want to try 2 different method options, we can create 2 output directory.
196 | ## Identify driver regulators by TF activity
197 | ### Instruction
198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
199 | 
200 | ### Prepare
201 | We need to *trans*-regulatory network, you can choose a network match you data best.
202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
203 | ```python
204 | network = 'general'
205 | ```
206 | 2. If your gene expression data are matched with cell population GRN, you can set
207 | ```python
208 | network = 'cell population'
209 | ```
210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
211 | ```python
212 | network = '0' # 0 is the name of one cell type
213 | ```
214 | 
215 | ### Calculate TF activity
216 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
217 | 
218 | ```python
219 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
220 | GRNdir=Datadir+'data_bulk/'
221 | genome='hg38'
222 | from LingerGRN.TF_activity import *
223 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
224 | import anndata
225 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
226 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
227 | ```
228 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
229 | ```python
230 | save=True
231 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
232 | ```
233 | <div style="text-align: right">
234 |   <img src="heatmap_activity.png" alt="Image" width="500">
235 | </div>
236 | 
237 | ### Identify driver regulator
238 | We use t-test to find the differential TFs of a certain cell type by the activity. 
239 | 1. You can assign a certain cell type of the gene expression data by
240 | ```python
241 | celltype='0'
242 | ```
243 | 2. Or, you can obtain the result for all cell types.
244 | ```python
245 | celltype='all'
246 | ```
247 | 
248 | For example,
249 | 
250 | ```python
251 | celltype='0'
252 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
253 | t_test_results
254 | ```
255 | 
256 | <div style="text-align: right">
257 |   <img src="ttest.png" alt="Image" width="300">
258 | </div>
259 | 
260 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
261 | 
262 | ```python
263 | TFName='ATF1'
264 | datatype='activity'
265 | celltype1='0'
266 | celltype2='Others'
267 | save=True
268 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir)
269 | ```
270 | 
271 | <div style="text-align: right">
272 |   <img src="box_plot_ATF1_activity_0_Others.png" alt="Image" width="300">
273 | </div>
274 | 
275 | For gene expression data, the boxplot is:
276 | ```python
277 | datatype='expression'
278 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,regulon_score,save,outdir)
279 | ```
280 | 
281 | <div style="text-align: right">
282 |   <img src="box_plot_ATF1_expression_0_Others.png" alt="Image" width="300">
283 | </div>
284 | 


--------------------------------------------------------------------------------
/docs/H1_E2F6_trans.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/H1_E2F6_trans.jpg


--------------------------------------------------------------------------------
/docs/LINGER.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/LINGER.png


--------------------------------------------------------------------------------
/docs/PBMC.md:
--------------------------------------------------------------------------------
  1 | # PBMCs Tutorial
  2 | ## Instruction
  3 | This tutorial delineates a computational framework for constructing gene regulatory networks (GRNs) from single-cell multiome data. We provide 2 options to do this: '**baseline**' and '**LINGER**'. The first is a naive method combining the prior GRNs and features from the single-cell data, offering a rapid approach. LINGER integrates the comprehensive gene regulatory profile from external bulk data. As the following figure, LINGER uses lifelong machine learning (continuous learning) based on neural network (NN) models, which has been proven to leverage the knowledge learned in previous tasks to help learn the new task better.
  4 | <div style="text-align: right">
  5 |   <img src="LINGER.png" alt="Image" width="400">
  6 | </div>
  7 | 
  8 | After constructing the GRNs for the cell population, we infer the cell type specific one using the feature engineering approach. Just as in the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$.
  9 | 
 10 | ![Image Alt Text](feature_engineering.jpg)
 11 | 
 12 | ## Download the general gene regulatory network 
 13 | We provide the general gene regulatory network, please download the data first.
 14 | ```sh
 15 | Datadir=/path/to/LINGER/# the directory to store the data please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
 16 | mkdir $Datadir
 17 | cd $Datadir
 18 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b" -O data_bulk.tar.gz && rm -rf /tmp/cookies.txt
 19 | ```
 20 | or use the following link: [https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing](https://drive.google.com/file/d/1jwRgRHPJrKABOk7wImKONTtUupV7yJ9b/view?usp=sharing)
 21 | 
 22 | Then unzip，
 23 | ```sh
 24 | tar -xzf data_bulk.tar.gz
 25 | ```
 26 | 
 27 | ## Prepare the input data
 28 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 
 29 | - Single-cell multiome data including matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz.
 30 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (PBMC_label.txt in our example).
 31 | <div style="text-align: right">
 32 |   <img src="label_PBMC.png" alt="Image" width="300">
 33 | </div>  
 34 | 
 35 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
 36 | 
 37 | ### sc data
 38 | We download the data using shell command line.
 39 | ```sh
 40 | mkdir -p data
 41 | wget -O data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
 42 | tar -xzvf data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.tar.gz
 43 | mv filtered_feature_bc_matrix data/
 44 | gzip -d data/filtered_feature_bc_matrix/*
 45 | ```
 46 | We provide the cell annotation as following:
 47 | ```sh
 48 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=17PXkQJr8fk0h90dCkTi3RGPmFNtDqHO_" -O PBMC_label.txt && rm -rf /tmp/cookies.txt
 49 | mv PBMC_label.txt data/
 50 | ```
 51 | ## LINGER 
 52 | ### Install
 53 | ```sh
 54 | conda create -n LINGER python==3.10.0
 55 | conda activate LINGER
 56 | pip install LingerGRN==1.105
 57 | conda install bioconda::bedtools #Requirement
 58 | ```
 59 | For the following step, we run the code in python.
 60 | ### Preprocess
 61 | There are 2 options for the method we introduced above:
 62 | 1. baseline;
 63 | ```python
 64 | method='baseline' # this method is corresponding to bulkNN described in the paper
 65 | ```
 66 | 2. LINGER;
 67 | ```python
 68 | method='LINGER'
 69 | ```
 70 | #### Transfer the sc-multiome data to anndata  
 71 | 
 72 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
 73 | ```python
 74 | import scanpy as sc
 75 | #set some figure parameters for nice display inside jupyternotebooks.
 76 | %matplotlib inline
 77 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
 78 | sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
 79 | sc.logging.print_header()
 80 | #results_file = "scRNA/pbmc10k.h5ad"
 81 | import scipy
 82 | import pandas as pd
 83 | matrix=scipy.io.mmread('data/filtered_feature_bc_matrix/matrix.mtx')
 84 | features=pd.read_csv('data/filtered_feature_bc_matrix/features.tsv',sep='\t',header=None)
 85 | barcodes=pd.read_csv('data/filtered_feature_bc_matrix/barcodes.tsv',sep='\t',header=None)
 86 | label=pd.read_csv('data/PBMC_label.txt',sep='\t',header=0)
 87 | from LingerGRN.preprocess import *
 88 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
 89 | ```
 90 | #### Remove low counts cells and genes
 91 | ```python
 92 | import scanpy as sc
 93 | sc.pp.filter_cells(adata_RNA, min_genes=200)
 94 | sc.pp.filter_genes(adata_RNA, min_cells=3)
 95 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
 96 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
 97 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
 98 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
 99 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
100 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
101 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
102 | ```
103 | #### Generate the pseudo-bulk/metacell:
104 | ```python
105 | from LingerGRN.pseudo_bulk import *
106 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
107 | tempsample=samplelist[0]
108 | TG_pseudobulk=pd.DataFrame([])
109 | RE_pseudobulk=pd.DataFrame([])
110 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
111 | for tempsample in samplelist:
112 |     adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
113 |     adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
114 |     TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)                
115 |     TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
116 |     RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
117 |     RE_pseudobulk[RE_pseudobulk > 100] = 100
118 | 
119 | import os
120 | if not os.path.exists('data/'):
121 |     os.mkdir('data/')
122 | adata_ATAC.write('data/adata_ATAC.h5ad')
123 | adata_RNA.write('data/adata_RNA.h5ad')
124 | TG_pseudobulk=TG_pseudobulk.fillna(0)
125 | RE_pseudobulk=RE_pseudobulk.fillna(0)
126 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
127 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
128 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
129 | ```
130 | ### Training model
131 | Overlap the region with general GRN:
132 | ```python
133 | from LingerGRN.preprocess import *
134 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
135 | GRNdir=Datadir+'data_bulk/'
136 | genome='hg38'
137 | outdir='/path/to/output/' #output dir
138 | preprocess(TG_pseudobulk,RE_pseudobulk,GRNdir,genome,method,outdir)
139 | ```
140 | Train for the LINGER model.
141 | ```python
142 | import LingerGRN.LINGER_tr as LINGER_tr
143 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
144 | LINGER_tr.training(GRNdir,method,outdir,activef,'Human')
145 | ```
146 | 
147 | 
148 | ### Cell population gene regulatory network
149 | #### TF binding potential
150 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
151 | ```python
152 | import LingerGRN.LL_net as LL_net
153 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
154 | ```
155 | 
156 | #### *cis*-regulatory network
157 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
158 | ```python
159 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
160 | ```
161 | #### *trans*-regulatory network
162 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
163 | ```python
164 | LL_net.trans_reg(GRNdir,method,outdir,genome)
165 | ```
166 | 
167 | ### Cell type specific gene regulatory network
168 | There are 2 options:
169 | 1. infer GRN for a specific cell type, which is in the label.txt;
170 | ```python
171 | celltype='CD56 (bright) NK cells' #use a string to assign your cell type
172 | ```
173 | 2. infer GRNs for all cell types.
174 | ```python
175 | celltype='all'
176 | ```
177 | Please make sure that 'all' is not a cell type in your data.
178 | 
179 | #### TF binding potential
180 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
181 | ```python
182 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
183 | ```
184 | 
185 | #### *cis*-regulatory network
186 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
187 | ```python
188 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir)
189 | ```
190 | 
191 | #### *trans*-regulatory network
192 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
193 | ```python
194 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
195 | ```
196 | ## Identify driver regulators by TF activity
197 | ### Instruction
198 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
199 | 
200 | ### Prepare
201 | You can choose a *trans*-regulatory network that matches your data best.
202 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
203 | ```python
204 | network = 'general'
205 | ```
206 | 2. If your gene expression data are matched with cell population GRN, you can set
207 | ```python
208 | network = 'cell population'
209 | ```
210 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
211 | ```python
212 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
213 | ```
214 | 
215 | ### Calculate TF activity
216 | The input is gene expression data. It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
217 | 
218 | ```python
219 | 
220 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
221 | GRNdir=Datadir+'data_bulk/'
222 | genome='hg38'
223 | from LingerGRN.TF_activity import *
224 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
225 | import anndata
226 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
227 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
228 | ```
229 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
230 | ```python
231 | save=True
232 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
233 | ```
234 | <div style="text-align: right">
235 |   <img src="PBMCs_heatmap_activity.png" alt="Image" width="500">
236 | </div>
237 | 
238 | ### Identify driver regulator
239 | We use t-test to find the differential TFs of a certain cell type by the activity. 
240 | 1. You can assign a certain cell type of the gene expression data by
241 | ```python
242 | celltype='CD56 (bright) NK cells'
243 | ```
244 | 2. Or, you can obtain the result for all cell types.
245 | ```python
246 | celltype='all'
247 | ```
248 | 
249 | For example,
250 | 
251 | ```python
252 | celltype='CD56 (bright) NK cells'
253 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
254 | t_test_results
255 | ```
256 | 
257 | <div style="text-align: right">
258 |   <img src="PBMCs_ttest.png" alt="Image" width="300">
259 | </div>
260 | 
261 | Visualize the differential activity and expression. You can compare 2 different cell types and one cell type with others. If you want to save the heatmap to output, please set `save=True`. The output is `box_plot_<TFName>_<datatype>_<celltype1>_<celltype2>.png`.
262 | 
263 | ```python
264 | TFName='ATF1'
265 | datatype='activity'
266 | celltype1='CD56 (bright) NK cells'
267 | celltype2='Others'
268 | save=True
269 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
270 | ```
271 | 
272 | <div style="text-align: right">
273 |   <img src="PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png" alt="Image" width="300">
274 | </div>
275 | 
276 | For gene expression data, the boxplot is:
277 | ```python
278 | datatype='expression'
279 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
280 | ```
281 | 
282 | <div style="text-align: right">
283 |   <img src="box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png" alt="Image" width="300">
284 | </div>
285 | 
286 | ## Note
287 | - The cell specific GRN is based on the output of the cell population GRN.
288 | - If we want to try 2 different method options, we can create 2 output directory.
289 | 
290 | 
291 | 


--------------------------------------------------------------------------------
/docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_box_plot_ATF1_activity_CD56 (bright) NK cells_Others.png


--------------------------------------------------------------------------------
/docs/PBMCs_heatmap_activity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_heatmap_activity.png


--------------------------------------------------------------------------------
/docs/PBMCs_ttest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PBMCs_ttest.png


--------------------------------------------------------------------------------
/docs/POU5F1_KO_Diff_Umap_NANOG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_Umap_NANOG.png


--------------------------------------------------------------------------------
/docs/POU5F1_KO_Diff_exp_Umap_NANOG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Diff_exp_Umap_NANOG.png


--------------------------------------------------------------------------------
/docs/POU5F1_KO_Differentiation_Umap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/POU5F1_KO_Differentiation_Umap.png


--------------------------------------------------------------------------------
/docs/PWM.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/PWM.jpg


--------------------------------------------------------------------------------
/docs/RNA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA.png


--------------------------------------------------------------------------------
/docs/RNA_ds.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/RNA_ds.jpg


--------------------------------------------------------------------------------
/docs/S_TG.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/S_TG.png


--------------------------------------------------------------------------------
/docs/TFactivity.md:
--------------------------------------------------------------------------------
 1 | # Identify driver regulators by TF activity
 2 | ## Instruction
 3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
 4 | 
 5 | ## Prepare
 6 | We need to *trans*-regulatory network, you can choose a network match you data best.
 7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
 8 | ```python
 9 | network = 'general'
10 | ```
11 | 2. If your gene expression data are matched with cell population GRN, you can set
12 | ```python
13 | network = 'cell population'
14 | ```
15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
16 | ```python
17 | network = '0' # 0 is the name of one cell type
18 | ```
19 | 
20 | ## Calculate TF activity
21 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
22 | 
23 | ```python
24 | 
25 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
26 | GRNdir=Datadir+'data_bulk/'
27 | genome='hg38'
28 | from LingerGRN.TF_activity import *
29 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
30 | import anndata
31 | adata_RNA=anndata.read_h5ad('data/adata_RNA.h5ad')
32 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
33 | ```
34 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
35 | ```python
36 | save=True
37 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
38 | ```
39 | <div style="text-align: right">
40 |   <img src="heatmap_activity.png" alt="Image" width="500">
41 | </div>
42 | 
43 | ## Identify driver regulator
44 | We use t-test to find the differential TFs of a certain cell type by the activity. 
45 | 1. You can assign a certain cell type of the gene expression data by
46 | ```python
47 | celltype='0'
48 | ```
49 | 2. Or, you can obtain the result for all cell types.
50 | ```python
51 | celltype='all'
52 | ```
53 | 
54 | For example,
55 | 
56 | ```python
57 | celltype='0'
58 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
59 | t_test_results
60 | ```
61 | 
62 | <div style="text-align: right">
63 |   <img src="ttest.png" alt="Image" width="300">
64 | </div>
65 | 
66 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
67 | 
68 | ```python
69 | TFName='ATF1'
70 | datatype='activity'
71 | celltype1='0'
72 | celltype2='Others'
73 | save=True
74 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir)
75 | ```
76 | 
77 | <div style="text-align: right">
78 |   <img src="box_plot_ATF1_activity_0_Others.png" alt="Image" width="300">
79 | </div>
80 | 
81 | For gene expression data, the boxplot is:
82 | ```python
83 | datatype='expression'
84 | box_comp(TFName,labels,celltype1,celltype2,datatype,RNA_file,TF_activity,save,outdir)
85 | ```
86 | 
87 | <div style="text-align: right">
88 |   <img src="box_plot_ATF1_expression_0_Others.png" alt="Image" width="300">
89 | </div>
90 | 


--------------------------------------------------------------------------------
/docs/User_guide.md:
--------------------------------------------------------------------------------
 1 | # LINGER
 2 | LINGER (LIfelong neural Network for GEne Regulation) is a interpretable artificial intelligence model designed to analyze sc multiome data (RNA-seq count and chromatine accessbility count data that can subsequently be used for many downstream tasks.
 3 | 
 4 | The advantages of LINGER are:
 5 | - incorporate large-scale external bulk data and neural networks
 6 | - integrate TF-RE motif matching knowledge 
 7 | - high accuracy of gene regulatory network (GRN) inference  
 8 | - enable the estimation of TF activity solely from gene expression data
 9 | 
10 | ## Method overview
11 | LINGER uses neural network architectures to model transcriptional regulation, predicting target gene (TG) expression levels from transcription factor (TF) and regulatory element (RE) profiles. The loss function is compisosed of 3 parts:
12 | - Accuracy of prediction, MSE, L1 regularization
13 | - Elastic weight consolidation (EWC) loss of lifelong learning to incorporate large-scale external bulk data
14 | - LINGER integrates prior biological understanding of TF-RE interactions through manifold regularization. 
15 | 
16 | <div style="text-align: right">
17 |   <img src="LINGER.png" alt="Image" width="400">
18 | </div>
19 | 
20 | 
21 | ## Neuron network model structure
22 | LINGER trains individual models for each gene using a neural network architecture that includes a single input layer and two fully connected hidden layers. The input layer has dimension equal to the number of features, containing all TFs and REs within 1Mb from the transcription start site (TSS) for the gene to be predicted. The first hidden layer has 64 neurons with rectified linear unit (ReLU) activation. The second hidden layer has 16 neurons with ReLU activation. The output layer is a single neuron, which outputs a real value for gene expression prediction.
23 | 
24 | ## Tasks
25 | ### Cell population gene regulatory network inference
26 | We use the average of absolute Shapley value across samples to infer the regulation strength of TF and RE to target genes, generating the RE-TG *cis*-regulatory strength and the TF-TG *trans*-regulatory strength. To generate the TF-RE binding strength, we use the weights from input layer (TFs and REs) to all node in the second layer of the NN model as the embedding of the TF or RE. The TF-RE binding strength is calculated by the PCC between the TF and RE based on the embedding. 
27 | ### Cell type specific gene regulatory network
28 | Afer constructing the GRNs for cell population, we infer the cell type specific one using the feature engineering approach. Just as the following figure, we combine the single cell data ($O, E$, and $C$ in the figure) and the prior gene regulatory network structure with the parameter $\alpha,\beta,d,B$, and $\gamma$.
29 | 
30 | ![Image Alt Text](feature_engineering.jpg)
31 | 
32 | ### Benchmark gene regulatory network
33 | We systematically assess the performance of the GRN by the metrics, AUC and AUPR ratio.
34 | - *trans*: the ground truth include and knock down
35 | - *cis*: the ground truth is HiC and eQTL
36 | - TF-RE: the ground truth is ChIP-seq data from
37 | ### Transcription factor activity
38 | Assuming the GRN structure is consistent across individuals, we employ LINGER inferred GRNs from sc-multiome data of a single individual to estimated TF activity using gene expression data alone from same or other individuals. By comparing TF activity between cases and controls, we identified driver regulators.
39 | ### In silico pertubation
40 | We predict the gene expression when knock out one TF or several TFs together. Then we could predict the target gene after the perturbation.
41 | ## Tutorials
42 | - [PBMCs](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md)
43 | - [H1 cell line](https://github.com/Durenlab/LINGER/blob/main/docs/GRN_infer.md)
44 | - [Non-human species](https://github.com/Durenlab/LINGER/blob/main/docs/scNN.md)
45 | 


--------------------------------------------------------------------------------
/docs/adata_ATAC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_ATAC.png


--------------------------------------------------------------------------------
/docs/adata_RNA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/adata_RNA.png


--------------------------------------------------------------------------------
/docs/barcode_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/barcode_mm10.png


--------------------------------------------------------------------------------
/docs/box_plot_ATF1_activity_0_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_activity_0_Others.png


--------------------------------------------------------------------------------
/docs/box_plot_ATF1_expression_0_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_0_Others.png


--------------------------------------------------------------------------------
/docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_ATF1_expression_CD56 (bright) NK cells_Others.png


--------------------------------------------------------------------------------
/docs/box_plot_Erg_activity_1_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_activity_1_Others.png


--------------------------------------------------------------------------------
/docs/box_plot_Erg_expression_1_Others.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/box_plot_Erg_expression_1_Others.png


--------------------------------------------------------------------------------
/docs/downstream.md:
--------------------------------------------------------------------------------
  1 | # Downstream analysis - Module detection
  2 | ## Regulatory Module
  3 | For this analysis, we first detect key TF-TG subnetworks (modules) from the cell population TF–TG trans-regulation. Then, we identify the differential regulatory modules that are differentially expressed between the case and control groups.
  4 | ### Detect Module
  5 | #### Input
  6 | - pseudobulk gene expression: [TG_pseudobulk], please make sure the data is after removing the batch effect
  7 | - metadata including case and control in column 'group' and cell type annotation in column 'celltype': [metadata]. Note that the case is 1 and the control is 0.
  8 | - LINGER outdir including a trans-regulatory network, 'cell_population_trans_regulatory.txt',
  9 | - GWAS data file, which is not necessary.
 10 | 
 11 | This is an example of the input.
 12 | ```python
 13 | import pandas as pd
 14 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
 15 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the specie is mouse, replace 'MT-' with 'mt-'
 16 | import scanpy as sc
 17 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
 18 | label_all = adata_RNA.obs[['barcode','sample','label']]
 19 | label_all.index = label_all['barcode']
 20 | metadata = label_all.loc[TG_pseudobulk.columns]
 21 | metadata.columns = ['barcode','group','celltype']
 22 | outdir = 'output/'
 23 | GWASfile = ['AUD_gene.txt','AUD_gene2.txt']# GWAS file is a gene list with no head (Optional)
 24 | ```
 25 | ```python
 26 | TG_pseudobulk
 27 | ```
 28 | <div style="text-align: right">
 29 |   <img src="RNA_ds.jpg" alt="Image" width="600">
 30 | </div>
 31 | 
 32 | ```python
 33 | metadata 
 34 | ```
 35 | <div style="text-align: right">
 36 |   <img src="metadata_ds.jpg" alt="Image" width="300">
 37 | </div>
 38 | 
 39 | #### output
 40 | ```python
 41 | K=10 #k is the number of modules, a tuning parameter
 42 | from LingerGRN import Compare
 43 | Module_result=Compare.Module_trans(outdir,metadata,TG_pseudobulk,K,GWASfile)
 44 | ```
 45 | The output is Module_result object. There are 3 items in this object: 
 46 | - S_TG, which represents the module assigned for each gene;
 47 | - pvalue_all, the p-value of the differential module t-test comparing the case and control groups;
 48 | - t_value, the t-value of the t-test, a positive value representing group 1 is more active, and a negative value representing group 0 is more active.
 49 | ```python
 50 | Module_result.S_TG
 51 | ```
 52 | <div style="text-align: right">
 53 |   <img src="S_TG.png" alt="Image" width="100">
 54 | </div>
 55 | 
 56 | ```python
 57 | Module_result.pvalue_all
 58 | ```
 59 | <div style="text-align: right">
 60 |   <img src="pvalue_all.png" alt="Image" width="400">
 61 | </div>
 62 | 
 63 | ```python
 64 | Module_result.tvalue_all
 65 | ```
 66 | <div style="text-align: right">
 67 |   <img src="tvalue_all.png" alt="Image" width="400">
 68 | </div>
 69 | 
 70 | Save the result to files.
 71 | ```python
 72 | Module_result.pvalue_all.to_csv('pvalue_all.txt',sep='\t')
 73 | Module_result.tvalue_all.to_csv('tvalue_all.txt',sep='\t')
 74 | temp=Module_result.S_TG
 75 | temp=temp[temp['Module']>0]
 76 | temp.to_csv('Module.txt',sep='\t')
 77 | ```
 78 | #### Visualize
 79 | Please ensure that the r packages: ggplot2, grid, tidyr, egg are well-installed. Note that 'cutoff' is a parameter, representing the cutoff of -log10(p-value). We suggest 'cutoff = 2' as a default.
 80 | ```python
 81 | # Import the rpy2 components needed
 82 | import os
 83 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R'  # Replace with your actual R home path
 84 | import rpy2.robjects as robjects
 85 | from rpy2.robjects import r
 86 | # Import the R plotting package (ggplot2 as an example)
 87 | r('library(ggplot2)')
 88 | r('library(grid)')
 89 | # Create data in R environment through Python
 90 | r('''
 91 | library(tidyr)
 92 | dataP=read.table('pvalue_all.txt',sep='\t',header=TRUE,row.names=1)
 93 | dataT=read.table('tvalue_all.txt',sep='\t',header=TRUE,row.names=1)
 94 | dataP=-log10(dataP)
 95 | dataP$TF=rownames(dataP)
 96 | dataT$TF=rownames(dataT)
 97 | longdiff0 <- gather(dataP, sample, value,-TF)
 98 | longdiff1 <- gather(dataT, sample, value,-TF)
 99 | colnames(longdiff1)=c('TF','celltype','T')
100 | longdiff1$P=longdiff0$value
101 | longdiff1$TF=factor(longdiff1$TF,levels=rev(longdiff1$TF[1:10]))
102 | print(longdiff1[1:10,])
103 | ''')
104 | # R code for plotting using ggplot
105 | r('''
106 | cutoff=1 # here 
107 | maxp=ceiling(max(longdiff1$P))
108 | limits0=c(cutoff,maxp)
109 | range0=c(1,(maxp-cutoff+1))*4/(maxp-cutoff+1)
110 | breaks0=(cutoff+1):(maxp-1)
111 | print(limits0)
112 | print(range0)
113 | print(breaks0)
114 | ''')
115 | 
116 | # Print the plot to display it
117 | r('''
118 | library(ggplot2)
119 | library(egg)
120 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
121 | geom_point(aes(size = P, fill = T), alpha = 1, shape = 21) + 
122 |   scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 
123 |   labs( x= "cell type", y = "Module", fill = "")  + theme_article()+
124 |   theme(legend.key=element_blank(), 
125 |   axis.text.x = element_text(colour = "black", size = 9, face = "bold", angle = 90, vjust = 0.3, hjust = 1), 
126 |   axis.text.y = element_text(colour = "black", face = "bold", size = 11), 
127 |   legend.text = element_text(size = 9, face ="bold", colour ="black"), 
128 |   legend.title = element_text(size = 9, face = "bold"), 
129 |   panel.background = element_blank(), panel.border = element_rect(colour = "black", fill = NA, size = 1.2), 
130 |   legend.position = "right") +  
131 |   scale_fill_gradient2(midpoint=0, low="blue", mid="white",
132 |                      high="red", space ="Lab" )
133 | ''')
134 | r("pdf('module_result.pdf',width=1.5+dim(dataP)[2]/3,height=3)")# change the height and the width of the figure
135 | r('print(p)')
136 | r('dev.off()')
137 | ```
138 | The figure is saved to module_result.pdf.
139 | <div style="text-align: right">
140 |   <img src="module_result.png" alt="Image" width="300">
141 | </div>
142 | 
143 | 


--------------------------------------------------------------------------------
/docs/driver.md:
--------------------------------------------------------------------------------
  1 | # Downstream analysis- TF driver score
  2 | ## Driver Score
  3 | We identify driver TFs underlying epigenetic and transcriptomic change between control and case using a correlation model. We normalized the GRN and then calculated the Pearson Correlation Coefficient (PCC) between expression or chromatin accessibility fold change and the regulatory strength of TGs or REs for each TF.
  4 | ### Request
  5 | Please complete the following tutorials 
  6 | - [PBMCs tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/PBMC.md)
  7 | - [Downstream analysis-Module detection](https://github.com/Durenlab/LINGER/blob/main/docs/downstream.md)
  8 | ### Transcriptomics driver score
  9 | ```python
 10 | import pandas as pd
 11 | TG_pseudobulk = pd.read_csv('data/TG_pseudobulk.tsv',sep=',',header=0,index_col=0)
 12 | TG_pseudobulk = TG_pseudobulk[~TG_pseudobulk.index.str.startswith('MT-')] # remove the mitochondrion, if the species is mouse, replace 'MT-' with 'mt-'
 13 | import scanpy as sc
 14 | adata_RNA = sc.read_h5ad('data/adata_RNA.h5ad')
 15 | label_all = adata_RNA.obs[['barcode','sample','label']]
 16 | label_all.index = label_all['barcode']
 17 | metadata = label_all.loc[TG_pseudobulk.columns]
 18 | metadata.columns = ['barcode','group','celltype']
 19 | GRN='trans_regulatory'
 20 | adjust_method='bonferroni' 
 21 | corr_method='pearsonr'
 22 | import numpy as np
 23 | from LingerGRN import Compare
 24 | C_result_RNA_sp,P_result_RNA_sp,Q_result_RNA_sp=Compare.driver_score(TG_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
 25 | K=3 # We choose the top K positive and negative TFs to save to the txt file for visualization purposes.
 26 | C_result_RNA_sp_r,Q_result_RNA_sp_r=Compare.driver_result(C_result_RNA_sp,Q_result_RNA_sp,K)
 27 | C_result_RNA_sp_r.to_csv('C_result_RNA_sp_r.txt',sep='\t')
 28 | Q_result_RNA_sp_r.to_csv('Q_result_RNA_sp_r.txt',sep='\t')
 29 | ```
 30 | The adjust_method is the p-value adjust method, you could choose one from the following:
 31 | - bonferroni : one-step correction
 32 | - sidak : one-step correction
 33 | - holm-sidak : step down method using Sidak adjustments
 34 | - holm : step-down method using Bonferroni adjustments
 35 | - simes-hochberg : step-up method (independent)
 36 | - hommel : closed method based on Simes tests (non-negative)
 37 | - fdr_bh : Benjamini/Hochberg (non-negative)
 38 | - fdr_by : Benjamini/Yekutieli (negative)
 39 | - fdr_tsbh : two stage fdr correction (non-negative)
 40 | - fdr_tsbky : two stage fdr correction (non-negative)
 41 | ### visualize
 42 | ```python
 43 | import os
 44 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R'  # Replace with your actual R home path
 45 | import rpy2.robjects as robjects
 46 | from rpy2.robjects import r
 47 | # Import the R plotting package (ggplot2 as an example)
 48 | r('library(ggplot2)')
 49 | r('library(grid)')
 50 | # Create data in R environment through Python
 51 | r('''
 52 | dataP=read.table('Q_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
 53 | dataT=read.table('C_result_RNA_sp_r.txt',sep='\t',row.names=1,header=TRUE)
 54 | sort_TF=rownames(dataT)
 55 | library(tidyr)
 56 | dataP=-log10(dataP)
 57 | print(paste0('maxinum of -log10P:',max(dataP)))
 58 | dataP[dataP>40]=40
 59 | dataP1=dataP
 60 | dataP1$TF=rownames(dataP)
 61 | longdiff0 <- gather(dataP1, sample, value,-TF)
 62 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
 63 | dataT1=dataT
 64 | dataT1$TF=rownames(dataT)
 65 | longdiff1=gather(dataT1, sample, value,-TF)
 66 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
 67 | colnames(longdiff1)=c('TF','celltype','PCC')
 68 | longdiff1$P=longdiff0_s$value
 69 | longdiff1$TF=factor(longdiff1$TF,levels=rev(sort_TF))
 70 | library(egg)
 71 | limits0=c(2,ceiling(dataP))
 72 | range0 = c(1,4)
 73 | breaks0 = c(2,(ceiling(dataP)-2)*1/4+2,(ceiling(dataP)-2)*2/4+2,(ceiling(dataP)-2)*3/4+2,ceiling(dataP))
 74 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
 75 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 
 76 |   scale_size_continuous(limits = c(4, 40), range = c(1,5), breaks = c(4,10,20,30,40)) + 
 77 |   labs( x= "cell type", y = "TF", fill = "")  + theme_article()+
 78 |   theme(legend.key=element_blank(), 
 79 |   axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 
 80 |   legend.position = "right") + 
 81 |   scale_fill_gradient2(midpoint=0, low="blue", mid="white",
 82 |                      high="red", space ="Lab" )
 83 | 
 84 | 
 85 | ''')
 86 | r('''
 87 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
 88 | library(pheatmap)
 89 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
 90 | colnames(anno1)=c('TG')
 91 | rownames(anno1)=rownames(dataP)
 92 | anno1[is.na(anno1)]=0
 93 | anno1[,1]=paste0('M',anno1[,1])
 94 | anno1$name=rownames(anno1)
 95 | longdiff0 <- gather(anno1, sample, value,-name)
 96 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
 97 | library("RColorBrewer")
 98 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
 99 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
100 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
101 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
102 |   geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
103 |   theme_article()+theme(text = element_text(size = 9), legend.position = "left")
104 |   ''')
105 | r('''
106 | widths <- c(4.5, 4.5+dim(dataP)[2]) 
107 | print(unique(longdiff0$value))
108 | pdf('driver_trans.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+0.5)
109 | #print(heatmap_plot)
110 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
111 | dev.off()
112 | ''')
113 | ```
114 | The figure is saved to driver_trans.pdf.
115 | <div style="text-align: right">
116 |   <img src="driver_trans.png" alt="Image" width="400">
117 | </div>
118 | 
119 | ### Epigenetic driver score
120 | ```python
121 | RE_pseudobulk=pd.read_csv('data/RE_pseudobulk.tsv',sep=',',header=0,index_col=0)
122 | K=5
123 | GRN='TF_RE_binding'
124 | adjust_method='bonferroni'
125 | corr_method='pearsonr'
126 | C_result_RE,P_result_RE,Q_result_RE=Compare.driver_score(RE_pseudobulk,metadata,GRN,outdir,adjust_method,corr_method)
127 | C_result_RE_r,Q_result_RE_r=Compare.driver_result(C_result_RE,Q_result_RE,K)
128 | C_result_RE_r.to_csv('C_result_RE_r.txt',sep='\t')
129 | Q_result_RE_r.to_csv('Q_result_RE_r.txt',sep='\t')
130 | ```
131 | ### Visualization
132 | ```python
133 | import os
134 | os.environ['R_HOME'] = '/data2/duren_lab/Kaya/conda_envs/LINGER/lib/R'  # Replace with your actual R home path
135 | import rpy2.robjects as robjects
136 | from rpy2.robjects import r
137 | # Import the R plotting package (ggplot2 as an example)
138 | r('library(ggplot2)')
139 | r('library(grid)')
140 | # Create data in R environment through Python
141 | r('''
142 | dataP=read.table('Q_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
143 | dataT=read.table('C_result_RE_r.txt',sep='\t',row.names=1,header=TRUE)
144 | sort_TF=rownames(dataT)
145 | library(tidyr)
146 | dataP=-log10(dataP)
147 | print(paste0('maxinum of -log10P:',max(dataP)))
148 | maxP=100
149 | dataP[dataP>100]=100
150 | dataP1=dataP
151 | dataP1$TF=rownames(dataP)
152 | longdiff0 <- gather(dataP1, sample, value,-TF)
153 | longdiff0_s <- longdiff0[order(longdiff0$TF, longdiff0$sample), ]
154 | dataT1=dataT
155 | dataT1$TF=rownames(dataT)
156 | longdiff1=gather(dataT1, sample, value,-TF)
157 | longdiff1=longdiff1[order(longdiff1$TF, longdiff1$sample), ]
158 | colnames(longdiff1)=c('TF','celltype','PCC')
159 | longdiff1$P=longdiff0_s$value
160 | longdiff1$TF=factor(longdiff1$TF,levels=(sort_TF))
161 | library(egg)
162 | cutoff=2
163 | maxp=ceiling(max(dataP))
164 | print(maxp)
165 | limits0=c(cutoff,maxp)
166 | print(limits0)
167 | range0 = c(1,4)
168 | numbreak=5
169 | d=ceiling((maxp-cutoff)/(numbreak-1))
170 | print(d)
171 | breaks0= seq(from = cutoff, to = cutoff+d*(numbreak-1),  length.out=numbreak)
172 | 
173 | print(range0)
174 | print(breaks0)
175 | p=ggplot(longdiff1,aes(x = celltype, y = TF))+
176 | geom_point(aes(size = P, fill = PCC), alpha = 1, shape = 21) + 
177 |   scale_size_continuous(limits = limits0, range = range0, breaks = breaks0) + 
178 |   labs( x= "cell type", y = "TF", fill = "")  + theme_article()+
179 |   theme(legend.key=element_blank(), 
180 |   axis.text.x = element_text( size = 9, face = "bold", angle = 0, vjust = 0.3, hjust = 1), 
181 |   legend.position = "right") + 
182 |   scale_fill_gradient2(midpoint=0, low="blue", mid="white",
183 |                      high="red", space ="Lab" )
184 | 
185 | 
186 | ''')
187 | r('''
188 | annotation_row=read.table('Module.txt',sep='\t',header=TRUE,row.names=1)
189 | library(pheatmap)
190 | anno1=data.frame(annotation_row[match(rownames(dataP), rownames(annotation_row)), ])
191 | colnames(anno1)=c('TG')
192 | rownames(anno1)=rownames(dataP)
193 | anno1[is.na(anno1)]=0
194 | anno1[,1]=paste0('M',anno1[,1])
195 | anno1$name=rownames(anno1)
196 | longdiff0 <- gather(anno1, sample, value,-name)
197 | longdiff0$name=factor(longdiff0$name,levels=rownames(anno1))
198 | library("RColorBrewer")
199 | ann_colors = c('M0'="gray", 'M1'='#ffe901','M2'="#be3223",'M3'='#098ec4','M4'='#ffe901','M5'='#f8c9cb','M6'='#f8c9cb',
200 | 'M7'='#b2d68c','M8'='#f2f1f6','M9'='#c7a7d2','M10'='#fcba5d')
201 | #ann_colors = list("gray", '#ffe901',"#be3223",'#098ec4','#ffe901','#f8c9cb','#f8c9cb','#b2d68c','#f2f1f6','#c7a7d2','#fcba5d')
202 | heatmap_plot <- ggplot(longdiff0, aes(x = sample, y =name , fill = value)) +
203 |   geom_tile(width = 0.9, height = 0.9) + scale_fill_manual(values=ann_colors) +
204 |   theme_article()+theme(text = element_text(size = 9), legend.position = "left",
205 |  )
206 |   ''')
207 | r('''
208 | widths <- c(4.5, 4.5+dim(dataP)[2]) 
209 | print(unique(longdiff0$value))
210 | pdf('driver_epi.pdf',width=6/16*(9+dim(dataP)[2]),height= dim(anno1)[1]/10+1.5)
211 | #print(p)
212 | grid.arrange(heatmap_plot, p, ncol = 2,widths = widths)
213 | dev.off()
214 | ''')
215 | ```
216 | The figure is saved to driver_epi.pdf.
217 | <div style="text-align: right">
218 |   <img src="driver_epi.png" alt="Image" width="400">
219 | </div>
220 | 


--------------------------------------------------------------------------------
/docs/driver_epi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_epi.png


--------------------------------------------------------------------------------
/docs/driver_trans.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/driver_trans.png


--------------------------------------------------------------------------------
/docs/feature_engineering.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.jpg


--------------------------------------------------------------------------------
/docs/feature_engineering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/feature_engineering.png


--------------------------------------------------------------------------------
/docs/genomemap.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/genomemap.jpg


--------------------------------------------------------------------------------
/docs/h5_input.md:
--------------------------------------------------------------------------------
 1 | # h5ad file as input
 2 | ## case1. 10x filtered feature barcode matrix
 3 | ### download h5 file
 4 | ```sh
 5 | wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
 6 | ```
 7 | ### get the input data for LINGER
 8 | ```python
 9 | import scanpy as sc
10 | adata = sc.read_10x_h5('pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5', gex_only=False)
11 | import scipy.sparse as sp
12 | import pandas as pd
13 | matrix=adata.X.T
14 | adata.var['gene_ids']=adata.var.index
15 | features=pd.DataFrame(adata.var['gene_ids'].values.tolist(),columns=[1])
16 | features[2]=adata.var['feature_types'].values
17 | barcodes=pd.DataFrame(adata.obs_names,columns=[0])
18 | from LingerGRN.preprocess import *
19 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
20 | ```
21 | ## case2. seperate RNA and ATAC h5ad file
22 | ### Read H5AD file as an AnnData object
23 | ```python
24 | import scanpy as sc
25 | adata_RNA = sc.read_h5ad('rna.h5ad')
26 | adata_ATAC=sc.read_h5ad('ATAC.h5ad')
27 | import pandas as pd
28 | label=pd.read_csv('label.txt',sep='\t',header=0)
29 | ```
30 | ```python
31 | adata_RNA
32 | ```
33 | 
34 | <div style="text-align: right">
35 |   <img src="adata_RNA.png" alt="Image" width="500">
36 | </div>
37 | 
38 | ```python
39 | adata_ATAC
40 | ```
41 | 
42 | <div style="text-align: right">
43 |   <img src="adata_ATAC.png" alt="Image" width="500">
44 | </div>
45 | 
46 | ```python
47 | label
48 | ```
49 | <div style="text-align: right">
50 |   <img src="label_PBMC.png" alt="Image" width="300">
51 | </div>
52 | 
53 | ### get the input data for LINGER
54 | 
55 | ```python
56 | import scipy.sparse as sp
57 | matrix=sp.vstack([adata_RNA.X.T, adata_ATAC.X.T])
58 | features=pd.DataFrame(adata_RNA.var['gene_ids'].values.tolist()+adata_ATAC.var['gene_ids'].values.tolist(),columns=[1])
59 | K=adata_RNA.shape[1]
60 | N=K+adata_ATAC.shape[1]
61 | types = ['Gene Expression' if i <= K-1 else 'Peaks' for i in range(0, N)]
62 | features[2]=types
63 | barcodes=pd.DataFrame(adata_RNA.obs['barcode'].values,columns=[0])
64 | from LingerGRN.preprocess import *
65 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
66 | ```
67 | Then you could go back to the PBMC tutorial and continue with the 'Remove low counts cells and genes' step.
68 | 


--------------------------------------------------------------------------------
/docs/heatmap_activity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity.png


--------------------------------------------------------------------------------
/docs/heatmap_activity_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/heatmap_activity_mm10.png


--------------------------------------------------------------------------------
/docs/label.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label.png


--------------------------------------------------------------------------------
/docs/label_PBMC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/label_PBMC.png


--------------------------------------------------------------------------------
/docs/metadata_ds.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/metadata_ds.jpg


--------------------------------------------------------------------------------
/docs/module_result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/module_result.png


--------------------------------------------------------------------------------
/docs/motifmatch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/motifmatch.png


--------------------------------------------------------------------------------
/docs/original.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/original.png


--------------------------------------------------------------------------------
/docs/perturb.md:
--------------------------------------------------------------------------------
 1 | # In silico perturbation
 2 | Here, we use gene-regulatory networks inferred from single-cell multi-omics data to perform in silico transcription factor(TF) perturbations, simulating the consequent changes such as traget gene and differentiation direction using only unperturbed wild-type data. 
 3 | 
 4 | ## Predict the original gene expression
 5 | We first predict the gene expression based on LINGER neural network-based model. We use this to represent the wild type context transcriptional profile.
 6 | ```python
 7 | from LingerGRN.perturb import *
 8 | # insilico pertubation
 9 | outdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/output/' #output dir
10 | Datadir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/'# this directory should be the same with Datadir
11 | GRNdir=Datadir+'data_bulk/'
12 | Input_dir= '/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/LINGER/examples/'# input data dir
13 | chrall,data_merge,Exp,Opn,Target,idx,TFname=load_data_ptb(Input_dir,outdir,GRNdir)
14 | original=get_simulation(outdir,chrall,data_merge,GRNdir,Exp,Opn,Target,idx)
15 | original.to_csv(outdir+'original.txt',sep='\t')
16 | original
17 | ```
18 | <div style="text-align: right">
19 |   <img src="original.png" alt="Image" width="500">
20 | </div>
21 | 
22 | ## Predict the gene expression after TF perturbation
23 | We predict the gene expression after kouck out one TF or several together. We take the POU5F1 as an example, which is a master regulator of H1 cell line (stem cell).
24 | ```python
25 | TFko='POU5F1'# multiple TFs: TFko=[TF1 TF2 ...]
26 | import pandas as pd
27 | Exp_df=pd.DataFrame(Exp,index=TFname)
28 | Exp1=Exp_df.copy()
29 | Exp1.loc[TFko]=0
30 | perturb=get_simulation(outdir,chrall,data_merge,GRNdir,Exp1.values,Opn,Target,idx)
31 | perturb.to_csv(outdir+TFko+'.txt',sep='\t')
32 | perturb
33 | ```
34 | <div style="text-align: right">
35 |   <img src="perturb.png" alt="Image" width="500">
36 | </div>
37 | 
38 | ## Differential expression for single cell
39 | We visualize the differential expression of the target gene. We take the POU5F1's target gene, NANOG, as an example. We set save = True to save the figure to outdir (Kouckout TF+'_KO_Diff_exp_Umap_'+Target gene.png). The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively. 
40 | ```python
41 | embedding,D=umap_embedding(outdir,Target,original,perturb,Input_dir)
42 | TG='NANOG'
43 | save=True
44 | diff_umap(TFko,TG,save,outdir,embedding,perturb,original,Input_dir)
45 | ```
46 | <div style="text-align: right">
47 |   <img src="POU5F1_KO_Diff_exp_Umap_NANOG.png" alt="Image" width="300">
48 | </div>
49 | 
50 | The above result suggests that NANOG expression decreased in H1 cell line after POU5F1 knockout.
51 | ## Differentiation prediction
52 | 
53 | We get the embedding of original and perturbed gene expression to the same embedding space. Then we get the difference of embedding to represent the differentiation prediction after the perturbatiion. The figure will be saved as Kouckout TF+'KO_Differentiation_Umap.png'. The cell types of cluster 0 to 3 are H1, BJ, K562, and GM12878, respectively. 
54 | ```python
55 | save=True
56 | Umap_direct(TFko,Input_dir,embedding,D,save,outdir)
57 | ```
58 | <div style="text-align: right">
59 |   <img src="POU5F1_KO_Differentiation_Umap.png" alt="Image" width="300">
60 | </div>
61 | 
62 | POU5F1 knock out simulation yeilds a potential shift of H1 cell line.
63 | 


--------------------------------------------------------------------------------
/docs/perturb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/perturb.png


--------------------------------------------------------------------------------
/docs/pvalue_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/pvalue_all.png


--------------------------------------------------------------------------------
/docs/scNN.md:
--------------------------------------------------------------------------------
  1 | # Other species tutorial
  2 | We support for the following species. For other species, we provide another [tutorial](https://github.com/Durenlab/LINGER/blob/main/docs/scNN_newSpecies.md).
  3 | |genome_short  | species | species_ensembl | genome |
  4 | | --- | --- | --- | --- |
  5 | | canFam3 | dog | Canis_lupus_familiaris |CanFam3  | 
  6 | | danRer11|zebrafish|Danio_rerio|GRCz11|
  7 | |danRer10|zebrafish|Danio_rerio|GRCz10|
  8 | | dm6|fly|Drosophila_melanogaster|BDGP6|
  9 | | dm3|fly|Drosophila_melanogaster|BDGP5|
 10 | | rheMac8|rhesus|Macaca_mulatta|Mmul_8|
 11 | |mm10|mouse|Mus_musculus|GRCm38|
 12 | |mm9|mouse|Mus_musculus|NCBIM37|
 13 | |rn5|rat|Rattus_norvegicus|Rnor_5|
 14 | |rn6|rat|Rattus_norvegicus|Rnor_6|
 15 | |susScr3|pig|Sus_scrofa|Sscrofa10|
 16 | |susScr11|pig|Sus_scrofa|Sscrofa11|
 17 | |fr3|fugu|Takifugu_rubripes|FUGU5|
 18 | |xenTro9|frog|Xenopus_tropicalis|Xenopus_tropicalis_v9|
 19 | |tair10|Arabidopsis|Arabidopsis_thaliana|Tair10|
 20 | ## Download the provided data 
 21 | We provide the TSS location for the above genome and the motif information.
 22 | ```sh
 23 | Datadir=/path/to/LINGER/# the directory to store the data, please use the absolute directory. Example: Datadir=/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data/
 24 | mkdir $Datadir
 25 | cd $Datadir
 26 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Dog5JTS_SNIoa5aohgZmOWXrTUuAKHXV" -O provide_data.tar.gz && rm -rf /tmp/cookies.txt
 27 | ```
 28 | Then unzip，
 29 | ```sh
 30 | tar -xzf provide_data.tar.gz
 31 | ```
 32 | ## Prepare the input data
 33 | We take the sc data of mm10 as an example. The data is from the published paper (FOXA2 drives lineage plasticity and KIT pathway
 34 | activation in neuroendocrine prostate cancer).
 35 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 
 36 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt
 37 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example).
 38 | <div style="text-align: right">
 39 |   <img src="barcode_mm10.png" alt="Image" width="300">
 40 | </div>  
 41 | 
 42 | If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
 43 | 
 44 | ### sc data
 45 | We download the data using the shell command line.
 46 | ```sh
 47 | mkdir -p data
 48 | cd data
 49 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt
 50 | tar -xzvf mm10_data.tar.gz
 51 | mv mm10_data/* ./
 52 | cd ../
 53 | ```
 54 | We provide the cell annotation as follows:
 55 | ```sh
 56 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt
 57 | mv mm10_label.txt data/
 58 | ```
 59 | ## LINGER 
 60 | ### Install
 61 | ```sh
 62 | conda create -n LINGER python==3.10.0
 63 | conda activate LINGER
 64 | pip install LingerGRN==1.97
 65 | conda install bioconda::bedtools #Requirement
 66 | ```
 67 | ### Install homer
 68 | Check whether homer is installed
 69 | ```sh
 70 | which homer # run this in the command line
 71 | ```
 72 | If homer is not installed, use Conda to install it
 73 | ```sh
 74 | conda install bioconda::homer
 75 | ```
 76 | #### install genome
 77 | You can check the installed genome
 78 | ```sh
 79 | dir=$(which homer)
 80 | dir_path=$(dirname "$dir")
 81 | ls $dir_path/../share/homer/data/genomes/  # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory'
 82 | ```
 83 | If the genome is not installed, use the following shell script to install it.
 84 | ```sh
 85 | genome='mm10'
 86 | perl $dir_path/../share/homer/configureHomer.pl -install $genome
 87 | ```
 88 | For the following step, we run the code in python.
 89 | #### Transfer the sc-multiome data to anndata  
 90 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
 91 | ```python
 92 | import scanpy as sc
 93 | #set some figure parameters for nice display inside jupyternotebooks.
 94 | %matplotlib inline
 95 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
 96 | sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
 97 | sc.logging.print_header()
 98 | #results_file = "scRNA/pbmc10k.h5ad"
 99 | import scipy
100 | import pandas as pd
101 | matrix=scipy.io.mmread('data/matrix.mtx')
102 | features=pd.read_csv('data/features.txt',sep='\t',header=None)
103 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None)
104 | label=pd.read_csv('data/mm10_label.txt',sep='\t',header=0)
105 | from LingerGRN.preprocess import *
106 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
107 | ```
108 | #### Remove low counts cells and genes
109 | ```python
110 | import scanpy as sc
111 | sc.pp.filter_cells(adata_RNA, min_genes=200)
112 | sc.pp.filter_genes(adata_RNA, min_cells=3)
113 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
114 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
115 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
116 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
117 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
118 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
119 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
120 | ```
121 | #### Generate the pseudo-bulk/metacell:
122 | ```python
123 | from LingerGRN.pseudo_bulk import *
124 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
125 | tempsample=samplelist[0]
126 | TG_pseudobulk=pd.DataFrame([])
127 | RE_pseudobulk=pd.DataFrame([])
128 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
129 | for tempsample in samplelist:
130 |     adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
131 |     adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
132 |     TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)                
133 |     TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
134 |     RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
135 |     RE_pseudobulk[RE_pseudobulk > 100] = 100
136 | 
137 | import os
138 | if not os.path.exists('data/'):
139 |     os.mkdir('data/')
140 | adata_ATAC.write('data/adata_ATAC.h5ad')
141 | adata_RNA.write('data/adata_RNA.h5ad')
142 | TG_pseudobulk=TG_pseudobulk.fillna(0)
143 | RE_pseudobulk=RE_pseudobulk.fillna(0)
144 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
145 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
146 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
147 | ```
148 | ### Training model
149 | Overlap the region with general GRN:
150 | ```python
151 | Datadir='/path/to/LINGER/'# This directory should be the same as Datadir defined in the above 'Download the general gene regulatory network' section
152 | GRNdir=Datadir+'provide_data/'
153 | genome='mm10'
154 | outdir='/path/to/output/' #output dir
155 | activef='ReLU' 
156 | method='scNN'
157 | import torch
158 | import subprocess
159 | import os
160 | import LingerGRN.LINGER_tr as LINGER_tr
161 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported
162 | LINGER_tr.RE_TG_dis(outdir)
163 | ```
164 | Train for the LINGER model.
165 | ```python
166 | import LingerGRN.LINGER_tr as LINGER_tr
167 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
168 | genomemap=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t')
169 | genomemap.index=genomemap['genome_short']
170 | species=genomemap.loc[genome]['species_ensembl']
171 | LINGER_tr.training(GRNdir,method,outdir,activef,species)
172 | ```
173 | ### Cell population gene regulatory network
174 | #### TF binding potential
175 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
176 | ```python
177 | import LingerGRN.LL_net as LL_net
178 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
179 | ```
180 | 
181 | #### *cis*-regulatory network
182 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
183 | ```python
184 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
185 | ```
186 | #### *trans*-regulatory network
187 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
188 | ```python
189 | LL_net.trans_reg(GRNdir,method,outdir,genome)
190 | ```
191 | 
192 | ### Cell type sepecific gene regulaory network
193 | There are 2 options:
194 | 1. infer GRN for a specific cell type, which is in the label.txt;
195 | ```python
196 | celltype='1' #use a string to assign your cell type
197 | ```
198 | 2. infer GRNs for all cell types.
199 | ```python
200 | celltype='all'
201 | ```
202 | Please make sure that 'all' is not a cell type in your data.
203 | #### Motif matching
204 | ```python
205 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt'
206 | subprocess.run(command, shell=True)
207 | import pandas as pd
208 | genome_map=pd.read_csv(GRNdir+'genome_map_homer.txt',sep='\t',header=0)
209 | genome_map.index=genome_map['genome_short']
210 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+'all_motif_rmdup_'+genome_map.loc[genome]['Motif']+'> '+outdir+'MotifTarget.bed'
211 | subprocess.run(command, shell=True)
212 | ```
213 | #### TF binding potential
214 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
215 | ```python
216 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
217 | ```
218 | 
219 | #### *cis*-regulatory network
220 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
221 | ```python
222 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)
223 | ```
224 | 
225 | #### *trans*-regulatory network
226 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
227 | ```python
228 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
229 | ```
230 | ## Identify driver regulators by TF activity
231 | ### Instruction
232 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
233 | 
234 | ### Prepare
235 | We need to *trans*-regulatory network, you can choose a network match you data best.
236 | 1. If your gene expression data are matched with cell population GRN, you can set
237 | ```python
238 | network = 'cell population'
239 | ```
240 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
241 | ```python
242 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
243 | ```
244 | 
245 | ### Calculate TF activity
246 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
247 | 
248 | ```python
249 | from LingerGRN.TF_activity import *
250 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
251 | ```
252 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
253 | ```python
254 | save=True
255 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
256 | ```
257 | <div style="text-align: right">
258 |   <img src="heatmap_activity_mm10.png" alt="Image" width="500">
259 | </div>
260 | 
261 | ### Identify driver regulator
262 | We use t-test to find the differential TFs of a certain cell type by the activity. 
263 | 1. You can assign a certain cell type of the gene expression data by
264 | ```python
265 | celltype='1'
266 | ```
267 | 2. Or, you can obtain the result for all cell types.
268 | ```python
269 | celltype='all'
270 | ```
271 | 
272 | For example,
273 | 
274 | ```python
275 | celltype='1'
276 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
277 | t_test_results
278 | ```
279 | 
280 | <div style="text-align: right">
281 |   <img src="ttest_mm10.png" alt="Image" width="300">
282 | </div>
283 | 
284 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
285 | 
286 | ```python
287 | TFName='Erg'
288 | datatype='activity'
289 | celltype1='1'
290 | celltype2='Others'
291 | save=True
292 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
293 | ```
294 | 
295 | <div style="text-align: right">
296 |   <img src="box_plot_Erg_activity_1_Others.png" alt="Image" width="300">
297 | </div>
298 | 
299 | For gene expression data, the boxplot is:
300 | ```python
301 | datatype='expression'
302 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
303 | ```
304 | 
305 | <div style="text-align: right">
306 |   <img src="box_plot_Erg_expression_1_Others.png" alt="Image" width="300">
307 | </div>
308 | 
309 | 
310 | 


--------------------------------------------------------------------------------
/docs/scNN_newSpecies.md:
--------------------------------------------------------------------------------
  1 | # Other species not in the list
  2 | Here, we provide a tutorial for the species not included in the given list but in the basic Homer installation. You can choose one genome version from the following table based on your data.
  3 | | **Genome Version**  | **Release** | **Description**                                           |
  4 | |---------------------|-------------|-----------------------------------------------------------|
  5 | | tair10              | v6.0        | Arabidopsis genome and annotation (tair10)                |
  6 | | dm6                 | v7.0        | Fly genome and annotation for UCSC dm6                    |
  7 | | xenTro3             | v6.4        | Frog genome and annotation for UCSC xenTro3               |
  8 | | danRer11            | v7.0        | Zebrafish genome and annotation for UCSC danRer11         |
  9 | | rn7                 | v7.0        | Rat genome and annotation for UCSC rn7                    |
 10 | | canFam3             | v6.4        | Dog genome and annotation for UCSC canFam3                |
 11 | | anoGam3             | v7.0        | Mosquito genome and annotation for UCSC anoGam3           |
 12 | | gorGor3             | v6.4        | Human genome and annotation for UCSC gorGor3              |
 13 | | strPur2             | v7.0        | Urchin genome and annotation for UCSC strPur2             |
 14 | | rn5                 | v6.4        | Rat genome and annotation for UCSC rn5                    |
 15 | | rheMac10            | v7.0        | Rhesus genome and annotation for UCSC rheMac10            |
 16 | | danRer7             | v6.4        | Zebrafish genome and annotation for UCSC danRer7          |
 17 | | canFam5             | v7.0        | Dog genome and annotation for UCSC canFam5                |
 18 | | gorGor6             | v7.0        | Human genome and annotation for UCSC gorGor6              |
 19 | | canFam6             | v7.0        | Dog genome and annotation for UCSC canFam6                |
 20 | | apiMel2             | v7.0        | Bee genome and annotation for UCSC apiMel2                |
 21 | | gorGor5             | v6.4        | Human genome and annotation for UCSC gorGor5              |
 22 | | panTro4             | v6.4        | Human genome and annotation for UCSC panTro4              |
 23 | | xenTro7             | v6.4        | Frog genome and annotation for UCSC xenTro7               |
 24 | | rheMac2             | v6.4        | Rhesus genome and annotation for UCSC rheMac2             |
 25 | | gorGor4             | v6.4        | Human genome and annotation for UCSC gorGor4              |
 26 | | panTro5             | v6.4        | Human genome and annotation for UCSC panTro5              |
 27 | | panPan2             | v6.4        | Human genome and annotation for UCSC panPan2              |
 28 | | patens.ASM242v1     | v5.10       | Patens genome and annotation (patens.ASM242v1)            |
 29 | | panTro6             | v7.0        | Human genome and annotation for UCSC panTro6              |
 30 | | AGPv3               | v5.10       | Corn genome and annotation (AGPv3)                        |
 31 | | xenTro9             | v6.4        | Frog genome and annotation for UCSC xenTro9               |
 32 | | papAnu2             | v6.4        | Human genome and annotation for UCSC papAnu2              |
 33 | | corn.AGPv3          | v5.10       | Corn genome and annotation (corn.AGPv3)                   |
 34 | | petMar2             | v6.4        | Lamprey genome and annotation for UCSC petMar2            |
 35 | | panPan1             | v6.4        | Human genome and annotation for UCSC panPan1              |
 36 | | fr3                 | v7.0        | Fugu genome and annotation for UCSC fr3                   |
 37 | | sacCer3             | v7.0        | Yeast genome and annotation for UCSC sacCer3              |
 38 | | mm9                 | v7.0        | Mouse genome and annotation for UCSC mm9                  |
 39 | | susScr3             | v6.4        | Pig genome and annotation for UCSC susScr3                |
 40 | | hg17                | v7.0        | Human genome and annotation for UCSC hg17                 |
 41 | | panTro3             | v6.4        | Human genome and annotation for UCSC panTro3              |
 42 | | tetNig2             | v7.0        | Fugu genome and annotation for UCSC tetNig2               |
 43 | | mm10                | v7.0        | Mouse genome and annotation for UCSC mm10                 |
 44 | | taeGut2             | v7.0        | Zebrafinch genome and annotation for UCSC taeGut2         |
 45 | | hg18                | v7.0        | Human genome and annotation for UCSC hg18                 |
 46 | | susScr11            | v7.0        | Pig genome and annotation for UCSC susScr11               |
 47 | | galGal5             | v6.4        | Chicken genome and annotation for UCSC galGal5            |
 48 | | hg38                | v7.0        | Human genome and annotation for UCSC hg38                 |
 49 | | mm8                 | v6.4        | Mouse genome and annotation for UCSC mm8                  |
 50 | | xenLae2             | v7.0        | Frog genome and annotation for UCSC xenLae2               |
 51 | | rn6                 | v7.0        | Rat genome and annotation for UCSC rn6                    |
 52 | | ce11                | v7.0        | Worm genome and annotation for UCSC ce11                  |
 53 | | dm3                 | v7.0        | Fly genome and annotation for UCSC dm3                    |
 54 | | rheMac3             | v6.4        | Rhesus genome and annotation for UCSC rheMac3             |
 55 | | rn4                 | v6.4        | Rat genome and annotation for UCSC rn4                    |
 56 | | rice.IRGSP1.0       | v5.10       | Rice genome and annotation (rice.IRGSP1.0)                |
 57 | | panPan3             | v7.0        | Human genome and annotation for UCSC panPan3              |
 58 | | galGal6             | v7.0        | Chicken genome and annotation for UCSC galGal6            |
 59 | | danRer10            | v7.0        | Zebrafish genome and annotation for UCSC danRer10         |
 60 | | ci2                 | v7.0        | Ciona genome and annotation for UCSC ci2                  |
 61 | | petMar3             | v7.0        | Lamprey genome and annotation for UCSC petMar3            |
 62 | | sacCer2             | v7.0        | Yeast genome and annotation for UCSC sacCer2              |
 63 | | hg19                | v7.0        | Human genome and annotation for UCSC hg19                 |
 64 | | xenTro10            | v7.0        | Frog genome and annotation for UCSC xenTro10              |
 65 | | rheMac8             | v6.4        | Rhesus genome and annotation for UCSC rheMac8             |
 66 | | mm39                | v7.0        | Mouse genome and annotation for UCSC mm39                 |
 67 | | ce10                | v7.0        | Worm genome and annotation for UCSC ce10                  |
 68 | | aplCal1             | v6.4        | Seahare genome and annotation for UCSC aplCal1            |
 69 | | xenTro2             | v6.4        | Frog genome and annotation for UCSC xenTro2               |
 70 | | papAnu4             | v7.0        | Human genome and annotation for UCSC papAnu4              |
 71 | | ce6                 | v7.0        | Worm genome and annotation for UCSC ce6                   |
 72 | | ci3                 | v7.0        | Ciona genome and annotation for UCSC ci3                  |
 73 | | apiMel3             | v7.0        | Bee genome and annotation for UCSC apiMel3                |
 74 | | anoGam1             | v7.0        | Mosquito genome and annotation for UCSC anoGam1           |
 75 | | galGal4             | v6.4        | Chicken genome and annotation for UCSC galGal4            |
 76 | 
 77 | ## Prepare the input data
 78 | We take the sc data of mm10 as an example. The data is from the published paper,[FOXA2 drives lineage plasticity and KIT pathway
 79 | activation in neuroendocrine prostate cancer](https://www.cell.com/cancer-cell/pdf/S1535-6108(22)00502-5.pdf).
 80 | The input data is the feature matrix from 10x sc-multiome data and Cell annotation/cell type label which includes: 
 81 | - Single-cell multiome data including matrix.mtx, features.tsv/features.txt, and barcodes.tsv/barcodes.txt
 82 | 
 83 |   If the input data is 10X h5 file or h5ad file from scanpy, please follow the instruction [h5/h5ad file as input](https://github.com/Durenlab/LINGER/blob/main/docs/h5_input.md) .
 84 |   
 85 |   Here, we provide sc-multiome data of mm10. We download the data using the shell command line.
 86 | ```sh
 87 | mkdir -p data
 88 | cd data
 89 | wget --load-cookies /tmp/cookies.txt "https://drive.usercontent.google.com/download?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.usercontent.google.com/download?id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0'  -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1PDOmtO2oL-YVxKQY26jL91SAFedDALA0" -O mm10_data.tar.gz && rm -rf /tmp/cookies.txt
 90 | tar -xzvf mm10_data.tar.gz
 91 | mv mm10_data/* ./
 92 | cd ../
 93 | ```
 94 | We provide the cell annotation as follows:
 95 | ```sh
 96 | wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1nFm5shjcDuDYhA8YGzAnYoYVQ_29_Yj4" -O mm10_label.txt && rm -rf /tmp/cookies.txt
 97 | mv mm10_label.txt data/label.txt
 98 | ```
 99 | - Cell annotation/cell type label if you need the cell type-specific gene regulatory network (label.txt in our example).
100 |   <div style="text-align: right">
101 |   <img src="barcode_mm10.png" alt="Image" width="260">
102 |   </div>  
103 | - gtf file or the URL of the gtf file describing gene annotation, '*.gtf'
104 | - PWM matrix file of motifs, 'all_motif.txt'
105 |   <div style="text-align: right">
106 |   <img src="PWM.jpg" alt="Image" width="400">
107 |   </div>  
108 | - Motif-TF match file, 'MotifMatch.txt', mapping motif and TFs
109 |   <div style="text-align: right">
110 |   <img src="motifmatch.png" alt="Image" width="300">
111 |   </div>  
112 | 
113 | ## LINGER 
114 | ### Install
115 | ```sh
116 | conda create -n LINGER python==3.10.0
117 | conda activate LINGER
118 | pip install LingerGRN==1.97
119 | conda install bioconda::bedtools #Requirement
120 | ```
121 | ### Install homer
122 | Check whether homer is installed
123 | ```sh
124 | which homer # run this in the command line
125 | ```
126 | If homer is not installed, use Conda to install it
127 | ```sh
128 | conda install bioconda::homer
129 | ```
130 | #### Install genome
131 | You can check the installed genome
132 | ```sh
133 | dir=$(which homer)
134 | dir_path=$(dirname "$dir")
135 | ls $dir_path/../share/homer/data/genomes/  # this is the installed genomes, if no genome is installed, there will be an error 'No such file or directory'
136 | ```
137 | If the genome is not installed, use the following shell script to install it.
138 | ```sh
139 | genome='mm10'
140 | perl $dir_path/../share/homer/configureHomer.pl -install $genome
141 | ```
142 | For the following step, we run the code in python.
143 | #### Input 
144 | Note that  PWM_file and MotifMatch_file are located at the GRNdir.
145 | ```python
146 | GRNdir='/path/to/current_dir/' # The dir we currently use; the sc data is in GRNdir/data/
147 | gtf_file='/paht/to/gtfdir/*.gtf'# gtf, PWM matrix, and Motif-TF match file should be included in the GRNdir
148 | PWM_file='all_motif.txt'
149 | MotifMatch_file='MotifMatch.txt'
150 | genome='mm10' # the genome of your data
151 | outdir='/path/to/output/' #output dir
152 | species='New'
153 | ```
154 | #### Transfer the sc-multiome data to anndata  
155 | We will transfer sc-multiome data to the anndata format and filter the cell barcode by the cell type label.
156 | ```python
157 | import scanpy as sc
158 | #set some figure parameters for nice display inside Jupiter notebooks.
159 | sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
160 | sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
161 | sc.logging.print_header()
162 | #results_file = "scRNA/pbmc10k.h5ad"
163 | import scipy
164 | import pandas as pd
165 | matrix=scipy.io.mmread('data/matrix.mtx')
166 | features=pd.read_csv('data/features.txt',sep='\t',header=None)
167 | barcodes=pd.read_csv('data/barcodes.txt',sep='\t',header=None)
168 | label=pd.read_csv('data/label.txt',sep='\t',header=0)
169 | from LingerGRN.preprocess import *
170 | adata_RNA,adata_ATAC=get_adata(matrix,features,barcodes,label)# adata_RNA and adata_ATAC are scRNA and scATAC
171 | ```
172 | #### Remove low counts cells and genes
173 | ```python
174 | import scanpy as sc
175 | sc.pp.filter_cells(adata_RNA, min_genes=200)
176 | sc.pp.filter_genes(adata_RNA, min_cells=3)
177 | sc.pp.filter_cells(adata_ATAC, min_genes=200)
178 | sc.pp.filter_genes(adata_ATAC, min_cells=3)
179 | selected_barcode=list(set(adata_RNA.obs['barcode'].values)&set(adata_ATAC.obs['barcode'].values))
180 | barcode_idx=pd.DataFrame(range(adata_RNA.shape[0]), index=adata_RNA.obs['barcode'].values)
181 | adata_RNA = adata_RNA[barcode_idx.loc[selected_barcode][0]]
182 | barcode_idx=pd.DataFrame(range(adata_ATAC.shape[0]), index=adata_ATAC.obs['barcode'].values)
183 | adata_ATAC = adata_ATAC[barcode_idx.loc[selected_barcode][0]]
184 | ```
185 | #### Generate the pseudo-bulk/metacell:
186 | ```python
187 | from LingerGRN.pseudo_bulk import *
188 | samplelist=list(set(adata_ATAC.obs['sample'].values)) # sample is generated from cell barcode 
189 | tempsample=samplelist[0]
190 | TG_pseudobulk=pd.DataFrame([])
191 | RE_pseudobulk=pd.DataFrame([])
192 | singlepseudobulk = (adata_RNA.obs['sample'].unique().shape[0]*adata_RNA.obs['sample'].unique().shape[0]>100)
193 | for tempsample in samplelist:
194 |     adata_RNAtemp=adata_RNA[adata_RNA.obs['sample']==tempsample]
195 |     adata_ATACtemp=adata_ATAC[adata_ATAC.obs['sample']==tempsample]
196 |     TG_pseudobulk_temp,RE_pseudobulk_temp=pseudo_bulk(adata_RNAtemp,adata_ATACtemp,singlepseudobulk)                
197 |     TG_pseudobulk=pd.concat([TG_pseudobulk, TG_pseudobulk_temp], axis=1)
198 |     RE_pseudobulk=pd.concat([RE_pseudobulk, RE_pseudobulk_temp], axis=1)
199 |     RE_pseudobulk[RE_pseudobulk > 100] = 100
200 | 
201 | import os
202 | if not os.path.exists('data/'):
203 |     os.mkdir('data/')
204 | adata_ATAC.write('data/adata_ATAC.h5ad')
205 | adata_RNA.write('data/adata_RNA.h5ad')
206 | TG_pseudobulk=TG_pseudobulk.fillna(0)
207 | RE_pseudobulk=RE_pseudobulk.fillna(0)
208 | pd.DataFrame(adata_ATAC.var['gene_ids']).to_csv('data/Peaks.txt',header=None,index=None)
209 | TG_pseudobulk.to_csv('data/TG_pseudobulk.tsv')
210 | RE_pseudobulk.to_csv('data/RE_pseudobulk.tsv')
211 | ```
212 | ### Training model
213 | Overlap the region with general GRN:
214 | ```python
215 | activef='ReLU' 
216 | method='scNN'
217 | import torch
218 | import subprocess
219 | import os
220 | import LingerGRN.LINGER_tr as LINGER_tr
221 | LINGER_tr.get_TSS_ensembl(genome,gtf_file)
222 | LINGER_tr.get_TSS(GRNdir,genome,200000) # Here, 200000 represent the largest distance of regulatory element to the TG. Other distance is supported
223 | LINGER_tr.RE_TG_dis(outdir)
224 | ```
225 | Train for the LINGER model.
226 | ```python
227 | import LingerGRN.LINGER_tr as LINGER_tr
228 | activef='ReLU' # active function chose from 'ReLU','sigmoid','tanh'
229 | LINGER_tr.training(GRNdir,method,outdir,activef,species)
230 | ```
231 | ### Cell population gene regulatory network
232 | #### TF binding potential
233 | The output is 'cell_population_TF_RE_binding.txt', a matrix of the TF-RE binding score.
234 | ```python
235 | import LingerGRN.LL_net as LL_net
236 | LL_net.TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
237 | ```
238 | 
239 | #### *cis*-regulatory network
240 | The output is 'cell_population_cis_regulatory.txt' with 3 columns: region, target gene, cis-regulatory score.
241 | ```python
242 | LL_net.cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,method,outdir)
243 | ```
244 | #### *trans*-regulatory network
245 | The output is 'cell_population_trans_regulatory.txt', a matrix of the trans-regulatory score.
246 | ```python
247 | LL_net.trans_reg(GRNdir,method,outdir,genome)
248 | ```
249 | 
250 | ### Cell type sepecific gene regulaory network
251 | There are 2 options:
252 | 1. infer GRN for a specific cell type, which is in the label.txt;
253 | ```python
254 | celltype='1' #use a string to assign your cell type
255 | ```
256 | 2. infer GRNs for all cell types.
257 | ```python
258 | celltype='all'
259 | ```
260 | Please make sure that 'all' is not a cell type in your data.
261 | #### Motif matching
262 | ```python
263 | command='paste data/Peaks.bed data/Peaks.txt > data/region.txt'
264 | subprocess.run(command, shell=True)
265 | import pandas as pd
266 | command='findMotifsGenome.pl data/region.txt '+genome+' ./. -size given -find '+GRNdir+PWM_file+'> '+outdir+'MotifTarget.bed'
267 | subprocess.run(command, shell=True)
268 | ```
269 | #### TF binding potential
270 | The output is 'cell_population_TF_RE_binding_*celltype*.txt', a matrix of the TF-RE binding potential.
271 | ```python
272 | LL_net.cell_type_specific_TF_RE_binding(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)# different from the previous version
273 | ```
274 | 
275 | #### *cis*-regulatory network
276 | The output is 'cell_type_specific_cis_regulatory_{*celltype*}.txt' with 3 columns: region, target gene, cis-regulatory score.
277 | ```python
278 | LL_net.cell_type_specific_cis_reg(GRNdir,adata_RNA,adata_ATAC,genome,celltype,outdir,method)
279 | ```
280 | 
281 | #### *trans*-regulatory network
282 | The output is 'cell_type_specific_trans_regulatory_{*celltype*}.txt', a matrix of the trans-regulatory score.
283 | ```python
284 | LL_net.cell_type_specific_trans_reg(GRNdir,adata_RNA,celltype,outdir)
285 | ```
286 | ## Identify driver regulators by TF activity
287 | ### Instruction
288 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
289 | 
290 | ### Prepare
291 | We need to *trans*-regulatory network, you can choose a network match you data best.
292 | 1. If your gene expression data are matched with cell population GRN, you can set
293 | ```python
294 | network = 'cell population'
295 | ```
296 | 2. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
297 | ```python
298 | network = 'CD56 (bright) NK cells' # CD56 (bright) NK cells is the name of one cell type
299 | ```
300 | 
301 | ### Calculate TF activity
302 | The input is gene expression data, It could be the scRNA-seq data from the sc multiome data. It could be other sc or bulk RNA-seq data matches the GRN. The row of gene expresion data is gene, columns is sample and the value is read count (sc) or FPKM/RPKM (bulk).
303 | 
304 | ```python
305 | from LingerGRN.TF_activity import *
306 | TF_activity=regulon(outdir,adata_RNA,GRNdir,network,genome)
307 | ```
308 | Visualize the TF activity heatmap by cluster. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'heatmap_activity.png'.
309 | ```python
310 | save=True
311 | heatmap_cluster(TF_activity,adata_RNA,save,outdir)
312 | ```
313 | <div style="text-align: right">
314 |   <img src="heatmap_activity_mm10.png" alt="Image" width="500">
315 | </div>
316 | 
317 | ### Identify driver regulator
318 | We use t-test to find the differential TFs of a certain cell type by the activity. 
319 | 1. You can assign a certain cell type of the gene expression data by
320 | ```python
321 | celltype='1'
322 | ```
323 | 2. Or, you can obtain the result for all cell types.
324 | ```python
325 | celltype='all'
326 | ```
327 | 
328 | For example,
329 | 
330 | ```python
331 | celltype='1'
332 | t_test_results=master_regulator(TF_activity,adata_RNA,celltype)
333 | t_test_results
334 | ```
335 | 
336 | <div style="text-align: right">
337 |   <img src="ttest_mm10.png" alt="Image" width="300">
338 | </div>
339 | 
340 | Visulize the differential activity and expression. You can compare 2 different cell types and one cell type with other cell types. If you want to save the heatmap to outdit, please set 'save=True'. The output is 'box_plot'_+TFName+'_'+datatype+'_'+celltype1+'_'+celltype2+'.png'.
341 | 
342 | ```python
343 | TFName='Erg'
344 | datatype='activity'
345 | celltype1='1'
346 | celltype2='Others'
347 | save=True
348 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
349 | ```
350 | 
351 | <div style="text-align: right">
352 |   <img src="box_plot_Erg_activity_1_Others.png" alt="Image" width="300">
353 | </div>
354 | 
355 | For gene expression data, the boxplot is:
356 | ```python
357 | datatype='expression'
358 | box_comp(TFName,adata_RNA,celltype1,celltype2,datatype,TF_activity,save,outdir)
359 | ```
360 | 
361 | <div style="text-align: right">
362 |   <img src="box_plot_Erg_expression_1_Others.png" alt="Image" width="300">
363 | </div>
364 | 
365 | 
366 | 


--------------------------------------------------------------------------------
/docs/trans_pr_curveMYC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_pr_curveMYC.png


--------------------------------------------------------------------------------
/docs/trans_roc_curveMYC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/trans_roc_curveMYC.png


--------------------------------------------------------------------------------
/docs/ttest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest.png


--------------------------------------------------------------------------------
/docs/ttest_mm10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/ttest_mm10.png


--------------------------------------------------------------------------------
/docs/tutorial2.md:
--------------------------------------------------------------------------------
 1 | # Identify driver regulators by TF activity
 2 | ## Instruction
 3 | TF activity, focusing on the DNA-binding component of TF proteins in the nucleus, is a more reliable metric than mRNA or whole protein expression for identifying driver regulators. Here, we employed LINGER inferred GRNs from sc-multiome data of a single individual. Assuming the GRN structure is consistent across individuals, we estimated TF activity using gene expression data alone. By comparing TF activity between cases and controls, we identified driver regulators. 
 4 | 
 5 | ## Prepare
 6 | We need to *trans*-regulatory network, you can choose a network match you data best.
 7 | 1. If there is not single cell avaliable to infer the cell population and cell type specific GRN, you can choose a GRN from various tissues.
 8 | ```python
 9 | network = 'general'
10 | ```
11 | 2. If your gene expression data are matched with cell population GRN, you can set
12 | ```python
13 | network = 'cell population'
14 | ```
15 | 3. If your gene expression data are matched with certain cell type, you can set network to the name of this cell type.
16 | ```python
17 | network = '0' # 0 is the name of one cell type
18 | ```
19 | ## Calculate TF activity
20 | ```python
21 | Input_dir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/Input/'
22 | RNA_file='RNA.txt'
23 | GRNdir='/zfs/durenlab/palmetto/Kaya/SC_NET/code/github/combine/data_bulk/'
24 | genome='hg38'
25 | from TF_activity import *
26 | regulon_score=regulon(Input_dir,RNA_file,GRNdir,network,genome)
27 | ```
28 | ## Identify driver regulator
29 | We use t-test to find the differential TFs of a certain cell type by the activity. 
30 | 1. You can assign a certain cell type by
31 | ```python
32 | celltype='0'
33 | ```
34 | 2. Or, you can obtain the result for all cell types.
35 | ```python
36 | celltype='all'
37 | ```
38 | ```python
39 | labels='label.txt'
40 | t_test_results=master_regulator(regulon_score,Input_dir,labels,celltype)
41 | t_test_results
42 | ```
43 | <div style="text-align: right">
44 |   <img src="ttest.png" alt="Image" width="300">
45 | </div>
46 | 


--------------------------------------------------------------------------------
/docs/tvalue_all.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Durenlab/LINGER/1ccdc3cd37d8191cc96a27f1252cc46831ee820c/docs/tvalue_all.png


--------------------------------------------------------------------------------