├── .gitattributes
├── README.md
├── emission_and_metadata
├── README.md
├── create_heatmap_for_each_state.Rmd
├── data
│ ├── assay_count.csv
│ ├── biosample_term_name_counts.csv
│ ├── emissions_100.txt
│ ├── emissions_100_for_pheatmap.txt
│ ├── metadata.csv
│ ├── organ.txt
│ └── slims_metadata.txt
├── draw_emission_subsetted_states.R
├── draw_heatmap_emission_mm10_full_stack.Rmd
├── helper.py
├── prepare_emission_for_plotting.py
└── rank_chromMark_cellType_emission.py
├── gene_exp_analysis
├── ._investigate_lecif_and_phastCons.py
├── README.md
├── avg_gene_exp_per_state_per_ct.txt.gz
├── ct_name.csv
├── get_avg_gene_per_state.py
├── helper.py
└── plot_avg_gene_exp_heatmap.Rmd
├── helper.py
├── learn_model
├── README.md
├── extract_cell_mark_table.py
├── helper.py
└── metadata_controltype_june_2021.tsv
├── neighborhood
├── README.md
├── draw_2DLine_neighborhood_enrichment.R
├── draw_neighborhood_enrichment.R
├── genome_100_RefSeqTES_neighborhood.txt
└── genome_100_RefSeqTSS_neighborhood.txt
├── overlap
├── README.md
├── create_excel_painted_overlap_enrichment_LECIF.py
└── helper.py
├── process_CTCF_data
├── README.md
├── calculate_genometric_mean_CTCF_overlap.py
├── download_CTCF_data.py
├── draw_2D_overlap_enrichment_CTCF.Rmd
├── helper.py
├── metadata_clean_for_publication.txt
├── metadata_from_ENCODE.tsv
└── overlap_ctcf_mm10.txt
├── relationship_with_ct_spec_state
├── ._15_state_annotations.csv
├── README.md
├── calculate_summary_sample_regions.py
├── example_perCT_segment
│ ├── 15_state_annotations.csv
│ ├── P0_forebrain_15_segments.bed.gz
│ ├── P0_lung_15_segments.bed.gz
│ ├── P0_stomach_15_segments.bed.gz
│ ├── README.md
│ └── tissue_annotation.csv
├── get_mm_enrichments_across_ct.py
├── get_ranked_max_enriched_25_state.py
├── helper.py
├── mouse_random_represent_full_stack.snakefile
└── overlap_enrichment_with_ct_spec_state.py
├── state_annotation_processed.tsv
└── view_ucsc_genome_browser.pptx
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # full_stack_ChromHMM_annotations for mouse (mm10)
3 | Data of genome annotation from full-stack ChromHMM model trained with 901 datasets assaying 14 chromatin marks in 26 different cell or tissue types of the **mouse** genome. The paper has been published on Genome Biology . This project is an extension of the published manuscript at Genome Biology that introduces an universal (pan-tissue-type) annotation of the **human** genome
4 | # Download links:
5 | Data of full-stack genome annotations for reference assemblies mm10 can be found here:
6 | Within this folder:
7 | - File mm10_100_segments_segments.bed.gz contains a simple four column .bed file of **mouse** full-stack state annotation in **mm10** assembly. The fourth column contains a state label with a prefix number that can be used to order the states. The OverlapEnrichment and NeighborhoodEnrichment commands of ChromHMM with the '-labels' option can compute enrichments for this file and order states based on the prefix number.
8 | - File mm10_100_segments_browser.bed.gz contains a browser file of **mouse** full-stack state annotation in **mm10** assembly. This file is compatible to for UCSC genome browser. Since our training data (901 input data tracks) are in mm10, mm10 is the assembly used for original training and annotation.
9 |
10 | - Detailed description of states can be found at tsv file . The excel version of these files, with more results of the states' overlap enrichments with external annotations can be found in our paper, Additional Files 3-5.
11 |
12 | - State group meanings:
13 | ```
14 | mGapArtf - assembly gaps and artifacts
15 | mQuies - quiescent
16 | mHet - heterochromatin
17 | mZNF - Zinc finger genes
18 | mReprPC - polycomb repressed
19 | mReprPC_openC - polycomb repressed and open chromatin
20 | mOpenC - open chromatin
21 | mEnhA - active enhancers
22 | mEnhWk - weak enhancers
23 | mTxEnh - transcribed enhancers
24 | mTx - transcription
25 | mTxEx - transcription and exons
26 | mTxWk - weak transcription
27 | mBivProm - bivalent promoters
28 | mPromF - promoter flank
29 | mTSS - transcription start sites
30 | ```
31 |
32 | # Track hubs on UCSC genome browser:
33 | You can view the full-stack annotations for the mouse (presented in this manuscript) OR the human (presented in the "sister" manuscript), by using the track hub link. We provide a very detailed step-by-step instruction on how to view the full-stack annotations using the provided track hub link in the tutorial file view_ucsc_genome_browser.pptx.
34 |
35 | # Folders:
36 | Within each subfolders inside this folder. Each subfolder contains its own readme file.
37 | - Folder ```relationship_with_ct_spec_state```: contains code to reproduce the plots in Fig. S6 and S7 of the manuscript.
38 | - Folder ```process_CTCF_data```: contains code to reproduce Fig. 2F of the manuscript.
39 | - Folder ```neighborhood```: contains code to reproduce Fig. 2D-E and S2 of the manuscript.
40 | - Folder ```emission_and_metadata```: contains code to reproduce Fig. 1 of the manuscript
41 | - Folder ```gene_exp_analysis```: contains code to reproduce Fig. 2C of the manuscript
42 | - Folder ```overlap```: contains scripts of calculate the enrichments of mouse full-stack states with different genome annotation.
43 | - Folder ```learn_model```: contains scripts to reproduce tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' in Additional File 2 of the paper
44 |
45 | # License:
46 | All code is provided under the MIT Open Acess License
47 | Copyright 2022 Ha Vu and Jason Ernst
48 |
49 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
50 |
51 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
52 |
53 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
54 |
55 | # Contact:
56 | If you run into problems, please contact Ha Vu (havu73@ucla.edu)
57 |
--------------------------------------------------------------------------------
/emission_and_metadata/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code to produce Figure 1 of the manuscript.
2 |
3 | - ```helper.py```: script containing helper functions for file managements.
4 | - Folder ```data```: contains emission probability file and different metadata files that specify the color codes for different chromatin marks, cell types, etc.
5 | - ```prepare_emission_for_plotting.py```: Code to transform data from ```./data/emissions_100.txt``` to ```./data/emissions_100_for_pheatmap.txt```, with added columns showing the associated chromatin mark and cell types fro each experiment.
6 | - ```rank_chromMark_cellType_emission.py```: Code to produce excel files that correspond to Fig. 1B-C.
7 | ```
8 | python rank_chrom_mark_celltype_emission.py
9 | emission_fn: should be ./data/emissions_100.txt
10 | output_fn: execl file where the output data of ranked cell type and chrom marks are stored
11 | num_top_marks: number of top marks that we want to report, recommended value: 100
12 | ```
13 | - ```draw_heatmap_emission_mm10_full_stack.Rmd```: Code to produce Fig. 1A, and individual plots corresponding to indidivual states (not presented in the paper, but these plots are helpful in understanding the states' biological implications).
14 | - ```draw_emission_subsetted_states.R```: Code to draw emission probabilities for an user-specified subset of states. This function did not involve in creating any figures in the paper. It may be helpful to users, but it requires that you call states by their raw numbers (column ```state``` in ```../state_annotation_processed.csv```)
15 | ```
16 | Rscript draw_emission_subsetted_states.R 100), to plot>
17 | ```
18 | # License:
19 | All code is provided under the MIT Open Acess License
20 | Copyright 2022 Ha Vu and Jason Ernst
21 |
22 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
23 |
24 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
25 |
26 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
27 |
28 |
--------------------------------------------------------------------------------
/emission_and_metadata/create_heatmap_for_each_state.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Create_heatmap_for_each_state"
3 | author: "Ha Vu"
4 | date: "6/11/2019"
5 | output: html_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = TRUE)
10 | library(tidyr)
11 | library(tidyverse)
12 | library(ggplot2)
13 | library(dplyr)
14 | library(pheatmap)
15 | source('/Users/vuthaiha/Desktop/window_hoff/source/summary_analysis/draw_emission_matrix_functions.R')
16 | ```
17 | # Get the emission matrix into a dataframe
18 | ```{r}
19 | emission_fn <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emissions_100.txt"
20 | output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/state_avg_cell_group_and_mark"
21 | cm_output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/cm_based_emission"
22 | ana_output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/ana_based_emission"
23 | emission_df <- get_emission_df_from_chromHMM_emission(emission_fn)
24 | num_state <- 100
25 | num_exp <- 1032
26 | ```
27 | # get the count of each annotation category
28 | ```{r}
29 | count_chrom_mark_GROUP <- emission_df %>% select(c('mark_names', 'chrom_mark', 'GROUP')) %>% group_by(chrom_mark, GROUP) %>% tally() %>% rename('count' = 'n')
30 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(chrom_mark ~ GROUP , value.var = 'count') %>% replace(., is.na(.), 0)
31 | annot_col_df <- data.frame(cell_GROUP = colnames(count_chrom_mark_GROUP))
32 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP))
33 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP))
34 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP))
35 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP[,1]
36 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1]
37 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(cell_GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png'))
38 | ```
39 | # draw plots for each state on its own figure
40 | ```{r}
41 | for (state_index in seq(num_state)){
42 | state_df <- emission_df %>% select(c("mark_names", paste0("S", state_index), "chrom_mark", "GROUP"))
43 | avg_emission_df <- state_df %>% select (paste0("S", state_index), 'chrom_mark', 'GROUP') %>% group_by(chrom_mark, GROUP) %>% summarise_all(.funs = funs(mean = "mean")) %>% rename(mean_emission = mean) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission
44 | avg_emission_df <- avg_emission_df %>% dcast(chrom_mark ~ GROUP , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible
45 |
46 | chrom_mark_list <- avg_emission_df %>% select('chrom_mark') # list of chromatin marks, which will later be the row names for the heatmap
47 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names
48 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later
49 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') )
50 | break_list <- seq (0, 1, by = 0.01)
51 | annot_col_df <- data.frame(cell_GROUP = colnames(avg_emission_df))
52 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df))
53 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df))
54 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df))
55 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(cell_GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), filename = this_state_save_fig_fn)
56 | print(paste("Done with state:", state_index))
57 | }
58 | ```
59 | # get a data frame avg_emission_df: rows: 'ana_mark_comb' + states, columns: mark-GROUP combination --> cells: avg emission probabilities for each mark-GROUP combination
60 | ```{r}
61 | avg_emission_df <- emission_df %>% select(-'GROUP', -'COLOR', -'TYPE', -'Epig_name', -'ct')
62 | mark_name_list <- as.character(avg_emission_df$mark_names)
63 | rownames(avg_emission_df) <- mark_name_list
64 | avg_emission_df <- avg_emission_df %>% select(-'mark_names') %>% group_by(chrom_mark, GROUP) %>% summarise_all(funs(mean)) %>% unite("ana_mark_comb",c("GROUP", "chrom_mark"), sep = "-")
65 | ana_mark_comb_list <- avg_emission_df$ana_mark_comb
66 | ```
67 |
68 | # Draw heatmap where we arrange categories by chrom marks
69 | ```{r}
70 | uniq_chrom_mark_list <- unique(emission_df$chrom_mark)
71 | plot_df <- data.frame(matrix(ncol = ncol(avg_emission_df) - 1, nrow = 0))
72 |
73 | for (cm in uniq_chrom_mark_list){
74 | this_cm_df <- avg_emission_df %>% filter(grepl(cm, ana_mark_comb))
75 | plot_df <- rbind(plot_df, this_cm_df)
76 | }
77 | ana_mark_comb_list <- plot_df$ana_mark_comb
78 | rownames(plot_df) <- ana_mark_comb_list
79 | plot_df <- plot_df %>% select(- 'ana_mark_comb')
80 | # column annotation
81 | annot_df <- data.frame(chrom_mark = sapply(rownames(plot_df), FUN = get_chrom_mark_name))
82 | annot_df['GROUP'] <- sapply(rownames(plot_df), FUN = get_mark_ct)
83 | break_list <- seq (0, 1, by = 0.01)
84 |
85 | save_fn <- file.path(output_folder, "avg_chrom_mark_GROUP_grouped_cm.png")
86 | pheatmap(t(plot_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 3, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), cellwidth = 3.5 ,filename = save_fn)
87 | ```
88 |
89 | ```{r}
90 | break_list <- seq (0, 1, by = 0.01)
91 | uniq_chrom_mark_list <- unique(emission_df$chrom_mark)
92 | for (cm in uniq_chrom_mark_list){
93 | this_cm_df <- avg_emission_df %>% filter(grepl(cm, ana_mark_comb))
94 | ana_mark_comb_list <- this_cm_df$ana_mark_comb
95 | this_cm_df <- this_cm_df %>% select(-"ana_mark_comb")
96 | rownames(this_cm_df) <- ana_mark_comb_list
97 | annot_df <- data.frame(chrom_mark = sapply(ana_mark_comb_list, FUN = get_chrom_mark_name))
98 | annot_df['GROUP'] <- sapply(ana_mark_comb_list, FUN = get_mark_ct)
99 | save_fn <- file.path(cm_output_folder, paste0(cm, "avg_emission.png"))
100 | pheatmap(t(this_cm_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 5, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE),filename = save_fn)
101 | }
102 | ```
103 |
104 | ```{r}
105 | uniq_GROUP_list <- unique(emission_df$GROUP)
106 | plot_df <- data.frame(matrix(ncol = ncol(avg_emission_df) - 1, nrow = 0))
107 | for (ana in uniq_GROUP_list){
108 | this_ana_df <- avg_emission_df %>% filter(grepl(paste0(ana, "-"), ana_mark_comb))
109 | plot_df <- rbind(plot_df, this_ana_df)
110 | }
111 | ana_mark_comb_list <- plot_df$ana_mark_comb
112 | rownames(plot_df) <- ana_mark_comb_list
113 | plot_df <- plot_df %>% select(- 'ana_mark_comb')
114 | annot_df <- data.frame(chrom_mark = sapply(rownames(plot_df), FUN = get_chrom_mark_name))
115 | annot_df['GROUP'] <- sapply(rownames(plot_df), FUN = get_mark_ct)
116 |
117 | break_list <- seq (0, 1, by = 0.01)
118 |
119 | save_fn <- file.path(output_folder, "avg_chrom_mark_GROUP_grouped_ana.png")
120 | pheatmap(t(plot_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 3, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE) , cellwidth = 3.5 ,filename = save_fn)
121 | ```
122 |
123 | ```{r}
124 | break_list <- seq (0, 1, by = 0.01)
125 | uniq_GROUP_list <- unique(emission_df$GROUP)
126 | for (ana in uniq_GROUP_list){
127 | this_ana_df <- avg_emission_df %>% filter(grepl(paste0(ana, '-'), ana_mark_comb))
128 | ana_mark_comb_list <- this_ana_df$ana_mark_comb
129 | this_ana_df <- this_ana_df %>% select(-"ana_mark_comb")
130 | rownames(this_ana_df) <- ana_mark_comb_list
131 | annot_df <- data.frame(chrom_mark = sapply(ana_mark_comb_list, FUN = get_chrom_mark_name))
132 | annot_df['GROUP'] <- sapply(ana_mark_comb_list, FUN = get_mark_ct)
133 | save_fn <- file.path(ana_output_folder, paste0(ana, "avg_emission.png"))
134 | pheatmap(t(this_ana_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 5, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE),filename = save_fn)
135 | }
136 | ```
137 |
138 | # calculate the correlations between states
139 | ```{r}
140 | emission_df <- get_emission_df_from_chromHMM_emission(emission_fn)
141 | mark_name_list <- emission_df$mark_names
142 | emission_df <- emission_df %>% select(-'mark_names', -'ct', -'chrom_mark', -'GROUP', -'COLOR', -'TYPE', -'Epig_name', -'GROUP')
143 | rownames(emission_df) <- mark_name_list
144 | cor_df <- as.data.frame(cor(emission_df))
145 | break_list <- seq (0, 1, by = 0.01)
146 | save_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/state_correlations/pearson_correlations.png'
147 | pheatmap(cor_df, breaks = break_list, fontsize = 4, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4, angle_col = 90,filename = save_fn)
148 | ```
--------------------------------------------------------------------------------
/emission_and_metadata/data/assay_count.csv:
--------------------------------------------------------------------------------
1 | mark,num_exp,color,big_group,
2 | DNase,114,#DBE680,1,
3 | H3K4me3,101,#F13712,2,
4 | H3K4me1,100,#EDF732,3,
5 | H3K27ac,93,#F7CB4D,3,
6 | H3K36me3,92,#49AC5E,4,
7 | H3K27me3,91,#A6A6A4,5,
8 | ATAC,83,#E8F484,1,
9 | H3K9me3,79,#677BF6,6,
10 | H3K9ac,72,#F2A626,2,
11 | H3K4me2,68,#F8CBAD,2,
12 | H3K79me2,6,#377A45,4,
13 | H3ac,1,#E5EAE7,7,
14 | H3K79me1,1,#9DCEA8,4,This mark is actually not in the emission file
15 | H3K79me3,1,#95B77E,4,
16 | ,902,,,
--------------------------------------------------------------------------------
/emission_and_metadata/data/biosample_term_name_counts.csv:
--------------------------------------------------------------------------------
1 | biosample_name,num_experiment,group_refined_1,group_refined_2,group_refined_3,group,explanation,color,big_group
2 | heart,83,heart,heart,heart,Heart,,#D56F80,1
3 | liver,80,liver,liver,liver,Liver,,#9BC2E6,2
4 | forebrain,73,forebrain,forebrain,brain ,Brain,,#C5912B,3
5 | midbrain,73,midbrain,midbrain,brain ,Brain,,#C5912B,3
6 | hindbrain,73,hindbrain,hindbrain,brain ,Brain,,#C5912B,3
7 | limb,54,limb ,limb ,limb ,Muscle,,#F9B6CF,22
8 | embryonic facial prominence,54,embryonic facial promincence,embryonic facial promincence,embryo,embryo,,#AF5B39,18
9 | kidney,48,kidney,kidney,kidney,kidney,,#F4B084,23
10 | neural tube,47,neural tube,neuron ,neuron ,neuron,,#FFD924,4
11 | lung,42,lung,lung,lung,lung,,#E41A1C,5
12 | stomach,37,stomach,stomach,stomach,stomach,,#E5BDB5,6
13 | intestine,36,intestine,intestine,intestine,intestine,,#D0A39B,7
14 | CH12.LX,14,B-cells,white blood cells,immunity,white blood cells,,#678C69,8
15 | MEL,13,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,Murine erythroleukemia (MEL) cell lines are erythroid progenitor cells derived from the spleens of susceptible mice infected with the Friend virus complex 1. These virally transformed cells are arrested at the proerythroblast stage of development and can be maintained in tissue culture indefinitely ,#55A354,9
16 | ES-E14,9,ESC,ESC,ESC,ESC,"Mouse E14 embryonic stem cells (ESCs) are a well-characterized and widespread used ESC line, often employed for genome-wide studies involving next generation sequencing analysis",#924965,16
17 | cerebellum,9,hindbrain,hindbrain,brain ,Brain,,#C5912B,3
18 | thymus,9,thymus,thymus,thymus,Thymus,,#C6E0B4,11
19 | spleen,7,spleen,spleen,spleen,Spleen,,#678C69,12
20 | G1E,7,ESC,ESC,ESC,ESC,"G1E cells are an immortalized Gata1 null cell line derived from embryonic stem cells [1], and the daughter cell line G1E-ER4 has been stably rescued by transduction with a virus expressing a hybrid gene encoding the GATA1-ER protein",#924965,16
21 | myocyte,6,muscle,muscle,muscle,Muscle,,#F9B6CF,22
22 | ES-Bruce4,6,ESC,ESC,ESC,ESC,An embryonic cell line isolated from C57BL/6 mouse strain. Injection of Bruce4 cells into C57BL/6 blastocysts will produce agouti chimeras.,#924965,16
23 | testis,6,testis,testis,testis,testis,,#ACB9CA,13
24 | erythroblast,6,red blood cells,red blood cells,red blood cells,red blood cells,"Erythroblast, nucleated cell occurring in red marrow as a stage or stages in the development of the red blood cell, or erythrocyte.",#55A354,9
25 | megakaryocyte,6,bone marrow cells for producing platelets,bone marrow ,bone marrow ,white blood cells,"Megakaryocytes are cells in the bone marrow responsible for making platelets, which are necessary for blood clotting",#678C69,8
26 | embryo,5,embryo,embryo,embryo,embryo,,#AF5B39,18
27 | C2C12,5,Muscle,Muscle,Muscle,Muscle,an undifferentiated cell capable of giving rise to muscle cells,#F9B6CF,22
28 | E14TG2a.4,5,ESC,ESC,ESC,ESC,,#924965,16
29 | small intestine,5,Intestine,Intestine,Intestine,intestine,,#D0A39B,7
30 | retina,4,retina,retina,retina,retina,,#F8CBAD,14
31 | brain,3,brain ,brain ,brain ,brain,,#C5912B,3
32 | splenic B cell,3,B-cells,white blood cells,white blood cells,white blood cells,"b-cells generated by the spleen, used for immunity",#678C69,8
33 | cortical plate,3,ESC/forebrain,ESC,ESC,ESC,The term cortical plate refers to the embryonic precursor of cerebral cortex. It gives rise to all layers of neocortex and allocortex,#924965,16
34 | brown adipose tissue,3,adipose,adipose,adipose,Adipose,,#0070C0,15
35 | embryonic fibroblast,3,ESC fibroblast,ESC,ESC,ESC,,#924965,16
36 | placenta,3,placenta,placenta,placenta,placenta,,#69608A,17
37 | olfactory bulb,3,forebrain,forebrain,brain ,brain,,#C5912B,3
38 | bone marrow macrophage,3,bone marrow white blood cells,bone marrow,bone marrow,bone marrow,"Bone marrow-derived macrophages (BMDM) are primary macrophages obtained by in vitro differentiation of bone marrow cells in the presence of macrophage colony-stimulating factor (M-CSF or CSF1). They are easy to obtain in high yields, can be stored by freezing, and can be obtained from genetically modified mice strains.",#375623,10
39 | cortex,2,forebrain,forebrain,brain ,Brain,,#C5912B,3
40 | bone marrow,2,bone marrow,bone marrow,bone marrow,bone marrow,,#375623,10
41 | adrenal gland,2,adrenal gland,other,other,Other,"small, triangular-shaped glands located on top of both kidneys. Adrenal glands produce hormones that help regulate your metabolism, immune system, blood pressure, response to stress and other essential functions.",#999999,19
42 | hematopoietic stem cell,2,ESC,ESC,ESC,ESC,"An immature cell that can develop into all types of blood cells, including white blood cells, red blood cells, and platelets. Hematopoietic stem cells are found in the peripheral blood and the bone marrow.",#924965,16
43 | c-Kit-positive CD71-negative TER-119-negative erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,"Erythroid progenitor cells are committed self-renewing stem cells that give rise to only one type of cell, namely, the erythrocytes (red blood cells).",#55A354,9
44 | erythroid progenitor cell,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9
45 | megakaryocyte-erythroid progenitor cell,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,"The megakaryocyte–erythroid progenitor cell (or MEP, or hMEP to specify human) is a cell that gives rise to megakaryocytes and erythrocytes. It is derived from the common myeloid progenitor.",#55A354,9
46 | neutrophil,1,white blood cells,white blood cells,white blood cells,white blood cells,"A type of immune cell that is one of the first cell types to travel to the site of an infection. Neutrophils help fight infection by ingesting microorganisms and releasing enzymes that kill the microorganisms. A neutrophil is a type of white blood cell, a type of granulocyte, and a type of phagocyte.",#678C69,8
47 | adipocyte,1,adipose,adipose,adipose,adipose,"a cell specialized for the storage of fat, found in connective tissue",#0070C0,15
48 | activated regulatory T-cells,1,white blood cells,white blood cells,white blood cells,white blood cells,,#678C69,8
49 | cerebral cortex,2,forebrain,forebrain,brain ,brain,,#C5912B,3
50 | left cerebral cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3
51 | "naive thymus-derived CD4-positive, alpha-beta T cell",1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
52 | NIH3T3,1,ESC fibroblast,fibroblast,fibroblast,fibroblast,3T3 cells are several cell lines of mouse embryonic fibroblasts,#8DAFC3,20
53 | frontal cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3
54 | forelimb bud,1,limb bud,limb ,limb ,Muscle,The limb bud is a structure formed early in vertebrate limb development,#F9B6CF,22
55 | ES-CJ7,1,ESC,ESC,ESC,ESC,Undifferentiated embryonic stem cells were originally isolated from 129S1/SVImJ strain mice by,#924965,16
56 | WW6,1,ESC,ESC,ESC,ESC,WW6: an embryonic stem cell line with an inert genetic marker that can be traced in chimeras,#924965,16
57 | telencephalon,1,neuron ,neuron ,neuron ,neuron,"The telencephalon (basal ganglia, septum, cerebral cortex and olfactory bulb) contains two general classes of neurons: those that project axons to distant targets and those that make only local connections.",#FFD924,4
58 | yolk sac,1,embryo,embryo,embryo,embryo,"The yolk sac is a small, membranous structure situated outside of the embryo with a variety of functions during embryonic development. It attaches ventrally to the developing embryo via the yolk stalk. The yolk stalk is a term that may be used interchangeably with the vitelline duct or omphalomesenteric duct.",#AF5B39,18
59 | hindlimb bud,1,embryo,embryo,embryo,embryo,,#AF5B39,18
60 | B cell,1,B-cells,white blood cells,white blood cells,white blood cells,,#678C69,8
61 | monocyte,1,white blood cells,white blood cells,white blood cells,white blood cells,"A monocyte is a type of white blood cell and a type of phagocyte. Enlarge. Blood cells. Blood contains many types of cells: white blood cells (monocytes, lymphocytes, neutrophils, eosinophils, basophils, and macrophages), red blood cells (erythrocytes), and platelets",#678C69,8
62 | common myeloid progenitor,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,Common myeloid progenitors give rise to either megakaryocyte/erythrocyte or granulocyte/macrophage progenitors,#55A354,9
63 | c-Kit-negative CD71-positive TER-119-positive erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9
64 | gonadal fat pad,1,adipose,adipose,adipose,adipose,,#0070C0,15
65 | regulatory T cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
66 | inflammation-experienced regulatory T-cells,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
67 | c-Kit-positive CD71-positive TER-119-negative erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,erythroid progenitor cells,red blood cells,,#55A354,9
68 | Patski,1,fibroblast,fibroblast,fibroblast,fibroblast,PATSKI is a female interspecific mouse fibroblast that was derived from the embryonic kidney of an M. spretus x C57BL/6J hybrid mouse such that the C57Bl/6J X chromosome (maternal) is always the inactive X. This is an adherent cell line.,#8DAFC3,20
69 | skeletal muscle tissue,1,muscle,muscle,muscle,muscle,,#F9B6CF,22
70 | mesoderm,1,embryo,embryo,embryo,embryo,"the middle layer of an embryo in early development, between the endoderm and ectoderm.",#AF5B39,18
71 | large intestine,1,intestine,intestine,intestine,intestine,,#D0A39B,7
72 | CD4-positive helper T cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
73 | MEL-GATA-1-ER,1,other,other,other,other,cancer cell ine Mouse erythroid leukemia (https://web.expasy.org/cellosaurus/CVCL_Y480),#999999,19
74 | 416B,1,other,other,other,other,transformed cell lineHaematopoiesis is the formation of blood cellular component,#999999,19
75 | right cerebral cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3
76 | induced T-regulatory cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
77 | A20,1,B-cells,white blood cells,white blood cells,white blood cells,B lymphocyte,#678C69,8
78 | activated regulatory T-cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
79 | 3T3-L1,1,fibroblast,fibroblast,fibroblast,fibroblast,,#8DAFC3,20
80 | fat pad,1,adipose,adipose,adipose,adipose,,#0070C0,15
81 | R1,1,ESC,ESC,ESC,esC,,#924965,16
82 | MN1,1,neuron ,neuron ,neuron ,neuron,MN1 is a cholinergic motor neuron cell line derived from a fusion of N18TG2 with embryonic mouse spinal cord motor neurons,#FFD924,4
83 | fibroblast of lung,1,fibroblast,fibroblast,fibroblast,fibroblast,,#8DAFC3,20
84 | ZHBTc4,1,ESC,ESC,ESC,eSC,"ZHBTc4 undifferentiated mouse embryonic stem cells originated from a male mouse of the 129/Ola strain, and received as frozen ampoules from D. Levasseur (University of Iowa). These cells lack functional endogenous Oct4 alleles and harbor a regulatable Oct4 transgene.",#924965,16
85 | "CD4-positive, CD25-positive, alpha-beta regulatory T cell",1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8
86 | megakaryocyte progenitor cell,1,myeloid progenitor,other,other,other,"The megakaryocyte–erythroid progenitor cell (or MEP, or hMEP to specify human) is a cell that gives rise to megakaryocytes and erythrocytes. It is derived from the common myeloid progenitor.",#999999,19
87 | 3134,1,other,other,other,other,cancer cell line: https://web.expasy.org/cellosaurus/CVCL_H641,#999999,19
88 | Muller cell,1,retina,retina,retina,retina,"Müller cells are the principal glial cells of the retina, assuming many of the functions carried out by astrocytes, oligodendrocytes and ependymal cells in other CNS regio",#F8CBAD,14
89 | c-Kit-positive CD71-positive TER-119-positive erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9
90 | granulocyte monocyte progenitor cell,1,white blood cells,white blood cells,white blood cells,white blood cells,"Granulocyte-monocyte progenitor (GMP) cells play a vital role in the immune system by maturing into a variety of white blood cells, including neutrophils and macrophages, depending on exposure to cytokines such as various types of colony stimulating factors",#678C69,8
91 | gastrocnemius,1,muscle,muscle,muscle,muscle,,#F9B6CF,22
--------------------------------------------------------------------------------
/emission_and_metadata/data/organ.txt:
--------------------------------------------------------------------------------
1 | spleen #678C69 18
2 | lymph_node #E2EFDA 19
3 | musculature #F9B6CF 11
4 | embryo #0070C0 25
5 | liver #9BC2E6 5
6 | lung #E41A1C 6
7 | limb #F182BC 10
8 | kidney #F4B084 7
9 | intestine #D0A39B 8
10 | stomach #E5BDB5 9
11 | epithelium #7491A2 3
12 | brain #C5912B 1
13 | heart #D56F80 4
14 | immune_organ #678C69 20
15 | gonad #ACB9CA 12
16 | blood #55A354 21
17 | bone_element #375623 22
18 | eye #F8CBAD 13
19 | unknown #999999 26
20 | breast #18A7F0 14
21 | adipose_tissue #A7D5FF 15
22 | connective_tissue #90C4CE 16
23 | adrenal_gland #F4B084 17
24 | extraembryonic_component #69608A 23
25 | spinal_cord #FFD924 2
26 | placenta #786D8C 24
27 |
--------------------------------------------------------------------------------
/emission_and_metadata/draw_emission_subsetted_states.R:
--------------------------------------------------------------------------------
1 | # Copyright 2021 Ha Vu (havu73@ucla.edu)
2 |
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 |
5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 |
7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 |
9 | library(tidyverse)
10 | library(tidyr)
11 | library(dplyr)
12 | library(pheatmap)
13 | library(ggplot2)
14 | biosample_color_dict <- c('adipose'= '#AF5B39', 'bone marrow'= '#375623', 'brain'= '#C5912B', 'embryo'= '#0070C0', 'esc'= '#924965', 'fibroblast'= '#C075C3', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'kidney'= '#F4B084', 'limb'= '#F182BC', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'muscle'= '#F9B6CF', 'neuron'= '#FFD924', 'other'= '#999999', 'placenta'= '#69608A', 'red blood cells'= '#55A354', 'retina'= '#F8CBAD', 'spleen'= '#678C69', 'stomach'= '#E5BDB5', 'testis'= '#ACB9CA', 'thymus'= '#C6E0B4', 'white blood cells'= '#678C69')
15 | assay_color_dict <- c('ATAC'= '#E8F484', 'DNase'= '#DBE680', 'H3K27ac'= '#F7CB4D', 'H3K27me3'= '#A6A6A4', 'H3K36me3'= '#49AC5E', 'H3K4me1'= '#EDF732', 'H3K4me2'= '#F8CBAD', 'H3K4me3'= '#F13712', 'H3K79me1'= '#9DCEA8', 'H3K79me2'= '#377A45', 'H3K9ac'= '#F2A626', 'H3K9me3'= '#677BF6', 'H3ac'= '#E5EAE7', 'H3K79me3'= '#95B77E')
16 | state_group_color_dict <- c('HET'= '#b19cd9', 'quiescent'= '#ffffff', 'enhancers'= '#FFA500', 'WkEnh'= '#ffff00', 'znf'= '#7fffd4', 'Tx'= '#006400', 'artifacts'= '#ffffff', 'DNase'= '#fff44f', 'ATAC'= '#f7f3b5', 'Prom'= '#ff4500', 'ReprPC and DNase'= '#d1cf90', 'DNase'= '#fff44f', 'exon'= '#3cb371', 'BivProm'= '#6a0dad', 'acetylations'= '#fffacd', 'ct_spec enhancers'= '#FFA500', 'ReprPC'= '#C0C0C0', 'weak promoters'= '#800080', 'TxEnh'= '#ADFF2F', 'TSS'= '#FF0000', 'others'= '#fff5ee', 'TxWk'= '#228B22', 'TxEx'= '#9BBB59')
17 | num_state <- 100
18 |
19 | read_state_annot_df <- function(state_annot_fn, states_to_plot){
20 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t'))
21 | tryCatch({
22 | annot_df <- annot_df %>% rename('state_group' = 'group')
23 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")})
24 | annot_df <- annot_df[annot_df$state %in% states_to_plot,]
25 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column
26 | return(annot_df)
27 | }
28 |
29 | draw_emission_subsetted_states <- function(emission_fn, annot_fn, save_fn, states_to_plot){
30 | emission_df <- as.data.frame(read.csv(emission_fn, sep = '\t', header = TRUE)) %>% arrange(assay_big_group, assay, biosample_big_group, biosample_group, biosample_name)
31 | colnames(emission_df) <- c('experiments', paste0('S', seq(1, num_state)), colnames(emission_df)[102:length(colnames(emission_df))])
32 | annot_df <- read_state_annot_df(annot_fn, states_to_plot) # this step will choose only the states that we want to plot
33 | #### getting the plot_df: columns : experiments, rows: states
34 | plot_df <- emission_df %>% select(-c('assay', "biosample", "mark", "mark_color", "assay_big_group", "biosample_name", "biosample_group", "biosample_big_group", "biosample_color")) %>% select(c('experiments', paste0('S', annot_df$state))) # choosing only experiments and the states, which are ordered based on the group, all based on the annot_state_df
35 | plot_df <- as.data.frame(t(plot_df))
36 | colnames(plot_df) <- plot_df[1,] # experiments
37 | plot_df <- plot_df[-1,] # get rid of the of experiments row
38 | rownames(plot_df) <- annot_df$mneumonics
39 | plot_df <- plot_df %>% mutate_all(as.numeric)
40 | #######
41 | ####### getting the annot_exp_df for experiments ######
42 | annot_exp_df <- emission_df %>% select(c('biosample_group', 'assay'))
43 | rownames(annot_exp_df) <- emission_df$experiments
44 | ############
45 | ####### getting the annot_state_df for states ####
46 | annot_state_df <- annot_df %>% select(c('state_group'))
47 | rownames(annot_state_df) <- annot_df$mneumonics
48 | ############
49 | p <- pheatmap(plot_df, fontsize = 5, annotation_col = annot_exp_df, annotation_row = annot_state_df, annotation_colors = list(biosample_group = biosample_color_dict, assay = assay_color_dict, state_group = state_group_color_dict), cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = FALSE, fontsize_col = 3, angle_col = 90 , cellheight = 5, filename = save_fn)
50 | }
51 |
52 | args = commandArgs(trailingOnly=TRUE)
53 | if (length(args) < 3)
54 | {
55 | stop("wrong command line argument format", call.=FALSE)
56 | }
57 | emission_fn <- args[1] # output from prepare_emission_for_plotting.py
58 | annot_fn <- args[2] # where annotations of states, the state group and state index by groups are stored
59 | #annot_fn <- './data/state_annotation_processed.csv'
60 | #emission_fn <- './data/emissions/emissions_100_for_pheatmap.txt'
61 | #save_fn <- './data/emissions/emissions_100.png'
62 | save_fn <- args[3] # where the figures should be stored
63 | states_to_plot <- as.integer(args[4:length(args)])
64 | print(states_to_plot)
65 | print ('Done get_emission_df_from_chromHMM_emission')
66 | draw_emission_subsetted_states(emission_fn, annot_fn, save_fn, states_to_plot)
--------------------------------------------------------------------------------
/emission_and_metadata/draw_heatmap_emission_mm10_full_stack.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "draw_heatmap_emission_mm10_full_stack"
3 | author: "Ha Vu"
4 | date: "9/17/2021"
5 | output: html_document
6 | ---
7 | ```{r}
8 | library(tidyverse)
9 | library(tidyr)
10 | library(dplyr)
11 | library(pheatmap)
12 | library(ggplot2)
13 | library(reshape2) # dcast function
14 | ```
15 |
16 | ```{r}
17 | biosample_color_dict <- c('adipose'= '#AF5B39', 'bone marrow'= '#375623', 'brain'= '#C5912B', 'embryo'= '#0070C0', 'esc'= '#924965', 'fibroblast'= '#C075C3', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'kidney'= '#F4B084', 'limb'= '#F182BC', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'muscle'= '#F9B6CF', 'neuron'= '#FFD924', 'other'= '#999999', 'placenta'= '#69608A', 'red blood cells'= '#55A354', 'retina'= '#F8CBAD', 'spleen'= '#678C69', 'stomach'= '#E5BDB5', 'testis'= '#ACB9CA', 'thymus'= '#C6E0B4', 'white blood cells'= '#678C69')
18 | organ_color_dict <- c('spleen'= '#678C69', 'lymph_node'= '#E2EFDA', 'musculature'= '#F9B6CF', 'embryo'= '#0070C0', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'limb'= '#F182BC', 'kidney'= '#F4B084', 'intestine'= '#D0A39B', 'stomach'= '#E5BDB5', 'epithelium'= '#7491A2', 'brain'= '#C5912B', 'heart'= '#D56F80', 'immune_organ'= '#678C69', 'gonad'= '#ACB9CA', 'blood'= '#55A354', 'bone_element'= '#375623', 'eye'= '#F8CBAD', 'unknown'= '#999999', 'breast'= '#18A7F0', 'adipose_tissue'= '#A7D5FF', 'connective_tissue'= '#90C4CE', 'adrenal_gland'= '#F4B084', 'extraembryonic_component'= '#69608A', 'spinal_cord'= '#FFD924', 'placenta'= '#786D8C')
19 | assay_color_dict <- c('ATAC'= '#E8F484', 'DNase'= '#DBE680', 'H3K27ac'= '#F7CB4D', 'H3K27me3'= '#A6A6A4', 'H3K36me3'= '#49AC5E', 'H3K4me1'= '#EDF732', 'H3K4me2'= '#F8CBAD', 'H3K4me3'= '#F13712', 'H3K79me1'= '#9DCEA8', 'H3K79me2'= '#377A45', 'H3K9ac'= '#F2A626', 'H3K9me3'= '#677BF6', 'H3ac'= '#E5EAE7', 'H3K79me3'= '#95B77E')
20 | state_group_color_dict <- c('BivProm'= '#6a0dad', 'OpenC'= '#fff44f', 'HET'= '#b19cd9', 'Prom'= '#ff4500', 'ReprPC'= '#C0C0C0', 'ReprPC and openC'= '#d1cf90', 'TSS'= '#FF0000', 'Tx'= '#006400', 'TxEnh'= '#ADFF2F', 'TxEx'= '#9BBB59', 'TxWk'= '#228B22', 'WkEnh'= '#ffff00', 'ZNF'= '#7fffd4', 'artifacts'= '#fff5ee', 'enhancers'= '#FFA500', 'quiescent'= '#ffffff')
21 | num_state <- 100
22 | ```
23 |
24 | ```{r}
25 | annot_fn <- '..//state_annotation_processed.csv'
26 | annot_df <- as.data.frame(read.csv(annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t'))
27 | read_state_annot_df <- function(state_annot_fn){
28 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t'))
29 | tryCatch({
30 | annot_df <- annot_df %>% rename('state_group' = 'group')
31 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")})
32 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column
33 | return(annot_df)
34 | }
35 | annot_df <- read_state_annot_df(annot_fn)
36 | head(annot_df)
37 |
38 |
39 | calculate_gap_rows_among_state_groups <- function(state_annot_df){
40 | state_group_ordered_by_appearance <- unique(state_annot_df$state_group) # list of different state groups, ordered by how they appear in the heatmap from top to bottom
41 | count_df <- state_annot_df %>% dplyr::count(state_group)
42 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$state_group),] # order the rows such that the state_type are ordered based on state_group_ordered_by_appearance
43 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps
44 | return(results)
45 | }
46 | ```
47 | Draw the emission matrix right here
48 | ```{r}
49 | emission_fn <- './data/emissions_100_for_pheatmap.txt'
50 | save_fn <- './data/emissions_100.png'
51 | emission_df <- as.data.frame(read.csv(emission_fn, sep = '\t', header = TRUE)) %>% arrange(assay_big_group, assay, organ_order, organ_group)
52 | colnames(emission_df) <- c('experiments', paste0('S', seq(1,num_state)), colnames(emission_df)[102:length(colnames(emission_df))])
53 | annot_df <- read_state_annot_df(annot_fn)
54 | #### getting the plot_df: columns : experiments, rows: states
55 | plot_df <- emission_df %>% select(c('experiments', paste0('S', annot_df$state))) # choosing only experiments and the states, which are ordered based on the group, all based on the annot_state_df
56 | plot_df <- as.data.frame(t(plot_df))
57 | colnames(plot_df) <- plot_df[1,] # experiments
58 | plot_df <- plot_df[-1,] # get rid of the of experiments row
59 | rownames(plot_df) <- annot_df$mneumonics
60 | plot_df <- plot_df %>% mutate_all(as.numeric)
61 | #######
62 | ####### getting the annot_exp_df for experiments ######
63 | annot_exp_df <- emission_df %>% select(c('organ_group', 'assay'))
64 | rownames(annot_exp_df) <- emission_df$experiments
65 | ############
66 | ####### getting the annot_state_df for states ####
67 | annot_state_df <- annot_df %>% select(c('state_group'))
68 | rownames(annot_state_df) <- annot_df$mneumonics
69 | ############
70 | ####### getting gap row indices ####
71 | gap_row_indices <- calculate_gap_rows_among_state_groups(annot_df)
72 | ############
73 | pheatmap(plot_df, fontsize = 5, annotation_col = annot_exp_df, annotation_row = annot_state_df, annotation_colors = list(organ_group = organ_color_dict, assay = assay_color_dict, state_group = state_group_color_dict), gaps_row = gap_row_indices, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = FALSE, fontsize_col = 3, angle_col = 90 , cellheight = 5, filename = save_fn)
74 | ```
75 |
76 | Draw emission matrix for each state
77 | ```{r}
78 | colnames(emission_df)
79 | head(annot_col_df)
80 | head(count_chrom_mark_GROUP)
81 | count_chrom_mark_GROUP$mark
82 | count_chrom_mark_GROUP[,1]
83 | ```
84 |
85 | Plot to count the number of experiments for each combination of mark- biosample group
86 | ```{r}
87 | output_folder <- './data/one_plot_per_state/biosample_group'
88 | count_chrom_mark_GROUP <- emission_df %>% select(c('experiments', 'mark', 'biosample_group')) %>% group_by(mark, biosample_group) %>% tally() %>% rename('count' = 'n')
89 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(mark ~ biosample_group , value.var = 'count') %>% replace(., is.na(.), 0)
90 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% select(c('mark','brain', 'neuron', 'retina', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose', 'muscle', 'fibroblast', 'white blood cells', 'red blood cells', 'bone marrow', 'thymus', 'spleen', 'testis', 'esc', 'placenta', 'embryo', 'other'))
91 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP$mark
92 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1]
93 | annot_col_df <- data.frame(biosample_group = colnames(count_chrom_mark_GROUP))
94 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP))
95 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]
96 | print(count_chrom_mark_GROUP)
97 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP))
98 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP))
99 |
100 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(biosample_group = biosample_color_dict, chrom_mark = assay_color_dict), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png'))
101 | ```
102 |
103 | # draw plots for each state on its own figure, grouped by my own metadata of biosamples
104 | ```{r}
105 | for (state_index in seq(num_state)){
106 | state_colname <- paste0("S", state_index)
107 | state_df <- emission_df %>% select(c(state_colname, "mark", "biosample_group"))
108 | avg_emission_df <- state_df %>% group_by(mark, biosample_group) %>% summarise_all(~ mean(.x, na.rm = TRUE)) %>% rename(mean_emission = state_colname) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission
109 | avg_emission_df <- avg_emission_df %>% dcast(mark ~ biosample_group , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible
110 | avg_emission_df <- avg_emission_df %>% select(c('mark','brain', 'neuron', 'retina', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose', 'muscle', 'fibroblast', 'white blood cells', 'red blood cells', 'bone marrow', 'thymus', 'spleen', 'testis', 'esc', 'placenta', 'embryo', 'other'))
111 | chrom_mark_list <- avg_emission_df %>% select('mark') # list of chromatin marks, which will later be the row names for the heatmap
112 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names
113 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later
114 | avg_emission_df <- (avg_emission_df[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]) # note that H3K79me1 is not here
115 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') )
116 | break_list <- seq (0, 1, by = 0.01)
117 | annot_col_df <- data.frame(cell_GROUP = colnames(avg_emission_df))
118 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df))
119 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df))
120 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df))
121 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(cell_GROUP = biosample_color_dict, chrom_mark = assay_color_dict), filename = this_state_save_fig_fn)
122 | print(paste("Done with state:", state_index))
123 | }
124 | ```
125 |
126 |
127 | ```{r}
128 | output_folder <- './data/one_plot_per_state/organ_slims'
129 | count_chrom_mark_GROUP <- emission_df %>% select(c('experiments', 'mark', 'organ_group')) %>% group_by(mark, organ_group) %>% tally() %>% rename('count' = 'n')
130 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(mark ~ organ_group , value.var = 'count') %>% replace(., is.na(.), 0)
131 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% select(c('mark','brain', 'spinal_cord','eye', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose_tissue', 'breast', 'connective_tissue', 'epithelium', 'limb', 'musculature', 'blood', 'immune_organ', 'lymph_node', 'spleen', 'bone_element', 'adrenal_gland', 'gonad', 'extraembryonic_component', 'placenta', 'embryo', 'unknown'))
132 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP$mark
133 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1]
134 | annot_col_df <- data.frame(organ_group = colnames(count_chrom_mark_GROUP))
135 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP))
136 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]
137 | print(count_chrom_mark_GROUP)
138 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP))
139 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP))
140 |
141 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(organ_group = organ_color_dict, chrom_mark = assay_color_dict), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png'))
142 |
143 | for (state_index in seq(num_state)){
144 | state_colname <- paste0("S", state_index)
145 | state_df <- emission_df %>% select(c(state_colname, "mark", "organ_group"))
146 | avg_emission_df <- state_df %>% group_by(mark, organ_group) %>% summarise_all(~ mean(.x, na.rm = TRUE)) %>% rename(mean_emission = state_colname) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission
147 | avg_emission_df <- avg_emission_df %>% dcast(mark ~ organ_group , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible
148 | avg_emission_df <- avg_emission_df %>% select(c('mark','brain', 'spinal_cord','eye', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose_tissue', 'breast', 'connective_tissue', 'epithelium', 'limb', 'musculature', 'blood', 'immune_organ', 'lymph_node', 'spleen', 'bone_element', 'adrenal_gland', 'gonad', 'extraembryonic_component', 'placenta', 'embryo', 'unknown'))
149 | chrom_mark_list <- avg_emission_df %>% select('mark') # list of chromatin marks, which will later be the row names for the heatmap
150 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names
151 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later
152 | avg_emission_df <- (avg_emission_df[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]) # note that H3K79me1 is not here
153 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') )
154 | break_list <- seq (0, 1, by = 0.01)
155 | annot_col_df <- data.frame(organ_group = colnames(avg_emission_df))
156 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df))
157 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df))
158 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df))
159 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(organ_group = organ_color_dict, chrom_mark = assay_color_dict), filename = this_state_save_fig_fn)
160 | print(paste("Done with state:", state_index))
161 | }
162 | ```
--------------------------------------------------------------------------------
/emission_and_metadata/helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import string
4 | import os
5 | import sys
6 | import time
7 | def make_dir(directory):
8 | try:
9 | os.makedirs(directory)
10 | except:
11 | print ( 'Folder' + directory + ' is already created')
12 |
13 |
14 |
15 | def check_file_exist(fn):
16 | if not os.path.isfile(fn):
17 | print ( "File: " + fn + " DOES NOT EXISTS")
18 | exit(1)
19 | return
20 |
21 | def check_dir_exist(fn):
22 | if not os.path.isdir(fn):
23 | print ( "Directory: " + fn + " DOES NOT EXISTS")
24 | exit(1)
25 | return
26 |
27 | def create_folder_for_file(fn):
28 | last_slash_index = fn.rfind('/')
29 | if last_slash_index != -1: # path contains folder
30 | make_dir(fn[:last_slash_index])
31 | return
32 |
33 | def get_command_line_integer(arg):
34 | try:
35 | arg = int(arg)
36 | return arg
37 | except:
38 | print ( "Integer: " + str(arg) + " IS NOT VALID")
39 | exit(1)
40 |
41 |
42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
44 | # rename the org_enrichment_df so that it's easier to work with
45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
46 | return enrichment_df
47 |
48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
50 | if len(nc_enrichment_df.columns) != 3:
51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
53 | exit(1)
54 | # Now, we know that the nc_enrichment_df has exactly 3 columns
55 | # change the column names
56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
57 | return (nc_enrichment_df)
58 |
--------------------------------------------------------------------------------
/emission_and_metadata/prepare_emission_for_plotting.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pandas as pd
3 | import numpy as np
4 | emission_fn = './data/emissions_100.txt'
5 | emission_df = pd.read_csv(emission_fn, header = 0, index_col = None, sep = '\t')
6 | emission_df = emission_df.T # rows: experiments, columns: states
7 | emission_df.reset_index(inplace = True) # experiments are now a columns
8 | emission_df.columns = emission_df.iloc[0] # first row is the states --> become headers
9 | emission_df = emission_df.rename(columns={'State (Emission order)': 'experiments'})
10 | emission_df = emission_df.drop(0, axis = 0) # after first row becomes headers, drop first row
11 | # experiments are of form hindbrain_ATAC-seq_ENCSR662KNY
12 | emission_df['assay'] = emission_df['experiments'].apply(lambda x: x.split('_')[-2].split('-')[0])
13 | emission_df['biosample'] = emission_df['experiments'].apply(lambda x: x.split('_')[0])
14 | emission_df['expID'] = emission_df['experiments'].apply(lambda x: x.split('_')[2])
15 | assay_meta_fn = './data/assay_count.csv'
16 | biosample_meta_fn = './data/biosample_term_name_counts.csv'
17 | assay_df = pd.read_csv(assay_meta_fn, header = 0, index_col = None, sep = ',')[['mark', 'color', 'big_group']]
18 | assay_df = assay_df.rename(columns = {'color': 'mark_color', 'big_group': 'assay_big_group'})
19 | print(dict(assay_df.groupby(['mark', 'mark_color']).groups.keys()))
20 | biosample_df = pd.read_csv(biosample_meta_fn, header = 0, index_col = None, sep = ',')[['biosample_name', 'group', 'big_group', 'color']]
21 | biosample_df['group'] = biosample_df['group'].apply(lambda x: x.lower())
22 | biosample_df = biosample_df.rename(columns = {'color': 'biosample_color', 'group': 'biosample_group', 'big_group': 'biosample_big_group', 'color': 'biosample_color'})
23 | biosample_df['biosample_group'] = biosample_df['biosample_group'].replace('', 'other')
24 | print(dict(biosample_df.groupby(['biosample_group', 'biosample_color']).groups.keys()))
25 |
26 | meta_fn = './data/slims_metadata.txt'
27 | meta_df = pd.read_csv(meta_fn, header = 0, index_col = None, sep = '\t') # 'experimentID', 'organ_slims', 'cell_slims', 'developmental_slims', 'system_slims', 'biosample_summary', 'simple_biosample_summary'
28 | meta_df['organ_group'] = meta_df['organ_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('musculature_of_body', 'musculature').replace('', 'unknown')
29 | meta_df['cell_group'] = meta_df['cell_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown')
30 | meta_df['development_group'] = meta_df['developmental_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown') # The ectoderm gives rise to the skin and the nervous system. The mesoderm specifies the development of several cell types such as bone, muscle, and connective tissue. Cells in the endoderm layer become the linings of the digestive and respiratory system, and form organs such as the liver and pancreas
31 | meta_df['system_group'] = meta_df['system_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown').apply(lambda x: x.split('_system')[0])
32 |
33 | # PROCEESSING METADATA ABOUT THE ORGANS AND DEVELOPMENTAL STAGES
34 | organ_meta_fn = './data/organ.txt'
35 | organ_meta_df = pd.read_csv(organ_meta_fn, header = None, index_col = None, sep = '\t') # organ, color, order
36 | ORGAN_GROUP_COLOR = dict(zip(organ_meta_df[0], organ_meta_df[1])) # keys: organ, values: color
37 | print(ORGAN_GROUP_COLOR)
38 | ORGAN_ORDER = dict(zip(organ_meta_df[0], organ_meta_df[2])) # keys: organ, values: organ numbered order
39 | meta_df['organ_color'] = meta_df['organ_group'].apply(lambda x: ORGAN_GROUP_COLOR[x])
40 | meta_df['organ_order'] = meta_df['organ_group'].apply(lambda x: ORGAN_ORDER[x])
41 | meta_df.drop(labels = ['cell_slims', 'organ_slims', 'system_slims'], axis = 1, inplace = True)
42 |
43 | print(emission_df.columns)
44 | print(biosample_df.columns)
45 | emission_df = emission_df.merge(assay_df, how = 'left', left_on = 'assay', right_on = 'mark')
46 | emission_df = emission_df.merge(biosample_df, how = 'left', left_on = 'biosample', right_on = 'biosample_name')
47 | print(emission_df[emission_df['biosample_group'].isnull()])
48 | print
49 | emission_df = emission_df.merge(meta_df, how = 'left', left_on = 'expID', right_on = 'experimentID')
50 | emission_df['biosample_group'] = emission_df['biosample_group'].replace('', 'other')
51 | save_fn = './data/emissions_100_for_pheatmap.txt'
52 | emission_df.to_csv(save_fn, header = True, index = False, sep = '\t')
53 |
--------------------------------------------------------------------------------
/emission_and_metadata/rank_chromMark_cellType_emission.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import sys
4 | import os
5 | import helper
6 |
7 | def prepare_color_dict():
8 | assay_meta_fn = './data//assay_count.csv'
9 | organ_meta_fn = './data//organ.txt'
10 | assay_df = pd.read_csv(assay_meta_fn, header = 0, index_col = None, sep = ',')[['mark', 'color', 'big_group']]
11 | assay_df = assay_df.rename(columns = {'color': 'mark_color', 'big_group': 'assay_big_group'})
12 | assay_color_dict = dict(assay_df.groupby(['mark', 'mark_color']).groups.keys()) # keys: assay, values: colors
13 | organ_df = pd.read_csv(organ_meta_fn, header = None, index_col = None, sep = '\t')
14 | organ_df.columns = ['group', 'color', 'order']
15 | organ_df['group'] = organ_df['group'].apply(lambda x: x.lower())
16 | organ_color_dict = dict(organ_df.groupby(['group', 'color']).groups.keys()) # keys: biomsaple_name, values: color
17 | return assay_color_dict, organ_color_dict
18 |
19 | ASSAY_COLOR_DICT, organ_COLOR_DICT = prepare_color_dict()
20 |
21 | def get_ranked_mark_name (row_data):
22 | sorted_rank = row_data.sort_values(ascending = False) # 1 --> n
23 | return pd.Series(sorted_rank.index) # get the index whcih is the list of experiments, ordered from least emitted to most emitted in each state. Convert to pd.Series so that we can concatenate into data frame later.
24 |
25 | def get_ranked_exp_df (fn) : # fn should be emission_fn
26 | df = pd.read_csv(fn, header = 0, sep = '\t') # get emission df
27 | df = df.rename(columns = {'State (Emission order)' : 'state'}) # one column is state, others are all experiment names
28 | df.index = df['state'] # state column becomes the index of the dataframe
29 | df = df[df.columns[1:]] # get rid of state column so that we can rank the emission probabilities of each state
30 | rank_df = df.rank(axis = 1) # index: state, columns: experiments ordered just like in df. Cells: rank of each experiment within each state, 1: least emission --> n: highest emission
31 | rank_df = rank_df.apply(get_ranked_mark_name, axis = 1) # index; state, columns: rank 1 --> n most emitted experiment to least emitted experiments
32 | rank_df.columns = map(lambda x: 'r_' + str(x + 1), range(len(rank_df.columns)))
33 | return rank_df
34 |
35 |
36 | def color_organ_names(val):
37 | if val == "":
38 | color = organ_COLOR_DICT["NA"]
39 | else:
40 | color = organ_COLOR_DICT[val]
41 | return 'background-color: %s' % color
42 |
43 | def color_mark_names(val):
44 | if val == "":
45 | color = ASSAY_COLOR_DICT['NaN']
46 | else:
47 | color = ASSAY_COLOR_DICT[val]
48 | return 'background-color: %s' % color
49 |
50 | def read_exp_organ_dict():
51 | meta_fn = './data/slims_metadata.txt'
52 | meta_df = pd.read_csv(meta_fn, header = 0, index_col = None, sep = '\t') # 'experimentID', 'organ_slims', 'cell_slims', 'developmental_slims', 'system_slims', 'biosample_summary', 'simple_biosample_summary'
53 | meta_df['organ_group'] = meta_df['organ_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('musculature_of_body', 'musculature').replace('', 'unknown')
54 | return dict(zip(meta_df.experimentID, meta_df.organ_group))
55 |
56 | def get_painted_excel_ranked_exp(rank_df, output_fn, num_top_marks, organ_color_dict, assay_color_dict):
57 | EXP_ORGAN_DICT = read_exp_organ_dict()
58 | organ_df = rank_df.applymap(lambda x: EXP_ORGAN_DICT[x.split('_')[2]]) # first convert the data to only contain the experimentID, then convert from experimentID to ORGAN
59 | organ_df.reset_index(inplace = True)
60 | chrom_mark_df = rank_df.applymap(lambda x: x.split('_')[-2].split('-')[0]) # H3K9me3
61 | chrom_mark_df.reset_index(inplace = True)
62 | columns_to_paint = list(map(lambda x: 'r_' + str(x + 1), range(num_top_marks)))
63 | organ_df = organ_df[['state'] + columns_to_paint]
64 | chrom_mark_df = chrom_mark_df[['state'] + columns_to_paint]
65 | colored_organ_df = organ_df.style.applymap(color_organ_names, subset = columns_to_paint)
66 | colored_chrom_mark_df = chrom_mark_df.style.applymap(color_mark_names, subset = columns_to_paint)
67 | # save file
68 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter')
69 | colored_organ_df.to_excel(writer, sheet_name = 'cell_group')
70 | colored_chrom_mark_df.to_excel(writer, sheet_name = 'chrom_mark')
71 | writer.save()
72 | print ("Done saving data into " + output_fn)
73 |
74 |
75 |
76 | def main():
77 | if len(sys.argv) != 4:
78 | usage()
79 | emission_fn = sys.argv[1]
80 | output_fn = sys.argv[2]
81 | helper.create_folder_for_file(output_fn)
82 | num_top_marks = helper.get_command_line_integer(sys.argv[3])
83 | print ('Done getting command line argument')
84 | assay_color_dict, organ_color_dict = prepare_color_dict()
85 | rank_df = get_ranked_exp_df(emission_fn) # index: states, columns: rank of experiments r_1: highest emitted, r_n: lowest emitted --> cells: names of experiments. Example: E118-H3K9me3
86 | print ("Done getting ranked data")
87 | get_painted_excel_ranked_exp(rank_df, output_fn, num_top_marks, organ_color_dict, assay_color_dict)
88 |
89 |
90 |
91 | def usage():
92 | print ("python rank_chrom_mark_celltype_emission.py")
93 | print ("emission_fn: should be ./data/emissions_100.txt")
94 | print ("output_fn: execl file where the output data of ranked cell type and chrom marks are stored")
95 | print ("num_top_marks: number of top marks that we want to report, recommended value: 100")
96 | exit(1)
97 |
98 | main()
99 |
--------------------------------------------------------------------------------
/gene_exp_analysis/._investigate_lecif_and_phastCons.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/gene_exp_analysis/._investigate_lecif_and_phastCons.py
--------------------------------------------------------------------------------
/gene_exp_analysis/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code to create figure 2C in the manuscript. Within this folder:
2 |
3 | - ```helper.py```: script containing helper functions for file managements.
4 | - ```get_avg_gene_per_state.py```: this script will calculate the average gene expression per mouse full-stack state, in each available cell type. Please refer to the Methods section "Average gene expression associated with each full-stack state" for full details about how the average expression is calculated.
5 | - ```avg_gene_exp_per_state_per_ct.txt.gz```: the output of the script ```get_avg_gene_per_state.py```. Inside this file, each row corresponds to a state (raw state names: E1 --> E100), each column corresponds to a sample. This file will then be used to plot figure 2C using the R markdown code ```plot_avg_gene_exp_heatmap.Rmd```
6 | ```
7 | usage: get_avg_gene_per_state.py [-h] [--gene_exp_folder GENE_EXP_FOLDER]
8 | [--segment_fn SEGMENT_FN]
9 | [--num_chromHMM_state NUM_CHROMHMM_STATE]
10 | [--output_fn OUTPUT_FN]
11 | [--state_annot_fn STATE_ANNOT_FN]
12 |
13 | Calculating avg gene expression in each state, in each available cell
14 | types. The gene_exp_folder should be where we store BingRen's gene
15 | expression data for mouse. segment_fn should be from ChromHMM model for the
16 | mouse
17 |
18 | options:
19 | -h, --help show this help message and exit
20 | --gene_exp_folder GENE_EXP_FOLDER
21 | Where there gene exp data for different cell types
22 | are stored. Each file in this folder is named in
23 | the format -.expr, where the
24 | provided gene expression data is in FPKM unit. This
25 | data can be downloaded and unzipped from http://chr
26 | omosome.sdsc.edu/mouse/download/19-tissues-expr.zip
27 | from Shen et al., 2012, 'A map of the cis-
28 | regulatory sequences in the mouse genome', Nature.
29 | On Ha's system:
30 | /data/ENCODE/mouse/gene_exp/19-tissues-expr/.
31 | --segment_fn SEGMENT_FN
32 | segment_fn. NOTE: Because our gene expression data
33 | is in mm9, we will need to pass the segment_fn in
34 | mm9 here
35 | --num_chromHMM_state NUM_CHROMHMM_STATE
36 | number of chromHMM states
37 | --output_fn OUTPUT_FN
38 | output_fn
39 | --state_annot_fn STATE_ANNOT_FN
40 | state_annot_fn
41 | ```
42 | - ```plot_avg_gene_exp_heatmap.Rmd```: Rmarkdown file that will plot figure 2C.
43 |
44 | # License:
45 | All code is provided under the MIT Open Acess License
46 | Copyright 2022 Ha Vu and Jason Ernst
47 |
48 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
49 |
50 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
51 |
52 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
53 |
54 |
--------------------------------------------------------------------------------
/gene_exp_analysis/avg_gene_exp_per_state_per_ct.txt.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/gene_exp_analysis/avg_gene_exp_per_state_per_ct.txt.gz
--------------------------------------------------------------------------------
/gene_exp_analysis/ct_name.csv:
--------------------------------------------------------------------------------
1 | boneMarrow
2 | brain
3 | cerebellum
4 | cortex
5 | heart
6 | intestine
7 | intestine_
8 | kidney
9 | limb
10 | liver
11 | lung
12 | mESC
13 | mef_male
14 | olfactory
15 | placenta
16 | spleen
17 | testes
18 | thymus
19 |
--------------------------------------------------------------------------------
/gene_exp_analysis/get_avg_gene_per_state.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import pandas as pd
4 | import numpy as np
5 | import glob
6 | import pybedtools as bed
7 | import helper
8 | parser = argparse.ArgumentParser(description='Calculating avg gene expression in each state, in each available cell types. The gene_exp_folder should be where we store BingRen\'s gene expression data for mouse. segment_fn should be from ChromHMM model for the mouse')
9 | parser.add_argument('--gene_exp_folder', type=str,
10 | help='Where there gene exp data for different cell types are stored')
11 | parser.add_argument('--segment_fn', type=str,
12 | help='segment_fn. NOTE: Because our gene expression data is in mm9, we will need to pass the segment_fn in mm9 here')
13 | parser.add_argument('--num_chromHMM_state', default=100, type=int,
14 | help='number of chromHMM states')
15 | parser.add_argument('--output_fn', type=str,
16 | help = 'output_fn')
17 | parser.add_argument('--state_annot_fn', type=str,
18 | help = 'state_annot_fn')
19 | args = parser.parse_args()
20 | print (args)
21 | helper.check_dir_exist(args.gene_exp_folder)
22 | helper.check_file_exist(args.segment_fn)
23 | helper.create_folder_for_file(args.output_fn)
24 | helper.check_file_exist(args.state_annot_fn)
25 | SEGMENT_LENGTH = 200
26 |
27 | def open_gene_exp_data_one_ct(gene_exp_fn):
28 | df = pd.read_csv(gene_exp_fn, header = 0, index_col = None, sep = '\t')
29 | df = df[['chr', 'left', 'right', 'FPKM']]
30 | df['FPKM'] = df['FPKM'].apply(lambda x: np.log(x+1))
31 | df.columns = ['chrom', 'start', 'end', 'logFPKM']
32 | return df
33 |
34 | def get_avg_gene_exp_one_ct(segment_fn, gene_exp_fn, num_chromHMM_state):
35 | segment_bed = bed.BedTool(segment_fn)
36 | exp_df = open_gene_exp_data_one_ct(gene_exp_fn)
37 | exp_bed = bed.BedTool.from_dataframe(exp_df)
38 | segment_bed = segment_bed.intersect(exp_bed, wa = True, wb = True)
39 | comb_df = segment_bed.to_dataframe()
40 | comb_df.columns = ['sChrom', 'sStart', 'sEnd', 'state', 'eChrom', 'eStart', 'eEnd', 'logFPKM']
41 | comb_df['gene_length'] = comb_df['eEnd'] - comb_df['eStart']
42 | comb_df['gene_length_segments'] = comb_df['gene_length'] / SEGMENT_LENGTH
43 | comb_df['com_start'] = comb_df[['sStart', 'eStart']].max(axis = 1) #start of the region we are interested is the start of either the gene or the segment of chromatin state. Whichever is greater is the start of the intersection.
44 | comb_df['com_end'] = comb_df[['sEnd', 'eEnd']].min(axis = 1) # same argument as above with the end of the region
45 | comb_df['num_segments'] = (comb_df['com_end'] - comb_df['com_start']) / SEGMENT_LENGTH # number of segments that are shared between the genes and the chromatin state
46 | comb_df['num_segments'] = comb_df['num_segments'].apply(np.ceil) # round up the number of segments
47 | # Number of segments is the number of basepair in the region we are interested in / number of basepairs per segment bin
48 | comb_df = comb_df[[u'state', u'gene_length', u'com_start', u'com_end', u'num_segments', 'gene_length_segments', 'logFPKM']]
49 | # now onto processing the gene expression data
50 | result_avg_exp_S = pd.Series([], dtype = float)
51 | for state_index in range(num_chromHMM_state):
52 | one_based_state_index = state_index + 1
53 | this_state_org_df = (comb_df[comb_df['state'] == 'E' + str(one_based_state_index)])
54 | bp_unif_exp_S = this_state_org_df['logFPKM'] * this_state_org_df['num_segments'] / this_state_org_df['gene_length_segments'] # a pandas series where each entry correspond to a position on the genome annotated as this state
55 | total_weighted_segments = np.sum(this_state_org_df['num_segments'] / this_state_org_df['gene_length_segments']) # a number
56 | avg_exp_bp_unif = bp_unif_exp_S.sum() / total_weighted_segments
57 | result_avg_exp_S['E' + str(one_based_state_index)] = avg_exp_bp_unif
58 | return result_avg_exp_S # a pandas series with index: E1--> E100, values: avg gene expression in each state in the current cell type
59 |
60 | def get_cell_name_from_fn(fn):
61 | fn = fn.split('/')[-1].split('.expr')[0].split('.gene')[0]
62 | first_fn = fn.split('-')[0]
63 | if first_fn.endswith('1') or first_fn.endswith('2') :
64 | return first_fn
65 | last_fn = fn.split('-')[-1]
66 | if (last_fn == '1' or last_fn == '2' or last_fn == '3') and first_fn != 'heart' and first_fn != 'liver':
67 | return first_fn + last_fn
68 | if (first_fn == 'heart' or first_fn == 'liver'):
69 | return '_'.join(fn.split('-'))
70 | else:
71 | return '_'.join(fn.split('-')[:2])
72 | return ''
73 |
74 | def get_avg_gene_exp_all_ct(segment_fn, gene_exp_folder, num_chromHMM_state, output_fn):
75 | gene_exp_fn_list = glob.glob(gene_exp_folder+'/*.expr')
76 | ct_list = list(map(get_cell_name_from_fn, gene_exp_fn_list))
77 | result_index = list(map(lambda x: 'E{}'.format(x+1), range(num_chromHMM_state)))
78 | result_avg_exp_df = pd.DataFrame(columns = ct_list, index = result_index)
79 | for ct_index, ct in enumerate(ct_list):
80 | result_avg_exp_df[ct] = get_avg_gene_exp_one_ct(segment_fn, gene_exp_fn_list[ct_index], num_chromHMM_state)
81 | result_avg_exp_df.to_csv(output_fn, header = True, index = True, sep = '\t', compression = 'gzip')
82 | print('Done!')
83 | return
84 |
85 | get_avg_gene_exp_all_ct(args.segment_fn, args.gene_exp_folder, args.num_chromHMM_state, args.output_fn)
--------------------------------------------------------------------------------
/gene_exp_analysis/helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import string
4 | import os
5 | import sys
6 | import time
7 | def make_dir(directory):
8 | try:
9 | os.makedirs(directory)
10 | except:
11 | print ( 'Folder' + directory + ' is already created')
12 |
13 |
14 |
15 | def check_file_exist(fn):
16 | if not os.path.isfile(fn):
17 | print ( "File: " + fn + " DOES NOT EXISTS")
18 | exit(1)
19 | return
20 |
21 | def check_dir_exist(fn):
22 | if not os.path.isdir(fn):
23 | print ( "Directory: " + fn + " DOES NOT EXISTS")
24 | exit(1)
25 | return
26 |
27 | def create_folder_for_file(fn):
28 | last_slash_index = fn.rfind('/')
29 | if last_slash_index != -1: # path contains folder
30 | make_dir(fn[:last_slash_index])
31 | return
32 |
33 | def get_command_line_integer(arg):
34 | try:
35 | arg = int(arg)
36 | return arg
37 | except:
38 | print ( "Integer: " + str(arg) + " IS NOT VALID")
39 | exit(1)
40 |
41 |
42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
44 | # rename the org_enrichment_df so that it's easier to work with
45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
46 | return enrichment_df
47 |
48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
50 | if len(nc_enrichment_df.columns) != 3:
51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
53 | exit(1)
54 | # Now, we know that the nc_enrichment_df has exactly 3 columns
55 | # change the column names
56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
57 | return (nc_enrichment_df)
58 |
--------------------------------------------------------------------------------
/gene_exp_analysis/plot_avg_gene_exp_heatmap.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "draw_avg_gene_exp_heatmap"
3 | author: "Ha Vu"
4 | date: "3/30/2022"
5 | output: html_document
6 | ---
7 |
8 | ```{r}
9 | library(ggplot2)
10 | library(dplyr)
11 | library(tidyr)
12 | library(pheatmap)
13 | library(RColorBrewer)
14 | ```
15 |
16 | ```{r}
17 | state_annot_fn <- '../state_annotation_processed.csv'
18 | read_state_annot_df <- function(state_annot_fn){
19 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t'))
20 | tryCatch({
21 | annot_df <- annot_df %>% rename('state_type' = 'group')
22 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")})
23 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column
24 | return(annot_df)
25 | }
26 | state_annot_df <- read_state_annot_df(state_annot_fn)
27 | state_color_df <- unique(state_annot_df%>% select(c('state_type', 'color')))
28 | STATE_COLOR_MAPPING <- setNames(state_color_df$color, state_color_df$state_type)
29 | CELL_TYPE_COLOR_MAPPING <- c('cerebellum'= '#C5912B', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'testes'= '#ACB9CA', 'spleen'= '#678C69', 'thymus'= '#C6E0B4', 'olfactory'= '#999999', 'mESC'= '#924965', 'mef_male'= '#C075C3', 'mef'= '#C075C3', 'brain'= '#C5912B', 'lung'= '#E41A1C', 'cortex'= '#C5912B', 'limb'= '#F182BC', 'placenta'= '#69608A', 'boneMarrow'= '#375623', 'kidney'= '#F4B084', 'liver'= '#9BC2E6')
30 | calculate_gap_rows_among_state_groups <- function(state_annot_df){
31 | state_group_ordered_by_appearance <- unique(state_annot_df$state_type) # list of different state groups, ordered by how they appear in the heatmap from top to bottom
32 | count_df <- state_annot_df %>% count(state_type)
33 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$state_type),] # order the rows such that the state_type are ordered based on state_group_ordered_by_appearance
34 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps
35 | return(results)
36 | }
37 |
38 | get_cell_group_name <- function(cg){
39 | cg <- unlist(strsplit(cg, "_"))[1]
40 | if (endsWith(cg, '1') | endsWith(cg, '2')){
41 | cg <- substr(cg,1,nchar(cg)-1)
42 | }
43 | return(cg)
44 | }
45 |
46 | get_file_name_without_tail <- function(fn){
47 | fn_no_tail <- unlist(strsplit(fn, "[.]"))[1]
48 | return(fn_no_tail)
49 | }
50 |
51 | get_cell_type_name <- function(cg){ # jsut get rid of the 1 or 2 at the end of the cell group name
52 | if (endsWith(cg, '_1') | endsWith(cg, '_2')){
53 | cg <- substr(cg,1,nchar(cg)-2)
54 | }
55 |
56 | if (endsWith(cg, '1') | endsWith(cg, '2')){
57 | cg <- substr(cg,1,nchar(cg)-1)
58 | }
59 | return(cg)
60 | }
61 | ```
62 |
63 | ```{r}
64 | fn <- './avg_gene_exp_per_state_per_ct.txt.gz'
65 | df <- read.table(fn, sep = '\t', header = TRUE, row.names = 1)
66 | df <- as.data.frame(t(df))
67 | df$ct <- sapply(rownames(df), get_cell_type_name)
68 | # get the mean expression per cell type per state across the different replicates
69 | df <- df %>% group_by(ct) %>% summarise(across(everything(), mean))
70 | ct_list <- df$ct
71 | df <- df %>% select(-c('ct'))
72 | df <- as.data.frame(t(df)) # transpose again to its original form
73 | colnames(df) <- ct_list
74 | # rearrange the rows such that states of the same group are put together
75 | state_annot_df <- read_state_annot_df(state_annot_fn)
76 | df <- df %>% slice(state_annot_df$state)
77 | rownames(df) <- state_annot_df$mneumonics
78 | # create the df for annotationof rows in our heatmap
79 | rownames(state_annot_df) <- state_annot_df$mneumonics
80 | state_annot_df <- state_annot_df %>% select(state_type)
81 | gap_row_indices <- calculate_gap_rows_among_state_groups(state_annot_df)
82 | # create the df for annotations of columns in our heatmap
83 | cell_group_df <- data.frame(ct = colnames(df))
84 | cell_group_df$group <- sapply(cell_group_df$ct, get_cell_group_name)
85 | # reorder the groups of cell types such that tissues that are similar are closer to each other
86 | ordered_cg_list <- c('brain', 'cerebellum', 'cortex', 'mESC', 'mef', 'mef_male', 'placenta', 'boneMarrow', 'heart', 'intestine', 'kidney', 'liver', 'lung', 'spleen', 'testes', 'thymus', 'limb', 'olfactory')
87 | ordered_cg_df <- data.frame(ct = character(), group = character())
88 | for (cg in ordered_cg_list){
89 | this_cg_df <- cell_group_df %>% filter(group == cg)
90 | ordered_cg_df <- bind_rows(ordered_cg_df, this_cg_df)
91 | }
92 | rownames(ordered_cg_df) <- ordered_cg_df$ct
93 | ordered_cg_df <- ordered_cg_df %>% select(group)
94 | df <- df %>% select(rownames(ordered_cg_df)) # reorder the columns of df (df of average gene_exp)
95 | break_list <- seq (0.1, 2.8, by = 0.1)
96 | fn_no_tail <- get_file_name_without_tail(fn)
97 | save_fn <- paste0(fn_no_tail, ".png")
98 | pheatmap(df, breaks = break_list, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(break_list)), annotation_row = state_annot_df, annotation_col = ordered_cg_df, annotation_colors = list(state_type = STATE_COLOR_MAPPING, group = CELL_TYPE_COLOR_MAPPING), gaps_row = gap_row_indices, fontsize =4.5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, filename = save_fn, cellwidth= 5)
99 | ```
--------------------------------------------------------------------------------
/helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import string
4 | import os
5 | import sys
6 | import time
7 | def make_dir(directory):
8 | try:
9 | os.makedirs(directory)
10 | except:
11 | print ( 'Folder' + directory + ' is already created')
12 |
13 |
14 |
15 | def check_file_exist(fn):
16 | if not os.path.isfile(fn):
17 | print ( "File: " + fn + " DOES NOT EXISTS")
18 | exit(1)
19 | return
20 |
21 | def check_dir_exist(fn):
22 | if not os.path.isdir(fn):
23 | print ( "Directory: " + fn + " DOES NOT EXISTS")
24 | exit(1)
25 | return
26 |
27 | def create_folder_for_file(fn):
28 | last_slash_index = fn.rfind('/')
29 | if last_slash_index != -1: # path contains folder
30 | make_dir(fn[:last_slash_index])
31 | return
32 |
33 | def get_command_line_integer(arg):
34 | try:
35 | arg = int(arg)
36 | return arg
37 | except:
38 | print ( "Integer: " + str(arg) + " IS NOT VALID")
39 | exit(1)
40 |
41 |
42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
44 | # rename the org_enrichment_df so that it's easier to work with
45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
46 | return enrichment_df
47 |
48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
50 | if len(nc_enrichment_df.columns) != 3:
51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
53 | exit(1)
54 | # Now, we know that the nc_enrichment_df has exactly 3 columns
55 | # change the column names
56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
57 | return (nc_enrichment_df)
58 |
--------------------------------------------------------------------------------
/learn_model/README.md:
--------------------------------------------------------------------------------
1 | ## NOTES
2 |
3 | This folder provides code that were used to generate Additional File 2, tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' outlining the metadata and download links to the datasets that we used to train the mouse full-stack model. Within this folder:
4 | - ```metada_conntroltype_june_2021.tsv```: tsv file downloaded from ENCODE portal for the Chip-seq and ATAC-seq experiments, including control experiments and their files, that we are using for this study.
5 | - ```extract_cell_mark_table.py```: code to produce Additional File 2, tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' in the published paper. This code will extract the metadata for the experiments that we will use for this paper, and their corresponding control files. The control experiments are needed for binarizing input data so that we can use as input to the ChromHMM LearnModel function (see Methods section of the paper).
6 | ```
7 | usage: extract_cell_mark_table.py [-h] [--ENCODE_meta_fn ENCODE_META_FN]
8 | --output_fn OUTPUT_FN
9 |
10 | This file aims at producing the supplementary file listing the input file
11 | metadata and download links.
12 |
13 | optional arguments:
14 | -h, --help show this help message and exit
15 | --ENCODE_meta_fn ENCODE_META_FN
16 | The file where we can get metadata of experiments
17 | from ENCODE
18 | --output_fn OUTPUT_FN
19 | The excel output file. Multiple tabs will be in
20 | this file. The output file is part of Additional
21 | File 2 in the published paper
22 | ```
23 | - ```helper.py```: script containing helper functions for file managements.
24 |
25 | For the rest of the steps in training the model (binarize data, learn model), we used ChromHMM v.1.23 functions BinarizeData and LearnModel. We described the process in Methods section of the paper. If you need to obtain the binarized data we used for training the model, please contact Dr. Jason Ernst.
26 |
27 | ## LICENSE
28 | Copyright 2022 Ha Vu (havu73@ucla.edu)
29 |
30 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
31 |
32 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
33 |
34 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--------------------------------------------------------------------------------
/learn_model/extract_cell_mark_table.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022 Ha Vu (havu73@ucla.edu)
2 |
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 |
5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 |
7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 |
9 | import pandas as pd
10 | import numpy as np
11 | import helper
12 | import os
13 | import argparse
14 | import sys
15 | sys.path.append('../emission_and_metadata') # full path on Ha's system: /u/home/h/havu73/project-ernst/source/mm10_annotations/emission_and_metadata
16 | import prepare_metadata_files as meta
17 | ################## READING INPUT COMMAND LINE ARGS ##################
18 | parser = argparse.ArgumentParser(description = 'This file aims at producing the supplementary file listing the input file metadata and download links.')
19 | parser.add_argument('--ENCODE_meta_fn', type = str, required = False, default = './metadata_controltype_june_2021.tsv', help = 'The file where we can get metadata of experiments from ENCODE')
20 | parser.add_argument('--output_fn', type = str, required = True, help = 'The excel output file. Multiple tabs will be in this file. The output file is part of Additional File 2 in the published paper')
21 | args = parser.parse_args()
22 | helper.check_file_exist(args.ENCODE_meta_fn)
23 | helper.create_folder_for_file(args.output_fn)
24 | #########################################################################################
25 | def read_filter_metadata():
26 | meta_df = pd.read_csv(args.ENCODE_meta_fn, header = 0, index_col = None, sep = '\t', comment = '#', low_memory = False)
27 | # meta_df = meta_df[meta_df['File analysis title'].str.contains('ENCODE4', na = False)] # note we need to comment out this line because the read files for different control experiments are actually not part of ENCODE4, and are all blank on this column
28 | ALL_ASSAY_LIST = ['Control ChIP-seq', 'Histone ChIP-seq', 'DNase-seq', 'ATAC-seq']
29 | meta_df = meta_df[meta_df['Assay'].isin(ALL_ASSAY_LIST)] # filter only lines whose assay is in ALL_ASSAY_LIST
30 | COLUMNS_TO_PARSE = ['File accession', 'Output type', 'Experiment accession', 'Assay', 'Biosample term name', 'Experiment target', 'File assembly', 'Run type', 'Derived from', 'Controlled by', 'File analysis title']
31 | meta_df = meta_df[COLUMNS_TO_PARSE]
32 | return meta_df
33 |
34 | def combine_with_organ_meta_df(file_df, organ_meta_df):
35 | '''
36 | file_df can be from functions produce_cell_mark_table_from_chipBamDf or from get_openC_df
37 | '''
38 | file_df = file_df.merge(organ_meta_df, how = 'left', left_on = 'Experiment accession', right_on = 'experimentID')
39 | file_df = file_df[['exp_name', 'Experiment target', 'Biosample term name', 'organ_group', 'cell_group', 'development_group', 'system_group', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download']]
40 | return file_df
41 |
42 | def check_similar_file_acc_code_with_jason_file(df, jason_fn):
43 | jason_df = pd.read_csv(jason_fn, header = None, sep = '\t')
44 | diff_list = np.setdiff1d(df['File accession'], jason_df[0])
45 | assert len(diff_list) == 0, 'The expected and observed list of file accession code from jason file {} IS NOT EQUAL.'.format(jason_fn)
46 | return
47 |
48 | def get_control_df(meta_df):
49 | # first filter so that we only get the Control Chip-seq experiments
50 | control_df = meta_df[meta_df['Assay'] == 'Control ChIP-seq']
51 | bam_control_df = control_df[(control_df['Output type'] == 'alignments') & (control_df['File analysis title'].str.contains('ENCODE4', na = False))] # these are control bam files
52 | jason_bamControl_fn = os.path.join(ENCODE_meta_fn, 'controlmeta_alignments.txt')
53 | check_similar_file_acc_code_with_jason_file(bam_control_df, jason_bamControl_fn)
54 | # assay is always Control ChIP-seq
55 | # Biosample term name corresponds to different cell/ tissue types
56 | # Experiment target is all blank
57 | # Run type is all blank
58 | # Controlled by is all blank
59 | # File analysis title all contains ENCODE4
60 | read_control_df = control_df[control_df['Output type'] == 'reads'] # these are read files for control samples
61 | jason_readControl_fn = os.path.join(ENCODE_meta_fn, 'controlmeta_reads.txt')
62 | check_similar_file_acc_code_with_jason_file(read_control_df, jason_readControl_fn)
63 | return bam_control_df, read_control_df
64 |
65 | def get_chip_df(meta_df):
66 | bam_chip_df = meta_df[(meta_df['Assay'] == 'Histone ChIP-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # this will result in 1397 bam files belonging to 704 unique ChIP seq experiments
67 | jason_bamChip_fn = os.path.join(ENCODE_meta_fn, 'histonemeta_alignments.txt')
68 | check_similar_file_acc_code_with_jason_file(bam_chip_df, jason_bamChip_fn)
69 | read_chip_df = meta_df[(meta_df['Assay'] == 'Histone ChIP-seq') & (meta_df['Output type'] == 'reads')]
70 | jason_readChip_fn = os.path.join(ENCODE_meta_fn, 'histonemeta_reads.txt')
71 | check_similar_file_acc_code_with_jason_file(read_chip_df, jason_readChip_fn)
72 | return bam_chip_df, read_chip_df
73 |
74 | def get_openC_df(meta_df, organ_meta_df):
75 | '''
76 | Extract bam files corresponding DNase and ATAC-seq output
77 | '''
78 | bam_DNase_df = meta_df[(meta_df['Assay'] == 'DNase-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # based on code in ENCODE_meta_fn/getmeta.sh
79 | jason_dnaseBam_fn = os.path.join(ENCODE_meta_fn, 'dnasemeta_alignments.txt')
80 | check_similar_file_acc_code_with_jason_file(bam_DNase_df, jason_dnaseBam_fn)
81 | bam_ATAC_df = meta_df[(meta_df['Assay'] == 'ATAC-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # based on code in ENCODE_meta_fn/getmeta.sh
82 | jason_ATACBam_fn = os.path.join(ENCODE_meta_fn, 'atacmeta_alignments.txt')
83 | check_similar_file_acc_code_with_jason_file(bam_ATAC_df, jason_ATACBam_fn)
84 | openc_df = pd.concat([bam_DNase_df, bam_ATAC_df], ignore_index = True)
85 | openc_df.loc[:, 'Experiment target'] = openc_df['Assay'].apply(lambda x: x.split('-seq')[0])
86 | openc_df = openc_df.sort_values(['Experiment target', 'Biosample term name', 'Experiment accession'])
87 | openc_df['download_link'] = openc_df['File accession'].apply(lambda x: 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = x))
88 | openc_df['exp_name'] = openc_df['Biosample term name'] + '_' + openc_df['Experiment target'] + '_' + openc_df['Experiment accession']
89 | openc_df['ctrl_fileCode'] = 'uniform'
90 | openc_df['ctrl_download'] = ''
91 | openc_df = openc_df[['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download']]
92 | openc_df.reset_index(inplace = True, drop = True)
93 | openc_df = combine_with_organ_meta_df(openc_df, organ_meta_df)
94 | return openc_df
95 |
96 |
97 | def extract_file_code_one_row(entry, check_set):
98 | '''
99 | For each entry of /files/ENCFF001JZL/, /files/ENCFF309GLL/, extrack the file accession code
100 | '''
101 | if pd.isnull(entry):
102 | return []
103 | code_list = entry.split(', ') # from /files/ENCFF001JZL/, /files/ENCFF309GLL/ to ['/files/ENCFF001JZL/', '/files/ENCFF309GLL/']
104 | code_list = list(map(lambda x: x[7:-1], code_list)) # to ['ENCFF001JZL', 'ENCFF309GLL']
105 | if check_set != None:
106 | results = []
107 | for code in code_list:
108 | if code in check_set:
109 | results.append(code)
110 | return results
111 | return code_list
112 |
113 | def map_controlRead_to_controlBam(bam_control_df, read_control_df):
114 | # given the table showing the metadata of control bam and control read files, we will output a dictionary with keys: controlRead --> values list of controlBam that were derived from the control read key
115 | control_readToBam = {}
116 | set_conRead = set(read_control_df['File accession']) # a set of files that correspond to control read files
117 | bam_control_df.loc[:, 'read_file_list'] = bam_control_df['Derived from'].apply(lambda x: extract_file_code_one_row(x, set_conRead))
118 | for index, row in bam_control_df.iterrows():
119 | read_control_list = row['read_file_list']
120 | for file in read_control_list:
121 | if not (file in control_readToBam):
122 | control_readToBam[file] = []
123 | (control_readToBam[file]).append(row['File accession']) # add the control bam file accession code to the list of files derived from the control read file
124 | return control_readToBam
125 |
126 | def map_file_accession_to_fileCode_list(first_df, second_df, colname_to_extract):
127 | '''
128 | first_df should have columns file accession code, and column Derived from or Controlled by (colname_to_extract). These columns should contain data of the files that the file in 'File accession' are derivied from or controlled by.
129 | second_df should have column 'File accession' that contains all the legal files to extract files of columns 'Derived from' and 'Controlled by'
130 | '''
131 | set_code_list = set(second_df['File accession']) # set of chip read files
132 | first_df.loc[:, 'read_file_list'] = first_df[colname_to_extract].apply(lambda x: extract_file_code_one_row(x, set_code_list))
133 | return dict(zip(first_df['File accession'], first_df['read_file_list']))
134 |
135 | def map_bamChip_to_bamCon(chipBam_to_chipRead, chipRead_to_conRead, conRead_to_conBam):
136 | '''
137 | mappings from chipBam to a list of controlBam, given the 3 exiting mappings
138 | '''
139 | results = {} # keys: chipBam, values: list of conBam
140 | for chipBam in chipBam_to_chipRead:
141 | conBam_set = set([])
142 | chipRead_list = chipBam_to_chipRead[chipBam]
143 | for chipRead in chipRead_list:
144 | conRead_list = chipRead_to_conRead[chipRead]
145 | for conRead in conRead_list:
146 | conBam_list = conRead_to_conBam[conRead]
147 | conBam_set.update(conBam_list)
148 | results[chipBam] = list(conBam_set)
149 | return results
150 |
151 |
152 | def match_conBam_to_chipBam(chipBam, chipBam_to_conBam):
153 | '''
154 | chipBam is a file accession code
155 | '''
156 | try:
157 | return chipBam_to_conBam[chipBam]
158 | except:
159 | return []
160 |
161 | def produce_cell_mark_table_from_chipBamDf(bam_chip_df, chipBam_to_conBam, organ_meta_df):
162 | '''
163 | chipBam_to_conBam: a dictionary of keys: chipBam, values: list of conBam associated with each chipBam
164 | '''
165 | bam_chip_df = bam_chip_df.sort_values(['Experiment target', 'Biosample term name', 'Experiment accession'])
166 | bam_chip_df.loc[:, 'exp_name'] = bam_chip_df['Biosample term name'] + '_' + bam_chip_df['Experiment target'] + '_' + bam_chip_df['Experiment accession']
167 | bam_chip_df['control_bam_list'] = bam_chip_df['File accession'].apply(lambda x: match_conBam_to_chipBam(x, chipBam_to_conBam))
168 | result_df = pd.DataFrame(columns = ['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download'])
169 | for index, row in bam_chip_df.iterrows():
170 | chip_download = 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = row['File accession'])
171 | chip_report_row = pd.Series([row['exp_name'], row['Experiment target'], row['Biosample term name'], row['Experiment accession'], chip_download, row['File assembly'], row['Run type']], index = ['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type'])
172 | control_bam_list = row['control_bam_list']
173 | for ctrl_bam in control_bam_list:
174 | ctrl_download = 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = ctrl_bam)
175 | ctrl_report_row = pd.Series([ctrl_bam, ctrl_download], index = ['ctrl_fileCode', 'ctrl_download'])
176 | output_row = pd.concat([chip_report_row, ctrl_report_row])
177 | result_df.loc[result_df.shape[0], :] = output_row
178 | result_df = combine_with_organ_meta_df(result_df, organ_meta_df)
179 | return result_df
180 |
181 |
182 | def save_output(output_fn, chip_output_df, openc_df):
183 | writer = pd.ExcelWriter(output_fn, engine = 'xlsxwriter')
184 | chip_output_df.to_excel(writer, sheet_name = 'Histone ChIP-seq')
185 | openc_df.to_excel(writer, sheet_name = 'ATAC_DNase')
186 | writer.save()
187 | return
188 |
189 | if __name__ =='__main__':
190 | organ_meta_df = meta.get_metadata_from_jason() # output is a df outlinining the cell types of experiment. columns: 'experimentID', 'organ_group', 'cell_group', 'development_group', 'system_group'
191 | meta_df = read_filter_metadata()
192 | bam_control_df, read_control_df = get_control_df(meta_df)
193 | bam_chip_df, read_chip_df = get_chip_df(meta_df)
194 | chipBam_to_chipRead = map_file_accession_to_fileCode_list(bam_chip_df, read_chip_df, 'Derived from')
195 | chipRead_to_conRead = map_file_accession_to_fileCode_list(read_chip_df, read_control_df, 'Controlled by')
196 | conRead_to_conBam = map_controlRead_to_controlBam(bam_control_df, read_control_df)
197 | chipBam_to_conBam = map_bamChip_to_bamCon(chipBam_to_chipRead, chipRead_to_conRead, conRead_to_conBam)
198 | chip_output_df = produce_cell_mark_table_from_chipBamDf(bam_chip_df, chipBam_to_conBam, organ_meta_df)
199 | openc_df = get_openC_df(meta_df, organ_meta_df)
200 | save_output(args.output_fn, chip_output_df, openc_df)
201 |
--------------------------------------------------------------------------------
/learn_model/helper.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022 Ha Vu (havu73@ucla.edu)
2 |
3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 |
5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 |
7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 |
9 |
10 | import numpy as np
11 | import pandas as pd
12 | import string
13 | import os
14 | import sys
15 | import time
16 | def make_dir(directory):
17 | try:
18 | os.makedirs(directory)
19 | except:
20 | print ( 'Folder' + directory + ' is already created')
21 |
22 |
23 |
24 | def check_file_exist(fn):
25 | if not os.path.isfile(fn):
26 | print ( "File: " + fn + " DOES NOT EXISTS")
27 | exit(1)
28 | return
29 |
30 | def check_dir_exist(fn):
31 | if not os.path.isdir(fn):
32 | print ( "Directory: " + fn + " DOES NOT EXISTS")
33 | exit(1)
34 | return
35 |
36 | def create_folder_for_file(fn):
37 | last_slash_index = fn.rfind('/')
38 | if last_slash_index != -1: # path contains folder
39 | make_dir(fn[:last_slash_index])
40 | return
41 |
42 | def get_command_line_integer(arg):
43 | try:
44 | arg = int(arg)
45 | return arg
46 | except:
47 | print ( "Integer: " + str(arg) + " IS NOT VALID")
48 | exit(1)
49 |
50 |
51 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
52 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
53 | # rename the org_enrichment_df so that it's easier to work with
54 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
55 | return enrichment_df
56 |
57 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
58 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
59 | if len(nc_enrichment_df.columns) != 3:
60 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
61 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
62 | exit(1)
63 | # Now, we know that the nc_enrichment_df has exactly 3 columns
64 | # change the column names
65 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
66 | return (nc_enrichment_df)
67 |
--------------------------------------------------------------------------------
/neighborhood/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code to obtain neighborhood enrichment analysis results of mouse full-stack states, as presented in Fig. 2D-E of the manuscript.
2 | - Files ```genome_100_RefSeqTES_neighborhood.txt``` and ```genome_100_RefSeqTSS_neighborhood.txt``` show the output results of running ChromHMM neighborhood enrichments for the 100 mouse full-stack states with the neighborhood of annotated TSS and TES from RefSeq. The bed files of annotated TSS and TES are both provided along with ChromHMM software.
3 | - You will need to install some R packages if you have not had them in order to use the next two code files. In your R terminal, use the following commands:
4 | ```
5 | install.packages("tidyverse")
6 | install.packages("tidyr")
7 | install.packages("dplyr")
8 | install.packages('ggplot2')
9 | install.packages(pheatmap)
10 | install.packages(reshape2)
11 | ```
12 | - File ```draw_2DLine_neighborhood_enrichment.R``` contain code to plot figure 2D-E. You can use this command:
13 | ```
14 | Rscript draw_2DLine_neighborhood_enrichment.R genome_100_RefSeqTSS_neighborhood.txt genome_100_RefSeqTSS_neighborhood_linePlot.png genome_100_RefSeqTES_neighborhood.txt genome_100_RefSeqTES_neighborhood_linePlot.png ../state_annotation_processed.csv
15 | ```
16 |
17 | - File ```draw_neighborhood_enrichment.R``` contains code to plot figures S2 in the manuscript. You can use these two commands:
18 | ```
19 | Rscript draw_neighborhood_enrichment.R genome_100_RefSeqTSS_neighborhood.txt genome_100_RefSeqTSS_neighborhood_heatMap.png ../state_annotation_processed.csv
20 |
21 | Rscript draw_neighborhood_enrichment.R genome_100_RefSeqTES_neighborhood.txt genome_100_RefSeqTES_neighborhood_heatMap.png ../state_annotation_processed.csv
22 | ```
23 | ## LICENSE
24 | Copyright 2022 Ha Vu (havu73@ucla.edu)
25 |
26 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
27 |
28 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
29 |
30 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--------------------------------------------------------------------------------
/neighborhood/draw_2DLine_neighborhood_enrichment.R:
--------------------------------------------------------------------------------
1 | library("tidyverse")
2 | library("tidyr")
3 | library("dplyr")
4 | library('ggplot2')
5 | library(pheatmap)
6 | library(reshape2)
7 | STATE_COLOR_DICT = c('quescient' = "#ffffff", "HET" = '#b19cd9', 'acetylations' = '#fffacd', 'enhancers'= '#FFA500', 'ct_spec enhancers' = '#FFA500', 'transcription' = '#006400', 'weak transcription' = '#228B22', 'weak enhancers' = '#ffff00', 'transcribed and enhancer' = '#ADFF2F', 'exon' = '#3cb371', 'promoters' = '#ff4500', 'TSS' = '#FF0000', 'weak promoters' = '#800080', 'polycomb repressed' = '#C0C0C0', 'others' = '#fff5ee', 'znf' = '#7fffd4', 'DNase' = '#fff44f')
8 | get_rid_of_bedGZ_context_name <- function(enrichment_name){
9 | if (endsWith(enrichment_name, '.bed.gz')){
10 | result <- substring(enrichment_name, 1, nchar(enrichment_name) - 7)
11 | return(result)
12 | }
13 | return(enrichment_name)
14 | }
15 |
16 | get_overlap_enrichment_df <- function(enrichment_fn){
17 | if (endsWith(enrichment_fn, '.xlsx') | endsWith(enrichment_fn, '.xls')) {
18 | print("READING AN EXCEL FILE")
19 | enrichment_df <- as.data.frame(read_excel(enrichment_fn, sheet = 1, col_names = TRUE))
20 | } # if this is a excel file
21 | else{
22 | print ("READING A TEXT FILE")
23 | enrichment_df <- as.data.frame(read_tsv(enrichment_fn, col_names = TRUE))
24 | }
25 | tryCatch({
26 | enrichment_df <- enrichment_df %>% rename("state" = "state..Emission.order.", "percent_in_genome" = "Genome..", "state" = "state (Emission order)", "state (Emission order)" = "state") # rename("percent_in_genome" = "percent_in_genome")
27 | }, error = function(e) {message("Nothing bad happended")})
28 | score_colnames <- colnames(enrichment_df)[2:ncol(enrichment_df)]
29 | score_colnames <- sapply(score_colnames, get_rid_of_bedGZ_context_name)
30 | score_colnames <- c('state', score_colnames)
31 | colnames(enrichment_df) <- score_colnames
32 | num_state <- nrow(enrichment_df)
33 | tryCatch({
34 | enrichment_df <- enrichment_df %>% select(-"percent_in_genome")
35 | }, error = function(e) {message("There is not a percent_in_genome column")})
36 | enrichment_df <- as.data.frame(t(enrichment_df))
37 | colnames(enrichment_df) <- enrichment_df[1,] # first row is also header
38 | enrichment_df <- enrichment_df[-1,] # get rid of the first row
39 | enrichment_df$coord <- as.integer(rownames(enrichment_df))
40 | colnames(enrichment_df)[-length(colnames(enrichment_df))] <- sapply(colnames(enrichment_df)[-length(colnames(enrichment_df))], function (x){paste0('S',x)}) # convert columns from state_index to "S". exclude that last column because that's coord
41 | enrichment_df <- melt(enrichment_df, id = c('coord'))
42 | colnames(enrichment_df) <- c('coord', 'state', 'enrichment')
43 | enrichment_df$state <- as.character(enrichment_df$state)
44 | return(enrichment_df)
45 | }
46 |
47 | get_state_type <- function(state_annot_fn){
48 | state_annot_df <- as.data.frame(read.csv(state_annot_fn, header = T, sep = '\t')) ### HAHAHA this line may need to change depending on what state annotation we are using
49 | state_annot_df$state <- sapply(as.character(state_annot_df$state), function (x){paste0('S', x)})
50 | return(state_annot_df)
51 | }
52 |
53 | draw_2DLine_neighborhood_enrichment <- function (enrichment_fn, save_fig_fn, state_annot_fn) {
54 | # draw enrichment plot where each column (each enrichment context) is on its own color scale
55 | enrichment_df <- get_overlap_enrichment_df(enrichment_fn) # columns: enrichment contex, rows: states
56 | print(head(enrichment_df))
57 | annot_df <- get_state_type(state_annot_fn)
58 | print(head(annot_df))
59 | enrichment_df <- enrichment_df %>% left_join(annot_df, by = 'state')
60 | color_dict <- enrichment_df$color
61 | names(color_dict) <- enrichment_df$state
62 | #color_dict has keys: states, values: the coloro corresponding to that state
63 | ggplot(data = enrichment_df, aes(x = coord, y = enrichment, group = state)) +
64 | geom_line(aes(color = state)) +
65 | scale_colour_manual(values = color_dict) +
66 | theme_bw() +
67 | theme(legend.position = 'None')
68 | ggsave(save_fig_fn)
69 | print(paste('DONE! Figure is saved at:', save_fig_fn))
70 | }
71 |
72 | args = commandArgs(trailingOnly=TRUE)
73 | if (length(args) < 5)
74 | {
75 | stop("wrong command line argument format", call.=FALSE)
76 | }
77 | tss_fn <- args[1] # output of ChromHMM neighborhoodEnrichment
78 | save_tss_fn <- args[2] # where the figures should be stored
79 | tes_fn <- args[3] # output of ChromHMM neighborhoodEnrichment
80 | save_tes_fn <- args[4] # where the figures should be stored
81 | state_annot_fn <- args[5] # where the state annotation are stored
82 | print ('Done getting command line arguments')
83 | # state_annot_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/state_annotations.csv'
84 | # tss_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/neighborhood_enrichment_tss_hg19.txt'
85 | # save_tss_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/2DLine_neighborhood_enrichment_tss_hg19.png'
86 | draw_2DLine_neighborhood_enrichment(tss_fn, save_tss_fn, state_annot_fn)
87 | # tes_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/neighborhood_enrichment_tes_hg19.txt'
88 | # save_tes_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/2DLine_neighborhood_enrichment_tes_hg19.png'
89 | draw_2DLine_neighborhood_enrichment(tes_fn, save_tes_fn, state_annot_fn)
--------------------------------------------------------------------------------
/neighborhood/draw_neighborhood_enrichment.R:
--------------------------------------------------------------------------------
1 | library("tidyverse")
2 | library("tidyr")
3 | library("dplyr")
4 | library('ggplot2')
5 | library(pheatmap)
6 | # STATE_COLOR_DICT = c('quescient' = "#ffffff", "HET" = '#b19cd9', 'acetylations' = '#fffacd', 'enhancers'= '#FFA500', 'ct_spec enhancers' = '#FFA500', 'transcription' = '#006400', 'weak transcription' = '#228B22', 'weak enhancers' = '#ffff00', 'transcribed and enhancer' = '#ADFF2F', 'exon' = '#3cb371', 'promoters' = '#ff4500', 'TSS' = '#FF0000', 'weak promoters' = '#800080', 'bivalent promoters' = '#7030A0', 'polycomb repressed' = '#C0C0C0', 'others' = '#fff5ee', 'znf' = '#7fffd4', 'DNase' = '#fff44f')
7 | get_rid_of_bedGZ_context_name <- function(enrichment_name){
8 | if (endsWith(enrichment_name, '.bed.gz')){
9 | result <- substring(enrichment_name, 1, nchar(enrichment_name) - 7)
10 | return(result)
11 | }
12 | return(enrichment_name)
13 | }
14 |
15 | get_overlap_enrichment_df <- function(enrichment_fn){
16 | if (endsWith(enrichment_fn, '.xlsx') | endsWith(enrichment_fn, '.xls')) {
17 | print("READING AN EXCEL FILE")
18 | enrichment_df <- as.data.frame(read_excel(enrichment_fn, sheet = 1, col_names = TRUE))
19 | } # if this is a excel file
20 | else{
21 | print ("READING A TEXT FILE")
22 | enrichment_df <- as.data.frame(read_tsv(enrichment_fn, col_names = TRUE))
23 | }
24 | tryCatch({
25 | enrichment_df <- enrichment_df %>% rename("state" = "state..Emission.order.", "percent_in_genome" = "Genome..", "state" = "state (Emission order)", "state (Emission order)" = "state") # rename("percent_in_genome" = "percent_in_genome")
26 | }, error = function(e) {message("Nothing bad happended")})
27 | score_colnames <- colnames(enrichment_df)[2:ncol(enrichment_df)]
28 | score_colnames <- sapply(score_colnames, get_rid_of_bedGZ_context_name)
29 | score_colnames <- c('state', score_colnames)
30 | colnames(enrichment_df) <- score_colnames
31 | num_state <- nrow(enrichment_df)
32 | tryCatch({
33 | enrichment_df <- enrichment_df %>% select(-"percent_in_genome")
34 | }, error = function(e) {message("There is not a percent_in_genome column. Nothing bad happend")})
35 | return(enrichment_df)
36 | }
37 |
38 | read_state_annot_df <- function(states_to_include, state_annot_fn){
39 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = "\t"))
40 | tryCatch({
41 | annot_df <- annot_df %>% rename('group' = 'Group')
42 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")})
43 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column
44 | annot_df <- annot_df %>% filter(state %in% states_to_include) # in case some states need to be excluded
45 | STATE_COLOR_DICT <- as.character(annot_df$color)
46 | names(STATE_COLOR_DICT) <- as.character(annot_df$group)
47 | STATE_COLOR_DICT <- STATE_COLOR_DICT[!duplicated(names(STATE_COLOR_DICT))]
48 | return_obj <- list(annot_df = annot_df, STATE_COLOR_DICT = STATE_COLOR_DICT)
49 | return(return_obj)
50 | }
51 |
52 | calculate_gap_rows_among_state_groups <- function(state_annot_df){
53 | state_group_ordered_by_appearance <- unique(state_annot_df$group) # list of different state groups, ordered by how they appear in the heatmap from top to bottom
54 | count_df <- state_annot_df %>% count(group)
55 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$group),] # order the rows such that the group are ordered based on state_group_ordered_by_appearance
56 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps
57 | return(results)
58 | }
59 |
60 | draw_enrichment_plot_no_color_scale <- function (enrichment_fn, save_fig_fn, states_to_include, state_annot_fn) {
61 | # draw enrichment plot where each column (each enrichment context) is on its own color scale
62 | annot_obj <- read_state_annot_df(states_to_include, state_annot_fn) # state: one-based state indices, arranged based on the groups of states, long annotations, short annotations, group, color, itemRgb, state_ordered_by_group: 0 --> 99 already, mneumonics in the format
63 | annot_df <- annot_obj$annot_df
64 | STATE_COLOR_DICT <- annot_obj$STATE_COLOR_DICT
65 | enrichment_df <- get_overlap_enrichment_df(enrichment_fn) # columns: enrichment contex, rows: states
66 | enrichment_colnames <- colnames(enrichment_df)[-1] # get rid of the first item in enrichment_colnames because we want to get rid of the 'state'
67 | enrichment_df <- as.data.frame(enrichment_df[, -1]) # get rid of the first column (states) so that the enrichment data is just the enrichment values. This is nescessary to that we can get input for pheatmap function later
68 | enrichment_colnames <- as.character(seq(-2000, 2000, 200))
69 | enrichment_df <- enrichment_df %>% select(enrichment_colnames)
70 | # rearranged_colnames <- paste("consHMM_state", seq(1, 100) , sep = "")
71 | enrichment_df <- enrichment_df %>% slice(annot_df$state) # rearrange the states based on the ordered based on the order from annot_df
72 | colnames(enrichment_df) <- enrichment_colnames
73 | rownames(enrichment_df) <- annot_df$mneumonics # row names and columns names are the names of x and y axis tick labels in the figure
74 | display_num_df <- round(enrichment_df, digits = 1)
75 | # calculate the indices of rows where we want to create gaps among groups of states
76 | gap_row_indices <- calculate_gap_rows_among_state_groups(annot_df)
77 | # prepare the annot_df for plotting as the row_annotation in the pheatmap function
78 | rownames(annot_df) <- annot_df$mneumonics
79 | annot_df <- annot_df %>% select(c('group')) ### HAHAHA this can be group or state_group depending on the state annot file that we use
80 | #pheatmap(enrichment_df, cluster_rows = FALSE, cluster_cols = TRUE, fontsize = 4.5, scale = 'none', display_numbers = display_num_df, number_format = '%.1f', color = colorRampPalette(c('white', 'red'))(300), filename = save_fig_fn, cellwidth = 10, lengend = FALSE)
81 | pheatmap(enrichment_df, cluster_rows = FALSE, cluster_cols = FALSE,annotation_row = annot_df, gaps_row = gap_row_indices, annotation_colors = list(group = STATE_COLOR_DICT), fontsize = 8, scale = 'none', color = colorRampPalette(c('white', 'blue'))(300), filename = save_fig_fn, cellwidth = 9, cellheight = 8, lengend = TRUE)
82 | print(paste('DONE! Figure is saved at:', save_fig_fn))
83 | }
84 | args = commandArgs(trailingOnly=TRUE)
85 | if (length(args) < 3)
86 | {
87 | stop("wrong command line argument format", call.=FALSE)
88 | }
89 | neighborhood_enrichment_fn <- args[1] # output of ChromHMM NeighborhoodEnrichment, also input to this program
90 | save_fn <- args[2] # where the figures should be stored
91 | state_annot_fn <- args[3] # data about the state annotations
92 | states_to_include <- args[4:length(args)]
93 | print(length(states_to_include))
94 | if (is.na(states_to_include)){
95 | states_to_include <- seq(1, 100)
96 | }
97 | print (states_to_include)
98 | draw_enrichment_plot_no_color_scale(neighborhood_enrichment_fn, save_fn, states_to_include, state_annot_fn)
--------------------------------------------------------------------------------
/neighborhood/genome_100_RefSeqTES_neighborhood.txt:
--------------------------------------------------------------------------------
1 | State (Emission order) -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000
2 | 1 0.69942 0.73801 0.75730 0.73318 0.64636 0.71389 0.82001 0.77177 0.68012 0.70906 0.81518 0.73321 0.87310 0.79592 0.85380 1.00816 1.06605 1.04675 1.07570 1.00816 1.10464
3 | 2 0.96530 1.02457 0.99917 1.00764 1.06691 0.95683 0.95683 1.19392 1.19392 0.96530 0.77055 0.87219 0.88913 0.99921 0.99074 1.33792 1.24478 1.42260 1.25325 1.35486 1.49035
4 | 3 3.44937 3.56774 3.75373 3.96509 3.88055 4.05809 4.05809 4.12572 4.06654 3.65228 1.92759 2.63786 2.58713 2.56177 3.01832 2.95914 3.08596 3.28041 3.14514 3.21278 3.43260
5 | 4 1.87602 1.71579 1.89938 1.89938 1.88937 1.91274 2.04626 2.10301 2.13639 1.95613 1.25847 1.62906 1.75257 1.87943 1.82268 1.85272 1.84271 1.74924 1.79263 1.89278 1.86607
6 | 5 0.63535 0.64349 0.80640 0.78197 0.72495 0.57833 0.74939 0.80640 0.64349 0.82270 0.59462 0.72498 0.70869 0.76571 0.81458 0.82273 0.81458 0.95306 0.82273 0.80644 1.05081
7 | 6 0.48532 0.66180 0.83828 0.61768 0.66180 0.75004 0.92651 0.82357 0.85298 0.70592 0.67650 0.82360 1.02950 1.19128 1.05892 1.01479 1.14716 0.86772 1.22069 1.38247 1.27952
8 | 7 0.82627 0.85868 0.82627 0.89108 0.77767 1.21511 1.03689 1.23131 1.21511 1.08550 1.05309 1.31237 1.40958 1.45819 1.65261 1.37718 1.45819 1.39338 1.57160 1.45819 1.60400
9 | 8 1.82420 1.64178 1.51148 1.51148 1.95450 1.98056 1.95450 1.92844 2.26722 2.26722 1.79814 2.37155 3.07520 3.07520 3.04914 3.07520 2.65823 2.39762 2.39762 2.39762 2.37155
10 | 9 0.87475 0.83374 0.83374 1.01143 0.94309 0.87475 1.06610 0.97042 0.86108 0.75173 0.87475 1.21649 1.23016 1.03880 1.21649 1.23016 1.29850 1.36684 1.53086 1.42152 1.33951
11 | 10 1.37969 1.17064 0.96160 0.82224 1.00341 1.03128 1.10096 1.04522 1.08703 1.26820 1.05915 1.24037 1.18463 1.26825 1.36580 1.65848 1.53305 1.72816 1.72816 1.99296 2.28563
12 | 11 1.08855 1.05069 1.18320 1.06015 1.03175 1.00336 1.02229 1.01282 0.88977 1.02229 0.87084 1.01286 1.22111 1.06966 1.23058 1.39150 1.48616 1.55242 1.62815 1.51456 1.54296
13 | 12 0.98268 1.04493 1.00936 1.00936 0.91598 0.91598 0.90264 0.88930 0.95155 0.96934 1.20501 1.10278 1.19616 1.26731 1.24062 1.28954 1.31177 1.35624 1.50743 1.59191 1.42739
14 | 13 0.72184 0.73075 0.80204 0.71292 0.63272 0.76639 0.70401 0.89116 0.77531 0.90898 0.73966 0.96249 0.82881 0.90901 0.93575 0.94466 1.10508 1.24767 1.21202 1.22984 1.36352
15 | 14 0.62781 0.52891 0.54181 0.50741 0.63641 0.58481 0.47301 0.51601 0.63211 0.69231 0.70522 0.60204 0.58053 0.77835 0.73534 0.72674 0.77405 0.88585 0.91165 0.93746 0.86435
16 | 15 0.32910 0.38994 0.35814 0.38026 0.37335 0.36229 0.40100 0.39686 0.39824 0.43834 0.53375 0.43697 0.43421 0.42591 0.39687 0.45218 0.49505 0.49782 0.53930 0.51303 0.53654
17 | 16 0.33113 0.50644 0.36035 0.38957 0.41879 0.41879 0.37009 0.37983 0.40905 0.43826 0.61357 0.48698 0.46750 0.42854 0.44802 0.52594 0.48698 0.61359 0.59412 0.73047 0.89604
18 | 17 0.83523 0.85160 0.70421 0.75334 0.85160 0.73696 0.58957 0.76972 0.80247 0.68783 0.62233 0.90077 0.75337 0.78613 0.54046 0.76975 0.83526 0.86801 0.73699 0.85164 0.98266
19 | 18 0.58457 0.54206 0.52080 0.44640 0.34011 0.49954 0.41451 0.37200 0.31886 0.43577 0.57394 0.46767 0.39327 0.51019 0.45704 0.58459 0.46767 0.34013 0.42516 0.47830 0.54208
20 | 19 0.54590 0.48925 0.48410 0.47380 0.47380 0.36050 0.43260 0.39655 0.41715 0.50470 0.54590 0.46866 0.51502 0.52017 0.49956 0.54077 0.56137 0.61287 0.53047 0.61287 0.57682
21 | 20 1.67156 1.66823 1.61174 1.65827 1.70645 1.63168 1.63999 1.75962 1.97397 2.65190 2.96926 1.62510 1.52540 1.58854 1.64504 1.59685 1.65168 1.74307 1.61014 1.56195 1.53038
22 | 21 2.73577 2.60778 2.73577 2.87576 2.71177 2.96775 2.77977 2.98775 3.01575 3.93567 5.01158 2.35989 2.42389 2.47589 2.40789 2.44789 2.47589 2.24790 2.19590 2.23190 2.09990
23 | 22 3.25087 3.25809 3.42045 3.33025 3.46014 3.30500 3.47457 3.19675 3.27974 3.53591 4.12403 3.35925 3.11750 3.16079 2.85770 2.85770 2.60874 2.57266 2.57987 2.47884 2.27317
24 | 23 3.93763 4.02958 3.92164 3.90165 3.90165 3.81771 3.66580 3.67379 3.33799 3.40995 3.84969 3.53801 3.12624 2.97033 2.83441 2.57855 2.37866 2.05884 1.79899 1.85895 1.89893
25 | 24 5.66168 5.69571 5.60213 5.63616 5.18101 5.00235 5.02788 4.75989 4.33878 3.84960 3.46251 4.74731 4.29640 3.79019 3.27122 2.79054 2.52254 2.47575 2.35664 2.21201 2.04186
26 | 25 5.64941 5.65560 5.36478 5.46378 5.53804 5.64941 5.41428 5.83505 5.42047 5.05539 5.03683 6.02710 5.92190 5.53825 5.47637 5.24741 5.33405 4.65337 4.54817 3.86130 4.18927
27 | 26 11.37016 11.27793 11.84811 11.53786 12.04935 12.36799 12.67823 12.22544 12.88786 10.38072 6.45651 10.41467 9.91993 9.02269 8.78790 8.19253 7.37077 7.16113 6.72509 6.45676 5.77754
28 | 27 5.07549 5.22883 5.10616 4.93749 5.19816 5.50484 5.30550 5.29017 5.32083 4.69215 5.25950 6.13376 5.84241 5.33637 5.15236 4.49298 4.43164 4.06362 4.23230 3.94094 3.92561
29 | 28 2.02635 2.20977 2.10496 1.97395 2.28838 2.11370 2.13116 2.00015 2.14863 2.05256 2.59408 2.45443 2.41949 2.72520 2.58545 2.48937 2.26227 2.47190 2.34088 2.07884 2.00023
30 | 29 2.01174 2.04940 2.07451 2.16238 2.09334 2.05568 2.09648 2.19063 2.12158 2.08706 2.29106 2.40727 2.42297 2.38844 2.33509 2.36020 2.33823 2.22838 2.17188 2.15305 2.17188
31 | 30 2.97326 2.91120 3.07481 2.92248 3.09738 3.16508 3.08045 2.99583 2.96762 2.87170 2.73066 3.28933 3.31190 3.56579 3.65042 3.45859 3.50373 3.41345 3.52065 3.44731 3.24983
32 | 31 5.67447 5.98343 5.71567 6.23059 6.04522 6.71462 6.61164 6.50865 6.60134 5.71567 4.07821 6.03515 6.05575 5.89097 6.18964 6.16904 6.03515 5.78798 5.50991 5.65409 5.17005
33 | 32 11.17219 11.88620 12.26421 13.18822 14.84725 15.68726 16.73728 17.59829 17.07329 14.63724 9.80716 14.53280 16.31790 15.26784 14.00777 14.04978 12.51669 11.84465 10.85760 9.95455 10.16456
34 | 33 3.55846 3.63798 3.87653 3.93617 3.97593 3.81689 4.25425 4.19461 3.93617 3.14099 2.96207 3.55860 3.28027 2.92242 3.06158 2.74350 3.10135 3.28027 3.16099 2.68386 2.32601
35 | 34 0.54532 0.52513 0.51503 0.62611 0.62611 0.60592 0.58572 0.50493 0.44434 0.55542 0.49483 0.55544 0.72713 0.59584 0.68673 0.71703 0.74733 0.83822 0.75742 0.62614 0.69683
36 | 35 0.34983 0.34983 0.35557 0.37277 0.36704 0.45306 0.37851 0.40145 0.44733 0.50468 0.48174 0.46455 0.36705 0.46455 0.41293 0.47602 0.51043 0.48749 0.50470 0.54484 0.44735
37 | 36 0.41609 0.44810 0.42676 0.40542 0.35208 0.39476 0.44810 0.39476 0.41609 0.43743 0.39476 0.52280 0.59749 0.46946 0.51214 0.62950 0.65084 0.48013 0.46946 0.54414 0.52280
38 | 37 0.81697 0.70297 0.74097 0.66498 0.70297 0.56998 0.77897 0.93097 0.83597 0.74097 0.77897 0.83600 0.70300 0.79800 0.70300 0.81700 0.72200 0.83600 0.83600 0.89300 0.77900
39 | 38 1.92982 2.02028 1.99012 1.89966 1.80920 1.59813 1.59813 1.62828 1.68859 1.62828 1.44736 1.32680 1.56804 2.17113 2.38221 1.71881 1.80927 1.96005 2.14097 1.86958 2.20128
40 | 39 1.68483 1.64820 1.86796 2.08772 2.60049 2.41736 2.34411 2.27085 2.16097 1.79471 1.46507 1.50175 1.39187 1.13547 1.35524 1.39187 1.24536 1.24536 1.53838 1.53838 1.61164
41 | 40 3.80158 3.77607 3.31681 3.18925 3.16373 3.41887 3.08719 2.70448 2.24523 1.96457 1.45430 1.42884 1.68399 1.88811 1.91362 2.04120 1.99017 2.14326 2.34737 2.32186 2.37289
42 | 41 4.79146 4.66860 4.48431 4.11574 4.54574 4.07479 3.68574 3.39907 3.56288 3.89050 3.91098 3.89065 3.72683 4.32067 4.64830 4.87355 5.40596 4.93498 4.68926 5.30357 5.46739
43 | 42 4.49819 4.19181 4.08040 3.94114 3.63476 3.39801 3.24482 2.54851 2.53458 2.14465 2.21428 2.21436 2.85500 3.30066 3.42600 3.48171 3.30066 3.32851 3.45385 3.17532 3.45385
44 | 43 3.76010 3.70787 2.97674 2.76785 2.66340 1.82782 1.88005 3.81232 2.66340 1.41004 2.14117 3.34244 3.65579 3.49911 3.18576 2.87241 2.87241 3.18576 3.70802 3.91692 3.65579
45 | 44 3.41011 3.38216 3.52192 3.57782 2.85108 2.65542 3.10264 3.71758 4.22071 3.60577 3.74553 4.47245 4.47245 3.77363 3.82954 4.52836 4.47245 4.19292 4.22087 4.91970 3.63387
46 | 45 2.21007 2.12614 2.48982 2.01424 2.01424 2.21007 2.21007 2.60172 1.79043 2.18209 2.15412 1.67860 1.37085 1.56669 1.39883 1.34288 1.76253 1.79050 1.42681 1.48276 1.51074
47 | 46 3.09956 3.26863 3.01503 2.87414 2.19787 2.22605 2.05698 2.33876 2.31058 2.08516 1.69067 1.94435 1.85981 1.40895 1.57802 2.19796 2.67700 3.69144 3.88869 3.57872 3.49419
48 | 47 2.83055 2.57323 2.20154 2.63041 2.11576 1.94421 2.05858 2.28731 2.20154 2.48745 2.54463 2.43036 2.17303 1.85851 1.74414 1.31525 1.88710 1.94429 2.40177 2.57332 3.14517
49 | 48 11.32196 12.09980 13.05050 13.13693 14.34691 16.50759 17.45829 16.68045 15.55689 16.24831 12.87765 14.77962 13.74245 11.23597 10.97667 10.71738 10.28523 10.71738 9.68022 8.98877 9.59379
50 | 49 2.10794 1.64682 1.64682 1.88836 1.86640 1.73466 1.69074 2.02010 2.17381 2.08598 1.80053 2.10802 2.15193 2.34956 2.87656 2.63502 2.83265 2.89852 3.05223 3.20594 2.96440
51 | 50 1.33972 1.15026 1.17733 1.05554 1.00141 1.13673 1.38032 1.24499 1.42092 1.19086 0.90668 1.21797 1.21797 1.24504 1.19091 1.55630 1.73223 1.52923 1.73223 1.65103 1.61043
52 | 51 0.84345 0.90694 0.80717 0.86159 0.83438 0.77997 0.81624 0.71648 0.85252 0.88880 1.23343 0.79813 1.18813 1.15185 1.23348 1.20627 1.38767 1.67790 1.36953 1.55999 1.51464
53 | 52 0.85268 0.90897 0.90334 0.86394 0.86957 0.90334 0.86957 0.83580 0.81047 0.87801 1.11721 0.99061 0.97936 0.99343 1.02720 1.09474 1.16228 1.21294 1.25234 1.39587 1.39024
54 | 53 1.10162 1.19413 1.12685 1.17731 1.05958 1.13526 1.09321 1.18572 1.11844 1.16890 1.01753 1.27827 1.15212 1.37918 1.47169 1.48851 1.64829 1.58102 1.69875 1.47169 1.56420
55 | 54 2.36832 2.31932 2.67865 2.62965 2.84198 2.66232 2.95632 3.26665 3.59331 3.03798 2.13966 2.56442 2.54808 2.92376 3.10344 2.98910 4.01813 3.70779 3.93646 4.08347 4.32848
56 | 55 2.32330 2.21397 2.18664 1.94064 2.51463 2.07731 1.94064 2.43263 2.67863 2.13197 2.35064 3.66276 4.01810 3.99077 3.49876 3.66276 3.41675 4.04544 4.37344 4.04544 4.15477
57 | 56 1.30202 1.61766 1.26256 1.24284 1.44011 1.24284 1.14420 1.24284 1.12447 1.61766 1.61766 1.77555 1.75582 2.13066 1.89392 1.69664 2.17011 2.64359 2.66332 2.52522 2.34767
58 | 57 1.89439 1.64418 1.80503 1.75141 1.46547 1.91226 1.93013 2.14459 1.89439 1.98374 2.30543 1.80510 2.68084 2.85956 2.46637 3.27062 3.23488 3.16339 3.53871 3.37786 3.62807
59 | 58 1.47864 1.47864 1.57889 1.75432 1.97988 1.85457 2.00494 2.25556 2.13025 1.95482 1.27815 1.77945 1.92983 1.62908 1.75439 2.23058 1.67920 1.75439 1.80451 1.75439 1.97995
60 | 59 0.92601 0.77337 0.88531 0.88531 0.88531 1.00742 1.11936 0.94637 0.96672 0.88531 0.93619 0.92605 1.02781 0.95658 1.02781 0.95658 0.96675 1.01764 1.02781 1.33310 1.29240
61 | 60 4.18661 4.16917 4.37850 4.37850 4.65761 4.51806 4.29128 4.15173 4.18661 3.75051 3.22718 3.17497 3.57620 3.57620 3.45409 3.47154 3.38431 3.45409 3.36687 3.59365 3.75065
62 | 61 1.61407 1.51319 1.46780 1.49302 1.54850 1.42240 1.42744 1.47284 1.45771 1.44258 1.15507 1.40228 1.47290 1.47794 1.47290 1.44263 1.46281 1.48803 1.45272 1.52838 1.57882
63 | 62 1.10904 1.09160 1.01311 1.01456 0.98549 1.05090 1.05817 1.03055 1.02910 0.95787 0.87357 1.02187 1.05966 1.12507 1.08728 1.12071 1.08146 1.03495 1.05239 1.15124 1.19048
64 | 63 0.70942 0.71423 0.72144 0.72529 0.73780 0.70028 0.64209 0.61323 0.59928 0.69547 0.95471 0.78208 0.74985 0.74360 0.79554 0.79843 0.82104 0.86673 0.90136 0.88597 0.90424
65 | 64 0.45813 0.48268 0.42132 0.42950 0.40496 0.37632 0.37223 0.35587 0.48268 0.42950 0.29452 0.42952 0.47042 0.44179 0.44588 0.49088 0.57269 0.45406 0.49906 0.56860 0.51951
66 | 65 0.40829 0.41779 0.47001 0.49850 0.48900 0.58395 0.82608 0.85457 0.86881 0.62668 0.21839 0.30386 0.32760 0.39881 0.36558 0.35608 0.46053 0.38932 0.43680 0.41306 0.39407
67 | 66 0.29763 0.29123 0.27202 0.27202 0.25922 0.29123 0.32963 0.39844 0.36483 0.33123 0.23362 0.23843 0.23683 0.26243 0.27844 0.28804 0.31044 0.32804 0.33124 0.34724 0.34244
68 | 67 0.27268 0.26727 0.21858 0.20776 0.18504 0.24347 0.39171 0.42526 0.45664 0.43933 0.21101 0.18396 0.25646 0.28893 0.31165 0.30624 0.29867 0.31706 0.28568 0.29542 0.26620
69 | 68 0.29538 0.28669 0.29914 0.30243 0.31794 0.29844 0.24486 0.21407 0.22606 0.26601 0.36964 0.35673 0.33370 0.32030 0.31701 0.32171 0.31090 0.31490 0.31490 0.31184 0.32171
70 | 69 0.39681 0.35272 0.32333 0.29393 0.30863 0.33068 0.43355 0.47030 0.41151 0.41886 0.18371 0.30864 0.44827 0.39683 0.48501 0.36008 0.53645 0.52175 0.52175 0.41887 0.36743
71 | 70 0.30052 0.29739 0.32869 0.29113 0.30991 0.27861 0.50086 0.68869 0.62921 0.56660 0.29426 0.31305 0.38819 0.44453 0.46645 0.47584 0.48210 0.48836 0.57289 0.51967 0.51341
72 | 71 0.65157 0.68777 0.62055 0.72914 0.53264 0.64640 1.06527 1.03942 0.89979 0.99287 0.73431 0.73434 0.66194 0.69814 0.80674 0.74986 0.81191 0.88948 0.88948 0.92051 1.00843
73 | 72 0.67045 0.84285 0.80454 0.82369 0.97694 0.99610 0.82369 1.01525 1.01525 0.90032 0.45974 0.80457 0.99613 0.84288 1.07276 0.91951 0.82373 0.97698 1.03445 0.95782 1.01529
74 | 73 0.77648 0.62189 0.69216 0.62540 0.67108 0.69216 0.77297 0.85027 0.90297 0.83621 0.72729 0.67813 0.69570 0.82219 0.83625 0.85030 0.90652 0.92760 0.86787 0.81165 0.92760
75 | 74 0.51731 0.54345 0.49255 0.52144 0.51181 0.53382 0.47329 0.42926 0.46090 0.50355 0.55996 0.51458 0.56549 0.58062 0.57925 0.59989 0.59989 0.58613 0.65767 0.62740 0.61777
76 | 75 0.34262 0.34823 0.34441 0.33981 0.33752 0.31890 0.29338 0.29007 0.29645 0.30359 0.33624 0.34646 0.34263 0.34723 0.35871 0.36534 0.36458 0.37453 0.37019 0.39927 0.40310
77 | 76 0.25610 0.27346 0.26478 0.24525 0.20835 0.24959 0.27346 0.26478 0.24091 0.24091 0.30385 0.31037 0.31688 0.31037 0.33859 0.39068 0.37982 0.38416 0.42974 0.45579 0.46230
78 | 77 0.85446 0.70188 0.83920 0.82394 0.85446 0.83920 0.88498 0.99178 1.02230 0.93075 0.48826 0.68665 0.91553 0.88501 0.94604 0.76294 0.97656 1.12915 0.85449 1.12915 1.12915
79 | 78 3.03353 3.12546 3.12546 3.40123 2.75775 2.75775 2.20620 2.39005 2.29813 2.57390 1.74658 2.11436 1.83857 1.83857 1.65472 2.02243 1.93050 1.83857 1.37893 2.11436 2.02243
80 | 79 1.24395 1.19485 1.16211 1.32579 1.42400 1.47310 1.16211 1.06391 1.17848 1.39126 1.42400 1.62048 1.86600 1.80053 1.70232 1.76779 1.62048 1.98058 2.07879 2.02969 2.11153
81 | 80 4.81787 5.06027 5.12088 4.75726 4.60576 4.72696 4.84817 4.75726 4.39365 3.57552 2.93920 4.18170 4.24231 4.39382 4.24231 4.15140 3.54536 3.27264 3.39385 3.21203 2.57569
82 | 81 3.93605 3.90269 3.85821 4.05835 3.78038 3.45794 3.69143 3.36899 3.42458 3.22444 3.09102 3.38024 3.55814 3.08002 3.09114 2.89099 2.47958 2.52406 2.39063 2.16824 2.06817
83 | 82 4.49801 4.38168 4.73067 4.26535 4.26535 4.88577 5.19598 4.80822 4.57556 3.99392 3.37351 4.88596 4.53696 4.73085 3.95530 4.11041 3.72263 3.41242 2.98586 2.52053 2.13276
84 | 83 0.62057 0.60784 0.60784 0.60784 0.58556 0.58556 0.54737 0.60784 0.68103 0.63966 0.85288 0.76380 0.78926 0.75107 0.77335 0.73834 0.77017 0.80836 0.83700 0.73198 0.78608
85 | 84 1.27600 1.36264 1.06333 1.17361 1.20511 1.10272 1.20511 1.12635 1.15785 1.22086 1.12635 1.37057 1.26817 1.44934 1.33906 1.48085 1.50448 1.33119 1.38633 1.33119 1.19728
86 | 85 4.10988 3.90179 3.76306 3.57230 3.48560 3.34687 3.19080 3.48560 3.55496 3.90179 3.81508 4.16207 4.49157 4.56093 4.19675 4.09270 4.11004 3.86726 3.88460 3.93662 3.98865
87 | 86 5.96414 6.58417 6.76133 6.79085 6.58417 6.67275 6.64323 6.73180 6.90895 5.69841 5.04885 5.34431 4.75378 4.93094 4.39946 4.39946 4.28135 4.31088 4.16325 4.31088 4.22230
88 | 87 6.54511 6.75970 7.18889 7.40348 7.02794 7.10841 6.59875 6.03545 4.74788 4.18458 3.16526 2.81665 2.52157 2.68252 2.84348 3.13855 3.08490 2.89713 2.70935 2.68252 2.60205
89 | 88 3.49175 3.20754 2.84212 3.53235 3.57295 3.61356 4.14138 3.45115 3.00453 3.08573 3.16694 4.01973 3.53249 3.04525 3.00465 2.51741 2.76103 3.00465 2.72042 2.35499 2.39560
90 | 89 1.46703 1.25364 1.46703 1.04026 1.36033 1.70709 1.44035 1.92047 2.10718 2.18720 2.74734 1.94722 1.76050 1.54711 1.41373 1.30704 1.41373 1.04030 1.20034 1.46708 1.22702
91 | 90 0.30331 0.28619 0.29842 0.30087 0.26418 0.30576 0.31310 0.33022 0.37670 0.36202 0.49900 0.37671 0.37427 0.35714 0.37671 0.37182 0.40851 0.42074 0.43297 0.42563 0.46967
92 | 91 0.34790 0.46593 0.39760 0.43487 0.47836 0.40381 0.37896 0.38517 0.42866 0.42245 0.47215 0.37276 0.39761 0.41004 0.41004 0.39761 0.36655 0.40382 0.41625 0.60263 0.58399
93 | 92 0.37262 0.38382 0.36701 0.34180 0.33339 0.39783 0.38943 0.38382 0.39783 0.38662 0.50709 0.43427 0.41185 0.40345 0.46789 0.50431 0.50151 0.48750 0.54073 0.53793 0.49030
94 | 93 1.21580 1.06670 1.06670 1.10111 1.08964 1.41079 1.11258 1.02082 1.04376 1.03229 0.83730 1.11262 1.26173 1.26173 1.13556 1.27320 1.16997 1.34202 1.19291 1.23879 1.23879
95 | 94 0.55258 0.64205 0.73678 0.67889 0.62626 0.57890 0.54206 0.50522 0.53679 0.51574 0.62100 0.59471 0.56313 0.73154 0.71575 0.67365 0.73154 0.65260 0.69470 0.74207 0.83154
96 | 95 0.64917 0.51551 0.65871 0.62053 0.66826 0.62053 0.63962 0.71599 0.73508 0.64917 0.57279 0.70647 0.75421 0.69692 0.81149 0.76375 0.70647 0.74466 0.73511 0.74466 0.76375
97 | 96 0.43323 0.42061 0.45426 0.40378 0.41220 0.40799 0.44164 0.46688 0.45846 0.46688 0.39117 0.46689 0.48372 0.44166 0.47531 0.39118 0.40801 0.42062 0.50054 0.53419 0.58887
98 | 97 1.25930 1.39303 1.27044 1.14785 1.05870 1.02527 1.02527 1.03641 1.18129 1.11442 0.95840 1.22591 1.31507 1.50453 1.64941 1.97260 1.90573 1.83887 2.24007 2.09519 2.22893
99 | 98 1.40754 1.54321 1.93325 1.95021 1.71279 1.64496 1.50929 1.39058 1.30579 1.52625 1.33971 1.67894 1.72981 1.88244 2.03508 2.10291 2.11987 2.39121 2.22162 2.17075 2.40817
100 | 99 1.37426 1.21331 1.03998 0.99046 1.01522 0.92856 1.01522 0.80475 0.85427 0.97808 1.39902 1.30003 1.32479 1.43622 1.42384 1.39908 1.69623 1.65908 1.75813 1.79528 1.80766
101 | 100 0.77586 0.73323 0.78012 0.64797 0.63944 0.58402 0.66928 0.67781 0.69486 0.78438 0.98474 0.77162 0.86967 0.96346 1.15956 1.12546 1.13399 1.24909 1.25335 1.41535 1.38551
102 |
--------------------------------------------------------------------------------
/overlap/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code that produces excel files associated with different overlap enrichment analysis presented in the paper. In these analyses, for each state in the mouse full-stack annotation, we calculate the overlap fold-enrichment between the state and an annotation of interests (for example, with different chromosomes, different genome contexts from RefSeq, different classes of repeat elements). Please refer to the Method subsection "External annotation sources" for a list of external genome annotations that we use to overlap with the mouse full-stack states.
2 | The process of doing overlap enrichment includes (1) Run ```ChromHMM OverlapEnrichment``` to calculate the fold-overlap, and (2) Run ```create_excel_painted_overlap_enrichment.py``` to create excel file that visualize the results.
3 | (1) We run ```ChromHMM OverlapEnrichment``` with ChromHMM v.1.23 following the manual's instruction. In particular, this is the format of the command to run ChromHMM:
4 | ```
5 | java -jar OverlapEnrichment -b 1 -lowmem -noimage
6 | ```
7 | The ```-b 1``` flag tells ChromHMM to calculate overlap enrichment at 1bp resolution
8 | The ```-lowmem``` and ```-noimage``` flags tell ChromHMM to calculate overlap enrichment using lower memory mode, and without generating output heatmap file.
9 | (2) We visualize the results of overlap enrichment using the following scripts:
10 | - ```helper.py```: script containing helper functions for file managements.
11 | - ```create_excel_painted_overlap_enrichment.py```: File to create excel files showing the overlap enrichment results
12 | ```
13 | python create_excel_painted_overlap_enrichment_output.py
14 | input_fn: the output of chromHMM OverlapEnrichment showing the fold enrichment between each full-stack state and each external genome annotation
15 | output_fn: the name of the excel file that you want to create
16 | context_prefix: the prefix to all the enrichment contexts in this input_fn. For example, if we do enrichments with gnomad variants of varying maf, the column names in input_fn are 'maf_0_0.1', we can specify context_prefix to gnomad. If we dont need such prefix, specify to empty string. This is useful when we try to write comment on states that are most enriched in each genomic context.
17 | state_annot_fn: where we get all the data of characterization of states, recommended ../state_annotation_processed.csv
18 | ```
19 |
20 | # License:
21 | All code is provided under the MIT Open Acess License
22 | Copyright 2022 Ha Vu and Jason Ernst
23 |
24 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
25 |
26 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
27 |
28 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
29 |
30 |
--------------------------------------------------------------------------------
/overlap/create_excel_painted_overlap_enrichment_LECIF.py:
--------------------------------------------------------------------------------
1 | import seaborn as sns
2 | import pandas as pd
3 | import numpy as np
4 | import os
5 | import sys
6 | import time
7 | import helper
8 |
9 | print ("This file will take as input the chromHMM OverlapEnrichment output file, and create a replicate of such file in Excel format. The slick thing about this is that the excel file will be colored based on the levels of enrichment in each column. The color betweeen columns are not relevant. However, the colors of cells within one column shows the enrichment levels comparable among states.")
10 | print ("")
11 | print ("")
12 |
13 | def get_state_annot_df(state_annot_fn):
14 | # state_annot_fn = '../../..//ROADMAP_aligned_reads/chromHMM_model/model_100_state/figures/supp_excel/state_annotations_processed.csv'
15 | try:
16 | state_annot_df = pd.read_csv(state_annot_fn, sep = ',', header = 0)
17 | except:
18 | state_annot_df = pd.read_csv(state_annot_fn, sep = '\t', header = 0)
19 | state_annot_df = state_annot_df[['state', 'color', 'mneumonics', 'state_order_by_group', 'comments']]
20 | return state_annot_df
21 |
22 | def get_enrichment_df(enrichment_fn, state_annot_fn):
23 | state_annot_df = get_state_annot_df(state_annot_fn)
24 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t", header = 0)
25 | try:
26 | enrichment_df = enrichment_df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)': 'state'})
27 | except:
28 | pass
29 | (num_state, num_enr_cont) = (enrichment_df.shape[0] - 1, enrichment_df.shape[1] - 1)
30 | percent_genome_of_cont = enrichment_df.iloc[num_state, 2:]
31 | enrichment_df = enrichment_df.loc[:(num_state - 1)] # rid of the last row because that is the row about percentage each genomic context occupies
32 | enrichment_df['state'] = (enrichment_df['state']).astype(str).astype(int)
33 | enrichment_df = enrichment_df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y'))
34 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group')
35 | # num_enr_cont : number of enrichment context (TSS, exon, etc.)
36 | return enrichment_df, num_state, num_enr_cont, percent_genome_of_cont
37 |
38 | def get_rid_of_stupid_file_tail(context_name):
39 | # first, get rid of stupid_file_tail
40 | if context_name.endswith('.bed.gz'):
41 | context_name = context_name[:-7]
42 | elif context_name.endswith('.txt.gz'):
43 | context_name = context_name[:-7]
44 | elif context_name.endswith('.txt'):
45 | context_name = context_name[:-4]
46 | elif context_name.endswith('.txt'):
47 | context_name = context_name[:-4]
48 | elif context_name.endswith('.bed'):
49 | context_name = context_name[:-4]
50 | else:
51 | context_name = context_name
52 | # fix if the context name is from variant prioritization analysis
53 | if 'non_coding' in context_name or 'whole_genome' in context_name: # that means context_name can be FATHMM_non_coding_top_0.01, so we want FATHMM instead
54 | enr_cont_list = context_name.split('_')
55 | enr_cont_list = enr_cont_list[:-4] # get rid of the last 4 items because
56 | context_name = '_'.join(enr_cont_list)
57 | return(context_name)
58 |
59 | def add_comments_on_states(enrichment_df, num_state, num_enr_cont, context_prefix):
60 | enr_cont_list = enrichment_df.columns[2:]
61 | NUM_TOP_STATE_TO_COMMENT = 3 #7
62 | output_comment = pd.Series([''] * enrichment_df.shape[0]) # we choose the number of rows in this enrichment_df as the size of output_comment, instead of num_state because there's one last line that's the base line and we want that line to be an empty string too.
63 | # output_comment will be later used as a column in the enrichment_df to details what enrichment context each state is associated with
64 | for enr_cont in enr_cont_list:
65 | top_states = enrichment_df[enr_cont][:num_state].nlargest(NUM_TOP_STATE_TO_COMMENT)
66 | top_states = np.around(top_states, decimals = 1) # round the values of enrichment
67 | comment_this_cont = list(map(lambda x: 'rank' + str(x + 1) + '_'+ context_prefix + '_' + enr_cont + ':' + str(top_states[top_states.index[x]])+ "; ", range(NUM_TOP_STATE_TO_COMMENT))) # the string that specify the content of the comment: rank_:
68 | for i in range(NUM_TOP_STATE_TO_COMMENT):
69 | top_state_index = top_states.index[i] # the index of the state that has rank i
70 | output_comment[top_state_index] = output_comment[top_state_index] + comment_this_cont[i] # concatenate comment string
71 | return output_comment
72 |
73 | def color_state_annotation(row_data, index_to_color_list):
74 | results = [""] * len(row_data) # current paint all the cells in the rows with nothing, no format yet---> all white for now
75 | state_annot_color = row_data['color']
76 | if not pd.isna(row_data['color']):
77 | for index in index_to_color_list:
78 | results[index] = 'background-color: %s' % state_annot_color # the third cell from the left is the state annotation cells
79 | return results
80 |
81 | def combine_enrichment_df_with_state_annot_df(enrichment_df, enr_cont_list, state_annot_fn):
82 | state_annot_df = get_state_annot_df(state_annot_fn)
83 | enrichment_df['state'] = (enrichment_df['state']).astype(str).astype(int)
84 | enrichment_df = enrichment_df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y'))
85 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group')
86 | enrichment_df = enrichment_df.reset_index()
87 | enrichment_df = enrichment_df.drop(columns = ['index'])
88 | enrichment_df['state'] = enrichment_df['state'].astype(str).astype(int)
89 | enrichment_df['percent_in_genome'] = enrichment_df['percent_in_genome'].astype(str).astype(float)
90 | for enr_cont in enr_cont_list:
91 | enrichment_df[enr_cont] = enrichment_df[enr_cont].astype(str).astype(float)
92 | enrichment_df['comment'] = enrichment_df['comment'].astype(str)
93 | enrichment_df['max_fold_context'] = enrichment_df['max_fold_context'].astype(str)
94 | enrichment_df['max_enrich'] = enrichment_df['max_enrich'].astype(float)
95 | enrichment_df['mneumonics'] = enrichment_df['mneumonics'].astype(str)
96 | enrichment_df['color'] = enrichment_df['color'].astype(str)
97 | enrichment_df['state_order_by_group'] = enrichment_df['state_order_by_group'].astype(int)
98 | return enrichment_df
99 |
100 | def color_lecif_score(max_fold_context):
101 | lecif_score_color = {1.0: '#806000', 0.9: '#BF8F00', 0.8: '#FFD966', 0.3: '#8EA9DB', 0.2: '#467CDD', 0.1: '#3A63B7', 0.7: '#FFE699', 0.6: '#FFF2CC', 0.5: '#D9E1F2', 0.4: '#B4C6E7'}
102 | try:
103 | color = lecif_score_color[float(max_fold_context)]
104 | except: # when the word is not in the dictionary, return a white color
105 | color = '#ffffff' # white
106 | #print max_fold_context, color
107 | return 'background-color: %s' % color
108 |
109 | def get_enrichment_colored_df(enrichment_fn, save_fn, context_prefix, state_annot_fn):
110 | cm = sns.light_palette("red", as_cmap=True)
111 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t", header = 0)
112 | try:
113 | enrichment_df = enrichment_df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)' : 'state'})
114 | except:
115 | pass
116 | enrichment_df = enrichment_df.fillna(0) # if there are nan enrichment values, due to the state not being present (such as when we create files with foreground and background), we can fill it by 0 so that the code to make colorful excel would not crash.
117 | (num_state, num_enr_cont) = (enrichment_df.shape[0] - 1, enrichment_df.shape[1] - 1)
118 | enr_cont_list = enrichment_df.columns[2:]
119 | enr_cont_list = list(map(get_rid_of_stupid_file_tail, enr_cont_list)) # fix the name of the enrichment context. If it contains the tail '.bed.gz' then we get rid of it
120 | enrichment_df.columns = list(enrichment_df.columns[:2]) + list(enr_cont_list)
121 | percent_genome_of_cont = enrichment_df.iloc[num_state, 2:]
122 | enrichment_df = enrichment_df.loc[:(num_state-1)] # rid of the last row because that is the row about percentage each genomic context occupies
123 | # now change the column names of the enrichment contexts. If the contexts' names contain '.bed.gz' then get rid of it.
124 | no_state_df = enrichment_df[enrichment_df.columns[2:]] # only get the data of enrichment contexts for now, don't consider state and percent_in_genome
125 | # now call functions to add comments to the enrichment patterns of states
126 | # add comments about states, calculate the maximum fold enrichment for each state and what context is associated with such fold enrichment
127 | enrichment_df['comment'] = add_comments_on_states(enrichment_df, num_state, num_enr_cont, context_prefix)
128 | enrichment_df['max_fold_context'] = no_state_df.apply(lambda x: x.idxmax().split('_')[1], axis = 1) # name of the context that are most enriched in this state (only applied for lecif score context here)
129 | enrichment_df['max_enrich'] = no_state_df.apply(lambda x: x.max(), axis = 1) # the value of the max fold enrichment in this state
130 | # now add back data associated with the percentage in genome of each of the
131 | enrichment_df = combine_enrichment_df_with_state_annot_df(enrichment_df, enr_cont_list, state_annot_fn)
132 | num_remaining_columns = len(enrichment_df.columns) - 2 - len(enr_cont_list)
133 | enrichment_df.loc[num_state] = list([0 , np.sum(enrichment_df['percent_in_genome'])]) + list(percent_genome_of_cont) + [None] * num_remaining_columns
134 | mneumonics_index_in_row = enrichment_df.columns.get_loc('mneumonics') # column index, zero-based of the menumonics entries, which we will use to paint the right column later
135 | print(enr_cont_list)
136 | colored_df = enrichment_df.style.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont_list[0]]], cmap = cm)
137 | for enr_cont in enr_cont_list[1:]:
138 | colored_df = colored_df.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont]], cmap = cm)
139 | colored_df = colored_df.apply(lambda x: color_state_annotation(x, [mneumonics_index_in_row]), axis = 1)
140 | # add a function to color the lecif score range that is most enriched with each full stack state
141 | colored_df = colored_df.applymap(color_lecif_score, subset = pd.IndexSlice[:(num_state - 1), ['max_fold_context']])
142 | colored_df.to_excel(save_fn, engine = 'openpyxl')
143 | return colored_df
144 |
145 |
146 | def main():
147 | if len(sys.argv) != 5:
148 | usage()
149 | input_fn = sys.argv[1]
150 | helper.check_file_exist(input_fn)
151 | output_fn = sys.argv[2]
152 | helper.create_folder_for_file(output_fn)
153 | context_prefix = sys.argv[3]
154 | state_annot_fn = sys.argv[4]
155 | print ("Done getting command line! ")
156 |
157 | get_enrichment_colored_df(input_fn, output_fn, context_prefix, state_annot_fn)
158 |
159 |
160 | def usage():
161 | print ("python create_excel_painted_overlap_enrichment_output.py ")
162 | print ("input_fn: the output of chromHMM OverlapEnrichment showing the fold enrichment between each full-stack state and each external genome annotation")
163 | print ("output_fn: the name of the excel file that you want to create")
164 | print ("context_prefix: the prefix to all the enrichment contexts in this input_fn. For example, if we do enrichments with gnomad variants of varying maf, the column names in input_fn are 'maf_0_0.1', we can specify context_prefix to gnomad. If we dont need such prefix, specify to empty string. This is useful when we try to write comment on states that are most enriched in each genomic context.")
165 | print ('state_annot_fn: where we get all the data of characterization of states, recommended ../state_annotation_processed.csv')
166 | exit(1)
167 |
168 | if __name__ == '__main__':
169 | main()
170 | # input_fn = sys.argv[1]
171 | # output_fn = sys.argv[2]
172 | # get_colored_df_for_consHMM_enrichment(input_fn, output_fn)
--------------------------------------------------------------------------------
/overlap/helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import string
4 | import os
5 | import sys
6 | import time
7 | CHROMOSOME_LIST = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X', 'Y'] # we exclude chromosome Y because it's not available in all cell types
8 |
9 | def make_dir(directory):
10 | try:
11 | os.makedirs(directory)
12 | except:
13 | print ( 'Folder' + directory + ' is already created')
14 |
15 |
16 |
17 | def check_file_exist(fn):
18 | if not os.path.isfile(fn):
19 | print ( "File: " + fn + " DOES NOT EXISTS")
20 | exit(1)
21 | return
22 |
23 | def check_dir_exist(fn):
24 | if not os.path.isdir(fn):
25 | print ( "Directory: " + fn + " DOES NOT EXISTS")
26 | exit(1)
27 | return
28 |
29 | def create_folder_for_file(fn):
30 | last_slash_index = fn.rfind('/')
31 | if last_slash_index != -1: # path contains folder
32 | make_dir(fn[:last_slash_index])
33 | return
34 |
35 | def get_command_line_integer(arg):
36 | try:
37 | arg = int(arg)
38 | return arg
39 | except:
40 | print ( "Integer: " + str(arg) + " IS NOT VALID")
41 | exit(1)
42 |
43 |
44 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
45 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
46 | # rename the org_enrichment_df so that it's easier to work with
47 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
48 | return enrichment_df
49 |
50 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
51 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
52 | if len(nc_enrichment_df.columns) != 3:
53 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
54 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
55 | exit(1)
56 | # Now, we know that the nc_enrichment_df has exactly 3 columns
57 | # change the column names
58 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
59 | return (nc_enrichment_df)
60 |
--------------------------------------------------------------------------------
/process_CTCF_data/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code to obtain enrichment analysis results of mouse full-stack states with CTCF elements profiled in multiple mouse cell types (obtained from ENCODE project). The results are presented in Fig. 2F in the manuscript.
2 | - First, we obtained the metadata of CTCF experiments from ENCODE project. We downloaded the metadata, recorded in file. ```metadata_from_ENCODE.tsv```.
3 | - Second, we downloaded data of CTCF peaks (bed files) based on download links provided in ```metadata_from_ENCODE.tsv```, using the script ```download_CTCF_data.py```
4 | ```
5 | usage: download_CTCF_data.py [-h] --metadata_fn METADATA_FN [--download]
6 | --download_folder DOWNLOAD_FOLDER --output_fn
7 | OUTPUT_FN
8 |
9 | This file will filter out the data needed to get the CTCF peaks in mouse
10 |
11 | optional arguments:
12 | -h, --help show this help message and exit
13 | --metadata_fn METADATA_FN
14 | metadata file that I got from ENCODE: mouse, CTCF,
15 | TF Chip-seq
16 | --download If this flag is present, we will download the data
17 | of CTCF peaks
18 | --download_folder DOWNLOAD_FOLDER
19 | Where data should be downloaded to
20 | --output_fn OUTPUT_FN
21 | The file of metadata that we will show as
22 | additional data for the paper
23 |
24 | ```
25 | Note: we include a file ```metadata_clean_for_publication.txt``` in this folder, which is actually the output this second step (```download_CTCF_data.py```). But, if you want o replicate completely what we did, you will absolutely need to use the code to downloaded CTCF bed file data from ENCODE.
26 | - Third, you will need to use ChromHMM to do overlap enrichment analysis of the chromatin states with each of the CTCF files, corresponding to CTCF from different cell/tissue types. This is the command that we used to run ChromHMM (version 1.23):
27 | ```
28 | java -jar -b 1 -lowmem -noimage
29 | ```
30 | The output of this step is provided in file ```overlap_ctcf_mm10.txt```.
31 | - Lastly, we calculated the geometric mean and STD of overlap fold enrichment across different cell types. We do not include the annotated code to generate plots here since the procedure is standard (we still include code that works locally but has not been fully clean from local file paths in this folder). If you want to obtain clean, fully commented code for this step, please contact graduate student Ha Vu.
--------------------------------------------------------------------------------
/process_CTCF_data/calculate_genometric_mean_CTCF_overlap.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import os
4 | from scipy.stats import gmean
5 | from scipy.stats import gstd
6 | import sys
7 | import seaborn as sns
8 | sys.path.append('/Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/overlap')
9 | import create_excel_painted_overlap_enrichment_LECIF as lecif_code
10 |
11 | fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10.txt'
12 | save_excel_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10.xlsx'
13 | save_csv_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10_with_calculated_mean.txt'
14 | state_annot_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/state_annotation_processed.csv'
15 | df = pd.read_csv(fn, header = 0, index_col = None, sep = '\t')
16 | percent_genome_of_cont = df.loc[df.shape[0]-1,:]
17 | df.drop(df.tail(1).index,inplace=True)
18 | enr_colname = df.columns[df.columns.str.endswith('.bed.gz')]
19 | print((df[enr_colname] == 0).sum())
20 | def calculate_gmean(x):
21 | x = x.astype(float)
22 | non_zero_min = np.min(x[x!=0])
23 | x[x==0] = non_zero_min
24 | return gmean(np.array(x))
25 |
26 | def calculate_geostd(x):
27 | x = x.astype(float)
28 | non_zero_min = np.min(x[x!=0])
29 | x[x==0] = non_zero_min
30 | return gstd(np.array(x))
31 |
32 | df['geomean'] = df.apply(lambda x: calculate_gmean(x[enr_colname]), axis = 1)
33 | df['geostd'] = df.apply(lambda x: calculate_geostd(x[enr_colname]), axis = 1)
34 | df['mean'] = df.apply(lambda x: np.mean(x[enr_colname].astype(float)), axis = 1)
35 | df['std'] = df.apply(lambda x: np.std(x[enr_colname].astype(float)), axis = 1)
36 |
37 | cm = sns.light_palette("red", as_cmap=True)
38 | try:
39 | df = df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)' : 'state'})
40 | except:
41 | pass
42 | df = df.fillna(0) # if there are nan enrichment values, due to the state not being present (such as when we create files with foreground and background), we can fill it by 0 so that the code to make colorful excel would not crash.
43 | (num_state, num_enr_cont) = (df.shape[0], df.shape[1] - 1)
44 | enr_colname = list(map(lecif_code.get_rid_of_stupid_file_tail, enr_colname)) # fix the name of the enrichment context. If it contains the tail '.bed.gz' then we get rid of it
45 | df.columns = list(map(lecif_code.get_rid_of_stupid_file_tail, df.columns))
46 | # now change the column names of the enrichment contexts. If the contexts' names contain '.bed.gz' then get rid of it.
47 | state_annot_df = lecif_code.get_state_annot_df(state_annot_fn)
48 | df['state'] = (df['state']).astype(str).astype(int)
49 | df = df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y'))
50 | df = df.sort_values(by = 'state_order_by_group')
51 | df = df.reset_index(drop = True)
52 | for enr_cont in enr_colname:
53 | df[enr_cont] = df[enr_cont].astype(str).astype(float)
54 | num_remaining_columns = len(df.columns) - 2 - len(enr_colname)
55 | df.loc[df.shape[0]] = list(percent_genome_of_cont) + [None] * num_remaining_columns
56 | mneumonics_index_in_row = df.columns.get_loc('mneumonics') # column index, zero-based of the menumonics entries, which we will use to paint the right column later
57 | colored_df = df.style.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_colname[0]]], cmap = cm)
58 | for enr_cont in enr_colname[1:] + ['geomean', 'mean']:
59 | colored_df = colored_df.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont]], cmap = cm)
60 | colored_df = colored_df.apply(lambda x: lecif_code.color_state_annotation(x, [mneumonics_index_in_row]), axis = 1)
61 | colored_df.to_excel(save_excel_fn, engine = 'openpyxl')
62 | df.to_csv(save_csv_fn, header = True, index = False, sep = '\t')
63 |
--------------------------------------------------------------------------------
/process_CTCF_data/download_CTCF_data.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import helper
4 | import os
5 | import argparse
6 | import sys
7 | import requests
8 | parser = argparse.ArgumentParser(description = 'This file will filter out the data needed to get the CTCF peaks in mouse')
9 | parser.add_argument('--metadata_fn', type = str, required = True, help = 'metadata file that I got from ENCODE: mouse, CTCF, TF Chip-seq')
10 | parser.add_argument('--download', action='store_true', default = False, help = 'If this flag is present, we will download the data of CTCF peaks')
11 | parser.add_argument('--download_folder', type = str, required = True, help = 'Where data should be downloaded to')
12 | parser.add_argument('--output_fn', required = True, help = 'The file of metadata that we will show as additional data for the paper')
13 | args = parser.parse_args()
14 | helper.check_file_exist(args.metadata_fn)
15 | helper.create_folder_for_file(args.output_fn)
16 | helper.make_dir(args.download_folder)
17 | ###################################################################################
18 | def cleanup_metadata(metadata_fn, output_fn):
19 | meta_df = pd.read_csv(metadata_fn, header = 0, index_col = None, sep = '\t')
20 | meta_df.dropna(axis = 1, how = 'all',inplace = True) # drop all empty columns, only 34 of them left
21 | meta_df = meta_df[meta_df['File assembly'] == 'mm10']
22 | meta_df = meta_df[meta_df['Assay'] == 'TF ChIP-seq']
23 | meta_df = meta_df[meta_df['Biosample organism'] == 'Mus musculus']
24 | meta_df = meta_df[meta_df['Experiment target'] == 'CTCF-mouse']
25 | meta_df = meta_df[meta_df['Project'] == 'ENCODE']
26 | meta_df = meta_df[meta_df['File analysis title'].str.contains('ENCODE4', na = False)]
27 | meta_df = meta_df[meta_df['File type'] == 'bed']
28 | # I used the function metadata_df.nunique() to find the number of unique values in each column, and that helped me figure the necessary columns that contains different values for different dataset that I will download
29 | meta_df = meta_df[['File accession', 'Output type', 'Experiment accession', 'Assay', 'Donor(s)', 'Biosample term id', 'Biosample term name', 'Biosample type', 'Experiment target', 'File download URL', 'File analysis title']]
30 | print(meta_df['File accession'].unique())
31 | print(len(meta_df['File accession'].unique()))
32 | meta_df.to_csv(output_fn, header = True, index = False, sep = '\t')
33 | return meta_df
34 |
35 | def download_data_one_row(row):
36 | biosample_term = '-'.join(row['Biosample term name'].split())
37 | biosample_save_fn = '{id}_{term}.bed.gz'.format(id = row['File accession'], term = biosample_term)
38 | save_fn = os.path.join(args.download_folder, biosample_save_fn)
39 | if not os.path.isfile(save_fn):
40 | response = requests.get(row['File download URL'])
41 | file = open(save_fn, 'wb')
42 | file.write(response.content)
43 | file.close()
44 | return
45 |
46 | if __name__ == '__main__':
47 | meta_df = cleanup_metadata(args.metadata_fn, args.output_fn)
48 | print('Done cleaning up metadata')
49 | if args.download:
50 | meta_df.apply(download_data_one_row, axis = 1) # apply function to download data for each row (each data of the cleaned metadata)
51 | print('Done downloading data')
52 |
53 |
--------------------------------------------------------------------------------
/process_CTCF_data/draw_2D_overlap_enrichment_CTCF.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "draw_2D_overlap_enrichment_CTCF"
3 | author: "Ha Vu"
4 | date: "10/23/2022"
5 | output: html_document
6 | ---
7 |
8 | ```{r}
9 | knitr::opts_chunk$set(echo = TRUE)
10 | library(tidyr)
11 | library(tidyverse)
12 | library(ggplot2)
13 | library(dplyr)
14 | ```
15 |
16 | ```{r}
17 | overlap_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10_with_calculated_mean.txt'
18 | state_annot_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/state_annotation_processed.csv'
19 | ```
20 |
21 | ```{r}
22 | RowSD <- function(x) { # function that I can use to calculate the standard deviation of a list of numbers. Taken from https://stackoverflow.com/questions/29581612/r-dplyr-mutate-calculating-standard-deviation-for-each-row
23 | sqrt(rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1))
24 | }
25 |
26 | ```
27 |
28 | ```{r}
29 | state_df <- as.data.frame(read.csv(state_annot_fn, header = T, sep = '\t', check.names = F))
30 | state_df <- state_df %>% select(-c('comments', 'color', 'group', 'itemRgb'))
31 | ```
32 |
33 | ```{r}
34 | df <- read.table(overlap_fn, header = T, sep = '\t', check.names = F) # check.names should be set to F otherwise R will change the colum names with special characters
35 | #df <- df %>% slice(1 : (n()-1)) # we do not really need this anymore because
36 | df <- df %>% rename(c('state' = 'state (Emission order)', 'genome_percent' = 'Genome %'))
37 | df$state <- as.integer(df$state)
38 | df <- df %>% left_join(state_df, by = 'state')
39 | #df <- df %>% mutate(mean_overlap= mean(select(., ends_with('.bed.gz'))), std_overlap= RowSD(select(., ends_with('.bed.gz'))))
40 | df <- df %>% arrange(state_order_by_group)
41 | #show_df <- df%>% select(c('mneumonics', 'geomean', 'geostd', 'mean_overlap', 'std_overlap'))
42 | #show_df %>% write_csv('/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/mean_overlap_ctcf.csv')
43 | ```
44 |
45 | ```{r}
46 | ggplot(df, aes(x=fct_inorder(mneumonics), y = geomean)) + # factor inorder to keep the order of states in the plots https://forcats.tidyverse.org/reference/fct_inorder.html
47 | geom_line(linetype = "dashed", size = 1) +
48 | geom_point() +
49 | geom_errorbar(aes(ymin = geomean - geostd, ymax = geomean + geostd), width = 0.5, position = position_dodge(0.05)) +
50 | theme_bw() +
51 | theme(axis.text.x = element_text(size = 5, angle = 90))
52 | save_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/geomean_overlap_ctcf.png'
53 | ggsave(save_fn, width = 10, height = 4)
54 | ```
--------------------------------------------------------------------------------
/process_CTCF_data/helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import string
4 | import os
5 | import sys
6 | import time
7 | def make_dir(directory):
8 | try:
9 | os.makedirs(directory)
10 | except:
11 | print ( 'Folder' + directory + ' is already created')
12 |
13 |
14 |
15 | def check_file_exist(fn):
16 | if not os.path.isfile(fn):
17 | print ( "File: " + fn + " DOES NOT EXISTS")
18 | exit(1)
19 | return
20 |
21 | def check_dir_exist(fn):
22 | if not os.path.isdir(fn):
23 | print ( "Directory: " + fn + " DOES NOT EXISTS")
24 | exit(1)
25 | return
26 |
27 | def create_folder_for_file(fn):
28 | last_slash_index = fn.rfind('/')
29 | if last_slash_index != -1: # path contains folder
30 | make_dir(fn[:last_slash_index])
31 | return
32 |
33 | def get_command_line_integer(arg):
34 | try:
35 | arg = int(arg)
36 | return arg
37 | except:
38 | print ( "Integer: " + str(arg) + " IS NOT VALID")
39 | exit(1)
40 |
41 |
42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format
43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t")
44 | # rename the org_enrichment_df so that it's easier to work with
45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"})
46 | return enrichment_df
47 |
48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn):
49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t')
50 | if len(nc_enrichment_df.columns) != 3:
51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.")
52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py")
53 | exit(1)
54 | # Now, we know that the nc_enrichment_df has exactly 3 columns
55 | # change the column names
56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"]
57 | return (nc_enrichment_df)
58 |
--------------------------------------------------------------------------------
/process_CTCF_data/metadata_clean_for_publication.txt:
--------------------------------------------------------------------------------
1 | File accession Output type Experiment accession Assay Donor(s) Biosample term id Biosample term name Biosample type Experiment target File download URL File analysis title
2 | ENCFF210BTO IDR thresholded peaks ENCSR000CED TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002108 small intestine tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF210BTO/@@download/ENCFF210BTO.bed.gz ENCODE4 v1.5.1 mm10
3 | ENCFF349BUL IDR thresholded peaks ENCSR000CDZ TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002370 thymus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF349BUL/@@download/ENCFF349BUL.bed.gz ENCODE4 v1.5.1 mm10
4 | ENCFF502WDT IDR thresholded peaks ENCSR000ETQ TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF502WDT/@@download/ENCFF502WDT.bed.gz ENCODE4 v1.5.1 mm10
5 | ENCFF065ODV conservative IDR thresholded peaks ENCSR000DIP TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF065ODV/@@download/ENCFF065ODV.bed.gz ENCODE4 v1.5.1 mm10
6 | ENCFF316FYN IDR thresholded peaks ENCSR000CBN TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002037 cerebellum tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF316FYN/@@download/ENCFF316FYN.bed.gz ENCODE4 v1.5.1 mm10
7 | ENCFF097CAQ IDR thresholded peaks ENCSR000CBL TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002371 bone marrow tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF097CAQ/@@download/ENCFF097CAQ.bed.gz ENCODE4 v1.5.1 mm10
8 | ENCFF491RJK IDR thresholded peaks ENCSR000CBV TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF491RJK/@@download/ENCFF491RJK.bed.gz ENCODE4 v1.5.1 mm10
9 | ENCFF089EWX IDR thresholded peaks ENCSR000CBY TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002106 spleen tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF089EWX/@@download/ENCFF089EWX.bed.gz ENCODE4 v1.5.1 mm10
10 | ENCFF575MGI IDR thresholded peaks ENCSR000CBO TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0005343 cortical plate tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF575MGI/@@download/ENCFF575MGI.bed.gz ENCODE4 v1.5.1 mm10
11 | ENCFF951GOP IDR thresholded peaks ENCSR000CBI TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF951GOP/@@download/ENCFF951GOP.bed.gz ENCODE4 v1.5.1 mm10
12 | ENCFF281QMV IDR thresholded peaks ENCSR000CEH TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000955 brain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF281QMV/@@download/ENCFF281QMV.bed.gz ENCODE4 v1.5.1 mm10
13 | ENCFF033UQX IDR thresholded peaks ENCSR000CEF TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000473 testis tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF033UQX/@@download/ENCFF033UQX.bed.gz ENCODE4 v1.5.1 mm10
14 | ENCFF504JDS conservative IDR thresholded peaks ENCSR000DIU TF ChIP-seq /mouse-donors/ENCDO324YEK/ EFO:0005233 CH12.LX cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF504JDS/@@download/ENCFF504JDS.bed.gz ENCODE4 v1.5.1 mm10
15 | ENCFF601JVE IDR thresholded peaks ENCSR000DIS TF ChIP-seq /mouse-donors/ENCDO089AAA/ EFO:0002034 G1E cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF601JVE/@@download/ENCFF601JVE.bed.gz ENCODE4 v1.5.1 mm10
16 | ENCFF805QBU IDR thresholded peaks ENCSR000CFJ TF ChIP-seq /mouse-donors/ENCDO081AAA/ CL:0002476 bone marrow macrophage primary cell CTCF-mouse https://www.encodeproject.org/files/ENCFF805QBU/@@download/ENCFF805QBU.bed.gz ENCODE4 v1.5.1 mm10
17 | ENCFF142CNG IDR thresholded peaks ENCSR000CFH TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF142CNG/@@download/ENCFF142CNG.bed.gz ENCODE4 v1.5.1 mm10
18 | ENCFF611HDQ IDR thresholded peaks ENCSR000CBW TF ChIP-seq /mouse-donors/ENCDO072AAA/ CL:2000042 embryonic fibroblast primary cell CTCF-mouse https://www.encodeproject.org/files/ENCFF611HDQ/@@download/ENCFF611HDQ.bed.gz ENCODE4 v1.5.1 mm10
19 | ENCFF957SPH conservative IDR thresholded peaks ENCSR000CBJ TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002113 kidney tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF957SPH/@@download/ENCFF957SPH.bed.gz ENCODE4 v1.5.1 mm10
20 | ENCFF883UPM IDR thresholded peaks ENCSR000ERM TF ChIP-seq /mouse-donors/ENCDO324YEK/ EFO:0005233 CH12.LX cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF883UPM/@@download/ENCFF883UPM.bed.gz ENCODE4 v1.5.1 mm10
21 | ENCFF086ZTV IDR thresholded peaks ENCSR000CDX TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002264 olfactory bulb tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF086ZTV/@@download/ENCFF086ZTV.bed.gz ENCODE4 v1.5.1 mm10
22 | ENCFF136ZDK IDR thresholded peaks ENCSR397RHW TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF136ZDK/@@download/ENCFF136ZDK.bed.gz ENCODE4 v1.5.1 mm10
23 | ENCFF589ZBJ IDR thresholded peaks ENCSR041SMK TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF589ZBJ/@@download/ENCFF589ZBJ.bed.gz ENCODE4 v1.5.1 mm10
24 | ENCFF430PPJ IDR thresholded peaks ENCSR677HXC TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0001890 forebrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF430PPJ/@@download/ENCFF430PPJ.bed.gz ENCODE4 v1.5.1 mm10
25 | ENCFF198SSR IDR thresholded peaks ENCSR418SBY TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF198SSR/@@download/ENCFF198SSR.bed.gz ENCODE4 v1.5.1 mm10
26 | ENCFF165LXO IDR thresholded peaks ENCSR677SIH TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF165LXO/@@download/ENCFF165LXO.bed.gz ENCODE4 v1.5.1 mm10
27 | ENCFF194HVL IDR thresholded peaks ENCSR000CBU TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF194HVL/@@download/ENCFF194HVL.bed.gz ENCODE4 v1.5.1 mm10
28 | ENCFF024JOZ IDR thresholded peaks ENCSR143WOK TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002113 kidney tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF024JOZ/@@download/ENCFF024JOZ.bed.gz ENCODE4 v1.5.1 mm10
29 | ENCFF660HRL IDR thresholded peaks ENCSR245OKD TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF660HRL/@@download/ENCFF660HRL.bed.gz ENCODE4 v1.8.0 mm10
30 | ENCFF143RZH IDR thresholded peaks ENCSR799YRV TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0002305 layer of hippocampus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF143RZH/@@download/ENCFF143RZH.bed.gz ENCODE4 v1.8.0 mm10
31 | ENCFF784ASD IDR thresholded peaks ENCSR000AIJ TF ChIP-seq /mouse-donors/ENCDO321TZV/ EFO:0001098 C2C12 cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF784ASD/@@download/ENCFF784ASD.bed.gz ENCODE4 v1.5.1 mm10
32 | ENCFF327LYS IDR thresholded peaks ENCSR002ZAG TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000160 intestine tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF327LYS/@@download/ENCFF327LYS.bed.gz ENCODE4 v1.5.1 mm10
33 | ENCFF052KVO conservative IDR thresholded peaks ENCSR362VNF TF ChIP-seq /mouse-donors/ENCDO015AAA/ EFO:0007751 E14TG2a.4 cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF052KVO/@@download/ENCFF052KVO.bed.gz ENCODE4 v1.5.1 mm10
34 | ENCFF921ILM IDR thresholded peaks ENCSR877MSN TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0002305 layer of hippocampus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF921ILM/@@download/ENCFF921ILM.bed.gz ENCODE4 v1.8.0 mm10
35 | ENCFF680EMQ IDR thresholded peaks ENCSR415XNJ TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0001388 gastrocnemius tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF680EMQ/@@download/ENCFF680EMQ.bed.gz ENCODE4 v1.8.0 mm10
36 | ENCFF649JAA IDR thresholded peaks ENCSR798CCX TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0001388 gastrocnemius tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF649JAA/@@download/ENCFF649JAA.bed.gz ENCODE4 v1.8.0 mm10
37 | ENCFF388VCL IDR thresholded peaks ENCSR642FSG TF ChIP-seq /mouse-donors/ENCDO509HIY/ NTR:0000646 left cerebral cortex tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF388VCL/@@download/ENCFF388VCL.bed.gz ENCODE4 v1.8.0 mm10
38 | ENCFF609MDN IDR thresholded peaks ENCSR491NUM TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF609MDN/@@download/ENCFF609MDN.bed.gz ENCODE4 v1.5.1 mm10
39 | ENCFF999RIN IDR thresholded peaks ENCSR644VYX TF ChIP-seq /mouse-donors/ENCDO509HIY/ NTR:0000646 left cerebral cortex tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF999RIN/@@download/ENCFF999RIN.bed.gz ENCODE4 v1.8.0 mm10
40 | ENCFF089GBC IDR thresholded peaks ENCSR985ZTV TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0001891 midbrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF089GBC/@@download/ENCFF089GBC.bed.gz ENCODE4 v1.5.1 mm10
41 | ENCFF907XEG IDR thresholded peaks ENCSR104QEN TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000945 stomach tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF907XEG/@@download/ENCFF907XEG.bed.gz ENCODE4 v1.5.1 mm10
42 | ENCFF092NMJ IDR thresholded peaks ENCSR150RGT TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002028 hindbrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF092NMJ/@@download/ENCFF092NMJ.bed.gz ENCODE4 v1.5.1 mm10
43 | ENCFF637GEK IDR thresholded peaks ENCSR024KGB TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF637GEK/@@download/ENCFF637GEK.bed.gz ENCODE4 v1.8.0 mm10
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/._15_state_annotations.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/._15_state_annotations.csv
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/README.md:
--------------------------------------------------------------------------------
1 | This folder contains code to do analysis and create plots that replicate Figures S6 and S7 in the manuscript.
2 | - First, we obtained per-cell-type 15-chromatin state annotations for 66 reference epigenomes/cell types from Gorkin et al., 2020 , with download links curated and provided by Kwon and Ernst, 2021 (in Supplementary File 1, tab g). Note: the provided links show that all bed files are named based on the format ```_15_segments.bed.gz```, and saved into a folder that is given variable name ```ct_segment_folder``` in the provided snakefile ```mouse_random_represent_full_stack.snakefile``` (look into the first few lines of the file if you are confused about what we mean here). Within this folder, we store some example bed files representing per-ct annotations into the folder ```example_perCT_segment```. Also note that the segmentation files from Gorkin et al., 2020 are sorted bed files and the code that we provide here also assumes that the per-ct segmentation files are sorted. If you apply our code to some data other than those provided by Gorkin et al., 2020, you need to sort the bed files files, using the command line ```zcat | sort -k1,1 k2,2n > ```
3 | - Second, download the file of mouse full-stack chromatin state annotation ```genome_100_segments.bed.gz```, introduced in our manuscript (see the main readme for download links)
4 | - Next, you can start running our snakemake pipeline in ```mouse_random_represent_full_stack.snakefile``` to reproduce the plots in Fig.S6 and Fig. S7. To run the snakemake pipeline to conduct the analysis sequentially (which can take a long time), you can run ```snakemake --cores 1 --snakefile mouse_random_represent_full_stack.snakefile```. To run the snakemake pipeline such that jobs can be run in parallel, you can run ```snakemake -j --cluster "" --snakefile mouse_random_represent_full_stack.snakefile```, where for our local machine settings, `````` is ```qsub -V -l h_rt=4:00:00,h_data=4G```, specifying the time and memory needed for each job in the pipeline to run.
5 | # Some scripts that are part of the pipeline (we already incorporated these files into the snakemake pipeline. Therefore, we won't go into details about where these scripts are used within the pipeline here)
6 | - File ```helper.py``` contains some useful functions used by most other scripts.
7 | - File ```get_mm_enrichments_across_ct.py```
8 | ```
9 | usage: get_mm_enrichments_across_ct.py [-h] [--input_folder INPUT_FOLDER] [--output_fn OUTPUT_FN] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN]
10 | [--ct_state_annot_fn CT_STATE_ANNOT_FN]
11 | Create an excel of the top ct-spec states enriched in each full-stack state
12 | optional arguments:
13 | -h, --help show this help message and exit
14 | --input_folder INPUT_FOLDER
15 | where there are multiple subfolders, each containing enrichment data for different cell type specific model
16 | --output_fn OUTPUT_FN
17 | Where the files showing max, mean, min fold enrichment across all ct-state states are reported for all cell types for each state is reported
18 | --num_state_ct_model NUM_STATE_CT_MODEL
19 | number of states in the ct-spec models
20 | --full_state_annot_fn FULL_STATE_ANNOT_FN
21 | file of the full-stack state annotations
22 | --ct_state_annot_fn CT_STATE_ANNOT_FN
23 | file of the ct-spec state annotations
24 | ```
25 | - File ```get_ranked_max_enriched_25_state.py```
26 | ```
27 | usage: get_ranked_max_enriched_25_state.py [-h] [--input_folder INPUT_FOLDER] [--output_fn OUTPUT_FN] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN]
28 | [--ct_state_annot_fn CT_STATE_ANNOT_FN] [--ct_group_fn CT_GROUP_FN]
29 |
30 | Create an excel of the top ct-spec states enriched in each full-stack state
31 |
32 | optional arguments:
33 | -h, --help show this help message and exit
34 | --input_folder INPUT_FOLDER
35 | where there are multiple subfolders, each containing enrichment data for different cell type specific model
36 | --output_fn OUTPUT_FN
37 | Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported
38 | --num_state_ct_model NUM_STATE_CT_MODEL
39 | number of states in the ct-spec models
40 | --full_state_annot_fn FULL_STATE_ANNOT_FN
41 | file of the full-stack state annotations
42 | --ct_state_annot_fn CT_STATE_ANNOT_FN
43 | file of the ct-spec state annotations
44 | --ct_group_fn CT_GROUP_FN
45 | file of the annotations of the cell types
46 | ```
47 | - File ```calculate_summary_sample_regions.py```
48 | ```
49 | usage: calculate_summary_sample_regions.py [-h] [--all_seed_folder ALL_SEED_FOLDER] [--output_folder OUTPUT_FOLDER] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN]
50 | [--ct_state_annot_fn CT_STATE_ANNOT_FN] [--ct_group_fn CT_GROUP_FN]
51 |
52 | Create an excel of the top ct-spec states enriched in each full-stack state
53 |
54 | optional arguments:
55 | -h, --help show this help message and exit
56 | --all_seed_folder ALL_SEED_FOLDER
57 | where there are multiple subfolders, each containing enrichment data for different cell type specific model
58 | --output_folder OUTPUT_FOLDER
59 | Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported
60 | --num_state_ct_model NUM_STATE_CT_MODEL
61 | number of states in the ct-spec models
62 | --full_state_annot_fn FULL_STATE_ANNOT_FN
63 | file of the full-stack state annotations
64 | --ct_state_annot_fn CT_STATE_ANNOT_FN
65 | file of the ct-spec state annotations
66 | --ct_group_fn CT_GROUP_FN
67 | file of the annotations of the cell types
68 | ```
69 |
70 | # License:
71 | All code is provided under the MIT Open Acess License
72 | Copyright 2022 Ha Vu and Jason Ernst
73 |
74 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
75 |
76 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
77 |
78 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
79 |
80 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/15_state_annotations.csv:
--------------------------------------------------------------------------------
1 | state mneumonics Meaning color
2 | 1 1_PrA Promoter Active #FF0000
3 | 2 2_PrW Promoter Weak #FF7D4D
4 | 3 3_PrB Promoter Bivalent #7030A0
5 | 4 4_PrF Promoter Flanking #FF7D4D
6 | 5 5_EnSd Enhancer Strong TSS-dist #FFA500
7 | 6 6_EnSp Enhancer Strong TSS-prox #FFA500
8 | 7 7_EnW Enhancer Weak, TSS-dist #FFFF00
9 | 8 8_EnPd Enhancer Poised, TSS-dist #E2D27D
10 | 9 9_EnPp Enhancer Poised, TSS-prox #E2D27D
11 | 10 10_TrS Transcription Strong #008000
12 | 11 11_TrP Transcription Permission #DFE7DE
13 | 12 12_TrI Transcription Initiation #008000
14 | 13 13_HcP Polycomb-assoc. #808080
15 | 14 14_HcH Heterochromatin H3K9me3 #B19CD9
16 | 15 15_Ns No signal (Quiescent) #FFFFFF
17 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/P0_forebrain_15_segments.bed.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_forebrain_15_segments.bed.gz
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/P0_lung_15_segments.bed.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_lung_15_segments.bed.gz
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/P0_stomach_15_segments.bed.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_stomach_15_segments.bed.gz
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/README.md:
--------------------------------------------------------------------------------
1 | - Each ```.bed.gz``` file in this folder represents the chromatin state annotation for one cell type (sample) from Gorkin et al., 2020 . We provide just 3 files, and only data for chromosome 19, for the sake of demonstration. If you want to replicate what we did to obtain Fig. S6+7 in Kwon and Ernst, 2021 , you will have to download raw data provided in the paper.
2 | - File ```tissue_annotation.csv``` shows metadata about different tissue types from Gorkin et al., 2020 , so that we can color and group the cell types that have per-ct annotations. Columns: tissue_stage, ct, stage, color, group --> needed later for plotting and grouping the cell types by their tissue groups, along with color codes.
3 |
4 | - File ```15_state_annotations.csv``` shows metadata about 15 different states from the per-ct annotations, so what we can color the per-ct states as in Fig. S7 in the manuscript.
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/example_perCT_segment/tissue_annotation.csv:
--------------------------------------------------------------------------------
1 | tissue_stage ct stage color group
2 | e11.5_facial-prominence facial-prominence e11.5 #124c78
3 | e11.5_forebrain forebrain e11.5 #C5912B Brain
4 | e11.5_heart heart e11.5 #D56F80 Heart
5 | e11.5_hindbrain hindbrain e11.5 #C5912B Brain
6 | e11.5_limb limb e11.5 #F9B6CF Muscle
7 | e11.5_liver liver e11.5 #9BC2E6 Liver
8 | e11.5_midbrain midbrain e11.5 #C5912B Brain
9 | e11.5_neural-tube neural-tube e11.5 #FFD924 neuron
10 | e12.5_facial-prominence facial-prominence e12.5 #124c78
11 | e12.5_forebrain forebrain e12.5 #C5912B Brain
12 | e12.5_heart heart e12.5 #D56F80 Heart
13 | e12.5_hindbrain hindbrain e12.5 #C5912B Brain
14 | e12.5_limb limb e12.5 #F9B6CF Muscle
15 | e12.5_liver liver e12.5 #9BC2E6 Liver
16 | e12.5_midbrain midbrain e12.5 #C5912B Brain
17 | e12.5_neural-tube neural-tube e12.5 #FFD924 neuron
18 | e13.5_facial-prominence facial-prominence e13.5 #124c78
19 | e13.5_forebrain forebrain e13.5 #C5912B Brain
20 | e13.5_heart heart e13.5 #D56F80 Heart
21 | e13.5_hindbrain hindbrain e13.5 #C5912B Brain
22 | e13.5_limb limb e13.5 #F9B6CF Muscle
23 | e13.5_liver liver e13.5 #9BC2E6 Liver
24 | e13.5_midbrain midbrain e13.5 #C5912B Brain
25 | e13.5_neural-tube neural-tube e13.5 #FFD924 neuron
26 | e14.5_facial-prominence facial-prominence e14.5 #124c78
27 | e14.5_forebrain forebrain e14.5 #C5912B Brain
28 | e14.5_heart heart e14.5 #D56F80 Heart
29 | e14.5_hindbrain hindbrain e14.5 #C5912B Brain
30 | e14.5_intestine intestine e14.5 #D0A39B intestine
31 | e14.5_kidney kidney e14.5 #F4B084 kidney
32 | e14.5_limb limb e14.5 #F9B6CF Muscle
33 | e14.5_liver liver e14.5 #9BC2E6 Liver
34 | e14.5_lung lung e14.5 #E41A1C lung
35 | e14.5_midbrain midbrain e14.5 #C5912B Brain
36 | e14.5_neural-tube neural-tube e14.5 #FFD924 neuron
37 | e14.5_stomach stomach e14.5 #E5BDB5 stomach
38 | e15.5_facial-prominence facial-prominence e15.5 #124c78
39 | e15.5_forebrain forebrain e15.5 #C5912B Brain
40 | e15.5_heart heart e15.5 #D56F80 Heart
41 | e15.5_hindbrain hindbrain e15.5 #C5912B Brain
42 | e15.5_intestine intestine e15.5 #D0A39B intestine
43 | e15.5_kidney kidney e15.5 #F4B084 kidney
44 | e15.5_limb limb e15.5 #F9B6CF Muscle
45 | e15.5_liver liver e15.5 #9BC2E6 Liver
46 | e15.5_lung lung e15.5 #E41A1C lung
47 | e15.5_midbrain midbrain e15.5 #C5912B Brain
48 | e15.5_neural-tube neural-tube e15.5 #FFD924 neuron
49 | e15.5_stomach stomach e15.5 #E5BDB5 stomach
50 | e16.5_forebrain forebrain e16.5 #C5912B Brain
51 | e16.5_heart heart e16.5 #D56F80 Heart
52 | e16.5_hindbrain hindbrain e16.5 #C5912B Brain
53 | e16.5_intestine intestine e16.5 #D0A39B intestine
54 | e16.5_kidney kidney e16.5 #F4B084 kidney
55 | e16.5_liver liver e16.5 #9BC2E6 Liver
56 | e16.5_lung lung e16.5 #E41A1C lung
57 | e16.5_midbrain midbrain e16.5 #C5912B Brain
58 | e16.5_stomach stomach e16.5 #E5BDB5 stomach
59 | P0_forebrain forebrain P0 #C5912B Brain
60 | P0_heart heart P0 #D56F80 Heart
61 | P0_hindbrain hindbrain P0 #C5912B Brain
62 | P0_intestine intestine P0 #D0A39B intestine
63 | P0_kidney kidney P0 #F4B084 kidney
64 | P0_liver liver P0 #9BC2E6 Liver
65 | P0_lung lung P0 #E41A1C lung
66 | P0_midbrain midbrain P0 #C5912B Brain
67 | P0_stomach stomach P0 #E5BDB5 stomach
68 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/get_mm_enrichments_across_ct.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import seaborn as sns
3 | import numpy as np
4 | import os
5 | import helper
6 | import glob
7 | import argparse
8 | parser = argparse.ArgumentParser(description = 'Create an excel of the top ct-spec states enriched in each full-stack state')
9 | parser.add_argument('--input_folder', type=str,
10 | help = "where there are multiple subfolders, each containing enrichment data for different cell type specific model")
11 | parser.add_argument('--output_fn', type=str,
12 | help = 'Where the files showing max, mean, min fold enrichment across all ct-state states are reported for all cell types for each state is reported')
13 | parser.add_argument('--num_state_ct_model', type=int,
14 | help = 'number of states in the ct-spec models')
15 | parser.add_argument('--full_state_annot_fn', type=str,
16 | help = 'file of the full-stack state annotations')
17 | parser.add_argument('--ct_state_annot_fn', type=str,
18 | help = 'file of the ct-spec state annotations')
19 | args = parser.parse_args()
20 | print(args)
21 | helper.check_dir_exist(args.input_folder)
22 | helper.create_folder_for_file(args.output_fn)
23 | helper.check_file_exist(args.full_state_annot_fn)
24 | helper.check_file_exist(args.ct_state_annot_fn)
25 |
26 | def get_ct_spec_state_annot(ct_state_annot_fn):
27 | df = pd.read_csv(ct_state_annot_fn, header = 0, index_col = None, sep = '\t') # state, mneumonics, meaning, color
28 | CT_STATE_MNEUNOMICS_LIST = list(df.mneumonics)
29 | CT_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics).to_dict() # keys: state mneumonics, values: state color
30 | return CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT
31 |
32 | CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT = get_ct_spec_state_annot(args.ct_state_annot_fn)
33 |
34 | def get_full_state_annot_df(full_state_annot_fn):
35 | df = pd.read_csv(full_state_annot_fn, header = 0, index_col = None, sep = '\t')
36 | FULL_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics)
37 | df = df[['state', 'mneumonics', 'state_order_by_group']]
38 | return FULL_STATE_COLOR_DICT, df
39 |
40 | FULL_STATE_COLOR_DICT, FULL_STATE_ANNOT_DF = get_full_state_annot_df(args.full_state_annot_fn) #FULL_STATE_ANNOT_DF: state, mneumonics, state_order_by_group, FULL_STATE_COLOR_DICT: keys mneumonics, values color
41 |
42 | def color_full_state_names(val):
43 | if val == "":
44 | color = '#FFFFFF'
45 | else:
46 | color = FULL_STATE_COLOR_DICT[val]
47 | return 'background-color: %s' % color
48 |
49 | def get_rid_of_stupid_file_tail(context_name):
50 | if context_name.endswith('.bed.gz'):
51 | return(context_name[:-7])
52 | else:
53 | return(context_name)
54 |
55 |
56 | def get_one_enrichment_ct_model_df(fn):
57 | df = pd.read_csv(fn, sep = '\t', header = 0)
58 | df = df.rename(columns = {'state (Emission order)' : 'state', 'Genome %': 'percent_in_genome'}) # rename some columns so that it is easier to write column names later
59 | context_name_list = list(map(get_rid_of_stupid_file_tail, df.columns[2:])) # get the names of enrichment contexts. If the enrichment contexts is .bed.gz -->
60 | df.columns = ['state', 'percent_in_genome'] + context_name_list
61 | df = df.drop([df.shape[0] - 1], axis = 0) # drop that last row, which is about the percentage of genome that each enrichment context occupies
62 | return df
63 |
64 | def get_mmm_enrichment_one_genome_context(all_ct_df_list, all_ct_list, state_context_colName):
65 | # state_context_colName: name of the column that we are looking at so that we know what column to look at for each data frame
66 | this_context_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']})
67 | # collect data from each of the cell type specific model --> columsn: cell type specific model, rows: 100 full stack states
68 | for df_index, df in enumerate(all_ct_df_list):
69 | ct_name = all_ct_list[df_index] # all_ct_fn_list and all_ct_df_list are responsive to each other, i.e. each element in each list corresponds to the same cell type. We tried zipping teh two list, but that gave a bug message. so this code here is not the most graceful.
70 | this_context_df[ct_name] = df[state_context_colName]
71 | this_context_df['max_enrichment'] = (this_context_df.drop('state', axis = 1)).max(axis = 1)
72 | this_context_df['min_enrichment'] = (this_context_df.drop('state', axis = 1)).min(axis = 1)
73 | this_context_df['median_enrichment'] = (this_context_df.drop('state', axis = 1)).median(axis = 1)
74 | return this_context_df
75 |
76 | def get_25_state_annot(state):
77 | # '1' --> ''1_TssA''
78 | state_index = int(state) - 1 # zero-based
79 | return CT_STATE_MNEUNOMICS_LIST[state_index]
80 |
81 | def get_CT_STATE_COLOR_DICT(state):
82 | # '1_TssA' --> 'red'
83 | state_index = int(state.split('_')[0]) # one-based
84 | color = CT_STATE_COLOR_DICT[state_index]
85 | return 'background-color: %s' % color
86 |
87 | def paint_excel_mmm_enrichment(enrichment_df, num_state_ct_model):
88 | # here enrichment_df is actually max_enrichment_df from get_all_ct_model_mmm_enrichment
89 | cm = sns.light_palette("red", as_cmap=True)
90 | (num_state, num_enr_cont) = (enrichment_df.shape[0], enrichment_df.shape[1] - 1)
91 | state_colName_list = list(map(lambda x: 'U' + str(x + 1), range(num_state_ct_model))) #CUSTOM: this is a line of code that could be improved for better customization
92 | enrichment_df['state'] = enrichment_df['state'].astype(int)
93 | enrichment_df = enrichment_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state')
94 | enrichment_df = enrichment_df[['state', 'percent_in_genome'] + state_colName_list + ['mneumonics', 'state_order_by_group']] # get the data frame to display columns in the expected order
95 | enrichment_df.columns = ['state', 'percent_in_genome'] + CT_STATE_MNEUNOMICS_LIST + ['mneumonics', 'state_order_by_group'] # rename the columns so that instead of 'state1.bed.gz' we have '1_TssA'
96 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group')
97 | colored_df = enrichment_df.style.background_gradient(subset = pd.IndexSlice[:, CT_STATE_MNEUNOMICS_LIST], cmap = cm) # color the enrichment values into a red-white gradient in the first enrichment contnext
98 | colored_df = colored_df.applymap(color_full_state_names, subset = pd.IndexSlice[:, ['mneumonics']])
99 | return colored_df
100 |
101 |
102 | def get_all_ct_model_mmm_enrichment(input_folder, output_fn, num_state_ct_model):
103 | all_ct_fn_list = glob.glob(input_folder + '/*/overlap_enrichment.txt')
104 | all_ct_list = list(map(lambda x: x.split('/')[-2], all_ct_fn_list)) # from '/path/to/E129/overlap_enrichment.txt' to 'E129'
105 | all_ct_df_list = list(map(lambda x: get_one_enrichment_ct_model_df(x), all_ct_fn_list))
106 | all_ct_df_list = list(all_ct_df_list)
107 | # up until here, we have finished getting all the data from all the overlap enrichment files of all cell types into data frame --> all_ct_df_list [index] --> data frame of the cell type
108 | print ("Done getting all enrichment data from all cell types")
109 | # create a data frame that will report the median enrichment over all the cell types
110 | max_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination.
111 | # now loop through each of the context that we care about
112 | median_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination.
113 | # now loop through each of the context that we care about
114 | min_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination.
115 | # now loop through each of the context that we care about
116 | enrichment_context_list = ['percent_in_genome'] + list((all_ct_df_list[0]).columns[2:]) # this is because we also want to report the max, median, min of percent_in_genome_of data from all ct-spec enrichments
117 | print (enrichment_context_list)
118 | for enr_context in enrichment_context_list:
119 | this_context_mmm_enrichment_df = get_mmm_enrichment_one_genome_context(all_ct_df_list, all_ct_list, enr_context)
120 | max_enrichment_df[enr_context] = this_context_mmm_enrichment_df['max_enrichment']
121 | min_enrichment_df[enr_context] = this_context_mmm_enrichment_df['min_enrichment']
122 | median_enrichment_df[enr_context] = this_context_mmm_enrichment_df['median_enrichment']
123 | print ("Done getting all the necessary data!")
124 | max_colored_df = paint_excel_mmm_enrichment(max_enrichment_df, num_state_ct_model)
125 | min_colored_df = paint_excel_mmm_enrichment(min_enrichment_df, num_state_ct_model)
126 | median_colored_df = paint_excel_mmm_enrichment(median_enrichment_df, num_state_ct_model)
127 | # now save into 3 sheets
128 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter')
129 | max_colored_df.to_excel(writer, sheet_name='max')
130 | min_colored_df.to_excel(writer, sheet_name='min')
131 | median_colored_df.to_excel(writer, sheet_name='median')
132 | writer.save()
133 | print ("Done getting the mmm_enrichment_df!")
134 |
135 | get_all_ct_model_mmm_enrichment(args.input_folder, args.output_fn, args.num_state_ct_model)
136 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/get_ranked_max_enriched_25_state.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import seaborn as sns
3 | import numpy as np
4 | import os
5 | import sys
6 | import helper
7 | import glob
8 | import argparse
9 | parser = argparse.ArgumentParser(description = 'Create an excel of the top ct-spec states enriched in each full-stack state')
10 | parser.add_argument('--input_folder', type=str,
11 | help = "where there are multiple subfolders, each containing enrichment data for different cell type specific model")
12 | parser.add_argument('--output_fn', type=str,
13 | help = 'Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported')
14 | parser.add_argument('--num_state_ct_model', type=int,
15 | help = 'number of states in the ct-spec models')
16 | parser.add_argument('--full_state_annot_fn', type=str,
17 | help = 'file of the full-stack state annotations')
18 | parser.add_argument('--ct_state_annot_fn', type=str,
19 | help = 'file of the ct-spec state annotations')
20 | parser.add_argument('--ct_group_fn', type=str,
21 | help = 'file of the annotations of the cell types')
22 | args = parser.parse_args()
23 | print(args)
24 | helper.check_dir_exist(args.input_folder)
25 | helper.create_folder_for_file(args.output_fn)
26 | helper.check_file_exist(args.full_state_annot_fn)
27 | helper.check_file_exist(args.ct_state_annot_fn)
28 | helper.check_file_exist(args.ct_group_fn)
29 |
30 | def get_full_state_annot_df(full_state_annot_fn):
31 | df = pd.read_csv(full_state_annot_fn, header = 0, index_col = None, sep = '\t')
32 | FULL_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics)
33 | df = df[['state', 'mneumonics', 'state_order_by_group']]
34 | return FULL_STATE_COLOR_DICT, df
35 |
36 | FULL_STATE_COLOR_DICT, FULL_STATE_ANNOT_DF = get_full_state_annot_df(args.full_state_annot_fn) #FULL_STATE_ANNOT_DF: state, mneumonics, state_order_by_group, FULL_STATE_COLOR_DICT: keys mneumonics, values color
37 |
38 | def get_celltype_color_map(ct_group_fn):
39 | df = pd.read_csv(ct_group_fn, header = 0, index_col = None, sep = '\t')
40 | CELLTYPE_COLOR_MAP = pd.Series(df.color.values, index = df.tissue_stage).to_dict() # convert two columns in to a dictionary: keys: cell type, values: color corresponding to that tissue
41 | CELLTYPE_COLOR_MAP['NA'] = '#E5B8E8'
42 | CELLTYPE_CELL_GROUP_MAP = pd.Series(df.group.values, index = df.tissue_stage).to_dict() # keys: cell type name, values: cell group
43 | group_color_df = (df[['group', 'color']]).drop_duplicates()
44 | CELL_GROUP_COLOR_CODE_DICT = pd.Series(group_color_df.color.values, index = group_color_df.group).to_dict() # keys: group, values: color
45 | return CELLTYPE_COLOR_MAP, CELLTYPE_CELL_GROUP_MAP, CELL_GROUP_COLOR_CODE_DICT, df
46 |
47 | CELLTYPE_COLOR_MAP, CELLTYPE_CELL_GROUP_MAP, CELL_GROUP_COLOR_CODE_DICT, CT_DF = get_celltype_color_map(args.ct_group_fn)
48 |
49 | def get_ct_spec_state_annot(ct_state_annot_fn):
50 | df = pd.read_csv(ct_state_annot_fn, header = 0, index_col = None, sep = '\t') # state, mneumonics, meaning, color
51 | CT_STATE_MNEUNOMICS_LIST = list(df.mneumonics)
52 | CT_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics).to_dict() # keys: state mneumonics, values: state color
53 | return CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT
54 |
55 | CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT = get_ct_spec_state_annot(args.ct_state_annot_fn)
56 |
57 | def get_rid_of_stupid_file_tail(context_name):
58 | if context_name.endswith('.bed.gz'):
59 | return(context_name[:-7])
60 | else:
61 | return(context_name)
62 |
63 | def color_full_state_names(val):
64 | if val == "":
65 | color = '#FFFFFF'
66 | else:
67 | color = FULL_STATE_COLOR_DICT[val]
68 | return 'background-color: %s' % color
69 |
70 | def color_cell_group_names(val):
71 | if val == "":
72 | color = '#FFFFFF'
73 | else:
74 | color = CELLTYPE_COLOR_MAP[val]
75 | return 'background-color: %s' % color
76 |
77 | def color_tissue_names(val):
78 | if val == "":
79 | color = '#FFFFFF'
80 | else:
81 | color = CELL_GROUP_COLOR_CODE_DICT[val]
82 | return 'background-color: %s' % color
83 |
84 | def get_one_enrichment_ct_model_df(fn, num_state_ct_model):
85 | df = pd.read_csv(fn, sep = '\t', header = 0)
86 | df = df.rename(columns = {'state (Emission order)' : 'state', 'Genome %': 'percent_in_genome'}) # rename some columns so that it is easier to write column names later
87 | state_colName_list = list(map(lambda x: 'U' + str(x + 1), range(num_state_ct_model))) #CUSTOM: this is a line of code that could be improved for better customization
88 | df = df[['state', 'percent_in_genome'] + state_colName_list] # get the data frame to display columns in the expected order
89 | df.columns = ['state', 'percent_in_genome'] + CT_STATE_MNEUNOMICS_LIST # rename the columns so that instead of 'state1.bed.gz' we have '1_TssA'
90 | df['max_enrichment'] = (df.drop(['state', 'percent_in_genome'], axis = 1)).max(axis = 1) # find the maximum enrichment values in this row
91 | df['max_enrichment_context'] = (df.drop(['state', 'percent_in_genome'], axis = 1)).idxmax(axis = 1)
92 | (nrow, ncol) = df.shape
93 | df = df.drop(nrow - 1) # drop the last row, which is the 'Base' row with percentage that each enrichment context occupies the genome
94 | return df
95 |
96 | def get_25_state_annot(state):
97 | # 'state_1' --> ''1_TssA''
98 | state_index = int(state.split('_')[1]) - 1 # zero-based
99 | return CT_STATE_MNEUNOMICS_LIST[state_index]
100 |
101 | def get_CT_STATE_COLOR_DICT(state):
102 | # '1_TssA' --> 'red'
103 | color = CT_STATE_COLOR_DICT[state]
104 | return 'background-color: %s' % color
105 |
106 | def get_max_enrichment_25_state_df_ordered_time_stamp(max_enrich_25_state_all_ct_df):
107 | """
108 | max_enrich_25_state_all_ct_df: rows: full stack states, columns: state of the 25-state system that is most enriched with the full-stack states WITHIN a cell type. Cell type are the column names
109 | --> a plot where cell types are juxtaposed based on the cell groups they belong to. The output dataframe will be colored properly
110 | """
111 | cell_group_df = pd.DataFrame(columns = ['tissue_stage']) # rows: the cell types that are used in this analysis (colnames of max_enrich_25_state_all_ct_df), columns :
112 | cell_group_df['tissue_stage'] = max_enrich_25_state_all_ct_df.columns[1:] # we skip the first column because that's 'state'
113 | cell_group_df = pd.merge(cell_group_df, CT_DF, how = 'left', left_on = 'tissue_stage', right_on = 'tissue_stage') # merge the cell types so that we can get the information that we want
114 | # Now let's get the count of the number of cell types that are of each cell groups, and then get the unique cell groups, ordered by descending counts
115 | unique_cell_groups = ['forebrain', 'midbrain', 'hindbrain', 'neural-tube', 'facial-prominence', 'limb', 'intestine', 'stomach', 'liver', 'kidney', 'lung', 'heart']#CUSTOM: change this line of code for better customization
116 | unique_stages = ['P0', 'e10.5', 'e11.5', 'e12.5', 'e13.5', 'e14.5', 'e15.5', 'e16.5']#CUSTOM: change this line of code for better customization
117 | cell_group_df['ct'] = pd.Categorical(cell_group_df['ct'], unique_cell_groups) # so that we can easily sort them later based on this custom order
118 | # list of cell types, arranged such that those of the same cell_groups are juxtaposed
119 | rearranged_tissue_stages = []
120 | for stage in unique_stages:
121 | this_cg_df = cell_group_df[cell_group_df['stage'] == stage] # filter out rows with this cellgroup
122 | this_cg_df = this_cg_df.sort_values(['ct'], ignore_index = True)
123 | rearranged_tissue_stages += list(this_cg_df['tissue_stage'])
124 | # now we will rearrange the columns of max_enrich_25_state_all_ct_df based on the order that we just got from rearranged_tissue_stages
125 | max_enrich_25_state_all_ct_df['state'] = max_enrich_25_state_all_ct_df['state'].astype(int)
126 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state')
127 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df[['state'] + rearranged_tissue_stages + ['mneumonics', 'state_order_by_group']]
128 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.sort_values('state_order_by_group', ignore_index = True)
129 | last_row_index = max_enrich_25_state_all_ct_df.shape[0]
130 | max_enrich_25_state_all_ct_df.loc[last_row_index] = [''] + rearranged_tissue_stages + ['', ''] # add one more row that specific the cell_group of each of the cell types
131 | colored_25_state_df = max_enrich_25_state_all_ct_df.style.applymap(color_cell_group_names, subset = pd.IndexSlice[last_row_index, :]) # color the row that contain the cell_group of cell types
132 | colored_25_state_df = colored_25_state_df.applymap(get_CT_STATE_COLOR_DICT, subset = pd.IndexSlice[:(last_row_index - 1) , colored_25_state_df.columns[1:max_enrich_25_state_all_ct_df.shape[1]-2]]) # color the states of the 25-system that is most enriched in each of the full-stack states, exclude the first column and the last two columns
133 | colored_25_state_df = colored_25_state_df.applymap(color_full_state_names, subset = pd.IndexSlice[:,['mneumonics']]) # color full stack state names
134 | return colored_25_state_df
135 |
136 |
137 | def get_max_enrichment_25_state_df_ordered_cell_group(max_enrich_25_state_all_ct_df):
138 | """
139 | max_enrich_25_state_all_ct_df: rows: full stack states, columns: state of the 25-state system that is most enriched with the full-stack states WITHIN a cell type. Cell type are the column names
140 | --> a plot where cell types are juxtaposed based on the cell groups they belong to. The output dataframe will be colored properly
141 | """
142 | cell_group_df = pd.DataFrame(columns = ['tissue_stage']) # rows: the cell types that are used in this analysis (colnames of max_enrich_25_state_all_ct_df), columns :
143 | cell_group_df['tissue_stage'] = max_enrich_25_state_all_ct_df.columns[1:] # we skip the first column because that's 'state'
144 | cell_group_df = pd.merge(cell_group_df, CT_DF, how = 'left', left_on = 'tissue_stage', right_on = 'tissue_stage') # merge the cell types so that we can get the information that we want
145 | # Now let's get the count of the number of cell types that are of each cell groups, and then get the unique cell groups, ordered by descending counts
146 | unique_cell_groups = ['forebrain', 'midbrain', 'hindbrain', 'neural-tube', 'facial-prominence', 'limb', 'intestine', 'stomach', 'liver', 'kidney', 'lung', 'heart']#CUSTOM: change this line of code for better customization
147 | unique_stages = [ 'P0', 'e10.5', 'e11.5', 'e12.5', 'e13.5', 'e14.5', 'e15.5', 'e16.5']#CUSTOM: change this line of code for better customization
148 | cell_group_df['stage'] = pd.Categorical(cell_group_df['stage'], unique_stages) # so that we can easily sort them later based on this custom order
149 | # list of cell types, arranged such that those of the same cell_groups are juxtaposed
150 | rearranged_tissue_stages = []
151 | for cg in unique_cell_groups:
152 | this_cg_df = cell_group_df[cell_group_df['ct'] == cg] # filter out rows with this cellgroup
153 | this_cg_df = this_cg_df.sort_values(['stage'], ignore_index = True)
154 | rearranged_tissue_stages += list(this_cg_df['tissue_stage'])
155 | max_enrich_25_state_all_ct_df['state'] = max_enrich_25_state_all_ct_df['state'].astype(int) # so that we can merge with the df of full stack state annotation
156 | # now we will rearrange the columns of max_enrich_25_state_all_ct_df based on the order that we just got from rearranged_tissue_stage
157 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state')
158 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df[['state'] + rearranged_tissue_stages + ['mneumonics', 'state_order_by_group']]
159 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.sort_values('state_order_by_group', ignore_index = True)
160 | last_row_index = max_enrich_25_state_all_ct_df.shape[0]
161 | max_enrich_25_state_all_ct_df.loc[last_row_index] = [''] + rearranged_tissue_stages + ['', ''] # add one more row that specific the cell_group of each of the cell types
162 | colored_25_state_df = max_enrich_25_state_all_ct_df.style.applymap(color_cell_group_names, subset = pd.IndexSlice[last_row_index, :]) # color the row that contain the cell_group of cell types
163 | colored_25_state_df = colored_25_state_df.applymap(get_CT_STATE_COLOR_DICT, subset = pd.IndexSlice[:(last_row_index - 1) , colored_25_state_df.columns[1:max_enrich_25_state_all_ct_df.shape[1]-2]]) # color the states of the 25-system that is most enriched in each of the full-stack states, exclude the first column and the last two columns
164 | colored_25_state_df = colored_25_state_df.applymap(color_full_state_names, subset = pd.IndexSlice[:,['mneumonics']]) # color full stack state names
165 | return colored_25_state_df
166 |
167 |
168 | def get_rank_25_state_df(max_enrich_25_state_all_ct_df, max_enrich_value_all_ct_df, output_fn):
169 | (num_full_state, num_ct) = max_enrich_25_state_all_ct_df.shape # nrow, ncol
170 | # color 25_state_cell_group_df: df where cell types of the same cell group are juxtaposed
171 | colored_25_state_cell_group_df = get_max_enrichment_25_state_df_ordered_cell_group(max_enrich_25_state_all_ct_df)
172 | # color colored_25_state_stage_df: df where the samples of the same time stamps are juxtaposed
173 | colored_25_state_stage_df = get_max_enrichment_25_state_df_ordered_time_stamp(max_enrich_25_state_all_ct_df)
174 | # save the excel
175 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter')
176 | colored_25_state_cell_group_df.to_excel(writer, sheet_name = 'cell_group_25_state')
177 | colored_25_state_stage_df.to_excel(writer, sheet_name = 'stage_25_state')
178 | writer.save()
179 |
180 |
181 |
182 | def get_ranked_max_enriched_25_state(input_folder, output_fn, num_state_ct_model):
183 | all_ct_fn_list = glob.glob(input_folder + '/*/overlap_enrichment.txt')
184 | all_ct_list = list(map(lambda x: x.split('/')[-2], all_ct_fn_list)) # from '/path/to/E129/overlap_enrichment.txt' to 'E129'
185 | all_ct_df_list = list(map(lambda x: get_one_enrichment_ct_model_df(x, num_state_ct_model), all_ct_fn_list))
186 |
187 | max_enrich_value_all_ct_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']}) # create a data frame with only a column called state that are the states in the full-stack model. This data frame store the enrichment values
188 | max_enrich_25_state_all_ct_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']}) # this data frame stores the names of states that are most enriched with each of the full-stack state in each of the cell type that we look at
189 | for ct_index, ct in enumerate(all_ct_list):
190 | this_ct_fn = all_ct_fn_list[ct_index] # get the overlap_enrichment file associated with this cell type
191 | this_ct_enrichment_df = get_one_enrichment_ct_model_df(this_ct_fn, num_state_ct_model) # get the enrichment data frame associated with this cell type
192 | max_enrich_value_all_ct_df[ct] = this_ct_enrichment_df['max_enrichment'] # store the values of maximum enrichment in this cell type
193 | max_enrich_25_state_all_ct_df[ct] = this_ct_enrichment_df['max_enrichment_context'] # store the names of the state that is most enriched with each of the full-stack state for this one cell type
194 | print ("Done getting data from all cell types " )
195 | get_rank_25_state_df(max_enrich_25_state_all_ct_df, max_enrich_value_all_ct_df, output_fn)
196 | print ("Done ranking enrichment data across cell types ")
197 |
198 | get_ranked_max_enriched_25_state(args.input_folder, args.output_fn, args.num_state_ct_model)
199 |
200 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/helper.py:
--------------------------------------------------------------------------------
1 | import string
2 | import os
3 | import sys
4 | import time
5 | def make_dir(directory):
6 | try:
7 | os.makedirs(directory)
8 | except:
9 | print ( 'Folder' + directory + ' is already created')
10 |
11 |
12 |
13 | def check_file_exist(fn):
14 | if not os.path.isfile(fn):
15 | print ( "File: " + fn + " DOES NOT EXISTS")
16 | exit(1)
17 | return
18 |
19 | def check_dir_exist(fn):
20 | if not os.path.isdir(fn):
21 | print ( "Directory: " + fn + " DOES NOT EXISTS")
22 | exit(1)
23 | return
24 |
25 | def create_folder_for_file(fn):
26 | last_slash_index = fn.find('/')
27 | if last_slash_index != -1: # path contains folder
28 | make_dir(fn[:last_slash_index])
29 | return
30 |
31 | def get_command_line_integer(arg):
32 | try:
33 | arg = int(arg)
34 | return arg
35 | except:
36 | print ( "Integer: " + str(arg) + " IS NOT VALID")
37 | exit(1)
38 |
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/mouse_random_represent_full_stack.snakefile:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | full_stack_segment_fn = 'genome_100_segments.bed.gz' # should be replaced with where you store the 100-state chromatin state annotation
4 | ct_segment_folder = './example_perCT_segment/' # should be replaced with the folder path to where you store the per_ct annotations. NOTE: the code assume that your per-ct annotation (bed files) are sorted, if not you need to use the command line: zcat | sort -k1,1 k2,2n >
5 | ct_list = glob.glob(ct_segment_folder + '/*_15_segments.bed.gz') # we assume that within the folder ct_segment_folder, the files are named based on format _15_segments.bed.gz
6 | ct_list = list(map(lambda x: x.split('/')[-1].split('_15_segments.bed.gz')[0], ct_list))
7 | NUM_SAMPLE_SEGMENT_PER_STATE = 100
8 | output_folder = './output//random_represent_with_15state' # output folder of analysis of the estimated overlap probability for full-stack states with per-ct states
9 | overlap_output_folder = './output//overlap_with_ct_segment' # output folder of the analysis of overlap enrichment of full-stack states with per-ct states
10 | ct_state_annot_fn = './example_perCT_segment/15_state_annotations.csv' # where we can get the metadata of the states within the per-ct chromatin state model
11 | full_state_annot_fn = '../state_annotation_processed.csv' # path to the file of full-stack states' metadata (state characterizations)
12 | ct_group_fn = './example_perCT_segment/tissue_annotation.csv'
13 | num_state_ct_model = 15
14 |
15 | seed_list = [800, 922, 23, 204, 132, 992, 60, 650, 761, 154, 432, 760, 999, 969, 955, 986, 981, 246, 35, 438, 116]
16 | rule all:
17 | input:
18 | os.path.join(output_folder, 'summary', 'avg_prop_stateCT_per_group.xlsx'),
19 | os.path.join(overlap_output_folder, 'summary', 'max_enriched_states.xlsx'),
20 |
21 | rule sample_segment_equal_per_state:
22 | # sample regions on the genome such that for each state, the number of sampled regions are similar
23 | input:
24 | full_stack_segment_fn, # from raw data
25 | output:
26 | (os.path.join(output_folder, 'seed_{seed}', 'temp_sample_region.bed.gz'))
27 | shell:
28 | """
29 | python sample_region_for_state_representation.py {input} {NUM_SAMPLE_SEGMENT_PER_STATE} {output} {wildcards.seed}
30 | """
31 |
32 | rule overlap_sample_region_with_ct_segment:
33 | input:
34 | (os.path.join(output_folder, 'seed_{seed}', 'temp_sample_region.bed.gz')), # from rule sample_segment_equal_per_state
35 | expand(os.path.join(ct_segment_folder, '{ct}_15_segments.bed.gz'), ct = ct_list) # Note we made the assumption that these files are all sorted 'sort -k1,1 -k2,2n'
36 | output:
37 | os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed.gz'),
38 | params:
39 | ct_list_string = " ".join(ct_list),
40 | output_no_gz = os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed')
41 | shell:
42 | """
43 | command="zcat {input[0]} | sort -k1,1 -k2,2n " # first sort the file of sample regions, where each state appear NUM_SAMPLE_SEGMENT_PER_STATE times
44 | output_header="chrom\\tstart\\tend\\tfull_stack"
45 | rm -f {output} # so we can overwrite it
46 | rm -f {params.output_no_gz} # so we can overwrite it
47 | for ct in {params.ct_list_string}
48 | do
49 | ct_segment_fn={ct_segment_folder}/${{ct}}_15_segments.bed.gz
50 | command="$command | bedtools map -a stdin -b ${{ct_segment_fn}} -c 4 -o collapse"
51 | output_header="${{output_header}}\\t${{ct}}"
52 | done
53 | command="$command >> {params.output_no_gz} "
54 | # now onto the writing part
55 | echo -e $output_header > {params.output_no_gz} # write the header first # need the -e flag for tabs to be considered seriously
56 | eval $command # write the content
57 | gzip {params.output_no_gz} # gzip
58 | """
59 |
60 | rule calculate_overlap_enrichment_with_ct_segment:
61 | input:
62 | full_stack_segment_fn,
63 | os.path.join(ct_segment_folder, '{ct}_15_segments.bed.gz'),
64 | output:
65 | os.path.join(overlap_output_folder, '{ct}', 'overlap_enrichment.txt'),
66 | params:
67 | code='/Users/vuthaiha/Desktop/window_hoff/source/25state_enrichments/overlap_enrichment_with_ct_spec_state.py',
68 | num_other_states = 15,
69 | shell:
70 | """
71 | python {params.code} --reference_segment_fn {full_stack_segment_fn} --other_segment_fn {input[1]} --num_other_states {params.num_other_states} --output_fn {output}
72 | """
73 |
74 | rule calculate_summary_overlap_with_ct_segment:
75 | input:
76 | expand(os.path.join(overlap_output_folder, '{ct}', 'overlap_enrichment.txt'), ct = ct_list),
77 | output:
78 | os.path.join(overlap_output_folder, 'summary', 'max_enriched_states.xlsx'),
79 | shell:
80 | """
81 | python /Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/relationship_with_ct_spec_state/get_ranked_max_enriched_25_state.py --input_folder {overlap_output_folder} --output_fn {output} --num_state_ct_model {num_state_ct_model} --full_state_annot_fn {full_state_annot_fn} --ct_state_annot_fn {ct_state_annot_fn} --ct_group_fn {ct_group_fn}
82 | """
83 |
84 | rule calculate_summary_prob_overlap_with_ct_segment:
85 | input:
86 | expand(os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed.gz'), seed = seed_list)
87 | output:
88 | os.path.join(output_folder, 'summary', 'avg_prop_stateCT_per_group.xlsx')
89 | params:
90 | prob_summ_folder = os.path.join(output_folder, 'summary')
91 | shell:
92 | """
93 | python /Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/relationship_with_ct_spec_state/calculate_summary_sample_regions.py --all_seed_folder {output_folder} --output_folder {params.prob_summ_folder} --num_state_ct_model {num_state_ct_model} --full_state_annot_fn {full_state_annot_fn} --ct_state_annot_fn {ct_state_annot_fn} --ct_group_fn {ct_group_fn}
94 | """
--------------------------------------------------------------------------------
/relationship_with_ct_spec_state/overlap_enrichment_with_ct_spec_state.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import pandas as pd
4 | import numpy as np
5 | import glob
6 | import pybedtools as bed
7 | import helper
8 | parser = argparse.ArgumentParser(description='Calculating the enrichment between the refernece chromatin state annotation and another segmentation states. This is useful when we want to calculate the overlap enrichment between full stack states and the states in other ct-spec models. Here, we will output the files in a way that replicates output of ChromHMM OverlapEnrichment')
9 | parser.add_argument('--reference_segment_fn', type=str,
10 | help='The segmentation fn that we will use to calculate the state overlap enrichment for. The rows of output correspond to states in this file')
11 | parser.add_argument('--other_segment_fn', type=str,
12 | help='other_segment_fn. The columns of output will be the states in this file')
13 | parser.add_argument('--num_other_states', default=100, type=int,
14 | help='number of chromHMM states in other_segment_fn')
15 | parser.add_argument('--output_fn', type=str,
16 | help = 'output_fn. Will look exactly like the output of ChromHMM OverlapEnrichment')
17 | args = parser.parse_args()
18 | print (args)
19 | helper.check_file_exist(args.reference_segment_fn)
20 | helper.check_file_exist(args.other_segment_fn)
21 | helper.create_folder_for_file(args.output_fn)
22 |
23 | def calculate_percentage_in_genome(reference_segment_fn):
24 | ref_segment_df = pd.read_csv(reference_segment_fn, header = None, index_col = None, sep = '\t')
25 | ref_segment_df.columns = ['chrom', 'start', 'end', 'state']
26 | result_df = pd.DataFrame()
27 | all_ref_states = np.unique(ref_segment_df['state'])
28 | all_ref_states = list(map(lambda x: int(x[1:]), all_ref_states)) # a list of unique state numbers
29 | all_ref_states = np.sort(all_ref_states) # 1--> 100, for full-stack states
30 | result_df['state (Emission order)'] = all_ref_states
31 | result_df.index = list(map(lambda x: 'E{}'.format(x), result_df['state (Emission order)']))
32 | ref_segment_df['length'] = ref_segment_df['end'] - ref_segment_df['start']
33 | gw_length = np.sum(ref_segment_df['length'])
34 | state_cover = ref_segment_df.groupby('state')['length'].sum()
35 | state_perc = state_cover/ gw_length * 100.0 # percentage of the genome that is in each state
36 | result_df['Genome %'] = state_perc
37 | result_df['num_bp_in_state'] = state_cover
38 | return result_df, gw_length
39 |
40 | def calculate_overlap_enrichment_all_states(reference_segment_fn, other_segment_fn, num_other_states, output_fn):
41 | ref_segment_bed = bed.BedTool(reference_segment_fn)
42 | result_df, gw_length = calculate_percentage_in_genome(reference_segment_fn) # two columns so far: state (Emission order) and Genome %
43 | other_bed = bed.BedTool(other_segment_fn)
44 | inter_bed = ref_segment_bed.intersect(other_bed, wa = True, wb= True)
45 | inter_df = inter_bed.to_dataframe()
46 | inter_df.columns = ['ref_chrom', 'ref_start', 'ref_end', 'ref_state', 'other_chrom', 'other_start', 'other_end', 'other_state']
47 | inter_df['inter_length'] = inter_df.apply(lambda x: min(x['ref_end'], x['other_end']) - max(x['ref_start'], x['other_start']),axis = 1) # length of intersection between the reference and the other segmentation
48 | inter_df = inter_df.groupby(['ref_state', 'other_state'])['inter_length'].sum()
49 | inter_df = inter_df.to_frame().reset_index() # ref_state, other_state, inter_length
50 | inter_df = inter_df.pivot_table(values = 'inter_length', index = 'ref_state', columns = 'other_state') # columns: states in the other_bed, rows: states in ref_bed
51 | inter_df = inter_df.fillna(0)
52 | fract_gene_in_context = (inter_df.sum(axis = 0))/ gw_length # fraction of the genome that is in each of the contexts of the columns
53 | inter_df = inter_df.divide(result_df['num_bp_in_state'], axis = 0).divide(fract_gene_in_context, axis = 1) # FE = (#MS/#S) / (#M/#G)
54 | result_df = result_df.merge(inter_df, left_index = True, right_index = True)
55 | result_df = result_df.drop('num_bp_in_state', axis = 1) # get rid of this column to make the output look like ChromHMM output
56 | result_df.loc[result_df.shape[0]] = ['Base', 100] + list(fract_gene_in_context*100.0) # the last row of result_df should show the percentage of the genome that is in each of the context
57 | result_df.to_csv(output_fn, header = True, index = False, sep = '\t')
58 | print('Done!')
59 | return
60 |
61 | calculate_overlap_enrichment_all_states(args.reference_segment_fn, args.other_segment_fn, args.num_other_states, args.output_fn)
62 |
--------------------------------------------------------------------------------
/view_ucsc_genome_browser.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/view_ucsc_genome_browser.pptx
--------------------------------------------------------------------------------