├── .gitattributes ├── README.md ├── emission_and_metadata ├── README.md ├── create_heatmap_for_each_state.Rmd ├── data │ ├── assay_count.csv │ ├── biosample_term_name_counts.csv │ ├── emissions_100.txt │ ├── emissions_100_for_pheatmap.txt │ ├── metadata.csv │ ├── organ.txt │ └── slims_metadata.txt ├── draw_emission_subsetted_states.R ├── draw_heatmap_emission_mm10_full_stack.Rmd ├── helper.py ├── prepare_emission_for_plotting.py └── rank_chromMark_cellType_emission.py ├── gene_exp_analysis ├── ._investigate_lecif_and_phastCons.py ├── README.md ├── avg_gene_exp_per_state_per_ct.txt.gz ├── ct_name.csv ├── get_avg_gene_per_state.py ├── helper.py └── plot_avg_gene_exp_heatmap.Rmd ├── helper.py ├── learn_model ├── README.md ├── extract_cell_mark_table.py ├── helper.py └── metadata_controltype_june_2021.tsv ├── neighborhood ├── README.md ├── draw_2DLine_neighborhood_enrichment.R ├── draw_neighborhood_enrichment.R ├── genome_100_RefSeqTES_neighborhood.txt └── genome_100_RefSeqTSS_neighborhood.txt ├── overlap ├── README.md ├── create_excel_painted_overlap_enrichment_LECIF.py └── helper.py ├── process_CTCF_data ├── README.md ├── calculate_genometric_mean_CTCF_overlap.py ├── download_CTCF_data.py ├── draw_2D_overlap_enrichment_CTCF.Rmd ├── helper.py ├── metadata_clean_for_publication.txt ├── metadata_from_ENCODE.tsv └── overlap_ctcf_mm10.txt ├── relationship_with_ct_spec_state ├── ._15_state_annotations.csv ├── README.md ├── calculate_summary_sample_regions.py ├── example_perCT_segment │ ├── 15_state_annotations.csv │ ├── P0_forebrain_15_segments.bed.gz │ ├── P0_lung_15_segments.bed.gz │ ├── P0_stomach_15_segments.bed.gz │ ├── README.md │ └── tissue_annotation.csv ├── get_mm_enrichments_across_ct.py ├── get_ranked_max_enriched_25_state.py ├── helper.py ├── mouse_random_represent_full_stack.snakefile └── overlap_enrichment_with_ct_spec_state.py ├── state_annotation_processed.tsv └── view_ucsc_genome_browser.pptx /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # full_stack_ChromHMM_annotations for mouse (mm10) 3 | Data of genome annotation from full-stack ChromHMM model trained with 901 datasets assaying 14 chromatin marks in 26 different cell or tissue types of the **mouse** genome. The paper has been published on Genome Biology . This project is an extension of the published manuscript at Genome Biology that introduces an universal (pan-tissue-type) annotation of the **human** genome 4 | # Download links: 5 | Data of full-stack genome annotations for reference assemblies mm10 can be found here: 6 | Within this folder: 7 | - File mm10_100_segments_segments.bed.gz contains a simple four column .bed file of **mouse** full-stack state annotation in **mm10** assembly. The fourth column contains a state label with a prefix number that can be used to order the states. The OverlapEnrichment and NeighborhoodEnrichment commands of ChromHMM with the '-labels' option can compute enrichments for this file and order states based on the prefix number. 8 | - File mm10_100_segments_browser.bed.gz contains a browser file of **mouse** full-stack state annotation in **mm10** assembly. This file is compatible to for UCSC genome browser. Since our training data (901 input data tracks) are in mm10, mm10 is the assembly used for original training and annotation. 9 | 10 | - Detailed description of states can be found at tsv file . The excel version of these files, with more results of the states' overlap enrichments with external annotations can be found in our paper, Additional Files 3-5. 11 | 12 | - State group meanings: 13 | ``` 14 | mGapArtf - assembly gaps and artifacts 15 | mQuies - quiescent 16 | mHet - heterochromatin 17 | mZNF - Zinc finger genes 18 | mReprPC - polycomb repressed 19 | mReprPC_openC - polycomb repressed and open chromatin 20 | mOpenC - open chromatin 21 | mEnhA - active enhancers 22 | mEnhWk - weak enhancers 23 | mTxEnh - transcribed enhancers 24 | mTx - transcription 25 | mTxEx - transcription and exons 26 | mTxWk - weak transcription 27 | mBivProm - bivalent promoters 28 | mPromF - promoter flank 29 | mTSS - transcription start sites 30 | ``` 31 | 32 | # Track hubs on UCSC genome browser: 33 | You can view the full-stack annotations for the mouse (presented in this manuscript) OR the human (presented in the "sister" manuscript), by using the track hub link. We provide a very detailed step-by-step instruction on how to view the full-stack annotations using the provided track hub link in the tutorial file view_ucsc_genome_browser.pptx. 34 | 35 | # Folders: 36 | Within each subfolders inside this folder. Each subfolder contains its own readme file. 37 | - Folder ```relationship_with_ct_spec_state```: contains code to reproduce the plots in Fig. S6 and S7 of the manuscript. 38 | - Folder ```process_CTCF_data```: contains code to reproduce Fig. 2F of the manuscript. 39 | - Folder ```neighborhood```: contains code to reproduce Fig. 2D-E and S2 of the manuscript. 40 | - Folder ```emission_and_metadata```: contains code to reproduce Fig. 1 of the manuscript 41 | - Folder ```gene_exp_analysis```: contains code to reproduce Fig. 2C of the manuscript 42 | - Folder ```overlap```: contains scripts of calculate the enrichments of mouse full-stack states with different genome annotation. 43 | - Folder ```learn_model```: contains scripts to reproduce tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' in Additional File 2 of the paper 44 | 45 | # License: 46 | All code is provided under the MIT Open Acess License 47 | Copyright 2022 Ha Vu and Jason Ernst 48 | 49 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 50 | 51 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 52 | 53 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 54 | 55 | # Contact: 56 | If you run into problems, please contact Ha Vu (havu73@ucla.edu) 57 | -------------------------------------------------------------------------------- /emission_and_metadata/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code to produce Figure 1 of the manuscript. 2 | 3 | - ```helper.py```: script containing helper functions for file managements. 4 | - Folder ```data```: contains emission probability file and different metadata files that specify the color codes for different chromatin marks, cell types, etc. 5 | - ```prepare_emission_for_plotting.py```: Code to transform data from ```./data/emissions_100.txt``` to ```./data/emissions_100_for_pheatmap.txt```, with added columns showing the associated chromatin mark and cell types fro each experiment. 6 | - ```rank_chromMark_cellType_emission.py```: Code to produce excel files that correspond to Fig. 1B-C. 7 | ``` 8 | python rank_chrom_mark_celltype_emission.py 9 | emission_fn: should be ./data/emissions_100.txt 10 | output_fn: execl file where the output data of ranked cell type and chrom marks are stored 11 | num_top_marks: number of top marks that we want to report, recommended value: 100 12 | ``` 13 | - ```draw_heatmap_emission_mm10_full_stack.Rmd```: Code to produce Fig. 1A, and individual plots corresponding to indidivual states (not presented in the paper, but these plots are helpful in understanding the states' biological implications). 14 | - ```draw_emission_subsetted_states.R```: Code to draw emission probabilities for an user-specified subset of states. This function did not involve in creating any figures in the paper. It may be helpful to users, but it requires that you call states by their raw numbers (column ```state``` in ```../state_annotation_processed.csv```) 15 | ``` 16 | Rscript draw_emission_subsetted_states.R 100), to plot> 17 | ``` 18 | # License: 19 | All code is provided under the MIT Open Acess License 20 | Copyright 2022 Ha Vu and Jason Ernst 21 | 22 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 23 | 24 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 25 | 26 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 27 | 28 | -------------------------------------------------------------------------------- /emission_and_metadata/create_heatmap_for_each_state.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Create_heatmap_for_each_state" 3 | author: "Ha Vu" 4 | date: "6/11/2019" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | library(tidyr) 11 | library(tidyverse) 12 | library(ggplot2) 13 | library(dplyr) 14 | library(pheatmap) 15 | source('/Users/vuthaiha/Desktop/window_hoff/source/summary_analysis/draw_emission_matrix_functions.R') 16 | ``` 17 | # Get the emission matrix into a dataframe 18 | ```{r} 19 | emission_fn <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emissions_100.txt" 20 | output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/state_avg_cell_group_and_mark" 21 | cm_output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/cm_based_emission" 22 | ana_output_folder <- "/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/emission_results/ana_based_emission" 23 | emission_df <- get_emission_df_from_chromHMM_emission(emission_fn) 24 | num_state <- 100 25 | num_exp <- 1032 26 | ``` 27 | # get the count of each annotation category 28 | ```{r} 29 | count_chrom_mark_GROUP <- emission_df %>% select(c('mark_names', 'chrom_mark', 'GROUP')) %>% group_by(chrom_mark, GROUP) %>% tally() %>% rename('count' = 'n') 30 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(chrom_mark ~ GROUP , value.var = 'count') %>% replace(., is.na(.), 0) 31 | annot_col_df <- data.frame(cell_GROUP = colnames(count_chrom_mark_GROUP)) 32 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP)) 33 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP)) 34 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP)) 35 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP[,1] 36 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1] 37 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(cell_GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png')) 38 | ``` 39 | # draw plots for each state on its own figure 40 | ```{r} 41 | for (state_index in seq(num_state)){ 42 | state_df <- emission_df %>% select(c("mark_names", paste0("S", state_index), "chrom_mark", "GROUP")) 43 | avg_emission_df <- state_df %>% select (paste0("S", state_index), 'chrom_mark', 'GROUP') %>% group_by(chrom_mark, GROUP) %>% summarise_all(.funs = funs(mean = "mean")) %>% rename(mean_emission = mean) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission 44 | avg_emission_df <- avg_emission_df %>% dcast(chrom_mark ~ GROUP , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible 45 | 46 | chrom_mark_list <- avg_emission_df %>% select('chrom_mark') # list of chromatin marks, which will later be the row names for the heatmap 47 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names 48 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later 49 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') ) 50 | break_list <- seq (0, 1, by = 0.01) 51 | annot_col_df <- data.frame(cell_GROUP = colnames(avg_emission_df)) 52 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df)) 53 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df)) 54 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df)) 55 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(cell_GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), filename = this_state_save_fig_fn) 56 | print(paste("Done with state:", state_index)) 57 | } 58 | ``` 59 | # get a data frame avg_emission_df: rows: 'ana_mark_comb' + states, columns: mark-GROUP combination --> cells: avg emission probabilities for each mark-GROUP combination 60 | ```{r} 61 | avg_emission_df <- emission_df %>% select(-'GROUP', -'COLOR', -'TYPE', -'Epig_name', -'ct') 62 | mark_name_list <- as.character(avg_emission_df$mark_names) 63 | rownames(avg_emission_df) <- mark_name_list 64 | avg_emission_df <- avg_emission_df %>% select(-'mark_names') %>% group_by(chrom_mark, GROUP) %>% summarise_all(funs(mean)) %>% unite("ana_mark_comb",c("GROUP", "chrom_mark"), sep = "-") 65 | ana_mark_comb_list <- avg_emission_df$ana_mark_comb 66 | ``` 67 | 68 | # Draw heatmap where we arrange categories by chrom marks 69 | ```{r} 70 | uniq_chrom_mark_list <- unique(emission_df$chrom_mark) 71 | plot_df <- data.frame(matrix(ncol = ncol(avg_emission_df) - 1, nrow = 0)) 72 | 73 | for (cm in uniq_chrom_mark_list){ 74 | this_cm_df <- avg_emission_df %>% filter(grepl(cm, ana_mark_comb)) 75 | plot_df <- rbind(plot_df, this_cm_df) 76 | } 77 | ana_mark_comb_list <- plot_df$ana_mark_comb 78 | rownames(plot_df) <- ana_mark_comb_list 79 | plot_df <- plot_df %>% select(- 'ana_mark_comb') 80 | # column annotation 81 | annot_df <- data.frame(chrom_mark = sapply(rownames(plot_df), FUN = get_chrom_mark_name)) 82 | annot_df['GROUP'] <- sapply(rownames(plot_df), FUN = get_mark_ct) 83 | break_list <- seq (0, 1, by = 0.01) 84 | 85 | save_fn <- file.path(output_folder, "avg_chrom_mark_GROUP_grouped_cm.png") 86 | pheatmap(t(plot_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 3, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE), cellwidth = 3.5 ,filename = save_fn) 87 | ``` 88 | 89 | ```{r} 90 | break_list <- seq (0, 1, by = 0.01) 91 | uniq_chrom_mark_list <- unique(emission_df$chrom_mark) 92 | for (cm in uniq_chrom_mark_list){ 93 | this_cm_df <- avg_emission_df %>% filter(grepl(cm, ana_mark_comb)) 94 | ana_mark_comb_list <- this_cm_df$ana_mark_comb 95 | this_cm_df <- this_cm_df %>% select(-"ana_mark_comb") 96 | rownames(this_cm_df) <- ana_mark_comb_list 97 | annot_df <- data.frame(chrom_mark = sapply(ana_mark_comb_list, FUN = get_chrom_mark_name)) 98 | annot_df['GROUP'] <- sapply(ana_mark_comb_list, FUN = get_mark_ct) 99 | save_fn <- file.path(cm_output_folder, paste0(cm, "avg_emission.png")) 100 | pheatmap(t(this_cm_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 5, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE),filename = save_fn) 101 | } 102 | ``` 103 | 104 | ```{r} 105 | uniq_GROUP_list <- unique(emission_df$GROUP) 106 | plot_df <- data.frame(matrix(ncol = ncol(avg_emission_df) - 1, nrow = 0)) 107 | for (ana in uniq_GROUP_list){ 108 | this_ana_df <- avg_emission_df %>% filter(grepl(paste0(ana, "-"), ana_mark_comb)) 109 | plot_df <- rbind(plot_df, this_ana_df) 110 | } 111 | ana_mark_comb_list <- plot_df$ana_mark_comb 112 | rownames(plot_df) <- ana_mark_comb_list 113 | plot_df <- plot_df %>% select(- 'ana_mark_comb') 114 | annot_df <- data.frame(chrom_mark = sapply(rownames(plot_df), FUN = get_chrom_mark_name)) 115 | annot_df['GROUP'] <- sapply(rownames(plot_df), FUN = get_mark_ct) 116 | 117 | break_list <- seq (0, 1, by = 0.01) 118 | 119 | save_fn <- file.path(output_folder, "avg_chrom_mark_GROUP_grouped_ana.png") 120 | pheatmap(t(plot_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 3, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE) , cellwidth = 3.5 ,filename = save_fn) 121 | ``` 122 | 123 | ```{r} 124 | break_list <- seq (0, 1, by = 0.01) 125 | uniq_GROUP_list <- unique(emission_df$GROUP) 126 | for (ana in uniq_GROUP_list){ 127 | this_ana_df <- avg_emission_df %>% filter(grepl(paste0(ana, '-'), ana_mark_comb)) 128 | ana_mark_comb_list <- this_ana_df$ana_mark_comb 129 | this_ana_df <- this_ana_df %>% select(-"ana_mark_comb") 130 | rownames(this_ana_df) <- ana_mark_comb_list 131 | annot_df <- data.frame(chrom_mark = sapply(ana_mark_comb_list, FUN = get_chrom_mark_name)) 132 | annot_df['GROUP'] <- sapply(ana_mark_comb_list, FUN = get_mark_ct) 133 | save_fn <- file.path(ana_output_folder, paste0(ana, "avg_emission.png")) 134 | pheatmap(t(this_ana_df), breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 5, angle_col = 90, annotation_col = annot_df, annotation_colors = list(GROUP = CELL_GROUP_COLOR_CODE, chrom_mark = CHROM_MARK_COLOR_CODE),filename = save_fn) 135 | } 136 | ``` 137 | 138 | # calculate the correlations between states 139 | ```{r} 140 | emission_df <- get_emission_df_from_chromHMM_emission(emission_fn) 141 | mark_name_list <- emission_df$mark_names 142 | emission_df <- emission_df %>% select(-'mark_names', -'ct', -'chrom_mark', -'GROUP', -'COLOR', -'TYPE', -'Epig_name', -'GROUP') 143 | rownames(emission_df) <- mark_name_list 144 | cor_df <- as.data.frame(cor(emission_df)) 145 | break_list <- seq (0, 1, by = 0.01) 146 | save_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/state_correlations/pearson_correlations.png' 147 | pheatmap(cor_df, breaks = break_list, fontsize = 4, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4, angle_col = 90,filename = save_fn) 148 | ``` -------------------------------------------------------------------------------- /emission_and_metadata/data/assay_count.csv: -------------------------------------------------------------------------------- 1 | mark,num_exp,color,big_group, 2 | DNase,114,#DBE680,1, 3 | H3K4me3,101,#F13712,2, 4 | H3K4me1,100,#EDF732,3, 5 | H3K27ac,93,#F7CB4D,3, 6 | H3K36me3,92,#49AC5E,4, 7 | H3K27me3,91,#A6A6A4,5, 8 | ATAC,83,#E8F484,1, 9 | H3K9me3,79,#677BF6,6, 10 | H3K9ac,72,#F2A626,2, 11 | H3K4me2,68,#F8CBAD,2, 12 | H3K79me2,6,#377A45,4, 13 | H3ac,1,#E5EAE7,7, 14 | H3K79me1,1,#9DCEA8,4,This mark is actually not in the emission file 15 | H3K79me3,1,#95B77E,4, 16 | ,902,,, -------------------------------------------------------------------------------- /emission_and_metadata/data/biosample_term_name_counts.csv: -------------------------------------------------------------------------------- 1 | biosample_name,num_experiment,group_refined_1,group_refined_2,group_refined_3,group,explanation,color,big_group 2 | heart,83,heart,heart,heart,Heart,,#D56F80,1 3 | liver,80,liver,liver,liver,Liver,,#9BC2E6,2 4 | forebrain,73,forebrain,forebrain,brain ,Brain,,#C5912B,3 5 | midbrain,73,midbrain,midbrain,brain ,Brain,,#C5912B,3 6 | hindbrain,73,hindbrain,hindbrain,brain ,Brain,,#C5912B,3 7 | limb,54,limb ,limb ,limb ,Muscle,,#F9B6CF,22 8 | embryonic facial prominence,54,embryonic facial promincence,embryonic facial promincence,embryo,embryo,,#AF5B39,18 9 | kidney,48,kidney,kidney,kidney,kidney,,#F4B084,23 10 | neural tube,47,neural tube,neuron ,neuron ,neuron,,#FFD924,4 11 | lung,42,lung,lung,lung,lung,,#E41A1C,5 12 | stomach,37,stomach,stomach,stomach,stomach,,#E5BDB5,6 13 | intestine,36,intestine,intestine,intestine,intestine,,#D0A39B,7 14 | CH12.LX,14,B-cells,white blood cells,immunity,white blood cells,,#678C69,8 15 | MEL,13,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,Murine erythroleukemia (MEL) cell lines are erythroid progenitor cells derived from the spleens of susceptible mice infected with the Friend virus complex 1. These virally transformed cells are arrested at the proerythroblast stage of development and can be maintained in tissue culture indefinitely ,#55A354,9 16 | ES-E14,9,ESC,ESC,ESC,ESC,"Mouse E14 embryonic stem cells (ESCs) are a well-characterized and widespread used ESC line, often employed for genome-wide studies involving next generation sequencing analysis",#924965,16 17 | cerebellum,9,hindbrain,hindbrain,brain ,Brain,,#C5912B,3 18 | thymus,9,thymus,thymus,thymus,Thymus,,#C6E0B4,11 19 | spleen,7,spleen,spleen,spleen,Spleen,,#678C69,12 20 | G1E,7,ESC,ESC,ESC,ESC,"G1E cells are an immortalized Gata1 null cell line derived from embryonic stem cells [1], and the daughter cell line G1E-ER4 has been stably rescued by transduction with a virus expressing a hybrid gene encoding the GATA1-ER protein",#924965,16 21 | myocyte,6,muscle,muscle,muscle,Muscle,,#F9B6CF,22 22 | ES-Bruce4,6,ESC,ESC,ESC,ESC,An embryonic cell line isolated from C57BL/6 mouse strain. Injection of Bruce4 cells into C57BL/6 blastocysts will produce agouti chimeras.,#924965,16 23 | testis,6,testis,testis,testis,testis,,#ACB9CA,13 24 | erythroblast,6,red blood cells,red blood cells,red blood cells,red blood cells,"Erythroblast, nucleated cell occurring in red marrow as a stage or stages in the development of the red blood cell, or erythrocyte.",#55A354,9 25 | megakaryocyte,6,bone marrow cells for producing platelets,bone marrow ,bone marrow ,white blood cells,"Megakaryocytes are cells in the bone marrow responsible for making platelets, which are necessary for blood clotting",#678C69,8 26 | embryo,5,embryo,embryo,embryo,embryo,,#AF5B39,18 27 | C2C12,5,Muscle,Muscle,Muscle,Muscle,an undifferentiated cell capable of giving rise to muscle cells,#F9B6CF,22 28 | E14TG2a.4,5,ESC,ESC,ESC,ESC,,#924965,16 29 | small intestine,5,Intestine,Intestine,Intestine,intestine,,#D0A39B,7 30 | retina,4,retina,retina,retina,retina,,#F8CBAD,14 31 | brain,3,brain ,brain ,brain ,brain,,#C5912B,3 32 | splenic B cell,3,B-cells,white blood cells,white blood cells,white blood cells,"b-cells generated by the spleen, used for immunity",#678C69,8 33 | cortical plate,3,ESC/forebrain,ESC,ESC,ESC,The term cortical plate refers to the embryonic precursor of cerebral cortex. It gives rise to all layers of neocortex and allocortex,#924965,16 34 | brown adipose tissue,3,adipose,adipose,adipose,Adipose,,#0070C0,15 35 | embryonic fibroblast,3,ESC fibroblast,ESC,ESC,ESC,,#924965,16 36 | placenta,3,placenta,placenta,placenta,placenta,,#69608A,17 37 | olfactory bulb,3,forebrain,forebrain,brain ,brain,,#C5912B,3 38 | bone marrow macrophage,3,bone marrow white blood cells,bone marrow,bone marrow,bone marrow,"Bone marrow-derived macrophages (BMDM) are primary macrophages obtained by in vitro differentiation of bone marrow cells in the presence of macrophage colony-stimulating factor (M-CSF or CSF1). They are easy to obtain in high yields, can be stored by freezing, and can be obtained from genetically modified mice strains.",#375623,10 39 | cortex,2,forebrain,forebrain,brain ,Brain,,#C5912B,3 40 | bone marrow,2,bone marrow,bone marrow,bone marrow,bone marrow,,#375623,10 41 | adrenal gland,2,adrenal gland,other,other,Other,"small, triangular-shaped glands located on top of both kidneys. Adrenal glands produce hormones that help regulate your metabolism, immune system, blood pressure, response to stress and other essential functions.",#999999,19 42 | hematopoietic stem cell,2,ESC,ESC,ESC,ESC,"An immature cell that can develop into all types of blood cells, including white blood cells, red blood cells, and platelets. Hematopoietic stem cells are found in the peripheral blood and the bone marrow.",#924965,16 43 | c-Kit-positive CD71-negative TER-119-negative erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,"Erythroid progenitor cells are committed self-renewing stem cells that give rise to only one type of cell, namely, the erythrocytes (red blood cells).",#55A354,9 44 | erythroid progenitor cell,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9 45 | megakaryocyte-erythroid progenitor cell,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,"The megakaryocyte–erythroid progenitor cell (or MEP, or hMEP to specify human) is a cell that gives rise to megakaryocytes and erythrocytes. It is derived from the common myeloid progenitor.",#55A354,9 46 | neutrophil,1,white blood cells,white blood cells,white blood cells,white blood cells,"A type of immune cell that is one of the first cell types to travel to the site of an infection. Neutrophils help fight infection by ingesting microorganisms and releasing enzymes that kill the microorganisms. A neutrophil is a type of white blood cell, a type of granulocyte, and a type of phagocyte.",#678C69,8 47 | adipocyte,1,adipose,adipose,adipose,adipose,"a cell specialized for the storage of fat, found in connective tissue",#0070C0,15 48 | activated regulatory T-cells,1,white blood cells,white blood cells,white blood cells,white blood cells,,#678C69,8 49 | cerebral cortex,2,forebrain,forebrain,brain ,brain,,#C5912B,3 50 | left cerebral cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3 51 | "naive thymus-derived CD4-positive, alpha-beta T cell",1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 52 | NIH3T3,1,ESC fibroblast,fibroblast,fibroblast,fibroblast,3T3 cells are several cell lines of mouse embryonic fibroblasts,#8DAFC3,20 53 | frontal cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3 54 | forelimb bud,1,limb bud,limb ,limb ,Muscle,The limb bud is a structure formed early in vertebrate limb development,#F9B6CF,22 55 | ES-CJ7,1,ESC,ESC,ESC,ESC,Undifferentiated embryonic stem cells were originally isolated from 129S1/SVImJ strain mice by,#924965,16 56 | WW6,1,ESC,ESC,ESC,ESC,WW6: an embryonic stem cell line with an inert genetic marker that can be traced in chimeras,#924965,16 57 | telencephalon,1,neuron ,neuron ,neuron ,neuron,"The telencephalon (basal ganglia, septum, cerebral cortex and olfactory bulb) contains two general classes of neurons: those that project axons to distant targets and those that make only local connections.",#FFD924,4 58 | yolk sac,1,embryo,embryo,embryo,embryo,"The yolk sac is a small, membranous structure situated outside of the embryo with a variety of functions during embryonic development. It attaches ventrally to the developing embryo via the yolk stalk. The yolk stalk is a term that may be used interchangeably with the vitelline duct or omphalomesenteric duct.",#AF5B39,18 59 | hindlimb bud,1,embryo,embryo,embryo,embryo,,#AF5B39,18 60 | B cell,1,B-cells,white blood cells,white blood cells,white blood cells,,#678C69,8 61 | monocyte,1,white blood cells,white blood cells,white blood cells,white blood cells,"A monocyte is a type of white blood cell and a type of phagocyte. Enlarge. Blood cells. Blood contains many types of cells: white blood cells (monocytes, lymphocytes, neutrophils, eosinophils, basophils, and macrophages), red blood cells (erythrocytes), and platelets",#678C69,8 62 | common myeloid progenitor,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,Common myeloid progenitors give rise to either megakaryocyte/erythrocyte or granulocyte/macrophage progenitors,#55A354,9 63 | c-Kit-negative CD71-positive TER-119-positive erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9 64 | gonadal fat pad,1,adipose,adipose,adipose,adipose,,#0070C0,15 65 | regulatory T cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 66 | inflammation-experienced regulatory T-cells,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 67 | c-Kit-positive CD71-positive TER-119-negative erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,erythroid progenitor cells,red blood cells,,#55A354,9 68 | Patski,1,fibroblast,fibroblast,fibroblast,fibroblast,PATSKI is a female interspecific mouse fibroblast that was derived from the embryonic kidney of an M. spretus x C57BL/6J hybrid mouse such that the C57Bl/6J X chromosome (maternal) is always the inactive X. This is an adherent cell line.,#8DAFC3,20 69 | skeletal muscle tissue,1,muscle,muscle,muscle,muscle,,#F9B6CF,22 70 | mesoderm,1,embryo,embryo,embryo,embryo,"the middle layer of an embryo in early development, between the endoderm and ectoderm.",#AF5B39,18 71 | large intestine,1,intestine,intestine,intestine,intestine,,#D0A39B,7 72 | CD4-positive helper T cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 73 | MEL-GATA-1-ER,1,other,other,other,other,cancer cell ine Mouse erythroid leukemia (https://web.expasy.org/cellosaurus/CVCL_Y480),#999999,19 74 | 416B,1,other,other,other,other,transformed cell lineHaematopoiesis is the formation of blood cellular component,#999999,19 75 | right cerebral cortex,1,forebrain,forebrain,brain ,brain,,#C5912B,3 76 | induced T-regulatory cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 77 | A20,1,B-cells,white blood cells,white blood cells,white blood cells,B lymphocyte,#678C69,8 78 | activated regulatory T-cell,1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 79 | 3T3-L1,1,fibroblast,fibroblast,fibroblast,fibroblast,,#8DAFC3,20 80 | fat pad,1,adipose,adipose,adipose,adipose,,#0070C0,15 81 | R1,1,ESC,ESC,ESC,esC,,#924965,16 82 | MN1,1,neuron ,neuron ,neuron ,neuron,MN1 is a cholinergic motor neuron cell line derived from a fusion of N18TG2 with embryonic mouse spinal cord motor neurons,#FFD924,4 83 | fibroblast of lung,1,fibroblast,fibroblast,fibroblast,fibroblast,,#8DAFC3,20 84 | ZHBTc4,1,ESC,ESC,ESC,eSC,"ZHBTc4 undifferentiated mouse embryonic stem cells originated from a male mouse of the 129/Ola strain, and received as frozen ampoules from D. Levasseur (University of Iowa). These cells lack functional endogenous Oct4 alleles and harbor a regulatable Oct4 transgene.",#924965,16 85 | "CD4-positive, CD25-positive, alpha-beta regulatory T cell",1,T cells,white blood cells,white blood cells,white blood cells,,#678C69,8 86 | megakaryocyte progenitor cell,1,myeloid progenitor,other,other,other,"The megakaryocyte–erythroid progenitor cell (or MEP, or hMEP to specify human) is a cell that gives rise to megakaryocytes and erythrocytes. It is derived from the common myeloid progenitor.",#999999,19 87 | 3134,1,other,other,other,other,cancer cell line: https://web.expasy.org/cellosaurus/CVCL_H641,#999999,19 88 | Muller cell,1,retina,retina,retina,retina,"Müller cells are the principal glial cells of the retina, assuming many of the functions carried out by astrocytes, oligodendrocytes and ependymal cells in other CNS regio",#F8CBAD,14 89 | c-Kit-positive CD71-positive TER-119-positive erythroid progenitor cells,1,erythroid progenitor cells,erythroid progenitor cells,red blood cells,red blood cells,,#55A354,9 90 | granulocyte monocyte progenitor cell,1,white blood cells,white blood cells,white blood cells,white blood cells,"Granulocyte-monocyte progenitor (GMP) cells play a vital role in the immune system by maturing into a variety of white blood cells, including neutrophils and macrophages, depending on exposure to cytokines such as various types of colony stimulating factors",#678C69,8 91 | gastrocnemius,1,muscle,muscle,muscle,muscle,,#F9B6CF,22 -------------------------------------------------------------------------------- /emission_and_metadata/data/organ.txt: -------------------------------------------------------------------------------- 1 | spleen #678C69 18 2 | lymph_node #E2EFDA 19 3 | musculature #F9B6CF 11 4 | embryo #0070C0 25 5 | liver #9BC2E6 5 6 | lung #E41A1C 6 7 | limb #F182BC 10 8 | kidney #F4B084 7 9 | intestine #D0A39B 8 10 | stomach #E5BDB5 9 11 | epithelium #7491A2 3 12 | brain #C5912B 1 13 | heart #D56F80 4 14 | immune_organ #678C69 20 15 | gonad #ACB9CA 12 16 | blood #55A354 21 17 | bone_element #375623 22 18 | eye #F8CBAD 13 19 | unknown #999999 26 20 | breast #18A7F0 14 21 | adipose_tissue #A7D5FF 15 22 | connective_tissue #90C4CE 16 23 | adrenal_gland #F4B084 17 24 | extraembryonic_component #69608A 23 25 | spinal_cord #FFD924 2 26 | placenta #786D8C 24 27 | -------------------------------------------------------------------------------- /emission_and_metadata/draw_emission_subsetted_states.R: -------------------------------------------------------------------------------- 1 | # Copyright 2021 Ha Vu (havu73@ucla.edu) 2 | 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | library(tidyverse) 10 | library(tidyr) 11 | library(dplyr) 12 | library(pheatmap) 13 | library(ggplot2) 14 | biosample_color_dict <- c('adipose'= '#AF5B39', 'bone marrow'= '#375623', 'brain'= '#C5912B', 'embryo'= '#0070C0', 'esc'= '#924965', 'fibroblast'= '#C075C3', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'kidney'= '#F4B084', 'limb'= '#F182BC', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'muscle'= '#F9B6CF', 'neuron'= '#FFD924', 'other'= '#999999', 'placenta'= '#69608A', 'red blood cells'= '#55A354', 'retina'= '#F8CBAD', 'spleen'= '#678C69', 'stomach'= '#E5BDB5', 'testis'= '#ACB9CA', 'thymus'= '#C6E0B4', 'white blood cells'= '#678C69') 15 | assay_color_dict <- c('ATAC'= '#E8F484', 'DNase'= '#DBE680', 'H3K27ac'= '#F7CB4D', 'H3K27me3'= '#A6A6A4', 'H3K36me3'= '#49AC5E', 'H3K4me1'= '#EDF732', 'H3K4me2'= '#F8CBAD', 'H3K4me3'= '#F13712', 'H3K79me1'= '#9DCEA8', 'H3K79me2'= '#377A45', 'H3K9ac'= '#F2A626', 'H3K9me3'= '#677BF6', 'H3ac'= '#E5EAE7', 'H3K79me3'= '#95B77E') 16 | state_group_color_dict <- c('HET'= '#b19cd9', 'quiescent'= '#ffffff', 'enhancers'= '#FFA500', 'WkEnh'= '#ffff00', 'znf'= '#7fffd4', 'Tx'= '#006400', 'artifacts'= '#ffffff', 'DNase'= '#fff44f', 'ATAC'= '#f7f3b5', 'Prom'= '#ff4500', 'ReprPC and DNase'= '#d1cf90', 'DNase'= '#fff44f', 'exon'= '#3cb371', 'BivProm'= '#6a0dad', 'acetylations'= '#fffacd', 'ct_spec enhancers'= '#FFA500', 'ReprPC'= '#C0C0C0', 'weak promoters'= '#800080', 'TxEnh'= '#ADFF2F', 'TSS'= '#FF0000', 'others'= '#fff5ee', 'TxWk'= '#228B22', 'TxEx'= '#9BBB59') 17 | num_state <- 100 18 | 19 | read_state_annot_df <- function(state_annot_fn, states_to_plot){ 20 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t')) 21 | tryCatch({ 22 | annot_df <- annot_df %>% rename('state_group' = 'group') 23 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")}) 24 | annot_df <- annot_df[annot_df$state %in% states_to_plot,] 25 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column 26 | return(annot_df) 27 | } 28 | 29 | draw_emission_subsetted_states <- function(emission_fn, annot_fn, save_fn, states_to_plot){ 30 | emission_df <- as.data.frame(read.csv(emission_fn, sep = '\t', header = TRUE)) %>% arrange(assay_big_group, assay, biosample_big_group, biosample_group, biosample_name) 31 | colnames(emission_df) <- c('experiments', paste0('S', seq(1, num_state)), colnames(emission_df)[102:length(colnames(emission_df))]) 32 | annot_df <- read_state_annot_df(annot_fn, states_to_plot) # this step will choose only the states that we want to plot 33 | #### getting the plot_df: columns : experiments, rows: states 34 | plot_df <- emission_df %>% select(-c('assay', "biosample", "mark", "mark_color", "assay_big_group", "biosample_name", "biosample_group", "biosample_big_group", "biosample_color")) %>% select(c('experiments', paste0('S', annot_df$state))) # choosing only experiments and the states, which are ordered based on the group, all based on the annot_state_df 35 | plot_df <- as.data.frame(t(plot_df)) 36 | colnames(plot_df) <- plot_df[1,] # experiments 37 | plot_df <- plot_df[-1,] # get rid of the of experiments row 38 | rownames(plot_df) <- annot_df$mneumonics 39 | plot_df <- plot_df %>% mutate_all(as.numeric) 40 | ####### 41 | ####### getting the annot_exp_df for experiments ###### 42 | annot_exp_df <- emission_df %>% select(c('biosample_group', 'assay')) 43 | rownames(annot_exp_df) <- emission_df$experiments 44 | ############ 45 | ####### getting the annot_state_df for states #### 46 | annot_state_df <- annot_df %>% select(c('state_group')) 47 | rownames(annot_state_df) <- annot_df$mneumonics 48 | ############ 49 | p <- pheatmap(plot_df, fontsize = 5, annotation_col = annot_exp_df, annotation_row = annot_state_df, annotation_colors = list(biosample_group = biosample_color_dict, assay = assay_color_dict, state_group = state_group_color_dict), cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = FALSE, fontsize_col = 3, angle_col = 90 , cellheight = 5, filename = save_fn) 50 | } 51 | 52 | args = commandArgs(trailingOnly=TRUE) 53 | if (length(args) < 3) 54 | { 55 | stop("wrong command line argument format", call.=FALSE) 56 | } 57 | emission_fn <- args[1] # output from prepare_emission_for_plotting.py 58 | annot_fn <- args[2] # where annotations of states, the state group and state index by groups are stored 59 | #annot_fn <- './data/state_annotation_processed.csv' 60 | #emission_fn <- './data/emissions/emissions_100_for_pheatmap.txt' 61 | #save_fn <- './data/emissions/emissions_100.png' 62 | save_fn <- args[3] # where the figures should be stored 63 | states_to_plot <- as.integer(args[4:length(args)]) 64 | print(states_to_plot) 65 | print ('Done get_emission_df_from_chromHMM_emission') 66 | draw_emission_subsetted_states(emission_fn, annot_fn, save_fn, states_to_plot) -------------------------------------------------------------------------------- /emission_and_metadata/draw_heatmap_emission_mm10_full_stack.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "draw_heatmap_emission_mm10_full_stack" 3 | author: "Ha Vu" 4 | date: "9/17/2021" 5 | output: html_document 6 | --- 7 | ```{r} 8 | library(tidyverse) 9 | library(tidyr) 10 | library(dplyr) 11 | library(pheatmap) 12 | library(ggplot2) 13 | library(reshape2) # dcast function 14 | ``` 15 | 16 | ```{r} 17 | biosample_color_dict <- c('adipose'= '#AF5B39', 'bone marrow'= '#375623', 'brain'= '#C5912B', 'embryo'= '#0070C0', 'esc'= '#924965', 'fibroblast'= '#C075C3', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'kidney'= '#F4B084', 'limb'= '#F182BC', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'muscle'= '#F9B6CF', 'neuron'= '#FFD924', 'other'= '#999999', 'placenta'= '#69608A', 'red blood cells'= '#55A354', 'retina'= '#F8CBAD', 'spleen'= '#678C69', 'stomach'= '#E5BDB5', 'testis'= '#ACB9CA', 'thymus'= '#C6E0B4', 'white blood cells'= '#678C69') 18 | organ_color_dict <- c('spleen'= '#678C69', 'lymph_node'= '#E2EFDA', 'musculature'= '#F9B6CF', 'embryo'= '#0070C0', 'liver'= '#9BC2E6', 'lung'= '#E41A1C', 'limb'= '#F182BC', 'kidney'= '#F4B084', 'intestine'= '#D0A39B', 'stomach'= '#E5BDB5', 'epithelium'= '#7491A2', 'brain'= '#C5912B', 'heart'= '#D56F80', 'immune_organ'= '#678C69', 'gonad'= '#ACB9CA', 'blood'= '#55A354', 'bone_element'= '#375623', 'eye'= '#F8CBAD', 'unknown'= '#999999', 'breast'= '#18A7F0', 'adipose_tissue'= '#A7D5FF', 'connective_tissue'= '#90C4CE', 'adrenal_gland'= '#F4B084', 'extraembryonic_component'= '#69608A', 'spinal_cord'= '#FFD924', 'placenta'= '#786D8C') 19 | assay_color_dict <- c('ATAC'= '#E8F484', 'DNase'= '#DBE680', 'H3K27ac'= '#F7CB4D', 'H3K27me3'= '#A6A6A4', 'H3K36me3'= '#49AC5E', 'H3K4me1'= '#EDF732', 'H3K4me2'= '#F8CBAD', 'H3K4me3'= '#F13712', 'H3K79me1'= '#9DCEA8', 'H3K79me2'= '#377A45', 'H3K9ac'= '#F2A626', 'H3K9me3'= '#677BF6', 'H3ac'= '#E5EAE7', 'H3K79me3'= '#95B77E') 20 | state_group_color_dict <- c('BivProm'= '#6a0dad', 'OpenC'= '#fff44f', 'HET'= '#b19cd9', 'Prom'= '#ff4500', 'ReprPC'= '#C0C0C0', 'ReprPC and openC'= '#d1cf90', 'TSS'= '#FF0000', 'Tx'= '#006400', 'TxEnh'= '#ADFF2F', 'TxEx'= '#9BBB59', 'TxWk'= '#228B22', 'WkEnh'= '#ffff00', 'ZNF'= '#7fffd4', 'artifacts'= '#fff5ee', 'enhancers'= '#FFA500', 'quiescent'= '#ffffff') 21 | num_state <- 100 22 | ``` 23 | 24 | ```{r} 25 | annot_fn <- '..//state_annotation_processed.csv' 26 | annot_df <- as.data.frame(read.csv(annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t')) 27 | read_state_annot_df <- function(state_annot_fn){ 28 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t')) 29 | tryCatch({ 30 | annot_df <- annot_df %>% rename('state_group' = 'group') 31 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")}) 32 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column 33 | return(annot_df) 34 | } 35 | annot_df <- read_state_annot_df(annot_fn) 36 | head(annot_df) 37 | 38 | 39 | calculate_gap_rows_among_state_groups <- function(state_annot_df){ 40 | state_group_ordered_by_appearance <- unique(state_annot_df$state_group) # list of different state groups, ordered by how they appear in the heatmap from top to bottom 41 | count_df <- state_annot_df %>% dplyr::count(state_group) 42 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$state_group),] # order the rows such that the state_type are ordered based on state_group_ordered_by_appearance 43 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps 44 | return(results) 45 | } 46 | ``` 47 | Draw the emission matrix right here 48 | ```{r} 49 | emission_fn <- './data/emissions_100_for_pheatmap.txt' 50 | save_fn <- './data/emissions_100.png' 51 | emission_df <- as.data.frame(read.csv(emission_fn, sep = '\t', header = TRUE)) %>% arrange(assay_big_group, assay, organ_order, organ_group) 52 | colnames(emission_df) <- c('experiments', paste0('S', seq(1,num_state)), colnames(emission_df)[102:length(colnames(emission_df))]) 53 | annot_df <- read_state_annot_df(annot_fn) 54 | #### getting the plot_df: columns : experiments, rows: states 55 | plot_df <- emission_df %>% select(c('experiments', paste0('S', annot_df$state))) # choosing only experiments and the states, which are ordered based on the group, all based on the annot_state_df 56 | plot_df <- as.data.frame(t(plot_df)) 57 | colnames(plot_df) <- plot_df[1,] # experiments 58 | plot_df <- plot_df[-1,] # get rid of the of experiments row 59 | rownames(plot_df) <- annot_df$mneumonics 60 | plot_df <- plot_df %>% mutate_all(as.numeric) 61 | ####### 62 | ####### getting the annot_exp_df for experiments ###### 63 | annot_exp_df <- emission_df %>% select(c('organ_group', 'assay')) 64 | rownames(annot_exp_df) <- emission_df$experiments 65 | ############ 66 | ####### getting the annot_state_df for states #### 67 | annot_state_df <- annot_df %>% select(c('state_group')) 68 | rownames(annot_state_df) <- annot_df$mneumonics 69 | ############ 70 | ####### getting gap row indices #### 71 | gap_row_indices <- calculate_gap_rows_among_state_groups(annot_df) 72 | ############ 73 | pheatmap(plot_df, fontsize = 5, annotation_col = annot_exp_df, annotation_row = annot_state_df, annotation_colors = list(organ_group = organ_color_dict, assay = assay_color_dict, state_group = state_group_color_dict), gaps_row = gap_row_indices, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = FALSE, fontsize_col = 3, angle_col = 90 , cellheight = 5, filename = save_fn) 74 | ``` 75 | 76 | Draw emission matrix for each state 77 | ```{r} 78 | colnames(emission_df) 79 | head(annot_col_df) 80 | head(count_chrom_mark_GROUP) 81 | count_chrom_mark_GROUP$mark 82 | count_chrom_mark_GROUP[,1] 83 | ``` 84 | 85 | Plot to count the number of experiments for each combination of mark- biosample group 86 | ```{r} 87 | output_folder <- './data/one_plot_per_state/biosample_group' 88 | count_chrom_mark_GROUP <- emission_df %>% select(c('experiments', 'mark', 'biosample_group')) %>% group_by(mark, biosample_group) %>% tally() %>% rename('count' = 'n') 89 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(mark ~ biosample_group , value.var = 'count') %>% replace(., is.na(.), 0) 90 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% select(c('mark','brain', 'neuron', 'retina', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose', 'muscle', 'fibroblast', 'white blood cells', 'red blood cells', 'bone marrow', 'thymus', 'spleen', 'testis', 'esc', 'placenta', 'embryo', 'other')) 91 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP$mark 92 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1] 93 | annot_col_df <- data.frame(biosample_group = colnames(count_chrom_mark_GROUP)) 94 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP)) 95 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ] 96 | print(count_chrom_mark_GROUP) 97 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP)) 98 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP)) 99 | 100 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(biosample_group = biosample_color_dict, chrom_mark = assay_color_dict), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png')) 101 | ``` 102 | 103 | # draw plots for each state on its own figure, grouped by my own metadata of biosamples 104 | ```{r} 105 | for (state_index in seq(num_state)){ 106 | state_colname <- paste0("S", state_index) 107 | state_df <- emission_df %>% select(c(state_colname, "mark", "biosample_group")) 108 | avg_emission_df <- state_df %>% group_by(mark, biosample_group) %>% summarise_all(~ mean(.x, na.rm = TRUE)) %>% rename(mean_emission = state_colname) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission 109 | avg_emission_df <- avg_emission_df %>% dcast(mark ~ biosample_group , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible 110 | avg_emission_df <- avg_emission_df %>% select(c('mark','brain', 'neuron', 'retina', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose', 'muscle', 'fibroblast', 'white blood cells', 'red blood cells', 'bone marrow', 'thymus', 'spleen', 'testis', 'esc', 'placenta', 'embryo', 'other')) 111 | chrom_mark_list <- avg_emission_df %>% select('mark') # list of chromatin marks, which will later be the row names for the heatmap 112 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names 113 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later 114 | avg_emission_df <- (avg_emission_df[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]) # note that H3K79me1 is not here 115 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') ) 116 | break_list <- seq (0, 1, by = 0.01) 117 | annot_col_df <- data.frame(cell_GROUP = colnames(avg_emission_df)) 118 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df)) 119 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df)) 120 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df)) 121 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(cell_GROUP = biosample_color_dict, chrom_mark = assay_color_dict), filename = this_state_save_fig_fn) 122 | print(paste("Done with state:", state_index)) 123 | } 124 | ``` 125 | 126 | 127 | ```{r} 128 | output_folder <- './data/one_plot_per_state/organ_slims' 129 | count_chrom_mark_GROUP <- emission_df %>% select(c('experiments', 'mark', 'organ_group')) %>% group_by(mark, organ_group) %>% tally() %>% rename('count' = 'n') 130 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% dcast(mark ~ organ_group , value.var = 'count') %>% replace(., is.na(.), 0) 131 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP %>% select(c('mark','brain', 'spinal_cord','eye', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose_tissue', 'breast', 'connective_tissue', 'epithelium', 'limb', 'musculature', 'blood', 'immune_organ', 'lymph_node', 'spleen', 'bone_element', 'adrenal_gland', 'gonad', 'extraembryonic_component', 'placenta', 'embryo', 'unknown')) 132 | rownames(count_chrom_mark_GROUP) <- count_chrom_mark_GROUP$mark 133 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[,-1] 134 | annot_col_df <- data.frame(organ_group = colnames(count_chrom_mark_GROUP)) 135 | rownames(annot_col_df) <- as.character(colnames(count_chrom_mark_GROUP)) 136 | count_chrom_mark_GROUP <- count_chrom_mark_GROUP[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ] 137 | print(count_chrom_mark_GROUP) 138 | annot_row_df <- data.frame(chrom_mark = rownames(count_chrom_mark_GROUP)) 139 | rownames(annot_row_df) <- as.character(rownames(count_chrom_mark_GROUP)) 140 | 141 | pheatmap(count_chrom_mark_GROUP, fontsize = 6, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, display_numbers = count_chrom_mark_GROUP, number_format = '%s', annotation_col = annot_col_df, annotation_row = annot_row_df, annotation_colors = list(organ_group = organ_color_dict, chrom_mark = assay_color_dict), color = colorRampPalette(c('white', 'red'))(300), filename = file.path(output_folder, 'count_categories.png')) 142 | 143 | for (state_index in seq(num_state)){ 144 | state_colname <- paste0("S", state_index) 145 | state_df <- emission_df %>% select(c(state_colname, "mark", "organ_group")) 146 | avg_emission_df <- state_df %>% group_by(mark, organ_group) %>% summarise_all(~ mean(.x, na.rm = TRUE)) %>% rename(mean_emission = state_colname) # group by chrom_mark and GROUP, then calculate the mean emission of emission probabilities in this state --> chrom_mark, GROUP, mean_emission 147 | avg_emission_df <- avg_emission_df %>% dcast(mark ~ organ_group , value.var = 'mean_emission') %>% replace(., is.na(.), 0) # put the 3 column data frame into a data frame that are heatmap compatible 148 | avg_emission_df <- avg_emission_df %>% select(c('mark','brain', 'spinal_cord','eye', 'heart', 'lung', 'liver', 'stomach', 'intestine', 'kidney', 'adipose_tissue', 'breast', 'connective_tissue', 'epithelium', 'limb', 'musculature', 'blood', 'immune_organ', 'lymph_node', 'spleen', 'bone_element', 'adrenal_gland', 'gonad', 'extraembryonic_component', 'placenta', 'embryo', 'unknown')) 149 | chrom_mark_list <- avg_emission_df %>% select('mark') # list of chromatin marks, which will later be the row names for the heatmap 150 | avg_emission_df <- avg_emission_df[,-1] # get rid of the 'chrom_mark' column because it will become the row names 151 | row.names(avg_emission_df) <- chrom_mark_list[,1] # row names are now chrom mark so that can draw a heatmap later 152 | avg_emission_df <- (avg_emission_df[c('ATAC', 'DNase', 'H3K27ac', 'H3K4me1', 'H3K9ac', 'H3K4me2', 'H3K4me3', 'H3K36me3', 'H3K79me2', 'H3K79me3', 'H3ac', 'H3K27me3', 'H3K9me3'), ]) # note that H3K79me1 is not here 153 | this_state_save_fig_fn <- file.path(output_folder, paste0("state", state_index, '_avg_emission.png') ) 154 | break_list <- seq (0, 1, by = 0.01) 155 | annot_col_df <- data.frame(organ_group = colnames(avg_emission_df)) 156 | rownames(annot_col_df) <- as.character(colnames(avg_emission_df)) 157 | annot_row_df <- data.frame(chrom_mark = rownames(avg_emission_df)) 158 | rownames(annot_row_df) <- as.character(rownames(avg_emission_df)) 159 | pheatmap(avg_emission_df,breaks = break_list, fontsize = 5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, fontsize_col = 4.5, angle_col = 90, annotation_col = annot_col_df, annotation_row = annot_row_df, display_numbers = count_chrom_mark_GROUP ,annotation_colors = list(organ_group = organ_color_dict, chrom_mark = assay_color_dict), filename = this_state_save_fig_fn) 160 | print(paste("Done with state:", state_index)) 161 | } 162 | ``` -------------------------------------------------------------------------------- /emission_and_metadata/helper.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import string 4 | import os 5 | import sys 6 | import time 7 | def make_dir(directory): 8 | try: 9 | os.makedirs(directory) 10 | except: 11 | print ( 'Folder' + directory + ' is already created') 12 | 13 | 14 | 15 | def check_file_exist(fn): 16 | if not os.path.isfile(fn): 17 | print ( "File: " + fn + " DOES NOT EXISTS") 18 | exit(1) 19 | return 20 | 21 | def check_dir_exist(fn): 22 | if not os.path.isdir(fn): 23 | print ( "Directory: " + fn + " DOES NOT EXISTS") 24 | exit(1) 25 | return 26 | 27 | def create_folder_for_file(fn): 28 | last_slash_index = fn.rfind('/') 29 | if last_slash_index != -1: # path contains folder 30 | make_dir(fn[:last_slash_index]) 31 | return 32 | 33 | def get_command_line_integer(arg): 34 | try: 35 | arg = int(arg) 36 | return arg 37 | except: 38 | print ( "Integer: " + str(arg) + " IS NOT VALID") 39 | exit(1) 40 | 41 | 42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 44 | # rename the org_enrichment_df so that it's easier to work with 45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 46 | return enrichment_df 47 | 48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 50 | if len(nc_enrichment_df.columns) != 3: 51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 53 | exit(1) 54 | # Now, we know that the nc_enrichment_df has exactly 3 columns 55 | # change the column names 56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 57 | return (nc_enrichment_df) 58 | -------------------------------------------------------------------------------- /emission_and_metadata/prepare_emission_for_plotting.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import numpy as np 4 | emission_fn = './data/emissions_100.txt' 5 | emission_df = pd.read_csv(emission_fn, header = 0, index_col = None, sep = '\t') 6 | emission_df = emission_df.T # rows: experiments, columns: states 7 | emission_df.reset_index(inplace = True) # experiments are now a columns 8 | emission_df.columns = emission_df.iloc[0] # first row is the states --> become headers 9 | emission_df = emission_df.rename(columns={'State (Emission order)': 'experiments'}) 10 | emission_df = emission_df.drop(0, axis = 0) # after first row becomes headers, drop first row 11 | # experiments are of form hindbrain_ATAC-seq_ENCSR662KNY 12 | emission_df['assay'] = emission_df['experiments'].apply(lambda x: x.split('_')[-2].split('-')[0]) 13 | emission_df['biosample'] = emission_df['experiments'].apply(lambda x: x.split('_')[0]) 14 | emission_df['expID'] = emission_df['experiments'].apply(lambda x: x.split('_')[2]) 15 | assay_meta_fn = './data/assay_count.csv' 16 | biosample_meta_fn = './data/biosample_term_name_counts.csv' 17 | assay_df = pd.read_csv(assay_meta_fn, header = 0, index_col = None, sep = ',')[['mark', 'color', 'big_group']] 18 | assay_df = assay_df.rename(columns = {'color': 'mark_color', 'big_group': 'assay_big_group'}) 19 | print(dict(assay_df.groupby(['mark', 'mark_color']).groups.keys())) 20 | biosample_df = pd.read_csv(biosample_meta_fn, header = 0, index_col = None, sep = ',')[['biosample_name', 'group', 'big_group', 'color']] 21 | biosample_df['group'] = biosample_df['group'].apply(lambda x: x.lower()) 22 | biosample_df = biosample_df.rename(columns = {'color': 'biosample_color', 'group': 'biosample_group', 'big_group': 'biosample_big_group', 'color': 'biosample_color'}) 23 | biosample_df['biosample_group'] = biosample_df['biosample_group'].replace('', 'other') 24 | print(dict(biosample_df.groupby(['biosample_group', 'biosample_color']).groups.keys())) 25 | 26 | meta_fn = './data/slims_metadata.txt' 27 | meta_df = pd.read_csv(meta_fn, header = 0, index_col = None, sep = '\t') # 'experimentID', 'organ_slims', 'cell_slims', 'developmental_slims', 'system_slims', 'biosample_summary', 'simple_biosample_summary' 28 | meta_df['organ_group'] = meta_df['organ_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('musculature_of_body', 'musculature').replace('', 'unknown') 29 | meta_df['cell_group'] = meta_df['cell_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown') 30 | meta_df['development_group'] = meta_df['developmental_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown') # The ectoderm gives rise to the skin and the nervous system. The mesoderm specifies the development of several cell types such as bone, muscle, and connective tissue. Cells in the endoderm layer become the linings of the digestive and respiratory system, and form organs such as the liver and pancreas 31 | meta_df['system_group'] = meta_df['system_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('', 'unknown').apply(lambda x: x.split('_system')[0]) 32 | 33 | # PROCEESSING METADATA ABOUT THE ORGANS AND DEVELOPMENTAL STAGES 34 | organ_meta_fn = './data/organ.txt' 35 | organ_meta_df = pd.read_csv(organ_meta_fn, header = None, index_col = None, sep = '\t') # organ, color, order 36 | ORGAN_GROUP_COLOR = dict(zip(organ_meta_df[0], organ_meta_df[1])) # keys: organ, values: color 37 | print(ORGAN_GROUP_COLOR) 38 | ORGAN_ORDER = dict(zip(organ_meta_df[0], organ_meta_df[2])) # keys: organ, values: organ numbered order 39 | meta_df['organ_color'] = meta_df['organ_group'].apply(lambda x: ORGAN_GROUP_COLOR[x]) 40 | meta_df['organ_order'] = meta_df['organ_group'].apply(lambda x: ORGAN_ORDER[x]) 41 | meta_df.drop(labels = ['cell_slims', 'organ_slims', 'system_slims'], axis = 1, inplace = True) 42 | 43 | print(emission_df.columns) 44 | print(biosample_df.columns) 45 | emission_df = emission_df.merge(assay_df, how = 'left', left_on = 'assay', right_on = 'mark') 46 | emission_df = emission_df.merge(biosample_df, how = 'left', left_on = 'biosample', right_on = 'biosample_name') 47 | print(emission_df[emission_df['biosample_group'].isnull()]) 48 | print 49 | emission_df = emission_df.merge(meta_df, how = 'left', left_on = 'expID', right_on = 'experimentID') 50 | emission_df['biosample_group'] = emission_df['biosample_group'].replace('', 'other') 51 | save_fn = './data/emissions_100_for_pheatmap.txt' 52 | emission_df.to_csv(save_fn, header = True, index = False, sep = '\t') 53 | -------------------------------------------------------------------------------- /emission_and_metadata/rank_chromMark_cellType_emission.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import sys 4 | import os 5 | import helper 6 | 7 | def prepare_color_dict(): 8 | assay_meta_fn = './data//assay_count.csv' 9 | organ_meta_fn = './data//organ.txt' 10 | assay_df = pd.read_csv(assay_meta_fn, header = 0, index_col = None, sep = ',')[['mark', 'color', 'big_group']] 11 | assay_df = assay_df.rename(columns = {'color': 'mark_color', 'big_group': 'assay_big_group'}) 12 | assay_color_dict = dict(assay_df.groupby(['mark', 'mark_color']).groups.keys()) # keys: assay, values: colors 13 | organ_df = pd.read_csv(organ_meta_fn, header = None, index_col = None, sep = '\t') 14 | organ_df.columns = ['group', 'color', 'order'] 15 | organ_df['group'] = organ_df['group'].apply(lambda x: x.lower()) 16 | organ_color_dict = dict(organ_df.groupby(['group', 'color']).groups.keys()) # keys: biomsaple_name, values: color 17 | return assay_color_dict, organ_color_dict 18 | 19 | ASSAY_COLOR_DICT, organ_COLOR_DICT = prepare_color_dict() 20 | 21 | def get_ranked_mark_name (row_data): 22 | sorted_rank = row_data.sort_values(ascending = False) # 1 --> n 23 | return pd.Series(sorted_rank.index) # get the index whcih is the list of experiments, ordered from least emitted to most emitted in each state. Convert to pd.Series so that we can concatenate into data frame later. 24 | 25 | def get_ranked_exp_df (fn) : # fn should be emission_fn 26 | df = pd.read_csv(fn, header = 0, sep = '\t') # get emission df 27 | df = df.rename(columns = {'State (Emission order)' : 'state'}) # one column is state, others are all experiment names 28 | df.index = df['state'] # state column becomes the index of the dataframe 29 | df = df[df.columns[1:]] # get rid of state column so that we can rank the emission probabilities of each state 30 | rank_df = df.rank(axis = 1) # index: state, columns: experiments ordered just like in df. Cells: rank of each experiment within each state, 1: least emission --> n: highest emission 31 | rank_df = rank_df.apply(get_ranked_mark_name, axis = 1) # index; state, columns: rank 1 --> n most emitted experiment to least emitted experiments 32 | rank_df.columns = map(lambda x: 'r_' + str(x + 1), range(len(rank_df.columns))) 33 | return rank_df 34 | 35 | 36 | def color_organ_names(val): 37 | if val == "": 38 | color = organ_COLOR_DICT["NA"] 39 | else: 40 | color = organ_COLOR_DICT[val] 41 | return 'background-color: %s' % color 42 | 43 | def color_mark_names(val): 44 | if val == "": 45 | color = ASSAY_COLOR_DICT['NaN'] 46 | else: 47 | color = ASSAY_COLOR_DICT[val] 48 | return 'background-color: %s' % color 49 | 50 | def read_exp_organ_dict(): 51 | meta_fn = './data/slims_metadata.txt' 52 | meta_df = pd.read_csv(meta_fn, header = 0, index_col = None, sep = '\t') # 'experimentID', 'organ_slims', 'cell_slims', 'developmental_slims', 'system_slims', 'biosample_summary', 'simple_biosample_summary' 53 | meta_df['organ_group'] = meta_df['organ_slims'].apply(lambda x: '_'.join(x[1:-1].split(',')[0][1:-1].split())).replace('musculature_of_body', 'musculature').replace('', 'unknown') 54 | return dict(zip(meta_df.experimentID, meta_df.organ_group)) 55 | 56 | def get_painted_excel_ranked_exp(rank_df, output_fn, num_top_marks, organ_color_dict, assay_color_dict): 57 | EXP_ORGAN_DICT = read_exp_organ_dict() 58 | organ_df = rank_df.applymap(lambda x: EXP_ORGAN_DICT[x.split('_')[2]]) # first convert the data to only contain the experimentID, then convert from experimentID to ORGAN 59 | organ_df.reset_index(inplace = True) 60 | chrom_mark_df = rank_df.applymap(lambda x: x.split('_')[-2].split('-')[0]) # H3K9me3 61 | chrom_mark_df.reset_index(inplace = True) 62 | columns_to_paint = list(map(lambda x: 'r_' + str(x + 1), range(num_top_marks))) 63 | organ_df = organ_df[['state'] + columns_to_paint] 64 | chrom_mark_df = chrom_mark_df[['state'] + columns_to_paint] 65 | colored_organ_df = organ_df.style.applymap(color_organ_names, subset = columns_to_paint) 66 | colored_chrom_mark_df = chrom_mark_df.style.applymap(color_mark_names, subset = columns_to_paint) 67 | # save file 68 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter') 69 | colored_organ_df.to_excel(writer, sheet_name = 'cell_group') 70 | colored_chrom_mark_df.to_excel(writer, sheet_name = 'chrom_mark') 71 | writer.save() 72 | print ("Done saving data into " + output_fn) 73 | 74 | 75 | 76 | def main(): 77 | if len(sys.argv) != 4: 78 | usage() 79 | emission_fn = sys.argv[1] 80 | output_fn = sys.argv[2] 81 | helper.create_folder_for_file(output_fn) 82 | num_top_marks = helper.get_command_line_integer(sys.argv[3]) 83 | print ('Done getting command line argument') 84 | assay_color_dict, organ_color_dict = prepare_color_dict() 85 | rank_df = get_ranked_exp_df(emission_fn) # index: states, columns: rank of experiments r_1: highest emitted, r_n: lowest emitted --> cells: names of experiments. Example: E118-H3K9me3 86 | print ("Done getting ranked data") 87 | get_painted_excel_ranked_exp(rank_df, output_fn, num_top_marks, organ_color_dict, assay_color_dict) 88 | 89 | 90 | 91 | def usage(): 92 | print ("python rank_chrom_mark_celltype_emission.py") 93 | print ("emission_fn: should be ./data/emissions_100.txt") 94 | print ("output_fn: execl file where the output data of ranked cell type and chrom marks are stored") 95 | print ("num_top_marks: number of top marks that we want to report, recommended value: 100") 96 | exit(1) 97 | 98 | main() 99 | -------------------------------------------------------------------------------- /gene_exp_analysis/._investigate_lecif_and_phastCons.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/gene_exp_analysis/._investigate_lecif_and_phastCons.py -------------------------------------------------------------------------------- /gene_exp_analysis/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code to create figure 2C in the manuscript. Within this folder: 2 | 3 | - ```helper.py```: script containing helper functions for file managements. 4 | - ```get_avg_gene_per_state.py```: this script will calculate the average gene expression per mouse full-stack state, in each available cell type. Please refer to the Methods section "Average gene expression associated with each full-stack state" for full details about how the average expression is calculated. 5 | - ```avg_gene_exp_per_state_per_ct.txt.gz```: the output of the script ```get_avg_gene_per_state.py```. Inside this file, each row corresponds to a state (raw state names: E1 --> E100), each column corresponds to a sample. This file will then be used to plot figure 2C using the R markdown code ```plot_avg_gene_exp_heatmap.Rmd``` 6 | ``` 7 | usage: get_avg_gene_per_state.py [-h] [--gene_exp_folder GENE_EXP_FOLDER] 8 | [--segment_fn SEGMENT_FN] 9 | [--num_chromHMM_state NUM_CHROMHMM_STATE] 10 | [--output_fn OUTPUT_FN] 11 | [--state_annot_fn STATE_ANNOT_FN] 12 | 13 | Calculating avg gene expression in each state, in each available cell 14 | types. The gene_exp_folder should be where we store BingRen's gene 15 | expression data for mouse. segment_fn should be from ChromHMM model for the 16 | mouse 17 | 18 | options: 19 | -h, --help show this help message and exit 20 | --gene_exp_folder GENE_EXP_FOLDER 21 | Where there gene exp data for different cell types 22 | are stored. Each file in this folder is named in 23 | the format -.expr, where the 24 | provided gene expression data is in FPKM unit. This 25 | data can be downloaded and unzipped from http://chr 26 | omosome.sdsc.edu/mouse/download/19-tissues-expr.zip 27 | from Shen et al., 2012, 'A map of the cis- 28 | regulatory sequences in the mouse genome', Nature. 29 | On Ha's system: 30 | /data/ENCODE/mouse/gene_exp/19-tissues-expr/. 31 | --segment_fn SEGMENT_FN 32 | segment_fn. NOTE: Because our gene expression data 33 | is in mm9, we will need to pass the segment_fn in 34 | mm9 here 35 | --num_chromHMM_state NUM_CHROMHMM_STATE 36 | number of chromHMM states 37 | --output_fn OUTPUT_FN 38 | output_fn 39 | --state_annot_fn STATE_ANNOT_FN 40 | state_annot_fn 41 | ``` 42 | - ```plot_avg_gene_exp_heatmap.Rmd```: Rmarkdown file that will plot figure 2C. 43 | 44 | # License: 45 | All code is provided under the MIT Open Acess License 46 | Copyright 2022 Ha Vu and Jason Ernst 47 | 48 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 49 | 50 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 51 | 52 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 53 | 54 | -------------------------------------------------------------------------------- /gene_exp_analysis/avg_gene_exp_per_state_per_ct.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/gene_exp_analysis/avg_gene_exp_per_state_per_ct.txt.gz -------------------------------------------------------------------------------- /gene_exp_analysis/ct_name.csv: -------------------------------------------------------------------------------- 1 | boneMarrow 2 | brain 3 | cerebellum 4 | cortex 5 | heart 6 | intestine 7 | intestine_ 8 | kidney 9 | limb 10 | liver 11 | lung 12 | mESC 13 | mef_male 14 | olfactory 15 | placenta 16 | spleen 17 | testes 18 | thymus 19 | -------------------------------------------------------------------------------- /gene_exp_analysis/get_avg_gene_per_state.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import pandas as pd 4 | import numpy as np 5 | import glob 6 | import pybedtools as bed 7 | import helper 8 | parser = argparse.ArgumentParser(description='Calculating avg gene expression in each state, in each available cell types. The gene_exp_folder should be where we store BingRen\'s gene expression data for mouse. segment_fn should be from ChromHMM model for the mouse') 9 | parser.add_argument('--gene_exp_folder', type=str, 10 | help='Where there gene exp data for different cell types are stored') 11 | parser.add_argument('--segment_fn', type=str, 12 | help='segment_fn. NOTE: Because our gene expression data is in mm9, we will need to pass the segment_fn in mm9 here') 13 | parser.add_argument('--num_chromHMM_state', default=100, type=int, 14 | help='number of chromHMM states') 15 | parser.add_argument('--output_fn', type=str, 16 | help = 'output_fn') 17 | parser.add_argument('--state_annot_fn', type=str, 18 | help = 'state_annot_fn') 19 | args = parser.parse_args() 20 | print (args) 21 | helper.check_dir_exist(args.gene_exp_folder) 22 | helper.check_file_exist(args.segment_fn) 23 | helper.create_folder_for_file(args.output_fn) 24 | helper.check_file_exist(args.state_annot_fn) 25 | SEGMENT_LENGTH = 200 26 | 27 | def open_gene_exp_data_one_ct(gene_exp_fn): 28 | df = pd.read_csv(gene_exp_fn, header = 0, index_col = None, sep = '\t') 29 | df = df[['chr', 'left', 'right', 'FPKM']] 30 | df['FPKM'] = df['FPKM'].apply(lambda x: np.log(x+1)) 31 | df.columns = ['chrom', 'start', 'end', 'logFPKM'] 32 | return df 33 | 34 | def get_avg_gene_exp_one_ct(segment_fn, gene_exp_fn, num_chromHMM_state): 35 | segment_bed = bed.BedTool(segment_fn) 36 | exp_df = open_gene_exp_data_one_ct(gene_exp_fn) 37 | exp_bed = bed.BedTool.from_dataframe(exp_df) 38 | segment_bed = segment_bed.intersect(exp_bed, wa = True, wb = True) 39 | comb_df = segment_bed.to_dataframe() 40 | comb_df.columns = ['sChrom', 'sStart', 'sEnd', 'state', 'eChrom', 'eStart', 'eEnd', 'logFPKM'] 41 | comb_df['gene_length'] = comb_df['eEnd'] - comb_df['eStart'] 42 | comb_df['gene_length_segments'] = comb_df['gene_length'] / SEGMENT_LENGTH 43 | comb_df['com_start'] = comb_df[['sStart', 'eStart']].max(axis = 1) #start of the region we are interested is the start of either the gene or the segment of chromatin state. Whichever is greater is the start of the intersection. 44 | comb_df['com_end'] = comb_df[['sEnd', 'eEnd']].min(axis = 1) # same argument as above with the end of the region 45 | comb_df['num_segments'] = (comb_df['com_end'] - comb_df['com_start']) / SEGMENT_LENGTH # number of segments that are shared between the genes and the chromatin state 46 | comb_df['num_segments'] = comb_df['num_segments'].apply(np.ceil) # round up the number of segments 47 | # Number of segments is the number of basepair in the region we are interested in / number of basepairs per segment bin 48 | comb_df = comb_df[[u'state', u'gene_length', u'com_start', u'com_end', u'num_segments', 'gene_length_segments', 'logFPKM']] 49 | # now onto processing the gene expression data 50 | result_avg_exp_S = pd.Series([], dtype = float) 51 | for state_index in range(num_chromHMM_state): 52 | one_based_state_index = state_index + 1 53 | this_state_org_df = (comb_df[comb_df['state'] == 'E' + str(one_based_state_index)]) 54 | bp_unif_exp_S = this_state_org_df['logFPKM'] * this_state_org_df['num_segments'] / this_state_org_df['gene_length_segments'] # a pandas series where each entry correspond to a position on the genome annotated as this state 55 | total_weighted_segments = np.sum(this_state_org_df['num_segments'] / this_state_org_df['gene_length_segments']) # a number 56 | avg_exp_bp_unif = bp_unif_exp_S.sum() / total_weighted_segments 57 | result_avg_exp_S['E' + str(one_based_state_index)] = avg_exp_bp_unif 58 | return result_avg_exp_S # a pandas series with index: E1--> E100, values: avg gene expression in each state in the current cell type 59 | 60 | def get_cell_name_from_fn(fn): 61 | fn = fn.split('/')[-1].split('.expr')[0].split('.gene')[0] 62 | first_fn = fn.split('-')[0] 63 | if first_fn.endswith('1') or first_fn.endswith('2') : 64 | return first_fn 65 | last_fn = fn.split('-')[-1] 66 | if (last_fn == '1' or last_fn == '2' or last_fn == '3') and first_fn != 'heart' and first_fn != 'liver': 67 | return first_fn + last_fn 68 | if (first_fn == 'heart' or first_fn == 'liver'): 69 | return '_'.join(fn.split('-')) 70 | else: 71 | return '_'.join(fn.split('-')[:2]) 72 | return '' 73 | 74 | def get_avg_gene_exp_all_ct(segment_fn, gene_exp_folder, num_chromHMM_state, output_fn): 75 | gene_exp_fn_list = glob.glob(gene_exp_folder+'/*.expr') 76 | ct_list = list(map(get_cell_name_from_fn, gene_exp_fn_list)) 77 | result_index = list(map(lambda x: 'E{}'.format(x+1), range(num_chromHMM_state))) 78 | result_avg_exp_df = pd.DataFrame(columns = ct_list, index = result_index) 79 | for ct_index, ct in enumerate(ct_list): 80 | result_avg_exp_df[ct] = get_avg_gene_exp_one_ct(segment_fn, gene_exp_fn_list[ct_index], num_chromHMM_state) 81 | result_avg_exp_df.to_csv(output_fn, header = True, index = True, sep = '\t', compression = 'gzip') 82 | print('Done!') 83 | return 84 | 85 | get_avg_gene_exp_all_ct(args.segment_fn, args.gene_exp_folder, args.num_chromHMM_state, args.output_fn) -------------------------------------------------------------------------------- /gene_exp_analysis/helper.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import string 4 | import os 5 | import sys 6 | import time 7 | def make_dir(directory): 8 | try: 9 | os.makedirs(directory) 10 | except: 11 | print ( 'Folder' + directory + ' is already created') 12 | 13 | 14 | 15 | def check_file_exist(fn): 16 | if not os.path.isfile(fn): 17 | print ( "File: " + fn + " DOES NOT EXISTS") 18 | exit(1) 19 | return 20 | 21 | def check_dir_exist(fn): 22 | if not os.path.isdir(fn): 23 | print ( "Directory: " + fn + " DOES NOT EXISTS") 24 | exit(1) 25 | return 26 | 27 | def create_folder_for_file(fn): 28 | last_slash_index = fn.rfind('/') 29 | if last_slash_index != -1: # path contains folder 30 | make_dir(fn[:last_slash_index]) 31 | return 32 | 33 | def get_command_line_integer(arg): 34 | try: 35 | arg = int(arg) 36 | return arg 37 | except: 38 | print ( "Integer: " + str(arg) + " IS NOT VALID") 39 | exit(1) 40 | 41 | 42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 44 | # rename the org_enrichment_df so that it's easier to work with 45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 46 | return enrichment_df 47 | 48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 50 | if len(nc_enrichment_df.columns) != 3: 51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 53 | exit(1) 54 | # Now, we know that the nc_enrichment_df has exactly 3 columns 55 | # change the column names 56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 57 | return (nc_enrichment_df) 58 | -------------------------------------------------------------------------------- /gene_exp_analysis/plot_avg_gene_exp_heatmap.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "draw_avg_gene_exp_heatmap" 3 | author: "Ha Vu" 4 | date: "3/30/2022" 5 | output: html_document 6 | --- 7 | 8 | ```{r} 9 | library(ggplot2) 10 | library(dplyr) 11 | library(tidyr) 12 | library(pheatmap) 13 | library(RColorBrewer) 14 | ``` 15 | 16 | ```{r} 17 | state_annot_fn <- '../state_annotation_processed.csv' 18 | read_state_annot_df <- function(state_annot_fn){ 19 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = '\t')) 20 | tryCatch({ 21 | annot_df <- annot_df %>% rename('state_type' = 'group') 22 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")}) 23 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column 24 | return(annot_df) 25 | } 26 | state_annot_df <- read_state_annot_df(state_annot_fn) 27 | state_color_df <- unique(state_annot_df%>% select(c('state_type', 'color'))) 28 | STATE_COLOR_MAPPING <- setNames(state_color_df$color, state_color_df$state_type) 29 | CELL_TYPE_COLOR_MAPPING <- c('cerebellum'= '#C5912B', 'heart'= '#D56F80', 'intestine'= '#D0A39B', 'testes'= '#ACB9CA', 'spleen'= '#678C69', 'thymus'= '#C6E0B4', 'olfactory'= '#999999', 'mESC'= '#924965', 'mef_male'= '#C075C3', 'mef'= '#C075C3', 'brain'= '#C5912B', 'lung'= '#E41A1C', 'cortex'= '#C5912B', 'limb'= '#F182BC', 'placenta'= '#69608A', 'boneMarrow'= '#375623', 'kidney'= '#F4B084', 'liver'= '#9BC2E6') 30 | calculate_gap_rows_among_state_groups <- function(state_annot_df){ 31 | state_group_ordered_by_appearance <- unique(state_annot_df$state_type) # list of different state groups, ordered by how they appear in the heatmap from top to bottom 32 | count_df <- state_annot_df %>% count(state_type) 33 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$state_type),] # order the rows such that the state_type are ordered based on state_group_ordered_by_appearance 34 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps 35 | return(results) 36 | } 37 | 38 | get_cell_group_name <- function(cg){ 39 | cg <- unlist(strsplit(cg, "_"))[1] 40 | if (endsWith(cg, '1') | endsWith(cg, '2')){ 41 | cg <- substr(cg,1,nchar(cg)-1) 42 | } 43 | return(cg) 44 | } 45 | 46 | get_file_name_without_tail <- function(fn){ 47 | fn_no_tail <- unlist(strsplit(fn, "[.]"))[1] 48 | return(fn_no_tail) 49 | } 50 | 51 | get_cell_type_name <- function(cg){ # jsut get rid of the 1 or 2 at the end of the cell group name 52 | if (endsWith(cg, '_1') | endsWith(cg, '_2')){ 53 | cg <- substr(cg,1,nchar(cg)-2) 54 | } 55 | 56 | if (endsWith(cg, '1') | endsWith(cg, '2')){ 57 | cg <- substr(cg,1,nchar(cg)-1) 58 | } 59 | return(cg) 60 | } 61 | ``` 62 | 63 | ```{r} 64 | fn <- './avg_gene_exp_per_state_per_ct.txt.gz' 65 | df <- read.table(fn, sep = '\t', header = TRUE, row.names = 1) 66 | df <- as.data.frame(t(df)) 67 | df$ct <- sapply(rownames(df), get_cell_type_name) 68 | # get the mean expression per cell type per state across the different replicates 69 | df <- df %>% group_by(ct) %>% summarise(across(everything(), mean)) 70 | ct_list <- df$ct 71 | df <- df %>% select(-c('ct')) 72 | df <- as.data.frame(t(df)) # transpose again to its original form 73 | colnames(df) <- ct_list 74 | # rearrange the rows such that states of the same group are put together 75 | state_annot_df <- read_state_annot_df(state_annot_fn) 76 | df <- df %>% slice(state_annot_df$state) 77 | rownames(df) <- state_annot_df$mneumonics 78 | # create the df for annotationof rows in our heatmap 79 | rownames(state_annot_df) <- state_annot_df$mneumonics 80 | state_annot_df <- state_annot_df %>% select(state_type) 81 | gap_row_indices <- calculate_gap_rows_among_state_groups(state_annot_df) 82 | # create the df for annotations of columns in our heatmap 83 | cell_group_df <- data.frame(ct = colnames(df)) 84 | cell_group_df$group <- sapply(cell_group_df$ct, get_cell_group_name) 85 | # reorder the groups of cell types such that tissues that are similar are closer to each other 86 | ordered_cg_list <- c('brain', 'cerebellum', 'cortex', 'mESC', 'mef', 'mef_male', 'placenta', 'boneMarrow', 'heart', 'intestine', 'kidney', 'liver', 'lung', 'spleen', 'testes', 'thymus', 'limb', 'olfactory') 87 | ordered_cg_df <- data.frame(ct = character(), group = character()) 88 | for (cg in ordered_cg_list){ 89 | this_cg_df <- cell_group_df %>% filter(group == cg) 90 | ordered_cg_df <- bind_rows(ordered_cg_df, this_cg_df) 91 | } 92 | rownames(ordered_cg_df) <- ordered_cg_df$ct 93 | ordered_cg_df <- ordered_cg_df %>% select(group) 94 | df <- df %>% select(rownames(ordered_cg_df)) # reorder the columns of df (df of average gene_exp) 95 | break_list <- seq (0.1, 2.8, by = 0.1) 96 | fn_no_tail <- get_file_name_without_tail(fn) 97 | save_fn <- paste0(fn_no_tail, ".png") 98 | pheatmap(df, breaks = break_list, color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(break_list)), annotation_row = state_annot_df, annotation_col = ordered_cg_df, annotation_colors = list(state_type = STATE_COLOR_MAPPING, group = CELL_TYPE_COLOR_MAPPING), gaps_row = gap_row_indices, fontsize =4.5, cluster_rows = FALSE, cluster_cols = FALSE, show_colnames = TRUE, filename = save_fn, cellwidth= 5) 99 | ``` -------------------------------------------------------------------------------- /helper.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import string 4 | import os 5 | import sys 6 | import time 7 | def make_dir(directory): 8 | try: 9 | os.makedirs(directory) 10 | except: 11 | print ( 'Folder' + directory + ' is already created') 12 | 13 | 14 | 15 | def check_file_exist(fn): 16 | if not os.path.isfile(fn): 17 | print ( "File: " + fn + " DOES NOT EXISTS") 18 | exit(1) 19 | return 20 | 21 | def check_dir_exist(fn): 22 | if not os.path.isdir(fn): 23 | print ( "Directory: " + fn + " DOES NOT EXISTS") 24 | exit(1) 25 | return 26 | 27 | def create_folder_for_file(fn): 28 | last_slash_index = fn.rfind('/') 29 | if last_slash_index != -1: # path contains folder 30 | make_dir(fn[:last_slash_index]) 31 | return 32 | 33 | def get_command_line_integer(arg): 34 | try: 35 | arg = int(arg) 36 | return arg 37 | except: 38 | print ( "Integer: " + str(arg) + " IS NOT VALID") 39 | exit(1) 40 | 41 | 42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 44 | # rename the org_enrichment_df so that it's easier to work with 45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 46 | return enrichment_df 47 | 48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 50 | if len(nc_enrichment_df.columns) != 3: 51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 53 | exit(1) 54 | # Now, we know that the nc_enrichment_df has exactly 3 columns 55 | # change the column names 56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 57 | return (nc_enrichment_df) 58 | -------------------------------------------------------------------------------- /learn_model/README.md: -------------------------------------------------------------------------------- 1 | ## NOTES 2 | 3 | This folder provides code that were used to generate Additional File 2, tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' outlining the metadata and download links to the datasets that we used to train the mouse full-stack model. Within this folder: 4 | - ```metada_conntroltype_june_2021.tsv```: tsv file downloaded from ENCODE portal for the Chip-seq and ATAC-seq experiments, including control experiments and their files, that we are using for this study. 5 | - ```extract_cell_mark_table.py```: code to produce Additional File 2, tabs 'trainData_HistoneChip' and 'trainData_ATACDNase' in the published paper. This code will extract the metadata for the experiments that we will use for this paper, and their corresponding control files. The control experiments are needed for binarizing input data so that we can use as input to the ChromHMM LearnModel function (see Methods section of the paper). 6 | ``` 7 | usage: extract_cell_mark_table.py [-h] [--ENCODE_meta_fn ENCODE_META_FN] 8 | --output_fn OUTPUT_FN 9 | 10 | This file aims at producing the supplementary file listing the input file 11 | metadata and download links. 12 | 13 | optional arguments: 14 | -h, --help show this help message and exit 15 | --ENCODE_meta_fn ENCODE_META_FN 16 | The file where we can get metadata of experiments 17 | from ENCODE 18 | --output_fn OUTPUT_FN 19 | The excel output file. Multiple tabs will be in 20 | this file. The output file is part of Additional 21 | File 2 in the published paper 22 | ``` 23 | - ```helper.py```: script containing helper functions for file managements. 24 | 25 | For the rest of the steps in training the model (binarize data, learn model), we used ChromHMM v.1.23 functions BinarizeData and LearnModel. We described the process in Methods section of the paper. If you need to obtain the binarized data we used for training the model, please contact Dr. Jason Ernst. 26 | 27 | ## LICENSE 28 | Copyright 2022 Ha Vu (havu73@ucla.edu) 29 | 30 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 31 | 32 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 33 | 34 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /learn_model/extract_cell_mark_table.py: -------------------------------------------------------------------------------- 1 | # Copyright 2022 Ha Vu (havu73@ucla.edu) 2 | 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | import pandas as pd 10 | import numpy as np 11 | import helper 12 | import os 13 | import argparse 14 | import sys 15 | sys.path.append('../emission_and_metadata') # full path on Ha's system: /u/home/h/havu73/project-ernst/source/mm10_annotations/emission_and_metadata 16 | import prepare_metadata_files as meta 17 | ################## READING INPUT COMMAND LINE ARGS ################## 18 | parser = argparse.ArgumentParser(description = 'This file aims at producing the supplementary file listing the input file metadata and download links.') 19 | parser.add_argument('--ENCODE_meta_fn', type = str, required = False, default = './metadata_controltype_june_2021.tsv', help = 'The file where we can get metadata of experiments from ENCODE') 20 | parser.add_argument('--output_fn', type = str, required = True, help = 'The excel output file. Multiple tabs will be in this file. The output file is part of Additional File 2 in the published paper') 21 | args = parser.parse_args() 22 | helper.check_file_exist(args.ENCODE_meta_fn) 23 | helper.create_folder_for_file(args.output_fn) 24 | ######################################################################################### 25 | def read_filter_metadata(): 26 | meta_df = pd.read_csv(args.ENCODE_meta_fn, header = 0, index_col = None, sep = '\t', comment = '#', low_memory = False) 27 | # meta_df = meta_df[meta_df['File analysis title'].str.contains('ENCODE4', na = False)] # note we need to comment out this line because the read files for different control experiments are actually not part of ENCODE4, and are all blank on this column 28 | ALL_ASSAY_LIST = ['Control ChIP-seq', 'Histone ChIP-seq', 'DNase-seq', 'ATAC-seq'] 29 | meta_df = meta_df[meta_df['Assay'].isin(ALL_ASSAY_LIST)] # filter only lines whose assay is in ALL_ASSAY_LIST 30 | COLUMNS_TO_PARSE = ['File accession', 'Output type', 'Experiment accession', 'Assay', 'Biosample term name', 'Experiment target', 'File assembly', 'Run type', 'Derived from', 'Controlled by', 'File analysis title'] 31 | meta_df = meta_df[COLUMNS_TO_PARSE] 32 | return meta_df 33 | 34 | def combine_with_organ_meta_df(file_df, organ_meta_df): 35 | ''' 36 | file_df can be from functions produce_cell_mark_table_from_chipBamDf or from get_openC_df 37 | ''' 38 | file_df = file_df.merge(organ_meta_df, how = 'left', left_on = 'Experiment accession', right_on = 'experimentID') 39 | file_df = file_df[['exp_name', 'Experiment target', 'Biosample term name', 'organ_group', 'cell_group', 'development_group', 'system_group', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download']] 40 | return file_df 41 | 42 | def check_similar_file_acc_code_with_jason_file(df, jason_fn): 43 | jason_df = pd.read_csv(jason_fn, header = None, sep = '\t') 44 | diff_list = np.setdiff1d(df['File accession'], jason_df[0]) 45 | assert len(diff_list) == 0, 'The expected and observed list of file accession code from jason file {} IS NOT EQUAL.'.format(jason_fn) 46 | return 47 | 48 | def get_control_df(meta_df): 49 | # first filter so that we only get the Control Chip-seq experiments 50 | control_df = meta_df[meta_df['Assay'] == 'Control ChIP-seq'] 51 | bam_control_df = control_df[(control_df['Output type'] == 'alignments') & (control_df['File analysis title'].str.contains('ENCODE4', na = False))] # these are control bam files 52 | jason_bamControl_fn = os.path.join(ENCODE_meta_fn, 'controlmeta_alignments.txt') 53 | check_similar_file_acc_code_with_jason_file(bam_control_df, jason_bamControl_fn) 54 | # assay is always Control ChIP-seq 55 | # Biosample term name corresponds to different cell/ tissue types 56 | # Experiment target is all blank 57 | # Run type is all blank 58 | # Controlled by is all blank 59 | # File analysis title all contains ENCODE4 60 | read_control_df = control_df[control_df['Output type'] == 'reads'] # these are read files for control samples 61 | jason_readControl_fn = os.path.join(ENCODE_meta_fn, 'controlmeta_reads.txt') 62 | check_similar_file_acc_code_with_jason_file(read_control_df, jason_readControl_fn) 63 | return bam_control_df, read_control_df 64 | 65 | def get_chip_df(meta_df): 66 | bam_chip_df = meta_df[(meta_df['Assay'] == 'Histone ChIP-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # this will result in 1397 bam files belonging to 704 unique ChIP seq experiments 67 | jason_bamChip_fn = os.path.join(ENCODE_meta_fn, 'histonemeta_alignments.txt') 68 | check_similar_file_acc_code_with_jason_file(bam_chip_df, jason_bamChip_fn) 69 | read_chip_df = meta_df[(meta_df['Assay'] == 'Histone ChIP-seq') & (meta_df['Output type'] == 'reads')] 70 | jason_readChip_fn = os.path.join(ENCODE_meta_fn, 'histonemeta_reads.txt') 71 | check_similar_file_acc_code_with_jason_file(read_chip_df, jason_readChip_fn) 72 | return bam_chip_df, read_chip_df 73 | 74 | def get_openC_df(meta_df, organ_meta_df): 75 | ''' 76 | Extract bam files corresponding DNase and ATAC-seq output 77 | ''' 78 | bam_DNase_df = meta_df[(meta_df['Assay'] == 'DNase-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # based on code in ENCODE_meta_fn/getmeta.sh 79 | jason_dnaseBam_fn = os.path.join(ENCODE_meta_fn, 'dnasemeta_alignments.txt') 80 | check_similar_file_acc_code_with_jason_file(bam_DNase_df, jason_dnaseBam_fn) 81 | bam_ATAC_df = meta_df[(meta_df['Assay'] == 'ATAC-seq') & (meta_df['Output type'] == 'alignments') & (meta_df['File analysis title'].str.contains('ENCODE4', na = False))] # based on code in ENCODE_meta_fn/getmeta.sh 82 | jason_ATACBam_fn = os.path.join(ENCODE_meta_fn, 'atacmeta_alignments.txt') 83 | check_similar_file_acc_code_with_jason_file(bam_ATAC_df, jason_ATACBam_fn) 84 | openc_df = pd.concat([bam_DNase_df, bam_ATAC_df], ignore_index = True) 85 | openc_df.loc[:, 'Experiment target'] = openc_df['Assay'].apply(lambda x: x.split('-seq')[0]) 86 | openc_df = openc_df.sort_values(['Experiment target', 'Biosample term name', 'Experiment accession']) 87 | openc_df['download_link'] = openc_df['File accession'].apply(lambda x: 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = x)) 88 | openc_df['exp_name'] = openc_df['Biosample term name'] + '_' + openc_df['Experiment target'] + '_' + openc_df['Experiment accession'] 89 | openc_df['ctrl_fileCode'] = 'uniform' 90 | openc_df['ctrl_download'] = '' 91 | openc_df = openc_df[['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download']] 92 | openc_df.reset_index(inplace = True, drop = True) 93 | openc_df = combine_with_organ_meta_df(openc_df, organ_meta_df) 94 | return openc_df 95 | 96 | 97 | def extract_file_code_one_row(entry, check_set): 98 | ''' 99 | For each entry of /files/ENCFF001JZL/, /files/ENCFF309GLL/, extrack the file accession code 100 | ''' 101 | if pd.isnull(entry): 102 | return [] 103 | code_list = entry.split(', ') # from /files/ENCFF001JZL/, /files/ENCFF309GLL/ to ['/files/ENCFF001JZL/', '/files/ENCFF309GLL/'] 104 | code_list = list(map(lambda x: x[7:-1], code_list)) # to ['ENCFF001JZL', 'ENCFF309GLL'] 105 | if check_set != None: 106 | results = [] 107 | for code in code_list: 108 | if code in check_set: 109 | results.append(code) 110 | return results 111 | return code_list 112 | 113 | def map_controlRead_to_controlBam(bam_control_df, read_control_df): 114 | # given the table showing the metadata of control bam and control read files, we will output a dictionary with keys: controlRead --> values list of controlBam that were derived from the control read key 115 | control_readToBam = {} 116 | set_conRead = set(read_control_df['File accession']) # a set of files that correspond to control read files 117 | bam_control_df.loc[:, 'read_file_list'] = bam_control_df['Derived from'].apply(lambda x: extract_file_code_one_row(x, set_conRead)) 118 | for index, row in bam_control_df.iterrows(): 119 | read_control_list = row['read_file_list'] 120 | for file in read_control_list: 121 | if not (file in control_readToBam): 122 | control_readToBam[file] = [] 123 | (control_readToBam[file]).append(row['File accession']) # add the control bam file accession code to the list of files derived from the control read file 124 | return control_readToBam 125 | 126 | def map_file_accession_to_fileCode_list(first_df, second_df, colname_to_extract): 127 | ''' 128 | first_df should have columns file accession code, and column Derived from or Controlled by (colname_to_extract). These columns should contain data of the files that the file in 'File accession' are derivied from or controlled by. 129 | second_df should have column 'File accession' that contains all the legal files to extract files of columns 'Derived from' and 'Controlled by' 130 | ''' 131 | set_code_list = set(second_df['File accession']) # set of chip read files 132 | first_df.loc[:, 'read_file_list'] = first_df[colname_to_extract].apply(lambda x: extract_file_code_one_row(x, set_code_list)) 133 | return dict(zip(first_df['File accession'], first_df['read_file_list'])) 134 | 135 | def map_bamChip_to_bamCon(chipBam_to_chipRead, chipRead_to_conRead, conRead_to_conBam): 136 | ''' 137 | mappings from chipBam to a list of controlBam, given the 3 exiting mappings 138 | ''' 139 | results = {} # keys: chipBam, values: list of conBam 140 | for chipBam in chipBam_to_chipRead: 141 | conBam_set = set([]) 142 | chipRead_list = chipBam_to_chipRead[chipBam] 143 | for chipRead in chipRead_list: 144 | conRead_list = chipRead_to_conRead[chipRead] 145 | for conRead in conRead_list: 146 | conBam_list = conRead_to_conBam[conRead] 147 | conBam_set.update(conBam_list) 148 | results[chipBam] = list(conBam_set) 149 | return results 150 | 151 | 152 | def match_conBam_to_chipBam(chipBam, chipBam_to_conBam): 153 | ''' 154 | chipBam is a file accession code 155 | ''' 156 | try: 157 | return chipBam_to_conBam[chipBam] 158 | except: 159 | return [] 160 | 161 | def produce_cell_mark_table_from_chipBamDf(bam_chip_df, chipBam_to_conBam, organ_meta_df): 162 | ''' 163 | chipBam_to_conBam: a dictionary of keys: chipBam, values: list of conBam associated with each chipBam 164 | ''' 165 | bam_chip_df = bam_chip_df.sort_values(['Experiment target', 'Biosample term name', 'Experiment accession']) 166 | bam_chip_df.loc[:, 'exp_name'] = bam_chip_df['Biosample term name'] + '_' + bam_chip_df['Experiment target'] + '_' + bam_chip_df['Experiment accession'] 167 | bam_chip_df['control_bam_list'] = bam_chip_df['File accession'].apply(lambda x: match_conBam_to_chipBam(x, chipBam_to_conBam)) 168 | result_df = pd.DataFrame(columns = ['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type', 'ctrl_fileCode', 'ctrl_download']) 169 | for index, row in bam_chip_df.iterrows(): 170 | chip_download = 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = row['File accession']) 171 | chip_report_row = pd.Series([row['exp_name'], row['Experiment target'], row['Biosample term name'], row['Experiment accession'], chip_download, row['File assembly'], row['Run type']], index = ['exp_name', 'Experiment target', 'Biosample term name', 'Experiment accession', 'download_link', 'File assembly', 'Run type']) 172 | control_bam_list = row['control_bam_list'] 173 | for ctrl_bam in control_bam_list: 174 | ctrl_download = 'https://www.encodeproject.org/files/{c}/@@download/{c}.bam'.format(c = ctrl_bam) 175 | ctrl_report_row = pd.Series([ctrl_bam, ctrl_download], index = ['ctrl_fileCode', 'ctrl_download']) 176 | output_row = pd.concat([chip_report_row, ctrl_report_row]) 177 | result_df.loc[result_df.shape[0], :] = output_row 178 | result_df = combine_with_organ_meta_df(result_df, organ_meta_df) 179 | return result_df 180 | 181 | 182 | def save_output(output_fn, chip_output_df, openc_df): 183 | writer = pd.ExcelWriter(output_fn, engine = 'xlsxwriter') 184 | chip_output_df.to_excel(writer, sheet_name = 'Histone ChIP-seq') 185 | openc_df.to_excel(writer, sheet_name = 'ATAC_DNase') 186 | writer.save() 187 | return 188 | 189 | if __name__ =='__main__': 190 | organ_meta_df = meta.get_metadata_from_jason() # output is a df outlinining the cell types of experiment. columns: 'experimentID', 'organ_group', 'cell_group', 'development_group', 'system_group' 191 | meta_df = read_filter_metadata() 192 | bam_control_df, read_control_df = get_control_df(meta_df) 193 | bam_chip_df, read_chip_df = get_chip_df(meta_df) 194 | chipBam_to_chipRead = map_file_accession_to_fileCode_list(bam_chip_df, read_chip_df, 'Derived from') 195 | chipRead_to_conRead = map_file_accession_to_fileCode_list(read_chip_df, read_control_df, 'Controlled by') 196 | conRead_to_conBam = map_controlRead_to_controlBam(bam_control_df, read_control_df) 197 | chipBam_to_conBam = map_bamChip_to_bamCon(chipBam_to_chipRead, chipRead_to_conRead, conRead_to_conBam) 198 | chip_output_df = produce_cell_mark_table_from_chipBamDf(bam_chip_df, chipBam_to_conBam, organ_meta_df) 199 | openc_df = get_openC_df(meta_df, organ_meta_df) 200 | save_output(args.output_fn, chip_output_df, openc_df) 201 | -------------------------------------------------------------------------------- /learn_model/helper.py: -------------------------------------------------------------------------------- 1 | # Copyright 2022 Ha Vu (havu73@ucla.edu) 2 | 3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | # The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | 10 | import numpy as np 11 | import pandas as pd 12 | import string 13 | import os 14 | import sys 15 | import time 16 | def make_dir(directory): 17 | try: 18 | os.makedirs(directory) 19 | except: 20 | print ( 'Folder' + directory + ' is already created') 21 | 22 | 23 | 24 | def check_file_exist(fn): 25 | if not os.path.isfile(fn): 26 | print ( "File: " + fn + " DOES NOT EXISTS") 27 | exit(1) 28 | return 29 | 30 | def check_dir_exist(fn): 31 | if not os.path.isdir(fn): 32 | print ( "Directory: " + fn + " DOES NOT EXISTS") 33 | exit(1) 34 | return 35 | 36 | def create_folder_for_file(fn): 37 | last_slash_index = fn.rfind('/') 38 | if last_slash_index != -1: # path contains folder 39 | make_dir(fn[:last_slash_index]) 40 | return 41 | 42 | def get_command_line_integer(arg): 43 | try: 44 | arg = int(arg) 45 | return arg 46 | except: 47 | print ( "Integer: " + str(arg) + " IS NOT VALID") 48 | exit(1) 49 | 50 | 51 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 52 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 53 | # rename the org_enrichment_df so that it's easier to work with 54 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 55 | return enrichment_df 56 | 57 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 58 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 59 | if len(nc_enrichment_df.columns) != 3: 60 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 61 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 62 | exit(1) 63 | # Now, we know that the nc_enrichment_df has exactly 3 columns 64 | # change the column names 65 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 66 | return (nc_enrichment_df) 67 | -------------------------------------------------------------------------------- /neighborhood/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code to obtain neighborhood enrichment analysis results of mouse full-stack states, as presented in Fig. 2D-E of the manuscript. 2 | - Files ```genome_100_RefSeqTES_neighborhood.txt``` and ```genome_100_RefSeqTSS_neighborhood.txt``` show the output results of running ChromHMM neighborhood enrichments for the 100 mouse full-stack states with the neighborhood of annotated TSS and TES from RefSeq. The bed files of annotated TSS and TES are both provided along with ChromHMM software. 3 | - You will need to install some R packages if you have not had them in order to use the next two code files. In your R terminal, use the following commands: 4 | ``` 5 | install.packages("tidyverse") 6 | install.packages("tidyr") 7 | install.packages("dplyr") 8 | install.packages('ggplot2') 9 | install.packages(pheatmap) 10 | install.packages(reshape2) 11 | ``` 12 | - File ```draw_2DLine_neighborhood_enrichment.R``` contain code to plot figure 2D-E. You can use this command: 13 | ``` 14 | Rscript draw_2DLine_neighborhood_enrichment.R genome_100_RefSeqTSS_neighborhood.txt genome_100_RefSeqTSS_neighborhood_linePlot.png genome_100_RefSeqTES_neighborhood.txt genome_100_RefSeqTES_neighborhood_linePlot.png ../state_annotation_processed.csv 15 | ``` 16 | 17 | - File ```draw_neighborhood_enrichment.R``` contains code to plot figures S2 in the manuscript. You can use these two commands: 18 | ``` 19 | Rscript draw_neighborhood_enrichment.R genome_100_RefSeqTSS_neighborhood.txt genome_100_RefSeqTSS_neighborhood_heatMap.png ../state_annotation_processed.csv 20 | 21 | Rscript draw_neighborhood_enrichment.R genome_100_RefSeqTES_neighborhood.txt genome_100_RefSeqTES_neighborhood_heatMap.png ../state_annotation_processed.csv 22 | ``` 23 | ## LICENSE 24 | Copyright 2022 Ha Vu (havu73@ucla.edu) 25 | 26 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 27 | 28 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 29 | 30 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /neighborhood/draw_2DLine_neighborhood_enrichment.R: -------------------------------------------------------------------------------- 1 | library("tidyverse") 2 | library("tidyr") 3 | library("dplyr") 4 | library('ggplot2') 5 | library(pheatmap) 6 | library(reshape2) 7 | STATE_COLOR_DICT = c('quescient' = "#ffffff", "HET" = '#b19cd9', 'acetylations' = '#fffacd', 'enhancers'= '#FFA500', 'ct_spec enhancers' = '#FFA500', 'transcription' = '#006400', 'weak transcription' = '#228B22', 'weak enhancers' = '#ffff00', 'transcribed and enhancer' = '#ADFF2F', 'exon' = '#3cb371', 'promoters' = '#ff4500', 'TSS' = '#FF0000', 'weak promoters' = '#800080', 'polycomb repressed' = '#C0C0C0', 'others' = '#fff5ee', 'znf' = '#7fffd4', 'DNase' = '#fff44f') 8 | get_rid_of_bedGZ_context_name <- function(enrichment_name){ 9 | if (endsWith(enrichment_name, '.bed.gz')){ 10 | result <- substring(enrichment_name, 1, nchar(enrichment_name) - 7) 11 | return(result) 12 | } 13 | return(enrichment_name) 14 | } 15 | 16 | get_overlap_enrichment_df <- function(enrichment_fn){ 17 | if (endsWith(enrichment_fn, '.xlsx') | endsWith(enrichment_fn, '.xls')) { 18 | print("READING AN EXCEL FILE") 19 | enrichment_df <- as.data.frame(read_excel(enrichment_fn, sheet = 1, col_names = TRUE)) 20 | } # if this is a excel file 21 | else{ 22 | print ("READING A TEXT FILE") 23 | enrichment_df <- as.data.frame(read_tsv(enrichment_fn, col_names = TRUE)) 24 | } 25 | tryCatch({ 26 | enrichment_df <- enrichment_df %>% rename("state" = "state..Emission.order.", "percent_in_genome" = "Genome..", "state" = "state (Emission order)", "state (Emission order)" = "state") # rename("percent_in_genome" = "percent_in_genome") 27 | }, error = function(e) {message("Nothing bad happended")}) 28 | score_colnames <- colnames(enrichment_df)[2:ncol(enrichment_df)] 29 | score_colnames <- sapply(score_colnames, get_rid_of_bedGZ_context_name) 30 | score_colnames <- c('state', score_colnames) 31 | colnames(enrichment_df) <- score_colnames 32 | num_state <- nrow(enrichment_df) 33 | tryCatch({ 34 | enrichment_df <- enrichment_df %>% select(-"percent_in_genome") 35 | }, error = function(e) {message("There is not a percent_in_genome column")}) 36 | enrichment_df <- as.data.frame(t(enrichment_df)) 37 | colnames(enrichment_df) <- enrichment_df[1,] # first row is also header 38 | enrichment_df <- enrichment_df[-1,] # get rid of the first row 39 | enrichment_df$coord <- as.integer(rownames(enrichment_df)) 40 | colnames(enrichment_df)[-length(colnames(enrichment_df))] <- sapply(colnames(enrichment_df)[-length(colnames(enrichment_df))], function (x){paste0('S',x)}) # convert columns from state_index to "S". exclude that last column because that's coord 41 | enrichment_df <- melt(enrichment_df, id = c('coord')) 42 | colnames(enrichment_df) <- c('coord', 'state', 'enrichment') 43 | enrichment_df$state <- as.character(enrichment_df$state) 44 | return(enrichment_df) 45 | } 46 | 47 | get_state_type <- function(state_annot_fn){ 48 | state_annot_df <- as.data.frame(read.csv(state_annot_fn, header = T, sep = '\t')) ### HAHAHA this line may need to change depending on what state annotation we are using 49 | state_annot_df$state <- sapply(as.character(state_annot_df$state), function (x){paste0('S', x)}) 50 | return(state_annot_df) 51 | } 52 | 53 | draw_2DLine_neighborhood_enrichment <- function (enrichment_fn, save_fig_fn, state_annot_fn) { 54 | # draw enrichment plot where each column (each enrichment context) is on its own color scale 55 | enrichment_df <- get_overlap_enrichment_df(enrichment_fn) # columns: enrichment contex, rows: states 56 | print(head(enrichment_df)) 57 | annot_df <- get_state_type(state_annot_fn) 58 | print(head(annot_df)) 59 | enrichment_df <- enrichment_df %>% left_join(annot_df, by = 'state') 60 | color_dict <- enrichment_df$color 61 | names(color_dict) <- enrichment_df$state 62 | #color_dict has keys: states, values: the coloro corresponding to that state 63 | ggplot(data = enrichment_df, aes(x = coord, y = enrichment, group = state)) + 64 | geom_line(aes(color = state)) + 65 | scale_colour_manual(values = color_dict) + 66 | theme_bw() + 67 | theme(legend.position = 'None') 68 | ggsave(save_fig_fn) 69 | print(paste('DONE! Figure is saved at:', save_fig_fn)) 70 | } 71 | 72 | args = commandArgs(trailingOnly=TRUE) 73 | if (length(args) < 5) 74 | { 75 | stop("wrong command line argument format", call.=FALSE) 76 | } 77 | tss_fn <- args[1] # output of ChromHMM neighborhoodEnrichment 78 | save_tss_fn <- args[2] # where the figures should be stored 79 | tes_fn <- args[3] # output of ChromHMM neighborhoodEnrichment 80 | save_tes_fn <- args[4] # where the figures should be stored 81 | state_annot_fn <- args[5] # where the state annotation are stored 82 | print ('Done getting command line arguments') 83 | # state_annot_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/state_annotations.csv' 84 | # tss_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/neighborhood_enrichment_tss_hg19.txt' 85 | # save_tss_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/2DLine_neighborhood_enrichment_tss_hg19.png' 86 | draw_2DLine_neighborhood_enrichment(tss_fn, save_tss_fn, state_annot_fn) 87 | # tes_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/neighborhood_enrichment_tes_hg19.txt' 88 | # save_tes_fn <- '/Users/vuthaiha/Desktop/window_hoff/ROADMAP_aligned_reads/chromHMM_model/model_100_state/neighborhood_enrichment/hg19/2DLine_neighborhood_enrichment_tes_hg19.png' 89 | draw_2DLine_neighborhood_enrichment(tes_fn, save_tes_fn, state_annot_fn) -------------------------------------------------------------------------------- /neighborhood/draw_neighborhood_enrichment.R: -------------------------------------------------------------------------------- 1 | library("tidyverse") 2 | library("tidyr") 3 | library("dplyr") 4 | library('ggplot2') 5 | library(pheatmap) 6 | # STATE_COLOR_DICT = c('quescient' = "#ffffff", "HET" = '#b19cd9', 'acetylations' = '#fffacd', 'enhancers'= '#FFA500', 'ct_spec enhancers' = '#FFA500', 'transcription' = '#006400', 'weak transcription' = '#228B22', 'weak enhancers' = '#ffff00', 'transcribed and enhancer' = '#ADFF2F', 'exon' = '#3cb371', 'promoters' = '#ff4500', 'TSS' = '#FF0000', 'weak promoters' = '#800080', 'bivalent promoters' = '#7030A0', 'polycomb repressed' = '#C0C0C0', 'others' = '#fff5ee', 'znf' = '#7fffd4', 'DNase' = '#fff44f') 7 | get_rid_of_bedGZ_context_name <- function(enrichment_name){ 8 | if (endsWith(enrichment_name, '.bed.gz')){ 9 | result <- substring(enrichment_name, 1, nchar(enrichment_name) - 7) 10 | return(result) 11 | } 12 | return(enrichment_name) 13 | } 14 | 15 | get_overlap_enrichment_df <- function(enrichment_fn){ 16 | if (endsWith(enrichment_fn, '.xlsx') | endsWith(enrichment_fn, '.xls')) { 17 | print("READING AN EXCEL FILE") 18 | enrichment_df <- as.data.frame(read_excel(enrichment_fn, sheet = 1, col_names = TRUE)) 19 | } # if this is a excel file 20 | else{ 21 | print ("READING A TEXT FILE") 22 | enrichment_df <- as.data.frame(read_tsv(enrichment_fn, col_names = TRUE)) 23 | } 24 | tryCatch({ 25 | enrichment_df <- enrichment_df %>% rename("state" = "state..Emission.order.", "percent_in_genome" = "Genome..", "state" = "state (Emission order)", "state (Emission order)" = "state") # rename("percent_in_genome" = "percent_in_genome") 26 | }, error = function(e) {message("Nothing bad happended")}) 27 | score_colnames <- colnames(enrichment_df)[2:ncol(enrichment_df)] 28 | score_colnames <- sapply(score_colnames, get_rid_of_bedGZ_context_name) 29 | score_colnames <- c('state', score_colnames) 30 | colnames(enrichment_df) <- score_colnames 31 | num_state <- nrow(enrichment_df) 32 | tryCatch({ 33 | enrichment_df <- enrichment_df %>% select(-"percent_in_genome") 34 | }, error = function(e) {message("There is not a percent_in_genome column. Nothing bad happend")}) 35 | return(enrichment_df) 36 | } 37 | 38 | read_state_annot_df <- function(states_to_include, state_annot_fn){ 39 | annot_df <- as.data.frame(read.csv(state_annot_fn, header = TRUE, stringsAsFactors = FALSE, sep = "\t")) 40 | tryCatch({ 41 | annot_df <- annot_df %>% rename('group' = 'Group') 42 | }, error = function (e) {message("tried to change column names in annot_df and nothing worth worrying happened")}) 43 | annot_df <- annot_df %>% arrange(state_order_by_group) # order rows based on the index that we get from state_order_by_group column 44 | annot_df <- annot_df %>% filter(state %in% states_to_include) # in case some states need to be excluded 45 | STATE_COLOR_DICT <- as.character(annot_df$color) 46 | names(STATE_COLOR_DICT) <- as.character(annot_df$group) 47 | STATE_COLOR_DICT <- STATE_COLOR_DICT[!duplicated(names(STATE_COLOR_DICT))] 48 | return_obj <- list(annot_df = annot_df, STATE_COLOR_DICT = STATE_COLOR_DICT) 49 | return(return_obj) 50 | } 51 | 52 | calculate_gap_rows_among_state_groups <- function(state_annot_df){ 53 | state_group_ordered_by_appearance <- unique(state_annot_df$group) # list of different state groups, ordered by how they appear in the heatmap from top to bottom 54 | count_df <- state_annot_df %>% count(group) 55 | count_df <- count_df[match(state_group_ordered_by_appearance, count_df$group),] # order the rows such that the group are ordered based on state_group_ordered_by_appearance 56 | results <- cumsum(count_df$n) # cumulative sum of the count for groups of states, which will be used to generate the gaps between rows of the heatmaps 57 | return(results) 58 | } 59 | 60 | draw_enrichment_plot_no_color_scale <- function (enrichment_fn, save_fig_fn, states_to_include, state_annot_fn) { 61 | # draw enrichment plot where each column (each enrichment context) is on its own color scale 62 | annot_obj <- read_state_annot_df(states_to_include, state_annot_fn) # state: one-based state indices, arranged based on the groups of states, long annotations, short annotations, group, color, itemRgb, state_ordered_by_group: 0 --> 99 already, mneumonics in the format 63 | annot_df <- annot_obj$annot_df 64 | STATE_COLOR_DICT <- annot_obj$STATE_COLOR_DICT 65 | enrichment_df <- get_overlap_enrichment_df(enrichment_fn) # columns: enrichment contex, rows: states 66 | enrichment_colnames <- colnames(enrichment_df)[-1] # get rid of the first item in enrichment_colnames because we want to get rid of the 'state' 67 | enrichment_df <- as.data.frame(enrichment_df[, -1]) # get rid of the first column (states) so that the enrichment data is just the enrichment values. This is nescessary to that we can get input for pheatmap function later 68 | enrichment_colnames <- as.character(seq(-2000, 2000, 200)) 69 | enrichment_df <- enrichment_df %>% select(enrichment_colnames) 70 | # rearranged_colnames <- paste("consHMM_state", seq(1, 100) , sep = "") 71 | enrichment_df <- enrichment_df %>% slice(annot_df$state) # rearrange the states based on the ordered based on the order from annot_df 72 | colnames(enrichment_df) <- enrichment_colnames 73 | rownames(enrichment_df) <- annot_df$mneumonics # row names and columns names are the names of x and y axis tick labels in the figure 74 | display_num_df <- round(enrichment_df, digits = 1) 75 | # calculate the indices of rows where we want to create gaps among groups of states 76 | gap_row_indices <- calculate_gap_rows_among_state_groups(annot_df) 77 | # prepare the annot_df for plotting as the row_annotation in the pheatmap function 78 | rownames(annot_df) <- annot_df$mneumonics 79 | annot_df <- annot_df %>% select(c('group')) ### HAHAHA this can be group or state_group depending on the state annot file that we use 80 | #pheatmap(enrichment_df, cluster_rows = FALSE, cluster_cols = TRUE, fontsize = 4.5, scale = 'none', display_numbers = display_num_df, number_format = '%.1f', color = colorRampPalette(c('white', 'red'))(300), filename = save_fig_fn, cellwidth = 10, lengend = FALSE) 81 | pheatmap(enrichment_df, cluster_rows = FALSE, cluster_cols = FALSE,annotation_row = annot_df, gaps_row = gap_row_indices, annotation_colors = list(group = STATE_COLOR_DICT), fontsize = 8, scale = 'none', color = colorRampPalette(c('white', 'blue'))(300), filename = save_fig_fn, cellwidth = 9, cellheight = 8, lengend = TRUE) 82 | print(paste('DONE! Figure is saved at:', save_fig_fn)) 83 | } 84 | args = commandArgs(trailingOnly=TRUE) 85 | if (length(args) < 3) 86 | { 87 | stop("wrong command line argument format", call.=FALSE) 88 | } 89 | neighborhood_enrichment_fn <- args[1] # output of ChromHMM NeighborhoodEnrichment, also input to this program 90 | save_fn <- args[2] # where the figures should be stored 91 | state_annot_fn <- args[3] # data about the state annotations 92 | states_to_include <- args[4:length(args)] 93 | print(length(states_to_include)) 94 | if (is.na(states_to_include)){ 95 | states_to_include <- seq(1, 100) 96 | } 97 | print (states_to_include) 98 | draw_enrichment_plot_no_color_scale(neighborhood_enrichment_fn, save_fn, states_to_include, state_annot_fn) -------------------------------------------------------------------------------- /neighborhood/genome_100_RefSeqTES_neighborhood.txt: -------------------------------------------------------------------------------- 1 | State (Emission order) -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 | 1 0.69942 0.73801 0.75730 0.73318 0.64636 0.71389 0.82001 0.77177 0.68012 0.70906 0.81518 0.73321 0.87310 0.79592 0.85380 1.00816 1.06605 1.04675 1.07570 1.00816 1.10464 3 | 2 0.96530 1.02457 0.99917 1.00764 1.06691 0.95683 0.95683 1.19392 1.19392 0.96530 0.77055 0.87219 0.88913 0.99921 0.99074 1.33792 1.24478 1.42260 1.25325 1.35486 1.49035 4 | 3 3.44937 3.56774 3.75373 3.96509 3.88055 4.05809 4.05809 4.12572 4.06654 3.65228 1.92759 2.63786 2.58713 2.56177 3.01832 2.95914 3.08596 3.28041 3.14514 3.21278 3.43260 5 | 4 1.87602 1.71579 1.89938 1.89938 1.88937 1.91274 2.04626 2.10301 2.13639 1.95613 1.25847 1.62906 1.75257 1.87943 1.82268 1.85272 1.84271 1.74924 1.79263 1.89278 1.86607 6 | 5 0.63535 0.64349 0.80640 0.78197 0.72495 0.57833 0.74939 0.80640 0.64349 0.82270 0.59462 0.72498 0.70869 0.76571 0.81458 0.82273 0.81458 0.95306 0.82273 0.80644 1.05081 7 | 6 0.48532 0.66180 0.83828 0.61768 0.66180 0.75004 0.92651 0.82357 0.85298 0.70592 0.67650 0.82360 1.02950 1.19128 1.05892 1.01479 1.14716 0.86772 1.22069 1.38247 1.27952 8 | 7 0.82627 0.85868 0.82627 0.89108 0.77767 1.21511 1.03689 1.23131 1.21511 1.08550 1.05309 1.31237 1.40958 1.45819 1.65261 1.37718 1.45819 1.39338 1.57160 1.45819 1.60400 9 | 8 1.82420 1.64178 1.51148 1.51148 1.95450 1.98056 1.95450 1.92844 2.26722 2.26722 1.79814 2.37155 3.07520 3.07520 3.04914 3.07520 2.65823 2.39762 2.39762 2.39762 2.37155 10 | 9 0.87475 0.83374 0.83374 1.01143 0.94309 0.87475 1.06610 0.97042 0.86108 0.75173 0.87475 1.21649 1.23016 1.03880 1.21649 1.23016 1.29850 1.36684 1.53086 1.42152 1.33951 11 | 10 1.37969 1.17064 0.96160 0.82224 1.00341 1.03128 1.10096 1.04522 1.08703 1.26820 1.05915 1.24037 1.18463 1.26825 1.36580 1.65848 1.53305 1.72816 1.72816 1.99296 2.28563 12 | 11 1.08855 1.05069 1.18320 1.06015 1.03175 1.00336 1.02229 1.01282 0.88977 1.02229 0.87084 1.01286 1.22111 1.06966 1.23058 1.39150 1.48616 1.55242 1.62815 1.51456 1.54296 13 | 12 0.98268 1.04493 1.00936 1.00936 0.91598 0.91598 0.90264 0.88930 0.95155 0.96934 1.20501 1.10278 1.19616 1.26731 1.24062 1.28954 1.31177 1.35624 1.50743 1.59191 1.42739 14 | 13 0.72184 0.73075 0.80204 0.71292 0.63272 0.76639 0.70401 0.89116 0.77531 0.90898 0.73966 0.96249 0.82881 0.90901 0.93575 0.94466 1.10508 1.24767 1.21202 1.22984 1.36352 15 | 14 0.62781 0.52891 0.54181 0.50741 0.63641 0.58481 0.47301 0.51601 0.63211 0.69231 0.70522 0.60204 0.58053 0.77835 0.73534 0.72674 0.77405 0.88585 0.91165 0.93746 0.86435 16 | 15 0.32910 0.38994 0.35814 0.38026 0.37335 0.36229 0.40100 0.39686 0.39824 0.43834 0.53375 0.43697 0.43421 0.42591 0.39687 0.45218 0.49505 0.49782 0.53930 0.51303 0.53654 17 | 16 0.33113 0.50644 0.36035 0.38957 0.41879 0.41879 0.37009 0.37983 0.40905 0.43826 0.61357 0.48698 0.46750 0.42854 0.44802 0.52594 0.48698 0.61359 0.59412 0.73047 0.89604 18 | 17 0.83523 0.85160 0.70421 0.75334 0.85160 0.73696 0.58957 0.76972 0.80247 0.68783 0.62233 0.90077 0.75337 0.78613 0.54046 0.76975 0.83526 0.86801 0.73699 0.85164 0.98266 19 | 18 0.58457 0.54206 0.52080 0.44640 0.34011 0.49954 0.41451 0.37200 0.31886 0.43577 0.57394 0.46767 0.39327 0.51019 0.45704 0.58459 0.46767 0.34013 0.42516 0.47830 0.54208 20 | 19 0.54590 0.48925 0.48410 0.47380 0.47380 0.36050 0.43260 0.39655 0.41715 0.50470 0.54590 0.46866 0.51502 0.52017 0.49956 0.54077 0.56137 0.61287 0.53047 0.61287 0.57682 21 | 20 1.67156 1.66823 1.61174 1.65827 1.70645 1.63168 1.63999 1.75962 1.97397 2.65190 2.96926 1.62510 1.52540 1.58854 1.64504 1.59685 1.65168 1.74307 1.61014 1.56195 1.53038 22 | 21 2.73577 2.60778 2.73577 2.87576 2.71177 2.96775 2.77977 2.98775 3.01575 3.93567 5.01158 2.35989 2.42389 2.47589 2.40789 2.44789 2.47589 2.24790 2.19590 2.23190 2.09990 23 | 22 3.25087 3.25809 3.42045 3.33025 3.46014 3.30500 3.47457 3.19675 3.27974 3.53591 4.12403 3.35925 3.11750 3.16079 2.85770 2.85770 2.60874 2.57266 2.57987 2.47884 2.27317 24 | 23 3.93763 4.02958 3.92164 3.90165 3.90165 3.81771 3.66580 3.67379 3.33799 3.40995 3.84969 3.53801 3.12624 2.97033 2.83441 2.57855 2.37866 2.05884 1.79899 1.85895 1.89893 25 | 24 5.66168 5.69571 5.60213 5.63616 5.18101 5.00235 5.02788 4.75989 4.33878 3.84960 3.46251 4.74731 4.29640 3.79019 3.27122 2.79054 2.52254 2.47575 2.35664 2.21201 2.04186 26 | 25 5.64941 5.65560 5.36478 5.46378 5.53804 5.64941 5.41428 5.83505 5.42047 5.05539 5.03683 6.02710 5.92190 5.53825 5.47637 5.24741 5.33405 4.65337 4.54817 3.86130 4.18927 27 | 26 11.37016 11.27793 11.84811 11.53786 12.04935 12.36799 12.67823 12.22544 12.88786 10.38072 6.45651 10.41467 9.91993 9.02269 8.78790 8.19253 7.37077 7.16113 6.72509 6.45676 5.77754 28 | 27 5.07549 5.22883 5.10616 4.93749 5.19816 5.50484 5.30550 5.29017 5.32083 4.69215 5.25950 6.13376 5.84241 5.33637 5.15236 4.49298 4.43164 4.06362 4.23230 3.94094 3.92561 29 | 28 2.02635 2.20977 2.10496 1.97395 2.28838 2.11370 2.13116 2.00015 2.14863 2.05256 2.59408 2.45443 2.41949 2.72520 2.58545 2.48937 2.26227 2.47190 2.34088 2.07884 2.00023 30 | 29 2.01174 2.04940 2.07451 2.16238 2.09334 2.05568 2.09648 2.19063 2.12158 2.08706 2.29106 2.40727 2.42297 2.38844 2.33509 2.36020 2.33823 2.22838 2.17188 2.15305 2.17188 31 | 30 2.97326 2.91120 3.07481 2.92248 3.09738 3.16508 3.08045 2.99583 2.96762 2.87170 2.73066 3.28933 3.31190 3.56579 3.65042 3.45859 3.50373 3.41345 3.52065 3.44731 3.24983 32 | 31 5.67447 5.98343 5.71567 6.23059 6.04522 6.71462 6.61164 6.50865 6.60134 5.71567 4.07821 6.03515 6.05575 5.89097 6.18964 6.16904 6.03515 5.78798 5.50991 5.65409 5.17005 33 | 32 11.17219 11.88620 12.26421 13.18822 14.84725 15.68726 16.73728 17.59829 17.07329 14.63724 9.80716 14.53280 16.31790 15.26784 14.00777 14.04978 12.51669 11.84465 10.85760 9.95455 10.16456 34 | 33 3.55846 3.63798 3.87653 3.93617 3.97593 3.81689 4.25425 4.19461 3.93617 3.14099 2.96207 3.55860 3.28027 2.92242 3.06158 2.74350 3.10135 3.28027 3.16099 2.68386 2.32601 35 | 34 0.54532 0.52513 0.51503 0.62611 0.62611 0.60592 0.58572 0.50493 0.44434 0.55542 0.49483 0.55544 0.72713 0.59584 0.68673 0.71703 0.74733 0.83822 0.75742 0.62614 0.69683 36 | 35 0.34983 0.34983 0.35557 0.37277 0.36704 0.45306 0.37851 0.40145 0.44733 0.50468 0.48174 0.46455 0.36705 0.46455 0.41293 0.47602 0.51043 0.48749 0.50470 0.54484 0.44735 37 | 36 0.41609 0.44810 0.42676 0.40542 0.35208 0.39476 0.44810 0.39476 0.41609 0.43743 0.39476 0.52280 0.59749 0.46946 0.51214 0.62950 0.65084 0.48013 0.46946 0.54414 0.52280 38 | 37 0.81697 0.70297 0.74097 0.66498 0.70297 0.56998 0.77897 0.93097 0.83597 0.74097 0.77897 0.83600 0.70300 0.79800 0.70300 0.81700 0.72200 0.83600 0.83600 0.89300 0.77900 39 | 38 1.92982 2.02028 1.99012 1.89966 1.80920 1.59813 1.59813 1.62828 1.68859 1.62828 1.44736 1.32680 1.56804 2.17113 2.38221 1.71881 1.80927 1.96005 2.14097 1.86958 2.20128 40 | 39 1.68483 1.64820 1.86796 2.08772 2.60049 2.41736 2.34411 2.27085 2.16097 1.79471 1.46507 1.50175 1.39187 1.13547 1.35524 1.39187 1.24536 1.24536 1.53838 1.53838 1.61164 41 | 40 3.80158 3.77607 3.31681 3.18925 3.16373 3.41887 3.08719 2.70448 2.24523 1.96457 1.45430 1.42884 1.68399 1.88811 1.91362 2.04120 1.99017 2.14326 2.34737 2.32186 2.37289 42 | 41 4.79146 4.66860 4.48431 4.11574 4.54574 4.07479 3.68574 3.39907 3.56288 3.89050 3.91098 3.89065 3.72683 4.32067 4.64830 4.87355 5.40596 4.93498 4.68926 5.30357 5.46739 43 | 42 4.49819 4.19181 4.08040 3.94114 3.63476 3.39801 3.24482 2.54851 2.53458 2.14465 2.21428 2.21436 2.85500 3.30066 3.42600 3.48171 3.30066 3.32851 3.45385 3.17532 3.45385 44 | 43 3.76010 3.70787 2.97674 2.76785 2.66340 1.82782 1.88005 3.81232 2.66340 1.41004 2.14117 3.34244 3.65579 3.49911 3.18576 2.87241 2.87241 3.18576 3.70802 3.91692 3.65579 45 | 44 3.41011 3.38216 3.52192 3.57782 2.85108 2.65542 3.10264 3.71758 4.22071 3.60577 3.74553 4.47245 4.47245 3.77363 3.82954 4.52836 4.47245 4.19292 4.22087 4.91970 3.63387 46 | 45 2.21007 2.12614 2.48982 2.01424 2.01424 2.21007 2.21007 2.60172 1.79043 2.18209 2.15412 1.67860 1.37085 1.56669 1.39883 1.34288 1.76253 1.79050 1.42681 1.48276 1.51074 47 | 46 3.09956 3.26863 3.01503 2.87414 2.19787 2.22605 2.05698 2.33876 2.31058 2.08516 1.69067 1.94435 1.85981 1.40895 1.57802 2.19796 2.67700 3.69144 3.88869 3.57872 3.49419 48 | 47 2.83055 2.57323 2.20154 2.63041 2.11576 1.94421 2.05858 2.28731 2.20154 2.48745 2.54463 2.43036 2.17303 1.85851 1.74414 1.31525 1.88710 1.94429 2.40177 2.57332 3.14517 49 | 48 11.32196 12.09980 13.05050 13.13693 14.34691 16.50759 17.45829 16.68045 15.55689 16.24831 12.87765 14.77962 13.74245 11.23597 10.97667 10.71738 10.28523 10.71738 9.68022 8.98877 9.59379 50 | 49 2.10794 1.64682 1.64682 1.88836 1.86640 1.73466 1.69074 2.02010 2.17381 2.08598 1.80053 2.10802 2.15193 2.34956 2.87656 2.63502 2.83265 2.89852 3.05223 3.20594 2.96440 51 | 50 1.33972 1.15026 1.17733 1.05554 1.00141 1.13673 1.38032 1.24499 1.42092 1.19086 0.90668 1.21797 1.21797 1.24504 1.19091 1.55630 1.73223 1.52923 1.73223 1.65103 1.61043 52 | 51 0.84345 0.90694 0.80717 0.86159 0.83438 0.77997 0.81624 0.71648 0.85252 0.88880 1.23343 0.79813 1.18813 1.15185 1.23348 1.20627 1.38767 1.67790 1.36953 1.55999 1.51464 53 | 52 0.85268 0.90897 0.90334 0.86394 0.86957 0.90334 0.86957 0.83580 0.81047 0.87801 1.11721 0.99061 0.97936 0.99343 1.02720 1.09474 1.16228 1.21294 1.25234 1.39587 1.39024 54 | 53 1.10162 1.19413 1.12685 1.17731 1.05958 1.13526 1.09321 1.18572 1.11844 1.16890 1.01753 1.27827 1.15212 1.37918 1.47169 1.48851 1.64829 1.58102 1.69875 1.47169 1.56420 55 | 54 2.36832 2.31932 2.67865 2.62965 2.84198 2.66232 2.95632 3.26665 3.59331 3.03798 2.13966 2.56442 2.54808 2.92376 3.10344 2.98910 4.01813 3.70779 3.93646 4.08347 4.32848 56 | 55 2.32330 2.21397 2.18664 1.94064 2.51463 2.07731 1.94064 2.43263 2.67863 2.13197 2.35064 3.66276 4.01810 3.99077 3.49876 3.66276 3.41675 4.04544 4.37344 4.04544 4.15477 57 | 56 1.30202 1.61766 1.26256 1.24284 1.44011 1.24284 1.14420 1.24284 1.12447 1.61766 1.61766 1.77555 1.75582 2.13066 1.89392 1.69664 2.17011 2.64359 2.66332 2.52522 2.34767 58 | 57 1.89439 1.64418 1.80503 1.75141 1.46547 1.91226 1.93013 2.14459 1.89439 1.98374 2.30543 1.80510 2.68084 2.85956 2.46637 3.27062 3.23488 3.16339 3.53871 3.37786 3.62807 59 | 58 1.47864 1.47864 1.57889 1.75432 1.97988 1.85457 2.00494 2.25556 2.13025 1.95482 1.27815 1.77945 1.92983 1.62908 1.75439 2.23058 1.67920 1.75439 1.80451 1.75439 1.97995 60 | 59 0.92601 0.77337 0.88531 0.88531 0.88531 1.00742 1.11936 0.94637 0.96672 0.88531 0.93619 0.92605 1.02781 0.95658 1.02781 0.95658 0.96675 1.01764 1.02781 1.33310 1.29240 61 | 60 4.18661 4.16917 4.37850 4.37850 4.65761 4.51806 4.29128 4.15173 4.18661 3.75051 3.22718 3.17497 3.57620 3.57620 3.45409 3.47154 3.38431 3.45409 3.36687 3.59365 3.75065 62 | 61 1.61407 1.51319 1.46780 1.49302 1.54850 1.42240 1.42744 1.47284 1.45771 1.44258 1.15507 1.40228 1.47290 1.47794 1.47290 1.44263 1.46281 1.48803 1.45272 1.52838 1.57882 63 | 62 1.10904 1.09160 1.01311 1.01456 0.98549 1.05090 1.05817 1.03055 1.02910 0.95787 0.87357 1.02187 1.05966 1.12507 1.08728 1.12071 1.08146 1.03495 1.05239 1.15124 1.19048 64 | 63 0.70942 0.71423 0.72144 0.72529 0.73780 0.70028 0.64209 0.61323 0.59928 0.69547 0.95471 0.78208 0.74985 0.74360 0.79554 0.79843 0.82104 0.86673 0.90136 0.88597 0.90424 65 | 64 0.45813 0.48268 0.42132 0.42950 0.40496 0.37632 0.37223 0.35587 0.48268 0.42950 0.29452 0.42952 0.47042 0.44179 0.44588 0.49088 0.57269 0.45406 0.49906 0.56860 0.51951 66 | 65 0.40829 0.41779 0.47001 0.49850 0.48900 0.58395 0.82608 0.85457 0.86881 0.62668 0.21839 0.30386 0.32760 0.39881 0.36558 0.35608 0.46053 0.38932 0.43680 0.41306 0.39407 67 | 66 0.29763 0.29123 0.27202 0.27202 0.25922 0.29123 0.32963 0.39844 0.36483 0.33123 0.23362 0.23843 0.23683 0.26243 0.27844 0.28804 0.31044 0.32804 0.33124 0.34724 0.34244 68 | 67 0.27268 0.26727 0.21858 0.20776 0.18504 0.24347 0.39171 0.42526 0.45664 0.43933 0.21101 0.18396 0.25646 0.28893 0.31165 0.30624 0.29867 0.31706 0.28568 0.29542 0.26620 69 | 68 0.29538 0.28669 0.29914 0.30243 0.31794 0.29844 0.24486 0.21407 0.22606 0.26601 0.36964 0.35673 0.33370 0.32030 0.31701 0.32171 0.31090 0.31490 0.31490 0.31184 0.32171 70 | 69 0.39681 0.35272 0.32333 0.29393 0.30863 0.33068 0.43355 0.47030 0.41151 0.41886 0.18371 0.30864 0.44827 0.39683 0.48501 0.36008 0.53645 0.52175 0.52175 0.41887 0.36743 71 | 70 0.30052 0.29739 0.32869 0.29113 0.30991 0.27861 0.50086 0.68869 0.62921 0.56660 0.29426 0.31305 0.38819 0.44453 0.46645 0.47584 0.48210 0.48836 0.57289 0.51967 0.51341 72 | 71 0.65157 0.68777 0.62055 0.72914 0.53264 0.64640 1.06527 1.03942 0.89979 0.99287 0.73431 0.73434 0.66194 0.69814 0.80674 0.74986 0.81191 0.88948 0.88948 0.92051 1.00843 73 | 72 0.67045 0.84285 0.80454 0.82369 0.97694 0.99610 0.82369 1.01525 1.01525 0.90032 0.45974 0.80457 0.99613 0.84288 1.07276 0.91951 0.82373 0.97698 1.03445 0.95782 1.01529 74 | 73 0.77648 0.62189 0.69216 0.62540 0.67108 0.69216 0.77297 0.85027 0.90297 0.83621 0.72729 0.67813 0.69570 0.82219 0.83625 0.85030 0.90652 0.92760 0.86787 0.81165 0.92760 75 | 74 0.51731 0.54345 0.49255 0.52144 0.51181 0.53382 0.47329 0.42926 0.46090 0.50355 0.55996 0.51458 0.56549 0.58062 0.57925 0.59989 0.59989 0.58613 0.65767 0.62740 0.61777 76 | 75 0.34262 0.34823 0.34441 0.33981 0.33752 0.31890 0.29338 0.29007 0.29645 0.30359 0.33624 0.34646 0.34263 0.34723 0.35871 0.36534 0.36458 0.37453 0.37019 0.39927 0.40310 77 | 76 0.25610 0.27346 0.26478 0.24525 0.20835 0.24959 0.27346 0.26478 0.24091 0.24091 0.30385 0.31037 0.31688 0.31037 0.33859 0.39068 0.37982 0.38416 0.42974 0.45579 0.46230 78 | 77 0.85446 0.70188 0.83920 0.82394 0.85446 0.83920 0.88498 0.99178 1.02230 0.93075 0.48826 0.68665 0.91553 0.88501 0.94604 0.76294 0.97656 1.12915 0.85449 1.12915 1.12915 79 | 78 3.03353 3.12546 3.12546 3.40123 2.75775 2.75775 2.20620 2.39005 2.29813 2.57390 1.74658 2.11436 1.83857 1.83857 1.65472 2.02243 1.93050 1.83857 1.37893 2.11436 2.02243 80 | 79 1.24395 1.19485 1.16211 1.32579 1.42400 1.47310 1.16211 1.06391 1.17848 1.39126 1.42400 1.62048 1.86600 1.80053 1.70232 1.76779 1.62048 1.98058 2.07879 2.02969 2.11153 81 | 80 4.81787 5.06027 5.12088 4.75726 4.60576 4.72696 4.84817 4.75726 4.39365 3.57552 2.93920 4.18170 4.24231 4.39382 4.24231 4.15140 3.54536 3.27264 3.39385 3.21203 2.57569 82 | 81 3.93605 3.90269 3.85821 4.05835 3.78038 3.45794 3.69143 3.36899 3.42458 3.22444 3.09102 3.38024 3.55814 3.08002 3.09114 2.89099 2.47958 2.52406 2.39063 2.16824 2.06817 83 | 82 4.49801 4.38168 4.73067 4.26535 4.26535 4.88577 5.19598 4.80822 4.57556 3.99392 3.37351 4.88596 4.53696 4.73085 3.95530 4.11041 3.72263 3.41242 2.98586 2.52053 2.13276 84 | 83 0.62057 0.60784 0.60784 0.60784 0.58556 0.58556 0.54737 0.60784 0.68103 0.63966 0.85288 0.76380 0.78926 0.75107 0.77335 0.73834 0.77017 0.80836 0.83700 0.73198 0.78608 85 | 84 1.27600 1.36264 1.06333 1.17361 1.20511 1.10272 1.20511 1.12635 1.15785 1.22086 1.12635 1.37057 1.26817 1.44934 1.33906 1.48085 1.50448 1.33119 1.38633 1.33119 1.19728 86 | 85 4.10988 3.90179 3.76306 3.57230 3.48560 3.34687 3.19080 3.48560 3.55496 3.90179 3.81508 4.16207 4.49157 4.56093 4.19675 4.09270 4.11004 3.86726 3.88460 3.93662 3.98865 87 | 86 5.96414 6.58417 6.76133 6.79085 6.58417 6.67275 6.64323 6.73180 6.90895 5.69841 5.04885 5.34431 4.75378 4.93094 4.39946 4.39946 4.28135 4.31088 4.16325 4.31088 4.22230 88 | 87 6.54511 6.75970 7.18889 7.40348 7.02794 7.10841 6.59875 6.03545 4.74788 4.18458 3.16526 2.81665 2.52157 2.68252 2.84348 3.13855 3.08490 2.89713 2.70935 2.68252 2.60205 89 | 88 3.49175 3.20754 2.84212 3.53235 3.57295 3.61356 4.14138 3.45115 3.00453 3.08573 3.16694 4.01973 3.53249 3.04525 3.00465 2.51741 2.76103 3.00465 2.72042 2.35499 2.39560 90 | 89 1.46703 1.25364 1.46703 1.04026 1.36033 1.70709 1.44035 1.92047 2.10718 2.18720 2.74734 1.94722 1.76050 1.54711 1.41373 1.30704 1.41373 1.04030 1.20034 1.46708 1.22702 91 | 90 0.30331 0.28619 0.29842 0.30087 0.26418 0.30576 0.31310 0.33022 0.37670 0.36202 0.49900 0.37671 0.37427 0.35714 0.37671 0.37182 0.40851 0.42074 0.43297 0.42563 0.46967 92 | 91 0.34790 0.46593 0.39760 0.43487 0.47836 0.40381 0.37896 0.38517 0.42866 0.42245 0.47215 0.37276 0.39761 0.41004 0.41004 0.39761 0.36655 0.40382 0.41625 0.60263 0.58399 93 | 92 0.37262 0.38382 0.36701 0.34180 0.33339 0.39783 0.38943 0.38382 0.39783 0.38662 0.50709 0.43427 0.41185 0.40345 0.46789 0.50431 0.50151 0.48750 0.54073 0.53793 0.49030 94 | 93 1.21580 1.06670 1.06670 1.10111 1.08964 1.41079 1.11258 1.02082 1.04376 1.03229 0.83730 1.11262 1.26173 1.26173 1.13556 1.27320 1.16997 1.34202 1.19291 1.23879 1.23879 95 | 94 0.55258 0.64205 0.73678 0.67889 0.62626 0.57890 0.54206 0.50522 0.53679 0.51574 0.62100 0.59471 0.56313 0.73154 0.71575 0.67365 0.73154 0.65260 0.69470 0.74207 0.83154 96 | 95 0.64917 0.51551 0.65871 0.62053 0.66826 0.62053 0.63962 0.71599 0.73508 0.64917 0.57279 0.70647 0.75421 0.69692 0.81149 0.76375 0.70647 0.74466 0.73511 0.74466 0.76375 97 | 96 0.43323 0.42061 0.45426 0.40378 0.41220 0.40799 0.44164 0.46688 0.45846 0.46688 0.39117 0.46689 0.48372 0.44166 0.47531 0.39118 0.40801 0.42062 0.50054 0.53419 0.58887 98 | 97 1.25930 1.39303 1.27044 1.14785 1.05870 1.02527 1.02527 1.03641 1.18129 1.11442 0.95840 1.22591 1.31507 1.50453 1.64941 1.97260 1.90573 1.83887 2.24007 2.09519 2.22893 99 | 98 1.40754 1.54321 1.93325 1.95021 1.71279 1.64496 1.50929 1.39058 1.30579 1.52625 1.33971 1.67894 1.72981 1.88244 2.03508 2.10291 2.11987 2.39121 2.22162 2.17075 2.40817 100 | 99 1.37426 1.21331 1.03998 0.99046 1.01522 0.92856 1.01522 0.80475 0.85427 0.97808 1.39902 1.30003 1.32479 1.43622 1.42384 1.39908 1.69623 1.65908 1.75813 1.79528 1.80766 101 | 100 0.77586 0.73323 0.78012 0.64797 0.63944 0.58402 0.66928 0.67781 0.69486 0.78438 0.98474 0.77162 0.86967 0.96346 1.15956 1.12546 1.13399 1.24909 1.25335 1.41535 1.38551 102 | -------------------------------------------------------------------------------- /overlap/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code that produces excel files associated with different overlap enrichment analysis presented in the paper. In these analyses, for each state in the mouse full-stack annotation, we calculate the overlap fold-enrichment between the state and an annotation of interests (for example, with different chromosomes, different genome contexts from RefSeq, different classes of repeat elements). Please refer to the Method subsection "External annotation sources" for a list of external genome annotations that we use to overlap with the mouse full-stack states. 2 | The process of doing overlap enrichment includes (1) Run ```ChromHMM OverlapEnrichment``` to calculate the fold-overlap, and (2) Run ```create_excel_painted_overlap_enrichment.py``` to create excel file that visualize the results. 3 | (1) We run ```ChromHMM OverlapEnrichment``` with ChromHMM v.1.23 following the manual's instruction. In particular, this is the format of the command to run ChromHMM: 4 | ``` 5 | java -jar OverlapEnrichment -b 1 -lowmem -noimage 6 | ``` 7 | The ```-b 1``` flag tells ChromHMM to calculate overlap enrichment at 1bp resolution 8 | The ```-lowmem``` and ```-noimage``` flags tell ChromHMM to calculate overlap enrichment using lower memory mode, and without generating output heatmap file. 9 | (2) We visualize the results of overlap enrichment using the following scripts: 10 | - ```helper.py```: script containing helper functions for file managements. 11 | - ```create_excel_painted_overlap_enrichment.py```: File to create excel files showing the overlap enrichment results 12 | ``` 13 | python create_excel_painted_overlap_enrichment_output.py 14 | input_fn: the output of chromHMM OverlapEnrichment showing the fold enrichment between each full-stack state and each external genome annotation 15 | output_fn: the name of the excel file that you want to create 16 | context_prefix: the prefix to all the enrichment contexts in this input_fn. For example, if we do enrichments with gnomad variants of varying maf, the column names in input_fn are 'maf_0_0.1', we can specify context_prefix to gnomad. If we dont need such prefix, specify to empty string. This is useful when we try to write comment on states that are most enriched in each genomic context. 17 | state_annot_fn: where we get all the data of characterization of states, recommended ../state_annotation_processed.csv 18 | ``` 19 | 20 | # License: 21 | All code is provided under the MIT Open Acess License 22 | Copyright 2022 Ha Vu and Jason Ernst 23 | 24 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 25 | 26 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 27 | 28 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 29 | 30 | -------------------------------------------------------------------------------- /overlap/create_excel_painted_overlap_enrichment_LECIF.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import pandas as pd 3 | import numpy as np 4 | import os 5 | import sys 6 | import time 7 | import helper 8 | 9 | print ("This file will take as input the chromHMM OverlapEnrichment output file, and create a replicate of such file in Excel format. The slick thing about this is that the excel file will be colored based on the levels of enrichment in each column. The color betweeen columns are not relevant. However, the colors of cells within one column shows the enrichment levels comparable among states.") 10 | print ("") 11 | print ("") 12 | 13 | def get_state_annot_df(state_annot_fn): 14 | # state_annot_fn = '../../..//ROADMAP_aligned_reads/chromHMM_model/model_100_state/figures/supp_excel/state_annotations_processed.csv' 15 | try: 16 | state_annot_df = pd.read_csv(state_annot_fn, sep = ',', header = 0) 17 | except: 18 | state_annot_df = pd.read_csv(state_annot_fn, sep = '\t', header = 0) 19 | state_annot_df = state_annot_df[['state', 'color', 'mneumonics', 'state_order_by_group', 'comments']] 20 | return state_annot_df 21 | 22 | def get_enrichment_df(enrichment_fn, state_annot_fn): 23 | state_annot_df = get_state_annot_df(state_annot_fn) 24 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t", header = 0) 25 | try: 26 | enrichment_df = enrichment_df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)': 'state'}) 27 | except: 28 | pass 29 | (num_state, num_enr_cont) = (enrichment_df.shape[0] - 1, enrichment_df.shape[1] - 1) 30 | percent_genome_of_cont = enrichment_df.iloc[num_state, 2:] 31 | enrichment_df = enrichment_df.loc[:(num_state - 1)] # rid of the last row because that is the row about percentage each genomic context occupies 32 | enrichment_df['state'] = (enrichment_df['state']).astype(str).astype(int) 33 | enrichment_df = enrichment_df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y')) 34 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group') 35 | # num_enr_cont : number of enrichment context (TSS, exon, etc.) 36 | return enrichment_df, num_state, num_enr_cont, percent_genome_of_cont 37 | 38 | def get_rid_of_stupid_file_tail(context_name): 39 | # first, get rid of stupid_file_tail 40 | if context_name.endswith('.bed.gz'): 41 | context_name = context_name[:-7] 42 | elif context_name.endswith('.txt.gz'): 43 | context_name = context_name[:-7] 44 | elif context_name.endswith('.txt'): 45 | context_name = context_name[:-4] 46 | elif context_name.endswith('.txt'): 47 | context_name = context_name[:-4] 48 | elif context_name.endswith('.bed'): 49 | context_name = context_name[:-4] 50 | else: 51 | context_name = context_name 52 | # fix if the context name is from variant prioritization analysis 53 | if 'non_coding' in context_name or 'whole_genome' in context_name: # that means context_name can be FATHMM_non_coding_top_0.01, so we want FATHMM instead 54 | enr_cont_list = context_name.split('_') 55 | enr_cont_list = enr_cont_list[:-4] # get rid of the last 4 items because 56 | context_name = '_'.join(enr_cont_list) 57 | return(context_name) 58 | 59 | def add_comments_on_states(enrichment_df, num_state, num_enr_cont, context_prefix): 60 | enr_cont_list = enrichment_df.columns[2:] 61 | NUM_TOP_STATE_TO_COMMENT = 3 #7 62 | output_comment = pd.Series([''] * enrichment_df.shape[0]) # we choose the number of rows in this enrichment_df as the size of output_comment, instead of num_state because there's one last line that's the base line and we want that line to be an empty string too. 63 | # output_comment will be later used as a column in the enrichment_df to details what enrichment context each state is associated with 64 | for enr_cont in enr_cont_list: 65 | top_states = enrichment_df[enr_cont][:num_state].nlargest(NUM_TOP_STATE_TO_COMMENT) 66 | top_states = np.around(top_states, decimals = 1) # round the values of enrichment 67 | comment_this_cont = list(map(lambda x: 'rank' + str(x + 1) + '_'+ context_prefix + '_' + enr_cont + ':' + str(top_states[top_states.index[x]])+ "; ", range(NUM_TOP_STATE_TO_COMMENT))) # the string that specify the content of the comment: rank_: 68 | for i in range(NUM_TOP_STATE_TO_COMMENT): 69 | top_state_index = top_states.index[i] # the index of the state that has rank i 70 | output_comment[top_state_index] = output_comment[top_state_index] + comment_this_cont[i] # concatenate comment string 71 | return output_comment 72 | 73 | def color_state_annotation(row_data, index_to_color_list): 74 | results = [""] * len(row_data) # current paint all the cells in the rows with nothing, no format yet---> all white for now 75 | state_annot_color = row_data['color'] 76 | if not pd.isna(row_data['color']): 77 | for index in index_to_color_list: 78 | results[index] = 'background-color: %s' % state_annot_color # the third cell from the left is the state annotation cells 79 | return results 80 | 81 | def combine_enrichment_df_with_state_annot_df(enrichment_df, enr_cont_list, state_annot_fn): 82 | state_annot_df = get_state_annot_df(state_annot_fn) 83 | enrichment_df['state'] = (enrichment_df['state']).astype(str).astype(int) 84 | enrichment_df = enrichment_df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y')) 85 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group') 86 | enrichment_df = enrichment_df.reset_index() 87 | enrichment_df = enrichment_df.drop(columns = ['index']) 88 | enrichment_df['state'] = enrichment_df['state'].astype(str).astype(int) 89 | enrichment_df['percent_in_genome'] = enrichment_df['percent_in_genome'].astype(str).astype(float) 90 | for enr_cont in enr_cont_list: 91 | enrichment_df[enr_cont] = enrichment_df[enr_cont].astype(str).astype(float) 92 | enrichment_df['comment'] = enrichment_df['comment'].astype(str) 93 | enrichment_df['max_fold_context'] = enrichment_df['max_fold_context'].astype(str) 94 | enrichment_df['max_enrich'] = enrichment_df['max_enrich'].astype(float) 95 | enrichment_df['mneumonics'] = enrichment_df['mneumonics'].astype(str) 96 | enrichment_df['color'] = enrichment_df['color'].astype(str) 97 | enrichment_df['state_order_by_group'] = enrichment_df['state_order_by_group'].astype(int) 98 | return enrichment_df 99 | 100 | def color_lecif_score(max_fold_context): 101 | lecif_score_color = {1.0: '#806000', 0.9: '#BF8F00', 0.8: '#FFD966', 0.3: '#8EA9DB', 0.2: '#467CDD', 0.1: '#3A63B7', 0.7: '#FFE699', 0.6: '#FFF2CC', 0.5: '#D9E1F2', 0.4: '#B4C6E7'} 102 | try: 103 | color = lecif_score_color[float(max_fold_context)] 104 | except: # when the word is not in the dictionary, return a white color 105 | color = '#ffffff' # white 106 | #print max_fold_context, color 107 | return 'background-color: %s' % color 108 | 109 | def get_enrichment_colored_df(enrichment_fn, save_fn, context_prefix, state_annot_fn): 110 | cm = sns.light_palette("red", as_cmap=True) 111 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t", header = 0) 112 | try: 113 | enrichment_df = enrichment_df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)' : 'state'}) 114 | except: 115 | pass 116 | enrichment_df = enrichment_df.fillna(0) # if there are nan enrichment values, due to the state not being present (such as when we create files with foreground and background), we can fill it by 0 so that the code to make colorful excel would not crash. 117 | (num_state, num_enr_cont) = (enrichment_df.shape[0] - 1, enrichment_df.shape[1] - 1) 118 | enr_cont_list = enrichment_df.columns[2:] 119 | enr_cont_list = list(map(get_rid_of_stupid_file_tail, enr_cont_list)) # fix the name of the enrichment context. If it contains the tail '.bed.gz' then we get rid of it 120 | enrichment_df.columns = list(enrichment_df.columns[:2]) + list(enr_cont_list) 121 | percent_genome_of_cont = enrichment_df.iloc[num_state, 2:] 122 | enrichment_df = enrichment_df.loc[:(num_state-1)] # rid of the last row because that is the row about percentage each genomic context occupies 123 | # now change the column names of the enrichment contexts. If the contexts' names contain '.bed.gz' then get rid of it. 124 | no_state_df = enrichment_df[enrichment_df.columns[2:]] # only get the data of enrichment contexts for now, don't consider state and percent_in_genome 125 | # now call functions to add comments to the enrichment patterns of states 126 | # add comments about states, calculate the maximum fold enrichment for each state and what context is associated with such fold enrichment 127 | enrichment_df['comment'] = add_comments_on_states(enrichment_df, num_state, num_enr_cont, context_prefix) 128 | enrichment_df['max_fold_context'] = no_state_df.apply(lambda x: x.idxmax().split('_')[1], axis = 1) # name of the context that are most enriched in this state (only applied for lecif score context here) 129 | enrichment_df['max_enrich'] = no_state_df.apply(lambda x: x.max(), axis = 1) # the value of the max fold enrichment in this state 130 | # now add back data associated with the percentage in genome of each of the 131 | enrichment_df = combine_enrichment_df_with_state_annot_df(enrichment_df, enr_cont_list, state_annot_fn) 132 | num_remaining_columns = len(enrichment_df.columns) - 2 - len(enr_cont_list) 133 | enrichment_df.loc[num_state] = list([0 , np.sum(enrichment_df['percent_in_genome'])]) + list(percent_genome_of_cont) + [None] * num_remaining_columns 134 | mneumonics_index_in_row = enrichment_df.columns.get_loc('mneumonics') # column index, zero-based of the menumonics entries, which we will use to paint the right column later 135 | print(enr_cont_list) 136 | colored_df = enrichment_df.style.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont_list[0]]], cmap = cm) 137 | for enr_cont in enr_cont_list[1:]: 138 | colored_df = colored_df.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont]], cmap = cm) 139 | colored_df = colored_df.apply(lambda x: color_state_annotation(x, [mneumonics_index_in_row]), axis = 1) 140 | # add a function to color the lecif score range that is most enriched with each full stack state 141 | colored_df = colored_df.applymap(color_lecif_score, subset = pd.IndexSlice[:(num_state - 1), ['max_fold_context']]) 142 | colored_df.to_excel(save_fn, engine = 'openpyxl') 143 | return colored_df 144 | 145 | 146 | def main(): 147 | if len(sys.argv) != 5: 148 | usage() 149 | input_fn = sys.argv[1] 150 | helper.check_file_exist(input_fn) 151 | output_fn = sys.argv[2] 152 | helper.create_folder_for_file(output_fn) 153 | context_prefix = sys.argv[3] 154 | state_annot_fn = sys.argv[4] 155 | print ("Done getting command line! ") 156 | 157 | get_enrichment_colored_df(input_fn, output_fn, context_prefix, state_annot_fn) 158 | 159 | 160 | def usage(): 161 | print ("python create_excel_painted_overlap_enrichment_output.py ") 162 | print ("input_fn: the output of chromHMM OverlapEnrichment showing the fold enrichment between each full-stack state and each external genome annotation") 163 | print ("output_fn: the name of the excel file that you want to create") 164 | print ("context_prefix: the prefix to all the enrichment contexts in this input_fn. For example, if we do enrichments with gnomad variants of varying maf, the column names in input_fn are 'maf_0_0.1', we can specify context_prefix to gnomad. If we dont need such prefix, specify to empty string. This is useful when we try to write comment on states that are most enriched in each genomic context.") 165 | print ('state_annot_fn: where we get all the data of characterization of states, recommended ../state_annotation_processed.csv') 166 | exit(1) 167 | 168 | if __name__ == '__main__': 169 | main() 170 | # input_fn = sys.argv[1] 171 | # output_fn = sys.argv[2] 172 | # get_colored_df_for_consHMM_enrichment(input_fn, output_fn) -------------------------------------------------------------------------------- /overlap/helper.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import string 4 | import os 5 | import sys 6 | import time 7 | CHROMOSOME_LIST = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', 'X', 'Y'] # we exclude chromosome Y because it's not available in all cell types 8 | 9 | def make_dir(directory): 10 | try: 11 | os.makedirs(directory) 12 | except: 13 | print ( 'Folder' + directory + ' is already created') 14 | 15 | 16 | 17 | def check_file_exist(fn): 18 | if not os.path.isfile(fn): 19 | print ( "File: " + fn + " DOES NOT EXISTS") 20 | exit(1) 21 | return 22 | 23 | def check_dir_exist(fn): 24 | if not os.path.isdir(fn): 25 | print ( "Directory: " + fn + " DOES NOT EXISTS") 26 | exit(1) 27 | return 28 | 29 | def create_folder_for_file(fn): 30 | last_slash_index = fn.rfind('/') 31 | if last_slash_index != -1: # path contains folder 32 | make_dir(fn[:last_slash_index]) 33 | return 34 | 35 | def get_command_line_integer(arg): 36 | try: 37 | arg = int(arg) 38 | return arg 39 | except: 40 | print ( "Integer: " + str(arg) + " IS NOT VALID") 41 | exit(1) 42 | 43 | 44 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 45 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 46 | # rename the org_enrichment_df so that it's easier to work with 47 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 48 | return enrichment_df 49 | 50 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 51 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 52 | if len(nc_enrichment_df.columns) != 3: 53 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 54 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 55 | exit(1) 56 | # Now, we know that the nc_enrichment_df has exactly 3 columns 57 | # change the column names 58 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 59 | return (nc_enrichment_df) 60 | -------------------------------------------------------------------------------- /process_CTCF_data/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code to obtain enrichment analysis results of mouse full-stack states with CTCF elements profiled in multiple mouse cell types (obtained from ENCODE project). The results are presented in Fig. 2F in the manuscript. 2 | - First, we obtained the metadata of CTCF experiments from ENCODE project. We downloaded the metadata, recorded in file. ```metadata_from_ENCODE.tsv```. 3 | - Second, we downloaded data of CTCF peaks (bed files) based on download links provided in ```metadata_from_ENCODE.tsv```, using the script ```download_CTCF_data.py``` 4 | ``` 5 | usage: download_CTCF_data.py [-h] --metadata_fn METADATA_FN [--download] 6 | --download_folder DOWNLOAD_FOLDER --output_fn 7 | OUTPUT_FN 8 | 9 | This file will filter out the data needed to get the CTCF peaks in mouse 10 | 11 | optional arguments: 12 | -h, --help show this help message and exit 13 | --metadata_fn METADATA_FN 14 | metadata file that I got from ENCODE: mouse, CTCF, 15 | TF Chip-seq 16 | --download If this flag is present, we will download the data 17 | of CTCF peaks 18 | --download_folder DOWNLOAD_FOLDER 19 | Where data should be downloaded to 20 | --output_fn OUTPUT_FN 21 | The file of metadata that we will show as 22 | additional data for the paper 23 | 24 | ``` 25 | Note: we include a file ```metadata_clean_for_publication.txt``` in this folder, which is actually the output this second step (```download_CTCF_data.py```). But, if you want o replicate completely what we did, you will absolutely need to use the code to downloaded CTCF bed file data from ENCODE. 26 | - Third, you will need to use ChromHMM to do overlap enrichment analysis of the chromatin states with each of the CTCF files, corresponding to CTCF from different cell/tissue types. This is the command that we used to run ChromHMM (version 1.23): 27 | ``` 28 | java -jar -b 1 -lowmem -noimage 29 | ``` 30 | The output of this step is provided in file ```overlap_ctcf_mm10.txt```. 31 | - Lastly, we calculated the geometric mean and STD of overlap fold enrichment across different cell types. We do not include the annotated code to generate plots here since the procedure is standard (we still include code that works locally but has not been fully clean from local file paths in this folder). If you want to obtain clean, fully commented code for this step, please contact graduate student Ha Vu. -------------------------------------------------------------------------------- /process_CTCF_data/calculate_genometric_mean_CTCF_overlap.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import os 4 | from scipy.stats import gmean 5 | from scipy.stats import gstd 6 | import sys 7 | import seaborn as sns 8 | sys.path.append('/Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/overlap') 9 | import create_excel_painted_overlap_enrichment_LECIF as lecif_code 10 | 11 | fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10.txt' 12 | save_excel_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10.xlsx' 13 | save_csv_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10_with_calculated_mean.txt' 14 | state_annot_fn = '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/state_annotation_processed.csv' 15 | df = pd.read_csv(fn, header = 0, index_col = None, sep = '\t') 16 | percent_genome_of_cont = df.loc[df.shape[0]-1,:] 17 | df.drop(df.tail(1).index,inplace=True) 18 | enr_colname = df.columns[df.columns.str.endswith('.bed.gz')] 19 | print((df[enr_colname] == 0).sum()) 20 | def calculate_gmean(x): 21 | x = x.astype(float) 22 | non_zero_min = np.min(x[x!=0]) 23 | x[x==0] = non_zero_min 24 | return gmean(np.array(x)) 25 | 26 | def calculate_geostd(x): 27 | x = x.astype(float) 28 | non_zero_min = np.min(x[x!=0]) 29 | x[x==0] = non_zero_min 30 | return gstd(np.array(x)) 31 | 32 | df['geomean'] = df.apply(lambda x: calculate_gmean(x[enr_colname]), axis = 1) 33 | df['geostd'] = df.apply(lambda x: calculate_geostd(x[enr_colname]), axis = 1) 34 | df['mean'] = df.apply(lambda x: np.mean(x[enr_colname].astype(float)), axis = 1) 35 | df['std'] = df.apply(lambda x: np.std(x[enr_colname].astype(float)), axis = 1) 36 | 37 | cm = sns.light_palette("red", as_cmap=True) 38 | try: 39 | df = df.rename(columns = {u'state (Emission order)': 'state', u'Genome %' : 'percent_in_genome', u'State (Emission order)' : 'state'}) 40 | except: 41 | pass 42 | df = df.fillna(0) # if there are nan enrichment values, due to the state not being present (such as when we create files with foreground and background), we can fill it by 0 so that the code to make colorful excel would not crash. 43 | (num_state, num_enr_cont) = (df.shape[0], df.shape[1] - 1) 44 | enr_colname = list(map(lecif_code.get_rid_of_stupid_file_tail, enr_colname)) # fix the name of the enrichment context. If it contains the tail '.bed.gz' then we get rid of it 45 | df.columns = list(map(lecif_code.get_rid_of_stupid_file_tail, df.columns)) 46 | # now change the column names of the enrichment contexts. If the contexts' names contain '.bed.gz' then get rid of it. 47 | state_annot_df = lecif_code.get_state_annot_df(state_annot_fn) 48 | df['state'] = (df['state']).astype(str).astype(int) 49 | df = df.merge(state_annot_df, how = 'left', left_on = 'state', right_on = 'state', suffixes = ('_x', '_y')) 50 | df = df.sort_values(by = 'state_order_by_group') 51 | df = df.reset_index(drop = True) 52 | for enr_cont in enr_colname: 53 | df[enr_cont] = df[enr_cont].astype(str).astype(float) 54 | num_remaining_columns = len(df.columns) - 2 - len(enr_colname) 55 | df.loc[df.shape[0]] = list(percent_genome_of_cont) + [None] * num_remaining_columns 56 | mneumonics_index_in_row = df.columns.get_loc('mneumonics') # column index, zero-based of the menumonics entries, which we will use to paint the right column later 57 | colored_df = df.style.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_colname[0]]], cmap = cm) 58 | for enr_cont in enr_colname[1:] + ['geomean', 'mean']: 59 | colored_df = colored_df.background_gradient(subset = pd.IndexSlice[:(num_state-1), [enr_cont]], cmap = cm) 60 | colored_df = colored_df.apply(lambda x: lecif_code.color_state_annotation(x, [mneumonics_index_in_row]), axis = 1) 61 | colored_df.to_excel(save_excel_fn, engine = 'openpyxl') 62 | df.to_csv(save_csv_fn, header = True, index = False, sep = '\t') 63 | -------------------------------------------------------------------------------- /process_CTCF_data/download_CTCF_data.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import helper 4 | import os 5 | import argparse 6 | import sys 7 | import requests 8 | parser = argparse.ArgumentParser(description = 'This file will filter out the data needed to get the CTCF peaks in mouse') 9 | parser.add_argument('--metadata_fn', type = str, required = True, help = 'metadata file that I got from ENCODE: mouse, CTCF, TF Chip-seq') 10 | parser.add_argument('--download', action='store_true', default = False, help = 'If this flag is present, we will download the data of CTCF peaks') 11 | parser.add_argument('--download_folder', type = str, required = True, help = 'Where data should be downloaded to') 12 | parser.add_argument('--output_fn', required = True, help = 'The file of metadata that we will show as additional data for the paper') 13 | args = parser.parse_args() 14 | helper.check_file_exist(args.metadata_fn) 15 | helper.create_folder_for_file(args.output_fn) 16 | helper.make_dir(args.download_folder) 17 | ################################################################################### 18 | def cleanup_metadata(metadata_fn, output_fn): 19 | meta_df = pd.read_csv(metadata_fn, header = 0, index_col = None, sep = '\t') 20 | meta_df.dropna(axis = 1, how = 'all',inplace = True) # drop all empty columns, only 34 of them left 21 | meta_df = meta_df[meta_df['File assembly'] == 'mm10'] 22 | meta_df = meta_df[meta_df['Assay'] == 'TF ChIP-seq'] 23 | meta_df = meta_df[meta_df['Biosample organism'] == 'Mus musculus'] 24 | meta_df = meta_df[meta_df['Experiment target'] == 'CTCF-mouse'] 25 | meta_df = meta_df[meta_df['Project'] == 'ENCODE'] 26 | meta_df = meta_df[meta_df['File analysis title'].str.contains('ENCODE4', na = False)] 27 | meta_df = meta_df[meta_df['File type'] == 'bed'] 28 | # I used the function metadata_df.nunique() to find the number of unique values in each column, and that helped me figure the necessary columns that contains different values for different dataset that I will download 29 | meta_df = meta_df[['File accession', 'Output type', 'Experiment accession', 'Assay', 'Donor(s)', 'Biosample term id', 'Biosample term name', 'Biosample type', 'Experiment target', 'File download URL', 'File analysis title']] 30 | print(meta_df['File accession'].unique()) 31 | print(len(meta_df['File accession'].unique())) 32 | meta_df.to_csv(output_fn, header = True, index = False, sep = '\t') 33 | return meta_df 34 | 35 | def download_data_one_row(row): 36 | biosample_term = '-'.join(row['Biosample term name'].split()) 37 | biosample_save_fn = '{id}_{term}.bed.gz'.format(id = row['File accession'], term = biosample_term) 38 | save_fn = os.path.join(args.download_folder, biosample_save_fn) 39 | if not os.path.isfile(save_fn): 40 | response = requests.get(row['File download URL']) 41 | file = open(save_fn, 'wb') 42 | file.write(response.content) 43 | file.close() 44 | return 45 | 46 | if __name__ == '__main__': 47 | meta_df = cleanup_metadata(args.metadata_fn, args.output_fn) 48 | print('Done cleaning up metadata') 49 | if args.download: 50 | meta_df.apply(download_data_one_row, axis = 1) # apply function to download data for each row (each data of the cleaned metadata) 51 | print('Done downloading data') 52 | 53 | -------------------------------------------------------------------------------- /process_CTCF_data/draw_2D_overlap_enrichment_CTCF.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "draw_2D_overlap_enrichment_CTCF" 3 | author: "Ha Vu" 4 | date: "10/23/2022" 5 | output: html_document 6 | --- 7 | 8 | ```{r} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | library(tidyr) 11 | library(tidyverse) 12 | library(ggplot2) 13 | library(dplyr) 14 | ``` 15 | 16 | ```{r} 17 | overlap_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/overlap_ctcf_mm10_with_calculated_mean.txt' 18 | state_annot_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/state_annotation_processed.csv' 19 | ``` 20 | 21 | ```{r} 22 | RowSD <- function(x) { # function that I can use to calculate the standard deviation of a list of numbers. Taken from https://stackoverflow.com/questions/29581612/r-dplyr-mutate-calculating-standard-deviation-for-each-row 23 | sqrt(rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)) 24 | } 25 | 26 | ``` 27 | 28 | ```{r} 29 | state_df <- as.data.frame(read.csv(state_annot_fn, header = T, sep = '\t', check.names = F)) 30 | state_df <- state_df %>% select(-c('comments', 'color', 'group', 'itemRgb')) 31 | ``` 32 | 33 | ```{r} 34 | df <- read.table(overlap_fn, header = T, sep = '\t', check.names = F) # check.names should be set to F otherwise R will change the colum names with special characters 35 | #df <- df %>% slice(1 : (n()-1)) # we do not really need this anymore because 36 | df <- df %>% rename(c('state' = 'state (Emission order)', 'genome_percent' = 'Genome %')) 37 | df$state <- as.integer(df$state) 38 | df <- df %>% left_join(state_df, by = 'state') 39 | #df <- df %>% mutate(mean_overlap= mean(select(., ends_with('.bed.gz'))), std_overlap= RowSD(select(., ends_with('.bed.gz')))) 40 | df <- df %>% arrange(state_order_by_group) 41 | #show_df <- df%>% select(c('mneumonics', 'geomean', 'geostd', 'mean_overlap', 'std_overlap')) 42 | #show_df %>% write_csv('/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/mean_overlap_ctcf.csv') 43 | ``` 44 | 45 | ```{r} 46 | ggplot(df, aes(x=fct_inorder(mneumonics), y = geomean)) + # factor inorder to keep the order of states in the plots https://forcats.tidyverse.org/reference/fct_inorder.html 47 | geom_line(linetype = "dashed", size = 1) + 48 | geom_point() + 49 | geom_errorbar(aes(ymin = geomean - geostd, ymax = geomean + geostd), width = 0.5, position = position_dodge(0.05)) + 50 | theme_bw() + 51 | theme(axis.text.x = element_text(size = 5, angle = 90)) 52 | save_fn <- '/Users/vuthaiha/Desktop/window_hoff/full_stacked_mouse/from_jason_061921/overlap/ctcf/geomean_overlap_ctcf.png' 53 | ggsave(save_fn, width = 10, height = 4) 54 | ``` -------------------------------------------------------------------------------- /process_CTCF_data/helper.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import string 4 | import os 5 | import sys 6 | import time 7 | def make_dir(directory): 8 | try: 9 | os.makedirs(directory) 10 | except: 11 | print ( 'Folder' + directory + ' is already created') 12 | 13 | 14 | 15 | def check_file_exist(fn): 16 | if not os.path.isfile(fn): 17 | print ( "File: " + fn + " DOES NOT EXISTS") 18 | exit(1) 19 | return 20 | 21 | def check_dir_exist(fn): 22 | if not os.path.isdir(fn): 23 | print ( "Directory: " + fn + " DOES NOT EXISTS") 24 | exit(1) 25 | return 26 | 27 | def create_folder_for_file(fn): 28 | last_slash_index = fn.rfind('/') 29 | if last_slash_index != -1: # path contains folder 30 | make_dir(fn[:last_slash_index]) 31 | return 32 | 33 | def get_command_line_integer(arg): 34 | try: 35 | arg = int(arg) 36 | return arg 37 | except: 38 | print ( "Integer: " + str(arg) + " IS NOT VALID") 39 | exit(1) 40 | 41 | 42 | def get_enrichment_df (enrichment_fn): # enrichment_fn follows the format of ChromHMM OverlapEnrichment's format 43 | enrichment_df = pd.read_csv(enrichment_fn, sep = "\t") 44 | # rename the org_enrichment_df so that it's easier to work with 45 | enrichment_df = enrichment_df.rename(columns = {"state (Emission order)": "state", "Genome %": "percent_in_genome"}) 46 | return enrichment_df 47 | 48 | def get_non_coding_enrichment_df (non_coding_enrichment_fn): 49 | nc_enrichment_df = pd.read_csv(non_coding_enrichment_fn, sep = '\t') 50 | if len(nc_enrichment_df.columns) != 3: 51 | print ( "Number of columns in a non_coding_enrichment_fn should be 3. The provided file has " + str(len(nc_enrichment_df.columns)) + " columns.") 52 | print ( "Exiting, from ChromHMM_untilities_common_functions_helper.py") 53 | exit(1) 54 | # Now, we know that the nc_enrichment_df has exactly 3 columns 55 | # change the column names 56 | nc_enrichment_df.columns = ["state", "percent_in_genome", "non_coding"] 57 | return (nc_enrichment_df) 58 | -------------------------------------------------------------------------------- /process_CTCF_data/metadata_clean_for_publication.txt: -------------------------------------------------------------------------------- 1 | File accession Output type Experiment accession Assay Donor(s) Biosample term id Biosample term name Biosample type Experiment target File download URL File analysis title 2 | ENCFF210BTO IDR thresholded peaks ENCSR000CED TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002108 small intestine tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF210BTO/@@download/ENCFF210BTO.bed.gz ENCODE4 v1.5.1 mm10 3 | ENCFF349BUL IDR thresholded peaks ENCSR000CDZ TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002370 thymus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF349BUL/@@download/ENCFF349BUL.bed.gz ENCODE4 v1.5.1 mm10 4 | ENCFF502WDT IDR thresholded peaks ENCSR000ETQ TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF502WDT/@@download/ENCFF502WDT.bed.gz ENCODE4 v1.5.1 mm10 5 | ENCFF065ODV conservative IDR thresholded peaks ENCSR000DIP TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF065ODV/@@download/ENCFF065ODV.bed.gz ENCODE4 v1.5.1 mm10 6 | ENCFF316FYN IDR thresholded peaks ENCSR000CBN TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002037 cerebellum tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF316FYN/@@download/ENCFF316FYN.bed.gz ENCODE4 v1.5.1 mm10 7 | ENCFF097CAQ IDR thresholded peaks ENCSR000CBL TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002371 bone marrow tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF097CAQ/@@download/ENCFF097CAQ.bed.gz ENCODE4 v1.5.1 mm10 8 | ENCFF491RJK IDR thresholded peaks ENCSR000CBV TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF491RJK/@@download/ENCFF491RJK.bed.gz ENCODE4 v1.5.1 mm10 9 | ENCFF089EWX IDR thresholded peaks ENCSR000CBY TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002106 spleen tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF089EWX/@@download/ENCFF089EWX.bed.gz ENCODE4 v1.5.1 mm10 10 | ENCFF575MGI IDR thresholded peaks ENCSR000CBO TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0005343 cortical plate tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF575MGI/@@download/ENCFF575MGI.bed.gz ENCODE4 v1.5.1 mm10 11 | ENCFF951GOP IDR thresholded peaks ENCSR000CBI TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF951GOP/@@download/ENCFF951GOP.bed.gz ENCODE4 v1.5.1 mm10 12 | ENCFF281QMV IDR thresholded peaks ENCSR000CEH TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000955 brain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF281QMV/@@download/ENCFF281QMV.bed.gz ENCODE4 v1.5.1 mm10 13 | ENCFF033UQX IDR thresholded peaks ENCSR000CEF TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000473 testis tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF033UQX/@@download/ENCFF033UQX.bed.gz ENCODE4 v1.5.1 mm10 14 | ENCFF504JDS conservative IDR thresholded peaks ENCSR000DIU TF ChIP-seq /mouse-donors/ENCDO324YEK/ EFO:0005233 CH12.LX cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF504JDS/@@download/ENCFF504JDS.bed.gz ENCODE4 v1.5.1 mm10 15 | ENCFF601JVE IDR thresholded peaks ENCSR000DIS TF ChIP-seq /mouse-donors/ENCDO089AAA/ EFO:0002034 G1E cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF601JVE/@@download/ENCFF601JVE.bed.gz ENCODE4 v1.5.1 mm10 16 | ENCFF805QBU IDR thresholded peaks ENCSR000CFJ TF ChIP-seq /mouse-donors/ENCDO081AAA/ CL:0002476 bone marrow macrophage primary cell CTCF-mouse https://www.encodeproject.org/files/ENCFF805QBU/@@download/ENCFF805QBU.bed.gz ENCODE4 v1.5.1 mm10 17 | ENCFF142CNG IDR thresholded peaks ENCSR000CFH TF ChIP-seq /mouse-donors/ENCDO073AAA/ EFO:0003971 MEL cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF142CNG/@@download/ENCFF142CNG.bed.gz ENCODE4 v1.5.1 mm10 18 | ENCFF611HDQ IDR thresholded peaks ENCSR000CBW TF ChIP-seq /mouse-donors/ENCDO072AAA/ CL:2000042 embryonic fibroblast primary cell CTCF-mouse https://www.encodeproject.org/files/ENCFF611HDQ/@@download/ENCFF611HDQ.bed.gz ENCODE4 v1.5.1 mm10 19 | ENCFF957SPH conservative IDR thresholded peaks ENCSR000CBJ TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002113 kidney tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF957SPH/@@download/ENCFF957SPH.bed.gz ENCODE4 v1.5.1 mm10 20 | ENCFF883UPM IDR thresholded peaks ENCSR000ERM TF ChIP-seq /mouse-donors/ENCDO324YEK/ EFO:0005233 CH12.LX cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF883UPM/@@download/ENCFF883UPM.bed.gz ENCODE4 v1.5.1 mm10 21 | ENCFF086ZTV IDR thresholded peaks ENCSR000CDX TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002264 olfactory bulb tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF086ZTV/@@download/ENCFF086ZTV.bed.gz ENCODE4 v1.5.1 mm10 22 | ENCFF136ZDK IDR thresholded peaks ENCSR397RHW TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF136ZDK/@@download/ENCFF136ZDK.bed.gz ENCODE4 v1.5.1 mm10 23 | ENCFF589ZBJ IDR thresholded peaks ENCSR041SMK TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF589ZBJ/@@download/ENCFF589ZBJ.bed.gz ENCODE4 v1.5.1 mm10 24 | ENCFF430PPJ IDR thresholded peaks ENCSR677HXC TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0001890 forebrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF430PPJ/@@download/ENCFF430PPJ.bed.gz ENCODE4 v1.5.1 mm10 25 | ENCFF198SSR IDR thresholded peaks ENCSR418SBY TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF198SSR/@@download/ENCFF198SSR.bed.gz ENCODE4 v1.5.1 mm10 26 | ENCFF165LXO IDR thresholded peaks ENCSR677SIH TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002048 lung tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF165LXO/@@download/ENCFF165LXO.bed.gz ENCODE4 v1.5.1 mm10 27 | ENCFF194HVL IDR thresholded peaks ENCSR000CBU TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002107 liver tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF194HVL/@@download/ENCFF194HVL.bed.gz ENCODE4 v1.5.1 mm10 28 | ENCFF024JOZ IDR thresholded peaks ENCSR143WOK TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002113 kidney tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF024JOZ/@@download/ENCFF024JOZ.bed.gz ENCODE4 v1.5.1 mm10 29 | ENCFF660HRL IDR thresholded peaks ENCSR245OKD TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF660HRL/@@download/ENCFF660HRL.bed.gz ENCODE4 v1.8.0 mm10 30 | ENCFF143RZH IDR thresholded peaks ENCSR799YRV TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0002305 layer of hippocampus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF143RZH/@@download/ENCFF143RZH.bed.gz ENCODE4 v1.8.0 mm10 31 | ENCFF784ASD IDR thresholded peaks ENCSR000AIJ TF ChIP-seq /mouse-donors/ENCDO321TZV/ EFO:0001098 C2C12 cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF784ASD/@@download/ENCFF784ASD.bed.gz ENCODE4 v1.5.1 mm10 32 | ENCFF327LYS IDR thresholded peaks ENCSR002ZAG TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000160 intestine tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF327LYS/@@download/ENCFF327LYS.bed.gz ENCODE4 v1.5.1 mm10 33 | ENCFF052KVO conservative IDR thresholded peaks ENCSR362VNF TF ChIP-seq /mouse-donors/ENCDO015AAA/ EFO:0007751 E14TG2a.4 cell line CTCF-mouse https://www.encodeproject.org/files/ENCFF052KVO/@@download/ENCFF052KVO.bed.gz ENCODE4 v1.5.1 mm10 34 | ENCFF921ILM IDR thresholded peaks ENCSR877MSN TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0002305 layer of hippocampus tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF921ILM/@@download/ENCFF921ILM.bed.gz ENCODE4 v1.8.0 mm10 35 | ENCFF680EMQ IDR thresholded peaks ENCSR415XNJ TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0001388 gastrocnemius tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF680EMQ/@@download/ENCFF680EMQ.bed.gz ENCODE4 v1.8.0 mm10 36 | ENCFF649JAA IDR thresholded peaks ENCSR798CCX TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0001388 gastrocnemius tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF649JAA/@@download/ENCFF649JAA.bed.gz ENCODE4 v1.8.0 mm10 37 | ENCFF388VCL IDR thresholded peaks ENCSR642FSG TF ChIP-seq /mouse-donors/ENCDO509HIY/ NTR:0000646 left cerebral cortex tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF388VCL/@@download/ENCFF388VCL.bed.gz ENCODE4 v1.8.0 mm10 38 | ENCFF609MDN IDR thresholded peaks ENCSR491NUM TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF609MDN/@@download/ENCFF609MDN.bed.gz ENCODE4 v1.5.1 mm10 39 | ENCFF999RIN IDR thresholded peaks ENCSR644VYX TF ChIP-seq /mouse-donors/ENCDO509HIY/ NTR:0000646 left cerebral cortex tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF999RIN/@@download/ENCFF999RIN.bed.gz ENCODE4 v1.8.0 mm10 40 | ENCFF089GBC IDR thresholded peaks ENCSR985ZTV TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0001891 midbrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF089GBC/@@download/ENCFF089GBC.bed.gz ENCODE4 v1.5.1 mm10 41 | ENCFF907XEG IDR thresholded peaks ENCSR104QEN TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0000945 stomach tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF907XEG/@@download/ENCFF907XEG.bed.gz ENCODE4 v1.5.1 mm10 42 | ENCFF092NMJ IDR thresholded peaks ENCSR150RGT TF ChIP-seq /mouse-donors/ENCDO956IXV/ UBERON:0002028 hindbrain tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF092NMJ/@@download/ENCFF092NMJ.bed.gz ENCODE4 v1.5.1 mm10 43 | ENCFF637GEK IDR thresholded peaks ENCSR024KGB TF ChIP-seq /mouse-donors/ENCDO509HIY/ UBERON:0000948 heart tissue CTCF-mouse https://www.encodeproject.org/files/ENCFF637GEK/@@download/ENCFF637GEK.bed.gz ENCODE4 v1.8.0 mm10 -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/._15_state_annotations.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/._15_state_annotations.csv -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/README.md: -------------------------------------------------------------------------------- 1 | This folder contains code to do analysis and create plots that replicate Figures S6 and S7 in the manuscript. 2 | - First, we obtained per-cell-type 15-chromatin state annotations for 66 reference epigenomes/cell types from Gorkin et al., 2020 , with download links curated and provided by Kwon and Ernst, 2021 (in Supplementary File 1, tab g). Note: the provided links show that all bed files are named based on the format ```_15_segments.bed.gz```, and saved into a folder that is given variable name ```ct_segment_folder``` in the provided snakefile ```mouse_random_represent_full_stack.snakefile``` (look into the first few lines of the file if you are confused about what we mean here). Within this folder, we store some example bed files representing per-ct annotations into the folder ```example_perCT_segment```. Also note that the segmentation files from Gorkin et al., 2020 are sorted bed files and the code that we provide here also assumes that the per-ct segmentation files are sorted. If you apply our code to some data other than those provided by Gorkin et al., 2020, you need to sort the bed files files, using the command line ```zcat | sort -k1,1 k2,2n > ``` 3 | - Second, download the file of mouse full-stack chromatin state annotation ```genome_100_segments.bed.gz```, introduced in our manuscript (see the main readme for download links) 4 | - Next, you can start running our snakemake pipeline in ```mouse_random_represent_full_stack.snakefile``` to reproduce the plots in Fig.S6 and Fig. S7. To run the snakemake pipeline to conduct the analysis sequentially (which can take a long time), you can run ```snakemake --cores 1 --snakefile mouse_random_represent_full_stack.snakefile```. To run the snakemake pipeline such that jobs can be run in parallel, you can run ```snakemake -j --cluster "" --snakefile mouse_random_represent_full_stack.snakefile```, where for our local machine settings, `````` is ```qsub -V -l h_rt=4:00:00,h_data=4G```, specifying the time and memory needed for each job in the pipeline to run. 5 | # Some scripts that are part of the pipeline (we already incorporated these files into the snakemake pipeline. Therefore, we won't go into details about where these scripts are used within the pipeline here) 6 | - File ```helper.py``` contains some useful functions used by most other scripts. 7 | - File ```get_mm_enrichments_across_ct.py``` 8 | ``` 9 | usage: get_mm_enrichments_across_ct.py [-h] [--input_folder INPUT_FOLDER] [--output_fn OUTPUT_FN] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN] 10 | [--ct_state_annot_fn CT_STATE_ANNOT_FN] 11 | Create an excel of the top ct-spec states enriched in each full-stack state 12 | optional arguments: 13 | -h, --help show this help message and exit 14 | --input_folder INPUT_FOLDER 15 | where there are multiple subfolders, each containing enrichment data for different cell type specific model 16 | --output_fn OUTPUT_FN 17 | Where the files showing max, mean, min fold enrichment across all ct-state states are reported for all cell types for each state is reported 18 | --num_state_ct_model NUM_STATE_CT_MODEL 19 | number of states in the ct-spec models 20 | --full_state_annot_fn FULL_STATE_ANNOT_FN 21 | file of the full-stack state annotations 22 | --ct_state_annot_fn CT_STATE_ANNOT_FN 23 | file of the ct-spec state annotations 24 | ``` 25 | - File ```get_ranked_max_enriched_25_state.py``` 26 | ``` 27 | usage: get_ranked_max_enriched_25_state.py [-h] [--input_folder INPUT_FOLDER] [--output_fn OUTPUT_FN] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN] 28 | [--ct_state_annot_fn CT_STATE_ANNOT_FN] [--ct_group_fn CT_GROUP_FN] 29 | 30 | Create an excel of the top ct-spec states enriched in each full-stack state 31 | 32 | optional arguments: 33 | -h, --help show this help message and exit 34 | --input_folder INPUT_FOLDER 35 | where there are multiple subfolders, each containing enrichment data for different cell type specific model 36 | --output_fn OUTPUT_FN 37 | Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported 38 | --num_state_ct_model NUM_STATE_CT_MODEL 39 | number of states in the ct-spec models 40 | --full_state_annot_fn FULL_STATE_ANNOT_FN 41 | file of the full-stack state annotations 42 | --ct_state_annot_fn CT_STATE_ANNOT_FN 43 | file of the ct-spec state annotations 44 | --ct_group_fn CT_GROUP_FN 45 | file of the annotations of the cell types 46 | ``` 47 | - File ```calculate_summary_sample_regions.py``` 48 | ``` 49 | usage: calculate_summary_sample_regions.py [-h] [--all_seed_folder ALL_SEED_FOLDER] [--output_folder OUTPUT_FOLDER] [--num_state_ct_model NUM_STATE_CT_MODEL] [--full_state_annot_fn FULL_STATE_ANNOT_FN] 50 | [--ct_state_annot_fn CT_STATE_ANNOT_FN] [--ct_group_fn CT_GROUP_FN] 51 | 52 | Create an excel of the top ct-spec states enriched in each full-stack state 53 | 54 | optional arguments: 55 | -h, --help show this help message and exit 56 | --all_seed_folder ALL_SEED_FOLDER 57 | where there are multiple subfolders, each containing enrichment data for different cell type specific model 58 | --output_folder OUTPUT_FOLDER 59 | Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported 60 | --num_state_ct_model NUM_STATE_CT_MODEL 61 | number of states in the ct-spec models 62 | --full_state_annot_fn FULL_STATE_ANNOT_FN 63 | file of the full-stack state annotations 64 | --ct_state_annot_fn CT_STATE_ANNOT_FN 65 | file of the ct-spec state annotations 66 | --ct_group_fn CT_GROUP_FN 67 | file of the annotations of the cell types 68 | ``` 69 | 70 | # License: 71 | All code is provided under the MIT Open Acess License 72 | Copyright 2022 Ha Vu and Jason Ernst 73 | 74 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 75 | 76 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 77 | 78 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 79 | 80 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/15_state_annotations.csv: -------------------------------------------------------------------------------- 1 | state mneumonics Meaning color 2 | 1 1_PrA Promoter Active #FF0000 3 | 2 2_PrW Promoter Weak #FF7D4D 4 | 3 3_PrB Promoter Bivalent #7030A0 5 | 4 4_PrF Promoter Flanking #FF7D4D 6 | 5 5_EnSd Enhancer Strong TSS-dist #FFA500 7 | 6 6_EnSp Enhancer Strong TSS-prox #FFA500 8 | 7 7_EnW Enhancer Weak, TSS-dist #FFFF00 9 | 8 8_EnPd Enhancer Poised, TSS-dist #E2D27D 10 | 9 9_EnPp Enhancer Poised, TSS-prox #E2D27D 11 | 10 10_TrS Transcription Strong #008000 12 | 11 11_TrP Transcription Permission #DFE7DE 13 | 12 12_TrI Transcription Initiation #008000 14 | 13 13_HcP Polycomb-assoc. #808080 15 | 14 14_HcH Heterochromatin H3K9me3 #B19CD9 16 | 15 15_Ns No signal (Quiescent) #FFFFFF 17 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/P0_forebrain_15_segments.bed.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_forebrain_15_segments.bed.gz -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/P0_lung_15_segments.bed.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_lung_15_segments.bed.gz -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/P0_stomach_15_segments.bed.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/relationship_with_ct_spec_state/example_perCT_segment/P0_stomach_15_segments.bed.gz -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/README.md: -------------------------------------------------------------------------------- 1 | - Each ```.bed.gz``` file in this folder represents the chromatin state annotation for one cell type (sample) from Gorkin et al., 2020 . We provide just 3 files, and only data for chromosome 19, for the sake of demonstration. If you want to replicate what we did to obtain Fig. S6+7 in Kwon and Ernst, 2021 , you will have to download raw data provided in the paper. 2 | - File ```tissue_annotation.csv``` shows metadata about different tissue types from Gorkin et al., 2020 , so that we can color and group the cell types that have per-ct annotations. Columns: tissue_stage, ct, stage, color, group --> needed later for plotting and grouping the cell types by their tissue groups, along with color codes. 3 | 4 | - File ```15_state_annotations.csv``` shows metadata about 15 different states from the per-ct annotations, so what we can color the per-ct states as in Fig. S7 in the manuscript. -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/example_perCT_segment/tissue_annotation.csv: -------------------------------------------------------------------------------- 1 | tissue_stage ct stage color group 2 | e11.5_facial-prominence facial-prominence e11.5 #124c78 3 | e11.5_forebrain forebrain e11.5 #C5912B Brain 4 | e11.5_heart heart e11.5 #D56F80 Heart 5 | e11.5_hindbrain hindbrain e11.5 #C5912B Brain 6 | e11.5_limb limb e11.5 #F9B6CF Muscle 7 | e11.5_liver liver e11.5 #9BC2E6 Liver 8 | e11.5_midbrain midbrain e11.5 #C5912B Brain 9 | e11.5_neural-tube neural-tube e11.5 #FFD924 neuron 10 | e12.5_facial-prominence facial-prominence e12.5 #124c78 11 | e12.5_forebrain forebrain e12.5 #C5912B Brain 12 | e12.5_heart heart e12.5 #D56F80 Heart 13 | e12.5_hindbrain hindbrain e12.5 #C5912B Brain 14 | e12.5_limb limb e12.5 #F9B6CF Muscle 15 | e12.5_liver liver e12.5 #9BC2E6 Liver 16 | e12.5_midbrain midbrain e12.5 #C5912B Brain 17 | e12.5_neural-tube neural-tube e12.5 #FFD924 neuron 18 | e13.5_facial-prominence facial-prominence e13.5 #124c78 19 | e13.5_forebrain forebrain e13.5 #C5912B Brain 20 | e13.5_heart heart e13.5 #D56F80 Heart 21 | e13.5_hindbrain hindbrain e13.5 #C5912B Brain 22 | e13.5_limb limb e13.5 #F9B6CF Muscle 23 | e13.5_liver liver e13.5 #9BC2E6 Liver 24 | e13.5_midbrain midbrain e13.5 #C5912B Brain 25 | e13.5_neural-tube neural-tube e13.5 #FFD924 neuron 26 | e14.5_facial-prominence facial-prominence e14.5 #124c78 27 | e14.5_forebrain forebrain e14.5 #C5912B Brain 28 | e14.5_heart heart e14.5 #D56F80 Heart 29 | e14.5_hindbrain hindbrain e14.5 #C5912B Brain 30 | e14.5_intestine intestine e14.5 #D0A39B intestine 31 | e14.5_kidney kidney e14.5 #F4B084 kidney 32 | e14.5_limb limb e14.5 #F9B6CF Muscle 33 | e14.5_liver liver e14.5 #9BC2E6 Liver 34 | e14.5_lung lung e14.5 #E41A1C lung 35 | e14.5_midbrain midbrain e14.5 #C5912B Brain 36 | e14.5_neural-tube neural-tube e14.5 #FFD924 neuron 37 | e14.5_stomach stomach e14.5 #E5BDB5 stomach 38 | e15.5_facial-prominence facial-prominence e15.5 #124c78 39 | e15.5_forebrain forebrain e15.5 #C5912B Brain 40 | e15.5_heart heart e15.5 #D56F80 Heart 41 | e15.5_hindbrain hindbrain e15.5 #C5912B Brain 42 | e15.5_intestine intestine e15.5 #D0A39B intestine 43 | e15.5_kidney kidney e15.5 #F4B084 kidney 44 | e15.5_limb limb e15.5 #F9B6CF Muscle 45 | e15.5_liver liver e15.5 #9BC2E6 Liver 46 | e15.5_lung lung e15.5 #E41A1C lung 47 | e15.5_midbrain midbrain e15.5 #C5912B Brain 48 | e15.5_neural-tube neural-tube e15.5 #FFD924 neuron 49 | e15.5_stomach stomach e15.5 #E5BDB5 stomach 50 | e16.5_forebrain forebrain e16.5 #C5912B Brain 51 | e16.5_heart heart e16.5 #D56F80 Heart 52 | e16.5_hindbrain hindbrain e16.5 #C5912B Brain 53 | e16.5_intestine intestine e16.5 #D0A39B intestine 54 | e16.5_kidney kidney e16.5 #F4B084 kidney 55 | e16.5_liver liver e16.5 #9BC2E6 Liver 56 | e16.5_lung lung e16.5 #E41A1C lung 57 | e16.5_midbrain midbrain e16.5 #C5912B Brain 58 | e16.5_stomach stomach e16.5 #E5BDB5 stomach 59 | P0_forebrain forebrain P0 #C5912B Brain 60 | P0_heart heart P0 #D56F80 Heart 61 | P0_hindbrain hindbrain P0 #C5912B Brain 62 | P0_intestine intestine P0 #D0A39B intestine 63 | P0_kidney kidney P0 #F4B084 kidney 64 | P0_liver liver P0 #9BC2E6 Liver 65 | P0_lung lung P0 #E41A1C lung 66 | P0_midbrain midbrain P0 #C5912B Brain 67 | P0_stomach stomach P0 #E5BDB5 stomach 68 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/get_mm_enrichments_across_ct.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import seaborn as sns 3 | import numpy as np 4 | import os 5 | import helper 6 | import glob 7 | import argparse 8 | parser = argparse.ArgumentParser(description = 'Create an excel of the top ct-spec states enriched in each full-stack state') 9 | parser.add_argument('--input_folder', type=str, 10 | help = "where there are multiple subfolders, each containing enrichment data for different cell type specific model") 11 | parser.add_argument('--output_fn', type=str, 12 | help = 'Where the files showing max, mean, min fold enrichment across all ct-state states are reported for all cell types for each state is reported') 13 | parser.add_argument('--num_state_ct_model', type=int, 14 | help = 'number of states in the ct-spec models') 15 | parser.add_argument('--full_state_annot_fn', type=str, 16 | help = 'file of the full-stack state annotations') 17 | parser.add_argument('--ct_state_annot_fn', type=str, 18 | help = 'file of the ct-spec state annotations') 19 | args = parser.parse_args() 20 | print(args) 21 | helper.check_dir_exist(args.input_folder) 22 | helper.create_folder_for_file(args.output_fn) 23 | helper.check_file_exist(args.full_state_annot_fn) 24 | helper.check_file_exist(args.ct_state_annot_fn) 25 | 26 | def get_ct_spec_state_annot(ct_state_annot_fn): 27 | df = pd.read_csv(ct_state_annot_fn, header = 0, index_col = None, sep = '\t') # state, mneumonics, meaning, color 28 | CT_STATE_MNEUNOMICS_LIST = list(df.mneumonics) 29 | CT_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics).to_dict() # keys: state mneumonics, values: state color 30 | return CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT 31 | 32 | CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT = get_ct_spec_state_annot(args.ct_state_annot_fn) 33 | 34 | def get_full_state_annot_df(full_state_annot_fn): 35 | df = pd.read_csv(full_state_annot_fn, header = 0, index_col = None, sep = '\t') 36 | FULL_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics) 37 | df = df[['state', 'mneumonics', 'state_order_by_group']] 38 | return FULL_STATE_COLOR_DICT, df 39 | 40 | FULL_STATE_COLOR_DICT, FULL_STATE_ANNOT_DF = get_full_state_annot_df(args.full_state_annot_fn) #FULL_STATE_ANNOT_DF: state, mneumonics, state_order_by_group, FULL_STATE_COLOR_DICT: keys mneumonics, values color 41 | 42 | def color_full_state_names(val): 43 | if val == "": 44 | color = '#FFFFFF' 45 | else: 46 | color = FULL_STATE_COLOR_DICT[val] 47 | return 'background-color: %s' % color 48 | 49 | def get_rid_of_stupid_file_tail(context_name): 50 | if context_name.endswith('.bed.gz'): 51 | return(context_name[:-7]) 52 | else: 53 | return(context_name) 54 | 55 | 56 | def get_one_enrichment_ct_model_df(fn): 57 | df = pd.read_csv(fn, sep = '\t', header = 0) 58 | df = df.rename(columns = {'state (Emission order)' : 'state', 'Genome %': 'percent_in_genome'}) # rename some columns so that it is easier to write column names later 59 | context_name_list = list(map(get_rid_of_stupid_file_tail, df.columns[2:])) # get the names of enrichment contexts. If the enrichment contexts is .bed.gz --> 60 | df.columns = ['state', 'percent_in_genome'] + context_name_list 61 | df = df.drop([df.shape[0] - 1], axis = 0) # drop that last row, which is about the percentage of genome that each enrichment context occupies 62 | return df 63 | 64 | def get_mmm_enrichment_one_genome_context(all_ct_df_list, all_ct_list, state_context_colName): 65 | # state_context_colName: name of the column that we are looking at so that we know what column to look at for each data frame 66 | this_context_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']}) 67 | # collect data from each of the cell type specific model --> columsn: cell type specific model, rows: 100 full stack states 68 | for df_index, df in enumerate(all_ct_df_list): 69 | ct_name = all_ct_list[df_index] # all_ct_fn_list and all_ct_df_list are responsive to each other, i.e. each element in each list corresponds to the same cell type. We tried zipping teh two list, but that gave a bug message. so this code here is not the most graceful. 70 | this_context_df[ct_name] = df[state_context_colName] 71 | this_context_df['max_enrichment'] = (this_context_df.drop('state', axis = 1)).max(axis = 1) 72 | this_context_df['min_enrichment'] = (this_context_df.drop('state', axis = 1)).min(axis = 1) 73 | this_context_df['median_enrichment'] = (this_context_df.drop('state', axis = 1)).median(axis = 1) 74 | return this_context_df 75 | 76 | def get_25_state_annot(state): 77 | # '1' --> ''1_TssA'' 78 | state_index = int(state) - 1 # zero-based 79 | return CT_STATE_MNEUNOMICS_LIST[state_index] 80 | 81 | def get_CT_STATE_COLOR_DICT(state): 82 | # '1_TssA' --> 'red' 83 | state_index = int(state.split('_')[0]) # one-based 84 | color = CT_STATE_COLOR_DICT[state_index] 85 | return 'background-color: %s' % color 86 | 87 | def paint_excel_mmm_enrichment(enrichment_df, num_state_ct_model): 88 | # here enrichment_df is actually max_enrichment_df from get_all_ct_model_mmm_enrichment 89 | cm = sns.light_palette("red", as_cmap=True) 90 | (num_state, num_enr_cont) = (enrichment_df.shape[0], enrichment_df.shape[1] - 1) 91 | state_colName_list = list(map(lambda x: 'U' + str(x + 1), range(num_state_ct_model))) #CUSTOM: this is a line of code that could be improved for better customization 92 | enrichment_df['state'] = enrichment_df['state'].astype(int) 93 | enrichment_df = enrichment_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state') 94 | enrichment_df = enrichment_df[['state', 'percent_in_genome'] + state_colName_list + ['mneumonics', 'state_order_by_group']] # get the data frame to display columns in the expected order 95 | enrichment_df.columns = ['state', 'percent_in_genome'] + CT_STATE_MNEUNOMICS_LIST + ['mneumonics', 'state_order_by_group'] # rename the columns so that instead of 'state1.bed.gz' we have '1_TssA' 96 | enrichment_df = enrichment_df.sort_values(by = 'state_order_by_group') 97 | colored_df = enrichment_df.style.background_gradient(subset = pd.IndexSlice[:, CT_STATE_MNEUNOMICS_LIST], cmap = cm) # color the enrichment values into a red-white gradient in the first enrichment contnext 98 | colored_df = colored_df.applymap(color_full_state_names, subset = pd.IndexSlice[:, ['mneumonics']]) 99 | return colored_df 100 | 101 | 102 | def get_all_ct_model_mmm_enrichment(input_folder, output_fn, num_state_ct_model): 103 | all_ct_fn_list = glob.glob(input_folder + '/*/overlap_enrichment.txt') 104 | all_ct_list = list(map(lambda x: x.split('/')[-2], all_ct_fn_list)) # from '/path/to/E129/overlap_enrichment.txt' to 'E129' 105 | all_ct_df_list = list(map(lambda x: get_one_enrichment_ct_model_df(x), all_ct_fn_list)) 106 | all_ct_df_list = list(all_ct_df_list) 107 | # up until here, we have finished getting all the data from all the overlap enrichment files of all cell types into data frame --> all_ct_df_list [index] --> data frame of the cell type 108 | print ("Done getting all enrichment data from all cell types") 109 | # create a data frame that will report the median enrichment over all the cell types 110 | max_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination. 111 | # now loop through each of the context that we care about 112 | median_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination. 113 | # now loop through each of the context that we care about 114 | min_enrichment_df = pd.DataFrame({'state': (all_ct_df_list[0])['state']}) # this will report for each state in the 25-state model, and for each enrichment context, what are the highest enrichment in each state-context combination. 115 | # now loop through each of the context that we care about 116 | enrichment_context_list = ['percent_in_genome'] + list((all_ct_df_list[0]).columns[2:]) # this is because we also want to report the max, median, min of percent_in_genome_of data from all ct-spec enrichments 117 | print (enrichment_context_list) 118 | for enr_context in enrichment_context_list: 119 | this_context_mmm_enrichment_df = get_mmm_enrichment_one_genome_context(all_ct_df_list, all_ct_list, enr_context) 120 | max_enrichment_df[enr_context] = this_context_mmm_enrichment_df['max_enrichment'] 121 | min_enrichment_df[enr_context] = this_context_mmm_enrichment_df['min_enrichment'] 122 | median_enrichment_df[enr_context] = this_context_mmm_enrichment_df['median_enrichment'] 123 | print ("Done getting all the necessary data!") 124 | max_colored_df = paint_excel_mmm_enrichment(max_enrichment_df, num_state_ct_model) 125 | min_colored_df = paint_excel_mmm_enrichment(min_enrichment_df, num_state_ct_model) 126 | median_colored_df = paint_excel_mmm_enrichment(median_enrichment_df, num_state_ct_model) 127 | # now save into 3 sheets 128 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter') 129 | max_colored_df.to_excel(writer, sheet_name='max') 130 | min_colored_df.to_excel(writer, sheet_name='min') 131 | median_colored_df.to_excel(writer, sheet_name='median') 132 | writer.save() 133 | print ("Done getting the mmm_enrichment_df!") 134 | 135 | get_all_ct_model_mmm_enrichment(args.input_folder, args.output_fn, args.num_state_ct_model) 136 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/get_ranked_max_enriched_25_state.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import seaborn as sns 3 | import numpy as np 4 | import os 5 | import sys 6 | import helper 7 | import glob 8 | import argparse 9 | parser = argparse.ArgumentParser(description = 'Create an excel of the top ct-spec states enriched in each full-stack state') 10 | parser.add_argument('--input_folder', type=str, 11 | help = "where there are multiple subfolders, each containing enrichment data for different cell type specific model") 12 | parser.add_argument('--output_fn', type=str, 13 | help = 'Where the ct-state with maximum fold enrichment across all ct-state states are reported for all cell types for each state is reported') 14 | parser.add_argument('--num_state_ct_model', type=int, 15 | help = 'number of states in the ct-spec models') 16 | parser.add_argument('--full_state_annot_fn', type=str, 17 | help = 'file of the full-stack state annotations') 18 | parser.add_argument('--ct_state_annot_fn', type=str, 19 | help = 'file of the ct-spec state annotations') 20 | parser.add_argument('--ct_group_fn', type=str, 21 | help = 'file of the annotations of the cell types') 22 | args = parser.parse_args() 23 | print(args) 24 | helper.check_dir_exist(args.input_folder) 25 | helper.create_folder_for_file(args.output_fn) 26 | helper.check_file_exist(args.full_state_annot_fn) 27 | helper.check_file_exist(args.ct_state_annot_fn) 28 | helper.check_file_exist(args.ct_group_fn) 29 | 30 | def get_full_state_annot_df(full_state_annot_fn): 31 | df = pd.read_csv(full_state_annot_fn, header = 0, index_col = None, sep = '\t') 32 | FULL_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics) 33 | df = df[['state', 'mneumonics', 'state_order_by_group']] 34 | return FULL_STATE_COLOR_DICT, df 35 | 36 | FULL_STATE_COLOR_DICT, FULL_STATE_ANNOT_DF = get_full_state_annot_df(args.full_state_annot_fn) #FULL_STATE_ANNOT_DF: state, mneumonics, state_order_by_group, FULL_STATE_COLOR_DICT: keys mneumonics, values color 37 | 38 | def get_celltype_color_map(ct_group_fn): 39 | df = pd.read_csv(ct_group_fn, header = 0, index_col = None, sep = '\t') 40 | CELLTYPE_COLOR_MAP = pd.Series(df.color.values, index = df.tissue_stage).to_dict() # convert two columns in to a dictionary: keys: cell type, values: color corresponding to that tissue 41 | CELLTYPE_COLOR_MAP['NA'] = '#E5B8E8' 42 | CELLTYPE_CELL_GROUP_MAP = pd.Series(df.group.values, index = df.tissue_stage).to_dict() # keys: cell type name, values: cell group 43 | group_color_df = (df[['group', 'color']]).drop_duplicates() 44 | CELL_GROUP_COLOR_CODE_DICT = pd.Series(group_color_df.color.values, index = group_color_df.group).to_dict() # keys: group, values: color 45 | return CELLTYPE_COLOR_MAP, CELLTYPE_CELL_GROUP_MAP, CELL_GROUP_COLOR_CODE_DICT, df 46 | 47 | CELLTYPE_COLOR_MAP, CELLTYPE_CELL_GROUP_MAP, CELL_GROUP_COLOR_CODE_DICT, CT_DF = get_celltype_color_map(args.ct_group_fn) 48 | 49 | def get_ct_spec_state_annot(ct_state_annot_fn): 50 | df = pd.read_csv(ct_state_annot_fn, header = 0, index_col = None, sep = '\t') # state, mneumonics, meaning, color 51 | CT_STATE_MNEUNOMICS_LIST = list(df.mneumonics) 52 | CT_STATE_COLOR_DICT = pd.Series(df.color.values, index = df.mneumonics).to_dict() # keys: state mneumonics, values: state color 53 | return CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT 54 | 55 | CT_STATE_MNEUNOMICS_LIST, CT_STATE_COLOR_DICT = get_ct_spec_state_annot(args.ct_state_annot_fn) 56 | 57 | def get_rid_of_stupid_file_tail(context_name): 58 | if context_name.endswith('.bed.gz'): 59 | return(context_name[:-7]) 60 | else: 61 | return(context_name) 62 | 63 | def color_full_state_names(val): 64 | if val == "": 65 | color = '#FFFFFF' 66 | else: 67 | color = FULL_STATE_COLOR_DICT[val] 68 | return 'background-color: %s' % color 69 | 70 | def color_cell_group_names(val): 71 | if val == "": 72 | color = '#FFFFFF' 73 | else: 74 | color = CELLTYPE_COLOR_MAP[val] 75 | return 'background-color: %s' % color 76 | 77 | def color_tissue_names(val): 78 | if val == "": 79 | color = '#FFFFFF' 80 | else: 81 | color = CELL_GROUP_COLOR_CODE_DICT[val] 82 | return 'background-color: %s' % color 83 | 84 | def get_one_enrichment_ct_model_df(fn, num_state_ct_model): 85 | df = pd.read_csv(fn, sep = '\t', header = 0) 86 | df = df.rename(columns = {'state (Emission order)' : 'state', 'Genome %': 'percent_in_genome'}) # rename some columns so that it is easier to write column names later 87 | state_colName_list = list(map(lambda x: 'U' + str(x + 1), range(num_state_ct_model))) #CUSTOM: this is a line of code that could be improved for better customization 88 | df = df[['state', 'percent_in_genome'] + state_colName_list] # get the data frame to display columns in the expected order 89 | df.columns = ['state', 'percent_in_genome'] + CT_STATE_MNEUNOMICS_LIST # rename the columns so that instead of 'state1.bed.gz' we have '1_TssA' 90 | df['max_enrichment'] = (df.drop(['state', 'percent_in_genome'], axis = 1)).max(axis = 1) # find the maximum enrichment values in this row 91 | df['max_enrichment_context'] = (df.drop(['state', 'percent_in_genome'], axis = 1)).idxmax(axis = 1) 92 | (nrow, ncol) = df.shape 93 | df = df.drop(nrow - 1) # drop the last row, which is the 'Base' row with percentage that each enrichment context occupies the genome 94 | return df 95 | 96 | def get_25_state_annot(state): 97 | # 'state_1' --> ''1_TssA'' 98 | state_index = int(state.split('_')[1]) - 1 # zero-based 99 | return CT_STATE_MNEUNOMICS_LIST[state_index] 100 | 101 | def get_CT_STATE_COLOR_DICT(state): 102 | # '1_TssA' --> 'red' 103 | color = CT_STATE_COLOR_DICT[state] 104 | return 'background-color: %s' % color 105 | 106 | def get_max_enrichment_25_state_df_ordered_time_stamp(max_enrich_25_state_all_ct_df): 107 | """ 108 | max_enrich_25_state_all_ct_df: rows: full stack states, columns: state of the 25-state system that is most enriched with the full-stack states WITHIN a cell type. Cell type are the column names 109 | --> a plot where cell types are juxtaposed based on the cell groups they belong to. The output dataframe will be colored properly 110 | """ 111 | cell_group_df = pd.DataFrame(columns = ['tissue_stage']) # rows: the cell types that are used in this analysis (colnames of max_enrich_25_state_all_ct_df), columns : 112 | cell_group_df['tissue_stage'] = max_enrich_25_state_all_ct_df.columns[1:] # we skip the first column because that's 'state' 113 | cell_group_df = pd.merge(cell_group_df, CT_DF, how = 'left', left_on = 'tissue_stage', right_on = 'tissue_stage') # merge the cell types so that we can get the information that we want 114 | # Now let's get the count of the number of cell types that are of each cell groups, and then get the unique cell groups, ordered by descending counts 115 | unique_cell_groups = ['forebrain', 'midbrain', 'hindbrain', 'neural-tube', 'facial-prominence', 'limb', 'intestine', 'stomach', 'liver', 'kidney', 'lung', 'heart']#CUSTOM: change this line of code for better customization 116 | unique_stages = ['P0', 'e10.5', 'e11.5', 'e12.5', 'e13.5', 'e14.5', 'e15.5', 'e16.5']#CUSTOM: change this line of code for better customization 117 | cell_group_df['ct'] = pd.Categorical(cell_group_df['ct'], unique_cell_groups) # so that we can easily sort them later based on this custom order 118 | # list of cell types, arranged such that those of the same cell_groups are juxtaposed 119 | rearranged_tissue_stages = [] 120 | for stage in unique_stages: 121 | this_cg_df = cell_group_df[cell_group_df['stage'] == stage] # filter out rows with this cellgroup 122 | this_cg_df = this_cg_df.sort_values(['ct'], ignore_index = True) 123 | rearranged_tissue_stages += list(this_cg_df['tissue_stage']) 124 | # now we will rearrange the columns of max_enrich_25_state_all_ct_df based on the order that we just got from rearranged_tissue_stages 125 | max_enrich_25_state_all_ct_df['state'] = max_enrich_25_state_all_ct_df['state'].astype(int) 126 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state') 127 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df[['state'] + rearranged_tissue_stages + ['mneumonics', 'state_order_by_group']] 128 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.sort_values('state_order_by_group', ignore_index = True) 129 | last_row_index = max_enrich_25_state_all_ct_df.shape[0] 130 | max_enrich_25_state_all_ct_df.loc[last_row_index] = [''] + rearranged_tissue_stages + ['', ''] # add one more row that specific the cell_group of each of the cell types 131 | colored_25_state_df = max_enrich_25_state_all_ct_df.style.applymap(color_cell_group_names, subset = pd.IndexSlice[last_row_index, :]) # color the row that contain the cell_group of cell types 132 | colored_25_state_df = colored_25_state_df.applymap(get_CT_STATE_COLOR_DICT, subset = pd.IndexSlice[:(last_row_index - 1) , colored_25_state_df.columns[1:max_enrich_25_state_all_ct_df.shape[1]-2]]) # color the states of the 25-system that is most enriched in each of the full-stack states, exclude the first column and the last two columns 133 | colored_25_state_df = colored_25_state_df.applymap(color_full_state_names, subset = pd.IndexSlice[:,['mneumonics']]) # color full stack state names 134 | return colored_25_state_df 135 | 136 | 137 | def get_max_enrichment_25_state_df_ordered_cell_group(max_enrich_25_state_all_ct_df): 138 | """ 139 | max_enrich_25_state_all_ct_df: rows: full stack states, columns: state of the 25-state system that is most enriched with the full-stack states WITHIN a cell type. Cell type are the column names 140 | --> a plot where cell types are juxtaposed based on the cell groups they belong to. The output dataframe will be colored properly 141 | """ 142 | cell_group_df = pd.DataFrame(columns = ['tissue_stage']) # rows: the cell types that are used in this analysis (colnames of max_enrich_25_state_all_ct_df), columns : 143 | cell_group_df['tissue_stage'] = max_enrich_25_state_all_ct_df.columns[1:] # we skip the first column because that's 'state' 144 | cell_group_df = pd.merge(cell_group_df, CT_DF, how = 'left', left_on = 'tissue_stage', right_on = 'tissue_stage') # merge the cell types so that we can get the information that we want 145 | # Now let's get the count of the number of cell types that are of each cell groups, and then get the unique cell groups, ordered by descending counts 146 | unique_cell_groups = ['forebrain', 'midbrain', 'hindbrain', 'neural-tube', 'facial-prominence', 'limb', 'intestine', 'stomach', 'liver', 'kidney', 'lung', 'heart']#CUSTOM: change this line of code for better customization 147 | unique_stages = [ 'P0', 'e10.5', 'e11.5', 'e12.5', 'e13.5', 'e14.5', 'e15.5', 'e16.5']#CUSTOM: change this line of code for better customization 148 | cell_group_df['stage'] = pd.Categorical(cell_group_df['stage'], unique_stages) # so that we can easily sort them later based on this custom order 149 | # list of cell types, arranged such that those of the same cell_groups are juxtaposed 150 | rearranged_tissue_stages = [] 151 | for cg in unique_cell_groups: 152 | this_cg_df = cell_group_df[cell_group_df['ct'] == cg] # filter out rows with this cellgroup 153 | this_cg_df = this_cg_df.sort_values(['stage'], ignore_index = True) 154 | rearranged_tissue_stages += list(this_cg_df['tissue_stage']) 155 | max_enrich_25_state_all_ct_df['state'] = max_enrich_25_state_all_ct_df['state'].astype(int) # so that we can merge with the df of full stack state annotation 156 | # now we will rearrange the columns of max_enrich_25_state_all_ct_df based on the order that we just got from rearranged_tissue_stage 157 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.merge(FULL_STATE_ANNOT_DF, how = 'left', left_on = 'state', right_on = 'state') 158 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df[['state'] + rearranged_tissue_stages + ['mneumonics', 'state_order_by_group']] 159 | max_enrich_25_state_all_ct_df = max_enrich_25_state_all_ct_df.sort_values('state_order_by_group', ignore_index = True) 160 | last_row_index = max_enrich_25_state_all_ct_df.shape[0] 161 | max_enrich_25_state_all_ct_df.loc[last_row_index] = [''] + rearranged_tissue_stages + ['', ''] # add one more row that specific the cell_group of each of the cell types 162 | colored_25_state_df = max_enrich_25_state_all_ct_df.style.applymap(color_cell_group_names, subset = pd.IndexSlice[last_row_index, :]) # color the row that contain the cell_group of cell types 163 | colored_25_state_df = colored_25_state_df.applymap(get_CT_STATE_COLOR_DICT, subset = pd.IndexSlice[:(last_row_index - 1) , colored_25_state_df.columns[1:max_enrich_25_state_all_ct_df.shape[1]-2]]) # color the states of the 25-system that is most enriched in each of the full-stack states, exclude the first column and the last two columns 164 | colored_25_state_df = colored_25_state_df.applymap(color_full_state_names, subset = pd.IndexSlice[:,['mneumonics']]) # color full stack state names 165 | return colored_25_state_df 166 | 167 | 168 | def get_rank_25_state_df(max_enrich_25_state_all_ct_df, max_enrich_value_all_ct_df, output_fn): 169 | (num_full_state, num_ct) = max_enrich_25_state_all_ct_df.shape # nrow, ncol 170 | # color 25_state_cell_group_df: df where cell types of the same cell group are juxtaposed 171 | colored_25_state_cell_group_df = get_max_enrichment_25_state_df_ordered_cell_group(max_enrich_25_state_all_ct_df) 172 | # color colored_25_state_stage_df: df where the samples of the same time stamps are juxtaposed 173 | colored_25_state_stage_df = get_max_enrichment_25_state_df_ordered_time_stamp(max_enrich_25_state_all_ct_df) 174 | # save the excel 175 | writer = pd.ExcelWriter(output_fn, engine='xlsxwriter') 176 | colored_25_state_cell_group_df.to_excel(writer, sheet_name = 'cell_group_25_state') 177 | colored_25_state_stage_df.to_excel(writer, sheet_name = 'stage_25_state') 178 | writer.save() 179 | 180 | 181 | 182 | def get_ranked_max_enriched_25_state(input_folder, output_fn, num_state_ct_model): 183 | all_ct_fn_list = glob.glob(input_folder + '/*/overlap_enrichment.txt') 184 | all_ct_list = list(map(lambda x: x.split('/')[-2], all_ct_fn_list)) # from '/path/to/E129/overlap_enrichment.txt' to 'E129' 185 | all_ct_df_list = list(map(lambda x: get_one_enrichment_ct_model_df(x, num_state_ct_model), all_ct_fn_list)) 186 | 187 | max_enrich_value_all_ct_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']}) # create a data frame with only a column called state that are the states in the full-stack model. This data frame store the enrichment values 188 | max_enrich_25_state_all_ct_df = pd.DataFrame({'state' : (all_ct_df_list[0])['state']}) # this data frame stores the names of states that are most enriched with each of the full-stack state in each of the cell type that we look at 189 | for ct_index, ct in enumerate(all_ct_list): 190 | this_ct_fn = all_ct_fn_list[ct_index] # get the overlap_enrichment file associated with this cell type 191 | this_ct_enrichment_df = get_one_enrichment_ct_model_df(this_ct_fn, num_state_ct_model) # get the enrichment data frame associated with this cell type 192 | max_enrich_value_all_ct_df[ct] = this_ct_enrichment_df['max_enrichment'] # store the values of maximum enrichment in this cell type 193 | max_enrich_25_state_all_ct_df[ct] = this_ct_enrichment_df['max_enrichment_context'] # store the names of the state that is most enriched with each of the full-stack state for this one cell type 194 | print ("Done getting data from all cell types " ) 195 | get_rank_25_state_df(max_enrich_25_state_all_ct_df, max_enrich_value_all_ct_df, output_fn) 196 | print ("Done ranking enrichment data across cell types ") 197 | 198 | get_ranked_max_enriched_25_state(args.input_folder, args.output_fn, args.num_state_ct_model) 199 | 200 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/helper.py: -------------------------------------------------------------------------------- 1 | import string 2 | import os 3 | import sys 4 | import time 5 | def make_dir(directory): 6 | try: 7 | os.makedirs(directory) 8 | except: 9 | print ( 'Folder' + directory + ' is already created') 10 | 11 | 12 | 13 | def check_file_exist(fn): 14 | if not os.path.isfile(fn): 15 | print ( "File: " + fn + " DOES NOT EXISTS") 16 | exit(1) 17 | return 18 | 19 | def check_dir_exist(fn): 20 | if not os.path.isdir(fn): 21 | print ( "Directory: " + fn + " DOES NOT EXISTS") 22 | exit(1) 23 | return 24 | 25 | def create_folder_for_file(fn): 26 | last_slash_index = fn.find('/') 27 | if last_slash_index != -1: # path contains folder 28 | make_dir(fn[:last_slash_index]) 29 | return 30 | 31 | def get_command_line_integer(arg): 32 | try: 33 | arg = int(arg) 34 | return arg 35 | except: 36 | print ( "Integer: " + str(arg) + " IS NOT VALID") 37 | exit(1) 38 | -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/mouse_random_represent_full_stack.snakefile: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | full_stack_segment_fn = 'genome_100_segments.bed.gz' # should be replaced with where you store the 100-state chromatin state annotation 4 | ct_segment_folder = './example_perCT_segment/' # should be replaced with the folder path to where you store the per_ct annotations. NOTE: the code assume that your per-ct annotation (bed files) are sorted, if not you need to use the command line: zcat | sort -k1,1 k2,2n > 5 | ct_list = glob.glob(ct_segment_folder + '/*_15_segments.bed.gz') # we assume that within the folder ct_segment_folder, the files are named based on format _15_segments.bed.gz 6 | ct_list = list(map(lambda x: x.split('/')[-1].split('_15_segments.bed.gz')[0], ct_list)) 7 | NUM_SAMPLE_SEGMENT_PER_STATE = 100 8 | output_folder = './output//random_represent_with_15state' # output folder of analysis of the estimated overlap probability for full-stack states with per-ct states 9 | overlap_output_folder = './output//overlap_with_ct_segment' # output folder of the analysis of overlap enrichment of full-stack states with per-ct states 10 | ct_state_annot_fn = './example_perCT_segment/15_state_annotations.csv' # where we can get the metadata of the states within the per-ct chromatin state model 11 | full_state_annot_fn = '../state_annotation_processed.csv' # path to the file of full-stack states' metadata (state characterizations) 12 | ct_group_fn = './example_perCT_segment/tissue_annotation.csv' 13 | num_state_ct_model = 15 14 | 15 | seed_list = [800, 922, 23, 204, 132, 992, 60, 650, 761, 154, 432, 760, 999, 969, 955, 986, 981, 246, 35, 438, 116] 16 | rule all: 17 | input: 18 | os.path.join(output_folder, 'summary', 'avg_prop_stateCT_per_group.xlsx'), 19 | os.path.join(overlap_output_folder, 'summary', 'max_enriched_states.xlsx'), 20 | 21 | rule sample_segment_equal_per_state: 22 | # sample regions on the genome such that for each state, the number of sampled regions are similar 23 | input: 24 | full_stack_segment_fn, # from raw data 25 | output: 26 | (os.path.join(output_folder, 'seed_{seed}', 'temp_sample_region.bed.gz')) 27 | shell: 28 | """ 29 | python sample_region_for_state_representation.py {input} {NUM_SAMPLE_SEGMENT_PER_STATE} {output} {wildcards.seed} 30 | """ 31 | 32 | rule overlap_sample_region_with_ct_segment: 33 | input: 34 | (os.path.join(output_folder, 'seed_{seed}', 'temp_sample_region.bed.gz')), # from rule sample_segment_equal_per_state 35 | expand(os.path.join(ct_segment_folder, '{ct}_15_segments.bed.gz'), ct = ct_list) # Note we made the assumption that these files are all sorted 'sort -k1,1 -k2,2n' 36 | output: 37 | os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed.gz'), 38 | params: 39 | ct_list_string = " ".join(ct_list), 40 | output_no_gz = os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed') 41 | shell: 42 | """ 43 | command="zcat {input[0]} | sort -k1,1 -k2,2n " # first sort the file of sample regions, where each state appear NUM_SAMPLE_SEGMENT_PER_STATE times 44 | output_header="chrom\\tstart\\tend\\tfull_stack" 45 | rm -f {output} # so we can overwrite it 46 | rm -f {params.output_no_gz} # so we can overwrite it 47 | for ct in {params.ct_list_string} 48 | do 49 | ct_segment_fn={ct_segment_folder}/${{ct}}_15_segments.bed.gz 50 | command="$command | bedtools map -a stdin -b ${{ct_segment_fn}} -c 4 -o collapse" 51 | output_header="${{output_header}}\\t${{ct}}" 52 | done 53 | command="$command >> {params.output_no_gz} " 54 | # now onto the writing part 55 | echo -e $output_header > {params.output_no_gz} # write the header first # need the -e flag for tabs to be considered seriously 56 | eval $command # write the content 57 | gzip {params.output_no_gz} # gzip 58 | """ 59 | 60 | rule calculate_overlap_enrichment_with_ct_segment: 61 | input: 62 | full_stack_segment_fn, 63 | os.path.join(ct_segment_folder, '{ct}_15_segments.bed.gz'), 64 | output: 65 | os.path.join(overlap_output_folder, '{ct}', 'overlap_enrichment.txt'), 66 | params: 67 | code='/Users/vuthaiha/Desktop/window_hoff/source/25state_enrichments/overlap_enrichment_with_ct_spec_state.py', 68 | num_other_states = 15, 69 | shell: 70 | """ 71 | python {params.code} --reference_segment_fn {full_stack_segment_fn} --other_segment_fn {input[1]} --num_other_states {params.num_other_states} --output_fn {output} 72 | """ 73 | 74 | rule calculate_summary_overlap_with_ct_segment: 75 | input: 76 | expand(os.path.join(overlap_output_folder, '{ct}', 'overlap_enrichment.txt'), ct = ct_list), 77 | output: 78 | os.path.join(overlap_output_folder, 'summary', 'max_enriched_states.xlsx'), 79 | shell: 80 | """ 81 | python /Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/relationship_with_ct_spec_state/get_ranked_max_enriched_25_state.py --input_folder {overlap_output_folder} --output_fn {output} --num_state_ct_model {num_state_ct_model} --full_state_annot_fn {full_state_annot_fn} --ct_state_annot_fn {ct_state_annot_fn} --ct_group_fn {ct_group_fn} 82 | """ 83 | 84 | rule calculate_summary_prob_overlap_with_ct_segment: 85 | input: 86 | expand(os.path.join(output_folder, 'seed_{seed}', 'sample_segment_fullStack_ctState.bed.gz'), seed = seed_list) 87 | output: 88 | os.path.join(output_folder, 'summary', 'avg_prop_stateCT_per_group.xlsx') 89 | params: 90 | prob_summ_folder = os.path.join(output_folder, 'summary') 91 | shell: 92 | """ 93 | python /Users/vuthaiha/Desktop/window_hoff/source/mm10_annotations/relationship_with_ct_spec_state/calculate_summary_sample_regions.py --all_seed_folder {output_folder} --output_folder {params.prob_summ_folder} --num_state_ct_model {num_state_ct_model} --full_state_annot_fn {full_state_annot_fn} --ct_state_annot_fn {ct_state_annot_fn} --ct_group_fn {ct_group_fn} 94 | """ -------------------------------------------------------------------------------- /relationship_with_ct_spec_state/overlap_enrichment_with_ct_spec_state.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import pandas as pd 4 | import numpy as np 5 | import glob 6 | import pybedtools as bed 7 | import helper 8 | parser = argparse.ArgumentParser(description='Calculating the enrichment between the refernece chromatin state annotation and another segmentation states. This is useful when we want to calculate the overlap enrichment between full stack states and the states in other ct-spec models. Here, we will output the files in a way that replicates output of ChromHMM OverlapEnrichment') 9 | parser.add_argument('--reference_segment_fn', type=str, 10 | help='The segmentation fn that we will use to calculate the state overlap enrichment for. The rows of output correspond to states in this file') 11 | parser.add_argument('--other_segment_fn', type=str, 12 | help='other_segment_fn. The columns of output will be the states in this file') 13 | parser.add_argument('--num_other_states', default=100, type=int, 14 | help='number of chromHMM states in other_segment_fn') 15 | parser.add_argument('--output_fn', type=str, 16 | help = 'output_fn. Will look exactly like the output of ChromHMM OverlapEnrichment') 17 | args = parser.parse_args() 18 | print (args) 19 | helper.check_file_exist(args.reference_segment_fn) 20 | helper.check_file_exist(args.other_segment_fn) 21 | helper.create_folder_for_file(args.output_fn) 22 | 23 | def calculate_percentage_in_genome(reference_segment_fn): 24 | ref_segment_df = pd.read_csv(reference_segment_fn, header = None, index_col = None, sep = '\t') 25 | ref_segment_df.columns = ['chrom', 'start', 'end', 'state'] 26 | result_df = pd.DataFrame() 27 | all_ref_states = np.unique(ref_segment_df['state']) 28 | all_ref_states = list(map(lambda x: int(x[1:]), all_ref_states)) # a list of unique state numbers 29 | all_ref_states = np.sort(all_ref_states) # 1--> 100, for full-stack states 30 | result_df['state (Emission order)'] = all_ref_states 31 | result_df.index = list(map(lambda x: 'E{}'.format(x), result_df['state (Emission order)'])) 32 | ref_segment_df['length'] = ref_segment_df['end'] - ref_segment_df['start'] 33 | gw_length = np.sum(ref_segment_df['length']) 34 | state_cover = ref_segment_df.groupby('state')['length'].sum() 35 | state_perc = state_cover/ gw_length * 100.0 # percentage of the genome that is in each state 36 | result_df['Genome %'] = state_perc 37 | result_df['num_bp_in_state'] = state_cover 38 | return result_df, gw_length 39 | 40 | def calculate_overlap_enrichment_all_states(reference_segment_fn, other_segment_fn, num_other_states, output_fn): 41 | ref_segment_bed = bed.BedTool(reference_segment_fn) 42 | result_df, gw_length = calculate_percentage_in_genome(reference_segment_fn) # two columns so far: state (Emission order) and Genome % 43 | other_bed = bed.BedTool(other_segment_fn) 44 | inter_bed = ref_segment_bed.intersect(other_bed, wa = True, wb= True) 45 | inter_df = inter_bed.to_dataframe() 46 | inter_df.columns = ['ref_chrom', 'ref_start', 'ref_end', 'ref_state', 'other_chrom', 'other_start', 'other_end', 'other_state'] 47 | inter_df['inter_length'] = inter_df.apply(lambda x: min(x['ref_end'], x['other_end']) - max(x['ref_start'], x['other_start']),axis = 1) # length of intersection between the reference and the other segmentation 48 | inter_df = inter_df.groupby(['ref_state', 'other_state'])['inter_length'].sum() 49 | inter_df = inter_df.to_frame().reset_index() # ref_state, other_state, inter_length 50 | inter_df = inter_df.pivot_table(values = 'inter_length', index = 'ref_state', columns = 'other_state') # columns: states in the other_bed, rows: states in ref_bed 51 | inter_df = inter_df.fillna(0) 52 | fract_gene_in_context = (inter_df.sum(axis = 0))/ gw_length # fraction of the genome that is in each of the contexts of the columns 53 | inter_df = inter_df.divide(result_df['num_bp_in_state'], axis = 0).divide(fract_gene_in_context, axis = 1) # FE = (#MS/#S) / (#M/#G) 54 | result_df = result_df.merge(inter_df, left_index = True, right_index = True) 55 | result_df = result_df.drop('num_bp_in_state', axis = 1) # get rid of this column to make the output look like ChromHMM output 56 | result_df.loc[result_df.shape[0]] = ['Base', 100] + list(fract_gene_in_context*100.0) # the last row of result_df should show the percentage of the genome that is in each of the context 57 | result_df.to_csv(output_fn, header = True, index = False, sep = '\t') 58 | print('Done!') 59 | return 60 | 61 | calculate_overlap_enrichment_all_states(args.reference_segment_fn, args.other_segment_fn, args.num_other_states, args.output_fn) 62 | -------------------------------------------------------------------------------- /view_ucsc_genome_browser.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ernstlab/mouse_fullStack_annotations/9d022338af972a07636e8c7e5dcb2ada35295e64/view_ucsc_genome_browser.pptx --------------------------------------------------------------------------------