├── HiChIP_analysis.pdf
├── LICENSE
├── LICENSE.md
├── README.md
├── UCSC_browser
    └── makeUCSCtrackHub.py
├── _config.yml
├── images
    ├── CTCF.png
    ├── HTSeq-extend200.png
    ├── RNApol2.png
    ├── bam2bigwig1.png
    ├── bam2bigwig2.png
    ├── bam2bigwig3.png
    ├── choose_diff_tool.png
    ├── cross-correlation.png
    ├── extend_200.png
    ├── fastqc_1.png
    ├── fastqc_10.png
    ├── fastqc_11.png
    ├── fastqc_2.png
    ├── fastqc_3.png
    ├── fastqc_4.png
    ├── fastqc_5.png
    ├── fastqc_6.png
    ├── fastqc_7.png
    ├── fastqc_8.png
    ├── fastqc_9.png
    ├── input_control.png
    ├── meta-heatmap.png
    ├── phred_score.png
    ├── shift.png
    ├── snakemake_flow.png
    ├── super-enhancer-plot.png
    └── variablePeaks.png
├── part0.1_fastqc.md
├── part0.2_mapping_to_genome.md
├── part0.3_downsampling_bam.md
├── part0_quality_control.md
├── part1.1_MACS2_parallel_peak_calling.md
├── part1.2_convert_bam2_bigwig.md
├── part1.3_MACS2_peak_calling_details.md
├── part1_peak_calling.md
├── part2_Preparing-ChIP-seq-count-table.md
├── part3.1_Differential_binding_DiffBind_lib_size.md
├── part3_Differential_binding_by_DESeq2.md
└── snakemake_ChIPseq_pipeline
    ├── README.md
    ├── Snakefile
    ├── config.yaml
    ├── msub_cluster.py
    ├── rawfastqs
        ├── sampleA
        │   ├── sampleA_L001.fastq.gz
        │   ├── sampleA_L002.fastq.gz
        │   └── sampleA_L003.fastq.gz
        ├── sampleB
        │   ├── sampleB_L001.fastq.gz
        │   ├── sampleB_L002.fastq.gz
        │   └── sampleB_L003.fastq.gz
        ├── sampleG1
        │   ├── sampleG1_L001.fastq.gz
        │   ├── sampleG1_L002.fastq.gz
        │   └── sampleG1_L003.fastq.gz
        └── sampleG2
        │   ├── sampleG2_L001.fastq.gz
        │   ├── sampleG2_L002.fastq.gz
        │   └── sampleG2_L003.fastq.gz
    └── snakemake_notes.md


/HiChIP_analysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/HiChIP_analysis.pdf


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Ming Tang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Ming Tang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # ChIP-seq-analysis
  2 | 
  3 | ### Snakemake pipelines
  4 | 
  5 | I developed a Snakemake based ChIP-seq pipeline: [pyflow-ChIPseq](https://github.com/crazyhottommy/pyflow-ChIPseq).
  6 | and ATACseq pipeline: [pyflow-ATACseq](https://github.com/crazyhottommy/pyflow-ATACseq)
  7 | 
  8 | ### Resources for ChIP-seq 
  9 | 1. [ENCODE: Encyclopedia of DNA Elements](https://www.encodeproject.org/)  [ENCODExplorer](https://www.bioconductor.org/packages/release/bioc/html/ENCODExplorer.html): A compilation of metadata from ENCODE. A bioc package to access the meta data of ENCODE and download the raw files.
 10 | 2. [ENCODE Factorbook](https://www.encodeproject.org/)  
 11 | 3. [ChromNet ChIP-seq interactions](http://chromnet.cs.washington.edu/#/?search=&threshold=0.5)  
 12 |     paper: [Learning the human chromatin network using all ENCODE ChIP-seq datasets](http://biorxiv.org/content/early/2015/08/04/023911)  
 13 | 4. [The International Human Epigenome Consortium (IHEC) epigenome data portal](http://epigenomesportal.ca/ihec/index.html?as=1)
 14 | 5. [GEO](http://www.ncbi.nlm.nih.gov/gds/?term=). Sequences are in .sra format, need to use sratools to dump into fastq.
 15 | 6. [European Nucleotide Archive](http://www.ebi.ac.uk/ena). Sequences are available in fastq format.
 16 | 7. [Data bases and software from Sheirly Liu's lab at Harvard](http://liulab.dfci.harvard.edu/WEBSITE/software.htm)
 17 | 8. [Blueprint epigenome](http://dcc.blueprint-epigenome.eu/#/home)
 18 | 9. [A collection of tools and papers for nucelosome positioning and TF ChIP-seq](http://generegulation.info/)
 19 | 10. [review paper:Deciphering ENCODE](http://www.cell.com/trends/genetics/fulltext/S0168-9525(16)00017-2)
 20 | 11. [EpiFactors](http://epifactors.autosome.ru/) is a database for epigenetic factors, corresponding genes and products.
 21 | 12. [biostar handbook](https://read.biostarhandbook.com/). My [ChIP-seq chapter](https://read.biostarhandbook.com/chip-seq/chip-seq-analysis.html) is out April 2017!
 22 | 13. [ReMap 2018](http://tagc.univ-mrs.fr/remap/) An integrative ChIP-seq analysis of regulatory regions. The ReMap atlas consits of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs) from public data sets. The atlas is available to browse or download either for a given TF or cell line, or for the entire dataset.
 23 | 14. GTRDGene Transcription Regulation Database https://gtrd.biouml.org/#!
 24 | 
 25 | ### Papers on ChIP-seq
 26 | 1. [ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia](http://www.ncbi.nlm.nih.gov/pubmed/22955991) 
 27 | 2. [Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326)    
 28 | 3. [Systematic evaluation of factors influencing ChIP-seq fidelity](http://www.nature.com/nmeth/journal/v9/n6/full/nmeth.1985.html)
 29 | 4. [ChIP–seq: advantages and challenges of a maturing technology](http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html)
 30 | 5. [ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions](http://www.nature.com/nrg/journal/v13/n12/abs/nrg3306.html) 
 31 | 6. [Beyond library size: a field guide to NGS normalization](http://biorxiv.org/content/early/2014/06/19/006403)
 32 | 7. [ENCODE paper portol](http://www.nature.com/encode/threads)  
 33 | 8. [Enhancer discovery and characterization](http://www.nature.com/encode/threads/enhancer-discovery-and-characterization) 
 34 | 9. 2016 review [Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation](http://bib.oxfordjournals.org/content/early/2016/03/15/bib.bbw023.full)
 35 | 10. [bioinformatics paper:Features that define the best ChIP-seq peak calling algorithms](http://bib.oxfordjournals.org/content/early/2016/05/10/bib.bbw035.short?rss=1) compares different peak callers for TFs and histones.
 36 | 11. [Systematic comparison of monoclonal versus polyclonal antibodies for mapping histone modifications by ChIP-seq](http://biorxiv.org/content/early/2016/05/19/054387) The binding patterns for H3K27ac differed substantially between polyclonal and monoclonal antibodies. However, this was most likely due to the distinct immunogen used rather than the clonality of the antibody. Altogether, we found that monoclonal antibodies as a class perform as well as polyclonal antibodies. Accordingly, we recommend the use of monoclonal antibodies in ChIP-seq experiments.
 37 | 12. A nice small review: [Unraveling the 3D genome: genomics tools for multiscale exploration](http://www.cell.com/trends/genetics/pdf/S0168-9525(15)00063-3.pdf)
 38 | 13. Three very interesting papers, [Developmental biology: Panoramic views of the early epigenome](http://www.nature.com/nature/journal/v537/n7621/full/nature19468.html)
 39 | 14. [ChIP off the old block: Beyond chromatin immunoprecipitation](https://www.sciencemag.org/features/2018/12/chip-old-block-beyond-chromatin-immunoprecipitation). A nice review of the past and future of ChIPseq.
 40 | 15. [Histone Modifications: Insights into Their Influence on Gene Expression](https://www.sciencedirect.com/science/article/pii/S0092867418310481)
 41 |     **Protocols**  
 42 | 1. [A computational pipeline for comparative ChIP-seq analyses](http://www.ncbi.nlm.nih.gov/pubmed/22179591)    
 43 | 2. [Identifying ChIP-seq enrichment using MACS](http://www.nature.com/nprot/journal/v7/n9/full/nprot.2012.101.html)  
 44 | 3. [Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells](http://www.ncbi.nlm.nih.gov/pubmed/24743992)
 45 | 4. [ENCODE tutorials](http://www.genome.gov/27553900) 
 46 | 5. [A User's Guide to the Encyclopedia of DNA Elements (ENCODE)](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001046)  
 47 | 6. [A toolbox of immunoprecipitation-grade monoclonal antibodies to human transcription factors](https://www.nature.com/articles/nmeth.4632) The data portal https://proteincapture.org/
 48 | 
 49 | ### Quality Control
 50 | Data downloaded from GEO usually are raw fastq files. One needs to do quality control (QC) on them.
 51 | 
 52 | * [fastqc](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)  
 53 | * [multiqc](http://multiqc.info/) Aggregate results from bioinformatics analyses across many samples into a single report. Could be very useful to summarize the QC report. 
 54 | 
 55 | 
 56 | ### Peak calling  
 57 | 
 58 | Be careful with the peaks you get:  
 59 | [Active promoters give rise to false positive ‘Phantom Peaks’ in ChIP-seq experiments](http://nar.oxfordjournals.org/content/early/2015/06/27/nar.gkv637.long)    
 60 | 
 61 | It is good to have controls for your ChIP-seq experiments. A DNA input control (no antibody is applied) is prefered.
 62 | The IgG control is also fine, but because so little DNA is there, you might get many duplicated reads due to PCR artifact.
 63 | 
 64 | **For cancer cells, an input control can be used to correct for copy-number bias.**
 65 | 
 66 | * [tools used by IHEC consortium](http://ihec-epigenomes.org/research/tools/)
 67 | 
 68 | [A quote from Tao Liu:](https://groups.google.com/forum/#!searchin/macs-announcement/h3k27ac/macs-announcement/9_LB5EsjS_Y/nwgsPN8lR-kJ) who develped MACS1/2
 69 | 
 70 | >I remember in a PloS One paper last year by Elizabeth G. Wilbanks et al.,  authors pointed out the best way to sort results in MACS is by -10*log10(pvalue) then fold enrichment. I agree with them. You don't have to worry about FDR too much if your input data are far more than ChIP data. MACS1.4 calculates FDR by swapping samples, so if your input signal has some strong bias somewhere in the genome, your FDR result would be bad. Bad FDR may mean something but it's just secondary.
 71 | 
 72 | 1. The most popular peak caller by Tao Liu: [MACS2](https://github.com/taoliu/MACS/). Now `--broad` flag supports broad peaks calling as well.
 73 | 
 74 | 2. [TF ChIP-seq peak calling using the Irreproducibility Discovery Rate (IDR) framework](https://github.com/nboley/idr) and many [Software Tools Used to Create the ENCODE Resource](https://genome.ucsc.edu/ENCODE/encodeTools.html)    
 75 | 3. [SICER](http://home.gwu.edu/~wpeng/Software.htm) for broad histone modification ChIP-seq
 76 | 4. [HOMER](http://homer.salk.edu/homer/ngs/peaks.html) can also used to call Transcription factor ChIP-seq peaks and histone 
 77 |     modification ChIP-seq peaks.
 78 | 5. [MUSIC](https://github.com/gersteinlab/MUSIC)
 79 | 6. [permseq](https://github.com/keleslab/permseq)  R package for mapping protein-DNA interactions in highly repetitive regions of the genomes with prior-enhanced read mapping. [Paper](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4618727/pdf/pcbi.1004491.pdf) on PLos Comp.
 80 | 7. [Ritornello](http://www.biorxiv.org/content/early/2015/12/11/034090): High fidelity control-free chip-seq peak calling. No input is required!
 81 | 8. Tumor samples are heterogeneous containing different cell types. [MixChIP: a probabilistic method for cell type specific protein-DNA binding analysis](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0834-3) 
 82 | 9. [Detecting broad domains and narrow peaks in ChIP-seq data with hiddenDomains](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0991-z) [tool](http://hiddendomains.sourceforge.net/)
 83 | 10. [BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets](http://bioinformatics.oxfordjournals.org/content/29/4/492)
 84 | 11. [epic: diffuse domain ChIP-Seq caller based on SICER]( https://github.com/endrebak/epic). It is a re-writen of SICER for faster processing using more CPUs. (Will try it for broad peak for sure).
 85 |     [epic2](https://github.com/biocore-ntnu/epic2) paper is out https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz232/5421513?redirectedFrom=fulltext
 86 | 12. [Cistrome](http://cistrome.org/Cistrome/Cistrome_Project.html): The best place for wet lab scientist to check the binding sites. Developed by Shierly Liu lab in Harvard.
 87 | 13. [ChIP-Atlas](http://chip-atlas.org/) is an integrative and comprehensive database for visualizing and making use of public ChIP-seq data. ChIP-Atlas covers almost all public ChIP-seq data submitted to the SRA (Sequence Read Archives) in NCBI, DDBJ, or ENA, and is based on over 78,000 experiments.
 88 | 14. [A map of direct TF-DNA interactions in the human genome](https://unibind.uio.no/) UniBind is a comprehensive map of direct interactions between transcription factor (TFs) and DNA. High confidence TF binding site predictions were obtained from uniform processing of thousands of ChIP-seq data sets using the ChIP-eat software.
 89 | 15. [Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-Seq peak callers](http://biorxiv.org/content/early/2016/12/01/090704) tool in [github](https://github.com/tengmx/gcapc)
 90 | 16. [SUPERmerge](https://github.com/Bohdan-Khomtchouk/SUPERmerge):ChIP-seq coverage island analysis algorithm for broad histone marks
 91 | 17. [PeakRanger](http://ranger.sourceforge.net/manual1.18.html) heard that it is good for broad peaks of H3K9me3 and H3K27me3.
 92 | 
 93 | 
 94 | **Different parameters using the same program can produce drastic different sets of peaks especially for histone modifications with variable enrichment length and gaps between peaks. One needs to make a valid argument for parameters he uses**  
 95 | 
 96 | An example of different parameters for homer `findPeaks`:  
 97 | ![](./images/variablePeaks.png)
 98 | 
 99 | ### Tutorial
100 | 
101 | * [tutorial by Simon van Heeringen at bioinfosummer](https://github.com/simonvh/bioinfosummer)
102 | 
103 | ### Binding does not infer functionality  
104 | 
105 | * [A significant proportion of transcription-factor binding sites may be nonfunctional](http://judgestarling.tumblr.com/post/64874995999/hypotheses-about-the-functionality-of) A post from Judge Starling  
106 | 
107 | * Several papers have shown that changes of adjacent TF binding poorly correlates with gene expression change:
108 | [Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression](http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003748)  
109 | [Transcription Factors Bind Thousands of Active and Inactive Regions in theDrosophila Blastoderm](http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.0060027)    
110 | 
111 | [The Functional Consequences of Variation in Transcription Factor Binding](http://arxiv.org/abs/1310.5166)  
112 | >" On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. "
113 | 
114 | * paper [A large portion of the ChIP-seq signal does not correspond to true binding](http://www.ncbi.nlm.nih.gov/pubmed/26388941?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter)
115 | * [BIDCHIPS: Bias-Decomposition of ChIP-seq Signals](http://www.perkinslab.ca/Software.html)  
116 | 	**mappability, GC-content and chromatin accessibility** affect ChIP-seq read counts. 
117 | * [ChIP bias as a function of cross-linking time](http://link.springer.com/article/10.1007%2Fs10577-015-9509-1)   
118 | 
119 | >We analyzed the dependence of the ChIP signal on the duration of formaldehyde cross-linking time for two proteins: DNA topoisomerase 1 (Top1) that is functionally associated with the double helix in vivo, especially with active chromatin, and green fluorescent protein (GFP) that has no known bona fide interactions with DNA. With short time of formaldehyde fixation, only Top1 immunoprecipation efficiently recovered DNA from active promoters, whereas prolonged fixation augmented non-specific recovery of GFP dramatizing the need to optimize ChIP protocols to minimize the time of cross-linking, especially for abundant nuclear proteins. Thus, ChIP is a powerful approach to study the localization of protein on the genome when care is taken to manage potential artifacts.
120 | 
121 | 
122 | ### Gene set enrichment analysis for ChIP-seq peaks  
123 | 
124 | [The Gene Ontology Handbook](http://link.springer.com/protocol/10.1007%2F978-1-4939-3743-1_13) Read it for basics for GO.
125 | 
126 | 1. [Broad Enrich](http://broad-enrich.med.umich.edu/)  
127 | 2. [ChIP Enrich](http://chip-enrich.med.umich.edu/)  
128 | 3. [GREAT](http://bejerano.stanford.edu/great/public/html/) predicts functions of cis-regulatory regions.  
129 | 4. [ENCODE ChIP-seq significance tool](http://encodeqt.simple-encode.org/). Given a list of genes, co-regulating TFs will be identified.  
130 | 5. [cscan](http://159.149.160.51/cscan/) similar to the ENCODE significance tool.  
131 | 6. [CompGO: an R package for comparing and visualizing Gene Ontology enrichment differences between DNA binding experiments](http://www.biomedcentral.com/1471-2105/16/275)  
132 | 7. [interactive and collaborative HTML5 gene list enrichment analysis tool](http://amp.pharm.mssm.edu/Enrichr/)
133 | 8. [GeNets](http://www.broadinstitute.org/genets#computations) from Broad. Looks very promising.
134 | 9. [Bioconductor EnrichmentBrowser](https://www.bioconductor.org/packages/3.3/bioc/html/EnrichmentBrowser.html)
135 | 10. [clusterProfiler](http://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html) by Guangchuan Yu, the author of `ChIPseeker`.
136 | 11. [fgsea bioconductor package](http://bioconductor.org/packages/devel/bioc/html/fgsea.html) Fast Gene Set Entrichment Analysis.
137 | 12. [paper: A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0079217#pone-0079217-t001)
138 | 13. [UniBind Enrichment Analysis](https://unibind.uio.no/enrichment/) predicts which sets of TFBSs from the UniBind database are enriched in a set of given genomic regions. Enrichment computations are performed using the LOLA tool.
139 | 14. [BEHST](https://www.biorxiv.org/content/10.1101/168427v1) from Hoffman group: genomic set enrichment analysis enhanced through integration of chromatin long-range interactions
140 | 15. [ChEA3: transcription factor enrichment analysis by orthogonal omics integration](https://amp.pharm.mssm.edu/chea3)
141 | 
142 | 
143 | ### Chromatin state Segmentation  
144 | 1. [ChromHMM](http://compbio.mit.edu/ChromHMM/)  from Manolis Kellis in MIT.
145 |   >In ChromHMM the raw reads are assigned to non-overlapping bins of 200 bps and a sample-specific threshold is used to transform the count data to binary values
146 | 
147 | 2. [Segway](https://www.pmgenomics.ca/hoffmanlab/proj/segway/) from Hoffman lab. Base pair resolution. Takes longer time to run.  
148 | 3. [epicseg](https://github.com/lamortenera/epicseg) published 2015 in genome biology. Similiar speed with ChromHMM.   
149 | 4. [Spectacle: fast chromatin state annotation using spectral learning](https://github.com/jiminsong/Spectacle). Also published 2015 in genome biology.  
150 | 5. [chromstaR](http://biorxiv.org/content/early/2016/02/04/038612): Tracking combinatorial chromatin state dynamics in space and time
151 | 6. [epilogos](http://epilogos.broadinstitute.org/) visualization and analysis of chromatin state model data.
152 | 7. [Accurate promoter and enhancer identification in 127 ENCODE and Roadmap Epigenomics cell types and tissues by GenoSTAN](http://biorxiv.org/content/early/2016/02/24/041020)
153 | 8. [StatePaintR](https://github.com/Simon-Coetzee/StatePaintR) StateHub-StatePaintR: rules-based chromatin state annotations.
154 | 9. [IDEAS(https://github.com/yuzhang123/IDEAS/): an integrative and discriminative epigenome annotation system  http://sites.stat.psu.edu/~yzz2/IDEAS/
155 | 
156 | ### deep learning in ChIP-seq
157 | * [Coda](https://github.com/kundajelab/coda) uses convolutional neural networks to learn a mapping from noisy to high-quality ChIP-seq data. These trained networks can then be used to remove noise and improve the quality of new ChIP-seq data. From Ashul lab.
158 | * [DeepChrome](https://github.com/QData/DeepChrome) is a unified CNN framework that automatically learns combinatorial interactions among histone modification marks to predict the gene expression. (Is it really better than a simple linear model?)
159 | * [deep learning in biology](https://github.com/hussius/deeplearning-biology)
160 | * gReLU is a Python library to train, interpret, and apply deep learning models to DNA sequences. [Code documentation is available here.](https://github.com/Genentech/gReLU)
161 | * [tangerMEME](https://github.com/jmschrei/tangermeme) is an extension of the MEME suite concept to biological sequence analysis when you have a collection of sequences and a predictive model.
162 | * [Dissecting the cis-regulatory syntax of transcription initiation with deep learning](https://www.biorxiv.org/content/10.1101/2024.05.28.596138v1)
163 | * Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage https://www.biorxiv.org/content/10.1101/2024.05.28.596078v1
164 |   
165 | ### Peak annotation 
166 | 
167 | 1. Homer [`annotatePeak`](http://homer.salk.edu/homer/ngs/annotation.html) 
168 | 2. Bioconductor package [ChIPseeker](http://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) by [Guangchuan Yu](http://ygc.name/)   
169 |    See an important post by him on 0 or 1 based [coordinates](http://ygc.name/2015/08/07/parsing-bed-coordinates/).
170 |  
171 |  >Most of the software for ChIP annotation doesn't considered this issue when annotating peak (0-based) to transcript (1-based). To my knowledge, only HOMER consider this issue. After I figure this out, I have updated ChIPseeker (version >= 1.4.3) to fix the issue.
172 |  
173 | 3. Bioconductor package [ChIPpeakAnno](http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html). 
174 | 
175 | 4. [annotatr](https://github.com/rcavalcante/annotatr/) Annotation of Genomic Regions to Genomic Annotations.
176 | 
177 | 5. [geneXtendeR](https://bioconductor.org/packages/release/bioc/html/geneXtendeR.html) computes optimal gene extensions tailored to the broadness of the specific epigenetic mark (e.g., H3K9me1, H3K27me3), as determined by a user-supplied ChIP-seq peak input file. As such, geneXtender maximizes the signal-to-noise ratio of locating genes closest to and directly under peaks
178 | 
179 | * [DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from genomic sequencing data](http://tsupeichiu.github.io/DNAshapeR/)  
180 | 
181 | ### Differential peak detection  
182 | Look at a [post](http://andre-rendeiro.me/2015/04/03/chipseq_diffbind_analysis/) and [here](http://crazyhottommy.blogspot.com/2013/10/compare-chip-seq-data-for-different.html) describing different tools. 
183 | A review paper [A comprehensive comparison of tools for differential ChIP-seq analysis](http://bib.oxfordjournals.org/content/early/2016/01/12/bib.bbv110.short?rss=1)  
184 | 
185 | ![](./images/choose_diff_tool.png)
186 | 
187 | 
188 | * [ATAC-seq normalization method can significantly affect differential accessibility analysis and interpretation](https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-020-00342-y)
189 | * [Comparison of differential accessibility analysis strategies for ATAC-seq data](https://www.nature.com/articles/s41598-020-66998-4) https://github.com/Zhang-lab/BeCorrect to correct batch effect from the bedgraph files.
190 | 
191 | 1. [MultiGPS](http://mahonylab.org/software/multigps/)  
192 | 
193 | 2. [PePr](https://github.com/shawnzhangyx/PePr). It can also call peaks.  
194 | 
195 | 3. [histoneHMM](http://histonehmm.molgen.mpg.de/)  
196 | 
197 | 4. [diffreps](https://github.com/shenlab-sinai/diffreps) for histone.  developed by Shen Li's lab in Mount Sinai who also develped [ngs.plot](https://github.com/shenlab-sinai/ngsplot).  
198 | 
199 | 5. [diffbind bioconductor package](http://bioconductor.org/packages/release/bioc/html/DiffBind.html). Internally uses RNA-seq tools: EdgR or DESeq.  Most likely, I will use this tool.  
200 | 
201 | 6. [ChIPComp](http://web1.sph.emory.edu/users/hwu30/software/ChIPComp.html). Very little tutorial. Now it is on bioconductor.
202 | 
203 | 7. [csaw bioconductor package](http://bioconductor.org/packages/release/bioc/html/csaw.html). Tutorial [here](https://www.bioconductor.org/help/course-materials/2015/BioC2015/csaw_lab.html)  
204 | 
205 | 8. [chromDiff](http://compbio.mit.edu/ChromDiff/Download.html). Also from from Manolis Kellis in MIT. Similar with ChromHMM, documentation is not that detailed. Will have a try on this.  
206 | 9. [MACS2 can detect differential peaks as well](https://github.com/taoliu/MACS/wiki/Call-differential-binding-events)  
207 | 10. paper [Identifying differential transcription factor binding in ChIP-seq](http://journal.frontiersin.org/article/10.3389/fgene.2015.00169/full)  
208 | 
209 | 
210 | ### Motif enrichment
211 | 1. [HOMER](http://homer.salk.edu/homer/ngs/peakMotifs.html). It has really detailed documentation. It can also be used to call peaks. 
212 | 
213 | For TF ChIP-seq, one can usually find the summit of the peak (macs14 will report the summit), and extend the summit to both sides to 100bp-500bp. One can then use those 100bp-500 bp small regions to do motif analysis. Usually, oen should find the motif for the ChIPed TF in the ChIP-seq experiment if it is a DNA binding protein.
214 | 
215 | It is trickier to do motif analysis using histone modification ChIP-seq. For example, the average peak size of H3K27ac is 2~3 kb. If one wants to find TF binding motifs from H3K27ac ChIP-seq data, it is good to narrow down the region a bit. MEME and many other motif finding tools require that the DNA sequence length to be small (~500bp). One way is to use `findPeaks` in homer turning on `-nfr`(nucleosome free region) flag, and then do motif analysis in those regions.
216 | 
217 | suggestions for finding motifs from histone modification ChIP-seq data from HOMER page:
218 | >Since you are looking at a region, you do not necessarily want to center the peak on the specific position with the highest tag density, which may be at the edge of the region.  Besides, in the case of histone modifications at enhancers, the highest signal will usually be found on nucleosomes surrounding the center of the enhancer, which is where the functional sequences and transcription factor binding sites reside.  Consider H3K4me marks surrounding distal PU.1 transcription factor peaks.  Typically, adding the -center >option moves peaks further away from the functional sequence in these scenarios.
219 | 
220 | Other strategy similar to `-nfr` was developed in this paper: [Dissecting neural differentiation regulatory networks through epigenetic footprinting](http://www.ncbi.nlm.nih.gov/pubmed/25533951). In the method part of the paper, the authors computed a depletion score within the peaks, and use the footprinted regions to do motif analysis. (Thanks [kadir](https://twitter.com/canerakdemir) for pointing out the paper)
221 | 
222 | http://homer.ucsd.edu/homer/ngs/peakMotifs.html  
223 | 
224 | >Region Size ("-size <#>", "-size <#>,<#>", "-size given", default: 200)
225 | The size of the region used for motif finding is important.  If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor.  When looking at histone marked regions, **500-1000 bp is probably a good idea (i.e. H3K4me or H3/H4 acetylated regions)**.  In theory, HOMER can work with very large regions (i.e. 10kb), but with the larger the regions comes more sequence and longer execution time.  These regions will be based off the center of the peaks.  If you prefer an offset, you can specify "-size -300,100" to search a region of size 400 that is centered 100 bp upstream of the peak center (useful if doing motif finding on putative TSS regions).  If you have variable length regions, use the option "-size given" and HOMER will use the exact regions that were used as input.
226 | 
227 | I just found [PARE](http://spundhir.github.io/PARE/). PARE is a computational method to Predict Active Regulatory Elements, specifically enhancers and promoters. H3K27ac and H3K4me can be used to define active enhancers.
228 | 
229 | 2. [MEME suite](http://meme.ebi.edu.au/meme/index.html). It is probably the most popular motif finding tool in the papers.  [protocol:Motif-based analysis of large nucleotide data sets using MEME-ChIP](http://www.nature.com/nprot/journal/v9/n6/full/nprot.2014.083.html)  
230 | 3. [MEME R package](https://snystrom.github.io/memes-manual/)
231 | 4. [JASPAR database](http://jaspar.binf.ku.dk/  )
232 | 5. [pScan-ChIP](http://159.149.160.51/pscan_chip_dev/)  
233 | 6. [MotifMap](http://motifmap.ics.uci.edu/#MotifSearch)  
234 | 7. [RAST](http://rsat01.biologie.ens.fr/rsa-tools/index.html) Regulatory Sequence Analysis Tools.  
235 | 8. [ENCODE TF motif database](http://compbio.mit.edu/encode-motifs/)  
236 | 9. [oPOSSUM](http://opossum.cisreg.ca/oPOSSUM3/) is a web-based system for the detection of over-represented conserved transcription factor binding sites and binding site combinations in sets of genes or sequences.  
237 | 10.  my post [how to get a genome-wide motif bed file](http://crazyhottommy.blogspot.com/2014/02/how-to-get-genome-wide-motif-bed-file.html) 
238 | 11.  Many other tools [here](http://omictools.com/motif-discovery-c84-p1.html)
239 | 12. [A review of ensemble methods for de novo motif discovery in ChIP-Seq data](http://bib.oxfordjournals.org/content/early/2015/04/17/bib.bbv022.abstract)  
240 | 13. [melina2](http://melina2.hgc.jp/public/index.html). If you only have one sequence and want to know what TFs might bind
241 |     there, this is a very useful tool.
242 | 12. [STEME](https://pypi.python.org/pypi/STEME/). A python library for motif analysis. STEME started life as an approximation to the Expectation-Maximisation algorithm for the type of model used in motif finders such as MEME. **STEME’s EM approximation runs an order of magnitude more quickly than the MEME implementation for typical parameter settings**. STEME has now developed into a fully-fledged motif finder in its own right.  
243 | 13. [CENTIPEDE: Transcription factor footprinting and binding site prediction](http://centipede.uchicago.edu/). [Tutorial](https://github.com/slowkow/CENTIPEDE.tutorial)  
244 | 14. [msCentipede: Modeling Heterogeneity across Genomic Sites and Replicates Improves Accuracy in the Inference of Transcription Factor Binding](http://rajanil.github.io/msCentipede/)  
245 | 15. [DiffLogo: A comparative visualisation of sequence motifs](http://bioconductor.org/packages/release/bioc/html/DiffLogo.html) 
246 | 16. [Weeder (version: 2.0)](http://159.149.160.51/modtools/)
247 | 17. [MCAST: scanning for cis-regulatory motif clusters](http://bioinformatics.oxfordjournals.org/content/early/2016/01/14/bioinformatics.btv750.short?rss=1) Part of MEME suite.
248 | 18. [Sequence-based Discovery of Regulons](http://iregulon.aertslab.org/) iRegulon detects the TF, the targets and the motifs/tracks from a set of genes.
249 | 19. [Regulatory genomic toolbox](http://www.regulatory-genomics.org/motif-analysis/introduction/)
250 | 20. [Parse TF motifs from public databases, read into R, and scan using 'rtfbs'](https://github.com/Danko-Lab/rtfbs_db)
251 | 21. [Romulus: Robust multi-state identification of transcription factor binding sites from DNase-seq data](https://github.com/ajank/Romulus): Romulus is a computational method to accurately identify individual transcription factor binding sites from genome sequence information and cell-type--specific experimental data, such as DNase-seq. It combines the strengths of its predecessors, CENTIPEDE and Wellington, while keeping the number of free parameters in the model robustly low. The method is unique in allowing for multiple binding states for a single transcription factor, differing in their cut profile and overall number of DNase I cuts.
252 | 22. [moca](https://github.com/saketkc/moca): Tool for motif conservation analysis.
253 | 23. [gimmemotifs](https://github.com/simonvh/gimmemotifs) Suite of motif tools, including a motif prediction pipeline for ChIP-seq experiments. looks very useful, will take a look!
254 | 24. [YAMDA](https://github.com/daquang/YAMDA): thousandfold speedup of EM-based motif discovery using deep learning libraries and GPU
255 | 25. [motif clustering](https://www.vierstra.org/resources/motif_clustering)
256 | 26. [RSAT matrix-clustering: dynamic exploration and redundancy reduction of transcription factor binding motif collections](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737723/)
257 | 
258 | 
259 | ### Super-enhancer identification   
260 | 
261 | The fancy "supper-enhancer" term was first introduced by [Richard Young](http://younglab.wi.mit.edu/publications.htm) in Whitehead Institute. Basically, super-enhancers are enhancers that span large genomic regions(~12.5kb). The concept of super-enhancer is not new. One of the most famous example is the Locus Control Region (LCR) that controls the globin gene expression, and this has been known for decades.  
262 | 
263 | A review in Nature Genetics [What are super-enhancers?](http://www.nature.com/ng/journal/v47/n1/full/ng.3167.html)  
264 | 
265 | [paper: Genetic dissection of the α-globin super-enhancer in vivo](http://www.nature.com/ng/journal/v48/n8/full/ng.3605.html) 
266 | 
267 | > By generating a series of mouse models, deleting each of the five regulatory elements of the α-globin super-enhancer individually and in informative combinations, we demonstrate that each constituent enhancer seems to act independently and in an additive fashion with respect to hematological phenotype, gene expression, chromatin structure and chromosome conformation, without clear evidence of synergistic or higher-order effects.
268 | 
269 | [paper: Hierarchy within the mammary STAT5-driven Wap super-enhancer](http://www.nature.com/ng/journal/v48/n8/full/ng.3606.html)  
270 | [paper: Enhancers and super-enhancers have an equivalent regulatory role in embryonic stem cells through regulation of single or multiple genes](http://m.genome.cshlp.org/content/early/2016/11/28/gr.210930.116.abstract)
271 | 
272 | From the [HOMER page](http://homer.salk.edu/homer/ngs/peaks.html)
273 | **How finding super enhancers works:**
274 | 
275 | >Super enhancer discovery in HOMER emulates the original strategy used by the Young lab.  First, peaks are found just like any other ChIP-Seq data set.  Then, peaks found within a given distance are 'stitched' together into larger regions (by default this is set at 12.5 kb).  The super enhancer signal of each of these regions is then determined by the total normalized number reads minus the number of normalized reads in the input.  These regions are then sorted by their score, normalized to the highest score and the number of putative enhancer regions, and then super enhancers are identified as regions past the point where the slope is greater than 1.  
276 | 
277 | Example of a super enhancer plot:
278 | ![](./images/super-enhancer-plot.png)
279 | 
280 | >In the plot above, all of the peaks past 0.95 or so would be considered "super enhancers", while the one's below would be "typical" enhancers.  If the slope threshold of 1 seems arbitrary to you, well... it is!  This part is probably the 'weakest link' in the super enhancer definition.  However, the concept is still very useful.  Please keep in mind that most enhancers probably fall on a continuum between typical and super enhancer status, so don't bother fighting over the precise number of super enhancers in a given sample and instead look for useful trends in the data.
281 | 
282 | **Using ROSE from Young lab**   
283 | [ROSE: RANK ORDERING OF SUPER-ENHANCERS](http://younglab.wi.mit.edu/super_enhancer_code.html)  
284 | 
285 | **[imPROSE](https://github.com/asntech/improse) - Integrated Methods for Prediction of Super-Enhancers
286 | 
287 | **[CREAM](https://github.com/bhklab/CREAM) (Clustering of Functional Regions Analysis Method) is a new method for identification of clusters of functional regions (COREs) within chromosomes.** published in Genome Research by Mathieu Lupien group. paper: Identifying clusters of cis-regulatory elements underpinning TAD structures and lineage-specific regulatory networks.
288 | 
289 | 
290 | ### Bedgraph, bigwig manipulation tools
291 | [WiggleTools](https://github.com/Ensembl/WiggleTools)  
292 | [bigwig tool](https://github.com/CRG-Barcelona/bwtool/wiki)  
293 | [bigwig-python](https://github.com/brentp/bw-python)  
294 | [samtools](http://www.htslib.org/)    
295 | [bedtools](http://bedtools.readthedocs.org/en/latest/) my all-time favorite tool from Araon Quinlan' lab. Great documentation! 
296 | [pyBedGraph](https://www.biorxiv.org/content/10.1101/709683v1): a Python package for fast operations on 1-dimensional genomic signal tracks.
297 | [pyBigwig](https://github.com/deeptools/pyBigWig)
298 | [Hosting bigWig for UCSC visualization](http://crazyhottommy.blogspot.com/2014/02/hosting-bigwig-by-dropbox-for-ucsc.html)  
299 | [My first play with GRO-seq data, from sam to bedgraph for visualization](http://crazyhottommy.blogspot.com/2013/10/my-first-play-with-gro-seq-data-from.html)  
300 | [convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)](http://crazyhottommy.blogspot.com/2014/10/convert-bam-file-to-bigwig-file-and.html). 
301 | [megadept](https://github.com/ChristopherWilks/megadepth) is pretty fast, can access bigWig files from the web, works on macOS, Linux & Windows, plus is also available via 
302 | @Bioconductor http://www.bioconductor.org/packages/release/bioc/html/megadepth.html which makes easy to use it in #rstats. For example, for quantifying expression of custom regions from recount3 data
303 | 
304 | [Bigtools: a high-performance BigWig and BigBed library in rust](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae350/7688332?login=false)
305 | 
306 | 
307 | 
308 | ### Peaks overlapping significance test
309 | [The genomic association tester (GAT)](https://github.com/AndreasHeger/gat)  
310 | [poverlap](https://github.com/brentp/poverlap) from Brent Pedersen. Now he is working with Aaron Quinlan at university of Utah.  
311 | [Genometric Correlation (GenometriCorr): an R package for spatial correlation of genome-wide interval datasets](http://genometricorr.sourceforge.net/)  
312 | [Location overlap analysis for enrichment of genomic ranges](http://bioconductor.org/packages/release/bioc/html/LOLA.html) bioconductor package.   
313 | [regioneR](http://bioconductor.org/packages/release/bioc/html/regioneR.html) Association analysis of genomic regions based on permutation tests
314 | [similaRpeak](http://bioconductor.org/packages/devel/bioc/html/similaRpeak.html): Metrics to estimate a level of similarity between two ChIP-Seq profiles
315 | 
316 | ### RNA-seq data integration
317 | [Beta](http://cistrome.org/BETA/) from Shirley Liu's lab in Harvard.  Tao Liu's previous lab.  
318 | 
319 | 
320 | ### Heatmap, mata-plot 
321 | 
322 | Many papers draw meta-plot and heatmap on certain genomic regions (2kb around TSS, genebody etc) using ChIP-seq data. 
323 | 
324 | See an example from the ngs.plot:  
325 | ![](./images/meta-heatmap.png)
326 | 
327 | 
328 | **Tools**  
329 | 
330 | 1. [deeptools](https://github.com/fidelram/deepTools).It can do many others and have good documentation.
331 | It can also generate the heatmaps, but I personally use [ngs.plot](https://github.com/shenlab-sinai/ngsplot) which is esy to use. (developed in Mount Sinai).  
332 | 
333 | 2. you can also draw heatmaps using R. just count (using either Homer or bedtools) the ChIP-seq reads in each bin and draw with heatmap.2 function. 
334 | [here](http://crazyhottommy.blogspot.com/2013/08/how-to-make-heatmap-based-on-chip-seq.html) and [here](http://crazyhottommy.blogspot.com/2013/04/how-to-make-tss-plot-using-rna-seq-and.html). Those are my pretty old blog posts, I now have a much better idea on how to make those graphs from scratch.
335 | 
336 | 3. You can also use bioconductor [Genomation](http://www.bioconductor.org/packages/release/bioc/vignettes/genomation/inst/doc/GenomationManual-knitr.html). It is very versatile.
337 | 4. [ChAsE](http://www.epigenomes.ca/tools.html)
338 | 5. [Metaseq](https://pythonhosted.org/metaseq/example_session.html)
339 | 6. [EnrichedHeatmaps](https://github.com/jokergoo/EnrichedHeatmap) from Zuguang Gu based on his own package `ComplexHeatmaps`. This is now my default go-to because of the flexiability of the package and the great user support. Thx!
340 | 7. [A biostar post discussing the tools: Visualizations of ChIP-Seq data using Heatmaps](https://www.biostars.org/p/180314/)
341 | 8. [A bioconductor package to produce metagene plots](http://bioconductor.org/packages/release/bioc/html/metagene.html)
342 | 9. [Fluff is a Python package that contains several scripts to produce pretty, publication-quality figures for next-generation sequencing experiments](https://github.com/simonvh/fluff) I just found it 09/01/2016. looks promising especially for identifying the dynamic change.
343 | 
344 | **One cavet is that the meta-plot (on the left) is an average view of ChIP-seq
345 | tag enrichment and may not reflect the real biological meaning for individual cases.**  
346 | 
347 | See a post from Lior Patcher [How to average genome-wide data](https://liorpachter.wordpress.com/2015/07/13/how-to-average-genome-wide-data/)  
348 | 
349 | I replied the post:
350 | >for ChIP-seq, in addition to the average plot, a heatmap that with each region in each row should make it more clear to compare (although not quantitatively). a box-plot (or a histogram) is better in this case . I am really uncomfortable averaging the signal, as a single value (mean) is not a good description of the distribution.
351 | 
352 | By Meromit Singer:  
353 | >thanks for the paper ref! Indeed, an additional important issue with averaging is that one could be looking at the aggregation of several (possibly very distinct) clusters. Another thing we should all keep in mind if we choose to make such plots..
354 | 
355 | A paper from Genome Research [Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements](http://m.genome.cshlp.org/content/22/9/1735.full)
356 | 
357 | ### Enhancer databases
358 | * [FANTOM project](http://fantom.gsc.riken.jp/5/)  CAGE for promoters and enhancers.
359 | * [DENdb: database of integrated human enhancers](http://www.cbrc.kaust.edu.sa/dendb/)  
360 | * [VISTA enhancer browser](http://enhancer.lbl.gov/)  
361 | * [Super-enhancer database](http://www.bio-bigdata.com/SEA/)
362 | * [Genome-wide identification and characterization of HOT regions in the human genome](http://biorxiv.org/content/early/2016/01/07/036152.abstract)  
363 | * [EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types](http://www.enhanceratlas.org/)
364 | * [review: Computational Tools for Stem Cell Biology](http://www.sciencedirect.com/science/article/pii/S0167779916300567)
365 | * [Integrative analysis of 10,000 epigenomic maps across 800 samples for regulatory genomics and disease dissection](https://www.biorxiv.org/content/10.1101/810291v2) from Manolis Kellis group.
366 | * [Index and biological spectrum of accessible DNA elements in the human genome](https://www.biorxiv.org/content/10.1101/822510v1) DHS sites from John A Stamatoyannopoulos group.
367 | 
368 | ### Interesting Enhancer papers
369 | * [Multiplex enhancer-reporter assays uncover unsophisticated TP53 enhancer logic](http://genome.cshlp.org/content/26/7/882)
370 | * 
371 | 
372 | ### Enhancer target prediction 
373 | 
374 | * [DoRothEA: collection of human and mouse regulons](https://github.com/saezlab/dorothea) DoRothEA is a gene regulatory network containing signed transcription factor (TF) - target gene interactions. DoRothEA regulons, the collection of a TF and its transcriptional targets, were curated and collected from different types of evidence for both human and mouse. A confidence level was assigned to each TF-target interaction based on the number of supporting evidence.
375 | * [Assessing Computational Methods for Transcription Factor Target Gene Identification Based on ChIP-seq Data](http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003342#pcbi.1003342.s019) 
376 | * [Protein binding and methylation on looping chromatin accurately predict distal regulatory interactions](http://biorxiv.org/content/early/2015/07/09/022293)
377 | * [i-cisTarget](http://gbiomed.kuleuven.be/apps/lcb/i-cisTarget/)
378 | * [protocol iRegulon and i-cisTarget: Reconstructing Regulatory Networks Using Motif and Track Enrichment](http://www.ncbi.nlm.nih.gov/pubmed/26678384?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter)  
379 | * [Model-based Analysis of Regulation of Gene Expression: MARGE](http://cistrome.org/MARGE/) from Shirley Liu's lab. MARGE is a robust methodology that leverages a comprehensive library of genome-wide H3K27ac ChIP-seq profiles to predict key regulated genes and cis-regulatory regions in human or mouse. 
380 | * [PrESSto: Promoter Enhancer Slider Selector Tool](http://pressto.binf.ku.dk/)
381 | * [TargetFinder](https://github.com/shwhalen/targetfinder). paper: [Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin](http://www.nature.com/ng/journal/v48/n5/full/ng.3539.html)
382 | * [C3D](https://github.com/mlupien/C3D) Cross Cell-type Correlation in DNaseI hypersensitivity. calculates correlations between open regions of chromatin based on DNase I hypersensitivity signals. Regions with high correlations are candidates for 3D interactions. It also performs association tests on each candidate and adjusts p-values. 
383 | * [ABC](https://www.biorxiv.org/content/10.1101/529990v1) Activity-by-Contact model of enhancer specificity from thousands of CRISPR perturbations. Blog post https://jesseengreitz.wordpress.com/2019/02/10/preprint-activity-by-contact-model/
384 | * [A curated benchmark of enhancer-gene interactions for evaluating enhancer-target gene prediction methods](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1924-8) "We use BENGI to test several published computational methods for linking enhancers with genes, including signal correlation and the TargetFinder and PEP supervised learning methods. We find that while TargetFinder is the best-performing method, it is only modestly better than a baseline distance method for most benchmark datasets when trained and tested with the same cell type and that TargetFinder often does not outperform the distance method when applied across cell types."
385 | 
386 | 
387 | ### Allele-specific analysis  
388 | * [WASP: allele-specific software for robust molecular quantitative trait locus discovery](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3582.html)
389 | * [ABC -- (Allele-specific Binding from ChIP-Seq)](https://github.com/mlupien/ABC/)
390 | * [SNPsplit: Allele-specific splitting of alignments between genomes with known SNP genotypes](http://f1000research.com/articles/5-1479/v2)
391 | * [BaalChIP](http://bioconductor.org/packages/devel/bioc/html/BaalChIP.html): Bayesian analysis of allele-specific transcription factor binding in cancer genomes. bioconductor pacckage, seems to be very useful.
392 | 
393 | ### SNPs affect on TF binding
394 | * [RegulomeDB](http://www.regulomedb.org/)  Use RegulomeDB to identify DNA features and regulatory elements in non-coding regions of the human genome by entering dbSNP id, chromosome regions or single Nucleotides.  
395 | * [motifbreakR](http://bioconductor.org/packages/devel/bioc/html/motifbreakR.html) A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
396 | * [GERV: A Statistical Method for Generative Evaluation of Regulatory Variants for Transcription Factor Binding](http://www.ncbi.nlm.nih.gov/pubmed/26476779?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter) From the same group as above.
397 | * [PRIME: Predicted Regulatory Impact of a Mutation in an Enhancer](https://github.com/aertslab/primescore) 
398 | * [paper: Which Genetics Variants in DNase-Seq Footprints Are More Likely to Alter Binding?](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005875) [website](http://genome.grid.wayne.edu/centisnps/)
399 | * [paper: Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo](http://www.nature.com/ng/journal/v47/n12/abs/ng.3432.html) 
400 | * [A Sicence paper:Survey of variation in human transcription factors reveals prevalent DNA binding changes](http://science.sciencemag.org/content/351/6280/1450)
401 | * paper [Estimating the functional impact of INDELs in transcription factor binding sites: a genome-wide landscape](http://biorxiv.org/content/early/2016/06/07/057604)
402 | * [paper: Mutational Biases Drive Elevated Rates of Substitution at Regulatory Sites across Cancer Types](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006207)
403 | * [paper: Recurrent promoter mutations in melanoma are defined by an extended context-specific mutational signature](http://biorxiv.org/content/early/2016/08/12/069351)
404 | * [sasquatch](https://github.com/rschwess/sasquatch) Predicting the impact of regulatory SNPs from cell and tissue specific DNase-footprints
405 | 
406 | ### co-occurring TFs
407 | 
408 | * In-silico Search for co-occuring transcription factors: [INSECT](http://bioinformatics.ibioba-mpsp-conicet.gov.ar:84/INSECT/index.html) 
409 | * [INSECT 2](http://bioinformatics.ibioba-mpsp-conicet.gov.ar/INSECT2/)
410 | * CO-factors associated with Uniquely-bound GEnomic Regions:[COUGER](http://couger.oit.duke.edu/)  
411 | * 
412 | 
413 | ### Conservation of the peak underlying DNA sequences
414 | * [bioconductor annotation package phastCons100way.UCSC.hg19](https://bioconductor.org/packages/release/data/annotation/html/phastCons100way.UCSC.hg19.html) see this [post](http://bioinfoblog.it/2015/12/are-fitness-genes-more-conserved-my-first-30-minutes-attempt/) how to use it.
415 | * 
416 | 
417 | 
418 | ### Integration of different data sets
419 | 
420 | [methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data](http://www.ncbi.nlm.nih.gov/pubmed/26415965?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter)  
421 | 
422 | [Copy number information from targeted sequencing using off-target reads](https://bioconductor.org/packages/release/bioc/html/CopywriteR.html) bioconductor CopywriteR package.   
423 | 
424 | [3CPET](http://www.bioconductor.org/packages/release/bioc/html/R3CPET.html): Finding Co-factor Complexes in Chia-PET experiment using a Hierarchical Dirichlet Process
425 | 
426 | ## New single/few cell epigenomics
427 | 
428 | * [GeF-seq: A Simple Procedure for Base Pair Resolution ChIP-seq](https://www.ncbi.nlm.nih.gov/m/pubmed/30109604/)
429 | 
430 | * [Ultra-low input CUT&RUN (uliCUT&RUN) enables interrogation of TF binding from low cell numbers](https://www.biorxiv.org/content/early/2018/03/21/286351)
431 | 
432 | * [We describe Cleavage Under Targets and Release Using Nuclease (CUT&RUN), a chromatin profiling strategy in which antibody-targeted controlled cleavage by micrococcal nuclease releases specific protein-DNA complexes into the supernatant for paired-end DNA sequencing](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5310842/) another cut&run method. maybe useful for scChIP-seq?
433 | 
434 | * [Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state](https://www.nature.com/articles/nbt.3383)
435 | 
436 | * [Calling Cards enable multiplexed identification of the genomic targets of DNA-binding proteins](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083092/) this is potentially can work with single cells.
437 | 
438 | * [Ultra-parallel ChIP-seq by barcoding of intact nuclei](https://www.biorxiv.org/content/early/2018/03/05/276469) as low as 1000 cells.
439 | 
440 | * [single-cell chromatin overall omic-scale landscape sequencing (scCOOL-seq) to generate a genome-wide map of DNA methylation and chromatin accessibility at single-cell resolution](https://www.nature.com/articles/s41556-018-0123-2)
441 | 
442 | * [High-Throughput ChIPmentation: freely scalable, single day ChIPseq data generation from very low cell-numbers](https://www.biorxiv.org/content/early/2018/09/27/426957)
443 | 
444 | * [CUT&Tag for efficient epigenomic profiling of small samples and single cells](https://www.biorxiv.org/content/10.1101/568915v1)
445 | 
446 | * [CUT&Tag Data Processing and Analysis Tutorial](https://yezhengstat.github.io/CUTTag_tutorial/) protocols.io link https://www.protocols.io/view/cut-amp-tag-data-processing-and-analysis-tutorial-bjk2kkye
447 | 
448 | * [Simultaneous quantification of protein-DNA contacts and transcriptomes in single cells](https://www.biorxiv.org/content/10.1101/529388v1) scDamID&T.
449 | 
450 | * [Self-reporting transposons enable simultaneous readout of gene expression and transcription factor binding in single cells](https://www.biorxiv.org/content/10.1101/538553v2) piggyBac transposase.
451 | 
452 | * [Mapping Histone Modifications in Low Cell Number and Single Cells Using Antibody-guided Chromatin Tagmentation (ACT-seq)](https://www.biorxiv.org/content/10.1101/571208v2) by Keji Zhao group.
453 | 
454 | * [Single-cell chromatin immunocleavage sequencing (scChIC-seq) to profile histone modification](https://www.nature.com/articles/s41592-019-0361-7) by Keji Zhao group.
455 | 
456 | * [CoBATCH for high-throughput single-cell epigenomic profiling](https://www.biorxiv.org/content/10.1101/590661v1) Protein A in fusion to Tn5 transposase is enriched through specific antibodies to genomic regions and Tn5 generates indexed chromatin fragments ready for the library preparation and sequencing.
457 | 
458 | * [High-throughput single-cell ChIP-seq identifies heterogeneity of chromatin states in breast cancer](https://www.nature.com/articles/s41588-019-0424-9)
459 | 
460 | ### ChIP-exo
461 | 
462 | * [Characterizing protein-DNA binding event subtypes in ChIP-exo data](https://www.biorxiv.org/content/early/2018/02/16/266536)
463 | * [paper: Simplified ChIP-exo assays](https://www.nature.com/articles/s41467-018-05265-7)
464 | 
465 | ### ATAC-seq
466 | 
467 | >Some may notice that the peaks produced look both like peaks produced from the TF ChIP-seq pipeline as well as the histone ChIP-seq pipeline. This is intentional, as ATAC-seq data looks both like TF data (narrow peaks of signal) as well as histone data (broader regions of openness).
468 | 
469 | * paper [From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3)
470 | * [ATACseqQC ](http://bioconductor.org/packages/release/bioc/html/ATACseqQC.html) a bioconductor package for quality control of ATAC-seq data.
471 | * [RASQUAL](https://github.com/dg13/rasqual) (Robust Allele Specific QUAntification and quality controL) maps QTLs for sequenced based cellular traits by combining population and allele-specific signals. [paper: Fine-mapping cellular QTLs with RASQUAL and ATAC-seq](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3467.html) 
472 | * [ATAC-seq Forum](https://sites.google.com/site/atacseqpublic/home?pli=1)  
473 | * [Single-cell ATAC-Seq](http://cole-trapnell-lab.github.io/projects/sc-atac/) 
474 | * [A rapid and robust method for single cell chromatin accessibility profiling](https://www.biorxiv.org/content/early/2018/04/27/309831)
475 | * [Global Prediction of Chromatin Accessibility Using RNA-seq from Small Number of Cells](http://www.biorxiv.org/content/early/2016/01/01/035816)  from RNA-seq to DNA accessibility. [tool on github](https://github.com/WeiqiangZhou/BIRD)  
476 | * [NucleoATAC](https://github.com/GreenleafLab/NucleoATAC):Python package for calling nucleosomes using ATAC-Seq data 
477 | * [chromVAR: Inferring transcription factor variation from single-cell epigenomic data](http://biorxiv.org/content/early/2017/02/21/110346) scATAC-seq
478 | * [ENCODE ATAC-seq guidelines](https://www.encodeproject.org/data-standards/atac-seq/)
479 | * [Brockman](https://carldeboer.github.io/brockman.html) is a suite of command line tools and R functions to convert genomics data into DNA k-mer words representing the regions associated with a chromatin mark, and then analyzing these k-mer sets to see how samples differ from each other. This approach is primarily intended for single cell genomics data, and was tested most extensively on single cell ATAC-seq data
480 | * [Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets via protocol-specific bias modeling](https://www.biorxiv.org/content/early/2018/03/19/284364)
481 | * [msCentipede](http://rajanil.github.io/msCentipede/) is an algorithm for accurately inferring transcription factor binding sites using chromatin accessibility data (Dnase-seq, ATAC-seq) and is written in Python2.x and Cython.
482 | * The Differential ATAC-seq Toolkit [(DAStk)](https://biof-git.colorado.edu/dowelllab/DAStk) is a set of scripts to aid analyzing differential ATAC-Seq data.
483 | * [Identification of Transcription Factor Binding Sites using ATAC-seq](https://www.biorxiv.org/content/early/2018/07/17/362863)  We propose HINT-ATAC, a footprinting method that addresses ATAC- seq specific protocol artifacts
484 | * [HMMRATAC](https://github.com/LiuLabUB/HMMRATAC)splits a single ATAC-seq dataset into nucleosome-free and nucleosome-enriched signals, learns the unique chromatin structure around accessible regions, and then predicts accessible regions across the entire genome. We show that HMMRATAC outperforms the popular peak-calling algorithms on published human and mouse ATAC-seq datasets.
485 | 
486 | ### DNase-seq
487 | * [pyDNase](https://github.com/jpiper/pyDNase) - a library for analyzing DNase-seq data. [paper: Wellington-bootstrap: differential DNase-seq footprinting identifies cell-type determining transcription factors](http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2081-4)  
488 | * [paper: Analysis of computational footprinting methods for DNase sequencing experiments](http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3772.html) [tool](http://www.regulatory-genomics.org/hint/introduction/) 
489 | * [paper: A practical guide for DNase-seq data analysis: from data management to common applications](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby057/5053117)
490 | * Two nature prime: [Genome-wide footprinting: ready for prime time?](http://www.nature.com/nmeth/journal/v13/n3/full/nmeth.3766.html) [Genomic footprinting](http://www.nature.com/nmeth/journal/v13/n3/full/nmeth.3768.html)
491 | * [PING](http://bioconductor.org/packages/release/bioc/html/PING.html) biocondcutor package: Probabilistic inference for Nucleosome Positioning with MNase-based or Sonicated Short-read Data
492 | * [Basset](https://github.com/davek44/Basset) Convolutional neural network analysis for predicting DNA sequence activity]
493 | * [Analysis of optimized DNase-seq reveals intrinsic bias in transcription factor footprint identification](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4018771/)
494 | 
495 | ### Chromatin Interaction data (ChIA-PET, Hi-C)
496 | * [ChIA-PET2](https://github.com/GuipengLi/ChIA-PET2) a versatile and flexible pipeline for analysing different variants of ChIA-PET data
497 | * [TopDom : An efficient and Deterministic Method for identifying Topological Domains in Genomes](http://zhoulab.usc.edu/TopDom/)
498 | * [DBPnet: Inferring cooperation of DNA binding proteins in 3D genome](http://wanglab.ucsd.edu/star/DBPnet/index.html) 
499 | * [Systematic identification of cooperation between DNA binding proteins in 3D space](http://www.nature.com/ncomms/2016/160727/ncomms12249/full/ncomms12249.html)
500 | * [DiffHiC](https://www.bioconductor.org/packages/release/bioc/html/diffHic.html) package maintained by Aaron Lun, who is the author of [csaw](https://bioconductor.org/packages/release/bioc/html/csaw.html) and [InteractionSet]( https://github.com/LTLA/InteractionSet) as well.
501 | * [protocol:Practical Analysis of Genome Contact Interaction Experiments](http://link.springer.com/protocol/10.1007/978-1-4939-3578-9_9)
502 | * [4D genome: a general repository for chromatin interaction data](http://4dgenome.research.chop.edu/)
503 | * [CCSI](http://songyanglab.sysu.edu.cn/ccsi/search.php): a database providing chromatin–chromatin spatial interaction information. only hg38 for human and mm10 for mouse.
504 | * [LOGIQA](http://www.ngs-qc.org/logiqa/) is a database hosting local and global quality scores assessed over long-range interaction assays (e.g. Hi-C).Based on the concept applied by the NGS-QC Generator over ChIP-seq and related datasets, LOGIQA infers quality indicators by the comparison of multiple sequence reads random sampling assays.
505 | * [Computational Identification of Genomic Features That Influence 3D Chromatin Domain Formation](http://journals.plos.org/ploscompbiol/article?id=info:doi/10.1371/journal.pcbi.1004908)
506 | * [Feng Yue's lab](http://promoter.bx.psu.edu/hi-c/view.php) in PSU developed tools for Hi-C, 4C 
507 | * [QuIN: A Web Server for Querying and Visualizing Chromatin Interaction Networks](https://quin.jax.org/)
508 | * [paper: Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations](http://genome.cshlp.org/content/26/6/719)
509 | * [Exploring long-range genome interactions using the WashU Epigenome Browse](http://www.nature.com/nmeth/journal/v10/n5/full/nmeth.2440.html)
510 | * [MAPPING OF LONG-RANGE CHROMATIN INTERACTIONS BY PROXIMITY LIGATION ASSISTED CHIP-SEQ](http://biorxiv.org/content/early/2016/09/09/074294)
511 | * [HiChIP: Efficient and sensitive analysis of protein-directed genome architecture](http://biorxiv.org/content/early/2016/09/08/073619) HiChIP improves the yield of conformation-informative reads by over 10-fold and lowers input requirement over 100-fold relative to ChIA-PE
512 | * [A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome](http://www.cell.com/cell-reports/fulltext/S2211-1247(16)31481-4?elsca1=etoc&amp;elsca2=email&amp;elsca3=2211-1247_20161115_17_8_&amp;elsca4=Cell%20Press%7CWebinar) paper from Bing Ren's group. 21 tissue-specific TADs.
513 | 
514 | ### Caleb's take on HiChIP analysis
515 | From Caleb, the author of hichipper https://twitter.com/CalebLareau/status/1098312702651523077 thx!
516 | 
517 | In HiChIP data analyses, there are two primary problems that we are trying to solve. A) Which anchors (i.e. genomic loci) should be used as a feature set and B) which loops (i.e. interactions between pairs of loci) are important in the data. 2/n
518 | 
519 | Depending on what you are hoping to use your data for, there are a variety of ways to think about anchors and loops. Two uses of HiChIP that come to mind are "which gene is this enhancer talking to" and "which loops are differential between my celltype/condition of interest" 3/n
520 | 
521 | When Martin and I wrote hichipper, we envisioned the second question being more used (i.e. building out a framework for differential loop calling), so we wanted a pre-processing pipeline that was as inclusive of potential loops as possible that could be subsetted downstream 4/n
522 | 
523 | To these ends, we reported an improved version of anchor detection from HiChIP data by modeling the restriction enzyme cut bias explicitly, which helped identify high-quality anchors from the data itself 5/n
524 | 
525 | (we achieve this by re-parametrizing MACS2 peak calling by essentially fitting a loess curve to the data in the previous picture) 6/n
526 | 
527 | Unfortunately, based on user feedback, this modified background winds up with a very, very conservative peak calling if the library preparations are sub-par. Thus, the safest way to approach HiChIP data analyses is often to use a pre-defined anchor set 7/n
528 | 
529 | These can be from either a complementary ATAC-seq or ChIP-seq dataset for the conditions that you are interested in. From what I've seen, you can supply a bed file to hichipper or other tools directly. Hichipper does some other modifications by default to this bed file FYI 8/n
530 | 
531 | In terms of the second problem of identifying loops, hichipper didn't make any revolutionary progress. We recommend some level of CPM-based filtering + mango FDR calculation (implemented in hichipper) for identifying single-library significant loops. 9/n
532 | 
533 | Where I've personally done the most is getting multiple libraries from multiple conditions and using some sort of between-replicate logic to filter to a reasonable (~10,000-20,000) number of loops ( see e.g. https://github.com/caleblareau/k562-hichip …) 10/n
534 | 
535 | 
536 | Other tools (that I admittedly have not tried) use a variety of statistical techniques to (probably more intelligently from what I can tell) merge anchors or filter loops for analyses. A brief run down of those that I'm aware of (not exhaustive)-- 11/n
537 | 
538 | MAPS (https://www.biorxiv.org/content/biorxiv/early/2018/09/08/411835.full.pdf …) uses a measure of reproducibility with ChIP-seq to define a normalization and significance basis for loop calling. Given HiChIP-specific restriction enzyme bias, this seems sensible 12/n
539 | 
540 | FitHiChIP (https://www.biorxiv.org/content/early/2018/10/29/376194.full.pdf …) provides automatic merging of nearby anchors to solve the "hairball" problem, which is clearly shown Fig. 1. When I compared hichipper to FitHiC, the bias regression seemed to perform well, but I ran into memory issues which high... 13/n
541 | 
542 | resolution (i.e. ~2.5kb) HiChIP data, which the authors have apparently solved in FitHiChIP. 14/n
543 | 
544 | Additionally, there is CID, which uses a density-based method to further collapse anchors to solve the "hairball" problem. 15/n
545 | 
546 | There are certainly other tools out there, but from my experience, any of these four (hichipper, MAPS, FitHiChIP, and CID) will probably give you something sensible (again acknowledging that I myself haven't actually run these other 3 tools) 16/n
547 | 
548 | And if you're still reading this, I'll be a bit more specific about how I view hichipper pros/cons from both my own use and others in the community: hichipper provides the most "vanilla" functionality to given sensible yet exhaustive anchors and loops. 17/n
549 | 
550 | I prefer it this way because I find that for each data set, I have to apply variable downstream threshold and cutoffs because the assay is so variable depending on which experimentalist performs the protocol and the biological question often varies so much 18/n
551 | 
552 | This may be a negative for individuals new to bioinformatics or HiChIP data but seemingly a positive for someone more experienced in working with related data. It's not obvious to me which other tools may be more applicable to a novice 19/n
553 | 
554 | Hope this helps paint a picture-- do let me know what you find if you compare tools! I think that it would be useful for the community.  20/20
555 | 
556 | 
557 | 


--------------------------------------------------------------------------------
/UCSC_browser/makeUCSCtrackHub.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | 
  3 | __author__ = 'tommy'
  4 | 
  5 | # check the tutorial here https://pythonhosted.org/trackhub/tutorial.html
  6 | # import the components we'll be using
  7 | # 01/12/2016
  8 | 
  9 | import argparse
 10 | 
 11 | parser = argparse.ArgumentParser()
 12 | parser.add_argument("--hub_name", help="Required. the name of your track hub")
 13 | parser.add_argument("--base_url", help="Required. the path to the folder where the files are accessible from a web, "
 14 |                                        "make sure add the trailing slash")
 15 | parser.add_argument("--input_dir", help="Required. the folder where the files are stored in your local computer")
 16 | parser.add_argument("--output_dir", help="the folder where the hub track files are generated, default is "
 17 |                                          "the same as input_dir", default=".")
 18 | parser.add_argument("--email", help="Required. your email to contact")
 19 | parser.add_argument("--composite_track_name", help="Required. the name of your composite track")
 20 | 
 21 | args = parser.parse_args()
 22 | 
 23 | assert args.hub_name is not None, "please provide the hub_name"
 24 | assert args.base_url is not None, "please provide the base_url"
 25 | assert args.composite_track_name is not None, "please provide the composite track name"
 26 | assert args.email is not None, "please provide your email"
 27 | assert args.input_dir is not None, "please provide the path to the bigwig and bigbed files on your local computer"
 28 | 
 29 | 
 30 | from trackhub import Hub, GenomesFile, Genome, TrackDb
 31 | 
 32 | hub = Hub(
 33 |     hub='%s' % args.hub_name,
 34 |     short_label='%s' % args.hub_name,
 35 |     long_label='%s ChIP-seq hub' % args.hub_name,
 36 |     email='%s' % args.email)
 37 | 
 38 | genomes_file = GenomesFile()
 39 | genome = Genome('hg19')
 40 | trackdb = TrackDb()
 41 | 
 42 | # Bottom-up
 43 | genome.add_trackdb(trackdb)
 44 | genomes_file.add_genome(genome)
 45 | hub.add_genomes_file(genomes_file)
 46 | 
 47 | # make a composite track
 48 | from trackhub import CompositeTrack
 49 | 
 50 | composite = CompositeTrack(
 51 |     name="%s" % args.composite_track_name,
 52 |     short_label="%s" % args.composite_track_name,
 53 |     long_label=" %s ChIP-seq" % args.composite_track_name,
 54 |     tracktype="bigWig")
 55 | 
 56 | # After the composite track has been created, we can incrementally add additional parameters.
 57 | # This is same method can be used for all classes derived from Track CompositeTrack, ViewTrack
 58 | # and of course Track itself:
 59 | 
 60 | composite.add_params(dragAndDrop='subtracks', visibility='full')
 61 | 
 62 | # The next part of the hierarchy is a ViewTrack object. Both ViewTrack and CompositeTrack are subclasses of the more generic Track class,
 63 | # so they act in much the same way. This should look familiar, but a notable difference is the addition of the view kwarg
 64 | from trackhub import ViewTrack
 65 | 
 66 | bed_view = ViewTrack(
 67 |     name="bedViewTrack",
 68 |     view="Bed",
 69 |     visibility="squish",
 70 |     tracktype="bigBed 3",
 71 |     short_label="beds",
 72 |     long_label="Beds")
 73 | 
 74 | signal_view = ViewTrack(
 75 |     name="signalViewTrack",
 76 |     view="Signal",
 77 |     visibility="full",
 78 |     tracktype="bigWig 0 10000",
 79 |     short_label="signal",
 80 |     long_label="Signal")
 81 | 
 82 | # Add these new view tracks to composite:
 83 | 
 84 | composite.add_view(bed_view)
 85 | composite.add_view(signal_view)
 86 | 
 87 | 
 88 | # We can make changes to the created views without having to add them again to the composite.
 89 | # For example, here we add configureable on to each view and print composite to make sure the changes show up:
 90 | 
 91 | 
 92 | for view in composite.views:
 93 |     view.add_params(configurable="on")
 94 | 
 95 | 
 96 | ## add the bigwig files and bed files to the track
 97 | 
 98 | import os
 99 | import glob
100 | from trackhub import Track
101 | 
102 | os.chdir("%s" % args.input_dir)
103 | # A quick function to return the number in the middle of filenames -- this
104 | # will become the key into the subgroup dictionaries above
105 | # def num_from_fn(fn):
106 |     #return os.path.basename(fn).split('.')[0].split('-')[-1]
107 | 
108 | # Make the bigBed tracks
109 | 
110 | def make_bigBed_tracks(bed_view, url_base):
111 |     for bb in glob.glob('*.bigBed'):
112 |         basename = os.path.basename(bb)
113 |         label = bb.replace('.bigBed', '')
114 |         track = Track(
115 |             name='peak_%s' % label,
116 |             tracktype='bigBed 3',
117 |             url=url_base + basename,
118 |             local_fn=bb,
119 |             shortLabel='peaks %s' % label,
120 |             longLabel='peaks %s' % label)
121 | 
122 |         # add this track to the bed view
123 |         bed_view.add_tracks(track)
124 | 
125 | # Make the bigWig tracks
126 | 
127 | def make_bigWig_tracks(signal_view, url_base):
128 |     for bw in glob.glob('*.bw'):
129 |         label = bw.replace('.bw', '')
130 |         basename = os.path.basename(bw)
131 |         track = Track(
132 |             name='signal_%s' % label,
133 |             tracktype='bigWig',
134 |             url=url_base + basename,
135 |             local_fn=bw,
136 |             shortLabel='signal %s' % label,
137 |             longLabel='signal %s' % label)
138 | 
139 |         # add this track to the signal view
140 |         signal_view.add_tracks(track)
141 | 
142 | make_bigBed_tracks(bed_view, args.base_url)
143 | make_bigWig_tracks(signal_view, args.base_url)
144 | trackdb.add_tracks(composite)
145 | 
146 | print trackdb
147 | 
148 | os.chdir('%s' % args.output_dir)
149 | 
150 | results = hub.render()
151 | 
152 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman


--------------------------------------------------------------------------------
/images/CTCF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/CTCF.png


--------------------------------------------------------------------------------
/images/HTSeq-extend200.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/HTSeq-extend200.png


--------------------------------------------------------------------------------
/images/RNApol2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/RNApol2.png


--------------------------------------------------------------------------------
/images/bam2bigwig1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/bam2bigwig1.png


--------------------------------------------------------------------------------
/images/bam2bigwig2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/bam2bigwig2.png


--------------------------------------------------------------------------------
/images/bam2bigwig3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/bam2bigwig3.png


--------------------------------------------------------------------------------
/images/choose_diff_tool.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/choose_diff_tool.png


--------------------------------------------------------------------------------
/images/cross-correlation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/cross-correlation.png


--------------------------------------------------------------------------------
/images/extend_200.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/extend_200.png


--------------------------------------------------------------------------------
/images/fastqc_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_1.png


--------------------------------------------------------------------------------
/images/fastqc_10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_10.png


--------------------------------------------------------------------------------
/images/fastqc_11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_11.png


--------------------------------------------------------------------------------
/images/fastqc_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_2.png


--------------------------------------------------------------------------------
/images/fastqc_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_3.png


--------------------------------------------------------------------------------
/images/fastqc_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_4.png


--------------------------------------------------------------------------------
/images/fastqc_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_5.png


--------------------------------------------------------------------------------
/images/fastqc_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_6.png


--------------------------------------------------------------------------------
/images/fastqc_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_7.png


--------------------------------------------------------------------------------
/images/fastqc_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_8.png


--------------------------------------------------------------------------------
/images/fastqc_9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/fastqc_9.png


--------------------------------------------------------------------------------
/images/input_control.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/input_control.png


--------------------------------------------------------------------------------
/images/meta-heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/meta-heatmap.png


--------------------------------------------------------------------------------
/images/phred_score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/phred_score.png


--------------------------------------------------------------------------------
/images/shift.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/shift.png


--------------------------------------------------------------------------------
/images/snakemake_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/snakemake_flow.png


--------------------------------------------------------------------------------
/images/super-enhancer-plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/super-enhancer-plot.png


--------------------------------------------------------------------------------
/images/variablePeaks.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/images/variablePeaks.png


--------------------------------------------------------------------------------
/part0.1_fastqc.md:
--------------------------------------------------------------------------------
  1 | ### Quality control with fastqc
  2 | 
  3 | For raw sequencing files in fastq format, [fastqc](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is used to do quality control.
  4 | 
  5 | read my [blog post](http://crazyhottommy.blogspot.com/2014/06/quality-control-of-your-fastq-file-my.html)
  6 | 
  7 | Let's start.
  8 | inspecting the first several reads of the fastq (this is a ChIP-seq data), I found the quality is coded as [phred33](https://en.wikipedia.org/wiki/Phred_quality_score)
  9 | 
 10 | ```
 11 | @SRR866627_nutlin_p53.1 HWI-ST571:161:D0YP4ACXX:3:1101:1437:2055 length=51
 12 | NCGAAAGACTGCTGGCCGACGTCGAGGTCCCGATTGTCGGCGTCGGCGGCA
 13 | +SRR866627_nutlin_p53.1 HWI-ST571:161:D0YP4ACXX:3:1101:1437:2055 length=51
 14 | #1:AABDDH<?CF<CFHH@D:??FAEE6?FHDAA=@CF@4@EBA88@;575
 15 | @SRR866627_nutlin_p53.2 HWI-ST571:161:D0YP4ACXX:3:1101:1423:2163 length=51
 16 | GAAATACTGTCTCTACTAAAAATACAAAAAGTAGCCAGGCGTCGTGGTGCA
 17 | +SRR866627_nutlin_p53.2 HWI-ST571:161:D0YP4ACXX:3:1101:1423:2163 length=51
 18 | B@CFFFFFHHHHGIJJJJIJJJIJJJJIJJIEGHIJJJJIJIIIHIE=FGC
 19 | ```
 20 | 
 21 | 
 22 | To know different quality coding see:  
 23 | ![](./images/phred_score.png)
 24 | 
 25 | 
 26 | then I used fastqc to check the quality:
 27 | 
 28 | ![](./images/fastqc_1.png)
 29 | ![](./images/fastqc_2.png)
 30 | ![](./images/fastqc_3.png)
 31 | ![](./images/fastqc_4.png)
 32 | 
 33 | the quality for each base is very good, but the perbase sequence content does not look good, and the adapters are not trimmed as indicated by the overrepresented sequences (highlighted part is part of the Truseq adapter index1).
 34 | 
 35 | Then I used [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to trim the adapters and the low quality bases.
 36 | `$ trimmomatic SE  in.fastq in_trimmed.fastq ILLUMINACLIP:Truseq_adaptor.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36`
 37 | 
 38 | the Truseq_adaptor.fa file contains the adapter sequences.
 39 | 
 40 | ```
 41 | $ cat Truseq_adaptor.fa
 42 | >TruSeq_Universal_Adapter
 43 | AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
 44 | >TruSeq_Adapter_Index_1
 45 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
 46 | >TruSeq_Adapter_Index_2
 47 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
 48 | >TruSeq_Adapter_Index_3
 49 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
 50 | >TruSeq_Adapter_Index_4
 51 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
 52 | >TruSeq_Adapter_Index_5
 53 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
 54 | >TruSeq_Adapter_Index_6
 55 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
 56 | >TruSeq_Adapter_Index_7
 57 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
 58 | >TruSeq_Adapter_Index_8
 59 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG
 60 | >TruSeq_Adapter_Index_9
 61 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
 62 | >TruSeq_Adapter_Index_10
 63 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG
 64 | >TruSeq_Adapter_Index_11
 65 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG
 66 | >TruSeq_Adapter_Index_12
 67 | GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
 68 | ```
 69 | after that, I checked the with fastqc again:
 70 | ![](./images/fastqc_5.png)
 71 | ![](./images/fastqc_6.png)
 72 | ![](./images/fastqc_7.png)
 73 | 
 74 | The perbase quality was improved and the per base GC content became more normal (should be 25% percent for each A,T,C,G base, it looks like the sequences are more A/T rich). More importantly, no overrepresent sequences were found anymore. Now the fastq file is ready for subsequent mapping by bowtie.
 75 | 
 76 | 
 77 | ###RNA-seq reads quality control
 78 | sometime ago, I quality controlled a RNA-seq data, and found found many wired things and asked on [biostar](https://www.biostars.org/p/99327/#99451).
 79 | fastqc output:
 80 | 
 81 | ![](./images/fastqc_8.png)
 82 | ![](./images/fastqc_9.png)
 83 | ![](./images/fastqc_10.png)
 84 | ![](./images/fastqc_11.png)
 85 | 
 86 | 
 87 | Take home messages:  
 88 | 1)  
 89 | "The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence."
 90 | 
 91 | 2)  
 92 | "Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite)."
 93 | 
 94 | 3)
 95 | Several posts for whether the duplicates need to be removed or not.
 96 | https://www.biostars.org/p/55648/
 97 | https://www.biostars.org/p/66831/
 98 | https://www.biostars.org/p/14283/
 99 | 
100 | 
101 | ### [Tutorial: Revisiting the FastQC read duplication report](https://www.biostars.org/p/107402/)
102 | 
103 | Quote from Istavan Albert:
104 | >The new plots now contain two different curves and the meaning of the percentage has also changed.  The explanations in the docs are little bit lacking  to make sure I got it right I wrote a python implementation (see the end) that produces the same plots.
105 | 
106 | >I found it helpful to use the term "distinct" sequences rather than unique sequences as this latter term seems to imply to some that those sequences are present only once in the data. So distinct sequences are defined as the largest subset of sequences where no two sequences are identical.
107 | 
108 | >Thus distinct sequences = number of singletons (sequences that appear only once) + number of doubles (number of sequences that appear twice but each double will be counted only once) + number of triplets (sequences that appear three times but each will be counted once)  ... and so on.
109 | 
110 | >The percentage in the title is computed as the distinct/total * 100
111 | 
112 | >The blue line represents the counts of all the sequences that are duplicated at a given rate. The percentage is computed relative to the total number of reads.
113 | 
114 | >The red line represents the number of distinct sequences that are duplicated at a given rate. The percentage is computed relative to the total number of distinct sequences in the data.
115 | 
116 | >Let's take two examples where each contain 20 reads:
117 | 
118 | >Case 1: 10 unique reads +  5 reads each present twice (duplicates)
119 | Case 2: 10 unique reads + 1 read present 10 times 
120 | Case 1 shown in the upper plot will lead to 15 distinct reads and thus 15/20=75% percent remaining, the number of singletons is 1x10 =10 and the number of doubles is 5x2 =10 therefore the blue line has a plateau at those rates. The 15 distinct sequences are distributed as  10 singletons and 5 duplicates, 10/15=66% and 5/15=33% is the slope of the red line.
121 | 
122 | >Case 2 will produce 11 distinct reads and therefore 11/20=55% will be the precent remaining reads. Again the total number of reads is equally distributed between the two cases but this time the peak will be at 10 since we have one read duplicated 10 times and that produces 10 sequences. But there are 11 total groups where 10/11=91% are singletons and 1/11=9% of the groups form at duplication rate of 10x.
123 | 
124 | 
125 | [Another post](http://proteo.me.uk/2013/09/a-new-way-to-look-at-duplication-in-fastqc-v0-11/) by Andrews Simon, the author of fastqc.
126 | 


--------------------------------------------------------------------------------
/part0.2_mapping_to_genome.md:
--------------------------------------------------------------------------------
 1 | I usually use bowtie1 for mapping short 36-bp reads to human genome.
 2 | bowtie2 is better for longer reads.
 3 | 
 4 | bowtie1:
 5 | ```bash
 6 | bowtie -p 10 --best --chunkmbs 200 path/to/ref/genome -q my.fastq -S | samtools view -bS - > unsorted.bam
 7 | ```
 8 | 
 9 | >BWA-MEM is recommended for query sequences longer than ~70bp for a variety of error rates (or sequence divergence). 
10 | > Generally, BWA-MEM is more tolerant with errors given longer query sequences as the chance of missing all seeds is small.
11 | > As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing error rate over 20%.
12 | 
13 | 
14 | Use [Teaser](https://github.com/Cibiv/Teaser) to test which mapper works the best for you.
15 | 


--------------------------------------------------------------------------------
/part0.3_downsampling_bam.md:
--------------------------------------------------------------------------------
 1 | ### Downsampling reads to a certain number
 2 | 
 3 | When compare different ChIP-seq data sets or analyze a set of ChIP-seq data sets together (e.g. ChromHMM analysis), it is desirable 
 4 | to subsample the deeply sequenced ones to a certain number of reads (say 15million or 30 million).
 5 | 
 6 | In paper [Integrative analysis of 111 reference human epigenomes](http://www.nature.com/nature/journal/v518/n7539/full/nature14248.html):
 7 | 
 8 | >To avoid artificial differences in signal strength due to differences in sequencing
 9 | depth, all consolidated histone mark data sets (except the additional histone marks
10 | the seven deeply profiled epigenomes, Fig. 2j) were uniformly subsampled to a
11 | **maximum depth of 30 million reads** (the median read depth over all consolidated
12 | samples). For the seven deeply profiled reference epigenomes (Fig. 2j), histone mark
13 | data sets were subsampled to a maximum of 45 million reads (median depth). The
14 | consolidated DNase-seq data sets were subsampled to a maximum depth of 50
15 | million reads (median depth). **These uniformly subsampled data sets were then used
16 | for all further processing steps (peak calling, signal coverage tracks, chromatin states)**.
17 | 
18 | After reading several posts [here](https://www.biostars.org/p/76791/) and [here](https://groups.google.com/forum/#!topic/bedtools-discuss/gf0KeAJN2Cw).
19 | It seems `samtools` and `sambamba` are the tools to use, but they both output a proportion number of reads. 
20 | 
21 | ```bash
22 | time samtools view -s 3.6 -b my.bam -o subsample.bam
23 | real	6m9.141s
24 | user	5m59.842s
25 | sys	0m8.912s
26 | 
27 | time sambamba view -f bam -t 10 --subsampling-seed=3 -s 0.6 my.bam -o subsample.bam
28 | real	1m34.937s
29 | user	11m55.222s
30 | sys	0m29.872s
31 | ```
32 | `-s 3.6` set seed of 3 and 60% of the reads by samtools.
33 | Using multiple cpu with `sambamba` is much faster and an index file is generated on the fly.
34 | 
35 | If one wants to get say 15 million reads, one needs to do `samtools flag stat` or `samtools idxstats` to get the total number of reads,
36 | and then calculate the proportion by:  `15 million/total = proportion`. 
37 | 
38 | `samtools idxstats` is much faster when the bam is sorted and indexed:
39 | >Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
40 | 
41 | Total number of reads: `samtools idxstats example.bam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {print total}'`
42 | 
43 | Finally, feed the proportion to `-s` flag. One might want to remove the unmapped the reads and the duplicated reads in the bam file before downsampling. One might also need to sort the subsampled bam file again and index it.
44 | 


--------------------------------------------------------------------------------
/part0_quality_control.md:
--------------------------------------------------------------------------------
  1 | ## Quality control of the ChIP-seq data.
  2 | 
  3 | First read these papers:
  4 | 
  5 | [ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia](http://www.ncbi.nlm.nih.gov/pubmed/22955991)  
  6 | [Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326)   
  7 | [Sequencing depth and coverage: key considerations in genomic analyses](http://www.nature.com/nrg/journal/v15/n2/full/nrg3642.html)  
  8 | [Impact of sequencing depth in ChIP-seq experiments](http://nar.oxfordjournals.org/content/early/2014/03/05/nar.gku178)  
  9 | [Systematic evaluation of factors influencing ChIP-seq fidelity](http://www.nature.com/nmeth/journal/v9/n6/full/nmeth.1985.html)  
 10 | [Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data](http://www.ncbi.nlm.nih.gov/pubmed/24782889)  
 11 | [Large-Scale Quality Analysis of Published ChIP-seq Data](http://www.g3journal.org/content/4/2/209.full)  
 12 | 
 13 | 
 14 | 
 15 | ## Guidlines from Sherily Liu's lab
 16 | According to a [guideline](http://cistrome.org/chilin/_downloads/instructions.pdf) from Sherily Liu's lab, I summarize the matrics below (there are many matrics, we can just use some of them):  
 17 | 
 18 | ### peak calling independent statistics
 19 | 1. Fastq reads median quality score >= 25. This can be gotten by [FASTQC from Babaraham Institute](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Many other good tools like Bismark for DNA methylation data mapping, and SeqMonk, a pretty cool GUI tool alternative to IGV are from this insititute as well.   According to Kadir, the sequencing core members will do initial quality control with the fastq files and will flag the file if quality of the file is bad.  In addition, they will trim off the adaptors when de-duplex.
 20 | 
 21 | 2. Uniquely mapped reads. According to Encode best practise, for most transcription factors (TFs), ~10 million of uniquely mapped reads are good enough; for histone modifications, ~20 millions uniquely mapped reads are recommended. The more reads one sequences, the more peaks will show up. However,the peak number will saturate when a certain number of reads (~say 30 million for TFs) are sequenced. **A good uniquely mapped ratio is ≥ 60%.** bowtie1 will ouput this number, or we can get it by samtools.
 22 |   
 23 | **Duplicated reads** are reads with the same coordinates for the 5' and 3' end (maybe due to PCR artifact)
 24 | A special case is MNase-seq in which an enzyme is used to cut the DNA, and then sequence the nucelosome bound sequences. One will expect to find many duplicated reads. see [here](https://ethanomics.wordpress.com/2012/01/06/to-filter-or-not-to-filter-duplicate-reads-chip-seq/)
 25 |   
 26 | A read can be mapped to mulitple places, because of repetitive sequences in the genome, the aligner(BWA, bowtie) can not assign it uniquely to a place.  
 27 | For ChIP-seq, we only want the **uniquely mapped reads**.
 28 | 
 29 | 
 30 | [Print uniquely mapped reads](https://wikis.utexas.edu/display/bioiteam/Samtools+tricks)
 31 | `samtools view -bq 1 <bam>`
 32 | 
 33 | It looks like one should [abandon](https://www.biostars.org/p/101533/#101537) the concept of uniquely mapped reads
 34 | >Bowtie2 will give an alignment a MAPQ score of 0 or 1 if it can map equally well to more than one location. Further, there's not always a perfect correspondence between the MAPQ of a read and the summary metrics it prints at the end (I'd need to go through the code again to determine how it arrives at the printed summary metrics, that's not documented anywhere). Finally, you would be well served to completely abandon the concept of "uniquely mapped". It is never useful and is always misleading, since you're simply lying to by labeling something unique. You're better served by simply filtering on a meaningful MAPQ (5 or 10 are often reasonable choices), which has the benefit of actually doing what you want, namely filtering according the the likelihood that an alignment is correct.
 35 | >
 36 | 
 37 | Simply filter the bam with  MAPQ (mapping quality of the reads), 5 or 10 is usually reasonable:
 38 |  
 39 | `samtools view -b -q 10 foo.bam > foo.filtered.bam`  
 40 | or if you only want the number:  
 41 | `samtools view -c -b -q 10 foo.bam`
 42 | 
 43 | ### peak calling dependent statistics
 44 | 
 45 | 4. Peak number for each replicate called by MACS2 with fixed extension size (~200bp) and qvalue cutoff. A good peaks number depends on your experiment.
 46 | 5. Peak number for each replicates called by MACS2 where the fold change is ≥ 10.
 47 | 6. Peak number for each replicates called by MACS2 where the fold change is ≥ 20.
 48 | 7. Replicates reads correlation is the whole genome reads pearson correlation for all replicates with resolution 146. A good correlation score is ≥ 0.6. 
 49 |   
 50 |   Details: one can bin the genome into small bins with each bin say 1000 bp, then one can count how many reads in each bin.
 51 |   For replicates, a perason correlation or spearman correlation can be calculated. We can do it by using bedtools               [makewindows](http://bedtools.readthedocs.org/en/latest/content/tools/makewindows.html) for binning, and                      [muticov](http://bedtools.readthedocs.org/en/latest/content/tools/multicov.html) for counting the reads.
 52 | 
 53 |   [Deeptools](https://github.com/fidelram/deepTools/wiki/QC)  has a command to calculate these numbers as well. 
 54 |   
 55 |   It takes long time to run, because it calculate the reads numbers in each bin across the genome. One alternative is to just 
 56 |   count reads in the peaks called by MACS2, it will be much faster.
 57 | 
 58 | 8. Replicates peaks overlapping number. How many peaks are overlapped among replicates.
 59 | 
 60 | 9. Top peaks not overlap with blacklist regions ratio is the ratio of the merged top 5000 peaks (ordered by MACS2 -log           (qvalue)) which do not overlap with [blacklist region](https://sites.google.com/site/anshulkundaje/projects/blacklists).      This is expected to be ≥ 90%. We will remove peaks that overlap with blacklist regions using bedtools anyway.  
 61 | 
 62 | 10. Top peaks overlap with union DHS number (ratio) is the ratio of the merged top 5000 peaks (ordered by MACS2 -log (qvalue))     which overlap with union DHS regions. Union DHS regions are obtained from ENCODE II UW DNase-seq Hypersensitive regions.      The union DHS regions was collected from 122 human datasets or 53 mouse datasets, we do not have union DHS of other           species. Union DHS generation methods is consisted of three steps: 1.for peaks length longer than 300bp, trim macs2 peaks     length to 300bp around macs2 summits, 2.if less than 300bp, preserve the original length, 3.merge the peaks overlap each      other. This is expected to be ≥ 70%.
 63 | 
 64 |     Also read this paper from John Stamatoyannopoulos group [The accessible chromatin landscape of the human genome](http://www.ncbi.nlm.nih.gov/pubmed/22955617). It has DHS sites in the supplementary materials.
 65 | 
 66 | 11. Top peaks conservation plot is the Phastcons conservation scores distribution around +/- 2kb of the top 5000 merged peak      summits. Phastcons conservation scores are from placental mammals multiple alignment. For TFs and active histone mark the     plot should be shown as a sharp peak in the center.
 67 | 
 68 |     I do not have a script for this yet. I think one can just download the phastcons scores from UCSC genome browser, and the
 69 |     plot the score around the summits of the peaks. I will look into this if necessary.
 70 | 
 71 | 12. Top peaks motif analysis is the motif analysis performed on the top 5000 merged peak summits. I will cover                    motif-enrichment analysis in another markdown file in the repo. Basically, MEME suite and HOMER will be used.
 72 | 
 73 | ## ENCODE guidlines
 74 | 
 75 | From paper [ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia](http://www.ncbi.nlm.nih.gov/pubmed/22955991)  
 76 | 
 77 | ### Evaluation of ChIP-seq quality
 78 | **Cross-correlation analysis**
 79 | >A very useful ChIP-seq quality metric that is **independent of peak calling** is strand cross-correlation. It is based on the fact that **a high-quality ChIP-seq experiment produces significant clustering of enriched DNA sequence tags at locations bound by the protein of interest, and that the sequence tag density accumulates on forward and reverse strands centered around the binding site.** As illustrated in Figure 5D, these “true signal” sequence tags are positioned at a distance from the binding site center that depends on the fragment size distribution (Kharchenko et al. 2008). A control experiment, such as sequenced input DNA, lacks this pattern of shifted stranded tag densities (Supplemental Fig. S1). This has made it possible to develop a metric that quantifies fragment clustering (IP enrichment) based on the correlation between genome-wide stranded tag densities (A Kundaje, Y Jung, P Kharchenko, B Wold, A Sidow, S Batzoglou, and P Park, in prep.). **It is computed as the Pearson linear correlation between the Crick strand and the Watson strand, after shifting Watson by k base pairs (Fig. 5E)**. This typically produces two peaks when cross-correlation is plotted against the shift value: **a peak of enrichment corresponding to the predominant fragment length** and **a peak corresponding to the read length (“phantom” peak)** (Fig. 4E; Heinz et al. 2010; A Kundaje, Y Jung, P Kharchenko, B Wold, A Sidow, S Batzoglou, and P Park, in prep.).
 80 | 
 81 | ![](./images/shift.png)
 82 | 
 83 | ![](./images/cross-correlation.png)
 84 | 
 85 | 
 86 | >The **normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation (normalized strand coefficient, NSC)** and the **ratio between the fragment-length peak and the read-length peak (relative strand correlation, RSC)** (Fig. 4G), are strong metrics for assessing signal-to-noise ratios in a ChIP-seq experiment. High-quality ChIP-seq data sets tend to have a larger fragment-length peak compared with the read-length peak, whereas failed ones and inputs have little or no such peak (Figs. 4G, 5A,B; Fig. 7, below). In general, we observe a continuum between the two extremes, and **broad-source data sets are expected to have flatter cross-correlation profiles than point-sources, even when they are of very high quality.** As expected, the NSC/RSC and FRiP metrics are strongly and positively correlated for the majority of experiments (Fig. 4F). As with the other quality metrics, even high-quality data sets generated for factors with few genuine binding sites tend to produce relatively low NSCs.
 87 | 
 88 | >These measures form the basis for one of the current quality standards for ENCODE data sets. **We repeat replicates with NSC values <1.05 and RSC values <0.8** and, if additional replicates produce low values, we include a note with the reported data set (Box 3). We illustrate the application of our ChIP-seq quality metrics to a failed pair of replicates in Figure 5, A–E. Initially, two EGR1 ChIP-seq replicates were generated in the K562 cell line. Based on the cross-correlation profiles, FRiP score, and number of called regions, these replicates were flagged as marginal in quality. The experiments were repeated, with all quality control metrics improving considerably. On this basis, the superior measurements replaced the initial ones in the ENCODE database.
 89 | 
 90 | 
 91 | ### Understanding NSC and RSC
 92 | From Ashul:
 93 | 
 94 | ####Normalized strand cross-correlation coefficient (NSC)
 95 | 
 96 | Genome-wide correlation between + and - strand read counts when shifted by fraglen/2 relative to background. Represents enrichment of clustered ChIP fragments around target sites. 
 97 | 
 98 | Input-DNA values  are used as a reference for calibration.
 99 | 
100 | Diffused marks such as H3K9me3 inherently have lower signal to noise ratios and hence NSC compared to strong active marks such as H3K4me3.
101 | 
102 | **Samples with very low seq. depth can have abnormally high NSC** since there is a significant depletion of 'background'. i.e. these samples tend to have higher specificity but very low sensitivity." 
103 | 
104 | ####Relative strand cross-correlation coefficient (RSC)
105 | 
106 | Relative enrichment of fragment-length cross-correlation to read-length cross-correlation (phantom peak). 
107 | 
108 | The read-length cross-correlation is a baseline correlation that is entirely due to an inherent mappability-bias for reads to separated on + and - strand by read-length.
109 | 
110 | The fragment length cross-correlation (which is due to clustering of relatively fixed sized fragment around target sites) should be able to beat the read-length correlation for highly enriched datasets with sufficient localized target sites. So a RSC value > 0.8 is desirable in general.
111 | 
112 | **Marks that tend to be enriched at repeat-like regions and those that have low signal to noise ratios with diffused genome-wide patterns can have stronger read-length peaks and RSC values < 0.8**
113 | 
114 | #### A very [old post](https://groups.google.com/forum/#!msg/macs-announcement/XawMJuBLYrc/ErL5oWVUWdYJ) by Ashul in the MACS google group back to 2012
115 | 
116 | >A useful way of estimating fragment length (different from how MACS does it) is to compute a strand cross-correlation profile of read start density on the + and - strand i.e. **you compute the number of read starts at each position on the + strand and separately on the - strand for each chromosome. Then simply shift these vectors wrt each other and compute the correlation for each shift.** You can then plot a cross-correlation profile as the cross-correlation values on the y-axis and the shift that you used to compute the correlation on the x-axis. This is the cross-correlation profile for the dataset. Due to the 'shift' phenomenon of reads on the + and - strand around true binding sites, one would get a peak in the cross-correlation profile at the predominant fragment length. 
117 | 
118 | **Side notes from me: for reads starting at EACH position, there are three conditions: only one read starts (no duplicates) at there, or there is no read starts (no reads mapping), or several reads (potential PCR duplicates). So in reality, the number of reads are calculated by implicit aggregation over 5 bp sliding windows.**
119 | 
120 | 
121 | >For a really strong ChIP-seq dataset such as say CTCF in human cells (great antibody and 45-60K peaks typically), the cross-correlation profile looks like what u see in the attached Figure CTCF.pdf. Notice the RED vertical line which is the dominant peak at the true peak shift. Also notice the little bump (the blue vertical line). This is at read-length.
122 | 
123 | ![](./images/CTCF.png)
124 | 
125 | >At the other extreme, lets take a control dataset (input DNA). The cross-correlation profile is shown in CONTROL.pdf. Now notice how the strongest peak is the blue line (read length) and there is basically almost no other significant peak in the profile. The absence of a peak shud be expected since unlike a ChIP-seq dataset for input DNA one expects no significant clustering of fragments around specific target sites (except potentially weak biases in open chromatin regions depending on the protocol used). Now the read-length peak occurs due to unique mappability properties of the mapped reads. **If a position 'i' on the + strand in the genome is uniquely mappable (i.e. a read starting at 'i' on the + strand maps uniquely), it implies that the position 'i+readlength-1' is also uniquely mappable on the - strand (ie. a read starting at i+readlength-1 on the - strand maps uniquely to that position)**. So in the input dataset or in random scattering of reads to uniquely mappable locations (in a genome made up of unmappable, multimappable locations and unique mappable locations), **there is a greater odds of finding reads starting on the + and - strand separated by read-length than any other shift.** Which is why the cross-correlation profile peaks at read-length compared to other values of strand-shift and the cross-correlation at the true fragment length/peak-shift is washed away since there are is no significant +/- strand read density shift in the input dataset.
126 | 
127 | ![](./images/input_control.png)
128 | 
129 | >Now take a look at what you get for some a ChIP-seq dataset that is an inbetween case.
130 | 
131 | >POL2B.pdf : has few peaks (just about 3000 detectable ones in the human genome), this particular antibody is not very efficient (there are other POL2 antibodies that are very effective) and these are broad scattered peaks (following elongation patterns of POL2). Notice how you now have 2 peaks in the cross-correlation profile. One at the true peak shift (~185-200 bp) thats the one marked in red and the other at read length (the one marked in blue). For such weaker datasets, the read-length peak starts to dominate. Depending on the data quality characteristics of the dataset, the read-length peak scales relative to the true fragment length peak.
132 | 
133 | ![](./images/RNApol2.png)
134 | 
135 | >So long story short, MACS effectively tends to just pick up just the strongest peak in the cross-correlation profile (although it uses a different method of estimating the peak-shift) and for datasets that have the properties listed at the top of this email, basically it picks up the read length. For strong datasets, it picks up the true shift. What one needs to do is find the peak in the cross-correlation profile ignoring any peak at read-length (which may be stronger or weaker than the other peaks in the profile). This always gives reliable estimates of fragment length (d/peak-shift). We have confirmed this using paired-end sequencing on a variety of different TFs and histone marks with different binding characteristics and ubiqiuity (where you can actually observe the distribution of fragment lengths for comparison). We have seen this phenomenon in a large number of datasets (ENCODE and modENCODE datasets). We have a paper in press right now that deals with this phenomenon as well as how it can be used as a useful data quality measure. Once it is published I can send a link to those interested.
136 | 
137 | >If you would like to have some code that computes the fragment length based on the cross-correlation method shoot me an email. I am hesitant to link it here without Tao's permission, since it uses the code-base from another peak caller. You can then use the --shift-size parameter set to 1/2 the estimated fragment length with --no-model. You will notice significantly better results with a correctly estimated 'd'.
138 | 
139 | >I think at some point, it might be useful to have this cross-correlation method incorporated within MACS so as to make the d estimation more robust (which is probably one of the only unstable aspects of an otherwise fantastic peak caller).
140 | 
141 | >Thanks,
142 | >Anshul.
143 | 
144 | 
145 | #### see a discussion on [biostars](https://www.biostars.org/p/18548/) 
146 | 
147 | 
148 | ### Calculate fragment length, NSC and RSC by [phantompeakqualtools](https://code.google.com/p/phantompeakqualtools/)
149 | 
150 | ```
151 | ===========================
152 | GENERAL USAGE
153 | ===========================
154 | Usage: Rscript run_spp.R <options>
155 | ===========================
156 | TYPICAL USAGE
157 | ===========================
158 | (1) Determine strand cross-correlation peak / predominant fragment length OR print out quality measures
159 |         
160 |         Rscript run_spp.R -c=<tagAlign/BAMfile> -savp -out=<outFile>
161 |         
162 | -out=<outFile> will create and/or append to a file named <outFile> several important characteristics of the dataset.
163 | The file contains 11 tab delimited columns
164 | 
165 | COL1: Filename: tagAlign/BAM filename
166 | COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
167 | COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.
168 |           The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output.
169 |           In almost all cases, the top (first) value in the list represents the predominant fragment length.
170 |           If you want to keep only the top value simply run
171 |           sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
172 | COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
173 | COL5: phantomPeak: Read length/phantom peak strand shift
174 | COL6: corr_phantomPeak: Correlation value at phantom peak
175 | COL7: argmin_corr: strand shift at which cross-correlation is lowest
176 | COL8: min_corr: minimum value of cross-correlation
177 | COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
178 | COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
179 | COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
180 | 
181 | You can run the program on multiple datasets in parallel and append all the quality information to the same <outFile> for a summary analysis.
182 | 
183 | Qtag is a thresholded version of RSC.
184 | ```
185 | **we need column 3 (fragment length) column 9 (NSC) and column 10 (RSC)** of the outFile.
186 | 
187 | >NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. 
188 | Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg.a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality)
189 | 
190 | >RSC values range from 0 to larger positive values. 1 is the critical threshold.
191 | RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200) which is biologically justifiable also show low RSC scores.
192 | 
193 | #### NOTES:
194 | 
195 | - It is **EXTREMELY important to filter out multi-mapping reads from the BAM/tagAlign files**. Large number of multimapping reads can severly affect the phantom peak coefficient and peak calling results.
196 | 
197 | - If a dataset seems to have high PCR bottlenecking, then you might want to actually clamp the number of unique mappping reads per position to 1 or upto 5. If not the phantom peak coefficient can be artificially good.
198 | 
199 | - For the IDR rescue strategy, one needs to pool reads from replicates and then shuffle and subsample the mapped reads to create two balanced pseudoReplicates. This is much easier to implement on tagAlign/BED read-mapping files using the unix 'shuf' command. So it is recommended to use the tagAlign format.
200 | 
201 | - In most cases, you can simply use the maximum reported strand correlation peak as the predominant fragment length.
202 | However, it is useful to manually take a look at the cross-correlation plot to make sure the selected max peak is not an artifact.
203 | 
204 | - Also, if there are problems with library size-selection, a dataset's cross-correlation profile can have multiple strong cross-correlation peaks. This is currently not autodetected.
205 | 
206 | 
207 | 
208 | `Rscript run_spp.R -c=<tagAlign/BAMfile> -savp -out=<outFile>`
209 | 
210 | It is better to specifiy a temp directory, and later you can remove everything in the temp dir.
211 | 
212 | `time Rscript /scratch/genomic_med/mtang1/softwares/phantompeakqualtools/run_spp.R -c=sample.sorted.bam -savp -out=sample-cross-correlation.txt -tmpdir="."`
213 | 
214 | 
215 | `real	35m10.321s
216 | user	36m8.622s
217 | sys	0m13.043s`
218 | 
219 | A pdf file will be produced like fig4G above.
220 | >In almost all cases, the top (first) value in the list represents the predominant fragment length.
221 |           If you want to keep only the top value simply run
222 |           sed -r 's/,[^\t]+//g' outFile  newOutFile
223 |           
224 |           
225 | For this particular bam file  
226 | fragment length: 230  
227 | NSC: 1.042359  
228 | RSC: 1.014053
229 | 
230 | **These are very robust matrices for ChIP-seq quality evaluation.** 
231 | 
232 | 
233 | ### The thresholds for TF ChIP-seq and histone modification ChIPs-seq are different.
234 | 
235 | I had a conversation with Anshul Kundaje(An ENCODE person and author of phantompeakqualtools) on twitter:
236 | 
237 | For  ChIP-seq quality control with phantompeakqualtools, do you repeat the experiment only when BOTH NSC < 1.05 AND RSC < 0.8? @anshul thx
238 | 
239 | >Dataset is flagged when both NSC & RSC below these thresholds for TF ChIP-seq (different from histones) 
240 | >
241 | 
242 | Thx. what's the cut-off for histone ChIP-seq? and sometimes, we have NSC > 1.05 but RSC < 0.8, or NSC < 1.05 but RSC > 0.8.
243 | >
244 | > Depends on histone. See [here](https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15) for approx. distribution of NSC/RSC for histones (Col AZ-BR) For **broad marks NSC usually in the range of 1.03. RSC in the range of 0.4-0.8.Narrow marks more similar to TF range**.
245 | 
246 | do you have a note on what histone-modifications are broad and narrow? For H3K4me3, it can be broad as well. see this paper [Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor-suppressor genes](http://www.ncbi.nlm.nih.gov/pubmed/26301496) 
247 | 
248 | >Yes. H3K4me3 can have broad domains but **overall** it is more predominantly a narrow mark from signal-to-noise perspective. I dont have a comprehensive list. Shud create one. For now ask me specific mark - I can tell u signal-to-noise categorization.
249 | 
250 | thx. I am quality control (narrow?)H3K4me3, H3K4me, H3K27ac. and (broad?) H3K9me3, H3K27me3, H3K79me2. Looking forward to the list
251 | 
252 | >this categorization is correct. Although H3K79me2 usually has high SNR between narrow and broad marks.
253 | 
254 | ### Measuring global ChIP enrichment (FRiP)
255 | 
256 | >For point-source data sets, **we calculate the fraction of all mapped reads that fall into peak regions identified by a peak-calling algorithm** (Ji et al. 2008). Typically, a minority of reads in ChIP-seq experiments occur in significantly enriched genomic regions (i.e., peaks); the remainder of the read represents background. The fraction of reads falling within peak regions is therefore a useful and simple first-cut metric for the success of the immunoprecipitation, and is called FRiP (fraction of reads in peaks). In general, FRiP values correlate positively and linearly with the number of called regions, although there are exceptions, such as REST (also known as NRSF) and GABP, which yield a more limited number of called regions but display very high enrichment (Fig. 4C). Most (787 of 1052) ENCODE data sets have a FRiP enrichment of 1% or more when peaks are called using MACS with default parameters. **The ENCODE Consortium scrutinizes experiments in which the FRiP falls below 1%.**
257 | 
258 | >The 1% FRiP guideline works well when there are thousands to tens of thousands of called occupancy sites in a large mammalian genome. However, passing this threshold does not automatically mean that an experiment is successful and **a FRiP below the threshold does not automatically mean failure.** For example, ZNF274 and human RNA polymerase III have very few true binding sites (Frietze et al. 2010; Raha et al. 2010), and a FRiP of <1% is obtained. At the other extreme, ChIP experiments using antibody/factor pairs capable of generating very high enrichment (such as REST and GABP mentioned above) and/or binding-site numbers (CTCF, RAD21, and others) can result in FRiP scores that exceed those obtained for most factors (Fig. 5C), even for experiments that are suboptimal. **Thus, FRiP is very useful for comparing results obtained with the same antibody across cell lines or with different antibodies against the same factor.**  **FRiP is sensitive to the specifics of peak calling, including the way the algorithm delineates regions of enrichment and the parameters and thresholds used. Thus, all FRiP values that are compared should be derived from peaks uniformly called by a single algorithm and parameter set.**
259 | 
260 | According to this paragraph from the paper, we need to call peaks first, and then count how many reads fall into the called peaks. This number divided by the total mapped reads number is the **FRiP**.
261 | 
262 | 
263 |  
264 | 
265 | ###Consistency of replicates: Analysis using IDR
266 | to be continuted....
267 | 
268 | 
269 | 
270 | 
271 | 


--------------------------------------------------------------------------------
/part1.1_MACS2_parallel_peak_calling.md:
--------------------------------------------------------------------------------
  1 | 
  2 | I want to use MACS2 call ChIP-seq peaks for 10 samples(each with IP and input control) with 4 different set of parameters. That's a lot of commands to type.
  3 | 
  4 | 
  5 | First, put the sample names (prefix) in to a file: 
  6 | `ls -1 *bam | sort | sed -r 's/-[A-G]{1}-NC.sorted.bam//g' | sort | uniq > sample_name.txt`  
  7 | 
  8 | `A` is IP. `G` is input control. 
  9 | the prefix could be anything that tags each experiment.
 10 | 
 11 | ```bash
 12 | #! /bin/bash
 13 | 
 14 | ## put the unique sample names into an array 
 15 | sample_files=($(cut -f 1 sample_name.txt))
 16 | 
 17 | # print out all the element in the array
 18 | echo "${sample_files[@]}"
 19 | 
 20 | ## loop over the samples and call peak with macs2  
 21 | 
 22 | for file in "${sample_files[@]}"
 23 | do
 24 | 	IP_bam="${file}"-A-NC.sorted.bam
 25 | 	Input_bam="${file}"-G-NC.sorted.bam
 26 | 	# call regular sharp peaks
 27 | 	macs2 callpeak -t "$IP_bam" -c "$Input_bam" -g hs -n "${file}"-A-NC-regular-model -q 0.01 
 28 | 	macs2 callpeak -t "$IP_bam" -c "$Input_bam" -g hs -n "${file}"-A-NC-regular-nomodel -q 0.01 --nomodel --extsize 146
 29 | 	
 30 | 	# call broad peak 
 31 | 	macs2 callpeak -t "$IP_bam" -c "$Input_bam" --broad -g hs --broad-cutoff 0.1 -n "${file}"-A-NC-broad-model 
 32 | 	macs2 callpeak -t "$IP_bam" -c "$Input_bam" --broad -g hs --broad-cutoff 0.1 -n "${file}"-A-NC-broad-nomodel --nomodel --extsize 146
 33 | done
 34 | ```
 35 | This is not good, because the script will loop over all the bam files and 
 36 | call peaks one after another. It does not take advantage of the multi-thread
 37 | computing cluster.  
 38 | Assume, we have 10 distinct names in the sample_name.txt and each peak calling takes 15mins. Total time will be: 4 x 10 x 15 = 600 mins = 10 hours!
 39 | 
 40 | **we can do things like this in the above script**:
 41 | `macs2 callpeak -t "$IP_bam" -c "$Input_bam" -g hs -n "${file}"-A-NC-regular-model -q 0.01 &` 
 42 | 
 43 | Adding `&` in the end will put the program in the background and start the next command, but it will inititate as many instances as cpus . This is not good citizen behavior on a shared computing cluster.
 44 | 
 45 | 
 46 | **Alternatively, we can use `xargs` `-P` flag to restrict the number of CPUs a command uses.**
 47 | 
 48 | ```bash
 49 | ### call peaks can be parallelized by xargs
 50 | 
 51 | ### call regular sharp peaks with model
 52 | cat sample_name.txt | xargs -P 6  -I{} macs2 callpeak -t {}-A-NC.sorted.bam \
 53 |  -c {}-G-NC.sorted.bam -g hs -n {}-A-NC-sharp-model -q 0.01 --outdir {}-A-NC-sharp-model-peaks
 54 | 
 55 | 
 56 | ### call regular sharp peaks without model
 57 | 
 58 | cat sample_name.txt | xargs -P 6  -I{} macs2 callpeak -t {}-A-NC.sorted.bam \
 59 | -c {}-G-NC.sorted.bam -g hs -n {}-A-NC-regular-nomodel -q 0.01 \
 60 |  --nomodel --extsize 146 --outdir {}-A-NC-sharp-nomodel-peaks 
 61 | 
 62 | 
 63 | ### call broad peaks with model
 64 | cat sample_name.txt | xargs -P 6  -I{} macs2 callpeak -t {}-A-NC.sorted.bam \
 65 |  -c {}-G-NC.sorted.bam --broad -g hs --broad-cutoff 0.1 -n {}-A-NC-broad-model \
 66 |  --outdir -n {}-A-NC-broad-model-peaks
 67 | 
 68 | ### call broad peaks without model
 69 | cat sample_name.txt | xargs -P 6  -I{} macs2 callpeak -t {}-A-NC.sorted.bam \
 70 |  -c {}-G-NC.sorted.bam --broad -g hs --broad-cutoff 0.1 -n {}-A-NC-broad-nomodel \
 71 |  --nomodel --extsize 146 --outdir {}-A-NC-broad-nomodel-peaks
 72 | ```
 73 | 
 74 | However, the standard error will be put together and hard to track for each peak calling. 
 75 | 
 76 | Bioinformatics Data skills by Vince Buffalo page 420:
 77 | >One stumbling block beginners frequently encounter is trying to use pipes and redirects with xargs. This won't work, as the shell process that reads your xargs xommand will interpret pipesand redirects as what to do with xarg's ouput, not as part of the command run by xargs.
 78 | 
 79 | 
 80 | One can put the macs2 peak calling script in a bash script `script.sh` redrecting the stderr to a file, and then feed it into `xargs`.
 81 | 
 82 | ```bash
 83 | #! /bin/bash
 84 | 
 85 | macs2 callpeak -t "$1"-A-NC.sorted.bam \
 86 |  -c "$1"-G-NC.sorted.bam --broad -g hs --broad-cutoff 0.1 -n "$1"-A-NC-broad-nomodel \
 87 |  --nomodel --extsize 146 --outdir "$1"-A-NC-broad-nomodel-peaks \
 88 |  2> "$1"-A-NC-broad-nomodel-peaks.stder
 89 | 
 90 | ```
 91 | use `-n 1` to restrict one input argument one time.
 92 | 
 93 | ```bash
 94 | cat sample_name.txt | xargs -P 6 -n 1 bash script.sh
 95 | 
 96 | ```
 97 |   
 98 | Finally use `GNU Parallel` which works with pipe and redirection:  
 99 | 
100 | ```bash
101 | ### call peaks can be parallelized by GNU parallel
102 | 
103 | ### call regular sharp peaks with model
104 | cat sample_names.txt | parallel --max-procs=12 'macs2 callpeak -t {}-A-NC.sorted.bam \
105 |  -c {}-G-NC.sorted.bam -g hs -n {}-A-NC-sharp-model -q 0.01 --outdir {}-A-NC-sharp-model-peaks 2> {}-A-NC-sharp-model.stderr'
106 | 
107 | 
108 | ### call regular sharp peaks without model
109 | 
110 | cat sample_names.txt | parallel --max-procs=12 'macs2 callpeak -t {}-A-NC.sorted.bam \
111 | -c {}-G-NC.sorted.bam -g hs -n {}-A-NC-sharp-nomodel -q 0.01 \
112 |  --nomodel --extsize 146 --outdir {}-A-NC-sharp-nomodel-peaks 2> {}-A-NC-sharp-nomodel.stderr'
113 | 
114 | 
115 | ### call broad peaks with model
116 | cat sample_names.txt | parallel --max-procs=12 'macs2 callpeak -t {}-A-NC.sorted.bam \
117 |  -c {}-G-NC.sorted.bam --broad -g hs --broad-cutoff 0.1 -n {}-A-NC-broad-model -q 0.01 \
118 |  --outdir {}-A-NC-broad-model-peaks 2> {}-A-NC-broad-model.stderr'
119 | 
120 | ### call broad peaks without model
121 | cat sample_names.txt | parallel --max-procs=12 'macs2 callpeak -t {}-A-NC.sorted.bam \
122 |  -c {}-G-NC.sorted.bam --broad -g hs --broad-cutoff 0.1 -n {}-A-NC-broad-nomodel -q 0.01 \
123 |  --nomodel --extsize 146 --outdir {}-A-NC-broad-nomodel-peaks 2> {}-A-NC-broad-nomodel.stderr'
124 | 
125 | ```
126 | **within 30mins, I finished peak calling for 10 x 4 = 40 MACS runs**
127 | 
128 | Read the tutorial on [biostars](https://www.biostars.org/p/63816/)  
129 | and more bioinformatics centered tutorial by [Pierre Lindenbaum](http://figshare.com/articles/GNU_parallel_for_Bioinformatics_my_notebook/822138)   
130 | A post by Stephen Turner [find | xargs ... Like a Boss](http://www.gettinggeneticsdone.com/2012/03/find-xargs-like-boss.html) 
131 | 


--------------------------------------------------------------------------------
/part1.2_convert_bam2_bigwig.md:
--------------------------------------------------------------------------------
  1 | ### Convert bam to bigwig for ChIP-seq bam
  2 | 
  3 | [Bigwig](http://genome.ucsc.edu/goldenpath/help/bigWig.html) is very good for visualization in IGV and UCSC genome browser.There are many tools to convert bam to bigwig. 
  4 | 
  5 | Make sure you understand the other two closely related file formats:  
  6 |  
  7 | * [bedgraph](http://genome.ucsc.edu/goldenpath/help/bedgraph.html):The bedGraph format is an older format used to display sparse data or data that contains elements of varying size.  
  8 | * [wig file](http://genome.ucsc.edu/goldenpath/help/wiggle.html): The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be **equally sized**.
  9 | 
 10 | See my old blog posts:  
 11 | [My first play with GRO-seq data, from sam to bedgraph for visualization](http://crazyhottommy.blogspot.com/2013/10/my-first-play-with-gro-seq-data-from.html)  
 12 | [hosting bigwig by dropbox for UCSC visualization](http://crazyhottommy.blogspot.com/2014/02/hosting-bigwig-by-dropbox-for-ucsc.html)  
 13 | [MeDIP-seq and histone modification ChIP-seq analysis](http://crazyhottommy.blogspot.com/2014/01/medip-seq-and-histone-modification-chip.html)  
 14 | bedtools genomeCoverage [convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)](http://crazyhottommy.blogspot.com/2014/10/convert-bam-file-to-bigwig-file-and.html)   
 15 | 
 16 | MACS2 outputs bedgraph file as well, but the file is big. In addition, **extending the reads to 200bp** will exceed the chromosome ends in some cases. If you load the bedgraph to UCSC, you will get an error complaining this. One needs to use bedClip to get around with it.
 17 | 
 18 | [Fix the bedGraph and convert them to bigWig files](https://github.com/taoliu/MACS/wiki/Build-Signal-Track#Fix_the_bedGraph_and_convert_them_to_bigWig_files)  
 19 | [discussion on the macs google group](https://groups.google.com/forum/#!searchin/macs-announcement/bedgraph$20extend/macs-announcement/yefHwueKbiY/UsfWvFrdBh0J)
 20 | 
 21 | ### How MACS1/2 output wig/bedgraph files are produced?
 22 | 
 23 | From [Tao liu](https://groups.google.com/forum/#!searchin/macs-announcement/bedgraph$20extend/macs-announcement/g29v40hMaIs/GREAyDqNxB8J):
 24 | >MACS uses 'd' to extend each tag before piling them up. As for + tag, extend it to the right for d size, and for - tag, **extend it to the left for d size**. But, before that, MACS will filter out redundant reads first. So if you used something like genomeCoverageBed tool on your alignment file directly, you would see a lot of inconsistencies. This method doesn't change between versions, however the way to filter out redundant reads changes. Before in MACS<1.4, MACS only keep 1 tag at the exactly same position, and after MACS1.4, MACS by default uses a binomial test to decide how many reads at the same position can be accepted.
 25 | >
 26 | >MACS version <2, doesn't do any normalization on the .wig files. However in my own research, I'd prefer to divide the fragment pileup (value in wiggle file) by total number of reads in million. 
 27 | 
 28 | >You can save the following content into a perl script such as spmr.pl:
 29 | 
 30 | ```
 31 | #!/usr/bin/perl -w
 32 | 
 33 | open(IN,"$ARGV[1]");
 34 | open(OUT,">$ARGV[2]");
 35 | 
 36 | while(<IN>) {
 37 |     if (/^\d+/){
 38 |         chomp;
 39 |         @_=split;
 40 |         printf OUT "%d\t%.4f\n",$_[0],$_[1]/$ARGV[0];
 41 |     }
 42 |     else{
 43 |         print OUT $_;
 44 |     }
 45 | }
 46 | ```
 47 | >Then if your data has 12.3 million reads, and you want to get the pileup per 10 million reads, run:
 48 | 
 49 | >`$ perl spmr.pl 1.23 some.wig some.wig.sp10mr`
 50 | 
 51 | From [Tao liu](https://groups.google.com/forum/#!searchin/macs-announcement/Sophia$20scaled$20--SPMR/macs-announcement/LZpliDkdN-8/5oAws6EHkHEJ)
 52 | 
 53 | >Wiggle or BedGraph file is in plain text format. So you can easily manipulate the file and divide the score column by a denominator from sequencing depth. I usually use sequencing depth in million reads after redundant reads being filtered out. You can check MACS log or xls output for these numbers. 
 54 | 
 55 | 
 56 | >For example, you have 3,123,456 reads after filtering. You can do:
 57 | 
 58 | >`$ awk -v OFS="\t" 'NF==2{print $1,$2/3.123456};END {print}' yourwig.wig` for variableStep wig.
 59 | 
 60 | >Or
 61 | 
 62 | >`$ awk -v OFS="\t" 'NF==4{print $1,$2,$3,$4/3.123456};END {print}'  yourbedgraph.bdg` for bedGraph.
 63 | 
 64 | For MACS2, it only produce bedgraph, not wig files any more.
 65 | 
 66 | **Are MACS1/2 ouput wig/bedgraph files scaled/normalized?**
 67 | 
 68 | >MACS14: No. They are raw pileup with tag extension.  Scaling is only used while calling peaks. Note that control track is generated by same tag extension as treatment. So it's not exactly the same as local bias from MACS, which is a maximum of average tag number from a small window(1kbp) and a larger window(10kbp). In brief, they are raw fragment pileup.
 69 | 
 70 | >MACS2: Yes. They are scaled. Especially with current --SPMR option, you will get values of signal per million reads. And control track is consistent with local bias calculated from MACS.
 71 | 
 72 | 
 73 | Check several subcommands from MACS2:
 74 | 
 75 | 1. `predictd`, which can predict the best lag of + and - strand tags, through x-correlation;
 76 | 2. `filterdup`, which can filter duplicate reads and convert any format into BED;
 77 | 3. `pileup`, which can extend + and - strand tags into fragments of given length, then pile them up into a bedGraph file.
 78 | 
 79 | ### Why do we need to extend the reads to fragment length/200bp?
 80 | 
 81 | Because in real experiment, we fragment the genome into small fragments of ~200bp, and pull down the protein bound DNA with antibodies. However, we only sequence the first 36bp(50bp, or 100bp depending on your library). To recapitulate the real experiment, we need to extend it to the fragment size.
 82 | 
 83 | That's what MACS buiding model is doing. MACS calculatea the length `d` that the reads need to be extended.
 84 | 
 85 | Three examples of extending reads:  
 86 | 1. [HTSeq TSS plot](http://www-huber.embl.de/users/anders/HTSeq/doc/tss.html) 
 87 |    ![](./images/HTSeq-extend200.png)
 88 | 2. paper [Integrative analysis of 111 reference human epigenomes](http://www.nature.com/nature/journal/v518/n7539/full/nature14248.html) Methods part:
 89 | >Mappability filtering, pooling and subsampling. **The raw Release 9 read alignment files contain reads that are pre-extended to 200 bp**. However, there were significant differences in the original read lengths across the Release 9 raw data sets reflecting differences between centres and changes of sequencing technology during the course of the project (36 bp, 50 bp, 76 bp and 100 bp). To avoid artificial differences due to mappability, for each consolidated data set the **raw mapped reads were uniformly truncated to 36 bp and then refiltered using a 36-bp custom mappability track to only retain reads that map to positions (taking strand into account) at which the corresponding 36-mers starting at those positions are unique in the genome.** Filtered data sets were then merged across technical/biological replicates, and where necessary to obtain a single consolidated sample for every histone >mark or DNase-seq in each standardized epigenome.  
 90 | 
 91 | 
 92 | 3 paper [Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data](http://www.ncbi.nlm.nih.gov/pubmed/24782889)
 93 | 
 94 | "IGV screenshot of an example CTCF ChIP signal showing the distribution of Watson and Crick signal around the CTCF motif and the distribution of Watson and Crick signal following extension of reads to the expected fragment length."
 95 | 
 96 | ![](./images/extend_200.png)
 97 | 
 98 | 
 99 | 
100 | ## Keep in mind when you convert ChIP-seq bam to bigwig files:
101 | * extend the reads to 200bp, `d` predicted by MACS, or fragment length predicted by [phantomPeakqualtools](https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md#calculate-fragment-length-nsc-and-rsc-by-phantompeakqualtools)
102 | * normalize (to library size, RPKM or 1x genomic content like `deeptools`. see below)
103 | 
104 | ### Using bedtools 
105 | 
106 | Install `bedClip` and `bedGraphToBigWig` [UCSC utilities](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) first.
107 | 
108 | **one can convert the bam to bed and the use bedtools slop to extend the reads to 3' for 200bp and then feed into bedtools coverage biostar [post](https://www.biostars.org/p/49163/)**
109 | 
110 | for 36bp single-end ChIP-seq reads:  
111 | ```
112 | bamToBed -i input.bam | slopBed -i - -g genome_file_of_chr_sizes -s -r 164 | bedToBam -i - -g genome_file_of_chr_sizes > output_extended.bam
113 | ```
114 |  
115 | ```bash
116 | #! /bin/bash
117 | 
118 | for bam in *extended.bam
119 | do 
120 | echo $bam 
121 | genomeCoverageBed -ibam $bam -bg -g hg19.genome.info > $(basename $bam .bam).bdg
122 | done
123 | ```
124 | Convert bedgraph to bigwig. credits go to [Tao Liu](https://gist.github.com/taoliu/2469050):
125 | 
126 | ```bash
127 | #!/bin/bash
128 | 
129 | # this script is from Tao Liu https://gist.github.com/taoliu/2469050 
130 | # check commands: slopBed, bedGraphToBigWig and bedClip
131 |  
132 | which bedtools &>/dev/null || { echo "bedtools not found! Download bedTools: <http://code.google.com/p/bedtools/>"; exit 1; }
133 | which bedGraphToBigWig &>/dev/null || { echo "bedGraphToBigWig not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
134 | which bedClip &>/dev/null || { echo "bedClip not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
135 |  
136 | # end of checking
137 |  
138 | if [ $# -lt 2 ];then
139 |     echo "Need 2 parameters! <bedgraph> <chrom info>"
140 |     exit
141 | fi
142 |  
143 | F=$1
144 | G=$2
145 |  
146 | bedtools slop -i ${F} -g ${G} -b 0 | bedClip stdin ${G} ${F}.clip
147 |  
148 | bedGraphToBigWig ${F}.clip ${G} ${F/bdg/bw}
149 |  
150 | rm -f ${F}.clip
151 | 
152 | ```
153 | 
154 | ### Using Deeptools
155 | 
156 | 
157 | I personally like to covert the bam files directly to [bigwig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) files using [deeptools](https://github.com/fidelram/deepTools). Using 10bp as a bin size, I get a bigwig file of 205Mb and you can directly load it into IGV.  
158 | `bamCoverage -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam --normalizeTo1x 2451960000 --missingDataAsZero yes --binSize 10 --fragmentLength 200 -o panc1_H3k27acRep1_deeptool_normalized.bw`  
159 | 
160 | `--fragmentLength 200` will extend the reads at 3' end to 200bp, which is more reasonable for ChIP-seq data. We only sequence the first (36)bp of the DNA fragment pulled down by antibodies.
161 | see [here](https://www.biostars.org/p/49775/#158050)
162 | 
163 | I really like the demonstration of how coverage files are computed by the `deeptools` [author](https://docs.google.com/file/d/0B8DPnFM4SLr2UjdYNkQ0dElEMm8/edit?usp=sharing):  
164 | ![](./images/bam2bigwig1.png)
165 | 
166 | **reads will be extended to 200bp before counting**
167 | 
168 | ![](./images/bam2bigwig2.png)
169 | 
170 | **which normalization you want to use? RPKM(like RNA-seq) or 1 x Coverage**:
171 | ![](./images/bam2bigwig3.png)  
172 | 
173 | RPKM:  
174 | reads per kilobase per million reads  
175 | The formula is: RPKM (per bin) = number of reads per bin /(number of mapped reads (in millions) * bin length (kp))  
176 | 
177 | RPGC:  
178 | reads per genomic content  
179 | used to normalize reads to 1x depth of coverage  
180 | sequencing depth is defined as: (total number of mapped reads * fragment length) / effective genome size
181 | 
182 | 
183 | ### Using HTSeq
184 | 
185 | HTSeq is a python library that is designed for NGS sequencing analysis.
186 | The [`HTSeq-count`](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) program is widely used for RNA-seq counting.
187 | 
188 | ```python
189 | import HTSeq
190 | 
191 | alignment_file = HTSeq.SAM_Reader("SRR817000.sam")
192 | # HTSeq also has a BAM_Reader function to handle the bam file
193 | 
194 | # initialize a Genomic Array (a class defined in the HTSeq package to deal with NGS data,
195 | # it allows transparent access of the data through the GenomicInterval object)
196 | # more reading http://www-huber.embl.de/users/anders/HTSeq/doc/genomic.html#genomic
197 | 
198 | fragmentsize = 200
199 | 
200 | coverage = HTSeq.GenomicArray("auto", stranded = True, typecode = 'i')
201 | 
202 | # go through the alignment file, add count by 1 if in that Interval there is a read mapped there
203 | 
204 | for alignment in alignment_file:
205 |   if alignment.aligned:
206 |   	# extend to 200bp
207 |   	almnt.iv.length = fragmentsize
208 |     coverage[ alignment.iv] += 1
209 | 
210 | # it takes a while to construct this coverage array since python goes through every read in the big SAM file
211 | 
212 | # write the file to bedgraph
213 | coverage.write_bedgraph_file ( "plus.wig", "+")
214 | coverage.write_bedgraph_file ( "minus.wig", "-")
215 | 
216 | ```
217 | We get the wig file (Note, this wig is in 1 base resolution and strandness-specific), and then can convert wig to bigwig using UCSC [`wigToBigwig`](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/).
218 | 
219 | ### Correlation between replicates 
220 | 
221 | You need UCSC tool [`wigCorrelate`](http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/wigCorrelate
222 | ):
223 | 
224 | 
225 | Run it:
226 | 
227 | `wigCorrelate H3K36me1_EE_rep1_FE.bw H3K36me1_EE_rep2_FE.bw`
228 | 
229 | ### Compare the bigwig files between IP and input control. 
230 | 
231 | Use [`bamCompare`](https://github.com/fidelram/deepTools/wiki/Normalizations) from `deeptools` as well
232 | 
233 | >--scaleFactorsMethod 
234 | >Here, you can choose how you would like to normalize to account for variation in sequencing depths. We provide:  
235 | the simple normalization total read count  
236 | the more sophisticated signal extraction (SES) method proposed by Diaz et al. for the normalization of ChIP-seq samples. We recommend to use SES only for those cases where the distinction between input and ChIP is very clear in the bamFingerprint plots. This is usually the case for transcription factors and sharply defined histone marks such as H3K4me3.  
237 | 
238 | >--ratio 
239 | >Here, you get to choose how you want the two input files to be compared, e.g. by taking the ratio or by subtracting the second BAM file from the first BAM file etc. In case you do want to subtract one sample from the other, you will have to choose whether you want to normalize to 1x coverage (--normalizeTo1x) or to Reads Per Kilobase per Million reads (--normalizeUsingRPKM; similar to RNA-seq normalization schemes).
240 | 
241 | Or use [bwtools](https://github.com/CRG-Barcelona/bwtool).
242 | 
243 | 


--------------------------------------------------------------------------------
/part1.3_MACS2_peak_calling_details.md:
--------------------------------------------------------------------------------
 1 | ### How to set MACS2 peak calling parameters
 2 | 
 3 | 
 4 | From the paper [Integrative analysis of 111 reference human epigenomes](http://www.nature.com/nature/journal/v518/n7539/full/nature14248.html):  
 5 | 
 6 | >Peak calling. For the histone ChIP-seq data, the MACSv2.0.10 peak caller was used to compare ChIP-seq signal to a corresponding whole-cell extract (WCE) sequenced control to identify narrow regions of enrichment (peaks) that pass a Poisson P value threshold 0.01, broad domains that pass a broad-peak Poisson P value of 0.1 and gapped peaks which are broad domains (P < 0.1) that include at least one narrow peak (P < 0.01) (https://github.com/taoliu/MACS/). Fragment lengths for each data set were pre-estimated using strand cross-correlation analysis and the [SPP peak caller package](https://code.google.com/p/phantompeakqualtools/) and these fragment length estimates were explicitly used as parameters in the MACS2 program (–shift-size = fragment_length/2).
 7 | >
 8 | 
 9 | MACS2 is used to call broad and narrow peaks for **histone ChIP-seq:**  
10 | >MACSv2.0.10 was also used to call narrow peaks using the same settings specified above for the histone mark narrow peak calling.
11 | 
12 | >Narrow peaks and broad domains were also generated for the unconsolidated, 36-bp mappability filtered histone mark ChIP-seq and DNase-seq Release 9 data sets using MACSv2.0.10 with the same settings as specified above.
13 | 
14 | 
15 | **The description actually is not accurate in the paper**. for MACS2
16 | >--extsize EXTSIZE The arbitrary extension size in bp. When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end, then pile them up. **It's exactly twice the number of obsolete SHIFTSIZE.** In previous language, each read is moved 5'->3' direction to middle of fragment by 1/2 d, then extended to both direction with 1/2 d. This is equivalent to say each read is extended towards 5'->3' into a d size fragment. DEFAULT: 200. EXTSIZE and SHIFT can be combined when necessary. Check SHIFT option.
17 | >
18 | 
19 | In the ENCODE [ChIP-seq github page](https://github.com/crazyhottommy/chip-seq-pipeline/blob/master/dnanexus/macs2/src/macs2.py) I found:
20 | 
21 | ```python
22 | 
23 |    #===========================================
24 | 	# Generate narrow peaks and preliminary signal tracks
25 | 	#============================================
26 | 
27 | 	command = 'macs2 callpeak ' + \
28 | 			  '-t %s -c %s ' %(experiment.name, control.name) + \
29 | 			  '-f BED -n %s/%s ' %(peaks_dirname, prefix) + \
30 | 			  '-g %s -p 1e-2 --nomodel --shift 0 --extsize %s --keep-dup all -B --SPMR' %(genomesize, fraglen)
31 | ```
32 | 
33 | ```python
34 | 	#===========================================
35 | 	# Generate Broad and Gapped Peaks
36 | 	#============================================
37 | 
38 | 	command = 'macs2 callpeak ' + \
39 | 			  '-t %s -c %s ' %(experiment.name, control.name) + \
40 | 			  '-f BED -n %s/%s ' %(peaks_dirname, prefix) + \
41 | 			  '-g %s -p 1e-2 --broad --nomodel --shift 0 --extsize %s --keep-dup all' %(genomesize, fraglen)
42 | 
43 | 
44 | ```
45 | 
46 | The fraglen is from [strand cross-correlation analysis](https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md#calculate-fragment-length-nsc-and-rsc-by-phantompeakqualtools)
47 | 
48 | 
49 | ```python
50 | 
51 | #Extract the fragment length estimate from column 3 of the cross-correlation scores file
52 | 	with open(xcor_scores_input.name,'r') as fh:
53 | 		firstline = fh.readline()
54 | 		fraglen = firstline.split()[2] #third column
55 | 		print "Fraglen %s" %(fraglen)
56 | ```
57 | 
58 | ### Conclusions
59 | 
60 | We are using `deeptools` for bigwig production, so we do not specify `-B`(output bedgraph) and `-SPMR`(for normalized bedgraph).
61 | 
62 | For each histone-modification ChIP-seq, we will have two sets of peaks (broad and narrow).
63 | 
64 | Use`--nomodel` and provide the `--extsize` of either 147 bp or the fragment length predicted by [strand cross-correlation analysis](https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md#calculate-fragment-length-nsc-and-rsc-by-phantompeakqualtools) 
65 | 
66 | for narrow peaks:  
67 | `macs2 callpeak -t IP.bam -c Input.bam -n test -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs`  
68 |  
69 | for borad regions:  
70 | `macs2 callpeak -t IP.bam -c Input.bam -n test --broad -p 0.01 --nomodel --extsize fragment_length --keep-dup all -g hs`
71 | 
72 | It turns out that ENCODE intentionally use a relax `p value 0.01` for calling peaks and then filter the peaks afterwards by [IDR](https://sites.google.com/site/anshulkundaje/projects/idr).
73 | In my experience, I would set `q value of 0.01`([q value is to control false discover rate](http://crazyhottommy.blogspot.com/2015/03/understanding-p-value-multiple.html) )  for narrow peaks and `q value of 0.05` for broad peaks.
74 | 
75 | Please check this [issue](https://github.com/taoliu/MACS/issues/76) for MACS2:
76 | 
77 | Jgarthur:
78 | >I ran MACS2 (2.1.0.20140616) to call peaks on chromatin accessibility data (ATAC-seq) with the following options:
79 | 
80 | `-t reads.bam -f BAM -g mm --nomodel --shift -100 --extsize 200 -p 1e-4 --broad`
81 | 
82 | >I wanted to check sensitivity to the p-value cutoff specified by -p (which I believe controls narrow peak calling before merging to broad peaks?). Running with, e.g., -p 1e-3 or -p 1e-10 gave identical output to the first run. The output is different than using -q 0.01, however. Is this the intended behavior?
83 | 
84 | 
85 | Tao Liu:
86 | >Broad peak cutoff is controlled by '--broad-cutoff'. The '-p' option controls the narrower regions inside broad regions. BTW, in the newest release, there is a new tool to give you p-value cutoff analysis "macs2 callpeak --cutoff-analysis" (without using --broad mode). It will try pvalue cutoff from 1 to 1e-10 and collect how many peaks and bps can be called as enriched regions. You may want to give it a try.
87 | 
88 | Jgarthur:
89 | >Am I correct in saying that in --broad mode, the value set by -q or -p does not matter at all, but whether one sets -q or -p determines the meaning of --broad-cutoff as a q-value or p-value threshold?
90 | 
91 | >My own testing indicates, e.g., that "--broad -q {x} --broad-cutoff .01" is unaffected by the choice of x, though the output still displays "# qvalue cutoff = {x}"
92 | 
93 | Tao liu:
94 | >Yes. You are right. But -q or -p also determines the narrower calls inside broad regions. MACS2 broad mode does a 2-level peak calling and embed stronger/narrower calls in weaker/broader calls.
95 | 
96 | >Thanks for reminding me this output issue. I will fix that so it will be displayed as '#qvalue cutoff for narrow region = {x}' and '# qvalue cutoff for broad region = {y}' will be correctly displayed.
97 | 
98 | 


--------------------------------------------------------------------------------
/part1_peak_calling.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ### sofware installation and data source
  3 | 
  4 | **install MACS2:**  06/11/2015
  5 | `sudo pip -H install MACS2`  
  6 | macs2 version:2.1.0.20150420 
  7 | 
  8 | **install NCIS to estimate the scaling factor:**  
  9 | Download the package [here](http://www.biomedcentral.com/1471-2105/13/199):  
 10 | `wget http://www.biomedcentral.com/content/supplementary/1471-2105-13-199-s2.gz`
 11 | 
 12 | It is a .gz file, you can open your Rstudio and install it:  
 13 | `install.packages("~/NCIS/1471-2105-13-199-s2.gz", repos = NULL, type="source")`  
 14 | For help:  
 15 | `library(NCIS); ?NCIS`
 16 | 
 17 | see a post here using NCIS before MACS peak calling:  
 18 | [Adding a custom normalization to MACS](http://searchvoidstar.tumblr.com/post/52594053877/adding-a-custom-normalization-to-macs).  
 19 | when use NCIS, one needs to deduplicate the reads first. see a post on the [MACS goolge group](https://groups.google.com/forum/#!searchin/macs-announcement/NCIS/macs-announcement/cGzpyez57dI/Kga4YM6ukGYJ). Then the estimated scaling factor between ChIP and input control will be feeded into MACS2 with the `--ratio` flag:  
 20 | `macs2 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1 --ration 1.4`
 21 | 
 22 | #### Data source
 23 | **downloaded the data from the ENNCODE project** on 06/11/2015 
 24 | [here](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/) for MCF7 and panc1 cells. H3k27ac ChIP-seq, duplicates for each cell line. I choose these data sets because the SYDH Histone tracks contain input DNAs. The Broad Histione tracks do not have input DNA sequenced.
 25 | 
 26 | **MCF7 cells**
 27 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam`  
 28 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam.bai`  
 29 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam.bai`  
 30 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam`  
 31 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7H3k27acUcdPk.narrowPeak.gz`  
 32 | 
 33 | **input DNA**
 34 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam`  
 35 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam.bai`  
 36 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep2.bam`  
 37 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistoneMcf7InputUcdAlnRep2.bam.bai`
 38 | 
 39 | **Panc1  cells**  
 40 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam`  
 41 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam.bai`  
 42 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam`
 43 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam.bai`  
 44 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak.gz`  
 45 | 
 46 | **input DNA**  
 47 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam`  
 48 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam.bai`
 49 | 
 50 | #### project layout  
 51 | It is always good to organize your project. For me, the layout is like this: 
 52 | 
 53 | `diff_ChIP_test` is the project main folder and it contains four sub-folders:
 54 | `data`,    `doc` ,    `results` and `scripts`.
 55 | Data are downloaded into the `data` folder, scripts are in the `scripts` folder and the output from the scripts are in the `results` folder.  
 56 | Further Reading: [A Quick Guide to Organizing Computational Biology Projects](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)
 57 | 
 58 | 
 59 | ### Is macs2 OK for broad peaks such as H3K27ac?
 60 | a discussion [here](https://groups.google.com/forum/#!searchin/macs-announcement/macs$20for$20broad/macs-announcement/LVkBpm-2oRM/gMT_g-DS4b0J)  
 61 | Yet another note from a user argues that MACS2 can be used for broad peak calling, but by choosing correct [arguments](https://groups.google.com/forum/#!searchin/macs-announcement/h3k27ac/macs-announcement/9_LB5EsjS_Y/nwgsPN8lR-kJ):
 62 | >While I can't speak to much of the first paragraph of your message, I wanted to let you know that when it comes to histone modification analysis, using the --call-supeaks option in the command line has proven to be a phenomenally successful way to use MACS to handle modifications (right now, I'm working on H3K4me1) which have both sharp defined peaks but also very broad "peaks" which are more like domains of H3K4me1 signals clustered together. This obviously created many situations, in part because MACS will automatically combine peaks which are separated by a distance of 10bp or less into one called peak, where artefactually broad regions were called (like 30kb peaks... yeah right!).
 63 | 
 64 | >So many people have argued, including in a recent review in Nat Immunology about ChIP-seq and peak-calling, that modifications with broad distribution (even H3K27Ac can be broad in our hands) should not be analyzed with MACS; using various biological end-points as testing I have found that this is not true and that MACS outperforms ZINBA (specifically designed for broad histone modifications) WHEN the subpeaks function is called for detection of histone peaks with broad, narrow, or both types of signal distribution.
 65 | 
 66 | 
 67 | It seems to me that MACS2 has evloved a lot to deal with the broad peaks compared with the widely used MACS14. Although other tools such as SICER are designed sepcifically for histone modifications, I am still going to use MACS2 for H3K27ac ChIP-seq peak calling.
 68 | 
 69 | Further Reading:   
 70 | [Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326)  
 71 | [ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia](http://www.ncbi.nlm.nih.gov/pubmed/22955991)
 72 | 
 73 | ###starting pilot analysis  
 74 | 
 75 | I am going to use one single ChIP bam file and one input file to do some initial testing with different parameters of MACS2.  
 76 | 
 77 | ```
 78 | --broad
 79 | 
 80 | When this flag is on, MACS will try to composite broad regions in BED12 ( a gene-model-like format ) by putting nearby highly enriched regions into a broad region with loose cutoff. The broad region is controlled by another cutoff through --broad-cutoff. The maximum length of broad region length is 4 times of d from MACS. DEFAULT: False
 81 | ```
 82 | **call peaks with macs2 using --broad, building model:**
 83 | `macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_with_model_broad `
 84 | 
 85 | **call peaks with macs2 using --broad, bypass the model:**
 86 | `macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_without_model_broad --nomodel --extsize 146`
 87 | 
 88 | -nomodel and --extsize 146 tell MACS2 use 146bp as fragment size to pileup sequencing reads.
 89 | 
 90 | For model building, a
 91 | NAME_model.r is an R script in the output which you can use to produce a PDF image about the model based on your data. Load it to R by:
 92 | 
 93 | `$ Rscript NAME_model.r`
 94 | 
 95 | The model will estimate the `d`
 96 | 
 97 | A note from [Tao Liu](https://groups.google.com/forum/#!searchin/macs-announcement/h3k27ac/macs-announcement/9_LB5EsjS_Y/nwgsPN8lR-kJ):  
 98 | >If the d is not small ~ < 2*tag size (for those tag size < 50bp), and the model image in PDF shows clean bimodal shape, d may be good. And several bp differences on d shouldn't affect the peak detection on general transcription factor ChIP-seq much.
 99 | 
100 | >However, for Pol2 or histone marks, things may be different. Pol2 is moving so it's not appropriate to say there is a fixed fragment size. I don't know the correct answer. For histone mark ChIP-seq, since they would have a underlying characteristic 147bp resolution for a nucleosome size, you can simply skip model building and use "--shiftsize 74 --nomodel" instead. Also if you want, you can try other software like SICER and NPS.
101 | 
102 | 
103 | --extsize EXTSIZE     The arbitrary extension size in bp. When nomodel is
104 |                         true, MACS will use this value as fragment size to
105 |                         extend each read towards 3' end, then pile them up.
106 |                         **It's exactly twice the number of obsolete SHIFTSIZE.**
107 |                         In previous language, each read is moved 5'->3'
108 |                         direction to middle of fragment by 1/2 d, then
109 |                         extended to both direction with 1/2 d. This is
110 |                         equivalent to say each read is extended towards 5'->3'
111 |                         into a d size fragment. DEFAULT: 200. EXTSIZE and
112 |                         SHIFT can be combined when necessary. Check SHIFT
113 |                         option.
114 | 
115 | 
116 | **call peaks with macs2 not using --broad, bypass the model, produce bedgraph file -B, and --call-summit**
117 | `macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam -g hs -q 0.01 -n panc1H3k27acRep1_regular --call-summit -B --outdir panc1H3k27acRep1_without_model_regular --nomodel --extsize 146`
118 | 
119 | **short note fom me**  
120 | The bedgraph file generated by macs2 is very huge (1Gb for this particular case), because it contains decimals(?). If you want to visualize it in IGV, you need to get a TDF file first. Remember to change the suffix .bdg to .bedgraph, otherwise IGV will not recoginize the file format.  
121 | 
122 | I personally like to covert the bam files directly to [bigwig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) files using [deeptools](https://github.com/fidelram/deepTools). Using 10bp as a bin size, I get a bigwig file of 205Mb and you can directly load it into IGV.  
123 | `bamCoverage -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam --normalizeTo1x 2451960000 --missingDataAsZero yes --binSize 10 --fragmentLength 200 -o panc1_H3k27acRep1_deeptool_normalized.bw`  
124 | 
125 | `--fragmentLength 200` will extend the reads at 3' end to 200bp, which is more reasonable for ChIP-seq data. We only sequence the first (36)bp of the DNA fragment pulled down by antibodies.
126 | see [here](https://www.biostars.org/p/49775/#158050)
127 | 
128 | ### Results from pilot analysis
129 | #### 1. peak numbers
130 | First, let's look at how many peaks are produced by three different settings of macs2 arguments.  
131 | 1. use --broad, build model  
132 | it produced **54223** broadpeaks.  
133 | 2. use --broad, bypass model  
134 | it produced **66288** broadpeaks.    
135 | 3. regular peak calling will produce a summit.bed file and narrow peaks  
136 | it produced **182569** narrowpeaks.
137 | 
138 | **using --broad definetly improve the identification of peaks(or more appropriately:enriched regions).** 
139 | You can find the narrow peaks in the gappedPeak file:
140 | 
141 | >NAME_peaks.gappedPeak is in BED12+3 format which contains both the broad region and narrow peaks. The 5th column is 10*-log10qvalue, to be more compatible to show grey levels on UCSC browser. Tht 7th is the start of the first narrow peak in the region, and the 8th column is the end. The 9th column should be RGB color key, however, we keep 0 here to use the default color, so change it if you want. The 10th column tells how many blocks including the starting 1bp and ending 1bp of broad regions. The 11th column shows the length of each blocks, and 12th for the starts of each blocks. 13th: fold-change, 14th: -log10pvalue, 15th: -log10qvalue. The file can be loaded directly to UCSC genome browser.
142 | 
143 | 
144 | The narrowPeak file downloaded from the ENCODE website contains  73953 narrow peaks.
145 | `zcat ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak.gz| wc -l    
146 | `
147 | 
148 | Take home messages for now:  
149 | 1. Using different tools to call peaks will produce different number of peaks.  
150 | 2. Using the same tool with different settings will produce different number of peaks.  
151 | **There is no consensus to use which arguments. It depends on your data type and purpose. The bottom line is that you need to make sense of the data and find biological siginifcances from there.**
152 | 
153 | **As long as you document how you did the analysis (so that other people can reproduce your work) and can convince people you are doing it reasonably, you are fine.** 
154 | 
155 | I will stick to use these argements for all my subsequent analysis:  
156 | `macs2 callpeak -t ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n panc1H3k27acRep1 --outdir panc1H3k27acRep1_without_model_broad --nomodel --extsize 146`
157 | 
158 | ####2. peak quality in terms of qvalues (FDR) and overlapping.  
159 | using [bedtools](http://bedtools.readthedocs.org/en/latest/index.html) for overlapping:  
160 | 
161 | `bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -wa | cut -f1-3 | sort | uniq | wc -l
162 | `
163 |    45228
164 |    
165 | `bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -wb | cut -f1-3 | sort | uniq | wc -l`
166 |    61507  
167 | largely they are overlapped with each other.
168 | 
169 | compare with the narrowpeaks downloaded from ENCODE:  
170 | no-model:  
171 | `bedtools intersect -a panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak -wa | cut -f1-3 | sort | uniq | wc -l`
172 |    35722  
173 |    
174 | build model     
175 | `bedtools intersect -a panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak -b ../data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak -wa | cut -f1-3 | sort | uniq | wc -l` 
176 |    28943
177 | 
178 | **it seems that without model building is giving more concordant peaks compared with the ENCODE narrowpeak.**
179 |    
180 | #### check FDRs
181 | columns of the broadPeak file:  
182 | 7th: fold-change, 8th: -log10pvalue, 9th: -log10qvalue  
183 | How many peaks have a FDR of 0.01 and fold-change of 2:   
184 | peaks with no-model building  
185 | `cat panc1H3k27acRep1_without_model_broad/panc1H3k27acRep1_peaks.broadPeak | awk '$9 >2  && $7 >2' | wc -l`
186 |    41136 
187 | 
188 | peaks after building model  
189 | ` cat panc1H3k27acRep1_with_model_broad/panc1H3k27acRep1_peaks.broadPeak | awk '$9 >2  && $7 >2' | wc -l`
190 |    35487
191 | 
192 | **it seems that without model building gives more confident peaks**
193 | 
194 | I did some exploratory analysis using R and published at [Rpub](http://rpubs.com/crazyhottommy/ChIP-seq-peak-distribution)
195 | ### mannual visualization of called peaks.
196 | 
197 | 
198 | ### call peaks for all the samples.
199 | 
200 | #### calculate scaling factor for each ChIP bam file using NCIS library  
201 | By default **"Larger dataset will be scaled towards smaller dataset"** for macs2, one can **call peaks with macs2 using `--ratio` flag.** 
202 | NCIS library needs the ChIP bam and input control bam file as two arguments. I need to write a R script and excute on the command line by using `Rscript`:  
203 | [how to get commands line arguments for R](http://stackoverflow.com/questions/2151212/how-can-i-read-command-line-parameters-from-an-r-script/2151627#2151627)
204 | 
205 | ```r
206 | ## calculate the scaling factor using NCIS for ChIP-seq data
207 | # see links https://groups.google.com/forum/#!searchin/macs-announcement/NCIS/macs-announcement/0EF4cQF09FI/2-zlu2rqfOkJ
208 | # http://searchvoidstar.tumblr.com/post/52594053877/adding-a-custom-normalization-to-macs
209 | # Ming Tang 06/15/2015
210 | 
211 | library(NCIS)
212 | 
213 | library(ShortRead)
214 | 
215 | options(echo=TRUE) # set to FALSE if you not  want see commands in output 
216 | args <- commandArgs(trailingOnly = TRUE)
217 | print(args)
218 | # trailingOnly=TRUE means that only your arguments are returned, check:
219 | # print(commandsArgs(trailingOnly=FALSE))
220 | 
221 | ChIP_bam<- args[1]
222 | input_control_bam<- args[2]
223 | 
224 | # NCIS usese the Aligned Reads object from the shortRead package, however, it is recommended
225 | # to use GenomicAignments package to read in the bam files
226 | # ga_ChIP<- readGAlignments(ChIP_bam)
227 | # ga_input<-readGAlignments(input_control_bam)
228 | # However, the resulting GenomicAlignment object is not recognized by NCIS.
229 | # I have to use the legacy readAligned function from the ShortRead package.
230 | # it takes around 15mins to finish
231 | 
232 | ga_ChIP<- readAligned(ChIP_bam, type="BAM")
233 | ga_input<-readAligned(input_control_bam, type="BAM")
234 | 
235 | res<- NCIS(ga_ChIP, ga_input, data.type="AlignedRead")
236 | res
237 | res$est
238 | res$r.seq.depth
239 | ```
240 | To use it, save the R script as `NCIS_scaling.r` and on terminal:  
241 | `Rscript ChIP.bam control.bam`  
242 | It will output the scaling factor of ChIP/control. Note that the MACS2 flag `--ratio` is also for ChIP/control.  
243 | ```
244 |  --ratio RATIO         When set, use a custom scaling ratio of ChIP/control
245 |                         (e.g. calculated using NCIS) for linear scaling.
246 |                         DEFAULT: ingore
247 | ```
248 | **scaling factor for all four samples**  
249 | 
250 | `Rscript NCIS_scaling.r ../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam ../data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam`  
251 | `Rscript NCIS_scaling.r ../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam ../data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam`  
252 | `Rscript NCIS_scaling.r ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam  ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam`  
253 | `Rscript NCIS_scaling.r ../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam  ../data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam` 
254 | 
255 | 
256 | 
257 | The Scaling factor and the sequencing depth ratio are all for ChIP/control
258 | 
259 | | file_name     | scaling factor| seq depth ratio |
260 | | ------------- |:-------------:| -----:|
261 | | Mcf7Rep1      | 1.761996      | 2.091474 |
262 | | Mcf7Rep2      | 1.709105      | 1.917619 |
263 | | panc1Rep1     | 0.5861799     | 1.110217 |
264 | | panc1Rep2     | 0.5160342     | 0.8233171|
265 | 
266 | we can get a rough idea of the size of each bam file:  
267 | 
268 | `ls -sh data/*`  
269 | `4.0K data/README.md`
270 | `1.3G data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam`
271 | `6.2M data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam.bai`
272 | `1.2G data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam`
273 | `6.1M data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam.bai`
274 | `468K data/wgEncodeSydhHistoneMcf7H3k27acUcdPk.narrowPeak.gz`
275 | `773M data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam`
276 | `5.9M data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam.bai`
277 | `718M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam`
278 | `6.1M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam.bai`
279 | `522M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam`
280 | `5.9M data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam.bai`
281 | `4.4M data/wgEncodeSydhHistonePanc1H3k27acUcdPk.narrowPeak`
282 | `678M data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam`
283 | `6.1M data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam.bai`
284 | 
285 | 
286 | **MACS2 peak calling with --ratio**  
287 | `macs2 callpeak -t data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam -c data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Mcf7H3k27acUcdAlnRep1_ratio --outdir results/Mcf7H3k27acUcdAlnRep1_ratio --nomodel --extsize 146 --ratio 1.761996`  
288 | `macs2 callpeak -t data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep2.bam -c data/wgEncodeSydhHistoneMcf7InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Mcf7H3k27acUcdAlnRep2_ratio --outdir results/Mcf7H3k27acUcdAlnRep2_ratio --nomodel --extsize 146 --ratio 1.709105`  
289 | `macs2 callpeak -t data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam -c data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Panc17H3k27acUcdAlnRep1_ratio --outdir results/Panc1H3k27acUcdAlnRep1_ratio --nomodel --extsize 146 --ratio 0.5861799`  
290 | `macs2 callpeak -t data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep2.bam -c data/wgEncodeSydhHistonePanc1InputUcdAlnRep1.bam --broad -g hs --broad-cutoff 0.1 -n Panc17H3k27acUcdAlnRep2_ratio --outdir results/Panc1H3k27acUcdAlnRep2_ratio --nomodel --extsize 146 --ratio 0.5861799`
291 | 
292 | 
293 | **MACS2 peak calling without --ratio**
294 | 
295 | To bulk process the bam files downloaded from ENCODE,one can write a bash script. 
296 | 
297 | ```bash
298 | #! /bin/bash
299 | 
300 | set -e
301 | set -u
302 | set -o pipefail -o errexit -o nounset
303 | 
304 | # we loop for the ChIP bam files
305 | for bam in ../data/*H3k27ac*bam
306 | do 
307 | 	# strip out only the meaningful filename to be used for output
308 | 	file_name=$(echo "$bam" | sed -E "s/..\/data\/wg.+Histone(.+)(H3k27ac.+).bam/\1\2/")
309 | 	
310 | 	# need to retain the ../data/ path. it could be simply: sed -E "s/H3k27ac/Input/" if 
311 | 	# every bam file has a input control
312 | 	input_control=$(echo "$bam" | sed -E "s/(wg.+Histone)(.+)(H3k27ac.+).bam/\1\2InputUcdAlnRep1.bam/")
313 | 	
314 | 	echo "processing ${file_name} bam file"
315 | 	echo "the input control file is ${input_control}"
316 | 	echo "calling peaks with macs2"
317 | 	macs2 callpeak -t "$bam" -c "${input_control}"  --broad -g hs --broad-cutoff 0.1 -n "${file_name}" --outdir ../results/"${file_name}"  --nomodel --extsize 146
318 | 	
319 | done
320 | 
321 | 
322 | ```
323 | 
324 | The sed regular expression caused me some headache.
325 | for `+` operator to function in sed, one needs to turn on the `-E` flag for extended regular expression on macOS, or `-r` on GNU sed.  
326 | To caputure part of the pattern, use \1, and the parentheses do not need to be escaped:  
327 | If you wanted to keep the first word of a line, and delete the rest of the line:
328 | `sed 's/\([a-z]*\).*/\1'`  
329 | or turn on the -E flag:  
330 | `sed -E 's/([a-z]*).*/\1'`  
331 | see a very good tutorial on [sed](http://www.grymoire.com/Unix/Sed.html).  
332 | 
333 | After execute the bash script, 4 folders are created in the `results` folder:`Mcf7H3k27acUcdAlnRep1`,`Mcf7H3k27acUcdAlnRep2`,`Panc1H3k27acUcdAlnRep1` and `Panc1H3k27acUcdAlnRep2`
334 | 
335 | 
336 | ### Peak number with and without --ratio
337 | | file_name     | with --ratio  | without --ratio | overlapping |
338 | | ------------- |:-------------:| -----------------:|--------:|
339 | | Mcf7Rep1      | 25744         | 18572             |  18572  |
340 | | Mcf7Rep2      | 17564         | 12257             |  12256  |
341 | | panc1Rep1     | 348968        | 66288             |  66282  |
342 | | panc1Rep2     | 71033         | 54312             |  54229  |
343 | 
344 | It looks like that including the --ratio will generally increase the peak number. Especially for Pan1Rep1, it has 5 times more peaks with --ratio than without --ratio.  
345 | It will be interesting to check the peak quality of the increasing number of peaks after adding --ratio. I mannually checked several peaks and found that the newly found peaks are mostly very weak ones. **I decided to do my subsequent analysis with the peaks got from the MACS2 without --ratio.**
346 | 
347 | 
348 | ### filter peaks from the blacklists
349 | [blacklists](https://sites.google.com/site/anshulkundaje/projects/blacklists)
350 | 
351 | >Functional genomics experiments based on next-gen sequencing (e.g. ChIP-seq, MNase-seq, DNase-seq, FAIRE-seq) that measure biochemical activity of various elements in the genome often produce artifact signal in certain regions of the genome. It is important to keep track of and filter artifact regions that tend to show artificially high signal (excessive unstructured anomalous reads mapping). Below is a list of comprehensive empirical blacklists identified by the ENCODE and modENCODE consortia. Note that these blacklists were empirically derived from large compendia of data using a combination of automated heuristics and manual curation. These blacklists are applicable to functional genomic data based on short-read sequencing (20-100bp reads). These are not directly applicable to RNA-seq or any other transcriptome data types. The blacklisted regions typically appear u
352 | niquely mappable so simple mappability filters do not remove them. These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats. It is especially important to remove these regions that computing measures of similarity such as Pearson correlation between genome-wide tracks that are especially affected by outliers.
353 | 
354 | download the hg19 blacklist into the `data` folder by 
355 | `wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz`
356 | 
357 | creat a new folder `broad_peaks_all` inside the `results` folder, and copy all the broad peaks to this folder.  
358 | `  ls ./broad_peaks_all`  
359 | `Mcf7H3k27acUcdAlnRep1_peaks.broadPeak`  `Panc1H3k27acUcdAlnRep1_peaks.broadPeak`
360 | `Mcf7H3k27acUcdAlnRep2_peaks.broadPeak`  `Panc1H3k27acUcdAlnRep2_peaks.broadPeak`
361 | 
362 | ```bash
363 | #! /bin/bash
364 | 
365 | ## filter the broad peaks from MACS output against the blacklist regions
366 | ## bedtools intersect -v
367 | 
368 | set -e
369 | set -u
370 | set -o pipefail -o errexit -o nounset
371 | 
372 | 
373 | for peak in ../results/broad_peaks_all/*broadPeak
374 | do
375 | 	file_name=$(basename $peak .broadPeak)
376 | 	bedtools intersect -a "$peak" -b ../data/wgEncodeDacMapabilityConsensusExcludable.bed -v \
377 | 	> ../results/broad_peaks_all/"${file_name}".filtered.bed
378 | done
379 | ```
380 | save the bash script as `filter_blacklist.sh` and run it at terminal in the `script` folder `./fiter_blacklist.sh`. 
381 | 
382 | ```bash
383 | chmod u+x filter_blacklist.sh
384 | ```
385 | After filtering:  
386 | `wc -l /broad_peaks_all/*`  
387 |    `18572 Mcf7H3k27acUcdAlnRep1_peaks.broadPeak`
388 |    `18548 Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed`
389 |    `12257 Mcf7H3k27acUcdAlnRep2_peaks.broadPeak`
390 |    `12239 Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed`
391 |    `66288 Panc1H3k27acUcdAlnRep1_peaks.broadPeak`
392 |    `66248 Panc1H3k27acUcdAlnRep1_peaks.filtered.bed`
393 |    `54312 Panc1H3k27acUcdAlnRep2_peaks.broadPeak`
394 |    `54235 Panc1H3k27acUcdAlnRep2_peaks.filtered.bed`  
395 | Then remove all the broadpeaks:  
396 | `rm *broadPeak`
397 | 
398 | ### merge peaks
399 | After get the filtered peaks, I need to merge all the peaks into a superset that contains all the peaks. I will first select out the peaks that overlap with each other between biological replicates, and then merge all four of them using bedtools.  
400 | `bedtools intersect -a Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed -b Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed -wa | cut -f1-3 | sort | uniq > Mcf7Rep1_peaks.bed `  
401 | `bedtools intersect -a Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed -b Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed -wb | cut -f1-3 | sort | uniq > Mcf7Rep2_peaks.bed `  
402 | `bedtools intersect -a Panc1H3k27acUcdAlnRep1_peaks.filtered.bed -b Panc1H3k27acUcdAlnRep2_peaks.filtered.bed -wa | cut -f1-3 | sort | uniq > Panc1Rep1_peaks.bed`  
403 | `bedtools intersect -a Panc1H3k27acUcdAlnRep1_peaks.filtered.bed -b Panc1H3k27acUcdAlnRep2_peaks.filtered.bed -wb | cut -f1-3 | sort | uniq > Panc1Rep2_peaks.bed`  
404 | 
405 | `wc -l *`  
406 |    `18548 Mcf7H3k27acUcdAlnRep1_peaks.filtered.bed`
407 |    `12239 Mcf7H3k27acUcdAlnRep2_peaks.filtered.bed`
408 |     `9679 Mcf7Rep1_peaks.bed`  
409 |    `11319 Mcf7Rep2_peaks.bed`
410 |    `66248 Panc1H3k27acUcdAlnRep1_peaks.filtered.bed`
411 |    `54235 Panc1H3k27acUcdAlnRep2_peaks.filtered.bed`
412 |    `35218 Panc1Rep1_peaks.bed`  
413 |    `44358 Panc1Rep2_peaks.bed`
414 | 
415 | 
416 | `rm *filtered*`  
417 | `cat *bed | sort -k1,1 -k2,2n | bedtools merge | tee merge.bed | wc -l`  
418 | 39046  
419 | **we got a final superset containing 39046 peaks.**
420 | 


--------------------------------------------------------------------------------
/part2_Preparing-ChIP-seq-count-table.md:
--------------------------------------------------------------------------------
 1 | ### Preparing ChIP-seq count table
 2 | 
 3 | [Countinuing with part1](https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md), I've got a `merged.bed` containing the merged peaks and I will count how many reads are in those peaks using bedtools multicov and featureCounts from subRead.
 4 | 
 5 | #### Count by bedtools
 6 | Make a bed file adding peak id as the fourth colum.
 7 | This bed file will be used for bedtools multicov:  
 8 | `cat merge.bed | awk '{$3=$3"\t""peak_"NR}1' OFS="\t" > bed_for_multicov.bed`   
 9 | 
10 | count one bam file:  
11 | `time bedtools multicov -bams ../../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam -bed 1000_bedtools.bed > counts_multicov.txt`  
12 | 
13 | count multiple bam files:  
14 | 
15 | 
16 | Make sure the header for each bam file is identical, otherwise an error saying "Can not open bam file" will show up.  
17 | See [here](https://groups.google.com/forum/#!msg/bedtools-discuss/_LNuoRWHn50/14MaqyzyzXsJ) and [here](https://github.com/arq5x/bedtools2/issues/52). For bam files downloaded from UCSC, some header contains chrY in the header, some do not, and the order of chrX, chrY, chrM may be different.  
18 | 
19 | change bam header:  
20 | `samtools view -H ../../data/wgEncodeSydhHistonePanc1H3k27acUcdAlnRep1.bam > bam_header.txt`  
21 | `samtools reheader bam_header.txt ../../data/wgEncodeSydhHistoneMcf7H3k27acUcdAlnRep1.bam > Mcf7H3K27acRep1_reheader.bam`  
22 | `samtools index Mcf7H3K27acRep1_reheader.bam`  
23 | 
24 | do the same thing for other bams.  
25 | 
26 | `time bedtools multicov -bams ../../data/*bam -bed bed_for_multicov.bed > counts_bedtools.txt`  
27 | 
28 | `550.24s user 17.86s system 97% cpu 9:42.88 total`  
29 | It is pretty fast in counting 6 bam files for a bed file containing
30 | ~40,000 entries.
31 | 
32 | The columns of counts are in the same sequences as the input bam files.
33 | 
34 | #### Count by featureCounts
35 | Make a saf file for featureCount in the [subread package](http://bioinf.wehi.edu.au/featureCounts/)  
36 | 
37 | add peak id in the first column, add a strand info to the last column:  
38 | 
39 | `awk -F "\t" '{$1="peak_"NR FS$1;$4=$4FS"."}1' > subread.saf`  
40 | 
41 | featureCounts assumes that the default annotation file is GTF file. featureCounts is usually used to count RNAs-seq data. check the help message for other flags such as `-f`, `-t` and `-g`. use `-T` to specifiy how many threads you want to use, default is 1.  
42 | It is a faster alternative to [htseq-count](http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) which is widely used for gene-level RNA-seq counts.
43 | 
44 | `time featureCounts -a subread.saf -F SAF -o counts_subread.txt ../../data/*bam -T 4` 
45 | 
46 | `274.74s user 3.16s system 417% cpu 1:06.55 total`
47 | 
48 | **Using 4 cpus, the speed is faster than bedtools**
49 | 
50 | The output file contains two lines of header:  
51 | the first line is the command used to generate the output; the second line
52 | is the header composed of Geneid, Chr, Start, End, Strand, Length, and the name of 
53 | the input bam files.  
54 | 
55 | Strand will be all "+" although the peaks do not have any strandness. 
56 | 
57 | besides the `counts_subread.txt` file, another `counts_subread.txt.summary`
58 | file will be generated detailing how many reads are assigned or not.
59 | 
60 | 
61 | #### Join the two counts table to compare the differences
62 | Join the two counts table using the common peak id column, use `csvjoin` from [csvkit](http://csvkit.readthedocs.org/en/latest/index.html#)    
63 | 
64 | `csvjoin -t -c1,4 <(cat counts_subread.txt | sed '1,2d') counts_bedtools.txt |  cut -d"," -f1-12,17-22 > joined_counts_table.csv`   
65 | 
66 | Now, we can load the csv file into R to do some exploration. I put the exploration on [Rpubs](http://rpubs.com/crazyhottommy/ChIP-seq-counts). 
67 | 
68 | 
69 | 
70 | 
71 | 


--------------------------------------------------------------------------------
/part3.1_Differential_binding_DiffBind_lib_size.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ### library size and normalization for ChIP-seq
 3 | I have discussed how to use DESeq2 to do differential binding for ChIP-seq at [here](https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part3_Differential_binding_by_DESeq2.md).  
 4 | I am experimenting [`DiffBind`](http://bioconductor.org/packages/release/bioc/html/DiffBind.html) to do the same thing, which internally uses EdgR, DESeq and DESeq2.
 5 | The author `Rory Stark` is very responsive on the [bioconductor support site](https://support.bioconductor.org/) and has answered several of my questions.
 6 | 
 7 | Today, I am going to keep a note here for normalizing the ChIP-seq data.
 8 | If one compares ChIP-seq versus RNA-seq data, they are in the end are all counts data. For RNA-seq, we usually get a read count table for
 9 | the counts in the exons (union of them is for a gene); for ChIP-seq, we get a read count table for counts within the peaks. The peaks have to
10 | be identified by other tools such as MACS first. The counts data follow a **(negative) binomial distribution**. That's why tools such as DESeq2, which was developed for RNAseq is used for ChIP-seq.
11 | 
12 | After we get a count table, it comes to the normalization problem. If you are interested, read this paper [Beyond library size: a field guide to NGS normalization](http://biorxiv.org/content/early/2014/06/19/006403).
13 | In the `DiffBind` package, the counts table is obtained by a function `?dba.count`.
14 | 
15 | There are several ways to specify how the counts are normalized for the binding affinity matrix:
16 | ```
17 | score	
18 | which score to use in the binding affinity matrix. Note that all raw read counts are maintained for use by dba.analyze, regardless of how this is set. One of:
19 | DBA_SCORE_READS	raw read count for interval using only reads from ChIP
20 | DBA_SCORE_READS_FOLD	raw read count for interval from ChIP divided by read count for interval from control
21 | DBA_SCORE_READS_MINUS	raw read count for interval from ChIP minus read count for interval from control
22 | DBA_SCORE_RPKM	RPKM for interval using only reads from ChIP
23 | DBA_SCORE_RPKM_FOLD	RPKM for interval from ChIP divided by RPKM for interval from control
24 | DBA_SCORE_TMM_READS_FULL	TMM normalized (using edgeR), using ChIP read counts and Full Library size
25 | DBA_SCORE_TMM_READS_EFFECTIVE	TMM normalized (using edgeR), using ChIP read counts and Effective Library size
26 | DBA_SCORE_TMM_MINUS_FULL	TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Full Library size
27 | DBA_SCORE_TMM_MINUS_EFFECTIVE	TMM normalized (using edgeR), using ChIP read counts minus Control read counts and Effective Library size
28 | DBA_SCORE_TMM_READS_FULL_CPM	same as DBA_SCORE_TMM_READS_FULL, but reporrted in counts-per-million.
29 | DBA_SCORE_TMM_READS_EFFECTIVE_CPM	same as DBA_SCORE_TMM_READS_EFFECTIVE, but reporrted in counts-per-million.
30 | DBA_SCORE_TMM_MINUS_FULL_CPM	same as DBA_SCORE_TMM_MINUS_FULL, but reporrted in counts-per-million.
31 | DBA_SCORE_TMM_MINUS_EFFECTIVE_CPM	Tsame as DBA_SCORE_TMM_MINUS_EFFECTIVE, but reporrted in counts-per-million.
32 | ```
33 | 
34 | `DBA_SCORE_TMM_READS_FULL`  vs `DBA_SCORE_TMM_READS_EFFECTIVE`:
35 | 
36 | `Diffbind` let's you to choose use full library size or effective library size for trimmed mean of M values(`TMM`) normalization which was proposed by [Mark D Robinson](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25)
37 | for RNAseq. 
38 | 
39 | **Full library size** is the number of reads in the bam files.  
40 | **Effective library size** is the number of reads mapped in the exons or within the peaks. It is the column sums for the matrix.
41 | >Note that effective library size (bFullLibrarySize =FALSE) may be more appropriate for situations when the overall signal (binding rate) is expected to be directly comparable
42 | between the samples.
43 | 
44 | If one wants to subtract the input reads, one can use `DBA_SCORE_TMM_MINUS_FULL` and `DBA_SCORE_TMM_MINUS_EFFECTIVE`
45 | 
46 | **No matter what score you choose, for differential binding analysis in `Diffbind`, it is always the raw counts is used for the binding matrix**.
47 | Diffbind (by default) subtract the input raw reads for subsequent analysis. Whether or not this is good was discussed [here](https://support.bioconductor.org/p/72098/#72127).  
48 | 
49 | For example, if one uses `DESeq2`, the details are as follows:
50 | >For each contrast, a separate analysis is performed. First, a matrix of counts is constructed for the contrast, with columns
51 | for all the samples in the first group, followed by columns for all the samples in the second group. **The raw read count** is
52 | used for this matrix; **if the bSubControl parameter is set to TRUE (as it is by default), the raw number of reads in the
53 | control sample (if available) will be subtracted**. **Next the library size is computed for each sample for use in subsequent
54 | normalization**. By default, **this is the total number of reads in peaks (the sum of each column)**. Alternatively, if the
55 | bFullLibrarySize parameter is set to TRUE, the total number of reads in the library (calculated from the source
56 | BAM/BED file) is used. The first step concludes with a call to DESeq2’s DESeqDataSetFromMatrix function, which
57 | returns a DESeqDataSet object. If bFullLibrarySize is set to TRUE, then sizeFactors is called with the number of reads in the BAM/BED files for
58 | each ChIP sample, divided by the minimum of these; otherwise, `estimateSizeFactors` is invoked.
59 | `estimateDispersions` is then called with the `DESeqDataSet` object and `fitType` set to local. Next the model is
60 | fitted and tested using nbinomWaldTest
61 | 
62 | `estimateSizeFactors` in DESeq2:
63 | >Given a matrix or data frame of count data, this function estimates the size factors as follows: Each column is divided by 
64 | the geometric means of the rows. The median (or, ir requested, another location estimator) of these ratios 
65 | (skipping the genes with a geometric mean of zero) is used as the size factor for this column.
66 | 
67 | 
68 | 
69 | 


--------------------------------------------------------------------------------
/part3_Differential_binding_by_DESeq2.md:
--------------------------------------------------------------------------------
 1 | ### Differential binding by DESeq2  
 2 | 
 3 | There are many tools detecting differential binding for ChIP-seq data.  See [here](https://github.com/crazyhottommy/ChIP-seq-analysis#differential-peak-detection).  
 4 |  
 5 | DiffBind is very popular and internally it uses EdgR or DESeq1/2. Given that I have gotten raw counts from part2, I will use DESeq2 for differential peak detecting.  
 6 | 
 7 | In this way, I have a more flexiable control over the counts table. From here on, the workflow is pretty much like a RNA-seq differential gene expression workflow.
 8 | 
 9 | Let's normalize the raw counts by the sequencing depth (library size) first, and then **subtract the input control tags for the IP samples** (**This is wrong or not recommended!** see below) and then use the DESeq2 `estimatesizefactor` function to do the normalization.   
10 | 
11 | See my post on [bioconductor support site](https://support.bioconductor.org/p/72098/#72173)  
12 | 
13 | From DiffBind author Rory stark:
14 | >I'm not sure what you mean by wanting to "take it the other way"? Meaning, why not use the DiffBind package, which does exactly this type of analysis? Is the idea is that you want to really control each step yourself? 
15 | 
16 | >If you want to subtract the control reads from the ChIP reads, you should do a simple scaling first. Rounding the nearest integer avoids the DESeq2 issue (you also have to check for negative values as it may be possible there there re more reads in the control than in the ChIP for some merged consensus peaks for some samples). DiffBind does this by default. In your case, it would multiply the control read counts by the ratio of ChIP:Control reads (computed individually for each sample) -- 0.50 in all of your examples -- and round the result before subtracting.
17 | 
18 | >In an experiment  like you describe, where the same tissue type is used for all the samples, subtracting the control reads shouldn't make much difference to the results as the same control is used for all the samples in each group, unless there are significant difference between the Inputs for each sample group. This should only happen if the treatment had a big impact the open chromatin (or be the result of a technical issue in the ChIP). The Input controls are most important for the MACS peak calling step.
19 | 
20 | From Ryan C.Thompson:
21 | >You might want to take at look at the csaw package, which adapts the edgeR method with all the necessary modifications for unbiased ChIP-Seq differential binding analysis. **Also, I would not recommend subtracting input from ChIP counts, since all the count-based methods assume that you are analyzing absolute counts, not "counts in excess of background**".
22 | >
23 | 
24 | From DESeq2 author Michael Love:
25 | >**You should definitely never subtract or divide the counts you provide to DESeq2**.I would not use the input counts at all for differential binding. I would just compare treated counts vs untreated ChIP counts. but I would also recommend to also take a look at DiffBind and csaw vignettes and workflows, at the least to understand the best practices they've set out.
26 | 
27 | From DiffBind author Rory stark again:
28 | >While it is certainly the case that altering the read counts using control reads violates an essential assumption underlying DESeq2 (namely the use of unadulterated read counts), ignoring the control tracks can also lead to incorrect results. This is because the binding matrix may include read counts for enriched regions (peaks) in samples where they were not originally identified as enriched compared to the control. As DESeq2 will have no way of detecting an anomaly in the Input control for that sample in that region, the results may be misleading. This is most likely to occur in experiments involving different cell types.
29 | 
30 | >There are alternative ways to incorporate the Input controls. For example, instead of scaling the control read counts, **the control libraries can be down-sampled to the same level as each corresponding ChIP sample prior to peak calling (MACS2 does this) and counting.**  This is what we do in our processing pipelines. This still involves altering the ChIP read counts via subtraction however, and in practice down-sampling and scaling almost always yield the same results.
31 | 
32 | >The other method is the more aggressive use of blacklists. **Generating blacklists based on every Input control**, and removing reads/peaks from every sample that overlap any blacklisted area, can eliminate false positives in those regions where there is an anomaly in an Input control.  Gordon Brown developed the [GreyListChIP](http://www.bioconductor.org/packages/release/bioc/html/GreyListChIP.html) package for this purpose.
33 | >
34 | 
35 | From EdgeR author Aaron Lun:
36 | >From what I can see, there's two choices; either we get erroneous DB calls because of differences in chromatin state and input coverage, or we get errors due to distorted modelling of the mean-variance relationship after input subtraction. Our (i.e., the edgeR development team's) opinion is that the latter is more dangerous for routine analyses of ChIP-seq count data. Inaccurate variance modelling will affect the inferences for every genomic region via EB shrinkage, while spurious calls due to differences in chromatin state should be limited to a smaller number of regions. Rory's blacklist idea might be useful here, as we would be able to screen out such problematic regions before they cause any headaches during interpretation.
37 | 
38 | In a word, subracting input control counts from the IP counts violates the DESeq2 and EdgeR assumption and should not be used. The blacklist method may be the way to go.
39 | 
40 | There is a nice tutorial using bioconductor in f1000 research [From reads to regions: a Bioconductor workflow to detect differential binding in ChIP-seq data](http://f1000research.com/articles/4-1080/v1)
41 | 
42 | 


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/README.md:
--------------------------------------------------------------------------------
  1 | ### Why using snakemake
  2 | [Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) is a python3 based pipeline building tool (a python variant of GNU [make](https://www.gnu.org/software/make/)) specialized for bioinformatics. I put my notes managing different versions of python [here](https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/use_multiple_version_python.md). You can write any python codes inside the Snakefile. Using snakemake is to simplify the tedious pre-processing work for large genomic data sets and for the sake of reproducibility. There are many other [tools](https://github.com/pditommaso/awesome-pipeline) you can find here for this purpose.
  3 | 
  4 | ### Key features of snakemake
  5 | 
  6 | * Snakemake automatically creates missing directories.
  7 | 
  8 | * wildcards and Input function
  9 | 
 10 | To access wildcards in a shell command:  `{wildcards.sample}`
 11 | 
 12 | `{wildcards}` is greedy `(.+)`:
 13 | `{sample}.fastq` could be matching `sampleA.fastq` if there is no sub-folder anymore, but even `whateverfolder/sampleA.fastq` can be matched as well.
 14 | 
 15 | One needs to think snakemake in a bottom-up way: snakemake will first look for the output files, and substitue the file names to the `{wildcards}`, and look for which rule can be used to creat the output, and then look for input files that are defined by the `{wildcards}`.
 16 | 
 17 | The `wildcards` is pretty confusing to me. read the posts in the google group by [searching wildcards](https://groups.google.com/forum/#!searchin/snakemake/wildcards) 
 18 | 
 19 | read threads below:
 20 | https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/YfHgx6P5se4/CRk-d151GBwJ  
 21 | https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/FsdT4ioRyNY/LCm6Xj8dIAAJ  
 22 | https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/JAcOdGgWR_g/1nT9nsNkCgAJ  
 23 | https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/KorE6c-OZg4/LVO_0jDBHlUJ
 24 | 
 25 | 
 26 | [quote](https://groups.google.com/forum/#!searchin/snakemake/glob_wildcards/snakemake/FsdT4ioRyNY/LCm6Xj8dIAAJ) from the snakemake developer:
 27 | >The generation of the final target is nearly always the complex part of the workflow. You have to determine, which files exist and somehow tell Snakemake which files you want it to generate. `glob_wildcards` and `expand` helps a lot - it was really painful in the versions before it was implemented, but often you have to write some custom code, if it gets complicated
 28 | 
 29 | #### Read the following
 30 | [snakemake book](https://www.gitbook.com/book/endrebak/the-snakemake-book/details)  
 31 | [flexible bioinformatics pipelines with snakemake](http://watson.nci.nih.gov/~sdavis/blog/flexible_bioinformatics_pipelines_with_snakemake/)    
 32 | [Build bioinformatics pipelines with Snakemake](https://slowkow.com/notes/snakemake-tutorial/)  
 33 | [snakemake ChIP-seq pipeline example](https://hpc.nih.gov/apps/snakemake.html)  
 34 | [submit all the jobs immediately](https://bitbucket.org/snakemake/snakemake/issues/28/clustering-jobs-with-snakemake)  
 35 | [cluster job submission wrappers](https://groups.google.com/forum/#!searchin/snakemake/dependencies/snakemake/1QelazgzilY/oBgZoP19BL4J)  
 36 | [snakemake-parallel-bwa](https://github.com/inodb/snakemake-parallel-bwa)  
 37 | [RNA-seq snakemake example](http://www.annotathon.org/courses/ABD/practical/snakemake/snake_intro.html)  
 38 | [functions as inputs and derived parameters](https://groups.google.com/forum/#!msg/Snakemake/0tLS6KrXA5E/Oe5umTdluq4J)  
 39 | [snakemake FAQ](https://bitbucket.org/snakemake/snakemake/wiki/FAQ)  
 40 | [snakemake tutorial from the developer](http://snakemake.bitbucket.org/snakemake-tutorial.htm)  
 41 | 
 42 | ### examples
 43 | https://github.com/slowkow/snakefiles/blob/master/bsub.py  
 44 | https://github.com/broadinstitute/viral-ngs/tree/master/pipes
 45 | 
 46 | ### jobscript
 47 | An example of the jobscript
 48 | 
 49 | ```bash
 50 | #!/bin/bash
 51 | # properties = {properties}
 52 | . /etc/profile.d/modules.sh
 53 | module purge
 54 | module load snakemake/python3.2/2.5.2.2
 55 | {workflow.snakemakepath} --snakefile {workflow.snakefile} \
 56 | --force -j{cores} \
 57 | --directory {workdir} --nocolor --notemp --quiet --nolock {job.output} \
 58 | && touch "{jobfinished}" || touch "{jobfailed}"
 59 | exit 0
 60 | ```
 61 | 
 62 | 
 63 | ### [A working snakemake pipeline for ChIP-seq](https://github.com/crazyhottommy/ChIP-seq-analysis/tree/master/snakemake_ChIPseq_pipeline)
 64 | 
 65 | The folder structure is like this:
 66 | 
 67 | ```
 68 | ├── README.md
 69 | ├── Snakemake
 70 | ├── config.yaml
 71 | └── rawfastqs
 72 |     ├── sampleA
 73 |     │   ├── sampleA_L001.fastq.gz
 74 |     │   ├── sampleA_L002.fastq.gz
 75 |     │   └── sampleA_L003.fastq.gz
 76 |     ├── sampleB
 77 |     │   ├── sampleB_L001.fastq.gz
 78 |     │   ├── sampleB_L002.fastq.gz
 79 |     │   └── sampleB_L003.fastq.gz
 80 |     ├── sampleG1
 81 |     │   ├── sampleG1_L001.fastq.gz
 82 |     │   ├── sampleG1_L002.fastq.gz
 83 |     │   └── sampleG1_L003.fastq.gz
 84 |     └── sampleG2
 85 |         ├── sampleG2_L001.fastq.gz
 86 |         ├── sampleG2_L002.fastq.gz
 87 |         └── sampleG2_L003.fastq.gz
 88 | 
 89 | ```
 90 | 
 91 | There is a folder named `rawfastqs` containing all the raw fastqs. each sample subfolder contains multiple fastq files from different lanes.
 92 | 
 93 | In this example, I have two control (Input) samples and two corresponding case(IP) samples.
 94 | 
 95 | ```
 96 | CONTROLS = ["sampleG1","sampleG2"]
 97 | CASES = ["sampleA", "sampleB"]
 98 | ```
 99 | putting them in a list inside the `Snakefile`. If there are many more samples,
100 | need to generate it with `python` programmatically.
101 | 
102 | 
103 | ```bash
104 | ## dry run
105 | snakemake -np
106 | 
107 | ## work flow diagram
108 | snakemake --forceall --dag | dot -Tpng | display
109 | 
110 | ```
111 | ![](../images/snakemake_flow.png)
112 | 
113 | 
114 | ###To Do:  
115 | 
116 | * Make the pipeline more flexiable. e.g. specify the folder name containing raw fastqs, now it is hard coded.
117 | * write a wrapper script for submitting jobs in `moab`. Figuring out dependencies and `--immediate-submit`
118 | 
119 | `snakemake -k -j 1000 --forceall --cluster-config cluster.json --cluster "msub -V -N '{rule.name}_{wildcards.sample}' -l nodes={cluster.nodes}:ppn={cluster.cpu} -l mem={cluster.mem} -l walltime={cluster.time} -m {cluster.EmailNotice} -M {cluster.email}"`
120 | 
121 | `snakemake -k -j 1000 --forceall --cluster-config cluster.json --cluster "msub -V -l nodes={cluster.nodes}:ppn={cluster.cpu} -l mem={cluster.mem} -l walltime={cluster.time} -m {cluster.EmailNotice} -M {cluster.email}"`
122 | 


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/Snakefile:
--------------------------------------------------------------------------------
  1 | shell.prefix("set -eo pipefail; ")
  2 | 
  3 | configfile: "config.yaml"
  4 | 
  5 | localrules: all
  6 | # localrules will let the rule run locally rather than submitting to cluster
  7 | # computing nodes, this is for very small jobs
  8 | 
  9 | ## list all samples
 10 | CONTROLS = ["sampleG1","sampleG2"]
 11 | CASES = ["sampleA", "sampleB"]
 12 | 
 13 | 
 14 | ## list BAM files
 15 | CONTROL_BAM = expand("04aln/{sample}.sorted.bam", sample=CONTROLS)
 16 | CASE_BAM = expand("04aln/{sample}.sorted.bam", sample=CASES)
 17 | 
 18 | ## create target for peak-calling: will call the rule call_peaks in order to generate bed files
 19 | ## note: the "zip" function allow the correct pairing of the BAM files
 20 | ALL_PEAKS   = expand("05peak/{case}_vs_{control}_peaks.bed", zip, case=CASES, control=CONTROLS)
 21 | ALL_SAMPLES = CONTROLS + CASES
 22 | ALL_BAM     = CONTROL_BAM + CASE_BAM
 23 | ALL_FASTQ   = expand("01seq/{sample}.fastq", sample = ALL_SAMPLES)
 24 | ALL_CLEAN_FASTQ = expand("03seqClean/{sample}_clean.fastq", sample = ALL_SAMPLES)
 25 | ALL_FASTQC  = expand("02fqc/{sample}_fastqc.zip", sample = ALL_SAMPLES)
 26 | 
 27 | rule all:
 28 |     input: ALL_FASTQC + ALL_BAM + ALL_PEAKS + ALL_CLEAN_FASTQ
 29 | 
 30 | ## for each sample, there are multiple fastq.gz from different lanes, merge them
 31 | ## because the number of the fastq.gz files in the folder is not fixed, need to be
 32 | ## determined by a function
 33 | 
 34 | import glob
 35 | def get_fastqs(wildcards):
 36 |     return glob.glob("rawfastqs/"+ wildcards.sample+"/"+ wildcards.sample + "_L00[0-9].fastq.gz")
 37 | 
 38 | rule merge_fastqs:
 39 |     input: get_fastqs
 40 |     output: "01seq/{sample}.fastq"
 41 |     log: "00log/{sample}_unzip"
 42 |     threads: 1
 43 |     message: "merging fastqs gunzip -c {input} > {output}"
 44 |     shell: "gunzip -c {input} > {output} 2> {log}"
 45 | 
 46 | rule fastqc:
 47 |     input:  "01seq/{sample}.fastq"
 48 |     output: "02fqc/{sample}_fastqc.zip", "02fqc/{sample}_fastqc.html"
 49 |     log:    "00log/{sample}_fastqc"
 50 |     threads: 2
 51 |     resources:
 52 |         mem  = 2 ,
 53 |         time = 20
 54 |     message: "fastqc {input}: {threads} / {resources.mem}"
 55 |     shell:
 56 |         """
 57 |         module load fastqc
 58 |         fastqc -o 02fqc -f fastq --noextract {input[0]}
 59 |         """
 60 | 
 61 | ## use trimmomatic to trim low quality bases and adaptors
 62 | rule clean_fastq:
 63 |     input:   "01seq/{sample}.fastq"
 64 |     output:  temp("03seqClean/{sample}_clean.fastq")
 65 |     log:     "00log/{sample}_clean_fastq"
 66 |     threads: 4
 67 |     resources:
 68 |         mem= 2,
 69 |         time= 30
 70 |     message: "clean_fastq {input}: {threads} threads / {resources.mem}"
 71 |     shell:
 72 |         """
 73 |         module load fastxtoolkit
 74 |         trimmomatic SE {input} {output} \
 75 |         ILLUMINACLIP:Truseq_adaptor.fa:2:30:10 LEADING:3 \
 76 |         TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 2> {log}
 77 |         """
 78 | 
 79 | 
 80 | rule align:
 81 |     input:  "03seqClean/{sample}_clean.fastq"
 82 |     output: temp("04aln/{sample}.unsorted.bam")
 83 |     threads: 10
 84 |     params: bowtie = "--chunkmbs 320 --best -p 10 "
 85 |     resources:
 86 |         mem    = 10,
 87 |         time   = 45
 88 |     message: "aligning {input}: {threads} threads / {resources.mem}"
 89 |     log: "00log/{sample}.align"
 90 |     shell:
 91 |         """
 92 |         module load bowtie/1.1.1 samtools/1.2
 93 |         bowtie {params.bowtie} --threads={threads} {config[idx_bt1]} {input} 2> {log} \
 94 |             | samtools view -Sb - \
 95 |             >  {output}
 96 |         """
 97 | 
 98 | rule sort_bam:
 99 |     input:  "04aln/{sample}.unsorted.bam"
100 |     output: "04aln/{sample}.sorted.bam"
101 |     log:    "00log/{sample}.sort_bam"
102 |     threads: 10
103 |     resources:
104 |         mem  = 12,
105 |         time = 15
106 |     message: "sort_bam {input}: {threads} threads / {resources.mem}"
107 |     shell:
108 |         """
109 |         module load samtools/1.2
110 |         samtools sort -m 1G -@ {threads} -O bam -T {output}.tmp {input} > {output} 2> {log}
111 |         """
112 | 
113 | rule index_bam:
114 |     input:  "04aln/{sample}.sorted.bam"
115 |     output: "04aln/{sample}.sorted.bam.bai"
116 |     log:    "00log/{sample}.index_bam"
117 |     threads: 1
118 |     resources:
119 |         mem   = 500,
120 |         time  = 10
121 |     message: "index_bam {input}: {threads} threads / {resources.mem}"
122 |     shell:
123 |         """
124 |         module load samtools/1.2
125 |         samtools index {input} 2> {log}
126 |         """
127 | 
128 | rule flagstat_bam:
129 |     input:  "04aln/{sample}.sorted.bam"
130 |     output: "04aln/{sample}.sorted.bam.flagstat"
131 |     log:    "00log/{sample}.flagstat_bam"
132 |     threads: 1
133 |     resources:
134 |         mem   = 500,
135 |         time  = 10
136 |     message: "flagstat_bam {input}: {threads} threads / {resources.mem}"
137 |     shell:
138 |         """
139 |         module load samtools/1.2
140 |         samtools flagstat {input} > {output} 2> {log}
141 |         """
142 | 
143 | rule call_peaks:
144 |     input: control = "04aln/{control_id}.sorted.bam", case="04aln/{case_id}.sorted.bam"
145 |     output: bed="05peak/{case_id}_vs_{control_id}_peaks.bed"
146 |     log: "00log/{case_id}_vs_{control_id}_call_peaks.log"
147 |     params:
148 |         name = "{case_id}_vs_{control_id}"
149 |     resources:
150 |         mem  = 4,
151 |         time = 30
152 | 
153 |     message: "call_peaks macs14 {input}: {threads} threads / {resources.mem}"
154 |     shell:
155 |         """
156 |         module load macs14 R
157 |         macs14 -t {input.case} \
158 |             -c {input.control} -f BAM -g {config[macs_g]} \
159 |             --outdir 04peak -n {params.name} -p 10e-5 &> {log}
160 |             cd 05peak && Rscript {params.name}_model.r
161 |         """
162 | 


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/config.yaml:
--------------------------------------------------------------------------------
1 | 
2 | idx_bt1: /scratch/genomic_med/apps/annot/indexes/bowtie/hg19
3 | macs_g: hs
4 | 


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/msub_cluster.py:
--------------------------------------------------------------------------------
 1 | 
 2 | #!/usr/bin/env python3
 3 | 
 4 | ## In order to submit all the jobs to the moab queuing system, one needs to write a wrapper.
 5 | ## This wrapper is inspired by Daniel Park https://github.com/broadinstitute/viral-ngs/blob/master/pipes/Broad_LSF/cluster-submitter.py
 6 | ## I asked him questions on the snakemake google group and he kindly answered: https://groups.google.com/forum/#!topic/snakemake/1QelazgzilY
 7 | 
 8 | import sys
 9 | import re
10 | from snakemake.utils import read_job_properties
11 | 
12 | ## snakemake will generate a jobscript containing all the (shell) commands from your Snakefile. 
13 | ## I think that's something baked into snakemake's code itself. It passes the jobscript as the last parameter.
14 | ## https://bitbucket.org/snakemake/snakemake/wiki/Documentation#markdown-header-job-properties
15 | 
16 | jobscript = sys.argv[-1]
17 | props = read_job_properties(jobscript)
18 | 
19 | 
20 | 
21 | 
22 | 
23 | 


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L001.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L001.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L002.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L002.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L003.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleA/sampleA_L003.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L001.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L001.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L002.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L002.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L003.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleB/sampleB_L003.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L001.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L001.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L002.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L002.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L003.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG1/sampleG1_L003.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L001.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L001.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L002.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L002.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L003.fastq.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crazyhottommy/ChIP-seq-analysis/d3f966dee8460502ee8e438e7a082425eb0c0a76/snakemake_ChIPseq_pipeline/rawfastqs/sampleG2/sampleG2_L003.fastq.gz


--------------------------------------------------------------------------------
/snakemake_ChIPseq_pipeline/snakemake_notes.md:
--------------------------------------------------------------------------------
 1 | 
 2 | see a [post](https://groups.google.com/forum/#!topic/snakemake/iDnr3PIcsfE)  
 3 | 
 4 | >Apart from the rule declarations, Snakefiles are plain Python. In your return statement in myfunc, you take the value of the wildcards   
 5 | object and put braces around it. Braces around an object in Python create 
 6 | a set containing that object. But you just want the value, without wrapping it in a set. Hence, the solution is to remove the braces. 
 7 | 
 8 | You should put `{}` around the wildcards within quotes, like so
 9 | `"{wildcards.kittens}"`
10 | 
11 | If you are using wildcards within code you do not need the curly braces, so you can just do 
12 | for kitten in wildcards.kittens:
13 |     print(kitten)
14 |     
15 | parameters useful:
16 | 
17 | ```
18 | --keep-going, -k      Go on with independent jobs if a job fails.
19 | ```
20 | ### test functions in a python console
21 | 
22 | ```bash
23 | from snakemake.io import glob_wildcards, expand
24 | ```
25 | ### python versions
26 | `snakemake` is python3 based, if you want to execute python2 commands, you have to activate the python2x environment.
27 | In a future release, the environment will be baked in to snakemake so you can specify environment inside a rule.
28 | see [this issue and pull reques: Integration of conda package management into Snakemake](https://bitbucket.org/snakemake/snakemake/pull-requests/92/wip-integration-of-conda-package/diff)
29 | 
30 | ```python
31 | rule a:
32 |     output:
33 |         "test.out"
34 |     environment:
35 |         "envs/samtools.yaml"
36 |     shell:
37 |         "samtools --help > {output}"
38 | 
39 | ```
40 | 
41 | with `envs/samtools.yaml` being e.g.
42 | 
43 | ```
44 | channels:
45 |   - bioconda
46 |   - r
47 | dependencies:
48 |   - samtools ==1.3
49 | ```
50 | 
51 | some threads that are useful:  
52 | * [unifying resources and cluster config](https://bitbucket.org/snakemake/snakemake/issues/279/unifying-resources-and-cluster-config)
53 | 


--------------------------------------------------------------------------------