├── Detailed-Steps-Mapping-PRO-seq.md ├── MetaGeneATAC.md ├── MetaGenePlots.md ├── PRO-seq.md ├── PausingIndex.md ├── README.md └── etc ├── ATACmeta.png ├── BLAT1.png ├── BLAT2.png ├── BioHPCpwd.png ├── FastQC-ac.png ├── FastQC.png ├── PI-by-stage.png ├── PausingIndex.png ├── chroseq.lnmeta.png ├── chroseq.tssmeta.png └── proseq.png /Detailed-Steps-Mapping-PRO-seq.md: -------------------------------------------------------------------------------- 1 | BioMG 7810 practical day 1: Mapping PRO-seq, GRO-seq, and ChRO-seq data. 2 | ======================================================================== 3 | 4 | Global Run-On and Sequencing (GRO-seq), Precision Run-On and Sequencing (PRO-seq), and Chromatin Run-On and Sequencing are technologies for mapping the location and orientation of actively transcribing RNA polymerase I, II, and III (Pol) across the genome. All Run-on technologies provide a genome-wide readout of gene and lincRNA transcription, as well as the location and relative activities of active enhancers and promoters that regulate gene expression. 5 | 6 | Learning goals 7 | -------------- 8 | 9 | * Get comfortable in a LINUX command line environement. 10 | * Practice mapping ChRO-seq data to a reference genome. 11 | * Understand *WHY* we preform each step. 12 | 13 | Finalize your BioHPC reservation 14 | -------------------------------- 15 | 16 | Four of you had accounts. For those who did not, find the e-mail invitation from BioHPC (should have received this yesterday, Nov. 26th). 17 |

18 | 19 | Fig. 1: E-mail from BioHPC. Select the link and set your password. 20 | 21 | User LabID is your netID. Please choose a password following the instructions on the web browser. 22 | 23 | Log into the high performance compute server 24 | -------------------------------------------- 25 | 26 | Secure shell (a.k.a. SSH) is a protocol that lets you log into a computer and control it remotely. We are logging into high performance computers (HPC) hosted by the BioHPC service, as part of the BioHPC cloud: https://biohpc.cornell.edu/Default.aspx 27 | 28 | An SSH clinet called "PuTTY" is installed on computers in B30B. 29 | 30 | * Select the Windows start menu and type "putty". 31 | 32 | * Under Hostname, enter the host name assinged to you (see the table below), followed by .biohpc.cornell.edu 33 | 34 | * Select Open and enter your username (you netID) and your password. 35 | 36 | Server assignments are:
37 | cbsumm22: apb248 mc2698 agc94 ad986
38 | cbsumm23: yj386 chk63 mgl77 mip25
39 | cbsumm26: is372 mvs44 ldw64 mjg75
40 | cbsumm27: kmw264 yx438 hz543
41 | 42 | If you are assigned cbsumm22, for example, type: 43 | ``` 44 | cbsumm22.biohpc.cornell.edu 45 | ``` 46 | 47 | Hint: You can download PuTTY for your own computer from this URL: https://www.putty.org/. If you have a Mac, you can use the command line utility. Look for tutorials online. 48 | 49 | Look at the raw ChRO-seq data in fastq format 50 | --------------------------------------------- 51 | 52 | The next step is to use LINUX commands to navigate to the raw ChRO-seq data. I have added this to /workdir/data. To get there and view the data, please enter the following commands: 53 | 54 | ``` 55 | [dankoc@cbsumm22 ~]$ cd /workdir/data/fastq 56 | [dankoc@cbsumm22 fastq]$ ls -lha 57 | total 1.9G 58 | drwxrwxr-x 2 dankoc dankoc 35 Nov 26 19:05 . 59 | drwxrwxr-x 4 dankoc dankoc 41 Nov 26 18:32 .. 60 | -rw-rw-r-- 1 dankoc dankoc 1.9G Nov 26 19:06 LZ_R4.fastq.gz 61 | ``` 62 | 63 | New look into the fastq file using the "zless" LINUX command: 64 | 65 | ``` 66 | [dankoc@cbsumm22 fastq]$ zless LZ_R4.fastq.gz 67 | ``` 68 | 69 | This opens a window which shows you the raw data: 70 | 71 | ``` 72 | @NS500503:579:HTMFNBGX3:1:11101:7482:1050 1:N:0:GATCAG 73 | CGGGANGGTGACTGCAATGACATGCTGTTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGATCAGATCTCGTAT 74 | + 75 | AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEE6EEEEEE 76 | @NS500503:579:HTMFNBGX3:1:11101:7717:1050 1:N:0:GATCAG 77 | GTTAGNAAAGCAGGAGGATTATTTTTGGTAGCCTACTTAAATTCATGTTTTGCTTAGGTAACCATACAGTTGAGTG 78 | + 79 | AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA 80 | @NS500503:579:HTMFNBGX3:1:11101:17223:1050 1:N:0:GATCAG 81 | CGAGANACATATGTGCTATGGCATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGATCAGATCTCGTATGCCGT 82 | + 83 | AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEEEAEEE 108 |

109 | 110 | Notice that while some of the reads map, many don't map as-is. Moreover, manual inspection shows you two aspects of read mapping: 111 | * First, most reads start mapping at position 7 in the read. 112 | * Second, many reads map only the 5' portion of the read. 113 | 114 | Why do you think this is? 115 | 116 | Quality control reads 117 | --------------------- 118 | 119 | To check the QC on these fastq files using the software program fastqc. 120 | 121 | Note that this program will write a summary of the fastq qualities. You need to put this summary in a directory that we can write to. Permissions in LINUX frequently do not allow writing to directories with shared information (nor should they!). Create a working directory: 122 | 123 | ``` 124 | [dankoc@cbsumm27 workdir]$ mkdir /workdir/dankoc 125 | [dankoc@cbsumm27 workdir]$ cd /workdir/dankoc/ 126 | ``` 127 | 128 | Note: instead of using /workdir/dankoc/ you should name the directory /workdir/{your netID}/. Make sure to change the commands throughout to write to your own folder instead of dankoc!! 129 | 130 | Then run fastqc: 131 | 132 | ``` 133 | [dankoc@cbsumm27 dankoc]$ fastqc -o /workdir/dankoc /workdir/data/fastq/LZ_R4.fastq.gz 134 | Started analysis of LZ_R4.fastq.gz 135 | Approx 5% complete for LZ_R4.fastq.gz 136 | Approx 10% complete for LZ_R4.fastq.gz 137 | Approx 15% complete for LZ_R4.fastq.gz 138 | Approx 20% complete for LZ_R4.fastq.gz 139 | Approx 25% complete for LZ_R4.fastq.gz 140 | Approx 30% complete for LZ_R4.fastq.gz 141 | Approx 35% complete for LZ_R4.fastq.gz 142 | Approx 40% complete for LZ_R4.fastq.gz 143 | Approx 45% complete for LZ_R4.fastq.gz 144 | Approx 50% complete for LZ_R4.fastq.gz 145 | Approx 55% complete for LZ_R4.fastq.gz 146 | Approx 60% complete for LZ_R4.fastq.gz 147 | Approx 65% complete for LZ_R4.fastq.gz 148 | Approx 70% complete for LZ_R4.fastq.gz 149 | Approx 75% complete for LZ_R4.fastq.gz 150 | Approx 80% complete for LZ_R4.fastq.gz 151 | Approx 85% complete for LZ_R4.fastq.gz 152 | Approx 90% complete for LZ_R4.fastq.gz 153 | Approx 95% complete for LZ_R4.fastq.gz 154 | Analysis complete for LZ_R4.fastq.gz 155 | ``` 156 | 157 | To see the fastqc analysis, download those files using a secure file transfer protocol (SFTP) clinet. The client "filezilla" is on the computers in Mann Library B30B. Select start, type filezilla, and set filezilla to connect to your assigned server. Note that you have to use the SFTP protocol. 158 | 159 | Note also that I have not yet gone to Mann library to try this step. 160 | 161 | When you view the fastqc file, it will looks something like this: 162 | 163 |

164 | 165 | The base qualities represented in that graph look excellent (our sequencing core typically does a nice job). 166 | 167 | Scroll down and note the fraction of reads contaminated with adapters: 168 | 169 |

170 | 171 | A lot of PRO-seq-type data (and other short read data too) looks like this. There's a lot of relatively shoter inserts in the library, and Illumina NextSeq500 has a very strong bias for shorter fragments. Whatever you think your insert size distribution looks like, Illumina will sequence the shorter fragments first! 172 | 173 | Therefore we will have to trim adapters. Note that if we did not already know what sequence to trim, we could use the adapter identified by fastqc! 174 | 175 | Demultiplex PCR duplicates 176 | -------------------------- 177 | 178 | The first step in processing ChRO-seq data is to demultiplex PCR duplicates. I demultiplex using prinseq-lite.pl. 179 | 180 | First, you have to copy prinseq-lite.pl into your working directory. 181 | 182 | ``` 183 | cp /workdir/data/prinseq-lite.pl /workdir/dankoc/ 184 | ``` 185 | 186 | Then, you can use the various prinseq-lite options to preform the demultiplexing, and trim off the first 6 bases from the read. 187 | 188 | Note that what follows is a complicated set of instructions. This program is a bit archaic, so I'm not going to go into much detail explaining this line. If you don't understand this in detail, don't fret! There's better examples coming that you should be better equipped to understand in their entirety. 189 | 190 | ``` 191 | cd /workdir/dankoc 192 | zcat /workdir/data/fastq/LZ_R4.fastq.gz | ./prinseq-lite.pl -derep 1 -fastq stdin -out_format 3 -out_good stdout -out_bad null 2> /workdir/dankoc/pcr_dups.txt | \ 193 | ./prinseq-lite.pl -trim_left 6 -fastq stdin -out_format 3 -out_good stdout -out_bad null -min_len 15 | gzip > /workdir/dankoc/LZ_R4.no-PCR-dups.fastq.gz 194 | Input and filter stats: 195 | Input sequences: 42,740,174 196 | Input bases: 3,248,253,224 197 | Input mean length: 76.00 198 | Good sequences: 42,740,174 (100.00%) 199 | Good bases: 2,991,812,180 200 | Good mean length: 70.00 201 | Bad sequences: 0 (0.00%) 202 | Sequences filtered by specified parameters: 203 | none 204 | ``` 205 | 206 | That line should create a new .fastq.gz file in your working directory (mine is: /workdir/dankoc/). It also created a text file, pcr_dups.txt, which contains information about how many PCR duplicates were identified and removed: 207 | 208 | ``` 209 | [dankoc@cbsumm27 dankoc]$ cat pcr_dups.txt 210 | Input and filter stats: 211 | Input sequences: 47,534,859 212 | Input bases: 3,612,649,284 213 | Input mean length: 76.00 214 | Good sequences: 42,740,174 (89.91%) 215 | Good bases: 3,248,253,224 216 | Good mean length: 76.00 217 | Bad sequences: 4,794,685 (10.09%) 218 | Bad bases: 364,396,060 219 | Bad mean length: 76.00 220 | Sequences filtered by specified parameters: 221 | derep: 4794685 222 | ``` 223 | 224 | The output fastq file (LZ_R4.no-PCR-dups.fastq.gz) can be used in the next step! 225 | 226 | Trim adapters 227 | ------------- 228 | 229 | Our next goal is to trim adapters, and leaving all non-adapter sequence untouched. We will trim adapters using the software program "cutadapt". 230 | 231 | To see how to use this program, call the program without any options: 232 | 233 | ``` 234 | [dankoc@cbsumm22 data]$ cutadapt 235 | cutadapt version 1.16 236 | Copyright (C) 2010-2017 Marcel Martin 237 | 238 | cutadapt removes adapter sequences from high-throughput sequencing reads. 239 | 240 | Usage: 241 | cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq 242 | 243 | For paired-end reads: 244 | cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq 245 | 246 | Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard 247 | characters are supported. The reverse complement is *not* automatically 248 | searched. All reads from input.fastq will be written to output.fastq with the 249 | adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter 250 | sequences can be given (use further -a options), but only the best-matching 251 | adapter will be removed. 252 | 253 | Input may also be in FASTA format. Compressed input and output is supported and 254 | auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for 255 | standard input/output. Without the -o option, output is sent to standard output. 256 | 257 | Citation: 258 | 259 | Marcel Martin. Cutadapt removes adapter sequences from high-throughput 260 | sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011. 261 | http://dx.doi.org/10.14806/ej.17.1.200 262 | 263 | Use "cutadapt --help" to see all command-line options. 264 | See http://cutadapt.readthedocs.io/ for full documentation. 265 | 266 | cutadapt: error: At least one parameter needed: name of a FASTA or FASTQ file. 267 | ``` 268 | 269 | To make cutadapt work, we need to specify the: 270 | * Input fastq file (input.fastq) 271 | * Adapter sequence (-a ADAPTER) 272 | * Output file (-o output.fastq) 273 | 274 | 275 | Run this as follows: 276 | 277 | ``` 278 | [dankoc@cbsumm22 data]$ cutadapt -a TGGAATTCTCGGGTGCCAAGG -z -e 0.10 --minimum-length=15 --output=LZ_R4.no-PCR-dups.no-Adapters.fastq.gz LZ_R4.no-PCR-dups.fastq.gz 279 | This is cutadapt 1.16 with Python 3.6.1 280 | Command line parameters: -a TGGAATTCTCGGGTGCCAAGG -z -e 0.10 --minimum-length=15 --output=LZ_R4.no-PCR-dups.no-Adapters.fastq.gz LZ_R4.no-PCR-dups.fastq.gz 281 | Running on 1 core 282 | Trimming 1 adapter with at most 10.0% errors in single-end mode ... 283 | Finished in 888.86 s (21 us/read; 2.89 M reads/minute). 284 | 285 | === Summary === 286 | 287 | Total reads processed: 42,740,174 288 | Reads with adapters: 39,544,375 (92.5%) 289 | Reads that were too short: 1,287,179 (3.0%) 290 | Reads written (passing filters): 41,452,995 (97.0%) 291 | 292 | Total basepairs processed: 2,991,812,180 bp 293 | Total written (filtered): 1,530,434,687 bp (51.2%) 294 | 295 | === Adapter 1 === 296 | 297 | Sequence: TGGAATTCTCGGGTGCCAAGG; Type: regular 3'; Length: 21; Trimmed: 39544375 times. 298 | 299 | No. of allowed errors: 300 | 0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2 301 | 302 | Bases preceding removed adapters: 303 | A: 14.5% 304 | C: 50.0% 305 | G: 8.9% 306 | T: 26.3% 307 | none/other: 0.3% 308 | 309 | Overview of removed sequences 310 | length count expect max.err error counts 311 | 3 257830 667815.2 0 257830 312 | 4 222128 166953.8 0 222128 313 | 5 221401 41738.5 0 221401 314 | 6 225189 10434.6 0 225189 315 | 7 236872 2608.7 0 236872 316 | 8 240668 652.2 0 240668 317 | 9 265207 163.0 0 263620 1587 318 | 10 285301 40.8 1 275947 9354 319 | 11 300955 10.2 1 290697 10258 320 | 12 317065 2.5 1 306226 10839 321 | 13 336865 0.6 1 324269 12596 322 | 14 353719 0.2 1 339814 13905 323 | 15 368055 0.0 1 351799 16256 324 | 16 390184 0.0 1 373033 17151 325 | 17 409857 0.0 1 390850 19007 326 | 18 428369 0.0 1 408019 20220 130 327 | 19 450155 0.0 1 427985 21864 306 328 | 20 471537 0.0 2 441747 24955 4835 329 | 21 493105 0.0 2 463368 25061 4676 330 | 22 522386 0.0 2 492115 25756 4515 331 | 23 545160 0.0 2 514271 26394 4495 332 | 24 563192 0.0 2 531642 27057 4493 333 | 25 593036 0.0 2 559794 28482 4760 334 | 26 619266 0.0 2 585012 29415 4839 335 | 27 639210 0.0 2 604625 29922 4663 336 | 28 666983 0.0 2 630913 31177 4893 337 | 29 698123 0.0 2 660362 32627 5134 338 | 30 725655 0.0 2 686379 33901 5375 339 | 31 751377 0.0 2 712010 34325 5042 340 | 32 765386 0.0 2 725969 34396 5021 341 | 33 784165 0.0 2 744621 34554 4990 342 | 34 820110 0.0 2 780207 35181 4722 343 | 35 850916 0.0 2 809539 36314 5063 344 | 36 890172 0.0 2 847372 37944 4856 345 | 37 942634 0.0 2 897212 40301 5121 346 | 38 986344 0.0 2 936554 43934 5856 347 | 39 1028666 0.0 2 978616 44129 5921 348 | 40 1079438 0.0 2 1026400 46842 6196 349 | 41 1135827 0.0 2 1080585 48915 6327 350 | 42 1216544 0.0 2 1159792 50528 6224 351 | 43 1310655 0.0 2 1249154 54866 6635 352 | 44 1405163 0.0 2 1339787 58293 7083 353 | 45 1479306 0.0 2 1412826 59730 6750 354 | 46 1553448 0.0 2 1479859 65533 8056 355 | 47 1612051 0.0 2 1536252 67783 8016 356 | 48 1604135 0.0 2 1530348 66045 7742 357 | 49 1576522 0.0 2 1504822 64359 7341 358 | 50 1459342 0.0 2 1393907 58885 6550 359 | 51 1208041 0.0 2 1154098 48606 5337 360 | 52 877897 0.0 2 838192 35883 3822 361 | 53 546745 0.0 2 522130 22382 2233 362 | 54 310862 0.0 2 296835 12705 1322 363 | 55 213977 0.0 2 204002 8922 1053 364 | 56 186416 0.0 2 177561 7988 867 365 | 57 163987 0.0 2 156562 6673 752 366 | 58 139292 0.0 2 133083 5571 638 367 | 59 123801 0.0 2 118571 4734 496 368 | 60 108009 0.0 2 103273 4307 429 369 | 61 108722 0.0 2 104057 4210 455 370 | 62 103555 0.0 2 99258 3933 364 371 | 63 91135 0.0 2 87329 3485 321 372 | 64 71053 0.0 2 67868 2904 281 373 | 65 45819 0.0 2 43809 1816 194 374 | 66 20209 0.0 2 19387 754 68 375 | 67 10985 0.0 2 10462 485 38 376 | 68 5114 0.0 2 4841 250 23 377 | 69 7818 0.0 2 7241 533 44 378 | 70 101264 0.0 2 63611 32321 5332 379 | 380 | ``` 381 | 382 | This produces a new fastq.gz file in your working directory (LZ_R4.no-PCR-dups.no-Adapters.fastq.gz). Look at this file using zless, and you'll see that reads are all sorts of different lengths. 383 | 384 | ``` 385 | [dankoc@cbsumm27 dankoc]$ zless LZ_R4.no-PCR-dups.no-Adapters.fastq.gz 386 | 387 | @NS500503:579:HTMFNBGX3:1:11101:7482:1050 1:N:0:GATCAG 388 | GGTGACTGCAATGACATGCTGT 389 | +NS500503:579:HTMFNBGX3:1:11101:7482:1050 1:N:0:GATCAG 390 | EEEEEEEEEEEEEEEEEEEEEE 391 | @NS500503:579:HTMFNBGX3:1:11101:7717:1050 1:N:0:GATCAG 392 | AAAGCAGGAGGATTATTTTTGGTAGCCTACTTAAATTCATGTTTTGCTTAGGTAACCATACAGTTGAGTG 393 | +NS500503:579:HTMFNBGX3:1:11101:7717:1050 1:N:0:GATCAG 394 | EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA 395 | @NS500503:579:HTMFNBGX3:1:11101:17223:1050 1:N:0:GATCAG 396 | ACATATGTGCTATGGCA 397 | +NS500503:579:HTMFNBGX3:1:11101:17223:1050 1:N:0:GATCAG 398 | EEEEEEEEEEEEEEEEE 399 | @NS500503:579:HTMFNBGX3:1:11101:18540:1052 1:N:0:GATCAG 400 | ATAACCTATGCGTGACTCTCAGCACAGTGAATTTTGTT 401 | +NS500503:579:HTMFNBGX3:1:11101:18540:1052 1:N:0:GATCAG 402 | EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 403 | @NS500503:579:HTMFNBGX3:1:11101:20114:1052 1:N:0:GATCAG 404 | AGGTGCTCTGGTCCTTCCTCCAGTGTGTATGC 405 | +NS500503:579:HTMFNBGX3:1:11101:20114:1052 1:N:0:GATCAG 406 | EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 407 | @NS500503:579:HTMFNBGX3:1:11101:5639:1052 1:N:0:GATCAG 408 | CGAGCAAGCCGGGACATAAGCCAGGGACGGGGGAAT 409 | +NS500503:579:HTMFNBGX3:1:11101:5639:1052 1:N:0:GATCAG 410 | EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 411 | @NS500503:579:HTMFNBGX3:1:11101:16452:1052 1:N:0:GATCAG 412 | CTCAGACCCCAGAACTGTACACTAATAACTGTGTATTAATTTC 413 | +NS500503:579:HTMFNBGX3:1:11101:16452:1052 1:N:0:GATCAG 414 | EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 472 | 473 | Usage: bwa [options] 474 | 475 | Command: index index sequences in the FASTA format 476 | mem BWA-MEM algorithm 477 | fastmap identify super-maximal exact matches 478 | pemerge merge overlapping paired ends (EXPERIMENTAL) 479 | aln gapped/ungapped alignment 480 | samse generate alignment (single ended) 481 | sampe generate alignment (paired ended) 482 | bwasw BWA-SW for long queries 483 | 484 | shm manage indices in shared memory 485 | fa2pac convert FASTA to PAC format 486 | pac2bwt generate BWT from PAC 487 | pac2bwtgen alternative algorithm for generating BWT 488 | bwtupdate update .bwt to the new format 489 | bwt2sa generate SA from BWT and Occ 490 | 491 | Note: To use BWA, you need to first index the genome with `bwa index'. 492 | There are three alignment algorithms in BWA: `mem', `bwasw', and 493 | `aln/samse/sampe'. If you are not sure which to use, try `bwa mem' 494 | first. Please `man ./bwa.1' for the manual. 495 | 496 | ``` 497 | 498 | BWA works in two steps. First, we need to generate a compressed file that represents the mouse genome using very efficient machine language. This compressed representation of the genome is called an "index". 499 | 500 | I have placed a text copy of the mouse reference genome (version mm10) here: /workdir/data/mm10/mm10.rRNA.fa.gz 501 | 502 | Note that BWA does not allow you to adjust the output directory of this file, so you have to copy it to your working directory where you have permission to write the index. 503 | 504 | Use the following commands: 505 | 506 | ``` 507 | [dankoc@cbsumm27 mm10]$ cd /workdir/data/mm10 508 | [dankoc@cbsumm27 mm10]$ cp mm10.rRNA.fa.gz /workdir/dankoc 509 | [dankoc@cbsumm27 mm10]$ cd /workdir/dankoc 510 | [dankoc@cbsumm27 dankoc]$ bwa index /workdir/data/mm10/mm10.rRNA.fa.gz 511 | [bwa_index] Pack FASTA... 32.40 sec 512 | [bwa_index] Construct BWT for the packed sequence... 513 | [BWTIncCreate] textLength=5461829546, availableWord=396314452 514 | [BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed. 515 | [BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed. 516 | [BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed. 517 | [BWTIncConstructFromPacked] 40 iterations done. 399999994 characters processed. 518 | [BWTIncConstructFromPacked] 50 iterations done. 499999994 characters processed. 519 | [BWTIncConstructFromPacked] 60 iterations done. 599999994 characters processed. 520 | [BWTIncConstructFromPacked] 70 iterations done. 699999994 characters processed. 521 | [BWTIncConstructFromPacked] 80 iterations done. 799999994 characters processed. 522 | [BWTIncConstructFromPacked] 90 iterations done. 899999994 characters processed. 523 | [BWTIncConstructFromPacked] 100 iterations done. 999999994 characters processed. 524 | [BWTIncConstructFromPacked] 110 iterations done. 1099999994 characters processed. 525 | [BWTIncConstructFromPacked] 120 iterations done. 1199999994 characters processed. 526 | [BWTIncConstructFromPacked] 130 iterations done. 1299999994 characters processed. 527 | [BWTIncConstructFromPacked] 140 iterations done. 1399999994 characters processed. 528 | [BWTIncConstructFromPacked] 150 iterations done. 1499999994 characters processed. 529 | [BWTIncConstructFromPacked] 160 iterations done. 1599999994 characters processed. 530 | [BWTIncConstructFromPacked] 170 iterations done. 1699999994 characters processed. 531 | [BWTIncConstructFromPacked] 180 iterations done. 1799999994 characters processed. 532 | [BWTIncConstructFromPacked] 190 iterations done. 1899999994 characters processed. 533 | [BWTIncConstructFromPacked] 200 iterations done. 1999999994 characters processed. 534 | [BWTIncConstructFromPacked] 210 iterations done. 2099999994 characters processed. 535 | [BWTIncConstructFromPacked] 220 iterations done. 2199999994 characters processed. 536 | [BWTIncConstructFromPacked] 230 iterations done. 2299999994 characters processed. 537 | [BWTIncConstructFromPacked] 240 iterations done. 2399999994 characters processed. 538 | [BWTIncConstructFromPacked] 250 iterations done. 2499999994 characters processed. 539 | [BWTIncConstructFromPacked] 260 iterations done. 2599999994 characters processed. 540 | [BWTIncConstructFromPacked] 270 iterations done. 2699999994 characters processed. 541 | [BWTIncConstructFromPacked] 280 iterations done. 2799999994 characters processed. 542 | [BWTIncConstructFromPacked] 290 iterations done. 2899999994 characters processed. 543 | [BWTIncConstructFromPacked] 300 iterations done. 2999999994 characters processed. 544 | [BWTIncConstructFromPacked] 310 iterations done. 3099999994 characters processed. 545 | [BWTIncConstructFromPacked] 320 iterations done. 3199999994 characters processed. 546 | [BWTIncConstructFromPacked] 330 iterations done. 3299999994 characters processed. 547 | [BWTIncConstructFromPacked] 340 iterations done. 3399999994 characters processed. 548 | [BWTIncConstructFromPacked] 350 iterations done. 3499999994 characters processed. 549 | [BWTIncConstructFromPacked] 360 iterations done. 3599999994 characters processed. 550 | [BWTIncConstructFromPacked] 370 iterations done. 3699999994 characters processed. 551 | [BWTIncConstructFromPacked] 380 iterations done. 3799999994 characters processed. 552 | [BWTIncConstructFromPacked] 390 iterations done. 3899999994 characters processed. 553 | [BWTIncConstructFromPacked] 400 iterations done. 3999999994 characters processed. 554 | [BWTIncConstructFromPacked] 410 iterations done. 4099999994 characters processed. 555 | [BWTIncConstructFromPacked] 420 iterations done. 4199999994 characters processed. 556 | [BWTIncConstructFromPacked] 430 iterations done. 4299999994 characters processed. 557 | [BWTIncConstructFromPacked] 440 iterations done. 4399999994 characters processed. 558 | [BWTIncConstructFromPacked] 450 iterations done. 4499999994 characters processed. 559 | [BWTIncConstructFromPacked] 460 iterations done. 4599999994 characters processed. 560 | [BWTIncConstructFromPacked] 470 iterations done. 4699999994 characters processed. 561 | [BWTIncConstructFromPacked] 480 iterations done. 4799899178 characters processed. 562 | [BWTIncConstructFromPacked] 490 iterations done. 4892727610 characters processed. 563 | [BWTIncConstructFromPacked] 500 iterations done. 4975229770 characters processed. 564 | [BWTIncConstructFromPacked] 510 iterations done. 5048553898 characters processed. 565 | [BWTIncConstructFromPacked] 520 iterations done. 5113720554 characters processed. 566 | [BWTIncConstructFromPacked] 530 iterations done. 5171636842 characters processed. 567 | [BWTIncConstructFromPacked] 540 iterations done. 5223108954 characters processed. 568 | [BWTIncConstructFromPacked] 550 iterations done. 5268853482 characters processed. 569 | [BWTIncConstructFromPacked] 560 iterations done. 5309507322 characters processed. 570 | [BWTIncConstructFromPacked] 570 iterations done. 5345636538 characters processed. 571 | [BWTIncConstructFromPacked] 580 iterations done. 5377744234 characters processed. 572 | [BWTIncConstructFromPacked] 590 iterations done. 5406277626 characters processed. 573 | [BWTIncConstructFromPacked] 600 iterations done. 5431634122 characters processed. 574 | [BWTIncConstructFromPacked] 610 iterations done. 5454167018 characters processed. 575 | [bwt_gen] Finished constructing BWT in 614 iterations. 576 | [bwa_index] 2339.11 seconds elapse. 577 | [bwa_index] Update BWT... 15.82 sec 578 | [bwa_index] Pack forward-only FASTA... 24.65 sec 579 | [bwa_index] Construct SA from BWT and Occ... 774.60 sec 580 | [main] Version: 0.7.13-r1126 581 | [main] CMD: bwa index /workdir/data/mm10/mm10.rRNA.fa.gz 582 | [main] Real time: 3192.431 sec; CPU: 3186.585 sec 583 | [dankoc@cbsumm27 dankoc]$ ls -lha mm10.rRNA.fa* 584 | -rw-rw-r-- 1 dankoc dankoc 830M Nov 26 18:36 mm10.rRNA.fa.gz 585 | -rw-rw-r-- 1 dankoc dankoc 12K Nov 26 19:23 mm10.rRNA.fa.gz.amb 586 | -rw-rw-r-- 1 dankoc dankoc 2.9K Nov 26 19:23 mm10.rRNA.fa.gz.ann 587 | -rw-rw-r-- 1 dankoc dankoc 2.6G Nov 26 19:23 mm10.rRNA.fa.gz.bwt 588 | -rw-rw-r-- 1 dankoc dankoc 652M Nov 26 19:23 mm10.rRNA.fa.gz.pac 589 | -rw-rw-r-- 1 dankoc dankoc 1.3G Nov 26 19:36 mm10.rRNA.fa.gz.sa 590 | ``` 591 | 592 | Note that five additional files have appeared in this directory. Some of which are larger than the fastq file. These files are the index, representing the mouse genome in machine langugae that allows finding matching sequences very efficiently. 593 | 594 | You will refer to this genome index in subsequent commands as "mm10.rRNA.fa.gz". Note that the fastq file is not a part of the index. You can now delete your private copy if you wish. 595 | 596 | Next, align reads in the trimmed fastq.gz file to this mm10 reference genome: 597 | 598 | ``` 599 | [dankoc@cbsumm27 dankoc]$ bwa aln -t 10 mm10.rRNA.fa.gz LZ_R4.no-PCR-dups.no-Adapters.fastq.gz > LZ_R4.sai 600 | [bwa_aln] 17bp reads: max_diff = 2 601 | [bwa_aln] 38bp reads: max_diff = 3 602 | [bwa_aln] 64bp reads: max_diff = 4 603 | [bwa_aln] 93bp reads: max_diff = 5 604 | [bwa_aln] 124bp reads: max_diff = 6 605 | [bwa_aln] 157bp reads: max_diff = 7 606 | [bwa_aln] 190bp reads: max_diff = 8 607 | [bwa_aln] 225bp reads: max_diff = 9 608 | [bwa_aln_core] calculate SA coordinate... 43.04 sec 609 | [bwa_aln_core] write to the disk... 0.05 sec 610 | [bwa_aln_core] 262144 sequences have been processed. 611 | [bwa_aln_core] calculate SA coordinate... 44.20 sec 612 | [bwa_aln_core] write to the disk... 0.05 sec 613 | [bwa_aln_core] 524288 sequences have been processed. 614 | ... [MANY MORE LINES LIKE THIS] 615 | [bwa_aln_core] calculate SA coordinate... 40.86 sec 616 | [bwa_aln_core] write to the disk... 0.07 sec 617 | [bwa_aln_core] 41418752 sequences have been processed. 618 | [bwa_aln_core] calculate SA coordinate... 5.02 sec 619 | [bwa_aln_core] write to the disk... 0.01 sec 620 | [bwa_aln_core] 41452995 sequences have been processed. 621 | [main] Version: 0.7.13-r1126 622 | [main] CMD: bwa aln -t 10 mm10.rRNA.fa.gz LZ_R4.no-PCR-dups.no-Adapters.fastq.gz 623 | [main] Real time: 824.319 sec; CPU: 6761.728 sec 624 | ``` 625 | 626 | And convert the resulting alignments (.bai format) into a more standard SAM format: 627 | 628 | ``` 629 | [dankoc@cbsumm27 dankoc]$ bwa samse -n 1 -f LZ_R4.sam mm10.rRNA.fa.gz LZ_R4.sai LZ_R4.no-PCR-dups.no-Adapters.fastq.gz 630 | [bwa_aln_core] convert to sequence coordinate... 4.35 sec 631 | [bwa_aln_core] refine gapped alignments... 0.63 sec 632 | [bwa_aln_core] print alignments... 0.35 sec 633 | [bwa_aln_core] 262144 sequences have been processed. 634 | [bwa_aln_core] convert to sequence coordinate... 4.18 sec 635 | [bwa_aln_core] refine gapped alignments... 0.61 sec 636 | [bwa_aln_core] print alignments... 0.34 sec 637 | [bwa_aln_core] 524288 sequences have been processed. 638 | [bwa_aln_core] convert to sequence coordinate... 4.03 sec 639 | [bwa_aln_core] refine gapped alignments... 0.60 sec 640 | [bwa_aln_core] print alignments... 0.35 sec 641 | [bwa_aln_core] 786432 sequences have been processed. 642 | [bwa_aln_core] convert to sequence coordinate... 4.00 sec 643 | ... 644 | [main] Version: 0.7.13-r1126 645 | [main] CMD: bwa samse -n 1 -f LZ_R4.sam mm10.rRNA.fa.gz LZ_R4.sai LZ_R4.no-PCR-dups.no-Adapters.fastq.gz 646 | [main] Real time: 878.276 sec; CPU: 878.285 sec 647 | ``` 648 | 649 | The output of this is a SAM file. See here for details on the SAM specification: https://en.wikipedia.org/wiki/SAM_(file_format) 650 | 651 | Note that SAM files are human readable text files, and are therefore very inefficient for long term storage. In the next step we will convert this SAM file into a machine-readable (but not human readable) file format called BAM. We will then sort the BAM file for downstream processing. Both of these steps are completed using a program called samtools. 652 | 653 | ``` 654 | [dankoc@cbsumm27 dankoc]$ samtools view -b -S LZ_R4.sam > LZ_R4.bam 655 | [dankoc@cbsumm27 dankoc]$ samtools sort -@ 10 LZ_R4.bam -o LZ_R4.sort.bam 656 | [bam_sort_core] merging from 10 files and 10 in-memory blocks... 657 | ``` 658 | 659 | Notice how much larger the SAM file is than the BAMs: 660 | 661 | ``` 662 | [dankoc@cbsumm27 dankoc]$ ls -lha *.bam 663 | -rw-rw-r-- 1 dankoc dankoc 1.9G Nov 27 12:34 LZ_R4.bam 664 | -rw-rw-r-- 1 dankoc dankoc 1.6G Nov 27 12:38 LZ_R4.sort.bam 665 | [dankoc@cbsumm27 dankoc]$ ls -lha *.sam 666 | -rw-rw-r-- 1 dankoc dankoc 7.9G Nov 27 12:25 LZ_R4.sam 667 | ``` 668 | 669 | The BAM file represents the location of reads that map to the reference genome. That's it - you've done it! 670 | 671 | Note: See the reference for BWA here: 672 | https://academic.oup.com/bioinformatics/article/25/14/1754/225615 673 | 674 | Convert BAM files into bigWig format 675 | ------------------------------------ 676 | 677 | Much of what we will do on Thursday uses a more compact format known as bigWig. BigWig is a binary format that represents each position in the genome with >0 counts, and the number of counts at that position. 678 | 679 | There are several steps to convert a BAM file into bigWig format. 680 | 681 | First, write out a BED file. 682 | 683 | ``` 684 | [dankoc@cbsumm27 dankoc]$ bedtools bamtobed -i LZ_R4.sort.bam | awk 'BEGIN{OFS="\t"} ($5 > 20){print $0}' | grep -v "rRNA" | \ 685 | awk 'BEGIN{OFS="\t"} ($6 == "+") {print $1,$2,$2+1,$4,$5,$6}; ($6 == "-") {print $1,$3-1,$3,$4,$5,$6}' | gzip > LZ_R4.bed.gz 686 | ``` 687 | 688 | Note that this command uses awk. Awk is a programming language that allows some very fast operations on text without writing to disk. Instead, it operates from a pipe (see here: https://www.geeksforgeeks.org/piping-in-unix-or-linux/). 689 | 690 | The first awk statement ``` awk 'BEGIN{OFS="\t"} ($5 > 20){print $0}' ``` takes reads that pass a stringent alignment quality filter. Scores are in PHRED format (see here: https://en.wikipedia.org/wiki/Phred_quality_score), so 20 denotes <1% chance of error. 691 | 692 | The second statement ``` grep -v "rRNA" ``` removes Pol I transcripts that map to the Pol I transcription unit. 693 | 694 | The third statement is where most of the magic happens ``` awk 'BEGIN{OFS="\t"} ($6 == "+") {print $1,$2,$2+1,$4,$5,$6}; ($6 == "-") {print $1,$3-1,$3,$4,$5,$6}' ```. 695 | 696 | Note that we take only a single position. Run-On and sequencing methods map the location of the RNA polymerase, in many cases at nearly single nucleotide resolution. Therefore, it is logical to represent the coordinate of RNA polymerase using the genomic position that best represents the polymerase location, rather than representing the entire read. leChRO-seq measures the location in which RNA emerges from the polymerase exit channel at appxoimately single nucleotide resolution. We therefore represent this position using the 5' end of the mapped read. 697 | 698 | Next, count the number of reads starting in each position. This is done using the genomecov command in the bedtools suite. 699 | 700 | ``` 701 | [dankoc@cbsumm27 dankoc]$ bedtools genomecov -bg -i LZ_R4.bed.gz -g /workdir/data/mm10.chromInfo -strand + > LZ_R4_plus.bedGraph 702 | [dankoc@cbsumm27 dankoc]$ bedtools genomecov -bg -i LZ_R4.bed.gz -g /workdir/data/mm10.chromInfo -strand - > LZ_R4_minus.bedGraph 703 | ``` 704 | 705 | Note that a new file is required as input (mm10.chromInfo). This file representes the size of every chromosome in the mouse genome. It's human readable. Have a look at what's inside using ``` cat /workdir/data/mm10.chromInfo ```. 706 | 707 | Also notice that this line is run twice in order to split off reads mapping to + and - strand. Remember that leChRO-seq resolves the strand onto which RNA polymerase is engaged, and we need to capture that information when we are analyzing the data. 708 | 709 | The final step is to convert these bedGraph files into the bigWig format. There is not much difference in what kind of data is represented between these files, only in how it is represented. BigWig files store an index that allows a computer program to request data from just a portion of the genome without read the entire file. This speeds up some analyses. 710 | 711 | The conversion to bigWig is done using the bedGraphToBigWig program, one of the Kent Source utilities written by authors of the UCSC genome browser. 712 | 713 | ``` 714 | [dankoc@cbsumm27 dankoc]$ /workdir/data/bedGraphToBigWig LZ_R4_plus.bedGraph /workdir/data/mm10.chromInfo LZ_R4_plus.bw 715 | [dankoc@cbsumm27 dankoc]$ /workdir/data/bedGraphToBigWig LZ_R4_minus.bedGraph /workdir/data/mm10.chromInfo LZ_R4_minus.bw 716 | ``` 717 | 718 | Note: BED, bedGraph, and bigWig are all standard file formats in bioinformatics. A great resource for learning about these file formats can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html 719 | 720 | Look at the mapped data using a genome browser 721 | ---------------------------------------------- 722 | 723 | Viewing mapped reads on a genome browser is the most informative way to determine many features of an experiment, and should be the first analysis after read mapping. Please try to view the bigWig fies that you generated using a genome browser. 724 | 725 | I suggest starting with IGV. Download the IGV genome browser here: http://software.broadinstitute.org/software/igv/home, and follow the instructions to make it run. 726 | 727 | Download your bigWig files using filezilla, or another SFTP client of your choice. Once you have bigWig files downloaded, you can just drag and drop them into IGV. 728 | 729 | Look in particular at the pattern of RNA polymerase on genes. We will pick up the discussion here on Thursday. 730 | 731 | 732 | Notes/ thoughts 733 | --------------- 734 | 735 | Note that for tasks that you do all the time, it is much *much* better to have a pipeline set up that automates each of these steps. The PRO-seq, GRO-seq, and ChRO-seq pipeline that we use in my lab can be found here: 736 | https://github.com/Danko-Lab/proseq2.0 737 | 738 | This program is essentially just a "shell script" which automates the commands that you just put in manually. To use it, you just execut a single command which takes in all of the information and provides the location of the BAM and bigWig files at the end. 739 | 740 | -------------------------------------------------------------------------------- /MetaGeneATAC.md: -------------------------------------------------------------------------------- 1 | BioMG 7810 practical day 2: Meta gene for ATAC-seq. 2 | =================================================== 3 | 4 | One model of shutdown is that paused Pol II functions as a nucleosome exclusion factor, of sorts, prevent nucleosomes from encroaching on DNA. If this model is true, we may expect to see nucleosomes fill-in on regulatory DNA, resulting in a loss of the nucleosome free region. 5 | 6 | To test this hypothesis, Adriana Alexander preformed ATAC-seq (see here: https://www.nature.com/articles/nmeth.2688) in each of these three separate stages of Prophase I. Here we are going to use this data to ask how chromatin accessability to Tn5 nuclease changes following these three stages of prophase I. 7 | 8 | Adding ATAC-seq 9 | --------------- 10 | 11 | I have placed bigWig files representing ATAC-seq counts here: ```/workdir/data/bigWigs/atacseq```. 12 | 13 | Next, we edit the same meta-gene script we used to write out ChRO-seq meta plots near the TSS (```writeMetaPlots.ChROseq.R```). Note that this is a slightly more complicated modification than simply changing directories, ylims, tweaking colors, etc. 14 | 15 | Download this script using filezilla (if you don't have it left over from before), and open it using Notepad++. 16 | 17 | Note that ATAC-seq data, at least this representation, does not provide information about DNA strand. Note that one of the two metaPlot functions on lines 9-19 takes single stranded data. 18 | 19 | ``` 20 | metaPlot <- function(bed, HP_FILE, path="./", stp= 100, halfWindow= 10000, ...) { 21 | bed <- center.bed(bed, halfWindow, halfWindow) 22 | HP <- load.bigWig(paste(path, HP_FILE, sep="")) 23 | H_meta <- metaprofile.bigWig(bed, HP, step=stp) 24 | 25 | N = length(H_meta$middle) 26 | x = 1:N*stp 27 | 28 | plot.metaprofile(H_meta, X0= 0, ...) 29 | abline(v=halfWindow, lty="dotted") 30 | } 31 | ``` 32 | 33 | We are simply going to run this function on the ATAC-seq bigWig files. Proceed down to line 67, and enter the following: 34 | 35 | ``` 36 | pth="/workdir/data/bigWigs/atacseq/" 37 | LZatac <- metaPlot(tss[,1:3], "LZ_merged.bw", main="L/Z ATAC-seq", path=pth, stp=stp, halfWindow=hW) 38 | ``` 39 | 40 | Those two lines change the path, and direct the script to analyze LZ merged ATAC-seq file. 41 | 42 | Hint: When I'm making change son more than one file, I prefer to work in interactive mode with an open R session, and type the commands in as I go. When I'm done, I paste them into the script. 43 | 44 | How would you instruct the script to analyze Pachytene and Diplotene conditions? Simply add additional lines that will analyze the P and D ATAC-seq bigWig files! 45 | 46 | ``` 47 | Patac <- metaPlot(tss, "P_merged.bw", main="P ATAC-seq", path=pth, stp=stp, halfWindow=hW) 48 | Datac <- metaPlot(tss, "D_merged.bw", main="D ATAC-seq", path=pth, stp=stp, halfWindow=hW) 49 | ``` 50 | 51 | Each of those lines will write the plots for each condition into this PDF: ```PausePlots.raw-ProphaseI-ChRO-seq.pdf```. To put them on the same plot, add the following commands: 52 | 53 | ``` 54 | ylims <- c(0, max(LZatac$data$middle)/LZatac$signal*10^8) 55 | pdf("ATAC-seq_TSS.pdf") 56 | plot(-100000, -1000000, xlim= c(0,NROW(LZatac$data$middle)), ylim=ylims, xlab="Distance to TSS", ylab="ATAC-seq Signal") 57 | lines(LZatac$data$middle/LZatac$signal*10^8, col= "#cb6751") 58 | lines(Patac$data$middle/Patac$signal*10^8, col= "#9e6ebd") 59 | lines(Datac$data$middle/Datac$signal*10^8, col= "#7aa457") 60 | dev.off() 61 | ``` 62 | 63 | Finally, download and view ```ATAC-seq_TSS.pdf```. 64 | 65 | The resulting figure is here: 66 |

67 | -------------------------------------------------------------------------------- /MetaGenePlots.md: -------------------------------------------------------------------------------- 1 | BioMG 7810 practical day 2: Meta gene plots. 2 | ============================================ 3 | 4 | Visulizing raw or processed data can be a very informative way to undersatnd the biology of a process. Yesterday/ homework was to look at the raw data. Today, we are going to start by generating a "meta-plot", or a summary of data near a specific type of genetic feature. 5 | 6 | Scaled meta gene function 7 | ------------------------- 8 | 9 | First, we are going to look at transcription over the entire gene. I have placed an R script that generates meta-gene plots over an entire gene region here: /workdir/data/ScaledMetaPlotFunctions.r. [Credit for this file: Greg Booth in John Lis' lab; Andre Martins; others.] 10 | 11 | Download this script using filezilla, and open it using Notepad++. 12 | 13 | The top of the file contains R functions that are used to generate the meta plot (e.g., mround, scale_params, collect.many.scaled, etc.). These look something like this: 14 | ``` 15 | function_name = function(parameters){ 16 | instructions... 17 | } 18 | ``` 19 | 20 | For instance, the scale_params function: 21 | 22 | ``` 23 | scale_params = function(start, end, step=50, buffer=2000){ 24 | length = end-start 25 | if (length <1000){ 26 | buffer = buffer 27 | step1 = step 28 | scaleFact = 1 29 | } 30 | else{ 31 | length = ((end-start)-2*buffer)/2 32 | buffer = mround(length, base = 500) 33 | step1 = (buffer/500) 34 | scaleFact = step1/step # scale factor is divided by your scaled window count to adjust as if it were a 10bp window (i.e. window size = 5bp will have a scale factor of 2) 35 | } 36 | return(c(buffer,step1,scaleFact)) 37 | } 38 | ``` 39 | 40 | The lines that you will need to change are simply which files are used. These are found at the bottom of the file. Scroll down to the bottom. Changes that you need to make include: 41 | * Line 250: This points to the location of gene annotations. I have placed this file in ```/workdir/data/final_tus.txt```. 42 | * Line 258: This line sets the path to the bigWig files that are used in the analysis. In this analysis we are going to use spike-in normalized bigWigs. I have placed these here: ```/workdir/data/bigWigs/norm```. 43 | * Line 259-261: Change the bigWig file names, as appropriate. 44 | * Line 270: Normalized values for this new dataset may not be the same as the script was written for. Remove the ylim argument. Also change the file names. 45 | 46 | After making these changes, upload the script to your working directory. Run it using: 47 | ``` 48 | [dankoc@cbsumm22 dankoc]$ R --no-save < ScaledMetaPlotFunctions.r 49 | ``` 50 | 51 | Note that two new PDF files were added to your directory: 52 | ``` 53 | [dankoc@cbsumm22 dankoc]$ ls -lha *.pdf 54 | -rw-rw-r-- 1 dankoc dankoc 110K Nov 29 12:12 MetaGene.LenNorm-0-50.pdf 55 | -rw-rw-r-- 1 dankoc dankoc 118K Nov 29 12:12 MetaGene.LenNorm.pdf 56 | ``` 57 | 58 | Download the resulting PDF files and look at them (mine is below): 59 | 60 |

61 | 62 | Metagene Focused on TSS 63 | ----------------------- 64 | 65 | Note that one of the major differences in this analysis was the paused peak. Next, focus on this and make a meta plot without normalizing to length. I have placed a script to complete this task here: ```/workdir/data/writeMetaPlots.ChROseq.R```. 66 | 67 | Again, please download this script and edit the following lines: 68 | * Line 3: Change the gene annotation file. 69 | * Line 51: Change the path to the bigWig files. 70 | 71 | Then upload the file and run it using: 72 | ``` 73 | [dankoc@cbsumm22 dankoc]$ R --no-save < writeMetaPlots.ChROseq.R 74 | ``` 75 | 76 | Again, download and look at the result. 77 | 78 |

79 | 80 | Think about what do you make of this?! We're about to discuss it! 81 | -------------------------------------------------------------------------------- /PRO-seq.md: -------------------------------------------------------------------------------- 1 | Analyzing PRO-seq data. 2 | ======================= 3 | 4 | Global Run-On and Sequencing (GRO-seq) and Precision Run-On and Sequencing (PRO-seq) are technologies for mapping 5 | the location and orientation of actively transcribing RNA polymerase I, II, and III (Pol) across the genome. Both 6 | technologies are particular powerful because they provide a genome-wide readout of gene and lincRNA transcription, 7 | as well as the location and relative activities of active enhancers and promoters that regulate transcription. 8 | 9 | The Danko lab generally uses PRO-seq. This tutorial describes the basic analysis pipeline used in the Danko lab. 10 | 11 | Basic experimental design 12 | ------------------------- 13 | 14 |

15 | 16 | GRO-seq and PRO-seq were developed by Leighton Core and Hojoong Kwak while working in John Lis' lab at Cornell. A schematic of the PRO-seq protocol is shown in Fig. 1 (left). PRO-seq begins using nuclei isolated from a cell population of interest (step #1). RNA polymerases are halted by depleting rNTP monomers. Engaged RNA polymerases incorporate a single biotin-labeled rNTP (step #2) in a run-on reaction that is conducted in the presence of detergents which remove impediments to Pol II transcription and prevent new initiation. Incorporation of a biotin labeled nucleotide stalls RNA polymerase after a single nucleotide is incorporated, resulting in single-nucleotide resolution map of the location and orientation of actively transcribing RNA polymerases across the genome. Nascent RNAs are purified using streptavidin beads, amplified by PCR, and deep sequenced using Illumina technology (step #3). 17 | 18 |

19 | 20 | Read mapping 21 | ------------ 22 | New version of read mapping is HERE 23 | 24 | The Danko lab pipeline for aligning PRO-seq data can be found here: https://github.com/Danko-Lab/utils/tree/master/proseq. The script will automate three routine pre-processing and alignment options, including pre-processing reads and trimming the sequencing adapter (cutadapt), mapping reads to a reference genome (BWA), and converting BAM files into bedGraph and BigWig formats (kentsource). After running this script, users should have processed data files in the specified output directory. 25 | 26 | Note that the process for GRO-seq data is very similar in most respects, but does not require switching the strand of reads after mapping. We will post a pipeline for aligning GRO-seq data to GitHub soon. 27 | 28 | To run our pipeline users must first download the script files and install dependencies indicated in the README.md. Two additional options are required to run proseqMapper.bsh, including a path to a BWA index file (generated using the 'bwa index' command), and the path to the chromInfo file for the genome of choice. 29 | 30 | The proseqMapper.bsh script is run using: 31 | 32 | ``` bash proseqMapper.bsh --bwa-index=/bwa/index/file/bwa-index --chrom-info=/chrom/info/file/chrom.info``` 33 | 34 | For help with the proseqMapper.bsh, or to see a complete list of options, type: 35 | 36 | ``` 37 | $ bash proseqMapper.bsh --help 38 | 39 | Preprocesses and aligns PRO-seq data. 40 | 41 | Takes *.fastq.gz in the current working directory as input and writes 42 | BAM and bedGraph files to the current working directory as output. 43 | 44 | bash proseqMapper.bsh [options] [files] 45 | 46 | options: 47 | -h, --help show brief help. 48 | 49 | -i, --bwa-index=PATH path to the BWA index of the target genome (i.e., bwa index). 50 | -c, --chrom-info=PATH location of the chromInfo table. 51 | -T, --tmp=PATH path to a temporary storage directory. 52 | -o, --output-dir=DIR specify a directory to store output in. 53 | -I, --fastq=PREFIX Prefix for input files. 54 | 55 | -q, --qc writes out standard QC plots. 56 | -b6, --barcode6 collapses unique reads and trims a 6bp barcode. 57 | -G, --map5=TRUE|FALSE maps the 5' end of reads, for GRO-seq [default == FALSE] 58 | 59 | ``` 60 | 61 | The script requires two parameters for genome information, including BWA index (--bwa-index) and chromosome size (--chrom-info). For GRO-seq process, '-G' should be specified. 62 | 63 | Notes for **CBSUdanko** users: 64 | ``` 65 | (1) BWA index for hg19: 66 | CBSUdanko:/storage/data/short_read_index/hg19/bwa.rRNA-0.7.5a-r405/hg19.rRNA.bwt 67 | (2) Chromosome table for hg19: 68 | CBSUdanko:/storage/data/hg19/hg19.chromInfo 69 | (3) Add UCSC tools to your path (.bashrc file) 70 | export PATH=$PATH:/home/lac334/ucsc_tools 71 | ``` 72 | 73 | PRO-seq quality control (QC) 74 | ---------------------------- 75 | 76 | So you've got libraries back and you want to know whether they will work for you! There are two QC metrics that we find useful to determine the quality of PRO-seq libraries: 77 | 78 | * First, we calculate the size distrubtion of sequence fragments after trimming adapters. Degraded libraries will tend to have a shorter length distribution, and this will show up when viewing the histograms (below). 79 | * Second, we compute a metric that relates to library complexity, or the number of unique seuqneces fragments represented as the entire library is sequenced. This was initially developed by [Corcra](https://github.com/corcra/bed-metric). 80 | 81 | Both of these quality metrics are integrated into the PRO-seq alignment pipeline using the option -q or --qc. 82 | 83 | Alterantively, the QC script can be run separately. 84 | 85 | Identifying regulatory elements using dREG 86 | ------------------------------------------ 87 | 88 | dREG takes as input .bigWig files output after mapping PRO-seq reads to the reference genome and a pre-trained support vector regression (SVR) model. 89 | 90 | * PRO-seq files are required to be in the bigWig format standard created by the UCSC (more information can be found here: http://genome.ucsc.edu/goldenPath/help/bigWig.html). 91 | * We suggest to use this script to prepare your PRO-seq, GRO-seq, or ChRO-seq data in bigWig format. 92 | https://github.com/Danko-Lab/tutorials/blob/master/PRO-seq.md#read-mapping 93 | 94 | * The SVR model contained in the dREG package (under dREG_model/asvm.RData) is a simple model. Users are advised to use that followed with dREG-HD package when possible. 95 | 96 | * The well-trained model can be downloaed from FTP, users are advised to do peak calling directly using this model. 97 | ftp://cbsuftp.tc.cornell.edu/danko/hub/dreg.models/asvm.gdm.6.6M.20170828.rdata. 98 | 99 | To do peak calling with dREG, type: 100 | 101 | bash run_peakcalling.bsh plus_strand.bw minus_strand.bw out_prefix asvm.RData [nthreads] [GPU] 102 | 103 | plus_strand.bw -- PRO-seq data (plus strand) formatted as a bigWig file. 104 | minus_strand.bw -- PRO-seq data (plus strand) formatted as a bigWig file. 105 | out_prefix -- The prefix of the output file. 106 | asvm.RData -- The path to the RData file containing the pre-trained SVM. 107 | [nthreads] -- [optional, default=1] The number of threads to use. 108 | [GPU] -- [optional, default=_blank_] GPU can be used with the aid of the Rgtsvm package. 109 | 110 | For more information see the dREG usage instructions, here: https://github.com/Danko-Lab/dREG/blob/master/README.md 111 | 112 | Data visualization 113 | ------------------ 114 | 115 | Coming soon. 116 | 117 | Working with data using the bigWig package for R 118 | ------------------------------------------------ 119 | 120 | Coming soon. 121 | 122 | Transcription Unit Identification 123 | --------------------------------- 124 | 125 | Coming soon. 126 | 127 | Testing for changes between conditions 128 | -------------------------------------- 129 | 130 | Coming soon. 131 | 132 | Useful references 133 | ----------------- 134 | 135 | * GRO-seq: http://www.sciencemag.org/content/322/5909/1845.long 136 | * PRO-seq: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3974810/ 137 | * dREG: http://www.nature.com/nmeth/journal/v12/n5/full/nmeth.3329.html 138 | -------------------------------------------------------------------------------- /PausingIndex.md: -------------------------------------------------------------------------------- 1 | BioMG 7810 practical day 2: Pausing index plots. 2 | ================================================ 3 | 4 | Motivated by our observations examining RNA polymerase distribuion using meta-gene plots, we next ask a specific question: Does the ratio of pause:body transcription change during prophase I? 5 | 6 | What is the 'pausing index'? 7 | ---------------------------- 8 | 9 | The pausing index is deinfed as the ratio of reads in the pause peak to the gene body [Pause]/[Gene body]. 10 | 11 |

12 | 13 |

14 | Note that the window definitions are arbitrary. Many different definitions have been used in the literature, and there are valid arguments for all of them. The main important aspect of this type of analysis is that your results are consistent regardless of the definition of pausing index. 15 | 16 | You can show from first principals that the pausing index is appxosimately proportional to the rate at which Pol II is "released" from a paused state into productive elongation. I will post pictures of this if I have time before class. 17 | 18 | Let's compute the 'pausing index' 19 | --------------------------------- 20 | 21 | First, we need to get counts/ kilobase in the pause and gene body windows. I have placed an R script that will hel pyou do this here: ```/workdir/data/getCounts.R```. 22 | 23 | Download that script and use Notepad++ to edit the following lines: 24 | * Line 5: Point to the location of gene annotations (```/local/data/final_tus.txt```). 25 | * Line 23: Edit the path to point to the raw bigWig files (```//workdir/data/bigWigs/raw```). 26 | 27 | Use Filezilla to upload the modified file into your working directory, and run it: 28 | 29 | ``` 30 | [dankoc@cbsumm22 dankoc]$ R --no-save < getCounts.R 31 | ``` 32 | 33 | If that command completes successfully, it will write two RData files that can be read by R. These RData files contain the raw and normalized counts for each biological replicate and each condition. 34 | 35 | ``` 36 | [dankoc@cbsumm22 dankoc]$ ls -lha *.RData 37 | -rw-rw-r-- 1 lz539 danko 7.0M Sep 16 17:27 data-counts.RData 38 | -rw-rw-r-- 1 lz539 danko 8.8M Sep 16 17:30 data-rpkms.RData 39 | ``` 40 | 41 | The next step is to use these raw counts to plot the pausing index. 42 | 43 | To ease the transition for those not familiar with R, I have also provided a script that can write out violin plots (see here: ```/workdir/data/pauseIndexCDFs.R```). I would encourage those with R experience to poke around and look at the data in several different ways. To work with this sciprt, download it to your machine and make the following changes: 44 | * Line 26: Specify the path to your own ```data-rpkms.RData``` file. 45 | 46 | If it completes successfully, that script will create a new PDF file ```PI-Vioin-Plot.pdf```. Here is the resulting graph: 47 | 48 |

49 | 50 | What do you interpret this to mean? 51 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tutorials 2 | Tutorials covering various topics in genomic data analysis. 3 | 4 | PRO-seq: https://github.com/Danko-Lab/tutorials/blob/master/PRO-seq.md 5 | -------------------------------------------------------------------------------- /etc/ATACmeta.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/ATACmeta.png -------------------------------------------------------------------------------- /etc/BLAT1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/BLAT1.png -------------------------------------------------------------------------------- /etc/BLAT2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/BLAT2.png -------------------------------------------------------------------------------- /etc/BioHPCpwd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/BioHPCpwd.png -------------------------------------------------------------------------------- /etc/FastQC-ac.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/FastQC-ac.png -------------------------------------------------------------------------------- /etc/FastQC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/FastQC.png -------------------------------------------------------------------------------- /etc/PI-by-stage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/PI-by-stage.png -------------------------------------------------------------------------------- /etc/PausingIndex.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/PausingIndex.png -------------------------------------------------------------------------------- /etc/chroseq.lnmeta.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/chroseq.lnmeta.png -------------------------------------------------------------------------------- /etc/chroseq.tssmeta.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/chroseq.tssmeta.png -------------------------------------------------------------------------------- /etc/proseq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Danko-Lab/tutorials/5748f0c374aacc3be74d5bdfba927936ffd9f3bc/etc/proseq.png --------------------------------------------------------------------------------