├── README.md ├── lab_lessons ├── README.md ├── Lab8_mapping.md ├── Lab10_bacterial_genome_assembly.md ├── Lab3_hmmer.md ├── Lab4_fastq.md ├── Lab1_unix.md ├── Lab2_blast.md ├── Lab6_khmer.md ├── Lab5_trimming.md ├── Lab9_euk.genome.assembly.md └── Lab7_transcriptome_assembly.md └── student_code ├── transrate.txt ├── shortunwrapped.txt ├── bless_assembly_no_norm (6).txt ├── LK_bless_norm.txt ├── sga_assembly_no_norm.txt ├── bless_assembly_with_norm (5).txt ├── sga_assembly_with_norm.txt └── unwrapped.txt /README.md: -------------------------------------------------------------------------------- 1 | Gen711 2 | ====== 3 | -------------------------------------------------------------------------------- /lab_lessons/README.md: -------------------------------------------------------------------------------- 1 | This folder contains the (version 1) lab lessons for Gen711/811, released under a CC-BY license. 2 | 3 | This class was taught for the 1st time in Fall 2014 at the University of New Hampshire to a class of 25 (half undergrad, halF grad) with NO programming experience. We spent 2 hours per week in the computer lab doing these labs. The course website is here: http://genomebio.org/Gen711/ 4 | 5 | Please feel free to fork/send me pull requests, or otherwise incorporate as you see fit. 6 | -------------------------------------------------------------------------------- /student_code/transrate.txt: -------------------------------------------------------------------------------- 1 | #Transrate for epididymus 2 | transrate -a /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta \ 3 | -r /home/lauren/mus_protein_db/Mus_musculus.GRCm38.pep.all.fa \ 4 | -l /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \ 5 | -i /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq \ 6 | -o /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL -t 24 7 | 8 | 9 | 10 | #The elements of this code are as follows 11 | -a ASSEMBLY (fasta file) 12 | -r REFERENCE (fasta file) 13 | -l LEFT READS (numbered R1) 14 | -i RIGHT READS (numbered R2) 15 | -o OUTPUT FILE (.FULL) -------------------------------------------------------------------------------- /student_code/shortunwrapped.txt: -------------------------------------------------------------------------------- 1 | #shortunwrapped.txt 2 | #I created this intermediate file to confirm that the awk command was working, which it is. unfortunately the sed command is not! 3 | #The functionality of this file is that it will tell you how many files are created by the awk lines, therefore you should exec 4 | #this shortunwrapped program before you can do your unwrapped program in order to determine how many temporary file lines (xa_) #that you must write into your unwrapped program 5 | 6 | #so the following sed command is useless: 7 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \ 8 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta 9 | 10 | #Filter based on score. This is the command that works and will tell you how many files you are creating: 11 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \ 12 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000 -------------------------------------------------------------------------------- /student_code/bless_assembly_no_norm (6).txt: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -rRsf 2 | all: /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz \ 3 | /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz \ 4 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta 5 | 6 | #############################BLESS############################################## 7 | 8 | /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz:/mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.f$ 9 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'` 10 | echo Results will be in a file named *corrected.fastq.gz 11 | echo Settings used: bless kmerlength = 25 12 | bless -kmerlength 25 -read1 /mnt/data3/lah/mattsclass/testes.R1.fastq -read2 /mnt/data3/lah/mattsclass/testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/no_norm/testes 13 | gzip /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq & 14 | 15 | 16 | #######################Trimmomatic/Trinity########################## 17 | 18 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta:\ 19 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz 20 | Trinity --seqType fq --JM 50G --trimmomatic --left /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz --right /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz --CPU 12 --output testes.ble$ 21 | --quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25" 22 | -------------------------------------------------------------------------------- /student_code/LK_bless_norm.txt: -------------------------------------------------------------------------------- 1 | 2 | #!/usr/bin/make -rRsf 3 | 4 | ########################################### 5 | ### -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15' 6 | ### -RUN= name of run 7 | ### 8 | ############################################ 9 | 10 | #$@ 11 | 12 | 13 | MEM=5 14 | CPU=5 15 | RUN=run 16 | #READ1=/home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq 17 | #READ2=/home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq 18 | 19 | 20 | 21 | all:/home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz \ 22 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz \ 23 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta 24 | 25 | #############################BLESS############################################## 26 | 27 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz:\ 28 | /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq 29 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'` 30 | echo Results will be in a file named *corrected.fastq.gz 31 | echo Settings used: bless kmerlength = 25 32 | bless -kmerlength 25 -read1 /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \ 33 | -read2 /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq -verify -notrim -prefix epi 34 | gzip /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq 35 | 36 | 37 | ##########################khmer############################### 38 | 39 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz:\ 40 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz 41 | echo BEGIN NORMALIZATION `date +'%a %d%b%Y %H:%M:%S'` 42 | echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 43 | interleave-reads.py /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz \ /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz -o /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq 44 | normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 -out /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq \ /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq 45 | gzip /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq 46 | 47 | 48 | #######################Trimmomatic/Trinity########################## 49 | 50 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \ 51 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz 52 | Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \ 53 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25" 54 | -------------------------------------------------------------------------------- /student_code/sga_assembly_no_norm.txt: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -rRsf 2 | 3 | ### -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz 4 | 5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz' 6 | ### -RUN= name of run 7 | 8 | MEM=5 9 | CPU=5 10 | RUN=run 11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz 12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz 13 | 14 | SHELL=/bin/bash -o pipefail 15 | # SGA version 16 | SGA=sga-0.10.12 17 | DWGSIM=dwgsim 18 | REPORT=sga-preqc-report.py 19 | 20 | 21 | #change the data 22 | #This re-names my samples: 23 | #samp1 := p_eremicus 24 | 25 | 26 | #Below after the all command are the final output files from SGA and normalization and trinity: 27 | #Must be added in order of completion!!! 28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \ 29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \ 30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \ 31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta 32 | 33 | 34 | #################################SGA#################################### 35 | 36 | # Pre-process the dataset: recall that NEED gz file form for this step 37 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz 38 | sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \ 39 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq 40 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq 41 | 42 | 43 | # Build the FM-index 44 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 45 | cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $< 46 | 47 | 48 | # Make the preqc file for the short read set 49 | #%.preqc: %.bwt %.fastq.gz 50 | # $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@ 51 | 52 | # Final PDF report 53 | #main_report.pdf: p_eremicus.preqc 54 | # python $(REPORT) $+ 55 | # mv preqc_report.pdf $@ 56 | 57 | 58 | # SGA correction 59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 60 | cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 61 | 62 | 63 | 64 | 65 | #######################Trimmomatic/Trinity########################## 66 | 67 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta:/home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz 68 | Trinity --seqType fq --JM 50G --trimmomatic \ 69 | --single $< \ 70 | --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity \ 71 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25" -------------------------------------------------------------------------------- /student_code/bless_assembly_with_norm (5).txt: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -rRsf 2 | 3 | ########################################### 4 | ### -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/transcriptome/Pero360T.1.fastq.gz READ2=/home/lauren/transcriptome/Pero360T.2.fastq.gz READ3=/location/of/read3.fastq READ4=/location/of/read4.fastq ' 5 | ### -RUN= name of run 6 | ### 7 | ############################################ 8 | 9 | #$@ 10 | 11 | ##files we need are p_eremics.READNAME.fastq_1&2 12 | ##### mus_musculus.READNAME.fastq_1&2 13 | ##### file directories /mnt/data3/lah/mattsclass/p_eremicus #where output gets put 14 | ##### /mnt/data3/lah/mattsclass/p_eremicus/raw #reads will be here 15 | ##### /mnt/data3/lah/mattsclass/mus_musculus #where output gets put 16 | ##### /mnt/data3/lah/mattsclass/mus_musculus/raw #reads will be here 17 | ##### Run in mattsclass folder?? 18 | 19 | 20 | MEM=5 21 | CPU=5 22 | RUN=run 23 | #READ1=/mnt/data3/lah/mattsclass/testes.R1.fastq 24 | #READ2=/mnt/data3/lah/mattsclass/testes.R2.fastq 25 | 26 | 27 | all:/mnt/data3/lah/mattsclass/testes.1.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.1 /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.2 /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta 28 | #all output files in order of correction 29 | ##ADD OUTPUT AS WE'RE GOING!!#### 30 | 31 | 32 | #############################BLESS############################################## 33 | 34 | 35 | 36 | #all:/mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz 37 | 38 | 39 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz:\ #output files# 40 | /mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.fastq #input files (raw reads) 41 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'` 42 | echo Results will be in a file named *corrected.fastq.gz 43 | echo Settings used: bless kmerlength = 25 44 | bless -kmerlength 25 -read1 testes.R1.fastq -read2 testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/testes 45 | gzip /mnt/data3/lah/mattsclass/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/testes.2.corrected.fastq & 46 | 47 | 48 | ##########################khmer############################### 49 | 50 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz:\ #output file 51 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz #input files from bless (interleave these) 52 | echo BEGIN NORMALIZATION `date +'%a %d%b%Y %H:%M:%S'` 53 | echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 54 | interleave-reads.py /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz -o /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq 55 | normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 --out bless_corrected.inter.norm.fastq 56 | gzip bless_corrected.inter.norm.fasta & 57 | 58 | 59 | #######################Trimmomatic/Trinity########################## 60 | 61 | /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta: \ 62 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz 63 | Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /mnt/data3/lah/mattsclass/bless_norm_trinity.fasta \ 64 | --quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25" 65 | -------------------------------------------------------------------------------- /student_code/sga_assembly_with_norm.txt: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -rRsf 2 | 3 | ### -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz 4 | 5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz' 6 | ### -RUN= name of run 7 | 8 | MEM=5 9 | CPU=5 10 | RUN=run 11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz 12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz 13 | 14 | SHELL=/bin/bash -o pipefail 15 | # SGA version 16 | SGA=sga-0.10.12 17 | DWGSIM=dwgsim 18 | REPORT=sga-preqc-report.py 19 | 20 | 21 | #change the data 22 | #This re-names my samples: 23 | #samp1 := p_eremicus 24 | 25 | 26 | #Below after the all command are the final output files from SGA and normalization and trinity: 27 | #Must be added in order of completion!!! 28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \ 29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \ 30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \ 31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz \ 32 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta 33 | 34 | 35 | #################################SGA#################################### 36 | 37 | # Pre-process the dataset: recall that NEED gz file form for this step 38 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz 39 | sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \ 40 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq 41 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq 42 | 43 | 44 | # Build the FM-index 45 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 46 | cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $< 47 | 48 | # Make the preqc file for the short read set 49 | #%.preqc: %.bwt %.fastq.gz 50 | # $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@ 51 | 52 | # Final PDF report 53 | #main_report.pdf: p_eremicus.preqc 54 | # python $(REPORT) $+ 55 | # mv preqc_report.pdf $@ 56 | 57 | 58 | # SGA correction 59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 60 | cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 61 | 62 | 63 | ##########################khmer############################### 64 | 65 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz:\ 66 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz 67 | cd /home/lauren/Documents/p_eremicus/ && normalize-by-median.py -k 25 -C 50 -N 4 -x 15e9 --out \ 68 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq \ 69 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz 70 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq 71 | 72 | 73 | #######################Trimmomatic/Trinity########################## 74 | 75 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta:\ 76 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz. 77 | Trinity --seqType fq --JM 50G --trimmomatic \ 78 | --single $< \ 79 | --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity \ 80 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25" 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /lab_lessons/Lab8_mapping.md: -------------------------------------------------------------------------------- 1 | Lab 8: Read Mapping 2 | -- 3 | 4 | --- 5 | 6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will: 7 | 8 | 1. Install software and download data 9 | 10 | 2. Use sra-toolkit to extract fastQ reads 11 | 12 | 3. Map reads to dataset 13 | 14 | 4. look at mapping quality 15 | 16 | - 17 | 18 | The BWA manual: http://bio-bwa.sourceforge.net/  19 | 20 | Flag info: http://broadinstitute.github.io/picard/explain-flags.html 21 | 22 | --- 23 | 24 | > Step 1: Launch and AMI. For this exercise, we will use a c3.2xlarge (yet another instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 25 | 26 | 27 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 28 | 29 | 30 | --- 31 | 32 | > Update Software 33 | 34 | 35 | sudo bash 36 | apt-get update 37 | 38 | 39 | --- 40 | 41 | > Install updates 42 | 43 | 44 | apt-get -y upgrade 45 | 46 | 47 | --- 48 | 49 | > Install other software 50 | 51 | 52 | apt-get -y install subversion tmux git curl samtools gcc make g++ python-dev unzip dh-autoreconf default-jre zlib1g-dev 53 | 54 | 55 | --- 56 | 57 | 58 | cd $HOME 59 | git clone https://github.com/lh3/bwa.git 60 | cd bwa 61 | make -j4 62 | PATH=$PATH:$(pwd) 63 | 64 | 65 | --- 66 | 67 | 68 | cd $HOME 69 | wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz 70 | tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz 71 | PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin 72 | 73 | 74 | 75 | > Download data 76 | 77 | 78 | mkdir /mnt/data 79 | cd /mnt/data 80 | wget http://datadryad.org/bitstream/handle/10255/dryad.72141/brain.final.fasta 81 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR157/SRR1575395/SRR1575395.sra 82 | 83 | 84 | > Convert SRA format into fastQ (takes a few minutes) 85 | 86 | 87 | cd /mnt/data 88 | fastq-dump --split-files --split-spot SRR1575395.sra 89 | 90 | 91 | > Map reads!! (20 minutes) 92 | 93 | 94 | mkdir /mnt/mapping 95 | cd /mnt/mapping 96 | tmux new -s mapping 97 | bwa index -p index /mnt/data/brain.final.fasta 98 | bwa mem -t8 index /mnt/data/SRR1575395_1.fastq /mnt/data/SRR1575395_2.fastq > brain.sam 99 | 100 | 101 | > Look at SAM file. 102 | 103 | 104 | 105 | #Take a quick general look. 106 | 107 | head brain.sam 108 | tail brain.sam 109 | 110 | #Count how many reads in fastq files. `grep -c` counts the number of occurances of the pattern, which in this case is `^@`. I am looking for lines that begin with (specified by `^`) the @ character. 111 | 112 | grep -c ^@ ../data/SRR1575395_1.fastq ../data/SRR1575395_2.fastq 113 | 114 | #count number of reads mapping with Flag 65/67. The 1st part of this command `awk`, pulls out the second column of the files, and counts everthing that has either 65 or 67. What do these flags correspond to? 115 | 116 | awk '{print $2}' brain.sam | grep ^6 | grep -c '65\|67' 117 | 118 | #why do we need the `grep ^6` thing in there... try `awk '{print $2}' brain.sam | grep '65\|67' | wc -l` 119 | 120 | #what about this?? 121 | 122 | awk '{print $2}' brain.sam | grep '^65\|^67' | wc -l 123 | 124 | 125 | > Can you pull out the number of mismatches targeting the NM tag in column 12? 126 | 127 | 128 | 129 | #I'm giving you the last bit of the awk code. You have to figure out the 1st awk command and the 1st grep command. This will send the number of mismatches to a file `mismatches.txt`. Can you download it to your usb or HD and plot the results, find the mean number of mismatches, etc?? 130 | 131 | awk | grep | awk -F ":" '{print $3}' > mismatches.txt 132 | 133 | -------------------------------------------------------------------------------- /lab_lessons/Lab10_bacterial_genome_assembly.md: -------------------------------------------------------------------------------- 1 | Lab 10: Bacterial Genome Assembly 2 | -- 3 | 4 | --- 5 | 6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of E. coli. The data are taken from here: https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly/blob/master/Sources.md. 7 | 8 | 1. Install software and download data 9 | 10 | 2. Error correct, quality and adapter trim data sets. 11 | 12 | 3. Assemble 13 | 14 | - 15 | 16 | The SPAdes manuscript: http://www.ncbi.nlm.nih.gov/pubmed/22506599 17 | The SPAdes manual: http://spades.bioinf.spbau.ru/release3.1.1/manual.html 18 | SPAdes website: http://bioinf.spbau.ru/spades 19 | ABySS webpage: https://github.com/bcgsc/abyss 20 | 21 | - 22 | 23 | > Step 1: Launch and AMI. For this exercise, we will use a c3.2xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 24 | 25 | 26 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 27 | 28 | 29 | --- 30 | 31 | > Update Software 32 | 33 | 34 | sudo bash 35 | apt-get update 36 | 37 | 38 | --- 39 | 40 | > Install updates 41 | 42 | 43 | apt-get -y upgrade 44 | 45 | 46 | --- 47 | 48 | > Install other software 49 | 50 | 51 | apt-get -y install subversion tmux git curl libncurses5-dev gcc make g++ python-dev unzip dh-autoreconf zlib1g-dev libboost1.55-dev sparsehash openmpi* 52 | 53 | 54 | 55 | --- 56 | 57 | > Install SPAdes 58 | 59 | 60 | cd $HOME 61 | wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz 62 | tar -zxf SPAdes-3.1.1-Linux.tar.gz 63 | cd SPAdes-3.1.1-Linux 64 | PATH=$PATH:$(pwd)/bin 65 | 66 | 67 | --- 68 | 69 | > Install ABySS 70 | 71 | 72 | cd $HOME 73 | git clone https://github.com/bcgsc/abyss.git 74 | cd abyss 75 | ./autogen.sh 76 | ./configure --enable-maxk=128 --prefix=/usr/local/ --with-mpi=/usr/lib/openmpi/ 77 | make -j4 78 | make all install 79 | 80 | 81 | - 82 | 83 | > Install a script for assembly evaluation. 84 | 85 | 86 | git clone https://github.com/lexnederbragt/sequencetools.git 87 | cd sequencetools/ 88 | PATH=$PATH:$(pwd) 89 | 90 | 91 | > Download and unpack the data 92 | 93 | 94 | cd /mnt 95 | wget https://s3.amazonaws.com/gen711/ecoli_data.tar.gz 96 | tar -zxf ecoli_data.tar.gz 97 | 98 | 99 | > Assembly. Try this with different data combos (with mate pair data, without, with minION data and without, etc). Remember to name your assemblies something different using the `-o` flag. Spades has a built-in error correction tool (remove `--only-assembler`). Does 'double error correction seem to make a difference?'. 100 | 101 | 102 | mkdir /mnt/spades 103 | cd /mnt/spades 104 | 105 | spades.py -t 8 -m 15 --only-assembler --mp1-rf -k 127 \ 106 | --pe1-1 /mnt/ecoli_pe.1.fq \ 107 | --pe1-2 /mnt/ecoli_pe.2.fq \ 108 | --mp1-1 /mnt/nextera.1.fq \ 109 | --mp1-2 /mnt/nextera.2.fq \ 110 | --pacbio /mnt/minion.data.fasta \ 111 | -o Ecoli_all_data 112 | 113 | 114 | --- 115 | 116 | > Evaluate Assemblies 117 | 118 | 119 | abyss-fac Ecoli_all_data/scaffolds.fasta 120 | 121 | #take a closer look. 122 | 123 | assemblathon_stats.pl Ecoli_all_data/scaffolds.fasta 124 | 125 | 126 | 127 | 128 | > Assembling with ABySS (optional) 129 | 130 | 131 | mkdir /mnt/abyss 132 | cd /mnt/abyss 133 | 134 | abyss-pe np=8 k=127 name=ecoli lib='pe1' mp='mp1' long='minion' \ 135 | pe1='/mnt/ecoli_pe.1.fq /mnt/ecoli_pe.2.fq' \ 136 | mp1='/mnt/nextera.1.fq /mnt/nextera.2.fq' \ 137 | minion='/mnt/minion.data.fasta' mp1_l=30 138 | 139 | 140 | -------------------------------------------------------------------------------- /lab_lessons/Lab3_hmmer.md: -------------------------------------------------------------------------------- 1 | Lab3 : HMMER 2 | 3 | During this lab, we will acquaint ourselves with the the software package HMMER. Your objectives are: 4 | 5 | - 6 | 7 | 1. Familiarize yourself with the software, how to execute it, how to visualize results. 8 | 9 | 2. Regarding your dataset. Characterize a few conserved domains. 10 | 11 | The HMMER manual ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf 12 | 13 | The HMMER webpage: http://hmmer.janelia.org/ 14 | 15 | --- 16 | 17 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/your.pem` (change your.pem to whatever you named it) 18 | 19 | 20 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 21 | 22 | 23 | --- 24 | 25 | 26 | sudo bash 27 | apt-get update 28 | apt-get -y upgrade 29 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre 30 | 31 | 32 | - 33 | 34 | > Ok, for this lab we are going to use HMMER 35 | 36 | - 37 | 38 | 39 | cd $HOME 40 | wget http://selab.janelia.org/software/hmmer3/3.1b1/hmmer-3.1b1-linux-intel-x86_64.tar.gz 41 | tar -zxf hmmer-3.1b1-linux-intel-x86_64.tar.gz 42 | cd hmmer-3.1b1-linux-intel-x86_64/ 43 | ./configure 44 | make && make all install 45 | make check 46 | 47 | 48 | --- 49 | 50 | - 51 | 52 | > You will download one of the 5 different datasets (use the same dataset). Do you remember how to use the `wget` and `gzip` commands from last week? Also, download Swissprot and Pfam-A 53 | 54 | - 55 | 56 | 57 | cd /mnt 58 | 59 | #download your dataset 60 | 61 | dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz 62 | dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz 63 | dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz 64 | dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz 65 | dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz 66 | 67 | #download the SwissProt database 68 | 69 | wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz 70 | 71 | #download the Pfam-A database 72 | 73 | wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz 74 | 75 | 76 | > we are going to run HMMER to identify conserved protein domains. This will take a little while, and we'll use `tmux` to allow us to do this in the background, and continue to work on other things. 77 | 78 | 79 | gzip -d *gz 80 | tmux new -s pfam 81 | hmmpress Pfam-A.hmm #this is analogous to 'makeblastdb' 82 | hmmscan -E 1e-3 --domtblout dataset.pfam --cpu 4 Pfam-A.hmm dataset1.fa 83 | ctl-b d 84 | top -c #see that hmmscan is running.. 85 | 86 | 87 | > the neat thing about HMMER is that it can be used as a replacement for blastP or PSI-blast. 88 | 89 | 90 | #blastp-like HBB-HUMAN is a Hemoglobin B protein sequence. 91 | 92 | phmmer --domtblout hbb.phmmer -E 1e-5 \ 93 | /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \ 94 | uniprot_sprot.fasta 95 | 96 | #PSI-blast-like 97 | 98 | jackhmmer --domtblout hbb.jackhammer -E 1e-5 \ 99 | /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \ 100 | uniprot_sprot.fasta 101 | 102 | #you can look at the results using `more hmm.phmmer` or `more hmm.jackhmmer`. Try blasting a few of the results using the BLAST web interface. 103 | 104 | 105 | > Now let's look at the Pfam results. This analyses may still be running, but we can look at it while it's still in progress. 106 | 107 | 108 | more dataset.pfam 109 | #There are a bunch of columns in this table - what do they mean? 110 | 111 | #Try to extract all the hits to a specific domain. Google a few domains (column 1) to see if any seem interesting. 112 | 113 | #for instance, find all occurrences of ABC_tran 114 | grep ABC_tran dataset.pfam 115 | 116 | #use grep to count the number of matches. Copy this number down. 117 | 118 | grep -c ABC_tran dataset.pfam 119 | 120 | #Find all the contigs that have a ABC_tran domain. 121 | 122 | grep ABC_tran dataset.pfam | awk '{print $4}' | sort | uniq 123 | 124 | 125 | > Just for fun, check on the Pfam search to see what it is doing... 126 | 127 | 128 | tmux attach -t pfam 129 | ctl-b d 130 | 131 | -------------------------------------------------------------------------------- /lab_lessons/Lab4_fastq.md: -------------------------------------------------------------------------------- 1 | Lab4: Processing fastQ and fastA 2 | -- 3 | 4 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are: 5 | 6 | - 7 | 8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results. 9 | 10 | 2. Regarding your dataset. Characterize sequence quality. 11 | 12 | The FastQC manual: http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc 13 | 14 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf 15 | 16 | --- 17 | 18 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 19 | 20 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 21 | 22 | 23 | --- 24 | 25 | > Update Software 26 | 27 | sudo bash 28 | apt-get update 29 | 30 | 31 | --- 32 | 33 | > Install updates 34 | 35 | apt-get -y upgrade 36 | 37 | 38 | --- 39 | 40 | > Install other software 41 | 42 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre 43 | 44 | 45 | - 46 | 47 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. Make sure you know what each of these commands does, rather than blindly copying and pasting..  48 | 49 | - 50 | 51 | cd $HOME 52 | wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip 53 | unzip fastqc_v0.11.2.zip 54 | cd FastQC/ 55 | chmod +x fastqc 56 | PATH=$PATH:$(pwd) 57 | 58 | 59 | --- 60 | 61 | > Download data, and uncompress them.. What does the `-cd` flag mean WRT gzip?? 62 | 63 | - 64 | 65 | cd /mnt 66 | wget https://s3.amazonaws.com/gen711/Pero360B.1.fastq.gz 67 | wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz 68 | gzip -cd /mnt/Pero360B.1.fastq.gz > /mnt/Pero360B.1.fastq & 69 | gzip -cd /mnt/Pero360B.2.fastq.gz > /mnt/Pero360B.2.fastq & 70 | 71 | 72 | --- 73 | > Install Fastool, a neat and fast tool used for fastQ --> fastA 74 | 75 | - 76 | 77 | cd $HOME 78 | git clone https://github.com/fstrozzi/Fastool.git 79 | cd Fastool/ 80 | make 81 | PATH=$PATH:$(pwd) 82 | 83 | 84 | --- 85 | > Use Fastool to convert from fastQ to fastA 86 | 87 | - 88 | 89 | cd /mnt 90 | fastool --to-fasta Pero360B.1.fastq > Pero360B.1.fasta & 91 | fastool --to-fasta Pero360B.2.fastq > Pero360B.2.fasta & 92 | 93 | 94 | --- 95 | > While Fastool is working, lets install JellyFish.. Again, make sure you know what each of these commands does, rather than just copying and pasting..  96 | 97 | - 98 | 99 | cd $HOME 100 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz 101 | tar -zxf jellyfish-2.1.3.tar.gz 102 | cd jellyfish-2.1.3/ 103 | ./configure 104 | make 105 | PATH=$PATH:$(pwd)/bin 106 | 107 | 108 | --- 109 | > Run FastQC. Make sure to look at the manual to see what the different outputs mean. 110 | 111 | cd /mnt 112 | fastqc -t 4 Pero360B.1.fastq Pero360B.2.fastq 113 | 114 | 115 | --- 116 | > Run Jellyfish. Make sure to look at the manual. 117 | 118 | cd /mnt 119 | mkdir jelly 120 | cd jelly 121 | jellyfish count -F2 -m 25 -s 200M -t 4 -C ../Pero360B.1.fasta ../Pero360B.2.fasta 122 | jellyfish histo mer_counts.jf > Pero360B.histo 123 | head -50 Pero360B.histo 124 | 125 | 126 | --- 127 | > Open up a new terminal window using the buttons command-t 128 | 129 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/ 130 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/ 131 | 132 | 133 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means. 134 | 135 | - 136 | 137 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio. 138 | 139 | - 140 | 141 | > OPEN RSTUDIO 142 | 143 | 144 | #Import Data 145 | histo <- read.table("~/Downloads/Pero360B.histo", quote="\"") 146 | head(histo) 147 | 148 | #Plot 149 | plot(histo$V2 ~ histo$V1, type='h') 150 | 151 | #That one sucks, but what does it tell you about the kmer distribution? 152 | 153 | #Maybe this one is better? 154 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,100)) 155 | 156 | #Better. what is xlim? Maybe we can still improve? 157 | 158 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000)) 159 | 160 | #Final plot 161 | 162 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000), 163 | col='blue', frame.plot=F, xlab='25-mer frequency', ylab='Count', 164 | main='Kmer distribution in brain sample before quality trimming') 165 | 166 | 167 | 168 | > Done? -------------------------------------------------------------------------------- /student_code/unwrapped.txt: -------------------------------------------------------------------------------- 1 | #Make trinity assembly file 'unwrapped' 2 | 3 | #Purpose of this is to filter contigs out of the epididymus assembly that have a contig score <0.3 4 | #I need to do this because my epidiymus transcriptome assmembly had a low transrate score of 0.192 5 | #and we hope that by filtering out these low quality contigs, this may improve the transrate score 6 | #assembly: /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta 7 | #csv contigs: /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv 8 | 9 | 10 | #This sed command does not work, but the input is trinity.fasta file and the output is unwrapped_epi.fasta 11 | 12 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \ 13 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta 14 | 15 | #Filter based on score. This command does work: 16 | 17 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \ 18 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000 19 | 20 | #This will give you a bunch of files xaa, xab, xac, etc. Each of them contains 21 | #the names of the 'good contigs' Now we need to retrive them from the original fasta file 22 | #This number of temporary files conforms specifically to the number of appropriate temporary files as determined after 23 | #running the shortunwrapped program to determine how many xa_ files were generated: 24 | 25 | for i in $(cat xaa); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp1.fa; done & 26 | for i in $(cat xab); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp2.fa; done & 27 | for i in $(cat xac); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp3.fa; done & 28 | for i in $(cat xad); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp4.fa; done & 29 | for i in $(cat xae); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp5.fa; done & 30 | for i in $(cat xaf); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp6.fa; done & 31 | for i in $(cat xag); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp7.fa; done & 32 | for i in $(cat xah); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp8.fa; done & 33 | for i in $(cat xai); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp9.fa; done & 34 | for i in $(cat xaj); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp10.fa; done & 35 | for i in $(cat xak); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp11.fa; done & 36 | for i in $(cat xal); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp12.fa; done & 37 | for i in $(cat xam); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp13.fa; done & 38 | for i in $(cat xan); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp14.fa; done & 39 | for i in $(cat xao); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp15.fa; done & 40 | for i in $(cat xap); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp16.fa; done & 41 | for i in $(cat xaq); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp17.fa; done & 42 | for i in $(cat xar); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp18.fa; done & 43 | for i in $(cat xas); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp19.fa; done & 44 | for i in $(cat xat); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp20.fa; done & 45 | for i in $(cat xau); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp21.fa; done & 46 | for i in $(cat xav); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp22.fa; done & 47 | for i in $(cat xaw); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp23.fa; done & 48 | for i in $(cat xax); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp24.fa; done & 49 | for i in $(cat xay); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp25.fa; done & 50 | for i in $(cat xaz); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp26.fa; done & 51 | for i in $(cat xba); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp27.fa; done & 52 | 53 | #One command for each of the xa* files. I do it like this to save 54 | #time. Each xa* file is being processed on a different core. 55 | 56 | #Lastly, concatenate all the temporary files together and get rid of the now unneeded files: 57 | 58 | #The > sign indicates that all temp files are concatenated into a NEW file: CAT_unwrapped_epi.fasta 59 | #However, the temp* is not liked by the command line! 60 | cat temp* > /home/lauren/Documents/NYGenomeCenter/CAT_unwrapped_epi.fasta 61 | rm temp* x* 62 | 63 | -------------------------------------------------------------------------------- /lab_lessons/Lab1_unix.md: -------------------------------------------------------------------------------- 1 | Lab 1 2 | -- 3 | 4 | During this lab, we will acquaint ourselves with the Unix terminal, learn how to access data, install software, and  find things. *it is absolutely critical that you master these skills*, so please ask questions if confused. 5 | 6 | > Step 1: Launch and AMI. For this exercise, a t1.micro will be sufficient. 7 | 8 | 9 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 10 | 11 | 12 | 13 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily.  The command `sudo apt-get update` checks a server for updates to existing software. 14 | 15 | 16 | sudo apt-get update 17 | 18 | 19 | >The upgrade command actually installs any of the required updates. 20 | 21 | sudo apt-get upgrade 22 | 23 | >OK, what are these commands?  `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting. 24 | 25 | - 26 | 27 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!! 28 | 29 | - 30 | sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \ 31 | default-jre 32 | 33 | - 34 | 35 | >After you run this command, try something else - try to install something else. R (a stats package - more on this wonderful software later). The package is named `r-base-core`. See if you can install it!! Installing software on Linux is easy (so long as there is a downloadable package - more on when no such package exists later in lab) 36 | 37 | - 38 | 39 | >BTW, did you notice the `\` at the end of line 1 in the above code snippett?? That is a special character we use to break up a single line of code over 2 or more lines. You'll see me use this a lot! 40 | 41 | - 42 | 43 | >OK, lets try our hands at navigating around on the command line - it is not scary! 44 | 45 | Important UNIX rules 46 | -- 47 | 48 | * Everything is case sensitive. Gen711 is not the same as gen711 49 | * Spaces in file names should be avoided 50 | * The unix $PATH is the collection of locations where the computer looks for executables (programs) 51 | * Folders and Files are all you have. If you want to access one of these, you need to tell the computer *EXACTLY* where it is. `/home/macmanes/gen711/exam1_key.txt` will work (assuming you've spelled things correctly, and that the file really exists in that location), but `exam1_key.txt` may not. 52 | 53 | * Lines that begin with a `#` are comments. 54 | 55 | Basic shell commands 56 | -- 57 | 58 | >the `pwd` command returns your current location. 59 | 60 | pwd 61 | 62 | - 63 | 64 | >the `ls` command lists the files and folders present in your current directory.  Try `ls -lt` and `ls -lth`. *What is the difference between these commands?* Try typing `man ls` to learn about all the different flags. 65 | 66 | ls -l 67 | 68 | - 69 | 70 | >create a file 71 | 72 | nano hello.txt 73 | #The nano text editor will appear -> type something 74 | This is my 1st unix file 75 | CTL-x 76 | y 77 | #typing n would get rid of the text you just wrote. 78 | 79 | - 80 | 81 | >look at the file, there are several ways to look at the file 82 | 83 | head -5 hello.txt #this shows you the 1st 5 lines of the file 84 | more hello.txt #this shows you the whole file, 1 screen at a time. Space bar to advance, q to quit 85 | 86 | - 87 | 88 | >make a copy of the file, using a different name, then remove it. 89 | 90 | cp hello.txt bye.txt 91 | ls -lth 92 | rm bye.txt 93 | ls -lth 94 | 95 | - 96 | 97 | >move the file (or rename it). What is the difference between `mv` and `cp`??? 98 | 99 | mv hello.txt bye.txt 100 | ls -lth 101 | 102 | - 103 | 104 | >make a folder (directory), make a file inside a folder. 105 | 106 | mkdir testfolder 107 | ls -lth 108 | #make a folder inside that folder 109 | mkdir testfolder/inside_test 110 | #make a file 111 | nano testfolder/inside_test/inside.txt 112 | head testfolder/inside_test/inside.txt 113 | rm testfolder/inside_test/inside.txt 114 | 115 | >there are a few other commands that you should be familiar with: `sort`, `cat`, `clear`, `tail`, `history`. Try googling and using `man` to figure them out. 116 | 117 | Downloading Data and Stuff 118 | -- 119 | 120 | >download something from the web. You're using the `wget` command. You're downloading the SwissProt database. See http://www.ebi.ac.uk/uniprot 121 | 122 | mkdir swissprot 123 | cd swissprot 124 | wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz 125 | 126 | - 127 | 128 | >It will take a few minutes to download. After it's downloaded, you'll need to extract it. Files ending in `.gz` are compressed, just like `.zip`, which is a type of file compression you may be more familiar with. 129 | 130 | gzip -d uniprot_sprot.fasta.gz 131 | 132 | - 133 | 134 | >Can you tell me what type of file this is? Use the commands we used above to look at the 1st few lines. 135 | 136 | ??? 137 | 138 | >There is some info that is complementary to this material found here: http://swcarpentry.github.io/2014-08-21-upenn/novice/ref/01-shell.html -------------------------------------------------------------------------------- /lab_lessons/Lab2_blast.md: -------------------------------------------------------------------------------- 1 | Lab 2: BLAST 2 | -- 3 | 4 | During this lab, we will acquaint ourselves with the the software package BLAST. Your objectives are: 5 | 6 | - 7 | 8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results. 9 | 10 | 2. Regarding your dataset, tell me how some of these genes are related to their homologous copies. 11 | 12 | - 13 | 14 | --- 15 | 16 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. 17 | 18 | 19 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 20 | 21 | - 22 | 23 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily.  The command `sudo apt-get update` checks a server for updates to existing software. 24 | 25 | - 26 | 27 | 28 | sudo apt-get update 29 | 30 | - 31 | 32 | > The upgrade command actually installs any of the required updates. 33 | 34 | 35 | sudo apt-get -y upgrade 36 | 37 | 38 | > OK, what are these commands?  `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting. 39 | 40 | - 41 | 42 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!! 43 | 44 | - 45 | 46 | 47 | sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \ 48 | default-jre 49 | 50 | - 51 | 52 | > ok, for this lab we are going to use BLAST, which is available as a package entitled `ncbi-blast+` 53 | 54 | - 55 | 56 | 57 | sudo apt-get -y install ??? 58 | 59 | - 60 | 61 | > to get a feel for the different options, type `blastp -help`. Which type of blast does this correspond to? Look at the help info for blastp and tblastx 62 | 63 | - 64 | 65 | > Let's go root 66 | 67 | 68 | sudo bash 69 | 70 | --- 71 | 72 | Install mafft and RAxML 73 | -- 74 | 75 | > Let's install mafft so that we can do an alignment (http://mafft.cbrc.jp/alignment/software/) 76 | 77 | 78 | cd $HOME 79 | wget http://mafft.cbrc.jp/alignment/software/mafft-7.164-without-extensions-src.tgz 80 | tar -zxf mafft-7.164-without-extensions-src.tgz 81 | cd mafft-7.164-without-extensions/core 82 | sudo make && sudo make install 83 | PATH=$PATH:/home/ubuntu/mafft-7.164-without-extensions/core 84 | 85 | - 86 | 87 | > Now lets install RAxML so that we can make a phylogeny. () 88 | 89 | 90 | cd $HOME 91 | git clone https://github.com/stamatak/standard-RAxML.git 92 | cd standard-RAxML/ 93 | make -f Makefile.PTHREADS.gcc 94 | PATH=$PATH:/home/ubuntu/standard-RAxML 95 | 96 | - 97 | 98 | > remember, for blasting we need both some data (a query) and a database. Lets start with the data 1st. You will have one of the 5 different datasets. Do you remember how to use the `wget` and `gzip` commands from last week? 99 | 100 | - 101 | 102 | 103 | cd /mnt 104 | dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz 105 | dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz 106 | dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz 107 | dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz 108 | dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz 109 | 110 | - 111 | 112 | > Now let's download the database. For this exercise we will use Swissprot: `ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz` 113 | 114 | > unzip this file using `gzip -d` 115 | 116 | --- 117 | 118 | Make blast database and blast 119 | -- 120 | 121 | - 122 | 123 | > make a blast database 124 | 125 | 126 | makeblastdb -in uniprot_sprot.fasta -out uniprot -dbtype prot 127 | 128 | > Now we are ready to blast. 129 | 130 | 131 | head -n20 dataset1.fa > test.fa 132 | blastp -evalue 1e-10 -num_threads 4 -db uniprot -query test.fa -outfmt 6 133 | 134 | > You will see the results in a table with 12 columns. Use `blastp -help` to see what the results mean. 135 | 136 | > Test out some of the blast options. Try changing the word size `-word_size`, scoring matrix, evalue, cost to open or extend a gap. See how these changes affect the results. 137 | 138 | > After you've done this, you should make a file containing the query and the hits. 139 | 140 | 141 | grep -A4 -w AAseq_1 dataset1.fa 142 | 143 | #use the dataset you have, and substitute your contig for AAseq_1 144 | #increase -A4 until the whole contigs is displayed. 145 | #copy and paste it into nano. 146 | #do the same for the database matches. 147 | 148 | grep -A4 'sp|Q6GZX4|001R_FRG3G' uniprot_sprot.fasta 149 | 150 | --- 151 | 152 | mafft 153 | -- 154 | 155 | > Align the proteins using mafft 156 | 157 | 158 | mafft --reorder --bl 80 --auto for.align > for.tree 159 | 160 | --- 161 | 162 | RAxML 163 | -- 164 | 165 | > Make a phylogeny 166 | 167 | 168 | raxmlHPC-PTHREADS -help 169 | raxmlHPC-PTHREADS -f a -m PROTCATBLOSUM62 -T 4 -x 34 -N 100 -n tree -s for.tree -p 35 170 | 171 | > Copy phylogeny and view online. 172 | 173 | 174 | more RAxML_bipartitionsBranchLabels.tree 175 | 176 | #copy this info. 177 | 178 | > Visualize tree on website: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html -------------------------------------------------------------------------------- /lab_lessons/Lab6_khmer.md: -------------------------------------------------------------------------------- 1 | Lab 6: khmer 2 | -- 3 | 4 | --- 5 | 6 | During this lab, we will acquaint ourselves with digital normalization. You will: 7 | 8 | 1. Install software and download data 9 | 10 | 2. Quality and adapter trim data sets. 11 | 12 | 3. Apply digital normalization to the dataset. 13 | 14 | 4. Count and compare kmers and kmer distributions in the normalized and un-normalized dataset. 15 | 16 | 5. Plot in RStudio. 17 | 18 | - 19 | 20 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf 21 | 22 | The Khmer manual: http://khmer.readthedocs.org/en/v1.1/ 23 | 24 | --- 25 | 26 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 27 | 28 | 29 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 30 | 31 | 32 | --- 33 | 34 | > Update Software 35 | 36 | 37 | sudo bash 38 | apt-get update 39 | 40 | 41 | --- 42 | 43 | > Install updates 44 | 45 | 46 | apt-get -y upgrade 47 | 48 | 49 | --- 50 | 51 | > Install other software 52 | 53 | 54 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre python-pip zlib1g-dev 55 | 56 | 57 | --- 58 | 59 | > Install Trimmomatic 60 | 61 | 62 | cd $HOME 63 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip 64 | unzip Trimmomatic-0.32.zip 65 | cd Trimmomatic-0.32 66 | chmod +x trimmomatic-0.32.jar 67 | 68 | 69 | --- 70 | 71 | > Install Jellyfish 72 | 73 | 74 | cd $HOME 75 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz 76 | tar -zxf jellyfish-2.1.3.tar.gz 77 | cd jellyfish-2.1.3/ 78 | ./configure 79 | make -j4 80 | PATH=$PATH:$(pwd)/bin 81 | 82 | 83 | --- 84 | 85 | > Install Khmer 86 | 87 | 88 | cd $HOME 89 | pip install screed pysam 90 | git clone https://github.com/ged-lab/khmer.git 91 | cd khmer 92 | make -j4 93 | make install 94 | PATH=$PATH:$(pwd)/scripts 95 | 96 | 97 | --- 98 | > Download data. For this lab, we'll be using a smaller dataset that consists of 10million paired end reads. 99 | 100 | - 101 | 102 | 103 | cd /mnt 104 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz 105 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz 106 | 107 | 108 | --- 109 | 110 | > Trim low quality bases and adapters from dataset. These files will form the basis of all out subsequent analyses. 111 | 112 | - 113 | 114 | 115 | mkdir /mnt/trimming 116 | cd /mnt/trimming 117 | 118 | #paste the below lines together as 1 command 119 | 120 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 121 | -threads 4 -baseout P2.trimmed.fastQ \ 122 | /mnt/raw.10M.SRR797058_1.fastq.gz \ 123 | /mnt/raw.10M.SRR797058_2.fastq.gz \ 124 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \ 125 | SLIDINGWINDOW:4:2 \ 126 | LEADING:2 \ 127 | TRAILING:2 \ 128 | MINLEN:25 129 | 130 | 131 | --- 132 | > Run Jellyfish on the un-normalized dataset. 133 | 134 | 135 | mkdir /mnt/jelly 136 | cd /mnt/jelly 137 | 138 | jellyfish count -m 25 -F2 -s 700M -t 4 -C -o trimmed.jf /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P 139 | jellyfish histo trimmed.jf -o trimmed.histo 140 | 141 | 142 | --- 143 | 144 | > Run Khmer 145 | 146 | 147 | mkdir /mnt/khmer 148 | cd /mnt/khmer 149 | interleave-reads.py /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P -o interleaved.fq 150 | normalize-by-median.py -p -x 15e8 -k 25 -C 50 --out khmer_normalized.fq interleaved.fq 151 | 152 | 153 | --- 154 | 155 | > Run Khmer on the normalized dataset. 156 | 157 | 158 | cd /mnt/jelly 159 | 160 | jellyfish count -m 25 -s 700M -t 4 -C -o khmer.jf /mnt/khmer/khmer_normalized.fq 161 | jellyfish histo khmer.jf -o khmer.histo 162 | 163 | 164 | > Open up a new terminal window using the buttons command-t 165 | 166 | 167 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/ 168 | 169 | 170 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means. 171 | 172 | - 173 | 174 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio. 175 | 176 | - 177 | 178 | > OPEN RSTUDIO 179 | 180 | 181 | #Import all 2 histogram datasets: this is the code for importing 1 of them.. 182 | 183 | khmer <- read.table("~/Downloads/khmer.histo", quote="\"") 184 | trim <- read.table("~/Downloads/trimmed.histo", quote="\"") 185 | 186 | #What does this plot show you?? 187 | 188 | barplot(c(trim$V2[1],khmer$V2[1]), 189 | names=c('Non-normalized', 'C50 Normalized'), 190 | main='Number of unique kmers') 191 | 192 | # plot differences between non-unique kmers 193 | 194 | plot(khmer$V2[10:300] - trim$V2[10:300], type='l', 195 | xlim=c(10,300), xaxs="i", yaxs="i", frame.plot=F, 196 | ylim=c(-10000,60000), col='red', xlab='kmer frequency', 197 | lwd=4, ylab='count', 198 | main='Diff in 25mer counts of \n normalized vs. un-normalized datasets') 199 | abline(h=0) 200 | 201 | 202 | 203 | --- 204 | 205 | khmer_norm 206 | 207 | - 208 | 209 | - 210 | 211 | > What do the analyses of kmer counts tell you? -------------------------------------------------------------------------------- /lab_lessons/Lab5_trimming.md: -------------------------------------------------------------------------------- 1 | Lab 5: Trimming 2 | -- 3 | 4 | --- 5 | 6 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are: 7 | 8 | - 9 | 10 | 1. Familiarize yourself with the software, how to execute it, how to visualize results. 11 | 12 | 2. Regarding your dataset. Characterize sequence quality. 13 | 14 | The FastQC manual: http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc 15 | 16 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf 17 | 18 | --- 19 | 20 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 21 | 22 | 23 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 24 | 25 | 26 | --- 27 | 28 | > Update Software 29 | 30 | 31 | sudo bash 32 | apt-get update 33 | 34 | 35 | --- 36 | 37 | > Install updates 38 | 39 | 40 | apt-get -y upgrade 41 | 42 | 43 | --- 44 | 45 | > Install other software 46 | 47 | 48 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre 49 | 50 | 51 | - 52 | 53 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. Make sure you know what each of these commands does, rather than blindly copying and pasting..  54 | 55 | - 56 | 57 | 58 | cd $HOME 59 | wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip 60 | unzip fastqc_v0.11.2.zip 61 | cd FastQC/ 62 | chmod +x fastqc 63 | PATH=$PATH:$(pwd) 64 | 65 | 66 | --- 67 | 68 | > Install Trimmomatic 69 | 70 | 71 | cd $HOME 72 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip 73 | unzip Trimmomatic-0.32.zip 74 | cd Trimmomatic-0.32 75 | chmod +x trimmomatic-0.32.jar 76 | 77 | 78 | --- 79 | 80 | > Install Jellyfish 81 | 82 | 83 | cd $HOME 84 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz 85 | tar -zxf jellyfish-2.1.3.tar.gz 86 | cd jellyfish-2.1.3/ 87 | ./configure 88 | make 89 | PATH=$PATH:$(pwd)/bin 90 | 91 | 92 | --- 93 | 94 | > Download data. For this lab, we'll be using only 1 sequencing file. 95 | 96 | - 97 | 98 | 99 | cd /mnt 100 | wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz 101 | 102 | 103 | --- 104 | 105 | > Do 3 different trimming levels between 2 and 40. This one is trimming at a Phred score of 30 (BAD!!!) When you run your commands, you'll need to change the numbers in `LEADING:30` `TRAILING:30` `SLIDINGWINDOW:4:30` and `Pero360B.trim.Phred30.fastq` to whatever trimming level you want to use. 106 | 107 | 108 | mkdir /mnt/trimming 109 | cd /mnt/trimming 110 | 111 | #paste the below lines together as 1 command 112 | 113 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar SE \ 114 | -threads 4 \ 115 | ../Pero360B.2.fastq.gz \ 116 | Pero360B.trim.Phred30.fastq \ 117 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \ 118 | SLIDINGWINDOW:4:30 \ 119 | LEADING:30 \ 120 | TRAILING:30 \ 121 | MINLEN:25 122 | 123 | 124 | 125 | --- 126 | > After Trimmomatic is done, Run FastQC. You'll have to change the numbers to match the levels you trimmed at. 127 | 128 | 129 | cd /mnt 130 | fastqc -t 4 Pero360B.2.fastq.gz 131 | fastqc -t 4 trimming/Pero360B.trim.Phred2.fastq 132 | fastqc -t 4 trimming/Pero360B.trim.Phred15.fastq 133 | fastqc -t 4 trimming/Pero360B.trim.Phred30.fastq 134 | 135 | 136 | --- 137 | > Run Jellyfish. 138 | 139 | 140 | mkdir /mnt/jelly 141 | cd /mnt/jelly 142 | 143 | # You'll have to run these commands 4 separate times - 144 | # once for each different trimmed dataset, and once for the raw dataset. 145 | # Change the names of the input and output files.. 146 | 147 | jellyfish count -m 25 -s 200M -t 4 -C -o trim30.jf ../trimming/Pero360B.trim.Phred30.fastq 148 | jellyfish histo trim30.jf -o trim30.histo 149 | 150 | 151 | --- 152 | > Open up a new terminal window using the buttons command-t 153 | 154 | 155 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/ 156 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/ 157 | 158 | 159 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means. 160 | 161 | - 162 | 163 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio. 164 | 165 | - 166 | 167 | > OPEN RSTUDIO 168 | 169 | 170 | #Import all 3 histogram datasets: this is the code for importing 1 of them.. 171 | 172 | trim2 <- read.table("~/Downloads/trim2.histo", quote="\"") 173 | 174 | #Plot: Make sure and change the names to match what you import. 175 | #What does this plot show you?? 176 | 177 | barplot(c(trim2$V2[1],trim15$V2[1],trim30$V2[1]), 178 | names=c('Phred2', 'Phred15', 'Phred30'), 179 | main='Number of unique kmers') 180 | 181 | # plot differences between non-unique kmers 182 | 183 | plot(trim2$V2[2:30] - trim30$V2[2:30], type='l', 184 | xlim=c(2,20), xaxs="i", yaxs="i", frame.plot=F, 185 | ylim=c(0,2000000), col='red', xlab='kmer frequency', 186 | lwd=4, ylab='count', 187 | main='Diff in 25mer counts of freq 2 to 20 \n Phred2 vs. Phred30') 188 | 189 | 190 | 191 | 192 | > Look at the FastQC plots across the different trimming levels. Anything surprising? 193 | 194 | > What do the analyses of kmer counts tell you? -------------------------------------------------------------------------------- /lab_lessons/Lab9_euk.genome.assembly.md: -------------------------------------------------------------------------------- 1 | Lab 9: Genome assembly 2 | -- 3 | 4 | --- 5 | 6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of Plasmodium falciparum. The data are taken from this paper: http://www.nature.com/ncomms/2014/140909/ncomms5754/full/ncomms5754.html?WT.ec_id=JA-NCOMMS-20140919. 7 | As it stands right now, I think that you will do all the preprocessing steps this week, then the assembly next. Once you have done all the steps, `gzip` compress the files and download them to your USB drive, or the MAC HD. I can provide you with these files next week if issues arise. 8 | 9 | 1. Install software and download data 10 | 11 | 2. Error correct, quality and adapter trim data sets. 12 | 13 | 3. (next week) Assemble 14 | 15 | - 16 | 17 | The SPAdes manuscript: http://www.ncbi.nlm.nih.gov/pubmed/22506599 18 | The SPAdes manual: http://spades.bioinf.spbau.ru/release3.1.1/manual.html 19 | SPAdes website: http://bioinf.spbau.ru/spades 20 | 21 | > Step 1: Launch and AMI. For this exercise, we will use a c3.8xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 22 | 23 | 24 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 25 | 26 | 27 | --- 28 | 29 | > Update Software 30 | 31 | 32 | sudo bash 33 | apt-get update 34 | 35 | 36 | --- 37 | 38 | > Install updates 39 | 40 | 41 | apt-get -y upgrade 42 | 43 | 44 | --- 45 | 46 | > Install other software 47 | 48 | 49 | apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev 50 | 51 | 52 | --- 53 | 54 | > Install Lighter, software for error correction. 55 | 56 | 57 | cd $HOME 58 | git clone https://github.com/mourisl/Lighter.git 59 | cd Lighter 60 | make -j8 61 | PATH=$PATH:$(pwd) 62 | 63 | 64 | --- 65 | 66 | > Install Trimmomatic 67 | 68 | 69 | cd $HOME 70 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip 71 | unzip Trimmomatic-0.32.zip 72 | cd Trimmomatic-0.32 73 | chmod +x trimmomatic-0.32.jar 74 | 75 | 76 | --- 77 | 78 | > Install SRAtoolkit 79 | 80 | 81 | cd $HOME 82 | wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz 83 | tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz 84 | PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin 85 | 86 | 87 | --- 88 | 89 | > Install SPAdes 90 | 91 | 92 | wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz 93 | tar -zxf SPAdes-3.1.1-Linux.tar.gz 94 | cd SPAdes-3.1.1-Linux 95 | PATH=$PATH:$(pwd)/bin 96 | 97 | 98 | --- 99 | 100 | > Download 3.5kb MP library 101 | 102 | 103 | cd /mnt 104 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022558/ERR022558.sra 105 | 106 | 107 | > Download 10kb MP library 108 | 109 | 110 | cd /mnt 111 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022557/ERR022557.sra 112 | 113 | 114 | > Download PE library #1 115 | 116 | 117 | cd /mnt 118 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019273/ERR019273.sra 119 | 120 | 121 | --- 122 | 123 | > Download PE library #2 124 | 125 | 126 | cd /mnt 127 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019275/ERR019275.sra 128 | 129 | 130 | --- 131 | 132 | > Extract fastQ from sra format. 133 | 134 | 135 | cd /mnt 136 | 137 | #this is a basic for loop. Copy is all as 1 line. 138 | 139 | for i in `ls *sra`; do 140 | fastq-dump --split-files --split-spot $i; 141 | rm $i; 142 | done 143 | 144 | 145 | --- 146 | 147 | > Error Correct Data 148 | 149 | 150 | mkdir /mnt/ec 151 | cd /mnt/ec 152 | lighter -r /mnt/ERR019273_1.fastq -r /mnt/ERR019273_2.fastq -t 32 -k 21 45000000 .1 153 | lighter -r /mnt/ERR022557_1.fastq -r /mnt/ERR022557_2.fastq -t 32 -k 21 45000000 .1 154 | lighter -r /mnt/ERR022558_1.fastq -r /mnt/ERR022558_2.fastq -t 32 -k 21 45000000 .1 155 | lighter -r /mnt/ERR019275_1.fastq -r /mnt/ERR019275_2.fastq -t 32 -k 21 45000000 .1 156 | 157 | #remove the raw files. 158 | 159 | rm *fastq & 160 | 161 | 162 | > trim the data: 163 | 164 | 165 | mkdir /mnt/trim 166 | cd /mnt/trim 167 | #paste the below lines together as 1 command 168 | 169 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 170 | -threads 32 -baseout PE_lib1.fq \ 171 | /mnt/ec/ERR019273_1.cor.fq \ 172 | /mnt/ec/ERR019273_2.cor.fq \ 173 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \ 174 | SLIDINGWINDOW:4:2 \ 175 | LEADING:2 \ 176 | TRAILING:2 \ 177 | MINLEN:25 178 | 179 | #paste the below lines together as 1 command 180 | 181 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 182 | -threads 32 -baseout PE_lib2.fq \ 183 | /mnt/ec/ERR019275_1.cor.fq \ 184 | /mnt/ec/ERR019275_2.cor.fq \ 185 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \ 186 | SLIDINGWINDOW:4:2 \ 187 | LEADING:2 \ 188 | TRAILING:2 \ 189 | MINLEN:25 190 | 191 | #paste the below lines together as 1 command 192 | 193 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 194 | -threads 32 -baseout MP10000.fq \ 195 | /mnt/ec/ERR022557_1.cor.fq \ 196 | /mnt/ec/ERR022557_2.cor.fq \ 197 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \ 198 | SLIDINGWINDOW:4:2 \ 199 | LEADING:2 \ 200 | TRAILING:2 \ 201 | MINLEN:25 202 | 203 | #paste the below lines together as 1 command 204 | 205 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 206 | -threads 32 -baseout MP3500.fq \ 207 | /mnt/ec/ERR022558_1.cor.fq \ 208 | /mnt/ec/ERR022558_2.cor.fq \ 209 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \ 210 | SLIDINGWINDOW:4:2 \ 211 | LEADING:2 \ 212 | TRAILING:2 \ 213 | MINLEN:25 214 | 215 | 216 | 217 | 218 | --- 219 | 220 | 221 | mkdir /mnt/ 222 | 223 | #remove the corrected files. 224 | 225 | rm *fq 226 | 227 | 228 | 229 | > Assembly. This you may want to do next week. Alternatively, you can put it in a tmux window and let it run. You'd have to login later however to download the assembled genome. 230 | 231 | 232 | mkdir spades 233 | cd spades 234 | 235 | spades.py -t 32 -m 60 \ 236 | --pe1-1 /mnt/trim/PE_lib1_1P.fq \ 237 | --pe1-2 /mnt/trim/PE_lib1_2P.fq \ 238 | --pe2-1 /mnt/trim/PE_lib2_1P.fq \ 239 | --pe2-2 /mnt/trim/PE_lib2_2P.fq \ 240 | --mp1-1 /mnt/trim/MP3500_1P.fq \ 241 | --mp1-2 /mnt/trim/MP3500_2P.fq \ 242 | --mp2-1 /mnt/trim/MP10000_1P.fq \ 243 | --mp2-2 /mnt/trim/MP10000_2P.fq \ 244 | -o Pfal --only-assembler 245 | 246 | -------------------------------------------------------------------------------- /lab_lessons/Lab7_transcriptome_assembly.md: -------------------------------------------------------------------------------- 1 | Lab 7: Transcriptome assembly 2 | --- 3 | 4 | -- 5 | 6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will: 7 | 8 | 1. Install software and download data 9 | 10 | 2. Error correct, quality and adapter trim data sets. 11 | 12 | 3. Apply digital normalization to the dataset. 13 | 14 | 4. Trinity assembly 15 | 16 | 5. Because the above steps will take a few hours, I am providing you with 2 datasets: one is the 10 million read dataset you used last week. The other is that same 10M read dataset that I have error corrected, quality/adapter trimmed, normalized, and subsampled to 0.5 million reads (I did this so that the assembly could be done in a reasonable amount of time). Especially for the people who are going to do de novo transcriptome projects, and students who will use something like this in their own research, that it is probably worth going through the whole pipeline at some point. 17 | 18 | - 19 | 20 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf 21 | 22 | The Khmer manual: http://khmer.readthedocs.org/en/v1.1/ 23 | 24 | Trinity reference material: http://trinityrnaseq.sourceforge.net/ 25 | 26 | --- 27 | 28 | > Step 1: Launch and AMI. For this exercise, we will use a m3.2xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it) 29 | 30 | 31 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com 32 | 33 | 34 | --- 35 | 36 | > Update Software 37 | 38 | 39 | sudo bash 40 | apt-get update 41 | 42 | 43 | --- 44 | 45 | > Install updates 46 | 47 | 48 | apt-get -y upgrade 49 | 50 | 51 | --- 52 | 53 | > Install other software 54 | 55 | 56 | apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev 57 | 58 | 59 | --- 60 | 61 | > Install Lighter, software for error correction. 62 | 63 | 64 | cd $HOME 65 | git clone https://github.com/mourisl/Lighter.git 66 | make -j8 67 | PATH=$PATH:$(pwd)/scripts 68 | 69 | 70 | --- 71 | 72 | > Install Trimmomatic 73 | 74 | 75 | cd $HOME 76 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip 77 | unzip Trimmomatic-0.32.zip 78 | cd Trimmomatic-0.32 79 | chmod +x trimmomatic-0.32.jar 80 | 81 | 82 | --- 83 | 84 | > Install Trinity 85 | 86 | 87 | cd $HOME 88 | svn checkout svn://svn.code.sf.net/p/trinityrnaseq/code/trunk trinityrnaseq-code 89 | cd trinityrnaseq-code 90 | make -j8 91 | PATH=$PATH:$(pwd) 92 | 93 | 94 | --- 95 | 96 | > Install Khmer 97 | 98 | 99 | cd $HOME 100 | pip install screed pysam 101 | git clone https://github.com/ged-lab/khmer.git 102 | cd khmer 103 | make -j8 104 | make install 105 | PATH=$PATH:$(pwd)/scripts 106 | 107 | 108 | --- 109 | > Download data. For this lab, these data are to be used by people wanting to do the whole pipeline. Most people will want the other dataset I link to below here.. 110 | 111 | 112 | cd /mnt 113 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz 114 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz 115 | 116 | 117 | > Alternatively, you can download the pre-corrected, trimmed, normalized datasets. Sadly, I had to subsample this dataset severely (to 500,000 reads) so that we could assemble it in a lab period... 118 | 119 | 120 | cd /mnt 121 | wget https://www.dropbox.com/s/eo3wrx6lvngq3ja/ec.P2.C25.left.fq.gz 122 | wget https://www.dropbox.com/s/eycchg3m2my2ag2/ec.P2.C25.right.fq.gz 123 | 124 | 125 | --- 126 | 127 | > Error correct (do this step if you are working with the raw data only). Note you will have to uncompress the data if you are doing these steps. I chose the software 'lighter' because if is 1. probably good and 2. it is fast! It is written by Ben Langmead, the author of several of the powerpoint lectures I posted last week. 128 | 129 | 130 | mkdir /mnt/ec 131 | cd /mnt/ec 132 | lighter -r /mnt/raw.10M.SRR797058_1.fastq -r /mnt/raw.10M.SRR797058_2.fastq -t 8 -k 25 100000000 .1 133 | 134 | 135 | --- 136 | 137 | > Trim (do this step if you are working with the raw data only) 138 | 139 | 140 | mkdir /mnt/trimming 141 | cd /mnt/trimming 142 | 143 | #paste the below lines together as 1 command 144 | 145 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \ 146 | -threads 8 -baseout ec.P2trim.fastQ \ 147 | /mnt/ec/raw.10M.SRR797058_1.cor.fq \ 148 | /mnt/ec/raw.10M.SRR797058_2.cor.fq \ 149 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \ 150 | SLIDINGWINDOW:4:2 \ 151 | LEADING:2 \ 152 | TRAILING:2 \ 153 | MINLEN:25 154 | 155 | 156 | --- 157 | 158 | > Run Khmer (do this step if you are working with the raw data only) 159 | 160 | 161 | mkdir /mnt/khmer 162 | cd /mnt/khmer 163 | interleave-reads.py /mnt/trimming/ec.P2trim.fastQ_1P /mnt/trimming/ec.P2trim.fastQ_2P -o interleaved.fq 164 | normalize-by-median.py -p -x 15e8 -k 25 -C 25 --out khmer_normalized.fq interleaved.fq 165 | split-paired-reads.py khmer_normalized.fq 166 | 167 | 168 | --- 169 | 170 | > Run Trinity - everybody do this. If you are running with the raw data, you'll have to change the names of the input files. Note that I am using `--min_kmer_cov 2` in the command below. This is only so that you can get through the assembly in a short amount of time. DO NOT USE THIS OPTION IN 'REAL LIFE' AS IT WILL MAKE YOUR ASSEMBLY WORSE!!! This should take ~30 minutes, so use this time to talk to your group members, or whatever else.. 171 | 172 | 173 | mkdir /mnt/trinity 174 | cd /mnt/trinity 175 | Trinity --seqType fq --JM 20G --min_kmer_cov 2 \ 176 | --left /mnt/ec.P2.C25.left.fq \ 177 | --right /mnt/ec.P2.C25.right.fq \ 178 | --CPU 8 --output ec.P2trim.C25 --group_pairs_distance 999 --inchworm_cpu 8 179 | 180 | 181 | --- 182 | 183 | > Generate Length Based stats from your assembly. What do these mean? 184 | 185 | 186 | $HOME/trinityrnaseq-code/util/TrinityStats.pl ec.P2trim.C25/Trinity.fasta 187 | 188 | 189 | 190 | > lets looks for coding sequences. Before we can do this, we need to install a Perl module using the cpan command. 191 | 192 | 193 | cpan URI::Escape 194 | $HOME/trinityrnaseq-code/trinity-plugins/TransDecoder_r20140704/TransDecoder --CPU 8 -t ec.P2trim.C25/Trinity.fasta 195 | 196 | 197 | > This will take a few minutes. Once done, you will have a file of amino acid sequences, and coding sequences. Look at how many coding sequences you found, and how many were complete (have a start and stop codon) vs. fragmented in one way or another. What do these numbers mean?? What would you hope these numbers look like. What does `grep -c` do? 198 | 199 | 200 | $HOME/trinityrnaseq-code/util/TrinityStats.pl Trinity.fasta.transdecoder.pep 201 | grep -c complete Trinity.fasta.transdecoder.pep 202 | grep -c internal Trinity.fasta.transdecoder.pep 203 | grep -c 5prime Trinity.fasta.transdecoder.pep 204 | grep -c 3prime Trinity.fasta.transdecoder.pep 205 | 206 | --------------------------------------------------------------------------------