├── README.md
├── lab_lessons
    ├── README.md
    ├── Lab8_mapping.md
    ├── Lab10_bacterial_genome_assembly.md
    ├── Lab3_hmmer.md
    ├── Lab4_fastq.md
    ├── Lab1_unix.md
    ├── Lab2_blast.md
    ├── Lab6_khmer.md
    ├── Lab5_trimming.md
    ├── Lab9_euk.genome.assembly.md
    └── Lab7_transcriptome_assembly.md
└── student_code
    ├── transrate.txt
    ├── shortunwrapped.txt
    ├── bless_assembly_no_norm (6).txt
    ├── LK_bless_norm.txt
    ├── sga_assembly_no_norm.txt
    ├── bless_assembly_with_norm (5).txt
    ├── sga_assembly_with_norm.txt
    └── unwrapped.txt


/README.md:
--------------------------------------------------------------------------------
1 | Gen711
2 | ======
3 | 


--------------------------------------------------------------------------------
/lab_lessons/README.md:
--------------------------------------------------------------------------------
1 | This folder contains the (version 1) lab lessons for Gen711/811, released under a CC-BY license. 
2 | 
3 | This class was taught for the 1st time in Fall 2014 at the University of New Hampshire to a class of 25 (half undergrad, halF grad) with NO programming experience. We spent 2 hours per week in the computer lab doing these labs. The course website is here: http://genomebio.org/Gen711/
4 | 
5 | Please feel free to fork/send me pull requests, or otherwise incorporate as you see fit.
6 | 


--------------------------------------------------------------------------------
/student_code/transrate.txt:
--------------------------------------------------------------------------------
 1 | #Transrate for epididymus
 2 | transrate -a /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta \  
 3 | -r /home/lauren/mus_protein_db/Mus_musculus.GRCm38.pep.all.fa \
 4 | -l /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \
 5 | -i /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq \
 6 | -o /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL -t 24
 7 | 
 8 | 
 9 |  
10 | #The elements of this code are as follows
11 | -a ASSEMBLY (fasta file)
12 | -r REFERENCE (fasta file)
13 | -l LEFT READS (numbered R1)
14 | -i RIGHT READS (numbered R2)
15 | -o OUTPUT FILE (.FULL)


--------------------------------------------------------------------------------
/student_code/shortunwrapped.txt:
--------------------------------------------------------------------------------
 1 | #shortunwrapped.txt
 2 | #I created this intermediate file to confirm that the awk command was working, which it is.  unfortunately the sed command is not!
 3 | #The functionality of this file is that it will tell you how many files are created by the awk lines, therefore you should exec
 4 | #this shortunwrapped program before you can do your unwrapped program in order to determine how many temporary file lines (xa_) #that you must write into your unwrapped program
 5 | 
 6 | #so the following sed command is useless:
 7 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \
 8 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta
 9 | 
10 | #Filter based on score.  This is the command that works and will tell you how many files you are creating: 
11 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \
12 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000


--------------------------------------------------------------------------------
/student_code/bless_assembly_no_norm (6).txt:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/make -rRsf
 2 | all: /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz \
 3 | /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz \
 4 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta
 5 | 
 6 | #############################BLESS##############################################
 7 | 
 8 | /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz:/mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.f$
 9 |         echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y  %H:%M:%S'`
10 |         echo Results will be in a file named *corrected.fastq.gz
11 |         echo Settings used: bless kmerlength = 25
12 |         bless -kmerlength 25 -read1 /mnt/data3/lah/mattsclass/testes.R1.fastq -read2 /mnt/data3/lah/mattsclass/testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/no_norm/testes
13 |         gzip /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq &
14 | 
15 | 
16 | #######################Trimmomatic/Trinity##########################
17 | 
18 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta:\
19 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz
20 |         Trinity --seqType fq --JM 50G --trimmomatic --left /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz --right /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz --CPU 12 --output testes.ble$
21 |         --quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
22 | 


--------------------------------------------------------------------------------
/student_code/LK_bless_norm.txt:
--------------------------------------------------------------------------------
 1 | 
 2 | #!/usr/bin/make -rRsf
 3 | 
 4 | ###########################################
 5 | ###        -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15'
 6 | ###         -RUN= name of run
 7 | ###
 8 | ############################################
 9 | 
10 | #$@
11 | 
12 | 
13 | MEM=5
14 | CPU=5
15 | RUN=run
16 | #READ1=/home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq
17 | #READ2=/home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq
18 | 
19 | 
20 | 
21 | all:/home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz \
22 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz \
23 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta
24 | 
25 | #############################BLESS##############################################
26 | 
27 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz:\
28 | /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq
29 | 	echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y  %H:%M:%S'`
30 | 	echo Results will be in a file named *corrected.fastq.gz
31 | 	echo Settings used: bless kmerlength = 25
32 | 	bless -kmerlength 25 -read1 /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \
33 | 	-read2 /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq -verify -notrim -prefix epi
34 | 	gzip /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq 
35 | 
36 | 
37 | ##########################khmer###############################
38 | 
39 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz:\
40 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz
41 | 	echo BEGIN NORMALIZATION `date +'%a %d%b%Y  %H:%M:%S'`
42 | 	echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9
43 | 	interleave-reads.py /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz \ /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz -o /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
44 | 	normalize-by-median.py -p -k 25  -C 50  -N 4 -x 15e9 -out /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq \ /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
45 | 	gzip /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
46 | 
47 | 
48 | #######################Trimmomatic/Trinity##########################
49 | 
50 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \
51 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz
52 | 	Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \
53 | 	--quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
54 | 


--------------------------------------------------------------------------------
/student_code/sga_assembly_no_norm.txt:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/make -rRsf
 2 | 
 3 | ###        -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz
 4 | 
 5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz'
 6 | ###        -RUN= name of run
 7 | 
 8 | MEM=5
 9 | CPU=5
10 | RUN=run
11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz
12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz
13 | 
14 | SHELL=/bin/bash -o pipefail
15 | # SGA version
16 | SGA=sga-0.10.12
17 | DWGSIM=dwgsim
18 | REPORT=sga-preqc-report.py
19 | 
20 | 
21 | #change the data
22 | #This re-names my samples:
23 | #samp1 := p_eremicus
24 | 
25 | 
26 | #Below after the all command are the final output files from SGA and normalization and trinity:
27 | #Must be added in order of completion!!!
28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \
29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \
30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \
31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta
32 | 
33 | 
34 | #################################SGA####################################
35 | 
36 | # Pre-process the dataset: recall that NEED gz file form for this step
37 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz
38 | 	sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \
39 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
40 | 	 gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
41 | 
42 | 
43 | # Build the FM-index
44 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
45 |          cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $<
46 | 
47 | 
48 | # Make the preqc file for the short read set
49 | #%.preqc: %.bwt %.fastq.gz
50 | #       $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@
51 | 
52 | # Final PDF report
53 | #main_report.pdf: p_eremicus.preqc
54 | #       python $(REPORT) $+
55 | #       mv preqc_report.pdf $@
56 | 	
57 | 	
58 | # SGA correction
59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
60 |         cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 
61 | 
62 | 
63 | 
64 | 
65 | #######################Trimmomatic/Trinity##########################
66 | 
67 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta:/home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz
68 | 	Trinity --seqType fq --JM 50G --trimmomatic \
69 |         --single $<  \
70 |         --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity \
71 | 	--quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25"


--------------------------------------------------------------------------------
/student_code/bless_assembly_with_norm (5).txt:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/make -rRsf 
 2 | 
 3 | ###########################################
 4 | ###        -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/transcriptome/Pero360T.1.fastq.gz READ2=/home/lauren/transcriptome/Pero360T.2.fastq.gz READ3=/location/of/read3.fastq READ4=/location/of/read4.fastq '
 5 | ###         -RUN= name of run
 6 | ###
 7 | ############################################
 8 | 
 9 | #$@
10 | 
11 | ##files we need are p_eremics.READNAME.fastq_1&2
12 | #####				mus_musculus.READNAME.fastq_1&2
13 | ##### file directories /mnt/data3/lah/mattsclass/p_eremicus  #where output gets put
14 | #####				   /mnt/data3/lah/mattsclass/p_eremicus/raw #reads will be here
15 | #####				   /mnt/data3/lah/mattsclass/mus_musculus #where output gets put
16 | #####				   /mnt/data3/lah/mattsclass/mus_musculus/raw	#reads will be here
17 | #####		Run in mattsclass folder??
18 | 
19 | 
20 | MEM=5
21 | CPU=5
22 | RUN=run
23 | #READ1=/mnt/data3/lah/mattsclass/testes.R1.fastq
24 | #READ2=/mnt/data3/lah/mattsclass/testes.R2.fastq
25 | 
26 | 
27 | all:/mnt/data3/lah/mattsclass/testes.1.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.1 /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.2 /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta
28 | 					#all output files in order of correction
29 | ##ADD OUTPUT AS WE'RE GOING!!####
30 | 
31 | 
32 | #############################BLESS##############################################
33 | 
34 | 
35 | 
36 | #all:/mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz
37 | 
38 | 
39 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz:\ #output files#
40 | /mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.fastq #input files (raw reads)
41 |         echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y  %H:%M:%S'`
42 |         echo Results will be in a file named *corrected.fastq.gz
43 |         echo Settings used: bless kmerlength = 25
44 |         bless -kmerlength 25 -read1 testes.R1.fastq -read2 testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/testes
45 |         gzip /mnt/data3/lah/mattsclass/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/testes.2.corrected.fastq &
46 | 
47 | 
48 | ##########################khmer###############################
49 | 
50 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz:\ #output file
51 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz #input files from bless (interleave these)
52 |         echo BEGIN NORMALIZATION `date +'%a %d%b%Y  %H:%M:%S'`
53 |         echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9
54 |         interleave-reads.py /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz -o /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq
55 |         normalize-by-median.py -p -k 25  -C 50  -N 4 -x 15e9 --out bless_corrected.inter.norm.fastq 
56 |         gzip bless_corrected.inter.norm.fasta &
57 | 	
58 | 
59 | #######################Trimmomatic/Trinity##########################
60 | 
61 | /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta: \
62 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz
63 |         Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /mnt/data3/lah/mattsclass/bless_norm_trinity.fasta \
64 | 	--quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
65 | 


--------------------------------------------------------------------------------
/student_code/sga_assembly_with_norm.txt:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/make -rRsf
 2 | 
 3 | ###        -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz
 4 | 
 5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz'
 6 | ###        -RUN= name of run
 7 | 
 8 | MEM=5
 9 | CPU=5
10 | RUN=run
11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz
12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz
13 | 
14 | SHELL=/bin/bash -o pipefail
15 | # SGA version
16 | SGA=sga-0.10.12
17 | DWGSIM=dwgsim
18 | REPORT=sga-preqc-report.py
19 | 
20 | 
21 | #change the data
22 | #This re-names my samples:
23 | #samp1 := p_eremicus
24 | 
25 | 
26 | #Below after the all command are the final output files from SGA and normalization and trinity:
27 | #Must be added in order of completion!!!
28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \
29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \
30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \
31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz \
32 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta
33 | 
34 | 
35 | #################################SGA####################################
36 | 
37 | # Pre-process the dataset: recall that NEED gz file form for this step
38 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz
39 | 	sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \
40 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
41 | 	 gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
42 | 
43 | 
44 | # Build the FM-index
45 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
46 |          cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $<
47 | 
48 | # Make the preqc file for the short read set
49 | #%.preqc: %.bwt %.fastq.gz
50 | #       $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@
51 | 
52 | # Final PDF report
53 | #main_report.pdf: p_eremicus.preqc
54 | #       python $(REPORT) $+
55 | #       mv preqc_report.pdf $@
56 | 	
57 | 	
58 | # SGA correction
59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
60 |         cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz 
61 | 
62 | 	
63 | ##########################khmer###############################
64 | 
65 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz:\  
66 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz 
67 | 	cd /home/lauren/Documents/p_eremicus/ && normalize-by-median.py -k 25 -C 50 -N 4 -x 15e9 --out \
68 | 	/home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq \
69 | 	/home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz 
70 | 	gzip /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq 
71 | 	
72 | 
73 | #######################Trimmomatic/Trinity##########################
74 | 
75 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta:\
76 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz.
77 | 	Trinity --seqType fq --JM 50G --trimmomatic \
78 |         --single $<  \
79 |         --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity \
80 |         --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25"
81 | 
82 | 
83 | 	
84 | 	
85 | 	
86 | 	
87 | 	
88 | 	


--------------------------------------------------------------------------------
/lab_lessons/Lab8_mapping.md:
--------------------------------------------------------------------------------
  1 | Lab 8: Read Mapping
  2 | --
  3 | 
  4 | ---
  5 | 
  6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will:
  7 | 
  8 | 1. Install software and download data
  9 | 
 10 | 2. Use sra-toolkit to extract fastQ reads
 11 | 
 12 | 3. Map reads to dataset
 13 | 
 14 | 4. look at mapping quality
 15 | 
 16 | -
 17 | 
 18 | The BWA manual: http://bio-bwa.sourceforge.net/ 
 19 | 
 20 | Flag info: <a href="http://broadinstitute.github.io/picard/explain-flags.html" target="_blank">http://broadinstitute.github.io/picard/explain-flags.html</a>
 21 | 
 22 | ---
 23 | 
 24 | > Step 1: Launch and AMI. For this exercise, we will use a <span style="color: #ff0000;"><strong>c3.2xlarge</strong></span> (yet another instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 25 | 
 26 | 
 27 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 28 | 
 29 | 
 30 | ---
 31 | 
 32 | > Update Software
 33 | 
 34 | 
 35 | 	sudo bash
 36 | 	apt-get update
 37 | 
 38 | 
 39 | ---
 40 | 
 41 | > Install updates
 42 | 
 43 | 
 44 | 	apt-get -y upgrade
 45 | 
 46 | 
 47 | ---
 48 | 
 49 | > Install other software
 50 | 
 51 | 
 52 | 	apt-get -y install subversion tmux git curl samtools gcc make g++ python-dev unzip dh-autoreconf default-jre zlib1g-dev
 53 | 
 54 | 
 55 | ---
 56 | 
 57 | 
 58 |     cd $HOME
 59 |     git clone https://github.com/lh3/bwa.git
 60 |     cd bwa
 61 |     make -j4
 62 |     PATH=$PATH:$(pwd)
 63 | 
 64 | 
 65 | ---
 66 | 
 67 | 
 68 |     cd $HOME
 69 |     wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz
 70 |     tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz
 71 |     PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin
 72 | 
 73 | 
 74 | 
 75 | > Download data
 76 | 
 77 | 
 78 |     mkdir /mnt/data
 79 |     cd /mnt/data
 80 |     wget http://datadryad.org/bitstream/handle/10255/dryad.72141/brain.final.fasta
 81 |     wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR157/SRR1575395/SRR1575395.sra
 82 | 
 83 | 
 84 | > Convert SRA format into fastQ (takes a few minutes)
 85 | 
 86 | 
 87 | 	cd /mnt/data
 88 | 	fastq-dump --split-files --split-spot SRR1575395.sra
 89 | 
 90 | 
 91 | > Map reads!! (20 minutes)
 92 | 
 93 | 
 94 |     mkdir /mnt/mapping
 95 |     cd /mnt/mapping
 96 |     tmux new -s mapping
 97 |     bwa index -p index /mnt/data/brain.final.fasta
 98 |     bwa mem -t8 index /mnt/data/SRR1575395_1.fastq /mnt/data/SRR1575395_2.fastq > brain.sam
 99 | 
100 | 
101 | > Look at SAM file. 
102 | 
103 | 
104 | 
105 | 	#Take a quick general look.
106 | 
107 |     head brain.sam
108 |     tail brain.sam
109 |     
110 |     #Count how many reads in fastq files. `grep -c` counts the number of occurances of the pattern, which in this case is `^@`. I am looking for lines that begin with (specified by `^`) the @ character. 
111 |     
112 |     grep -c ^@ ../data/SRR1575395_1.fastq ../data/SRR1575395_2.fastq
113 |     
114 |     #count number of reads mapping with Flag 65/67. The 1st part of this command `awk`, pulls out the second column of the files, and counts everthing that has either 65 or 67. What do these flags correspond to?   
115 |     
116 |     awk '{print $2}' brain.sam | grep ^6 | grep -c '65\|67'
117 |     
118 |     #why do we need the `grep ^6` thing in there... try `awk '{print $2}' brain.sam | grep '65\|67' | wc -l`
119 |     
120 |     #what about this??
121 |     
122 |     awk '{print $2}' brain.sam | grep '^65\|^67' | wc -l
123 | 
124 | 
125 | > Can you pull out the number of mismatches targeting the NM tag in column 12?
126 | 
127 | 
128 | 
129 | 	#I'm giving you the last bit of the awk code. You have to figure out the 1st awk command and the 1st grep command. This will send the number of mismatches to a file `mismatches.txt`. Can you download it to your usb or HD and plot the results, find the mean number of mismatches, etc??
130 | 
131 | 	awk | grep | awk -F ":" '{print $3}' > mismatches.txt
132 | 
133 | 


--------------------------------------------------------------------------------
/lab_lessons/Lab10_bacterial_genome_assembly.md:
--------------------------------------------------------------------------------
  1 | Lab 10: Bacterial Genome Assembly
  2 | --
  3 | 
  4 | ---
  5 | 
  6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of <em>E. coli</em>. The data are taken from here: <a href="https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly/blob/master/Sources.md" target="_blank">https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly/blob/master/Sources.md</a>.
  7 | 
  8 | <del>1. Install software and download data</del>
  9 | 
 10 | <del>2. Error correct, quality and adapter trim data sets.</del>
 11 | 
 12 | 3. Assemble
 13 | 
 14 | -
 15 | 
 16 | The SPAdes manuscript: <a href="http://www.ncbi.nlm.nih.gov/pubmed/22506599" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed/22506599</a>
 17 | The SPAdes manual: <a href="http://spades.bioinf.spbau.ru/release3.1.1/manual.html" target="_blank">http://spades.bioinf.spbau.ru/release3.1.1/manual.html</a>
 18 | SPAdes website: <a href="http://bioinf.spbau.ru/spades" target="_blank">http://bioinf.spbau.ru/spades</a>
 19 | ABySS webpage: <a href="https://github.com/bcgsc/abyss" target="_blank">https://github.com/bcgsc/abyss</a>
 20 | 
 21 | -
 22 | 
 23 | > Step 1: Launch and AMI. For this exercise, we will use a <span style="color: #ff0000;"><strong>c3.2xlarge</strong></span> (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 24 | 
 25 | 
 26 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 27 | 
 28 | 
 29 | ---
 30 | 
 31 | > Update Software
 32 | 
 33 | 
 34 | 	sudo bash
 35 | 	apt-get update
 36 | 
 37 | 
 38 | ---
 39 | 
 40 | > Install updates
 41 | 
 42 | 
 43 | 	apt-get -y upgrade
 44 | 
 45 | 
 46 | ---
 47 | 
 48 | > Install other software
 49 | 
 50 | 
 51 | 	apt-get -y install subversion tmux git curl libncurses5-dev gcc make g++ python-dev unzip dh-autoreconf zlib1g-dev libboost1.55-dev sparsehash openmpi*
 52 | 
 53 | 
 54 | 
 55 | ---
 56 | 
 57 | > Install SPAdes
 58 | 
 59 | 
 60 |     cd $HOME
 61 |     wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz
 62 |     tar -zxf SPAdes-3.1.1-Linux.tar.gz
 63 |     cd SPAdes-3.1.1-Linux
 64 |     PATH=$PATH:$(pwd)/bin
 65 | 
 66 | 
 67 | ---
 68 | 
 69 | > Install ABySS
 70 | 
 71 | 
 72 |     cd $HOME
 73 |     git clone https://github.com/bcgsc/abyss.git
 74 |     cd abyss
 75 |     ./autogen.sh
 76 |     ./configure --enable-maxk=128 --prefix=/usr/local/ --with-mpi=/usr/lib/openmpi/
 77 |     make -j4
 78 |     make all install
 79 | 
 80 | 
 81 | -
 82 | 
 83 | > Install a script for assembly evaluation.
 84 | 
 85 | 
 86 |     git clone https://github.com/lexnederbragt/sequencetools.git
 87 |     cd sequencetools/
 88 |     PATH=$PATH:$(pwd)
 89 | 
 90 | 
 91 | > Download and unpack the data
 92 | 
 93 | 
 94 | 	cd /mnt
 95 | 	wget https://s3.amazonaws.com/gen711/ecoli_data.tar.gz
 96 | 	tar -zxf ecoli_data.tar.gz
 97 | 
 98 | 
 99 | > Assembly. Try this with different data combos (with mate pair data, without, with minION data and without, etc). Remember to name your assemblies something different using the `-o` flag. Spades has a built-in error correction tool (remove `--only-assembler`). Does 'double error correction seem to make a difference?'.
100 | 
101 | 
102 |     mkdir /mnt/spades
103 |     cd /mnt/spades
104 |     
105 |     spades.py -t 8 -m 15 --only-assembler --mp1-rf -k 127 \
106 |     --pe1-1 /mnt/ecoli_pe.1.fq \
107 |     --pe1-2 /mnt/ecoli_pe.2.fq \
108 |     --mp1-1 /mnt/nextera.1.fq \
109 |     --mp1-2 /mnt/nextera.2.fq \
110 |     --pacbio /mnt/minion.data.fasta \
111 |     -o Ecoli_all_data
112 | 
113 | 
114 | ---
115 | 
116 | > Evaluate Assemblies
117 | 
118 | 
119 |     abyss-fac Ecoli_all_data/scaffolds.fasta
120 |     
121 |     #take a closer look.
122 |     
123 |     assemblathon_stats.pl Ecoli_all_data/scaffolds.fasta
124 | 
125 | 
126 | 
127 | 
128 | > Assembling with ABySS (optional)
129 | 
130 | 
131 |     mkdir /mnt/abyss
132 |     cd /mnt/abyss
133 |     
134 |     abyss-pe np=8 k=127 name=ecoli lib='pe1' mp='mp1' long='minion' \
135 |     pe1='/mnt/ecoli_pe.1.fq /mnt/ecoli_pe.2.fq' \
136 |     mp1='/mnt/nextera.1.fq /mnt/nextera.2.fq' \
137 |     minion='/mnt/minion.data.fasta' mp1_l=30
138 | 
139 | 
140 | 


--------------------------------------------------------------------------------
/lab_lessons/Lab3_hmmer.md:
--------------------------------------------------------------------------------
  1 | Lab3 : HMMER
  2 | 
  3 | During this lab, we will acquaint ourselves with the the software package HMMER. Your objectives are:
  4 | 
  5 | -
  6 | 
  7 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
  8 | 
  9 | 2. Regarding your dataset. Characterize a few conserved domains.
 10 | 
 11 | The HMMER manual <a href="ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf">ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf</a>
 12 | 
 13 | The HMMER webpage: <a href="http://hmmer.janelia.org/">http://hmmer.janelia.org/</a>
 14 | 
 15 | ---
 16 | 
 17 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/your.pem` (change your.pem to whatever you named it)
 18 | 
 19 | 
 20 | 	ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 21 | 
 22 | 
 23 | ---
 24 | 
 25 | 
 26 | 	sudo bash
 27 | 	apt-get update
 28 | 	apt-get -y upgrade
 29 | 	apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
 30 | 
 31 | 
 32 | -
 33 | 
 34 | > Ok, for this lab we are going to use HMMER
 35 | 
 36 | -
 37 | 
 38 | 
 39 | 	cd $HOME
 40 | 	wget http://selab.janelia.org/software/hmmer3/3.1b1/hmmer-3.1b1-linux-intel-x86_64.tar.gz
 41 | 	tar -zxf hmmer-3.1b1-linux-intel-x86_64.tar.gz
 42 | 	cd hmmer-3.1b1-linux-intel-x86_64/
 43 | 	./configure
 44 | 	make && make all install
 45 | 	make check
 46 | 
 47 | 
 48 | ---
 49 | 
 50 | -
 51 | 
 52 | > You will download one of the 5 different datasets (use the same dataset). Do you remember how to use the `wget` and `gzip` commands from last week? Also, download Swissprot and Pfam-A
 53 | 
 54 | -
 55 | 
 56 | 
 57 | 	cd /mnt
 58 | 
 59 | 	#download your dataset
 60 | 
 61 | 	dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz
 62 | 	dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz
 63 | 	dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz
 64 | 	dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz
 65 | 	dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz
 66 | 
 67 | 	#download the SwissProt database
 68 | 
 69 | 	wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
 70 | 
 71 | 	#download the Pfam-A database
 72 | 
 73 | 	wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
 74 | 
 75 | 
 76 | > we are going to run HMMER to identify conserved protein domains. This will take a little while, and we'll use `tmux` to allow us to do this in the background, and continue to work on other things.
 77 | 
 78 | 
 79 |     gzip -d *gz
 80 |     tmux new -s pfam
 81 |     hmmpress Pfam-A.hmm #this is analogous to 'makeblastdb'
 82 |     hmmscan -E 1e-3 --domtblout dataset.pfam --cpu 4 Pfam-A.hmm dataset1.fa
 83 |     ctl-b d
 84 |     top -c #see that hmmscan is running..
 85 | 
 86 | 
 87 | > the neat thing about HMMER is that it can be used as a replacement for blastP or PSI-blast.
 88 | 
 89 | 
 90 |     #blastp-like HBB-HUMAN is a Hemoglobin B protein sequence. 
 91 |     
 92 |     phmmer --domtblout hbb.phmmer -E 1e-5 \
 93 |     /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \
 94 |     uniprot_sprot.fasta
 95 |     
 96 |     #PSI-blast-like
 97 |     
 98 |     jackhmmer --domtblout hbb.jackhammer -E 1e-5 \
 99 |     /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \
100 |     uniprot_sprot.fasta
101 |     
102 |     #you can look at the results using `more hmm.phmmer` or `more hmm.jackhmmer`. Try blasting a few of the results using the BLAST web interface.
103 |     
104 | 
105 | > Now let's look at the Pfam results. This analyses may still be running, but we can look at it while it's still in progress.
106 | 
107 | 
108 |     more dataset.pfam
109 |     #There are a bunch of columns in this table - what do they mean?
110 |     
111 |     #Try to extract all the hits to a specific domain. Google a few domains (column 1) to see if any seem interesting. 
112 |     
113 |     #for instance, find all occurrences of ABC_tran
114 |     grep ABC_tran dataset.pfam
115 |     
116 |     #use grep to count the number of matches. Copy this number down.
117 |     
118 |     grep -c ABC_tran dataset.pfam
119 |     
120 |     #Find all the contigs that have a ABC_tran domain. 
121 |     
122 |     grep ABC_tran dataset.pfam | awk '{print $4}' | sort | uniq
123 |     
124 | 
125 | > Just for fun, check on the Pfam search to see what it is doing... 
126 | 
127 | 
128 |     tmux attach -t pfam
129 |     ctl-b d
130 | 
131 | 


--------------------------------------------------------------------------------
/lab_lessons/Lab4_fastq.md:
--------------------------------------------------------------------------------
  1 | Lab4: Processing fastQ and fastA
  2 | --
  3 | 
  4 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are:
  5 | 
  6 | -
  7 | 
  8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
  9 | 
 10 | 2. Regarding your dataset. Characterize sequence quality.
 11 | 
 12 | The FastQC manual: <a href="http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc">http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc</a>
 13 | 
 14 | The JellyFish manual: <a href="ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf">ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf</a>
 15 | 
 16 | ---
 17 | 
 18 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 19 | 
 20 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 21 | 
 22 | 
 23 | ---
 24 | 
 25 | > Update Software
 26 | 
 27 | 	sudo bash
 28 | 	apt-get update
 29 | 
 30 | 
 31 | ---
 32 | 
 33 | > Install updates
 34 | 
 35 | 	apt-get -y upgrade
 36 | 
 37 | 
 38 | ---
 39 | 
 40 | > Install other software
 41 | 
 42 | 	apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
 43 | 
 44 | 
 45 | -
 46 | 
 47 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. <span style="color: #ff0000;"><strong>Make sure you know what each of these commands does, rather than blindly copying and pasting.. </strong></span>
 48 | 
 49 | -
 50 | 
 51 |     cd $HOME
 52 |     wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip
 53 |     unzip fastqc_v0.11.2.zip
 54 |     cd FastQC/
 55 |     chmod +x fastqc
 56 |     PATH=$PATH:$(pwd)
 57 | 
 58 | 
 59 | ---
 60 | 
 61 | > Download data, and uncompress them.. What does the `-cd` flag mean WRT gzip??
 62 | 
 63 | -
 64 | 
 65 |     cd /mnt
 66 |     wget https://s3.amazonaws.com/gen711/Pero360B.1.fastq.gz
 67 |     wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz
 68 |     gzip -cd /mnt/Pero360B.1.fastq.gz > /mnt/Pero360B.1.fastq &
 69 |     gzip -cd /mnt/Pero360B.2.fastq.gz > /mnt/Pero360B.2.fastq &
 70 | 
 71 | 
 72 | ---
 73 | > Install Fastool, a neat and fast tool used for fastQ -->  fastA
 74 | 
 75 | -
 76 | 
 77 |     cd $HOME
 78 |     git clone https://github.com/fstrozzi/Fastool.git
 79 |     cd Fastool/
 80 |     make
 81 |     PATH=$PATH:$(pwd)
 82 | 
 83 | 
 84 | ---
 85 | > Use Fastool to convert from fastQ to fastA
 86 | 
 87 | -
 88 | 
 89 |     cd /mnt
 90 |     fastool --to-fasta Pero360B.1.fastq > Pero360B.1.fasta &
 91 |     fastool --to-fasta Pero360B.2.fastq > Pero360B.2.fasta &
 92 | 
 93 | 
 94 | ---
 95 | > While Fastool is working, lets install JellyFish.. <span style="color: #ff0000;"><strong>Again, make sure you know what each of these commands does, rather than just copying and pasting.. </strong></span>
 96 | 
 97 | -
 98 | 
 99 |     cd $HOME
100 |     wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
101 |     tar -zxf jellyfish-2.1.3.tar.gz
102 |     cd jellyfish-2.1.3/
103 |     ./configure
104 |     make
105 |     PATH=$PATH:$(pwd)/bin
106 | 
107 | 
108 | ---
109 | > Run FastQC. Make sure to look at the manual to see what the different outputs mean.
110 | 
111 |     cd /mnt
112 |     fastqc -t 4 Pero360B.1.fastq Pero360B.2.fastq
113 | 
114 | 
115 | ---
116 | > Run Jellyfish. Make sure to look at the manual.
117 | 
118 |     cd /mnt
119 |     mkdir jelly
120 |     cd jelly
121 |     jellyfish count -F2 -m 25 -s 200M -t 4 -C ../Pero360B.1.fasta ../Pero360B.2.fasta
122 |     jellyfish histo mer_counts.jf > Pero360B.histo
123 |     head -50 Pero360B.histo
124 | 
125 | 
126 | ---
127 | > Open up a new terminal window using the buttons command-t
128 | 
129 |     scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/
130 |     scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
131 | 
132 | 
133 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
134 | 
135 | -
136 | 
137 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
138 | 
139 | -
140 | 
141 | > OPEN RSTUDIO
142 | 
143 | 
144 |     #Import Data
145 |     histo <- read.table("~/Downloads/Pero360B.histo", quote="\"")
146 |     head(histo)
147 |     
148 |     #Plot
149 |     plot(histo$V2 ~ histo$V1, type='h')
150 |     
151 |     #That one sucks, but what does it tell you about the kmer distribution?
152 |     
153 |     #Maybe this one is better?
154 |     plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,100))
155 |     
156 |     #Better. what is xlim? Maybe we can still improve? 
157 |     
158 |     plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000))
159 |     
160 |     #Final plot
161 |     
162 |     plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000),
163 |             col='blue', frame.plot=F, xlab='25-mer frequency', ylab='Count',
164 |             main='Kmer distribution in brain sample before quality trimming')
165 | 
166 | 
167 | 
168 | > Done?


--------------------------------------------------------------------------------
/student_code/unwrapped.txt:
--------------------------------------------------------------------------------
 1 | #Make trinity assembly file 'unwrapped'
 2 | 
 3 | #Purpose of this is to filter contigs out of the epididymus assembly that have a contig score <0.3
 4 | #I need to do this because my epidiymus transcriptome assmembly had a low transrate score of 0.192
 5 | #and we hope that by filtering out these low quality contigs, this may improve the transrate score
 6 | #assembly: /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta
 7 | #csv contigs: /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv
 8 | 
 9 | 
10 | #This sed command does not work, but the input is trinity.fasta file and the output is unwrapped_epi.fasta
11 | 
12 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \
13 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta
14 | 
15 | #Filter based on score.  This command does work:
16 | 
17 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \
18 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000
19 | 
20 | #This will give you a  bunch of files xaa, xab, xac, etc. Each of them contains 
21 | #the names of the 'good contigs' Now we  need to retrive them from the original fasta file
22 | #This number of temporary files conforms specifically to the number of appropriate temporary files as determined after 
23 | #running the shortunwrapped program to determine how many xa_ files were generated:
24 | 
25 | for i in $(cat xaa); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp1.fa; done &
26 | for i in $(cat xab); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp2.fa; done &
27 | for i in $(cat xac); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp3.fa; done &
28 | for i in $(cat xad); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp4.fa; done &
29 | for i in $(cat xae); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp5.fa; done &
30 | for i in $(cat xaf); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp6.fa; done &
31 | for i in $(cat xag); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp7.fa; done &
32 | for i in $(cat xah); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp8.fa; done &
33 | for i in $(cat xai); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp9.fa; done &
34 | for i in $(cat xaj); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp10.fa; done &
35 | for i in $(cat xak); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp11.fa; done &
36 | for i in $(cat xal); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp12.fa; done &
37 | for i in $(cat xam); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp13.fa; done &
38 | for i in $(cat xan); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp14.fa; done &
39 | for i in $(cat xao); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp15.fa; done &
40 | for i in $(cat xap); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp16.fa; done &
41 | for i in $(cat xaq); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp17.fa; done &
42 | for i in $(cat xar); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp18.fa; done &
43 | for i in $(cat xas); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp19.fa; done &
44 | for i in $(cat xat); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp20.fa; done &
45 | for i in $(cat xau); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp21.fa; done &
46 | for i in $(cat xav); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp22.fa; done &
47 | for i in $(cat xaw); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp23.fa; done &
48 | for i in $(cat xax); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp24.fa; done &
49 | for i in $(cat xay); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp25.fa; done &
50 | for i in $(cat xaz); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp26.fa; done &
51 | for i in $(cat xba); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp27.fa; done &
52 | 
53 | #One command for each of the xa* files. I do it like this to save
54 | #time. Each xa* file is being processed on a different core.
55 | 
56 | #Lastly, concatenate all the temporary files together and get rid of the now unneeded files:
57 | 
58 | #The > sign indicates that all temp files are concatenated into a NEW file: CAT_unwrapped_epi.fasta
59 | #However, the temp* is not liked by the command line!
60 | cat temp* > /home/lauren/Documents/NYGenomeCenter/CAT_unwrapped_epi.fasta
61 | rm temp* x*
62 | 
63 | 


--------------------------------------------------------------------------------
/lab_lessons/Lab1_unix.md:
--------------------------------------------------------------------------------
  1 | Lab 1
  2 | --
  3 | 
  4 | During this lab, we will acquaint ourselves with the Unix terminal, learn how to access data, install software, and  find things. *it is absolutely critical that you master these skills*, so please ask questions if confused.
  5 | 
  6 | > Step 1: Launch and AMI. For this exercise, a t1.micro will be sufficient.
  7 | 
  8 | 
  9 | 	ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 10 | 
 11 | 
 12 | 
 13 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily.  The command `sudo apt-get update` checks a server for updates to existing software.
 14 | 
 15 | 
 16 | 	sudo apt-get update
 17 | 
 18 | 
 19 | >The upgrade command actually installs any of the required updates.
 20 | 
 21 | 	sudo apt-get upgrade
 22 | 
 23 | >OK, what are these commands?  `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting.
 24 | 
 25 | -
 26 | 
 27 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!!
 28 | 
 29 | -
 30 | 	sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \
 31 |         default-jre
 32 |  
 33 | -
 34 | 
 35 | >After you run this command, try something else - try to install something else. R (a stats package - more on this wonderful software later). The package is named `r-base-core`. See if you can install it!! Installing software on Linux is easy (so long as there is a downloadable package - more on when no such package exists later in lab)
 36 | 
 37 | -
 38 | 
 39 | >BTW, did you notice the `\` at the end of line 1 in the above code snippett?? That is a special character we use to break up a single line of code over 2 or more lines. You'll see me use this a lot!
 40 | 
 41 | -
 42 | 
 43 | >OK, lets try our hands at navigating around on the command line - it is not scary!
 44 | 
 45 | Important UNIX rules
 46 | --
 47 | 
 48 | * Everything is case sensitive. Gen711 is not the same as gen711
 49 | * Spaces in file names should be avoided
 50 | * The unix $PATH is the collection of locations where the computer looks for executables (programs)
 51 | * Folders and Files are all you have. If you want to access one of these, you need to tell the computer *EXACTLY* where it is. `/home/macmanes/gen711/exam1_key.txt` will work (assuming you've spelled things correctly, and that the file really exists in that location), but `exam1_key.txt` may not.
 52 | 
 53 | * Lines that begin with a `#` are comments.
 54 | 
 55 | Basic shell commands
 56 | --
 57 | 
 58 | >the `pwd` command returns your current location.
 59 | 
 60 | 	pwd
 61 | 
 62 | -
 63 | 
 64 | >the `ls` command lists the files and folders present in your current directory.  Try `ls -lt` and `ls -lth`. *What is the difference between these commands?* Try typing `man ls` to learn about all the different flags.
 65 | 
 66 | 	ls -l
 67 | 
 68 | -
 69 | 
 70 | >create a file
 71 | 
 72 |     nano hello.txt
 73 |     #The nano text editor will appear -> type something
 74 |     This is my 1st unix file
 75 |     CTL-x
 76 |     y
 77 |     #typing n would get rid of the text you just wrote.
 78 | 
 79 | -
 80 | 
 81 | >look at the file, there are several ways to look at the file
 82 | 
 83 | 	head -5 hello.txt #this shows you the 1st 5 lines of the file
 84 | 	more hello.txt #this shows you the whole file, 1 screen at a time. Space bar to advance, q to quit
 85 | 
 86 | -
 87 | 
 88 | >make a copy of the file, using a different name, then remove it.
 89 | 
 90 | 	cp hello.txt bye.txt
 91 | 	ls -lth
 92 | 	rm bye.txt
 93 | 	ls -lth
 94 | 
 95 | -
 96 | 
 97 | >move the file (or rename it). What is the difference between `mv` and `cp`???
 98 | 
 99 | 	mv hello.txt bye.txt
100 | 	ls -lth
101 | 
102 | -
103 | 
104 | >make a folder (directory), make a file inside a folder.
105 | 
106 |     mkdir testfolder
107 |     ls -lth
108 |     #make a folder inside that folder
109 |     mkdir testfolder/inside_test
110 |     #make a file
111 |     nano testfolder/inside_test/inside.txt
112 |     head testfolder/inside_test/inside.txt
113 |     rm testfolder/inside_test/inside.txt
114 | 
115 | >there are a few other commands that you should be familiar with: `sort`, `cat`, `clear`, `tail`, `history`. Try googling and using `man` to figure them out.
116 | 
117 | Downloading Data and Stuff
118 | --
119 | 
120 | >download something from the web. You're using the `wget` command. You're downloading the SwissProt database. See http://www.ebi.ac.uk/uniprot
121 | 
122 |     mkdir swissprot
123 |     cd swissprot
124 |     wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz
125 | 
126 | -
127 | 
128 | >It will take a few minutes to download. After it's downloaded, you'll need to extract it. Files ending in `.gz` are compressed, just like `.zip`, which is a type of file compression you may be more familiar with.
129 | 
130 | 	gzip -d uniprot_sprot.fasta.gz
131 | 
132 | -
133 | 
134 | >Can you tell me what type of file this is? Use the commands we used above to look at the 1st few lines.
135 | 
136 | 	???
137 | 
138 | >There is some info that is complementary to this material found here: <a href="http://swcarpentry.github.io/2014-08-21-upenn/novice/ref/01-shell.html">http://swcarpentry.github.io/2014-08-21-upenn/novice/ref/01-shell.html</a>


--------------------------------------------------------------------------------
/lab_lessons/Lab2_blast.md:
--------------------------------------------------------------------------------
  1 | Lab 2: BLAST
  2 | --
  3 | 
  4 | During this lab, we will acquaint ourselves with the the software package BLAST. Your objectives are:
  5 | 
  6 | -
  7 | 
  8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
  9 | 
 10 | 2. Regarding your dataset, tell me how some of these genes are related to their homologous copies.
 11 | 
 12 | -
 13 | 
 14 | ---
 15 | 
 16 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance.
 17 | 
 18 | 
 19 | 	ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 20 | 
 21 | -
 22 | 
 23 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily.  The command `sudo apt-get update` checks a server for updates to existing software.
 24 | 
 25 | -
 26 | 
 27 | 
 28 | 	sudo apt-get update
 29 | 
 30 | -
 31 | 
 32 | > The upgrade command actually installs any of the required updates.
 33 | 
 34 | 
 35 | 	sudo apt-get -y upgrade
 36 | 
 37 | 
 38 | > OK, what are these commands?  `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting.
 39 | 
 40 | -
 41 | 
 42 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!!
 43 | 
 44 | -
 45 | 
 46 | 
 47 | 	sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \
 48 |         default-jre
 49 | 
 50 | -
 51 | 
 52 | > ok, for this lab we are going to use BLAST, which is available as a package entitled `ncbi-blast+`
 53 | 
 54 | -
 55 | 
 56 | 
 57 | 	sudo apt-get -y install ???
 58 | 
 59 | -
 60 | 
 61 | > to get a feel for the different options, type `blastp -help`. Which type of blast does this correspond to? Look at the help info for blastp and tblastx
 62 | 
 63 | -
 64 | 
 65 | > Let's go root
 66 | 
 67 | 
 68 | 	sudo bash
 69 | 
 70 | ---
 71 | 
 72 | Install mafft and RAxML
 73 | --
 74 | 
 75 | > Let's install mafft so that we can do an alignment (<a href="http://mafft.cbrc.jp/alignment/software/">http://mafft.cbrc.jp/alignment/software/</a>)
 76 | 
 77 | 
 78 |     cd $HOME
 79 |     wget http://mafft.cbrc.jp/alignment/software/mafft-7.164-without-extensions-src.tgz
 80 |     tar -zxf mafft-7.164-without-extensions-src.tgz
 81 |     cd mafft-7.164-without-extensions/core
 82 |     sudo make && sudo make install
 83 |     PATH=$PATH:/home/ubuntu/mafft-7.164-without-extensions/core
 84 | 
 85 | -
 86 | 
 87 | > Now lets install RAxML so that we can make a phylogeny. ()
 88 | 
 89 | 
 90 |     cd $HOME
 91 |     git clone https://github.com/stamatak/standard-RAxML.git
 92 |     cd standard-RAxML/
 93 |     make -f Makefile.PTHREADS.gcc
 94 |     PATH=$PATH:/home/ubuntu/standard-RAxML
 95 | 
 96 | -
 97 | 
 98 | > remember, for blasting we need both some data (a query) and a database. Lets start with the data 1st. You will have one of the 5 different datasets. Do you remember how to use the `wget` and `gzip` commands from last week?
 99 | 
100 | -
101 | 
102 | 
103 |     cd /mnt
104 |     dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz
105 |     dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz
106 |     dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz
107 |     dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz
108 |     dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz
109 | 
110 | -
111 | 
112 | > Now let's download the database. For this exercise we will use Swissprot: `ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz`
113 | 
114 | > unzip this file using `gzip -d`
115 | 
116 | ---
117 | 
118 | Make blast database and blast
119 | --
120 | 
121 | -
122 | 
123 | > make a blast database
124 | 
125 | 
126 | 	makeblastdb -in uniprot_sprot.fasta -out uniprot -dbtype prot
127 | 
128 | > Now we are ready to blast.
129 | 
130 | 
131 | 	head -n20 dataset1.fa > test.fa
132 | 	blastp -evalue 1e-10 -num_threads 4 -db uniprot -query test.fa -outfmt 6
133 | 
134 | > You will see the results in a table with 12 columns. Use `blastp -help` to see what the results mean.
135 | 
136 | > Test out some of the blast options. Try changing the word size `-word_size`, scoring matrix, evalue, cost to open or extend a gap. See how these changes affect the results.
137 | 
138 | > After you've done this, you should make a file containing the query and the hits.
139 | 
140 | 
141 | 	grep -A4 -w AAseq_1 dataset1.fa 
142 | 
143 |     #use the dataset you have, and substitute your contig for AAseq_1
144 |     #increase -A4 until the whole contigs is displayed.
145 |     #copy and paste it into nano.
146 |     #do the same for the database matches. 
147 |     
148 |     grep -A4 'sp|Q6GZX4|001R_FRG3G' uniprot_sprot.fasta
149 | 
150 | ---
151 | 
152 | mafft
153 | --
154 | 
155 | > Align the proteins using mafft
156 | 
157 | 
158 | 	mafft --reorder --bl 80 --auto for.align > for.tree
159 | 
160 | ---
161 | 
162 | RAxML
163 | --
164 | 
165 | > Make a phylogeny
166 | 
167 | 
168 | 	raxmlHPC-PTHREADS -help
169 | 	raxmlHPC-PTHREADS -f a -m PROTCATBLOSUM62 -T 4 -x 34 -N 100 -n tree -s for.tree -p 35
170 | 
171 | > Copy phylogeny and view online.
172 | 
173 | 
174 | 	more RAxML_bipartitionsBranchLabels.tree
175 | 
176 | 	#copy this info.
177 | 
178 | > Visualize tree on website: <a href="http://iubio.bio.indiana.edu/treeapp/treeprint-form.html">http://iubio.bio.indiana.edu/treeapp/treeprint-form.html</a>


--------------------------------------------------------------------------------
/lab_lessons/Lab6_khmer.md:
--------------------------------------------------------------------------------
  1 | Lab 6: khmer
  2 | --
  3 | 
  4 | ---
  5 | 
  6 | During this lab, we will acquaint ourselves with digital normalization. You will:
  7 | 
  8 | 1. Install software and download data
  9 | 
 10 | 2. Quality and adapter trim data sets.
 11 | 
 12 | 3. Apply digital normalization to the dataset.
 13 | 
 14 | 4. Count and compare kmers and kmer distributions in the normalized and un-normalized dataset.
 15 | 
 16 | 5. Plot in RStudio.
 17 | 
 18 | -
 19 | 
 20 | The JellyFish manual: <a href="ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf">ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf</a>
 21 | 
 22 | The Khmer manual: <a href="http://khmer.readthedocs.org/en/v1.1/">http://khmer.readthedocs.org/en/v1.1/</a>
 23 | 
 24 | ---
 25 | 
 26 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 27 | 
 28 | 
 29 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 30 | 
 31 | 
 32 | ---
 33 | 
 34 | > Update Software
 35 | 
 36 | 
 37 | 	sudo bash
 38 | 	apt-get update
 39 | 
 40 | 
 41 | ---
 42 | 
 43 | > Install updates
 44 | 
 45 | 
 46 | 	apt-get -y upgrade
 47 | 
 48 | 
 49 | ---
 50 | 
 51 | > Install other software
 52 | 
 53 | 
 54 | 	apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre python-pip zlib1g-dev
 55 | 
 56 | 
 57 | ---
 58 | 
 59 | > Install Trimmomatic
 60 | 
 61 | 
 62 |     cd $HOME
 63 |     wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
 64 |     unzip Trimmomatic-0.32.zip
 65 |     cd Trimmomatic-0.32
 66 |     chmod +x trimmomatic-0.32.jar
 67 | 
 68 | 
 69 | ---
 70 | 
 71 | > Install Jellyfish
 72 | 
 73 | 
 74 |     cd $HOME
 75 |     wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
 76 |     tar -zxf jellyfish-2.1.3.tar.gz
 77 |     cd jellyfish-2.1.3/
 78 |     ./configure
 79 |     make -j4
 80 |     PATH=$PATH:$(pwd)/bin
 81 | 
 82 | 
 83 | ---
 84 | 
 85 | > Install Khmer
 86 | 
 87 | 
 88 |     cd $HOME
 89 |     pip install screed pysam
 90 |     git clone https://github.com/ged-lab/khmer.git
 91 |     cd khmer
 92 |     make -j4
 93 |     make install
 94 |     PATH=$PATH:$(pwd)/scripts
 95 | 
 96 | 
 97 | ---
 98 | > Download data. For this lab, we'll be using a smaller dataset that consists of 10million paired end reads.
 99 | 
100 | -
101 | 
102 | 
103 | 	cd /mnt
104 | 	wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz
105 | 	wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz
106 | 
107 | 
108 | ---
109 | 
110 | > Trim low quality bases and adapters from dataset. These files will form the basis of all out subsequent analyses.
111 | 
112 | -
113 | 
114 | 
115 |     mkdir /mnt/trimming
116 |     cd /mnt/trimming
117 |     
118 |     #paste the below lines together as 1 command
119 |     
120 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
121 |     -threads 4 -baseout P2.trimmed.fastQ \
122 |     /mnt/raw.10M.SRR797058_1.fastq.gz \
123 |     /mnt/raw.10M.SRR797058_2.fastq.gz \
124 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
125 |     SLIDINGWINDOW:4:2 \
126 |     LEADING:2 \
127 |     TRAILING:2 \
128 |     MINLEN:25
129 | 
130 | 
131 | ---
132 | > Run Jellyfish on the un-normalized dataset.
133 | 
134 | 
135 |     mkdir /mnt/jelly
136 |     cd /mnt/jelly
137 |     
138 |     jellyfish count -m 25 -F2 -s 700M -t 4 -C -o trimmed.jf /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P
139 |     jellyfish histo trimmed.jf -o trimmed.histo
140 | 
141 | 
142 | ---
143 | 
144 | > Run Khmer
145 | 
146 | 
147 |     mkdir /mnt/khmer
148 |     cd /mnt/khmer
149 |     interleave-reads.py /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P -o interleaved.fq
150 |     normalize-by-median.py -p -x 15e8 -k 25 -C 50 --out khmer_normalized.fq interleaved.fq
151 | 
152 | 
153 | ---
154 | 
155 | > Run Khmer on the normalized dataset.
156 | 
157 | 
158 |     cd /mnt/jelly
159 |     
160 |     jellyfish count -m 25 -s 700M -t 4 -C -o khmer.jf /mnt/khmer/khmer_normalized.fq
161 |     jellyfish histo khmer.jf -o khmer.histo
162 | 
163 | 
164 | > Open up a new terminal window using the buttons command-t
165 | 
166 | 
167 | 	scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
168 | 
169 | 
170 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
171 | 
172 | -
173 | 
174 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
175 | 
176 | -
177 | 
178 | > OPEN RSTUDIO
179 | 
180 | 
181 |     #Import all 2 histogram datasets: this is the code for importing 1 of them..
182 |     
183 |     khmer <- read.table("~/Downloads/khmer.histo", quote="\"")
184 |     trim <- read.table("~/Downloads/trimmed.histo", quote="\"")
185 |     
186 |     #What does this plot show you?? 
187 |     
188 |     barplot(c(trim$V2[1],khmer$V2[1]),
189 |         names=c('Non-normalized', 'C50 Normalized'),
190 |         main='Number of unique kmers')
191 |     
192 |     # plot differences between non-unique kmers
193 |     
194 |     plot(khmer$V2[10:300] - trim$V2[10:300], type='l',
195 |         xlim=c(10,300), xaxs="i", yaxs="i", frame.plot=F,
196 |         ylim=c(-10000,60000), col='red', xlab='kmer frequency',
197 |         lwd=4, ylab='count',
198 |         main='Diff in 25mer counts of \n normalized vs. un-normalized datasets')
199 |     abline(h=0)
200 | 
201 | 
202 | 
203 | ---
204 | 
205 | <a href="http://genomebio.org/Gen711/wp-content/uploads/2014/10/khmer_norm1.jpeg"><img class="aligncenter size-large wp-image-240" src="http://genomebio.org/Gen711/wp-content/uploads/2014/10/khmer_norm1-1024x700.jpeg" alt="khmer_norm" width="1024" height="700" /></a>
206 | 
207 | -
208 | 
209 | -
210 | 
211 | > What do the analyses of kmer counts tell you?


--------------------------------------------------------------------------------
/lab_lessons/Lab5_trimming.md:
--------------------------------------------------------------------------------
  1 | Lab 5: Trimming
  2 | --
  3 | 
  4 | ---
  5 | 
  6 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are:
  7 | 
  8 | -
  9 | 
 10 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
 11 | 
 12 | 2. Regarding your dataset. Characterize sequence quality.
 13 | 
 14 | The FastQC manual: <a href="http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc">http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc</a>
 15 | 
 16 | The JellyFish manual: <a href="ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf">ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf</a>
 17 | 
 18 | ---
 19 | 
 20 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 21 | 
 22 | 
 23 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 24 | 
 25 | 
 26 | ---
 27 | 
 28 | > Update Software
 29 | 
 30 | 
 31 | 	sudo bash
 32 | 	apt-get update
 33 | 
 34 | 
 35 | ---
 36 | 
 37 | > Install updates
 38 | 
 39 | 
 40 | 	apt-get -y upgrade
 41 | 
 42 | 
 43 | ---
 44 | 
 45 | > Install other software
 46 | 
 47 | 
 48 | 	apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
 49 | 
 50 | 
 51 | -
 52 | 
 53 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. <span style="color: #ff0000;"><strong>Make sure you know what each of these commands does, rather than blindly copying and pasting.. </strong></span>
 54 | 
 55 | -
 56 | 
 57 | 
 58 |     cd $HOME
 59 |     wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip
 60 |     unzip fastqc_v0.11.2.zip
 61 |     cd FastQC/
 62 |     chmod +x fastqc
 63 |     PATH=$PATH:$(pwd)
 64 | 
 65 | 
 66 | ---
 67 | 
 68 | > Install Trimmomatic
 69 | 
 70 | 
 71 |     cd $HOME
 72 |     wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
 73 |     unzip Trimmomatic-0.32.zip
 74 |     cd Trimmomatic-0.32
 75 |     chmod +x trimmomatic-0.32.jar
 76 | 
 77 | 
 78 | ---
 79 | 
 80 | > Install Jellyfish
 81 | 
 82 | 
 83 |     cd $HOME
 84 |     wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
 85 |     tar -zxf jellyfish-2.1.3.tar.gz
 86 |     cd jellyfish-2.1.3/
 87 |     ./configure
 88 |     make
 89 |     PATH=$PATH:$(pwd)/bin
 90 | 
 91 | 
 92 | ---
 93 | 
 94 | > Download data. For this lab, we'll be using only 1 sequencing file.
 95 | 
 96 | -
 97 | 
 98 | 
 99 | 	cd /mnt
100 | 	wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz
101 | 
102 | 
103 | ---
104 | 
105 | > Do 3 different trimming levels between 2 and 40. This one is trimming at a Phred score of 30 (BAD!!!) When you run your commands, you'll need to change the numbers in `LEADING:30` `TRAILING:30` `SLIDINGWINDOW:4:30` and `Pero360B.trim.Phred30.fastq` to whatever trimming level you want to use.
106 | 
107 | 
108 |     mkdir /mnt/trimming
109 |     cd /mnt/trimming
110 |     
111 |     #paste the below lines together as 1 command
112 |     
113 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar SE \
114 |     -threads 4 \
115 |     ../Pero360B.2.fastq.gz \
116 |     Pero360B.trim.Phred30.fastq \
117 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
118 |     SLIDINGWINDOW:4:30 \
119 |     LEADING:30 \
120 |     TRAILING:30 \
121 |     MINLEN:25
122 | 
123 | 
124 | 
125 | ---
126 | > After Trimmomatic is done, Run FastQC. You'll have to change the numbers to match the levels you trimmed at.
127 | 
128 | 
129 |     cd /mnt
130 |     fastqc -t 4 Pero360B.2.fastq.gz
131 |     fastqc -t 4 trimming/Pero360B.trim.Phred2.fastq
132 |     fastqc -t 4 trimming/Pero360B.trim.Phred15.fastq
133 |     fastqc -t 4 trimming/Pero360B.trim.Phred30.fastq
134 | 
135 | 
136 | ---
137 | > Run Jellyfish.
138 | 
139 | 
140 | 	mkdir /mnt/jelly
141 | 	cd /mnt/jelly
142 | 
143 | 	# You'll have to run these commands 4 separate times -
144 | 	# once for each different trimmed dataset, and once for the raw dataset.
145 | 	# Change the names of the input and output files..
146 | 
147 | 	jellyfish count -m 25 -s 200M -t 4 -C -o trim30.jf ../trimming/Pero360B.trim.Phred30.fastq
148 | 	jellyfish histo trim30.jf -o trim30.histo
149 | 
150 | 
151 | ---
152 | > Open up a new terminal window using the buttons command-t
153 | 
154 | 
155 | 	scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/
156 | 	scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
157 | 
158 | 
159 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
160 | 
161 | -
162 | 
163 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
164 | 
165 | -
166 | 
167 | > OPEN RSTUDIO
168 | 
169 | 
170 |     #Import all 3 histogram datasets: this is the code for importing 1 of them..
171 |     
172 |     trim2 <- read.table("~/Downloads/trim2.histo", quote="\"")
173 |     
174 |     #Plot: Make sure and change the names to match what you import.
175 |     #What does this plot show you?? 
176 |     
177 |     barplot(c(trim2$V2[1],trim15$V2[1],trim30$V2[1]),
178 |         names=c('Phred2', 'Phred15', 'Phred30'),
179 |         main='Number of unique kmers')
180 |     
181 |     # plot differences between non-unique kmers
182 |     
183 |     plot(trim2$V2[2:30] - trim30$V2[2:30], type='l',
184 |         xlim=c(2,20), xaxs="i", yaxs="i", frame.plot=F,
185 |         ylim=c(0,2000000), col='red', xlab='kmer frequency',
186 |         lwd=4, ylab='count',
187 |         main='Diff in 25mer counts of freq 2 to 20 \n Phred2 vs. Phred30')
188 | 
189 | 
190 | 
191 | 
192 | > Look at the FastQC plots across the different trimming levels. Anything surprising?
193 | 
194 | > What do the analyses of kmer counts tell you?


--------------------------------------------------------------------------------
/lab_lessons/Lab9_euk.genome.assembly.md:
--------------------------------------------------------------------------------
  1 | Lab 9: Genome assembly
  2 | --
  3 | 
  4 | ---
  5 | 
  6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of Plasmodium falciparum. The data are taken from this paper: http://www.nature.com/ncomms/2014/140909/ncomms5754/full/ncomms5754.html?WT.ec_id=JA-NCOMMS-20140919.
  7 | As it stands right now, I think that you will do all the preprocessing steps this week, then the assembly next. Once you have done all the steps, `gzip` compress the files and download them to your USB drive, or the MAC HD. I can provide you with these files next week if issues arise. 
  8 | 
  9 | 1. Install software and download data
 10 | 
 11 | 2. Error correct, quality and adapter trim data sets.
 12 | 
 13 | 3. (next week) Assemble
 14 | 
 15 | -
 16 | 
 17 | The SPAdes manuscript: http://www.ncbi.nlm.nih.gov/pubmed/22506599
 18 | The SPAdes manual: http://spades.bioinf.spbau.ru/release3.1.1/manual.html
 19 | SPAdes website: http://bioinf.spbau.ru/spades
 20 | 
 21 | > Step 1: Launch and AMI. For this exercise, we will use a <span style="color: #ff0000;"><strong>c3.8xlarge</strong></span> (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 22 | 
 23 | 
 24 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 25 | 
 26 | 
 27 | ---
 28 | 
 29 | > Update Software
 30 | 
 31 | 
 32 | 	sudo bash
 33 | 	apt-get update
 34 | 
 35 | 
 36 | ---
 37 | 
 38 | > Install updates
 39 | 
 40 | 
 41 | 	apt-get -y upgrade
 42 | 
 43 | 
 44 | ---
 45 | 
 46 | > Install other software
 47 | 
 48 | 
 49 | 	apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev
 50 | 
 51 | 
 52 | ---
 53 | 
 54 | > Install Lighter, software for error correction.
 55 | 
 56 | 
 57 |     cd $HOME
 58 |     git clone https://github.com/mourisl/Lighter.git
 59 |     cd Lighter
 60 |     make -j8
 61 |     PATH=$PATH:$(pwd)
 62 | 
 63 | 
 64 | ---
 65 | 
 66 | > Install Trimmomatic
 67 | 
 68 | 
 69 |     cd $HOME
 70 |     wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
 71 |     unzip Trimmomatic-0.32.zip
 72 |     cd Trimmomatic-0.32
 73 |     chmod +x trimmomatic-0.32.jar
 74 |     
 75 | 
 76 | ---
 77 | 
 78 | > Install SRAtoolkit
 79 | 
 80 | 
 81 |     cd $HOME
 82 |     wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz
 83 |     tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz
 84 |     PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin
 85 | 
 86 | 
 87 | ---
 88 | 
 89 | > Install SPAdes
 90 | 
 91 | 
 92 |     wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz
 93 |     tar -zxf SPAdes-3.1.1-Linux.tar.gz
 94 |     cd SPAdes-3.1.1-Linux
 95 |     PATH=$PATH:$(pwd)/bin
 96 | 
 97 | 
 98 | ---
 99 | 
100 | > Download 3.5kb MP library
101 | 
102 | 
103 | 	cd /mnt
104 | 	wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022558/ERR022558.sra 
105 | 
106 | 
107 | > Download 10kb MP library
108 | 
109 | 
110 | 	cd /mnt
111 | 	wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022557/ERR022557.sra 
112 | 
113 | 
114 | > Download PE library #1
115 | 
116 | 
117 | 	cd /mnt
118 | 	wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019273/ERR019273.sra
119 | 
120 | 
121 | ---
122 | 
123 | > Download PE library #2
124 | 
125 | 
126 | 	cd /mnt
127 | 	wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019275/ERR019275.sra
128 | 
129 | 
130 | ---
131 | 
132 | > Extract fastQ from sra format.
133 | 
134 | 
135 |     cd /mnt
136 |     
137 |     #this is a basic for loop. Copy is all as 1 line. 
138 |     
139 |     for i in `ls *sra`; do
140 |         fastq-dump --split-files --split-spot $i;
141 |         rm $i;
142 |     done
143 | 
144 | 
145 | ---
146 | 
147 | > Error Correct Data
148 | 
149 | 
150 |     mkdir /mnt/ec
151 |     cd /mnt/ec
152 |     lighter -r /mnt/ERR019273_1.fastq -r /mnt/ERR019273_2.fastq -t 32 -k 21 45000000 .1
153 |     lighter -r /mnt/ERR022557_1.fastq -r /mnt/ERR022557_2.fastq -t 32 -k 21 45000000 .1
154 |     lighter -r /mnt/ERR022558_1.fastq -r /mnt/ERR022558_2.fastq -t 32 -k 21 45000000 .1
155 |     lighter -r /mnt/ERR019275_1.fastq -r /mnt/ERR019275_2.fastq -t 32 -k 21 45000000 .1
156 |     
157 |     #remove the raw files. 
158 |     
159 |     rm *fastq &
160 | 
161 | 
162 | > trim the data:
163 | 
164 | 
165 |     mkdir /mnt/trim
166 |     cd /mnt/trim
167 |     #paste the below lines together as 1 command
168 |     
169 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
170 |     -threads 32 -baseout PE_lib1.fq \
171 |     /mnt/ec/ERR019273_1.cor.fq \
172 |     /mnt/ec/ERR019273_2.cor.fq \
173 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
174 |     SLIDINGWINDOW:4:2 \
175 |     LEADING:2 \
176 |     TRAILING:2 \
177 |     MINLEN:25
178 |     
179 |     #paste the below lines together as 1 command
180 |     
181 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
182 |     -threads 32 -baseout PE_lib2.fq \
183 |     /mnt/ec/ERR019275_1.cor.fq \
184 |     /mnt/ec/ERR019275_2.cor.fq \
185 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
186 |     SLIDINGWINDOW:4:2 \
187 |     LEADING:2 \
188 |     TRAILING:2 \
189 |     MINLEN:25
190 |     
191 |     #paste the below lines together as 1 command
192 |     
193 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
194 |     -threads 32 -baseout MP10000.fq \
195 |     /mnt/ec/ERR022557_1.cor.fq \
196 |     /mnt/ec/ERR022557_2.cor.fq \
197 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
198 |     SLIDINGWINDOW:4:2 \
199 |     LEADING:2 \
200 |     TRAILING:2 \
201 |     MINLEN:25
202 |     
203 |     #paste the below lines together as 1 command
204 |     
205 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
206 |     -threads 32 -baseout MP3500.fq \
207 |     /mnt/ec/ERR022558_1.cor.fq \
208 |     /mnt/ec/ERR022558_2.cor.fq \
209 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
210 |     SLIDINGWINDOW:4:2 \
211 |     LEADING:2 \
212 |     TRAILING:2 \
213 |     MINLEN:25
214 |     
215 |     
216 | 
217 | 
218 | ---
219 | 
220 | 
221 |     mkdir /mnt/
222 |     
223 |     #remove the corrected files. 
224 |     
225 |     rm *fq
226 | 
227 | 
228 | 
229 | > Assembly. This you may want to do next week. Alternatively, you can put it in a tmux window and let it run. You'd have to login later however to download the assembled genome.  
230 | 
231 | 
232 |     mkdir spades
233 |     cd spades
234 |     
235 |     spades.py -t 32 -m 60 \
236 |     --pe1-1 /mnt/trim/PE_lib1_1P.fq \
237 |     --pe1-2 /mnt/trim/PE_lib1_2P.fq \
238 |     --pe2-1 /mnt/trim/PE_lib2_1P.fq \
239 |     --pe2-2 /mnt/trim/PE_lib2_2P.fq \
240 |     --mp1-1 /mnt/trim/MP3500_1P.fq \
241 |     --mp1-2 /mnt/trim/MP3500_2P.fq \
242 |     --mp2-1 /mnt/trim/MP10000_1P.fq \
243 |     --mp2-2 /mnt/trim/MP10000_2P.fq \
244 |     -o Pfal --only-assembler
245 | 
246 | 


--------------------------------------------------------------------------------
/lab_lessons/Lab7_transcriptome_assembly.md:
--------------------------------------------------------------------------------
  1 | Lab 7: Transcriptome assembly
  2 | ---
  3 | 
  4 | --
  5 | 
  6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will:
  7 | 
  8 | 1. Install software and download data
  9 | 
 10 | 2. Error correct, quality and adapter trim data sets.
 11 | 
 12 | 3. Apply digital normalization to the dataset.
 13 | 
 14 | 4. Trinity assembly
 15 | 
 16 | 5. Because the above steps will take a few hours, I am providing you with 2 datasets: one is the 10 million read dataset you used last week. The other is that same 10M read dataset that I have error corrected, quality/adapter trimmed, normalized, and subsampled to 0.5 million reads (I did this so that the assembly could be done in a reasonable amount of time). Especially for the people who are going to do <em>de novo</em> transcriptome projects, and students who will use something like this in their own research, that it is probably worth going through the whole pipeline at some point.
 17 | 
 18 | -
 19 | 
 20 | The JellyFish manual: <a href="ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf">ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf</a>
 21 | 
 22 | The Khmer manual: <a href="http://khmer.readthedocs.org/en/v1.1/">http://khmer.readthedocs.org/en/v1.1/</a>
 23 | 
 24 | Trinity reference material: <a href="http://trinityrnaseq.sourceforge.net/">http://trinityrnaseq.sourceforge.net/</a>
 25 | 
 26 | ---
 27 | 
 28 | > Step 1: Launch and AMI. For this exercise, we will use a <span style="color: #ff0000;"><strong>m3.2xlarge</strong></span> (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
 29 | 
 30 | 
 31 | 	ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
 32 | 
 33 | 
 34 | ---
 35 | 
 36 | > Update Software
 37 | 
 38 | 
 39 | 	sudo bash
 40 | 	apt-get update
 41 | 
 42 | 
 43 | ---
 44 | 
 45 | > Install updates
 46 | 
 47 | 
 48 | 	apt-get -y upgrade
 49 | 
 50 | 
 51 | ---
 52 | 
 53 | > Install other software
 54 | 
 55 | 
 56 | 	apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev
 57 | 
 58 | 
 59 | ---
 60 | 
 61 | > Install Lighter, software for error correction.
 62 | 
 63 | 
 64 |     cd $HOME
 65 |     git clone https://github.com/mourisl/Lighter.git
 66 |     make -j8
 67 |     PATH=$PATH:$(pwd)/scripts
 68 | 
 69 | 
 70 | ---
 71 | 
 72 | > Install Trimmomatic
 73 | 
 74 | 
 75 |     cd $HOME
 76 |     wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
 77 |     unzip Trimmomatic-0.32.zip
 78 |     cd Trimmomatic-0.32
 79 |     chmod +x trimmomatic-0.32.jar
 80 | 
 81 | 
 82 | ---
 83 | 
 84 | > Install Trinity
 85 | 
 86 | 
 87 |     cd $HOME
 88 |     svn checkout svn://svn.code.sf.net/p/trinityrnaseq/code/trunk trinityrnaseq-code
 89 |     cd trinityrnaseq-code
 90 |     make -j8
 91 |     PATH=$PATH:$(pwd)
 92 | 
 93 | 
 94 | ---
 95 | 
 96 | > Install Khmer
 97 | 
 98 | 
 99 |     cd $HOME
100 |     pip install screed pysam
101 |     git clone https://github.com/ged-lab/khmer.git
102 |     cd khmer
103 |     make -j8
104 |     make install
105 |     PATH=$PATH:$(pwd)/scripts
106 | 
107 | 
108 | ---
109 | > Download data. For this lab, these data are to be used by people wanting to do the whole pipeline. Most people will want the other dataset I link to below here..
110 | 
111 | 
112 | 	cd /mnt
113 | 	wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz
114 | 	wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz
115 | 
116 | 
117 | > Alternatively, you can download the pre-corrected, trimmed, normalized datasets. Sadly, I had to subsample this dataset severely (to 500,000 reads) so that we could assemble it in a lab period...
118 | 
119 | 
120 | 	cd /mnt
121 | 	wget https://www.dropbox.com/s/eo3wrx6lvngq3ja/ec.P2.C25.left.fq.gz
122 | 	wget https://www.dropbox.com/s/eycchg3m2my2ag2/ec.P2.C25.right.fq.gz
123 | 
124 | 
125 | ---
126 | 
127 | > Error correct (do this step if you are working with the raw data only). Note you will have to uncompress the data if you are doing these steps. I chose the software 'lighter' because if is 1. probably good and 2. it is fast! It is written by Ben Langmead, the author of several of the powerpoint lectures I posted last week.
128 | 
129 | 
130 |     mkdir /mnt/ec
131 |     cd /mnt/ec
132 |     lighter -r /mnt/raw.10M.SRR797058_1.fastq -r /mnt/raw.10M.SRR797058_2.fastq -t 8 -k 25 100000000 .1
133 | 
134 | 
135 | ---
136 | 
137 | > Trim (do this step if you are working with the raw data only)
138 | 
139 | 
140 |     mkdir /mnt/trimming
141 |     cd /mnt/trimming
142 |     
143 |     #paste the below lines together as 1 command
144 |     
145 |     java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
146 |     -threads 8 -baseout ec.P2trim.fastQ \
147 |     /mnt/ec/raw.10M.SRR797058_1.cor.fq \
148 |     /mnt/ec/raw.10M.SRR797058_2.cor.fq \
149 |     ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
150 |     SLIDINGWINDOW:4:2 \
151 |     LEADING:2 \
152 |     TRAILING:2 \
153 |     MINLEN:25
154 | 
155 | 
156 | ---
157 | 
158 | > Run Khmer (do this step if you are working with the raw data only)
159 | 
160 | 
161 |     mkdir /mnt/khmer
162 |     cd /mnt/khmer
163 |     interleave-reads.py /mnt/trimming/ec.P2trim.fastQ_1P /mnt/trimming/ec.P2trim.fastQ_2P -o interleaved.fq
164 |     normalize-by-median.py -p -x 15e8 -k 25 -C 25 --out khmer_normalized.fq interleaved.fq
165 |     split-paired-reads.py khmer_normalized.fq
166 | 
167 | 
168 | ---
169 | 
170 | > Run Trinity - everybody do this. If you are running with the raw data, you'll have to change the names of the input files. Note that I am using `--min_kmer_cov 2` in the command below. This is only so that you can get through the assembly in a short amount of time. DO NOT USE THIS OPTION IN 'REAL LIFE' AS IT WILL MAKE YOUR ASSEMBLY WORSE!!! This should take ~30 minutes, so use this time to talk to your group members, or whatever else..
171 | 
172 | 
173 |     mkdir /mnt/trinity
174 |     cd /mnt/trinity
175 |     Trinity --seqType fq --JM 20G --min_kmer_cov 2 \
176 |     --left /mnt/ec.P2.C25.left.fq \
177 |     --right /mnt/ec.P2.C25.right.fq \
178 |     --CPU 8 --output ec.P2trim.C25 --group_pairs_distance 999 --inchworm_cpu 8
179 | 
180 | 
181 | ---
182 | 
183 | > Generate Length Based stats from your assembly. What do these mean?
184 | 
185 | 
186 | 	$HOME/trinityrnaseq-code/util/TrinityStats.pl ec.P2trim.C25/Trinity.fasta
187 | 
188 | 
189 | 
190 | > lets looks for coding sequences. Before we can do this, we need to install a Perl module using the cpan command.
191 | 
192 | 
193 | 	cpan URI::Escape
194 | 	$HOME/trinityrnaseq-code/trinity-plugins/TransDecoder_r20140704/TransDecoder --CPU 8 -t ec.P2trim.C25/Trinity.fasta
195 | 
196 | 
197 | > This will take a few minutes. Once done, you will have a file of amino acid sequences, and coding sequences. Look at how many coding sequences you found, and how many were complete (have a start and stop codon) vs. fragmented in one way or another. What do these numbers mean?? What would you hope these numbers look like. What does `grep -c` do?
198 | 
199 | 
200 |     $HOME/trinityrnaseq-code/util/TrinityStats.pl Trinity.fasta.transdecoder.pep
201 |     grep -c complete Trinity.fasta.transdecoder.pep
202 |     grep -c internal Trinity.fasta.transdecoder.pep
203 |     grep -c 5prime Trinity.fasta.transdecoder.pep
204 |     grep -c 3prime Trinity.fasta.transdecoder.pep
205 |     
206 | 


--------------------------------------------------------------------------------