├── README.md
├── lab_lessons
├── README.md
├── Lab8_mapping.md
├── Lab10_bacterial_genome_assembly.md
├── Lab3_hmmer.md
├── Lab4_fastq.md
├── Lab1_unix.md
├── Lab2_blast.md
├── Lab6_khmer.md
├── Lab5_trimming.md
├── Lab9_euk.genome.assembly.md
└── Lab7_transcriptome_assembly.md
└── student_code
├── transrate.txt
├── shortunwrapped.txt
├── bless_assembly_no_norm (6).txt
├── LK_bless_norm.txt
├── sga_assembly_no_norm.txt
├── bless_assembly_with_norm (5).txt
├── sga_assembly_with_norm.txt
└── unwrapped.txt
/README.md:
--------------------------------------------------------------------------------
1 | Gen711
2 | ======
3 |
--------------------------------------------------------------------------------
/lab_lessons/README.md:
--------------------------------------------------------------------------------
1 | This folder contains the (version 1) lab lessons for Gen711/811, released under a CC-BY license.
2 |
3 | This class was taught for the 1st time in Fall 2014 at the University of New Hampshire to a class of 25 (half undergrad, halF grad) with NO programming experience. We spent 2 hours per week in the computer lab doing these labs. The course website is here: http://genomebio.org/Gen711/
4 |
5 | Please feel free to fork/send me pull requests, or otherwise incorporate as you see fit.
6 |
--------------------------------------------------------------------------------
/student_code/transrate.txt:
--------------------------------------------------------------------------------
1 | #Transrate for epididymus
2 | transrate -a /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta \
3 | -r /home/lauren/mus_protein_db/Mus_musculus.GRCm38.pep.all.fa \
4 | -l /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \
5 | -i /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq \
6 | -o /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL -t 24
7 |
8 |
9 |
10 | #The elements of this code are as follows
11 | -a ASSEMBLY (fasta file)
12 | -r REFERENCE (fasta file)
13 | -l LEFT READS (numbered R1)
14 | -i RIGHT READS (numbered R2)
15 | -o OUTPUT FILE (.FULL)
--------------------------------------------------------------------------------
/student_code/shortunwrapped.txt:
--------------------------------------------------------------------------------
1 | #shortunwrapped.txt
2 | #I created this intermediate file to confirm that the awk command was working, which it is. unfortunately the sed command is not!
3 | #The functionality of this file is that it will tell you how many files are created by the awk lines, therefore you should exec
4 | #this shortunwrapped program before you can do your unwrapped program in order to determine how many temporary file lines (xa_) #that you must write into your unwrapped program
5 |
6 | #so the following sed command is useless:
7 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \
8 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta
9 |
10 | #Filter based on score. This is the command that works and will tell you how many files you are creating:
11 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \
12 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000
--------------------------------------------------------------------------------
/student_code/bless_assembly_no_norm (6).txt:
--------------------------------------------------------------------------------
1 | #!/usr/bin/make -rRsf
2 | all: /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz \
3 | /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz \
4 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta
5 |
6 | #############################BLESS##############################################
7 |
8 | /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq.gz:/mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.f$
9 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'`
10 | echo Results will be in a file named *corrected.fastq.gz
11 | echo Settings used: bless kmerlength = 25
12 | bless -kmerlength 25 -read1 /mnt/data3/lah/mattsclass/testes.R1.fastq -read2 /mnt/data3/lah/mattsclass/testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/no_norm/testes
13 | gzip /mnt/data3/lah/mattsclass/no_norm/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/no_norm/testes.2.corrected.fastq &
14 |
15 |
16 | #######################Trimmomatic/Trinity##########################
17 |
18 | /mnt/data3/lah/mattsclass/no_norm/testes.bless_no_norm_trinity.fasta:\
19 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz
20 | Trinity --seqType fq --JM 50G --trimmomatic --left /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz --right /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz --CPU 12 --output testes.ble$
21 | --quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
22 |
--------------------------------------------------------------------------------
/student_code/LK_bless_norm.txt:
--------------------------------------------------------------------------------
1 |
2 | #!/usr/bin/make -rRsf
3 |
4 | ###########################################
5 | ### -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15'
6 | ### -RUN= name of run
7 | ###
8 | ############################################
9 |
10 | #$@
11 |
12 |
13 | MEM=5
14 | CPU=5
15 | RUN=run
16 | #READ1=/home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq
17 | #READ2=/home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq
18 |
19 |
20 |
21 | all:/home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz \
22 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz \
23 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta
24 |
25 | #############################BLESS##############################################
26 |
27 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz:\
28 | /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq
29 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'`
30 | echo Results will be in a file named *corrected.fastq.gz
31 | echo Settings used: bless kmerlength = 25
32 | bless -kmerlength 25 -read1 /home/lauren/Documents/NYGenomeCenter/epidiymus.R1.fastq \
33 | -read2 /home/lauren/Documents/NYGenomeCenter/epidiymus.R2.fastq -verify -notrim -prefix epi
34 | gzip /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq
35 |
36 |
37 | ##########################khmer###############################
38 |
39 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz:\
40 | /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz
41 | echo BEGIN NORMALIZATION `date +'%a %d%b%Y %H:%M:%S'`
42 | echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9
43 | interleave-reads.py /home/lauren/Documents/NYGenomeCenter/epi.1.corrected.fastq.gz \ /home/lauren/Documents/NYGenomeCenter/epi.2.corrected.fastq.gz -o /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
44 | normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 -out /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq \ /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
45 | gzip /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq
46 |
47 |
48 | #######################Trimmomatic/Trinity##########################
49 |
50 | /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \
51 | /home/lauren/Documents/NYGenomeCenter/epi_corrected_internorm.fastq.gz
52 | Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta: \
53 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
54 |
--------------------------------------------------------------------------------
/student_code/sga_assembly_no_norm.txt:
--------------------------------------------------------------------------------
1 | #!/usr/bin/make -rRsf
2 |
3 | ### -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz
4 |
5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz'
6 | ### -RUN= name of run
7 |
8 | MEM=5
9 | CPU=5
10 | RUN=run
11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz
12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz
13 |
14 | SHELL=/bin/bash -o pipefail
15 | # SGA version
16 | SGA=sga-0.10.12
17 | DWGSIM=dwgsim
18 | REPORT=sga-preqc-report.py
19 |
20 |
21 | #change the data
22 | #This re-names my samples:
23 | #samp1 := p_eremicus
24 |
25 |
26 | #Below after the all command are the final output files from SGA and normalization and trinity:
27 | #Must be added in order of completion!!!
28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \
29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \
30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \
31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta
32 |
33 |
34 | #################################SGA####################################
35 |
36 | # Pre-process the dataset: recall that NEED gz file form for this step
37 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz
38 | sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \
39 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
40 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
41 |
42 |
43 | # Build the FM-index
44 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
45 | cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $<
46 |
47 |
48 | # Make the preqc file for the short read set
49 | #%.preqc: %.bwt %.fastq.gz
50 | # $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@
51 |
52 | # Final PDF report
53 | #main_report.pdf: p_eremicus.preqc
54 | # python $(REPORT) $+
55 | # mv preqc_report.pdf $@
56 |
57 |
58 | # SGA correction
59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
60 | cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
61 |
62 |
63 |
64 |
65 | #######################Trimmomatic/Trinity##########################
66 |
67 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity.fasta:/home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz
68 | Trinity --seqType fq --JM 50G --trimmomatic \
69 | --single $< \
70 | --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_no_norm_trinity \
71 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25"
--------------------------------------------------------------------------------
/student_code/bless_assembly_with_norm (5).txt:
--------------------------------------------------------------------------------
1 | #!/usr/bin/make -rRsf
2 |
3 | ###########################################
4 | ### -usage 'bless_assembly_with_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/transcriptome/Pero360T.1.fastq.gz READ2=/home/lauren/transcriptome/Pero360T.2.fastq.gz READ3=/location/of/read3.fastq READ4=/location/of/read4.fastq '
5 | ### -RUN= name of run
6 | ###
7 | ############################################
8 |
9 | #$@
10 |
11 | ##files we need are p_eremics.READNAME.fastq_1&2
12 | ##### mus_musculus.READNAME.fastq_1&2
13 | ##### file directories /mnt/data3/lah/mattsclass/p_eremicus #where output gets put
14 | ##### /mnt/data3/lah/mattsclass/p_eremicus/raw #reads will be here
15 | ##### /mnt/data3/lah/mattsclass/mus_musculus #where output gets put
16 | ##### /mnt/data3/lah/mattsclass/mus_musculus/raw #reads will be here
17 | ##### Run in mattsclass folder??
18 |
19 |
20 | MEM=5
21 | CPU=5
22 | RUN=run
23 | #READ1=/mnt/data3/lah/mattsclass/testes.R1.fastq
24 | #READ2=/mnt/data3/lah/mattsclass/testes.R2.fastq
25 |
26 |
27 | all:/mnt/data3/lah/mattsclass/testes.1.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.bless_corrected.fastq.gz /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.1 /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz.2 /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta
28 | #all output files in order of correction
29 | ##ADD OUTPUT AS WE'RE GOING!!####
30 |
31 |
32 | #############################BLESS##############################################
33 |
34 |
35 |
36 | #all:/mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz
37 |
38 |
39 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz:\ #output files#
40 | /mnt/data3/lah/mattsclass/testes.R1.fastq /mnt/data3/lah/mattsclass/testes.R2.fastq #input files (raw reads)
41 | echo BEGIN ERROR CORRECTION: `date +'%a %d%b%Y %H:%M:%S'`
42 | echo Results will be in a file named *corrected.fastq.gz
43 | echo Settings used: bless kmerlength = 25
44 | bless -kmerlength 25 -read1 testes.R1.fastq -read2 testes.R2.fastq -verify -notrim -prefix /mnt/data3/lah/mattsclass/testes
45 | gzip /mnt/data3/lah/mattsclass/testes.1.corrected.fastq /mnt/data3/lah/mattsclass/testes.2.corrected.fastq &
46 |
47 |
48 | ##########################khmer###############################
49 |
50 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz:\ #output file
51 | /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz #input files from bless (interleave these)
52 | echo BEGIN NORMALIZATION `date +'%a %d%b%Y %H:%M:%S'`
53 | echo Settings used: normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9
54 | interleave-reads.py /mnt/data3/lah/mattsclass/testes.1.corrected.fastq.gz /mnt/data3/lah/mattsclass/testes.2.corrected.fastq.gz -o /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq
55 | normalize-by-median.py -p -k 25 -C 50 -N 4 -x 15e9 --out bless_corrected.inter.norm.fastq
56 | gzip bless_corrected.inter.norm.fasta &
57 |
58 |
59 | #######################Trimmomatic/Trinity##########################
60 |
61 | /mnt/data3/lah/mattsclass/testes.bless_norm_trinity.fasta: \
62 | /mnt/data3/lah/mattsclass/bless_corrected.inter.norm.fastq.gz
63 | Trinity --seqType fq --JM 50G --trimmomatic --single $< --CPU 12 --output /mnt/data3/lah/mattsclass/bless_norm_trinity.fasta \
64 | --quality_trimming_params "ILLUMINACLIP:/opt/trinity/trinity-plugins/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:40:15 LEADING:2 TRAILING:2 MINLEN:25"
65 |
--------------------------------------------------------------------------------
/student_code/sga_assembly_with_norm.txt:
--------------------------------------------------------------------------------
1 | #!/usr/bin/make -rRsf
2 |
3 | ### -usage 'sga_assembly_no_norm.mk RUN=run CPU=8 MEM=15 READ1=/home/lauren/Documents/testes.R1.fastq.gz
4 |
5 | READ2=/home/lauren/Documents/testes.R2.fastq.gz'
6 | ### -RUN= name of run
7 |
8 | MEM=5
9 | CPU=5
10 | RUN=run
11 | READ1=/home/lauren/Documents/testes.R1.fastq.gz
12 | READ2=/home/lauren/Documents/testes.R2.fastq.gz
13 |
14 | SHELL=/bin/bash -o pipefail
15 | # SGA version
16 | SGA=sga-0.10.12
17 | DWGSIM=dwgsim
18 | REPORT=sga-preqc-report.py
19 |
20 |
21 | #change the data
22 | #This re-names my samples:
23 | #samp1 := p_eremicus
24 |
25 |
26 | #Below after the all command are the final output files from SGA and normalization and trinity:
27 | #Must be added in order of completion!!!
28 | all:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz \
29 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt \
30 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz \
31 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz \
32 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta
33 |
34 |
35 | #################################SGA####################################
36 |
37 | # Pre-process the dataset: recall that NEED gz file form for this step
38 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz:/home/lauren/Documents/testes.R1.fastq.gz \ /home/lauren/Documents/testes.R2.fastq.gz
39 | sga preprocess --pe-mode 1 /home/lauren/Documents/testes.R1.fastq.gz /home/lauren/Documents/testes.R2.fastq.gz > \
40 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
41 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq
42 |
43 |
44 | # Build the FM-index
45 | /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.bwt:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
46 | cd /home/lauren/Documents/p_eremicus/ && sga index -a ropebwt -t 8 --no-reverse $<
47 |
48 | # Make the preqc file for the short read set
49 | #%.preqc: %.bwt %.fastq.gz
50 | # $(SGA) preqc -t 8 $(patsubst %.bwt, %.fastq.gz, $<) > $@
51 |
52 | # Final PDF report
53 | #main_report.pdf: p_eremicus.preqc
54 | # python $(REPORT) $+
55 | # mv preqc_report.pdf $@
56 |
57 |
58 | # SGA correction
59 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz:/home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
60 | cd /home/lauren/Documents/p_eremicus/ && sga correct -k 41 --discard --learn -t 8 -o \ /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz /home/lauren/Documents/p_eremicus/p_eremicus_preprocessed.fastq.gz
61 |
62 |
63 | ##########################khmer###############################
64 |
65 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz:\
66 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz
67 | cd /home/lauren/Documents/p_eremicus/ && normalize-by-median.py -k 25 -C 50 -N 4 -x 15e9 --out \
68 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq \
69 | /home/lauren/Documents/p_eremicus/p_eremicus_sga.fastq.gz
70 | gzip /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq
71 |
72 |
73 | #######################Trimmomatic/Trinity##########################
74 |
75 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity.fasta:\
76 | /home/lauren/Documents/p_eremicus/p_eremicus_sga_corrected_inter_norm.fastq.gz.
77 | Trinity --seqType fq --JM 50G --trimmomatic \
78 | --single $< \
79 | --CPU $(CPU) --output /home/lauren/Documents/p_eremicus/p_eremicus_sga_with_norm_trinity \
80 | --quality_trimming_params "ILLUMINACLIP:/opt/trinityrnaseq-code/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 MINLEN:25"
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
--------------------------------------------------------------------------------
/lab_lessons/Lab8_mapping.md:
--------------------------------------------------------------------------------
1 | Lab 8: Read Mapping
2 | --
3 |
4 | ---
5 |
6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will:
7 |
8 | 1. Install software and download data
9 |
10 | 2. Use sra-toolkit to extract fastQ reads
11 |
12 | 3. Map reads to dataset
13 |
14 | 4. look at mapping quality
15 |
16 | -
17 |
18 | The BWA manual: http://bio-bwa.sourceforge.net/
19 |
20 | Flag info: http://broadinstitute.github.io/picard/explain-flags.html
21 |
22 | ---
23 |
24 | > Step 1: Launch and AMI. For this exercise, we will use a c3.2xlarge (yet another instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
25 |
26 |
27 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
28 |
29 |
30 | ---
31 |
32 | > Update Software
33 |
34 |
35 | sudo bash
36 | apt-get update
37 |
38 |
39 | ---
40 |
41 | > Install updates
42 |
43 |
44 | apt-get -y upgrade
45 |
46 |
47 | ---
48 |
49 | > Install other software
50 |
51 |
52 | apt-get -y install subversion tmux git curl samtools gcc make g++ python-dev unzip dh-autoreconf default-jre zlib1g-dev
53 |
54 |
55 | ---
56 |
57 |
58 | cd $HOME
59 | git clone https://github.com/lh3/bwa.git
60 | cd bwa
61 | make -j4
62 | PATH=$PATH:$(pwd)
63 |
64 |
65 | ---
66 |
67 |
68 | cd $HOME
69 | wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz
70 | tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz
71 | PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin
72 |
73 |
74 |
75 | > Download data
76 |
77 |
78 | mkdir /mnt/data
79 | cd /mnt/data
80 | wget http://datadryad.org/bitstream/handle/10255/dryad.72141/brain.final.fasta
81 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR157/SRR1575395/SRR1575395.sra
82 |
83 |
84 | > Convert SRA format into fastQ (takes a few minutes)
85 |
86 |
87 | cd /mnt/data
88 | fastq-dump --split-files --split-spot SRR1575395.sra
89 |
90 |
91 | > Map reads!! (20 minutes)
92 |
93 |
94 | mkdir /mnt/mapping
95 | cd /mnt/mapping
96 | tmux new -s mapping
97 | bwa index -p index /mnt/data/brain.final.fasta
98 | bwa mem -t8 index /mnt/data/SRR1575395_1.fastq /mnt/data/SRR1575395_2.fastq > brain.sam
99 |
100 |
101 | > Look at SAM file.
102 |
103 |
104 |
105 | #Take a quick general look.
106 |
107 | head brain.sam
108 | tail brain.sam
109 |
110 | #Count how many reads in fastq files. `grep -c` counts the number of occurances of the pattern, which in this case is `^@`. I am looking for lines that begin with (specified by `^`) the @ character.
111 |
112 | grep -c ^@ ../data/SRR1575395_1.fastq ../data/SRR1575395_2.fastq
113 |
114 | #count number of reads mapping with Flag 65/67. The 1st part of this command `awk`, pulls out the second column of the files, and counts everthing that has either 65 or 67. What do these flags correspond to?
115 |
116 | awk '{print $2}' brain.sam | grep ^6 | grep -c '65\|67'
117 |
118 | #why do we need the `grep ^6` thing in there... try `awk '{print $2}' brain.sam | grep '65\|67' | wc -l`
119 |
120 | #what about this??
121 |
122 | awk '{print $2}' brain.sam | grep '^65\|^67' | wc -l
123 |
124 |
125 | > Can you pull out the number of mismatches targeting the NM tag in column 12?
126 |
127 |
128 |
129 | #I'm giving you the last bit of the awk code. You have to figure out the 1st awk command and the 1st grep command. This will send the number of mismatches to a file `mismatches.txt`. Can you download it to your usb or HD and plot the results, find the mean number of mismatches, etc??
130 |
131 | awk | grep | awk -F ":" '{print $3}' > mismatches.txt
132 |
133 |
--------------------------------------------------------------------------------
/lab_lessons/Lab10_bacterial_genome_assembly.md:
--------------------------------------------------------------------------------
1 | Lab 10: Bacterial Genome Assembly
2 | --
3 |
4 | ---
5 |
6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of E. coli. The data are taken from here: https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly/blob/master/Sources.md.
7 |
8 | 1. Install software and download data
9 |
10 | 2. Error correct, quality and adapter trim data sets.
11 |
12 | 3. Assemble
13 |
14 | -
15 |
16 | The SPAdes manuscript: http://www.ncbi.nlm.nih.gov/pubmed/22506599
17 | The SPAdes manual: http://spades.bioinf.spbau.ru/release3.1.1/manual.html
18 | SPAdes website: http://bioinf.spbau.ru/spades
19 | ABySS webpage: https://github.com/bcgsc/abyss
20 |
21 | -
22 |
23 | > Step 1: Launch and AMI. For this exercise, we will use a c3.2xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
24 |
25 |
26 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
27 |
28 |
29 | ---
30 |
31 | > Update Software
32 |
33 |
34 | sudo bash
35 | apt-get update
36 |
37 |
38 | ---
39 |
40 | > Install updates
41 |
42 |
43 | apt-get -y upgrade
44 |
45 |
46 | ---
47 |
48 | > Install other software
49 |
50 |
51 | apt-get -y install subversion tmux git curl libncurses5-dev gcc make g++ python-dev unzip dh-autoreconf zlib1g-dev libboost1.55-dev sparsehash openmpi*
52 |
53 |
54 |
55 | ---
56 |
57 | > Install SPAdes
58 |
59 |
60 | cd $HOME
61 | wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz
62 | tar -zxf SPAdes-3.1.1-Linux.tar.gz
63 | cd SPAdes-3.1.1-Linux
64 | PATH=$PATH:$(pwd)/bin
65 |
66 |
67 | ---
68 |
69 | > Install ABySS
70 |
71 |
72 | cd $HOME
73 | git clone https://github.com/bcgsc/abyss.git
74 | cd abyss
75 | ./autogen.sh
76 | ./configure --enable-maxk=128 --prefix=/usr/local/ --with-mpi=/usr/lib/openmpi/
77 | make -j4
78 | make all install
79 |
80 |
81 | -
82 |
83 | > Install a script for assembly evaluation.
84 |
85 |
86 | git clone https://github.com/lexnederbragt/sequencetools.git
87 | cd sequencetools/
88 | PATH=$PATH:$(pwd)
89 |
90 |
91 | > Download and unpack the data
92 |
93 |
94 | cd /mnt
95 | wget https://s3.amazonaws.com/gen711/ecoli_data.tar.gz
96 | tar -zxf ecoli_data.tar.gz
97 |
98 |
99 | > Assembly. Try this with different data combos (with mate pair data, without, with minION data and without, etc). Remember to name your assemblies something different using the `-o` flag. Spades has a built-in error correction tool (remove `--only-assembler`). Does 'double error correction seem to make a difference?'.
100 |
101 |
102 | mkdir /mnt/spades
103 | cd /mnt/spades
104 |
105 | spades.py -t 8 -m 15 --only-assembler --mp1-rf -k 127 \
106 | --pe1-1 /mnt/ecoli_pe.1.fq \
107 | --pe1-2 /mnt/ecoli_pe.2.fq \
108 | --mp1-1 /mnt/nextera.1.fq \
109 | --mp1-2 /mnt/nextera.2.fq \
110 | --pacbio /mnt/minion.data.fasta \
111 | -o Ecoli_all_data
112 |
113 |
114 | ---
115 |
116 | > Evaluate Assemblies
117 |
118 |
119 | abyss-fac Ecoli_all_data/scaffolds.fasta
120 |
121 | #take a closer look.
122 |
123 | assemblathon_stats.pl Ecoli_all_data/scaffolds.fasta
124 |
125 |
126 |
127 |
128 | > Assembling with ABySS (optional)
129 |
130 |
131 | mkdir /mnt/abyss
132 | cd /mnt/abyss
133 |
134 | abyss-pe np=8 k=127 name=ecoli lib='pe1' mp='mp1' long='minion' \
135 | pe1='/mnt/ecoli_pe.1.fq /mnt/ecoli_pe.2.fq' \
136 | mp1='/mnt/nextera.1.fq /mnt/nextera.2.fq' \
137 | minion='/mnt/minion.data.fasta' mp1_l=30
138 |
139 |
140 |
--------------------------------------------------------------------------------
/lab_lessons/Lab3_hmmer.md:
--------------------------------------------------------------------------------
1 | Lab3 : HMMER
2 |
3 | During this lab, we will acquaint ourselves with the the software package HMMER. Your objectives are:
4 |
5 | -
6 |
7 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
8 |
9 | 2. Regarding your dataset. Characterize a few conserved domains.
10 |
11 | The HMMER manual ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/Userguide.pdf
12 |
13 | The HMMER webpage: http://hmmer.janelia.org/
14 |
15 | ---
16 |
17 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/your.pem` (change your.pem to whatever you named it)
18 |
19 |
20 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
21 |
22 |
23 | ---
24 |
25 |
26 | sudo bash
27 | apt-get update
28 | apt-get -y upgrade
29 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
30 |
31 |
32 | -
33 |
34 | > Ok, for this lab we are going to use HMMER
35 |
36 | -
37 |
38 |
39 | cd $HOME
40 | wget http://selab.janelia.org/software/hmmer3/3.1b1/hmmer-3.1b1-linux-intel-x86_64.tar.gz
41 | tar -zxf hmmer-3.1b1-linux-intel-x86_64.tar.gz
42 | cd hmmer-3.1b1-linux-intel-x86_64/
43 | ./configure
44 | make && make all install
45 | make check
46 |
47 |
48 | ---
49 |
50 | -
51 |
52 | > You will download one of the 5 different datasets (use the same dataset). Do you remember how to use the `wget` and `gzip` commands from last week? Also, download Swissprot and Pfam-A
53 |
54 | -
55 |
56 |
57 | cd /mnt
58 |
59 | #download your dataset
60 |
61 | dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz
62 | dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz
63 | dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz
64 | dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz
65 | dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz
66 |
67 | #download the SwissProt database
68 |
69 | wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
70 |
71 | #download the Pfam-A database
72 |
73 | wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
74 |
75 |
76 | > we are going to run HMMER to identify conserved protein domains. This will take a little while, and we'll use `tmux` to allow us to do this in the background, and continue to work on other things.
77 |
78 |
79 | gzip -d *gz
80 | tmux new -s pfam
81 | hmmpress Pfam-A.hmm #this is analogous to 'makeblastdb'
82 | hmmscan -E 1e-3 --domtblout dataset.pfam --cpu 4 Pfam-A.hmm dataset1.fa
83 | ctl-b d
84 | top -c #see that hmmscan is running..
85 |
86 |
87 | > the neat thing about HMMER is that it can be used as a replacement for blastP or PSI-blast.
88 |
89 |
90 | #blastp-like HBB-HUMAN is a Hemoglobin B protein sequence.
91 |
92 | phmmer --domtblout hbb.phmmer -E 1e-5 \
93 | /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \
94 | uniprot_sprot.fasta
95 |
96 | #PSI-blast-like
97 |
98 | jackhmmer --domtblout hbb.jackhammer -E 1e-5 \
99 | /home/ubuntu/hmmer-3.1b1-linux-intel-x86_64/tutorial/HBB_HUMAN \
100 | uniprot_sprot.fasta
101 |
102 | #you can look at the results using `more hmm.phmmer` or `more hmm.jackhmmer`. Try blasting a few of the results using the BLAST web interface.
103 |
104 |
105 | > Now let's look at the Pfam results. This analyses may still be running, but we can look at it while it's still in progress.
106 |
107 |
108 | more dataset.pfam
109 | #There are a bunch of columns in this table - what do they mean?
110 |
111 | #Try to extract all the hits to a specific domain. Google a few domains (column 1) to see if any seem interesting.
112 |
113 | #for instance, find all occurrences of ABC_tran
114 | grep ABC_tran dataset.pfam
115 |
116 | #use grep to count the number of matches. Copy this number down.
117 |
118 | grep -c ABC_tran dataset.pfam
119 |
120 | #Find all the contigs that have a ABC_tran domain.
121 |
122 | grep ABC_tran dataset.pfam | awk '{print $4}' | sort | uniq
123 |
124 |
125 | > Just for fun, check on the Pfam search to see what it is doing...
126 |
127 |
128 | tmux attach -t pfam
129 | ctl-b d
130 |
131 |
--------------------------------------------------------------------------------
/lab_lessons/Lab4_fastq.md:
--------------------------------------------------------------------------------
1 | Lab4: Processing fastQ and fastA
2 | --
3 |
4 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are:
5 |
6 | -
7 |
8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
9 |
10 | 2. Regarding your dataset. Characterize sequence quality.
11 |
12 | The FastQC manual: http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
13 |
14 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf
15 |
16 | ---
17 |
18 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
19 |
20 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
21 |
22 |
23 | ---
24 |
25 | > Update Software
26 |
27 | sudo bash
28 | apt-get update
29 |
30 |
31 | ---
32 |
33 | > Install updates
34 |
35 | apt-get -y upgrade
36 |
37 |
38 | ---
39 |
40 | > Install other software
41 |
42 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
43 |
44 |
45 | -
46 |
47 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. Make sure you know what each of these commands does, rather than blindly copying and pasting..
48 |
49 | -
50 |
51 | cd $HOME
52 | wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip
53 | unzip fastqc_v0.11.2.zip
54 | cd FastQC/
55 | chmod +x fastqc
56 | PATH=$PATH:$(pwd)
57 |
58 |
59 | ---
60 |
61 | > Download data, and uncompress them.. What does the `-cd` flag mean WRT gzip??
62 |
63 | -
64 |
65 | cd /mnt
66 | wget https://s3.amazonaws.com/gen711/Pero360B.1.fastq.gz
67 | wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz
68 | gzip -cd /mnt/Pero360B.1.fastq.gz > /mnt/Pero360B.1.fastq &
69 | gzip -cd /mnt/Pero360B.2.fastq.gz > /mnt/Pero360B.2.fastq &
70 |
71 |
72 | ---
73 | > Install Fastool, a neat and fast tool used for fastQ --> fastA
74 |
75 | -
76 |
77 | cd $HOME
78 | git clone https://github.com/fstrozzi/Fastool.git
79 | cd Fastool/
80 | make
81 | PATH=$PATH:$(pwd)
82 |
83 |
84 | ---
85 | > Use Fastool to convert from fastQ to fastA
86 |
87 | -
88 |
89 | cd /mnt
90 | fastool --to-fasta Pero360B.1.fastq > Pero360B.1.fasta &
91 | fastool --to-fasta Pero360B.2.fastq > Pero360B.2.fasta &
92 |
93 |
94 | ---
95 | > While Fastool is working, lets install JellyFish.. Again, make sure you know what each of these commands does, rather than just copying and pasting..
96 |
97 | -
98 |
99 | cd $HOME
100 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
101 | tar -zxf jellyfish-2.1.3.tar.gz
102 | cd jellyfish-2.1.3/
103 | ./configure
104 | make
105 | PATH=$PATH:$(pwd)/bin
106 |
107 |
108 | ---
109 | > Run FastQC. Make sure to look at the manual to see what the different outputs mean.
110 |
111 | cd /mnt
112 | fastqc -t 4 Pero360B.1.fastq Pero360B.2.fastq
113 |
114 |
115 | ---
116 | > Run Jellyfish. Make sure to look at the manual.
117 |
118 | cd /mnt
119 | mkdir jelly
120 | cd jelly
121 | jellyfish count -F2 -m 25 -s 200M -t 4 -C ../Pero360B.1.fasta ../Pero360B.2.fasta
122 | jellyfish histo mer_counts.jf > Pero360B.histo
123 | head -50 Pero360B.histo
124 |
125 |
126 | ---
127 | > Open up a new terminal window using the buttons command-t
128 |
129 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/
130 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
131 |
132 |
133 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
134 |
135 | -
136 |
137 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
138 |
139 | -
140 |
141 | > OPEN RSTUDIO
142 |
143 |
144 | #Import Data
145 | histo <- read.table("~/Downloads/Pero360B.histo", quote="\"")
146 | head(histo)
147 |
148 | #Plot
149 | plot(histo$V2 ~ histo$V1, type='h')
150 |
151 | #That one sucks, but what does it tell you about the kmer distribution?
152 |
153 | #Maybe this one is better?
154 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,100))
155 |
156 | #Better. what is xlim? Maybe we can still improve?
157 |
158 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000))
159 |
160 | #Final plot
161 |
162 | plot(histo$V2 ~ histo$V1, type='h', xlim=c(0,500), ylim=c(0,1000000),
163 | col='blue', frame.plot=F, xlab='25-mer frequency', ylab='Count',
164 | main='Kmer distribution in brain sample before quality trimming')
165 |
166 |
167 |
168 | > Done?
--------------------------------------------------------------------------------
/student_code/unwrapped.txt:
--------------------------------------------------------------------------------
1 | #Make trinity assembly file 'unwrapped'
2 |
3 | #Purpose of this is to filter contigs out of the epididymus assembly that have a contig score <0.3
4 | #I need to do this because my epidiymus transcriptome assmembly had a low transrate score of 0.192
5 | #and we hope that by filtering out these low quality contigs, this may improve the transrate score
6 | #assembly: /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta
7 | #csv contigs: /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv
8 |
9 |
10 | #This sed command does not work, but the input is trinity.fasta file and the output is unwrapped_epi.fasta
11 |
12 | sed ':begin;$!N;/[ACTGNn-]\n[ACTGNn-]/s/\n//;tbegin;P;D' /home/lauren/Documents/NYGenomeCenter/epi_bless_norm_trinity.fasta:/Trinity_Fixed.fasta > \
13 | /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta
14 |
15 | #Filter based on score. This command does work:
16 |
17 | awk -F "," '.3>$17{next}1' /home/lauren/Documents/NYGenomeCenter/epi.contig_score.FULL_Trinity_Fixed.fasta_contigs.csv | \
18 | awk -F "," '{print $1}' | sed '1,1d' | split -l 9000
19 |
20 | #This will give you a bunch of files xaa, xab, xac, etc. Each of them contains
21 | #the names of the 'good contigs' Now we need to retrive them from the original fasta file
22 | #This number of temporary files conforms specifically to the number of appropriate temporary files as determined after
23 | #running the shortunwrapped program to determine how many xa_ files were generated:
24 |
25 | for i in $(cat xaa); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp1.fa; done &
26 | for i in $(cat xab); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp2.fa; done &
27 | for i in $(cat xac); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp3.fa; done &
28 | for i in $(cat xad); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp4.fa; done &
29 | for i in $(cat xae); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp5.fa; done &
30 | for i in $(cat xaf); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp6.fa; done &
31 | for i in $(cat xag); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp7.fa; done &
32 | for i in $(cat xah); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp8.fa; done &
33 | for i in $(cat xai); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp9.fa; done &
34 | for i in $(cat xaj); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp10.fa; done &
35 | for i in $(cat xak); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp11.fa; done &
36 | for i in $(cat xal); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp12.fa; done &
37 | for i in $(cat xam); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp13.fa; done &
38 | for i in $(cat xan); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp14.fa; done &
39 | for i in $(cat xao); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp15.fa; done &
40 | for i in $(cat xap); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp16.fa; done &
41 | for i in $(cat xaq); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp17.fa; done &
42 | for i in $(cat xar); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp18.fa; done &
43 | for i in $(cat xas); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp19.fa; done &
44 | for i in $(cat xat); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp20.fa; done &
45 | for i in $(cat xau); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp21.fa; done &
46 | for i in $(cat xav); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp22.fa; done &
47 | for i in $(cat xaw); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp23.fa; done &
48 | for i in $(cat xax); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp24.fa; done &
49 | for i in $(cat xay); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp25.fa; done &
50 | for i in $(cat xaz); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp26.fa; done &
51 | for i in $(cat xba); do grep -A1 --max-count=1 -w $i /home/lauren/Documents/NYGenomeCenter/unwrapped_epi.fasta >> temp27.fa; done &
52 |
53 | #One command for each of the xa* files. I do it like this to save
54 | #time. Each xa* file is being processed on a different core.
55 |
56 | #Lastly, concatenate all the temporary files together and get rid of the now unneeded files:
57 |
58 | #The > sign indicates that all temp files are concatenated into a NEW file: CAT_unwrapped_epi.fasta
59 | #However, the temp* is not liked by the command line!
60 | cat temp* > /home/lauren/Documents/NYGenomeCenter/CAT_unwrapped_epi.fasta
61 | rm temp* x*
62 |
63 |
--------------------------------------------------------------------------------
/lab_lessons/Lab1_unix.md:
--------------------------------------------------------------------------------
1 | Lab 1
2 | --
3 |
4 | During this lab, we will acquaint ourselves with the Unix terminal, learn how to access data, install software, and find things. *it is absolutely critical that you master these skills*, so please ask questions if confused.
5 |
6 | > Step 1: Launch and AMI. For this exercise, a t1.micro will be sufficient.
7 |
8 |
9 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
10 |
11 |
12 |
13 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily. The command `sudo apt-get update` checks a server for updates to existing software.
14 |
15 |
16 | sudo apt-get update
17 |
18 |
19 | >The upgrade command actually installs any of the required updates.
20 |
21 | sudo apt-get upgrade
22 |
23 | >OK, what are these commands? `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting.
24 |
25 | -
26 |
27 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!!
28 |
29 | -
30 | sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \
31 | default-jre
32 |
33 | -
34 |
35 | >After you run this command, try something else - try to install something else. R (a stats package - more on this wonderful software later). The package is named `r-base-core`. See if you can install it!! Installing software on Linux is easy (so long as there is a downloadable package - more on when no such package exists later in lab)
36 |
37 | -
38 |
39 | >BTW, did you notice the `\` at the end of line 1 in the above code snippett?? That is a special character we use to break up a single line of code over 2 or more lines. You'll see me use this a lot!
40 |
41 | -
42 |
43 | >OK, lets try our hands at navigating around on the command line - it is not scary!
44 |
45 | Important UNIX rules
46 | --
47 |
48 | * Everything is case sensitive. Gen711 is not the same as gen711
49 | * Spaces in file names should be avoided
50 | * The unix $PATH is the collection of locations where the computer looks for executables (programs)
51 | * Folders and Files are all you have. If you want to access one of these, you need to tell the computer *EXACTLY* where it is. `/home/macmanes/gen711/exam1_key.txt` will work (assuming you've spelled things correctly, and that the file really exists in that location), but `exam1_key.txt` may not.
52 |
53 | * Lines that begin with a `#` are comments.
54 |
55 | Basic shell commands
56 | --
57 |
58 | >the `pwd` command returns your current location.
59 |
60 | pwd
61 |
62 | -
63 |
64 | >the `ls` command lists the files and folders present in your current directory. Try `ls -lt` and `ls -lth`. *What is the difference between these commands?* Try typing `man ls` to learn about all the different flags.
65 |
66 | ls -l
67 |
68 | -
69 |
70 | >create a file
71 |
72 | nano hello.txt
73 | #The nano text editor will appear -> type something
74 | This is my 1st unix file
75 | CTL-x
76 | y
77 | #typing n would get rid of the text you just wrote.
78 |
79 | -
80 |
81 | >look at the file, there are several ways to look at the file
82 |
83 | head -5 hello.txt #this shows you the 1st 5 lines of the file
84 | more hello.txt #this shows you the whole file, 1 screen at a time. Space bar to advance, q to quit
85 |
86 | -
87 |
88 | >make a copy of the file, using a different name, then remove it.
89 |
90 | cp hello.txt bye.txt
91 | ls -lth
92 | rm bye.txt
93 | ls -lth
94 |
95 | -
96 |
97 | >move the file (or rename it). What is the difference between `mv` and `cp`???
98 |
99 | mv hello.txt bye.txt
100 | ls -lth
101 |
102 | -
103 |
104 | >make a folder (directory), make a file inside a folder.
105 |
106 | mkdir testfolder
107 | ls -lth
108 | #make a folder inside that folder
109 | mkdir testfolder/inside_test
110 | #make a file
111 | nano testfolder/inside_test/inside.txt
112 | head testfolder/inside_test/inside.txt
113 | rm testfolder/inside_test/inside.txt
114 |
115 | >there are a few other commands that you should be familiar with: `sort`, `cat`, `clear`, `tail`, `history`. Try googling and using `man` to figure them out.
116 |
117 | Downloading Data and Stuff
118 | --
119 |
120 | >download something from the web. You're using the `wget` command. You're downloading the SwissProt database. See http://www.ebi.ac.uk/uniprot
121 |
122 | mkdir swissprot
123 | cd swissprot
124 | wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz
125 |
126 | -
127 |
128 | >It will take a few minutes to download. After it's downloaded, you'll need to extract it. Files ending in `.gz` are compressed, just like `.zip`, which is a type of file compression you may be more familiar with.
129 |
130 | gzip -d uniprot_sprot.fasta.gz
131 |
132 | -
133 |
134 | >Can you tell me what type of file this is? Use the commands we used above to look at the 1st few lines.
135 |
136 | ???
137 |
138 | >There is some info that is complementary to this material found here: http://swcarpentry.github.io/2014-08-21-upenn/novice/ref/01-shell.html
--------------------------------------------------------------------------------
/lab_lessons/Lab2_blast.md:
--------------------------------------------------------------------------------
1 | Lab 2: BLAST
2 | --
3 |
4 | During this lab, we will acquaint ourselves with the the software package BLAST. Your objectives are:
5 |
6 | -
7 |
8 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
9 |
10 | 2. Regarding your dataset, tell me how some of these genes are related to their homologous copies.
11 |
12 | -
13 |
14 | ---
15 |
16 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance.
17 |
18 |
19 | ssh -i ~/Downloads/your.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
20 |
21 | -
22 |
23 | > The machine you are using is Linux Ubuntu: Ubuntu is an operating system you can use (I do) on your laptop or desktop. One of the nice things about this OS is the ability to update the software, easily. The command `sudo apt-get update` checks a server for updates to existing software.
24 |
25 | -
26 |
27 |
28 | sudo apt-get update
29 |
30 | -
31 |
32 | > The upgrade command actually installs any of the required updates.
33 |
34 |
35 | sudo apt-get -y upgrade
36 |
37 |
38 | > OK, what are these commands? `sudo` is the command that tells the computer that we have admin privileges. Try running the commands without the sudo -- it will complain that you don't have admin privileges or something like that. *Careful here, using sudo means that you can do something really bad to your own computer -- like delete everything*, so use with caution. It's not a big worry when using AWS, as this is a virtual machine- fixing your worst mistake is as easy as just terminating the instance and restarting.
39 |
40 | -
41 |
42 | > So now that we have updates the software, lets see how to add new software. Same basic command, but instead of the `update` or `upgrade` command, we're using `install`. EASY!!
43 |
44 | -
45 |
46 |
47 | sudo apt-get -y install tmux git curl gcc make g++ python-dev unzip \
48 | default-jre
49 |
50 | -
51 |
52 | > ok, for this lab we are going to use BLAST, which is available as a package entitled `ncbi-blast+`
53 |
54 | -
55 |
56 |
57 | sudo apt-get -y install ???
58 |
59 | -
60 |
61 | > to get a feel for the different options, type `blastp -help`. Which type of blast does this correspond to? Look at the help info for blastp and tblastx
62 |
63 | -
64 |
65 | > Let's go root
66 |
67 |
68 | sudo bash
69 |
70 | ---
71 |
72 | Install mafft and RAxML
73 | --
74 |
75 | > Let's install mafft so that we can do an alignment (http://mafft.cbrc.jp/alignment/software/)
76 |
77 |
78 | cd $HOME
79 | wget http://mafft.cbrc.jp/alignment/software/mafft-7.164-without-extensions-src.tgz
80 | tar -zxf mafft-7.164-without-extensions-src.tgz
81 | cd mafft-7.164-without-extensions/core
82 | sudo make && sudo make install
83 | PATH=$PATH:/home/ubuntu/mafft-7.164-without-extensions/core
84 |
85 | -
86 |
87 | > Now lets install RAxML so that we can make a phylogeny. ()
88 |
89 |
90 | cd $HOME
91 | git clone https://github.com/stamatak/standard-RAxML.git
92 | cd standard-RAxML/
93 | make -f Makefile.PTHREADS.gcc
94 | PATH=$PATH:/home/ubuntu/standard-RAxML
95 |
96 | -
97 |
98 | > remember, for blasting we need both some data (a query) and a database. Lets start with the data 1st. You will have one of the 5 different datasets. Do you remember how to use the `wget` and `gzip` commands from last week?
99 |
100 | -
101 |
102 |
103 | cd /mnt
104 | dataset1= https://www.dropbox.com/s/srfk4o2bh1qmq6l/dataset1.fa.gz
105 | dataset2= https://www.dropbox.com/s/977n0ibznzuor22/dataset2.fa.gz
106 | dataset3= https://www.dropbox.com/s/8s2h7sm6xtoky6q/dataset3.fa.gz
107 | dataset4= https://www.dropbox.com/s/qth3mjrianb48a6/dataset4.fa.gz
108 | dataset5= https://www.dropbox.com/s/quexoxfh6ttmudo/dataset5.fa.gz
109 |
110 | -
111 |
112 | > Now let's download the database. For this exercise we will use Swissprot: `ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz`
113 |
114 | > unzip this file using `gzip -d`
115 |
116 | ---
117 |
118 | Make blast database and blast
119 | --
120 |
121 | -
122 |
123 | > make a blast database
124 |
125 |
126 | makeblastdb -in uniprot_sprot.fasta -out uniprot -dbtype prot
127 |
128 | > Now we are ready to blast.
129 |
130 |
131 | head -n20 dataset1.fa > test.fa
132 | blastp -evalue 1e-10 -num_threads 4 -db uniprot -query test.fa -outfmt 6
133 |
134 | > You will see the results in a table with 12 columns. Use `blastp -help` to see what the results mean.
135 |
136 | > Test out some of the blast options. Try changing the word size `-word_size`, scoring matrix, evalue, cost to open or extend a gap. See how these changes affect the results.
137 |
138 | > After you've done this, you should make a file containing the query and the hits.
139 |
140 |
141 | grep -A4 -w AAseq_1 dataset1.fa
142 |
143 | #use the dataset you have, and substitute your contig for AAseq_1
144 | #increase -A4 until the whole contigs is displayed.
145 | #copy and paste it into nano.
146 | #do the same for the database matches.
147 |
148 | grep -A4 'sp|Q6GZX4|001R_FRG3G' uniprot_sprot.fasta
149 |
150 | ---
151 |
152 | mafft
153 | --
154 |
155 | > Align the proteins using mafft
156 |
157 |
158 | mafft --reorder --bl 80 --auto for.align > for.tree
159 |
160 | ---
161 |
162 | RAxML
163 | --
164 |
165 | > Make a phylogeny
166 |
167 |
168 | raxmlHPC-PTHREADS -help
169 | raxmlHPC-PTHREADS -f a -m PROTCATBLOSUM62 -T 4 -x 34 -N 100 -n tree -s for.tree -p 35
170 |
171 | > Copy phylogeny and view online.
172 |
173 |
174 | more RAxML_bipartitionsBranchLabels.tree
175 |
176 | #copy this info.
177 |
178 | > Visualize tree on website: http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
--------------------------------------------------------------------------------
/lab_lessons/Lab6_khmer.md:
--------------------------------------------------------------------------------
1 | Lab 6: khmer
2 | --
3 |
4 | ---
5 |
6 | During this lab, we will acquaint ourselves with digital normalization. You will:
7 |
8 | 1. Install software and download data
9 |
10 | 2. Quality and adapter trim data sets.
11 |
12 | 3. Apply digital normalization to the dataset.
13 |
14 | 4. Count and compare kmers and kmer distributions in the normalized and un-normalized dataset.
15 |
16 | 5. Plot in RStudio.
17 |
18 | -
19 |
20 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf
21 |
22 | The Khmer manual: http://khmer.readthedocs.org/en/v1.1/
23 |
24 | ---
25 |
26 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
27 |
28 |
29 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
30 |
31 |
32 | ---
33 |
34 | > Update Software
35 |
36 |
37 | sudo bash
38 | apt-get update
39 |
40 |
41 | ---
42 |
43 | > Install updates
44 |
45 |
46 | apt-get -y upgrade
47 |
48 |
49 | ---
50 |
51 | > Install other software
52 |
53 |
54 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre python-pip zlib1g-dev
55 |
56 |
57 | ---
58 |
59 | > Install Trimmomatic
60 |
61 |
62 | cd $HOME
63 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
64 | unzip Trimmomatic-0.32.zip
65 | cd Trimmomatic-0.32
66 | chmod +x trimmomatic-0.32.jar
67 |
68 |
69 | ---
70 |
71 | > Install Jellyfish
72 |
73 |
74 | cd $HOME
75 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
76 | tar -zxf jellyfish-2.1.3.tar.gz
77 | cd jellyfish-2.1.3/
78 | ./configure
79 | make -j4
80 | PATH=$PATH:$(pwd)/bin
81 |
82 |
83 | ---
84 |
85 | > Install Khmer
86 |
87 |
88 | cd $HOME
89 | pip install screed pysam
90 | git clone https://github.com/ged-lab/khmer.git
91 | cd khmer
92 | make -j4
93 | make install
94 | PATH=$PATH:$(pwd)/scripts
95 |
96 |
97 | ---
98 | > Download data. For this lab, we'll be using a smaller dataset that consists of 10million paired end reads.
99 |
100 | -
101 |
102 |
103 | cd /mnt
104 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz
105 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz
106 |
107 |
108 | ---
109 |
110 | > Trim low quality bases and adapters from dataset. These files will form the basis of all out subsequent analyses.
111 |
112 | -
113 |
114 |
115 | mkdir /mnt/trimming
116 | cd /mnt/trimming
117 |
118 | #paste the below lines together as 1 command
119 |
120 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
121 | -threads 4 -baseout P2.trimmed.fastQ \
122 | /mnt/raw.10M.SRR797058_1.fastq.gz \
123 | /mnt/raw.10M.SRR797058_2.fastq.gz \
124 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
125 | SLIDINGWINDOW:4:2 \
126 | LEADING:2 \
127 | TRAILING:2 \
128 | MINLEN:25
129 |
130 |
131 | ---
132 | > Run Jellyfish on the un-normalized dataset.
133 |
134 |
135 | mkdir /mnt/jelly
136 | cd /mnt/jelly
137 |
138 | jellyfish count -m 25 -F2 -s 700M -t 4 -C -o trimmed.jf /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P
139 | jellyfish histo trimmed.jf -o trimmed.histo
140 |
141 |
142 | ---
143 |
144 | > Run Khmer
145 |
146 |
147 | mkdir /mnt/khmer
148 | cd /mnt/khmer
149 | interleave-reads.py /mnt/trimming/P2.trimmed.fastQ_1P /mnt/trimming/P2.trimmed.fastQ_2P -o interleaved.fq
150 | normalize-by-median.py -p -x 15e8 -k 25 -C 50 --out khmer_normalized.fq interleaved.fq
151 |
152 |
153 | ---
154 |
155 | > Run Khmer on the normalized dataset.
156 |
157 |
158 | cd /mnt/jelly
159 |
160 | jellyfish count -m 25 -s 700M -t 4 -C -o khmer.jf /mnt/khmer/khmer_normalized.fq
161 | jellyfish histo khmer.jf -o khmer.histo
162 |
163 |
164 | > Open up a new terminal window using the buttons command-t
165 |
166 |
167 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
168 |
169 |
170 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
171 |
172 | -
173 |
174 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
175 |
176 | -
177 |
178 | > OPEN RSTUDIO
179 |
180 |
181 | #Import all 2 histogram datasets: this is the code for importing 1 of them..
182 |
183 | khmer <- read.table("~/Downloads/khmer.histo", quote="\"")
184 | trim <- read.table("~/Downloads/trimmed.histo", quote="\"")
185 |
186 | #What does this plot show you??
187 |
188 | barplot(c(trim$V2[1],khmer$V2[1]),
189 | names=c('Non-normalized', 'C50 Normalized'),
190 | main='Number of unique kmers')
191 |
192 | # plot differences between non-unique kmers
193 |
194 | plot(khmer$V2[10:300] - trim$V2[10:300], type='l',
195 | xlim=c(10,300), xaxs="i", yaxs="i", frame.plot=F,
196 | ylim=c(-10000,60000), col='red', xlab='kmer frequency',
197 | lwd=4, ylab='count',
198 | main='Diff in 25mer counts of \n normalized vs. un-normalized datasets')
199 | abline(h=0)
200 |
201 |
202 |
203 | ---
204 |
205 |
206 |
207 | -
208 |
209 | -
210 |
211 | > What do the analyses of kmer counts tell you?
--------------------------------------------------------------------------------
/lab_lessons/Lab5_trimming.md:
--------------------------------------------------------------------------------
1 | Lab 5: Trimming
2 | --
3 |
4 | ---
5 |
6 | During this lab, we will acquaint ourselves with the the software packages FastQC and JellyFish. Your objectives are:
7 |
8 | -
9 |
10 | 1. Familiarize yourself with the software, how to execute it, how to visualize results.
11 |
12 | 2. Regarding your dataset. Characterize sequence quality.
13 |
14 | The FastQC manual: http://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
15 |
16 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf
17 |
18 | ---
19 |
20 | > Step 1: Launch and AMI. For this exercise, we will use a c3.xlarge instance. Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
21 |
22 |
23 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
24 |
25 |
26 | ---
27 |
28 | > Update Software
29 |
30 |
31 | sudo bash
32 | apt-get update
33 |
34 |
35 | ---
36 |
37 | > Install updates
38 |
39 |
40 | apt-get -y upgrade
41 |
42 |
43 | ---
44 |
45 | > Install other software
46 |
47 |
48 | apt-get -y install tmux git curl gcc make g++ python-dev unzip default-jre
49 |
50 |
51 | -
52 |
53 | > Ok, for this lab we are going to use FastQC. There is a version available on apt-get, but it is an old version and we want to make sure that we have the most updated version.. Make sure you know what each of these commands does, rather than blindly copying and pasting..
54 |
55 | -
56 |
57 |
58 | cd $HOME
59 | wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip
60 | unzip fastqc_v0.11.2.zip
61 | cd FastQC/
62 | chmod +x fastqc
63 | PATH=$PATH:$(pwd)
64 |
65 |
66 | ---
67 |
68 | > Install Trimmomatic
69 |
70 |
71 | cd $HOME
72 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
73 | unzip Trimmomatic-0.32.zip
74 | cd Trimmomatic-0.32
75 | chmod +x trimmomatic-0.32.jar
76 |
77 |
78 | ---
79 |
80 | > Install Jellyfish
81 |
82 |
83 | cd $HOME
84 | wget ftp://ftp.genome.umd.edu/pub/jellyfish/jellyfish-2.1.3.tar.gz
85 | tar -zxf jellyfish-2.1.3.tar.gz
86 | cd jellyfish-2.1.3/
87 | ./configure
88 | make
89 | PATH=$PATH:$(pwd)/bin
90 |
91 |
92 | ---
93 |
94 | > Download data. For this lab, we'll be using only 1 sequencing file.
95 |
96 | -
97 |
98 |
99 | cd /mnt
100 | wget https://s3.amazonaws.com/gen711/Pero360B.2.fastq.gz
101 |
102 |
103 | ---
104 |
105 | > Do 3 different trimming levels between 2 and 40. This one is trimming at a Phred score of 30 (BAD!!!) When you run your commands, you'll need to change the numbers in `LEADING:30` `TRAILING:30` `SLIDINGWINDOW:4:30` and `Pero360B.trim.Phred30.fastq` to whatever trimming level you want to use.
106 |
107 |
108 | mkdir /mnt/trimming
109 | cd /mnt/trimming
110 |
111 | #paste the below lines together as 1 command
112 |
113 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar SE \
114 | -threads 4 \
115 | ../Pero360B.2.fastq.gz \
116 | Pero360B.trim.Phred30.fastq \
117 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
118 | SLIDINGWINDOW:4:30 \
119 | LEADING:30 \
120 | TRAILING:30 \
121 | MINLEN:25
122 |
123 |
124 |
125 | ---
126 | > After Trimmomatic is done, Run FastQC. You'll have to change the numbers to match the levels you trimmed at.
127 |
128 |
129 | cd /mnt
130 | fastqc -t 4 Pero360B.2.fastq.gz
131 | fastqc -t 4 trimming/Pero360B.trim.Phred2.fastq
132 | fastqc -t 4 trimming/Pero360B.trim.Phred15.fastq
133 | fastqc -t 4 trimming/Pero360B.trim.Phred30.fastq
134 |
135 |
136 | ---
137 | > Run Jellyfish.
138 |
139 |
140 | mkdir /mnt/jelly
141 | cd /mnt/jelly
142 |
143 | # You'll have to run these commands 4 separate times -
144 | # once for each different trimmed dataset, and once for the raw dataset.
145 | # Change the names of the input and output files..
146 |
147 | jellyfish count -m 25 -s 200M -t 4 -C -o trim30.jf ../trimming/Pero360B.trim.Phred30.fastq
148 | jellyfish histo trim30.jf -o trim30.histo
149 |
150 |
151 | ---
152 | > Open up a new terminal window using the buttons command-t
153 |
154 |
155 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/*zip ~/Downloads/
156 | scp -i ~/Downloads/????.pem ubuntu@ec2-??-???-???-??.compute-1.amazonaws.com:/mnt/jelly/*histo ~/Downloads/
157 |
158 |
159 | > Now, on your MAC, find the files you just downloaded - for the zip files - double click and that should unzip them.. Click on the `html` file, which will open up your browser. Look at the results. Try to figure out what each plot means.
160 |
161 | -
162 |
163 | > Now look at the `.histo` file, which is a kmer distribution. I want you to plot the distribution using R and RStudio.
164 |
165 | -
166 |
167 | > OPEN RSTUDIO
168 |
169 |
170 | #Import all 3 histogram datasets: this is the code for importing 1 of them..
171 |
172 | trim2 <- read.table("~/Downloads/trim2.histo", quote="\"")
173 |
174 | #Plot: Make sure and change the names to match what you import.
175 | #What does this plot show you??
176 |
177 | barplot(c(trim2$V2[1],trim15$V2[1],trim30$V2[1]),
178 | names=c('Phred2', 'Phred15', 'Phred30'),
179 | main='Number of unique kmers')
180 |
181 | # plot differences between non-unique kmers
182 |
183 | plot(trim2$V2[2:30] - trim30$V2[2:30], type='l',
184 | xlim=c(2,20), xaxs="i", yaxs="i", frame.plot=F,
185 | ylim=c(0,2000000), col='red', xlab='kmer frequency',
186 | lwd=4, ylab='count',
187 | main='Diff in 25mer counts of freq 2 to 20 \n Phred2 vs. Phred30')
188 |
189 |
190 |
191 |
192 | > Look at the FastQC plots across the different trimming levels. Anything surprising?
193 |
194 | > What do the analyses of kmer counts tell you?
--------------------------------------------------------------------------------
/lab_lessons/Lab9_euk.genome.assembly.md:
--------------------------------------------------------------------------------
1 | Lab 9: Genome assembly
2 | --
3 |
4 | ---
5 |
6 | During this lab, we will acquaint ourselves with Genome Assembly using SPAdes. We will assembly the genome of Plasmodium falciparum. The data are taken from this paper: http://www.nature.com/ncomms/2014/140909/ncomms5754/full/ncomms5754.html?WT.ec_id=JA-NCOMMS-20140919.
7 | As it stands right now, I think that you will do all the preprocessing steps this week, then the assembly next. Once you have done all the steps, `gzip` compress the files and download them to your USB drive, or the MAC HD. I can provide you with these files next week if issues arise.
8 |
9 | 1. Install software and download data
10 |
11 | 2. Error correct, quality and adapter trim data sets.
12 |
13 | 3. (next week) Assemble
14 |
15 | -
16 |
17 | The SPAdes manuscript: http://www.ncbi.nlm.nih.gov/pubmed/22506599
18 | The SPAdes manual: http://spades.bioinf.spbau.ru/release3.1.1/manual.html
19 | SPAdes website: http://bioinf.spbau.ru/spades
20 |
21 | > Step 1: Launch and AMI. For this exercise, we will use a c3.8xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
22 |
23 |
24 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
25 |
26 |
27 | ---
28 |
29 | > Update Software
30 |
31 |
32 | sudo bash
33 | apt-get update
34 |
35 |
36 | ---
37 |
38 | > Install updates
39 |
40 |
41 | apt-get -y upgrade
42 |
43 |
44 | ---
45 |
46 | > Install other software
47 |
48 |
49 | apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev
50 |
51 |
52 | ---
53 |
54 | > Install Lighter, software for error correction.
55 |
56 |
57 | cd $HOME
58 | git clone https://github.com/mourisl/Lighter.git
59 | cd Lighter
60 | make -j8
61 | PATH=$PATH:$(pwd)
62 |
63 |
64 | ---
65 |
66 | > Install Trimmomatic
67 |
68 |
69 | cd $HOME
70 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
71 | unzip Trimmomatic-0.32.zip
72 | cd Trimmomatic-0.32
73 | chmod +x trimmomatic-0.32.jar
74 |
75 |
76 | ---
77 |
78 | > Install SRAtoolkit
79 |
80 |
81 | cd $HOME
82 | wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.4.2/sratoolkit.2.4.2-ubuntu64.tar.gz
83 | tar -zxf sratoolkit.2.4.2-ubuntu64.tar.gz
84 | PATH=$PATH:/home/ubuntu/sratoolkit.2.4.2-ubuntu64/bin
85 |
86 |
87 | ---
88 |
89 | > Install SPAdes
90 |
91 |
92 | wget http://spades.bioinf.spbau.ru/release3.1.1/SPAdes-3.1.1-Linux.tar.gz
93 | tar -zxf SPAdes-3.1.1-Linux.tar.gz
94 | cd SPAdes-3.1.1-Linux
95 | PATH=$PATH:$(pwd)/bin
96 |
97 |
98 | ---
99 |
100 | > Download 3.5kb MP library
101 |
102 |
103 | cd /mnt
104 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022558/ERR022558.sra
105 |
106 |
107 | > Download 10kb MP library
108 |
109 |
110 | cd /mnt
111 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR022/ERR022557/ERR022557.sra
112 |
113 |
114 | > Download PE library #1
115 |
116 |
117 | cd /mnt
118 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019273/ERR019273.sra
119 |
120 |
121 | ---
122 |
123 | > Download PE library #2
124 |
125 |
126 | cd /mnt
127 | wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/ERR/ERR019/ERR019275/ERR019275.sra
128 |
129 |
130 | ---
131 |
132 | > Extract fastQ from sra format.
133 |
134 |
135 | cd /mnt
136 |
137 | #this is a basic for loop. Copy is all as 1 line.
138 |
139 | for i in `ls *sra`; do
140 | fastq-dump --split-files --split-spot $i;
141 | rm $i;
142 | done
143 |
144 |
145 | ---
146 |
147 | > Error Correct Data
148 |
149 |
150 | mkdir /mnt/ec
151 | cd /mnt/ec
152 | lighter -r /mnt/ERR019273_1.fastq -r /mnt/ERR019273_2.fastq -t 32 -k 21 45000000 .1
153 | lighter -r /mnt/ERR022557_1.fastq -r /mnt/ERR022557_2.fastq -t 32 -k 21 45000000 .1
154 | lighter -r /mnt/ERR022558_1.fastq -r /mnt/ERR022558_2.fastq -t 32 -k 21 45000000 .1
155 | lighter -r /mnt/ERR019275_1.fastq -r /mnt/ERR019275_2.fastq -t 32 -k 21 45000000 .1
156 |
157 | #remove the raw files.
158 |
159 | rm *fastq &
160 |
161 |
162 | > trim the data:
163 |
164 |
165 | mkdir /mnt/trim
166 | cd /mnt/trim
167 | #paste the below lines together as 1 command
168 |
169 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
170 | -threads 32 -baseout PE_lib1.fq \
171 | /mnt/ec/ERR019273_1.cor.fq \
172 | /mnt/ec/ERR019273_2.cor.fq \
173 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
174 | SLIDINGWINDOW:4:2 \
175 | LEADING:2 \
176 | TRAILING:2 \
177 | MINLEN:25
178 |
179 | #paste the below lines together as 1 command
180 |
181 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
182 | -threads 32 -baseout PE_lib2.fq \
183 | /mnt/ec/ERR019275_1.cor.fq \
184 | /mnt/ec/ERR019275_2.cor.fq \
185 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
186 | SLIDINGWINDOW:4:2 \
187 | LEADING:2 \
188 | TRAILING:2 \
189 | MINLEN:25
190 |
191 | #paste the below lines together as 1 command
192 |
193 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
194 | -threads 32 -baseout MP10000.fq \
195 | /mnt/ec/ERR022557_1.cor.fq \
196 | /mnt/ec/ERR022557_2.cor.fq \
197 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
198 | SLIDINGWINDOW:4:2 \
199 | LEADING:2 \
200 | TRAILING:2 \
201 | MINLEN:25
202 |
203 | #paste the below lines together as 1 command
204 |
205 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
206 | -threads 32 -baseout MP3500.fq \
207 | /mnt/ec/ERR022558_1.cor.fq \
208 | /mnt/ec/ERR022558_2.cor.fq \
209 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 \
210 | SLIDINGWINDOW:4:2 \
211 | LEADING:2 \
212 | TRAILING:2 \
213 | MINLEN:25
214 |
215 |
216 |
217 |
218 | ---
219 |
220 |
221 | mkdir /mnt/
222 |
223 | #remove the corrected files.
224 |
225 | rm *fq
226 |
227 |
228 |
229 | > Assembly. This you may want to do next week. Alternatively, you can put it in a tmux window and let it run. You'd have to login later however to download the assembled genome.
230 |
231 |
232 | mkdir spades
233 | cd spades
234 |
235 | spades.py -t 32 -m 60 \
236 | --pe1-1 /mnt/trim/PE_lib1_1P.fq \
237 | --pe1-2 /mnt/trim/PE_lib1_2P.fq \
238 | --pe2-1 /mnt/trim/PE_lib2_1P.fq \
239 | --pe2-2 /mnt/trim/PE_lib2_2P.fq \
240 | --mp1-1 /mnt/trim/MP3500_1P.fq \
241 | --mp1-2 /mnt/trim/MP3500_2P.fq \
242 | --mp2-1 /mnt/trim/MP10000_1P.fq \
243 | --mp2-2 /mnt/trim/MP10000_2P.fq \
244 | -o Pfal --only-assembler
245 |
246 |
--------------------------------------------------------------------------------
/lab_lessons/Lab7_transcriptome_assembly.md:
--------------------------------------------------------------------------------
1 | Lab 7: Transcriptome assembly
2 | ---
3 |
4 | --
5 |
6 | During this lab, we will acquaint ourselves with de novo transcriptome assembly using Trinity. You will:
7 |
8 | 1. Install software and download data
9 |
10 | 2. Error correct, quality and adapter trim data sets.
11 |
12 | 3. Apply digital normalization to the dataset.
13 |
14 | 4. Trinity assembly
15 |
16 | 5. Because the above steps will take a few hours, I am providing you with 2 datasets: one is the 10 million read dataset you used last week. The other is that same 10M read dataset that I have error corrected, quality/adapter trimmed, normalized, and subsampled to 0.5 million reads (I did this so that the assembly could be done in a reasonable amount of time). Especially for the people who are going to do de novo transcriptome projects, and students who will use something like this in their own research, that it is probably worth going through the whole pipeline at some point.
17 |
18 | -
19 |
20 | The JellyFish manual: ftp://ftp.genome.umd.edu/pub/jellyfish/JellyfishUserGuide.pdf
21 |
22 | The Khmer manual: http://khmer.readthedocs.org/en/v1.1/
23 |
24 | Trinity reference material: http://trinityrnaseq.sourceforge.net/
25 |
26 | ---
27 |
28 | > Step 1: Launch and AMI. For this exercise, we will use a m3.2xlarge (note different instance type). Remember to change the permission of your key code `chmod 400 ~/Downloads/????.pem` (change ????.pem to whatever you named it)
29 |
30 |
31 | ssh -i ~/Downloads/?????.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com
32 |
33 |
34 | ---
35 |
36 | > Update Software
37 |
38 |
39 | sudo bash
40 | apt-get update
41 |
42 |
43 | ---
44 |
45 | > Install updates
46 |
47 |
48 | apt-get -y upgrade
49 |
50 |
51 | ---
52 |
53 | > Install other software
54 |
55 |
56 | apt-get -y install subversion tmux git curl bowtie libncurses5-dev samtools gcc make g++ python-dev unzip dh-autoreconf default-jre python-pip zlib1g-dev
57 |
58 |
59 | ---
60 |
61 | > Install Lighter, software for error correction.
62 |
63 |
64 | cd $HOME
65 | git clone https://github.com/mourisl/Lighter.git
66 | make -j8
67 | PATH=$PATH:$(pwd)/scripts
68 |
69 |
70 | ---
71 |
72 | > Install Trimmomatic
73 |
74 |
75 | cd $HOME
76 | wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip
77 | unzip Trimmomatic-0.32.zip
78 | cd Trimmomatic-0.32
79 | chmod +x trimmomatic-0.32.jar
80 |
81 |
82 | ---
83 |
84 | > Install Trinity
85 |
86 |
87 | cd $HOME
88 | svn checkout svn://svn.code.sf.net/p/trinityrnaseq/code/trunk trinityrnaseq-code
89 | cd trinityrnaseq-code
90 | make -j8
91 | PATH=$PATH:$(pwd)
92 |
93 |
94 | ---
95 |
96 | > Install Khmer
97 |
98 |
99 | cd $HOME
100 | pip install screed pysam
101 | git clone https://github.com/ged-lab/khmer.git
102 | cd khmer
103 | make -j8
104 | make install
105 | PATH=$PATH:$(pwd)/scripts
106 |
107 |
108 | ---
109 | > Download data. For this lab, these data are to be used by people wanting to do the whole pipeline. Most people will want the other dataset I link to below here..
110 |
111 |
112 | cd /mnt
113 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_1.fastq.gz
114 | wget https://s3.amazonaws.com/gen711/raw.10M.SRR797058_2.fastq.gz
115 |
116 |
117 | > Alternatively, you can download the pre-corrected, trimmed, normalized datasets. Sadly, I had to subsample this dataset severely (to 500,000 reads) so that we could assemble it in a lab period...
118 |
119 |
120 | cd /mnt
121 | wget https://www.dropbox.com/s/eo3wrx6lvngq3ja/ec.P2.C25.left.fq.gz
122 | wget https://www.dropbox.com/s/eycchg3m2my2ag2/ec.P2.C25.right.fq.gz
123 |
124 |
125 | ---
126 |
127 | > Error correct (do this step if you are working with the raw data only). Note you will have to uncompress the data if you are doing these steps. I chose the software 'lighter' because if is 1. probably good and 2. it is fast! It is written by Ben Langmead, the author of several of the powerpoint lectures I posted last week.
128 |
129 |
130 | mkdir /mnt/ec
131 | cd /mnt/ec
132 | lighter -r /mnt/raw.10M.SRR797058_1.fastq -r /mnt/raw.10M.SRR797058_2.fastq -t 8 -k 25 100000000 .1
133 |
134 |
135 | ---
136 |
137 | > Trim (do this step if you are working with the raw data only)
138 |
139 |
140 | mkdir /mnt/trimming
141 | cd /mnt/trimming
142 |
143 | #paste the below lines together as 1 command
144 |
145 | java -Xmx10g -jar $HOME/Trimmomatic-0.32/trimmomatic-0.32.jar PE \
146 | -threads 8 -baseout ec.P2trim.fastQ \
147 | /mnt/ec/raw.10M.SRR797058_1.cor.fq \
148 | /mnt/ec/raw.10M.SRR797058_2.cor.fq \
149 | ILLUMINACLIP:$HOME/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 \
150 | SLIDINGWINDOW:4:2 \
151 | LEADING:2 \
152 | TRAILING:2 \
153 | MINLEN:25
154 |
155 |
156 | ---
157 |
158 | > Run Khmer (do this step if you are working with the raw data only)
159 |
160 |
161 | mkdir /mnt/khmer
162 | cd /mnt/khmer
163 | interleave-reads.py /mnt/trimming/ec.P2trim.fastQ_1P /mnt/trimming/ec.P2trim.fastQ_2P -o interleaved.fq
164 | normalize-by-median.py -p -x 15e8 -k 25 -C 25 --out khmer_normalized.fq interleaved.fq
165 | split-paired-reads.py khmer_normalized.fq
166 |
167 |
168 | ---
169 |
170 | > Run Trinity - everybody do this. If you are running with the raw data, you'll have to change the names of the input files. Note that I am using `--min_kmer_cov 2` in the command below. This is only so that you can get through the assembly in a short amount of time. DO NOT USE THIS OPTION IN 'REAL LIFE' AS IT WILL MAKE YOUR ASSEMBLY WORSE!!! This should take ~30 minutes, so use this time to talk to your group members, or whatever else..
171 |
172 |
173 | mkdir /mnt/trinity
174 | cd /mnt/trinity
175 | Trinity --seqType fq --JM 20G --min_kmer_cov 2 \
176 | --left /mnt/ec.P2.C25.left.fq \
177 | --right /mnt/ec.P2.C25.right.fq \
178 | --CPU 8 --output ec.P2trim.C25 --group_pairs_distance 999 --inchworm_cpu 8
179 |
180 |
181 | ---
182 |
183 | > Generate Length Based stats from your assembly. What do these mean?
184 |
185 |
186 | $HOME/trinityrnaseq-code/util/TrinityStats.pl ec.P2trim.C25/Trinity.fasta
187 |
188 |
189 |
190 | > lets looks for coding sequences. Before we can do this, we need to install a Perl module using the cpan command.
191 |
192 |
193 | cpan URI::Escape
194 | $HOME/trinityrnaseq-code/trinity-plugins/TransDecoder_r20140704/TransDecoder --CPU 8 -t ec.P2trim.C25/Trinity.fasta
195 |
196 |
197 | > This will take a few minutes. Once done, you will have a file of amino acid sequences, and coding sequences. Look at how many coding sequences you found, and how many were complete (have a start and stop codon) vs. fragmented in one way or another. What do these numbers mean?? What would you hope these numbers look like. What does `grep -c` do?
198 |
199 |
200 | $HOME/trinityrnaseq-code/util/TrinityStats.pl Trinity.fasta.transdecoder.pep
201 | grep -c complete Trinity.fasta.transdecoder.pep
202 | grep -c internal Trinity.fasta.transdecoder.pep
203 | grep -c 5prime Trinity.fasta.transdecoder.pep
204 | grep -c 3prime Trinity.fasta.transdecoder.pep
205 |
206 |
--------------------------------------------------------------------------------