-1 reads1.fq -2 reads2.fq -o transcripts_quant`
145 | >
146 |
147 | > **NOTE 2:** To have Salmon correct for other RNA-Seq biases you will need to specify the appropriate parameters when you run it. Before using these parameters it is advisable to assess your data using tools like [Qualimap](http://qualimap.bioinfo.cipf.es/) to look specifically for the presence of these biases in your data and decide on which parameters would be appropriate.
148 | >
149 | > To correct for the various sample-specific biases you could add the following parameters to the Salmon command:
150 | >
151 | > * `--gcBias` to learn and correct for fragment-level GC biases in the input data
152 | > * `--posBias` will enable modeling of a position-specific fragment start distribution
153 |
154 |
155 | ## Salmon output
156 |
157 | You should see a new directory has been created that is named by the string value you provided in the `-o` command. Take a look at what is contained in this directory:
158 |
159 | $ ls -l Mov10_oe_1.subset.salmon/
160 |
161 | There is a logs directory, which contains all of the text that was printed to screen as Sailfish was running. Additionally, there is a file called `quant.sf`.
162 |
163 | This is the **quantification file** in which each row corresponds to a transcript, listed by Ensembl ID, and the columns correspond to metrics for each transcript:
164 |
165 | ```bash
166 | Name Length EffectiveLength TPM NumReads
167 | ENST00000456328 1657 1407.000 0.000000 0.000
168 | ENST00000450305 632 382.000 0.000000 0.000
169 | ENST00000488147 1351 1101.000 0.000000 0.000
170 | ENST00000619216 68 3.000 0.000000 0.000
171 | ENST00000473358 712 462.000 0.000000 0.000
172 | ENST00000469289 535 285.000 0.000000 0.000
173 | ENST00000607096 138 5.000 0.000000 0.000
174 | ENST00000417324 1187 937.000 0.000000 0.000
175 |
176 | ....
177 |
178 | ```
179 |
180 | * The first two columns are self-explanatory, the **name** of the transcript and the **length of the transcript** in base pairs (bp).
181 | * The **effective length** represents the various factors that effect the length of transcript (i.e degradation, technical limitations of the sequencing platform)
182 | * Salmon outputs ‘pseudocounts’ which predict the relative abundance of different isoforms in the form of three possible metrics (KPKM, RPKM, and TPM). **TPM (transcripts per million)** is a commonly used normalization method as described in [[1]](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820677/) and is computed based on the effective length of the transcript.
183 | * Estimated **number of reads** (an estimate of the number of reads drawn from this transcript given the transcript’s relative abundance and length)
184 |
185 |
186 | ## Running Salmon on multiple samples
187 |
188 | We just ran Salmon on a single sample (and keep in mind only on a subset of chr1 from the original data). To obtain meaningful results we need to run this on **all samples for the full dataset**. To do so, we will need to create a job submission script.
189 |
190 | > *NOTE:* We are iterating over FASTQ files in the **full dataset directory**, located at `/n/groups/hbctraining/ngs-data-analysis-longcourse/rnaseq/full_dataset`
191 |
192 |
193 | ### Create a job submission script to run Salmon in serial
194 |
195 | Since Salmon is only able to take a single file as input, one way in which we can do this is to use a for loop to run Salmon on all samples in serial. What this means is that Salmon will process the dataset one sample at a time.
196 |
197 | Let's start by opening up a script in `vim`:
198 |
199 | $ vim salmon_all_samples.sbatch
200 |
201 |
202 | Let's start our script with a **shebang line followed by SBATCH directives which describe the resources we are requesting from O2**. We will ask for 6 cores and take advantage of Salmon's multi-threading capabilities. Note that we also removed the `--reservation` from our SBATCH options.
203 |
204 | Next we will do the following:
205 |
206 | 1. **Create a for loop to iterate over all FASTQ samples**.
207 | 2. Inside the loop we will create a variable that stores the prefix we will use for naming output files.
208 | 3. Then we run Salmon.
209 |
210 | > **NOTE:** We have **added a couple of new parameters**. First, since we are **multithreading** with 6 cores we will use `-p 6`. Another new parameter we have added is called `--numBootstraps`. Salmon has the ability to optionally compute bootstrapped abundance estimates. **Bootstraps are required for estimation of technical variance**. We will discuss this in more detail when we talk about transcript-level differential expression analysis.
211 |
212 | The final script is shown below:
213 |
214 | ```
215 | #!/bin/bash
216 |
217 | #SBATCH -p short
218 | #SBATCH -c 6
219 | #SBATCH -t 0-12:00
220 | #SBATCH --mem 8G
221 | #SBATCH --job-name salmon_in_serial
222 | #SBATCH -o %j.out
223 | #SBATCH -e %j.err
224 |
225 | cd ~/unix_lesson/rnaseq/salmon
226 |
227 | for fq in /n/groups/hbctraining/ngs-data-analysis-longcourse/rnaseq/full_dataset/*.fastq
228 |
229 | do
230 |
231 | # create a prefix
232 | base=`basename $fq .fastq`
233 |
234 | # run salmon
235 | salmon quant -i /n/groups/hbctraining/ngs-data-analysis-longcourse/rnaseq/salmon.ensembl38.idx \
236 | -l A \
237 | -r $fq \
238 | -p 6 \
239 | -o $base.salmon \
240 | --seqBias \
241 | --useVBOpt \
242 | --numBootstraps 30
243 |
244 | done
245 |
246 | ```
247 |
248 | Save and close the script. This is now ready to run.
249 |
250 | $ sbatch salmon_all_samples.sbatch
251 |
252 | Once you have run Salmon on all of your samples, you will need to decide whether you would like to perform gene-level or isoform-level analysis. The **output directory from Salmon for each sample will be required as input for any of these downstream tools**. In our standard workflow we ended up with a count matrix where all expression data was nicely summarized into a single file, but with this alternative approach the downstream tools for differential expression will take care of compiling it for you.
253 |
254 | ***
255 |
256 | **Exercise**
257 |
258 | We learned in the [Automation lesson](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html) that we can be more efficient by running our jobs in parallel. In this way, each sample is run as an independent job and not waiting for the previous sample to finish. **Create a new script to run Salmon in parallel**.
259 |
260 | ***
261 |
262 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
263 |
264 |
265 |
266 |
267 |
268 |
--------------------------------------------------------------------------------
/lessons/DE_analysis.md:
--------------------------------------------------------------------------------
1 | ## Learning Objectives:
2 | -------------------
3 |
4 | * Learning how to run R scripts from the command line
5 | * Use the count matrix as input to an R script for differential expression analysis
6 | * Apply Unix commands to look at the results that are generated and extract relevant information
7 | * Familiarize yourself with various functional analysis tools for gene lists
8 |
9 |
10 | ## Differential expression analysis
11 | -------------------
12 |
13 | At the end of the workflow from the last lesson, our final end product was a count matrix. This is a matrix in which each row represents a gene (or feature) and each column corresponds to a sample. In our dataset, we have two sample classes (control and Mov10oe) and we want to assess the difference in expression between these groups on a gene-by-gene basis.
14 |
15 |
16 |
17 |
18 |
19 | _Illustration taken from slides courtesy of Dr. Paul Pavlidis, UBC_
20 |
21 | Since we know which samples belong to which group, we could just compute a fold-change for each gene and then rank genes by that value. Easy, right? Not exactly.
22 |
23 | The problem is, the **gene expression changes** we observe are not just a result of the differences between the groups that we are investigating, rather it **is a measurement of the sum of many effects**. In a set of biological samples the transcriptional patterns can be associated not only with our experimetal variable(s) but also many extraneous factors; some that we are aware of (i.e demographic factors, batch information) and sources that are unknown. The goal of differential expression analysis to determine the relative role of these effects, and to **separate the “interesting” from the “uninteresting”.**
24 |
25 |
26 | ### Statistical models in R
27 |
28 | [R](https://www.r-project.org/) is a software environment for statistical computing and graphics. R is widely used in the field of bioinformatics, amongst various other disciplines.
29 |
30 |
31 |
32 |
33 |
34 | It can be locally installed on almost all operating systems (and it's free!), with numerous packages available that help in increasing efficency of data handling, data manipulation and data analysis. Discussing the specifics about R is outside the scope of this course. However, we encourage you to take a look at some of the R resources listed below if you are interested in learning more.
35 |
36 | R is a powerful language that can be very useful for NGS data analysis, and there are many popular packages for working with RNA-Seq count data. Some of these packages include [edgeR](https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf), [DESeq2](http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf), and [limma-voom](http://www.genomebiology.com/2014/15/2/R29). All of these tools use statistical modeling of the count data to test each gene against the null hypothesis and evaluate whether or not it is significantly differentially expressed.
37 |
38 |
39 |
40 |
41 |
42 | These methods determine, for each gene, whether the differences in expression (counts) **between groups** is significant given the amount of variation observed **within groups** (replicates). To test for significance, we need an appropriate statistical model that accurately performs normalization (to account for differences in sequencing depth, etc.) and variance modeling (to account for few numbers of replicates and large dynamic expression range). The details on how each package works is described thoroughly within each of the respective vignettes.
43 |
44 | ### Running R scripts
45 |
46 | In order to run R on O2, let's first **log on to the cluster and start an interactive session with a single core**.
47 |
48 | Once you are in an interactive session, navigate to the `rnaseq` directory:
49 |
50 | $ cd ~/unix_lesson/rnaseq
51 |
52 | We will be running an R script that uses the R package [DESeq2](http://bioconductor.org/packages/release/bioc/html/DESeq2.html) to identify differentially expressed genes. This package is available from [Bioconductor](https://www.bioconductor.org/), which is a repository of packages for the analysis of high-throughput genomic data. There are also a few other packages that are required to generate some additional figures.
53 |
54 | We first need to load the R module and the GCC compiler:
55 |
56 | ```bash
57 | $ module load gcc/6.2.0 R/3.4.1
58 | ```
59 | You can open R by simply typing `R` at the command prompt and pressing `Enter`. You are now in the R console (note that the command prompt has changed to a `>` instead of a `$`):
60 |
61 |
62 |
64 |
65 | Installing packages can be timely and particularly cumbersome when doing this on a cluster environment. So rather than installing packages we have instructions for you to use the libraries from our installation.
66 |
67 | > **NOTE:** Packages are bundles of code that perform functions and include detailed documentation on how to use those functions. Once installed, they are referred to as _libraries_.
68 |
69 | **To use the libraries we have created for you first exit R with:**
70 |
71 | ```R
72 | q()
73 | ```
74 | You should find yourself back at the shell command prompt. The next few lines will set the environment variable `R_LIBS_USER` to let R know where the R libraries directory resides.
75 |
76 | ```bash
77 | # check if the variable is already set
78 | $ echo $R_LIBS_USER
79 |
80 | # If the above command returns nothing, then run the command below
81 | $ export R_LIBS_USER="/n/groups/hbctraining/R/library/"
82 | ```
83 |
84 | To run differential expression analysis, we are going to run a script from the `results` directory, so let's navigate there and create a directory for the results of our analysis. We will call the directory `diffexpression`:
85 |
86 | ```bash
87 | $ cd ~/unix_lesson/rnaseq/results
88 | $ mkdir diffexpression
89 | ```
90 | First, let's copy over the script file:
91 |
92 | ```bash
93 | $ cp /n/groups/hbctraining/intro_rnaseq_hpc/DESeq2_script.R diffexpression/
94 | ```
95 |
96 | The DE script will require as input **1) your count matrix file** and **2) a metadata file**. The count matrix we generated in the last lesson and is in the `counts` directory. The metadata file is a tab-delimited file which contains any information associated with our samples. Each row corresponds to a sample and each column contains some information about each sample.
97 |
98 | ```bash
99 | $ cp ~/unix_lesson/other/Mov10_rnaseq_metadata.txt diffexpression
100 | ```
101 | > **NOTE:** If you _didn't generate this file in class_ we have a pre-computed count matrix generated that you can use:
102 | >
103 | > `$ cp /groups/hbctraining/intro_rnaseq_hpc/counts_STAR/Mov10_rnaseq_counts_complete.txt diffexpression`
104 | >
105 |
106 | Once you have the files copied, take a quick look at the metadata using `less`.
107 |
108 | Now we're all setup to run our R script! Let's run it from within our `diffexpression` directory,
109 | ```bash
110 | $ cd diffexpression
111 | $ Rscript DESeq2_script.R Mov10_rnaseq_counts_complete.txt Mov10_rnaseq_metadata.txt
112 | ```
113 |
114 | > **NOTE:** You will notice chunks of code in the text that correspond to plotting figures, and these chunks have been commented out. The reason for this is in order to generate figures on **O2 you require the X11 system**, which we are currently not setup to do with the training accounts. If you are interested in learning more about using X11 applications you can [find out more on the O2 wiki page](https://wiki.rc.hms.harvard.edu/display/O2/Using+X11+Applications+Remotely).
115 |
116 | ### Gene list exploration
117 |
118 | There are two results files generated from `DE_script.R`, a full table and significant genes table (at FDR < 0.05). Take a look at the significant results file and see what values have been reported:
119 |
120 | ```bash
121 | $ head DEresults_sig_table.txt
122 | ```
123 | You should have a table with 7 columns in it:
124 |
125 | 1. `Gene symbols` (this will not have a column name, due to the nature of the `write` function)
126 | 2. `baseMean`: the average normalized counts across all samples
127 | 3. `log2FoldChange`
128 | 4. `lfcse`: the standard error of the log2 FC
129 | 5. `stat`: the Wald test statistic
130 | 6. `pvalue`
131 | 7. `padj`: p-value adjusted for multiple test correction using the BH method
132 |
133 | Since we have the full table of results for all genes, we could apply a filter based on the `padj` column to keep only genes we consider significant. We could also increase the stringency by adding in a fold change criteria. Alternatively, the full table can be useful for investigating groups of interesting genes of that are co-regulated but did not appear in our significant list.
134 |
135 | Using `wc -l` find out how many genes are identified in the significant table. Keep in mind this is generated using the truncated dataset.
136 |
137 | ```bash
138 | $ wc -l DEresults_sig_table.txt
139 | ```
140 |
141 | For downstream analysis, the relevant information that we will require from this results table is the gene names and the FDR value. We can cut the columns to a new file and and use that as input to some functional analaysis tools.
142 |
143 | ```bash
144 | $ cut -f1,7 DEresults_sig_table.txt > Mov10_sig_genelist.txt
145 | ```
146 |
147 | Since the list we have is generated from analaysis on a small subset of chromosome 1, using these genes as input to downstream tools will not provide any meaningful results. As such, **we have generated a list using the full dataset for these samples and can be downloaded to your laptop via [this link]().**
148 |
149 |
150 | ## Differential expression analysis using pseudocounts
151 | ----------------------
152 |
153 | In the script described above, we used count data generated from the standard RNA-seq workflow as input. The instructions are below to **perform a similar analysis with the output from Salmon, but on your local laptop**. To perform this analysis, you will need to use R and Rstudio directly. We do not have a script available that works on O2.
154 |
155 | **The rest of this section assumes that you are comfortable with R and RStudio.**
156 |
157 | The output from Salmon is transcript counts, but DESeq2 works well only with gene counts. To bridge this gap, the developers of DESeq2 have developed a package makes the output of Salmon compatible with DESeq2. This package is called [`tximport`](https://bioconductor.org/packages/release/bioc/html/tximport.html) and is also available through Bioconductor. `tximport` imports transcript-level abundance, estimated counts and transcript lengths, and summarizes this into matrices for use with downstream gene-level analysis packages.
158 |
159 | First, you have to download the directory with the quant.sf files for the 8 full datasets using the link below. Once you have them downloaded continue to follow the rest of instructions:
160 |
161 | 1. [Download Salmon files](https://www.dropbox.com/s/aw170f8zge01jpq/salmon.zip?dl=0)
162 | 2. Decompress (unzip) the zip archive and move the folder to an appropriate location (i.e `~/Desktop`)
163 | 3. Open RStudio and select 'File' -> 'New Project' -> 'Existing Directory' and navigate to the `salmon` directory
164 | 4. Open up a new R script ('File' -> 'New File' -> 'Rscript'), and save it as `salmon_de.R`
165 |
166 | Your Rstudio interface should look something like the screenshot below:
167 |
168 |
169 |
170 | To perform this analysis you will have to install the following libraries:
171 |
172 | `tximport`
173 |
174 | `readr`
175 |
176 | `DESeq2`
177 |
178 | `biomaRt`
179 |
180 | **Step 1:** Load the required libraries:
181 |
182 | ```R
183 | # Load libraries
184 | library(tximport)
185 | library(readr)
186 | library(DESeq2)
187 | library(biomaRt) # tximport requires gene symbols as row names
188 | ```
189 |
190 | **Step 2:** Load the quantification data that was output from Salmon:
191 |
192 | ```R
193 | ## List all directories containing data
194 | samples <- list.files(path = ".", full.names = F, pattern="\\.salmon$")
195 |
196 | ## Obtain a vector of all filenames including the path
197 | files <- file.path(samples, "quant.sf")
198 |
199 | ## Since all quant files have the same name it is useful to have names for each element
200 | names(files) <- samples
201 | ```
202 |
203 | The **main objective here is to add names to our quant files which will allow us to easily discriminate between samples in the final output matrix**.
204 |
205 | **Step 3.** Create a dataframe containing Ensembl Transcript IDs and Gene symbols
206 |
207 | Our Salmon index was generated with transcript sequences listed by Ensembl IDs, but `tximport` needs to know **which genes these transcripts came from**, so we need to use the `biomaRt` package to extract this information.
208 |
209 | > *NOTE:* Keep in mind that the Ensembl IDs listed in our Salmon output contained version numbers (i.e ENST00000632684.1). If we query Biomart with those IDs it will not return anything. Therefore, before querying Biomart in R do not forget to strip the version numbers from the Ensembl IDs.
210 |
211 | ```R
212 | ## DO NOT RUN
213 |
214 | # Create a character vector of Ensembl IDs
215 | ids <- read.delim(files[1], sep="\t", header=T) # extract the transcript ids from one of the files
216 | ids <- as.character(ids[,1])
217 | require(stringr)
218 | ids.strip <- str_replace(ids, "([.][0-9])", "")
219 |
220 | # Create a mart object
221 | # Note that we are using an archived host, since "www.ensembl.org" gave us an error
222 | mart <- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL", host="mar2016.archive.ensembl.org"))
223 |
224 | # Get official gene symbol and Ensembl gene IDs
225 | tx2gene <- getBM(
226 | filters= "ensembl_transcript_id",
227 | attributes= c("ensembl_transcript_id", "external_gene_name"),
228 | values= ids.strip,
229 | mart= mart)
230 |
231 | ```
232 |
233 | **We have already run the above code for you and saved the output in a text file which is in the salmon directory.** Load it in using:
234 |
235 | ```R
236 | tx2gene <- read.delim("tx2gene.txt",sep="\t")
237 | ```
238 |
239 | **Step 4:** Run tximport to summarize gene-level information
240 | ```R
241 | ?tximport # let's take a look at the arguments for the tximport function
242 |
243 | txi <- tximport(files, type="salmon", txIn = TRUE, txOut = FALSE, tx2gene=tx2gene, reader=read_tsv, ignoreTxVersion=TRUE)
244 | ```
245 | ### Output from `tximport`
246 |
247 | The `txi` object is a simple list with three matrices: abundance, counts, length.
248 | ```R
249 | attributes(txi)
250 | ```
251 | A final element 'countsFromAbundance' carries through the character argument used in the tximport call. The length matrix contains the average transcript length for each gene which can be used as an offset for gene-level analysis.
252 |
253 | ### Using DESeq2 for DE analysis with pseudocounts
254 |
255 | ```R
256 | ## Create a sampletable/metadata
257 |
258 | # Before we create this metadata object, let's see what the sample (column) order of the counts matrix is:
259 | colnames(txi$counts)
260 |
261 | condition=factor(c(rep("Ctl",3), rep("KD", 2), rep("OE", 3)))
262 | sampleTable <- data.frame(condition, row.names = colnames(txi$counts))
263 |
264 | ## Create a DESeqDataSet object
265 | dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ condition)
266 | ```
267 |
268 | Now you have created a DESeq object to proceed with DE analysis you can now complete the DE analysis using methods in the script we ran
269 | for the counts from STAR.
270 |
271 | ### Resources for R
272 |
273 | * https://www.datacamp.com/courses/free-introduction-to-r
274 | * Software Carpentry materials: http://swcarpentry.github.io/r-novice-inflammation/
275 | * Data Carpentry materials: http://tracykteal.github.io/R-genomics/
276 | * Materials from IQSS at Harvard: http://tutorials.iq.harvard.edu/R/Rintro/Rintro.html
277 | * [swirl](http://swirlstats.com/): learn R interactively from within the R console
278 | * The free "try R" class from [Code School](http://tryr.codeschool.com)
279 | * HarvardX course ["Statistics and R for the Life Sciences"](https://courses.edx.org/courses/HarvardX/PH525.1x/1T2015/info)
280 |
281 | ***
282 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
283 |
284 |
--------------------------------------------------------------------------------
/lessons/advanced_bash.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Advanced Shell"
3 | author: "Radhika Khetani, Meeta Mistry"
4 | date: "May 9, 2018"
5 | ---
6 |
7 | ## Learning Objectives
8 |
9 | * Increasing productivity when working on the command-line and in a cluster environment
10 | * Becoming familiar with advanced bash commands and utilities
11 |
12 | ## Advanced Bash Commands and Utilities
13 |
14 | As you begin working more with the Shell, you will discover that there are mountains of different utilities at your fingertips to help increase command-line productivity. So far we have introduced you to some of the basics to help get you started. In this lesson, we will touch on some more advanced topics that can be very useful as you carry out analyses in a cluster environment.
15 |
16 | ## O2-specific utilities
17 |
18 | * [Configuring your shell](#config)
19 | * [`.bashrc` versus `.bash_profile`](#bashrc)
20 | * [Aliases](#alias)
21 | * [Symbolic links](#symlinks)
22 | * [Transferring files with `rsync`](#rsync)
23 | * [Working on `/n/scratch2/`](#nscratch)
24 |
25 | ***
26 |
27 | ## Configuring your shell
28 |
29 | In your home directory there are two hidden files `.bashrc` and `.bash_profile`. These files contain all the startup configuration and preferences for your command line interface and are loaded before your Terminal loads the shell environment. Modifying these files allow you to change your preferences for features like your command prompt, the colors of text, and adding aliases for commands you use all the time.
30 |
31 | > **NOTE:** These files begin with a dot (`.`) which makes it a hidden file. To view all hidden files in your home directory you can use:
32 | >
33 | > `$ ls -al ~/`
34 |
35 | ### `.bashrc` versus `.bash_profile`
36 |
37 | You can put configurations in either file, and you can create either if it doesn’t exist. **But why two different files? What is the difference?**
38 |
39 | The difference is that **`.bash_profile` is executed for login shells, while `.bashrc` is executed for interactive non-login shells**. It is helpful to have these separate files when there are preferences you only want to see on the login and not every time you open a new terminal window. For example, suppose you would like to print some lengthy diagnostic information about your machine (load average, memory usage, current users, etc) - the `.bash_profile` would be a good place since you would only want in displayed once when starting out.
40 |
41 | Most of the time you don’t want to maintain two separate configuration files for login and non-login shells. For example, when you export a `$PATH` (as we had done previously), you want it to apply to both. You can do this by sourcing `.bashrc` from within your `.bash_profile` file. Take a look at your `.bash_profile` file, it has already been done for you:
42 |
43 | ```bash
44 | $ less ~/.bash_profile
45 | ```
46 |
47 | You should see the following lines:
48 |
49 | ```bash
50 | if [ -f ~/.bashrc ]; then
51 | source ~/.bashrc
52 | fi
53 | ```
54 |
55 | What this means is that if a `.bashrc` files exist, all configuration settings will be sourced upon logging in. Any settings you would like applied to all shell windows (login and interactive) can simply be added directly to the `.bashrc` file rather than in two separate files.
56 |
57 |
58 | ### Aliases
59 |
60 | An alias is a short name that the shell translates into another (usually longer) name or command. They are typically placed in the `.bash_profile` or `.bashrc` startup files so that they are available to all subshells. You can use the `alias` built-in command without any arguments, and the **shell will display a list of all defined aliases**:
61 |
62 | ```bash
63 | $ alias
64 | ```
65 |
66 | This should return to you the list of aliases that have been set for you, and you can see **the syntax used for setting an alias is**:
67 |
68 | ```bash
69 | alias aliasname=value
70 | ```
71 |
72 | When setting an alias **no spaces are permitted around the equal sign**. If value contains spaces or tabs, you must enclose the value within quotation marks. If you look through the list of aliases that have been set for you, `ll` is a good example of this:
73 |
74 | ```bash
75 | alias ll='ls -l'
76 | ```
77 |
78 | Since we have a modifier `-l` and there is a space required, the quotations are necessary.
79 |
80 | Let's **setup our own alias**! Every time we want to start an interactive session we have type out this lengthy command. Wouldn't it be great if we could type in a short name instead? Open up the `.bashrc` file using `vim`:
81 |
82 | ```bash
83 | $ vim ~/.bashrc
84 | ```
85 |
86 | Scroll down to the heading "`# User specific aliases and functions`" and on the next line you can set your alias:
87 |
88 | ```bash
89 | o2i='srun --pty -p short -t 0-12:00 --mem 8G /bin/bash'
90 | ```
91 |
92 | Save and quit. Now we can source the `.bash_profile` file and test it out. By typing `o2i` at the command prompt we will request an interactive session for 12 hours with 8G of memory. You can change the directives to those you use more often (i.e add more cores, increase memory).
93 |
94 |
95 | ```bash
96 | $ source ~/.bash_profile
97 |
98 | $ o2i
99 | ```
100 |
101 | ## Symbolic links
102 |
103 | The O2 cluster supports symbolic links also known as symlinks. This is a kind of “file” that is **essentially a pointer to another file name**. Symbolic links can be made to directories or across file systems with no restrictions. You can also make a symbolic link to a name which is not the name of any file. (Opening this link will fail until a file by that name is created.) Likewise, if the symbolic link points to an existing file which is later deleted, the symbolic link continues to point to the same file name even though the name no longer names any file.
104 |
105 | Symlinks can be used in lieu of copying over large files. For example, when we began the RNA-seq part of this workshop we had copied over FASTQ files from `~/unix_lesson/raw_fastq` to `~/unix_lesson/rnaseq/raw_data`. But what we could have done instead is created symlinks to those files.
106 |
107 | The basic syntax for creating a symlink is:
108 |
109 | ```bash
110 | ln -s /path/to/file /path/to/symlink
111 | ```
112 |
113 | So if we wanted to have symlinks to our FASTQ files instead of having duplicate copies, we can first remove the files that are currrently there:
114 |
115 | ```bash
116 | $ rm ~/unix_lesson/rnaseq/raw_data/*
117 | ```
118 |
119 | And then we can symlink the files:
120 |
121 | ```bash
122 | $ ln -s ~/unix_lesson/raw_fastq/*.fq ~/unix_lesson/rnaseq/raw_data/
123 | ```
124 |
125 | Now, if you check the directory where we created the symlinks you should see the filenames listed in cyan text followed by an arrow pointing the actual file location. (_NOTE: If your files are flashing red text, this is an indication your links are broken so you might want to double check the paths._)
126 |
127 | ```bash
128 | $ ll ~/unix_lesson/rnaseq/raw_data
129 | ```
130 |
131 | ## Transferring files with `rsync`
132 |
133 | During this workshop we have mostly used Filezilla to transfer files to and from your laptop to the O2 cluster. At the end of the Alignment/Counting lesson we also introduced how to do this on the command line using `scp`. The way `scp` works is it reads the source file and writes it to the destination. It performs a plain linear copy, locally, or over a network.
134 |
135 | When **transferring large files or a large number of files `rsync` is a better command** to use. `rsync` employs a special delta transfer algorithm and a few optimizations to make the operation a lot faster. **It will check files sizes and modification timestamps** of both file(s) to be copied and the destination, and skip any further processing if they match. If the destination file(s) already exists, the delta transfer algorithm will **make sure only differences between the two are sent over.**
136 |
137 | There are many modifiers for the `rsync` command, but in the examples below we only introduce a select few that we commonly use during a file transfer.
138 |
139 | **Example 1:**
140 |
141 | ```
142 | rsync -t --progress /path/to/transfer/files/*.c /path/to/destination
143 | ```
144 |
145 | This command would transfer all files matching the pattern *.c from the transfer directory to the destination directory. If any of the files already exist at the destination then the rsync remote-update protocol is used to update the file by sending only the differences.
146 |
147 | **Example 2:**
148 |
149 | ```
150 | rsync -avr --progress /path/to/transfer/directory /path/to/destination
151 | ```
152 |
153 | This command would recursively transfer all files from the transfer directory into the destination directory. The files are transferred in "archive" mode (`-a`), which ensures that symbolic links, devices, attributes, permissions, ownerships, etc. are preserved in the transfer. In both commands, we have additional modifiers for verbosity so we have an idea of how the transfer is progressing (`-v`, `--progress`)
154 |
155 | > **NOTE:** A trailing slash on the transfer directory changes the behavior to avoid creating an additional directory level at the destination. You can think of a trailing `/ ` as meaning "copy the contents of this directory" as opposed to "copy the directory by name".
156 |
157 | ### Working on `/n/scratch2`
158 |
159 | Typically, the `rsync` command is used to move files between a remote computer and a local computer. But it can also be used to to move files on the same computer. For example, we could use it to move files across filesystems on O2.
160 |
161 | Most HPC environments have a "scratch space" available to use. This is **a temporary filesystem with larger amounts of storage space and resources, which is ideal for running analyses**. On the O2 cluster, this is located at `/n/scratch2`. Each user is entitled to 10 TB of space in the `/n/scratch2` filesystem. You can create your own directories inside `/n/scratch2/` and put data in there. These files are not backed up and will be deleted if they are not accessed for 30 days.
162 |
163 | Scratch will not work very well with workflows that write many thousands of small files. It is designed for workflows with medium and large files (> 100 MB), making it ideal for many next-gen sequencing analysis, image analysis, and other bioinformatics workflows that use large files.
164 |
165 | When performing your analysis, you may want to take advantage of this space and will want to start by copying over your raw FASTQ files. Rather than using `cp`, the `rsync` command would be benefical since FASTQ files are large in size. As an example we will copy over our FASTQ files to `/n/scratch2`, but first we will need to create a directory to copy them to. You can name this directory with your user login name.
166 |
167 | ```bash
168 | $ mkdir /n/scratch2/$USER
169 | ```
170 |
171 | Now we can copy over the entire directory of FASTQ files:
172 |
173 | ```bash
174 | $ rsync -avr --progress ~/unix_lesson/raw_fastq /n/scratch2/$USER
175 | ```
176 |
177 | Take a look at the directory on scratch and see that the files transferred successfully.
178 |
179 | > **NOTE:** If you are copying files from a remote resource to your local laptop (or vice versa), the syntax will change. You will need to add the host address before specifying the path. Below is an example of command you would **run in a Terminal on your local laptop**:
180 | >
181 |
182 | ```bash
183 | ## DO NOT RUN
184 | $ rysnc -avr --progress rc_training01@transfer.rc.hms.harvard.edu:/home/rc_traning01/unix_lesson/rnaseq/raw_data /path/on/local/machine
185 | ```
186 |
187 |
188 | ## General Bash commands
189 |
190 | > *These materials are adapted from training materials generated by [FAS Research Computing at Harvard University](https://www.rc.fas.harvard.edu/training/training-materials/).*
191 |
192 | * [Setting up](#setup)
193 | * [Regular expressions (regex) in `bash`](#regex)
194 | * [Reintroducing `grep`](#grep)
195 | * [`grep` examples](#example1)
196 | * [Introducing `sed`](#sed)
197 | * [`sed` examples](#example2)
198 | * [Reintroducing `awk`](#awk)
199 | * [`awk` examples](#example3)
200 |
201 | ***
202 |
203 | ## Setting up
204 |
205 | ```bash
206 | $ cd ~/unix_lesson
207 |
208 | $ cp /n/groups/hbctraining/ngs-data-analysis-longcourse/unix_lesson/bicycle.txt .
209 | ```
210 | ***
211 |
212 | ## Regular expressions (regex) in `bash`
213 |
214 | "A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings." -[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)
215 |
216 | "The specific syntax rules vary depending on the specific implementation, programming language, or library in use. Additionally, the functionality of regex implementations can vary between versions of languages." -[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)
217 |
218 | Below is a small subset of characters that can be used for pattern generation in `bash`.
219 |
220 | **Special Characters:**
221 |
222 | * `.` : *match any character (except new line)
223 | * `\` : *make next character literal*
224 | * `^` : *matches at the start of the line*
225 | * `$` : *matches at the end of line*
226 | * `*` : *repeat match
227 | * `?` : *preceding character is optional
228 | * `[ ]` : *sequence of characters*
229 | * `[a-z]` : any one from a through z
230 | * `[aei]` : either a, e, i
231 | * `[0-9]` : any one from 1 through 9
232 |
233 | **Examples:**
234 |
235 | * `.at` == any three-character string ending with "at", including "hat", "cat", and "bat".
236 | * `ab*c` == "ac", "abc", "abbc", "abbbc", and so on*
237 | * `colou?r` == "color" or "colour"*
238 | * `[hc]at` == "hat" and "cat".
239 | * `[^b]at` == all strings matched by .at except "bat".
240 | * `[^hc]at` == all strings matched by .at other than "hat" and "cat".
241 | * `^[hc]at` == "hat" and "cat", but only at the beginning of the string or line.
242 | * `[hc]at$` == "hat" and "cat", but only at the end of the string or line.
243 | * `\[.\]` == any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
244 | * `s.*` == "s" followed by zero or more characters, for example: "s" and "saw" and "seed" and "shawshank".
245 |
246 | > above examples excerpted from [Wikipedia](-[Wikipedia](https://en.wikipedia.org/wiki/Regular_expression))
247 |
248 | **Non printable characters:**
249 |
250 | * `\t` : tab
251 | * `\n` : new line (Unix)
252 | * `\s` : space
253 |
254 | ***
255 |
256 | ## Reintroducing `grep` (GNU regex parser)
257 |
258 | As we saw yesterday, `grep` is a line by line parser that outputs lines matching a pattern of interest. In addition it also allows the use of regular expressions (regex) in the specified pattern, let's use some regular expressions with `grep`.
259 |
260 | **`grep` usage:**
261 |
262 | `cat file | grep pattern`
263 |
264 | OR
265 |
266 | `grep pattern file`
267 |
268 | **`grep` common options:**
269 |
270 | * `c` : count the number of occurrences
271 | * `v` : invert match, print non-matching lines
272 | * `R` : recursively through directories
273 | * `o` : only print matching part of line
274 | * `n` : print the line number
275 |
276 | ***
277 |
278 | ### Examples `grep` usage
279 |
280 | ```bash
281 | $ grep -c bicycle bicycle.txt
282 |
283 | $ grep "bicycle bicycle" bicycle.txt
284 |
285 | $ grep ^bicycle bicycle.txt
286 | $ grep ^Bicycle bicycle.txt
287 |
288 | $ grep yeah$ bicycle.txt
289 |
290 | $ grep [SJ] bicycle.txt
291 |
292 | $ grep ^[SJ] bicycle.txt
293 | ```
294 | ***
295 |
296 | ## Introducing `sed`
297 |
298 | `sed` takes a stream of stdin and pattern matches and returns the replaced text to stdout ("Think amped-up Windows Find & Replace").
299 |
300 | **`sed` usage:**
301 |
302 | `cat file | sed ‘command’`
303 |
304 | OR
305 |
306 | `sed ‘command’ file`
307 |
308 | **`sed` common options:**
309 |
310 | * `4d` : *delete line 4*
311 | * `2,4d` : *delete lines 2-4*
312 | * `/here/d` : *delete line matching here*
313 | * `/here/,/there/d` : *delete lines matching here to there*
314 | * `s/pattern/text/` : *switch text matching pattern*
315 | * `s/pattern/text/g` : *switch text matching pattern globally*
316 | * `/pattern/a\text` : *append line with text after matching pattern*
317 | * `/pattern/c\text` : *change line with text for matching pattern*
318 |
319 | ### Examples `sed` usage
320 |
321 | ```bash
322 | $ sed '1,2d' bicycle.txt
323 |
324 | $ sed 's/Superman/Batman/' bicycle.txt
325 |
326 | $ sed 's/bicycle/car/' bicycle.txt
327 | $ sed 's/bicycle/car/g' bicycle.txt
328 |
329 | $ sed 's/.icycle/car/g' bicycle.txt
330 |
331 | $ sed 's/bi*/car/g' bicycle.txt
332 |
333 | $ sed 's/bicycle/tri*cycle/g' bicycle.txt | sed 's/tri*cycle/tricycle/g' ## does this work?
334 | $ sed 's/bicycle/tri*cycle/g' bicycle.txt | sed 's/tri\*cycle/tricycle/g'
335 |
336 | $ sed 's/\s/\t/g' bicycle.txt
337 | $ sed 's/\s/\\t/g' bicycle.txt
338 |
339 | $ sed 's/\s//g' bicycle.txt
340 | ```
341 | ***
342 |
343 | ## Introducing `awk`
344 |
345 | `awk` is command/script language that turns text into records and fields which can be selected to display as kind of an ad hoc database. With awk you can perform many manipulations to these fields or records before they are displayed.
346 |
347 | **`awk` usage:**
348 |
349 | `cat file | awk ‘command’`
350 |
351 | OR
352 |
353 | `awk ‘command’ file`
354 |
355 | **`awk` concepts:**
356 |
357 | *Fields:*
358 |
359 | Fields are separated by white space, or you can specifying a field separator (FS). The fields are denoted $1, $2, ..., while $0 refers to the entire line. If there is no FS, the input line is split into one field per character.
360 |
361 | The `awk` program has some internal environment variables that are useful (more exist and change upon platform)
362 |
363 | * `NF` – number of fields in the current record
364 | * `NR` – number of the current record (somewhat similar to row number)
365 | * `FS` – regular expression used to separate fields; also settable by option -Ffs (default whitespace)
366 | * `RS` – input record separator (default newline)
367 | * `OFS` – output field separator (default blank)
368 | * `ORS` – output record separator (default newline)
369 |
370 | `awk` also supports more complex statements, some examples are below:
371 |
372 | * if (expression) statement [ else statement ]
373 | * while (expression) statement
374 | * for (expression ; expression ; expression) statement
375 | * for (var in array) statement
376 | * do statement while (expression)
377 |
378 | Please note that awk is a language on it's own, and we will only be looking at some examples os its usage.
379 |
380 | ### Examples `awk` usage
381 |
382 | ```bash
383 | $ awk '{print $3}' reference_data/chr1-hg19_genes.gtf | head
384 |
385 | $ awk '{print $3 | "sort -u"}' reference_data/chr1-hg19_genes.gtf
386 |
387 | $ awk '{OFS = "\t" ; if ($3 == "stop_codon") print $1,$4,$5,$3,$10}' reference_data/chr1-hg19_genes.gtf | head
388 | $ awk '{OFS = "\t" ; if ($3 == "stop_codon") print $1,$4,$5,$3,$10}' reference_data/chr1-hg19_genes.gtf | sed 's/"//g' | sed 's/;//g' | head
389 |
390 | $ awk -F "\t" '{print $10}' reference_data/chr1-hg19_genes.gtf | head
391 | $ awk -F "\t" '{print $9}' reference_data/chr1-hg19_genes.gtf | head
392 |
393 | # head other/bad-reads.count.summary
394 | $ awk -F ":" 'NR > 1 {sum += $2} END {print sum}' other/bad-reads.count.summary
395 |
396 | # head ../rnaseq/results/counts/Mov10_featurecounts.Rmatrix.txt
397 | $ awk 'NR > 1 {sum += $2} END {print sum}' ../rnaseq/results/counts/Mov10_featurecounts.Rmatrix.txt
398 | ```
399 |
400 |
401 | ---
402 |
403 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
404 |
405 | * *The materials used in this lesson were derived from training materials generated by [FAS Reseach Computing at Harvard University](https://www.rc.fas.harvard.edu/training/training-materials/) and [HMS Research Computing](https://rc.hms.harvard.edu/)*
406 |
407 |
408 |
--------------------------------------------------------------------------------
/lessons/experimental_planning_considerations.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Library preparation, sequencing and experimental considerations"
3 | author: "Mary Piper, Meeta Mistry, Radhika Khetani"
4 | date: "Monday, October 29, 2018"
5 | ---
6 |
7 | Approximate time: 90 minutes
8 |
9 | ## Learning Objectives:
10 |
11 | * Describe the process of RNA-seq library preparation
12 | * Describe Illumina sequencing method
13 | * Discuss special considerations for experimental design
14 |
15 | # Experimental steps and considerations
16 |
17 | ## Introduction to RNA-seq
18 |
19 | RNA-seq is an exciting experimental technique that is utilized to explore and/or quantify gene expression within or between conditions.
20 |
21 |
22 | As we know, genes provide instructions to make proteins, which perform some function within the cell. Although all cells contain the same DNA sequence, muscle cells are different from nerve cells and other types of cells because of the different genes that are turned on in these cells and the different RNAs and proteins produced.
23 |
24 |
25 |
26 | Different biological processes, as well as mutations, can affect which genes are turned on and which are turned off, in addition to, *how much* specific genes are turned on/off.
27 |
28 | To make proteins, the DNA is transcribed into messenger RNA, or mRNA, which is translated by the ribosome into protein. However, some genes encode RNA that does not get translated into protein; these RNAs are called non-coding RNAs, or ncRNAs. Often these RNAs have a function in and of themselves and include rRNAs, tRNAs, and siRNAs, among others. All RNAs transcribed from genes are called transcripts.
29 |
30 |
31 |
32 | To be translated into proteins, the RNA must undergo processing to generate the mRNA. In the figure below, the top strand in the image represents a gene in the DNA, comprised of the untranslated regions (UTRs) and the open read frame. Genes are transcribed into pre-mRNA, which still contains the intronic sequences. After post-transciptional processing, the introns are spliced out and a polyA tail and 5' cap are added to yield mature mRNA transcripts, which can be translated into proteins.
33 |
34 |
35 |
36 | **While mRNA transcripts have a polyA tail, many of the non-coding RNA transcripts do not as the post-transcriptional processing is different for these transcripts.**
37 |
38 | RNA-seq data can be used to explore and/or quantify the RNA transcripts, which can be utilized for the following types of experiments:
39 |
40 | - Differential Gene Expression: *quantitative* evaluation and comparison of transcript levels
41 | - Transcriptome assembly: building the profile of transcribed regions of the genome, a *qualitative* evaluation.
42 | - Can be used to help build better gene models, and verify them using the assembly
43 | - Metatranscriptomics or community transcriptome analysis
44 |
45 |
46 | ## Illumina library preparation
47 |
48 | When starting an RNA-seq experiment, for every sample the RNA needs to be isolated and turned into a cDNA library for sequencing. Generally, ribosomal RNA represents the majority of the RNAs present in a cell, while messenger RNAs represent a small percentage of total RNA, ~2% in humans.
49 |
50 |
51 |
52 | Therefore, if we want to study the protein-coding genes, we need to enrich for mRNA or deplete the rRNA. **For differential gene expression analysis, it is best to enrich for Poly(A)+, unless you are aiming to obtain information about long non-coding RNAs, then do a ribosomal RNA depletion.**
53 |
54 | The workflow for library preparation is detailed in the image below:
55 |
56 |
57 |
58 | *Image credit: [Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682](https://www.nature.com/articles/nrg3068)*
59 |
60 | Briefly, the RNA is isolated from the sample and contaminating DNA is removed, followed by either selection of the mRNA or depletion of the rRNA. The resulting RNA is fragmented then reverse transcribed into cDNA. Sequence adapters are added to the ends of the fragments and the fragments are PCR amplified if needed. Finally, the fragments are size selected (usually ~300-500bp) to finish the library.
61 |
62 | The cDNA libraries can be generated in a way to retain information about which strand of DNA the RNA was transcribed from. Libraries that retain this information are called stranded libraries, which are now standard with Illumina’s TruSeq stranded RNA-Seq kits. Stranded libraries should not be any more expensive than unstranded, so there is not really any reason not to acquire this additional information.
63 |
64 | There are 3 types of cDNA libraries available:
65 |
66 | - Forward (secondstrand) – reads resemble the gene sequence or the secondstrand cDNA sequence
67 | - Reverse (firststrand) – reads resemble the complement of the gene sequence or firststrand cDNA sequence (TruSeq)
68 | - Unstranded
69 |
70 | > **NOTE:** This workflow is specific to Illumina sequencing, which is currently the most utilized sequencing method. But there are other long-read methods worth noting, such as:
71 | >
72 | > - Pacific Biosciences: http://www.pacb.com/
73 | > - Oxford Nanopore (MinION): https://nanoporetech.com/
74 | > - 10X Genomics: https://www.10xgenomics.com/
75 | >
76 | > Advantages and disadvantages of these technologies can be explored in the table below:
77 | >
78 | >
79 |
80 | ## Illumina Sequencing
81 |
82 | After preparation of the libraries, sequencing can be performed to generate the nucleotide sequences of the ends of the fragments, which are called **reads**. You will have the choice of sequencing a single end of the cDNA fragments (single-end reads) or both ends of the fragments (paired-end reads).
83 |
84 |
85 |
86 | - SE - Single end dataset => Only Read1
87 | - PE - Paired-end dataset => Read1 + Read2
88 | - can be 2 separate FastQ files or just one with interleaved pairs
89 |
90 | Generally single-end sequencing is sufficient unless it is expected that the reads will match multiple locations on the genome (e.g. organisms with many paralogous genes), assemblies are being performed, or for splice isoform differentiation. Be aware that paired-end reads are generally 2x more expensive.
91 |
92 | There are a variety of Illumina platforms to choose from to sequence the cDNA libraries.
93 |
94 |
95 |
96 | *Image credit: Adapted from [Illumina](www.illumina.com)*
97 |
98 | Differences in platform can alter the length of reads generated as well as the total number of reads sequenced per run and the amount of time required to sequence the libraries. The different platforms each use a different flow cell, which is a glass surface coated with an arrangement of paired oligos that are complementary to the adapters added to your template molecules. The flow cell is where the sequencing reactions take place.
99 |
100 |
101 |
102 | *Image credit: Adapted from [Illumina](www.illumina.com)*
103 |
104 |
105 | Let's explore how Illumina sequencing is performed:
106 |
107 | [
](https://www.dropbox.com/s/f4t94tcw06f9stg/Illumina%20Sequencing%20by%20Synthesis-14840.mp4?dl=0)
108 |
109 | - Number of clusters ~= Number of reads
110 | - Number of sequencing cycles = Length of reads
111 |
112 | The number of cycles (length of the reads) will depend on sequencing platform used as well as your preferences.
113 |
114 | Charges for sequencing are usually per lane of the flow cell, and usually you don’t need one lane per sample. Multiplexing allows you to sequence multiple samples per lane with addition of indices (within the Illumina adapter) or special barcodes (outside the Illumina adapter).
115 |
116 |
117 |
118 | ## Experimental planning considerations
119 |
120 | Understanding the steps in the experimental process of RNA extraction and preparation of RNA-Seq libraries is helpful for designing an RNA-Seq experiment, but there are special considerations that should be highlighted that can greatly affect the quality of a differential expression analysis.
121 |
122 | These important considerations include:
123 |
124 | 1. Number and type of **replicates**
125 | 2. Avoiding **confounding**
126 | 3. Addressing **batch effects**
127 |
128 | We will go over each of these considerations in detail, discussing best practice and optimal design.
129 |
130 | ## Replicates
131 |
132 | Experimental replicates can be performed as **technical replicates** or **biological replicates**.
133 |
134 |
135 |
136 | *Image credit: [Klaus B., EMBO J (2015) **34**: 2727-2730](https://dx.doi.org/10.15252%2Fembj.201592958)*
137 |
138 | - **Technical replicates:** use the same biological sample to repeat the technical or experimental steps in order to accurately measure technical variation and remove it during analysis.
139 |
140 | - **Biological replicates** use different biological samples of the same condition to measure the biological variation between samples.
141 |
142 | For mice or rats, this might be easy to determine what constitutes a different biological sample, but it's a bit more difficult to determine for cell lines. When using cell lines it's best to include as much variation between samples as possible, and [this article](http://paasp.net/accurate-design-of-in-vitro-experiments-why-does-it-matter/) gives some great recommendations for cell line replicates.
143 |
144 | In the days of microarrays, technical replicates were considered a necessity; however, with the current RNA-Seq technologies, technical variation is much lower than biological variation and **technical replicates are unneccessary**.
145 |
146 | In contrast, **biological replicates are absolutely essential**. For differential expression analysis, the more biological replicates, the better the estimates of biological variation and the more precise our estimates of the mean expression levels. This leads to more accurate modeling of our data and identification of more differentially expressed genes.
147 |
148 |
149 |
150 | *Image credit: [Liu, Y., et al., Bioinformatics (2014) **30**(3): 301–304](https://doi.org/10.1093/bioinformatics/btt688)*
151 |
152 | As the figure above illustrates, **biological replicates are of greater importance than sequencing depth**, which is the total number of reads sequenced per sample. The figure shows the relationship between sequencing depth and number of replicates on the number of differentially expressed genes identified [[1](https://academic.oup.com/bioinformatics/article/30/3/301/228651/RNA-seq-differential-expression-studies-more)]. Note that an **increase in the number of replicates tends to return more DE genes than increasing the sequencing depth**. Therefore, generally more replicates are better than higher sequencing depth, with the caveat that higher depth is required for detection of lowly expressed DE genes and for performing isoform-level differential expression.
153 |
154 | Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq. However, **guidelines depend on the experiment performed and the desired analysis**. Below we list some general guidelines for replicates and sequencing depth to help with experimental planning:
155 |
156 |
157 | - **General gene-level differential expression:**
158 |
159 | - ENCODE guidelines suggest 30 million SE reads per sample (stranded).
160 |
161 | - 15 million reads per sample is often sufficient, if there are a good number of replicates (>3).
162 |
163 | - Spend money on more biological replicates, if possible.
164 |
165 | - **Gene-level differential expression with detection of lowly-expressed genes:**
166 |
167 | - Similarly benefits from replicates more than sequencing depth.
168 |
169 | - Sequence deeper with at least 30-60 million reads depending on level of expression (start with 30 million with a good number of replicates).
170 |
171 | - **Isoform-level differential expression:**
172 |
173 | - Of known isoforms, suggested to have a depth of at least 30 million reads per sample and paired-end reads.
174 |
175 | - Of novel isoforms should have more depth (> 60 million reads per sample).
176 |
177 | - Choose biological replicates over paired/deeper sequencing.
178 |
179 | - Perform careful QC of RNA quality. Be careful to use high quality preparation methods and restrict analysis to high quality RIN # samples.
180 |
181 | - **Other types of RNA analyses (intron retention, small RNA-Seq, etc.):**
182 |
183 | - Different recommendations depending on the analysis.
184 |
185 | - Almost always more biological replicates are better!
186 |
187 | > **NOTE:** The factor used to estimate the depth of sequencing for genomes is "coverage" - how many times do the number nucleotides sequenced "cover" the genome. This metric is not exact for genomes, but it works okay. It **does not work for transcriptomes** because expression of the genes depend on the condition being studied.
188 |
189 | ## Confounding
190 |
191 | A confounded RNA-Seq experiment is one where you **cannot distinguish the separate effects of two different sources of variation** in the data.
192 |
193 | For example, we know that sex has large effects on gene expression, and if all of our *control* mice were female and all of the *treatment* mice were male, then our treatment effect would be confounded by sex. **We could not differentiate the effect of treatment from the effect of sex.**
194 |
195 |
196 |
197 | **To AVOID confounding:**
198 |
199 | - Ensure animals in each condition are all the **same sex, age, litter, and batch**, if possible.
200 |
201 | - If not possible, then ensure to split the animals equally between conditions
202 |
203 |
204 |
205 | ## Batch effects
206 |
207 | Batch effects are a significant issue for RNA-Seq analyses, since you can see significant differences in expression due solely to the batch effect.
208 |
209 |
210 |
211 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
212 |
213 | To explore the issues generated by poor batch study design, they are highlighted nicely in [this paper](https://f1000research.com/articles/4-121/v1).
214 |
215 | ### How to know whether you have batches?
216 |
217 | - Were all RNA isolations performed on the same day?
218 |
219 | - Were all library preparations performed on the same day?
220 |
221 | - Did the same person perform the RNA isolation/library preparation for all samples?
222 |
223 | - Did you use the same reagents for all samples?
224 |
225 | - Did you perform the RNA isolation/library preparation in the same location?
226 |
227 | If *any* of the answers is **‘No’**, then you have batches.
228 |
229 | ### Best practices regarding batches:
230 |
231 | - Design the experiment in a way to **avoid batches**, if possible.
232 |
233 | - If unable to avoid batches:
234 |
235 | - **Do NOT confound** your experiment by batch:
236 |
237 |
238 |
239 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
240 |
241 | - **DO** split replicates of the different sample groups across batches. The more replicates the better (definitely more than 2).
242 |
243 |
244 |
245 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
246 |
247 | - **DO** include batch information in your **experimental metadata**. During the analysis, we can regress out the variation due to batch so it doesn’t affect our results if we have that information.
248 |
249 |
250 |
251 | ***
252 | **Exercise**
253 |
254 | Your experiment has three different treatment groups, A, B, and C. Due to the lengthy process of tissue extraction, you can only isolate the RNA from two samples at the same time. You plan to have 4 replicates per group.
255 |
256 | 1. Fill in the `RNA isolation` column of the metadata table. Since we can only prepare 2 samples at a time and we have 12 samples total, you will need to isolate RNA in 6 batches. In the `RNA isolation` column, enter one of the following values for each sample: `group1`, `group2`, `group3`, `group4`, `group5`, `group6`. Make sure to fill in the table so as to avoid confounding by batch of `RNA isolation`.
257 |
258 | 2. **BONUS:** To perform the RNA isolations more quickly, you devote two researchers to perform the RNA isolations. Fill in their initials to the `researcher` column for the samples they will prepare: use initials `AB` or `CD`.
259 |
260 | | sample | treatment | sex | replicate | RNA isolation |
261 | | --- | --- | --- | --- | --- |
262 | | sample1 | A | F | 1 |
263 | | sample2 | A | F | 2 |
264 | | sample3 | A | M | 3 |
265 | | sample4 | A | M | 4 |
266 | | sample5 | B | F | 1 |
267 | | sample6 | B | F | 2 |
268 | | sample7 | B | M | 3 |
269 | | sample8 | B | M | 4 |
270 | | sample9 | C | F | 1 |
271 | | sample10 | C | F | 2 |
272 | | sample11 | C | M | 3 |
273 | | sample12 | C | M | 4 |
274 |
275 | ***
276 |
--------------------------------------------------------------------------------
/sam.md:
--------------------------------------------------------------------------------
1 | ## samtools extras
2 |
3 | To play around with a few `samtools` commands, first change directories into the directory containing all BAM files.
4 |
5 | `$ cd ~/unix_workshop/rnaseq/results/STAR/bams`
6 |
7 | ### Write only mapped reads to file (filter out unmapped reads)
8 |
9 | `$ samtools view -b -h -F 4 Mov10_oe_1_Aligned.sortedByCoord.out.bam > Mov10_oe_1_Aligned.onlyAligned.bam`
10 |
11 | ### Create a FASTQ file containing only mapped reads
12 |
13 | `$ bamtofastq -o Mov10_oe_1_Mapped.fastq --no-unaligned Mov10_oe_1_Aligned.onlyMapped.bam`
14 |
15 | ### Index BAM file
16 |
17 | `$ samtools index Mov10_oe_1_Aligned.sortedByCoord.out.bam`
18 |
19 | ### Extract reads from a specific region of the chromosome
20 |
21 | `$samtools view Mov10_oe_1_Aligned.sortedByCoord.out.bam chr1:200000-500000`
22 |
23 | ### Randomly subsample half of the reads into a new BAM file
24 |
25 | `$ samtools view -s 0.5 -b Mov10_oe_1_Aligned.sortedByCoord.out.bam > Mov10_oe_1_subsample.bam`
26 |
27 | ### Simple stats for alignment file
28 |
29 | `$ samtools flagstat Mov10_oe_1_Aligned.sortedByCoord.out.bam`
30 |
31 | ### Visualizing mismatches
32 |
33 | `$ samtools view -h Mov10_oe_1_Aligned.sortedByCoord.out.bam | head -n 5 | samtools fillmd -e - ~/unix_workshop/rnaseq/reference_data/chr1.fa`
34 |
35 |
--------------------------------------------------------------------------------
/schedule/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-rnaseq-hpc-O2/0feca5559bcbde27cbc7634085f07b3624a36f2f/schedule/.DS_Store
--------------------------------------------------------------------------------
/schedule/2-day/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-rnaseq-hpc-O2/0feca5559bcbde27cbc7634085f07b3624a36f2f/schedule/2-day/.DS_Store
--------------------------------------------------------------------------------
/schedule/2-day/README.md:
--------------------------------------------------------------------------------
1 | # Workshop Schedule (2-day)
2 |
3 | ## Day 1
4 |
5 | | Time | Topic | Instructor |
6 | |:------------------------:|:------------------------------------------------:|:--------:|
7 | |9:00 - 9:40 | [Workshop Introduction] | Radhika |
8 | |9:40 - 10:40 | [Introduction to the Shell] | Radhika |
9 | |10:40 - 10:50 | Break | |
10 | |10:50 - 11:35 | [Introduction to the Shell (cont.)] | Meeta |
11 | |11:35 - 12:15 | [Searching and Redirection] | Mary |
12 | |12:15 - 13:15 | Lunch | |
13 | |13:15 - 13:35 | [Introduction to the Vim Text Editor] | Mary |
14 | |13:35 - 14:50 | [Loops and Shell Scripts] | Meeta |
15 | |14:50 - 15:00 | Break | |
16 | |15:00 - 15:30 | [Permissions and Environment Variables] | Radhika |
17 | |15:30 - 16:00 | [Project Organization and Best Practices in Data Management] | Meeta |
18 | |16:00 - 17:00 | [Introduction to RNA-seq and Library Prep] | Radhika |
19 |
20 | ## Day 2
21 |
22 | | Time | Topic | Instructor |
23 | |:------------------------:|:----------:|:--------:|
24 | |9:00 - 9:50 | [Introduction to High-Performance Computing] | Radhika |
25 | |9:50 - 10:30 | [RNA-seq analysis workshop - Quality Assessment] | Mary |
26 | |10:30 - 10:40 | Break | |
27 | |10:40 - 11:15 | [RNA-seq analysis workshop - Quality Assessment] | Mary |
28 | |11:15 - 12:00 | [RNA-seq analysis workshop - Alignment and Counting] | Meeta |
29 | |12:00 - 13:00 | Break | |
30 | |13:00 - 13:45 | [RNA-seq analysis workshop - Alignment and Counting] | Meeta |
31 | |13:45 - 14:45 | [Automating the RNA-seq workflow] | Radhika |
32 | |14:45 - 15:00 | Break | |
33 | |15:00 - 16:30 | [Advanced concepts in bash] | Meeta/Radhika |
34 | |16:30 - 17:00 | [Wrap up + Q & A] | Radhika |
35 |
--------------------------------------------------------------------------------
/schedule/README.md:
--------------------------------------------------------------------------------
1 | # Workshop Schedule
2 |
3 | ## Day 1
4 |
5 | | Time | Topic | Instructor |
6 | |:------------------------:|:------------------------------------------------:|:--------:|
7 | |9:00 - 9:40 | [Workshop Introduction](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/raw/master/lectures/Intro_to_workshop.pdf) | Meeta |
8 | |9:40 - 10:30 | [Introduction to the Shell](https://hbctraining.github.io/Intro-to-Shell/lessons/01_the_filesystem.html) | Mary |
9 | |10:30 - 10:45 | Break | |
10 | |10:45 - 11:35 | [Introduction to the Shell (cont.)](https://hbctraining.github.io/Intro-to-Shell/lessons/01_the_filesystem.html) | Meeta |
11 | |11:35 - 12:15 | [Searching and Redirection](https://hbctraining.github.io/Intro-to-Shell/lessons/02_searching_files.html) | Mary |
12 | |12:15 - 13:15 | Lunch | |
13 | |13:15 - 13:45 | [Introduction to the Vim Text Editor](https://hbctraining.github.io/Intro-to-Shell/lessons/03_vim.html) | Mary |
14 | |13:45 - 15:00 | [Loops and Shell Scripts](https://hbctraining.github.io/Intro-to-Shell/lessons/04_loops_and_scripts.html) | Meeta |
15 | |15:00 - 15:15 | Break | |
16 | |15:15 - 15:45 | [Permissions and Environment Variables](https://hbctraining.github.io/Intro-to-Shell/lessons/05_permissions_and_environment_variables.html) | Mary |
17 | |15:45 - 17:00 | [Project Organization and Best Practices in Data Management](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/01_data_organization.html) | Meeta |
18 |
19 | ## Day 2
20 |
21 | | Time | Topic | Instructor |
22 | |:------------------------:|:----------:|:--------:|
23 | |9:00 - 9:45 | [Introduction to High-Performance Computing for HMS-RC's O2](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/raw/master/lectures/HPC_intro_O2.pdf) | Meeta |
24 | |9:45 - 10:00 | Break | |
25 | |10:00 - 11:15 | [Introduction to RNA-seq and Library Prep](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/rna-seq_design.pdf) | Mary |
26 | |11:15 - 11:55 | [NGS workflows and data standards](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/NGS_workflows.pdf) | Meeta |
27 | |11:55 - 12:55 | Lunch | |
28 | |12:55 - 13:50 | [Quality Assessment of Sequence Data](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/02_assessing_quality.html) | Mary |
29 | |13:50 - 14:30 | [Sequence Alignment Theory](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/Sequence_alignment.pdf) | Meeta |
30 | |14:30 - 14:45 | Break | |
31 | |14:45 - 16:00 | [RNA-seq Alignment with STAR](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html) | Mary |
32 | |16:00 - 17:00 | [Assessing Alignment Quality](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/04_alignment_quality.html) | Meeta |
33 |
34 | ## Day 3
35 |
36 | | Time | Topic | Instructor |
37 | |:------------------------:|:----------:|:--------:|
38 | |9:00 - 10:15 | [Generating a Count Matrix](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/05_counting_reads.html) | Meeta |
39 | |10:15 - 10:45 | [Documenting Steps in the Workflow with MultiQC](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/06_multiQC.html) | Meeta |
40 | |10:45 - 11:00 | Break | |
41 | |11:00 - 12:35 | [Automating the RNA-seq workflow](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html) | Mary |
42 | |12:35 - 13:35 | Lunch | |
43 | |13:35 - 13:45 | [Alternative workflows for analyzing RNA-seq data](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/RNAseq-analysis-methods.pdf) | Mary |
44 | |13:45 - 15:20 | [Quantifying expression using alignment-free methods (Salmon)](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/08_salmon.html) | Meeta |
45 | |15:20 - 15:35 | Break | |
46 | |15:35 - 15:45 | [Other Applications of RNA-seq](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/other%20rnaseq%20applications.pdf) | Mary |
47 | |15:45 - 16:25 | [Troubleshooting RNA-seq Data Analysis](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/RNA-seq_troubleshooting.pdf) | Mary |
48 | |16:25 - 17:00 | [Genome Builds and Accessing Data on GEO/SRA](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2/blob/master/lectures/Accessing_genomics_dataonline.pdf) | Meeta |
49 | | | [Wrap-up](https://www.dropbox.com/s/6diqq661xn3wgko/Wrap-up.pdf?dl=0) | Mary |
50 |
51 |
--------------------------------------------------------------------------------
/scripts/mov10_fastqc.run:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | #SBATCH -p short # partition name
4 | #SBATCH -t 0-2:00 # hours:minutes runlimit after which job will be killed
5 | #SBATCH -n 6 # number of cores requested -- this needs to be greater than or equal to the number of cores you plan to use to run your job
6 | #SBATCH --job-name rnaseq_mov10_fastqc # Job name
7 | #SBATCH -o %j.out # File to which standard out will be written
8 | #SBATCH -e %j.err # File to which standard err will be written
9 |
10 | ## Changing directories to where the fastq files are located
11 | cd ~/unix_workshop/rnaseq/raw_data
12 |
13 | ## Loading modules required for script commands
14 | module load seq/fastqc/0.11.3
15 |
16 | ## Running FASTQC
17 | fastqc -t 6 *.fq
18 |
19 | ## Moving files to our results directory
20 | mv *fastqc* ../results/fastqc/
21 |
--------------------------------------------------------------------------------
/scripts/rnaseq_analysis_on_allfiles_for-slurm.sh:
--------------------------------------------------------------------------------
1 | #! /bin/bash
2 |
3 | for fq in ~/unix_lesson/rnaseq/raw_data/*.fq
4 | do
5 |
6 | sbatch -p short -t 0-2:00 -n 6 --job-name rnaseq-workflow --wrap="sh ~/unix_lesson/rnaseq/scripts/rnaseq_analysis_on_input_file.sh $fq"
7 | sleep 1 # wait 1 second between each job submission
8 |
9 | done
10 |
--------------------------------------------------------------------------------
/scripts/rnaseq_analysis_on_input_file.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash/
2 |
3 | # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
4 | # USAGE: sh rnaseq_analysis_on_allfiles.sh
5 |
6 | # initialize a variable with an intuitive name to store the name of the input fastq file
7 |
8 | fq=$1
9 |
10 | # grab base of filename for naming outputs
11 |
12 | base=`basename $fq .subset.fq`
13 | echo "Sample name is $base"
14 |
15 | # specify the number of cores to use
16 |
17 | cores=2
18 |
19 | # directory with genome reference FASTA and index files + name of the gene annotation file
20 |
21 | genome=/groups/hbctraining/unix_workshop_other/reference_STAR/
22 | gtf=~/unix_workshop/rnaseq/reference_data/chr1-hg19_genes.gtf
23 |
24 | # make all of the output directories
25 | # The -p option means mkdir will create the whole path if it
26 | # does not exist and refrain from complaining if it does exist
27 |
28 | mkdir -p ~/unix_workshop/rnaseq/results/fastqc/
29 | mkdir -p ~/unix_workshop/rnaseq/results/STAR
30 | mkdir -p ~/unix_workshop/rnaseq/results/counts
31 |
32 | # set up output filenames and locations
33 |
34 | fastqc_out=~/unix_workshop/rnaseq/results/fastqc/
35 | align_out=~/unix_workshop/rnaseq/results/STAR/${base}_
36 | counts_input_bam=~/unix_workshop/rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
37 | counts=~/unix_workshop/rnaseq/results/counts/${base}_featurecounts.txt
38 |
39 | # set up the software environment
40 |
41 | module load seq/fastqc/0.11.3
42 | module load seq/STAR/2.5.3a
43 | module load seq/samtools/1.3
44 | PATH=/opt/bcbio/centos/bin:$PATH # for using featureCounts if not already in $PATH
45 |
46 | echo "Processing file $fq"
47 |
48 | # Run FastQC and move output to the appropriate folder
49 | fastqc $fq
50 |
51 | # Run STAR
52 | STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outFilterMultimapNmax 10 --outSAMstrandField intronMotif --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes NH HI NM MD AS
53 |
54 | # Create BAM index
55 | samtools index $counts_input_bam
56 |
57 | # Count mapped reads
58 | featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
59 |
--------------------------------------------------------------------------------
/scripts/salmon_all_samples.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash/
2 |
3 | for fq in /groups/hbctraining/unix_workshop_other/full_dataset/*.fastq
4 | do
5 | base=`basename $fq .fastq`
6 | bsub -q mcore -n 6 -W 1:30 -R "rusage[mem=4000]" -J $base.mov10_salmon -o %J.$base.out -e %J.$base.err \
7 | salmon quant -i /groups/hbctraining/unix_workshop_other/salmon.ensembl37.idx/ \
8 | -p 6 -l SR -r $fq --useVBOpt --numBootstraps 30 -o $base.salmon
9 | done
10 |
--------------------------------------------------------------------------------