├── Figures
    ├── LFC.png
    ├── PCA.png
    ├── GO-UP.png
    ├── Rplot.png
    ├── FC-CPM.pdf
    ├── FC-CPM.png
    ├── FDR-hist.png
    ├── MAplot.png
    ├── REVIGO1.png
    ├── REVIGO2.png
    ├── GCbiasPlot.png
    ├── Go-output.png
    ├── REVIGO-UP.png
    ├── P-values-hist.png
    ├── REVIGO-input.png
    ├── fastp-summary.png
    ├── GSA-classification.png
    ├── multi-mapped-reads.png
    ├── multiqc-alignment.png
    ├── volcano-plot-QL-0.05.png
    ├── biological-rep-correlation.png
    ├── biological-replicates-exp-logcqn.png
    ├── PCA-vsd-no-correction-no-label(small).png
    └── PCA-vsd-remove-DONOR-no-label(small).png
├── QC-RNA-seq-LSECs.xlsx
├── star-indexes.sh
└── README.md


/Figures/LFC.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/LFC.png


--------------------------------------------------------------------------------
/Figures/PCA.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA.png


--------------------------------------------------------------------------------
/Figures/GO-UP.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GO-UP.png


--------------------------------------------------------------------------------
/Figures/Rplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/Rplot.png


--------------------------------------------------------------------------------
/Figures/FC-CPM.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FC-CPM.pdf


--------------------------------------------------------------------------------
/Figures/FC-CPM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FC-CPM.png


--------------------------------------------------------------------------------
/Figures/FDR-hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FDR-hist.png


--------------------------------------------------------------------------------
/Figures/MAplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/MAplot.png


--------------------------------------------------------------------------------
/Figures/REVIGO1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO1.png


--------------------------------------------------------------------------------
/Figures/REVIGO2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO2.png


--------------------------------------------------------------------------------
/Figures/GCbiasPlot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GCbiasPlot.png


--------------------------------------------------------------------------------
/Figures/Go-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/Go-output.png


--------------------------------------------------------------------------------
/Figures/REVIGO-UP.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO-UP.png


--------------------------------------------------------------------------------
/QC-RNA-seq-LSECs.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/QC-RNA-seq-LSECs.xlsx


--------------------------------------------------------------------------------
/Figures/P-values-hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/P-values-hist.png


--------------------------------------------------------------------------------
/Figures/REVIGO-input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO-input.png


--------------------------------------------------------------------------------
/Figures/fastp-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/fastp-summary.png


--------------------------------------------------------------------------------
/Figures/GSA-classification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GSA-classification.png


--------------------------------------------------------------------------------
/Figures/multi-mapped-reads.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/multi-mapped-reads.png


--------------------------------------------------------------------------------
/Figures/multiqc-alignment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/multiqc-alignment.png


--------------------------------------------------------------------------------
/Figures/volcano-plot-QL-0.05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/volcano-plot-QL-0.05.png


--------------------------------------------------------------------------------
/Figures/biological-rep-correlation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/biological-rep-correlation.png


--------------------------------------------------------------------------------
/Figures/biological-replicates-exp-logcqn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/biological-replicates-exp-logcqn.png


--------------------------------------------------------------------------------
/Figures/PCA-vsd-no-correction-no-label(small).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA-vsd-no-correction-no-label(small).png


--------------------------------------------------------------------------------
/Figures/PCA-vsd-remove-DONOR-no-label(small).png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA-vsd-remove-DONOR-no-label(small).png


--------------------------------------------------------------------------------
/star-indexes.sh:
--------------------------------------------------------------------------------
 1 | #PBS -l select=1:mem=60gb:ncpus=6
 2 | #PBS -l walltime=05:00:00
 3 | #PBS -N starIndex
 4 | 
 5 | module load star
 6 | DIR=/rds/general/user/hm1412/projects/cebolalab_liver_regulomes/ephemeral/reference-genomes/GRCh38
 7 | 
 8 | STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $DIR --genomeFastaFiles $DIR/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
 9 | 
10 | mv * $DIR
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # RNA-seq: a step-by-step analysis pipeline.
  2 | 
  3 | A step-by-step analysis pipeline for RNA-seq data from the [Cebola Lab](https://www.imperial.ac.uk/metabolism-digestion-reproduction/research/systems-medicine/genetics--genomics/regulatory-genomics-and-metabolic-disease/).
  4 | 
  5 | Correspondence: hannah.maude12@imperial.ac.uk
  6 | 
  7 | The resources and references used to build this tutorial are found at the bottom, in the [resources](#resources) section.
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | *Run using command line tools (`bash`)*:
 12 | - [Pre-alignment quality control (QC)](#pre-alignment-qc)
 13 | - [Align to the reference human genome](#align-to-the-reference-genome)
 14 | - [Post-alignment QC](#post-alignment-qc)
 15 | - [Visualisation](#visualisation)
 16 | - [Quantify transcripts](#quantification)
 17 | - [Visualise tracks against the reference genome](#visualisation)
 18 | 
 19 | *Run in `R`*:
 20 | - [Differential gene expression (DGE) analysis](#differential-expression)
 21 | 
 22 | **Quality metrics**: throughout this Github, ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) icons will show where QC measures are obtained. An excel spreadsheet can be downloaded from this Github which these metrics can be input into and saved. 
 23 | 
 24 | **Programs required:** it is recommended that the user has anaconda installed, through which all required programs can be installed. Assuming that anaconda is available, all the required programs can be installed using the following:
 25 | 
 26 | ```bash
 27 | #Install the required programs using anaconda
 28 | conda create -N RNA-seq
 29 | 
 30 | conda install -n RNA-seq -c bioconda fastqc
 31 | conda install -n RNA-seq -c bioconda fastp
 32 | conda install -n RNA-seq -c bioconda multiqc
 33 | conda install -n RNA-seq -c bioconda star
 34 | conda install -n RNA-seq -c bioconda samtools
 35 | conda install -n RNA-seq -c bioconda deeptools
 36 | conda install -n RNA-seq -c bioconda salmon
 37 | 
 38 | #For differential expression using DESeq2
 39 | conda create -N DEseq2 r-essentials r-base
 40 | 
 41 | conda install -N DEseq2 -c bioconda bioconductor-deseq2
 42 | conda install -N DEseq2 -c bioconda bioconductor-tximport 
 43 | conda install -N DEseq2 -c r r-ggplot2 
 44 | ```
 45 | 
 46 | ## Introduction
 47 | 
 48 | This pipeline is compatabile with RNA-seq reads generated by Illumina.
 49 | 
 50 | ## Pre-alignment QC
 51 | 
 52 | > Generate QC report
 53 | 
 54 | The raw sequence data should first be assessed for quality. FastQC reports can be generated for all samples to assess sequence quality, GC content, duplication rates, length distribution, K-mer content and adapter contamination. For paired-end reads, run fastqc on both files, with the results output to the current directory:
 55 | 
 56 | ```bash
 57 | fastqc <sample>_1.fastq.gz -d . -o .
 58 | 
 59 | fastqc <sample>_2.fastq.gz -d . -o .
 60 | ```
 61 | 
 62 | These fastQC reports can be combined into one summary report using [multiQC](https://multiqc.info/). 
 63 | 
 64 | To extract the total number of reads from the fastQC report, run the following code (replacing <sample> with your file name).
 65 | 
 66 | ```bash
 67 | totalreads=$(unzip -c <sample>_fastqc.zip <sample>_fastqc/fastqc_data.txt | grep 'Total Sequences' | cut -f 2)
 68 | 
 69 | echo $totalreads
 70 | #This number will be used again later so is saved as a variable 'totalreads'
 71 | ```
 72 | 
 73 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** input the total number of reads into the QC spreadsheet. 
 74 | 
 75 | > Trimming 
 76 | 
 77 | Trimming is a useful step of pre-alignment QC, which removes low quality reads and contaminating adapter sequences (which occur when the length of DNA sequences is longer than the DNA insert).
 78 | 
 79 | If there is evidence of adapter contamination shown in the fastQC report (see below), specific adapter sequences can be trimmed. Here, the program fastp is used to trim the data. For **paired-end** data:
 80 | 
 81 | ```bash
 82 | #Change the -l argument to change the minimum read length allowed.
 83 | fastp -i <sample>_R1.fastq.gz -I <sample>_R2.fastq.gz -o <sample>_R1.trimmed.fastq.gz -O <sample>_R2.trimmed.fastq.gz --detect_adapter_for_pe -l 25 -j <sample>.fastp.json -h <sample>.fastp.html
 84 | ```
 85 | 
 86 | For **single-end** reads: (note the adapter detection is not always as effective for single-end reads, so it is advisable to provide the adapter sequence, here the 'Illumina TruSeq Adapter Read 1'):
 87 | 
 88 | ```bash
 89 | fastp -i <sample>.fastq.gz -o <sample>-trimmed.fastq.gz -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -l 25 -j <sample>.fastp.json -h <sample>.fastp.html 
 90 | ```
 91 | 
 92 | A html report is generated, including the following information:
 93 | 
 94 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/fastp-summary.png" width="700">
 95 | 
 96 | Here, fastQC should be repeated to generated reports for the trimmed data and a second multiqc report generated:
 97 | 
 98 | ```bash
 99 | fastqc <sample>_R1.trimmed.fastq.gz -d . -o .
100 | 
101 | fastqc <sample>_R2.trimmed.fastq.gz -d . -o .
102 | ```
103 | 
104 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the number of trimmed reads can be filled in using the fastp report, or by extracting the number of reads from the trimmed fastQC files, as above, and used to fill in the QC spreadsheet.
105 | 
106 | ## Align to the reference genome
107 | 
108 | The raw RNA-seq data in `fastq` format will be aligned to the reference genome, along with a reference transcriptome, to output two alignment files: the genome alignment and the transcriptome alignemnt. 
109 | 
110 | The DNA reads are aligned using the splice-aware aligner, STAR. Here, [STAR](https://github.com/alexdobin/STAR) is used. The manual is available [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). The reference genome used is the GRCh38 'no-alt' assembly from ncbi, recommended by [Heng Li](http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). The genome can be downloaded using `wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz`.  This version of the recent GRCh38 reference genome excludes alternative contigs which may cause fragments to map in multiple locations. The downloaded genome should be indexed with STAR. 
111 | 
112 | > Index the reference genome
113 | 
114 | Set --sjdbOverhang to your maximum read length -1. The indexing also requires a file containing gene annotation, which comes in a `gtf` format. For example, ENCODE provides a gtf file with GRCh38 annotations, containing gencode gene coordinates, along with UCSC tRNAs and a PhiX spike-in. Here, we use `gencode.v36.annotation.gtf` as the most recent gene annotation file. The user should aim to use the most up-to-date reference files, while ensuring that the format is the same as the reference genome. For example, UCSC uses the 'chr1, chr2, chr3' naming convention, while ENSEMBL uses '1, 2, 3' etc. The files suggested here are compatible. 
115 | 
116 | ```bash
117 | GENOMEDIR=/path/to/indexed/genome
118 | 
119 | STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $GENOMEDIR --genomeFastaFiles $GENOMEDIR/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --sjdbGTFfile gencode.v36.annotation.gtf --sjdbOverhang readlength -1
120 | ```
121 | 
122 | > Carry out the alignment
123 | 
124 | STAR can then be run to align the `fastq` raw data to the genome. If the fastq files are in the compressed `.gz` format, the `--readFilesCommand zcat` argument is added. The output file should be unsorted, as required for the downstream quantification step using Salmon. The following options are shown according to the ENCODE recommendations. 
125 | 
126 | For **paired-end** data:
127 | 
128 | ```bash
129 | STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn <sample>_R1.trimmed.fastq.gz <sample>_R2.trimmed.fastq.gz
130 | --outFileNamePrefix <sample> --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 
131 | --alignSJoverhangMin 8 --outFilterMultimapNmax 20
132 | --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999
133 | --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 
134 | --alignIntronMax 1000000 --alignMatesGapMax 1000000 
135 | --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD
136 | ```
137 | 
138 | For **single-end** data:
139 | 
140 | ```bash
141 | STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn <sample>-trimmed.fastq.gz 
142 | --outFileNamePrefix <sample> --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 
143 | --alignSJoverhangMin 8 --outFilterMultimapNmax 20
144 | --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999
145 | --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 
146 | --alignIntronMax 1000000 --alignMatesGapMax 1000000 
147 | --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD
148 | ```
149 | 
150 | *Hint: all the above code should be on one line!*
151 | 
152 | For compatibility with the STAR quantification, the `--quantMode TranscriptomeSAM` option will result in the output of two alignment files, one to the reference genome (`Aligned.*.sam/bam`) and one to the transcriptome (`Aligned.toTranscriptome.out.bam`).
153 | 
154 | > Merge files [optional]
155 | 
156 | At this stage, if samples have been sequenced across multiple lanes, the sample files can be combined using `samtools merge`. Various QC tools can be used to assess reproducibility and assess lane effects, such as `deeptools plotCorrelation`. The `salmon` quantification does not require files to be merged, since multiple `bam` files can be listed in the command. However, to visualise the RNA-seq data from the combined technical replicates, `bam` files can be merged at this stage. For example, if your sample was split across lanes 1, 2 and 3 (`L001`, `L002`, `L003`):
157 | 
158 | ```bash
159 | samtools merge <sample>-merged.bam <sample>_L001.bam <sample>_L002.bam <sample>_L003.bam
160 | ```
161 | 
162 | ## Post-alignment QC
163 | 
164 | The STAR alignment will have output several files with the following file names:
165 | 
166 | - Aligned.out.bam
167 | - Aligned.toTranscriptome.out.bam
168 | - Log.final.out
169 | - Log.out  
170 | - Log.progress.out
171 | - SJ.out.tab
172 | 
173 | Two files will be used in downstream analysis, the `Aligned.out.bam` for generating genome browser bigWig tracks and the `Aligned.toTranscriptome.out.bam` for quantification and differential gene expression analysis. First, the `Aligned.out.bam` will be assessed for quality and processed to generate bigWig tracks. 
174 | 
175 | > Generate QC reports using [qualimap](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html) and samtools
176 | 
177 | Qualimap will be run on the `Aligned.out.bam` file (or `<sample>.merged.bam` if you have merged data).
178 | 
179 | Qualimap will provide several measures of quality, including how many reads have aligned to exons vs non-coding intergenic regions. To do this, qualimap requires a transcript file which contains the information containing the locations of coding regions. The transcript annotation file, `gencode.v36.annotation.gtf` can be downloaded from [gencode](https://www.gencodegenes.org/human/). (*Note: the most recent annotation file should be used.*)
180 | 
181 | ```bash
182 | #Sort the output bam file. The suffix of the .bam input file may be .gzAligned.out.bam, or -merged.bam. Edit this code to include the appropriate file name.
183 | samtools sort <sample>.bam > <sample>-sorted.bam
184 | samtools index <sample>-sorted.bam
185 | 
186 | samtools flagstat <sample>-sorted.bam > <sample>-sorted.flagstat
187 | 
188 | #Run qualimap to generate QC reports
189 | qualimap bamqc -bam <sample>-sorted.bam -gff gencode.v36.annotation.gtf -outdir <sample>-bamqc-qualimap-report --java-mem-size=16G
190 | 
191 | qualimap rnaseq -bam <sample>-sorted.bam -gtf gencode.v36.annotation.gtf -outdir <sample>-rnaseq-qualimap-report --java-mem-size=16G
192 | ```
193 | 
194 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the percentage of reads aligned to exons can be extracted as follows: 
195 | 
196 | ```bash
197 | cat <sample>-qualimap-rnaseq/rnaseq_qc_results.txt | grep exonic | cut -d '(' -f 2 | cut -d ')' -f1
198 | ```
199 | 
200 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the number of aligned reads and aligned reads which were properly paired can be extracted as follows:
201 | 
202 | ```bash
203 | #The total number of reads mapped
204 | cat <sample>.flagstat | grep mapped | head -n1 | cut -d ' ' -f1
205 | 
206 | #The total number of properly paired reads
207 | cat ../bam_files/061818_con.flagstat | grep 'properly paired' | head -n1 | cut -d ' ' -f1
208 | ```
209 | 
210 | > A combined qualimap report
211 | 
212 | Qualimap `multi-bamqc` can then run QC on combined samples and replicates. This includes principal component analysis (PCA) to confirm whether technical and/or biological replicates cluster together. A text file (`samples.txt`) should be created with three columns, the first with the sample ID, the second with the full path to the bamqc results and the third with the group names. 
213 | 
214 | *Note, some versions of qualimap require the raw_data_qualimapReport directory to be renamed to raw_data.*
215 | 
216 | ```bash
217 | qualimap multi-bamqc sample.txt
218 | ```
219 | 
220 | The QC reports can be combined using [multiqc](https://multiqc.info/); an excellent tool for combining QC reports of multiple samples into one. Example outputs of qualimap/multiqc include the alignment positions 
221 | 
222 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/multiqc-alignment.png" width="800">
223 | 
224 | #### Remove duplicates?
225 | 
226 | It is generally recommended to *not* remove duplicates when working with RNA-seq data, unless using UMIs (unique molecular identifiers) [(Klepikova et al. 2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/). This is because there are likely to be DNA molecules which are natural duplicates of each other, for example originating from genes with a shared sequence in a common domain. Typically, removing duplicates does more harm than good. It is more or less impossible to remove duplicates from single-end data and research has also suggested it may cause false negatives when applied to paired end data. See more in [this useful blog post](https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/). Generally, duplicates are not a problem so long as the *library complexity is high*. 
227 | 
228 | ## Visualisation 
229 | 
230 | > Compute GC bias
231 | 
232 | GC-bias describes the bias in sequencing depth depending on the GC-content of the DNA sequence. Bias in DNA fragments, due to the GC-content and start-and-end sequences, may be increased due to preferential PCR amplification [(Benjamini and Speed, 2012)](https://academic.oup.com/nar/article/40/10/e72/2411059). A high rate of PCR duplications, for example when library complexity is low, may cause a significant GC-bias due to the preferential amplification of specific DNA fragments. This can significantly impact transcript abundance estimates. Bias in RNA-seq is explained in a handy [blog](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) and [video](https://youtu.be/9xskajkNJwg) by Mike Love.
233 | 
234 | It is **crucial** to correct GC-bias when comparing groups of samples which may have variable GC content dependence, for example when samples were processed in different libraries. `Salmon`, used later to generate read counts for quantification, has its own in-built method to correct for GC-bias. 
235 | 
236 | When generating `bedGraph` or `BigWig` files for visualisation, the user may opt to correct GC-bias so that coverage is corrected and appears more uniform. The `deeptools` suite includes tools to calculate GC bias and correct for it.
237 | 
238 | The reference genome file should be converted to `.2bit` format using [`faToTwoBit`](http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit).
239 | The effective genome size can be calculated using `faCount` available [here](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/).
240 | Set the `-l` argument to your fragment length.
241 | 
242 | The input `bam` file requires an index, which can be generated using `samtools index`.
243 | 
244 | ```bash
245 | deeptools computeGCBias -b <sample>-sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit -l 100 --GCbiasFrequenciesFile <sample>.freq.txt  --biasPlot <sample>.biasPlot.pdf
246 | ```
247 | 
248 | The bias plot format can be changed to png, eps, plotly or svg. If there is significant evidence of a GC bias, this can be corrected using `correctGCbias`. An example of GC bias can be seen in the plot outout from `computeGCBias` below:
249 | 
250 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/GCbiasPlot.png" width="500">
251 | 
252 | Correct the GC-bias using `correctGCBias`. This tool effectively removes reads from regions with greater-than-expected coverage (GC-rich regions) and adds reads from regions with less-than-expected coverage (AT-rich regions). The methods are described by [Benjamini and Speed [2012]](https://academic.oup.com/nar/article/40/10/e72/2411059). The following code can be used:
253 | 
254 | ```bash
255 |  correctGCBias -b <sample>-sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit --GCbiasFrequenciesFile <sample>.freq.txt -o <sample>.gc_corrected.bam [options]
256 | ```
257 | 
258 | **NOTE:** When calculating the GC-bias for ChIP-seq, ATAC-seq, DNase-seq (and CUT&Tag/CUT&Run) it is recommended to filter out problematic regions. These include those with low mappability and high numbers of repeats. The compiled list of [ENCODE blacklist regions](https://www.nature.com/articles/s41598-019-45839-z) should be excluded. However, the ENCODE blacklist regions have little overlap with coding regions and this step is not necessary for RNA-seq data [(Amemiya et al, 2019)](https://www.nature.com/articles/s41598-019-45839-z).
259 | 
260 | > Generate bigwig files
261 | 
262 | The `bam` file aligned to the *genome* should be converted to a `bigWig` format, which can be uploaded to genome browsers and viewed as a track. First, the bam file aligned to the reference genome may be assessed and corrected for GC bias, to acheive a more even coverage. 
263 | 
264 | The gene counts are here normalised to TPM values during conversion. 
265 | 
266 | ```bash
267 | bamCoverage -b <sample>.gc_corrected.bam -o <sample>.bw --normalizeUsing BPM --samFlagExclude 512
268 | ```
269 | 
270 | There are multiple methods available for normalisation. Recent analysis by [Abrams et al. (2019)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3247-x#Sec2) advocated TPM as the most effective method. 
271 | 
272 | > Check correlation of technical and biological replicates
273 | 
274 | The correlation between `bam` files of biological and/or technical replicates can be calculated as a QC step to ensure that the expected replicates positively correlate. Deeptools [multiBamSummary](https://deeptools.readthedocs.io/en/develop/content/tools/multiBamSummary.html) and [plotCorrelation](https://deeptools.readthedocs.io/en/develop/content/tools/plotCorrelation.html) are useful tools for further investigation.
275 | 
276 | ## Quantification
277 | 
278 | The `bam` file previously aligned to the *transcriptome* by STAR will next be input into [Salmon](https://combine-lab.github.io/salmon/) in alignment-mode, in order to generate a matrix of gene counts. The Salmon documentation is available [here](https://salmon.readthedocs.io/en/latest/).
279 | 
280 | > Generate transcriptome
281 | 
282 | Salmon requires a transcriptome to be generated from the genome `fasta` and annotation `gtf` files used earlier with STAR. This can be generated using `gffread` (source package avaiable for download [here](http://ccb.jhu.edu/software/stringtie/gff.shtml)).
283 | 
284 | ```bash 
285 | gffread -w GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa -g GCA_000001405.15_GRCh38_no_alt_analysis_set.fna gencode.v36.annotation.gtf
286 | ```
287 | 
288 | > Run Salmon
289 | 
290 | Salmon is here used with the variational Bayesian expectation minimisation (VSEM) algorithm for quantification. Quanitifcation is described in the 2020 paper by [Deschamps-Francoeur et al.](https://www.sciencedirect.com/science/article/pii/S2001037020303032), which describes the handling of multi-mapped reads in RNA-seq data. Duplicated sequences such as pseudogenes can cause reads to align to multiple positions in the genome. Where transcripts have exons which are similar to other genomic sequences, the VSEM approach attributes reads to the most likely transcript. Technical replicates can also be combined by providing the Salmon `-a` argument with a list of bam files, with the file names separated by a space (this may not work on all queue systems. A common error is `segmentation fault (core dump)`). Here, Salmon is run ***without*** any normalisation, on each technical replicate; samples are combined and normalised in the next steps.
291 | 
292 | For **paired-end** data:
293 | 
294 | ```bash
295 | salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType A -a <sample>.Aligned.toTranscriptome.out.bam -o <sample>.salmon_quant --gcBias --seqBias
296 | ```
297 | 
298 | For **single-end** data:
299 | 
300 | If using single end data, add the `--fldMean` and `--fldSD` parameters to include the mean and standard deviation of the fragment lengths. If listing multiple files to be combined, the library type will need to be specified, as Salmon cannot determine it automatically (see the Salmon documentation for more information).
301 | 
302 | ```bash
303 | salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType ?? -a <sample>.Aligned.toTranscriptome.out.bam -o <sample>.salmon_quant  --fldMean ?? --fldSD ?? --gcBias --seqBias
304 | ```
305 | 
306 | ## Differential expression
307 | 
308 | ***All following code should be run in `R`.***
309 | 
310 | The differential expression analysis contains the following steps:
311 | 
312 | - Import count data
313 | - Import data to DEseq2
314 | - Differential gene expression
315 | - QC plots
316 | 
317 | Following these steps, [functional analysis](#functional-analysis) will be carried out to investigate differential expression of biological pathways. In this analysis, GC-normalised counts from Salmon will be input into DESeq2, which will run the standard DESeq2 normalisation. Optionally, normalisation can be carried out using cqn to correct for sample-specific biases (described at the end of this page). If cqn is the method of choice, Salmon should be run *without* the `--gcBias` flag.
318 | 
319 | To install the required packages:
320 | 
321 | ```R
322 | if (!requireNamespace("BiocManager", quietly = TRUE))
323 |   install.packages("BiocManager")
324 | 
325 | #BiocManager::install("cqn") #optional for cqn normalisation
326 | BiocManager::install("DESeq2")
327 | BiocManager::install("tximport")
328 | BiocManager::install("biomaRt")
329 | ```
330 | 
331 | > Import count data
332 | 
333 | The output from Salmon are TPM values (the 'abundance', transcripts per million) and estimated counts mapped to transcripts. The counts will be combined to gene-level estimates in R. The output files from salmon, `quant.sf` will be imported into R using `tximport` (described in detail [here](http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) by Love, Soneson & Robinson). This will require a list of sample IDs as well as a file containing transcript to gene ID mappings, in order to convert the transcriptome alignment to gene-level counts. 
334 | 
335 | 1) **Create a matrix containing the sample IDs**. The matrix should have at least three columns: the first with the sample IDs, the second with the path to the salmon `quant.sf` files, and the third with the group (e.g. treatment or sample). This can be generated in excel, for example, and saved as a tab-delimited txt file called `samples.txt`. 
336 | 
337 | ```R
338 | #Read in the files with the sample information
339 | samples = read.table('samples.txt')
340 | ```
341 | 
342 | 2) **Read in the transcript to gene ID file** provided in this repository (generated from gencode v36).
343 | 
344 | ```R
345 | #Read in the gene/transcript IDs 
346 | tx2gene = read.table('tx2gene_gencodev36-unique.txt', sep = '\t')
347 | ```
348 | 
349 | 3) **Read in the count data using `tximport`**. This will combine the transcript-level counts to gene-level. 
350 | 
351 | ```R
352 | library(tximport)
353 | 
354 | #Column 2 of samples, samples[,2], contains the paths to the quant.sf files
355 | counts.imported = tximport(files = as.character(samples[,2]), type = 'salmon', tx2gene = tx2gene)
356 | ```
357 | 
358 | To use cqn normalisation, see the optional description at the [end](#cqn_normalisation). Otherwise, the default DESeq2 normalisation will be used.
359 | 
360 | 
361 | > Import data to DEseq2 
362 | 
363 | An excellent tutorial on how DEseq2 works, including how different expression is calculated including dispersion estimates, is provided in [this](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/04_DGE_DESeq2_analysis.md) hbctraining lesson and in the [DEseq2 vignette](http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
364 | 
365 | The counts information will be input into DEseq2. A data-frame called `colData` should be generated. The rownames will be the unique sample IDs, while the columns should contain the conditions being tested for differential expression, in addition to any effects to be controlled for. In the example below, the column called `condition` contains the treatment, while the column `batch` contains the donor ID. Other covariants such as age could be added, for example. 
366 | 
367 | The design, as shown below, should read `~ batch + condition`, where batch is an effect to be controlled for and condition is the condition to be tested, such as treated vs untreated or disease vs healthy. `batch` and `condition` (or your own variables with your preferred names), should be columns in `colData`.
368 | 
369 | ```R
370 | #Import to DEseq2
371 | counts.DEseq = DESeqDataSetFromTximport(counts.imported, colData = colData, design = ~batch + condition)
372 | 
373 | dds <- DESeq(counts.DEseq)
374 | resultsNames(dds) #lists the coefficients
375 | 
376 | plotDispEsts(dds)
377 | 
378 | #Add the normalisation offset from cqn
379 | #normalizationFactors(dds) <- cqnNormFactors
380 | ```
381 | 
382 | > Differential gene expression
383 | 
384 | There are several models available to calculate differential gene expression. Here, the apeglm shrinkage method will be applied to shrink high log-fold changes with little statistical evidence and account for lowly expressed genes with significant deviation. This hbctraining [tutorial](https://hbctraining.github.io/DGE_workshop/lessons/05_DGE_DESeq2_analysis2.html) described the DEseq2 model fitting and hypothesis testing. 
385 | 
386 | ```R
387 | library(apeglm)
388 | 
389 | #List the names of the coefficients and choose your comparison
390 | resultsNames(dds)
391 | 
392 | #Substitute the '????' with a comparison, selected from the resultsNames(dds) shown above
393 | LFC <- lfcShrink(dds, coef = "????", type = "apeglm")
394 | ```
395 | 
396 | The contens of the `LFC` dataframe contain the log2 fold-change, as well as the p-value and adjusted p-value:
397 | 
398 |  <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/LFC.png" width="800">
399 | 
400 |  Following quality control analysis, we will [explore the data](#data-exploration) to check the numbers of differentially expressed genes (DEGs), the top DGEs and pathways of differential expression.
401 | 
402 | > QC plots
403 | 
404 | Before moving on to functional analysis, such as gene set enrichment analysis, quality control should be carried out on the differential expression analyses. The types of plots which will be generated below are:
405 | 
406 | - Principal component analysis - sample clustering
407 | - Biological replicate correlation
408 | - MD plot
409 | - p-value distribution
410 | - Volcano plot
411 | 
412 | > Principal component analysis
413 | 
414 | A common component of analysing RNA-seq data is to carry out QC by testing if expected samples cluster together. One popular tool is principal component analysis (PCA) (the following steps are adapted from a [hbctraining tutorial on clustering](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/03_DGE_QC_analysis.md)). Useful resources include this [blog post](https://builtin.com/data-science/step-step-explanation-principal-component-analysis) by Zakaria Jaadi and a [video](https://www.youtube.com/watch?v=_UVHneBUBW0) on PCA by StatQuest. 
415 | 
416 | **If you have few samples:**
417 | 
418 | ```R
419 | rld <- rlog(dds, blind = TRUE)
420 | rld_mat <- assay(rld)
421 | pca <- prcomp(t(rld_mat))
422 | ```
423 | 
424 | **If you have more samples** (e.g. >20), the vst transformation will be faster:
425 | 
426 | ```R
427 | vst.r <- vst(dds,blind = TRUE)
428 | vst_mat <- assay(vst)
429 | pca <- prcomp(t(vst_mat))
430 | ```
431 | 
432 | The results can be plotted using ggplot2. Several examples are provided below:
433 | 
434 | ```R
435 | library(ggplot2)
436 | 
437 | z = plotPCA(vst.r, "condition")
438 | nudge <- position_nudge(y = 2,x=6)
439 | z + geom_text(aes(label = name), position = nudge) +theme 
440 | 
441 | #plotPCA from DEseq2 plots uses the top 500 genes:
442 | data = plotPCA(rld, intgroup = c("condition", "batch"), returnData = TRUE)
443 | p <- ggplot(data, aes(x = PC1, y = PC2, color = condition ))
444 | p <- p + geom_point() + theme 
445 | print(p)
446 | 
447 | #Alternatively, PCA can be carried out using all genes:
448 | df_out <- as.data.frame(pca$x)
449 | df_out$group <- samples[,3]
450 | 
451 | #Include the next two lines to add the PC % to the axis labels
452 | #percentage <- round(pca$sdev / sum(pca$sdev) * 100, 2)
453 | #percentage <- paste( colnames(df_out), paste0(" (", as.character(percentage), "%", ")"), sep="") 
454 | 
455 | p <- ggplot(df_out, aes(x = PC1, y = PC2, color = group))
456 | p <- p + geom_point() + theme #+ xlab(percentage[1]) + ylab(percentage[2])
457 | 
458 | print(p)
459 | ```
460 | 
461 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/PCA-vsd-no-correction-no-label(small).png" width="500">
462 | 
463 | To generate the PCA plot with any batch effects removed:
464 | 
465 | ```R
466 | #Batch effect (donor) removed
467 | assay(vst.r) <- limma::removeBatchEffect(assay(vst.r), vst.r$batch)
468 | 
469 | z=plotPCA(vst.r, "condition")
470 | nudge <- position_nudge(y = 1,x=4)
471 | z + geom_text(size=2.5, aes(label = name), position = nudge) + theme 
472 | 
473 | #An example with no labels
474 | z=plotPCA(vst.r, "condition")
475 | nudge <- position_nudge(y = 1,x=4)
476 | z + geom_text(size=2.5, aes(label = NA), position = nudge) + theme 
477 | ```
478 | 
479 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/PCA-vsd-remove-DONOR-no-label(small).png" width="600">
480 | 
481 | > MA plot
482 | 
483 | An MA plot is a scatter plot of the log fold-change between two samples against the average gene expression (mean of normalised counts). An MA plot can be generated using the following command from DEseq2:
484 | 
485 | ```R
486 | #Add a title to reflect your comparison 
487 | plotMA(LFC, main = '???', cex = 0.5)
488 | ```
489 | 
490 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/MAplot.png" width="400">
491 | 
492 | > Distribution of p-values and FDRs
493 | 
494 | The distribution of p-values following a differential expression analysis can be an indication of whether there is an enrichment of differentially expressed genes and whether the statistical test is correct, i.e. has the correct assumptions. 
495 | 
496 | ```R
497 | #The distribution of p-values
498 | hist(LFC$pvalue, breaks = 50, col = 'grey', main = '???', xlab = 'p-value')
499 | 
500 | #The false-discovery rate distribution
501 | hist(LFC$padj, breaks = 50, col = 'grey', main = '???', xlab = 'Adjusted p-value')
502 | ```
503 | 
504 | The p-value distribution:
505 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/P-values-hist.png" width="800">
506 | 
507 | The false discovery rate (FDR) distribution:
508 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/FDR-hist.png" width="800">
509 | 
510 | > Volcano plots
511 | 
512 | A volcano plot is a scatterplot which plots the p-value of differential expression against the fold-change. The volcano plot can be designed to highlight datapoints of significant genes, with a p-value and fold-change cut off.
513 | 
514 | Volcano plots are generated as described by [Ignacio González](http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf)
515 | 
516 | ```R
517 | #Allow for more space around the borders of the plot
518 | par(mar = c(5, 4, 4, 4))
519 | 
520 | #Set your log-fold-change and p-value thresholds
521 | lfc = 2
522 | pval = 0.05
523 | 
524 | tab = data.frame(logFC = LFC$log2FoldChange, negLogPval = -log10(LFC$padj))#make a data frame with the log2 fold-changes and adjusted p-values
525 | 
526 | plot(tab, pch = 16, cex = 0.4, xlab = expression(log[2]~fold~change),
527 |      ylab = expression(-log[10]~pvalue), main = '???') #replace main = with your title
528 | 
529 | #Genes with a fold-change greater than 2 and p-value<0.05:
530 | signGenes = (abs(tab$logFC) > lfc & tab$negLogPval > -log10(pval))
531 | 
532 | #Colour these red
533 | points(tab[signGenes, ], pch = 16, cex = 0.5, col = "red")
534 | 
535 | #Show the cut-off lines
536 | abline(h = -log10(pval), col = "green3", lty = 2)
537 | abline(v = c(-lfc, lfc), col = "blue", lty = 2)
538 | 
539 | mtext(paste("FDR =", pval), side = 4, at = -log10(pval), cex = 0.6, line = 0.5, las = 1)
540 | mtext(c(paste("-", lfc, "fold"), paste("+", lfc, "fold")), side = 3, at = c(-lfc, lfc),
541 |       cex = 0.6, line = 0.5)
542 | ```
543 | 
544 | The resulting plot will look like this:
545 | 
546 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/volcano-plot-QL-0.05.png" width="600">
547 | 
548 | ## Data exploration
549 | 
550 | How many genes are differentially expressed? What are the top DEGs? How do I plot the expression for candidate genes?
551 | 
552 | 
553 | 1) **How many genes are differentially expressed?**
554 | 
555 | ```R
556 | #increased expression
557 | attach(as.data.frame(LFC))
558 | 
559 | #The total number of DEGs with an adjusted p-value<0.05
560 | summary(LFC, alpha=0.05)
561 | 
562 | #The total number of DEGs with an adjusted p-value<0.05 AND absolute fold-change > 2
563 | sum(!is.na(padj) & padj < 0.05 & abs(log2FoldChange) >2)
564 | 
565 | #Decreased expression:
566 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange <0) #any fold-change
567 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange <(-2)) #fold-change greater than 2
568 | 
569 | #Increased expression:
570 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange >0) #any fold-change
571 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange >2) #fold-change greater than 2
572 | ```
573 | 
574 | 2) **What are the top genes?**
575 | 
576 | ```R
577 | #At this stage it may be useful to create a copy of the results with the gene version removed from the gene name, to make it easier for you to search for the gene name etc. 
578 | #The rownames currently appear as 'ENSG00000175197.12, ENSG00000128272.15' etc.
579 | #To change them to 'ENSG00000175197, ENSG00000128272'
580 | LFC.gene = as.data.frame(LFC)
581 | 
582 | #Some gene names are repeated if they are in the PAR region of the Y chromosome. Since dataframes cannot have duplicate row names, we will leave these gene names as they are and rename the rest.
583 | whichgenes = which(!grepl('PAR', rownames(LFC.gene)))
584 | rownames(LFC.gene)[whichgenes] = unlist(lapply(strsplit(rownames(LFC.gene)[whichgenes], '\\.'), '[[',1))
585 |  
586 | #subset the significant genes
587 | LFC.sig = LFC.gene[padj < 0.05 & !is.na(padj),]#subset the significant genes
588 | 
589 | #We can add a column with the HGNC gene names
590 | library(biomaRt)
591 | ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
592 | 
593 | converted <- getBM(attributes=c('hgnc_symbol','ensembl_gene_id'), filters = 'ensembl_gene_id',
594 |                  values = rownames(LFC.sig), mart = ensembl)
595 | 
596 | #Add gene names to the LFC.sig data-frame
597 | LFC.sig$hgnc = converted[converted[,2] == rownames(LFC.sig),1]
598 | 
599 | #View the top 10 genes with the most significant (adjusted) p-values
600 | head(LFC.sig, n = 10)
601 | 
602 | #The largest fold-changes with a significant p-value
603 | LFC.sig[order(abs(LFC.sig$log2FoldChange), decreasing = TRUE),][1:10,] #add the [1:10,] to see the top 10 rows
604 | ```
605 | 
606 | 3) **Can I plot the expression for the top genes?**
607 | 
608 | ```R
609 | #Select your chosen gene 
610 | tmp = plotCounts(dds, gene = grep('ENSG00000000003', names(dds), value = TRUE), intgroup = "condition", pch = 18, main = '??? expression', returnData = TRUE)
611 | 
612 | theme <- theme(panel.background = element_blank(), panel.border = element_rect(fill = NA),
613 |              plot.title = element_text(hjust = 0.5))
614 | 
615 | p <- ggplot(tmp, aes(x = condition, y = count)) + geom_boxplot() + 
616 |   geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.6) + ggtitle('??? expression') + theme
617 | 
618 | print(p)
619 | ```
620 | 
621 | ## Functional analysis
622 | 
623 | Functional analysis can further investigate the differential expression of each gene. Pathway analysis is a popular approach with which to investigate the differential expression of pathways, including genes with similar biological functions. This can be achieved using gene set analysis (GSA). There are many flavours of GSA. They can be categorised as shown below by [Das et al. (2020)](https://www.mdpi.com/1099-4300/22/4/427).
624 | 
625 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/GSA-classification.png" width="600">
626 | 
627 | Here, we will use one gene annotation approach and one gene set enrichment analysis (GSEA) approach. 
628 | 
629 | ### GoSeq - gene annotation
630 | 
631 | [GoSeq](https://bioconductor.org/packages/release/bioc/html/goseq.html), developed by [Young et al. (2010)](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-2-r14), tests for the enrichment of Gene Ontology terms.
632 | 
633 | ```R
634 | BiocManager::install("goseq")
635 | library(goseq)
636 | 
637 | #Extract the differential expression data, with false discovery rate correction
638 | groups12.table <- as.data.frame(topTags(ql.groups12, n = Inf))
639 | 
640 | #Remove the version numbers from the ENSEMBL gene IDs
641 | rownames(groups12.table) <- unlist(lapply(strsplit(rownames(groups12.table), '\\.'), `[[`, 1))
642 | ```
643 | 
644 | The genes can be seperated into those which show significantly increased expression and those which show significantly decreased expression. Here, the FDR threshold is set to 0.05. A minimum fold-change can also be defined. 
645 | 
646 | ```R
647 | #Decreased expression
648 | ql53.DEGs.down <- groups12.table$FDR < 0.05 & groups12.table$logFC<0
649 | names(ql53.DEGs.down) <- rownames(groups12.table)
650 | pwf.dn <- nullp(ql53.DEGs.up, "hg19", "ensGene")
651 | go.results.dn <- goseq(pwf.dn, "hg19", "ensGene")
652 | 
653 | #Increased expression
654 | ql53.DEGs.up <- groups12.table$FDR < 0.05 & groups12.table$logFC>0
655 | names(ql53.DEGs.up) <- rownames(groups12.table)
656 | pwf.up <- nullp(ql53.DEGs.down, "hg19","ensGene")
657 | go.results.up <- goseq(pwf.up, "hg19","ensGene")
658 | ```
659 | 
660 | The `go.results.up` dataframe looks like this:
661 | 
662 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/Go-output.png" width="800">
663 | 
664 | Significant results can be saved...
665 | 
666 | ```R
667 | write.table(go.results.up[go.results.up$over_represented_pvalue<0.05,1:2], 'p53-GO-up0.05.txt', quote=FALSE, sep='\t', row.names=FALSE, col.names=FALSE)
668 | ```
669 | 
670 | ...and uploaded to the REVIGO tool which collapses and summarises redundant GO terms. Copy the contents of the `p53-GO-up0.05.txt` into the [REVIGO](http://revigo.irb.hr/) box:
671 | 
672 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/REVIGO-input.png" width="500">
673 | 
674 | After running, select the 'Scatterplot & Table' tab and scroll down to 'export results to text table (csv)':
675 | 
676 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/REVIGO2.png" width="800">
677 | 
678 | Save the downloaded file as `REVIGO-UP.csv`. The package `ggplot2` can be used here to visualise the log<10> p-values for GO term enrichment. To view the top 10 terms:
679 | 
680 | ```R
681 | #Read in the REVIGO output
682 | revigoUP = read.table('REVIGO-UP.csv', sep = ',', header = TRUE)
683 | 
684 | #Sort by p-value and extract the top 20
685 | revigoUP = revigoUP[order(revigoUP$log10.p.value),]
686 | revigoUP = head(revigoUP, n = 20)
687 | 
688 | #Convert the GO terms to factors for compatability with ggplot2
689 | revigoUP$description <- factor(revigoUP.108top$description, levels = revigoUP.108top$description)
690 | 
691 | #Plot the barplot
692 | p <- ggplot(data = revigoUP.108top, aes(x = log10.p.value, y = description, fill = description)) +
693 |   geom_bar(stat = "identity") 
694 | 
695 | p + scale_fill_manual(values = rep("steelblue2", dim(revigoUP.108top)[1])) + theme_minimal() + theme(legend.position = "none") + 
696 |         ylab('')
697 | ```
698 | 
699 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/REVIGO-UP.png" width="800">
700 | 
701 | ### Gene Set Enrichment Analysis
702 | 
703 | Gene set enrichment analysis (GSEA) will be used to test for the altered expression of pre-defined *set* of genes.  
704 | 
705 | ```R
706 | BiocManager::install("piano")
707 | library(piano)
708 | ```
709 | 
710 | ## Resources
711 | 
712 | Many resources were used in building this RNA-seq tutorial.
713 | 
714 | Highly recommended RNA-seq tutorial series:
715 | 
716 | - [Introduction to differential gene expression analysis](https://hbctraining.github.io/DGE_workshop/lessons/01_DGE_setup_and_overview.html#rna-seq-count-distribution)
717 | 
718 | Other RNA-seq tutorials:
719 | - [Statistical analysis of RNA-Seq data](http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf) by Ignacio Gonz´alez 
720 | - [Analysis of RNA-Seq data: gene-level exploratory analysis and differential expression](https://www.huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#gene-ontology-enrichment-analysis) by Bernd Klaus and Wolfgang Huber
721 | 
722 | 
723 | - <https://vallierlab.wixsite.com/pipelines/rna-seq>
724 | - [RNA-seq workflow: gene-level exploratory analysis and differential expression by Love et al. 2019](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#running-the-differential-expression-pipeline)
725 | - https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
726 | - The Encode pipeline for long-RNAs: https://www.encodeproject.org/data-standards/rna-seq/long-rnas/
727 | 
728 | Understanding normalisation:
729 | - https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
730 | 
731 | 
732 | 
733 | 
734 | **Preseq**: Estimates library complexity
735 | 
736 | **Picard RNAseqMetrics**: Number of reads that align to coding, intronic, UTR, intergenic, ribosomal regions, normalize gene coverage across a meta-gene body, identify 5’ or 3’ bias
737 | 
738 | **RSeQC**: Suite of tools to assess various post-alignment quality, Calculate distribution of Insert Size, Junction Annotation (% Known, % Novel read spanning splice junctions), BAM to BigWig (Visual Inspection with IGV)
739 | 
740 | ## cqn Normalisation 
741 | 
742 | The count data needs to be normalised for several confounding factors. The number of DNA reads (or fragments for paired end data) mapped to a gene is influeced by (1) its gc-content, (2) its length and (3) the total library size for the sample. There are multiple methods used for normalisation. Here, conditional quantile normalisation (cqn) is used as recommended by [Mandelboum et al. (2019)](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000481) to correct for sample-specific biases. Cqn is described by [Hansen et al. (2012)](https://academic.oup.com/biostatistics/article/13/2/204/1746212).
743 | 
744 | `cqn` requires an input of gene length, gc content and the estimated library size per sample (which it will estimate as the total sum of the counts if not provided by the user). For more guidance on how to normalise using `cqn` and import into `DESeq2`, the user is directed to [the cqn vignette](https://bioconductor.org/packages/release/bioc/vignettes/cqn/inst/doc/cqn.pdf) by Hansen & Wu and the [tximport vignette](http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) by Love, Soneson & Robinson.
745 | 
746 | ```R
747 | #Read in the gene lengths and gc-content data frame (provided in this repository)
748 | genes.length.gc = read.table('gencode-v36-gene-length-gc.txt', sep = '\t')
749 | ```
750 | 
751 | At this stage, technical replicates can be combined if they have not been already. This is typically achieved by summing the counts. 
752 | 
753 | To carry out the normalisation:
754 | 
755 | ```R
756 | library(cqn)
757 | #cqn normalisation
758 | counts = counts.imported$counts
759 | 
760 | #Exclude genes with no length information, for compatibility with cqn.
761 | counts = counts[-which(is.na(genes.length.gc[rownames(counts),]$length)),]
762 | 
763 | #Extract the lengths and GC contents for genes in the same order as the counts data-frame
764 | geneslengths = genes.length.gc[rownames(counts),]$length
765 | genesgc = genes.length.gc[rownames(counts),]$gc
766 | 
767 | #Run the cqn normalisation 
768 | cqn.results <- cqn(counts, genesgc, geneslengths, lengthMethod = c("smooth"))
769 | 
770 | #Extract the offset, which will be input directly into DEseq2 to normalise the counts. 
771 | cqnoffset <- cqn.results.DEseq$glm.offset
772 | cqnNormFactors <- exp(cqnoffset)
773 | 
774 | #The 'counts' object imported from tximport also contains data-frames for 'length' and 'abundance'.
775 | #These data-frames should also be subset to remove any genes excluded from the 'NA' length filter
776 | counts.imported$abundance = counts.imported$abundance[rownames(counts),]
777 | counts.imported$counts = counts.imported$counts[rownames(counts),]
778 | counts.imported$length = counts.imported$length[rownames(counts),]
779 | ```
780 | 
781 | The normalised gene expression values can be saved as a cqn output. These values will not be used for the downstream differential expression, rather they are useful for any visualisation purposes. Differential expression will be calculated within DEseq2 using a negative bionomial model, to which the cqn offset will be added. 
782 | 
783 | ```R
784 | #The normalised gene expression counts can be saved as:
785 | RPKM.cqn <- cqn.results$y + cqn.results$offset
786 | ``` 
787 | 
788 | > Biological replicate correlation
789 | 
790 | The correlation between the expression of genes in two biological replicates should ideally be very high. The normalised expression values, saved above as `RPKM.cqn` will be used. 
791 | 
792 | ```R
793 | #To test the correlation between the first two samples in columns 1 and 2
794 | plot(RPKM.cqn[,1], RPKM.cqn[,2], pch = 18, cex = 0.5, xlab = colnames(RPKM.cqn)[1], ylab = colnames(RPKM.cqn)[2])
795 | 
796 | #The Pearson correlation coefficient can be calculated as:
797 | cor(RPKM.cqn[,1], RPKM.cqn[,2])
798 | 
799 | #Add it to your plot, replacing x and y with the coordinates for your legend
800 | text(x, y, labels = paste0('r=', round(cor(RPKM.cqn[,1], RPKM.cqn[,2]),2)))
801 | 
802 | #To add a regression line
803 | abline(lm(RPKM.cqn[,1] ~ RPKM.cqn[,2]), col = 'red')
804 | ```
805 | 
806 | <img src="https://github.com/CebolaLab/RNA-seq/blob/master/Figures/biological-rep-correlation.png" width="400">
807 | 


--------------------------------------------------------------------------------