├── Figures ├── LFC.png ├── PCA.png ├── GO-UP.png ├── Rplot.png ├── FC-CPM.pdf ├── FC-CPM.png ├── FDR-hist.png ├── MAplot.png ├── REVIGO1.png ├── REVIGO2.png ├── GCbiasPlot.png ├── Go-output.png ├── REVIGO-UP.png ├── P-values-hist.png ├── REVIGO-input.png ├── fastp-summary.png ├── GSA-classification.png ├── multi-mapped-reads.png ├── multiqc-alignment.png ├── volcano-plot-QL-0.05.png ├── biological-rep-correlation.png ├── biological-replicates-exp-logcqn.png ├── PCA-vsd-no-correction-no-label(small).png └── PCA-vsd-remove-DONOR-no-label(small).png ├── QC-RNA-seq-LSECs.xlsx ├── star-indexes.sh └── README.md /Figures/LFC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/LFC.png -------------------------------------------------------------------------------- /Figures/PCA.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA.png -------------------------------------------------------------------------------- /Figures/GO-UP.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GO-UP.png -------------------------------------------------------------------------------- /Figures/Rplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/Rplot.png -------------------------------------------------------------------------------- /Figures/FC-CPM.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FC-CPM.pdf -------------------------------------------------------------------------------- /Figures/FC-CPM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FC-CPM.png -------------------------------------------------------------------------------- /Figures/FDR-hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/FDR-hist.png -------------------------------------------------------------------------------- /Figures/MAplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/MAplot.png -------------------------------------------------------------------------------- /Figures/REVIGO1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO1.png -------------------------------------------------------------------------------- /Figures/REVIGO2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO2.png -------------------------------------------------------------------------------- /Figures/GCbiasPlot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GCbiasPlot.png -------------------------------------------------------------------------------- /Figures/Go-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/Go-output.png -------------------------------------------------------------------------------- /Figures/REVIGO-UP.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO-UP.png -------------------------------------------------------------------------------- /QC-RNA-seq-LSECs.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/QC-RNA-seq-LSECs.xlsx -------------------------------------------------------------------------------- /Figures/P-values-hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/P-values-hist.png -------------------------------------------------------------------------------- /Figures/REVIGO-input.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/REVIGO-input.png -------------------------------------------------------------------------------- /Figures/fastp-summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/fastp-summary.png -------------------------------------------------------------------------------- /Figures/GSA-classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/GSA-classification.png -------------------------------------------------------------------------------- /Figures/multi-mapped-reads.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/multi-mapped-reads.png -------------------------------------------------------------------------------- /Figures/multiqc-alignment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/multiqc-alignment.png -------------------------------------------------------------------------------- /Figures/volcano-plot-QL-0.05.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/volcano-plot-QL-0.05.png -------------------------------------------------------------------------------- /Figures/biological-rep-correlation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/biological-rep-correlation.png -------------------------------------------------------------------------------- /Figures/biological-replicates-exp-logcqn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/biological-replicates-exp-logcqn.png -------------------------------------------------------------------------------- /Figures/PCA-vsd-no-correction-no-label(small).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA-vsd-no-correction-no-label(small).png -------------------------------------------------------------------------------- /Figures/PCA-vsd-remove-DONOR-no-label(small).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CebolaLab/RNA-seq/HEAD/Figures/PCA-vsd-remove-DONOR-no-label(small).png -------------------------------------------------------------------------------- /star-indexes.sh: -------------------------------------------------------------------------------- 1 | #PBS -l select=1:mem=60gb:ncpus=6 2 | #PBS -l walltime=05:00:00 3 | #PBS -N starIndex 4 | 5 | module load star 6 | DIR=/rds/general/user/hm1412/projects/cebolalab_liver_regulomes/ephemeral/reference-genomes/GRCh38 7 | 8 | STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $DIR --genomeFastaFiles $DIR/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna 9 | 10 | mv * $DIR 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RNA-seq: a step-by-step analysis pipeline. 2 | 3 | A step-by-step analysis pipeline for RNA-seq data from the [Cebola Lab](https://www.imperial.ac.uk/metabolism-digestion-reproduction/research/systems-medicine/genetics--genomics/regulatory-genomics-and-metabolic-disease/). 4 | 5 | Correspondence: hannah.maude12@imperial.ac.uk 6 | 7 | The resources and references used to build this tutorial are found at the bottom, in the [resources](#resources) section. 8 | 9 | ## Table of Contents 10 | 11 | *Run using command line tools (`bash`)*: 12 | - [Pre-alignment quality control (QC)](#pre-alignment-qc) 13 | - [Align to the reference human genome](#align-to-the-reference-genome) 14 | - [Post-alignment QC](#post-alignment-qc) 15 | - [Visualisation](#visualisation) 16 | - [Quantify transcripts](#quantification) 17 | - [Visualise tracks against the reference genome](#visualisation) 18 | 19 | *Run in `R`*: 20 | - [Differential gene expression (DGE) analysis](#differential-expression) 21 | 22 | **Quality metrics**: throughout this Github, ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) icons will show where QC measures are obtained. An excel spreadsheet can be downloaded from this Github which these metrics can be input into and saved. 23 | 24 | **Programs required:** it is recommended that the user has anaconda installed, through which all required programs can be installed. Assuming that anaconda is available, all the required programs can be installed using the following: 25 | 26 | ```bash 27 | #Install the required programs using anaconda 28 | conda create -N RNA-seq 29 | 30 | conda install -n RNA-seq -c bioconda fastqc 31 | conda install -n RNA-seq -c bioconda fastp 32 | conda install -n RNA-seq -c bioconda multiqc 33 | conda install -n RNA-seq -c bioconda star 34 | conda install -n RNA-seq -c bioconda samtools 35 | conda install -n RNA-seq -c bioconda deeptools 36 | conda install -n RNA-seq -c bioconda salmon 37 | 38 | #For differential expression using DESeq2 39 | conda create -N DEseq2 r-essentials r-base 40 | 41 | conda install -N DEseq2 -c bioconda bioconductor-deseq2 42 | conda install -N DEseq2 -c bioconda bioconductor-tximport 43 | conda install -N DEseq2 -c r r-ggplot2 44 | ``` 45 | 46 | ## Introduction 47 | 48 | This pipeline is compatabile with RNA-seq reads generated by Illumina. 49 | 50 | ## Pre-alignment QC 51 | 52 | > Generate QC report 53 | 54 | The raw sequence data should first be assessed for quality. FastQC reports can be generated for all samples to assess sequence quality, GC content, duplication rates, length distribution, K-mer content and adapter contamination. For paired-end reads, run fastqc on both files, with the results output to the current directory: 55 | 56 | ```bash 57 | fastqc _1.fastq.gz -d . -o . 58 | 59 | fastqc _2.fastq.gz -d . -o . 60 | ``` 61 | 62 | These fastQC reports can be combined into one summary report using [multiQC](https://multiqc.info/). 63 | 64 | To extract the total number of reads from the fastQC report, run the following code (replacing with your file name). 65 | 66 | ```bash 67 | totalreads=$(unzip -c _fastqc.zip _fastqc/fastqc_data.txt | grep 'Total Sequences' | cut -f 2) 68 | 69 | echo $totalreads 70 | #This number will be used again later so is saved as a variable 'totalreads' 71 | ``` 72 | 73 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** input the total number of reads into the QC spreadsheet. 74 | 75 | > Trimming 76 | 77 | Trimming is a useful step of pre-alignment QC, which removes low quality reads and contaminating adapter sequences (which occur when the length of DNA sequences is longer than the DNA insert). 78 | 79 | If there is evidence of adapter contamination shown in the fastQC report (see below), specific adapter sequences can be trimmed. Here, the program fastp is used to trim the data. For **paired-end** data: 80 | 81 | ```bash 82 | #Change the -l argument to change the minimum read length allowed. 83 | fastp -i _R1.fastq.gz -I _R2.fastq.gz -o _R1.trimmed.fastq.gz -O _R2.trimmed.fastq.gz --detect_adapter_for_pe -l 25 -j .fastp.json -h .fastp.html 84 | ``` 85 | 86 | For **single-end** reads: (note the adapter detection is not always as effective for single-end reads, so it is advisable to provide the adapter sequence, here the 'Illumina TruSeq Adapter Read 1'): 87 | 88 | ```bash 89 | fastp -i .fastq.gz -o -trimmed.fastq.gz -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -l 25 -j .fastp.json -h .fastp.html 90 | ``` 91 | 92 | A html report is generated, including the following information: 93 | 94 | 95 | 96 | Here, fastQC should be repeated to generated reports for the trimmed data and a second multiqc report generated: 97 | 98 | ```bash 99 | fastqc _R1.trimmed.fastq.gz -d . -o . 100 | 101 | fastqc _R2.trimmed.fastq.gz -d . -o . 102 | ``` 103 | 104 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the number of trimmed reads can be filled in using the fastp report, or by extracting the number of reads from the trimmed fastQC files, as above, and used to fill in the QC spreadsheet. 105 | 106 | ## Align to the reference genome 107 | 108 | The raw RNA-seq data in `fastq` format will be aligned to the reference genome, along with a reference transcriptome, to output two alignment files: the genome alignment and the transcriptome alignemnt. 109 | 110 | The DNA reads are aligned using the splice-aware aligner, STAR. Here, [STAR](https://github.com/alexdobin/STAR) is used. The manual is available [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). The reference genome used is the GRCh38 'no-alt' assembly from ncbi, recommended by [Heng Li](http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use). The genome can be downloaded using `wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz`. This version of the recent GRCh38 reference genome excludes alternative contigs which may cause fragments to map in multiple locations. The downloaded genome should be indexed with STAR. 111 | 112 | > Index the reference genome 113 | 114 | Set --sjdbOverhang to your maximum read length -1. The indexing also requires a file containing gene annotation, which comes in a `gtf` format. For example, ENCODE provides a gtf file with GRCh38 annotations, containing gencode gene coordinates, along with UCSC tRNAs and a PhiX spike-in. Here, we use `gencode.v36.annotation.gtf` as the most recent gene annotation file. The user should aim to use the most up-to-date reference files, while ensuring that the format is the same as the reference genome. For example, UCSC uses the 'chr1, chr2, chr3' naming convention, while ENSEMBL uses '1, 2, 3' etc. The files suggested here are compatible. 115 | 116 | ```bash 117 | GENOMEDIR=/path/to/indexed/genome 118 | 119 | STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $GENOMEDIR --genomeFastaFiles $GENOMEDIR/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --sjdbGTFfile gencode.v36.annotation.gtf --sjdbOverhang readlength -1 120 | ``` 121 | 122 | > Carry out the alignment 123 | 124 | STAR can then be run to align the `fastq` raw data to the genome. If the fastq files are in the compressed `.gz` format, the `--readFilesCommand zcat` argument is added. The output file should be unsorted, as required for the downstream quantification step using Salmon. The following options are shown according to the ENCODE recommendations. 125 | 126 | For **paired-end** data: 127 | 128 | ```bash 129 | STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn _R1.trimmed.fastq.gz _R2.trimmed.fastq.gz 130 | --outFileNamePrefix --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 131 | --alignSJoverhangMin 8 --outFilterMultimapNmax 20 132 | --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 133 | --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 134 | --alignIntronMax 1000000 --alignMatesGapMax 1000000 135 | --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD 136 | ``` 137 | 138 | For **single-end** data: 139 | 140 | ```bash 141 | STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn -trimmed.fastq.gz 142 | --outFileNamePrefix --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 143 | --alignSJoverhangMin 8 --outFilterMultimapNmax 20 144 | --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 145 | --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 146 | --alignIntronMax 1000000 --alignMatesGapMax 1000000 147 | --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD 148 | ``` 149 | 150 | *Hint: all the above code should be on one line!* 151 | 152 | For compatibility with the STAR quantification, the `--quantMode TranscriptomeSAM` option will result in the output of two alignment files, one to the reference genome (`Aligned.*.sam/bam`) and one to the transcriptome (`Aligned.toTranscriptome.out.bam`). 153 | 154 | > Merge files [optional] 155 | 156 | At this stage, if samples have been sequenced across multiple lanes, the sample files can be combined using `samtools merge`. Various QC tools can be used to assess reproducibility and assess lane effects, such as `deeptools plotCorrelation`. The `salmon` quantification does not require files to be merged, since multiple `bam` files can be listed in the command. However, to visualise the RNA-seq data from the combined technical replicates, `bam` files can be merged at this stage. For example, if your sample was split across lanes 1, 2 and 3 (`L001`, `L002`, `L003`): 157 | 158 | ```bash 159 | samtools merge -merged.bam _L001.bam _L002.bam _L003.bam 160 | ``` 161 | 162 | ## Post-alignment QC 163 | 164 | The STAR alignment will have output several files with the following file names: 165 | 166 | - Aligned.out.bam 167 | - Aligned.toTranscriptome.out.bam 168 | - Log.final.out 169 | - Log.out 170 | - Log.progress.out 171 | - SJ.out.tab 172 | 173 | Two files will be used in downstream analysis, the `Aligned.out.bam` for generating genome browser bigWig tracks and the `Aligned.toTranscriptome.out.bam` for quantification and differential gene expression analysis. First, the `Aligned.out.bam` will be assessed for quality and processed to generate bigWig tracks. 174 | 175 | > Generate QC reports using [qualimap](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html) and samtools 176 | 177 | Qualimap will be run on the `Aligned.out.bam` file (or `.merged.bam` if you have merged data). 178 | 179 | Qualimap will provide several measures of quality, including how many reads have aligned to exons vs non-coding intergenic regions. To do this, qualimap requires a transcript file which contains the information containing the locations of coding regions. The transcript annotation file, `gencode.v36.annotation.gtf` can be downloaded from [gencode](https://www.gencodegenes.org/human/). (*Note: the most recent annotation file should be used.*) 180 | 181 | ```bash 182 | #Sort the output bam file. The suffix of the .bam input file may be .gzAligned.out.bam, or -merged.bam. Edit this code to include the appropriate file name. 183 | samtools sort .bam > -sorted.bam 184 | samtools index -sorted.bam 185 | 186 | samtools flagstat -sorted.bam > -sorted.flagstat 187 | 188 | #Run qualimap to generate QC reports 189 | qualimap bamqc -bam -sorted.bam -gff gencode.v36.annotation.gtf -outdir -bamqc-qualimap-report --java-mem-size=16G 190 | 191 | qualimap rnaseq -bam -sorted.bam -gtf gencode.v36.annotation.gtf -outdir -rnaseq-qualimap-report --java-mem-size=16G 192 | ``` 193 | 194 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the percentage of reads aligned to exons can be extracted as follows: 195 | 196 | ```bash 197 | cat -qualimap-rnaseq/rnaseq_qc_results.txt | grep exonic | cut -d '(' -f 2 | cut -d ')' -f1 198 | ``` 199 | 200 | ![#f03c15](https://via.placeholder.com/15/f03c15/000000?text=+) **QC value:** the number of aligned reads and aligned reads which were properly paired can be extracted as follows: 201 | 202 | ```bash 203 | #The total number of reads mapped 204 | cat .flagstat | grep mapped | head -n1 | cut -d ' ' -f1 205 | 206 | #The total number of properly paired reads 207 | cat ../bam_files/061818_con.flagstat | grep 'properly paired' | head -n1 | cut -d ' ' -f1 208 | ``` 209 | 210 | > A combined qualimap report 211 | 212 | Qualimap `multi-bamqc` can then run QC on combined samples and replicates. This includes principal component analysis (PCA) to confirm whether technical and/or biological replicates cluster together. A text file (`samples.txt`) should be created with three columns, the first with the sample ID, the second with the full path to the bamqc results and the third with the group names. 213 | 214 | *Note, some versions of qualimap require the raw_data_qualimapReport directory to be renamed to raw_data.* 215 | 216 | ```bash 217 | qualimap multi-bamqc sample.txt 218 | ``` 219 | 220 | The QC reports can be combined using [multiqc](https://multiqc.info/); an excellent tool for combining QC reports of multiple samples into one. Example outputs of qualimap/multiqc include the alignment positions 221 | 222 | 223 | 224 | #### Remove duplicates? 225 | 226 | It is generally recommended to *not* remove duplicates when working with RNA-seq data, unless using UMIs (unique molecular identifiers) [(Klepikova et al. 2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/). This is because there are likely to be DNA molecules which are natural duplicates of each other, for example originating from genes with a shared sequence in a common domain. Typically, removing duplicates does more harm than good. It is more or less impossible to remove duplicates from single-end data and research has also suggested it may cause false negatives when applied to paired end data. See more in [this useful blog post](https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/). Generally, duplicates are not a problem so long as the *library complexity is high*. 227 | 228 | ## Visualisation 229 | 230 | > Compute GC bias 231 | 232 | GC-bias describes the bias in sequencing depth depending on the GC-content of the DNA sequence. Bias in DNA fragments, due to the GC-content and start-and-end sequences, may be increased due to preferential PCR amplification [(Benjamini and Speed, 2012)](https://academic.oup.com/nar/article/40/10/e72/2411059). A high rate of PCR duplications, for example when library complexity is low, may cause a significant GC-bias due to the preferential amplification of specific DNA fragments. This can significantly impact transcript abundance estimates. Bias in RNA-seq is explained in a handy [blog](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) and [video](https://youtu.be/9xskajkNJwg) by Mike Love. 233 | 234 | It is **crucial** to correct GC-bias when comparing groups of samples which may have variable GC content dependence, for example when samples were processed in different libraries. `Salmon`, used later to generate read counts for quantification, has its own in-built method to correct for GC-bias. 235 | 236 | When generating `bedGraph` or `BigWig` files for visualisation, the user may opt to correct GC-bias so that coverage is corrected and appears more uniform. The `deeptools` suite includes tools to calculate GC bias and correct for it. 237 | 238 | The reference genome file should be converted to `.2bit` format using [`faToTwoBit`](http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit). 239 | The effective genome size can be calculated using `faCount` available [here](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/). 240 | Set the `-l` argument to your fragment length. 241 | 242 | The input `bam` file requires an index, which can be generated using `samtools index`. 243 | 244 | ```bash 245 | deeptools computeGCBias -b -sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit -l 100 --GCbiasFrequenciesFile .freq.txt --biasPlot .biasPlot.pdf 246 | ``` 247 | 248 | The bias plot format can be changed to png, eps, plotly or svg. If there is significant evidence of a GC bias, this can be corrected using `correctGCbias`. An example of GC bias can be seen in the plot outout from `computeGCBias` below: 249 | 250 | 251 | 252 | Correct the GC-bias using `correctGCBias`. This tool effectively removes reads from regions with greater-than-expected coverage (GC-rich regions) and adds reads from regions with less-than-expected coverage (AT-rich regions). The methods are described by [Benjamini and Speed [2012]](https://academic.oup.com/nar/article/40/10/e72/2411059). The following code can be used: 253 | 254 | ```bash 255 | correctGCBias -b -sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit --GCbiasFrequenciesFile .freq.txt -o .gc_corrected.bam [options] 256 | ``` 257 | 258 | **NOTE:** When calculating the GC-bias for ChIP-seq, ATAC-seq, DNase-seq (and CUT&Tag/CUT&Run) it is recommended to filter out problematic regions. These include those with low mappability and high numbers of repeats. The compiled list of [ENCODE blacklist regions](https://www.nature.com/articles/s41598-019-45839-z) should be excluded. However, the ENCODE blacklist regions have little overlap with coding regions and this step is not necessary for RNA-seq data [(Amemiya et al, 2019)](https://www.nature.com/articles/s41598-019-45839-z). 259 | 260 | > Generate bigwig files 261 | 262 | The `bam` file aligned to the *genome* should be converted to a `bigWig` format, which can be uploaded to genome browsers and viewed as a track. First, the bam file aligned to the reference genome may be assessed and corrected for GC bias, to acheive a more even coverage. 263 | 264 | The gene counts are here normalised to TPM values during conversion. 265 | 266 | ```bash 267 | bamCoverage -b .gc_corrected.bam -o .bw --normalizeUsing BPM --samFlagExclude 512 268 | ``` 269 | 270 | There are multiple methods available for normalisation. Recent analysis by [Abrams et al. (2019)](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3247-x#Sec2) advocated TPM as the most effective method. 271 | 272 | > Check correlation of technical and biological replicates 273 | 274 | The correlation between `bam` files of biological and/or technical replicates can be calculated as a QC step to ensure that the expected replicates positively correlate. Deeptools [multiBamSummary](https://deeptools.readthedocs.io/en/develop/content/tools/multiBamSummary.html) and [plotCorrelation](https://deeptools.readthedocs.io/en/develop/content/tools/plotCorrelation.html) are useful tools for further investigation. 275 | 276 | ## Quantification 277 | 278 | The `bam` file previously aligned to the *transcriptome* by STAR will next be input into [Salmon](https://combine-lab.github.io/salmon/) in alignment-mode, in order to generate a matrix of gene counts. The Salmon documentation is available [here](https://salmon.readthedocs.io/en/latest/). 279 | 280 | > Generate transcriptome 281 | 282 | Salmon requires a transcriptome to be generated from the genome `fasta` and annotation `gtf` files used earlier with STAR. This can be generated using `gffread` (source package avaiable for download [here](http://ccb.jhu.edu/software/stringtie/gff.shtml)). 283 | 284 | ```bash 285 | gffread -w GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa -g GCA_000001405.15_GRCh38_no_alt_analysis_set.fna gencode.v36.annotation.gtf 286 | ``` 287 | 288 | > Run Salmon 289 | 290 | Salmon is here used with the variational Bayesian expectation minimisation (VSEM) algorithm for quantification. Quanitifcation is described in the 2020 paper by [Deschamps-Francoeur et al.](https://www.sciencedirect.com/science/article/pii/S2001037020303032), which describes the handling of multi-mapped reads in RNA-seq data. Duplicated sequences such as pseudogenes can cause reads to align to multiple positions in the genome. Where transcripts have exons which are similar to other genomic sequences, the VSEM approach attributes reads to the most likely transcript. Technical replicates can also be combined by providing the Salmon `-a` argument with a list of bam files, with the file names separated by a space (this may not work on all queue systems. A common error is `segmentation fault (core dump)`). Here, Salmon is run ***without*** any normalisation, on each technical replicate; samples are combined and normalised in the next steps. 291 | 292 | For **paired-end** data: 293 | 294 | ```bash 295 | salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType A -a .Aligned.toTranscriptome.out.bam -o .salmon_quant --gcBias --seqBias 296 | ``` 297 | 298 | For **single-end** data: 299 | 300 | If using single end data, add the `--fldMean` and `--fldSD` parameters to include the mean and standard deviation of the fragment lengths. If listing multiple files to be combined, the library type will need to be specified, as Salmon cannot determine it automatically (see the Salmon documentation for more information). 301 | 302 | ```bash 303 | salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType ?? -a .Aligned.toTranscriptome.out.bam -o .salmon_quant --fldMean ?? --fldSD ?? --gcBias --seqBias 304 | ``` 305 | 306 | ## Differential expression 307 | 308 | ***All following code should be run in `R`.*** 309 | 310 | The differential expression analysis contains the following steps: 311 | 312 | - Import count data 313 | - Import data to DEseq2 314 | - Differential gene expression 315 | - QC plots 316 | 317 | Following these steps, [functional analysis](#functional-analysis) will be carried out to investigate differential expression of biological pathways. In this analysis, GC-normalised counts from Salmon will be input into DESeq2, which will run the standard DESeq2 normalisation. Optionally, normalisation can be carried out using cqn to correct for sample-specific biases (described at the end of this page). If cqn is the method of choice, Salmon should be run *without* the `--gcBias` flag. 318 | 319 | To install the required packages: 320 | 321 | ```R 322 | if (!requireNamespace("BiocManager", quietly = TRUE)) 323 | install.packages("BiocManager") 324 | 325 | #BiocManager::install("cqn") #optional for cqn normalisation 326 | BiocManager::install("DESeq2") 327 | BiocManager::install("tximport") 328 | BiocManager::install("biomaRt") 329 | ``` 330 | 331 | > Import count data 332 | 333 | The output from Salmon are TPM values (the 'abundance', transcripts per million) and estimated counts mapped to transcripts. The counts will be combined to gene-level estimates in R. The output files from salmon, `quant.sf` will be imported into R using `tximport` (described in detail [here](http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) by Love, Soneson & Robinson). This will require a list of sample IDs as well as a file containing transcript to gene ID mappings, in order to convert the transcriptome alignment to gene-level counts. 334 | 335 | 1) **Create a matrix containing the sample IDs**. The matrix should have at least three columns: the first with the sample IDs, the second with the path to the salmon `quant.sf` files, and the third with the group (e.g. treatment or sample). This can be generated in excel, for example, and saved as a tab-delimited txt file called `samples.txt`. 336 | 337 | ```R 338 | #Read in the files with the sample information 339 | samples = read.table('samples.txt') 340 | ``` 341 | 342 | 2) **Read in the transcript to gene ID file** provided in this repository (generated from gencode v36). 343 | 344 | ```R 345 | #Read in the gene/transcript IDs 346 | tx2gene = read.table('tx2gene_gencodev36-unique.txt', sep = '\t') 347 | ``` 348 | 349 | 3) **Read in the count data using `tximport`**. This will combine the transcript-level counts to gene-level. 350 | 351 | ```R 352 | library(tximport) 353 | 354 | #Column 2 of samples, samples[,2], contains the paths to the quant.sf files 355 | counts.imported = tximport(files = as.character(samples[,2]), type = 'salmon', tx2gene = tx2gene) 356 | ``` 357 | 358 | To use cqn normalisation, see the optional description at the [end](#cqn_normalisation). Otherwise, the default DESeq2 normalisation will be used. 359 | 360 | 361 | > Import data to DEseq2 362 | 363 | An excellent tutorial on how DEseq2 works, including how different expression is calculated including dispersion estimates, is provided in [this](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/04_DGE_DESeq2_analysis.md) hbctraining lesson and in the [DEseq2 vignette](http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). 364 | 365 | The counts information will be input into DEseq2. A data-frame called `colData` should be generated. The rownames will be the unique sample IDs, while the columns should contain the conditions being tested for differential expression, in addition to any effects to be controlled for. In the example below, the column called `condition` contains the treatment, while the column `batch` contains the donor ID. Other covariants such as age could be added, for example. 366 | 367 | The design, as shown below, should read `~ batch + condition`, where batch is an effect to be controlled for and condition is the condition to be tested, such as treated vs untreated or disease vs healthy. `batch` and `condition` (or your own variables with your preferred names), should be columns in `colData`. 368 | 369 | ```R 370 | #Import to DEseq2 371 | counts.DEseq = DESeqDataSetFromTximport(counts.imported, colData = colData, design = ~batch + condition) 372 | 373 | dds <- DESeq(counts.DEseq) 374 | resultsNames(dds) #lists the coefficients 375 | 376 | plotDispEsts(dds) 377 | 378 | #Add the normalisation offset from cqn 379 | #normalizationFactors(dds) <- cqnNormFactors 380 | ``` 381 | 382 | > Differential gene expression 383 | 384 | There are several models available to calculate differential gene expression. Here, the apeglm shrinkage method will be applied to shrink high log-fold changes with little statistical evidence and account for lowly expressed genes with significant deviation. This hbctraining [tutorial](https://hbctraining.github.io/DGE_workshop/lessons/05_DGE_DESeq2_analysis2.html) described the DEseq2 model fitting and hypothesis testing. 385 | 386 | ```R 387 | library(apeglm) 388 | 389 | #List the names of the coefficients and choose your comparison 390 | resultsNames(dds) 391 | 392 | #Substitute the '????' with a comparison, selected from the resultsNames(dds) shown above 393 | LFC <- lfcShrink(dds, coef = "????", type = "apeglm") 394 | ``` 395 | 396 | The contens of the `LFC` dataframe contain the log2 fold-change, as well as the p-value and adjusted p-value: 397 | 398 | 399 | 400 | Following quality control analysis, we will [explore the data](#data-exploration) to check the numbers of differentially expressed genes (DEGs), the top DGEs and pathways of differential expression. 401 | 402 | > QC plots 403 | 404 | Before moving on to functional analysis, such as gene set enrichment analysis, quality control should be carried out on the differential expression analyses. The types of plots which will be generated below are: 405 | 406 | - Principal component analysis - sample clustering 407 | - Biological replicate correlation 408 | - MD plot 409 | - p-value distribution 410 | - Volcano plot 411 | 412 | > Principal component analysis 413 | 414 | A common component of analysing RNA-seq data is to carry out QC by testing if expected samples cluster together. One popular tool is principal component analysis (PCA) (the following steps are adapted from a [hbctraining tutorial on clustering](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/lessons/03_DGE_QC_analysis.md)). Useful resources include this [blog post](https://builtin.com/data-science/step-step-explanation-principal-component-analysis) by Zakaria Jaadi and a [video](https://www.youtube.com/watch?v=_UVHneBUBW0) on PCA by StatQuest. 415 | 416 | **If you have few samples:** 417 | 418 | ```R 419 | rld <- rlog(dds, blind = TRUE) 420 | rld_mat <- assay(rld) 421 | pca <- prcomp(t(rld_mat)) 422 | ``` 423 | 424 | **If you have more samples** (e.g. >20), the vst transformation will be faster: 425 | 426 | ```R 427 | vst.r <- vst(dds,blind = TRUE) 428 | vst_mat <- assay(vst) 429 | pca <- prcomp(t(vst_mat)) 430 | ``` 431 | 432 | The results can be plotted using ggplot2. Several examples are provided below: 433 | 434 | ```R 435 | library(ggplot2) 436 | 437 | z = plotPCA(vst.r, "condition") 438 | nudge <- position_nudge(y = 2,x=6) 439 | z + geom_text(aes(label = name), position = nudge) +theme 440 | 441 | #plotPCA from DEseq2 plots uses the top 500 genes: 442 | data = plotPCA(rld, intgroup = c("condition", "batch"), returnData = TRUE) 443 | p <- ggplot(data, aes(x = PC1, y = PC2, color = condition )) 444 | p <- p + geom_point() + theme 445 | print(p) 446 | 447 | #Alternatively, PCA can be carried out using all genes: 448 | df_out <- as.data.frame(pca$x) 449 | df_out$group <- samples[,3] 450 | 451 | #Include the next two lines to add the PC % to the axis labels 452 | #percentage <- round(pca$sdev / sum(pca$sdev) * 100, 2) 453 | #percentage <- paste( colnames(df_out), paste0(" (", as.character(percentage), "%", ")"), sep="") 454 | 455 | p <- ggplot(df_out, aes(x = PC1, y = PC2, color = group)) 456 | p <- p + geom_point() + theme #+ xlab(percentage[1]) + ylab(percentage[2]) 457 | 458 | print(p) 459 | ``` 460 | 461 | 462 | 463 | To generate the PCA plot with any batch effects removed: 464 | 465 | ```R 466 | #Batch effect (donor) removed 467 | assay(vst.r) <- limma::removeBatchEffect(assay(vst.r), vst.r$batch) 468 | 469 | z=plotPCA(vst.r, "condition") 470 | nudge <- position_nudge(y = 1,x=4) 471 | z + geom_text(size=2.5, aes(label = name), position = nudge) + theme 472 | 473 | #An example with no labels 474 | z=plotPCA(vst.r, "condition") 475 | nudge <- position_nudge(y = 1,x=4) 476 | z + geom_text(size=2.5, aes(label = NA), position = nudge) + theme 477 | ``` 478 | 479 | 480 | 481 | > MA plot 482 | 483 | An MA plot is a scatter plot of the log fold-change between two samples against the average gene expression (mean of normalised counts). An MA plot can be generated using the following command from DEseq2: 484 | 485 | ```R 486 | #Add a title to reflect your comparison 487 | plotMA(LFC, main = '???', cex = 0.5) 488 | ``` 489 | 490 | 491 | 492 | > Distribution of p-values and FDRs 493 | 494 | The distribution of p-values following a differential expression analysis can be an indication of whether there is an enrichment of differentially expressed genes and whether the statistical test is correct, i.e. has the correct assumptions. 495 | 496 | ```R 497 | #The distribution of p-values 498 | hist(LFC$pvalue, breaks = 50, col = 'grey', main = '???', xlab = 'p-value') 499 | 500 | #The false-discovery rate distribution 501 | hist(LFC$padj, breaks = 50, col = 'grey', main = '???', xlab = 'Adjusted p-value') 502 | ``` 503 | 504 | The p-value distribution: 505 | 506 | 507 | The false discovery rate (FDR) distribution: 508 | 509 | 510 | > Volcano plots 511 | 512 | A volcano plot is a scatterplot which plots the p-value of differential expression against the fold-change. The volcano plot can be designed to highlight datapoints of significant genes, with a p-value and fold-change cut off. 513 | 514 | Volcano plots are generated as described by [Ignacio González](http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf) 515 | 516 | ```R 517 | #Allow for more space around the borders of the plot 518 | par(mar = c(5, 4, 4, 4)) 519 | 520 | #Set your log-fold-change and p-value thresholds 521 | lfc = 2 522 | pval = 0.05 523 | 524 | tab = data.frame(logFC = LFC$log2FoldChange, negLogPval = -log10(LFC$padj))#make a data frame with the log2 fold-changes and adjusted p-values 525 | 526 | plot(tab, pch = 16, cex = 0.4, xlab = expression(log[2]~fold~change), 527 | ylab = expression(-log[10]~pvalue), main = '???') #replace main = with your title 528 | 529 | #Genes with a fold-change greater than 2 and p-value<0.05: 530 | signGenes = (abs(tab$logFC) > lfc & tab$negLogPval > -log10(pval)) 531 | 532 | #Colour these red 533 | points(tab[signGenes, ], pch = 16, cex = 0.5, col = "red") 534 | 535 | #Show the cut-off lines 536 | abline(h = -log10(pval), col = "green3", lty = 2) 537 | abline(v = c(-lfc, lfc), col = "blue", lty = 2) 538 | 539 | mtext(paste("FDR =", pval), side = 4, at = -log10(pval), cex = 0.6, line = 0.5, las = 1) 540 | mtext(c(paste("-", lfc, "fold"), paste("+", lfc, "fold")), side = 3, at = c(-lfc, lfc), 541 | cex = 0.6, line = 0.5) 542 | ``` 543 | 544 | The resulting plot will look like this: 545 | 546 | 547 | 548 | ## Data exploration 549 | 550 | How many genes are differentially expressed? What are the top DEGs? How do I plot the expression for candidate genes? 551 | 552 | 553 | 1) **How many genes are differentially expressed?** 554 | 555 | ```R 556 | #increased expression 557 | attach(as.data.frame(LFC)) 558 | 559 | #The total number of DEGs with an adjusted p-value<0.05 560 | summary(LFC, alpha=0.05) 561 | 562 | #The total number of DEGs with an adjusted p-value<0.05 AND absolute fold-change > 2 563 | sum(!is.na(padj) & padj < 0.05 & abs(log2FoldChange) >2) 564 | 565 | #Decreased expression: 566 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange <0) #any fold-change 567 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange <(-2)) #fold-change greater than 2 568 | 569 | #Increased expression: 570 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange >0) #any fold-change 571 | sum(!is.na(padj) & padj < 0.05 & log2FoldChange >2) #fold-change greater than 2 572 | ``` 573 | 574 | 2) **What are the top genes?** 575 | 576 | ```R 577 | #At this stage it may be useful to create a copy of the results with the gene version removed from the gene name, to make it easier for you to search for the gene name etc. 578 | #The rownames currently appear as 'ENSG00000175197.12, ENSG00000128272.15' etc. 579 | #To change them to 'ENSG00000175197, ENSG00000128272' 580 | LFC.gene = as.data.frame(LFC) 581 | 582 | #Some gene names are repeated if they are in the PAR region of the Y chromosome. Since dataframes cannot have duplicate row names, we will leave these gene names as they are and rename the rest. 583 | whichgenes = which(!grepl('PAR', rownames(LFC.gene))) 584 | rownames(LFC.gene)[whichgenes] = unlist(lapply(strsplit(rownames(LFC.gene)[whichgenes], '\\.'), '[[',1)) 585 | 586 | #subset the significant genes 587 | LFC.sig = LFC.gene[padj < 0.05 & !is.na(padj),]#subset the significant genes 588 | 589 | #We can add a column with the HGNC gene names 590 | library(biomaRt) 591 | ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") 592 | 593 | converted <- getBM(attributes=c('hgnc_symbol','ensembl_gene_id'), filters = 'ensembl_gene_id', 594 | values = rownames(LFC.sig), mart = ensembl) 595 | 596 | #Add gene names to the LFC.sig data-frame 597 | LFC.sig$hgnc = converted[converted[,2] == rownames(LFC.sig),1] 598 | 599 | #View the top 10 genes with the most significant (adjusted) p-values 600 | head(LFC.sig, n = 10) 601 | 602 | #The largest fold-changes with a significant p-value 603 | LFC.sig[order(abs(LFC.sig$log2FoldChange), decreasing = TRUE),][1:10,] #add the [1:10,] to see the top 10 rows 604 | ``` 605 | 606 | 3) **Can I plot the expression for the top genes?** 607 | 608 | ```R 609 | #Select your chosen gene 610 | tmp = plotCounts(dds, gene = grep('ENSG00000000003', names(dds), value = TRUE), intgroup = "condition", pch = 18, main = '??? expression', returnData = TRUE) 611 | 612 | theme <- theme(panel.background = element_blank(), panel.border = element_rect(fill = NA), 613 | plot.title = element_text(hjust = 0.5)) 614 | 615 | p <- ggplot(tmp, aes(x = condition, y = count)) + geom_boxplot() + 616 | geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.6) + ggtitle('??? expression') + theme 617 | 618 | print(p) 619 | ``` 620 | 621 | ## Functional analysis 622 | 623 | Functional analysis can further investigate the differential expression of each gene. Pathway analysis is a popular approach with which to investigate the differential expression of pathways, including genes with similar biological functions. This can be achieved using gene set analysis (GSA). There are many flavours of GSA. They can be categorised as shown below by [Das et al. (2020)](https://www.mdpi.com/1099-4300/22/4/427). 624 | 625 | 626 | 627 | Here, we will use one gene annotation approach and one gene set enrichment analysis (GSEA) approach. 628 | 629 | ### GoSeq - gene annotation 630 | 631 | [GoSeq](https://bioconductor.org/packages/release/bioc/html/goseq.html), developed by [Young et al. (2010)](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-2-r14), tests for the enrichment of Gene Ontology terms. 632 | 633 | ```R 634 | BiocManager::install("goseq") 635 | library(goseq) 636 | 637 | #Extract the differential expression data, with false discovery rate correction 638 | groups12.table <- as.data.frame(topTags(ql.groups12, n = Inf)) 639 | 640 | #Remove the version numbers from the ENSEMBL gene IDs 641 | rownames(groups12.table) <- unlist(lapply(strsplit(rownames(groups12.table), '\\.'), `[[`, 1)) 642 | ``` 643 | 644 | The genes can be seperated into those which show significantly increased expression and those which show significantly decreased expression. Here, the FDR threshold is set to 0.05. A minimum fold-change can also be defined. 645 | 646 | ```R 647 | #Decreased expression 648 | ql53.DEGs.down <- groups12.table$FDR < 0.05 & groups12.table$logFC<0 649 | names(ql53.DEGs.down) <- rownames(groups12.table) 650 | pwf.dn <- nullp(ql53.DEGs.up, "hg19", "ensGene") 651 | go.results.dn <- goseq(pwf.dn, "hg19", "ensGene") 652 | 653 | #Increased expression 654 | ql53.DEGs.up <- groups12.table$FDR < 0.05 & groups12.table$logFC>0 655 | names(ql53.DEGs.up) <- rownames(groups12.table) 656 | pwf.up <- nullp(ql53.DEGs.down, "hg19","ensGene") 657 | go.results.up <- goseq(pwf.up, "hg19","ensGene") 658 | ``` 659 | 660 | The `go.results.up` dataframe looks like this: 661 | 662 | 663 | 664 | Significant results can be saved... 665 | 666 | ```R 667 | write.table(go.results.up[go.results.up$over_represented_pvalue<0.05,1:2], 'p53-GO-up0.05.txt', quote=FALSE, sep='\t', row.names=FALSE, col.names=FALSE) 668 | ``` 669 | 670 | ...and uploaded to the REVIGO tool which collapses and summarises redundant GO terms. Copy the contents of the `p53-GO-up0.05.txt` into the [REVIGO](http://revigo.irb.hr/) box: 671 | 672 | 673 | 674 | After running, select the 'Scatterplot & Table' tab and scroll down to 'export results to text table (csv)': 675 | 676 | 677 | 678 | Save the downloaded file as `REVIGO-UP.csv`. The package `ggplot2` can be used here to visualise the log<10> p-values for GO term enrichment. To view the top 10 terms: 679 | 680 | ```R 681 | #Read in the REVIGO output 682 | revigoUP = read.table('REVIGO-UP.csv', sep = ',', header = TRUE) 683 | 684 | #Sort by p-value and extract the top 20 685 | revigoUP = revigoUP[order(revigoUP$log10.p.value),] 686 | revigoUP = head(revigoUP, n = 20) 687 | 688 | #Convert the GO terms to factors for compatability with ggplot2 689 | revigoUP$description <- factor(revigoUP.108top$description, levels = revigoUP.108top$description) 690 | 691 | #Plot the barplot 692 | p <- ggplot(data = revigoUP.108top, aes(x = log10.p.value, y = description, fill = description)) + 693 | geom_bar(stat = "identity") 694 | 695 | p + scale_fill_manual(values = rep("steelblue2", dim(revigoUP.108top)[1])) + theme_minimal() + theme(legend.position = "none") + 696 | ylab('') 697 | ``` 698 | 699 | 700 | 701 | ### Gene Set Enrichment Analysis 702 | 703 | Gene set enrichment analysis (GSEA) will be used to test for the altered expression of pre-defined *set* of genes. 704 | 705 | ```R 706 | BiocManager::install("piano") 707 | library(piano) 708 | ``` 709 | 710 | ## Resources 711 | 712 | Many resources were used in building this RNA-seq tutorial. 713 | 714 | Highly recommended RNA-seq tutorial series: 715 | 716 | - [Introduction to differential gene expression analysis](https://hbctraining.github.io/DGE_workshop/lessons/01_DGE_setup_and_overview.html#rna-seq-count-distribution) 717 | 718 | Other RNA-seq tutorials: 719 | - [Statistical analysis of RNA-Seq data](http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf) by Ignacio Gonz´alez 720 | - [Analysis of RNA-Seq data: gene-level exploratory analysis and differential expression](https://www.huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#gene-ontology-enrichment-analysis) by Bernd Klaus and Wolfgang Huber 721 | 722 | 723 | - 724 | - [RNA-seq workflow: gene-level exploratory analysis and differential expression by Love et al. 2019](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#running-the-differential-expression-pipeline) 725 | - https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html 726 | - The Encode pipeline for long-RNAs: https://www.encodeproject.org/data-standards/rna-seq/long-rnas/ 727 | 728 | Understanding normalisation: 729 | - https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html 730 | 731 | 732 | 733 | 734 | **Preseq**: Estimates library complexity 735 | 736 | **Picard RNAseqMetrics**: Number of reads that align to coding, intronic, UTR, intergenic, ribosomal regions, normalize gene coverage across a meta-gene body, identify 5’ or 3’ bias 737 | 738 | **RSeQC**: Suite of tools to assess various post-alignment quality, Calculate distribution of Insert Size, Junction Annotation (% Known, % Novel read spanning splice junctions), BAM to BigWig (Visual Inspection with IGV) 739 | 740 | ## cqn Normalisation 741 | 742 | The count data needs to be normalised for several confounding factors. The number of DNA reads (or fragments for paired end data) mapped to a gene is influeced by (1) its gc-content, (2) its length and (3) the total library size for the sample. There are multiple methods used for normalisation. Here, conditional quantile normalisation (cqn) is used as recommended by [Mandelboum et al. (2019)](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000481) to correct for sample-specific biases. Cqn is described by [Hansen et al. (2012)](https://academic.oup.com/biostatistics/article/13/2/204/1746212). 743 | 744 | `cqn` requires an input of gene length, gc content and the estimated library size per sample (which it will estimate as the total sum of the counts if not provided by the user). For more guidance on how to normalise using `cqn` and import into `DESeq2`, the user is directed to [the cqn vignette](https://bioconductor.org/packages/release/bioc/vignettes/cqn/inst/doc/cqn.pdf) by Hansen & Wu and the [tximport vignette](http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) by Love, Soneson & Robinson. 745 | 746 | ```R 747 | #Read in the gene lengths and gc-content data frame (provided in this repository) 748 | genes.length.gc = read.table('gencode-v36-gene-length-gc.txt', sep = '\t') 749 | ``` 750 | 751 | At this stage, technical replicates can be combined if they have not been already. This is typically achieved by summing the counts. 752 | 753 | To carry out the normalisation: 754 | 755 | ```R 756 | library(cqn) 757 | #cqn normalisation 758 | counts = counts.imported$counts 759 | 760 | #Exclude genes with no length information, for compatibility with cqn. 761 | counts = counts[-which(is.na(genes.length.gc[rownames(counts),]$length)),] 762 | 763 | #Extract the lengths and GC contents for genes in the same order as the counts data-frame 764 | geneslengths = genes.length.gc[rownames(counts),]$length 765 | genesgc = genes.length.gc[rownames(counts),]$gc 766 | 767 | #Run the cqn normalisation 768 | cqn.results <- cqn(counts, genesgc, geneslengths, lengthMethod = c("smooth")) 769 | 770 | #Extract the offset, which will be input directly into DEseq2 to normalise the counts. 771 | cqnoffset <- cqn.results.DEseq$glm.offset 772 | cqnNormFactors <- exp(cqnoffset) 773 | 774 | #The 'counts' object imported from tximport also contains data-frames for 'length' and 'abundance'. 775 | #These data-frames should also be subset to remove any genes excluded from the 'NA' length filter 776 | counts.imported$abundance = counts.imported$abundance[rownames(counts),] 777 | counts.imported$counts = counts.imported$counts[rownames(counts),] 778 | counts.imported$length = counts.imported$length[rownames(counts),] 779 | ``` 780 | 781 | The normalised gene expression values can be saved as a cqn output. These values will not be used for the downstream differential expression, rather they are useful for any visualisation purposes. Differential expression will be calculated within DEseq2 using a negative bionomial model, to which the cqn offset will be added. 782 | 783 | ```R 784 | #The normalised gene expression counts can be saved as: 785 | RPKM.cqn <- cqn.results$y + cqn.results$offset 786 | ``` 787 | 788 | > Biological replicate correlation 789 | 790 | The correlation between the expression of genes in two biological replicates should ideally be very high. The normalised expression values, saved above as `RPKM.cqn` will be used. 791 | 792 | ```R 793 | #To test the correlation between the first two samples in columns 1 and 2 794 | plot(RPKM.cqn[,1], RPKM.cqn[,2], pch = 18, cex = 0.5, xlab = colnames(RPKM.cqn)[1], ylab = colnames(RPKM.cqn)[2]) 795 | 796 | #The Pearson correlation coefficient can be calculated as: 797 | cor(RPKM.cqn[,1], RPKM.cqn[,2]) 798 | 799 | #Add it to your plot, replacing x and y with the coordinates for your legend 800 | text(x, y, labels = paste0('r=', round(cor(RPKM.cqn[,1], RPKM.cqn[,2]),2))) 801 | 802 | #To add a regression line 803 | abline(lm(RPKM.cqn[,1] ~ RPKM.cqn[,2]), col = 'red') 804 | ``` 805 | 806 | 807 | --------------------------------------------------------------------------------