├── README.md
├── _config.yml
├── activities
├── day1_polls.md
├── day2_polls.md
├── day3_polls.md
├── day4_polls.md
└── practice_exercise.md
├── assets
├── css
│ └── style.scss
└── images
│ └── dna-sequence-1600x800.jpg
├── data
├── DGE_workshop_annotations.RData
├── Mov10_full_meta.txt
├── annotations_ahb.csv
├── multiqc_report_rnaseq.html
├── multiqc_report_rnaseq.html.zip
├── raw_counts_mouseKO.csv
└── tx2gene_grch38_ens94.txt
├── exercises
├── DGE_analysis_exercises answer_key.md
├── DGE_analysis_exercises.md
└── refresher_answers.R
├── homework
├── DGE_assignment_1.R
├── DGE_assignment_1_answer_key.R
├── DGE_assignment_2.R
├── DGE_assignment_2_answer_key.R
├── DGE_assignment_3.R
└── exp_design_table.xlsx
├── img
├── Create_RStudio_Project.gif
├── Differential_expression.png
├── MA_plot.png
├── NB_model_formula.png
├── PCA_2sample_genes.png
├── PCA_2sample_influence.png
├── PCA_2sample_rotate.png
├── PCA_2sample_variation1.png
├── PCA_2sample_variation2.png
├── PCA_2sample_variation3.png
├── PCA_salmon.png
├── PCA_samples.png
├── bad_dispersion1.png
├── bad_dispersion2.png
├── batch_effect.png
├── batch_effect_pca.png
├── cluster_summary.png
├── cnetplot-2_salmon.png
├── cnetplot1_salmon.png
├── confounded_batch.png
├── confounded_design.png
├── dataset.png
├── de_norm_counts_var.png
├── de_replicates_img2.png
├── de_theory.png
├── de_variation.png
├── de_workflow_salmon.png
├── de_workflow_salmon_deseq1.png
├── de_workflow_salmon_deseq2.png
├── de_workflow_salmon_normalization.png
├── de_workflow_salmon_qc.png
├── degPatterns_figure.png
├── degReport_clusters2.png
├── degpattern_figure.tiff
├── deseq2_shrunken_lfc.png
├── deseq2_workflow_separate.png
├── deseq2_workflow_separate_dis.png
├── deseq2_workflow_separate_fit.png
├── deseq2_workflow_separate_sf.png
├── deseq2_workflow_separate_shr.png
├── deseq_counts_distribution.png
├── deseq_counts_overview.png
├── deseq_dispersion1.png
├── deseq_dispersion2.png
├── deseq_mean_variance2.png
├── deseq_median_of_ratios.png
├── deseq_nb.png
├── deseq_obj1.png
├── deseq_obj2.png
├── emapplot_salmon.png
├── example_PCA_cage.png
├── example_PCA_sex.png
├── example_PCA_strain.png
├── example_PCA_treatmentPC1.png
├── example_PCA_treatmentPC3.png
├── example_metadata.png
├── exercise_dispersion.png
├── gProfiler.png
├── gene_filtering.png
├── genemania.png
├── go_freq.png
├── go_heirarchy.jpg
├── go_proportions.png
├── go_proportions_table3.png
├── gseaKEGGresults.png
├── gsea_kegg_hsa03008.png
├── gsea_kegg_hsa03040.png
├── gsea_overview.png
├── gsea_theory.png
├── heatmap_example.png
├── heatmap_sleuth.png
├── hsa03008.pathview.png
├── hsa03040.pathview.png
├── hsa05222.png
├── hypergeo.png
├── illumina_sequencing_process.png
├── indep_filt_scatterplus.png
├── library_prep.png
├── lrt_formula.png
├── lrt_time_nodiff.png
├── lrt_time_yesdiff.png
├── maplot_unshrunken.png
├── melt_wide_to_long_format.png
├── meta_example.png
├── metadata_batch.png
├── mov10-model.png
├── mov10_isoform_expression.png
├── mov10_test_table.png
├── mov10oe_dotplot.png
├── non_confounded_design.png
├── normalization_methods_composition.png
├── normalization_methods_depth.png
├── normalization_methods_length.png
├── orgdb_annotation_databases.png
├── paired_end_reads.png
├── pathway_analysis.png
├── pca_sleuth.png
├── pheatmap_aug2020.png
├── pheatmap_salmon.png
├── plotCounts_ggrepel_salmon.png
├── plotDispersion_salmon.png
├── plot_density_filtered.png
├── plot_density_unfiltered.png
├── quant_screenshot.png
├── replicates.png
├── revigo_download_link.png
├── revigo_input.png
├── revigo_treemap.png
├── rlog_transformation_new.png
├── salmon.png
├── sample_qc.png
├── settingup.png
├── sigOE_heatmap.png
├── sig_genes_gather_salmon.png
├── sleuth_abund_h5.png
├── sleuth_bootstraps1.png
├── sleuth_bootstraps2.png
├── sleuth_cor_heatmap.png
├── sleuth_formula1.png
├── sleuth_formula2.png
├── sleuth_mov10_results.png
├── sleuth_pca_loadings.png
├── sleuth_pca_variance.png
├── sleuth_tech_var.png
├── sleuth_workflow.png
├── sleuth_workflow1.png
├── sleuth_workflow2_updated.png
├── spia_plot.png
├── topgen_plot.png
├── topgen_plot_salmon.png
├── volcano_plot_1_salmon.png
├── volcano_plot_2_salmon.png
├── workflow-salmon-DGE-alt2.png
└── workflow-salmon-sleuth.png
├── lectures
├── Intro_to_workshop_CFAR.pdf
├── Intro_to_workshop_all.pdf
├── Workshop_wrapup.pdf
├── Workshop_wrapup_CFAR.pdf
├── Workshop_wrapup_all.pdf
├── Workshop_wrapup_dfhcc.pdf
├── workshop_intro_slides.pdf
└── workshop_wrapup_slides.pdf
├── lessons
├── 01a_RNAseq_processing_workflow.md
├── 01b_DGE_setup_and_overview.md
├── 01c_RNAseq_count_distribution.md
├── 02_DGE_count_normalization.md
├── 03_DGE_QC_analysis.md
├── 04a_design_formulas.md
├── 04b_DGE_DESeq2_analysis.md
├── 05a_hypothesis_testing.md
├── 05b_wald_test_results.md
├── 05c_summarizing_results.md
├── 06_DGE_visualizing_results.md
├── 07_DGE_summarizing_workflow.md
├── 08a_DGE_LRT_results.md
├── 08b_time_course_analyses.md
├── 09_sleuth.md
├── 10_FA_over-representation_analysis.md
├── 11_FA_functional_class_scoring.md
├── AnnotationDbi_lesson.md
├── DGE_visualizing_results_archived.md
├── OLD_functional_analysis.md
├── R_refresher.md
├── experimental_planning_considerations.md
├── functional_analysis_other_methods.md
├── genomic_annotation.md
├── pathway_topology.md
├── principal_component_analysis.md
└── top20_genes-expression_plotting.md
├── schedule
├── README.md
├── README_old.md
└── links-to-lessons.md
└── templates
├── DGE_DESeq2_template.Rmd
└── DGE_FA_template.Rmd
/README.md:
--------------------------------------------------------------------------------
1 | # Introduction to Differential Gene Expression Analysis
2 |
3 | | Audience | Computational skills required| Duration |
4 | :----------|:----------|:----------|
5 | | Biologists | [Introduction to R](https://hbctraining.github.io/Intro-to-R-flipped/) | 4-session online workshop (~8 hours of trainer-led time)|
6 |
7 | ### Description
8 |
9 | This repository has teaching materials for a hands-on **Introduction to Differential Gene Expression Analysis** workshop. The workshop will lead participants through performing a differential gene expression analysis workflow on RNA-seq count data using R/RStudio. Working knowledge of R is required or completion of the [Introduction to R workshop](https://hbctraining.github.io/Intro-to-R-flipped/).
10 |
11 | **Note for Trainers:** Please note that the schedule linked below assumes that learners will spend between 3-4 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion/Q & A.
12 |
13 | ### Learning Objectives
14 |
15 | - QC on count data using Principal Component Analysis (PCA) and hierarchical clustering
16 | - Using DESeq2 to obtain a list of significantly different genes
17 | - Visualizing expression patterns of differentially expressed genes
18 | - Performing functional analysis on gene lists with R-based tools
19 |
20 | ### Lessons
21 | * [Workshop schedule (trainer-led learning)](schedule/)
22 | * [Self-learning](schedule/links-to-lessons.md)
23 |
24 | ### Installation Requirements
25 |
26 | Download the most recent versions of R and RStudio for your laptop:
27 |
28 | - [R](http://lib.stat.cmu.edu/R/CRAN/) (version 4.0.0 or above)
29 | - [RStudio](https://www.rstudio.com/products/rstudio/download/#download)
30 |
31 | > **Note 1:** When installing the following packages, if you are asked to select (a/s/n) or (y/n), please select “a” or "y" as applicable.
32 |
33 | > **Note 2**: If you have a Mac with an M1 chip, download and install this tool before intalling your packages: https://mac.r-project.org/tools/gfortran-12.2-universal.pkg
34 |
35 | (1) Install the below packages on your laptop from CRAN. You DO NOT have to go to the CRAN webpage; you can use the following function to install them one by one:
36 |
37 | ```r
38 | install.packages("insert_package_name_in_quotations")
39 | install.packages("insert_package_name_in_quotations")
40 | & so on ...
41 | ```
42 |
43 | Note that these package names are case sensitive!
44 |
45 | ```r
46 | BiocManager
47 | tidyverse
48 | RColorBrewer
49 | pheatmap
50 | ggrepel
51 | cowplot
52 | ```
53 |
54 | (2) Install the below packages from Bioconductor. Load BiocManager, then run BiocManager's `install()` function 12 times for the 12 packages:
55 |
56 | ```r
57 | library(BiocManager)
58 | install("insert_first_package_name_in_quotations")
59 | install("insert_second_package_name_in_quotations")
60 | & so on ...
61 | ```
62 |
63 | Note that these package names are case sensitive!
64 |
65 | ```r
66 | DESeq2
67 | clusterProfiler
68 | DOSE
69 | org.Hs.eg.db
70 | pathview
71 | DEGreport
72 | tximport
73 | AnnotationHub
74 | ensembldb
75 | apeglm
76 | ```
77 |
78 | > **NOTE:** The library used for the annotations associated with genes (here we are using `org.Hs.eg.db`) will change based on organism (e.g. if studying mouse, would need to install and load `org.Mm.eg.db`). The list of different organism packages are given [here](https://github.com/hbctraining/Training-modules/raw/master/DGE-functional-analysis/img/available_annotations.png).
79 |
80 | (3) Finally, please check that all the packages were installed successfully by loading them **one at a time** using the code below:
81 |
82 | ```r
83 | library(DESeq2)
84 | library(tidyverse)
85 | library(RColorBrewer)
86 | library(pheatmap)
87 | library(ggrepel)
88 | library(cowplot)
89 | library(clusterProfiler)
90 | library(DEGreport)
91 | library(org.Hs.eg.db)
92 | library(DOSE)
93 | library(pathview)
94 | library(tximport)
95 | library(AnnotationHub)
96 | library(ensembldb)
97 | library(apeglm)
98 | ```
99 |
100 | (4) Once all packages have been loaded, run sessionInfo().
101 |
102 | ```r
103 | sessionInfo()
104 | ```
105 |
106 | ---
107 |
108 | ### Citation
109 |
110 | To cite material from this course in your publications, please use:
111 |
112 | > Meeta Mistry, Mary Piper, Jihe Liu, & Radhika Khetani. (2021, May 24). hbctraining/DGE_workshop_salmon_online: Differential Gene Expression Workshop Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.4783481. RRID:SCR_025373.
113 |
114 | A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.
115 |
116 | ---
117 |
118 | *These materials have been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/) RRID:SCR_025373. These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
119 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman
2 | title: Introduction to DGE
3 | google_analytics: UA-150953419-1
4 |
--------------------------------------------------------------------------------
/activities/day1_polls.md:
--------------------------------------------------------------------------------
1 | 1. Which of the following is NOT part of the library preparation process?
2 |
3 | 1. Reverse transcription
4 | 1. Fragmentation
5 | 1. **Immunoprecipitation**
6 | 1. PolyA selection
7 |
8 |
9 | 1. In Illumina's sequencing by synthesis process, each cycle incorporates a fluorophore labelled dNTP which is excited by a laser. Each dNTP has a distinct excitatory signal emission which is captured by cameras. The number of bases incorporated is equal to the total length of the cDNA fragment being sequenced. OR *"The number of bases incorporated during Illumina's sequencing by synthesis process is equal to the total length of the cDNA fragment being sequenced."*
10 |
11 | 1. True
12 | 2. **False**
13 |
14 | 1. Lightweight alignment tools are those which avoid base-to-base genomic alignment of the reads. These tools not only provide quantification estimates faster than older tools, but also show improvements in accuracy. Which of the following tools is not a lightweight alignment tool? OR *"Lightweight alignment tools avoid base-to-base genomic alignment of the reads. These tools provide quantification estimates faster than older tools, and also show improvements in accuracy. Which of the following tools is not a lightweight alignment tool?"*
15 |
16 | 1. Sailfish
17 | 1. **STAR**
18 | 1. Salmon
19 | 1. Kallisto
20 |
21 | 1. The output of a lightweight alignment tool is a BAM file.
22 |
23 | 1. True
24 | 1. **False**
25 |
26 | 1. Which of the following statements about RNA-seq quality control is not true?
27 |
28 | 1. Quality checks need to be performed on each individual sample and metrics should be compared across samples in a dataset for consistency.
29 | 1. FASTQC is a tool used to evaluate the raw sequence reads.
30 | 1. Mapping quality tools like Qualimap require genomic coordinate information as input.
31 | 1. **If a sample exhibits low quality scores at any one of the checkpoints it should be discarded before proceeding to differential gene expression analysis.**
32 |
33 | 5. For the majority of bulk RNA-seq experiments, higher sequencing depth is more important than the number of replicates.
34 |
35 | 1. True
36 | 1. **False**
37 |
--------------------------------------------------------------------------------
/activities/day2_polls.md:
--------------------------------------------------------------------------------
1 | 1. The following statements are describing the characteristics of RNA-seq count data. Select the statement that is **NOT** correct.
2 | 1. Count data has large dynamic range, with no upper limit for expression.
3 | 1. For genes with low mean expression we observe a range of variance values.
4 | 1. **For genes with high mean expression, the variance is less than the mean.**
5 | 1. Count data follows a negative binomial distribution.
6 |
7 | 2. Increasing the number of replicates enables more precise estimates of group means, and increases our statistical power to call differentially expressed genes correctly.
8 | 1. **True**
9 | 1. False
10 |
11 | 3. Which of the following factors does NOT need to be taken into consideration when comparing expression of a given gene between samples?
12 |
13 | 1. Sequencing depth
14 | 1. **Gene length**
15 | 1. RNA composition
16 |
17 | 4. Which of the following normalization methods are utilized by DESeq2's model?
18 |
19 | 1. CPM (counts per million)
20 | 1. TPM (transcripts per kilobase million)
21 | 1. RPKM (reads per kilobase of exon per million reads/fragments mapped)
22 | 1. **median of ratios**
23 |
24 | 5. True or False: RPKM is the preferred normalization method to compare the expression of a given gene between samples.
25 | 1. True
26 | 1. **False**
27 |
28 | 6. True or False: The presence of differentially expressed genes between samples can alter the size factor (normalization factor).
29 | 1. True
30 | 1. **False**
31 |
32 | 7. Sample-level QC can be performed on normalized counts directly, but the log transformation of the normalized counts gives a better idea of similarity and differences
33 | 1. **True**
34 | 1. False
35 |
36 | 8. When I plot PC1 vs PC2 for my data, the samples are not separating along PC1 based on my experimental question/factor. Which of the following should I NOT do.
37 | 1. Color my data points by other factors in the data systemtically to identify what is the major source of variation.
38 | 1. I should plot other PCs to identify if the samples separate by the experimental factor along another PC
39 | 1. I should remove some samples and recreate the PCA to see if it helps with the clustering
40 | 1. **The data are too noisy, I need to throw out the experiment and start from scratch**
41 |
42 | 9. In the multiqc report one of my samples appeared to be fairly different from the others, it has different trends for the various plots. During sample-level QC, it is not clustering with the other replicates from the same sample group. What should I do?
43 | 1. **At this point it is okay to remove it as an outlier. Once removed, redo the sample-level QC.**
44 | 2. Don't remove it as an outlier
45 |
--------------------------------------------------------------------------------
/activities/day3_polls.md:
--------------------------------------------------------------------------------
1 | 1. Please choose the option that is **NOT TRUE**.
2 |
3 | 1. Generally, the gene dispersion estimates should decrease with increasing mean expression values.
4 | 1. Generally, the gene dispersion estimates should cluster around the line of best fit.
5 | 1. Gene dispersion plots are a good way to examine whether the data are a good fit to the DESeq2 model.
6 | 1. **A good gene dispersion plot indicates a higher likelihood of detecting a lot of differentially expressed genes.**
7 | 1. A worrisome gene dispersion plot could be a result of outlier samples or contamination.
8 | 1. A worrisome gene dispersion plot indicates that we should be more skeptical of our significant DE genes.
9 |
10 | 2. In order to set up your null hypothesis, you need to first observe all of the data points in an experiment.
11 |
12 | 1. True
13 | 1. **False**
14 |
15 | 3. The generalized linear model (GLM) is fit for each individual gene, which means we are conducting thousands of independent tests for a given experiment. This inflates our false positive rate for DE genes.
16 |
17 | 1. **True**
18 | 1. False
19 |
20 | 4. Which of the following statements about gene-level filtering is **NOT TRUE** ?
21 |
22 | 1. If a gene has zero expression in all of the samples it is not tested for differential expression.
23 | 1. For independent filtering that is applied in DESeq2, the low mean threshold is empirically determined from your data.
24 | 1. **Gene-level filtering will reduce the number of genes being tested and therefore decrease the total number of differentially expressed genes that are identified.**
25 | 1. A Cook's distance is computed for each gene in each sample to help identify genes with an extreme outlier count.
26 |
27 | 5. Which of the following statements about LFC shrinkage is true?
28 |
29 | 1. LFC shrinkage uses information from the significant genes to generate more accurate estimates.
30 | 1. Shrinking the log fold changes will reduce the total number of genes identified as significant at padj < 0.05.
31 | 1. **Shrinkage of the LFC estimates is useful when the information for a gene is low, which includes low mean expression or high dispersion**.
32 | 1. LFC shrinkage is applied by default in DESeq2.
33 |
34 |
35 | 6. To identify significant differentially expressed genes, you will always need to set both an adjusted p-value threshold and a fold change threshold.
36 |
37 | 1. True
38 | 1. **False**
39 |
40 | 7. When using a heatmap to observe the differences in patterns of gene expression between samplegroups it is more informative to plot the normalized expression values, rather than the scaled Z-scores for each gene.
41 |
42 | 1. True
43 | 1. **False**
44 |
45 |
--------------------------------------------------------------------------------
/activities/day4_polls.md:
--------------------------------------------------------------------------------
1 | 1. We cannot annotate our genes until we know the source and build of the reference files that were used in the analysis upstream of obtaining the pseudocounts.
2 |
3 | 1. **True**
4 | 1. False
5 |
6 | 1. Which of the following statements about AnnotationHub are NOT TRUE?
7 |
8 | 1. AnnotationHub is a resource for accessing genomic data or querying large collection of whole genome resources, including ENSEMBL, UCSC, ENCODE among many others.
9 | 1. **In AnnotationHub Ensembl and Entrez identifiers have a one-to-one mapping.**
10 | 1. AnnotationHub allows us to retrieve gene, transcript and exon-level information.
11 | 1. When searching Human EnsemblDb using AnnotationHub, there are older releases but only for the most recent build.
12 |
13 | 1. True/False. Among the functional analysis methods, GSEA is better than over-representation analysis for exploring enriched processes or pathways when there are few differentially expressed genes.
14 |
15 | 1. **True**
16 | 1. False
17 |
18 | 1. True/False. We identify RNA splicing-related terms as significant in both our over-representation and GSEA analyses. We can conclude that RNA splicing is affected by our condition of interest.
19 |
20 | 1. True
21 | 1. **False**
22 |
--------------------------------------------------------------------------------
/activities/practice_exercise.md:
--------------------------------------------------------------------------------
1 |
2 | Use the Salmon quant files to perform differential gene expression analysis with DESeq2. Report the DE genes identified for SMOC2 over-expressing samples relative to wild type using an FDR threshold of 0.05, by uploading a csv file containing the results table output by DESeq2,but only for the significant genes.
3 |
4 | Also generate a heatmap of the top 20 differentially expressed genes and include an annotation bar assigning samples to each group. Upload this heatmap.
5 |
6 | e. Use the Salmon output to run Sleuth and identify differentially expressed transcripts. Use Sleuth functions to create a PCA plot. Save and upload the image. How well do samples cluster?
7 |
8 | Report the DE transcripts identified for SMOC2 over-expressing samples relative to wild type using an FDR threshold of 0.05, by uploading a csv file containing the results table output by Sleuth,but only for the significant transcripts.
9 |
10 | Use Sleuth functions to plot a heatmap of the top 20 differentially expressed transcripts. Upload the image.
11 |
--------------------------------------------------------------------------------
/assets/css/style.scss:
--------------------------------------------------------------------------------
1 | ---
2 | ---
3 |
4 | @import "{{ site.theme }}";
5 |
6 | .page-header { color: #fff; text-align: center; background-image: url("../images/dna-sequence-1600x800.jpg"); }
7 |
8 | .main-content h1, .main-content h2, .main-content h3, .main-content h4, .main-content h5, .main-content h6 { margin-top: 2rem; margin-bottom: 1rem; font-weight: normal; color: #000000; }
9 |
--------------------------------------------------------------------------------
/assets/images/dna-sequence-1600x800.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/assets/images/dna-sequence-1600x800.jpg
--------------------------------------------------------------------------------
/data/DGE_workshop_annotations.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/data/DGE_workshop_annotations.RData
--------------------------------------------------------------------------------
/data/Mov10_full_meta.txt:
--------------------------------------------------------------------------------
1 | sampletype MOVexpr
2 | Mov10_kd_2 MOV10_knockdown low
3 | Mov10_kd_3 MOV10_knockdown low
4 | Mov10_oe_1 MOV10_overexpression high
5 | Mov10_oe_2 MOV10_overexpression high
6 | Mov10_oe_3 MOV10_overexpression high
7 | Irrel_kd_1 control normal
8 | Irrel_kd_2 control normal
9 | Irrel_kd_3 control normal
10 |
--------------------------------------------------------------------------------
/data/multiqc_report_rnaseq.html.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/data/multiqc_report_rnaseq.html.zip
--------------------------------------------------------------------------------
/exercises/DGE_analysis_exercises.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "DGE Analysis Homework"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper"
4 | date: "October 24th, 2017"
5 | ---
6 |
7 | # Using DESeq2 for gene-level differential expression analysis
8 |
9 | - The metadata below describes an RNA-seq analysis experiment, in which the metadata table below and associated count matrix have been loaded into R as `meta` and `counts`, respectively. Additionally, all of the appropriate libraries have been loaded for you. Use the information in the table to answer the following questions.
10 |
11 | **meta**
12 |
13 | | |Genotype |Celltype |Batch|
14 | | ------ | ------- | -------- | --- |
15 | |sample1 |Wt |typeA |second |
16 | |sample2 |Wt |typeA |second|
17 | |sample3 |Wt |typeA |first|
18 | |sample4 |KO |typeA |first|
19 | |sample5 |KO |typeA |first|
20 | |sample6 |KO |typeA |second|
21 | |sample7 |Wt |typeB |second|
22 | |sample8 |Wt |typeB |first|
23 | |sample9 |Wt |typeB |second|
24 | |sample10 |KO |typeB |first|
25 | |sample11 |KO |typeB |first|
26 | |sample12 |KO |typeB |second|
27 |
28 |
29 | **NOTE: This is an exercise in thinking about running DESeq2. You do not need to run any code in R/RStudio. Refer to the materials/lessons from class to answer the following questions.**
30 |
31 | **a.** Reorder the columns of the `counts` dataset such that `rownames(meta) == colnames(counts)`.
32 |
33 | **b.** Provide the line of code used to create a DESeqDataSet object called `dds` in which `Genotype` is the factor of interest and `Celltype` and `Batch` are other contributing sources of variation in your data.
34 |
35 | **c.** Provide the line of code required to run DESeq2 on `dds`.
36 |
37 | **d.** Provide the line of code to create a dispersion plot.
38 |
39 | **e.** Provide the line of code to return the results of a Wald test comparison for `Celltype` categories `typeA` versus `typeB` (i.e the fold changes reported should reflect gene expression changes relative to `typeB`).
40 |
41 | **f.** Provide the line of code to return the results with log2 fold change shrinkage performed.
42 |
43 | **g.** Provide the line of code to write the results of the Wald test with shrunken log2 fold changes to file.
44 |
45 | **h.** Provide the line of code to subset the results to return those genes with adjusted p-value < 0.05 and logFold2Change > 1.
46 |
47 | # Working with the DESeq2 results table
48 |
49 | - Using the de_script.R that we created in class for the differential expression analysis, change the thresholds for adjusted p-value and log2 fold change to the following values:
50 |
51 | ```r
52 | padj.cutoff <- 0.01
53 |
54 | lfc.cutoff <- 1.5
55 | ```
56 |
57 | Using these new cutoffs, perform the following steps:
58 |
59 | **a.** Subset `res_tableOE` to only return those rows that meet the criteria we specified above (adjusted p-values < 0.01 and log fold changes >1.5). Save the subsetted table to a data frame called `sig_table_hw_oe`. Write the code below:
60 |
61 | **b.** There is a a DESeq2 function that summarizes how many genes are up- and down-regulated using our criteria for `alpha=0.01`. Use this on the `sig_table_hw_oe`. Write the code you would use, and also, list how many genes are up- and down- regulated.
62 |
63 | **c.** Get the gene names from `sig_table_hw_oe` and save them to a vector called `sigOE_hw`. Write the code below:
64 |
65 | **d.** Write the `sigOE_hw` vector of gene names to a file called `sigOE_hw.txt` using the `write()` function. Ensure the genes are listed in a single column. Write the code below.
66 |
67 | # Visualizing Results
68 |
69 | - For the genes that are differentially expressed in the knockdown versus control comparison (`res_tableKD`), plot an expression heatmap using normalized counts and `pheatmap()` following the instructions below. Write the code you would use to create the heatmap.
70 |
71 | **a.** The heatmap should only include control and knockdown samples.
72 |
73 | **b.** Set up a heat.colors vector using a palette of your choice from brewer.pal (make sure it is different from the one used in class).
74 |
75 | **c.** Plot the heatmap without clustering the columns.
76 |
77 | **d.** Scale expression values by row.
78 |
79 | # Use significant gene lists to find overlaps between the two comparisons
80 |
81 | - Using the original cutoff values, perform the following steps:
82 |
83 | ```r
84 | padj.cutoff < 0.05
85 |
86 | lfc.cutoff > 0.58
87 | ```
88 |
89 | **a.** Create separate vectors with gene names for up-regulated genes and down-regulated genes from `res_tableOE` and save as `up_OE` and `down_OE`, respectively. Write the code below:
90 |
91 | **b.** Create separate vectors with gene names for up-regulated genes and down-regulated genes from `res_tableKD` and save as `up_KD` and `down_KD`, respectively. Write the code below:
92 |
93 | **c.** Test for overlaps between the lists:
94 |
95 | - How many, and which genes in `up_OE` are also in `down_KD`?
96 |
97 | - How many, and which genes in `up_KD` are also in `down_OE`?
98 |
99 | # Using Salmon abundance estimates with DESeq2
100 |
101 | - We have materials for using counts from quasialignment tools, such as Salmon, to perform the DE analysis. The 'tximport' package in R is needed to summarize the transcript-level estimates generated from Salmon into pseudocounts.
102 |
103 | Open up RStudio and create a project for using Salmon estimates with DESeq2 (i.e. ~/Desktop/salmon should be your working directory).
104 |
105 | **a.** Follow the [markdown](https://github.com/hbctraining/Intro-to-rnaseq-hpc-orchestra/blob/master/lessons/DE_analysis.md#differential-expression-analysis-using-pseudocounts) for the Salmon section to create the DESeqDataSet object.
106 |
107 | **b.** Run DESeq2 on the DESeqDataSet object (`dds`) you created.
108 |
109 | **c.** Use the `results()` function to extract a results table for the OE vs Ctl comparison. Be sure to provide the appropriate contrasts and to perform shrinkage.
110 |
111 | **d.** Use the `results()` function to extract a results table for the KD vs Ctl comparison. Be sure to provide the appropriate contrasts and to perform shrinkage.
112 |
113 | **e.** Use the `summary()` function using `alpha=0.05` on the OE results table.
114 |
115 | - Report the number of genes that are up and down-regulated:
116 |
117 | - Report how many genes are removed due to independent filtering (HINT: these are genes that are filtered due to low mean count):
118 |
119 | - Report how many genes were removed due to an extreme outlier count (HINT: these 'outliers' are identified based on Cook's distance):
120 |
121 |
122 | **f.** Use the `summary()` function using `alpha=0.05` on the KD results table.
123 |
124 | - Report the number of genes that are up and down-regulated:
125 |
126 | - Report how many genes are removed due to independent filtering (HINT: these are genes that are filtered due to low mean count)
127 |
128 | - Report how many genes were removed due to an extreme outlier count (HINT: these 'outliers' are identified based on Cook's distance)
129 |
130 | **g.** Create a vector called `criteria_oe`, which contains the indexes for which rows in the OE results table have `padj < 0.05` AND `abs(logFC) > 0.58`
131 |
132 | **h.** Subset the OE results into a new table using the `criteria_oe` vector:
133 |
134 | **i.** Create a vector called `criteria_kd`, which contains the indexes for which rows in the KD results table have `padj < 0.05` AND `abs(logFC) > 0.58`
135 |
136 | **j.** Subset the KD results into a new table using the `criteria_kd vector`:
137 |
138 | **k.** Use the following linked datasets as the significant genes using a different workflow (STAR alignment + featureCounts + DESeq2), but same samples. Download them via these links:
139 |
140 | [Download OE genes (870 genes)](https://wiki.harvard.edu/confluence/download/attachments/216318985/Mov10_oe_2017.txt?version=1&modificationDate=1498507515000&api=v2)
141 |
142 | [Download KD genes (689 genes)](https://wiki.harvard.edu/confluence/download/attachments/216318985/Mov10_kd_2017.txt?version=1&modificationDate=1498507515000&api=v2)
143 |
144 | Load these gene lists in as `star_KD` and `star_OE`, using the `scan()` function in R (write your code below):
145 |
146 | **l.** Find overlapping OE genes between those that you have identified in your subsetted results table (using the Salmon abundance estimates) to the `star_OE` gene list.
147 |
148 | - How many genes overlap?
149 |
150 | **m.** Find overlapping KD genes between those that you have identified in your subsetted results table (using the Salmon abundance estimates) to the `star_KD` gene list.
151 |
152 | - How many genes overlap?
153 |
154 |
--------------------------------------------------------------------------------
/exercises/refresher_answers.R:
--------------------------------------------------------------------------------
1 | ## Setting up
2 |
3 | # 1. Let’s create a new project directory for this review:
4 | #
5 | # - Create a new project called `R_refresher`
6 |
7 | # File -> New Project -> New Directory - > New Project -> Name it "R_refresher" -> Create Project
8 |
9 | # - Create a new R script called `reviewing_R.R`
10 |
11 | # File -> New File -> R script. Then, File -> Save as... -> Name it "reviewing_R" -> Save
12 |
13 | # - Create the following folders in the project directory - `data`, `figures`
14 |
15 | # With the Files tab selected in the Files/Plots/Packages/Help window, click "New folder" and name it "data" or "figures" then click "OK".
16 |
17 | # - Download a counts file to the `data` folder by [right-clicking here](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/data/raw_counts_mouseKO.csv?raw=true)
18 |
19 | # Right click the hyperlink and select "Download Linked File As" or "Save Link As". Navigate to your data folder within your R_refresher directory and click "Save".
20 |
21 | # 2. Now that we have our directory structure setup, let's load our libraries and read in our data:
22 | #
23 | # - Load the `tidyverse` library
24 | library(tidyverse)
25 |
26 | # - Use `read.csv()` to read in the downloaded file and save it in the object/variable `counts`
27 | counts <- read.csv("data/raw_counts_mouseKO.csv")
28 |
29 | # - What is the syntax for a function?
30 |
31 | # Object <- read.csv("path/to/file.txt")
32 |
33 | # - How do we get help for using a function?
34 |
35 | # ?read.csv
36 |
37 | # - What is the data structure of `counts`?
38 | class(counts)
39 |
40 | # This will tell us that the counts object if a data frame.
41 |
42 | # - What main data structures are available in R?
43 |
44 | # Data frames, vectors, lists, matrices and factors
45 |
46 | # - What are the data types of the columns?
47 | str(counts)
48 |
49 | # We can see that they are eight numeric columns.
50 |
51 | # - What data types are available in R?
52 |
53 | # Numeric, character, integer, logical, complex and raw
54 |
55 | # ## Creating vectors/factors and dataframes
56 | #
57 | # 3. We are performing RNA-Seq on cancer samples with genotypes of p53 wildtype (WT) and knock-down (KO). You have 8 samples total, with 4 replicates per genotype. Write the R code you would use to construct your metadata table as described below.
58 | #
59 | # - Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the `rep()` function).
60 |
61 | sex <- rep(c("M","F"), 4)
62 | stage <- c(1,2,2,1,2,1,1,2)
63 | genotype <- c(rep("KO", 4), rep("WT",4))
64 | myc <- c(23,4,45,90,34,35,9, 10)
65 |
66 | # - Put them together into a dataframe called `meta`.
67 |
68 | meta <- data.frame(sex, stage, genotype,myc)
69 | View(meta)
70 |
71 | # - Use the `rownames()` function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the `paste0()` function).
72 |
73 | rownames(meta) <- c(paste0(rep("KO", 4), 1:4), paste0(rep("WT",4), 1:4))
74 |
75 | # Your finished metadata table should have information for the variables `sex`, `stage`, `genotype`, and `myc` levels:
76 | #
77 | # | |sex | stage | genotype | myc |
78 | # |:--:|:--: | :--: | :------: | :--: |
79 | # |KO1 | M |1 |KO |23|
80 | # |KO2| F |2 |KO |4|
81 | # |KO3 |M |2 |KO |45|
82 | # |KO4 |F |1 |KO |90|
83 | # |WT1| M |2 |WT |34|
84 | # |WT2| F| 1| WT| 35|
85 | # |WT3| M| 1| WT| 9|
86 | # |WT4| F| 2| WT| 10|
87 |
88 |
89 |
90 | #### Exploring data
91 | #
92 | # Now that we have created our metadata data frame, it's often a good idea to get some descriptive statistics about the data before performing any analyses.
93 | #
94 | # - Summarize the contents of the `meta` object, how many data types are represented?
95 | str(meta)
96 |
97 | # We can see that two columns are numeric and two are characters.
98 |
99 | # - Check that the row names in the `meta` data frame are identical to the column names in `counts` (content and order).
100 | all(rownames(meta) %in% colnames(counts))
101 | all(rownames(meta) == colnames(counts))
102 |
103 | # Either of these should return TRUE indicating that we have passed this check.
104 |
105 | # - Convert the existing `stage` column into a factor data type
106 | meta$stage <- factor(meta$stage)
107 |
108 | str(meta)
109 |
110 | # This should show you that the "Stage" column is now a factor.
111 |
112 | #
113 | # ## Extracting data
114 | #
115 | # 4. Using the `meta` data frame created in the previous question, perform the following exercises (questions **DO NOT** build upon each other):
116 | #
117 | # - return only the `genotype` and `sex` columns using `[]`:
118 | meta[,c(3,1)]
119 | #Or
120 | meta[,c("genotype","sex")]
121 |
122 | # - return the `genotype` values for samples 1, 7, and 8 using `[]`:
123 | meta[c(1,7,8),3]
124 | # Or
125 | meta[c(1,7,8),"genotype"]
126 |
127 | # - use `filter()` to return all data for those samples with genotype `WT`:
128 | filter(meta, genotype == "WT")
129 | # OR
130 | meta %>% filter(genotype == "WT")
131 |
132 | # - use `filter()`/`select()`to return only the `stage` and `genotype` columns for those samples with `myc` > 50:
133 | meta %>%
134 | filter(myc > 50) %>%
135 | select(stage, genotype)
136 |
137 | # Or
138 |
139 | select(filter(meta, myc > 50), stage, genotype)
140 |
141 | # - add a column called `pre_treatment` to the beginning of the dataframe with the values T, F, T, F, T, F, T, F
142 | pre_treatment <- c(T, F, T, F, T, F, T, F)
143 |
144 | # Or
145 |
146 | pre_treatment <- rep(c("T","F"),4)
147 |
148 | meta <- cbind(pre_treatment, meta)
149 |
150 | # - why might this design be problematic?
151 |
152 | # All of the Male samples are True for the pre-treatment and all of the females are False. Thus, we have confounded our design.
153 |
154 | # - Using `%>%` create a tibble of the `meta` object and call it `meta_tb` (make sure you don't lose the rownames!)
155 | # - change the names of the columns to: "A", "B", "C", "D", "E":
156 | #
157 | meta_tb <- meta %>%
158 | rownames_to_column(var="sampleIDs") %>%
159 | as.tibble()
160 |
161 | colnames(meta_tb)[2:6] <- LETTERS[1:5]
162 |
163 | # ## Visualizing data
164 | #
165 | # 5. Often it is easier to see the patterns or nature of our data when we explore it visually with a variety of graphics. Let's use ggplot2 to explore differences in the expression of the Myc gene based on genotype.
166 | #
167 | # - Plot a boxplot of the expression of Myc for the KO and WT samples using `theme_minimal()` and give the plot new axes names and a centered title.
168 | #
169 | ggplot(meta) +
170 | geom_boxplot(aes(x = genotype, y = myc)) +
171 | theme_minimal() +
172 | ggtitle("Myc expression") +
173 | ylab("Myc level") +
174 | xlab("Genotype") +
175 | theme(plot.title = element_text(hjust=0.5, size = rel(2)))
176 |
177 | ### Preparing for downstream analysis tools
178 | #
179 | # 6. Many different statistical tools or analytical packages expect all data needed as input to be in the structure of a list. Let's create a list of our count and metadata in preparation for a downstream analysis.
180 | #
181 | # - Create a list called `project1` with the `meta` and `counts` objects, as well as a new vector with all the sample names extracted from one of the 2 data frames.
182 | #
183 | project1 <- list(meta, counts, rownames(meta))
184 | project1
185 |
--------------------------------------------------------------------------------
/homework/DGE_assignment_1.R:
--------------------------------------------------------------------------------
1 | #### Assignment 1 ####
2 |
3 | #### RNA-seq counts distribution
4 | # 1. Evaluate the relationship between mean and variance for the control replicates (Irrel_kd samples). Note the differences or similarities in the plot compared to the one using the overexpression replicates.
5 |
6 |
7 | # 2. An RNA-seq experiment was conducted on mice forebrain to evaluate the effect of increasing concentrations of a treatment. For each of the five different concentrations we have n = 5 mice for a total of 25 samples. If we observed little to no variability between replicates, what might this suggest about our samples?
8 |
9 |
10 | # 3. What type of mean-variance relationship would you expect to see for this dataset?
11 |
12 |
13 | #### Count normalization
14 | # 1. Suppose we have sample names matching in the counts matrix and metadata file, but they are in different order. Write the line(s) of code to create a new matrix with columns re-ordered such that they are identical to the row names of the metadata.
15 |
16 |
17 | #### Sample-level QC
18 | # 1. What does the above plot tell you about the similarity of samples?
19 |
20 |
21 | # 2. Does it fit the expectation from the experimental design?
22 |
23 |
24 | # 3. What do you think the %variance information (in the axes titles) tell you about the data in the context of the PCA?
25 |
26 |
--------------------------------------------------------------------------------
/homework/DGE_assignment_1_answer_key.R:
--------------------------------------------------------------------------------
1 | #### Assignment 1 ####
2 |
3 | #### RNA-seq counts distribution
4 | # 1. Evaluate the relationship between mean and variance for the control replicates (Irrel_kd samples). Note the differences or similarities in the plot compared to the one using the overexpression replicates.
5 |
6 | mean_counts_ctrl <- apply(data[,1:3], 1, mean) #select column 1 to 3, which correspond to Irrel_kd samples
7 | variance_counts_ctrl <- apply(data[,1:3], 1, var)
8 | df_ctrl <- data.frame(mean_counts_ctrl, variance_counts_ctrl)
9 | ggplot(df_ctrl) +
10 | geom_point(aes(x=mean_counts_ctrl, y=variance_counts_ctrl)) +
11 | scale_y_log10(limits = c(1,1e9)) +
12 | scale_x_log10(limits = c(1,1e9)) +
13 | geom_abline(intercept = 0, slope = 1, color="red")
14 |
15 | # Ans: The plot of mean and variance for the control replicates is similar to that of overexpression replicates shown in the lesson.
16 |
17 | # 2. An RNA-seq experiment was conducted on mice forebrain to evaluate the effect of increasing concentrations of a treatment. For each of the five different concentrations we have n = 5 mice for a total of 25 samples. If we observed little to no variability between replicates, what might this suggest about our samples?
18 |
19 | # Ans: The lack of variability between replicates suggests that we are possibly dealing with technical replicates. With true biologcal replicates we expect some amount of variability. If you have technical replicates, you do not want to be using DESeq2 because we will be using the NB to account for overdispersion, which doesn't exist.
20 |
21 | # 3. What type of mean-variance relationship would you expect to see for this dataset?
22 | # Ans: mean == variance. A Poisson would be more appropriate.
23 |
24 | #### Count normalization
25 | # 1. Suppose we have sample names matching in the counts matrix and metadata file, but they are in different order. Write the line(s) of code to create a new matrix with columns re-ordered such that they are identical to the row names of the metadata.
26 |
27 | idx <- match(rownames(meta), colnames(data))
28 | data_reordered <- data[, idx]
29 |
30 | ## OR
31 |
32 | data_reordered <- data[, match(rownames(meta), colnames(data))]
33 |
34 | ## OR
35 |
36 | txi$counts <- txi$counts[, match(rownames(meta), colnames(txi$counts))]
37 |
38 | #### Sample-level QC
39 | # 1. What does the above plot tell you about the similarity of samples?
40 | # Ans: Samples from different experimental groups are different, while replicates within the same group are similar.
41 |
42 | # 2. Does it fit the expectation from the experimental design?
43 | # Ans: Yes, it does.
44 |
45 | # 3. What do you think the %variance information (in the axes titles) tell you about the data in the context of the PCA?
46 | # Ans: PC1 is associated with 37% of the variance in the data, and PC2 is associated with 16%.
47 | # *You won't see the %variance information when you make PCA plots without the plotPCA() function. But, there are tools that let you explore this.*
48 |
--------------------------------------------------------------------------------
/homework/DGE_assignment_2.R:
--------------------------------------------------------------------------------
1 | #### Assignment 2 ####
2 |
3 | #### Design formula
4 | # 1. How would the design formula be structured to perform the following analyses?
5 |
6 | # 2. Test for the effect of treatment.
7 |
8 | # 3. Test for the effect of genotype, while regressing out the variation due to treatment.
9 |
10 | # 4. Test for the effect of genotype on the treatment effects.
11 |
12 | #### Hypothesis testing
13 | # 1. What is an appropriate hypothesis test if you are testing for expression differences across the developmental stages?
14 |
15 | # 2. Provide the line of code used to create the dds object.
16 |
17 | # 3. Provide the line of code used to run DESeq2.
18 |
19 | # 4. The results of the differential expression analysis run identifies a group of genes that spike in expression between the first and second timepoints with no change in expression thereafter. How would we go about obtaining fold changes for these genes?
20 |
21 | #### Description of steps for DESeq2
22 | # 1. Given the dispersion plot below, would you have any concerns regarding the fit of your data to the model?
23 | # If not, what aspects of the plot makes you feel confident about your data?
24 | # If so, what are your concerns? What would you do to address them?
25 |
26 | #### Wald test results
27 | # MOV10 Differential Expression Analysis: Control versus Knockdown
28 | # Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.
29 |
30 | # 1. Create a contrast vector called contrast_kd.
31 |
32 | # 2. Use contrast vector in the results() to extract a results table and store that to a variable called res_tableKD.
33 |
34 | # 3. Shrink the LFC estimates using lfcShrink() and assign it back to res_tableKD.
35 |
36 | #### Summarizing results and extracting significant gene lists
37 | # MOV10 Differential Expression Analysis: Control versus Knockdown
38 |
39 | # 1. Using the same p-adjusted threshold as above (padj.cutoff < 0.05), subset res_tableKD to report the number of genes that are up- and down-regulated in Mov10_knockdown compared to control.
40 |
41 | # 2. How many genes are differentially expressed in the Knockdown compared to Control? How does this compare to the overexpression significant gene list (in terms of numbers)?
42 |
43 |
--------------------------------------------------------------------------------
/homework/DGE_assignment_2_answer_key.R:
--------------------------------------------------------------------------------
1 | #### Assignment 2 ####
2 |
3 | #### Design formula
4 | ## Exercise
5 | # 1. Suppose you wanted to study the expression differences between the two age groups in the metadata shown above, and major sources of variation were sex and treatment, how would the design formula be written?
6 | # Ans: design = ~ sex + treatment + age
7 |
8 | # 2. Based on our Mov10 metadata dataframe, which factors could we include in our design formula?
9 | # Ans: design = ~ sampletype
10 |
11 | # 3. What would you do if you wanted to include a factor in your design formula that is not in your metadata?
12 | # Ans: Add that factor into your metadata.
13 |
14 | ## Exercise: How would the design formula be structured to perform the following analyses?
15 | # 1. Test for the effect of treatment.
16 | # Ans: design = ~ treatment
17 |
18 | # 2. Test for the effect of genotype, while regressing out the variation due to treatment.
19 | # Ans: design = ~ treatment + genotype
20 |
21 | # 3. Test for the effect of genotype on the treatment effects.
22 | # Ans: design = ~ genotype + treatment + genotype:treatment
23 |
24 | #### Hypothesis testing
25 | # 1. What is an appropriate hypothesis test if you are testing for expression differences across the developmental stages?
26 | # Ans: Likelihood ratio test, because there are more than two groups for comparison.
27 |
28 | # 2. Provide the line of code used to create the dds object.
29 | dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sex + developmental_stage) # here we assume that the metadata includes two columns: sex and developmental_stage
30 |
31 | # 3. Provide the line of code used to run DESeq2.
32 | dds_lrt <- DESeq(dds, test="LRT", reduced = ~ sex) # since the full model is "sex + developmental_stage", the reduced model is then "sex". We don't need to use "1" here, because our reduced model has one factor.
33 |
34 | # 4. The results of the differential expression analysis run identifies a group of genes that spike in expression between the first and second timepoints with no change in expression thereafter. How would we go about obtaining fold changes for these genes?
35 | # Ans: We could use a Wald test to compare the groups we are interested in. It is not uncommon to run both tests, as you recall the LRT does not provide fold changes, and sometimes it can be helpful to further reduce down to a set of higher confidence genes by identifying those with higher fold change values.
36 |
37 | #### Description of steps for DESeq2
38 | # 1. Given the dispersion plot below, would you have any concerns regarding the fit of your data to the model?
39 | # If not, what aspects of the plot makes you feel confident about your data?
40 | # If so, what are your concerns? What would you do to address them?
41 | # Ans: Yes, there are some concerns. The data does not scatter around the fitted curve, and the distribution of normalized counts are restricted in a small range. I would double check QC of my samples to make sure that there are no contamination or outliers.
42 |
43 | #### Wald test results
44 | # MOV10 Differential Expression Analysis: Control versus Knockdown
45 | # Now that we have results for the overexpression results, do the same for the Control vs. Knockdown samples.
46 |
47 | # 1. Create a contrast vector called contrast_kd.
48 | contrast_kd <- c("sampletype", "MOV10_knockdown", "control")
49 |
50 | # 2. Use contrast vector in the results() to extract a results table and store that to a variable called res_tableKD.
51 | res_tableKD <- results(dds, contrast=contrast_kd, alpha = 0.05)
52 |
53 | # 3. Shrink the LFC estimates using lfcShrink() and assign it back to res_tableKD.
54 |
55 | res_tableKD <- lfcShrink(dds, coef="sampletype_MOV10_knockdown_vs_control", type="apeglm")
56 |
57 | #### Summarizing results and extracting significant gene lists
58 | # MOV10 Differential Expression Analysis: Control versus Knockdown
59 |
60 | # 1. Using the same p-adjusted threshold as above (padj.cutoff < 0.05), subset res_tableKD to report the number of genes that are up- and down-regulated in Mov10_knockdown compared to control.
61 | res_tableKD_tb <- res_tableKD %>%
62 | data.frame() %>%
63 | rownames_to_column(var="gene") %>%
64 | as_tibble()
65 |
66 | sigKD <- res_tableKD_tb %>%
67 | filter(padj < padj.cutoff)
68 |
69 | # 2. How many genes are differentially expressed in the Knockdown compared to Control? How does this compare to the overexpression significant gene list (in terms of numbers)?
70 | # Ans: There are 2,827 genes differentially expressed in the Knockdown compared to Control, and 4,774 genes differentially expressed in the Overexpression compared to Control. Therefore, less genes are present in the Knockdown significant gene list.
71 |
72 |
--------------------------------------------------------------------------------
/homework/DGE_assignment_3.R:
--------------------------------------------------------------------------------
1 | #### Assignment 3 ####
2 |
3 |
--------------------------------------------------------------------------------
/homework/exp_design_table.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/homework/exp_design_table.xlsx
--------------------------------------------------------------------------------
/img/Create_RStudio_Project.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/Create_RStudio_Project.gif
--------------------------------------------------------------------------------
/img/Differential_expression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/Differential_expression.png
--------------------------------------------------------------------------------
/img/MA_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/MA_plot.png
--------------------------------------------------------------------------------
/img/NB_model_formula.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/NB_model_formula.png
--------------------------------------------------------------------------------
/img/PCA_2sample_genes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_genes.png
--------------------------------------------------------------------------------
/img/PCA_2sample_influence.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_influence.png
--------------------------------------------------------------------------------
/img/PCA_2sample_rotate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_rotate.png
--------------------------------------------------------------------------------
/img/PCA_2sample_variation1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_variation1.png
--------------------------------------------------------------------------------
/img/PCA_2sample_variation2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_variation2.png
--------------------------------------------------------------------------------
/img/PCA_2sample_variation3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_2sample_variation3.png
--------------------------------------------------------------------------------
/img/PCA_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_salmon.png
--------------------------------------------------------------------------------
/img/PCA_samples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/PCA_samples.png
--------------------------------------------------------------------------------
/img/bad_dispersion1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/bad_dispersion1.png
--------------------------------------------------------------------------------
/img/bad_dispersion2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/bad_dispersion2.png
--------------------------------------------------------------------------------
/img/batch_effect.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/batch_effect.png
--------------------------------------------------------------------------------
/img/batch_effect_pca.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/batch_effect_pca.png
--------------------------------------------------------------------------------
/img/cluster_summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/cluster_summary.png
--------------------------------------------------------------------------------
/img/cnetplot-2_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/cnetplot-2_salmon.png
--------------------------------------------------------------------------------
/img/cnetplot1_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/cnetplot1_salmon.png
--------------------------------------------------------------------------------
/img/confounded_batch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/confounded_batch.png
--------------------------------------------------------------------------------
/img/confounded_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/confounded_design.png
--------------------------------------------------------------------------------
/img/dataset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/dataset.png
--------------------------------------------------------------------------------
/img/de_norm_counts_var.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_norm_counts_var.png
--------------------------------------------------------------------------------
/img/de_replicates_img2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_replicates_img2.png
--------------------------------------------------------------------------------
/img/de_theory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_theory.png
--------------------------------------------------------------------------------
/img/de_variation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_variation.png
--------------------------------------------------------------------------------
/img/de_workflow_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_workflow_salmon.png
--------------------------------------------------------------------------------
/img/de_workflow_salmon_deseq1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_workflow_salmon_deseq1.png
--------------------------------------------------------------------------------
/img/de_workflow_salmon_deseq2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_workflow_salmon_deseq2.png
--------------------------------------------------------------------------------
/img/de_workflow_salmon_normalization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_workflow_salmon_normalization.png
--------------------------------------------------------------------------------
/img/de_workflow_salmon_qc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/de_workflow_salmon_qc.png
--------------------------------------------------------------------------------
/img/degPatterns_figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/degPatterns_figure.png
--------------------------------------------------------------------------------
/img/degReport_clusters2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/degReport_clusters2.png
--------------------------------------------------------------------------------
/img/degpattern_figure.tiff:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/degpattern_figure.tiff
--------------------------------------------------------------------------------
/img/deseq2_shrunken_lfc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_shrunken_lfc.png
--------------------------------------------------------------------------------
/img/deseq2_workflow_separate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_workflow_separate.png
--------------------------------------------------------------------------------
/img/deseq2_workflow_separate_dis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_workflow_separate_dis.png
--------------------------------------------------------------------------------
/img/deseq2_workflow_separate_fit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_workflow_separate_fit.png
--------------------------------------------------------------------------------
/img/deseq2_workflow_separate_sf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_workflow_separate_sf.png
--------------------------------------------------------------------------------
/img/deseq2_workflow_separate_shr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq2_workflow_separate_shr.png
--------------------------------------------------------------------------------
/img/deseq_counts_distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_counts_distribution.png
--------------------------------------------------------------------------------
/img/deseq_counts_overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_counts_overview.png
--------------------------------------------------------------------------------
/img/deseq_dispersion1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_dispersion1.png
--------------------------------------------------------------------------------
/img/deseq_dispersion2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_dispersion2.png
--------------------------------------------------------------------------------
/img/deseq_mean_variance2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_mean_variance2.png
--------------------------------------------------------------------------------
/img/deseq_median_of_ratios.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_median_of_ratios.png
--------------------------------------------------------------------------------
/img/deseq_nb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_nb.png
--------------------------------------------------------------------------------
/img/deseq_obj1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_obj1.png
--------------------------------------------------------------------------------
/img/deseq_obj2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/deseq_obj2.png
--------------------------------------------------------------------------------
/img/emapplot_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/emapplot_salmon.png
--------------------------------------------------------------------------------
/img/example_PCA_cage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_PCA_cage.png
--------------------------------------------------------------------------------
/img/example_PCA_sex.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_PCA_sex.png
--------------------------------------------------------------------------------
/img/example_PCA_strain.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_PCA_strain.png
--------------------------------------------------------------------------------
/img/example_PCA_treatmentPC1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_PCA_treatmentPC1.png
--------------------------------------------------------------------------------
/img/example_PCA_treatmentPC3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_PCA_treatmentPC3.png
--------------------------------------------------------------------------------
/img/example_metadata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/example_metadata.png
--------------------------------------------------------------------------------
/img/exercise_dispersion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/exercise_dispersion.png
--------------------------------------------------------------------------------
/img/gProfiler.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gProfiler.png
--------------------------------------------------------------------------------
/img/gene_filtering.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gene_filtering.png
--------------------------------------------------------------------------------
/img/genemania.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/genemania.png
--------------------------------------------------------------------------------
/img/go_freq.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/go_freq.png
--------------------------------------------------------------------------------
/img/go_heirarchy.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/go_heirarchy.jpg
--------------------------------------------------------------------------------
/img/go_proportions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/go_proportions.png
--------------------------------------------------------------------------------
/img/go_proportions_table3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/go_proportions_table3.png
--------------------------------------------------------------------------------
/img/gseaKEGGresults.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gseaKEGGresults.png
--------------------------------------------------------------------------------
/img/gsea_kegg_hsa03008.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gsea_kegg_hsa03008.png
--------------------------------------------------------------------------------
/img/gsea_kegg_hsa03040.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gsea_kegg_hsa03040.png
--------------------------------------------------------------------------------
/img/gsea_overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gsea_overview.png
--------------------------------------------------------------------------------
/img/gsea_theory.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/gsea_theory.png
--------------------------------------------------------------------------------
/img/heatmap_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/heatmap_example.png
--------------------------------------------------------------------------------
/img/heatmap_sleuth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/heatmap_sleuth.png
--------------------------------------------------------------------------------
/img/hsa03008.pathview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/hsa03008.pathview.png
--------------------------------------------------------------------------------
/img/hsa03040.pathview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/hsa03040.pathview.png
--------------------------------------------------------------------------------
/img/hsa05222.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/hsa05222.png
--------------------------------------------------------------------------------
/img/hypergeo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/hypergeo.png
--------------------------------------------------------------------------------
/img/illumina_sequencing_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/illumina_sequencing_process.png
--------------------------------------------------------------------------------
/img/indep_filt_scatterplus.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/indep_filt_scatterplus.png
--------------------------------------------------------------------------------
/img/library_prep.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/library_prep.png
--------------------------------------------------------------------------------
/img/lrt_formula.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/lrt_formula.png
--------------------------------------------------------------------------------
/img/lrt_time_nodiff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/lrt_time_nodiff.png
--------------------------------------------------------------------------------
/img/lrt_time_yesdiff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/lrt_time_yesdiff.png
--------------------------------------------------------------------------------
/img/maplot_unshrunken.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/maplot_unshrunken.png
--------------------------------------------------------------------------------
/img/melt_wide_to_long_format.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/melt_wide_to_long_format.png
--------------------------------------------------------------------------------
/img/meta_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/meta_example.png
--------------------------------------------------------------------------------
/img/metadata_batch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/metadata_batch.png
--------------------------------------------------------------------------------
/img/mov10-model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/mov10-model.png
--------------------------------------------------------------------------------
/img/mov10_isoform_expression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/mov10_isoform_expression.png
--------------------------------------------------------------------------------
/img/mov10_test_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/mov10_test_table.png
--------------------------------------------------------------------------------
/img/mov10oe_dotplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/mov10oe_dotplot.png
--------------------------------------------------------------------------------
/img/non_confounded_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/non_confounded_design.png
--------------------------------------------------------------------------------
/img/normalization_methods_composition.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/normalization_methods_composition.png
--------------------------------------------------------------------------------
/img/normalization_methods_depth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/normalization_methods_depth.png
--------------------------------------------------------------------------------
/img/normalization_methods_length.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/normalization_methods_length.png
--------------------------------------------------------------------------------
/img/orgdb_annotation_databases.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/orgdb_annotation_databases.png
--------------------------------------------------------------------------------
/img/paired_end_reads.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/paired_end_reads.png
--------------------------------------------------------------------------------
/img/pathway_analysis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/pathway_analysis.png
--------------------------------------------------------------------------------
/img/pca_sleuth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/pca_sleuth.png
--------------------------------------------------------------------------------
/img/pheatmap_aug2020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/pheatmap_aug2020.png
--------------------------------------------------------------------------------
/img/pheatmap_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/pheatmap_salmon.png
--------------------------------------------------------------------------------
/img/plotCounts_ggrepel_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/plotCounts_ggrepel_salmon.png
--------------------------------------------------------------------------------
/img/plotDispersion_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/plotDispersion_salmon.png
--------------------------------------------------------------------------------
/img/plot_density_filtered.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/plot_density_filtered.png
--------------------------------------------------------------------------------
/img/plot_density_unfiltered.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/plot_density_unfiltered.png
--------------------------------------------------------------------------------
/img/quant_screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/quant_screenshot.png
--------------------------------------------------------------------------------
/img/replicates.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/replicates.png
--------------------------------------------------------------------------------
/img/revigo_download_link.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/revigo_download_link.png
--------------------------------------------------------------------------------
/img/revigo_input.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/revigo_input.png
--------------------------------------------------------------------------------
/img/revigo_treemap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/revigo_treemap.png
--------------------------------------------------------------------------------
/img/rlog_transformation_new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/rlog_transformation_new.png
--------------------------------------------------------------------------------
/img/salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/salmon.png
--------------------------------------------------------------------------------
/img/sample_qc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sample_qc.png
--------------------------------------------------------------------------------
/img/settingup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/settingup.png
--------------------------------------------------------------------------------
/img/sigOE_heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sigOE_heatmap.png
--------------------------------------------------------------------------------
/img/sig_genes_gather_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sig_genes_gather_salmon.png
--------------------------------------------------------------------------------
/img/sleuth_abund_h5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_abund_h5.png
--------------------------------------------------------------------------------
/img/sleuth_bootstraps1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_bootstraps1.png
--------------------------------------------------------------------------------
/img/sleuth_bootstraps2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_bootstraps2.png
--------------------------------------------------------------------------------
/img/sleuth_cor_heatmap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_cor_heatmap.png
--------------------------------------------------------------------------------
/img/sleuth_formula1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_formula1.png
--------------------------------------------------------------------------------
/img/sleuth_formula2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_formula2.png
--------------------------------------------------------------------------------
/img/sleuth_mov10_results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_mov10_results.png
--------------------------------------------------------------------------------
/img/sleuth_pca_loadings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_pca_loadings.png
--------------------------------------------------------------------------------
/img/sleuth_pca_variance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_pca_variance.png
--------------------------------------------------------------------------------
/img/sleuth_tech_var.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_tech_var.png
--------------------------------------------------------------------------------
/img/sleuth_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_workflow.png
--------------------------------------------------------------------------------
/img/sleuth_workflow1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_workflow1.png
--------------------------------------------------------------------------------
/img/sleuth_workflow2_updated.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/sleuth_workflow2_updated.png
--------------------------------------------------------------------------------
/img/spia_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/spia_plot.png
--------------------------------------------------------------------------------
/img/topgen_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/topgen_plot.png
--------------------------------------------------------------------------------
/img/topgen_plot_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/topgen_plot_salmon.png
--------------------------------------------------------------------------------
/img/volcano_plot_1_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/volcano_plot_1_salmon.png
--------------------------------------------------------------------------------
/img/volcano_plot_2_salmon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/volcano_plot_2_salmon.png
--------------------------------------------------------------------------------
/img/workflow-salmon-DGE-alt2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/workflow-salmon-DGE-alt2.png
--------------------------------------------------------------------------------
/img/workflow-salmon-sleuth.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/img/workflow-salmon-sleuth.png
--------------------------------------------------------------------------------
/lectures/Intro_to_workshop_CFAR.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Intro_to_workshop_CFAR.pdf
--------------------------------------------------------------------------------
/lectures/Intro_to_workshop_all.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Intro_to_workshop_all.pdf
--------------------------------------------------------------------------------
/lectures/Workshop_wrapup.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Workshop_wrapup.pdf
--------------------------------------------------------------------------------
/lectures/Workshop_wrapup_CFAR.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Workshop_wrapup_CFAR.pdf
--------------------------------------------------------------------------------
/lectures/Workshop_wrapup_all.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Workshop_wrapup_all.pdf
--------------------------------------------------------------------------------
/lectures/Workshop_wrapup_dfhcc.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/Workshop_wrapup_dfhcc.pdf
--------------------------------------------------------------------------------
/lectures/workshop_intro_slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/workshop_intro_slides.pdf
--------------------------------------------------------------------------------
/lectures/workshop_wrapup_slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hbctraining/Intro-to-DGE/6cd3009c4d8f785205bae8da07afff8ce527dda5/lectures/workshop_wrapup_slides.pdf
--------------------------------------------------------------------------------
/lessons/01a_RNAseq_processing_workflow.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "From raw sequence reads to count matrix"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper"
4 | date: "June 9, 2020"
5 | ---
6 |
7 | Approximate time: 40 minutes
8 |
9 | ## Learning objectives
10 |
11 | * Understand the different steps of the RNA-seq workflow, from RNA extraction to assessing the expression levels of genes.
12 |
13 | # RNA-seq workflow
14 |
15 | To perform Differential Gene Expression analysis, we need to start with a matrix of counts representing the levels of gene expression. It is important to understand how the count matrix is generated, before diving into the statistical analysis.
16 |
17 | In this lesson we will briefly discuss the RNA-processing pipeline for bulk RNA-seq, and the **different steps we take to go from raw sequencing reads to a gene expression count matrix**.
18 |
19 |
20 |
21 |
22 |
23 | ## 1. RNA Extraction and library preparation
24 |
25 | Before RNA can be sequenced, it must first be extracted and separated from its cellular environment prepared into a cDNA library. There are a number of steps involved which are outlined in the figure below, and in parallel there are various quality checks implemented to make sure we have good quality RNA to move forward with. We briefly describe some of these steps below.
26 |
27 | **a. Enriching for RNA.** Once the sample has been treated with DNAse to remove any contaminating DNA sequence, the sample undergoes either selection of the mRNA (polyA selection) or depletion of the rRNA.
28 |
29 | Generally, ribosomal RNA represents the majority of the RNAs present in a cell, while messenger RNAs represent a small percentage of total RNA, ~2% in humans. Therefore, if we want to study the protein-coding genes, we need to enrich for mRNA or deplete the rRNA. For differential gene expression analysis, it is best to enrich for Poly(A)+, unless you are aiming to obtain information about long non-coding RNAs, in which case ribosomal RNA depletion is recommended.
30 |
31 | > **RNA Quality check**: It is essential to check for the integrity of the extracted RNA prior to starting the cDNA library prepation. Traditionally, RNA integrity was assessed via gel electrophoresis by visual inspection of the ribosomal RNA bands; but that method is time consuming and imprecise. The Bioanalyzer system from Agilent will rapidly assess RNA integrity and calculate an RNA Integrity Number (RIN), which facilitates the interpretation and reproducibility of RNA quality. RIN, essentially, provides a means by which RNA quality from different samples can be compared to each other in a standardized manner.
32 |
33 | **b. Fragmentation and size selection.** The remaining RNA molecules are then fragmented. This is done either via chemical, enzymatic (e.g., RNAses) or physical processes (e.g., chemical/mechanical shearing). These fragments then undergo size selection to retain only those fragments within a size range that Illumina sequencing machines can handle best, i.e., between 150 to 300 bp.
34 |
35 | > **Fragment size quality check**: After size selection/exclusion the fragment size distribution should be assesed to ensure that it is unimodal and well-defined.
36 |
37 | **c. Reverse transcribe RNA into double-stranded cDNA.** Information about which strand a fragment originated from can be preserved by creating stranded libraries. The most commonly used method incorporates deoxy-UTP during the synthesis
38 | of the second cDNA strand (for details see Levin et al. (2010)). Once double-stranded cDNA fragments are generated, sequence adapters are ligated to the ends. (Size selection can be performed here instead of at the RNA-level.)
39 |
40 | **d. PCR amplification.** If the amount of starting material is low and/or to increase the number of cDNA molecules to an amount sufficient for sequencing, libraries are usually PCR amplified. Run as few amplication cycles as possible to avoid PCR artefacts.
41 |
42 |
43 |
44 |
45 |
46 | *Image source: [Zeng and Mortavi, 2012](https://pubmed.ncbi.nlm.nih.gov/22910383/)*
47 |
48 | ## 2. Sequencing (Illumina)
49 |
50 | Sequencing of the cDNA libraries will generate **reads**. Reads correspond to the nucleotide sequences of the ends of each of the cDNA fragements in the library. You will have the choice of sequencing either a single end of the cDNA fragments (single-end reads) or both ends of the fragments (paired-end reads).
51 |
52 |
53 |
54 |
55 |
56 | - SE - Single end dataset => Only Read1
57 | - PE - Paired-end dataset => Read1 + Read2
58 | - can be 2 separate FastQ files or just one with interleaved pairs
59 |
60 | Generally single-end sequencing is sufficient unless it is expected that the reads will match multiple locations on the genome (e.g. organisms with many paralogous genes), assemblies are being performed, or for splice isoform differentiation. Be aware that paired-end reads are generally 2x more expensive.
61 |
62 | ### Sequencing-by-synthesis
63 |
64 | Illumina sequencing technology uses a sequencing-by-synthesis approach. **To explore sequencing by synthesis in more depth, please watch [this linked video on Illumina's YouTube channel](https://www.youtube.com/watch?v=fCd6B5HRaZ8).**
65 |
66 | We have privided a brief explanation of the steps below:
67 |
68 | _**Cluster growth**_: The DNA fragments in the cDNA library are denatured and hybridized to the glass flowcell (adapter complementarity). Each fragment is then clonally amplified, forming a cluster of double-stranded DNA. This step is necessary to ensure that the sequencing signal will be strong enough to be detected/captured unambiguously for each base of each fragment.
69 |
70 | * **Number of clusters ~= Number of reads**
71 |
72 | _**Sequencing:**_ The sequencing of the fragment ends is based on fluorophore labelled dNTPs with reversible terminator elements. In each sequencing cycle, a base is incorporated into every cluster and excited by a laser.
73 |
74 | ***Image acquisition:*** Each dNTP has a distinct excitatory signal emission which is captured by cameras.
75 |
76 | ***Base calling:*** The Base calling program will then generate the sequence of bases, **i.e. reads**, for each fragement/cluster by assessing the images captured during the many sequencing cycles. In addition to calling the base in every position, the base caller will also report the certainty with which it was able to make the call (quality information).
77 |
78 | * **Number of sequencing cycles = Length of reads**
79 |
80 |
81 |
82 |
83 |
84 |
85 | ## 3. Quality control of raw sequencing data (FastQC)
86 |
87 | The raw reads obtained from the sequencer are stored as **[FASTQ files](https://en.wikipedia.org/wiki/FASTQ_format)**. The FASTQ file format is the defacto file format for sequence reads generated from next-generation sequencing technologies.
88 |
89 | Each FASTQ file is a text file which represents sequence readouts for a sample. Each read is represented by 4 lines as shown below:
90 |
91 | ```
92 | @HWI-ST330:304:H045HADXX:1:1101:1111:61397
93 | CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
94 | +
95 | @?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
96 | ```
97 |
98 | |Line|Description|
99 | |----|-----------|
100 | |1|Always begins with '@' and then information about the read|
101 | |2|The actual DNA sequence|
102 | |3|Always begins with a '+' and sometimes the same info as in line 1|
103 | |4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|
104 |
105 | [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is a commonly used software that provides a simple way to do some **quality control checks on raw sequence data**.
106 |
107 | The main functions include:
108 |
109 | * Providing a quick overview to tell you in which areas there may be problems
110 | * Summary graphs and tables to quickly assess your data
111 | * Export of results to an HTML based permanent report
112 |
113 |
114 | ## 4. Quantify expression
115 |
116 | Once we have explored the quality of our raw reads, we can move on to quantifying expression at the transcript level. The goal of this step is to **identify from which transcript each of the reads originated from and the total number of reads associated with each transcript**.
117 |
118 | Tools that have been found to be most accurate for this step in the analysis are referred to as **lightweight alignment tools**, which include:
119 | * [Kallisto](https://pachterlab.github.io/kallisto/about),
120 | * [Sailfish](http://www.nature.com/nbt/journal/v32/n5/full/nbt.2862.html) and
121 | * [Salmon](https://combine-lab.github.io/salmon/)
122 |
123 | Each of the tools in the list above work slightly differently from one another. However, common to all of them is that **they avoid base-to-base genomic alignment of the reads**. Genomic alignment is a step performed by older splice-aware alignment tools such as [STAR](https://academic.oup.com/bioinformatics/article/29/1/15/272537) and [HISAT2](https://daehwankimlab.github.io/hisat2/). In comparison to these tools, the lightweight alignment tools not only provide quantification estimates **faster** (typically more than 20 times faster), but also prove **improvements in accuracy** [[1](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0734-x)].
124 |
125 | **We will use the expression estimates, often referred to as 'pseudocounts', obtained from [Salmon](https://combine-lab.github.io/salmon/)** as the starting point for the differential gene expression analysis.
126 |
127 |
128 |
129 |
130 |
131 | ## 5. Quality control of aligned sequence reads (STAR/Qualimap)
132 |
133 | As mentioned above, the differential gene expression analysis will use transcript/gene pseudocounts generated by Salmon. However, to perform some basic quality checks on the sequencing data, it is important to align the reads to the whole genome. Either STAR or HiSAT2 are able to perform this step and generate a [BAM](https://samtools.github.io/hts-specs/SAMv1.pdf) file that can be used for QC.
134 |
135 | A tool called [Qualimap](http://qualimap.bioinfo.cipf.es/doc_html/intro.html) **explores the features of aligned reads in the context of the genomic region they map to**, hence providing an overall view of the data quality (as an HTML file). Various quality metrics assessed by Qualimap include:
136 |
137 | * DNA or rRNA contamination
138 | * 5'-3' biases
139 | * Coverage biases
140 |
141 | ## 6. Quality control: aggregating results with MultiQC
142 |
143 | Throughout the workflow we have performed various steps of quality checks on our data. You will need **to do this for every sample in your dataset**, making sure these metrics are consistent across the samples for a given experiment. Outlier samples should be flagged for further investigation and potential removal.
144 |
145 | Manually tracking these metrics and browsing through multiple HTML reports (FastQC, Qualimap) and log files (Salmon, STAR) for each samples is tedious and prone to errors. **[MultiQC](https://multiqc.info/) is a tool which aggregates results from several tools and generates a single HTML report** with plots to visualize and compare various QC metrics between the samples. Assessment of the QC metrics may result in the removal of samples before proceeding to the next step, if necessary.
146 |
147 | ***
148 |
149 | Once the QC has been performed on all the samples, we are ready to get started with Differential Gene Expression analysis with [DESeq2](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)!
150 |
151 |
152 |
153 |
154 |
155 |
156 | ***
157 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
158 |
--------------------------------------------------------------------------------
/lessons/04a_design_formulas.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Gene-level differential expression analysis with DESeq2"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper"
4 | date: "Wednesday, May 27th, 2020"
5 | ---
6 |
7 | Approximate time: 30 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Demonstrate the use of the design formula with simple and complex designs
12 | * Construct R code to execute the differential expression analysis workflow with DESeq2
13 |
14 |
15 | # Differential expression analysis with DESeq2
16 |
17 | The final step in the differential expression analysis workflow is **fitting the raw counts to the NB model and performing the statistical test** for differentially expressed genes. In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 | The [DESeq2 paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8) was published in 2014, but the package is continually updated and available for use in R through Bioconductor. It builds on good ideas for dispersion estimation and use of Generalized Linear Models from the [DSS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005660/) and [edgeR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/) methods.
26 |
27 | Differential expression analysis with DESeq2 involves multiple steps as displayed in the flowchart below in blue. Briefly, DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth. Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts. Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.
28 |
29 |
30 |
31 |
32 |
33 | > **NOTE:** DESeq2 is actively maintained by the developers and continuously being updated. As such, it is important that you note the version you are working with. Recently, there have been some rather **big changes implemented** that impact the output. To find out more detail about the specific **modifications made to methods described in the original 2014 paper**, take a look at [this section in the DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#methods-changes-since-the-2014-deseq2-paper).
34 | >
35 | > Additional details on the statistical concepts underlying DESeq2 are elucidated nicely in Rafael Irizarry's [materials](https://rafalab.github.io/pages/harvardx.html) for the EdX course, "Data Analysis for the Life Sciences Series".
36 |
37 | ## Running DESeq2
38 |
39 | Prior to performing the differential expression analysis, it is a good idea to know what **sources of variation** are present in your data, either by exploration during the QC and/or prior knowledge. Once you know the major sources of variation, you can remove them prior to analysis or control for them in the statistical model by including them in your **design formula**.
40 |
41 | **This step is critical as each additional factor in your design formula reduces your power. HOWEVER, failing to include important sources of variation can give you inaccurate results.**
42 |
43 | ### Design formula
44 |
45 | A design formula tells the statistical software which sources of variation to test for. This includes both your factor of interest as well as any additional covariates that are sources of variation. For example, if you know that sex is a significant source of variation in your data, then `sex` should be included in your model. **The design formula should have all of the factors in your metadata that account for major sources of variation in your data.**
46 |
47 | For example, suppose you have the following metadata:
48 |
49 |
50 |
51 |
52 |
53 | If you want to examine the expression differences between treatments, and you know that major sources of variation include `sex` and `age`, then your design formula would be:
54 |
55 | `design = ~ sex + age + treatment`
56 |
57 | The tilde (`~`) should always precede your factors and tells DESeq2 to model the counts using the following formula. Note the **factors included in the design formula need to match the column names in the metadata**.
58 |
59 | > #### Does the order of variables matter?
60 | > In short, the order of variables in your design formula will not change the final results (i.e. the coefficients returned are the always the same). Typically, it has been best practice to list the variable that is your main effect in the last position of your design formula. In this way, the default result that is returned to you when using the `results()` function will be for your main effect.
61 |
62 | ***
63 |
64 |
65 | **Exercises**
66 |
67 | 1. Suppose you wanted to study the expression differences between the two age groups in the metadata shown above, and major sources of variation were `sex` and `treatment`, how would the design formula be written?
68 | 2. Based on our **Mov10** `metadata` dataframe, which factors could we include in our design formula?
69 | 3. What would you do if you wanted to include a factor in your design formula that is not in your metadata?
70 |
71 | ***
72 |
73 | #### Complex designs
74 |
75 | DESeq2 also allows for the analysis of complex designs. You can explore interactions or 'the difference of differences' by specifying for it in the design formula. For example, if you wanted to explore the effect of sex on the treatment effect, you could specify for it in the design formula as follows:
76 |
77 | `design = ~ sex + age + treatment + sex:treatment`
78 |
79 | There are additional recommendations for complex designs in the [DESeq2 vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions). In addition, [Limma documentation](http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf) offers additional insight into creating more complex design formulas.
80 |
81 | > _**NOTE:** Need help figuring out **what information should be present in your metadata?** We have [additional materials](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/experimental_planning_considerations.html) highlighting bulk RNA-seq planning considerations. Please take a look at these materials before starting an experiment to help with proper experimental design._
82 |
83 | ### MOV10 DE analysis
84 |
85 | Now that we know how to specify the model to DESeq2, we can run the differential expression pipeline on the **raw counts**.
86 |
87 | **To get our differential expression results from our raw count data, we only need to run 2 lines of code!**
88 |
89 | First we create a DESeqDataSet as we did in the ['Count normalization'](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html#2-create-deseq2-object) lesson and specify the `txi` object which contains our raw counts, the metadata variable, and provide our design formula:
90 |
91 | ```r
92 | ## Create DESeq2Dataset object
93 | dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)
94 | ```
95 |
96 | Then, to run the actual differential expression analysis, we use a single call to the function `DESeq()`.
97 |
98 | ```r
99 | ## Run analysis
100 | dds <- DESeq(dds)
101 | ```
102 |
103 | By re-assigning the results of the function back to the same variable name (`dds`), we can fill in the `slots` of our `DESeqDataSet` object.
104 |
105 |
106 |
107 |
108 |
109 | **Everything from normalization to linear modeling was carried out by the use of a single function!** This function will print out a message for the various steps it performs:
110 |
111 | ```
112 | estimating size factors
113 | estimating dispersions
114 | gene-wise dispersion estimates
115 | mean-dispersion relationship
116 | final dispersion estimates
117 | fitting model and testing
118 | ```
119 |
120 | We will discuss what is occurring in each of these steps in the next few lessons, but the code to execute these steps is encompassed in the two lines above.
121 |
122 | > **NOTE:** There are individual functions available in DESeq2 that would allow us to carry out each step in the workflow in a step-wise manner, rather than a single call. We demonstrated one example when generating size factors to create a normalized matrix. By calling `DESeq()`, the individual functions for each step are run for you.
123 |
124 | ***
125 |
126 | **Exercise**
127 |
128 | Let's suppose our experiment has the following metadata:
129 |
130 | | | **genotype** | **treatment** |
131 | | :---: | :---: | :---: |
132 | | **sample1** | WT | ev |
133 | | **sample2** | WT | ev |
134 | | **sample3** | WT | ev |
135 | | **sample4** | WT | ev |
136 | | **sample5** | KO_geneA | ev |
137 | | **sample6** | KO_geneA | ev |
138 | | **sample7** | KO_geneA | ev |
139 | | **sample8** | KO_geneA | ev |
140 | | **sample9** | WT | treated |
141 | | **sample10** | WT | treated |
142 | | **sample11** | WT | treated |
143 | | **sample12** | WT | treated |
144 | | **sample13** | KO_geneA | treated |
145 | | **sample14** | KO_geneA | treated |
146 | | **sample15** | KO_geneA | treated |
147 | | **sample16** | KO_geneA | treated |
148 |
149 | How would the design formula be structured to perform the following analyses?
150 |
151 | 1. Test for the effect of `treatment`.
152 |
153 | 2. Test for the effect of `genotype`, while regressing out the variation due to `treatment`.
154 |
155 | 3. Test for the effect of `genotype` on the `treatment` effects.
156 |
157 |
158 | ***
159 |
--------------------------------------------------------------------------------
/lessons/05a_hypothesis_testing.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Hypothesis testing and multiple testing"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper"
4 | date: "June 2nd, 2020"
5 | ---
6 |
7 | Approximate time: 60 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Desribe the process of model fitting
12 | * Compare two methods for hypothesis testing (Wald test vs. LRT)
13 | * Recongnize the importance of multiple test correction
14 | * Identify different methods for multiple test correction
15 |
16 |
17 | # DESeq2: Model fitting and Hypothesis testing
18 |
19 | The final step in the DESeq2 workflow is taking the counts for each gene and fitting it to the model and testing for differential expression.
20 |
21 |
22 |
23 |
24 |
25 | ## Generalized Linear Model
26 |
27 | As described [earlier](01c_RNAseq_count_distribution.md), the count data generated by RNA-seq exhibits overdispersion (variance > mean) and the statistical distribution used to model the counts needs to account for this. As such, DESeq2 uses a **negative binomial distribution to model the RNA-seq counts using the equation below**:
28 |
29 |
30 |
31 |
32 |
33 | The two parameters required are the **size factor, and the dispersion estimate**. Next, a generalized linear model (GLM) of the NB family is used to fit the data. Modeling is a mathematically formalized way to approximate how the data behaves given a set of parameters.
34 |
35 | > "_In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value." ([Wikipedia](https://en.wikipedia.org/wiki/Generalized_linear_model))_.
36 |
37 | After the model is fit, coefficients are estimated for each sample group along with their standard error. The coefficents are the estimates for the **log2 foldchanges**, and will be used as input for hypothesis testing.
38 |
39 |
40 | ## Hypothesis testing
41 |
42 | The first step in hypothesis testing is to set up a **null hypothesis** for each gene. In our case, the null hypothesis is that there is **no differential expression across the two sample groups (LFC == 0)**. Notice that we can do this without observing any data, because it is based on a thought experiment. Second, we use a statistical test to determine if based on the observed data, the null hypothesis is true.
43 |
44 | ### Wald test
45 |
46 | In DESeq2, the **Wald test is the default used for hypothesis testing when comparing two groups**. The Wald test is a test usually performed on parameters that have been estimated by maximum likelihood. In our case we are testing each gene model coefficient (LFC) which was derived using parameters like dispersion, which were estimated using maximum likelihood. If there are more than 2 sample classes within a variable (for example, if you had low, medium, and high treatment levels) then DESeq2 will generate two pairwise comparisons when low is set as the control (see [here](https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/05b_wald_test_results.html) for more info): low vs. medium, and low vs. high.
47 |
48 | DESeq2 implements the Wald test by:
49 | * Taking the LFC and dividing it by its standard error, resulting in a z-statistic
50 | * The z-statistic is compared to a standard normal distribution, and a p-value is computed reporting the probability that a z-statistic at least as extreme as the observed value would be selected at random
51 | * If the p-value is small we reject the null hypothesis and state that there is evidence against the null (i.e. the gene is differentially expressed).
52 |
53 | The **model fit and Wald test were already run previously as part of the `DESeq()` function**:
54 |
55 | ```r
56 | ## DO NOT RUN THIS CODE
57 |
58 | ## Create DESeq2Dataset object
59 | dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ sampletype)
60 |
61 | ## Run analysis
62 | dds <- DESeq(dds)
63 | ```
64 |
65 | ### Likelihood ratio test (LRT)
66 |
67 | DESeq2 also offers the Likelihood Ratio Test (LRT) as an alternative **hypothesis test for when we are comparing more than two sample classes**. Rather than evaluating whether a gene's expression is up- or down-regulated in one class compared to another, the LRT **identifies genes that are changing in expresssion in any direction across the different sample classes**.
68 |
69 | _How does this compare to the Wald test?_
70 |
71 | The **Wald test** (default) only **estimates one model per gene** and evaluates the null hypothesis that LFC == 0.
72 |
73 | For the **Likelihood Ratio Test** is also performed on parameters that have been estimated by maximum likelihood. For this test **two models are estimated per gene; the fit of one model is compared to the fit of the other model.**
74 |
75 |
76 |
77 |
78 |
79 | * m1 is the reduced model (i.e the design formula with your main factor term removed)
80 | * m2 is the full model (i.e. the full design formula you provided when creating your `dds` object)
81 |
82 | > _This type of test can be especially useful in analyzing time course experiments_.
83 |
84 | Here, we are evaluating the **null hypothesis that the full model fits just as well as the reduced model**. If we reject the null hypothesis, this suggests that there is a significant amount of variation explained by the full model (and our main factor of interest), therefore the gene is differentially expressed across the different levels. DESeq2 implements the LRT by using an Analysis of Deviance (ANODEV) to compare the two model fits. It is shown that LR follows a chi-squared distribution, and this can be used to calculate and associated p-value.
85 |
86 | To use the LRT, we use the `DESeq()` function but this time adding two arguments:
87 |
88 | 1. specifying that we want to use the LRT test
89 | 2. the 'reduced' model
90 |
91 | ```r
92 | # The full model was specified previously with the `design = ~ sampletype`:
93 | # dds <- DESeqDataSetFromTximport(txi, colData = meta, ~ sampletype)
94 |
95 | # Likelihood ratio test
96 | dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)
97 | ```
98 |
99 | Since our 'full' model only has one factor (`sampletype`), the 'reduced' model (removing that factor) leaves us with nothing in our design formula. DESeq2 cannot fit a model with nothing in the design formula, and so in the scenario where you have no additional covariates the intercept is modeled using the syntax `~ 1`.
100 |
101 | ***
102 |
103 | **Exercise**
104 |
105 | You are studying brain maturation and growth patterns in mouse cortex and have obtained RNA-seq data for a total of 24 mice. These samples were acquired at 2 developmental stages (3 dpf and 10 dpf) and with or without treatment using a growth inhibitor (Monoamine oxidase (MAO) inhibitors). For each developmental stage and treatment combination you have 6 replicates. You also have sex information for these mice (12 males and 12 females).
106 |
107 | 1. What steps are necessary to take to decide what your model should be?
108 | 2. What is an appropriate hypothesis test if you are testing for expression differences across the developmental stages?
109 | 3. Provide the line of code used to create the `dds` object.
110 | 4. Provide the line of code used to run DESeq2.
111 | 5. Would you use a different hypothesis test if you had 3 developmental timepoints?
112 |
113 | ***
114 |
115 | ## Multiple test correction
116 |
117 | Regardless of whether we use the Wald test or the LRT, each gene that has been tested will be associated with a p-value. It is this result which we use to determine which genes are considered significantly differentially expressed. However, **we cannot use the p-value directly.**
118 |
119 | ### What does the p-value mean?
120 |
121 | A gene with a significance cut-off of p < 0.05, means there is a 5% chance it is a false positive. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives! We would not want to sift through our "significant" genes to identify which ones are true positives.
122 |
123 | Since each p-value is the result of a single test (single gene). The more genes we test, the more we inflate the false positive rate. **This is the multiple testing problem.**
124 |
125 | ### Correcting the p-value for multiple testing
126 |
127 | There are a few common approaches for multiple test correction:
128 |
129 | - **Bonferroni:** The adjusted p-value is calculated by: p-value * m (m = total number of tests). **This is a very conservative approach with a high probability of false negatives**, so is generally not recommended.
130 | - **FDR/Benjamini-Hochberg:** Benjamini and Hochberg (1995) defined the concept of False Discovery Rate (FDR) and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. [More info about BH](https://www.statisticshowto.com/benjamini-hochberg-procedure/).
131 | - **Q-value / Storey method:** The minimum FDR that can be attained when calling that feature significant. For example, if gene X has a q-value of 0.013 it means that 1.3% of genes that show p-values at least as small as gene X are false positives.
132 |
133 | DESeq2 helps reduce the number of genes tested by removing those genes unlikely to be significantly DE prior to testing, such as those with low number of counts and outlier samples ([gene-level QC](05b_wald_test_results.md#gene-level-filtering)). However, multiple test correction is also implemented to reduce the False Discovery Rate using an interpretation of the Benjamini-Hochberg procedure.
134 |
135 | **So what does FDR < 0.05 mean?**
136 |
137 | By setting the FDR cutoff to < 0.05, we're saying that the proportion of false positives we expect amongst our differentially expressed genes is 5%. For example, if you call 500 genes as differentially expressed with an FDR cutoff of 0.05, you expect 25 of them to be false positives.
138 |
139 |
140 | ---
141 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
142 |
143 | *Some materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
144 |
145 | ***
146 |
--------------------------------------------------------------------------------
/lessons/05c_summarizing_results.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Summarizing results from the Wald test"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper"
4 | date: "June 1, 2020"
5 | ---
6 |
7 | Approximate time: 20 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Evaluate the number of differentially expressed genes produced for each comparison
12 | * Construct R objects containing significant genes from each comparison
13 |
14 |
15 | ## Summarizing results
16 |
17 | To summarize the results table, a handy function in DESeq2 is `summary()`. Confusingly it has the same name as the function used to inspect data frames. This function when called with a DESeq results table as input, will summarize the results using a default threshold of padj < 0.1. However, since we had set the `alpha` argument to 0.05 when creating our results table threshold: FDR < 0.05 (padj/FDR is used even though the output says `p-value < 0.05`). Let's start with the OE vs control results:
18 |
19 | ```r
20 | ## Summarize results
21 | summary(res_tableOE, alpha = 0.05)
22 | ```
23 |
24 | In addition to the number of genes up- and down-regulated at the default threshold, **the function also reports the number of genes that were tested (genes with non-zero total read count), and the number of genes not included in multiple test correction due to a low mean count**.
25 |
26 |
27 | ## Extracting significant differentially expressed genes
28 |
29 | Let's first create variables that contain our threshold criteria. We will only be using the adjusted p-values in our criteria:
30 |
31 | ```r
32 | ### Set thresholds
33 | padj.cutoff <- 0.05
34 | ```
35 |
36 | We can easily subset the results table to only include those that are significant using the `filter()` function, but first we will convert the results table into a tibble:
37 |
38 | ```r
39 | # Create a tibble of results
40 | res_tableOE_tb <- res_tableOE %>%
41 | data.frame() %>%
42 | rownames_to_column(var="gene") %>%
43 | as_tibble()
44 | ```
45 |
46 | Now we can subset that table to only keep the significant genes using our pre-defined thresholds:
47 |
48 | ```r
49 | # Subset the tibble to keep only significant genes
50 | sigOE <- res_tableOE_tb %>%
51 | dplyr::filter(padj < padj.cutoff)
52 | ```
53 |
54 | ```r
55 | # Take a quick look at this tibble
56 | sigOE
57 | ```
58 |
59 | ***
60 |
61 | **Exercise**
62 |
63 | **MOV10 Differential Expression Analysis: Control versus Knockdown**
64 |
65 | 1. Using the same p-adjusted threshold as above (`padj.cutoff < 0.05`), subset `res_tableKD` to report the number of genes that are up- and down-regulated in Mov10_knockdown compared to control.
66 | 2. How many genes are differentially expressed in the Knockdown compared to Control? How does this compare to the overexpression significant gene list (in terms of numbers)?
67 |
68 | ***
69 |
70 |
71 | Now that we have extracted the significant results, we are ready for visualization!
72 |
73 |
74 |
75 | ---
76 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
77 |
78 | *Some materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
79 |
80 | ***
81 |
--------------------------------------------------------------------------------
/lessons/06_DGE_visualizing_results.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Advanced visualizations"
3 | author: "Meeta Mistry, Mary Piper, Radhika Khetani"
4 | date: "December 4th, 2019"
5 | ---
6 |
7 | Approximate time: 75 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Setup results data for application of visualization techniques
12 | * Describe different data visualization useful for exploring results from a DGE analysis
13 | * Create a volcano plot to evaluate relationship amongst DGE statistics
14 | * Create a heatmap to illustrate expression changes of differentially expressed genes
15 |
16 | ## Visualizing the results
17 |
18 | When we are working with large amounts of data it can be useful to display that information graphically to gain more insight. During this lesson, we will get you started with some basic and more advanced plots commonly used when exploring differential gene expression data, however, many of these plots can be helpful in visualizing other types of data as well.
19 |
20 | We will be working with three different data objects we have already created in earlier lessons:
21 |
22 | - Metadata for our samples (a dataframe): `meta`
23 | - Normalized expression data for every gene in each of our samples (a matrix): `normalized_counts`
24 | - Tibble versions of the DESeq2 results we generated in the last lesson: `res_tableOE_tb` and `res_tableKD_tb`
25 |
26 | First, let's create a metadata tibble from the data frame (don't lose the row names!)
27 |
28 | ```r
29 | mov10_meta <- meta %>%
30 | rownames_to_column(var="samplename") %>%
31 | as_tibble()
32 | ```
33 |
34 | Next, let's bring in a column with gene symbols to the `normalized_counts` object, so we can use them to label our plots. Ensembl IDs are great for many things, but the gene symbols are much more recognizable to us, as biologists.
35 |
36 | ```r
37 | # DESeq2 creates a matrix when you use the counts() function
38 | ## First convert normalized_counts to a data frame and transfer the row names to a new column called "gene"
39 | normalized_counts <- counts(dds, normalized=T) %>%
40 | data.frame() %>%
41 | rownames_to_column(var="gene")
42 |
43 | # Next, merge together (ensembl IDs) the normalized counts data frame with a subset of the annotations in the tx2gene data frame (only the columns for ensembl gene IDs and gene symbols)
44 | grch38annot <- tx2gene %>%
45 | dplyr::select(ensgene, symbol) %>%
46 | dplyr::distinct()
47 |
48 | ## This will bring in a column of gene symbols
49 | normalized_counts <- merge(normalized_counts, grch38annot, by.x="gene", by.y="ensgene")
50 |
51 | # Now create a tibble for the normalized counts
52 | normalized_counts <- normalized_counts %>%
53 | as_tibble()
54 |
55 | normalized_counts
56 | ```
57 |
58 | >**NOTE:** A possible alternative to the above:
59 | >
60 | > ```r
61 | > normalized_counts <- counts(dds, normalized=T) %>%
62 | > data.frame() %>%
63 | > rownames_to_column(var="gene") %>%
64 | > as_tibble() %>%
65 | > left_join(grch38annot, by=c("gene" = "ensgene"))
66 | > ```
67 |
68 | ### Plotting signicant DE genes
69 |
70 | One way to visualize results would be to simply plot the expression data for a handful of genes. We could do that by picking out specific genes of interest or selecting a range of genes.
71 |
72 | #### **Using DESeq2 `plotCounts()` to plot expression of a single gene**
73 |
74 | To pick out a specific gene of interest to plot, for example MOV10, we can use the `plotCounts()` from DESeq2. `plotCounts()` requires that the gene specified matches the original input to DESeq2, which in our case was Ensembl IDs.
75 |
76 | ```r
77 | # Find the Ensembl ID of MOV10
78 | grch38annot[grch38annot$symbol == "MOV10", "ensgene"]
79 |
80 | # Plot expression for single gene
81 | plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype")
82 | ```
83 |
84 |
85 |
86 | > This DESeq2 function only allows for plotting the counts of a single gene at a time, and is not flexible regarding the appearance.
87 |
88 | #### **Using ggplot2 to plot expression of a single gene**
89 |
90 | If you wish to change the appearance of this plot, we can save the output of `plotCounts()` to a variable specifying the `returnData=TRUE` argument, then use `ggplot()`:
91 |
92 | ```r
93 | # Save plotcounts to a data frame object
94 | d <- plotCounts(dds, gene="ENSG00000155363", intgroup="sampletype", returnData=TRUE)
95 |
96 | # What is the data output of plotCounts()?
97 | d %>% View()
98 |
99 | # Plot the MOV10 normalized counts, using the samplenames (rownames(d) as labels)
100 | ggplot(d, aes(x = sampletype, y = count, color = sampletype)) +
101 | geom_point(position=position_jitter(w = 0.1,h = 0)) +
102 | geom_text_repel(aes(label = rownames(d))) +
103 | theme_bw() +
104 | ggtitle("MOV10") +
105 | theme(plot.title = element_text(hjust = 0.5))
106 | ```
107 |
108 | > Note that in the plot below (code above), we are using `geom_text_repel()` from the `ggrepel` package to label our individual points on the plot.
109 |
110 |
111 |
112 | > If you are interested in plotting the expression of multiple genes all together, please refer to [the short lesson linked here](top20_genes-expression_plotting.md) where we demo this for the top 20 most significantly expressed genes.
113 |
114 | ### Heatmap
115 |
116 | In addition to plotting subsets, we could also extract the normalized values of *all* the significant genes and plot a heatmap of their expression using `pheatmap()`.
117 |
118 | ```r
119 | ### Extract normalized expression for significant genes from the OE and control samples (2:4 and 7:9)
120 | norm_OEsig <- normalized_counts[,c(1:4,7:9)] %>%
121 | dplyr::filter(gene %in% sigOE$gene)
122 | ```
123 |
124 | Now let's draw the heatmap using `pheatmap`:
125 |
126 | ```r
127 | ### Set a color palette
128 | heat_colors <- brewer.pal(6, "YlOrRd")
129 |
130 | ### Run pheatmap using the metadata data frame for the annotation
131 | pheatmap(norm_OEsig[2:7],
132 | color = heat_colors,
133 | cluster_rows = T,
134 | show_rownames = F,
135 | annotation = meta,
136 | border_color = NA,
137 | fontsize = 10,
138 | scale = "row",
139 | fontsize_row = 10,
140 | height = 20)
141 | ```
142 |
143 |
144 |
145 | > *NOTE:* There are several additional arguments we have included in the function for aesthetics. One important one is `scale="row"`, in which Z-scores are plotted, rather than the actual normalized count value.
146 | >
147 | > Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed **after the clustering**, so that it only affects the graphical aesthetics and the color visualization is improved.
148 |
149 | ### Volcano plot
150 |
151 | The above plot would be great to look at the expression levels of a good number of genes, but for more of a global view there are other plots we can draw. A commonly used one is a volcano plot; in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis.
152 |
153 | To generate a volcano plot, we first need to have a column in our results data indicating whether or not the gene is considered differentially expressed based on p-adjusted values and we will include a log2fold change here.
154 |
155 | ```r
156 | ## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 in either direction
157 |
158 | res_tableOE_tb <- res_tableOE_tb %>%
159 | dplyr::mutate(threshold_OE = padj < 0.05 & abs(log2FoldChange) >= 0.58)
160 | ```
161 |
162 | Now we can start plotting. The `geom_point` object is most applicable, as this is essentially a scatter plot:
163 |
164 | ```r
165 | ## Volcano plot
166 | ggplot(res_tableOE_tb) +
167 | geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = threshold_OE)) +
168 | ggtitle("Mov10 overexpression") +
169 | xlab("log2 fold change") +
170 | ylab("-log10 adjusted p-value") +
171 | #scale_y_continuous(limits = c(0,50)) +
172 | theme(legend.position = "none",
173 | plot.title = element_text(size = rel(1.5), hjust = 0.5),
174 | axis.title = element_text(size = rel(1.25)))
175 | ```
176 |
177 |
178 |
179 | This is a great way to get an overall picture of what is going on, but what if we also wanted to know where the top 10 genes (lowest padj) in our DE list are located on this plot? We could label those dots with the gene name on the Volcano plot using `geom_text_repel()`.
180 |
181 | First, we need to order the res_tableOE tibble by `padj`, and add an additional column to it, to include on those gene names we want to use to label the plot.
182 |
183 | ```r
184 | ## Add all the gene symbols as a column from the grch38 table using bind_cols()
185 | res_tableOE_tb <- bind_cols(res_tableOE_tb, symbol=grch38annot$symbol[match(res_tableOE_tb$gene, grch38annot$ensgene)])
186 |
187 | ## Create an empty column to indicate which genes to label
188 | res_tableOE_tb <- res_tableOE_tb %>% dplyr::mutate(genelabels = "")
189 |
190 | ## Sort by padj values
191 | res_tableOE_tb <- res_tableOE_tb %>% dplyr::arrange(padj)
192 |
193 | ## Populate the genelabels column with contents of the gene symbols column for the first 10 rows, i.e. the top 10 most significantly expressed genes
194 | res_tableOE_tb$genelabels[1:10] <- as.character(res_tableOE_tb$symbol[1:10])
195 |
196 | View(res_tableOE_tb)
197 | ```
198 |
199 | Next, we plot it as before with an additional layer for `geom_text_repel()` wherein we can specify the column of gene labels we just created.
200 |
201 | ```r
202 | ggplot(res_tableOE_tb, aes(x = log2FoldChange, y = -log10(padj))) +
203 | geom_point(aes(colour = threshold_OE)) +
204 | geom_text_repel(aes(label = genelabels)) +
205 | ggtitle("Mov10 overexpression") +
206 | xlab("log2 fold change") +
207 | ylab("-log10 adjusted p-value") +
208 | theme(legend.position = "none",
209 | plot.title = element_text(size = rel(1.5), hjust = 0.5),
210 | axis.title = element_text(size = rel(1.25)))
211 | ```
212 |
213 |
214 |
215 | ***
216 |
217 | > ### An R package for visualization of DGE results
218 | >
219 | > The Bioconductor package [`DEGreport`](https://bioconductor.org/packages/release/bioc/html/DEGreport.html) can use the DESeq2 results output to make the top20 genes and the volcano plots generated above by writing much fewer lines of code. The caveat of these functions is you lose the ability to customize plots as we have demonstrated above.
220 | >
221 | > If you are interested, the example code below shows how you can use DEGreport to create similar plots. **Note that this is example code, do not run.**
222 |
223 | > ```r
224 | > DEGreport::degPlot(dds = dds, res = res, n = 20, xs = "type", group = "condition") # dds object is output from DESeq2
225 | >
226 | > DEGreport::degVolcano(
227 | > data.frame(res[,c("log2FoldChange","padj")]), # table - 2 columns
228 | > plot_text = data.frame(res[1:10,c("log2FoldChange","padj","id")])) # table to add names
229 | >
230 | > # Available in the newer version for R 3.4
231 | > DEGreport::degPlotWide(dds = dds, genes = row.names(res)[1:5], group = "condition")
232 | > ```
233 |
234 |
235 | ***
236 |
237 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
238 |
239 | * *Materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
240 |
--------------------------------------------------------------------------------
/lessons/07_DGE_summarizing_workflow.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Summary of DGE workflow"
3 | author: "Mary Piper"
4 | date: "June 8, 2017"
5 | ---
6 |
7 | Approximate time: 15 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Identify the R commands needed to run a complete differential expression analysis using DESeq2
12 |
13 | ## Summary of differential expression analysis workflow
14 |
15 | We have detailed the various steps in a differential expression analysis workflow, providing theory with example code. To provide a more succinct reference for the code needed to run a DGE analysis, we have summarized the steps in an analysis below:
16 |
17 | 1. Obtaining gene-level counts from Salmon using tximport
18 |
19 | ```r
20 | # Run tximport
21 | txi <- tximport(files,
22 | type="salmon",
23 | tx2gene=t2g,
24 | countsFromAbundance = "lengthScaledTPM")
25 |
26 | # "files" is a vector wherein each element is the path to the salmon quant.sf file, and each element is named with the name of the sample.
27 | # "t2g" is a 2 column data frame which contains transcript IDs mapped to geneIDs (in that order)
28 | ```
29 |
30 | 2. Creating the dds object:
31 |
32 | ```r
33 | # Check that the row names of the metadata equal the column names of the **raw counts** data
34 | all(colnames(txi$counts) == rownames(metadata))
35 |
36 | # Create DESeq2Dataset object
37 | dds <- DESeqDataSetFromTximport(txi,
38 | colData = metadata,
39 | design = ~ condition)
40 | ```
41 |
42 | 3. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers and sources of variation in the data:
43 |
44 | ```r
45 | # Transform counts for data visualization
46 | rld <- rlog(dds,
47 | blind=TRUE)
48 |
49 | # Plot PCA
50 | plotPCA(rld,
51 | intgroup="condition")
52 |
53 | # Extract the rlog matrix from the object and compute pairwise correlation values
54 | rld_mat <- assay(rld)
55 | rld_cor <- cor(rld_mat)
56 |
57 | # Plot heatmap
58 | pheatmap(rld_cor,
59 | annotation = metadata)
60 | ```
61 |
62 | 4. Run DESeq2:
63 |
64 | ```r
65 | # **Optional step** - Re-create DESeq2 dataset if the design formula has changed after QC analysis in include other sources of variation using "dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ covariate + condition)"
66 |
67 | # Run DESeq2 differential expression analysis
68 | dds <- DESeq(dds)
69 |
70 | # **Optional step** - Output normalized counts to save as a file to access outside RStudio using "normalized_counts <- counts(dds, normalized=TRUE)"
71 | ```
72 |
73 | 5. Check the fit of the dispersion estimates:
74 |
75 | ```r
76 | # Plot dispersion estimates
77 | plotDispEsts(dds)
78 | ```
79 |
80 | 6. Create contrasts to perform Wald testing on the shrunken log2 foldchanges between specific conditions:
81 |
82 | ```r
83 | # Specify contrast for comparison of interest
84 | contrast <- c("condition", "level_to_compare", "base_level")
85 |
86 | # Output results of Wald test for contrast
87 | res <- results(dds,
88 | contrast = contrast,
89 | alpha = 0.05)
90 |
91 | # Shrink the log2 fold changes to be more accurate
92 | res <- lfcShrink(dds,
93 | coef = "sampletype_group1_vs_group2",
94 | type = "apeglm")
95 | # The coef will be dependent on what your contrast was. and should be identical to what is stored in resultsNames()
96 | ```
97 |
98 | 7. Output significant results:
99 |
100 | ```r
101 | # Set thresholds
102 | padj.cutoff < - 0.05
103 |
104 | # Turn the results object into a tibble for use with tidyverse functions
105 | res_tbl <- res %>%
106 | data.frame() %>%
107 | rownames_to_column(var="gene") %>%
108 | as_tibble()
109 |
110 | # Subset the significant results
111 | sig_res <- dplyr::filter(res_tbl,
112 | padj < padj.cutoff)
113 | ```
114 |
115 | 8. Visualize results: volcano plots, heatmaps, normalized counts plots of top genes, etc.
116 |
117 | 9. Perform analysis to extract functional significance of results: GO or KEGG enrichment, GSEA, etc.
118 |
119 | 10. Make sure to output the versions of all tools used in the DE analysis:
120 |
121 | ```r
122 | sessionInfo()
123 | ```
124 |
125 | For better reproducibility, it can help to create **RMarkdown reports**, which save all code, results, and visualizations as nicely formatted html reports. We have a very basic example of a report [linked here](https://www.dropbox.com/s/4bq0chxze6dogba/workshop-example.html?dl=0). To create these reports we have [additional materials](https://hbctraining.github.io/Training-modules/Rmarkdown/) available.
126 |
--------------------------------------------------------------------------------
/lessons/08a_DGE_LRT_results.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "DGE analysis using LRT in DESeq2"
3 | author: "Meeta Mistry and Mary Piper"
4 | date: "June 14, 2017"
5 | ---
6 |
7 | Approximate time: 60 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Apply the Likelihood Ratio Test (LRT) for hypothesis testing
12 | * Compare results generated from the LRT to results obtained using the Wald test
13 | * Identify shared expression profiles from the LRT significant gene list
14 |
15 |
16 | ## Exploring results from the Likelihood ratio test (LRT)
17 |
18 | DESeq2 also offers the Likelihood Ratio Test as an alternative **when evaluating expression change across more than two levels**. Genes which are identified as significant, are those that are changing in expression in any direction across the different factor levels.
19 |
20 | Generally, this test will result in a larger number of genes than the individual pair-wise comparisons. While the LRT is a test of significance for differences of any level(s) of the factor, one should not expect it to be exactly equal to the union of sets of genes using Wald tests (although we do expect a high degree of overlap).
21 |
22 | ## The `results()` table
23 |
24 | To extract the results from our `dds_lrt` object we can use the same `results()` function we had used with the Wald test. _There is no need for contrasts since we are not making a pair-wise comparison._
25 |
26 | > **NOTE:** In an [earlier lesson](05a_hypothesis_testing.md#likelihood-ratio-test-lrt) on hypothesis testing, we had you create the object `dds_lrt`. If you are **having trouble finding the object**, please run the code: `dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1`
27 |
28 | ```r
29 | # Extract results for LRT
30 | res_LRT <- results(dds_lrt)
31 | ```
32 |
33 | Let's take a look at the results table:
34 |
35 | ```r
36 | # View results for LRT
37 | res_LRT
38 | ```
39 |
40 | ```
41 | log2 fold change (MLE): sampletype MOV10 overexpression vs control
42 | LRT p-value: '~ sampletype' vs '~ 1'
43 | DataFrame with 57761 rows and 6 columns
44 | baseMean log2FoldChange lfcSE stat pvalue padj
45 |
46 | ENSG00000000003 3525.88347786355 -0.438245423329571 0.0774607246185232 40.4611749305021 1.63669402960044e-09 3.14070461117016e-08
47 | ENSG00000000005 26.2489043110535 0.0292079869376247 0.441128912409409 1.61898836146221 0.445083140923522 0.58866891597654
48 | ENSG00000000419 1478.25124052691 0.383635036119846 0.113760957175207 11.3410110249776 0.00344612277761083 0.0122924772964227
49 | ENSG00000000457 518.42202383345 0.228970583496456 0.102331174090148 14.6313920603898 0.000665018279181725 0.00304543241149833
50 | ENSG00000000460 1159.77613645835 -0.269138203013482 0.0814992499897986 25.0394477225533 3.65386933066256e-06 3.23415706764646e-05
51 | ... ... ... ... ... ...
52 | ```
53 |
54 | The results table output looks similar to the Wald test results, with identical columns to what we observed previously.
55 |
56 | ### Why are fold changes reported for an LRT test?
57 |
58 | For analyses using the likelihood ratio test, the p-values are determined solely by the difference in deviance between the full and reduced model formula. **A single log2 fold change is printed in the results table for consistency with other results table outputs, but is not associated with the actual test.**
59 |
60 | **Columns relevant to the LRT test:**
61 |
62 | * `baseMean`: mean of normalized counts for all samples
63 | * `stat`: the difference in deviance between the reduced model and the full model
64 | * `pvalue`: the stat value is compared to a chi-squared distribution to generate a pvalue
65 | * `padj`: BH adjusted p-values
66 |
67 | **Additional columns:**
68 |
69 | * `log2FoldChange`: log2 fold change
70 | * `lfcSE`: standard error
71 |
72 | > **NOTE:** Printed at the top of the the results table are the two sample groups used to generate the log2 fold change values that we observe in the results table. This can be controlled using the `name` argument; the value provided to name must be an element of resultsNames(dds).
73 |
74 | ## Identifying significant genes
75 |
76 | When filtering significant genes from the LRT we threshold only the `padj` column. _How many genes are significant at `padj < 0.05`?_
77 |
78 | ```r
79 | # Create a tibble for LRT results
80 | res_LRT_tb <- res_LRT %>%
81 | data.frame() %>%
82 | rownames_to_column(var="gene") %>%
83 | as_tibble()
84 |
85 | # Subset to return genes with padj < 0.05
86 | sigLRT_genes <- res_LRT_tb %>%
87 | dplyr::filter(padj < padj.cutoff)
88 |
89 | # Get number of significant genes
90 | nrow(sigLRT_genes)
91 |
92 | # Compare to numbers we had from Wald test
93 | nrow(sigOE)
94 | nrow(sigKD)
95 |
96 | ```
97 |
98 | The number of significant genes observed from the LRT is quite high. This list includes genes that can be changing in any direction across the three factor levels (control, KO, overexpression). To reduce the number of significant genes, we can increase the stringency of our FDR threshold (`padj.cutoff`).
99 |
100 | ***
101 |
102 | **Exercise**
103 |
104 | 1. Compare the resulting gene list from the LRT test to the gene lists from the Wald test comparisons.
105 | 1. How many of the `sigLRT_genes` overlap with the significant genes in `sigOE`?
106 | 1. How many of the `sigLRT_genes` overlap with the significant genes in `sigKD`?
107 |
108 | ***
109 |
110 | ## Identifying clusters of genes with shared expression profiles
111 |
112 | We now have this list of ~7K significant genes that we know are changing in some way across the three different sample groups. What do we do next?
113 |
114 | A good next step is to identify groups of genes that share a pattern of expression change across the sample groups (levels). To do this we will be using a clustering tool called `degPatterns` from the 'DEGreport' package. The `degPatterns` tool uses a **hierarchical clustering approach based on pair-wise correlations** between genes, then cuts the hierarchical tree to generate groups of genes with similar expression profiles. The tool cuts the tree in a way to optimize the diversity of the clusters, such that the variability inter-cluster > the variability intra-cluster.
115 |
116 | Before we begin clustering, we will **first subset our rlog transformed normalized counts** to retain only the differentially expressed genes (padj < 0.05). In our case, it may take some time to run the clustering on 7K genes, and so for class demonstration purposes we will subset to keep only the top 1000 genes sorted by p-adjusted value.
117 |
118 | > #### Where do I get rlog transformed counts?
119 | > This rlog transformation was applied in an [earlier lesson](03_DGE_QC_analysis.md#transform-normalized-counts-for-the-mov10-dataset) when we performed QC analysis. If you **do not see this in your environment**, run the following code:
120 | >
121 | > ```
122 | > ### Transform counts for data visualization
123 | > rld <- rlog(dds, blind=TRUE)
124 | > rld_mat <- assay(rld)
125 | > ```
126 | >
127 |
128 | ```r
129 | # Subset results for faster cluster finding (for classroom demo purposes)
130 | clustering_sig_genes <- sigLRT_genes %>%
131 | arrange(padj) %>%
132 | head(n=1000)
133 |
134 |
135 | # Obtain rlog values for those significant genes
136 | cluster_rlog <- rld_mat[clustering_sig_genes$gene, ]
137 | ```
138 |
139 | The rlog transformed counts for the significant genes are input to `degPatterns` along with a few additional arguments:
140 |
141 | * `metadata`: the metadata dataframe that corresponds to samples
142 | * `time`: character column name in metadata that will be used as variable that changes
143 | * `col`: character column name in metadata to separate samples
144 |
145 | ```r
146 | # Use the `degPatterns` function from the 'DEGreport' package to show gene clusters across sample groups
147 | clusters <- degPatterns(cluster_rlog, metadata = meta, time = "sampletype", col=NULL)
148 | ```
149 |
150 | Once the clustering is finished running, you will get your command prompt back in the console and you should see a figure appear in your plot window. The genes have been clustered into four different groups. For each group of genes, we have a boxplot illustrating expression change across the different sample groups. A line graph is overlayed to illustrate the trend in expression change.
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 | Suppose we are interested in the genes which show a decreased expression in the knockdown samples and increase in the overexpression. According to the plot there are 275 genes which share this expression profile. To find out what these genes are let's explore the output. What type of data structure is the `clusters` output?
159 |
160 | ```r
161 | # What type of data structure is the `clusters` output?
162 | class(clusters)
163 | ```
164 | We can see what objects are stored in the list by using `names(clusters)`. There is a dataframe stored inside. This is the main result so let's take a look at it. The first column contains the genes, and the second column contains the cluster number to which they belong.
165 |
166 | ```r
167 | # Let's see what is stored in the `df` component
168 | head(clusters$df)
169 | ```
170 |
171 | Since we are interested in Group 1, we can filter the dataframe to keep only those genes:
172 |
173 | ```r
174 | # Extract the Group 1 genes
175 | group1 <- clusters$df %>%
176 | dplyr::filter(cluster == 1)
177 | ```
178 |
179 | After extracting a group of genes, we can use annotation packages to obtain additional information. We can also use these lists of genes as input to downstream functional analysis tools to obtain more biological insight and see whether the groups of genes share a specific function.
180 |
181 |
182 | ---
183 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
184 |
185 | * *Materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
186 |
187 |
--------------------------------------------------------------------------------
/lessons/08b_time_course_analyses.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Time course analysis with DESeq2"
3 | author: "Meeta Mistry and Mary Piper"
4 | date: "June 17, 2020"
5 | ---
6 |
7 | Approximate time: 20 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Discuss time course analyses with DESeq2
12 |
13 | ## Time course analyses with LRT
14 |
15 | Despite the popularity of static measurement of gene expression, time-course capturing of biological processes is essential to reflect their dynamic nature, particularly when patterns are complex and are not simply ascending or descending. When working with this type of data, the Likelihood Ratio Test (LRT) is especially helpful. We can use the LRT to explore whether there are any significant differences across a series of timepoints and further evaluate differences observed between sample classes.
16 |
17 | For example, suppose we have an experiment looking at the effect of treatment over time on mice of two different genotypes. We could use a design formula for our 'full model' that would include the major sources of variation in our data: `genotype`, `treatment`, `time`, and our main condition of interest, which is the difference in the effect of treatment over time (`treatment:time`).
18 |
19 | > **NOTE:** This is just example code for our hypothetical experiment. You **should not run this code**.
20 |
21 | ```r
22 | ## DO NOT RUN
23 |
24 | full_model <- ~ genotype + treatment + time + treatment:time
25 | ```
26 |
27 | To perform the LRT test, we also need to provide a reduced model, that is the full model without the `treatment:time` term:
28 |
29 | ```r
30 | ## DO NOT RUN
31 |
32 | reduced_model <- ~ genotype + treatment + time
33 | ```
34 |
35 | Then, we could run the LRT by using the following code:
36 |
37 | ```r
38 | ## DO NOT RUN
39 |
40 | dds <- DESeqDataSetFromMatrix(countData = raw_counts, colData = metadata, design = ~ genotype + treatment + time + treatment:time)
41 |
42 | dds_lrt_time <- DESeq(dds, test="LRT", reduced = ~ genotype + treatment + time)
43 | ```
44 | To understand what kind of gene expression patterns will be identified as differentially expressed, we have a few examples below. In the plots below we have Time on the x-axis and gene expression on the y-axis. In this dataset there are two samples for each time point, one having undergone some treatment (red) and the other without (blue).
45 |
46 | For this figure, we are depicting the type of **genes that will not be identified as differentially expressed.** Here, we observe that GeneX is differentially expressed between the time points, however there is no difference in that expression pattern between the treatment groups.
47 |
48 |
49 |
50 |
51 |
52 | The type of **gene expression patterns we do expect** the LRT to return are those that exhibit differences in the effect of treatment over time. In the example below, GeneX displays a different expression pattern over time for the two treatment groups.
53 |
54 |
55 |
56 |
57 |
58 | Continuing with our example dataset, after running the LRT we can determine the set of significant genes using a threshold of `padj` < 0.05. The next step would be to sort those genes into groups based on shared expression patterns, and we could do this using `degPatterns()`. Here, you will notice that we make use of the `col` argument since we have two groups that we are comparing to one another.
59 |
60 | ```r
61 | clusters <- degPatterns(cluster_rlog, metadata = meta, time="time", col="treatment")
62 | ```
63 |
64 | Depending on what type of shared expression profiles exist in your data, you can then extract the groups of genes associated with the patterns of interest and move on to functional analysis for each of the gene groups of interest.
65 |
66 |
67 | ---
68 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
69 |
70 | * *Materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
71 |
--------------------------------------------------------------------------------
/lessons/AnnotationDbi_lesson.md:
--------------------------------------------------------------------------------
1 | ## AnnotationDbi
2 |
3 | AnnotationDbi is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage. The AnnotationDbi packages can query the *OrgDb*, *TxDb*, *EnsDb*, *Go.db*, and *BioMart* annotations. There is helpful [documentation](https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf) available to reference when extracting data from any of these databases.
4 |
5 | ### org.Hs.eg.db
6 |
7 | There are a plethora of organism-specific *orgDb* packages, such as `org.Hs.eg.db` for human and `org.Mm.eg.db` for mouse, and a list of organism databases can be found [here](https://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb). These databases are best for converting gene IDs or obtaining GO information for current genome builds, but not for older genome builds. These packages provide the current builds corresponding to the release date of the package, and update every 6 months. If a package is not available for your organism of interest, you can create your own using *AnnotationHub*.
8 |
9 | ```r
10 | # Load libraries
11 | library(org.Hs.eg.db)
12 | library(AnnotationDbi)
13 |
14 | # Check object metadata
15 | org.Hs.eg.db
16 | ```
17 |
18 | We can see the metadata for the database by just typing the name of the database, including the species, last updates for the different source information, and the source urls. Note the KEGG data from this database was last updated in 2011, so may not be the best site for KEGG pathway information.
19 |
20 | ```r
21 | OrgDb object:
22 | | DBSCHEMAVERSION: 2.1
23 | | Db type: OrgDb
24 | | Supporting package: AnnotationDbi
25 | | DBSCHEMA: HUMAN_DB
26 | | ORGANISM: Homo sapiens
27 | | SPECIES: Human
28 | | EGSOURCEDATE: 2018-Oct11
29 | | EGSOURCENAME: Entrez Gene
30 | | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
31 | | CENTRALID: EG
32 | | TAXID: 9606
33 | | GOSOURCENAME: Gene Ontology
34 | | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
35 | | GOSOURCEDATE: 2018-Oct10
36 | | GOEGSOURCEDATE: 2018-Oct11
37 | | GOEGSOURCENAME: Entrez Gene
38 | | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
39 | | KEGGSOURCENAME: KEGG GENOME
40 | | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
41 | | KEGGSOURCEDATE: 2011-Mar15
42 | | GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
43 | | GPSOURCEURL:
44 | | GPSOURCEDATE: 2018-Oct2
45 | | ENSOURCEDATE: 2018-Oct05
46 | | ENSOURCENAME: Ensembl
47 | | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
48 | | UPSOURCENAME: Uniprot
49 | | UPSOURCEURL: http://www.UniProt.org/
50 | | UPSOURCEDATE: Thu Oct 18 05:22:10 2018
51 | ```
52 |
53 | We can easily extract information from this database using *AnnotationDbi* with the methods: `columns`, `keys`, `keytypes`, and `select`. For example, we will use our `org.Hs.eg.db` database to acquire information, but know that the same methods work for the *TxDb*, *Go.db*, *EnsDb*, and *BioMart* annotations.
54 |
55 | ```r
56 | # Return the Ensembl IDs for a set of genes
57 | annotations_orgDb <- AnnotationDbi::select(org.Hs.eg.db, # database
58 | keys = res_tableOE_tb$gene, # data to use for retrieval
59 | columns = c("SYMBOL", "ENTREZID","GENENAME"), # information to retreive for given data
60 | keytype = "ENSEMBL") # type of data given in 'keys' argument
61 | ```
62 |
63 | We started from at about 57K genes in our results table, and the dimensions of our resulting annotation data frame also look quite similar. Let's take a peek to see if we actually returned annotations for each individual Ensembl gene ID that went in to the query:
64 |
65 | ```r
66 | length(which(is.na(annotations_orgDb$SYMBOL)))
67 | ```
68 |
69 | Looks like more than half of the input genes did not return any annotations. This is because the OrgDb family of database are primarily based on mapping using Entrez Gene identifiers. If you look at some of the Ensembl IDs from our query that returned NA, these map to pseudogenes (i.e [ENSG00000265439](https://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000265439;r=6:44209766-44210063;t=ENST00000580735)) or non-coding RNAs (i.e. [ENSG00000265425](http://useast.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000265425;r=18:68427030-68436918;t=ENST00000577835)). The difference is due to the fact that each database implements different computational approaches for generating the gene builds. Let's get rid of those NA entries:
70 |
71 | ```r
72 | # Determine the indices for the non-NA genes
73 | non_na_idx <- which(is.na(annotations_orgDb$SYMBOL) == FALSE)
74 |
75 | # Return only the genes with annotations using indices
76 | annotations_orgDb <- annotations_orgDb[non_na_idx, ]
77 | ```
78 |
79 | You may have also noted the *warning* returned: *'select()' returned 1:many mapping between keys and columns*. This is always going to happen with converting between different gene IDs (i.e. one geneID can map to more than one identifier in another databse) . Unless we would like to keep multiple mappings for a single gene, then we probably want to de-duplicate our data before using it.
80 |
81 | ```r
82 | # Determine the indices for the non-duplicated genes
83 | non_duplicates_idx <- which(duplicated(annotations_orgDb$SYMBOL) == FALSE)
84 |
85 | # Return only the non-duplicated genes using indices
86 | annotations_orgDb <- annotations_orgDb[non_duplicates_idx, ]
87 | ```
88 |
89 | ### EnsDb.Hsapiens.v86
90 |
91 | To generate the Ensembl annotations, the *EnsDb* database can also be easily queried using AnnotationDbi. You will need to decide the release of Ensembl you would like to query. We know that our data is for GRCh38, and the most current *EnsDb* release for GRCh38 in Bioconductor is release 86, so we can install this database. All Ensembl releases are listed [here](http://useast.ensembl.org/info/website/archives/index.html). **NOTE: this is not the most current release of GRCh38 in the Ensembl database, but it's as current as we can obtain through AnnotationDbi.**
92 |
93 | Since we are using *AnnotationDbi* to query the database, we can use the same functions that we used previously:
94 |
95 | ```r
96 | # Load the library
97 | library(EnsDb.Hsapiens.v86)
98 |
99 | # Check object metadata
100 | EnsDb.Hsapiens.v86
101 |
102 | # Explore the fields that can be used as keys
103 | keytypes(EnsDb.Hsapiens.v86)
104 | ```
105 |
106 | Now we can return all gene IDs for our gene list:
107 |
108 | ```r
109 | # Return the Ensembl IDs for a set of genes
110 | annotations_edb <- AnnotationDbi::select(EnsDb.Hsapiens.v86,
111 | keys = res_tableOE_tb$gene,
112 | columns = c("SYMBOL", "ENTREZID","GENEBIOTYPE"),
113 | keytype = "GENEID")
114 | ```
115 |
116 | We can check for NA entries, and find that there are none:
117 |
118 | ```r
119 | length(which(is.na(annotations_edb$SYMBOL) == FALSE))
120 | ```
121 |
122 | Then we can again deduplicate, to remove the gene symbols which appear more than once:
123 |
124 | ```r
125 | # Determine the indices for the non-duplicated genes
126 | non_duplicates_idx <- which(duplicated(annotations_edb$SYMBOL) == FALSE)
127 |
128 | # Return only the non-duplicated genes using indices
129 | annotations_edb <- annotations_edb[non_duplicates_idx, ]
130 | ```
131 |
132 | > **NOTE:** In this case we used the same build but a slightly older release, and we found little discrepancy. If your analysis was conducted using an older genome build (i.e hg19), but used a newer build for annotation some genes may be found to be not annotated (NA). Some of the genes have changed names in between versions (due to updates and patches), so may not be present in the newer version of the database.
133 |
134 | ---
135 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
136 |
--------------------------------------------------------------------------------
/lessons/DGE_visualizing_results_archived.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "DEG visualization of results"
3 | author: "Meeta Mistry"
4 | date: "October 17, 2016"
5 | ---
6 |
7 | Approximate time: 75 minutes
8 |
9 | ## Learning Objectives
10 |
11 | * Exploring our significant genes using data visualization
12 | * Using volcano plots to evaluate relationships between DEG statistics
13 | * Plotting expression of significant genes using heatmaps
14 |
15 | ## Extracting significant genes
16 |
17 | The FDR threshold on it's own doesn't appear to be reducing the number of significant genes. With large significant gene lists it can be hard to extract meaningful biological relevance. To help increase stringency, one can also **add a fold change threshold**. The `summary()` function doesn't have an argument for fold change threshold,
18 |
19 | > *NOTE:* the `results()` function does have an option to add a fold change threshold and subset the data this way. Take a look at the help manual using `?results` and see what argument would be required. However, rather than subsetting the results, we want to return the whole dataset and simply identify which genes meet our criteria.
20 |
21 | Let's first create variables that contain our threshold criteria:
22 |
23 | ```r
24 | ### Set thresholds
25 | padj.cutoff <- 0.05
26 | lfc.cutoff <- 0.58
27 | ```
28 |
29 | The `lfc.cutoff` is set to 0.58; remember that we are working with log2 fold changes so this translates to an actual fold change of 1.5 which is pretty reasonable. Let's create vector that helps us identify the genes that meet our criteria:
30 |
31 | ```r
32 | threshold <- res_tableOE$padj < padj.cutoff & abs(res_tableOE$log2FoldChange) > lfc.cutoff
33 | ```
34 |
35 | We now have a logical vector of values that has a length which is equal to the total number of genes in the dataset. The elements that have a `TRUE` value correspond to genes that meet the criteria (and `FALSE` means it fails). **How many genes are differentially expressed in the Overexpression compared to Control, given our criteria specified above?** Does this reduce our results?
36 |
37 | ```r
38 | length(which(threshold))
39 | ```
40 |
41 | To add this vector to our results table we can use the `$` notation to create the column on the left hand side of the assignment operator, and the assign the vector to it instead of using `cbind()`:
42 |
43 | ```r
44 | res_tableOE$threshold <- threshold
45 | ```
46 |
47 | Now we can easily subset the results table to only include those that are significant using the `subset()` function:
48 |
49 | ```r
50 | sigOE <- data.frame(subset(res_tableOE, threshold == TRUE))
51 | ```
52 |
53 | Using the same thresholds as above (`padj.cutoff < 0.05` and `lfc.cutoff = 0.58`), create a threshold vector to report the number of genes that are up- and down-regulated in Mov10_knockdown compared to control.
54 |
55 | ```r
56 | threshold_KD <- res_tableKD$padj < padj.cutoff & abs(res_tableKD$log2FoldChange) > lfc.cutoff
57 | ```
58 |
59 | Take this new threshold vector and add it as a new column called `threshold` to the `res_tableKD` which contains a logical vector denoting genes as being differentially expressed or not. **How many genes are differentially expressed in the Knockdown compared to Control?** Subset the data to keep only the significant genes.
60 |
61 | ```r
62 | res_tableKD$threshold <- threshold_KD
63 |
64 | sigKD <- data.frame(subset(res_tableKD, threshold == TRUE))
65 | ```
66 |
67 |
68 | ## Visualizing the results
69 |
70 | When we are working with large amounts of data it can be useful to display that information graphically to gain more insight. Visualization deserves an entire course of its own, but during this lesson we will get you started with some basic plots commonly used when exploring differential gene expression data.
71 |
72 | One way to visualize results would be to simply plot the expression data for a handful of our top genes. We could do that by picking out specific genes of interest, for example Mov10:
73 |
74 | ```r
75 | # Plot expression for single gene
76 | plotCounts(dds, gene="MOV10", intgroup="sampletype")
77 | ```
78 |
79 | 
80 |
81 | ### Volcano plot
82 |
83 | The above plot would be great to validate a select few genes, but for more of a global view there are other plots we can draw. A commonly used one is a volcano plot; in which you have the log transformed adjusted p-values plotted on the y-axis and log2 fold change values on the x-axis. There is no built-in function for the volcano plot in DESeq2, but we can easily draw it using `ggplot2`. First, we will need to create a `data.frame` object from the results, which is currently stored in a `DESeqResults` object:
84 |
85 | ```r
86 | # Create dataframe for plotting
87 | df <- data.frame(res_tableOE)
88 | ```
89 |
90 | Now we can start plotting. The `geom_point` object is most applicable, as this is essentially a scatter plot:
91 |
92 | ```
93 | # Volcano plot
94 | ggplot(df) +
95 | geom_point(aes(x=log2FoldChange, y=-log10(padj), colour=threshold)) +
96 | xlim(c(-2,2)) +
97 | ggtitle('Mov10 overexpression') +
98 | xlab("log2 fold change") +
99 | ylab("-log10 adjusted p-value") +
100 | theme(legend.position = "none",
101 | plot.title = element_text(size = rel(1.5)),
102 | axis.title = element_text(size = rel(1.5)),
103 | axis.text = element_text(size = rel(1.25)))
104 | ```
105 |
106 | 
107 |
108 |
109 | ### Heatmap
110 |
111 | Alternatively, we could extract only the genes that are identified as significant and the plot the expression of those genes using a heatmap. We can then use the genes from the subsetted data frames to select the corresponding rows from the normalized data matrix:
112 |
113 | ```r
114 | ### Extract normalized expression for significant genes
115 | norm_OEsig <- normalized_counts[rownames(sigOE),]
116 | ```
117 |
118 | Now let's draw the heatmap using `pheatmap`:
119 |
120 | ```r
121 | ### Annotate our heatmap (optional)
122 | annotation <- data.frame(sampletype=meta[,'sampletype'],
123 | row.names=row.names(meta))
124 |
125 | ### Set a color palette
126 | heat.colors <- brewer.pal(6, "YlOrRd")
127 |
128 | ### Run pheatmap
129 | pheatmap(norm_OEsig, color = heat.colors, cluster_rows = T, show_rownames=F,
130 | annotation= annotation, border_color=NA, fontsize = 10, scale="row",
131 | fontsize_row = 10, height=20)
132 | ```
133 |
134 | 
135 |
136 | > *NOTE:* There are several additional arguments we have included in the function for aesthetics. One important one is `scale="row"`, in which Z-scores are plotted, rather than the actual normalized count value. Z-scores are computed on a gene-by-gene basis by subtracting the mean and then dividing by the standard deviation. The Z-scores are computed **after the clustering**, so that it only affects the graphical aesthetics and the color visualization is improved.
137 |
138 | ***
139 |
140 | **Exercise**
141 |
142 | 1. Generate two figures for the KD-control comparison: a volcano plot and a heatmap.
143 | 2. Save both images to file.
144 |
145 | ***
146 | ->**NOTE:** Advanced visualization methods for DE analysis results are available as [additional material](https://github.com/hbctraining/Training-modules/blob/master/Visualization_in_R/lessons/03_advanced_visualizations.md).
147 |
148 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
149 |
150 | * *Materials and hands-on activities were adapted from [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/#de) on the Bioconductor website*
151 |
--------------------------------------------------------------------------------
/lessons/R_refresher.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "R refresher"
3 | author: "Meeta Mistry, Radhika Khetani, Mary Piper, Jihe Liu"
4 | ---
5 |
6 | Approximate time: 45 minutes
7 |
8 | ## Learning Objectives
9 |
10 | * Describe the various data types and data structures (including tibbles) used by R
11 | * Use functions in R and describe how to get help with arguments
12 | * Describe how to install and use packages in R
13 | * Use the pipe (%>%) from the dplyr package
14 | * Describe the syntax used by ggplot2 for making plots
15 |
16 | ## Setting up
17 |
18 | 1. Let’s create a new project directory for this review:
19 |
20 | - Create a new project called `R_refresher`
21 | - Create a new R script called `reviewing_R.R`
22 | - Create the following folders in the project directory - `data`, `figures`
23 | - Download a counts file to the `data` folder by [right-clicking here](https://github.com/hbctraining/DGE_workshop_salmon/blob/master/data/raw_counts_mouseKO.csv?raw=true)
24 |
25 | 2. Now that we have our directory structure setup, let's load our libraries and read in our data:
26 |
27 | - Load the `tidyverse` library
28 | - Use `read.csv()` to read in the downloaded file and save it in the object/variable `counts`
29 | - What is the syntax for a function?
30 | - How do we get help for using a function?
31 | - What is the data structure of `counts`?
32 | - What main data structures are available in R?
33 | - What are the data types of the columns?
34 | - What data types are available in R?
35 |
36 | ## Creating vectors/factors and dataframes
37 |
38 | 3. We are performing RNA-Seq on cancer samples with genotypes of p53 wildtype (WT) and knock-down (KO). You have 8 samples total, with 4 replicates per genotype. Write the R code you would use to construct your metadata table as described below.
39 |
40 | - Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the `rep()` function).
41 | - Put them together into a dataframe called `meta`.
42 | - Use the `rownames()` function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the `paste0()` function).
43 |
44 | Your finished metadata table should have information for the variables `sex`, `stage`, `genotype`, and `myc` levels:
45 |
46 | | |sex | stage | genotype | myc |
47 | |:--:|:--: | :--: | :------: | :--: |
48 | |KO1 | M |1 |KO |23|
49 | |KO2| F |2 |KO |4|
50 | |KO3 |M |2 |KO |45|
51 | |KO4 |F |1 |KO |90|
52 | |WT1| M |2 |WT |34|
53 | |WT2| F| 1| WT| 35|
54 | |WT3| M| 1| WT| 9|
55 | |WT4| F| 2| WT| 10|
56 |
57 | ## Exploring data
58 |
59 | Now that we have created our metadata data frame, it's often a good idea to get some descriptive statistics about the data before performing any analyses.
60 |
61 | - Summarize the contents of the `meta` object, how many data types are represented?
62 | - Check that the row names in the `meta` data frame are identical to the column names in `counts` (content and order).
63 | - Convert the existing `stage` column into a factor data type
64 |
65 | ## Extracting data
66 |
67 | 4. Using the `meta` data frame created in the previous question, perform the following exercises (questions **DO NOT** build upon each other):
68 |
69 | - return only the `genotype` and `sex` columns using `[]`:
70 | - return the `genotype` values for samples 1, 7, and 8 using `[]`:
71 | - use `filter()` to return all data for those samples with genotype `WT`:
72 | - use `filter()`/`select()`to return only the `stage` and `genotype` columns for those samples with `myc` > 50:
73 | - add a column called `pre_treatment` to the beginning of the dataframe with the values T, F, T, F, T, F, T, F
74 | - why might this design be problematic?
75 | - Using `%>%` create a tibble of the `meta` object and call it `meta_tb` (make sure you don't lose the rownames!)
76 | - change the names of the columns to: "A", "B", "C", "D", "E":
77 |
78 | ## Visualizing data
79 |
80 | 5. Often it is easier to see the patterns or nature of our data when we explore it visually with a variety of graphics. Let's use ggplot2 to explore differences in the expression of the Myc gene based on genotype.
81 |
82 | - Plot a boxplot of the expression of Myc for the KO and WT samples using `theme_minimal()` and give the plot new axes names and a centered title.
83 |
84 | ## Preparing for downstream analysis tools
85 |
86 | 6. Many different statistical tools or analytical packages expect all data needed as input to be in the structure of a list. Let's create a list of our count and metadata in preparation for a downstream analysis.
87 |
88 | - Create a list called `project1` with the `meta` and `counts` objects, as well as a new vector with all the sample names extracted from one of the 2 data frames.
89 |
90 |
91 | [Rscript with answers](https://hbctraining.github.io/DGE_workshop_salmon_online/exercises/refresher_answers.R)
92 |
--------------------------------------------------------------------------------
/lessons/experimental_planning_considerations.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Experimental design considerations"
3 | author: "Mary Piper, Meeta Mistry, Radhika Khetani"
4 | date: "Thursday, June 20, 2019"
5 | ---
6 |
7 | Approximate time: 50 minutes
8 |
9 | ## Learning Objectives:
10 |
11 | * Describe the importance of replicates for RNA-seq differential expression experiments
12 | * Explain the relationship between the number of biological replicates, sequencing depth and the differentially expressed genes identified
13 | * Demonstrate how to design an RNA-seq experiment that avoids confounding and batch effects
14 |
15 | ## Experimental planning considerations
16 |
17 | Understanding the steps in the experimental process of RNA extraction and preparation of RNA-Seq libraries is helpful for designing an RNA-Seq experiment, but there are special considerations that should be highlighted that can greatly affect the quality of a differential expression analysis.
18 |
19 | These important considerations include:
20 |
21 | 1. Number and type of **replicates**
22 | 2. Avoiding **confounding**
23 | 3. Addressing **batch effects**
24 |
25 | We will go over each of these considerations in detail, discussing best practice and optimal design.
26 |
27 | ## Replicates
28 |
29 | Experimental replicates can be performed as **technical replicates** or **biological replicates**.
30 |
31 |
32 |
33 | *Image credit: [Klaus B., EMBO J (2015) **34**: 2727-2730](https://dx.doi.org/10.15252%2Fembj.201592958)*
34 |
35 | - **Technical replicates:** use the same biological sample to repeat the technical or experimental steps in order to accurately measure technical variation and remove it during analysis.
36 |
37 | - **Biological replicates** use different biological samples of the same condition to measure the biological variation between samples.
38 |
39 | In the days of microarrays, technical replicates were considered a necessity; however, with the current RNA-Seq technologies, technical variation is much lower than biological variation and **technical replicates are unneccessary**.
40 |
41 | In contrast, **biological replicates are absolutely essential** for differential expression analysis. For mice or rats, this might be easy to determine what constitutes a different biological sample, but it's a bit **more difficult to determine for cell lines**. [This article](http://paasp.net/accurate-design-of-in-vitro-experiments-why-does-it-matter/) gives some great recommendations for cell line replicates.
42 |
43 | For differential expression analysis, the more biological replicates, the better the estimates of biological variation and the more precise our estimates of the mean expression levels. This leads to more accurate modeling of our data and identification of more differentially expressed genes.
44 |
45 |
46 |
47 | *Image credit: [Liu, Y., et al., Bioinformatics (2014) **30**(3): 301–304](https://doi.org/10.1093/bioinformatics/btt688)*
48 |
49 | As the figure above illustrates, **biological replicates are of greater importance than sequencing depth**, which is the total number of reads sequenced per sample. The figure shows the relationship between sequencing depth and number of replicates on the number of differentially expressed genes identified [[1](https://academic.oup.com/bioinformatics/article/30/3/301/228651/RNA-seq-differential-expression-studies-more)]. Note that an **increase in the number of replicates tends to return more DE genes than increasing the sequencing depth**. Therefore, generally more replicates are better than higher sequencing depth, with the caveat that higher depth is required for detection of lowly expressed DE genes and for performing isoform-level differential expression.
50 |
51 | > **Sample pooling:** Try to avoid pooling of individuals/experiments, if possible; however, if absolutely necessary, then each pooled set of samples would count as a **single replicate**. To ensure similar amounts of variation between replicates, you would want to pool the **same number of individuals** for each pooled set of samples.
52 | >
53 | > *For example, if you need at least 3 individuals to get enough material for your `control` replicate and at least 5 individuals to get enough material for your `treatment` replicate, you would pool 5 individuals for the `control` and 5 individuals for the `treatment` conditions. You would also make sure that the individuals that are pooled in both conditions are similar in sex, age, etc.*
54 |
55 | Replicates are almost always preferred to greater sequencing depth for bulk RNA-Seq. However, **guidelines depend on the experiment performed and the desired analysis**. Below we list some general guidelines for replicates and sequencing depth to help with experimental planning:
56 |
57 | - **General gene-level differential expression:**
58 |
59 | - ENCODE guidelines suggest 30 million SE reads per sample (stranded).
60 |
61 | - 15 million reads per sample is often sufficient, if there are a good number of replicates (>3).
62 |
63 | - Spend money on more biological replicates, if possible.
64 |
65 | - Generally recommended to have read length >= 50 bp
66 |
67 | - **Gene-level differential expression with detection of lowly-expressed genes:**
68 |
69 | - Similarly benefits from replicates more than sequencing depth.
70 |
71 | - Sequence deeper with at least 30-60 million reads depending on level of expression (start with 30 million with a good number of replicates).
72 |
73 | - Generally recommended to have read length >= 50 bp
74 |
75 | - **Isoform-level differential expression:**
76 |
77 | - Of known isoforms, suggested to have a depth of at least 30 million reads per sample and paired-end reads.
78 |
79 | - Of novel isoforms should have more depth (> 60 million reads per sample).
80 |
81 | - Choose biological replicates over paired/deeper sequencing.
82 |
83 | - Generally recommended to have read length >= 50 bp, but longer is better as the reads will be more likely to cross exon junctions
84 |
85 | - Perform careful QC of RNA quality. Be careful to use high quality preparation methods and restrict analysis to high quality RIN # samples.
86 |
87 | - **Other types of RNA analyses (intron retention, small RNA-Seq, etc.):**
88 |
89 | - Different recommendations depending on the analysis.
90 |
91 | - Almost always more biological replicates are better!
92 |
93 | > **NOTE:** The factor used to estimate the depth of sequencing for genomes is "coverage" - how many times do the number of nucleotides sequenced "cover" the genome. This metric is not exact for genomes (whole genome sequencing), but it is good enough and is used extensively. However, the metric **does not work for transcriptomes** because even though you may know what % of the genome has trancriptional activity, the expression of the genes is highly variable.
94 |
95 | ## Confounding
96 |
97 | A confounded RNA-Seq experiment is one where you **cannot distinguish the separate effects of two different sources of variation** in the data.
98 |
99 | For example, we know that sex has large effects on gene expression, and if all of our *control* mice were female and all of the *treatment* mice were male, then our treatment effect would be confounded by sex. **We could not differentiate the effect of treatment from the effect of sex.**
100 |
101 |
102 |
103 | **To AVOID confounding:**
104 |
105 | - Ensure animals in each condition are all the **same sex, age, litter, and batch**, if possible.
106 |
107 | - If not possible, then ensure to split the animals equally between conditions
108 |
109 |
110 |
111 | ## Batch effects
112 |
113 | Batch effects are a significant issue for RNA-Seq analyses, since you can see significant differences in expression due solely to the batch effect.
114 |
115 |
116 |
117 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
118 |
119 | To explore the issues generated by poor batch study design, they are highlighted nicely in [this paper](https://f1000research.com/articles/4-121/v1).
120 |
121 | ### How to know whether you have batches?
122 |
123 | - Were all RNA isolations performed on the same day?
124 |
125 | - Were all library preparations performed on the same day?
126 |
127 | - Did the same person perform the RNA isolation/library preparation for all samples?
128 |
129 | - Did you use the same reagents for all samples?
130 |
131 | - Did you perform the RNA isolation/library preparation in the same location?
132 |
133 | If *any* of the answers is **‘No’**, then you have batches.
134 |
135 | ### Best practices regarding batches:
136 |
137 | - Design the experiment in a way to **avoid batches**, if possible.
138 |
139 | - If unable to avoid batches:
140 |
141 | - **Do NOT confound** your experiment by batch:
142 |
143 |
144 |
145 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
146 |
147 | - **DO** split replicates of the different sample groups across batches. The more replicates the better (definitely more than 2).
148 |
149 |
150 |
151 | *Image credit: [Hicks SC, et al., bioRxiv (2015)](https://www.biorxiv.org/content/early/2015/08/25/025528)*
152 |
153 | - **DO** include batch information in your **experimental metadata**. During the analysis, we can regress out the variation due to batch if not confounded so it doesn’t affect our results if we have that information.
154 |
155 |
156 |
157 | > **NOTE:** *The sample preparation of cell line "biological" replicates "should be performed as independently as possible" (as batches), "meaning that cell culture media should be prepared freshly for each experiment, different frozen cell stocks and growth factor batches, etc. should be used [[2](http://paasp.net/accurate-design-of-in-vitro-experiments-why-does-it-matter/)]." However, preparation across all conditions should be performed at the same time.*
158 |
159 | ***
160 | **Exercise**
161 |
162 | Your experiment has three different treatment groups, A, B, and C. Due to the lengthy process of tissue extraction, you can only isolate the RNA from two samples at the same time. You plan to have 4 replicates per group.
163 |
164 | 1. Fill in the `RNA isolation` column of the metadata table. Since we can only prepare 2 samples at a time and we have 12 samples total, you will need to isolate RNA in 6 batches. In the `RNA isolation` column, enter one of the following values for each sample: `group1`, `group2`, `group3`, `group4`, `group5`, `group6`. Make sure to fill in the table so as to avoid confounding by batch of `RNA isolation`.
165 |
166 | Click [here](https://github.com/hbctraining/Intro-to-rnaseq-hpc-salmon/blob/master/data/exp_design_table.xlsx?raw=true) to download the below table as an **Excel file**.
167 |
168 | 2. **BONUS:** To perform the RNA isolations more quickly, you devote two researchers to perform the RNA isolations. Create a `researcher` column and fill in the researchers' initials for the samples they will prepare: use initials `AB` or `CD`.
169 |
170 | | sample | treatment | sex | replicate | RNA isolation |
171 | | --- | --- | --- | --- | --- |
172 | | sample1 | A | F | 1 |
173 | | sample2 | A | F | 2 |
174 | | sample3 | A | M | 3 |
175 | | sample4 | A | M | 4 |
176 | | sample5 | B | F | 1 |
177 | | sample6 | B | F | 2 |
178 | | sample7 | B | M | 3 |
179 | | sample8 | B | M | 4 |
180 | | sample9 | C | F | 1 |
181 | | sample10 | C | F | 2 |
182 | | sample11 | C | M | 3 |
183 | | sample12 | C | M | 4 |
184 |
185 | [Answer Key](https://www.dropbox.com/scl/fi/qifgipe1esic1avu3an7c/exp_design_table.xlsx?rlkey=9hq0834lzl6jnrfn58wfz1qh0&dl=1)
186 |
187 | ***
188 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
189 |
--------------------------------------------------------------------------------
/lessons/functional_analysis_other_methods.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Other Functional Analysis Methods for RNA-seq - gProfileR"
3 | author: "Mary Piper"
4 | date: "Tuesday, October 10th, 2017"
5 | ---
6 |
7 | ## gProfileR
8 |
9 | [gProfileR](http://biit.cs.ut.ee/gprofiler/index.cgi) is another tool for performing ORA, similar to clusterProfiler. gProfileR considers multiple sources of functional evidence, including Gene Ontology terms, biological pathways, regulatory motifs of transcription factors and microRNAs, human disease annotations and protein-protein interactions. The user selects the organism and the sources of evidence to test. There are also additional parameters to change various thresholds and tweak the stringency to the desired level.
10 |
11 | The GO terms output by gprofileR are generally quite similar to those output by clusterProfiler, but there are small differences due to the different algorithms used by the tools.
12 |
13 | 
14 |
15 | You can use gProfiler for a wide selection of organisms, and the tool accepts your gene list as input. If your gene list is ordered (e.g. by padj. values), then gProfiler will take the order of the genes into account when outputting enriched terms or pathways.
16 |
17 | In addition, a large number (70%) of the functional annotations of GO terms are determined using _in silico_ methods to infer function from electronic annotation (IEA). While these annotations can offer valuable information, the information is of lower confidence than experimental and computational studies, and these functional annotations can be easily filtered out.
18 |
19 | The color codes in the gProfiler output represent the quality of the evidence for the functional annotation. For example, weaker evidence is depicted in blue, while strong evidence generated by direct experiment is shown with red or orange. Similar coloring is used for pathway information, with well-researched pathway information shown in black, opposed to lighter colors. Grey coloring suggests an unknown gene product or annotation. For more information, please see the [gProfiler paper](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1933153/).
20 |
21 | Also, due to the hierarchical structure of GO terms, you may return many terms that seem redundant since they are child and parent terms. gProfiler allows for 'hierarchical filtering', returning only the best term per parent term.
22 |
23 | We encourage you to explore gProfiler online, for today's class we will be demonstrating how to run it using the R package.
24 |
25 | #### Running gProfiler
26 |
27 | We can run gProfileR relatively easily from R, by loading the library and running the `gprofiler` function.
28 |
29 | ```r
30 | source("http://bioconductor.org/biocLite.R")
31 | biocLite(c("gProfiler", "treemap"))
32 | library(gProfiler)
33 | library(treemap)
34 |
35 | ## Running gprofiler to identify enriched processes among significant genes
36 |
37 | gprofiler_results_oe <- gprofiler(query = sigOE_genes),
38 | organism = "hsapiens",
39 | ordered_query = F,
40 | exclude_iea = F,
41 | max_p_value = 0.05,
42 | max_set_size = 0,
43 | correction_method = "fdr",
44 | hier_filtering = "none",
45 | domain_size = "annotated",
46 | custom_bg = all_genes)
47 |
48 | ```
49 |
50 | Let's save the gProfiler results to file:
51 |
52 | ```r
53 | ## Subset and reorder gProfiler results to only include columns of interest
54 | gprofiler_results_oe_reordered <- gprofiler_results_oe[, c("term.id", "domain", "term.name", "p.value", "overlap.size", "term.size", "intersection")]
55 |
56 | ## Order the results by p-adjusted value
57 | gprofiler_results_oe_reordered <- gprofiler_results_oe_reordered[order(gprofiler_results_oe_reordered$p.value), ]
58 |
59 | ## Extract only the 'GO' terms from the results
60 | gprofiler_results_oe_GOs <- gprofiler_results_oe_reordered[grep('GO:', gprofiler_results_oe_reordered$term.id), ]
61 |
62 | ## Write the enriched GO results to file
63 | write.csv(gprofiler_results_oe_GOs,
64 | "results/gprofiler_MOV10_oe.csv")
65 | ```
66 |
67 | Now, extract only those lines in the gProfiler results with GO term accession numbers and associated padj values for downstream analyses:
68 |
69 | ```r
70 | ## Extract only GO IDs and p-values for downstream analysis
71 |
72 | GOpval_oe <- gprofiler_results_oe_GOs[ , c("term.id", "p.value")]
73 |
74 | write.table(GOpval_oe, "results/GOs_oe.txt", quote=FALSE, row.names = FALSE, col.names = FALSE)
75 |
76 | ```
77 |
78 | ### REVIGO
79 |
80 | [REVIGO](http://revigo.irb.hr/) is a web-based tool that can take our list of GO terms, collapse redundant terms by semantic similarity, and summarize them graphically.
81 |
82 | 
83 |
84 | Open `GOs_oe.txt` and copy and paste the contents into the REVIGO search box, and submit.
85 |
86 | After the program runs, there may not be output to the screen, but you can click on the `Treemap` tab. At the bottom of the `Treemap` tab should be a link to an R script to create the treemap; click to download the script.
87 |
88 | 
89 |
90 | In RStudio, pull-down the `File` menu and choose `Open File`, then navigate to the `REVIGO_treemap.r` script to open. In the `REVIGO_treemap.r` script tab, scroll down the script to the end and replace with the following:
91 |
92 | ```r
93 | ## by default, outputs to a PDF file
94 | pdf( file="results/revigo_treemap.pdf", width=16, height=9 ) # rename to appropriate path and file name
95 |
96 | ## change the `tmPlot()` command to `treemap()`
97 | treemap(
98 | stuff,
99 | index = c("representative","description"),
100 | vSize = "abslog10pvalue",
101 | type = "categorical",
102 | vColor = "representative",
103 | title = "REVIGO Gene Ontology treemap",
104 | inflate.labels = FALSE, # set this to TRUE for space-filling group labels - good for posters
105 | lowerbound.cex.labels = 0, # try to draw as many labels as possible (still, some small squares may not get a label)
106 | bg.labels = "#CCCCCCAA", # define background color of group labels
107 | # "#CCCCCC00" is fully transparent, "#CCCCCCAA" is semi-transparent grey, NA is opaque
108 | position.legend = "none"
109 | )
110 | ```
111 |
112 |
113 |
114 |
115 |
116 | ### Gene set enrichment analysis using GAGE and Pathview
117 |
118 | Gene set enrichment analysis using [GAGE (Generally Applicable Gene-set Enrichment for Pathway Analysis)](http://bioconductor.org/packages/release/bioc/html/gage.html) and [Pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) tools was also performed using a slightly different type of algorithm.
119 |
120 | "GAGE assumes a gene set comes from a different distribution than the background and uses two-sample t-test to account for the gene set specific variance as well as the background variance. The two-sample t-test used by GAGE identifies gene sets with modest but consistent changes in gene expression level."[[2](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-161)]
121 |
122 | Pathview allows for the integration of the data generated by GAGE and visualization of the pathways from the dataset.
123 |
124 |
125 | #### Exploring enrichment of KEGG pathways
126 | To get started with GAGE and Pathview analysis, we need to load multiple libraries:
127 |
128 | ```r
129 | # Install packages if this is your first time using them
130 |
131 | source("http://bioconductor.org/biocLite.R")
132 | biocLite('gage', 'pathview', 'gageData', 'biomaRt', 'org.Hs.eg.db', 'DOSE', 'SPIA')
133 |
134 | # Loading the packages needed for GAGE and Pathview analysis
135 |
136 | library(gage)
137 | library(pathview)
138 | library(gageData)
139 | library(org.Hs.eg.db)
140 | ```
141 |
142 | To determine whether pathways in our dataset are enriched, we need to first obtain the gene sets to test:
143 |
144 | ```r
145 | # Create datasets with KEGG gene sets to test
146 |
147 | kegg_human <- kegg.gsets(species = "human", id.type = "kegg")
148 | names(kegg_human)
149 |
150 | kegg.gs <- kegg_human$kg.sets[kegg_human$sigmet.idx]
151 | head(kegg.gs)
152 | ```
153 |
154 | Now that we have our pathways to test, we need to bring in our own data. We will use the log2 fold changes output by differential expression analysis to determine whether particular pathways are enriched. GAGE requires the genes have Entrez IDs, so we will used these IDs for the analysis.
155 |
156 | > A useful tutorial for using GAGE and Pathview is available from Stephen Turner on R-bloggers: [http://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-sailfish-deseq2-gage-and-pathview/](http://www.r-bloggers.com/tutorial-rna-seq-differential-expression-pathway-analysis-with-sailfish-deseq2-gage-and-pathview/)
157 |
158 | The [GAGE vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/gage/inst/doc/gage.pdf) provides detailed information on running the analysis. Note that you can run the analysis and look for pathways with genes statistically only up- or down-regulated. Alternatively, you can explore statistically perturbed pathways, which are enriched in genes that may be either up- or down- regulated. For KEGG pathways, looking at both types of pathways could be useful.
159 |
160 | ```r
161 | # Run GAGE
162 |
163 | keggres = gage(foldchanges, gsets=kegg.gs, same.dir=T)
164 | names(keggres)
165 |
166 | head(keggres$greater) #Pathways that are up-regulated
167 |
168 | head(keggres$less) #Pathways that are down-regulated
169 |
170 | # Explore genes that are up-regulated
171 |
172 | sel_up <- keggres$greater[, "q.val"] < 0.05 & !is.na(keggres$greater[, "q.val"])
173 |
174 | path_ids_up <- rownames(keggres$greater)[sel_up]
175 | path_ids_up
176 |
177 | # Get the pathway IDs for the significantly up-regulated pathways
178 |
179 | keggresids = substr(path_ids_up, start=1, stop=8)
180 | keggresids
181 | ```
182 | Now that we have the IDs for the pathways that are significantly up-regulated in our dataset, we can visualize these pathways and the genes identified from our dataset causing these pathways to be enriched using [Pathview](https://www.bioconductor.org/packages/devel/bioc/vignettes/pathview/inst/doc/pathview.pdf).
183 |
184 | ```r
185 | # Run Pathview
186 |
187 | ## Use Pathview to view significant up-regulated pathways
188 |
189 | pathview(gene.data = foldchanges, pathway.id=keggresids, species="human", kegg.dir="results/")
190 | ```
191 | 
192 |
193 | #### Exploring enrichment of biological processes using GO terms in GAGE
194 | Using the GAGE tool, we can identify significantly enriched gene ontology terms for biological process and molecular function based on the log2 fold changes for all genes. While gProfileR is an overlap statistic analysis tool which uses a threshold (adjusted p<0.05 here) to define which genes are analyzed for GO enrichment, gene set enrichment analysis tools like GAGE use a list of genes (here ranked by logFC) without using a threshold. This allows GAGE to use more information to identify enriched biological processes. The introduction to GSEA goes into more detail about the advantages of this approach: [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1239896/](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1239896/).
195 |
196 | ```r
197 | #Acquire datasets
198 |
199 | data(go.sets.hs)
200 | head(names(go.sets.hs))
201 |
202 | data(go.subs.hs)
203 | names(go.subs.hs)
204 | head(go.subs.hs$MF)
205 |
206 | #Use gage to explore enriched biological processes
207 | #Biological process
208 |
209 | go_bp_sets = go.sets.hs[go.subs.hs$BP]
210 | ```
211 |
212 | > If we wanted to identify enriched molecular functions we would use the code: `go.sets.hs[go.subs.hs$MF]`
213 |
214 |
215 | ```r
216 | # Run GAGE
217 | go_bp_res = gage(foldchanges, gsets=go_bp_sets, same.dir=T)
218 | class(go_bp_res)
219 | names(go_bp_res)
220 | head(go_bp_res$greater)
221 | go_df_enriched <- data.frame(go_bp_res$greater)
222 |
223 | GO_enriched_BP <- subset(go_df_enriched, q.val < 0.05)
224 | GO_enriched_BP
225 |
226 | write.table(GO_enriched_BP, "Mov10_GAGE_GO_BP.txt", quote=F)
227 | ```
228 |
229 |
230 | ***
231 | *This lesson has been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
232 |
233 |
--------------------------------------------------------------------------------
/lessons/pathway_topology.md:
--------------------------------------------------------------------------------
1 | ### SPIA
2 | The [SPIA (Signaling Pathway Impact Analysis)](http://bioconductor.org/packages/release/bioc/html/SPIA.html) tool can be used to integrate the lists of differentially expressed genes, their fold changes, and pathway topology to identify affected pathways. The blog post from [Getting Genetics Done](https://gettinggeneticsdone.blogspot.com/2012/03/pathway-analysis-for-high-throughput.html) provides a step-by-step procedure for using and understanding SPIA.
3 |
4 | ```r
5 | # Set-up
6 |
7 | BiocManager::install("SPIA")
8 | library(SPIA)
9 |
10 | ## Significant genes is a vector of fold changes where the names are ENTREZ gene IDs. The background set is a vector of all the genes represented on the platform.
11 |
12 | background_entrez <- res_entrez$entrezid
13 |
14 | sig_res_entrez <- res_entrez[which(res_entrez$padj < 0.05), ]
15 |
16 | sig_entrez <- sig_res_entrez$log2FoldChange
17 |
18 | names(sig_entrez) <- sig_res_entrez$entrezid
19 |
20 | head(sig_entrez)
21 | ```
22 |
23 |
24 | Now that we have our background and significant genes in the appropriate format, we can run SPIA:
25 |
26 | ```r
27 |
28 | spia_result <- spia(de=sig_entrez, all=background_entrez, organism="hsa")
29 |
30 | head(spia_result, n=20)
31 | ```
32 |
33 | SPIA outputs a table showing significantly dysregulated pathways based on over-representation and signaling perturbations accumulation. The table shows the following information:
34 |
35 | - `pSize`: the number of genes on the pathway
36 | - `NDE`: the number of DE genes per pathway
37 | - `tA`: the observed total perturbation accumulation in the pathway
38 | - `pNDE`: the probability to observe at least NDE genes on the pathway using a hypergeometric model (similar to ORA)
39 | - `pPERT`: the probability to observe a total accumulation more extreme than tA only by chance
40 | - `pG`: the p-value obtained by combining pNDE and pPERT
41 | - `pGFdr` and `pGFWER` are the False Discovery Rate and Bonferroni adjusted global p-values, respectively
42 | - `Status`: gives the direction in which the pathway is perturbed (activated or inhibited)
43 | - `KEGGLINK` gives a web link to the KEGG website that **displays the pathway image** with the differentially expressed genes highlighted in red
44 |
45 | We can view the significantly dysregulated pathways by viewing the over-representation and perturbations for each pathway.
46 |
47 | ```r
48 | plotP(spia_result, threshold=0.05)
49 | ```
50 |
51 | 
52 |
53 | In this plot, each pathway is a point and the coordinates are the log of pNDE (using a hypergeometric model) and the p-value from perturbations, pPERT. The oblique lines in the plot show the significance regions based on the combined evidence.
54 |
55 | If we choose to explore the significant genes from our dataset occurring in these pathways, we can subset our SPIA results:
56 |
57 | ```r
58 | ## Look at pathway 03013 and view kegglink
59 | subset(spia_result, ID == "03013")
60 | ```
61 |
62 | Then, click on the KEGGLINK, we can view the genes within our dataset from these perturbed pathways:
63 |
64 | 
65 |
--------------------------------------------------------------------------------
/lessons/principal_component_analysis.md:
--------------------------------------------------------------------------------
1 | #### Principal Component Analysis (PCA)
2 |
3 | Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset (dimensionality reduction). Details regarding PCA are given below (based on [materials from StatQuest](https://www.youtube.com/watch?v=_UVHneBUBW0), and if you would like a more thorough description, we encourage you to explore [StatQuest's video](https://www.youtube.com/watch?v=_UVHneBUBW0).
4 |
5 | If you had two samples and wanted to plot the counts of one sample versus another, you could plot the counts of one sample on the x-axis and the other sample on the y-axis as shown below:
6 |
7 |
8 |
9 | You could draw a line through the data in the direction representing the most variation, which is on the diagonal in this example. The maximum variation in the data is between the two endpoints of this line.
10 |
11 | We also see the genes vary somewhat above and below the line. We could draw another line through the data representing the second most amount of variation in the data.
12 |
13 |
14 |
15 |
16 | The genes near the ends of the line, which would include those genes with the highest variation between samples (high expression in one sample and low expression in the other), have the greatest influence on the direction of the line.
17 |
18 |
19 |
20 | For example, a small change in the value of *Gene C* would greatly change the direction of the line, whereas a small change in *Gene A* or *Gene D* would have little affect.
21 |
22 |
23 |
24 | We could just rotate the entire plot and view the lines representing the variation as left-to-right and up-and-down. We see most of the variation in the data is left-to-right; this is and the second most variation in the data is up-and-down. These axes that represent the variation are "Principal Components", with PC1 representing the most variation in the data and PC2 representing the second most variation in the data.
25 |
26 | If we had three samples, then we would have an extra direction in which we could have variation. Therefore, if we have *N* samples we would have *N*-directions of variation or principal components.
27 |
28 |
29 |
30 | We could give quantitative scores to genes based on how much they influence PC1 and PC2. Genes with little influence would get scores near zero, while genes with more influence would receive larger scores. Genes on opposite ends of the lines have a large influence, so they would receive large scores, but with opposite signs.
31 |
32 |
33 |
34 | To generate a score per sample, we combine the read counts for all genes. To calculate the scores, we do the following:
35 |
36 | Sample1 PC1 score = (read count * influence) + ... for all genes
37 |
38 | Using the counts in the table for each gene (assuming we had only 4 genes total) we could calculate PC1 and PC2 values for each sample as follows:
39 |
40 | Sample1 PC1 score = (4 * -2) + (1 * -10) + (8 * 8) + (5 * 1) = 51
41 | Sample1 PC2 score = (4 * 0.5) + (1 * 1) + (8 * -5) + (5 * 6) = -7
42 |
43 | Sample2 PC1 score = (5 * -2) + (4 * -10) + (8 * 8) + (7 * 1) = 21
44 | Sample2 PC2 score = (5 * 0.5) + (4 * 1) + (8 * -5) + (7 * 6) = 8.5
45 |
46 | The scores would then be plotted to examine whether the samples exhibit similar variation across all genes:
47 |
48 |
49 |
50 | Since genes with the greatest variation between samples will have the greatest influence on the principal components, we hope our condition of interest explains this variation (e.g. high counts in one condition and low counts in the other). With PC1 representing the most variation in the data and PC2 representing the second most variation in the data, we can visualize how similar the variation of genes is between samples.
--------------------------------------------------------------------------------
/lessons/top20_genes-expression_plotting.md:
--------------------------------------------------------------------------------
1 | #### Using `ggplot2` to plot multiple genes (e.g. top 20)
2 |
3 | Often it is helpful to check the expression of multiple genes of interest at the same time. This often first requires some data wrangling.
4 |
5 | We are going to plot the normalized count values for the **top 20 differentially expressed genes (by padj values)**.
6 |
7 | To do this, we first need to determine the gene names of our top 20 genes by ordering our results and extracting the top 20 genes (by padj values):
8 |
9 | ```r
10 | ## Order results by padj values
11 | top20_sigOE_genes <- res_tableOE_tb %>%
12 | arrange(padj) %>% #Arrange rows by padj values
13 | pull(gene) %>% #Extract character vector of ordered genes
14 | head(n=20) #Extract the first 20 genes
15 | ```
16 |
17 | Then, we can extract the normalized count values for these top 20 genes:
18 |
19 | ```r
20 | ## normalized counts for top 20 significant genes
21 | top20_sigOE_norm <- normalized_counts %>%
22 | filter(gene %in% top20_sigOE_genes)
23 | ```
24 |
25 | Now that we have the normalized counts for each of the top 20 genes for all 8 samples, to plot using `ggplot()`, we need to gather the counts for all samples into a single column to allow us to give ggplot the one column with the values we want it to plot.
26 |
27 | The `gather()` function in the **tidyr** package will perform this operation and will output the normalized counts for all genes for *Mov10_oe_1* listed in the first 20 rows, followed by the normalized counts for *Mov10_oe_2* in the next 20 rows, so on and so forth.
28 |
29 |
30 |
31 | ```r
32 | # Gathering the columns to have normalized counts to a single column
33 | gathered_top20_sigOE <- top20_sigOE_norm %>%
34 | gather(colnames(top20_sigOE_norm)[2:9], key = "samplename", value = "normalized_counts")
35 |
36 | ## check the column header in the "gathered" data frame
37 | View(gathered_top20_sigOE)
38 | ```
39 |
40 | Now, if we want our counts colored by sample group, then we need to combine the metadata information with the melted normalized counts data into the same data frame for input to `ggplot()`:
41 |
42 | ```r
43 | gathered_top20_sigOE <- inner_join(mov10_meta, gathered_top20_sigOE)
44 | ```
45 |
46 | The `inner_join()` will merge 2 data frames with respect to the "samplename" column, i.e. a column with the same column name in both data frames.
47 |
48 | Now that we have a data frame in a format that can be utilised by ggplot easily, let's plot!
49 |
50 | ```r
51 | ## plot using ggplot2
52 | ggplot(gathered_top20_sigOE) +
53 | geom_point(aes(x = symbol, y = normalized_counts, color = sampletype)) +
54 | scale_y_log10() +
55 | xlab("Genes") +
56 | ylab("log10 Normalized Counts") +
57 | ggtitle("Top 20 Significant DE Genes") +
58 | theme_bw() +
59 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
60 | theme(plot.title = element_text(hjust = 0.5))
61 | ```
62 |
63 |
64 |
--------------------------------------------------------------------------------
/schedule/README.md:
--------------------------------------------------------------------------------
1 | # Workshop Schedule
2 |
3 | ## Pre-reading
4 |
5 | 1. [Workflow (raw data to counts)](../lessons/01a_RNAseq_processing_workflow.md)
6 | 1. [Experimental design considerations](../lessons/experimental_planning_considerations.md)
7 |
8 | ## Day 1
9 |
10 | | Time | Topic | Instructor |
11 | |:------------------------:|:------------------------------------------------:|:--------:|
12 | | 10:00 - 10:30 | [Workshop Introduction](../lectures/workshop_intro_slides.pdf) | Will |
13 | | 10:30 - 11:00 | RNA-seq pre-reading discussion | All |
14 | | 11:00 - 11:45 | [Intro to DGE / setting up DGE analysis](../lessons/01b_DGE_setup_and_overview.md) | Meeta |
15 | | 11:45 - 12:00 | Overview of self-learning materials and homework submission | Will |
16 |
17 | ### Before the next class:
18 |
19 | 1. Please **study the contents** and **work through all the code** within the following lessons:
20 |
21 | * [RNA-seq counts distribution](../lessons/01c_RNAseq_count_distribution.md)
22 |
23 | Click here for a preview of this lesson
24 |
Starting with the count matrix, we want to explore some characteristics of the RNA-seq data and evaluate the appropriate model to use.
This lesson will cover:
25 | - Describing characteristics of the RNA-seq count data
26 | - Understanding different statistical methods to model the count data
27 | - Explaining the benefits of biological replicates
28 |
29 |
30 | * [Count normalization](../lessons/02_DGE_count_normalization.md)
31 |
32 | Click here for a preview of this lesson
33 |
Count normalization is an important data pre-processing step before the differential expression analysis.
This lesson will cover:
34 | - Describing "uninteresting factors" to consider during normalization
35 | - Understanding different normalization methods and their corresponding use cases
36 | - Generating a matrix of normalized counts using DESeq2's median of ratios method
37 |
38 |
39 | * [Sample-level QC](../lessons/03_DGE_QC_analysis.md) (PCA and hierarchical clustering)
40 |
41 | Click here for a preview of this lesson
42 |
Next, we want to check the quality of count data, to make sure that the samples are good.
43 |
This lesson will cover:
44 | - Understanding the importance of similarity analysis between samples
45 | - Describing Principal Component Analysis (PCA) and interpreting PCA plots from RNA-seq data
46 | - Performing hierarchical clustering and plotting correlation metrics
47 |
48 |
49 | 2. **Complete the exercises**:
50 | * Each lesson above contain exercises; please go through each of them.
51 | * **Copy over** your solutions into the [Google Form](https://docs.google.com/forms/d/e/1FAIpQLScUcYzyiM_dAsgQdNx9ECzCX3lKrTHTwmUKux9u8VyP2JDLNQ/viewform?usp=sf_link) **the day before the next class**.
52 |
53 | ### Questions?
54 | * ***If you get stuck due to an error*** while runnning code in the lesson, [email us](mailto:hbctraining@hsph.harvard.edu)
55 |
56 | ---
57 |
58 | ## Day 2
59 |
60 | | Time | Topic | Instructor |
61 | |:------------------------:|:------------------------------------------------:|:--------:|
62 | | 10:00 - 11:00 | Self-learning lessons discussion | All |
63 | | 11:00 - 11:30 | [Design formulas](../lessons/04a_design_formulas.md) | Will |
64 | | 11:30 - 12:00 | [Hypothesis testing and multiple test correction](../lessons/05a_hypothesis_testing.md) | Meeta |
65 |
66 | ### Before the next class:
67 |
68 | I. Please **study the contents** and **work through all the code** within the following lessons:
69 | 1. [Description of steps for DESeq2](../lessons/04b_DGE_DESeq2_analysis.md)
70 |
71 | Click here for a preview of this lesson
72 |
The R code required to perform differential gene expression analysis is actually quite simple. Running the `DESeq()` function will carry out the various steps involved. It is important that you have some knowledge of what is happening under the hood, to be able to fully understand and interpret the results
In this lesson you will:
73 | - Examine size factors and learn about sources that cause observed variation in values
74 | - Explore the gene-wise dispersion estimates as they relate back the mean-variance relationship
75 | - Critically evaluate a dispersion plot
76 |
77 |
78 | 2. [Wald test results](../lessons/05b_wald_test_results.md)
79 |
80 | Click here for a preview of this lesson
81 |
We have run the analysis, and now it's time to explore the results!
In this lesson you will:
82 | - Learn how to extract results for specific group comparisons
83 | - Explore the information presented in the results table (different statistics and their importance)
84 | - Understand the different levels of filtering that are applied in DESeq2 by default (and why they are important)
85 |
86 |
87 |
88 | 3. [Summarizing results and extracting significant gene lists](../lessons/05c_summarizing_results.md)
89 |
90 | Click here for a preview of this lesson
91 |
Once you have your results, it is useful to summarize the information. Here, we get a snapshot of the number of differentially expressed genes that are identified from the different comparisons.
92 |
93 |
94 | 4. [Visualization](../lessons/06_DGE_visualizing_results.md)
95 |
96 | Click here for a preview of this lesson
97 |
A picture is worth a thousand words. In our case, a figure is worth a thousand (or 30 thousand) data points. When working with large scale data, it can be helpful to visualize results and get a big picture perspective of your findings.
In this lesson you will:
98 | - Explore different plots for data visualization
99 | - Create a volcano plot to evaluate the relationship between different statistics from the results table
100 | - Create a heatmap for visualization of differentially expressed genes
101 |
102 |
103 | II. **Complete the exercises**:
104 | * Each lesson above contain exercises; please go through each of them.
105 | * **Copy over** your solutions into the [Google Form](https://docs.google.com/forms/d/e/1FAIpQLSfVELkIcVN4wyJ2aNrowgxiuat5uUXCXACj8QN4MfTK5Yr-Zw/viewform?usp=sf_link) **the day before the next class**.
106 |
107 | ### Questions?
108 | * ***If you get stuck due to an error*** while runnning code in the lesson, [email us](mailto:hbctraining@hsph.harvard.edu)
109 |
110 | ---
111 |
112 | ## Day 3
113 |
114 | | Time | Topic | Instructor |
115 | |:------------------------:|:------------------------------------------------:|:--------:|
116 | | 10:00 - 11:15 | Self-learning lessons discussion | All |
117 | | 11:15 - 12:00 | [Likelihood Ratio Test results](../lessons/08a_DGE_LRT_results.md) | Meeta |
118 |
119 | ### Before the next class:
120 |
121 | 1. Please **study the contents** and **work through all the code** within the following lessons:
122 | * [Time course analysis](../lessons/08b_time_course_analyses.md)
123 |
124 | Click here for a preview of this lesson
125 |
Sometimes we are interested in how a gene changes over time. The Likelihood Ratio Test (LRT) is paricularly well-suited for this task.
This lesson will cover:
126 | - Designing a LRT for a time-course analysis in DESeq2
127 | - Identifying patterns in our list of differentially expressed genes
128 |
129 | * [Gene annotation](../lessons/genomic_annotation.md)
130 |
131 | Click here for a preview of this lesson
132 |
Next-generation analyses rely on annotations to provide a description for defining genes, transcripts and/or proteins. These annotations are often stored in publicly available databases.
This lesson will cover:
133 | - Describing the various annotation databases
134 | - Accessing annotations from one of these databases using R
135 |
136 | * [Functional analysis - over-representation analysis](../lessons/10_FA_over-representation_analysis.md)
137 |
138 | Click here for a preview of this lesson
139 |
Oftentimes after completing an RNA-seq experiment, you will be left with a list of differentially expressed transcripts. You may be interested in knowing if these transcripts are enriched in certain biologically-relevant contexts.
This lesson will cover:
140 | - Describing how functional enrichment tools yield statistically enriched functional categories or interactions
141 | - Identifying enriched Gene Ontology terms using the R package, clusterProfiler
142 |
143 | * [Functional analysis - functional class scoring / GSEA](../lessons/11_FA_functional_class_scoring.md)
144 |
145 | Click here for a preview of this lesson
146 |
While some functional analyses focus on large changes focused on a select few genes, functional class scoring (FCS) focuses on weaker but coordinated changes in sets of functionally related genes (i.e., pathways) that can also have significant effects.
This lesson will cover:
147 | - Designing a GSEA analysis using GO and/or KEGG gene sets
148 | - Evaluating the results of a GSEA analysis
149 | - Discussing other tools and resources for identifying genes of novel pathways or networks
150 |
151 |
152 | 2. **There is no assignment submission, but please use this [Google form](https://forms.gle/QHPktomJZysJe7Zz8) to ask us questions!**
153 |
154 | ### Questions?
155 | * ***If you get stuck due to an error*** while runnning code in the lesson, [email us](mailto:hbctraining@hsph.harvard.edu)
156 |
157 | ---
158 |
159 | ## Day 4
160 |
161 | | Time | Topic | Instructor |
162 | |:------------------------:|:------------------------------------------------:|:--------:|
163 | | 10:00 - 11:00 | Questions about self-learning lessons | All |
164 | | 11:00 - 11:15 | [Summarizing workflow](../lessons/07_DGE_summarizing_workflow.md) | Will |
165 | | 11:15 - 11:45 | Discussion, Q & A | All |
166 | | 11:45 - 12:00 | [Wrap Up](../lectures/workshop_wrapup_slides.pdf) | Will |
167 |
168 | ## Answer keys
169 | * [Day 1 Answer Key](../homework/DGE_assignment_1_answer_key.R)
170 | * [Day 2 Answer Key](../homework/DGE_assignment_2_answer_key.R)
171 |
172 | ## Resources
173 | We have covered the inner workings of DESeq2 in a fair amount of detail such that when using this package you have a good understanding of what is going on under the hood. For more information on topics covered, we encourage you to take a look at the following resources:
174 |
175 | * [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#theory-behind-deseq2)
176 | * GitHub book on [RNA-seq gene level analysis](http://genomicsclass.github.io/book/pages/rnaseq_gene_level.html)
177 | * [Bioconductor support site](https://support.bioconductor.org/t/deseq2/) (posts tagged with `deseq2`)
178 | * [Enrichment analysis book](https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html)
179 | * [Visualization: Functional (Enrichment) analysis](https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html)
180 |
181 | ## Building on this workshop
182 | * [Single-cell RNA-seq workshop](https://hbctraining.github.io/scRNA-seq/)
183 | * [RMarkdown](https://hbctraining.github.io/Training-modules/Rmarkdown/)
184 | * [ggplot2 for functional analysis](https://hbctraining.github.io/Training-modules/Tidyverse_ggplot2/lessons/03_ggplot2.html)
185 |
186 | ## Other helpful links
187 | * [Online hbctraining learning materials](https://hbctraining.github.io/main/table_of_training_README.html)
188 | * [hbctraining webpage](https://hbctraining.github.io/main)
189 |
--------------------------------------------------------------------------------
/schedule/README_old.md:
--------------------------------------------------------------------------------
1 | # Workshop schedule
2 |
3 | **Note 1:** The *Basic Data Skills* [Introduction to R](https://hbctraining.github.io/Intro-to-R/schedules/1.5-day.html) workshop is a prerequisite.
4 |
5 | **Note 2:** In this workshop we use expression data generated using Salmon; the training materials for that portion of the workflow was covered in the *Advanced Topics* [Introduction to (bulk) RNA-seq using High-Performance Computing](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/schedule/) workshop.
6 |
7 |
8 | ## Day1
9 |
10 | | Time | Topic | Instructor |
11 | |:-----------:|:----------:|:--------:|
12 | | 09:00 - 9:15 | [Introduction to Workshop](../lectures/Intro_to_workshop.pdf) | Radhika |
13 | | 09:15 - 10:00 | [R Refresher Session](../lessons/R_refresher.md) | Mary |
14 | | 10:00 - 10:15 | Coffee | |
15 | | 10:15 - 11:25 | [Overview of DGE Analysis Workflow](../lessons/01_DGE_setup_and_overview.md) | Meeta |
16 | | 11:25 - 12:15 |[Setting up for DGE Analysis: Count Normalization](../lessons/02_DGE_count_normalization.md) | Jihe |
17 | | 12:15 - 13:15 | Lunch | |
18 | | 13:15 - 14:40 | [Setting up for DGE Analysis: Count Data QC](../lessons/03_DGE_QC_analysis.md) | Radhika |
19 | | 14:40 - 15:10 | [Overview of DESeq2 Methods](../lessons/04_DGE_DESeq2_analysis.md) | Mary |
20 | | 15:10- 15:25 | Coffee |
21 | | 15:25 - 15:55 | [Overview of DESeq2 Methods](../lessons/04_DGE_DESeq2_analysis.md) | Mary |
22 | | 15:55 - 17:00 | [DGE Analysis: Pairwise Comparisons (Wald Test)](../lessons/05_DGE_DESeq2_analysis2.md)| Meeta |
23 |
24 |
25 | ## Day2
26 |
27 | | Time | Topic | Instructor |
28 | |:-----------:|:----------:|:--------:|
29 | | 9:00 - 9:15 | [DGE Analysis: Workflow Summarization](../lessons/07_DGE_summarizing_workflow.md) | Mary |
30 | | 9:15 - 10:15 | [DGE Analysis: Visualization](../lessons/06_DGE_visualizing_results.md) | Radhika |
31 | | 10:15 - 10:30 | Coffee | |
32 | | 10:30 - 11:00 | [DGE Analysis: Visualization (cont'd)](../lessons/06_DGE_visualizing_results.md) | Radhika |
33 | | 11:00 - 12:00 | [DGE Analysis: Likelihood Ratio Test](../lessons/08_DGE_LRT.md) | Meeta |
34 | | 12:00 - 13:00 | Lunch | |
35 | | 13:00 - 14:00 | [Gene Annotations](../lessons/genomic_annotation.md) | Meeta |
36 | | 14:00 - 14:30 | [Functional Analysis](../lessons/functional_analysis_2019.md) | Jihe |
37 | | 14:30- 14:45 | Coffee |
38 | | 14:45 - 16:15 | [Functional Analysis (cont'd)](../lessons/functional_analysis_2019.md) | Mary |
39 | | 16:15 - 17:00 | [Wrap-up ](../lectures/Workshop_wrapup.pdf) | Radhika |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/schedule/links-to-lessons.md:
--------------------------------------------------------------------------------
1 | # Introduction to Differential Gene Expression Analysis
2 |
3 | ## Learning Objectives
4 |
5 | - Explain and interpret QC on count data using Principal Component Analysis (PCA) and hierarchical clustering
6 | - Implement DESeq2 to obtain a list of significantly different genes
7 | - Perform functional analysis on gene lists with R-based tools
8 |
9 | ## Installations
10 |
11 | [Follow the instructions linked here](../README.md#installation-requirements) to download R and RStudio + Install Packages from CRAN and Bioconductor
12 |
13 | ## Lessons
14 |
15 | ### Part 1 (Getting Started)
16 | 1. [Workflow (raw data to counts)](../lessons/01a_RNAseq_processing_workflow.md)
17 | 1. [Experimental design considerations](../lessons/experimental_planning_considerations.md)
18 | 1. [Intro to DGE / setting up DGE analysis](../lessons/01b_DGE_setup_and_overview.md)
19 |
20 | ***
21 |
22 | ### Part II (QC and setting up for DESeq2)
23 | 1. [RNA-seq counts distribution](../lessons/01c_RNAseq_count_distribution.md)
24 | 1. [Count normalization](../lessons/02_DGE_count_normalization.md)
25 | 1. [Sample-level QC](../lessons/03_DGE_QC_analysis.md) (PCA and hierarchical clustering)
26 | 1. [Design formulas](../lessons/04a_design_formulas.md)
27 | 1. [Hypothesis testing and multiple test correction](../lessons/05a_hypothesis_testing.md)
28 |
29 | ***
30 |
31 | ### Part III (DESeq2)
32 | 1. [Description of steps for DESeq2](../lessons/04b_DGE_DESeq2_analysis.md)
33 | 1. [Wald test results](../lessons/05b_wald_test_results.md)
34 | 1. [Summarizing results and extracting significant gene lists](../lessons/05c_summarizing_results.md)
35 | 1. [Visualization](../lessons/06_DGE_visualizing_results.md)
36 | 1. [Likelihood Ratio Test results](../lessons/08a_DGE_LRT_results.md)
37 | 1. [Time course analysis](../lessons/08b_time_course_analyses.md)
38 |
39 | ***
40 |
41 | ### Part IV (Functional Analysis)
42 | 1. [Gene annotation](../lessons/genomic_annotation.md)
43 | 1. [Functional analysis - over-representation analysis](../lessons/10_FA_over-representation_analysis.md)
44 | 1. [Functional analysis - functional class scoring / GSEA](../lessons/11_FA_functional_class_scoring.md)
45 |
46 | ***
47 |
48 | [Workflow Summary](../lessons/07_DGE_summarizing_workflow.md)
49 |
50 | ***
51 |
52 | ## Building on this workshop
53 | * [Single-cell RNA-seq workshop](https://hbctraining.github.io/scRNA-seq/)
54 | * [RMarkdown](https://hbctraining.github.io/Training-modules/Rmarkdown/)
55 | * [Ggplot2 for functional analysis](https://hbctraining.github.io/Training-modules/Tidyverse_ggplot2/lessons/ggplot2.html)
56 |
57 | ## Resources
58 | * [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#theory-behind-deseq2)
59 | * GitHub book on [RNA-seq gene level analysis](http://genomicsclass.github.io/book/pages/rnaseq_gene_level.html)
60 | * [Bioconductor support site](https://support.bioconductor.org/t/deseq2/) (posts tagged with `deseq2`)
61 | * [Functional analysis visualization](https://yulab-smu.top/biomedical-knowledge-mining-book/enrichplot.html)
62 |
63 | ****
64 |
65 | *These materials have been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
66 |
--------------------------------------------------------------------------------