├── README.md ├── gsea.Rmd └── ora.Rmd /README.md: -------------------------------------------------------------------------------- 1 | # r-notebooks 2 | R Markdown Notebooks to support the following tutorials: 3 | 4 | Gene Set Enrichment Analysis 5 | https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/ 6 | 7 | Over-Representation Analysis 8 | https://learn.gencore.bio.nyu.edu/rna-seq-analysis/over-representation-analysis/ 9 | 10 | -------------------------------------------------------------------------------- /gsea.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Gene Set Enrichment Analysis with ClusterProfiler" 3 | author: "Mohammed Khalfan" 4 | date: "5/19/2019" 5 | output: 6 | html_document: 7 | df_print: paged 8 | df_print: paged 9 | --- 10 | 11 | This R Notebook describes the implementation of gene set enrichment analysis (GSEA) using the clusterProfiler package. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html 12 | 13 | 14 | # Install and load required packages 15 | ```{r, message=F, warning=F} 16 | #BiocManager::install("clusterProfiler", version = "3.8") 17 | #BiocManager::install("pathview") 18 | #BiocManager::install("enrichplot") 19 | library(clusterProfiler) 20 | library(enrichplot) 21 | # we use ggplot2 to add x axis labels (ex: ridgeplot) 22 | library(ggplot2) 23 | ``` 24 | 25 | # Annotations 26 | I'm using *D melanogaster* data, so I install and load the annotation "org.Dm.eg.db" below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available). 27 | 28 | ```{r, message=F, warning=F} 29 | # SET THE DESIRED ORGANISM HERE 30 | organism = "org.Dm.eg.db" 31 | #BiocManager::install(organism, character.only = TRUE) 32 | library(organism, character.only = TRUE) 33 | ``` 34 | 35 | #Prepare Input 36 | ```{r} 37 | # reading in data from deseq2 38 | df = read.csv("drosphila_example_de.csv", header=TRUE) 39 | 40 | # we want the log2 fold change 41 | original_gene_list <- df$log2FoldChange 42 | 43 | # name the vector 44 | names(original_gene_list) <- df$X 45 | 46 | # omit any NA values 47 | gene_list<-na.omit(original_gene_list) 48 | 49 | # sort the list in decreasing order (required for clusterProfiler) 50 | gene_list = sort(gene_list, decreasing = TRUE) 51 | ``` 52 | 53 | ## Gene Set Enrichment 54 | Params: 55 | 56 | **keyType** This is the source of the annotation (gene ids). The options vary for each annotation. In the example of *org.Dm.eg.db*, the options are: 57 | 58 | "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" 59 | "ENZYME" "EVIDENCE" "EVIDENCEALL" "FLYBASE" "FLYBASECG" "FLYBASEPROT" 60 | "GENENAME" "GO" "GOALL" "MAP" "ONTOLOGY" "ONTOLOGYALL" 61 | "PATH" "PMID" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT" 62 | 63 | Check which options are available with the `keytypes` command, for example `keytypes(org.Dm.eg.db)`. 64 | 65 | **ont** one of "BP", "MF", "CC" or "ALL" 66 | **nPerm** permutation numbers, the higher the number of permutations you set, the more accurate your results is, but it will also cost longer time for running permutation. 67 | **minGSSize** minimal size of each geneSet for analyzing. 68 | **maxGSSize** maximal size of genes annotated for testing. 69 | **pvalueCutoff** pvalue Cutoff. 70 | **pAdjustMethod** one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" 71 | 72 | ```{r} 73 | gse <- gseGO(geneList=gene_list, 74 | ont ="ALL", 75 | keyType = "ENSEMBL", 76 | nPerm = 10000, 77 | minGSSize = 3, 78 | maxGSSize = 800, 79 | pvalueCutoff = 0.05, 80 | verbose = TRUE, 81 | OrgDb = organism, 82 | pAdjustMethod = "none") 83 | ``` 84 | 85 | # Output 86 | ##Table of results 87 | ```{r} 88 | head(gse) 89 | ``` 90 | 91 | ##Dotplot 92 | ```{r echo=TRUE, fig.width=15, fig.height=8} 93 | require(DOSE) 94 | dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign) 95 | ``` 96 | 97 | ##Encrichment plot map: 98 | Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules. 99 | ```{r echo=TRUE} 100 | emapplot(gse, showCategory = 10) 101 | 102 | ``` 103 | 104 | ##Category Netplot 105 | The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories). 106 | ```{r fig.width=18} 107 | # categorySize can be either 'pvalue' or 'geneNum' 108 | cnetplot(gse, categorySize="pvalue", foldChange=gene_list, showCategory = 3) 109 | ``` 110 | 111 | ## Ridgeplot 112 | Helpful to interpret up/down-regulated pathways. 113 | ```{r fig.width=18, fig.height=12} 114 | 115 | ridgeplot(gse) + labs(x = "enrichment distribution") 116 | ``` 117 | 118 | ## GSEA Plot 119 | Traditional method for visualizing GSEA result. 120 | 121 | Params: 122 | **Gene Set** Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc. 123 | 124 | ```{r fig.height=6} 125 | # Use the `Gene Set` param for the index in the title, and as the value for geneSetId 126 | gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1) 127 | ``` 128 | 129 | ## PubMed trend of enriched terms 130 | Plots the number/proportion of publications trend based on the query result from PubMed Central. 131 | ```{r fig.width=10} 132 | terms <- gse$Description[1:3] 133 | pmcplot(terms, 2010:2018, proportion=FALSE) 134 | ``` 135 | 136 | 137 | # KEGG Gene Set Enrichment Analysis 138 | For KEGG pathway enrichment using the `gseKEGG()` function, we need to convert id types. We can use the `bitr` function for this (included in clusterProfiler). It is normal for this call to produce some messages / warnings. 139 | 140 | In the `bitr` function, the param `fromType` should be the same as `keyType` from the `gseGO` function above (the annotation source). This param is used again in the next two steps: creating `dedup_ids` and `df2`. 141 | 142 | `toType` in the `bitr` function has to be one of the available options from `keyTypes(org.Dm.eg.db)` and must map to one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot' because `gseKEGG()` only accepts one of these 4 options as it's `keytype` parameter. In the case of org.Dm.eg.db, none of those 4 types are available, but 'ENTREZID' are the same as ncbi-geneid for org.Dm.eg.db so we use this for `toType`. 143 | 144 | As our intial input, we use `original_gene_list` which we created above. 145 | 146 | ## Prepare Input 147 | ```{r} 148 | # Convert gene IDs for gseKEGG function 149 | # We will lose some genes here because not all IDs will be converted 150 | ids<-bitr(names(original_gene_list), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb=organism) 151 | 152 | # remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType) 153 | dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),] 154 | 155 | # Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above 156 | df2 = df[df$X %in% dedup_ids$ENSEMBL,] 157 | 158 | # Create a new column in df2 with the corresponding ENTREZ IDs 159 | df2$Y = dedup_ids$ENTREZID 160 | 161 | # Create a vector of the gene unuiverse 162 | kegg_gene_list <- df2$log2FoldChange 163 | 164 | # Name vector with ENTREZ ids 165 | names(kegg_gene_list) <- df2$Y 166 | 167 | # omit any NA values 168 | kegg_gene_list<-na.omit(kegg_gene_list) 169 | 170 | # sort the list in decreasing order (required for clusterProfiler) 171 | kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE) 172 | 173 | ``` 174 | ## Create gseKEGG object 175 | 176 | **organism** KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code). I define this as `kegg_organism` first, because it is used again below when making the pathview plots. 177 | **nPerm** permutation numbers, the higher the number of permutations you set, the more accurate your results is, but it will also cost longer time for running permutation. 178 | **minGSSize** minimal size of each geneSet for analyzing. 179 | **maxGSSize** maximal size of genes annotated for testing. 180 | **pvalueCutoff** pvalue Cutoff. 181 | **pAdjustMethod** one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none". 182 | **keyType** one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot'. 183 | ```{r} 184 | kegg_organism = "dme" 185 | kk2 <- gseKEGG(geneList = kegg_gene_list, 186 | organism = kegg_organism, 187 | nPerm = 10000, 188 | minGSSize = 3, 189 | maxGSSize = 800, 190 | pvalueCutoff = 0.05, 191 | pAdjustMethod = "none", 192 | keyType = "ncbi-geneid") 193 | ``` 194 | 195 | ```{r} 196 | head(kk2, 10) 197 | ``` 198 | 199 | ## Dotplot 200 | ```{r echo=TRUE} 201 | dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign) 202 | ``` 203 | 204 | ## Encrichment plot map: 205 | Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules. 206 | ```{r echo=TRUE} 207 | emapplot(kk2) 208 | ``` 209 | 210 | ## Category Netplot: 211 | The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories). 212 | ```{r fig.width=12} 213 | # categorySize can be either 'pvalue' or 'geneNum' 214 | cnetplot(kk2, categorySize="pvalue", foldChange=gene_list) 215 | ``` 216 | 217 | ## Ridgeplot 218 | Helpful to interpret up/down-regulated pathways. 219 | ```{r fig.width=18, fig.height=12} 220 | ridgeplot(kk2) + labs(x = "enrichment distribution") 221 | ``` 222 | 223 | # GSEA Plot 224 | Traditional method for visualizing GSEA result. 225 | 226 | Params: 227 | **Gene Set** Integer. Corresponds to gene set in the gse object. The first gene set is 1, second gene set is 2, etc. Default: 1 228 | 229 | ```{r fig.height=6} 230 | # Use the `Gene Set` param for the index in the title, and as the value for geneSetId 231 | gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1) 232 | ``` 233 | 234 | #Pathview 235 | This will create a PNG and *different* PDF of the enriched KEGG pathway. 236 | 237 | Params: 238 | **gene.data** This is `kegg_gene_list` created above 239 | **pathway.id** The user needs to enter this. Enriched pathways + the pathway ID are provided in the gseKEGG output table (above). 240 | **species** Same as `organism` above in `gseKEGG`, which we defined as `kegg_organism` 241 | 242 | ```{r, message=F, warning=F, echo = TRUE} 243 | library(pathview) 244 | 245 | # Produce the native KEGG plot (PNG) 246 | dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism) 247 | 248 | # Produce a different plot (PDF) (not displayed here) 249 | dme <- pathview(gene.data=kegg_gene_list, pathway.id="dme04130", species = kegg_organism, kegg.native = F) 250 | ``` 251 | ```{r pressure, echo=TRUE, fig.cap="KEGG Native Enriched Pathway Plot", out.width = '100%'} 252 | knitr::include_graphics("dme04130.pathview.png") 253 | ``` -------------------------------------------------------------------------------- /ora.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Over-Representation Analysis with ClusterProfiler" 3 | author: "Mohammed Khalfan" 4 | date: "5/15/2019" 5 | output: 6 | html_document: 7 | df_print: paged 8 | df_print: paged 9 | --- 10 | 11 | This R Notebook describes the implementation of over-representation analysis using the clusterProfiler package. For more information please see the full documentation here: https://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html 12 | 13 | 14 | # Install and load packages 15 | ```{r, message=F, warning=F} 16 | #BiocManager::install("clusterProfiler", version = "3.8") 17 | #BiocManager::install("pathview") 18 | #install.packages("wordcloud") 19 | library(clusterProfiler) 20 | library(wordcloud) 21 | ``` 22 | 23 | # Annotations 24 | I'm using *D melanogaster* data, so I install and load the annotation "org.Dm.eg.db" below. See all annotations available here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb (there are 19 presently available). 25 | 26 | ```{r, message=F, warning=F} 27 | organism = "org.Dm.eg.db" 28 | #BiocManager::install(organism, character.only = TRUE) 29 | library(organism, character.only = TRUE) 30 | ``` 31 | 32 | #Prepare Input 33 | 34 | ```{r} 35 | # reading in input from deseq2 36 | df = read.csv("drosphila_example_de.csv", header=TRUE) 37 | 38 | # we want the log2 fold change 39 | original_gene_list <- df$log2FoldChange 40 | 41 | # name the vector 42 | names(original_gene_list) <- df$X 43 | 44 | # omit any NA values 45 | gene_list<-na.omit(original_gene_list) 46 | 47 | # sort the list in decreasing order (required for clusterProfiler) 48 | gene_list = sort(gene_list, decreasing = TRUE) 49 | 50 | # Exctract significant results (padj < 0.05) 51 | sig_genes_df = subset(df, padj < 0.05) 52 | 53 | # From significant results, we want to filter on log2fold change 54 | genes <- sig_genes_df$log2FoldChange 55 | 56 | # Name the vector 57 | names(genes) <- sig_genes_df$X 58 | 59 | # omit NA values 60 | genes <- na.omit(genes) 61 | 62 | # filter on min log2fold change (log2FoldChange > 2) 63 | genes <- names(genes)[abs(genes) > 2] 64 | ``` 65 | 66 | #Create enrichGO object 67 | Params: 68 | 69 | **Ontology** Options: ["BP", "MF", "CC"] 70 | **keyType** This is the source of the annotation (gene ids). The options vary for each annotation. In the example of *org.Dm.eg.db*, the options are: 71 | 72 | "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" 73 | "ENZYME" "EVIDENCE" "EVIDENCEALL" "FLYBASE" "FLYBASECG" "FLYBASEPROT" 74 | "GENENAME" "GO" "GOALL" "MAP" "ONTOLOGY" "ONTOLOGYALL" 75 | "PATH" "PMID" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT" 76 | 77 | Check which options are available with the `keytypes` command, for example `keytypes(org.Dm.eg.db)`. 78 | 79 | ## Create the object 80 | ```{r} 81 | go_enrich <- enrichGO(gene = genes, 82 | universe = names(gene_list), 83 | OrgDb = organism, 84 | keyType = 'ENSEMBL', 85 | readable = T, 86 | ont = "BP", 87 | pvalueCutoff = 0.05, 88 | qvalueCutoff = 0.10) 89 | ``` 90 | 91 | #Output 92 | ##Table of results 93 | ```{r} 94 | head(go_enrich) 95 | ``` 96 | 97 | ## Upset Plot 98 | Emphasizes the genes overlapping among different gene sets. 99 | ```{r fig.width=18, fig.height=12} 100 | #BiocManager::install("enrichplot") 101 | library(enrichplot) 102 | upsetplot(go_enrich) 103 | ``` 104 | 105 | ##Wordcloud 106 | 107 | ```{r fig.width=28, fig.height=26} 108 | wcdf<-read.table(text=go_enrich$GeneRatio, sep = "/")[1] 109 | wcdf$term<-go_enrich[,2] 110 | wordcloud(words = wcdf$term, freq = wcdf$V1, scale=(c(4, .1)), colors=brewer.pal(8, "Dark2"), max.words = 25) 111 | ``` 112 | 113 | 114 | ##Barplot 115 | 116 | ```{r echo=TRUE} 117 | barplot(go_enrich, 118 | drop = TRUE, 119 | showCategory = 10, 120 | title = "GO Biological Pathways", 121 | font.size = 8) 122 | ``` 123 | 124 | ##Dotplot 125 | ```{r echo=TRUE} 126 | dotplot(go_enrich) 127 | ``` 128 | 129 | ##Encrichment plot map: 130 | Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional modules. 131 | ```{r echo=TRUE} 132 | emapplot(go_enrich) 133 | 134 | ``` 135 | 136 | ##Enriched GO induced graph: 137 | 138 | ```{r fig.width=12} 139 | goplot(go_enrich, showCategory = 10) 140 | ``` 141 | 142 | ##Category Netplot 143 | The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories). 144 | ```{r fig.width=12} 145 | # categorySize can be either 'pvalue' or 'geneNum' 146 | cnetplot(go_enrich, categorySize="pvalue", foldChange=gene_list) 147 | ``` 148 | 149 | ##KEGG Pathway Enrichment 150 | For KEGG pathway enrichment using the `gseKEGG()` function, we need to convert id types. We can use the `bitr` function for this (included in clusterProfiler). It is normal for this call to produce some messages / warnings. 151 | 152 | In the `bitr` function, the param `fromType` should be the same as `keyType` from the `gseGO` function above (the annotation source). This param is used again in the next two steps: creating `dedup_ids` and `df2`. 153 | 154 | `toType` in the `bitr` function has to be one of the available options from `keyTypes(org.Dm.eg.db)` and must map to one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot' because `gseKEGG()` only accepts one of these 4 options as it's `keytype` parameter. In the case of org.Dm.eg.db, none of those 4 types are available, but 'ENTREZID' are the same as ncbi-geneid for org.Dm.eg.db so we use this for `toType`. 155 | 156 | As our intial input, we use `original_gene_list` which we created above. 157 | 158 | ## Prepare Data 159 | ```{r} 160 | # Convert gene IDs for enrichKEGG function 161 | # We will lose some genes here because not all IDs will be converted 162 | ids<-bitr(names(original_gene_list), fromType = "ENSEMBL", toType = "ENTREZID", OrgDb="org.Dm.eg.db") 163 | 164 | # remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType) 165 | dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),] 166 | 167 | # Create a new dataframe df2 which has only the genes which were successfully mapped using the bitr function above 168 | df2 = df[df$X %in% dedup_ids$ENSEMBL,] 169 | 170 | # Create a new column in df2 with the corresponding ENTREZ IDs 171 | df2$Y = dedup_ids$ENTREZID 172 | 173 | # Create a vector of the gene unuiverse 174 | kegg_gene_list <- df2$log2FoldChange 175 | 176 | # Name vector with ENTREZ ids 177 | names(kegg_gene_list) <- df2$Y 178 | 179 | # omit any NA values 180 | kegg_gene_list<-na.omit(kegg_gene_list) 181 | 182 | # sort the list in decreasing order (required for clusterProfiler) 183 | kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE) 184 | 185 | # Exctract significant results from df2 186 | kegg_sig_genes_df = subset(df2, padj < 0.05) 187 | 188 | # From significant results, we want to filter on log2fold change 189 | kegg_genes <- kegg_sig_genes_df$log2FoldChange 190 | 191 | # Name the vector with the CONVERTED ID! 192 | names(kegg_genes) <- kegg_sig_genes_df$Y 193 | 194 | # omit NA values 195 | kegg_genes <- na.omit(kegg_genes) 196 | 197 | # filter on log2fold change (PARAMETER) 198 | kegg_genes <- names(kegg_genes)[abs(kegg_genes) > 2] 199 | 200 | ``` 201 | ## Create enrichKEGG object 202 | **organism** KEGG Organism Code: The full list is here: https://www.genome.jp/kegg/catalog/org_list.html (need the 3 letter code). I define this as `kegg_organism` first, because it is used again below when making the pathview plots. 203 | **keyType** one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot'. 204 | ```{r echo=TRUE} 205 | kegg_organism = "dme" 206 | kk <- enrichKEGG(gene=kegg_genes, universe=names(kegg_gene_list),organism=kegg_organism, pvalueCutoff = 0.05, keyType = "ncbi-geneid") 207 | head(kk) 208 | ``` 209 | 210 | ##Barplot 211 | ```{r echo=TRUE} 212 | barplot(kk, 213 | showCategory = 10, 214 | title = "Enriched Pathways", 215 | font.size = 8) 216 | ``` 217 | 218 | ## Dotplot 219 | ```{r echo=TRUE} 220 | dotplot(kk, 221 | showCategory = 10, 222 | title = "Enriched Pathways", 223 | font.size = 8) 224 | ``` 225 | 226 | ## Category Netplot: 227 | The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network (helpful to see which genes are involved in enriched pathways and genes that may belong to multiple annotation categories). 228 | ```{r fig.width=12} 229 | # categorySize can be either 'pvalue' or 'geneNum' 230 | cnetplot(kk, categorySize="pvalue", foldChange=gene_list) 231 | ``` 232 | 233 | #Pathview 234 | This will create a PNG and *different* PDF of the enriched KEGG pathway. 235 | 236 | Params: 237 | **gene.data** This is `kegg_gene_list` created above 238 | **pathway.id** The user needs to enter this. Enriched pathways + the pathway ID are provided in the gseKEGG output table (above). 239 | **species** Same as `organism` above in `gseKEGG`, which we defined as `kegg_organism` 240 | **gene.idtype** The index number (first index is 1) correspoding to your keytype from this list `gene.idtype.list` 241 | ```{r, message=F, warning=F, echo = TRUE} 242 | library(pathview) 243 | 244 | # Produce the native KEGG plot (PNG) 245 | dme <- pathview(gene.data=gene_list, pathway.id="dme04080", species = "dme", gene.idtype=gene.idtype.list[3]) 246 | 247 | # Produce a different plot (PDF) (not displayed here) 248 | dme <- pathview(gene.data=gene_list, pathway.id="dme04080", species = "dme", gene.idtype=gene.idtype.list[3], kegg.native = F) 249 | ``` 250 | ```{r pressure, echo=TRUE, fig.cap="KEGG Native Enriched Pathway Plot", out.width = '100%'} 251 | knitr::include_graphics("dme04080.pathview.png") 252 | ``` 253 | 254 | 255 | --------------------------------------------------------------------------------