├── LICENSE ├── NAMESPACE ├── man └── figures │ └── bioconductor.png ├── .Rbuildignore ├── pkgdown └── extra.css ├── inst ├── references.bib └── templates │ └── howto_template.qmd ├── LICENSE.md ├── _pkgdown.yml ├── README.md ├── vignettes ├── how-to-read-mass-spectrometry-data.qmd ├── how-to-load-gene-from-gff-gtf.qmd ├── how-to-read-big-bam-file-in-chunks.qmd ├── how-to-retrieve-gene-model-from-annotationhub.qmd ├── how-to-read-single-end-reads-from-bam-file.qmd ├── how-to-compute-read-coverage.qmd ├── how-to-read-paired-end-reads-from-bam-file.qmd ├── how-to-use-tidy-principles-for-rna-seq-analysis.qmd ├── how-to-get-exon-intron-sequence-for-gene.qmd ├── how-to-read-gene-sets-from-gmt-files.qmd ├── how-to-use-tidy-principles-for-granges-manipulation.qmd ├── how-to-compute-sequence-composition-for-genomic-regions.qmd └── how-to-extract-promoter-sequences.qmd ├── DESCRIPTION └── .github └── workflows └── R-CMD-check.yaml /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2025 2 | COPYRIGHT HOLDER: BiocHowTo authors 3 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | -------------------------------------------------------------------------------- /man/figures/bioconductor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Bioconductor/BiocHowTo/devel/man/figures/bioconductor.png -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^LICENSE\.md$ 4 | ^vignettes/*_files$ 5 | ^_pkgdown.yml$ 6 | ^\.github$ 7 | -------------------------------------------------------------------------------- /pkgdown/extra.css: -------------------------------------------------------------------------------- 1 | .navbar { 2 | background: linear-gradient(to right, #3a85ab, #4ca139) !important; 3 | background-image: linear-gradient(to right, #3a85ab, #4ca139) !important; 4 | border: none !important; 5 | } 6 | 7 | -------------------------------------------------------------------------------- /inst/references.bib: -------------------------------------------------------------------------------- 1 | @article{Serizay2020Oct, 2 | author = {Serizay, Jacques and Dong, Yan and J{\ifmmode\ddot{a}\else\"{a}\fi}nes, J{\ifmmode\ddot{u}\else\"{u}\fi}rgen and Chesney, Michael and Cerrato, Chiara and Ahringer, Julie}, 3 | title = {{Distinctive regulatory architectures of germline-active and somatic genes in C. elegans}}, 4 | journal = {Genome Res.}, 5 | volume = {30}, 6 | number = {12}, 7 | pages = {1752--1765}, 8 | year = {2020}, 9 | month = oct, 10 | issn = {1088-9051}, 11 | publisher = {Cold Spring Harbor Lab}, 12 | doi = {10.1101/gr.265934.120} 13 | } 14 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # MIT License 2 | 3 | Copyright (c) 2025 BiocHowTo authors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /_pkgdown.yml: -------------------------------------------------------------------------------- 1 | url: https://bioconductor.github.io/BiocHowTo/ 2 | template: 3 | bootstrap: 5 4 | params: 5 | bootswatch: zephyr 6 | 7 | navbar: 8 | structure: 9 | left: [home, articles] 10 | right: [search, github] 11 | 12 | articles: 13 | - title: Interacting with BAM files 14 | contents: 15 | - how-to-read-single-end-reads-from-bam-file 16 | - how-to-read-paired-end-reads-from-bam-file 17 | - how-to-read-big-bam-file-in-chunks 18 | - how-to-compute-read-coverage 19 | 20 | - title: Mass spectrometry 21 | contents: 22 | - how-to-read-mass-spectrometry-data 23 | 24 | - title: Tidy principles in Bioconductor 25 | contents: 26 | - how-to-use-tidy-principles-for-granges-manipulation 27 | - how-to-use-tidy-principles-for-rna-seq-analysis 28 | 29 | - title: Accessing genomic annotation data 30 | contents: 31 | - how-to-retrieve-gene-model-from-annotationhub 32 | - how-to-load-gene-from-gff-gtf 33 | - how-to-get-exon-intron-sequence-for-gene 34 | - how-to-read-gene-sets-from-gmt-files 35 | 36 | - title: Working with biological sequences 37 | contents: 38 | - how-to-compute-sequence-composition-for-genomic-regions 39 | - how-to-extract-promoter-sequences 40 | -------------------------------------------------------------------------------- /inst/templates/howto_template.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: EDIT:HowToTitle 3 | author: EDIT:HowToAuthor 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{EDIT:HowToTitle} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | embed-resources: true 18 | --- 19 | 20 | ```{r} 21 | #| echo: false 22 | 23 | suppressPackageStartupMessages({ 24 | library(BiocStyle) 25 | }) 26 | ``` 27 | 28 | EDIT:Short introduction 29 | 30 | # Bioconductor packages used in this document 31 | 32 | EDIT:Add list of packages used in the HowTo. Use `r Biocpkg("PkgName")` to 33 | refer to Bioconductor packages, and similarly for other package sources 34 | (see https://bioconductor.org/packages/release/bioc/vignettes/BiocStyle/inst/doc/AuthoringRmdVignettes.html#4_Style_macros). Example: 35 | * `r Biocpkg("pasillaBamSubset")` 36 | 37 | # EDIT:Main section 38 | 39 | EDIT:Here, put the code and explanations for your HowTo. Keep in mind that 40 | HowTos should be short, and address a well-defined, specific task using 41 | Bioconductor. 42 | 43 | # Further reading 44 | 45 | EDIT: add a (short) list of useful resources about. 46 | 47 | # Session info 48 | 49 |
50 | 51 | Click to display session info 52 | 53 | ```{r} 54 | sessionInfo() 55 | ``` 56 |
57 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | Bioconductor logo 4 | 5 |
6 | 7 | # BiocHowTo 8 | 9 |
10 | 11 | Welcome to the Bioconductor HowTo collection! The HowTos are short, 12 | stand-alone documents, each focused on how to address a well-defined question 13 | using one or more Bioconductor packages. 14 | 15 | For a list of the available HowTo documents, see the [Articles](articles/index.html) tab. 16 | 17 | ## How to contribute 18 | 19 | 1. [Fork this repository](https://github.com/Bioconductor/BiocHowTo/fork) and create a new branch. 20 | 2. Copy and rename the `howto_template.qmd` file from `inst/templates` to 21 | the `vignettes` directory, and edit the copy accordingly. Please choose the 22 | name for the new vignette in such a way that it is unique and clearly indicates 23 | the content. Note that the title of the vignette should be added both to the 24 | `title` field and the `%\VignetteIndexEntry` in the YAML section. 25 | 3. Test the vignette locally in a fresh R session, to make sure that it is 26 | self-contained and runs without errors. 27 | 4. Add your name to the `Author` list in the `DESCRIPTION` file. 28 | 5. If your HowTo is using packages that are not already included among the 29 | dependencies of `BiocHowTo`, please add them to the list of `Imports` in the 30 | `DESCRIPTION` file. One way to do this is via the `usethis` package: 31 | 32 | ``` 33 | usethis::use_package("name_of_new_dependency") 34 | ``` 35 | 36 | 6. Add the new vignette to a suitable section under 'Articles' in the 37 | `_pkgdown.yml` file. 38 | 7. Push the changes to your forked repository and open a pull request to 39 | the `devel` branch of the parent repository. 40 | 8. All submissions will be reviewed by the Bioconductor Training Committee. 41 | 42 | 43 | ## How to suggest a new topic 44 | 45 | To suggest a new topic for a HowTo, open an issue and provide some more 46 | details of your suggestion. If you would like to take on one of the open 47 | issues, either assign yourself (if possible), or comment in the issue, and 48 | one of the administrators can assign you. 49 | -------------------------------------------------------------------------------- /vignettes/how-to-read-mass-spectrometry-data.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to read mass spectrometry data 3 | author: Laurent Gatto 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to read mass spectrometry data} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | # Bioconductor packages used in this document 28 | 29 | * `r Biocpkg("Spectra")` 30 | * `r Biocpkg("mzR")` 31 | * `r Biocpkg("MsDataHub")` 32 | 33 | # How to read mass spectrometry data 34 | 35 | Let's first get some example raw mass spectrometry data from the 36 | *MsDataHub* package. Below, we download two _sciex_ mzML files 37 | represent profile-mode LC-MS data of pooled human serum samples and 38 | create a vector of files names `fls`: 39 | 40 | ```{r} 41 | library(MsDataHub) 42 | fls <- c(X20171016_POOL_POS_1_105.134.mzML(), 43 | X20171016_POOL_POS_3_105.134.mzML()) 44 | fls 45 | ``` 46 | 47 | We can now use the `Spectra()` constructor function to create an 48 | object of class `Spectra` that contains the raw data. Note that the 49 | `mzR` package is needed to read the content of the mzML files. 50 | 51 | 52 | ```{r} 53 | suppressPackageStartupMessages({ 54 | library("Spectra") 55 | library("mzR") 56 | }) 57 | ``` 58 | 59 | ```{r} 60 | sp <- Spectra(fls) 61 | sp 62 | ``` 63 | 64 | The object contains `r length(sp)` spectra from both files. 65 | 66 | # Further reading 67 | 68 | - The [R for Mass Spectrometry](https://rformassspectrometry.github.io/book/) book. 69 | - The [Large-scale data handling and processing with 70 | Spectra](https://rformassspectrometry.github.io/Spectra/articles/Spectra-large-scale.html) 71 | vignette from the *Spectra* package. 72 | 73 | # Session info 74 | 75 |
76 | 77 | Click to display session info 78 | 79 | ```{r} 80 | sessionInfo() 81 | ``` 82 |
83 | -------------------------------------------------------------------------------- /vignettes/how-to-load-gene-from-gff-gtf.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to load a gene model from a GFF or GTF file 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to load a gene model from a GFF or GTF file} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | A **gene model** is essentially a set of annotations that 28 | describes the genomic locations of the known genes, transcripts, exons, and CDS, for 29 | a given organism. The standardized file format to hold gene models is a 30 | **[GFF or GTF](https://useast.ensembl.org/info/website/upload/gff.html)**. 31 | In Bioconductor, gene model information is typically represented as a TxDb 32 | object but also sometimes as a GRanges or GRangesList object. We can use the 33 | `makeTxDbFromGFF()` function from the `r Biocpkg("txdbmaker")` package to 34 | import a GFF or GTF file as a *TxDb* object. 35 | 36 | # Bioconductor packages used in this document 37 | 38 | * `r Biocpkg("txdbmaker")` 39 | 40 | # How to load a gene model from a GFF or GTF file 41 | 42 | We will use a small .gff3 file provided by the `r Biocpkg("txdbmaker")` package. 43 | 44 | ```{r, warning=FALSE, message=FALSE} 45 | suppressPackageStartupMessages({ 46 | library(txdbmaker) 47 | }) 48 | 49 | gff_file <- system.file("extdata", "GFF3_files", "a.gff3", package="txdbmaker") 50 | 51 | txdb <- makeTxDbFromGFF(gff_file, format="gff3") 52 | txdb 53 | ``` 54 | 55 | 56 | See `?makeTxDbFromGFF` in the `r Biocpkg("txdbmaker")` package for more information. 57 | 58 | Extract the exon coordinates grouped by gene from this gene model: 59 | 60 | ```{r} 61 | exonsBy(txdb, by="gene") 62 | ``` 63 | 64 | # Session info 65 | 66 |
67 | 68 | Click to display session info 69 | 70 | ```{r} 71 | sessionInfo() 72 | ``` 73 |
74 | 75 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: BiocHowTo 2 | Type: Package 3 | Title: A Collection of How To Documents For Bioconductor 4 | Version: 0.1.0 5 | Authors@R: c( 6 | person( 7 | "Charlotte", "Soneson", 8 | email = "charlottesoneson@gmail.com", 9 | role = c("aut", "cre") 10 | ), 11 | person( 12 | "Hervé", "Pagès", 13 | role = "aut" 14 | ), 15 | person( 16 | "Dan", "Tenenbaum", 17 | role = "aut" 18 | ), 19 | person( 20 | "Valerie", "Obenchain", 21 | role = "aut" 22 | ), 23 | person( 24 | "Marc", "Carlson", 25 | role = "aut" 26 | ), 27 | person( 28 | "Martin", "Morgan", 29 | role = "aut" 30 | ), 31 | person( 32 | "James", "Hester", 33 | role = "aut" 34 | ), 35 | person( 36 | "Vince", "Carey", 37 | role = "aut" 38 | ), 39 | person( 40 | "Laurent", "Gatto", 41 | role = "aut" 42 | ), 43 | person( 44 | "Pedro", "Baldoni", 45 | role = "aut" 46 | ), 47 | person( 48 | "Jenny", "Drnevich", 49 | role = "aut" 50 | ), 51 | person( 52 | "Robert", "Castelo", 53 | role = "aut" 54 | ), 55 | person( 56 | "Michael", "Stadler", 57 | role = "aut" 58 | ), 59 | person( 60 | "Fabricio", "Almeida-Silva", 61 | role = "aut" 62 | ), 63 | person( 64 | "Jacques", "Serizay", 65 | role = "aut" 66 | ) 67 | ) 68 | Description: This package provides a crowd-sourced collection of 'How To' 69 | documents related to Bioconductor packages. 70 | License: MIT + file LICENSE 71 | Encoding: UTF-8 72 | LazyData: false 73 | Suggests: 74 | knitr, 75 | rmarkdown 76 | RoxygenNote: 7.3.2 77 | VignetteBuilder: 78 | quarto 79 | Imports: 80 | quarto, 81 | BiocStyle, 82 | GenomicAlignments, 83 | pasillaBamSubset, 84 | txdbmaker, 85 | mzR, 86 | MsDataHub, 87 | Spectra, 88 | AnnotationHub, 89 | TxDb.Hsapiens.UCSC.hg38.knownGene, 90 | BSgenome.Hsapiens.UCSC.hg38, 91 | GenomicFeatures, 92 | Biostrings, 93 | Rsamtools, 94 | GSEABase, 95 | GSVA, 96 | rtracklayer, 97 | GenomicRanges, 98 | BSgenome, 99 | tidyomics, 100 | airway, 101 | readxl 102 | URL: https://github.com/bioconductor/BiocHowTo, 103 | https://bioconductor.github.io/BiocHowTo/ 104 | -------------------------------------------------------------------------------- /vignettes/how-to-read-big-bam-file-in-chunks.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to read a BAM file in chunks 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to read a BAM file in chunks} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("pasillaBamSubset")` 34 | * `r Biocpkg("GenomicAlignments")` 35 | * `r Biocpkg("Rsamtools")` 36 | 37 | # How to read a BAM file in chunks 38 | 39 | A large BAM file can be iterated through in chunks, in order to reduce the 40 | memory usage, by setting a `yieldSize` for the `BamFile` object. For 41 | illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data package. 42 | 43 | ```{r} 44 | suppressPackageStartupMessages({ 45 | library(pasillaBamSubset) 46 | library(Rsamtools) 47 | }) 48 | # Path to a bam file with single-end reads 49 | (un1 <- untreated1_chr4()) 50 | bf <- BamFile(un1, yieldSize = 100000) 51 | ``` 52 | 53 | Iteration through a BAM file requires that the file be opened, repeatedly 54 | queried inside a loop, then closed. Repeated calls to 55 | `GenomicAlignments::readGAlignments` without opening the file first result in 56 | the same 100000 records returned each time (with a `yieldSize` of 100000). 57 | As an example, let's calculate the coverage for the bam file above. 58 | 59 | ```{r} 60 | suppressPackageStartupMessages({ 61 | library(GenomicAlignments) 62 | }) 63 | open(bf) 64 | cvg <- NULL 65 | repeat { 66 | chunk <- readGAlignments(bf) 67 | if (length(chunk) == 0L) { 68 | break 69 | } 70 | chunk_cvg <- coverage(chunk) 71 | if (is.null(cvg)) { 72 | cvg <- chunk_cvg 73 | } else { 74 | cvg <- cvg + chunk_cvg 75 | } 76 | } 77 | close(bf) 78 | cvg 79 | ``` 80 | 81 | # Session info 82 | 83 |
84 | 85 | Click to display session info 86 | 87 | ```{r} 88 | sessionInfo() 89 | ``` 90 |
91 | 92 | -------------------------------------------------------------------------------- /vignettes/how-to-retrieve-gene-model-from-annotationhub.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to retrieve a gene model from AnnotationHub 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to retrieve a gene model from AnnotationHub} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("AnnotationHub")` 34 | 35 | # How to retrieve a gene model from *AnnotationHub* 36 | 37 | When a gene model is not available as a `GRanges` or `GRangesList` object or as 38 | a \Bioconductor{} data package, it may be available on `r Biocpkg("AnnotationHub")`. 39 | In this *HOWTO*, will look for a gene model for Drosophila melanogaster 40 | on `r Biocpkg("AnnotationHub")`. Create a `hub' and then filter on Drosophila 41 | melanogaster: 42 | 43 | ```{r} 44 | suppressPackageStartupMessages({ 45 | library(AnnotationHub) 46 | }) 47 | ### Internet connection required! 48 | hub <- AnnotationHub() 49 | hub <- subset(hub, hub$species=='Drosophila melanogaster') 50 | ``` 51 | 52 | There are `r length(hub)` files that match Drosophila melanogaster. 53 | If you look at the metadata in hub, you can see that the 5th record 54 | represents a GRanges object from UCSC RefSeq Gene track. 55 | 56 | ```{r} 57 | length(hub) 58 | head(names(hub)) 59 | head(hub$title, n=10) 60 | ## then look at a specific slice of the hub object. 61 | hub[5] 62 | ``` 63 | So you can retrieve that dm3 file as a GRanges like this: 64 | 65 | ```{r} 66 | gr <- hub[[names(hub)[5]]] 67 | summary(gr) 68 | ``` 69 | 70 | The metadata fields contain the details of file origin and content. 71 | 72 | ```{r} 73 | metadata(gr) 74 | ``` 75 | Split the `GRanges` object by gene name to get a `GRangesList` object of 76 | transcript ranges grouped by gene. 77 | 78 | ```{r} 79 | txbygn <- split(gr, gr$name) 80 | ``` 81 | 82 | You can now use `txbygn` with the `summarizeOverlaps` function 83 | to prepare a table of read counts for RNA-Seq differential gene expression. 84 | 85 | # Further reading 86 | 87 | Note that before passing `txbygn` to `summarizeOverlaps`, 88 | you should confirm that the seqlevels (chromosome names) in it match those 89 | in the BAM file. See `?renameSeqlevels`, `?keepSeqlevels` 90 | and `?seqlevels` for examples of renaming seqlevels. 91 | 92 | # Session info 93 | 94 |
95 | 96 | Click to display session info 97 | 98 | ```{r} 99 | sessionInfo() 100 | ``` 101 |
102 | -------------------------------------------------------------------------------- /vignettes/how-to-read-single-end-reads-from-bam-file.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to read single-end reads from a BAM file 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to read single-end reads from a BAM file} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("pasillaBamSubset")` 34 | * `r Biocpkg("GenomicAlignments")` 35 | * `r Biocpkg("Rsamtools")` 36 | 37 | # How to read single-end reads from a BAM file 38 | 39 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 40 | package. 41 | 42 | ```{r} 43 | suppressPackageStartupMessages({ 44 | library(pasillaBamSubset) 45 | }) 46 | # Path to a bam file with single-end reads 47 | (un1 <- untreated1_chr4()) 48 | ``` 49 | 50 | Several functions are available for reading BAM files into R: 51 | 52 | * `GenomicAlignments::readGAlignments()` 53 | * `GenomicAlignments::readGAlignmentPairs()` 54 | * `GenomicAlignments::readGAlignmentsList()` 55 | * `Rsamtools::scanBam()` 56 | 57 | `scanBam` is a low-level function that returns a list of lists and is not 58 | discussed further here. See `?scanBam` in the `r Biocpkg("Rsamtools")` package 59 | for more information. Single-end reads can be read with the `readGAlignments` 60 | function from the `r Biocpkg("GenomicAlignments")` package. 61 | 62 | ```{r} 63 | suppressPackageStartupMessages({ 64 | library(GenomicAlignments) 65 | }) 66 | gal <- readGAlignments(un1) 67 | class(gal) 68 | gal 69 | ``` 70 | 71 | Data subsets can be specified by genomic position, field names, or flag 72 | criteria using `Rsamtools::ScanBamParam`. Here, as an example we import records 73 | that overlap position 1 to 5000 on the negative strand of chromosome 4, with 74 | `flag` and `cigar` as metadata columns. 75 | 76 | ```{r} 77 | suppressPackageStartupMessages({ 78 | library(Rsamtools) 79 | }) 80 | what <- c("flag", "cigar") 81 | which <- GRanges("chr4", IRanges(1, 5000)) 82 | flag <- scanBamFlag(isMinusStrand = TRUE) 83 | param <- ScanBamParam(which = which, what = what, flag = flag) 84 | neg <- readGAlignments(un1, param = param) 85 | neg 86 | ``` 87 | 88 | Another approach to subsetting the data is to use `Rsamtools::filterBam`. 89 | This function creates a new BAM file of records passing user-defined criteria. 90 | See `?filterBam` in the `r Biocpkg("Rsamtools")` package for more information. 91 | 92 | # Session info 93 | 94 |
95 | 96 | Click to display session info 97 | 98 | ```{r} 99 | sessionInfo() 100 | ``` 101 |
102 | 103 | -------------------------------------------------------------------------------- /vignettes/how-to-compute-read-coverage.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to compute read coverage 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to compute read coverage} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("pasillaBamSubset")` 34 | * `r Biocpkg("GenomicAlignments")` 35 | 36 | # How to compute read coverage 37 | 38 | By "read coverage", we mean the number of reads that cover a given genomic position. 39 | Computing the read coverage generally consists in computing the coverage at 40 | each position in the genome. 41 | This can be done with the `coverage` function from `r Biocpkg("GenomicAlignments")`. 42 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 43 | package. 44 | 45 | ```{r} 46 | suppressPackageStartupMessages({ 47 | library(pasillaBamSubset) 48 | }) 49 | # Path to a bam file with single-end reads 50 | (un1 <- untreated1_chr4()) 51 | ``` 52 | 53 | Single-end reads can be read with the `readGAlignments` function from 54 | `r Biocpkg("GenomicAlignments")`. 55 | 56 | ```{r} 57 | suppressPackageStartupMessages({ 58 | library(GenomicAlignments) 59 | }) 60 | (reads <- readGAlignments(un1)) 61 | ``` 62 | 63 | The coverage can then be calculated using the `coverage` function from 64 | `r Biocpkg("GenomicAlignments")`: 65 | 66 | ```{r} 67 | (cvg <- coverage(reads)) 68 | ``` 69 | 70 | We can extract the coverage on a specific chromosome: 71 | 72 | ```{r} 73 | cvg$chr4 74 | ``` 75 | 76 | We can also perform calculations on the returned object, e.g. to get the 77 | average and maximum coverage on chromosome 4: 78 | 79 | ```{r} 80 | mean(cvg$chr4) 81 | max(cvg$chr4) 82 | ``` 83 | 84 | We can use the `slice` function to find the genomic regions where the coverage 85 | is greater or equal to a given threshold. 86 | 87 | ```{r} 88 | (chr4_highcov <- slice(cvg$chr4, lower = 500)) 89 | ``` 90 | 91 | The *weight* of a given such region can be defined as the number of aligned 92 | nucleotides that belong to the region (i.e. the area under the coverage curve). 93 | It can be obtained with `sum`: 94 | 95 | ```{r} 96 | sum(chr4_highcov) 97 | ``` 98 | 99 | Note that `coverage` and `slice` are generic functions with methods for 100 | different types of objects. See `?coverage` or `?slice` for more information. 101 | 102 | 103 | # Further reading 104 | 105 | - The `r Biocpkg("GenomicAlignments")` vignette. 106 | 107 | # Session info 108 | 109 |
110 | 111 | Click to display session info 112 | 113 | ```{r} 114 | sessionInfo() 115 | ``` 116 |
117 | -------------------------------------------------------------------------------- /vignettes/how-to-read-paired-end-reads-from-bam-file.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to read paired-end reads from a BAM file 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to read paired-end reads from a BAM file} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("pasillaBamSubset")` 34 | * `r Biocpkg("GenomicAlignments")` 35 | 36 | # How to read paired-end reads from a BAM file 37 | 38 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 39 | package. 40 | 41 | ```{r} 42 | suppressPackageStartupMessages({ 43 | library(pasillaBamSubset) 44 | }) 45 | # Path to a bam file with paired-end reads 46 | (un3 <- untreated3_chr4()) 47 | ``` 48 | 49 | Paired-end reads can be loaded with the `GenomicAlignments::readGAlignmentPairs` 50 | or `GenomicAlignments::readGAlignmentsList` functions from the 51 | `r Biocpkg("GenomicAlignments")` package. These functions use the same mate 52 | pairing algorithm but output different objects. 53 | 54 | Let's start with `GenomicAlignments::readGAlignmentPairs`: 55 | 56 | ```{r} 57 | suppressPackageStartupMessages({ 58 | library(GenomicAlignments) 59 | }) 60 | gapairs <- readGAlignmentPairs(un3) 61 | class(gapairs) 62 | gapairs 63 | ``` 64 | 65 | The `GAlignmentPairs` class holds only pairs; reads with no mate or with 66 | ambiguous pairing are discarded. Each list element holds exactly 2 records 67 | (a mated pair). Records can be accessed as the first and last segments in a 68 | template or as left and right alignments. See `?GAlignmentPairs` in the 69 | `r Biocpkg("GenomicAlignments")` package for more information. 70 | 71 | For `readGAlignmentsList`, mate pairing is performed when `asMates` is set to 72 | `TRUE` on the `BamFile` object, otherwise records are treated as single-end. 73 | 74 | ```{r} 75 | galist <- readGAlignmentsList(BamFile(un3, asMates = TRUE)) 76 | galist 77 | ``` 78 | 79 | `GAlignmentsList` is a more general ‘list-like’ structure that holds mate 80 | pairs as well as nonmates (i.e., singletons, records with unmapped mates etc). 81 | A `mates_status` metadata column (accessed with `mcols`) indicates which 82 | records were paired. 83 | 84 | Non-mated reads are returned as groups by `QNAME` and contain any number of 85 | records. Here the non-mate groups range in size from 1 to 9. 86 | 87 | ```{r} 88 | non_mates <- galist[unlist(mcols(galist)$mate_status) == "unmated"] 89 | table(elementNROWS(non_mates)) 90 | ``` 91 | 92 | # Session info 93 | 94 |
95 | 96 | Click to display session info 97 | 98 | ```{r} 99 | sessionInfo() 100 | ``` 101 |
102 | 103 | -------------------------------------------------------------------------------- /vignettes/how-to-use-tidy-principles-for-rna-seq-analysis.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to use tidy principles for RNA-seq data analysis 3 | author: Jacques Serizay 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to use tidy principles for RNA-seq data analysis} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | Tidy data principles are a set of principles for structuring data in a way 28 | that makes it easier to manipulate and analyze. They have been popularized by the 29 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation 30 | and `ggplot2` for data visualization. The tidy principles can also be applied to 31 | biological data, such as RNA-seq data, to facilitate analysis and visualization, 32 | thanks to the `tidyomics` meta-package. 33 | 34 | # Bioconductor packages used in this document 35 | 36 | * `r Biocpkg("tidyomics")` 37 | 38 | # How to manipulate SummarizedExperiment objects with tidy principles 39 | 40 | For illustration, we use data from the `r Biocpkg("airway")` package. 41 | 42 | ```{r} 43 | library(tidyomics) 44 | data(airway, package="airway") 45 | 46 | airway 47 | ``` 48 | 49 | How `airway` object is printed is controlled by the `tidySummarizedExperiment` package, 50 | and by default it is displayed as a `tibble`-like object. Note that this 51 | behavior can be changed by setting the `options(restore_SummarizedExperiment_show = TRUE)` option. 52 | 53 | ```{r} 54 | # Turn off the tibble visualisation 55 | options("restore_SummarizedExperiment_show" = TRUE) 56 | airway 57 | 58 | # Turn on the tibble visualisation 59 | options("restore_SummarizedExperiment_show" = FALSE) 60 | airway 61 | ``` 62 | 63 | The `airway` object is a `SummarizedExperiment` object, and the 64 | `tidySummarizedExperiment` packages adds an invisible layer to abstract the `SE` object as a 65 | `tibble`. This allows us to use the `dplyr` package to manipulate the data, and eventually 66 | to directly plug data flow into `ggplot2` for visualization. 67 | 68 | ```{r} 69 | airway |> 70 | filter(albut == "untrt") |> 71 | group_by(.feature, dex) |> 72 | summarise(tot_counts = sum(counts)) |> 73 | pivot_wider(names_from = dex, values_from = tot_counts) |> 74 | slice_sample(n = 10000) |> 75 | ggplot(aes(x = trt, y = untrt)) + 76 | geom_density_2d(color = "black", linewidth = 0.5) + 77 | geom_point(alpha = 0.1, size = 0.3) + 78 | scale_x_log10() + 79 | scale_y_log10() + 80 | annotation_logticks(side = 'bl') + 81 | labs(x = "Dex treated", y = "Dex untreated") + 82 | theme_bw() 83 | ``` 84 | 85 | Because these tidy methods are defined for `SummarizedExperiment` objects, 86 | most classes built on top of `SummarizedExperiment`, such as `DESeqDataSet` or `SingleCellExperiment`, 87 | can also be manipulated with tidy principles. This has prompted the development of 88 | other packages (included in the `tidyomics` meta-package) that extend the tidy principles to 89 | other analyses, such as single-cell transcriptomics analysis. 90 | 91 | ```{r} 92 | data(pbmc_small, package="tidySingleCellExperiment") 93 | pbmc_small 94 | 95 | ## We can still manipulate the data with standard SummarizedExperiment methods 96 | counts(pbmc_small)[1:5, 1:4] 97 | 98 | ## But we can also use tidy principles 99 | pbmc_small |> 100 | extract(file, "sample", "../data/([a-z0-9]+)/outs.+") |> 101 | rename(annotation = letter.idents) |> 102 | ggplot(aes(sample, nCount_RNA, fill=sample)) + 103 | geom_boxplot(outlier.shape=NA) + 104 | geom_jitter(width=0.1) + 105 | facet_wrap( ~ annotation) 106 | ``` 107 | 108 | # Further reading 109 | 110 | To read more about the `tidyomics` packages, please refer to the 111 | [tidyomics](https://github.com/tidyomics) GitHub organization page. 112 | 113 | # Session info 114 | 115 |
116 | 117 | Click to display session info 118 | 119 | ```{r} 120 | sessionInfo() 121 | ``` 122 |
123 | 124 | -------------------------------------------------------------------------------- /vignettes/how-to-get-exon-intron-sequence-for-gene.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to get the exon and intron sequences of a given gene 3 | author: Bioconductor Core Team 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to get the exon and intron sequences of a given gene} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | This HowTo has been adapted from the list of HowTos provided in the 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` 34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")` 35 | * `r Biocpkg("GenomicFeatures")` 36 | * `r Biocpkg("Biostrings")` 37 | 38 | # How to get the exon and intron sequences of a given gene 39 | 40 | The exon and intron sequences of a gene are essentially the DNA sequences of 41 | the introns and exons of all known transcripts of the gene. The first task, 42 | therefore, is to identify all transcripts associated with the gene of interest. 43 | Our example gene is the human `TRAK2`, which is involved in regulation of 44 | endosome-to-lysosome trafficking of membrane cargo. The Entrez gene id is 45 | '66008'. 46 | 47 | ```{r} 48 | trak2 <- "66008" 49 | ``` 50 | 51 | The `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` data package contains the 52 | gene models corresponding to the UCSC 'Known Genes' track. 53 | 54 | ```{r} 55 | suppressPackageStartupMessages({ 56 | library(TxDb.Hsapiens.UCSC.hg38.knownGene) 57 | }) 58 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene 59 | ``` 60 | 61 | The transcript ranges for all the genes in the gene model can be extracted with 62 | the `transcriptsBy` function from the `r Biocpkg("GenomicFeatures")` package. 63 | They will be returned in a named `GRangesList` object containing all the 64 | transcripts grouped by gene. In order to keep only the transcripts of the 65 | `TRAK2` gene we will subset the `GRangesList` object using the `[[` operator. 66 | 67 | ```{r} 68 | suppressPackageStartupMessages({ 69 | library(GenomicFeatures) 70 | }) 71 | (trak2_txs <- transcriptsBy(txdb, by = "gene")[[trak2]]) 72 | ``` 73 | 74 | `trak2_txs` is a `GRanges` object with one range per transcript in the `TRAK2` 75 | gene. The transcript names are stored in the `tx_name` metadata column. We will 76 | need them to subset the extracted intron and exon regions: 77 | 78 | ```{r} 79 | (trak2_tx_names <- mcols(trak2_txs)$tx_name) 80 | ``` 81 | 82 | The exon and intron genomic ranges for all the transcripts in the gene model 83 | can be extracted with the `exonsBy` and `intronsByTranscript` functions, 84 | respectively. Both functions return a `GRangesList` object. Then we keep only 85 | the exon and intron for the transcripts of the `TRAK2` gene by subsetting 86 | each `GRangesList` object by the `TRAK2` transcript names. 87 | 88 | Extract the exon regions: 89 | 90 | ```{r} 91 | trak2_exbytx <- exonsBy(txdb, "tx", use.names = TRUE)[trak2_tx_names] 92 | elementNROWS(trak2_exbytx) 93 | ``` 94 | 95 | ... and the intron regions: 96 | 97 | ```{r} 98 | trak2_inbytx <- intronsByTranscript(txdb, use.names = TRUE)[trak2_tx_names] 99 | elementNROWS(trak2_inbytx) 100 | ``` 101 | 102 | Next we want the DNA sequences for these exons and introns. The `getSeq` 103 | function from the `r Biocpkg("Biostrings")` package can be used to query a 104 | `BSgenome` object with a set of genomic ranges and retrieve the corresponding 105 | DNA sequences. 106 | 107 | ```{r} 108 | suppressPackageStartupMessages({ 109 | library(BSgenome.Hsapiens.UCSC.hg38) 110 | }) 111 | ``` 112 | 113 | Extract the exon sequences: 114 | 115 | ```{r} 116 | (trak2_ex_seqs <- getSeq(Hsapiens, trak2_exbytx)) 117 | ``` 118 | 119 | ... and the intron sequences: 120 | 121 | ```{r} 122 | (trak2_in_seqs <- getSeq(Hsapiens, trak2_inbytx)) 123 | ``` 124 | 125 | Each list element is a `DNAStringSet` object, e.g.: 126 | 127 | ```{r} 128 | trak2_in_seqs[[1]] 129 | ``` 130 | 131 | Each sequence here corresponds to a single intron. 132 | 133 | # Further reading 134 | 135 | - The `r Biocpkg("GenomicFeatures")` vignette. 136 | 137 | # Session info 138 | 139 |
140 | 141 | Click to display session info 142 | 143 | ```{r} 144 | sessionInfo() 145 | ``` 146 |
147 | -------------------------------------------------------------------------------- /.github/workflows/R-CMD-check.yaml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | pull_request: 4 | branches: 5 | - devel 6 | - main 7 | schedule: 8 | - cron: '0 9 * * 2' 9 | 10 | name: R-CMD-check 11 | 12 | jobs: 13 | R-CMD-check: 14 | runs-on: ${{ matrix.config.os }} 15 | container: ${{ matrix.config.image }} 16 | 17 | name: ${{ matrix.config.os }} (${{ matrix.config.bioc }} - ${{ matrix.config.image }} 18 | 19 | strategy: 20 | fail-fast: false 21 | matrix: 22 | config: 23 | - {os: macos-13, bioc: 'devel'} 24 | 25 | env: 26 | R_REMOTES_NO_ERRORS_FROM_WARNINGS: true 27 | CRAN: ${{ matrix.config.cran }} 28 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 29 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 30 | 31 | steps: 32 | - name: Check out repo 33 | uses: actions/checkout@v3 34 | 35 | - name: Set up R and install BiocManager 36 | uses: grimbough/bioc-actions/setup-bioc@v1 37 | if: matrix.config.image == null 38 | with: 39 | bioc-version: ${{ matrix.config.bioc }} 40 | 41 | - name: Install Quarto 42 | uses: quarto-dev/quarto-actions/setup@v2 43 | 44 | - name: Install pandoc 45 | uses: r-lib/actions/setup-pandoc@v2 46 | if: matrix.config.image == null 47 | 48 | - name: Install remotes and rmarkdown 49 | run: | 50 | install.packages(c('remotes', 'rmarkdown')) 51 | shell: Rscript {0} 52 | 53 | - name: Query dependencies 54 | run: | 55 | saveRDS(remotes::dev_package_deps(dependencies = TRUE, repos = c(getOption('repos'), BiocManager::repositories())), 'depends.Rds', version = 2) 56 | shell: Rscript {0} 57 | 58 | - name: Cache R packages 59 | if: runner.os != 'Windows' && matrix.config.image == null 60 | uses: actions/cache@v4 61 | with: 62 | path: ${{ env.R_LIBS_USER }} 63 | key: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-${{ hashFiles('depends.Rds') }} 64 | restore-keys: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}- 65 | 66 | - name: Install system dependencies (Linux) 67 | if: runner.os == 'Linux' 68 | env: 69 | RHUB_PLATFORM: linux-x86_64-ubuntu-gcc 70 | uses: r-lib/actions/setup-r-dependencies@v2 71 | with: 72 | extra-packages: any::rcmdcheck 73 | pak-version: devel 74 | 75 | - name: Install system dependencies (macOS) 76 | if: runner.os == 'macOS' 77 | run: | 78 | brew install harfbuzz 79 | brew install fribidi 80 | 81 | - name: Install dependencies 82 | run: | 83 | local_deps <- remotes::local_package_deps(dependencies = TRUE) 84 | deps <- remotes::dev_package_deps(dependencies = TRUE, repos = BiocManager::repositories()) 85 | BiocManager::install(local_deps[local_deps %in% deps$package[deps$diff != 0]], Ncpu = 2L) 86 | remotes::install_cran('rcmdcheck', Ncpu = 2L) 87 | shell: Rscript {0} 88 | 89 | - name: Session info 90 | run: | 91 | options(width = 100) 92 | pkgs <- installed.packages()[, "Package"] 93 | sessioninfo::session_info(pkgs, include_base = TRUE) 94 | shell: Rscript {0} 95 | 96 | - name: Build, Install, Check 97 | id: build-install-check 98 | uses: grimbough/bioc-actions/build-install-check@v1 99 | 100 | - name: Upload install log if the build/install/check step fails 101 | if: always() && (steps.build-install-check.outcome == 'failure') 102 | uses: actions/upload-artifact@v4 103 | with: 104 | name: install-log 105 | path: | 106 | ${{ steps.build-install-check.outputs.install-log }} 107 | 108 | - name: Upload check results 109 | if: failure() 110 | uses: actions/upload-artifact@v4 111 | with: 112 | name: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-results 113 | path: ${{ steps.build-install-check.outputs.check-dir }} 114 | 115 | # - name: Run BiocCheck 116 | # uses: grimbough/bioc-actions/run-BiocCheck@v1 117 | # with: 118 | # arguments: '--no-check-bioc-views --no-check-bioc-help' 119 | # error-on: 'error' 120 | # 121 | # - name: Test coverage 122 | # if: matrix.config.os == 'macOS-latest' && matrix.config.pandoc == null 123 | # run: | 124 | # install.packages("covr") 125 | # covr::codecov(token = "${{secrets.CODECOV_TOKEN}}") 126 | # shell: Rscript {0} 127 | # 128 | - name: Deploy 129 | if: github.event_name == 'push' && github.ref == 'refs/heads/devel' && runner.os == 'macOS' && matrix.config.bioc == 'devel' 130 | run: | 131 | R CMD INSTALL . 132 | Rscript -e "remotes::install_dev('pkgdown'); pkgdown::deploy_to_branch(new_process = FALSE)" 133 | shell: bash 134 | -------------------------------------------------------------------------------- /vignettes/how-to-read-gene-sets-from-gmt-files.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to read gene sets from GMT files 3 | author: Robert Castelo 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to read gene sets from GMT files} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | --- 18 | 19 | ```{r} 20 | #| echo: false 21 | 22 | suppressPackageStartupMessages({ 23 | library(BiocStyle) 24 | }) 25 | ``` 26 | 27 | A **gene set** is simple yet useful way to define pathways without regard 28 | to the specific molecular interactions. It can also represent the signature 29 | of a characteristic pattern of gene expression associated with a cell type 30 | or a disease for diagnostic or prognostic purposes. An important source of 31 | gene sets is the 32 | [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb), 33 | which stores them as plain text files following the so-called 34 | [_gene matrix transposed_ (GMT) format](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29). 35 | In the GMT format, each line stores a gene set with the following values 36 | separated by tabs: 37 | 38 | * A unique gene set identifier. 39 | * A gene set description. 40 | * One or more gene identifiers. 41 | 42 | Because each different gene set may consist of a different number of genes, each 43 | line in a GMT file may contain a different number of tab-separated values. This 44 | means that the GMT format is not a tabular format, and therefore cannot be directly 45 | read with base R functions such as `read.table()` or `read.csv()`. 46 | 47 | # Bioconductor packages used in this document 48 | 49 | * `r Biocpkg("GSEABase")` 50 | * `r Biocpkg("GSVA")` 51 | 52 | # How to read gene sets from GMT files 53 | 54 | The package `r Biocpkg("GSEABase")` provides object classes and methods to represent 55 | and manipulate gene sets, including reading GMT files with the function `getGmt()`. 56 | Here is a first example in which we read a GMT file from the MSigDB database: 57 | 58 | ```{r, message=FALSE} 59 | library(GSEABase) 60 | 61 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release", 62 | "2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt", sep="/") 63 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier()) 64 | genesets 65 | length(genesets) 66 | ``` 67 | Next to the filename of the GMT file, we provided the argument 68 | `geneIdType=SymbolIdentifier()`. This argument adds the necessary metadata that 69 | later on allows other software to figure out what kind of gene identifiers are 70 | used in this collection of gene sets, to attempt mapping them to other type of 71 | identifiers, if necessary. While this argument is optional, we should always 72 | try to provide it. 73 | 74 | The output of the `getGmt()` function is a `GeneSetCollection` object, which 75 | pretty much behaves like a list object and many Bioconductor packages analysing 76 | gene sets may expect that kind of object as input. However, we can coerce it to 77 | a regular `list` object with the function `geneIds()`: 78 | 79 | ```{r} 80 | genesets.list <- geneIds(genesets) 81 | head(lapply(genesets.list, head, 3), 3) 82 | ``` 83 | By definition, GMT files should have unique gene set identifiers and if we 84 | attempt to read with `getGmt()` a malformed GMT file with duplicated 85 | identifiers it will prompt an error: 86 | 87 | ```{r, error=TRUE} 88 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release", 89 | "7.5/c2.all.v7.5.symbols.gmt", sep="/") 90 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier()) 91 | ``` 92 | So if we still want to read the contents of that GMT file, we would have to 93 | download it and edit the lines with duplicated identifiers. An alternative is 94 | to use the function `readGMT()` of the `r Biocpkg("GSVA")` package: 95 | 96 | ```{r, message=FALSE} 97 | library(GSVA) 98 | 99 | genesets <- readGMT(URL) 100 | genesets 101 | ``` 102 | As we may see, it has detected the presence duplicated gene set identifiers and 103 | applied a deduplication policy, which can be changed through the parameter 104 | `dedupUse`; see its help page with `?readGMT` for further deduplication 105 | policies. As `getGmt()`, it also takes a parameter `geneIdType`, but by default 106 | it will automatically detect the type of gene identifier and fill in the 107 | corresponding metadata in the resulting `GeneSetCollection` object, so that in 108 | principle we need not to specify it. 109 | 110 | # Further reading 111 | 112 | - The [Introduction to GSEABase](https://bioconductor.org/packages/release/bioc/vignettes/GSEABase/inst/doc/GSEABase.html) 113 | vignette from the `r Biocpkg("GSEABase")` package. 114 | - The help pages of the `getGmt()` and `readGMT()` functions in the 115 | `r Biocpkg("GSEABase")` and `r Biocpkg("GSVA")` packages, respectively. 116 | - The Bioconductor package `r Biocpkg("msigdb")` provides an API for 117 | directly querying and downloading MSigDB GMT files as `GeneSetCollection` 118 | objects. 119 | - Similarly to `r Biocpkg("msigdb")`, the `r CRANpkg("msigdbr")` CRAN 120 | package provides an API to query and download MSigDB GMT files, but as 121 | [tidy](https://r4ds.hadley.nz/data-tidy.html) `data.frame` objects 122 | with one gene per row. 123 | 124 | # Session info 125 | 126 |
127 | 128 | Click to display session info 129 | 130 | ```{r} 131 | sessionInfo() 132 | ``` 133 |
134 | -------------------------------------------------------------------------------- /vignettes/how-to-use-tidy-principles-for-granges-manipulation.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to use tidy principles for genomic ranges manipulation 3 | author: Jacques Serizay 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to use tidy principles for genomic ranges manipulation} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | bibliography: ../inst/references.bib 18 | --- 19 | 20 | ```{r} 21 | #| echo: false 22 | 23 | suppressPackageStartupMessages({ 24 | library(BiocStyle) 25 | }) 26 | ``` 27 | 28 | Tidy data principles are a set of principles for structuring data in a way 29 | that makes it easier to manipulate and analyze. They have been popularized by the 30 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation 31 | and `ggplot2` for data visualization. The tidy principles can also be applied to 32 | biological data, such as RNA-seq data, to facilitate analysis and visualization, 33 | thanks to the `tidyomics` meta-package. 34 | 35 | # Bioconductor packages used in this document 36 | 37 | * `r Biocpkg("tidyomics")` 38 | 39 | # How to manipulate GRanges objects with tidy principles 40 | 41 | For illustration, we are going to download a `xlsx` file provided as a 42 | Supplementary Data in [@Serizay2020Oct], and manipulate it using tidy principles. 43 | 44 | Good practices recommended by the Bioconductor consortium suggest to use 45 | the `BiocFileCache` package to download and cache files from remote servers. 46 | 47 | ```{r} 48 | library(BiocFileCache) 49 | 50 | # Create a BiocFileCache instance 51 | bfc <- BiocFileCache() 52 | 53 | # Download the xlsx file from Ensembl and cache it locally using BiocFileCache 54 | xlsx_url <- "https://genome.cshlp.org/content/suppl/2020/11/16/gr.265934.120.DC1/Supplemental_Table_S2.xlsx" 55 | xlsx_file <- bfcrpath(bfc, xlsx_url) 56 | 57 | # The file is now downloaded and available 58 | xlsx_file 59 | ``` 60 | 61 | Excel files can be read into `tibbles` in `R` thanks to the `read_xlsx()` function 62 | from the `readxl` package. 63 | 64 | ```{r} 65 | library(readxl) 66 | tbl <- read_xlsx(xlsx_file, sheet = "ATAC_metrics", skip = 2) 67 | tbl 68 | ``` 69 | 70 | We convert it into a `GRanges` object directly using `as_granges()` from the `plyranges` package, 71 | using [tidy evaluation as in `dplyr` package](https://dplyr.tidyverse.org/articles/programming.html). 72 | 73 | ```{r} 74 | library(tidyomics) 75 | gr <- tbl |> 76 | as_granges(seqnames = chrom_ce11, start = start_ce11, end = end_ce11) 77 | gr 78 | ``` 79 | 80 | Let's say we are not interested in all the metadata columns, and we want to keep only the 81 | `annot_dev_age_tissues`, `Annotation`, and the columns ending with `_RPM`. 82 | We can use the `select()` function from `dplyr` to select these columns, 83 | just like we would do with a `data.frame` or `tibble`. 84 | 85 | ```{r} 86 | gr <- gr |> 87 | select(annot_dev_age_tissues, Annotation, ends_with("_RPM")) 88 | gr 89 | ``` 90 | 91 | In fact, most verbs from the `dplyr` package can be used with `GRanges` objects, 92 | such as `filter()`, `mutate()`, `arrange()`, `group_by()` and `summarise()`. 93 | 94 | ```{r} 95 | gr |> 96 | mutate(start = start - 1000, end = end + 1000) |> 97 | filter(start > 1e6) |> 98 | arrange(Intest._RPM) |> 99 | group_by(annot_dev_age_tissues, Annotation) |> 100 | summarize(n = plyranges::n()) 101 | ``` 102 | 103 | Note that the `summarize()` function automatically returns a `DataFrame`, not a `GRanges` object. 104 | 105 | To enhance compatibility with the `tidyverse` ecosystem, the `plyranges` package 106 | provides a `as_tibble()` method for `GRanges` objects, which converts them 107 | into `tibbles`. This allows you to use pass `GRanges` objects to `ggplot` for visualization. 108 | 109 | ```{r} 110 | gr |> 111 | filter(Annotation %in% c("Intest.", "Hypod.", "Soma", "Ubiq.")) |> 112 | as_tibble() |> 113 | ggplot(aes(x = `Intest._RPM`, y = `Hypod._RPM`, color = Annotation)) + 114 | geom_point() + 115 | labs(x = "Intestine RPM", y = "Hypodermis RPM") + 116 | facet_grid(Annotation~annot_dev_age_tissues) + 117 | theme_bw() + 118 | theme(legend.position = "bottom") 119 | ``` 120 | 121 | Finally, the `join` concept introduced by the `dplyr` package can also be applied to `GRanges` objects. 122 | In this case, the operation refers to joining overlapping, nearest, or preceding/following ranges, 123 | using the `join_*_*()` function from the `plyranges` package. 124 | 125 | As an example, we can compare annotations from [@Serizay2020Oct] to chromosome 126 | arms/center 127 | 128 | ```{r} 129 | arms <- GRanges( 130 | seqnames = rep(c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrX"), each = 3), 131 | ranges = IRanges( 132 | start = c( 133 | 1, 4550001, 10750001, 134 | 1, 4550001, 10750001, 135 | 1, 4450001, 10150001, 136 | 1, 4750001, 12650001, 137 | 1, 4850001, 16250001, 138 | 1, 2850001, 14250001 139 | ), 140 | end = c( 141 | 4550000, 10750000, 15072423, 142 | 4550000, 10750000, 15279345, 143 | 4450000, 10150000, 13783700, 144 | 4750000, 12650000, 17493793, 145 | 4850000, 16250000, 20924149, 146 | 2850000, 14250000, 17718866 147 | ) 148 | ), 149 | strand = "*", 150 | part = rep(c("arm", "center", "arm"), 6) 151 | ) 152 | 153 | gr |> 154 | join_overlap_left(arms) |> 155 | group_by(Annotation, part) |> 156 | filter(Annotation %in% c("Germline", "Neurons")) |> 157 | summarize(n = plyranges::n()) 158 | ``` 159 | 160 | # Further reading 161 | 162 | To read more about the `tidyomics` packages, please refer to the 163 | [tidyomics](https://github.com/tidyomics) GitHub organization page. 164 | 165 | More tutorials are provided by the `tidyomics` GitHub organization in the 166 | [tidy-ranges-tutorial](https://tidyomics.github.io/tidy-ranges-tutorial/) website. 167 | 168 | # References 169 | 170 | ::: {#refs} 171 | ::: 172 | 173 | # Session info 174 | 175 |
176 | 177 | Click to display session info 178 | 179 | ```{r} 180 | sessionInfo() 181 | ``` 182 |
183 | 184 | -------------------------------------------------------------------------------- /vignettes/how-to-compute-sequence-composition-for-genomic-regions.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to compute sequence composition for genomic regions 3 | author: Michael Stadler 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to compute sequence composition for genomic regions} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | embed-resources: true 18 | --- 19 | 20 | ```{r} 21 | #| echo: false 22 | 23 | suppressPackageStartupMessages({ 24 | library(BiocStyle) 25 | }) 26 | ``` 27 | 28 | This HowTo shows how to fetch sequences for given regions from a reference and 29 | compute composition features for them. 30 | 31 | # Bioconductor packages used in this document 32 | 33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` 34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")` 35 | * `r Biocpkg("GenomicFeatures")` 36 | * `r Biocpkg("Biostrings")` 37 | 38 | # How to compute sequence composition for genomic regions 39 | 40 | Mammalian genomes contain regulatory regions with a high content of CpG 41 | dinucleotides, so called *CpG islands*, a side effect indirectly caused by 42 | the methylation of cytosines in CpG contexts. Many transcripts contain 43 | a CpG island in their promoter region, near the start of the transcript. 44 | In this HowTo, we will identify such transcripts by analyzing the CpG 45 | composition of promoter sequences. 46 | 47 | We first obtain the coordinates of promoter regions from the 48 | `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` package, using the 49 | `promoters` function from the `r Biocpkg("GenomicFeatures")` package: 50 | 51 | ```{r} 52 | suppressPackageStartupMessages({ 53 | library(GenomicFeatures) 54 | library(TxDb.Hsapiens.UCSC.hg38.knownGene) 55 | }) 56 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene 57 | prom <- promoters(txdb, upstream = 1000, downstream = 1000) 58 | prom 59 | ``` 60 | 61 | Here, promoter regions are defined as 2000 base pair regions centered on 62 | transcript start sites (the 5'-end of a transcript). The warning message above 63 | is caused by start sites that are near the boundary of a chromosome, such that 64 | the promoter region will extend beyond its beginning or end. By calling `trim` 65 | on the `GRanges` object, all regions are truncated to the limits of the 66 | chromosomes, resulting in a few promoters shorter than 2000 base pairs: 67 | 68 | ```{r} 69 | summary(width(prom) == 2000) 70 | prom <- trim(prom) 71 | summary(width(prom) == 2000) 72 | ``` 73 | 74 | It may be useful to now subset the extracted promoters, for example to only 75 | retain promoters on autosomes, or only a single promoter per gene. 76 | Here, we will remove redundant promoters resulting from transcripts with 77 | identical start sites: 78 | 79 | ```{r} 80 | summary(duplicated(prom)) 81 | prom <- prom[!duplicated(prom)] 82 | ``` 83 | 84 | Next, we extract the sequences of the promoters from the reference genome 85 | assembly in the `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")` package: 86 | 87 | ```{r} 88 | suppressPackageStartupMessages({ 89 | library(Biostrings) 90 | library(BSgenome.Hsapiens.UCSC.hg38) 91 | }) 92 | promseq <- getSeq(BSgenome.Hsapiens.UCSC.hg38, prom) 93 | promseq 94 | ``` 95 | 96 | Finally, we calculate some sequence features: The percent of G and C bases, and 97 | the ratio of observed over expected CpG dinucleotides (where the expected 98 | frequency of a dinucleotide is the product of the frequencies of the two 99 | mononucleotides). We obtain the mono- and dinucleotide frequencies from the 100 | sequences using the `oligonucleotideFrequency` function from the 101 | `r Biocpkg("Biostrings")` package: 102 | 103 | ```{r} 104 | mo <- oligonucleotideFrequency(promseq, width = 1, as.prob = TRUE) 105 | di <- oligonucleotideFrequency(promseq, width = 2, as.prob = TRUE) 106 | 107 | head(mo) 108 | head(di) 109 | 110 | percentGC <- 100 * (mo[, "C"] + mo[, "G"]) 111 | expectedCpG <- mo[, "C"] * mo[, "G"] 112 | obsExpRatioCpG <- di[, "CG"] / expectedCpG 113 | ``` 114 | 115 | Finally, we can look at the distribution of these sequence features. 116 | The percentage of G+C bases in promoters spans a broad range from about 30% to 117 | over 70%: 118 | 119 | ```{r} 120 | #| fig.width: 6 121 | #| fig.height: 6 122 | hist(percentGC, 60) 123 | ``` 124 | 125 | Defining CpG island promoters using this distribution would be hard, but is 126 | much easier using the observed over expected ratio of CpGs, which shows a 127 | much clearer bimodal structure: 128 | 129 | ```{r} 130 | #| fig.width: 6 131 | #| fig.height: 6 132 | hist(obsExpRatioCpG, 60) 133 | abline(v = 0.45, col = 2, lty = 2) 134 | ``` 135 | 136 | Interestingly, almost all ratios are below 1.0, indicating that CpG islands 137 | are in fact not enriched for CpG dinucleotides, but rather lose them at a 138 | lower rate than other genomic regions. 139 | 140 | Plotting the ratio versus the GC percentage furthermore reveals that CpG island 141 | promoters are also slightly GC-richer: 142 | 143 | ```{r} 144 | #| fig.width: 6 145 | #| fig.height: 6 146 | plot(obsExpRatioCpG, percentGC, pch = "*", col = "#22222205") 147 | abline(v = 0.45, col = 2, lty = 2) 148 | ``` 149 | 150 | # Further reading 151 | 152 | More examples of how to represent and work with biological sequences are 153 | contained in the vignettes of the `r Biocpkg("Biostrings")` package, 154 | available for example from `vignette(package = "Biostrings")`. 155 | 156 | If no `BSgenome` object is available for your genome of interest, you can also 157 | use alternative sources to extract your sequences from, such as `DNAStringSet` 158 | objects or files in fasta or 2bit format. A list of inputs supported by the 159 | `getSeq` function can be obtained using: 160 | 161 | ```{r} 162 | showMethods(getSeq) 163 | ``` 164 | 165 | For more information about CpG islands, see: 166 | 167 | - Bird A, Taggart M, Frommer M, Miller OJ, Macleod D. A fraction of the mouse 168 | genome that is derived from islands of nonmethylated, CpG-rich DNA. 169 | Cell. 1985 Jan;40(1):91-9. doi: 10.1016/0092-8674(85)90312-5. PMID: 2981636. 170 | - Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. 171 | J Mol Biol. 1987 Jul 20;196(2):261-82. 172 | doi: 10.1016/0022-2836(87)90689-9. PMID: 3656447. 173 | 174 | # Session info 175 | 176 |
177 | 178 | Click to display session info 179 | 180 | ```{r} 181 | sessionInfo() 182 | ``` 183 |
184 | -------------------------------------------------------------------------------- /vignettes/how-to-extract-promoter-sequences.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: How to extract promoter sequences 3 | author: Fabricio Almeida-Silva 4 | date: "`r Sys.Date()`" 5 | vignette: > 6 | %\VignetteIndexEntry{How to extract promoter sequences} 7 | %\VignetteEngine{quarto::html} 8 | %\VignetteEncoding{UTF-8} 9 | knitr: 10 | opts_chunk: 11 | collapse: true 12 | comment: '#>' 13 | format: 14 | html: 15 | toc: true 16 | html-math-method: mathjax 17 | embed-resources: true 18 | --- 19 | 20 | ```{r} 21 | #| echo: false 22 | 23 | suppressPackageStartupMessages({ 24 | library(BiocStyle) 25 | }) 26 | ``` 27 | 28 | Promoters are DNA sequences typically located upstream of a 29 | gene's transcription start site (TSS), and they contain binding 30 | sites for transcription factors. 31 | Researchers working on gene regulation often want to extract 32 | promoter sequences for all genes or a set of genes. With these sequences, 33 | researchers can, for example, look for known transcription factor 34 | binding sites (TFBS), or predict TFBS *de novo*. This HowTo will demonstrate 35 | how to extract promoter sequences for any gene(s) in a genome. 36 | 37 | # Bioconductor packages used in this document 38 | 39 | * `r Biocpkg("Biostrings")` 40 | * `r Biocpkg("rtracklayer")` 41 | * `r Biocpkg("GenomicRanges")` 42 | * `r Biocpkg("txdbmaker")` 43 | * `r Biocpkg("GenomicFeatures")` 44 | * `r Biocpkg("BSgenome")` 45 | 46 | # How to extract promoter sequences 47 | 48 | We will start by loading the required packages. 49 | 50 | ```{r} 51 | #| message: false 52 | 53 | library(Biostrings) 54 | library(GenomicRanges) 55 | library(rtracklayer) 56 | library(txdbmaker) 57 | library(GenomicFeatures) 58 | library(BSgenome) 59 | ``` 60 | 61 | Then, we will get our example data set. Here, we will use data for 62 | the model plant *Arabidopsis thaliana*. We will start by obtaining 63 | genome sequences and gene annotation from [Ensembl Plants](https://plants.ensembl.org). 64 | 65 | ::: {.callout-note} 66 | 67 | ## Working with model and non-model organisms 68 | 69 | Here, we're obtaining data from [Ensembl Plants](https://plants.ensembl.org) 70 | to explicitly show how this can be done for **any** organism. However, 71 | if you're lucky enough to work with a model organism (e.g. human, mouse, etc), 72 | you can find curated genome sequences and gene annotation 73 | in `r Biocpkg("AnnotationHub")` (see, for instance, 74 | [this](https://bioconductor.github.io/BiocHowTo/articles/how-to-retrieve-gene-model-from-annotationhub.html) HowTo article). 75 | 76 | ::: 77 | 78 | ```{r} 79 | # Read genomic data from Ensembl Plants 80 | ## Genome 81 | genome <- Biostrings::readDNAStringSet( 82 | "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz" 83 | ) 84 | names(genome) <- gsub(" .*", "", names(genome)) # remove text after chr name 85 | 86 | ## Annotation 87 | annotation <- rtracklayer::import( 88 | "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gff3.gz" 89 | ) 90 | ``` 91 | 92 | Genome sequences are stored in `DNAStringSet` objects, and 93 | gene annotation (i.e., start and end coordinates of each gene) are stored in 94 | `GRanges` objects. This is what these objects look like: 95 | 96 | ```{r} 97 | # Take a look at the genome and annotation 98 | genome 99 | annotation 100 | ``` 101 | 102 | Note that our `GRanges` object contains ranges for multiple types of features, 103 | including genes, mRNAs, exons, and even entire chromosomes. 104 | Since we're only interested in promoter regions, we will subset 105 | `annotation` to keep only 'gene' ranges. 106 | 107 | ```{r} 108 | # Keep only gene ranges 109 | genes <- annotation[annotation$type == "gene"] 110 | genes 111 | ``` 112 | 113 | Next, we can use the `promoters()` function from 114 | `r Biocpkg("GenomicRanges")` to extract the genomic coordinates of promoters. 115 | By default, promoters start at 2000 bp upstream the transcription start 116 | site (TSS), and end at 200 bp downstream the TSS, but you can modify these 117 | positions in parameters *upstream* and *downstream* of `promoters()`. 118 | For example, some transcription factor families in plants 119 | (e.g. HSF and C2C2-GATA) bind exclusively to regions downstream the TSS, which 120 | is contrary to the common view of TFBS being located upstream TSSs. 121 | Importantly, if a gene has multiple TSSs (e.g., because of different 122 | transcripts starting at different positions), the most upstream TSS is 123 | used, not a cleverly chosen 'representative' or 'canonical' transcript. 124 | 125 | ```{r} 126 | # Get promoter regions from a `GRanges` object 127 | prom_genes <- GenomicRanges::promoters(genes) 128 | 129 | prom_genes 130 | ``` 131 | 132 | Alternatively, if you want to extract promoters for each transcript separately, 133 | you can subset the `GRanges` object to extract only ranges of 134 | type 'transcript', and then use `promoters()` as demonstrated above. 135 | Note that GFF3 files from some databases do not contain a 'transcript' type. 136 | In these cases, you will most likely find transcript-level ranges in rows of 137 | type 'mRNA'. This is the case for [Ensembl Plants](https://plants.ensembl.org), 138 | from where we obtained our example data. 139 | To extract promoters for each *A. thaliana* transcript, you'd execute: 140 | 141 | ```{r} 142 | # Extract transcript ranges from a GRanges object 143 | tx_ranges <- annotation[annotation$type == "mRNA"] # or 'transcript' (if any) 144 | 145 | # Extract promoters for each transcript separately 146 | prom_tx <- promoters(tx_ranges) 147 | prom_tx 148 | ``` 149 | 150 | ::: {.callout-tip} 151 | 152 | ## Extracting promoter regions from `TxDb` objects 153 | 154 | If you have a `TxDb` object with transcript coordinates and metadata, 155 | you can extract promoter regions for each transcript with the `promoters()` 156 | function from the `r Biocpkg("GenomicFeatures")` package. Note that function names 157 | are the same, but `GenomicFeatures::promoters()` takes `TxDb` objects as input, 158 | while `GenomicRanges::promoters()` takes `GRanges` objects as input. 159 | 160 | ```{r} 161 | # Create a `TxDb` object from a `GRanges` object 162 | txdb <- txdbmaker::makeTxDbFromGRanges(annotation) 163 | 164 | # Get promoter regions for all transcripts 165 | prom_tx2 <- GenomicFeatures::promoters(txdb) 166 | 167 | prom_tx2 168 | ``` 169 | 170 | ::: 171 | 172 | Finally, you can extract sequences given a genome and some coordinates with 173 | the `getSeq()` function from `r Biocpkg("BSgenome")`. 174 | 175 | ```{r} 176 | # Extract promoter sequences (1k genes only for demo purposes) 177 | prom_seqs <- getSeq(genome, prom_genes[1:1000]) 178 | 179 | prom_seqs 180 | ``` 181 | 182 | 183 | # Further reading 184 | 185 | To learn more about how to work with sequences and ranges using Bioconductor 186 | packages, look at the vignettes of the `r Biocpkg("Biostrings")` and 187 | `r Biocpkg("GenomicRanges")` packages. 188 | 189 | # Session info 190 | 191 |
192 | 193 | Click to display session info 194 | 195 | ```{r} 196 | sessionInfo() 197 | ``` 198 |
199 | --------------------------------------------------------------------------------