├── LICENSE
├── NAMESPACE
├── man
└── figures
│ └── bioconductor.png
├── .Rbuildignore
├── pkgdown
└── extra.css
├── inst
├── references.bib
└── templates
│ └── howto_template.qmd
├── LICENSE.md
├── _pkgdown.yml
├── README.md
├── vignettes
├── how-to-read-mass-spectrometry-data.qmd
├── how-to-load-gene-from-gff-gtf.qmd
├── how-to-read-big-bam-file-in-chunks.qmd
├── how-to-retrieve-gene-model-from-annotationhub.qmd
├── how-to-read-single-end-reads-from-bam-file.qmd
├── how-to-compute-read-coverage.qmd
├── how-to-read-paired-end-reads-from-bam-file.qmd
├── how-to-use-tidy-principles-for-rna-seq-analysis.qmd
├── how-to-get-exon-intron-sequence-for-gene.qmd
├── how-to-read-gene-sets-from-gmt-files.qmd
├── how-to-use-tidy-principles-for-granges-manipulation.qmd
├── how-to-compute-sequence-composition-for-genomic-regions.qmd
└── how-to-extract-promoter-sequences.qmd
├── DESCRIPTION
└── .github
└── workflows
└── R-CMD-check.yaml
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2025
2 | COPYRIGHT HOLDER: BiocHowTo authors
3 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 |
--------------------------------------------------------------------------------
/man/figures/bioconductor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/BiocHowTo/devel/man/figures/bioconductor.png
--------------------------------------------------------------------------------
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^LICENSE\.md$
4 | ^vignettes/*_files$
5 | ^_pkgdown.yml$
6 | ^\.github$
7 |
--------------------------------------------------------------------------------
/pkgdown/extra.css:
--------------------------------------------------------------------------------
1 | .navbar {
2 | background: linear-gradient(to right, #3a85ab, #4ca139) !important;
3 | background-image: linear-gradient(to right, #3a85ab, #4ca139) !important;
4 | border: none !important;
5 | }
6 |
7 |
--------------------------------------------------------------------------------
/inst/references.bib:
--------------------------------------------------------------------------------
1 | @article{Serizay2020Oct,
2 | author = {Serizay, Jacques and Dong, Yan and J{\ifmmode\ddot{a}\else\"{a}\fi}nes, J{\ifmmode\ddot{u}\else\"{u}\fi}rgen and Chesney, Michael and Cerrato, Chiara and Ahringer, Julie},
3 | title = {{Distinctive regulatory architectures of germline-active and somatic genes in C. elegans}},
4 | journal = {Genome Res.},
5 | volume = {30},
6 | number = {12},
7 | pages = {1752--1765},
8 | year = {2020},
9 | month = oct,
10 | issn = {1088-9051},
11 | publisher = {Cold Spring Harbor Lab},
12 | doi = {10.1101/gr.265934.120}
13 | }
14 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | # MIT License
2 |
3 | Copyright (c) 2025 BiocHowTo authors
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/_pkgdown.yml:
--------------------------------------------------------------------------------
1 | url: https://bioconductor.github.io/BiocHowTo/
2 | template:
3 | bootstrap: 5
4 | params:
5 | bootswatch: zephyr
6 |
7 | navbar:
8 | structure:
9 | left: [home, articles]
10 | right: [search, github]
11 |
12 | articles:
13 | - title: Interacting with BAM files
14 | contents:
15 | - how-to-read-single-end-reads-from-bam-file
16 | - how-to-read-paired-end-reads-from-bam-file
17 | - how-to-read-big-bam-file-in-chunks
18 | - how-to-compute-read-coverage
19 |
20 | - title: Mass spectrometry
21 | contents:
22 | - how-to-read-mass-spectrometry-data
23 |
24 | - title: Tidy principles in Bioconductor
25 | contents:
26 | - how-to-use-tidy-principles-for-granges-manipulation
27 | - how-to-use-tidy-principles-for-rna-seq-analysis
28 |
29 | - title: Accessing genomic annotation data
30 | contents:
31 | - how-to-retrieve-gene-model-from-annotationhub
32 | - how-to-load-gene-from-gff-gtf
33 | - how-to-get-exon-intron-sequence-for-gene
34 | - how-to-read-gene-sets-from-gmt-files
35 |
36 | - title: Working with biological sequences
37 | contents:
38 | - how-to-compute-sequence-composition-for-genomic-regions
39 | - how-to-extract-promoter-sequences
40 |
--------------------------------------------------------------------------------
/inst/templates/howto_template.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: EDIT:HowToTitle
3 | author: EDIT:HowToAuthor
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{EDIT:HowToTitle}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | embed-resources: true
18 | ---
19 |
20 | ```{r}
21 | #| echo: false
22 |
23 | suppressPackageStartupMessages({
24 | library(BiocStyle)
25 | })
26 | ```
27 |
28 | EDIT:Short introduction
29 |
30 | # Bioconductor packages used in this document
31 |
32 | EDIT:Add list of packages used in the HowTo. Use `r Biocpkg("PkgName")` to
33 | refer to Bioconductor packages, and similarly for other package sources
34 | (see https://bioconductor.org/packages/release/bioc/vignettes/BiocStyle/inst/doc/AuthoringRmdVignettes.html#4_Style_macros). Example:
35 | * `r Biocpkg("pasillaBamSubset")`
36 |
37 | # EDIT:Main section
38 |
39 | EDIT:Here, put the code and explanations for your HowTo. Keep in mind that
40 | HowTos should be short, and address a well-defined, specific task using
41 | Bioconductor.
42 |
43 | # Further reading
44 |
45 | EDIT: add a (short) list of useful resources about.
46 |
47 | # Session info
48 |
49 |
50 |
51 | Click to display session info
52 |
53 | ```{r}
54 | sessionInfo()
55 | ```
56 |
57 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | # BiocHowTo
8 |
9 |
10 |
11 | Welcome to the Bioconductor HowTo collection! The HowTos are short,
12 | stand-alone documents, each focused on how to address a well-defined question
13 | using one or more Bioconductor packages.
14 |
15 | For a list of the available HowTo documents, see the [Articles](articles/index.html) tab.
16 |
17 | ## How to contribute
18 |
19 | 1. [Fork this repository](https://github.com/Bioconductor/BiocHowTo/fork) and create a new branch.
20 | 2. Copy and rename the `howto_template.qmd` file from `inst/templates` to
21 | the `vignettes` directory, and edit the copy accordingly. Please choose the
22 | name for the new vignette in such a way that it is unique and clearly indicates
23 | the content. Note that the title of the vignette should be added both to the
24 | `title` field and the `%\VignetteIndexEntry` in the YAML section.
25 | 3. Test the vignette locally in a fresh R session, to make sure that it is
26 | self-contained and runs without errors.
27 | 4. Add your name to the `Author` list in the `DESCRIPTION` file.
28 | 5. If your HowTo is using packages that are not already included among the
29 | dependencies of `BiocHowTo`, please add them to the list of `Imports` in the
30 | `DESCRIPTION` file. One way to do this is via the `usethis` package:
31 |
32 | ```
33 | usethis::use_package("name_of_new_dependency")
34 | ```
35 |
36 | 6. Add the new vignette to a suitable section under 'Articles' in the
37 | `_pkgdown.yml` file.
38 | 7. Push the changes to your forked repository and open a pull request to
39 | the `devel` branch of the parent repository.
40 | 8. All submissions will be reviewed by the Bioconductor Training Committee.
41 |
42 |
43 | ## How to suggest a new topic
44 |
45 | To suggest a new topic for a HowTo, open an issue and provide some more
46 | details of your suggestion. If you would like to take on one of the open
47 | issues, either assign yourself (if possible), or comment in the issue, and
48 | one of the administrators can assign you.
49 |
--------------------------------------------------------------------------------
/vignettes/how-to-read-mass-spectrometry-data.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to read mass spectrometry data
3 | author: Laurent Gatto
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to read mass spectrometry data}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | # Bioconductor packages used in this document
28 |
29 | * `r Biocpkg("Spectra")`
30 | * `r Biocpkg("mzR")`
31 | * `r Biocpkg("MsDataHub")`
32 |
33 | # How to read mass spectrometry data
34 |
35 | Let's first get some example raw mass spectrometry data from the
36 | *MsDataHub* package. Below, we download two _sciex_ mzML files
37 | represent profile-mode LC-MS data of pooled human serum samples and
38 | create a vector of files names `fls`:
39 |
40 | ```{r}
41 | library(MsDataHub)
42 | fls <- c(X20171016_POOL_POS_1_105.134.mzML(),
43 | X20171016_POOL_POS_3_105.134.mzML())
44 | fls
45 | ```
46 |
47 | We can now use the `Spectra()` constructor function to create an
48 | object of class `Spectra` that contains the raw data. Note that the
49 | `mzR` package is needed to read the content of the mzML files.
50 |
51 |
52 | ```{r}
53 | suppressPackageStartupMessages({
54 | library("Spectra")
55 | library("mzR")
56 | })
57 | ```
58 |
59 | ```{r}
60 | sp <- Spectra(fls)
61 | sp
62 | ```
63 |
64 | The object contains `r length(sp)` spectra from both files.
65 |
66 | # Further reading
67 |
68 | - The [R for Mass Spectrometry](https://rformassspectrometry.github.io/book/) book.
69 | - The [Large-scale data handling and processing with
70 | Spectra](https://rformassspectrometry.github.io/Spectra/articles/Spectra-large-scale.html)
71 | vignette from the *Spectra* package.
72 |
73 | # Session info
74 |
75 |
76 |
77 | Click to display session info
78 |
79 | ```{r}
80 | sessionInfo()
81 | ```
82 |
83 |
--------------------------------------------------------------------------------
/vignettes/how-to-load-gene-from-gff-gtf.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to load a gene model from a GFF or GTF file
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to load a gene model from a GFF or GTF file}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | A **gene model** is essentially a set of annotations that
28 | describes the genomic locations of the known genes, transcripts, exons, and CDS, for
29 | a given organism. The standardized file format to hold gene models is a
30 | **[GFF or GTF](https://useast.ensembl.org/info/website/upload/gff.html)**.
31 | In Bioconductor, gene model information is typically represented as a TxDb
32 | object but also sometimes as a GRanges or GRangesList object. We can use the
33 | `makeTxDbFromGFF()` function from the `r Biocpkg("txdbmaker")` package to
34 | import a GFF or GTF file as a *TxDb* object.
35 |
36 | # Bioconductor packages used in this document
37 |
38 | * `r Biocpkg("txdbmaker")`
39 |
40 | # How to load a gene model from a GFF or GTF file
41 |
42 | We will use a small .gff3 file provided by the `r Biocpkg("txdbmaker")` package.
43 |
44 | ```{r, warning=FALSE, message=FALSE}
45 | suppressPackageStartupMessages({
46 | library(txdbmaker)
47 | })
48 |
49 | gff_file <- system.file("extdata", "GFF3_files", "a.gff3", package="txdbmaker")
50 |
51 | txdb <- makeTxDbFromGFF(gff_file, format="gff3")
52 | txdb
53 | ```
54 |
55 |
56 | See `?makeTxDbFromGFF` in the `r Biocpkg("txdbmaker")` package for more information.
57 |
58 | Extract the exon coordinates grouped by gene from this gene model:
59 |
60 | ```{r}
61 | exonsBy(txdb, by="gene")
62 | ```
63 |
64 | # Session info
65 |
66 |
67 |
68 | Click to display session info
69 |
70 | ```{r}
71 | sessionInfo()
72 | ```
73 |
74 |
75 |
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: BiocHowTo
2 | Type: Package
3 | Title: A Collection of How To Documents For Bioconductor
4 | Version: 0.1.0
5 | Authors@R: c(
6 | person(
7 | "Charlotte", "Soneson",
8 | email = "charlottesoneson@gmail.com",
9 | role = c("aut", "cre")
10 | ),
11 | person(
12 | "Hervé", "Pagès",
13 | role = "aut"
14 | ),
15 | person(
16 | "Dan", "Tenenbaum",
17 | role = "aut"
18 | ),
19 | person(
20 | "Valerie", "Obenchain",
21 | role = "aut"
22 | ),
23 | person(
24 | "Marc", "Carlson",
25 | role = "aut"
26 | ),
27 | person(
28 | "Martin", "Morgan",
29 | role = "aut"
30 | ),
31 | person(
32 | "James", "Hester",
33 | role = "aut"
34 | ),
35 | person(
36 | "Vince", "Carey",
37 | role = "aut"
38 | ),
39 | person(
40 | "Laurent", "Gatto",
41 | role = "aut"
42 | ),
43 | person(
44 | "Pedro", "Baldoni",
45 | role = "aut"
46 | ),
47 | person(
48 | "Jenny", "Drnevich",
49 | role = "aut"
50 | ),
51 | person(
52 | "Robert", "Castelo",
53 | role = "aut"
54 | ),
55 | person(
56 | "Michael", "Stadler",
57 | role = "aut"
58 | ),
59 | person(
60 | "Fabricio", "Almeida-Silva",
61 | role = "aut"
62 | ),
63 | person(
64 | "Jacques", "Serizay",
65 | role = "aut"
66 | )
67 | )
68 | Description: This package provides a crowd-sourced collection of 'How To'
69 | documents related to Bioconductor packages.
70 | License: MIT + file LICENSE
71 | Encoding: UTF-8
72 | LazyData: false
73 | Suggests:
74 | knitr,
75 | rmarkdown
76 | RoxygenNote: 7.3.2
77 | VignetteBuilder:
78 | quarto
79 | Imports:
80 | quarto,
81 | BiocStyle,
82 | GenomicAlignments,
83 | pasillaBamSubset,
84 | txdbmaker,
85 | mzR,
86 | MsDataHub,
87 | Spectra,
88 | AnnotationHub,
89 | TxDb.Hsapiens.UCSC.hg38.knownGene,
90 | BSgenome.Hsapiens.UCSC.hg38,
91 | GenomicFeatures,
92 | Biostrings,
93 | Rsamtools,
94 | GSEABase,
95 | GSVA,
96 | rtracklayer,
97 | GenomicRanges,
98 | BSgenome,
99 | tidyomics,
100 | airway,
101 | readxl
102 | URL: https://github.com/bioconductor/BiocHowTo,
103 | https://bioconductor.github.io/BiocHowTo/
104 |
--------------------------------------------------------------------------------
/vignettes/how-to-read-big-bam-file-in-chunks.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to read a BAM file in chunks
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to read a BAM file in chunks}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("pasillaBamSubset")`
34 | * `r Biocpkg("GenomicAlignments")`
35 | * `r Biocpkg("Rsamtools")`
36 |
37 | # How to read a BAM file in chunks
38 |
39 | A large BAM file can be iterated through in chunks, in order to reduce the
40 | memory usage, by setting a `yieldSize` for the `BamFile` object. For
41 | illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data package.
42 |
43 | ```{r}
44 | suppressPackageStartupMessages({
45 | library(pasillaBamSubset)
46 | library(Rsamtools)
47 | })
48 | # Path to a bam file with single-end reads
49 | (un1 <- untreated1_chr4())
50 | bf <- BamFile(un1, yieldSize = 100000)
51 | ```
52 |
53 | Iteration through a BAM file requires that the file be opened, repeatedly
54 | queried inside a loop, then closed. Repeated calls to
55 | `GenomicAlignments::readGAlignments` without opening the file first result in
56 | the same 100000 records returned each time (with a `yieldSize` of 100000).
57 | As an example, let's calculate the coverage for the bam file above.
58 |
59 | ```{r}
60 | suppressPackageStartupMessages({
61 | library(GenomicAlignments)
62 | })
63 | open(bf)
64 | cvg <- NULL
65 | repeat {
66 | chunk <- readGAlignments(bf)
67 | if (length(chunk) == 0L) {
68 | break
69 | }
70 | chunk_cvg <- coverage(chunk)
71 | if (is.null(cvg)) {
72 | cvg <- chunk_cvg
73 | } else {
74 | cvg <- cvg + chunk_cvg
75 | }
76 | }
77 | close(bf)
78 | cvg
79 | ```
80 |
81 | # Session info
82 |
83 |
84 |
85 | Click to display session info
86 |
87 | ```{r}
88 | sessionInfo()
89 | ```
90 |
91 |
92 |
--------------------------------------------------------------------------------
/vignettes/how-to-retrieve-gene-model-from-annotationhub.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to retrieve a gene model from AnnotationHub
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to retrieve a gene model from AnnotationHub}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("AnnotationHub")`
34 |
35 | # How to retrieve a gene model from *AnnotationHub*
36 |
37 | When a gene model is not available as a `GRanges` or `GRangesList` object or as
38 | a \Bioconductor{} data package, it may be available on `r Biocpkg("AnnotationHub")`.
39 | In this *HOWTO*, will look for a gene model for Drosophila melanogaster
40 | on `r Biocpkg("AnnotationHub")`. Create a `hub' and then filter on Drosophila
41 | melanogaster:
42 |
43 | ```{r}
44 | suppressPackageStartupMessages({
45 | library(AnnotationHub)
46 | })
47 | ### Internet connection required!
48 | hub <- AnnotationHub()
49 | hub <- subset(hub, hub$species=='Drosophila melanogaster')
50 | ```
51 |
52 | There are `r length(hub)` files that match Drosophila melanogaster.
53 | If you look at the metadata in hub, you can see that the 5th record
54 | represents a GRanges object from UCSC RefSeq Gene track.
55 |
56 | ```{r}
57 | length(hub)
58 | head(names(hub))
59 | head(hub$title, n=10)
60 | ## then look at a specific slice of the hub object.
61 | hub[5]
62 | ```
63 | So you can retrieve that dm3 file as a GRanges like this:
64 |
65 | ```{r}
66 | gr <- hub[[names(hub)[5]]]
67 | summary(gr)
68 | ```
69 |
70 | The metadata fields contain the details of file origin and content.
71 |
72 | ```{r}
73 | metadata(gr)
74 | ```
75 | Split the `GRanges` object by gene name to get a `GRangesList` object of
76 | transcript ranges grouped by gene.
77 |
78 | ```{r}
79 | txbygn <- split(gr, gr$name)
80 | ```
81 |
82 | You can now use `txbygn` with the `summarizeOverlaps` function
83 | to prepare a table of read counts for RNA-Seq differential gene expression.
84 |
85 | # Further reading
86 |
87 | Note that before passing `txbygn` to `summarizeOverlaps`,
88 | you should confirm that the seqlevels (chromosome names) in it match those
89 | in the BAM file. See `?renameSeqlevels`, `?keepSeqlevels`
90 | and `?seqlevels` for examples of renaming seqlevels.
91 |
92 | # Session info
93 |
94 |
95 |
96 | Click to display session info
97 |
98 | ```{r}
99 | sessionInfo()
100 | ```
101 |
102 |
--------------------------------------------------------------------------------
/vignettes/how-to-read-single-end-reads-from-bam-file.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to read single-end reads from a BAM file
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to read single-end reads from a BAM file}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("pasillaBamSubset")`
34 | * `r Biocpkg("GenomicAlignments")`
35 | * `r Biocpkg("Rsamtools")`
36 |
37 | # How to read single-end reads from a BAM file
38 |
39 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data
40 | package.
41 |
42 | ```{r}
43 | suppressPackageStartupMessages({
44 | library(pasillaBamSubset)
45 | })
46 | # Path to a bam file with single-end reads
47 | (un1 <- untreated1_chr4())
48 | ```
49 |
50 | Several functions are available for reading BAM files into R:
51 |
52 | * `GenomicAlignments::readGAlignments()`
53 | * `GenomicAlignments::readGAlignmentPairs()`
54 | * `GenomicAlignments::readGAlignmentsList()`
55 | * `Rsamtools::scanBam()`
56 |
57 | `scanBam` is a low-level function that returns a list of lists and is not
58 | discussed further here. See `?scanBam` in the `r Biocpkg("Rsamtools")` package
59 | for more information. Single-end reads can be read with the `readGAlignments`
60 | function from the `r Biocpkg("GenomicAlignments")` package.
61 |
62 | ```{r}
63 | suppressPackageStartupMessages({
64 | library(GenomicAlignments)
65 | })
66 | gal <- readGAlignments(un1)
67 | class(gal)
68 | gal
69 | ```
70 |
71 | Data subsets can be specified by genomic position, field names, or flag
72 | criteria using `Rsamtools::ScanBamParam`. Here, as an example we import records
73 | that overlap position 1 to 5000 on the negative strand of chromosome 4, with
74 | `flag` and `cigar` as metadata columns.
75 |
76 | ```{r}
77 | suppressPackageStartupMessages({
78 | library(Rsamtools)
79 | })
80 | what <- c("flag", "cigar")
81 | which <- GRanges("chr4", IRanges(1, 5000))
82 | flag <- scanBamFlag(isMinusStrand = TRUE)
83 | param <- ScanBamParam(which = which, what = what, flag = flag)
84 | neg <- readGAlignments(un1, param = param)
85 | neg
86 | ```
87 |
88 | Another approach to subsetting the data is to use `Rsamtools::filterBam`.
89 | This function creates a new BAM file of records passing user-defined criteria.
90 | See `?filterBam` in the `r Biocpkg("Rsamtools")` package for more information.
91 |
92 | # Session info
93 |
94 |
95 |
96 | Click to display session info
97 |
98 | ```{r}
99 | sessionInfo()
100 | ```
101 |
102 |
103 |
--------------------------------------------------------------------------------
/vignettes/how-to-compute-read-coverage.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to compute read coverage
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to compute read coverage}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("pasillaBamSubset")`
34 | * `r Biocpkg("GenomicAlignments")`
35 |
36 | # How to compute read coverage
37 |
38 | By "read coverage", we mean the number of reads that cover a given genomic position.
39 | Computing the read coverage generally consists in computing the coverage at
40 | each position in the genome.
41 | This can be done with the `coverage` function from `r Biocpkg("GenomicAlignments")`.
42 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data
43 | package.
44 |
45 | ```{r}
46 | suppressPackageStartupMessages({
47 | library(pasillaBamSubset)
48 | })
49 | # Path to a bam file with single-end reads
50 | (un1 <- untreated1_chr4())
51 | ```
52 |
53 | Single-end reads can be read with the `readGAlignments` function from
54 | `r Biocpkg("GenomicAlignments")`.
55 |
56 | ```{r}
57 | suppressPackageStartupMessages({
58 | library(GenomicAlignments)
59 | })
60 | (reads <- readGAlignments(un1))
61 | ```
62 |
63 | The coverage can then be calculated using the `coverage` function from
64 | `r Biocpkg("GenomicAlignments")`:
65 |
66 | ```{r}
67 | (cvg <- coverage(reads))
68 | ```
69 |
70 | We can extract the coverage on a specific chromosome:
71 |
72 | ```{r}
73 | cvg$chr4
74 | ```
75 |
76 | We can also perform calculations on the returned object, e.g. to get the
77 | average and maximum coverage on chromosome 4:
78 |
79 | ```{r}
80 | mean(cvg$chr4)
81 | max(cvg$chr4)
82 | ```
83 |
84 | We can use the `slice` function to find the genomic regions where the coverage
85 | is greater or equal to a given threshold.
86 |
87 | ```{r}
88 | (chr4_highcov <- slice(cvg$chr4, lower = 500))
89 | ```
90 |
91 | The *weight* of a given such region can be defined as the number of aligned
92 | nucleotides that belong to the region (i.e. the area under the coverage curve).
93 | It can be obtained with `sum`:
94 |
95 | ```{r}
96 | sum(chr4_highcov)
97 | ```
98 |
99 | Note that `coverage` and `slice` are generic functions with methods for
100 | different types of objects. See `?coverage` or `?slice` for more information.
101 |
102 |
103 | # Further reading
104 |
105 | - The `r Biocpkg("GenomicAlignments")` vignette.
106 |
107 | # Session info
108 |
109 |
110 |
111 | Click to display session info
112 |
113 | ```{r}
114 | sessionInfo()
115 | ```
116 |
117 |
--------------------------------------------------------------------------------
/vignettes/how-to-read-paired-end-reads-from-bam-file.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to read paired-end reads from a BAM file
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to read paired-end reads from a BAM file}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("pasillaBamSubset")`
34 | * `r Biocpkg("GenomicAlignments")`
35 |
36 | # How to read paired-end reads from a BAM file
37 |
38 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data
39 | package.
40 |
41 | ```{r}
42 | suppressPackageStartupMessages({
43 | library(pasillaBamSubset)
44 | })
45 | # Path to a bam file with paired-end reads
46 | (un3 <- untreated3_chr4())
47 | ```
48 |
49 | Paired-end reads can be loaded with the `GenomicAlignments::readGAlignmentPairs`
50 | or `GenomicAlignments::readGAlignmentsList` functions from the
51 | `r Biocpkg("GenomicAlignments")` package. These functions use the same mate
52 | pairing algorithm but output different objects.
53 |
54 | Let's start with `GenomicAlignments::readGAlignmentPairs`:
55 |
56 | ```{r}
57 | suppressPackageStartupMessages({
58 | library(GenomicAlignments)
59 | })
60 | gapairs <- readGAlignmentPairs(un3)
61 | class(gapairs)
62 | gapairs
63 | ```
64 |
65 | The `GAlignmentPairs` class holds only pairs; reads with no mate or with
66 | ambiguous pairing are discarded. Each list element holds exactly 2 records
67 | (a mated pair). Records can be accessed as the first and last segments in a
68 | template or as left and right alignments. See `?GAlignmentPairs` in the
69 | `r Biocpkg("GenomicAlignments")` package for more information.
70 |
71 | For `readGAlignmentsList`, mate pairing is performed when `asMates` is set to
72 | `TRUE` on the `BamFile` object, otherwise records are treated as single-end.
73 |
74 | ```{r}
75 | galist <- readGAlignmentsList(BamFile(un3, asMates = TRUE))
76 | galist
77 | ```
78 |
79 | `GAlignmentsList` is a more general ‘list-like’ structure that holds mate
80 | pairs as well as nonmates (i.e., singletons, records with unmapped mates etc).
81 | A `mates_status` metadata column (accessed with `mcols`) indicates which
82 | records were paired.
83 |
84 | Non-mated reads are returned as groups by `QNAME` and contain any number of
85 | records. Here the non-mate groups range in size from 1 to 9.
86 |
87 | ```{r}
88 | non_mates <- galist[unlist(mcols(galist)$mate_status) == "unmated"]
89 | table(elementNROWS(non_mates))
90 | ```
91 |
92 | # Session info
93 |
94 |
95 |
96 | Click to display session info
97 |
98 | ```{r}
99 | sessionInfo()
100 | ```
101 |
102 |
103 |
--------------------------------------------------------------------------------
/vignettes/how-to-use-tidy-principles-for-rna-seq-analysis.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to use tidy principles for RNA-seq data analysis
3 | author: Jacques Serizay
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to use tidy principles for RNA-seq data analysis}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | Tidy data principles are a set of principles for structuring data in a way
28 | that makes it easier to manipulate and analyze. They have been popularized by the
29 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation
30 | and `ggplot2` for data visualization. The tidy principles can also be applied to
31 | biological data, such as RNA-seq data, to facilitate analysis and visualization,
32 | thanks to the `tidyomics` meta-package.
33 |
34 | # Bioconductor packages used in this document
35 |
36 | * `r Biocpkg("tidyomics")`
37 |
38 | # How to manipulate SummarizedExperiment objects with tidy principles
39 |
40 | For illustration, we use data from the `r Biocpkg("airway")` package.
41 |
42 | ```{r}
43 | library(tidyomics)
44 | data(airway, package="airway")
45 |
46 | airway
47 | ```
48 |
49 | How `airway` object is printed is controlled by the `tidySummarizedExperiment` package,
50 | and by default it is displayed as a `tibble`-like object. Note that this
51 | behavior can be changed by setting the `options(restore_SummarizedExperiment_show = TRUE)` option.
52 |
53 | ```{r}
54 | # Turn off the tibble visualisation
55 | options("restore_SummarizedExperiment_show" = TRUE)
56 | airway
57 |
58 | # Turn on the tibble visualisation
59 | options("restore_SummarizedExperiment_show" = FALSE)
60 | airway
61 | ```
62 |
63 | The `airway` object is a `SummarizedExperiment` object, and the
64 | `tidySummarizedExperiment` packages adds an invisible layer to abstract the `SE` object as a
65 | `tibble`. This allows us to use the `dplyr` package to manipulate the data, and eventually
66 | to directly plug data flow into `ggplot2` for visualization.
67 |
68 | ```{r}
69 | airway |>
70 | filter(albut == "untrt") |>
71 | group_by(.feature, dex) |>
72 | summarise(tot_counts = sum(counts)) |>
73 | pivot_wider(names_from = dex, values_from = tot_counts) |>
74 | slice_sample(n = 10000) |>
75 | ggplot(aes(x = trt, y = untrt)) +
76 | geom_density_2d(color = "black", linewidth = 0.5) +
77 | geom_point(alpha = 0.1, size = 0.3) +
78 | scale_x_log10() +
79 | scale_y_log10() +
80 | annotation_logticks(side = 'bl') +
81 | labs(x = "Dex treated", y = "Dex untreated") +
82 | theme_bw()
83 | ```
84 |
85 | Because these tidy methods are defined for `SummarizedExperiment` objects,
86 | most classes built on top of `SummarizedExperiment`, such as `DESeqDataSet` or `SingleCellExperiment`,
87 | can also be manipulated with tidy principles. This has prompted the development of
88 | other packages (included in the `tidyomics` meta-package) that extend the tidy principles to
89 | other analyses, such as single-cell transcriptomics analysis.
90 |
91 | ```{r}
92 | data(pbmc_small, package="tidySingleCellExperiment")
93 | pbmc_small
94 |
95 | ## We can still manipulate the data with standard SummarizedExperiment methods
96 | counts(pbmc_small)[1:5, 1:4]
97 |
98 | ## But we can also use tidy principles
99 | pbmc_small |>
100 | extract(file, "sample", "../data/([a-z0-9]+)/outs.+") |>
101 | rename(annotation = letter.idents) |>
102 | ggplot(aes(sample, nCount_RNA, fill=sample)) +
103 | geom_boxplot(outlier.shape=NA) +
104 | geom_jitter(width=0.1) +
105 | facet_wrap( ~ annotation)
106 | ```
107 |
108 | # Further reading
109 |
110 | To read more about the `tidyomics` packages, please refer to the
111 | [tidyomics](https://github.com/tidyomics) GitHub organization page.
112 |
113 | # Session info
114 |
115 |
116 |
117 | Click to display session info
118 |
119 | ```{r}
120 | sessionInfo()
121 | ```
122 |
123 |
124 |
--------------------------------------------------------------------------------
/vignettes/how-to-get-exon-intron-sequence-for-gene.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to get the exon and intron sequences of a given gene
3 | author: Bioconductor Core Team
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to get the exon and intron sequences of a given gene}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | This HowTo has been adapted from the list of HowTos provided in the
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")`
34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`
35 | * `r Biocpkg("GenomicFeatures")`
36 | * `r Biocpkg("Biostrings")`
37 |
38 | # How to get the exon and intron sequences of a given gene
39 |
40 | The exon and intron sequences of a gene are essentially the DNA sequences of
41 | the introns and exons of all known transcripts of the gene. The first task,
42 | therefore, is to identify all transcripts associated with the gene of interest.
43 | Our example gene is the human `TRAK2`, which is involved in regulation of
44 | endosome-to-lysosome trafficking of membrane cargo. The Entrez gene id is
45 | '66008'.
46 |
47 | ```{r}
48 | trak2 <- "66008"
49 | ```
50 |
51 | The `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` data package contains the
52 | gene models corresponding to the UCSC 'Known Genes' track.
53 |
54 | ```{r}
55 | suppressPackageStartupMessages({
56 | library(TxDb.Hsapiens.UCSC.hg38.knownGene)
57 | })
58 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
59 | ```
60 |
61 | The transcript ranges for all the genes in the gene model can be extracted with
62 | the `transcriptsBy` function from the `r Biocpkg("GenomicFeatures")` package.
63 | They will be returned in a named `GRangesList` object containing all the
64 | transcripts grouped by gene. In order to keep only the transcripts of the
65 | `TRAK2` gene we will subset the `GRangesList` object using the `[[` operator.
66 |
67 | ```{r}
68 | suppressPackageStartupMessages({
69 | library(GenomicFeatures)
70 | })
71 | (trak2_txs <- transcriptsBy(txdb, by = "gene")[[trak2]])
72 | ```
73 |
74 | `trak2_txs` is a `GRanges` object with one range per transcript in the `TRAK2`
75 | gene. The transcript names are stored in the `tx_name` metadata column. We will
76 | need them to subset the extracted intron and exon regions:
77 |
78 | ```{r}
79 | (trak2_tx_names <- mcols(trak2_txs)$tx_name)
80 | ```
81 |
82 | The exon and intron genomic ranges for all the transcripts in the gene model
83 | can be extracted with the `exonsBy` and `intronsByTranscript` functions,
84 | respectively. Both functions return a `GRangesList` object. Then we keep only
85 | the exon and intron for the transcripts of the `TRAK2` gene by subsetting
86 | each `GRangesList` object by the `TRAK2` transcript names.
87 |
88 | Extract the exon regions:
89 |
90 | ```{r}
91 | trak2_exbytx <- exonsBy(txdb, "tx", use.names = TRUE)[trak2_tx_names]
92 | elementNROWS(trak2_exbytx)
93 | ```
94 |
95 | ... and the intron regions:
96 |
97 | ```{r}
98 | trak2_inbytx <- intronsByTranscript(txdb, use.names = TRUE)[trak2_tx_names]
99 | elementNROWS(trak2_inbytx)
100 | ```
101 |
102 | Next we want the DNA sequences for these exons and introns. The `getSeq`
103 | function from the `r Biocpkg("Biostrings")` package can be used to query a
104 | `BSgenome` object with a set of genomic ranges and retrieve the corresponding
105 | DNA sequences.
106 |
107 | ```{r}
108 | suppressPackageStartupMessages({
109 | library(BSgenome.Hsapiens.UCSC.hg38)
110 | })
111 | ```
112 |
113 | Extract the exon sequences:
114 |
115 | ```{r}
116 | (trak2_ex_seqs <- getSeq(Hsapiens, trak2_exbytx))
117 | ```
118 |
119 | ... and the intron sequences:
120 |
121 | ```{r}
122 | (trak2_in_seqs <- getSeq(Hsapiens, trak2_inbytx))
123 | ```
124 |
125 | Each list element is a `DNAStringSet` object, e.g.:
126 |
127 | ```{r}
128 | trak2_in_seqs[[1]]
129 | ```
130 |
131 | Each sequence here corresponds to a single intron.
132 |
133 | # Further reading
134 |
135 | - The `r Biocpkg("GenomicFeatures")` vignette.
136 |
137 | # Session info
138 |
139 |
140 |
141 | Click to display session info
142 |
143 | ```{r}
144 | sessionInfo()
145 | ```
146 |
147 |
--------------------------------------------------------------------------------
/.github/workflows/R-CMD-check.yaml:
--------------------------------------------------------------------------------
1 | on:
2 | push:
3 | pull_request:
4 | branches:
5 | - devel
6 | - main
7 | schedule:
8 | - cron: '0 9 * * 2'
9 |
10 | name: R-CMD-check
11 |
12 | jobs:
13 | R-CMD-check:
14 | runs-on: ${{ matrix.config.os }}
15 | container: ${{ matrix.config.image }}
16 |
17 | name: ${{ matrix.config.os }} (${{ matrix.config.bioc }} - ${{ matrix.config.image }}
18 |
19 | strategy:
20 | fail-fast: false
21 | matrix:
22 | config:
23 | - {os: macos-13, bioc: 'devel'}
24 |
25 | env:
26 | R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
27 | CRAN: ${{ matrix.config.cran }}
28 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
29 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
30 |
31 | steps:
32 | - name: Check out repo
33 | uses: actions/checkout@v3
34 |
35 | - name: Set up R and install BiocManager
36 | uses: grimbough/bioc-actions/setup-bioc@v1
37 | if: matrix.config.image == null
38 | with:
39 | bioc-version: ${{ matrix.config.bioc }}
40 |
41 | - name: Install Quarto
42 | uses: quarto-dev/quarto-actions/setup@v2
43 |
44 | - name: Install pandoc
45 | uses: r-lib/actions/setup-pandoc@v2
46 | if: matrix.config.image == null
47 |
48 | - name: Install remotes and rmarkdown
49 | run: |
50 | install.packages(c('remotes', 'rmarkdown'))
51 | shell: Rscript {0}
52 |
53 | - name: Query dependencies
54 | run: |
55 | saveRDS(remotes::dev_package_deps(dependencies = TRUE, repos = c(getOption('repos'), BiocManager::repositories())), 'depends.Rds', version = 2)
56 | shell: Rscript {0}
57 |
58 | - name: Cache R packages
59 | if: runner.os != 'Windows' && matrix.config.image == null
60 | uses: actions/cache@v4
61 | with:
62 | path: ${{ env.R_LIBS_USER }}
63 | key: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-${{ hashFiles('depends.Rds') }}
64 | restore-keys: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-
65 |
66 | - name: Install system dependencies (Linux)
67 | if: runner.os == 'Linux'
68 | env:
69 | RHUB_PLATFORM: linux-x86_64-ubuntu-gcc
70 | uses: r-lib/actions/setup-r-dependencies@v2
71 | with:
72 | extra-packages: any::rcmdcheck
73 | pak-version: devel
74 |
75 | - name: Install system dependencies (macOS)
76 | if: runner.os == 'macOS'
77 | run: |
78 | brew install harfbuzz
79 | brew install fribidi
80 |
81 | - name: Install dependencies
82 | run: |
83 | local_deps <- remotes::local_package_deps(dependencies = TRUE)
84 | deps <- remotes::dev_package_deps(dependencies = TRUE, repos = BiocManager::repositories())
85 | BiocManager::install(local_deps[local_deps %in% deps$package[deps$diff != 0]], Ncpu = 2L)
86 | remotes::install_cran('rcmdcheck', Ncpu = 2L)
87 | shell: Rscript {0}
88 |
89 | - name: Session info
90 | run: |
91 | options(width = 100)
92 | pkgs <- installed.packages()[, "Package"]
93 | sessioninfo::session_info(pkgs, include_base = TRUE)
94 | shell: Rscript {0}
95 |
96 | - name: Build, Install, Check
97 | id: build-install-check
98 | uses: grimbough/bioc-actions/build-install-check@v1
99 |
100 | - name: Upload install log if the build/install/check step fails
101 | if: always() && (steps.build-install-check.outcome == 'failure')
102 | uses: actions/upload-artifact@v4
103 | with:
104 | name: install-log
105 | path: |
106 | ${{ steps.build-install-check.outputs.install-log }}
107 |
108 | - name: Upload check results
109 | if: failure()
110 | uses: actions/upload-artifact@v4
111 | with:
112 | name: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-results
113 | path: ${{ steps.build-install-check.outputs.check-dir }}
114 |
115 | # - name: Run BiocCheck
116 | # uses: grimbough/bioc-actions/run-BiocCheck@v1
117 | # with:
118 | # arguments: '--no-check-bioc-views --no-check-bioc-help'
119 | # error-on: 'error'
120 | #
121 | # - name: Test coverage
122 | # if: matrix.config.os == 'macOS-latest' && matrix.config.pandoc == null
123 | # run: |
124 | # install.packages("covr")
125 | # covr::codecov(token = "${{secrets.CODECOV_TOKEN}}")
126 | # shell: Rscript {0}
127 | #
128 | - name: Deploy
129 | if: github.event_name == 'push' && github.ref == 'refs/heads/devel' && runner.os == 'macOS' && matrix.config.bioc == 'devel'
130 | run: |
131 | R CMD INSTALL .
132 | Rscript -e "remotes::install_dev('pkgdown'); pkgdown::deploy_to_branch(new_process = FALSE)"
133 | shell: bash
134 |
--------------------------------------------------------------------------------
/vignettes/how-to-read-gene-sets-from-gmt-files.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to read gene sets from GMT files
3 | author: Robert Castelo
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to read gene sets from GMT files}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | ---
18 |
19 | ```{r}
20 | #| echo: false
21 |
22 | suppressPackageStartupMessages({
23 | library(BiocStyle)
24 | })
25 | ```
26 |
27 | A **gene set** is simple yet useful way to define pathways without regard
28 | to the specific molecular interactions. It can also represent the signature
29 | of a characteristic pattern of gene expression associated with a cell type
30 | or a disease for diagnostic or prognostic purposes. An important source of
31 | gene sets is the
32 | [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb),
33 | which stores them as plain text files following the so-called
34 | [_gene matrix transposed_ (GMT) format](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29).
35 | In the GMT format, each line stores a gene set with the following values
36 | separated by tabs:
37 |
38 | * A unique gene set identifier.
39 | * A gene set description.
40 | * One or more gene identifiers.
41 |
42 | Because each different gene set may consist of a different number of genes, each
43 | line in a GMT file may contain a different number of tab-separated values. This
44 | means that the GMT format is not a tabular format, and therefore cannot be directly
45 | read with base R functions such as `read.table()` or `read.csv()`.
46 |
47 | # Bioconductor packages used in this document
48 |
49 | * `r Biocpkg("GSEABase")`
50 | * `r Biocpkg("GSVA")`
51 |
52 | # How to read gene sets from GMT files
53 |
54 | The package `r Biocpkg("GSEABase")` provides object classes and methods to represent
55 | and manipulate gene sets, including reading GMT files with the function `getGmt()`.
56 | Here is a first example in which we read a GMT file from the MSigDB database:
57 |
58 | ```{r, message=FALSE}
59 | library(GSEABase)
60 |
61 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
62 | "2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt", sep="/")
63 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
64 | genesets
65 | length(genesets)
66 | ```
67 | Next to the filename of the GMT file, we provided the argument
68 | `geneIdType=SymbolIdentifier()`. This argument adds the necessary metadata that
69 | later on allows other software to figure out what kind of gene identifiers are
70 | used in this collection of gene sets, to attempt mapping them to other type of
71 | identifiers, if necessary. While this argument is optional, we should always
72 | try to provide it.
73 |
74 | The output of the `getGmt()` function is a `GeneSetCollection` object, which
75 | pretty much behaves like a list object and many Bioconductor packages analysing
76 | gene sets may expect that kind of object as input. However, we can coerce it to
77 | a regular `list` object with the function `geneIds()`:
78 |
79 | ```{r}
80 | genesets.list <- geneIds(genesets)
81 | head(lapply(genesets.list, head, 3), 3)
82 | ```
83 | By definition, GMT files should have unique gene set identifiers and if we
84 | attempt to read with `getGmt()` a malformed GMT file with duplicated
85 | identifiers it will prompt an error:
86 |
87 | ```{r, error=TRUE}
88 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
89 | "7.5/c2.all.v7.5.symbols.gmt", sep="/")
90 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
91 | ```
92 | So if we still want to read the contents of that GMT file, we would have to
93 | download it and edit the lines with duplicated identifiers. An alternative is
94 | to use the function `readGMT()` of the `r Biocpkg("GSVA")` package:
95 |
96 | ```{r, message=FALSE}
97 | library(GSVA)
98 |
99 | genesets <- readGMT(URL)
100 | genesets
101 | ```
102 | As we may see, it has detected the presence duplicated gene set identifiers and
103 | applied a deduplication policy, which can be changed through the parameter
104 | `dedupUse`; see its help page with `?readGMT` for further deduplication
105 | policies. As `getGmt()`, it also takes a parameter `geneIdType`, but by default
106 | it will automatically detect the type of gene identifier and fill in the
107 | corresponding metadata in the resulting `GeneSetCollection` object, so that in
108 | principle we need not to specify it.
109 |
110 | # Further reading
111 |
112 | - The [Introduction to GSEABase](https://bioconductor.org/packages/release/bioc/vignettes/GSEABase/inst/doc/GSEABase.html)
113 | vignette from the `r Biocpkg("GSEABase")` package.
114 | - The help pages of the `getGmt()` and `readGMT()` functions in the
115 | `r Biocpkg("GSEABase")` and `r Biocpkg("GSVA")` packages, respectively.
116 | - The Bioconductor package `r Biocpkg("msigdb")` provides an API for
117 | directly querying and downloading MSigDB GMT files as `GeneSetCollection`
118 | objects.
119 | - Similarly to `r Biocpkg("msigdb")`, the `r CRANpkg("msigdbr")` CRAN
120 | package provides an API to query and download MSigDB GMT files, but as
121 | [tidy](https://r4ds.hadley.nz/data-tidy.html) `data.frame` objects
122 | with one gene per row.
123 |
124 | # Session info
125 |
126 |
127 |
128 | Click to display session info
129 |
130 | ```{r}
131 | sessionInfo()
132 | ```
133 |
134 |
--------------------------------------------------------------------------------
/vignettes/how-to-use-tidy-principles-for-granges-manipulation.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to use tidy principles for genomic ranges manipulation
3 | author: Jacques Serizay
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to use tidy principles for genomic ranges manipulation}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | bibliography: ../inst/references.bib
18 | ---
19 |
20 | ```{r}
21 | #| echo: false
22 |
23 | suppressPackageStartupMessages({
24 | library(BiocStyle)
25 | })
26 | ```
27 |
28 | Tidy data principles are a set of principles for structuring data in a way
29 | that makes it easier to manipulate and analyze. They have been popularized by the
30 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation
31 | and `ggplot2` for data visualization. The tidy principles can also be applied to
32 | biological data, such as RNA-seq data, to facilitate analysis and visualization,
33 | thanks to the `tidyomics` meta-package.
34 |
35 | # Bioconductor packages used in this document
36 |
37 | * `r Biocpkg("tidyomics")`
38 |
39 | # How to manipulate GRanges objects with tidy principles
40 |
41 | For illustration, we are going to download a `xlsx` file provided as a
42 | Supplementary Data in [@Serizay2020Oct], and manipulate it using tidy principles.
43 |
44 | Good practices recommended by the Bioconductor consortium suggest to use
45 | the `BiocFileCache` package to download and cache files from remote servers.
46 |
47 | ```{r}
48 | library(BiocFileCache)
49 |
50 | # Create a BiocFileCache instance
51 | bfc <- BiocFileCache()
52 |
53 | # Download the xlsx file from Ensembl and cache it locally using BiocFileCache
54 | xlsx_url <- "https://genome.cshlp.org/content/suppl/2020/11/16/gr.265934.120.DC1/Supplemental_Table_S2.xlsx"
55 | xlsx_file <- bfcrpath(bfc, xlsx_url)
56 |
57 | # The file is now downloaded and available
58 | xlsx_file
59 | ```
60 |
61 | Excel files can be read into `tibbles` in `R` thanks to the `read_xlsx()` function
62 | from the `readxl` package.
63 |
64 | ```{r}
65 | library(readxl)
66 | tbl <- read_xlsx(xlsx_file, sheet = "ATAC_metrics", skip = 2)
67 | tbl
68 | ```
69 |
70 | We convert it into a `GRanges` object directly using `as_granges()` from the `plyranges` package,
71 | using [tidy evaluation as in `dplyr` package](https://dplyr.tidyverse.org/articles/programming.html).
72 |
73 | ```{r}
74 | library(tidyomics)
75 | gr <- tbl |>
76 | as_granges(seqnames = chrom_ce11, start = start_ce11, end = end_ce11)
77 | gr
78 | ```
79 |
80 | Let's say we are not interested in all the metadata columns, and we want to keep only the
81 | `annot_dev_age_tissues`, `Annotation`, and the columns ending with `_RPM`.
82 | We can use the `select()` function from `dplyr` to select these columns,
83 | just like we would do with a `data.frame` or `tibble`.
84 |
85 | ```{r}
86 | gr <- gr |>
87 | select(annot_dev_age_tissues, Annotation, ends_with("_RPM"))
88 | gr
89 | ```
90 |
91 | In fact, most verbs from the `dplyr` package can be used with `GRanges` objects,
92 | such as `filter()`, `mutate()`, `arrange()`, `group_by()` and `summarise()`.
93 |
94 | ```{r}
95 | gr |>
96 | mutate(start = start - 1000, end = end + 1000) |>
97 | filter(start > 1e6) |>
98 | arrange(Intest._RPM) |>
99 | group_by(annot_dev_age_tissues, Annotation) |>
100 | summarize(n = plyranges::n())
101 | ```
102 |
103 | Note that the `summarize()` function automatically returns a `DataFrame`, not a `GRanges` object.
104 |
105 | To enhance compatibility with the `tidyverse` ecosystem, the `plyranges` package
106 | provides a `as_tibble()` method for `GRanges` objects, which converts them
107 | into `tibbles`. This allows you to use pass `GRanges` objects to `ggplot` for visualization.
108 |
109 | ```{r}
110 | gr |>
111 | filter(Annotation %in% c("Intest.", "Hypod.", "Soma", "Ubiq.")) |>
112 | as_tibble() |>
113 | ggplot(aes(x = `Intest._RPM`, y = `Hypod._RPM`, color = Annotation)) +
114 | geom_point() +
115 | labs(x = "Intestine RPM", y = "Hypodermis RPM") +
116 | facet_grid(Annotation~annot_dev_age_tissues) +
117 | theme_bw() +
118 | theme(legend.position = "bottom")
119 | ```
120 |
121 | Finally, the `join` concept introduced by the `dplyr` package can also be applied to `GRanges` objects.
122 | In this case, the operation refers to joining overlapping, nearest, or preceding/following ranges,
123 | using the `join_*_*()` function from the `plyranges` package.
124 |
125 | As an example, we can compare annotations from [@Serizay2020Oct] to chromosome
126 | arms/center
127 |
128 | ```{r}
129 | arms <- GRanges(
130 | seqnames = rep(c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrX"), each = 3),
131 | ranges = IRanges(
132 | start = c(
133 | 1, 4550001, 10750001,
134 | 1, 4550001, 10750001,
135 | 1, 4450001, 10150001,
136 | 1, 4750001, 12650001,
137 | 1, 4850001, 16250001,
138 | 1, 2850001, 14250001
139 | ),
140 | end = c(
141 | 4550000, 10750000, 15072423,
142 | 4550000, 10750000, 15279345,
143 | 4450000, 10150000, 13783700,
144 | 4750000, 12650000, 17493793,
145 | 4850000, 16250000, 20924149,
146 | 2850000, 14250000, 17718866
147 | )
148 | ),
149 | strand = "*",
150 | part = rep(c("arm", "center", "arm"), 6)
151 | )
152 |
153 | gr |>
154 | join_overlap_left(arms) |>
155 | group_by(Annotation, part) |>
156 | filter(Annotation %in% c("Germline", "Neurons")) |>
157 | summarize(n = plyranges::n())
158 | ```
159 |
160 | # Further reading
161 |
162 | To read more about the `tidyomics` packages, please refer to the
163 | [tidyomics](https://github.com/tidyomics) GitHub organization page.
164 |
165 | More tutorials are provided by the `tidyomics` GitHub organization in the
166 | [tidy-ranges-tutorial](https://tidyomics.github.io/tidy-ranges-tutorial/) website.
167 |
168 | # References
169 |
170 | ::: {#refs}
171 | :::
172 |
173 | # Session info
174 |
175 |
176 |
177 | Click to display session info
178 |
179 | ```{r}
180 | sessionInfo()
181 | ```
182 |
183 |
184 |
--------------------------------------------------------------------------------
/vignettes/how-to-compute-sequence-composition-for-genomic-regions.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to compute sequence composition for genomic regions
3 | author: Michael Stadler
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to compute sequence composition for genomic regions}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | embed-resources: true
18 | ---
19 |
20 | ```{r}
21 | #| echo: false
22 |
23 | suppressPackageStartupMessages({
24 | library(BiocStyle)
25 | })
26 | ```
27 |
28 | This HowTo shows how to fetch sequences for given regions from a reference and
29 | compute composition features for them.
30 |
31 | # Bioconductor packages used in this document
32 |
33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")`
34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`
35 | * `r Biocpkg("GenomicFeatures")`
36 | * `r Biocpkg("Biostrings")`
37 |
38 | # How to compute sequence composition for genomic regions
39 |
40 | Mammalian genomes contain regulatory regions with a high content of CpG
41 | dinucleotides, so called *CpG islands*, a side effect indirectly caused by
42 | the methylation of cytosines in CpG contexts. Many transcripts contain
43 | a CpG island in their promoter region, near the start of the transcript.
44 | In this HowTo, we will identify such transcripts by analyzing the CpG
45 | composition of promoter sequences.
46 |
47 | We first obtain the coordinates of promoter regions from the
48 | `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` package, using the
49 | `promoters` function from the `r Biocpkg("GenomicFeatures")` package:
50 |
51 | ```{r}
52 | suppressPackageStartupMessages({
53 | library(GenomicFeatures)
54 | library(TxDb.Hsapiens.UCSC.hg38.knownGene)
55 | })
56 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
57 | prom <- promoters(txdb, upstream = 1000, downstream = 1000)
58 | prom
59 | ```
60 |
61 | Here, promoter regions are defined as 2000 base pair regions centered on
62 | transcript start sites (the 5'-end of a transcript). The warning message above
63 | is caused by start sites that are near the boundary of a chromosome, such that
64 | the promoter region will extend beyond its beginning or end. By calling `trim`
65 | on the `GRanges` object, all regions are truncated to the limits of the
66 | chromosomes, resulting in a few promoters shorter than 2000 base pairs:
67 |
68 | ```{r}
69 | summary(width(prom) == 2000)
70 | prom <- trim(prom)
71 | summary(width(prom) == 2000)
72 | ```
73 |
74 | It may be useful to now subset the extracted promoters, for example to only
75 | retain promoters on autosomes, or only a single promoter per gene.
76 | Here, we will remove redundant promoters resulting from transcripts with
77 | identical start sites:
78 |
79 | ```{r}
80 | summary(duplicated(prom))
81 | prom <- prom[!duplicated(prom)]
82 | ```
83 |
84 | Next, we extract the sequences of the promoters from the reference genome
85 | assembly in the `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")` package:
86 |
87 | ```{r}
88 | suppressPackageStartupMessages({
89 | library(Biostrings)
90 | library(BSgenome.Hsapiens.UCSC.hg38)
91 | })
92 | promseq <- getSeq(BSgenome.Hsapiens.UCSC.hg38, prom)
93 | promseq
94 | ```
95 |
96 | Finally, we calculate some sequence features: The percent of G and C bases, and
97 | the ratio of observed over expected CpG dinucleotides (where the expected
98 | frequency of a dinucleotide is the product of the frequencies of the two
99 | mononucleotides). We obtain the mono- and dinucleotide frequencies from the
100 | sequences using the `oligonucleotideFrequency` function from the
101 | `r Biocpkg("Biostrings")` package:
102 |
103 | ```{r}
104 | mo <- oligonucleotideFrequency(promseq, width = 1, as.prob = TRUE)
105 | di <- oligonucleotideFrequency(promseq, width = 2, as.prob = TRUE)
106 |
107 | head(mo)
108 | head(di)
109 |
110 | percentGC <- 100 * (mo[, "C"] + mo[, "G"])
111 | expectedCpG <- mo[, "C"] * mo[, "G"]
112 | obsExpRatioCpG <- di[, "CG"] / expectedCpG
113 | ```
114 |
115 | Finally, we can look at the distribution of these sequence features.
116 | The percentage of G+C bases in promoters spans a broad range from about 30% to
117 | over 70%:
118 |
119 | ```{r}
120 | #| fig.width: 6
121 | #| fig.height: 6
122 | hist(percentGC, 60)
123 | ```
124 |
125 | Defining CpG island promoters using this distribution would be hard, but is
126 | much easier using the observed over expected ratio of CpGs, which shows a
127 | much clearer bimodal structure:
128 |
129 | ```{r}
130 | #| fig.width: 6
131 | #| fig.height: 6
132 | hist(obsExpRatioCpG, 60)
133 | abline(v = 0.45, col = 2, lty = 2)
134 | ```
135 |
136 | Interestingly, almost all ratios are below 1.0, indicating that CpG islands
137 | are in fact not enriched for CpG dinucleotides, but rather lose them at a
138 | lower rate than other genomic regions.
139 |
140 | Plotting the ratio versus the GC percentage furthermore reveals that CpG island
141 | promoters are also slightly GC-richer:
142 |
143 | ```{r}
144 | #| fig.width: 6
145 | #| fig.height: 6
146 | plot(obsExpRatioCpG, percentGC, pch = "*", col = "#22222205")
147 | abline(v = 0.45, col = 2, lty = 2)
148 | ```
149 |
150 | # Further reading
151 |
152 | More examples of how to represent and work with biological sequences are
153 | contained in the vignettes of the `r Biocpkg("Biostrings")` package,
154 | available for example from `vignette(package = "Biostrings")`.
155 |
156 | If no `BSgenome` object is available for your genome of interest, you can also
157 | use alternative sources to extract your sequences from, such as `DNAStringSet`
158 | objects or files in fasta or 2bit format. A list of inputs supported by the
159 | `getSeq` function can be obtained using:
160 |
161 | ```{r}
162 | showMethods(getSeq)
163 | ```
164 |
165 | For more information about CpG islands, see:
166 |
167 | - Bird A, Taggart M, Frommer M, Miller OJ, Macleod D. A fraction of the mouse
168 | genome that is derived from islands of nonmethylated, CpG-rich DNA.
169 | Cell. 1985 Jan;40(1):91-9. doi: 10.1016/0092-8674(85)90312-5. PMID: 2981636.
170 | - Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes.
171 | J Mol Biol. 1987 Jul 20;196(2):261-82.
172 | doi: 10.1016/0022-2836(87)90689-9. PMID: 3656447.
173 |
174 | # Session info
175 |
176 |
177 |
178 | Click to display session info
179 |
180 | ```{r}
181 | sessionInfo()
182 | ```
183 |
184 |
--------------------------------------------------------------------------------
/vignettes/how-to-extract-promoter-sequences.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: How to extract promoter sequences
3 | author: Fabricio Almeida-Silva
4 | date: "`r Sys.Date()`"
5 | vignette: >
6 | %\VignetteIndexEntry{How to extract promoter sequences}
7 | %\VignetteEngine{quarto::html}
8 | %\VignetteEncoding{UTF-8}
9 | knitr:
10 | opts_chunk:
11 | collapse: true
12 | comment: '#>'
13 | format:
14 | html:
15 | toc: true
16 | html-math-method: mathjax
17 | embed-resources: true
18 | ---
19 |
20 | ```{r}
21 | #| echo: false
22 |
23 | suppressPackageStartupMessages({
24 | library(BiocStyle)
25 | })
26 | ```
27 |
28 | Promoters are DNA sequences typically located upstream of a
29 | gene's transcription start site (TSS), and they contain binding
30 | sites for transcription factors.
31 | Researchers working on gene regulation often want to extract
32 | promoter sequences for all genes or a set of genes. With these sequences,
33 | researchers can, for example, look for known transcription factor
34 | binding sites (TFBS), or predict TFBS *de novo*. This HowTo will demonstrate
35 | how to extract promoter sequences for any gene(s) in a genome.
36 |
37 | # Bioconductor packages used in this document
38 |
39 | * `r Biocpkg("Biostrings")`
40 | * `r Biocpkg("rtracklayer")`
41 | * `r Biocpkg("GenomicRanges")`
42 | * `r Biocpkg("txdbmaker")`
43 | * `r Biocpkg("GenomicFeatures")`
44 | * `r Biocpkg("BSgenome")`
45 |
46 | # How to extract promoter sequences
47 |
48 | We will start by loading the required packages.
49 |
50 | ```{r}
51 | #| message: false
52 |
53 | library(Biostrings)
54 | library(GenomicRanges)
55 | library(rtracklayer)
56 | library(txdbmaker)
57 | library(GenomicFeatures)
58 | library(BSgenome)
59 | ```
60 |
61 | Then, we will get our example data set. Here, we will use data for
62 | the model plant *Arabidopsis thaliana*. We will start by obtaining
63 | genome sequences and gene annotation from [Ensembl Plants](https://plants.ensembl.org).
64 |
65 | ::: {.callout-note}
66 |
67 | ## Working with model and non-model organisms
68 |
69 | Here, we're obtaining data from [Ensembl Plants](https://plants.ensembl.org)
70 | to explicitly show how this can be done for **any** organism. However,
71 | if you're lucky enough to work with a model organism (e.g. human, mouse, etc),
72 | you can find curated genome sequences and gene annotation
73 | in `r Biocpkg("AnnotationHub")` (see, for instance,
74 | [this](https://bioconductor.github.io/BiocHowTo/articles/how-to-retrieve-gene-model-from-annotationhub.html) HowTo article).
75 |
76 | :::
77 |
78 | ```{r}
79 | # Read genomic data from Ensembl Plants
80 | ## Genome
81 | genome <- Biostrings::readDNAStringSet(
82 | "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
83 | )
84 | names(genome) <- gsub(" .*", "", names(genome)) # remove text after chr name
85 |
86 | ## Annotation
87 | annotation <- rtracklayer::import(
88 | "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gff3.gz"
89 | )
90 | ```
91 |
92 | Genome sequences are stored in `DNAStringSet` objects, and
93 | gene annotation (i.e., start and end coordinates of each gene) are stored in
94 | `GRanges` objects. This is what these objects look like:
95 |
96 | ```{r}
97 | # Take a look at the genome and annotation
98 | genome
99 | annotation
100 | ```
101 |
102 | Note that our `GRanges` object contains ranges for multiple types of features,
103 | including genes, mRNAs, exons, and even entire chromosomes.
104 | Since we're only interested in promoter regions, we will subset
105 | `annotation` to keep only 'gene' ranges.
106 |
107 | ```{r}
108 | # Keep only gene ranges
109 | genes <- annotation[annotation$type == "gene"]
110 | genes
111 | ```
112 |
113 | Next, we can use the `promoters()` function from
114 | `r Biocpkg("GenomicRanges")` to extract the genomic coordinates of promoters.
115 | By default, promoters start at 2000 bp upstream the transcription start
116 | site (TSS), and end at 200 bp downstream the TSS, but you can modify these
117 | positions in parameters *upstream* and *downstream* of `promoters()`.
118 | For example, some transcription factor families in plants
119 | (e.g. HSF and C2C2-GATA) bind exclusively to regions downstream the TSS, which
120 | is contrary to the common view of TFBS being located upstream TSSs.
121 | Importantly, if a gene has multiple TSSs (e.g., because of different
122 | transcripts starting at different positions), the most upstream TSS is
123 | used, not a cleverly chosen 'representative' or 'canonical' transcript.
124 |
125 | ```{r}
126 | # Get promoter regions from a `GRanges` object
127 | prom_genes <- GenomicRanges::promoters(genes)
128 |
129 | prom_genes
130 | ```
131 |
132 | Alternatively, if you want to extract promoters for each transcript separately,
133 | you can subset the `GRanges` object to extract only ranges of
134 | type 'transcript', and then use `promoters()` as demonstrated above.
135 | Note that GFF3 files from some databases do not contain a 'transcript' type.
136 | In these cases, you will most likely find transcript-level ranges in rows of
137 | type 'mRNA'. This is the case for [Ensembl Plants](https://plants.ensembl.org),
138 | from where we obtained our example data.
139 | To extract promoters for each *A. thaliana* transcript, you'd execute:
140 |
141 | ```{r}
142 | # Extract transcript ranges from a GRanges object
143 | tx_ranges <- annotation[annotation$type == "mRNA"] # or 'transcript' (if any)
144 |
145 | # Extract promoters for each transcript separately
146 | prom_tx <- promoters(tx_ranges)
147 | prom_tx
148 | ```
149 |
150 | ::: {.callout-tip}
151 |
152 | ## Extracting promoter regions from `TxDb` objects
153 |
154 | If you have a `TxDb` object with transcript coordinates and metadata,
155 | you can extract promoter regions for each transcript with the `promoters()`
156 | function from the `r Biocpkg("GenomicFeatures")` package. Note that function names
157 | are the same, but `GenomicFeatures::promoters()` takes `TxDb` objects as input,
158 | while `GenomicRanges::promoters()` takes `GRanges` objects as input.
159 |
160 | ```{r}
161 | # Create a `TxDb` object from a `GRanges` object
162 | txdb <- txdbmaker::makeTxDbFromGRanges(annotation)
163 |
164 | # Get promoter regions for all transcripts
165 | prom_tx2 <- GenomicFeatures::promoters(txdb)
166 |
167 | prom_tx2
168 | ```
169 |
170 | :::
171 |
172 | Finally, you can extract sequences given a genome and some coordinates with
173 | the `getSeq()` function from `r Biocpkg("BSgenome")`.
174 |
175 | ```{r}
176 | # Extract promoter sequences (1k genes only for demo purposes)
177 | prom_seqs <- getSeq(genome, prom_genes[1:1000])
178 |
179 | prom_seqs
180 | ```
181 |
182 |
183 | # Further reading
184 |
185 | To learn more about how to work with sequences and ranges using Bioconductor
186 | packages, look at the vignettes of the `r Biocpkg("Biostrings")` and
187 | `r Biocpkg("GenomicRanges")` packages.
188 |
189 | # Session info
190 |
191 |
192 |
193 | Click to display session info
194 |
195 | ```{r}
196 | sessionInfo()
197 | ```
198 |
199 |
--------------------------------------------------------------------------------