├── LICENSE
├── NAMESPACE
├── man
    └── figures
    │   └── bioconductor.png
├── .Rbuildignore
├── pkgdown
    └── extra.css
├── inst
    ├── references.bib
    └── templates
    │   └── howto_template.qmd
├── LICENSE.md
├── _pkgdown.yml
├── README.md
├── vignettes
    ├── how-to-read-mass-spectrometry-data.qmd
    ├── how-to-load-gene-from-gff-gtf.qmd
    ├── how-to-read-big-bam-file-in-chunks.qmd
    ├── how-to-retrieve-gene-model-from-annotationhub.qmd
    ├── how-to-read-single-end-reads-from-bam-file.qmd
    ├── how-to-compute-read-coverage.qmd
    ├── how-to-read-paired-end-reads-from-bam-file.qmd
    ├── how-to-use-tidy-principles-for-rna-seq-analysis.qmd
    ├── how-to-get-exon-intron-sequence-for-gene.qmd
    ├── how-to-read-gene-sets-from-gmt-files.qmd
    ├── how-to-use-tidy-principles-for-granges-manipulation.qmd
    ├── how-to-compute-sequence-composition-for-genomic-regions.qmd
    └── how-to-extract-promoter-sequences.qmd
├── DESCRIPTION
└── .github
    └── workflows
        └── R-CMD-check.yaml


/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2025
2 | COPYRIGHT HOLDER: BiocHowTo authors
3 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 | 
3 | 


--------------------------------------------------------------------------------
/man/figures/bioconductor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Bioconductor/BiocHowTo/devel/man/figures/bioconductor.png


--------------------------------------------------------------------------------
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^LICENSE\.md$
4 | ^vignettes/*_files$
5 | ^_pkgdown.yml$
6 | ^\.github$
7 | 


--------------------------------------------------------------------------------
/pkgdown/extra.css:
--------------------------------------------------------------------------------
1 | .navbar {
2 |   background: linear-gradient(to right, #3a85ab, #4ca139) !important;
3 |   background-image: linear-gradient(to right, #3a85ab, #4ca139) !important;
4 |   border: none !important;
5 | }
6 | 
7 | 


--------------------------------------------------------------------------------
/inst/references.bib:
--------------------------------------------------------------------------------
 1 | @article{Serizay2020Oct,
 2 | 	author = {Serizay, Jacques and Dong, Yan and J{\ifmmode\ddot{a}\else\"{a}\fi}nes, J{\ifmmode\ddot{u}\else\"{u}\fi}rgen and Chesney, Michael and Cerrato, Chiara and Ahringer, Julie},
 3 | 	title = {{Distinctive regulatory architectures of germline-active and somatic genes in C. elegans}},
 4 | 	journal = {Genome Res.},
 5 | 	volume = {30},
 6 | 	number = {12},
 7 | 	pages = {1752--1765},
 8 | 	year = {2020},
 9 | 	month = oct,
10 | 	issn = {1088-9051},
11 | 	publisher = {Cold Spring Harbor Lab},
12 | 	doi = {10.1101/gr.265934.120}
13 | }
14 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | # MIT License
 2 | 
 3 | Copyright (c) 2025 BiocHowTo authors
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/_pkgdown.yml:
--------------------------------------------------------------------------------
 1 | url: https://bioconductor.github.io/BiocHowTo/
 2 | template:
 3 |   bootstrap: 5
 4 |   params:
 5 |     bootswatch: zephyr
 6 | 
 7 | navbar:
 8 |   structure:
 9 |     left: [home, articles]
10 |     right: [search, github]
11 | 
12 | articles:
13 | - title: Interacting with BAM files
14 |   contents:
15 |   - how-to-read-single-end-reads-from-bam-file
16 |   - how-to-read-paired-end-reads-from-bam-file
17 |   - how-to-read-big-bam-file-in-chunks
18 |   - how-to-compute-read-coverage
19 | 
20 | - title: Mass spectrometry
21 |   contents:
22 |   - how-to-read-mass-spectrometry-data
23 | 
24 | - title: Tidy principles in Bioconductor
25 |   contents:
26 |   - how-to-use-tidy-principles-for-granges-manipulation
27 |   - how-to-use-tidy-principles-for-rna-seq-analysis
28 | 
29 | - title: Accessing genomic annotation data
30 |   contents:
31 |   - how-to-retrieve-gene-model-from-annotationhub
32 |   - how-to-load-gene-from-gff-gtf
33 |   - how-to-get-exon-intron-sequence-for-gene
34 |   - how-to-read-gene-sets-from-gmt-files
35 | 
36 | - title: Working with biological sequences
37 |   contents:
38 |   - how-to-compute-sequence-composition-for-genomic-regions
39 |   - how-to-extract-promoter-sequences
40 | 


--------------------------------------------------------------------------------
/inst/templates/howto_template.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: EDIT:HowToTitle
 3 | author: EDIT:HowToAuthor
 4 | date: "`r Sys.Date()`"
 5 | vignette: >
 6 |   %\VignetteIndexEntry{EDIT:HowToTitle}
 7 |   %\VignetteEngine{quarto::html}
 8 |   %\VignetteEncoding{UTF-8}
 9 | knitr:
10 |     opts_chunk:
11 |         collapse: true
12 |         comment: '#>'
13 | format:
14 |     html:
15 |         toc: true
16 |         html-math-method: mathjax
17 |         embed-resources: true
18 | ---
19 | 
20 | ```{r}
21 | #| echo: false
22 | 
23 | suppressPackageStartupMessages({
24 |     library(BiocStyle)
25 | })
26 | ```
27 | 
28 | EDIT:Short introduction
29 | 
30 | # Bioconductor packages used in this document
31 | 
32 | EDIT:Add list of packages used in the HowTo. Use `r Biocpkg("PkgName")` to
33 | refer to Bioconductor packages, and similarly for other package sources
34 | (see https://bioconductor.org/packages/release/bioc/vignettes/BiocStyle/inst/doc/AuthoringRmdVignettes.html#4_Style_macros). Example:
35 | * `r Biocpkg("pasillaBamSubset")`
36 | 
37 | # EDIT:Main section
38 | 
39 | EDIT:Here, put the code and explanations for your HowTo. Keep in mind that
40 | HowTos should be short, and address a well-defined, specific task using
41 | Bioconductor.
42 | 
43 | # Further reading
44 | 
45 | EDIT: add a (short) list of useful resources about.
46 | 
47 | # Session info
48 | 
49 | <details>
50 | <summary><b>
51 | Click to display session info
52 | </b></summary>
53 | ```{r}
54 | sessionInfo()
55 | ```
56 | </details>
57 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <br>
 2 | 
 3 | <img src="man/figures/bioconductor.png" align="right" alt="Bioconductor logo" width="150"/>
 4 | 
 5 | <br>
 6 | 
 7 | # BiocHowTo
 8 | 
 9 | <br>
10 | 
11 | Welcome to the Bioconductor HowTo collection! The HowTos are short, 
12 | stand-alone documents, each focused on how to address a well-defined question 
13 | using one or more Bioconductor packages. 
14 | 
15 | For a list of the available HowTo documents, see the [Articles](articles/index.html) tab.
16 | 
17 | ## How to contribute
18 | 
19 | 1. [Fork this repository](https://github.com/Bioconductor/BiocHowTo/fork) and create a new branch. 
20 | 2. Copy and rename the `howto_template.qmd` file from `inst/templates` to 
21 | the `vignettes` directory, and edit the copy accordingly. Please choose the 
22 | name for the new vignette in such a way that it is unique and clearly indicates 
23 | the content. Note that the title of the vignette should be added both to the 
24 | `title` field and the `%\VignetteIndexEntry` in the YAML section.
25 | 3. Test the vignette locally in a fresh R session, to make sure that it is 
26 | self-contained and runs without errors.
27 | 4. Add your name to the `Author` list in the `DESCRIPTION` file.
28 | 5. If your HowTo is using packages that are not already included among the 
29 | dependencies of `BiocHowTo`, please add them to the list of `Imports` in the 
30 | `DESCRIPTION` file. One way to do this is via the `usethis` package:
31 | 
32 | ```
33 | usethis::use_package("name_of_new_dependency")
34 | ```
35 | 
36 | 6. Add the new vignette to a suitable section under 'Articles' in the 
37 | `_pkgdown.yml` file.
38 | 7. Push the changes to your forked repository and open a pull request to 
39 | the `devel` branch of the parent repository.
40 | 8. All submissions will be reviewed by the Bioconductor Training Committee. 
41 | 
42 | 
43 | ## How to suggest a new topic
44 | 
45 | To suggest a new topic for a HowTo, open an issue and provide some more
46 | details of your suggestion. If you would like to take on one of the open 
47 | issues, either assign yourself (if possible), or comment in the issue, and 
48 | one of the administrators can assign you. 
49 | 


--------------------------------------------------------------------------------
/vignettes/how-to-read-mass-spectrometry-data.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: How to read mass spectrometry data
 3 | author: Laurent Gatto
 4 | date: "`r Sys.Date()`"
 5 | vignette: >
 6 |   %\VignetteIndexEntry{How to read mass spectrometry data}
 7 |   %\VignetteEngine{quarto::html}
 8 |   %\VignetteEncoding{UTF-8}
 9 | knitr:
10 |     opts_chunk:
11 |         collapse: true
12 |         comment: '#>'
13 | format:
14 |     html:
15 |         toc: true
16 |         html-math-method: mathjax
17 | ---
18 | 
19 | ```{r}
20 | #| echo: false
21 | 
22 | suppressPackageStartupMessages({
23 |     library(BiocStyle)
24 | })
25 | ```
26 | 
27 | # Bioconductor packages used in this document
28 | 
29 | * `r Biocpkg("Spectra")`
30 | * `r Biocpkg("mzR")`
31 | * `r Biocpkg("MsDataHub")`
32 | 
33 | # How to read mass spectrometry data
34 | 
35 | Let's first get some example raw mass spectrometry data from the
36 | *MsDataHub* package. Below, we download two _sciex_ mzML files
37 | represent profile-mode LC-MS data of pooled human serum samples and
38 | create a vector of files names `fls`:
39 | 
40 | ```{r}
41 | library(MsDataHub)
42 | fls <- c(X20171016_POOL_POS_1_105.134.mzML(),
43 |          X20171016_POOL_POS_3_105.134.mzML())
44 | fls
45 | ```
46 | 
47 | We can now use the `Spectra()` constructor function to create an
48 | object of class `Spectra` that contains the raw data. Note that the
49 | `mzR` package is needed to read the content of the mzML files.
50 | 
51 | 
52 | ```{r}
53 | suppressPackageStartupMessages({
54 |     library("Spectra")
55 |     library("mzR")
56 | })
57 | ```
58 | 
59 | ```{r}
60 | sp <- Spectra(fls)
61 | sp
62 | ```
63 | 
64 | The object contains `r length(sp)` spectra from both files.
65 | 
66 | # Further reading
67 | 
68 | - The [R for Mass Spectrometry](https://rformassspectrometry.github.io/book/) book.
69 | - The [Large-scale data handling and processing with
70 |   Spectra](https://rformassspectrometry.github.io/Spectra/articles/Spectra-large-scale.html)
71 |   vignette from the *Spectra* package.
72 | 
73 | # Session info
74 | 
75 | <details>
76 | <summary><b>
77 | Click to display session info
78 | </b></summary>
79 | ```{r}
80 | sessionInfo()
81 | ```
82 | </details>
83 | 


--------------------------------------------------------------------------------
/vignettes/how-to-load-gene-from-gff-gtf.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: How to load a gene model from a GFF or GTF file
 3 | author: Bioconductor Core Team
 4 | date: "`r Sys.Date()`"
 5 | vignette: >
 6 |   %\VignetteIndexEntry{How to load a gene model from a GFF or GTF file}
 7 |   %\VignetteEngine{quarto::html}
 8 |   %\VignetteEncoding{UTF-8}
 9 | knitr:
10 |     opts_chunk: 
11 |         collapse: true
12 |         comment: '#>'
13 | format:
14 |     html:
15 |         toc: true
16 |         html-math-method: mathjax
17 | ---
18 | 
19 | ```{r}
20 | #| echo: false
21 | 
22 | suppressPackageStartupMessages({
23 |     library(BiocStyle)
24 | })
25 | ```
26 | 
27 | A **gene model** is essentially a set of annotations that
28 | describes the genomic locations of the known genes, transcripts, exons, and CDS, for
29 | a given organism. The standardized file format to hold gene models is a 
30 | **[GFF or GTF](https://useast.ensembl.org/info/website/upload/gff.html)**. 
31 | In Bioconductor, gene model information is typically represented as a TxDb 
32 | object but also sometimes as a GRanges or GRangesList object. We can use the
33 | `makeTxDbFromGFF()` function from the `r Biocpkg("txdbmaker")` package to 
34 | import a GFF or GTF file as a *TxDb* object.
35 | 
36 | # Bioconductor packages used in this document
37 | 
38 | * `r Biocpkg("txdbmaker")`
39 | 
40 | # How to load a gene model from a GFF or GTF file
41 | 
42 | We will use a small .gff3 file provided by the `r Biocpkg("txdbmaker")` package.
43 | 
44 | ```{r, warning=FALSE, message=FALSE}
45 | suppressPackageStartupMessages({
46 |     library(txdbmaker)
47 | })
48 | 
49 |  gff_file <- system.file("extdata", "GFF3_files", "a.gff3", package="txdbmaker")
50 |  
51 |  txdb <- makeTxDbFromGFF(gff_file, format="gff3")
52 |  txdb
53 | ```
54 | 
55 | 
56 | See `?makeTxDbFromGFF` in the `r Biocpkg("txdbmaker")` package for more information.
57 | 
58 | Extract the exon coordinates grouped by gene from this gene model:
59 | 
60 | ```{r}
61 | exonsBy(txdb, by="gene")
62 | ```
63 | 
64 | # Session info
65 | 
66 | <details>
67 | <summary><b>
68 | Click to display session info
69 | </b></summary>
70 | ```{r}
71 | sessionInfo()
72 | ```
73 | </details>
74 | 
75 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
  1 | Package: BiocHowTo
  2 | Type: Package
  3 | Title: A Collection of How To Documents For Bioconductor
  4 | Version: 0.1.0
  5 | Authors@R: c(
  6 |     person(
  7 |         "Charlotte", "Soneson",
  8 |         email = "charlottesoneson@gmail.com",
  9 |         role = c("aut", "cre")
 10 |     ),
 11 |     person(
 12 |         "Hervé", "Pagès",
 13 |         role = "aut"
 14 |     ),
 15 |     person(
 16 |         "Dan", "Tenenbaum",
 17 |         role = "aut"
 18 |     ),
 19 |     person(
 20 |         "Valerie", "Obenchain",
 21 |         role = "aut"
 22 |     ),
 23 |     person(
 24 |         "Marc", "Carlson",
 25 |         role = "aut"
 26 |     ),
 27 |     person(
 28 |         "Martin", "Morgan",
 29 |         role = "aut"
 30 |     ),
 31 |     person(
 32 |         "James", "Hester",
 33 |         role = "aut"
 34 |     ),
 35 |     person(
 36 |         "Vince", "Carey",
 37 |         role = "aut"
 38 |     ),
 39 |     person(
 40 |         "Laurent", "Gatto",
 41 |         role = "aut"
 42 |     ),
 43 |     person(
 44 |         "Pedro", "Baldoni",
 45 |         role = "aut"
 46 |     ),
 47 |     person(
 48 |         "Jenny", "Drnevich",
 49 |         role = "aut"
 50 |     ),
 51 |     person(
 52 |         "Robert", "Castelo",
 53 |         role = "aut"
 54 |     ),
 55 |     person(
 56 |         "Michael", "Stadler",
 57 |         role = "aut"
 58 |     ),
 59 |     person(
 60 |         "Fabricio", "Almeida-Silva",
 61 |         role = "aut"
 62 |     ),
 63 |     person(
 64 |         "Jacques", "Serizay",
 65 |         role = "aut"
 66 |     )
 67 |   )
 68 | Description: This package provides a crowd-sourced collection of 'How To'
 69 |     documents related to Bioconductor packages.
 70 | License: MIT + file LICENSE
 71 | Encoding: UTF-8
 72 | LazyData: false
 73 | Suggests:
 74 |     knitr,
 75 |     rmarkdown
 76 | RoxygenNote: 7.3.2
 77 | VignetteBuilder:
 78 |     quarto
 79 | Imports:
 80 |     quarto,
 81 |     BiocStyle,
 82 |     GenomicAlignments,
 83 |     pasillaBamSubset,
 84 |     txdbmaker,
 85 |     mzR,
 86 |     MsDataHub,
 87 |     Spectra,
 88 |     AnnotationHub,
 89 |     TxDb.Hsapiens.UCSC.hg38.knownGene,
 90 |     BSgenome.Hsapiens.UCSC.hg38,
 91 |     GenomicFeatures,
 92 |     Biostrings,
 93 |     Rsamtools,
 94 |     GSEABase,
 95 |     GSVA,
 96 |     rtracklayer,
 97 |     GenomicRanges,
 98 |     BSgenome, 
 99 |     tidyomics, 
100 |     airway, 
101 |     readxl
102 | URL: https://github.com/bioconductor/BiocHowTo, 
103 |     https://bioconductor.github.io/BiocHowTo/
104 | 


--------------------------------------------------------------------------------
/vignettes/how-to-read-big-bam-file-in-chunks.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: How to read a BAM file in chunks
 3 | author: Bioconductor Core Team
 4 | date: "`r Sys.Date()`"
 5 | vignette: >
 6 |   %\VignetteIndexEntry{How to read a BAM file in chunks}
 7 |   %\VignetteEngine{quarto::html}
 8 |   %\VignetteEncoding{UTF-8}
 9 | knitr:
10 |     opts_chunk: 
11 |         collapse: true
12 |         comment: '#>'
13 | format:
14 |     html:
15 |         toc: true
16 |         html-math-method: mathjax
17 | ---
18 | 
19 | ```{r}
20 | #| echo: false
21 | 
22 | suppressPackageStartupMessages({
23 |     library(BiocStyle)
24 | })
25 | ```
26 | 
27 | This HowTo has been adapted from the list of HowTos provided in the 
28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
30 | 
31 | # Bioconductor packages used in this document
32 | 
33 | * `r Biocpkg("pasillaBamSubset")`
34 | * `r Biocpkg("GenomicAlignments")`
35 | * `r Biocpkg("Rsamtools")`
36 | 
37 | # How to read a BAM file in chunks
38 | 
39 | A large BAM file can be iterated through in chunks, in order to reduce the 
40 | memory usage, by setting a `yieldSize` for the `BamFile` object. For 
41 | illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data package.
42 | 
43 | ```{r}
44 | suppressPackageStartupMessages({
45 |     library(pasillaBamSubset)
46 |     library(Rsamtools)
47 | })
48 | # Path to a bam file with single-end reads
49 | (un1 <- untreated1_chr4())
50 | bf <- BamFile(un1, yieldSize = 100000)
51 | ```
52 | 
53 | Iteration through a BAM file requires that the file be opened, repeatedly 
54 | queried inside a loop, then closed. Repeated calls to 
55 | `GenomicAlignments::readGAlignments` without opening the file first result in
56 | the same 100000 records returned each time (with a `yieldSize` of 100000).
57 | As an example, let's calculate the coverage for the bam file above.
58 | 
59 | ```{r}
60 | suppressPackageStartupMessages({
61 |     library(GenomicAlignments)
62 | })
63 | open(bf)
64 | cvg <- NULL
65 | repeat {
66 |     chunk <- readGAlignments(bf)
67 |     if (length(chunk) == 0L) {
68 |         break
69 |     }
70 |     chunk_cvg <- coverage(chunk)
71 |     if (is.null(cvg)) {
72 |         cvg <- chunk_cvg
73 |     } else {
74 |         cvg <- cvg + chunk_cvg
75 |     }
76 | }
77 | close(bf)
78 | cvg
79 | ```
80 | 
81 | # Session info
82 | 
83 | <details>
84 | <summary><b>
85 | Click to display session info
86 | </b></summary>
87 | ```{r}
88 | sessionInfo()
89 | ```
90 | </details>
91 | 
92 | 


--------------------------------------------------------------------------------
/vignettes/how-to-retrieve-gene-model-from-annotationhub.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to retrieve a gene model from AnnotationHub
  3 | author: Bioconductor Core Team
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to retrieve a gene model from AnnotationHub}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | This HowTo has been adapted from the list of HowTos provided in the 
 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("AnnotationHub")`
 34 | 
 35 | # How to retrieve a gene model from *AnnotationHub*
 36 | 
 37 | When a gene model is not available as a `GRanges` or `GRangesList` object or as 
 38 | a \Bioconductor{} data package, it may be available on `r Biocpkg("AnnotationHub")`.
 39 | In this *HOWTO*, will look for a gene model for Drosophila melanogaster
 40 | on `r Biocpkg("AnnotationHub")`. Create a `hub' and then filter on Drosophila 
 41 | melanogaster:
 42 | 
 43 | ```{r}
 44 | suppressPackageStartupMessages({
 45 |     library(AnnotationHub)
 46 | })
 47 | ### Internet connection required!
 48 | hub <- AnnotationHub()
 49 | hub <- subset(hub, hub$species=='Drosophila melanogaster')
 50 | ```
 51 | 
 52 | There are `r length(hub)` files that match Drosophila melanogaster. 
 53 | If you look at the metadata in hub, you can see that the 5th record 
 54 | represents a GRanges object from UCSC RefSeq Gene track.
 55 | 
 56 | ```{r}
 57 | length(hub)
 58 | head(names(hub))
 59 | head(hub$title, n=10)
 60 | ## then look at a specific slice of the hub object.
 61 | hub[5]
 62 | ```
 63 | So you can retrieve that dm3 file as a GRanges like this:
 64 | 
 65 | ```{r}
 66 | gr <- hub[[names(hub)[5]]]
 67 | summary(gr)
 68 | ```
 69 | 
 70 | The metadata fields contain the details of file origin and content.
 71 | 
 72 | ```{r}
 73 | metadata(gr)
 74 | ```
 75 | Split the `GRanges` object by gene name to get a `GRangesList` object of 
 76 | transcript ranges grouped by gene.
 77 | 
 78 | ```{r}
 79 | txbygn <- split(gr, gr$name)
 80 | ```
 81 | 
 82 | You can now use `txbygn` with the `summarizeOverlaps` function
 83 | to prepare a table of read counts for RNA-Seq differential gene expression.
 84 | 
 85 | # Further reading
 86 | 
 87 | Note that before passing `txbygn` to `summarizeOverlaps`,
 88 | you should confirm that the seqlevels (chromosome names) in it match those
 89 | in the BAM file. See `?renameSeqlevels`, `?keepSeqlevels`
 90 | and `?seqlevels` for examples of renaming seqlevels.
 91 | 
 92 | # Session info
 93 | 
 94 | <details>
 95 | <summary><b>
 96 | Click to display session info
 97 | </b></summary>
 98 | ```{r}
 99 | sessionInfo()
100 | ```
101 | </details>
102 | 


--------------------------------------------------------------------------------
/vignettes/how-to-read-single-end-reads-from-bam-file.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to read single-end reads from a BAM file
  3 | author: Bioconductor Core Team
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to read single-end reads from a BAM file}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk: 
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | This HowTo has been adapted from the list of HowTos provided in the 
 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("pasillaBamSubset")`
 34 | * `r Biocpkg("GenomicAlignments")`
 35 | * `r Biocpkg("Rsamtools")`
 36 | 
 37 | # How to read single-end reads from a BAM file
 38 | 
 39 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 
 40 | package.
 41 | 
 42 | ```{r}
 43 | suppressPackageStartupMessages({
 44 |     library(pasillaBamSubset)
 45 | })
 46 | # Path to a bam file with single-end reads
 47 | (un1 <- untreated1_chr4())
 48 | ```
 49 | 
 50 | Several functions are available for reading BAM files into R:
 51 | 
 52 | * `GenomicAlignments::readGAlignments()`
 53 | * `GenomicAlignments::readGAlignmentPairs()`
 54 | * `GenomicAlignments::readGAlignmentsList()`
 55 | * `Rsamtools::scanBam()`
 56 | 
 57 | `scanBam` is a low-level function that returns a list of lists and is not 
 58 | discussed further here. See `?scanBam` in the `r Biocpkg("Rsamtools")` package 
 59 | for more information. Single-end reads can be read with the `readGAlignments` 
 60 | function from the `r Biocpkg("GenomicAlignments")` package.
 61 | 
 62 | ```{r}
 63 | suppressPackageStartupMessages({
 64 |     library(GenomicAlignments)
 65 | })
 66 | gal <- readGAlignments(un1)
 67 | class(gal)
 68 | gal
 69 | ```
 70 | 
 71 | Data subsets can be specified by genomic position, field names, or flag 
 72 | criteria using `Rsamtools::ScanBamParam`. Here, as an example we import records 
 73 | that overlap position 1 to 5000 on the negative strand of chromosome 4, with 
 74 | `flag` and `cigar` as metadata columns.
 75 | 
 76 | ```{r}
 77 | suppressPackageStartupMessages({
 78 |     library(Rsamtools)
 79 | })
 80 | what <- c("flag", "cigar")
 81 | which <- GRanges("chr4", IRanges(1, 5000))
 82 | flag <- scanBamFlag(isMinusStrand = TRUE)
 83 | param <- ScanBamParam(which = which, what = what, flag = flag)
 84 | neg <- readGAlignments(un1, param = param)
 85 | neg
 86 | ```
 87 | 
 88 | Another approach to subsetting the data is to use `Rsamtools::filterBam`. 
 89 | This function creates a new BAM file of records passing user-defined criteria. 
 90 | See `?filterBam` in the `r Biocpkg("Rsamtools")` package for more information.
 91 | 
 92 | # Session info
 93 | 
 94 | <details>
 95 | <summary><b>
 96 | Click to display session info
 97 | </b></summary>
 98 | ```{r}
 99 | sessionInfo()
100 | ```
101 | </details>
102 | 
103 | 


--------------------------------------------------------------------------------
/vignettes/how-to-compute-read-coverage.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to compute read coverage
  3 | author: Bioconductor Core Team
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to compute read coverage}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | This HowTo has been adapted from the list of HowTos provided in the 
 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("pasillaBamSubset")`
 34 | * `r Biocpkg("GenomicAlignments")`
 35 | 
 36 | # How to compute read coverage
 37 | 
 38 | By "read coverage", we mean the number of reads that cover a given genomic position. 
 39 | Computing the read coverage generally consists in computing the coverage at 
 40 | each position in the genome.
 41 | This can be done with the `coverage` function from `r Biocpkg("GenomicAlignments")`.
 42 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 
 43 | package.
 44 | 
 45 | ```{r}
 46 | suppressPackageStartupMessages({
 47 |     library(pasillaBamSubset)
 48 | })
 49 | # Path to a bam file with single-end reads
 50 | (un1 <- untreated1_chr4())
 51 | ```
 52 | 
 53 | Single-end reads can be read with the `readGAlignments` function from 
 54 | `r Biocpkg("GenomicAlignments")`.
 55 | 
 56 | ```{r}
 57 | suppressPackageStartupMessages({
 58 |     library(GenomicAlignments)
 59 | })
 60 | (reads <- readGAlignments(un1))
 61 | ```
 62 | 
 63 | The coverage can then be calculated using the `coverage` function from 
 64 | `r Biocpkg("GenomicAlignments")`: 
 65 | 
 66 | ```{r}
 67 | (cvg <- coverage(reads))
 68 | ```
 69 | 
 70 | We can extract the coverage on a specific chromosome: 
 71 | 
 72 | ```{r}
 73 | cvg$chr4
 74 | ```
 75 | 
 76 | We can also perform calculations on the returned object, e.g. to get the 
 77 | average and maximum coverage on chromosome 4:
 78 | 
 79 | ```{r}
 80 | mean(cvg$chr4)
 81 | max(cvg$chr4)
 82 | ```
 83 | 
 84 | We can use the `slice` function to find the genomic regions where the coverage
 85 | is greater or equal to a given threshold.
 86 | 
 87 | ```{r}
 88 | (chr4_highcov <- slice(cvg$chr4, lower = 500))
 89 | ```
 90 | 
 91 | The *weight* of a given such region can be defined as the number of aligned 
 92 | nucleotides that belong to the region (i.e. the area under the coverage curve). 
 93 | It can be obtained with `sum`:
 94 | 
 95 | ```{r}
 96 | sum(chr4_highcov)
 97 | ```
 98 | 
 99 | Note that `coverage` and `slice` are generic functions with methods for 
100 | different types of objects. See `?coverage` or `?slice` for more information.
101 | 
102 | 
103 | # Further reading
104 | 
105 | - The `r Biocpkg("GenomicAlignments")` vignette.
106 | 
107 | # Session info
108 | 
109 | <details>
110 | <summary><b>
111 | Click to display session info
112 | </b></summary>
113 | ```{r}
114 | sessionInfo()
115 | ```
116 | </details>
117 | 


--------------------------------------------------------------------------------
/vignettes/how-to-read-paired-end-reads-from-bam-file.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to read paired-end reads from a BAM file
  3 | author: Bioconductor Core Team
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to read paired-end reads from a BAM file}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk: 
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | This HowTo has been adapted from the list of HowTos provided in the 
 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("pasillaBamSubset")`
 34 | * `r Biocpkg("GenomicAlignments")`
 35 | 
 36 | # How to read paired-end reads from a BAM file
 37 | 
 38 | For illustration, we use data from the `r Biocpkg("pasillaBamSubset")` data 
 39 | package.
 40 | 
 41 | ```{r}
 42 | suppressPackageStartupMessages({
 43 |     library(pasillaBamSubset)
 44 | })
 45 | # Path to a bam file with paired-end reads
 46 | (un3 <- untreated3_chr4())
 47 | ```
 48 | 
 49 | Paired-end reads can be loaded with the `GenomicAlignments::readGAlignmentPairs` 
 50 | or `GenomicAlignments::readGAlignmentsList` functions from the 
 51 | `r Biocpkg("GenomicAlignments")` package. These functions use the same mate 
 52 | pairing algorithm but output different objects.
 53 | 
 54 | Let's start with `GenomicAlignments::readGAlignmentPairs`:
 55 | 
 56 | ```{r}
 57 | suppressPackageStartupMessages({
 58 |     library(GenomicAlignments)
 59 | })
 60 | gapairs <- readGAlignmentPairs(un3)
 61 | class(gapairs)
 62 | gapairs
 63 | ```
 64 | 
 65 | The `GAlignmentPairs` class holds only pairs; reads with no mate or with 
 66 | ambiguous pairing are discarded. Each list element holds exactly 2 records 
 67 | (a mated pair). Records can be accessed as the first and last segments in a 
 68 | template or as left and right alignments. See `?GAlignmentPairs` in the 
 69 | `r Biocpkg("GenomicAlignments")` package for more information.
 70 | 
 71 | For `readGAlignmentsList`, mate pairing is performed when `asMates` is set to 
 72 | `TRUE` on the `BamFile` object, otherwise records are treated as single-end.
 73 | 
 74 | ```{r}
 75 | galist <- readGAlignmentsList(BamFile(un3, asMates = TRUE))
 76 | galist
 77 | ```
 78 | 
 79 | `GAlignmentsList` is a more general ‘list-like’ structure that holds mate 
 80 | pairs as well as nonmates (i.e., singletons, records with unmapped mates etc). 
 81 | A `mates_status` metadata column (accessed with `mcols`) indicates which 
 82 | records were paired.
 83 | 
 84 | Non-mated reads are returned as groups by `QNAME` and contain any number of 
 85 | records. Here the non-mate groups range in size from 1 to 9.
 86 | 
 87 | ```{r}
 88 | non_mates <- galist[unlist(mcols(galist)$mate_status) == "unmated"]
 89 | table(elementNROWS(non_mates))
 90 | ```
 91 | 
 92 | # Session info
 93 | 
 94 | <details>
 95 | <summary><b>
 96 | Click to display session info
 97 | </b></summary>
 98 | ```{r}
 99 | sessionInfo()
100 | ```
101 | </details>
102 | 
103 | 


--------------------------------------------------------------------------------
/vignettes/how-to-use-tidy-principles-for-rna-seq-analysis.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to use tidy principles for RNA-seq data analysis
  3 | author: Jacques Serizay
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to use tidy principles for RNA-seq data analysis}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk: 
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | Tidy data principles are a set of principles for structuring data in a way 
 28 | that makes it easier to manipulate and analyze. They have been popularized by the
 29 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation 
 30 | and `ggplot2` for data visualization. The tidy principles can also be applied to
 31 | biological data, such as RNA-seq data, to facilitate analysis and visualization, 
 32 | thanks to the `tidyomics` meta-package.
 33 | 
 34 | # Bioconductor packages used in this document
 35 | 
 36 | * `r Biocpkg("tidyomics")`
 37 | 
 38 | # How to manipulate SummarizedExperiment objects with tidy principles
 39 | 
 40 | For illustration, we use data from the `r Biocpkg("airway")` package. 
 41 | 
 42 | ```{r}
 43 | library(tidyomics)
 44 | data(airway, package="airway")
 45 | 
 46 | airway
 47 | ```
 48 | 
 49 | How `airway` object is printed is controlled by the `tidySummarizedExperiment` package, 
 50 | and by default it is displayed as a `tibble`-like object. Note that this 
 51 | behavior can be changed by setting the `options(restore_SummarizedExperiment_show = TRUE)` option. 
 52 | 
 53 | ```{r}
 54 | # Turn off the tibble visualisation
 55 | options("restore_SummarizedExperiment_show" = TRUE)
 56 | airway
 57 | 
 58 | # Turn on the tibble visualisation
 59 | options("restore_SummarizedExperiment_show" = FALSE)
 60 | airway
 61 | ```
 62 | 
 63 | The `airway` object is a `SummarizedExperiment` object, and the 
 64 | `tidySummarizedExperiment` packages adds an invisible layer to abstract the `SE` object as a
 65 | `tibble`. This allows us to use the `dplyr` package to manipulate the data, and eventually 
 66 | to directly plug data flow into `ggplot2` for visualization.
 67 | 
 68 | ```{r}
 69 | airway |> 
 70 |     filter(albut == "untrt") |> 
 71 |     group_by(.feature, dex) |> 
 72 |     summarise(tot_counts = sum(counts)) |>
 73 |     pivot_wider(names_from = dex, values_from = tot_counts) |>
 74 |     slice_sample(n = 10000) |> 
 75 |     ggplot(aes(x = trt, y = untrt)) + 
 76 |     geom_density_2d(color = "black", linewidth = 0.5) +
 77 |     geom_point(alpha = 0.1, size = 0.3) + 
 78 |     scale_x_log10() + 
 79 |     scale_y_log10() + 
 80 |     annotation_logticks(side = 'bl') +
 81 |     labs(x = "Dex treated", y = "Dex untreated") +
 82 |     theme_bw()
 83 | ```
 84 | 
 85 | Because these tidy methods are defined for `SummarizedExperiment` objects, 
 86 | most classes built on top of `SummarizedExperiment`, such as `DESeqDataSet` or `SingleCellExperiment`, 
 87 | can also be manipulated with tidy principles. This has prompted the development of 
 88 | other packages (included in the `tidyomics` meta-package) that extend the tidy principles to
 89 | other analyses, such as single-cell transcriptomics analysis.
 90 | 
 91 | ```{r}
 92 | data(pbmc_small, package="tidySingleCellExperiment")
 93 | pbmc_small
 94 | 
 95 | ## We can still manipulate the data with standard SummarizedExperiment methods
 96 | counts(pbmc_small)[1:5, 1:4]
 97 | 
 98 | ## But we can also use tidy principles
 99 | pbmc_small |> 
100 |     extract(file, "sample", "../data/([a-z0-9]+)/outs.+") |> 
101 |     rename(annotation = letter.idents) |> 
102 |     ggplot(aes(sample, nCount_RNA, fill=sample)) +
103 |     geom_boxplot(outlier.shape=NA) +
104 |     geom_jitter(width=0.1) +
105 |     facet_wrap( ~ annotation)
106 | ```
107 | 
108 | # Further reading 
109 | 
110 | To read more about the `tidyomics` packages, please refer to the
111 | [tidyomics](https://github.com/tidyomics) GitHub organization page. 
112 | 
113 | # Session info
114 | 
115 | <details>
116 | <summary><b>
117 | Click to display session info
118 | </b></summary>
119 | ```{r}
120 | sessionInfo()
121 | ```
122 | </details>
123 | 
124 | 


--------------------------------------------------------------------------------
/vignettes/how-to-get-exon-intron-sequence-for-gene.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to get the exon and intron sequences of a given gene
  3 | author: Bioconductor Core Team
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to get the exon and intron sequences of a given gene}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | This HowTo has been adapted from the list of HowTos provided in the 
 28 | [vignette](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)
 29 | for the `r Biocpkg("GenomicRanges")` Bioconductor package. 
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")`
 34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`
 35 | * `r Biocpkg("GenomicFeatures")`
 36 | * `r Biocpkg("Biostrings")`
 37 | 
 38 | # How to get the exon and intron sequences of a given gene
 39 | 
 40 | The exon and intron sequences of a gene are essentially the DNA sequences of 
 41 | the introns and exons of all known transcripts of the gene. The first task, 
 42 | therefore, is to identify all transcripts associated with the gene of interest. 
 43 | Our example gene is the human `TRAK2`, which is involved in regulation of 
 44 | endosome-to-lysosome trafficking of membrane cargo. The Entrez gene id is
 45 | '66008'.
 46 | 
 47 | ```{r}
 48 | trak2 <- "66008"
 49 | ```
 50 | 
 51 | The `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` data package contains the 
 52 | gene models corresponding to the UCSC 'Known Genes' track.
 53 | 
 54 | ```{r}
 55 | suppressPackageStartupMessages({
 56 |     library(TxDb.Hsapiens.UCSC.hg38.knownGene)
 57 | })
 58 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
 59 | ```
 60 | 
 61 | The transcript ranges for all the genes in the gene model can be extracted with 
 62 | the `transcriptsBy` function from the `r Biocpkg("GenomicFeatures")` package. 
 63 | They will be returned in a named `GRangesList` object containing all the 
 64 | transcripts grouped by gene. In order to keep only the transcripts of the 
 65 | `TRAK2` gene we will subset the `GRangesList` object using the `[[` operator.
 66 | 
 67 | ```{r}
 68 | suppressPackageStartupMessages({
 69 |     library(GenomicFeatures)
 70 | })
 71 | (trak2_txs <- transcriptsBy(txdb, by = "gene")[[trak2]])
 72 | ```
 73 | 
 74 | `trak2_txs` is a `GRanges` object with one range per transcript in the `TRAK2` 
 75 | gene. The transcript names are stored in the `tx_name` metadata column. We will 
 76 | need them to subset the extracted intron and exon regions:
 77 | 
 78 | ```{r}
 79 | (trak2_tx_names <- mcols(trak2_txs)$tx_name)
 80 | ```
 81 | 
 82 | The exon and intron genomic ranges for all the transcripts in the gene model 
 83 | can be extracted with the `exonsBy` and `intronsByTranscript` functions, 
 84 | respectively. Both functions return a `GRangesList` object. Then we keep only 
 85 | the exon and intron for the transcripts of the `TRAK2` gene by subsetting 
 86 | each `GRangesList` object by the `TRAK2` transcript names.
 87 | 
 88 | Extract the exon regions:
 89 | 
 90 | ```{r}
 91 | trak2_exbytx <- exonsBy(txdb, "tx", use.names = TRUE)[trak2_tx_names]
 92 | elementNROWS(trak2_exbytx)
 93 | ```
 94 | 
 95 | ... and the intron regions:
 96 | 
 97 | ```{r}
 98 | trak2_inbytx <- intronsByTranscript(txdb, use.names = TRUE)[trak2_tx_names]
 99 | elementNROWS(trak2_inbytx)
100 | ```
101 | 
102 | Next we want the DNA sequences for these exons and introns. The `getSeq` 
103 | function from the `r Biocpkg("Biostrings")` package can be used to query a 
104 | `BSgenome` object with a set of genomic ranges and retrieve the corresponding 
105 | DNA sequences.
106 | 
107 | ```{r}
108 | suppressPackageStartupMessages({
109 |     library(BSgenome.Hsapiens.UCSC.hg38)
110 | })
111 | ```
112 | 
113 | Extract the exon sequences:
114 | 
115 | ```{r}
116 | (trak2_ex_seqs <- getSeq(Hsapiens, trak2_exbytx))
117 | ```
118 | 
119 | ... and the intron sequences:
120 | 
121 | ```{r}
122 | (trak2_in_seqs <- getSeq(Hsapiens, trak2_inbytx))
123 | ```
124 | 
125 | Each list element is a `DNAStringSet` object, e.g.:
126 | 
127 | ```{r}
128 | trak2_in_seqs[[1]]
129 | ```
130 | 
131 | Each sequence here corresponds to a single intron.
132 | 
133 | # Further reading
134 | 
135 | - The `r Biocpkg("GenomicFeatures")` vignette.
136 | 
137 | # Session info
138 | 
139 | <details>
140 | <summary><b>
141 | Click to display session info
142 | </b></summary>
143 | ```{r}
144 | sessionInfo()
145 | ```
146 | </details>
147 | 


--------------------------------------------------------------------------------
/.github/workflows/R-CMD-check.yaml:
--------------------------------------------------------------------------------
  1 | on:
  2 |   push:
  3 |   pull_request:
  4 |     branches:
  5 |       - devel
  6 |       - main
  7 |   schedule:
  8 |     - cron: '0 9 * * 2'
  9 | 
 10 | name: R-CMD-check
 11 | 
 12 | jobs:
 13 |   R-CMD-check:
 14 |     runs-on: ${{ matrix.config.os }}
 15 |     container: ${{ matrix.config.image }}
 16 | 
 17 |     name: ${{ matrix.config.os }} (${{ matrix.config.bioc }} - ${{ matrix.config.image }}
 18 | 
 19 |     strategy:
 20 |       fail-fast: false
 21 |       matrix:
 22 |         config:
 23 |         - {os: macos-13, bioc: 'devel'}
 24 | 
 25 |     env:
 26 |       R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
 27 |       CRAN: ${{ matrix.config.cran }}
 28 |       GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
 29 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 30 | 
 31 |     steps:
 32 |       - name: Check out repo
 33 |         uses: actions/checkout@v3
 34 | 
 35 |       - name: Set up R and install BiocManager
 36 |         uses: grimbough/bioc-actions/setup-bioc@v1
 37 |         if: matrix.config.image == null
 38 |         with:
 39 |           bioc-version: ${{ matrix.config.bioc }}
 40 | 
 41 |       - name: Install Quarto
 42 |         uses: quarto-dev/quarto-actions/setup@v2
 43 | 
 44 |       - name: Install pandoc
 45 |         uses: r-lib/actions/setup-pandoc@v2
 46 |         if: matrix.config.image == null
 47 | 
 48 |       - name: Install remotes and rmarkdown
 49 |         run: |
 50 |           install.packages(c('remotes', 'rmarkdown'))
 51 |         shell: Rscript {0}
 52 | 
 53 |       - name: Query dependencies
 54 |         run: |
 55 |           saveRDS(remotes::dev_package_deps(dependencies = TRUE, repos = c(getOption('repos'), BiocManager::repositories())), 'depends.Rds', version = 2)
 56 |         shell: Rscript {0}
 57 | 
 58 |       - name: Cache R packages
 59 |         if: runner.os != 'Windows' && matrix.config.image == null
 60 |         uses: actions/cache@v4
 61 |         with:
 62 |           path: ${{ env.R_LIBS_USER }}
 63 |           key: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-${{ hashFiles('depends.Rds') }}
 64 |           restore-keys: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-
 65 | 
 66 |       - name: Install system dependencies (Linux)
 67 |         if: runner.os == 'Linux'
 68 |         env:
 69 |           RHUB_PLATFORM: linux-x86_64-ubuntu-gcc
 70 |         uses: r-lib/actions/setup-r-dependencies@v2
 71 |         with:
 72 |           extra-packages: any::rcmdcheck
 73 |           pak-version: devel
 74 | 
 75 |       - name: Install system dependencies (macOS)
 76 |         if: runner.os == 'macOS'
 77 |         run: |
 78 |           brew install harfbuzz
 79 |           brew install fribidi
 80 | 
 81 |       - name: Install dependencies
 82 |         run: |
 83 |           local_deps <- remotes::local_package_deps(dependencies = TRUE)
 84 |           deps <- remotes::dev_package_deps(dependencies = TRUE, repos = BiocManager::repositories())
 85 |           BiocManager::install(local_deps[local_deps %in% deps$package[deps$diff != 0]], Ncpu = 2L)
 86 |           remotes::install_cran('rcmdcheck', Ncpu = 2L)
 87 |         shell: Rscript {0}
 88 | 
 89 |       - name: Session info
 90 |         run: |
 91 |           options(width = 100)
 92 |           pkgs <- installed.packages()[, "Package"]
 93 |           sessioninfo::session_info(pkgs, include_base = TRUE)
 94 |         shell: Rscript {0}
 95 | 
 96 |       - name: Build, Install, Check
 97 |         id: build-install-check
 98 |         uses: grimbough/bioc-actions/build-install-check@v1
 99 | 
100 |       - name: Upload install log if the build/install/check step fails
101 |         if: always() && (steps.build-install-check.outcome == 'failure')
102 |         uses: actions/upload-artifact@v4
103 |         with:
104 |           name: install-log
105 |           path: |
106 |             ${{ steps.build-install-check.outputs.install-log }}
107 | 
108 |       - name: Upload check results
109 |         if: failure()
110 |         uses: actions/upload-artifact@v4
111 |         with:
112 |           name: ${{ runner.os }}-bioc-${{ matrix.config.bioc }}-results
113 |           path: ${{ steps.build-install-check.outputs.check-dir }}
114 | 
115 |       # - name: Run BiocCheck
116 |       #   uses: grimbough/bioc-actions/run-BiocCheck@v1
117 |       #   with:
118 |       #     arguments: '--no-check-bioc-views --no-check-bioc-help'
119 |       #     error-on: 'error'
120 |       #
121 |       # - name: Test coverage
122 |       #   if: matrix.config.os == 'macOS-latest' && matrix.config.pandoc == null
123 |       #   run: |
124 |       #     install.packages("covr")
125 |       #     covr::codecov(token = "${{secrets.CODECOV_TOKEN}}")
126 |       #   shell: Rscript {0}
127 |       #
128 |       - name: Deploy
129 |         if: github.event_name == 'push' && github.ref == 'refs/heads/devel' && runner.os == 'macOS' && matrix.config.bioc == 'devel'
130 |         run: |
131 |           R CMD INSTALL .
132 |           Rscript -e "remotes::install_dev('pkgdown'); pkgdown::deploy_to_branch(new_process = FALSE)"
133 |         shell: bash
134 | 


--------------------------------------------------------------------------------
/vignettes/how-to-read-gene-sets-from-gmt-files.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to read gene sets from GMT files
  3 | author: Robert Castelo
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to read gene sets from GMT files}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | ---
 18 | 
 19 | ```{r}
 20 | #| echo: false
 21 | 
 22 | suppressPackageStartupMessages({
 23 |     library(BiocStyle)
 24 | })
 25 | ```
 26 | 
 27 | A **gene set** is simple yet useful way to define pathways without regard
 28 | to the specific molecular interactions. It can also represent the signature
 29 | of a characteristic pattern of gene expression associated with a cell type
 30 | or a disease for diagnostic or prognostic purposes. An important source of
 31 | gene sets is the
 32 | [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb),
 33 | which stores them as plain text files following the so-called
 34 | [_gene matrix transposed_ (GMT) format](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29).
 35 | In the GMT format, each line stores a gene set with the following values
 36 | separated by tabs:
 37 | 
 38 |   * A unique gene set identifier.
 39 |   * A gene set description.
 40 |   * One or more gene identifiers.
 41 | 
 42 | Because each different gene set may consist of a different number of genes, each
 43 | line in a GMT file may contain a different number of tab-separated values. This
 44 | means that the GMT format is not a tabular format, and therefore cannot be directly
 45 | read with base R functions such as `read.table()` or `read.csv()`.
 46 | 
 47 | # Bioconductor packages used in this document
 48 | 
 49 | * `r Biocpkg("GSEABase")`
 50 | * `r Biocpkg("GSVA")`
 51 | 
 52 | # How to read gene sets from GMT files
 53 | 
 54 | The package `r Biocpkg("GSEABase")` provides object classes and methods to represent
 55 | and manipulate gene sets, including reading GMT files with the function `getGmt()`.
 56 | Here is a first example in which we read a GMT file from the MSigDB database:
 57 | 
 58 | ```{r, message=FALSE}
 59 | library(GSEABase)
 60 | 
 61 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
 62 |              "2023.2.Hs/c7.immunesigdb.v2023.2.Hs.symbols.gmt", sep="/")
 63 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
 64 | genesets
 65 | length(genesets)
 66 | ```
 67 | Next to the filename of the GMT file, we provided the argument
 68 | `geneIdType=SymbolIdentifier()`. This argument adds the necessary metadata that
 69 | later on allows other software to figure out what kind of gene identifiers are
 70 | used in this collection of gene sets, to attempt mapping them to other type of
 71 | identifiers, if necessary. While this argument is optional, we should always
 72 | try to provide it.
 73 | 
 74 | The output of the `getGmt()` function is a `GeneSetCollection` object, which
 75 | pretty much behaves like a list object and many Bioconductor packages analysing
 76 | gene sets may expect that kind of object as input. However, we can coerce it to
 77 | a regular `list` object with the function `geneIds()`:
 78 | 
 79 | ```{r}
 80 | genesets.list <- geneIds(genesets)
 81 | head(lapply(genesets.list, head, 3), 3)
 82 | ```
 83 | By definition, GMT files should have unique gene set identifiers and if we
 84 | attempt to read with `getGmt()` a malformed GMT file with duplicated
 85 | identifiers it will prompt an error:
 86 | 
 87 | ```{r, error=TRUE}
 88 | URL <- paste("https://data.broadinstitute.org/gsea-msigdb/msigdb/release",
 89 |              "7.5/c2.all.v7.5.symbols.gmt", sep="/")
 90 | genesets <- getGmt(URL, geneIdType=SymbolIdentifier())
 91 | ```
 92 | So if we still want to read the contents of that GMT file, we would have to
 93 | download it and edit the lines with duplicated identifiers. An alternative is
 94 | to use the function `readGMT()` of the `r Biocpkg("GSVA")` package:
 95 | 
 96 | ```{r, message=FALSE}
 97 | library(GSVA)
 98 | 
 99 | genesets <- readGMT(URL)
100 | genesets
101 | ```
102 | As we may see, it has detected the presence duplicated gene set identifiers and
103 | applied a deduplication policy, which can be changed through the parameter
104 | `dedupUse`; see its help page with `?readGMT` for further deduplication
105 | policies. As `getGmt()`, it also takes a parameter `geneIdType`, but by default
106 | it will automatically detect the type of gene identifier and fill in the
107 | corresponding metadata in the resulting `GeneSetCollection` object, so that in
108 | principle we need not to specify it.
109 | 
110 | # Further reading
111 | 
112 | - The [Introduction to GSEABase](https://bioconductor.org/packages/release/bioc/vignettes/GSEABase/inst/doc/GSEABase.html)
113 |   vignette from the `r Biocpkg("GSEABase")` package.
114 | - The help pages of the `getGmt()` and `readGMT()` functions in the
115 |   `r Biocpkg("GSEABase")` and `r Biocpkg("GSVA")` packages, respectively.
116 | - The Bioconductor package `r Biocpkg("msigdb")` provides an API for
117 |   directly querying and downloading MSigDB GMT files as `GeneSetCollection`
118 |   objects.
119 | - Similarly to `r Biocpkg("msigdb")`, the `r CRANpkg("msigdbr")` CRAN
120 |   package provides an API to query and download MSigDB GMT files, but as
121 |   [tidy](https://r4ds.hadley.nz/data-tidy.html) `data.frame` objects
122 |   with one gene per row.
123 | 
124 | # Session info
125 | 
126 | <details>
127 | <summary><b>
128 | Click to display session info
129 | </b></summary>
130 | ```{r}
131 | sessionInfo()
132 | ```
133 | </details>
134 | 


--------------------------------------------------------------------------------
/vignettes/how-to-use-tidy-principles-for-granges-manipulation.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to use tidy principles for genomic ranges manipulation
  3 | author: Jacques Serizay
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to use tidy principles for genomic ranges manipulation}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk: 
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 | bibliography: ../inst/references.bib
 18 | ---
 19 | 
 20 | ```{r}
 21 | #| echo: false
 22 | 
 23 | suppressPackageStartupMessages({
 24 |     library(BiocStyle)
 25 | })
 26 | ```
 27 | 
 28 | Tidy data principles are a set of principles for structuring data in a way 
 29 | that makes it easier to manipulate and analyze. They have been popularized by the
 30 | `tidyverse` collection of R packages, which includes `dplyr` for data manipulation 
 31 | and `ggplot2` for data visualization. The tidy principles can also be applied to
 32 | biological data, such as RNA-seq data, to facilitate analysis and visualization, 
 33 | thanks to the `tidyomics` meta-package.
 34 | 
 35 | # Bioconductor packages used in this document
 36 | 
 37 | * `r Biocpkg("tidyomics")`
 38 | 
 39 | # How to manipulate GRanges objects with tidy principles
 40 | 
 41 | For illustration, we are going to download a `xlsx` file provided as a 
 42 | Supplementary Data in [@Serizay2020Oct], and manipulate it using tidy principles.
 43 | 
 44 | Good practices recommended by the Bioconductor consortium suggest to use 
 45 | the `BiocFileCache` package to download and cache files from remote servers.
 46 | 
 47 | ```{r}
 48 | library(BiocFileCache)
 49 | 
 50 | # Create a BiocFileCache instance
 51 | bfc <- BiocFileCache()
 52 | 
 53 | # Download the xlsx file from Ensembl and cache it locally using BiocFileCache
 54 | xlsx_url <- "https://genome.cshlp.org/content/suppl/2020/11/16/gr.265934.120.DC1/Supplemental_Table_S2.xlsx"
 55 | xlsx_file <- bfcrpath(bfc, xlsx_url)
 56 | 
 57 | # The file is now downloaded and available
 58 | xlsx_file
 59 | ```
 60 | 
 61 | Excel files can be read into `tibbles` in `R` thanks to the `read_xlsx()` function 
 62 | from the `readxl` package.
 63 | 
 64 | ```{r}
 65 | library(readxl)
 66 | tbl <- read_xlsx(xlsx_file, sheet = "ATAC_metrics", skip = 2)
 67 | tbl
 68 | ```
 69 | 
 70 | We convert it into a `GRanges` object directly using `as_granges()` from the `plyranges` package, 
 71 | using [tidy evaluation as in `dplyr` package](https://dplyr.tidyverse.org/articles/programming.html). 
 72 | 
 73 | ```{r}
 74 | library(tidyomics)
 75 | gr <- tbl |> 
 76 |     as_granges(seqnames = chrom_ce11, start = start_ce11, end = end_ce11)
 77 | gr
 78 | ```
 79 | 
 80 | Let's say we are not interested in all the metadata columns, and we want to keep only the
 81 | `annot_dev_age_tissues`, `Annotation`, and the columns ending with `_RPM`. 
 82 | We can use the `select()` function from `dplyr` to select these columns, 
 83 | just like we would do with a `data.frame` or `tibble`.
 84 | 
 85 | ```{r}
 86 | gr <- gr |> 
 87 |     select(annot_dev_age_tissues, Annotation, ends_with("_RPM"))
 88 | gr
 89 | ```
 90 | 
 91 | In fact, most verbs from the `dplyr` package can be used with `GRanges` objects,
 92 | such as `filter()`, `mutate()`, `arrange()`, `group_by()` and `summarise()`. 
 93 | 
 94 | ```{r}
 95 | gr |> 
 96 |     mutate(start = start - 1000, end = end + 1000) |>
 97 |     filter(start > 1e6) |>
 98 |     arrange(Intest._RPM) |>
 99 |     group_by(annot_dev_age_tissues, Annotation) |> 
100 |     summarize(n = plyranges::n()) 
101 | ```
102 | 
103 | Note that the `summarize()` function automatically returns a `DataFrame`, not a `GRanges` object. 
104 | 
105 | To enhance compatibility with the `tidyverse` ecosystem, the `plyranges` package
106 | provides a `as_tibble()` method for `GRanges` objects, which converts them
107 | into `tibbles`. This allows you to use pass `GRanges` objects to `ggplot` for visualization. 
108 | 
109 | ```{r}
110 | gr |> 
111 |     filter(Annotation %in% c("Intest.", "Hypod.", "Soma", "Ubiq.")) |>
112 |     as_tibble() |> 
113 |     ggplot(aes(x = `Intest._RPM`, y = `Hypod._RPM`, color = Annotation)) +
114 |     geom_point() +
115 |     labs(x = "Intestine RPM", y = "Hypodermis RPM") + 
116 |     facet_grid(Annotation~annot_dev_age_tissues) +
117 |     theme_bw() +
118 |     theme(legend.position = "bottom")
119 | ```
120 | 
121 | Finally, the `join` concept introduced by the `dplyr` package can also be applied to `GRanges` objects.
122 | In this case, the operation refers to joining overlapping, nearest, or preceding/following ranges, 
123 | using the `join_*_*()` function from the `plyranges` package.
124 | 
125 | As an example, we can compare annotations from [@Serizay2020Oct] to chromosome 
126 | arms/center 
127 | 
128 | ```{r}
129 | arms <- GRanges(
130 |     seqnames = rep(c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrX"), each = 3),
131 |     ranges = IRanges(
132 |         start = c(
133 |             1, 4550001, 10750001, 
134 |             1, 4550001, 10750001, 
135 |             1, 4450001, 10150001, 
136 |             1, 4750001, 12650001, 
137 |             1, 4850001, 16250001, 
138 |             1, 2850001, 14250001
139 |         ), 
140 |         end = c(
141 |             4550000, 10750000, 15072423, 
142 |             4550000, 10750000, 15279345,
143 |             4450000, 10150000, 13783700,
144 |             4750000, 12650000, 17493793,
145 |             4850000, 16250000, 20924149,
146 |             2850000, 14250000, 17718866
147 |         )
148 |     ),
149 |     strand = "*",
150 |     part = rep(c("arm", "center", "arm"), 6)
151 | )
152 | 
153 | gr |> 
154 |     join_overlap_left(arms) |> 
155 |     group_by(Annotation, part) |>
156 |     filter(Annotation %in% c("Germline", "Neurons")) |>
157 |     summarize(n = plyranges::n()) 
158 | ```
159 | 
160 | # Further reading 
161 | 
162 | To read more about the `tidyomics` packages, please refer to the
163 | [tidyomics](https://github.com/tidyomics) GitHub organization page. 
164 | 
165 | More tutorials are provided by the `tidyomics` GitHub organization in the 
166 | [tidy-ranges-tutorial](https://tidyomics.github.io/tidy-ranges-tutorial/) website. 
167 | 
168 | # References
169 | 
170 | ::: {#refs}
171 | :::
172 | 
173 | # Session info
174 | 
175 | <details>
176 | <summary><b>
177 | Click to display session info
178 | </b></summary>
179 | ```{r}
180 | sessionInfo()
181 | ```
182 | </details>
183 | 
184 | 


--------------------------------------------------------------------------------
/vignettes/how-to-compute-sequence-composition-for-genomic-regions.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to compute sequence composition for genomic regions
  3 | author: Michael Stadler
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to compute sequence composition for genomic regions}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 |         embed-resources: true
 18 | ---
 19 | 
 20 | ```{r}
 21 | #| echo: false
 22 | 
 23 | suppressPackageStartupMessages({
 24 |     library(BiocStyle)
 25 | })
 26 | ```
 27 | 
 28 | This HowTo shows how to fetch sequences for given regions from a reference and
 29 | compute composition features for them.
 30 | 
 31 | # Bioconductor packages used in this document
 32 | 
 33 | * `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")`
 34 | * `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")`
 35 | * `r Biocpkg("GenomicFeatures")`
 36 | * `r Biocpkg("Biostrings")`
 37 | 
 38 | # How to compute sequence composition for genomic regions
 39 | 
 40 | Mammalian genomes contain regulatory regions with a high content of CpG 
 41 | dinucleotides, so called *CpG islands*, a side effect indirectly caused by
 42 | the methylation of cytosines in CpG contexts. Many transcripts contain
 43 | a CpG island in their promoter region, near the start of the transcript.
 44 | In this HowTo, we will identify such transcripts by analyzing the CpG
 45 | composition of promoter sequences.
 46 | 
 47 | We first obtain the coordinates of promoter regions from the
 48 | `r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")` package, using the
 49 | `promoters` function from the `r Biocpkg("GenomicFeatures")` package:
 50 | 
 51 | ```{r}
 52 | suppressPackageStartupMessages({
 53 |     library(GenomicFeatures)
 54 |     library(TxDb.Hsapiens.UCSC.hg38.knownGene)
 55 | })
 56 | txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
 57 | prom <- promoters(txdb, upstream = 1000, downstream = 1000)
 58 | prom
 59 | ```
 60 | 
 61 | Here, promoter regions are defined as 2000 base pair regions centered on
 62 | transcript start sites (the 5'-end of a transcript). The warning message above
 63 | is caused by start sites that are near the boundary of a chromosome, such that
 64 | the promoter region will extend beyond its beginning or end. By calling `trim`
 65 | on the `GRanges` object, all regions are truncated to the limits of the
 66 | chromosomes, resulting in a few promoters shorter than 2000 base pairs:
 67 | 
 68 | ```{r}
 69 | summary(width(prom) == 2000)
 70 | prom <- trim(prom)
 71 | summary(width(prom) == 2000)
 72 | ```
 73 | 
 74 | It may be useful to now subset the extracted promoters, for example to only
 75 | retain promoters on autosomes, or only a single promoter per gene.
 76 | Here, we will remove redundant promoters resulting from transcripts with
 77 | identical start sites:
 78 | 
 79 | ```{r}
 80 | summary(duplicated(prom))
 81 | prom <- prom[!duplicated(prom)]
 82 | ```
 83 | 
 84 | Next, we extract the sequences of the promoters from the reference genome
 85 | assembly in the `r Biocpkg("BSgenome.Hsapiens.UCSC.hg38")` package:
 86 | 
 87 | ```{r}
 88 | suppressPackageStartupMessages({
 89 |     library(Biostrings)
 90 |     library(BSgenome.Hsapiens.UCSC.hg38)
 91 | })
 92 | promseq <- getSeq(BSgenome.Hsapiens.UCSC.hg38, prom)
 93 | promseq
 94 | ```
 95 | 
 96 | Finally, we calculate some sequence features: The percent of G and C bases, and
 97 | the ratio of observed over expected CpG dinucleotides (where the expected
 98 | frequency of a dinucleotide is the product of the frequencies of the two
 99 | mononucleotides). We obtain the mono- and dinucleotide frequencies from the
100 | sequences using the `oligonucleotideFrequency` function from the
101 | `r Biocpkg("Biostrings")` package:
102 | 
103 | ```{r}
104 | mo <- oligonucleotideFrequency(promseq, width = 1, as.prob = TRUE)
105 | di <- oligonucleotideFrequency(promseq, width = 2, as.prob = TRUE)
106 | 
107 | head(mo)
108 | head(di)
109 | 
110 | percentGC <- 100 * (mo[, "C"] + mo[, "G"])
111 | expectedCpG <- mo[, "C"] * mo[, "G"]
112 | obsExpRatioCpG <- di[, "CG"] / expectedCpG
113 | ```
114 | 
115 | Finally, we can look at the distribution of these sequence features.
116 | The percentage of G+C bases in promoters spans a broad range from about 30% to
117 | over 70%:
118 | 
119 | ```{r}
120 | #| fig.width: 6
121 | #| fig.height: 6
122 | hist(percentGC, 60)
123 | ```
124 | 
125 | Defining CpG island promoters using this distribution would be hard, but is
126 | much easier using the observed over expected ratio of CpGs, which shows a
127 | much clearer bimodal structure:
128 | 
129 | ```{r}
130 | #| fig.width: 6
131 | #| fig.height: 6
132 | hist(obsExpRatioCpG, 60)
133 | abline(v = 0.45, col = 2, lty = 2)
134 | ```
135 | 
136 | Interestingly, almost all ratios are below 1.0, indicating that CpG islands
137 | are in fact not enriched for CpG dinucleotides, but rather lose them at a
138 | lower rate than other genomic regions.
139 | 
140 | Plotting the ratio versus the GC percentage furthermore reveals that CpG island
141 | promoters are also slightly GC-richer:
142 | 
143 | ```{r}
144 | #| fig.width: 6
145 | #| fig.height: 6
146 | plot(obsExpRatioCpG, percentGC, pch = "*", col = "#22222205")
147 | abline(v = 0.45, col = 2, lty = 2)
148 | ```
149 | 
150 | # Further reading
151 | 
152 | More examples of how to represent and work with biological sequences are
153 | contained in the vignettes of the `r Biocpkg("Biostrings")` package,
154 | available for example from `vignette(package = "Biostrings")`.
155 | 
156 | If no `BSgenome` object is available for your genome of interest, you can also
157 | use alternative sources to extract your sequences from, such as `DNAStringSet`
158 | objects or files in fasta or 2bit format. A list of inputs supported by the
159 | `getSeq` function can be obtained using:
160 | 
161 | ```{r}
162 | showMethods(getSeq)
163 | ```
164 | 
165 | For more information about CpG islands, see:
166 | 
167 | - Bird A, Taggart M, Frommer M, Miller OJ, Macleod D. A fraction of the mouse
168 |   genome that is derived from islands of nonmethylated, CpG-rich DNA.
169 |   Cell. 1985 Jan;40(1):91-9. doi: 10.1016/0092-8674(85)90312-5. PMID: 2981636.
170 | - Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes.
171 |   J Mol Biol. 1987 Jul 20;196(2):261-82.
172 |   doi: 10.1016/0022-2836(87)90689-9. PMID: 3656447.
173 | 
174 | # Session info
175 | 
176 | <details>
177 | <summary><b>
178 | Click to display session info
179 | </b></summary>
180 | ```{r}
181 | sessionInfo()
182 | ```
183 | </details>
184 | 


--------------------------------------------------------------------------------
/vignettes/how-to-extract-promoter-sequences.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: How to extract promoter sequences
  3 | author: Fabricio Almeida-Silva
  4 | date: "`r Sys.Date()`"
  5 | vignette: >
  6 |   %\VignetteIndexEntry{How to extract promoter sequences}
  7 |   %\VignetteEngine{quarto::html}
  8 |   %\VignetteEncoding{UTF-8}
  9 | knitr:
 10 |     opts_chunk:
 11 |         collapse: true
 12 |         comment: '#>'
 13 | format:
 14 |     html:
 15 |         toc: true
 16 |         html-math-method: mathjax
 17 |         embed-resources: true
 18 | ---
 19 | 
 20 | ```{r}
 21 | #| echo: false
 22 | 
 23 | suppressPackageStartupMessages({
 24 |     library(BiocStyle)
 25 | })
 26 | ```
 27 | 
 28 | Promoters are DNA sequences typically located upstream of a 
 29 | gene's transcription start site (TSS), and they contain binding 
 30 | sites for transcription factors. 
 31 | Researchers working on gene regulation often want to extract
 32 | promoter sequences for all genes or a set of genes. With these sequences,
 33 | researchers can, for example, look for known transcription factor 
 34 | binding sites (TFBS), or predict TFBS *de novo*. This HowTo will demonstrate
 35 | how to extract promoter sequences for any gene(s) in a genome.
 36 | 
 37 | # Bioconductor packages used in this document
 38 | 
 39 | * `r Biocpkg("Biostrings")`
 40 | * `r Biocpkg("rtracklayer")`
 41 | * `r Biocpkg("GenomicRanges")`
 42 | * `r Biocpkg("txdbmaker")`
 43 | * `r Biocpkg("GenomicFeatures")`
 44 | * `r Biocpkg("BSgenome")`
 45 | 
 46 | # How to extract promoter sequences
 47 | 
 48 | We will start by loading the required packages.
 49 | 
 50 | ```{r}
 51 | #| message: false
 52 | 
 53 | library(Biostrings)
 54 | library(GenomicRanges)
 55 | library(rtracklayer)
 56 | library(txdbmaker)
 57 | library(GenomicFeatures)
 58 | library(BSgenome)
 59 | ```
 60 | 
 61 | Then, we will get our example data set. Here, we will use data for 
 62 | the model plant *Arabidopsis thaliana*. We will start by obtaining 
 63 | genome sequences and gene annotation from [Ensembl Plants](https://plants.ensembl.org).
 64 | 
 65 | ::: {.callout-note}
 66 | 
 67 | ## Working with model and non-model organisms
 68 | 
 69 | Here, we're obtaining data from [Ensembl Plants](https://plants.ensembl.org)
 70 | to explicitly show how this can be done for **any** organism. However,
 71 | if you're lucky enough to work with a model organism (e.g. human, mouse, etc),
 72 | you can find curated genome sequences and gene annotation 
 73 | in `r Biocpkg("AnnotationHub")` (see, for instance, 
 74 | [this](https://bioconductor.github.io/BiocHowTo/articles/how-to-retrieve-gene-model-from-annotationhub.html) HowTo article).
 75 | 
 76 | :::
 77 | 
 78 | ```{r}
 79 | # Read genomic data from Ensembl Plants
 80 | ## Genome
 81 | genome <- Biostrings::readDNAStringSet(
 82 |     "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz"
 83 | )
 84 | names(genome) <- gsub(" .*", "", names(genome)) # remove text after chr name
 85 | 
 86 | ## Annotation
 87 | annotation <- rtracklayer::import(
 88 |     "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-61/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.61.gff3.gz"
 89 | )
 90 | ```
 91 | 
 92 | Genome sequences are stored in `DNAStringSet` objects, and 
 93 | gene annotation (i.e., start and end coordinates of each gene) are stored in 
 94 | `GRanges` objects. This is what these objects look like:
 95 | 
 96 | ```{r}
 97 | # Take a look at the genome and annotation
 98 | genome
 99 | annotation
100 | ```
101 | 
102 | Note that our `GRanges` object contains ranges for multiple types of features,
103 | including genes, mRNAs, exons, and even entire chromosomes. 
104 | Since we're only interested in promoter regions, we will subset
105 | `annotation` to keep only 'gene' ranges.
106 | 
107 | ```{r}
108 | # Keep only gene ranges
109 | genes <- annotation[annotation$type == "gene"]
110 | genes
111 | ```
112 | 
113 | Next, we can use the `promoters()` function from 
114 | `r Biocpkg("GenomicRanges")` to extract the genomic coordinates of promoters.
115 | By default, promoters start at 2000 bp upstream the transcription start 
116 | site (TSS), and end at 200 bp downstream the TSS, but you can modify these 
117 | positions in parameters *upstream* and *downstream* of `promoters()`. 
118 | For example, some transcription factor families in plants 
119 | (e.g. HSF and C2C2-GATA) bind exclusively to regions downstream the TSS, which
120 | is contrary to the common view of TFBS being located upstream TSSs.
121 | Importantly, if a gene has multiple TSSs (e.g., because of different 
122 | transcripts starting at different positions), the most upstream TSS is 
123 | used, not a cleverly chosen 'representative' or 'canonical' transcript.
124 | 
125 | ```{r}
126 | # Get promoter regions from a `GRanges` object
127 | prom_genes <- GenomicRanges::promoters(genes)
128 | 
129 | prom_genes
130 | ```
131 | 
132 | Alternatively, if you want to extract promoters for each transcript separately,
133 | you can subset the `GRanges` object to extract only ranges of 
134 | type 'transcript', and then use `promoters()` as demonstrated above.
135 | Note that GFF3 files from some databases do not contain a 'transcript' type.
136 | In these cases, you will most likely find transcript-level ranges in rows of 
137 | type 'mRNA'. This is the case for [Ensembl Plants](https://plants.ensembl.org), 
138 | from where we obtained our example data. 
139 | To extract promoters for each *A. thaliana* transcript, you'd execute:
140 | 
141 | ```{r}
142 | # Extract transcript ranges from a GRanges object
143 | tx_ranges <- annotation[annotation$type == "mRNA"] # or 'transcript' (if any)
144 | 
145 | # Extract promoters for each transcript separately
146 | prom_tx <- promoters(tx_ranges)
147 | prom_tx
148 | ```
149 | 
150 | ::: {.callout-tip}
151 | 
152 | ## Extracting promoter regions from `TxDb` objects
153 | 
154 | If you have a `TxDb` object with transcript coordinates and metadata,
155 | you can extract promoter regions for each transcript with the `promoters()`
156 | function from the `r Biocpkg("GenomicFeatures")` package. Note that function names
157 | are the same, but `GenomicFeatures::promoters()` takes `TxDb` objects as input, 
158 | while `GenomicRanges::promoters()` takes `GRanges` objects as input.
159 | 
160 | ```{r}
161 | # Create a `TxDb` object from a `GRanges` object
162 | txdb <- txdbmaker::makeTxDbFromGRanges(annotation)
163 | 
164 | # Get promoter regions for all transcripts
165 | prom_tx2 <- GenomicFeatures::promoters(txdb)
166 | 
167 | prom_tx2
168 | ```
169 | 
170 | :::
171 | 
172 | Finally, you can extract sequences given a genome and some coordinates with
173 | the `getSeq()` function from `r Biocpkg("BSgenome")`.
174 | 
175 | ```{r}
176 | # Extract promoter sequences (1k genes only for demo purposes)
177 | prom_seqs <- getSeq(genome, prom_genes[1:1000])
178 | 
179 | prom_seqs
180 | ```
181 | 
182 | 
183 | # Further reading
184 | 
185 | To learn more about how to work with sequences and ranges using Bioconductor 
186 | packages, look at the vignettes of the `r Biocpkg("Biostrings")` and
187 | `r Biocpkg("GenomicRanges")` packages.
188 | 
189 | # Session info
190 | 
191 | <details>
192 | <summary><b>
193 | Click to display session info
194 | </b></summary>
195 | ```{r}
196 | sessionInfo()
197 | ```
198 | </details>
199 | 


--------------------------------------------------------------------------------