├── .github
├── .gitignore
├── CODE_OF_CONDUCT.md
├── workflows
│ └── check-bioc.yml
├── ISSUE_TEMPLATE
│ └── issue_template.md
├── SUPPORT.md
└── CONTRIBUTING.md
├── vignettes
├── .gitignore
├── bibliography.bib
└── doubletrouble_vignette.Rmd
├── .Rbuildignore
├── data
├── gmax_ks.rda
├── fungi_kaks.rda
├── yeast_annot.rda
├── yeast_seq.rda
├── diamond_inter.rda
├── diamond_intra.rda
└── cds_scerevisiae.rda
├── man
├── figures
│ └── logo.png
├── cds_scerevisiae.Rd
├── diamond_intra.Rd
├── yeast_seq.Rd
├── yeast_annot.Rd
├── diamond_inter.Rd
├── gmax_ks.Rd
├── fungi_kaks.Rd
├── plot_ks_peaks.Rd
├── classify_genes.Rd
├── get_intron_counts.Rd
├── plot_duplicate_freqs.Rd
├── duplicates2counts.Rd
├── get_segmental.Rd
├── plot_rates_by_species.Rd
├── plot_ks_distro.Rd
├── pairs2kaks.Rd
├── split_pairs_by_peak.Rd
├── get_tandem_proximal.Rd
├── find_ks_peaks.Rd
├── get_anchors_list.Rd
├── get_transposed_classes.Rd
├── get_transposed.Rd
└── classify_gene_pairs.Rd
├── .gitignore
├── codecov.yml
├── tests
├── testthat
│ ├── test-data_validation.R
│ ├── test-visualization.R
│ ├── test-ka_ks_analyses.R
│ └── test-duplicate_classification.R
└── testthat.R
├── NEWS.md
├── inst
├── _pkgdown.yml
├── CITATION
└── script
│ └── data_acquisition.md
├── R
├── data_validation.R
├── data.R
├── duplicate_classification.R
├── ka_ks_analyses.R
├── utils.R
├── visualization.R
└── utils_duplicate_classification.R
├── NAMESPACE
├── DESCRIPTION
├── README.Rmd
└── README.md
/.github/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 |
--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | *.R
3 |
--------------------------------------------------------------------------------
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^\.github$
2 | ^codecov\.yml$
3 | ^.*\.Rproj$
4 | ^\.Rproj\.user$
5 |
--------------------------------------------------------------------------------
/data/gmax_ks.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/gmax_ks.rda
--------------------------------------------------------------------------------
/data/fungi_kaks.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/fungi_kaks.rda
--------------------------------------------------------------------------------
/data/yeast_annot.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/yeast_annot.rda
--------------------------------------------------------------------------------
/data/yeast_seq.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/yeast_seq.rda
--------------------------------------------------------------------------------
/man/figures/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/man/figures/logo.png
--------------------------------------------------------------------------------
/data/diamond_inter.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/diamond_inter.rda
--------------------------------------------------------------------------------
/data/diamond_intra.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/diamond_intra.rda
--------------------------------------------------------------------------------
/data/cds_scerevisiae.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/almeidasilvaf/doubletrouble/HEAD/data/cds_scerevisiae.rda
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | inst/script/*.Rmd
6 | inst/doc
7 | *.Rproj
8 | *.BiocCheck
--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
1 | comment: false
2 |
3 | coverage:
4 | status:
5 | project:
6 | default:
7 | target: auto
8 | threshold: 1%
9 | informational: true
10 | patch:
11 | default:
12 | target: auto
13 | threshold: 1%
14 | informational: true
15 |
--------------------------------------------------------------------------------
/tests/testthat/test-data_validation.R:
--------------------------------------------------------------------------------
1 |
2 | # Start tests ----
3 | test_that("check_geneid_match() flags mismatches between gene sets", {
4 |
5 | set1 <- c("gene1", "gene2A", "gene3", "gene4A")
6 | set2 <- c("gene1", "gene2", "gene3", "gene4")
7 |
8 | expect_error(check_geneid_match(set1, set2))
9 | })
10 |
--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | # This file is part of the standard setup for testthat.
2 | # It is recommended that you do not modify it.
3 | #
4 | # Where should you do additional test configuration?
5 | # Learn more about the roles of various files in:
6 | # * https://r-pkgs.org/tests.html
7 | # * https://testthat.r-lib.org/reference/test_package.html#special-files
8 |
9 | library(testthat)
10 | library(doubletrouble)
11 |
12 | test_check("doubletrouble")
13 |
--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
1 |
2 | # doubletrouble 0.99.0
3 |
4 | NEW FEATURES
5 |
6 | * Added a `NEWS.md` file to track changes to the package.
7 |
8 |
9 | # doubletrouble 0.99.2
10 |
11 | CHANGES
12 |
13 | * Small change in coding style after Bioconductor peer-review (`m:n` replaced
14 | with `c(m, n)` and `seq(m,n)`)
15 |
16 | # doubletrouble 0.99.3
17 |
18 | BUG FIXES
19 |
20 | * Updated functions (e.g., get_anchor_list(), collinearity2blocks()) after
21 | update in syntenet.
22 |
23 |
--------------------------------------------------------------------------------
/man/cds_scerevisiae.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{cds_scerevisiae}
5 | \alias{cds_scerevisiae}
6 | \title{Coding sequences (CDS) of S. cerevisiae}
7 | \format{
8 | A DNAStringSet object with CDS of S. cerevisiae.
9 | }
10 | \usage{
11 | data(cds_scerevisiae)
12 | }
13 | \description{
14 | Data were obtained from Ensembl Fungi, and only CDS of primary transcripts
15 | were included.
16 | }
17 | \examples{
18 | data(cds_scerevisiae)
19 | }
20 | \keyword{datasets}
21 |
--------------------------------------------------------------------------------
/man/diamond_intra.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{diamond_intra}
5 | \alias{diamond_intra}
6 | \title{Intraspecies DIAMOND output for S. cerevisiae}
7 | \format{
8 | A list of data frames (length 1) containing the whole paranome of
9 | S. cerevisiae resulting from intragenome similarity searches.
10 | }
11 | \usage{
12 | data(diamond_intra)
13 | }
14 | \description{
15 | List obtained with \code{run_diamond()}.
16 | }
17 | \examples{
18 | data(diamond_intra)
19 | }
20 | \keyword{datasets}
21 |
--------------------------------------------------------------------------------
/inst/_pkgdown.yml:
--------------------------------------------------------------------------------
1 | template:
2 | bootstrap: 5
3 | bslib:
4 | preset: "bootstrap"
5 | font_scale: 1.0
6 | base_font:
7 | google: "Atkinson Hyperlegible" # font for better accessibility
8 | code_font:
9 | google: "Source Code Pro"
10 | primary: "#18416a" # navbar hover and sidebar: dark blue
11 | navbar-light-brand-color: "#18416a" # pkg name: white
12 | navbar-light-hover-color: "#18416a" # navbar text on hover: Bioc green
13 | pkgdown-nav-height: 78px
14 |
15 | navbar:
16 | type: light
17 |
--------------------------------------------------------------------------------
/man/yeast_seq.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{yeast_seq}
5 | \alias{yeast_seq}
6 | \title{Protein sequences of the yeast species S. cerevisiae and C. glabrata}
7 | \format{
8 | A list of AAStringSet objects with the elements
9 | \strong{Scerevisiae} and \strong{Cglabrata}.
10 | }
11 | \usage{
12 | data(yeast_seq)
13 | }
14 | \description{
15 | Data obtained from Ensembl Fungi. Only translated sequences of primary
16 | transcripts were included.
17 | }
18 | \examples{
19 | data(yeast_seq)
20 | }
21 | \keyword{datasets}
22 |
--------------------------------------------------------------------------------
/man/yeast_annot.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{yeast_annot}
5 | \alias{yeast_annot}
6 | \title{Genome annotation of the yeast species S. cerevisiae and C. glabrata}
7 | \format{
8 | A CompressedGRangesList containing
9 | the elements \strong{Scerevisiae} and \strong{Cglabrata}.
10 | }
11 | \usage{
12 | data(yeast_annot)
13 | }
14 | \description{
15 | Data obtained from Ensembl Fungi. Only annotation data protein-coding
16 | genes (with associated mRNA, exons, CDS, etc) are included.
17 | }
18 | \examples{
19 | data(yeast_annot)
20 | }
21 | \keyword{datasets}
22 |
--------------------------------------------------------------------------------
/man/diamond_inter.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{diamond_inter}
5 | \alias{diamond_inter}
6 | \title{Interspecies DIAMOND output for yeast species}
7 | \format{
8 | A list of data frames (length 1) containing the output of a
9 | DIAMOND search of S. cerevisiae against C. glabrata (outgroup).
10 | }
11 | \usage{
12 | data(diamond_inter)
13 | }
14 | \description{
15 | This list contains a similarity search of S. cerevisiae against
16 | C. glabrata, and it was obtained with \code{run_diamond()}.
17 | }
18 | \examples{
19 | data(diamond_inter)
20 | }
21 | \keyword{datasets}
22 |
--------------------------------------------------------------------------------
/man/gmax_ks.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{gmax_ks}
5 | \alias{gmax_ks}
6 | \title{Duplicate pairs and Ks values for Glycine max}
7 | \format{
8 | A data frame with the following variables:
9 | \describe{
10 | \item{dup1}{Character, duplicated gene 1.}
11 | \item{dup2}{Character, duplicated gene 2.}
12 | \item{Ks}{Numeric, Ks values.}
13 | \item{type}{Factor, duplication mode.}
14 | }
15 | }
16 | \usage{
17 | data(gmax_ks)
18 | }
19 | \description{
20 | This data set was obtained with \code{classify_gene_pairs()} followed
21 | by \code{pairs2kaks()}.
22 | }
23 | \examples{
24 | data(gmax_ks)
25 | }
26 | \keyword{datasets}
27 |
--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
1 | citHeader("To cite doubletrouble in publications, use:")
2 |
3 | citEntry(
4 | entry = "Article",
5 | title = "doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications",
6 | author = personList(
7 | as.person("Fabricio Almeida-Silva"),
8 | as.person("Yves Van de Peer")
9 | ),
10 | journal = "Bioinformatics",
11 | year = "2025",
12 | volume = "41",
13 | number = "2",
14 | pages = "btaf043",
15 | url = "https://academic.oup.com/bioinformatics/article/41/2/btaf043/7979242",
16 | doi = "10.1093/bioinformatics/btaf043",
17 | textVersion = paste(
18 | "Almeida-Silva F, Van de Peer Y",
19 | "doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications.",
20 | "Bioinformatics, 41(2), btaf043. (2025). https://doi.org/10.1093/bioinformatics/btaf043"
21 | )
22 | )
--------------------------------------------------------------------------------
/man/fungi_kaks.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/data.R
3 | \docType{data}
4 | \name{fungi_kaks}
5 | \alias{fungi_kaks}
6 | \title{Duplicate pairs and Ka, Ks, and Ka/Ks values for fungi species}
7 | \format{
8 | A list of data frame with elements
9 | named \strong{saccharomyces_cerevisiae}, \strong{candida_glabrata},
10 | and \strong{schizosaccharomyces_pombe}. Each data frame contains
11 | the following variables:
12 | \describe{
13 | \item{dup1}{Character, duplicated gene 1.}
14 | \item{dup2}{Character, duplicated gene 2.}
15 | \item{Ka}{Numeric, Ka values.}
16 | \item{Ks}{Numeric, Ks values.}
17 | \item{Ka_Ks}{Numeric, Ka/Ks values.}
18 | \item{type}{Character, mode of duplication}
19 | }
20 | }
21 | \usage{
22 | data(fungi_kaks)
23 | }
24 | \description{
25 | This data set was obtained with \code{classify_gene_pairs()} followed
26 | by \code{pairs2kaks()}.
27 | }
28 | \examples{
29 | data(fungi_kaks)
30 | }
31 | \keyword{datasets}
32 |
--------------------------------------------------------------------------------
/man/plot_ks_peaks.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/visualization.R
3 | \name{plot_ks_peaks}
4 | \alias{plot_ks_peaks}
5 | \title{Plot histogram of Ks distribution with peaks}
6 | \usage{
7 | plot_ks_peaks(peaks = NULL, binwidth = 0.05)
8 | }
9 | \arguments{
10 | \item{peaks}{A list with elements \strong{mean}, \strong{sd},
11 | \strong{lambda}, and \strong{ks}, as returned by the
12 | function \code{fins_ks_peaks()}.}
13 |
14 | \item{binwidth}{Numeric scalar with binwidth for the histogram.
15 | Default: 0.05.}
16 | }
17 | \value{
18 | A ggplot object with a histogram and lines for each Ks peak.
19 | }
20 | \description{
21 | Plot histogram of Ks distribution with peaks
22 | }
23 | \examples{
24 | data(fungi_kaks)
25 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
26 | ks <- scerevisiae_kaks$Ks
27 |
28 | # Find 2 peaks in Ks distribution
29 | peaks <- find_ks_peaks(ks, npeaks = 2)
30 |
31 | # Plot
32 | plot_ks_peaks(peaks, binwidth = 0.05)
33 | }
34 |
--------------------------------------------------------------------------------
/.github/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | The Bioconductor community values
2 |
3 | * an open approach to science that promotes the sharing of ideas, code, and expertise
4 | * collaboration
5 | * diversity and inclusivity
6 | * a kind and welcoming environment
7 | * community contributions
8 |
9 | In line with these values, Bioconductor is dedicated to providing a welcoming, supportive, collegial, experience free of harassment, intimidation, and bullying regardless of:
10 |
11 | * identity: gender, gender identity and expression, sexual orientation, disability, physical appearance, ethnicity, body size, race, age, religion, etc.
12 | * intellectual position: approaches to data analysis, software preferences, coding style, scientific perspective, etc.
13 | * stage of career
14 |
15 | In order to uphold these values, members of the Bioconductor community are required to follow the Code of Conduct.The latest version of Bioconductor project Code of Conduct is available at http://bioconductor.org/about/code-of-conduct/. Please read the Code of Conduct before contributing to this project.
16 |
17 | Thank you!
18 |
--------------------------------------------------------------------------------
/.github/workflows/check-bioc.yml:
--------------------------------------------------------------------------------
1 | name: rworkflows.devel
2 | 'on':
3 | push:
4 | branches: devel
5 | pull_request:
6 | branches: devel
7 | jobs:
8 | rworkflows:
9 | permissions: write-all
10 | runs-on: ${{ matrix.config.os }}
11 | name: ${{ matrix.config.os }} (${{ matrix.config.r }})
12 | container: ${{ matrix.config.cont }}
13 | strategy:
14 | fail-fast: ${{ false }}
15 | matrix:
16 | config:
17 | - os: ubuntu-latest
18 | bioc: devel
19 | r: auto
20 | cont: ghcr.io/bioconductor/bioconductor_docker:devel
21 | rspm: ~
22 | steps:
23 | - uses: neurogenomics/rworkflows@master
24 | with:
25 | run_bioccheck: ${{ true }}
26 | run_rcmdcheck: ${{ true }}
27 | as_cran: ${{ false }}
28 | run_vignettes: ${{ false }}
29 | has_testthat: ${{ true }}
30 | run_covr: ${{ true }}
31 | run_pkgdown: ${{ true }}
32 | has_runit: ${{ false }}
33 | has_latex: ${{ false }}
34 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
35 | run_docker: ${{ false }}
36 | DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
37 | runner_os: ${{ runner.os }}
38 | cache_version: cache-v1
39 | docker_registry: ghcr.io
--------------------------------------------------------------------------------
/man/classify_genes.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/duplicate_classification.R
3 | \name{classify_genes}
4 | \alias{classify_genes}
5 | \title{Classify genes into unique modes of duplication}
6 | \usage{
7 | classify_genes(gene_pairs_list = NULL)
8 | }
9 | \arguments{
10 | \item{gene_pairs_list}{List of classified gene pairs as returned
11 | by \code{classify_gene_pairs()}.}
12 | }
13 | \value{
14 | A list of 2-column data frames with variables \strong{gene}
15 | and \strong{type} representing gene ID and duplication type, respectively.
16 | }
17 | \description{
18 | Classify genes into unique modes of duplication
19 | }
20 | \details{
21 | If a gene is present in pairs with different duplication modes, the gene
22 | is classified into a unique mode of duplication following the order
23 | of priority indicated in the levels of the factor \strong{type}.
24 |
25 | For scheme "binary", the order is SD > SSD.
26 | For scheme "standard", the order is SD > TD > PD > DD.
27 | For scheme "extended", the order is SD > TD > PD > TRD > DD.
28 | For scheme "full", the order is SD > TD > PD > rTRD > dTRD > DD.
29 | }
30 | \examples{
31 | data(fungi_kaks)
32 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
33 |
34 | cols <- c("dup1", "dup2", "type")
35 | gene_pairs_list <- list(Scerevisiae = scerevisiae_kaks[, cols])
36 |
37 | class_genes <- classify_genes(gene_pairs_list)
38 | }
39 |
--------------------------------------------------------------------------------
/man/get_intron_counts.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils.R
3 | \name{get_intron_counts}
4 | \alias{get_intron_counts}
5 | \title{Get a data frame of intron counts per gene}
6 | \usage{
7 | get_intron_counts(txdb)
8 | }
9 | \arguments{
10 | \item{txdb}{A \code{TxDb} object with transcript annotations. See details below
11 | for examples on how to create \code{TxDb} objects from different kinds of input.}
12 | }
13 | \value{
14 | A data frame with intron counts per gene, with variables:
15 | \describe{
16 | \item{gene}{Character with gene IDs.}
17 | \item{introns}{Numeric with number of introns per gene.}
18 | }
19 | }
20 | \description{
21 | Get a data frame of intron counts per gene
22 | }
23 | \details{
24 | The family of functions \code{makeTxDbFrom*} from
25 | the \strong{txdbmaker} package can be used to create \code{TxDb} objects
26 | from a variety of input data types. You can create \code{TxDb} objects
27 | from e.g., \code{GRanges} objects (\code{makeTxDbFromGRanges()}),
28 | GFF files (\code{makeTxDbFromGFF()}),
29 | an Ensembl database (\code{makeTxDbFromEnsembl}), and
30 | a Biomart database (\code{makeTxDbFromBiomart}).
31 | }
32 | \examples{
33 | data(yeast_annot)
34 |
35 | # Create TxDb object from GRanges
36 | library(txdbmaker)
37 | txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]])
38 |
39 | # Get intron counts
40 | intron_counts <- get_intron_counts(txdb)
41 | }
42 |
--------------------------------------------------------------------------------
/man/plot_duplicate_freqs.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/visualization.R
3 | \name{plot_duplicate_freqs}
4 | \alias{plot_duplicate_freqs}
5 | \title{Plot frequency of duplicates per mode for each species}
6 | \usage{
7 | plot_duplicate_freqs(dup_counts, plot_type = "facet", remove_zero = TRUE)
8 | }
9 | \arguments{
10 | \item{dup_counts}{A data frame in long format with the number of
11 | duplicates per mode for each species, as returned by
12 | the function \code{duplicates2counts}.}
13 |
14 | \item{plot_type}{Character indicating how to plot frequencies. One of
15 | 'facet' (facets for each level of the variable \strong{type}),
16 | 'stack' (levels of the variable \strong{type} as stacked bars), or
17 | 'stack_percent' (levels of the variable \strong{type} as stacked bars,
18 | with x-axis representing relative frequencies). Default: 'facet'.}
19 |
20 | \item{remove_zero}{Logical indicating whether or not to remove rows
21 | with zero values. Default: TRUE.}
22 | }
23 | \value{
24 | A ggplot object.
25 | }
26 | \description{
27 | Plot frequency of duplicates per mode for each species
28 | }
29 | \examples{
30 | data(fungi_kaks)
31 |
32 | # Get unique duplicates
33 | duplicate_list <- classify_genes(fungi_kaks)
34 |
35 | # Get count table
36 | dup_counts <- duplicates2counts(duplicate_list)
37 |
38 | # Plot counts
39 | plot_duplicate_freqs(dup_counts, plot_type = "stack_percent")
40 | }
41 |
--------------------------------------------------------------------------------
/man/duplicates2counts.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils.R
3 | \name{duplicates2counts}
4 | \alias{duplicates2counts}
5 | \title{Get a duplicate count matrix for each genome}
6 | \usage{
7 | duplicates2counts(duplicate_list, shape = "long")
8 | }
9 | \arguments{
10 | \item{duplicate_list}{A list of data frames with the duplicated genes or
11 | gene pairs and their modes of duplication as returned
12 | by \code{classify_gene_pairs()} or \code{classify_genes()}.}
13 |
14 | \item{shape}{Character specifying the shape of the output data frame.
15 | One of "long" (data frame in the long shape, in the tidyverse sense),
16 | or "wide" (data frame in the wide shape, in the tidyverse sense).
17 | Default: "long".}
18 | }
19 | \value{
20 | If \strong{shape = "wide"}, a count matrix containing the
21 | frequency of duplicated genes (or gene pairs) by mode for each species,
22 | with species in rows and duplication modes in columns.
23 | If \strong{shape = "long"}, a data frame in long format with the following
24 | variables:
25 | \describe{
26 | \item{type}{Factor, type of duplication.}
27 | \item{n}{Numeric, number of duplicates.}
28 | \item{species}{Character, species name}
29 | }
30 | }
31 | \description{
32 | Get a duplicate count matrix for each genome
33 | }
34 | \examples{
35 | data(fungi_kaks)
36 |
37 | # Get unique duplicates
38 | duplicate_list <- classify_genes(fungi_kaks)
39 |
40 | # Get count table
41 | counts <- duplicates2counts(duplicate_list)
42 | }
43 |
--------------------------------------------------------------------------------
/man/get_segmental.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils_duplicate_classification.R
3 | \name{get_segmental}
4 | \alias{get_segmental}
5 | \title{Classify gene pairs derived from segmental duplications}
6 | \usage{
7 | get_segmental(anchor_pairs = NULL, pairs = NULL)
8 | }
9 | \arguments{
10 | \item{anchor_pairs}{A 2-column data frame with anchor pairs in columns 1
11 | and 2.}
12 |
13 | \item{pairs}{A 2-column data frame with all duplicate pairs. This
14 | is equivalent to the first 2 columns of the tabular output of BLAST-like
15 | programs.}
16 | }
17 | \value{
18 | A 3-column data frame with the variables:
19 | \describe{
20 | \item{dup1}{Character, duplicated gene 1}
21 | \item{dup2}{Character, duplicated gene 2}
22 | \item{type}{Factor indicating duplication types, with levels
23 | "SD" (segmental duplication) or
24 | "DD" (dispersed duplication).}
25 | }
26 | }
27 | \description{
28 | Classify gene pairs derived from segmental duplications
29 | }
30 | \examples{
31 | data(diamond_intra)
32 | data(yeast_annot)
33 | data(yeast_seq)
34 | blast_list <- diamond_intra
35 |
36 | # Get processed annotation for S. cerevisiae
37 | annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation[1]
38 |
39 | # Get list of intraspecies anchor pairs
40 | anchor_pairs <- get_anchors_list(blast_list, annotation)
41 | anchor_pairs <- anchor_pairs[[1]][, c(1, 2)]
42 |
43 | # Get duplicate pairs from DIAMOND output
44 | duplicates <- diamond_intra[[1]][, c(1, 2)]
45 | dups <- get_segmental(anchor_pairs, duplicates)
46 | }
47 |
--------------------------------------------------------------------------------
/R/data_validation.R:
--------------------------------------------------------------------------------
1 |
2 | #' Check if gene names in set 1 are present in set 2
3 | #'
4 | #' @param ref_ids Character vector of reference gene set.
5 | #' @param test_ids Character vector of test gene set.
6 | #' @param setnames Character vector of length with set names.
7 | #' Default: \code{c("gene pairs", "CDS")}
8 | #'
9 | #' @return TRUE if names match, otherwise an error is shown.
10 | #' @importFrom utils head
11 | #' @details
12 | #' This internal function can be used, for instance, to check if CDS names
13 | #' match gene IDs in the gene pair list.
14 | #' @noRd
15 | check_geneid_match <- function(
16 | ref_ids, test_ids, setnames = c("gene pairs", "CDS")
17 | ) {
18 |
19 | mismatch_ids <- ref_ids[!ref_ids %in% test_ids]
20 | mismatch_perc <- length(mismatch_ids) / length(ref_ids)
21 | mismatch_perc <- round(mismatch_perc * 100, 2)
22 |
23 | if(mismatch_perc >0) {
24 | stop(
25 | mismatch_perc, "%", " (N=", length(mismatch_ids), ") of the IDs in ", setnames[1],
26 | " were not found in ", setnames[2], ".\n",
27 | "All gene IDs in ", setnames[1], " must be in ", setnames[2],
28 | ". Did you check if gene IDs match?",
29 | "\n\nHere are some examples of nonmatching IDs (from ", setnames[1], ") :\n",
30 | paste0(head(mismatch_ids, n = 5), collapse = "\n"),
31 | "\n\nAnd here are some examples of IDs in ", setnames[2], ":\n",
32 | paste0(head(test_ids, n = 5), collapse = "\n")
33 | )
34 | }
35 |
36 | return(TRUE)
37 | }
--------------------------------------------------------------------------------
/man/plot_rates_by_species.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/visualization.R
3 | \name{plot_rates_by_species}
4 | \alias{plot_rates_by_species}
5 | \title{Plot distributions of substitution rates (Ka, Ks, or Ka/Ks) per species}
6 | \usage{
7 | plot_rates_by_species(
8 | kaks_list,
9 | rate_column = "Ks",
10 | bytype = FALSE,
11 | range = c(0, 2),
12 | fill = "deepskyblue3",
13 | color = "deepskyblue4"
14 | )
15 | }
16 | \arguments{
17 | \item{kaks_list}{A list of data frames with substitution rates per gene
18 | pair in each species as returned by \code{pairs2kaks()}.}
19 |
20 | \item{rate_column}{Character indicating the name of the column to plot.
21 | Default: "Ks".}
22 |
23 | \item{bytype}{Logical indicating whether or not to show distributions by
24 | type of duplication. Default: FALSE.}
25 |
26 | \item{range}{Numeric vector of length 2 indicating the minimum and maximum
27 | values to plot. Default: \code{c(0, 2)}.}
28 |
29 | \item{fill}{Character with color to use for the fill aesthetic. Ignored
30 | if \strong{bytype = TRUE}. Default: "deepskyblue3".}
31 |
32 | \item{color}{Character with color to use for the color aesthetic. Ignored
33 | if \strong{bytype = FALSE}. Default: "deepskyblue4".}
34 | }
35 | \value{
36 | A ggplot object.
37 | }
38 | \description{
39 | Plot distributions of substitution rates (Ka, Ks, or Ka/Ks) per species
40 | }
41 | \details{
42 | Data will be plotted using the species order of the list. To change the
43 | order of the species to plot, reorder the list elements
44 | in \strong{kaks_list}.
45 | }
46 | \examples{
47 | data(fungi_kaks)
48 |
49 | # Plot rates
50 | plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks")
51 | }
52 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 | export(classify_gene_pairs)
4 | export(classify_genes)
5 | export(duplicates2counts)
6 | export(find_ks_peaks)
7 | export(get_anchors_list)
8 | export(get_intron_counts)
9 | export(get_segmental)
10 | export(get_tandem_proximal)
11 | export(get_transposed)
12 | export(get_transposed_classes)
13 | export(pairs2kaks)
14 | export(plot_duplicate_freqs)
15 | export(plot_ks_distro)
16 | export(plot_ks_peaks)
17 | export(plot_rates_by_species)
18 | export(split_pairs_by_peak)
19 | importFrom(AnnotationDbi,select)
20 | importFrom(Biostrings,width)
21 | importFrom(GenomicFeatures,intronsByTranscript)
22 | importFrom(GenomicRanges,GRangesList)
23 | importFrom(MSA2dist,indices2kaks)
24 | importFrom(ggplot2,aes)
25 | importFrom(ggplot2,after_stat)
26 | importFrom(ggplot2,element_blank)
27 | importFrom(ggplot2,facet_grid)
28 | importFrom(ggplot2,facet_wrap)
29 | importFrom(ggplot2,geom_bar)
30 | importFrom(ggplot2,geom_boxplot)
31 | importFrom(ggplot2,geom_density)
32 | importFrom(ggplot2,geom_histogram)
33 | importFrom(ggplot2,geom_violin)
34 | importFrom(ggplot2,geom_vline)
35 | importFrom(ggplot2,ggplot)
36 | importFrom(ggplot2,ggplot_build)
37 | importFrom(ggplot2,labs)
38 | importFrom(ggplot2,scale_fill_manual)
39 | importFrom(ggplot2,scale_x_continuous)
40 | importFrom(ggplot2,scale_y_continuous)
41 | importFrom(ggplot2,stat_function)
42 | importFrom(ggplot2,theme)
43 | importFrom(ggplot2,theme_bw)
44 | importFrom(ggplot2,vars)
45 | importFrom(mclust,densityMclust)
46 | importFrom(rlang,.data)
47 | importFrom(stats,density)
48 | importFrom(stats,dnorm)
49 | importFrom(syntenet,interspecies_synteny)
50 | importFrom(syntenet,intraspecies_synteny)
51 | importFrom(utils,head)
52 | importFrom(utils,read.table)
53 |
--------------------------------------------------------------------------------
/man/plot_ks_distro.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/visualization.R
3 | \name{plot_ks_distro}
4 | \alias{plot_ks_distro}
5 | \title{Plot distribution of synonymous substitution rates (Ks)}
6 | \usage{
7 | plot_ks_distro(
8 | ks_df,
9 | min_ks = 0.01,
10 | max_ks = 2,
11 | bytype = FALSE,
12 | type_levels = NULL,
13 | plot_type = "histogram",
14 | binwidth = 0.03
15 | )
16 | }
17 | \arguments{
18 | \item{ks_df}{A data frame with Ks values for each gene pair
19 | as returned by \code{pairs2kaks()}.}
20 |
21 | \item{min_ks}{Numeric indicating the minimum Ks value to keep.
22 | Default: 0.01.}
23 |
24 | \item{max_ks}{Numeric indicating the maximum Ks value to keep.
25 | Default: 2.}
26 |
27 | \item{bytype}{Logical indicating whether or not to plot the distribution
28 | by type of duplication (requires a column named \code{type}).}
29 |
30 | \item{type_levels}{(Only valid if \strong{bytype} is not NULL) Character
31 | indicating which levels of the variable specified in
32 | parameter \strong{group_by} should be kept. By default, all levels are kept.}
33 |
34 | \item{plot_type}{Character indicating the type of plot to create.
35 | If \strong{bytype = TRUE}, possible types are "histogram" or "violin".
36 | If \strong{bytype = FALSE}, possible types are "histogram", "density",
37 | or "density_histogram". Default: "histogram".}
38 |
39 | \item{binwidth}{(Only valid if \strong{plot_type = "histogram"})
40 | Numeric indicating the bin width. Default: 0.03.}
41 | }
42 | \value{
43 | A ggplot object.
44 | }
45 | \description{
46 | Plot distribution of synonymous substitution rates (Ks)
47 | }
48 | \examples{
49 | data(fungi_kaks)
50 | ks_df <- fungi_kaks$saccharomyces_cerevisiae
51 |
52 | # Plot distro
53 | plot_ks_distro(ks_df, bytype = TRUE)
54 | }
55 |
--------------------------------------------------------------------------------
/man/pairs2kaks.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/ka_ks_analyses.R
3 | \name{pairs2kaks}
4 | \alias{pairs2kaks}
5 | \title{Calculate Ka, Ks, and Ka/Ks from duplicate gene pairs}
6 | \usage{
7 | pairs2kaks(gene_pairs_list, cds, model = "MYN", threads = 1, verbose = FALSE)
8 | }
9 | \arguments{
10 | \item{gene_pairs_list}{List of data frames containing duplicated gene pairs
11 | as returned by \code{classify_gene_pairs()}.}
12 |
13 | \item{cds}{List of DNAStringSet objects containing the coding sequences
14 | of each gene.}
15 |
16 | \item{model}{Character scalar indicating which codon model to use.
17 | Possible values are "Li", "NG86", "NG", "LWL", "LPB", "MLWL", "MLPB", "GY",
18 | "YN", "MYN", "MS", "MA", "GNG", "GLWL", "GLPB", "GMLWL", "GMLPB", "GYN",
19 | and "GMYN". Default: "MYN".}
20 |
21 | \item{threads}{Numeric indicating the number of threads to use. Default: 1.}
22 |
23 | \item{verbose}{Logical indicating whether progress messages should be
24 | printed on screen. Default: FALSE.}
25 | }
26 | \value{
27 | A list of data frames containing gene pairs and their Ka, Ks,
28 | and Ka/Ks values.
29 | }
30 | \description{
31 | Calculate Ka, Ks, and Ka/Ks from duplicate gene pairs
32 | }
33 | \examples{
34 | data(diamond_intra)
35 | data(diamond_inter)
36 | data(yeast_annot)
37 | data(yeast_seq)
38 | data(cds_scerevisiae)
39 | blast_list <- diamond_intra
40 | blast_inter <- diamond_inter
41 |
42 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
43 | annot <- pdata$annotation["Scerevisiae"]
44 |
45 | # Binary classification scheme
46 | pairs <- classify_gene_pairs(annot, blast_list)
47 | td_pairs <- pairs[[1]][pairs[[1]]$type == "TD", ]
48 | gene_pairs_list <- list(
49 | Scerevisiae = td_pairs[seq(1, 3, by = 1), ]
50 | )
51 |
52 | cds <- list(Scerevisiae = cds_scerevisiae)
53 |
54 | kaks <- pairs2kaks(gene_pairs_list, cds)
55 |
56 | }
57 |
--------------------------------------------------------------------------------
/man/split_pairs_by_peak.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/ka_ks_analyses.R
3 | \name{split_pairs_by_peak}
4 | \alias{split_pairs_by_peak}
5 | \title{Split gene pairs based on their Ks peaks}
6 | \usage{
7 | split_pairs_by_peak(ks_df, peaks, nsd = 2, binwidth = 0.05)
8 | }
9 | \arguments{
10 | \item{ks_df}{A 3-column data frame with gene pairs in columns 1 and 2,
11 | and Ks values for the gene pair in column 3.}
12 |
13 | \item{peaks}{A list with mean, standard deviation, and amplitude of Ks
14 | peaks as generated by \code{find_ks_peaks}.}
15 |
16 | \item{nsd}{Numeric with the number of standard deviations to consider
17 | for each peak.}
18 |
19 | \item{binwidth}{Numeric scalar with binwidth for the histogram.
20 | Default: 0.05.}
21 | }
22 | \value{
23 | A list with the following elements:
24 | \describe{
25 | \item{pairs}{A 4-column data frame with the variables
26 | \strong{dup1} (character), \strong{dup2} (character),
27 | \strong{ks} (numeric), and \strong{peak} (numeric),
28 | representing duplicate gene pair, Ks values, and peak ID,
29 | respectively.}
30 | \item{plot}{A ggplot object with Ks peaks as returned by
31 | \code{plot_ks_peaks}, but with dashed red lines indicating
32 | boundaries for each peak.}
33 | }
34 | }
35 | \description{
36 | The purpose of this function is to classify gene pairs by age when there
37 | are 2+ Ks peaks. This way, newer gene pairs are found within a
38 | certain number of standard deviations from the highest peak,
39 | and older genes are found close within smaller peaks.
40 | }
41 | \examples{
42 | data(fungi_kaks)
43 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
44 |
45 | # Create a data frame of duplicate pairs and Ks values
46 | ks_df <- scerevisiae_kaks[, c("dup1", "dup2", "Ks")]
47 |
48 | # Create list of peaks
49 | peaks <- find_ks_peaks(ks_df$Ks, npeaks = 2)
50 |
51 | # Split pairs
52 | spairs <- split_pairs_by_peak(ks_df, peaks)
53 | }
54 |
--------------------------------------------------------------------------------
/man/get_tandem_proximal.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils_duplicate_classification.R
3 | \name{get_tandem_proximal}
4 | \alias{get_tandem_proximal}
5 | \title{Classify gene pairs derived from tandem and proximal duplications}
6 | \usage{
7 | get_tandem_proximal(pairs = NULL, annotation_granges = NULL, proximal_max = 10)
8 | }
9 | \arguments{
10 | \item{pairs}{A 3-column data frame with columns \strong{dup1}, \strong{dup2},
11 | and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
12 | the mode of duplication associated with the pair. This data frame
13 | is returned by \code{get_segmental()}.}
14 |
15 | \item{annotation_granges}{A processed GRanges object as in each element
16 | of the list returned by \code{syntenet::process_input()}.}
17 |
18 | \item{proximal_max}{Numeric scalar with the maximum distance (in number
19 | of genes) between two genes to consider them as proximal duplicates.
20 | Default: 10.}
21 | }
22 | \value{
23 | A 3-column data frame with the variables:
24 | \describe{
25 | \item{dup1}{Character, duplicated gene 1.}
26 | \item{dup2}{Character, duplicated gene 2.}
27 | \item{type}{Factor of duplication types, with levels
28 | "SD" (segmental duplication),
29 | "TD" (tandem duplication),
30 | "PD" (proximal duplication), and
31 | "DD" (dispersed duplication).}
32 | }
33 | }
34 | \description{
35 | Classify gene pairs derived from tandem and proximal duplications
36 | }
37 | \examples{
38 | data(yeast_annot)
39 | data(yeast_seq)
40 | data(fungi_kaks)
41 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
42 |
43 | # Get processed annotation for S. cerevisiae
44 | pdata <- annotation <- syntenet::process_input(yeast_seq, yeast_annot)
45 | annot <- pdata$annotation[[1]]
46 |
47 | # Get duplicated pairs
48 | pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
49 | pairs$dup1 <- paste0("Sce_", pairs$dup1)
50 | pairs$dup2 <- paste0("Sce_", pairs$dup2)
51 |
52 | # Get tandem and proximal duplicates
53 | td_pd_pairs <- get_tandem_proximal(pairs, annot)
54 |
55 | }
56 |
--------------------------------------------------------------------------------
/man/find_ks_peaks.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/ka_ks_analyses.R
3 | \name{find_ks_peaks}
4 | \alias{find_ks_peaks}
5 | \title{Find peaks in a Ks distribution with Gaussian Mixture Models}
6 | \usage{
7 | find_ks_peaks(ks, npeaks = 2, min_ks = 0.01, max_ks = 4, verbose = FALSE)
8 | }
9 | \arguments{
10 | \item{ks}{A numeric vector of Ks values.}
11 |
12 | \item{npeaks}{Numeric scalar indicating the number of peaks in
13 | the Ks distribution. If you don't know how many peaks there are,
14 | you can include a range of values, and the number of peaks that produces
15 | the lowest BIC (Bayesian Information Criterion) will be selected as the
16 | optimal. Default: 2.}
17 |
18 | \item{min_ks}{Numeric scalar with the minimum Ks value. Removing
19 | very small Ks values is generally used to avoid the incorporation of allelic
20 | and/or splice variants and to prevent the fitting of a component to infinity.
21 | Default: 0.01.}
22 |
23 | \item{max_ks}{Numeric scalar indicating the maximum Ks value. Removing
24 | very large Ks values is usually performed to account for Ks saturation.
25 | Default: 4.}
26 |
27 | \item{verbose}{Logical indicating if messages should be printed on screen.
28 | Default: FALSE.}
29 | }
30 | \value{
31 | A list with the following elements:
32 | \describe{
33 | \item{mean}{Numeric with the estimated means.}
34 | \item{sd}{Numeric with the estimated standard deviations.}
35 | \item{lambda}{Numeric with the estimated mixture weights.}
36 | \item{ks}{Numeric vector of filtered Ks distribution based on
37 | arguments passed to min_ks and max_ks.}
38 | }
39 | }
40 | \description{
41 | Find peaks in a Ks distribution with Gaussian Mixture Models
42 | }
43 | \examples{
44 | data(fungi_kaks)
45 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
46 | ks <- scerevisiae_kaks$Ks
47 |
48 | # Find 2 peaks in Ks distribution
49 | peaks <- find_ks_peaks(ks, npeaks = 2)
50 |
51 | # From 2 to 4 peaks, verbose = TRUE to show BIC values
52 | peaks <- find_ks_peaks(ks, npeaks = c(2, 3, 4), verbose = TRUE)
53 | }
54 |
--------------------------------------------------------------------------------
/man/get_anchors_list.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils.R
3 | \name{get_anchors_list}
4 | \alias{get_anchors_list}
5 | \title{Get a list of anchor pairs for each species}
6 | \usage{
7 | get_anchors_list(
8 | blast_list = NULL,
9 | annotation = NULL,
10 | evalue = 1e-10,
11 | anchors = 5,
12 | max_gaps = 25,
13 | collinearity_dir = NULL
14 | )
15 | }
16 | \arguments{
17 | \item{blast_list}{A list of data frames containing BLAST tabular output
18 | for intraspecies comparisons.
19 | Each list element corresponds to the BLAST output for a given species,
20 | and names of list elements must match the names of list elements in
21 | \code{annotation}. BLASTp, DIAMOND or simular programs must be run on processed
22 | sequence data as returned by \code{process_input()}.}
23 |
24 | \item{annotation}{A processed GRangesList or CompressedGRangesList object as
25 | returned by \code{syntenet::process_input()}.}
26 |
27 | \item{evalue}{Numeric scalar indicating the E-value threshold.
28 | Default: 1e-10.}
29 |
30 | \item{anchors}{Numeric indicating the minimum required number of genes
31 | to call a syntenic block, as in \code{syntenet::infer_syntenet}.
32 | Default: 5.}
33 |
34 | \item{max_gaps}{Numeric indicating the number of upstream and downstream
35 | genes to search for anchors, as in \code{syntenet::infer_syntenet}.
36 | Default: 25.}
37 |
38 | \item{collinearity_dir}{Character indicating the path to the directory
39 | where .collinearity files will be stored. If NULL, files will
40 | be stored in a subdirectory of \code{tempdir()}. Default: NULL.}
41 | }
42 | \value{
43 | A list of data frames representing intraspecies anchor pairs.
44 | }
45 | \description{
46 | Get a list of anchor pairs for each species
47 | }
48 | \examples{
49 | data(diamond_intra)
50 | data(yeast_annot)
51 | data(yeast_seq)
52 | blast_list <- diamond_intra
53 |
54 | # Get processed annotation for S. cerevisiae
55 | annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
56 |
57 | # Get list of intraspecies anchor pairs
58 | anchorpairs <- get_anchors_list(blast_list, annotation)
59 | }
60 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/issue_template.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report or feature request
3 | about: Describe a bug you've seen or make a case for a new feature
4 | title: "[BUG] Your bug or feature request"
5 | labels: ''
6 | assignees: ''
7 | ---
8 |
9 | Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on using the appropriate tag(s) including one for this package.
10 |
11 | ## Context
12 |
13 | Provide some context for your bug report or feature request. This could be the:
14 |
15 | * link to raw code, example: https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/blob/master/00-template.Rmd#L24-L28
16 | * link to a commit, example: https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/commit/6aa30b22eda614d932c12997ba611ba582c435d7
17 | * link to a line of code inside a commit, example: https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/commit/6aa30b22eda614d932c12997ba611ba582c435d7#diff-e265269fe4f17929940e81341b92b116R17
18 | * link to code from an R package, example: https://github.com/LieberInstitute/spatialLIBD/blob/master/R/run_app.R#L51-L55
19 |
20 | ## Code
21 |
22 | Include the code you ran and comments
23 |
24 | ```R
25 | ## prompt an error
26 | stop('hola')
27 |
28 | ## check the error trace
29 | traceback()
30 | ```
31 |
32 | ## Small reproducible example
33 |
34 | If you copy the lines of code that lead to your error, you can then run [`reprex::reprex()`](https://reprex.tidyverse.org/reference/reprex.html) which will create a small website with code you can then easily copy-paste here in a way that will be easy to work with later on.
35 |
36 | ```R
37 | ## prompt an error
38 | stop('hola')
39 | #> Error in eval(expr, envir, enclos): hola
40 |
41 | ## check the error trace
42 | traceback()
43 | #> No traceback available
44 | ```
45 |
46 |
47 | ## R session information
48 |
49 | Remember to include your full R session information.
50 |
51 | ```R
52 | options(width = 120)
53 | sessioninfo::session_info()
54 | ```
55 |
56 | The output of `sessioninfo::session_info()` includes relevant GitHub installation information and other details that are missed by `sessionInfo()`.
57 |
--------------------------------------------------------------------------------
/man/get_transposed_classes.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils_duplicate_classification.R
3 | \name{get_transposed_classes}
4 | \alias{get_transposed_classes}
5 | \title{Classify TRD genes as derived from either DNA transposons or retrotransposons}
6 | \usage{
7 | get_transposed_classes(pairs, intron_counts)
8 | }
9 | \arguments{
10 | \item{pairs}{A 3-column data frame with columns \strong{dup1}, \strong{dup2},
11 | and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
12 | the mode of duplication associated with the pair. This data frame
13 | is returned by \code{get_transposed()}.}
14 |
15 | \item{intron_counts}{A 2-column data frame with columns \strong{gene}
16 | and \strong{introns} indicating the number of introns for each gene,
17 | as returned by \code{get_intron_counts}.}
18 | }
19 | \value{
20 | A 3-column data frame with the following variables:
21 | \describe{
22 | \item{dup1}{Character, duplicated gene 1.}
23 | \item{dup2}{Character, duplicated gene 2.}
24 | \item{type}{Factor of duplication types, with levels
25 | "SD" (segmental duplication),
26 | "TD" (tandem duplication),
27 | "PD" (proximal duplication),
28 | "dTRD" (DNA transposon-derived duplication),
29 | "rTRD" (retrotransposon-derived duplication), and
30 | "DD" (dispersed duplication).}
31 | }
32 | }
33 | \description{
34 | Classify TRD genes as derived from either DNA transposons or retrotransposons
35 | }
36 | \examples{
37 | data(diamond_inter)
38 | data(diamond_intra)
39 | data(yeast_seq)
40 | data(yeast_annot)
41 | data(fungi_kaks)
42 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
43 |
44 | # Get processed annotation
45 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
46 | annotation <- pdata$annotation
47 |
48 | # Get duplicated pairs
49 | pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
50 | pairs$dup1 <- paste0("Sce_", pairs$dup1)
51 | pairs$dup2 <- paste0("Sce_", pairs$dup2)
52 |
53 | # Classify pairs
54 | trd <- get_transposed(pairs, diamond_inter, annotation)
55 |
56 | # Create TxDb object from GRanges
57 | library(txdbmaker)
58 | txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]])
59 |
60 | # Get intron counts
61 | intron_counts <- get_intron_counts(txdb)
62 |
63 | # Get TRD classes
64 | trd_classes <- get_transposed_classes(trd, intron_counts)
65 |
66 | }
67 |
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: doubletrouble
2 | Title: Identification and classification of duplicated genes
3 | Version: 1.9.1
4 | Date: 2024-03-19
5 | Authors@R:
6 | c(
7 | person(given = "Fabrício",
8 | family = "Almeida-Silva",
9 | role = c("aut", "cre"),
10 | email = "fabricio_almeidasilva@hotmail.com",
11 | comment = c(ORCID = "0000-0002-5314-2964")),
12 | person(given = "Yves",
13 | family = "Van de Peer",
14 | role = "aut",
15 | email = "yves.vandepeer@psb.vib-ugent.be",
16 | comment = c(ORCID = "0000-0003-4327-3730"))
17 | )
18 | Description: doubletrouble aims to identify duplicated genes from
19 | whole-genome protein sequences and classify them based on their modes
20 | of duplication. The duplication modes are i. segmental duplication (SD);
21 | ii. tandem duplication (TD);
22 | iii. proximal duplication (PD);
23 | iv. transposed duplication (TRD) and;
24 | v. dispersed duplication (DD).
25 | Transposon-derived duplicates (TRD) can be further subdivided into
26 | rTRD (retrotransposon-derived duplication) and
27 | dTRD (DNA transposon-derived duplication).
28 | If users want a simpler classification scheme, duplicates can also be
29 | classified into SD- and SSD-derived (small-scale duplication) gene pairs.
30 | Besides classifying gene pairs, users can also classify genes, so that
31 | each gene is assigned a unique mode of duplication.
32 | Users can also calculate substitution rates per substitution site (i.e., Ka
33 | and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian
34 | Mixture Models (GMMs), and classify gene pairs into age groups based on Ks
35 | peaks.
36 | License: GPL-3
37 | URL: https://github.com/almeidasilvaf/doubletrouble
38 | BugReports: https://support.bioconductor.org/t/doubletrouble
39 | biocViews:
40 | Software,
41 | WholeGenome,
42 | ComparativeGenomics,
43 | FunctionalGenomics,
44 | Phylogenetics,
45 | Network,
46 | Classification
47 | Encoding: UTF-8
48 | LazyData: false
49 | Roxygen: list(markdown = TRUE)
50 | RoxygenNote: 7.3.2
51 | Imports:
52 | syntenet,
53 | GenomicRanges,
54 | Biostrings,
55 | mclust,
56 | MSA2dist (>= 1.1.5),
57 | ggplot2,
58 | rlang,
59 | stats,
60 | utils,
61 | AnnotationDbi,
62 | GenomicFeatures
63 | Depends:
64 | R (>= 4.2.0)
65 | Suggests:
66 | txdbmaker,
67 | testthat (>= 3.0.0),
68 | knitr,
69 | feature,
70 | patchwork,
71 | BiocStyle,
72 | rmarkdown,
73 | covr,
74 | sessioninfo
75 | Config/testthat/edition: 3
76 | VignetteBuilder: knitr
77 |
--------------------------------------------------------------------------------
/tests/testthat/test-visualization.R:
--------------------------------------------------------------------------------
1 |
2 | # Load data ----
3 | data(fungi_kaks)
4 |
5 | # Start tests ----
6 | test_that("duplicates2counts() returns counts", {
7 |
8 | duplicate_list <- classify_genes(fungi_kaks)
9 |
10 | # Get count table
11 | d1 <- duplicates2counts(duplicate_list, shape = "wide")
12 | d2 <- duplicates2counts(duplicate_list, shape = "long")
13 |
14 | expect_true(is.data.frame(d1))
15 | expect_true(is.data.frame(d2))
16 |
17 | expect_error(duplicates2counts(duplicate_list, shape = "error"))
18 | })
19 |
20 |
21 | test_that("plot_duplicate_freqs() returns a ggplot object", {
22 |
23 | duplicate_list <- classify_genes(fungi_kaks)
24 |
25 | # Get count table
26 | dup_counts <- duplicates2counts(duplicate_list)
27 |
28 | # Plot
29 | p1 <- plot_duplicate_freqs(dup_counts, plot_type = "facet")
30 | p2 <- plot_duplicate_freqs(dup_counts, plot_type = "stack")
31 | p3 <- plot_duplicate_freqs(dup_counts, plot_type = "stack_percent")
32 |
33 | expect_true("ggplot" %in% class(p1))
34 | expect_true("ggplot" %in% class(p2))
35 | expect_true("ggplot" %in% class(p3))
36 |
37 | expect_error(plot_duplicate_freqs(dup_counts, plot_type = "error"))
38 | })
39 |
40 |
41 | test_that("plot_ks_distro() returns a ggplot object", {
42 |
43 | df <- fungi_kaks$saccharomyces_cerevisiae
44 |
45 | p1 <- plot_ks_distro(df, bytype = TRUE)
46 | p2 <- plot_ks_distro(
47 | df, bytype = TRUE, plot_type = "violin", type_levels = c("SD", "All")
48 | )
49 | p3 <- plot_ks_distro(df, bytype = FALSE, plot_type = "histogram")
50 | p4 <- plot_ks_distro(df, bytype = FALSE, plot_type = "density")
51 | p5 <- plot_ks_distro(df, bytype = FALSE, plot_type = "density_histogram")
52 |
53 | expect_true("ggplot" %in% class(p1))
54 | expect_true("ggplot" %in% class(p2))
55 | expect_true("ggplot" %in% class(p3))
56 | expect_true("ggplot" %in% class(p4))
57 | expect_true("ggplot" %in% class(p5))
58 |
59 | expect_error(plot_ks_distro(df, bytype = TRUE, plot_type = "error"))
60 | expect_error(plot_ks_distro(df, bytype = FALSE, plot_type = "error"))
61 | })
62 |
63 |
64 | test_that("plot_rates_by_species() returns a ggplot object", {
65 |
66 | p1 <- plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks")
67 | p2 <- plot_rates_by_species(fungi_kaks, rate_column = "Ks", bytype = TRUE)
68 | p3 <- plot_rates_by_species(fungi_kaks, rate_column = "Ka")
69 |
70 | expect_true("ggplot" %in% class(p1))
71 | expect_true("ggplot" %in% class(p2))
72 | expect_true("ggplot" %in% class(p3))
73 |
74 | expect_error(plot_rates_by_species(fungi_kaks, rate_column = "Kw"))
75 | })
76 |
77 |
78 |
79 |
--------------------------------------------------------------------------------
/.github/SUPPORT.md:
--------------------------------------------------------------------------------
1 | # Getting help with `doubletrouble`
2 |
3 | Thanks for using `doubletrouble`!
4 | Before filing an issue, there are a few places to explore and pieces to put together to make the process as smooth as possible.
5 |
6 | ## Make a reprex
7 |
8 | Start by making a minimal **repr**oducible **ex**ample using the [reprex](https://reprex.tidyverse.org/) package.
9 | If you haven't heard of or used reprex before, you're in for a treat!
10 | Seriously, reprex will make all of your R-question-asking endeavors easier (which is a pretty insane ROI for the five to ten minutes it'll take you to learn what it's all about).
11 | For additional reprex pointers, check out the [Get help!](https://www.tidyverse.org/help/) section of the tidyverse site.
12 |
13 | ## Where to ask?
14 |
15 | Armed with your reprex, the next step is to figure out [where to ask](https://www.tidyverse.org/help/#where-to-ask). See also the [Bioconductor help](http://bioconductor.org/help/) website.
16 |
17 | * If it's a question: start with [community.rstudio.com](https://community.rstudio.com/), and/or StackOverflow. If this a Bioconductor-related question, please ask it at the [Bioconductor Support Website](https://support.bioconductor.org/) using the [appropriate package tag](https://support.bioconductor.org/t/doubletrouble) (the website will send an automatic email to the package authors). There are more people there to answer questions.
18 |
19 | * If it's a bug: you're in the right place, [file an issue](https://github.com/almeidasilvaf/doubletrouble/issues/new).
20 |
21 | * If you're not sure: let the community help you figure it out!
22 | If your problem _is_ a bug or a feature request, you can easily return here and report it.
23 |
24 | Before opening a new issue, be sure to [search issues and pull requests](https://github.com/almeidasilvaf/doubletrouble/issues) to make sure the bug hasn't been reported and/or already fixed in the development version.
25 | By default, the search will be pre-populated with `is:issue is:open`.
26 | You can [edit the qualifiers](https://help.github.com/articles/searching-issues-and-pull-requests/) (e.g. `is:pr`, `is:closed`) as needed.
27 | For example, you'd simply remove `is:open` to search _all_ issues in the repo, open or closed.
28 |
29 | ## What happens next?
30 |
31 | To be as efficient as possible, development of tidyverse packages tends to be very bursty, so you shouldn't worry if you don't get an immediate response.
32 | Typically we don't look at a repo until a sufficient quantity of issues accumulates, then there’s a burst of intense activity as we focus our efforts.
33 | That makes development more efficient because it avoids expensive context switching between problems, at the cost of taking longer to get back to you.
34 | This process makes a good reprex particularly important because it might be multiple months between your initial report and when we start working on it.
35 | If we can’t reproduce the bug, we can’t fix it!
36 |
--------------------------------------------------------------------------------
/.github/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing to doubletrouble
2 |
3 | This outlines how to propose a change to doubletrouble.
4 | For more detailed info about contributing to this, and other tidyverse packages, please see the
5 | [**development contributing guide**](https://rstd.io/tidy-contrib).
6 |
7 | ## Fixing typos
8 |
9 | You can fix typos, spelling mistakes, or grammatical errors in the documentation directly using the GitHub web interface, as long as the changes are made in the _source_ file.
10 | This generally means you'll need to edit [roxygen2 comments](https://roxygen2.r-lib.org/articles/roxygen2.html) in an `.R`, not a `.Rd` file.
11 | You can find the `.R` file that generates the `.Rd` by reading the comment in the first line.
12 |
13 | ## Bigger changes
14 |
15 | If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed.
16 | If you’ve found a bug, please file an issue that illustrates the bug with a minimal
17 | [reprex](https://www.tidyverse.org/help/#reprex) (this will also help you write a unit test, if needed).
18 |
19 | ### Pull request process
20 |
21 | * Fork the package and clone onto your computer. If you haven't done this before, we recommend using `usethis::create_from_github("almeidasilvaf/doubletrouble", fork = TRUE)`.
22 |
23 | * Install all development dependencies with `devtools::install_dev_deps()`, and then make sure the package passes R CMD check by running `devtools::check()`.
24 | If R CMD check doesn't pass cleanly, it's a good idea to ask for help before continuing.
25 | * Create a Git branch for your pull request (PR). We recommend using `usethis::pr_init("brief-description-of-change")`.
26 |
27 | * Make your changes, commit to git, and then create a PR by running `usethis::pr_push()`, and following the prompts in your browser.
28 | The title of your PR should briefly describe the change.
29 | The body of your PR should contain `Fixes #issue-number`.
30 |
31 | * For user-facing changes, add a bullet to the top of `NEWS.md` (i.e. just below the first header). Follow the style described in .
32 |
33 | ### Code style
34 |
35 | * New code should follow the tidyverse [style guide](https://style.tidyverse.org).
36 | You can use the [styler](https://CRAN.R-project.org/package=styler) package to apply these styles, but please don't restyle code that has nothing to do with your PR.
37 |
38 | * We use [roxygen2](https://cran.r-project.org/package=roxygen2), with [Markdown syntax](https://cran.r-project.org/web/packages/roxygen2/vignettes/rd-formatting.html), for documentation.
39 |
40 | * We use [testthat](https://cran.r-project.org/package=testthat) for unit tests.
41 | Contributions with test cases included are easier to accept.
42 |
43 | ## Code of Conduct
44 |
45 | Please note that the doubletrouble project is released with a
46 | [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this
47 | project you agree to abide by its terms.
48 |
--------------------------------------------------------------------------------
/tests/testthat/test-ka_ks_analyses.R:
--------------------------------------------------------------------------------
1 |
2 | #----Load data------------------------------------------------------------------
3 | data(fungi_kaks)
4 | data(diamond_intra)
5 | data(diamond_inter)
6 | data(yeast_annot)
7 | data(yeast_seq)
8 | data(cds_scerevisiae)
9 | blast_list <- diamond_intra
10 | blast_inter <- diamond_inter
11 |
12 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
13 |
14 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
15 | annot <- pdata$annotation["Scerevisiae"]
16 |
17 | pairs <- classify_gene_pairs(annot, blast_list)
18 | td_pairs <- pairs[[1]][pairs[[1]]$type == "TD", ]
19 | gene_pairs_list <- list(
20 | Scerevisiae = td_pairs[seq(1, 3, by = 1), ]
21 | )
22 |
23 | cds <- list(Scerevisiae = cds_scerevisiae)
24 |
25 | ks <- scerevisiae_kaks$Ks
26 |
27 | ## Simulate gene with CDS that is not a multiple of 3
28 | cds2 <- cds
29 | cds2$Scerevisiae$Q0055 <- Biostrings::subseq(
30 | cds2$Scerevisiae$Q0055, 1, length(cds2$Scerevisiae$Q0055) - 1
31 | )
32 |
33 | #----Start tests----------------------------------------------------------------
34 | test_that("pairs2kaks() returns a data frame with Ka, Ks, and Ka/Ks", {
35 |
36 | kaks <- pairs2kaks(gene_pairs_list, cds, verbose = TRUE)
37 | kaks2 <- pairs2kaks(gene_pairs_list, cds2)
38 |
39 | expect_equal(class(kaks), "list")
40 | expect_equal(class(kaks[[1]]), "data.frame")
41 | expect_equal(nrow(kaks[[1]]), nrow(gene_pairs_list[[1]]))
42 | expect_true("Ks" %in% names(kaks[[1]]))
43 | expect_true("Ka" %in% names(kaks[[1]]))
44 | expect_true("Ka_Ks" %in% names(kaks[[1]]))
45 | expect_equal(class(kaks2), "list")
46 | })
47 |
48 |
49 | test_that("find_ks_peaks() returns a list of mean, sd, amplitudes, ks vals", {
50 |
51 | expect_message(
52 | peaks <- find_ks_peaks(ks, npeaks = 1:2, verbose = TRUE)
53 | )
54 |
55 | expect_equal(class(peaks), "list")
56 | expect_equal(names(peaks), c("mean", "sd", "lambda", "ks"))
57 | expect_equal(length(peaks$mean), 2)
58 | })
59 |
60 |
61 | test_that("plot_ks_peaks() returns a ggplot object", {
62 |
63 | peaks <- list(
64 | mean = c(0.118717925754829, 0.534196999662316),
65 | sd = c(0.054568151633283, 0.227909257694474),
66 | lambda = c(0.49001230165837, 0.509987698341629)
67 | )
68 |
69 | peaks_plot <- plot_ks_peaks(peaks, binwidth = 0.05)
70 | expect_true("ggplot" %in% class(peaks_plot))
71 |
72 | })
73 |
74 | test_that("find_intersect_mixtures() returns a numeric scalar", {
75 |
76 | peaks <- find_ks_peaks(ks, npeaks = 2)
77 | peak1 <- find_ks_peaks(ks, npeaks = 1)
78 | inters <- find_intersect_mixtures(peaks)
79 |
80 | expect_error(
81 | find_intersect_mixtures(peak1)
82 | )
83 |
84 | expect_equal(class(inters), "numeric")
85 | expect_equal(length(inters), 1)
86 | })
87 |
88 | test_that("split_pairs_by_peak() returns a list", {
89 |
90 | ks_df <- scerevisiae_kaks[, c("dup1", "dup2", "Ks")]
91 | peaks <- find_ks_peaks(ks_df$Ks, npeaks = 2)
92 | peak1 <- find_ks_peaks(ks, npeaks = 1)
93 |
94 | spairs <- split_pairs_by_peak(ks_df, peaks)
95 | spairs1 <- split_pairs_by_peak(ks_df, peak1)
96 |
97 | expect_equal(class(spairs1), "list")
98 | expect_equal(class(spairs), "list")
99 | expect_equal(names(spairs), c("pairs", "plot"))
100 | expect_equal(class(spairs$pairs), "data.frame")
101 | expect_equal(ncol(spairs$pairs), 4)
102 | expect_true("ggplot" %in% class(spairs$plot))
103 |
104 | })
105 |
--------------------------------------------------------------------------------
/R/data.R:
--------------------------------------------------------------------------------
1 |
2 | #' Protein sequences of the yeast species S. cerevisiae and C. glabrata
3 | #'
4 | #' Data obtained from Ensembl Fungi. Only translated sequences of primary
5 | #' transcripts were included.
6 | #'
7 | #' @name yeast_seq
8 | #' @format A list of AAStringSet objects with the elements
9 | #' \strong{Scerevisiae} and \strong{Cglabrata}.
10 | #' @examples
11 | #' data(yeast_seq)
12 | #' @usage data(yeast_seq)
13 | "yeast_seq"
14 |
15 |
16 | #' Genome annotation of the yeast species S. cerevisiae and C. glabrata
17 | #'
18 | #' Data obtained from Ensembl Fungi. Only annotation data protein-coding
19 | #' genes (with associated mRNA, exons, CDS, etc) are included.
20 | #'
21 | #' @name yeast_annot
22 | #' @format A CompressedGRangesList containing
23 | #' the elements \strong{Scerevisiae} and \strong{Cglabrata}.
24 | #' @examples
25 | #' data(yeast_annot)
26 | #' @usage data(yeast_annot)
27 | "yeast_annot"
28 |
29 |
30 | #' Intraspecies DIAMOND output for S. cerevisiae
31 | #'
32 | #' List obtained with \code{run_diamond()}.
33 | #'
34 | #' @name diamond_intra
35 | #' @format A list of data frames (length 1) containing the whole paranome of
36 | #' S. cerevisiae resulting from intragenome similarity searches.
37 | #' @examples
38 | #' data(diamond_intra)
39 | #' @usage data(diamond_intra)
40 | "diamond_intra"
41 |
42 |
43 | #' Interspecies DIAMOND output for yeast species
44 | #'
45 | #' This list contains a similarity search of S. cerevisiae against
46 | #' C. glabrata, and it was obtained with \code{run_diamond()}.
47 | #'
48 | #' @name diamond_inter
49 | #' @format A list of data frames (length 1) containing the output of a
50 | #' DIAMOND search of S. cerevisiae against C. glabrata (outgroup).
51 | #' @examples
52 | #' data(diamond_inter)
53 | #' @usage data(diamond_inter)
54 | "diamond_inter"
55 |
56 |
57 | #' Coding sequences (CDS) of S. cerevisiae
58 | #'
59 | #' Data were obtained from Ensembl Fungi, and only CDS of primary transcripts
60 | #' were included.
61 | #'
62 | #' @name cds_scerevisiae
63 | #' @format A DNAStringSet object with CDS of S. cerevisiae.
64 | #' @examples
65 | #' data(cds_scerevisiae)
66 | #' @usage data(cds_scerevisiae)
67 | "cds_scerevisiae"
68 |
69 |
70 | #' Duplicate pairs and Ka, Ks, and Ka/Ks values for fungi species
71 | #'
72 | #' This data set was obtained with \code{classify_gene_pairs()} followed
73 | #' by \code{pairs2kaks()}.
74 | #'
75 | #' @name fungi_kaks
76 | #' @format A list of data frame with elements
77 | #' named \strong{saccharomyces_cerevisiae}, \strong{candida_glabrata},
78 | #' and \strong{schizosaccharomyces_pombe}. Each data frame contains
79 | #' the following variables:
80 | #' \describe{
81 | #' \item{dup1}{Character, duplicated gene 1.}
82 | #' \item{dup2}{Character, duplicated gene 2.}
83 | #' \item{Ka}{Numeric, Ka values.}
84 | #' \item{Ks}{Numeric, Ks values.}
85 | #' \item{Ka_Ks}{Numeric, Ka/Ks values.}
86 | #' \item{type}{Character, mode of duplication}
87 | #' }
88 | #' @examples
89 | #' data(fungi_kaks)
90 | #' @usage data(fungi_kaks)
91 | "fungi_kaks"
92 |
93 |
94 | #' Duplicate pairs and Ks values for Glycine max
95 | #'
96 | #' This data set was obtained with \code{classify_gene_pairs()} followed
97 | #' by \code{pairs2kaks()}.
98 | #'
99 | #' @name gmax_ks
100 | #' @format A data frame with the following variables:
101 | #' \describe{
102 | #' \item{dup1}{Character, duplicated gene 1.}
103 | #' \item{dup2}{Character, duplicated gene 2.}
104 | #' \item{Ks}{Numeric, Ks values.}
105 | #' \item{type}{Factor, duplication mode.}
106 | #' }
107 | #' @examples
108 | #' data(gmax_ks)
109 | #' @usage data(gmax_ks)
110 | "gmax_ks"
111 |
112 |
113 |
--------------------------------------------------------------------------------
/vignettes/bibliography.bib:
--------------------------------------------------------------------------------
1 |
2 | @book{ohno2013evolution,
3 | title={Evolution by gene duplication},
4 | author={Ohno, Susumu},
5 | year={2013},
6 | publisher={Springer Science \& Business Media}
7 | }
8 |
9 | @article{yates2022ensembl,
10 | title={Ensembl Genomes 2022: an expanding genome resource for non-vertebrates},
11 | author={Yates, Andrew D and Allen, James and Amode, Ridwan M and Azov, Andrey G and Barba, Matthieu and Becerra, Andr{\'e}s and Bhai, Jyothish and Campbell, Lahcen I and Carbajo Martinez, Manuel and Chakiachvili, Marc and others},
12 | journal={Nucleic acids research},
13 | volume={50},
14 | number={D1},
15 | pages={D996--D1003},
16 | year={2022},
17 | publisher={Oxford University Press}
18 | }
19 |
20 | @article{wang2010kaks_calculator,
21 | title={KaKs\_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies},
22 | author={Wang, Dapeng and Zhang, Yubin and Zhang, Zhang and Zhu, Jiang and Yu, Jun},
23 | journal={Genomics, proteomics \& bioinformatics},
24 | volume={8},
25 | number={1},
26 | pages={77--80},
27 | year={2010},
28 | publisher={Elsevier}
29 | }
30 |
31 | @article{qiao2019gene,
32 | title={Gene duplication and evolution in recurring polyploidization--diploidization cycles in plants},
33 | author={Qiao, Xin and Li, Qionghou and Yin, Hao and Qi, Kaijie and Li, Leiting and Wang, Runze and Zhang, Shaoling and Paterson, Andrew H},
34 | journal={Genome biology},
35 | volume={20},
36 | number={1},
37 | pages={1--23},
38 | year={2019},
39 | publisher={BioMed Central}
40 | }
41 |
42 |
43 | @article{vanneste2013inference,
44 | title={Inference of genome duplications from age distributions revisited},
45 | author={Vanneste, Kevin and Van de Peer, Yves and Maere, Steven},
46 | journal={Molecular biology and evolution},
47 | volume={30},
48 | number={1},
49 | pages={177--190},
50 | year={2013},
51 | publisher={Oxford University Press}
52 | }
53 |
54 | @article{chaudhuri1999sizer,
55 | title={SiZer for exploration of structures in curves},
56 | author={Chaudhuri, Probal and Marron, James S},
57 | journal={Journal of the American Statistical Association},
58 | volume={94},
59 | number={447},
60 | pages={807--823},
61 | year={1999},
62 | publisher={Taylor \& Francis}
63 | }
64 |
65 | @article{schmutz2010genome,
66 | title={Genome sequence of the palaeopolyploid soybean},
67 | author={Schmutz, Jeremy and Cannon, Steven B and Schlueter, Jessica and Ma, Jianxin and Mitros, Therese and Nelson, William and Hyten, David L and Song, Qijian and Thelen, Jay J and Cheng, Jianlin and others},
68 | journal={nature},
69 | volume={463},
70 | number={7278},
71 | pages={178--183},
72 | year={2010},
73 | publisher={Nature Publishing Group}
74 | }
75 |
76 |
77 | @article{tiley2018assessing,
78 | title={Assessing the performance of Ks plots for detecting ancient whole genome duplications},
79 | author={Tiley, George P and Barker, Michael S and Burleigh, J Gordon},
80 | journal={Genome biology and evolution},
81 | volume={10},
82 | number={11},
83 | pages={2882--2898},
84 | year={2018},
85 | publisher={Oxford University Press}
86 | }
87 |
88 |
89 | @article{buchfink2021sensitive,
90 | title={Sensitive protein alignments at tree-of-life scale using DIAMOND},
91 | author={Buchfink, Benjamin and Reuter, Klaus and Drost, Hajk-Georg},
92 | journal={Nature methods},
93 | volume={18},
94 | number={4},
95 | pages={366--368},
96 | year={2021},
97 | publisher={Nature Publishing Group US New York}
98 | }
99 |
100 |
101 | @article{altschul1997gapped,
102 | title={Gapped BLAST and PSI-BLAST: a new generation of protein database search programs},
103 | author={Altschul, Stephen F and Madden, Thomas L and Sch{\"a}ffer, Alejandro A and Zhang, Jinghui and Zhang, Zheng and Miller, Webb and Lipman, David J},
104 | journal={Nucleic acids research},
105 | volume={25},
106 | number={17},
107 | pages={3389--3402},
108 | year={1997},
109 | publisher={Oxford University Press}
110 | }
111 |
--------------------------------------------------------------------------------
/tests/testthat/test-duplicate_classification.R:
--------------------------------------------------------------------------------
1 | library(txdbmaker) # for makeTxDbFromGRanges()
2 |
3 | #----Load data------------------------------------------------------------------
4 | data(diamond_intra)
5 | data(fungi_kaks)
6 | data(diamond_inter)
7 | data(yeast_annot)
8 | data(yeast_seq)
9 | blast_list <- diamond_intra
10 | blast_inter <- syntenet::collapse_bidirectional_hits(
11 | diamond_inter,
12 | data.frame("Scerevisiae", "Cglabrata")
13 | )
14 |
15 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
16 |
17 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
18 | annotation <- pdata$annotation
19 | annotation_granges <- pdata$annotation[["Scerevisiae"]]
20 |
21 | ## Get anchor pairs
22 | all <- scerevisiae_kaks
23 | all$dup1 <- paste0("Sce_", all$dup1)
24 | all$dup2 <- paste0("Sce_", all$dup2)
25 | anchor_pairs <- all[all$type == "SD", 1:2]
26 | ssd <- all[all$type != "SD", 1:2]
27 |
28 | ## Get duplicate pairs from DIAMOND output
29 | duplicates <- diamond_intra[[1]][, 1:2]
30 |
31 |
32 | txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]])
33 | intron_counts <- get_intron_counts(txdb)
34 |
35 | ic_list <- list(Scerevisiae = intron_counts)
36 |
37 | #----Start tests----------------------------------------------------------------
38 | test_that("get_* functions returned classified duplicates", {
39 |
40 | # 1) get_anchors_list()
41 | anchorpairs <- get_anchors_list(blast_list, annotation)
42 |
43 | expect_equal(class(anchorpairs), "list")
44 | expect_equal(class(anchorpairs[[1]]), "data.frame")
45 | expect_equal(ncol(anchorpairs[[1]]), 2)
46 |
47 | # 2) get_segmental()
48 | dups <- get_segmental(anchor_pairs, duplicates)
49 | dups2 <- get_segmental(NULL, duplicates)
50 |
51 | expect_equal(class(dups2), "data.frame")
52 | expect_equal(ncol(dups), 3)
53 | expect_equal(names(dups), c("dup1", "dup2", "type"))
54 | expect_equal(length(unique(dups$type)), 2)
55 |
56 | # 3) get_tandem_proximal()
57 | td_pd <- get_tandem_proximal(dups, annotation_granges)
58 |
59 | expect_equal(class(td_pd), "data.frame")
60 | expect_equal(ncol(td_pd), 3)
61 | expect_equal(names(td_pd), c("dup1", "dup2", "type"))
62 | expect_equal(length(unique(td_pd$type)), 4)
63 |
64 | # 4) get_transposed
65 | trd <- get_transposed(td_pd, blast_inter, annotation)
66 |
67 | binter2 <- list(e1 = blast_inter[[1]], e2 = blast_inter[[1]])
68 | expect_error(get_transposed(td_pd, binter2, annotation))
69 |
70 | expect_equal(class(trd), "data.frame")
71 | expect_equal(ncol(trd), 3)
72 | expect_equal(names(trd), c("dup1", "dup2", "type"))
73 | expect_equal(length(unique(trd$type)), 5)
74 |
75 | # 5) get_transposed_classes
76 | trdc <- get_transposed_classes(trd, intron_counts)
77 |
78 | expect_equal(class(trdc), "data.frame")
79 | expect_equal(ncol(trdc), 3)
80 | expect_equal(names(trdc), c("dup1", "dup2", "type"))
81 | expect_equal(length(unique(trdc$type)), 6)
82 | })
83 |
84 |
85 | test_that("classify_gene_pairs() and classify_genes() return a data frame", {
86 |
87 | # 1) classify_gene_pairs()
88 | dup_full <- classify_gene_pairs(
89 | annotation = annotation,
90 | blast_list = diamond_intra,
91 | scheme = "full",
92 | blast_inter = blast_inter,
93 | intron_counts = ic_list
94 | )
95 |
96 | dup_binary <- classify_gene_pairs(
97 | annotation = annotation,
98 | blast_list = diamond_intra,
99 | scheme = "binary"
100 | )
101 |
102 | expect_equal(class(dup_full), "list")
103 | expect_equal(class(dup_full[[1]]), "data.frame")
104 | expect_equal(ncol(dup_full[[1]]), 3)
105 |
106 | expect_equal(class(dup_binary), "list")
107 | expect_equal(class(dup_binary[[1]]), "data.frame")
108 | expect_equal(ncol(dup_binary[[1]]), 3)
109 |
110 | # 2) classify_genes()
111 | dup_genes <- classify_genes(dup_full)
112 |
113 | expect_equal(class(dup_genes[[1]]), "data.frame")
114 | expect_equal(class(dup_genes), "list")
115 | expect_equal(ncol(dup_genes[[1]]), 2)
116 | })
117 |
118 |
--------------------------------------------------------------------------------
/man/get_transposed.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/utils_duplicate_classification.R
3 | \name{get_transposed}
4 | \alias{get_transposed}
5 | \title{Classify gene pairs originating from transposon-derived duplications}
6 | \usage{
7 | get_transposed(
8 | pairs,
9 | blast_inter,
10 | annotation,
11 | evalue = 1e-10,
12 | anchors = 5,
13 | max_gaps = 25,
14 | collinearity_dir = NULL,
15 | outgroup_coverage = 70
16 | )
17 | }
18 | \arguments{
19 | \item{pairs}{A 3-column data frame with columns \strong{dup1}, \strong{dup2},
20 | and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
21 | the mode of duplication associated with the pair. This data frame
22 | is returned by \code{get_tandem_proximal()}.}
23 |
24 | \item{blast_inter}{A list of data frames of length 1
25 | containing BLAST tabular output for the comparison between the target
26 | species and an outgroup. Names of list elements must match the names of
27 | list elements in \code{annotation}. BLASTp, DIAMOND or simular programs must
28 | be run on processed sequence data as returned
29 | by \code{syntenet::process_input()}.}
30 |
31 | \item{annotation}{A processed GRangesList or CompressedGRangesList object as
32 | returned by \code{syntenet::process_input()}.}
33 |
34 | \item{evalue}{Numeric scalar indicating the E-value threshold.
35 | Default: 1e-10.}
36 |
37 | \item{anchors}{Numeric indicating the minimum required number of genes
38 | to call a syntenic block, as in \code{syntenet::infer_syntenet}.
39 | Default: 5.}
40 |
41 | \item{max_gaps}{Numeric indicating the number of upstream and downstream
42 | genes to search for anchors, as in \code{syntenet::infer_syntenet}.
43 | Default: 25.}
44 |
45 | \item{collinearity_dir}{Character indicating the path to the directory
46 | where .collinearity files will be stored. If NULL, files will
47 | be stored in a subdirectory of \code{tempdir()}. Default: NULL.}
48 |
49 | \item{outgroup_coverage}{Numeric indicating the minimum percentage of
50 | outgroup species to use to consider genes as transposed duplicates. Only
51 | valid if multiple outgroup species are present (see details below). Values
52 | should range from 0 to 100. Default: 70.}
53 | }
54 | \value{
55 | A 3-column data frame with the following variables:
56 | \describe{
57 | \item{dup1}{Character, duplicated gene 1.}
58 | \item{dup2}{Character, duplicated gene 2.}
59 | \item{type}{Factor of duplication types, with levels
60 | "SD" (segmental duplication),
61 | "TD" (tandem duplication),
62 | "PD" (proximal duplication),
63 | "TRD" (transposon-derived duplication), and
64 | "DD" (dispersed duplication).}
65 | }
66 | }
67 | \description{
68 | Classify gene pairs originating from transposon-derived duplications
69 | }
70 | \details{
71 | If the list of interspecies DIAMOND tables contain comparisons of the
72 | same species to multiple outgroups (e.g.,
73 | 'speciesA_speciesB', 'speciesA_speciesC'), this function will check if
74 | gene pairs are classified as transposed (i.e.,
75 | only one gene is an ancestral locus) in each of the outgroup species,
76 | and then calculate the percentage of outgroup species in which each pair
77 | is considered 'transposed'. For instance, gene pair 1 is transposed based on
78 | 30\\% of the outgroup species, gene pair is considered as transposed based
79 | on 100\\% of the outgroup species, gene pair 3 is considered as transposed
80 | based on 0\\% of the outgroup species, and so on.
81 | Parameter \strong{outgroup_coverage} lets you choose a minimum percentage
82 | cut-off to classify pairs as transposed.
83 | }
84 | \examples{
85 | # Load example data
86 | data(diamond_inter)
87 | data(yeast_seq)
88 | data(yeast_annot)
89 | data(fungi_kaks)
90 | scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
91 |
92 | # Get processed annotation
93 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
94 | annotation <- pdata$annotation
95 |
96 | # Get duplicated pairs
97 | pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
98 | pairs$dup1 <- paste0("Sce_", pairs$dup1)
99 | pairs$dup2 <- paste0("Sce_", pairs$dup2)
100 |
101 | # Collapse bidirectional hits
102 | compare <- data.frame(target = "Scerevisiae", outgroup = "Cglabrata")
103 | blast_inter <- syntenet::collapse_bidirectional_hits(diamond_inter, compare)
104 |
105 | # Classify pairs
106 | trd <- get_transposed(pairs, blast_inter, annotation)
107 |
108 | }
109 |
--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | output: github_document
3 | ---
4 |
5 |
6 |
7 | ```{r, include = FALSE}
8 | knitr::opts_chunk$set(
9 | collapse = TRUE,
10 | comment = "#>",
11 | fig.path = "man/figures/README-",
12 | out.width = "100%"
13 | )
14 | ```
15 |
16 | # doubletrouble
17 |
18 |
19 | [](https://github.com/almeidasilvaf/doubletrouble/issues)
20 | [](https://lifecycle.r-lib.org/articles/stages.html#stable)
21 | [](https://github.com/almeidasilvaf/doubletrouble/actions)
22 | [](https://codecov.io/gh/almeidasilvaf/doubletrouble?branch=devel)
24 |
25 |
26 | The major goal of __doubletrouble__ is to identify duplicated genes from
27 | whole-genome protein sequences and classify them based on their modes
28 | of duplication. Duplicates can be classified using four different
29 | classification schemes, which increase the complexity and level of details
30 | in a stepwise manner. The classification schemes and the duplication modes
31 | they can classify are:
32 |
33 |
34 | | Scheme | Duplication modes |
35 | |:---------|:----------------------------|
36 | | binary | SD, SSD |
37 | | standard | SD, TD, PD, DD |
38 | | extended | SD, TD, PD, TRD, DD |
39 | | full | SD, TD, PD, rTRD, dTRD, DD |
40 |
41 | *Legend:* **SD**, segmental duplication. **SSD**, small-scale duplication.
42 | **TD**, tandem duplication. **PD**, proximal duplication.
43 | **TRD**, transposon-derived duplication.
44 | **rTRD**, retrotransposon-derived duplication.
45 | **dTRD**, DNA transposon-derived duplication. **DD**, dispersed duplication.
46 |
47 |
48 | Besides classifying gene pairs, users can also classify genes, so that
49 | each gene is assigned to a unique mode of duplication.
50 |
51 | Users can also calculate substitution rates per substitution site (i.e.,
52 | $K_a$, $K_s$ and their ratios $\frac{K_a}{K_s}$) from duplicate pairs,
53 | find peaks in Ks distributions with Gaussian Mixture Models (GMMs),
54 | and classify gene pairs into age groups based on Ks peaks.
55 |
56 | ## Installation instructions
57 |
58 | Get the latest stable `R` release from [CRAN](http://cran.r-project.org/).
59 | Then install __doubletrouble__ from [Bioconductor](http://bioconductor.org/)
60 | using the following code:
61 |
62 | ```{r 'install', eval = FALSE}
63 | if (!requireNamespace("BiocManager", quietly = TRUE)) {
64 | install.packages("BiocManager")
65 | }
66 |
67 | BiocManager::install("doubletrouble")
68 | ```
69 |
70 | And the development version from [GitHub](https://github.com/almeidasilvaf/doubletrouble) with:
71 |
72 | ```{r 'install_dev', eval = FALSE}
73 | BiocManager::install("almeidasilvaf/doubletrouble")
74 | ```
75 |
76 | ## Citation
77 |
78 | Below is the citation output from using `citation('doubletrouble')` in R. Please
79 | run this yourself to check for any updates on how to cite __doubletrouble__.
80 |
81 | ```{r 'citation', eval = requireNamespace('doubletrouble')}
82 | print(citation('doubletrouble'), bibtex = TRUE)
83 | ```
84 |
85 | Please note that the __doubletrouble__ was only made possible thanks to many other R and bioinformatics software authors, which are cited either in the vignettes and/or the paper(s) describing this package.
86 |
87 | ## Code of Conduct
88 |
89 | Please note that the __doubletrouble__ project is released with
90 | a [Contributor Code of Conduct](http://bioconductor.org/about/code-of-conduct/).
91 | By contributing to this project, you agree to abide by its terms.
92 |
93 | ## Development tools
94 |
95 | * Continuous code testing is possible thanks to [GitHub actions](https://www.tidyverse.org/blog/2020/04/usethis-1-6-0/) through `r BiocStyle::CRANpkg('usethis')`, `r BiocStyle::CRANpkg('remotes')`, and `r BiocStyle::CRANpkg('rcmdcheck')` customized to use [Bioconductor's docker containers](https://www.bioconductor.org/help/docker/) and `r BiocStyle::Biocpkg('BiocCheck')`.
96 | * Code coverage assessment is possible thanks to [codecov](https://codecov.io/gh) and `r BiocStyle::CRANpkg('covr')`.
97 | * The [documentation website](http://almeidasilvaf.github.io/doubletrouble) is automatically updated thanks to `r BiocStyle::CRANpkg('pkgdown')`.
98 | * The code is styled automatically thanks to `r BiocStyle::CRANpkg('styler')`.
99 | * The documentation is formatted thanks to `r BiocStyle::CRANpkg('devtools')` and `r BiocStyle::CRANpkg('roxygen2')`.
100 |
101 | For more details, check the `dev` directory.
102 |
103 | This package was developed using `r BiocStyle::Biocpkg('biocthis')`.
104 |
105 |
106 |
--------------------------------------------------------------------------------
/man/classify_gene_pairs.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/duplicate_classification.R
3 | \name{classify_gene_pairs}
4 | \alias{classify_gene_pairs}
5 | \title{Classify duplicate gene pairs based on their modes of duplication}
6 | \usage{
7 | classify_gene_pairs(
8 | annotation = NULL,
9 | blast_list = NULL,
10 | scheme = "standard",
11 | blast_inter = NULL,
12 | intron_counts,
13 | evalue = 1e-10,
14 | anchors = 5,
15 | max_gaps = 25,
16 | proximal_max = 10,
17 | collinearity_dir = NULL,
18 | outgroup_coverage = 70
19 | )
20 | }
21 | \arguments{
22 | \item{annotation}{A processed GRangesList or CompressedGRangesList object as
23 | returned by \code{syntenet::process_input()}.}
24 |
25 | \item{blast_list}{A list of data frames containing BLAST tabular output
26 | for intraspecies comparisons.
27 | Each list element corresponds to the BLAST output for a given species,
28 | and names of list elements must match the names of list elements in
29 | \strong{annotation}. BLASTp, DIAMOND or simular programs must be run
30 | on processed sequence data as returned by \code{process_input()}.}
31 |
32 | \item{scheme}{Character indicating which classification scheme to use.
33 | One of "binary", "standard", "extended", or "full". See details below
34 | for information on what each scheme means. Default: "standard".}
35 |
36 | \item{blast_inter}{(Only valid if \code{scheme == "extended" or "full"}).
37 | A list of data frames containing BLAST tabular output
38 | for the comparison between target species and outgroups.
39 | Names of list elements must match the names of
40 | list elements in \code{annotation}. BLASTp, DIAMOND or simular programs must
41 | be run on processed sequence data as returned by \code{process_input()}.}
42 |
43 | \item{intron_counts}{(Only valid if \code{scheme == "full"}).
44 | A list of 2-column data frames with the number of
45 | introns per gene as returned by \code{get_intron_counts()}. Names
46 | of list elements must match names of \strong{annotation}.}
47 |
48 | \item{evalue}{Numeric scalar indicating the E-value threshold.
49 | Default: 1e-10.}
50 |
51 | \item{anchors}{Numeric indicating the minimum required number of genes
52 | to call a syntenic block, as in \code{syntenet::infer_syntenet}.
53 | Default: 5.}
54 |
55 | \item{max_gaps}{Numeric indicating the number of upstream and downstream
56 | genes to search for anchors, as in \code{syntenet::infer_syntenet}.
57 | Default: 25.}
58 |
59 | \item{proximal_max}{Numeric scalar with the maximum distance (in number
60 | of genes) between two genes to consider them as proximal duplicates.
61 | Default: 10.}
62 |
63 | \item{collinearity_dir}{Character indicating the path to the directory
64 | where .collinearity files will be stored. If NULL, files will
65 | be stored in a subdirectory of \code{tempdir()}. Default: NULL.}
66 |
67 | \item{outgroup_coverage}{Numeric indicating the minimum percentage of
68 | outgroup species to use to consider genes as transposed duplicates. Only
69 | valid if multiple outgroup species are present (see details below). Values
70 | should range from 0 to 100. Default: 70.}
71 | }
72 | \value{
73 | A list of 3-column data frames of duplicated gene pairs
74 | (columns 1 and 2), and their modes of duplication (column 3).
75 | }
76 | \description{
77 | Classify duplicate gene pairs based on their modes of duplication
78 | }
79 | \details{
80 | The classification schemes increase in complexity (number of classes)
81 | in the order 'binary', 'standard', 'extended', and 'full'.
82 |
83 | For classification scheme "binary", duplicates are classified into
84 | one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).
85 |
86 | For classification scheme "standard" (default), duplicates are
87 | classified into 'SD' (segmental duplication), 'TD' (tandem duplication),
88 | 'PD' (proximal duplication), and 'DD' (dispersed duplication).
89 |
90 | For classification scheme "extended", duplicates are classified into
91 | 'SD' (segmental duplication), 'TD' (tandem duplication),
92 | 'PD' (proximal duplication), 'TRD' (transposon-derived duplication),
93 | and 'DD' (dispersed duplication).
94 |
95 | Finally, for classification scheme "full", duplicates are classified into
96 | 'SD' (segmental duplication), 'TD' (tandem duplication),
97 | 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication),
98 | 'dTRD' (DNA transposon-derived duplication), and
99 | 'DD' (dispersed duplication).
100 | }
101 | \examples{
102 | # Load example data
103 | data(diamond_intra)
104 | data(diamond_inter)
105 | data(yeast_annot)
106 | data(yeast_seq)
107 |
108 | # Get processed annotation data
109 | annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
110 |
111 | # Get collapsed DIAMOND inter
112 | blast_inter <- syntenet::collapse_bidirectional_hits(
113 | diamond_inter,
114 | data.frame("Scerevisiae", "Cglabrata")
115 | )
116 |
117 | # Get list of intron counts
118 | library(txdbmaker)
119 | txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
120 | intron_counts <- lapply(txdb_list, get_intron_counts)
121 |
122 | # Classify duplicates - full scheme
123 | dup_class <- classify_gene_pairs(
124 | annotation = annotation,
125 | blast_list = diamond_intra,
126 | scheme = "full",
127 | blast_inter = blast_inter,
128 | intron_counts = intron_counts
129 | )
130 |
131 | # Check number of gene pairs per class
132 | table(dup_class$Scerevisiae$type)
133 |
134 | }
135 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | # doubletrouble
5 |
6 |
7 |
8 | [](https://github.com/almeidasilvaf/doubletrouble/issues)
10 | [](https://lifecycle.r-lib.org/articles/stages.html#stable)
12 | [](https://github.com/almeidasilvaf/doubletrouble/actions)
13 | [](https://codecov.io/gh/almeidasilvaf/doubletrouble?branch=devel)
15 |
16 |
17 | The major goal of **doubletrouble** is to identify duplicated genes from
18 | whole-genome protein sequences and classify them based on their modes of
19 | duplication. Duplicates can be classified using four different
20 | classification schemes, which increase the complexity and level of
21 | details in a stepwise manner. The classification schemes and the
22 | duplication modes they can classify are:
23 |
24 | | Scheme | Duplication modes |
25 | |:---------|:---------------------------|
26 | | binary | SD, SSD |
27 | | standard | SD, TD, PD, DD |
28 | | extended | SD, TD, PD, TRD, DD |
29 | | full | SD, TD, PD, rTRD, dTRD, DD |
30 |
31 | *Legend:* **SD**, segmental duplication. **SSD**, small-scale
32 | duplication. **TD**, tandem duplication. **PD**, proximal duplication.
33 | **TRD**, transposon-derived duplication. **rTRD**,
34 | retrotransposon-derived duplication. **dTRD**, DNA transposon-derived
35 | duplication. **DD**, dispersed duplication.
36 |
37 | Besides classifying gene pairs, users can also classify genes, so that
38 | each gene is assigned to a unique mode of duplication.
39 |
40 | Users can also calculate substitution rates per substitution site (i.e.,
41 | $K_a$, $K_s$ and their ratios $\frac{K_a}{K_s}$) from duplicate pairs,
42 | find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and
43 | classify gene pairs into age groups based on Ks peaks.
44 |
45 | ## Installation instructions
46 |
47 | Get the latest stable `R` release from
48 | [CRAN](http://cran.r-project.org/). Then install **doubletrouble** from
49 | [Bioconductor](http://bioconductor.org/) using the following code:
50 |
51 | ``` r
52 | if (!requireNamespace("BiocManager", quietly = TRUE)) {
53 | install.packages("BiocManager")
54 | }
55 |
56 | BiocManager::install("doubletrouble")
57 | ```
58 |
59 | And the development version from
60 | [GitHub](https://github.com/almeidasilvaf/doubletrouble) with:
61 |
62 | ``` r
63 | BiocManager::install("almeidasilvaf/doubletrouble")
64 | ```
65 |
66 | ## Citation
67 |
68 | Below is the citation output from using `citation('doubletrouble')` in
69 | R. Please run this yourself to check for any updates on how to cite
70 | **doubletrouble**.
71 |
72 | ``` r
73 | print(citation('doubletrouble'), bibtex = TRUE)
74 | #> To cite doubletrouble in publications, use:
75 | #>
76 | #> Almeida-Silva F, Van de Peer Y doubletrouble: an R/Bioconductor
77 | #> package for the identification, classification, and analysis of gene
78 | #> and genome duplications. Bioinformatics, 41(2), btaf043. (2025).
79 | #> https://doi.org/10.1093/bioinformatics/btaf043
80 | #>
81 | #> A BibTeX entry for LaTeX users is
82 | #>
83 | #> @Article{,
84 | #> title = {doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications},
85 | #> author = {Fabricio Almeida-Silva and Yves {Van de Peer}},
86 | #> journal = {Bioinformatics},
87 | #> year = {2025},
88 | #> volume = {41},
89 | #> number = {2},
90 | #> pages = {btaf043},
91 | #> url = {https://academic.oup.com/bioinformatics/article/41/2/btaf043/7979242},
92 | #> doi = {10.1093/bioinformatics/btaf043},
93 | #> }
94 | ```
95 |
96 | Please note that the **doubletrouble** was only made possible thanks to
97 | many other R and bioinformatics software authors, which are cited either
98 | in the vignettes and/or the paper(s) describing this package.
99 |
100 | ## Code of Conduct
101 |
102 | Please note that the **doubletrouble** project is released with a
103 | [Contributor Code of
104 | Conduct](http://bioconductor.org/about/code-of-conduct/). By
105 | contributing to this project, you agree to abide by its terms.
106 |
107 | ## Development tools
108 |
109 | - Continuous code testing is possible thanks to [GitHub
110 | actions](https://www.tidyverse.org/blog/2020/04/usethis-1-6-0/)
111 | through *[usethis](https://CRAN.R-project.org/package=usethis)*,
112 | *[remotes](https://CRAN.R-project.org/package=remotes)*, and
113 | *[rcmdcheck](https://CRAN.R-project.org/package=rcmdcheck)* customized
114 | to use [Bioconductor’s docker
115 | containers](https://www.bioconductor.org/help/docker/) and
116 | *[BiocCheck](https://bioconductor.org/packages/3.19/BiocCheck)*.
117 | - Code coverage assessment is possible thanks to
118 | [codecov](https://codecov.io/gh) and
119 | *[covr](https://CRAN.R-project.org/package=covr)*.
120 | - The [documentation
121 | website](http://almeidasilvaf.github.io/doubletrouble) is
122 | automatically updated thanks to
123 | *[pkgdown](https://CRAN.R-project.org/package=pkgdown)*.
124 | - The code is styled automatically thanks to
125 | *[styler](https://CRAN.R-project.org/package=styler)*.
126 | - The documentation is formatted thanks to
127 | *[devtools](https://CRAN.R-project.org/package=devtools)* and
128 | *[roxygen2](https://CRAN.R-project.org/package=roxygen2)*.
129 |
130 | For more details, check the `dev` directory.
131 |
132 | This package was developed using
133 | *[biocthis](https://bioconductor.org/packages/3.19/biocthis)*.
134 |
--------------------------------------------------------------------------------
/R/duplicate_classification.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #' Classify duplicate gene pairs based on their modes of duplication
4 | #'
5 | #' @param annotation A processed GRangesList or CompressedGRangesList object as
6 | #' returned by \code{syntenet::process_input()}.
7 | #' @param blast_list A list of data frames containing BLAST tabular output
8 | #' for intraspecies comparisons.
9 | #' Each list element corresponds to the BLAST output for a given species,
10 | #' and names of list elements must match the names of list elements in
11 | #' \strong{annotation}. BLASTp, DIAMOND or simular programs must be run
12 | #' on processed sequence data as returned by \code{process_input()}.
13 | #' @param scheme Character indicating which classification scheme to use.
14 | #' One of "binary", "standard", "extended", or "full". See details below
15 | #' for information on what each scheme means. Default: "standard".
16 | #' @param blast_inter (Only valid if \code{scheme == "extended" or "full"}).
17 | #' A list of data frames containing BLAST tabular output
18 | #' for the comparison between target species and outgroups.
19 | #' Names of list elements must match the names of
20 | #' list elements in `annotation`. BLASTp, DIAMOND or simular programs must
21 | #' be run on processed sequence data as returned by \code{process_input()}.
22 | #' @param intron_counts (Only valid if \code{scheme == "full"}).
23 | #' A list of 2-column data frames with the number of
24 | #' introns per gene as returned by \code{get_intron_counts()}. Names
25 | #' of list elements must match names of \strong{annotation}.
26 | #' @param evalue Numeric scalar indicating the E-value threshold.
27 | #' Default: 1e-10.
28 | #' @param anchors Numeric indicating the minimum required number of genes
29 | #' to call a syntenic block, as in \code{syntenet::infer_syntenet}.
30 | #' Default: 5.
31 | #' @param max_gaps Numeric indicating the number of upstream and downstream
32 | #' genes to search for anchors, as in \code{syntenet::infer_syntenet}.
33 | #' Default: 25.
34 | #' @param proximal_max Numeric scalar with the maximum distance (in number
35 | #' of genes) between two genes to consider them as proximal duplicates.
36 | #' Default: 10.
37 | #' @param collinearity_dir Character indicating the path to the directory
38 | #' where .collinearity files will be stored. If NULL, files will
39 | #' be stored in a subdirectory of \code{tempdir()}. Default: NULL.
40 | #' @param outgroup_coverage Numeric indicating the minimum percentage of
41 | #' outgroup species to use to consider genes as transposed duplicates. Only
42 | #' valid if multiple outgroup species are present (see details below). Values
43 | #' should range from 0 to 100. Default: 70.
44 | #'
45 | #'
46 | #' @return A list of 3-column data frames of duplicated gene pairs
47 | #' (columns 1 and 2), and their modes of duplication (column 3).
48 | #'
49 | #' @details
50 | #' The classification schemes increase in complexity (number of classes)
51 | #' in the order 'binary', 'standard', 'extended', and 'full'.
52 | #'
53 | #' For classification scheme "binary", duplicates are classified into
54 | #' one of 'SD' (segmental duplications) or 'SSD' (small-scale duplications).
55 | #'
56 | #' For classification scheme "standard" (default), duplicates are
57 | #' classified into 'SD' (segmental duplication), 'TD' (tandem duplication),
58 | #' 'PD' (proximal duplication), and 'DD' (dispersed duplication).
59 | #'
60 | #' For classification scheme "extended", duplicates are classified into
61 | #' 'SD' (segmental duplication), 'TD' (tandem duplication),
62 | #' 'PD' (proximal duplication), 'TRD' (transposon-derived duplication),
63 | #' and 'DD' (dispersed duplication).
64 | #'
65 | #' Finally, for classification scheme "full", duplicates are classified into
66 | #' 'SD' (segmental duplication), 'TD' (tandem duplication),
67 | #' 'PD' (proximal duplication), 'rTRD' (retrotransposon-derived duplication),
68 | #' 'dTRD' (DNA transposon-derived duplication), and
69 | #' 'DD' (dispersed duplication).
70 | #'
71 | #' @export
72 | #' @rdname classify_gene_pairs
73 | #' @examples
74 | #' # Load example data
75 | #' data(diamond_intra)
76 | #' data(diamond_inter)
77 | #' data(yeast_annot)
78 | #' data(yeast_seq)
79 | #'
80 | #' # Get processed annotation data
81 | #' annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
82 | #'
83 | #' # Get collapsed DIAMOND inter
84 | #' blast_inter <- syntenet::collapse_bidirectional_hits(
85 | #' diamond_inter,
86 | #' data.frame("Scerevisiae", "Cglabrata")
87 | #' )
88 | #'
89 | #' # Get list of intron counts
90 | #' library(txdbmaker)
91 | #' txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
92 | #' intron_counts <- lapply(txdb_list, get_intron_counts)
93 | #'
94 | #' # Classify duplicates - full scheme
95 | #' dup_class <- classify_gene_pairs(
96 | #' annotation = annotation,
97 | #' blast_list = diamond_intra,
98 | #' scheme = "full",
99 | #' blast_inter = blast_inter,
100 | #' intron_counts = intron_counts
101 | #' )
102 | #'
103 | #' # Check number of gene pairs per class
104 | #' table(dup_class$Scerevisiae$type)
105 | #'
106 | classify_gene_pairs <- function(
107 | annotation = NULL, blast_list = NULL, scheme = "standard",
108 | blast_inter = NULL, intron_counts,
109 | evalue = 1e-10, anchors = 5, max_gaps = 25, proximal_max = 10,
110 | collinearity_dir = NULL, outgroup_coverage = 70
111 | ) {
112 |
113 | anchorp <- get_anchors_list(
114 | blast_list, annotation, evalue, anchors, max_gaps, collinearity_dir
115 | )
116 |
117 | # Get duplicate pairs and filter duplicate entries
118 | pairs <- lapply(blast_list, function(x) {
119 | fpair <- x[x$evalue <= evalue, c(1, 2)]
120 | fpair <- fpair[fpair[, 1] != fpair[, 2], ]
121 | fpair <- fpair[!duplicated(t(apply(fpair, 1, sort))), ]
122 | names(fpair) <- c("dup1", "dup2")
123 | return(fpair)
124 | })
125 |
126 | dup_list <- lapply(seq_along(anchorp), function(x) {
127 | # 1) Get segmental duplicates
128 | sp <- names(anchorp)[x]
129 | p <- pairs[[grep(paste0(sp, "$"), names(pairs))]]
130 |
131 | dups <- get_segmental(anchorp[[x]], p)
132 | if(scheme == "binary") {
133 | dups$type <- gsub("DD", "SSD", dups$type)
134 | dups$type <- factor(dups$type, levels = c("SD", "SSD"))
135 | } else {
136 | # 2) Get tandem and proximal duplicates
137 | dups <- get_tandem_proximal(
138 | dups, annotation_granges = annotation[[sp]],
139 | proximal_max = proximal_max
140 | )
141 |
142 | if(scheme %in% c("extended", "full")) {
143 | # 3) Get transposed duplicates
144 | binter <- blast_inter[startsWith(names(blast_inter), paste0(sp, "_"))]
145 | if(length(binter) == 0) {
146 | message(
147 | "Could not find outgroup for species '", sp,
148 | "'. Skipping identification of TRD duplicates..."
149 | )
150 | } else {
151 | dups <- get_transposed(
152 | pairs = dups,
153 | blast_inter = binter,
154 | annotation = annotation,
155 | evalue = evalue,
156 | anchors = anchors, max_gaps = max_gaps,
157 | collinearity_dir = collinearity_dir,
158 | outgroup_coverage = outgroup_coverage
159 | )
160 |
161 | if(scheme == "full") {
162 | # 4) Get TRD classes (rTRD and dTRD)
163 | dups <- get_transposed_classes(dups, intron_counts[[sp]])
164 | }
165 | }
166 | }
167 | }
168 |
169 | return(dups)
170 | })
171 | names(dup_list) <- names(anchorp)
172 |
173 | return(dup_list)
174 | }
175 |
176 |
177 | #' Classify genes into unique modes of duplication
178 | #'
179 | #' @param gene_pairs_list List of classified gene pairs as returned
180 | #' by \code{classify_gene_pairs()}.
181 | #'
182 | #' @return A list of 2-column data frames with variables \strong{gene}
183 | #' and \strong{type} representing gene ID and duplication type, respectively.
184 | #'
185 | #' @details
186 | #' If a gene is present in pairs with different duplication modes, the gene
187 | #' is classified into a unique mode of duplication following the order
188 | #' of priority indicated in the levels of the factor \strong{type}.
189 | #'
190 | #' For scheme "binary", the order is SD > SSD.
191 | #' For scheme "standard", the order is SD > TD > PD > DD.
192 | #' For scheme "extended", the order is SD > TD > PD > TRD > DD.
193 | #' For scheme "full", the order is SD > TD > PD > rTRD > dTRD > DD.
194 | #'
195 | #' @rdname classify_genes
196 | #' @export
197 | #' @importFrom GenomicRanges GRangesList
198 | #' @examples
199 | #' data(fungi_kaks)
200 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
201 | #'
202 | #' cols <- c("dup1", "dup2", "type")
203 | #' gene_pairs_list <- list(Scerevisiae = scerevisiae_kaks[, cols])
204 | #'
205 | #' class_genes <- classify_genes(gene_pairs_list)
206 | classify_genes <- function(gene_pairs_list = NULL) {
207 |
208 | class_genes <- lapply(gene_pairs_list, function(x) {
209 |
210 | pairs_by_type <- split(x, x$type)
211 | gene_type <- Reduce(rbind, lapply(pairs_by_type, function(y) {
212 |
213 | genes_df <- NULL
214 | if(nrow(y) > 0) {
215 | genes <- unique(c(y$dup1, y$dup2))
216 | genes_df <- data.frame(gene = genes, type = y$type[1])
217 | genes_df <- genes_df[!duplicated(genes_df$gene), ]
218 | }
219 |
220 | return(genes_df)
221 | }))
222 | ref <- levels(x$type)
223 | gene_type <- gene_type[order(match(gene_type$type, ref)), ]
224 | gene_type <- gene_type[!duplicated(gene_type$gene), ]
225 | })
226 | return(class_genes)
227 | }
228 |
229 |
230 |
231 |
232 |
233 |
234 |
235 |
--------------------------------------------------------------------------------
/inst/script/data_acquisition.md:
--------------------------------------------------------------------------------
1 | Data acquisition
2 | ================
3 |
4 | # Data in data/
5 |
6 | Here, we will use genome data for two yeast species:
7 |
8 | - *Saccharomyces cerevisiae*
9 | - *Candida glabrata*
10 |
11 | Data will be obtained from Ensembl Fungi.
12 |
13 | First of all, let’s obtain a list of only protein-coding genes for each
14 | species.
15 |
16 | ``` r
17 | library(tidyverse)
18 |
19 | # Get character vector of protein coding gene IDs
20 | ## S. cerevisiae
21 | scerevisiae_coding <- as.data.frame(read_delim(
22 | "ftp://ftp.psb.ugent.be/pub/plaza/plaza_pico_03/Annotation/annotation.selected_transcript.sac.csv.gz", skip = 8, delim = ";", show_col_types = FALSE
23 | ))
24 | scerevisiae_coding <- scerevisiae_coding[scerevisiae_coding$type == "coding", 1]
25 |
26 | ## C. glabrata
27 | cglabrata_coding <- rtracklayer::import(
28 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/gff3/candida_glabrata/Candida_glabrata.GCA000002545v2.54.gff3.gz"
29 | )
30 | cglabrata_coding <- cglabrata_coding[cglabrata_coding$type == "gene", ]
31 | cglabrata_coding <- cglabrata_coding[cglabrata_coding$biotype == "protein_coding", ]
32 | ```
33 |
34 | ## yeast_annot.rda
35 |
36 | The object `yeast_annot` is a `GRangesList` object with elements
37 | *Scerevisiae* and *Cglabrata*. Only ranges for protein-coding genes are
38 | included.
39 |
40 | ``` r
41 | library(rtracklayer)
42 |
43 | # Get gene ranges
44 | scerevisiae_annot <- import(
45 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.54.gff3.gz"
46 | )
47 | cglabrata_annot <- import(
48 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/gff3/candida_glabrata/Candida_glabrata.GCA000002545v2.54.gff3.gz"
49 | )
50 |
51 | # Filter GRanges (include only protein-coding genes and relevant metadata)
52 | ## Combine GRanges objects in a list
53 | yeast_annot <- list(
54 | Scerevisiae = scerevisiae_annot,
55 | Cglabrata = cglabrata_annot
56 | )
57 |
58 | ## Filter data
59 | yeast_annot <- lapply(yeast_annot, function(x) {
60 |
61 | # Get ranges for coding genes, and use them to extract exons, mRNA, etc.
62 | gene_ranges <- x[x$biotype == "protein_coding" & x$type == "gene"]
63 | cranges <- subsetByOverlaps(x, gene_ranges)
64 |
65 | # Remove exons for TEs (to avoid warnings when building TxDb)
66 | te_tx <- cranges[cranges$type == "transposable_element", ]$transcript_id
67 | if(length(te_tx) > 0) {
68 | te_exonid <- paste0(rep(te_tx, each = 9), paste0("-E", 1:9))
69 | cranges <- cranges[-which(cranges$Name %in% te_exonid)]
70 | }
71 |
72 | # Remove unnecessary columns (for package size issues)
73 | cols <- c(
74 | "type", "phase", "ID", "Parent", "Name",
75 | "gene_id", "transcript_id", "exon_id", "protein_id"
76 | )
77 | cranges <- cranges[, cols]
78 |
79 | return(cranges)
80 | })
81 |
82 | yeast_annot <- GenomicRanges::GRangesList(yeast_annot)
83 |
84 | # Save data
85 | usethis::use_data(yeast_annot, compress = "xz", overwrite = TRUE)
86 | ```
87 |
88 | ## yeast_seq.rda
89 |
90 | The object `yeast_seq` is a list of `AAStringSet` objects with elements
91 | *Scerevisiae* and *Cglabrata*. Only translated sequences for primary
92 | transcripts (protein-coding only) are included.
93 |
94 | ``` r
95 | library(Biostrings)
96 |
97 | # Define small function to keep only longest isoform
98 | ensembl_longest_isoform <- function(proteome = NULL) {
99 |
100 | pnames <- gsub(".*gene:", "", names(proteome))
101 | pnames <- gsub(" .*", "", pnames)
102 |
103 | names(proteome) <- pnames
104 | proteome <- proteome[order(Biostrings::width(proteome), decreasing = TRUE),]
105 | proteome <- proteome[!duplicated(names(proteome)), ]
106 | return(proteome)
107 | }
108 |
109 | # Get proteome data
110 | scerevisiae_proteome <- readAAStringSet(
111 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz"
112 | ) |> ensembl_longest_isoform()
113 |
114 | cglabrata_proteome <- readAAStringSet(
115 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/fasta/candida_glabrata/pep/Candida_glabrata.GCA000002545v2.pep.all.fa.gz"
116 | ) |> ensembl_longest_isoform()
117 |
118 | # Remove non-coding genes
119 | scerevisiae_proteome <- scerevisiae_proteome[names(scerevisiae_proteome) %in%
120 | scerevisiae_annot$gene_id, ]
121 |
122 | cglabrata_proteome <- cglabrata_proteome[names(cglabrata_proteome) %in%
123 | cglabrata_annot$gene_id]
124 |
125 | # Store AAStringSet objects in a list
126 | yeast_seq <- list(
127 | Scerevisiae = scerevisiae_proteome,
128 | Cglabrata = cglabrata_proteome
129 | )
130 |
131 | # Save object
132 | usethis::use_data(yeast_seq, compress = "xz", overwrite = TRUE)
133 | ```
134 |
135 | ## diamond_intra.rda and diamond_inter.rda
136 |
137 | The object `diamond_intra` is a list of DIAMOND data frames for
138 | intraspecies comparisons of *S. cerevisiae*, while `diamond_inter`
139 | contains the DIAMOND output of a comparison between *S. cerevisiae* and
140 | *C. glabrata*.
141 |
142 | ``` r
143 | # Load and process data
144 | data(yeast_seq)
145 | data(yeast_annot)
146 |
147 | pdata <- process_input(yeast_seq, yeast_annot)
148 |
149 | # Intraspecies DIAMOND
150 | diamond_intra <- run_diamond(
151 | seq = pdata$seq["Scerevisiae"],
152 | compare = "intraspecies",
153 | outdir = file.path(tempdir(), "diamond_intra_data"),
154 | ... = "--sensitive"
155 | )
156 |
157 | # Interspecies DIAMOND
158 | comparisons <- data.frame(
159 | species = "Scerevisiae",
160 | outgroup = "Cglabrata"
161 | )
162 |
163 | diamond_inter <- run_diamond(
164 | seq = pdata$seq,
165 | compare = comparisons,
166 | outdir = file.path(tempdir(), "diamond_inter_data"),
167 | ... = "--sensitive"
168 | )
169 |
170 | # Save data
171 | usethis::use_data(diamond_intra, compress = "xz", overwrite = TRUE)
172 | usethis::use_data(diamond_inter, compress = "xz", overwrite = TRUE)
173 | ```
174 |
175 | ## cds_scerevisiae.rda
176 |
177 | This is a `DNAStringSet object` containing the CDS of duplicated genes
178 | in the S. cerevisiae genome.
179 |
180 | ``` r
181 | library(Biostrings)
182 |
183 | # Load and process data
184 | data("yeast_seq")
185 | data("yeast_annot")
186 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
187 |
188 | data(diamond_intra)
189 |
190 | # Classify gene pairs
191 | c_standard <- classify_gene_pairs(
192 | annotation = pdata$annotation,
193 | blast_list = diamond_intra,
194 | scheme = "standard"
195 | )
196 |
197 | # Get TD-derived pairs
198 | td_pairs <- c_standard$Scerevisiae |>
199 | dplyr::filter(type == "TD")
200 | td_pairs <- unique(c(td_pairs$dup1, td_pairs$dup2))
201 | td_pairs <- gsub(".*_", "", td_pairs)
202 |
203 | # Get CDS and keep only longest isoform
204 | cds_scerevisiae_full <- readDNAStringSet(
205 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-54/fungi/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz"
206 | ) |> ensembl_longest_isoform()
207 |
208 | # Keep only duplicated genes
209 | cds_scerevisiae <- cds_scerevisiae_full[names(cds_scerevisiae_full) %in%
210 | td_pairs]
211 |
212 | # Write, read, and export file
213 | out <- tempfile(fileext = ".fa")
214 | writeXStringSet(cds_scerevisiae, filepath = out)
215 |
216 | cds_scerevisiae <- Biostrings::readDNAStringSet(out)
217 |
218 | usethis::use_data(cds_scerevisiae, compress = "xz", overwrite = TRUE)
219 | ```
220 |
221 | ## scerevisiae_kaks.rda
222 |
223 | This object is a data frame of duplicate pairs and their Ks values.
224 |
225 | ``` r
226 | # Get all duplicated gene pairs
227 | library(Biostrings)
228 |
229 | data(yeast_seq)
230 | data(yeast_annot)
231 | data(diamond_intra)
232 | data(diamond_inter)
233 | pdata <- syntenet::process_input(yeast_seq, yeast_annot)
234 |
235 | # Classify genes into the extended scheme
236 | c_extended <- classify_gene_pairs(
237 | blast_list = diamond_intra,
238 | annotation = pdata$annotation,
239 | scheme = "extended",
240 | blast_inter = diamond_inter
241 | )
242 |
243 | # Get CDS
244 | cds <- list(Scerevisiae = cds_scerevisiae_all)
245 |
246 | # Calculate Ks values
247 | scerevisiae_kaks_list <- pairs2kaks(c_extended, cds)
248 | scerevisiae_kaks <- scerevisiae_kaks_list$Scerevisiae
249 |
250 | fungi_kaks2 <- fungi_kaks
251 | fungi_kaks2 <- lapply(fungi_kaks2, function(x) {
252 |
253 | x$Ka <- signif(x$Ka, 3)
254 | x$Ks <- signif(x$Ks, 3)
255 | x$Ka_Ks <- signif(x$Ka_Ks, 3)
256 |
257 | return(x)
258 | })
259 |
260 | usethis::use_data(scerevisiae_kaks, compress = "xz", overwrite = TRUE)
261 | ```
262 |
263 | ## gmax_ks.rda
264 |
265 | This object is a 3-column data frame of duplicate pairs and their Ks
266 | values for *Glycine max* (soybean).
267 |
268 | ``` r
269 | # Get data
270 | annot <- rtracklayer::import(
271 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-53/plants/gtf/glycine_max/Glycine_max.Glycine_max_v2.1.53.gtf.gz"
272 | )
273 | annot <- list(Gmax = annot)
274 |
275 | seq <- Biostrings::readAAStringSet(
276 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-53/plants/fasta/glycine_max/pep/Glycine_max.Glycine_max_v2.1.pep.all.fa.gz"
277 | ) |> ensembl_longest_isoform()
278 | seq <- list(Gmax = seq)
279 |
280 | cds <- Biostrings::readDNAStringSet(
281 | "http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-53/plants/fasta/glycine_max/cds/Glycine_max.Glycine_max_v2.1.cds.all.fa.gz"
282 | ) |> ensembl_longest_isoform()
283 |
284 | # Process data
285 | pdata <- syntenet::process_input(seq, annot)
286 |
287 | # Intraspecies comparison
288 | diamond_intra <- run_diamond(
289 | seq = pdata$seq["Gmax"],
290 | compare = "intraspecies",
291 | outdir = file.path(tempdir(), "diamond_intra_data"),
292 | ... = "--sensitive"
293 | )
294 |
295 | # Binary classification
296 | c_binary <- classify_gene_pairs(
297 | blast_list = diamond_intra,
298 | annotation = pdata$annotation,
299 | binary = TRUE
300 | )
301 |
302 | cds <- list(Gmax = cds)
303 |
304 | # Calculate Ks values
305 | gmax_kaks_list <- pairs2kaks(c_binary, cds)
306 | gmax_ks <- gmax_kaks_list$Gmax
307 | gmax_ks <- gmax_ks[, c("dup1", "dup2", "Ks", "type")]
308 |
309 | gmax_ks <- gmax_ks[gmax_ks$Ks <= 2, ]
310 | gmax_ks <- gmax_ks[!is.na(gmax_ks$Ks), ]
311 |
312 | gmax_ks$Ks <- signif(gmax_ks$Ks, 3) # to reduce object size
313 |
314 | usethis::use_data(gmax_ks, compress = "xz", overwrite = TRUE)
315 | ```
316 |
--------------------------------------------------------------------------------
/R/ka_ks_analyses.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #' Calculate Ka, Ks, and Ka/Ks from duplicate gene pairs
4 | #'
5 | #' @param gene_pairs_list List of data frames containing duplicated gene pairs
6 | #' as returned by \code{classify_gene_pairs()}.
7 | #' @param cds List of DNAStringSet objects containing the coding sequences
8 | #' of each gene.
9 | #' @param model Character scalar indicating which codon model to use.
10 | #' Possible values are "Li", "NG86", "NG", "LWL", "LPB", "MLWL", "MLPB", "GY",
11 | #' "YN", "MYN", "MS", "MA", "GNG", "GLWL", "GLPB", "GMLWL", "GMLPB", "GYN",
12 | #' and "GMYN". Default: "MYN".
13 | #' @param threads Numeric indicating the number of threads to use. Default: 1.
14 | #' @param verbose Logical indicating whether progress messages should be
15 | #' printed on screen. Default: FALSE.
16 | #'
17 | #' @return A list of data frames containing gene pairs and their Ka, Ks,
18 | #' and Ka/Ks values.
19 | #'
20 | #' @importFrom MSA2dist indices2kaks
21 | #' @importFrom Biostrings width
22 | #' @export
23 | #' @rdname pairs2kaks
24 | #' @examples
25 | #' data(diamond_intra)
26 | #' data(diamond_inter)
27 | #' data(yeast_annot)
28 | #' data(yeast_seq)
29 | #' data(cds_scerevisiae)
30 | #' blast_list <- diamond_intra
31 | #' blast_inter <- diamond_inter
32 | #'
33 | #' pdata <- syntenet::process_input(yeast_seq, yeast_annot)
34 | #' annot <- pdata$annotation["Scerevisiae"]
35 | #'
36 | #' # Binary classification scheme
37 | #' pairs <- classify_gene_pairs(annot, blast_list)
38 | #' td_pairs <- pairs[[1]][pairs[[1]]$type == "TD", ]
39 | #' gene_pairs_list <- list(
40 | #' Scerevisiae = td_pairs[seq(1, 3, by = 1), ]
41 | #' )
42 | #'
43 | #' cds <- list(Scerevisiae = cds_scerevisiae)
44 | #'
45 | #' kaks <- pairs2kaks(gene_pairs_list, cds)
46 | #'
47 | pairs2kaks <- function(
48 | gene_pairs_list, cds, model = "MYN", threads = 1, verbose = FALSE
49 | ) {
50 |
51 | kaks_list <- lapply(seq_along(gene_pairs_list), function(x) {
52 |
53 | # Get pairs and CDS for species x
54 | species <- names(gene_pairs_list)[x]
55 | if(verbose) { message("Calculating rates for species '", species, "'") }
56 |
57 | pairs <- gene_pairs_list[[x]]
58 | names(pairs)[c(1, 2)] <- c("dup1", "dup2")
59 | pairs$dup1 <- gsub("^[a-zA-Z]{2,5}_", "", pairs$dup1)
60 | pairs$dup2 <- gsub("^[a-zA-Z]{2,5}_", "", pairs$dup2)
61 | fcds <- cds[[species]]
62 |
63 | # Check if IDs in pairs are all present in CDS
64 | c1 <- check_geneid_match(unique(c(pairs$dup1, pairs$dup2)), names(fcds))
65 |
66 | # Remove CDS that are not multiple of 3
67 | fcds <- cds[[species]]
68 | m3 <- Biostrings::width(fcds) %% 3
69 | remove <- which(m3 != 0)
70 | if(length(remove) != 0) {
71 | message(
72 | "For species ", species, ", the lengths of ", length(remove),
73 | " CDS are not multiples of 3. Removing them..."
74 | )
75 | pairs <- pairs[!pairs$dup1 %in% names(fcds)[remove], ]
76 | pairs <- pairs[!pairs$dup2 %in% names(fcds)[remove], ]
77 | fcds <- fcds[-remove]
78 | }
79 |
80 | # Create a list of indices - vectors of length 2
81 | idx_df <- data.frame(gene = names(fcds), idx = seq_along(fcds))
82 | pairs_idx <- lapply(seq_len(nrow(pairs)), function(p) {
83 | idx <- c(
84 | idx_df$idx[idx_df$gene == pairs[p, 1]],
85 | idx_df$idx[idx_df$gene == pairs[p, 2]]
86 | )
87 |
88 | return(idx)
89 | })
90 |
91 | # Calculate rates
92 | rates <- MSA2dist::indices2kaks(
93 | cds = fcds, indices = pairs_idx,
94 | model = model,
95 | threads = threads,
96 | isMSA = FALSE,
97 | verbose = FALSE
98 | )
99 |
100 | rates <- data.frame(
101 | dup1 = rates$seq1,
102 | dup2 = rates$seq2,
103 | Ka = as.numeric(ifelse(rates$Ka == "NA", NA, rates$Ka)),
104 | Ks = as.numeric(ifelse(rates$Ks == "NA", NA, rates$Ks)),
105 | Ka_Ks = as.numeric(ifelse(rates[["Ka/Ks"]] == "NA", NA, rates[["Ka/Ks"]]))
106 | )
107 | if("type" %in% names(pairs)) {
108 | rates$type <- pairs$type
109 | }
110 |
111 | return(rates)
112 | })
113 | names(kaks_list) <- names(gene_pairs_list)
114 |
115 | return(kaks_list)
116 | }
117 |
118 |
119 | #' Find peaks in a Ks distribution with Gaussian Mixture Models
120 | #'
121 | #' @param ks A numeric vector of Ks values.
122 | #' @param npeaks Numeric scalar indicating the number of peaks in
123 | #' the Ks distribution. If you don't know how many peaks there are,
124 | #' you can include a range of values, and the number of peaks that produces
125 | #' the lowest BIC (Bayesian Information Criterion) will be selected as the
126 | #' optimal. Default: 2.
127 | #' @param min_ks Numeric scalar with the minimum Ks value. Removing
128 | #' very small Ks values is generally used to avoid the incorporation of allelic
129 | #' and/or splice variants and to prevent the fitting of a component to infinity.
130 | #' Default: 0.01.
131 | #' @param max_ks Numeric scalar indicating the maximum Ks value. Removing
132 | #' very large Ks values is usually performed to account for Ks saturation.
133 | #' Default: 4.
134 | #' @param verbose Logical indicating if messages should be printed on screen.
135 | #' Default: FALSE.
136 | #'
137 | #' @return A list with the following elements:
138 | #' \describe{
139 | #' \item{mean}{Numeric with the estimated means.}
140 | #' \item{sd}{Numeric with the estimated standard deviations.}
141 | #' \item{lambda}{Numeric with the estimated mixture weights.}
142 | #' \item{ks}{Numeric vector of filtered Ks distribution based on
143 | #' arguments passed to min_ks and max_ks.}
144 | #' }
145 | #' @importFrom mclust densityMclust
146 | #' @export
147 | #' @rdname find_ks_peaks
148 | #' @examples
149 | #' data(fungi_kaks)
150 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
151 | #' ks <- scerevisiae_kaks$Ks
152 | #'
153 | #' # Find 2 peaks in Ks distribution
154 | #' peaks <- find_ks_peaks(ks, npeaks = 2)
155 | #'
156 | #' # From 2 to 4 peaks, verbose = TRUE to show BIC values
157 | #' peaks <- find_ks_peaks(ks, npeaks = c(2, 3, 4), verbose = TRUE)
158 | find_ks_peaks <- function(ks, npeaks = 2, min_ks = 0.01, max_ks = 4,
159 | verbose = FALSE) {
160 |
161 | # Data preprocessing
162 | ks <- ks[!is.na(ks)]
163 | fks <- ks[ks >= min_ks]
164 | fks <- fks[fks <= max_ks]
165 |
166 | # Find peaks
167 | peaks <- mclust::densityMclust(
168 | fks, G = npeaks, verbose = FALSE, plot = FALSE
169 | )
170 |
171 | if(verbose & length(npeaks) > 1) {
172 | message("Optimal number of peaks: ", peaks$G)
173 | print(peaks$BIC)
174 | }
175 |
176 | # Create result list
177 | peak_list <- list(
178 | mean = peaks$parameters$mean,
179 | sd = sqrt(peaks$parameters$variance$sigmasq),
180 | lambda = peaks$parameters$pro,
181 | ks = as.numeric(peaks$data[,1])
182 | )
183 | return(peak_list)
184 | }
185 |
186 |
187 |
188 |
189 |
190 | #' Split gene pairs based on their Ks peaks
191 | #'
192 | #' The purpose of this function is to classify gene pairs by age when there
193 | #' are 2+ Ks peaks. This way, newer gene pairs are found within a
194 | #' certain number of standard deviations from the highest peak,
195 | #' and older genes are found close within smaller peaks.
196 | #'
197 | #' @param ks_df A 3-column data frame with gene pairs in columns 1 and 2,
198 | #' and Ks values for the gene pair in column 3.
199 | #' @param peaks A list with mean, standard deviation, and amplitude of Ks
200 | #' peaks as generated by \code{find_ks_peaks}.
201 | #' @param nsd Numeric with the number of standard deviations to consider
202 | #' for each peak.
203 | #' @param binwidth Numeric scalar with binwidth for the histogram.
204 | #' Default: 0.05.
205 | #'
206 | #' @return A list with the following elements:
207 | #' \describe{
208 | #' \item{pairs}{A 4-column data frame with the variables
209 | #' \strong{dup1} (character), \strong{dup2} (character),
210 | #' \strong{ks} (numeric), and \strong{peak} (numeric),
211 | #' representing duplicate gene pair, Ks values, and peak ID,
212 | #' respectively.}
213 | #' \item{plot}{A ggplot object with Ks peaks as returned by
214 | #' \code{plot_ks_peaks}, but with dashed red lines indicating
215 | #' boundaries for each peak.}
216 | #' }
217 | #'
218 | #' @importFrom ggplot2 geom_vline
219 | #' @export
220 | #' @rdname split_pairs_by_peak
221 | #' @examples
222 | #' data(fungi_kaks)
223 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
224 | #'
225 | #' # Create a data frame of duplicate pairs and Ks values
226 | #' ks_df <- scerevisiae_kaks[, c("dup1", "dup2", "Ks")]
227 | #'
228 | #' # Create list of peaks
229 | #' peaks <- find_ks_peaks(ks_df$Ks, npeaks = 2)
230 | #'
231 | #' # Split pairs
232 | #' spairs <- split_pairs_by_peak(ks_df, peaks)
233 | split_pairs_by_peak <- function(ks_df, peaks, nsd = 2, binwidth = 0.05) {
234 |
235 | names(ks_df) <- c("dup1", "dup2", "ks")
236 | npeaks <- length(peaks$mean)
237 |
238 | # Filter Ks data frame as done in find_ks_peaks()
239 | max_ks <- max(peaks$ks)
240 | min_ks <- min(peaks$ks)
241 | ks_df <- ks_df[!is.na(ks_df$ks), ]
242 | ks_df <- ks_df[ks_df$ks >= min_ks & ks_df$ks <= max_ks, ]
243 |
244 | # Get minimum, intersection points, and maximum
245 | min_boun <- peaks$mean[1] - nsd * peaks$sd[1]
246 | if(min_boun < 0) { min_boun <- 0 }
247 | max_boun <- peaks$mean[npeaks] + nsd * peaks$sd[npeaks]
248 | if(max_boun > max(ks_df$ks)) { max_boun <- max(ks_df$ks) }
249 |
250 | if(npeaks == 1) {
251 | cutpoints <- c(min_boun, max_boun)
252 | } else {
253 | inter <- find_intersect_mixtures(peaks)
254 | cutpoints <- c(min_boun, inter, max_boun)
255 | }
256 |
257 | # Plot histogram with cutpoints in "brown2" dashed lines
258 | p <- plot_ks_peaks(peaks, binwidth = binwidth)
259 | for(i in seq_along(cutpoints)) {
260 | p <- p + geom_vline(xintercept = cutpoints[i],
261 | linetype = "dashed", color = "brown2")
262 | }
263 |
264 | # Create list of intervals
265 | int_list <- lapply(seq_len(length(cutpoints)-1), function(x) {
266 | return(c(cutpoints[x], cutpoints[x+1]))
267 | })
268 |
269 | # Create list of data frames for each interval
270 | split_pairs <- Reduce(rbind, lapply(seq_along(int_list), function(x) {
271 | ivec <- int_list[[x]]
272 | pairs <- ks_df[ks_df$ks >= ivec[1] & ks_df$ks < ivec[2], ]
273 | pairs$peak <- x
274 | return(pairs)
275 | }))
276 |
277 | result_list <- list(pairs = split_pairs, plot = p)
278 | }
279 |
280 |
281 |
--------------------------------------------------------------------------------
/R/utils.R:
--------------------------------------------------------------------------------
1 |
2 | #' Get a list of anchor pairs for each species
3 | #'
4 | #' @param blast_list A list of data frames containing BLAST tabular output
5 | #' for intraspecies comparisons.
6 | #' Each list element corresponds to the BLAST output for a given species,
7 | #' and names of list elements must match the names of list elements in
8 | #' `annotation`. BLASTp, DIAMOND or simular programs must be run on processed
9 | #' sequence data as returned by \code{process_input()}.
10 | #' @param annotation A processed GRangesList or CompressedGRangesList object as
11 | #' returned by \code{syntenet::process_input()}.
12 | #' @param evalue Numeric scalar indicating the E-value threshold.
13 | #' Default: 1e-10.
14 | #' @param anchors Numeric indicating the minimum required number of genes
15 | #' to call a syntenic block, as in \code{syntenet::infer_syntenet}.
16 | #' Default: 5.
17 | #' @param max_gaps Numeric indicating the number of upstream and downstream
18 | #' genes to search for anchors, as in \code{syntenet::infer_syntenet}.
19 | #' Default: 25.
20 | #' @param collinearity_dir Character indicating the path to the directory
21 | #' where .collinearity files will be stored. If NULL, files will
22 | #' be stored in a subdirectory of \code{tempdir()}. Default: NULL.
23 | #'
24 | #' @return A list of data frames representing intraspecies anchor pairs.
25 | #' @importFrom syntenet intraspecies_synteny
26 | #' @export
27 | #' @rdname get_anchors_list
28 | #' @examples
29 | #' data(diamond_intra)
30 | #' data(yeast_annot)
31 | #' data(yeast_seq)
32 | #' blast_list <- diamond_intra
33 | #'
34 | #' # Get processed annotation for S. cerevisiae
35 | #' annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation
36 | #'
37 | #' # Get list of intraspecies anchor pairs
38 | #' anchorpairs <- get_anchors_list(blast_list, annotation)
39 | get_anchors_list <- function(
40 | blast_list = NULL, annotation = NULL,
41 | evalue = 1e-10, anchors = 5, max_gaps = 25,
42 | collinearity_dir = NULL
43 | ) {
44 |
45 | # Create output directory
46 | intradir <- collinearity_dir
47 | if(is.null(intradir)) {
48 | daytime <- format(Sys.time(), "%d_%b_%Y_%Hh%M")
49 | intradir <- file.path(tempdir(), paste0("intra_", daytime))
50 | }
51 |
52 | # Filter DIAMOND list by e-value
53 | fblast <- lapply(blast_list, function(x) return(x[x$evalue <= evalue, ]))
54 |
55 | # Get .collinearity files for intragenome comparisons
56 | col_files <- syntenet::intraspecies_synteny(
57 | blast_intra = fblast,
58 | annotation = annotation,
59 | intra_dir = intradir,
60 | anchors = anchors,
61 | max_gaps = max_gaps
62 | )
63 |
64 | # Parse files
65 | anchors <- lapply(col_files, syntenet::parse_collinearity)
66 | names(anchors) <- gsub("\\.collinearity", "", basename(col_files))
67 |
68 | return(anchors)
69 | }
70 |
71 |
72 | #' Parse .collinearity files into a data frame of syntenic blocks
73 | #'
74 | #' @param collinearity_paths Character vector of paths to .collinearity files.
75 | #'
76 | #' @return A 4-column data frame with the variables:
77 | #' \describe{
78 | #' \item{block}{Syntenic block}
79 | #' \item{anchor1}{Anchor pair 1}
80 | #' \item{anchor2}{Anchor pair 2}
81 | #' }
82 | #'
83 | #' @importFrom utils read.table
84 | #' @noRd
85 | collinearity2blocks <- function(collinearity_paths = NULL) {
86 |
87 | fname <- gsub("\\.collinearity", "", basename(collinearity_paths))
88 | names(collinearity_paths) <- fname
89 |
90 | blocks <- lapply(seq_along(collinearity_paths), function(x) {
91 | lines <- readLines(collinearity_paths[x])
92 | nlines <- length(lines[!startsWith(lines, "#")])
93 |
94 | df <- NULL
95 | if(nlines > 0) {
96 | df <- read.table(
97 | collinearity_paths[x], sep = "\t", comment.char = "#"
98 | )
99 | df <- df[, c(1, 2, 3)]
100 |
101 | # Reorder columns based on original order (not alphabetically)
102 | spp1 <- unlist(strsplit(names(collinearity_paths[x]), "_"))[1]
103 | id1 <- gsub("_.*", "", df$V2[1])
104 | new_names <- c("anchor1", "anchor2")
105 | if(!startsWith(spp1, id1)) {
106 | new_names <- c("anchor2", "anchor1")
107 | }
108 | names(df)[c(2, 3)] <- new_names
109 |
110 | # Get syntenic block IDs
111 | df$V1 <- gsub(":", "", df$V1)
112 | block_ids <- strsplit(df$V1, "-")
113 | blocks <- lapply(block_ids, function(x) return(as.numeric(x[1])))
114 |
115 | # Add syntenic block IDs to data frame
116 | df$block <- unlist(blocks)
117 | df <- df[, c("block", "anchor1", "anchor2")]
118 | }
119 | return(df)
120 | })
121 | blocks <- Reduce(rbind, blocks)
122 | return(blocks)
123 | }
124 |
125 |
126 | #' Get a data frame of intron counts per gene
127 | #'
128 | #' @param txdb A `TxDb` object with transcript annotations. See details below
129 | #' for examples on how to create `TxDb` objects from different kinds of input.
130 | #'
131 | #'
132 | #' @return A data frame with intron counts per gene, with variables:
133 | #' \describe{
134 | #' \item{gene}{Character with gene IDs.}
135 | #' \item{introns}{Numeric with number of introns per gene.}
136 | #' }
137 | #'
138 | #' @details
139 | #' The family of functions \code{makeTxDbFrom*} from
140 | #' the \strong{txdbmaker} package can be used to create `TxDb` objects
141 | #' from a variety of input data types. You can create `TxDb` objects
142 | #' from e.g., `GRanges` objects (\code{makeTxDbFromGRanges()}),
143 | #' GFF files (\code{makeTxDbFromGFF()}),
144 | #' an Ensembl database (\code{makeTxDbFromEnsembl}), and
145 | #' a Biomart database (\code{makeTxDbFromBiomart}).
146 | #'
147 | #' @rdname get_intron_counts
148 | #' @export
149 | #' @importFrom GenomicFeatures intronsByTranscript
150 | #' @importFrom AnnotationDbi select
151 | #'
152 | #' @examples
153 | #' data(yeast_annot)
154 | #'
155 | #' # Create TxDb object from GRanges
156 | #' library(txdbmaker)
157 | #' txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]])
158 | #'
159 | #' # Get intron counts
160 | #' intron_counts <- get_intron_counts(txdb)
161 | get_intron_counts <- function(txdb) {
162 |
163 | # Get a data frame with the number of introns per transcript
164 | introns_by_tx <- intronsByTranscript(txdb, use.names = TRUE)
165 |
166 | intron_counts <- data.frame(
167 | tx = names(introns_by_tx),
168 | introns = lengths(introns_by_tx)
169 | )
170 |
171 | # Create a data frame of transcript-to-gene mapping
172 | suppressMessages({
173 | tx2gene <- AnnotationDbi::select(
174 | txdb,
175 | keys = unique(intron_counts$tx),
176 | columns = "GENEID",
177 | keytype = "TXNAME"
178 | )
179 | })
180 | names(tx2gene) <- c("tx", "gene")
181 |
182 | # Create a data frame of intron counts per gene
183 | intron_counts_gene <- merge(intron_counts, tx2gene)[, c("gene", "introns")]
184 | intron_counts_gene <- intron_counts_gene[order(-intron_counts_gene$introns), ]
185 | intron_counts_gene <- intron_counts_gene[!duplicated(intron_counts_gene$gene), ]
186 | rownames(intron_counts_gene) <- NULL
187 |
188 | return(intron_counts_gene)
189 | }
190 |
191 |
192 |
193 | #' Find line intersect between pairs of Gaussian mixtures
194 | #'
195 | #' This function finds x-axis coordinate of n-1 intersections between lines
196 | #' of n Gaussian mixtures. Thus, it will find 1 intersection for Ks distros
197 | #' with 2 peaks, 2 intersections for distros with 2 peaks, and so on.
198 | #'
199 | #' @param peaks A list with elements \strong{mean}, \strong{sd},
200 | #' \strong{lambda}, and \strong{ks}, as returned by the
201 | #' function \code{fins_ks_peaks()}.
202 | #'
203 | #' @return A numeric scalar or vector with the x-axis coordinates of the
204 | #' intersections.
205 | #' @importFrom ggplot2 ggplot_build
206 | #' @noRd
207 | #' @rdname find_intersect_mixtures
208 | #' @examples
209 | #' data(fungi_kaks)
210 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
211 | #' ks <- scerevisiae_kaks$Ks
212 | #'
213 | #' # Find 2 peaks in Ks distribution
214 | #' peaks <- find_ks_peaks(ks, npeaks = 2)
215 | #'
216 | #' # Get intersects
217 | #' inter <- find_intersect_mixtures(peaks)
218 | find_intersect_mixtures <- function(peaks) {
219 |
220 | p <- plot_ks_peaks(peaks)
221 | npeaks <- length(peaks$mean)
222 | if(npeaks == 1) {
223 | stop("Cannot find intersect of peaks with only 1 peak.")
224 | }
225 |
226 | # Create list of density line indices to iterate through
227 | iteration_list <- list(
228 | c(2,3), c(3,4), c(4,5), c(5,6), c(6,7), c(7,8), c(8,9)
229 | )
230 | iteration_list <- iteration_list[seq_len(npeaks-1)]
231 |
232 | # Get intersection between density line i and density line i+1
233 | ints <- unlist(lapply(iteration_list, function(x) {
234 | l1 <- x[1]
235 | l2 <- x[2]
236 | line_df <- data.frame(
237 | x = ggplot_build(p)$data[[l1]]$x,
238 | line1 = ggplot_build(p)$data[[l1]]$y,
239 | line2 = ggplot_build(p)$data[[l2]]$y
240 | )
241 | # Get minimal distance between lines along y axis
242 | line_df$delta <- line_df$line1 - line_df$line2
243 |
244 | # Get x value for minimal delta y
245 | int <- line_df$x[which(diff(sign(diff((abs(line_df$delta))))) == 2)+1]
246 | return(int)
247 | }))
248 | return(ints)
249 | }
250 |
251 |
252 | #' Get a duplicate count matrix for each genome
253 | #'
254 | #' @param duplicate_list A list of data frames with the duplicated genes or
255 | #' gene pairs and their modes of duplication as returned
256 | #' by \code{classify_gene_pairs()} or \code{classify_genes()}.
257 | #' @param shape Character specifying the shape of the output data frame.
258 | #' One of "long" (data frame in the long shape, in the tidyverse sense),
259 | #' or "wide" (data frame in the wide shape, in the tidyverse sense).
260 | #' Default: "long".
261 | #'
262 | #' @return If \strong{shape = "wide"}, a count matrix containing the
263 | #' frequency of duplicated genes (or gene pairs) by mode for each species,
264 | #' with species in rows and duplication modes in columns.
265 | #' If \strong{shape = "long"}, a data frame in long format with the following
266 | #' variables:
267 | #' \describe{
268 | #' \item{type}{Factor, type of duplication.}
269 | #' \item{n}{Numeric, number of duplicates.}
270 | #' \item{species}{Character, species name}
271 | #' }
272 | #'
273 | #' @export
274 | #' @rdname duplicates2counts
275 | #' @examples
276 | #' data(fungi_kaks)
277 | #'
278 | #' # Get unique duplicates
279 | #' duplicate_list <- classify_genes(fungi_kaks)
280 | #'
281 | #' # Get count table
282 | #' counts <- duplicates2counts(duplicate_list)
283 | duplicates2counts <- function(duplicate_list, shape = "long") {
284 |
285 | # Get factor levels for variable `type`
286 | tlevels <- lapply(duplicate_list, function(x) return(levels(x$type)))
287 | tlevels <- tlevels[[names(sort(lengths(tlevels), decreasing = TRUE)[1])]]
288 |
289 | counts <- Reduce(rbind, lapply(seq_along(duplicate_list), function(x) {
290 |
291 | species <- names(duplicate_list)[x]
292 |
293 | dup_table <- duplicate_list[[x]]
294 | dup_table$type <- factor(dup_table$type, levels = tlevels)
295 |
296 | if(shape == "long") {
297 | final_dups <- as.data.frame(table(dup_table$type))
298 | names(final_dups) <- c("type", "n")
299 | final_dups$species <- species
300 | } else if(shape == "wide") {
301 | final_dups <- t(as.matrix(table(dup_table$type)))
302 | final_dups <- cbind(species, as.data.frame(final_dups))
303 | } else {
304 | stop("Argument 'format' must be one of 'long' or 'wide'.")
305 | }
306 |
307 | return(final_dups)
308 | }))
309 |
310 | return(counts)
311 | }
312 |
313 |
--------------------------------------------------------------------------------
/R/visualization.R:
--------------------------------------------------------------------------------
1 |
2 | #' Create a named vector with a color palette for duplication modes
3 | #'
4 | #' @return A named character vector with colors for each duplication mode.
5 | #' @noRd
6 | #'
7 | dup_palette <- function() {
8 |
9 | pal <- c(
10 | All = "gray20",
11 | SD = "#EFC000FF",
12 | TD = "#CD534CFF",
13 | PD = "#79AF97FF",
14 | TRD = "#7AA6DCFF",
15 | rTRD = "#7AA6DCFF",
16 | dTRD = "#003C67FF",
17 | DD = "#6A6599FF",
18 | SSD = "#7AA6DCFF"
19 | )
20 |
21 | return(pal)
22 | }
23 |
24 |
25 | #' Plot frequency of duplicates per mode for each species
26 | #'
27 | #' @param dup_counts A data frame in long format with the number of
28 | #' duplicates per mode for each species, as returned by
29 | #' the function \code{duplicates2counts}.
30 | #' @param plot_type Character indicating how to plot frequencies. One of
31 | #' 'facet' (facets for each level of the variable \strong{type}),
32 | #' 'stack' (levels of the variable \strong{type} as stacked bars), or
33 | #' 'stack_percent' (levels of the variable \strong{type} as stacked bars,
34 | #' with x-axis representing relative frequencies). Default: 'facet'.
35 | #' @param remove_zero Logical indicating whether or not to remove rows
36 | #' with zero values. Default: TRUE.
37 | #'
38 | #' @return A ggplot object.
39 | #'
40 | #' @importFrom ggplot2 ggplot aes geom_bar facet_wrap theme_bw theme labs
41 | #' scale_fill_manual element_blank
42 | #' @importFrom rlang .data
43 | #' @export
44 | #' @rdname plot_duplicate_freqs
45 | #' @examples
46 | #' data(fungi_kaks)
47 | #'
48 | #' # Get unique duplicates
49 | #' duplicate_list <- classify_genes(fungi_kaks)
50 | #'
51 | #' # Get count table
52 | #' dup_counts <- duplicates2counts(duplicate_list)
53 | #'
54 | #' # Plot counts
55 | #' plot_duplicate_freqs(dup_counts, plot_type = "stack_percent")
56 | plot_duplicate_freqs <- function(
57 | dup_counts, plot_type = "facet", remove_zero = TRUE
58 | ) {
59 |
60 | # Define palette
61 | pal <- dup_palette()
62 |
63 | # Remove zeros
64 | if(remove_zero) { dup_counts <- dup_counts[dup_counts$n != 0, ] }
65 |
66 | if(plot_type == "facet") {
67 | p <- ggplot(dup_counts, aes(x = .data$n, y = .data$species)) +
68 | geom_bar(
69 | aes(fill = .data$type), stat = "identity", color = "grey20",
70 | show.legend = FALSE
71 | ) +
72 | facet_wrap("type", nrow = 1, scales = "free_x") +
73 | labs(y = "", x = "Absolute frequency")
74 |
75 | } else if(plot_type == "stack") {
76 | p <- ggplot(
77 | dup_counts, aes(x = .data$n, y = .data$species, fill = .data$type)
78 | ) +
79 | geom_bar(color = "gray20", position = "stack", stat = "identity") +
80 | labs(fill = "Type", y = "", x = "Absolute frequency")
81 |
82 | } else if(plot_type == "stack_percent") {
83 | p <- ggplot(
84 | dup_counts, aes(x = .data$n, y = .data$species, fill = .data$type)
85 | ) +
86 | geom_bar(position = "fill", stat = "identity", color = "gray20") +
87 | labs(fill = "Type", y = "", x = "Relative frequency")
88 |
89 | } else {
90 | stop("Input to argument 'plot_type' must be one of 'facet', 'stack', or 'stack_percent'.")
91 | }
92 |
93 | p <- p +
94 | scale_fill_manual(values = pal) +
95 | theme_bw() +
96 | theme(panel.grid = element_blank())
97 |
98 | return(p)
99 | }
100 |
101 |
102 | #' Plot distribution of synonymous substitution rates (Ks)
103 | #'
104 | #' @param ks_df A data frame with Ks values for each gene pair
105 | #' as returned by \code{pairs2kaks()}.
106 | #' @param min_ks Numeric indicating the minimum Ks value to keep.
107 | #' Default: 0.01.
108 | #' @param max_ks Numeric indicating the maximum Ks value to keep.
109 | #' Default: 2.
110 | #' @param bytype Logical indicating whether or not to plot the distribution
111 | #' by type of duplication (requires a column named `type`).
112 | #' @param type_levels (Only valid if \strong{bytype} is not NULL) Character
113 | #' indicating which levels of the variable specified in
114 | #' parameter \strong{group_by} should be kept. By default, all levels are kept.
115 | #' @param plot_type Character indicating the type of plot to create.
116 | #' If \strong{bytype = TRUE}, possible types are "histogram" or "violin".
117 | #' If \strong{bytype = FALSE}, possible types are "histogram", "density",
118 | #' or "density_histogram". Default: "histogram".
119 | #' @param binwidth (Only valid if \strong{plot_type = "histogram"})
120 | #' Numeric indicating the bin width. Default: 0.03.
121 | #'
122 | #' @return A ggplot object.
123 | #'
124 | #' @importFrom ggplot2 ggplot aes geom_density geom_histogram facet_grid
125 | #' theme theme_bw geom_violin geom_boxplot scale_x_continuous vars
126 | #' scale_y_continuous after_stat
127 | #' @importFrom stats density
128 | #' @export
129 | #' @rdname plot_ks_distro
130 | #' @examples
131 | #' data(fungi_kaks)
132 | #' ks_df <- fungi_kaks$saccharomyces_cerevisiae
133 | #'
134 | #' # Plot distro
135 | #' plot_ks_distro(ks_df, bytype = TRUE)
136 | plot_ks_distro <- function(
137 | ks_df, min_ks = 0.01, max_ks = 2,
138 | bytype = FALSE, type_levels = NULL,
139 | plot_type = "histogram",
140 | binwidth = 0.03
141 | ) {
142 |
143 | pal <- dup_palette()
144 |
145 | # Filter Ks values
146 | filt_ks <- ks_df[ks_df$Ks >= min_ks & ks_df$Ks <= max_ks, ]
147 | filt_ks <- filt_ks[!is.na(filt_ks$Ks), ]
148 |
149 | if(bytype) {
150 | # Add level for all combined
151 | ks_all <- filt_ks
152 | ks_all$type <- factor("All", levels = "All")
153 | filt_ks <- rbind(ks_all, filt_ks)
154 |
155 | # Keep only desired levels (optional)
156 | if(!is.null(type_levels)) {
157 | filt_ks <- filt_ks[filt_ks$type %in% type_levels, ]
158 | filt_ks$type <- droplevels(filt_ks$type)
159 | }
160 |
161 | # Plot
162 | if(plot_type == "histogram") {
163 | p <- ggplot(filt_ks, aes(x = .data$Ks)) +
164 | geom_histogram(
165 | aes(fill = .data$type),
166 | color = "gray30", binwidth = binwidth, show.legend = FALSE
167 | ) +
168 | scale_fill_manual(values = pal) +
169 | facet_grid(rows = vars(.data$type), scales = "free_y") +
170 | labs(y = "Count")
171 | } else if(plot_type == "violin") {
172 | p <- ggplot(filt_ks, aes(x = .data$Ks, y = .data$type)) +
173 | geom_violin(aes(fill = .data$type), show.legend = FALSE) +
174 | scale_fill_manual(values = pal) +
175 | labs(y = "Type")
176 | } else {
177 | stop("When plotting by groups, `plot_type` must be either 'histogram' or 'violin'.")
178 | }
179 | p <- p + labs(title = "Ks distribution for gene pairs by mode")
180 | } else {
181 |
182 | if(plot_type == "histogram") {
183 | p <- ggplot(filt_ks, aes(x = .data$Ks)) +
184 | geom_histogram(
185 | fill = "#9196ca", color = "#3e57a7", binwidth = binwidth
186 | ) +
187 | labs(y = "Count")
188 | } else if(plot_type == "density") {
189 | p <- ggplot(filt_ks, aes(x = .data$Ks)) +
190 | geom_density(
191 | fill = "#9196ca", color = "#3e57a7"
192 | ) +
193 | labs(y = "Density")
194 | } else if(plot_type == "density_histogram") {
195 | p <- ggplot(filt_ks, aes(x = .data$Ks)) +
196 | geom_histogram(
197 | aes(y = after_stat(density)), alpha = 0.5,
198 | fill = "#9196ca", color = "#3e57a7", binwidth = binwidth
199 | ) +
200 | geom_density(color = "gray30", linewidth = 1) +
201 | labs(y = "Density")
202 | } else {
203 | stop("Without groups, `plot_type` must be one of 'histogram', 'density', or 'density_histogram'.")
204 | }
205 | p <- p + scale_y_continuous(expand = c(1e-2, 1e-2)) +
206 | labs(title = "Ks distribution for gene pairs")
207 | }
208 |
209 | # Polish plot
210 | p <- p +
211 | theme_bw() +
212 | theme(panel.grid = element_blank()) +
213 | scale_x_continuous(expand = c(1e-2, 1e-2)) +
214 | labs(x = expression(K[s]))
215 |
216 | return(p)
217 | }
218 |
219 |
220 | #' Plot distributions of substitution rates (Ka, Ks, or Ka/Ks) per species
221 | #'
222 | #' @param kaks_list A list of data frames with substitution rates per gene
223 | #' pair in each species as returned by \code{pairs2kaks()}.
224 | #' @param rate_column Character indicating the name of the column to plot.
225 | #' Default: "Ks".
226 | #' @param bytype Logical indicating whether or not to show distributions by
227 | #' type of duplication. Default: FALSE.
228 | #' @param range Numeric vector of length 2 indicating the minimum and maximum
229 | #' values to plot. Default: \code{c(0, 2)}.
230 | #' @param fill Character with color to use for the fill aesthetic. Ignored
231 | #' if \strong{bytype = TRUE}. Default: "deepskyblue3".
232 | #' @param color Character with color to use for the color aesthetic. Ignored
233 | #' if \strong{bytype = FALSE}. Default: "deepskyblue4".
234 | #'
235 | #' @return A ggplot object.
236 | #'
237 | #' @details
238 | #' Data will be plotted using the species order of the list. To change the
239 | #' order of the species to plot, reorder the list elements
240 | #' in \strong{kaks_list}.
241 | #'
242 | #' @importFrom ggplot2 geom_violin scale_fill_manual facet_wrap
243 | #' @export
244 | #' @rdname plot_rates_by_species
245 | #' @examples
246 | #' data(fungi_kaks)
247 | #'
248 | #' # Plot rates
249 | #' plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks")
250 | plot_rates_by_species <- function(
251 | kaks_list, rate_column = "Ks", bytype = FALSE, range = c(0, 2),
252 | fill = "deepskyblue3", color = "deepskyblue4"
253 | ) {
254 |
255 | xl <- switch(
256 | rate_column,
257 | "Ks" = expression(K[s]),
258 | "Ka" = expression(K[a]),
259 | "Ka_Ks" = expression(K[a] / K[s]),
260 | stop("Input to 'rate_column' must be one of 'Ka', 'Ks', or 'Ka_Ks'.")
261 | )
262 |
263 | # From list to data frame
264 | rate_df <- Reduce(rbind, lapply(seq_along(kaks_list), function(x) {
265 |
266 | df <- kaks_list[[x]]
267 | df$species <- names(kaks_list)[x]
268 |
269 | return(df)
270 | }))
271 | rate_df <- rate_df[!is.na(rate_df[[rate_column]]), ]
272 | rate_df <- rate_df[rate_df[[rate_column]] >= range[1] &
273 | rate_df[[rate_column]] <= range[2], ]
274 | rate_df$species <- factor(rate_df$species, levels = names(kaks_list))
275 |
276 | # Create plot
277 | if(bytype) {
278 | p <- ggplot(rate_df, aes(x = .data[[rate_column]], y = .data$species)) +
279 | geom_violin(aes(fill = .data$type), show.legend = FALSE) +
280 | scale_fill_manual(values = dup_palette()) +
281 | facet_wrap("type", nrow = 1)
282 |
283 | } else {
284 | p <- ggplot(rate_df, aes(x = .data[[rate_column]], y = .data$species)) +
285 | geom_violin(fill = fill, color = color)
286 | }
287 |
288 | # Polish the plot
289 | p <- p +
290 | theme_bw() +
291 | labs(x = xl, y = "") +
292 | theme(panel.grid = element_blank())
293 |
294 | return(p)
295 | }
296 |
297 |
298 | #' Plot histogram of Ks distribution with peaks
299 | #'
300 | #' @param peaks A list with elements \strong{mean}, \strong{sd},
301 | #' \strong{lambda}, and \strong{ks}, as returned by the
302 | #' function \code{fins_ks_peaks()}.
303 | #' @param binwidth Numeric scalar with binwidth for the histogram.
304 | #' Default: 0.05.
305 | #'
306 | #' @return A ggplot object with a histogram and lines for each Ks peak.
307 | #'
308 | #' @importFrom ggplot2 ggplot aes geom_histogram ggplot stat_function
309 | #' labs theme_bw
310 | #' @importFrom stats dnorm
311 | #' @importFrom rlang .data
312 | #' @rdname plot_ks_peaks
313 | #' @export
314 | #' @examples
315 | #' data(fungi_kaks)
316 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
317 | #' ks <- scerevisiae_kaks$Ks
318 | #'
319 | #' # Find 2 peaks in Ks distribution
320 | #' peaks <- find_ks_peaks(ks, npeaks = 2)
321 | #'
322 | #' # Plot
323 | #' plot_ks_peaks(peaks, binwidth = 0.05)
324 | plot_ks_peaks <- function(peaks = NULL, binwidth = 0.05) {
325 |
326 | ks_df <- data.frame(ks = peaks$ks)
327 |
328 | # Define color palette
329 | pal <- c(
330 | "#6A6599FF", "#79AF97FF", "#B24745FF", "#00A1D5FF",
331 | "#DF8F44FF", "#374E55FF", "#F39B7FFF", "#3C5488FF"
332 | )
333 |
334 | pal <- as.list(rev(pal[seq_along(peaks$mean)]))
335 |
336 | # Plot
337 | p <- ggplot(ks_df, aes(x = .data$ks)) +
338 | geom_histogram(binwidth = binwidth, color = "black", fill = "grey80") +
339 | mapply(function(mean, sd, lambda, n, binwidth, color) {
340 | stat_function(geom = "line", fun = function(x) {
341 | (dnorm(x, mean = mean, sd = sd)) * n * binwidth * lambda
342 | },
343 | color = color, linewidth = 1.5)
344 | }, mean = peaks$mean, sd = peaks$sd, lambda = peaks$lambda,
345 | n = length(ks_df$ks), binwidth = binwidth,
346 | color = pal) +
347 | theme_bw() +
348 | theme(panel.grid = element_blank()) +
349 | labs(title = "Ks distribution with peaks", y = "Frequency",
350 | x = "Ks values")
351 |
352 | return(p)
353 | }
--------------------------------------------------------------------------------
/R/utils_duplicate_classification.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #' Classify gene pairs derived from segmental duplications
4 | #'
5 | #' @param anchor_pairs A 2-column data frame with anchor pairs in columns 1
6 | #' and 2.
7 | #' @param pairs A 2-column data frame with all duplicate pairs. This
8 | #' is equivalent to the first 2 columns of the tabular output of BLAST-like
9 | #' programs.
10 | #'
11 | #' @return A 3-column data frame with the variables:
12 | #' \describe{
13 | #' \item{dup1}{Character, duplicated gene 1}
14 | #' \item{dup2}{Character, duplicated gene 2}
15 | #' \item{type}{Factor indicating duplication types, with levels
16 | #' "SD" (segmental duplication) or
17 | #' "DD" (dispersed duplication).}
18 | #' }
19 | #' @rdname get_segmental
20 | #' @export
21 | #' @examples
22 | #' data(diamond_intra)
23 | #' data(yeast_annot)
24 | #' data(yeast_seq)
25 | #' blast_list <- diamond_intra
26 | #'
27 | #' # Get processed annotation for S. cerevisiae
28 | #' annotation <- syntenet::process_input(yeast_seq, yeast_annot)$annotation[1]
29 | #'
30 | #' # Get list of intraspecies anchor pairs
31 | #' anchor_pairs <- get_anchors_list(blast_list, annotation)
32 | #' anchor_pairs <- anchor_pairs[[1]][, c(1, 2)]
33 | #'
34 | #' # Get duplicate pairs from DIAMOND output
35 | #' duplicates <- diamond_intra[[1]][, c(1, 2)]
36 | #' dups <- get_segmental(anchor_pairs, duplicates)
37 | get_segmental <- function(anchor_pairs = NULL, pairs = NULL) {
38 |
39 | names(pairs) <- c("dup1", "dup2")
40 | p <- pairs[pairs$dup1 != pairs$dup2, ]
41 | anchorp <- anchor_pairs
42 |
43 | duplicates <- p
44 | duplicates$type <- "DD"
45 |
46 | if(!is.null(anchorp)) {
47 | names(anchorp) <- c("anchor1", "anchor2")
48 |
49 | # Look for anchor pairs in duplicate pairs - vector-based approach
50 | p_vector <- paste0(p$dup1, p$dup2)
51 | anchor_vector <- c(
52 | paste0(anchorp$anchor1, anchorp$anchor2),
53 | paste0(anchorp$anchor2, anchorp$anchor1)
54 | )
55 |
56 | # Add column `type` with classification
57 | dup_mode <- ifelse(p_vector %in% anchor_vector, "SD", "DD")
58 | duplicates$type <- dup_mode
59 | }
60 | duplicates$type <- factor(duplicates$type, levels = c("SD", "DD"))
61 |
62 | return(duplicates)
63 | }
64 |
65 |
66 | #' Classify gene pairs derived from tandem and proximal duplications
67 | #'
68 | #' @param pairs A 3-column data frame with columns \strong{dup1}, \strong{dup2},
69 | #' and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
70 | #' the mode of duplication associated with the pair. This data frame
71 | #' is returned by \code{get_segmental()}.
72 | #' @param annotation_granges A processed GRanges object as in each element
73 | #' of the list returned by \code{syntenet::process_input()}.
74 | #' @param proximal_max Numeric scalar with the maximum distance (in number
75 | #' of genes) between two genes to consider them as proximal duplicates.
76 | #' Default: 10.
77 | #'
78 | #' @return A 3-column data frame with the variables:
79 | #' \describe{
80 | #' \item{dup1}{Character, duplicated gene 1.}
81 | #' \item{dup2}{Character, duplicated gene 2.}
82 | #' \item{type}{Factor of duplication types, with levels
83 | #' "SD" (segmental duplication),
84 | #' "TD" (tandem duplication),
85 | #' "PD" (proximal duplication), and
86 | #' "DD" (dispersed duplication).}
87 | #' }
88 | #' @rdname get_tandem_proximal
89 | #' @export
90 | #' @examples
91 | #' data(yeast_annot)
92 | #' data(yeast_seq)
93 | #' data(fungi_kaks)
94 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
95 | #'
96 | #' # Get processed annotation for S. cerevisiae
97 | #' pdata <- annotation <- syntenet::process_input(yeast_seq, yeast_annot)
98 | #' annot <- pdata$annotation[[1]]
99 | #'
100 | #' # Get duplicated pairs
101 | #' pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
102 | #' pairs$dup1 <- paste0("Sce_", pairs$dup1)
103 | #' pairs$dup2 <- paste0("Sce_", pairs$dup2)
104 | #'
105 | #' # Get tandem and proximal duplicates
106 | #' td_pd_pairs <- get_tandem_proximal(pairs, annot)
107 | #'
108 | get_tandem_proximal <- function(
109 | pairs = NULL, annotation_granges = NULL, proximal_max = 10
110 | ) {
111 |
112 | annot <- as.data.frame(annotation_granges)
113 | annot <- annot[, c("seqnames", "gene", "start", "end")]
114 | pairs$type <- as.character(pairs$type)
115 | ssd_pairs <- pairs[pairs$type == "DD", ]
116 |
117 | # Add chromosome number and order in the chromosome for each gene
118 | annot <- annot[order(annot$seqnames, annot$start), ]
119 | annot_bychr <- split(annot, annot$seqnames)
120 | annot_order <- Reduce(rbind, lapply(annot_bychr, function(x) {
121 | x$order <- seq_len(nrow(x))
122 | return(x[, c("seqnames", "gene", "order")])
123 | }))
124 |
125 | # Create df with `dup1`, `dup2`, `type`, `chr1`, `order1`, `chr2`, `order2`
126 | ssd_pos <- merge(ssd_pairs, annot_order, by.x = "dup1", by.y = "gene")
127 | names(ssd_pos)[c(4, 5)] <- c("chr_dup1", "order_dup1")
128 | ssd_pos <- merge(
129 | ssd_pos, annot_order, sort = FALSE, by.x = "dup2", by.y = "gene"
130 | )[, c(2, 1, 3, 4, 5, 6, 7)] # dup1, dup2, type, chr1, order1, chr2, order2
131 | names(ssd_pos)[c(6, 7)] <- c("chr_dup2", "order_dup2")
132 |
133 | # Find tandem and proximal duplicates
134 | ssd_pos$dist <- abs(ssd_pos$order_dup1 - ssd_pos$order_dup2)
135 | td_idx <- which(ssd_pos$chr_dup1 == ssd_pos$chr_dup2 & ssd_pos$dist == 1)
136 | pd_idx <- which(ssd_pos$chr_dup1 == ssd_pos$chr_dup2 & ssd_pos$dist > 1 &
137 | ssd_pos$dist <= proximal_max)
138 |
139 | if(length(td_idx) > 0) { ssd_pos$type[td_idx] <- "TD" }
140 | if(length(pd_idx) > 0) { ssd_pos$type[pd_idx] <- "PD" }
141 |
142 | duplicates <- rbind(pairs[pairs$type != "DD", ], ssd_pos[, c(1, 2, 3)])
143 | l <- c("SD", "TD", "PD", "DD")
144 | duplicates$type <- factor(duplicates$type, levels = l)
145 |
146 | return(duplicates)
147 | }
148 |
149 |
150 | #' Get syntenic block ID for each gene in a gene pair
151 | #'
152 | #' @param pair A 2-column data frame with gene pairs.
153 | #' @param syn_df A 2-column data frame with gene ID in column 1, and synteny
154 | #' block ID in column 2.
155 | #'
156 | #' @return A data frame of 4 columns as below:
157 | #' \describe{
158 | #' \item{dup1}{Character, ID of duplicated gene 1.}
159 | #' \item{dup2}{Character, ID of duplicated gene 2.}
160 | #' \item{block1}{Numeric, syntenic block ID of gene 1.}
161 | #' \item{block2}{Numeric, syntenic block ID of gene 2.}
162 | #' }
163 | #'
164 | #' @noRd
165 | pairs_and_synblocks <- function(pairs, syn_df) {
166 |
167 | names(pairs)[1:2] <- c("dup1", "dup2")
168 | names(syn_df)[1:2] <- c("anchor", "block")
169 |
170 | pairs_ancestral <- merge(
171 | pairs[, c(1, 2)], syn_df, by.x = "dup1", by.y = "anchor",
172 | all.x = TRUE
173 | )
174 | pairs_ancestral <- merge(
175 | pairs_ancestral, syn_df, by.x = "dup2", by.y = "anchor",
176 | all.x = TRUE
177 | )
178 | names(pairs_ancestral)[c(3, 4)] <- c("block1", "block2")
179 |
180 | return(pairs_ancestral)
181 | }
182 |
183 |
184 | #' Classify gene pairs originating from transposon-derived duplications
185 | #'
186 | #' @param pairs A 3-column data frame with columns \strong{dup1}, \strong{dup2},
187 | #' and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
188 | #' the mode of duplication associated with the pair. This data frame
189 | #' is returned by \code{get_tandem_proximal()}.
190 | #' @param blast_inter A list of data frames of length 1
191 | #' containing BLAST tabular output for the comparison between the target
192 | #' species and an outgroup. Names of list elements must match the names of
193 | #' list elements in `annotation`. BLASTp, DIAMOND or simular programs must
194 | #' be run on processed sequence data as returned
195 | #' by \code{syntenet::process_input()}.
196 | #' @param annotation A processed GRangesList or CompressedGRangesList object as
197 | #' returned by \code{syntenet::process_input()}.
198 | #' @param evalue Numeric scalar indicating the E-value threshold.
199 | #' Default: 1e-10.
200 | #' @param anchors Numeric indicating the minimum required number of genes
201 | #' to call a syntenic block, as in \code{syntenet::infer_syntenet}.
202 | #' Default: 5.
203 | #' @param max_gaps Numeric indicating the number of upstream and downstream
204 | #' genes to search for anchors, as in \code{syntenet::infer_syntenet}.
205 | #' Default: 25.
206 | #' @param collinearity_dir Character indicating the path to the directory
207 | #' where .collinearity files will be stored. If NULL, files will
208 | #' be stored in a subdirectory of \code{tempdir()}. Default: NULL.
209 | #' @param outgroup_coverage Numeric indicating the minimum percentage of
210 | #' outgroup species to use to consider genes as transposed duplicates. Only
211 | #' valid if multiple outgroup species are present (see details below). Values
212 | #' should range from 0 to 100. Default: 70.
213 | #'
214 | #'
215 | #' @return A 3-column data frame with the following variables:
216 | #' \describe{
217 | #' \item{dup1}{Character, duplicated gene 1.}
218 | #' \item{dup2}{Character, duplicated gene 2.}
219 | #' \item{type}{Factor of duplication types, with levels
220 | #' "SD" (segmental duplication),
221 | #' "TD" (tandem duplication),
222 | #' "PD" (proximal duplication),
223 | #' "TRD" (transposon-derived duplication), and
224 | #' "DD" (dispersed duplication).}
225 | #' }
226 | #'
227 | #' @details
228 | #' If the list of interspecies DIAMOND tables contain comparisons of the
229 | #' same species to multiple outgroups (e.g.,
230 | #' 'speciesA_speciesB', 'speciesA_speciesC'), this function will check if
231 | #' gene pairs are classified as transposed (i.e.,
232 | #' only one gene is an ancestral locus) in each of the outgroup species,
233 | #' and then calculate the percentage of outgroup species in which each pair
234 | #' is considered 'transposed'. For instance, gene pair 1 is transposed based on
235 | #' 30\% of the outgroup species, gene pair is considered as transposed based
236 | #' on 100\% of the outgroup species, gene pair 3 is considered as transposed
237 | #' based on 0\% of the outgroup species, and so on.
238 | #' Parameter \strong{outgroup_coverage} lets you choose a minimum percentage
239 | #' cut-off to classify pairs as transposed.
240 | #'
241 | #' @importFrom syntenet interspecies_synteny
242 | #' @export
243 | #' @rdname get_transposed
244 | #' @examples
245 | #' # Load example data
246 | #' data(diamond_inter)
247 | #' data(yeast_seq)
248 | #' data(yeast_annot)
249 | #' data(fungi_kaks)
250 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
251 | #'
252 | #' # Get processed annotation
253 | #' pdata <- syntenet::process_input(yeast_seq, yeast_annot)
254 | #' annotation <- pdata$annotation
255 | #'
256 | #' # Get duplicated pairs
257 | #' pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
258 | #' pairs$dup1 <- paste0("Sce_", pairs$dup1)
259 | #' pairs$dup2 <- paste0("Sce_", pairs$dup2)
260 | #'
261 | #' # Collapse bidirectional hits
262 | #' compare <- data.frame(target = "Scerevisiae", outgroup = "Cglabrata")
263 | #' blast_inter <- syntenet::collapse_bidirectional_hits(diamond_inter, compare)
264 | #'
265 | #' # Classify pairs
266 | #' trd <- get_transposed(pairs, blast_inter, annotation)
267 | #'
268 | get_transposed <- function(
269 | pairs, blast_inter, annotation,
270 | evalue = 1e-10, anchors = 5, max_gaps = 25,
271 | collinearity_dir = NULL, outgroup_coverage = 70
272 | ) {
273 |
274 | blast_inter <- lapply(blast_inter, function(x) return(x[x$evalue <= evalue, ]))
275 | pairs$type <- as.character(pairs$type)
276 | pairs_dd <- pairs[pairs$type == "DD", ]
277 |
278 | # Define directory where interspecies .collinearity files will be stored
279 | interdir <- collinearity_dir
280 | if(is.null(interdir)) {
281 | daytime <- format(Sys.time(), "%d_%b_%Y_%Hh%M")
282 | interdir <- file.path(tempdir(), paste0("inter_", daytime))
283 | }
284 |
285 | # Get name of target and outgroup species
286 | target <- unlist(lapply(names(annotation), function(x) {
287 | nfound <- sum(grepl(paste0(x, "_"), names(blast_inter)))
288 | found <- rep(x, nfound)
289 | return(found)
290 | }))
291 |
292 | outgroup <- unlist(lapply(names(annotation), function(x) {
293 | nfound <- sum(grepl(paste0(x, "$"), names(blast_inter)))
294 | found <- rep(x, nfound)
295 | return(found)
296 | }))
297 |
298 | # For each outgroup, get data frame indicating if pairs is tranposed
299 | trd_df <- lapply(seq_along(outgroup), function(n) {
300 | # Detect syntenic regions between `target` and `outgroup`
301 | syn <- syntenet::interspecies_synteny(
302 | blast_inter[n],
303 | annotation = annotation[c(target[n], outgroup[n])],
304 | inter_dir = interdir,
305 | anchors = anchors,
306 | max_gaps = max_gaps
307 | )
308 |
309 | # Read and parse interspecies synteny results
310 | parsed_syn <- collinearity2blocks(syn)[, c("anchor1", "block")]
311 | parsed_syn <- parsed_syn[!duplicated(parsed_syn$anchor1), ]
312 |
313 | pairs_ancestral <- pairs_dd[, c(1, 2)]
314 | pairs_ancestral$ancestral <- FALSE
315 | if(!is.null(parsed_syn)) {
316 |
317 | # Find TRD-derived genes (only one member of pair in ancestral loci)
318 | pairs_ancestral <- pairs_and_synblocks(pairs_dd, parsed_syn)
319 |
320 | nas <- apply(pairs_ancestral[, c(3, 4)], 1, function(x) return(sum(is.na(x))))
321 | pairs_ancestral$ancestral <- ifelse(nas == 1, TRUE, FALSE)
322 | pairs_ancestral <- pairs_ancestral[, c("dup1", "dup2", "ancestral")]
323 | }
324 |
325 | return(pairs_ancestral)
326 | })
327 | nout <- length(trd_df)
328 | trd_df <- Reduce(function(x, y) merge(x, y, by = c("dup1", "dup2"), all = TRUE), trd_df)
329 | names(trd_df)[seq(3, nout+2, 1)] <- paste0("ancestral", seq_len(nout))
330 |
331 | # Calculate percentage of outgroups in which pair is classified as transposed
332 | perc_trd <- (rowSums(trd_df[, -c(1,2), drop = FALSE]) / nout) * 100
333 |
334 | # Classify pairs as TRD if percentage >= outgroup_coverage
335 | trd_df$type <- ifelse(perc_trd >= outgroup_coverage, "TRD", "DD")
336 | final <- rbind(
337 | pairs[pairs$type != "DD", ],
338 | trd_df[, c("dup1", "dup2", "type")]
339 | )
340 | final$type <- factor(final$type, levels = c("SD", "TD", "PD", "TRD", "DD"))
341 |
342 | return(final)
343 | }
344 |
345 |
346 |
347 | #' Classify TRD genes as derived from either DNA transposons or retrotransposons
348 | #'
349 | #' @param pairs A 3-column data frame with columns \strong{dup1}, \strong{dup2},
350 | #' and \strong{type} indicating duplicated gene 1, duplicated gene 2, and
351 | #' the mode of duplication associated with the pair. This data frame
352 | #' is returned by \code{get_transposed()}.
353 | #' @param intron_counts A 2-column data frame with columns \strong{gene}
354 | #' and \strong{introns} indicating the number of introns for each gene,
355 | #' as returned by \code{get_intron_counts}.
356 | #'
357 | #' @return A 3-column data frame with the following variables:
358 | #' \describe{
359 | #' \item{dup1}{Character, duplicated gene 1.}
360 | #' \item{dup2}{Character, duplicated gene 2.}
361 | #' \item{type}{Factor of duplication types, with levels
362 | #' "SD" (segmental duplication),
363 | #' "TD" (tandem duplication),
364 | #' "PD" (proximal duplication),
365 | #' "dTRD" (DNA transposon-derived duplication),
366 | #' "rTRD" (retrotransposon-derived duplication), and
367 | #' "DD" (dispersed duplication).}
368 | #' }
369 | #'
370 | #' @rdname get_transposed_classes
371 | #' @export
372 | #' @examples
373 | #' data(diamond_inter)
374 | #' data(diamond_intra)
375 | #' data(yeast_seq)
376 | #' data(yeast_annot)
377 | #' data(fungi_kaks)
378 | #' scerevisiae_kaks <- fungi_kaks$saccharomyces_cerevisiae
379 | #'
380 | #' # Get processed annotation
381 | #' pdata <- syntenet::process_input(yeast_seq, yeast_annot)
382 | #' annotation <- pdata$annotation
383 | #'
384 | #' # Get duplicated pairs
385 | #' pairs <- scerevisiae_kaks[, c("dup1", "dup2", "type")]
386 | #' pairs$dup1 <- paste0("Sce_", pairs$dup1)
387 | #' pairs$dup2 <- paste0("Sce_", pairs$dup2)
388 | #'
389 | #' # Classify pairs
390 | #' trd <- get_transposed(pairs, diamond_inter, annotation)
391 | #'
392 | #' # Create TxDb object from GRanges
393 | #' library(txdbmaker)
394 | #' txdb <- txdbmaker::makeTxDbFromGRanges(yeast_annot[[1]])
395 | #'
396 | #' # Get intron counts
397 | #' intron_counts <- get_intron_counts(txdb)
398 | #'
399 | #' # Get TRD classes
400 | #' trd_classes <- get_transposed_classes(trd, intron_counts)
401 | #'
402 | get_transposed_classes <- function(pairs, intron_counts) {
403 |
404 | # Get TRD pairs
405 | pairs$type <- as.character(pairs$type)
406 | tpairs <- pairs[pairs$type == "TRD", ]
407 |
408 | final_pairs <- pairs
409 | if(nrow(tpairs) > 0) {
410 |
411 | id <- unique(gsub("_.*", "", tpairs$dup1))
412 | tpairs$dup1 <- gsub("^[a-zA-Z]{2,5}_", "", tpairs$dup1)
413 | tpairs$dup2 <- gsub("^[a-zA-Z]{2,5}_", "", tpairs$dup2)
414 |
415 | # Combine `tpairs` and `intron_counts`
416 | pairs_ic <- merge(
417 | tpairs, intron_counts, by.x = "dup1", by.y = "gene", all.x = TRUE
418 | )
419 | pairs_ic <- merge(
420 | pairs_ic, intron_counts, by.x = "dup2", by.y = "gene", all.x = TRUE
421 | )
422 | names(pairs_ic)[c(4,5)] <- c("introns1", "introns2")
423 |
424 | # Create a column with number of genes in pair with 0 introns
425 | pairs_ic$count <- apply(pairs_ic[, 4:5], 1, function(x) sum(x == 0))
426 |
427 | # 'rTRD' if only one gene has no introns, as 'dTRD' otherwise
428 | pairs_ic$type <- ifelse(pairs_ic$count == 1, "rTRD", "dTRD")
429 |
430 | final_pairs <- pairs_ic[, c("dup1", "dup2", "type")]
431 |
432 | # Add species IDs back
433 | final_pairs$dup1 <- paste0(id, "_", final_pairs$dup1)
434 | final_pairs$dup2 <- paste0(id, "_", final_pairs$dup2)
435 |
436 | final_pairs <- rbind(pairs[pairs$type != "TRD", ], final_pairs)
437 | }
438 |
439 | l <- c("SD", "TD", "PD", "rTRD", "dTRD", "DD")
440 | final_pairs$type <- factor(final_pairs$type, levels = l)
441 |
442 | return(final_pairs)
443 | }
444 |
445 |
--------------------------------------------------------------------------------
/vignettes/doubletrouble_vignette.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Identification and classification of duplicated genes"
3 | author:
4 | - name: Fabricio Almeida-Silva
5 | affiliation: |
6 | VIB-UGent Center for Plant Systems Biology, Ghent University,
7 | Ghent, Belgium
8 | - name: Yves Van de Peer
9 | affiliation: |
10 | VIB-UGent Center for Plant Systems Biology, Ghent University,
11 | Ghent, Belgium
12 | output:
13 | BiocStyle::html_document:
14 | toc: true
15 | number_sections: yes
16 | bibliography: bibliography.bib
17 | vignette: >
18 | %\VignetteIndexEntry{Identification and classification of duplicated genes}
19 | %\VignetteEngine{knitr::rmarkdown}
20 | %\VignetteEncoding{UTF-8}
21 | ---
22 |
23 | ```{r setup, include = FALSE}
24 | knitr::opts_chunk$set(
25 | collapse = TRUE,
26 | comment = "#>",
27 | crop = NULL
28 | )
29 | ```
30 |
31 | # Introduction
32 |
33 | Gene and genome duplications are a source of raw genetic material for evolution
34 | [@ohno2013evolution]. However, whole-genome duplications (WGD) and small-scale
35 | duplications (SSD) contribute to genome evolution in different manners. To
36 | help you explore the different contributions of WGD and SSD to evolution, we
37 | developed `r BiocStyle::Githubpkg("doubletrouble")`, a package that can be
38 | used to identify and classify duplicated genes from whole-genome
39 | protein sequences, calculate substitution rates per substitution site (i.e.,
40 | $K_a$ and $K_s$) for gene pairs, find peaks in $K_s$ distributions, and classify
41 | gene pairs by age groups.
42 |
43 | # Installation
44 |
45 | You can install `r BiocStyle::Githubpkg("doubletrouble")` from Bioconductor
46 | with the following code:
47 |
48 | ```{r installation, eval = FALSE}
49 | if(!requireNamespace("BiocManager", quietly = TRUE)) {
50 | install.packages("BiocManager")
51 | }
52 |
53 | BiocManager::install("doubletrouble")
54 |
55 | ## Check that you have a valid Bioconductor installation
56 | BiocManager::valid()
57 | ```
58 |
59 | Then, you can load the package:
60 |
61 | ```{r load_package}
62 | library(doubletrouble)
63 | ```
64 |
65 | # Input data
66 |
67 | To identify and classify duplicated gene pairs, users need two types of input
68 | data:
69 |
70 | 1. Whole-genome protein sequences (a.k.a. "proteome"), with only one protein
71 | sequence per gene (i.e., translated sequence of the primary transcript).
72 | These are typically stored in *.fasta* files.
73 |
74 | 2. Gene annotation, with genomic coordinates of all features (i.e., genes, exons,
75 | etc). These are typically stored in *.gff3/.gff/.gtf* files.
76 |
77 | 3. (Optional) Coding sequences (CDS), with only one DNA sequence sequence per
78 | gene. These are only required for users who want to calculate
79 | substitution rates (i.e., $K_a$, $K_s$, and their ratio $K_a/K_s$),
80 | and they are typically stored in *.fasta* files.
81 |
82 | In the Bioconductor ecosystem, sequences and ranges are stored in
83 | standardized S4 classes named
84 | `XStringSet` (`AAStringSet` for proteins, `DNAStringSet` for DNA) and `GRanges`,
85 | respectively. This ensures seamless interoperability across packages, which
86 | is important for both users and package developers.
87 | Thus, `r BiocStyle::Biocpkg("doubletrouble")` expects
88 | proteomes in lists of `AAStringSet` objects, and annotations in lists of
89 | `GRanges` objects. Below you can find a summary of input data types, their
90 | typical file formats, and Bioconductor class.
91 |
92 | | Input data | File format | Bioconductor class | Requirement |
93 | |:-----------|:------------|:-------------------|:------------|
94 | | Proteome | FASTA | `AAStringSet` | Mandatory |
95 | | Annotation | GFF3/GTF | `GRanges` | Mandatory |
96 | | CDS | FASTA | `DNAStringSet` | Optional |
97 |
98 | Names of list elements represent species identifiers
99 | (e.g., name, abbreviations, taxonomy IDs, or anything you like), and **must**
100 | be consistent across different lists, so correspondence can be made.
101 | For instance, suppose you have an object named `seqs` with a list of
102 | `AAStringSet` objects (proteomes for each species)
103 | named *Athaliana*, *Alyrata*, and *Bnapus*. You also have an object
104 | named `annotation` with a list of `GRanges` objects (gene annotation for
105 | each species). In this example, list names in `annotation` must also be
106 | *Athaliana*, *Alyrata*, and *Bnapus* (not necessarily in that order), so that
107 | `r BiocStyle::Biocpkg("doubletrouble")` knows that element *Athaliana*
108 | in `seqs` corresponds to element *Athaliana* in `annotation`. You can check
109 | that with:
110 |
111 | ```{r eval=FALSE}
112 | # Checking if names of lists match
113 | setequal(names(seqs), names(annotation)) # should return TRUE
114 | ```
115 |
116 | **IMPORTANT:** If you have protein sequences as FASTA files in a directory,
117 | you can read them into a list of `AAStringSet` objects with the function
118 | `fasta2AAStringSetlist()` from the Bioconductor package
119 | `r BiocStyle::Biocpkg("syntenet")`. Likewise, you can get a `GRangesList`
120 | object from GFF/GTF files with the function `gff2GRangesList()`, also
121 | from `r BiocStyle::Biocpkg("syntenet")`.
122 |
123 |
124 | # Getting to know the example data sets
125 |
126 | In this vignette, we will use data (proteome, gene annotations, and CDS) from
127 | the yeast species *Saccharomyces cerevisiae* and *Candida glabrata*,
128 | since their genomes are relatively small (and, hence, great for
129 | demonstration purposes). Our goal here is to identify and classify duplicated
130 | genes in the *S. cerevisiae* genome. The *C. glabrata* genome will be used as an outgroup to identify transposed duplicates later in this vignette.
131 |
132 |
133 | Data were obtained from Ensembl Fungi release 54 [@yates2022ensembl].
134 | While you can download these data manually from the Ensembl Fungi webpage (or
135 | through another repository such as NCBI RefSeq), here we will demonstrate how
136 | you can get data from Ensembl using the `r BiocStyle::Biocpkg("biomartr")`
137 | package.
138 |
139 | ```{r eval=FALSE}
140 | species <- c("Saccharomyces cerevisiae", "Candida glabrata")
141 |
142 | # Download data from Ensembl with {biomartr}
143 | ## Whole-genome protein sequences (.fa)
144 | fasta_dir <- file.path(tempdir(), "proteomes")
145 | fasta_files <- biomartr::getProteomeSet(
146 | db = "ensembl", organisms = species, path = fasta_dir
147 | )
148 |
149 | ## Gene annotation (.gff3)
150 | gff_dir <- file.path(tempdir(), "annotation")
151 | gff_files <- biomartr::getGFFSet(
152 | db = "ensembl", organisms = species, path = gff_dir
153 | )
154 |
155 | ## CDS (.fa)
156 | cds_dir <- file.path(tempdir(), "CDS")
157 | cds_files <- biomartr::getCDSSet(
158 | db = "ensembl", organisms = species, path = cds_dir
159 | )
160 |
161 | # Import data to the R session
162 | ## Read .fa files with proteomes as a list of AAStringSet + clean names
163 | seq <- syntenet::fasta2AAStringSetlist(fasta_dir)
164 | names(seq) <- gsub("\\..*", "", names(seq))
165 |
166 | ## Read .gff3 files as a list of GRanges
167 | annot <- syntenet::gff2GRangesList(gff_dir)
168 | names(annot) <- gsub("\\..*", "", names(annot))
169 |
170 | ## Read .fa files with CDS as a list of DNAStringSet objects
171 | cds <- lapply(cds_files, Biostrings::readDNAStringSet)
172 | names(cds) <- gsub("\\..*", "", basename(cds_files))
173 |
174 | # Process data
175 | ## Keep ranges for protein-coding genes only
176 | yeast_annot <- lapply(annot, function(x) {
177 | gene_ranges <- x[x$biotype == "protein_coding" & x$type == "gene"]
178 | gene_ranges <- IRanges::subsetByOverlaps(x, gene_ranges)
179 | return(gene_ranges)
180 | })
181 |
182 | ## Keep only longest sequence for each protein-coding gene (no isoforms)
183 | yeast_seq <- lapply(seq, function(x) {
184 | # Keep only protein-coding genes
185 | x <- x[grep("protein_coding", names(x))]
186 |
187 | # Leave only gene IDs in sequence names
188 | names(x) <- gsub(".*gene:| .*", "", names(x))
189 |
190 | # If isoforms are present (same gene ID multiple times), keep the longest
191 | x <- x[order(Biostrings::width(x), decreasing = TRUE)]
192 | x <- x[!duplicated(names(x))]
193 |
194 | return(x)
195 | })
196 | ```
197 |
198 | Note that processing might differ depending on the data source. For instance,
199 | Ensembl adds gene 'biotypes' (e.g., protein-coding, pseudogene, etc) in sequence
200 | names and in a field named *biotype* in annotation files. Other databases
201 | might add these information elsewhere.
202 |
203 | To avoid problems building this vignette (due to no/slow/unstable internet
204 | connection), the code chunk above is not executed. Instead, we ran such code
205 | and saved data in the following objects:
206 |
207 | - **yeast_seq:** A list of `AAStringSet` objects with elements
208 | named *Scerevisiae* and *Cglabrata*.
209 |
210 | - **yeast_annot:** A `GRangesList` object with elements
211 | named *Scerevisiae* and *Cglabrata*.
212 |
213 | Let's take a look at them.
214 |
215 | ```{r example_data}
216 | # Load example data
217 | data(yeast_seq)
218 | data(yeast_annot)
219 |
220 | yeast_seq
221 | yeast_annot
222 | ```
223 |
224 | # Data preparation
225 |
226 | First of all, we need to process the list of protein sequences and gene ranges
227 | to detect synteny with `r BiocStyle::Biocpkg("syntenet")`. We will do that
228 | using the function `process_input()` from
229 | the `r BiocStyle::Biocpkg("syntenet")` package.
230 |
231 | ```{r process_input}
232 | library(syntenet)
233 |
234 | # Process input data
235 | pdata <- process_input(yeast_seq, yeast_annot)
236 |
237 | # Inspect the output
238 | names(pdata)
239 | pdata$seq
240 | pdata$annotation
241 | ```
242 |
243 | The processed data are represented as a list with the elements `seq` and
244 | `annotation`, each containing the protein sequences and gene ranges for
245 | each species, respectively.
246 |
247 | Finally, we need to perform pairwise sequence similarity searches to
248 | identify the whole set of paralogous gene pairs. We can do this
249 | using the function `run_diamond()` from the `r BiocStyle::Biocpkg("syntenet")`
250 | package [^1], setting `compare = "intraspecies"` to perform only intraspecies
251 | comparisons.
252 |
253 | [^1]: **Note:** you need to have DIAMOND installed in your machine to run
254 | this function. If you don't have it, you can use
255 | the `r BiocStyle::Biocpkg("Herper")` package to install DIAMOND in a Conda
256 | environment and run DIAMOND from this virtual environment.
257 |
258 | ```{r run_diamond_intraspecies}
259 | data(diamond_intra)
260 |
261 | # Run DIAMOND in sensitive mode for S. cerevisiae only
262 | if(diamond_is_installed()) {
263 | diamond_intra <- run_diamond(
264 | seq = pdata$seq["Scerevisiae"],
265 | compare = "intraspecies",
266 | outdir = file.path(tempdir(), "diamond_intra"),
267 | ... = "--sensitive"
268 | )
269 | }
270 |
271 | # Inspect output
272 | names(diamond_intra)
273 | head(diamond_intra$Scerevisiae_Scerevisiae)
274 | ```
275 |
276 | And voilà! Now that we have the DIAMOND output and the processed annotation,
277 | you can classify the duplicated genes.
278 |
279 | # Classifying duplicated gene pairs and genes
280 |
281 | To classify duplicated gene pairs based on their modes of duplication,
282 | you will use the function `classify_gene_pairs()`. This function offers
283 | four different classification schemes, depending on how much detail you
284 | want. The classification schemes, along with the duplication modes
285 | they identify and their required input, are summarized in the table below:
286 |
287 |
288 | | Scheme | Duplication modes | Required input |
289 | |:---------|:----------------------------|:---------------------------------------------------|
290 | | binary | SD, SSD | `blast_list`, `annotation` |
291 | | standard | SD, TD, PD, DD | `blast_list`, `annotation` |
292 | | extended | SD, TD, PD, TRD, DD | `blast_list`, `annotation`, `blast_inter` |
293 | | full | SD, TD, PD, rTRD, dTRD, DD | `blast_list`, `annotation`, `blast_inter`, `intron_counts` |
294 |
295 | **Legend:** SD, segmental duplication. SSD, small-scale duplication.
296 | TD, tandem duplication. PD, proximal duplication.
297 | TRD, transposon-derived duplication. rTRD, retrotransposon-derived duplication.
298 | dTRD, DNA transposon-derived duplication. DD, dispersed duplication.
299 |
300 |
301 | As shown in the table, the minimal input objects are:
302 |
303 | - **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular
304 | output for intraspecies comparisons as returned
305 | by `syntenet::run_diamond(..., compare = 'intraspecies')`.
306 | - **annotation**: The processed annotation list (a `GRangesList` object)
307 | as returned by `syntenet::process_input()`.
308 |
309 |
310 | However, if you also want to identify transposon-derived duplicates (TRD)
311 | and further classify them as retrotransposon-derived duplicates (rTRD) or
312 | DNA transposon-derived duplicates (dTRD), you will need the following objects:
313 |
314 | - **blast_list**: A list of data frames with DIAMOND (or BLASTp, etc.) tabular
315 | output for interspecies comparisons (target species vs an outgroup) as returned
316 | by `syntenet::run_diamond(..., compare = )`.
317 | - **intron_counts**: A list of data frames with the number of introns per gene
318 | for each species, as returned by `get_intron_counts()`.
319 |
320 |
321 | Below, we demonstrate each classification scheme with examples.
322 |
323 | ## The *binary* scheme (SD vs SSD)
324 |
325 | The binary scheme classifies duplicates as derived from either
326 | segmental duplications (SD) or small-scale duplications (SSD).
327 | To identify segmental duplicates, the function `classify_gene_pairs()`
328 | performs intragenome synteny detection scans
329 | with `r BiocStyle::Biocpkg("syntenet")` and classifies any detected anchor
330 | pairs as segmental duplicates. The remaining pairs are classified as
331 | originating from small-scale duplications.
332 |
333 | This scheme can be used by specifying `scheme = "binary"` in the
334 | function `classify_gene_pairs()`.
335 |
336 | ```{r binary_classification}
337 | # Binary scheme
338 | c_binary <- classify_gene_pairs(
339 | annotation = pdata$annotation,
340 | blast_list = diamond_intra,
341 | scheme = "binary"
342 | )
343 |
344 | # Inspecting the output
345 | names(c_binary)
346 | head(c_binary$Scerevisiae)
347 |
348 | # How many pairs are there for each duplication mode?
349 | table(c_binary$Scerevisiae$type)
350 | ```
351 |
352 | The function returns a list of data frames, each containing the duplicated
353 | gene pairs and their modes of duplication for each species (here, because
354 | we have only one species, this is a list of length 1).
355 |
356 | ## The *standard* scheme (SSD → TD, PD, DD)
357 |
358 | Gene pairs derived from small-scale duplications can be further classified
359 | as originating from tandem duplications (TD, genes are adjacent to each other),
360 | proximal duplications (PD, genes are separated by only a few genes), or
361 | dispersed duplications (DD, duplicates that do not fit in any of the previous
362 | categories).
363 |
364 | This is the default classification scheme in `classify_gene_pairs()`,
365 | and it can be specified by setting `scheme = "standard"`.
366 |
367 | ```{r expanded_classification}
368 | # Standard scheme
369 | c_standard <- classify_gene_pairs(
370 | annotation = pdata$annotation,
371 | blast_list = diamond_intra,
372 | scheme = "standard"
373 | )
374 |
375 | # Inspecting the output
376 | names(c_standard)
377 | head(c_standard$Scerevisiae)
378 |
379 | # How many pairs are there for each duplication mode?
380 | table(c_standard$Scerevisiae$type)
381 | ```
382 |
383 | ## The *extended* scheme (SSD → TD, PD, TRD, DD)
384 |
385 | To find transposon-derived duplicates (TRD), the
386 | function `classify_gene_pairs()` detects syntenic regions between a target
387 | species and an outgroup species. Genes in the target species that are in
388 | syntenic regions with the outgroup are treated as *ancestral loci*. Then,
389 | if only one gene of the duplicate pair is an ancestral locus, this
390 | duplicate pair is classified as originating from transposon-derived
391 | duplications.
392 |
393 |
394 | Since finding transposon-derived duplicates requires detecting syntenic regions
395 | between a target species and an outgroup species, you will first need to
396 | perform similarity searches with DIAMOND [@buchfink2021sensitive],
397 | BLAST [@altschul1997gapped], or similar programs. This can be done with
398 | `syntenet::run_diamond(seq, compare = compare_df)`. For the parameter `compare`,
399 | you will pass a 2-column data frame specifying the comparisons to be made. [^3]
400 | Importantly, for a more accurate detection of interspecies synteny, you need to
401 | perform bidirectional similarity searches for each comparison. For instance,
402 | if you want to use `speciesA` as target species and `speciesB` as outgroup,
403 | you need to perform similarity searches in both directions:
404 | `speciesA_speciesB` and `speciesB_speciesA`.
405 |
406 | [^3]: **Pro tip:** If you want to identify and classify duplicated genes for
407 | multiple species in batch, you must include the outgroup for each of them
408 | in the comparisons data frame.
409 |
410 | Here, we will identify duplicated gene pairs for *Saccharomyces cerevisiae*
411 | using *Candida glabrata* as an outgroup. To create a data frame with the
412 | bidirectional comparisons to be made, we will use the helper function `make_bidirectional()` from the `r BiocStyle::Biocpkg("syntenet")` package.
413 |
414 | ```{r make_bidirectional_comparisons}
415 | # Create a data frame of species and outgroups for `syntenet::run_diamond()`
416 | spp_outgroup <- data.frame(
417 | species = "Scerevisiae",
418 | outgroup = "Cglabrata"
419 | )
420 | spp_outgroup
421 |
422 | # Expand the data frame to make bidirectional comparisons
423 | comparisons <- syntenet::make_bidirectional(spp_outgroup)
424 | comparisons
425 | ```
426 |
427 | Now that we have a data frame with our desired comparisons, we can pass it
428 | to `syntenet::run_diamond()`.
429 |
430 | ```{r blast_interspecies}
431 | data(diamond_inter) # load pre-computed output in case DIAMOND is not installed
432 |
433 | # Run DIAMOND for the comparisons we specified
434 | if(diamond_is_installed()) {
435 | diamond_inter <- run_diamond(
436 | seq = pdata$seq,
437 | compare = comparisons,
438 | outdir = file.path(tempdir(), "diamond_inter"),
439 | ... = "--sensitive"
440 | )
441 | }
442 |
443 | names(diamond_inter)
444 | head(diamond_inter$Scerevisiae_Cglabrata)
445 | ```
446 |
447 | As you can see, for each species-outgroup pair, `diamond_inter` contains two
448 | data frames: one named `Scerevisiae_Cglabrata`, and one named
449 | `Cglabrata_Scerevisiae`. Before actually classifying gene pairs, we will
450 | need to collapse these data frames so that we have
451 | **only one data frame per species-outgroup pair**. This can be easily done
452 | with the function `collapse_bidirectional_hits()` from
453 | `r BiocStyle::Biocpkg("syntenet")`. As input, this function takes a
454 | list of interspecies DIAMOND data frames, and a 2-column data frame indicating
455 | the target species and the outgroup species (columns 1 and 2, respectively;
456 | double-check the order of the columns!).
457 |
458 | ```{r collapse_hits}
459 | # For each species-outgroup pair, collapse bidirectional hits in one data frame
460 | diamond_inter <- syntenet::collapse_bidirectional_hits(
461 | diamond_inter, compare = spp_outgroup
462 | )
463 | names(diamond_inter)
464 | ```
465 |
466 | Then, we can pass this interspecies DIAMOND output as an argument to
467 | the parameter `blast_inter` of `classify_gene_pairs()`.
468 |
469 | ```{r full_classification}
470 | # Extended scheme
471 | c_extended <- classify_gene_pairs(
472 | annotation = pdata$annotation,
473 | blast_list = diamond_intra,
474 | scheme = "extended",
475 | blast_inter = diamond_inter
476 | )
477 |
478 | # Inspecting the output
479 | names(c_extended)
480 | head(c_extended$Scerevisiae)
481 |
482 | # How many pairs are there for each duplication mode?
483 | table(c_extended$Scerevisiae$type)
484 | ```
485 |
486 | In the example above, we used only one outgroup species (*C. glabrata*).
487 | However, since results might change depending on the chosen outgroup,
488 | you can also use multiple outgroups in the comparisons data frame, and then
489 | run interspecies DIAMOND searches as above. For instance, suppose you want
490 | to use *speciesB*, *speciesC*, and *speciesD* as outgroups to *speciesA*.
491 | In this case, your data frame of comparisons (to be passed to the `compare`
492 | argument of `syntenet::run_diamond()`) would look like the following:
493 |
494 | ```{r}
495 | # Example: multiple outgroups for the same species
496 | spp_outgroup_many <- data.frame(
497 | species = rep("speciesA", 3),
498 | outgroup = c("speciesB", "speciesC", "speciesD")
499 | )
500 |
501 | comparisons_many <- syntenet::make_bidirectional(spp_outgroup_many)
502 | comparisons_many
503 | ```
504 |
505 | When multiple outgroups are present, `classify_gene_pairs()` will check if
506 | gene pairs are classified as transposed (i.e., only one gene is an ancestral
507 | locus) in each of the outgroup species, and then calculate the percentage of
508 | outgroup species in which each pair is considered 'transposed'. For instance,
509 | you could have gene pair 1 as transposed based on 30\% of the outgroup species,
510 | gene pair 2 as transposed based on 100\% of the outgroup species,
511 | gene pair 3 based on 0\% of the outgroup species, and so on. By default,
512 | pairs are considered 'transposed' if they are classified as such
513 | in >70% of the outgroups, but you can choose a different minimum percentage
514 | cut-off using parameter `outgroup_coverage`.
515 |
516 | ## The *full* scheme (SSD → TD, PD, rTRD, dTRD, DD)
517 |
518 | Finally, the full scheme consists in classifying transposon-derived
519 | duplicates (TRD) further as originating from retrotransposons (rTRD) or
520 | DNA transposons (dTRD). To do that, the function `classify_gene_pairs()`
521 | uses the number of introns per gene to find TRD pairs for which
522 | one gene has at least 1 intron, and the other gene has no introns; if that
523 | is the case, the pair is classified as originating from the activity
524 | of retrotransposons (rTRD, i.e., the transposed gene without introns is
525 | a processed transcript that was retrotransposed back to the genome). All the
526 | other TRD pairs are classified as DNA transposon-derived duplicates (dTRD).
527 |
528 |
529 | To classify duplicates using this scheme, you will first need to create a list
530 | of data frames with the number of introns per gene for each species. This
531 | can be done with the function `get_intron_counts()`, which takes a `TxDb`
532 | object as input. `TxDb` objects store transcript annotations, and they
533 | can be created with a family of functions
534 | named `makeTxDbFrom*` from the `r BiocStyle::Biocpkg("txdbmaker")`
535 | package (see `?get_intron_counts()` for a summary of all functions).
536 |
537 |
538 | Here, we will create a list of `TxDb` objects from a list of `GRanges` objects
539 | using the function `makeTxDbFromGRanges()`
540 | from `r BiocStyle::Biocpkg("txdbmaker")`. Importantly, to create
541 | a `TxDb` from a `GRanges`, the `GRanges` object must contain genomic coordinates
542 | for all features, including transcripts, exons, etc. Because of that, we
543 | will use annotation from the example data set `yeast_annot`,
544 | which was not processed with `syntenet::process_input()`.
545 |
546 | ```{r message=FALSE}
547 | library(txdbmaker)
548 |
549 | # Create a list of `TxDb` objects from a list of `GRanges` objects
550 | txdb_list <- lapply(yeast_annot, txdbmaker::makeTxDbFromGRanges)
551 | txdb_list
552 | ```
553 |
554 | Once we have the `TxDb` objects, we can get intron counts per gene with
555 | `get_intron_counts()`.
556 |
557 | ```{r}
558 | # Get a list of data frames with intron counts per gene for each species
559 | intron_counts <- lapply(txdb_list, get_intron_counts)
560 |
561 | # Inspecting the list
562 | names(intron_counts)
563 | head(intron_counts$Scerevisiae)
564 | ```
565 |
566 | Finally, we can use this list to classify duplicates using the full scheme
567 | as follows:
568 |
569 | ```{r}
570 | # Full scheme
571 | c_full <- classify_gene_pairs(
572 | annotation = pdata$annotation,
573 | blast_list = diamond_intra,
574 | scheme = "full",
575 | blast_inter = diamond_inter,
576 | intron_counts = intron_counts
577 | )
578 |
579 | # Inspecting the output
580 | names(c_full)
581 | head(c_full$Scerevisiae)
582 |
583 | # How many pairs are there for each duplication mode?
584 | table(c_full$Scerevisiae$type)
585 | ```
586 |
587 |
588 |
589 | # Classifying genes into unique modes of duplication
590 |
591 | If you look carefully at the output of `classify_gene_pairs()`, you will notice
592 | that some genes appear in more than one duplicate pair, and these pairs can
593 | have different duplication modes assigned. There's nothing wrong with it.
594 | Consider, for example, a gene that was originated by a segmental duplication
595 | some 60 million years ago, and then it underwent a tandem duplication
596 | 5 million years ago. In the output of `classify_gene_pairs()`, you'd see
597 | this gene in two pairs, one with **SD** in the `type` column, and one
598 | with **TD**.
599 |
600 | If you want to assign each gene to a unique mode of duplication, you can
601 | use the function `classify_genes()`. This function assigns duplication modes
602 | hierarchically using factor levels in column `type` as the priority order.
603 | The priority orders for each classification scheme are:
604 |
605 | 1. **Binary:** SD > SSD.
606 | 2. **Standard:** SD > TD > PD > DD.
607 | 3. **Extended:** SD > TD > PD > TRD > DD.
608 | 4. **Full:** SD > TD > PD > rTRD > dTRD > DD.
609 |
610 | The input for `classify_genes()` is the list of gene pairs returned by
611 | `classify_gene_pairs()`.
612 |
613 | ```{r classify_genes}
614 | # Classify genes into unique modes of duplication
615 | c_genes <- classify_genes(c_full)
616 |
617 | # Inspecting the output
618 | names(c_genes)
619 | head(c_genes$Scerevisiae)
620 |
621 | # Number of genes per mode
622 | table(c_genes$Scerevisiae$type)
623 | ```
624 |
625 | # Calculating substitution rates for duplicated gene pairs
626 |
627 | You can use the function `pairs2kaks()` to calculate rates of nonsynonymous
628 | substitutions per nonsynonymous site ($K_a$), synonymouys substitutions per
629 | synonymous site ($K_s$), and their ratios ($K_a/K_s$). These rates are calculated
630 | using the Bioconductor package `r BiocStyle::Biocpkg("MSA2dist")`, which
631 | implements all codon models in KaKs_Calculator 2.0 [@wang2010kaks_calculator].
632 |
633 |
634 | For the purpose of demonstration, we will only calculate $K_a$, $K_s$,
635 | and $K_a/K_s$ for 5 TD-derived gene pairs. The CDS for TD-derived
636 | genes were obtained from Ensembl Fungi [@yates2022ensembl], and
637 | they are stored in `cds_scerevisiae`.
638 |
639 | ```{r kaks_calculation}
640 | data(cds_scerevisiae)
641 | head(cds_scerevisiae)
642 |
643 | # Store DNAStringSet object in a list
644 | cds_list <- list(Scerevisiae = cds_scerevisiae)
645 |
646 | # Keep only top five TD-derived gene pairs for demonstration purposes
647 | td_pairs <- c_full$Scerevisiae[c_full$Scerevisiae$type == "TD", ]
648 | gene_pairs <- list(Scerevisiae = td_pairs[seq(1, 5, by = 1), ])
649 |
650 | # Calculate Ka, Ks, and Ka/Ks
651 | kaks <- pairs2kaks(gene_pairs, cds_list)
652 |
653 | # Inspect the output
654 | head(kaks)
655 | ```
656 |
657 | Importantly, `pairs2kaks()` expects all genes in the gene pairs to be present
658 | in the CDS, with matching names. Species abbreviations in gene pairs (added
659 | by `r BiocStyle::Biocpkg("syntenet")`) are automatically removed, so you should
660 | not add them to the sequence names of your CDS.
661 |
662 | # Identifying and visualizing $K_s$ peaks
663 |
664 | Peaks in $K_s$ distributions typically indicate whole-genome duplication (WGD)
665 | events, and they can be identified by fitting Gaussian mixture models (GMMs) to
666 | $K_s$ distributions. In `r BiocStyle::Githubpkg("doubletrouble")`, this can be
667 | performed with the function `find_ks_peaks()`.
668 |
669 |
670 | However, because of saturation at higher $K_s$ values, only **recent WGD**
671 | events can be reliably identified from $K_s$
672 | distributions [@vanneste2013inference]. Recent WGD events are commonly found
673 | in plant species, such as maize, soybean, apple, etc.
674 | Although the genomes of yeast species have signatures of WGD,
675 | these events are ancient, so it is very hard to find evidence for them
676 | using $K_s$ distributions. [^4]
677 |
678 | [^4]: **Tip:** You might be asking yourself: "How does one identify ancient
679 | WGD, then?". A common approach is to look for syntenic blocks (i.e.,
680 | regions with conserved gene content and order) within genomes. This is what
681 | `classify_gene_pairs()` does under the hood to find SD-derived gene pairs.
682 |
683 | To demonstrate how you can find peaks in $K_s$ distributions
684 | with `find_ks_peaks()`, we will use a data frame containing $K_s$ values for
685 | duplicate pairs in the soybean (*Glycine max*) genome, which has undergone
686 | 2 WGDs events ~13 and ~58 million years ago [@schmutz2010genome].
687 | Then, we will visualize $K_s$ distributions with peaks using the function
688 | `plot_ks_peaks()`.
689 |
690 | First of all, let's look at the data and have a quick look at the distribution
691 | with the function `plot_ks_distro()` (more details on this function in the
692 | data visualization section).
693 |
694 | ```{r ks_eda}
695 | # Load data and inspect it
696 | data(gmax_ks)
697 | head(gmax_ks)
698 |
699 | # Plot distribution
700 | plot_ks_distro(gmax_ks)
701 | ```
702 |
703 | By visual inspection, we can see 2 or 3 peaks. Based on our prior knowledge,
704 | we know that 2 WGD events have occurred in the ancestral of the *Glycine* genus
705 | and in the ancestral of all Fabaceae, which seem to correspond to the
706 | peaks we see at $K_s$ values around 0.1 and 0.5, respectively. There could be
707 | a third, flattened peak at around 1.6, which would represent the WGD shared
708 | by all eudicots. Let's test which number of peaks has more support: 2 or 3.
709 |
710 | ```{r find_ks_peaks}
711 | # Find 2 and 3 peaks and test which one has more support
712 | peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = c(2, 3), verbose = TRUE)
713 | names(peaks)
714 | str(peaks)
715 |
716 | # Visualize Ks distribution
717 | plot_ks_peaks(peaks)
718 | ```
719 |
720 | As we can see, the presence of 3 peaks is more supported (lowest BIC). The
721 | function returns a list with the mean, variance and amplitude
722 | of mixture components (i.e., peaks), as well as the $K_s$ distribution itself.
723 |
724 | Now, suppose you just want to get the first 2 peaks. You can do that by
725 | explictly saying to `find_ks_peaks()` how many peaks there are.
726 |
727 | ```{r find_peaks_explicit}
728 | # Find 2 peaks ignoring Ks values > 1
729 | peaks <- find_ks_peaks(gmax_ks$Ks, npeaks = 2, max_ks = 1)
730 | plot_ks_peaks(peaks)
731 | ```
732 |
733 | **Important consideration on GMMs and $K_s$ distributions:**
734 | Peaks identified with GMMs should not be blindly regarded as "the truth".
735 | Using GMMs to find peaks in $K_s$ distributions can lead to problems such as
736 | overfitting and overclustering [@tiley2018assessing]. Some general
737 | recommendations are:
738 |
739 | 1. Use your prior knowledge. If you know how many peaks there are (e.g.,
740 | based on literature evidence), just tell the number to `find_ks_peaks()`.
741 | Likewise, if you are not sure about how many peaks there are, but you know
742 | the maximum number of peaks is N, don't test for the presence of >N peaks.
743 | GMMs can incorrectly identify more peaks than the actual number.
744 |
745 | 2. Test the significance of each peak with SiZer (Significant ZERo crossings
746 | of derivatives) maps [@chaudhuri1999sizer].
747 | This can be done with the function `SiZer()` from
748 | the R package `r BiocStyle::CRANpkg("feature")`.
749 |
750 | As an example of a SiZer map, let's use `feature::SiZer()` to assess
751 | the significance of the 2 peaks we found previously.
752 |
753 | ```{r sizer}
754 | # Get numeric vector of Ks values <= 1
755 | ks <- gmax_ks$Ks[gmax_ks$Ks <= 1]
756 |
757 | # Get SiZer map
758 | feature::SiZer(ks)
759 | ```
760 |
761 | The blue regions in the SiZer map indicate significantly increasing regions
762 | of the curve, which support the 2 peaks we found.
763 |
764 | # Classifying genes by age groups
765 |
766 | Finally, you can use the peaks you obtained before to classify gene pairs
767 | by age group. Age groups are defined based on the $K_s$ peak to which pairs belong.
768 | This is useful if you want to analyze duplicate pairs
769 | from a specific WGD event, for instance. You can do this with
770 | the function `split_pairs_by_peak()`. This function returns a list containing
771 | the classified pairs in a data frame, and a ggplot object with the
772 | age boundaries highlighted in the histogram of $K_s$ values.
773 |
774 | ```{r split_by_peak}
775 | # Gene pairs without age-based classification
776 | head(gmax_ks)
777 |
778 | # Classify gene pairs by age group
779 | pairs_age_group <- split_pairs_by_peak(gmax_ks[, c(1,2,3)], peaks)
780 |
781 | # Inspecting the output
782 | names(pairs_age_group)
783 |
784 | # Take a look at the classified gene pairs
785 | head(pairs_age_group$pairs)
786 |
787 | # Visualize Ks distro with age boundaries
788 | pairs_age_group$plot
789 | ```
790 |
791 | Age groups can also be used to identify SD gene pairs that likely originated
792 | from whole-genome duplications. The rationale here is that segmental duplicates
793 | with $K_s$ values near $K_s$ peaks (indicating WGD events) were likely
794 | created by such WGDs. In a similar logic, SD pairs with $K_s$ values that
795 | are too distant from $K_s$ peaks (e.g., >2 standard deviations away from
796 | the mean) were likely created by duplications of large genomic segments, but
797 | not duplications of the entire genome.
798 |
799 | As an example, to find gene pairs in the soybean genome that likely originated
800 | from the WGD event shared by all legumes (at ~58 million years ago),
801 | you'd need to extract SD pairs in age group 2 using the following code:
802 |
803 | ```{r}
804 | # Get all pairs in age group 2
805 | pairs_ag2 <- pairs_age_group$pairs[pairs_age_group$pairs$peak == 2, c(1,2)]
806 |
807 | # Get all SD pairs
808 | sd_pairs <- gmax_ks[gmax_ks$type == "SD", c(1,2)]
809 |
810 | # Merge tables
811 | pairs_wgd_legumes <- merge(pairs_ag2, sd_pairs)
812 |
813 | head(pairs_wgd_legumes)
814 | ```
815 |
816 | # Data visualization
817 |
818 | Last but not least, `r BiocStyle::Biocpkg("doubletrouble")` provides users
819 | with graphical functions to produce publication-ready plots from the output
820 | of `classify_gene_pairs()`, `classify_genes()`, and `pairs2kaks()`.
821 | Let's take a look at them one by one.
822 |
823 | ## Visualizing the frequency of duplicates per mode
824 |
825 | To visualize the frequency of duplicated gene pairs or genes by duplication
826 | type (as returned by `classify_gene_pairs()` and `classify_genes()`,
827 | respectively), you will first need to create a data frame of counts with
828 | `duplicates2counts()`. To demonstrate how this works, we will use an
829 | example data set with duplicate pairs for 3 fungi species (and substitution
830 | rates, which will be ignored by `duplicates2counts()`).
831 |
832 | ```{r}
833 | # Load data set with pre-computed duplicates for 3 fungi species
834 | data(fungi_kaks)
835 | names(fungi_kaks)
836 | head(fungi_kaks$saccharomyces_cerevisiae)
837 |
838 | # Get a data frame of counts per mode in all species
839 | counts_table <- duplicates2counts(fungi_kaks |> classify_genes())
840 |
841 | counts_table
842 | ```
843 |
844 | Now, let's visualize the frequency of duplicate gene pairs by duplication
845 | type with the function `plot_duplicate_freqs()`. You can visualize frequencies
846 | in three different ways, as demonstrated below.
847 |
848 | ```{r}
849 | # A) Facets
850 | p1 <- plot_duplicate_freqs(counts_table)
851 |
852 | # B) Stacked barplot, absolute frequencies
853 | p2 <- plot_duplicate_freqs(counts_table, plot_type = "stack")
854 |
855 | # C) Stacked barplot, relative frequencies
856 | p3 <- plot_duplicate_freqs(counts_table, plot_type = "stack_percent")
857 |
858 | # Combine plots, one per row
859 | patchwork::wrap_plots(p1, p2, p3, nrow = 3) +
860 | patchwork::plot_annotation(tag_levels = "A")
861 | ```
862 |
863 | If you want to visually the frequency of duplicated **genes** (not gene pairs),
864 | you'd first need to classify genes into unique modes of duplication
865 | with `classify_genes()`, and then repeat the code above. For example:
866 |
867 | ```{r fig.height = 3, fig.width = 8}
868 | # Frequency of duplicated genes by mode
869 | classify_genes(fungi_kaks) |> # classify genes into unique duplication types
870 | duplicates2counts() |> # get a data frame of counts (long format)
871 | plot_duplicate_freqs() # plot frequencies
872 | ```
873 |
874 | ## Visualizing $K_s$ distributions
875 |
876 | As briefly demonstrated before, to plot a $K_s$ distribution for the
877 | whole paranome, you will use the function `plot_ks_distro()`.
878 |
879 | ```{r fig.height=3, fig.width=9}
880 | ks_df <- fungi_kaks$saccharomyces_cerevisiae
881 |
882 | # A) Histogram, whole paranome
883 | p1 <- plot_ks_distro(ks_df, plot_type = "histogram")
884 |
885 | # B) Density, whole paranome
886 | p2 <- plot_ks_distro(ks_df, plot_type = "density")
887 |
888 | # C) Histogram with density lines, whole paranome
889 | p3 <- plot_ks_distro(ks_df, plot_type = "density_histogram")
890 |
891 | # Combine plots side by side
892 | patchwork::wrap_plots(p1, p2, p3, nrow = 1) +
893 | patchwork::plot_annotation(tag_levels = "A")
894 | ```
895 |
896 | However, visualizing the distribution for the whole paranome can mask patterns
897 | that only happen for duplicates originating from particular duplication types.
898 | For instance, when looking for evidence of WGD events,
899 | visualizing the $K_s$ distribution for SD-derived pairs only can reveal
900 | whether syntenic genes cluster together, suggesting the presence of WGD history.
901 | To visualize the distribution by duplication type, use `bytype = TRUE` in
902 | `plot_ks_distro()`.
903 |
904 | ```{r fig.width = 8, fig.height=4}
905 | # A) Duplicates by type, histogram
906 | p1 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "histogram")
907 |
908 | # B) Duplicates by type, violin
909 | p2 <- plot_ks_distro(ks_df, bytype = TRUE, plot_type = "violin")
910 |
911 | # Combine plots side by side
912 | patchwork::wrap_plots(p1, p2) +
913 | patchwork::plot_annotation(tag_levels = "A")
914 | ```
915 |
916 | ## Visualizing substitution rates by species
917 |
918 | The function `plot_rates_by_species()` can be used to show distributions of
919 | substitution rates ($K_s$, $K_a$, or their ratio $K_a/K_s$) by species.
920 | You can choose which rate you want to visualize, and whether or not to
921 | group gene pairs by duplication mode, as demonstrated below.
922 |
923 | ```{r fig.width = 6, fig.height = 4}
924 | # A) Ks for each species
925 | p1 <- plot_rates_by_species(fungi_kaks)
926 |
927 | # B) Ka/Ks by duplication type for each species
928 | p2 <- plot_rates_by_species(fungi_kaks, rate_column = "Ka_Ks", bytype = TRUE)
929 |
930 | # Combine plots - one per row
931 | patchwork::wrap_plots(p1, p2, nrow = 2) +
932 | patchwork::plot_annotation(tag_levels = "A")
933 | ```
934 |
935 | # Session information {.unnumbered}
936 |
937 | This document was created under the following conditions:
938 |
939 | ```{r session_info}
940 | sessioninfo::session_info()
941 | ```
942 |
943 | # References {.unnumbered}
944 |
945 |
--------------------------------------------------------------------------------