├── .Rbuildignore
├── .gitignore
├── .travis.yml
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── NEWS.md
├── R
├── autoTranslate.R
├── diopt.R
├── geneHistory.R
├── homologene.R
├── homologeneData2.R
├── import.R
└── updateHomologene.R
├── README.md
├── README.rmd
├── cran-comments.md
├── data-raw
├── homologene2.tsv
├── homologeneData.tsv
├── release
└── taxData.tsv
├── data
├── homologeneData.rda
├── homologeneData2.rda
├── homologeneVersion.rda
└── taxData.rda
├── docs
├── LICENSE-text.html
├── LICENSE.html
├── README.html
├── authors.html
├── docsearch.css
├── docsearch.js
├── index.html
├── jquery.sticky-kit.min.js
├── link.svg
├── news
│ └── index.html
├── pkgdown.css
├── pkgdown.js
├── pkgdown.yml
└── reference
│ ├── autoTranslate.html
│ ├── diopt.html
│ ├── getGeneHistory.html
│ ├── getGeneInfo.html
│ ├── getHomologene.html
│ ├── homologene.html
│ ├── homologeneData.html
│ ├── homologeneData2.html
│ ├── homologeneVersion.html
│ ├── human2mouse.html
│ ├── index.html
│ ├── mouse2human.html
│ ├── reexports.html
│ ├── taxData.html
│ ├── updateHomologene.html
│ └── updateIDs.html
├── homologene.Rproj
├── man
├── autoTranslate.Rd
├── diopt.Rd
├── getGeneHistory.Rd
├── getGeneInfo.Rd
├── getHomologene.Rd
├── homologene.Rd
├── homologeneData.Rd
├── homologeneData2.Rd
├── homologeneVersion.Rd
├── human2mouse.Rd
├── mouse2human.Rd
├── reexports.Rd
├── taxData.Rd
├── updateHomologene.Rd
└── updateIDs.Rd
├── process
├── autoUpdate.sh
├── biomartTests.R
├── dioptMemory.R
├── prepHomologene.R
└── prepHomologene2.R
└── tests
├── testthat.R
└── testthat
├── test_diopt.R
├── test_homologene.R
├── test_utilities.R
└── testfiles
└── gene_history_trimmed.tsv
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^LICENSE\.md$
2 | ^cran-comments\.md$
3 | ^.*\.Rproj$
4 | ^\.Rproj\.user$
5 | ^\.httr-oauth$
6 | ^\.travis\.yml$
7 | ^data-raw$
8 | ^process
9 | README.rmd
10 | ^docs$
11 | ^README_cache$
12 | ^cache
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .httr-oauth
5 | auth
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r
2 |
3 | language: R
4 | sudo: false
5 | cache: packages
6 | r_github_packages:
7 | - jimhester/covr
8 | after_success:
9 | - Rscript -e 'covr::codecov()'
10 | warnings_are_errors: false
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Package: homologene
2 | Type: Package
3 | Title: Quick Access to Homologene and Gene Annotation Updates
4 | Version: 1.7.68.23.10.31
5 | Depends: R (>= 3.1.2)
6 | Imports:
7 | dplyr (>= 0.7.4),
8 | magrittr (>= 1.5),
9 | purrr (>= 0.2.5),
10 | readr (>= 1.3.1),
11 | R.utils(>= 2.8.0),
12 | assertthat (>= 0.2.1),
13 | rvest (>= 1.0.0),
14 | xml2 (>= 1.3.2)
15 | Suggests:
16 | testthat (>= 1.0.2)
17 | Date: 2023-10-31
18 | Authors@R: c(
19 | person("Ogan", "Mancarci", email = "ogan.mancarci@gmail.com", role = c("aut", "cre")),
20 | person("Leon","French", role = c('ctb')))
21 | BugReports: https://github.com/oganm/homologene/issues
22 | URL: https://github.com/oganm/homologene
23 | Description: A wrapper for the homologene database by the National Center for
24 | Biotechnology Information ('NCBI'). It allows searching for gene homologs across
25 | species. Data in this package can be found at .
26 | The package also includes an updated version of the homologene database where
27 | gene identifiers and symbols are replaced with their latest (at the time of
28 | submission) version and functions to fetch latest annotation data to keep updated.
29 | License: MIT + file LICENSE
30 | LazyData: true
31 | RoxygenNote: 7.2.3
32 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2019
2 | COPYRIGHT HOLDER: Ogan Mancarci
3 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | # MIT License
2 |
3 | Copyright (c) 2019 Ogan Mancarci
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 | export("%$%")
4 | export("%<>%")
5 | export("%>%")
6 | export(autoTranslate)
7 | export(diopt)
8 | export(getGeneHistory)
9 | export(getGeneInfo)
10 | export(getHomologene)
11 | export(homologene)
12 | export(human2mouse)
13 | export(mouse2human)
14 | export(updateHomologene)
15 | export(updateIDs)
16 | importFrom(magrittr,"%$%")
17 | importFrom(magrittr,"%<>%")
18 | importFrom(magrittr,"%>%")
19 |
--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
1 | # homologene 1.5.68.x
2 |
3 | * Added `diopt` function to make queries at diopt database.
4 | * Further automatic updates to homologeneData2
5 |
6 | # homologene 1.4.68.19.3.24 (since 1.1.68)
7 |
8 | * Added a `NEWS.md` file to track changes to the package.
9 | * Added `autoTranslate` function to allow automated translation of gene symbols or ids.
10 | * `homologeneData2` is added as an updated version of the original homologene database (original database is not updated since 2014). This database includes the latest gene symbols and identifiers for every gene included in the original database. Outside CRAN (github version), this database is updated weekly.
11 | * Version number is extended to include the last update date of homologeneData2.
12 | * `updateHomologene` function is added to allow users create their own updated
13 | versions of homologene. Using `homologeneData2` as a baseline with this function
14 | allows faster updates.
15 | * `getGeneHistory`, `updateIDs` and `getGeneInfo` functions are added to allow users to update arbitrary gene lists with latest symbols and identifiers.
16 | * All species originally repsented in the homologene database are added to the package.
--------------------------------------------------------------------------------
/R/autoTranslate.R:
--------------------------------------------------------------------------------
1 | #' Attempt to automatically translate a gene list
2 | #'
3 | #' @description Given a list of query gene list and a target gene list, the function
4 | #' tries find the homology pairing that matches the query list to the target list. The query list
5 | #' is a short list of genes while the target list is supposed to represent a large number of genes from the target
6 | #' species. The default output will be the largest possible list. If \code{returnAllPossible = TRUE} then
7 | #' all possible pairings with any matches are returned. It is possible to limit the
8 | #' search by setting \code{possibleOrigins} and \code{possibleTargets}. Note that gene symbols of some species
9 | #' are more similar to each other than others. Using this with small gene lists and without providing any
10 | #' \code{possibleOrigins} or \code{possibleTargets} might return multiple hits, or if \code{returnAllPossible = TRUE}
11 | #' a wrong match can be returned.
12 | #'
13 | #' @param genes A list of genes to match the target. Symbols or NCBI ids
14 | #' @param targetGenes The target list. This list is supposed to represent a large number of genes
15 | #' from the target species.
16 | #' @param possibleOrigins Taxonomic identifiers of possible origin species
17 | #' @param possibleTargets Taxonomic identifiers of possible target species
18 | #' @param returnAllPossible if TRUE returns all possible pairings with non zero gene matches. If FALSE (default) returns the best match
19 | #' @return A data frame if \code{returnAllPossibe = FALSE} and a list of data frames if \code{TRUE}
20 | #' @param db Homologene database to use.
21 | #' @export
22 | autoTranslate = function(genes,
23 | targetGenes,
24 | possibleOrigins= NULL,
25 | possibleTargets = NULL,
26 | returnAllPossible = FALSE,
27 | db = homologene::homologeneData){
28 | pairwise = db$Taxonomy %>%
29 | unique %>% utils::combn(2) %>%
30 | {cbind(.,.[c(2,1),],
31 | rbind(db$Taxonomy %>%
32 | unique,db$Taxonomy %>%
33 | unique))}
34 |
35 | if(!is.null(possibleOrigins)){
36 | possibleOrigins[possibleOrigins == 'human'] = 9606
37 | possibleOrigins[possibleOrigins == 'mouse'] = 10090
38 |
39 | pairwise = pairwise[,pairwise[1,] %in% possibleOrigins, drop = FALSE]
40 | } else{
41 | possibleOrigins = db$Taxonomy %>% unique
42 | }
43 | if(!is.null(possibleTargets)){
44 | possibleTargets[possibleTargets == 'human'] = 9606
45 | possibleTargets[possibleTargets == 'mouse'] = 10090
46 | pairwise = pairwise[,pairwise[2,] %in% possibleTargets,drop = FALSE]
47 | } else{
48 | possibleTargets = db$Taxonomy %>% unique
49 | }
50 |
51 |
52 | possibleOriginData = db %>%
53 | dplyr::filter(Taxonomy %in% possibleOrigins & (Gene.Symbol %in% genes | Gene.ID %in% genes)) %>%
54 | dplyr::group_by(Taxonomy)
55 | possibleOriginCounts = possibleOriginData %>% dplyr::summarise(n = dplyr::n())
56 |
57 | possibleTargetData = db %>%
58 | dplyr::filter(Taxonomy %in% possibleTargets & (Gene.Symbol %in% targetGenes | Gene.ID %in% targetGenes)) %>%
59 | dplyr::group_by(Taxonomy)
60 | possibleTargetCounts = possibleTargetData%>% dplyr::summarise(n = dplyr::n())
61 |
62 |
63 | pairwise = pairwise[,pairwise[1,] %in% possibleOriginCounts$Taxonomy,drop= FALSE]
64 | pairwise = pairwise[,pairwise[2,] %in% possibleTargetCounts$Taxonomy, drop = FALSE]
65 |
66 |
67 | pairwise %>% apply(2,function(taxes){
68 | homologene(genes,inTax = taxes[1],outTax = taxes[2])
69 | }) %>% {.[purrr::map_int(.,nrow)>0]} -> possibleTranslations
70 |
71 | possibleTranslations %>% sapply(function(trans){
72 | sum(c(trans[,2],trans[,4]) %in% targetGenes)
73 | }) -> translationCounts
74 |
75 | if(!returnAllPossible){
76 | translationCounts %>% which.max %>% {possibleTranslations[[.]]} -> possibleTranslations
77 | if(sum(translationCounts>0)>1){
78 | bestMatch = translationCounts %>% which.max
79 | nextBest = max(translationCounts[-bestMatch])
80 | warning('There are other pairings, best of which has ',nextBest, ' matching genes')
81 | }
82 | } else{
83 | possibleTranslations = possibleTranslations[translationCounts!=0]
84 | }
85 | return(possibleTranslations)
86 | }
87 |
--------------------------------------------------------------------------------
/R/diopt.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #' Query DIOPT database
4 | #'
5 | #' Query DIOPT database (\url{https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl}) for orthologues.
6 | #' DIOPT database uses multiple tools to find gene orthologues. Sadly they don't have an
7 | #' API so this function queries by visiting the site and filling up the form. By default
8 | #' each query will take a minimum of 10 seconds due to \code{delay} parameter. This
9 | #' is taken from their robots.txt at the time this function is written.
10 | #' Note that DIOPT is not necesariy in sync with homologene database as provided in this package.
11 | #'
12 | #' DIOPT does not support all species available in the homologene database. The supported
13 | #' species are:
14 | #'
15 | #' \describe{
16 | #' \item{4896}{Schizosaccharomyces pombe}
17 | #' \item{4932}{Saccharomyces cerevisiae}
18 | #' \item{6239}{Caenorhabditis elegans}
19 | #' \item{7227}{Drosophila melanogaster}
20 | #' \item{7955}{Danio rerio}
21 | #' \item{8364}{Xenopus (Silurana) tropicalis}
22 | #' \item{9606}{Homo sapiens}
23 | #' \item{10090}{Mus musculus}
24 | #' \item{10116}{Rattus norvegicus}
25 | #' \item{3702}{Arabidopsis thaliana}
26 | #' }
27 | #'
28 | #'
29 | #' @param genes A vector of gene identifiers. Anything that DIOPT accepts
30 | #' @param inTax taxid of the species that the input genes are coming from
31 | #' @param outTax taxid of the species that you are seeking homology. 0 to query
32 | #' all species. It must be specificed unless paralogue = TRUE
33 | #' @param paralogue If TRUE, searches for paralogues instead of orthologues.
34 | #' outTax cannot be specified when searching for paralogues
35 | #' @param delay How many seconds of delay should be between queries. Default is 10
36 | #' based on the robots.txt at the time this function is written.
37 | #'
38 | #' @return A data frame
39 | #' @export
40 | #'
41 | diopt = function(genes, inTax, outTax = NULL, paralogue = FALSE, delay = 10){
42 | # rtxt = robotstxt::robotstxt(domain = "flyrnai.org")
43 | # delay = rtxt$crawl_delay %>% filter(useragent =='*') %$% value %>% as.integer()
44 | session = rvest::session('https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl')
45 | # session = rvest::html_session('https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl', httr::config(ssl_verifypeer = 0L))
46 | form = rvest::html_form(session)[[1]]
47 |
48 | if(paralogue){
49 | assertthat::assert_that(is.null(outTax),msg = 'outTax cannot be specified when querying paralogues')
50 | form$fields[[1]]$attr$class = "btn btn-outline-primary"
51 | form$fields[[2]]$attr$class = "btn btn-outline-primary active"
52 | outTax = "9606"
53 | } else{
54 | assertthat::assert_that(!is.null(outTax),msg = 'outTax must be specified when querying orthologues')
55 | acceptableOutTax = form$fields$output_species$options
56 | assertthat::assert_that(outTax %in% acceptableOutTax)
57 | }
58 |
59 | acceptableInTax= form$fields$input_species$options
60 |
61 | assertthat::assert_that(inTax %in% acceptableInTax)
62 |
63 | form = rvest::html_form_set(form,
64 | input_species = inTax,
65 | output_species = outTax,
66 | gene_list = paste(genes,collapse = '\n\r'))
67 |
68 | # additional_filters = which(names(form$fields) == 'additional_filter')
69 |
70 | # additional_filter_names = form$fields[additional_filters] %>% purrr::map_chr('value')
71 |
72 | # form$fields[additional_filters][additional_filter_names %in% 'None'][[1]]$attr$checked = 'checked'
73 | # form$fields[additional_filters][additional_filter_names %in% 'NoLow'][[1]]$attr$checked = NULL
74 |
75 | values = form$fields %>% purrr::map('value')
76 | additional_filters = names(values) == 'additional_filter'
77 | noneField = values %>% purrr::map_lgl(function(x){length(x)==1&&x !='None'})
78 | form$fields = form$fields[!(additional_filters & noneField)]
79 |
80 |
81 | values = form$fields %>% purrr::map('value')
82 | search_datasets = names(values) == 'search_datasets'
83 | allField = values %>% purrr::map_lgl(function(x){length(x)==1&&x !='All'})
84 | form$fields = form$fields[!(search_datasets & allField)]
85 |
86 | # values = form$fields %>% purrr::map('value')
87 | # search_datasets = names(values) == 'search_fields'
88 | # allField = values %>% purrr::map_lgl(function(x){length(x)==1&&x !='***'})
89 | # form$fields = form$fields[!(search_datasets & allField)]
90 |
91 | Sys.sleep(delay)
92 |
93 | response = rvest::html_form_submit(form,submit = 'submit')
94 |
95 | # writeLines(ogbox::as.char(session$response),'hede.html')
96 | # utils::browseURL('hede.html')
97 | # writeBin(response$content,'hede.html')
98 | # utils::browseURL('hede.html')
99 |
100 | output = response %>%
101 | xml2::read_html() %>%
102 | rvest::html_node('#results-table') %>%
103 | rvest::html_table()
104 | return(output)
105 | }
106 |
--------------------------------------------------------------------------------
/R/geneHistory.R:
--------------------------------------------------------------------------------
1 | #' Download gene symbol information
2 | #'
3 | #' This function downloads the gene_info file from NCBI website and returns the
4 | #' gene symbols for current IDs.
5 | #'
6 | #' @param destfile Path of the output file. If NULL a temp file will be used
7 | #' @param justRead If TRUE and destfile exists, it reads the file instead of
8 | #' downloading the latest one from NCBI
9 | #' @param chunk_size Chunk size to be used with \code{link[readr]{read_tsv_chunked}}.
10 | #' The gene_info file is big enough to make its intake difficult. If you don't
11 | #' have large amounts of free memory you may have to reduce this number to read
12 | #' the file in smaller chunks
13 | #'
14 | #' @return A data frame with gene symbols for each current gene id
15 | #' @export
16 | #'
17 | getGeneInfo = function(destfile = NULL, justRead = FALSE,chunk_size = 1000000){
18 | if(is.null(destfile)){
19 | destfile = tempfile()
20 | }
21 | if(!(!is.null(destfile) && file.exists(destfile) && justRead)){
22 | utils::download.file('https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz',
23 | paste0(destfile,'.gz'))
24 |
25 | R.utils::gunzip(paste0(destfile,'.gz'), overwrite = TRUE)
26 | }
27 |
28 | callBack = function(x,pos){
29 | x[,c(1,2,3)]
30 | }
31 | geneInfo = readr::read_tsv_chunked(destfile,
32 | readr::DataFrameCallback$new(callBack),
33 | col_names = c('tax_id','GeneID','Symbol'),
34 | chunk_size = chunk_size, skip = 1,
35 | col_types = 'iic')
36 |
37 | }
38 |
39 |
40 | #' Download gene history file
41 | #'
42 | #' Downloads and reads the gene history file from NCBI website. This file is needed for
43 | #' other functions
44 | #'
45 | #' @param destfile Path of the output file. If NULL a temp file will be used
46 | #' @param justRead If TRUE and destfile exists, it reads the file instead of
47 | #' downloading the latest one from NCBI
48 | #'
49 | #' @return A data frame with latest gene history information
50 | #' @export
51 | #'
52 | getGeneHistory = function(destfile = NULL, justRead = FALSE){
53 | if(is.null(destfile)){
54 | destfile = tempfile()
55 | }
56 |
57 | if(!(!is.null(destfile) && file.exists(destfile) && justRead)){
58 | utils::download.file(url = "https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_history.gz",
59 | destfile = paste0(destfile,'.gz'))
60 |
61 |
62 | R.utils::gunzip(paste0(destfile,'.gz'), overwrite = TRUE)
63 | }
64 |
65 | gene_history = readr::read_tsv(destfile,
66 | col_names = c('tax_id',
67 | 'GeneID',
68 | 'Discontinued_GeneID',
69 | 'Discontinued_Symbol',
70 | 'Discontinue_Date'),skip = 1,
71 | col_types = 'icici')
72 | return(gene_history)
73 | }
74 |
75 |
76 | #' Update gene IDs
77 | #'
78 | #' Given a list of gene ids and gene history information, traces changes in the
79 | #' gene's name to get the latest valid ID
80 | #'
81 | #' @param ids Gene ids
82 | #' @param gene_history Gene history information, probably returned by \code{\link{getGeneHistory}}
83 | #'
84 | #' @return A character vector. New ids for genes that changed ids, or "-" for discontinued genes.
85 | #' the input itself.
86 | #' @export
87 | #'
88 | #' @examples
89 | #' \dontrun{
90 | #' gene_history = getGeneHistory()
91 | #' updateIDs(c("4340964", "4349034", "4332470", "4334151", "4323831"),gene_history)
92 | #' }
93 | #'
94 | updateIDs = function(ids, gene_history){
95 | # we do not filter for taxonomy information as some genes use alternative
96 | # tax ids in non homologene sources
97 | # we do filter for earliest date found to run this a little faster
98 | earlierst_date = gene_history %>%
99 | dplyr::filter(Discontinued_GeneID %in% as.integer(ids)) %$%
100 | Discontinue_Date %>%
101 | {suppressWarnings(min(.))}
102 |
103 | relevant_gene_history = gene_history %>%
104 | dplyr::filter(Discontinue_Date >= earlierst_date
105 | )
106 |
107 | # just speed things along if the input id list includes ids that
108 | # are not discontinued
109 | idsToProcess = ids %in% relevant_gene_history$Discontinued_GeneID
110 | if(sum(idsToProcess)>0){
111 | ids[idsToProcess] = ids[idsToProcess] %>% sapply(traceID,relevant_gene_history)
112 | }
113 | return(ids)
114 |
115 | }
116 |
117 |
118 |
119 | traceID = function(id,gene_history){
120 | event = gene_history %>% dplyr::filter(Discontinued_GeneID == as.integer(id))
121 | if(nrow(event)>1){
122 | # just in case. if the same ID is discontinued twice, there is a problem...
123 | return("multiple events")
124 | } else if(nrow(event) == 0){
125 | return(id)
126 | }
127 |
128 | while(TRUE){
129 | if(event$GeneID == '-'){
130 | # if this condition wasn't there, this function would have worked just fine but
131 | # looking for '-'s take much longer than looking for IDs
132 | return('-')
133 | }
134 | # see if the new ID is discontinued as well
135 | # the check for the "-"s above allows us to do an integer matching here
136 | # which is faster
137 | next_event = gene_history %>%
138 | dplyr::filter(Discontinued_GeneID == as.integer(event$GeneID))
139 | if(nrow(next_event)==0){
140 | # if not, previous ID is the right one
141 | return(event$GeneID)
142 | } else if(nrow(next_event)>1){
143 | # just in case, if the same ID is discontinued twice, there is a problem...
144 | return("multiple events")
145 | } else if(nrow(next_event) == 1){
146 | # if the new IDs is discontinued, continue the loop and check if it has a parent
147 | event = next_event
148 | }
149 | }
150 | }
151 |
152 |
153 |
154 | #' Get the latest homologene file
155 | #'
156 | #' This function downloads the latest homologene file from NCBI. Note that Homologene
157 | #' has not been updated since 2014 so the output will be identical to \code{\link{homologeneData}}
158 | #' included in this package. This function is here for futureproofing purposes.
159 | #'
160 | #' @param destfile Path of the output file. If NULL a temp file will be used
161 | #' @param justRead If TRUE and destfile exists, it reads the file instead of
162 | #' downloading the latest one from NCBI
163 | #'
164 | #' @return A data frame with homology groups, gene ids and gene symbols
165 | #' @export
166 | #'
167 | getHomologene = function(destfile = NULL, justRead = FALSE){
168 | if(is.null(destfile)){
169 | destfile = tempfile()
170 | }
171 | if(!(!is.null(destfile) && file.exists(destfile) && justRead)){
172 | utils::download.file('https://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data',
173 | destfile)
174 | }
175 |
176 | homologene = readr::read_tsv(destfile,
177 | col_names = c('HID','Taxonomy','Gene.ID','Gene.Symbol','Protein.GI','Protein.Accession'),
178 | col_types = 'iiicic')
179 |
180 | homologeneData = homologene %>%
181 | dplyr::select(HID,Gene.ID,Gene.Symbol,Taxonomy) %>%
182 | unique %>%
183 | dplyr::arrange(HID)
184 |
185 | homologeneData %<>% as.data.frame
186 | }
187 |
--------------------------------------------------------------------------------
/R/homologene.R:
--------------------------------------------------------------------------------
1 | #' Get homologues of given genes
2 | #' @description Given a list of genes and a taxid, returns a data frame inlcuding the genes and their corresponding homologues
3 | #' @param genes A vector of gene symbols or NCBI ids
4 | #' @param inTax taxid of the species that the input genes are coming from
5 | #' @param outTax taxid of the species that you are seeking homology
6 | #' @param db Homologene database to use.
7 | #' @export
8 | #' @examples
9 | #' homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)
10 | homologene = function(genes, inTax, outTax, db = homologene::homologeneData){
11 | genes <- unique(genes) #remove duplicates
12 | out = db %>%
13 | dplyr::filter(Taxonomy %in% inTax & (Gene.Symbol %in% genes | Gene.ID %in% genes)) %>%
14 | dplyr::select(HID,Gene.Symbol,Gene.ID)
15 | names(out)[2] = inTax
16 | names(out)[3] = paste0(inTax,'_ID')
17 |
18 | out2 = db %>% dplyr::filter(Taxonomy %in% outTax & HID %in% out$HID) %>%
19 | dplyr::select(HID,Gene.Symbol,Gene.ID)
20 | names(out2)[2] = outTax
21 | names(out2)[3] = paste0(outTax,'_ID')
22 |
23 | # merge from HID to support translate from self
24 | output = merge(out,out2,'HID') %>%
25 | dplyr::select(2,4,3,5)
26 |
27 | # preserve order with temporary column
28 | output$sortBy <- factor(output[,1], levels = genes)
29 | output <- dplyr::arrange(output, sortBy)
30 | output$sortBy <- NULL
31 | output %<>% {colnames(.)= gsub('\\.(x|y)','',colnames(.));.}
32 |
33 | return(output)
34 | }
35 |
36 | #' Mouse/human wraper for homologene
37 | #' @param genes A vector of gene symbols or NCBI ids
38 | #' @param db Homologene database to use.
39 | #' @export
40 | #' @examples
41 | #' mouse2human(c('Eno2','17441'))
42 | mouse2human = function(genes, db = homologene::homologeneData){
43 | out = homologene(genes,10090,9606, db)
44 | names(out) = c('mouseGene', 'humanGene','mouseID','humanID')
45 | return(out)
46 | }
47 |
48 |
49 | #' Human/mouse wraper for homologene
50 | #' @param genes A vector of gene symbols or NCBI ids
51 | #' @param db Homologene database to use.
52 | #' @export
53 | #' @examples
54 | #' human2mouse(c('ENO2','4340'))
55 | human2mouse = function(genes, db = homologene::homologeneData){
56 | out = homologene(genes,9606,10090, db)
57 | names(out) = c('humanGene','mouseGene','humanID','mouseID')
58 | return(out)
59 | }
60 |
61 |
62 | #' homologeneData
63 | #'
64 | #' List of gene homologues used by homologene functions
65 | "homologeneData"
66 |
67 |
68 | #' Version of homologene used
69 | "homologeneVersion"
70 |
71 | #' Names and ids of included species
72 | "taxData"
--------------------------------------------------------------------------------
/R/homologeneData2.R:
--------------------------------------------------------------------------------
1 | #' homologeneData2
2 | #'
3 | #' A modified copy of the homologene database. Homologene was updated at 2014 and many of its gene IDs and
4 | #' symbols are out of date. Here the IDs and symbols are replaced with their most current version
5 | #' Last update: Tue Oct 31 18:41:52 2023
6 | "homologeneData2"
7 |
--------------------------------------------------------------------------------
/R/import.R:
--------------------------------------------------------------------------------
1 | #' @importFrom magrittr %>%
2 | #' @export
3 | magrittr::`%>%`
4 |
5 | #' @importFrom magrittr %<>%
6 | #' @export
7 | magrittr::`%<>%`
8 |
9 | #' @importFrom magrittr %$%
10 | #' @export
11 | magrittr::`%$%`
12 |
13 | utils::globalVariables(c("Taxonomy",
14 | "Gene.Symbol",
15 | "Gene.ID",
16 | "HID",
17 | "sortBy",
18 | ".",
19 | "Discontinued_GeneID",
20 | "Discontinue_Date",
21 | "Gene2FunctionDetails",
22 | "Feedback",
23 | "Alignment & Scores"))
24 |
--------------------------------------------------------------------------------
/R/updateHomologene.R:
--------------------------------------------------------------------------------
1 | #' Update homologene database
2 | #'
3 | #' Creates an updated version of the homologene database. This is done by downloading
4 | #' the latest gene annotation information and tracing changes in gene symbols and
5 | #' identifiers over history. \code{\link{homologeneData2}} was created using
6 | #' this function over the original \code{\link{homologeneData}}. This function
7 | #' requires downloading large amounts of data from the NCBI ftp servers.
8 | #'
9 | #' @param destfile Optional. Path of the output file.
10 | #' @param baseline The baseline homologene file to be used. By default uses the
11 | #' \code{\link{homologeneData2}} that is included in this package. The more ids
12 | #' to update, the more time is needed for the update which is why the default option
13 | #' uses an already updated version of the original database.
14 | #' @param gene_history A gene history data frame, possibly returned by \code{\link{getGeneHistory}}
15 | #' function. Use this if you want to have a static gene_history file to update up to a specific date.
16 | #' An up to date gene_history object can be set to update to a specific date by trimming
17 | #' rows that have recent dates. Note that the same is not possible for the gene_info
18 | #' If not provided, the latest file will be downloaded.
19 | #' @param gene_info A gene info data frame that contatins ID-symbol matches,
20 | #' possibly returned by \code{\link{getGeneInfo}}. Use this if you
21 | #' want a static version. Should be in sync with the gene_history file. Note that there is
22 | #' no easy way to track changes in gene symbols back in time so if you want to update it up
23 | #' to a specific date, make sure you don't lose that file.
24 | #'
25 | #' @return Homologene database in a data frame with updated gene IDs and symbols
26 | #' @export
27 | #'
28 | updateHomologene = function(destfile = NULL,
29 | baseline = homologene::homologeneData2,
30 | gene_history = NULL,
31 | gene_info = NULL){
32 |
33 | if(is.null(gene_history)){
34 | message('acquiring gene history data')
35 | gene_history = getGeneHistory()
36 | }
37 | # identify discontinued ids
38 | discontinued_ids = baseline %>%
39 | dplyr::filter(Gene.ID %in% gene_history$Discontinued_GeneID)
40 |
41 | unchanged_ids = baseline %>%
42 | dplyr::filter(!Gene.ID %in% gene_history$Discontinued_GeneID)
43 |
44 | # we do not filter for taxonomy information as some genes use alternative
45 | # tax ids in non homologene sources
46 | # we do filter for earliest date found to run this a little faster
47 |
48 | message('Tracing discontinued IDs. This might take a while.')
49 | discontinued_ids$Gene.ID %>% updateIDs(gene_history) ->
50 | new_ids
51 |
52 | # create a frame with new ids
53 | discontinued_fix = data.frame(HID = discontinued_ids$HID,
54 | Gene.Symbol = discontinued_ids$Gene.Symbol,
55 | Taxonomy = discontinued_ids$Taxonomy,
56 | Gene.ID = new_ids,
57 | stringsAsFactors = FALSE)
58 |
59 | discontinued_fix %<>% dplyr::filter(Gene.ID != '-')
60 |
61 | new_homo_frame =
62 | rbind(discontinued_fix,unchanged_ids) %>%
63 | dplyr::arrange(HID)
64 |
65 | new_homo_frame %<>% dplyr::mutate(
66 | Gene.ID = as.integer(Gene.ID)
67 | )
68 |
69 |
70 | if(is.null(gene_info)){
71 | message('Downloading gene symbol information')
72 | gene_info = getGeneInfo()
73 | }
74 |
75 | message('Updating gene symbols')
76 | matchToHomologene = match(new_homo_frame$Gene.ID,gene_info$GeneID)
77 |
78 | # tax information isn't really needed here. just added for testing purposes
79 | modern_frame = data.frame(modern_ids = new_homo_frame$Gene.ID,
80 | modern_symbols = gene_info$Symbol[matchToHomologene],
81 | modern_tax = gene_info$tax_id[matchToHomologene],stringsAsFactors = FALSE)
82 |
83 | new_homo_frame %<>%
84 | dplyr::mutate(Gene.Symbol = modern_frame$modern_symbols)
85 | # remove convergent gene ids with same HIDs
86 | new_homo_frame %<>% unique()
87 | if(!is.null(destfile)){
88 | utils::write.table(new_homo_frame,destfile,
89 | sep='\t', row.names=FALSE,quote = FALSE)
90 |
91 | }
92 |
93 | return(new_homo_frame)
94 | }
--------------------------------------------------------------------------------
/README.rmd:
--------------------------------------------------------------------------------
1 | ---
2 | output:
3 | github_document:
4 | html_preview: false
5 | ---
6 | ```{r setup, include=FALSE}
7 | knitr::opts_chunk$set(echo = TRUE)
8 | library(knitr)
9 | library(badger)
10 | library(magrittr)
11 | devtools::load_all()
12 | ```
13 |
14 | # homologene
15 | [](https://travis-ci.org/oganm/homologene) [](https://codecov.io/gh/oganm/homologene) `r badge_cran_release('homologene',color = '#32BD36')` `r badge_devel("oganm/homologene", "blue")`
16 |
17 | An r package that works as a wrapper to homologene
18 |
19 | Available species are
20 |
21 | ```{r}
22 | homologene::taxData
23 | ```
24 |
25 | Installation
26 | ============
27 | ```r
28 | install.packages('homologene')
29 | ```
30 |
31 | or
32 |
33 | ```r
34 | devtools::install_github('oganm/homologene')
35 | ```
36 |
37 | Usage
38 | ===========
39 | Basic homologene function requires a list of gene symbols or NCBI ids, and an `inTax` and an `outTax`. In this example, `inTax` is the taxon id of *mus musculus* while `outTax` is for humans.
40 | ```{r}
41 | homologene(c('Eno2','Mog'), inTax = 10090, outTax = 9606)
42 |
43 | homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)
44 | ```
45 |
46 | For mouse and humans two convenience functions exist that removes the need to provide taxonomic identifiers. Note that the column names are not the same as the `homologene` output.
47 | ```{r}
48 | mouse2human(c('Eno2','Mog'))
49 | human2mouse(c('ENO2','MOG','GZMH'))
50 | ```
51 |
52 |
53 | homologeneData2
54 | =================
55 | Original homologene database has not been updated since 2014.
56 | This package also includes an updated version of the homologene database that
57 | replaces gene symbols and identifiers with the their latest version. For the procedure followed for updating,
58 | see [this blog post](https://oganm.com/homologene-update/) and/or see the [processing code](R/updateHomologene.R).
59 |
60 | Using the updated version can help you match genes that cannot matched due to out of date annotations.
61 |
62 |
63 | ```{r}
64 | mouse2human(c('Mesd',
65 | 'Trp53rka',
66 | 'Cstdc4',
67 | 'Ifit3b'))
68 |
69 |
70 | mouse2human(c('Mesd',
71 | 'Trp53rka',
72 | 'Cstdc4',
73 | 'Ifit3b'),
74 | db = homologeneData2)
75 | ```
76 |
77 |
78 | The `homologeneData2` object that comes with the GitHub version of this package
79 | is updated weekly but if you are using the CRAN version and want the latest
80 | annotations, or if you want to keep
81 | a frozen version homologene, you can use the `updateHomologene` function.
82 |
83 |
84 | ```r
85 | homologeneDataVeryNew = updateHomologene() # update the homologene database with the latest identifiers
86 |
87 | mouse2human(c('Mesd',
88 | 'Trp53rka',
89 | 'Cstdc4',
90 | 'Ifit3b'),
91 | db = homologeneDataVeryNew)
92 |
93 | ```
94 |
95 |
96 | Gene ID syncronization
97 | =========================
98 |
99 | The package also includes functions that were used to create the `homologeneData2`, for updating outdated gene symbols and identifiers.
100 |
101 | ```{r, cache = TRUE}
102 | library(dplyr)
103 |
104 | gene_history = getGeneHistory()
105 | oldIds = c(4340964, 4349034, 4332470, 4334151, 4323831)
106 | newIds = updateIDs(oldIds,gene_history)
107 | print(newIds)
108 | # get the latest gene symbols for the ids
109 |
110 | gene_info = getGeneInfo()
111 |
112 | gene_info %>%
113 | dplyr::filter(GeneID %in% as.integer(newIds)) # faster to match integers
114 |
115 | ```
116 |
117 | Querying DIOPT
118 | ==============
119 |
120 | Instead of using just homologene, one can also make queries into the [DIOPT database](https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl). Diopt uses multiple databases
121 | to find gene homolog/orthologues. Note that this function has a `delay` parameter
122 | that is set to 10 seconds by default. This was done to obey the `robots.txt` of their website.
123 |
124 | ```{r, cache = TRUE}
125 |
126 | diopt(c('GZMH'),inTax = 9606, outTax = 10090) %>%
127 | knitr::kable()
128 |
129 | diopt(c('Eno2','Mog'),inTax = 10090, outTax =9606) %>%
130 | knitr::kable()
131 |
132 | ```
133 |
134 |
135 | Mishaps
136 | =================
137 | As of version version 1.1.68, the output now includes NCBI ids. Since it doesn't change any of the existing column names or their order, this shouldn't cause problems in most use cases.
138 |
139 | If a you can't find a gene you are looking for it may have synonyms. See [geneSynonym](https://github.com/oganm/geneSynonym.git) package to find them. If you have other problems open an issue or send a mail.
140 |
--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
1 | ## Test environments
2 | * local ubuntu 16.04, R 3.5.2
3 | * ubuntu 14.04 (on travis-ci), R 3.5.2
4 |
5 | ## R CMD check results
6 |
7 | 0 errors | 0 warnings | 0 notes
8 |
9 | * This is a resubmission.
10 | * License file is fixed
11 | * Date is updated
12 |
--------------------------------------------------------------------------------
/data-raw/release:
--------------------------------------------------------------------------------
1 | 68
2 |
--------------------------------------------------------------------------------
/data-raw/taxData.tsv:
--------------------------------------------------------------------------------
1 | tax_id name_txt
2 | 10090 Mus musculus
3 | 10116 Rattus norvegicus
4 | 28985 Kluyveromyces lactis
5 | 318829 Magnaporthe oryzae
6 | 33169 Eremothecium gossypii
7 | 3702 Arabidopsis thaliana
8 | 4530 Oryza sativa
9 | 4896 Schizosaccharomyces pombe
10 | 4932 Saccharomyces cerevisiae
11 | 5141 Neurospora crassa
12 | 6239 Caenorhabditis elegans
13 | 7165 Anopheles gambiae
14 | 7227 Drosophila melanogaster
15 | 7955 Danio rerio
16 | 8364 Xenopus (Silurana) tropicalis
17 | 9031 Gallus gallus
18 | 9544 Macaca mulatta
19 | 9598 Pan troglodytes
20 | 9606 Homo sapiens
21 | 9615 Canis lupus familiaris
22 | 9913 Bos taurus
23 |
--------------------------------------------------------------------------------
/data/homologeneData.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oganm/homologene/9a9f99c4b596ccdd05a1ea1d7f62323bffb3b721/data/homologeneData.rda
--------------------------------------------------------------------------------
/data/homologeneData2.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oganm/homologene/9a9f99c4b596ccdd05a1ea1d7f62323bffb3b721/data/homologeneData2.rda
--------------------------------------------------------------------------------
/data/homologeneVersion.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oganm/homologene/9a9f99c4b596ccdd05a1ea1d7f62323bffb3b721/data/homologeneVersion.rda
--------------------------------------------------------------------------------
/data/taxData.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oganm/homologene/9a9f99c4b596ccdd05a1ea1d7f62323bffb3b721/data/taxData.rda
--------------------------------------------------------------------------------
/docs/LICENSE-text.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
109 |
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
110 |
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
An r package that works as a wrapper to homologene
95 |
Available species are
96 |
97 |
Homo sapiens
98 |
Mus musculus
99 |
Rattus norvegicus
100 |
Danio rerio
101 |
Caenorhabditis elegans
102 |
Drosophila melanogaster
103 |
Rhesus macaque
104 |
105 |
More species can be added on request
106 |
107 |
108 |
109 | Installation
110 |
install.packages('homologene')
111 |
or
112 |
devtools::install_github('oganm/homologene')
113 |
114 |
115 |
116 | Usage
117 |
Basic homologene function requires a list of gene symbols or NCBI ids, and an inTax and an outTax. In this example, inTax is the taxon id of mus musculus while outTax is for humans.
For mouse and humans two convenience functions exist that removes the need to provide taxonomic identifiers. Note that the column names are not the same as the homologene output.
As of version version 1.1.68, the output now includes NCBI ids. Since it doesn’t change any of the existing column names or their order, this shouldn’t cause problems in most use cases. If this is an issue for you plase notify me.
144 |
If a you can’t find a gene you are looking for it may have synonyms. See geneSynonym package to find them. If you have other problems open an issue or send a mail.
Added a NEWS.md file to track changes to the package.
121 |
Added autoTranslate function to allow automated translation of gene symbols or ids.
122 |
123 | homologeneData2 is added as an updated version of the original homologene database (original database is not updated since 2014). This database includes the latest gene symbols and identifiers for every gene included in the original database. Outside CRAN (github version), this database is updated weekly.
124 |
Version number is extended to include the last update date of homologeneData2.
125 |
126 | updateHomologene function is added to allow users create their own updated versions of homologene. Using homologeneData2 as a baseline with this function allows faster updates.
127 |
128 | getGeneHistory, updateIDs and getGeneInfo functions are added to allow users to update arbitrary gene lists with latest symbols and identifiers.
129 |
All species originally repsented in the homologene database are added to the package.
Query DIOPT database (https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl) for orthologues.
118 | DIOPT database uses multiple tools to find gene orthologues. Sadly they don't have an
119 | API so this function queries by visiting the site and filling up the form. By default
120 | each query will take a minimum of 10 seconds due to delay parameter. This
121 | is taken from their robots.txt at the time this function is written.
122 | Note that DIOPT is not necesariy in sync with homologene database as provided in this package.
123 |
124 |
125 |
126 |
diopt(genes, inTax, outTax, delay=10)
127 |
128 |
Arguments
129 |
130 |
131 |
132 |
genes
133 |
A vector of gene identifiers. Anything that DIOPT accepts
134 |
135 |
136 |
inTax
137 |
taxid of the species that the input genes are coming from
138 |
139 |
140 |
outTax
141 |
taxid of the species that you are seeking homology
142 |
143 |
144 |
delay
145 |
How many seconds of delay should be between queries. Default is 10
146 | based on the robots.txt at the time this function is written.
Path of the output file. If NULL a temp file will be used
126 |
127 |
128 |
justRead
129 |
If TRUE and destfile exists, it reads the file instead of
130 | downloading the latest one from NCBI
131 |
132 |
133 |
chunk_size
134 |
Chunk size to be used with link[readr]{read_tsv_chunked}.
135 | The gene_info file is big enough to make its intake difficult. If you don't
136 | have large amounts of free memory you may have to reduce this number to read
137 | the file in smaller chunks
138 |
139 |
140 |
141 |
Value
142 |
143 |
A data frame with gene symbols for each current gene id
This function downloads the latest homologene file from NCBI. Note that Homologene
115 | has not been updated since 2014 so the output will be identical to homologeneData
116 | included in this package. This function is here for futureproofing purposes.
117 |
118 |
119 |
120 |
getHomologene(destfile=NULL, justRead=FALSE)
121 |
122 |
Arguments
123 |
124 |
125 |
126 |
destfile
127 |
Path of the output file. If NULL a temp file will be used
128 |
129 |
130 |
justRead
131 |
If TRUE and destfile exists, it reads the file instead of
132 | downloading the latest one from NCBI
133 |
134 |
135 |
136 |
Value
137 |
138 |
A data frame with homology groups, gene ids and gene symbols
A modified copy of the homologene database. Homologene was updated at 2014 and many of its gene IDs and
115 | symbols are out of date. Here the IDs and symbols are replaced with their most current version
116 | Last update: Mon May 6 14:15:51 2019
117 |
118 |
119 |
120 |
homologeneData2
121 |
122 |
Format
123 |
124 |
An object of class data.frame with 269545 rows and 4 columns.
Creates an updated version of the homologene database. This is done by downloading
117 | the latest gene annotation information and tracing changes in gene symbols and
118 | identifiers over history. homologeneData2 was created using
119 | this function over the original homologeneData. This function
120 | requires downloading large amounts of data from the NCBI ftp servers.
The baseline homologene file to be used. By default uses the
138 | homologeneData2 that is included in this package. The more ids
139 | to update, the more time is needed for the update which is why the default option
140 | uses an already updated version of the original database.
141 |
142 |
143 |
gene_history
144 |
A gene history data frame, possibly returned by getGeneHistory
145 | function. Use this if you want to have a static gene_history file to update up to a specific date.
146 | An up to date gene_history object can be set to update to a specific date by trimming
147 | rows that have recent dates. Note that the same is not possible for the gene_info
148 | If not provided, the latest file will be downloaded.
149 |
150 |
151 |
gene_info
152 |
A gene info data frame that contatins ID-symbol matches,
153 | possibly returned by getGeneInfo. Use this if you
154 | want a static version. Should be in sync with the gene_history file. Note that there is
155 | no easy way to track changes in gene symbols back in time so if you want to update it up
156 | to a specific date, make sure you don't lose that file.
157 |
158 |
159 |
160 |
Value
161 |
162 |
Homologene database in a data frame with updated gene IDs and symbols