├── LICENSE ├── README.md └── scopusAPI.R /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Christopher Belter 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # scopusAPI 2 | 3 | The functions in this file allow you to query the Scopus Search API in R and parse the results into a data frame. The file contains three functions: searchByString() which allows you to query the API using an advanced search string, searchByID() which allows you to search for a list of article IDs (PMIDs, DOIs, or Scopus EIDs), and extractXML() which extracts values from the XML and parses them into a data frame you can work with in R. 4 | 5 | ## Before you begin 6 | 7 | You will need to obtain a personal API key from Elsevier. You can request one at http://dev.elsevier.com/. Click on the "Get API Key" button, log in to the site, and then click on "register a new site." Enter the requested information and use your institution's home page as the site. 8 | 9 | You will then need to copy/paste your API key into the scopusAPI.R file at lines 9 and 52, replacing the yourAPIKey text with your API key. 10 | 11 | Note also that if your institution uses IP authentication to verify your access to Scopus, you have to make the API calls from an authenticated IP address. In practice, this means you need to make the API calls from a computer that is on your institution's network either on campus or removely via a VPN connection. 12 | 13 | You will also need to install the httr and XML packages, if you haven't already done so, with the command 14 | 15 | install.packages(c("httr", "XML")) 16 | 17 | Finally, there are some API limits you should be aware of. 18 | 19 | First, if you want to download the full records of your search results (inlcuding the full author list and abstract for each document), you are limited to requesting 25 articles at a time. The searchByString() and searchByID() functions will automatically send multiple requests to the API to retrieve the full set of search results for your particular query, but it can only retrieve 25 full records at a time. 20 | 21 | Second, you can only request the first 5,000 records using the 'offset' parameter. This used to mean that you could only request up to 5,000 records for a single search string, but Elsevier has added a new 'cursor' parameter that allows you to bypass this limit. The new version of the searchByString() function uses this cursor parameter instead of the myStart parameter to iterate through the search results. In practice, this means you shouldn't use the myStart parameter unless you really need to. 22 | 23 | Finally, you are limited to downloading 20,000 records per week. 24 | 25 | 26 | ## The searchByString() method 27 | 28 | This function allows you to run an advanced search through the API and download all of the search results. I recommend developing the search string in the Scopus web interface and then using that string in the API to obtain the results. 29 | 30 | The function has eight arguments: string, content, myStart, retCount, retMax, mySort, cursor, and outfile. 31 | * **string:** the advanced search string you want to use. 32 | * **content:** how many fields you want to return (either "complete" which returns all available fields or "standard" which returns an abbreviated record). 33 | * **myStart:** which search result you want to start downloading from. Limited to the first 5,000 records for any given search string. Setting this value to 5,001 or higher will result in an error. In practice, you shouldn't need to use this unless your download process was interrupted and you want to pick up at the point where the error happened. 34 | * **retCount:** how many records you want to download per request. Limited to 25 per request for "complete" content; requests for more than 25 with the "complete" content type will return an error. 35 | * **retMax:** the maximum number of records you want to download. The function will continue to make requests until it reaches either the total number of search results or the retMax, if specified. If unspecified, it will return all of the search results. 36 | * **mySort:** how you want the search results to be sorted. Currently defaults to descending order by cover date, but could also be set to descending order by times cited count ("-citedby-count") or relevance ("-relevancy"). See the Scopus Search API wadl for more options. 37 | * **cursor:** a parameter used to iterate through a set of search results beyond 5,000. You shouldn't ever change this from it's default value. 38 | * **outfile:** the file you want to save the data to. 39 | 40 | All but two of these arguments have default values: content defaults to "complete", myStart to 0, retCount to Inf, mySort to "-coverDate", cursor to the necessary value supplied by the Elsevier API, and retCount to 25. So, you only need to specify the string and the outfile for the function to work. 41 | 42 | ## The searchByID() method 43 | 44 | This function allows you to search for a list of article IDs and download the matching search results. It can search for PMIDs, DOIs, or EIDs (Scopus ID numbers). The function expects the list of article IDs to be either a character vector from R (e.g. myData$scopusID) or a text file with a single article ID per line. 45 | 46 | The function has similar arguments and default values to the searchByString() method, but it also has an "idtype" argument which requires you to specify what kind of article ID you want to search for ("pmid", "doi", or "eid"). 47 | 48 | ## The extractXML() function 49 | 50 | This function extracts selected values from XML returned by either of the above methods and formats them into a data frame. Note, however, that this function will only work for XML returned using the "application/xml" datatype. If you want to work with json data instead of XML, I recommend parsing the results using the jsonlite package. 51 | 52 | ## Sample workflow using the searchByString() method 53 | 54 | Set your working directory and load the scopusAPI.R file 55 | 56 | setwd("C:/Users/Documents") 57 | source("scopusAPI.R") 58 | 59 | Then save the search query you want to use. In this example I'm searching for documents that have the keyword "cryoelectron microscopy" and were published from 2006 to 2015. 60 | 61 | myQuery <- "KEY(\"cryoelectron microscopy\") AND PUBYEAR > 2005 AND PUBYEAR < 2016" 62 | 63 | Next, run the search against the Scopus Search API and save the results in batches of 25 to a file called testdata.xml 64 | 65 | theXML <- searchByString(string = myQuery, outfile = "testdata.xml") 66 | 67 | When the function finishes downloading all of the records, extract values from the XML and parse them into a data frame 68 | 69 | theData <- extractXML(theXML) 70 | 71 | You can then work with the data in R or save it to a .csv file with the command 72 | 73 | write.csv(theData, file = "thedata.csv") 74 | 75 | ## Sample workflow using the searchByID() method 76 | 77 | Set your working directory and load the scopusAPI.R file 78 | 79 | setwd("C:/Users/Documents") 80 | source("scopusAPI.R") 81 | 82 | Then run the set of article IDs (in this example PMIDs) against the Scopus Search API and download the matching results 83 | 84 | theXML <- searchByID(theIDs = "testPMIDs.txt", idtype = "pmid", outfile = "test.xml") 85 | 86 | Then when the function is finished, extract the values from the resulting XML 87 | 88 | theData <- extractXML(theXML) 89 | 90 | You can then work with the data frame in R or save it to a .csv, as above. 91 | -------------------------------------------------------------------------------- /scopusAPI.R: -------------------------------------------------------------------------------- 1 | ## version 0.4 2 | searchByString <- function(string, content = "complete", myStart = 0, retCount = 25, retMax = Inf, mySort = "-coverDate", cursor = "*", outfile) { 3 | if (!content %in% c("complete", "standard")) { 4 | stop("Invalid content value. Valid content values are 'complete', and 'standard'") 5 | } 6 | else { 7 | ##library(httr) 8 | ##library(XML) 9 | key <- "yourAPIKey" 10 | print("Retrieving records.") 11 | theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string, sort = mySort, httpAccept = "application/xml", view = content, count = retCount, start = myStart, cursor = cursor)) ## format the URL to be sent to the API 12 | httr::stop_for_status(theURL) ## pass any HTTP errors to the R console 13 | theData <- httr::content(theURL, as = "text") ## extract the content of the response 14 | newData <- XML::xmlParse(theURL) ## parse the data to extract values 15 | resultCount <- as.numeric(XML::xpathSApply(newData,"//opensearch:totalResults", XML::xmlValue)) ## get the total number of search results for the string 16 | cursor <- XML::xpathSApply(newData, "//cto:cursor", XML::xmlGetAttr, name = "next", namespaces = "cto") 17 | print(paste("Found", resultCount, "records.")) 18 | retrievedCount <- retCount + myStart ## set the current number of results retrieved for the designated start and count parameters 19 | while (resultCount > retrievedCount && retrievedCount < retMax) { ## check if it's necessary to perform multiple requests to retrieve all of the results; if so, create a loop to retrieve additional pages of results 20 | myStart <- myStart + retCount ## add the number of records already returned to the start number 21 | print(paste("Retrieved", retrievedCount, "of", resultCount, "records. Getting more.")) 22 | theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string, sort = mySort, httpAccept = "application/xml", view = content, count = retCount, cursor = cursor)) ## get the next page of results 23 | theData <- paste(theData, httr::content(theURL, as = "text")) ## paste new theURL content to theData; if there's an HTTP error, the XML of the error will be pasted to the end of theData 24 | newData <- httr::content(theURL, as = "text") 25 | newData <- XML::xmlParse(theURL) 26 | cursor <- XML::xpathSApply(newData, "//cto:cursor", XML::xmlGetAttr, name = "next", namespaces = "cto") 27 | if (httr::http_error(theURL) == TRUE) { ## check if there's an HTTP error 28 | print("Encountered an HTTP error. Details follow.") ## alert the user to the error 29 | print(httr::http_status(theURL)) ## print out the error category, reason, and message 30 | break ## if there's an HTTP error, break out of the loop and return the data that has been retrieved 31 | } 32 | retrievedCount <- retrievedCount + retCount ## add the number of results retrieved in this iteration to the total number of results retrieved 33 | Sys.sleep(1) 34 | } ## repeat until retrievedCount >= resultCount 35 | print(paste("Retrieved", retrievedCount, "records. Formatting and saving results.")) 36 | writeLines(theData, outfile, useBytes = TRUE) ## if there were multiple pages of results, they come back as separate XML files pasted into the single outfile; the theData XML object can't be coerced into a string to do find/replace operations, so I think it must be written to a file and then reloaded; useBytes = TRUE keeps the UTF-8 encoding of special characters like the copyright symbol so they won't throw an error later 37 | theData <- readChar(outfile, file.info(outfile)$size) ## convert the XML results to a character vector of length 1 that can be manipulated 38 | theData <- gsub("", "", theData, fixed = TRUE, useBytes = TRUE) 39 | theData <- gsub("", "", theData, useBytes = TRUE) 40 | theData <- gsub("", "", theData, fixed = TRUE) ## remove all headers and footers of the separate XML files 41 | theData <- paste("", "", theData, "", sep = "\n") 42 | #theData <- paste(theData, "") ## add the correct header to the beginning of the file and the correct footer to the end of the file 43 | writeLines(theData, outfile, useBytes = TRUE) ## save the correctly formatted XML file 44 | print("Done") 45 | return(theData) ## return the final, correctly formatted XML file 46 | } 47 | } 48 | 49 | searchByID <- function(theIDs, idtype, datatype = "application/xml", content = "complete", myStart = 0, retCount = 25, outfile) { 50 | ##library(httr) 51 | ##library(XML) 52 | key <- "yourAPIKey" 53 | if (length(theIDs) == 1) { 54 | theIDs <- unique(scan(theIDs, what = "varchar")) ## load the list of IDs into a character vector 55 | } 56 | else { 57 | theIDs <- unique(as.character(theIDs)) 58 | } 59 | resultCount <- as.numeric(length(theIDs)) ## get the total number of IDs 60 | idList <- split(theIDs, ceiling(seq_along(theIDs)/25)) ## split the IDs into batches of 25 61 | theData <- " " ## create an empty character holder for the XML 62 | retrievedCount <- 0 ## set the current number of records retrieved to zero 63 | if (idtype == "pmid") { 64 | idList <- lapply(mapply(paste, "PMID(", idList, collapse = ") OR "), paste, ")") ## append the correct scopus search syntax around each number in each batch of IDs 65 | } 66 | else if (idtype == "doi") { 67 | idList <- lapply(mapply(paste, "DOI(", idList, collapse = ") OR "), paste, ")") 68 | } 69 | else if (idtype == "eid") { 70 | idList <- lapply(mapply(paste, "EID(", idList, collapse = ") OR "), paste, ")") ## append the correct scopus search syntax around each number 71 | } 72 | else { 73 | stop("Invalid idtype. Valid idtypes are 'pmid', 'doi', or 'eid'") 74 | } 75 | print(paste("Retrieving", resultCount, "records.")) 76 | for (i in 1:length(idList)) { ## loop through the list of search strings and return data for each one 77 | string <- idList[i] 78 | theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string, httpAccept = "application/xml", view = content, count = retCount, start = myStart)) 79 | theData <- paste(theData, httr::content(theURL, as = "text")) ## paste new theURL content to theData 80 | if (httr::http_error(theURL) == TRUE) { ## check if there's an HTTP error 81 | print("Encountered an HTTP error. Details follow.") ## alert the user to the error 82 | print(httr::http_status(theURL)) ## print out the error category, reason, and message 83 | break ## if there's an HTTP error, break out of the loop and return the data that has been retrieved 84 | } 85 | Sys.sleep(1) 86 | retrievedCount <- retrievedCount + retCount 87 | print(paste("Retrieved", retrievedCount, "of", resultCount, "records. Getting more.")) 88 | } 89 | print(paste("Retrieved", retrievedCount, "records. Formatting and saving results.")) 90 | writeLines(theData, outfile, useBytes = TRUE) 91 | theData <- readChar(outfile, file.info(outfile)$size) 92 | theData <- gsub("", "", theData, fixed = TRUE, useBytes = TRUE) 93 | theData <- gsub("", "", theData, useBytes = TRUE) 94 | theData <- gsub("", "", theData, fixed = TRUE) ## remove all headers and footers of the separate XML files 95 | theData <- paste("", "", theData, "", sep = "\n") 96 | #theData <- paste(theData, "") ## add the correct header to the beginning of the file and the correct footer to the end of the file 97 | writeLines(theData, outfile, useBytes = TRUE) 98 | print("Done") 99 | return(theData) 100 | } 101 | 102 | extractXML <- function(theFile) { 103 | ##library(XML) 104 | newData <- XML::xmlParse(theFile) ## parse the XML 105 | records <- XML::getNodeSet(newData, "//cto:entry", namespaces = "cto") ## create a list of records for missing or duplicate node handling 106 | scopusID <- lapply(records, XML::xpathSApply, "./cto:eid", XML::xmlValue, namespaces = "cto") ## handle potentially missing eid nodes 107 | scopusID[sapply(scopusID, is.list)] <- NA 108 | scopusID <- unlist(scopusID) 109 | doi <- lapply(records, XML::xpathSApply, "./prism:doi", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing doi nodes 110 | doi[sapply(doi, is.list)] <- NA 111 | doi <- unlist(doi) 112 | pmid <- lapply(records, XML::xpathSApply, "./cto:pubmed-id", XML::xmlValue, namespaces = "cto") ## handle potentially missing pmid nodes: returns a list with the node value if the node is present and an empty list if the node is missing 113 | pmid[sapply(pmid, is.list)] <- NA ## find the empty lists in pmid and set them to NA 114 | pmid <- unlist(pmid) ## turn the pmid list into a vector 115 | authLast <- lapply(records, XML::xpathSApply, ".//cto:surname", XML::xmlValue, namespaces = "cto") ## grab the surname and initials for each author in each record, then paste them together 116 | authLast[sapply(authLast, is.list)] <- NA 117 | authInit <- lapply(records, XML::xpathSApply, ".//cto:initials", XML::xmlValue, namespaces = "cto") 118 | authInit[sapply(authInit, is.list)] <- NA 119 | authors <- mapply(paste, authLast, authInit, collapse = "|") 120 | authors <- sapply(strsplit(authors, "|", fixed = TRUE), unique) ## remove the duplicate author listings 121 | authors <- sapply(authors, paste, collapse = "|") 122 | affiliations <- lapply(records, XML::xpathSApply, ".//cto:affilname", XML::xmlValue, namespaces = "cto") ## handle multiple affiliation names 123 | affiliations[sapply(affiliations, is.list)] <- NA 124 | affiliations <- sapply(affiliations, paste, collapse = "|") 125 | affiliations <- sapply(strsplit(affiliations, "|", fixed = TRUE), unique) ## remove the duplicate affiliation listings 126 | affiliations <- sapply(affiliations, paste, collapse = "|") 127 | countries <- lapply(records, XML::xpathSApply, ".//cto:affiliation-country", XML::xmlValue, namespaces = "cto") 128 | countries[sapply(countries, is.list)] <- NA 129 | countries <- sapply(countries, paste, collapse = "|") 130 | countries <- sapply(strsplit(countries, "|", fixed = TRUE), unique) ## remove the duplicate country listings 131 | countries <- sapply(countries, paste, collapse = "|") 132 | year <- lapply(records, XML::xpathSApply, "./prism:coverDate", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) 133 | year[sapply(year, is.list)] <- NA 134 | year <- unlist(year) 135 | year <- gsub("\\-..", "", year) ## extract only year from coverDate string (e.g. extract "2015" from "2015-01-01") 136 | articletitle <- lapply(records, XML::xpathSApply, "./dc:title", XML::xmlValue, namespaces = c(dc = "http://purl.org/dc/elements/1.1/")) 137 | articletitle[sapply(articletitle, is.list)] <- NA 138 | articletitle <- unlist(articletitle) 139 | journal <- lapply(records, XML::xpathSApply, "./prism:publicationName", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes 140 | journal[sapply(journal, is.list)] <- NA 141 | journal <- unlist(journal) 142 | volume <- lapply(records, XML::xpathSApply, "./prism:volume", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes 143 | volume[sapply(volume, is.list)] <- NA 144 | volume <- unlist(volume) 145 | issue <- lapply(records, XML::xpathSApply, "./prism:issueIdentifier", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes 146 | issue[sapply(issue, is.list)] <- NA 147 | issue <- unlist(issue) 148 | pages <- lapply(records, XML::xpathSApply, "./prism:pageRange", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes 149 | pages[sapply(pages, is.list)] <- NA 150 | pages <- unlist(pages) 151 | abstract <- lapply(records, XML::xpathSApply, "./dc:description", XML::xmlValue, namespaces = c(dc = "http://purl.org/dc/elements/1.1/")) ## handle potentially missing abstract nodes 152 | abstract[sapply(abstract, is.list)] <- NA 153 | abstract <- unlist(abstract) 154 | keywords <- lapply(records, XML::xpathSApply, "./cto:authkeywords", XML::xmlValue, namespaces = "cto") 155 | keywords[sapply(keywords, is.list)] <- NA 156 | keywords <- unlist(keywords) 157 | keywords <- gsub(" | ", "|", keywords, fixed = TRUE) 158 | ptype <- lapply(records, XML::xpathSApply, "./cto:subtypeDescription", XML::xmlValue, namespaces = "cto") 159 | ptype[sapply(ptype, is.list)] <- NA 160 | ptype <- unlist(ptype) 161 | timescited <- lapply(records, XML::xpathSApply, "./cto:citedby-count", XML::xmlValue, namespaces = "cto") 162 | timescited[sapply(timescited, is.list)] <- NA 163 | timescited <- unlist(timescited) 164 | theDF <- data.frame(scopusID, doi, pmid, authors, affiliations, countries, year, articletitle, journal, volume, issue, pages, keywords, abstract, ptype, timescited, stringsAsFactors = FALSE) 165 | return(theDF) 166 | } --------------------------------------------------------------------------------