├── LICENSE
├── README.md
└── scopusAPI.R


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Christopher Belter
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # scopusAPI
 2 | 
 3 | The functions in this file allow you to query the Scopus Search API in R and parse the results into a data frame. The file contains three functions: searchByString() which allows you to query the API using an advanced search string, searchByID() which allows you to search for a list of article IDs (PMIDs, DOIs, or Scopus EIDs), and extractXML() which extracts values from the XML and parses them into a data frame you can work with in R.
 4 | 
 5 | ## Before you begin
 6 | 
 7 | You will need to obtain a personal API key from Elsevier. You can request one at http://dev.elsevier.com/. Click on the "Get API Key" button, log in to the site, and then click on "register a new site." Enter the requested information and use your institution's home page as the site. 
 8 | 
 9 | You will then need to copy/paste your API key into the scopusAPI.R file at lines 9 and 52, replacing the yourAPIKey text with your API key.
10 | 
11 | Note also that if your institution uses IP authentication to verify your access to Scopus, you have to make the API calls from an authenticated IP address. In practice, this means you need to make the API calls from a computer that is on your institution's network either on campus or removely via a VPN connection.
12 | 
13 | You will also need to install the httr and XML packages, if you haven't already done so, with the command
14 | 
15 |     install.packages(c("httr", "XML"))
16 | 
17 | Finally, there are some API limits you should be aware of. 
18 | 
19 | First, if you want to download the full records of your search results (inlcuding the full author list and abstract for each document), you are limited to requesting 25 articles at a time. The searchByString() and searchByID() functions will automatically send multiple requests to the API to retrieve the full set of search results for your particular query, but it can only retrieve 25 full records at a time. 
20 | 
21 | Second, you can only request the first 5,000 records using the 'offset' parameter. This used to mean that you could only request up to 5,000 records for a single search string, but Elsevier has added a new 'cursor' parameter that allows you to bypass this limit. The new version of the searchByString() function uses this cursor parameter instead of the myStart parameter to iterate through the search results. In practice, this means you shouldn't use the myStart parameter unless you really need to. 
22 | 
23 | Finally, you are limited to downloading 20,000 records per week. 
24 |  
25 | 
26 | ## The searchByString() method
27 | 
28 | This function allows you to run an advanced search through the API and download all of the search results. I recommend developing the search string in the Scopus web interface and then using that string in the API to obtain the results. 
29 | 
30 | The function has eight arguments: string, content, myStart, retCount, retMax, mySort, cursor, and outfile. 
31 | * **string:** the advanced search string you want to use.
32 | * **content:** how many fields you want to return (either "complete" which returns all available fields or "standard" which returns an abbreviated record).
33 | * **myStart:** which search result you want to start downloading from. Limited to the first 5,000 records for any given search string. Setting this value to 5,001 or higher will result in an error. In practice, you shouldn't need to use this unless your download process was interrupted and you want to pick up at the point where the error happened.
34 | * **retCount:** how many records you want to download per request. Limited to 25 per request for "complete" content; requests for more than 25 with the "complete" content type will return an error.
35 | * **retMax:** the maximum number of records you want to download. The function will continue to make requests until it reaches either the total number of search results or the retMax, if specified. If unspecified, it will return all of the search results. 
36 | * **mySort:** how you want the search results to be sorted. Currently defaults to descending order by cover date, but could also be set to descending order by times cited count ("-citedby-count") or relevance ("-relevancy"). See the Scopus Search API wadl for more options. 
37 | * **cursor:** a parameter used to iterate through a set of search results beyond 5,000. You shouldn't ever change this from it's default value. 
38 | * **outfile:** the file you want to save the data to.
39 | 
40 | All but two of these arguments have default values: content defaults to "complete", myStart to 0, retCount to Inf, mySort to "-coverDate", cursor to the necessary value supplied by the Elsevier API, and retCount to 25. So, you only need to specify the string and the outfile for the function to work. 
41 | 
42 | ## The searchByID() method
43 | 
44 | This function allows you to search for a list of article IDs and download the matching search results. It can search for PMIDs, DOIs, or EIDs (Scopus ID numbers). The function expects the list of article IDs to be either a character vector from R (e.g. myData$scopusID) or a text file with a single article ID per line.
45 | 
46 | The function has similar arguments and default values to the searchByString() method, but it also has an "idtype" argument which requires you to specify what kind of article ID you want to search for ("pmid", "doi", or "eid"). 
47 | 
48 | ## The extractXML() function
49 | 
50 | This function extracts selected values from XML returned by either of the above methods and formats them into a data frame. Note, however, that this function will only work for XML returned using the "application/xml" datatype. If you want to work with json data instead of XML, I recommend parsing the results using the jsonlite package. 
51 | 
52 | ## Sample workflow using the searchByString() method
53 | 
54 | Set your working directory and load the scopusAPI.R file
55 | 
56 |     setwd("C:/Users/Documents")
57 |     source("scopusAPI.R")
58 | 
59 | Then save the search query you want to use. In this example I'm searching for documents that have the keyword "cryoelectron microscopy" and were published from 2006 to 2015.
60 | 
61 |     myQuery <- "KEY(\"cryoelectron microscopy\") AND PUBYEAR > 2005 AND PUBYEAR < 2016"
62 | 
63 | Next, run the search against the Scopus Search API and save the results in batches of 25 to a file called testdata.xml
64 | 
65 |     theXML <- searchByString(string = myQuery, outfile = "testdata.xml")
66 |     
67 | When the function finishes downloading all of the records, extract values from the XML and parse them into a data frame
68 | 
69 |     theData <- extractXML(theXML)
70 | 
71 | You can then work with the data in R or save it to a .csv file with the command
72 | 
73 |     write.csv(theData, file = "thedata.csv")
74 | 
75 | ## Sample workflow using the searchByID() method
76 | 
77 | Set your working directory and load the scopusAPI.R file
78 | 
79 |     setwd("C:/Users/Documents")
80 |     source("scopusAPI.R")
81 | 
82 | Then run the set of article IDs (in this example PMIDs) against the Scopus Search API and download the matching results 
83 | 
84 |     theXML <- searchByID(theIDs = "testPMIDs.txt", idtype = "pmid", outfile = "test.xml")
85 | 
86 | Then when the function is finished, extract the values from the resulting XML
87 | 
88 |     theData <- extractXML(theXML)
89 | 
90 | You can then work with the data frame in R or save it to a .csv, as above.
91 | 


--------------------------------------------------------------------------------
/scopusAPI.R:
--------------------------------------------------------------------------------
  1 | ## version 0.4
  2 | searchByString <- function(string, content = "complete", myStart = 0, retCount = 25, retMax = Inf, mySort = "-coverDate", cursor = "*", outfile) {
  3 | 	if (!content %in% c("complete", "standard")) {
  4 | 		stop("Invalid content value. Valid content values are 'complete', and 'standard'")
  5 | 	}
  6 | 	else {
  7 | 		##library(httr)
  8 | 		##library(XML)
  9 | 		key <- "yourAPIKey"
 10 | 		print("Retrieving records.")
 11 | 		theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string, sort = mySort, httpAccept = "application/xml", view = content, count = retCount, start = myStart, cursor = cursor)) ## format the URL to be sent to the API
 12 | 		httr::stop_for_status(theURL) ## pass any HTTP errors to the R console
 13 | 		theData <- httr::content(theURL, as = "text") ## extract the content of the response
 14 | 		newData <- XML::xmlParse(theURL) ## parse the data to extract values
 15 | 		resultCount <- as.numeric(XML::xpathSApply(newData,"//opensearch:totalResults", XML::xmlValue)) ## get the total number of search results for the string
 16 | 		cursor <- XML::xpathSApply(newData, "//cto:cursor", XML::xmlGetAttr, name = "next", namespaces = "cto")
 17 | 		print(paste("Found", resultCount, "records."))
 18 | 		retrievedCount <- retCount + myStart ## set the current number of results retrieved for the designated start and count parameters
 19 | 		while (resultCount > retrievedCount && retrievedCount < retMax) { ## check if it's necessary to perform multiple requests to retrieve all of the results; if so, create a loop to retrieve additional pages of results
 20 | 			myStart <- myStart + retCount ## add the number of records already returned to the start number
 21 | 			print(paste("Retrieved", retrievedCount, "of", resultCount, "records. Getting more."))
 22 | 			theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string,  sort = mySort, httpAccept = "application/xml", view = content, count = retCount, cursor = cursor)) ## get the next page of results
 23 | 			theData <- paste(theData, httr::content(theURL, as = "text")) ## paste new theURL content to theData; if there's an HTTP error, the XML of the error will be pasted to the end of theData
 24 | 			newData <- httr::content(theURL, as = "text")
 25 | 			newData <- XML::xmlParse(theURL)
 26 | 			cursor <- XML::xpathSApply(newData, "//cto:cursor", XML::xmlGetAttr, name = "next", namespaces = "cto")
 27 | 			if (httr::http_error(theURL) == TRUE) { ## check if there's an HTTP error
 28 | 				print("Encountered an HTTP error. Details follow.") ## alert the user to the error
 29 | 				print(httr::http_status(theURL)) ## print out the error category, reason, and message
 30 | 				break ## if there's an HTTP error, break out of the loop and return the data that has been retrieved
 31 | 				}
 32 | 			retrievedCount <- retrievedCount + retCount ## add the number of results retrieved in this iteration to the total number of results retrieved
 33 | 			Sys.sleep(1)
 34 | 		} ## repeat until retrievedCount >= resultCount
 35 | 		print(paste("Retrieved", retrievedCount, "records. Formatting and saving results."))
 36 | 		writeLines(theData, outfile, useBytes = TRUE) ## if there were multiple pages of results, they come back as separate XML files pasted into the single outfile; the theData XML object can't be coerced into a string to do find/replace operations, so I think it must be written to a file and then reloaded; useBytes = TRUE keeps the UTF-8 encoding of special characters like the copyright symbol so they won't throw an error later
 37 | 		theData <- readChar(outfile, file.info(outfile)$size) ## convert the XML results to a character vector of length 1 that can be manipulated
 38 | 		theData <- gsub("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "", theData, fixed = TRUE, useBytes = TRUE)
 39 | 		theData <- gsub("<search-results.+?>", "", theData, useBytes = TRUE)
 40 | 		theData <- gsub("</search-results>", "", theData, fixed = TRUE) ## remove all headers and footers of the separate XML files
 41 | 		theData <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<search-results xmlns=\"http://www.w3.org/2005/Atom\" xmlns:cto=\"http://www.elsevier.com/xml/cto/dtd\" xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:prism=\"http://prismstandard.org/namespaces/basic/2.0/\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\">", theData, "</search-results>", sep = "\n")
 42 | 		#theData <- paste(theData, "</search-results>") ## add the correct header to the beginning of the file and the correct footer to the end of the file
 43 | 		writeLines(theData, outfile, useBytes = TRUE) ## save the correctly formatted XML file
 44 | 		print("Done")
 45 | 		return(theData) ## return the final, correctly formatted XML file
 46 | 	}
 47 | }
 48 | 
 49 | searchByID <- function(theIDs, idtype, datatype = "application/xml", content = "complete", myStart = 0, retCount = 25, outfile) {
 50 | 	##library(httr)
 51 | 	##library(XML)
 52 | 	key <- "yourAPIKey"
 53 | 	if (length(theIDs) == 1) {
 54 | 		theIDs <- unique(scan(theIDs, what = "varchar")) ## load the list of IDs into a character vector
 55 | 	}
 56 | 	else {
 57 | 		theIDs <- unique(as.character(theIDs))
 58 | 	}
 59 | 	resultCount <- as.numeric(length(theIDs)) ## get the total number of IDs
 60 | 	idList <- split(theIDs, ceiling(seq_along(theIDs)/25)) ## split the IDs into batches of 25
 61 | 	theData <- " " ## create an empty character holder for the XML
 62 | 	retrievedCount <- 0 ## set the current number of records retrieved to zero
 63 | 	if (idtype == "pmid") {
 64 | 		idList <- lapply(mapply(paste, "PMID(", idList, collapse = ") OR "), paste, ")") ## append the correct scopus search syntax around each number in each batch of IDs
 65 | 	}
 66 | 	else if (idtype == "doi") {
 67 | 		idList <- lapply(mapply(paste, "DOI(", idList, collapse = ") OR "), paste, ")")
 68 | 	}
 69 | 	else if (idtype == "eid") {
 70 | 	idList <- lapply(mapply(paste, "EID(", idList, collapse = ") OR "), paste, ")") ## append the correct scopus search syntax around each number
 71 | 	}
 72 | 	else {
 73 | 	stop("Invalid idtype. Valid idtypes are 'pmid', 'doi', or 'eid'")
 74 | 	}
 75 | 	print(paste("Retrieving", resultCount, "records."))
 76 | 	for (i in 1:length(idList)) { ## loop through the list of search strings and return data for each one
 77 | 		string <- idList[i]
 78 | 		theURL <- httr::GET("https://api.elsevier.com/content/search/scopus", query = list(apiKey = key, query = string, httpAccept = "application/xml", view = content, count = retCount, start = myStart))
 79 | 		theData <- paste(theData, httr::content(theURL, as = "text")) ## paste new theURL content to theData
 80 | 		if (httr::http_error(theURL) == TRUE) { ## check if there's an HTTP error
 81 | 			print("Encountered an HTTP error. Details follow.") ## alert the user to the error
 82 | 			print(httr::http_status(theURL)) ## print out the error category, reason, and message
 83 | 			break ## if there's an HTTP error, break out of the loop and return the data that has been retrieved
 84 | 			}
 85 | 		Sys.sleep(1)
 86 | 		retrievedCount <- retrievedCount + retCount
 87 | 		print(paste("Retrieved", retrievedCount, "of", resultCount, "records. Getting more."))
 88 | 	}
 89 | 	print(paste("Retrieved", retrievedCount, "records. Formatting and saving results."))
 90 | 	writeLines(theData, outfile, useBytes = TRUE)
 91 | 	theData <- readChar(outfile, file.info(outfile)$size)
 92 | 	theData <- gsub("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "", theData, fixed = TRUE, useBytes = TRUE)
 93 | 	theData <- gsub("<search-results.+?>", "", theData, useBytes = TRUE)
 94 | 	theData <- gsub("</search-results>", "", theData, fixed = TRUE) ## remove all headers and footers of the separate XML files
 95 | 	theData <- paste("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "<search-results xmlns=\"http://www.w3.org/2005/Atom\" xmlns:cto=\"http://www.elsevier.com/xml/cto/dtd\" xmlns:atom=\"http://www.w3.org/2005/Atom\" xmlns:prism=\"http://prismstandard.org/namespaces/basic/2.0/\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\">", theData, "</search-results>", sep = "\n")
 96 | 	#theData <- paste(theData, "</search-results>") ## add the correct header to the beginning of the file and the correct footer to the end of the file
 97 | 	writeLines(theData, outfile, useBytes = TRUE)
 98 | 	print("Done")
 99 | 	return(theData)
100 | }
101 | 
102 | extractXML <- function(theFile) {
103 | 	##library(XML)
104 | 	newData <- XML::xmlParse(theFile) ## parse the XML
105 | 	records <- XML::getNodeSet(newData, "//cto:entry", namespaces = "cto") ## create a list of records for missing or duplicate node handling
106 | 	scopusID <- lapply(records, XML::xpathSApply, "./cto:eid", XML::xmlValue, namespaces = "cto") ## handle potentially missing eid nodes
107 | 	scopusID[sapply(scopusID, is.list)] <- NA
108 | 	scopusID <- unlist(scopusID)
109 | 	doi <- lapply(records, XML::xpathSApply, "./prism:doi", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing doi nodes
110 | 	doi[sapply(doi, is.list)] <- NA
111 | 	doi <- unlist(doi)
112 | 	pmid <- lapply(records, XML::xpathSApply, "./cto:pubmed-id", XML::xmlValue, namespaces = "cto") ## handle potentially missing pmid nodes: returns a list with the node value if the node is present and an empty list if the node is missing
113 | 	pmid[sapply(pmid, is.list)] <- NA ## find the empty lists in pmid and set them to NA
114 | 	pmid <- unlist(pmid) ## turn the pmid list into a vector
115 | 	authLast <- lapply(records, XML::xpathSApply, ".//cto:surname", XML::xmlValue, namespaces = "cto") ## grab the surname and initials for each author in each record, then paste them together 
116 | 	authLast[sapply(authLast, is.list)] <- NA
117 | 	authInit <- lapply(records, XML::xpathSApply, ".//cto:initials", XML::xmlValue, namespaces = "cto")
118 | 	authInit[sapply(authInit, is.list)] <- NA
119 | 	authors <- mapply(paste, authLast, authInit, collapse = "|")
120 | 	authors <- sapply(strsplit(authors, "|", fixed = TRUE), unique) ## remove the duplicate author listings
121 | 	authors <- sapply(authors, paste, collapse = "|")
122 | 	affiliations <- lapply(records, XML::xpathSApply, ".//cto:affilname", XML::xmlValue, namespaces = "cto") ## handle multiple affiliation names
123 | 	affiliations[sapply(affiliations, is.list)] <- NA
124 | 	affiliations <- sapply(affiliations, paste, collapse = "|")
125 | 	affiliations <- sapply(strsplit(affiliations, "|", fixed = TRUE), unique) ## remove the duplicate affiliation listings
126 | 	affiliations <- sapply(affiliations, paste, collapse = "|")
127 | 	countries <- lapply(records, XML::xpathSApply, ".//cto:affiliation-country", XML::xmlValue, namespaces = "cto")
128 | 	countries[sapply(countries, is.list)] <- NA
129 | 	countries <- sapply(countries, paste, collapse = "|")
130 | 	countries <- sapply(strsplit(countries, "|", fixed = TRUE), unique) ## remove the duplicate country listings
131 | 	countries <- sapply(countries, paste, collapse = "|") 
132 | 	year <- lapply(records, XML::xpathSApply, "./prism:coverDate", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/"))
133 | 	year[sapply(year, is.list)] <- NA
134 | 	year <- unlist(year)
135 | 	year <- gsub("\\-..", "", year) ## extract only year from coverDate string (e.g. extract "2015" from "2015-01-01")
136 | 	articletitle <- lapply(records, XML::xpathSApply, "./dc:title", XML::xmlValue, namespaces = c(dc = "http://purl.org/dc/elements/1.1/"))
137 | 	articletitle[sapply(articletitle, is.list)] <- NA
138 | 	articletitle <- unlist(articletitle)
139 | 	journal <- lapply(records, XML::xpathSApply, "./prism:publicationName", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes
140 | 	journal[sapply(journal, is.list)] <- NA
141 | 	journal <- unlist(journal)
142 | 	volume <- lapply(records, XML::xpathSApply, "./prism:volume", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes
143 | 	volume[sapply(volume, is.list)] <- NA
144 | 	volume <- unlist(volume)
145 | 	issue <- lapply(records, XML::xpathSApply, "./prism:issueIdentifier", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes
146 | 	issue[sapply(issue, is.list)] <- NA
147 | 	issue <- unlist(issue)
148 | 	pages <- lapply(records, XML::xpathSApply, "./prism:pageRange", XML::xmlValue, namespaces = c(prism = "http://prismstandard.org/namespaces/basic/2.0/")) ## handle potentially missing issue nodes
149 | 	pages[sapply(pages, is.list)] <- NA
150 | 	pages <- unlist(pages)
151 | 	abstract <- lapply(records, XML::xpathSApply, "./dc:description", XML::xmlValue, namespaces = c(dc = "http://purl.org/dc/elements/1.1/")) ## handle potentially missing abstract nodes
152 | 	abstract[sapply(abstract, is.list)] <- NA
153 | 	abstract <- unlist(abstract)
154 | 	keywords <- lapply(records, XML::xpathSApply, "./cto:authkeywords", XML::xmlValue, namespaces = "cto")
155 | 	keywords[sapply(keywords, is.list)] <- NA
156 | 	keywords <- unlist(keywords)
157 | 	keywords <- gsub(" | ", "|", keywords, fixed = TRUE)
158 | 	ptype <- lapply(records, XML::xpathSApply, "./cto:subtypeDescription", XML::xmlValue, namespaces = "cto")
159 | 	ptype[sapply(ptype, is.list)] <- NA
160 | 	ptype <- unlist(ptype)
161 | 	timescited <- lapply(records, XML::xpathSApply, "./cto:citedby-count", XML::xmlValue, namespaces = "cto")
162 | 	timescited[sapply(timescited, is.list)] <- NA
163 | 	timescited <- unlist(timescited)
164 | 	theDF <- data.frame(scopusID, doi, pmid, authors, affiliations, countries, year, articletitle, journal, volume, issue, pages, keywords, abstract, ptype, timescited, stringsAsFactors = FALSE)
165 | 	return(theDF)
166 | }


--------------------------------------------------------------------------------