├── .Rbuildignore ├── .travis.yml ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── NEWS ├── R ├── body.R ├── colnames.R ├── header.R ├── htmltab.R ├── identify_rows.R ├── inbody_header.R ├── setup_and_checks.R ├── utils.R └── zzz.R ├── README.md ├── cran-comments.md ├── man ├── check_type.Rd ├── create_inbody.Rd ├── eval_body.Rd ├── eval_header.Rd ├── get_body_xpath.Rd ├── get_cell_element.Rd ├── get_head_xpath.Rd ├── get_header_elements.Rd ├── get_span.Rd ├── get_trindex.Rd ├── htmltab.Rd ├── identify_elements.Rd ├── normalize_tr.Rd ├── num_xpath.Rd ├── rm_empty_cols.Rd ├── rm_empty_rows.Rd ├── rm_nuisance.Rd └── select_tab.Rd ├── tests ├── testthat.R └── testthat │ ├── test_doc_examples.R │ ├── test_expand_spans.R │ ├── test_find_header.R │ ├── test_inputs.R │ └── test_multi-dim-header.R └── vignettes ├── .Rapp.history └── htmltab.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | cran-comments.md 4 | .travis.yml 5 | misc/ 6 | CHANGES 7 | tests/testthat/test_doc_examples.R 8 | tests/testthat/.test_warnings.R 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # Sample .travis.yml for R projects from https://github.com/craigcitro/r-travis 2 | 3 | language: r 4 | warnings_are_errors: false 5 | sudo: required 6 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: htmltab 2 | Title: Assemble Data Frames from HTML Tables 3 | Version: 0.7.0 4 | Authors@R: person("Christian", "Rubba", email = "r@christianrubba.com", 5 | role = c("aut", "cre")) 6 | Description: HTML tables are a valuable data source but extracting and 7 | recasting these data into a useful format can be tedious. htmltab is a package 8 | for collecting structured information from HTML tables. It is similar to 9 | readHTMLTable() of the XML package but provides three major advantages. First, 10 | the function automatically expands row and column spans in the header and body 11 | cells. Second, users are given more control over the identification of header 12 | and body rows which will end up in the R table, including semantic header 13 | information that appear throughout the body. Third, the function preprocesses 14 | table code, corrects common types of malformations, removes unneeded parts and 15 | so helps to alleviate the need for tedious post-processing. 16 | Depends: 17 | R (>= 3.0.0) 18 | Imports: 19 | XML (>= 3.98.1.3), 20 | httr (>= 1.0.0) 21 | License: MIT + file LICENSE 22 | LazyData: true 23 | Suggests: 24 | testthat, 25 | knitr, 26 | tidyr 27 | URL: https://github.com/crubba/htmltab 28 | BugReports: https://github.com/crubba/htmltab/issues 29 | VignetteBuilder: knitr 30 | RoxygenNote: 5.0.0 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2014,2015 2 | COPYRIGHT HOLDER: Christian Rubba 3 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(num_xpath,character) 4 | S3method(num_xpath,default) 5 | S3method(num_xpath,list) 6 | S3method(num_xpath,numeric) 7 | export(htmltab) 8 | export(num_xpath) 9 | -------------------------------------------------------------------------------- /NEWS: -------------------------------------------------------------------------------- 1 | CHANGES IN htmltab VERSION 0.7.0 2 | 3 | 4 | NEW FEATURES 5 | o Added a new argument (rm_nodata_cols) to remove columns that have no apparent data value 6 | 7 | MAJOR CHANGE 8 | o When htmltab encounters an inner table inside the target table, the inner table is flattened to allow table generation 9 | 10 | BUG FIXES 11 | o Single column data frames are not reduced to vectors anymore which used to resulte in an error 12 | o When the last column had misspecified column spans, htmltab previously dicarded an entire column. Now, a check is in place that makes a judgement whether a column should be kept or not 13 | o Fixed a problem with reading html files from the local file system (@earino) 14 | 15 | 16 | 17 | CHANGES IN htmltab VERSION 0.6.0 18 | 19 | NEW FEATURES 20 | o Added capability to process header information that appear in-table. This is done via a new formula interface to the header argument 21 | o Added new parameter (rm_whitespace) to remove leading and trailing whitespace from cell values 22 | o Added new parameter (rm_identical_cols) to remove columns that are falsely copied when colspan attributes are misused 23 | o Tables are now checked for and cleaned from various types of malformation 24 | 25 | BUG FIXES 26 | o Fixed a bug that prevented correct creation of multi-row header when a header cell was completely whitespaces 27 | o Fixed a bug where rm_empty_cols did not work properly because of values that were created through column expansion 28 | o Removed unreliable test for documentation examples 29 | o Automatic check for nested tables. htmltab throws an error when the designated table includes a table 30 | 31 | MINOR CHANGES 32 | o In the header construction, multi-row headers are now correctly ignoring empty values in the final header 33 | o Complementarity checks of header and body rows is now based on a different (and more robust) methodology 34 | 35 | 36 | 37 | CHANGES IN htmltab VERSION 0.5.0 38 | 39 | NEW FEATURES 40 | o Header and body are now treated as complementary elements of a table, i.e. passing (numeric) information about the position of either of the two will be used for the identification of the other 41 | o Added a new argument (fillNA) to replace non-data cells cells by NA 42 | o Added a new argument (rm_nodata_cols) to remove columns that have no apparent data value 43 | o Added a new argument (rm_invisible) to remove invisible nodes from the table node 44 | 45 | BUG FIXES 46 | o Fixed a problem where htmltable failed when a table didn't nest a row within tr tags. Now every table is controlled, and tr tags are added when necessary 47 | o Fixed a small problem with misspecified spans in the table header 48 | o Added meaningful error message when table couldn't be identified 49 | o Fixed problem where a header warning was thrown even when colNames was supplied 50 | 51 | MAJOR CHANGES 52 | o Revised code for header and body identification. When an XPath is passed to either of the two, it must treat the parent table node as the root. This change is backward incompatible 53 | 54 | -------------------------------------------------------------------------------- /R/body.R: -------------------------------------------------------------------------------- 1 | 2 | #' Creates body using span information 3 | #' 4 | #' @param vals a list of body cell values 5 | #' @param colspans a list of body colspans 6 | #' @param rowspans a list of body rowspans 7 | #' @param header.names the header name vector 8 | #' @return a matrix of the assembled body 9 | #' @noRd 10 | span_body <- function(vals, colspans, rowspans, header.names) { 11 | 12 | #Create empty container matrix 13 | if(length(header.names) == 0){ 14 | n.cols <- sum(as.numeric(unlist(colspans[[1L]]))) #get idea from first row column length, should be right usually 15 | } else { 16 | n.cols <- length(header.names) #use header dim info 17 | } 18 | n.rows <- length(rowspans) #guess, better make flexible #lapply(rowspans, function(x) x[1]) %>% unlist %>% as.numeric %>% sum 19 | 20 | mat <- matrix(NA, ncol = n.cols, nrow = n.rows) 21 | #row.cont <- vector() 22 | 23 | col = 1L 24 | 25 | while(col <= n.cols){ # col start 26 | 27 | row <- 1L 28 | 29 | while(row <= n.rows){ # row start 30 | 31 | cel.val <- vals[[row]][col] 32 | col.span.length <- colspans[[row]][col] 33 | row.span.length <- rowspans[[row]][col] 34 | 35 | #This block controls for undefined col- and rowspans (hacky) 36 | if(is.na(row.span.length) && row < n.rows){ 37 | col.span.length <- 1L 38 | row.span.length <- 1L 39 | cel.val <- mat[row, col -1L] 40 | } 41 | 42 | if(row.span.length < 2L) { 43 | mat[row, col] <- cel.val 44 | } else { 45 | 46 | if(row != n.rows){ #Control for situation: specified rows (main), last row demands col/rowspans 47 | index <- (row + 1L) : (row + row.span.length - 1L) 48 | 49 | for(counter in index){ 50 | vals[[counter]] <- append(vals[[counter]], cel.val, (col - 1L )) #append col val is after 51 | rowspans[[counter]] <- append(rowspans[[counter]], 1L, (col - 1L)) 52 | colspans[[counter]] <- append(colspans[[counter]], 1L, (col - 1L)) 53 | } 54 | rowspans[[row]] <- rowspans[[row]][-col] 55 | rowspans[[row]] <- append(rowspans[[row]], 1L, (col-1L)) 56 | 57 | mat[row:(row + row.span.length-1L), col] <- cel.val 58 | } 59 | } 60 | 61 | if(col.span.length > 1L){ 62 | vals[[row]] <- append(vals[[row]], rep(cel.val, (col.span.length - 1L)), col) #append 63 | colspans[[row]] <- colspans[[row]][-col] 64 | colspans[[row]] <- append(colspans[[row]], rep(1, col.span.length), (col-1L)) 65 | rowspans[[row]] <- rowspans[[row]][-col] 66 | rowspans[[row]] <- append(rowspans[[row]], rep(row.span.length, col.span.length), (col-1L)) 67 | } 68 | 69 | row <- row + 1L 70 | } #row end 71 | 72 | col <- col + 1L 73 | } #col end 74 | 75 | return(mat) 76 | 77 | } #function end 78 | 79 | 80 | 81 | #colspans = body.colspans 82 | #rowspans = body.rowspans 83 | 84 | #' Expand the body 85 | #' @noRd 86 | expand_body <- function(vals, colspans, rowspans){ 87 | 88 | #colspans <- body.colspans 89 | #rowspans <- body.rowspans 90 | 91 | body.table <- list() 92 | col <- 1 93 | n.row <- length(vals) 94 | 95 | repeat{ 96 | 97 | #Break when there are no header information or when last column is missspecified 98 | if(is_empty(vals[[1]])){break} #all_empty(vals) 99 | 100 | body.row <- vector() 101 | 102 | for(row in 1:n.row) { 103 | 104 | length.col <- colspans[[row]][1] 105 | length.row <- rowspans[[row]][1] 106 | name <- vals[[row]][1] 107 | #name <- ifelse(name == "", NA, name) 108 | name <- ifelse(grepl("[[:alnum:][:punct:]]+", name), name, NA) 109 | 110 | if(is.na(length.col)) break 111 | 112 | #Remove cell info 113 | colspans[[row]] <- colspans[[row]][-1] 114 | rowspans[[row]] <- rowspans[[row]][-1] 115 | vals[[row]] <- vals[[row]][-1] 116 | 117 | #Expand along columnes 118 | colspans[[row]] <- append(colspans[[row]], rep(1, length.col - 1), 0) 119 | rowspans[[row]] <- append(rowspans[[row]], rep(length.row, length.col - 1), 0) 120 | vals[[row]] <- append(vals[[row]], rep(name, length.col - 1), 0) 121 | 122 | #Expand along rows 123 | these.rows <- row:(row + (length.row - 1)) 124 | rowspans[these.rows] <- lapply(rowspans[these.rows], append, 1, after = 0) #an erster Stelle 125 | colspans[these.rows] <- lapply(colspans[these.rows], append, 1, after = 0) 126 | vals[these.rows] <- lapply(vals[these.rows], append, name, after = 0) #vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0) 127 | 128 | #remove the first colum info 129 | colspans[[row]] <- colspans[[row]][-1] #check for colspans different 130 | rowspans[[row]] <- rowspans[[row]][-1] #check for colspans different 131 | vals[[row]] <- vals[[row]][-1] 132 | 133 | #Add cell info to column name vector 134 | body.row <- c(body.row, name) 135 | } 136 | 137 | #Break if last row is not completely filled up (https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election) 138 | if(col > 1 && length(body.row) < length(body.table[[col - 1]])){ 139 | add.cell.no <- length(body.table[[col - 1]]) - length(body.row) 140 | add.NAs <- rep(NA, add.cell.no) 141 | body.row <- c(body.row, add.NAs) 142 | } 143 | 144 | body.table[[col]] <- vector() 145 | body.table[[col]] <- body.row 146 | 147 | col <- col + 1 148 | } 149 | 150 | # Cbind all 151 | tab <- do.call("cbind", body.table) 152 | 153 | return(tab) 154 | } 155 | 156 | 157 | #' Extract body cell values 158 | #' 159 | #' @param table.Node the table node 160 | #' @return list of body information 161 | #' @family get_head 162 | #' @noRd 163 | get_cells <- function(table.Node) { 164 | 165 | cells <- XML::xpathSApply(table.Node, path = "//tr") 166 | 167 | ifstop(is_empty(cells), sprintf("No body generated. Body is empty. 168 | Try passing information to the body argument. Body XPath was '%s'.", body), call. = FALSE) 169 | 170 | return(cells) 171 | } 172 | 173 | -------------------------------------------------------------------------------- /R/colnames.R: -------------------------------------------------------------------------------- 1 | #' Produces correct column names 2 | #' 3 | #' @param df the generated body 4 | #' @param header.names the header column names vector 5 | #' @param colNames either a self-specificed character vector for the column names or a function used on header.names 6 | #' @param xpath generated header and body xpath 7 | #' @return a character vector of header column names 8 | #' @noRd 9 | make_colnames <- function(df, header.names = NULL, colNames = NULL, header.xpath) { 10 | 11 | if(length(header.names) == 0 && is.null(colNames)){ 12 | warning("No header generated. Try passing information to header or colNames", call. = FALSE) 13 | return(df) 14 | } 15 | 16 | if(length(header.names) > 0 && is.null(colNames)) { 17 | if(ncol(df) != length(header.names)){ 18 | warning("Header dimension doesn't match body dimension", call. = FALSE) 19 | colnames(df) <- vector() 20 | } 21 | colnames(df)[1:length(header.names)] <- header.names 22 | return(df) 23 | } 24 | 25 | if(is.character(colNames)) { 26 | header.names <- colNames 27 | if(ncol(df) != length(header.names)){ 28 | warning("Header dimension doesn't match body dimension", call. = FALSE) 29 | colnames(df) <- vector() 30 | } 31 | colnames(df)[1:length(header.names)] <- header.names 32 | return(df) 33 | } 34 | 35 | if(class(colNames) == "function") { 36 | header.names <- colNames(header.names) 37 | if(ncol(df) != length(header.names)){ 38 | warning("Header dimension doesn't match body dimension", call. = FALSE) 39 | colnames(df) <- vector() 40 | } 41 | colnames(df)[1:length(header.names)] <- header.names 42 | return(df) 43 | } 44 | 45 | } 46 | -------------------------------------------------------------------------------- /R/header.R: -------------------------------------------------------------------------------- 1 | #' Retrieve table head rows 2 | #' 3 | #' @param table.Node the table node 4 | #' @return the header element 5 | #' @noRd 6 | get_head <- function(trindex, table.Node) UseMethod() 7 | 8 | get_head <- function(table.Node, trindex) { 9 | 10 | bracket <- paste(sprintf("@HTMLTABtrindex = '%s'", trindex), collapse = " or ") 11 | xpath <- sprintf("//tr[%s]", bracket) 12 | 13 | head_MAIN <- XML::xpathSApply(table.Node, path = xpath) 14 | 15 | return(head_MAIN) 16 | } 17 | 18 | 19 | #' Produce table header 20 | #' @noRd 21 | make_header <- function(trindex, table.Node, headerSep, headerFun, 22 | rm_escape, rm_whitespace) UseMethod("make_header") 23 | 24 | make_header.logical <- function(trindex, table.Node, headerSep, headerFun, 25 | rm_escape, rm_whitespace){ 26 | return(logical(0)) 27 | } 28 | 29 | make_header.numeric <- function(trindex, table.Node, headerSep, headerFun, 30 | rm_escape, 31 | rm_whitespace){ 32 | 33 | head <- get_head(table.Node = table.Node, trindex = trindex) 34 | 35 | header.colspans <- get_span(head, span = "colspan", tag = "td | th") 36 | header.rowspans <- get_span(head, span = "rowspan", tag = "td | th") 37 | 38 | header.names <- get_cell_element(head, tag = "td | th", 39 | elFun = headerFun, 40 | rm_escape = rm_escape, 41 | rm_whitespace = rm_whitespace) 42 | 43 | #Span head 44 | header.names <- span_header(header.names, 45 | header.colspans, 46 | header.rowspans, 47 | headerSep = headerSep) 48 | 49 | noval_col <- which(header.names == "") 50 | 51 | if(length(noval_col) > 0){ 52 | header.names[noval_col] <- paste0("V", length(noval_col)) 53 | } 54 | 55 | 56 | return(header.names) 57 | } 58 | 59 | 60 | #' Creates header using span information 61 | #' 62 | #' @param header.names a list of header names 63 | #' @param header.colspans a list of header colspans 64 | #' @param header.rowspans a list of header rowspans 65 | #' @param headerSep a character vector that is used as a seperator in the construction of the table's variable names (default value ' >> ') 66 | #' @return a vector of header column names 67 | #' @noRd 68 | span_header <- function(header.names, header.colspans, header.rowspans, headerSep) { 69 | 70 | #has no header information 71 | if(length(header.names) == 0){ 72 | header.name.table <- vector() 73 | return(header.name.table) 74 | } 75 | 76 | #Remove rows which have all empty cells 77 | empty.rows <- which(sapply(header.names, function(x) all(x == ""))) 78 | if(!is_empty(empty.rows)){ 79 | header.names <- header.names[-empty.rows] 80 | header.colspans <- header.colspans[-empty.rows] 81 | header.rowspans <- header.rowspans[-empty.rows] 82 | } 83 | 84 | #return empty header 85 | if(length(header.names) == 0){ 86 | header.name.table <- vector() 87 | return(header.name.table) 88 | } 89 | 90 | header.name.table <- expand_header(header.names, header.colspans, header.rowspans) 91 | 92 | header.name.table <- lapply(header.name.table, function(col) col[!is.na(col)]) 93 | 94 | header.name.table <- unlist(lapply(header.name.table, function(col) paste(col, collapse = headerSep))) 95 | 96 | return(header.name.table) 97 | } 98 | 99 | #' Expand the header 100 | #' @noRd 101 | expand_header <- function(vals, colspans, rowspans){ 102 | 103 | body.table <- list() 104 | col <- 1 105 | n.row <- length(vals) 106 | 107 | repeat{ 108 | 109 | # Break when there are no header information or when last column is missspecified 110 | if(is_empty(vals[[1]])){break} # || length(vals) > 1 && length(vals[[2]]) == 0 111 | 112 | body.row <- vector() 113 | 114 | for(row in 1:n.row) { 115 | 116 | length.col <- colspans[[row]][1] 117 | length.row <- rowspans[[row]][1] 118 | name <- vals[[row]][1] 119 | name <- ifelse(grepl("[[:alnum:][:punct:]]+", name), name, NA) 120 | 121 | if(is.na(length.col)) break 122 | 123 | #Remove cell info 124 | colspans[[row]] <- colspans[[row]][-1] 125 | rowspans[[row]] <- rowspans[[row]][-1] 126 | vals[[row]] <- vals[[row]][-1] 127 | 128 | #Expand along columnes 129 | colspans[[row]] <- append(colspans[[row]], rep(1, length.col - 1), 0) 130 | rowspans[[row]] <- append(rowspans[[row]], rep(length.row, length.col - 1), 0) 131 | vals[[row]] <- append(vals[[row]], rep(name, length.col - 1), 0) 132 | 133 | #Expand along rows 134 | length(vals[[1]]) 135 | these.rows <- row:(row + (length.row - 1)) 136 | rowspans[these.rows] <- lapply(rowspans[these.rows], append, 1, after = 0) #an erster Stelle 137 | colspans[these.rows] <- lapply(colspans[these.rows], append, 1, after = 0) 138 | vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0) #vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0) 139 | 140 | #remove the first colum info 141 | colspans[[row]] <- colspans[[row]][-1] #check for colspans different 142 | rowspans[[row]] <- rowspans[[row]][-1] #check for colspans different 143 | vals[[row]] <- vals[[row]][-1] 144 | 145 | #Add cell info to column name vector 146 | body.row <- c(body.row, name) 147 | } 148 | 149 | #Break if 150 | if(col > 1 && length(body.row) < length(body.table[[col - 1]])) break 151 | 152 | body.table[[col]] <- vector() 153 | body.table[[col]] <- body.row 154 | 155 | col <- col + 1 156 | } 157 | 158 | return(body.table) 159 | } 160 | 161 | 162 | #' Extracts cells elements 163 | #' 164 | #' @param cells a list of cell nodes 165 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements 166 | #' @param elFun a function that is executed over the header/body cell nodes 167 | #' @param rm_escape a character vector that, if specified, is used to replace escape sequences in header and body cells (default value ' ') 168 | #' @param rm_whitespace logical, should leading/trailing whitespace be removed from cell values ( 169 | #' default value TRUE)? 170 | #' @return the body element 171 | get_cell_element <- function(cells, tag = "td | th", elFun, rm_escape, rm_whitespace) { 172 | 173 | cell.element <- lapply(cells, function(tr) { 174 | XML::xpathSApply(tr, tag, elFun) 175 | }) 176 | 177 | # cell.element <- 178 | # 179 | # lapply(cells, function(tr) { 180 | # x <- XML::xmlValue(tr, tag, function(x) paste(xmlValue(x), sep = "||")) 181 | # #sapply(x, xmlValue, recursive = F) 182 | # }) 183 | 184 | if(!is.null(rm_escape)) { 185 | cell.element <- lapply(cell.element, function(el) gsub("([[:alpha:]])-[\b\n\t\r]([[:alpha:]])", "\\1\\2", el)) 186 | cell.element <- lapply(cell.element, function(el) gsub("[\b \n \t \r]", rm_escape, el)) 187 | } 188 | 189 | if(isTRUE(rm_whitespace)){ 190 | cell.element <- lapply(cell.element, function(el) rm_str_white(el)) 191 | } 192 | return(cell.element) 193 | } 194 | 195 | 196 | #' Extracts rowspan information 197 | #' 198 | #' @param cells a list of cell nodes 199 | #' @param span a character for the span element name 200 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements 201 | #' @return A list of row information from the cells 202 | get_span <- function(cells, span, tag = "td | th"){ 203 | 204 | span.val <- lapply(cells, function(tr) { 205 | XML::xpathSApply(tr, tag, function(node) { 206 | val <- XML::xmlGetAttr(node, span) 207 | val <- ifelse(is.null(val) || val == "0" || grepl("%", val), 1, val) #Check Firefox for colspan == 0 208 | val <- as.numeric(val) 209 | return(val) 210 | }) 211 | }) 212 | 213 | return(span.val) 214 | } 215 | 216 | #' Extracts header elements 217 | #' 218 | #' @param cells a list of cell nodes 219 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements 220 | #' @return A list of header information from the cells 221 | get_header_elements <- function(cells, tag = "td | th"){ 222 | 223 | header_elements <- lapply(cells, function(tr) { 224 | XML::xpathSApply(tr, tag, function(node) { 225 | if(XML::xmlName(node) != "sup") { 226 | value <- XML::xmlValue(node) 227 | } 228 | return(value) 229 | }) 230 | }) 231 | return(header_elements) 232 | } 233 | -------------------------------------------------------------------------------- /R/htmltab.R: -------------------------------------------------------------------------------- 1 | #' Assemble a data frame from HTML table data 2 | #' 3 | #' Robust and flexible methods for extracting structured information out of HTML tables 4 | #' 5 | #' @export 6 | #' @param doc the HTML document which can be a file name or a URL or an already parsed document 7 | #' (by XML's parsing functions) 8 | #' @param which a vector of length one for identification of the table in the document. Either a 9 | #' numeric vector for the tables' rank or a character vector that describes an XPath for the table 10 | #' @param header the header formula, see details for specifics 11 | #' @param headerFun a function that is executed over the header cell nodes 12 | #' @param headerSep a character vector that is used as a seperator in the construction of the table's 13 | #' variable names (default ' >> ') 14 | #' @param body a vector that specifies which table rows should be used as body information. A numeric 15 | #' vector can be specified where each element corresponds to a table row. A character vector may be 16 | #' specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use 17 | #' semantic information from the HTML code 18 | #' @param bodyFun a function that is executed over the body cell nodes 19 | #' @param complementary logical, should htmltab ensure complementarity of header, inbody header and 20 | #' body elements (default TRUE)? 21 | #' @param fillNA character vector of symbols that are replaced by NA (default c('')) 22 | #' @param rm_superscript logical, should superscript information be removed from header and body cells 23 | #' (default TRUE)? 24 | #' @param rm_footnotes logical, should semantic footer information be removed (default TRUE)? 25 | #' @param rm_nodata_cols logical, should columns that have no alphanumeric data be removed (default TRUE)? 26 | #' @param rm_nodata_rows logical, should rows that have no alphanumeric data be removed (default TRUE)? 27 | #' @param rm_escape a character vector that, if specified, is used to replace escape sequences in header 28 | #' and body cells (default ' ') 29 | #' @param rm_invisible logical, should nodes that are not visible be removed (default TRUE)? This 30 | #' includes elements with class 'sortkey' and 'display:none' style. 31 | #' @param rm_whitespace logical, should leading/trailing whitespace be removed from cell values (default TRUE)? 32 | #' @param colNames a character vector of column names, or a function that can be used to replace specific 33 | #' column names (default NULL) 34 | #' @param ... additional arguments passed to HTML parsers 35 | #' 36 | #' @return An R data frame 37 | #' @details The header formula has the following format: level1 + level2 + level3 + ... . 38 | #' level1 specifies the main header dimension (column names). This information must 39 | #' be for rows. level2 and deeper signify header dimensions that appear throughout the body. 40 | #' Those information muste be for cell elements, not rows. Header information may be 41 | #' one of the following types: 42 | #' 43 | #'\itemize{ 44 | #' \item the NULL value (default). No information passed, htmltab will try to identify 45 | #' header elements through heuristics (heuristics only work for the main header) 46 | #' \item A numeric vector that retrieves rows in the respective position 47 | #' \item A character string of an XPath expression 48 | #' \item A function that when evaluated produces a numeric or character vector 49 | #' \item 0, when the process of finding the main header should be skipped (only works 50 | #' for main header) 51 | #' } 52 | #' 53 | #' 54 | #' @author Christian Rubba <\url{http://www.christianrubba.com}> 55 | #' @references \url{https://github.com/crubba/htmltab} 56 | #' @examples 57 | #' \dontrun{ 58 | #'# When no spans are present, htmltab produces output close to XML's readHTMLTable(), 59 | #'but it removes many types of non-data elements (footnotes, non-visible HTML elements, etc) 60 | #' 61 | #' url <- "http://en.wikipedia.org/wiki/World_population" 62 | #' xp <- "//caption[starts-with(text(),'World historical')]/ancestor::table" 63 | #' htmltab(doc = url, which = xp) 64 | #' 65 | #' popFun <- function(node) { 66 | #' x <- XML::xmlValue(node) 67 | #' gsub(',', '', x) 68 | #' } 69 | #' 70 | #' htmltab(doc = url, which = xp, bodyFun = popFun) 71 | #' 72 | #' #This table lacks header information. We provide them through colNames. 73 | #' #We also need to set header = 0 to indicate that no header is present. 74 | #' doc <- "http://en.wikipedia.org/wiki/FC_Bayern_Munich" 75 | #' xp2 <- "//td[text() = 'Head coach']/ancestor::table" 76 | #' htmltab(doc = doc, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role")) 77 | #' 78 | #' #htmltab recognizes column spans and produces a one-dimension vector of variable information, 79 | #' #also removes automatically superscript information since these are usually not of use. 80 | #' 81 | #' doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers" 82 | #' xp3 <- "//table[7]" 83 | #' bFun <- function(node) { 84 | #' x <- XML::xmlValue(node) 85 | #' gsub('%$', '', x) 86 | #' } 87 | #' 88 | #' htmltab(doc = doc, which = xp3, bodyFun = bFun) 89 | #' 90 | #' 91 | #' htmltab("https://en.wikipedia.org/wiki/Arjen_Robben", which = 3, 92 | #' header = 1:2) 93 | #' 94 | #' 95 | #' #When header information appear throughout the body, you can specify their 96 | #' #position in the header formula 97 | #' 98 | #' htmltab(url, which = "//table[@@id='team_gamelogs']", header = . + "//td[./strong]") 99 | #' } 100 | 101 | 102 | htmltab <- function(doc, 103 | which = NULL, 104 | header = NULL, 105 | headerFun = function(node)XML::xmlValue(node), 106 | headerSep = " >> ", 107 | body = NULL, 108 | bodyFun = function(node)XML::xmlValue(node), 109 | complementary = TRUE, 110 | fillNA = NA, 111 | rm_superscript = TRUE, 112 | rm_escape = " ", 113 | rm_footnotes = TRUE, 114 | rm_nodata_cols = TRUE, 115 | rm_nodata_rows = TRUE, 116 | rm_invisible = TRUE, 117 | rm_whitespace = TRUE, 118 | colNames = NULL, 119 | ...){ 120 | 121 | # on exit 122 | # if(isTRUE(develop)){ 123 | # on.exit(print(table.Node)) 124 | # } 125 | 126 | # Deparse 127 | header <- deparse(substitute(header), width.cutoff = 500L) 128 | body <- deparse(substitute(body), width.cutoff = 500L) 129 | ev_header <- eval_header(arg = header) 130 | ev_body <- eval_body(arg = body) 131 | 132 | # Check Inputs & Clean Up & Normalize tr -------- 133 | table.Node <- check_type(doc = doc, which = which, ...) 134 | table.Node <- rm_nuisance(table.Node = table.Node, 135 | rm_superscript = rm_superscript, 136 | rm_footnotes = rm_footnotes, 137 | rm_invisible = rm_invisible) 138 | table.Node <- normalize_tr(table.Node = table.Node) 139 | 140 | #Produce XPath for header and body and add class information 141 | LL <- identify_elements(table.Node = table.Node, 142 | header = ev_header, 143 | body = ev_body, 144 | complementary = complementary) 145 | 146 | # Create Table --------------------------- 147 | 148 | #Retrieve Head Elements 149 | header.names <- make_header(trindex = LL$trindex$header, 150 | table.Node = table.Node, 151 | headerSep = headerSep, 152 | headerFun = headerFun, 153 | rm_escape = rm_escape, 154 | rm_whitespace = rm_whitespace) 155 | 156 | 157 | # Create Body --------------------------- 158 | 159 | #Get Body Cell Nodes 160 | cells <- get_cells(table.Node = table.Node) 161 | #row_type <- get_row_type(table.Node = table.Node) 162 | 163 | 164 | #Extract and transform body cell elements 165 | vals <- get_cell_element(cells, elFun = bodyFun, 166 | rm_escape = rm_escape, 167 | rm_whitespace = rm_whitespace) 168 | 169 | #Produce rowspans and colspans lists from body cell 170 | body.rowspans <- get_span(cells, span = "rowspan") 171 | body.colspans <- get_span(cells, span = "colspan") 172 | 173 | #Produce table body 174 | tab <- expand_body(vals, colspans = body.colspans, rowspans = body.rowspans) 175 | 176 | 177 | # Finish --------------------------- 178 | 179 | tab <- as.data.frame(tab, stringsAsFactors = F) 180 | 181 | #Produce DF 182 | tab <- make_colnames(df = tab, 183 | header.names = header.names, 184 | colNames = colNames, 185 | header.xpath = LL$xpath$header) 186 | 187 | # Inbody header 188 | tab <- create_inbody(tab = tab, table.Node = table.Node, 189 | trindex = LL$trindex$inbody, 190 | xpath = LL$xpath$inbody) 191 | 192 | #Check if there are no data columns/rows 193 | if(isTRUE(rm_nodata_cols)){ 194 | tab <- rm_empty_cols(df = tab, header = header.names) 195 | } 196 | 197 | # Subset 198 | tab <- tab[LL$trindex$body, , drop = F] 199 | 200 | if(isTRUE(rm_nodata_rows)){ 201 | tab <- rm_empty_rows(df = tab) 202 | } 203 | 204 | 205 | # Replace empty vals by NA 206 | tab[is.na(tab)] <- fillNA 207 | 208 | return(tab) 209 | } 210 | -------------------------------------------------------------------------------- /R/identify_rows.R: -------------------------------------------------------------------------------- 1 | #' Generate numeric XPath expression 2 | #' 3 | #' @title num_xpath: Generate numeric XPath expression 4 | #' @param data the header XPath 5 | #' @export 6 | num_xpath <- function(data) UseMethod("num_xpath") 7 | 8 | #' @export 9 | num_xpath.default <- function(data){ 10 | stop("Unknown input", call. = F) 11 | } 12 | 13 | #' @export 14 | num_xpath.character <- function(data){ 15 | return(data) 16 | } 17 | 18 | #' @export 19 | num_xpath.numeric <- function(data){ 20 | 21 | index.count <- data - 1 22 | index.xpath <- sapply(1:length(index.count), function(pos) { 23 | sprintf("count(preceding::tr) = %s", index.count[pos]) 24 | }) 25 | index.xpath <- sprintf("//tr[%s]", paste(index.xpath, collapse = " or ")) 26 | 27 | return(index.xpath) 28 | } 29 | 30 | #' @export 31 | num_xpath.list <- function(data){ 32 | 33 | index.xpath <- lapply(data, get("num_xpath", envir = parent.env(environment()))) 34 | return(index.xpath) 35 | } 36 | 37 | 38 | #' Return trindex given an XPath 39 | #' @param table.Node the table node 40 | #' @param xpath XPath 41 | get_trindex <- function(xpath, table.Node) UseMethod("get_trindex") 42 | 43 | get_trindex.default <- function(xpath, table.Node){ 44 | stop("Unknow XPath", call. = FALSE) 45 | } 46 | 47 | get_trindex.NULL <- function(xpath, table.Node){ 48 | return(logical()) 49 | } 50 | 51 | get_trindex.character <- function(xpath, table.Node){ 52 | 53 | tr.index <- unlist(XML::xpathSApply(table.Node, xpath, XML::xmlGetAttr, "HTMLTABtrindex")) 54 | tr.index <- as.numeric(tr.index) 55 | return(tr.index) 56 | } 57 | 58 | get_trindex.list <- function(xpath, table.Node){ 59 | 60 | tr.index <- lapply(seq_along(xpath), function(id) { 61 | XML::xpathSApply(table.Node, xpath[[id]], function(x) { 62 | val <- XML::xmlGetAttr(XML::xmlParent(x), "HTMLTABtrindex") 63 | val <- as.numeric(val) 64 | return(val) 65 | }) 66 | }) 67 | 68 | return(tr.index) 69 | } 70 | 71 | #' Return header xpath 72 | #' 73 | #' @param table.Node the table node 74 | #' @param header an information for the header rows 75 | #' @return a character vector of XPath statements 76 | get_head_xpath <- function(header, table.Node) UseMethod("get_head_xpath") 77 | 78 | get_head_xpath.default <- function(header, table.Node) { 79 | stop("Unknown header information", .call = F) 80 | } 81 | 82 | get_head_xpath.character <- function(header, table.Node) { 83 | return(header) 84 | } 85 | 86 | get_head_xpath.numeric <- function(header, table.Node) { 87 | 88 | if(header[1] == 0){ 89 | header <- NULL 90 | } else{ 91 | header <- num_xpath(header) 92 | } 93 | 94 | return(header) 95 | } 96 | 97 | get_head_xpath.NULL <- function(header, table.Node){ 98 | 99 | thead <- has_tag(table.Node, "//thead") 100 | thead.th <- has_tag(table.Node, "//thead/tr[th]") 101 | thead.td <- has_tag(table.Node, "//thead/tr[td]") 102 | 103 | tr <- has_tag(table.Node, "//tr") 104 | th <- has_tag(table.Node, "//tr[th and not(./td)]") 105 | td <- has_tag(table.Node, "//tr[td and not(./th)]") 106 | 107 | 108 | if(thead) { 109 | header.xpath <- '//thead/tr' 110 | return(c(header.xpath)) 111 | } 112 | 113 | if (!thead && th){ 114 | header.xpath <- "//tr[th and not(./td)]" 115 | return(c(header.xpath)) 116 | } 117 | 118 | if (!thead && !th){ 119 | header.xpath <- "//tr[position() = 1]" 120 | message("Neither nor information found. Taking first table row for the header. If incorrect, specifiy header argument.") 121 | return(c(header.xpath)) 122 | } 123 | } 124 | 125 | 126 | #' Return body xpath 127 | #' 128 | #' @param table.Node the table node 129 | #' @param body an information for the body rows 130 | #' @return a character vector of XPath statements 131 | get_body_xpath <- function(body, table.Node) UseMethod("get_body_xpath") 132 | 133 | get_body_xpath.default <- function(body, table.Node){ 134 | stop("Unknow body information", .call = F) 135 | } 136 | 137 | get_body_xpath.numeric <- function(body, table.Node){ 138 | body <- num_xpath(body) 139 | return(body) 140 | } 141 | 142 | get_body_xpath.character <- function(body, table.Node){ 143 | return(body) 144 | } 145 | 146 | get_body_xpath.NULL <- function(body, table.Node){ 147 | tbody <- has_tag(table.Node, "//tbody") 148 | tbody.th <- has_tag(table.Node, "//tbody/tr[th]") 149 | tbody.td <- has_tag(table.Node, "//tbody/tr[td]") 150 | 151 | tr <- has_tag(table.Node, "//tr") 152 | th <- has_tag(table.Node, "//tr[th and not(./td)]") 153 | td <- has_tag(table.Node, "//tr[td and not(./th)]") 154 | 155 | if(tbody){ 156 | body.xpath <- "//tbody/tr" 157 | return(c(body.xpath)) 158 | } else { 159 | body.xpath <- "//tr[./td]" 160 | return(c(body.xpath)) 161 | } 162 | 163 | } 164 | 165 | #' Assemble XPath expressions for header and body 166 | #' 167 | #' @param table.Node the table node 168 | #' @param header a vector that contains information for the identification of the header row(s). A numeric vector can be specified where each element corresponds to the table rows. A character vector may be specified that describes an XPath for the header rows. If left unspecified, htmltable tries to use semantic information from the HTML code 169 | #' @param body a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltable tries to use semantic information from the HTML code 170 | #' @param complementary logical, should htmltab ensure complementarity of header, inbody header and 171 | #' body elements (default TRUE)? 172 | #' @return a character vector of XPath statements 173 | identify_elements <- function(table.Node, header, body, complementary = T){ 174 | 175 | #header <- ev_header 176 | #body <- ev_body 177 | 178 | header_MAIN <- header[[1]] 179 | header_INBODY <- header[-1] 180 | 181 | # switch compementary 182 | if(!is.null(header_MAIN[1])){ 183 | if(header_MAIN[1] == 0){complementary <- FALSE} 184 | } 185 | 186 | # Produce XPaths 187 | header_MAIN.xpath <- get_head_xpath(table.Node = table.Node, header = header_MAIN) 188 | body.xpath <- get_body_xpath(table.Node = table.Node, body = body) 189 | header_INBODY.xpath <- num_xpath(header_INBODY) 190 | 191 | # Receive HTMLTABtrindex 192 | header_MAIN.trindex <- get_trindex(header_MAIN.xpath, table.Node) 193 | header_INBODY.trindex <- get_trindex(header_INBODY.xpath, table.Node) 194 | body.trindex <- get_trindex(body.xpath, table.Node) 195 | 196 | # Check INBODY correct 197 | ifstop(length(header_INBODY.xpath) > 0 && length(header_INBODY.trindex[[1]]) == 0, 198 | "Your specified inbody header cells could not be identified.", call. = F) 199 | 200 | # Complementarity 201 | if(isTRUE(complementary)){ 202 | 203 | if(is.null(body)){ 204 | body.trindex <- setdiff(body.trindex, c(header_MAIN.trindex, unlist(header_INBODY.trindex))) 205 | } 206 | 207 | if(is.null(header_MAIN) && !is.null(body)){ 208 | header_MAIN.trindex <- setdiff(header_MAIN.trindex, c(body.trindex, unlist(header_INBODY.trindex))) 209 | } 210 | 211 | } 212 | 213 | trindex <- list(header = header_MAIN.trindex, 214 | inbody = header_INBODY.trindex, 215 | body = body.trindex) 216 | 217 | xpath <- list(header = header_MAIN.xpath, 218 | inbody = header_INBODY.xpath, 219 | body = body.xpath) 220 | 221 | return(list(trindex = trindex, xpath = xpath)) 222 | } 223 | -------------------------------------------------------------------------------- /R/inbody_header.R: -------------------------------------------------------------------------------- 1 | #' Reshape in table header information into wide format 2 | #' @param tab the table data frame 3 | #' @param table.Node the table node 4 | #' @param trindex the tr index of the inbody rows 5 | #' @param xpath the xpath for the inbody rows 6 | #' @return the modified R data frame 7 | create_inbody <- function(tab, table.Node, trindex, xpath){ 8 | 9 | # When no inbody header was specified 10 | if(length(trindex) == 0){ 11 | return(tab) 12 | } 13 | 14 | # Else ... 15 | N <- length(trindex) 16 | N.row <- nrow(tab) 17 | df <- data.frame(matrix(NA, ncol = N, nrow = N.row)) 18 | colnames(df) <- paste0("Header_", 1:N) 19 | 20 | for(i in 1:N){ 21 | val <- XML::xpathSApply(table.Node, xpath[[i]], XML::xmlValue) 22 | index <- trindex[[i]] 23 | index <- c(index, (N.row + 1)) 24 | 25 | df[index[1]:(N.row), (N+1-i)] <- rep(val, diff(index)) 26 | } 27 | 28 | # Combine 29 | tab <- cbind(df, tab) 30 | 31 | return(tab) 32 | } 33 | -------------------------------------------------------------------------------- /R/setup_and_checks.R: -------------------------------------------------------------------------------- 1 | #' Produce the table node 2 | #' 3 | #' @param doc the HTML document which can be a file name or a URL or an already parsed document 4 | #' (by XML's parsing functions) 5 | #' @param which a vector of length one for identification of the table in the document. Either 6 | #' a numeric vector for the tables' rank or a character vector that describes an XPath for the table 7 | #' @param ... additional arguments passed to htmlParse 8 | #' @return a table node 9 | check_type <- function(doc, which, ...) UseMethod("check_type") 10 | 11 | check_type.default <- function(doc, which, ...){ 12 | stop("doc is of unknown type", call. = FALSE) 13 | } 14 | 15 | check_type.XMLNodeSet <- function(doc, which, ...){ 16 | 17 | Node <- eval.parent(substitute(XML::xmlParse(XML::saveXML(doc[[1]]), list(...)))) 18 | 19 | return(Node) 20 | } 21 | 22 | check_type.HTMLInternalDocument <- function(doc, which, ...) { 23 | Node <- doc 24 | Node <- select_tab(which = which, Node = Node) 25 | 26 | return(Node) 27 | } 28 | 29 | check_type.XMLInternalElementNode <- function(doc, which, ...) { 30 | Node <- doc 31 | Node <- select_tab(which = which, Node = Node) 32 | 33 | return(Node) 34 | } 35 | 36 | check_type.character <- function(doc, which, ...){ 37 | 38 | isurl <- is_url(doc) 39 | 40 | if(isurl) { 41 | doc <- httr::GET(doc) 42 | doc <- httr::content(doc, "text") 43 | } else if (file.exists(doc)) { 44 | doc <- readChar(doc, file.info(doc)$size) 45 | } 46 | 47 | Node <- eval.parent(substitute(XML::htmlParse(doc, encoding = "UTF-8", list(...)))) 48 | Node <- select_tab(which = which, Node = Node) 49 | 50 | return(Node) 51 | } 52 | 53 | 54 | #' Selects the table from the HTML Code 55 | #' 56 | #' @param Node the table node 57 | #' @param which a vector of length one for identification of the table in the document. Either 58 | #' a numeric vector for the tables' rank or a character vector that describes an XPath for the table 59 | #' @param ... additional arguments passed to htmlParse 60 | #' @return a table node 61 | select_tab <- function(which, Node) UseMethod("select_tab") 62 | 63 | select_tab.default <- function(which, Node){ 64 | 65 | message("Argument 'which' was left unspecified. Choosing first table.") 66 | Node <- XML::getNodeSet(Node, path = "//table") 67 | 68 | ifstop(cond = length(Node) == 0, 69 | mess = "Couldn't find a table.") 70 | 71 | Node <- XML::xmlParse(XML::saveXML(Node[[1]])) 72 | return(Node) 73 | } 74 | 75 | select_tab.numeric <- function(which, Node){ 76 | 77 | Node <- XML::getNodeSet(Node, path = "//table") 78 | 79 | ifstop(cond = length(Node) < which, 80 | mess = "Couldn't find the table. Try passing (a different) information to the which argument.") 81 | 82 | Node <- XML::xmlParse(XML::saveXML(Node[[which]])) 83 | return(Node) 84 | } 85 | 86 | select_tab.character <- function(which, Node){ 87 | 88 | xpath <- paste(which, collapse = " | ") 89 | Node <- XML::getNodeSet(Node, path = xpath) 90 | 91 | ifstop(cond = is.null(Node[[1]]), 92 | mess = "Couldn't find the table. Try passing (a different) information to the which argument.") 93 | 94 | Node <- XML::xmlParse(XML::saveXML(Node[[1]])) 95 | return(Node) 96 | } 97 | 98 | 99 | #' Evaluate and deparse the header argument 100 | #' @param arg the header information 101 | #' @return evaluated header info 102 | eval_header <- function(arg){ 103 | 104 | # Parse header string 105 | header <- rm_str_white(strsplit(arg, "\\+")[[1]]) 106 | 107 | those.header <- which(header %in% c("\"\"", "''", "", ".", "NULL")) 108 | header[those.header] <- "NULL" 109 | 110 | # Check that inbody information are complete 111 | ifstop(cond = any(header[-1] == "NULL"), 112 | mess = "You need to provide complete information for the inbody rows") 113 | 114 | # Evaluate header information 115 | header <- lapply(header, function(x) eval(parse(text = x))) 116 | 117 | return(header) 118 | } 119 | 120 | #' Evaluate and deparse the body argument 121 | #' @param arg the body argument 122 | eval_body <- function(arg){ 123 | 124 | body <- rm_str_white(strsplit(arg, "\\+")[[1]]) 125 | 126 | those.body <- which(body %in% c("\"\"", "''", "", ".", "NULL")) 127 | body[those.body] <- "NULL" 128 | 129 | ifstop(cond = length(body) > 1, 130 | mess = "Your body information is malformed. You may only provide one piece of information") 131 | 132 | # Evaluate header information 133 | body <- eval(parse(text = body)) 134 | 135 | return(body) 136 | } 137 | 138 | 139 | #' Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them 140 | #' 141 | #' @param table.Node the table node 142 | #' @return the revised table node 143 | normalize_tr <- function(table.Node){ 144 | 145 | wrong_tag <- "//trbody" 146 | x <- has_tag(table.Node, wrong_tag) 147 | 148 | if(isTRUE(x)){ 149 | old.header <- XML::getNodeSet(table.Node, "//trbody/*") 150 | invisible(new.header <- XML::newXMLNode("tbody")) 151 | invisible(XML::addChildren(new.header, old.header)) 152 | invisible(XML::replaceNodes(oldNode = XML::getNodeSet(table.Node, "//trbody")[[1]], newNode = new.header)) 153 | 154 | warning(sprintf("The code for the HTML table you provided contains invalid table tags ('%s'). The following transformations were applied:\n 155 | //trbody -> //tbody \n 156 | If you specified an XPath that makes a reference to this tag, this may have caused problems with their identification.", wrong_tag), call. = F) 157 | } 158 | 159 | #Every row in tr 160 | node1 <- XML::getNodeSet(table.Node, "//*[name() = 'td' or name() = 'th' and not(parent::tr)][preceding-sibling::*[1]/self::tr]") 161 | 162 | if(length(node1) > 0){ 163 | 164 | for(i in 1:length(node1)){ 165 | 166 | node.container <- list() 167 | node.container[[1]] <- node1[[i]] 168 | 169 | gg <- 1 170 | repeat{ 171 | sibling.node <- XML::getSibling(node.container[[gg]]) 172 | 173 | if(is.null(sibling.node)) break 174 | if(!(XML::xmlName(sibling.node) %in% c("td", "th"))) break 175 | 176 | node.container[[gg + 1]] <- sibling.node 177 | gg <- gg + 1 178 | } 179 | 180 | invisible(new.tr <- XML::newXMLNode("tr")) 181 | invisible(XML::replaceNodes(oldNode = node1[[i]], newNode = new.tr)) 182 | invisible(XML::addChildren(new.tr, node.container)) 183 | } 184 | 185 | warning("The code for the HTML table you provided is malformed. Not all cells are nested in row tags (). htmltab tried to normalize the table and ensure that all cells are within row tags. If you specified an XPath for body or header elements, this may have caused problems with their identification.", call. = F) 186 | 187 | } 188 | 189 | # Flatten inside table 190 | nested <- has_tag(table.Node, "/table//table") 191 | if(nested){ 192 | warning("There is a table inside the target table. htmltab tries to flatten the inner table", call. = F) 193 | invisible(old.node <- XML::getNodeSet(table.Node, "/table//table")) 194 | for(i in 1:length(old.node)){ 195 | invisible(vals <- XML::xmlValue(old.node[[i]])) 196 | invisible(new.cell <- XML::newXMLNode("td", vals)) 197 | invisible(XML::replaceNodes(oldNode = old.node[[i]], newNode = new.cell)) 198 | } 199 | } 200 | 201 | #Add tr index 202 | trs <- XML::getNodeSet(table.Node, "//tr") 203 | n.trs <- length(trs) 204 | invisible(lapply(1:n.trs, function(index) XML::xmlAttrs(trs[[index]]) <- c(HTMLTABtrindex = index))) 205 | 206 | return(table.Node) 207 | } 208 | 209 | 210 | #' Remove nuisance elements from the the table code 211 | #' 212 | #' @param table.Node the table node 213 | #' @param rm_superscript logical, denotes whether superscript information should be removed from header and body cells (default value TRUE) 214 | #' @param rm_footnotes logical, denotes whether semantic footer information should be removed (default value TRUE) 215 | #' @param rm_invisible logical, should nodes that are not visible (display:none attribute) be removed? 216 | #' @seealso \code{\link{rm_empty_cols}} 217 | #' @return The revised table node 218 | rm_nuisance <- function(table.Node, rm_superscript, rm_footnotes, rm_invisible){ 219 | 220 | if(isTRUE(rm_superscript)){ 221 | invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//sup"))) 222 | } 223 | 224 | if(isTRUE(rm_footnotes)){ 225 | invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//tfoot"))) 226 | } 227 | 228 | if(isTRUE(rm_invisible)){ 229 | invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//*[contains(@style, 'display:none') or @class = 'sortkey']"))) 230 | } 231 | 232 | # Remove empty rows 233 | invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//tr[not(./*)]"))) 234 | 235 | return(table.Node) 236 | } 237 | 238 | 239 | #' Remove columns which do not have data values 240 | #' 241 | #' @param df a data frame 242 | #' @param header the header vector 243 | #' @return a data frame 244 | #' @seealso \code{\link{rm_nuisance}, \link{rm_empty_rows}} 245 | rm_empty_cols <- function(df, header){ 246 | 247 | #This is clumsy but seems to work reasonably well 248 | #columns are removed when they have: 249 | #1. No name (V...) 250 | #2. More than 50% missing values in their column 251 | 252 | # no.col.name <- grep('^V[[:digit:]]', colnames(df)) 253 | 254 | empty.cols <- sapply(df, function(x) { 255 | gg <- is.na(x) 256 | gg <- length(base::which(gg, TRUE)) / length(x) 257 | }) 258 | 259 | empty.cols <- which(empty.cols > 0.5) 260 | if(length(empty.cols) > 0) warning(sprintf("Columns [%s] seem to have no data and are removed. Use rm_nodata_cols = F to suppress this behavior", paste(names(empty.cols), collapse = ",")), call. = F) 261 | rm.these <- empty.cols #intersect(empty.cols, no.col.name) 262 | 263 | if(length(rm.these) > 0) { 264 | df <- df[, -rm.these, drop = F] 265 | } 266 | 267 | return(df) 268 | } 269 | 270 | #' Remove rows which do not have data values 271 | #' 272 | #' @param df a data frame 273 | #' @return a data frame 274 | #' @seealso \code{\link{rm_nuisance}, \link{rm_empty_cols}} 275 | rm_empty_rows <- function(df){ 276 | rm.these <- which(rowSums(is.na(df)) == ncol(df)) 277 | 278 | if(length(rm.these) > 0) { 279 | df <- df[-rm.these,] 280 | } 281 | 282 | return(df) 283 | } 284 | 285 | 286 | 287 | # 288 | # #' Remove columns which do not have data values 289 | # #' 290 | # #' @param df a data frame 291 | # #' @return a data frame 292 | # #' @seealso \code{\link{rm_nuisance}} 293 | # rm_empty_cols <- function(df){ 294 | # 295 | # #This is clumsy but seems to work reasonably well 296 | # #columns are removed when they have: 297 | # #1. No name (V...) 298 | # #2. More than 50% missing values in their column 299 | # 300 | # no.col.name <- grep('^V[[:digit:]]', colnames(df)) 301 | # 302 | # empty.cols <- sapply(df, function(col){ 303 | # x <- grepl("[A-Za-z]{1,}", col) 304 | # x <- length(base::which(x, TRUE)) / length(x) 305 | # }) 306 | # 307 | # empty.cols <- which(empty.cols < 0.5) 308 | # rm.these <- intersect(empty.cols, no.col.name) 309 | # 310 | # if(length(rm.these) > 0) { 311 | # df <- df[, -rm.these] 312 | # } 313 | # 314 | # return(df) 315 | # } 316 | -------------------------------------------------------------------------------- /R/utils.R: -------------------------------------------------------------------------------- 1 | is_empty <- function(x) length(x) == 0 2 | 3 | all_empty <- function(x) { 4 | all(sapply(x, length) == 0) 5 | } 6 | 7 | equal_length <- function(x) { 8 | x.length <- sapply(x, length) 9 | length(unique(x.length)) == 1 10 | } 11 | 12 | rm_str_white <- function(el) gsub("^\\s+|\\s+$", "", el) 13 | 14 | check_correct <- function(nodeset){ 15 | names <- sapply(1:length(nodeset), function(index) XML::xmlName(nodeset[[index]])) 16 | if(!all(names == "tr")) stop("You must pass header/body information that identifies row elements (tr)", call. = FALSE) 17 | } 18 | 19 | #' Wrapper around if stop logic 20 | #' @noRd 21 | ifstop <- function(cond, mess, ...){ 22 | 23 | cond <- eval(quote(cond)) 24 | 25 | if(isTRUE(cond)){ 26 | stop(mess, call. = F) 27 | } 28 | } 29 | 30 | #' Is str a URL? 31 | #' @noRd 32 | is_url <- function(str){ 33 | grepl("^(http:|https:|www.)", str) 34 | } 35 | 36 | 37 | #' Assert a specific tag in an XML node 38 | #' 39 | #' @param table.Node the table node 40 | #' @param tag a character string for the tag name to be matched 41 | #' @return logical value indicating whether tag is present in table code 42 | #' @noRd 43 | 44 | has_tag <- function(table.Node, tag) { 45 | x <- XML::xpathSApply(table.Node, tag) 46 | if(length(x) > 0){TRUE} else{FALSE} 47 | } 48 | -------------------------------------------------------------------------------- /R/zzz.R: -------------------------------------------------------------------------------- 1 | .onAttach <- function(...) { 2 | if (!interactive() || stats::runif(1) > 0.1) return() 3 | tips <- c( 4 | "Due to new features and arguments introduced with v.0.6.0 code can break.\nUse suppressPackageStartupMessages to eliminate package startup messages.", 5 | "If you find a bug, please report it to 'https://github.com/crubba/htmltab/issues'\nUse suppressPackageStartupMessages to eliminate package startup messages." 6 | ) 7 | tip <- sample(tips, 1) 8 | packageStartupMessage(tip) 9 | } 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## htmltab: Hassle-free HTML tables in R 2 | HTML tables are a valuable data source but extracting and recasting 3 | these data into a useful format can be tedious. htmltab is a package for 4 | extracting structured information from HTML tables. It is similar to 5 | readHTMLTable() of the XML package but provides two major advantages. First, 6 | the function automatically expands row and column spans in the header and 7 | body cells. Second, users are given more control over the identification of 8 | header and body rows which will end up in the R table. Additionally, the 9 | function preprocesses table code, removes unneeded parts and so helps to 10 | alleviate the need for tedious post-processing. 11 | 12 | ## Installation 13 | The package is available from CRAN and Github. For the stable release version, download from [CRAN](http://cran.r-project.org/web/packages/htmltab/index.html): 14 | 15 | ``` 16 | install.packages("htmltab") 17 | ``` 18 | 19 | For the developer version, download from my GitHub repo. You can install the package directly from inside R: 20 | 21 | ``` 22 | install.packages("devtools") 23 | devtools::install_github("crubba/htmltab") 24 | ``` 25 | 26 | ## Usage 27 | To see *htmltab* in action, take a look at the case studies in the [package vignette](http://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.htm), this [blog post](http://r-datacollection.com/blog/htmltab-Next-version-and-CRAN-release/) or consult the package manual. 28 | 29 | ## Travis status 30 | ![travis status](https://travis-ci.org/crubba/htmltab.svg) 31 | 32 | ## Report issues 33 | If you experience problems with *htmltab*, I would like to hear about it to improve the project. Please use [my github repo](https://github.com/crubba/htmltab/issues) to report the issue. 34 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | # Resubmission 2 | This is a resubmission. In this version I have: 3 | 4 | * Converted the DESCRIPTION title to title case 5 | * Included call to person() in Authors@R field 6 | * Confirmed that the package was checked with R-devel through win-builder 7 | 8 | ## Test environments 9 | * local OS X install, R 3.1.2 10 | * local Ubuntu 12.04, R 3.1.2 11 | * win-builder (devel and release) 12 | 13 | ## R CMD check results 14 | Local R CMD CHECKs succeeded with no ERRORs, WARNINGs or NOTEs. 15 | 16 | I experienced two NOTEs, when using devtools::check(args="--as-cran"), but not when I checked the package as usual: 17 | 18 | * checking CRAN incoming feasibility ... NOTE 19 | Maintainer: ‘Christian Rubba ’ 20 | New submission 21 | Components with restrictions and base license permitting such: 22 | MIT + file LICENSE 23 | File 'LICENSE': 24 | YEAR: 2014,2015 25 | COPYRIGHT HOLDER: Christian Rubba 26 | 27 | * checking package dependencies ... NOTE 28 | No repository set, so cyclic dependency check skipped 29 | 30 | I experienced two NOTEs on R CMD CHECK on win-builder: 31 | 32 | On release I got: 33 | 34 | *Components with restrictions and base license permitting such: 35 | MIT + file LICENSE 36 | File 'LICENSE': 37 | YEAR: 2014,2015 38 | COPYRIGHT HOLDER: Christian Rubba 39 | Possibly mis-spelled words in DESCRIPTION: 40 | htmltab (6:14) 41 | preprocesses (11:48) 42 | 43 | Additionally, only on R-devel I got: 44 | 45 | * checking R code for possible problems ... NOTE 46 | htmltab: possible error in check_type(doc = doc, which = which, ...): 47 | ... used in a situation where it does not exist 48 | 49 | ## Downstream dependencies 50 | There are currently no downstream dependencies of this package 51 | -------------------------------------------------------------------------------- /man/check_type.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{check_type} 4 | \alias{check_type} 5 | \title{Produce the table node} 6 | \usage{ 7 | check_type(doc, which, ...) 8 | } 9 | \arguments{ 10 | \item{doc}{the HTML document which can be a file name or a URL or an already parsed document 11 | (by XML's parsing functions)} 12 | 13 | \item{which}{a vector of length one for identification of the table in the document. Either 14 | a numeric vector for the tables' rank or a character vector that describes an XPath for the table} 15 | 16 | \item{...}{additional arguments passed to htmlParse} 17 | } 18 | \value{ 19 | a table node 20 | } 21 | \description{ 22 | Produce the table node 23 | } 24 | 25 | -------------------------------------------------------------------------------- /man/create_inbody.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/inbody_header.R 3 | \name{create_inbody} 4 | \alias{create_inbody} 5 | \title{Reshape in table header information into wide format} 6 | \usage{ 7 | create_inbody(tab, table.Node, trindex, xpath) 8 | } 9 | \arguments{ 10 | \item{tab}{the table data frame} 11 | 12 | \item{table.Node}{the table node} 13 | 14 | \item{trindex}{the tr index of the inbody rows} 15 | 16 | \item{xpath}{the xpath for the inbody rows} 17 | } 18 | \value{ 19 | the modified R data frame 20 | } 21 | \description{ 22 | Reshape in table header information into wide format 23 | } 24 | 25 | -------------------------------------------------------------------------------- /man/eval_body.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{eval_body} 4 | \alias{eval_body} 5 | \title{Evaluate and deparse the body argument} 6 | \usage{ 7 | eval_body(arg) 8 | } 9 | \arguments{ 10 | \item{arg}{the body argument} 11 | } 12 | \description{ 13 | Evaluate and deparse the body argument 14 | } 15 | 16 | -------------------------------------------------------------------------------- /man/eval_header.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{eval_header} 4 | \alias{eval_header} 5 | \title{Evaluate and deparse the header argument} 6 | \usage{ 7 | eval_header(arg) 8 | } 9 | \arguments{ 10 | \item{arg}{the header information} 11 | } 12 | \value{ 13 | evaluated header info 14 | } 15 | \description{ 16 | Evaluate and deparse the header argument 17 | } 18 | 19 | -------------------------------------------------------------------------------- /man/get_body_xpath.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/identify_rows.R 3 | \name{get_body_xpath} 4 | \alias{get_body_xpath} 5 | \title{Return body xpath} 6 | \usage{ 7 | get_body_xpath(body, table.Node) 8 | } 9 | \arguments{ 10 | \item{body}{an information for the body rows} 11 | 12 | \item{table.Node}{the table node} 13 | } 14 | \value{ 15 | a character vector of XPath statements 16 | } 17 | \description{ 18 | Return body xpath 19 | } 20 | 21 | -------------------------------------------------------------------------------- /man/get_cell_element.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/header.R 3 | \name{get_cell_element} 4 | \alias{get_cell_element} 5 | \title{Extracts cells elements} 6 | \usage{ 7 | get_cell_element(cells, tag = "td | th", elFun, rm_escape, rm_whitespace) 8 | } 9 | \arguments{ 10 | \item{cells}{a list of cell nodes} 11 | 12 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements} 13 | 14 | \item{elFun}{a function that is executed over the header/body cell nodes} 15 | 16 | \item{rm_escape}{a character vector that, if specified, is used to replace escape sequences in header and body cells (default value ' ')} 17 | 18 | \item{rm_whitespace}{logical, should leading/trailing whitespace be removed from cell values ( 19 | default value TRUE)?} 20 | } 21 | \value{ 22 | the body element 23 | } 24 | \description{ 25 | Extracts cells elements 26 | } 27 | 28 | -------------------------------------------------------------------------------- /man/get_head_xpath.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/identify_rows.R 3 | \name{get_head_xpath} 4 | \alias{get_head_xpath} 5 | \title{Return header xpath} 6 | \usage{ 7 | get_head_xpath(header, table.Node) 8 | } 9 | \arguments{ 10 | \item{header}{an information for the header rows} 11 | 12 | \item{table.Node}{the table node} 13 | } 14 | \value{ 15 | a character vector of XPath statements 16 | } 17 | \description{ 18 | Return header xpath 19 | } 20 | 21 | -------------------------------------------------------------------------------- /man/get_header_elements.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/header.R 3 | \name{get_header_elements} 4 | \alias{get_header_elements} 5 | \title{Extracts header elements} 6 | \usage{ 7 | get_header_elements(cells, tag = "td | th") 8 | } 9 | \arguments{ 10 | \item{cells}{a list of cell nodes} 11 | 12 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements} 13 | } 14 | \value{ 15 | A list of header information from the cells 16 | } 17 | \description{ 18 | Extracts header elements 19 | } 20 | 21 | -------------------------------------------------------------------------------- /man/get_span.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/header.R 3 | \name{get_span} 4 | \alias{get_span} 5 | \title{Extracts rowspan information} 6 | \usage{ 7 | get_span(cells, span, tag = "td | th") 8 | } 9 | \arguments{ 10 | \item{cells}{a list of cell nodes} 11 | 12 | \item{span}{a character for the span element name} 13 | 14 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements} 15 | } 16 | \value{ 17 | A list of row information from the cells 18 | } 19 | \description{ 20 | Extracts rowspan information 21 | } 22 | 23 | -------------------------------------------------------------------------------- /man/get_trindex.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/identify_rows.R 3 | \name{get_trindex} 4 | \alias{get_trindex} 5 | \title{Return trindex given an XPath} 6 | \usage{ 7 | get_trindex(xpath, table.Node) 8 | } 9 | \arguments{ 10 | \item{xpath}{XPath} 11 | 12 | \item{table.Node}{the table node} 13 | } 14 | \description{ 15 | Return trindex given an XPath 16 | } 17 | 18 | -------------------------------------------------------------------------------- /man/htmltab.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/htmltab.R 3 | \name{htmltab} 4 | \alias{htmltab} 5 | \title{Assemble a data frame from HTML table data} 6 | \usage{ 7 | htmltab(doc, which = NULL, header = NULL, headerFun = function(node) 8 | XML::xmlValue(node), headerSep = " >> ", body = NULL, 9 | bodyFun = function(node) XML::xmlValue(node), complementary = TRUE, 10 | fillNA = NA, rm_superscript = TRUE, rm_escape = " ", 11 | rm_footnotes = TRUE, rm_nodata_cols = TRUE, rm_nodata_rows = TRUE, 12 | rm_invisible = TRUE, rm_whitespace = TRUE, colNames = NULL, ...) 13 | } 14 | \arguments{ 15 | \item{doc}{the HTML document which can be a file name or a URL or an already parsed document 16 | (by XML's parsing functions)} 17 | 18 | \item{which}{a vector of length one for identification of the table in the document. Either a 19 | numeric vector for the tables' rank or a character vector that describes an XPath for the table} 20 | 21 | \item{header}{the header formula, see details for specifics} 22 | 23 | \item{headerFun}{a function that is executed over the header cell nodes} 24 | 25 | \item{headerSep}{a character vector that is used as a seperator in the construction of the table's 26 | variable names (default ' >> ')} 27 | 28 | \item{body}{a vector that specifies which table rows should be used as body information. A numeric 29 | vector can be specified where each element corresponds to a table row. A character vector may be 30 | specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use 31 | semantic information from the HTML code} 32 | 33 | \item{bodyFun}{a function that is executed over the body cell nodes} 34 | 35 | \item{complementary}{logical, should htmltab ensure complementarity of header, inbody header and 36 | body elements (default TRUE)?} 37 | 38 | \item{fillNA}{character vector of symbols that are replaced by NA (default c(''))} 39 | 40 | \item{rm_superscript}{logical, should superscript information be removed from header and body cells 41 | (default TRUE)?} 42 | 43 | \item{rm_escape}{a character vector that, if specified, is used to replace escape sequences in header 44 | and body cells (default ' ')} 45 | 46 | \item{rm_footnotes}{logical, should semantic footer information be removed (default TRUE)?} 47 | 48 | \item{rm_nodata_cols}{logical, should columns that have no alphanumeric data be removed (default TRUE)?} 49 | 50 | \item{rm_nodata_rows}{logical, should rows that have no alphanumeric data be removed (default TRUE)?} 51 | 52 | \item{rm_invisible}{logical, should nodes that are not visible be removed (default TRUE)? This 53 | includes elements with class 'sortkey' and 'display:none' style.} 54 | 55 | \item{rm_whitespace}{logical, should leading/trailing whitespace be removed from cell values (default TRUE)?} 56 | 57 | \item{colNames}{a character vector of column names, or a function that can be used to replace specific 58 | column names (default NULL)} 59 | 60 | \item{...}{additional arguments passed to HTML parsers} 61 | } 62 | \value{ 63 | An R data frame 64 | } 65 | \description{ 66 | Robust and flexible methods for extracting structured information out of HTML tables 67 | } 68 | \details{ 69 | The header formula has the following format: level1 + level2 + level3 + ... . 70 | level1 specifies the main header dimension (column names). This information must 71 | be for rows. level2 and deeper signify header dimensions that appear throughout the body. 72 | Those information muste be for cell elements, not rows. Header information may be 73 | one of the following types: 74 | 75 | \itemize{ 76 | \item the NULL value (default). No information passed, htmltab will try to identify 77 | header elements through heuristics (heuristics only work for the main header) 78 | \item A numeric vector that retrieves rows in the respective position 79 | \item A character string of an XPath expression 80 | \item A function that when evaluated produces a numeric or character vector 81 | \item 0, when the process of finding the main header should be skipped (only works 82 | for main header) 83 | } 84 | } 85 | \examples{ 86 | \dontrun{ 87 | # When no spans are present, htmltab produces output close to XML's readHTMLTable(), 88 | but it removes many types of non-data elements (footnotes, non-visible HTML elements, etc) 89 | 90 | url <- "http://en.wikipedia.org/wiki/World_population" 91 | xp <- "//caption[starts-with(text(),'World historical')]/ancestor::table" 92 | htmltab(doc = url, which = xp) 93 | 94 | popFun <- function(node) { 95 | x <- XML::xmlValue(node) 96 | gsub(',', '', x) 97 | } 98 | 99 | htmltab(doc = url, which = xp, bodyFun = popFun) 100 | 101 | #This table lacks header information. We provide them through colNames. 102 | #We also need to set header = 0 to indicate that no header is present. 103 | doc <- "http://en.wikipedia.org/wiki/FC_Bayern_Munich" 104 | xp2 <- "//td[text() = 'Head coach']/ancestor::table" 105 | htmltab(doc = doc, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role")) 106 | 107 | #htmltab recognizes column spans and produces a one-dimension vector of variable information, 108 | #also removes automatically superscript information since these are usually not of use. 109 | 110 | doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers" 111 | xp3 <- "//table[7]" 112 | bFun <- function(node) { 113 | x <- XML::xmlValue(node) 114 | gsub('\%$', '', x) 115 | } 116 | 117 | htmltab(doc = doc, which = xp3, bodyFun = bFun) 118 | 119 | 120 | htmltab("https://en.wikipedia.org/wiki/Arjen_Robben", which = 3, 121 | header = 1:2) 122 | 123 | 124 | #When header information appear throughout the body, you can specify their 125 | #position in the header formula 126 | 127 | htmltab(url, which = "//table[@id='team_gamelogs']", header = . + "//td[./strong]") 128 | } 129 | } 130 | \author{ 131 | Christian Rubba <\url{http://www.christianrubba.com}> 132 | } 133 | \references{ 134 | \url{https://github.com/crubba/htmltab} 135 | } 136 | 137 | -------------------------------------------------------------------------------- /man/identify_elements.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/identify_rows.R 3 | \name{identify_elements} 4 | \alias{identify_elements} 5 | \title{Assemble XPath expressions for header and body} 6 | \usage{ 7 | identify_elements(table.Node, header, body, complementary = T) 8 | } 9 | \arguments{ 10 | \item{table.Node}{the table node} 11 | 12 | \item{header}{a vector that contains information for the identification of the header row(s). A numeric vector can be specified where each element corresponds to the table rows. A character vector may be specified that describes an XPath for the header rows. If left unspecified, htmltable tries to use semantic information from the HTML code} 13 | 14 | \item{body}{a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltable tries to use semantic information from the HTML code} 15 | 16 | \item{complementary}{logical, should htmltab ensure complementarity of header, inbody header and 17 | body elements (default TRUE)?} 18 | } 19 | \value{ 20 | a character vector of XPath statements 21 | } 22 | \description{ 23 | Assemble XPath expressions for header and body 24 | } 25 | 26 | -------------------------------------------------------------------------------- /man/normalize_tr.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{normalize_tr} 4 | \alias{normalize_tr} 5 | \title{Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them} 6 | \usage{ 7 | normalize_tr(table.Node) 8 | } 9 | \arguments{ 10 | \item{table.Node}{the table node} 11 | } 12 | \value{ 13 | the revised table node 14 | } 15 | \description{ 16 | Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them 17 | } 18 | 19 | -------------------------------------------------------------------------------- /man/num_xpath.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/identify_rows.R 3 | \name{num_xpath} 4 | \alias{num_xpath} 5 | \title{num_xpath: Generate numeric XPath expression} 6 | \usage{ 7 | num_xpath(data) 8 | } 9 | \arguments{ 10 | \item{data}{the header XPath} 11 | } 12 | \description{ 13 | Generate numeric XPath expression 14 | } 15 | 16 | -------------------------------------------------------------------------------- /man/rm_empty_cols.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{rm_empty_cols} 4 | \alias{rm_empty_cols} 5 | \title{Remove columns which do not have data values} 6 | \usage{ 7 | rm_empty_cols(df, header) 8 | } 9 | \arguments{ 10 | \item{df}{a data frame} 11 | 12 | \item{header}{the header vector} 13 | } 14 | \value{ 15 | a data frame 16 | } 17 | \description{ 18 | Remove columns which do not have data values 19 | } 20 | \seealso{ 21 | \code{\link{rm_nuisance}, \link{rm_empty_rows}} 22 | } 23 | 24 | -------------------------------------------------------------------------------- /man/rm_empty_rows.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{rm_empty_rows} 4 | \alias{rm_empty_rows} 5 | \title{Remove rows which do not have data values} 6 | \usage{ 7 | rm_empty_rows(df) 8 | } 9 | \arguments{ 10 | \item{df}{a data frame} 11 | } 12 | \value{ 13 | a data frame 14 | } 15 | \description{ 16 | Remove rows which do not have data values 17 | } 18 | \seealso{ 19 | \code{\link{rm_nuisance}, \link{rm_empty_cols}} 20 | } 21 | 22 | -------------------------------------------------------------------------------- /man/rm_nuisance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{rm_nuisance} 4 | \alias{rm_nuisance} 5 | \title{Remove nuisance elements from the the table code} 6 | \usage{ 7 | rm_nuisance(table.Node, rm_superscript, rm_footnotes, rm_invisible) 8 | } 9 | \arguments{ 10 | \item{table.Node}{the table node} 11 | 12 | \item{rm_superscript}{logical, denotes whether superscript information should be removed from header and body cells (default value TRUE)} 13 | 14 | \item{rm_footnotes}{logical, denotes whether semantic footer information should be removed (default value TRUE)} 15 | 16 | \item{rm_invisible}{logical, should nodes that are not visible (display:none attribute) be removed?} 17 | } 18 | \value{ 19 | The revised table node 20 | } 21 | \description{ 22 | Remove nuisance elements from the the table code 23 | } 24 | \seealso{ 25 | \code{\link{rm_empty_cols}} 26 | } 27 | 28 | -------------------------------------------------------------------------------- /man/select_tab.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/setup_and_checks.R 3 | \name{select_tab} 4 | \alias{select_tab} 5 | \title{Selects the table from the HTML Code} 6 | \usage{ 7 | select_tab(which, Node) 8 | } 9 | \arguments{ 10 | \item{which}{a vector of length one for identification of the table in the document. Either 11 | a numeric vector for the tables' rank or a character vector that describes an XPath for the table} 12 | 13 | \item{Node}{the table node} 14 | 15 | \item{...}{additional arguments passed to htmlParse} 16 | } 17 | \value{ 18 | a table node 19 | } 20 | \description{ 21 | Selects the table from the HTML Code 22 | } 23 | 24 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | test_check("htmltab") 3 | -------------------------------------------------------------------------------- /tests/testthat/test_doc_examples.R: -------------------------------------------------------------------------------- 1 | context("htmltab examples work correctly") 2 | 3 | test_that("Example 1 works", { 4 | 5 | html_world <- ' 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 |
World historical and predicted populations (in millions)[115][116]
Region15001600170017501800185019001950199920082010201220502150
World4585806827919781,2621,6502,5215,9786,7076,8967,0528,9099,746
Africa861141061061071111332217679731,0221,0521,7662,308
Asia2433394365026358099471,4023,6344,0544,1644,2505,2685,561
Europe84111125163203276408547729732738740628517
Latin America[Note 1]39101016243874167511577590603809912
Northern America[Note 1]332272682172307337345351392398
Oceania333222613303437384651
144 | ' 145 | 146 | ex1 <- htmltab(doc = html_world, which = 1) 147 | 148 | expect_that(ex1[1,1], equals("World")) 149 | expect_that(colnames(ex1)[1], equals("Region")) 150 | }) 151 | 152 | 153 | test_that("Example 2 works", { 154 | 155 | html_world <- ' 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 |
World historical and predicted populations (in millions)[115][116]
Region15001600170017501800185019001950199920082010201220502150
World4585806827919781,2621,6502,5215,9786,7076,8967,0528,9099,746
Africa861141061061071111332217679731,0221,0521,7662,308
Asia2433394365026358099471,4023,6344,0544,1644,2505,2685,561
Europe84111125163203276408547729732738740628517
Latin America[Note 1]39101016243874167511577590603809912
Northern America[Note 1]332272682172307337345351392398
Oceania333222613303437384651
294 | ' 295 | 296 | library(XML) 297 | 298 | xp <- "//*[text() = 'World historical and predicted populations (in millions)']/ancestor::table" 299 | popFun <- function(node) { 300 | x <- xmlValue(node) 301 | gsub(',', '', x) 302 | } 303 | ex2 <- htmltab(doc = html_world, which = xp, bodyFun = popFun) 304 | 305 | expect_that(ex2[1,7], equals("1262")) 306 | }) 307 | 308 | 309 | test_that("Example 3 works", { 310 | 311 | html_bayern <- ' 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 |
Spain Pep GuardiolaHead coach
Spain Manel EstiartePersonal assistant
Germany Hermann GerlandAssistant coach
Spain Domènec TorrentAssistant coach
Germany Toni TapalovićGoalkeeping coach
Spain Lorenzo BuenaventuraFitness coach
Germany Andreas KornmayerFitness coach
Germany Thomas WilhelmiFitness coach
Germany Matthias SammerSport director
Spain Carles PlanchartMatch analyst
Germany Lars KornetkaVideo analyst
Germany Michael NiemeyerVideo analyst
Germany Paul BreitnerChief scout
Germany Egon CoordesScout
Germany Wolfgang GrobeScout
Germany Hans-Wilhelm Müller-WohlfahrtChief medic
Germany Lutz HänselTeam doctor
Germany Peter UeblackerTeam doctor
Germany Roland SchmidtCardiologist
Germany Fredi BinderPhysiotherapist
Italy Gianni BianchiPhysiotherapist
Germany Gerry HoffmannPhysiotherapist
Germany Stephan WeickertPhysiotherapist
Germany Helmut ErhardPhysiotherapist
' 409 | 410 | xp2 <- "//td[text() = 'Head coach']/ancestor::table" 411 | ex3 <- htmltab(doc = html_bayern, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role")) 412 | 413 | expect_that(ex3[1,1], equals("Pep Guardiola")) 414 | expect_that(colnames(ex3)[1], equals("name")) 415 | }) 416 | 417 | 418 | test_that("Example 4 works", { 419 | 420 | html_web <- ' 421 | 422 | 423 | 424 | 426 | 427 | 428 | 429 | 430 | 431 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 |
Global desktop and mobile stats combined from StatCounter (Top 5 browsers)
Date
Internet
425 | Explorer
[Note 1]
ChromeFirefoxSafariOperaAndroidMobile
432 | Total
Desktop+LaptopMobileTotalDesktop+LaptopMobileTotal
July 201320.27%36.29%16.60%7.10%4.27%11.37%1.01%2.73%3.74%4.97%17.35%
June 201321.35%35.82%16.79%7.04%3.79%10.83%1.03%2.28%3.31%4.66%16.08%
May 201323.67%35.66%16.87%6.80%3.67%10.55%0.87%2.13%3.00%4.45%14.62%
April 201325.58%33.71%17.27%6.89%3.76%10.65%0.85%2.26%3.11%4.30%13.90%
March 201325.08%32.88%17.86%7.28%3.90%11.18%1.00%2.24%3.24%4.43%14.44%
February 201325.54%31.96%18.28%7.37%3.89%11.26%1.04%2.21%3.25%4.53%14.35%
January 201326.37%31.51%18.39%7.12%3.64%10.76%1.02%2.17%3.19%4.36%14.13%
July 201228.49%30.06%21.01%6.33%2.77%9.10%1.53%2.15%3.68%2.57%11.09%
January 201234.27%25.99%22.68%6.06%1.92%7.98%1.78%2.06%3.84%1.71%8.49%
July 201139.47%20.59%25.99%4.81%1.40%6.21%1.54%1.54%3.08%1.28%7.02%
June 201140.73%19.30%26.49%4.74%1.63%6.53%
' 597 | 598 | bFun <- function(node) { 599 | x <- XML::xmlValue(node) 600 | x <- gsub('%$', '', x) 601 | ifelse(x == '', NA, x) 602 | } 603 | ex4 <- htmltab(doc = html_web, which = 1, bodyFun = bFun) 604 | 605 | expect_that(ex4[1,1], equals("July 2013")) 606 | expect_that(ex4[1,2], equals("20.27")) 607 | expect_that(is.na(ex4[11,5]), is_true()) 608 | }) 609 | -------------------------------------------------------------------------------- /tests/testthat/test_expand_spans.R: -------------------------------------------------------------------------------- 1 | context("Correct expansion of header elements") 2 | 3 | tab1_code <- ' 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
abc
123
' 20 | 21 | test_that("Correctly expanded", { 22 | 23 | tab1 <- XML::htmlParse(tab1_code) 24 | suppressMessages(tab1 <- htmltab(tab1, header = 1:2, body = 3, which = NULL)) 25 | 26 | expect_that(tab1[,1], equals("1")) 27 | expect_that(tab1[,2], equals("2")) 28 | expect_that(tab1[,3], equals("3")) 29 | 30 | expect_that(colnames(tab1)[1], equals("a")) 31 | expect_that(colnames(tab1)[2], equals("b")) 32 | expect_that(colnames(tab1)[3], equals("c")) 33 | }) 34 | 35 | tab2_code <- ' 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
abc
123
' 51 | 52 | test_that("Correctly expanded", { 53 | 54 | tab2 <- XML::htmlParse(tab2_code) 55 | suppressMessages(tab2 <- htmltab(tab2, header = 1:2, body = 3)) 56 | 57 | expect_that(tab2[,1], equals("1")) 58 | expect_that(tab2[,2], equals("2")) 59 | expect_that(tab2[,3], equals("3")) 60 | #expect_that(tab2[,4], equals("")) 61 | 62 | expect_that(colnames(tab2)[1], equals("a")) 63 | expect_that(colnames(tab2)[2], equals("b")) 64 | expect_that(colnames(tab2)[3], equals("c")) 65 | #expect_that(colnames(tab2)[4], equals("c")) 66 | }) 67 | 68 | 69 | tab3_code <- ' 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 |
abcd
e
123
' 86 | 87 | test_that("Correctly expanded", { 88 | 89 | tab3 <- XML::htmlParse(tab3_code) 90 | suppressMessages(tab3 <- htmltab(tab3, header = 1:2, body = 3)) 91 | 92 | expect_that(tab3[,1], equals("1")) 93 | expect_that(tab3[,2], equals("2")) 94 | expect_that(tab3[,3], equals("3")) 95 | expect_that(is.na(tab3[,4]), is_true()) 96 | 97 | expect_that(colnames(tab3)[1], equals("a >> e")) 98 | expect_that(colnames(tab3)[2], equals("b >> e")) 99 | expect_that(colnames(tab3)[3], equals("c >> e")) 100 | expect_that(colnames(tab3)[4], equals("d >> e")) 101 | }) 102 | 103 | tab4_code <- ' 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
abc
b1b2
1234
' 120 | 121 | test_that("Correctly expanded", { 122 | 123 | tab4 <- XML::htmlParse(tab4_code) 124 | suppressMessages(tab4 <- htmltab(tab4, header = 1:2, body = 3)) 125 | 126 | expect_that(tab4[,1], equals("1")) 127 | expect_that(tab4[,2], equals("2")) 128 | expect_that(tab4[,3], equals("3")) 129 | expect_that(tab4[,4], equals("4")) 130 | 131 | expect_that(colnames(tab4)[1], equals("a")) 132 | expect_that(colnames(tab4)[2], equals("b >> b1")) 133 | expect_that(colnames(tab4)[3], equals("b >> b2")) 134 | expect_that(colnames(tab4)[4], equals("c")) 135 | }) 136 | 137 | tab5_code <- ' 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 |
abc
b1b2
1234
' 154 | 155 | test_that("Correctly expanded when misspecified header", { 156 | 157 | tab5 <- XML::htmlParse(tab5_code) 158 | suppressMessages(tab5 <- htmltab(tab5, header = 1:2, body = 3)) 159 | 160 | expect_that(tab5[,1], equals("1")) 161 | expect_that(tab5[,2], equals("2")) 162 | expect_that(tab5[,3], equals("3")) 163 | expect_that(tab5[,4], equals("c")) # should be 4 164 | 165 | expect_that(colnames(tab5)[1], equals("a")) 166 | expect_that(colnames(tab5)[2], equals("b >> b1")) 167 | expect_that(colnames(tab5)[3], equals("b >> b2")) 168 | expect_that(colnames(tab5)[4], equals("c")) 169 | }) 170 | 171 | 172 | tab6_code <- ' 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 |
abc
b1b2
1234
' 193 | 194 | test_that("H: tr/th.td, B: tbody/tr; misspecified rowspan in H", { 195 | 196 | tab6 <- XML::htmlParse(tab6_code) 197 | suppressMessages(tab6 <- htmltab(tab6, header = 1:2)) 198 | 199 | expect_that(tab6[,1], equals("1")) 200 | expect_that(tab6[,2], equals("2")) 201 | expect_that(tab6[,3], equals("3")) 202 | expect_that(tab6[,4], equals("c")) #should be 4 203 | 204 | expect_that(colnames(tab6)[1], equals("a")) 205 | expect_that(colnames(tab6)[2], equals("b >> b1")) 206 | expect_that(colnames(tab6)[3], equals("b >> b2")) 207 | expect_that(colnames(tab6)[4], equals("c")) 208 | }) 209 | 210 | 211 | stack <- ' 212 | 213 | My First Webpage 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 |
9-1111-1313-1515-17
Monday67Luncha
Tuesday< frees
Wedensdayas5
251 | 252 | 253 | ' 254 | 255 | test_that("http://stackoverflow.com/questions/24215584/html-complex-tables", { 256 | 257 | parsed_stack <- XML::htmlParse(stack) 258 | suppressMessages(stack2 <- htmltab(parsed_stack)) 259 | 260 | expect_that(stack2[,1], equals(c("Monday", "Tuesday", "Wedensday"))) 261 | expect_that(stack2[,2], equals(c("6", "< free", "a"))) 262 | expect_that(stack2[,3], equals(c("7", "< free", "s"))) 263 | expect_that(stack2[,4], equals(c("Lunch", "Lunch", "Lunch"))) 264 | expect_that(stack2[,5], equals(c("a", "s", "5"))) 265 | 266 | expect_that(colnames(stack2), equals(c("V1", "9-11", "11-13", "13-15", "15-17"))) 267 | }) 268 | -------------------------------------------------------------------------------- /tests/testthat/test_find_header.R: -------------------------------------------------------------------------------- 1 | context("Correctly identified header elements") 2 | 3 | tab1_code <- ' 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
abc
123
' 20 | 21 | test_that("Correctly identified header 1", { 22 | 23 | tab1 <- XML::htmlParse(tab1_code) 24 | suppressMessages(tab1 <- htmltab(tab1)) 25 | 26 | expect_that(tab1[,1], equals("1")) 27 | expect_that(tab1[,2], equals("2")) 28 | expect_that(tab1[,3], equals("3")) 29 | 30 | expect_that(colnames(tab1)[1], equals("a")) 31 | expect_that(colnames(tab1)[2], equals("b")) 32 | expect_that(colnames(tab1)[3], equals("c")) 33 | }) 34 | 35 | tab2_code <- ' 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
abc
123
' 56 | 57 | test_that("Correctly identified header 2", { 58 | 59 | tab2 <- XML::htmlParse(tab2_code) 60 | suppressMessages(tab2 <- htmltab(tab2)) 61 | 62 | expect_that(tab2[,1], equals("1")) 63 | expect_that(tab2[,2], equals("2")) 64 | expect_that(tab2[,3], equals("3")) 65 | expect_that(is.na(tab2[,4]), is_true()) 66 | 67 | expect_that(colnames(tab2)[1], equals("a")) 68 | expect_that(colnames(tab2)[2], equals("b")) 69 | expect_that(colnames(tab2)[3], equals("c")) 70 | expect_that(colnames(tab2)[4], equals("c")) 71 | }) 72 | 73 | 74 | tab3_code <- ' 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 |
abcd
e
123
' 94 | 95 | test_that("Correctly identified header 3", { 96 | 97 | tab3 <- XML::htmlParse(tab3_code) 98 | suppressMessages(tab3 <- htmltab(tab3, fillNA = "")) 99 | 100 | expect_that(tab3[,1], equals("1")) 101 | expect_that(tab3[,2], equals("2")) 102 | expect_that(tab3[,3], equals("3")) 103 | expect_that(tab3[,4], equals("")) 104 | 105 | expect_that(colnames(tab3)[1], equals("a >> e")) 106 | expect_that(colnames(tab3)[2], equals("b >> e")) 107 | expect_that(colnames(tab3)[3], equals("c >> e")) 108 | expect_that(colnames(tab3)[4], equals("d >> e")) 109 | }) 110 | 111 | 112 | tab4_code <- ' 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 |
abc
a2b2c2
123
' 130 | 131 | test_that("Correctly identified header 4", { 132 | 133 | tab4 <- XML::htmlParse(tab4_code) 134 | suppressMessages(tab4 <- htmltab(tab4)) 135 | 136 | expect_that(tab4[,1], equals("1")) 137 | expect_that(tab4[,2], equals("2")) 138 | expect_that(tab4[,3], equals("3")) 139 | 140 | expect_that(colnames(tab4)[1], equals("a >> a2")) 141 | expect_that(colnames(tab4)[2], equals("b >> b2")) 142 | expect_that(colnames(tab4)[3], equals("c >> c2")) 143 | }) 144 | 145 | 146 | test_that("Correctly identified header 5", { 147 | 148 | tab5 <- XML::htmlParse(tab4_code) 149 | suppressMessages(tab5 <- htmltab(tab5, header = 1:2)) 150 | 151 | expect_that(tab5[,1], equals("1")) 152 | expect_that(tab5[,2], equals("2")) 153 | expect_that(tab5[,3], equals("3")) 154 | 155 | expect_that(colnames(tab5)[1], equals("a >> a2")) 156 | expect_that(colnames(tab5)[2], equals("b >> b2")) 157 | expect_that(colnames(tab5)[3], equals("c >> c2")) 158 | }) 159 | 160 | test_that("Correctly identified header 6", { 161 | 162 | tab6 <- XML::htmlParse(tab4_code) 163 | suppressMessages(tab6 <- htmltab(tab6, body = 3)) 164 | 165 | expect_that(tab6[,1], equals("1")) 166 | expect_that(tab6[,2], equals("2")) 167 | expect_that(tab6[,3], equals("3")) 168 | 169 | expect_that(colnames(tab6)[1], equals("a >> a2")) 170 | expect_that(colnames(tab6)[2], equals("b >> b2")) 171 | expect_that(colnames(tab6)[3], equals("c >> c2")) 172 | }) 173 | 174 | 175 | tab5_code <- ' 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 |
abc
a2b2c2
123
' 198 | 199 | test_that("Header info in thead and tbody", { 200 | 201 | tab7 <- XML::htmlParse(tab5_code) 202 | suppressMessages(tab7 <- htmltab(tab7, header = 1:2)) 203 | 204 | expect_that(tab7[,1], equals("1")) 205 | expect_that(tab7[,2], equals("2")) 206 | expect_that(tab7[,3], equals("3")) 207 | 208 | expect_that(colnames(tab7)[1], equals("a >> a2")) 209 | expect_that(colnames(tab7)[2], equals("b >> b2")) 210 | expect_that(colnames(tab7)[3], equals("c >> c2")) 211 | }) 212 | 213 | test_that("Correctly identified header 8", { 214 | 215 | tab8 <- XML::htmlParse(tab5_code) 216 | suppressMessages(tab8 <- htmltab(tab8, header = 1:2, body = 3)) 217 | 218 | expect_that(tab8[,1], equals("1")) 219 | expect_that(tab8[,2], equals("2")) 220 | expect_that(tab8[,3], equals("3")) 221 | 222 | expect_that(colnames(tab8)[1], equals("a >> a2")) 223 | expect_that(colnames(tab8)[2], equals("b >> b2")) 224 | expect_that(colnames(tab8)[3], equals("c >> c2")) 225 | }) 226 | -------------------------------------------------------------------------------- /tests/testthat/test_inputs.R: -------------------------------------------------------------------------------- 1 | context("Input checks") 2 | 3 | test_that("Prompts errors correctly", { 4 | expect_error(check_type(doc = "http://http://cran.at.r-project.org/", which = factor(4))) 5 | }) 6 | 7 | test_that("check_type produces class output", { 8 | 9 | x <- check_type(doc = "http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html", which = "//table[5]") 10 | expect_that(x, is_a("XMLInternalDocument")) 11 | 12 | #expect_error(check_type(doc = "http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html", which = 1)) 13 | 14 | parsed <- XML::htmlParse("http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html") 15 | z <- check_type(doc = parsed, which = 3) 16 | expect_that(z, is_a("XMLInternalDocument")) 17 | }) 18 | -------------------------------------------------------------------------------- /tests/testthat/test_multi-dim-header.R: -------------------------------------------------------------------------------- 1 | context("formula interface for header works") 2 | 3 | test_that("multi-dim 1", { 4 | 5 | tab3_code <- ' 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 |
abcd
e
123
Header1
123
11223344
Header2
123
11223344
' 65 | 66 | parsedm3 <- doc <- XML::htmlParse(tab3_code) 67 | 68 | suppressMessages(tab3m1 <- htmltab(doc = parsedm3, header = "//thead/tr" + "//td[@class = 'md']", body = "//tbody/tr[not(./td[@class = 'md'])]")) 69 | suppressMessages(tab3m2 <- htmltab(doc = parsedm3, header = 1:2 + "//td[@class = 'md']", body = "//tbody/tr[not(./td[@class = 'md'])]")) 70 | 71 | expect_that(is.na(tab3m1[1,1]), is_true()) 72 | expect_that(tab3m1[2,1], equals("Header1")) 73 | expect_that(tab3m1[3,1], equals("Header1")) 74 | expect_that(tab3m1[4,1], equals("Header2")) 75 | expect_that(tab3m1[5,1], equals("Header2")) 76 | 77 | expect_that(is.na(tab3m2[1,1]), is_true()) 78 | expect_that(tab3m2[2,1], equals("Header1")) 79 | expect_that(tab3m2[3,1], equals("Header1")) 80 | expect_that(tab3m2[4,1], equals("Header2")) 81 | expect_that(tab3m2[5,1], equals("Header2")) 82 | }) 83 | 84 | 85 | 86 | test_that("multi-dim 2", { 87 | 88 | tab4_code <- ' 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 |
abcd
e
123
MAIN 1
Header1
123
11223344
Header2
123
11223344
MAIN 2
Header1
123
11223344
Header2
123
11223344
' 194 | 195 | parsedm4 <- XML::htmlParse(tab4_code) 196 | 197 | suppressMessages(tab4m1 <- htmltab(doc = parsedm4, 198 | header = "//thead/tr" + "//td[@class = 'md1']" + "//td[@class = 'md0']", 199 | body = "//tbody/tr[not(./td[starts-with(@class, 'md')])]")) 200 | 201 | expect_that(is.na(tab4m1[1,1]), is_true()) 202 | expect_that(tab4m1[2,1], equals("MAIN 1")) 203 | expect_that(tab4m1[6,1], equals("MAIN 2")) 204 | expect_that(tab4m1[2,2], equals("Header1")) 205 | expect_that(tab4m1[4,2], equals("Header2")) 206 | 207 | }) 208 | 209 | -------------------------------------------------------------------------------- /vignettes/.Rapp.history: -------------------------------------------------------------------------------- 1 | set.seed(1)# 2 | # 3 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 4 | set.seed(1)# 5 | # 6 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 7 | set.seed(1)# 8 | # 9 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 10 | set.seed(1)# 11 | # 12 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 13 | set.seed(1)# 14 | # 15 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 16 | set.seed(1)# 17 | # 18 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 19 | set.seed(1)# 20 | # 21 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 22 | set.seed(1)# 23 | # 24 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 25 | set.seed(1)# 26 | # 27 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 28 | set.seed(1)# 29 | # 30 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 31 | set.seed(1)# 32 | # 33 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 34 | set.seed(1)# 35 | # 36 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 37 | set.seed(1)# 38 | # 39 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 40 | set.seed(1)# 41 | # 42 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 43 | set.seed(1)# 44 | # 45 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 46 | set.seed(1)# 47 | # 48 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 49 | set.seed(1)# 50 | # 51 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 52 | set.seed(1)# 53 | # 54 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 55 | set.seed(1)# 56 | # 57 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 58 | set.seed(1)# 59 | # 60 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 61 | set.seed(1)# 62 | # 63 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 64 | set.seed(1)# 65 | # 66 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 67 | set.seed(1)# 68 | # 69 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 70 | set.seed(1)# 71 | # 72 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 73 | set.seed(1)# 74 | # 75 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F) 76 | set.seed(1)# 77 | # 78 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 79 | set.seed(1)# 80 | # 81 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 82 | set.seed(1)# 83 | # 84 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 85 | set.seed(1)# 86 | # 87 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 88 | set.seed(1)# 89 | # 90 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 91 | set.seed(1)# 92 | # 93 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 94 | set.seed(1)# 95 | # 96 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 97 | set.seed(1)# 98 | # 99 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 100 | set.seed(1)# 101 | # 102 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 103 | set.seed(1)# 104 | # 105 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 106 | set.seed(1)# 107 | # 108 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 109 | set.seed(1)# 110 | # 111 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 112 | set.seed(1)# 113 | # 114 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 115 | set.seed(1)# 116 | # 117 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 118 | set.seed(1)# 119 | # 120 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 121 | set.seed(1)# 122 | # 123 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 124 | set.seed(1)# 125 | # 126 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 127 | set.seed(1)# 128 | # 129 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 130 | set.seed(1)# 131 | # 132 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 133 | set.seed(1)# 134 | # 135 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 136 | set.seed(1)# 137 | # 138 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 139 | set.seed(1)# 140 | # 141 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 142 | set.seed(1)# 143 | # 144 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 145 | set.seed(1)# 146 | # 147 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 148 | set.seed(1)# 149 | # 150 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 151 | set.seed(1)# 152 | # 153 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 154 | set.seed(1)# 155 | # 156 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 157 | set.seed(1)# 158 | # 159 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 160 | set.seed(1)# 161 | # 162 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 163 | set.seed(1)# 164 | # 165 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 166 | set.seed(1)# 167 | # 168 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 169 | set.seed(1)# 170 | # 171 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 172 | set.seed(1)# 173 | # 174 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 175 | set.seed(1)# 176 | # 177 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 178 | set.seed(1)# 179 | # 180 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 181 | set.seed(1)# 182 | # 183 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 184 | set.seed(1)# 185 | # 186 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 187 | set.seed(1)# 188 | # 189 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 190 | set.seed(1)# 191 | # 192 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 193 | set.seed(1)# 194 | # 195 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 196 | set.seed(1)# 197 | # 198 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 199 | set.seed(1)# 200 | # 201 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 202 | set.seed(1)# 203 | # 204 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 205 | set.seed(1)# 206 | # 207 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 208 | set.seed(1)# 209 | # 210 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 211 | set.seed(1)# 212 | # 213 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 214 | set.seed(1)# 215 | # 216 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 217 | set.seed(1)# 218 | # 219 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 220 | set.seed(1)# 221 | # 222 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 223 | set.seed(1)# 224 | # 225 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 226 | set.seed(1)# 227 | # 228 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F) 229 | set.seed(1)# 230 | # 231 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 232 | set.seed(1)# 233 | # 234 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 235 | set.seed(1)# 236 | # 237 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 238 | set.seed(1)# 239 | # 240 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 241 | set.seed(1)# 242 | # 243 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 244 | set.seed(1)# 245 | # 246 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 247 | set.seed(1)# 248 | # 249 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 250 | set.seed(1)# 251 | # 252 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 253 | set.seed(1)# 254 | # 255 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 256 | set.seed(1)# 257 | # 258 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 259 | set.seed(1)# 260 | # 261 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 262 | set.seed(1)# 263 | # 264 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 265 | set.seed(1)# 266 | # 267 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 268 | set.seed(1)# 269 | # 270 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 271 | set.seed(1)# 272 | # 273 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 274 | set.seed(1)# 275 | # 276 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 277 | set.seed(1)# 278 | # 279 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 280 | set.seed(1)# 281 | # 282 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 283 | set.seed(1)# 284 | # 285 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 286 | set.seed(1)# 287 | # 288 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 289 | set.seed(1)# 290 | # 291 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 292 | set.seed(1)# 293 | # 294 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 295 | set.seed(1)# 296 | # 297 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 298 | set.seed(1)# 299 | # 300 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 301 | set.seed(1)# 302 | # 303 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 304 | set.seed(1)# 305 | # 306 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 307 | set.seed(1)# 308 | # 309 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 310 | set.seed(1)# 311 | # 312 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F) 313 | set.seed(1)# 314 | # 315 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 316 | set.seed(1)# 317 | # 318 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 319 | set.seed(1)# 320 | # 321 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 322 | set.seed(1)# 323 | # 324 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 325 | set.seed(1)# 326 | # 327 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 328 | set.seed(1)# 329 | # 330 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F) 331 | set.seed(1)# 332 | # 333 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 334 | set.seed(1)# 335 | # 336 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 337 | set.seed(1)# 338 | # 339 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 340 | set.seed(1)# 341 | # 342 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 343 | set.seed(1)# 344 | # 345 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 346 | set.seed(1)# 347 | # 348 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 349 | set.seed(1)# 350 | # 351 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 352 | set.seed(1)# 353 | # 354 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 355 | set.seed(1)# 356 | # 357 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 358 | set.seed(221214)# 359 | # 360 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 361 | set.seed(221214)# 362 | # 363 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 364 | set.seed(221214)# 365 | # 366 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 367 | set.seed(221214)# 368 | # 369 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 370 | set.seed(221214)# 371 | # 372 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 373 | set.seed(221214)# 374 | # 375 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 376 | set.seed(221214)# 377 | # 378 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 379 | set.seed(221214)# 380 | # 381 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 382 | set.seed(221214)# 383 | # 384 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 385 | set.seed(221214)# 386 | # 387 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 388 | set.seed(221214)# 389 | # 390 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 391 | set.seed(221214)# 392 | # 393 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 394 | set.seed(221214)# 395 | # 396 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 397 | set.seed(221214)# 398 | # 399 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 400 | set.seed(221214)# 401 | # 402 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 403 | set.seed(221214)# 404 | # 405 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 406 | set.seed(221214)# 407 | # 408 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 409 | set.seed(221214)# 410 | # 411 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 412 | set.seed(221214)# 413 | # 414 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 415 | set.seed(221214)# 416 | # 417 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 418 | set.seed(221214)# 419 | # 420 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 421 | set.seed(221214)# 422 | # 423 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 424 | set.seed(221214)# 425 | # 426 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 427 | set.seed(221214)# 428 | # 429 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 430 | set.seed(221214)# 431 | # 432 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 433 | set.seed(221214)# 434 | # 435 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 436 | set.seed(221214)# 437 | # 438 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 439 | set.seed(221214)# 440 | # 441 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 442 | set.seed(221214)# 443 | # 444 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 445 | set.seed(221214)# 446 | # 447 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 448 | set.seed(221214)# 449 | # 450 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 451 | set.seed(221214)# 452 | # 453 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 454 | set.seed(221214)# 455 | # 456 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 457 | set.seed(221214)# 458 | # 459 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 460 | set.seed(221214)# 461 | # 462 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 463 | set.seed(221214)# 464 | # 465 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 466 | set.seed(221214)# 467 | # 468 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 469 | set.seed(221214)# 470 | # 471 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 472 | set.seed(221214)# 473 | # 474 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 475 | set.seed(221214)# 476 | # 477 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 478 | set.seed(221214)# 479 | # 480 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 481 | set.seed(221214)# 482 | # 483 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 484 | set.seed(221214)# 485 | # 486 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 487 | set.seed(221214)# 488 | # 489 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 490 | set.seed(221214)# 491 | # 492 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 493 | set.seed(221214)# 494 | # 495 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 496 | set.seed(221214)# 497 | # 498 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 499 | set.seed(221214)# 500 | # 501 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 502 | set.seed(221214)# 503 | # 504 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 505 | set.seed(221214)# 506 | # 507 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 508 | set.seed(221214)# 509 | # 510 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 511 | set.seed(221214)# 512 | # 513 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 514 | set.seed(221214)# 515 | # 516 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 517 | set.seed(221214)# 518 | # 519 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 520 | set.seed(221214)# 521 | # 522 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 523 | set.seed(221214)# 524 | # 525 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 526 | set.seed(221214)# 527 | # 528 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 529 | set.seed(221214)# 530 | # 531 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 532 | set.seed(221214)# 533 | # 534 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F) 535 | set.seed(221214)# 536 | N <- 100# 537 | # 538 | sample(1:N, 1, replace = F) 539 | set.seed(221214)# 540 | N <- 100# 541 | # 542 | sample(1:N, 1, replace = F) 543 | set.seed(221214)# 544 | N <- 100# 545 | # 546 | sample(1:N, 1, replace = F) 547 | set.seed(221214)# 548 | N <- 100# 549 | # 550 | sample(1:N, 1, replace = F) 551 | set.seed(221214)# 552 | N <- 200# 553 | # 554 | sample(1:N, 1, replace = F) 555 | 200/3 556 | set.seed(221214)# 557 | N <- 1000# 558 | # 559 | sample(1:N, 1, replace = F) 560 | set.seed(221214)# 561 | # 562 | N <- 1000# 563 | x <- rep(NA, N)# 564 | # 565 | for(i 1:N){# 566 | x[i] <- sample(1:N, 1, replace = F)# 567 | }# 568 | # 569 | (1:N) / x 570 | x 571 | set.seed(221214)# 572 | # 573 | N <- 1000# 574 | x <- rep(NA, N)# 575 | # 576 | for(i 1:N){# 577 | x[i] <- sample(1:N, 1, replace = F)# 578 | }# 579 | # 580 | (1:N) / x 581 | set.seed(221214)# 582 | # 583 | N <- 1000# 584 | x <- rep(NA, N)# 585 | # 586 | for(i in 1:N){# 587 | x[i] <- sample(1:N, 1, replace = F)# 588 | }# 589 | # 590 | (1:N) / x 591 | plt(1:N, (1:N) / x) 592 | pl0ot(1:N, (1:N) / x) 593 | plot(1:N, (1:N) / x) 594 | plot(1:N, (1:N) / x) 595 | plot(1:N, x / 1:N) 596 | x / (1:N) 597 | set.seed(221214)# 598 | # 599 | N <- 1000# 600 | x <- rep(NA, N)# 601 | # 602 | for(i in 1:N){# 603 | x[i] <- sample(1:N, 1, replace = F)# 604 | } 605 | x 606 | set.seed(221214)# 607 | # 608 | N <- 1000# 609 | x <- rep(NA, N)# 610 | # 611 | for(i in 1:N){# 612 | x[i] <- sample(1:i, 1, replace = F)# 613 | } 614 | x 615 | N <- 1000# 616 | x <- rep(NA, N)# 617 | # 618 | for(i in 1:N){# 619 | set.seed(221214)# 620 | x[i] <- sample(1:i, 1, replace = F)# 621 | } 622 | x 623 | x / (1:N) 624 | plot(1:N, x / 1:N) 625 | library(htmltable) 626 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]") 627 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 1:3) 628 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 1:3) 629 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 3) 630 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 3, body = "tr[position() > 3]") 631 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[3]") 632 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[3]", header = 1) 633 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]") 634 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8") 635 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8", rm_escape = "") 636 | lol <- htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8", rm_escape = "") 637 | library(dplyr) 638 | lol 639 | install.packages("wikipediatrend") 640 | library(wikipediatrend) 641 | bt_election <- wp_trend( page from lang = "Bundestagswahl", = "2007-01-01", = "de", friendly = T, userAgent = T) 642 | lol <- wp_trend(page = "http://en.wikipedia.org/wiki/Martin_Luther_King,_Jr.", from = "2007-01-01", lang = "en", friendly = T, userAgent = T) 643 | lol 644 | getwd() 645 | bt_election <- wp_trend( page = "Bundestagswahl", # 646 | from = "2009-01-01", # 647 | lang = "de", # 648 | friendly = T,# 649 | userAgent = T)# 650 | bt_election <- bt_election[ order(bt_election$da 651 | bt_election 652 | ?wp_trend 653 | mlk <- wp_trend(page = "Martin_Luther_King,_Jr.", from = "2010-01-01", lang = "en", friendly = T) 654 | mlk 655 | plot(mlk) 656 | plot(ts(mlk)) 657 | plot(mlk, type="h", ylim=c(-1000,40000)) 658 | ?wp_trend 659 | head(mlk) 660 | rm(list=ls()) 661 | setwd("~/Dropbox/htmltable/vignettes/") 662 | library(knitr) 663 | ?knit 664 | knit("index.Rmd") 665 | install.packages("magrittr") 666 | knit("index.Rmd") 667 | install.packages("knitr") 668 | rm(list=ls()) 669 | setwd("~/Dropbox/htmltable/vignettes/") 670 | library(knitr) 671 | knit("index.Rmd") 672 | knit("index.Rmd") 673 | knit("index.Rmd") 674 | knit("index.Rmd") 675 | ?knitr 676 | q() 677 | -------------------------------------------------------------------------------- /vignettes/htmltab.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Hassle-free HTML tables with htmltab" 3 | author: "Christian Rubba" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{htmltab case studies} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\usepackage[utf8]{inputenc} 10 | --- 11 | 12 | HTML tables are a standard way to display tabular information online. Getting HTML table data into R is fairly straightforward with the `readHTMLTable()` function of the *XML* package. But tables on the web are primarily designed for displaying and consuming data, not for analytical purposes. Peculiar design choices for HTML tables are therefore frequently made which tend to produce useless outputs when run through `readHTMLTable()`. I found that sometimes these outputs could be saved with a little bit of (tedious) post-processing, but just as often they could not. To make working with HTML tables easier and less time-consuming, I developed *htmltab*, a package for the R system that tries to alleviate these problems directly in the parsing stage when the structural information is still available. Its main advantages over `readHTMLTable()` are twofold: 13 | 14 | - Consideration of row and column spans in the HTML table body and header cells 15 | - More control over the process that translates HTML cells into R table cells 16 | 17 | This vignette discusses the application of *htmltab* for two use cases where the package provides a significant improvement over `readHTMLTable()`. 18 | 19 | (I make use of the R packages *tidyr* and *stringr* to process table outputs. Neither of the three is required for running *htmltab*.) 20 | 21 | 22 | ## How to read HTML tables with htmltab() 23 | The principal function of *htmltab* is `htmltab()`. The behaviour of `htmltab()` is modeled closely after `readHTMLTable()`, and many argument names are identical. Any function call requires passing a value to its _doc_ argument. This value may be of three kinds: 24 | 25 | 1. a URL or file path for the HTML document where the table lives 26 | 2. a parsed HTML object of the entire page of class _HTMLInternalDocument_ 27 | 3. a table nodeset of class _XMLNodeSet_ 28 | 29 | The last of these methods returns a single R table object. For the first two, `htmltab()` requires users to be specific about the table they would like to have returned. This is done via the _which_ argument. This may be either a numeric value for the table's position in the page, or a character value that describes an XPath statement. 30 | 31 | 32 | ## 1. Corrections for rowspans and colspans by default 33 | In many HTML tables, spans are used to allow cell values to extend across multiple cells. `htmltab()` recognizes spans and expands tables automatically. To illustrate this feature, take a look at the HTML table in the Language section of this [Wikipedia page about Demography in the UK](http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom#Languages). The header information spans across three consecutive rows. To get the table into R, we have to pass an identifiying information to the _which_ argument. I use an XPath statement that I wrote while exploring the HTML page with Web Developer Tools. One that works is "//th[text() = 'Ability']/ancestor::table": 34 | 35 | 36 | ```{r message=FALSE} 37 | library(htmltab) 38 | 39 | url <- "http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom" 40 | ukLang <- htmltab(doc = url, which = "//th[text() = 'Ability']/ancestor::table") 41 | head(ukLang) 42 | ``` 43 | 44 | The header information has been recast into a format that respects the hierarchical order of the variables and yet only spans a single line in the R table. If you prefer a different seperator between variables, pass it to the _headerSep_ argument. This format was chosen to make further processing of the table easy. For example, using functionality from the *tidyr* package, the next couple of data cleaning steps may be the following: 45 | 46 | ```{r message=FALSE} 47 | library(tidyr) 48 | 49 | ukLang <- gather(ukLang, key, value, -Ability) 50 | ``` 51 | 52 | This statement restructures the variables in a more useful long format. From this we can separate the variables using an appropriate regular expression such as " >> ". 53 | 54 | 55 | ```{r} 56 | ukLang <- separate(ukLang, key, into = c("region", "language", "statistic"), sep = " >> ") 57 | head(ukLang) 58 | ``` 59 | 60 | `htmltab()` also automatically expands row and column spans when they appear in the table's body. 61 | 62 | ## 2. More control over cell value conversion 63 | `htmltab()` offers you more control over what part of the HTML table is used in the R table. You can exert this control via `htmltab()`'s _body_, _header_, _bodyFun_, _headerFun_, _rm_escape_, _rm_footnote_, _rm_superscript_, _rm_nodata_cols_, _rm_invisible_ and _rm_whitespace_ arguments. 64 | 65 | ### _body_ and _header_ arguments 66 | It is not possible for `htmltab()` to correctly identify header and body elements in all the tables. Although there is a semantically *correct* way to organize header and body elements in HTML tables, web designers do not necessarily need to adhere to them to produce visually appealing tables. *htmltab* employs heuristics for identification but they are no guarantee. If you find that the table is not correctly assembled, you can try to give the function more information through its _header_ and _body_ arguments. These arguments are used to pass information about which rows should be used for the contruction of the header and the body. Both accept numeric values for the rows, but a more robust way is to use an XPath that identifies the respective rows. To illustrate, take a look at this [Wikipedia page about the New Zealand General Election in 2002](http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002#Electorate_results). The table uses cells that span the entire column range to classify General and Maori electorates (yellow background). We need to control for this problem explicitly in the assembling stage. I pass the XPath "//tr[./td[not(@colspan = '10')]]" to the _body_ argument to explicitly discard all rows from the body that have a \ cell with a colspan attribute of 10: 67 | 68 | 69 | ```{r} 70 | url <- "http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002" 71 | xp <- "//caption[starts-with(text(), 'Electorate results')]/ancestor::table" 72 | body_xp <- "//tr[./td[not(@colspan = '10')]]" 73 | 74 | nz1 <- htmltab(doc = url, which = xp, body = body_xp, encoding = "UTF-8") 75 | head(nz1) 76 | ``` 77 | 78 | ### Using table information that intercept body rows 79 | In the previous example, we discarded the two intercepting rows in the body which signified the region of the electorate. You might object that ideally these rows should not be discarded, but used for what they are -- variable/header information! As of version 0.6.0, *htmltab* can process these sort of table designs more efficiently and prepend the information accurately in a new column variable. Information to the _header_ argument can now be passed in the form of X1 + X2 + X3 + ..., where X1 codifies the *main* header (i.e. the one that spans the body grid), and X2, X3, ... signify groups of header information that appear in the body. Please note that the in-body information (X2, X3, ...) must not identify row elements (tr) but individual cells (td or th) from which the value of the new variable can be generated (usually from the node value). To illustrate, consider the following snippet: 80 | 81 | ```{r} 82 | nz2 <- htmltab(doc = url, which = xp, header = 1 + "//tr/td[@colspan = '10']", 83 | body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8") 84 | ``` 85 | 86 | Here, we pass '1' to signify that the *main* header information appear in the first row. We add to this the XPath "//td[@colspan = '10']" that refer to the two rows. Generally, you are free to use numeric information or XPath to refer to the values that are takes as header variable. *htmltab* extracts these information and prepends them to the main table. 87 | 88 | ```{r} 89 | tail(nz2, 9) 90 | ``` 91 | 92 | For more information on this feature take a look at this [blog post](http://r-datacollection.com/blog/htmltab-v0.6.0/) and the Details section of the *htmltab* function in the package documentation. 93 | 94 | ### Removal of unneeded information 95 | Many HTML tables include additional information which are of little interest to data analysts such as information encoded in superscript and footnote tags, as well as escape sequences. By default, `htmltab()` removes information from the first two and replaces all escape sequences by a whitespace. You can change this behaviour through the _rm\_superscript_, _rm\_footnotes_, _rm\_escape_, _rm_nodata_cols_, _rm_invisible_ and _rm_whitespace_ arguments. 96 | 97 | ## Conclusion 98 | HTML tables are a valuable data source but they frequently violate basic principles of data well-formedness. This is usually for good reason since their primary purpose is to increase readability of tabular information. *htmltab*'s goal is to reduce the need for users to interfere when working with HTML tables by relying on available structural information as well as making some assumptions about the table's design. However, you are free to exert more control over the transformation by specifying various function arguments. 99 | --------------------------------------------------------------------------------