├── .Rbuildignore
├── .travis.yml
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── NEWS
├── R
    ├── body.R
    ├── colnames.R
    ├── header.R
    ├── htmltab.R
    ├── identify_rows.R
    ├── inbody_header.R
    ├── setup_and_checks.R
    ├── utils.R
    └── zzz.R
├── README.md
├── cran-comments.md
├── man
    ├── check_type.Rd
    ├── create_inbody.Rd
    ├── eval_body.Rd
    ├── eval_header.Rd
    ├── get_body_xpath.Rd
    ├── get_cell_element.Rd
    ├── get_head_xpath.Rd
    ├── get_header_elements.Rd
    ├── get_span.Rd
    ├── get_trindex.Rd
    ├── htmltab.Rd
    ├── identify_elements.Rd
    ├── normalize_tr.Rd
    ├── num_xpath.Rd
    ├── rm_empty_cols.Rd
    ├── rm_empty_rows.Rd
    ├── rm_nuisance.Rd
    └── select_tab.Rd
├── tests
    ├── testthat.R
    └── testthat
    │   ├── test_doc_examples.R
    │   ├── test_expand_spans.R
    │   ├── test_find_header.R
    │   ├── test_inputs.R
    │   └── test_multi-dim-header.R
└── vignettes
    ├── .Rapp.history
    └── htmltab.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | cran-comments.md
4 | .travis.yml
5 | misc/
6 | CHANGES
7 | tests/testthat/test_doc_examples.R
8 | tests/testthat/.test_warnings.R
9 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | # Sample .travis.yml for R projects from https://github.com/craigcitro/r-travis
2 | 
3 | language: r
4 | warnings_are_errors: false
5 | sudo: required
6 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: htmltab
 2 | Title: Assemble Data Frames from HTML Tables
 3 | Version: 0.7.0
 4 | Authors@R: person("Christian", "Rubba", email = "r@christianrubba.com",
 5 |     role = c("aut", "cre"))
 6 | Description: HTML tables are a valuable data source but extracting and
 7 |     recasting these data into a useful format can be tedious. htmltab is a package
 8 |     for collecting structured information from HTML tables. It is similar to
 9 |     readHTMLTable() of the XML package but provides three major advantages. First,
10 |     the function automatically expands row and column spans in the header and body
11 |     cells. Second, users are given more control over the identification of header
12 |     and body rows which will end up in the R table, including semantic header
13 |     information that appear throughout the body. Third, the function preprocesses
14 |     table code, corrects common types of malformations, removes unneeded parts and
15 |     so helps to alleviate the need for tedious post-processing.
16 | Depends:
17 |     R (>= 3.0.0)
18 | Imports:
19 |     XML (>= 3.98.1.3),
20 |     httr (>= 1.0.0)
21 | License: MIT + file LICENSE
22 | LazyData: true
23 | Suggests:
24 |     testthat,
25 |     knitr,
26 |     tidyr
27 | URL: https://github.com/crubba/htmltab
28 | BugReports: https://github.com/crubba/htmltab/issues
29 | VignetteBuilder: knitr
30 | RoxygenNote: 5.0.0
31 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2014,2015
2 | COPYRIGHT HOLDER: Christian Rubba
3 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 | 
3 | S3method(num_xpath,character)
4 | S3method(num_xpath,default)
5 | S3method(num_xpath,list)
6 | S3method(num_xpath,numeric)
7 | export(htmltab)
8 | export(num_xpath)
9 | 


--------------------------------------------------------------------------------
/NEWS:
--------------------------------------------------------------------------------
 1 | CHANGES IN htmltab VERSION 0.7.0
 2 | 
 3 | 
 4 | NEW FEATURES
 5 | o Added a new argument (rm_nodata_cols) to remove columns that have no apparent data value
 6 | 
 7 | MAJOR CHANGE
 8 | o When htmltab encounters an inner table inside the target table, the inner table is flattened to allow table generation
 9 | 
10 | BUG FIXES
11 | o Single column data frames are not reduced to vectors anymore which used to resulte in an error
12 | o When the last column had misspecified column spans, htmltab previously dicarded an entire column. Now, a check is in place that makes a judgement whether a column should be kept or not
13 | o Fixed a problem with reading html files from the local file system (@earino)
14 | 
15 | 
16 | 
17 | CHANGES IN htmltab VERSION 0.6.0
18 | 
19 | NEW FEATURES
20 | o Added capability to process header information that appear in-table. This is done via a new formula interface to the header argument
21 | o Added new parameter (rm_whitespace) to remove leading and trailing whitespace from cell values
22 | o Added new parameter (rm_identical_cols) to remove columns that are falsely copied when colspan attributes are misused
23 | o Tables are now checked for and cleaned from various types of malformation
24 | 
25 | BUG FIXES
26 | o Fixed a bug that prevented correct creation of multi-row header when a header cell was completely whitespaces
27 | o Fixed a bug where rm_empty_cols did not work properly because of values that were created through column expansion
28 | o Removed unreliable test for documentation examples
29 | o Automatic check for nested tables. htmltab throws an error when the designated table includes a table
30 | 
31 | MINOR CHANGES
32 | o In the header construction, multi-row headers are now correctly ignoring empty values in the final header
33 | o Complementarity checks of header and body rows is now based on a different (and more robust) methodology
34 | 
35 | 
36 | 
37 | CHANGES IN htmltab VERSION 0.5.0
38 | 
39 | NEW FEATURES
40 | o Header and body are now treated as complementary elements of a table, i.e. passing (numeric) information about the position of either of the two will be used for the identification of the other
41 | o Added a new argument (fillNA) to replace non-data cells cells by NA
42 | o Added a new argument (rm_nodata_cols) to remove columns that have no apparent data value
43 | o Added a new argument (rm_invisible) to remove invisible nodes from the table node
44 | 
45 | BUG FIXES
46 | o Fixed a problem where htmltable failed when a table didn't nest a row within tr tags. Now every table is controlled, and tr tags are added when necessary
47 | o Fixed a small problem with misspecified spans in the table header
48 | o Added meaningful error message when table couldn't be identified
49 | o Fixed problem where a header warning was thrown even when colNames was supplied
50 | 
51 | MAJOR CHANGES
52 | o Revised code for header and body identification. When an XPath is passed to either of the two, it must treat the parent table node as the root. This change is backward incompatible
53 | 
54 | 


--------------------------------------------------------------------------------
/R/body.R:
--------------------------------------------------------------------------------
  1 | 
  2 | #' Creates body using span information
  3 | #'
  4 | #' @param vals a list of body cell values
  5 | #' @param colspans a list of body colspans
  6 | #' @param rowspans a list of body rowspans
  7 | #' @param header.names the header name vector
  8 | #' @return a matrix of the assembled body
  9 | #' @noRd
 10 | span_body <- function(vals, colspans, rowspans, header.names) {
 11 | 
 12 |   #Create empty container matrix
 13 |   if(length(header.names) == 0){
 14 |     n.cols <- sum(as.numeric(unlist(colspans[[1L]]))) #get idea from first row column length, should be right usually
 15 |   } else {
 16 |     n.cols <- length(header.names) #use header dim info
 17 |   }
 18 |   n.rows <- length(rowspans) #guess, better make flexible #lapply(rowspans, function(x) x[1]) %>% unlist %>% as.numeric %>% sum
 19 | 
 20 |   mat <- matrix(NA, ncol = n.cols, nrow = n.rows)
 21 |   #row.cont <- vector()
 22 | 
 23 |   col = 1L
 24 | 
 25 |   while(col <= n.cols){ # col start
 26 | 
 27 |     row <- 1L
 28 | 
 29 |     while(row <= n.rows){ # row start
 30 | 
 31 |       cel.val <- vals[[row]][col]
 32 |       col.span.length <- colspans[[row]][col]
 33 |       row.span.length <- rowspans[[row]][col]
 34 | 
 35 |       #This block controls for undefined col- and rowspans (hacky)
 36 |       if(is.na(row.span.length) && row < n.rows){
 37 |         col.span.length <- 1L
 38 |         row.span.length <- 1L
 39 |         cel.val <- mat[row, col -1L]
 40 |       }
 41 | 
 42 |       if(row.span.length < 2L) {
 43 |         mat[row, col] <- cel.val
 44 |       } else {
 45 | 
 46 |         if(row != n.rows){ #Control for situation: specified rows (main), last row demands col/rowspans
 47 |           index <- (row + 1L) : (row + row.span.length - 1L)
 48 | 
 49 |           for(counter in index){
 50 |             vals[[counter]] <- append(vals[[counter]], cel.val, (col - 1L )) #append col val is after
 51 |             rowspans[[counter]] <- append(rowspans[[counter]], 1L, (col - 1L))
 52 |             colspans[[counter]] <- append(colspans[[counter]], 1L, (col - 1L))
 53 |           }
 54 |           rowspans[[row]] <- rowspans[[row]][-col]
 55 |           rowspans[[row]] <- append(rowspans[[row]], 1L, (col-1L))
 56 | 
 57 |           mat[row:(row + row.span.length-1L), col] <- cel.val
 58 |         }
 59 |       }
 60 | 
 61 |       if(col.span.length > 1L){
 62 |         vals[[row]] <- append(vals[[row]], rep(cel.val, (col.span.length - 1L)), col) #append
 63 |         colspans[[row]] <- colspans[[row]][-col]
 64 |         colspans[[row]] <- append(colspans[[row]], rep(1, col.span.length), (col-1L))
 65 |         rowspans[[row]] <- rowspans[[row]][-col]
 66 |         rowspans[[row]] <- append(rowspans[[row]], rep(row.span.length, col.span.length), (col-1L))
 67 |       }
 68 | 
 69 |       row <- row + 1L
 70 |     } #row end
 71 | 
 72 |     col <- col + 1L
 73 |   } #col end
 74 | 
 75 |   return(mat)
 76 | 
 77 | } #function end
 78 | 
 79 | 
 80 | 
 81 | #colspans = body.colspans
 82 | #rowspans = body.rowspans
 83 | 
 84 | #' Expand the body
 85 | #' @noRd
 86 | expand_body <- function(vals, colspans, rowspans){
 87 | 
 88 |   #colspans <- body.colspans
 89 |   #rowspans <- body.rowspans
 90 | 
 91 |   body.table <- list()
 92 |   col <- 1
 93 |   n.row <- length(vals)
 94 | 
 95 |   repeat{
 96 | 
 97 |     #Break when there are no header information or when last column is missspecified
 98 |     if(is_empty(vals[[1]])){break} #all_empty(vals)
 99 | 
100 |     body.row <- vector()
101 | 
102 |     for(row in 1:n.row) {
103 | 
104 |       length.col <- colspans[[row]][1]
105 |       length.row <- rowspans[[row]][1]
106 |       name <- vals[[row]][1]
107 |       #name <- ifelse(name == "", NA, name)
108 |       name <- ifelse(grepl("[[:alnum:][:punct:]]+", name), name, NA)
109 | 
110 |       if(is.na(length.col)) break
111 | 
112 |       #Remove cell info
113 |       colspans[[row]] <- colspans[[row]][-1]
114 |       rowspans[[row]] <- rowspans[[row]][-1]
115 |       vals[[row]] <- vals[[row]][-1]
116 | 
117 |       #Expand along columnes
118 |       colspans[[row]] <- append(colspans[[row]], rep(1, length.col - 1), 0)
119 |       rowspans[[row]] <- append(rowspans[[row]], rep(length.row, length.col - 1), 0)
120 |       vals[[row]] <- append(vals[[row]], rep(name, length.col - 1), 0)
121 | 
122 |       #Expand along rows
123 |       these.rows <- row:(row + (length.row - 1))
124 |       rowspans[these.rows] <- lapply(rowspans[these.rows], append, 1, after = 0) #an erster Stelle
125 |       colspans[these.rows] <- lapply(colspans[these.rows], append, 1, after = 0)
126 |       vals[these.rows] <- lapply(vals[these.rows], append, name, after = 0) #vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0)
127 | 
128 |       #remove the first colum info
129 |       colspans[[row]] <- colspans[[row]][-1] #check for colspans different
130 |       rowspans[[row]] <- rowspans[[row]][-1] #check for colspans different
131 |       vals[[row]] <- vals[[row]][-1]
132 | 
133 |       #Add cell info to column name vector
134 |       body.row <- c(body.row, name)
135 |     }
136 | 
137 |     #Break if last row is not completely filled up (https://en.wikipedia.org/wiki/Opinion_polling_for_the_42nd_Canadian_federal_election)
138 |     if(col > 1 && length(body.row) < length(body.table[[col - 1]])){
139 |       add.cell.no <- length(body.table[[col - 1]]) - length(body.row)
140 |       add.NAs <- rep(NA, add.cell.no)
141 |       body.row <- c(body.row, add.NAs)
142 |     }
143 | 
144 |     body.table[[col]] <- vector()
145 |     body.table[[col]] <- body.row
146 | 
147 |     col <- col + 1
148 |   }
149 | 
150 |   # Cbind all
151 |   tab <- do.call("cbind", body.table)
152 | 
153 |   return(tab)
154 | }
155 | 
156 | 
157 | #' Extract body cell values
158 | #'
159 | #' @param table.Node the table node
160 | #' @return list of body information
161 | #' @family get_head
162 | #' @noRd
163 | get_cells <- function(table.Node) {
164 | 
165 |   cells <- XML::xpathSApply(table.Node, path = "//tr")
166 | 
167 |   ifstop(is_empty(cells), sprintf("No body generated. Body is empty.
168 |                  Try passing information to the body argument. Body XPath was '%s'.", body), call. = FALSE)
169 | 
170 |   return(cells)
171 | }
172 | 
173 | 


--------------------------------------------------------------------------------
/R/colnames.R:
--------------------------------------------------------------------------------
 1 | #' Produces correct column names
 2 | #'
 3 | #' @param df the generated body
 4 | #' @param header.names the header column names vector
 5 | #' @param colNames either a self-specificed character vector for the column names or a function used on header.names
 6 | #' @param xpath generated header and body xpath
 7 | #' @return a character vector of header column names
 8 | #' @noRd
 9 | make_colnames <- function(df, header.names = NULL, colNames = NULL, header.xpath) {
10 | 
11 |   if(length(header.names) == 0 && is.null(colNames)){
12 |     warning("No header generated. Try passing information to header or colNames", call. = FALSE)
13 |     return(df)
14 |     }
15 | 
16 |   if(length(header.names) > 0 && is.null(colNames)) {
17 |     if(ncol(df) != length(header.names)){
18 |       warning("Header dimension doesn't match body dimension", call. = FALSE)
19 |       colnames(df) <- vector()
20 |     }
21 |     colnames(df)[1:length(header.names)] <- header.names
22 |     return(df)
23 |   }
24 | 
25 |   if(is.character(colNames)) {
26 |     header.names <- colNames
27 |     if(ncol(df) != length(header.names)){
28 |       warning("Header dimension doesn't match body dimension", call. = FALSE)
29 |       colnames(df) <- vector()
30 |     }
31 |     colnames(df)[1:length(header.names)] <- header.names
32 |     return(df)
33 |   }
34 | 
35 |   if(class(colNames) == "function") {
36 |     header.names <- colNames(header.names)
37 |     if(ncol(df) != length(header.names)){
38 |       warning("Header dimension doesn't match body dimension", call. = FALSE)
39 |       colnames(df) <- vector()
40 |     }
41 |     colnames(df)[1:length(header.names)] <- header.names
42 |     return(df)
43 |   }
44 | 
45 | }
46 | 


--------------------------------------------------------------------------------
/R/header.R:
--------------------------------------------------------------------------------
  1 | #' Retrieve table head rows
  2 | #'
  3 | #' @param table.Node the table node
  4 | #' @return the header element
  5 | #' @noRd
  6 | get_head <- function(trindex, table.Node) UseMethod()
  7 | 
  8 | get_head <- function(table.Node, trindex) {
  9 | 
 10 |   bracket <- paste(sprintf("@HTMLTABtrindex = '%s'", trindex), collapse = " or ")
 11 |   xpath <- sprintf("//tr[%s]", bracket)
 12 | 
 13 |   head_MAIN <- XML::xpathSApply(table.Node, path = xpath)
 14 | 
 15 |   return(head_MAIN)
 16 | }
 17 | 
 18 | 
 19 | #' Produce table header
 20 | #' @noRd
 21 | make_header <- function(trindex, table.Node, headerSep, headerFun,
 22 |                            rm_escape, rm_whitespace) UseMethod("make_header")
 23 | 
 24 | make_header.logical <- function(trindex, table.Node, headerSep, headerFun,
 25 |                                 rm_escape, rm_whitespace){
 26 |   return(logical(0))
 27 | }
 28 | 
 29 | make_header.numeric <- function(trindex, table.Node, headerSep, headerFun,
 30 |                              rm_escape,
 31 |                              rm_whitespace){
 32 | 
 33 |   head <- get_head(table.Node = table.Node, trindex = trindex)
 34 | 
 35 |   header.colspans <- get_span(head, span = "colspan", tag = "td | th")
 36 |   header.rowspans <- get_span(head, span = "rowspan", tag = "td | th")
 37 | 
 38 |   header.names <- get_cell_element(head, tag = "td | th",
 39 |                                    elFun = headerFun,
 40 |                                    rm_escape = rm_escape,
 41 |                                    rm_whitespace = rm_whitespace)
 42 | 
 43 |   #Span head
 44 |   header.names <- span_header(header.names,
 45 |                               header.colspans,
 46 |                               header.rowspans,
 47 |                               headerSep = headerSep)
 48 | 
 49 |   noval_col <- which(header.names == "")
 50 | 
 51 |   if(length(noval_col) > 0){
 52 |     header.names[noval_col] <- paste0("V", length(noval_col))
 53 |   }
 54 | 
 55 | 
 56 |   return(header.names)
 57 | }
 58 | 
 59 | 
 60 | #' Creates header using span information
 61 | #'
 62 | #' @param header.names a list of header names
 63 | #' @param header.colspans a list of header colspans
 64 | #' @param header.rowspans a list of header rowspans
 65 | #' @param headerSep a character vector that is used as a seperator in the construction of the table's variable names (default value ' >> ')
 66 | #' @return a vector of header column names
 67 | #' @noRd
 68 | span_header <- function(header.names, header.colspans, header.rowspans, headerSep) {
 69 | 
 70 |   #has no header information
 71 |   if(length(header.names) == 0){
 72 |     header.name.table <- vector()
 73 |     return(header.name.table)
 74 |   }
 75 | 
 76 |   #Remove rows which have all empty cells
 77 |   empty.rows <- which(sapply(header.names, function(x) all(x == "")))
 78 |   if(!is_empty(empty.rows)){
 79 |     header.names <- header.names[-empty.rows]
 80 |     header.colspans <- header.colspans[-empty.rows]
 81 |     header.rowspans <- header.rowspans[-empty.rows]
 82 |   }
 83 | 
 84 |   #return empty header
 85 |   if(length(header.names) == 0){
 86 |     header.name.table <- vector()
 87 |     return(header.name.table)
 88 |   }
 89 | 
 90 |   header.name.table <- expand_header(header.names, header.colspans, header.rowspans)
 91 | 
 92 |   header.name.table <- lapply(header.name.table, function(col) col[!is.na(col)])
 93 | 
 94 |   header.name.table <- unlist(lapply(header.name.table, function(col) paste(col, collapse = headerSep)))
 95 | 
 96 |   return(header.name.table)
 97 | }
 98 | 
 99 | #' Expand the header
100 | #' @noRd
101 | expand_header <- function(vals, colspans, rowspans){
102 | 
103 |   body.table <- list()
104 |   col <- 1
105 |   n.row <- length(vals)
106 | 
107 |   repeat{
108 | 
109 |     # Break when there are no header information or when last column is missspecified
110 |     if(is_empty(vals[[1]])){break} # || length(vals) > 1 && length(vals[[2]]) == 0
111 | 
112 |     body.row <- vector()
113 | 
114 |     for(row in 1:n.row) {
115 | 
116 |       length.col <- colspans[[row]][1]
117 |       length.row <- rowspans[[row]][1]
118 |       name <- vals[[row]][1]
119 |       name <- ifelse(grepl("[[:alnum:][:punct:]]+", name), name, NA)
120 | 
121 |       if(is.na(length.col)) break
122 | 
123 |       #Remove cell info
124 |       colspans[[row]] <- colspans[[row]][-1]
125 |       rowspans[[row]] <- rowspans[[row]][-1]
126 |       vals[[row]] <- vals[[row]][-1]
127 | 
128 |       #Expand along columnes
129 |       colspans[[row]] <- append(colspans[[row]], rep(1, length.col - 1), 0)
130 |       rowspans[[row]] <- append(rowspans[[row]], rep(length.row, length.col - 1), 0)
131 |       vals[[row]] <- append(vals[[row]], rep(name, length.col - 1), 0)
132 | 
133 |       #Expand along rows
134 |       length(vals[[1]])
135 |       these.rows <- row:(row + (length.row - 1))
136 |       rowspans[these.rows] <- lapply(rowspans[these.rows], append, 1, after = 0) #an erster Stelle
137 |       colspans[these.rows] <- lapply(colspans[these.rows], append, 1, after = 0)
138 |       vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0) #vals[these.rows] <- lapply(vals[these.rows], append, NA, after = 0)
139 | 
140 |       #remove the first colum info
141 |       colspans[[row]] <- colspans[[row]][-1] #check for colspans different
142 |       rowspans[[row]] <- rowspans[[row]][-1] #check for colspans different
143 |       vals[[row]] <- vals[[row]][-1]
144 | 
145 |       #Add cell info to column name vector
146 |       body.row <- c(body.row, name)
147 |     }
148 | 
149 |     #Break if
150 |     if(col > 1 && length(body.row) < length(body.table[[col - 1]])) break
151 | 
152 |     body.table[[col]] <- vector()
153 |     body.table[[col]] <- body.row
154 | 
155 |     col <- col + 1
156 |   }
157 | 
158 |   return(body.table)
159 | }
160 | 
161 | 
162 | #' Extracts cells elements
163 | #'
164 | #' @param cells a list of cell nodes
165 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements
166 | #' @param elFun a function that is executed over the header/body cell nodes
167 | #' @param rm_escape a character vector that, if specified, is used to replace escape sequences in header and body cells (default value ' ')
168 | #' @param rm_whitespace logical, should leading/trailing whitespace be removed from cell values (
169 | #'    default value TRUE)?
170 | #' @return the body element
171 | get_cell_element <- function(cells, tag = "td | th", elFun, rm_escape, rm_whitespace) {
172 | 
173 |   cell.element <- lapply(cells, function(tr) {
174 |     XML::xpathSApply(tr, tag, elFun)
175 |   })
176 | 
177 | #   cell.element <-
178 | #
179 | #     lapply(cells, function(tr) {
180 | #     x <- XML::xmlValue(tr, tag, function(x) paste(xmlValue(x), sep = "||"))
181 | #     #sapply(x, xmlValue, recursive = F)
182 | #   })
183 | 
184 |   if(!is.null(rm_escape)) {
185 |     cell.element <- lapply(cell.element, function(el) gsub("([[:alpha:]])-[\b\n\t\r]([[:alpha:]])", "\\1\\2", el))
186 |     cell.element <- lapply(cell.element, function(el) gsub("[\b \n \t \r]", rm_escape, el))
187 |   }
188 | 
189 |   if(isTRUE(rm_whitespace)){
190 |     cell.element <- lapply(cell.element, function(el) rm_str_white(el))
191 |   }
192 |   return(cell.element)
193 | }
194 | 
195 | 
196 | #' Extracts rowspan information
197 | #'
198 | #' @param cells a list of cell nodes
199 | #' @param span a character for the span element name
200 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements
201 | #' @return A list of row information from the cells
202 | get_span <- function(cells, span, tag = "td | th"){
203 | 
204 |   span.val <- lapply(cells, function(tr) {
205 |     XML::xpathSApply(tr, tag, function(node) {
206 |       val <- XML::xmlGetAttr(node, span)
207 |       val <- ifelse(is.null(val) || val == "0" || grepl("%", val), 1, val) #Check Firefox for colspan == 0
208 |       val <- as.numeric(val)
209 |       return(val)
210 |     })
211 |   })
212 | 
213 |   return(span.val)
214 | }
215 | 
216 | #' Extracts header elements
217 | #'
218 | #' @param cells a list of cell nodes
219 | #' @param tag a character vector that provides information used in the XPath expression to extract the correct elements
220 | #' @return A list of header information from the cells
221 | get_header_elements <- function(cells, tag = "td | th"){
222 | 
223 |   header_elements <- lapply(cells, function(tr) {
224 |     XML::xpathSApply(tr, tag, function(node) {
225 |       if(XML::xmlName(node) != "sup") {
226 |         value <- XML::xmlValue(node)
227 |       }
228 |       return(value)
229 |     })
230 |   })
231 |   return(header_elements)
232 | }
233 | 


--------------------------------------------------------------------------------
/R/htmltab.R:
--------------------------------------------------------------------------------
  1 | #' Assemble a data frame from HTML table data
  2 | #'
  3 | #' Robust and flexible methods for extracting structured information out of HTML tables
  4 | #'
  5 | #' @export
  6 | #' @param doc the HTML document which can be a file name or a URL or an already parsed document
  7 | #'    (by XML's parsing functions)
  8 | #' @param which a vector of length one for identification of the table in the document. Either a
  9 | #'    numeric vector for the tables' rank or a character vector that describes an XPath for the table
 10 | #' @param header the header formula, see details for specifics
 11 | #' @param headerFun a function that is executed over the header cell nodes
 12 | #' @param headerSep a character vector that is used as a seperator in the construction of the table's
 13 | #'    variable names (default ' >> ')
 14 | #' @param body a vector that specifies which table rows should be used as body information. A numeric
 15 | #'    vector can be specified where each element corresponds to a table row. A character vector may be
 16 | #'    specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use
 17 | #'    semantic information from the HTML code
 18 | #' @param bodyFun a function that is executed over the body cell nodes
 19 | #' @param complementary logical, should htmltab ensure complementarity of header, inbody header and
 20 | #'    body elements (default TRUE)?
 21 | #' @param fillNA character vector of symbols that are replaced by NA (default c(''))
 22 | #' @param rm_superscript logical, should superscript information be removed from header and body cells
 23 | #'    (default TRUE)?
 24 | #' @param rm_footnotes logical, should semantic footer information be removed (default TRUE)?
 25 | #' @param rm_nodata_cols logical, should columns that have no alphanumeric data be removed (default TRUE)?
 26 | #' @param rm_nodata_rows logical, should rows that have no alphanumeric data be removed (default TRUE)?
 27 | #' @param rm_escape a character vector that, if specified, is used to replace escape sequences in header
 28 | #'    and body cells (default ' ')
 29 | #' @param rm_invisible logical, should nodes that are not visible be removed (default TRUE)? This
 30 | #' includes elements with class 'sortkey' and 'display:none' style.
 31 | #' @param rm_whitespace logical, should leading/trailing whitespace be removed from cell values (default TRUE)?
 32 | #' @param colNames a character vector of column names, or a function that can be used to replace specific
 33 | #'    column names (default NULL)
 34 | #' @param ... additional arguments passed to HTML parsers
 35 | #'
 36 | #' @return An R data frame
 37 | #' @details The header formula has the following format: level1 + level2 + level3 + ... .
 38 | #' level1 specifies the main header dimension (column names). This information must
 39 | #' be for rows. level2 and deeper signify header dimensions that appear throughout the body.
 40 | #' Those information muste be for cell elements, not rows. Header information may be
 41 | #' one of the following types:
 42 | #'
 43 | #'\itemize{
 44 | #' \item the NULL value (default). No information passed, htmltab will try to identify
 45 | #' header elements through heuristics (heuristics only work for the main header)
 46 | #' \item A numeric vector that retrieves rows in the respective position
 47 | #' \item A character string of an XPath expression
 48 | #' \item A function that when evaluated produces a numeric or character vector
 49 | #' \item 0, when the process of finding the main header should be skipped (only works
 50 | #' for main header)
 51 | #' }
 52 | #'
 53 | #'
 54 | #' @author Christian Rubba <\url{http://www.christianrubba.com}>
 55 | #' @references \url{https://github.com/crubba/htmltab}
 56 | #' @examples
 57 | #' \dontrun{
 58 | #'# When no spans are present, htmltab produces output close to XML's readHTMLTable(),
 59 | #'but it removes many types of non-data elements (footnotes, non-visible HTML elements, etc)
 60 | #'
 61 | #'  url <- "http://en.wikipedia.org/wiki/World_population"
 62 | #'  xp <- "//caption[starts-with(text(),'World historical')]/ancestor::table"
 63 | #'  htmltab(doc = url, which = xp)
 64 | #'
 65 | #'  popFun <- function(node) {
 66 | #'    x <- XML::xmlValue(node)
 67 | #'    gsub(',', '', x)
 68 | #'  }
 69 | #'
 70 | #'  htmltab(doc = url, which = xp, bodyFun = popFun)
 71 | #'
 72 | #' #This table lacks header information. We provide them through colNames.
 73 | #' #We also need to set header = 0 to indicate that no header is present.
 74 | #' doc <- "http://en.wikipedia.org/wiki/FC_Bayern_Munich"
 75 | #' xp2 <- "//td[text() = 'Head coach']/ancestor::table"
 76 | #' htmltab(doc = doc, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role"))
 77 | #'
 78 | #' #htmltab recognizes column spans and produces a one-dimension vector of variable information,
 79 | #' #also removes automatically superscript information since these are usually not of use.
 80 | #'
 81 | #'  doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
 82 | #'  xp3 <-  "//table[7]"
 83 | #'  bFun <- function(node) {
 84 | #'    x <- XML::xmlValue(node)
 85 | #'    gsub('%$', '', x)
 86 | #'  }
 87 | #'
 88 | #'  htmltab(doc = doc, which = xp3, bodyFun = bFun)
 89 | #'
 90 | #'
 91 | #' htmltab("https://en.wikipedia.org/wiki/Arjen_Robben", which = 3,
 92 | #' header = 1:2)
 93 | #'
 94 | #'
 95 | #' #When header information appear throughout the body, you can specify their
 96 | #' #position in the header formula
 97 | #'
 98 | #' htmltab(url, which = "//table[@@id='team_gamelogs']", header = . + "//td[./strong]")
 99 | #' }
100 | 
101 | 
102 | htmltab <- function(doc,
103 |                     which = NULL,
104 |                     header = NULL,
105 |                     headerFun = function(node)XML::xmlValue(node),
106 |                     headerSep = " >> ",
107 |                     body = NULL,
108 |                     bodyFun = function(node)XML::xmlValue(node),
109 |                     complementary = TRUE,
110 |                     fillNA = NA,
111 |                     rm_superscript = TRUE,
112 |                     rm_escape = " ",
113 |                     rm_footnotes = TRUE,
114 |                     rm_nodata_cols = TRUE,
115 |                     rm_nodata_rows = TRUE,
116 |                     rm_invisible = TRUE,
117 |                     rm_whitespace = TRUE,
118 |                     colNames = NULL,
119 |                     ...){
120 | 
121 |   # on exit
122 | #   if(isTRUE(develop)){
123 | #     on.exit(print(table.Node))
124 | #   }
125 | 
126 |   # Deparse
127 |   header <- deparse(substitute(header), width.cutoff = 500L)
128 |   body <- deparse(substitute(body), width.cutoff = 500L)
129 |   ev_header <- eval_header(arg = header)
130 |   ev_body <- eval_body(arg = body)
131 | 
132 |   # Check Inputs & Clean Up & Normalize tr --------
133 |   table.Node <- check_type(doc = doc, which = which, ...)
134 |   table.Node <- rm_nuisance(table.Node = table.Node,
135 |                             rm_superscript = rm_superscript,
136 |                             rm_footnotes = rm_footnotes,
137 |                             rm_invisible = rm_invisible)
138 |   table.Node <- normalize_tr(table.Node = table.Node)
139 | 
140 |   #Produce XPath for header and body and add class information
141 |   LL <- identify_elements(table.Node = table.Node,
142 |                           header = ev_header,
143 |                           body = ev_body,
144 |                           complementary = complementary)
145 | 
146 |   # Create Table ---------------------------
147 | 
148 |   #Retrieve Head Elements
149 |   header.names <- make_header(trindex = LL$trindex$header,
150 |                               table.Node = table.Node,
151 |                               headerSep = headerSep,
152 |                               headerFun = headerFun,
153 |                               rm_escape = rm_escape,
154 |                               rm_whitespace = rm_whitespace)
155 | 
156 | 
157 |   # Create Body ---------------------------
158 | 
159 |   #Get Body Cell Nodes
160 |   cells <- get_cells(table.Node = table.Node)
161 |   #row_type <- get_row_type(table.Node = table.Node)
162 | 
163 | 
164 |   #Extract and transform body cell elements
165 |   vals <- get_cell_element(cells, elFun = bodyFun,
166 |                            rm_escape = rm_escape,
167 |                            rm_whitespace = rm_whitespace)
168 | 
169 |   #Produce rowspans and colspans lists from body cell
170 |   body.rowspans <- get_span(cells, span = "rowspan")
171 |   body.colspans <- get_span(cells, span = "colspan")
172 | 
173 |   #Produce table body
174 |   tab <- expand_body(vals, colspans = body.colspans, rowspans = body.rowspans)
175 | 
176 | 
177 |   # Finish ---------------------------
178 | 
179 |   tab <- as.data.frame(tab, stringsAsFactors = F)
180 | 
181 |   #Produce DF
182 |   tab <- make_colnames(df = tab,
183 |                        header.names = header.names,
184 |                        colNames = colNames,
185 |                        header.xpath = LL$xpath$header)
186 | 
187 |   # Inbody header
188 |   tab <- create_inbody(tab = tab, table.Node = table.Node,
189 |                        trindex = LL$trindex$inbody,
190 |                        xpath = LL$xpath$inbody)
191 | 
192 |   #Check if there are no data columns/rows
193 |   if(isTRUE(rm_nodata_cols)){
194 |     tab <- rm_empty_cols(df = tab, header = header.names)
195 |   }
196 | 
197 |   # Subset
198 |   tab <- tab[LL$trindex$body, , drop = F]
199 | 
200 |   if(isTRUE(rm_nodata_rows)){
201 |     tab <- rm_empty_rows(df = tab)
202 |   }
203 | 
204 | 
205 |   # Replace empty vals by NA
206 |   tab[is.na(tab)] <- fillNA
207 | 
208 |   return(tab)
209 | }
210 | 


--------------------------------------------------------------------------------
/R/identify_rows.R:
--------------------------------------------------------------------------------
  1 | #' Generate numeric XPath expression
  2 | #'
  3 | #' @title num_xpath: Generate numeric XPath expression
  4 | #' @param data the header XPath
  5 | #' @export
  6 | num_xpath <- function(data) UseMethod("num_xpath")
  7 | 
  8 | #' @export
  9 | num_xpath.default <- function(data){
 10 |   stop("Unknown input", call. = F)
 11 | }
 12 | 
 13 | #' @export
 14 | num_xpath.character <- function(data){
 15 |   return(data)
 16 | }
 17 | 
 18 | #' @export
 19 | num_xpath.numeric <- function(data){
 20 | 
 21 |   index.count <- data - 1
 22 |   index.xpath <- sapply(1:length(index.count), function(pos) {
 23 |     sprintf("count(preceding::tr) = %s", index.count[pos])
 24 |   })
 25 |   index.xpath <- sprintf("//tr[%s]", paste(index.xpath, collapse = " or "))
 26 | 
 27 |   return(index.xpath)
 28 | }
 29 | 
 30 | #' @export
 31 | num_xpath.list <- function(data){
 32 | 
 33 |   index.xpath <- lapply(data, get("num_xpath", envir = parent.env(environment())))
 34 |   return(index.xpath)
 35 | }
 36 | 
 37 | 
 38 | #' Return trindex given an XPath
 39 | #' @param table.Node the table node
 40 | #' @param xpath XPath
 41 | get_trindex <- function(xpath, table.Node) UseMethod("get_trindex")
 42 | 
 43 | get_trindex.default <- function(xpath, table.Node){
 44 |   stop("Unknow XPath", call. = FALSE)
 45 | }
 46 | 
 47 | get_trindex.NULL <- function(xpath, table.Node){
 48 |   return(logical())
 49 | }
 50 | 
 51 | get_trindex.character <- function(xpath, table.Node){
 52 | 
 53 |   tr.index <- unlist(XML::xpathSApply(table.Node, xpath, XML::xmlGetAttr, "HTMLTABtrindex"))
 54 |   tr.index <- as.numeric(tr.index)
 55 |   return(tr.index)
 56 | }
 57 | 
 58 | get_trindex.list <- function(xpath, table.Node){
 59 | 
 60 |   tr.index <- lapply(seq_along(xpath), function(id) {
 61 |     XML::xpathSApply(table.Node, xpath[[id]], function(x) {
 62 |       val <- XML::xmlGetAttr(XML::xmlParent(x), "HTMLTABtrindex")
 63 |       val <- as.numeric(val)
 64 |       return(val)
 65 |     })
 66 |   })
 67 | 
 68 |   return(tr.index)
 69 | }
 70 | 
 71 | #' Return header xpath
 72 | #'
 73 | #' @param table.Node the table node
 74 | #' @param header an information for the header rows
 75 | #' @return a character vector of XPath statements
 76 | get_head_xpath <- function(header, table.Node) UseMethod("get_head_xpath")
 77 | 
 78 | get_head_xpath.default <- function(header, table.Node) {
 79 |   stop("Unknown header information", .call = F)
 80 | }
 81 | 
 82 | get_head_xpath.character <- function(header, table.Node) {
 83 |   return(header)
 84 | }
 85 | 
 86 | get_head_xpath.numeric <- function(header, table.Node) {
 87 | 
 88 |   if(header[1] == 0){
 89 |     header <- NULL
 90 |   } else{
 91 |     header <- num_xpath(header)
 92 |   }
 93 | 
 94 |   return(header)
 95 | }
 96 | 
 97 | get_head_xpath.NULL <- function(header, table.Node){
 98 | 
 99 |   thead <- has_tag(table.Node, "//thead")
100 |   thead.th <- has_tag(table.Node, "//thead/tr[th]")
101 |   thead.td <- has_tag(table.Node, "//thead/tr[td]")
102 | 
103 |   tr <- has_tag(table.Node, "//tr")
104 |   th <- has_tag(table.Node, "//tr[th and not(./td)]")
105 |   td <- has_tag(table.Node, "//tr[td and not(./th)]")
106 | 
107 | 
108 |   if(thead) {
109 |     header.xpath <- '//thead/tr'
110 |     return(c(header.xpath))
111 |   }
112 | 
113 |   if (!thead && th){
114 |     header.xpath <- "//tr[th and not(./td)]"
115 |     return(c(header.xpath))
116 |   }
117 | 
118 |   if (!thead && !th){
119 |     header.xpath <- "//tr[position() = 1]"
120 |     message("Neither <thead> nor <th> information found. Taking first table row for the header. If incorrect, specifiy header argument.")
121 |     return(c(header.xpath))
122 |   }
123 | }
124 | 
125 | 
126 | #' Return body xpath
127 | #'
128 | #' @param table.Node the table node
129 | #' @param body an information for the body rows
130 | #' @return a character vector of XPath statements
131 | get_body_xpath <- function(body, table.Node) UseMethod("get_body_xpath")
132 | 
133 | get_body_xpath.default <- function(body, table.Node){
134 |   stop("Unknow body information", .call = F)
135 | }
136 | 
137 | get_body_xpath.numeric <- function(body, table.Node){
138 |   body <- num_xpath(body)
139 |   return(body)
140 | }
141 | 
142 | get_body_xpath.character <- function(body, table.Node){
143 |   return(body)
144 | }
145 | 
146 | get_body_xpath.NULL <- function(body, table.Node){
147 |   tbody <- has_tag(table.Node, "//tbody")
148 |   tbody.th <- has_tag(table.Node, "//tbody/tr[th]")
149 |   tbody.td <- has_tag(table.Node, "//tbody/tr[td]")
150 | 
151 |   tr <- has_tag(table.Node, "//tr")
152 |   th <- has_tag(table.Node, "//tr[th and not(./td)]")
153 |   td <- has_tag(table.Node, "//tr[td and not(./th)]")
154 | 
155 |   if(tbody){
156 |     body.xpath <- "//tbody/tr"
157 |     return(c(body.xpath))
158 |   } else {
159 |     body.xpath <- "//tr[./td]"
160 |     return(c(body.xpath))
161 |   }
162 | 
163 | }
164 | 
165 | #' Assemble XPath expressions for header and body
166 | #'
167 | #' @param table.Node the table node
168 | #' @param header a vector that contains information for the identification of the header row(s). A numeric vector can be specified where each element corresponds to the table rows. A character vector may be specified that describes an XPath for the header rows. If left unspecified, htmltable tries to use semantic information from the HTML code
169 | #' @param body a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltable tries to use semantic information from the HTML code
170 | #' @param complementary logical, should htmltab ensure complementarity of header, inbody header and
171 | #'    body elements (default TRUE)?
172 | #' @return a character vector of XPath statements
173 | identify_elements <- function(table.Node, header, body, complementary = T){
174 | 
175 |   #header <- ev_header
176 |   #body <- ev_body
177 | 
178 |   header_MAIN <- header[[1]]
179 |   header_INBODY <- header[-1]
180 | 
181 |   # switch compementary
182 |   if(!is.null(header_MAIN[1])){
183 |     if(header_MAIN[1] == 0){complementary <- FALSE}
184 |   }
185 | 
186 |   # Produce XPaths
187 |   header_MAIN.xpath <- get_head_xpath(table.Node = table.Node, header = header_MAIN)
188 |   body.xpath <- get_body_xpath(table.Node = table.Node, body = body)
189 |   header_INBODY.xpath <- num_xpath(header_INBODY)
190 | 
191 |   # Receive HTMLTABtrindex
192 |   header_MAIN.trindex <- get_trindex(header_MAIN.xpath, table.Node)
193 |   header_INBODY.trindex <- get_trindex(header_INBODY.xpath, table.Node)
194 |   body.trindex <- get_trindex(body.xpath, table.Node)
195 | 
196 |   # Check INBODY correct
197 |   ifstop(length(header_INBODY.xpath) > 0 && length(header_INBODY.trindex[[1]]) == 0,
198 |          "Your specified inbody header cells could not be identified.", call. = F)
199 | 
200 |   # Complementarity
201 |   if(isTRUE(complementary)){
202 | 
203 |     if(is.null(body)){
204 |       body.trindex <- setdiff(body.trindex, c(header_MAIN.trindex, unlist(header_INBODY.trindex)))
205 |     }
206 | 
207 |     if(is.null(header_MAIN) && !is.null(body)){
208 |       header_MAIN.trindex <- setdiff(header_MAIN.trindex, c(body.trindex, unlist(header_INBODY.trindex)))
209 |     }
210 | 
211 |   }
212 | 
213 |   trindex <- list(header = header_MAIN.trindex,
214 |              inbody = header_INBODY.trindex,
215 |              body = body.trindex)
216 | 
217 |   xpath <- list(header = header_MAIN.xpath,
218 |                 inbody = header_INBODY.xpath,
219 |                 body = body.xpath)
220 | 
221 |   return(list(trindex = trindex, xpath = xpath))
222 | }
223 | 


--------------------------------------------------------------------------------
/R/inbody_header.R:
--------------------------------------------------------------------------------
 1 | #' Reshape in table header information into wide format
 2 | #' @param tab the table data frame
 3 | #' @param table.Node the table node
 4 | #' @param trindex the tr index of the inbody rows
 5 | #' @param xpath the xpath for the inbody rows
 6 | #' @return the modified R data frame
 7 | create_inbody <- function(tab, table.Node, trindex, xpath){
 8 | 
 9 |   # When no inbody header was specified
10 |   if(length(trindex) == 0){
11 |     return(tab)
12 |   }
13 | 
14 |   # Else ...
15 |   N <- length(trindex)
16 |   N.row <- nrow(tab)
17 |   df <- data.frame(matrix(NA, ncol = N, nrow = N.row))
18 |   colnames(df) <- paste0("Header_", 1:N)
19 | 
20 |   for(i in 1:N){
21 |     val <- XML::xpathSApply(table.Node, xpath[[i]], XML::xmlValue)
22 |     index <- trindex[[i]]
23 |     index <- c(index, (N.row + 1))
24 | 
25 |     df[index[1]:(N.row), (N+1-i)] <- rep(val, diff(index))
26 |   }
27 | 
28 |   # Combine
29 |   tab <- cbind(df, tab)
30 | 
31 |   return(tab)
32 | }
33 | 


--------------------------------------------------------------------------------
/R/setup_and_checks.R:
--------------------------------------------------------------------------------
  1 | #' Produce the table node
  2 | #'
  3 | #' @param doc the HTML document which can be a file name or a URL or an already parsed document
  4 | #'   (by XML's parsing functions)
  5 | #' @param which a vector of length one for identification of the table in the document. Either
  6 | #'    a numeric vector for the tables' rank or a character vector that describes an XPath for the table
  7 | #' @param ... additional arguments passed to htmlParse
  8 | #' @return a table node
  9 | check_type <- function(doc, which, ...) UseMethod("check_type")
 10 | 
 11 | check_type.default <- function(doc, which, ...){
 12 |   stop("doc is of unknown type", call. = FALSE)
 13 | }
 14 | 
 15 | check_type.XMLNodeSet <- function(doc, which, ...){
 16 | 
 17 |   Node <- eval.parent(substitute(XML::xmlParse(XML::saveXML(doc[[1]]), list(...))))
 18 | 
 19 |   return(Node)
 20 | }
 21 | 
 22 | check_type.HTMLInternalDocument <- function(doc, which, ...) {
 23 |   Node <- doc
 24 |   Node <- select_tab(which = which, Node = Node)
 25 | 
 26 |   return(Node)
 27 | }
 28 | 
 29 | check_type.XMLInternalElementNode <- function(doc, which, ...) {
 30 |   Node <- doc
 31 |   Node <- select_tab(which = which, Node = Node)
 32 | 
 33 |   return(Node)
 34 | }
 35 | 
 36 | check_type.character <- function(doc, which, ...){
 37 | 
 38 |   isurl <- is_url(doc)
 39 | 
 40 |   if(isurl) {
 41 |     doc <- httr::GET(doc)
 42 |     doc <- httr::content(doc, "text")
 43 |   } else if (file.exists(doc)) {
 44 |     doc <- readChar(doc, file.info(doc)$size)
 45 |   }
 46 | 
 47 |   Node <- eval.parent(substitute(XML::htmlParse(doc, encoding = "UTF-8", list(...))))
 48 |   Node <- select_tab(which = which, Node = Node)
 49 | 
 50 |   return(Node)
 51 | }
 52 | 
 53 | 
 54 | #' Selects the table from the HTML Code
 55 | #'
 56 | #' @param Node the table node
 57 | #' @param which a vector of length one for identification of the table in the document. Either
 58 | #'    a numeric vector for the tables' rank or a character vector that describes an XPath for the table
 59 | #' @param ... additional arguments passed to htmlParse
 60 | #' @return a table node
 61 | select_tab <- function(which, Node) UseMethod("select_tab")
 62 | 
 63 | select_tab.default <- function(which, Node){
 64 | 
 65 |   message("Argument 'which' was left unspecified. Choosing first table.")
 66 |   Node <- XML::getNodeSet(Node, path = "//table")
 67 | 
 68 |   ifstop(cond = length(Node) == 0,
 69 |          mess = "Couldn't find a table.")
 70 | 
 71 |   Node <- XML::xmlParse(XML::saveXML(Node[[1]]))
 72 |   return(Node)
 73 | }
 74 | 
 75 | select_tab.numeric <- function(which, Node){
 76 | 
 77 |   Node <- XML::getNodeSet(Node, path = "//table")
 78 | 
 79 |   ifstop(cond = length(Node) < which,
 80 |          mess = "Couldn't find the table. Try passing (a different) information to the which argument.")
 81 | 
 82 |   Node <- XML::xmlParse(XML::saveXML(Node[[which]]))
 83 |   return(Node)
 84 | }
 85 | 
 86 | select_tab.character <- function(which, Node){
 87 | 
 88 |   xpath <- paste(which, collapse = " | ")
 89 |   Node <- XML::getNodeSet(Node, path = xpath)
 90 | 
 91 |   ifstop(cond = is.null(Node[[1]]),
 92 |          mess = "Couldn't find the table. Try passing (a different) information to the which argument.")
 93 | 
 94 |   Node <- XML::xmlParse(XML::saveXML(Node[[1]]))
 95 |   return(Node)
 96 | }
 97 | 
 98 | 
 99 | #' Evaluate and deparse the header argument
100 | #' @param arg the header information
101 | #' @return evaluated header info
102 | eval_header <- function(arg){
103 | 
104 |   # Parse header string
105 |   header <- rm_str_white(strsplit(arg, "\\+")[[1]])
106 | 
107 |   those.header <- which(header %in% c("\"\"", "''", "", ".", "NULL"))
108 |   header[those.header] <- "NULL"
109 | 
110 |   # Check that inbody information are complete
111 |   ifstop(cond = any(header[-1] == "NULL"),
112 |          mess = "You need to provide complete information for the inbody rows")
113 | 
114 |   # Evaluate header information
115 |   header <- lapply(header, function(x) eval(parse(text = x)))
116 | 
117 |   return(header)
118 | }
119 | 
120 | #' Evaluate and deparse the body argument
121 | #' @param arg the body argument
122 | eval_body <- function(arg){
123 | 
124 |   body <- rm_str_white(strsplit(arg, "\\+")[[1]])
125 | 
126 |   those.body <- which(body %in% c("\"\"", "''", "", ".", "NULL"))
127 |   body[those.body] <- "NULL"
128 | 
129 |   ifstop(cond = length(body) > 1,
130 |          mess = "Your body information is malformed. You may only provide one piece of information")
131 | 
132 |   # Evaluate header information
133 |   body <- eval(parse(text = body))
134 | 
135 |   return(body)
136 | }
137 | 
138 | 
139 | #' Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them
140 | #'
141 | #' @param table.Node the table node
142 | #' @return the revised table node
143 | normalize_tr <- function(table.Node){
144 | 
145 |   wrong_tag <- "//trbody"
146 |   x <- has_tag(table.Node, wrong_tag)
147 | 
148 |   if(isTRUE(x)){
149 |     old.header <- XML::getNodeSet(table.Node, "//trbody/*")
150 |     invisible(new.header <- XML::newXMLNode("tbody"))
151 |     invisible(XML::addChildren(new.header, old.header))
152 |     invisible(XML::replaceNodes(oldNode = XML::getNodeSet(table.Node, "//trbody")[[1]], newNode = new.header))
153 | 
154 |     warning(sprintf("The code for the HTML table you provided contains invalid table tags ('%s'). The following transformations were applied:\n
155 |                     //trbody -> //tbody \n
156 |                     If you specified an XPath that makes a reference to this tag, this may have caused problems with their identification.", wrong_tag), call. = F)
157 |   }
158 | 
159 |   #Every row in tr
160 |   node1 <- XML::getNodeSet(table.Node, "//*[name() = 'td' or name() = 'th' and not(parent::tr)][preceding-sibling::*[1]/self::tr]")
161 | 
162 |   if(length(node1) > 0){
163 | 
164 |     for(i in 1:length(node1)){
165 | 
166 |       node.container <- list()
167 |       node.container[[1]] <- node1[[i]]
168 | 
169 |       gg <- 1
170 |       repeat{
171 |         sibling.node <- XML::getSibling(node.container[[gg]])
172 | 
173 |         if(is.null(sibling.node)) break
174 |         if(!(XML::xmlName(sibling.node) %in% c("td", "th"))) break
175 | 
176 |         node.container[[gg + 1]] <- sibling.node
177 |         gg <- gg + 1
178 |       }
179 | 
180 |       invisible(new.tr <- XML::newXMLNode("tr"))
181 |       invisible(XML::replaceNodes(oldNode = node1[[i]], newNode = new.tr))
182 |       invisible(XML::addChildren(new.tr, node.container))
183 |     }
184 | 
185 |     warning("The code for the HTML table you provided is malformed. Not all cells are nested in row tags (<tr>). htmltab tried to normalize the table and ensure that all cells are within row tags. If you specified an XPath for body or header elements, this may have caused problems with their identification.", call. = F)
186 | 
187 |   }
188 | 
189 |   # Flatten inside table
190 |   nested <- has_tag(table.Node, "/table//table")
191 |   if(nested){
192 |     warning("There is a table inside the target table. htmltab tries to flatten the inner table", call. = F)
193 |     invisible(old.node <- XML::getNodeSet(table.Node,  "/table//table"))
194 |     for(i in 1:length(old.node)){
195 |       invisible(vals <- XML::xmlValue(old.node[[i]]))
196 |       invisible(new.cell <- XML::newXMLNode("td", vals))
197 |       invisible(XML::replaceNodes(oldNode = old.node[[i]], newNode = new.cell))
198 |     }
199 |   }
200 | 
201 |   #Add tr index
202 |   trs <- XML::getNodeSet(table.Node, "//tr")
203 |   n.trs <- length(trs)
204 |   invisible(lapply(1:n.trs, function(index)  XML::xmlAttrs(trs[[index]]) <- c(HTMLTABtrindex = index)))
205 | 
206 |   return(table.Node)
207 |   }
208 | 
209 | 
210 | #' Remove nuisance elements from the the table code
211 | #'
212 | #' @param table.Node the table node
213 | #' @param rm_superscript logical, denotes whether superscript information should be removed from header and body cells (default value TRUE)
214 | #' @param rm_footnotes logical, denotes whether semantic footer information should be removed (default value TRUE)
215 | #' @param rm_invisible logical, should nodes that are not visible (display:none attribute) be removed?
216 | #' @seealso \code{\link{rm_empty_cols}}
217 | #' @return The revised table node
218 | rm_nuisance <- function(table.Node, rm_superscript, rm_footnotes, rm_invisible){
219 | 
220 |   if(isTRUE(rm_superscript)){
221 |     invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//sup")))
222 |   }
223 | 
224 |   if(isTRUE(rm_footnotes)){
225 |     invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//tfoot")))
226 |   }
227 | 
228 |   if(isTRUE(rm_invisible)){
229 |     invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//*[contains(@style, 'display:none') or @class = 'sortkey']")))
230 |   }
231 | 
232 |   # Remove empty rows
233 |   invisible(XML::removeNodes(XML::getNodeSet(table.Node, "//tr[not(./*)]")))
234 | 
235 |   return(table.Node)
236 | }
237 | 
238 | 
239 | #' Remove columns which do not have data values
240 | #'
241 | #' @param df a data frame
242 | #' @param header the header vector
243 | #' @return a data frame
244 | #' @seealso \code{\link{rm_nuisance}, \link{rm_empty_rows}}
245 | rm_empty_cols <- function(df, header){
246 | 
247 |   #This is clumsy but seems to work reasonably well
248 |   #columns are removed when they have:
249 |   #1. No name (V...)
250 |   #2. More than 50% missing values in their column
251 | 
252 | #  no.col.name <- grep('^V[[:digit:]]', colnames(df))
253 | 
254 |   empty.cols <- sapply(df, function(x) {
255 |     gg <- is.na(x)
256 |     gg <- length(base::which(gg, TRUE)) / length(x)
257 |   })
258 | 
259 |   empty.cols <- which(empty.cols > 0.5)
260 |   if(length(empty.cols) > 0) warning(sprintf("Columns [%s] seem to have no data and are removed. Use rm_nodata_cols = F to suppress this behavior", paste(names(empty.cols), collapse = ",")), call. =  F)
261 |   rm.these <- empty.cols #intersect(empty.cols, no.col.name)
262 | 
263 |   if(length(rm.these) > 0) {
264 |     df <- df[, -rm.these, drop = F]
265 |   }
266 | 
267 |   return(df)
268 | }
269 | 
270 | #' Remove rows which do not have data values
271 | #'
272 | #' @param df a data frame
273 | #' @return a data frame
274 | #' @seealso \code{\link{rm_nuisance}, \link{rm_empty_cols}}
275 | rm_empty_rows <- function(df){
276 |   rm.these <- which(rowSums(is.na(df)) == ncol(df))
277 | 
278 |   if(length(rm.these) > 0) {
279 |     df <- df[-rm.these,]
280 |   }
281 | 
282 |   return(df)
283 | }
284 | 
285 | 
286 | 
287 | #
288 | # #' Remove columns which do not have data values
289 | # #'
290 | # #' @param df a data frame
291 | # #' @return a data frame
292 | # #' @seealso \code{\link{rm_nuisance}}
293 | # rm_empty_cols <- function(df){
294 | #
295 | #   #This is clumsy but seems to work reasonably well
296 | #   #columns are removed when they have:
297 | #   #1. No name (V...)
298 | #   #2. More than 50% missing values in their column
299 | #
300 | #   no.col.name <- grep('^V[[:digit:]]', colnames(df))
301 | #
302 | #   empty.cols <- sapply(df, function(col){
303 | #     x <- grepl("[A-Za-z]{1,}", col)
304 | #     x <- length(base::which(x, TRUE)) / length(x)
305 | #   })
306 | #
307 | #   empty.cols <- which(empty.cols < 0.5)
308 | #   rm.these <- intersect(empty.cols, no.col.name)
309 | #
310 | #   if(length(rm.these) > 0) {
311 | #     df <- df[, -rm.these]
312 | #   }
313 | #
314 | #   return(df)
315 | # }
316 | 


--------------------------------------------------------------------------------
/R/utils.R:
--------------------------------------------------------------------------------
 1 | is_empty <- function(x) length(x) == 0
 2 | 
 3 | all_empty <- function(x) {
 4 |   all(sapply(x, length) == 0)
 5 |   }
 6 | 
 7 | equal_length <- function(x) {
 8 |   x.length <- sapply(x, length)
 9 |   length(unique(x.length)) == 1
10 | }
11 | 
12 | rm_str_white <- function(el) gsub("^\\s+|\\s+$", "", el)
13 | 
14 | check_correct <- function(nodeset){
15 |   names <- sapply(1:length(nodeset), function(index) XML::xmlName(nodeset[[index]]))
16 |   if(!all(names == "tr")) stop("You must pass header/body information that identifies row elements (tr)", call. = FALSE)
17 | }
18 | 
19 | #' Wrapper around if stop logic
20 | #' @noRd
21 | ifstop <- function(cond, mess, ...){
22 | 
23 |   cond <- eval(quote(cond))
24 | 
25 |   if(isTRUE(cond)){
26 |     stop(mess, call. = F)
27 |   }
28 | }
29 | 
30 | #' Is str a URL?
31 | #' @noRd
32 | is_url <- function(str){
33 |   grepl("^(http:|https:|www.)", str)
34 | }
35 | 
36 | 
37 | #' Assert a specific tag in an XML node
38 | #'
39 | #' @param table.Node the table node
40 | #' @param tag a character string for the tag name to be matched
41 | #' @return logical value indicating whether tag is present in table code
42 | #' @noRd
43 | 
44 | has_tag <- function(table.Node, tag) {
45 |   x <- XML::xpathSApply(table.Node, tag)
46 |   if(length(x) > 0){TRUE} else{FALSE}
47 | }
48 | 


--------------------------------------------------------------------------------
/R/zzz.R:
--------------------------------------------------------------------------------
 1 | .onAttach <- function(...) {
 2 |   if (!interactive() || stats::runif(1) > 0.1) return()
 3 |   tips <- c(
 4 |     "Due to new features and arguments introduced with v.0.6.0 code can break.\nUse suppressPackageStartupMessages to eliminate package startup messages.",
 5 |     "If you find a bug, please report it to 'https://github.com/crubba/htmltab/issues'\nUse suppressPackageStartupMessages to eliminate package startup messages."
 6 |   )
 7 |   tip <- sample(tips, 1)
 8 |   packageStartupMessage(tip)
 9 | }
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## htmltab: Hassle-free HTML tables in R
 2 | HTML tables are a valuable data source but extracting and recasting
 3 | these data into a useful format can be tedious. htmltab is a package for
 4 | extracting structured information from HTML tables. It is similar to
 5 | readHTMLTable() of the XML package but provides two major advantages. First,
 6 | the function automatically expands row and column spans in the header and
 7 | body cells. Second, users are given more control over the identification of
 8 | header and body rows which will end up in the R table. Additionally, the
 9 | function preprocesses table code, removes unneeded parts and so helps to
10 | alleviate the need for tedious post-processing.
11 | 
12 | ## Installation 
13 | The package is available from CRAN and Github. For the stable release version, download from [CRAN](http://cran.r-project.org/web/packages/htmltab/index.html):
14 | 
15 | ```
16 | install.packages("htmltab")
17 | ```
18 | 
19 | For the developer version, download from my GitHub repo. You can install the package directly from inside R:
20 | 
21 | ```
22 | install.packages("devtools")
23 | devtools::install_github("crubba/htmltab")
24 | ```
25 | 
26 | ## Usage
27 | To see *htmltab* in action, take a look at the case studies in the [package vignette](http://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.htm), this [blog post](http://r-datacollection.com/blog/htmltab-Next-version-and-CRAN-release/) or consult the package manual.
28 | 
29 | ## Travis status
30 | ![travis status](https://travis-ci.org/crubba/htmltab.svg)
31 | 
32 | ## Report issues
33 | If you experience problems with *htmltab*, I would like to hear about it to improve the project. Please use [my github repo](https://github.com/crubba/htmltab/issues) to report the issue.
34 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
 1 | # Resubmission
 2 | This is a resubmission. In this version I have:
 3 | 
 4 | * Converted the DESCRIPTION title to title case
 5 | * Included call to person() in Authors@R field
 6 | * Confirmed that the package was checked with R-devel through win-builder
 7 | 
 8 | ## Test environments
 9 | * local OS X install, R 3.1.2
10 | * local Ubuntu 12.04, R 3.1.2
11 | * win-builder (devel and release)
12 | 
13 | ## R CMD check results
14 | Local R CMD CHECKs succeeded with no ERRORs, WARNINGs or NOTEs. 
15 | 
16 | I experienced two NOTEs, when using devtools::check(args="--as-cran"), but not when I checked the package as usual:
17 | 
18 | * checking CRAN incoming feasibility ... NOTE
19 |   Maintainer: ‘Christian Rubba <christian.rubba@gmail.com>’
20 |   New submission
21 |   Components with restrictions and base license permitting such:
22 |   MIT + file LICENSE
23 |   File 'LICENSE':
24 |   YEAR: 2014,2015
25 |   COPYRIGHT HOLDER: Christian Rubba
26 |   
27 | * checking package dependencies ... NOTE
28 |   No repository set, so cyclic dependency check skipped
29 |   
30 | I experienced two NOTEs on R CMD CHECK on win-builder:
31 | 
32 | On release I got:
33 | 
34 | *Components with restrictions and base license permitting such:
35 |   MIT + file LICENSE 
36 |   File 'LICENSE':
37 |   YEAR: 2014,2015
38 |   COPYRIGHT HOLDER: Christian Rubba
39 |   Possibly mis-spelled words in DESCRIPTION:
40 |   htmltab (6:14)
41 |   preprocesses (11:48)
42 |   
43 | Additionally, only on R-devel I got:
44 | 
45 | * checking R code for possible problems ... NOTE
46 |   htmltab: possible error in check_type(doc = doc, which = which, ...):
47 |   ... used in a situation where it does not exist
48 | 
49 | ## Downstream dependencies
50 | There are currently no downstream dependencies of this package
51 | 


--------------------------------------------------------------------------------
/man/check_type.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{check_type}
 4 | \alias{check_type}
 5 | \title{Produce the table node}
 6 | \usage{
 7 | check_type(doc, which, ...)
 8 | }
 9 | \arguments{
10 | \item{doc}{the HTML document which can be a file name or a URL or an already parsed document
11 | (by XML's parsing functions)}
12 | 
13 | \item{which}{a vector of length one for identification of the table in the document. Either
14 | a numeric vector for the tables' rank or a character vector that describes an XPath for the table}
15 | 
16 | \item{...}{additional arguments passed to htmlParse}
17 | }
18 | \value{
19 | a table node
20 | }
21 | \description{
22 | Produce the table node
23 | }
24 | 
25 | 


--------------------------------------------------------------------------------
/man/create_inbody.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/inbody_header.R
 3 | \name{create_inbody}
 4 | \alias{create_inbody}
 5 | \title{Reshape in table header information into wide format}
 6 | \usage{
 7 | create_inbody(tab, table.Node, trindex, xpath)
 8 | }
 9 | \arguments{
10 | \item{tab}{the table data frame}
11 | 
12 | \item{table.Node}{the table node}
13 | 
14 | \item{trindex}{the tr index of the inbody rows}
15 | 
16 | \item{xpath}{the xpath for the inbody rows}
17 | }
18 | \value{
19 | the modified R data frame
20 | }
21 | \description{
22 | Reshape in table header information into wide format
23 | }
24 | 
25 | 


--------------------------------------------------------------------------------
/man/eval_body.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{eval_body}
 4 | \alias{eval_body}
 5 | \title{Evaluate and deparse the body argument}
 6 | \usage{
 7 | eval_body(arg)
 8 | }
 9 | \arguments{
10 | \item{arg}{the body argument}
11 | }
12 | \description{
13 | Evaluate and deparse the body argument
14 | }
15 | 
16 | 


--------------------------------------------------------------------------------
/man/eval_header.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{eval_header}
 4 | \alias{eval_header}
 5 | \title{Evaluate and deparse the header argument}
 6 | \usage{
 7 | eval_header(arg)
 8 | }
 9 | \arguments{
10 | \item{arg}{the header information}
11 | }
12 | \value{
13 | evaluated header info
14 | }
15 | \description{
16 | Evaluate and deparse the header argument
17 | }
18 | 
19 | 


--------------------------------------------------------------------------------
/man/get_body_xpath.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/identify_rows.R
 3 | \name{get_body_xpath}
 4 | \alias{get_body_xpath}
 5 | \title{Return body xpath}
 6 | \usage{
 7 | get_body_xpath(body, table.Node)
 8 | }
 9 | \arguments{
10 | \item{body}{an information for the body rows}
11 | 
12 | \item{table.Node}{the table node}
13 | }
14 | \value{
15 | a character vector of XPath statements
16 | }
17 | \description{
18 | Return body xpath
19 | }
20 | 
21 | 


--------------------------------------------------------------------------------
/man/get_cell_element.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/header.R
 3 | \name{get_cell_element}
 4 | \alias{get_cell_element}
 5 | \title{Extracts cells elements}
 6 | \usage{
 7 | get_cell_element(cells, tag = "td | th", elFun, rm_escape, rm_whitespace)
 8 | }
 9 | \arguments{
10 | \item{cells}{a list of cell nodes}
11 | 
12 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements}
13 | 
14 | \item{elFun}{a function that is executed over the header/body cell nodes}
15 | 
16 | \item{rm_escape}{a character vector that, if specified, is used to replace escape sequences in header and body cells (default value ' ')}
17 | 
18 | \item{rm_whitespace}{logical, should leading/trailing whitespace be removed from cell values (
19 | default value TRUE)?}
20 | }
21 | \value{
22 | the body element
23 | }
24 | \description{
25 | Extracts cells elements
26 | }
27 | 
28 | 


--------------------------------------------------------------------------------
/man/get_head_xpath.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/identify_rows.R
 3 | \name{get_head_xpath}
 4 | \alias{get_head_xpath}
 5 | \title{Return header xpath}
 6 | \usage{
 7 | get_head_xpath(header, table.Node)
 8 | }
 9 | \arguments{
10 | \item{header}{an information for the header rows}
11 | 
12 | \item{table.Node}{the table node}
13 | }
14 | \value{
15 | a character vector of XPath statements
16 | }
17 | \description{
18 | Return header xpath
19 | }
20 | 
21 | 


--------------------------------------------------------------------------------
/man/get_header_elements.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/header.R
 3 | \name{get_header_elements}
 4 | \alias{get_header_elements}
 5 | \title{Extracts header elements}
 6 | \usage{
 7 | get_header_elements(cells, tag = "td | th")
 8 | }
 9 | \arguments{
10 | \item{cells}{a list of cell nodes}
11 | 
12 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements}
13 | }
14 | \value{
15 | A list of header information from the cells
16 | }
17 | \description{
18 | Extracts header elements
19 | }
20 | 
21 | 


--------------------------------------------------------------------------------
/man/get_span.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/header.R
 3 | \name{get_span}
 4 | \alias{get_span}
 5 | \title{Extracts rowspan information}
 6 | \usage{
 7 | get_span(cells, span, tag = "td | th")
 8 | }
 9 | \arguments{
10 | \item{cells}{a list of cell nodes}
11 | 
12 | \item{span}{a character for the span element name}
13 | 
14 | \item{tag}{a character vector that provides information used in the XPath expression to extract the correct elements}
15 | }
16 | \value{
17 | A list of row information from the cells
18 | }
19 | \description{
20 | Extracts rowspan information
21 | }
22 | 
23 | 


--------------------------------------------------------------------------------
/man/get_trindex.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/identify_rows.R
 3 | \name{get_trindex}
 4 | \alias{get_trindex}
 5 | \title{Return trindex given an XPath}
 6 | \usage{
 7 | get_trindex(xpath, table.Node)
 8 | }
 9 | \arguments{
10 | \item{xpath}{XPath}
11 | 
12 | \item{table.Node}{the table node}
13 | }
14 | \description{
15 | Return trindex given an XPath
16 | }
17 | 
18 | 


--------------------------------------------------------------------------------
/man/htmltab.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/htmltab.R
  3 | \name{htmltab}
  4 | \alias{htmltab}
  5 | \title{Assemble a data frame from HTML table data}
  6 | \usage{
  7 | htmltab(doc, which = NULL, header = NULL, headerFun = function(node)
  8 |   XML::xmlValue(node), headerSep = " >> ", body = NULL,
  9 |   bodyFun = function(node) XML::xmlValue(node), complementary = TRUE,
 10 |   fillNA = NA, rm_superscript = TRUE, rm_escape = " ",
 11 |   rm_footnotes = TRUE, rm_nodata_cols = TRUE, rm_nodata_rows = TRUE,
 12 |   rm_invisible = TRUE, rm_whitespace = TRUE, colNames = NULL, ...)
 13 | }
 14 | \arguments{
 15 | \item{doc}{the HTML document which can be a file name or a URL or an already parsed document
 16 | (by XML's parsing functions)}
 17 | 
 18 | \item{which}{a vector of length one for identification of the table in the document. Either a
 19 | numeric vector for the tables' rank or a character vector that describes an XPath for the table}
 20 | 
 21 | \item{header}{the header formula, see details for specifics}
 22 | 
 23 | \item{headerFun}{a function that is executed over the header cell nodes}
 24 | 
 25 | \item{headerSep}{a character vector that is used as a seperator in the construction of the table's
 26 | variable names (default ' >> ')}
 27 | 
 28 | \item{body}{a vector that specifies which table rows should be used as body information. A numeric
 29 | vector can be specified where each element corresponds to a table row. A character vector may be
 30 | specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use
 31 | semantic information from the HTML code}
 32 | 
 33 | \item{bodyFun}{a function that is executed over the body cell nodes}
 34 | 
 35 | \item{complementary}{logical, should htmltab ensure complementarity of header, inbody header and
 36 | body elements (default TRUE)?}
 37 | 
 38 | \item{fillNA}{character vector of symbols that are replaced by NA (default c(''))}
 39 | 
 40 | \item{rm_superscript}{logical, should superscript information be removed from header and body cells
 41 | (default TRUE)?}
 42 | 
 43 | \item{rm_escape}{a character vector that, if specified, is used to replace escape sequences in header
 44 | and body cells (default ' ')}
 45 | 
 46 | \item{rm_footnotes}{logical, should semantic footer information be removed (default TRUE)?}
 47 | 
 48 | \item{rm_nodata_cols}{logical, should columns that have no alphanumeric data be removed (default TRUE)?}
 49 | 
 50 | \item{rm_nodata_rows}{logical, should rows that have no alphanumeric data be removed (default TRUE)?}
 51 | 
 52 | \item{rm_invisible}{logical, should nodes that are not visible be removed (default TRUE)? This
 53 | includes elements with class 'sortkey' and 'display:none' style.}
 54 | 
 55 | \item{rm_whitespace}{logical, should leading/trailing whitespace be removed from cell values (default TRUE)?}
 56 | 
 57 | \item{colNames}{a character vector of column names, or a function that can be used to replace specific
 58 | column names (default NULL)}
 59 | 
 60 | \item{...}{additional arguments passed to HTML parsers}
 61 | }
 62 | \value{
 63 | An R data frame
 64 | }
 65 | \description{
 66 | Robust and flexible methods for extracting structured information out of HTML tables
 67 | }
 68 | \details{
 69 | The header formula has the following format: level1 + level2 + level3 + ... .
 70 | level1 specifies the main header dimension (column names). This information must
 71 | be for rows. level2 and deeper signify header dimensions that appear throughout the body.
 72 | Those information muste be for cell elements, not rows. Header information may be
 73 | one of the following types:
 74 | 
 75 | \itemize{
 76 | \item the NULL value (default). No information passed, htmltab will try to identify
 77 | header elements through heuristics (heuristics only work for the main header)
 78 | \item A numeric vector that retrieves rows in the respective position
 79 | \item A character string of an XPath expression
 80 | \item A function that when evaluated produces a numeric or character vector
 81 | \item 0, when the process of finding the main header should be skipped (only works
 82 | for main header)
 83 | }
 84 | }
 85 | \examples{
 86 | \dontrun{
 87 | # When no spans are present, htmltab produces output close to XML's readHTMLTable(),
 88 | but it removes many types of non-data elements (footnotes, non-visible HTML elements, etc)
 89 | 
 90 |  url <- "http://en.wikipedia.org/wiki/World_population"
 91 |  xp <- "//caption[starts-with(text(),'World historical')]/ancestor::table"
 92 |  htmltab(doc = url, which = xp)
 93 | 
 94 |  popFun <- function(node) {
 95 |    x <- XML::xmlValue(node)
 96 |    gsub(',', '', x)
 97 |  }
 98 | 
 99 |  htmltab(doc = url, which = xp, bodyFun = popFun)
100 | 
101 | #This table lacks header information. We provide them through colNames.
102 | #We also need to set header = 0 to indicate that no header is present.
103 | doc <- "http://en.wikipedia.org/wiki/FC_Bayern_Munich"
104 | xp2 <- "//td[text() = 'Head coach']/ancestor::table"
105 | htmltab(doc = doc, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role"))
106 | 
107 | #htmltab recognizes column spans and produces a one-dimension vector of variable information,
108 | #also removes automatically superscript information since these are usually not of use.
109 | 
110 |  doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
111 |  xp3 <-  "//table[7]"
112 |  bFun <- function(node) {
113 |    x <- XML::xmlValue(node)
114 |    gsub('\%$', '', x)
115 |  }
116 | 
117 |  htmltab(doc = doc, which = xp3, bodyFun = bFun)
118 | 
119 | 
120 | htmltab("https://en.wikipedia.org/wiki/Arjen_Robben", which = 3,
121 | header = 1:2)
122 | 
123 | 
124 | #When header information appear throughout the body, you can specify their
125 | #position in the header formula
126 | 
127 | htmltab(url, which = "//table[@id='team_gamelogs']", header = . + "//td[./strong]")
128 | }
129 | }
130 | \author{
131 | Christian Rubba <\url{http://www.christianrubba.com}>
132 | }
133 | \references{
134 | \url{https://github.com/crubba/htmltab}
135 | }
136 | 
137 | 


--------------------------------------------------------------------------------
/man/identify_elements.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/identify_rows.R
 3 | \name{identify_elements}
 4 | \alias{identify_elements}
 5 | \title{Assemble XPath expressions for header and body}
 6 | \usage{
 7 | identify_elements(table.Node, header, body, complementary = T)
 8 | }
 9 | \arguments{
10 | \item{table.Node}{the table node}
11 | 
12 | \item{header}{a vector that contains information for the identification of the header row(s). A numeric vector can be specified where each element corresponds to the table rows. A character vector may be specified that describes an XPath for the header rows. If left unspecified, htmltable tries to use semantic information from the HTML code}
13 | 
14 | \item{body}{a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltable tries to use semantic information from the HTML code}
15 | 
16 | \item{complementary}{logical, should htmltab ensure complementarity of header, inbody header and
17 | body elements (default TRUE)?}
18 | }
19 | \value{
20 | a character vector of XPath statements
21 | }
22 | \description{
23 | Assemble XPath expressions for header and body
24 | }
25 | 
26 | 


--------------------------------------------------------------------------------
/man/normalize_tr.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{normalize_tr}
 4 | \alias{normalize_tr}
 5 | \title{Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them}
 6 | \usage{
 7 | normalize_tr(table.Node)
 8 | }
 9 | \arguments{
10 | \item{table.Node}{the table node}
11 | }
12 | \value{
13 | the revised table node
14 | }
15 | \description{
16 | Normalizes rows to be nested in tr tags, header in thead, body in tbody and numbers them
17 | }
18 | 
19 | 


--------------------------------------------------------------------------------
/man/num_xpath.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/identify_rows.R
 3 | \name{num_xpath}
 4 | \alias{num_xpath}
 5 | \title{num_xpath: Generate numeric XPath expression}
 6 | \usage{
 7 | num_xpath(data)
 8 | }
 9 | \arguments{
10 | \item{data}{the header XPath}
11 | }
12 | \description{
13 | Generate numeric XPath expression
14 | }
15 | 
16 | 


--------------------------------------------------------------------------------
/man/rm_empty_cols.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{rm_empty_cols}
 4 | \alias{rm_empty_cols}
 5 | \title{Remove columns which do not have data values}
 6 | \usage{
 7 | rm_empty_cols(df, header)
 8 | }
 9 | \arguments{
10 | \item{df}{a data frame}
11 | 
12 | \item{header}{the header vector}
13 | }
14 | \value{
15 | a data frame
16 | }
17 | \description{
18 | Remove columns which do not have data values
19 | }
20 | \seealso{
21 | \code{\link{rm_nuisance}, \link{rm_empty_rows}}
22 | }
23 | 
24 | 


--------------------------------------------------------------------------------
/man/rm_empty_rows.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{rm_empty_rows}
 4 | \alias{rm_empty_rows}
 5 | \title{Remove rows which do not have data values}
 6 | \usage{
 7 | rm_empty_rows(df)
 8 | }
 9 | \arguments{
10 | \item{df}{a data frame}
11 | }
12 | \value{
13 | a data frame
14 | }
15 | \description{
16 | Remove rows which do not have data values
17 | }
18 | \seealso{
19 | \code{\link{rm_nuisance}, \link{rm_empty_cols}}
20 | }
21 | 
22 | 


--------------------------------------------------------------------------------
/man/rm_nuisance.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{rm_nuisance}
 4 | \alias{rm_nuisance}
 5 | \title{Remove nuisance elements from the the table code}
 6 | \usage{
 7 | rm_nuisance(table.Node, rm_superscript, rm_footnotes, rm_invisible)
 8 | }
 9 | \arguments{
10 | \item{table.Node}{the table node}
11 | 
12 | \item{rm_superscript}{logical, denotes whether superscript information should be removed from header and body cells (default value TRUE)}
13 | 
14 | \item{rm_footnotes}{logical, denotes whether semantic footer information should be removed (default value TRUE)}
15 | 
16 | \item{rm_invisible}{logical, should nodes that are not visible (display:none attribute) be removed?}
17 | }
18 | \value{
19 | The revised table node
20 | }
21 | \description{
22 | Remove nuisance elements from the the table code
23 | }
24 | \seealso{
25 | \code{\link{rm_empty_cols}}
26 | }
27 | 
28 | 


--------------------------------------------------------------------------------
/man/select_tab.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/setup_and_checks.R
 3 | \name{select_tab}
 4 | \alias{select_tab}
 5 | \title{Selects the table from the HTML Code}
 6 | \usage{
 7 | select_tab(which, Node)
 8 | }
 9 | \arguments{
10 | \item{which}{a vector of length one for identification of the table in the document. Either
11 | a numeric vector for the tables' rank or a character vector that describes an XPath for the table}
12 | 
13 | \item{Node}{the table node}
14 | 
15 | \item{...}{additional arguments passed to htmlParse}
16 | }
17 | \value{
18 | a table node
19 | }
20 | \description{
21 | Selects the table from the HTML Code
22 | }
23 | 
24 | 


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | test_check("htmltab")
3 | 


--------------------------------------------------------------------------------
/tests/testthat/test_doc_examples.R:
--------------------------------------------------------------------------------
  1 | context("htmltab examples work correctly")
  2 | 
  3 | test_that("Example 1 works", {
  4 | 
  5 |   html_world <- '<table class="wikitable sortable" style="font-size:97%; text-align:right;">
  6 | <caption>World historical and predicted populations (in millions)<sup id="cite_ref-Vallin_117-0" class="reference"><a href="#cite_note-Vallin-117"><span>[</span>115<span>]</span></a></sup><sup id="cite_ref-un2004_118-0" class="reference"><a href="#cite_note-un2004-118"><span>[</span>116<span>]</span></a></sup></caption>
  7 |   <tr>
  8 |     <th>Region</th>
  9 |     <th>1500</th>
 10 |     <th>1600</th>
 11 |     <th>1700</th>
 12 |     <th>1750</th>
 13 |     <th>1800</th>
 14 |     <th>1850</th>
 15 |     <th>1900</th>
 16 |     <th>1950</th>
 17 |     <th>1999</th>
 18 |     <th>2008</th>
 19 |     <th>2010</th>
 20 |     <th>2012</th>
 21 |     <th>2050</th>
 22 |     <th>2150</th>
 23 |     </tr>
 24 |     <tr>
 25 |     <th>World</th>
 26 |     <td>458</td>
 27 |     <td>580</td>
 28 |     <td>682</td>
 29 |     <td>791</td>
 30 |     <td>978</td>
 31 |     <td>1,262</td>
 32 |     <td>1,650</td>
 33 |     <td>2,521</td>
 34 |     <td>5,978</td>
 35 |     <td>6,707</td>
 36 |     <td>6,896</td>
 37 |     <td>7,052</td>
 38 |     <td>8,909</td>
 39 |     <td>9,746</td>
 40 |     </tr>
 41 |     <tr>
 42 |     <th>Africa</th>
 43 |     <td>86</td>
 44 |     <td>114</td>
 45 |     <td>106</td>
 46 |     <td>106</td>
 47 |     <td>107</td>
 48 |     <td>111</td>
 49 |     <td>133</td>
 50 |     <td>221</td>
 51 |     <td>767</td>
 52 |     <td>973</td>
 53 |     <td>1,022</td>
 54 |     <td>1,052</td>
 55 |     <td>1,766</td>
 56 |     <td>2,308</td>
 57 |     </tr>
 58 |     <tr>
 59 |     <th>Asia</th>
 60 |     <td>243</td>
 61 |     <td>339</td>
 62 |     <td>436</td>
 63 |     <td>502</td>
 64 |     <td>635</td>
 65 |     <td>809</td>
 66 |     <td>947</td>
 67 |     <td>1,402</td>
 68 |     <td>3,634</td>
 69 |     <td>4,054</td>
 70 |     <td>4,164</td>
 71 |     <td>4,250</td>
 72 |     <td>5,268</td>
 73 |     <td>5,561</td>
 74 |     </tr>
 75 |     <tr>
 76 |     <th>Europe</th>
 77 |     <td>84</td>
 78 |     <td>111</td>
 79 |     <td>125</td>
 80 |     <td>163</td>
 81 |     <td>203</td>
 82 |     <td>276</td>
 83 |     <td>408</td>
 84 |     <td>547</td>
 85 |     <td>729</td>
 86 |     <td>732</td>
 87 |     <td>738</td>
 88 |     <td>740</td>
 89 |     <td>628</td>
 90 |     <td>517</td>
 91 |     </tr>
 92 |     <tr>
 93 |     <th>Latin America<sup id="cite_ref-Americas_119-0" class="reference"><a href="#cite_note-Americas-119"><span>[</span>Note 1<span>]</span></a></sup></th>
 94 |     <td>39</td>
 95 |     <td>10</td>
 96 |     <td>10</td>
 97 |     <td>16</td>
 98 |     <td>24</td>
 99 |     <td>38</td>
100 |     <td>74</td>
101 |     <td>167</td>
102 |     <td>511</td>
103 |     <td>577</td>
104 |     <td>590</td>
105 |     <td>603</td>
106 |     <td>809</td>
107 |     <td>912</td>
108 |     </tr>
109 |     <tr>
110 |     <th>Northern America<sup id="cite_ref-Americas_119-1" class="reference"><a href="#cite_note-Americas-119"><span>[</span>Note 1<span>]</span></a></sup></th>
111 |     <td>3</td>
112 |     <td>3</td>
113 |     <td>2</td>
114 |     <td>2</td>
115 |     <td>7</td>
116 |     <td>26</td>
117 |     <td>82</td>
118 |     <td>172</td>
119 |     <td>307</td>
120 |     <td>337</td>
121 |     <td>345</td>
122 |     <td>351</td>
123 |     <td>392</td>
124 |     <td>398</td>
125 |     </tr>
126 |     <tr>
127 |     <th>Oceania</th>
128 |     <td>3</td>
129 |     <td>3</td>
130 |     <td>3</td>
131 |     <td>2</td>
132 |     <td>2</td>
133 |     <td>2</td>
134 |     <td>6</td>
135 |     <td>13</td>
136 |     <td>30</td>
137 |     <td>34</td>
138 |     <td>37</td>
139 |     <td>38</td>
140 |     <td>46</td>
141 |     <td>51</td>
142 |     </tr>
143 |     </table>
144 |     '
145 | 
146 |   ex1 <- htmltab(doc = html_world, which = 1)
147 | 
148 |   expect_that(ex1[1,1], equals("World"))
149 |   expect_that(colnames(ex1)[1], equals("Region"))
150 | })
151 | 
152 | 
153 | test_that("Example 2 works", {
154 | 
155 |   html_world <- '<table class="wikitable sortable" style="font-size:97%; text-align:right;">
156 |   <caption>World historical and predicted populations (in millions)<sup id="cite_ref-Vallin_117-0" class="reference"><a href="#cite_note-Vallin-117"><span>[</span>115<span>]</span></a></sup><sup id="cite_ref-un2004_118-0" class="reference"><a href="#cite_note-un2004-118"><span>[</span>116<span>]</span></a></sup></caption>
157 |   <tr>
158 |   <th>Region</th>
159 |   <th>1500</th>
160 |   <th>1600</th>
161 |   <th>1700</th>
162 |   <th>1750</th>
163 |   <th>1800</th>
164 |   <th>1850</th>
165 |   <th>1900</th>
166 |   <th>1950</th>
167 |   <th>1999</th>
168 |   <th>2008</th>
169 |   <th>2010</th>
170 |   <th>2012</th>
171 |   <th>2050</th>
172 |   <th>2150</th>
173 |   </tr>
174 |   <tr>
175 |   <th>World</th>
176 |   <td>458</td>
177 |   <td>580</td>
178 |   <td>682</td>
179 |   <td>791</td>
180 |   <td>978</td>
181 |   <td>1,262</td>
182 |   <td>1,650</td>
183 |   <td>2,521</td>
184 |   <td>5,978</td>
185 |   <td>6,707</td>
186 |   <td>6,896</td>
187 |   <td>7,052</td>
188 |   <td>8,909</td>
189 |   <td>9,746</td>
190 |   </tr>
191 |   <tr>
192 |   <th>Africa</th>
193 |   <td>86</td>
194 |   <td>114</td>
195 |   <td>106</td>
196 |   <td>106</td>
197 |   <td>107</td>
198 |   <td>111</td>
199 |   <td>133</td>
200 |   <td>221</td>
201 |   <td>767</td>
202 |   <td>973</td>
203 |   <td>1,022</td>
204 |   <td>1,052</td>
205 |   <td>1,766</td>
206 |   <td>2,308</td>
207 |   </tr>
208 |   <tr>
209 |   <th>Asia</th>
210 |   <td>243</td>
211 |   <td>339</td>
212 |   <td>436</td>
213 |   <td>502</td>
214 |   <td>635</td>
215 |   <td>809</td>
216 |   <td>947</td>
217 |   <td>1,402</td>
218 |   <td>3,634</td>
219 |   <td>4,054</td>
220 |   <td>4,164</td>
221 |   <td>4,250</td>
222 |   <td>5,268</td>
223 |   <td>5,561</td>
224 |   </tr>
225 |   <tr>
226 |   <th>Europe</th>
227 |   <td>84</td>
228 |   <td>111</td>
229 |   <td>125</td>
230 |   <td>163</td>
231 |   <td>203</td>
232 |   <td>276</td>
233 |   <td>408</td>
234 |   <td>547</td>
235 |   <td>729</td>
236 |   <td>732</td>
237 |   <td>738</td>
238 |   <td>740</td>
239 |   <td>628</td>
240 |   <td>517</td>
241 |   </tr>
242 |   <tr>
243 |   <th>Latin America<sup id="cite_ref-Americas_119-0" class="reference"><a href="#cite_note-Americas-119"><span>[</span>Note 1<span>]</span></a></sup></th>
244 |   <td>39</td>
245 |   <td>10</td>
246 |   <td>10</td>
247 |   <td>16</td>
248 |   <td>24</td>
249 |   <td>38</td>
250 |   <td>74</td>
251 |   <td>167</td>
252 |   <td>511</td>
253 |   <td>577</td>
254 |   <td>590</td>
255 |   <td>603</td>
256 |   <td>809</td>
257 |   <td>912</td>
258 |   </tr>
259 |   <tr>
260 |   <th>Northern America<sup id="cite_ref-Americas_119-1" class="reference"><a href="#cite_note-Americas-119"><span>[</span>Note 1<span>]</span></a></sup></th>
261 |   <td>3</td>
262 |   <td>3</td>
263 |   <td>2</td>
264 |   <td>2</td>
265 |   <td>7</td>
266 |   <td>26</td>
267 |   <td>82</td>
268 |   <td>172</td>
269 |   <td>307</td>
270 |   <td>337</td>
271 |   <td>345</td>
272 |   <td>351</td>
273 |   <td>392</td>
274 |   <td>398</td>
275 |   </tr>
276 |   <tr>
277 |   <th>Oceania</th>
278 |   <td>3</td>
279 |   <td>3</td>
280 |   <td>3</td>
281 |   <td>2</td>
282 |   <td>2</td>
283 |   <td>2</td>
284 |   <td>6</td>
285 |   <td>13</td>
286 |   <td>30</td>
287 |   <td>34</td>
288 |   <td>37</td>
289 |   <td>38</td>
290 |   <td>46</td>
291 |   <td>51</td>
292 |   </tr>
293 |   </table>
294 |   '
295 | 
296 |   library(XML)
297 | 
298 |   xp <- "//*[text() = 'World historical and predicted populations (in millions)']/ancestor::table"
299 |   popFun <- function(node) {
300 |     x <- xmlValue(node)
301 |     gsub(',', '', x)
302 |   }
303 |   ex2 <- htmltab(doc = html_world, which = xp, bodyFun = popFun)
304 | 
305 |   expect_that(ex2[1,7], equals("1262"))
306 | })
307 | 
308 | 
309 | test_that("Example 3 works", {
310 | 
311 |   html_bayern <- '<table class="wikitable">
312 | <tr>
313 | <td><span class="flagicon"><a href="/wiki/Spain" title="Spain"><img alt="Spain" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" data-file-width="750" data-file-height="500" /></a></span> <a href="/wiki/Pep_Guardiola" title="Pep Guardiola">Pep Guardiola</a></td>
314 |   <td>Head coach</td>
315 |   </tr>
316 |   <tr>
317 |   <td><span class="flagicon"><a href="/wiki/Spain" title="Spain"><img alt="Spain" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" data-file-width="750" data-file-height="500" /></a></span> <a href="/wiki/Manuel_Estiarte" title="Manuel Estiarte">Manel Estiarte</a></td>
318 |   <td>Personal assistant</td>
319 |   </tr>
320 |   <tr>
321 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Hermann_Gerland" title="Hermann Gerland">Hermann Gerland</a></td>
322 |   <td>Assistant coach</td>
323 |   </tr>
324 |   <tr>
325 |   <td><span class="flagicon"><a href="/wiki/Spain" title="Spain"><img alt="Spain" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" data-file-width="750" data-file-height="500" /></a></span> Domènec Torrent</td>
326 |   <td>Assistant coach</td>
327 |   </tr>
328 |   <tr>
329 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Toni_Tapalovi%C4%87" title="Toni Tapalović">Toni Tapalović</a></td>
330 |   <td>Goalkeeping coach</td>
331 |   </tr>
332 |   <tr>
333 |   <td><span class="flagicon"><a href="/wiki/Spain" title="Spain"><img alt="Spain" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" data-file-width="750" data-file-height="500" /></a></span> Lorenzo Buenaventura</td>
334 |   <td>Fitness coach</td>
335 |   </tr>
336 |   <tr>
337 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Andreas Kornmayer</td>
338 |   <td>Fitness coach</td>
339 |   </tr>
340 |   <tr>
341 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Thomas Wilhelmi</td>
342 |   <td>Fitness coach</td>
343 |   </tr>
344 |   <tr>
345 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Matthias_Sammer" title="Matthias Sammer">Matthias Sammer</a></td>
346 |   <td>Sport director</td>
347 |   </tr>
348 |   <tr>
349 |   <td><span class="flagicon"><a href="/wiki/Spain" title="Spain"><img alt="Spain" src="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/23px-Flag_of_Spain.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/35px-Flag_of_Spain.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/9/9a/Flag_of_Spain.svg/45px-Flag_of_Spain.svg.png 2x" data-file-width="750" data-file-height="500" /></a></span> Carles Planchart</td>
350 |   <td>Match analyst</td>
351 |   </tr>
352 |   <tr>
353 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Lars Kornetka</td>
354 |   <td>Video analyst</td>
355 |   </tr>
356 |   <tr>
357 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Michael Niemeyer</td>
358 |   <td>Video analyst</td>
359 |   </tr>
360 |   <tr>
361 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Paul_Breitner" title="Paul Breitner">Paul Breitner</a></td>
362 |   <td>Chief scout</td>
363 |   </tr>
364 |   <tr>
365 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Egon_Coordes" title="Egon Coordes">Egon Coordes</a></td>
366 |   <td>Scout</td>
367 |   </tr>
368 |   <tr>
369 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Wolfgang_Grobe" title="Wolfgang Grobe">Wolfgang Grobe</a></td>
370 |   <td>Scout</td>
371 |   </tr>
372 |   <tr>
373 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> <a href="/wiki/Hans-Wilhelm_M%C3%BCller-Wohlfahrt" title="Hans-Wilhelm Müller-Wohlfahrt">Hans-Wilhelm Müller-Wohlfahrt</a></td>
374 |   <td>Chief medic</td>
375 |   </tr>
376 |   <tr>
377 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Lutz Hänsel</td>
378 |   <td>Team doctor</td>
379 |   </tr>
380 |   <tr>
381 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Peter Ueblacker</td>
382 |   <td>Team doctor</td>
383 |   </tr>
384 |   <tr>
385 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Roland Schmidt</td>
386 |   <td>Cardiologist</td>
387 |   </tr>
388 |   <tr>
389 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Fredi Binder</td>
390 |   <td>Physiotherapist</td>
391 |   </tr>
392 |   <tr>
393 |   <td><span class="flagicon"><a href="/wiki/Italy" title="Italy"><img alt="Italy" src="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/23px-Flag_of_Italy.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/35px-Flag_of_Italy.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/0/03/Flag_of_Italy.svg/45px-Flag_of_Italy.svg.png 2x" data-file-width="1500" data-file-height="1000" /></a></span> Gianni Bianchi</td>
394 |   <td>Physiotherapist</td>
395 |   </tr>
396 |   <tr>
397 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Gerry Hoffmann</td>
398 |   <td>Physiotherapist</td>
399 |   </tr>
400 |   <tr>
401 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Stephan Weickert</td>
402 |   <td>Physiotherapist</td>
403 |   </tr>
404 |   <tr>
405 |   <td><span class="flagicon"><a href="/wiki/Germany" title="Germany"><img alt="Germany" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23px-Flag_of_Germany.svg.png" width="23" height="14" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/35px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/46px-Flag_of_Germany.svg.png 2x" data-file-width="1000" data-file-height="600" /></a></span> Helmut Erhard</td>
406 |   <td>Physiotherapist</td>
407 |   </tr>
408 |   </table>'
409 | 
410 |   xp2 <- "//td[text() = 'Head coach']/ancestor::table"
411 |   ex3 <- htmltab(doc = html_bayern, which = xp2, header = 0, encoding = "UTF-8", colNames = c("name", "role"))
412 | 
413 |   expect_that(ex3[1,1], equals("Pep Guardiola"))
414 |   expect_that(colnames(ex3)[1], equals("name"))
415 | })
416 | 
417 | 
418 | test_that("Example 4 works", {
419 | 
420 |   html_web <- '<table class="wikitable" style="text-align:right;">
421 | <caption><b>Global desktop and mobile stats combined from <a rel="nofollow" class="external text" href="http://gs.statcounter.com/#all-browser-ww-monthly-200812-201502">StatCounter</a> (Top 5 browsers)</b></caption>
422 | <tr>
423 |   <th scope="col" style="width:8em" rowspan="2">Date<br /></th>
424 |   <th scope="col" style="width:4em" rowspan="2"><a href="/wiki/Internet_Explorer" title="Internet Explorer">Internet<br />
425 |   Explorer</a> <sup id="cite_ref-DesktopMobile_17-0" class="reference"><a href="#cite_note-DesktopMobile-17"><span>[</span>Note 1<span>]</span></a></sup></th>
426 |   <th scope="col" style="width:4em" rowspan="2"><a href="/wiki/Google_Chrome" title="Google Chrome">Chrome</a></th>
427 |   <th scope="col" style="width:4em" rowspan="2"><a href="/wiki/Firefox" title="Firefox">Firefox</a></th>
428 |   <th scope="col" colspan="3"><a href="/wiki/Safari_(web_browser)" title="Safari (web browser)">Safari</a></th>
429 |   <th scope="col" colspan="3"><a href="/wiki/Opera_(web_browser)" title="Opera (web browser)">Opera</a></th>
430 |   <th scope="col" style="width:4em" rowspan="2"><a href="/wiki/Android_(operating_system)" title="Android (operating system)">Android</a></th>
431 |   <th scope="col" style="width:4em" rowspan="2"><a href="/wiki/Mobile_browser" title="Mobile browser">Mobile<br />
432 |   Total</a></th>
433 |   </tr>
434 |   <tr>
435 |   <th scope="col" style="width:3em">Desktop+Laptop</th>
436 |   <th scope="col" style="width:3em">Mobile</th>
437 |   <th scope="col" style="width:3em">Total</th>
438 |   <th scope="col" style="width:3em">Desktop+Laptop</th>
439 |   <th scope="col" style="width:3em">Mobile</th>
440 |   <th scope="col" style="width:3em">Total</th>
441 |   </tr>
442 |   <tr>
443 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201307-201307-bar">July 2013</a></th>
444 |   <td>20.27%</td>
445 |   <td>36.29%</td>
446 |   <td>16.60%</td>
447 |   <td>7.10%</td>
448 |   <td>4.27%</td>
449 |   <td>11.37%</td>
450 |   <td>1.01%</td>
451 |   <td>2.73%</td>
452 |   <td>3.74%</td>
453 |   <td>4.97%</td>
454 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201307-201307-bar">17.35%</a></td>
455 |   </tr>
456 |   <tr>
457 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201306-201306-bar">June 2013</a></th>
458 |   <td>21.35%</td>
459 |   <td>35.82%</td>
460 |   <td>16.79%</td>
461 |   <td>7.04%</td>
462 |   <td>3.79%</td>
463 |   <td>10.83%</td>
464 |   <td>1.03%</td>
465 |   <td>2.28%</td>
466 |   <td>3.31%</td>
467 |   <td>4.66%</td>
468 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201306-201306-bar">16.08%</a></td>
469 |   </tr>
470 |   <tr>
471 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201305-201305-bar">May 2013</a></th>
472 |   <td>23.67%</td>
473 |   <td>35.66%</td>
474 |   <td>16.87%</td>
475 |   <td>6.80%</td>
476 |   <td>3.67%</td>
477 |   <td>10.55%</td>
478 |   <td>0.87%</td>
479 |   <td>2.13%</td>
480 |   <td>3.00%</td>
481 |   <td>4.45%</td>
482 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201305-201305-bar">14.62%</a></td>
483 |   </tr>
484 |   <tr>
485 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201304-201304-bar">April 2013</a></th>
486 |   <td>25.58%</td>
487 |   <td>33.71%</td>
488 |   <td>17.27%</td>
489 |   <td>6.89%</td>
490 |   <td>3.76%</td>
491 |   <td>10.65%</td>
492 |   <td>0.85%</td>
493 |   <td>2.26%</td>
494 |   <td>3.11%</td>
495 |   <td>4.30%</td>
496 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201304-201304-bar">13.90%</a></td>
497 |   </tr>
498 |   <tr>
499 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201303-201303-bar">March 2013</a></th>
500 |   <td>25.08%</td>
501 |   <td>32.88%</td>
502 |   <td>17.86%</td>
503 |   <td>7.28%</td>
504 |   <td>3.90%</td>
505 |   <td>11.18%</td>
506 |   <td>1.00%</td>
507 |   <td>2.24%</td>
508 |   <td>3.24%</td>
509 |   <td>4.43%</td>
510 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201303-201303-bar">14.44%</a></td>
511 |   </tr>
512 |   <tr>
513 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201302-201302-bar">February 2013</a></th>
514 |   <td>25.54%</td>
515 |   <td>31.96%</td>
516 |   <td>18.28%</td>
517 |   <td>7.37%</td>
518 |   <td>3.89%</td>
519 |   <td>11.26%</td>
520 |   <td>1.04%</td>
521 |   <td>2.21%</td>
522 |   <td>3.25%</td>
523 |   <td>4.53%</td>
524 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201302-201302-bar">14.35%</a></td>
525 |   </tr>
526 |   <tr>
527 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201301-201301-bar">January 2013</a></th>
528 |   <td>26.37%</td>
529 |   <td>31.51%</td>
530 |   <td>18.39%</td>
531 |   <td>7.12%</td>
532 |   <td>3.64%</td>
533 |   <td>10.76%</td>
534 |   <td>1.02%</td>
535 |   <td>2.17%</td>
536 |   <td>3.19%</td>
537 |   <td>4.36%</td>
538 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201301-201301-bar">14.13%</a></td>
539 |   </tr>
540 |   <tr>
541 |   <th scope="row" style="text-align:right; border-top: 2px solid black;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201207-201207-bar">July 2012</a></th>
542 |   <td style="border-top: 2px solid black;">28.49%</td>
543 |   <td style="border-top: 2px solid black;">30.06%</td>
544 |   <td style="border-top: 2px solid black;">21.01%</td>
545 |   <td style="border-top: 2px solid black;">6.33%</td>
546 |   <td style="border-top: 2px solid black;">2.77%</td>
547 |   <td style="border-top: 2px solid black;">9.10%</td>
548 |   <td style="border-top: 2px solid black;">1.53%</td>
549 |   <td style="border-top: 2px solid black;">2.15%</td>
550 |   <td style="border-top: 2px solid black;">3.68%</td>
551 |   <td style="border-top: 2px solid black;">2.57%</td>
552 |   <td style="border-top: 2px solid black;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201207-201207-bar">11.09%</a></td>
553 |   </tr>
554 |   <tr>
555 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201201-201201-bar">January 2012</a></th>
556 |   <td>34.27%</td>
557 |   <td>25.99%</td>
558 |   <td>22.68%</td>
559 |   <td>6.06%</td>
560 |   <td>1.92%</td>
561 |   <td>7.98%</td>
562 |   <td>1.78%</td>
563 |   <td>2.06%</td>
564 |   <td>3.84%</td>
565 |   <td>1.71%</td>
566 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201201-201201-bar">8.49%</a></td>
567 |   </tr>
568 |   <tr>
569 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201107-201107-bar">July 2011</a></th>
570 |   <td>39.47%</td>
571 |   <td>20.59%</td>
572 |   <td>25.99%</td>
573 |   <td>4.81%</td>
574 |   <td>1.40%</td>
575 |   <td>6.21%</td>
576 |   <td>1.54%</td>
577 |   <td>1.54%</td>
578 |   <td>3.08%</td>
579 |   <td>1.28%</td>
580 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201107-201107-bar">7.02%</a></td>
581 |   </tr>
582 |   <tr>
583 |   <th scope="row" style="text-align:right;"><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#browser-ww-monthly-201106-201106-bar">June 2011</a></th>
584 |   <td>40.73%</td>
585 |   <td>19.30%</td>
586 |   <td>26.49%</td>
587 |   <td></td>
588 |   <td>4.74%</td>
589 |   <td></td>
590 |   <td></td>
591 |   <td>1.63%</td>
592 |   <td></td>
593 |   <td></td>
594 |   <td><a rel="nofollow" class="external text" href="http://gs.statcounter.com/#mobile_vs_desktop-ww-monthly-201106-201106-bar">6.53%</a></td>
595 |   </tr>
596 |   </table>'
597 | 
598 |   bFun <- function(node) {
599 |     x <- XML::xmlValue(node)
600 |     x <- gsub('%$', '', x)
601 |     ifelse(x == '', NA, x)
602 |   }
603 |   ex4 <- htmltab(doc = html_web, which = 1, bodyFun = bFun)
604 | 
605 |   expect_that(ex4[1,1], equals("July 2013"))
606 |   expect_that(ex4[1,2], equals("20.27"))
607 |   expect_that(is.na(ex4[11,5]), is_true())
608 | })
609 | 


--------------------------------------------------------------------------------
/tests/testthat/test_expand_spans.R:
--------------------------------------------------------------------------------
  1 | context("Correct expansion of header elements")
  2 | 
  3 | tab1_code <- '
  4 | <table>
  5 |   <tr>
  6 |     <th rowspan="2">a</th>
  7 |     <th>b</th>
  8 |     <th>c</th>
  9 |   </tr>
 10 |   <tr>
 11 |     <td></td>
 12 |     <td></td>
 13 |   </tr>
 14 |   <tr>
 15 |     <td>1</td>
 16 |     <td>2</td>
 17 |     <td>3</td>
 18 |   </tr>
 19 | </table>'
 20 | 
 21 | test_that("Correctly expanded", {
 22 | 
 23 |   tab1 <- XML::htmlParse(tab1_code)
 24 |   suppressMessages(tab1 <- htmltab(tab1, header = 1:2, body = 3, which = NULL))
 25 | 
 26 |   expect_that(tab1[,1], equals("1"))
 27 |   expect_that(tab1[,2], equals("2"))
 28 |   expect_that(tab1[,3], equals("3"))
 29 | 
 30 |   expect_that(colnames(tab1)[1], equals("a"))
 31 |   expect_that(colnames(tab1)[2], equals("b"))
 32 |   expect_that(colnames(tab1)[3], equals("c"))
 33 | })
 34 | 
 35 | tab2_code <- '<table>
 36 |   <tr>
 37 |     <th rowspan="2">a</th>
 38 | <th>b</th>
 39 | <th colspan="2" rowspan="2">c</th>
 40 | </tr>
 41 | <tr>
 42 | <td></td>
 43 | </tr>
 44 | <tr>
 45 | <td>1</td>
 46 | <td>2</td>
 47 | <td>3</td>
 48 | <td></td>
 49 | </tr>
 50 | </table>'
 51 | 
 52 | test_that("Correctly expanded", {
 53 | 
 54 |   tab2 <- XML::htmlParse(tab2_code)
 55 |   suppressMessages(tab2 <- htmltab(tab2, header = 1:2, body = 3))
 56 | 
 57 |   expect_that(tab2[,1], equals("1"))
 58 |   expect_that(tab2[,2], equals("2"))
 59 |   expect_that(tab2[,3], equals("3"))
 60 |   #expect_that(tab2[,4], equals(""))
 61 | 
 62 |   expect_that(colnames(tab2)[1], equals("a"))
 63 |   expect_that(colnames(tab2)[2], equals("b"))
 64 |   expect_that(colnames(tab2)[3], equals("c"))
 65 |   #expect_that(colnames(tab2)[4], equals("c"))
 66 | })
 67 | 
 68 | 
 69 | tab3_code <- '<table>
 70 |   <tr>
 71 |     <th>a</th>
 72 | <th>b</th>
 73 | <th>c</th>
 74 | <th colspan="3">d</th>
 75 | </tr>
 76 | <tr>
 77 | <td colspan="4">e</td>
 78 | </tr>
 79 | <tr>
 80 | <td>1</td>
 81 | <td>2</td>
 82 | <td>3</td>
 83 | <td></td>
 84 | </tr>
 85 | </table>'
 86 | 
 87 | test_that("Correctly expanded", {
 88 | 
 89 |   tab3 <- XML::htmlParse(tab3_code)
 90 |   suppressMessages(tab3 <- htmltab(tab3, header = 1:2, body = 3))
 91 | 
 92 |   expect_that(tab3[,1], equals("1"))
 93 |   expect_that(tab3[,2], equals("2"))
 94 |   expect_that(tab3[,3], equals("3"))
 95 |   expect_that(is.na(tab3[,4]), is_true())
 96 | 
 97 |   expect_that(colnames(tab3)[1], equals("a >> e"))
 98 |   expect_that(colnames(tab3)[2], equals("b >> e"))
 99 |   expect_that(colnames(tab3)[3], equals("c >> e"))
100 |   expect_that(colnames(tab3)[4], equals("d >> e"))
101 | })
102 | 
103 | tab4_code <- '<table>
104 |   <tr>
105 |     <th rowspan="2">a</th>
106 | <th colspan="2">b</th>
107 | <th rowspan="2">c</th>
108 | </tr>
109 | <tr>
110 | <td>b1</td>
111 | <td>b2</td>
112 | </tr>
113 | <tr>
114 | <td>1</td>
115 | <td>2</td>
116 | <td>3</td>
117 | <td>4</td>
118 | </tr>
119 | </table>'
120 | 
121 | test_that("Correctly expanded", {
122 | 
123 |   tab4 <- XML::htmlParse(tab4_code)
124 |   suppressMessages(tab4 <- htmltab(tab4, header = 1:2, body = 3))
125 | 
126 |   expect_that(tab4[,1], equals("1"))
127 |   expect_that(tab4[,2], equals("2"))
128 |   expect_that(tab4[,3], equals("3"))
129 |   expect_that(tab4[,4], equals("4"))
130 | 
131 |   expect_that(colnames(tab4)[1], equals("a"))
132 |   expect_that(colnames(tab4)[2], equals("b >> b1"))
133 |   expect_that(colnames(tab4)[3], equals("b >> b2"))
134 |   expect_that(colnames(tab4)[4], equals("c"))
135 | })
136 | 
137 | tab5_code <- '<table>
138 |   <tr>
139 |     <th rowspan="2">a</th>
140 | <th colspan="2">b</th>
141 | <th rowspan="40">c</th>
142 | </tr>
143 | <tr>
144 | <td>b1</td>
145 | <td>b2</td>
146 | </tr>
147 | <tr>
148 | <td>1</td>
149 | <td>2</td>
150 | <td>3</td>
151 | <td>4</td>
152 | </tr>
153 | </table>'
154 | 
155 | test_that("Correctly expanded when misspecified header", {
156 | 
157 |   tab5 <- XML::htmlParse(tab5_code)
158 |   suppressMessages(tab5 <- htmltab(tab5, header = 1:2, body = 3))
159 | 
160 |   expect_that(tab5[,1], equals("1"))
161 |   expect_that(tab5[,2], equals("2"))
162 |   expect_that(tab5[,3], equals("3"))
163 |   expect_that(tab5[,4], equals("c")) # should be 4
164 | 
165 |   expect_that(colnames(tab5)[1], equals("a"))
166 |   expect_that(colnames(tab5)[2], equals("b >> b1"))
167 |   expect_that(colnames(tab5)[3], equals("b >> b2"))
168 |   expect_that(colnames(tab5)[4], equals("c"))
169 | })
170 | 
171 | 
172 | tab6_code <- '<table>
173 | <tr>
174 | <th rowspan="2">a</th>
175 | <th colspan="2">b</th>
176 | <th rowspan="40">c</th>
177 | </tr>
178 | <tr>
179 | <td>b1</td>
180 | <td>b2</td>
181 | </tr>
182 | 
183 | <tbody>
184 | <tr>
185 | <td>1</td>
186 | <td>2</td>
187 | <td>3</td>
188 | <td>4</td>
189 | </tr>
190 | </tbody>
191 | 
192 | </table>'
193 | 
194 | test_that("H: tr/th.td, B: tbody/tr; misspecified rowspan in H", {
195 | 
196 |   tab6 <- XML::htmlParse(tab6_code)
197 |   suppressMessages(tab6 <- htmltab(tab6, header = 1:2))
198 | 
199 |   expect_that(tab6[,1], equals("1"))
200 |   expect_that(tab6[,2], equals("2"))
201 |   expect_that(tab6[,3], equals("3"))
202 |   expect_that(tab6[,4], equals("c")) #should be 4
203 | 
204 |   expect_that(colnames(tab6)[1], equals("a"))
205 |   expect_that(colnames(tab6)[2], equals("b >> b1"))
206 |   expect_that(colnames(tab6)[3], equals("b >> b2"))
207 |   expect_that(colnames(tab6)[4], equals("c"))
208 | })
209 | 
210 | 
211 | stack <- '<html>
212 |   <head>
213 |   <title>My First Webpage</title>
214 |   </head>
215 | 
216 |   <body >
217 | 
218 |   <table border="1px" width="80%" cellspacing="0" cellpading="0" >
219 | 
220 |   <tr>
221 |   <td ></td>        <! -- I NEED TO REMOVE THIS PART FROM TABLE AND MAKE A **FREE SPACE** HEARE -->
222 | 
223 |   <td >9-11</td>
224 |   <td >11-13</td>
225 |   <td >13-15</td>
226 |   <td >15-17</td>
227 |   </tr>
228 | 
229 |   <tr>
230 |   <td >Monday</td>
231 |   <td>6</td>
232 |   <td colspan="0">7</td>
233 |   <td rowspan ="3">Lunch</td>
234 |   <td>a</td>
235 |   </tr>
236 | 
237 |   <tr>
238 |   <td>Tuesday</td>
239 |   <td colspan="2">&#60; free</td>
240 |   <td>s</td>
241 |   </tr>
242 | 
243 |   <tr>
244 |   <td >Wedensday</td>
245 |   <td>a</td>
246 |   <td>s</td>
247 |   <td>5</td>
248 |   </tr>
249 | 
250 |   </table>
251 |   </body>
252 |   </html>
253 | '
254 | 
255 | test_that("http://stackoverflow.com/questions/24215584/html-complex-tables", {
256 | 
257 |   parsed_stack <- XML::htmlParse(stack)
258 |   suppressMessages(stack2 <- htmltab(parsed_stack))
259 | 
260 |   expect_that(stack2[,1], equals(c("Monday", "Tuesday", "Wedensday")))
261 |   expect_that(stack2[,2], equals(c("6", "< free", "a")))
262 |   expect_that(stack2[,3], equals(c("7", "< free", "s")))
263 |   expect_that(stack2[,4], equals(c("Lunch", "Lunch", "Lunch")))
264 |   expect_that(stack2[,5], equals(c("a", "s", "5")))
265 | 
266 |   expect_that(colnames(stack2), equals(c("V1", "9-11", "11-13", "13-15", "15-17")))
267 | })
268 | 


--------------------------------------------------------------------------------
/tests/testthat/test_find_header.R:
--------------------------------------------------------------------------------
  1 | context("Correctly identified header elements")
  2 | 
  3 | tab1_code <- '
  4 | <table>
  5 | <tr>
  6 | <th rowspan="2">a</th>
  7 | <th>b</th>
  8 | <th>c</th>
  9 | </tr>
 10 | <tr>
 11 | <th></th>
 12 | <th></th>
 13 | </tr>
 14 | <tr>
 15 | <td>1</td>
 16 | <td>2</td>
 17 | <td>3</td>
 18 | </tr>
 19 | </table>'
 20 | 
 21 | test_that("Correctly identified header 1", {
 22 | 
 23 |   tab1 <- XML::htmlParse(tab1_code)
 24 |   suppressMessages(tab1 <- htmltab(tab1))
 25 | 
 26 |   expect_that(tab1[,1], equals("1"))
 27 |   expect_that(tab1[,2], equals("2"))
 28 |   expect_that(tab1[,3], equals("3"))
 29 | 
 30 |   expect_that(colnames(tab1)[1], equals("a"))
 31 |   expect_that(colnames(tab1)[2], equals("b"))
 32 |   expect_that(colnames(tab1)[3], equals("c"))
 33 | })
 34 | 
 35 | tab2_code <- '
 36 | <table>
 37 | <thead>
 38 | <tr>
 39 | <th rowspan="2">a</th>
 40 | <th>b</th>
 41 | <th colspan="2" rowspan="2">c</th>
 42 | </tr>
 43 | <tr>
 44 | <td></td>
 45 | </tr>
 46 | </thead>
 47 | <tbody>
 48 | <tr>
 49 | <td>1</td>
 50 | <td>2</td>
 51 | <td>3</td>
 52 | <td></td>
 53 | </tr>
 54 | </tbody>
 55 | </table>'
 56 | 
 57 | test_that("Correctly identified header 2", {
 58 | 
 59 |   tab2 <- XML::htmlParse(tab2_code)
 60 |   suppressMessages(tab2 <- htmltab(tab2))
 61 | 
 62 |   expect_that(tab2[,1], equals("1"))
 63 |   expect_that(tab2[,2], equals("2"))
 64 |   expect_that(tab2[,3], equals("3"))
 65 |     expect_that(is.na(tab2[,4]), is_true())
 66 | 
 67 |   expect_that(colnames(tab2)[1], equals("a"))
 68 |   expect_that(colnames(tab2)[2], equals("b"))
 69 |   expect_that(colnames(tab2)[3], equals("c"))
 70 |   expect_that(colnames(tab2)[4], equals("c"))
 71 | })
 72 | 
 73 | 
 74 | tab3_code <- '<table>
 75 | <thead>
 76 | <tr>
 77 | <th>a</th>
 78 | <th>b</th>
 79 | <th>c</th>
 80 | <th colspan="3">d</th>
 81 | </tr>
 82 | <tr>
 83 | <td colspan="4">e</td>
 84 | </tr>
 85 | </thead>
 86 | 
 87 | <tr>
 88 | <td>1</td>
 89 | <td>2</td>
 90 | <td>3</td>
 91 | <td></td>
 92 | </tr>
 93 | </table>'
 94 | 
 95 | test_that("Correctly identified header 3", {
 96 | 
 97 |   tab3 <- XML::htmlParse(tab3_code)
 98 |   suppressMessages(tab3 <- htmltab(tab3, fillNA = ""))
 99 | 
100 |   expect_that(tab3[,1], equals("1"))
101 |   expect_that(tab3[,2], equals("2"))
102 |   expect_that(tab3[,3], equals("3"))
103 |   expect_that(tab3[,4], equals(""))
104 | 
105 |   expect_that(colnames(tab3)[1], equals("a >> e"))
106 |   expect_that(colnames(tab3)[2], equals("b >> e"))
107 |   expect_that(colnames(tab3)[3], equals("c >> e"))
108 |   expect_that(colnames(tab3)[4], equals("d >> e"))
109 | })
110 | 
111 | 
112 | tab4_code <- '
113 | <table>
114 | <tr>
115 | <th>a</th>
116 | <th>b</th>
117 | <th>c</th>
118 | </tr>
119 | <tr>
120 | <th>a2</th>
121 | <th>b2</th>
122 | <th>c2</th>
123 | </tr>
124 | <tr>
125 | <td>1</td>
126 | <td>2</td>
127 | <td>3</td>
128 | </tr>
129 | </table>'
130 | 
131 | test_that("Correctly identified header 4", {
132 | 
133 |   tab4 <- XML::htmlParse(tab4_code)
134 |   suppressMessages(tab4 <- htmltab(tab4))
135 | 
136 |   expect_that(tab4[,1], equals("1"))
137 |   expect_that(tab4[,2], equals("2"))
138 |   expect_that(tab4[,3], equals("3"))
139 | 
140 |   expect_that(colnames(tab4)[1], equals("a >> a2"))
141 |   expect_that(colnames(tab4)[2], equals("b >> b2"))
142 |   expect_that(colnames(tab4)[3], equals("c >> c2"))
143 | })
144 | 
145 | 
146 | test_that("Correctly identified header 5", {
147 | 
148 |   tab5 <- XML::htmlParse(tab4_code)
149 |   suppressMessages(tab5 <- htmltab(tab5, header = 1:2))
150 | 
151 |   expect_that(tab5[,1], equals("1"))
152 |   expect_that(tab5[,2], equals("2"))
153 |   expect_that(tab5[,3], equals("3"))
154 | 
155 |   expect_that(colnames(tab5)[1], equals("a >> a2"))
156 |   expect_that(colnames(tab5)[2], equals("b >> b2"))
157 |   expect_that(colnames(tab5)[3], equals("c >> c2"))
158 | })
159 | 
160 | test_that("Correctly identified header 6", {
161 | 
162 |   tab6 <- XML::htmlParse(tab4_code)
163 |   suppressMessages(tab6 <- htmltab(tab6, body = 3))
164 | 
165 |   expect_that(tab6[,1], equals("1"))
166 |   expect_that(tab6[,2], equals("2"))
167 |   expect_that(tab6[,3], equals("3"))
168 | 
169 |   expect_that(colnames(tab6)[1], equals("a >> a2"))
170 |   expect_that(colnames(tab6)[2], equals("b >> b2"))
171 |   expect_that(colnames(tab6)[3], equals("c >> c2"))
172 | })
173 | 
174 | 
175 | tab5_code <- '
176 | <table>
177 | <thead>
178 | <tr>
179 | <th>a</th>
180 | <th>b</th>
181 | <th>c</th>
182 | </tr>
183 | </thead>
184 | 
185 | <tbody>
186 | <tr>
187 | <th>a2</th>
188 | <th>b2</th>
189 | <th>c2</th>
190 | </tr>
191 | <tr>
192 | <td>1</td>
193 | <td>2</td>
194 | <td>3</td>
195 | </tr>
196 | </tbody>
197 | </table>'
198 | 
199 | test_that("Header info in thead and tbody", {
200 | 
201 |   tab7 <- XML::htmlParse(tab5_code)
202 |   suppressMessages(tab7 <- htmltab(tab7, header = 1:2))
203 | 
204 |   expect_that(tab7[,1], equals("1"))
205 |   expect_that(tab7[,2], equals("2"))
206 |   expect_that(tab7[,3], equals("3"))
207 | 
208 |   expect_that(colnames(tab7)[1], equals("a >> a2"))
209 |   expect_that(colnames(tab7)[2], equals("b >> b2"))
210 |   expect_that(colnames(tab7)[3], equals("c >> c2"))
211 | })
212 | 
213 | test_that("Correctly identified header 8", {
214 | 
215 |   tab8 <- XML::htmlParse(tab5_code)
216 |   suppressMessages(tab8 <- htmltab(tab8, header = 1:2, body = 3))
217 | 
218 |   expect_that(tab8[,1], equals("1"))
219 |   expect_that(tab8[,2], equals("2"))
220 |   expect_that(tab8[,3], equals("3"))
221 | 
222 |   expect_that(colnames(tab8)[1], equals("a >> a2"))
223 |   expect_that(colnames(tab8)[2], equals("b >> b2"))
224 |   expect_that(colnames(tab8)[3], equals("c >> c2"))
225 | })
226 | 


--------------------------------------------------------------------------------
/tests/testthat/test_inputs.R:
--------------------------------------------------------------------------------
 1 | context("Input checks")
 2 | 
 3 | test_that("Prompts errors correctly", {
 4 |   expect_error(check_type(doc = "http://http://cran.at.r-project.org/", which = factor(4)))
 5 | })
 6 | 
 7 | test_that("check_type produces class output", {
 8 | 
 9 |   x <- check_type(doc = "http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html", which = "//table[5]")
10 |   expect_that(x, is_a("XMLInternalDocument"))
11 | 
12 |   #expect_error(check_type(doc = "http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html", which = 1))
13 | 
14 |   parsed <- XML::htmlParse("http://christianrubba.com/htmltab/ex/wiki_indian_election2014.html")
15 |   z <- check_type(doc = parsed, which = 3)
16 |   expect_that(z, is_a("XMLInternalDocument"))
17 | })
18 | 


--------------------------------------------------------------------------------
/tests/testthat/test_multi-dim-header.R:
--------------------------------------------------------------------------------
  1 | context("formula interface for header works")
  2 | 
  3 | test_that("multi-dim 1", {
  4 | 
  5 | tab3_code <- '<table>
  6 | <thead>
  7 | <tr>
  8 | <th>a</th>
  9 | <th>b</th>
 10 | <th>c</th>
 11 | <th colspan="3">d</th>
 12 | </tr>
 13 | <tr>
 14 | <td colspan="4">e</td>
 15 | </tr>
 16 | </thead>
 17 | 
 18 | <tbody>
 19 | <tr>
 20 | <td>1</td>
 21 | <td>2</td>
 22 | <td>3</td>
 23 | <td></td>
 24 | </tr>
 25 | 
 26 | <tr>
 27 | <td class = "md" colspan = "4">Header1</td>
 28 | </tr>
 29 | 
 30 | <tr>
 31 | <td>1</td>
 32 | <td>2</td>
 33 | <td>3</td>
 34 | <td></td>
 35 | </tr>
 36 | 
 37 | <tr>
 38 | <td>11</td>
 39 | <td>22</td>
 40 | <td>33</td>
 41 | <td>44</td>
 42 | </tr>
 43 | 
 44 | <tr>
 45 | <td class = "md" colspan = "4">Header2</td>
 46 | </tr>
 47 | 
 48 | 
 49 | <tr>
 50 | <td>1</td>
 51 | <td>2</td>
 52 | <td>3</td>
 53 | <td></td>
 54 | </tr>
 55 | 
 56 | <tr>
 57 | <td>11</td>
 58 | <td>22</td>
 59 | <td>33</td>
 60 | <td>44</td>
 61 | </tr>
 62 | 
 63 | </tbody>
 64 | </table>'
 65 | 
 66 | parsedm3 <- doc <- XML::htmlParse(tab3_code)
 67 | 
 68 | suppressMessages(tab3m1 <- htmltab(doc = parsedm3, header = "//thead/tr" + "//td[@class = 'md']", body = "//tbody/tr[not(./td[@class = 'md'])]"))
 69 | suppressMessages(tab3m2 <- htmltab(doc = parsedm3, header = 1:2 + "//td[@class = 'md']", body = "//tbody/tr[not(./td[@class = 'md'])]"))
 70 | 
 71 |   expect_that(is.na(tab3m1[1,1]), is_true())
 72 |   expect_that(tab3m1[2,1], equals("Header1"))
 73 |   expect_that(tab3m1[3,1], equals("Header1"))
 74 |   expect_that(tab3m1[4,1], equals("Header2"))
 75 |   expect_that(tab3m1[5,1], equals("Header2"))
 76 | 
 77 |   expect_that(is.na(tab3m2[1,1]), is_true())
 78 |   expect_that(tab3m2[2,1], equals("Header1"))
 79 |   expect_that(tab3m2[3,1], equals("Header1"))
 80 |   expect_that(tab3m2[4,1], equals("Header2"))
 81 |   expect_that(tab3m2[5,1], equals("Header2"))
 82 | })
 83 | 
 84 | 
 85 | 
 86 | test_that("multi-dim 2", {
 87 | 
 88 | tab4_code <- '<table>
 89 | <thead>
 90 | <tr>
 91 | <th>a</th>
 92 | <th>b</th>
 93 | <th>c</th>
 94 | <th colspan="3">d</th>
 95 | </tr>
 96 | <tr>
 97 | <td colspan="4">e</td>
 98 | </tr>
 99 | </thead>
100 | 
101 | <tbody>
102 | <tr>
103 | <td>1</td>
104 | <td>2</td>
105 | <td>3</td>
106 | <td></td>
107 | </tr>
108 | 
109 | <tr>
110 | <td class = "md0" class="md2" colspan = "4">MAIN 1</td>
111 | </tr>
112 | 
113 | <tr>
114 | <td class = "md1" class="md2" colspan = "4">Header1</td>
115 | </tr>
116 | 
117 | <tr>
118 | <td>1</td>
119 | <td>2</td>
120 | <td>3</td>
121 | <td></td>
122 | </tr>
123 | 
124 | <tr>
125 | <td>11</td>
126 | <td>22</td>
127 | <td>33</td>
128 | <td>44</td>
129 | </tr>
130 | 
131 | <tr>
132 | <td class = "md1" colspan = "4">Header2</td>
133 | </tr>
134 | 
135 | 
136 | <tr>
137 | <td>1</td>
138 | <td>2</td>
139 | <td>3</td>
140 | <td></td>
141 | </tr>
142 | 
143 | <tr>
144 | <td>11</td>
145 | <td>22</td>
146 | <td>33</td>
147 | <td>44</td>
148 | </tr>
149 | 
150 | <tr>
151 | <td class = "md0" class="md2" colspan = "4">MAIN 2</td>
152 | </tr>
153 | 
154 | <tr>
155 | <td class="md1" colspan = "4">Header1</td>
156 | </tr>
157 | 
158 | <tr>
159 | <td>1</td>
160 | <td>2</td>
161 | <td>3</td>
162 | <td></td>
163 | </tr>
164 | 
165 | <tr>
166 | <td>11</td>
167 | <td>22</td>
168 | <td>33</td>
169 | <td>44</td>
170 | </tr>
171 | 
172 | <tr>
173 | <td class = "md1" colspan = "4">Header2</td>
174 | </tr>
175 | 
176 | 
177 | <tr>
178 | <td>1</td>
179 | <td>2</td>
180 | <td>3</td>
181 | <td></td>
182 | </tr>
183 | 
184 | <tr>
185 | <td>11</td>
186 | <td>22</td>
187 | <td>33</td>
188 | <td>44</td>
189 | </tr>
190 | 
191 | 
192 | </tbody>
193 | </table>'
194 | 
195 | parsedm4 <- XML::htmlParse(tab4_code)
196 | 
197 | suppressMessages(tab4m1 <- htmltab(doc = parsedm4,
198 |                   header = "//thead/tr" + "//td[@class = 'md1']" + "//td[@class = 'md0']",
199 |                   body = "//tbody/tr[not(./td[starts-with(@class, 'md')])]"))
200 | 
201 | expect_that(is.na(tab4m1[1,1]), is_true())
202 | expect_that(tab4m1[2,1], equals("MAIN 1"))
203 | expect_that(tab4m1[6,1], equals("MAIN 2"))
204 | expect_that(tab4m1[2,2], equals("Header1"))
205 | expect_that(tab4m1[4,2], equals("Header2"))
206 | 
207 | })
208 | 
209 | 


--------------------------------------------------------------------------------
/vignettes/.Rapp.history:
--------------------------------------------------------------------------------
  1 | set.seed(1)#
  2 | #
  3 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
  4 | set.seed(1)#
  5 | #
  6 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
  7 | set.seed(1)#
  8 | #
  9 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 10 | set.seed(1)#
 11 | #
 12 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 13 | set.seed(1)#
 14 | #
 15 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 16 | set.seed(1)#
 17 | #
 18 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 19 | set.seed(1)#
 20 | #
 21 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 22 | set.seed(1)#
 23 | #
 24 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 25 | set.seed(1)#
 26 | #
 27 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 28 | set.seed(1)#
 29 | #
 30 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 31 | set.seed(1)#
 32 | #
 33 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 34 | set.seed(1)#
 35 | #
 36 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 37 | set.seed(1)#
 38 | #
 39 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 40 | set.seed(1)#
 41 | #
 42 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 43 | set.seed(1)#
 44 | #
 45 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 46 | set.seed(1)#
 47 | #
 48 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 49 | set.seed(1)#
 50 | #
 51 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 52 | set.seed(1)#
 53 | #
 54 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 55 | set.seed(1)#
 56 | #
 57 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 58 | set.seed(1)#
 59 | #
 60 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 61 | set.seed(1)#
 62 | #
 63 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 64 | set.seed(1)#
 65 | #
 66 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 67 | set.seed(1)#
 68 | #
 69 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 70 | set.seed(1)#
 71 | #
 72 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 73 | set.seed(1)#
 74 | #
 75 | sample(c(1,2,3,4,5,6, 7, 8), 1, replace = F)
 76 | set.seed(1)#
 77 | #
 78 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 79 | set.seed(1)#
 80 | #
 81 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 82 | set.seed(1)#
 83 | #
 84 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 85 | set.seed(1)#
 86 | #
 87 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 88 | set.seed(1)#
 89 | #
 90 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 91 | set.seed(1)#
 92 | #
 93 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 94 | set.seed(1)#
 95 | #
 96 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
 97 | set.seed(1)#
 98 | #
 99 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
100 | set.seed(1)#
101 | #
102 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
103 | set.seed(1)#
104 | #
105 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
106 | set.seed(1)#
107 | #
108 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
109 | set.seed(1)#
110 | #
111 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
112 | set.seed(1)#
113 | #
114 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
115 | set.seed(1)#
116 | #
117 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
118 | set.seed(1)#
119 | #
120 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
121 | set.seed(1)#
122 | #
123 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
124 | set.seed(1)#
125 | #
126 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
127 | set.seed(1)#
128 | #
129 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
130 | set.seed(1)#
131 | #
132 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
133 | set.seed(1)#
134 | #
135 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
136 | set.seed(1)#
137 | #
138 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
139 | set.seed(1)#
140 | #
141 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
142 | set.seed(1)#
143 | #
144 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
145 | set.seed(1)#
146 | #
147 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
148 | set.seed(1)#
149 | #
150 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
151 | set.seed(1)#
152 | #
153 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
154 | set.seed(1)#
155 | #
156 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
157 | set.seed(1)#
158 | #
159 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
160 | set.seed(1)#
161 | #
162 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
163 | set.seed(1)#
164 | #
165 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
166 | set.seed(1)#
167 | #
168 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
169 | set.seed(1)#
170 | #
171 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
172 | set.seed(1)#
173 | #
174 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
175 | set.seed(1)#
176 | #
177 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
178 | set.seed(1)#
179 | #
180 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
181 | set.seed(1)#
182 | #
183 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
184 | set.seed(1)#
185 | #
186 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
187 | set.seed(1)#
188 | #
189 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
190 | set.seed(1)#
191 | #
192 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
193 | set.seed(1)#
194 | #
195 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
196 | set.seed(1)#
197 | #
198 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
199 | set.seed(1)#
200 | #
201 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
202 | set.seed(1)#
203 | #
204 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
205 | set.seed(1)#
206 | #
207 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
208 | set.seed(1)#
209 | #
210 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
211 | set.seed(1)#
212 | #
213 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
214 | set.seed(1)#
215 | #
216 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
217 | set.seed(1)#
218 | #
219 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
220 | set.seed(1)#
221 | #
222 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
223 | set.seed(1)#
224 | #
225 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
226 | set.seed(1)#
227 | #
228 | sample(c(1,2,3,4,5,6, 7, 8, 9), 1, replace = F)
229 | set.seed(1)#
230 | #
231 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
232 | set.seed(1)#
233 | #
234 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
235 | set.seed(1)#
236 | #
237 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
238 | set.seed(1)#
239 | #
240 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
241 | set.seed(1)#
242 | #
243 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
244 | set.seed(1)#
245 | #
246 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
247 | set.seed(1)#
248 | #
249 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
250 | set.seed(1)#
251 | #
252 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
253 | set.seed(1)#
254 | #
255 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
256 | set.seed(1)#
257 | #
258 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
259 | set.seed(1)#
260 | #
261 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
262 | set.seed(1)#
263 | #
264 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
265 | set.seed(1)#
266 | #
267 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
268 | set.seed(1)#
269 | #
270 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
271 | set.seed(1)#
272 | #
273 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
274 | set.seed(1)#
275 | #
276 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
277 | set.seed(1)#
278 | #
279 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
280 | set.seed(1)#
281 | #
282 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
283 | set.seed(1)#
284 | #
285 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
286 | set.seed(1)#
287 | #
288 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
289 | set.seed(1)#
290 | #
291 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
292 | set.seed(1)#
293 | #
294 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
295 | set.seed(1)#
296 | #
297 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
298 | set.seed(1)#
299 | #
300 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
301 | set.seed(1)#
302 | #
303 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
304 | set.seed(1)#
305 | #
306 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
307 | set.seed(1)#
308 | #
309 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
310 | set.seed(1)#
311 | #
312 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10), 1, replace = F)
313 | set.seed(1)#
314 | #
315 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
316 | set.seed(1)#
317 | #
318 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
319 | set.seed(1)#
320 | #
321 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
322 | set.seed(1)#
323 | #
324 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
325 | set.seed(1)#
326 | #
327 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
328 | set.seed(1)#
329 | #
330 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11), 1, replace = F)
331 | set.seed(1)#
332 | #
333 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
334 | set.seed(1)#
335 | #
336 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
337 | set.seed(1)#
338 | #
339 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
340 | set.seed(1)#
341 | #
342 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
343 | set.seed(1)#
344 | #
345 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
346 | set.seed(1)#
347 | #
348 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
349 | set.seed(1)#
350 | #
351 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
352 | set.seed(1)#
353 | #
354 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
355 | set.seed(1)#
356 | #
357 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
358 | set.seed(221214)#
359 | #
360 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
361 | set.seed(221214)#
362 | #
363 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
364 | set.seed(221214)#
365 | #
366 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
367 | set.seed(221214)#
368 | #
369 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
370 | set.seed(221214)#
371 | #
372 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
373 | set.seed(221214)#
374 | #
375 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
376 | set.seed(221214)#
377 | #
378 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
379 | set.seed(221214)#
380 | #
381 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
382 | set.seed(221214)#
383 | #
384 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
385 | set.seed(221214)#
386 | #
387 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
388 | set.seed(221214)#
389 | #
390 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
391 | set.seed(221214)#
392 | #
393 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
394 | set.seed(221214)#
395 | #
396 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
397 | set.seed(221214)#
398 | #
399 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
400 | set.seed(221214)#
401 | #
402 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
403 | set.seed(221214)#
404 | #
405 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
406 | set.seed(221214)#
407 | #
408 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
409 | set.seed(221214)#
410 | #
411 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
412 | set.seed(221214)#
413 | #
414 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
415 | set.seed(221214)#
416 | #
417 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
418 | set.seed(221214)#
419 | #
420 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
421 | set.seed(221214)#
422 | #
423 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
424 | set.seed(221214)#
425 | #
426 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
427 | set.seed(221214)#
428 | #
429 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
430 | set.seed(221214)#
431 | #
432 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
433 | set.seed(221214)#
434 | #
435 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
436 | set.seed(221214)#
437 | #
438 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
439 | set.seed(221214)#
440 | #
441 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
442 | set.seed(221214)#
443 | #
444 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
445 | set.seed(221214)#
446 | #
447 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
448 | set.seed(221214)#
449 | #
450 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
451 | set.seed(221214)#
452 | #
453 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
454 | set.seed(221214)#
455 | #
456 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
457 | set.seed(221214)#
458 | #
459 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
460 | set.seed(221214)#
461 | #
462 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
463 | set.seed(221214)#
464 | #
465 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
466 | set.seed(221214)#
467 | #
468 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
469 | set.seed(221214)#
470 | #
471 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
472 | set.seed(221214)#
473 | #
474 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
475 | set.seed(221214)#
476 | #
477 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
478 | set.seed(221214)#
479 | #
480 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
481 | set.seed(221214)#
482 | #
483 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
484 | set.seed(221214)#
485 | #
486 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
487 | set.seed(221214)#
488 | #
489 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
490 | set.seed(221214)#
491 | #
492 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
493 | set.seed(221214)#
494 | #
495 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
496 | set.seed(221214)#
497 | #
498 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
499 | set.seed(221214)#
500 | #
501 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
502 | set.seed(221214)#
503 | #
504 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
505 | set.seed(221214)#
506 | #
507 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
508 | set.seed(221214)#
509 | #
510 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
511 | set.seed(221214)#
512 | #
513 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
514 | set.seed(221214)#
515 | #
516 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
517 | set.seed(221214)#
518 | #
519 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
520 | set.seed(221214)#
521 | #
522 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
523 | set.seed(221214)#
524 | #
525 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
526 | set.seed(221214)#
527 | #
528 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
529 | set.seed(221214)#
530 | #
531 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
532 | set.seed(221214)#
533 | #
534 | sample(c(1,2,3,4,5,6, 7, 8, 9, 10, 11, 12), 1, replace = F)
535 | set.seed(221214)#
536 | N <- 100#
537 | #
538 | sample(1:N, 1, replace = F)
539 | set.seed(221214)#
540 | N <- 100#
541 | #
542 | sample(1:N, 1, replace = F)
543 | set.seed(221214)#
544 | N <- 100#
545 | #
546 | sample(1:N, 1, replace = F)
547 | set.seed(221214)#
548 | N <- 100#
549 | #
550 | sample(1:N, 1, replace = F)
551 | set.seed(221214)#
552 | N <- 200#
553 | #
554 | sample(1:N, 1, replace = F)
555 | 200/3
556 | set.seed(221214)#
557 | N <- 1000#
558 | #
559 | sample(1:N, 1, replace = F)
560 | set.seed(221214)#
561 | #
562 | N <- 1000#
563 | x <- rep(NA, N)#
564 | #
565 | for(i 1:N){#
566 | x[i] <- sample(1:N, 1, replace = F)#
567 | }#
568 | #
569 | (1:N) / x
570 | x
571 | set.seed(221214)#
572 | #
573 | N <- 1000#
574 | x <- rep(NA, N)#
575 | #
576 | for(i 1:N){#
577 | x[i] <- sample(1:N, 1, replace = F)#
578 | }#
579 | #
580 | (1:N) / x
581 | set.seed(221214)#
582 | #
583 | N <- 1000#
584 | x <- rep(NA, N)#
585 | #
586 | for(i in 1:N){#
587 | x[i] <- sample(1:N, 1, replace = F)#
588 | }#
589 | #
590 | (1:N) / x
591 | plt(1:N, (1:N) / x)
592 | pl0ot(1:N, (1:N) / x)
593 | plot(1:N, (1:N) / x)
594 | plot(1:N, (1:N) / x)
595 | plot(1:N, x / 1:N)
596 | x / (1:N)
597 | set.seed(221214)#
598 | #
599 | N <- 1000#
600 | x <- rep(NA, N)#
601 | #
602 | for(i in 1:N){#
603 | 	x[i] <- sample(1:N, 1, replace = F)#
604 | }
605 | x
606 | set.seed(221214)#
607 | #
608 | N <- 1000#
609 | x <- rep(NA, N)#
610 | #
611 | for(i in 1:N){#
612 | 	x[i] <- sample(1:i, 1, replace = F)#
613 | }
614 | x
615 | N <- 1000#
616 | x <- rep(NA, N)#
617 | #
618 | for(i in 1:N){#
619 | 	set.seed(221214)#
620 | 	x[i] <- sample(1:i, 1, replace = F)#
621 | }
622 | x
623 | x / (1:N)
624 | plot(1:N, x / 1:N)
625 | library(htmltable)
626 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]")
627 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 1:3)
628 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 1:3)
629 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 3)
630 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[2]", header = 3, body = "tr[position() > 3]")
631 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[3]")
632 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[3]", header = 1)
633 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]")
634 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8")
635 | htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8", rm_escape = "")
636 | lol <- htmltable(doc = "http://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team", which = "/html/body/div[3]/div[3]/div[4]/table[4]", encoding = "UTF-8", rm_escape = "")
637 | library(dplyr)
638 | lol
639 | install.packages("wikipediatrend")
640 | library(wikipediatrend)
641 | bt_election <- wp_trend( page from lang= "Bundestagswahl",= "2007-01-01",= "de",friendly = T,userAgent = T)
642 | lol <- wp_trend(page = "http://en.wikipedia.org/wiki/Martin_Luther_King,_Jr.", from = "2007-01-01", lang = "en", friendly = T, userAgent = T)
643 | lol
644 | getwd()
645 | bt_election <- wp_trend(  page      = "Bundestagswahl", #
646 |                           from      = "2009-01-01", #
647 |                           lang      = "de", #
648 |                           friendly  = T,#
649 |                           userAgent = T)#
650 | bt_election <- bt_election[ order(bt_election$da
651 | bt_election
652 | ?wp_trend
653 | mlk <- wp_trend(page = "Martin_Luther_King,_Jr.", from = "2010-01-01", lang = "en", friendly = T)
654 | mlk
655 | plot(mlk)
656 | plot(ts(mlk))
657 | plot(mlk, type="h", ylim=c(-1000,40000))
658 | ?wp_trend
659 | head(mlk)
660 | rm(list=ls())
661 | setwd("~/Dropbox/htmltable/vignettes/")
662 | library(knitr)
663 | ?knit
664 | knit("index.Rmd")
665 | install.packages("magrittr")
666 | knit("index.Rmd")
667 | install.packages("knitr")
668 | rm(list=ls())
669 | setwd("~/Dropbox/htmltable/vignettes/")
670 | library(knitr)
671 | knit("index.Rmd")
672 | knit("index.Rmd")
673 | knit("index.Rmd")
674 | knit("index.Rmd")
675 | ?knitr
676 | q()
677 | 


--------------------------------------------------------------------------------
/vignettes/htmltab.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Hassle-free HTML tables with htmltab"
 3 | author: "Christian Rubba"
 4 | date: "`r Sys.Date()`"
 5 | output: rmarkdown::html_vignette
 6 | vignette: >
 7 |   %\VignetteIndexEntry{htmltab case studies}
 8 |   %\VignetteEngine{knitr::rmarkdown}
 9 |   %\usepackage[utf8]{inputenc}
10 | ---
11 | 
12 | HTML tables are a standard way to display tabular information online. Getting HTML table data into R is fairly straightforward with the `readHTMLTable()` function of the *XML* package. But tables on the web are primarily designed for displaying and consuming data, not for analytical purposes. Peculiar design choices for HTML tables are therefore frequently made which tend to produce useless outputs when run through `readHTMLTable()`. I found that sometimes these outputs could be saved with a little bit of (tedious) post-processing, but just as often they could not. To make working with HTML tables easier and less time-consuming, I developed *htmltab*, a package for the R system that tries to alleviate these problems directly in the parsing stage when the structural information is still available. Its main advantages over `readHTMLTable()` are twofold:
13 | 
14 | - Consideration of row and column spans in the HTML table body and header cells
15 | - More control over the process that translates HTML cells into R table cells
16 | 
17 | This vignette discusses the application of *htmltab* for two use cases where the package provides a significant improvement over `readHTMLTable()`. 
18 | 
19 | (I make use of the R packages *tidyr* and *stringr* to process table outputs. Neither of the three is required for running *htmltab*.)
20 | 
21 | 
22 | ## How to read HTML tables with htmltab()
23 | The principal function of *htmltab* is `htmltab()`. The behaviour of `htmltab()` is modeled closely after `readHTMLTable()`, and many argument names are identical. Any function call requires passing a value to its _doc_ argument. This value may be of three kinds:
24 | 
25 | 1. a URL or file path for the HTML document where the table lives
26 | 2. a parsed HTML object of the entire page of class _HTMLInternalDocument_
27 | 3. a table nodeset of class _XMLNodeSet_
28 | 
29 | The last of these methods returns a single R table object. For the first two, `htmltab()` requires users to be specific about the table they would like to have returned. This is done via the _which_ argument. This may be either a numeric value for the table's position in the page, or a character value that describes an XPath statement.
30 | 
31 | 
32 | ## 1. Corrections for rowspans and colspans by default
33 | In many HTML tables, spans are used to allow cell values to extend across multiple cells. `htmltab()` recognizes spans and expands tables automatically. To illustrate this feature, take a look at the HTML table in the Language section of this [Wikipedia page about Demography in the UK](http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom#Languages). The header information spans across three consecutive rows. To get the table into R, we have to pass an identifiying information to the _which_ argument. I use an XPath statement that I wrote while exploring the HTML page with Web Developer Tools. One that works is "//th[text() = 'Ability']/ancestor::table":
34 | 
35 | 
36 | ```{r message=FALSE}
37 | library(htmltab)
38 | 
39 | url <- "http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom"
40 | ukLang <- htmltab(doc = url, which = "//th[text() = 'Ability']/ancestor::table")
41 | head(ukLang)
42 | ```
43 | 
44 | The header information has been recast into a format that respects the hierarchical order of the variables and yet only spans a single line in the R table. If you prefer a different seperator between variables, pass it to the _headerSep_ argument. This format was chosen to make further processing of the table easy. For example, using functionality from the *tidyr* package, the next couple of data cleaning steps may be the following:
45 | 
46 | ```{r message=FALSE}
47 | library(tidyr)
48 | 
49 | ukLang <- gather(ukLang, key, value, -Ability)
50 | ```
51 | 
52 | This statement restructures the variables in a more useful long format. From this we can separate the variables using an appropriate regular expression such as " >> ". 
53 | 
54 | 
55 | ```{r}
56 | ukLang <- separate(ukLang, key, into = c("region", "language", "statistic"), sep = " >> ")
57 | head(ukLang)
58 | ```
59 | 
60 | `htmltab()` also automatically expands row and column spans when they appear in the table's body. 
61 | 
62 | ## 2. More control over cell value conversion
63 | `htmltab()` offers you more control over what part of the HTML table is used in the R table. You can exert this control via `htmltab()`'s _body_, _header_, _bodyFun_, _headerFun_, _rm&#95;escape_, _rm&#95;footnote_, _rm&#95;superscript_, _rm&#95;nodata&#95;cols_, _rm&#95;invisible_ and _rm&#95;whitespace_ arguments. 
64 | 
65 | ### _body_ and _header_ arguments
66 | It is not possible for `htmltab()` to correctly identify header and body elements in all the tables. Although there is a semantically *correct* way to organize header and body elements in HTML tables, web designers do not necessarily need to adhere to them to produce visually appealing tables. *htmltab* employs heuristics for identification but they are no guarantee. If you find that the table is not correctly assembled, you can try to give the function more information through its _header_ and _body_ arguments. These arguments are used to pass information about which rows should be used for the contruction of the header and the body. Both accept numeric values for the rows, but a more robust way is to use an XPath that identifies the respective rows. To illustrate, take a look at this [Wikipedia page about the New Zealand General Election in 2002](http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002#Electorate_results). The table uses cells that span the entire column range to classify General and Maori electorates (yellow background). We need to control for this problem explicitly in the assembling stage. I pass the XPath "//tr[./td[not(@colspan = '10')]]" to the _body_ argument to explicitly discard all rows from the body that have a \<td\> cell with a colspan attribute of 10:
67 | 
68 | 
69 | ```{r}
70 | url <- "http://en.wikipedia.org/wiki/New_Zealand_general_election,_2002"
71 | xp <- "//caption[starts-with(text(), 'Electorate results')]/ancestor::table"
72 | body_xp <- "//tr[./td[not(@colspan = '10')]]"
73 | 
74 | nz1 <- htmltab(doc = url, which = xp, body = body_xp, encoding = "UTF-8")
75 | head(nz1)
76 | ```
77 | 
78 | ### Using table information that intercept body rows
79 | In the previous example, we discarded the two intercepting rows in the body which signified the region of the electorate. You might object that ideally these rows should not be discarded, but used for what they are -- variable/header information! As of version 0.6.0, *htmltab* can process these sort of table designs more efficiently and prepend the information accurately in a new column variable. Information to the _header_ argument can now be passed in the form of X1 + X2 + X3 + ..., where X1 codifies the *main* header (i.e. the one that spans the body grid), and X2, X3, ... signify groups of header information that appear in the body. Please note that the in-body information (X2, X3, ...) must not identify row elements (tr) but individual cells (td or th) from which the value of the new variable can be generated (usually from the node value). To illustrate, consider the following snippet:
80 | 
81 | ```{r}
82 | nz2 <- htmltab(doc = url, which = xp, header = 1 + "//tr/td[@colspan = '10']",
83 |                body = "//tr[./td[not(@colspan = '10')]]", encoding = "UTF-8")
84 | ```
85 | 
86 | Here, we pass '1' to signify that the *main* header information appear in the first row. We add to this the XPath "//td[@colspan = '10']" that refer to the two rows. Generally, you are free to use numeric information or XPath to refer to the values that are takes as header variable. *htmltab* extracts these information and prepends them to the main table.
87 | 
88 | ```{r}
89 | tail(nz2, 9)
90 | ```
91 | 
92 | For more information on this feature take a look at this [blog post](http://r-datacollection.com/blog/htmltab-v0.6.0/) and the Details section of the *htmltab* function in the package documentation. 
93 | 
94 | ### Removal of unneeded information
95 | Many HTML tables include additional information which are of little interest to data analysts such as information encoded in superscript and footnote tags, as well as escape sequences. By default, `htmltab()` removes information from the first two and replaces all escape sequences by a whitespace. You can change this behaviour through the _rm\_superscript_, _rm\_footnotes_, _rm\_escape_, _rm&#95;nodata&#95;cols_, _rm&#95;invisible_ and _rm&#95;whitespace_ arguments. 
96 | 
97 | ## Conclusion
98 | HTML tables are a valuable data source but they frequently violate basic principles of data well-formedness. This is usually for good reason since their primary purpose is to increase readability of tabular information. *htmltab*'s goal is to reduce the need for users to interfere when working with HTML tables by relying on available structural information as well as making some assumptions about the table's design. However, you are free to exert more control over the transformation by specifying various function arguments. 
99 | 


--------------------------------------------------------------------------------