├── .gitignore ├── LICENSE ├── README.md ├── example_script ├── README.md ├── download_pictures_from_google.R ├── iframe_tutorial.Rmd ├── iframe_tutorial.md └── resources │ ├── README.md │ ├── turtle0.jpg │ ├── turtle1.jpg │ └── turtle2.jpg └── resources ├── .gitkeep ├── functions_and_classes.png ├── placeholder.jpg └── rvest_parallel.md /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Yifu Yan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraping Reference: Cheat Sheet for Web Scraping using R 2 | 3 | Inspired by Hartley Brody, this cheat sheet is about web scraping using [rvest](https://github.com/hadley/rvest),[httr](https://github.com/r-lib/httr) and [Rselenium](https://github.com/ropensci/RSelenium). It covers many topics in this [blog](https://blog.hartleybrody.com/web-scraping-cheat-sheet/). 4 | 5 | While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered. 6 | 7 | I also recommend the book [The Ultimate Guide to Web Scraping](https://blog.hartleybrody.com/guide-to-web-scraping/) by Hartley Brody. Though it uses Python libraries, the underlying logic of web scraping is the same. The same strategies can be applied using any languages including R. 8 | 9 | Please post issues [here](https://github.com/yusuzech/r-web-scraping-cheat-sheet/issues) if you find any errors or have any recommendations. 10 | 11 | # Table of Contents 12 | 1. Web Scraping using rvest and httr 13 | 1. Useful Libraries and Resources 14 | 2. Making Simple Requests 15 | 3. Inspecting Response 16 | 4. Extracting Elements from HTML 17 | 5. Storing Data in R 18 | 1. Storing Data as list 19 | 2. Storing Data as data.frame 20 | 6. Saving Data to disk 21 | 1. Saving Data to csv 22 | 2. Saving Data to SQLite Database 23 | 7. More Advanced Topics 24 | 1. Javascript Heavy Websites 25 | 2. Content Inside iFrames 26 | 3. Sessions and Cookies 27 | 4. Delays and Backing Off 28 | 5. Spoofing the User Agent 29 | 6. Using Proxy Servers 30 | 7. Setting Timeouts 31 | 8. Handling Network Errors 32 | 9. Downloading Files 33 | 10. Logins and Sessions 34 | 11. Web Scraping in Parallel 35 | 2. Web Scraping using Rselenium 36 | 1. Why RSelenium 37 | 1. Pros and Cons from Using RSelenium 38 | 2. Useful Resources 39 | 2. Interacting with the Web Driver In Rselenium 40 | 1. How to Start 41 | 2. Navigating to different URLs 42 | 3. Simulating Scrolls, Clicks, Text Inputs, Logins, and Other Actions 43 | 3. Extract Content from the Web Page 44 | 1. Extracting Content using Rselenium 45 | 2. Extracting Content using Parsed Page Source and `rvest` 46 | 4. miscellanea 47 | 1. Javascript 48 | 2. Iframe 49 | 3. Change Log 50 | 51 | # 1. Web Scraping using rvest and httr 52 | ## 1.1. Useful Libraries and Resources 53 | 54 | [rvest](https://github.com/hadley/rvest) is built upon the xml2 package and also accept config from the `httr` package. For the most part, we only need `rvest`. However, we need `httr` if we want to add extra configurations. 55 | 56 | To install those two packages: 57 | 58 | ```r 59 | install.packages("rvest") 60 | install.packages("httr") 61 | ``` 62 | 63 | To load them: 64 | 65 | ```r 66 | require(rvest) 67 | require(httr) 68 | ``` 69 | 70 | There are many resources available online; these are what I found to be the most useful: 71 | 72 | 1. [w3schools CSS selectors reference](https://www.w3schools.com/CSSref/css_selectors.asp) : if you forget CSS syntax, just check it here 73 | 1. [w3schools XPATH reference](): XPATH is an alternative in selecting elements on websites. It's harder to learn but it's more flexible and robust. 74 | 1. [CSS Diner](https://flukeout.github.io/) : the easiest way to learn and understand CSS by playing games. 75 | 1. [Chrome CSS selector plugin](https://selectorgadget.com/): a convenient tool to use for choosing CSS selector. 76 | 1. [ChroPath](): a very convenient tool for choosing XPATH. 77 | 1. [Stack Overflow](https://stackoverflow.com/) : You can find answers to most of your problems, no matter it's web scraping, rvest or CSS. 78 | 1. [Web Scraping Sandbox](http://toscrape.com/): Great place to test your web scraping skills. 79 | 80 | **Functions and classes in rvest/httr:** 81 | 82 | Sometimes you may get confused about all the functions and classes you have. You can review this image at the moment. 83 | ![](resources/functions_and_classes.png) 84 | 85 | \*\* Please notice that: somtimes response could be JSON or other formats instead of HTML. In those cases you need other functions to parse the content(e.g. jsonlite::fromJSON() to parse a JSON string to a list). 86 | 87 | 88 | ## 1.2. Making Simple Requests 89 | 90 | rvest provides two ways of making request: `read_html()` and `html_session()` 91 | `read_html()` can parse a HTML file or an url into xml document. `html_session()` is built on `GET()` from httr package and can accept configurations defined by httr package. 92 | 93 | Reading a url: 94 | 95 | ```R 96 | #making GET request andparse website into xml document 97 | pagesource <- read_html("http://example.com/page") 98 | 99 | #using html_session which creates a session and accept httr methods 100 | my_session <- html_session("http://example.com/page") 101 | #html_session is built upon httr, you can also get response with a session 102 | response <- my_session$response 103 | ``` 104 | 105 | Alternatively, GET and POST method are available in the httr package. 106 | 107 | ```R 108 | library(httr) 109 | response <- GET("http://example.com/page") 110 | #or 111 | response <- POST("http://example.com/page", 112 | body = list(a=1,b=2)) 113 | ``` 114 | 115 | ## 1.3. Inspecting Response 116 | 117 | Check status code: 118 | 119 | ```R 120 | status_code(my_session) 121 | status_code(response) 122 | ``` 123 | 124 | Get response and content: 125 | 126 | ```R 127 | #response 128 | response <- my_session$response 129 | #retrieve content as raw 130 | content_raw <- content(my_session$response,as = "raw") 131 | #retrieve content as text 132 | content_text <- content(my_session$response,as = "text") 133 | #retrieve content as parsed(parsed automatically) 134 | content_parsed <- content(my_session$response,as = "parsed") 135 | ``` 136 | 137 | \*\*note: 138 | 139 | Content may be parsed incorrectly sometimes. For those situations, you can parse the content to text or raw and use other libraries or functions to parse it correctly. 140 | 141 | Search for specific string: 142 | 143 | ```R 144 | library(stringr) 145 | #regular expression can also be used here 146 | if(str_detect(content_text,"blocked")){ 147 | print("blocked from website") 148 | } 149 | ``` 150 | 151 | check content type: 152 | 153 | ```R 154 | response$headers$`content-type` 155 | ``` 156 | 157 | check html structure: 158 | 159 | ```R 160 | my_structure <- html_structure(content_parsed) 161 | ``` 162 | 163 | ## 1.4. Extracting Elements from HTML 164 | 165 | Using the regular expression to scrape HTML is not a very good idea, but it does have its usage like scraping all emails from websites, there is a detailed discussion about this topic on [stackoverflow](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). 166 | 167 | **Using rvest:** 168 | 169 | I will scrape https://scrapethissite.com/ for demonstration, since it has static HTML: 170 | 171 | For the purpose of extracting elements, using `read_html()` or `html_session()` are both fine. When using `read_html()`, it returns a xml_document. When using `html_session()`, it creates a session and the response is included. 172 | 173 | ```R 174 | my_session <- html_session("https://scrapethissite.com/pages/simple/") 175 | ``` 176 | 177 | Look for nodes: 178 | 179 | ```R 180 | my_nodes <- my_session %>% html_elements(".country") 181 | ``` 182 | 183 | Look for attributes: 184 | 185 | ```R 186 | my_attributes <- my_session %>% html_elements(".country-capital") %>% html_attr("class") 187 | ``` 188 | 189 | Look for texts: 190 | 191 | ```R 192 | my_texts <- my_session %>% html_elements(".country-capital") %>% html_text() 193 | ``` 194 | 195 | ## 1.5. Storing Data in R 196 | 197 | rvest can return a vector of elements or even table of elements, so it's easy to store it in R. 198 | 199 | ### 1.5.1. Storing Data as list 200 | 201 | Usually, rvest can return a vector, so it's very easy to store it. 202 | 203 | ```R 204 | my_texts <- my_session %>% html_elements(".country-capital") %>% html_text() 205 | ``` 206 | 207 | ### 1.5.2. Storing Data as data.frame 208 | 209 | We can concatenate vectors in a table or using `html_table()` to extract a HTML table directly into a data.frame. 210 | 211 | ```R 212 | my_country <- my_session %>% html_elements(".country-name") %>% html_text() 213 | my_capitals <- my_session %>% html_elements(".country-capital") %>% html_text() 214 | my_table <- data.frame(country = my_country, capital = my_capitals) 215 | ``` 216 | 217 | ## 1.6. Saving Data to disk 218 | ### 1.6.1. Saving Data to csv 219 | 220 | If the data is already stored as a data.frame: 221 | 222 | ```R 223 | write.csv(my_table,file="my_table.csv") 224 | ``` 225 | 226 | ### 1.6.2. Saving Data to SQLite Database 227 | 228 | After creating the database "webscrape.db": 229 | 230 | ```R 231 | library(RSQLite) 232 | connection <- dbConnect(SQLite(),"webscrape.db") 233 | dbWriteTable(conn = connection,name = "country_capital",value = my_table) 234 | dbDisconnect(conn = connection) 235 | ``` 236 | 237 | ## 1.7. More Advanced Topics 238 | 239 | 248 | 249 | ### 1.7.1. Javascript Heavy Websites 250 | 251 | For javascript heavy websites, there are three possible solutions: 252 | 253 | 1. Execute javascript in R 254 | 2. Use Developer tools(e.g. [Network in Chrome](https://developers.google.com/web/tools/chrome-devtools/network-performance/)) 255 | 3. Using Rselenium or other web drivers 256 | 257 | There are pros and cons of each method: 258 | 1. Executing Javascript in R is the most difficult one since it requires some knowledge of Javascript, but it makes web-scraping javascript heavy websites possible with rvest. 259 | 2. Using Developer tools is not difficult. The pro is that you only need to learn some examples and you can then work on it by yourself. The con is that if the website structure gets more completed, it requires more knowledge of HTTP. 260 | 3. The Rselenium is absolutely the easiest solution. The pro is it's easy to learn and use. The con is that it can be unstable sometimes and related resources are very limited online. In many situations, you may need to refer to python codes with selenium package. 261 | 262 | #### 1.Execute javascript 263 | 264 | I learned how to use Javascript with this [post](https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/). Since I'm not an expert in Javascript, I recommend you to search for other related resources online. 265 | 266 | If you don't know Javascipt, then this method is unlikely to be suitable for you. Since Javascript isn't an easy subject, I don't recommend you to learn it if your only purpose is to use it to do web-scraping in R. The following two methods are much easier. 267 | 268 | #### 2.Use Developer tools 269 | 270 | I learned this trick from Hartley's blog; the following section is quoted from his [post](https://blog.hartleybrody.com/web-scraping-cheat-sheet/): 271 | 272 | >Contrary to popular belief, you do not need any special tools to scrape websites that load their content via Javascript. For the information to get from their server and show up on a page in your browser, that information had to have been returned in an HTTP response somewhere. 273 | > 274 | >It usually means that you won’t be making an HTTP request to the page’s URL that you see at the top of your browser window, but instead you’ll need to find the URL of the AJAX request that’s going on in the background to fetch the data from the server and load it into the page. 275 | > 276 | >There’s not really an easy code snippet I can show here, but if you open the Chrome or Firefox Developer Tools, you can load the page, go to the “Network” tab and then look through the all of the requests that are being sent in the background to find the one that’s returning the data you’re looking for. Start by filtering the requests to only XHR or JS to make this easier. 277 | > 278 | >Once you find the AJAX request that returns the data you’re hoping to scrape, then you can make your scraper send requests to this URL, instead of to the parent page’s URL. If you’re lucky, the response will be encoded with JSON which is even easier to parse than HTML. 279 | 280 | So, as Hartley said, basically, everything displayed on your browser must be sent to you through JSON, HTML or other formats. What you need to do is to capture this file. 281 | 282 | I answered a couple of questions on Stack Overflow about scraping JavaScript rendered content. You may have the same problems as in the posts. Please check out the answers, and hope you get the ideas of how to use developer tools. 283 | 284 | https://stackoverflow.com/questions/50596714/how-to-scrap-a-jsp-page-in-r/50598032#50598032 285 | https://stackoverflow.com/questions/50765111/hover-pop-up-text-cannot-be-selected-for-rvest-in-r/50769875#50769875 286 | https://stackoverflow.com/questions/50900987/scraping-dl-dt-dd-html-data/50922733#50922733 287 | https://stackoverflow.com/questions/50693362/unable-to-extract-thorough-data-using-rvest/50730204#50730204 288 | https://stackoverflow.com/questions/50997094/trouble-reaching-a-css-node/50997121#50997121 289 | https://stackoverflow.com/questions/50262108/scraping-javascript-rendered-content-using-r/50263349#50263349 290 | 291 | #### 3.Using Rselenium or other web driver 292 | 293 | Rselenium launches a Chrome/Firefox/IE browser where you can simulate human actions like clicking on links, scrolling up or down. 294 | 295 | It is a very convenient tool, and it renders JavaScript and Interactive content automatically, so you don't need to worry about the complex HTTP and AJAX stuff. However, there are also some limitations to it: 296 | 297 | 1. The first limitation is that: it is very slow. Depending on the complexity of the websites, it could take seconds to render a single page while using httr/rvest takes less than one second. It is fine if you only want to scrape several hundred pages. However, if you want scrape thousands or ten thousands of pages, then the speed will become an issue. 298 | 2. The second limitation is that: There are little online resources on Rselenium. In many situations, you can't find related posts on Stack Overflow that solve your problem. You may need to refer to Python/Java Selenium posts for answers, and sometimes answers can't be applied in R. 299 | 300 | More detailed usage is explained in **Web Scraping using Rselenium**. 301 | 302 | ### 1.7.2. Content Inside iFrames 303 | 304 | Iframes are other websites embedded in the websites you are viewing as explained on [Wikipedia](https://en.wikipedia.org/wiki/HTML_element#Frames): 305 | 306 | > Frames allow a visual HTML Browser window to be split into segments, each of which can show a different document. This can lower bandwidth use, as repeating parts of a layout can be used in one frame, while variable content is displayed in another. This may come at a certain usability cost, especially in non-visual user agents,[[51\]](https://en.wikipedia.org/wiki/HTML_element#cite_note-58) due to separate and independent documents (or websites) being displayed adjacent to each other and being allowed to interact with the same parent window. Because of this cost, frames (excluding the `