├── 4_Selenium ├── Selenium_Driver_Install.md ├── figures │ └── scraping_flowchart.png └── Selenium.ipynb ├── 2_HTML_CSS ├── img │ ├── html.png │ ├── layout.png │ ├── attributes.png │ ├── css-rule-2.png │ ├── css-rule.png │ ├── html-tags.png │ ├── workflow.png │ ├── classes-and-ids.png │ ├── inspect-element.png │ ├── Infographic-HTML-CSS.png │ ├── inspect-element-css.png │ └── scraping_flowchart.png ├── answers │ ├── slide_exercise_answers.pdf │ ├── table_render.html │ └── slide_exercise_answers.md ├── 2_HTML_notes.md └── 1_HTML_slides.html ├── 1_APIs ├── figures │ ├── ellington.jpg │ ├── wikipedia.png │ ├── google_link.png │ ├── nytimes_docs.png │ ├── nytimes_key.png │ ├── google_search.png │ ├── nytimes_start.png │ ├── google_link_change.png │ └── scraping_flowchart.png ├── 2_api_full-notes.md ├── 1_api_slides.html ├── all-formated.csv ├── 3_api_workbook.ipynb └── 4_api_solutions.ipynb ├── 3_Beautiful_Soup ├── figures │ └── scraping_flowchart.png ├── 2_bs_solutions.ipynb └── 1_bs_workbook.ipynb ├── README.md ├── .gitignore ├── Tech-Requirements.md ├── 0_Intro.html ├── Bonus_Materials └── 1_APIs_in_R.Rmd └── LICENSE /4_Selenium/Selenium_Driver_Install.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /2_HTML_CSS/img/html.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/html.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/layout.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/layout.png -------------------------------------------------------------------------------- /1_APIs/figures/ellington.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/ellington.jpg -------------------------------------------------------------------------------- /1_APIs/figures/wikipedia.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/wikipedia.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/attributes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/attributes.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/css-rule-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/css-rule-2.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/css-rule.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/css-rule.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/html-tags.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/html-tags.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/workflow.png -------------------------------------------------------------------------------- /1_APIs/figures/google_link.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/google_link.png -------------------------------------------------------------------------------- /1_APIs/figures/nytimes_docs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/nytimes_docs.png -------------------------------------------------------------------------------- /1_APIs/figures/nytimes_key.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/nytimes_key.png -------------------------------------------------------------------------------- /1_APIs/figures/google_search.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/google_search.png -------------------------------------------------------------------------------- /1_APIs/figures/nytimes_start.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/nytimes_start.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/classes-and-ids.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/classes-and-ids.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/inspect-element.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/inspect-element.png -------------------------------------------------------------------------------- /1_APIs/figures/google_link_change.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/google_link_change.png -------------------------------------------------------------------------------- /1_APIs/figures/scraping_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/1_APIs/figures/scraping_flowchart.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/Infographic-HTML-CSS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/Infographic-HTML-CSS.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/inspect-element-css.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/inspect-element-css.png -------------------------------------------------------------------------------- /2_HTML_CSS/img/scraping_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/img/scraping_flowchart.png -------------------------------------------------------------------------------- /4_Selenium/figures/scraping_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/4_Selenium/figures/scraping_flowchart.png -------------------------------------------------------------------------------- /2_HTML_CSS/answers/slide_exercise_answers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/2_HTML_CSS/answers/slide_exercise_answers.pdf -------------------------------------------------------------------------------- /3_Beautiful_Soup/figures/scraping_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dlab-berkeley/python-data-from-web/HEAD/3_Beautiful_Soup/figures/scraping_flowchart.png -------------------------------------------------------------------------------- /2_HTML_CSS/answers/table_render.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 7 | 10 | 11 | 12 | 15 | 18 | 19 |
5 | Kurtis 6 | 8 | McCoy 9 |
13 | Leah 14 | 16 | Guerrero 17 |
20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # python-data-from-web 2 | API and web scraping workshops 3 | 4 | These workshops were originally developed by [Rochelle Terman](https://github.com/rochelleterman). 5 | 6 | [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/dlab-berkeley/python-data-from-web) 7 | 8 | # Extracting Data from the Internet in Python 9 | 10 | This workshop will cover how to extract data from the web using Python. We'll be covering both APIs and webscraping. 11 | 12 | ### Topics Covered 13 | 14 | - How the web works 15 | - Accessing databases via RESTful APIs 16 | - HTML / CSS 17 | - Manipulating a webpage with Google DevTools 18 | - Webscraping with Beautiful Soup 19 | - Scraping javascript-heavy sites and interactive sites with Selenium 20 | 21 | ### Requirements 22 | 23 | This workshop will be using the Python programming language. See the software requirements [here](B_Tech-Requirements.md). 24 | 25 | We will assume a basic knowledge of Python. If you've taken the D-Lab's Python Intensive, that should be sufficient. 26 | 27 | **Please note that these materials are updating.** 28 | -------------------------------------------------------------------------------- /2_HTML_CSS/answers/slide_exercise_answers.md: -------------------------------------------------------------------------------- 1 | ### Exercise 1: Find the CSS selectors for the following elements in the HTML above. (Hint: There will be multiple solutions for each) 2 | 3 | The entire table: `table`
4 | Just the row containing "Kurstin McCoy": `.kurtis`
5 | Just the elements containing first names: `.firstname`
6 | 7 | ### Exercise 3: Go to http://rochelleterman.github.io/. Using Google Chrome's inspect element: 8 | 9 | Change the background color of each of the rows in the table: 10 | 11 | ~~~ 12 | #godfathers { 13 | background-color: blue; 14 | } 15 | #mexican { 16 | background-color: green; 17 | } 18 | #cities { 19 | background-color: red; 20 | } 21 | #wu-tang { 22 | background-color: purple; 23 | } 24 | #wire { 25 | background-color: orange; 26 | } 27 | #comedians { 28 | background-color: cyan; 29 | } 30 | #holidays { 31 | background-color: yellow; 32 | } 33 | ~~~ 34 | 35 | 36 | Find the image source URL 37 | 38 | ~~~ 39 | Draky playing tennis 40 | ~~~ 41 | 42 | Find the HREF attribute of the link. 43 | 44 | ~~~ 45 | link 46 | ~~~ -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | -------------------------------------------------------------------------------- /Tech-Requirements.md: -------------------------------------------------------------------------------- 1 | #Setup 2 | 3 | Once you've installed all of the software below, test your installation by following the instructions at the bottom on this page. 4 | 5 | ## 1. The Bash Shell 6 | Bash is a commonly-used shell that gives you the power to do simple tasks more quickly. 7 | 8 | #### Windows 9 | 10 | Install Git for Windows by downloading and running the [installer](http://msysgit.github.io/). This will provide you with both Git and Bash in the Git Bash program. **NOTE**: on the ~6th step of installation, you will need to select the option "Use Windows' default console window" rather than the default of "Use MinTTY" in order for nano to work correctly. 11 | 12 | After the installer does its thing, it leaves the window open, so that you can play with the "Git Bash". 13 | 14 | Chances are that you want to have an easy way to restart that Git Bash. You can install shortcuts in the start menu, on the desktop or in the QuickStart bar by calling the script /share/msysGit/add-shortcut.tcl (call it without parameters to see a short help text). 15 | 16 | #### Mac OS X 17 | 18 | The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in `/Applications/Utilities`). You may want to keep Terminal in your dock for this class. 19 | 20 | #### Linux 21 | 22 | The default shell is usually Bash, but if your machine is set up differently you can run it by opening a terminal and typing bash. There is no need to install anything. 23 | 24 | ## 2. Google Chrome & Firefox 25 | 26 | We'll be using Google Chrome as out main web browser. Download [here](https://www.google.com/chrome/). 27 | 28 | For Selenium, we need to use Firefox. Download [here](https://www.mozilla.org/en-US/firefox/new/). 29 | 30 | ## 3. Python 31 | Python is a popular language for scientific computing, and great for general-purpose programming as well. Installing all of its scientific packages individually can be a bit difficult, so we recommend an all-in-one installer. 32 | 33 | Regardless of how you choose to install it, please make sure you install Python version 3.2 or above. 34 | 35 | For helpful information on switching between Python 2 and 3 environments in Anaconda, see [here](https://www.continuum.io/blog/developer-blog/python-3-support-anaconda). 36 | 37 | We will teach using the Jupiter (aka IPython) notebook, a programming environment that runs in a web browser. Jupiter notebooks are included in the all-in-one installer. 38 | 39 | ####Windows 40 | 41 | * Download and install [Anaconda](https://store.continuum.io/cshop/anaconda/). 42 | * Download the default Python 3 installer. Use all of the defaults for installation except make sure to check **Make Anaconda the default Python.** 43 | 44 | ####Mac OS X 45 | 46 | * Download and install [Anaconda](https://store.continuum.io/cshop/anaconda/). 47 | * Download the default Python 3 installer. Use all of the defaults for installation except make sure to check **Make Anaconda the default Python.** 48 | 49 | ####Linux 50 | 51 | We recommend the all-in-one scientific Python installer [Anaconda](http://continuum.io/downloads.html). (Installation requires using the shell and if you aren't comfortable doing the installation yourself just download the installer and we'll help you during the class.) 52 | 53 | 1. Download the installer that matches your operating system and save it in your home folder. Download the default Python 3 installer. 54 | 2. Open a terminal window. 55 | 3. Type `bash Anaconda-` and then press tab. The name of the file you just downloaded should appear. 56 | 4. Press enter. You will follow the text-only prompts. When there is a colon at the bottom of the screen press the down arrow to move down through the text. Type `yes` and press enter to approve the license. Press enter to approve the default location for the files. Type `yes` and press enter to prepend Anaconda to your `PATH` (this makes the Anaconda distribution the default Python). 57 | 58 | ##Testing your installation 59 | 60 | Open a command line window ('terminal' or, on windows, 'git bash'), and enter the following commands (without the $ sign): 61 | 62 | ```bash 63 | $ python --version 64 | ``` 65 | 66 | The python version should include "Anaconda" and its version information. 67 | 68 | Jupyter notesbook is a python development environment that comes pre-installed with the Anaconda python distribution. To see if you have it, type the following into your terminal window: 69 | 70 | ```bash 71 | $ jupyter notebook 72 | ``` 73 | 74 | This should open a programming interface in your default web browser. It may take a few minutes the first time. To close, just close your browser and then `CTRL-C` to end the process in the command line. 75 | 76 | Software Carpentry maintains a list of common issues that occur during installation may be useful for our class here: [Configuration Problems and Solutions wiki page.](https://github.com/swcarpentry/workshop-template/wiki/Configuration-Problems-and-Solutions) 77 | 78 | Credit: Thanks to [Software Carpentry](http://software-carpentry.org/workshops/) for providing installation guidelines. 79 | -------------------------------------------------------------------------------- /0_Intro.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | PS239T: Welcome! 6 | 166 | 167 | 168 | 239 | 240 | 243 | 244 | 245 | 246 | -------------------------------------------------------------------------------- /Bonus_Materials/1_APIs_in_R.Rmd: -------------------------------------------------------------------------------- 1 | ### STEP 4: Constructing API GET Requests in R 2 | 3 | Because using Web APIs in R will involve repeatedly constructing different GET requests with slightly different components each time, it is helpful to store many of the individuals components as objects and combine them using ```paste()``` when ready to send the request. 4 | 5 | In the first place, we know that every call will require us to provide a) a base URL for the API, b) some authorization code or key, and c) a format for the response. 6 | 7 | ```{r} 8 | # Create objects holding the key, base url, and response format 9 | key<-"ef9055ba947dd842effe0ecf5e338af9:15:72340235" 10 | base.url<-"http://api.nytimes.com/svc/search/v2/articlesearch" 11 | response.format<-".json" 12 | ``` 13 | 14 | Secondly, we need to specify our search terms, along with any filters to be placed on the results. In this case, we are searching for the phrase "jazz is dead", though we specifically want it to appear in the body of the text. 15 | ```{r} 16 | # Specify a main search term (q) 17 | search.term<-"jazz is dead" 18 | 19 | # Specify and encode filters (fc) 20 | filter.query<-"body:\"jazz is dead\"" 21 | ``` 22 | 23 | Note that it can often be tricky to properly re-format character strings stored in R objects to character strings suitable for GET requests. For example, the filter above uses quotation marks to specify that we wanted to retrieve the phrase exactly. But to include those quotation marks inside a character string that --- following R syntax --- must itself be surrounded by double quotation marks, these original characters need to be escaped with a backslash. This results in the stored R string appearing to be different from the parsed R string. 24 | ```{r} 25 | # NOTE: double quotes within double quotes must be escaped with / so R can parse the character string 26 | print(filter.query) # How R stores the string 27 | cat(filter.query) # How R parses the string 28 | ``` 29 | 30 | To overcome some of these encoding issues, it is often helpful to URL encode our strings. URL encoding basically translates punctuation marks, white space, and other non alphanumeric characters into a series of unique characters only recognizeable by URL decoders. If you've ever seen %20 in a URL, this is actually a placeholder for a single space. R provides helpful functions to doing this translation automatically. 31 | ```{r} 32 | # URL-encode the search and its filters 33 | search.term<-URLencode(URL = search.term, reserved = TRUE) 34 | filter.query<-URLencode(URL = filter.query, reserved = TRUE) 35 | print(search.term) 36 | print(filter.query) 37 | ``` 38 | 39 | Once all the pieces of our GET request are in place, we can use either the ```paste()``` or ```paste0()``` to combine a number of different character strings into a single character string. This final string will be our URL for the GET request. 40 | ```{r} 41 | # Paste components together to create URL for get request 42 | get.request<-paste0(base.url, response.format, "?", "q=", search.term, "&fq=", filter.query, "&api-key=", key) 43 | print(get.request) 44 | ``` 45 | 46 | Once we have the URL complete, we can send a properly formated GET request. There are several packages that can do this, but ***httr*** provides a good balance of simplicity and reliability. The main function of interest here is ```GET()```: 47 | ```{r} 48 | # Send the GET request using httr package 49 | response<-httr::GET(url = get.request) 50 | print(response) 51 | ``` 52 | 53 | The ```content()``` function allows us to extract the html response in a format of our choosing (raw text, in this case): 54 | ```{r} 55 | # Inspect the content of the response, parsing the result as text 56 | response<-httr::content(x = response, as = "text") 57 | substr(x = response, start = 1, stop = 1000) 58 | ``` 59 | 60 | The final step in the process involves converting the results from JSON format to something easier to work with -- notably a data.frame. The ***jsonlite*** package provides several easy conversion functions for moving between JSON and vectors, data.frames, and lists. 61 | ```{r} 62 | # Convert JSON response to a dataframe 63 | response.df<-jsonlite::fromJSON(txt = response, simplifyDataFrame = TRUE, flatten = TRUE) 64 | 65 | # Inspect the dataframe 66 | str(response.df, max.level = 3) 67 | 68 | # Get number of hits 69 | print(response.df$response$meta$hits) 70 | ``` 71 | 72 | Of course, most experiences using Web APIs will require *multiple* GET requests, each different from the next. To speed this process along, we can create a function that can take several arguments and then automatically generate a properly formated GET request URL. Here, for instance, is one such function we might write: 73 | ```{r} 74 | # Write a function to create get requests 75 | nytapi<-function(search.terms=NULL, begin.date=NULL, end.date=NULL, page=NULL, 76 | base.url="http://api.nytimes.com/svc/search/v2/articlesearch", 77 | response.format=".json", 78 | key="ef9055ba947dd842effe0ecf5e338af9:15:72340235"){ 79 | 80 | # Combine parameters 81 | params<-list( 82 | c("q", search.terms), 83 | c("begin_date", begin.date), 84 | c("end_date", end.date), 85 | c("page", page) 86 | ) 87 | params<-params[sapply(X = params, length)>1] 88 | params<-sapply(X = params, FUN = paste0, collapse="=") 89 | params<-paste0(params, collapse="&") 90 | 91 | # URL encode query portion 92 | query<-URLencode(URL = params, reserved = FALSE) 93 | 94 | # Combine with base url and other options 95 | get.request<-paste0(base.url, response.format, "?", query, "&api-key=", key) 96 | 97 | # Send GET request 98 | response<-httr::GET(url = get.request) 99 | 100 | # Parse response to JSON 101 | response<-httr::content(response, "text") 102 | response<-jsonlite::fromJSON(txt = response, simplifyDataFrame = T, flatten = T) 103 | 104 | return(response) 105 | } 106 | ``` 107 | 108 | Now that we have our handy NYT API function, let's try and do some data analysis. To figure out whether Duke Ellington is "trending" over the past few years, we can start by using our handy function to get a count of how often the New York Times mentions the Duke... 109 | 110 | ```{r} 111 | # Get number of hits, number of page queries 112 | duke<-nytapi(search.terms = "duke ellington", begin.date = 20050101, end.date = 20150101) 113 | hits<-duke$response$meta$hits 114 | print(hits) 115 | round(hits/10) 116 | ``` 117 | 118 | After making a quick call to the API, it appears that we have a total of 1059 hits. Since the API only allows us to download 10 results at a time, we need to make 106 calls! 119 | ```{r} 120 | # Get all articles 121 | duke.articles<-sapply(X = 0:105, FUN = function(page){ 122 | #cat(page, "") 123 | response<-tryCatch(expr = { 124 | r<-nytapi(search.terms = "duke ellington", begin.date = 20050101, end.date = 20150101, page = page) 125 | r$response$docs 126 | }, error=function(e) NULL) 127 | return(response) 128 | }) 129 | 130 | # Combine list of dataframes 131 | duke.articles<-duke.articles[!sapply(X = duke.articles, FUN = is.null)] 132 | duke.articles<-plyr::rbind.fill(duke.articles) 133 | ``` 134 | 135 | To figure out how Duke's popularity is changing over time, all we need to do is add an indicator for the year and month each article was published in, and then use the ***plyr*** package to count how many articles appear with each year-month combination: 136 | ```{r} 137 | # Add year-month indicators 138 | duke.articles$year.month<-format(as.Date(duke.articles$pub_date), "%Y-%m") 139 | duke.articles$year.month<-as.Date(paste0(duke.articles$year.month, "-01")) 140 | 141 | # Count articles per month 142 | library(plyr) 143 | duke.permonth<-ddply(.data = duke.articles, .variables = "year.month", summarize, count=length(year.month)) 144 | 145 | # Plot the trend over time 146 | library(ggplot2) 147 | ggplot(data = duke.permonth, aes(x = year.month, y = count))+geom_point()+geom_smooth(se=F)+ 148 | theme_bw()+xlab(label = "Date")+ylab(label = "Article Count")+ggtitle(label = "Coverage of Duke Ellington") 149 | ``` 150 | 151 | Looks like he actually *is* getting more popular of late! 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | -------------------------------------------------------------------------------- /2_HTML_CSS/2_HTML_notes.md: -------------------------------------------------------------------------------- 1 | # Webscraping 1: HTML, CSS, and Developer Tools 2 | 3 | > ## Learning Objectives 4 | > 5 | > * Explain the difference between webscraping and working with APIs 6 | > * Understand how HTML works with your browser to display a website 7 | > * Identify HTML tags and attributes 8 | > * Understand how CSS works to format a website 9 | > * Identify CSS selectors 10 | > * Alter a website using Google Developer Tools. 11 | 12 | ### Accessing Data: Some Preliminary Considerations 13 | 14 | Whenever you're trying to get information from the web, it's very important to first know whether you're accessing it through appropriate means. 15 | 16 | The UC Berkeley library has some excellent resources on this topic. Here is a flowchart that can help guide your course of action. 17 | 18 | ![](img/scraping_flowchart.png) 19 | 20 | You can see the library's licensed sources [here](http://guides.lib.berkeley.edu/text-mining). 21 | 22 | ## Why Webscrape 23 | 24 | * Tons of web data useful for social scientists and humanists 25 | * social media 26 | * news media 27 | * government publications 28 | * organizational records 29 | 30 | * Two kinds of ways to get data off the web 31 | * APIs - i.e. application-facing, for computers (last week) 32 | * Webscraping - i.e. user-facing websites for humans (this week and next week) 33 | 34 | ## Webscraping v. APIs 35 | 36 | * Webscraping Benefits 37 | * Any content that can be viewed on a webpage can be scraped. [Period](https://blog.hartleybrody.com/web-scraping/) 38 | * No API needed 39 | * No rate-limiting or authentication (usually) 40 | 41 | * Webscraping Challenges 42 | * Rarely tailored for researchers 43 | * Messy, unstructured, inconsistent 44 | * Entirely site-dependent 45 | 46 | * Rule of thumb: 47 | - Check for API first. If not available, scrape. 48 | 49 | ## Some Disclaimers 50 | 51 | * Check a site's terms and conditions before scraping. 52 | * Be nice - don't hammer the site's server. 53 | * Sites change their layout all the time. Your scraper will break. 54 | 55 | ## What's a website 56 | 57 | * Some combination of codebase, database 58 | * The "front end" product is HTML + CSS stylesheets + javascript 59 | 60 | ![html](img/html.png) 61 | 62 | * Your browser turns that into a tidy layout 63 | 64 | ![layout](img/layout.png) 65 | 66 | ## Webscraping returns HTML 67 | 68 | * It's easy to pull HTML from a website 69 | * It's much more difficult to find the information you want from that HTML 70 | 71 | ![html](img/html.png) 72 | 73 | * So we have to learn how to **parse** HTML to find the data we want 74 | 75 | ## Basic strategy of webscraping: 76 | 77 | 1. Find out what kind of HTML element your data is in. (Use your browser‘s “inspector” to) 78 | 2. Think about how you can differentiate those elements from other, similar elements in the webpage using CSS. 79 | 3. Use Python and add-on modules like BeautifulSoup to extract just that data. 80 | 81 | ## HTML: Basic structure 82 | 83 | ```html 84 | 85 | 86 | 87 | Page title 88 | 89 | 90 |

Hello world!

91 | 92 | 93 | ``` 94 | 95 | ## HTML is a Tree 96 | 97 | 98 | 99 | Each branch of the tree is called an *element* 100 | 101 | ## HTML Elements 102 | 103 | Generally speaking, an HTML element has three components: 104 | 105 | 1. Tags (starting and ending the element) 106 | 2. Atributes (giving information about the element) 107 | 3. Text, or Content (the text inside the element) 108 | 109 | ![elements](https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/HTML_element_structure.svg/330px-HTML_element_structure.svg.png) 110 | 111 | ## HTML: Tags 112 | 113 | ![html-tags](img/html-tags.png) 114 | 115 | [Image credit](http://miriamposner.com/blog/wp-content/uploads/2011/11/html-handout.pdf) 116 | 117 | ## Common HTML tags 118 | 119 | | Tag | Meaning | 120 | | ------------- |------------- | 121 | | `` | page header (metadata, etc | 122 | | `` | holds all of the content | 123 | | `

` | regular text (paragraph) | 124 | | `

`,`

`,`

` | header text, levels 1, 2, 3 | 125 | | `ol,`,`