├── README.md ├── common.R ├── data ├── correspondence-data-1585.csv └── locations.csv ├── debkeepr-intro.Rmd ├── excel-vs-r.R ├── excel-vs-r.Rmd ├── geocoding-with-r.R ├── geocoding-with-r.Rmd ├── gis-with-r-intro.R ├── gis-with-r-intro.Rmd ├── great-circles-sp-sf.R ├── great-circles-sp-sf.Rmd ├── intro-to-r.Rproj ├── networks-with-r.R ├── networks-with-r.Rmd ├── simple-feature-objects.R └── simple-feature-objects.Rmd /README.md: -------------------------------------------------------------------------------- 1 | ## Introduction to R 2 | 3 | This repository collects tutorials on the use of R. The tutorials use historical data from the correspondence of Daniel van der Meulen in 1585. The data is a subset of a larger digital history project, information about which you can find [here](https://jessesadler.com/project/dvdm-correspondence/). 4 | 5 | The tutorials come from blog posts on https://www.jessesadler.com. The R markdown documents in this repository are used to create the posts. Each post also has an R script, which contains all of the code from the posts in a more compressed format. 6 | 7 | If you have any comments, please let me know in the issues. -------------------------------------------------------------------------------- /common.R: -------------------------------------------------------------------------------- 1 | ### Set options for project ### 2 | 3 | knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, 4 | fig.path = paste0("img/", post, "/"), 5 | fig.width = 6, 6 | fig.asp = 0.618, # 1 / phi 7 | dpi = 300) -------------------------------------------------------------------------------- /data/correspondence-data-1585.csv: -------------------------------------------------------------------------------- 1 | writer,source,destination,date 2 | "Meulen, Andries van der",Antwerp,Delft,1585-01-03 3 | "Meulen, Andries van der",Antwerp,Haarlem,1585-01-09 4 | "Meulen, Andries van der",Antwerp,Haarlem,1585-01-11 5 | "Meulen, Andries van der",Antwerp,Delft,1585-01-12 6 | "Meulen, Andries van der",Antwerp,Haarlem,1585-01-12 7 | "Meulen, Andries van der",Antwerp,Delft,1585-01-17 8 | "Meulen, Andries van der",Antwerp,Delft,1585-01-22 9 | "Meulen, Andries van der",Antwerp,Delft,1585-01-23 10 | "Della Faille, Marten",Antwerp,Haarlem,1585-01-24 11 | "Meulen, Andries van der",Antwerp,Delft,1585-01-28 12 | "Meulen, Andries van der",Antwerp,Delft,1585-01-30 13 | "Meulen, Andries van der",Antwerp,Delft,1585-02-05 14 | "Meulen, Andries van der",Antwerp,Delft,1585-02-07 15 | "Della Faille, Jacques",Haarlem,Delft,1585-02-10 16 | "Meulen, Andries van der",Antwerp,Delft,1585-02-11 17 | "Meulen, Andries van der",Antwerp,Delft,1585-02-16 18 | "Della Faille, Jacques",Haarlem,The Hague,1585-02-20 19 | "Meulen, Andries van der",Antwerp,Delft,1585-02-20 20 | "Della Faille, Jacques",Haarlem,Delft,1585-02-21 21 | "Meulen, Andries van der",Antwerp,Delft,1585-02-22 22 | "Meulen, Andries van der",Antwerp,Delft,1585-02-25 23 | Staten van Brabant,Antwerp,Delft,1585-02-25 24 | "Meulen, Andries van der",Antwerp,Delft,1585-02-26 25 | Staten van Brabant,Antwerp,Delft,1585-02-26 26 | "Meulen, Andries van der",Antwerp,Delft,1585-02-27 27 | "Meulen, Andries van der",Antwerp,Delft,1585-03-02 28 | "Meulen, Andries van der",Antwerp,Delft,1585-03-05 29 | "Meulen, Andries van der",Antwerp,Delft,1585-03-09 30 | "Meulen, Andries van der",Antwerp,Delft,1585-03-11 31 | "Della Faille, Jacques",Haarlem,Delft,1585-03-12 32 | "Meulen, Andries van der",Antwerp,Delft,1585-03-13 33 | "Della Faille, Jacques",Haarlem,Delft,1585-03-14 34 | "Meulen, Andries van der",Antwerp,Delft,1585-03-14 35 | Staten van Brabant,Antwerp,Delft,1585-03-14 36 | "Meulen, Andries van der",Antwerp,Delft,1585-03-15 37 | "Meulen, Andries van der",Antwerp,Delft,1585-03-19 38 | "Della Faille, Jacques",Haarlem,Delft,1585-03-25 39 | "Della Faille, Joris",Dordrecht,Haarlem,1585-03-25 40 | "Meulen, Andries van der",Antwerp,Delft,1585-03-25 41 | "Della Faille, Marten",Antwerp,Haarlem,1585-03-28 42 | "Meulen, Andries van der",Antwerp,Delft,1585-03-28 43 | "Meulen, Andries van der",Antwerp,Delft,1585-03-30 44 | "Meulen, Andries van der",Antwerp,Delft,1585-03-31 45 | "Meulen, Andries van der",Antwerp,Delft,1585-04-06 46 | "Meulen, Andries van der",Antwerp,Delft,1585-04-10 47 | "Eeckeren, Robert van",Haarlem,Delft,1585-04-11 48 | "Meulen, Andries van der",Antwerp,Delft,1585-04-11 49 | "Meulen, Andries van der",Antwerp,Delft,1585-04-19 50 | "Meulen, Andries van der",Antwerp,Delft,1585-04-24 51 | "Anraet, Thomas",Antwerp,Delft,1585-04-27 52 | "Meulen, Andries van der",Antwerp,Delft,1585-04-28 53 | "Meulen, Andries van der",Antwerp,Delft,1585-04-30 54 | "Della Faille, Jacques",Haarlem,Delft,1585-05-01 55 | "Meulen, Andries van der",Antwerp,Middelburg,1585-05-04 56 | "Della Faille, Jacques",Haarlem,Delft,1585-05-08 57 | "Della Faille, Marten",Antwerp,Delft,1585-05-09 58 | "Meulen, Andries van der",Antwerp,Delft,1585-05-09 59 | "Meulen, Andries van der",Antwerp,Delft,1585-05-11 60 | "Della Faille, Jacques",Haarlem,Delft,1585-05-12 61 | "Meulen, Andries van der",Antwerp,Delft,1585-05-17 62 | "Della Faille, Jacques",Haarlem,Delft,1585-05-20 63 | "Della Faille, Marten",Antwerp,Delft,1585-05-28 64 | "Della Faille, Jacques",Haarlem,Delft,1585-05-28 65 | "Meulen, Andries van der",Antwerp,Delft,1585-05-28 66 | "Meulen, Andries van der",Antwerp,Delft,1585-05-28 67 | "Meulen, Andries van der",Antwerp,Delft,1585-05-30 68 | "Meulen, Andries van der",Antwerp,Delft,1585-06-01 69 | "Meulen, Andries van der",Antwerp,Delft,1585-06-04 70 | "Della Faille, Jacques",Haarlem,Middelburg,1585-06-11 71 | Staten van Brabant,Antwerp,The Hague,1585-06-13 72 | "Meulen, Andries van der",Antwerp,Delft,1585-06-14 73 | Burgemeesters of Antwerp,Antwerp,The Hague,1585-06-19 74 | "Meulen, Andries van der",Antwerp,Delft,1585-06-20 75 | "Wale, Jan de",Venice,Haarlem,1585-06-23 76 | "Meulen, Andries van der",Antwerp,Delft,1585-06-29 77 | "Meulen, Andries van der",Antwerp,Delft,1585-07-05 78 | "Meulen, Andries van der",Antwerp,Delft,1585-07-09 79 | "Meulen, Andries van der",Antwerp,Delft,1585-07-12 80 | "Meulen, Andries van der",Antwerp,Delft,1585-07-13 81 | "Meulen, Andries van der",Antwerp,Delft,1585-07-19 82 | "Meulen, Andries van der",Antwerp,Delft,1585-07-25 83 | "Meulen, Andries van der",Antwerp,Delft,1585-07-27 84 | "Della Faille, Jacques",Lisse,Delft,1585-07-29 85 | "Meulen, Andries van der",Antwerp,Delft,1585-07-31 86 | "Meulen, Andries van der",Antwerp,Delft,1585-08-01 87 | "Della Faille, Jacques",Haarlem,Delft,1585-08-02 88 | "Della Faille, Jacques",Haarlem,Delft,1585-08-06 89 | "Calvart, Jacques",Haarlem,Delft,1585-08-06 90 | "Della Faille, Jacques",Haarlem,Delft,1585-08-07 91 | "Meulen, Andries van der",Antwerp,Delft,1585-08-08 92 | "Della Faille, Jacques",Haarlem,Delft,1585-08-09 93 | "Della Faille, Jacques",Haarlem,Delft,1585-08-12 94 | "Meulen, Andries van der",Antwerp,Delft,1585-08-12 95 | "Meulen, Andries van der",Antwerp,Delft,1585-08-16 96 | "Della Faille, Jacques",Haarlem,Delft,1585-08-18 97 | "Della Faille, Jacques",Haarlem,Delft,1585-08-22 98 | "Meulen, Andries van der",Antwerp,Delft,1585-08-22 99 | "Wale, Jan de",Venice,Haarlem,1585-08-30 100 | "Della Faille, Jacques",Haarlem,Delft,1585-09-02 101 | "Della Faille, Jacques",Haarlem,Delft,1585-09-03 102 | "Janssen van der Meulen, Peeter",Antwerp,Delft,1585-09-04 103 | "Della Faille, Jacques",Haarlem,Delft,1585-09-06 104 | "Della Faille, Jacques",Haarlem,Delft,1585-09-07 105 | "Della Faille, Jacques",Haarlem,Delft,1585-09-08 106 | "Della Faille, Jacques",Haarlem,Delft,1585-09-10 107 | "Della Faille, Jacques",Haarlem,Delft,1585-09-14 108 | "Della Faille, Marten",Antwerp,Delft,1585-09-16 109 | "Della Faille, Marten",Antwerp,Delft,1585-09-17 110 | "Della Faille, Jacques",Het Vlie,Bremen,1585-10-21 111 | "Della Faille, Jacques",Hamburg,Bremen,1585-10-30 112 | "Della Faille, Jacques",Emden,Bremen,1585-11-30 113 | "Noirot, Jacques",Haarlem,Bremen,1585-12-16 114 | "Noirot, Jacques",Amsterdam,Bremen,1585-12-20 115 | "Della Faille, Jacques",Haarlem,Bremen,1585-12-27 116 | -------------------------------------------------------------------------------- /data/locations.csv: -------------------------------------------------------------------------------- 1 | place,lon,lat 2 | Antwerp,4.4024643,51.2194475 3 | Haarlem,4.6462194,52.3873878 4 | Dordrecht,4.6900929,51.81329789999999 5 | Venice,12.3155151,45.4408474 6 | Lisse,4.5574834,52.25793030000001 7 | Het Vlie,5.183333,53.3 8 | Hamburg,9.9936819,53.5510846 9 | Emden,7.2060095,53.35940290000001 10 | Amsterdam,4.895167900000001,52.3702157 11 | Delft,4.3570677,52.01157689999999 12 | The Hague,4.3006999,52.0704978 13 | Middelburg,3.610998,51.4987962 14 | Bremen,8.8016936,53.07929619999999 15 | -------------------------------------------------------------------------------- /debkeepr-intro.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introducing debkeepr" 3 | output: 4 | md_document: 5 | variant: markdown 6 | always_allow_html: yes 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | post <- "debkeepr-intro" 11 | source("common.r") 12 | knitr::opts_chunk$set(fig.width = 8, fig.asp = 0.8) 13 | ``` 14 | 15 | After an extensive period of iteration and a long but rewarding process of learning about package development, I am pleased to announce the release of my first R package. The package is called [debkeepr](http://jessesadler.github.io/debkeepr/), and it derives directly from my historical [research on early modern merchants](https://www.jessesadler.com/project/df-inheritance/). `debkeepr` provides an interface for working with non-decimal currencies that use the tripartite system of pounds, shillings, and pence that was used throughout Europe in the medieval and early modern periods. The package includes functions to apply arithmetic and financial operations to single or multiple values and to analyze account books that use [double-entry bookkeeping](https://en.wikipedia.org/wiki/Double-entry_bookkeeping_system) with the latter providing the basis for the name of `debkeepr`. In a later post I plan to write about the package development process, but here I want to discuss the motivation behind the creation of the package and provide some examples for how `debkeepr` can help those who encounter non-decimal currencies in their research. 16 | 17 | You can install `debkeepr` from GitHub right now with [devtools](https://github.com/hadley/devtools), and I am planning to submit the package to CRAN soon. Feedback is always welcome and any bug reports or feature requests can be made on [GitHub](https://github.com/jessesadler/debkeepr/issues). 18 | 19 | 20 | ```{r devtools, eval = FALSE} 21 | # install.packages("devtools") 22 | devtools::install_github("jessesadler/debkeepr") 23 | ``` 24 | 25 | ## Pounds, shillings, and pence: lsd monetary systems 26 | The system of expressing monetary values in the form of pounds, shillings, and pence dates back to the Carolingian Empire and the change from the use of gold coins that derived from the late Roman Empire to silver pennies that had taken place by the eighth century. Needing ways to count larger quantities of the new silver [denarius](https://en.wikipedia.org/wiki/Denarius), people began to define a [solidus](https://en.wikipedia.org/wiki/Solidus_(coin)), originally a gold coin introduced by the Emperor Constantine, as a unit of account equivalent to 12 denarii. For even larger valuations, the denarius was further defined in relation to a pound or [libra](https://en.wikipedia.org/wiki/French_livre) of silver. Though the actual number of coins struck from a pound of silver differed over time, the rate of 240 coins lasted long enough to create the custom of counting coins in dozens (solidi) and scores of dozens (librae). The [librae, solidi, and denarii (lsd)](https://en.wikipedia.org/wiki/%C2%A3sd) monetary system was translated into various European languages, and though the ratios between the three units often differed by region and period, the basic structure of the system remained in place until decimalization began following the French Revolution.[^1] 27 | 28 | The pounds, shillings, and pence system complicates even relatively [simple arithmetic operations](https://en.wikipedia.org/wiki/Arithmetic#Compound_unit_arithmetic). Especially in our decimalized world, manipulating values that are made up of three different units in which two of the units are non-decimal quickly becomes cumbersome. The principles of grade school arithmetic come running back, as the practice of addition and carrying over values to the next unit gains new relevancy. Calculations with pounds, shillings, and pence values are further complicated when dealing with more than one currency or [money of account](https://en.wikipedia.org/wiki/Unit_of_account), particularly if one or more of the currencies used bases for the shillings and pence units that differed from the common bases of 20 shillings to a pound and 12 pence to a shilling. 29 | 30 | The economic historian encounters the difficulties of handling non-decimal currencies in two main contexts. In reading through documents that may or may not be primarily economic in nature, the researcher comes across sets of values that need to be manipulated to better understand their meaning. A common case is the discovery of values that need to be added together to see if they are equivalent to a value in another document, or there may be values in one currency that have to be converted to another currency for comparison. The second context in which historians confront non-decimal monetary values is in explicitly economic documents such as entire account books that may contain hundreds or thousands of transactions. In both contexts the researcher is often stuck performing arithmetic calculations by hand just as merchants and bookkeepers had to do in the past. I do not know how many pages of scratched out calculations I went through in my own research on the estate of Jan della Faille de Oude, a wealthy merchant from Antwerp who died in 1582. 31 | 32 | image of my arithmetic 33 | 34 | The tripartite and non-decimal nature of pounds, shillings, and pence values presents particular difficulties for the analysis of large sets of accounts. A first instinct of the researcher is to place the transactions from an account book into a table or data base to better understand these large groups of data, but how should the pounds, shillings, and pence values be entered? Should they be placed into three separate variables or brought together somehow into one, and how should the arithmetic be performed? So long as there is no clear way to handle one, not to mention multiple, non-decimal currency within a data base, the use of digital tools to investigate and analyze historical accounts books will be hindered. 35 | 36 | ## `debkeepr` 37 | `debkeepr` seeks to overcome these issues and integrate non-decimal pounds, shillings, and pence values into the decimalized environment of R, opening the vast analytical and visual capabilities of R to historical monies of account. `debkeepr` accomplishes this through the creation of the `lsd` class and functions intended to work with this new class of object in R. The `lsd` class unifies the separate pounds, shillings, and pence units into a numeric vector of length three and tracks the non-decimal bases for the shillings and pence units through a `bases` attribute. `lsd` objects are stored as [lists](http://r4ds.had.co.nz/vectors.html#lists), making it possible for a single object to contain multiple pounds, shillings, and pence values and to implement `lsd` objects as list columns in a data frame or [tibble](https://tibble.tidyverse.org). The `lsd` class is intended to both minimize the difficulty historians encounter in one-off calculations and provide a consistent means for the exploration and analysis of large sets of accounts.[^2] 38 | 39 | I have set up a [pkgdown site for `debkeepr`](http://jessesadler.github.io/debkeepr/), and you can find the source code on [GitHub](https://github.com/jessesadler/debkeepr). The package includes three vignettes, two of which deal with data from the practice journal and ledger from Richard Dafforne’s *The Merchant's Mirrour, Or Directions for the Perfect Ordering and Keeping of His Accounts* (London, 1660), a bookkeeping manual that was printed throughout the seventeenth century. Data frames for the transactions and accounts from this set of account books are also included with the package. 40 | 41 | ## Vignettes 42 | - [Getting Started with debkeepr vignette](https://jessesadler.github.io/debkeepr/articles/debkeepr.html): An in depth overview of `debkeepr`’s functions and their use in various contexts. 43 | - [Transactions in Richard Dafforne's Journal vignette](https://jessesadler.github.io/debkeepr/articles/transactions.html): Examples of financial and arithmetic calculations dealing with various currencies taken from the example journal in Richard Dafforne’s *Merchant’s Mirrour* (1660). 44 | - [Analysis of Richard Dafforne’s Journal and Ledger vignette](https://jessesadler.github.io/debkeepr/articles/ledger.html): An analysis of the example journal and ledger in Dafforne’s *Merchant’s Mirrour* using the `dafforne_transactions` and `dafforne_accounts` data provided in `debkeepr`. 45 | 46 | In the rest of this post I want to quickly highlight how `debkeepr` can assist historians in handling pounds, shillings, and and pence values in both one-off calculations and in the context of an entire account book. The vignettes go into much greater detail than will be done here. 47 | 48 | ## Arithmetic with `debkeepr` 49 | At the heart of `debkeepr` is the ability to normalize pounds, shillings, and pence values to specified non-decimal unit bases in the process or making various calculations. Even in the simplest case of addition, `debkeepr` makes the process easier and less error prone. Let’s use an example from the image below that shows arithmetic on a scratch piece of paper dealing with trade between Holland and the Barbary coast in the 1590s.[^3] On the top right four values in the money of account of pounds Flemish are added together. 50 | 51 | scratch image 52 | 53 | ```{r table of values, echo = FALSE} 54 | knitr::kable(data.frame( 55 | pounds = c(74, 26, 107, 28), 56 | shillings = c(10, 8, 14, 19), 57 | pence = c(0, 8, 0, 0) 58 | )) 59 | ``` 60 | 61 | `debkeepr` provides multiple means to add together values of pounds Flemish, which used the standard bases of 20 shillings per pound and 12 pence per shilling.[^4] 62 | 63 | - Add the separate units by hand and then normalize. 64 | - Do the addition with `debkeepr` by supplying numeric vectors of length three. 65 | - Create an object of class `lsd` and proceed with the addition. 66 | 67 | ```{r addition} 68 | library(debkeepr) 69 | 70 | # Normalize £235 51s. 8d. 71 | deb_normalize(c(235, 51, 8), bases = c(20, 12)) 72 | 73 | # Addition of values with debkeepr 74 | deb_sum(c(74, 10, 0), 75 | c(26, 8, 8), 76 | c(107, 14, 0), 77 | c(28, 19, 0)) 78 | 79 | # Create lsd object from list of vectors, then do addition 80 | lsd_values <- deb_as_lsd(list(c(74, 10, 0), 81 | c(26, 8, 8), 82 | c(107, 14, 0), 83 | c(28, 19, 0)), 84 | bases = c(20, 12)) 85 | deb_sum(lsd_values) 86 | ``` 87 | 88 | Multiplication and division of pounds, shillings, and pence values were equally frequent calculations, but they are more complex to do by hand. Bookkeeping and merchant manuals often included rules for the multiplication and division of compound values such as pounds, shillings, and pence. The article on arithmetic in the third edition of the [*Encyclopedia Britannica*, printed in 1797](http://onlinebooks.library.upenn.edu/webbin/metabook?id=britannica3), provides a good example of how to do [compound unit arithmetic](https://en.wikipedia.org/wiki/Arithmetic#Compound_unit_arithmetic). `debkeepr` greatly simplifies this process. 89 | 90 | lsd-multiplication 91 | 92 | ```{r multiplication} 93 | # Multiply £15 3s. 8d. sterling by 32 94 | deb_multiply(c(15, 3, 8), x = 32) 95 | 96 | # Multiply £17 3s. 8d. sterling by 75 97 | deb_multiply(c(17, 3, 8), x = 75) 98 | ``` 99 | 100 | The examples for division in the *Encyclopedia Britannica* include the division of pounds, shillings, and pence, as well as the division of weight measured in terms of hundredweight, quarters, and pounds. A hundredweight consisted of four quarters and there were 28 pounds (or two stones) in a quarter. While `debkeepr` was created with pounds, shillings, and pence values in mind, the measurement of weight in terms of hundredweight can be integrated by altering the bases argument. This example even serves to show a mistake in the printing in the *Encyclopedia*, as the answer is shown as 15 cwt. 2 q. 21 lb., but it should actually be 22 lb. Checking the long division at the bottom of the calculation, 22 goes into 44 twice not once. Notice too that the division of £465 12s. 8d. ends with a remainder of 8 even though this is not included in the answer provided. This remainder leads to the decimal in the pence unit in the answer provided by `deb_divide()`. Setting the `round` argument to 0 would convert the pence unit to a whole number. 101 | 102 | lsd-division 103 | 104 | ```{r divison} 105 | # Divide £465 12s. 8d. sterling by 72 106 | deb_divide(c(465, 12, 8), x = 72) 107 | 108 | # Divide 345 hundredweight 1 quarter 8 lbs by 22 109 | deb_divide(c(345, 1, 8), x = 22, bases = c(4, 28)) 110 | ``` 111 | 112 | These examples replicate answers already provided, but they serve to demonstrate the gains in both speed and accuracy for these common calculations. Further examples, including the use of `debkeepr` to convert between currencies that may or may not have different bases for the shillings and pence units, can be found in both [Getting Started with debkeepr](https://jessesadler.github.io/debkeepr/articles/debkeepr.html) and the [Transactions in Richard Dafforne's Journal vignette](https://jessesadler.github.io/debkeepr/articles/transactions.html). 113 | 114 | ## Analysis of account books 115 | As a way of providing example data of pounds, shillings, and pence values in a data frame, `debkeepr` includes data from the example journal and ledger in Dafforne’s *Merchant’s Mirrour* from 1660. `dafforne_transactions` has 177 transactions between 46 accounts from the journal. Each transaction has a creditor and debtor account, showing where each transactional value came from and where it went to, the date of the transaction, and the value in pounds sterling contained in an `lsd` list column. Extra details include the pages on which the transaction can be found in Dafforne’s journal and ledger and a short description of the transaction. A PDF copy of the journal is [available for download](https://github.com/jessesadler/debkeepr/blob/master/data-raw/dafforne-journal.pdf) if you want to see what the original source looks like. The [raw data](https://github.com/jessesadler/debkeepr/tree/master/data-raw) from which `dafforne_transactions` derives also demonstrates the process of entering pounds, shillings, and pence units into three separate columns to create the data base and then using `deb_lsd_gather()` to transform these variables into an `lsd` list column. 116 | 117 | ```{r transactions} 118 | # Transactions from the Dafforne's example journal 119 | dafforne_transactions 120 | ``` 121 | 122 | The [Analysis of Richard Dafforne’s Journal and Ledger vignette](https://jessesadler.github.io/debkeepr/articles/ledger.html) provides a fuller breakdown of the data, but here I would like to show how to get a summary of the 46 accounts in the books and create a plot using [ggplot2](https://ggplot2.tidyverse.org). One side effect of the use of a list column to represent pounds, shillings, and pence values is that it cannot be used for the purposes of plotting. However, this issue is overcome by the robust support for [decimalization](https://jessesadler.github.io/debkeepr/reference/index.html#section-decimalization) in `debkeepr`. Pounds, shillings, and pence values can be decimalized to any of the three units and returned to their original form when desired. For the purposes of plotting, the most useful workflow is to transform the `lsd` list column to decimalized pounds. 123 | 124 | `debkeepr` has a [set of functions](https://jessesadler.github.io/debkeepr/reference/index.html#section-transaction-data-frames) meant to deal with data frames that mimic the form of a journal used for double-entry bookkeeping such as `dafforne_transactions`. The 177 transactions and 46 accounts in `dafforne_transactions` are more than enough to lose track of which accounts were most significant to the bookkeeper’s trade. Here, I will use `deb_account_summary()` to calculate the total credit or amount each account sent, total debit or amount each account received, and the current value of the account at the closing of the books. While `deb_account_summary()` gives a nice overview of the entire set of account books, it is also possible to focus on a single account with `deb_account()`. 125 | 126 | ```{r account-functions} 127 | # Summary of all accounts in Dafforne's example journal 128 | deb_account_summary(df = dafforne_transactions) 129 | 130 | # Summary of the cash account in Dafforne's example journal 131 | deb_account(df = dafforne_transactions, account_id = 1) 132 | ``` 133 | 134 | To plot the summary of the accounts in the journal and ledger the `lsd` list columns have to be converted to decimalized pounds, which is done with `deb_lsd_l()` and `dplyr::mutate_if()`. From this information, we can create a line range plot in which the upper limit is represented by the total credit, the lower limit by the total debit, and the current value by a point. Blue points show accounts that have been balanced by the end of the books and are therefore closed, while black points represent the values for accounts that remain open. 135 | 136 | ```{r account-summary, message = FALSE} 137 | library(dplyr) 138 | library(ggplot2) 139 | 140 | # Prepare account summary for plotting 141 | (dafforne_summary <- dafforne_transactions %>% 142 | deb_account_summary() %>% 143 | mutate_if(deb_is_lsd, deb_lsd_l) %>% 144 | mutate(debit = -debit)) 145 | 146 | # Plot summary of accounts 147 | ggplot(data = dafforne_summary) + 148 | geom_linerange(aes(x = account_id, ymin = debit, ymax = credit)) + 149 | geom_point(aes(x = account_id, y = current, 150 | color = if_else(current == 0, "Closed", "Open"))) + 151 | scale_color_manual(values = c(Open = "black", Closed = "blue")) + 152 | scale_y_continuous(labels = scales::dollar_format(prefix = "£")) + 153 | labs(x = "Accounts", 154 | y = "Pounds sterling", 155 | color = "Status", 156 | title = "Summary of the accounts in Dafforne's ledger") + 157 | theme_light() 158 | ``` 159 | 160 | ## `debkeepr` and reproducibility 161 | The above plot only scratches the surface of the possibilities for exploring and analyzing account books that use pounds, shillings, and pence values with `debkeepr`. Yet, what I think is the most significant about `debkeepr` is not any one plot, time saving feature, or type of analysis, but rather the possibility of using the [practices of reproducible research](http://ropensci.github.io/reproducibility-guide/) in the context of historical economic data that takes the form of pounds, shillings, and pence values. Individual calculations are easier to track and validate when written in code rather than scrawled out on a scratch piece of paper or entered into a calculator. Even more importantly, `debkeepr` provides a workflow for entering data from account books into a data base, [tidying the data](http://vita.had.co.nz/papers/tidy-data.html) by transforming separate pounds, shillings, and pence variables into an `lsd` list column, and then exploring, analyzing, and visualizing the data in ways that integrate with the [tidyverse](https://www.tidyverse.org) and more general practices of data analysis in R. It is my hope that `debkeepr` can help bring to light crucial and interesting social interactions that are buried in economic manuscripts, making these stories accessible to a wider audience. 162 | 163 | [^1]: For more information about the development of the system of pounds, shillings, and pence and medieval monetary systems more generally see Peter Spufford, *Money and its Use in Medieval Europe* (Cambridge: Cambridge University Press, 1988). 164 | 165 | [^2]: For a more in depth discussion of `lsd` objects and a step-by-step overview of how to create and work with them see [Getting Started with debkeepr](https://jessesadler.github.io/debkeepr/articles/debkeepr.html). 166 | 167 | [^3]: This scratch piece of paper comes from the archive of Daniël van der Meulen en Hester de la Faille, zijn vrouw, 1550-1648 at Erfgoed Leiden en Omstreken. You can access the original [here](https://www.erfgoedleiden.nl/collecties/archieven/archievenoverzicht/scans/NL-LdnRAL-0096/7.23). 168 | 169 | [^4]: This means that the default value of `c(20, 12)` does not need to be altered for the `bases` argument, though it is included in some of the below examples for the purposes of transparency. -------------------------------------------------------------------------------- /excel-vs-r.R: -------------------------------------------------------------------------------- 1 | ### Analyzing a correspondence network with dplyr and ggplot ### 2 | 3 | # This script derives from the brief introduction to R, in the blog post 4 | # Excel vs R: A Brief Introduction to R, 5 | # which can be found at https://www.jessesadler.com/post/excel-vs-r/ 6 | 7 | library(tidyverse) 8 | library(lubridate) 9 | 10 | # Load the data 11 | letters <- read_csv("data/correspondence-data-1585.csv") 12 | 13 | ### Creating new data frames ### 14 | # Unique correspondents 15 | writers <- distinct(letters, writer) 16 | writers 17 | 18 | # Number of correspondents 19 | nrow(distinct(letters, writer)) 20 | 21 | # Unique sources and destinations in the correspondence 22 | sources <- distinct(letters, source) 23 | destinations <- distinct(letters, destination) 24 | 25 | # First attempt to get letters per correspondent 26 | per_correspondent <- summarise(letters, count = n()) 27 | 28 | # Letters per correspondent 29 | per_correspondent <- letters %>% 30 | group_by(writer) %>% 31 | summarise(count = n()) %>% 32 | arrange(desc(count)) 33 | 34 | # Letters per source 35 | per_source <- letters %>% 36 | group_by(source) %>% 37 | summarise(count = n()) %>% 38 | arrange(desc(count)) 39 | 40 | ## ggplot charts of letters per source and correspondent ## 41 | 42 | # Per source barchart 43 | ggplot(data = per_source) + 44 | geom_bar(aes(x = source, y = count), stat = "identity") + 45 | labs(x = NULL, y = "Letters written") 46 | 47 | # Per correspondent barchart 48 | ggplot(data = letters) + 49 | geom_bar(aes(x = writer)) + 50 | labs(x = NULL, y = "Letters written") + 51 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 52 | 53 | ## Working with dates ## 54 | 55 | # Letters per month without labels 56 | per_month <- letters %>% 57 | mutate(month = month(date)) %>% 58 | group_by(month) %>% 59 | summarise(count = n()) 60 | 61 | # Letters per month with labels 62 | per_month <- letters %>% 63 | mutate(month = month(date, label = TRUE)) %>% 64 | group_by(month) %>% 65 | summarise(count = n()) 66 | 67 | # Letters per month barchart 68 | ggplot(data = per_month) + 69 | geom_bar(aes(x = month, y = count), stat = "identity") + 70 | labs(x = 1585, y = "Letters sent") 71 | 72 | # Letters per weekday 73 | per_wday <- letters %>% 74 | mutate(wday = wday(date, label = TRUE)) %>% 75 | group_by(wday) %>% 76 | summarise(count = n()) 77 | 78 | # Letters per weekday barchart 79 | ggplot(data = letters) + 80 | geom_bar(aes(x = wday(date, label = TRUE))) + 81 | labs(x = 1585, y = "Letters sent") 82 | 83 | # Letters from Sunday 84 | letters %>% filter(wday(date) == 1) 85 | 86 | ## Combining data for correspondents and months ## 87 | 88 | # Per correspondent by month 89 | correspondent_month <- letters %>% 90 | mutate(month = month(date, label = TRUE)) %>% 91 | group_by(writer, month) %>% 92 | summarise(count = n()) %>% 93 | arrange(desc(count)) 94 | 95 | # Per correspondent by month barchart 96 | ggplot(data = correspondent_month) + 97 | geom_bar(aes(x = month, y = count, fill = writer), stat = "identity") + 98 | labs(x = 1585, y = "Letters sent", fill = "Correspondents") 99 | 100 | # Per correspondent by month line chart 101 | ggplot(data = correspondent_month, aes(x = month, y = count, color = writer)) + 102 | geom_point(size = 3) + 103 | geom_line(aes(group = writer)) + 104 | labs(x = 1585, y = "Letters sent", color = "Correspondents") -------------------------------------------------------------------------------- /excel-vs-r.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Excel vs R: A Basic Introduction to R" 3 | output: 4 | md_document: 5 | variant: markdown 6 | always_allow_html: yes 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | post <- "excel-vs-r" 11 | source("common.R") 12 | ``` 13 | 14 | Quantitative research often begins with the humble process of counting. Historical documents are never as plentiful as a historian would wish, but counting words, material objects, court cases, etc. can lead to a better understanding of the sources and the subject under study. When beginning the process of counting, the first instinct is to open a spreadsheet. The end result might be the production of tables and charts created in the very same spreadsheet document. In this post, I want to show why this spreadsheet-centric workflow is problematic and recommend the use of a programming language such as [R](https://www.r-project.org) as an alternative for both analyzing and visualizing data. There is no doubt that the learning curve for R is much steeper than producing one or two charts in a spreadsheet. However, there are real long-term advantages to learning a dedicated data analysis tool like R. Such advice to learn a programming language can seem both daunting and vague, especially [if you do not really understand what it means to code](https://jessesadler.com/post/new-kinds-of-projects/). For this reason, after discussing why it is preferable to analyze data with R instead of a spreadsheet program, this post provides a brief introduction to R, as well as an example of analysis and visualization of historical data with R.[^1] 15 | 16 | The draw of the spreadsheet is strong. As I first thought about ways to keep track of and analyze the thousands of letters in the [Daniel van der Meulen Archive](https://jessesadler.com/project/dvdm-correspondence/), I automatically opened up [Numbers](https://www.apple.com/numbers/) — the spreadsheet software I use most often — and started to think about what columns I would need to create to document information about the letters. Whether one uses Excel, Numbers, Google Sheets or any other spreadsheet program, the basic structure and capabilities are well known. They all provide more-or-less aesthetically pleasing ways to easily enter data, view subsets of the data, and rearrange the rows based on the values of the various columns. But, of course, spreadsheet programs are more powerful than this, because you can add in your own programatic logic into cells to combine them in seemingly endless ways and produce graphs and charts from the results. The spreadsheet, after all, was [the first killer app](https://en.wikipedia.org/wiki/VisiCalc). 17 | 18 | [With great power, there must also come great responsibility](https://giphy.com/gifs/11qAyKz9AbFEYM/html5). Or, in the case of the spreadsheet, with great power there must also come great danger. The danger of the spreadsheet derives from its very structure. The mixture of data entry, analysis, and visualization makes it easy to confuse cells that contain raw data from those that are the product of analysis. The nature of defining programatic logic — such as which cells are to be added together — by mouse clicks means that a mistaken click or drag action can lead to errors or the overwriting of data. You only need to think about the dread of the moment when you go to close a spreadsheet and the program asks whether you would like to save changes. It makes you wonder. Do I want to save? What changes did I make? Because the logic in a spreadsheet is all done through mouse clicks, there is no way to effectively track what changes have been made either in one session or in the production of a chart. Excel mistakes can have wide-ranging consequences, as the controversy around the paper of [Carmen Reinhart and Kenneth Rogoff on national debt made clear](https://www.marketplace.org/2013/04/17/economy/excel-mistake-heard-round-world "The Excel mistake heard round the world").[^2] 19 | 20 | There are certainly legitimate reasons why people default to using spreadsheets for data analysis instead of using a programming language like R. Spreadsheets are much more inviting and comforting than any programming language could ever be to a newcomer. Learning how to program is intimidating and not something that can be done either quickly or easily. [Graphical user interface (GUI)](https://en.wikipedia.org/wiki/Graphical_user_interface) applications are simply less daunting than a [command-line interface](https://en.wikipedia.org/wiki/Command-line_interface). Secondly, spreadsheets are a good tool for data entry, and it is tempting to simply move on to data analysis, keeping everything in the same document. Finally, the interactive nature of spreadsheets and the ability to create charts that change based on inputs is very attractive, even if fully unlocking this potential involves quite complex knowledge about how the program works. The first advantage of spreadsheets over programming is not easily overcome, but the latter two are built on what I believe to be a problematic workflow. Instead of using a couple of monolithic applications — often a office suite of applications — to do everything, I think that it is better to split up the workflow among several applications that [do one thing well](https://en.m.wikipedia.org/wiki/Unix_philosophy). 21 | 22 | Creating a clear division between data entry and analysis is a major reason why analyzing data in a programming language is preferable to spreadsheet software. I still use spreadsheets, but their use is strictly limited to data entry.[^3] In a spreadsheet program the analysis directly manipulates the only copy of the raw data. In contrast, with R you import the data, creating an object that is a copy of the raw data.[^4] All manipulations to the data are done on this copy, and the original data are never altered in any way. This means that there is no way to mess up the raw data. Manipulating a copy of the data enables you to more freely experiment. All mistakes are basically inconsequential, even if they can be frustrating. A line of code that fails to produce the expected result can be tweaked and rerun — with the process repeated many times if necessary — until the expected result is returned. 23 | 24 | Working on a copy of the raw data can even simplify the process of data entry. Analyzing tabular data in R results in the creation of multiple objects, which are referred to as **data frames** and can be thought of as equivalent to tables in a spreadsheet.[^5] The ability to split, subset, and transform the original data set into many different data frames has the benefit of drastically reducing the complexity of data entry. Instead of needing bespoke spreadsheets with multiple interrelated sheets and tables, every piece of data only needs to be entered once and all manipulations can be done in code. The different data frames that are created in the process of analysis do not even have to be saved, because they are so easily reproduced by the script of code. 25 | 26 | The separation of data entry and data analysis severely reduces the potential for mistakes, but maybe even more significantly, the use of code for data analysis enables the creation of [reproducible research](https://cran.r-project.org/web/views/ReproducibleResearch.html) that is simply not possible in spreadsheets. Reproducible research has been a [hot button issue in the sciences](http://science.sciencemag.org/content/348/6242/1403.full), but there is no reason why research in the humanities should not strive to be reproducible where possible. With a programming language, the steps of the analysis can be clearly laid out in the code. The [“truth” of the analysis is the code](http://r4ds.had.co.nz/workflow-projects.html#what-is-real), not the objects or visuals that the code creates. Saving analysis in code has the immediate benefit that it can be easily rerun anytime that new data is added. Code can also be applied to a completely new data set in a much more transparent manner than with spreadsheets. The long-term benefit is that with code all analysis is documented instead of being hidden behind mouse clicks. This makes it easier for you to go over your own analyses long after you have finished with it, as well as for others to understand what you did and check for errors. 27 | 28 | ## A Brief Introduction to R 29 | If you have never read or written any code before, it is difficult to know what the theoretical differences between spreadsheets and coding mean in practice.[^6] Therefore, I want to provide an example of analyzing and visualizing data in R. I will use the example of letters sent to Daniel van der Meulen in 1585, which is a subset of the data from my project on the [correspondence of Daniel van der Meulen](https://jessesadler.com/project/dvdm-correspondence/). Before getting to the analysis, I will give a very brief introduction to R and define some key terms in order to make it more possible for anyone who has never used R or coded before to follow along. This is not meant to be a full-scale tutorial, so much as an introduction to what code in R looks like.[^7] 30 | 31 | R is an open-source programming language, which can be [freely downloaded](https://cloud.r-project.org). Opening the downloaded program will lead you to a command-line interface devoid of any clues for what a novice should do next. Thankfully, there is a freely available application, or [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment), called [RStudio](https://www.rstudio.com/products/rstudio), which provides a number of features to help write code. It is very much recommended to write your R code in RStudio. RStudio also has a plethora of [online tutorials](https://www.rstudio.com/online-learning/) on their website to help you get familiar with both R and the RStudio IDE. 32 | 33 | In either R or RStudio, the command line or **console** is where all of the action takes place. It is here that you enter and run the code. The foundation of working in R is the process of creating and manipulating objects of various sorts. **Objects** are assigned names with the assignment operator `<-`. Thus, if I want to assign the name x the value of 5, I write `x <- 5` into the console and press return. 34 | 35 | ```{r assignment operator} 36 | x <- 5 37 | ``` 38 | 39 | Nothing is printed when this operation is run, but now, if I type `x` into the console and hit return, the value of x will be printed. Alternatively, I could have explicitly called for the object `x` to be printed with the command `print(x)`. The result is the same with either method.[^8] 40 | 41 | ```{r print x} 42 | x 43 | ``` 44 | 45 | The power of R comes with the manipulation of objects through the use of functions. **Functions** take in objects, possibly with a list of arguments to specify how the function works, and return another object. The objects and instructions for the functions are included within parentheses following the function name, and all instructions are separated by a comma.[^9] You can think of objects as nouns and functions as verbs. A simple example with the `sum()` function demonstrates how functions work. `sum()` does exactly what you would think. It takes a series of objects and returns their sum. You can see the function's documentation by entering `?sum()` on the console.[^10] The result of the `sum()` function can either be immediately printed to the console, or it can be saved by assigning it a name. Below, I assign the result of `x + 10` a name and then in a second command print out the value of the newly created object. 46 | 47 | ```{r sum} 48 | y <- sum(x, 10) 49 | y 50 | ``` 51 | 52 | The flexibility of coding enables objects to be redefined endlessly. If a mistake is made or the data changes, the command can be rerun with the new values. For instance, if the value of `x` changes from 5 to 15 in the data, I can simply change the code for creating the `x` object and reassign the value of `x`. Below, I do this and then rerun the same `sum()` command, but this time I print the result directly to the console instead of assigning it to the object `y`. 53 | 54 | ```{r reassign object} 55 | x <- 15 56 | sum(x, 10) 57 | ``` 58 | 59 | Once you get a command that produces the expected result, you can save the command to an R script. A **script** is a text file with a suffix of .R, which you can save. Thus, you could save the commands to create the objects `x` and `y` to a file called `my_script.R`. The script will then be available to you whenever you want to rerun the code in the console. 60 | 61 | ## R Packages: The tidyverse 62 | Upon download, R comes with a large number of functions, which together are referred to as base R. However, the capabilities of R can be greatly extended through the use of additional packages, which can be downloaded through [The Comprehensive R Archive Network (CRAN)](https://cran.r-project.org). Packages both extend what is possible in R and provide alternative ways to do things possible in base R. This can be confusing, because it means there are often a plethora of ways to do a single operation, but the extension in capability is well worth it. Particularly significant are the packages bundled together in the `tidyverse` package that were built by [Hadley Wickham](http://hadley.nz) and collaborators. The [tidyverse](https://www.tidyverse.org) provides a set of linked packages that all use a similar grammar to work with data. The tidyverse is [the best place to start if you are new to R](http://varianceexplained.org/r/teach-tidyverse/). 63 | 64 | The examples below show the ability to analyze and visualize data using the tidyverse packages. The analysis will mainly be done with the [dplyr package](http://dplyr.tidyverse.org) and the visualization is done with [ggplot](http://ggplot2.tidyverse.org). The **dplyr** functions all have a similar structure. The main `dplyr` verbs or functions are `filter()`, `select()`, `arrange()`, `mutate()`, `group_by()`, and `summarise()`. All take a data frame as their first argument. The next set of arguments are the column names on which the function performs its actions. The result of the functions is a new data frame object. The `dplyr` functions can be linked together, so that it is possible to perform multiple manipulations on a data frame in one command. This is done with the use of the pipe, which in R is `%>%`.[^11] 65 | 66 | The commands needed to produce a **ggplot** graph can be confusing at first, but the advantages of `ggplot` are that it is based on a [grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.pdf), which provides a systematic way to discuss graphs, and that `ggplot` comes with good defaults, enabling the creation of nice looking graphs with minimal code. The two main components of `ggplot` are geoms and the mapping of aesthetics. **Geoms** tell `ggplot` what kind of visual graphic to make, so there are separate geom functions for bar charts, points, lines, etc. **Aesthetics** tell `ggplot` which columns of the data to use for placement on the graph and for any other distinguishing aspects of these variables such as size, shape, and color. The graphs below are used to show some of the possibilities of `ggplot` while staying mainly within the defaults. Publication ready visualizations are possible with `ggplot`, but this would take more fiddling.[^12] 67 | 68 | ## An Example: Analyzing a correspondence network 69 | With this basic overview of R out of the way, let’s move on to an example of an actual analysis of historical data to see what analysis and visualization in R looks like in practice. The data and an R script with all of the code can be found on [my GitHub page](https://github.com/jessesadler/intro-to-r), if you would like to run the code yourself. 70 | 71 | Before doing anything with the data, it is necessary to set up our environment in R. This means loading the packages that we will be using with the `library()` command. To begin we only need to load the `tidyverse` package, which includes the individual packages that I will use for the analysis. 72 | 73 | ```{r load tidyverse, message = FALSE} 74 | library(tidyverse) 75 | ``` 76 | 77 | Once this is done, it is possible to use the functions from the tidyverse packages, beginning with reading our data into R with the `read_csv()` function. The code below takes a [csv file](https://en.wikipedia.org/wiki/Comma-separated_values) from a location on my computer and loads it into R as a data frame, while saving the object under the name `letters`.[^13] The code can be read as "create a data frame object called `letters` from the csv file named `correspondence-data-1585.csv`." 78 | 79 | ```{r clean-data, eval = FALSE, include = FALSE} 80 | # Script for creating clean data that was then saved to data 81 | # Not run here 82 | letters1585 <- read_csv("data/dvdm-correspondence-1591.csv") %>% 83 | filter(destination != "NA", source != "NA", year == 1585) %>% 84 | select(writer, source, destination, date) 85 | ``` 86 | 87 | ```{r load data, message = FALSE} 88 | letters <- read_csv("data/correspondence-data-1585.csv") 89 | ``` 90 | 91 | Even before anything has been done to manipulate the data, we are already in a better position than if we kept the data in Excel. Having loaded the data, all further manipulations will be done on the object `letters`, and no changes will be made to the `correspondence-data-1585.csv` file. In other words, there is no way to tamper with the database that may have taken hours (or more) to meticulously produce. 92 | 93 | Let's start by taking a look at the data itself by printing out a subset of the data to the console, which can be done by typing the name of the object into the console.[^14] 94 | 95 | ```{r letters_tibble} 96 | letters 97 | ``` 98 | 99 | This tells us that `letters` is a [tibble](http://r4ds.had.co.nz/tibbles.html), which is a special kind of data frame meant to work well in the tidyverse. Though not necessarily aesthetically pleasing, the basic shape of a table of data is clear.[^15] The command printed out the first ten rows, while informing us that the complete data set contains `r nrow(letters)` rows. Each letter has four pieces of information or variables: the writer of the letter, the place from which it was sent, the place to which it was sent, and the date sent. Below the column headings of writer, source, destination, and date we are informed of the type of data for each variable. This shows that the first three columns consist of character strings (chr), meaning that the data is words, while the last column contains dates, which is discussed in greater detail below. 100 | 101 | ## Creating new data frames 102 | Now that we have an idea of the basic structure of the data, it is possible to begin to analyze it. A simple question that is fairly difficult to answer in Excel — at least I do not know how to do it other than by hand — is how many different people wrote letters to Daniel during this period. The below gives a list of the writers of letters. The code is a good demonstration of basic structure of `dplyr` functions. The function `distinct()` takes a data frame, followed by the name of the column that I am interested in, returning a new data frame object with one column, which consists of the unique values from the original column. 103 | 104 | ```{r distinct writers} 105 | distinct(letters, writer) 106 | ``` 107 | 108 | Because I did not assign the output a name, the results simply printed to the console. Saving the object requires the use of the assignment operator and choosing a memorable name. With this relatively simple operation I have created my first new data frame, which I can now refer to and further manipulate by calling to `writers`. 109 | 110 | ```{r save distinct writers} 111 | writers <- distinct(letters, writer) 112 | ``` 113 | 114 | Printing out the object shows that there are `r nrow(writers)` people who sent Daniel letters in 1585, but another way to get this information is to run the `nrow()` function, which returns the number of rows in a data frame. It is possible to run the function on either the `writers` data frame, or to use the command that created the `writers` data frame. Let's do the latter and print the result to the console. If you happen to forget the number of correspondents, the command can be typed again. 115 | 116 | ```{r nrow distinct writers} 117 | nrow(distinct(letters, writer)) 118 | ``` 119 | 120 | Once we run the commands and see that they produce results that are what we would expect from the data, it is possible to both save the commands in a script and also to reuse their structure for other pieces of information. For example, we can learn the locations from which Daniel's correspondents sent letters and the locations in which he received letters by reusing the `distinct()` function and changing the column name to be manipulated. 121 | 122 | ```{r distinct sources and locations} 123 | distinct(letters, source) 124 | distinct(letters, destination) 125 | ``` 126 | 127 | We can save the objects for later use by using the assignment operator and giving the data frames names. 128 | 129 | ```{r save sources and locations} 130 | sources <- distinct(letters, source) 131 | destinations <- distinct(letters, destination) 132 | ``` 133 | 134 | ## Linking `dplyr` commands together with the pipe 135 | Thus far, I have created data frames with one variable that show the unique values of the different variables, but this has not told us anything about the number of letters sent by each author or from each location. Doing this is more complicated, because it is necessary to chain together a series of functions using the pipe. This is the case for the question of how many letters each correspondent sent. Looking at the `dplyr` verbs listed above you might assume that the `summarise()` function will create a sum of the data. We can use the `summarise()` function to create a new column called count that is filled with the number of observations through the `n()` function. In other words, the code below tells R to summarize the `letters` data frame and place the number of observations in a column called count. 136 | 137 | ```{r summarise} 138 | per_correspondent <- summarise(letters, count = n()) 139 | per_correspondent 140 | ``` 141 | 142 | Running the command and printing the result shows a result that was not what we hoped. Instead of showing letters per correspondent, the function created a column called count with a single value equal to the amount of rows in the `letters` data frame. However, because the objects that are made in R are ephemeral, we can simply rerun the code after reworking it. This overwrites the old `per_correspondent` object with one that is more useful. In the first attempt there is no notion that the goal is to group the number of letters by writer. This is the task `group_by()` function. To create an object with correspondents in one column and the number of letters they sent in the second column we need to group the `letters` data frame by writer and then summarize the letters within each group, creating the count column while doing so. The last line of the code arranges the table in descending order of letters sent by the newly created count variable. Notice that the `letters` data frame is listed first before any function. This is because the `letters` data frame is piped into each of the functions with the use of the `%>%` command, which can be read as "and then." 143 | 144 | ```{r per_correspondent_tbl} 145 | per_correspondent <- letters %>% 146 | group_by(writer) %>% 147 | summarise(count = n()) %>% 148 | arrange(desc(count)) 149 | per_correspondent 150 | ``` 151 | 152 | Looking at the result, we can see that the above changes produced the kind output we expected. The `group_by()` and `summarise()` functions worked to create a data frame in which each author is listed once. The `count = n()` within the `summarise()` function created a new variable called count that is filled with the number of letters each correspondent sent. A cursory look at the results shows that the vast majority of the letters sent to Daniel were written by his brother Andries and his brother-in-law Jacques della Faille. Andries lived in the besieged city of Antwerp for most of 1585, and Jacques lived in Haarlem, so it will hardly be surprising that if we look at the amount of letters sent from each location that Antwerp and Haarlem dominate. Having written the above code, it is possible to rework it to create a data frame called `per_source`, which is done by replacing the writer column in the `group_by()` function with the source variable. 153 | 154 | ```{r per_source_tbl} 155 | per_source <- letters %>% 156 | group_by(source) %>% 157 | summarise(count = n()) %>% 158 | arrange(desc(count)) 159 | per_source 160 | ``` 161 | 162 | ## Visualizing the data with `ggplot` 163 | While it is nice to have the two above sets of information presented in tables, it is also possible to visualize the newly created data with `ggplot`. The structure of `ggplot` functions is a bit different than that of `dplyr`. Like `dplyr` the functions are linked together, but the linking is done with the `+` symbol instead of `%>%`. With `ggplot` the three main functions are `ggplot()`, one or more `geom_*` function, which informs the type of graphical element to draw, and one or more `aes()` function that sets the aesthetic values of the graph. The designation of the data to use and the `aes()` functions can be placed within either the `ggplot()` function, indicating that they apply to all of the geoms, or they can be placed within individual `geom_*` functions in which case they will only be used for that geom. All this is to say that there are a variety of ways to produce the same visuals. The below code adds an extra function to label the axes with `labs()`. 164 | 165 | ```{r per_source} 166 | ggplot(data = per_source) + 167 | geom_bar(aes(x = source, y = count), stat = "identity") + 168 | labs(x = NULL, y = "Letters written") 169 | ``` 170 | 171 | One part of the above code that might be a bit difficult to figure out is `stat = "identity"`. This needs to be called, because the code to create the `per_source` data frame actually did more work than necessary. `stat = "identity"` tells `geom_bar()` to set the height of the bars to the exact number in the count column. However, with `geom_bar()` only the x-axis needs to be specified. The y value is then calculated automatically based on the x value. Therefore, the below code, which uses the original `letters` data frame, could produce the exact same graph. Because of this, it is little trouble to change the variable for the x-axis to writers and get a completely new graph. Notice the change in the data and the deletion of the y variable and `stat`. One problem with the correspondents data is the length of the names. Therefore, this command makes changes to the `theme()` function. The arguments in the function are used to place the correspondent names at a ninety degree angle so that they do not overlap. 172 | 173 | ```{r per_correspondent, fig.asp = 1} 174 | ggplot(data = letters) + 175 | geom_bar(aes(x = writer)) + 176 | labs(x = NULL, y = "Letters written") + 177 | theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 178 | ``` 179 | 180 | ## Working with dates 181 | Now, let's investigate the dates column. Dealing with dates is often tricky. To facilitate this analysis we need to load the `lubridate` package. [lubridate](http://lubridate.tidyverse.org/) is another package created by Hadley Wickham, which fits into the tidyverse manner of dealing with data. It is not among the packages loaded with `library(tidyverse)`, and so it is necessary to load it separately. 182 | 183 | ```{r load lubridate, message = FALSE} 184 | library(lubridate) 185 | ``` 186 | 187 | Since the letters in this data set were all received in one year, it would be interesting to see how many letters Daniel received each month. This question is particularly interesting, because in 1585 Daniel's home city of Antwerp was under siege by Spanish troops, and Daniel was serving as a representative for his city to the rebels in Holland. It is also an interesting issue of analysis, because while the `letters` data frame contains information about the month the letters were sent, there is no month column. This prevents us from using the `group_by()` and `summarise()` workflow that we have developed. The answer comes from the `lubridate` package, which includes a function called `month()`. This function extracts the month from a date object, which we know the date column is, because it is identified as such when we have printed the `letters` data frame. The goal is to create a new column called month. This is done through the `dplyr` function `mutate()`, which creates the column and then applies the `month()` function to each of the dates. The remaining code is similar to that used above, but now the column we want to `group_by()` and `summarise()` is the newly created month column. 188 | 189 | ```{r per_month_nr} 190 | per_month <- letters %>% 191 | mutate(month = month(date)) %>% 192 | group_by(month) %>% 193 | summarise(count = n()) 194 | per_month 195 | ``` 196 | 197 | Looking at the result of this code, a problem is immediately apparent. The number of letters per month is as expected, but the months are returned as numbers, which is less than ideal. However, by looking at the documentation for lubridate by typing `?month()` into the console, it is possible to see that what needs to be done is to change the label argument to TRUE from the default of FALSE. 198 | 199 | ```{r per_month_tbl} 200 | per_month <- letters %>% 201 | mutate(month = month(date, label = TRUE)) %>% 202 | group_by(month) %>% 203 | summarise(count = n()) 204 | per_month 205 | ``` 206 | 207 | Now that we have the data in a good form we can plot it by making another bar chart. 208 | 209 | ```{r per_month} 210 | ggplot(data = per_month) + 211 | geom_bar(aes(x = month, y = count), stat = "identity") + 212 | labs(x = 1585, y = "Letters sent") 213 | ``` 214 | 215 | The graph shows a peak in March of 1585 when Antwerp was in an extremely vulnerable position and it was more important than ever that it receive military and monetary assistance from the rebels in Holland. Another peak is reached in August when Antwerp surrendered, forcing Daniel and his family members to go into exile. The letters declined beginning in October, when Daniel left Holland to live in exile with his family in Bremen. 216 | 217 | As a fun aside, it is also possible to see what day of the week the letters were sent. Luckily, the data comes from 1585, three years after the creation of [Gregorian Calendar](http://en.wikipedia.org/wiki/Gregorian_calendar), and the areas from which the letters were sent had already transitioned to the new calendar. This means that they were using the same calendar as we do now, as opposed to a place like England, which only adopted the Gregorian Calendar in 1752. Therefore, we do not have to worry about adding 10 days to move from the old calendar to the new calendar. 218 | 219 | ```{r per_wday_tbl} 220 | per_wday <- letters %>% 221 | mutate(wday = wday(date, label = TRUE)) %>% 222 | group_by(wday) %>% 223 | summarise(count = n()) 224 | per_wday 225 | ``` 226 | 227 | We can even create the chart using the original `letters` data frame and call the `wday()` function within the definition of the x variable. 228 | 229 | ```{r per_wday} 230 | ggplot(data = letters) + 231 | geom_bar(aes(x = wday(date, label = TRUE))) + 232 | labs(x = 1585, y = "Letters sent") 233 | ``` 234 | 235 | Finally, let's see who sent Daniel letters on Sundays. We can use the `filter()` function, which returns rows that match a certain argument. The below code uses the knowledge that without labels, Sunday is equivalent to 1. The code is also written in a slightly different style. Even though only a single function is called, `%>%` is used to pipe the data frame into the function, which cleans up the function call slightly. 236 | 237 | ```{r sunday} 238 | letters %>% filter(wday(date) == 1) 239 | ``` 240 | 241 | ## Combining data for correspondents and dates 242 | The `group_by()` and `summarise()` workflow developed so far results in data frames with two columns and thus only tells us about one variable from the data at a time. We can get a better understanding of Daniel's network by looking into how many letters each writer sent per month, which would involve the creation of a data frame with three columns. One advantage of using code is that it is often relatively simple to take a functioning line of code and tweak it to create a different result. Here, I take the line of code that produced `per_month` and alter it by adding the writer variable to the `group_by` function. Now the code will group the letters written by each correspondent per month and then count them. Notice how the dimensions of the resulting data frame has changed. 243 | 244 | ```{r correspondent_per_month} 245 | correspondent_month <- letters %>% 246 | mutate(month = month(date, label = TRUE)) %>% 247 | group_by(writer, month) %>% 248 | summarise(count = n()) %>% 249 | arrange(desc(count)) 250 | correspondent_month 251 | ``` 252 | 253 | With this new object called `correspondent_month` it is possible to add to the bar chart on letters per month, by filling in the bars with letters per correspondent in each month. The only difference in the code below from the earlier chart is the inclusion of `fill = writer` in `geom_bar()` and then giving it a label in `labs()`. 254 | 255 | ```{r correspondent_month} 256 | ggplot(data = correspondent_month) + 257 | geom_bar(aes(x = month, y = count, fill = writer), stat = "identity") + 258 | labs(x = 1585, y = "Letters sent", fill = "Correspondents") 259 | ``` 260 | 261 | A bar graph is only one of many different geoms made available through `ggplot`, so let me finish by using the same information to produce a line graph with points showing the amount of letters each correspondent sent in a month. The structure of the command is slightly different here, because I place the `aes()` function in the `ggplot()` function. This is done because the same aesthetics will be used by both `geom_point()` and `geom_line()`. In the `geom_point()` function I increase the size of all of the points so that they are more clearly visible. The `geom_line()` function adds a new aesthetic for group, which tells the function how to connect the lines. 262 | 263 | ```{r line_graph, fig.asp = 0.75} 264 | ggplot(data = correspondent_month, aes(x = month, y = count, color = writer)) + 265 | geom_point(size = 3) + 266 | geom_line(aes(group = writer)) + 267 | labs(x = 1585, y = "Letters sent", color = "Correspondents") 268 | ``` 269 | 270 | These two graphs provide different ways to visualize the data. Both give a clearer picture of the development of Daniel's correspondence over the course of 1585. Up until September, Andries was Daniel's most signifiant correspondent. After the fall of Antwerp, the two lived together in Bremen, forestalling any need to communicate by correspondence. On the other hand, in the second half of the year, Daniel's correspondence picked up with his brother-in-law, Jacques. 271 | 272 | ## Conclusion 273 | The examples above only display a small fraction of the analysis and visualization capabilities of R. More fully developed visualizations with a larger set of the data used here can be seen on the [the Projects page of this website](https://www.jessesadler.com/project/), including an [interactive map of the letters that can be filtered by date](https://jessesadler.shinyapps.io/dvdm-correspondence/). Going from no coding experience to the production of visualizations in R is by no means easy, though there is an ever-growing set of resources designed to make the process as easy as possible. This post is part of my attempt to add to these resources from the perspective of a historian. In future posts I will build on the foundation provided here. In particular, the above analysis did not take advantage of the geographic nature of the data. I will discuss the process of geocoding the data and mapping it in an upcoming post. 274 | 275 | [^1]: This post concentrates on the basic aspects of data analysis and visualization using the popular [dplyr](http://dplyr.tidyverse.org) and [ggplot](http://ggplot2.tidyverse.org) packages for R. In future posts, I will discuss more aspects of R. 276 | 277 | [^2]: [Episode 9 of the Not So Standard Deviations podcast](https://simplystatistics.org/2016/02/12/not-so-standard-deviations-episode-9-spreadsheet-drama/) has a good discussion of Excel vs R, which helped me think through my own thoughts on this subject. 278 | 279 | [^3]: My workflow is to enter data into Numbers, export it into a csv file, and then analyze with R. I have found data entry directly into csv too clunky, and so spreadsheets still have a place in my life. 280 | 281 | [^4]: There are a variety of different data types in R, but in this post I will concentrate on tabular data, or data frames, because that is the form of data derived from spreadsheet and most used within the humanities and social sciences. 282 | 283 | [^5]: See the explanation and examples below on how objects in R are created and manipulated. 284 | 285 | [^6]: This was the position that I was in only a couple of months ago. 286 | 287 | [^7]: If you are interested in learning how to use R, two good places to start are [Garrett Grolemund and Hadley Wickham, *R for Data Science*](http://r4ds.had.co.nz) and [Roger Peng, *R Programming for Data Science*](https://bookdown.org/rdpeng/rprogdatascience/). Both are available for free and cover the topics that I discuss here in much greater detail. 288 | 289 | [^8]: Do not worry about the `[1]` in the printed output. This merely informs us that this is the first number in the object `x`. Behind the scenes in R `x` is not simply a number but a vector of numbers of length one. 290 | 291 | [^9]: A common frustration when learning to code is how persnickety computers are about typos and grammar. If a command does not work, it is likely because there is a typo somewhere. 292 | 293 | [^10]: If you want to know how a function works or what arguments are necessary to run the function, you can always access a function's documentation with `?function_name()`. The documentation for many functions tend to be jargon heavy, but most also contain examples of how to use the function. 294 | 295 | [^11]: The best resource for learning about these `dplyr` functions is Chapter 5 of [Grolemund and Wickham, *R for Data Science*](http://r4ds.had.co.nz/transform.html). 296 | 297 | [^12]: There are many resources on learning `ggplot`. Chapter 3 of [Grolemund and Wickham, *R for Data Science*](http://r4ds.had.co.nz/data-visualisation.html) is a good place to start. Another invaluable resource for ggplot that is still under development is [Kieran Healy’s *Data Visualization for Social Science*](http://socviz.co/) 298 | 299 | [^13]: Remember that a data frame is essentially a table. To be more precise, in this case, the `read_csv()` function produces a [tibble](http://tibble.tidyverse.org), which is a special kind of data frame. 300 | 301 | [^14]: One advantage of tibbles over default data frames is that tibbles are smart about printing out a subset of the data to the console instead of all of the content. The latter approach becomes messy when data frames have hundreds or thousands of rows. 302 | 303 | [^15]: RStudio comes with its own viewer, enabling you to view the contents of data frames in a manner that replicates the experience of a spreadsheet. -------------------------------------------------------------------------------- /geocoding-with-r.R: -------------------------------------------------------------------------------- 1 | ### Geocoding and mapping with R ### 2 | 3 | # This script corresponds with a blog post 4 | # which can be found at https://www.jessesadler.com/post/geocoding-with-r/ 5 | 6 | library(tidyverse) 7 | library(ggmap) 8 | 9 | # Load the data 10 | letters <- read_csv("data/correspondence-data-1585.csv") 11 | 12 | ### Geocode the data ### 13 | sources <- distinct(letters, source) 14 | destinations <- distinct(letters, destination) 15 | 16 | cities <- full_join(sources, destinations, by = c("source" = "destination")) 17 | cities <- rename(cities, place = source) 18 | 19 | ############################################## 20 | ## Alternative way to get cities data frame ## 21 | ############################################## 22 | 23 | sources <- letters %>% 24 | distinct(source) %>% 25 | rename(place = source) 26 | 27 | destinations <- letters %>% 28 | distinct(destination) %>% 29 | rename(place = destination) 30 | 31 | cities <- full_join(sources, destinations, by = "place") 32 | 33 | # Convert to data frame 34 | cities_df <- as.data.frame(cities) 35 | 36 | ### Geocode function ### 37 | locations_df <- mutate_geocode(cities_df, place) 38 | 39 | # Back to tibble 40 | locations <- as_tibble(locations_df) 41 | 42 | library(sf) 43 | library(mapview) 44 | 45 | # sf points 46 | locations_sf <- st_as_sf(locations, coords = c("lon", "lat"), crs = 4326) 47 | 48 | # Mapview map 49 | mapview(locations_sf) 50 | 51 | # Save locations data 52 | write_csv(locations, "data/locations.csv") 53 | 54 | ### Mapping with ggmap ### 55 | geocode("mannheim") 56 | 57 | map <- get_googlemap(center = c(8.4, 49.5), zoom = 6) 58 | 59 | # Black and white styled map 60 | bw_map <- get_googlemap(center = c(8.4, 49.5), zoom = 6, 61 | color = "bw", 62 | style = "feature:road|visibility:off&style=element:labels|visibility:off&style=feature:administrative|visibility:off") 63 | 64 | # Locations map 65 | ggmap(bw_map) + 66 | geom_point(data = locations, aes(x = lon, y = lat)) 67 | 68 | ### Adding data with dplyr ### 69 | per_source <- letters %>% 70 | group_by(source) %>% 71 | summarise(count = n()) %>% 72 | arrange(desc(count)) 73 | 74 | per_destination <- letters %>% 75 | group_by(destination) %>% 76 | summarise(count = n()) %>% 77 | arrange(desc(count)) 78 | 79 | # Join locations to source and destination data 80 | geo_per_source <- left_join(per_source, locations, by = c("source" = "place")) 81 | geo_per_destination <- left_join(per_destination, locations, by = c("destination" = "place")) 82 | 83 | ## Data map 1 ## 84 | ggmap(bw_map) + 85 | geom_point(data = geo_per_destination, 86 | aes(x = lon, y = lat), color = "red") + 87 | geom_point(data = geo_per_source, 88 | aes(x = lon, y = lat), color = "purple") 89 | 90 | ## Data map 2 ## 91 | ggmap(bw_map) + 92 | geom_point(data = geo_per_destination, 93 | aes(x = lon, y = lat, size = count), 94 | color = "red", alpha = 0.5) + 95 | geom_point(data = geo_per_source, 96 | aes(x = lon, y = lat, size = count), 97 | color = "purple", alpha = 0.5) 98 | 99 | ## Data map 3 ## 100 | ggmap(bw_map) + 101 | geom_point(data = geo_per_destination, 102 | aes(x = lon, y = lat, size = count, color = "Destination"), 103 | alpha = 0.5) + 104 | geom_point(data = geo_per_source, 105 | aes(x = lon, y = lat, size = count, color = "Source"), 106 | alpha = 0.5) + 107 | scale_color_manual(values = c(Destination = "red", Source = "purple")) + 108 | scale_size_continuous(range = c(2, 9)) + 109 | geom_text_repel(data = locations, aes(x = lon, y = lat, label = place)) + 110 | labs(title = "Correspondence of Daniel van der Meulen, 1585", 111 | size = "Letters", 112 | color = NULL) + 113 | guides(color = guide_legend(override.aes = list(size = 6))) -------------------------------------------------------------------------------- /geocoding-with-r.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Geocoding with R" 3 | output: 4 | md_document: 5 | variant: markdown 6 | always_allow_html: yes 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | post <- "geocoding-with-r" 11 | source("common.r") 12 | ``` 13 | 14 | In the [previous post](https://jessesadler.com/post/excel-vs-r/) I discussed some reasons to use [R](https://www.r-project.org) instead of Excel to analyze and visualize data and provided a brief introduction to the R programming language. That post used an example of letters sent to the sixteenth-century merchant [Daniel van der Meulen](https://jessesadler.com/project/dvdm-correspondence/) in 1585. One aspect missing from the analysis was the visualization of the geographical aspects of the data. This post will provide an introduction to geocoding and mapping location data using the [ggmap](https://cran.r-project.org/web/packages/ggmap/index.html) package for R, which enables the creation of maps with [ggplot](http://ggplot2.tidyverse.org). There are a number of websites that can help geocode location data and even create maps.[^1] You could also use a full-scale geographic information systems (GIS) application such as [QGIS](https://qgis.org) or [ESRI](https://esri.com). However, an active developer community has made it possible to complete a full range of geographic analysis from geocoding data to the creation of publication-ready maps with R.[^2] Geocoding and mapping data with R instead of a web or GIS application brings the general advantages of using a programming language in analyzing and visualizing data. With R, you can write the code once and use it over and over, while also providing a record of all your steps in the creation of a map.[^3] 15 | 16 | This post will merely scratch the surface of the mapping capabilities of R and will not enter into the domain of the more complex specific geographic packages available for R.[^4] Instead, it will build on the [dplyr](http://dplyr.tidyverse.org) and [ggplot](http://ggplot2.tidyverse.org) skills discussed in [my brief introduction to R](https://jessesadler.com/post/excel-vs-r/). The example of geocoding and mapping with R will also provide another opportunity to show the advantages of coding. In particular, geocoding is a good example of how code can simplify the workflow for entering data. Instead of dealing with separate spreadsheets, one containing information about the letters and the other with geographic information, with code the geographic information can be created directly from the contents of the letters data. This has the added advantage that the code to find the longitude and latitude of locations can be saved as a R script and rerun if new data is added to ensure that the information is always kept up to date. 17 | 18 | ## Preparing the data with `dplyr` {#preparing-data} 19 | In this example, I will use the same database of letters sent to Daniel van der Meulen in 1585 as I did in the previous post. You can find the data and the R script that goes along with this tutorial on [GitHub](https://github.com/jessesadler/intro-to-r). Before getting into the database of letters and figuring out how to geocode the locations found in the data, it is necessary to set up the environment in R by loading the libraries that we will be using. Here, I load both the `tidyverse` library to import and manipulate the data and the `ggmap` library to do the actual geocoding and mapping. 20 | 21 | ```{r libraries} 22 | library(tidyverse) 23 | library(ggmap) 24 | ``` 25 | 26 | As in the previous post, that data is loaded through the `read_csv()` function from the `readr` package. It is best to keep the names of objects consistent across scripts. Therefore, I name the object `letters` with the assignment operator. Printing out the contents reveals the that `letters` is a [tibble](http://tibble.tidyverse.org) with 114 letters and four columns. 27 | 28 | ```{r load data} 29 | letters <- read_csv("data/correspondence-data-1585.csv") 30 | letters 31 | ``` 32 | 33 | To do the actual geocoding of the locations I will be using the `mutate_geocode()` function from the `ggmap` package. To geocode a number of locations at one time, the function requires a data frame with a column containing the locations we would like to geocode. The goal, then, is to get a data frame with a column that contains all of the distinct locations found in the `letters` data frame. In the [introduction to R](https://jessesadler.com/post/excel-vs-r/#new-dataframes) post I used the `distinct()` function to get data frames with the unique sources and destinations. We can rerun that code here and look at the results. 34 | 35 | ```{r distinct sources/destinations} 36 | sources <- distinct(letters, source) 37 | destinations <- distinct(letters, destination) 38 | ``` 39 | 40 | ```{r print sources/destinations} 41 | sources 42 | destinations 43 | ``` 44 | 45 | A glance at the two data frames shows that neither provide exactly what I are looking for. Neither the `sources` nor the `destinations` data frames include all of the locations that we want to geocode. It would be possible to geocode both the `sources` and `destinations` data frames, but this would place the geocoded information in two different data frames, which is less than ideal. Instead, we can join the two data frames together using the [join functions in dplyr](http://r4ds.had.co.nz/relational-data.html). Coming from the world of spreadsheets, the join functions are a revelation, opening up seemingly endless possibilities for data manipulation. 46 | 47 | The `dplyr` package includes a number of functions to join two data frames together to create a single data frame. The join functions use overlapping columns of data contained in both data frames, called keys, to match up the data. There are three join functions that are used most often. The `left_join()` keeps all observations from the left, or first, data frame, while dropping all rows from the right, or second, data frame that do not have a match in the left data frame. The `inner_join()` only keeps rows that contain matching keys in both data frames. Conversely, the `full_join()` brings together all rows from both the left and right data frames. The differences between these functions may not be immediately apparent, but you can experiment with them to see the variety of outputs they create. 48 | 49 | Using `sources` as the left and `destinations` as the right data frames, a `left_join()` creates a new object with 9 rows, an `inner_join()` results in only one row, and a `full_join()` contains 13 observations. Thus, the `full_join()` is what we are looking for. The `full_join()` function — like the other join functions — takes three arguments: the two data frames to join and the key column by which they are to be joined. In this case, some extra work needs to be done to the identify the key columns. The key columns are the only columns in the two data frames, but because they have different names, it is necessary to declare that they are equivalent. This is done with the help of the concatenate or combine function, `c()`.[^5] The below command creates a new data frame that I have called `cities`, which brings together the "source" column with the "destination" column and therefore contains all of the locations found in the `letters` data frame. 50 | 51 | ```{r join locations} 52 | cities <- full_join(sources, destinations, by = c("source" = "destination")) 53 | cities 54 | ``` 55 | 56 | Printing out the `cities` data frame shows that there are 13 distinct locations in the `letters` data. However, the structure of the `cities` object is less than ideal. The name of the column is "source," which was taken over from the `sources` data frame, but this is not an accurate description of the data in the column. This can be fixed with the help of `rename()`, which uses the structure of `new_name = old_name`. Here, I change the name of the “source” column to "place" and then print out the first two rows with the `head()` function to show the change in the column name. Notice that using the `cities` object within the `rename()` function and using the same name for the result overwrites the original object with the new one. Alternatively, you could name the new object a different name such as `cities1`.[^6] 57 | 58 | ```{r rename source column} 59 | cities <- rename(cities, place = source) 60 | head(cities, n = 2) 61 | ``` 62 | 63 | ## Geocoding with `ggmap` {#geocoding} 64 | We now have an object with the basic structure needed to geocode the locations. However, if you run `mutate_geocode()` on the `cities` object as it is, you will receive an error. The error is a good example of a common frustration with coding. Computers are picky, and because humans also write flawed code, bugs exist, making computers picky in odd ways. In this case, we are running into a problem in which the `mutate_geocode()` function will not work on tibbles. As noted in my previous post, [tibbles](http://r4ds.had.co.nz/tibbles.html) are a special kind of data frame used by the `tidyverse` set of packages. It is usually easier to work with tibbles in the [tidyverse](https://www.tidyverse.org) workflow, but here it is necessary to convert the tibble to a standard data frame object with `as.data.frame()`. I give the result a new name to distinguish it from the `cities` tibble. 65 | 66 | ```{r tibble to df} 67 | cities_df <- as.data.frame(cities) 68 | ``` 69 | 70 | The locations data from the letters sent to Daniel is now ready to be geocoded. The `mutate_geocode()` function uses [Google Maps](https://maps.google.com) to find the longitude and latitude of each location. The two necessary arguments are the data frame and the name of the column with the location data. The function can be used to find more information about each location, including the country and region, but here I just have the function return the longitude and latitude data. The function will query Google Maps, and so you must have an internet connection. This makes running the command relatively slow, especially if your data contains a large amount of locations. There is also a limit of 2,500 queries per day, so you may have to find other methods if you are geocoding thousands of locations. 71 | 72 | ```{r geocode data, eval = FALSE} 73 | locations_df <- mutate_geocode(cities_df, place) 74 | ``` 75 | 76 | ```{r load locations data, include = FALSE} 77 | locations <- read_csv("data/locations.csv") 78 | locations_df <- as.data.frame(locations) 79 | ``` 80 | 81 | Because `mutate_geocode()` necessitates an internet connection, is somewhat slow to run, and has a daily limit, it is not something that you want to do all the time. It is a good idea to save the results by writing the object out to a csv file. Before doing this, however, we should inspect the data and make sure everything is correct. Let's start by printing out the `locations_df` object and see what we have. 82 | 83 | ```{r print locations_df} 84 | locations_df 85 | ``` 86 | 87 | At a quick glance, the result looks like what we would expect. `locations_df` is a data frame with three columns called "place,” "lon,” and "lat" with the latter two representing longitude and latitude of the location named in the first column. One thing that we might want to change, though this step is not necessary, is to convert the data frame back to a tibble, which can be done with the `as_tibble()` function. 88 | 89 | ```{r into tibble} 90 | locations <- as_tibble(locations_df) 91 | ``` 92 | 93 | A quick look at the values of the longitude and latitude columns in the `locations` data would seem to indicate that the geocoding process occurred correctly. All of the longitude and latitude values are within a reasonable range given the cities in the data. However, because the data given to the Google Maps query consisted of only the names of the cities, it is always worth double checking the returned values. As anyone who has ever used Google Maps before knows, it is possible that the query could return the location of a different place that has the same name.[^7] 94 | 95 | One way to check that the geocoding was done correctly is to map the locations with the [mapview](https://cran.r-project.org/web/packages/mapview/index.html) package. Using `mapview` requires converting the `locations` tibble to yet another format. This is done with the [simple features](https://cran.r-project.org/web/packages/sf/) package, which brings us to the world of GIS with R. The first step is to load the `sf` and `mapview` packages. 96 | 97 | ```{r load spatial packages} 98 | library(sf) 99 | library(mapview) 100 | ``` 101 | 102 | I will not get into the details of the `sf` package and type of objects that it creates here, but the function to transform the `locations` tibble into an `sf` object is understandable even without knowing the details. The function to make an `sf` object takes three main arguments. The first two are the data frame to be converted and the columns that contain the geographic data. This second argument uses the `c()` function to combine the "lon" and "lat" columns. The third argument determines the [coordinate reference system (crs)](https://en.wikipedia.org/wiki/Spatial_reference_system) or projection used for the data. Here, I indicate that I want the longitude and latitude to be plotted using the World Geographic System 1984 projection, which is referenced as [European Petroleum Survey Group (EPSG)](http://wiki.gis.com/wiki/index.php/European_Petroleum_Survey_Group) 4326. Geographic jargon aside, what matters at this stage is that EPSG 4326 is the projection used by web maps such as Google Maps.[^8] 103 | 104 | ``` {.r} 105 | locations_sf <- st_as_sf(locations, coords = c("lon", "lat"), crs = 4326) 106 | ``` 107 | 108 | With the data in the correct format, a simple call to the `mapview()` function creates an interactive map with all the locations plotted as points. You can click on a point to see its name and compare it to the locations on the map. With this data, the accuracy of the location only needs to be at the city level. The location within the city is not relevant. Inspecting the data from the map shows that all of the locations were correctly geocoded. 109 | 110 | ```{r mapview, eval = FALSE} 111 | mapview(locations_sf) 112 | ``` 113 | 114 | I can now save the locations data using the `readr` package and a function similar to that used to load data. I will use the `write_csv()` function to save the data as a csv file. Here, I save the `locations` tibble, but you could also save the other forms of the locations data. The second argument tells the function where to save the csv and what to call the file. Here, I place the file in the same folder as the “correspondence-data-1585.csv” and name the file “locations.csv”. 115 | 116 | ``` {.r eval = FALSE} 117 | write_csv(locations, "data/locations.csv") 118 | ``` 119 | 120 | ## Mapping with `ggmap` {#mapping} 121 | Now that we have successfully geocoded the locations from which Daniel's correspondents sent letters and in which Daniel received them, we can move on to the task of plotting the locations on a map with `ggmap`.[^9] `ggmap` is essentially an extension of `ggplot`. It enables you to plot a map as the background of a `ggplot` graph. The two main plotting features of `ggmap` are the `get_map()` function, which downloads a map from a specified map provider, and the `ggmap()` function that plots the downloaded map on a `ggplot` plot. Once a map is downloaded and plotted, it is then possible to use the normal grammar of `gglot` to visually represent that data on the map. 122 | 123 | The `get_map()` function can access data from three different map providers: [Google Maps](http://maps.google.com), [Open Street Maps](http://www.openstreetmap.org), and [Stamen Maps](http://maps.stamen.com/). In this example, I will use Google Maps and the `get_googlemap()` function. The challenge with the `get_map()` function is downloading a map with a zoom level and location that shows all of the data with minimal extra space. We need two pieces of information to do this for `get_googlemap()`: a location for the center of the map and a zoom level between 1 and 21.[^10] Figuring out the best center and zoom level for a map may take some trial and error. However, the work that we have already done can help to make an educated first guess. We can rerun `mapview(locations_sf)` and look at the details provided by the map. The map produced by `mapview()` shows the longitude and latitude at your cursor position and tells the zoom level of the map. We can use the cursor to guess a good center of the map or zoom in to find a city, which we can then geocode. After a couple of tries, I found that Mannheim, Germany works as a good center for a map with a zoom level of 6. I can get the coordinates of Mannheim with the generic `geocode()` function and then place the coordinates into the `get_googlemap()` function. The only alteration that needs to be made is to glue together the longitude and latitude values into one object with the concatenate function, `c()`. 124 | 125 | ```{r geocode mannheim} 126 | geocode("mannheim") 127 | ``` 128 | 129 | ```{r google map} 130 | map <- get_googlemap(center = c(8.4, 49.5), zoom = 6) 131 | ``` 132 | 133 | We can look at what the map looks like by calling the `ggmap()` function. 134 | 135 | ```{r basic-google-map} 136 | ggmap(map) 137 | ``` 138 | 139 | Given the historical nature of the data, some of the normal features of a Google Map are problematic. The modern road system obviously did not exist in the sixteenth century, nor are the modern political boundaries useful for the data. In the below command, I change the aesthetics of the original map using the color argument and turn off features of the map by using commands from the [Google Maps API](https://developers.google.com/maps/documentation/staticmaps/). 140 | 141 | ```{r styled map} 142 | bw_map <- get_googlemap(center = c(8.4, 49.5), zoom = 6, 143 | color = "bw", 144 | style = "feature:road|visibility:off&style=element:labels|visibility:off&style=feature:administrative|visibility:off") 145 | ``` 146 | 147 | Now let's see what the map looks like, but this time let's add the location data. This is done using the normal `ggplot` functions. Here, I want to add points for each place in the `locations` data that we created above. The `aes()` function within `geom_point()` tells `ggplot` that the x-axis corresponds to the longitude and the y-axis to the latitude of each place. 148 | 149 | ```{r locations-map} 150 | ggmap(bw_map) + 151 | geom_point(data = locations, aes(x = lon, y = lat)) 152 | ``` 153 | 154 | The resulting map is rather sparse and does not provide much information, but it gives a good starting point from which to build a more informative map. 155 | 156 | ## Adding data with `dplyr` {#adding-data} 157 | The above map used the `locations` data to plot Daniel van der Meulen's correspondence in 1585, but because it did not use the `letters` data, the map could not tell us anything about the number of letters sent from or received in each location. Visualizing this information is a two-step process. Firstly, it is necessary to find out how many letters were sent from and received in each location. Because we need to distinguish between sent and received locations, this data will be contained in two data frames. To do this, we can reuse the `per_source` and `per_destination` code discussed in the [previous post](https://jessesadler.com/post/excel-vs-r/##the-pipe), which summarizes the amount of letters sent from and to each location. Secondly, we have to join the longitude and latitude information from the `locations` data frame to the `per_source` and `per_destination` data frames. 158 | 159 | As a reminder, to create the `per_source` and `per_destination` objects I will use the `group_by()` and `summarise()` workflow. This groups the data by one or more defined variables and creates a new column that counts the number of unique observations from the variable(s). Below, I make `per_source` and `per_destination` data frames and print out the `per_destination` object to show its form. 160 | 161 | ```{r per_source/destination} 162 | per_source <- letters %>% 163 | group_by(source) %>% 164 | summarise(count = n()) %>% 165 | arrange(desc(count)) 166 | 167 | per_destination <- letters %>% 168 | group_by(destination) %>% 169 | summarise(count = n()) %>% 170 | arrange(desc(count)) 171 | ``` 172 | 173 | ```{r per_destination} 174 | per_destination 175 | ``` 176 | 177 | The `per_source` and `per_destination` data frames are in the basic structure that we want, but we need to add longitude and latitude columns to both of the objects so that they can be plotted on our map. Here, I will use a `left_join()`. As noted above, a `left_join()` only keeps the observations from the first data frame in the function. In other words, the result of a `left_join()` will have the same number of rows as the original left data frame, while adding the longitude and latitude columns from the `locations` data frame. The data frames will be joined by the columns containing the name of the cities. Because these "key" columns have different names, it is again necessary to denote their equivalency with `c()`. Below, I print out the newly created `geo_per_destination` data frame to show its structure. 178 | 179 | ```{r geo_per_source/destination} 180 | geo_per_source <- left_join(per_source, locations, by = c("source" = "place")) 181 | 182 | geo_per_destination <- left_join(per_destination, locations, by = c("destination" = "place")) 183 | ``` 184 | 185 | ```{r geo_per_destination} 186 | geo_per_destination 187 | ``` 188 | 189 | I now have the necessary data to create a map that will distinguish between the sources of the letters and their destinations and will allow me to show the quantities of letters sent from and received in each location. 190 | 191 | ## Mapping the data {#mapping-data} 192 | Creating a quality visualization with `ggplot` involves iteration. Because the different parts of the plot are all written out in code, aspects can be added, subtracted, or modified until a good balance is found. Let's start by creating a basic map using the `geo_per_source` and `geo_per_destination` data. The structure of the command is similar to that used to make the first map, but now that I am using information from two data frames, I need to use two `geom_point()` functions. There is also a small change to the `ggmap()` function to have the map take up the entire plotting area so that the longitude and latitude scales do not show. The only change to the `geom_point()` function is the addition of different colors for the two sets of points. This makes it easier to distinguish between places from which Daniel's correspondents sent letters and places where he received them. Notice that the argument for the color of the points is placed outside of the `aes()` function. This makes all of the points plotted from each data frame a single color as opposed to mapping a change in color to a variable within the data. In this instance, I chose to specify color by name, but it is also possible to use rgb values, hex values, or a number of different color palettes.[^11] 193 | 194 | ```{r data-map1, fig.width = 8, fig.asp = 0.925} 195 | ggmap(bw_map) + 196 | geom_point(data = geo_per_destination, 197 | aes(x = lon, y = lat), color = "red") + 198 | geom_point(data = geo_per_source, 199 | aes(x = lon, y = lat), color = "purple") 200 | ``` 201 | 202 | This plot is much better than what we started with, but it still has a couple of issues. In the first place, it does not communicate any information about the quantity of letters. In addition, because the points are opaque, it is not clear that letters were both sent from and to Haarlem. The former issue can be rectified by using the size argument within the `aes()` function. This will tell `ggplot` to vary the size of each of the points in proportion to the count column. By default, the size aesthetic creates a legend to indicate the scale used. In the `ggmap()` function I place the legend in the top right corner of the map since there are no data points there. The latter issue is solved by adding an alpha argument to the two `geom_point()` functions. This argument is placed outside of the `aes()` function, because it is an aspect we want to apply to all points. [Alpha](https://www.w3schools.com/css/css3_colors.asp) describes the translucency of an object and takes values between 1 (opaque) and 0 (translucent). 203 | 204 | ```{r data-map2, fig.width = 8, fig.asp = 0.875} 205 | ggmap(bw_map) + 206 | geom_point(data = geo_per_destination, 207 | aes(x = lon, y = lat, size = count), 208 | color = "red", alpha = 0.5) + 209 | geom_point(data = geo_per_source, 210 | aes(x = lon, y = lat, size = count), 211 | color = "purple", alpha = 0.5) 212 | ``` 213 | 214 | The above map is more informative, but it is hardly a finished product. For instance, there is no explanation for the differences in the color of the points, the smallest points are not easy to see, and there are no labels to indicate the names of the cities. Let's deal with these issues one at a time to create a more fully fleshed out map. This will serve as an opportunity to demonstrate both the flexibility and complexity of `ggplot` code.[^12] 215 | 216 | The different colors for sent and received locations are not defined in a legend in the previous plot, because `ggplot` only creates a legend for arguments within an `aes()` function. Even though the color does not change within the data for each `geom_point()` function, it is possible to place the color in the `aes()` function when used in tandem with `scale_color_manual()`. Inside the `aes()` function the color is associated with a name that will appear in the legend. The actual color to be used is defined in a separate `scale_color_manual()` function with the values argument.[^13] 217 | 218 | A similar type of scale function also makes it possible to manually control the minimum and maximum size of the points drawn by the `size = count` argument within the `aes()` functions. For this last plot, I decided to make the minimum size of the point 2 and the maximum size 9. The range of the sizes for points is a good example of a plot element that you can play around with, trying out different sizes until you find one that works. Note too, that the best sizes of plot elements will also depend on the format in which the finished plot will be presented. 219 | 220 | In changing the background map from the default Google Map, I took out the city labels, but this makes it difficult to know which cities are represented by the different points. The plot only contains fourteen points, making it possible to label each point without too much clutter. In `ggplot`, [labels are geoms](http://r4ds.had.co.nz/graphics-for-communication.html#annotations); they are distinct elements placed on a plot with separate geom functions. Labeling points can be done with either `geom_text()` or `geom_label()`. `geom_text()` places text at the indicated position, while `geom_label()` places the text within a white text box. In addition to x and y coordinates, the two geoms require a `label` argument, which indicates the variable that should be used for the text. By default `geom_text()` and `geom_label()` are placed exactly on the x and y coordinates given by the data. It is possible to nudge the placement of the labels with the `nudge_x` and `nudge_y` arguments. However, instead of using these, I will take advantage of the `geom_text_repel()` function from the `ggrepel` package. This package automatically chooses the placement of individual labels to ensure that they do not overlap. Note too that I use the `locations` data frame for the data of the geom, since `locations` contains the longitude and latitude of each point on the map. 221 | 222 | The final touches to the map can be made by ensuring that all of the elements are clearly noted. [Labels for the plot itself](http://r4ds.had.co.nz/graphics-for-communication.html#label) can be made with the `labs()` function. For this plot, I will add a descriptive title and change the labels for the two legends. By default, `ggplot` uses the name of the indicated column as the label for the legend. This is shown in the above map, where the size of the point is labeled as "count." The default label can be replaced with a more informative one by indicating the aesthetic to be changed. Here, I will rename the size aesthetic as "Letters." In addition, I chose not to have a label for the color aesthetic, which is indicated by "NULL". Finally, I altered the size of the points drawn in the color legend to a larger size. [This is done with the `guides()` function](http://r4ds.had.co.nz/graphics-for-communication.html#legend-layout), which changes the scales of different aspects of the plot. In this case, I use the `override.aes` argument to have the red and purple points in the legend be drawn at `size = 6`. 223 | 224 | ```{r ggrepel} 225 | library(ggrepel) 226 | ``` 227 | 228 | ```{r data-map3, fig.width = 8, fig.asp = 0.8} 229 | ggmap(bw_map) + 230 | geom_point(data = geo_per_destination, 231 | aes(x = lon, y = lat, size = count, color = "Destination"), 232 | alpha = 0.5) + 233 | geom_point(data = geo_per_source, 234 | aes(x = lon, y = lat, size = count, color = "Source"), 235 | alpha = 0.5) + 236 | scale_color_manual(values = c(Destination = "red", Source = "purple")) + 237 | scale_size_continuous(range = c(2, 9)) + 238 | geom_text_repel(data = locations, aes(x = lon, y = lat, label = place)) + 239 | labs(title = "Correspondence of Daniel van der Meulen, 1585", 240 | size = "Letters", 241 | color = NULL) + 242 | guides(color = guide_legend(override.aes = list(size = 6))) 243 | ``` 244 | 245 | ## Conclusion {#concluding} 246 | The above map demonstrates the geographic spread of Daniel van der Meulen’s correspondence in 1585 and shows the relative significance of the location of his correspondents and the different places in which he received letters throughout the year. More important than the details of the map for the purposes of this post is the process by which it was created. One interesting feature that I would like to emphasize in concluding is that the map uses data from three different data frames or tables: `geo_per_destination`, `geo_per_source`, and `locations`. All three of these data frames derive from the original `letters` data, while they in turn were the product of yet more data frames. By my count, running the commands contained in this post leads to the creation of 12 different data frame like objects. The below diagram outlines the workflow. The ability to split, subset, transform, and then join newly created tables in a variety of ways is a very powerful and flexible workflow. Because the data frames can be recreated by running the code, there is minimal overhead in managing them, especially in comparison to creating tables within spreadsheets. In this case, I would only recommend that the `locations` data frame be saved for easy access in other R scripts and sessions. The other objects can be created on demand, or even more can be added, while the individual aspects of the map can be endlessly tweaked using the power of `ggplot`. 247 | 248 | [^1]: Geocoding can be done with websites such as [geonames](http://www.geonames.org) and the service provided by [Texas A&M](http://geoservices.tamu.edu/Services/Geocode/). There are also a number of [options for creating maps on the web](http://crln.acrl.org/index.php/crlnews/article/view/16772/18314). 249 | 250 | [^2]: See the Introduction of [Robin Lovelace, Jakub Nowosad, and Jannes Meunchow, *Geocomputation with R*](https://bookdown.org/robinlovelace/geocompr/) for a good overview of the GIS capabilities of R. 251 | 252 | [^3]: [Alex David Singleton, Seth Spielman, and Chris Brunsdon, “Establishing a Framework for Open Geographic Information Science,”](https://www.cdrc.ac.uk/wp-content/uploads/2016/04/AS_EstablishingAFrameworkforOpenGIS.pdf) *International Journal of Geographical Information Science*, 30 (2016): 1507-1521. 253 | 254 | [^4]: A full list of geographic packages for R can be found at [R Packages for the Analysis of Spatial Data](https://cran.r-project.org/web/views/Spatial.html). In future posts I will discuss some of these packages and their capabilities. 255 | 256 | [^5]: In this command the quotations around "source" and "destination" are necessary. They tell the function that these are strings of characters. 257 | 258 | [^6]: It is good practice to stay away from overwriting objects until you know whether the code works. However, if you mess up, it is always possible to recreate the original object by rerunning the code that created it. 259 | 260 | [^7]: For instance, geocoding "Naples" will return the longitude and latitude of Naples, Florida and not Naples, Italy. 261 | 262 | [^8]: If you want to get an idea of how an `sf` object differs from a normal tibble or data frame, you can print out the `locations_sf` object. This shows that the longitude and latitude columns have been combined into a special type of column called `simple_feature` and named "geometry", while additional information about the CRS is also provided. 263 | 264 | [^9]: For a fuller discussion of `ggmap`, see [David Kahle and Hadley Wickham, "ggmap: Spatial Visualization with ggplot2.”](https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf) 265 | 266 | [^10]: A zoom level of 1 is the most zoomed out, while 21 is the most zoomed in. 267 | 268 | [^11]: [W3 Schools has a good discussion on the use of rgb and hex colors](https://www.w3schools.com/colors/default.asp). For the use of colors with `ggplot`, see Hadley Wickham, *ggplot2: Elegant Graphics for Data Analysis*, Second edition (Springer, 2016), 133–145. A full list of the named colors available in R can be found [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) 269 | 270 | [^12]: For a full discussion of the powers of `ggplot`, see [Wickham, *ggplot2*](http://www.springer.com/us/book/9783319242750). 271 | 272 | [^13]: Wickham, *ggplot*, 142–143. -------------------------------------------------------------------------------- /gis-with-r-intro.R: -------------------------------------------------------------------------------- 1 | ### Introduction to GIS with R: Spatial data with the sp and sf packages ### 2 | 3 | # This script goes along with the blog post of the same name, 4 | # which can be found at https://www.jessesadler.com/post/gis-with-r-intro/ 5 | # See the Rmarkdown document for the contents of the blog post. 6 | 7 | ### Load packages and data 8 | library(tidyverse) 9 | library(sp) 10 | library(sf) 11 | library(rnaturalearth) 12 | 13 | # Load the data 14 | letters <- read_csv("data/correspondence-data-1585.csv") 15 | locations <- read_csv("data/locations.csv") 16 | 17 | ######################## 18 | ## Preparing the data ## 19 | ######################## 20 | 21 | # Letters per source 22 | sources <- letters %>% 23 | group_by(source) %>% 24 | count() %>% 25 | rename(place = source) %>% 26 | add_column(type = "source") %>% 27 | ungroup() 28 | 29 | # Letters per destination 30 | destinations <- letters %>% 31 | group_by(destination) %>% 32 | count() %>% 33 | rename(place = destination) %>% 34 | add_column(type = "destination") %>% 35 | ungroup() 36 | 37 | # Bind the rows of the two data frames 38 | # and change type column to factor 39 | letters_data <- rbind(sources, destinations) %>% 40 | mutate(type = as_factor(type)) 41 | 42 | # Join letters_data to locations 43 | geo_data <- left_join(letters_data, locations, by = "place") 44 | 45 | ################################## 46 | ## Spatial data with sp package ## 47 | ################################## 48 | 49 | # Create data frame of only longitude and latitude values 50 | coords <- select(geo_data, lon, lat) 51 | 52 | # Create SpatialPoints object with coords and CRS 53 | points_sp <- SpatialPoints(coords = coords, 54 | proj4string = CRS("+proj=longlat +datum=WGS84")) 55 | 56 | # Create SpatialPointsDataFrame object 57 | points_spdf <- SpatialPointsDataFrame(coords = coords, 58 | data = letters_data, 59 | proj4string = CRS("+proj=longlat +datum=WGS84")) 60 | 61 | # Example of subsetting `points_spdf` to return locations with "n" greater than 10 62 | points_spdf[points_spdf@data$n > 10, ] 63 | 64 | # Get coastal and country world maps as Spatial objects 65 | coast_sp <- ne_coastline(scale = "medium") 66 | countries_sp <- ne_countries(scale = "medium") 67 | 68 | #################################### 69 | ## Mapping with sp and base plots ## 70 | #################################### 71 | 72 | ### Set up plot ### 73 | 74 | # Create a new color palette to distinguish source and destination 75 | palette(alpha(c("darkorchid", "darkorange"), 0.7)) 76 | # Set margins for bottom, left, top, and right of plot 77 | par(mar = c(1, 1, 3, 1)) 78 | 79 | ### Plot points ### 80 | plot(points_spdf, 81 | pch = 20, 82 | col = points_spdf$type, 83 | cex = sqrt(points_spdf$n)/2 + 0.25) 84 | 85 | # Add a box around the plot 86 | box() 87 | # Add a title 88 | title(main = "Correspondence of Daniel van der Meulen, 1585") 89 | 90 | ### Plot map with coastlines (lines) data ### 91 | # Pointsize vector for legend 92 | pointsize <- c(1, 50, 100) 93 | par(mar = c(1, 1, 3, 1)) 94 | 95 | # Plot points 96 | plot(points_spdf, 97 | pch = 20, 98 | col = points_spdf$type, 99 | cex = sqrt(points_spdf$n)/2 + 0.25) 100 | # Plot coastlines background map 101 | plot(coast_sp, 102 | col = "black", 103 | add = TRUE) 104 | # Add a box around the plot 105 | box() 106 | 107 | # Legend for colors 108 | legend("topright", legend = levels(points_spdf$type), 109 | pt.cex = 2, 110 | col = 1:2, 111 | pch = 15) 112 | 113 | # legend for size of points 114 | legend("right", legend = pointsize, 115 | pt.cex = (sqrt(pointsize)/2 + 0.25), 116 | col = "black", 117 | pch = 20, 118 | title = "Letters") 119 | 120 | # Title for the map 121 | title(main = "Correspondence of Daniel van der Meulen, 1585") 122 | 123 | # Make bounding box for countries_sp match 124 | # bounding box of points_spdf 125 | countries_sp@bbox <- bbox(points_spdf) 126 | 127 | ### Plot map with countries (polygons) data ### 128 | par(mar = c(1, 1, 3, 1)) 129 | 130 | # Plot countries map and color with grays 131 | plot(countries_sp, 132 | col = gray(0.8), 133 | border = gray(0.7)) 134 | # Plot points 135 | plot(points_spdf, 136 | pch = 20, 137 | col = points_spdf$type, 138 | cex = sqrt(points_spdf$n)/2 + 0.25, 139 | add = TRUE) 140 | # Add a box around the plot 141 | box() 142 | 143 | # Legend for colors 144 | legend("topright", 145 | legend = levels(points_spdf$type), 146 | pt.cex = 2, 147 | col = 1:2, 148 | pch = 15) 149 | # legend for size of points 150 | legend("right", 151 | legend = pointsize, 152 | pt.cex = (sqrt(pointsize)/2 + 0.25), 153 | col = "black", 154 | pch = 20, 155 | title = "Letters") 156 | 157 | # Title for the map 158 | title(main = "Correspondence of Daniel van der Meulen, 1585") 159 | 160 | ################################## 161 | ## Spatial data with sf package ## 162 | ################################## 163 | 164 | # Create sf object with geo_data data frame and CRS 165 | points_sf <- st_as_sf(geo_data, coords = c("lon", "lat"), crs = 4326) 166 | 167 | # Get coastal and country world maps as sf objects 168 | coast_sf <- ne_coastline(scale = "medium", returnclass = "sf") 169 | countries_sf <- ne_countries(scale = "medium", returnclass = "sf") 170 | 171 | # Subset of locations with "n" greater than 10 172 | filter(points_sf, n > 10) 173 | 174 | ### Subset of countries object to get South American countries ### 175 | # South American countries with new CRS 176 | countries_sf %>% 177 | filter(continent == "South America") %>% 178 | select(name) %>% 179 | st_transform(crs = "+proj=moll +datum=WGS84") 180 | 181 | ### Make map of South America ### 182 | # Return to default palette 183 | palette("default") 184 | 185 | # Map of South American countries 186 | countries_sf %>% 187 | filter(continent == "South America") %>% 188 | select(name) %>% 189 | st_transform(crs = "+proj=moll +datum=WGS84") %>% 190 | plot(key.pos = NULL, graticule = TRUE, main = "South America") 191 | 192 | ################################# 193 | ## Mapping with sf and ggplot2 ## 194 | ################################# 195 | 196 | ### Basic ggplot2 plot with geom_sf ### 197 | ggplot() + 198 | geom_sf(data = coast_sf) + 199 | geom_sf(data = points_sf, 200 | aes(color = type, size = n), 201 | alpha = 0.7, 202 | show.legend = "point") + 203 | coord_sf(xlim = c(-1, 14), ylim = c(44, 55)) 204 | 205 | ### Plot map with coastlines (lines) data ### 206 | # Load ggrepel package 207 | library(ggrepel) 208 | 209 | ggplot() + 210 | geom_sf(data = coast_sf) + 211 | geom_sf(data = points_sf, 212 | aes(color = type, size = n), 213 | alpha = 0.7, 214 | show.legend = "point") + 215 | coord_sf(xlim = c(-1, 14), ylim = c(44, 55), 216 | datum = NA) + # removes graticules 217 | geom_text_repel(data = locations, 218 | aes(x = lon, y = lat, label = place)) + 219 | labs(title = "Correspondence of Daniel van der Meulen, 1585", 220 | size = "Letters", 221 | color = "Type", 222 | x = NULL, 223 | y = NULL) + 224 | guides(color = guide_legend(override.aes = list(size = 6))) + 225 | theme_minimal() 226 | 227 | ### Plot map with countries (polygons) data ### 228 | ggplot() + 229 | geom_sf(data = countries_sf, 230 | fill = gray(0.8), color = gray(0.7)) + 231 | geom_sf(data = points_sf, 232 | aes(color = type, size = n), 233 | alpha = 0.7, 234 | show.legend = "point") + 235 | coord_sf(xlim = c(-1, 14), ylim = c(44, 55), 236 | datum = NA) + # removes graticules 237 | geom_text_repel(data = locations, 238 | aes(x = lon, y = lat, label = place)) + 239 | labs(title = "Correspondence of Daniel van der Meulen, 1585", 240 | size = "Letters", 241 | color = "Type", 242 | x = NULL, 243 | y = NULL) + 244 | guides(color = guide_legend(override.aes = list(size = 6))) + 245 | theme_bw() -------------------------------------------------------------------------------- /great-circles-sp-sf.R: -------------------------------------------------------------------------------- 1 | ## Great circles with sp and sf ## 2 | 3 | # This script goes along with the blog post of the same name, 4 | # which can be found at https://www.jessesadler.com/post/great-circles-sp-sf/ 5 | # See the Rmarkdown document for the contents of the blog post. 6 | 7 | # Load packages 8 | library(tidyverse) 9 | library(sp) 10 | library(geosphere) 11 | library(sf) 12 | library(rnaturalearth) 13 | 14 | # Load the data 15 | letters <- read_csv("data/correspondence-data-1585.csv") 16 | locations <- read_csv("data/locations.csv") 17 | 18 | # Load the background maps 19 | countries_sp <- ne_countries(scale = "medium") 20 | countries_sf <- ne_countries(scale = "medium", returnclass = "sf") 21 | 22 | #################################### 23 | ## Great circle vs Rhumb line map ## 24 | #################################### 25 | 26 | # Create points data 27 | la_sfg <- st_point(c(-118.2615805, 34.1168926)) 28 | amsterdam_sfg <- st_point(c(4.8979755, 52.3745403)) 29 | points_sfc <- st_sfc(la_sfg, amsterdam_sfg, crs = 4326) 30 | points_data <- data.frame(name = c("Los Angeles", "Amsterdam")) 31 | points_sf <- st_sf(points_data, geometry = points_sfc) 32 | 33 | # Create lines 34 | rhumb_line <- st_linestring(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) %>% 35 | st_sfc(crs = 4326) 36 | 37 | # Make a great circle line from LA to Amsterdam as sfc object 38 | great_circle <- st_linestring(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) %>% 39 | st_sfc(crs = 4326) %>% 40 | st_segmentize(units::set_units(50, km)) 41 | 42 | # Combine two sfc objects 43 | lines_sfc <- c(rhumb_line, great_circle) 44 | 45 | # Labels and transformation to sf object 46 | lines_data <- data.frame(name = c("Rhumb line", "Great circle")) 47 | lines_sf <- st_sf(lines_data, geometry = lines_sfc) 48 | 49 | # Background map 50 | map <- ne_countries(scale = "medium", returnclass = "sf") %>% 51 | select(sovereignt) 52 | 53 | ggplot() + 54 | geom_sf(data = map, fill = gray(0.7), color = gray(0.7)) + 55 | geom_sf(data = lines_sf, aes(color = name), size = 1.5, show.legend = FALSE) + 56 | geom_sf(data = points_sf, aes(color = name), size = 3, show.legend = FALSE) + 57 | coord_sf(xlim = c(-120, 20), ylim = c(20, 70)) + 58 | theme_minimal() 59 | 60 | ######################## 61 | ## Preparing the data ## 62 | ######################## 63 | 64 | # Create data frame of routes and count for letters per route 65 | routes <- letters %>% 66 | group_by(source, destination) %>% 67 | count() %>% 68 | ungroup() %>% 69 | arrange(n) 70 | 71 | # Print routes 72 | routes 73 | 74 | ########################################## 75 | ## SpatialLinesDataFrame with geosphere ## 76 | ########################################## 77 | 78 | # tibble of longitude and latitude values of sources 79 | sources_tbl <- routes %>% 80 | left_join(locations, by = c("source" = "place")) %>% 81 | select(lon, lat) 82 | 83 | # tibble of longitude and latitude values of destinations 84 | destinations_tbl <- routes %>% 85 | left_join(locations, by = c("destination" = "place")) %>% 86 | select(lon, lat) 87 | 88 | # Great circles as a SpatialLines object 89 | routes_sl <- gcIntermediate(sources_tbl, destinations_tbl, 90 | n = 50, addStartEnd = TRUE, 91 | sp = TRUE) 92 | 93 | # Class of routes_sl 94 | class(routes_sl) 95 | 96 | # Slots in routes_sl 97 | slotNames(routes_sl) 98 | 99 | # CRS of routes_sl 100 | routes_sl@proj4string 101 | 102 | # Make bbox of countries_sp the same as routes_sl 103 | countries_sp@bbox <- bbox(routes_sl) 104 | 105 | # Plot map 106 | par(mar = c(1, 1, 3, 1)) 107 | plot(countries_sp, col = gray(0.8), border = gray(0.7), 108 | main = "SpatialLines great circles") 109 | plot(routes_sl, col = "dodgerblue", add = TRUE) 110 | 111 | # Great circles as a SpatialLinesDataFrame object 112 | routes_sldf <- SpatialLinesDataFrame(routes_sl, data = routes) 113 | 114 | # Map with SpatialLinesDataFrame object 115 | par(mar = c(1, 1, 3, 1)) 116 | plot(countries_sp, col = gray(0.8), border = gray(0.7), 117 | main = "SpatialLinesDataFrame great circles") 118 | plot(routes_sldf, 119 | lwd = sqrt(routes_sldf$n/3) + 0.25, 120 | col = "dodgerblue", 121 | add = TRUE) 122 | 123 | ######################################## 124 | ## Great circles with sf: tidy method ## 125 | ######################################## 126 | 127 | # Add id column to routes 128 | routes_id <- rowid_to_column(routes, var = "id") 129 | 130 | # Transform routes to long format 131 | routes_long <- routes_id %>% 132 | gather(key = "type", value = "place", source, destination) 133 | 134 | # Add coordinate values 135 | routes_long_geo <- left_join(routes_long, locations, by = "place") 136 | 137 | # Convert coordinate data to sf object 138 | routes_long_sf <- st_as_sf(routes_long_geo, coords = c("lon", "lat"), crs = 4326) 139 | 140 | # Convert POINT geometry to MULTIPOINT, then LINESTRING 141 | routes_lines <- routes_long_sf %>% 142 | group_by(id) %>% 143 | summarise(do_union = FALSE) %>% 144 | st_cast("LINESTRING") 145 | 146 | # Print routes_lines 147 | routes_lines 148 | 149 | # Join sf object with attributes data 150 | routes_lines <- left_join(routes_lines, routes_id, by = "id") 151 | 152 | # Convert rhumb lines to great circles 153 | routes_sf_tidy <- routes_lines %>% 154 | st_segmentize(units::set_units(20, km)) 155 | 156 | # Compare number of points in routes_lines and routes_sf_tidy 157 | nrow(st_coordinates(routes_lines)) 158 | nrow(st_coordinates(routes_sf_tidy)) 159 | 160 | # Rhumb lines vs great circles 161 | ggplot() + 162 | geom_sf(data = countries_sf, fill = gray(0.8), color = gray(0.7)) + 163 | geom_sf(data = routes_lines, color = "magenta", show.legend = "point") + 164 | geom_sf(data = routes_sf_tidy) + 165 | coord_sf(xlim = c(2, 14), ylim = c(45, 54), datum = NA) + 166 | ggtitle("Rhumb lines vs Great circles") + 167 | theme_minimal() 168 | 169 | # Interactive comparison of gcIntermediate and st_segmentize 170 | library(mapview) 171 | 172 | mapview(routes_sldf, color = "magenta") + 173 | mapview(routes_sf_tidy, color = "black") 174 | 175 | ############################################ 176 | ## Great circles with sf: for loop method ## 177 | ############################################ 178 | 179 | # Create a line between Venice and Haarlem 180 | st_linestring(rbind(c(12.315515, 45.44085), c(4.646219, 52.38739))) 181 | 182 | # Matrix of longitude and latitude values of sources 183 | sources_m <- routes %>% 184 | left_join(locations, by = c("source" = "place")) %>% 185 | select(lon:lat) %>% 186 | as.matrix() 187 | 188 | # Matrix of longitude and latitude values of destinations 189 | destinations_m <- routes %>% 190 | left_join(locations, by = c("destination" = "place")) %>% 191 | select(lon:lat) %>% 192 | as.matrix() 193 | 194 | # Create empty list object of length equal to number of routes 195 | linestrings_sfg <- vector("list", nrow(routes)) 196 | 197 | # Define sequence and body of loop 198 | for (i in 1:nrow(routes)) { 199 | linestrings_sfg[[i]] <- st_linestring(rbind(sources_m[i, ], destinations_m[i, ])) 200 | } 201 | 202 | # sfc object of great circles 203 | linestrings_sfc <- st_sfc(linestrings_sfg, crs = 4326) %>% 204 | st_segmentize(units::set_units(20, km)) 205 | 206 | # Print linestrings_sfc 207 | linestrings_sfc 208 | 209 | # Create sf object from data frame and sfc geometry set 210 | routes_sf <- st_sf(routes, geometry = linestrings_sfc) 211 | 212 | # Show routes_sf_tidy and routes_sf are equivalent 213 | select(routes_sf_tidy, -id) %>% 214 | all.equal(routes_sf) 215 | 216 | # ggplot2 of great cirlce routes 217 | ggplot() + 218 | geom_sf(data = countries_sf, fill = gray(0.8), color = gray(0.7)) + 219 | geom_sf(data = routes_sf, aes( color = n)) + 220 | scale_color_viridis_c() + 221 | labs(color = "Letters", title = "Great circles with sf") + 222 | coord_sf(xlim = c(2, 14), ylim = c(45, 54), datum = NA) + 223 | theme_minimal() -------------------------------------------------------------------------------- /intro-to-r.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /networks-with-r.R: -------------------------------------------------------------------------------- 1 | ### Introduction to Network Analysis with R ### 2 | 3 | # This script goes along with the blog post of the same name, 4 | # which can be found at https://www.jessesadler.com/post/network-analysis-with-r/ 5 | 6 | library(tidyverse) 7 | 8 | # Load data 9 | letters <- read_csv("data/correspondence-data-1585.csv") 10 | 11 | ################################ 12 | ## Create node and edge lists ## 13 | ################################ 14 | 15 | ### Node list ### 16 | sources <- letters %>% 17 | distinct(source) %>% 18 | rename(label = source) 19 | 20 | destinations <- letters %>% 21 | distinct(destination) %>% 22 | rename(label = destination) 23 | 24 | nodes <- full_join(sources, destinations, by = "label") 25 | 26 | # Create id column and reorder columns 27 | nodes <- add_column(nodes, id = 1:nrow(nodes)) %>% 28 | select(id, everything()) 29 | 30 | ### Edge list ### 31 | per_route <- letters %>% 32 | group_by(source, destination) %>% 33 | summarise(weight = n()) %>% 34 | ungroup() 35 | 36 | # Join with node ids and reorder columns 37 | edges <- per_route %>% 38 | left_join(nodes, by = c("source" = "label")) %>% 39 | rename(from = id) 40 | 41 | edges <- edges %>% 42 | left_join(nodes, by = c("destination" = "label")) %>% 43 | rename(to = id) 44 | 45 | edges <- select(edges, from, to, weight) 46 | 47 | ##################### 48 | ## network package ## 49 | ##################### 50 | 51 | library(network) 52 | 53 | # network object 54 | routes_network <- network(edges, vertex.attr = nodes, matrix.type = "edgelist", ignore.eval = FALSE) 55 | 56 | # Print network object 57 | routes_network 58 | 59 | # network plot 60 | plot(routes_network, vertex.cex = 3) 61 | 62 | # network circle plot 63 | plot(routes_network, vertex.cex = 3, mode = "circle") 64 | 65 | #################### 66 | ## igraph package ## 67 | #################### 68 | 69 | # Clean environment 70 | detach(package:network) 71 | rm(routes_network) 72 | library(igraph) 73 | 74 | # igraph object 75 | routes_igraph <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE) 76 | 77 | # Print igraph object 78 | routes_igraph 79 | 80 | # igraph plot 81 | plot(routes_igraph, edge.arrow.size = 0.2) 82 | 83 | # igraph-graphopt-plot 84 | plot(routes_igraph, layout = layout_with_graphopt, edge.arrow.size = 0.2) 85 | 86 | ########################## 87 | ## tidygraph and ggraph ## 88 | ########################## 89 | 90 | # Load libraries 91 | library(tidygraph) 92 | library(ggraph) 93 | 94 | # edge list and node list to tbl_graph 95 | routes_tidy <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE) 96 | 97 | # igraph to tbl_graph 98 | routes_igraph_tidy <- as_tbl_graph(routes_igraph) 99 | 100 | # Show classes of objects 101 | class(routes_tidy) 102 | class(routes_igraph_tidy) 103 | class(routes_igraph) 104 | 105 | # Print routes tidy 106 | routes_tidy 107 | 108 | # Activate edges 109 | routes_tidy %>% 110 | activate(edges) %>% 111 | arrange(desc(weight)) 112 | 113 | # Basic ggraph plot 114 | ggraph(routes_tidy) + geom_edge_link() + geom_node_point() + theme_graph() 115 | 116 | # More complex ggraph plot 117 | ggraph(routes_tidy, layout = "graphopt") + 118 | geom_node_point() + 119 | geom_edge_link(aes(width = weight), alpha = 0.8) + 120 | scale_edge_width(range = c(0.2, 2)) + 121 | geom_node_text(aes(label = label), repel = TRUE) + 122 | labs(edge_width = "Letters") + 123 | theme_graph() 124 | 125 | # ggraph arc plot 126 | ggraph(routes_igraph, layout = "linear") + 127 | geom_edge_arc(aes(width = weight), alpha = 0.8) + 128 | scale_edge_width(range = c(0.2, 2)) + 129 | geom_node_text(aes(label = label)) + 130 | labs(edge_width = "Letters") + 131 | theme_graph() 132 | 133 | ################ 134 | ## visNetwork ## 135 | ################ 136 | 137 | library(visNetwork) 138 | 139 | # Simple interactive plot 140 | visNetwork(nodes, edges, width = "100%") 141 | 142 | # Width attribute 143 | edges <- mutate(edges, width = weight/5 + 1) 144 | 145 | # visNetwork edge width plot 146 | visNetwork(nodes, edges, width = "100%") %>% 147 | visIgraphLayout(layout = "layout_with_fr") %>% 148 | visEdges(arrows = "middle") 149 | 150 | ############### 151 | ## networkD3 ## 152 | ############### 153 | 154 | library(networkD3) 155 | 156 | # Redo id numbers to have them begin at 0 157 | nodes_d3 <- mutate(nodes, id = id - 1) 158 | edges_d3 <- mutate(edges, from = from - 1, to = to - 1) 159 | 160 | # d3 force-network plot 161 | forceNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 162 | NodeID = "label", Group = "id", Value = "weight", 163 | opacity = 1, fontSize = 16, zoom = TRUE) 164 | 165 | # Sankey diagram 166 | sankeyNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 167 | NodeID = "label", Value = "weight", fontSize = 16, unit = "Letter(s)") 168 | -------------------------------------------------------------------------------- /networks-with-r.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Network Analysis with R" 3 | output: 4 | md_document: 5 | variant: markdown 6 | always_allow_html: yes 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | post <- "network-analysis-with-r" 11 | source("common.r") 12 | ``` 13 | 14 | Over a wide range of fields network analysis has become an increasingly popular tool for scholars to deal with the complexity of the interrelationships between actors of all sorts. The promise of network analysis is the placement of significance on the relationships between actors, rather than seeing actors as isolated entities. The emphasis on complexity, along with the creation of a variety of algorithms to measure various aspects of networks, makes network analysis a central tool for digital humanities.[^1] This post will provide an introduction to working with networks in [R](https://www.r-project.org), using the example of the network of cities in the correspondence of [Daniel van der Meulen](https://jessesadler.com/project/dvdm-correspondence/) in 1585. 15 | 16 | There are a number of applications designed for network analysis and the creation of network graphs such as [gephi](https://gephi.org) and [cytoscape](http://cytoscape.org). Though not specifically designed for it, R has developed into a powerful tool for network analysis. The strength of R in comparison to stand-alone network analysis software is three fold. In the first place, [R enables reproducible research](https://jessesadler.com/post/excel-vs-r/) that is not possible with GUI applications. Secondly, the data analysis power of R provides robust tools for manipulating data to prepare it for network analysis. Finally, there is an ever growing range of packages designed to make R a complete network analysis tool. Significant network analysis packages for R include the [statnet suite](http://www.statnet.org) of packages and [`igraph`](http://igraph.org). In addition, [Thomas Lin Pedersen](http://www.data-imaginist.com) has recently released the [`tidygraph`](https://cran.r-project.org/web/packages/tidygraph/index.html) and [`ggraph`](https://cran.r-project.org/web/packages/ggraph/index.html) packages that leverage the power of `igraph` in a manner consistent with the [tidyverse](http://www.tidyverse.org) workflow. R can also be used to make interactive network graphs with the [htmlwidgets framework](http://www.htmlwidgets.org) that translates R code to JavaScript. 17 | 18 | This post begins with a short introduction to the basic vocabulary of network analysis, followed by a discussion of the process for getting data into the proper structure for network analysis. The network analysis packages have all implemented their own object classes. In this post, I will show how to create the specific object classes for the statnet suite of packages with the [`network`](https://cran.r-project.org/web/packages/network/index.html) package, as well as for `igraph` and `tidygraph`, which is based on the `igraph` implementation. Finally, I will turn to the creation of interactive graphs with the [`vizNetwork`](http://datastorm-open.github.io/visNetwork/) and [`networkD3`](http://christophergandrud.github.io/networkD3/) packages. 19 | 20 | ## Network Analysis: Nodes and Edges 21 | The two primary aspects of networks are a multitude of separate entities and the connections between them. The vocabulary can be a bit technical and even inconsistent between different disciplines, packages, and software. The entities are referred to as **nodes** or **vertices** of a graph, while the connections are **edges** or **links**. In this post I will mainly use the nomenclature of nodes and edges except when discussing packages that use different vocabulary. 22 | 23 | The network analysis packages need data to be in a particular form to create the special type of object used by each package. The object classes for `network`, `igraph`, and `tidygraph` are all based on adjacency matrices, also known as sociomatrices.[^2] An [adjacency matrix](https://en.wikipedia.org/wiki/Adjacency_matrix) is a square matrix in which the column and row names are the nodes of the network. Within the matrix a 1 indicates that there is a connection between the nodes, and a 0 indicates no connection. Adjacency matrices implement a very different data structure than data frames and do not fit within the [tidyverse](http://www.tidyverse.org) workflow that I have used in my previous posts. Helpfully, the specialized network objects can also be created from an edge-list data frame, which do fit in the tidyverse workflow. In this post I will stick to the data analysis techniques of the tidyverse to create edge lists, which will then be converted to the specific object classes for `network`, `igraph`, and `tidygraph`. 24 | 25 | An **edge list** is a data frame that contains a minimum of two columns, one column of nodes that are the source of a connection and another column of nodes that are the target of the connection. The nodes in the data are identified by unique IDs. If the distinction between source and target is meaningful, the network is **directed**. If the distinction is not meaningful, the network is **undirected**. With the example of letters sent between cities, the distinction between source and target is clearly meaningful, and so the network is directed. For the examples below, I will name the source column as "from" and the target column as "to". I will use integers beginning with one as node IDs.[^3] An edge list can also contain additional columns that describe **attributes** of the edges such as a magnitude aspect for an edge. If the edges have a magnitude attribute the graph is considered **weighted**. 26 | 27 | Edge lists contain all of the information necessary to create network objects, but sometimes it is preferable to also create a separate node list. At its simplest, a **node list** is a data frame with a single column — which I will label as "id" — that lists the node IDs found in the edge list. The advantage of creating a separate node list is the ability to add attribute columns to the data frame such as the names of the nodes or any kind of groupings. Below I give an example of minimal edge and node lists created with the `tibble()` function. 28 | 29 | ```{r node-edge lists} 30 | library(tidyverse) 31 | edge_list <- tibble(from = c(1, 2, 2, 3, 4), to = c(2, 3, 4, 2, 1)) 32 | node_list <- tibble(id = 1:4) 33 | 34 | edge_list 35 | node_list 36 | ``` 37 | 38 | Compare this to an adjacency matrix with the same data. 39 | 40 | ```{r adjacency matrix, echo = FALSE} 41 | library(network) 42 | net <- network(edge_list, vertex.attr = node_list, matrix.type = "edgelist", ignore.eval = FALSE) 43 | as.sociomatrix(net) 44 | ``` 45 | 46 | ## Creating edge and node lists 47 | To create network objects from the database of letters received by Daniel van der Meulen in 1585 I will make both an edge list and a node list. This will necessitate the use of the [dplyr](http://dplyr.tidyverse.org) package to manipulate the data frame of letters sent to Daniel and split it into two data frames or [tibbles](http://r4ds.had.co.nz/tibbles.html) with the structure of edge and node lists. In this case, the nodes will be the cities from which Daniel's correspondents sent him letters and the cities in which he received them. The node list will contain a "label" column, containing the names of the cities. The edge list will also have an attribute column that will show the amount of letters sent between each pair of cities. The workflow to create these objects will be similar to that I have used in my [brief introduction to R](https://jessesadler.com/post/excel-vs-r/) and in [geocoding with R](https://jessesadler.com/post/geocoding-with-r/). If you would like to follow along, you can find the data used in this post and the R script used on [GitHub](https://github.com/jessesadler/intro-to-r). 48 | 49 | The first step is to load the `tidyverse` library to import and manipulate the data. Printing out the `letters` data frame shows that it contains four columns: "writer", "source", "destination", and "date". In this example, we will only deal with the "source" and "destination" columns. 50 | 51 | ```{r load data, message = FALSE} 52 | library(tidyverse) 53 | letters <- read_csv("data/correspondence-data-1585.csv") 54 | 55 | letters 56 | ``` 57 | 58 | ### Node list 59 | The workflow to create a node list is similar to the one I used to get the [list of cities in order to geocode the data](https://jessesadler.com/post/geocoding-with-r/#preparing-data) in a previous post. We want to get the distinct cities from both the "source" and "destination" columns and then join the information from these columns together. In the example below, I slightly change the commands from those I used in the previous post to have the name for the columns with the city names be the same for both the `sources` and `destinations` data frames to simplify the `full_join()` function. I rename the column with the city names as "label" to adopt the vocabulary used by network analysis packages. 60 | 61 | ```{r source and destination} 62 | sources <- letters %>% 63 | distinct(source) %>% 64 | rename(label = source) 65 | 66 | destinations <- letters %>% 67 | distinct(destination) %>% 68 | rename(label = destination) 69 | ``` 70 | 71 | To create a single dataframe with a column with the unique locations we need to use a [full join](http://r4ds.had.co.nz/relational-data.html#outer-join), because we want to include all unique places from both the sources of the letters and the destinations. 72 | 73 | ```{r nodes join} 74 | nodes <- full_join(sources, destinations, by = "label") 75 | nodes 76 | ``` 77 | 78 | This results in a data frame with one variable. However, the variable contained in the data frame is not really what we are looking for. The "label" column contains the names of the nodes, but we also want to have unique IDs for each city. We can do this by adding an "id" column to the `nodes` data frame that contains numbers from one to whatever the total number of rows in the data frame is. A helpful function for this workflow is `rowid_to_column()`, which adds a column with the values from the row ids and places the column at the start of the data frame.[^4] Note that `rowid_to_column()` is a pipeable command, and so it is possible to do the `full_join()` and add the "id" column in a single command. The result is a nodes list with an ID column and a label attribute. 79 | 80 | ```{r id column} 81 | nodes <- nodes %>% rowid_to_column("id") 82 | nodes 83 | ``` 84 | 85 | ### Edge list 86 | Creating an edge list is similar to the above, but it is complicated by the need to deal with two ID columns instead of one. We also want to create a weight column that will note the amount of letters sent between each set of nodes. To accomplish this I will use the same `group_by()` and `summarise()` workflow that I have [discussed in previous posts](https://jessesadler.com/post/excel-vs-r/#the-pipe). The difference here is that we want to group the data frame by two columns — "source" and "destination" — instead of just one. Previously, I have named the column that counts the number of observations per group "count", but here I adopt the nomenclature of network analysis and call it "weight". The final command in the pipeline removes the grouping for the data frame instituted by the `group_by()` function. This makes it easier to manipulate the resulting `per_route` data frame unhindered.[^5] 87 | 88 | ```{r per_route_} 89 | per_route <- letters %>% 90 | group_by(source, destination) %>% 91 | summarise(weight = n()) %>% 92 | ungroup() 93 | per_route 94 | ``` 95 | 96 | Like the node list, `per_route` now has the basic form that we want, but we again have the problem that the "source" and "destination" columns contain labels rather than IDs. What we need to do is link the IDs that have been assigned in `nodes` to each location in both the "source" and "destination" columns. This can be accomplished with another join function. In fact, it is necessary to perform two joins, one for the "source" column and one for "destination." In this case, I will use a `left_join()` with `per_route` as the left data frame, because we want to maintain the number of rows in `per_route`. While doing the `left_join`, we also want to rename the two "id" columns that are brought over from `nodes`. For the join using the "source" column I will rename the column as "from". The column brought over from the "destination" join is renamed "to". It would be possible to do both joins in a single command with the use of the pipe. However, for clarity, I will perform the joins in two separate commands. Because the join is done across two commands, notice that the data frame at the beginning of the pipeline changes from `per_route` to `edges`, which is created by the first command. 97 | 98 | ```{r joins for edges} 99 | edges <- per_route %>% 100 | left_join(nodes, by = c("source" = "label")) %>% 101 | rename(from = id) 102 | 103 | edges <- edges %>% 104 | left_join(nodes, by = c("destination" = "label")) %>% 105 | rename(to = id) 106 | ``` 107 | 108 | Now that `edges` has "from" and "to" columns with node IDs, we need to reorder the columns to bring "from" and "to" to the left of the data frame. Currently, the `edges` data frame still contains the "source" and "destination" columns with the names of the cities that correspond with the IDs. However, this data is superfluous, since it is already present in `nodes`. Therefore, I will only include the "from", "to", and "weight" columns in the `select()` function. 109 | 110 | ```{r arrange edges columns} 111 | edges <- select(edges, from, to, weight) 112 | edges 113 | ``` 114 | 115 | The `edges` data frame does not look very impressive; it is three columns of integers. However, `edges` combined with `nodes` provides us with all of the information necessary to create network objects with the `network`, `igraph`, and `tidygraph` packages. 116 | 117 | ## Creating network objects 118 | The network object classes for `network`, `igraph`, and `tidygraph` are all closely related. It is possible to translate between a `network` object and an `igraph` object. However, it is best to keep the two packages and their objects separate. In fact, the capabilities of `network` and `igraph` overlap to such an extent that it is best practice to have only one of the packages loaded at a time. I will begin by going over the `network` package and then move to the `igraph` and `tidygraph` packages. 119 | 120 | ### network 121 | ```{r load network} 122 | library(network) 123 | ``` 124 | 125 | The function used to create a `network` object is `network()`. The command is not particularly straight forward, but you can always enter `?network()` into the console if you get confused. The first argument is — as stated in the documentation — "a matrix giving the network structure in adjacency, incidence, or edgelist form." The language demonstrates the significance of matrices in network analysis, but instead of a matrix, we have an edge list, which fills the same role. The second argument is a list of vertex attributes, which corresponds to the nodes list. Notice that the `network` package uses the nomenclature of vertices instead of nodes. The same is true of `igraph`. We then need to specify the type of data that has been entered into the first two arguments by specifying that the `matrix.type` is an `"edgelist"`. Finally, we set `ignore.eval` to `FALSE` so that our network can be weighted and take into account the number of letters along each route. 126 | 127 | ```{r network object} 128 | routes_network <- network(edges, vertex.attr = nodes, matrix.type = "edgelist", ignore.eval = FALSE) 129 | ``` 130 | 131 | You can see the type of object created by the `network()` function by placing `routes_network` in the `class()` function. 132 | 133 | ```{r class network object} 134 | class(routes_network) 135 | ``` 136 | 137 | Printing out `routes_network` to the console shows that the structure of the object is quite different from data-frame style objects such as `edges` and `nodes`. The print command reveals information that is specifically defined for network analysis. It shows that there are 13 vertices or nodes and 15 edges in `routes_network`. These numbers correspond to the number of rows in `nodes` and `edges` respectively. We can also see that the vertices and edges both contain attributes such as label and weight. You can get even more information, including a sociomatrix of the data, by entering `summary(routes_network)`. 138 | 139 | ```{r print network object} 140 | routes_network 141 | ``` 142 | 143 | It is now possible to get a rudimentary, if not overly aesthetically pleasing, graph of our network of letters. Both the `network` and `igraph` packages use the [base plotting system of R](https://bookdown.org/rdpeng/exdata/the-base-plotting-system-1.html). The conventions for base plots are significantly different from those of [ggplot2](http://ggplot2.tidyverse.org) — which I have discussed in [previous](https://jessesadler.com/post/excel-vs-r/#ggplot) [posts](https://jessesadler.com/post/geocoding-with-r/#mapping-data) — and so I will stick with rather simple plots instead of going into the details of creating complex plots with base R. In this case, the only change that I make to the default `plot()` function for the `network` package is to increase the size of nodes with the `vertex.cex` argument to make the nodes more visible. Even with this very simple graph, we can already learn something about the data. The graph makes clear that there are two main groupings or clusters of the data, which correspond to the time Daniel spent in Holland in the first three-quarters of 1585 and after his move to Bremen in September. 144 | 145 | ```{r network-plot} 146 | plot(routes_network, vertex.cex = 3) 147 | ``` 148 | 149 | The `plot()` function with a `network` object uses the Fruchterman and Reingold algorithm to decide on the placement of the nodes.[^6] You can change the layout algorithm with the `mode` argument. Below, I layout the nodes in a circle. This is not a particularly useful arrangement for this network, but it gives an idea of some of the options available. 150 | 151 | ```{r network-circle-plot} 152 | plot(routes_network, vertex.cex = 3, mode = "circle") 153 | ``` 154 | 155 | ### igraph 156 | Let's now move on to discuss the `igraph` package. First, we need to clean up the environment in R by removing the `network` package so that it does not interfere with the `igraph` commands. We might as well also remove `routes_network` since we will not longer be using it. The `network` package can be removed with the `detach()` function, and `routes_network` is removed with `rm()`.[^7] After this, we can safely load `igraph`. 157 | 158 | ```{r clean environment} 159 | detach(package:network) 160 | rm(routes_network) 161 | library(igraph) 162 | ``` 163 | 164 | To create an `igraph` object from an edge-list data frame we can use the `graph_from_data_frame()` function, which is a bit more straight forward than `network()`. There are three arguments in the `graph_from_data_frame()` function: d, vertices, and directed. Here, d refers to the edge list, vertices to the node list, and directed can be either `TRUE` or `FALSE` depending on whether the data is directed or undirected. 165 | 166 | ```{r igraph object} 167 | routes_igraph <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE) 168 | ``` 169 | 170 | Printing the `igraph` object created by `graph_from_data_frame()` to the console reveals similar information to that from a `network` object, though the structure is more cryptic. 171 | 172 | ```{r print igraph object} 173 | routes_igraph 174 | ``` 175 | 176 | The main information about the object is contained in `DNW- 13 15 --`. This tells that `routes_igraph` is a directed network (D) that has a name attribute (N) and is weighted (W). The dash after W tells us that the graph is not [bipartite](https://wikipedia.org/wiki/Bipartite_graph). The numbers that follow describe the number of nodes and edges in the graph respectively. Next, `name (v/c), label (v/c), weight (e/n)` gives information about the attributes of the graph. There are two vertex attributes (v/c) of name — which are the IDs — and labels and an edge attribute (e/n) of weight. Finally, there is a print out of all of the edges. 177 | 178 | Just as with the `network` package, we can create a plot with an `igraph` object through the `plot()` function. The only change that I make to the default here is to decrease the size of the arrows. By default `igraph` labels the nodes with the label column if there is one or with the IDs. 179 | 180 | ```{r igraph-plot} 181 | plot(routes_igraph, edge.arrow.size = 0.2) 182 | ``` 183 | 184 | Like the `network` graph before, the default of an `igraph` plot is not particularly aesthetically pleasing, but all aspects of the plots can be manipulated. Here, I just want to change the layout of the nodes to use the graphopt algorithm created by [Michael Schmuhl](http://www.schmuhl.org/graphopt/). This algorithm makes it easier to see the relationship between Haarlem, Antwerp, and Delft, which are three of the most signifiant locations in the correspondence network, by spreading them out further. 185 | 186 | ```{r igraph-graphopt-plot} 187 | plot(routes_igraph, layout = layout_with_graphopt, edge.arrow.size = 0.2) 188 | ``` 189 | 190 | ### tidygraph and ggraph 191 | The `tidygraph` and `ggraph` packages are newcomers to the network analysis landscape, but together the two packages provide real advantages over the `network` and `igraph` packages. `tidygraph` and `ggraph` represent an attempt to [bring network analysis into the tidyverse workflow](http://www.data-imaginist.com/2017/Introducing-tidygraph/). [`tidygraph`](https://cran.r-project.org/web/packages/tidygraph/index.html) provides a way to create a network object that more closely resembles a [tibble or data frame](http://r4ds.had.co.nz/tibbles.html). This makes it possible to use many of the `dplyr` functions to manipulate network data. [`ggraph`](https://cran.r-project.org/web/packages/ggraph/index.html) gives a way to plot network graphs using the conventions and power of `ggplot2`. In other words, `tidygraph` and `ggraph` allow you to deal with network objects in a manner that is more consistent with the commands used for working with tibbles and data frames. However, the true promise of `tidygraph` and `ggraph` is that they leverage the power of `igraph`. This means that you sacrifice few of the network analysis capabilities of `igraph` by using `tidygraph` and `ggraph`. 192 | 193 | We need to start as always by loading the necessary packages. 194 | 195 | ```{r load tidygraph and ggraph} 196 | library(tidygraph) 197 | library(ggraph) 198 | ``` 199 | 200 | First, let's create a network object using `tidygraph`, which is called a `tbl_graph`. A `tbl_graph` consists of two tibbles: an edges tibble and a nodes tibble. Conveniently, the `tbl_graph` object class is a wrapper around an `igraph` object, meaning that at its basis a `tbl_graph` object is essentially an `igraph` object.[^8] The close link between `tbl_graph` and `igraph` objects results in two main ways to create a `tbl_graph` object. The first is to use an edge list and node list, using `tbl_graph()`. The arguments for the function are almost identical to those of `graph_from_data_frame()` with only a slight change to the names of the arguments. 201 | 202 | ```{r tibble to tbl_graph} 203 | routes_tidy <- tbl_graph(nodes = nodes, edges = edges, directed = TRUE) 204 | ``` 205 | 206 | The second way to create a `tbl_graph` object is to convert an `igraph` or `network` object using `as_tbl_graph()`. Thus, we could convert `routes_igraph` to a `tbl_graph` object. 207 | 208 | ```{r igraph to tbl_graph} 209 | routes_igraph_tidy <- as_tbl_graph(routes_igraph) 210 | ``` 211 | 212 | Now that we have created two `tbl_graph` objects, let's inspect them with the `class()` function. This shows that `routes_tidy` and `routes_igraph_tidy` are objects of class `"tbl_graph" "igraph"`, while `routes_igraph` is object class `"igraph"`. 213 | 214 | ```{r class of tidygraph} 215 | class(routes_tidy) 216 | class(routes_igraph_tidy) 217 | class(routes_igraph) 218 | ``` 219 | 220 | Printing out a `tbl_graph` object to the console results in a drastically different output from that of an `igraph` object. It is an output similar to that of a normal tibble. 221 | 222 | ```{r print routes_tidy} 223 | routes_tidy 224 | ``` 225 | 226 | Printing `routes_tidy` shows that it is a `tbl_graph` object with 13 nodes and 15 edges. The command also prints the first six rows of "Node Data" and the first three of "Edge Data". Notice too that it states that the Node Data is active. The notion of an active tibble within a `tbl_graph` object makes it possible to manipulate the data in one tibble at a time. The nodes tibble is activated by default, but you can change which tibble is active with the `activate()` function. Thus, if I wanted to rearrange the rows in the edges tibble to list those with the highest "weight" first, I could use `activate()` and then `arrange()`. Here I simply print out the result rather than saving it. 227 | 228 | ```{r activate edges} 229 | routes_tidy %>% 230 | activate(edges) %>% 231 | arrange(desc(weight)) 232 | ``` 233 | 234 | Since we do not need to further manipulate `routes_tidy`, we can plot the graph with `ggraph`. Like [ggmap](https://jessesadler.com/post/geocoding-with-r/#mapping-data), `ggraph` is an extension of `ggplot2`, making it easier to carry over basic `ggplot` skills to the creation of network plots. As in all network graphs, there are three main aspects to a `ggraph` plot: [nodes](http://www.data-imaginist.com/2017/ggraph-introduction-nodes/), [edges](http://www.data-imaginist.com/2017/ggraph-introduction-edges/), and [layouts](http://www.data-imaginist.com/2017/ggraph-introduction-layouts/). The [vignettes for the ggraph 235 | package](https://cran.r-project.org/web/packages/ggraph/index.html) helpfully cover the fundamental aspects of `ggraph` plots. `ggraph` adds special geoms to the basic set of `ggplot` geoms that are specifically designed for networks. Thus, there is a set of `geom_node` and `geom_edge` geoms. The basic plotting function is `ggraph()`, which takes the data to be used for the graph and the type of layout desired. Both of the arguments for `ggraph()` are built around `igraph`. Therefore, `ggraph()` can use either an `igraph` object or a `tbl_graph` object. In addition, the available layouts algorithms primarily derive from `igraph`. Lastly, `ggraph` introduces a special `ggplot` theme that provides better defaults for network graphs than the normal `ggplot` defaults. The `ggraph` theme can be set for a series of plots with the `set_graph_style()` command run before the graphs are plotted or by using `theme_graph()` in the individual plots. Here, I will use the latter method. 236 | 237 | Let's see what a basic `ggraph` plot looks like. The plot begins with `ggraph()` and the data. I then add basic edge and node geoms. No arguments are necessary within the edge and node geoms, because they take the information from the data provided in `ggraph()`. 238 | 239 | ```{r ggraph-basic} 240 | ggraph(routes_tidy) + geom_edge_link() + geom_node_point() + theme_graph() 241 | ``` 242 | 243 | As you can see, the structure of the command is similar to that of `ggplot` with the separate layers added with the `+` sign. The basic `ggraph` plot looks similar to those of `network` and `igraph`, if not even plainer, but we can use similar commands to `ggplot` to create a more informative graph. We can show the "weight" of the edges — or the amount of letters sent along each route — by using width in the `geom_edge_link()` function. To get the width of the line to change according to the weight variable, we place the argument within an `aes()` function. In order to control the maximum and minimum width of the edges, I use `scale_edge_width()` and set a `range`. I choose a relatively small width for the minimum, because there is a significant difference between the maximum and minimum number of letters sent along the routes. We can also label the nodes with the names of the locations since there are relatively few nodes. Conveniently, `geom_node_text()` comes with a repel argument that ensures that the labels do not overlap with the nodes in a manner similar to the [ggrepel package](https://cran.r-project.org/web/packages/ggrepel/index.html). I add a bit of transparency to the edges with the alpha argument. I also use `labs()` to relabel the legend "Letters". 244 | 245 | ```{r ggraph-graphopt} 246 | ggraph(routes_tidy, layout = "graphopt") + 247 | geom_node_point() + 248 | geom_edge_link(aes(width = weight), alpha = 0.8) + 249 | scale_edge_width(range = c(0.2, 2)) + 250 | geom_node_text(aes(label = label), repel = TRUE) + 251 | labs(edge_width = "Letters") + 252 | theme_graph() 253 | ``` 254 | 255 | In addition to the layout choices provided by `igraph`, `ggraph` also implements its own layouts. For example, you can use `ggraph's` concept of [circularity](http://www.data-imaginist.com/2017/ggraph-introduction-layouts/) to create arc diagrams. Here, I layout the nodes in a horizontal line and have the edges drawn as arcs. Unlike the previous plot, this graph indicates directionality of the edges.[^9] The edges above the horizontal line move from left to right, while the edges below the line move from right to left. Intsead of adding points for the nodes, I just include the label names. I use the same width aesthetic to denote the difference in the weight of each edge. Note that in this plot I use an `igraph` object as the data for the graph, which makes no practical difference. 256 | 257 | ```{r ggraph-arc, fig.width = 12, fig.asp = 0.5} 258 | ggraph(routes_igraph, layout = "linear") + 259 | geom_edge_arc(aes(width = weight), alpha = 0.8) + 260 | scale_edge_width(range = c(0.2, 2)) + 261 | geom_node_text(aes(label = label)) + 262 | labs(edge_width = "Letters") + 263 | theme_graph() 264 | ``` 265 | 266 | ## Interactive network graphs with `visNetwork` and `networkD3` 267 | The [htmlwidgets](http://www.htmlwidgets.org) set of packages makes it possible to use R to create interactive JavaScript visualizations. Here, I will show how to make graphs with the [`visNetwork`](http://datastorm-open.github.io/visNetwork/) and [`networkD3`](http://christophergandrud.github.io/networkD3/) packages. These two packages use different JavaScript libraries to create their graphs. `visNetwork` uses [vis.js](http://visjs.org/), while `networkD3` uses the popular [d3 visualization library](http://d3js.org/) to make its graphs. One difficulty in working with both `visNetwork` and `networkD3` is that they expect edge lists and node lists to use specific nomenclature. The above data manipulation conforms to the basic structure for `visNetwork`, but some work will need to be done for `networkD3`. Despite this inconvenience, both packages possess a wide range of graphing capabilities and both can work with `igraph` objects and layouts. 268 | 269 | ```{r interactive packages} 270 | library(visNetwork) 271 | library(networkD3) 272 | ``` 273 | 274 | ### visNetwork 275 | The `visNetwork()` function uses a nodes list and edges list to create an interactive graph. The nodes list must include an "id" column, and the edge list must have "from" and "to" columns. The function also plots the labels for the nodes, using the names of the cities from the "label" column in the node list. The resulting graph is fun to play around with. You can move the nodes and the graph will use an algorithm to keep the nodes properly spaced. You can also zoom in and out on the plot and move it around to re-center it. 276 | 277 | ```{r visNetwork-simple, eval = FALSE} 278 | visNetwork(nodes, edges) 279 | ``` 280 | 281 | `visNetwork` can use `igraph` layouts, providing a large variety of possible layouts. In addition, you can use `visIgraph()` to plot an `igraph` object directly. Here, I will stick with the `nodes` and `edges` workflow and use an `igraph` layout to customize the graph. I will also add a variable to change the width of the edge as we did with `ggraph`. `visNetwork()` uses column names from the edge and node lists to plot network attributes instead of arguments within the function call. This means that it is necessary to do some data manipulation to get a "width" column in the edge list. The width attribute for `visNetwork()` does not scale the values, so we have to do this manually. Both of these actions can be done with the `mutate()` function and some simple arithmetic. Here, I create a new column in `edges` and scale the weight values by dividing by 5. Adding 1 to the result provides a way to create a minimum width. 282 | 283 | ```{r width-edges} 284 | edges <- mutate(edges, width = weight/5 + 1) 285 | ``` 286 | 287 | Once this is done, we can create a graph with variable edge widths. I also choose a layout algorithm from `igraph` and add arrows to the edges, placing them in the middle of the edge. 288 | 289 | ```{r visNetwork-edgewidth, eval = FALSE} 290 | visNetwork(nodes, edges) %>% 291 | visIgraphLayout(layout = "layout_with_fr") %>% 292 | visEdges(arrows = "middle") 293 | ``` 294 | 295 | ### networkD3 296 | A little more work is necessary to prepare the data to create a `networkD3` graph. To make a `networkD3` graph with a edge and node list requires that the IDs be a series of numeric integers that begin with 0. Currently, the node IDs for our data begin with 1, and so we have to do a bit of data manipulation. It is possible to renumber the nodes by subtracting 1 from the ID columns in the `nodes` and `edges` data frames. Once again, this can be done with the `mutate()` function. The goal is to recreate the current columns, while subtracting 1 from each ID. The `mutate()` function works by creating a new column, but we can have it replace a column by giving the new column the same name as the old column. Here, I name the new data frames with a d3 suffix to distinguish them from the previous `nodes` and `edges` data frames. 297 | 298 | ```{r nodes_d3 and edges_d3} 299 | nodes_d3 <- mutate(nodes, id = id - 1) 300 | edges_d3 <- mutate(edges, from = from - 1, to = to - 1) 301 | ``` 302 | 303 | It is now possible to plot a `networkD3` graph. Unlike `visNetwork()`, the `forceNetwork()` function uses a series of arguments to adjust the graph and plot network attributes. The "Links" and "Nodes" arguments provide the data for the plot in the form of edge and node lists. The function also requires "NodeID" and "Group" arguments. The data being used here does not have any groupings, and so I just have each node be its own group, which in practice means that the nodes will all be different colors. In addition, the below tells the function that the network has "Source" and "Target" fields, and thus is directed. I include in this graph a "Value", which scales the width of the edges according to the "weight" column in the edge list. Finally, I add some aesthetic tweaks to make the nodes opaque and increase the font size of the labels to improve legibility. The result is very similar to the first `visNetwork()` plot that I created but with different aesthetic stylings. 304 | 305 | ```{r d3-network-plot, eval = FALSE} 306 | forceNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 307 | NodeID = "label", Group = "id", Value = "weight", 308 | opacity = 1, fontSize = 16, zoom = TRUE) 309 | ``` 310 | 311 | One of the main benefits of `networkD3` is that it implements a [d3-styled Sankey diagram](https://bost.ocks.org/mike/sankey/). A Sankey diagram is a good fit for the letters sent to Daniel in 1585. There are not too many nodes in the data, making it easier to visualize the flow of letters. Creating a Sankey diagram uses the `sankeyNetwork()` function, which takes many of the same arguments as `forceNetwork()`. This graph does not require a group argument, and the only other change is the addition of a "unit." This provides a label for the values that pop up in a tool tip when your cursor hovers over a diagram element.[^10] 312 | 313 | ```{r d3-sankey-diagram, eval = FALSE} 314 | sankeyNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 315 | NodeID = "label", Value = "weight", fontSize = 16, unit = "Letter(s)") 316 | ``` 317 | 318 | ## Resource on network analysis 319 | This post has attempted to give a general introduction to creating and plotting network type objects in R using the `network`, `igraph`, `tidygraph`, and `ggraph` packages for static plots and `visNetwork` and `networkD3` for interactive plots. I have presented this information from the position of a non-specialist in network theory. I have only covered a very small percentage of the network analysis capabilities of R. In particular, I have not discussed the statistical analysis of networks. Happily, there is a plethora of resources on network analysis in general and in R in particular. 320 | 321 | The best introduction to networks that I have found for the uninitiated is [Katya Ognyanova's Network Visualization with R](http://kateto.net/network-visualization). This presents both a helpful introduction to the visual aspects of networks and a more in depth tutorial on creating network plots in R. Ognyanova primarily uses `igraph`, but she also introduces interactive networks. 322 | 323 | There are two relatively recent books published on network analysis with R by Springer. [Douglas A. Luke, *A User’s Guide to Network Analysis in R* (2015)](http://www.springer.com/us/book/9783319238821) is a very useful introduction to network analysis with R. Luke covers both the statnet suit of packages and `igragh`. The contents are at a very approachable level throughout. More advanced is [Eric D. Kolaczyk and Gábor Csárdi's, Statistical Analysis of Network Data with R (2014)](http://www.springer.com/us/book/9781493909827). Kolaczyk and Csárdi's book mainly uses `igraph`, as Csárdi is the primary maintainer of the `igraph` package for R. This book gets further into advanced topics on the statistical analysis of networks. Despite the use of very technical language, the first four chapters are generally approachable from a non-specialist point of view. 324 | 325 | The [list curated by François Briatte](https://github.com/briatte/awesome-network-analysis) is a good overview of resources on network analysis in general. The [Networks Demystified series of posts by Scott Weingart](https://scottbot.net/tag/networks-demystified/) is also well worth perusal. 326 | 327 | [^1]: One example of the interest in network analysis within digital humanities is the newly launched [Journal of Historical Network Research](https://jhnr.uni.lu/index.php/jhnr). 328 | 329 | [^2]: For a good description of the `network` object class, including a discussion of its relationship to the `igraph` object class, see [Carter Butts, "`network`: A Package for Managing Relational Data in R", *Journal of Statistical Software*, 24 (2008): 1–36](https://www.jstatsoft.org/article/view/v024i02) 330 | 331 | [^3]: This is the specific structure expected by `visNetwork`, while also conforming to the general expectations of the other packages. 332 | 333 | [^4]: This is the expected order for the columns for some of the networking packages that I will be using below. 334 | 335 | [^5]: `ungroup()` is not strictly necessary in this case. However, if you do not ungroup the data frame, it is not possible to drop the "source" and "destination" columns, as I do later in the script. 336 | 337 | [^6]: Thomas M. J. Fruchterman and Edward M. Reingold, "Graph Drawing by Force-Directed Placement," *Software: Practice and Experience*, 21 (1991): 1129–1164. 338 | 339 | [^7]: The `rm()` function is useful if your working environment in R gets disorganized, but you do not want to clear the whole environment and start over again. 340 | 341 | [^8]: The relationship between `tbl_graph` and `igraph` objects is similar to that between `tibble` and `data.frame` objects. 342 | 343 | [^9]: It is possible to have `ggraph` draw arrows, but I have not shown that here. 344 | 345 | [^10]: It can take a bit of time for the tool tip to appear. -------------------------------------------------------------------------------- /simple-feature-objects.R: -------------------------------------------------------------------------------- 1 | ### An Exploration of Simple Features for R: Building sfg, sfc, and sf objects from the sf package ### 2 | 3 | # This script goes along with the blog post of the same name, 4 | # which can be found at https://www.jessesadler.com/post/simple-feature-objects/ 5 | # See the Rmarkdown document for the contents of the blog post. 6 | 7 | # Load sf package 8 | library(sf) 9 | 10 | ################# 11 | ## sfg objects ## 12 | ################# 13 | 14 | # Create an sfg object with coordinates of Los Angeles and Amsterdam 15 | la_sfg <- st_point(c(-118.2615805, 34.1168926)) 16 | amsterdam_sfg <- st_point(c(4.8979755, 52.3745403)) 17 | 18 | # Structure of a sfg POINT object 19 | str(la_sfg) 20 | 21 | # Coordinates of a sfg object 22 | st_coordinates(la_sfg) 23 | 24 | # Create MULTIPOINT and LINESTRING sfg objects through matrix of points 25 | multipoint_sfg <- st_multipoint(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 26 | linestring_sfg <- st_linestring(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 27 | 28 | # Plot to compare MULTIPOINT to LINESTRING 29 | par(mar = c(0, 1, 2, 1), 30 | mfrow = c(1, 2)) 31 | plot(multipoint_sfg, main = "MULTIPOINT") 32 | plot(linestring_sfg, main = "LINESTRING") 33 | 34 | # Distance on sfg objects does not have real-world meaning 35 | st_distance(la_sfg, amsterdam_sfg) 36 | 37 | # st_distance and stats::dist the same for sfg objects 38 | dist(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 39 | 40 | ################# 41 | ## sfc objects ## 42 | ################# 43 | 44 | # Create sfc object with default crs 45 | st_sfc(multipoint_sfg) 46 | 47 | # Create sfc object with multiple sfg objects 48 | points_sfc <- st_sfc(la_sfg, amsterdam_sfg, crs = 4326) 49 | 50 | # Attributes of sfc object 51 | attributes(points_sfc) 52 | 53 | # crs attribute is of class crs 54 | class(st_crs(points_sfc)) 55 | 56 | # Access crs attribute of sfc object 57 | st_crs(points_sfc) 58 | 59 | # Geographic distance with sfc object 60 | st_distance(points_sfc) 61 | 62 | ################ 63 | ## sf objects ## 64 | ################ 65 | 66 | # Create a data frame of attributes 67 | data <- data.frame(name = c("Los Angeles", "Amsterdam"), 68 | language = c("English", "Dutch"), 69 | weather = c("sunny", "rainy/cold")) 70 | 71 | # Create data frame with list column then make sf object 72 | st_sf(cbind(data, points_sfc)) 73 | 74 | # Make sf object from separate data frame and sfc objects 75 | st_sf(data, geometry = points_sfc) 76 | 77 | # Create sf object by combining data frame with sfc object and name sfc column geometry 78 | points_sf <- st_sf(data, geometry = points_sfc) 79 | 80 | # Class of new sf object 81 | class(points_sf) 82 | 83 | # Geometry of points_sf is equivalent to points_sfc 84 | identical(st_geometry(points_sf), points_sfc) 85 | 86 | # Stickiness of geometry column 87 | dplyr::select(points_sf, name) 88 | 89 | # Return sf object to a data frame by setting geometry to NULL 90 | st_set_geometry(points_sf, NULL) 91 | 92 | # Class 93 | class(st_set_geometry(points_sf, NULL)) -------------------------------------------------------------------------------- /simple-feature-objects.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Simple Features for R: An Exploration of sf objects" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | post <- "simple-feature-objects" 8 | source("common.R") 9 | ``` 10 | 11 | The [previous post](https://jessesadler.com/post/gis-with-r-intro/) provided an introduction to the [`sp`](https://cran.r-project.org/web/packages/sp/index.html) and [`sf`](https://cran.r-project.org/web/packages/sf/index.html) packages and how they represent spatial data in R. There I discussed the creation of `Spatial` and `sf` objects from data with longitude and latitude values and the process of making maps with the two packages. In this post I will provide further background for the `sf` package by going into the details of the structure of `sf` objects and explaining how the package implements the [Simple Features open standard](http://www.opengeospatial.org/standards/sfa). It is certainly not necessary to know the ins and outs of `sf` objects and the Simple Features standard to use the package — it has taken me long enough to get my head around much of this — but a better knowledge of the structure and vocabulary of `sf` objects is helpful for understanding the effects of the plethora of `sf` functions. There are a variety of good resources that discuss the structure of `sf` objects. The most comprehensive are the [package vignette Simple Features for R](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html) and the overview in Chapter 2 of the working book [ *Geocomputation with R* by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow](https://geocompr.robinlovelace.net/spatial-class.html#sf-classes). This post is based on these sources, as well as my own sleuthing through the code for the `sf` package. 12 | 13 | Before diving in, let's take a step back to provide some background to the package. The `sf` package implements the [Simple Features standard](https://en.wikipedia.org/wiki/Simple_Features) in R. The Simple Features standard is widely used by GIS software such as [PostGIS](http://postgis.net/), [GeoJSON](http://geojson.org/), and [ArcGIS](http://www.esri.com/) to represent geographic vector data. The `sf` package is designed to bring spatial analysis in R in line with these other systems.[^1] The standard defines a simple feature as a representation of a real world object by a point or points that may or may not be connected by straight line segments to form lines or polygons. A simple feature can contain both a geometry that includes points, any connecting lines, and a coordinate reference system to identify its location of Earth and attributes to describe the object, such as a name, values, color, etc. The `sf` package takes advantage of the wide use of Simple Features by linking directly to the [GDAL](http://gdal.org), [GEOS](http://trac.osgeo.org/geos), and [PROJ](http://proj4.org) libraries that provide the back end for reading spatial data, making geographic calculations, and handling coordinate reference systems.[^2] You can see this from the message when you load the `sf` package, so let's do that now. 14 | 15 | ```{r load-sf, message = TRUE} 16 | library(sf) 17 | ``` 18 | 19 | With this general definition of Simple Features in mind, we can look at how the `sf` package implements the standard through the `sf` class of object.[^3] At its most basic, an `sf` object is a collection of simple features that includes attributes and geometries in the form of a data frame. In other words, it is a data frame (or tibble) with rows of features, columns of attributes, and a special geometry column that contains the spatial aspects of the features. The special geometry column is itself a [list](http://colinfay.me/intro-to-r/lists-and-data-frames.html) of class `sfc`, which is made up of individual objects of class `sfg`. While it is possible to have multiple geometry columns, `sf` objects usually only have a single geometry column. We can break down the components of an `sf` object by looking at the printed output of an `sf` object.[^4] 20 | 21 | **Image** 22 | 23 | - `sf` object: collection of simple features represented by a data frame 24 | - feature: a single simple feature with attributes and geometry represented by a row in the data frame 25 | - attributes: non-geometry variables or columns in the data frame 26 | - `sfg` object: geometry of a single feature 27 | - `sfc` object: geometry column with the spatial attributes of the object printed above the data frame 28 | 29 | This terminology covers the basics for `sf` objects and provides the majority of information you need to work with them. The most likely way to create an `sf` object is convert another type of spatial object or a data frame with coordinates to an `sf` object. Rarely is there a reason to create an `sf` object from scratch. However, we can learn more about the nature of `sf` objects by delving further into their structure and explicating how `sfg` and `sfc` objects combine to store spatial data. The rest of the post will do this by creating an `sf` object step by step from a single `sfg` object, to a combination of `sfg` objects in an `sfc` object, and finally the addition of attributes to produce a full `sf` object. 30 | 31 | ## `sfg` objects 32 | `sfg` objects represent the geometry of a single feature and contain information about the feature's coordinates, dimension, and type of geometry. The coordinates are the values of the point or points in the feature, which can have two-, three-, or four dimensions. The vast majority of data is two dimensional, possessing and X and Y coordinates, but the Simple Features standard also possesses a Z coordinate for elevation and an M coordinate for information about the measurement of a point that is rarely used. The geometry type indicates the shape of the feature, whether it consists of a point, multiple points, a line, multiple lines, a polygon, multiple polygons, or some combination of these. There are seventeen different possible types of geometries, but the vast majority of data will use the seven main types discussed here.[^5] 33 | 34 | ### Geometry types 35 | - POINT – a single point 36 | - MULTIPOINT – multiple points 37 | - LINESTRING – sequence of two or more points connected by straight lines 38 | - MULTILINESTRING – multiple lines 39 | - POLYGON – a closed ring with zero or more interior holes 40 | - MULTIPOLYGON - multiple polygons 41 | - GEOMETRYCOLLECTION – any combination of the above types 42 | 43 | Within the Simple Features standard the geometry of a feature is encoded using [well-known binary](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html#wkb) (WKB) that can also be represented in a more human readable form of well-known text (WKT). The `sf` package can convert between `sfg` and WKB or WKT very quickly, but `sfg` objects are created and stored in R as vectors, matrices, or lists of matrices depending on the geometry type. 44 | 45 | ### Creating geometry types 46 | - POINT – a vector 47 | - MULTIPOINT and LINESTRING – a matrix with each row representing a point 48 | - MULTILINESTRING and POLYGON – a list of matrices 49 | - MULTIPOLYGON - a list of lists of matrices 50 | - GEOMETRYCOLLECTION – list that combines any of the above 51 | 52 | An `sfg` object of each of the seven main geometry types can be created through separate functions: `st_point()`, `st_multipoint()`, `st_linestring()`, etc. The functions all have two arguments: an R object of the proper class to create the specified geometry type and identification of dimensions as either XYZ or XYM if the data has three dimensions. In this post I will create POINT, MULTIPOINT, and LINESTRING geometry types, which are the most likely to be used when creating an `sf` object from scratch and provide the basis for the remaining types. You can see how to create the remaining types in the package [vignette](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html) and [*Geocomputation with R*](https://geocompr.robinlovelace.net/spatial-class.html#sf-classes). Let's start by creating two points using vectors of longitude and latitude values for points in Los Angeles and Amsterdam. 53 | 54 | ```{r POINTS sfg} 55 | # Create an sfg object with coordinates of Los Angeles and Amsterdam 56 | la_sfg <- st_point(c(-118.2615805, 34.1168926)) 57 | amsterdam_sfg <- st_point(c(4.8979755, 52.3745403)) 58 | ``` 59 | 60 | The relative simplicity of the command to create an `sfg` POINT underlines the simplicity of `sfg` objects as a whole. We can see this by taking a closer look at the information contained in these objects. 61 | 62 | ```{r print POINT sfg} 63 | # Print sfg POINT objects 64 | la_sfg 65 | amsterdam_sfg 66 | ``` 67 | 68 | The printing method for all `sfg` objects uses the well-known text style with the geometry type printed in capital letters followed by the coordinates, using commas and parentheses to distinguish between the different elements for each feature.[^6] 69 | 70 | ```{r structure POINT sfg} 71 | # Structure of a sfg POINT object 72 | str(la_sfg) 73 | ``` 74 | 75 | If we look a bit further into the structure of an `sfg` object, we can see that it consists of three classes corresponding to the dimensions, geometry type, and the `sfg` class itself. In addition, the object possesses coordinates, which is a vector of length two. The coordinates can also be represented by a matrix with two columns and one row. In fact, all `sfg` objects can be represented as a matrix of coordinates, though more complex geometry types contain additional columns to keep track of pieces that make up the feature or features.[^7] 76 | 77 | ```{r coordinates POINT sfg} 78 | # Coordinates of a sfg object 79 | st_coordinates(la_sfg) 80 | ``` 81 | 82 | Points are the building block for all other geometry types. We can see this by creating a MULTIPOINT and LINESTRING simple feature, which are both formed by a matrix of points. The matrix can be created by using `rbind()` to bind the coordinate vectors together into rows. 83 | 84 | ```{r multipoint and linestring} 85 | # Matrix of points 86 | rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403)) 87 | 88 | # Create MULTIPOINT and LINESTRING sfg objects through matrix of points 89 | multipoint_sfg <- st_multipoint(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 90 | linestring_sfg <- st_linestring(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 91 | 92 | # Print objects 93 | multipoint_sfg 94 | linestring_sfg 95 | ``` 96 | 97 | Printing the two new `sfg` objects shows their basic similarity. They were created by the same matrix and thus have the same coordinates. The only difference is that one is of class MULTIPOINT and the other is of class LINESTRING. In other words, the geometry of `multipoint_sfg` consists of points, and the geometry of `linestring_sfg` is a straight line between the same two points. We can see this by plotting the objects next to each other. 98 | 99 | ```{r multipoint-linestring-plot} 100 | par(mar = c(0, 1, 2, 1), 101 | mfrow = c(1, 2)) 102 | plot(multipoint_sfg, main = "MULTIPOINT") 103 | plot(linestring_sfg, main = "LINESTRING") 104 | ``` 105 | 106 | As important as what type of information is contained in `sfg` objects is, what is missing is equally significant in understanding their role within `sf` objects. `sfg` objects consist of a dimension, geometry type, and coordinates, but though they possess spatial aspects, they are not not geospatial. `sfg` objects do not possess a [coordinate reference system (CRS)](https://jessesadler.com/post/gis-with-r-intro/#crs). There is nothing within the structure of an `sfg` object to indicate that the X and Y values of the coordinates correspond to longitude and latitude values, much less to the datum of the coordinates. Calculating the distance between `la_sfg` and `amsterdam_sfg` demonstrates the non-geospatial nature of `sfg` objects. The calculation cannot take into account the ellipsoidal nature of the Earth, nor that the distance between degrees of longitude decrease as they move towards the poles. Thus, the calculation for `st_distance()` has no direct real-world meaning. In fact, the `st_distance()` function with `sfg` objects returns the same value as the base R `dist()` function from the `stats` package. 107 | 108 | ```{r distance of sfg objects} 109 | # Distance on sfg objects does not have real-world meaning 110 | st_distance(la_sfg, amsterdam_sfg) 111 | 112 | # st_distance and stats::dist the same for sfg objects 113 | dist(rbind(c(-118.2615805, 34.1168926), c(4.8979755, 52.3745403))) 114 | ``` 115 | 116 | ## `sfc` objects 117 | The spatial aspects of an `sf` object are contained in an object of class `sfc`. The documentation for the `sf` package often refers to an `sfc` object as a simple feature geometry list column, which is the role that they play within `sf` objects. However, as a stand-alone object, an object of class `sfc` is a list of one or more `sfg` objects with attributes — in the [R sense of attributes](http://colinfay.me/r-language-definition/objects.html#attributes) — that enable the object to have a coordinate reference system. When an `sfc` object is printed to the console, it is referred to as a geometry set. The `sfc` class possesses seven subclasses, one for an `sfc` object composed of each of the six main geometry types and an addition subclass for an `sfc` object that contains a combination of geometry types. 118 | 119 | ### `sfc` subclasses 120 | - sfc\_POINT 121 | - sfc\_MULTIPOINT 122 | - sfc\_LINESTRING 123 | - sfc\_MULTILINESTRING 124 | - sfc\_POLYGON_ 125 | - sfc\_MULTIPOLYGON 126 | - sfc\_MULTIPOLYGON 127 | - sfc\_GEOMETRY 128 | 129 | The `st_sfc()` function to create an `sfc` object takes any number of `sfg` objects and a CRS in the form of either an [EPSG code](http://www.epsg.org) or a `proj4string` from the [PROJ library](http://proj4.org). Internally, the `crs` argument uses the `st_crs()` function to look up either the EPSG code or `proj4string` and, if possible, set the corresponding value for the undefined `epsg` or `proj4string` through external calls to the [GDAL](http://gdal.org) and PROJ libraries. The default CRS for an `sfc` object is `NA` or not available. Let's create an `sfc` object from one of the above `sfg` objects with the default CRS to see how an `sfc` object differs from an `sfg` object even when it contains the geometry of only one feature. 130 | 131 | ```{r multipoint sfc} 132 | # Create sfc object with default crs 133 | st_sfc(multipoint_sfg) 134 | ``` 135 | 136 | The `sfc` print method reveals the information contained in the `sfg` object or objects used to create it, including the geometry type or `sfc` subclass, dimensions, and the coordinates printed in the same well-known text format used by the `sfg` print method.[^8] However, the `sfc` print method also reveals the addition of the `crs` attribute by printing out the `epsg` and `proj4string` values. Thus, even though in this case the geospatial aspect of the `sfc` object is left undefined, the object can be considered geospatial. Let's further investigate the structure of an `sfc` object by creating an `sfc` object with multiple `sfg` objects and use the [EPSG code 4326](https://en.wikipedia.org/wiki/World_Geodetic_System#WGS84) to identify longitude and latitude coordinates on the WGS84 ellipsoid for the CRS. Notice in the print out of the object that the `proj4string` is identified, though it was not included in the arguments. 137 | 138 | ```{r points_sfc} 139 | # Create sfc object with multiple sfg objects 140 | points_sfc <- st_sfc(la_sfg, amsterdam_sfg, crs = 4326) 141 | points_sfc 142 | ``` 143 | 144 | We can see that an `sfc` object is a list and treated as such in R by running `View(points_sfc)`, which will give an output similar to any other list object. However, `points_sfc` is different from a normal list in that an `sfc` object possesses five attributes that give `sfc` objects their geospatial nature. 145 | 146 | ```{r sfc attributes} 147 | # Attributes of sfc object 148 | attributes(points_sfc) 149 | ``` 150 | 151 | Here, we can clearly see that `points_sfc` is of class `sfc` and subclass `sfc_POINT` since it only contains geometries of type POINT. The `precision` attribute corresponds to the precision element of the Simple Features standard and is used in certain geometric calculations. `bbox` is a calculation of the minimum and maximum values of the X and Y coordinates within the `sfc` object, and `n_empty` notes the number of empty `sfg` objects in the list. The `crs` attribute is the most interesting for our purposes and consists of a `crs` object, which itself is a list of length two containing a `proj4string` and an `epsg` value if one exists. Because the `crs` is an attribute of `sfc` objects, all geometries within an `sfc` object by definition possess the same CRS. We can inspect the contents of the `crs` attribute for `points_sfc` with the `st_crs()` function even if it does not provide us with any additional information. 152 | 153 | ```{r crs of sfc object} 154 | # crs attribute is of class crs 155 | class(st_crs(points_sfc)) 156 | 157 | # Access crs attribute of sfc object 158 | st_crs(points_sfc) 159 | ``` 160 | 161 | We can confirm the spatial nature of `points_sfc` by again using the `st_distance()` function. Beginning with [version 0.6](https://github.com/r-spatial/sf/blob/master/NEWS.md) of the `sf` package, `st_distance()` uses the [`lwgeom`](https://cran.r-project.org/web/packages/lwgeom/index.html) package, which in turn links to geometric functions from the [liblwgeom](https://github.com/postgis/postgis/tree/svn-trunk/liblwgeom) library used by PostGIS, to make geometric calculations on longitude and latitude values. Therefore, you may need to install the `lwgeom` package for the `st_distance()` function to work properly. 162 | 163 | ```{r sfc distance} 164 | # Geographic distance with sfc object 165 | st_distance(points_sfc) 166 | ``` 167 | 168 | With an `sfc` object that uses longitude and latitude coordinates and a set `crs` the `st_distance()` function uses complex geometric calculations on the chosen ellipsoid to accurately calculate the distance between features. By default the function returns a dense matrix of the distance between all of the features — or `sfg` objects — contained in the `sfc` object along with the units of the values from the [`units`](https://cran.r-project.org/package=units) package. Here, we can see that the points in Los Angeles and Amsterdam are 8,955,120 meters apart from each other. You can even convert to miles or kilometers with the `units` package, but the most significant aspect of this command is how it differs from the above use of `st_distance()` with `sfg` objects. We now have usable geographical information about the two points we created. 169 | 170 | ## `sf` objects 171 | Now that we have a set of geometries containing a coordinate reference system in the form of an `sfc` object, we can connect `points_sfc` to a data frame of attributes to create an `sf` object. First though, we need a data frame of attributes that provides information about the two points. 172 | 173 | ```{r data} 174 | # Create a data frame of attributes for the two points 175 | data <- data.frame(name = c("Los Angeles", "Amsterdam"), 176 | language = c("English", "Dutch"), 177 | weather = c("sunny", "rainy/cold")) 178 | ``` 179 | 180 | The `st_sf()` function can either join a data frame to an `sfc` object of the same length or take a data frame that already contains an `sfc` list column to create an `sf` object. When joining a data frame to an `sfc` object you can name the geometry list column. Otherwise the column will be named after the `sfc` object. It is also worth pointing out that the methodology used by `st_sf` converts a tibble to a data frame, and so the output of the function will always be an object of classes `sf` and `data.frame`. Here I want to show two different ways to make an `sf` object with the `data` data frame and `points_sfc`, first by combining the two objects into a single data frame before executing the `sf_sf()` function and then letting the `sf_sf()` function do the joining. In the first method, the use of `cbind()` automatically names the geometry list column as "geometry." 181 | 182 | ```{r sf in two ways} 183 | # Create data frame with list column then make sf object 184 | st_sf(cbind(data, points_sfc)) 185 | 186 | # Make sf object from separate data frame and sfc objects 187 | st_sf(data, geometry = points_sfc) 188 | ``` 189 | 190 | I think that the latter method is preferable, but the former method demonstrates part of the internal process of `st_sf()` when using the latter method. In either case, both methods produce the same result, and we are presented with the same print method for an `sf` object that we saw at the beginning of the post. As I noted above, the result of `st_sf()` is an object of classes `sf` and `data.frame`, showing that `sf` objects are an extension of data frames. Let's confirm that by creating an `sf` object and checking its class. 191 | 192 | ```{r points_sf} 193 | # Create sf object by combining data frame with sfc object and name sfc column geometry 194 | points_sf <- st_sf(data, geometry = points_sfc) 195 | 196 | # Class of new sf object 197 | class(points_sf) 198 | ``` 199 | 200 | The creation of `points_sf` does not alter the geometry or geographical information contained in `points_sfc` in any way. To confirm this we can show that the geometry of `points_sf` — accessed via the `st_geometry()` function — is equivalent to `points_sfc`. 201 | 202 | ```{r geometry and sfc} 203 | # Geometry of points_sf is equivalent to points_sfc 204 | identical(st_geometry(points_sf), points_sfc) 205 | ``` 206 | 207 | In many ways an `sf` object can be treated as if it were a data frame, and this is one of the main advantages of working with `sf` objects over `Spatial` objects from the `sp` package. As I noted in my [previous post](https://jessesadler.com/post/gis-with-r-intro/#sf), you can manipulate `sf` objects using [`dplyr`](http://dplyr.tidyverse.org/) commands. However, internally `sf` objects are different from data frames in that the geometry column is designed to be sticky. In addition to the usual attributes associated with a data frame of column names, row names, and class, `sf` objects also possess an `sf_column` attribute, which is a vector of the names of the one or more geometry columns in the `sf` object.[^9] The `sf_column` attribute helps to implement the stickiness of geometry columns. The geometry column is kept in all subsetting of an `sf` object. For example, using `dplyr`'s `select()` function with a single column name returns an `sf` object with two columns, the selected column and the geometry column.[^10] 208 | 209 | ```{r stickiness of geometry} 210 | # Stickiness of geometry column 211 | dplyr::select(points_sf, name) 212 | ``` 213 | 214 | If you want to remove the geometry column and transform an `sf` object to a data frame of tibble, you can use `as_tibble()` or `as.data.frame()` in a pipe before `select()`. More directly, you can remove the geometry column and convert an `sf` object to a tibble or data frame by setting the geometry to `NULL`. By converting from an `sf` object to a data frame, we have come full circle in the creation of an `sf` object from scratch. 215 | 216 | ```{r remove geometry} 217 | # Return sf object to a data frame by setting geometry to NULL 218 | st_set_geometry(points_sf, NULL) 219 | 220 | # Class 221 | class(st_set_geometry(points_sf, NULL)) 222 | ``` 223 | 224 | ## Wrapping up 225 | The `sf` package is quickly becoming the default GIS package for use in R. It has done this by balancing the use of the Simple Features standard, enabling it to link directly to powerful GIS libraries such as PostGIS, GEOS, and GDAL, and an implementation that uses native R objects and fits within the [tidyverse](http://tidyverse.org) set of packages and methods. Investigating the structure of `sf` objects and how they are made through `sfg` objects, a `sfc` list column, and a data frame provides insight into how a data-frame-like object is able to possess spatial data. 226 | 227 | We can summarize the role of `sfg`, `sfc`, and `sf` objects in the following fashion. 228 | 229 | - `sfg`: geometry 230 | - geometry of a single feature 231 | - vector, matrix, or list of matrices of coordinates with defined dimension and type of geometry 232 | - seven main geometry types 233 | - `sfc`: geospatial geometry 234 | - list of `sfg` objects 235 | - coordinate reference system through `crs` attribute 236 | - seven subclasses based on geometries 237 | - `sf`: geospatial geometry with attributes 238 | - data frame with geometry column of class `sfc` 239 | - sticky geometry column through `sf_column` attribute 240 | 241 | [^1]: The Simple Features standard was developed after the creation of the `sp` package, and so `sp` does not use the standard. 242 | 243 | [^2]: Before `sf`, the primary interfaces to GDAL and GEOS were through the separate [`rgdal`](https://cran.r-project.org/web/packages/rgdal/index.html) and [`rgeos`](https://cran.r-project.org/web/packages/rgeos/index.html) packages. 244 | 245 | [^3]: There are many things called "sf" or "simple features" here, so I will try to be as clear as possible in distinguishing them. 246 | 247 | [^4]: See [Introduction to GIS with R](https://jessesadler.com/post/gis-with-r-intro/) for the code that created this object. See also the similar image in the [package vignette](https://cran.r-project.org/web/packages/sf/vignettes/sf1.html#sf-objects-with-simple-features). 248 | 249 | [^5]: The `sf` package prints the geometry types in capital letters. 250 | 251 | [^6]: See the [Implementation Standard for Geographic information - Simple feature access, pg 61](http://www.opengeospatial.org/standards/sfa) for examples of the well-known text style. 252 | 253 | [^7]: See `help("st_coordinates")` for details. 254 | 255 | [^8]: If you have more than five features, the print command will only print the first five geometries. You can alter this with an explicit print command with a defined "n" for the number of geometries you want to print. 256 | 257 | [^9]: An `sf` object also contains an `agr` attribute that can categorize the non-geometry columns as either "constant", "aggregate", or "identity" values. 258 | 259 | [^10]: Since I have not loaded the `dplyr` package in this post, I am using the `::` method to call a function from a package that is not loaded. --------------------------------------------------------------------------------