├── .gitignore ├── 01-Introduction.Rmd ├── 02-Memory.Rmd ├── 03-Preprocessing.Rmd ├── 04-Manipulating.Rmd ├── 05-Loading-dbs.Rmd ├── 06-Visualising.Rmd ├── 07-Efficient.Rmd ├── 08-Rcpp.Rmd ├── 09-ff.Rmd ├── 10-sparkR.Rmd ├── 11-Datasets.Rmd ├── LICENSE ├── Makefile ├── R-for-Big-Data.Rproj ├── README.md ├── _config.yml ├── additional-content ├── .gitignore ├── book_outline.Rmd ├── challenges-consolidation.txt ├── consolidate.R ├── course-info-leeds.Rmd ├── course_outline.Rmd ├── dfs.RData └── gini-dataset-II.Rmd ├── assets ├── cdrc-leeds.png └── cdrc-logo_large.png ├── build.R ├── data ├── .gitignore ├── CAIT_Country_GHG_Emissions_-_All_Data.xlsx ├── ghg-ems.csv ├── miniaa ├── minicars.Rds ├── minicars.csv ├── moby_dick.txt ├── pew.csv ├── rand-mini.csv ├── reshape-pew.csv ├── si.pov.gini_Indicator_en_csv_v2.csv ├── tinyaa ├── tinyab ├── tinyac ├── tinyad └── world-bank-ineq.csv ├── figures ├── 746px-Pistol-grip_drill.svg.png ├── Laptop-hard-drive-exposed.jpg ├── coventry-centroids.png ├── environment.png ├── f2_1-crop.pdf ├── f2_2-crop.pdf ├── know_data.jpg ├── mel-cycle-cent-close.png └── od-mess.png ├── in_header.tex ├── packages.R ├── r4bd.bib ├── slides ├── chapter1.Rmd ├── chapter10.Rmd ├── chapter2.Rmd ├── chapter3.Rmd ├── chapter4.Rmd ├── chapter5.Rmd ├── chapter6.Rmd ├── chapter9.Rmd └── dplyr.Rmd ├── src ├── mean_c.cpp └── precision.cpp ├── toc.yaml ├── tufte-common.def ├── tufte-ebook.cls └── tufte.bst /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | *.pdf 5 | big-data 6 | wrdtc-handout.docx 7 | npidata 8 | *cache* 9 | book.Rmd 10 | book.aux 11 | book.idx 12 | book.log 13 | book.out 14 | book.tex 15 | book.toc 16 | .#book.tex 17 | ignore_* 18 | course_*.html 19 | v59i10-data.zip 20 | data/reshape 21 | data/bn19figs.xlsx 22 | largefile 23 | gini_Indicator_en_csv_v2.zip 24 | data/ghg-ems.* 25 | data/npidata* 26 | data/largefile.zip 27 | *.ilg 28 | book.ind 29 | *.html 30 | figures/* 31 | pets* 32 | mini_readr.csv 33 | -------------------------------------------------------------------------------- /01-Introduction.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | bibliography: r4bd.bib 4 | --- 5 | 6 | ```{r echo=FALSE} 7 | library("grid") 8 | library("png") 9 | library("Rcpp") 10 | library("pryr") 11 | library("bigvis") 12 | ``` 13 | 14 | # Introduction 15 | 16 | ## What is Big Data? 17 | 18 | A common definition of big data is data that 19 | is:^[Data here is treated as a mass noun, similar to information. 20 | Purists may insist that 'data' should always plural because it originates from the Latin word datum. 21 | However, language evolves, we no longer speak Latin and the singular is becoming the norm [@kitchin2014data].] 22 | 23 | - Variable, within each dataset and between sources 24 | - Voluminous, occupying much RAM and hard disk space 25 | - High in velocity: it's always being generated 26 | 27 | \noindent Precisely how variable, voluminous and rapidly generated data needs to be before it's classified as 'big' is rarely specified, however. 28 | Looser definitions recognise that Big Data is an umbrella or 'catch all' term, used to refer to information that is simply tricky to analyse using established methods [@Lovelace2015]. 29 | We use this looser definition in this in this book. 30 | 31 | The variety of new datasets becoming available 32 | is huge. Therefore, instead of trying to cover all manner of new datasets, 33 | the focus of this book is developing a solid understanding of the R *language* to interact with data. 34 | As with learning any new language, a deep understanding of the fundamentals will provide the flexibility to deal with almost any situation. 35 | Learning to avoid computational bottlenecks and write efficient code, for example, can save hours of processing and development time. 36 | In other words, **becoming proficient in handling Big Data entails first becoming fluent in data analysis and computing more generally**. 37 | 38 | It's easy to get side-tracked or 'lost in the data' when analysing large datasets. 39 | Clearly defining the aim of a particular analysis project therefore particularly important in this context. 40 | There are often many ways to solve a problem with R and, 41 | in addition to computational speed, the most appropriate solution will likely depend on: 42 | 43 | - ease and speed of writing the code; 44 | - ease of communicating and reproducing the analysis; 45 | - durability of code. 46 | 47 | ```{r drill, fig.margin=TRUE, fig.cap= "A drill is analogous to a software tool: the questions of functionality and reliability should trump the question: 'is it the best?'", echo=FALSE} 48 | grid.raster(readPNG("figures//746px-Pistol-grip_drill.svg.png")) 49 | ``` 50 | 51 | In this context it is useful to think of software as a power-tool (Fig. 1.1). People rarely ask 'is this the BEST possible drill'?. 52 | More likely a good builder will ask: 'is this drill *good enough* to get the job done?' 'is this drill robust?' and 'will it work in 20 years time?' The same applies to R for Big Data. 53 | 54 | Regardless of the 'big' dataset you hope to use, 55 | you can be confident of one thing: 56 | **it is unlikely to be ready to analyse.** 57 | This means that you must work to tidy the data, 58 | a task that typically takes around 59 | 80% of the effort expended on data analysis projects 60 | [@tidy-data]. 61 | 62 | ## Coping with big data in R 63 | 64 | R has had a difficult relationship with big data. One of R's key features is that it loads data into the computer's 65 | RAM^[There are, however, packages such as **dplyr** which allow R to access, filter and even process data stored remotely. 66 | These are described in chapter 5.]. 67 | This was less of a problem twenty years ago, when data sets were small and the main bottleneck on analysis was how quickly a statistician could think. Traditionally, the development of a statistical model took more time than the computation. When it comes to Big Data, this changes. 68 | Nowadays datasets that are larger than your laptop's memory are commonplace. 69 | 70 | Even if the original data set is relatively small data set, the analysis can generate large objects. For example, suppose we went to perform standard cluster analysis. Using the built-in data set `USAarrests`, we can calculate a distance matrix, 71 | ```{r} 72 | d = dist(USArrests) 73 | ``` 74 | 75 | \noindent and perform hierarchical clustering to get a dendrogram 76 | 77 | ```{r} 78 | fit = hclust(d) 79 | ``` 80 | 81 | \noindent to get a dendrogram 82 | 83 | ```{r denofig.fullwidth=TRUE, fig.height=2, echo=2, fig.cap="Dendrogram from USArrests data."} 84 | par(mar=c(3,3,2,1), mgp=c(2,0.4,0), tck=-.01,cex=0.5, las=1) 85 | plot(fit, labels=rownames(d)) 86 | ``` 87 | 88 | \noindent When we inspect the object size of the original data set and the distance object 89 | ```{r} 90 | pryr::object_size(USArrests) 91 | pryr::object_size(d) 92 | ``` 93 | 94 | \noindent we have managed to create an object that is three times larger than the original data set.^[The function \texttt{object\_size} is part of the \texttt{pryr} package, which we will cover in chapter 2.] In fact the object `d` is a symmetric $n \times n$ matrix, where $n$ is the number of rows in `USAarrests`. Clearly, as `n` increases the size of `d` increases at rate $O(n^2)$. So if our original data set contained $10,000$ records, the associated distance matrix would contain almost $10^8$ values. Of course since the matrix is symmetric, this corresponds to around $50$ million unique values. 95 | 96 | To tackle big data in R, we review some of the possible strategies available. 97 | 98 | ### Buy more RAM 99 | 100 | Since R keeps all objects in memory, the easiest way to deal with memory issues. Currently, 16GB costs less than £100. This small cost is quickly recuperated on user time. A relatively powerful desktop machine can be purchased for less that £1000. 101 | 102 | Another alternative, could be to use cloud computing. For example, Amazon currently charge around 3£0.15 per Gigabyte of RAM. Currently, a $244$GB machine, with 32 cores, costs around £3.12 per hour (see [aws.amazon.com/ec2/pricing/](https://aws.amazon.com/ec2/pricing/). 103 | 104 | ### Sampling 105 | 106 | Do you **really** need to load all of data at once? For example, if your data contains information regarding sales, does it make sense to aggregate across countries, or should the data be split up? Assuming that you need to analyse all of your data, then random sampling could provide an easy way to perform your analysis. In fact, it is almost always sensible to sample your data set at the beginning of an analysis until your analysis pipeline is in reasonable shape. 107 | 108 | If your dataset is too large to read into RAM, it may need to be 109 | *preprocessed* or *filtered* using tools external to R before 110 | reading it in. This is the topic of chapter 3. 111 | For databases we can filter when asking for the data from within R 112 | (described in chapter 6). 113 | 114 | ### Integration with C++ or Java 115 | 116 | Another strategy to improve performance, is to move small parts of the program from R to another, faster language, such as C++ or Java. The goal is to keep R's neat way of handling data, with the higher performance offered by other languages. Indeed, many of R's base functions are written in C or Fortran. This outsourcing of code to another language can be easily hidden in another function. 117 | 118 | ### Avoid storing objects in memory 119 | 120 | There are packages available that avoid storing data in memory. Instead objects are stored on your hard disc and analysed in blocks or chunks. Hadoop is an example of this technique. This strategy is perfect for dealing with large amounts of data. Unfortunately, many algorithms haven't been designed with this principle in mind. This means that only a few R functions that have been explicitly created to deal with specific chunk data types will work. 121 | 122 | The two most famous packages on CRAN that use this principle are `ff` and `ffbase`.^[There is also the `bigmemory` package that does something similar.] The commercial product, Revolution R Enterprise, also uses the chunk strategy in their `scaleR` package. 123 | 124 | ### Alternative interpreters 125 | 126 | Due to the popularity of R, it now possible to use alternative interpreters (the interpreter is where the code is run). There are currently four possibilities 127 | 128 | * [pqrR](http://www.pqr-project.org/) (pretty quick R) is a new version of the R interpreter. One major downside, is that it is based on R-2.15.0. The developer (Radford Neal) has made many improvements, some of which have now been incorporated into base R. **pqR** is an open-source project licensed under the GPL. One notable improvement in pqR is that it is able to do some numeric computations in parallel with each other, and with other operations of the interpreter, on systems with multiple processors or processor cores. 129 | 130 | * [Renjin](http://www.renjin.org/) reimplements the R interpreter in Java, so it can run on the Java Virtual Machine (JVM). Since R will be pure Java, it can run anywhere. 131 | 132 | * [Tibco](http://spotfire.tibco.com/) created a C++ based interpreter called TERR. 133 | 134 | * Oracle also offer an R-interpreter, that uses Intel's mathematics library and therefore achieves a higher performance without changing R's core. 135 | 136 | ## Course R package 137 | 138 | There is companion R package for this course. The package contains some example data sets, and also a few helper functions. To install the package, first install `drat`.^[The `drat` package provides a nice way of accessing other package repositories.] 139 | 140 | ```{r eval=FALSE} 141 | install.packages("drat") 142 | ``` 143 | 144 | \noindent Then the course package can be installed using^[Assuming you are using at least R version 3.2.0.] 145 | 146 | ```{r eval=FALSE} 147 | drat::addRepo("rcourses") 148 | install.packages("r4bd") 149 | ``` 150 | 151 | 152 | 153 | 154 | -------------------------------------------------------------------------------- /03-Preprocessing.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 5 | \chapter{Preprocessing Data} 6 | 7 | R is ideal for handling many data-related tasks but not everything. 8 | As mentioned in the introduction, there are various ways to 9 | preprocess large files outside R to make them easier to handle. 10 | Here we will explore some of the options. 11 | 12 | For data stored in large text files we can use 13 | 'streaming' utilities before reading it into R. With tools such as 14 | *sed*\sidenote{\url{https://www.gnu.org/software/sed/manual/sed.html}} 15 | (a 'stream editor' included on most Unix-based systems), 16 | split\sidenote{\url{https://en.wikipedia.org/wiki/Split\_\%28Unix\%29}} and 17 | csvkit,\sidenote{\url{https://csvkit.readthedocs.org/en/latest/}} a 10 GB .csv can be 18 | broken up into smaller chunks before being loaded into R. 19 | Furthermore, these tools will run well on a puny laptop. 20 | Here's an 21 | example of downloading a very large .csv file and then 22 | trying (and failing!) to load it into R. 23 | Because of the strain it will put on most systems, we recommend you don't 24 | run this code:^[For 25 | more information on the origin and content of this dataset, see chapter 10.] 26 | 27 | ```{r, eval=FALSE, tidy=FALSE} 28 | dir.create("data") # create folder for data 29 | url <- "http://download.cms.gov/nppes/NPPES_Data_Dissemination_Aug_2015.zip" 30 | 31 | # download a large dataset - don't run 32 | library("downloader") # needs to be installed 33 | download(url, destfile = "data/largefile.zip") 34 | ## 550600K .......... ...... 100% 1.26M=6m53s 35 | 36 | # unzip the compressed file, measure time 37 | system.time( 38 | unzip("data/largefile.zip", exdir = "data") 39 | ) 40 | ## user system elapsed 41 | ## 34.380 22.428 193.145 42 | 43 | bigfile <- "data/npidata_20050523-20150809.csv" 44 | file.info(bigfile) # file info (not all shown) 45 | ## size: 5647444347 46 | ``` 47 | 48 | \noindent The above code illustrates how R can be used to download, 49 | unzip and present information on a giant .csv file, in completely reproducible workflow. 50 | Note that it's 5.6 GB in size and took over 3 minutes to unzip! 51 | The following code requires a 64 bit R installation and will not work on 52 | many laptops.^[Reading 53 | the 5.6 GB file would also fail on many desktops. 54 | The file took over 15 minutes using `read.csv` and half that using `read_csv` from the **readr** package on a Fourth Generation Intel i7 desktop with 64 GB RAM. 55 | Half of this RAM became occupied by R! 56 | ] 57 | 58 | ```{r, eval=FALSE} 59 | system.time(df <- read.csv(bigfile)) 60 | ## Error (from 32 bit machine): cannot allocate vector of 32.0 Mb 61 | ``` 62 | 63 | \noindent There are ways to better handle such large datasets such as using faster read-in functions such as `read_csv()` from the **readr** package. For now, just remember that reading large datasets into R can be tricky and time-consuming. Preprocessing outside R, as illustrated below, can help. 64 | 65 | # Splitting files with Unix tools 66 | 67 | The Unix utility **split** can be used to split large files, like the one we tried to load above, into chunks based on size or number of lines. The following bash commands will split the 68 | 5.6 GB file, downloaded and unzipped in the previous section, into chunks of 100 MB 69 | each:^['Bash commands' 70 | refer to computer code written in the Bash language. Bash is the default 71 | language used by Linux and Macs for most internal system administration 72 | functions. In Macs, you can open the Bash terminal by typing 'Apple key'-T. In 73 | Windows installing [cygwin](https://www.cygwin.com/) and launching it 74 | will provide access to this functionality. Note: you must start from the correct 75 | *working directory* --- `pwd` in Bash or `setwd()` in R can be used to check this.] 76 | 77 | ```{r, engine='bash', eval=FALSE} 78 | cd data # change directory 79 | split -b100m npidata_20050523-20150809.csv 80 | ``` 81 | 82 | Assuming there is sufficient 83 | disk space, the output of the above operation should be several 100 MB text 84 | files: more manageable. These files are named `aa`, `ab` etc. 85 | A sample from the results of this operation can be found in the 86 | `data` folder. This was saved using commands. 87 | 88 | ```{r, engine='bash', eval=FALSE} 89 | split -l 10 aa mini # further split chunk 'aa' into 10 lines 90 | cp miniaa ../data # copy the first into 'sample-data' 91 | ``` 92 | 93 | \noindent Now the file is much smaller and easy to read: finally we can 94 | read (part of) a 5.6 GB dataset into R using a puny laptop! 95 | 96 | ```{r} 97 | library(readr) 98 | npi <- read_csv("data/miniaa") 99 | dim(npi) # what are the dimensions of this dataset? 100 | head(npi[c(1, 37)], 3) # view a sample of the data 101 | ``` 102 | 103 | One of the great things about data analysis using command-line tools is that the same techniques that work on a dataset with 10 lines of code will also work on a dataset of 10 million rows, providing you have the right hardware and efficient implementation. 104 | Therefore trying 'dry runs' on small subsets of your data before the main analysis is a very good idea. 105 | The next challenge uses this principle to test your understanding of preprocessing files outside R, without relying on loading the data into R. 106 | 107 | > **Challenge (advanced):** Try to further split the csv file saved in `data/miniaa` into chunks called tinyaa, tinyab etc, with only 3 lines of code each. 108 | Use a method external to R (e.g. `split` if you use Unix) without loading the data into RAM. 109 | How many `tiny*` files result? 110 | 111 | ```{r, engine='bash', eval=FALSE, echo=FALSE} 112 | split -l 3 miniaa tiny 113 | ``` 114 | 115 | # Filtering with csvkit 116 | 117 | [csvkit](https://csvkit.readthedocs.org/en/latest/) is a command-line program 118 | for handling large .csv files, without having to read them all into RAM... 119 | 120 | Using the NPI data, the following 121 | [example](https://opendata.stackexchange.com/questions/1256/how-can-i-work-with-a-4gb-csv-file) 122 | illustrates how csvkit can be used to extract useful information from 123 | bulky .csv files before loading the results into R. 124 | 125 | ```{r, echo=FALSE} 126 | # Preprocessing with the LaF package 127 | ``` 128 | 129 | \chapter{Loading data into R} 130 | 131 | # An introduction to file formats 132 | 133 | 134 | 135 | 136 | # Loading static files 137 | 138 | Datasets are increasingly becoming continuously 139 | collected, making them well-suited to databases and other continuous 140 | systems that 'ingest' data in real-time. However, static files are still 141 | probably the most common way to access large datasets and probably will 142 | continue to be so into the future. 143 | 144 | This chapter looks at various file-types that are used for storing 145 | large datasets and how R can be used to optimised their read-in. 146 | The most common, simple and in many cases convenient file-type for large datasets 147 | are *plain text* files, so we look at reading these in first, before 148 | exploring more exotic file-types including, `.json`, `.xml`, `.spss`, `.stata`, `.xls`. 149 | 150 | # Text files 151 | 152 | Data stored as text files are files that are human-readable when displayed 153 | in a 'plain text' editor such as Microsoft Notepad, Vim or R Studio. 154 | Plain text files are the basis of computing.^[Most 155 | programs can be represented as large collections of scripts written in 156 | plain text. R Studio, 157 | for example, is written in 100s of lines of plain text files, all of which 158 | can be viewed on-line 159 | (see [github.com/rstudio/rstudio](https://github.com/rstudio/rstudio)). 160 | This tutorial was written as a UTF-8 encoded plain text '.Rmd' file. 161 | ] 162 | The advantages of plain text files are: 163 | 164 | - Simplicity: quick and easy to understand their contents 165 | - Compatibility: text files work with most software packages 166 | - Portability: text files are quick and easy to load, save and share 167 | 168 | The disadvantages of plain text files for Big Data are that they can become 169 | unwieldy, even when compressed (remember the 5.6 GB file from the introduction), 170 | and their ease of modification: text files are certainly not a highly secure 171 | data format. 172 | 173 | The most common format of text file is the trusty .csv file, in which 174 | each column is separated by a comma.^[Note that text strings such as 175 | `"speed"` are enclosed in quote marks whereas raw numbers are not. 176 | ] 177 | 178 | 179 | ```{r} 180 | write.csv(x = cars[1:3,]) # write a .csv file to the screen 181 | ``` 182 | 183 | ```{r, echo=FALSE, eval=FALSE} 184 | write.csv(x = cars[1:3,], "data/minicars.csv") # save to file 185 | ``` 186 | 187 | > **Challenge**: Save a .csv file of the full 'cars' dataset and open it with a plain text editor. 188 | 189 | It is important to note that R has its own *binary* data format which minimises the file space occupied by large static datasets. 190 | These can be read and written using the `save()` and `load()` commands, which save the names and contents of multiple R objects into a single file. 191 | We recommend using `saveRDS()` and `readRDS()` instead because they are more flexible, allowing the loaded datasets to be given any name. 192 | To save and re-load the subsetted `cars` dataset, for example, we could use the following code: 193 | 194 | ```{r} 195 | saveRDS(object = cars[1:3,], file = "data/minicars.Rds") 196 | cars_mini <- readRDS("data/minicars.Rds") 197 | ``` 198 | 199 | Note that the Rdata version of the same data is a third the size of the .csv version: 200 | 201 | ```{r} 202 | # Report the size of a file from within R using file.size() 203 | file.size("data/minicars.Rds") / 204 | file.size("data/minicars.csv") 205 | ``` 206 | 207 | Often the benefits of being able to see the data without reading it into R may outweigh the cost of additional hard-disc space.^[The bash command `head data/minicars.csv`, for example, instantly show the top 10 rows of the file, regardless of how large the dataset is, without needing to read it into RAM. 208 | An additional advantage of .csv files over .Rds files is that they display correctly on GitHub. 209 | ] 210 | 211 | # Freeing your data from spreadsheets 212 | 213 | Spreadsheets are ubiquitous in offices around the world, and are used for 214 | storing millions of (mostly quite small) datasets. Nevertheless 215 | Microsoft Excel, the most commonly used spreadsheet program can store 216 | datasets with a maximum size of 1,048,576 rows by 16,384 columns. 217 | 218 | There many packages designed for reading spreadsheet files into R, most 219 | of which are of variable reliability. The best of these is 220 | **readxl**, which was found to be much faster 221 | than alternatives from **gdata** and **openxlsx** 222 | packages!^[As 223 | an important aside, this example illustrates the importance of selecting the 224 | *right package*, in addition to the right function and implementation, 225 | for handling large datasets.] 226 | 227 | ```{r} 228 | f <- "data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx" 229 | system.time(df <- readxl::read_excel(f, sheet = 4)) 230 | ``` 231 | 232 | > **Optional challenge:** To brush-up on your benchmarking skills, run tests 233 | to load the same data into R using alternative packages. Which comes closest 234 | to `read_excel()`? Are the results identical? 235 | 236 | ```{r, echo=FALSE, eval=FALSE} 237 | xls_pkgs <- c("gdata", "openxlsx", "reaODS") 238 | # install.packages(xls_pkgs) # install packages if they're not already 239 | # This took less than 0.1 seconds 240 | system.time(df <- readxl::read_excel(f, sheet = 4)) 241 | # This took over 1 minute 242 | system.time(df1 <- gdata::read.xls(f, sheet = 4)) 243 | # This took 20 seconds 244 | system.time(df2 <- openxlsx::read.xlsx(f, sheet = 4)) 245 | 246 | # After saving the spreadsheet to .odt (not included) - took more than 1 minute 247 | system.time(df3 <- readODS::read.ods("data/CAIT_Country_GHG_Emissions_-_All_Data.ods", sheet = 4)) 248 | 249 | head(df[1:5]) 250 | head(df1[1:5]) 251 | head(df2[1:5]) 252 | head(df3[1:5]) 253 | ``` 254 | 255 | To share this dataset with others, it makes sense to save it in a non-proprietary format. 256 | Play with the following commands and see which data format is smallest. 257 | 258 | ```{r} 259 | write.csv(df, "data/ghg-ems.csv") 260 | saveRDS(df, "data/ghg-ems.Rds") 261 | ``` 262 | 263 | Using `file.size()`, we can ascertain that we've made huge space savings by freeing this dataset from a spreadsheet. The .csv version is `r round(file.size("data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx") / file.size("data/ghg-ems.csv"))` times smaller than the original and the .Rds version is `r round(file.size("data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx") / file.size("data/ghg-ems.Rds"))` times smaller. 264 | These space savings will make a substantial difference to your system resources when dealing with larger datasets gleaned from spreadsheets. 265 | 266 | # Batch loading of disparate datasets 267 | 268 | Sometimes data is made available as a series of disparate files. 269 | This is especially likely for historic datasets, when hard disks were smaller. 270 | In such cases it is useful to load these dataset in a *batch process*, which loads all the files iteratively into a single dataset. 271 | Building on the 'NPI' data introduced previously, and its subsets `tinyaa` to `tinyad`, the following code loads these four files into a single dataset. 272 | 273 | ```{r} 274 | batch_files <- list.files(path = "data", pattern = "tiny", full.names = T) 275 | b <- read_csv(batch_files[1]) 276 | b[1:3] 277 | for(i in batch_files[-1]){ 278 | new <- read_csv(i, col_names = F) 279 | names(new) <- names(b) 280 | b <- rbind(b, new) 281 | } 282 | dim(b) # is this the same as the previously loaded npi object? 283 | ``` 284 | 285 | The above code works to load in the column names and the 9 rows of data from the `miniaa` dataset described above, by looping through the four files that we split it into. 286 | Note that this is not a *computationally efficient* way to batch load datasets from disparate sources into R, however. 287 | This is because it requires creating many new objects and binding them. 288 | Also, because the end file size is unknown, the code is liable to cause R to crash by using up all available memory on large datasets. 289 | 290 | To overcome these issues, R's batch execution mode may be useful. 291 | Type `?BATCH` to view documentation on this. 292 | An additional resource, not described here is [**BatchJobs**](https://cran.r-project.org/web/packages/BatchJobs/index.html). 293 | This is an R package designed for batch processing of large datasets. 294 | 295 | 296 | 297 | -------------------------------------------------------------------------------- /04-Manipulating.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 5 | \chapter{Manipulating Big Data} 6 | 7 | ```{r, echo=FALSE} 8 | library("xtable") 9 | library("tidyr") 10 | library("readr") 11 | ``` 12 | 13 | 14 | # Tidying data with tidyr 15 | 16 | A key skill in data analysis is understanding the 'shape' of datasets and being able to 'reshape' them. 17 | An example of the various shapes that the same datasets can assume is provided by @tidy-data and illustrated in Tables \ref{Tpew} and \ref{Tpewt}. 18 | 19 | ```{r, echo=FALSE, eval=FALSE} 20 | # Download data from its original source - an academic paper 21 | downloader::download("http://www.jstatsoft.org/v59/i10/supp/4", destfile = "v59i10-data.zip") 22 | # The source code associated with the paper 23 | downloader::download("http://www.jstatsoft.org/v59/i10/supp/3", destfile = "data/reshape/v59i10.R") 24 | # After running the R script... 25 | dir.create("data/reshape") 26 | unzip("v59i10-data.zip", exdir = "data/reshape/") 27 | # write.csv(raw, "data/reshape-pew.csv") 28 | ``` 29 | 30 | ```{r, echo=FALSE, eval=FALSE} 31 | raw <- read_csv("data/reshape-pew.csv") 32 | raw <- raw[-c(1,ncol(raw))] # remove excess cols 33 | names(raw) <- c("religion", "<$10k", "$10--20k", "$20--30k", "$30--40k", "$40--50k", 34 | "$50--75k", "$75--100k", "$100--150k", ">150k") 35 | write_csv(raw, "data/pew.csv") 36 | print.xtable(xtable(raw[1:3,1:4], caption = "First 6 rows of the aggregated 'pew' dataset from Wickham (2014a) in an 'untidy' form.", include.rownames = F), comment = FALSE, include.rownames = F) 37 | rawt <- gather(raw, Income, Count, -religion) 38 | head(rawt) 39 | tail(rawt) 40 | rawt$Count <- as.character(rawt$Count) 41 | rawt$Income <- as.character(rawt$Income) 42 | rawtp <- rawt[c(1:3, nrow(rawt)),] 43 | 44 | insertRow <- function(existingDF, newrow, r) { 45 | existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),] 46 | existingDF[r,] <- newrow 47 | existingDF 48 | } 49 | 50 | rawtp <- insertRow(existingDF = rawtp, newrow = rep("...", 3), r = 4) 51 | xtable(rawtp) 52 | ``` 53 | 54 | \begin{margintable} 55 | \centering 56 | \begin{tabular}{@{}llll@{}} 57 | \toprule 58 | Religion & $<$\$10k & \$10--20k & \$20--30k \\ 59 | \midrule 60 | Agnostic & 27 & 34 & 60 \\ 61 | Atheist & 12 & 27 & 37 \\ 62 | Buddhist & 27 & 21 & 30 \\ 63 | \bottomrule 64 | \end{tabular} 65 | \vspace{0.2cm} 66 | \caption{First 3 rows and 4 columns of the aggregated 'Pew' dataset from Wickham (2014a) in an 'untidy' form.}\label{Tpew} 67 | \vspace{2cm} 68 | \end{margintable} 69 | 70 | 71 | \begin{margintable} 72 | \centering 73 | \begin{tabular}{@{}lll@{}} 74 | \toprule 75 | Religion & Income & Count \\ 76 | \midrule 77 | Agnostic & $<$\$10k & 27 \\ 78 | Atheist & $<$\$10k & 12 \\ 79 | Buddhist & $<$\$10k & 27 \\ 80 | ... & ... & ... \\ 81 | Unaffiliated & $>$150k & 258 \\ 82 | \bottomrule 83 | \end{tabular} 84 | \vspace{0.2cm} 85 | \caption{First 3 and last rows of the 'tidied' Pew dataset.}\label{Tpewt} 86 | \end{margintable} 87 | 88 | These tables may look very different, but they contain precisely the same data. 89 | They have been reshaped, such that column names in the 'flat' form in Table \ref{Tpew} became a new variable in the 'long' form in Table \ref{Tpewt}. 90 | According to the concept of 'tidy data' [@tidy-data], the long form is correct. 91 | Note that 'correct' here is used in the context of data analysis and graphical visualisation. 92 | For tabular presentation (i.e. tables) the 'wide' or 'untidy' form may be better. 93 | 94 | Tidy data has the following characteristics [quoting @tidy-data]: 95 | 96 | 1. Each variable forms a column. 97 | 2. Each observation forms a row. 98 | 3. Each type of observational unit forms a table. 99 | 100 | Because there is only one observational unit in the example (religions), it can be described in a single table. 101 | Large and complex datasets are usually represented by multiple tables, with unique identifiers or 'keys' to join them together [@Codd1979]. 102 | Being able to manipulate your data into a tidy and relational form is important for Big Data work because this form minimises data duplication and facilitates fast 103 | code.^[Because 104 | R is a vectorised language, it adept at handling long 1 dimensional vectors but less so at handling many interrelated 105 | columns.] 106 | Due to the importance of tidying data, an entire package, aptly named **tidyr** has been developed for the purpose. 107 | Two common operations needed to tidy datasets are reshaping and splitting. 108 | Each of these has its own **tidyr** function: 109 | 110 | - 'Wide' tables can become 'long', so that column names become a new variable. This is illustrated in Tables \ref{Tpew} to \ref{Tpewt} and can be achieved with the function 111 | `gather`:^[Note 112 | that the dimensions of the data change from having 10 observations across 18 columns to 162 rows in only 3 columns. 113 | Note that when we print the object `rawt[1:3,]`, the class of each variable is given 114 | (`chr`, `fctr`, `int` refer to character, factor and integer classes, respectively). 115 | This is because `read_csv` uses the `tbl` class from the **dplyr** package (described below). 116 | ] 117 | 118 | ```{r} 119 | raw <- read_csv("data/pew.csv") # read in the 'wide' dataset 120 | dim(raw) 121 | rawt <- gather(raw, Income, Count, -religion) 122 | dim(rawt) 123 | rawt[1:3,] 124 | ``` 125 | 126 | - Splitting compound variables in two. A classic example is age-sex variables (e.g. `m0-10` and `f0-15` to represent males and females in the 0 to 10 age band). Splitting such variables can be done with `split`: 127 | 128 | ```{r} 129 | agesex <- c("m0-10", "f0-10") # create compound variable 130 | n <- c(3, 5) # create a value for each observation 131 | df <- data.frame(agesex, n) # create a data frame 132 | separate(df, agesex, c("sex", "age"), 1) 133 | ``` 134 | 135 | ```{r, echo=FALSE, eval=FALSE} 136 | # generate latex for presentation 137 | kable(df, format = "latex") 138 | kable(separate(df, agesex, c("sex", "age"), 1), format = "latex") 139 | ``` 140 | 141 | \begin{margintable} 142 | \centering 143 | \begin{tabular}{@{}lll@{}} 144 | \toprule 145 | agesex & n\\ 146 | \midrule 147 | m0-10 & 3\\ 148 | f0-10 & 5\\ 149 | \bottomrule 150 | \end{tabular} 151 | \vspace{0.2cm} 152 | \caption{Dataset in which age and sex are conflated into a sing variable, 'agesex'.} 153 | \label{Tagesex} 154 | \vspace{1cm} 155 | \end{margintable} 156 | 157 | \begin{margintable} 158 | \centering 159 | \begin{tabular}{@{}lll@{}} 160 | \toprule 161 | sex & age & n\\ 162 | \midrule 163 | m & 0-10 & 3\\ 164 | f & 0-10 & 5\\ 165 | \bottomrule 166 | \end{tabular} 167 | \vspace{0.2cm} 168 | \caption{Data frame after the 'agesex' variable has been split into age and sex.}\label{Tsep} 169 | \end{margintable} 170 | 171 | \noindent Note there are other tidying operations that **tidyr** can performed in addition to the two described in this section. 172 | These are described in the `tidy-data` vignette, which can be called by entering `vignette("tidy-data")` once the package has been installed. 173 | Moreover, data manipulation and cleaning is a Big topic that extends far beyond the **tidyr** approach and about which much has been written [e.g. @Spector2008]. 174 | 175 | # Filtering columns 176 | 177 | Often Big Data contains much worthless or blank information. 178 | An example of this is provided in the huge 'NPI' dataset presented in the introduction. 179 | Being able to focus quickly only on the variables of interest becomes especially important when handling large datasets. 180 | 181 | Imagine that the tiny subset of the 'NPI' data, created using Unix tools in [Chapter 3](#pre), is the full 5+ GB file. 182 | We are running a machine of the future, is powerful enough to load the data in a fraction of a second, not the 15 minutes that it took a desktop in 2015. 183 | 184 | ```{r} 185 | df <- read_csv("data/miniaa") # load imaginary large data 186 | dim(df) 187 | ``` 188 | 189 | \noindent Note that the data frame has 329 columns (and imagine it has 4 million+ rows as the original does). 190 | That's a lot of variables. Do we need them all? 191 | It's worth taking a glimpse at this dataset to find out: 192 | 193 | ```{r, eval=FALSE} 194 | glimpse(df) 195 | ``` 196 | 197 | ``` 198 | # $ NPI (int) 1679576722, ... 199 | # $ Entity Type Code (int) 1, 1, 2, ... 200 | # $ Replacement NPI (lgl) NA, NA, NA, ... 201 | # ... 202 | ``` 203 | 204 | \noindent Looking at the output, it becomes clear that the majority of the variables only contain `NA`. 205 | To clean the giant dataset, removing the empty columns, we need to identify which these variables are. 206 | 207 | ```{r} 208 | # Identify the variable which are all NA 209 | all_na <- sapply(df, function(x) all(is.na(x))) 210 | summary(all_na) # summary of the results 211 | df <- df[!all_na] # subset the dataframe 212 | ``` 213 | 214 | \noindent The new `df` object has fewer than a third of the original columns. 215 | 216 | > **Challenge:** find out how much space was saved by the above operation using `object.size()` 217 | 218 | ```{r, include=FALSE} 219 | object.size(df) / 220 | object.size(read_csv("data/miniaa")) 221 | ``` 222 | 223 | # Data aggregation 224 | 225 | Data aggregation is the process of creating summaries of data based on a grouping variable. 226 | The end result usually has the same number of rows as there are groups. 227 | Because aggregation is a way of condensing datasets it can be a very useful technique for making sense of large datasets. 228 | The following code finds the average emissions per country (country being the grouping variable) from the 'GHG' dataset rescued from a spreadsheet and converted into a .csv file in the previous chapter. 229 | 230 | ```{r, warning=FALSE} 231 | df <- read_csv("data/ghg-ems.csv") 232 | names(df) 233 | nrow(df) 234 | length(unique(df$Country)) 235 | ``` 236 | 237 | > **Challenge:** rename the variables 4 to 8 so they are much shorter, following the pattern `ECO2`, `MCO2` etc. That will make the code for manipulating the dataset easier to write 238 | 239 | ```{r, echo=FALSE} 240 | names(df)[4:8] <- c("ECO2", "MCO2", "TCO2", "OCO2", "FCO2") 241 | ``` 242 | 243 | \noindent After the variable names have been updated, we can aggregate.^[Note the first argument in the function is the vector we're aiming to aggregate and the second is the grouping variable (in this case Countries). 244 | A quirk of R is that the grouping variable must be supplied as a list. 245 | Next we'll see a way of writing this that is neater.] 246 | 247 | ```{r} 248 | e_ems <- aggregate(df$ECO2, list(df$Country), mean, na.rm = T) 249 | nrow(e_ems) 250 | ``` 251 | 252 | Note that the resulting data frame has the same number of rows as there are countries: 253 | the aggregation has successfully reduced the number of rows we need to deal with. 254 | Now it is easier to find out per-country statistics, such as the three lowest emitters from electricity production: 255 | 256 | ```{r} 257 | head(e_ems[order(e_ems$x),], 3) 258 | ``` 259 | 260 | \noindent Another way to specify the `by` argument is with the tilde (`~`). 261 | The following command creates the same object as `e_ems`, but with less typing. 262 | 263 | ```{r} 264 | e_ems <- aggregate(ECO2 ~ Country, df, mean, na.rm = T) 265 | ``` 266 | 267 | The final way to aggregate the dataset uses a totally different syntax, from the **dplyr** package. 268 | Without worrying exactly how it works (this is described in the next section), try the following. 269 | 270 | ```{r} 271 | library(dplyr) 272 | e_ems <- group_by(df, Country) %>% 273 | summarise(mean_eco2 =mean(ECO2, na.rm = T)) 274 | e_ems 275 | ``` 276 | 277 | # dplyr 278 | 279 | **dplyr** has been designed to make data analysis 280 | fast and intuitive. 281 | **dplyr** works best with tidy data, as described above. 282 | Indeed, the two packages were designed to work closely together: **tidyr** creates tidy datasets, **dplyr** analyses 283 | them.^[As 284 | an interesting aside, **dplyr** works perfectly on `data.frames` but its default object is the `tbl`, which 285 | is identical to a `data.frame` but prints 286 | objects more intuitively.] 287 | 288 | ```{r, message=FALSE, results='hide'} 289 | library(readr) 290 | idata <- read.csv("data/world-bank-ineq.csv") 291 | idata <- tbl_df(idata) # convert the dataset to tbl class 292 | idata # print the dataset in the dplyr way 293 | ``` 294 | 295 | **dplyr** is much faster than base implementations of various 296 | operations, but it has the potential to be even faster, as 297 | *parallelisation* is 298 | [planned](https://github.com/hadley/dplyr/issues/145). 299 | 300 | You should not be expecting to learn the **dplyr** package in one sitting: 301 | the package is large and can be seen as 302 | an entirely new language, to supplement R's, 303 | in its own right. Following the 'walk before you run' principle, 304 | we'll start simple, by replicating the subsetting 305 | and grouping operations undertaken in base R above. 306 | 307 | First, we'll do a little 'data carpentry', and rename the first column using the extremely useful **dplyr** function 308 | `rename()`.^[Note 309 | in this code block the variable name is surrounded by back-quotes (`). 310 | This allows R to refer to column names that are non-standard. 311 | Note also the syntax: 312 | `rename` takes the `data.frame` as the first object and then creates new variables by specifying `new_variable_name = original_name`.] 313 | 314 | ```{r} 315 | idata <- rename(idata, Country = `Country.Name`) 316 | ``` 317 | 318 | The standard way to subset data by rows in R is with square brackets, for example: 319 | 320 | ```{r} 321 | aus1 <- idata[idata$Country == "Australia",] 322 | ``` 323 | 324 | **dplyr** offers an alternative and more flexible way of filtering data, using `filter()`. 325 | 326 | ```{r} 327 | aus2 <- filter(idata, Country == "Australia") 328 | ``` 329 | 330 | Note that we did not need to use the `$` to tell R 331 | that `Country` is a variable of the `idata` object. 332 | Because `idata` was the first argument, **dplyr** 'knew' 333 | that any subsequent names would be variables.^[Note that this syntax is a defining feature of **dplyr** 334 | and many of its functions work in the same way. 335 | Later we'll learn how this syntax can be used alongside the `%>%` 'pipe' command to write clear data manipulation commands. 336 | ] 337 | 338 | \noindent The **dplyr** equivalent of aggregate is to use 339 | the grouping function `group_by` in combination with 340 | the general purpose function `summarise` (not to 341 | be confused with `summary` in base R). 342 | 343 | ```{r} 344 | names(idata)[5:9] <- 345 | c("top10", "bot10", "gini", "b40_cons", "gdp_percap") 346 | ``` 347 | 348 | \noindent The *class* of R objects is critical to how it performs. 349 | If a class is incorrectly specified (if numbers are treated 350 | as factors, for example), R will likely generate error messages. 351 | Try typing `mean(idata$gini)`, for example. 352 | 353 | We can re-assign the classes of the numeric variables 354 | one-by one: 355 | 356 | ```{r} 357 | idata$gini <- as.numeric(as.character(idata$gini)) 358 | mean(idata$gini, na.rm = TRUE) # now the mean is calculated 359 | ``` 360 | 361 | \noindent However, the purpose of programming languages is to *automate* 362 | arduous tasks and reduce typing. The following command 363 | re-classifies all of the numeric variables using 364 | the `apply` function (we'll seem more of `apply`'s relatives 365 | later): 366 | 367 | ```{r, warning=FALSE} 368 | idata[5:9] <- apply(idata[5:9], 2, 369 | function(x) as.numeric(as.character(x))) 370 | ``` 371 | 372 | ```{r} 373 | countries <- group_by(idata, Country) 374 | summarise(countries, gini = mean(gini, na.rm = T)) 375 | ``` 376 | 377 | \noindent Note that `summarise` is highly versatile, and can 378 | be used to return a customised range of summary statistics: 379 | 380 | ```{r tidy=FALSE} 381 | summarise(countries, 382 | # number of rows per country 383 | obs = n(), 384 | med_t10 = median(top10, na.rm = T), 385 | # standard deviation 386 | sdev = sd(gini, na.rm = T), 387 | # number with gini > 30 388 | n30 = sum(gini > 30, na.rm = T), 389 | sdn30 = sd(gini[ gini > 30 ], na.rm = T), 390 | # range 391 | dif = max(gini, na.rm = T) - min(gini, na.rm = T) 392 | ) 393 | ``` 394 | 395 | \noindent To showcase the power of `summarise` used on 396 | a `grouped_df`, the 397 | above code reports a wide range of customised 398 | summary statistics 399 | *per country*: 400 | 401 | - the number of rows in each country group 402 | - standard deviation of gini indices 403 | - median proportion of income earned by the top 10% 404 | - the number of years in which the gini index was greater than 30 405 | - the standard deviation of gini index values over 30 406 | - the range of gini index values reported for each country. 407 | 408 | > **Challenge**: explore the **dplyr**'s documentation, starting with the introductory vignette, accessed by entering `vignette("introduction")` and test out its capabilities on the `idata` dataset. (More vignette names can be discovered by typing `vignette(package = "dplyr")`) 409 | 410 | # Chaining operations with dplyr 411 | 412 | Another interesting feature of **dplyr** is its ability 413 | to chain operations together. This overcomes one of the 414 | aesthetic issues with R code: you can end end-up with 415 | very long commands with many functions nested inside each 416 | other to answer relatively simple questions. 417 | 418 | > What were, on average, the 5 most unequal 419 | years for countries containing the letter g? 420 | 421 | Here's how chains work to organise the analysis in a 422 | logical step-by-step manner: 423 | 424 | ```{r tidy=FALSE} 425 | idata %>% 426 | filter(grepl("g", Country)) %>% 427 | group_by(Year) %>% 428 | summarise(gini = mean(gini, na.rm = T)) %>% 429 | arrange(desc(gini)) %>% 430 | top_n(n = 5) 431 | ``` 432 | 433 | The above function consists of 6 stages, each of which 434 | corresponds to a new line and **dplyr** function: 435 | 436 | 1. Filter-out the countries we're interested in (any selection criteria could be used in place of `grepl("g", Country)`). 437 | 2. Group the output by year. 438 | 3. Summarise, for each year, the mean gini index. 439 | 4. Arrange the results by average gini index 440 | 5. Select only the top 5 most unequal years. 441 | 442 | To see why this method is preferable to the nested 443 | function approach, take a look at the latter. 444 | Even after indenting properly it looks terrible 445 | and is almost impossible to understand! 446 | 447 | ```{r tidy=FALSE} 448 | top_n( 449 | arrange( 450 | summarise( 451 | group_by( 452 | filter(idata, grepl("g", Country)), 453 | Year), 454 | gini = mean(gini, na.rm = T)), 455 | desc(gini)), 456 | n = 5) 457 | ``` 458 | 459 | Of course, you *could* write code in base R to 460 | undertake the above analysis but for many 461 | people the **dplyr** approach is the most agreeable to write. 462 | 463 | -------------------------------------------------------------------------------- /05-Loading-dbs.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 11 | -------------------------------------------------------------------------------- /06-Visualising.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 5 | ```{r echo=FALSE} 6 | library("grid") 7 | library("png") 8 | dpi=300 9 | ``` 10 | 11 | \chapter{Visualisation} 12 | 13 | # Introduction to ggplot2 14 | 15 | \texttt{ggplot2} is a bit different from other graphics packages. It roughly follows 16 | the \textit{philosophy} of Wilkinson, 1999. Essentially, we 17 | think about plots as layers. By thinking of graphics in terms of layers it is 18 | easier for the user to iteratively add new components and for a developer to add 19 | new functionality. 20 | \begin{marginfigure} 21 | \centering 22 | \includegraphics[]{figures/ch6_f1.png} 23 | \caption{A scatter plot of engine displacement vs average city miles per gallon. 24 | The coloured points correspond to different cylinder sizes. The plot was 25 | constructed using \texttt{base} graphics.}\label{F6.1} 26 | \end{marginfigure} 27 | 28 | \subsection*{Example: the mpg data set} 29 | 30 | The \texttt{mpg} data set comes with the \texttt{ggplot2} package and can be using loaded in the usual way 31 | ```{r} 32 | data(mpg, package="ggplot2") 33 | ``` 34 | \noindent This data set contains statistics on $234$ cars. If we want to use base graphics to plot the engine displacement against city miles per gallon, and also colour the points by the number of cylinders, we would try something like 35 | 36 | ```{r, echo=2, message=FALSE, results="hide"} 37 | png("figures/ch6_f1.png", width=5*dpi, height=5*dpi, res=dpi) 38 | plot(mpg$displ, mpg$cty, col=mpg$cyl) 39 | dev.off() 40 | ``` 41 | 42 | \noindent to get figure \ref{F6.1}. Let's now consider the equivalent \texttt{ggplot2} graphic - figure \ref{F6.2}. After loading the necessary package 43 | 44 | ```{r message=FALSE} 45 | library("ggplot2") 46 | ``` 47 | 48 | \noindent figure \ref{F6.2} is generated using the following code 49 | 50 | ```{r fig.keep='none', cache=TRUE, echo=2:3} 51 | png("figures/ch6_2.png", width=5*dpi, height=5*dpi, res=dpi) 52 | g = ggplot(data=mpg, aes(x=displ, y=cty)) 53 | g + geom_point(aes(colour=factor(cyl))) 54 | sink=dev.off() 55 | ``` 56 | 57 | \noindent The \texttt{ggplot2} code is fundamentally different from the \texttt{base} code. The 58 | \texttt{ggplot} function sets the default data set, and attributes called 59 | \textbf{aesthetics}. The aesthetics are properties that are perceived on the 60 | graphic. A particular aesthetic can be mapped to a variable or set to a constant 61 | value. In figure \ref{F6.2}, the variable \texttt{displ} is mapped to the x-axis and 62 | \texttt{cty} variable is mapped to the y-axis. 63 | 64 | \begin{marginfigure} 65 | \centering 66 | \includegraphics[]{figures/ch6_2.png} 67 | \caption{As figure \ref{F6.1}, but created using \texttt{ggplot2}.}\label{F6.2} 68 | \end{marginfigure} 69 | 70 | The other function, \texttt{geom\_point} adds a layer to the plot. The \texttt{x} and 71 | \texttt{y} variables are inherited (in this case) from the first function, \texttt{ggplot}, and 72 | the colour aesthetic is set to the \texttt{cyl} variable. Other possible aesthetics 73 | are, for example, size, shape and transparency. In figure \ref{F6.2} these 74 | additional aesthetics are left at their default value. 75 | 76 | If instead we changed the `size` aesthetic 77 | 78 | ```{r cache=TRUE, echo=2} 79 | png("figures/ch6_3.png", width=5*dpi, height=5*dpi, res=dpi) 80 | g + geom_point(aes(size=factor(cyl))) 81 | sink=dev.off() 82 | ``` 83 | 84 | \noindent we would get figure \ref{F6.3} where the size of the points vary with `cyl`. Table \ref{T6.1} gives a summary of standard geoms. 85 | 86 | 87 | \begin{table}[t] 88 | \centering 89 | \begin{tabular}{@{}lll@{}} 90 | \toprule 91 | Plot Name & Geom & Base graphic \\ 92 | \midrule 93 | Barchart & bar & \texttt{barplot}\\ 94 | Box-and-whisker & boxplot & \texttt{boxplot}\\ 95 | Histogram & histogram & \texttt{hist} \\ 96 | Line plot & line & \texttt{plot} and \texttt{lines}\\ 97 | Scatter plot & point & \texttt{plot} and \texttt{points}\\ 98 | \bottomrule 99 | \end{tabular} 100 | \caption[4\baselineskip]{Basic \texttt{geom}'s and their corresponding standard plot names.}\label{T6.1} 101 | \end{table} 102 | 103 | 104 | \begin{marginfigure} 105 | \centering 106 | \includegraphics[]{figures/ch6_3.png} 107 | \caption{As figure \ref{F6.2}, but where the size aesthetic depends on 108 | cylinder size.}\label{F6.3} 109 | \end{marginfigure} 110 | 111 | # The bigvis package 112 | 113 | The `bigvis` package provides tools for exploratory data analysis of large datasets ($10-100$ million obs). 114 | The goal is that operations should take less than $5$ seconds on a standard computer, even when the sample size is $100$ million. The package is currently not available on CRAN and needs to be installed directly from github using the `devtools` package 115 | 116 | ```{r eval=FALSE, tidy=FALSE} 117 | devtools::install_github("hadley/bigvis") 118 | ``` 119 | 120 | \noindent If you are using Windows, you will also need to install Rtools. 121 | 122 | Directly visualising raw big data is pointless. It's a waste of time to create a $100$ million point scatter plot, since we would not be able to distinguish between the points. In fact, we are likely to run out of pixels! If you doubt this, compare these two plots 123 | 124 | ```{r fig.keep="none"} 125 | par(mfrow=c(1, 2)) 126 | plot(1, 1,ylab="") 127 | plot(rep(1, 1e3), rep(1, 1e3), ylab="") 128 | ``` 129 | 130 | \noindent Except for some anti-aliasing issues, it's impossible to tell the difference between these two plots. Instead, we need to quickly summarise the data and plot the data in a sensible way. 131 | 132 | Similar to `dplyr`, the `bigvis` package is structured around a few key functions. It provides fast C++ functions to manipulate the data, with the resulting output being handled by standard R functions (but optimised for `ggplot2`). The package also provides a few functions for handling outliers, since when visualising big data outliers may be more of an issue. 133 | 134 | 135 | \subsection*{Bin and condense} 136 | 137 | The `bin()` and `condense()` functions are used to get compact summaries of the data. For example, suppose we generate $10^5$ random numbers from the $t$ distribution 138 | ```{r echo=2} 139 | set.seed(1) 140 | x = rt(1e5, 5) 141 | ``` 142 | 143 | \noindent The `bin` and `condense` functions create the binned variable 144 | ```{r message=FALSE} 145 | library("bigvis") 146 | ## Bin in blocks of 0.01 147 | x_sum = condense(bin(x, 0.01)) 148 | ``` 149 | 150 | \subsection*{Smooth} 151 | 152 | After binning you may want to smooth out any rough estimates (similar to kernel density estimation). The `smooth` function smooths out the binned data 153 | 154 | ```{r echo=1:2} 155 | ## h is the binwidth (similar to bin size) 156 | x_smu = smooth(x_sum, h = 5 / 100) 157 | png("figures/ch6_4.png", width=5*dpi, height=5*dpi, res=dpi) 158 | par(mar=c(3,3,2,1), mgp=c(2,0.4,0), tck=-.01, 159 | cex.axis=0.9, las=1) 160 | plot(x_sum, panel.first=grid(), xlim=c(-12, 12), 161 | ylab="Count", pch=21, cex=0.6) 162 | lines(x_smu, col=2, lwd=2) 163 | text(5, 200, "Smoothed line", col=2) 164 | sink=dev.off() 165 | ``` 166 | 167 | \begin{marginfigure} 168 | \centering 169 | \includegraphics[]{figures/ch6_4.png} 170 | \caption{Black points are the binned data. Red line is the smoothed estimate.}\label{F6.4} 171 | \end{marginfigure} 172 | 173 | \noindent Consult the functions `best_h()` and `rmse_cvs()` to get an idea of a good starting binwidth. 174 | 175 | \subsection*{Visualisation} 176 | 177 | The output of the the `condense` and `smooth` functions can be visualised using standard plotting packages. The `bigvis` package also contains an `autoplot` function to quickly visualise results 178 | 179 | \begin{marginfigure} 180 | \centering 181 | \includegraphics[]{figures/ch6_5.png} 182 | \includegraphics[]{figures/ch6_6.png} 183 | \caption{Plots generated using `autoplot`. The bottom graph is the `peeled` version of the data.}\label{F6.5} 184 | \end{marginfigure} 185 | 186 | ```{r echo=2} 187 | png("figures/ch6_5.png", width=5*dpi, height=5*dpi, res=dpi) 188 | autoplot(x_sum) 189 | sink=dev.off() 190 | ``` 191 | 192 | \noindent This can be combined with the handy `peel` function, that (by default) just contains the middle 99% of the data 193 | 194 | ```{r echo=2} 195 | png("figures/ch6_6.png", width=5*dpi, height=5*dpi, res=dpi) 196 | autoplot(peel(x_smu)) 197 | sink=dev.off() 198 | ``` 199 | 200 | 201 | 202 | ## IMDB example 203 | 204 | The internet movie database (IMDB)\sidenote{\url{http://imdb.com/}} is a website devoted to collecting movie data supplied by studios and fans. It claims to be the biggest movie database on the web and is run by Amazon. A version of the data set comes with the `bigvis` package 205 | 206 | ```{r} 207 | data(movies, package="bigvis") 208 | ``` 209 | 210 | \noindent The dataset is a data frame and has `r NCOL(movies)` columns and `r NROW(movies)` rows. We create bin versions of the movie length and rating using the `condense/bin` trick 211 | 212 | ```{r tidy=FALSE, message=FALSE} 213 | n_bins = 1e4 214 | bin_data = with(movies, 215 | condense(bin(length, find_width(length, n_bins)), 216 | bin(rating, find_width(rating, n_bins)))) 217 | ``` 218 | 219 | \noindent This data set can then plotted as a heatmap using 220 | 221 | ```{r echo=2} 222 | png("figures/ch6_7.png", width=5*dpi, height=5*dpi, res=dpi) 223 | ggplot(bin_data, aes(length, rating, fill=.count )) + 224 | geom_raster() 225 | sink=dev.off() 226 | ``` 227 | \begin{figure}[t] 228 | \centering 229 | \includegraphics[width=0.5\textwidth]{figures/ch6_7.png}% 230 | \includegraphics[width=0.5\textwidth]{figures/ch6_8.png} 231 | \caption{Movie Rating vs Length. The plot on the right is the peeled version.}\label{F6.6} 232 | \end{figure} 233 | 234 | \noindent The resulting plot isn't helpful, due to a couple of long movies 235 | 236 | ```{r tidy=FALSE} 237 | ## Longer than one day!! 238 | subset(movies[ ,c("title", "length", "rating")], 239 | length > 24*60) 240 | ``` 241 | 242 | \noindent The `ggplot2` package contains a handy function called `last_plot` that allows us to manipulate the last created plot. For this example, we'll manipulate the plot using the `peel` function 243 | 244 | ```{r echo=1, fig.keep="none"} 245 | last_plot() %+% peel(bin_data) 246 | png("figures/ch6_8.png", width=5*dpi, height=5*dpi, res=dpi) 247 | ggplot(data=peel(bin_data), aes(length, rating, fill=.count )) + 248 | geom_raster() 249 | sink=dev.off() 250 | ``` 251 | 252 | \noindent to get a better visualisation. The associated paper 253 | \begin{center} 254 | \url{http://vita.had.co.nz/papers/bigvis.pdf} 255 | \end{center} 256 | \noindent provides a good introduction to the key ideas. 257 | 258 | # Tableplots: the tabplot package 259 | 260 | Tableplots are a visualisation technique that can be used to explore and analyse large data sets. These plots can be used to explore variable relationships and check data quality. Tableplots can visualise multivariate datasets with several variables and a large number of records. The `tabplot` package provides has an `ffdf` interface. 261 | \begin{marginfigure} 262 | \centering 263 | \includegraphics[]{figures/ch6_9.png} 264 | \caption{Tableplot of the movie dataset.}\label{F6.5} 265 | \end{marginfigure} 266 | 267 | In a tableplot, numeric variables are plotted as histograms of the mean values while for categorical variable, stacked bar charts are used to show category proportions. Missing values are also highlighted. 268 | 269 | Since `tableplot` can not handle character columns, when plotting we'll remove the first column and for presentation just select three columns 270 | 271 | 272 | ```{r, echo=c(1, 3), fig.keep="none", message=FALSE} 273 | library("tabplot") 274 | png("figures/ch6_9.png", width=5*dpi, height=5*dpi, res=dpi) 275 | tableplot(movies[,3:5]) 276 | sink=dev.off() 277 | ``` 278 | 279 | \noindent By default, the first column is sorted, but this can be altered using the `sortCol` argument 280 | 281 | ```{r fig.keep="none", message=FALSE, warning=FALSE} 282 | tableplot(movies[,3:5], sortCol = 3) 283 | ``` 284 | 285 | \noindent It is also possible to zoom into key sections of the plot. For example, if we wanted to zoom into the top 10\% of movies according to rating, then we can use the `from` and `to` arguments. 286 | \begin{marginfigure} 287 | \centering 288 | \includegraphics[]{figures/ch6_10.png} 289 | \caption{Tableplot of the movie dataset. Only the top 10\% of movies (based on rating) have been plotted.}\label{F6.8} 290 | \end{marginfigure} 291 | 292 | ```{r fig.keep="none", message=FALSE, warning=FALSE, tidy=FALSE, echo=2} 293 | png("figures/ch6_10.png", width=5*dpi, height=5*dpi, res=dpi) 294 | tableplot(movies[,3:5], sortCol = 3, from =0, to=10) 295 | sink=dev.off() 296 | ``` 297 | 298 | \noindent For a detailed description of the package consult the vignette 299 | 300 | ```{r eval=FALSE} 301 | browseVignettes("tabplot") 302 | ``` 303 | 304 | # Interactive visualisations 305 | 306 | One of the features of large, complex datasets is that they can be visualised in many ways. 307 | Sometimes it is only by viewing the relationships between different variables *interactively* that patterns in the data become apparent. 308 | Interactive visualisation is also a way to communicate the contents of large datasets to others, without needing to transfer huge files from one computer to another. 309 | Perhaps the main advantage of online interactive visualisation, however, is that the 'heavy lifting' to produce the plots can be done remotely. 310 | If powerful computers to handle the processing 'server side', the user's computer is freed from the strain that large datasets can put on hard-discs, RAM and CPU. 311 | 312 | Because of these advantages interactive visualisation has become increasingly common amongst R users over the last few years. 313 | It is a rapidly evolving field within R. 314 | Instead of providing code examples, this section therefore highlights some promising packages and provides concrete examples with reference to additional teaching material, to show what is possible 315 | 316 | ## Shiny 317 | 318 | **shiny** is an R package for easing the development of online, interactive web applications ('apps'). 319 | An example of a **shiny** app for real world use is the Propensity to Cycle Tool (PCT), funded by the UK's Department for Transport (figure 6.10). 320 | It is necessary to host the visualisation remotely because the target audience (local transport planners) cannot be expected to download all the input data and software dependencies to run the model locally. 321 | Moreover, most policy makers and not R experts so the visualisation reduces the barriers to entry into exploratory data analysis. 322 | 323 | ```{r cov, fig.margin=TRUE, fig.cap= "Screenshot of the Propensity to Cycle Tool shiny app", echo=FALSE} 324 | grid.raster(readPNG("figures/coventry-centroids.png")) 325 | ``` 326 | 327 | There are many excellent **shiny** teaching resources, the best of which is RStudio's website dedicated to shiny: 328 | [shiny.rstudio.com/](http://shiny.rstudio.com/). 329 | 330 | ## ggvis 331 | 332 | **ggvis** builds on **ggplot2** to ease the creation of interactive plots that be used locally or pushed online for use by others. 333 | A user-friendly tutorial teaching **ggvis** basics and showcasing its capabilities can be found online, at 334 | [ggvis.rstudio.com/ggvis-basics.html](http://ggvis.rstudio.com/ggvis-basics.html). 335 | 336 | ## taucharts 337 | 338 | **taucharts** is a new package for creating interactive graphics using the JavaScript library TauCharts. 339 | The package has a syntax that is similar to **ggplot2** but provides options for user-interaction. 340 | Read more about the package at [rpubs.com/hrbrmstr/taucharts](http://rpubs.com/hrbrmstr/taucharts). 341 | -------------------------------------------------------------------------------- /07-Efficient.Rmd: -------------------------------------------------------------------------------- 1 | \chapter{Efficient R Coding} 2 | 3 | \section{Benchmarking} 4 | 5 | Donald Knuth\sidenote{See \url{http://en.wikipedia.org/wiki/Donald_Knuth}} made the following statement on optimization: 6 | \begin{quote} 7 | \textit{"We should forget about small efficiencies, say about 97\% of the time: premature optimization is the root of all evil."} 8 | \end{quote} 9 | \noindent So before we rush headlong into optimising our R code, we will spend some time determining when it is worthwhile optimising. In this chapter, we will look at benchmarking. In computing, a benchmark is obtained by running a set of programs in order to assess their relative performance. 10 | 11 | \section{Simple benchmarks} 12 | 13 | To construct a benchmark we typically use the following steps 14 | \begin{enumerate} 15 | \item Construct a function (or set of functions) around the feature we want to 16 | benchmark. This function usually has an argument that allows us to vary the 17 | complexity of the object. For example, a parameter \texttt{n} that alters the data 18 | size. 19 | \item Time the function for various values of \texttt{n}. 20 | \end{enumerate} 21 | 22 | \subsection{Example: Creating a sequence of numbers} 23 | 24 | Suppose we want to create a sequence of integers\marginnote{This example is 25 | purely instructional. If you spend time trying to optimize something this low 26 | level, then you have bigger problems!} 27 | \[ 28 | 0, 1, 2, 3, \ldots, n \;. 29 | \] 30 | In R, we could do this in three ways 31 | ```{r eval=FALSE} 32 | 0:n 33 | seq(0, n) 34 | seq(0, n, by=1) 35 | ``` 36 | 37 | \noindent To time the function calls, we will use \texttt{system.time}:\marginnote{The 38 | \texttt{system.time} function actually calls the \texttt{proc.time} function, so see 39 | \texttt{?proc.time} for further details.} 40 | 41 | ```{r cache=TRUE} 42 | system.time(0:1e7) 43 | system.time(seq(0, 1e7)) 44 | system.time(seq(0, 1e7, by=1)) 45 | ``` 46 | \noindent The function \texttt{system.time} returns a vector with the following 47 | details:\sidenote{Typically you want user time $>>$ system time.} 48 | \begin{itemize} 49 | \item \texttt{user} is the CPU time (in seconds) charged for the execution of user 50 | instructions of the calling process. 51 | \item \texttt{system} is the CPU time (in seconds) charged for execution by the 52 | system on behalf of the calling process. 53 | \item \texttt{elapsed} is the real elapsed time (in seconds) since the process was 54 | started. 55 | \end{itemize} 56 | However, when benchmarking we typically compare across different test cases. To 57 | make things easier, we will wrap the operations of interest in functions 58 | ```{r cache=TRUE} 59 | base = function(n) 0:n 60 | seq1 = function(n) seq(0, n) 61 | seq2 = function(n) seq(0, n, by=1) 62 | ``` 63 | \noindent And then benchmark as before: 64 | ```{r results='hide', cache=TRUE} 65 | n = 1e7 66 | system.time(base(n)) 67 | system.time(seq1(n)) 68 | system.time(seq2(n)) 69 | ``` 70 | \noindent A small point to note is that our timings now include a function call overhead. However, a function call typically adds on an additional 200 nano seconds to each call\sidenote{There are $6\times 10^{10}$ nano seconds in a minute.}, so it's not something we usually worry about. 71 | 72 | \subsection{Saving the output} 73 | 74 | Sometimes we want to store the output of a benchmark. To do this we use the \verb+<-+ operator\sidenote{See \texttt{?assignOps} for a complete description of assignment operators.} inside the \texttt{system.time} function call 75 | ```{r results='hide', cache=TRUE} 76 | system.time(x <- base(5)) 77 | ``` 78 | \noindent The variable \texttt{x} now contains the output from the \texttt{base(5)} function call. 79 | 80 | At this point, things are starting to get messy. For example, we would like to vary \texttt{n} and calculate the relative overhead of the three methods. While this is possible, a better way is to use the \texttt{rbenchmark} package. 81 | 82 | \section{Benchmarking with \texttt{rbenchmark}} 83 | 84 | The \texttt{rbenchmark} package can be installed in the usual way 85 | ```{r eval=FALSE} 86 | install.packages("rbenchmark") 87 | ``` 88 | \noindent This package has a single function, \texttt{benchmark}, which is a simple wrapper around \texttt{system.time}. Given a specification of the benchmarking process - such as number of replications and an arbitrary number of expressions - the \texttt{benchmark} function evaluates each of the expressions in the specified environment, replicating the evaluation as many times as specified, and returning the results conveniently wrapped into a data frame. 89 | 90 | Let's consider the sequence example in the previous section. First load the package, 91 | ```{r} 92 | library("rbenchmark") 93 | ``` 94 | \noindent then we select how many replications we want of each function and what statistics we are interested in: 95 | ```{r cache=TRUE, tidy=FALSE} 96 | benchmark(replications=10, 97 | base(n), seq1(n), seq2(n), 98 | columns=c("test", "elapsed", "relative")) 99 | ``` 100 | \noindent In this comparison, using the \texttt{base} function is around twenty-five times faster the \texttt{seq2}. However, remember that each function only takes a fraction of a single second to run! 101 | 102 | To compare over different values of $n$, we just loop:\sidenote{This piece of code is breaking the number one rule in efficient R programming, we are growing a data frame. See the next section for details.} 103 | ```{r cache=TRUE, tidy=FALSE} 104 | d = NULL 105 | for(n in 10^(5:7)) { 106 | dd = benchmark(replications=10, 107 | base(n), seq1(n), seq2(n), 108 | columns=c("test", "elapsed", "relative")) 109 | dd$n = n 110 | d = rbind(d, dd) 111 | } 112 | ``` 113 | \noindent The results can be plotted in the usual way 114 | ```{r fig.keep='none', cache=TRUE} 115 | plot(d$n, d$relative, log="x", col=d$test) 116 | ``` 117 | 118 | ```{r echo=FALSE, eval=FALSE} 119 | N = 10^seq(2, 5, length.out = 10) 120 | b = c(0.200, 0.403, 0.891, 1.901, 4.112,8.937,19.050,45.562,120.665,410.480) 121 | 122 | fname = "../graphics/f2_1.pdf" 123 | pdf(fname, width=6, height=6) 124 | setnicepar() 125 | mypalette(1) 126 | plot(N, b, log="xy", xlab="n", ylab="Time(secs)", 127 | ylim=c(0.001, 1000), xlim=c(100, 1e5), 128 | axes=FALSE, frame=TRUE, pch=19, col=1) 129 | axis(1, 10^(2:5), label=c(expression(10^2), 130 | expression(10^3), 131 | expression(10^4), 132 | expression(10^5))) 133 | axis(2, 10^(-3:3), label=c(expression(10^-3),expression(10^-2), 134 | expression(10^-1), 135 | expression(10^0), 136 | expression(10^1), 137 | expression(10^2), 138 | expression(10^3))) 139 | b1 = c(0.001,0.003,0.005,0.011,0.025,0.054,0.113,0.247,0.533,1.146) 140 | points(N, b1, col=2, pch=19) 141 | #mtext("Time(secs)", side = 2, las=3,padj=-2.5) 142 | #mtext("n", side = 1, padj=2.2) 143 | grid() 144 | sink = dev.off() 145 | system(paste("pdfcrop", fname)) 146 | ``` 147 | 148 | \section{Common pitfalls} 149 | 150 | The benefit of using R (as opposed to C or Fortran, say), is that coding time is greatly 151 | reduced. However if we are not careful, it's very easy to write programs that are 152 | incredibly slow. 153 | 154 | ```{r echo=FALSE} 155 | library(rbenchmark) 156 | method1 = function(n) { 157 | myvec = numeric(0) 158 | for(i in 1:n) 159 | myvec = c(myvec, i) 160 | myvec 161 | } 162 | method2 = function(n) { 163 | myvec = numeric(n) 164 | for(i in 1:n) 165 | myvec[i] = i 166 | myvec 167 | } 168 | method3 = function(n) 1:n 169 | ``` 170 | 171 | \section{Object growth} 172 | 173 | Let's consider three methods of creating a sequence of numbers.\footnote{This chapter used 174 | a lot material found in the R inferno.} 175 | 176 | \begin{marginfigure} 177 | \centering 178 | \includegraphics[width=\textwidth]{figures/f2_1-crop} 179 | \caption{Timings (in seconds) comparing method 2 and method 3 of vector creation. Note that both scales are $\log_{10}$.}\label{F2.1} 180 | \end{marginfigure} 181 | \noindent \textbf{Method 1} creates an empty vector, and grows the 182 | object\sidenote{Equivalently, we could have \mbox{\texttt{myvec=c()}} or \mbox{\texttt{mvvec=numeric(0)}}} 183 | ```{R eval=FALSE, echo=TRUE, tidy=FALSE} 184 | n = 100000 185 | myvec = NULL 186 | for(i in 1:n) 187 | myvec = c(myvec, i) 188 | ``` 189 | \noindent \textbf{Method 2} creates an object of the final length and then changes the 190 | values in the object by subscripting: 191 | ```{r eval=FALSE, echo=TRUE, tidy=FALSE} 192 | myvec = numeric(n) 193 | for(i in 1:n) 194 | myvec[i] = i 195 | ``` 196 | \noindent \textbf{Method 3} directly creates the final object: 197 | ```{r eval=FALSE, echo=TRUE} 198 | myvec = 1:n 199 | ``` 200 | \noindent To compare the three methods we use the \texttt{benchmark} function from the previous chapter 201 | ```{r tidy=FALSE,cache=TRUE} 202 | n = 1e4 203 | benchmark(replications=10, 204 | method1(n), method2(n), method3(n), 205 | columns=c("test", "elapsed")) 206 | ``` 207 | \noindent Table \ref{T2.1} and figure \ref{F2.1} show the timing in seconds on my machine for these three methods for a 208 | selection of values of $n$. The relationships for varying $n$ are all roughly linear on a log-log scale, but the timings between methods are drastically different. Notice that the timings are no longer trivial. When $n=10^7$, method 1 takes around an hour whilst method 2 takes 2 seconds and method 3 is almost instantaneous.\sidenote{\textbf{This} is the number 1 rule when programming in R, if possible always pre-allocate your vector then fill in the values.} 209 | 210 | \begin{table}[t] 211 | \centering 212 | \begin{tabular}{@{} l r@{.}l ll @{}} 213 | \toprule 214 | & \multicolumn{4}{l}{Method} \\ 215 | \cmidrule(l){2-5} 216 | $n$ & \multicolumn{2}{l}{1} & 2 & 3 \\ 217 | \midrule 218 | $10^5$ & 0&208 & 0.024 & 0.000 \\ 219 | $10^6$ & 25&50 & 0.220 & 0.000 \\ 220 | $10^7$ & 3827&0 & 2.212 & 0.000\\ 221 | \bottomrule 222 | \end{tabular} 223 | \caption{Time in seconds to create sequences. When $n=10^7$, method 1 takes around an hour while methods 2 takes 2 seconds and method 3 almost instantaneous. }\label{T2.1} 224 | \end{table} 225 | ```{r echo=FALSE} 226 | n =2 227 | ``` 228 | Object growth can be quite insidious since it is easy to hide growing objects in your 229 | code. For example: 230 | ```{r tidy=FALSE} 231 | hit = NULL 232 | for(i in 1:n) { 233 | if(runif(1) < 0.3) 234 | hit[i] = TRUE 235 | else 236 | hit[i] = FALSE 237 | } 238 | ``` 239 | \noindent \textbf{Morale:} Never increase your object size incrementally. Always try and 240 | create the object first and fill in the blanks. 241 | 242 | 243 | \subsection{Avoid rbind too!} 244 | 245 | A more common - and possibly more dangerous - problem is with \texttt{rbind}.\sidenote{I fell into this trap in chapter 1.} For example: 246 | ```{r eval=FALSE, echo=TRUE, tidy=FALSE} 247 | df1 = data.frame(a=character(0), b=numeric(0)) 248 | for(i in 1:n) 249 | df1 = rbind(df1, 250 | data.frame(a=sample(letters, 1), b=runif(1))) 251 | ``` 252 | \noindent Probably the main reason this is more common is because it is more likely that each 253 | iteration will have a different number of observations. However, a reasonable upper bound on the size of the final object is often known. So we can pre-allocate a large data frame and trim if necessary. 254 | 255 | \section{Vectorise} 256 | 257 | When writing code in R, you need to remember that you are using R and not C (or even F77!). For example,\sidenote{The function \texttt{runif(1000)} generates 1000 random numbers between zero and one.} 258 | ```{r eval=FALSE, echo=TRUE, tidy=FALSE} 259 | x = runif(1000) + 1 260 | logsum = 0 261 | for(i in 1:length(x)) 262 | logsum = logsum + log(x[i]) 263 | ``` 264 | \noindent This is a piece R code that has a strong, unhealthy influence from C.\sidenote{It's not always the case that loops are slow and apply is fast \url{http://stackoverflow.com/q/7142767/203420}} Instead, we should write 265 | ```{r eval=FALSE} 266 | logsum = sum(log(x)) 267 | ``` 268 | 269 | ```{r echo=FALSE} 270 | x = runif(2) 271 | ``` 272 | 273 | \noindent Writing code this way has a number of benefits 274 | \begin{enumerate} 275 | \item It's faster. When $n = 10^7$ the ``R way'' is about forty times faster. 276 | \item It's neater. 277 | \item It doesn't contain a bug when \texttt{x} is of length $0$. 278 | \end{enumerate} 279 | Another common example is subsetting a vector. When writing in C, we would have something like: 280 | ```{r tidy=FALSE} 281 | ans = NULL 282 | for(i in 1:length(x)) { 283 | if(x[i] < 0) 284 | ans = c(ans, x[i]) 285 | } 286 | ``` 287 | \noindent This of course can be done simply with 288 | ```{r} 289 | ans = x[x < 0] 290 | ``` 291 | 292 | 293 | 294 | ```{r echo=FALSE, eval=FALSE} 295 | set.seed(1) 296 | fname = "../graphics/f2_2.pdf" 297 | pdf(fname, width=6, height=6) 298 | setnicepar() 299 | curve(x^2, 0,1, ylab="f(x)", xlab="x") 300 | grid() 301 | N= 40 302 | px = runif(N); py=runif(N) 303 | points(px[py < px^2], py[py < px^2], pch=19, col=1) 304 | points(px[py > px^2], py[py > px^2], pch=19, col=2) 305 | sink = dev.off() 306 | system(paste("pdfcrop", fname)) 307 | ``` 308 | 309 | 310 | 311 | \subsection{Example: Monte-Carlo integration} 312 | 313 | 314 | It's also important to make full use of R functions that use vectors. For 315 | example, suppose we wish to estimate 316 | \[ 317 | \int_0^1 x^2 dx 318 | \] 319 | using a basic Monte-Carlo method. 320 | \begin{marginfigure} 321 | \centering 322 | \includegraphics[width=\textwidth]{figures/f2_2-crop} 323 | \caption{Example of Monte-Carlo integration. To estimate the area under the curve throw random points at the graph and count the number of points that lie under the curve.}\label{F2.2} 324 | \end{marginfigure} 325 | The algorithm used to estimate this integral is given in algorithm \ref{A1}. 326 | \begin{algorithm}[h] 327 | \caption{Monte Carlo Integration}\label{A1} 328 | \begin{enumerate} 329 | \item Initialise: \texttt{hits = 0} 330 | \item \textbf{for i in 1:N} 331 | \item \quad Generate two random numbers, $U_1, U_2$, between 0 and 1 332 | \item \quad If $U_2 < U_1^2$, then \texttt{hits = hits + 1} 333 | \item \textbf{end for} 334 | \item Area estimate = \texttt{hits/N}. 335 | \end{enumerate} 336 | \end{algorithm} 337 | \noindent A standard C approach to implementing algorithm \ref{A1} would be something like: 338 | ```{r tidy=FALSE} 339 | N = 500000 340 | f = function(N){ 341 | hits = 0 342 | for(i in 1:N) { 343 | u1 = runif(1); u2 = runif(1) 344 | if(u1^2 > u2) 345 | hits = hits + 1 346 | } 347 | return(hits/N) 348 | } 349 | ``` 350 | \noindent Which in R takes about 5 seconds: 351 | ```{r cache=TRUE} 352 | system.time(f(N)) 353 | ``` 354 | \noindent However, an R-centric approach is: 355 | ```{r echo=TRUE} 356 | f1 = function(N){ 357 | hits = sum(runif(N)^2 > runif(N)) 358 | return(hits/N) 359 | } 360 | ``` 361 | \noindent So using vectors we get a 100 times speed-up: 362 | ```{r} 363 | system.time(f1(N)) 364 | ``` 365 | 366 | 367 | 368 | \subsection{If you can't vectorise} 369 | 370 | Sometimes it is impossible to vectorise your code. If this is the case, there are a few things you can do: 371 | 372 | 1. Put any object creation outside the loop. For example 373 | 374 | ```{r cache=TRUE, tidy=FALSE} 375 | jitter = function(x, k) rnorm(1, x, k) 376 | parts = rnorm(10) 377 | post = numeric(length(parts)) 378 | 379 | for(i in 1:length(parts)){ 380 | k = 1.06*sd(parts)/length(parts) 381 | post[i] = jitter(parts[i], k) 382 | } 383 | ``` 384 | 385 | can be rewritten as 386 | 387 | ```{r cache=TRUE, tidy=FALSE} 388 | k = 1.06*sd(parts)/length(parts) 389 | for(i in 1:length(parts)) 390 | post[i] = jitter(parts[i], k) 391 | ``` 392 | 393 | or even better, just 394 | 395 | ```{r cache=TRUE} 396 | post = sapply(parts, jitter, k) 397 | ``` 398 | 399 | 1. Make the number of iterations as small possible. For example, if you have the choice 400 | between iterating over factor elements and factor levels. Then factor levels is usually 401 | better (since there are fewer categories). 402 | 403 | 404 | 405 | -------------------------------------------------------------------------------- /08-Rcpp.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 5 | \chapter{The RCpp Package} 6 | 7 | # Introduction 8 | 9 | Sometimes R is just slow. You've tried every trick you know, and your code is still crawling along. At this point you may need to rewrite key parts of your code in C/C++. You don't have to resort to external packages to call C/Fortran routines. We can just use the `.Call` function; it's just an incredibly painful and error prone experience. However there is a better way, the Rcpp\sidenote{\url{http://www.rcpp.org}} package. This is now one of the most popular packages on CRAN. Rcpp provides a clean and friendly API\sidenote{\textbf{A}pplication \textbf{P}rogram \textbf{I}nterface is an exposed set of routines, protocols, and/or tools for building software applications.} that lets you write high-performance code, while at the same time keeping you safe from R's tricky C API. The typical bottlenecks that C/C++ can address are loops and recursive functions. 10 | 11 | In this chapter, C and C++ code are largely interchangeable, so when you see 'C code', it can usually be included in a `.cpp` file\marginnote{`.cpp` is the default file extension for C++ scripts.}. In general this isn't true. See for example the Stackoverflow question, 12 | \begin{center} 13 | \url{http://programmers.stackexchange.com/q/16390/14846} 14 | \end{center} 15 | \noindent for an overview. 16 | 17 | 18 | Since C++ is a separate programming language, this chapter just provides the bare minimum to get you started. This chapter's goal is to provide a flavour of what's possible. 19 | 20 | \subsection*{Pre-requisites} 21 | 22 | To use\sidenote{To use means being able to write and compile functions. If you distribute code in a package, then this isn't an issue. For example, the \texttt{ggplot2} package uses Rcpp.} the package you need a working C++ compiler. 23 | 24 | * Linux: A compiler should already be installed. Otherwise install `r-base` it a compiler will be installed as a dependency. 25 | * Macs: Install `Xcode`. 26 | * Windows: Install Rtools\sidenote{\url{http://cran.r-project.org/bin/windows/}}. Make sure you select the version that corresponds to your version of R. 27 | 28 | \noindent The code in this chapter was generated using version `r packageDescription("Rcpp")$Version` of Rcpp. You can install Rcpp from CRAN in the usual way 29 | 30 | ```{r eval=FALSE} 31 | install.packages("Rcpp") 32 | ``` 33 | 34 | \noindent The associated CRAN\sidenote{\url{https://cran.r-project.org/web/packages/Rcpp/}} page has numerous vignettes that are worth reading\sidenote{You can get an idea of the popularity of the package by looking at the `Reverse Imports` section.}. 35 | 36 | To check that you have everything needed for this chapter, run the following piece of code from the course R package 37 | 38 | ```{r cache=TRUE} 39 | library("r4bd") 40 | test_rcpp() 41 | ``` 42 | 43 | # A simple C++ function 44 | 45 | A C/C++ function is similar to an R function: you pass a set of inputs to function, some code is run, a single object is returned. However there are some key differences. 46 | 47 | 1. In the C/C++ function, each line must be terminated with `;` In R, we use `;` only when we have multiple statements on the same line. 48 | 2. We must declare object types in the C/C++ version. In particular we need to declare the types of the function arguments, return value and any intermediate objects we create. 49 | 3. The function must have an explicit `return` statement. Similar to R, there can be multiple returns, but the function will terminate when it hits it's first `return` statement. 50 | 4. You do not use assignment when creating a function. 51 | 5. Object assignment must use `=` sign. The `<-` operator isn't valid\sidenote{Yet another reason to use \texttt{=} when writing R code}. 52 | 6. One line comments can be created using `//`. Multi-line comments are created using `/*...*/` 53 | 54 | \noindent We want to create a function that adds two numbers together. In R this would be a simple one line affair: 55 | 56 | ```{r} 57 | add_r = function(x, y) x + y 58 | ``` 59 | 60 | \noindent In C++ it is a bit more long winded 61 | 62 | ```{Rcpp eval=FALSE} 63 | /* Return type double 64 | * Two arguments, also doubles 65 | */ 66 | double add_c(double x, double y) { 67 | double value = x + y; 68 | return value; 69 | } 70 | ``` 71 | 72 | \noindent If we were writing C++ code, we would also need another function called `main`. We would then compile the code to obtain an executable that is run. The executable is platform dependent. The beauty of using Rcpp is that it makes it very easy to call C++ functions from R and the user doesn't have to worry about the platform, or compilers or the R/C interface. 73 | 74 | ## The cppFunction command 75 | 76 | We load the Rcpp package using the usual `library` function call 77 | 78 | ```{r message=FALSE} 79 | library("Rcpp") 80 | ``` 81 | 82 | \noindent Then we simply pass the C++ function as a (string) argument to `cppFunction`: 83 | 84 | ```{r tidy=FALSE} 85 | cppFunction(' 86 | double add_c(double x, double y){ 87 | double value = x + y; 88 | return value; 89 | } 90 | ') 91 | ``` 92 | 93 | \noindent and Rcpp will magically compile the C++ code and construct a function that bridges the gap between R and C++. After running the above code, we now have access to the `add_c` function 94 | 95 | ```{r} 96 | add_c 97 | ``` 98 | 99 | \noindent We can call the `add_c` function in the usual way 100 | 101 | ```{r} 102 | add_c(1, 2) 103 | ``` 104 | 105 | \noindent and we don't have to worry about compilers. It has all been taken care of. Also, if you include this function in a package, users don't have to worry about any of the Rcpp magic. It just works. 106 | 107 | ## C/C++ data types 108 | 109 | The most basic type of variable is an integer, `int`. An `int` variable can store a value in the range $-32768$ to $+32767$\sidenote{In C, we can also define unsigned \texttt{int}. Then the range goes from $0$ to $65,535$. There are also `long int` data types, which range from $0$ to $2^{31}-1$.}. To store floating point numbers, there are single precision numbers, `float` and double precision numbers, `double`. A `double` takes twice as much memory as a `float`. For __single__ characters, we use the `char` data type. 110 | 111 | \begin{table}[t] 112 | \centering 113 | \begin{tabular}{@{}ll@{}} 114 | \toprule 115 | Type & Description\\ 116 | \midrule 117 | \texttt{char} & A single character.\\ 118 | \texttt{int} & An integer.\\ 119 | \texttt{float} & A single precision floating point number.\\ 120 | \texttt{double} & A double-precision floating point number.\\ 121 | \texttt{void} & A valueless quantity.\\ 122 | \bottomrule 123 | \end{tabular} 124 | \vspace{0.2cm} 125 | \caption{Overview of key C/C++ object types.} 126 | \end{table} 127 | 128 | A pointer object is a variable that points to an area of memory that has been given a name. Pointers are a very powerful, but primitive facility contained in the C language. They are very useful since rather than passing large objects around, we pass a pointer to the memory location; rather than pass the house, we just give the address. We won't use pointers in this chapter, but mention them for completeness. 129 | 130 | # The sourceCpp function 131 | 132 | The `cppFunction` is great for getting small examples up and running. But it is better practice to put your C++ code in a separate file (with file extension `cpp`) and use the function call `sourceCpp("path/to/file.cpp")` to compile them. However we need to include a few headers at the top of the file. The first line we add gives us access to the Rcpp functions. The file `Rcpp.h` contains a list of function and class definitions supplied by Rcpp\sidenote{This file will be located where Rcpp is installed. Alternatively, you can view it online at \url{https://github.com/RcppCore/Rcpp}}. The `include` statement adds the definitions to the top of your code 133 | 134 | ```{Rcpp eval=FALSE} 135 | #include 136 | ``` 137 | 138 | \noindent To access the Rcpp functions we would have to type `Rcpp::function_1`. To avoid typing `Rcpp::`, we use the namespace facility 139 | 140 | ```{Rcpp eval=FALSE} 141 | using namespace Rcpp; 142 | ``` 143 | 144 | \noindent Now we can just type `function_1`\sidenote{This is the same concept that R uses for managing function name collisions when loading packages.}. Above each function we want to export/use in R, we add the tag\sidenote{Similar to an Roxygen2 export tag.} 145 | 146 | ```{Rcpp eval=FALSE} 147 | // [[Rcpp::export]] 148 | ``` 149 | 150 | \noindent This would give the complete file 151 | 152 | ```{Rcpp} 153 | #include 154 | using namespace Rcpp; 155 | 156 | // [[Rcpp::export]] 157 | double add_c(double x, double y){ 158 | double value = x + y; 159 | return value; 160 | } 161 | ``` 162 | 163 | \noindent There are two main benefits with putting your C++ functions in separate files. First, we have the benefit of syntax highlighting (RStudio has great support for C++ editing). Second, it's easier to make syntax errors when the switching between R and C++ in the same file. To save space, we we'll omit the headers for the remainder of the chapter. 164 | 165 | # Vectors and loops 166 | 167 | Let's now consider a slightly more complicated example. Here we want to write our own function that calculates the mean. This is just an illustrative example: R's version is much better and more robust to scale differences in our data. For comparison, let's create a corresponding R function. The function takes a single vector `x` as input, and returns the mean value, `m`: 168 | 169 | ```{r} 170 | mean_r = function(x) { 171 | n = length(x) 172 | m = 0 173 | for(i in seq_along(x)) 174 | m = m + x[i]/n 175 | m 176 | } 177 | ``` 178 | 179 | \noindent This is a very bad R function. We should just use the base function `mean` for real world applications. However the purpose of `mean_r` is to provide a comparison for the C++ version, which we will write in a similar way. 180 | 181 | In this example, we will let Rcpp smooth the interface between C++ and R by using the `NumericVector` data type. This Rcpp data type mirrors the R vector object type. Other common classes are: `IntegerVector`, `CharacterVector`, and `LogicalVector`. 182 | 183 | In the C++ version of the mean function, we specify the arguments types: `x` (`NumericVector`) and the return value (`double`). The C++ version of the `mean` function is a few lines longer. Almost always, the corresponding C++ version will be, possibly much, longer. 184 | 185 | ```{Rcpp eval=FALSE} 186 | double mean_c(NumericVector x){ 187 | int i; 188 | int n = x.size(); 189 | double mean = 0; 190 | 191 | for(i=0; i x = readr::read_csv("very_big.csv") 104 | # Error: cannot allocate vector of size 12.8 Gb 105 | ``` 106 | 107 | 108 | ## ff Storage 109 | 110 | When data is the `ff` format, processing is faster than using the standard `read.csv`/`write.csv` combination. However, converting data into `ff` format can be time consuming; so keeping data in `ff` format is helpful. When you load in an `ff` object, there is a corresponding file(s) created on your hard disk 111 | 112 | ```{r} 113 | filename(ffx) 114 | ``` 115 | 116 | \noindent This make moving data around a bit more complicated. The package provides helper functions, `ffsave` and `ffload`, which zips/unzips `ff` object files. However the `ff` files are not platform-independent, so some care is needed when changing operating systems. 117 | 118 | # The ffbase package 119 | 120 | The `ff` package supplies the tools for manipulating large data sets, but provides few statistical functions. Conceptually, chunking algorithms are straightforward. The program reads a chunk of data into memory, performs intermediate calculations, saves the results and reads the next chunk. This process repeats until the entire dataset is processed. Unfortunately, many statistical algorithms have not been written with chunking in mind. 121 | 122 | The `ffbase`\sidenote{\url{http://github.com/edwindj/ffbase} package adds basic statistical functions to `ff` and `ffdf` objects. It tries to make the code more R like and smooth away the pain of working with `ff` objects. It also provides an interface with `big*` methods. 123 | 124 | `ffbase` provides S3 methods for a number of standard functions `mean`, `min`, `max`, and standard arithmetic operators (see `?ffbase` for a complete list) for `ff` objects. This removes some of the pain when dealing with `ff` objects. 125 | 126 | \newthought{The} `ffbase` package also provide access to other packages that handle large data sets. In particular, 127 | 128 | * `biglm`: Regression for data too large to fit in memory; 129 | * `biglars`: Scalable Least-Angle Regression and Lasso. 130 | * `bigrf`: Big Random Forests: Classification and Regression Forests for Large Data Sets. 131 | * `stream`: Infrastructure for Data Stream Mining. 132 | 133 | # Big linear models 134 | 135 | Linear models (lm) are one of the most basic statistical models available. The simplest regression model is 136 | \[ 137 | Y_i = \beta_0 + \beta_1 x_i + \epsilon_i 138 | \] 139 | where $\epsilon_i \sim N(0, \sigma^2)$. This corresponds to fitting a straight line through some points. So $\beta_0$ is the $y$-intercept and $\beta_1$ is the gradient. The aim is to estimate $\beta_0$ and $\beta_1$. 140 | 141 | In the more general multiple regression model, there are $p$ predictor variables 142 | \begin{equation} 143 | Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i, 144 | \end{equation} 145 | where $x_{ij}$ is the $i^\text{th}$ observation on the $j^\text{th}$ independent variable. The above equation can be written neatly in matrix notation as 146 | \[ 147 | \bm{Y} = X \bm{\beta} + \bm{\epsilon} 148 | \] 149 | with dimensions 150 | \[ 151 | [n\times 1]= [n\times (p+1)] ~[(p+1)\times 1] + [n \times 1 ]\;, 152 | \] 153 | where 154 | \begin{itemize} 155 | \item $\bm{Y}$ is the response vector - (dimensions $n \times 1$); 156 | \item $X$ is the design matrix - (dimensions $n \times (p+1)$); 157 | \item $\bm{\beta}$ is the parameter vector - (dimensions $(p+1) \times 1$); 158 | \item $\bm{\epsilon}$ is the error vector - (dimensions $n \times 1$). 159 | \end{itemize} 160 | \noindent The goal of regression is to estimate $\bm{\beta}$ with $\bm{\hat\beta}$. It can be shown that 161 | \begin{equation} 162 | \bm{\hat\beta} = (X^T X)^{-1} X^T \bm{Y} \;. 163 | \end{equation} 164 | \noindent Our estimate of $\bm {\hat \beta}$ will exist provided that $(X^T X)^{-1}$ 165 | exists, i.e. no column of $X$ is a linear combination of other columns. 166 | 167 | For a least squares regression with a simple size of $n$ training examples and $p$ predictors, it takes: 168 | 169 | * $O(p^2n)$ to multiply $X^T$ by $X$; 170 | * $O(pn)$ to multiply $X^T$ by $\bm{Y}$; 171 | * $O(p^3)$ to compute the LU (or Cholesky) factorization of $X^TX$ that is used to compute the product of $(X^TX)^{-1} (X^T\bm{Y})$. 172 | 173 | \noindent Since $n >> p$, this means that the algorithm scales with order $O(p^2 n)$. As well as taking a long time to calculate, the memory required also increases. The R implementation of `lm` requires $O(np + p^2)$ in memory. But this can be reduced by constructing the model matrix in chunks. The `biglm`'s algorithm is based on algorithm AS 274\sidenote{\url{http://lib.stat.cmu.edu/apstat/274}}, @Miller1992. It works by updating the Cholesky decomposition with new observations. So for a model with $p$ variables, only the $p \times p$ (triangular) Cholesky factor and a single row of data needs to be in the memory at any given time. The `biglm` pack age does not do the chunking for you, but `ffbase` provides a handy S3 wrapper, `bigglm.ffdf`. 174 | 175 | For an example of using `biglm`, see the blog post at \url{http://goo.gl/iBPkTp} by Bnosac. 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | -------------------------------------------------------------------------------- /10-sparkR.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: pdf_document 3 | --- 4 | 5 | \chapter{Apache Spark} 6 | 7 | # What is Apache Spark? 8 | 9 | Apache Spark is a computing platform whose goal is to make the analysis of large datasets fast. Spark extends the *MapReduce* paradigm for parallel computing to support a wide range of operations. A key feature of Spark is that it can run complex computational tasks both in-memory and on disk. 10 | 11 | The Spark project contains multiple tightly integrated components. This closely coupled design means that improvements in one part of the Spark engine are automatically used by other components. Another benefit of the tight coupling between Spark's components is that there is a single system to maintain, which can be crucial for large organisations. 12 | 13 | The Spark stack is composed of the following libraries. 14 | 15 | * Spark Core: The Core contains the basic functionality of Spark, such as memory management, fault recovery, and interacting with storage systems. Spark Core provides APIs that enable the other components to access these collections. 16 | * Spark SQL: This library provides an SQL interface for interacting with databases. 17 | * Spark Streaming: Functionality designed to ease the management of data collected in real-time. 18 | * MLlib: A scalable machine learning library. 19 | * GraphX: A relatively new component in Spark for representing and analysing phenomena that can be represented as graphs, such as person-to-person links in social media. 20 | * Cluster Managers. 21 | * Third party libraries: Like R and Python, developers are encouraged to 22 | extend Spark. Over 100 additional libraries have been contributed so far, 23 | each of which can be rated by the community\sidenote{\url{http://spark-packages.org/}}. 24 | 25 | \noindent Spark is rapidly becoming the a key component on analysing data sets that have to be distributed across multiple computers. To get an idea of Spark's popularity, browse the page 26 | \begin{center} 27 | \url{https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark} 28 | \end{center} 29 | The `rmr2` package allows the R to use Hadoop MapReduce. However, development on this package has slowed. The author notes that the lack of activity on `rmr2` is due to two reasons. First, the package maturity. Second, the general shift away from Hadoop MapReduce towards Spark. 30 | 31 | The associated Spark R package, `SparkR` was released as a separated component of Spark in 2014. In June 2015, the decision was made to merge `SparkR` into the main Spark distribution. However, they are still in the process of deciding on the API of `SparkR`. This has three unfortunate side effects 32 | 33 | 1. The full functionality of Spark is not yet available via `SparkR`. 34 | 1. If you find code online, it's likely not to work since it uses the old `SparkR` package. 35 | 1. These notes will go out of date. 36 | 37 | \noindent On a more positive note, since `SparkR` has been folded into the main Spark distribution, it means that it more likely to keep pace with the main Spark stack. 38 | 39 | # A first Spark instance 40 | 41 | There are a number of preliminary steps we need to take before beginning an analysis. First we need to install Spark. After Spark had been installed we set the environment variable `SPARK_HOME`.\sidenote{There are a number of ways to deploy Spark, the simplest of which is 'Standalone Mode', which simply requires a compiled version of Spark to be available on the machine. Other deployment modes include 'EC2', for use on Amazon's cloud computing infrastructure, and via Apache's cluster management system Mesos and YARN, which is an evolution of Hadoop.} This can be in done in our `bashrc` file, or in R itself, via 42 | 43 | ```{r eval=FALSE, echo=1} 44 | Sys.setenv(SPARK_HOME="/path/to/spark/") 45 | Sys.setenv(SPARK_HOME="/data/ncsg3/spark-1.4.1-bin-hadoop2.6/") 46 | ``` 47 | 48 | \noindent Then we load the `SparkR` package\marginnote{\texttt{SparkR} is bundled with Spark. This means that \texttt{SparkR} is not hosted on CRAN, so the usual \texttt{install.packages} won't work.} 49 | ```{r eval=FALSE} 50 | library("SparkR") 51 | ``` 52 | 53 | \noindent Next we initialise the Spark cluster and create a `SparkContext` 54 | ```{r eval=FALSE} 55 | sc = sparkR.init(master="local") 56 | ``` 57 | 58 | \noindent The `sparkR.init` function has number of arguments. In particular if we want to use any Spark packages, these should be specified during the initialisation stage. So if we wanted to load in a csv file, we would need something like 59 | ```{r eval=FALSE} 60 | sc = sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") 61 | ``` 62 | 63 | \noindent When we finish our Spark session, we should terminate the Spark context via 64 | ```{r eval=FALSE} 65 | sparkR.stop() 66 | ``` 67 | 68 | # Resilient distributed datasets (RDD) 69 | 70 | The core feature of Spark is the resilient distributed dataset (RDD). An RDD is an abstraction\sidenote{By abstraction we simply mean, we don't worry about how or where the data set is stored.} that helps us deal with big data. An RDD is a distributed collection of elements (including data and functions). In Spark everything we do revolves around RDDs. Typically, we may want to create, transform or operate on the distributed data set. Spark automatically handles how the data is distributed across your computer/cluster and parallelises operations where possible. 71 | 72 | ## Example: Moby Dick 73 | 74 | For this example, we are using the Moby Dick text, downloaded from the Project Gutenberg website. Assuming that we are already in a Spark session, we can read in the text using the `textFiles` function 75 | ```{r eval=FALSE} 76 | moby = SparkR:::textFile(sc, "data/moby_dick.txt") 77 | ``` 78 | \noindent There are two keys points to note. First, we pass the Spark instance object `sc` as an argument.\sidenote{In R, `::` is used to access functions that have been exported by a package, i.e. methods that appear in the NAMESPACE file. However there are some functions in the package that the author may want to remain private. These can be accessed using `:::`.} Second, we are use `:::` to access an non-exported function from `SparkR`\sidenote{SparkR was only integrated into Spark in June 2015, so the API is still being finalised, hence the use of the triple colon here.}. Because Spark and SparkR are so new, the interface has still to be finalised. Some things in this chapter may not work in a few month's time. In any case, the `moby` object is an RDD object 79 | ```{r eval=FALSE, tidy=FALSE} 80 | R> moby 81 | # MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2 82 | ``` 83 | 84 | \noindent Once we have an RDD object, there are two options available: *transformation* and *action*. 85 | 86 | A *transformation* operation constructs a new RDD based on the previous one. For example, suppose we want to extract the lines that contain the word `Moby` from our data set. This is a standard operation: we have a dataset and we want to remove certain values. To do this we use a standard R approach, by creating a function called `get_moby` that only returns `TRUE` or `FALSE`. The `filterRDD` function then retains any rows that are `TRUE`, i.e. 87 | ```{r eval=FALSE, tidy=FALSE} 88 | get_moby = function(x) 89 | "Moby" %in% strsplit(x, split = " ")[[1]] 90 | mobys = SparkR:::filterRDD(moby, get_moby) 91 | ``` 92 | 93 | \noindent This is a functional approach to programming and is similar to the `apply` family. 94 | An *action* computes a result based on an RDD. The result is either displayed or stored somewhere else on the system. For example, if we want to know how many rows contain the word `Moby`, we use the count function 95 | ```{r eval=FALSE} 96 | ## The answer is 77 97 | count(mobys) 98 | ``` 99 | 100 | \noindent Spark deals with transformations and actions in two different ways. Like `dplyr`, Spark uses lazy evaluation: it only performs the computation when it is used by an action. In the example above, the `textFile` and `filterRDD` commands are run only when we use `count`. 101 | 102 | The developers of Spark and `dplyr` recognize that lazy evaluation is essential when working with big data. This can be seen by considering the example above. If SparkR actually ran `textFile` straight away, this would use up a load of disk space. This is a waste, since we immediately filter out the vast majority of the text. Instead, SparkR (via Spark), takes the chain of transformations, and performs the computation on the minimum amount of data needed to get the result. 103 | 104 | Spark's RDDs are (by default) recomputed each time you run an action on them.\sidenote{Alternatively, we could use \texttt{cache(moby)}, which is the same as \texttt{persist} with the default level of storage.} To reuse RDDs in multiple operations we can ask Spark to persist it via 105 | ```{r eval=FALSE} 106 | ## There are different levels of storage 107 | persist(mobys, "MEMORY_ONLY") 108 | ``` 109 | 110 | \noindent If you are not planning on reusing the object, don't use persist. 111 | 112 | To summarise, every Spark session will have a similar structure. 113 | \begin{enumerate} 114 | \item Create a resilient distributed dataset (RDD). 115 | \item Transform and manipulate the data set. 116 | \item For key data sets, use `persist` for efficiency. 117 | \item Retrieve the results via an action such as `count`. 118 | \end{enumerate} 119 | 120 | # Loading data: creating RDDs 121 | 122 | There are two ways of creating an RDD: by parallelizing an existing dataset; or from an external data source, such as a database or csv file. 123 | 124 | The easiest way to create an RDD file is from an existing data set which can be passed to the `parallelize` function. If you use this method it may mean the data is relatively small and you don't need to use Spark. Nevertheless, applying the method on small datasets will help you to learn Spark/SparkR quickly, since you can quickly test and prototype code. To create an RDD representation of the vector `1:100`, we would use 125 | 126 | ```{r eval=FALSE} 127 | vec_sp = SparkR:::parallelize(sc, 1:100) 128 | ``` 129 | 130 | \noindent As before, we don't actually compute `vec_sp` until it is needed, 131 | ```{r eval=FALSE} 132 | count(vec_sp) 133 | ``` 134 | 135 | \noindent Typically, we would want to load data from external data sets. This could, for example, be from a text file using `textFile` described above, or from CSV file (again via `textFile`), provided you have loaded the correct library. 136 | 137 | # Example: Spark dataframes 138 | 139 | Suppose we have already initialised a Spark context. To use Spark data frames, we need to create an `SQLContext`, via 140 | ```{r eval=FALSE} 141 | sql_context = sparkRSQL.init(sc) 142 | ``` 143 | 144 | \noindent The `SQLContext` enables us to create data frames from a local R data frame, 145 | 146 | ```{r eval=FALSE} 147 | chicks_sp = createDataFrame(sql_context, chickwts) 148 | ``` 149 | 150 | \noindent or from other data sources, such as CSV files or a 'Hive 151 | table'.\sidenote{A hive is a data structure used by Hadoop.} 152 | If we examine the newly created object, we get 153 | 154 | ```{r eval=FALSE} 155 | R> chicks_sp 156 | # DataFrame[weight:double, feed:string] 157 | ``` 158 | 159 | \noindent An S3 method for `head` is also available, so 160 | 161 | ```{r eval=FALSE, tidy=FALSE} 162 | R> head(chicks_sp, 2) 163 | # weight feed 164 | #1 179 horsebean 165 | #2 160 horsebean 166 | ``` 167 | 168 | \noindent gives what we would expect. We can extract columns using the dollar notation, `chicks_sp$weight` or using `select` 169 | ```{r eval=FALSE} 170 | R> select(chicks_sp, "weight") 171 | # DataFrame[weight:double] 172 | ``` 173 | 174 | \noindent We can subset or filter the data frame using the `filter` function 175 | ```{r eval=FALSE} 176 | filter(chicks_sp, chicks_sp$feed == "horsebean") 177 | ``` 178 | 179 | \noindent Using Spark data frames, we can also easily group and aggregate data frames. (Note this is similar to the `dplyr` syntax). For example, to count the number of chicks in each feed group, we group, and then summarise: 180 | 181 | ```{r eval=FALSE, tidy=FALSE} 182 | chicks_cnt = groupBy(chicks_sp, chicks_sp$feed) %>% 183 | summarize(count=n(chicks_sp$feed)) 184 | ``` 185 | \noindent Then use `head` to view the top rows 186 | 187 | ```{r eval=FALSE, tidy=FALSE} 188 | head(chicks_cnt, 2) 189 | # feed count 190 | #1 casein 12 191 | #2 meatmeal 11 192 | ``` 193 | 194 | \noindent We can also use arrange the data by the most common group 195 | ```{r eval=FALSE, tidy=FALSE} 196 | arrange(chicks_cnt, desc(chicks_cnt$count)) 197 | ``` 198 | 199 | # Resources 200 | 201 | * Apache Spark homepage\sidenote{\url{https://spark.apache.org/}} 202 | * Learning Spark [@Karau2015] 203 | * Advanced analytics with Spark [@Ryza2014] 204 | * `dplyr` with `Spark`. Experimental, but worth watching\sidenote{\url{https://github.com/RevolutionAnalytics/dplyr-spark}} 205 | 206 | \clearpage 207 | -------------------------------------------------------------------------------- /11-Datasets.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Datasets" 3 | output: pdf_document 4 | --- 5 | 6 | \newpage 7 | 8 | \chapter*{Datasets} 9 | 10 | ```{r, echo=FALSE} 11 | # Plan: put a description of the datasets, and how to load them, 12 | # in a chapter at the end 13 | ``` 14 | 15 | \section*{Mobile Century travel behaviour dataset} 16 | 17 | This dataset is the result of an experiment conducted on the 8th February 2008, 18 | 10:00 to 18:00 (PST) on the Interstate 880 road in California. 19 | 100 GPS-enabled smart-phones were placed in cars for the experiment, the 20 | aim of which was to evaluate the potential of smart-phones to be used to 21 | monitor traffic conditions in real-time [@Herrera2010]. A website has been 22 | set-up to describe and disseminate the data for research purposes: 23 | [traffic.berkeley.edu/project/](http://traffic.berkeley.edu/project/). 24 | 25 | 26 | ```{r } 27 | if(!file.exists("data/MobileCentury/pems_prop_NB.csv")){ 28 | base = "http://traffic.berkeley.edu/sites/default/files/downloads/" 29 | url1 = paste0(base, "MobileCentury_data_final_ver3.zip?sid=1529") 30 | url2 = paste0(base, "mobile_century_data_manual.pdf") 31 | 32 | dir.create("data/MobileCentury") 33 | downloader::download(url1, destfile = "data/MobileCentury_data_final_ver3.zip") 34 | downloader::download(url2, destfile = "data/MobileCentury/mobile_century_data_manual.pdf") 35 | unzip("data/MobileCentury_data_final_ver3.zip", exdir = "data/MobileCentury/") 36 | } 37 | ``` 38 | 39 | \section*{Global emissions data} 40 | 41 | This dataset is an up-to-date compilation of global CO2 emissions, administered by the World Resources Institute (WRI). 42 | The user must sign-in to access the data from the WRI website.^[See [wri.org/resources/data-sets](http://www.wri.org/resources/data-sets/cait-country-greenhouse-gas-emissions-data).] 43 | The example serves to show how data can be rescued from spreadsheets and saved in a much more user-friendly format such as .csv. 44 | The dataset is not big (comprising less that 10,000 rows) but is used to illustrate methods of data manipulation. 45 | 46 | \section*{Moby Dick; or The Whale, by Herman Melville} 47 | 48 | The text file for the ebook of Moby Dick can be downloaded from the Gutenberg.org website\sidenote{\url{https://www.gutenberg.org/files/2701/2701.txt}} 49 | 50 | \section*{The National Provider Identifier} 51 | 52 | Providers of health care in the USA are made publicly available by the US government. 53 | The resulting datasets are large (over 4 GB unzipped) and can be accessed from 54 | [www.cms.gov/](http://download.cms.gov/nppes/NPI_Files.html). 55 | In this dataset each row is a registered health care provider. 56 | The columns contain information on these providers, including name, address and telephone number. 57 | Because there are so many column variables (329) much of the data is redundant. 58 | The download and processing of this dataset is described in chapter 3. 59 | 60 | \newpage 61 | 62 | \chapter*{References} 63 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Robin Lovelace and Colin Gillespie -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | ## $* = filename without extension 2 | ## $@ = the output file 3 | ## $< = the input file 4 | 5 | MAIN = book 6 | 7 | book.pdf: *.Rmd build.R in_header.tex 8 | Rscript -e "source('build.R')" 9 | 10 | clean: 11 | rm -fvr $(MAIN).pdf $(MAIN).tex $(MAIN).Rmd $(MAIN)_files $(MAIN).html 12 | rm -fvr $(MAIN)_cache 13 | rm -fv *.aux *.dvi *.log *.toc *.bak *~ *.blg *.bbl *.lot *.lof 14 | rm -fv *.nav *.snm *.out *.pyc \#*\# _region_* _tmp.* *.vrb 15 | rm -fv Rplots.pdf 16 | 17 | cleaner: 18 | make clean 19 | rm -fv ../graphics/*.pdf 20 | rm -fvr auto/ 21 | 22 | -------------------------------------------------------------------------------- /R-for-Big-Data.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | StripTrailingWhitespace: Yes 16 | 17 | BuildType: Makefile 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R-for-Big-Data 2 | 3 | Teaching materials for handling large datasets in R. Now superceded by [Efficient R Programming](https://github.com/csgillespie/efficientR) but archived for posterity. 4 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | name: R for Big Data 2 | markdown: redcarpet 3 | highlighter: pygments 4 | 5 | exclude: ["README.md", "book"] 6 | -------------------------------------------------------------------------------- /additional-content/.gitignore: -------------------------------------------------------------------------------- 1 | course-info-leeds.docx 2 | -------------------------------------------------------------------------------- /additional-content/book_outline.Rmd: -------------------------------------------------------------------------------- 1 | # Book outline 2 | 3 | 1. Introduction 4 | 5 | ## Part 1: Background 6 | 1. Memory 7 | 1. Preprocessing 8 | 9 | ## Part 2: Tools 10 | 11 | 1. Loading files 12 | 1. Loading databases 13 | 1. Manipulating 14 | 1. ff 15 | 1. SparkR 16 | 17 | ## Part 3: Advanced 18 | 19 | 1. Visualising 20 | 1. Rcpp 21 | 22 | * See http://www.slideshare.net/bytemining/r-hpc for slides on ff, bigmemory, mapreduce 23 | -------------------------------------------------------------------------------- /additional-content/challenges-consolidation.txt: -------------------------------------------------------------------------------- 1 | 1) Find all earthquakes with a magnitude greater than 5 and are from California 2 | 3 | 2) Remove unwanted columns 4 | 5 | 3) Summary statistics 6 | 7 | 4) Histograms 8 | 9 | 5) Cluster 10 | 11 | 6) Find all earthquakes that occurred in California 12 | 13 | 7) Barchart 14 | 15 | 8) Statistical Testing 16 | -------------------------------------------------------------------------------- /additional-content/consolidate.R: -------------------------------------------------------------------------------- 1 | 2 | # aim: read data and subset 3 | url <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv" 4 | 5 | # library(downloader) 6 | 7 | 8 | eq <- read.csv(url(url)) 9 | 10 | # Task 1 - subset 11 | mag7 <- eq$mag > 7 12 | 13 | # mag7 can be used to subset the data 14 | eq7 <- eq[mag7, c("place", "mag")] 15 | 16 | # this is the same as: 17 | eq7 <- eq[eq$mag > 7,] 18 | 19 | # eq7 now contains the rows for large earthquakes 20 | 21 | # equals doesn't work 22 | sel <- eq$place == "California" 23 | summary(sel) 24 | 25 | # to match a character, use grep 26 | sel <- grep("Ca", eq$place) 27 | sell <- grepl("Cali", eq$place) 28 | sel5 <- eq$mag > 4 29 | summary(sel) 30 | sel_final <- sel5 & sell 31 | 32 | mag5Cali <- eq[sel_final,] 33 | 34 | # in cali 35 | sp <- grepl("Cali", eq$place) 36 | # magnitude > 3.5 37 | sm <- eq$mag > 3.5 38 | 39 | # final selection 40 | s <- sp & sm 41 | m5c <- eq[s,] 42 | 43 | plot(eq$latitude, eq$longitude) 44 | 45 | # create the same thing with no new objects 46 | m5c2 <- eq[grepl("Cali", eq$place) & eq$mag > 3.5,] 47 | 48 | identical(m5c, m5c2) 49 | # eq[1:3, 4:5] 50 | 51 | library(dplyr) 52 | names(eq) 53 | 54 | eqmini <- select(eq, contains("l"), contains("e")) 55 | 56 | eqmini <- eq[seq(1, 15, 5)] 57 | 58 | 59 | 60 | names(eqmini) 61 | 62 | summary(mag7) 63 | names(eq) 64 | 65 | eq_ag <- group_by(eq, magType) %>% 66 | summarise(median = quantile(mag, probs = 0.5)) 67 | 68 | eq_joined <- inner_join(eq, eq_ag) 69 | 70 | glimpse(eq_joined) 71 | 72 | # 73 | -------------------------------------------------------------------------------- /additional-content/course-info-leeds.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Course Agenda" 3 | output: word_document 4 | --- 5 | 6 | ## Introduction to R 7 | 8 | - 09:00 Registration and refreshments 9 | - 09:30 Introducion to R 10 | - Data structures, plotting, summary statistics 11 | - 12:30 **Lunch** 12 | - 13:30 Practical 1 13 | - 14:30 Subsetting 14 | - 15:30 **Refreshments** 15 | - 15:45 Practical 2 16 | - 16:30 Finish 17 | 18 | ## Day 1 19 | 20 | - 09:00 Registration and refreshments 21 | - 09:30 Housekeeping and introduction (CSG/RL) 22 | - 10:15 Memory matters (CSG) 23 | - 12:00 Introduction to data formats (RL) 24 | - 12:30 **Lunch** 25 | - 13:30 Loading data into R (RL) 26 | - 14:00 Preprocessing data outside R (RL) 27 | - 14:30 Cleaning untidy data (RL) 28 | - 15:30 **Refreshments** 29 | - 15:45 Visualising Big Data: a taster (RL) 30 | - 16:00 An introduction to **dplyr** (RL) 31 | 32 | ## Day 2 33 | 34 | - 09:00 Registration and refreshments 35 | - 09:30 Efficient programming (CSG) 36 | - Benchmarking 37 | - Vectorising 38 | - Rcpp 39 | - 12:30 **Lunch** 40 | - 13:30 Out of memory data (CSG) 41 | - The ff package 42 | - Spark 43 | - 15:30 **Refreshments** 44 | - 15:45 Visualisation (RL) 45 | - 16:30 Finish 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /additional-content/course_outline.Rmd: -------------------------------------------------------------------------------- 1 | # Course outline 2 | 3 | 1. Introduction 4 | 5 | ## Part 1: Background 6 | 1. Memory 7 | 1. Preprocessing 8 | 9 | ## Part 2: Tools 10 | 11 | 1. Loading files 12 | 1. Loading databases 13 | 1. Manipulating 14 | 1. ff 15 | 1. SparkR 16 | 17 | ## Part 3: Advanced 18 | 19 | 1. Visualising 20 | 1. Rcpp -------------------------------------------------------------------------------- /additional-content/dfs.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/additional-content/dfs.RData -------------------------------------------------------------------------------- /additional-content/gini-dataset-II.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Additional Content" 3 | output: html_document 4 | --- 5 | 6 | See http://data.worldbank.org/indicator/SI.POV.GINI for another interesting dataset 7 | -------------------------------------------------------------------------------- /assets/cdrc-leeds.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/assets/cdrc-leeds.png -------------------------------------------------------------------------------- /assets/cdrc-logo_large.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/assets/cdrc-logo_large.png -------------------------------------------------------------------------------- /build.R: -------------------------------------------------------------------------------- 1 | # Which chapters do you want to build? 2 | chap_ord = 1 3 | 4 | library(bookdown) 5 | library(rmarkdown) 6 | 7 | 8 | # Render chapters into tex ---------------------------------------------------- 9 | needs_update <- function(src, dest) { 10 | if (!file.exists(dest)) return(TRUE) 11 | mtime <- file.info(src, dest)$mtime 12 | mtime[2] < mtime[1] 13 | } 14 | 15 | parse_md <- function(in_path) { 16 | out_path <- tempfile() 17 | on.exit(unlink(out_path)) 18 | cmd <- paste0("pandoc -f ", markdown_style, " -t json -o ", out_path, " ", in_path) 19 | system(cmd) 20 | 21 | RJSONIO::fromJSON(out_path, simplify = FALSE) 22 | } 23 | 24 | extract_headers <- function(in_path) { 25 | x <- parse_md(in_path) 26 | body <- x[[2]] 27 | ids <- vapply(headers, id, FUN.VALUE = character(1)) 28 | ids[ids != ""] 29 | } 30 | 31 | render_chapter <- function(src) { 32 | dest <- file.path("book/tex/", gsub("\\.rmd", "\\.tex", src)) 33 | if (!needs_update(src, dest)) return() 34 | 35 | message("Rendering ", src) 36 | command <- bquote(rmarkdown::render(.(src), bookdown::tex_chapter(), 37 | output_dir = "book/tex", quiet = TRUE, env = globalenv())) 38 | writeLines(deparse(command), "run.r") 39 | on.exit(unlink("run.r")) 40 | source_clean("run.r") 41 | } 42 | 43 | source_clean <- function(path) { 44 | r_path <- file.path(R.home("bin"), "R") 45 | cmd <- paste0(shQuote(r_path), " --quiet --file=", shQuote(path)) 46 | 47 | out <- system(cmd, intern = TRUE) 48 | status <- attr(out, "status") 49 | if (is.null(status)) status <- 0 50 | if (!identical(as.character(status), "0")) { 51 | stop("Command failed (", status, ")", call. = FALSE) 52 | } 53 | } 54 | 55 | chapters <- list.files(pattern = "*.Rmd")[chap_ord] 56 | lapply(chapters, render_chapter) 57 | 58 | # Copy across additional files ------------------------------------------------- 59 | file.copy("book/r-for-big-data.tex", "book/tex/", recursive = TRUE) 60 | file.copy("diagrams/", "book/tex/", recursive = TRUE) 61 | file.copy("screenshots/", "book/tex/", recursive = TRUE) 62 | 63 | # Build tex file --------------------------------------------------------------- 64 | # (build with Rstudio to find/diagnose errors) 65 | old <- setwd("book/tex") 66 | system("xelatex -interaction=batchmode r-for-big-data ") 67 | system("xelatex -interaction=batchmode r-for-big-data ") 68 | setwd(old) 69 | 70 | file.copy("book/tex/r-for-big-data.pdf", "book/r-for-big-data.pdf", overwrite = TRUE) 71 | 72 | # Build website 73 | 74 | -------------------------------------------------------------------------------- /data/.gitignore: -------------------------------------------------------------------------------- 1 | MobileCentury* 2 | rand.csv 3 | Ecoli_metadata.csv 4 | UKWCS.xls 5 | ineq-ifs* 6 | world-bank-ineq.xlsx 7 | p1.Rds 8 | reshape-data 9 | -------------------------------------------------------------------------------- /data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx -------------------------------------------------------------------------------- /data/minicars.Rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/data/minicars.Rds -------------------------------------------------------------------------------- /data/minicars.csv: -------------------------------------------------------------------------------- 1 | "","speed","dist" 2 | "1",4,2 3 | "2",4,10 4 | "3",7,4 5 | -------------------------------------------------------------------------------- /data/pew.csv: -------------------------------------------------------------------------------- 1 | "religion","<$10k","$10--20k","$20--30k","$30--40k","$40--50k","$50--75k","$75--100k","$100--150k",">150k" 2 | "Agnostic",27,34,60,81,76,137,122,109,84 3 | "Atheist",12,27,37,52,35,70,73,59,74 4 | " Buddhist ",27,21,30,34,33,58,62,39,53 5 | " Catholic ",418,617,732,670,638,1116,949,792,633 6 | " Don’t know/refused (no information on religious affiliation) ",15,14,15,11,10,35,21,17,18 7 | " Evangelical Protestant Churches ",575,869,1064,982,881,1486,949,723,414 8 | " Hindu ",1,9,7,9,11,34,47,48,54 9 | " Historically Black Protestant Churches ",228,244,236,238,197,223,131,81,78 10 | " Jehovah's Witness ",20,27,24,24,21,30,15,11,6 11 | " Jewish ",19,19,25,25,30,95,69,87,151 12 | " Mainline Protestant Churches ",289,495,619,655,651,1107,939,753,634 13 | " Mormon ",29,40,48,51,56,112,85,49,42 14 | " Muslim ",6,7,9,10,9,23,16,8,6 15 | " Orthodox ",13,17,23,32,32,47,38,42,46 16 | " Other Christian ",9,7,11,13,13,14,18,14,12 17 | " Other Faiths ",20,33,40,46,49,63,46,40,41 18 | " Other World Religions ",5,2,3,4,2,7,3,4,4 19 | " Unaffiliated ",217,299,374,365,341,528,407,321,258 20 | -------------------------------------------------------------------------------- /data/reshape-pew.csv: -------------------------------------------------------------------------------- 1 | "","religion","<$10k","$10--20k","$20--30k","$30--40k","$40--50k","$50--75k","$75--100k","$100--150k",">150k","Don't know/refused" 2 | "1","Agnostic",27,34,60,81,76,137,122,109,84,96 3 | "2","Atheist",12,27,37,52,35,70,73,59,74,76 4 | "3"," Buddhist ",27,21,30,34,33,58,62,39,53,54 5 | "4"," Catholic ",418,617,732,670,638,1116,949,792,633,1489 6 | "5"," Don’t know/refused (no information on religious affiliation) ",15,14,15,11,10,35,21,17,18,116 7 | "6"," Evangelical Protestant Churches ",575,869,1064,982,881,1486,949,723,414,1529 8 | "7"," Hindu ",1,9,7,9,11,34,47,48,54,37 9 | "8"," Historically Black Protestant Churches ",228,244,236,238,197,223,131,81,78,339 10 | "9"," Jehovah's Witness ",20,27,24,24,21,30,15,11,6,37 11 | "10"," Jewish ",19,19,25,25,30,95,69,87,151,162 12 | "11"," Mainline Protestant Churches ",289,495,619,655,651,1107,939,753,634,1328 13 | "12"," Mormon ",29,40,48,51,56,112,85,49,42,69 14 | "13"," Muslim ",6,7,9,10,9,23,16,8,6,22 15 | "14"," Orthodox ",13,17,23,32,32,47,38,42,46,73 16 | "15"," Other Christian ",9,7,11,13,13,14,18,14,12,18 17 | "16"," Other Faiths ",20,33,40,46,49,63,46,40,41,71 18 | "17"," Other World Religions ",5,2,3,4,2,7,3,4,4,8 19 | "18"," Unaffiliated ",217,299,374,365,341,528,407,321,258,597 20 | -------------------------------------------------------------------------------- /data/tinyaa: -------------------------------------------------------------------------------- 1 | "NPI","Entity Type Code","Replacement NPI","Employer Identification Number (EIN)","Provider Organization Name (Legal Business Name)","Provider Last Name (Legal Name)","Provider First Name","Provider Middle Name","Provider Name Prefix Text","Provider Name Suffix Text","Provider Credential Text","Provider Other Organization Name","Provider Other Organization Name Type Code","Provider Other Last Name","Provider Other First Name","Provider Other Middle Name","Provider Other Name Prefix Text","Provider Other Name Suffix Text","Provider Other Credential Text","Provider Other Last Name Type Code","Provider First Line Business Mailing Address","Provider Second Line Business Mailing Address","Provider Business Mailing Address City Name","Provider Business Mailing Address State Name","Provider Business Mailing Address Postal Code","Provider Business Mailing Address Country Code (If outside U.S.)","Provider Business Mailing Address Telephone Number","Provider Business Mailing Address Fax Number","Provider First Line Business Practice Location Address","Provider Second Line Business Practice Location Address","Provider Business Practice Location Address City Name","Provider Business Practice Location Address State Name","Provider Business Practice Location Address Postal Code","Provider Business Practice Location Address Country Code (If outside U.S.)","Provider Business Practice Location Address Telephone Number","Provider Business Practice Location Address Fax Number","Provider Enumeration Date","Last Update Date","NPI Deactivation Reason Code","NPI Deactivation Date","NPI Reactivation Date","Provider Gender Code","Authorized Official Last Name","Authorized Official First Name","Authorized Official Middle Name","Authorized Official Title or Position","Authorized Official Telephone Number","Healthcare Provider Taxonomy Code_1","Provider License Number_1","Provider License Number State Code_1","Healthcare Provider Primary Taxonomy Switch_1","Healthcare Provider Taxonomy Code_2","Provider License Number_2","Provider License Number State Code_2","Healthcare Provider Primary Taxonomy Switch_2","Healthcare Provider Taxonomy Code_3","Provider License Number_3","Provider License Number State Code_3","Healthcare Provider Primary Taxonomy Switch_3","Healthcare Provider Taxonomy Code_4","Provider License Number_4","Provider License Number State Code_4","Healthcare Provider Primary Taxonomy Switch_4","Healthcare Provider Taxonomy Code_5","Provider License Number_5","Provider License Number State Code_5","Healthcare Provider Primary Taxonomy Switch_5","Healthcare Provider Taxonomy Code_6","Provider License Number_6","Provider License Number State Code_6","Healthcare Provider Primary Taxonomy Switch_6","Healthcare Provider Taxonomy Code_7","Provider License Number_7","Provider License Number State Code_7","Healthcare Provider Primary Taxonomy Switch_7","Healthcare Provider Taxonomy Code_8","Provider License Number_8","Provider License Number State Code_8","Healthcare Provider Primary Taxonomy Switch_8","Healthcare Provider Taxonomy Code_9","Provider License Number_9","Provider License Number State Code_9","Healthcare Provider Primary Taxonomy Switch_9","Healthcare Provider Taxonomy Code_10","Provider License Number_10","Provider License Number State Code_10","Healthcare Provider Primary Taxonomy Switch_10","Healthcare Provider Taxonomy Code_11","Provider License Number_11","Provider License Number State Code_11","Healthcare Provider Primary Taxonomy Switch_11","Healthcare Provider Taxonomy Code_12","Provider License Number_12","Provider License Number State Code_12","Healthcare Provider Primary Taxonomy Switch_12","Healthcare Provider Taxonomy Code_13","Provider License Number_13","Provider License Number State Code_13","Healthcare Provider Primary Taxonomy Switch_13","Healthcare Provider Taxonomy Code_14","Provider License Number_14","Provider License Number State Code_14","Healthcare Provider Primary Taxonomy Switch_14","Healthcare Provider Taxonomy Code_15","Provider License Number_15","Provider License Number State Code_15","Healthcare Provider Primary Taxonomy Switch_15","Other Provider Identifier_1","Other Provider Identifier Type Code_1","Other Provider Identifier State_1","Other Provider Identifier Issuer_1","Other Provider Identifier_2","Other Provider Identifier Type Code_2","Other Provider Identifier State_2","Other Provider Identifier Issuer_2","Other Provider Identifier_3","Other Provider Identifier Type Code_3","Other Provider Identifier State_3","Other Provider Identifier Issuer_3","Other Provider Identifier_4","Other Provider Identifier Type Code_4","Other Provider Identifier State_4","Other Provider Identifier Issuer_4","Other Provider Identifier_5","Other Provider Identifier Type Code_5","Other Provider Identifier State_5","Other Provider Identifier Issuer_5","Other Provider Identifier_6","Other Provider Identifier Type Code_6","Other Provider Identifier State_6","Other Provider Identifier Issuer_6","Other Provider Identifier_7","Other Provider Identifier Type Code_7","Other Provider Identifier State_7","Other Provider Identifier Issuer_7","Other Provider Identifier_8","Other Provider Identifier Type Code_8","Other Provider Identifier State_8","Other Provider Identifier Issuer_8","Other Provider Identifier_9","Other Provider Identifier Type Code_9","Other Provider Identifier State_9","Other Provider Identifier Issuer_9","Other Provider Identifier_10","Other Provider Identifier Type Code_10","Other Provider Identifier State_10","Other Provider Identifier Issuer_10","Other Provider Identifier_11","Other Provider Identifier Type Code_11","Other Provider Identifier State_11","Other Provider Identifier Issuer_11","Other Provider Identifier_12","Other Provider Identifier Type Code_12","Other Provider Identifier State_12","Other Provider Identifier Issuer_12","Other Provider Identifier_13","Other Provider Identifier Type Code_13","Other Provider Identifier State_13","Other Provider Identifier Issuer_13","Other Provider Identifier_14","Other Provider Identifier Type Code_14","Other Provider Identifier State_14","Other Provider Identifier Issuer_14","Other Provider Identifier_15","Other Provider Identifier Type Code_15","Other Provider Identifier State_15","Other Provider Identifier Issuer_15","Other Provider Identifier_16","Other Provider Identifier Type Code_16","Other Provider Identifier State_16","Other Provider Identifier Issuer_16","Other Provider Identifier_17","Other Provider Identifier Type Code_17","Other Provider Identifier State_17","Other Provider Identifier Issuer_17","Other Provider Identifier_18","Other Provider Identifier Type Code_18","Other Provider Identifier State_18","Other Provider Identifier Issuer_18","Other Provider Identifier_19","Other Provider Identifier Type Code_19","Other Provider Identifier State_19","Other Provider Identifier Issuer_19","Other Provider Identifier_20","Other Provider Identifier Type Code_20","Other Provider Identifier State_20","Other Provider Identifier Issuer_20","Other Provider Identifier_21","Other Provider Identifier Type Code_21","Other Provider Identifier State_21","Other Provider Identifier Issuer_21","Other Provider Identifier_22","Other Provider Identifier Type Code_22","Other Provider Identifier State_22","Other Provider Identifier Issuer_22","Other Provider Identifier_23","Other Provider Identifier Type Code_23","Other Provider Identifier State_23","Other Provider Identifier Issuer_23","Other Provider Identifier_24","Other Provider Identifier Type Code_24","Other Provider Identifier State_24","Other Provider Identifier Issuer_24","Other Provider Identifier_25","Other Provider Identifier Type Code_25","Other Provider Identifier State_25","Other Provider Identifier Issuer_25","Other Provider Identifier_26","Other Provider Identifier Type Code_26","Other Provider Identifier State_26","Other Provider Identifier Issuer_26","Other Provider Identifier_27","Other Provider Identifier Type Code_27","Other Provider Identifier State_27","Other Provider Identifier Issuer_27","Other Provider Identifier_28","Other Provider Identifier Type Code_28","Other Provider Identifier State_28","Other Provider Identifier Issuer_28","Other Provider Identifier_29","Other Provider Identifier Type Code_29","Other Provider Identifier State_29","Other Provider Identifier Issuer_29","Other Provider Identifier_30","Other Provider Identifier Type Code_30","Other Provider Identifier State_30","Other Provider Identifier Issuer_30","Other Provider Identifier_31","Other Provider Identifier Type Code_31","Other Provider Identifier State_31","Other Provider Identifier Issuer_31","Other Provider Identifier_32","Other Provider Identifier Type Code_32","Other Provider Identifier State_32","Other Provider Identifier Issuer_32","Other Provider Identifier_33","Other Provider Identifier Type Code_33","Other Provider Identifier State_33","Other Provider Identifier Issuer_33","Other Provider Identifier_34","Other Provider Identifier Type Code_34","Other Provider Identifier State_34","Other Provider Identifier Issuer_34","Other Provider Identifier_35","Other Provider Identifier Type Code_35","Other Provider Identifier State_35","Other Provider Identifier Issuer_35","Other Provider Identifier_36","Other Provider Identifier Type Code_36","Other Provider Identifier State_36","Other Provider Identifier Issuer_36","Other Provider Identifier_37","Other Provider Identifier Type Code_37","Other Provider Identifier State_37","Other Provider Identifier Issuer_37","Other Provider Identifier_38","Other Provider Identifier Type Code_38","Other Provider Identifier State_38","Other Provider Identifier Issuer_38","Other Provider Identifier_39","Other Provider Identifier Type Code_39","Other Provider Identifier State_39","Other Provider Identifier Issuer_39","Other Provider Identifier_40","Other Provider Identifier Type Code_40","Other Provider Identifier State_40","Other Provider Identifier Issuer_40","Other Provider Identifier_41","Other Provider Identifier Type Code_41","Other Provider Identifier State_41","Other Provider Identifier Issuer_41","Other Provider Identifier_42","Other Provider Identifier Type Code_42","Other Provider Identifier State_42","Other Provider Identifier Issuer_42","Other Provider Identifier_43","Other Provider Identifier Type Code_43","Other Provider Identifier State_43","Other Provider Identifier Issuer_43","Other Provider Identifier_44","Other Provider Identifier Type Code_44","Other Provider Identifier State_44","Other Provider Identifier Issuer_44","Other Provider Identifier_45","Other Provider Identifier Type Code_45","Other Provider Identifier State_45","Other Provider Identifier Issuer_45","Other Provider Identifier_46","Other Provider Identifier Type Code_46","Other Provider Identifier State_46","Other Provider Identifier Issuer_46","Other Provider Identifier_47","Other Provider Identifier Type Code_47","Other Provider Identifier State_47","Other Provider Identifier Issuer_47","Other Provider Identifier_48","Other Provider Identifier Type Code_48","Other Provider Identifier State_48","Other Provider Identifier Issuer_48","Other Provider Identifier_49","Other Provider Identifier Type Code_49","Other Provider Identifier State_49","Other Provider Identifier Issuer_49","Other Provider Identifier_50","Other Provider Identifier Type Code_50","Other Provider Identifier State_50","Other Provider Identifier Issuer_50","Is Sole Proprietor","Is Organization Subpart","Parent Organization LBN","Parent Organization TIN","Authorized Official Name Prefix Text","Authorized Official Name Suffix Text","Authorized Official Credential Text","Healthcare Provider Taxonomy Group_1","Healthcare Provider Taxonomy Group_2","Healthcare Provider Taxonomy Group_3","Healthcare Provider Taxonomy Group_4","Healthcare Provider Taxonomy Group_5","Healthcare Provider Taxonomy Group_6","Healthcare Provider Taxonomy Group_7","Healthcare Provider Taxonomy Group_8","Healthcare Provider Taxonomy Group_9","Healthcare Provider Taxonomy Group_10","Healthcare Provider Taxonomy Group_11","Healthcare Provider Taxonomy Group_12","Healthcare Provider Taxonomy Group_13","Healthcare Provider Taxonomy Group_14","Healthcare Provider Taxonomy Group_15" 2 | "1679576722","1","","","","WIEBE","DAVID","A","","","M.D.","","","","","","","","","","PO BOX 2168","","KEARNEY","NE","688482168","US","3088652512","3088652506","3500 CENTRAL AVE","","KEARNEY","NE","688472944","US","3088652512","3088652506","05/23/2005","07/08/2007","","","","M","","","","","","207X00000X","12637","NE","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","46969","01","KS","BCBS","645540","01","KS","FIRSTGUARD","B67599","02","","","1553","01","NE","BCBS","046969WI","04","KS","","93420WI","04","NE","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","X","","","","","","","","","","","","","","","","","","","","","" 3 | "1588667638","1","","","","PILCHER","WILLIAM","C","DR.","","MD","","","","","","","","","","1824 KING STREET","SUITE 300","JACKSONVILLE","FL","322044736","US","9043881820","9043881827","1824 KING STREET","SUITE 300","JACKSONVILLE","FL","322044736","US","9043881820","9043881827","05/23/2005","05/29/2014","","","","M","","","","","","207RC0000X","032024","GA","N","207RC0000X","ME68414","FL","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","208143","01","FL","AVMED","C73899","02","FL","","27888Z","04","FL","","06BDGPK","04","GA","","00532485C","05","GA","","251286600","05","FL","","00706626A","05","GA","","0897705","01","FL","AETNA","27888","01","FL","BCBS","510265","01","GA","BCBS","110123591","04","FL","RAILROAD MCARE","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 4 | -------------------------------------------------------------------------------- /data/tinyab: -------------------------------------------------------------------------------- 1 | "1497758544","2","","","CUMBERLAND COUNTY HOSPITAL SYSTEM, INC","","","","","","","CAPE FEAR VALLEY HOME HEALTH AND HOSPICE","3","","","","","","","","3418 VILLAGE DR","","FAYETTEVILLE","NC","283044552","US","9106096740","","3418 VILLAGE DR","","FAYETTEVILLE","NC","283044552","US","9106096740","","05/23/2005","09/26/2011","","","","","NAGOWSKI","MICHAEL","","CEO","9106096700","251G00000X","HC0283","NC","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","3401562","05","NC","","341562","04","NC","PROVIDER NUMBER","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","MR.","","","","","","","","","","","","","","","","","" 2 | "1306849450","1","","","","SMITSON","HAROLD","LEROY","DR.","II","M.D.","","","","","","","","","","810 LUCAS DR","","ATHENS","TX","757513446","US","9036756778","9036752333","810 LUCAS DR","","ATHENS","TX","757513446","US","9036756778","9036752333","05/23/2005","01/03/2008","","","","M","","","","","","2085R0202X","E5444","TX","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","83R321","08","TX","","B26530","02","TX","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 3 | "1215930367","1","","","","GRESSOT","LAURENT","","DR.","","M.D.","","","","","","","","","","17323 RED OAK DR","","HOUSTON","TX","770901243","US","2814405006","2814406149","17323 RED OAK DR","","HOUSTON","TX","770901243","US","2814405006","2814406149","05/23/2005","11/25/2014","","","","M","","","","","","174400000X","H6257","TX","N","207RH0003X","H6257","TX","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","E43866","02","TX","","89G024","08","TX","","830005153","08","TX","","1215930367","08","TX","","0533800001","07","TX","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 4 | -------------------------------------------------------------------------------- /data/tinyac: -------------------------------------------------------------------------------- 1 | "1023011178","2","","","NAPA VALLEY HOSPICE & ADULT DAY SERVICES","","","","","","","","","","","","","","","","414 S JEFFERSON ST","","NAPA","CA","945594515","US","7072589080","7072582476","414 S JEFFERSON ST","","NAPA","CA","945594515","US","7072589080","7072582476","05/23/2005","10/17/2011","","","","","VOLKERTS","KEITH","REGIS","DIRECTOR OF FINANCE","7072589080","251G00000X","100000741","CA","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","HPC01537G","05","CA","","051537","04","CA","HOSPICE MEDICARE NO","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","MR.","","","","","","","","","","","","","","","","","" 2 | "1932102084","1","","","","ADUSUMILLI","RAVI","K","","","MD","","","","","","","","","","2940 N MCCORD RD","","TOLEDO","OH","436151753","US","4198423000","4198423048","2940 N MCCORD RD","","TOLEDO","OH","436151753","US","4198423000","4198423048","05/23/2005","04/23/2012","","","","M","","","","","","207RC0000X","4301081344","MI","N","207RC0000X","35069014","OH","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","060060579","08","OH","","4042628","08","OH","","MI1635021","08","MI","","P00751116","01","","RAILROAD MEDICARE","0178623","05","OH","","0792002","08","OH","","MI1635026","08","MI","","AD4257781","08","OH","","E21287","02","","","4042622","08","OH","","4042627","08","OH","","4042629","08","OH","","0792003","08","OH","","4042624","08","OH","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 3 | "1841293990","1","","","","WORTSMAN","SUSAN","","","","MA-CCC","","","","","","","","","","68 ROCKLEDGE RD","APT 1C","HARTSDALE","NY","105303455","US","2124814464","","425 E 25TH ST","","NEW YORK","NY","100102547","US","2124814464","","05/23/2005","07/08/2007","","","","F","","","","","","231H00000X","000396-1","NY","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","M72081","04","NY","PROVIDER NUMBER","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 4 | -------------------------------------------------------------------------------- /data/tinyad: -------------------------------------------------------------------------------- 1 | "1750384806","1","","","","BISBEE","ROBERT","","DR.","","MD","","","","","","","","","","5219 CITY BANK PKWY","STE 35","LUBBOCK","TX","794073545","US","8067852045","8067222908","113 WALNUT ST","","IDALOU","TX","793294003","US","8068922537","8068922726","05/23/2005","01/13/2012","","","","M","","","","","","207R00000X","J8461","TX","Y","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","G22052","02","TX","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","N","","","","","","","","","","","","","","","","","","","","","" 2 | -------------------------------------------------------------------------------- /figures/746px-Pistol-grip_drill.svg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/746px-Pistol-grip_drill.svg.png -------------------------------------------------------------------------------- /figures/Laptop-hard-drive-exposed.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/Laptop-hard-drive-exposed.jpg -------------------------------------------------------------------------------- /figures/coventry-centroids.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/coventry-centroids.png -------------------------------------------------------------------------------- /figures/environment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/environment.png -------------------------------------------------------------------------------- /figures/f2_1-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/f2_1-crop.pdf -------------------------------------------------------------------------------- /figures/f2_2-crop.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/f2_2-crop.pdf -------------------------------------------------------------------------------- /figures/know_data.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/know_data.jpg -------------------------------------------------------------------------------- /figures/mel-cycle-cent-close.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/mel-cycle-cent-close.png -------------------------------------------------------------------------------- /figures/od-mess.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robinlovelace/R-for-Big-Data/fbb470bf261d4e0a7dbb304a04b50b3ac4710543/figures/od-mess.png -------------------------------------------------------------------------------- /in_header.tex: -------------------------------------------------------------------------------- 1 | %% Start in_header.tex 2 | \usepackage{microtype,bm,algorithm} 3 | %% End in_header.tex -------------------------------------------------------------------------------- /packages.R: -------------------------------------------------------------------------------- 1 | ## CRAN packages needed 2 | pkgs = c( 3 | "drat", "devtools", "ggplot2", # Generic 4 | "downloader", "grid", "rbenchmark", # Generic 5 | "pryr", ## Memory chapter 6 | "dplyr", "readxl", 7 | "readr", "gdata", "openxlsx", 8 | "tidyr", 9 | "Rcpp",## Rcpp chapter 10 | "png", 11 | "ff", "ffbase", "biglm", ## FF chapter 12 | "xtable", 13 | "tabplot" ##Graphics chapter 14 | ) 15 | ## Github Drat packages 16 | github_pkgs = c("r4bd") 17 | 18 | ## Packages not in a proper repo 19 | if(!"bigvis" %in% installed.packages()){ 20 | devtools::install_github("hadley/bigvis") 21 | } 22 | 23 | ## create the data frames 24 | pkgs = data.frame(pkg = pkgs, 25 | repo = "http://cran.rstudio.com/", 26 | installed = pkgs %in% installed.packages(), 27 | stringsAsFactors = FALSE) 28 | 29 | if(!require(drat)){ 30 | install.packages("drat") 31 | } 32 | repo = drat::addRepo("rcourses")["rcourses"] 33 | 34 | github_pkgs = data.frame(pkg = github_pkgs, 35 | repo = repo, 36 | installed = github_pkgs %in% installed.packages(), 37 | stringsAsFactors = FALSE, row.names=NULL) 38 | 39 | ## Combine all data frames of package info 40 | pkgs = rbind(pkgs, github_pkgs) 41 | 42 | ## Update packages 43 | update.packages(checkBuilt = TRUE, ask = FALSE, 44 | repos = unique(pkgs$repo), 45 | oldPkgs = pkgs$pkg) 46 | 47 | ## Install missing packages 48 | to_install = pkgs[!pkgs$installed,] 49 | if(nrow(to_install)) 50 | install.packages(to_install$pkg, repos = to_install$repo) 51 | -------------------------------------------------------------------------------- /r4bd.bib: -------------------------------------------------------------------------------- 1 | @article{Herrera2010, 2 | author = {Herrera, Juan C and Work, Daniel B and Herring, Ryan and Ban, Xuegang Jeff and Jacobson, Quinn and Bayen, Alexandre M}, 3 | file = {:media/robin/data/Copy/lit/2009/Berkeley, Daniel, Jeff - 2009.pdf:pdf}, 4 | issn = {0968-090X}, 5 | journal = {Transportation Research Part C: Emerging Technologies}, 6 | number = {4}, 7 | pages = {568--583}, 8 | publisher = {Elsevier}, 9 | title = {{Evaluation of traffic data obtained via GPS-enabled mobile phones: The Mobile Century field experiment}}, 10 | volume = {18}, 11 | year = {2010} 12 | } 13 | 14 | @book{Karau2015, 15 | title={Learning Spark: Lightning-Fast Big Data Analysis}, 16 | author={Karau, Holden and Konwinski, Andy and Wendell, Patrick and Zaharia, Matei}, 17 | year={2015}, 18 | publisher={" O'Reilly Media, Inc."} 19 | } 20 | 21 | @article{Miller1992, 22 | title={Algorithm AS 274: Least squares routines to supplement those of Gentleman}, 23 | author={Miller, Alan J}, 24 | journal={Applied Statistics}, 25 | pages={458--478}, 26 | year={1992}, 27 | publisher={JSTOR} 28 | } 29 | 30 | @article{Ryza2014, 31 | title={Advanced Analytics with Spark}, 32 | author={Ryza, Sandy and others}, 33 | journal={by Ann Spencer. O’Reilly}, 34 | year={2014} 35 | } 36 | 37 | @Manual{dplyr, 38 | title = {dplyr: A Grammar of Data Manipulation}, 39 | author = {Hadley Wickham and Romain Francois}, 40 | year = {2015}, 41 | note = {R package version 0.4.2}, 42 | url = {http://CRAN.R-project.org/package=dplyr}, 43 | } 44 | 45 | @book{Wickham2014, 46 | title={Advanced R}, 47 | author={Wickham, Hadley}, 48 | year={2014}, 49 | publisher={CRC Press} 50 | } 51 | 52 | @article{Lovelace2015, 53 | author = {Lovelace, Robin and Clarke, Martin and Cross, Philip and Birkin, Mark}, 54 | journal = {Geographical Analysis}, 55 | keywords = {geographical analysis}, 56 | title = {{From Big Noise to Big Data: towards the verification of large datasets for understanding regional retail flows}}, 57 | url = {http://onlinelibrary.wiley.com/doi/10.1111/gean.12081/pdf}, 58 | year = {2015} 59 | } 60 | 61 | @article{tidy-data, 62 | author = {Wickham, Hadley}, 63 | issn = {1548-7660}, 64 | journal = {The Journal of Statistical Software}, 65 | keywords = {data cleaning,data tidying,r,relational databases}, 66 | number = {5}, 67 | title = {{Tidy data}}, 68 | url = {http://www.jstatsoft.org/v59/i10}, 69 | volume = {14}, 70 | year = {2014} 71 | } 72 | 73 | @book{kitchin2014data, 74 | author = {Kitchin, Rob}, 75 | publisher = {Sage}, 76 | title = {{The data revolution: Big data, open data, data infrastructures and their consequences}}, 77 | year = {2014} 78 | } 79 | 80 | @article{Codd1979, 81 | abstract = {During the last three or four years several investigators have been exploring “semantic models” for formatted databases. The intent is to capture (in a more or less formal way) more of the meaning of the data so that database design can become more systematic and the database system itself can behave more intelligently. Two major thrusts are clear: (I) the search for meaningful units that are as small as possible--atomic semantics; (2) the search for meaningful units that are larger than the usual n-ary relation-molecular semantics. In this paper we propose extensions to the relational model to support certain atomic and molecular semantics. These extensions represent a synthesis of many ideas from the published work in semantic modeling plus the introduction of new rules for insertion, update, and deletion, as well as new algebraic operators.}, 82 | author = {Codd, E. F.}, 83 | doi = {10.1145/320107.320109}, 84 | issn = {03625915}, 85 | journal = {ACM Transactions on Database Systems}, 86 | keywords = {22,29,3,33,34,39,4,70,73,and phrases,base,conceptual model,conceptual schema,cr categories,data model,data semantics,database,database schema,entity model,knowledge,knowledge representation,relation,relational database,relational model,relational schema,semantic model}, 87 | number = {4}, 88 | pages = {397--434}, 89 | title = {{Extending the database relational model to capture more meaning}}, 90 | url = {http://sites.google.com/site/eherrerao902/p397.pdf}, 91 | volume = {4}, 92 | year = {1979} 93 | } 94 | 95 | @book{Spector2008, 96 | author = {Spector, Phil}, 97 | isbn = {0387747303}, 98 | publisher = {Springer Science \& Business Media}, 99 | title = {{Data manipulation with R}}, 100 | year = {2008} 101 | } 102 | 103 | -------------------------------------------------------------------------------- /slides/chapter1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R for big data" 3 | author: "Colin Gillespie & Robin Lovelace" 4 | date: "17-18 September 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | 9 | ```{r echo=FALSE} 10 | library("pryr") 11 | library("grid") 12 | library("png") 13 | ``` 14 | 15 | # Chapter 1: Overview 16 | 17 | ## Who? 18 | 19 | * Dr Colin Gillespie 20 | * Statistics Lecturer, Newcastle University 21 | * Research interest in high performance and parallel computing 22 | * Dr Robin Lovelace 23 | * Research Fellow, Leeds Institute for Data Analytics (LIDA) 24 | * Data processing and 'analytics', focus on geographical data and sustainable transport 25 | 26 | ## Schedule 27 | 28 | * The next two days contains bleeding edge material 29 | * Hard to get practicals 30 | * Regular breaks split up the day 31 | * First time this course has run 32 | * Have too much material - prioritize 33 | 34 | 35 | ## What is Big Data? 36 | 37 | > - Number of rows (volume)? 38 | > - Rate of build-up (velocity)? 39 | > - Number of nested list items (variety)? 40 | 41 | - **Example:** http://geo8.webarch.net/ 42 | > - Is this Big Data? 43 | 44 | ## Software choices 45 | 46 | Think of data processing as DIY 47 | 48 | ![](../figures/746px-Pistol-grip_drill.svg.png) 49 | 50 | 51 | ## Coping with big data in R 52 | 53 | * R has had a difficult relationship with big data 54 | * It loads data into the computer's RAM 55 | * This was less of a problem twenty years ago, 56 | * Small data set 57 | * Main bottleneck was thinking 58 | * Traditionally the development of a statistical model took more time than the computation. 59 | * When it comes to big data, this changes. 60 | * Nowadays data sets that are larger than your laptop's memory are commonplace. 61 | 62 | ## Example: Clustering 63 | 64 | * Even if the original data set is relatively small data set, the analysis can generate large objects. 65 | * For example, suppose we went to perform standard cluster analysis. 66 | * Using the built-in data set `USAarrests`, we can calculate a distance matrix, 67 | ```{r} 68 | d = dist(USArrests) 69 | ``` 70 | 71 | ## Example: Clustering 72 | 73 | \noindent and perform hierarchical clustering to get a dendrogram 74 | 75 | ```{r} 76 | fit = hclust(d) 77 | ``` 78 | 79 | 80 | ```{r denofig.fullwidth=TRUE, fig.height=2, echo=FALSE, fig.cap="Dendrogram from USArrests data."} 81 | par(mar=c(3,3,2,1), mgp=c(2,0.4,0), tck=-.01,cex=0.5, las=1) 82 | plot(fit, labels=rownames(d)) 83 | ``` 84 | 85 | 86 | ## Example: Clustering 87 | 88 | When we inspect the object size of the original data set and the distance object 89 | ```{r} 90 | pryr::object_size(USArrests) 91 | pryr::object_size(d) 92 | ``` 93 | 94 | 95 | ## Example: Clustering 96 | 97 | * We have managed to create an object that is three times larger than the original data set 98 | * In fact the object `d` is a symmetric $n \times n$ matrix, where $n$ is the number of rows in `USAarrests` 99 | * As `n` increases the size of `d` increases at rate $O(n^2)$ 100 | * So if our original data set contained $10,000$ records, the associated distance matrix would contain almost $10^8$ values. 101 | 102 | ## Buy more RAM 103 | 104 | * Since R keeps all objects in memory, the easiest way to deal with memory issues. 105 | * Currently, 16GB costs less than £100. 106 | * This small cost is quickly recuperated on user time. 107 | * A relatively powerful desktop machine can be purchased for less that £1000. 108 | 109 | ## Cloud computing 110 | 111 | * Another alternative, could be to use cloud computing. 112 | * For example, Amazon currently charge around £0.15 per Gigabyte of RAM. 113 | * Currently, a $244$GB machine, with $32$ cores, costs around £3.12 per hour 114 | 115 | ## Sampling 116 | 117 | * Do you __really__ need to load all of data at once? 118 | * For example, if your data contains information regarding sales, does it make sense to aggregate across countries, or should the data be split up? 119 | * Assuming that you need to analyse all of your data, then random sampling could provide an easy way to perform your analysis. 120 | * It's almost always sensible to sample your data set at the beginning of an analysis until your analysis pipeline is in reasonable shape. 121 | 122 | ## Sampling 123 | 124 | * If your data set is too large to read into RAM, it may need to be 125 | *preprocessed* or *filtered* using tools external to R before 126 | reading it in. 127 | * This is the topic of chapter 3 and 6 128 | 129 | ## Integration with C++ or Java 130 | 131 | * Move small parts of the program from R to another, faster language, such as C++ or Java. 132 | * The goal is to keep R's neat way of handling data, with the higher performance offered by other languages. 133 | * Many of R's base functions are written in C or FORTRAN. 134 | * This outsourcing of code to another language can be easily hidden in another function. 135 | 136 | ## Avoid storing objects in memory 137 | 138 | * There are packages available that avoid storing data in memory. 139 | * Objects are stored on your hard disc and analysed in blocks or chunks. 140 | * Hadoop is an example of this technique. 141 | * This strategy is perfect for dealing with large amounts of data. 142 | * Unfortunately, many algorithms haven't been designed with this principle in mind. 143 | * This means that only a few R functions that have been explicitly created to deal with specific chunk data types will work. 144 | 145 | ## Alternative interpreters 146 | 147 | * Due to the popularity of R, it now possible to use alternative interpreters 148 | * the interpreter is where the code is run 149 | 150 | ## pqr 151 | * [pqrR](http://www.pqr-project.org/) (pretty quick R) is a new version of the R interpreter. 152 | * One major downside, is that it is based on R-2.15.0. 153 | * The developer (Radford Neal) has made many improvements, some of which have now been incorporated into base R. 154 | * __pqR__ is an open-source project licensed under the GPL. 155 | * One notable improvement in pqR is that it is able to do some numeric computations in parallel with each other 156 | 157 | ## Renjin 158 | 159 | * [Renjin](http://www.renjin.org/) reimplements the R interpreter in Java, so it can run on the Java Virtual Machine (JVM) 160 | * Since R will be pure Java, it can run anywhere. 161 | 162 | ## Tibco 163 | 164 | * [Tibco](http://spotfire.tibco.com/) created a C++ based interpreter called TERR. 165 | 166 | ## Oracle 167 | 168 | * Oracle also offer an R-interpreter, that uses Intel's mathematics library and therefore achieves a higher performance without changing R's core. 169 | 170 | ## Course R package 171 | 172 | * There is companion R package for this course. 173 | 174 | ```{r eval=FALSE} 175 | install.packages("drat") 176 | drat::addRepo("rcourses") 177 | install.packages("r4bd") 178 | ``` 179 | 180 | ## Notes 181 | 182 | * The intention is to add more material to the notes 183 | * Currently hosted on github.com 184 | * If you would like access, email Robin your github username 185 | * Comments/corrections welcome 186 | 187 | 188 | 189 | 190 | -------------------------------------------------------------------------------- /slides/chapter10.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Apache Spark" 3 | author: "Colin Gillespie" 4 | date: "17-18 September 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ## What is Apache Spark? 9 | 10 | * Apache Spark is a computing platform whose goal is to make the analysis of large datasets fast. 11 | * Spark extends the __MapReduce__ paradigm for parallel computing to support a wide range of operations. 12 | * A key feature of Spark is that it can run complex computational tasks both in-memory and on disk. 13 | 14 | ## What is Apache Spark? 15 | 16 | * The Spark project contains multiple tightly integrated components. 17 | * This closely coupled design means that improvements in one part of the Spark engine are automatically used by other components. 18 | * Another benefit of the tight coupling between Spark's components is that there is a single system to maintain, which can be crucial for large organisations. 19 | 20 | ## The Spark stack 21 | 22 | * Spark Core: The Core contains the basic functionality of Spark, such as memory management, fault recovery, and interacting with storage systems. Spark Core provides APIs that enable the other components to access these collections. 23 | * Spark SQL: This library provides an SQL interface for interacting with databases. 24 | * Spark Streaming: Functionality designed to ease the management of data collected in real-time. 25 | 26 | ## The Spark stack 27 | 28 | * MLlib: A scalable machine learning library. 29 | * GraphX: A relatively new component in Spark for representing and analysing phenomena that can be represented as graphs, such as person-to-person links in social media. 30 | * Cluster Managers. 31 | * Third party libraries: Like R and Python, developers are encouraged to 32 | extend Spark. Over 100 additional libraries have been contributed so far, 33 | each of which can be rated by the community\sidenote{\url{http://spark-packages.org/}}. 34 | 35 | ## Does anyone use it? 36 | 37 | * Spark is rapidly becoming the a key component on analysing data sets that have to be __distributed__ across multiple computers. 38 | * To get an idea of Spark's popularity, browse the Powered by Spark webpage (see notes) 39 | 40 | 41 | ## What about Hadoop? 42 | 43 | * The `rmr2` package allows the R to use Hadoop MapReduce. 44 | * However, development on this package has slowed. 45 | * The author notes that the lack of activity on `rmr2` is due to two reasons. 46 | * Package maturity. 47 | * The general shift away from Hadoop MapReduce towards Spark. 48 | 49 | ## SparkR 50 | 51 | * `SparkR` was released as a separated component of Spark in 2014. 52 | * In June 2015, merged into the main Spark distribution. 53 | * They are still in the process of deciding on the API of `SparkR`. 54 | 55 | 1. The full functionality of Spark is __not__ yet available via `SparkR`. 56 | 1. If you find code online, it's likely not to work since it uses the old `SparkR` package. 57 | 1. These notes will go out of date. 58 | * But likely to be kept up-to-date 59 | 60 | ## A first Spark instance 61 | 62 | * We need to install Spark. 63 | * After Spark had been installed we set the environment variable `SPARK_HOME`. 64 | * This can be in done in our `bashrc` file, or in R itself, via 65 | 66 | ```{r eval=FALSE, echo=1} 67 | Sys.setenv(SPARK_HOME="/path/to/spark/") 68 | ``` 69 | * Then we load the `SparkR` package 70 | 71 | ```{r eval=FALSE} 72 | library("SparkR") 73 | ``` 74 | 75 | ## A first Spark instance 76 | 77 | Next we initialise the Spark cluster and create a `SparkContext` 78 | ```{r eval=FALSE} 79 | sc = sparkR.init(master="local") 80 | ``` 81 | 82 | ## A first Spark instance 83 | 84 | * The `sparkR.init` function has number of arguments. 85 | * In particular if we want to use any Spark packages, these should be specified during the initialisation stage. 86 | * So if we wanted to load in a csv file, we would need something like 87 | ```{r eval=FALSE} 88 | sc = sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") 89 | ``` 90 | 91 | ## A first Spark instance 92 | 93 | When we finish our Spark session, we should terminate the Spark context via 94 | 95 | ```{r eval=FALSE} 96 | sparkR.stop() 97 | ``` 98 | 99 | ## Resilient distributed datasets (RDD) 100 | 101 | * The core feature of Spark is the resilient distributed dataset (RDD). 102 | * An RDD is an abstraction that helps us deal with big data. 103 | * An RDD is a distributed collection of elements (including data and functions). 104 | * In Spark everything we do revolves around RDDs. 105 | * Typically, we may want to create, transform or operate on the distributed data set. 106 | * Spark automatically handles how the data is distributed across your computer/cluster and parallelises operations where possible. 107 | 108 | ## Example: Moby Dick 109 | 110 | * Moby Dick text - Project Gutenberg website 111 | * Assuming that we are already in a Spark session 112 | ```{r eval=FALSE} 113 | ## Note ::: 114 | moby = SparkR:::textFile(sc, "data/moby_dick.txt") 115 | ``` 116 | * Also passing the spark context object, `sc` 117 | 118 | 119 | ## Example: Moby Dick 120 | 121 | In any case, the `moby` object is an RDD object 122 | ```{r eval=FALSE, tidy=FALSE} 123 | R> moby 124 | # MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2 125 | ``` 126 | 127 | ## Transformation 128 | 129 | * A *transformation* operation constructs a new RDD based on the previous one. 130 | 131 | ## Example: Transformation 132 | 133 | * Suppose we want to extract the lines that contain the word `Moby` from our data set. 134 | * This is a standard operation: we have a dataset and we want to remove certain values. 135 | * To do this we use a standard R approach, by creating a function called `get_moby` that only returns `TRUE` or `FALSE`. 136 | * The `filterRDD` function then retains any rows that are `TRUE`, i.e. 137 | ```{r eval=FALSE, tidy=FALSE} 138 | get_moby = function(x) 139 | "Moby" %in% strsplit(x, split = " ")[[1]] 140 | mobys = SparkR:::filterRDD(moby, get_moby) 141 | ``` 142 | ## Example: Transformation 143 | 144 | * This is a functional approach to programming and is similar to the `apply` family. 145 | * Also, we need `get_moby` to be efficient! 146 | 147 | ## Action 148 | 149 | * An __action__ computes a result based on an RDD. 150 | * The result is either displayed or stored somewhere else on the system. 151 | 152 | ## Example: Action 153 | 154 | If we want to know how many rows contain the word `Moby`, we use the count function 155 | ```{r eval=FALSE} 156 | ## The answer is 77 157 | count(mobys) 158 | ``` 159 | 160 | ## Lazy evaluation 161 | 162 | * Spark deals with transformations and actions in two different ways. 163 | * Like `dplyr`, Spark uses lazy evaluation: it only performs the computation when it is used by an action. 164 | * In the example above, the `textFile` and `filterRDD` commands are run only when we use `count`. 165 | 166 | ## Lazy evaluation 167 | 168 | * The developers of Spark and `dplyr` recognize that lazy evaluation is essential when working with big data. 169 | * This can be seen by considering the example above. 170 | * If SparkR actually ran `textFile` straight away, this would use up a load of disk space. 171 | * This is a waste, since we immediately filter out the vast majority of the text. 172 | * Instead, SparkR (via Spark), takes the chain of transformations, and performs the computation on the minimum amount of data needed to get the result. 173 | 174 | ## Caching 175 | 176 | Spark's RDDs are (by default) recomputed each time you run an action on them.\sidenote{Alternatively, we could use \texttt{cache(moby)}, which is the same as \texttt{persist} with the default level of storage.} To reuse RDDs in multiple operations we can ask Spark to persist it via 177 | ```{r eval=FALSE} 178 | ## There are different levels of storage 179 | persist(mobys, "MEMORY_ONLY") 180 | ``` 181 | 182 | \noindent If you are not planning on reusing the object, don't use persist. 183 | 184 | ## Summary 185 | 186 | To summarise, every Spark session will have a similar structure. 187 | 188 | * Create a resilient distributed dataset (RDD). 189 | * Transform and manipulate the data set. 190 | * For key data sets, use `persist` for efficiency. 191 | * Retrieve the results via an action such as `count`. 192 | 193 | ## Loading data: creating RDDs 194 | 195 | There are two ways of creating an RDD: by parallelizing an existing dataset; or from an external data source, such as a database or csv file. 196 | 197 | ## Loading data: creating RDDs 198 | 199 | The easiest way to create an RDD file is from an existing data set which can be passed to the `parallelize` function. If you use this method it may mean the data is relatively small and you don't need to use Spark. Nevertheless, applying the method on small datasets will help you to learn Spark/SparkR quickly, since you can quickly test and prototype code. To create an RDD representation of the vector `1:100`, we would use 200 | 201 | ```{r eval=FALSE} 202 | vec_sp = SparkR:::parallelize(sc, 1:100) 203 | ``` 204 | 205 | ## Being lazy again 206 | 207 | \noindent As before, we don't actually compute `vec_sp` until it is needed, 208 | ```{r eval=FALSE} 209 | count(vec_sp) 210 | ``` 211 | 212 | \noindent Typically, we would want to load data from external data sets. This could, for example, be from a text file using `textFile` described above, or from CSV file (again via `textFile`), provided you have loaded the correct library. 213 | 214 | ## Example: Spark dataframes 215 | 216 | Suppose we have already initialised a Spark context. To use Spark data frames, we need to create an `SQLContext`, via 217 | ```{r eval=FALSE} 218 | sql_context = sparkRSQL.init(sc) 219 | ``` 220 | 221 | \noindent The `SQLContext` enables us to create data frames from a local R data frame, 222 | 223 | ```{r eval=FALSE} 224 | chicks_sp = createDataFrame(sql_context, chickwts) 225 | ``` 226 | 227 | ## Example: Spark dataframes 228 | 229 | \noindent or from other data sources, such as CSV files or a 'Hive 230 | table'.\sidenote{A hive is a data structure used by Hadoop.} 231 | If we examine the newly created object, we get 232 | 233 | ```{r eval=FALSE} 234 | R> chicks_sp 235 | # DataFrame[weight:double, feed:string] 236 | ``` 237 | 238 | ## Example: Spark dataframes 239 | 240 | \noindent An S3 method for `head` is also available, so 241 | 242 | ```{r eval=FALSE, tidy=FALSE} 243 | R> head(chicks_sp, 2) 244 | # weight feed 245 | #1 179 horsebean 246 | #2 160 horsebean 247 | ``` 248 | 249 | \noindent gives what we would expect. We can extract columns using the dollar notation, `chicks_sp$weight` or using `select` 250 | ```{r eval=FALSE} 251 | R> select(chicks_sp, "weight") 252 | # DataFrame[weight:double] 253 | ``` 254 | ## Example: Spark dataframes 255 | 256 | \noindent We can subset or filter the data frame using the `filter` function 257 | ```{r eval=FALSE} 258 | filter(chicks_sp, chicks_sp$feed == "horsebean") 259 | ``` 260 | 261 | \noindent Using Spark data frames, we can also easily group and aggregate data frames. (Note this is similar to the `dplyr` syntax). For example, to count the number of chicks in each feed group, we group, and then summarise: 262 | 263 | ```{r eval=FALSE, tidy=FALSE} 264 | chicks_cnt = groupBy(chicks_sp, chicks_sp$feed) %>% 265 | summarize(count=n(chicks_sp$feed)) 266 | ``` 267 | ## Example: Spark dataframes 268 | 269 | \noindent Then use `head` to view the top rows 270 | 271 | ```{r eval=FALSE, tidy=FALSE} 272 | head(chicks_cnt, 2) 273 | # feed count 274 | #1 casein 12 275 | #2 meatmeal 11 276 | ``` 277 | 278 | \noindent We can also use arrange the data by the most common group 279 | ```{r eval=FALSE, tidy=FALSE} 280 | arrange(chicks_cnt, desc(chicks_cnt$count)) 281 | ``` 282 | 283 | # Resources 284 | 285 | * Apache Spark homepage\sidenote{\url{https://spark.apache.org/}} 286 | * Learning Spark [@Karau2015] 287 | * Advanced analytics with Spark [@Ryza2014] 288 | * `dplyr` with `Spark`. Experimental, but worth watching\sidenote{\url{https://github.com/RevolutionAnalytics/dplyr-spark}} 289 | 290 | \clearpage 291 | -------------------------------------------------------------------------------- /slides/chapter2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Memory" 3 | author: "Colin Gillespie" 4 | date: "17-18 September 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | # Chapter 2: Memory matters 9 | 10 | ## Why include this material? 11 | 12 | * Explain data sizes 13 | * If we want a new computer, what should we ask for? 14 | 15 | ## Memory 16 | 17 | * What is big? 18 | * Everything is relative. 19 | * A big data set from thirty years ago could probably be processed with ease using todays computers. 20 | * A data set we consider "big" 21 | * Google/Facebook would consider small. 22 | * When talking about big data we need a comparison. 23 | * In this course, __big__ is relative to our available resources. 24 | 25 | ## 2.1 File sizes 26 | 27 | * A computer cannot store "numbers" or "letters" 28 | * The only thing a computer can store and work with is bits. 29 | * A bit is binary, it is either a $0$ or a $1$. 30 | * In the past we used the ASCII character set. 31 | * This set defined $128$ characters including $0$ to $9$, 32 | * Upper and lower case alpha-numeric 33 | * A few control characters such as a new line. 34 | * To store these characters required $8$ bits 35 | 36 | ## Bit Representation 37 | 38 | Bit representation | Character 39 | -------------------|--------- 40 | $01000001$ | A 41 | $01000010$ | B 42 | $01000011$ | C 43 | $01000100$ | D 44 | $01000101$ | E 45 | $01010010$ | R 46 | 47 | ## Encoding 48 | 49 | * It's worth pointing out that storing characters has got more sophisticated in recent years. 50 | * UTF-8, UTF-16, ... 51 | * In particular, Unicode was created to try to create a single character set that included every reasonable writing system. 52 | * http://www.joelonsoftware.com/articles/Unicode.html for a nice introduction on character encoding. 53 | 54 | ## Bits and bytes 55 | 56 | * Eight bits is one byte 57 | * Four bits is called a nibble 58 | * So two characters would use two bytes or 16 bits. 59 | * A document containing $100$ characters would use $100$ bytes ($800$ bits). 60 | * This assumes that the file didn't have any other memory overhead, such as font information or meta-data 61 | * An empty `.docx` file takes about 3.7KB of storage 62 | 63 | ## Standards 64 | 65 | When people first started to think about computer memory, they noticed that 66 | \[ 67 | 2^{10} = 1024 \simeq 1000 68 | \] 69 | and 70 | \[ 71 | 2^{20} =1,048,576\simeq 10^6 72 | \] 73 | so they adopted the short hand of kilo- and mega-bytes. 74 | 75 | ## Standards 76 | 77 | * Of course, everyone knew that it was just a short hand, and it was really a binary power. 78 | * When computers became more wide spread, foolish people like you and me just assumed the kilo did actually mean $10^3$ bytes. 79 | * The IEEE Standards Board decided that IEEE standards will use the SI prefixes. 80 | 81 | ## Standards 82 | 83 | * So a kilobyte (KB) is 1000 bytes and a megabyte (MB) is 1000 kilobytes 84 | * A petabyte is approximately 100 million drawers filled with text. 85 | * Astonishingly, Google processes around $20$ petabytes of data every day 86 | * In 28th August, 2015 over 1 billion\ users logged on to facebook. 87 | 88 | ## Data conversion table 89 | 90 | Factor | Name | Symbol | Origin|Factor |Derivation 91 | -------|-------|---------|--------|-------|------------- 92 | $2^{10}$| kibi | Ki| Kilobinary: |$(2^{10})^1$ | Kilo: $(10^3)^1$ 93 | $2^{20}$| mebi | Mi| Megabinary: |$(2^{10})^2$ | Mega: $(10^3)^2$ 94 | $2^{30}$| gibi | Gi| Gigabinary: |$(2^{10})^3$ | Giga: $(10^3)^3$ 95 | $2^{40}$| tebi | Ti| Terabinary: |$(2^{10})^4$ | Tera: $(10^3)^4$ 96 | $2^{50}$| pebi | Pi| Petabinary: |$(2^{10})^5$ | Peta: $(10^3)^5$ 97 | 98 | ## IEE Standards (more or less) 99 | 100 | * Even though there is an agreed IEE standard, that doesn't mean that everyone follows it. 101 | * For example, Microsoft windows uses 1MB to mean $2^{20}$B. 102 | * Even more confusing, the capacity of a $1.44$MB floppy disk is a mixture, $1\text{MB} = 10^3 \times 2^{10}$B. 103 | 104 | ## Exercises 105 | 106 | 1. R loads everything into memory, i.e. your computers RAM. How much RAM do you have\sidenote{Feel free to Google, "How much RAM do I have?"}? 107 | 2. Using Google, how much does it cost (in pounds) to double the amount of available RAM? 108 | 3. How much does it cost to rent a machine comparable to your laptop in the cloud (Amazon AWS)? 109 | 110 | ## 2.2 Hard Disk Drive (HDD) 111 | 112 | * Unless you have a fairly expensive laptop, you've probably got a standard hard disk drive (HDD) 113 | * HDDs were first introduced by IBM in 1956. 114 | * Data is stored using magnetism on a rotating platter 115 | * The faster the platter spins, the faster the HDD can perform. 116 | * Many laptop drives spin at either 5400RPM (Revolutions per Minute) or 7200RPM 117 | * The major advantage of HDD is that they are cheap; 118 | * A 1TB laptop is becoming standard. 119 | 120 | ## Solid state drives (SSD) 121 | 122 | * SSDs can be thought of large, but more sophisticated versions of your USB stick. 123 | * They have no moving parts and information is stored in microchips 124 | * Since there are no moving parts, reading/writing is much quicker 125 | 126 | ## Read/write speeds 127 | 128 | * The read/write speed for a standard HDD is usually in the region of 50-120MB/s (usually nearer the 50 than the 120). 129 | * For SSDs, speeds are typically over 200MB/s and for top of the range models, it is closer to 500MB/s. 130 | * If you're wondering, read/write speeds for RAM is around 2-20GB/s. 131 | * So at best, SSDs are at least one order of magnitude slower than RAM, but still faster than standard HDDs. 132 | 133 | # 2.3 Operating systems 134 | 135 | ## 32 bit or 64 bit 136 | 137 | * When we suggest that you should just buy more RAM, this assumes that you are using a 64 bit operating system 138 | * A 32 bit machine can only access at most 4GB RAM. Although some CPUs offer solutions to this limitation, i.e. the OS can access more memory 139 | * If you are running a 32 bit operating system, then R is limited to around 3GB RAM 140 | * If you are running a 64 bit operating system, but only a 32 bit version R, then you have access to slightly more memory (but not much). 141 | * Hopefully, you are running a 64 bit operating system, with a 64 bit version of R. Your limit is now measured in Terabytes. 142 | 143 | 144 | ## Exercises 145 | 146 | 1. Are you using a 32 bit or 64 bit operating system? 147 | 2. Are you using 32 bit or 64 bit version of R? 148 | 3. What are the results of running the command 149 | ```{r results="hide", message=FALSE, warning=FALSE} 150 | memory.limit() 151 | ``` 152 | 153 | 154 | ## R data types 155 | 156 | * When programming in C or FORTRAN, we have to specify the data type of every object we create 157 | * The benefit of this is that the compiler can perform clever optimisation. 158 | * The downside is that programme length is longer. 159 | * In R we don't tend to worry about about data types. 160 | * For the most part, numbers are stored in double-precision floating-point format. 161 | * But R does have other ways of storing numbers. 162 | 163 | # 2.4 R data types 164 | 165 | ## `numeric` 166 | 167 | * The `numeric` function is the same as a `double`. 168 | * However, `is.numeric` is also true for integers. 169 | 170 | ## single 171 | 172 | * R doesn't have a single precision data type. 173 | * All real numbers are stored in double precision format. 174 | * The functions `as.single` and `single` are identical to `as.double` and `double` except they set the attribute `Csingle` that is used in the `.C` and `.Fortran` interface. 175 | 176 | ## `integer` 177 | 178 | * Integers exist to be passed to C or Fortran code. 179 | * Typically, we don't worry about creating integers. 180 | * However, they are occasionally used to optimise subsetting operations. 181 | * When we subset a data frame or matrix, we are interacting with C code. 182 | * If we look at the arguments for the `head` function 183 | ```{r} 184 | args(head.matrix) 185 | ``` 186 | The default argument is `6L` (the `L` is creating an integer object). 187 | 188 | ## Storage costs 189 | 190 | * Different data types, such as (ASCII) characters, integers and doubles, have different storage costs 191 | * It is helpful to understand why files are large and the limits of the computer system. 192 | 193 | ## Storage space for standard data types 194 | 195 | Type | Amount (Bytes) 196 | -----|----------------- 197 | Character | 1 198 | Integer | 4 199 | Double | 8 200 | 201 | 202 | ## Exercises 203 | 204 | 1. To get an idea of when to use the integer data type, it is helpful to look at the source code of some commonly used functions. Have a look at the following function definitions: 205 | * `tail.matrix` 206 | * `[.data.frame` 207 | * `lm` 208 | 2. How does the function `seq.int`, which was used in the `tail.matrix` function, differ to the standard `seq` function? 209 | 3. Suppose you had a file with a single column of doubles. Approximately, what is the maximum number of rows you could load into R?\marginnote{A vague rule of thumb is that need the RAM to be three times the size of the data set.} 210 | 4. (Hard) What is the range of values an integer object can represent? 211 | 212 | 213 | ## 2.5 Object size in R 214 | 215 | * When thinking about sizes of objects in R, it's a little bit more complicated than simply multiplying the data type by the number of bytes 216 | 217 | ## Object size in R 218 | 219 | * Object meta data: this is information on the base data type and memory management. 220 | * For example, is the object a logical, character, or numeric data type? 221 | * Three pointers: these are addresses to where memory is stored on the hard drive. One of the pointers is used to access the attribute list. The other two pointers help R move between object on your hard drive. 222 | * The length of the vector 223 | * The data. 224 | 225 | ## Object size in R 226 | 227 | * We can examine the size of an object using the base function `object.size` 228 | * However, `object_size` in the `pryr` package is a similar function that counts more accurately and includes the environment size 229 | * For example, an empty vector is 40 bytes 230 | 231 | ```{r} 232 | library("pryr") 233 | object_size(numeric(0)) 234 | ``` 235 | 236 | ## Grow carefully 237 | 238 | * Since asking for more memory is a relatively expensive operation, R asks for more than is needed when growing objects 239 | * In particular, R's vectors are always $2^3=8$, $2^4=16$, $2^5=32$, $48$ $2^6=64$ or $2^7=128$ bytes long 240 | * After $128$ bytes, R only asks memory in multiples of $8$ bytes. 241 | 242 | ## Grow carefully 243 | 244 | Let's start with a simple vector 245 | 246 | ```{r} 247 | v1 = 1:1e6 248 | ``` 249 | 250 | ## Grow carefully 251 | 252 | * When we use the `:` operator, we are actually creating a vector of integers. 253 | * Remember that an integer is only $4$ bytes, so this is more efficient. 254 | * To manually calculate the object size of `v1`, we have 255 | \[ 256 | 4\times 10^6 \,\text{bytes} \simeq 4 \,\text{MB} 257 | \] 258 | ```{r} 259 | object_size(v1) 260 | ``` 261 | 262 | ## Sequence 263 | 264 | If we create a similar vector using the sequence command 265 | ```{r} 266 | v2 = seq(1, 1e6, by=1) 267 | object_size(v2) 268 | ``` 269 | we find that the size of `v2` is double that of `v1`. 270 | * This is because when we use the `:` operator we create a vector with type `integer`, whereas the `seq` command has created a vector of `doubles` (see table \ref{T2.3}). 271 | 272 | ## Copies 273 | 274 | R is also tries to avoid making unnecessary copies of objects. For example, consider the following two lists 275 | ```{r} 276 | l1 = list(v1, v1) 277 | l2 = list(v1, v2) 278 | ``` 279 | 280 | ## Copies 281 | 282 | When we investigate the object sizes, we see that `v1` hasn't been double counted in `l1` 283 | ```{r} 284 | object_size(l1) 285 | object_size(l2) 286 | ``` 287 | Moreover, if we look at the combined size of the two lists, 288 | ```{r} 289 | object_size(l1, l2) 290 | ``` 291 | \noindent we still see that `v1` has only been counted once. 292 | 293 | ## Exercises 294 | 295 | 1. Use the `object_size` function to investigate some standard objects. For example, vectors, data frames, functions and matrices. 296 | 1. Create a matrix, data frame and list. What size are the empty objects? For each object, add two columns/nodes of $10$ random numbers. Comment on the results. 297 | 1. Create three vectors using `seq`, `seq.int` and `:`. Compare the sizes. 298 | 1. Run the following piece of code. Can you interpret the jumps in the graph? 299 | ```{r fig.keep='none', tidy=FALSE} 300 | n = 20 301 | x = numeric(n) 302 | for(i in (1:n)) 303 | x[i] = object_size(numeric(i-1)) 304 | plot(1:n-1, x, type="l") 305 | ``` 306 | 1. Change the first value of the vector in the list `l1`. Rerun the `object_size` commands. What has happened? 307 | 308 | 309 | ## 2.6 Collecting the garbage 310 | 311 | * The `object_size` function tells you the size of a particular object 312 | * The function `mem_used` tells you the amount of memory that is being using by `R` 313 | * Since managing memory is a complex process, determining the exact amount of memory used isn't exact; it isn't obvious what we mean by __memory used__ 314 | * The value returned by `mem_used` only includes objects created by R, not R itself 315 | * Also manipulating memory is an expensive operation, so the OS and R are lazy at reclaiming memory (this is a good thing). 316 | 317 | ## Collecting the garbage 318 | 319 | * In some languages, such as C, the programmer has the fun task of being in charge of managing memory. 320 | * Every time the programmer asks for more memory using `malloc` there should be a corresponding call (somewhere) to `free`. 321 | * When the call to `free` is omitted, this is known as a memory leak. 322 | * In R we don't have to worry about freeing memory; the garbage collector takes care of it. 323 | 324 | ## Example: Collecting the garbage 325 | 326 | ```{r tidy=FALSE} 327 | g = function() { 328 | z = 1:1e7 329 | message("Mem used: ", round(mem_used()/10^6), "MB") 330 | TRUE 331 | } 332 | ``` 333 | 334 | ## Example: Collecting the garbage 335 | 336 | Calculate the current memory being used 337 | 338 | ```{r} 339 | mem_used() 340 | ``` 341 | 342 | When we call `g` and calculate the memory used after the call 343 | 344 | ```{r} 345 | x = g() 346 | mem_used() 347 | ``` 348 | 349 | \noindent The memory usage hasn't changed. Since the variable `z` is only referenced inside the function, the associated memory is freed after the function call has ended. 350 | 351 | ## 2.6 Collecting the garbage 352 | 353 | * We can force a call to the garbage collector, via `gc()` or by explicitly deleting the object with `rm`. 354 | * However, this is almost never needed. 355 | * R is perfectly able to manage it's own memory and you need to use `gc()` or `rm()` to clean up. 356 | * We can adjust the garbage collectors strategy by setting the environment variable `R_GC_MEM_GROW` to an integer value between $0$ and $3$. 357 | * Typically, we don't alter these variables. 358 | 359 | ## 2.7 Monitoring memory change 360 | 361 | * There are tools available to dynamically monitor changes in memory. 362 | * The first, `pryr::mem_change`, is useful for determining the effect of an individual command. R also comes with a memory profiler, `utils::Rprof`. 363 | * But see `lineprof` instead (github only) 364 | 365 | 366 | ## 2.8 Sparse matrices 367 | 368 | If we recall the simple clustering example in the last chapter, we calculated a distance matrix via 369 | ```{r} 370 | d = dist(USArrests) 371 | ``` 372 | \noindent Intuitively, since the matrix `d` is symmetric around the diagonal, it makes sense to exploit this characteristic in terms of storage. In particular, storage should be halved. 373 | 374 | ## Sparse matrices 375 | 376 | * A sparse matrix is simply a matrix in where most of the elements are zero. 377 | * Conversely, if most elements are non-zero, the matrix is considered dense. 378 | * The proportion of non-zero elements is called the sparsity. 379 | * Large sparse matrix often crop up when performing numerical calculations. 380 | * Typically, our data isn't sparse, but the resulting data structures we create may be sparse. 381 | * There are a number of techniques/methods used to store sparse matrices. 382 | * Methods for creating sparse matrices can be found in the `Matrix` package. 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | -------------------------------------------------------------------------------- /slides/chapter3.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data formats" 3 | author: "Robin Lovelace and Colin Gillespie" 4 | date: "`r format(Sys.Date(), '%d %B %Y')`, Leeds Institute for Data Analytics" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | # knitr::opts_chunk$set(echo = FALSE) 10 | ``` 11 | 12 | # File formats and R 13 | 14 | ## Session information 15 | 16 | > - ~1 hour session 17 | > - 50:50 talking/practical split 18 | > - learn about file formats and how to read/write them in R 19 | > - often a key bottleneck if you're not used to R 20 | 21 | 22 | ## The history of storage devices 23 | 24 | > - 1930s: ![](http://www.zetta.net/images/pic2.jpg) 25 | > - 1960s: cassettes and hard discs 26 | > - 1970s & 1980s: floppy discs 27 | > - 1990s: CD ROMs 28 | > - 2000s: SD cards, USBs, DVDs 29 | > - 2010s: Cloud Backup Solutions 30 | 31 | Source: [zetta.net](http://www.zetta.net/history-of-computer-storage/) 32 | 33 | ## The history of data formats 34 | 35 | Human readable formats: 36 | 37 | > - .csv: 1967 38 | > - .json: 2001 39 | 40 | Binary formats: 41 | 42 | > - .xls 43 | > - .sav 44 | > - .Rdata 45 | 46 | Databases 47 | 48 | > - e.g. Postgres 49 | 50 | ## Why are data formats important? 51 | 52 | ![](https://upload.wikimedia.org/wikipedia/commons/8/87/IBM_card_storage.NARA.jpg) 53 | 54 | > - Punchcards from 1959 (source: [Wikipedia](https://en.wikipedia.org/wiki/Punched_card)) 55 | 56 | ## Getting to know your data 57 | 58 | ![](../figures/know_data.jpg) 59 | 60 | Source (cc licence): https://flic.kr/p/oJdg64 61 | 62 | ```{r, echo=FALSE} 63 | # Note: the point here is that people don't usually think about formats until they break! 64 | ``` 65 | 66 | 67 | ## What data formats do you use? 68 | 69 | 70 | 71 | # Loading data in R 72 | 73 | ## Quick-fire example from the tutorial 74 | 75 | ```{r, eval=FALSE} 76 | # write a .csv file to the screen 77 | df <- cars[1:3,] 78 | write.csv(x = df) 79 | # save to file 80 | write.csv(x = cars[1:3,], "data/minicars.csv") 81 | df 82 | ``` 83 | 84 | ```{r, echo=FALSE} 85 | df <- cars[1:3,] 86 | df 87 | ``` 88 | 89 | ## Reading csv files 90 | 91 | ```{r, eval=FALSE} 92 | df <- read.csv("data/minicars.csv") 93 | df 94 | ``` 95 | 96 | ```{r, echo=FALSE} 97 | df <- read.csv("../data/minicars.csv") 98 | # knitr::kable(df) 99 | df 100 | ``` 101 | 102 | > **Quiz:** what happened to df? 103 | 104 | ## Writing without row names 105 | 106 | ```{r} 107 | write.csv(df, row.names = F) 108 | write.csv(cars[1:3,], row.names = F) 109 | ``` 110 | 111 | ## Using external packages 112 | 113 | - Often external packages are needed for optimal reading/writing 114 | 115 | ```{r} 116 | readr::write_csv(cars[1:3,], "mini_readr.csv") 117 | df <- read.csv("mini_readr.csv") 118 | ``` 119 | 120 | > **Quiz**: what is in df now? 121 | 122 | > **Quiz**: what does `::` mean? 123 | 124 | ```{r, echo=FALSE} 125 | # dim(df) == dim(cars[1:3,]) 126 | ``` 127 | 128 | # Worked example - on screen 129 | 130 | ## The trusty csv 131 | 132 | ```{r} 133 | people <- c("Colin", "Robin", "Rachel") 134 | n_pets <- c(5, 2, 4) 135 | df <- data.frame(people, n_pets) 136 | # write.csv(df) 137 | ``` 138 | 139 | > - **Challenge:** Write to a csv file *without* saving row names 140 | 141 | ## Writing csv files to screen 142 | 143 | ```{r} 144 | write.csv(df) 145 | ``` 146 | 147 | > **Challenge:** Write the file to disk 148 | 149 | ```{r, echo=FALSE} 150 | write.csv(df, "pets.csv") 151 | ``` 152 | 153 | ## R's own formats 154 | 155 | - Saving to R's data format 156 | 157 | > - `save()` with `.RData` extension (many objects, saves name) 158 | > - `saveRDS()` with `.Rds` extension (one object, omits name) 159 | > - `.Rda` is short for `.RData` 160 | 161 | ```{r} 162 | df1 <- df 163 | save(df, df1, file = "dfs.RData") 164 | rm(df, df1) 165 | ls() 166 | ``` 167 | 168 | > - Recommendation: use `.Rds` when space is an issue, otherwise .csv 169 | 170 | ## Loading R data 171 | 172 | > - `load(filename.Rdata)` 173 | > - `new_object <- readRDS('filename.Rds')` 174 | 175 | > **Challenge:** Load the data saved in the previous stages 176 | 177 | # Worked practical: work through chapter 3: 3.1 to 3.4 (30 minutes) 178 | 179 | ## First command (to run!) from chapter 3 180 | 181 | ```{r, eval=FALSE} 182 | npi <- read.csv("data/miniaa") 183 | dim(npi) 184 | ``` 185 | 186 | > - **Challenge:** What do the numbers mean? 187 | 188 | ## Excessive data output! 189 | 190 | > - **Quiz:** how to stop the screen overflowing? 191 | 192 | ```{r, eval=FALSE} 193 | npi 194 | ``` 195 | 196 | 197 | ```{r, echo=FALSE} 198 | npi <- read.csv("../data/miniaa") 199 | dim(npi) 200 | npi 201 | ``` 202 | 203 | # Freeing your data from spreadsheets 204 | 205 | ## readxl 206 | 207 | Much of the world's data is trapped in inaccessible files 208 | 209 | This section relies on the [readxl package](https://github.com/hadley/readxl) 210 | 211 | ```{r, eval=FALSE} 212 | f <- "data/CAIT_Country_GHG_Emissions_-_All_Data.xlsx" 213 | system.time(df <- readxl::read_excel(f, sheet = 4)) 214 | ``` 215 | 216 | ## Other packages for spreadsheets 217 | 218 | ```{r, eval=FALSE} 219 | xls_pkgs <- c("gdata", "openxlsx", "reaODS") 220 | # install.packages(xls_pkgs) # install packages if they're not already 221 | # This took less than 0.1 seconds 222 | system.time(df <- readxl::read_excel(f, sheet = 4)) 223 | # This took over 1 minute (make a coffee!) 224 | system.time(df1 <- gdata::read.xls(f, sheet = 4)) 225 | # This took 20 seconds 226 | system.time(df2 <- openxlsx::read.xlsx(f, sheet = 4)) 227 | ``` 228 | 229 | ## Different outputs 230 | 231 | ```{r, eval=FALSE} 232 | # After saving the spreadsheet to .odt (not included) - took more than 1 minute 233 | system.time(df3 <- readODS::read.ods("data/CAIT_Country_GHG_Emissions_-_All_Data.ods", sheet = 4)) 234 | 235 | head(df[1:5]) 236 | head(df1[1:5]) 237 | head(df2[1:5]) 238 | head(df3[1:5]) 239 | ``` 240 | 241 | # Worked practical: 3.5 (10 minutes) 242 | 243 | ## Thanks for listening!s 244 | 245 | -------------------------------------------------------------------------------- /slides/chapter4.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Untitled" 3 | author: "Robin Lovelace" 4 | date: "September 17, 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ## R Markdown 9 | 10 | Keeping it tidy! 11 | 12 | ![](http://www.stevenson-engineers.co.uk/files/tidy_shop1.jpg) 13 | 14 | ## General principles 15 | 16 | - Concise, meaningful variable names 17 | - Short and simple factor names 18 | - Consistency 19 | - Long and not wide! 20 | 21 | ## Untidy data 22 | 23 | ```{r} 24 | library(readr) 25 | raw <- read_csv("../data/pew.csv") 26 | dim(raw) 27 | raw[1:4, 1:5] 28 | ``` 29 | 30 | ## Tidyr 31 | 32 | ```{r} 33 | library(tidyr) 34 | rawt <- gather(raw, Income, Count, -religion) 35 | head(rawt, 3) 36 | ``` 37 | 38 | ## Practical example 39 | 40 | - 5.1 41 | 42 | ## Filtering columns 43 | 44 | ## Data aggregation 45 | 46 | ## dplyr 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | ## Slide with R Output 55 | 56 | ```{r cars, echo = TRUE} 57 | summary(cars) 58 | ``` 59 | 60 | ## Slide with Plot 61 | 62 | ```{r pressure} 63 | plot(pressure) 64 | ``` 65 | 66 | -------------------------------------------------------------------------------- /slides/chapter5.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Untitled" 3 | author: "Robin Lovelace" 4 | date: "September 17, 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ## R Markdown 9 | 10 | Keeping it tidy! 11 | 12 | ![](http://www.stevenson-engineers.co.uk/files/tidy_shop1.jpg) 13 | 14 | ## General principles 15 | 16 | - Concise, meaningful variable names 17 | - Short and simple factor names 18 | - Consistency 19 | - Long and not wide! 20 | 21 | ## Tidyr 22 | 23 | 24 | 25 | 26 | 27 | ## Slide with R Output 28 | 29 | ```{r cars, echo = TRUE} 30 | summary(cars) 31 | ``` 32 | 33 | ## Slide with Plot 34 | 35 | ```{r pressure} 36 | plot(pressure) 37 | ``` 38 | 39 | -------------------------------------------------------------------------------- /slides/chapter6.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Chapter 5 Visualisation" 3 | author: "Colin Gillespie" 4 | date: "17-18 September 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ## 5.1 Introduction to ggplot2 9 | 10 | * `ggplot2` is a bit different from other graphics packages. 11 | * It roughly follows the _philosophy_ of Wilkinson, 1999. 12 | * Essentially, we think about plots as layers. 13 | * By thinking of graphics in terms of layers it is 14 | easier for the user to iteratively add new components and for a developer to add 15 | new functionality. 16 | 17 | 18 | ## Example: the mpg data set 19 | 20 | The `mpg` data set comes with the `ggplot2` package and can be using loaded in the usual way 21 | ```{r} 22 | data(mpg, package="ggplot2") 23 | ``` 24 | This data set contains statistics on $234$ cars. 25 | 26 | 27 | ## 5.1 Introduction to ggplot2 28 | 29 | ```{r, echo=T, message=FALSE, results="hide", fig.keep=FALSE} 30 | plot(mpg$displ, mpg$cty, col=mpg$cyl) 31 | ``` 32 | 33 | [](figures/ch6_f1.png) 34 | 35 | 36 | 37 | ## 5.1 Introduction to ggplot2 38 | 39 | After loading the necessary package 40 | 41 | ```{r message=FALSE} 42 | library("ggplot2") 43 | ``` 44 | 45 | ## 5.1 Introduction to ggplot2 46 | 47 | ```{r fig.keep='none', cache=TRUE, echo=T} 48 | g = ggplot(data=mpg, aes(x=displ, y=cty)) 49 | g + geom_point(aes(colour=factor(cyl))) 50 | ``` 51 | [](figures/ch6_f2.png) 52 | 53 | ## 5.1 Introduction to ggplot2 54 | 55 | * The _ggplot_ function sets the default data set, and attributes called _aesthetics_. 56 | * The aesthetics are properties that are perceived on the graphic. 57 | * A particular aesthetic can be mapped to a variable or set to a constant 58 | value. 59 | * In previous figure, the variable `displ` is mapped to the x-axis and `cty` variable is mapped to the y-axis. 60 | 61 | 62 | ## 5.1 Introduction to ggplot2 63 | 64 | * The other function, `geom_point` adds a layer to the plot. 65 | * The `x` and `y` variables are inherited (in this case) from the first function, `ggplot`, and the colour aesthetic is set to the `cyl` variable. 66 | * Other possible aesthetics are, for example, size, shape and transparency. 67 | 68 | ## 5.1 Introduction to ggplot2 69 | 70 | If instead we changed the `size` aesthetic 71 | 72 | ```{r cache=TRUE, echo=TRUE} 73 | g + geom_point(aes(size=factor(cyl))) 74 | ``` 75 | 76 | 77 | ## 5.1 Introduction to ggplot2 78 | 79 | Plot Name | Geom | Base graphic 80 | ----------|-------|----------------- 81 | Barchart | bar | __barplot__ 82 | Box-and-whisker | boxplot | __boxplot__ 83 | Histogram | histogram | __hist__ 84 | Line plot | line | __plot__ and __lines__ 85 | Scatter plot | point | __plot__ and __points__ 86 | 87 | # The bigvis package 88 | ## The bigvis package 89 | 90 | * The `bigvis` package provides tools for exploratory data analysis of large datasets ($10-100$ million obs). 91 | * The goal is that operations should take less than $5$ seconds on a standard computer, even when the sample size is $100$ million. 92 | * The package is currently not available on CRAN 93 | 94 | ```{r eval=FALSE, tidy=FALSE} 95 | devtools::install_github("hadley/bigvis") 96 | ``` 97 | 98 | * If you are using Windows, you will also need to install Rtools. 99 | 100 | 101 | ## The bigvis package 102 | 103 | * Directly visualising raw big data is pointless. 104 | * It's a waste of time to create a $100$ million point scatter plot, since we would not be able to distinguish between the points. 105 | * In fact, we are likely to run out of pixels! If you doubt this, compare these two plots 106 | 107 | ```{r fig.keep="none"} 108 | par(mfrow=c(1, 2)) 109 | plot(1, 1,ylab="") 110 | plot(rep(1, 1e3), rep(1, 1e3), ylab="") 111 | ``` 112 | * Instead, we need to quickly summarise the data and plot the data in a sensible way. 113 | 114 | 115 | ## The bigvis package 116 | 117 | * Similar to `dplyr` 118 | * It provides fast C++ functions to manipulate the data, with the resulting output being handled by standard R functions (but optimised for `ggplot2`). 119 | * The package also provides a few functions for handling outliers, since when visualising big data outliers may be more of an issue. 120 | 121 | 122 | ## Bin and condense 123 | 124 | * The `bin()` and `condense()` functions are used to get compact summaries of the data. * For example, suppose we generate $10^5$ random numbers from the $t$ distribution 125 | ```{r echo=2} 126 | set.seed(1) 127 | x = rt(1e5, 5) 128 | ``` 129 | 130 | 131 | ## Bin and condense 132 | 133 | * The `bin` and `condense` functions create the binned variable 134 | ```{r message=FALSE} 135 | library("bigvis") 136 | ## Bin in blocks of 0.01 137 | x_sum = condense(bin(x, 0.01)) 138 | ``` 139 | * After binning you may want to smooth out any rough estimates (similar to kernel density estimation). 140 | 141 | ## Smooth 142 | 143 | ```{r echo=1:2} 144 | ## h is the binwidth (similar to bin size) 145 | x_smu = smooth(x_sum, h = 5 / 100) 146 | par(mar=c(3,3,2,1), mgp=c(2,0.4,0), tck=-.01, 147 | cex.axis=0.9, las=1) 148 | plot(x_sum, panel.first=grid(), xlim=c(-12, 12), 149 | ylab="Count", pch=21, cex=0.6) 150 | lines(x_smu, col=2, lwd=2) 151 | text(5, 200, "Smoothed line", col=2) 152 | ``` 153 | 154 | ## Autoplot 155 | 156 | ```{r} 157 | autoplot(x_sum) 158 | ``` 159 | 160 | ## Autoplot + peel 161 | 162 | ```{r} 163 | autoplot(peel(x_smu)) 164 | ``` 165 | 166 | 167 | 168 | ## IMDB example 169 | 170 | ```{r} 171 | data(movies, package="bigvis") 172 | ``` 173 | 174 | The dataset is a data frame and has `r NCOL(movies)` columns and `r NROW(movies)` rows. 175 | 176 | ## IMDB example 177 | 178 | We create bin versions of the movie length and rating using the `condense/bin` trick 179 | 180 | ```{r tidy=FALSE, message=FALSE} 181 | n_bins = 1e4 182 | bin_data = with(movies, 183 | condense(bin(length, find_width(length, n_bins)), 184 | bin(rating, find_width(rating, n_bins)))) 185 | ``` 186 | 187 | ## IMDB example 188 | 189 | ```{r echo=1} 190 | ggplot(bin_data, aes(length, rating, fill=.count )) + 191 | geom_raster() 192 | ``` 193 | 194 | ## IMDB example 195 | 196 | The resulting plot isn't helpful, due to a couple of long movies 197 | 198 | ```{r tidy=FALSE} 199 | ## Longer than one day!! 200 | subset(movies[ ,c("title", "length", "rating")], 201 | length > 24*60) 202 | ``` 203 | 204 | ## IMDB example: last_plot + peel 205 | 206 | ```{r echo=1, fig.keep="last"} 207 | last_plot() %+% peel(bin_data) 208 | ggplot(data=peel(bin_data), aes(length, rating, fill=.count )) + 209 | geom_raster() 210 | ``` 211 | 212 | \noindent to get a better visualisation. 213 | 214 | ## Tableplots: the tabplot package 215 | 216 | * Tableplots are a visualisation technique that can be used to explore and analyse large data sets. 217 | * These plots can be used to explore variable relationships and check data quality. 218 | * Tableplots can visualise multivariate datasets with several variables and a large number of records. 219 | * The `tabplot` package provides has an `ffdf` interface. 220 | 221 | 222 | ## Tableplots: the tabplot package 223 | 224 | ```{r, echo=T, message=FALSE} 225 | library("tabplot") 226 | tableplot(movies[,3:5]) 227 | ``` 228 | 229 | 230 | ## Tableplots: the tabplot package 231 | 232 | ```{r message=FALSE, warning=FALSE} 233 | tableplot(movies[,3:5], sortCol = 3) 234 | ``` 235 | 236 | 237 | ## Tableplots: the tabplot package 238 | 239 | ```{r message=FALSE, warning=FALSE, tidy=FALSE} 240 | tableplot(movies[,3:5], sortCol = 3, from =0, to=10) 241 | ``` 242 | -------------------------------------------------------------------------------- /slides/chapter9.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "The ff package" 3 | author: "Colin Gillespie" 4 | date: "17-18 September 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ## Introduction 9 | 10 | * The `ff` (fast file) package provides access to data stored on your hard desk 11 | * Data isn't stored in memory 12 | * Bigger data sets! 13 | * It allows efficient indexing, retrieval and sorting of vectors 14 | * But hard drive is slower than RAM 15 | 16 | ## Introduction 17 | 18 | * No longer loading data directly into memory 19 | * Non-standard R code 20 | * The `ff` package __only__ provides the building blocks 21 | * Few statistical functions 22 | * No support for characters. 23 | 24 | ## Introduction 25 | 26 | * The `ffbase` package extends `ff` 27 | * `c()`, `duplicated()` and `which()` 28 | * The successor to `ffbase` is currently available on github 29 | * Integrate `ff` with `dplyr` 30 | 31 | ## Importing data 32 | 33 | * The `ff` package provides a number of functions to read in data. 34 | * All the `read.*` base R functions have `ff` equivalents that are used in the same way. 35 | 36 | ## Importing data 37 | 38 | * There are two key classes in the `ff` package. `ff` for vector and `ffdf` for data frames. 39 | * To illustrate we'll use a simple csv file that comes with the `r4bd` package. 40 | 41 | ```{r, cache=FALSE, message=FALSE} 42 | library("ff") 43 | ## `get_rand()` Returns the full path 44 | filename = r4bd::get_rand() 45 | ffx = read.csv.ffdf(file=filename, header = TRUE) 46 | ``` 47 | 48 | ## Some standard functions 49 | 50 | * We can use (some) standard R functions to query the data set. 51 | 52 | ```{r, results="hide"} 53 | ## Only 10000 rows (small) 54 | dim(ffx) 55 | ``` 56 | * But not all 57 | ```{r eval=FALSE} 58 | colSums(ffx) # produces the following error: 59 | 60 | ## Error in colSums(ffx) : 'x' must be an array ... 61 | ``` 62 | 63 | ## Data chunks 64 | 65 | * The key idea with `ffdf` objects, is that we no longer manipulate objects in one go 66 | * Use chunks of data. 67 | * Split the data set up into smaller pieces that can be manipulated by R 68 | * Process them one-by-one. 69 | * The `ff` package is a much more efficient solution than the naive approach of manually splitting your data into separate files, and using numerous `read.csv` calls. 70 | 71 | ## Data chunks 72 | 73 | * The `chunk` function creates a sequence of range indexes using a syntax similar to `seq`. 74 | * Since this data set is small, we only have a single chunk 75 | ```{r} 76 | length(chunk(ffx)) 77 | ``` 78 | 79 | ## Data chunks 80 | 81 | * To make this section more realistic we'll manually specify the number of chunks using the `length.out`. 82 | * The `chunk` function returns a list of ranges 83 | ```{r} 84 | chunk(ffx, length.out=10)[[1]] 85 | ``` 86 | 87 | ## Data chunks 88 | 89 | * Since we are now dealing with chunks, this makes standard data analysis a pain. 90 | * For example, suppose we just want to find the minimum value of the matrix. 91 | * If `ffx` was a standard data frame, we would just use `min(ffx)`. 92 | 93 | ## Data chunks 94 | 95 | * However, `ffx` isn't a standard R object. 96 | * Instead, we need to loop over the chunks and keep track of the result, e.g. 97 | 98 | ```{r tidy=FALSE} 99 | m = numeric(10) 100 | chunks = chunk(ffx, length.out=10) 101 | for(i in seq_along(chunks)) 102 | m[i] = min(ffx[chunks[[i]],]) 103 | min(m) 104 | ``` 105 | 106 | ## Exercise 107 | 108 | Suppose we have $n$ chunks. Can you think of how we could calculate 109 | 110 | * The mean 111 | * The median 112 | 113 | If it makes things easier, set $n=5$ 114 | 115 | ## Pass by reference 116 | 117 | * Since we are dealing with out of memory objects, standard rules about copying objects no longer apply 118 | * In particular when we copy objects, we are passing by reference 119 | 120 | ## Pass by reference 121 | 122 | * When we change `ffy` 123 | 124 | ```{r} 125 | ffy = ffx 126 | ffy[1, 1] = 0 127 | ``` 128 | * we have also changed `ffx` 129 | 130 | ```{r} 131 | ffx[1, 1] 132 | ``` 133 | 134 | It's a trade off between large objects and side-effects. 135 | 136 | ## ff vs readr 137 | 138 | * At this point it's worthwhile thinking about speed comparisons with `readr`, via a quick benchmark. 139 | * First we create a test data set 140 | 141 | ```{r, echo=FALSE, eval=FALSE} 142 | r4bd::create_rand("example.csv", 1e6) 143 | ``` 144 | 145 | ## ff vs readr 146 | 147 | Then time reading in the files 148 | ```{r, echo=TRUE, eval=FALSE} 149 | system.time(ffx <- ff::read.csv.ffdf(file="/tmp/tmp.csv", header = TRUE)) 150 | system.time(x <- readr::read_csv("/tmp/tmp.csv")) 151 | ``` 152 | 153 | ## ff vs readr 154 | 155 | * On my machine, the `readr` function is an order of magnitude faster. 156 | * This is what we would expect. 157 | * The `ffdf` version is also preparing the data for future read/write access from the hard drive 158 | * Also, the `readr` variant is limited by your RAM 159 | * So if your file is too large, you will get an error 160 | 161 | ```{r eval=FALSE, tidy=FALSE} 162 | R> x = readr::read_csv("very_big.csv") 163 | # Error: cannot allocate vector of size 12.8 Gb 164 | ``` 165 | 166 | 167 | ## ff Storage 168 | 169 | * When data is the `ff` format, processing is faster than using the standard `read.csv`/`write.csv` combination. 170 | * However, converting data into `ff` format can be time consuming; so keeping data in `ff` format is helpful. 171 | * When you load in an `ff` object, there is a corresponding file(s) created on your hard disk 172 | 173 | ```{r} 174 | filename(ffx) 175 | ``` 176 | 177 | ## ff Storage 178 | 179 | * This make moving data around a bit more complicated. 180 | * The package provides helper functions, `ffsave` and `ffload`, which zips/unzips `ff` object files. 181 | * However the `ff` files are not platform-independent, so some care is needed when changing operating systems. 182 | 183 | # The ffbase package 184 | ## The ffbase package 185 | 186 | * The `ff` package supplies the tools for manipulating large data sets, but provides few statistical functions. 187 | * Conceptually, chunking algorithms are straightforward. 188 | * The program reads a chunk of data into memory, performs intermediate calculations, saves the results and reads the next chunk. 189 | * This process repeats until the entire dataset is processed. 190 | * Unfortunately, many statistical algorithms have not been written with chunking in mind. 191 | 192 | ## The ffbase package 193 | 194 | * The `ffbase` package adds basic statistical functions to `ff` and `ffdf` objects. 195 | * It tries to make the code more R like and smooth away the pain of working with `ff` objects. 196 | * It also provides an interface with `big*` methods. 197 | 198 | ## The ffbase package 199 | 200 | * `ffbase` provides S3 methods for a number of standard functions 201 | * `mean`, `min`, `max`, and standard arithmetic operators for `ff` objects 202 | * See `?ffbase` for a complete list 203 | * This removes some of the pain when dealing with `ff` objects. 204 | 205 | ## The ffbase package 206 | 207 | * The `ffbase` package also provide access to other packages that handle large data sets 208 | * `biglm`: Regression for data too large to fit in memory; 209 | * `biglars`: Scalable Least-Angle Regression and Lasso. 210 | * `bigrf`: Big Random Forests: Classification and Regression Forests for Large Data Sets. 211 | * `stream`: Infrastructure for Data Stream Mining. 212 | 213 | ## Big linear models 214 | 215 | * Linear models (lm) are one of the most basic statistical models available. 216 | * The simplest regression model is 217 | \[ 218 | Y_i = \beta_0 + \beta_1 x_i + \epsilon_i 219 | \] 220 | where $\epsilon_i \sim N(0, \sigma^2)$. 221 | * This corresponds to fitting a straight line through some points. 222 | * So $\beta_0$ is the $y$-intercept and $\beta_1$ is the gradient. 223 | * The aim is to estimate $\beta_0$ and $\beta_1$. 224 | 225 | ## Big linear models 226 | 227 | * In the more general multiple regression model, there are $p$ predictor variables 228 | \[ 229 | Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i, 230 | \] 231 | where $x_{ij}$ is the $i^\text{th}$ observation on the $j^\text{th}$ independent variable. 232 | 233 | 234 | ## Big linear models 235 | 236 | The above equation can be written neatly in matrix notation as 237 | \[ 238 | Y = X \beta + \epsilon 239 | \] 240 | with dimensions 241 | \[ 242 | [n\times 1]= [n\times (p+1)] ~[(p+1)\times 1] + [n \times 1 ]\;, 243 | \] 244 | where 245 | 246 | * $Y$ is the response vector - (dimensions $n \times 1$) 247 | * $X$ is the design matrix - (dimensions $n \times (p+1)$) 248 | * $\beta$ is the parameter vector - (dimensions $(p+1) \times 1$) 249 | * $\epsilon$ is the error vector - (dimensions $n \times 1$) 250 | 251 | ## Big linear models 252 | 253 | The goal of regression is to estimate $\beta$ with $\hat\beta$. It can be shown that 254 | \[ 255 | \hat\beta = (X^T X)^{-1} X^T Y. 256 | \] 257 | Our estimate of $\hat \beta$ will exist provided that $(X^T X)^{-1}$ 258 | exists, i.e. no column of $X$ is a linear combination of other columns. 259 | 260 | ## Big linear models 261 | 262 | For a least squares regression with a simple size of $n$ training examples and $p$ predictors, it takes: 263 | 264 | * $O(p^2n)$ to multiply $X^T$ by $X$; 265 | * $O(pn)$ to multiply $X^T$ by $Y$; 266 | * $O(p^3)$ to compute the LU (or Cholesky) factorization of $X^TX$ that is used to compute the product of $(X^TX)^{-1} (X^T Y)$. 267 | 268 | ## Big linear models 269 | 270 | * Since $n >> p$, this means that the algorithm scales with order $O(p^2 n)$. 271 | * As well as taking a long time to calculate, the memory required also increases. 272 | * The R implementation of `lm` requires $O(np + p^2)$ in memory. 273 | * But this can be reduced by constructing the model matrix in chunks. 274 | 275 | 276 | ## Big linear models 277 | 278 | * It works by updating the Cholesky decomposition with new observations. 279 | * So for a model with $p$ variables, only the $p \times p$ (triangular) Cholesky factor and a single row of data needs to be in the memory at any given time. 280 | * The `biglm` pack age does not do the chunking for you, but `ffbase` provides a handy S3 wrapper, `bigglm.ffdf`. 281 | 282 | For an example of using `biglm`, see the blog post at \url{http://goo.gl/iBPkTp} by Bnosac. 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | -------------------------------------------------------------------------------- /slides/dplyr.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "dplyr" 3 | author: "Robin Lovelace" 4 | date: "September 18, 2015" 5 | output: ioslides_presentation 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_knit$set(root.dir = "../") 10 | ``` 11 | 12 | ## Where we got to last time 13 | 14 | ```{r, message=FALSE} 15 | pkgs <- c("tidyr", "readr", "dplyr") 16 | pkld <- lapply(pkgs, require, character.only = TRUE) 17 | raw <- read_csv("data/pew.csv") 18 | rawt <- gather(raw, Income, Count, -religion) 19 | head(rawt, 3) 20 | ``` 21 | 22 | ## Filtering columns 23 | 24 | ```{r} 25 | df <- read_csv("data/miniaa") # load imaginary large data 26 | dim(df) 27 | all_na <- sapply(df, function(x) all(is.na(x))) 28 | df1 <- df[!all_na] # subset the dataframe 29 | dim(df1) 30 | df2 <- df[complete.cases(t(df))] 31 | dim(df2) 32 | ``` 33 | 34 | ## Aggregation 35 | 36 | ```{r, warning=FALSE} 37 | df <- read_csv("data/ghg-ems.csv") 38 | glimpse(df) 39 | ``` 40 | 41 | ## What to aggregate? 42 | 43 | ```{r} 44 | df <- rename(df, ECO2 = `Electricity/Heat (CO2) (MtCO2)`) 45 | length(unique(df$Country)) 46 | sapply(df, function(x) length(unique(x))) 47 | ``` 48 | 49 | ## Aggregation with base R 50 | 51 | ```{r} 52 | e_ems <- aggregate(df$ECO2, list(df$Country), mean, na.rm = T) 53 | nrow(e_ems) 54 | head(e_ems) # not particularly beautiful output 55 | ``` 56 | 57 | ## Making base R code nicer 58 | 59 | ```{r} 60 | e_ems <- aggregate(ECO2 ~ Country, df, mean, na.rm = T) 61 | head(e_ems) 62 | ``` 63 | 64 | ## Enter dplyr 65 | 66 | ```{r} 67 | e_ems <- group_by(df, Country) %>% 68 | summarise(mean_eco2 = mean(ECO2, na.rm = T)) 69 | e_ems 70 | ``` 71 | 72 | ## dplyr: data processing made easy 73 | 74 | ```{r} 75 | glimpse(e_ems) 76 | ``` 77 | 78 | # Practical: work through chapter 5.3 and 5.4 (15 minutes) 79 | 80 | ## First thing to do 81 | 82 | > - Rename the 4th column of the data 83 | > - Optional challenge: do this in the dplyr way 84 | > - Advanced: Find the top and bottom 3 emitters in each category 85 | 86 | ```{r} 87 | df <- read_csv("data/ghg-ems.csv") 88 | names(df) 89 | names(df)[4] <- "ECO2" 90 | ``` 91 | 92 | ## Command chaining I 93 | 94 | Which do you prefer: 95 | 96 | > - This? 97 | 98 | ```{r, eval=FALSE} 99 | top_n( 100 | arrange( 101 | summarise( 102 | group_by( 103 | filter(idata, grepl("g", Country)), 104 | Year), 105 | gini = mean(gini, na.rm = T)), 106 | desc(gini)), 107 | n = 5) 108 | ``` 109 | 110 | 111 | ## Command chaining II 112 | 113 | > - Or this? 114 | 115 | ```{r, eval=FALSE} 116 | idata %>% 117 | filter(grepl("g", Country)) %>% 118 | group_by(Year) %>% 119 | summarise(gini = mean(gini, na.rm = T)) %>% 120 | arrange(desc(gini)) %>% 121 | top_n(n = 5) 122 | ``` 123 | 124 | Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. 125 | 126 | # Practical: Section 5.5 (10 minutes) 127 | -------------------------------------------------------------------------------- /src/mean_c.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | using namespace Rcpp; 3 | 4 | // [[Rcpp::export]] 5 | double mean_c(NumericVector x){ 6 | int i; 7 | int n = x.size(); 8 | double mean = 0; 9 | 10 | for(i=0; i 2 | using namespace Rcpp; 3 | 4 | // [[Rcpp::export]] 5 | float test1() { 6 | float a = 1.0 / 81; 7 | float b = 0; 8 | for (int i = 0; i < 729; ++ i) 9 | b = b + a; 10 | return b; 11 | } 12 | 13 | // [[Rcpp::export]] 14 | double test2() { 15 | double a = 1.0 / 81; 16 | double b = 0; 17 | for (int i = 0; i < 729; ++ i) 18 | b += a; 19 | return b; 20 | } -------------------------------------------------------------------------------- /toc.yaml: -------------------------------------------------------------------------------- 1 | Introduction.Rmd: 2 | - check 3 | - check-workflow -------------------------------------------------------------------------------- /tufte-ebook.cls: -------------------------------------------------------------------------------- 1 | \NeedsTeXFormat{LaTeX2e}[1994/06/01] 2 | 3 | \ProvidesClass{tufte-ebook}[2015/02/08 v3.5.1 Tufte-book class] 4 | 5 | %% 6 | % Declare we're tufte-book 7 | \newcommand{\@tufte@class}{book}% the base LaTeX class (defaults to the article/handout style) 8 | \newcommand{\@tufte@pkgname}{tufte-ebook}% the name of the package (defaults to tufte-handout) 9 | 10 | %% 11 | % Load the common style elements 12 | \input{tufte-common.def} 13 | 14 | 15 | %% 16 | % Set up any book-specific stuff now 17 | 18 | %% 19 | % The front matter in Tufte's /Beautiful Evidence/ contains everything up 20 | % to the opening page of Chapter 1. The running heads, when they appear, 21 | % contain only the (arabic) page number in the outside corner. 22 | %\newif\if@mainmatter \@mainmattertrue 23 | \renewcommand\frontmatter{% 24 | \clearpage% 25 | \@mainmatterfalse% 26 | \pagenumbering{arabic}% 27 | %\pagestyle{plain}% 28 | \fancyhf{}% 29 | \ifthenelse{\boolean{@tufte@twoside}}% 30 | {\fancyhead[LE,RO]{\thepage}}% 31 | {\fancyhead[RE,RO]{\thepage}}% 32 | } 33 | 34 | 35 | %% 36 | % The main matter in Tufte's /Beautiful Evidence/ doesn't restart the page 37 | % numbering---it continues where it left off in the front matter. 38 | \renewcommand\mainmatter{% 39 | \clearpage% 40 | \@mainmattertrue% 41 | \fancyhf{}% 42 | \ifthenelse{\boolean{@tufte@twoside}}% 43 | {% two-side 44 | \renewcommand{\chaptermark}[1]{\markboth{##1}{}}% 45 | \fancyhead[LE]{\thepage\quad\smallcaps{\newlinetospace{\plaintitle}}}% book title 46 | \fancyhead[RO]{\smallcaps{\newlinetospace{\leftmark}}\quad\thepage}% chapter title 47 | }% 48 | {% one-side 49 | \fancyhead[RE,RO]{\smallcaps{\newlinetospace{\plaintitle}}\quad\thepage}% book title 50 | }% 51 | } 52 | 53 | 54 | %% 55 | % The back matter contains appendices, indices, glossaries, endnotes, 56 | % biliographies, list of contributors, illustration credits, etc. 57 | \renewcommand\backmatter{% 58 | \clearpage% 59 | \@mainmatterfalse% 60 | } 61 | 62 | %% 63 | % Only show the chapter titles in the table of contents 64 | \setcounter{tocdepth}{0} 65 | 66 | %% 67 | % If there is a `tufte-book-local.sty' file, load it. 68 | 69 | \IfFileExists{tufte-book-local.tex}{% 70 | \@tufte@info@noline{Loading tufte-book-local.tex}% 71 | \input{tufte-book-local}% 72 | }{} 73 | 74 | %% 75 | % End of file 76 | \endinput 77 | --------------------------------------------------------------------------------