├── GISC-422.Rproj
├── README.md
├── labs
    ├── README.md
    ├── interpolation
    │   ├── 01-overview-of-the-approach.Rmd
    │   ├── 01-overview-of-the-approach.md
    │   ├── 02-example-dataset.Rmd
    │   ├── 02-example-dataset.md
    │   ├── 03-preparing-for-interpolation.Rmd
    │   ├── 03-preparing-for-interpolation.md
    │   ├── 04-nn-and-idw.Rmd
    │   ├── 04-nn-and-idw.md
    │   ├── 05-trend-surfaces-and-kriging.Rmd
    │   ├── 05-trend-surfaces-and-kriging.md
    │   ├── 05B-trend-surfaces-and-kriging.Rmd
    │   ├── 05B-trend-surfaces-and-kriging.md
    │   ├── 06-splines.Rmd
    │   ├── 06-splines.md
    │   ├── 07-other-r-packages.Rmd
    │   ├── 07-other-r-packages.md
    │   ├── 08-assignment-interpolation.Rmd
    │   ├── 08-assignment-interpolation.md
    │   ├── README.md
    │   ├── data
    │   │   ├── interp-ext.gpkg
    │   │   ├── maungawhau.tif
    │   │   ├── pa-counties.gpkg
    │   │   └── pa-weather-1993-04-01.gpkg
    │   └── interpolation.zip
    ├── intro-to-R-and-friends
    │   ├── 01-introducing-r-and-rstudio.md
    │   ├── 02-installing-packages.md
    │   ├── 03-simple-data-exploration.md
    │   ├── 04-simple-maps.md
    │   ├── 05-r-markdown.md
    │   ├── 06-wrapping-up.md
    │   ├── README.md
    │   ├── earthquakes.csv
    │   ├── images
    │   │   ├── quakes-MAG-boxplot.png
    │   │   ├── quakes-MAG-hist.png
    │   │   ├── quakes-NZMGE-NZMGN-plot.png
    │   │   └── rstudio.png
    │   ├── intro-to-R-and-friends.zip
    │   └── nz.gpkg
    ├── introducing-spatstat
    │   └── README.md
    ├── making-maps-in-r
    │   ├── 01-making-maps-in-r.md
    │   ├── 02-data-wrangling-in-r.md
    │   ├── README.md
    │   ├── ak-rds.gpkg
    │   ├── ak-tb-cases.geojson
    │   ├── ak-tb.geojson
    │   └── making-maps-in-r.zip
    ├── multivariate-analysis
    │   ├── 01-multivariate-analysis-the-problem.Rmd
    │   ├── 01-multivariate-analysis-the-problem.md
    │   ├── 02-the-tidyverse.Rmd
    │   ├── 02-the-tidyverse.md
    │   ├── 03-dimensional-reduction.Rmd
    │   ├── 03-dimensional-reduction.md
    │   ├── 04-classification-and-clustering.Rmd
    │   ├── 04-classification-and-clustering.md
    │   ├── 05-assignment-multivariate-analysis.Rmd
    │   ├── 05-assignment-multivariate-analysis.md
    │   ├── README.md
    │   ├── multivariate-analysis.zip
    │   ├── sa1-2018-census-individual-part-1-total-nz-lookup-table.csv
    │   ├── sf_demo.geojson
    │   └── welly.gpkg
    ├── network-analysis
    │   ├── README.md
    │   ├── network-analysis.Rmd
    │   ├── network-analysis.md
    │   ├── network-analysis.zip
    │   └── network
    │   │   ├── network.graphml
    │   │   └── network_.graphml
    ├── point-pattern-analysis
    │   ├── 01-ppa-in-spatstat.md
    │   ├── 02-ppa-with-real-data.md
    │   ├── 03-assignment-instructions.md
    │   ├── README.md
    │   ├── ak-tb-cases.geojson
    │   ├── ak-tb.geojson
    │   └── point-pattern-analysis.zip
    ├── spatial-autocorrelation
    │   ├── README.md
    │   ├── akregion-tb-06.gpkg
    │   ├── assignment-spatial-autocorrelation.md
    │   ├── moran_plots.R
    │   └── spatial-autocorrelation.zip
    └── statistical-models
    │   ├── README.md
    │   ├── layers
    │       ├── age.tif
    │       ├── deficit.tif
    │       ├── dem.tif
    │       ├── mas.tif
    │       ├── mat.tif
    │       ├── r2pet.tif
    │       ├── rain.tif
    │       ├── slope.tif
    │       ├── sseas.tif
    │       ├── tseas.tif
    │       └── vpd.tif
    │   ├── nz35-pa.gpkg
    │   ├── statistical-models.Rmd
    │   ├── statistical-models.md
    │   └── statistical-models.zip
├── report
    └── README.md
└── video-links.md


/GISC-422.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Spatial Analysis and Modelling
 2 | These pages outline a one semester (36 contact hours) class in spatial analysis and modelling that was last taught at Victoria University of Wellington as GISC 422 in the second half of 2023.
 3 | 
 4 | I am still in the process of cleaning the materials up for potential conversion into training materials. For the time being the materials are provided _gratis_ with no warrant as to their accuracy as a guide to spatial in _R_ but you may still find them useful all the same!
 5 | 
 6 | ## Link to related video content
 7 | A consolidated list of relevant video content for this class is available on [this page](video-links.md).
 8 | 
 9 | ## Lab and lecture timetable
10 | Here's a 12 week schedule schedule we will aim to follow. **Bolded labs** have an associated assignment that must be submitted and contributes the indicated percentage of the course credit. General instructions for the labs are [here](labs/README.md).  Relevant materials (lecture slides, lab scripts and datasets) are linked below, when available.
11 | 
12 | Week# | Lecture | Lab | [Videos](video-links.md)
13 | :-: | :-- | :-- | :--
14 | 1 | Course overview | [*R* and *RStudio* computing environment and Markdown documents](labs/intro-to-R-and-friends/README.md) | [Practical](video-links.md#introducing-r-and-friends)
15 | 2 | [Why &lsquo;spatial is special&rsquo;](https://southosullivan.com/gisc422/spatial-is-special/) | [Making maps in *R*](labs/making-maps-in-r/README.md) | [Lecture](video-links.md#lecture-on-spatial-is-special)<br />[Practical](video-links.md#practical-materials-on-making-maps-in-r)
16 | 3 | [Spatial processes](https://southosullivan.com/gisc422/spatial-processes/) | [Introducing `spatstat`](labs/introducing-spatstat/README.md) | [Lecture](video-links.md#lecture-on-the-idea-of-a-spatial-process)<br />[Practical](video-links.md#practical-materials-on-spatial-processes)
17 | 4 | [Point pattern analysis](https://southosullivan.com/gisc422/point-pattern-analysis/) | [**Point pattern analysis**](labs/point-pattern-analysis/README.md) (15%) | [Lecture](video-links.md#lecture-on-point-pattern-analysis)<br />[Practical](video-links.md#overview-of-lab-on-point-pattern-analysis)
18 | 5 | [Measuring spatial autocorrelation](https://southosullivan.com/gisc422/spatial-autocorrelation/) | [**Moran's *I***](labs/spatial-autocorrelation/README.md) (15%) | [Lecture](video-links.md#lecture-on-spatial-autocorrelation)<br />[Practical](video-links.md#overview-of-lab-on-spatial-autocorrelation)
19 | 6 | [Spatial interpolation](https://southosullivan.com/gisc422/interpolation/) | ['Simple' interpolation in R](labs/interpolation/README.md) | [Lecture](video-links.md#lecture-on-simple-interpolation-methods)
20 | 7 | [Geostatistics](https://southosullivan.com/gisc422/geostatistics/) | [**Interpolation**](labs/interpolation/README.md) (15%) | [Lecture](video-links.md#lecture-on-geostatistical-methods)
21 | 8 | Multivariate methods | [**Geodemographics**](labs/multivariate-analysis/README.md) (15%) | [Lecture](video-links.md#week-9-multivariate-analysis)
22 | 9 | Overlay, regression models and related methods | [Lab content](labs/statistical-models/README.md) |
23 | 10 | [Cluster detection](https://southosullivan.com/gisc422/cluster-detection/) | |
24 | 11 | [Network analysis](https://southosullivan.com/gisc422/network-analysis/) | [Tools for network analysis](labs/network-analysis/README.md)
25 | 12 | | 
26 | 
27 | ### Readings
28 | The most useful materials are
29 | 
30 | + Bivand R, Pebesma E and Gómez-Rubilio V. 2013. [*Applied Spatial Data Analysis with R*](https://link-springer-com.helicon.vuw.ac.nz/book/10.1007%2F978-1-4614-7618-4) 2nd edn. Springer, New York.
31 | + O'Sullivan D and D Unwin. 2010. [*Geographic Information Analysis*](http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470288574.html) 2nd edn. Wiley, Hoboken, NJ.
32 | + Brunsdon C and L Comber. 2019. [*An Introduction to R for Spatial Analysis and Mapping*](https://au.sagepub.com/en-gb/oce/an-introduction-to-r-for-spatial-analysis-and-mapping/book258267 "Brunsdon and Comber Introduction to R book") 2nd edn. Sage, London.
33 | 
34 | There are also many useful online resources that cover topics that are the subject of this class. For example:
35 | 
36 | + [*Geocomputation with R*](https://geocompr.robinlovelace.net/) by Lovelace, Novosad and Muenchow, 2019
37 | + [*Spatial Data Science*](https://r-spatial.org/book/) a book in preparation from Bivand and Pebesma
38 | + [Course materials for Geographic Data Science](http://darribas.org/gds15/) by Daniel Arribas-Bel at University of Liverpool
39 | + [*Geospatial Analysis*](https://www.spatialanalysisonline.com/HTML/index.html) by deSmith, Longley and Goodchild
40 | 
41 | For the final assignment you will need to do your own research and assemble materials concerning how spatial analysis has been applied in specific areas of study.
42 | 
43 | ### Software
44 | Most of the lab work will be completed in the [*R*](https://www.r-project.org/) programming language for statistical computing, using various packages tailored to spatial analysis work. *R*
45 | 
46 | We will use *R* from the [*RStudio*](https://posit.co/) environment which makes managing work more straightforward.
47 | 
48 | Both *R* and *RStudio* are freely downloadable for use on your own computer (they work on all three major platforms). I can take a look if you are having issues with your installation, but are likely to suggest that you uninstall and reinstall.
49 | 
50 | ## Course learning objectives (CLOs)
51 | 1. Articulate the theoretical and practical considerations in the application of spatial analysis methods and spatial modelling
52 | 2. Prepare, manipulate, analyse, and display spatial data
53 | 3. Apply existing tools to derive meaningful spatial models
54 | 4. Identify and perform appropriate spatial analysis
55 | 
56 | ## Assessment
57 | This course is 100% internally assessed.  Assessment is based on four lab assignments worth 15% of overall course credit each, and a final assignment worth 30% of course credit which is due in the exam period.
58 | 
59 | Assessment item | Credit | Due date | CLOs
60 | :- | :-: | :-: | :-:
61 | Point pattern analysis | 15% | 4 September | 2 3 4
62 | Spatial autocorrelation | 15% | 11 September | 2 3 4
63 | Spatial interpolation | 15% | 25 September | 2 3 4
64 | Geodemographic analysis | 15% | 9 October | 2 3 4
65 | Written report on application of spatial analysis in a particular topic area | 30% | 20 October | 1
66 | Participation (including non-assessed labs) | 10% | NA | 1 2 3 4
67 | 
68 | Some guidance on the written report assignment expectations is provided [here](report/README.md).
69 | 


--------------------------------------------------------------------------------
/labs/README.md:
--------------------------------------------------------------------------------
1 | # General instructions for labs
2 | Each week there will be a practical component. Not every week is assessed, but you will learn a lot more if you work on the materials every week.
3 | 
4 | The general procedure each week is to download the associated `<topic-title>.zip` file and unpack it to a folder on the machine you are working on that you have access to. The zipped material will initially be only any necessary data files. In later weeks it may also include _RMarkdown_ `.Rmd` files (don't worry if you don't know what that means yet).
5 | 
6 | Once you have the data, proceed to the `README.md` page on the website for that week (this is linked from the timetable on the main page) and follow the instructions. In later weeks, you might execute the instructions from an _RMarkdown_ file instead.
7 | 


--------------------------------------------------------------------------------
/labs/interpolation/01-overview-of-the-approach.Rmd:
--------------------------------------------------------------------------------
 1 | 
 2 | # The overall approach to interpolation
 3 | The *R* ecosystem's approach to spatial interpolation seems pretty complicated at first. It's not exactly simple, although the overall concept is straightforward enough and some of the apparent complexity has more to do with making different data types interact with one another successfully.
 4 | 
 5 | But the basic steps are
 6 | 1. Get a set of control points
 7 | 2. Define a target set of 'sites' to interpolate at
 8 | 3. Build a spatial statistical model from the control points
 9 | 4. Apply the model to the target sites
10 | 
11 | In a bit more detail these involve:
12 | 
13 | ## 1. Get a set of control points
14 | The control points are the empirical data from the field or other source telling you known measurements at known locations.
15 | 
16 | So... usually you will be supplied with these. In the instruction pages which follow we make a fake set of control points because in the instructional example we already know the answer. In 'real life' (and in the assignment) the control points will be provided.
17 | 
18 | ## 2. Define a target set of 'sites' to interpolate at
19 | We also need to specify where and at what resolution (or degree of detail) we want to perform estimates (i.e. interpolation).
20 | 
21 | Generally this will be across the area covered by the control points. We'll assume that a bounding box (i.e. a rectangular region) is 'good enough' but in specific cases, you might want to mask out regions, when it may get more complicated.
22 | 
23 | The most straightforward way to make use of the sites to interpolate at is as an 'empty' raster dataset with the required resolution.
24 | 
25 | Surprisingly there is no simple way to make this from a set of control points, instead we have to go via `st_make_grid` to get a grid of points, then turn this into an 'xyz' data table, then use the `rasterFromXYZ` function to get a suitable 'target' raster layer. That's way more complicated than I'd like, but it seems to be the most robust approach. How it's done is shown in the instructions.
26 | 
27 | ## 3. Build a spatial statistical model from the control points
28 | The 'crunchy' part is where we use `gstat::gstat` to fit a model to the control points data. This makes an *R* model object which can then be applied by other *R* tools to run interpolations. For the simple methods like IDW there isn't much to this. For kriging making the model gets more complicated.
29 | 
30 | ## 4. Apply the model to the target sites
31 | The last step is to apply the model. This ends up looking like
32 | 
33 | ```
34 | interpolated_sf <- predict(model, target_sites_sf)
35 | interpolated_raster <- rasterize(as(interpolated_sf, "Spatial"), target_raster, "var1.pred")
36 | ```
37 | 
38 | where `predict` applies a model at a set of locations as specified in the sites `sf`. This can be viewed as is or converted to a raster with the `rasterize` function. The final result is a raster layer whose geometric properties (extent, cell size, CRS) match the target sites raster we made in step 2. The values in the raster are calculated based on the information contained in the model made in step 3.
39 | 
40 | Sometimes you get lucky and you can interpolate straight to raster with
41 | 
42 | ```
43 | interpolated_raster <- interpolate(target_raster, model)
44 | ```
45 | 
46 | But it doesn't work for all models for reasons that are hard to work out (believe me I've tried). For the sake of a consistent approach at least to near neighbour, IDW, trend surfaces, and kriging, we will use the first approach.
47 | 
48 | # Onward!
49 | So... those are the steps. It's useful to keep this overall framework in mind before getting too lost in the details!
50 | 
51 | [Back to the overview](README.md) | On to [the example dataset](02-example-dataset.md)
52 | 


--------------------------------------------------------------------------------
/labs/interpolation/01-overview-of-the-approach.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # The overall approach to interpolation
 3 | The *R* ecosystem's approach to spatial interpolation seems pretty complicated at first. It's not exactly simple, although the overall concept is straightforward enough and some of the apparent complexity has more to do with making different data types interact with one another successfully.
 4 | 
 5 | But the basic steps are
 6 | 1. Get a set of control points
 7 | 2. Define a target set of 'sites' to interpolate at
 8 | 3. Build a spatial statistical model from the control points
 9 | 4. Apply the model to the target sites
10 | 
11 | In a bit more detail these involve:
12 | 
13 | ## 1. Get a set of control points
14 | The control points are the empirical data from the field or other source telling you known measurements at known locations.
15 | 
16 | So... usually you will be supplied with these. In the instruction pages which follow we make a fake set of control points because in the instructional example we already know the answer. In 'real life' (and in the assignment) the control points will be provided.
17 | 
18 | ## 2. Define a target set of 'sites' to interpolate at
19 | We also need to specify where and at what resolution (or degree of detail) we want to perform estimates (i.e. interpolation).
20 | 
21 | Generally this will be across the area covered by the control points. We'll assume that a bounding box (i.e. a rectangular region) is 'good enough' but in specific cases, you might want to mask out regions, when it may get more complicated.
22 | 
23 | The most straightforward way to make use of the sites to interpolate at is as an 'empty' raster dataset with the required resolution.
24 | 
25 | Surprisingly there is no simple way to make this from a set of control points, instead we have to go via `st_make_grid` to get a grid of points, then turn this into an 'xyz' data table, then use the `rasterFromXYZ` function to get a suitable 'target' raster layer. That's way more complicated than I'd like, but it seems to be the most robust approach. How it's done is shown in the instructions.
26 | 
27 | ## 3. Build a spatial statistical model from the control points
28 | The 'crunchy' part is where we use `gstat::gstat` to fit a model to the control points data. This makes an *R* model object which can then be applied by other *R* tools to run interpolations. For the simple methods like IDW there isn't much to this. For kriging making the model gets more complicated.
29 | 
30 | ## 4. Apply the model to the target sites
31 | The last step is to apply the model. This ends up looking like
32 | 
33 | ```
34 | interpolated_sf <- predict(model, target_sites_sf)
35 | interpolated_raster <- rasterize(as(interpolated_sf, "Spatial"), target_raster, "var1.pred")
36 | ```
37 | 
38 | where `predict` applies a model at a set of locations as specified in the sites `sf`. This can be viewed as is or converted to a raster with the `rasterize` function. The final result is a raster layer whose geometric properties (extent, cell size, CRS) match the target sites raster we made in step 2. The values in the raster are calculated based on the information contained in the model made in step 3.
39 | 
40 | Sometimes you get lucky and you can interpolate straight to raster with
41 | 
42 | ```
43 | interpolated_raster <- interpolate(target_raster, model)
44 | ```
45 | 
46 | But it doesn't work for all models for reasons that are hard to work out (believe me I've tried). For the sake of a consistent approach at least to near neighbour, IDW, trend surfaces, and kriging, we will use the first approach.
47 | 
48 | # Onward!
49 | So... those are the steps. It's useful to keep this overall framework in mind before getting too lost in the details!
50 | 
51 | [Back to the overview](README.md) | On to [the example dataset](02-example-dataset.md)
52 | 


--------------------------------------------------------------------------------
/labs/interpolation/02-example-dataset.Rmd:
--------------------------------------------------------------------------------
 1 | # Basics of working with raster data
 2 | First load some libraries
 3 | 
 4 | ```{r}
 5 | library(sf)
 6 | library(tmap)
 7 | library(terra)
 8 | ```
 9 | The new(ish) to us kid on the block here is `raster` for handling gridded raster datasets. One thing to be very aware of is that `raster` masks the `select` function from `dplyr` so you have to specify `dplyr::select` when using the `select` to tidy up datasets during data preparation.
10 | 
11 | ## Read a raster dataset
12 | We are using a simple example of the elevation of Maungawhau (Mt Eden) in Auckland to demonstrate the interpolation methods. There is a version of this dataset available as standard in *R* but I made raster version of it for us to work with. So load this with the raster package:
13 | 
14 | ```{r}
15 | volcano <- rast("data/maungawhau.tif")
16 | ```
17 | 
18 | Confusingly, when you read in a raster dataset it names the associated numerical data using the filename, so we rename that to `height` which is more appropriate for our purposes.
19 | 
20 | ```{r}
21 | names(volcano) <- "height"
22 | ```
23 | 
24 | We won't get into it for a bit, but you can actually have multiple values in a raster dataset (when it is often called a 'raster brick'), but this dataset just has one value per cell.
25 | 
26 | ### Map it
27 | `tmap` can handle raster data, so we can use it in the usual way, albeit with the `tm_raster` function to specify colouring and so on.
28 | 
29 | ```{r}
30 | tm_shape(volcano) + 
31 |   tm_raster(pal = "-BrBG", style = "cont") +
32 |   tm_legend(outside = TRUE)
33 | ```
34 | 
35 | ### Using `persp`
36 | It's also sometimes useful to get a 2.5D view of raster data. A base *R* function that allows us to do this is `persp`:
37 | 
38 | ```{r}
39 | persp(volcano, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
40 | ```
41 | 
42 | The parameters `theta` and `phi` control the viewing angle. `expand` controls scaling in the vertical direction, which it is often useful to exaggerate so we can see what's going on a bit better. Experiment with these settings a bit to get a feel for things.
43 | 
44 | ### Using `rayshader`
45 | Try not to get distracted by it (because it's very cool!) but we can also use a package `rayshader` to make really nice interactive 3D renderings of raster data. Here's how that works:
46 | 
47 | ```{r}
48 | library(rayshader)
49 | 
50 | # this package wants a matrix not a raster
51 | # so make a matrix copy of the raster data
52 | volcano_m <- raster_to_matrix(volcano)
53 | 
54 | volcano_m %>%
55 |   sphere_shade(texture = 'bw') %>% # this does the shading
56 |   plot_3d(heightmap = volcano_m, zscale = 5, theta = 35, phi = 30, fov = 5)
57 | ```
58 | 
59 | `theta` and `phi` are similar to the settings in `persp`. `fov` is the angular field of view and controls the perspective of the object. Here `zscale` tells the rendering the relationship between units in the vertical direction (metres) and the cells in the raster (which are at 10m spacing). To get a two-times exaggeration here, we use `zscale = 5`.
60 | 
61 | Back to [the overall framework](01-overview-of-the-approach.md) | On to [preparing for interpolation](03-preparing-for-interpolation.md)


--------------------------------------------------------------------------------
/labs/interpolation/02-example-dataset.md:
--------------------------------------------------------------------------------
 1 | # Basics of working with raster data
 2 | First load some libraries
 3 | 
 4 | ```{r}
 5 | library(sf)
 6 | library(tmap)
 7 | library(terra)
 8 | ```
 9 | The new(ish) to us kid on the block here is `raster` for handling gridded raster datasets. One thing to be very aware of is that `raster` masks the `select` function from `dplyr` so you have to specify `dplyr::select` when using the `select` to tidy up datasets during data preparation.
10 | 
11 | ## Read a raster dataset
12 | We are using a simple example of the elevation of Maungawhau (Mt Eden) in Auckland to demonstrate the interpolation methods. There is a version of this dataset available as standard in *R* but I made raster version of it for us to work with. So load this with the raster package:
13 | 
14 | ```{r}
15 | volcano <- rast("data/maungawhau.tif")
16 | ```
17 | 
18 | Confusingly, when you read in a raster dataset it names the associated numerical data using the filename, so we rename that to `height` which is more appropriate for our purposes.
19 | 
20 | ```{r}
21 | names(volcano) <- "height"
22 | ```
23 | 
24 | We won't get into it for a bit, but you can actually have multiple values in a raster dataset (when it is often called a 'raster brick'), but this dataset just has one value per cell.
25 | 
26 | ### Map it
27 | `tmap` can handle raster data, so we can use it in the usual way, albeit with the `tm_raster` function to specify colouring and so on.
28 | 
29 | ```{r}
30 | tm_shape(volcano) + 
31 |   tm_raster(pal = "-BrBG", style = "cont") +
32 |   tm_legend(outside = TRUE)
33 | ```
34 | 
35 | ### Using `persp`
36 | It's also sometimes useful to get a 2.5D view of raster data. A base *R* function that allows us to do this is `persp`:
37 | 
38 | ```{r}
39 | persp(volcano, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
40 | ```
41 | 
42 | The parameters `theta` and `phi` control the viewing angle. `expand` controls scaling in the vertical direction, which it is often useful to exaggerate so we can see what's going on a bit better. Experiment with these settings a bit to get a feel for things.
43 | 
44 | ### Using `rayshader`
45 | Try not to get distracted by it (because it's very cool!) but we can also use a package `rayshader` to make really nice interactive 3D renderings of raster data. Here's how that works:
46 | 
47 | ```{r}
48 | library(rayshader)
49 | 
50 | # this package wants a matrix not a raster
51 | # so make a matrix copy of the raster data
52 | volcano_m <- raster_to_matrix(volcano)
53 | 
54 | volcano_m %>%
55 |   sphere_shade(texture = 'bw') %>% # this does the shading
56 |   plot_3d(heightmap = volcano_m, zscale = 5, theta = 35, phi = 30, fov = 5)
57 | ```
58 | 
59 | `theta` and `phi` are similar to the settings in `persp`. `fov` is the angular field of view and controls the perspective of the object. Here `zscale` tells the rendering the relationship between units in the vertical direction (metres) and the cells in the raster (which are at 10m spacing). To get a two-times exaggeration here, we use `zscale = 5`.
60 | 
61 | Back to [the overall framework](01-overview-of-the-approach.md) | On to [preparing for interpolation](03-preparing-for-interpolation.md)


--------------------------------------------------------------------------------
/labs/interpolation/03-preparing-for-interpolation.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | # Preparing for interpolation
  3 | Run this first to make sure all the data and packages you need are loaded:
  4 | ```{r}
  5 | library(sf)
  6 | library(tmap)
  7 | library(terra)
  8 | library(dplyr)
  9 | 
 10 | volcano <- rast("data/maungawhau.tif")
 11 | names(volcano) <- "height"
 12 | ```
 13 | 
 14 | ## Spatial extent of the study area
 15 | It's useful to have a spatial extent polygon. For the example dataset, here's one I prepared earlier (**remember you wouldn't normally be able to do this!**).
 16 | 
 17 | ```{r}
 18 | interp_ext <- st_read("data/interp-ext.gpkg")
 19 | ```
 20 | 
 21 | *Normally*, we would make the extent from the control points, or from some pre-existing desired study area extent polygon. The code to make it from a set of control points is shown below (this is for reference, don't run it, but you may need it later when you tackle the assignment).
 22 | 
 23 | ```
 24 | controls <- st_read("controls.gpkg")
 25 | interp_ext <- controls %>%
 26 |   st_union() %>%
 27 |   st_bbox() %>%
 28 |   st_as_sfc() %>%
 29 |   st_sf() %>%
 30 |   st_set_crs(st_crs(controls))
 31 | ```
 32 | 
 33 | ## Control points
 34 | For the demonstration data, we already know the result.
 35 | 
 36 | Normally, we would have a set of control points in some spatial format and would simply read them with `sf::st_read`. Here, we will make set of random control points to work with in the other steps of these instructions when we are using the Maungawhau data. We start from the interpolation extent, and use `st_sample` to get a specified random number of points in the extent, then convert it to an `sf` dataset. Finally we use the `terra::extract` function to extract values from the raster dataset and assign their values to a height attribute of the `sf` dataset.
 37 | 
 38 | ```{r}
 39 | controls <- interp_ext %>%
 40 |   st_sample(size = 250) %>%
 41 |   st_sf() %>%
 42 |   st_set_crs(st_crs(interp_ext))
 43 | 
 44 | heights <- controls %>%
 45 |   extract(x = volcano)
 46 | 
 47 | controls <- controls %>%
 48 |   mutate(height = heights$height)
 49 | ```
 50 | 
 51 | Every time you run the above you will get a different random set of the specified number of control point locations. It is useful to map them on top of the underlying data and think about how many you might need to get a reasonable representation of the height map of Maungawhau.
 52 | 
 53 | For simplicity, I am going to write these control points out to a file, which can be loaded into later instructions documents.
 54 | 
 55 | ```{r}
 56 | st_write(controls, "data/controls.gpkg", delete_dsn = TRUE)
 57 | ```
 58 | 
 59 | Some interpolation tools don't want an `sf` dataset, but a simple dataframe with `x`, `y` and `z` attributes, so let's also make one of those:
 60 | 
 61 | ```{r}
 62 | st_read("data/controls.gpkg") %>%
 63 |   cbind(st_coordinates(.)) %>%          # this adds the coordinates of the points as X and Y columns
 64 |   st_drop_geometry() %>%                # throw away the geometry, so it's just a dataframe
 65 |   mutate(x = X, y = Y, z = height) %>%  # rename to the generic x, y, z
 66 |   dplyr::select(x, y, z) %>%            # and retain only those three
 67 |   write.csv("data/controls-xyz.csv", row.names = FALSE)  # write out to a file
 68 | ```
 69 | 
 70 | Remember that if you change the control points, you should also change this file to keep them matching.
 71 | 
 72 | ## Make a set of locations to interpolate
 73 | Unlike the previous step which may not be necessary when you are provided with control points to interpolate directly, this step is always required. Basically, *R* wants a raster layer to *interpolate into*. We'll call this `sites` and make `sf`, `terra` and simple xyz versions of these.
 74 | 
 75 | ```{r}
 76 | sites_sf <- interp_ext %>% # start with the extent
 77 |   st_make_grid(cellsize = 10, what = "centers") %>%
 78 |   st_sf() %>%
 79 |   st_set_crs(st_crs(interp_ext))
 80 | 
 81 | sites_xyz <- sites_sf %>%
 82 |   bind_cols(st_coordinates(.)) %>%
 83 |   st_drop_geometry() %>%
 84 |   mutate(Z = 0)
 85 | 
 86 | sites_raster <- sites_xyz %>%
 87 |   rast(type = "xyz")
 88 | crs(sites_raster) <- st_crs(controls)$wkt
 89 | ```
 90 | 
 91 | Again, it's a good idea to write these all out to files so we don't have to keep remaking them
 92 | 
 93 | ```{r}
 94 | st_write(sites_sf, "data/sites-sf.gpkg", delete_layer = TRUE)
 95 | write.csv(sites_xyz, "data/sitex-xyz.csv", row.names = FALSE)
 96 | writeRaster(sites_raster, "data/sites-raster.tif", overwrite = TRUE)
 97 | ```
 98 | 
 99 | Back to [the example dataset](02-example-dataset.md) | On to [Near neighbour and IDW](04-nn-and-idw.md)
100 | 


--------------------------------------------------------------------------------
/labs/interpolation/03-preparing-for-interpolation.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Preparing for interpolation
  3 | Run this first to make sure all the data and packages you need are loaded:
  4 | ```{r}
  5 | library(sf)
  6 | library(tmap)
  7 | library(terra)
  8 | library(dplyr)
  9 | 
 10 | volcano <- rast("data/maungawhau.tif")
 11 | names(volcano) <- "height"
 12 | ```
 13 | 
 14 | ## Spatial extent of the study area
 15 | It's useful to have a spatial extent polygon. For the example dataset, here's one I prepared earlier (**remember you wouldn't normally be able to do this!**).
 16 | 
 17 | ```{r}
 18 | interp_ext <- st_read("data/interp-ext.gpkg")
 19 | ```
 20 | 
 21 | *Normally*, we would make the extent from the control points, or from some pre-existing desired study area extent polygon. The code to make it from a set of control points is shown below (this is for reference, don't run it, but you may need it later when you tackle the assignment).
 22 | 
 23 | ```
 24 | controls <- st_read("controls.gpkg")
 25 | interp_ext <- controls %>%
 26 |   st_union() %>%
 27 |   st_bbox() %>%
 28 |   st_as_sfc() %>%
 29 |   st_sf() %>%
 30 |   st_set_crs(st_crs(controls))
 31 | ```
 32 | 
 33 | ## Control points
 34 | For the demonstration data, we already know the result.
 35 | 
 36 | Normally, we would have a set of control points in some spatial format and would simply read them with `sf::st_read`. Here, we will make set of random control points to work with in the other steps of these instructions when we are using the Maungawhau data. We start from the interpolation extent, and use `st_sample` to get a specified random number of points in the extent, then convert it to an `sf` dataset. Finally we use the `terra::extract` function to extract values from the raster dataset and assign their values to a height attribute of the `sf` dataset.
 37 | 
 38 | ```{r}
 39 | controls <- interp_ext %>%
 40 |   st_sample(size = 250) %>%
 41 |   st_sf() %>%
 42 |   st_set_crs(st_crs(interp_ext))
 43 | 
 44 | heights <- controls %>%
 45 |   extract(x = volcano)
 46 | 
 47 | controls <- controls %>%
 48 |   mutate(height = heights$height)
 49 | ```
 50 | 
 51 | Every time you run the above you will get a different random set of the specified number of control point locations. It is useful to map them on top of the underlying data and think about how many you might need to get a reasonable representation of the height map of Maungawhau.
 52 | 
 53 | For simplicity, I am going to write these control points out to a file, which can be loaded into later instructions documents.
 54 | 
 55 | ```{r}
 56 | st_write(controls, "data/controls.gpkg", delete_dsn = TRUE)
 57 | ```
 58 | 
 59 | Some interpolation tools don't want an `sf` dataset, but a simple dataframe with `x`, `y` and `z` attributes, so let's also make one of those:
 60 | 
 61 | ```{r}
 62 | st_read("data/controls.gpkg") %>%
 63 |   cbind(st_coordinates(.)) %>%          # this adds the coordinates of the points as X and Y columns
 64 |   st_drop_geometry() %>%                # throw away the geometry, so it's just a dataframe
 65 |   mutate(x = X, y = Y, z = height) %>%  # rename to the generic x, y, z
 66 |   dplyr::select(x, y, z) %>%            # and retain only those three
 67 |   write.csv("data/controls-xyz.csv", row.names = FALSE)  # write out to a file
 68 | ```
 69 | 
 70 | Remember that if you change the control points, you should also change this file to keep them matching.
 71 | 
 72 | ## Make a set of locations to interpolate
 73 | Unlike the previous step which may not be necessary when you are provided with control points to interpolate directly, this step is always required. Basically, *R* wants a raster layer to *interpolate into*. We'll call this `sites` and make `sf`, `terra` and simple xyz versions of these.
 74 | 
 75 | ```{r}
 76 | sites_sf <- interp_ext %>% # start with the extent
 77 |   st_make_grid(cellsize = 10, what = "centers") %>%
 78 |   st_sf() %>%
 79 |   st_set_crs(st_crs(interp_ext))
 80 | 
 81 | sites_xyz <- sites_sf %>%
 82 |   bind_cols(st_coordinates(.)) %>%
 83 |   st_drop_geometry() %>%
 84 |   mutate(Z = 0)
 85 | 
 86 | sites_raster <- sites_xyz %>%
 87 |   rast(type = "xyz")
 88 | crs(sites_raster) <- st_crs(controls)$wkt
 89 | ```
 90 | 
 91 | Again, it's a good idea to write these all out to files so we don't have to keep remaking them
 92 | 
 93 | ```{r}
 94 | st_write(sites_sf, "data/sites-sf.gpkg", delete_layer = TRUE)
 95 | write.csv(sites_xyz, "data/sitex-xyz.csv", row.names = FALSE)
 96 | writeRaster(sites_raster, "data/sites-raster.tif", overwrite = TRUE)
 97 | ```
 98 | 
 99 | Back to [the example dataset](02-example-dataset.md) | On to [Near neighbour and IDW](04-nn-and-idw.md)
100 | 


--------------------------------------------------------------------------------
/labs/interpolation/04-nn-and-idw.Rmd:
--------------------------------------------------------------------------------
 1 | # Near neighbour and inverse-distance weighted interpolation
 2 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
 3 | 
 4 | ```{r}
 5 | library(sf)
 6 | library(tmap)
 7 | library(terra)
 8 | library(dplyr)
 9 | library(gstat)
10 | 
11 | volcano <- rast("data/maungawhau.tif")
12 | names(volcano) <- "data/height"
13 | 
14 | controls <- st_read("data/controls.gpkg")
15 | sites_sf <- st_read("data/sites-sf.gpkg")
16 | sites_raster <- rast("data/sites-raster.tif")
17 | ```
18 | 
19 | ## Inverse-distance weighted interpolation
20 | These two methods are very similar, and IDW is actually *more general* so we'll show it first.
21 | 
22 | As with all the `gstat` methods we use the `gstat::gstat` function to make a statistical model, and then apply it using the `raster::interpolate` function.
23 | 
24 | ```{r}
25 | fit_IDW <- gstat(                   # makes a model 
26 |   formula = height ~ 1,             # The column `height` is what we are interested in
27 |   data = as(controls, "Spatial"),   # using sf but converting to sp, which is required
28 |   set = list(idp = 2),
29 |   # nmax = 12, maxdist = 100        # you can experiment with these options later...
30 | )
31 | ```
32 | 
33 | The `idp` setting here is the inverse-distance power used in the calculation. Once you understand what is going on in general, you should experiment with this, and also with `nmax` (the maximum number of control points to include in any estimate) and `maxdist` (the maximum distance to any control point to use in an estimate) to see how the results change.
34 | 
35 | Having made the model (called `fit_IDW`) we pass it to the `predict` function to obtain interpolated values (called `var1.pred`) at the locations specified by `sites` and then finally convert this to a raster for visualization.
36 | 
37 | ```{r}
38 | interp_pts_IDW <- predict(fit_IDW, sites_sf)
39 | interp_IDW <- rasterize(as(interp_pts_IDW, "SpatVector"), sites_raster, "var1.pred")
40 | names(interp_IDW) <- "height" # rename the variable to something more friendly
41 | ```
42 | 
43 | And then we can view the outcome in the usual ways. 
44 | ```{r}
45 | interp_IDW
46 | ```
47 | 
48 | And we can map the result
49 | 
50 | ```{r}
51 | tm_shape(interp_IDW) + 
52 |   tm_raster(pal = "-BrBG", style = "cont") +
53 |   tm_legend(outside = TRUE)
54 | ```
55 | 
56 | ```{r}
57 | persp(interp_IDW, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
58 | ```
59 | 
60 | ### Nearest neighbour
61 | The basic model above can be parameterised differently, setting `nmax` to 1 to make a simple nearest neighbour (i.e. proximity polygon) interpolation:
62 | 
63 | ```{r}
64 | fit_NN <- gstat( # makes a model 
65 |   formula = height ~ 1,
66 |   data = as(controls, "Spatial"), 
67 |   nmax = 1, # by setting nmax to 1 we force it to 1, and get nearest neighbour
68 | )
69 | 
70 | # and interpolate like before
71 | interp_pts_NN <- predict(fit_NN, sites_sf)
72 | interp_NN <- rasterize(as(interp_pts_NN, "SpatVector"), sites_raster, "var1.pred")
73 | names(interp_NN) <- "height"
74 | 
75 | # and display it
76 | persp(interp_NN, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
77 | ```
78 | 
79 | We can confirm this matches with proximity polygons like this:
80 | ```{r}
81 | tm_shape(interp_NN) + 
82 |   tm_raster(pal = "-BrBG", style = "cont") + 
83 |   tm_shape(st_voronoi(st_union(controls))) + 
84 |   tm_borders(col = "yellow", lwd = 0.5) + 
85 |   tm_shape(controls) + 
86 |   tm_dots() + 
87 |   tm_legend(outside = TRUE)
88 | ```
89 | 
90 | Back to [data prep](03-preparing-for-interpolation.md) | On to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md)


--------------------------------------------------------------------------------
/labs/interpolation/04-nn-and-idw.md:
--------------------------------------------------------------------------------
 1 | # Near neighbour and inverse-distance weighted interpolation
 2 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
 3 | 
 4 | ```{r}
 5 | library(sf)
 6 | library(tmap)
 7 | library(terra)
 8 | library(dplyr)
 9 | library(gstat)
10 | 
11 | volcano <- rast("data/maungawhau.tif")
12 | names(volcano) <- "data/height"
13 | 
14 | controls <- st_read("data/controls.gpkg")
15 | sites_sf <- st_read("data/sites-sf.gpkg")
16 | sites_raster <- rast("data/sites-raster.tif")
17 | ```
18 | 
19 | ## Inverse-distance weighted interpolation
20 | These two methods are very similar, and IDW is actually *more general* so we'll show it first.
21 | 
22 | As with all the `gstat` methods we use the `gstat::gstat` function to make a statistical model, and then apply it using the `raster::interpolate` function.
23 | 
24 | ```{r}
25 | fit_IDW <- gstat(                   # makes a model 
26 |   formula = height ~ 1,             # The column `height` is what we are interested in
27 |   data = as(controls, "Spatial"),   # using sf but converting to sp, which is required
28 |   set = list(idp = 2),
29 |   # nmax = 12, maxdist = 100        # you can experiment with these options later...
30 | )
31 | ```
32 | 
33 | The `idp` setting here is the inverse-distance power used in the calculation. Once you understand what is going on in general, you should experiment with this, and also with `nmax` (the maximum number of control points to include in any estimate) and `maxdist` (the maximum distance to any control point to use in an estimate) to see how the results change.
34 | 
35 | Having made the model (called `fit_IDW`) we pass it to the `predict` function to obtain interpolated values (called `var1.pred`) at the locations specified by `sites` and then finally convert this to a raster for visualization.
36 | 
37 | ```{r}
38 | interp_pts_IDW <- predict(fit_IDW, sites_sf)
39 | interp_IDW <- rasterize(as(interp_pts_IDW, "SpatVector"), sites_raster, "var1.pred")
40 | names(interp_IDW) <- "height" # rename the variable to something more friendly
41 | ```
42 | 
43 | And then we can view the outcome in the usual ways. 
44 | ```{r}
45 | interp_IDW
46 | ```
47 | 
48 | And we can map the result
49 | 
50 | ```{r}
51 | tm_shape(interp_IDW) + 
52 |   tm_raster(pal = "-BrBG", style = "cont") +
53 |   tm_legend(outside = TRUE)
54 | ```
55 | 
56 | ```{r}
57 | persp(interp_IDW, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
58 | ```
59 | 
60 | ### Nearest neighbour
61 | The basic model above can be parameterised differently, setting `nmax` to 1 to make a simple nearest neighbour (i.e. proximity polygon) interpolation:
62 | 
63 | ```{r}
64 | fit_NN <- gstat( # makes a model 
65 |   formula = height ~ 1,
66 |   data = as(controls, "Spatial"), 
67 |   nmax = 1, # by setting nmax to 1 we force it to 1, and get nearest neighbour
68 | )
69 | 
70 | # and interpolate like before
71 | interp_pts_NN <- predict(fit_NN, sites_sf)
72 | interp_NN <- rasterize(as(interp_pts_NN, "SpatVector"), sites_raster, "var1.pred")
73 | names(interp_NN) <- "height"
74 | 
75 | # and display it
76 | persp(interp_NN, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
77 | ```
78 | 
79 | We can confirm this matches with proximity polygons like this:
80 | ```{r}
81 | tm_shape(interp_NN) + 
82 |   tm_raster(pal = "-BrBG", style = "cont") + 
83 |   tm_shape(st_voronoi(st_union(controls))) + 
84 |   tm_borders(col = "yellow", lwd = 0.5) + 
85 |   tm_shape(controls) + 
86 |   tm_dots() + 
87 |   tm_legend(outside = TRUE)
88 | ```
89 | 
90 | Back to [data prep](03-preparing-for-interpolation.md) | On to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md)


--------------------------------------------------------------------------------
/labs/interpolation/05-trend-surfaces-and-kriging.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | # Trend surfaces and kriging
  3 | 
  4 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
  5 | 
  6 | ```{r}
  7 | library(sf)
  8 | library(tmap)
  9 | library(terra)
 10 | library(dplyr)
 11 | library(gstat)
 12 | 
 13 | volcano <- rast("data/maungawhau.tif")
 14 | names(volcano) <- "data/height"
 15 | 
 16 | controls <- st_read("data/controls.gpkg")
 17 | sites_sf <- st_read("data/sites-sf.gpkg")
 18 | sites_raster <- rast("data/sites-raster.tif")
 19 | ```
 20 | 
 21 | There are many different styles of kriging. We'll work here with universal kriging which models variation in the data with two components a *trend surface* and a *variogram* which models how control points vary with distance from one another. So... to perform kriging we have to consider each of these elements in turn.
 22 | 
 23 | ## Trend surfaces
 24 | 
 25 | Trend surfaces are a special kind of linear regression where we use the spatial coordinates of the control points as predictors of the values measured at those points. The function that is fitted is a polynomial expression in the coordinates. Trend surfaces in addition to being a component part of a universal kriging interpolation are sometimes a reasonable choice of overall interpolation especially when data and knowledge are limited, or the investigation is exploratory.
 26 | 
 27 | ```{r}
 28 | fit_TS <- gstat(
 29 |   formula = height ~ 1, 
 30 |   data = as(controls, "Spatial"), 
 31 |   # nmax = 24,
 32 |   degree = 2,
 33 | )
 34 | ```
 35 | 
 36 | The form of the trend surface function is specified by the `degree` parameter and tells you the maximum power to which the coordinates may be raised in the polynomial. For example with `degree = 2`, the polynomial is $z=b_0 + b_1x + b_2y + b_3xy + b_4x^2 + b_5y^2$.
 37 | 
 38 | ```{r}
 39 | interp_pts_TS <- predict(fit_TS, sites_sf)
 40 | interp_TS <- rasterize(as(interp_pts_TS, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
 41 | 
 42 | persp(interp_TS$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
 43 | ```
 44 | 
 45 | You can only specify degree from 1 to a maximum of 3. In theory you can specify it as 0, but I am still trying to figure out what `degree = 0` means... besides which, it appears not to work, at least in this case.
 46 | 
 47 | You can also specify  `nmax` and `nmin` which will cause localised trend surfaces to be made. Although if you set `nmax` too low for a particular `degree` (higher degrees need higher `nmax` settings) it can cause strange behaviour particularly in empty regions of the control point dataset or near the edges. I recommend experimenting with the degree setting, then uncommenting the `nmax` setting and experimenting further to see how changing the model `fit_TS` changes the interpolations below.
 48 | 
 49 | ### Variance estimates
 50 | 
 51 | Notice this time that we get multiple layers in the resulting interpolation.
 52 | 
 53 | ```{r}
 54 | plot(interp_TS)
 55 | ```
 56 | 
 57 | This is why in the perspective plot we specify `interp_TS$var1.pred` which is the predicted value. The `var1.var` is an estimate of likely variance in the estimates. It will generally be higher where control points are sparse, or where the surface is changing rapidly.
 58 | 
 59 | ## Making a variogram
 60 | 
 61 | The other half of kriging is the model of the residual spatial structure (after the trend surface is accounted for) that we use, otherwise known as a *variogram*.
 62 | 
 63 | The simplest variogram model is based on fitting a curve to the *empirical variogram* which is essentially a plot of distance between control points against some measure of difference in associated values.
 64 | 
 65 | ```{r}
 66 | v <- variogram(
 67 |   height ~ 1,
 68 |   data = as(controls, "Spatial"),
 69 |   cutoff = 600,
 70 |   width = 25,
 71 |   # cloud = TRUE
 72 | )
 73 | plot(v)
 74 | ```
 75 | 
 76 | The `cutoff` is the separation distance between control points beyond which we aren't interested. Since we'd be unlikely to use information from two points more than 600m apart in this case, we set that as an upper limit. It's generally good to see the empirical variogram plot levelling off as it does with the settings above. This plot is based on averaging differences in a series of distance intervals (width set using the `width` parameter). If you want to see the full underlying dataset then uncomment the `cloud = TRUE` setting and run the above chunk again. If you do that **be sure to rerun with `cloud = TRUE` commented out again** before proceeding.
 77 | 
 78 | Next, we fit a curve to the empirical variogram using the `fit.variogram` function. This requires a functional form (the `model` parameter) which is set by calling the `vgm` function). Many possible curves are possible (you can get a list by running`vgm()`with no parameters), although for these data, I've found that `"Sph"` is the most convincing option.
 79 | 
 80 | ```{r}
 81 | v.fit <- gstat::fit.variogram(v, vgm(model = "Sph"))
 82 | plot(v, v.fit)
 83 | ```
 84 | 
 85 | ## Using the variogram for ordinary kriging
 86 | 
 87 | Now everything is set up to perform kriging interpolation. Kriging in principle is just another geostatistical model and we make it in a similar way to the others. The key difference is we specify the variogram with the `model` parameter setting:
 88 | 
 89 | ```{r}
 90 | fit_K <- gstat(
 91 |   formula = height ~ 1, 
 92 |   data = as(controls, "Spatial"),
 93 |   model = v.fit,
 94 |   nmax = 8
 95 | )
 96 | ```
 97 | 
 98 | I have found that it is important to set a fairly low `nmax` with these data. I believe that this is because the control points are randomly located and sometimes may have dramatically different values from the interpolation location. But... really I am unsure about this, and we'll consider this in more detail in the next section.
 99 | 
100 | ```{r}
101 | interp_pts_K <- predict(fit_K, sites_sf)
102 | interp_K <- rasterize(as(interp_pts_K, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
103 | 
104 | persp(interp_K$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
105 | ```
106 | 
107 | You may see some areas where the interpolation has not worked out so well (you might not, as it depends on the control point locations you got). If so, might be able to can see by running the code chunk below why this is.
108 | 
109 | ```{r}
110 | tm_shape(interp_K$var1.var) + 
111 |   tm_raster(pal = "Reds", style = "cont") +
112 |   tm_legend(outside = TRUE) +
113 |   tm_shape(controls) + 
114 |   tm_dots(col = "black")
115 | ```
116 | 
117 | Kriging is complicated and this exercise is intended only to give you a flavour of that, so we will move on. 
118 | 
119 | Back to [NN and IDW](04-nn-and-idw.md) | On to [splines](06-splines.md)


--------------------------------------------------------------------------------
/labs/interpolation/05-trend-surfaces-and-kriging.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Trend surfaces and kriging
  3 | 
  4 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
  5 | 
  6 | ```{r}
  7 | library(sf)
  8 | library(tmap)
  9 | library(terra)
 10 | library(dplyr)
 11 | library(gstat)
 12 | 
 13 | volcano <- rast("data/maungawhau.tif")
 14 | names(volcano) <- "data/height"
 15 | 
 16 | controls <- st_read("data/controls.gpkg")
 17 | sites_sf <- st_read("data/sites-sf.gpkg")
 18 | sites_raster <- rast("data/sites-raster.tif")
 19 | ```
 20 | 
 21 | There are many different styles of kriging. We'll work here with universal kriging which models variation in the data with two components a *trend surface* and a *variogram* which models how control points vary with distance from one another. So... to perform kriging we have to consider each of these elements in turn.
 22 | 
 23 | ## Trend surfaces
 24 | 
 25 | Trend surfaces are a special kind of linear regression where we use the spatial coordinates of the control points as predictors of the values measured at those points. The function that is fitted is a polynomial expression in the coordinates. Trend surfaces in addition to being a component part of a universal kriging interpolation are sometimes a reasonable choice of overall interpolation especially when data and knowledge are limited, or the investigation is exploratory.
 26 | 
 27 | ```{r}
 28 | fit_TS <- gstat(
 29 |   formula = height ~ 1, 
 30 |   data = as(controls, "Spatial"), 
 31 |   # nmax = 24,
 32 |   degree = 2,
 33 | )
 34 | ```
 35 | 
 36 | The form of the trend surface function is specified by the `degree` parameter and tells you the maximum power to which the coordinates may be raised in the polynomial. For example with `degree = 2`, the polynomial is $z=b_0 + b_1x + b_2y + b_3xy + b_4x^2 + b_5y^2$.
 37 | 
 38 | ```{r}
 39 | interp_pts_TS <- predict(fit_TS, sites_sf)
 40 | interp_TS <- rasterize(as(interp_pts_TS, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
 41 | 
 42 | persp(interp_TS$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
 43 | ```
 44 | 
 45 | You can only specify degree from 1 to a maximum of 3. In theory you can specify it as 0, but I am still trying to figure out what `degree = 0` means... besides which, it appears not to work, at least in this case.
 46 | 
 47 | You can also specify  `nmax` and `nmin` which will cause localised trend surfaces to be made. Although if you set `nmax` too low for a particular `degree` (higher degrees need higher `nmax` settings) it can cause strange behaviour particularly in empty regions of the control point dataset or near the edges. I recommend experimenting with the degree setting, then uncommenting the `nmax` setting and experimenting further to see how changing the model `fit_TS` changes the interpolations below.
 48 | 
 49 | ### Variance estimates
 50 | 
 51 | Notice this time that we get multiple layers in the resulting interpolation.
 52 | 
 53 | ```{r}
 54 | plot(interp_TS)
 55 | ```
 56 | 
 57 | This is why in the perspective plot we specify `interp_TS$var1.pred` which is the predicted value. The `var1.var` is an estimate of likely variance in the estimates. It will generally be higher where control points are sparse, or where the surface is changing rapidly.
 58 | 
 59 | ## Making a variogram
 60 | 
 61 | The other half of kriging is the model of the residual spatial structure (after the trend surface is accounted for) that we use, otherwise known as a *variogram*.
 62 | 
 63 | The simplest variogram model is based on fitting a curve to the *empirical variogram* which is essentially a plot of distance between control points against some measure of difference in associated values.
 64 | 
 65 | ```{r}
 66 | v <- variogram(
 67 |   height ~ 1,
 68 |   data = as(controls, "Spatial"),
 69 |   cutoff = 600,
 70 |   width = 25,
 71 |   # cloud = TRUE
 72 | )
 73 | plot(v)
 74 | ```
 75 | 
 76 | The `cutoff` is the separation distance between control points beyond which we aren't interested. Since we'd be unlikely to use information from two points more than 600m apart in this case, we set that as an upper limit. It's generally good to see the empirical variogram plot levelling off as it does with the settings above. This plot is based on averaging differences in a series of distance intervals (width set using the `width` parameter). If you want to see the full underlying dataset then uncomment the `cloud = TRUE` setting and run the above chunk again. If you do that **be sure to rerun with `cloud = TRUE` commented out again** before proceeding.
 77 | 
 78 | Next, we fit a curve to the empirical variogram using the `fit.variogram` function. This requires a functional form (the `model` parameter) which is set by calling the `vgm` function). Many possible curves are possible (you can get a list by running`vgm()`with no parameters), although for these data, I've found that `"Sph"` is the most convincing option.
 79 | 
 80 | ```{r}
 81 | v.fit <- gstat::fit.variogram(v, vgm(model = "Sph"))
 82 | plot(v, v.fit)
 83 | ```
 84 | 
 85 | ## Using the variogram for ordinary kriging
 86 | 
 87 | Now everything is set up to perform kriging interpolation. Kriging in principle is just another geostatistical model and we make it in a similar way to the others. The key difference is we specify the variogram with the `model` parameter setting:
 88 | 
 89 | ```{r}
 90 | fit_K <- gstat(
 91 |   formula = height ~ 1, 
 92 |   data = as(controls, "Spatial"),
 93 |   model = v.fit,
 94 |   nmax = 8
 95 | )
 96 | ```
 97 | 
 98 | I have found that it is important to set a fairly low `nmax` with these data. I believe that this is because the control points are randomly located and sometimes may have dramatically different values from the interpolation location. But... really I am unsure about this, and we'll consider this in more detail in the next section.
 99 | 
100 | ```{r}
101 | interp_pts_K <- predict(fit_K, sites_sf)
102 | interp_K <- rasterize(as(interp_pts_K, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
103 | 
104 | persp(interp_K$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
105 | ```
106 | 
107 | You may see some areas where the interpolation has not worked out so well (you might not, as it depends on the control point locations you got). If so, might be able to can see by running the code chunk below why this is.
108 | 
109 | ```{r}
110 | tm_shape(interp_K$var1.var) + 
111 |   tm_raster(pal = "Reds", style = "cont") +
112 |   tm_legend(outside = TRUE) +
113 |   tm_shape(controls) + 
114 |   tm_dots(col = "black")
115 | ```
116 | 
117 | Kriging is complicated and this exercise is intended only to give you a flavour of that, so we will move on. 
118 | 
119 | Back to [NN and IDW](04-nn-and-idw.md) | On to [splines](06-splines.md)


--------------------------------------------------------------------------------
/labs/interpolation/05B-trend-surfaces-and-kriging.Rmd:
--------------------------------------------------------------------------------
  1 | 
  2 | # Two digressions on trend surfaces and kriging
  3 | 
  4 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
  5 | 
  6 | ```{r message = FALSE}
  7 | library(sf)
  8 | library(tmap)
  9 | library(terra)
 10 | library(dplyr)
 11 | library(gstat)
 12 | 
 13 | ## The warnings are a bit out of hand on this page so
 14 | options("rgdal_show_exportToProj4_warnings"="none")
 15 | 
 16 | volcano <- rast("data/maungawhau.tif")
 17 | names(volcano) <- "height"
 18 | 
 19 | controls <- st_read("data/controls.gpkg")
 20 | sites_sf <- st_read("data/sites-sf.gpkg")
 21 | sites_raster <- rast("data/sites-raster.tif")
 22 | ```
 23 | 
 24 | ## Some thoughts on kriging in `gstat`
 25 | 
 26 | Although `gstat` is the 'go to' package for geostatistics in *R* and although it is very flexible, it has some limitations. Among these are:
 27 | 
 28 | -   Awkward interfaces to spatial data (not exclusive to `gstat`!)
 29 | -   Making the variogram by hand is tricky
 30 | -   Inclusion of a trend surface to run *universal kriging* is particularly challenging to get right
 31 | -   And, even if you can get *universal kriging* to work, you can't use a localised trend surface (admittedly this is not supported by many platforms)
 32 | 
 33 | ## Digression 1: evenly space control points
 34 | 
 35 | Some of the challenges encountered in the main instructions are mitigated with better control points. You can use `spatstat` spatial processes to control the `st_sample` function so the code chunk below shows how this can improve kriging results.
 36 | 
 37 | ```{r}
 38 | # new control set using the rSSI point process from spatstat
 39 | controls_ssi <- st_read("data/interp-ext.gpkg") %>%
 40 |   st_sample(size = 250, type = "SSI", r = 30, n = 250) %>%
 41 |   st_sf() %>%
 42 |   st_set_crs(st_crs(controls))
 43 | 
 44 | heights_ssi <- controls_ssi %>%
 45 |   extract(x = volcano)
 46 | 
 47 | controls_ssi <- controls_ssi %>%
 48 |   mutate(height = heights_ssi$height)
 49 | ```
 50 | 
 51 | Now put them on a web map and note how much more evenly spaced the `controls_ssi` points are.
 52 | ```{r}
 53 | tmap_mode('view')
 54 | tm_shape(controls) + tm_dots(col = "black") + tm_shape(controls_ssi) + tm_dots(col = "red")
 55 | ```
 56 | 
 57 | ```{r}
 58 | # make a new variogram
 59 | v_ssi <- variogram(
 60 |   height ~ 1,
 61 |   data = as(controls_ssi, "Spatial"),
 62 |   cutoff = 500,
 63 |   width = 25,
 64 | )
 65 | # fit the variogram
 66 | v.fit_ssi <- fit.variogram(v_ssi, vgm(model = "Gau"))
 67 | # make the kriging model
 68 | fit_K_ssi <- gstat(
 69 |   formula = height ~ 1,
 70 |   data = as(controls_ssi, "Spatial"),
 71 |   model = v.fit_ssi,
 72 |   nmax = 8
 73 | )
 74 | # interpolate!
 75 | interp_pts_K_ssi <- predict(fit_K_ssi, sites_sf)
 76 | interp_K_ssi <- rasterize(as(interp_pts_K_ssi, "SpatVector"), 
 77 |                           sites_raster, field = c("var1.pred", "var1.var"))$var1.pred
 78 | 
 79 | persp(interp_K_ssi$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
 80 | ```
 81 | 
 82 | The lesson here is that we *always* could use better data!
 83 | 
 84 | ## Digression 2: Using a local trend surface in universal kriging
 85 | 
 86 | In theory universal kriging models overall trends with a trend surface of some kind, then interpolates the residuals from that surface using a variogram and kriging.
 87 | 
 88 | But we saw before that with a localised trend surface you can already get a pretty nice interpolation using that alone. However, `gstat` doesn't let you use localised trend surfaces in kriging---at least not in any simple way. In the code chunk below, I show how this limitation can potentially be sidestepped.
 89 | 
 90 | ```{r}
 91 | # make a trend surface interpolation
 92 | fit_TS <- gstat(
 93 |   formula = height ~ 1,
 94 |   data = as(controls, "Spatial"),
 95 |   nmax = 24,
 96 |   degree = 3,
 97 | )
 98 | interp_pts_TS <- predict(fit_TS, sites_sf)
 99 | interp_TS <- rasterize(as(interp_pts_TS, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))$var1.pred
100 | 
101 | persp(interp_TS, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
102 | ```
103 | 
104 | Now we proceed to krige on the residuals from this surface.
105 | 
106 | ```{r}
107 | # get the ts values and include in controls also making a residual
108 | ts_estimates <- extract(interp_TS, as(controls, "SpatVector")) %>%
109 |   as_tibble() %>%
110 |   select(var1.pred) %>%
111 |   rename(ts = var1.pred)
112 | 
113 | controls_resid <- controls %>%
114 |   bind_cols(ts_estimates) %>%
115 |   mutate(resid = height - ts)
116 | 
117 | # now proceeed with kriging on the residual values
118 | v_resid <- variogram(
119 |   resid ~ 1,
120 |   data = as(controls_resid, "Spatial"),
121 |   cutoff = 500,
122 |   width = 25,
123 | )
124 | # fit the variogram
125 | v.fit_resid <- fit.variogram(v_resid, vgm(model = "Sph"))
126 | # make the kriging model
127 | fit_K_resid <- gstat(
128 |   formula = resid ~ 1,
129 |   data = as(controls_resid, "Spatial"),
130 |   model = v.fit_resid,
131 |   nmax = 8
132 | )
133 | # interpolate!
134 | interp_pts_K_resid <- predict(fit_K_resid, sites_sf)
135 | interp_K_resid <- rasterize(as(interp_pts_K_resid, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
136 | interp_K_final <- interp_TS + interp_K_resid$var1.pred
137 | 
138 | persp(interp_K_final, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
139 | ```
140 | Back to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md) | On to [splines](06-splines.md)
141 | 


--------------------------------------------------------------------------------
/labs/interpolation/05B-trend-surfaces-and-kriging.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Two digressions on trend surfaces and kriging
  3 | 
  4 | Run this first to make sure all the data and packages you need are loaded. If any data are missing you probably didn't make them in one of the previous instruction pages.
  5 | 
  6 | ```{r message = FALSE}
  7 | library(sf)
  8 | library(tmap)
  9 | library(terra)
 10 | library(dplyr)
 11 | library(gstat)
 12 | 
 13 | ## The warnings are a bit out of hand on this page so
 14 | options("rgdal_show_exportToProj4_warnings"="none")
 15 | 
 16 | volcano <- rast("data/maungawhau.tif")
 17 | names(volcano) <- "height"
 18 | 
 19 | controls <- st_read("data/controls.gpkg")
 20 | sites_sf <- st_read("data/sites-sf.gpkg")
 21 | sites_raster <- rast("data/sites-raster.tif")
 22 | ```
 23 | 
 24 | ## Some thoughts on kriging in `gstat`
 25 | 
 26 | Although `gstat` is the 'go to' package for geostatistics in *R* and although it is very flexible, it has some limitations. Among these are:
 27 | 
 28 | -   Awkward interfaces to spatial data (not exclusive to `gstat`!)
 29 | -   Making the variogram by hand is tricky
 30 | -   Inclusion of a trend surface to run *universal kriging* is particularly challenging to get right
 31 | -   And, even if you can get *universal kriging* to work, you can't use a localised trend surface (admittedly this is not supported by many platforms)
 32 | 
 33 | ## Digression 1: evenly space control points
 34 | 
 35 | Some of the challenges encountered in the main instructions are mitigated with better control points. You can use `spatstat` spatial processes to control the `st_sample` function so the code chunk below shows how this can improve kriging results.
 36 | 
 37 | ```{r}
 38 | # new control set using the rSSI point process from spatstat
 39 | controls_ssi <- st_read("data/interp-ext.gpkg") %>%
 40 |   st_sample(size = 250, type = "SSI", r = 30, n = 250) %>%
 41 |   st_sf() %>%
 42 |   st_set_crs(st_crs(controls))
 43 | 
 44 | heights_ssi <- controls_ssi %>%
 45 |   extract(x = volcano)
 46 | 
 47 | controls_ssi <- controls_ssi %>%
 48 |   mutate(height = heights_ssi$height)
 49 | ```
 50 | 
 51 | Now put them on a web map and note how much more evenly spaced the `controls_ssi` points are.
 52 | ```{r}
 53 | tmap_mode('view')
 54 | tm_shape(controls) + tm_dots(col = "black") + tm_shape(controls_ssi) + tm_dots(col = "red")
 55 | ```
 56 | 
 57 | ```{r}
 58 | # make a new variogram
 59 | v_ssi <- variogram(
 60 |   height ~ 1,
 61 |   data = as(controls_ssi, "Spatial"),
 62 |   cutoff = 500,
 63 |   width = 25,
 64 | )
 65 | # fit the variogram
 66 | v.fit_ssi <- fit.variogram(v_ssi, vgm(model = "Gau"))
 67 | # make the kriging model
 68 | fit_K_ssi <- gstat(
 69 |   formula = height ~ 1,
 70 |   data = as(controls_ssi, "Spatial"),
 71 |   model = v.fit_ssi,
 72 |   nmax = 8
 73 | )
 74 | # interpolate!
 75 | interp_pts_K_ssi <- predict(fit_K_ssi, sites_sf)
 76 | interp_K_ssi <- rasterize(as(interp_pts_K_ssi, "SpatVector"), 
 77 |                           sites_raster, field = c("var1.pred", "var1.var"))$var1.pred
 78 | 
 79 | persp(interp_K_ssi$var1.pred, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
 80 | ```
 81 | 
 82 | The lesson here is that we *always* could use better data!
 83 | 
 84 | ## Digression 2: Using a local trend surface in universal kriging
 85 | 
 86 | In theory universal kriging models overall trends with a trend surface of some kind, then interpolates the residuals from that surface using a variogram and kriging.
 87 | 
 88 | But we saw before that with a localised trend surface you can already get a pretty nice interpolation using that alone. However, `gstat` doesn't let you use localised trend surfaces in kriging---at least not in any simple way. In the code chunk below, I show how this limitation can potentially be sidestepped.
 89 | 
 90 | ```{r}
 91 | # make a trend surface interpolation
 92 | fit_TS <- gstat(
 93 |   formula = height ~ 1,
 94 |   data = as(controls, "Spatial"),
 95 |   nmax = 24,
 96 |   degree = 3,
 97 | )
 98 | interp_pts_TS <- predict(fit_TS, sites_sf)
 99 | interp_TS <- rasterize(as(interp_pts_TS, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))$var1.pred
100 | 
101 | persp(interp_TS, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
102 | ```
103 | 
104 | Now we proceed to krige on the residuals from this surface.
105 | 
106 | ```{r}
107 | # get the ts values and include in controls also making a residual
108 | ts_estimates <- extract(interp_TS, as(controls, "SpatVector")) %>%
109 |   as_tibble() %>%
110 |   select(var1.pred) %>%
111 |   rename(ts = var1.pred)
112 | 
113 | controls_resid <- controls %>%
114 |   bind_cols(ts_estimates) %>%
115 |   mutate(resid = height - ts)
116 | 
117 | # now proceeed with kriging on the residual values
118 | v_resid <- variogram(
119 |   resid ~ 1,
120 |   data = as(controls_resid, "Spatial"),
121 |   cutoff = 500,
122 |   width = 25,
123 | )
124 | # fit the variogram
125 | v.fit_resid <- fit.variogram(v_resid, vgm(model = "Sph"))
126 | # make the kriging model
127 | fit_K_resid <- gstat(
128 |   formula = resid ~ 1,
129 |   data = as(controls_resid, "Spatial"),
130 |   model = v.fit_resid,
131 |   nmax = 8
132 | )
133 | # interpolate!
134 | interp_pts_K_resid <- predict(fit_K_resid, sites_sf)
135 | interp_K_resid <- rasterize(as(interp_pts_K_resid, "SpatVector"), sites_raster, field = c("var1.pred", "var1.var"))
136 | interp_K_final <- interp_TS + interp_K_resid$var1.pred
137 | 
138 | persp(interp_K_final, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
139 | ```
140 | Back to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md) | On to [splines](06-splines.md)
141 | 


--------------------------------------------------------------------------------
/labs/interpolation/06-splines.Rmd:
--------------------------------------------------------------------------------
 1 | # Spline interpolation
 2 | This is where `gstat` runs out of steam, and where I suggest you take a look at possibilities in QGIS where the GRASS and SAGA toolboxes, or in ArcGIS where the Spatial Analyst and Geostatistical Analyst have tools for various kinds of interpolation.
 3 | 
 4 | There is an option in *R* for this too (in fact, there are many), but spline interpolation falls under the rubric of general interpolation techniques that are generally not geographical and so the interfaces to spatial data are primitive in the packages supporting spline interpolation.
 5 | 
 6 | One example is the `fields` package, which provides a fairly painless way to run a spline interpolation (it also does kriging and a lot of other things besides, but we've already done that... so we'll not go there again.)
 7 | 
 8 | `fields` seems not to work with the `terra` raster data types, so we revert to using the older `raster` package instead.
 9 | 
10 | ## Reload our datasets and required libraries
11 | ```{r}
12 | library(raster)
13 | 
14 | controls_xyz <- read.csv("data/controls-xyz.csv")[1:100, ]
15 | sites_raster <- raster("data/sites-raster.tif")
16 | ```
17 | 
18 | And now load `fields` (you may have to install it first...)
19 | 
20 | ```{r}
21 | library(fields)
22 | ```
23 | 
24 | `fields` doesn't really know about geographical data it just wants coordinates and values, so we've loaded the 'xyz' version of the control points data and will provide the necessary columns from this to the spline model constructor `Tps`.
25 | 
26 | ```{r}
27 | spline <- Tps(controls_xyz[, c("x", "y")], controls_xyz$z)
28 | ```
29 | 
30 | We interpolate this to our desired raster output layer like this
31 | 
32 | ```{r}
33 | splined <- raster::interpolate(sites_raster, spline)
34 | ```
35 | 
36 | And we can take a look in the usual ways
37 | 
38 | ```{r}
39 | persp(splined, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
40 | ```
41 | 
42 | ### Kriging in `fields`
43 | Kriging is also available, although we are mainly looking at `fields` for the spline interpolation. Here's what it looks like if you are interested. It's quite slow as the tool is doing a lot of things at once and attempting to optimise fits and so on.
44 | 
45 | ```{r}
46 | spatial_model <- spatialProcess(controls_xyz[, c("x", "y")], controls_xyz$z)
47 | ```
48 | 
49 | Interpolate it to a raster
50 | 
51 | ```{r}
52 | kriged <- raster::interpolate(sites_raster, spatial_model)
53 | ```
54 | 
55 | And inspect. The result can end up being a lot nicer that `gstat`'s effort...
56 | 
57 | ```{r}
58 | persp(kriged, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
59 | ```
60 | 
61 | You can also get details of the model used
62 | 
63 | ```{r}
64 | spatial_model
65 | ```
66 | 
67 | 
68 | Back to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md) | On to [other R packages](07-other-r-packages.md)
69 | 


--------------------------------------------------------------------------------
/labs/interpolation/06-splines.md:
--------------------------------------------------------------------------------
 1 | # Spline interpolation
 2 | This is where `gstat` runs out of steam, and where I suggest you take a look at possibilities in QGIS where the GRASS and SAGA toolboxes, or in ArcGIS where the Spatial Analyst and Geostatistical Analyst have tools for various kinds of interpolation.
 3 | 
 4 | There is an option in *R* for this too (in fact, there are many), but spline interpolation falls under the rubric of general interpolation techniques that are generally not geographical and so the interfaces to spatial data are primitive in the packages supporting spline interpolation.
 5 | 
 6 | One example is the `fields` package, which provides a fairly painless way to run a spline interpolation (it also does kriging and a lot of other things besides, but we've already done that... so we'll not go there again.)
 7 | 
 8 | `fields` seems not to work with the `terra` raster data types, so we revert to using the older `raster` package instead.
 9 | 
10 | ## Reload our datasets and required libraries
11 | ```{r}
12 | library(raster)
13 | 
14 | controls_xyz <- read.csv("data/controls-xyz.csv")[1:100, ]
15 | sites_raster <- raster("data/sites-raster.tif")
16 | ```
17 | 
18 | And now load `fields` (you may have to install it first...)
19 | 
20 | ```{r}
21 | library(fields)
22 | ```
23 | 
24 | `fields` doesn't really know about geographical data it just wants coordinates and values, so we've loaded the 'xyz' version of the control points data and will provide the necessary columns from this to the spline model constructor `Tps`.
25 | 
26 | ```{r}
27 | spline <- Tps(controls_xyz[, c("x", "y")], controls_xyz$z)
28 | ```
29 | 
30 | We interpolate this to our desired raster output layer like this
31 | 
32 | ```{r}
33 | splined <- raster::interpolate(sites_raster, spline)
34 | ```
35 | 
36 | And we can take a look in the usual ways
37 | 
38 | ```{r}
39 | persp(splined, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
40 | ```
41 | 
42 | ### Kriging in `fields`
43 | Kriging is also available, although we are mainly looking at `fields` for the spline interpolation. Here's what it looks like if you are interested. It's quite slow as the tool is doing a lot of things at once and attempting to optimise fits and so on.
44 | 
45 | ```{r}
46 | spatial_model <- spatialProcess(controls_xyz[, c("x", "y")], controls_xyz$z)
47 | ```
48 | 
49 | Interpolate it to a raster
50 | 
51 | ```{r}
52 | kriged <- raster::interpolate(sites_raster, spatial_model)
53 | ```
54 | 
55 | And inspect. The result can end up being a lot nicer that `gstat`'s effort...
56 | 
57 | ```{r}
58 | persp(kriged, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
59 | ```
60 | 
61 | You can also get details of the model used
62 | 
63 | ```{r}
64 | spatial_model
65 | ```
66 | 
67 | 
68 | Back to [trend surfaces and kriging](05-trend-surfaces-and-kriging.md) | On to [other R packages](07-other-r-packages.md)
69 | 


--------------------------------------------------------------------------------
/labs/interpolation/07-other-r-packages.Rmd:
--------------------------------------------------------------------------------
 1 | 
 2 | # Other stuff
 3 | Spatial interpolation is widely used in many domains and consequently there are many packages that do some part of it.
 4 | 
 5 | This page is a bit of a grab-bag of notes and observations first about other *R* packages, then about other platforms.
 6 | 
 7 | ## Other *R* packages
 8 | Often like `fields` they aren't specifically geographical, so know nothing of projected coordinate systems or maps or cartography. That can make them tricky to work with when you have data in geospatial formats. Nevertheless it's worth knowing about a few of them:
 9 | 
10 | ### Proximity polygons without rasters
11 | You can do 'interpolation' like this (and no... I don't understand why it requires quite so many steps as it does).
12 | 
13 | ```{r}
14 | library(sf)
15 | interp_ext <- st_read("data/interp-ext.gpkg")
16 | controls <- st_read("data/controls.gpkg")
17 | controls_polys <- controls %>%
18 |   st_union() %>%
19 |   st_voronoi() %>%
20 |   st_cast() %>%
21 |   st_sf() %>%
22 |   st_intersection(interp_ext) %>%
23 |   st_join(controls)
24 | plot(controls_polys)
25 | ```
26 | 
27 | ### Geostatistics
28 | `sgeostat` has been around forever and does a small part of the kriging puzzle, particularly variogram estimation. It's `spacecloud` and `spacebox` plots are particularly good for getting a feel for variogram estimation. For example:
29 | 
30 | ```{r}
31 | library(sgeostat)
32 | controls_xyz <- read.csv("data/controls-xyz.csv")
33 | pts <- sgeostat::point(controls_xyz)
34 | prs <- sgeostat::pair(pts)
35 | spacecloud(pts, prs, "z", cex = 0.35, pch = 19)
36 | spacebox(pts, prs, "z")
37 | ```
38 | 
39 | ### Splines
40 | I'm sure there are others, but two packages that do splines are briefly discussed below
41 | 
42 | #### `MBA`
43 | ```{r}
44 | library(MBA)
45 | library(terra)
46 | spline.mba <- mba.surf(controls_xyz,
47 |                        no.X = 61, no.Y = 87, # much jiggery-pokery required
48 |                        n = 87/61, m = 1,
49 |                        extend = T, sp = T)
50 | r <- rast(spline.mba$xyz.est)
51 | 
52 | persp(r, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
53 | ```
54 | 
55 | #### `akima`
56 | ```{r}
57 | library(akima)
58 | spline.akima <- interp(controls_xyz$x, controls_xyz$y, controls_xyz$z,
59 |                        nx = 61, ny = 87,
60 |                        extrap = T, linear = F)
61 | 
62 | r.spline.akima <- cbind(
63 |   expand.grid(X = spline.akima$x, Y = spline.akima$y), Z = c(spline.akima$z)) %>%
64 |   rast(type = "xyz")
65 | crs(r.spline.akima) <- st_crs(controls)$wkt
66 | 
67 | p <- persp(r.spline.akima, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
68 | ```
69 | 
70 | ## Other platforms
71 | Other platforms are available!
72 | 
73 | The goal here is to give insight into the wide array of options out there. We'll quickly look at tools in QGIS in class. Similar tools are available in the Esri ecosystem, which still supports the rather nice _Geostatistical Analyst_ tool.
74 | 
75 | Probably the one thing these both have that I can't find anywhere in *R*-land is _natural neighbours_ interpolation. (**UPDATE:** it does exist in *R*-land in a package called `whitebox` but good luck getting that setup to run cleanly).
76 | 
77 | The striking thing about both these menu / dialogue driven pathways is that with so many options to set things quickly become quite hard to replicate. As you experiment with options and find a preferred solution, it can be difficult to find your way back to it!
78 | 
79 | In any particular 'real-world' setting there may be other tools used for this very common family of operations.
80 | 
81 | This is the end of the line: back to [the overview](README.md)
82 | 


--------------------------------------------------------------------------------
/labs/interpolation/07-other-r-packages.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Other stuff
 3 | Spatial interpolation is widely used in many domains and consequently there are many packages that do some part of it.
 4 | 
 5 | This page is a bit of a grab-bag of notes and observations first about other *R* packages, then about other platforms.
 6 | 
 7 | ## Other *R* packages
 8 | Often like `fields` they aren't specifically geographical, so know nothing of projected coordinate systems or maps or cartography. That can make them tricky to work with when you have data in geospatial formats. Nevertheless it's worth knowing about a few of them:
 9 | 
10 | ### Proximity polygons without rasters
11 | You can do 'interpolation' like this (and no... I don't understand why it requires quite so many steps as it does).
12 | 
13 | ```{r}
14 | library(sf)
15 | interp_ext <- st_read("data/interp-ext.gpkg")
16 | controls <- st_read("data/controls.gpkg")
17 | controls_polys <- controls %>%
18 |   st_union() %>%
19 |   st_voronoi() %>%
20 |   st_cast() %>%
21 |   st_sf() %>%
22 |   st_intersection(interp_ext) %>%
23 |   st_join(controls)
24 | plot(controls_polys)
25 | ```
26 | 
27 | ### Geostatistics
28 | `sgeostat` has been around forever and does a small part of the kriging puzzle, particularly variogram estimation. It's `spacecloud` and `spacebox` plots are particularly good for getting a feel for variogram estimation. For example:
29 | 
30 | ```{r}
31 | library(sgeostat)
32 | controls_xyz <- read.csv("data/controls-xyz.csv")
33 | pts <- sgeostat::point(controls_xyz)
34 | prs <- sgeostat::pair(pts)
35 | spacecloud(pts, prs, "z", cex = 0.35, pch = 19)
36 | spacebox(pts, prs, "z")
37 | ```
38 | 
39 | ### Splines
40 | I'm sure there are others, but two packages that do splines are briefly discussed below
41 | 
42 | #### `MBA`
43 | ```{r}
44 | library(MBA)
45 | library(terra)
46 | spline.mba <- mba.surf(controls_xyz,
47 |                        no.X = 61, no.Y = 87, # much jiggery-pokery required
48 |                        n = 87/61, m = 1,
49 |                        extend = T, sp = T)
50 | r <- rast(spline.mba$xyz.est)
51 | 
52 | persp(r, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
53 | ```
54 | 
55 | #### `akima`
56 | ```{r}
57 | library(akima)
58 | spline.akima <- interp(controls_xyz$x, controls_xyz$y, controls_xyz$z,
59 |                        nx = 61, ny = 87,
60 |                        extrap = T, linear = F)
61 | 
62 | r.spline.akima <- cbind(
63 |   expand.grid(X = spline.akima$x, Y = spline.akima$y), Z = c(spline.akima$z)) %>%
64 |   rast(type = "xyz")
65 | crs(r.spline.akima) <- st_crs(controls)$wkt
66 | 
67 | p <- persp(r.spline.akima, scale = FALSE, expand = 2, theta = 35, phi = 30, lwd = 0.5)
68 | ```
69 | 
70 | ## Other platforms
71 | Other platforms are available!
72 | 
73 | The goal here is to give insight into the wide array of options out there. We'll quickly look at tools in QGIS in class. Similar tools are available in the Esri ecosystem, which still supports the rather nice _Geostatistical Analyst_ tool.
74 | 
75 | Probably the one thing these both have that I can't find anywhere in *R*-land is _natural neighbours_ interpolation. (**UPDATE:** it does exist in *R*-land in a package called `whitebox` but good luck getting that setup to run cleanly).
76 | 
77 | The striking thing about both these menu / dialogue driven pathways is that with so many options to set things quickly become quite hard to replicate. As you experiment with options and find a preferred solution, it can be difficult to find your way back to it!
78 | 
79 | In any particular 'real-world' setting there may be other tools used for this very common family of operations.
80 | 
81 | This is the end of the line: back to [the overview](README.md)
82 | 


--------------------------------------------------------------------------------
/labs/interpolation/08-assignment-interpolation.Rmd:
--------------------------------------------------------------------------------
 1 | # Assignment 3: Interpolation in R
 2 | Now you've seen how to do interpolation many different ways, here is the assignment.
 3 | 
 4 | Some libraries you'll need to make this file run
 5 | ```{r}
 6 | library(sf)
 7 | library(tmap)
 8 | ```
 9 | 
10 | ## The data we will work with
11 | We will work with some old weather data for Pennsylvania on 1 April 1993. It is surprisingly hard to find well behaved data for interpolation, and these work. I tried some local Wellington weather data, but they were (maybe unsurprisingly), not very well-behaved...
12 | 
13 | ```{r}
14 | pa.counties <- st_read('data/pa-counties.gpkg')
15 | pa.weather <- st_read('data/pa-weather-1993-04-01.gpkg')
16 | ```
17 | 
18 | ## Inspect the data
19 | Make some simple maps to get some idea of things. The code below will do the rainfall results. Change it to view other variables. I've added a scale bar so you have an idea of the scale. If you switch the map mode to `'view'` with `tmap_mode('view')` you can see it in context on a web map.
20 | 
21 | ```{r}
22 | tm_shape(pa.counties) +
23 |   tm_polygons() +
24 |   tm_shape(pa.weather) +
25 |   tm_bubbles(col = 'rain_mm', palette = 'Blues', size = 0.5) +
26 |   tm_legend(legend.outside = T) +
27 |   tm_scale_bar(position = c('right', 'TOP')) +
28 |   tm_layout(main.title = 'Pennsylvania weather, 1 April 1993',
29 |             main.title.size = 1)
30 | ```
31 | 
32 | Now you have seen the data, here's the assignment...
33 | 
34 | ## The assignment
35 | Using any of the methods covered in these materials produce interpolated maps of rainfall and maximum and minimum temperatures from the provided data.
36 | 
37 | Write up a report on the process, providing the R code used to produce your final maps, and also discussing reasons for the choices of methods and parameters you made.
38 | 
39 | There are a number of choices to make, and consider in your write up:
40 | 
41 | + interpolation method: Voronoi (Thiessen/proximity) polygons, IDW, trend surface, or kriging;
42 | + resolution of the output (this is controlled by the cellsize setting in the `st_make_grid` function for the examples in this session);
43 | + parameters associated with particular methods, such as power (for IDW), the trend surface degree for trend surfaces and kriging; and
44 | + variogram model if performing kriging.
45 | 
46 | Many of these are difficult to make well-informed choices about, so it is OK to explain what you did and discuss the effects of doing things differently in accounting for your choices. If you get stuck (I did writing these materials) be sure to call for help on the slack channel.
47 | 
48 | **Some advice**
49 | As noted in the overview, some of the interfaces among the various packages used for interpolation can be finicky and it is easy to get frustrated (believe me, I have become frustrated when assembling these materials...). Your best option is to spend some time with the _RMarkdown_ versions of the material getting a feel for how things work. Then, I recommend you make a completely empty new file and start to assemble the materials you need, reusing code from the tutorial materials. A few things to look out for here:
50 | + The scales of the Pennsylvania data and the Maungawhau data are completely different. A cell size of 10m makes sense for the volcano, it won't for Pennsylvania. This may also affect any `maxdist` settings and also `cutoff` and `width` settings when estimating a variogram.
51 | + In my experiments, `gstat` doesn't do very well kriging these data. If you'd like to perform kriging, I **very much recommend** using the `fields` package which is introduced in the materials for spline-based interpolation. For whatever reason, it just does a much better job.
52 | + More than ever, ask questions if you get stuck!
53 | 
54 | Submit a PDF report to the dropbox provided in Blackboard by **10 May**.
55 | 
56 | I thoroughly recommend that you put your report together using the **Knit** functionality in *RStudio*. Please don't just submit a lightly modified version of the files I have provided&mdash;and *definitely* remove any tutorial materials! The explanatory linking text should be your own words, not mine!
57 | 


--------------------------------------------------------------------------------
/labs/interpolation/08-assignment-interpolation.md:
--------------------------------------------------------------------------------
 1 | # Assignment 3: Interpolation in R
 2 | Now you've seen how to do interpolation many different ways, here is the assignment.
 3 | 
 4 | Some libraries you'll need to make this file run
 5 | ```{r}
 6 | library(sf)
 7 | library(tmap)
 8 | ```
 9 | 
10 | ## The data we will work with
11 | We will work with some old weather data for Pennsylvania on 1 April 1993. It is surprisingly hard to find well behaved data for interpolation, and these work. I tried some local Wellington weather data, but they were (maybe unsurprisingly), not very well-behaved...
12 | 
13 | ```{r}
14 | pa.counties <- st_read('data/pa-counties.gpkg')
15 | pa.weather <- st_read('data/pa-weather-1993-04-01.gpkg')
16 | ```
17 | 
18 | ## Inspect the data
19 | Make some simple maps to get some idea of things. The code below will do the rainfall results. Change it to view other variables. I've added a scale bar so you have an idea of the scale. If you switch the map mode to `'view'` with `tmap_mode('view')` you can see it in context on a web map.
20 | 
21 | ```{r}
22 | tm_shape(pa.counties) +
23 |   tm_polygons() +
24 |   tm_shape(pa.weather) +
25 |   tm_bubbles(col = 'rain_mm', palette = 'Blues', size = 0.5) +
26 |   tm_legend(legend.outside = T) +
27 |   tm_scale_bar(position = c('right', 'TOP')) +
28 |   tm_layout(main.title = 'Pennsylvania weather, 1 April 1993',
29 |             main.title.size = 1)
30 | ```
31 | 
32 | Now you have seen the data, here's the assignment...
33 | 
34 | ## The assignment
35 | Using any of the methods covered in these materials produce interpolated maps of rainfall and maximum and minimum temperatures from the provided data.
36 | 
37 | Write up a report on the process, providing the R code used to produce your final maps, and also discussing reasons for the choices of methods and parameters you made.
38 | 
39 | There are a number of choices to make, and consider in your write up:
40 | 
41 | + interpolation method: Voronoi (Thiessen/proximity) polygons, IDW, trend surface, or kriging;
42 | + resolution of the output (this is controlled by the cellsize setting in the `st_make_grid` function for the examples in this session);
43 | + parameters associated with particular methods, such as power (for IDW), the trend surface degree for trend surfaces and kriging; and
44 | + variogram model if performing kriging.
45 | 
46 | Many of these are difficult to make well-informed choices about, so it is OK to explain what you did and discuss the effects of doing things differently in accounting for your choices. If you get stuck (I did writing these materials) be sure to call for help on the slack channel.
47 | 
48 | **Some advice**
49 | As noted in the overview, some of the interfaces among the various packages used for interpolation can be finicky and it is easy to get frustrated (believe me, I have become frustrated when assembling these materials...). Your best option is to spend some time with the _RMarkdown_ versions of the material getting a feel for how things work. Then, I recommend you make a completely empty new file and start to assemble the materials you need, reusing code from the tutorial materials. A few things to look out for here:
50 | + The scales of the Pennsylvania data and the Maungawhau data are completely different. A cell size of 10m makes sense for the volcano, it won't for Pennsylvania. This may also affect any `maxdist` settings and also `cutoff` and `width` settings when estimating a variogram.
51 | + In my experiments, `gstat` doesn't do very well kriging these data. If you'd like to perform kriging, I **very much recommend** using the `fields` package which is introduced in the materials for spline-based interpolation. For whatever reason, it just does a much better job.
52 | + More than ever, ask questions if you get stuck!
53 | 
54 | Submit a PDF report to the dropbox provided in Blackboard by **10 May**.
55 | 
56 | I thoroughly recommend that you put your report together using the **Knit** functionality in *RStudio*. Please don't just submit a lightly modified version of the files I have provided&mdash;and *definitely* remove any tutorial materials! The explanatory linking text should be your own words, not mine!
57 | 


--------------------------------------------------------------------------------
/labs/interpolation/README.md:
--------------------------------------------------------------------------------
 1 | # Interpolation overview
 2 | I am not going to lie... spatial interpolation, whatever tool you use, is complicated and messy. Not the least of the problems is that you are going back and forth between point data (the control points) and field data (the interpolated surface output).
 3 | 
 4 | This is particularly bad _right now_ because of changes in the management of coordinate reference systems (projections) which have made moving data back and forward between vector and raster formats in _R_ a bit flakey at the moment. You may as a result see more than the usual number of warnings when you are trying to comlplete this assignment! If you'd prefer not to see those warnings, you can issue this command in a session:
 5 | 
 6 | ```
 7 | options("rgdal_show_exportToProj4_warnings"="none")
 8 | ```
 9 | 
10 | which will suppress them, but it's probably better to just go with it.
11 | 
12 | The materials for interpolation extend across two weeks of lab sessions, with a single assignment asking you to perform interpolation on some data provided and comment on the results and process in the usual way. You'll find all the materials you need bundled in [this zip archive](interpolation.zip?raw=true) which you should unpack and set as your working folder in *RStudio*. I've provided `.Rmd` files this time, which you may find more useful even than usual.
13 | 
14 | The steps along the way are described in the instructions below. You'll get the most out of these working through them in order:
15 | 
16 | + [The overall approach to interpolation](01-overview-of-the-approach.md) in *R* using `gstat`
17 | + [The example dataset](02-example-dataset.md)
18 | + [Preparing for interpolation](03-preparing-for-interpolation.md) by making an output layer, etc.
19 | + [Near neighbour and inverse-distance weighted interpolation](04-nn-and-idw.md)
20 | + [Trend surfaces and kriging](05-trend-surfaces-and-kriging.md)
21 | + [Splines](06-splines.md)
22 | + [Other *R* packages and platforms](07-other-r-packages.md)
23 | 
24 | And then of course...
25 | 
26 | + [The assignment](08-assignment-interpolation.md)
27 | 


--------------------------------------------------------------------------------
/labs/interpolation/data/interp-ext.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/interpolation/data/interp-ext.gpkg


--------------------------------------------------------------------------------
/labs/interpolation/data/maungawhau.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/interpolation/data/maungawhau.tif


--------------------------------------------------------------------------------
/labs/interpolation/data/pa-counties.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/interpolation/data/pa-counties.gpkg


--------------------------------------------------------------------------------
/labs/interpolation/data/pa-weather-1993-04-01.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/interpolation/data/pa-weather-1993-04-01.gpkg


--------------------------------------------------------------------------------
/labs/interpolation/interpolation.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/interpolation/interpolation.zip


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/01-introducing-r-and-rstudio.md:
--------------------------------------------------------------------------------
 1 | # Introducing *R* and *RStudio*
 2 | This lab will introduce you to the statistical analysis and programming environment *R*, running in *RStudio* (which makes it a bit easier to deal with). *R* has become one of the standard tools for statistical analysis particularly in the academic research community, but [increasingly also in commercial and other work settings](https://statfr.blogspot.com/2018/08/r-generation-story-of-statistical.html). It is well suited to this environment for a number of reasons, particularly
 3 | 
 4 | 1. it is free [as in beer];
 5 | 2. it is easily extensible; and
 6 | 3. because of 1 and 2, many new methods of analysis first become available in *packages* contributed to the *R* ecosystem by researchers in the field.
 7 | 
 8 | We are using *R* for spatial analysis as part of this course for all of these reasons.
 9 | 
10 | Like any good software, versions of *R* are available for MacOS, Windows and Linux so you can install a copy on your own computer and work on this lab in your own time---you don't need to be at the timetabled lab sections to complete the assignment, although you will find it helpful to attend to get assistance from the course instructors, and also from one another. To get up and running on your own computer, you will need to download and install *R* itself, from [here](http://cran.r-project.org/) and also, optionally, (**but highly recommended**) install *RStudio* from [here](http://www.RStudio.com/products/RStudio/download/).
11 | 
12 | Installation is pretty straightforward on all platforms. When you are running *R* you will want a web connection to install any additional packages called for in lab instructions below. You will also find it useful to have a reasonably high resolution display (an old 1024&times;768 display will not be a lot of fun to work on, but high pixel density modern displays, such as 4K, can be a bit painful also, without tweaking the display settings). For this reason, if no other, you may find it good to work on the lab machines.
13 | 
14 | **You should aim for R version 4 and RStudio version 1.4. If you have older versions of these installed you should update to avoid confusion that might arise from inaccurate instructions.**
15 | 
16 | ### *DON'T PANIC!*
17 | This lab introduces *R* by just asking you to get on with it, without stopping to explain too much, at least not at first. This is because it's probably better, to just do things with *R* to get a feel for what it's about without thinking too hard about what is going on; kind of like learning to swim by jumping in at the deep end. You may sometimes feel like you are drowning. Try not to worry too much and stick with it, and bear in mind that the assignments will not assume you are some kind of *R* guru (I'm no R guru, I know enough to be dangerous, but am only just competent). Ask questions, confer with your fellow students, consult Google (this [cartoon](https://xkcd.com/627/) is good advice).
18 | 
19 | ## Getting started with R
20 | We're using *R* inside a slightly friendlier 'front-end' called *RStudio*, so start that program up in whatever platform you are running on. You should see something like the display below (without all the text which is from an open session on my machine).
21 | 
22 | <img src="images/rstudio.png">
23 | 
24 | I have labeled four major areas of the interface, these are
25 | + **Console** this is where you type commands and interact directly with the program
26 | + **Scripts and other files** is where you can write *scripts* which are short programs consisting of a series of commands that can all be run one after another. This is more useful as you become proficient with the environment, but if you have previous programming experience you may find it useful. You can also get tabular views of open datasets in this panel. Note that this window may not appear, particularly at initial startup, in which case the console will extend the whole height of the window on the left.
27 | + **Environment/History** here you can examine the data currently in your session (*environment*) more closely, or if you switch to the history tab, see all the commands you have issued in the session.
28 | + **Various outputs** are displayed in this area – mostly these will be plots, but perhaps also help information about commands.
29 | 
30 | Before going any further, it makes sense to do some clearing out, since the lab machines are shared computers, there may be data sitting around from the previous session. Use the 'broom' buttons in the **Environment** and **Output** panes to clear these out. Clear the console of previous commands by clicking in the console and selecting **Edit – Clear Console** and then click the X buttons on any open files or datasets in the upper left pane. Alternatively **Session - New Session** will accomplish the same thing.
31 | 
32 | Now you have cleaned house, we have to ensure that all the machines have installed a package that we will be using all semester, so on to the [next document](02-installing-packages.md).
33 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/02-installing-packages.md:
--------------------------------------------------------------------------------
 1 | # Installing packages in *RStudio*
 2 | One of the major advantages of *R* is its extensibility via *literally thousands* of additional packages tailored to specific applications. There is an extensive ecosystem of packages specifically for spatial applications. There is also a large array of packages collectively known as 'the tidyverse' that we will be using, which make data management and manipulation a little bit easier than in base *R*.
 3 | 
 4 | All this means it is essential to know how to install packages. Fortunately it is straightforward, as will be demonstrated by installing some packages right now.
 5 | 
 6 | ## Installing `tidyverse`
 7 | Navigate to the **Tools - Install Packages...** menu option.
 8 | 
 9 | In the dialogue that appears start typing in the **Packages** box `tidyverse`. When the full name appears, select it. Make sure the **Install dependencies** checkbox is selected, then click **OK**.
10 | 
11 | All hell will break loose. Sit back and enjoy the feeling that you are doing real computing.
12 | 
13 | ### What's going on?
14 | This particular installation includes a *whole bunch* of packages and will take some time.
15 | 
16 | If there are problems, ask for help. If everyone is having problems I got something wrong, and we will figure it out.
17 | 
18 | ## Installing `sf`, `tmap` and `tmaptools`
19 | Now you know how to install packages, do it again, for the ones named in the heading. Note that you can install all of them at once, simply by specifying all of them in the **Packages** list in the dialogue.
20 | 
21 | ## On to the next thing.
22 | Now all that's done we are ready to move on to [the actual content of this session](03-simple-data-exploration.md).
23 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/03-simple-data-exploration.md:
--------------------------------------------------------------------------------
  1 | # Simple visualization and mapping
  2 | ## Preliminaries
  3 | If you haven't worked through the other two sets of instructions for this week [go back there and do this now](README.md).
  4 | 
  5 | Next, set the working directory to where you would like to work using the **Session - Set Working Directory - Choose Directory...** menu option. When you do that you should see a response something like
  6 | ```
  7 | setwd("~/Documents/teaching/GISC-422/labs/scratch")
  8 | ```
  9 | in the console. It will be a different location than above, depending on your machine.
 10 | 
 11 | ## Meet the command line...
 12 | OK.
 13 | 
 14 | The key thing to understand about *R* is that it is a command line driven tool. That means you issue typed commands to tell *R* to do things (this was a normal way to interact with a computer and get work done before the early 1990s, and is making a comeback, like vinyl, but not cool). There are some menus and dialog boxes in *RStudio* to help you manage things, but mostly you interact with *R* by typing commands at the `>` prompt in the console window. To begin, we'll load up a dataset, just so you can see how things work and to help you get comfortable. As with most computer related stuff, you should experiment: you will learn a lot more that way.
 15 | 
 16 | ## Reading data
 17 | We will do this using the `readr` package from the `tidyverse`, which means we first have to load it:
 18 | ```{r}
 19 | library(readr)
 20 | ```
 21 | Now we can use a command in `readr` to read the comma-separated-variable (CSV) data file:
 22 | ```{r}
 23 | quakes <- read_csv('earthquakes.csv')
 24 | ```
 25 | You should get a response something like
 26 | ```
 27 | Parsed with column specification:
 28 | cols(
 29 |   CUSP_ID = col_double(),
 30 |   NZMGE = col_double(),
 31 |   NZMGN = col_double(),
 32 |   ELAPSED_DAYS = col_double(),
 33 |   MAG = col_double(),
 34 |   DEPTH = col_double(),
 35 |   YEAR = col_double(),
 36 |   MONTH = col_double(),
 37 |   DAY = col_double(),
 38 |   HOUR = col_double(),
 39 |   MINUTE = col_double(),
 40 |   SECOND = col_double()
 41 | )
 42 | ```
 43 | If that doesn't happen make sure you have set the correct working directory, that the data file is in there, and that you typed everything correctly. Ask for help, if necessary.
 44 | 
 45 | The response tells you that the file has been opened successfully. You should see an entry for the dataset appear in the *Environment* part of the interface, called `quake` because that's the name of the variable we asked *R* to read the data into. You can look at the data in a more spreadsheet like way either by typing (the capital V is important)
 46 | ```{r}
 47 | View(quakes)
 48 | ```
 49 | or by clicking on the view icon at the right hand side of the entry for `quakes` in the environment list.
 50 | 
 51 | These are data concerning earthquakes recorded in the months after the 7.1 earthquake in Christchurch in September 2010.
 52 | 
 53 | In *R*, data tables are known as *dataframes* and each column is an attribute or variable. The various variables that appear in the table are
 54 | + `CUSP_ID` a unique identifier for each earthquake or aftershock event
 55 | + `NZMGE` and `NZMGN` are New Zealand Map Grid Easting and Northing coordinates
 56 | + `ELAPSED_DAYS` is the number of days after September 3, 2010, when the big earthquake was recorded
 57 | + `MAG` is the earthquake or aftershock magnitude
 58 | + `DEPTH` is the estimate depth at which the earthquake or aftershock occurred
 59 | + `YEAR`, `MONTH`, `DAY`, `HOUR`, `MINUTE`, `SECOND` provide detailed time information
 60 | 
 61 | ## Exploring data
 62 | Now, if we want to use *R* to do some statistics, these data are stored in a variable named `quakes` (in my example, you may have called it something different). I can refer to columns in the dataframe by calling them `quakes$MAG` (note the `$` sign). So for example, if I want to know the mean magnitude of the aftershocks in this dataset I type
 63 | ```{r}
 64 | mean(quakes$MAG)
 65 | ```
 66 | or the mean northing coordinate
 67 | ```{r}
 68 | mean(quakes$NZMGN)
 69 | ```
 70 | and *R* will return the value in response. Probably more informative is a boxplot or histogram, try:
 71 | ```{r}
 72 | boxplot(quakes$MAG)
 73 | ```
 74 | or
 75 | ```{r}
 76 | hist(quakes$MAG)
 77 | ```
 78 | and you should see statistical plots similar to those shown below.
 79 | 
 80 | <img src="images/quakes-MAG-boxplot.png"><img src="images/quakes-MAG-hist.png">
 81 | 
 82 | It gets tedious typing `quakes` all the time, so you can `attach` the dataframe so that the variable names are directly accessible without the `quakes$` prefix by typing
 83 | ```{r}
 84 | attach(quakes)
 85 | ```
 86 | and then
 87 | ```{r}
 88 | hist(MAG)
 89 | ```
 90 | will plot the specified variable. Be careful using attach as it may lead to ambiguity about what you are plotting if you are working with different datasets that include variables with the same names.
 91 | 
 92 | Try the above commands just to get a feel for things.
 93 | 
 94 | ## A simple (crappy) map
 95 | You can make a simple map of all the data by plotting the `NZMGE` variable as the *x* (i.e. horizontal axis) and `NZMGN` as the *y* axis of a scatterplot:
 96 | ```{r}
 97 | plot(NZMGE, NZMGN)
 98 | ```
 99 | <img src="images/quakes-NZMGE-NZMGN-plot.png">
100 | 
101 | Because base *R* is not a GIS it doesn't know about things like projections, so this is a very crude map.
102 | 
103 | ## **NOTE: from here on I am not going to show results of commands, just the commands!**
104 | 
105 | There are *R* packages to handle geographical data better than this (we will look at those in [a little later](04-simple-maps.md)) but for now don't worry about it too much. To see if there is a relationship between earthquake depth and magnitude, try this
106 | ```{r}
107 | plot(DEPTH, MAG)
108 | ```
109 | and because *R* is a statistics package, we can easily fit and plot a simple linear regression model to the data
110 | ```{r}
111 | regmodel <- lm(MAG ~ DEPTH)
112 | plot(DEPTH, MAG)
113 | abline(regmodel, col = 'red')
114 | ```
115 | Note here the use of `<-` to assign the model, made by the *linear model* `lm()` command to a new variable, called `regmodel`. You can get more details of this model by typing `regmodel` or `summary(regmodel)`. If you know anything about regression models, these may be of interest to you. Also note how, I've requested that the line be plotted in red `(col = 'red')`, so it can be seen more easily.
116 | 
117 | We can make more complex displays. For example
118 | ```{r}
119 | plot(ELAPSED_DAYS, MAG)
120 | ```
121 | Shows how the magnitude of the aftershocks changed in the days after the initial large earthquake, with the second large event happening around 6 months (180 days) later. A more complicated plot still would be
122 | ```{r}
123 | boxplot(MAG ~ cut(ELAPSED_DAYS, seq(0,200,20)))
124 | ```
125 | Give that a try and see what you get. To label the chart more informatively we need to add information for *x* and *y* axis labels
126 | ```{r}
127 | boxplot(MAG ~ cut(ELAPSED_DAYS, seq(0, 200, 20)), xlab = "Days after Sept 3, 2010", ylab = "Magnitude")
128 | ```
129 | Next up: [simple maps](04-simple-maps.md).
130 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/04-simple-maps.md:
--------------------------------------------------------------------------------
  1 | # Making simple maps
  2 | To mentally prepare you for what's coming, the next few paragraphs walk you through making a map of some data, using the `sf` and `tmap` packages. I think it is helpful to do this just to get a feeling for what is going on before we dive into details in the coming weeks.
  3 | 
  4 | First, we need to load the libraries
  5 | ```{r}
  6 | library(sf)
  7 | library(tmap)
  8 | ```
  9 | We us the `sf` pacakge to read data in spatial formats like shape files, with the `st_read` function:
 10 | ```{r}
 11 | nz <- st_read('nz.gpkg')
 12 | ```
 13 | To make a map with this, we use the `tmap` package. We'll learn more about this package in the next couple of weeks. Basically it lets you make a map by progressively adding layers of data. To start a map you tell it the dataset to use
 14 | ```{r}
 15 | map <- tm_shape(nz)
 16 | ```
 17 | At this point nothing happens, we're just setting things up. We need to layer on or add additional information so `tmap` knows what to do with it. In this case, we are mapping polygons, so the `tm_polygons` function provides the needed information (to find out more about the available options, type `?tm_polygons` at the command prompt.
 18 | ```{r}
 19 | map + tm_polygons(col = 'green', border.col = 'black')
 20 | ```
 21 | If we want to add a few more cartographic frills like a compass rose and scale bar, we can do that too:
 22 | ```{r}
 23 | map + tm_polygons(col = 'darkseagreen2', border.col = 'skyblue', lwd = 0.5) +
 24 |   tm_layout(main.title = 'Aotearoa New Zealand',
 25 |             main.title.position = 'center',
 26 |             main.title.size = 1,
 27 |             bg.color = 'powderblue') +
 28 |   tm_compass() +
 29 |   tm_scale_bar()
 30 | ```
 31 | 
 32 | For a list of named colours in *R* see [this document](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf). Try experimenting with changing a few things in the above map. Consult the help on `tm_layout` using `?tm_layout` to see what options are available.
 33 | 
 34 | ## Adding another layer
 35 | The `quakes` dataset is not in a spatial format, although it include spatial information (the easting and northing coordinates). Before continuing, make sure it is still loaded, and if not, reload it
 36 | ```{r}
 37 | library(readr)
 38 | quakes <- read_csv('earthquakes.csv')
 39 | ```
 40 | 
 41 | This is just a dataframe. The `sf` package provides the required functions to convert the dataframe to a *simple features* dataset, which *is* a spatial data format. The following command will do the necessary conversion (you need to be careful to type it exactly as shown).
 42 | ```{r}
 43 | qmap <- st_as_sf(quakes, coords = c('NZMGE', 'NZMGN'), crs = 27200) %>%
 44 |   st_transform(st_crs(nz))
 45 | ```
 46 | What's happening here? Quite a lot it turns out.
 47 | 
 48 | `st_as_sf` is the function that does the conversion. The *parameters* in parentheses tell the function what to work on. First is the input dataframe `quakes`. Next the `coords` parameter tells the function which variables in the dataframe are the *x* and *y* coordinates in the dataframe. the `c()` structure concatenates the two variable names into a single *vector* which is required by `st_as_sf`. Finally, we also specify the *coordinate reference system* or map projection of the data. These data are in New Zealand Map Grid, which has an [EPSG code 27200](https://epsg.io/27200).
 49 | 
 50 | Unfortunately, this is a different projection than the `nz` dataset. But I can *pipe* the data using the `%>%` symbol into the `st_transform` function to convert its projection to match that of the `nz` dataset using `st_crs(nz)` to retrieve this information from the `nz` dataset and apply it to the new spatial `qmap` layer we are making.
 51 | 
 52 | Now we have two datasets we can make a layered map including both of them.
 53 | ```{r}
 54 | tm_shape(nz) +
 55 |   tm_polygons(col = 'darkseagreen2') +
 56 |   tm_shape(qmap) +
 57 |   tm_dots()
 58 | ```
 59 | 
 60 | That's OK, although not very useful, we really need to zoom in on the extent or *bounding box* of the earthquake data:
 61 | ```{r}
 62 | tm_shape(nz, bbox = st_bbox(qmap)) +
 63 |   tm_polygons(col = 'white', lwd = 0) +
 64 |   tm_layout(bg.color = 'powderblue') +
 65 |   tm_shape(qmap) +
 66 |   tm_dots() +
 67 |   tm_scale_bar()
 68 | ```
 69 | 
 70 | This still not very useful, because the `nz` dataset includes no actual reference data of interest other than the coastline. We can fix that by making a web map instead (see a little bit below.) Still... it's a pretty map.
 71 | 
 72 | For now, an alternative to `tm_dots` is `tm_bubbles` which allows us to scale the symbols by some variable
 73 | ```{r}
 74 | tm_shape(nz, bbox = st_bbox(qmap)) +
 75 |   tm_polygons(col = 'white', lwd = 0) +
 76 |   tm_layout(bg.color = 'powderblue') +
 77 |   tm_shape(qmap) +
 78 |   tm_bubbles(size = 'MAG', perceptual = TRUE, alpha = 0.5) +
 79 |   tm_scale_bar()
 80 | ```
 81 | 
 82 | This isn't a great map. It might be easier to see if we only showed the larger aftershocks. We use another pipe `%>%` to pass the data into a filter tool from the `dplyr` package.
 83 | 
 84 | ```{r}
 85 | library(dplyr)
 86 | bigq <- qmap %>%
 87 |   dplyr::filter(MAG >= 4)
 88 | ```
 89 | 
 90 | Try again, this time also making the bubbles transparent:
 91 | 
 92 | ```{r}
 93 | tm_shape(nz, bbox = st_bbox(qmap)) +
 94 |   tm_polygons(col = 'white', lwd = 0) +
 95 |   tm_layout(bg.color = 'powderblue') +
 96 |   tm_shape(bigq) +
 97 |   tm_bubbles(size = 'MAG', perceptual = T, alpha = 0) +
 98 |   tm_scale_bar()
 99 | ```
100 | 
101 | Alternatively, we might use colour to show the different magnitudes:
102 | 
103 | ```{r}
104 | tm_shape(nz, bbox = st_bbox(qmap)) +
105 |   tm_polygons(col = 'white', lwd = 0) +
106 |   tm_layout(bg.color = 'powderblue') +
107 |   tm_shape(bigq) +
108 |   tm_bubbles(size = 'MAG', col = 'MAG', palette = 'Reds', alpha = 0.5) +
109 |   tm_scale_bar()
110 | ```
111 | 
112 | That's probably enough experimenting to give you the general idea.
113 | 
114 | ## A web basemap
115 | One other thing we can do with the `tmap` package is make it a web map instead. We no longer need the `nz` layer, we just have to switch modes
116 | ```{r}
117 | tmap_mode('view')
118 | ```
119 | 
120 | [To switch back use `tmap_mode('plot')`]
121 | 
122 | Then make a map as before, but no need for the `nz` layer
123 | 
124 | ```{r}
125 | tm_shape(qmap) +
126 |   tm_dots(col = 'MAG', palette = 'Reds')
127 | ```
128 | OK. Before we wrap up a quick look at [_R Markdown_](05-r-markdown.md).
129 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/05-r-markdown.md:
--------------------------------------------------------------------------------
 1 | # _R Markdown_
 2 | One further element of the _R_ and _RStudio_ ecosystem that we will touch on briefly this week, and will see more of through the semester is _R Markdown_.
 3 | 
 4 | _R Markdown_ files are a format that allow you to run code inside a document, and to combine the results of running that code into an output file in formats like HTML (for web pages) or Word documents and PDFs (for more conventional documents).
 5 | 
 6 | _R Markdown_ is an example of *literate computing* which is a more general move towards media that merge traditional documentation with interactive computing elements.
 7 | 
 8 | What does that mean?
 9 | 
10 | Well a couple of things, but first we need to re-open **this** document (the one you are reading now) in _RStudio_:
11 | 
12 | + In _RStudio_ go to **File - Open File...** and navigate to the folder where you unpacked the lab `.zip` file
13 | + Open the file `r-markdown.md` (i.e., this file)
14 | + Now do **File - Save As...** and save the file but _change the extension from_ `.md` _to_ `.Rmd` (if you can't see file extensions on your computer ask about this).
15 | 
16 | ## Now what?
17 | That should mean you are now looking at a slightly different version of the file, in plain text, but with formatting marks also included.
18 | 
19 | The document is a combination of *markdown formatted text* (like this) which allows creation of simple structured documents with different heading levels and simple formatting such as *italics*, **bold** and `code` fonts. It also includes (and this is the really clever part) *code chunks*. These begin and end with three backtick symbols and an indication of what kind of code is included, in them. Here's a very simple example:
20 | 
21 | ```{r}
22 | x <- 5
23 | y <- 6
24 | x + y
25 | ```
26 | 
27 | You can *run* these code chunks using the little 'play' arrow at the top right of the chunk, and when you do, you see the result it outputs. Try it on the code chunk above now.
28 | 
29 | In effect, you are running an *R* session in pieces, inside a document that explains itself as it goes along.
30 | 
31 | # The really clever bit
32 | That's pretty smart (I think it is anyway). But there's more.
33 | 
34 | ## Viewing the document as formatted output
35 | First, you can also view the document in a nicely formatted display mode. To do this find the drop-down list next to the little 'gear wheel' icon at the top of the file viewing tab. Select the **Use Visual Editor** option. After a pause, you should see the file display change to a nicely formatted output combining formatted explanatory text, and code chunks.
36 | 
37 | In this view you will also find controls to allow you to edit markdown in the same way you might write a word processed document. This is all fairly intuitive, so I'll let you figure that all out for yourself. It's helpful as you explore to switch back to the non-Visual editing mode to see how changes you make alter the markdown materials.
38 | 
39 | ## Compiling the document to an output document
40 | But the _really_ clever part comes when you *knit* the document together into an output file. You do this in *RStudio* using the **Knit** button, which you'll see at the top of the file panel in the interface. Click on the little down arrow and select **Knit to HTML**. First time you do this, you will probably be asked if you want to install some packages: go ahead and say yes!
41 | 
42 | Once you've done that **Knit to HTML** again and *RStudio* should think for a bit, and will produce a HTML file that appears in the **Viewer** panel (at the bottom right of the RStudio interface) or perhaps as a document in a new window (you control where with the **Preview in Viewer Pane** or **Preview in Window** setting). You can also knit to a Word document (try it!) if it is installed on the machine you are using. You can also knit to PDF, although this will likely won't work in Windows.
43 | 
44 | Either way, the document that is produced includes all the linking text, nicely formatted based on the markdown, and also the code chunks *and their outputs* nicely formatted.
45 | 
46 | You can use this to produce a formatted report explaining a data analysis and how you produced it.
47 | 
48 | The remainder of this document provides a (tiny!) bit more detail on this. Ideally, you should complete assignments for this class using _R Markdown_, so pay attention (but know that it will be another couple of weeks before you need to do this for real, so you have time to get used to it).
49 | 
50 | ## Some notes on markdown formatting
51 | Markdown is now a widely used format for document preparation. Details are available [here](https://daringfireball.net/projects/markdown/syntax). This section provides an outline so you can understand how the materials for this lab have been prepared, and also write your own Rmarkdown file if you wish.
52 | 
53 | ### Document headings
54 | Document header levels are denoted by hash signs (there are other ways to do this, but I like hash signs). One hash for level 1, two for level 2 and so on, like this:
55 | 
56 | # Level 1
57 | ## Level 2
58 | ### Level 3
59 | #### and so on...
60 | 
61 | Plain text is just plain text. *Italics* are designated by single asterisks and **bold** by double asterisks. Code format text is designated by `backticks` (this obscure key is at the top-left of your keyboard).
62 | 
63 | ## Code chunks
64 | Code chunks appear, as you've seen above like this:
65 | 
66 | ```{r}
67 | # This is a code chunk
68 | x <- 5
69 | ```
70 | 
71 | You can have code chunks that don't run (but are formatted to look like code) by leaving out the `{r}`. You can also control what output a code chunk produces with a number of option settings. For example
72 | 
73 | ```{r message = FALSE}
74 | # This code won't show any messages
75 | ```
76 | 
77 | Some of the option settings are explained in [this document](https://rmarkdown.rstudio.com/lesson-3.html).
78 | 
79 | # Running labs in _R Markdown_
80 | It's worth noting that any lab instructions pages for this class provided as `.md` files (which most of them are) can be converted to `.Rmd` format in exactly the way described above. This means that they can be run conveniently, without much typing.
81 | 
82 | If you do this (I am OK with you doing so), there are a couple of things to keep in mind:
83 | 
84 | + don't forget to *actually read the materials presented* so that you understand what's going on!
85 | + don't forget to try changing parameters in the commands so that you learn how you can do analyses differently
86 | 
87 | And perhaps most importantly: **run everything in order!**. A problem with this kind of file is that it's tempting to jump around and run code chunks out of sequence. This often causes bad things to happen. Variables end up containing information they are not expected to, or don't contain information they are expected to, or required libraries have not been loaded, and so on. So... it pays to work through the document chunk by chunk, _in order_, reading the accompanying information so you understand what is happening.
88 | 
89 | And if things go haywire, it often pays to go back a few chunks and re-run them, in order.
90 | 
91 | OK on to the final [wrap up](06-wrapping-up.md)
92 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/06-wrapping-up.md:
--------------------------------------------------------------------------------
 1 | # Wrapping up
 2 | The aim of this session had been to get a feel for things. Don't panic if you don't completely understand everything that is happening. The important thing is to realize
 3 | + You make things happen by typing commands in the console
 4 | + Commands either cause things to happen (like plots) or they create new variables (data with a name attached), which we can further manipulate using other commands. Variables and the data they contain remain in memory (you can see them in the **Environment** tab) and can be manipulated as required.
 5 | + *RStudio* remembers everything you have typed (check the **History** tab if you don't believe this!)
 6 | + All the plots you make are also remembered (mess around with the back and forward arrows in the plot display).
 7 | 
 8 | The **History** tab is particularly useful. If you want to run a command again, find it in the list, select it and then select the **To Console** option (at the top). The command will appear in the console at the current prompt, where you can edit it to make any desired changes and hit `<RETURN>` to run it again. You can also get the same history functionality using the up arrow key in the console, which will bring previous commands back to the console line for you to reuse. But this gets kind of annoying once you have run many commands.
 9 | 
10 | Another way to rerun things you have done earlier is to save them to a script. The easiest way to do this is to go to the history, select any commands you want, and then select **To Source**. This will add the commands to a new file in the upper left panel, and then you can save them to a `.R` script file to run all at once. For example, in the history, find the command used to open the data file, then the one used to attach the data, then one that makes a complicated plot. Add each one in turn to the source file (in the proper order). Then from the scripts area, select **File – Save As...** and save the file to some name (say `test.R`). What you have done is to write a short program! To run it go to **Code – Source File...** navigate to the file, and click **OK**. All the commands in the file should then run in one go.
11 | 
12 | ## Do try to get through all the steps in these instructions
13 | I suggest making time to work through all of the instructions in this session. It's a lot all at once if you haven't used *R* before, and it won't all make sense right away, but getting comfortable with the tools will be important, and will set you up well for the semester ahead.
14 | 
15 | ## Additional resources
16 | *R* is really a programming language as much as it is a piece of software, there is a lot more to learn about it than is covered here, or will be covered in this course. If you want to know more about *R* as a general statistics environment there is a [good online guide here](https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf) which is worth looking at if you want a more detailed introduction.
17 | 
18 | For the purposes of this course, the commands you really need to get a handle on are explored in the corresponding weekly labs.
19 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/README.md:
--------------------------------------------------------------------------------
 1 | # Introducing _R_ and friends&mdash;overview
 2 | This week we have a few things to do, which are explained in detail in the documents linked below. Before you start, download the materials in [this file](https://raw.githubusercontent.com/DOSull/GISC-422/master/labs/intro-to-R-and-friends/intro-to-R-and-friends.zip?raw=true) and unzip them to a local folder.
 3 | 
 4 | The first couple of sections won't require you to use these materials. But after that, you'll need them to complete working through the materials.
 5 | 
 6 | + [Familiarise with the _R_ and _RStudio_ environments](01-introducing-r-and-rstudio.md)
 7 | + [Ensure that the environment is set up correctly](02-installing-packages.md)
 8 | + [Do some simple data exploration](03-simple-data-exploration.md)
 9 | + [Make some simple maps](04-simple-maps.md)
10 | + [_R Markdown_](05-r-markdown.md)
11 | + [Wrapping up](06-wrapping-up.md)
12 | 


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/images/quakes-MAG-boxplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/images/quakes-MAG-boxplot.png


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/images/quakes-MAG-hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/images/quakes-MAG-hist.png


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/images/quakes-NZMGE-NZMGN-plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/images/quakes-NZMGE-NZMGN-plot.png


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/images/rstudio.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/images/rstudio.png


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/intro-to-R-and-friends.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/intro-to-R-and-friends.zip


--------------------------------------------------------------------------------
/labs/intro-to-R-and-friends/nz.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/intro-to-R-and-friends/nz.gpkg


--------------------------------------------------------------------------------
/labs/making-maps-in-r/01-making-maps-in-r.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Making maps in *R*
  3 | 
  4 | The data for this lab are available in [this zip file](making-maps-in-r.zip).
  5 | 
  6 | You can see previews of two of the data layers at the links below:
  7 | 
  8 | -   the [Auckland census area units](ak-tb.geojson).
  9 | -   [Auckland TB cases 2006](ak-tb-cases.geojson) (jittered to anonymise locations)
 10 | 
 11 | Before we start we need to load the `sf` package to handle spatial data
 12 | 
 13 | ```{r}
 14 | library(sf)
 15 | ```
 16 | 
 17 | ## Some quick map plotting
 18 | 
 19 | First we need to read in the spatial data. We do this with the `st_read` function provided by the `sf` package (most functions in `sf` start with `st_`)
 20 | 
 21 | ```{r}
 22 | auckland <- st_read("ak-tb.geojson")
 23 | ```
 24 | 
 25 | The result tells us that we successfully read a file that contains 103 features (i.e. geographical things), and that each of those features has 5 'fields' of information associated with it. Note that to find out more about `st_read` and how it works you can type `?st_read` at any time. Typing `?` immediately before a function name provides help information. Or as you type the function name _RStudio_ will offer you a chance to see more information by hitting **F1**. For help on a whole package type `??` before the package name, i.e., `??sf`.
 26 | 
 27 | Back to the data we just loaded. We can get a feel for the data using the `as_tibble` function from the `tidyr` package. This is a generic function for examining datasets, and shows us the first few rows and columns of the data in a convenient format.
 28 | 
 29 | ```{r}
 30 | library(tidyr)
 31 | as_tibble(auckland)
 32 | ```
 33 | 
 34 | Alternatively, use `View`
 35 | 
 36 | ```{r}
 37 | View(auckland)
 38 | ```
 39 | 
 40 | to see a view of the data in the *RStudio* viewer, although be aware that this can be quite slow on large datasets so be careful!
 41 | 
 42 | We can also use the `plot` function to plot the data. Since these data are geographical, we will get an array of small maps one for each data attribute. It's important to realise that what is happening here is that because we read the data in with the `sf::st_read` function, so *R* knows that these data are spatial and produces maps rather than statistical plots.
 43 | 
 44 | ```{r}
 45 | plot(auckland)
 46 | ```
 47 | 
 48 | The default colour scheme of these plots is not great. Higher values are at the yellow end of the spectral scale. However, it is fiddly to change this, and it is easy to make nicer maps anyway, as we'll see in the next section.
 49 | 
 50 | ## Chloropleth maps
 51 | 
 52 | The array of maps produced by `plot` is useful for a quick look-see, but we can also make chloropleth maps of specific attributes using functions provided by `tmap` as we started to see last week. There's a column in our data called `TB_RATE`, or tuberculosis rate, expressed in number of cases per 100,000 population, so let's make a simple chloropleth map of that. Choropleth maps are those where regions are colored according to underlying data values.
 53 | 
 54 | ### Exploring the data
 55 | 
 56 | Since choropleth maps are maps of data, it is worth first familiarizing ourselves with the data in question, independent of the geography. Since we are concerned with the `TB_RATE` variable, let's see what it looks like in terms of the distribution across the 103 areas in the map. The base *R* function `summary` will provide summary information on any attributes it meaningfully can.
 57 | 
 58 | ```{r}
 59 | summary(auckland)
 60 | ```
 61 | 
 62 | -   What's the lowest TB_RATE?
 63 | -   What's the highest TB_RATE?
 64 | 
 65 | Since the *median* is 26.3, meaning that half the rates are that level or lower, while the average or *mean* value is higher at 30.4, you can see that the date are skewed. More visually, we can make a histogram. We can do this either with the base *R* function `hist` by using the `$` symbol to select only that variable as input:
 66 | 
 67 | ```{r}
 68 | hist(auckland$TB_RATE,
 69 |      xlab = 'TB rate per 100,000 population', main = '')
 70 | ```
 71 | 
 72 | or with the `ggplot2` approach, where we define the data we are using and the aesthetics to apply to it. The latter is quite an involved topic, which we *might* get into later in the semester, for now it may be easier to stick with the base *R* `hist` function. Many people much prefer the `ggplot2` approach, although for relatively simply plots like these it may not be very obvious why! I am happy to discuss this in more detail, if you find yourself trying to build complicated visualisations with *R* when `ggplot` becomes really useful.
 73 | 
 74 | ```{r}
 75 | library(ggplot2)
 76 | ggplot(auckland) +
 77 |   geom_histogram(aes(x = TB_RATE), binwidth=20)
 78 | ```
 79 | 
 80 | Inspecting the histograms, think about how a map might look using different classification schemes. Say we used 9 *equal interval* classes, how many would be in the lowest class? How many in the highest? Would any class have no members? Keep these questions in mind as we experiment with maps in the next section.
 81 | 
 82 | ### Mapping the data
 83 | 
 84 | The most convenient tool for making maps is the `tmap` package, so let's load that.
 85 | 
 86 | ```{r}
 87 | library(tmap)
 88 | ```
 89 | 
 90 | `tmap` maps are made in a similar way to `ggplot2` visualizations. First we make a map object and store it in a variable, telling it where we are getting the data using the `tm_shape` function.
 91 | 
 92 | ```{r}
 93 | m <- tm_shape(auckland)
 94 | ```
 95 | 
 96 | This makes a map object called `m` based on the `auckland` dataset. We now say how we want to symbolize it by adding layers. Since the `auckland` data are polygons we use the `tm_polygons` function
 97 | 
 98 | ```{r}
 99 | m +
100 |   tm_polygons()
101 | ```
102 | 
103 | Here, we just get polygons. For a choropleth map, we have to say what variable we want the polygon colours to be based on, so do this:
104 | 
105 | ```{r}
106 | # note the 'col' here means colour, not column
107 | m +
108 |   tm_polygons(col = 'TB_RATE')
109 | ```
110 | 
111 | There are a number of options for changing the look of this. We can change colours (`palette`), the number of classes (`n`), and the classification scheme (`style`)
112 | 
113 | ```{r}
114 | m +
115 |   tm_polygons(col = 'TB_RATE',
116 |               palette = 'Greens',
117 |               n = 9, style = 'quantile')
118 | ```
119 | 
120 | To find out what options are available check the help with `?tm_polygons`. Before going any further experiment with these options, until you are comfortable making such maps easily.
121 | 
122 | ## Adding other layers
123 | 
124 | We can add layers of other data pretty easily using the same approach. We need to read an additional layer of data first, of course
125 | 
126 | ```{r}
127 | cases <- st_read('ak-tb-cases.geojson')
128 | ```
129 | 
130 | This time, we make a map directly without saving the basemap to a variable
131 | 
132 | ```{r}
133 | tm_shape(auckland) +
134 |   tm_polygons() +
135 |   tm_shape(cases) +
136 |   tm_dots()
137 | ```
138 | 
139 | Again, check the help for this new function `tm_dots` to see what the options are.
140 | 
141 | Now read the roads dataset
142 | 
143 | ```{r}
144 | rds <- st_read("ak-rds.gpkg")
145 | ```
146 | 
147 | and add it to the map:
148 | 
149 | ```{r}
150 | tm_shape(auckland) +
151 |   tm_polygons() +
152 |   tm_shape(rds) +
153 |   tm_lines() +
154 |   tm_shape(cases) +
155 |   tm_dots(col='red', scale=2)
156 | ```
157 | 
158 | It is worth noting here that `tmap` is smart enough to reproject the roads layer to the same projection as the base layers. Check the projections
159 | 
160 | ```{r}
161 | st_crs(auckland)
162 | st_crs(rds)
163 | ```
164 | 
165 | They are clearly different. To reproject the roads to match the polygon layer, we can use `st_transform`, but there is no need, if all we want to do is have a quick look-see. Here's how it would work if we did want to reproject the data:
166 | 
167 | ```{r}
168 | rds_wgs84 <- rds %>%
169 |   st_transform(st_crs(auckland))
170 | ```
171 | 
172 | But again, this isn't necessary unless we are planning on more detailed analytical work, when the internal storage of the map coordinates may matter greatly.
173 | 
174 | ## Using a web basemap for context
175 | 
176 | `tmap` also provides a simple way to make web maps.
177 | 
178 | ```{r}
179 | tmap_mode('view')
180 | ```
181 | 
182 | You then make a map in the usual way.
183 | 
184 | Switch back to 'static' map mode using `tmap_mode(plot)`.
185 | 
186 | ## Things to Try
187 | 
188 | Nothing specific... but you should go back over the instructions and experiment with things like the colour palettes used, and the classification scheme specified by the `style` setting in the `tm_polygons` function call. Make use of the *RStudio* help to assist in these explorations.
189 | 
190 | You can also try adding map decorations using the `tm_layout` function.
191 | 
192 | ## Maps using only `ggplot2`
193 | 
194 | Finally, it is worth knowing that there is a way to make maps like these with pure `ggplot2` commands. It goes something like this
195 | 
196 | ```{r}
197 | ggplot(auckland) +
198 |   geom_sf(aes(fill = TB_RATE), lwd = 0.2, colour = 'grey') +
199 |   scale_fill_distiller(palette = 'Reds', direction = 1) +
200 |   geom_sf(data = cases, size = 1) +
201 |   geom_sf(data = rds, lwd = 0.1, colour = 'darkgrey') +
202 |   theme_minimal()
203 | ```
204 | 
205 | If you are an afficionado of the `ggplot` libraries this can be very helpful, although it is generally easier to work with `tmap` at least to begin with. A significant advantage of `tmap` is the ease with which we can make web map outputs.
206 | 


--------------------------------------------------------------------------------
/labs/making-maps-in-r/02-data-wrangling-in-r.md:
--------------------------------------------------------------------------------
  1 | # A first look at data wrangling in *R*
  2 | Data wrangling is not something you will really learn in this course, but unavoidably we will doing some of it along the way. More so when we come to look at large messy multivariate datasets later in the course.
  3 | 
  4 | But it is appropriate to introduce a few key ideas up front so that you have an idea what is going on in some of the lab instructions.
  5 | 
  6 | First just make sure we have all the data and libraries loaded as before.
  7 | 
  8 | ```{r}
  9 | library(sf)
 10 | library(tmap)
 11 | 
 12 | auckland <- st_read("ak-tb.geojson")
 13 | cases <- st_read('ak-tb-cases.geojson')
 14 | rds <- st_read("ak-rds.gpkg")
 15 | ```
 16 | 
 17 | # Introducing the `tidyverse`
 18 | The [`tidyverse`](https://www.tidyverse.org/) is a large collection of packages for handling data in a 'tidy' way. This document can only look at these very quickly, but will hopefully begin to give you a flavour of what is possible, encourage you to explore further if you need to, and help you understand what is happening in some of the instructions where some data preparation is needed.
 19 | 
 20 | Like *R* itself the `tidyverse` is largely inspired by the work of another New Zealander, [Hadley Wickham](http://hadley.nz/)... Aotearoa represent!
 21 | 
 22 | We can't really get into the philosophy of it all here. Instead we focus on some key functionality provided by functions in the `dplyr` package. We will also look at processing pipelines using the `%>%` or 'pipe' operator.
 23 | 
 24 | So... load these libraries
 25 | 
 26 | ```{r}
 27 | library(dplyr)
 28 | library(ggplot2)
 29 | library(tidyr)
 30 | ```
 31 | 
 32 | If any of them aren't installed, then install them in the usual way, and load them again.
 33 | 
 34 | ## Data tidying with `dplyr`
 35 | The core tidying operations in `dplyr` are
 36 | 
 37 | + _selecting_ columns to keep or reject
 38 | + _slicing_ rows to keep or reject
 39 | + _filtering_ data based on attribute values
 40 | + _mutating_ data values by combining them or operating on them in various ways
 41 | 
 42 | ### `select`
 43 | A common requirement in data analysis is selecting only the data attributes you want, and getting rid of all the other junk. A function for looking at data tables is `as_tibble()` (provided by the `tidyr` package). Use it to take a look at the `rds` dataset
 44 | 
 45 | ```{r}
 46 | as_tibble(rds)
 47 | ```
 48 | Notice that the `suffix` and `other_name` attributes don't seem to contain any useful data. In base R we could get rid of them by noting the column numbers and doing something like `rds <- rds[, c(1:2, 5:19)]` which is not particularly nice to read or to deal with. The `select` is much easier to read:
 49 | 
 50 | ```{r}
 51 | select(rds, -suffix, -other_name)
 52 | ```
 53 | The minus signs on the names actually tell R to _drop_ the named columns.
 54 | 
 55 | Selecting only columns of interest is easy, using the `select` function, we simply list them
 56 | ```{r}
 57 | select(rds, road_name, road_class)
 58 | ```
 59 | This hasn't changed the data, we've just looked at a selection from it. But we can easily assign the result of the selection to a new variable
 60 | 
 61 | ```{r}
 62 | rds_reduced <- select(rds, road_name, road_class)
 63 | ```
 64 | 
 65 | What is nice about `select` is that it provides lots of different ways to make selections. We can list names, or column numbers, or use colons to include all the columns between two names or two numbers, or even use a minus sign to drop a column. And we can use these (mostly) in all kinds of combinations. For example
 66 | 
 67 | ```{r}
 68 | select(rds, 1:2, road_class)
 69 | ```
 70 | 
 71 | or
 72 | ```{r}
 73 | select(rds, -(3:4))
 74 | ```
 75 | 
 76 | Note that here I need to put `3:4` in parentheses so it knows to remove all the columns 1 to 10, and doesn't start by trying to remove a (non-existent) column number -3.
 77 | 
 78 | ### Selecting rows
 79 | We look at filtering based on data in the next section. If you just want rows, then use `slice()`
 80 | 
 81 | ```{r}
 82 | slice(rds, 2:10, 15:25)
 83 | ```
 84 | 
 85 | ## `filter`
 86 | Another common data tidying operation is filtering based on the attributes of the data. This is provided by the `dplyr::filter` function. Note that we use the fully qualified function name `dplyr::filter` because there is a base *R* function called `filter` which does something different (this is a common source of problems). We provide a filter specification, usually data based to perform such operations
 87 | 
 88 | ```{r}
 89 | dplyr::filter(rds, road_class == 1)
 90 | ```
 91 | Notice how this has reduced the size of the dataset. If we want data that satisfy more than one filter, we combine them filters with **and** `&` and **or** `|` operators
 92 | 
 93 | ```{r}
 94 | dplyr::filter(rds, road_type == "ROAD" & road_class == 1)
 95 | ```
 96 | 
 97 | Using select and filter in combination, we can usually quickly and easily reduce large complicated datasets down to the parts we really want to look at. We'll see a lit bit later how to chain operations together into processing pipelines. First, one more tool is really useful, `mutate`.
 98 | 
 99 | ## `mutate`
100 | Selecting and filtering data leaves things unchanged. Often we want to combine columns, in various ways. This option is provide by the `mutate` function
101 | 
102 | ```{r}
103 | mutate(rds, full_name = paste(road_name, road_type))
104 | ```
105 | 
106 | This has added a new column to the data by combining other columns using some function. In this case we use the base *R* `paste` function to combine text columns (there are better ways to do this but we won't worry about that for now).
107 | 
108 | ## Combining operations into pipelines
109 | Often we want to do several things one after another combined in a workflow or processing pipeline that can easily become fiddly and hard to read (not executable code, but you get the idea):
110 | 
111 | ```
112 | a <- select(y, ...)
113 | b <- dplyr::filter(a, ...)
114 | c <- mutate(b, ...)
115 | ```
116 | 
117 | and so on. To combine these operations into a single line you would do something like this
118 | 
119 | ```
120 | c <- mutate(dplyr::filter(select(y, ...), ...), ... )
121 | ```
122 | 
123 | but this can get very confusing very quickly. The order of operations is opposite to the order they are written, and keeping track of all those opening and closing parentheses is error-prone.
124 | 
125 | The tidyverse introduces a 'pipe' operator `%>%`, which, once you get used to it, simplifies things greatly. Instead of the above, we have
126 | 
127 | ```
128 | c <- y %>% select(...) %>%
129 |            dplyr::filter(...) %>%
130 |            mutate(...)
131 | ```
132 | 
133 | This reads "assign to c the result of passing y into select, then into filter, then into mutate". Here is an nonsensical example with the `rds` dataset, combining operations from the previous three sections
134 | 
135 | ```{r}
136 | rds %>%
137 |   select(starts_with("road_")) %>%
138 |   dplyr::filter(road_class <= 2) %>%
139 |   mutate(full_name = paste(road_name, road_type))
140 | ```
141 | 
142 | This is introduced here not in the expectation that you will remember it, but so that you have some idea what's going on if the approach is used in later lab instructions. We may spend more time on these ideas later in the semester after all that lab assignment materials have been introduced.
143 | 
144 | ### `sf` and pipelines
145 | `sf` functions are pipeline aware and compliant. This means you can pass `sf` objects through tidy pipelines, and also that the various `st_` prefixed functions provided by `sf` for handling spatial data, can be included in such pipelines. So we can add on to the end of the above pipeline
146 | 
147 | ```{r}
148 | rds %>%
149 |   select(starts_with("road_")) %>%
150 |   dplyr::filter(road_class <= 2) %>%
151 |   mutate(full_name = paste(road_name, road_type)) %>%
152 |   st_transform(st_crs(auckland))
153 | ```
154 | 
155 | #### Finally: `sf` geometry is 'sticky'
156 | Also worth noting is that the geometry column in an `sf` dataset isn't dropped if you use an operation like `select(data, -geometry)` on it. It wants to stick around. This is good, because it means you can't accidentally ditch the spatial part of a dataset. But sometimes you just want the data table (some analytical functions can't handle the spatial information, for example). In these cases use the `st_drop_geometry` function:
157 | 
158 | ```{r}
159 | rds_data <- rds %>%
160 |   st_drop_geometry()
161 | as_tibble(rds_data)
162 | ```
163 | 


--------------------------------------------------------------------------------
/labs/making-maps-in-r/README.md:
--------------------------------------------------------------------------------
1 | # Making maps in *R* overview
2 | This week simply download [this zip file](making-maps-in-r.zip?raw=true) and unpack it a local folder, then follow the instructions on these pages:
3 | + [Making maps in *R*](01-making-maps-in-r.md).
4 | + [Simple data wrangling in *R*](02-data-wrangling-in-r.md).
5 | 


--------------------------------------------------------------------------------
/labs/making-maps-in-r/ak-rds.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/making-maps-in-r/ak-rds.gpkg


--------------------------------------------------------------------------------
/labs/making-maps-in-r/making-maps-in-r.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/making-maps-in-r/making-maps-in-r.zip


--------------------------------------------------------------------------------
/labs/multivariate-analysis/02-the-tidyverse.Rmd:
--------------------------------------------------------------------------------
  1 | First just make sure we have all the data and libraries we need set up.
  2 | ```{r message = FALSE}
  3 | library(sf)
  4 | library(ggplot2)
  5 | library(tidyr)
  6 | library(dplyr)
  7 | 
  8 | sfd <- st_read('sf_demo.geojson')
  9 | sfd <- drop_na(sfd)
 10 | sfd.d <- st_drop_geometry(sfd)
 11 | ```
 12 | # Introducing the `tidyverse`
 13 | It is really helpful to have a more systematic way of dealing with large, complicated and messy datasets. Base *R* does OK at this, but can get messy very quickly. 
 14 | 
 15 | An alternative approach is provided by the [`tidyverse`](https://www.tidyverse.org/) a large collection of packages for handling data in a 'tidy' way, and with an associated powerful set of plotting libraries (`ggplot2`). We already looked quickly at this [back in week 2](https://github.com/DOSull/GISC-422/blob/master/labs/making-maps-in-r/02-data-wrangling-in-r.md). This page is a re-run of what we learned then. Here we again only look at these very quickly, but hopefully give you enough flavour of what is possible, and encourage you to explore further as needed.
 16 |  
 17 | We focus on some key functionality provided by functions in the `dplyr` package. We will also look quickly at processing pipelines using the `%>%` or 'pipe' operator. We'll round things off with a quick look at `ggplot2`. If you take (or have already taken) [DATA 471](https://www.wgtn.ac.nz/courses/data/471/2021/offering?crn=33154), you'll learn more about both.
 18 | 
 19 | ## `dplyr::select` 
 20 | A common requirement in data analysis is selecting only the data attributes you want, and getting rid of all the other junk. The `sfd` dataset has a lot going on. A nice tidy tool for looking at data is `as_tibble()`
 21 | ```{r}
 22 | as_tibble(sfd)
 23 | ```
 24 | 
 25 | This shows us that we have 25 columns in our dataset (one of them is the geometry). We can get a list of the names with `names()`
 26 | ```{r}
 27 | names(sfd)
 28 | ```
 29 | 
 30 | Selecting only columns of interest is easy, using the `dplyr::select` function, we simply list them
 31 | ```{r}
 32 | dplyr::select(sfd, density, Pdoctorate, perCapitaIncome)
 33 | ```
 34 | 
 35 | Note that we specify `dplyr::select` to avoid name clashes with other package functions also called `select` which are likely to do entirely different things! 
 36 | 
 37 | The select operation hasn't changed the data, we've just looked at a selection from it. But we can easily assign the result of the selection to a new variable
 38 | ```{r}
 39 | sfd.3 <- dplyr::select(sfd, density, Pdoctorate, perCapitaIncome)
 40 | ```
 41 | 
 42 | What is nice about `dplyr::select` is that it provides lots of different ways to make selections. We can list names, or column numbers, or use colons to include all the columns between two names or two numbers, or even use a minus sign to drop a column. And we can use these (mostly) in all kinds of combinations. For example
 43 | ```{r}
 44 | dplyr::select(sfd, 1:2, PlessHighSchool, PraceBlackAlone:PforeignBorn)
 45 | ```
 46 | 
 47 | or
 48 | ```{r}
 49 | dplyr::select(sfd, -(1:10))
 50 | ```
 51 | 
 52 | Note that here I need to put `1:10` in parentheses so it knows to remove all the columns 1 to 10, and doesn't start by trying to remove a (non-existent) column number -1.
 53 | 
 54 | There are also helper functions like `starts_with()`, `ends_with()` and `contains()` to let you do selections based on variable names (this can be particularly useful with big datasets from sources like the census: take note!). For example:
 55 | ```{r}
 56 | dplyr::select(sfd, contains("race"))
 57 | ```
 58 | 
 59 | ### Selecting rows
 60 | We look at filtering based on data values in the next section. If you just want specific known rows, use `slice()`
 61 | ```{r}
 62 | slice(sfd, 2:10, 15:25)
 63 | ```
 64 | 
 65 | ## `dplyr::filter`
 66 | Another common data tidying operation is filtering based on attributes of the data. We provide a filter specification, usually data-based to perform such operations (again `filter` is a function name in other packages, hence the use of `dplyr::filter` to disambiguate).
 67 | ```{r}
 68 | dplyr::filter(sfd, density > 0.3)
 69 | ```
 70 | 
 71 | If we want data that satisfy more than one filter, we simply include combine the filters with **and** `&` and **or** `|` operators
 72 | ```{r}
 73 | dplyr::filter(sfd, (density > 0.1 & perCapitaIncome > 0.1) | PlessHighSchool > 0.5)
 74 | ```
 75 | 
 76 | Using select and filter in combination, we can usually quickly and easily reduce large complicated datasets down to the parts we really want to look at. What's left might still be big and complicated, but at least it will be only what we want to look at!  We'll see a lit bit later how to chain operations together into processing pipelines. First, one more tool is really useful, `mutate`.
 77 | 
 78 | ## `mutate`
 79 | Selecting and filtering data leaves things unchanged. Often we want to combine columns, in various ways. This option is provide by the `mutate` function
 80 | ```{r}
 81 | mutate(sfd, x = density + PwithKids)
 82 | ```
 83 | 
 84 | This has added a new column to the data by adding together the values of two other columns (in this case, it's a meaningless calculation, but you should easily be able to imagine other examples that would make sense!)
 85 | 
 86 | `mutate` can also use `dplyr::select` selection semantics to allow you to perform the same calculation on several columns at once:
 87 | ```{r}
 88 | mutate(sfd, across(starts_with("P"), ~ . * 100))
 89 | ```
 90 | will convert all those proportions to percentages. This is quite a complicated topic&mdash;I'm just letting you know this is a possibility. I often have to look up applying mutate operations to multiple columns on stackechange or wherever before using it although I am getting better at it. The key idea is to use `across(<variables>, <function>)` where the variables can be specified `dplyr::select` style and the function can be specified either by name (e.g. `sqrt` or `log`) or using an expression as shown above with the `.` standing in for the value of the variable itself.
 91 | 
 92 | ## Combining operations into pipelines
 93 | Something that can easily become tedious is this kind of thing (not executable code, but you get the idea)
 94 | 
 95 |     A <- dplyr::select(Y, ...)
 96 |     B <- dplyr::filter(A, ...)
 97 |     C <- mutate(B, ...)
 98 |     
 99 | and so on. Normally to combine these operations into a single line you would do something like this
100 | 
101 |     C <- mutate(dplyr::filter(dplyr::select(Y, ...), ...), ... )
102 |     
103 | but this can get very confusing very quickly, because the order of operations is opposite to the order they are written, and keeping track of all those opening and closing parentheses is error-prone. The `tidyverse` introduces a 'pipe' operator `%>%` which (once you get used to it) simplifies things greatly. Instead of the above, we have
104 | 
105 |     C <- Y %>% 
106 |       dplyr::select(...) %>% 
107 |       dplyr::filter(...) %>% 
108 |       mutate(...)
109 |     
110 | This reads "assign to `C` the result of passing `Y` into `select`, then into `filter`, then into `mutate`". Here is a nonsensical example with the `sfd` dataset, combining operations from each of the previous sections
111 | ```{r}
112 | sfd %>%
113 |   dplyr::select(1:10) %>%
114 |   slice(10:50) %>%
115 |   dplyr::filter(density > 0.1) %>%
116 |   mutate(x = density + PcommutingNotCar)
117 | ```
118 | 
119 | ## Tidying up plotting with `ggplot2`
120 | Not part of the tidyverse, exactly, but certainly adjacent to it, is a more consistent approach to plotting, particularly if you are making complicated figures. We've already seen an example of this in the previous document. Here it is again
121 | ```{r}
122 | ggplot(sfd) +
123 |   geom_point(aes(x = density,
124 |                  y = medianYearBuilt,
125 |                  colour = PoneUnit,
126 |                  size = PownerOccUnits), alpha = 0.5) +
127 |   scale_colour_distiller(palette = "Spectral")
128 | ```
129 | 
130 | What's going on here?! 
131 | 
132 | The idea behind `ggplot2` functions is that there should be an *aesthetic mapping* between each data attribute and some graphical aspect. The idea is discussed in [this paper about a layered grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.pdf) (that's what the `gg` stands for). We've already seen an implementation of it in `tmap` when we specify `col = ` for a map variable. `ggplot2` is a more complete complete implementation of the idea. The `ggplot` function specifies the dataset, an additional layer is specified by a geometry function, in the example above `geom_point`, for which we must specify the aesthetic mapping using `aes()` telling which graphical parameters, x and y location, colour and size are linked to which data attributes.
133 | 
134 | It is worth knowing that `ggplot` knows about `sf` geospatial data, and so can be used as an alternative to `tmap` by applying the `geom_sf` function. This is a big topic, and I only touch on it here so I can used `ggplot` functions from time to time without freaking you out! I am happy to discuss further if this is a topic that interests or excites you.
135 | ```{r}
136 | ggplot(sfd) +
137 |   geom_sf(aes(fill = density)) + 
138 |   scale_fill_distiller(palette = 'Reds', direction = 1)
139 | ```
140 | 
141 | Now let's get back to multivariate data. Go to [this document](03-dimensional-reduction.md).


--------------------------------------------------------------------------------
/labs/multivariate-analysis/02-the-tidyverse.md:
--------------------------------------------------------------------------------
  1 | First just make sure we have all the data and libraries we need set up.
  2 | ```{r message = FALSE}
  3 | library(sf)
  4 | library(ggplot2)
  5 | library(tidyr)
  6 | library(dplyr)
  7 | 
  8 | sfd <- st_read('sf_demo.geojson')
  9 | sfd <- drop_na(sfd)
 10 | sfd.d <- st_drop_geometry(sfd)
 11 | ```
 12 | # Introducing the `tidyverse`
 13 | It is really helpful to have a more systematic way of dealing with large, complicated and messy datasets. Base *R* does OK at this, but can get messy very quickly. 
 14 | 
 15 | An alternative approach is provided by the [`tidyverse`](https://www.tidyverse.org/) a large collection of packages for handling data in a 'tidy' way, and with an associated powerful set of plotting libraries (`ggplot2`). We already looked quickly at this [back in week 2](https://github.com/DOSull/GISC-422/blob/master/labs/making-maps-in-r/02-data-wrangling-in-r.md). This page is a re-run of what we learned then. Here we again only look at these very quickly, but hopefully give you enough flavour of what is possible, and encourage you to explore further as needed.
 16 |  
 17 | We focus on some key functionality provided by functions in the `dplyr` package. We will also look quickly at processing pipelines using the `%>%` or 'pipe' operator. We'll round things off with a quick look at `ggplot2`. If you take (or have already taken) [DATA 471](https://www.wgtn.ac.nz/courses/data/471/2021/offering?crn=33154), you'll learn more about both.
 18 | 
 19 | ## `dplyr::select` 
 20 | A common requirement in data analysis is selecting only the data attributes you want, and getting rid of all the other junk. The `sfd` dataset has a lot going on. A nice tidy tool for looking at data is `as_tibble()`
 21 | ```{r}
 22 | as_tibble(sfd)
 23 | ```
 24 | 
 25 | This shows us that we have 25 columns in our dataset (one of them is the geometry). We can get a list of the names with `names()`
 26 | ```{r}
 27 | names(sfd)
 28 | ```
 29 | 
 30 | Selecting only columns of interest is easy, using the `dplyr::select` function, we simply list them
 31 | ```{r}
 32 | dplyr::select(sfd, density, Pdoctorate, perCapitaIncome)
 33 | ```
 34 | 
 35 | Note that we specify `dplyr::select` to avoid name clashes with other package functions also called `select` which are likely to do entirely different things! 
 36 | 
 37 | The select operation hasn't changed the data, we've just looked at a selection from it. But we can easily assign the result of the selection to a new variable
 38 | ```{r}
 39 | sfd.3 <- dplyr::select(sfd, density, Pdoctorate, perCapitaIncome)
 40 | ```
 41 | 
 42 | What is nice about `dplyr::select` is that it provides lots of different ways to make selections. We can list names, or column numbers, or use colons to include all the columns between two names or two numbers, or even use a minus sign to drop a column. And we can use these (mostly) in all kinds of combinations. For example
 43 | ```{r}
 44 | dplyr::select(sfd, 1:2, PlessHighSchool, PraceBlackAlone:PforeignBorn)
 45 | ```
 46 | 
 47 | or
 48 | ```{r}
 49 | dplyr::select(sfd, -(1:10))
 50 | ```
 51 | 
 52 | Note that here I need to put `1:10` in parentheses so it knows to remove all the columns 1 to 10, and doesn't start by trying to remove a (non-existent) column number -1.
 53 | 
 54 | There are also helper functions like `starts_with()`, `ends_with()` and `contains()` to let you do selections based on variable names (this can be particularly useful with big datasets from sources like the census: take note!). For example:
 55 | ```{r}
 56 | dplyr::select(sfd, contains("race"))
 57 | ```
 58 | 
 59 | ### Selecting rows
 60 | We look at filtering based on data values in the next section. If you just want specific known rows, use `slice()`
 61 | ```{r}
 62 | slice(sfd, 2:10, 15:25)
 63 | ```
 64 | 
 65 | ## `dplyr::filter`
 66 | Another common data tidying operation is filtering based on attributes of the data. We provide a filter specification, usually data-based to perform such operations (again `filter` is a function name in other packages, hence the use of `dplyr::filter` to disambiguate).
 67 | ```{r}
 68 | dplyr::filter(sfd, density > 0.3)
 69 | ```
 70 | 
 71 | If we want data that satisfy more than one filter, we simply include combine the filters with **and** `&` and **or** `|` operators
 72 | ```{r}
 73 | dplyr::filter(sfd, (density > 0.1 & perCapitaIncome > 0.1) | PlessHighSchool > 0.5)
 74 | ```
 75 | 
 76 | Using select and filter in combination, we can usually quickly and easily reduce large complicated datasets down to the parts we really want to look at. What's left might still be big and complicated, but at least it will be only what we want to look at!  We'll see a lit bit later how to chain operations together into processing pipelines. First, one more tool is really useful, `mutate`.
 77 | 
 78 | ## `mutate`
 79 | Selecting and filtering data leaves things unchanged. Often we want to combine columns, in various ways. This option is provide by the `mutate` function
 80 | ```{r}
 81 | mutate(sfd, x = density + PwithKids)
 82 | ```
 83 | 
 84 | This has added a new column to the data by adding together the values of two other columns (in this case, it's a meaningless calculation, but you should easily be able to imagine other examples that would make sense!)
 85 | 
 86 | `mutate` can also use `dplyr::select` selection semantics to allow you to perform the same calculation on several columns at once:
 87 | ```{r}
 88 | mutate(sfd, across(starts_with("P"), ~ . * 100))
 89 | ```
 90 | will convert all those proportions to percentages. This is quite a complicated topic&mdash;I'm just letting you know this is a possibility. I often have to look up applying mutate operations to multiple columns on stackechange or wherever before using it although I am getting better at it. The key idea is to use `across(<variables>, <function>)` where the variables can be specified `dplyr::select` style and the function can be specified either by name (e.g. `sqrt` or `log`) or using an expression as shown above with the `.` standing in for the value of the variable itself.
 91 | 
 92 | ## Combining operations into pipelines
 93 | Something that can easily become tedious is this kind of thing (not executable code, but you get the idea)
 94 | 
 95 |     A <- dplyr::select(Y, ...)
 96 |     B <- dplyr::filter(A, ...)
 97 |     C <- mutate(B, ...)
 98 |     
 99 | and so on. Normally to combine these operations into a single line you would do something like this
100 | 
101 |     C <- mutate(dplyr::filter(dplyr::select(Y, ...), ...), ... )
102 |     
103 | but this can get very confusing very quickly, because the order of operations is opposite to the order they are written, and keeping track of all those opening and closing parentheses is error-prone. The `tidyverse` introduces a 'pipe' operator `%>%` which (once you get used to it) simplifies things greatly. Instead of the above, we have
104 | 
105 |     C <- Y %>% 
106 |       dplyr::select(...) %>% 
107 |       dplyr::filter(...) %>% 
108 |       mutate(...)
109 |     
110 | This reads "assign to `C` the result of passing `Y` into `select`, then into `filter`, then into `mutate`". Here is a nonsensical example with the `sfd` dataset, combining operations from each of the previous sections
111 | ```{r}
112 | sfd %>%
113 |   dplyr::select(1:10) %>%
114 |   slice(10:50) %>%
115 |   dplyr::filter(density > 0.1) %>%
116 |   mutate(x = density + PcommutingNotCar)
117 | ```
118 | 
119 | ## Tidying up plotting with `ggplot2`
120 | Not part of the tidyverse, exactly, but certainly adjacent to it, is a more consistent approach to plotting, particularly if you are making complicated figures. We've already seen an example of this in the previous document. Here it is again
121 | ```{r}
122 | ggplot(sfd) +
123 |   geom_point(aes(x = density,
124 |                  y = medianYearBuilt,
125 |                  colour = PoneUnit,
126 |                  size = PownerOccUnits), alpha = 0.5) +
127 |   scale_colour_distiller(palette = "Spectral")
128 | ```
129 | 
130 | What's going on here?! 
131 | 
132 | The idea behind `ggplot2` functions is that there should be an *aesthetic mapping* between each data attribute and some graphical aspect. The idea is discussed in [this paper about a layered grammar of graphics](http://vita.had.co.nz/papers/layered-grammar.pdf) (that's what the `gg` stands for). We've already seen an implementation of it in `tmap` when we specify `col = ` for a map variable. `ggplot2` is a more complete complete implementation of the idea. The `ggplot` function specifies the dataset, an additional layer is specified by a geometry function, in the example above `geom_point`, for which we must specify the aesthetic mapping using `aes()` telling which graphical parameters, x and y location, colour and size are linked to which data attributes.
133 | 
134 | It is worth knowing that `ggplot` knows about `sf` geospatial data, and so can be used as an alternative to `tmap` by applying the `geom_sf` function. This is a big topic, and I only touch on it here so I can used `ggplot` functions from time to time without freaking you out! I am happy to discuss further if this is a topic that interests or excites you.
135 | ```{r}
136 | ggplot(sfd) +
137 |   geom_sf(aes(fill = density)) + 
138 |   scale_fill_distiller(palette = 'Reds', direction = 1)
139 | ```
140 | 
141 | Now let's get back to multivariate data. Go to [this document](03-dimensional-reduction.md).


--------------------------------------------------------------------------------
/labs/multivariate-analysis/03-dimensional-reduction.Rmd:
--------------------------------------------------------------------------------
 1 | First just make sure we have all the data and libraries we need set up.
 2 | ```{r message = FALSE}
 3 | library(sf)
 4 | library(tmap)
 5 | library(dplyr)
 6 | library(tidyr)
 7 | library(RColorBrewer)
 8 | 
 9 | sfd <- st_read('sf_demo.geojson')
10 | sfd <- drop_na(sfd)
11 | sfd.d <- st_drop_geometry(sfd)
12 | ```
13 | # Dimension reduction methods
14 | There are a number of dimensional reduction methods. 
15 | 
16 | The most widely known is probably *principal component analysis* (PCA), so we will look at that. We've already seen in the above exploration of the data, that many of the variables in this dataset are correlated. If you're not convinced about that, try `plot(sfd)` again to see maps of some of the variables. It's clear that many of the variables exhibit similar distributions. This *non-independence* of the variables means that there is the potential to use weighted combinations of various variables to stand in for the full dataset. With any luck, the weighted sums will be interpretable, and we'll only need a few of them, not all 24 of them to get an overall picture of the data.
17 | 
18 | The mathematics of this are pretty complicated. They rely on analysis of the correlation matrix of the dataset (we already saw this above when we ran the `cor` function), specifically the calculation of the matrix *eigenvectors* and corresponding *eigenvalues*. Roughly speaking (very roughly) each eigenvector is a direction in multidimensional space along which the data set can be projected. By ordering the eigenvectors in descending order from that along which the data has the greatest variance, to that with the least (using the eigenvalues), and ensuring that the eigenvectors are perpendicular to one another, we can obtain a smaller set of *components*, which capture most of the variance in our original data.
19 | 
20 | This [interactive graphic](https://www.joyofdata.de/public/pca-3d/) provides a nice illustration of the idea, just imagine it in 24 dimensions and you'll be there!
21 | 
22 | OK... so how does this work in practice? It's actually pretty easy. The function `princomp` in R performs the analysis.
23 | ```{r}
24 | sfd.pca <- princomp(sfd.d, cor = TRUE)
25 | ```
26 | 
27 | The results are stored in the `sfd.pca` object. We can get a summary
28 | ```{r}
29 | summary(sfd.pca)
30 | ```
31 | 
32 | which tells us the total proportion of all the variance in the dataset accounted for by the principal components starting from the most significant and working down. These results show that about 2/3 of the variance in this dataset (0.3510 + 0.2206 + 0.0956) can be accounted for by only the first three principal components (to see this graphically, use `screeplot(sfd.pca)`. 
33 | 
34 | What are the principal components? Each is a *weighted sum of the original variables*, which we can examine in the `loadings` component of the result. 
35 | ```{r}
36 | sfd.pca$loadings
37 | ```
38 | 
39 | These show us for each component how they can be calculated from the original variables by summing weighted combinations of them. High (>0) weights are 'positive loadings' and high (<0) weights are negative loadings. In the table, we can see for example, that component 1 loads negatively on income and 'white alone' and 'some college', suggesting that high values of this category are associated with poorer mixed ethnic non-white areas. 
40 | 
41 | Determining the interpretation of each component is based on which variable weights heavily positive or negative on each component in this table.
42 | 
43 | A plot which sometimes helps is the biplot produced as below
44 | ```{r}
45 | biplot(sfd.pca, pc.biplot = TRUE, cex = 0.65)
46 | ```
47 | 
48 | This helps us to see which variables weight similarly on the components (they point in similar directions) and also which observations (i.e. which census tracts in this case) score highly or not on each variable. It is important to keep in mind when inspecting such plots that they are a reduction of the original data (that's the whole point) and must be interpreted with caution.
49 | 
50 | If we want to see the geography of components, then we can extract components from the PCA analysis `scores` output.
51 | ```{r}
52 | sfd$PC1 <- sfd.pca$scores[, "Comp.1"]
53 | tmap_mode('view')
54 | tm_shape(sfd) +
55 |   tm_polygons(col = 'PC1', palette = "PRGn") +   # call it PC1 for principal component 1
56 |   tm_legend(legend.outside = TRUE)
57 | ```
58 | 
59 | You can make a new spatial dataset from the original spatial data with only the component scores, like this:
60 | ```{r}
61 | sfd.pca.scores <- sfd %>%
62 |   dplyr::select(geometry) %>%
63 |   bind_cols(as.data.frame(sfd.pca$scores))
64 | plot(sfd.pca.scores, pal = brewer.pal(9, "Reds"))
65 | ```
66 | 
67 | Related techniques to PCA are *factor analysis* and *multidimensional scaling* (MDS), although the last of these is also closely related to the second broad class of methods, classification which we will look at... [so let's do that now](04-classification-and-clustering.md).
68 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/03-dimensional-reduction.md:
--------------------------------------------------------------------------------
 1 | First just make sure we have all the data and libraries we need set up.
 2 | ```{r message = FALSE}
 3 | library(sf)
 4 | library(tmap)
 5 | library(dplyr)
 6 | library(tidyr)
 7 | library(RColorBrewer)
 8 | 
 9 | sfd <- st_read('sf_demo.geojson')
10 | sfd <- drop_na(sfd)
11 | sfd.d <- st_drop_geometry(sfd)
12 | ```
13 | # Dimension reduction methods
14 | There are a number of dimensional reduction methods. 
15 | 
16 | The most widely known is probably *principal component analysis* (PCA), so we will look at that. We've already seen in the above exploration of the data, that many of the variables in this dataset are correlated. If you're not convinced about that, try `plot(sfd)` again to see maps of some of the variables. It's clear that many of the variables exhibit similar distributions. This *non-independence* of the variables means that there is the potential to use weighted combinations of various variables to stand in for the full dataset. With any luck, the weighted sums will be interpretable, and we'll only need a few of them, not all 24 of them to get an overall picture of the data.
17 | 
18 | The mathematics of this are pretty complicated. They rely on analysis of the correlation matrix of the dataset (we already saw this above when we ran the `cor` function), specifically the calculation of the matrix *eigenvectors* and corresponding *eigenvalues*. Roughly speaking (very roughly) each eigenvector is a direction in multidimensional space along which the data set can be projected. By ordering the eigenvectors in descending order from that along which the data has the greatest variance, to that with the least (using the eigenvalues), and ensuring that the eigenvectors are perpendicular to one another, we can obtain a smaller set of *components*, which capture most of the variance in our original data.
19 | 
20 | This [interactive graphic](https://www.joyofdata.de/public/pca-3d/) provides a nice illustration of the idea, just imagine it in 24 dimensions and you'll be there!
21 | 
22 | OK... so how does this work in practice? It's actually pretty easy. The function `princomp` in R performs the analysis.
23 | ```{r}
24 | sfd.pca <- princomp(sfd.d, cor = TRUE)
25 | ```
26 | 
27 | The results are stored in the `sfd.pca` object. We can get a summary
28 | ```{r}
29 | summary(sfd.pca)
30 | ```
31 | 
32 | which tells us the total proportion of all the variance in the dataset accounted for by the principal components starting from the most significant and working down. These results show that about 2/3 of the variance in this dataset (0.3510 + 0.2206 + 0.0956) can be accounted for by only the first three principal components (to see this graphically, use `screeplot(sfd.pca)`. 
33 | 
34 | What are the principal components? Each is a *weighted sum of the original variables*, which we can examine in the `loadings` component of the result. 
35 | ```{r}
36 | sfd.pca$loadings
37 | ```
38 | 
39 | These show us for each component how they can be calculated from the original variables by summing weighted combinations of them. High (>0) weights are 'positive loadings' and high (<0) weights are negative loadings. In the table, we can see for example, that component 1 loads negatively on income and 'white alone' and 'some college', suggesting that high values of this category are associated with poorer mixed ethnic non-white areas. 
40 | 
41 | Determining the interpretation of each component is based on which variable weights heavily positive or negative on each component in this table.
42 | 
43 | A plot which sometimes helps is the biplot produced as below
44 | ```{r}
45 | biplot(sfd.pca, pc.biplot = TRUE, cex = 0.65)
46 | ```
47 | 
48 | This helps us to see which variables weight similarly on the components (they point in similar directions) and also which observations (i.e. which census tracts in this case) score highly or not on each variable. It is important to keep in mind when inspecting such plots that they are a reduction of the original data (that's the whole point) and must be interpreted with caution.
49 | 
50 | If we want to see the geography of components, then we can extract components from the PCA analysis `scores` output.
51 | ```{r}
52 | sfd$PC1 <- sfd.pca$scores[, "Comp.1"]
53 | tmap_mode('view')
54 | tm_shape(sfd) +
55 |   tm_polygons(col = 'PC1', palette = "PRGn") +   # call it PC1 for principal component 1
56 |   tm_legend(legend.outside = TRUE)
57 | ```
58 | 
59 | You can make a new spatial dataset from the original spatial data with only the component scores, like this:
60 | ```{r}
61 | sfd.pca.scores <- sfd %>%
62 |   dplyr::select(geometry) %>%
63 |   bind_cols(as.data.frame(sfd.pca$scores))
64 | plot(sfd.pca.scores, pal = brewer.pal(9, "Reds"))
65 | ```
66 | 
67 | Related techniques to PCA are *factor analysis* and *multidimensional scaling* (MDS), although the last of these is also closely related to the second broad class of methods, classification which we will look at... [so let's do that now](04-classification-and-clustering.md).
68 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/04-classification-and-clustering.Rmd:
--------------------------------------------------------------------------------
  1 | First just make sure we have all the data and libraries we need set up.
  2 | 
  3 | ```{r message = FALSE}
  4 | library(sf)
  5 | library(tmap)
  6 | library(tidyr)
  7 | library(dplyr)
  8 | 
  9 | sfd <- st_read('sf_demo.geojson')
 10 | sfd <- drop_na(sfd)
 11 | sfd.d <- st_drop_geometry(sfd)
 12 | ```
 13 | 
 14 | ## Clustering
 15 | Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
 16 | 
 17 | There is no easy way to define clusters beyond recognising that clusters are the groups of observations identified by a clustering method! Like PCA, clustering analysis depends a great deal on the interpretation of an analyst.
 18 | 
 19 | What do we mean by 'similar' and 'different'? We extend the basic idea of distance in Euclidean (two dimensional) space where $d_{ij} = \sqrt{(x_i-x_j)^2+(y_i-y_j)^2}$, that is the square root of the sum of the squared difference in each coordinate to higher dimensions. So if we are in 24 dimensional data space, we take the sum of the squared differences in each of the 24 dimensions (i.e. on each variable) between two observations, add them together and take the square root. Other versions of the basic idea of 'total difference' in attribute values are possible. An important consideration is that all the attributes should be *rescaled* so that the differences in one particular attribute which happens to have large values associated with it don't 'drown out' differences in other variables. A similar concern is that we take care not to include lots of strongly correlated variables in the analysis (sometimes clustering is done on principal component scores for this reason).
 20 | 
 21 | ### K-means clustering
 22 | One common clustering approach is k-means clustering. The algorithm is pretty simple:
 23 | 
 24 | 1. Decide on the number of clusters you want, call this *k*
 25 | 2. Choose *k* cluster centres
 26 | 3. Assign each observation to its nearest cluster centre
 27 | 4. Calculate the mean centre of each cluster and move the cluster centre accordingly
 28 | 5. Go back to 3 and repeat, until the cluster assignments stop changing
 29 | 
 30 | Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/) to help with following the description above.
 31 | 
 32 | It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
 33 | 
 34 | So here is how we accomplish this in R.
 35 | 
 36 | ```{r}
 37 | km <- kmeans(sfd.d, 7)
 38 | sfd$km7 <- as.factor(km$cluster)
 39 | 
 40 | tmap_mode('view')
 41 | tm_shape(sfd) +
 42 |   tm_polygons(col = 'km7') +
 43 |   tm_legend(legend.outside = T)
 44 | ```
 45 | 
 46 | The `kmeans` function does the work, and requires that we decide in advance how many clusters we want (I picked 7 just because... well... SEVEN). We can retrieve the resulting cluster assignments from the output `km` as `km$cluster` which we convert to a `factor`. The numerical cluster number is meaningless, so the cluster number is properly speaking a factor, and designating as such will allow `tmap` and other packages to handle it intelligently. We can then add it to the spatial data and  map it like any other variable.
 47 | 
 48 | Try changing the number of clusters in the above code.
 49 | 
 50 | The 'quality' of a particular clustering solution is dependent on how well we feel we can interpret it. Measures of the variance within and between clusters can be used to assess quality in a more technical sense and are available from the `kmeans` object produced by the function.
 51 | 
 52 | ### Hierarchical clustering
 53 | K-means is quick and very scalable - it can be successfully applied to very large datasets (see for example the [Landcare LENZ data](https://www.landcareresearch.co.nz/resources/maps-satellites/lenz)). It doesn't however provide much clue about the structure of the clusters it produces. Hierarchical methods do a better job of showing us how observations got assembled into the clusters they are in.
 54 | 
 55 | The algorithm in this case looks something like
 56 | 
 57 | 1. Calculate all the (multivariate) distances between every pair of observations
 58 | 2. Find the nearest pair of observations and merge them into a cluster
 59 | 3. For the newly formed cluster determine its distances from all the remaining observations (and any other clusters)
 60 | 4. Go back to 3 and repeat until all cases are in a cluster
 61 | 
 62 | This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
 63 | 
 64 | In *R*, the necessary functions are provided by the `hclust` function
 65 | 
 66 | ```{R}
 67 | hc <- hclust(dist(sfd.d))
 68 | plot(hc)
 69 | ```
 70 | 
 71 | Blimey! What the heck is that thing? As the title says it is a *cluster dendrogram*. Starting at the bottom, each individual item in the dataset is a 'branch' of the diagram. At the distance or 'height' at which a pair were joined into a cluster the branches merge and so on up the diagram. The dendrogram provides a complete picture of the whole clustering process.
 72 | 
 73 | As you can see, even for this relatively small dataset of only 189 observations, the dendrogram is not easy to read. Again, interactive visualization methods can be used to help with this. However another option is to 'cut the dendrogam' specifying either the height value to do it at, or the number of clusters desired. In this case, it looks like 6 is not a bad option, so...
 74 | 
 75 | ```{r}
 76 | sfd$hc5 <- cutree(hc, k = 5)
 77 | tm_shape(sfd) +
 78 |   tm_polygons(col = 'hc5', palette = 'Set2', style = "cat") +
 79 |   tm_legend(legend.outside = TRUE)
 80 | ```
 81 | 
 82 | It's good to see that there are clear similarities between this output and the k-means one (at least there were first time I ran the analysis!)
 83 | 
 84 | As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information.
 85 | 
 86 | Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. FOr example
 87 | 
 88 | ```{r}
 89 | boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
 90 | ```
 91 | 
 92 | Or we can aggregate the clusters into single areas and assign them values based on the underlying data of all the member units:
 93 | 
 94 | ```{r}
 95 | sfd.c <- aggregate(sfd, by = list(sfd$hc5), mean)
 96 | plot(sfd.c, pal = RColorBrewer::brewer.pal(7, "Reds"))
 97 | ```
 98 | 
 99 | ## Further reading
100 | In the specific context of demographic data, clustering analysis is often referred to as *geodemographics* and a nice example of this is provided in this paper
101 | 
102 | Spielman, S. E., and A. Singleton. 2015. [Studying Neighborhoods Using Uncertain Data from the American Community Survey: A Contextual Approach](http://www.tandfonline.com/doi/full/10.1080/00045608.2015.1052335). Annals of the Association of American Geographers 105 (5):1003–1025.
103 | 
104 | which describes a cluster analysis of the hundreds of thousands of census tracts of the United States. Accompanying data is [available here](https://www.openicpsr.org/openicpsr/project/100235/version/V5/view?path=/openicpsr/100235/fcr:versions/V5/Output-Data&type=folder) and to [visualise here](https://observatory.cartodb.com/editor/5de68840-16ef-11e6-bf4f-0ea31932ec1d/embed), although it's kind of enormous! An interactive map of a classification of London at fine spatial resolution is the [London Open Area Classification](https://maps.cdrc.ac.uk/#/geodemographics/loac11/default/BTTTFFT/10/-0.1500/51.5200/).
105 | 
106 | Although geodemographics ia a very visible example of cluster-based classification, exactly the same methods are applicable to other kinds of data, such as physical, climate or other variables (and these methods are the basis of remote sensed imagery classificaion).
107 | 
108 | Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
109 | 
110 | OK... on to [statistical modelling](05-statistical-models.md).
111 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/04-classification-and-clustering.md:
--------------------------------------------------------------------------------
  1 | First just make sure we have all the data and libraries we need set up.
  2 | 
  3 | ```{r message = FALSE}
  4 | library(sf)
  5 | library(tmap)
  6 | library(tidyr)
  7 | library(dplyr)
  8 | 
  9 | sfd <- st_read('sf_demo.geojson')
 10 | sfd <- drop_na(sfd)
 11 | sfd.d <- st_drop_geometry(sfd)
 12 | ```
 13 | 
 14 | ## Clustering
 15 | Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
 16 | 
 17 | There is no easy way to define clusters beyond recognising that clusters are the groups of observations identified by a clustering method! Like PCA, clustering analysis depends a great deal on the interpretation of an analyst.
 18 | 
 19 | What do we mean by 'similar' and 'different'? We extend the basic idea of distance in Euclidean (two dimensional) space where $d_{ij} = \sqrt{(x_i-x_j)^2+(y_i-y_j)^2}$, that is the square root of the sum of the squared difference in each coordinate to higher dimensions. So if we are in 24 dimensional data space, we take the sum of the squared differences in each of the 24 dimensions (i.e. on each variable) between two observations, add them together and take the square root. Other versions of the basic idea of 'total difference' in attribute values are possible. An important consideration is that all the attributes should be *rescaled* so that the differences in one particular attribute which happens to have large values associated with it don't 'drown out' differences in other variables. A similar concern is that we take care not to include lots of strongly correlated variables in the analysis (sometimes clustering is done on principal component scores for this reason).
 20 | 
 21 | ### K-means clustering
 22 | One common clustering approach is k-means clustering. The algorithm is pretty simple:
 23 | 
 24 | 1. Decide on the number of clusters you want, call this *k*
 25 | 2. Choose *k* cluster centres
 26 | 3. Assign each observation to its nearest cluster centre
 27 | 4. Calculate the mean centre of each cluster and move the cluster centre accordingly
 28 | 5. Go back to 3 and repeat, until the cluster assignments stop changing
 29 | 
 30 | Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/) to help with following the description above.
 31 | 
 32 | It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
 33 | 
 34 | So here is how we accomplish this in R.
 35 | 
 36 | ```{r}
 37 | km <- kmeans(sfd.d, 7)
 38 | sfd$km7 <- as.factor(km$cluster)
 39 | 
 40 | tmap_mode('view')
 41 | tm_shape(sfd) +
 42 |   tm_polygons(col = 'km7') +
 43 |   tm_legend(legend.outside = T)
 44 | ```
 45 | 
 46 | The `kmeans` function does the work, and requires that we decide in advance how many clusters we want (I picked 7 just because... well... SEVEN). We can retrieve the resulting cluster assignments from the output `km` as `km$cluster` which we convert to a `factor`. The numerical cluster number is meaningless, so the cluster number is properly speaking a factor, and designating as such will allow `tmap` and other packages to handle it intelligently. We can then add it to the spatial data and  map it like any other variable.
 47 | 
 48 | Try changing the number of clusters in the above code.
 49 | 
 50 | The 'quality' of a particular clustering solution is dependent on how well we feel we can interpret it. Measures of the variance within and between clusters can be used to assess quality in a more technical sense and are available from the `kmeans` object produced by the function.
 51 | 
 52 | ### Hierarchical clustering
 53 | K-means is quick and very scalable - it can be successfully applied to very large datasets (see for example the [Landcare LENZ data](https://www.landcareresearch.co.nz/resources/maps-satellites/lenz)). It doesn't however provide much clue about the structure of the clusters it produces. Hierarchical methods do a better job of showing us how observations got assembled into the clusters they are in.
 54 | 
 55 | The algorithm in this case looks something like
 56 | 
 57 | 1. Calculate all the (multivariate) distances between every pair of observations
 58 | 2. Find the nearest pair of observations and merge them into a cluster
 59 | 3. For the newly formed cluster determine its distances from all the remaining observations (and any other clusters)
 60 | 4. Go back to 3 and repeat until all cases are in a cluster
 61 | 
 62 | This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
 63 | 
 64 | In *R*, the necessary functions are provided by the `hclust` function
 65 | 
 66 | ```{R}
 67 | hc <- hclust(dist(sfd.d))
 68 | plot(hc)
 69 | ```
 70 | 
 71 | Blimey! What the heck is that thing? As the title says it is a *cluster dendrogram*. Starting at the bottom, each individual item in the dataset is a 'branch' of the diagram. At the distance or 'height' at which a pair were joined into a cluster the branches merge and so on up the diagram. The dendrogram provides a complete picture of the whole clustering process.
 72 | 
 73 | As you can see, even for this relatively small dataset of only 189 observations, the dendrogram is not easy to read. Again, interactive visualization methods can be used to help with this. However another option is to 'cut the dendrogam' specifying either the height value to do it at, or the number of clusters desired. In this case, it looks like 6 is not a bad option, so...
 74 | 
 75 | ```{r}
 76 | sfd$hc5 <- cutree(hc, k = 5)
 77 | tm_shape(sfd) +
 78 |   tm_polygons(col = 'hc5', palette = 'Set2', style = "cat") +
 79 |   tm_legend(legend.outside = TRUE)
 80 | ```
 81 | 
 82 | It's good to see that there are clear similarities between this output and the k-means one (at least there were first time I ran the analysis!)
 83 | 
 84 | As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information.
 85 | 
 86 | Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. FOr example
 87 | 
 88 | ```{r}
 89 | boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
 90 | ```
 91 | 
 92 | Or we can aggregate the clusters into single areas and assign them values based on the underlying data of all the member units:
 93 | 
 94 | ```{r}
 95 | sfd.c <- aggregate(sfd, by = list(sfd$hc5), mean)
 96 | plot(sfd.c, pal = RColorBrewer::brewer.pal(7, "Reds"))
 97 | ```
 98 | 
 99 | ## Further reading
100 | In the specific context of demographic data, clustering analysis is often referred to as *geodemographics* and a nice example of this is provided in this paper
101 | 
102 | Spielman, S. E., and A. Singleton. 2015. [Studying Neighborhoods Using Uncertain Data from the American Community Survey: A Contextual Approach](http://www.tandfonline.com/doi/full/10.1080/00045608.2015.1052335). Annals of the Association of American Geographers 105 (5):1003–1025.
103 | 
104 | which describes a cluster analysis of the hundreds of thousands of census tracts of the United States. Accompanying data is [available here](https://www.openicpsr.org/openicpsr/project/100235/version/V5/view?path=/openicpsr/100235/fcr:versions/V5/Output-Data&type=folder) and to [visualise here](https://observatory.cartodb.com/editor/5de68840-16ef-11e6-bf4f-0ea31932ec1d/embed), although it's kind of enormous! An interactive map of a classification of London at fine spatial resolution is the [London Open Area Classification](https://maps.cdrc.ac.uk/#/geodemographics/loac11/default/BTTTFFT/10/-0.1500/51.5200/).
105 | 
106 | Although geodemographics ia a very visible example of cluster-based classification, exactly the same methods are applicable to other kinds of data, such as physical, climate or other variables (and these methods are the basis of remote sensed imagery classificaion).
107 | 
108 | Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
109 | 
110 | OK... on to [statistical modelling](05-statistical-models.md).
111 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/05-assignment-multivariate-analysis.Rmd:
--------------------------------------------------------------------------------
 1 | # Assignment 4 Geodemographics in Wellington
 2 | I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv). 
 3 | 
 4 | Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data). 
 5 | 
 6 | There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start! 
 7 | 
 8 | When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
 9 | 
10 | Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census. 
11 | 
12 | Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
13 | 
14 | Prepare your report in R Markdown and run it to produce a final output PDF or Word document (I prefer it if you can convert Word to PDF format) for submission (this means I will see your code as well as the outputs!) Avoid any outputs that are just long lists of data, as these are not very informative (you can do this by prefixing the code chunk that produces the output with ````{r, results = FALSE}`.
15 | 
16 | Submit your report to the dropbox provided on Blackboard by **24 May**.
17 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/05-assignment-multivariate-analysis.md:
--------------------------------------------------------------------------------
 1 | # Assignment 4 Geodemographics in Wellington
 2 | I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv). 
 3 | 
 4 | Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data). 
 5 | 
 6 | There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start! 
 7 | 
 8 | When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
 9 | 
10 | Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census. 
11 | 
12 | Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
13 | 
14 | Prepare your report in R Markdown and run it to produce a final output PDF or Word document (I prefer it if you can convert Word to PDF format) for submission (this means I will see your code as well as the outputs!) Avoid any outputs that are just long lists of data, as these are not very informative (you can do this by prefixing the code chunk that produces the output with ````{r, results = FALSE}`.
15 | 
16 | Submit your report to the dropbox provided on Blackboard by **24 May**.
17 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/README.md:
--------------------------------------------------------------------------------
 1 | # Multivariate analysis overview
 2 | This week we're looking at several things. You'll find all the necessary materials in this [zip file](multivariate-analysis.zip?raw=true).
 3 | 
 4 | The instructions for each stage of the material are linked below:
 5 | * Introducing [the multivariate data problem](01-multivariate-analysis-the-problem.md)
 6 | * The [*R* `tidyverse`](02-the-tidyverse.md) and cleaning up messy data
 7 | * [Dimensional reduction](03-dimensional-reduction.md)
 8 | * [Classification and clustering](04-classification-and-clustering.md)
 9 | * [Assignment on multivariate analysis](05-assignment-multivariate-analysis.md)
10 | 


--------------------------------------------------------------------------------
/labs/multivariate-analysis/multivariate-analysis.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/multivariate-analysis/multivariate-analysis.zip


--------------------------------------------------------------------------------
/labs/multivariate-analysis/welly.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/multivariate-analysis/welly.gpkg


--------------------------------------------------------------------------------
/labs/network-analysis/README.md:
--------------------------------------------------------------------------------
1 | # Network analysis overview
2 | This week simply download [this zip file](network-analysis.zip?raw=true) and unpack it a local folder, then follow the [instructions here](network-analysis.md).
3 | 


--------------------------------------------------------------------------------
/labs/network-analysis/network-analysis.Rmd:
--------------------------------------------------------------------------------
  1 | # Network analysis
  2 | In this session we explore some of the tools offered by the `igraph` package in R for network analysis. These treat networks from a network science perspective, rather than from a more transportation-oriented perspective.
  3 | 
  4 | ## Libraries
  5 | As usual, we need some libraries
  6 | 
  7 | ```{r}
  8 | library(tmap)
  9 | library(sf)
 10 | library(igraph) # you'll need to install this
 11 | library(tidygraph) # you'll need to install this
 12 | library(ggraph) # you'll need to install this
 13 | ```
 14 | 
 15 | ## Data
 16 | Also, some data. You'll find what follows in [this zip file](week-11.zip?raw=true).
 17 | 
 18 | I've made both a 'geospatial' and a 'network-oriented' version of the data.
 19 | 
 20 | Here are the geospatial layers.
 21 | 
 22 | ```{r}
 23 | intersections <- read_sf('network/nodes/nodes.shp')
 24 | road_segments <- read_sf('network/edges/edges.shp')
 25 | ```
 26 | 
 27 | It is important to look at the nodes and edges in their geographical context.
 28 | 
 29 | ```{r}
 30 | tmap_mode('view')
 31 | tm_shape(road_segments) +
 32 |   tm_lines(col='orange') +
 33 |   tm_shape(intersections) +
 34 |   tm_dots(col='red', size=0.005)
 35 | ```
 36 | 
 37 | Here you see how the network representation treats is concerned both with nodes (or vertices) and the connections (or edges) between them.
 38 | 
 39 | The graph version of these data is in a single `graphml` file, which we load using the `igraph` function `read_graph`.
 40 | 
 41 | ```{r}
 42 | G <- read_graph('network/network.graphml', format='graphml')
 43 | ```
 44 | 
 45 | This file (which you can examine in a text editor) is in the `graphML` format, and includes information about both nodes and edges in a single combined file. This is a relatively commonly used format for exchanging network data among programs that perform graph analysis, such as [*Gephi*](https://gephi.org/).
 46 | 
 47 | I built all three of these datasets using the [excellent `osmnx`
 48 | package](https://github.com/gboeing/osmnx) developed by Geoff Boeing.
 49 | 
 50 | ## Examining the graph
 51 | Now... graphs are rather complicated things. They effectively consist of two sets of entities, the nodes and the edges. Further, each edge consists of two nodes that it connects. To see this do
 52 | 
 53 | ```{r}
 54 | G
 55 | ```
 56 | 
 57 | What you are looking at is a rather concise summary of the graph object. It has 1114 nodes and 2471 edges. There are a number of vertex attributes (tagged v/c) such as lat, lon, y, x, and also edge attributes (tagged e/c) including a geometry. Unfortunately, the `igraph` package is not very intuitive to work with.
 58 | 
 59 | A recently released packages helps us out here by allowing us to see the `igraph` object in a more 'tabular' way.
 60 | 
 61 | ```{r}
 62 | G <- as_tbl_graph(G)
 63 | G
 64 | ```
 65 | 
 66 | This is a bit more readable, and helps us to see what we are dealing with.
 67 | 
 68 | Better yet would be a drawing! This turns out to be somewhat fiddly also. But here is an example.
 69 | 
 70 | ```{r}
 71 | ggraph(G, layout='igraph', algorithm='nicely') +
 72 |   geom_node_point(colour='red', size=0.5) +
 73 |   geom_edge_link(colour='grey', width=0.5) +
 74 |   coord_equal() +
 75 |   theme_graph()
 76 | ```
 77 | 
 78 | What the heck is that?! It's actually the same network that we saw in the map view above, but with the road segments joining nodes represented now as straight lines. This is the essence of the street network connectivity, without the complication of all the twists and turns of the roads themselves. This may not impress you very much, but once we allow ourselves to ignore the geographical detail, we can start to see the structure of the network more clearly. One way to do this is with different *graph drawing algorithms*.
 79 | 
 80 | One example is the multidimensional scaling algorithm, which attempts to place nodes so that their positions relate to their distances from one another in network space.
 81 | 
 82 | ```{r}
 83 | ggraph(G, layout='igraph', algorithm='mds') +
 84 |   geom_edge_link(colour='grey', width=0.5) +
 85 |   geom_node_point(colour='red', size=0.5) +
 86 |   coord_equal() +
 87 |   theme_graph()
 88 | ```
 89 | 
 90 | ## Analysing aspects of network structure
 91 | Redrawing the graph is not very useful unless you really understand its structure. There are a number of categories of graph analysis method that can help us with this.
 92 | 
 93 | ### Centrality measures
 94 | The centrality of a node or edge in a graph is a measure of its importance in the network sructure in some sense. This can be as simple as how many nodes a node is connected to (more is more central), although this tends not to be very interesting in a road network.
 95 | 
 96 | ```{r}
 97 | intersections$centrality <- degree(G, mode='in')
 98 | tm_shape(road_segments) +
 99 |   tm_lines() +
100 |   tm_shape(intersections) +
101 |   tm_dots(col='centrality', style='cat')
102 | ```
103 | 
104 | A more interesting option is *betweenness centrality*. This determines the centrality of nodes based on how often they are found on the shortest paths in the network from every location to every other location. This approach often highlights choke points in a network.
105 | 
106 | ```{r}
107 | intersections$centrality <- betweenness(G)
108 | tm_shape(road_segments) +
109 |   tm_lines() +
110 |   tm_shape(intersections) +
111 |   tm_dots(col='centrality')
112 | ```
113 | 
114 | Additional methods are `closeness` which determines on average which nodes are closest to all others, and `page_rank` which uses a complex matrix analysis method, identical to Google's pagerank algorithm, based on random walks on the network (for this one, you have to use `centrality <- page_rank(G)$vector` to extract the values.
115 | 
116 | Give those two a try (there are many more!) to get a feel for things.
117 | 
118 | ### Community detection
119 | The connectivity structure of a network may mean that there are distinct regions within it that are relatively cut off from one another while being well connected internally. In network science these regions are known as *communities* and many algorithms have been developed to perform community detection (this is analagous to cluster detection in multivariate data). Many of these only work on *undirected* graphs, so before trying them we will open an undirected version of the graph we have been looking at.
120 | 
121 | ```{r}
122 | UG <- read_graph('network/network_.graphml', format='graphml')
123 | intersections$community <- as.factor(membership(cluster_louvain(UG)))
124 | tm_shape(road_segments) +
125 |   tm_lines() +
126 |   tm_shape(intersections) +
127 |   tm_dots(col='community', style='cat')
128 | ```
129 | 
130 | In a relatively well connected network, it is surprising that community detection can work at all (since everywhere is pretty much connected to everywhere else!) but even so the groupings identified by this method are interesting to explore and relate to our understanding of the geography of central Wellington. There are quite a few other methods available, but my experiments suggest that for these data only `cluster_louvain`, `cluster_spinglass` and `cluster_edge_betweenness` have much luck. Give each a try, and see what you think. Is there any way to determine which partition is the most 'correct'?
131 | 
132 | ## Putting the communities back into 'network space'
133 | Having determined some communities of possible interest, it may be instructive to visualize these in the network space, determined by network structure.
134 | 
135 | ```{r}
136 | ggraph(G, layout='igraph', algorithm='mds') +
137 |   geom_edge_link(colour='grey', width=0.5) +
138 |   geom_node_point(aes(colour=intersections$community)) +
139 |   coord_equal() +
140 |   theme_graph()
141 | ```
142 | 
143 | ### Conclusion
144 | Hopefully this has given you a taste of how different the world can look when we start to consider not just simple Euclidean distances among things, but to consider their connectivity in other ways. It is worth emphasizing that a street network (even in Wellington!) is not very different to an open space in terms of the connection it affords between places. When we consider much less geographically coherent networks (such as airlines, Internet, etc.) then the 'space' opened up for exploration can be very different indeed.
145 | 


--------------------------------------------------------------------------------
/labs/network-analysis/network-analysis.md:
--------------------------------------------------------------------------------
  1 | # Network analysis
  2 | In this session we explore some of the tools offered by the `igraph` package in R for network analysis. These treat networks from a network science perspective, rather than from a more transportation-oriented perspective.
  3 | 
  4 | ## Libraries
  5 | As usual, we need some libraries
  6 | 
  7 | ```{r}
  8 | library(tmap)
  9 | library(sf)
 10 | library(igraph) # you'll need to install this
 11 | library(tidygraph) # you'll need to install this
 12 | library(ggraph) # you'll need to install this
 13 | ```
 14 | 
 15 | ## Data
 16 | Also, some data. You'll find what follows in [this zip file](week-11.zip?raw=true).
 17 | 
 18 | I've made both a 'geospatial' and a 'network-oriented' version of the data.
 19 | 
 20 | Here are the geospatial layers.
 21 | 
 22 | ```{r}
 23 | intersections <- read_sf('network/nodes/nodes.shp')
 24 | road_segments <- read_sf('network/edges/edges.shp')
 25 | ```
 26 | 
 27 | It is important to look at the nodes and edges in their geographical context.
 28 | 
 29 | ```{r}
 30 | tmap_mode('view')
 31 | tm_shape(road_segments) +
 32 |   tm_lines(col='orange') +
 33 |   tm_shape(intersections) +
 34 |   tm_dots(col='red', size=0.005)
 35 | ```
 36 | 
 37 | Here you see how the network representation treats is concerned both with nodes (or vertices) and the connections (or edges) between them.
 38 | 
 39 | The graph version of these data is in a single `graphml` file, which we load using the `igraph` function `read_graph`.
 40 | 
 41 | ```{r}
 42 | G <- read_graph('network/network.graphml', format='graphml')
 43 | ```
 44 | 
 45 | This file (which you can examine in a text editor) is in the `graphML` format, and includes information about both nodes and edges in a single combined file. This is a relatively commonly used format for exchanging network data among programs that perform graph analysis, such as [*Gephi*](https://gephi.org/).
 46 | 
 47 | I built all three of these datasets using the [excellent `osmnx`
 48 | package](https://github.com/gboeing/osmnx) developed by Geoff Boeing.
 49 | 
 50 | ## Examining the graph
 51 | Now... graphs are rather complicated things. They effectively consist of two sets of entities, the nodes and the edges. Further, each edge consists of two nodes that it connects. To see this do
 52 | 
 53 | ```{r}
 54 | G
 55 | ```
 56 | 
 57 | What you are looking at is a rather concise summary of the graph object. It has 1114 nodes and 2471 edges. There are a number of vertex attributes (tagged v/c) such as lat, lon, y, x, and also edge attributes (tagged e/c) including a geometry. Unfortunately, the `igraph` package is not very intuitive to work with.
 58 | 
 59 | A recently released packages helps us out here by allowing us to see the `igraph` object in a more 'tabular' way.
 60 | 
 61 | ```{r}
 62 | G <- as_tbl_graph(G)
 63 | G
 64 | ```
 65 | 
 66 | This is a bit more readable, and helps us to see what we are dealing with.
 67 | 
 68 | Better yet would be a drawing! This turns out to be somewhat fiddly also. But here is an example.
 69 | 
 70 | ```{r}
 71 | ggraph(G, layout='igraph', algorithm='nicely') +
 72 |   geom_node_point(colour='red', size=0.5) +
 73 |   geom_edge_link(colour='grey', width=0.5) +
 74 |   coord_equal() +
 75 |   theme_graph()
 76 | ```
 77 | 
 78 | What the heck is that?! It's actually the same network that we saw in the map view above, but with the road segments joining nodes represented now as straight lines. This is the essence of the street network connectivity, without the complication of all the twists and turns of the roads themselves. This may not impress you very much, but once we allow ourselves to ignore the geographical detail, we can start to see the structure of the network more clearly. One way to do this is with different *graph drawing algorithms*.
 79 | 
 80 | One example is the multidimensional scaling algorithm, which attempts to place nodes so that their positions relate to their distances from one another in network space.
 81 | 
 82 | ```{r}
 83 | ggraph(G, layout='igraph', algorithm='mds') +
 84 |   geom_edge_link(colour='grey', width=0.5) +
 85 |   geom_node_point(colour='red', size=0.5) +
 86 |   coord_equal() +
 87 |   theme_graph()
 88 | ```
 89 | 
 90 | ## Analysing aspects of network structure
 91 | Redrawing the graph is not very useful unless you really understand its structure. There are a number of categories of graph analysis method that can help us with this.
 92 | 
 93 | ### Centrality measures
 94 | The centrality of a node or edge in a graph is a measure of its importance in the network sructure in some sense. This can be as simple as how many nodes a node is connected to (more is more central), although this tends not to be very interesting in a road network.
 95 | 
 96 | ```{r}
 97 | intersections$centrality <- degree(G, mode='in')
 98 | tm_shape(road_segments) +
 99 |   tm_lines() +
100 |   tm_shape(intersections) +
101 |   tm_dots(col='centrality', style='cat')
102 | ```
103 | 
104 | A more interesting option is *betweenness centrality*. This determines the centrality of nodes based on how often they are found on the shortest paths in the network from every location to every other location. This approach often highlights choke points in a network.
105 | 
106 | ```{r}
107 | intersections$centrality <- betweenness(G)
108 | tm_shape(road_segments) +
109 |   tm_lines() +
110 |   tm_shape(intersections) +
111 |   tm_dots(col='centrality')
112 | ```
113 | 
114 | Additional methods are `closeness` which determines on average which nodes are closest to all others, and `page_rank` which uses a complex matrix analysis method, identical to Google's pagerank algorithm, based on random walks on the network (for this one, you have to use `centrality <- page_rank(G)$vector` to extract the values.
115 | 
116 | Give those two a try (there are many more!) to get a feel for things.
117 | 
118 | ### Community detection
119 | The connectivity structure of a network may mean that there are distinct regions within it that are relatively cut off from one another while being well connected internally. In network science these regions are known as *communities* and many algorithms have been developed to perform community detection (this is analagous to cluster detection in multivariate data). Many of these only work on *undirected* graphs, so before trying them we will open an undirected version of the graph we have been looking at.
120 | 
121 | ```{r}
122 | UG <- read_graph('network/network_.graphml', format='graphml')
123 | intersections$community <- as.factor(membership(cluster_louvain(UG)))
124 | tm_shape(road_segments) +
125 |   tm_lines() +
126 |   tm_shape(intersections) +
127 |   tm_dots(col='community', style='cat')
128 | ```
129 | 
130 | In a relatively well connected network, it is surprising that community detection can work at all (since everywhere is pretty much connected to everywhere else!) but even so the groupings identified by this method are interesting to explore and relate to our understanding of the geography of central Wellington. There are quite a few other methods available, but my experiments suggest that for these data only `cluster_louvain`, `cluster_spinglass` and `cluster_edge_betweenness` have much luck. Give each a try, and see what you think. Is there any way to determine which partition is the most 'correct'?
131 | 
132 | ## Putting the communities back into 'network space'
133 | Having determined some communities of possible interest, it may be instructive to visualize these in the network space, determined by network structure.
134 | 
135 | ```{r}
136 | ggraph(G, layout='igraph', algorithm='mds') +
137 |   geom_edge_link(colour='grey', width=0.5) +
138 |   geom_node_point(aes(colour=intersections$community)) +
139 |   coord_equal() +
140 |   theme_graph()
141 | ```
142 | 
143 | ### Conclusion
144 | Hopefully this has given you a taste of how different the world can look when we start to consider not just simple Euclidean distances among things, but to consider their connectivity in other ways. It is worth emphasizing that a street network (even in Wellington!) is not very different to an open space in terms of the connection it affords between places. When we consider much less geographically coherent networks (such as airlines, Internet, etc.) then the 'space' opened up for exploration can be very different indeed.
145 | 


--------------------------------------------------------------------------------
/labs/network-analysis/network-analysis.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/network-analysis/network-analysis.zip


--------------------------------------------------------------------------------
/labs/point-pattern-analysis/02-ppa-with-real-data.md:
--------------------------------------------------------------------------------
  1 | # Point pattern analysis with real data
  2 | 
  3 | ```{r}
  4 | # Reload some libraries in case you are restarting partway through
  5 | library(spatstat)
  6 | library(RColorBrewer)
  7 | # also store the initial graphic options in a variable
  8 | # so we can reset them at any time
  9 | pardefaults <- par()
 10 | ```
 11 | 
 12 | Now we need to read two data files, one with the point events and one with the study area. These are [here](ak-tb-cases.geojson?raw=true) and [here](ak-tb.geojson?raw=true). Load the `sf` library for handling spatial data.
 13 | 
 14 | ```{r}
 15 | library(sf)
 16 | ```
 17 | 
 18 | and use it to load the data:
 19 | 
 20 | ```{r}
 21 | ak <- st_read("ak-tb.geojson")
 22 | tb <- st_read("ak-tb-cases.geojson")
 23 | ```
 24 | 
 25 | Check that things line up OK by mapping them using `tmap`.
 26 | 
 27 | ```{r}
 28 | library(tmap)
 29 | 
 30 | tm_shape(ak) +
 31 |   tm_polygons() +
 32 |   tm_shape(tb) +
 33 |   tm_dots()
 34 | ```
 35 | 
 36 | ## Reprojecting the data
 37 | 
 38 | For PPA we really need to be working in a projected coordinate system that uses real spatial units (like metres, even better kilometres), not latitude-longitude.
 39 | 
 40 | ```{r}
 41 | st_crs(ak)
 42 | st_crs(tb)
 43 | ```
 44 | 
 45 | By now you should recognise these as 'unprojected' lat-lon, which is no good to us. We should instead use the New Zealand Transverse Mercator. We get the proj4 string for this from an appropriate source, such as [epsg.io/2193](https://epsg.io/2193), and use it to transform the two layers. I have modified the projection to make the units km (the `units` setting) rather than metres as this seems to have a dramatic effect on how well `spatstat` runs. It also makes it easier to interpret results meaningfully.
 46 | 
 47 | ```{r}
 48 | nztm <- '+proj=tmerc +lat_0=0 +lon_0=173 +k=0.9996 +x_0=1600 +y_0=10000 +datum=WGS84 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=km'
 49 | 
 50 | ak <- st_transform(ak, nztm)
 51 | tb <- st_transform(tb, nztm)
 52 | ```
 53 | 
 54 | Now we should check that things still line up OK.
 55 | 
 56 | ```{r}
 57 | tm_shape(ak) +
 58 |   tm_polygons() +
 59 |   tm_shape(tb) +
 60 |   tm_dots()
 61 | ```
 62 | 
 63 | ## Converting spatial data to `spatstat` data
 64 | 
 65 | OK. So much for the spatial data. `spatstat` works in its own little world and performs PPA on `ppp` objects, which have two components, a set of (*x*, *y*) points, and a study area or 'window'.
 66 | 
 67 | This is quite a fiddly business, which never seems to get any easier (every time I do it, I have to look it up in help). We need some conversion functions `as_Spatial` from the `sf` package, to convert from `sf` objects to `sp` objects and then some more in the `maptools` package to get from those to `spatstat` `ppp` objects. We will also use a (short) `dplyr` pipeline to do the conversion. So first load `maptools` and `dplyr`:
 68 | 
 69 | ```{r}
 70 | library(maptools)
 71 | library(dplyr)
 72 | ```
 73 | 
 74 | If `maptools` doesn't load, then make sure it is installed and try again. Now we use `as.ppp` to make a point pattern from the geometry of the `tb` data set.
 75 | 
 76 | ```{r}
 77 | tb.pp <- tb$geometry %>% # the geometry is all we need
 78 |   as("Spatial") %>%      # this converts to sp geometry
 79 |   as.ppp()               # this is a maptools function to convert to spatstat ppp
 80 | ```
 81 | 
 82 | and plot it to take a look:
 83 | 
 84 | ```{r}
 85 | plot(tb.pp)
 86 | ```
 87 | 
 88 | That's better than nothing, but ideally we want to use the land area for the study area. Again, we use a conversion function from `maptools`.
 89 | 
 90 | ```{r}
 91 | tb.pp$window <- ak %>%
 92 |   st_union() %>%       # combine all the polygons into a single shape
 93 |   as("Spatial") %>%    # convert to sp
 94 |   as.owin()            # convert to spatstat owin
 95 | ```
 96 | 
 97 | Now let's take a look:
 98 | 
 99 | ```{r}
100 | plot(density(tb.pp))
101 | plot(tb.pp, add = T)
102 | ```
103 | 
104 | Finally!
105 | 
106 | If we were doing a lot of this kind of analysis starting with lat-lon datasets, then we would build a function from some of the elements above to automate all these steps. In this case, since we are only running the analysis once for this lab, I will leave that as an exercise for the reader...
107 | 
108 | [**back to PPA with `spatstat`**](01-ppa-in-spatstat.md) \| [**on to assignment**](03-assignment-instructions.md)
109 | 


--------------------------------------------------------------------------------
/labs/point-pattern-analysis/03-assignment-instructions.md:
--------------------------------------------------------------------------------
 1 | # **Assignment 1: A point pattern analysis of the Auckland TB data**
 2 | Now you have a point pattern dataset to work with (`tb.pp`), you can perform point pattern analysis with it. You may need to [go back to the previous set of instructions](02-ppa-with-real-data.md) to get to this point.
 3 | 
 4 | ## Assignment deliverables
 5 | 
 6 | The submission deadline is **4 September**. Preferably do the analysis in an R Markdown file and output things to a PDF (you might have to go via Word to accomplish this)
 7 | 
 8 | You can assemble materials in a word processor (export images from *RStudio* to PNG format) and produce a PDF report as an alternative, but I recommend the R Markdown approach.
 9 | 
10 | Avoid showing lots of irrelevant output with options like `message = FALSE` in the code chunk headers. You can control these using the options in the 'gearwheel' button at the top of the files part of _RStudio_.
11 | 
12 | Include your name in the submitted filename for ease of identification. You should not need to write more than about 500 words, although your report should include a number of figures. Note that quality of cartography is not an important aspect of this assignment.
13 | 
14 | You need to do three things:
15 | 
16 | ### First (25%)
17 | 
18 | Present kernel density surfaces of the tuberculosis data.
19 | 
20 | **You should present three different density surfaces, and explain which are likely to be the most useful in different contexts.**
21 | 
22 | Explain what bandwidths you have selected, and the basis for your choice. (Remember that the distance units are km.) Keep in mind that there is no absolute right answer, and that the various suggestions you can get from functions available (see the section about kernel density) are only suggestions. You might want to make selections close to these but rounded to more readily understood values, for example.
23 | 
24 | You will need to present maps of the density surfaces.
25 | 
26 | ### Second (50%)
27 | 
28 | Conduct a point pattern analysis of the data and report the results.
29 | 
30 | You may use whatever methods from those available (quadrats, mean nearest neighbors, *G*, *F*, Ripley's *K*, the pair correlation function) that you find useful, and that you feel comfortable explaining and interpreting. You should use at least one of the simulation envelope based methods.
31 | 
32 | You will need to present graphs or other output on which your analysis is based.
33 | 
34 | ### Third (25%)
35 | 
36 | Comment on what the principle drivers of the tuberculosis incidents might be. Consider how you might run point pattern analysis to take account of such factors. What might a better null model for the occurrence of incidents be in this case than the default of complete spatial randomness?
37 | 
38 | [**back to overview**](README.md)
39 | 


--------------------------------------------------------------------------------
/labs/point-pattern-analysis/README.md:
--------------------------------------------------------------------------------
1 | # Point pattern analysis overview
2 | This week simply download [this zip file](point-pattern-analysis.zip?raw=true) and unpack it a local folder, then follow the instructions in the files below. Unlike previous weeks, *the last of these is an assignment!*
3 | 
4 | - [How to do point pattern analysis](01-ppa-in-spatstat.md).
5 | - [PPA with real data](02-ppa-with-real-data.md)
6 | - [The assignment instructions](03-assignment-instructions.md)
7 | 


--------------------------------------------------------------------------------
/labs/point-pattern-analysis/point-pattern-analysis.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/point-pattern-analysis/point-pattern-analysis.zip


--------------------------------------------------------------------------------
/labs/spatial-autocorrelation/README.md:
--------------------------------------------------------------------------------
1 | # Spatial autocorrelation overview
2 | This week simply download [this zip file](spatial-autocorrelation.zip?raw=true) and unpack it a local folder, then follow the [instructions here](assignment-spatial-autocorrelation.md).
3 | 


--------------------------------------------------------------------------------
/labs/spatial-autocorrelation/akregion-tb-06.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/spatial-autocorrelation/akregion-tb-06.gpkg


--------------------------------------------------------------------------------
/labs/spatial-autocorrelation/moran_plots.R:
--------------------------------------------------------------------------------
 1 | # Functions to make Moran map plots
 2 | 
 3 | 
 4 | # Determines Moran cluster quadrants of results
 5 | # from localmoran function applied to the specified variable
 6 | # under the specified set of weights as specified significance level
 7 | # 1 = High - High
 8 | # 2 = Low - Low
 9 | # 3 = High - Low
10 | # 4 = Low - High
11 | # 5 = Non-sigificant
12 | # result is returned as vector in order of data supplied
13 | moran_quadrant <- function(layer, variable, w, sig = 0.01) {
14 |   var <- layer[[variable]]
15 |   v <- as.vector(scale(var))
16 | 
17 |   locm <- localmoran(v, w, alternative = 'two.sided', zero.policy = T)
18 |   lv <- locm[,1]
19 |   p <- locm[,5]
20 |   quad <- rep(5, length(v))
21 |   significant <- p < sig | p > (1-sig)
22 |   quad[v > 0 & lv > 0 & significant] = 1
23 |   quad[v <= 0 & lv > 0 & significant] = 2
24 |   quad[v > 0 & lv <= 0 & significant] = 3
25 |   quad[v <= 0 & lv <= 0 & significant] = 4
26 |   return (quad)
27 | }
28 | 
29 | # Produces and returns a map coloured as follows, at significance level provided
30 | # Red: High - High
31 | # Blue: Low - Low
32 | # Pink: High - Low
33 | # Light blue: Low - High
34 | # White: Non-significant
35 | moran_cluster_map <- function(layer, variable, w, sig = 0.01) {
36 |   layer['q'] <- moran_quadrant(layer, variable, w, sig = sig)
37 |   m <- tm_shape(layer) +
38 |     tm_layout(title = "Moran cluster map", legend.position = c(.45,.75)) +
39 |     tm_fill(col = 'q', breaks = 0:5 + 0.5, palette = c('red', 'blue', 'pink', 'lightblue', 'white'),
40 |             labels = c('High - High', 'Low - Low', 'High - Low', 'Low - High', 'Non-significant'),
41 |             title = paste("Moran quadrant, p<", sig, " level", sep = "")) +
42 |     tm_borders(lwd = 0.5)
43 |   return (m)
44 | }
45 | 
46 | 
47 | # Produces a map showing signifcance levels of local Moran's index
48 | moran_significance_map <- function(layer, variable, w) {
49 |   localm <- localmoran(layer[[variable]], w, alternative = 'two.sided', zero.policy = T)
50 |   layer['pr'] <- as.vector(localm[,5])
51 |   layer$pr <- pmin(layer$pr, 1-layer$pr)
52 |   m <- tm_shape(layer) +
53 |     tm_layout(title = "LISA Significance map", legend.position = c(.45,.72)) +
54 |     tm_fill(col = 'pr', breaks = c(0, 0.001, 0.01, 0.05, 1), palette = "-Greens",
55 |             title = "Significance level") +
56 |     tm_borders(col = 'gray', lwd = 0.5)
57 |   return (m)
58 | }
59 | 


--------------------------------------------------------------------------------
/labs/spatial-autocorrelation/spatial-autocorrelation.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/spatial-autocorrelation/spatial-autocorrelation.zip


--------------------------------------------------------------------------------
/labs/statistical-models/README.md:
--------------------------------------------------------------------------------
1 | # Statistical models
2 | Some slides from another class to look at this week
3 | 
4 | + [From overlay to regression](https://dosull.github.io/Geog315/slides/from-overlay-to-regression/)
5 | + [Introduction to regression](https://dosull.github.io/Geog315/slides/regression/)
6 | + [More on regression](https://dosull.github.io/Geog315/slides/more-on-regression/)
7 | 
8 | And then [a notebook to explore](statistical-models.md). If you [download the zip file](statistical-models.zip?raw=true) then you'll find the data and the RMarkdown version of the notebook in there too.
9 | 


--------------------------------------------------------------------------------
/labs/statistical-models/layers/age.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/age.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/deficit.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/deficit.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/dem.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/dem.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/mas.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/mas.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/mat.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/mat.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/r2pet.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/r2pet.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/rain.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/rain.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/slope.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/slope.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/sseas.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/sseas.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/tseas.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/tseas.tif


--------------------------------------------------------------------------------
/labs/statistical-models/layers/vpd.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/layers/vpd.tif


--------------------------------------------------------------------------------
/labs/statistical-models/nz35-pa.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/nz35-pa.gpkg


--------------------------------------------------------------------------------
/labs/statistical-models/statistical-models.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DOSull/Spatial-Analysis-and-Modelling/46d05149d8e2c7bffc9f76471ef6266ddb800aa4/labs/statistical-models/statistical-models.zip


--------------------------------------------------------------------------------
/report/README.md:
--------------------------------------------------------------------------------
 1 | # Review of spatial analysis written report
 2 | This document sets out my expectations of the report you are required to write on spatial analysis as it has been applied in some field of interest. I am happy to meet to discuss your plans and thinking; make an appointment [here](https://calendly.com/dosullivan).
 3 | 
 4 | ## Scope of the report
 5 | The final written report should be 2,000 to 3,000 words in length (inclusive of references).
 6 | 
 7 | The goal is to demonstrate your ability to critically evaluate the usefulness, potential, and limitations of spatial analysis techniques in a particular subfield of interest.
 8 | 
 9 | ## Choice and scope of area of interest
10 | I like to imagine you are undertaking graduate study at some level because you are interested in subject areas in and around geographical information science and its fields of application. So the first thing to say about choosing an area of interest, is that it should be an area that is interesting to you (the clue is in the name: *area of __interest__*).
11 | 
12 | Relatedly, if you are considering further study after this year (i.e. a Masters thesis), then this is an opportunity to do some wider reading in that area (or perhaps to find out that it's not something that interests you *enough*).
13 | 
14 | A couple of things to start:
15 | 
16 | + Geography or earth science or ecology or similar would be much too broad an area. There are literally *thousands* of articles deploying spatial analysis methods in each of these topic areas, and you can't hope to do them justice in a meaningful way.
17 | + Something broad but more specific, such as health geography, historical geography, exploration geology, or plant ecology, will work fine. However, that's really just a starting point, an initial filter.
18 | 
19 | Having selected a general area you will need to do some digging, to identify useful general resources in the topic area. Some of the handbooks and encyclopedias of geography and/or GIScience that have appeared in recent years should be helpful here. The library has many of these in online form. I can probably also provide some pointers in some topic areas (if I don't know much about a particular topic area, you will see me google the topic, so please only come to talk to me about your topic area *after* you have  at least done that and come up with some leads for us to discuss).
20 | 
21 | You will almost certainly find that top level readings (in encyclopedias or textbooks) will direct your interest and attention to further refine your focus&mdash;perhaps emphasising only a narrow range of methods in the field, or taking a look at applications of spatial analysis in that topic area in New Zealand. Try to keep things focused, rather than including lots of disparate and only loosely related materials.
22 | 
23 | ## Expectation of number of references
24 | How many references are required to accomplish this task will be dependent on the topic area and the nature of the materials used (whether research papers, books, unpublished reports and studies, student theses and dissertations, etc.).
25 | 
26 | However, it would be surprising if you can meaningfully evaluate the application of spatial analysis in a subfield drawing on fewer than 15 items. Equally, the challenge is not to assemble a vast number of reference items (a quick trip to google scholar will do that for you), as then you will not be able to give them each the attention they deserve. The point is not the number of references you use, but how you assemble them through your evaluation of them into an overall picture of the usefulness (or not) of spatial analysis in your chosen focus area.
27 | 
28 | ## Overall outline of report
29 | I'm not providing one of these. Each report is likely to be different, and I think you've written enough essays and project reports and so on over the years to be trusted to put together your own report structure.
30 | 


--------------------------------------------------------------------------------