├── README.md
├── _custom_notebook.py
├── data
    ├── cleaned
    │   ├── gpkg-offshore_wells_2011_UTM20_NAD83.gpkg
    │   ├── offshore_wells_2011_Geographic_NAD27.cpg
    │   ├── offshore_wells_2011_Geographic_NAD27.dbf
    │   ├── offshore_wells_2011_Geographic_NAD27.prj
    │   ├── offshore_wells_2011_Geographic_NAD27.shp
    │   ├── offshore_wells_2011_Geographic_NAD27.shx
    │   ├── offshore_wells_2011_UTM20_NAD83.cpg
    │   ├── offshore_wells_2011_UTM20_NAD83.dbf
    │   ├── offshore_wells_2011_UTM20_NAD83.prj
    │   ├── offshore_wells_2011_UTM20_NAD83.shp
    │   └── offshore_wells_2011_UTM20_NAD83.shx
    ├── geojson-offshore_wells_Geographic_NAD27.geojson
    ├── ne_10m_admin_1_states_provinces.zip
    ├── ne_10m_rivers_lake_centerlines_trimmed.zip
    ├── offshore_wells_2011_UTM20_NAD83.cpg
    ├── offshore_wells_2011_UTM20_NAD83.dbf
    ├── offshore_wells_2011_UTM20_NAD83.prj
    ├── offshore_wells_2011_UTM20_NAD83.shp
    ├── offshore_wells_2011_UTM20_NAD83.shx
    ├── south_africa_places.gpkg
    ├── south_africa_runways.gpkg
    ├── tanzania_mines.zip
    ├── za_province_employment.csv
    └── zwe_mines_curated_all_opendata_p_ipis.csv
├── environment.yml
├── introduction.md
├── master
    ├── 01-geopandas-cleaning_data-master.ipynb
    ├── 02-geopandas-io_and_plotting-master.ipynb
    ├── 03-geopandas-processing_vector_data-master.ipynb
    ├── 04-geopandas-joins_and_spatial_joins-master.ipynb
    └── 05-geopandas-basemaps_with_contextily-master.ipynb
└── notebooks
    ├── 01-geopandas-cleaning_data.ipynb
    ├── 02-geopandas-io_and_plotting.ipynb
    ├── 03-geopandas-processing_vector_data.ipynb
    ├── 04-geopandas-joins_and_spatial_joins.ipynb
    └── 05-geopandas-basemaps_with_contextily.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial: Spatial Analysis in Python
 2 | 
 3 | DATE: Friday, June 12 2020 - 08:00 UTC
 4 | 
 5 | AUDIENCE: Intermediate
 6 | 
 7 | INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)
 8 | 
 9 | Part of the [SoftwareUndergrounds](https://swu.ng) online conference, [TRANSFORM2020](https://transform2020.sched.com/).
10 | 
11 | ### Video Stream:
12 | <a href="https://www.youtube.com/watch?v=t5FjmDwwTnA" target="_blank">Direct link to the live stream.
13 | </a>
14 | 
15 | ## Welcome
16 | Welcome to a brief tutorial on using Python for spatial analysis. This is intended for people who are more or less comfortable with Python, in particular pandas and matplotlib. Spatial analysis and technology is a broad field, and it is almost certainly not possible to cover everything. This tutorial will focus on vector data (points, lines, and polygons).
17 | 
18 | As a short note, both ArcGIS and QGIS have their own flavour of Python (arcpy and pyqgis) available within the program. Since I am not sure what people have access to, this tutorial will take a third option and use geopandas.
19 | 
20 | ## Set-up
21 | ### Install-less (cloud) set-up
22 | A [binder version](https://mybinder.org/v2/gh/mtb-za/transform-2020-spatial-in-python/master) of this repository is available. Clicking that link will let you run everything in a browser without installing anything. It may take some time to initialise, so if you want to go with this option, it might be good to arrive a little bit early to set it up. 
23 | 
24 | Because mybinder.org is a best-effort, free service, if there are many people attempting to use this service, it may be oversubscribed. If possible, it is recommended that you set up a local environment instead.
25 | 
26 | ### Local set-up
27 | The easiest way to get up and running for this tutorial will be to clone the [github repository](https://github.com/mtb-za/transform-2020-spatial-in-python/). Note that some of the files the data folder are zipped. These can be left as they are, unless you are particularly curious.
28 | 
29 | It is recommended that you use the Anaconda python distribution. Please see the end of this section for some help if you need to install this.
30 | 
31 | Once you have conda installed, navigate to the downloaded folder, and type `conda env create -f environment.yml` in the Anaconda prompt. This will set up an environment named `t20-fri-geo` in which everything should work. Type `conda activate t20-fri-geo`, followed by `jupyter notebook` and you will be ready to roll. This will open a page in your web browser - please make sure that you use either Firefox, Chrome or Edge.
32 | 
33 | If you are doing things manually, the minimal packages to install are the following along with their dependencies:
34 | * `jupyter` (the tutorial is run through a notebook)
35 | * [`geopandas`](https://geopandas.org/install.html)
36 | * [`mapclassify`](https://pysal.org/mapclassify/)
37 | * [`contextily`](https://github.com/darribas/contextily)
38 | * [`descartes`](https://pypi.python.org/pypi/descartes)
39 | * [`matplotlib`](matplotlib.org/)
40 | * [`geopy`](https://github.com/geopy/geopy)
41 | 
42 | It is recommended that these be installed in a new environment using anaconda.
43 | 
44 | Additional set-up instructions in the form of videos for [Windows](https://youtu.be/FdatS_NKVrM) and [Linux](https://youtu.be/3ncwbHyZeAg), or as a [written guide](http://swu.ng/t20-python-setup) are available. Alternatively, please feel free to ask in the #t20-fri-geo channel in the [Slack group](swu.ng/slack).
45 | 
46 | ## Other sessions
47 | 
48 | If you are interested in additional spatially-themed sessions specifically, the following may be of interest:
49 | * [Spatial data analytics with geostatspy](https://transform2020.sched.com/event/cD0W/tutorial-open-source-spatial-data-analytics-in-python-with-geostatspy)
50 | * [Scattered points to gridded products using verde](https://transform2020.sched.com/event/c7KE/tutorial-from-scattered-data-to-gridded-products-using-verde)
51 | * [Geologic image processing with Python](https://transform2020.sched.com/event/cD5T/tutorial-geologic-image-processing-with-python)
52 | 
53 | All have already happened, but the recordings of the live streams are still available via the [Software Underground YouTube channel](https://www.youtube.com/channel/UCeDefhvz7znDo29iOmqU_9A).
54 | 
55 | 


--------------------------------------------------------------------------------
/_custom_notebook.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import argparse
  3 | import os
  4 | import re
  5 | from shutil import copyfile
  6 | 
  7 | 
  8 | def hide_cells(notebook):
  9 |     """
 10 |     Finds the tag 'hide' in each cell and removes it
 11 | 
 12 |     Returns dict without 'hide' tagged cells
 13 |     """
 14 |     clean = []
 15 |     for cell in notebook['cells']:
 16 |         try:
 17 |             if 'hide' in cell['metadata']['tags']:
 18 |                 pass
 19 |             else:
 20 |                 clean.append(cell)
 21 |         except KeyError:
 22 |             clean.append(cell)
 23 | 
 24 |     notebook['cells'] = clean
 25 | 
 26 |     return notebook
 27 | 
 28 | 
 29 | def keep_cells(notebook):
 30 |     """
 31 |     Finds the tag 'keep' in any cell and if it exists,
 32 |     remove the ones that don't have it
 33 | 
 34 |     Returns dict without 'hide' tagged cells
 35 |     """
 36 |     has_keep = False
 37 |     for cell in notebook['cells']:
 38 |         try:
 39 |             if 'keep' in cell['metadata']['tags']:
 40 |                 has_keep = True
 41 |                 break
 42 |         except KeyError:
 43 |             pass
 44 | 
 45 |     if has_keep:
 46 |         clean = []
 47 |         for cell in notebook['cells']:
 48 |             try:
 49 |                 if 'keep' in cell['metadata']['tags']:
 50 | 
 51 |                     clean.append(cell)
 52 |             except KeyError:
 53 |                 pass
 54 | 
 55 | 
 56 |         notebook['cells'] = clean
 57 | 
 58 |     return notebook
 59 | 
 60 | 
 61 | def empty_cells(notebook):
 62 |     """
 63 |     Finds the tag 'empty' in each cell and removes its content
 64 | 
 65 |     Returns dict with empty cells
 66 |     """
 67 | 
 68 |     clean = []
 69 |     for cell in notebook['cells']:
 70 |         try:
 71 |             tags = cell['metadata']['tags']
 72 |             if True in map(lambda x: x.lower().startswith('empty'), tags):
 73 |                 cell['source'] = []
 74 |                 clean.append(cell)
 75 | 
 76 |         except KeyError:
 77 |             clean.append(cell)
 78 | 
 79 |     notebook['cells'] = clean
 80 | 
 81 |     return notebook
 82 | 
 83 | 
 84 | def exercise_cells(notebook):
 85 |     """
 86 |     Finds the tag 'exe' in each cell and applies HTML template
 87 | 
 88 |     Returns dict with template cells
 89 |     """
 90 |     clean = []
 91 |     wraphead = ["<div class=\"alert alert-success\">\n",
 92 |                 ]
 93 | 
 94 |     wraptail = ["\n</div>",
 95 |                 ]
 96 | 
 97 |     for cell in notebook['cells']:
 98 |         try:
 99 |             tags = cell['metadata']['tags']
100 |             if True in [t.lower().startswith('ex') for t in tags]:
101 |                 src = cell['source']
102 |                 src = [re.sub(r"^#+? (.+)\n", r"<h3>\1</h3>\n", s) for s in src]
103 |                 cell['source'] = wraphead + src + wraptail
104 |         except KeyError:
105 |             pass
106 |         clean.append(cell)
107 |     notebook['cells'] = clean
108 | 
109 |     return notebook
110 | 
111 | 
112 | def hide_code(notebook):
113 |     """
114 |     Finds the tags '#!--' and '#--! in each cell and removes
115 |     the lines in between.
116 | 
117 |     Returns dict
118 |     """
119 | 
120 |     for i, cell in enumerate(notebook['cells']):
121 |         istart = 0
122 |         istop = -1
123 |         for idx, line in enumerate(cell['source']):
124 |             if '#!--' in line:
125 |                 istart = idx
126 |             if '#--!' in line:
127 |                 istop = idx
128 | 
129 |         notebook['cells'][i]['source'] = cell['source'][:istart] + cell['source'][istop+1:]
130 | 
131 |     return notebook
132 | 
133 | 
134 | def hide_toolbar(notebook):
135 |     """
136 |     Finds the display toolbar tag and hides it
137 |     """
138 | 
139 |     if 'celltoolbar' in notebook['metadata']:
140 |         del(notebook['metadata']['celltoolbar'])
141 | 
142 |     return notebook
143 | 
144 | 
145 | def stripout(fname):
146 |     """
147 |     Removes all output cells
148 |     """
149 |     response = os.system("nbstripout {}".format(fname))
150 | 
151 |     return
152 | 
153 | 
154 | def process(fname,
155 |             outname,
156 |             poutput=True,
157 |             pkeep=True,
158 |             phide=True,
159 |             pexercise=True,
160 |             phidecode=True):
161 |     """
162 |     Loads an 'ipynb' file as a dict and performs cleaning tasks
163 | 
164 |     Writes cleaned version
165 |     """
166 |     print(fname)
167 | 
168 |     # if poutput:
169 |     #    stripout(fname)
170 | 
171 |     with open(fname, 'r') as f:
172 |         notebook_s = f.read()
173 | 
174 |     notebook = json.loads(notebook_s, encoding='utf-8')
175 | 
176 |     if pkeep:
177 |         notebook = keep_cells(notebook)
178 |     if phide:
179 |         notebook = hide_cells(notebook)
180 |     if pexercise:
181 |         notebook = exercise_cells(notebook)
182 |     if phidecode:
183 |         notebook = hide_code(notebook)
184 | 
185 |     notebook = hide_toolbar(notebook)
186 | 
187 |     with open(outname, 'w') as f:
188 |         _ = f.write(json.dumps(notebook))
189 | 
190 |     return
191 | 
192 | 
193 | def makedirs(name):
194 |     """
195 |     """
196 | 
197 |     try:
198 |         os.mkdir(name)
199 |     except:
200 |         pass
201 | 
202 |     return
203 | 
204 | 
205 | def movefiles(names, dest):
206 |     """
207 |     """
208 |     for name in names:
209 |         copyfile(name.strip('\n'), dest+'/'+name.split("/")[-1].strip('\n'))
210 | 
211 |     return
212 | 
213 | 
214 | def processList(fname):
215 |     """
216 |     Loads an 'txt' file with notebook filenames
217 |     and performs cleaning tasks on them
218 | 
219 |     Writes cleaned version
220 |     """
221 |     with open(fname, 'r') as f:
222 |         notebook_list = f.readlines()
223 | 
224 |     student = 'notebooks'
225 | #    instructor = 'instructor'
226 |     makedirs(student)
227 | #    makedirs(instructor)
228 | 
229 |     movefiles(notebook_list, student)
230 | #    movefiles(notebook_list, instructor)
231 | 
232 |     cwd = os.getcwd()
233 | 
234 |     os.chdir(os.path.join(cwd, student))
235 |     for name in notebook_list:
236 |         fname = name.split("/")[-1].strip('\n')
237 |         process(fname, fname)
238 | 
239 |     # os.chdir(os.path.join(cwd, instructor))
240 |     # for name in notebook_list:
241 |     #     fname = name.split("/")[-1].strip('\n')
242 |     #     process(fname, fname,
243 |     #             poutput=True,
244 |     #             pkeep=False,
245 |     #             phide=False,
246 |     #             pexercise=True,
247 |     #             phidecode=False)
248 | 
249 |     return
250 | 
251 | 
252 | def main(argv=None):
253 |     """
254 |     Usage:
255 | 
256 |     python custom_notebook.py --infile name.ipynb --outfile out.ipynb
257 |     """
258 |     argp = argparse.ArgumentParser(description='Convert a set of notebooks')
259 |     argp.add_argument('--infile', nargs='?', type=str, 
260 |                       help='The .ipynb file')
261 |     argp.add_argument('--outfile', type=str,
262 |                       help='Output filename.')
263 |     argp.add_argument('--listfile', type=str,
264 |                       help='Text file with Notebook filenames')
265 |     args = argp.parse_args(argv)
266 | 
267 |     if args.infile:
268 |         if not args.infile.endswith('.ipynb'):
269 |             raise FileNotFoundError("Could not find an ipynb file. Did you mean to use --listfile ??")
270 |         process(args.infile, args.outfile)
271 |     else:
272 |         processList(args.listfile)
273 | 
274 | 
275 | if __name__ == '__main__':
276 |     main()
277 | 


--------------------------------------------------------------------------------
/data/cleaned/gpkg-offshore_wells_2011_UTM20_NAD83.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/gpkg-offshore_wells_2011_UTM20_NAD83.gpkg


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_Geographic_NAD27.cpg:
--------------------------------------------------------------------------------
1 | ISO-8859-1


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_Geographic_NAD27.dbf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_Geographic_NAD27.dbf


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_Geographic_NAD27.prj:
--------------------------------------------------------------------------------
1 | GEOGCS["GCS_North_American_1927",DATUM["D_North_American_1927",SPHEROID["Clarke_1866",6378206.4,294.978698213898]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_Geographic_NAD27.shp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_Geographic_NAD27.shp


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_Geographic_NAD27.shx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_Geographic_NAD27.shx


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_UTM20_NAD83.cpg:
--------------------------------------------------------------------------------
1 | ISO-8859-1


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_UTM20_NAD83.dbf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_UTM20_NAD83.dbf


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_UTM20_NAD83.prj:
--------------------------------------------------------------------------------
1 | PROJCS["NAD_1983_UTM_Zone_20N",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-63],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["Meter",1]]


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_UTM20_NAD83.shp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_UTM20_NAD83.shp


--------------------------------------------------------------------------------
/data/cleaned/offshore_wells_2011_UTM20_NAD83.shx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/cleaned/offshore_wells_2011_UTM20_NAD83.shx


--------------------------------------------------------------------------------
/data/ne_10m_admin_1_states_provinces.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/ne_10m_admin_1_states_provinces.zip


--------------------------------------------------------------------------------
/data/ne_10m_rivers_lake_centerlines_trimmed.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/ne_10m_rivers_lake_centerlines_trimmed.zip


--------------------------------------------------------------------------------
/data/offshore_wells_2011_UTM20_NAD83.cpg:
--------------------------------------------------------------------------------
1 | ISO-8859-1


--------------------------------------------------------------------------------
/data/offshore_wells_2011_UTM20_NAD83.dbf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/offshore_wells_2011_UTM20_NAD83.dbf


--------------------------------------------------------------------------------
/data/offshore_wells_2011_UTM20_NAD83.prj:
--------------------------------------------------------------------------------
1 | PROJCS["NAD_1983_UTM_Zone_20N",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-63],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["Meter",1]]


--------------------------------------------------------------------------------
/data/offshore_wells_2011_UTM20_NAD83.shp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/offshore_wells_2011_UTM20_NAD83.shp


--------------------------------------------------------------------------------
/data/offshore_wells_2011_UTM20_NAD83.shx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/offshore_wells_2011_UTM20_NAD83.shx


--------------------------------------------------------------------------------
/data/south_africa_places.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/south_africa_places.gpkg


--------------------------------------------------------------------------------
/data/south_africa_runways.gpkg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/south_africa_runways.gpkg


--------------------------------------------------------------------------------
/data/tanzania_mines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mtb-za/transform-2020-spatial-in-python/266979087ebdf2ab31966af31c0f95c210850e7c/data/tanzania_mines.zip


--------------------------------------------------------------------------------
/data/za_province_employment.csv:
--------------------------------------------------------------------------------
 1 | "Province","Employed","Unemployed","Discouraged work-seeker","Other not economically active","Unspecified","Not applicable","Total"
 2 | "Western Cape",2010697.4156982,552733.1387408,122753.1199101,1330518.8044977,0,1806031.8520054,5822734.3308521
 3 | "Eastern Cape",1028963.7061933,615848.8157591,306375.9746898,2001778.9305453,0,2609085.6260784,6562053.053266
 4 | "Northern Cape",282791.2562902,106723.41261,39912.56978,306290.7838901,0,410142.7222103,1145860.7447806
 5 | "Free State",649660.7618378,313793.2708799,99948.79837,732516.7390578,0,949670.2152741,2745589.7854196
 6 | "KwaZulu-Natal",2041580.5929649,1006408.5079901,488537.5825189,2943203.0412211,0,3787570.6944338,10267300.4191289
 7 | "North West",843368.7815887,387348.2748303,127489.59069,913527.1048287,0,1238218.7622066,3509952.5141444
 8 | "Gauteng",4467370.1463884,1598044.3779043,296450.17456,2468858.9622163,0,3441539.2774087,12272262.9384777
 9 | "Mpumalanga",969770.8595493,448126.1962497,150843.82964,1020805.6904996,0,1450392.1525171,4039938.7284558
10 | "Limpopo",885873.8467598,565029.2274293,202780.2694298,1577755.5601106,0,2173428.8058108,5404867.7095404
11 | "Total",13180077.367197,5594055.2224395,1835091.9095935,13295255.6167899,0,17866080.1081442,51770560.2241641


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: t20-fri-geo
 2 | channels:
 3 |   - conda-forge
 4 | dependencies:
 5 |   - geopandas
 6 |   - descartes
 7 |   - jupyter
 8 |   - contextily
 9 |   - matplotlib
10 |   - mapclassify
11 |   - cartopy
12 | 


--------------------------------------------------------------------------------
/introduction.md:
--------------------------------------------------------------------------------
 1 | # Geospatial Processing in Python
 2 | ## (using geopandas)
 3 | 
 4 | ### Agenda
 5 | * Introduction
 6 | * Cleaning Data
 7 | * IO and basic plotting
 8 | * Processing vector data
 9 | * Joins and spatial joins (time permitting)
10 | * Basemaps in Contextily
11 | 
12 | ### What we will cover
13 | * Vector data - points, lines, polygons
14 | * How to clean data in a shapefile
15 | * Reading and writing to spatial dataformats
16 | * Creating GeoDataFrames
17 | * Plotting data on maps
18 |    - Using `mapclassify` for binning data
19 | * Working with times
20 | * Some discussion on projections and Coordinate Reference Systems
21 | * Buffering data
22 | * Selecting data by spatial relationships
23 |    - Plots using data in an area
24 | * Measuring distances and finding nearest neighbours
25 | * Joining data on attributes
26 | * Joining data based on spatial relationships
27 | 
28 | ### What we will explicitly not cover
29 | * Raster data
30 | * ArcGIS/QGIS flavours of Python
31 | * Interactive maps - leaflet/folium
32 | 


--------------------------------------------------------------------------------
/master/01-geopandas-cleaning_data-master.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Cleaning a shapefile\n",
  8 |     "\n",
  9 |     "DATE: 12 June 2020, 08:00 - 11:00 UTC\n",
 10 |     "\n",
 11 |     "AUDIENCE: Intermediate\n",
 12 |     "\n",
 13 |     "INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n",
 14 |     "\n",
 15 |     "When processing data, we are often not lucky enough to have it perfectly useable immediately. This notebook works through loading, cleaning and saving a shapefile, using `geopandas`, an extension for `pandas` that adds facility for spatial processing.\n",
 16 |     "\n",
 17 |     "#### Note\n",
 18 |     "Much of this is standard data cleaning, and does not rely on geopandas per se, except that the data that we want to clean is in a geospatial data format, such as a shapefile. Most of these tools are the same in standard `pandas`, but have been extended in geopandas to work with spatial indices.\n",
 19 |     "\n",
 20 |     "This notebook is provided more as an example of data cleaning, which is common when dealing with real data. The result of this notebook can easily be used in whatever GIS software you prefer, since it is a standard shapefile."
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "import geopandas as gpd\n",
 30 |     "import pandas as pd\n",
 31 |     "import numpy as np"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "We will start by loading and having a look at the data that we have available."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {},
 45 |    "outputs": [],
 46 |    "source": []
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {
 52 |     "tags": [
 53 |      "hide"
 54 |     ]
 55 |    },
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "fname = '../data/offshore_wells_2011_UTM20_NAD83.shp'\n",
 59 |     "\n",
 60 |     "well_data = gpd.read_file(fname)\n",
 61 |     "well_data.head()"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "Something that we may be interested in is the different companies that have operated in this field. This is equivalent to looking at a column in a spreadsheet."
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": []
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {
 82 |     "tags": [
 83 |      "hide"
 84 |     ]
 85 |    },
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "well_data['Company']"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "If we want to get an idea of all the companies present, we can use the `set` function, which returns the unique values from a list-like object:"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {
102 |     "scrolled": true
103 |    },
104 |    "outputs": [],
105 |    "source": []
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "metadata": {
111 |     "scrolled": true,
112 |     "tags": [
113 |      "hide"
114 |     ]
115 |    },
116 |    "outputs": [],
117 |    "source": [
118 |     "set(well_data['Company'])"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "A number of these companies should probably be consolidated. The most straightforward way to do this is by diving into the dark art of regular expressions. We will make a dictionary of what to look for as the key and what to replace it with as the value."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "replacements = {\n",
135 |     "    r'\\-': '', #  remove '-'\n",
136 |     "    r' et al': '', #  remove ' et al'\n",
137 |     "    r' Cda': '', #  remove ' Cda'\n",
138 |     "    r'EnCana.*$': 'EnCana', #  change 'EnCana' followed by anything to 'EnCana'\n",
139 |     "    r'PanCanadian(\\-|.*)\\n.*': 'PanCanadian', #  strip the odd characters after 'PanCanadian'\n",
140 |     "    r'Mobil.*$': 'Mobil', #  strip anything following 'Mobil'\n",
141 |     "    r'Shell.*$': 'Shell', #  strip anything following 'Shell'\n",
142 |     "    r'Exxonmobil': 'ExxonMobil', #  correct capitalisation of 'Exxonmobil' to 'ExxonMobil'\n",
143 |     "    r'Petocan': 'PetroCan', #  correct spelling\n",
144 |     "    r'Petrocan': 'PetroCan', #  correct capitalisation of 'Petrocan' to 'PetroCan'\n",
145 |     "    r'PetroCan*$': 'PetroCan', #  strip anything following 'PetroCan'\n",
146 |     "    r'^Husky.*\\n.*$': 'HBV', #  convert anything starting with 'Husky' to 'HBV' after stripping new line\n",
147 |     "    r'^Bow Valley.*\\n.*$': 'BVH', #  convert anything starting with 'Bow Valley' to 'BVH' after stripping new line\n",
148 |     "    r'HBV.*$': 'HBV', #  strip anything following 'HBV'\n",
149 |     "    r'BVH.*$': 'BVH', #  strip anything following 'BVH'\n",
150 |     "    r'Pex/Tex': 'Pex', #  convert 'Pex/Tex' to 'Pex'\n",
151 |     "    r'Candian Sup/': 'Canadian Superior', #  correct typo 'Candian Sup/' to 'Canadian Superior'\n",
152 |     "    r'Canadian Sup\\.': 'Canadian Superior', #  expand 'Canadian Sup.' to 'Canadian Superior'\n",
153 |     "}"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "In this case, we are going to create a new column (`Owner`) to store the cleaned data, in case we need to retrieve the exact company for some reason. We could change the original GeoDataFrame column by using the `inplace=True` argument to the `replace` method."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": []
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {
174 |     "tags": [
175 |      "hide"
176 |     ]
177 |    },
178 |    "outputs": [],
179 |    "source": [
180 |     "well_data['Owner'] = well_data['Company'].replace(regex=replacements)\n",
181 |     "set(well_data['Owner'])"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {},
187 |    "source": [
188 |     "## Exercise 1\n",
189 |     "\n",
190 |     "1. Print a list of the unique values in the `Well_Type` Series.\n",
191 |     "2. Clean up the `Well_Type` Series to remove the typos and make the data more consistent. We can do this in-place, because the original data does not really give us any additional information. (Hint: look at the `inplace=True` parameter to do this to the original GeoDataFrame.)\n",
192 |     "    - Change 'Exploratory' to 'Exploration'\n",
193 |     "    - Change the typo 'Develpoment' to 'Development'\n",
194 |     "    - Remove the new line, by changing `\\n&` to `''`\n",
195 |     "    - Remove excess whitespace by changing `\\s+` to `' '`"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "# Print a list of the unique values in the `Well_Type` Series.\n"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {
211 |     "tags": [
212 |      "hide"
213 |     ]
214 |    },
215 |    "outputs": [],
216 |    "source": [
217 |     "# Print a list of the unique values in the `Well_Type` Series.\n",
218 |     "set(well_data['Well_Type'])"
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": null,
224 |    "metadata": {},
225 |    "outputs": [],
226 |    "source": [
227 |     "# Clean up the `Well_Type` Series to remove typos and make the data more consistent. Do this in-place.\n",
228 |     "# \n",
229 |     "replacements = {\n",
230 |     "    'key': 'changed_to',\n",
231 |     "    r'\\n&': '',\n",
232 |     "    r'\\s+': ' ',\n",
233 |     "    'key2': 'changed_to',\n",
234 |     "}\n",
235 |     "\n",
236 |     "well_data['Well_Type'].replace(regex=replacements, inplace=True)\n",
237 |     "well_data['Well_Type']"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {
244 |     "tags": [
245 |      "hide"
246 |     ]
247 |    },
248 |    "outputs": [],
249 |    "source": [
250 |     "# Clean up the `Well_Type` Series to remove typos and make the data more consistent. Do this in-place.\n",
251 |     "replacements = {\n",
252 |     "    'Develpoment\\/ ': 'Development\\/',\n",
253 |     "    r'\\n&': '',\n",
254 |     "    r'\\s+': ' ',\n",
255 |     "    'Exploratory': 'Exploration',\n",
256 |     "}\n",
257 |     "\n",
258 |     "well_data['Well_Type'].replace(regex=replacements, inplace=True)"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "markdown",
263 |    "metadata": {},
264 |    "source": [
265 |     "## Cleaning Column Names\n",
266 |     "\n",
267 |     "The current column names are not very helpful in some cases, with weird codes and similar. We can probably make these more understandable, and (geo)pandas makes it easy to do so.\n",
268 |     "\n",
269 |     "#### Note for Shapefile\n",
270 |     "The maximim length of field names in a shapefile is 10 characters. Some other formats, such as `.gpkg` do not have this limitation.\n",
271 |     "\n",
272 |     "______________\n",
273 |     "\n",
274 |     "We can start by getting the current column names."
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": null,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": []
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {
288 |     "tags": [
289 |      "hide"
290 |     ]
291 |    },
292 |    "outputs": [],
293 |    "source": [
294 |     "print(well_data.columns)"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "Some of these are clearly cut off due to needing to be less than 10 characters in length, for example `Well_Termi` and `seafl_twt_`. In other cases, we can see that there are duplicates, for example `Well_Name` and `Well_Nam_1`, without any indication of what the difference is. We can do better, even with the limits of the Shapefile format.\n",
302 |     "\n",
303 |     "We can get a feel for the data with the `head()` method, as used above. The `set()` method is also often helpful to see what different values are in text columns, and may give us a better idea what the data is describing, as we have already seen."
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "metadata": {},
309 |    "source": [
310 |     "There is a `rename()` method that can take a `dict` of existing Series names as keys and set a new name as the values. We will change the `Well_Termi` to `Well_End` and `Well_Nam_1` to `Well_Code`."
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": null,
316 |    "metadata": {},
317 |    "outputs": [],
318 |    "source": []
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": null,
323 |    "metadata": {
324 |     "tags": [
325 |      "hide"
326 |     ]
327 |    },
328 |    "outputs": [],
329 |    "source": [
330 |     "series_names = {\n",
331 |     "    'Well_Termi': 'Well_End',\n",
332 |     "    'Well_Nam_': 'Well_Code',\n",
333 |     "}\n",
334 |     "well_data.rename(columns=series_names)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "code",
339 |    "execution_count": null,
340 |    "metadata": {},
341 |    "outputs": [],
342 |    "source": [
343 |     "well_data.head()"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "As you can see, `well_data` is not changed, since `rename()` returns a copy of the GeoDataFrame. To change it instead of getting a copy, the `inplace=True` option should be added, or the copy assigned to another variable."
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": null,
356 |    "metadata": {},
357 |    "outputs": [],
358 |    "source": []
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": null,
363 |    "metadata": {},
364 |    "outputs": [],
365 |    "source": [
366 |     "well_data.rename(columns=series_names, inplace=True)\n",
367 |     "# We can also do this with the following:\n",
368 |     "# well_data = well_data.rename(columns=series_names)"
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "markdown",
373 |    "metadata": {},
374 |    "source": [
375 |     "Now it works!"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "print(well_data.columns)\n",
385 |     "well_data.head()"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "markdown",
390 |    "metadata": {},
391 |    "source": [
392 |     "### Exercise 2\n",
393 |     "\n",
394 |     "1. What are the different values of `Well_Symb`?\n",
395 |     "2. What are the different values of `Drilling_U`? In particular, what are the different entries in `Drilling_U` referring to, and what might be a more descriptive name?\n",
396 |     "3. Change the following column names in the DataFrame:\n",
397 |     "    * `Total_De_1` to `Dpth_ft`\n",
398 |     "    * `Total_Dept` to `Dpth_m`\n",
399 |     "    * `seafl_twt_` to `FloorTWT`\n",
400 |     "    * `Drilling_U` to something based on the previous answer."
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": null,
406 |    "metadata": {},
407 |    "outputs": [],
408 |    "source": [
409 |     "# What are the different values in `Well_Symb`?\n"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "code",
414 |    "execution_count": null,
415 |    "metadata": {
416 |     "tags": [
417 |      "hide"
418 |     ]
419 |    },
420 |    "outputs": [],
421 |    "source": [
422 |     "# What are the different values in `Well_Symb`?\n",
423 |     "set(well_data['Well_Symb'])"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": null,
429 |    "metadata": {},
430 |    "outputs": [],
431 |    "source": [
432 |     "# What are the different values in `Drilling_U`?\n"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "code",
437 |    "execution_count": null,
438 |    "metadata": {
439 |     "tags": [
440 |      "hide"
441 |     ]
442 |    },
443 |    "outputs": [],
444 |    "source": [
445 |     "# What are the different values in `Drilling_U`?\n",
446 |     "set(well_data['Drilling_U'])"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "code",
451 |    "execution_count": null,
452 |    "metadata": {},
453 |    "outputs": [],
454 |    "source": [
455 |     "series_names = {\n",
456 |     "}"
457 |    ]
458 |   },
459 |   {
460 |    "cell_type": "code",
461 |    "execution_count": null,
462 |    "metadata": {
463 |     "tags": [
464 |      "hide"
465 |     ]
466 |    },
467 |    "outputs": [],
468 |    "source": [
469 |     "series_names = {\n",
470 |     "    'Drilling_U': 'Drill_Ship',\n",
471 |     "    'Total_De_1': 'Dpth_ft',\n",
472 |     "    'Total_Dept': 'Dpth_m',\n",
473 |     "    'Water_Dept': 'Water_Dpth',\n",
474 |     "    'seafl_twt_': 'FloorTWT',\n",
475 |     "}"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "metadata": {},
481 |    "source": [
482 |     "If we are changing a Shapefile, remember that we can not have Series names longer than 10 characters. This will check it for you."
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "code",
487 |    "execution_count": null,
488 |    "metadata": {},
489 |    "outputs": [],
490 |    "source": [
491 |     "for series in series_names:\n",
492 |     "    if len(series) > 10:\n",
493 |     "        print(f'{series} longer than 10 characters. Will not be able to save as Shapefile.')\n",
494 |     "    else:\n",
495 |     "        well_data.rename(columns=series_names, inplace=True)\n",
496 |     "well_data.columns"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": null,
502 |    "metadata": {},
503 |    "outputs": [],
504 |    "source": [
505 |     "well_data.head()"
506 |    ]
507 |   },
508 |   {
509 |    "cell_type": "markdown",
510 |    "metadata": {},
511 |    "source": [
512 |     "## Datetimes\n",
513 |     "\n",
514 |     "If you are familiar with `pandas`, then you will know the utility of `datetime`s. We have some dates in the data, so we should make sure that they are correctly imported if we want to use that for anything involving time series analysis.\n",
515 |     "\n",
516 |     "#### Note:\n",
517 |     "It is not possible to write a `datetime` to a Shapefile. If you want to do analysis that uses timeseries, then you may want to save a cleaned dataframe _before_ you convert to `datetime`s. Alternatively, save the GeoDataFrame as a geopackage or similar format that can handle a `datetime`."
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "code",
522 |    "execution_count": null,
523 |    "metadata": {},
524 |    "outputs": [],
525 |    "source": []
526 |   },
527 |   {
528 |    "cell_type": "code",
529 |    "execution_count": null,
530 |    "metadata": {
531 |     "tags": [
532 |      "hide"
533 |     ]
534 |    },
535 |    "outputs": [],
536 |    "source": [
537 |     "type(well_data['Spud_Date'][0])"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {},
543 |    "source": [
544 |     "As we see, these dates are stored as strings. We can easily convert them to `datetime`s, however. First we will copy our original geodataframe to save later. (If you are doing this conversion with your own data, make sure that you look into the limitations of doing so, if you want to save a shapefile.)"
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": null,
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": []
553 |   },
554 |   {
555 |    "cell_type": "code",
556 |    "execution_count": null,
557 |    "metadata": {
558 |     "tags": [
559 |      "hide"
560 |     ]
561 |    },
562 |    "outputs": [],
563 |    "source": [
564 |     "well_data_original = well_data.copy()\n",
565 |     "well_data['Spud_Date'] = pd.to_datetime(well_data['Spud_Date'])\n",
566 |     "type(well_data['Spud_Date'][0])"
567 |    ]
568 |   },
569 |   {
570 |    "cell_type": "markdown",
571 |    "metadata": {},
572 |    "source": [
573 |     "## Exercise 3\n",
574 |     "\n",
575 |     "1. Change the `Well_End` Series to `Timestamp`s.\n",
576 |     "2. Make a Series of the difference in time between the `Spud_Date` and the `Well_End` Series. (Do not add this to our current geodataframe, to make saving it easier later.)\n",
577 |     "3. What is the biggest difference in days, between the `Spud_Date` and the `Well_End`  Series? (Hint: you may wish to look at the `dt.days` attribute of a `timeDelta`.)"
578 |    ]
579 |   },
580 |   {
581 |    "cell_type": "code",
582 |    "execution_count": null,
583 |    "metadata": {},
584 |    "outputs": [],
585 |    "source": [
586 |     "# Change the `Well_End` Series to `Datetime`s.\n"
587 |    ]
588 |   },
589 |   {
590 |    "cell_type": "code",
591 |    "execution_count": null,
592 |    "metadata": {
593 |     "tags": [
594 |      "hide"
595 |     ]
596 |    },
597 |    "outputs": [],
598 |    "source": [
599 |     "# Change the `Well_End` Series to `Timestamp`s.\n",
600 |     "well_data['Well_End'] = pd.to_datetime(well_data['Well_End'])"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "code",
605 |    "execution_count": null,
606 |    "metadata": {},
607 |    "outputs": [],
608 |    "source": [
609 |     "# Add a Series with the difference in time between the `Spud_Date` and `Well_End` Series.\n"
610 |    ]
611 |   },
612 |   {
613 |    "cell_type": "code",
614 |    "execution_count": null,
615 |    "metadata": {
616 |     "tags": [
617 |      "hide"
618 |     ]
619 |    },
620 |    "outputs": [],
621 |    "source": [
622 |     "# Add a Series with the difference in time between the `Spud_Date` and `Well_End` Series.\n",
623 |     "time_differences = well_data['Well_End'] - well_data['Spud_Date']"
624 |    ]
625 |   },
626 |   {
627 |    "cell_type": "code",
628 |    "execution_count": null,
629 |    "metadata": {},
630 |    "outputs": [],
631 |    "source": [
632 |     "# What is the biggest time difference, in days, between the `Spud_Date` and `Well_End` Series?\n"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "code",
637 |    "execution_count": null,
638 |    "metadata": {
639 |     "tags": [
640 |      "hide"
641 |     ]
642 |    },
643 |    "outputs": [],
644 |    "source": [
645 |     "# What is the biggest time difference, in days, between the `Spud_Date` and `Well_End` Series?\n",
646 |     "max(time_differences.dt.days)"
647 |    ]
648 |   },
649 |   {
650 |    "cell_type": "markdown",
651 |    "metadata": {},
652 |    "source": [
653 |     "## Saving files\n",
654 |     "\n",
655 |     "Once we have made these changes, we would like to save them for future work. Geopandas makes that very easy. Note that we can not write a `datetime` to shapefiles, so we would need to change it (back) to a string if we want to save it. Similarly, if we have a Series of `bool` values (`True` or `False`) we should convert those to `int`s before we save.\n",
656 |     "\n",
657 |     "First we will see what available options we have to save to. Geopandas uses `fiona` in the background; we will take a look at what that offers us."
658 |    ]
659 |   },
660 |   {
661 |    "cell_type": "code",
662 |    "execution_count": null,
663 |    "metadata": {},
664 |    "outputs": [],
665 |    "source": [
666 |     "import fiona #  we do not normally need this for saving, it gets used in the background.\n",
667 |     "fiona.supported_drivers"
668 |    ]
669 |   },
670 |   {
671 |    "cell_type": "markdown",
672 |    "metadata": {},
673 |    "source": [
674 |     "We can only write to some of these formats: those with `raw` or `rw` tags.\n",
675 |     "\n",
676 |     "As a format, `gpkg` is becoming more popular, so we will save our geodataframe as that. One nice advantage, not relevant here, is being able to save multiple layers in a single file."
677 |    ]
678 |   },
679 |   {
680 |    "cell_type": "code",
681 |    "execution_count": null,
682 |    "metadata": {},
683 |    "outputs": [],
684 |    "source": []
685 |   },
686 |   {
687 |    "cell_type": "code",
688 |    "execution_count": null,
689 |    "metadata": {
690 |     "tags": [
691 |      "hide"
692 |     ]
693 |    },
694 |    "outputs": [],
695 |    "source": [
696 |     "fname = '../data/cleaned/gpkg-offshore_wells_2011_UTM20_NAD83.gpkg'\n",
697 |     "\n",
698 |     "well_data.to_file(fname, layer='well_locations', driver='GPKG')"
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "markdown",
703 |    "metadata": {},
704 |    "source": [
705 |     "We can also save as a Shapefile, but we will get an error:\n",
706 |     "\n",
707 |     "`DriverSupportError: ESRI Shapefile does not support datetime fields`"
708 |    ]
709 |   },
710 |   {
711 |    "cell_type": "code",
712 |    "execution_count": null,
713 |    "metadata": {},
714 |    "outputs": [],
715 |    "source": []
716 |   },
717 |   {
718 |    "cell_type": "code",
719 |    "execution_count": null,
720 |    "metadata": {
721 |     "tags": [
722 |      "hide"
723 |     ]
724 |    },
725 |    "outputs": [],
726 |    "source": [
727 |     "fname = '../data/cleaned/offshore_wells_2011_UTM20_NAD83_cleaned.shp'\n",
728 |     "\n",
729 |     "well_data.to_file(fname)"
730 |    ]
731 |   },
732 |   {
733 |    "cell_type": "markdown",
734 |    "metadata": {},
735 |    "source": [
736 |     "This is fixable by saving our copy of the dataset, or by converting the datetime back to a string."
737 |    ]
738 |   },
739 |   {
740 |    "cell_type": "code",
741 |    "execution_count": null,
742 |    "metadata": {},
743 |    "outputs": [],
744 |    "source": []
745 |   },
746 |   {
747 |    "cell_type": "code",
748 |    "execution_count": null,
749 |    "metadata": {
750 |     "tags": [
751 |      "hide"
752 |     ]
753 |    },
754 |    "outputs": [],
755 |    "source": [
756 |     "# well_data['Spud_Date'] = well_data['Spud_Date'].dt.strftime('%Y-%m-%d')\n",
757 |     "# well_data['Well_End'] = well_data['Well_End'].dt.strftime('%Y-%m-%d')\n",
758 |     "fname = '../data/offshore_wells_2011_UTM20_NAD83_cleaned.shp'\n",
759 |     "\n",
760 |     "well_data_original.to_file(fname)"
761 |    ]
762 |   },
763 |   {
764 |    "cell_type": "code",
765 |    "execution_count": null,
766 |    "metadata": {},
767 |    "outputs": [],
768 |    "source": []
769 |   },
770 |   {
771 |    "cell_type": "markdown",
772 |    "metadata": {},
773 |    "source": [
774 |     "We can also easily save these using a different CRS, if that is better for our data. This is one for the North America Datum 1927, in degrees."
775 |    ]
776 |   },
777 |   {
778 |    "cell_type": "code",
779 |    "execution_count": null,
780 |    "metadata": {
781 |     "tags": [
782 |      "hide"
783 |     ]
784 |    },
785 |    "outputs": [],
786 |    "source": [
787 |     "fname = '../data/offshore_wells_2011_Geographic_NAD27_cleaned.shp'\n",
788 |     "\n",
789 |     "well_data_original.to_crs(epsg=4267).to_file(fname, driver='ESRI Shapefile')"
790 |    ]
791 |   },
792 |   {
793 |    "cell_type": "markdown",
794 |    "metadata": {},
795 |    "source": [
796 |     "## Closing remarks\n",
797 |     "\n",
798 |     "The data that we have just saved can be used in the \"Intro to Geopandas\" notebook.\n",
799 |     "\n",
800 |     "<hr />\n",
801 |     "<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n",
802 |     "<p><center>© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> — <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"
803 |    ]
804 |   }
805 |  ],
806 |  "metadata": {
807 |   "celltoolbar": "Tags",
808 |   "kernelspec": {
809 |    "display_name": "Python 3",
810 |    "language": "python",
811 |    "name": "python3"
812 |   },
813 |   "language_info": {
814 |    "codemirror_mode": {
815 |     "name": "ipython",
816 |     "version": 3
817 |    },
818 |    "file_extension": ".py",
819 |    "mimetype": "text/x-python",
820 |    "name": "python",
821 |    "nbconvert_exporter": "python",
822 |    "pygments_lexer": "ipython3",
823 |    "version": "3.8.3"
824 |   }
825 |  },
826 |  "nbformat": 4,
827 |  "nbformat_minor": 4
828 | }


--------------------------------------------------------------------------------
/master/02-geopandas-io_and_plotting-master.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Intro to Geopandas plotting vector data\n",
  8 |     "\n",
  9 |     "DATE: 12 June 2020, 08:00 - 11:00 UTC\n",
 10 |     "\n",
 11 |     "AUDIENCE: Intermediate\n",
 12 |     "\n",
 13 |     "INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n",
 14 |     "\n",
 15 |     "Not all the data that we want to deal with is simply numeric. Much of it will need to be located in space as well. Luckily, there are numerous tools to handle this sort of data. For this notebook, we will focus on vector data. This is data consisting of points, lines and polygons, not gridded data. The tutorials by Leo Uieda and Joe Kington deal more with raster data and should be a good complement to this tutorial. \n",
 16 |     "\n",
 17 |     "There are a number of common spatial tasks that are often done in GIS software, such as adding buffers to data, or manipulating and creating geometries. This notebook is focused more on the basics of using existing data and plotting it, but not making many changes specific to spatial data.\n",
 18 |     "\n",
 19 |     "#### Prerequisites\n",
 20 |     "\n",
 21 |     "You should be reasonably comfortable with `pandas` and `matplotlib.pyplot`.\n",
 22 |     "\n",
 23 |     "Beyond that, this is aimed at relative beginners to working with spatial data.\n",
 24 |     "\n",
 25 |     "#### A Note on Shapefile\n",
 26 |     "\n",
 27 |     "Shapefiles are a common file format used when sharing and storing georeferenced data. A single shapefile has a number of components that are required for it to work correctly.\n",
 28 |     "These are mandatory:\n",
 29 |     "- `<name>.shp` the feature geometry.\n",
 30 |     "- `<name>.shx` is the shape index.\n",
 31 |     "- `<name>.dbx` contains the attributes in columns, for each feature.\n",
 32 |     "\n",
 33 |     "There are a number of additional files that may also be present, of which these are the most common (in the author's experience).\n",
 34 |     "- `<name>.prj` is the projection of the data.\n",
 35 |     "- `<name>.sbx` and `<name>.sbn` are a spatial index.\n",
 36 |     "- `<name>.shp.xml` is a metadata file.\n",
 37 |     "\n",
 38 |     "While shapefiles are very common on desktop systems, they tend not to be used present data on the web, although they are often offered as a download option."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "### Pandas and Geopandas\n",
 46 |     "\n",
 47 |     "Pandas gives us access to a data structure called a DataFrame, which is very well suited for the sort of data that is usually in spreadsheets, with rows and columns. Geopandas is an expansion of that, to allow for the data to be geographically located in a sensible way. It does this by adding a `geometry` column, a , and adding some methods for some spatially useful tests, while still allowing the usual `DataFrame` methods from pandas.\n",
 48 |     "\n",
 49 |     "In addition, we will use `cartopy` to handle projections. `mapclassify` is optional, but allows easy binning of our data."
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": null,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "#import cartopy.crs as ccrs\n",
 59 |     "import geopandas as gpd\n",
 60 |     "import mapclassify as mc\n",
 61 |     "import numpy as np\n",
 62 |     "import pandas as pd"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "## Creating a geodataframe\n",
 70 |     "\n",
 71 |     "Loading a shapefile (or a number of other formats) is as simple as calling `read_file` with the right location.\n",
 72 |     "\n",
 73 |     "Geopandas uses `fiona` in the background, so anything that can be handled by `fiona` can be handled with geopandas. Note that some formats can be read, but not written."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "fname = '../data/cleaned/offshore_wells_2011_Geographic_NAD27.shp'\n",
 83 |     "\n",
 84 |     "well_locations = gpd.read_file(fname)\n",
 85 |     "well_locations.head()"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "We can also load data as a standard DataFrame and convert it by using any existing geometry that we know about.\n",
 93 |     "\n",
 94 |     "We will load up some data available regarding issues identified at artisinal mines in Zimbabwe by the International Peace Information Service ([IPIS](http://ipisresearch.be/))."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "fname = '../data/zwe_mines_curated_all_opendata_p_ipis.csv'\n",
104 |     "\n",
105 |     "artisinal_mines = pd.read_csv(fname)\n",
106 |     "artisinal_mines.head()"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "We can see that there is a `geom` column in this CSV, where every point is a Well-Known Text ([WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry)) string describing the geometry. To make the geodataframe aware of this, we will use the `shapely` library (that geopandas uses under the hood)."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {
120 |     "scrolled": false
121 |    },
122 |    "outputs": [],
123 |    "source": [
124 |     "from shapely import wkt\n",
125 |     "artisinal_mines['geom'] = artisinal_mines['geom'].apply(wkt.loads)\n",
126 |     "mines = gpd.GeoDataFrame(artisinal_mines, geometry='geom')\n",
127 |     "mines.head()"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "This does not look very different, but we have now created a geodataframe from our existing dataframe. We could do something very similar with a CSV with separate columns of latitude and longitude.\n",
135 |     "\n",
136 |     "When creating a new geodataframe like this, we should also set the Coordinate Reference System (CRS) of the data, since geopandas does not know where the coordinates actually are on the Earth's surface. Some operations will still work, but relating one geodataframe to another is not possible. We are working with straight decimal degrees of longitude and latitude, so the WGS84 datum is a good option."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": null,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "mines.crs = \"EPSG:4326\""
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "markdown",
150 |    "metadata": {},
151 |    "source": [
152 |     "One of the simplest way to see how a geodataframe differs from a standard dataframe is by simply calling the `plot` method."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "metadata": {},
159 |    "outputs": [],
160 |    "source": [
161 |     "artisinal_mines.plot()"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "mines.plot()"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "As we can see, the geodataframe plots our coordinates, while the standard dataframe plots the numerical values according to their index."
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "### Exercise 1\n",
185 |     "\n",
186 |     "The following should be easily possible with a working knowledge of `pandas`. Using the well dataset:\n",
187 |     "1. Which well is the deepest? (`df.sort_values('column')` may be useful.)\n",
188 |     "1. How many wells are operated by Canadian Superior?"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": [
197 |     "# The deepest well is:\n"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": null,
203 |    "metadata": {
204 |     "tags": [
205 |      "hide"
206 |     ]
207 |    },
208 |    "outputs": [],
209 |    "source": [
210 |     "# The deepest well is:\n",
211 |     "well_locations.sort_values('Dpth_m').tail(1)"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "# How many well were operated by Canadian Superior?\n"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": null,
226 |    "metadata": {
227 |     "tags": [
228 |      "hide"
229 |     ]
230 |    },
231 |    "outputs": [],
232 |    "source": [
233 |     "# How many well were operated by Canadian Superior?\n",
234 |     "len(well_locations[well_locations['Owner'] == 'Canadian Superior'])"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "### Geographic plots\n",
242 |     "\n",
243 |     "We can take a quick look at where these wells are in relation to each other."
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {
250 |     "scrolled": true
251 |    },
252 |    "outputs": [],
253 |    "source": []
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {
259 |     "scrolled": true,
260 |     "tags": [
261 |      "hide"
262 |     ]
263 |    },
264 |    "outputs": [],
265 |    "source": [
266 |     "well_locations.plot()"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "markdown",
271 |    "metadata": {},
272 |    "source": [
273 |     "The data that we imported uses latitude and longitude. We can also easily import projected data, if we have it."
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": null,
279 |    "metadata": {},
280 |    "outputs": [],
281 |    "source": [
282 |     "fname = '../data/cleaned/offshore_wells_2011_UTM20_NAD83.shp'\n",
283 |     "well_locations_utm = gpd.read_file(fname)\n",
284 |     "# We are going to use the 'Spud_Date' and 'Well_Termi' column for some stuff, so we will turn it into a proper datetime column\n",
285 |     "well_locations_utm['Spud_Date'] = pd.to_datetime(well_locations_utm['Spud_Date'])\n",
286 |     "well_locations_utm['Well_End'] = pd.to_datetime(well_locations_utm['Well_End'])\n",
287 |     "well_locations_utm.replace('None', np.NaN, inplace=True)\n",
288 |     "well_locations_utm.plot()\n",
289 |     "well_locations_utm.head(5)"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "Notice that the axes are completely different between the two datasets. We can therefore not plot these two datasets in the same plot unless we use the same coordinate reference system. `cartopy` is the tool we will use to do this.\n",
297 |     "\n",
298 |     "First, let us see what CRS the different datasets have."
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": null,
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": [
307 |     "print(f'Wells: {well_locations.crs}\\nWells (UTM): {well_locations_utm.crs}')"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "If we want to plot the two datasets on the same plot, then they need to use the same CRS. One of the easiest ways is by using EPSG codes and the `to_crs` method. [epsg.io](https://epsg.io) and [spatialreference.org](https://spatialreference.org) are good places to find a suitable EPSG code for your data if you are not sure how the CRS relates to it."
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": null,
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": []
323 |   },
324 |   {
325 |    "cell_type": "code",
326 |    "execution_count": null,
327 |    "metadata": {
328 |     "tags": [
329 |      "hide"
330 |     ]
331 |    },
332 |    "outputs": [],
333 |    "source": [
334 |     "well_locations_utm_reproj = well_locations_utm.to_crs(epsg=\"4326\")\n",
335 |     "ax = well_locations_utm_reproj.plot(markersize=15)\n",
336 |     "well_locations.plot(ax=ax, color='red', markersize=5, alpha=0.4)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "We can see that these datasets now plot on top of each other, as they should."
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "## Styling\n",
351 |     "\n",
352 |     "Just plotting these points on their does not tell us very much, so we should style the data to show us what is happening. We will classify the data by total depth of each well, breaking the column into 6 bins with a natural break as the upper and lower bound.\n",
353 |     "\n",
354 |     "We do this by using the `scheme` parameter which will be used in the background by MapClassify to bin the values of a column. A number of binning options are available, such as NaturalBreaks, Quantiles, StdMean, Percentiles."
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": null,
360 |    "metadata": {},
361 |    "outputs": [],
362 |    "source": []
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": null,
367 |    "metadata": {
368 |     "tags": [
369 |      "hide"
370 |     ]
371 |    },
372 |    "outputs": [],
373 |    "source": [
374 |     "well_locations_utm.plot(column='Dpth_m',\n",
375 |     "                        scheme='Percentiles', k=6,\n",
376 |     "                        legend=True,\n",
377 |     "                        markersize=10, cmap='cividis_r', figsize=(10,10))#.legend(bbox_to_anchor=(2,1))"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "markdown",
382 |    "metadata": {},
383 |    "source": [
384 |     "The `scheme` keyword passes through to `mapclassify`, and only makes sense for some data. In other cases, we can just rely on the raw data."
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "execution_count": null,
390 |    "metadata": {},
391 |    "outputs": [],
392 |    "source": []
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": null,
397 |    "metadata": {
398 |     "tags": [
399 |      "hide"
400 |     ]
401 |    },
402 |    "outputs": [],
403 |    "source": [
404 |     "well_locations_utm.plot(column='Well_Type', legend=True,\n",
405 |     "                        markersize=10, cmap='Set1',\n",
406 |     "                        figsize=(10,10))#.legend(bbox_to_anchor=(1,1))"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "We may also be interested in only a section of the data within certain extents, such as the dense cluster south-east of centre. Geopandas offers a `cx` method for a coordinate index which can be used for slicing based on coordinate values."
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": []
422 |   },
423 |   {
424 |    "cell_type": "code",
425 |    "execution_count": null,
426 |    "metadata": {
427 |     "tags": [
428 |      "hide"
429 |     ]
430 |    },
431 |    "outputs": [],
432 |    "source": [
433 |     "main_field = well_locations_utm.cx[650000:800000, 4825000:4925000]\n",
434 |     "print(main_field.shape)\n",
435 |     "main_field.plot(column='Owner', legend=True, markersize=15, cmap='tab20', figsize=(10,10))"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "markdown",
440 |    "metadata": {},
441 |    "source": [
442 |     "### Exercise 2\n",
443 |     "\n",
444 |     "The data contains columns for the start and end of when a well was active.\n",
445 |     "\n",
446 |     "1. Which well was operating for the longest time and how long was this? (Hint: use the `datetime` columns from earlier ('Spud_Date' and 'Well_End'). A useful pattern is `df.loc[df['column'] == value]`.)\n",
447 |     "2. Plot a histogram of the days of operation for the wells in the dataset. You may need to drop invalid data (where some columns are NaN or NaT).\n",
448 |     "3. Using the above histogram to determine a suitable cut-off, is there an area of the field that has wells that were in operation for longer than others? (Hint: you might want to extract a useful time interval from a `Series` of `timedelta`s to plot.)"
449 |    ]
450 |   },
451 |   {
452 |    "cell_type": "code",
453 |    "execution_count": null,
454 |    "metadata": {},
455 |    "outputs": [],
456 |    "source": [
457 |     "# Which well was operating for the longest time and how long was this?\n",
458 |     "\n"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": null,
464 |    "metadata": {
465 |     "tags": [
466 |      "hide"
467 |     ]
468 |    },
469 |    "outputs": [],
470 |    "source": [
471 |     "# Which well was operating for the longest time and how long was this?\n",
472 |     "well_locations_utm['Operating'] = well_locations_utm['Well_End'] - well_locations_utm['Spud_Date']\n",
473 |     "well_locations_utm[well_locations_utm['Operating'] == well_locations_utm['Operating'].max()]"
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "code",
478 |    "execution_count": null,
479 |    "metadata": {},
480 |    "outputs": [],
481 |    "source": [
482 |     "# Plot a histogram of the days of operation for the wells in the dataset.\n"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "code",
487 |    "execution_count": null,
488 |    "metadata": {
489 |     "tags": [
490 |      "hide"
491 |     ]
492 |    },
493 |    "outputs": [],
494 |    "source": [
495 |     "# Plot a histogram of the days of operation for the wells in the dataset.\n",
496 |     "well_locations_utm['Operating'].dt.days.plot(kind='hist', bins=30)"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": null,
502 |    "metadata": {},
503 |    "outputs": [],
504 |    "source": [
505 |     "# Using the above histogram to determine a suitable cut-off, is there an area of the field that has\n",
506 |     "# wells that were in operation for longer than others?\n",
507 |     "\n",
508 |     "\n",
509 |     "\n",
510 |     "\n",
511 |     "\n"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "code",
516 |    "execution_count": null,
517 |    "metadata": {
518 |     "tags": [
519 |      "hide"
520 |     ]
521 |    },
522 |    "outputs": [],
523 |    "source": [
524 |     "# Using the above histogram to determine a suitable cut-off, is there an area of the field that has\n",
525 |     "# wells that were in operation for longer than others?\n",
526 |     "\n",
527 |     "#well_locations_utm['Operating_Days'] = well_locations_utm[well_locations_utm['Operating'].dt.days.notna() == True]\n",
528 |     "well_locations_utm['Operating_Days'] = well_locations_utm['Operating'].dt.days\n",
529 |     "long_wells = well_locations_utm[well_locations_utm['Operating'].dt.days > 150]\n",
530 |     "base = well_locations_utm.plot(color='grey', figsize=(10,10), markersize=8)\n",
531 |     "long_wells.plot(column='Operating_Days', scheme='Quantiles',\n",
532 |     "                cmap='viridis', alpha=0.9,\n",
533 |     "                ax=base, legend=True)"
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "markdown",
538 |    "metadata": {},
539 |    "source": [
540 |     "## Saving geodataframes\n",
541 |     "\n",
542 |     "While we can create maps and similar things in geopandas, sometimes we want to use files in something else. Geopandas, uses `fiona` in the background to read and write files. If we want this geodataframe as a GeoJSON file, for example, this is easily done by using the correct argument to the `driver` parameter. (Note that GeoJSON only accepts the WGS84 datum, so I am reprojecting the geodataframe first.)\n",
543 |     "\n",
544 |     "By default, without an explicit driver, `to_file` will create a Shapefile."
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": null,
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": []
553 |   },
554 |   {
555 |    "cell_type": "code",
556 |    "execution_count": null,
557 |    "metadata": {
558 |     "tags": [
559 |      "hide"
560 |     ]
561 |    },
562 |    "outputs": [],
563 |    "source": [
564 |     "fname = '../data/geojson-offshore_wells_Geographic_NAD27.geojson'\n",
565 |     "well_locations.to_crs(epsg=4326).to_file(fname, driver='GeoJSON')"
566 |    ]
567 |   },
568 |   {
569 |    "cell_type": "markdown",
570 |    "metadata": {},
571 |    "source": [
572 |     "Changing this to a GML file (a flavour of XML) is as simple as changing the driver parameter appropriately:"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "code",
577 |    "execution_count": null,
578 |    "metadata": {},
579 |    "outputs": [],
580 |    "source": []
581 |   },
582 |   {
583 |    "cell_type": "code",
584 |    "execution_count": null,
585 |    "metadata": {
586 |     "tags": [
587 |      "hide"
588 |     ]
589 |    },
590 |    "outputs": [],
591 |    "source": [
592 |     "fname = '../data/gml-offshore_wells_Geographic_NAD27.gml'\n",
593 |     "well_locations.to_file(fname, driver='GML')"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "markdown",
598 |    "metadata": {},
599 |    "source": [
600 |     "<hr />\n",
601 |     "<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n",
602 |     "<p><center>© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> — <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"
603 |    ]
604 |   }
605 |  ],
606 |  "metadata": {
607 |   "celltoolbar": "Tags",
608 |   "kernelspec": {
609 |    "display_name": "Python 3",
610 |    "language": "python",
611 |    "name": "python3"
612 |   },
613 |   "language_info": {
614 |    "codemirror_mode": {
615 |     "name": "ipython",
616 |     "version": 3
617 |    },
618 |    "file_extension": ".py",
619 |    "mimetype": "text/x-python",
620 |    "name": "python",
621 |    "nbconvert_exporter": "python",
622 |    "pygments_lexer": "ipython3",
623 |    "version": "3.8.3"
624 |   }
625 |  },
626 |  "nbformat": 4,
627 |  "nbformat_minor": 2
628 | }


--------------------------------------------------------------------------------
/master/03-geopandas-processing_vector_data-master.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Spatial Operations in Geopandas\n",
  8 |     "\n",
  9 |     "DATE: 12 June 2020, 08:00 - 11:00 UTC\n",
 10 |     "\n",
 11 |     "AUDIENCE: Intermediate\n",
 12 |     "\n",
 13 |     "INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n",
 14 |     "\n",
 15 |     "Importing and plotting vector data is all well and good, but we want to be able to do other things as well, like relating one dataset to another.\n",
 16 |     "\n",
 17 |     "This notebook covers selection of features via some spatial relationships between different GeoDataFrames and selecting the data that we are interested in. There is also some code given for plotting rose diagrams (adapted from Bruno Ruas de Pinho's [post](http://geologyandpython.com/structural_geology.html)), which may be of interest."
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "code",
 22 |    "execution_count": null,
 23 |    "metadata": {},
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "import geopandas as gpd\n",
 27 |     "import pandas as pd\n",
 28 |     "import matplotlib.pyplot as plt\n",
 29 |     "import numpy as np"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "plt.rcParams[\"figure.figsize\"] = (8, 8)"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "We are going to work with the runways located in South Africa, downloaded from OpenStreetMap. We will also load the outlines of each province from the Natural Earth dataset, and extract the South African ones. (Note: if you are actually working in South Africa, you should avoid these: they are very out of date now.)"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "za_runways = gpd.read_file('../data/south_africa_runways.gpkg')\n",
 55 |     "provinces = gpd.read_file('zip://../data/ne_10m_admin_1_states_provinces.zip')\n",
 56 |     "za_provinces = provinces[provinces['adm0_a3'] == 'ZAF']"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "The `bearing_360` field in the runway dataset contains the direction each line that makes up a runway is pointing, which might be useful."
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "A rose diagram is quite a nice way to visualise this sort of data (anyone who has done a structural geology course will be familiar with these for various strutural readings, and they are commonly used for wind direction as well).\n",
 71 |     "\n",
 72 |     "The easiest way to get a set of bins is by using the `histogram()` method from `numpy`. This returns two arrays: one with the value in each bin, and one with the edge of each bin. We will use 36 bins (so each is 10 degrees)."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {
 79 |     "scrolled": false
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "bin_width_degrees = 10\n",
 84 |     "bin_edges = np.arange(-5, 366, bin_width_degrees)\n",
 85 |     "hist_counts, bin_edges = np.histogram(za_runways['bearing_360'], bin_edges)\n",
 86 |     "print(f'Hist counts: {hist_counts}\\nEdges: {bin_edges}')\n",
 87 |     "\n",
 88 |     "print(list(hist_counts[:18]))\n",
 89 |     "print(list(reversed(hist_counts[18:])))"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "We know that the first and last bins are the same (that is, bin 1 has edges at [-5, 5] and the last bin has edges [355, 365]), so we will make sure that we add those counts together. Then we can split the list in half before summing the counts from opposite bins."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "hist_counts[0] += hist_counts[-1]  # sum counts of first and last bins\n",
106 |     "\n",
107 |     "half = np.sum(np.split(hist_counts[:-1], 2), 0)\n",
108 |     "two_halves = np.concatenate([half, half])\n",
109 |     "print(len(two_halves))\n",
110 |     "print(hist_counts)\n",
111 |     "print(two_halves)\n",
112 |     "print(list(two_halves[:18]))\n",
113 |     "print(list(two_halves[18:]))"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "These get used for the the parameters to the polar plot. `theta` is the centre of each bin, `radii` is the height of each bin, and `width` is how wide each bin needs to be."
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "theta = np.linspace(0, 2 * np.pi, 36, endpoint=True)\n",
130 |     "radii = two_halves\n",
131 |     "width = np.radians(bin_width_degrees)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "Finally, we can set up the plot. Note the addition of `polar` to the `projection` parameter. This is what tells matplotlib that we want a circular plot. By using `set_theta_offset` and `set_theta_direction` we can get 0 at the top and 90 to the right, as we expect from a compass. The grid can be altered using `set_theta_grids` and `set_rgrids` for the spokes and the rings respectively."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": null,
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "fig = plt.figure()\n",
148 |     "ax = fig.add_subplot(111, projection='polar')\n",
149 |     "ax.bar(theta, radii, \n",
150 |     "       width=width, bottom=0.0, edgecolor='k')\n",
151 |     "ax.set_theta_zero_location('N')\n",
152 |     "ax.set_theta_direction(-1)\n",
153 |     "ax.set_thetagrids(np.arange(0, 360, 10), labels=np.arange(0, 360, 10))\n",
154 |     "ax.set_rgrids(np.arange(0, radii.max() + 5, 5), angle=0)\n",
155 |     "ax.set_title('Directions of South African runways')"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "## Spatial Selections\n",
163 |     "This is useful, but it covers the entire country. If we want to look at each province in turn, because there may be differences, we can select only things in the province.\n",
164 |     "\n",
165 |     "We will use the `within` function to see if a feature is within another. This returns a boolean (True/False) that we can use to select only the features for which this is true. There are a few of these worth noting:\n",
166 |     "* `contains` is the opposite of `within`, and is True for features that entirely contain a given geometry.\n",
167 |     "* `intersects` is true if the interior and interior of the two objects overlap.\n",
168 |     "* `touches` is true if at least one point is in common and interiors do not intersect.\n",
169 |     "\n",
170 |     "In addition to these, most other spatial operations that are present in standard GIS software are available."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "First, we will select the province of KwaZulu-Natal and the runways within it, and then plot them."
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "metadata": {},
184 |    "outputs": [],
185 |    "source": [
186 |     "kzn = za_provinces[za_provinces['iso_3166_2'] == 'ZA-NL']\n",
187 |     "kzn_runways = za_runways[za_runways.within(kzn.geometry.unary_union)]"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": null,
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": []
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": null,
200 |    "metadata": {
201 |     "tags": [
202 |      "hide"
203 |     ]
204 |    },
205 |    "outputs": [],
206 |    "source": [
207 |     "print(f'All runways: {za_runways.shape}\\nKZN runways: {kzn_runways.shape}')"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": []
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {
221 |     "tags": [
222 |      "hide"
223 |     ]
224 |    },
225 |    "outputs": [],
226 |    "source": [
227 |     "base = za_provinces.plot(color='lightgrey', edgecolor='darkgrey')\n",
228 |     "kzn_runways.centroid.plot(ax=base, color='black', alpha=0.3)\n",
229 |     "base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n",
230 |     "base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "markdown",
235 |    "metadata": {},
236 |    "source": [
237 |     "We can see that this has only extracted values within the province's boundaries. We needed the `unary_union` because the geometry is that of a multipolygon."
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "## Exercise 1\n",
245 |     "\n",
246 |     "1. Convert the above code to a function for creating a rose diagram given a geoDataFrame\n",
247 |     "    1. Test it on the extracted runway directions in KZN.\n",
248 |     "2. Write a function that generates a rose diagram for each province. The easiest approach is to wrap the previous function. (Hint: you may find the `iterrows()` method that (geo)DataFrames have useful.)"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": null,
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": [
257 |     "# Write a function to plot a rose diagram.\n",
258 |     "def rose_diagram(bearings, title='Rose Diagram', degrees_in_bin=10):\n",
259 |     "    bin_width_degrees = degrees_in_bin\n",
260 |     "    num_bins = int(round(360 / degrees_in_bin, 0))\n",
261 |     "    \n",
262 |     "    bin_edges = np.arange(-5, 366, bin_width_degrees)\n",
263 |     "    hist_counts, bin_edges = np.histogram(bearings, bin_edges)\n",
264 |     "    \n",
265 |     "    hist_counts[0] += hist_counts[-1]  # sum counts of first and last bins\n",
266 |     "\n",
267 |     "    half = np.sum(np.split(hist_counts[:-1], 2), 0)\n",
268 |     "    two_halves = np.concatenate([half, half])\n",
269 |     "    \n",
270 |     "    theta = np.linspace(0, 2 * np.pi, num_bins, endpoint=True)\n",
271 |     "    radii = two_halves\n",
272 |     "    width = np.radians(bin_width_degrees)\n",
273 |     "    \n",
274 |     "    fig = plt.figure()\n",
275 |     "    ax = fig.add_subplot(111, projection='polar')\n",
276 |     "    ax.bar(theta, radii, \n",
277 |     "           width=width, bottom=0.0, edgecolor='k')\n",
278 |     "    ax.set_theta_zero_location('N')\n",
279 |     "    ax.set_theta_direction(-1)\n",
280 |     "    ax.set_thetagrids(np.arange(0, 360, 10), labels=np.arange(0, 360, 10))\n",
281 |     "    ax.set_rgrids(np.arange(0, radii.max() + 5, 5), angle=0)\n",
282 |     "    ax.set_title(title)"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": null,
288 |    "metadata": {},
289 |    "outputs": [],
290 |    "source": [
291 |     "# Test the function on the runways in KwaZulu-Natal.\n"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": null,
297 |    "metadata": {
298 |     "tags": [
299 |      "hide"
300 |     ]
301 |    },
302 |    "outputs": [],
303 |    "source": [
304 |     "# Test the function on the runways in KwaZulu-Natal.\n",
305 |     "rose_diagram(kzn_runways['bearing'], 'Direction of runways in KZN')"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "code",
310 |    "execution_count": null,
311 |    "metadata": {},
312 |    "outputs": [],
313 |    "source": [
314 |     "# Plot a rose diagram for each province.\n",
315 |     "\n",
316 |     "\n"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": null,
322 |    "metadata": {
323 |     "tags": [
324 |      "hide"
325 |     ]
326 |    },
327 |    "outputs": [],
328 |    "source": [
329 |     "# Plot a rose diagram for each province.\n",
330 |     "for index, province in za_provinces.iterrows():\n",
331 |     "    province_runways = za_runways[za_runways.within(province.geometry)]\n",
332 |     "    rose_diagram(province_runways['bearing_360'], f'Direction of runways in {province[\"name\"]}')"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "markdown",
337 |    "metadata": {},
338 |    "source": [
339 |     "## Spatial Relationships\n",
340 |     "\n",
341 |     "Now we will take a look at a few places in KZN, and get an idea of some spatial relationships between towns/cities and runways."
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": null,
347 |    "metadata": {},
348 |    "outputs": [],
349 |    "source": [
350 |     "za_places = gpd.read_file(\"../data/south_africa_places.gpkg\")"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "We can now look at how many airports are located within a certain range of a given town. First we will choose our town, then create a buffer around, and see how many fall into that range. I am going to look at Durban, the largest city in KZN."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": null,
363 |    "metadata": {},
364 |    "outputs": [],
365 |    "source": []
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": null,
370 |    "metadata": {
371 |     "tags": [
372 |      "hide"
373 |     ]
374 |    },
375 |    "outputs": [],
376 |    "source": [
377 |     "durban = za_places[za_places['name'] == 'Durban']\n",
378 |     "print(durban.geometry)\n",
379 |     "durban"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "metadata": {},
385 |    "source": [
386 |     "Geopandas gives us some useful and common tools straight away, such as a buffer. We can easily put a circle around a given geometry."
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "code",
391 |    "execution_count": null,
392 |    "metadata": {},
393 |    "outputs": [],
394 |    "source": []
395 |   },
396 |   {
397 |    "cell_type": "code",
398 |    "execution_count": null,
399 |    "metadata": {
400 |     "tags": [
401 |      "hide"
402 |     ]
403 |    },
404 |    "outputs": [],
405 |    "source": [
406 |     "base = kzn.plot(color='lightgrey')\n",
407 |     "durban.buffer(1).plot(ax=base, alpha=0.5)\n",
408 |     "durban.plot(ax=base, zorder=4, color='red')"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "markdown",
413 |    "metadata": {},
414 |    "source": [
415 |     "That looks a little big, given that our buffer was only `1`. If we look at the CRS of the place data, we can see that it works in degrees, so the buffer is one degree in each direction.We should convert this to a projected system that uses metres if we want to measure meaningful distances. We will try EPSG 3857, which uses metres as a unit of measure. (For our purposes, it is accurate enough, but if you need distances in a critical use-case, then you should use something more focused on your area of interest.)\n",
416 |     "\n",
417 |     "For smaller areas, a relevant UTM projection is good. Some countries also have well-defined ones of their own. https://gist.github.com/gubuntu/6403425 is of interest for Southern Africa generally. KZN is a little too big to fit into one UTM zone."
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "code",
422 |    "execution_count": null,
423 |    "metadata": {},
424 |    "outputs": [],
425 |    "source": []
426 |   },
427 |   {
428 |    "cell_type": "code",
429 |    "execution_count": null,
430 |    "metadata": {
431 |     "tags": [
432 |      "hide"
433 |     ]
434 |    },
435 |    "outputs": [],
436 |    "source": [
437 |     "kzn_runways_proj = kzn_runways.to_crs(epsg=3857)\n",
438 |     "kzn_proj = kzn.to_crs(epsg=3857)\n",
439 |     "durban_proj = durban.to_crs(epsg=3857)"
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "markdown",
444 |    "metadata": {},
445 |    "source": [
446 |     "Notice how this also changes our axes when plotting:\n",
447 |     "(We are plotting the centroid of each runway line here, to make the position more visible.)"
448 |    ]
449 |   },
450 |   {
451 |    "cell_type": "code",
452 |    "execution_count": null,
453 |    "metadata": {},
454 |    "outputs": [],
455 |    "source": []
456 |   },
457 |   {
458 |    "cell_type": "code",
459 |    "execution_count": null,
460 |    "metadata": {
461 |     "tags": [
462 |      "hide"
463 |     ]
464 |    },
465 |    "outputs": [],
466 |    "source": [
467 |     "base = kzn_proj.plot(color='lightgrey')\n",
468 |     "durban_proj.buffer(100_000).plot(ax=base, alpha=0.5)  # 100 000m is 100km\n",
469 |     "durban_proj.plot(ax=base, zorder=4, color='red')\n",
470 |     "kzn_runways_proj.centroid.plot(ax=base)"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "markdown",
475 |    "metadata": {},
476 |    "source": [
477 |     "Now we can use that buffer and select only the runways within that buffer."
478 |    ]
479 |   },
480 |   {
481 |    "cell_type": "code",
482 |    "execution_count": null,
483 |    "metadata": {},
484 |    "outputs": [],
485 |    "source": []
486 |   },
487 |   {
488 |    "cell_type": "code",
489 |    "execution_count": null,
490 |    "metadata": {
491 |     "tags": [
492 |      "hide"
493 |     ]
494 |    },
495 |    "outputs": [],
496 |    "source": [
497 |     "durban_proj_buffer = durban_proj.buffer(100_000)\n",
498 |     "near_durban = kzn_runways_proj[kzn_runways_proj.within(durban_proj_buffer.unary_union)]"
499 |    ]
500 |   },
501 |   {
502 |    "cell_type": "markdown",
503 |    "metadata": {},
504 |    "source": [
505 |     "We can now plot only those runways that fall within the buffer."
506 |    ]
507 |   },
508 |   {
509 |    "cell_type": "code",
510 |    "execution_count": null,
511 |    "metadata": {},
512 |    "outputs": [],
513 |    "source": [
514 |     "fig, ax = plt.subplots()\n",
515 |     "kzn_proj.plot(ax=ax, color='lightgrey')\n",
516 |     "near_durban.centroid.plot(ax=ax)\n",
517 |     "durban_proj_buffer.plot(ax=ax, alpha=0.3)\n",
518 |     "durban_proj.plot(ax=ax, color='red')"
519 |    ]
520 |   },
521 |   {
522 |    "cell_type": "markdown",
523 |    "metadata": {},
524 |    "source": [
525 |     "## Distances and Nearest Neighbours\n",
526 |     "\n",
527 |     "If we are interested in the distance to the nearest runway from one of our towns, we can write a short function that returns this value. A GeoDataFrame has a `distance` method that will work between any two geometries. Since we want to do it for a number of them in one go, we have to write a function to do it."
528 |    ]
529 |   },
530 |   {
531 |    "cell_type": "code",
532 |    "execution_count": null,
533 |    "metadata": {},
534 |    "outputs": [],
535 |    "source": []
536 |   },
537 |   {
538 |    "cell_type": "code",
539 |    "execution_count": null,
540 |    "metadata": {
541 |     "tags": [
542 |      "hide"
543 |     ]
544 |    },
545 |    "outputs": [],
546 |    "source": [
547 |     "def min_distance(point, destination):\n",
548 |     "    '''\n",
549 |     "    Takes a point and then returns the nearest destination. \n",
550 |     "    Expects to be working with metres as a unit of measure and returns a distance in km.\n",
551 |     "    '''\n",
552 |     "    return destination.distance(point).min() / 1000"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "The paradigm in pandas is to use `apply` which performs the same function on each row in a dataframe.\n",
560 |     "\n",
561 |     "Since we know that the nearest runway will be one of the ones that we previously identified as being within the buffer zone, we will just use that. We could give it the whole dataset and get the same result, but this will be slightly quicker.\n",
562 |     "\n",
563 |     "The final comma in `args=(near_durban,)` is important."
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": null,
569 |    "metadata": {},
570 |    "outputs": [],
571 |    "source": []
572 |   },
573 |   {
574 |    "cell_type": "code",
575 |    "execution_count": null,
576 |    "metadata": {
577 |     "scrolled": true,
578 |     "tags": [
579 |      "hide"
580 |     ]
581 |    },
582 |    "outputs": [],
583 |    "source": [
584 |     "durban_proj.geometry.apply(min_distance, args=(near_durban,))"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "markdown",
589 |    "metadata": {},
590 |    "source": [
591 |     "## Exercise 3\n",
592 |     "\n",
593 |     "1. What is the distance to the centre of the closest town to the runway with the `osm_id` of 87323625?\n",
594 |     "2. What is the distance from Durban to the furtherest town in South Africa? (Hint: you will need to modify the `nearest_neighbour` function to a `max_distance` version. Pay attention to the CRS.)"
595 |    ]
596 |   },
597 |   {
598 |    "cell_type": "code",
599 |    "execution_count": null,
600 |    "metadata": {},
601 |    "outputs": [],
602 |    "source": [
603 |     "# What is the distance to the centre of the closest town to the runway with the osm_id of 87323625?\n",
604 |     "\n"
605 |    ]
606 |   },
607 |   {
608 |    "cell_type": "code",
609 |    "execution_count": null,
610 |    "metadata": {
611 |     "tags": [
612 |      "hide"
613 |     ]
614 |    },
615 |    "outputs": [],
616 |    "source": [
617 |     "# What is the distance to the centre of the closest town to the runway with the osm_id of 87323625?\n",
618 |     "runway87323625 = kzn_runways_proj[kzn_runways_proj['osm_id'] == '6211489']\n",
619 |     "runway87323625.geometry.apply(min_distance, args=(za_places.to_crs(epsg=3857),))"
620 |    ]
621 |   },
622 |   {
623 |    "cell_type": "code",
624 |    "execution_count": null,
625 |    "metadata": {},
626 |    "outputs": [],
627 |    "source": [
628 |     "# What is the distance from Durban to the furtherest town in South Africa?\n",
629 |     "\n",
630 |     "\n",
631 |     "\n"
632 |    ]
633 |   },
634 |   {
635 |    "cell_type": "code",
636 |    "execution_count": null,
637 |    "metadata": {
638 |     "tags": [
639 |      "hide"
640 |     ]
641 |    },
642 |    "outputs": [],
643 |    "source": [
644 |     "# What is the distance from Durban to the furtherest town in South Africa?\n",
645 |     "def max_distance(point, destination):\n",
646 |     "    return destination.distance(point).max() / 1000\n",
647 |     "\n",
648 |     "durban_proj.geometry.apply(max_distance, args=(za_places.to_crs(epsg=3857),))"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "markdown",
653 |    "metadata": {},
654 |    "source": [
655 |     "## Multiple points\n",
656 |     "\n",
657 |     "We can also use the `nearest_neighbours` function applied to every feature. This can be used for example, to classify the runways by what the closest town or city to them is.\n",
658 |     "\n",
659 |     "First we will limit our analysis to towns in KwaZulu-Natal."
660 |    ]
661 |   },
662 |   {
663 |    "cell_type": "code",
664 |    "execution_count": null,
665 |    "metadata": {},
666 |    "outputs": [],
667 |    "source": [
668 |     "kzn_places = za_places[za_places.within(kzn.geometry.unary_union)]\n",
669 |     "kzn_places_proj = kzn_places.to_crs(epsg=3857)"
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "code",
674 |    "execution_count": null,
675 |    "metadata": {},
676 |    "outputs": [],
677 |    "source": [
678 |     "base = kzn_proj.plot(color='lightgrey')\n",
679 |     "kzn_runways_proj.centroid.plot(ax=base)\n",
680 |     "kzn_places_proj.plot(ax=base, color='red', marker='+')"
681 |    ]
682 |   },
683 |   {
684 |    "cell_type": "markdown",
685 |    "metadata": {},
686 |    "source": [
687 |     "Using the `apply` function on the whole GeoDataFrame will get us a result for all the features. We can use that to populate a new column in the runway dataset with the minimum distance to a town or city in the places dataset."
688 |    ]
689 |   },
690 |   {
691 |    "cell_type": "code",
692 |    "execution_count": null,
693 |    "metadata": {},
694 |    "outputs": [],
695 |    "source": [
696 |     "kzn_runways_proj['min_dist_to_place'] = kzn_runways_proj.geometry.apply(min_distance, args=(kzn_places_proj,))\n",
697 |     "kzn_runways_proj['min_dist_to_place'].describe()"
698 |    ]
699 |   },
700 |   {
701 |    "cell_type": "markdown",
702 |    "metadata": {},
703 |    "source": [
704 |     "The runways end up plotting very small at province-wide scale, so we will change the geometry to be a Point instead. The easiest way is to obtain the centroid of each runway."
705 |    ]
706 |   },
707 |   {
708 |    "cell_type": "code",
709 |    "execution_count": null,
710 |    "metadata": {},
711 |    "outputs": [],
712 |    "source": [
713 |     "kzn_runways_proj['line_geom'] = kzn_runways_proj.geometry\n",
714 |     "kzn_runways_proj['geometry'] = kzn_runways_proj.centroid\n",
715 |     "kzn_runways_proj.geometry"
716 |    ]
717 |   },
718 |   {
719 |    "cell_type": "markdown",
720 |    "metadata": {},
721 |    "source": [
722 |     "Using the `scheme` parameter, we can also take a column and group the values using `mapclassify` options."
723 |    ]
724 |   },
725 |   {
726 |    "cell_type": "code",
727 |    "execution_count": null,
728 |    "metadata": {},
729 |    "outputs": [],
730 |    "source": [
731 |     "base = kzn_proj.plot(color='lightgrey')\n",
732 |     "kzn_runways_proj.plot(ax=base, column='min_dist_to_place', legend=True, scheme='EqualInterval', k=7)\n",
733 |     "kzn_places_proj.plot(ax=base, color='red', marker='+')"
734 |    ]
735 |   },
736 |   {
737 |    "cell_type": "markdown",
738 |    "metadata": {},
739 |    "source": [
740 |     "## Exercise\n",
741 |     "\n",
742 |     "1. Plot the towns and cities in KwaZulu-Natal coloured by the distance from Durban."
743 |    ]
744 |   },
745 |   {
746 |    "cell_type": "code",
747 |    "execution_count": null,
748 |    "metadata": {},
749 |    "outputs": [],
750 |    "source": [
751 |     "\n",
752 |     "\n",
753 |     "\n",
754 |     "\n"
755 |    ]
756 |   },
757 |   {
758 |    "cell_type": "code",
759 |    "execution_count": null,
760 |    "metadata": {
761 |     "tags": [
762 |      "hide"
763 |     ]
764 |    },
765 |    "outputs": [],
766 |    "source": [
767 |     "kzn_durban = kzn_places_proj.copy()\n",
768 |     "kzn_durban['dist_durban'] = kzn_places_proj.geometry.apply(min_distance, args=(durban_proj,))\n",
769 |     "base = kzn_proj.plot(color='lightblue', edgecolor='black')\n",
770 |     "kzn_durban.plot(ax=base, column='dist_durban', scheme='jenkscaspallforced', k=6, legend=True, cmap='Reds_r', marker='+')\n",
771 |     "durban_proj.plot(ax=base, color='red')"
772 |    ]
773 |   },
774 |   {
775 |    "cell_type": "markdown",
776 |    "metadata": {},
777 |    "source": [
778 |     "## More Information from Neighbours\n",
779 |     "\n",
780 |     "So far we have only been able to obtain the distance to the nearest neighbour, but nothing else from the second GeoDataFrame.\n",
781 |     "\n",
782 |     "We need a new function that can get the nearest points to a given geometry. To do that we will use an operation imported from the `shapely` module."
783 |    ]
784 |   },
785 |   {
786 |    "cell_type": "code",
787 |    "execution_count": null,
788 |    "metadata": {},
789 |    "outputs": [],
790 |    "source": [
791 |     "from shapely.ops import nearest_points\n",
792 |     "\n",
793 |     "def calculate_nearest(row, destination, val, col=\"geometry\"):\n",
794 |     "    ''' Returns the value from a given column for the nearest feature\n",
795 |     "        of the `destination` GeoDataFrame.\n",
796 |     "        \n",
797 |     "        row: The object of interest\n",
798 |     "        destination: GeoDataFrame of possible locations, of which one is the closest.\n",
799 |     "        val: The column containing the values which the destination will give back.\n",
800 |     "        col: If your row's geometry is not `geometry`, change it here.\n",
801 |     "        \n",
802 |     "        returns: The value of the `val` column of the feature nearest to `row`.\n",
803 |     "    '''\n",
804 |     "    dest_unary = destination[\"geometry\"].unary_union\n",
805 |     "    nearest_geom = nearest_points(row[col], dest_unary)\n",
806 |     "    match_geom = destination.loc[destination.geometry == nearest_geom[1]]\n",
807 |     "    match_value = match_geom[val].to_numpy()[0]\n",
808 |     "    return match_value"
809 |    ]
810 |   },
811 |   {
812 |    "cell_type": "markdown",
813 |    "metadata": {},
814 |    "source": [
815 |     "This function allows us to ask for specific values back about the nearest object. Again, we will find the nearest place to each runway, but we will request the `name` of that place and save it as a column in the runways GeoDataFrame."
816 |    ]
817 |   },
818 |   {
819 |    "cell_type": "code",
820 |    "execution_count": null,
821 |    "metadata": {},
822 |    "outputs": [],
823 |    "source": [
824 |     "\n"
825 |    ]
826 |   },
827 |   {
828 |    "cell_type": "code",
829 |    "execution_count": null,
830 |    "metadata": {},
831 |    "outputs": [],
832 |    "source": [
833 |     "kzn_runways_proj[\"nearest_place_name\"] = kzn_runways_proj.apply(\n",
834 |     "    calculate_nearest, destination=kzn_places_proj, val=\"name\", axis=1)"
835 |    ]
836 |   },
837 |   {
838 |    "cell_type": "code",
839 |    "execution_count": null,
840 |    "metadata": {},
841 |    "outputs": [],
842 |    "source": [
843 |     "kzn_runways_proj[\"nearest_place_name\"].head(10)"
844 |    ]
845 |   },
846 |   {
847 |    "cell_type": "markdown",
848 |    "metadata": {},
849 |    "source": [
850 |     "We can now plot these and get an idea of where the nearest urban centre to each runway is."
851 |    ]
852 |   },
853 |   {
854 |    "cell_type": "code",
855 |    "execution_count": null,
856 |    "metadata": {},
857 |    "outputs": [],
858 |    "source": [
859 |     "fig, ax = plt.subplots(figsize=(10,10))\n",
860 |     "kzn_proj.plot(ax=ax, zorder=-4, color='lightgrey')\n",
861 |     "kzn_places_proj.plot(ax=ax, color='black', marker='+',)\n",
862 |     "kzn_runways_proj.plot(ax=ax, column='nearest_place_name', legend=True, cmap='tab20b')\n",
863 |     "ax.set_title('Nearest major place to runways in KZN')\n",
864 |     "ax.set_xlim(xmin=kzn_proj.total_bounds[0] - 10_000, xmax=kzn_proj.total_bounds[2] + 135_000)"
865 |    ]
866 |   },
867 |   {
868 |    "cell_type": "markdown",
869 |    "metadata": {},
870 |    "source": [
871 |     "## Exercise 3\n",
872 |     "\n",
873 |     "1. Is there a spatial trend in the bearing of runways in KZN? Plot the position of each runway and colour them by bearing.\n",
874 |     "2. Plot the runways within 150km of Pietermaritzburg and colour them by runway length. This plot should only cover 150km of Pietermaritzburg."
875 |    ]
876 |   },
877 |   {
878 |    "cell_type": "code",
879 |    "execution_count": null,
880 |    "metadata": {},
881 |    "outputs": [],
882 |    "source": [
883 |     "# Is there a spatial trend in the bearing of runways in KZN?\n",
884 |     "# Plot the posiiton of each runway and colour them by bearing.\n",
885 |     "\n",
886 |     "\n",
887 |     "\n"
888 |    ]
889 |   },
890 |   {
891 |    "cell_type": "code",
892 |    "execution_count": null,
893 |    "metadata": {
894 |     "tags": [
895 |      "hide"
896 |     ]
897 |    },
898 |    "outputs": [],
899 |    "source": [
900 |     "# Is there a spatial trend in the bearing of runways in KZN?\n",
901 |     "# Plot the posiiton of each runway and colour them by bearing.\n",
902 |     "fig, ax = plt.subplots(figsize=(10,10))\n",
903 |     "kzn_proj.plot(ax=ax, zorder=-4, color='lightgrey')\n",
904 |     "kzn_runways_proj.plot(ax=ax, column='bearing', legend=True, scheme='EqualInterval')\n",
905 |     "#kzn_places_proj.plot(ax=ax, color='black')\n",
906 |     "ax.set_title('Bearing of runways in KZN')"
907 |    ]
908 |   },
909 |   {
910 |    "cell_type": "code",
911 |    "execution_count": null,
912 |    "metadata": {},
913 |    "outputs": [],
914 |    "source": [
915 |     "# Plot the runways within 150km of Pietermaritzburg and colour them by runway length.\n",
916 |     "# This plot should only cover 150km of Pietermaritzburg.\n",
917 |     "\n"
918 |    ]
919 |   },
920 |   {
921 |    "cell_type": "code",
922 |    "execution_count": null,
923 |    "metadata": {
924 |     "tags": [
925 |      "hide"
926 |     ]
927 |    },
928 |    "outputs": [],
929 |    "source": [
930 |     "# Plot the runways within 150km of Pietermaritzburg and colour them by runway length.\n",
931 |     "# This plot should only cover 150km of Pietermaritzburg.\n",
932 |     "pmb_buffer = kzn_places_proj[kzn_places_proj['name'] == 'Pietermaritzburg'].buffer(150_000)\n",
933 |     "kzn_runways_proj[kzn_runways_proj.within(pmb_buffer.geometry.unary_union)].plot(column='length')"
934 |    ]
935 |   },
936 |   {
937 |    "cell_type": "markdown",
938 |    "metadata": {},
939 |    "source": [
940 |     "<hr />\n",
941 |     "<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n",
942 |     "<p><center>© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> — <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"
943 |    ]
944 |   }
945 |  ],
946 |  "metadata": {
947 |   "celltoolbar": "Tags",
948 |   "kernelspec": {
949 |    "display_name": "Python 3",
950 |    "language": "python",
951 |    "name": "python3"
952 |   },
953 |   "language_info": {
954 |    "codemirror_mode": {
955 |     "name": "ipython",
956 |     "version": 3
957 |    },
958 |    "file_extension": ".py",
959 |    "mimetype": "text/x-python",
960 |    "name": "python",
961 |    "nbconvert_exporter": "python",
962 |    "pygments_lexer": "ipython3",
963 |    "version": "3.8.3"
964 |   }
965 |  },
966 |  "nbformat": 4,
967 |  "nbformat_minor": 4
968 | }


--------------------------------------------------------------------------------
/master/04-geopandas-joins_and_spatial_joins-master.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Joins and merging data\n",
  8 |     "\n",
  9 |     "DATE: 12 June 2020, 08:00 - 11:00 UTC\n",
 10 |     "\n",
 11 |     "AUDIENCE: Intermediate\n",
 12 |     "\n",
 13 |     "INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n",
 14 |     "\n",
 15 |     "Many times we are interested in combining two datasets, but we may only want the area that is covered by another one. Spatial joins are a relatively complex topic, so this will give a brief overview that will hopefully be useful."
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "metadata": {},
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "import geopandas as gpd\n",
 25 |     "import pandas as pd\n",
 26 |     "import matplotlib.pyplot as plt\n",
 27 |     "import numpy as np"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "metadata": {},
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "provinces = gpd.read_file('zip://../data/ne_10m_admin_1_states_provinces.zip')\n",
 37 |     "za_provinces = provinces[provinces['sov_a3'] == 'ZAF']\n",
 38 |     "employment = pd.read_csv('../data/za_province_employment.csv')"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "Pandas differentiates between spatial joins and attribute joins. An **attribute** join works more or less the same as in standard pandas, and uses values that are in common to both. A **spatial** join uses the geometry of each dataframe."
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Attribute Joins\n",
 53 |     "\n",
 54 |     "This works by finding a common value in two dataframes and creating a new dataframe using the common value to add values to an existing feature. In pandas one uses the `merge` function to do this. These are very common when you have existing data that needs to be combined to existing geometry. A common example would be adding demographic data to administrative districts.\n",
 55 |     "\n",
 56 |     "In this case, we can see that the employment data has a `Province` attribute. We can link that to the `name` attribute in `za_provinces`. Further, there is no geometry associated with the employment data, so we have no way of seeing if there are any spatial trends to the data, unless we have a very good mental image of South Africa's provinces."
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "employment"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "ax = employment.plot(kind='bar')"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "za_provinces"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "We can now merge on these, which will give us a dataframe that adds the value associated with a given province in the employment data to each province."
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": null,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "merged_provinces = za_provinces.merge(employment, left_on='name', right_on='Province')\n",
100 |     "merged_provinces"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {},
106 |    "source": [
107 |     "As we can see, we have added the columns from `employment` to our `za_provinces` geodataframe, which we can treat normally. Also worth noting is that we have lost the row with the totals of each class, because there is no province named 'Total'. Since this is now a standard geodataframe, we can easily make a plot based on the employment data for each province."
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "ax = merged_provinces.plot('Total', scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')\n",
117 |     "ax.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n",
118 |     "ax.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n",
119 |     "the_legend = ax.get_legend()\n",
120 |     "the_legend.set_bbox_to_anchor((1.7,1))\n",
121 |     "plt.title('Population in South Africa per Province')\n",
122 |     "ax2 = merged_provinces.plot(merged_provinces['Unemployed']/merged_provinces['Total']*100, scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')\n",
123 |     "ax2.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n",
124 |     "ax2.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n",
125 |     "the_legend = ax2.get_legend()\n",
126 |     "the_legend.set_bbox_to_anchor((1.45,1))\n",
127 |     "plt.title('Percentage unemployed in each province')"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "## Spatial Joins\n",
135 |     "\n",
136 |     "These work by looking at the geometry of two different geodataframes and relating them to each other.\n",
137 |     "\n",
138 |     "For example, this river dataset has no information on which country or province a river is in, but that may be of interest for some reason."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": null,
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "rivers = gpd.read_file('zip://../data/ne_10m_rivers_lake_centerlines_trimmed.zip')\n",
148 |     "rivers"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "Geopandas offers us the `sjoin` method to spatially joing two different geodataframes.\n",
156 |     "\n",
157 |     "This takes two geodataframes (the first is `'left` and the second is `right`.)\n",
158 |     "\n",
159 |     "The `op` parameter controls how things are related to each other, using the `shapely` library's [binary predicates](https://shapely.readthedocs.io/en/latest/manual.html#binary-predicates):\n",
160 |     "* `intersects` - True if the objects have any boundary or interior point in common.\n",
161 |     "* `contains` - True if no points of other lie in the exterior of the object and at least one point of the interior of other lies in the interior of object.\n",
162 |     "* `within` - True if the object’s boundary and interior intersect only with the interior of the other\n",
163 |     "\n",
164 |     "The `how` parameter controls which geometry is kept:\n",
165 |     "* `'left'` uses keys from left; retains only the left geometry column"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "za_rivers_left = gpd.sjoin(za_provinces, rivers, how=\"left\", op='intersects')\n",
175 |     "base = za_provinces.plot(color='lightgrey', edgecolor='black')\n",
176 |     "za_rivers_left[za_rivers_left['sov_a3'] == 'ZAF'].plot(ax=base)\n",
177 |     "\n",
178 |     "base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n",
179 |     "base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n",
180 |     "za_rivers_left"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "* `'right'` use keys from right; retains only right geometry column (note that this means that all the rivers will still be present, but only those which can be matched to a province in `za_provinces` will have values from `za_provinces`.\n",
188 |     "\n",
189 |     "Note that we have rivers that extend beyond the border, because we are only looking for intersecting geometries. Try a different `op` ('contains' or 'within') to see what effect that has. Note also that some rivers are now present twice, because they are within multiple provinces, so get selected more than once."
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "za_rivers_right = gpd.sjoin(za_provinces, rivers, how=\"right\", op='intersects')\n",
199 |     "base = za_provinces.plot(color='lightgrey', edgecolor='black')\n",
200 |     "za_rivers_right[za_rivers_right['sov_a3'] == 'ZAF'].plot(ax=base)\n",
201 |     "\n",
202 |     "base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n",
203 |     "base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n",
204 |     "#za_rivers[za_rivers['sov_a3'] == 'ZAF']"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "* `'inner'` use intersection of keys from both geodataframes; retain only the left geometry column"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "za_rivers_inner = gpd.sjoin(za_provinces, rivers, how=\"inner\", op='intersects')\n",
221 |     "\n",
222 |     "za_rivers_inner"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "markdown",
227 |    "metadata": {},
228 |    "source": [
229 |     "Comparing all three then:"
230 |    ]
231 |   },
232 |   {
233 |    "cell_type": "code",
234 |    "execution_count": null,
235 |    "metadata": {},
236 |    "outputs": [],
237 |    "source": [
238 |     "print(f'Left: {za_rivers_left.shape}\\nRight: {za_rivers_right.shape}\\nInner: {za_rivers_inner.shape}')"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "Note that for these datasets, we expect Left and Inner to be the same. The main difference is in whether we keep records that are only in right or not.\n",
246 |     "<hr />\n",
247 |     "<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n",
248 |     "<p><center>© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> — <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"
249 |    ]
250 |   }
251 |  ],
252 |  "metadata": {
253 |   "kernelspec": {
254 |    "display_name": "Python 3",
255 |    "language": "python",
256 |    "name": "python3"
257 |   },
258 |   "language_info": {
259 |    "codemirror_mode": {
260 |     "name": "ipython",
261 |     "version": 3
262 |    },
263 |    "file_extension": ".py",
264 |    "mimetype": "text/x-python",
265 |    "name": "python",
266 |    "nbconvert_exporter": "python",
267 |    "pygments_lexer": "ipython3",
268 |    "version": "3.8.3"
269 |   }
270 |  },
271 |  "nbformat": 4,
272 |  "nbformat_minor": 4
273 | }


--------------------------------------------------------------------------------
/master/05-geopandas-basemaps_with_contextily-master.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Background Maps\n",
  8 |     "\n",
  9 |     "DATE: 12 June 2020, 08:00 - 11:00 UTC\n",
 10 |     "\n",
 11 |     "AUDIENCE: Intermediate\n",
 12 |     "\n",
 13 |     "INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n",
 14 |     "\n",
 15 |     "It is often useful to add a background map to a given dataset. There are a few tools that can do this, such as [`folium`](https://python-visualization.github.io/folium/) and [`ipyleaflet`](https://ipyleaflet.readthedocs.io/en/latest/) which build directly on the `leaflet` library for JavaScript. However, for a simple approach, we will stick to [`contextily`](https://github.com/darribas/contextily). There are some additional capabilities that we will not go over in this, but it should be enough to get started with.\n",
 16 |     "\n",
 17 |     "Note: the inital load of tiles can take quite a while as they get downloaded. Subsequent loads of the same tiles will be much quicker as they cache locally.\n",
 18 |     "\n",
 19 |     "This notebook is intended more as a demonstration of using contextily to add basemaps in a simple way."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "import geopandas as gpd\n",
 29 |     "import matplotlib.pyplot as plt\n",
 30 |     "import contextily as ctx"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "plt.rcParams[\"figure.figsize\"] = (8, 8)"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "First we will get some point data, in this case mines in Tanzania. Geopandas can download the file and import it directly from the source at [Geological and Mineral Information System](https://www.gmis-tanzania.com/) by the Geological Survey of Tanzania. If this download does not work, it is in the repo as `data/tanzania_mines.zip`."
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "fname = 'https://www.gmis-tanzania.com/download/mines.zip'\n",
 56 |     "mines = gpd.read_file(fname)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "mines"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "markdown",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "This can easily be plotted, as we have already done."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "mines.plot(column='miningexpl', legend=True)"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "In order to gain more context, we can plot this over a basemap of some kind. By default, `contextily` uses the Stamen Terrain tiles."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "base = mines.plot(column='miningexpl', legend=True)\n",
 98 |     "ctx.add_basemap(base, crs=mines.crs)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "Something worth noting is that the basemap is easily projected by giving it the `mines.crs` as a parameter. This is needed since the dataset using a local projected CRS, but we can largely ignore it."
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "mines.crs"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "We can switch this to use lat-lon easily enough, by reprojecting the data to something like the WGS84 datum first."
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": null,
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "mines_deg = mines.to_crs(epsg=4326)\n",
131 |     "base = mines_deg.plot(column='miningexpl', legend=True, alpha=0.75)\n",
132 |     "ctx.add_basemap(base, crs=mines_deg.crs)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "## Changing the Basemap\n",
140 |     "\n",
141 |     "The leaflet providers are available in contextily, which allows for a variety of different styles and looks."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "ctx.providers.keys()"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "markdown",
155 |    "metadata": {},
156 |    "source": [
157 |     "Many of these may have specific styles, which will look different. Compare the default [Mapnik style](https://www.openstreetmap.org/search?query=Kinshasa#map=13/-4.3385/15.3131) to the [Transport style](https://www.openstreetmap.org/search?query=Kinshasa#map=13/-4.3385/15.3131&layers=T), for example. Some of these styles may require API keys to use, and many will have usage limits of some kind."
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": null,
163 |    "metadata": {},
164 |    "outputs": [],
165 |    "source": [
166 |     "print(ctx.providers.OpenStreetMap.keys())\n",
167 |     "print(ctx.providers.Thunderforest.keys())\n",
168 |     "print(ctx.providers.Esri.keys())"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "Changing the basemap to use one of these is as simple as changing the `source` parameter. We can also make it fade a little by setting the `alpha` parameter."
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": null,
181 |    "metadata": {},
182 |    "outputs": [],
183 |    "source": [
184 |     "fig, ax = plt.subplots()\n",
185 |     "mines_deg.plot(ax=ax, column='miningexpl', legend=True, alpha=0.75)\n",
186 |     "ctx.add_basemap(ax, crs=mines_deg.crs,\n",
187 |     "               source=ctx.providers.OpenStreetMap.Mapnik, alpha=0.8)"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "It is also possible to load custom tilemaps, if they support the standard XYZ format. This is useful if you have created one using your own data somewhere. We will use the tiling server hosted by the government of New South Wales.\n",
195 |     "\n",
196 |     "It is also possible to request tiles based on an extent, which needs to be either in WGS84 (EPSG 4326) or Pseudo-Mercator (EPSG 3587)."
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": null,
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "src = 'http://maps.six.nsw.gov.au/arcgis/rest/services/public/NSW_Base_Map/MapServer/tile/{z}/{y}/{x}'\n",
206 |     "west, south, east, north = 16783076.1, -4041012.6, 16851459.8, -3988135.3"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "metadata": {},
212 |    "source": [
213 |     "The `bounds2img` method will download the tiles within a given bounding box as a three band array. The `ll` parameter is whether your data is in lon-lat or Pseudo-Mercator."
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": null,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "sydney_img, sydney_ext = ctx.bounds2img(west, south, east, north,\n",
223 |     "                                       source=src, ll=False, zoom=10)\n",
224 |     "print(sydney_img.shape)\n",
225 |     "plt.imshow(sydney_img, extent=sydney_ext)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "## Downloading Basemaps\n",
233 |     "\n",
234 |     "While basemaps downloaded are cached locally (try re-running one of the above cells; it should be much quicker), sometimes we may want to download them to use elsewhere or to save the bandwidth. Contextily can do that easily."
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "markdown",
239 |    "metadata": {},
240 |    "source": [
241 |     "Contextily sets a default zoom based on the extents, but we can change that if we want or need to. Higher zoom levels means downloading more tiles, but with higher resolution."
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "ctx.howmany(west, south, east, north, 7, ll=False)"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": null,
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": [
259 |     "ctx.howmany(west, south, east, north, 10, ll=False)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "ctx.howmany(west, south, east, north, 12, ll=False)"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "These will look different, because they are optimised to be viewed at a different zoom level. These changes include which features are shown, smoothing of lines, size of labels, and so on. The zoom levels for OpenStreetMap can be seen [here](https://wiki.openstreetmap.org/wiki/Zoom_levels). Most providers will be very similar if not the same.\n",
276 |     "\n",
277 |     "Note: do not try and download large areas at high resolution unless absolutely necessary. Many providers will cut off access for excessive use, or may have a limited number of requests for a given service tier.\n",
278 |     "\n",
279 |     "The following two blocks illustrate the difference in number of tiles downloaded at given zoom levels. Given the limitations of the browser, they might be clearer by looking at the downloaded files instead."
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "metadata": {},
286 |    "outputs": [],
287 |    "source": [
288 |     "sydney_10_img, sydney_10_ext = ctx.bounds2raster(west,\n",
289 |     "                             south,\n",
290 |     "                             east,\n",
291 |     "                             north,\n",
292 |     "                             \"sydney_z10.tif\",\n",
293 |     "                             source=src,\n",
294 |     "                             ll=False,\n",
295 |     "                             zoom=10\n",
296 |     "                            )\n",
297 |     "plt.imshow(sydney_10_img, extent=sydney_10_ext)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": null,
303 |    "metadata": {},
304 |    "outputs": [],
305 |    "source": [
306 |     "sydney_12_img, sydney_12_ext = ctx.bounds2raster(west,\n",
307 |     "                             south,\n",
308 |     "                             east,\n",
309 |     "                             north,\n",
310 |     "                             \"sydney_z12.tif\",\n",
311 |     "                             source=src,\n",
312 |     "                             ll=False,\n",
313 |     "                             zoom=12\n",
314 |     "                            )\n",
315 |     "plt.imshow(sydney_12_img, extent=sydney_12_ext)"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {},
321 |    "source": [
322 |     "We can load a saved tiff using something like `rasterio`, or in ArcGIS/QGIS."
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": null,
328 |    "metadata": {},
329 |    "outputs": [],
330 |    "source": [
331 |     "import numpy as np\n",
332 |     "import rasterio as rio\n",
333 |     "from rasterio.plot import show as rioshow\n",
334 |     "\n",
335 |     "with rio.open(\"sydney_z10.tif\") as r:\n",
336 |     "    rioshow(r)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "## Geocoding in `Contextily`\n",
344 |     "\n",
345 |     "A really nice feature to have is being able to download basemaps given a placename. This is made very simple in contextily, through use of the `pygeo` geocoder. The places can be countries, cities, or other places."
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "metadata": {},
352 |    "outputs": [],
353 |    "source": [
354 |     "paraguay = ctx.Place('Paraguay', source=ctx.providers.Esri.DeLorme)\n",
355 |     "paraguay"
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "markdown",
360 |    "metadata": {},
361 |    "source": [
362 |     "This can be used as a basemap with existing data as already shown."
363 |    ]
364 |   },
365 |   {
366 |    "cell_type": "code",
367 |    "execution_count": null,
368 |    "metadata": {},
369 |    "outputs": [],
370 |    "source": [
371 |     "rivers = gpd.read_file('zip://../data/ne_10m_rivers_lake_centerlines_trimmed.zip')\n",
372 |     "#rivers_clipped = rivers[rivers.intersects(paraguay.bbox)]\n",
373 |     "base = rivers.plot(color='red')\n",
374 |     "ctx.plot_map(paraguay, ax=base, axis_off=False)"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": null,
380 |    "metadata": {},
381 |    "outputs": [],
382 |    "source": [
383 |     "ctx.Place('Geneva', source=ctx.providers.CartoDB.Positron).plot()"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "code",
388 |    "execution_count": null,
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "ctx.Place('Union Buildings', source=ctx.providers.Wikimedia, zoom=19).plot()"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "markdown",
397 |    "metadata": {},
398 |    "source": [
399 |     "<hr />\n",
400 |     "<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n",
401 |     "<p><center>© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> — <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"
402 |    ]
403 |   }
404 |  ],
405 |  "metadata": {
406 |   "kernelspec": {
407 |    "display_name": "Python 3",
408 |    "language": "python",
409 |    "name": "python3"
410 |   },
411 |   "language_info": {
412 |    "codemirror_mode": {
413 |     "name": "ipython",
414 |     "version": 3
415 |    },
416 |    "file_extension": ".py",
417 |    "mimetype": "text/x-python",
418 |    "name": "python",
419 |    "nbconvert_exporter": "python",
420 |    "pygments_lexer": "ipython3",
421 |    "version": "3.8.3"
422 |   }
423 |  },
424 |  "nbformat": 4,
425 |  "nbformat_minor": 4
426 | }


--------------------------------------------------------------------------------
/notebooks/01-geopandas-cleaning_data.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["# Cleaning a shapefile\n","\n","DATE: 12 June 2020, 08:00 - 11:00 UTC\n","\n","AUDIENCE: Intermediate\n","\n","INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n","\n","When processing data, we are often not lucky enough to have it perfectly useable immediately. This notebook works through loading, cleaning and saving a shapefile, using `geopandas`, an extension for `pandas` that adds facility for spatial processing.\n","\n","#### Note\n","Much of this is standard data cleaning, and does not rely on geopandas per se, except that the data that we want to clean is in a geospatial data format, such as a shapefile. Most of these tools are the same in standard `pandas`, but have been extended in geopandas to work with spatial indices.\n","\n","This notebook is provided more as an example of data cleaning, which is common when dealing with real data. The result of this notebook can easily be used in whatever GIS software you prefer, since it is a standard shapefile."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import geopandas as gpd\n","import pandas as pd\n","import numpy as np"]},{"cell_type":"markdown","metadata":{},"source":["We will start by loading and having a look at the data that we have available."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Something that we may be interested in is the different companies that have operated in this field. This is equivalent to looking at a column in a spreadsheet."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["If we want to get an idea of all the companies present, we can use the `set` function, which returns the unique values from a list-like object:"]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":true},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["A number of these companies should probably be consolidated. The most straightforward way to do this is by diving into the dark art of regular expressions. We will make a dictionary of what to look for as the key and what to replace it with as the value."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["replacements = {\n","    r'\\-': '', #  remove '-'\n","    r' et al': '', #  remove ' et al'\n","    r' Cda': '', #  remove ' Cda'\n","    r'EnCana.*$': 'EnCana', #  change 'EnCana' followed by anything to 'EnCana'\n","    r'PanCanadian(\\-|.*)\\n.*': 'PanCanadian', #  strip the odd characters after 'PanCanadian'\n","    r'Mobil.*$': 'Mobil', #  strip anything following 'Mobil'\n","    r'Shell.*$': 'Shell', #  strip anything following 'Shell'\n","    r'Exxonmobil': 'ExxonMobil', #  correct capitalisation of 'Exxonmobil' to 'ExxonMobil'\n","    r'Petocan': 'PetroCan', #  correct spelling\n","    r'Petrocan': 'PetroCan', #  correct capitalisation of 'Petrocan' to 'PetroCan'\n","    r'PetroCan*$': 'PetroCan', #  strip anything following 'PetroCan'\n","    r'^Husky.*\\n.*$': 'HBV', #  convert anything starting with 'Husky' to 'HBV' after stripping new line\n","    r'^Bow Valley.*\\n.*$': 'BVH', #  convert anything starting with 'Bow Valley' to 'BVH' after stripping new line\n","    r'HBV.*$': 'HBV', #  strip anything following 'HBV'\n","    r'BVH.*$': 'BVH', #  strip anything following 'BVH'\n","    r'Pex/Tex': 'Pex', #  convert 'Pex/Tex' to 'Pex'\n","    r'Candian Sup/': 'Canadian Superior', #  correct typo 'Candian Sup/' to 'Canadian Superior'\n","    r'Canadian Sup\\.': 'Canadian Superior', #  expand 'Canadian Sup.' to 'Canadian Superior'\n","}"]},{"cell_type":"markdown","metadata":{},"source":["In this case, we are going to create a new column (`Owner`) to store the cleaned data, in case we need to retrieve the exact company for some reason. We could change the original GeoDataFrame column by using the `inplace=True` argument to the `replace` method."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["## Exercise 1\n","\n","1. Print a list of the unique values in the `Well_Type` Series.\n","2. Clean up the `Well_Type` Series to remove the typos and make the data more consistent. We can do this in-place, because the original data does not really give us any additional information. (Hint: look at the `inplace=True` parameter to do this to the original GeoDataFrame.)\n","    - Change 'Exploratory' to 'Exploration'\n","    - Change the typo 'Develpoment' to 'Development'\n","    - Remove the new line, by changing `\\n&` to `''`\n","    - Remove excess whitespace by changing `\\s+` to `' '`"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Print a list of the unique values in the `Well_Type` Series.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Clean up the `Well_Type` Series to remove typos and make the data more consistent. Do this in-place.\n","# \n","replacements = {\n","    'key': 'changed_to',\n","    r'\\n&': '',\n","    r'\\s+': ' ',\n","    'key2': 'changed_to',\n","}\n","\n","well_data['Well_Type'].replace(regex=replacements, inplace=True)\n","well_data['Well_Type']"]},{"cell_type":"markdown","metadata":{},"source":["## Cleaning Column Names\n","\n","The current column names are not very helpful in some cases, with weird codes and similar. We can probably make these more understandable, and (geo)pandas makes it easy to do so.\n","\n","#### Note for Shapefile\n","The maximim length of field names in a shapefile is 10 characters. Some other formats, such as `.gpkg` do not have this limitation.\n","\n","______________\n","\n","We can start by getting the current column names."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Some of these are clearly cut off due to needing to be less than 10 characters in length, for example `Well_Termi` and `seafl_twt_`. In other cases, we can see that there are duplicates, for example `Well_Name` and `Well_Nam_1`, without any indication of what the difference is. We can do better, even with the limits of the Shapefile format.\n","\n","We can get a feel for the data with the `head()` method, as used above. The `set()` method is also often helpful to see what different values are in text columns, and may give us a better idea what the data is describing, as we have already seen."]},{"cell_type":"markdown","metadata":{},"source":["There is a `rename()` method that can take a `dict` of existing Series names as keys and set a new name as the values. We will change the `Well_Termi` to `Well_End` and `Well_Nam_1` to `Well_Code`."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["well_data.head()"]},{"cell_type":"markdown","metadata":{},"source":["As you can see, `well_data` is not changed, since `rename()` returns a copy of the GeoDataFrame. To change it instead of getting a copy, the `inplace=True` option should be added, or the copy assigned to another variable."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["well_data.rename(columns=series_names, inplace=True)\n","# We can also do this with the following:\n","# well_data = well_data.rename(columns=series_names)"]},{"cell_type":"markdown","metadata":{},"source":["Now it works!"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["print(well_data.columns)\n","well_data.head()"]},{"cell_type":"markdown","metadata":{},"source":["### Exercise 2\n","\n","1. What are the different values of `Well_Symb`?\n","2. What are the different values of `Drilling_U`? In particular, what are the different entries in `Drilling_U` referring to, and what might be a more descriptive name?\n","3. Change the following column names in the DataFrame:\n","    * `Total_De_1` to `Dpth_ft`\n","    * `Total_Dept` to `Dpth_m`\n","    * `seafl_twt_` to `FloorTWT`\n","    * `Drilling_U` to something based on the previous answer."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# What are the different values in `Well_Symb`?\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# What are the different values in `Drilling_U`?\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["series_names = {\n","}"]},{"cell_type":"markdown","metadata":{},"source":["If we are changing a Shapefile, remember that we can not have Series names longer than 10 characters. This will check it for you."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["for series in series_names:\n","    if len(series) > 10:\n","        print(f'{series} longer than 10 characters. Will not be able to save as Shapefile.')\n","    else:\n","        well_data.rename(columns=series_names, inplace=True)\n","well_data.columns"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["well_data.head()"]},{"cell_type":"markdown","metadata":{},"source":["## Datetimes\n","\n","If you are familiar with `pandas`, then you will know the utility of `datetime`s. We have some dates in the data, so we should make sure that they are correctly imported if we want to use that for anything involving time series analysis.\n","\n","#### Note:\n","It is not possible to write a `datetime` to a Shapefile. If you want to do analysis that uses timeseries, then you may want to save a cleaned dataframe _before_ you convert to `datetime`s. Alternatively, save the GeoDataFrame as a geopackage or similar format that can handle a `datetime`."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["As we see, these dates are stored as strings. We can easily convert them to `datetime`s, however. First we will copy our original geodataframe to save later. (If you are doing this conversion with your own data, make sure that you look into the limitations of doing so, if you want to save a shapefile.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["## Exercise 3\n","\n","1. Change the `Well_End` Series to `Timestamp`s.\n","2. Make a Series of the difference in time between the `Spud_Date` and the `Well_End` Series. (Do not add this to our current geodataframe, to make saving it easier later.)\n","3. What is the biggest difference in days, between the `Spud_Date` and the `Well_End`  Series? (Hint: you may wish to look at the `dt.days` attribute of a `timeDelta`.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Change the `Well_End` Series to `Datetime`s.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Add a Series with the difference in time between the `Spud_Date` and `Well_End` Series.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# What is the biggest time difference, in days, between the `Spud_Date` and `Well_End` Series?\n"]},{"cell_type":"markdown","metadata":{},"source":["## Saving files\n","\n","Once we have made these changes, we would like to save them for future work. Geopandas makes that very easy. Note that we can not write a `datetime` to shapefiles, so we would need to change it (back) to a string if we want to save it. Similarly, if we have a Series of `bool` values (`True` or `False`) we should convert those to `int`s before we save.\n","\n","First we will see what available options we have to save to. Geopandas uses `fiona` in the background; we will take a look at what that offers us."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import fiona #  we do not normally need this for saving, it gets used in the background.\n","fiona.supported_drivers"]},{"cell_type":"markdown","metadata":{},"source":["We can only write to some of these formats: those with `raw` or `rw` tags.\n","\n","As a format, `gpkg` is becoming more popular, so we will save our geodataframe as that. One nice advantage, not relevant here, is being able to save multiple layers in a single file."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We can also save as a Shapefile, but we will get an error:\n","\n","`DriverSupportError: ESRI Shapefile does not support datetime fields`"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["This is fixable by saving our copy of the dataset, or by converting the datetime back to a string."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We can also easily save these using a different CRS, if that is better for our data. This is one for the North America Datum 1927, in degrees."]},{"cell_type":"markdown","metadata":{},"source":["## Closing remarks\n","\n","The data that we have just saved can be used in the \"Intro to Geopandas\" notebook.\n","\n","<hr />\n","<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n","<p><center>Â© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> â€” <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"nbformat":4,"nbformat_minor":4}


--------------------------------------------------------------------------------
/notebooks/02-geopandas-io_and_plotting.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["# Intro to Geopandas plotting vector data\n","\n","DATE: 12 June 2020, 08:00 - 11:00 UTC\n","\n","AUDIENCE: Intermediate\n","\n","INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n","\n","Not all the data that we want to deal with is simply numeric. Much of it will need to be located in space as well. Luckily, there are numerous tools to handle this sort of data. For this notebook, we will focus on vector data. This is data consisting of points, lines and polygons, not gridded data. The tutorials by Leo Uieda and Joe Kington deal more with raster data and should be a good complement to this tutorial. \n","\n","There are a number of common spatial tasks that are often done in GIS software, such as adding buffers to data, or manipulating and creating geometries. This notebook is focused more on the basics of using existing data and plotting it, but not making many changes specific to spatial data.\n","\n","#### Prerequisites\n","\n","You should be reasonably comfortable with `pandas` and `matplotlib.pyplot`.\n","\n","Beyond that, this is aimed at relative beginners to working with spatial data.\n","\n","#### A Note on Shapefile\n","\n","Shapefiles are a common file format used when sharing and storing georeferenced data. A single shapefile has a number of components that are required for it to work correctly.\n","These are mandatory:\n","- `<name>.shp` the feature geometry.\n","- `<name>.shx` is the shape index.\n","- `<name>.dbx` contains the attributes in columns, for each feature.\n","\n","There are a number of additional files that may also be present, of which these are the most common (in the author's experience).\n","- `<name>.prj` is the projection of the data.\n","- `<name>.sbx` and `<name>.sbn` are a spatial index.\n","- `<name>.shp.xml` is a metadata file.\n","\n","While shapefiles are very common on desktop systems, they tend not to be used present data on the web, although they are often offered as a download option."]},{"cell_type":"markdown","metadata":{},"source":["### Pandas and Geopandas\n","\n","Pandas gives us access to a data structure called a DataFrame, which is very well suited for the sort of data that is usually in spreadsheets, with rows and columns. Geopandas is an expansion of that, to allow for the data to be geographically located in a sensible way. It does this by adding a `geometry` column, a , and adding some methods for some spatially useful tests, while still allowing the usual `DataFrame` methods from pandas.\n","\n","In addition, we will use `cartopy` to handle projections. `mapclassify` is optional, but allows easy binning of our data."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["#import cartopy.crs as ccrs\n","import geopandas as gpd\n","import mapclassify as mc\n","import numpy as np\n","import pandas as pd"]},{"cell_type":"markdown","metadata":{},"source":["## Creating a geodataframe\n","\n","Loading a shapefile (or a number of other formats) is as simple as calling `read_file` with the right location.\n","\n","Geopandas uses `fiona` in the background, so anything that can be handled by `fiona` can be handled with geopandas. Note that some formats can be read, but not written."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fname = '../data/cleaned/offshore_wells_2011_Geographic_NAD27.shp'\n","\n","well_locations = gpd.read_file(fname)\n","well_locations.head()"]},{"cell_type":"markdown","metadata":{},"source":["We can also load data as a standard DataFrame and convert it by using any existing geometry that we know about.\n","\n","We will load up some data available regarding issues identified at artisinal mines in Zimbabwe by the International Peace Information Service ([IPIS](http://ipisresearch.be/))."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fname = '../data/zwe_mines_curated_all_opendata_p_ipis.csv'\n","\n","artisinal_mines = pd.read_csv(fname)\n","artisinal_mines.head()"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there is a `geom` column in this CSV, where every point is a Well-Known Text ([WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry)) string describing the geometry. To make the geodataframe aware of this, we will use the `shapely` library (that geopandas uses under the hood)."]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":false},"outputs":[],"source":["from shapely import wkt\n","artisinal_mines['geom'] = artisinal_mines['geom'].apply(wkt.loads)\n","mines = gpd.GeoDataFrame(artisinal_mines, geometry='geom')\n","mines.head()"]},{"cell_type":"markdown","metadata":{},"source":["This does not look very different, but we have now created a geodataframe from our existing dataframe. We could do something very similar with a CSV with separate columns of latitude and longitude.\n","\n","When creating a new geodataframe like this, we should also set the Coordinate Reference System (CRS) of the data, since geopandas does not know where the coordinates actually are on the Earth's surface. Some operations will still work, but relating one geodataframe to another is not possible. We are working with straight decimal degrees of longitude and latitude, so the WGS84 datum is a good option."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines.crs = \"EPSG:4326\""]},{"cell_type":"markdown","metadata":{},"source":["One of the simplest way to see how a geodataframe differs from a standard dataframe is by simply calling the `plot` method."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["artisinal_mines.plot()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines.plot()"]},{"cell_type":"markdown","metadata":{},"source":["As we can see, the geodataframe plots our coordinates, while the standard dataframe plots the numerical values according to their index."]},{"cell_type":"markdown","metadata":{},"source":["### Exercise 1\n","\n","The following should be easily possible with a working knowledge of `pandas`. Using the well dataset:\n","1. Which well is the deepest? (`df.sort_values('column')` may be useful.)\n","1. How many wells are operated by Canadian Superior?"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# The deepest well is:\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# How many well were operated by Canadian Superior?\n"]},{"cell_type":"markdown","metadata":{},"source":["### Geographic plots\n","\n","We can take a quick look at where these wells are in relation to each other."]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":true},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["The data that we imported uses latitude and longitude. We can also easily import projected data, if we have it."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fname = '../data/cleaned/offshore_wells_2011_UTM20_NAD83.shp'\n","well_locations_utm = gpd.read_file(fname)\n","# We are going to use the 'Spud_Date' and 'Well_Termi' column for some stuff, so we will turn it into a proper datetime column\n","well_locations_utm['Spud_Date'] = pd.to_datetime(well_locations_utm['Spud_Date'])\n","well_locations_utm['Well_End'] = pd.to_datetime(well_locations_utm['Well_End'])\n","well_locations_utm.replace('None', np.NaN, inplace=True)\n","well_locations_utm.plot()\n","well_locations_utm.head(5)"]},{"cell_type":"markdown","metadata":{},"source":["Notice that the axes are completely different between the two datasets. We can therefore not plot these two datasets in the same plot unless we use the same coordinate reference system. `cartopy` is the tool we will use to do this.\n","\n","First, let us see what CRS the different datasets have."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["print(f'Wells: {well_locations.crs}\\nWells (UTM): {well_locations_utm.crs}')"]},{"cell_type":"markdown","metadata":{},"source":["If we want to plot the two datasets on the same plot, then they need to use the same CRS. One of the easiest ways is by using EPSG codes and the `to_crs` method. [epsg.io](https://epsg.io) and [spatialreference.org](https://spatialreference.org) are good places to find a suitable EPSG code for your data if you are not sure how the CRS relates to it."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We can see that these datasets now plot on top of each other, as they should."]},{"cell_type":"markdown","metadata":{},"source":["## Styling\n","\n","Just plotting these points on their does not tell us very much, so we should style the data to show us what is happening. We will classify the data by total depth of each well, breaking the column into 6 bins with a natural break as the upper and lower bound.\n","\n","We do this by using the `scheme` parameter which will be used in the background by MapClassify to bin the values of a column. A number of binning options are available, such as NaturalBreaks, Quantiles, StdMean, Percentiles."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["The `scheme` keyword passes through to `mapclassify`, and only makes sense for some data. In other cases, we can just rely on the raw data."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We may also be interested in only a section of the data within certain extents, such as the dense cluster south-east of centre. Geopandas offers a `cx` method for a coordinate index which can be used for slicing based on coordinate values."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["### Exercise 2\n","\n","The data contains columns for the start and end of when a well was active.\n","\n","1. Which well was operating for the longest time and how long was this? (Hint: use the `datetime` columns from earlier ('Spud_Date' and 'Well_End'). A useful pattern is `df.loc[df['column'] == value]`.)\n","2. Plot a histogram of the days of operation for the wells in the dataset. You may need to drop invalid data (where some columns are NaN or NaT).\n","3. Using the above histogram to determine a suitable cut-off, is there an area of the field that has wells that were in operation for longer than others? (Hint: you might want to extract a useful time interval from a `Series` of `timedelta`s to plot.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Which well was operating for the longest time and how long was this?\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Plot a histogram of the days of operation for the wells in the dataset.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Using the above histogram to determine a suitable cut-off, is there an area of the field that has\n","# wells that were in operation for longer than others?\n","\n","\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["## Saving geodataframes\n","\n","While we can create maps and similar things in geopandas, sometimes we want to use files in something else. Geopandas, uses `fiona` in the background to read and write files. If we want this geodataframe as a GeoJSON file, for example, this is easily done by using the correct argument to the `driver` parameter. (Note that GeoJSON only accepts the WGS84 datum, so I am reprojecting the geodataframe first.)\n","\n","By default, without an explicit driver, `to_file` will create a Shapefile."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Changing this to a GML file (a flavour of XML) is as simple as changing the driver parameter appropriately:"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["<hr />\n","<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n","<p><center>Â© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> â€” <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"nbformat":4,"nbformat_minor":2}


--------------------------------------------------------------------------------
/notebooks/03-geopandas-processing_vector_data.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["# Spatial Operations in Geopandas\n","\n","DATE: 12 June 2020, 08:00 - 11:00 UTC\n","\n","AUDIENCE: Intermediate\n","\n","INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n","\n","Importing and plotting vector data is all well and good, but we want to be able to do other things as well, like relating one dataset to another.\n","\n","This notebook covers selection of features via some spatial relationships between different GeoDataFrames and selecting the data that we are interested in. There is also some code given for plotting rose diagrams (adapted from Bruno Ruas de Pinho's [post](http://geologyandpython.com/structural_geology.html)), which may be of interest."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import geopandas as gpd\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import numpy as np"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["plt.rcParams[\"figure.figsize\"] = (8, 8)"]},{"cell_type":"markdown","metadata":{},"source":["We are going to work with the runways located in South Africa, downloaded from OpenStreetMap. We will also load the outlines of each province from the Natural Earth dataset, and extract the South African ones. (Note: if you are actually working in South Africa, you should avoid these: they are very out of date now.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_runways = gpd.read_file('../data/south_africa_runways.gpkg')\n","provinces = gpd.read_file('zip://../data/ne_10m_admin_1_states_provinces.zip')\n","za_provinces = provinces[provinces['adm0_a3'] == 'ZAF']"]},{"cell_type":"markdown","metadata":{},"source":["The `bearing_360` field in the runway dataset contains the direction each line that makes up a runway is pointing, which might be useful."]},{"cell_type":"markdown","metadata":{},"source":["A rose diagram is quite a nice way to visualise this sort of data (anyone who has done a structural geology course will be familiar with these for various strutural readings, and they are commonly used for wind direction as well).\n","\n","The easiest way to get a set of bins is by using the `histogram()` method from `numpy`. This returns two arrays: one with the value in each bin, and one with the edge of each bin. We will use 36 bins (so each is 10 degrees)."]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":false},"outputs":[],"source":["bin_width_degrees = 10\n","bin_edges = np.arange(-5, 366, bin_width_degrees)\n","hist_counts, bin_edges = np.histogram(za_runways['bearing_360'], bin_edges)\n","print(f'Hist counts: {hist_counts}\\nEdges: {bin_edges}')\n","\n","print(list(hist_counts[:18]))\n","print(list(reversed(hist_counts[18:])))"]},{"cell_type":"markdown","metadata":{},"source":["We know that the first and last bins are the same (that is, bin 1 has edges at [-5, 5] and the last bin has edges [355, 365]), so we will make sure that we add those counts together. Then we can split the list in half before summing the counts from opposite bins."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["hist_counts[0] += hist_counts[-1]  # sum counts of first and last bins\n","\n","half = np.sum(np.split(hist_counts[:-1], 2), 0)\n","two_halves = np.concatenate([half, half])\n","print(len(two_halves))\n","print(hist_counts)\n","print(two_halves)\n","print(list(two_halves[:18]))\n","print(list(two_halves[18:]))"]},{"cell_type":"markdown","metadata":{},"source":["These get used for the the parameters to the polar plot. `theta` is the centre of each bin, `radii` is the height of each bin, and `width` is how wide each bin needs to be."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["theta = np.linspace(0, 2 * np.pi, 36, endpoint=True)\n","radii = two_halves\n","width = np.radians(bin_width_degrees)"]},{"cell_type":"markdown","metadata":{},"source":["Finally, we can set up the plot. Note the addition of `polar` to the `projection` parameter. This is what tells matplotlib that we want a circular plot. By using `set_theta_offset` and `set_theta_direction` we can get 0 at the top and 90 to the right, as we expect from a compass. The grid can be altered using `set_theta_grids` and `set_rgrids` for the spokes and the rings respectively."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig = plt.figure()\n","ax = fig.add_subplot(111, projection='polar')\n","ax.bar(theta, radii, \n","       width=width, bottom=0.0, edgecolor='k')\n","ax.set_theta_zero_location('N')\n","ax.set_theta_direction(-1)\n","ax.set_thetagrids(np.arange(0, 360, 10), labels=np.arange(0, 360, 10))\n","ax.set_rgrids(np.arange(0, radii.max() + 5, 5), angle=0)\n","ax.set_title('Directions of South African runways')"]},{"cell_type":"markdown","metadata":{},"source":["## Spatial Selections\n","This is useful, but it covers the entire country. If we want to look at each province in turn, because there may be differences, we can select only things in the province.\n","\n","We will use the `within` function to see if a feature is within another. This returns a boolean (True/False) that we can use to select only the features for which this is true. There are a few of these worth noting:\n","* `contains` is the opposite of `within`, and is True for features that entirely contain a given geometry.\n","* `intersects` is true if the interior and interior of the two objects overlap.\n","* `touches` is true if at least one point is in common and interiors do not intersect.\n","\n","In addition to these, most other spatial operations that are present in standard GIS software are available."]},{"cell_type":"markdown","metadata":{},"source":["First, we will select the province of KwaZulu-Natal and the runways within it, and then plot them."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn = za_provinces[za_provinces['iso_3166_2'] == 'ZA-NL']\n","kzn_runways = za_runways[za_runways.within(kzn.geometry.unary_union)]"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We can see that this has only extracted values within the province's boundaries. We needed the `unary_union` because the geometry is that of a multipolygon."]},{"cell_type":"markdown","metadata":{},"source":["## Exercise 1\n","\n","1. Convert the above code to a function for creating a rose diagram given a geoDataFrame\n","    1. Test it on the extracted runway directions in KZN.\n","2. Write a function that generates a rose diagram for each province. The easiest approach is to wrap the previous function. (Hint: you may find the `iterrows()` method that (geo)DataFrames have useful.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Write a function to plot a rose diagram.\n","def rose_diagram(bearings, title='Rose Diagram', degrees_in_bin=10):\n","    bin_width_degrees = degrees_in_bin\n","    num_bins = int(round(360 / degrees_in_bin, 0))\n","    \n","    bin_edges = np.arange(-5, 366, bin_width_degrees)\n","    hist_counts, bin_edges = np.histogram(bearings, bin_edges)\n","    \n","    hist_counts[0] += hist_counts[-1]  # sum counts of first and last bins\n","\n","    half = np.sum(np.split(hist_counts[:-1], 2), 0)\n","    two_halves = np.concatenate([half, half])\n","    \n","    theta = np.linspace(0, 2 * np.pi, num_bins, endpoint=True)\n","    radii = two_halves\n","    width = np.radians(bin_width_degrees)\n","    \n","    fig = plt.figure()\n","    ax = fig.add_subplot(111, projection='polar')\n","    ax.bar(theta, radii, \n","           width=width, bottom=0.0, edgecolor='k')\n","    ax.set_theta_zero_location('N')\n","    ax.set_theta_direction(-1)\n","    ax.set_thetagrids(np.arange(0, 360, 10), labels=np.arange(0, 360, 10))\n","    ax.set_rgrids(np.arange(0, radii.max() + 5, 5), angle=0)\n","    ax.set_title(title)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Test the function on the runways in KwaZulu-Natal.\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Plot a rose diagram for each province.\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["## Spatial Relationships\n","\n","Now we will take a look at a few places in KZN, and get an idea of some spatial relationships between towns/cities and runways."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_places = gpd.read_file(\"../data/south_africa_places.gpkg\")"]},{"cell_type":"markdown","metadata":{},"source":["We can now look at how many airports are located within a certain range of a given town. First we will choose our town, then create a buffer around, and see how many fall into that range. I am going to look at Durban, the largest city in KZN."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Geopandas gives us some useful and common tools straight away, such as a buffer. We can easily put a circle around a given geometry."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["That looks a little big, given that our buffer was only `1`. If we look at the CRS of the place data, we can see that it works in degrees, so the buffer is one degree in each direction.We should convert this to a projected system that uses metres if we want to measure meaningful distances. We will try EPSG 3857, which uses metres as a unit of measure. (For our purposes, it is accurate enough, but if you need distances in a critical use-case, then you should use something more focused on your area of interest.)\n","\n","For smaller areas, a relevant UTM projection is good. Some countries also have well-defined ones of their own. https://gist.github.com/gubuntu/6403425 is of interest for Southern Africa generally. KZN is a little too big to fit into one UTM zone."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Notice how this also changes our axes when plotting:\n","(We are plotting the centroid of each runway line here, to make the position more visible.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["Now we can use that buffer and select only the runways within that buffer."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["We can now plot only those runways that fall within the buffer."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax = plt.subplots()\n","kzn_proj.plot(ax=ax, color='lightgrey')\n","near_durban.centroid.plot(ax=ax)\n","durban_proj_buffer.plot(ax=ax, alpha=0.3)\n","durban_proj.plot(ax=ax, color='red')"]},{"cell_type":"markdown","metadata":{},"source":["## Distances and Nearest Neighbours\n","\n","If we are interested in the distance to the nearest runway from one of our towns, we can write a short function that returns this value. A GeoDataFrame has a `distance` method that will work between any two geometries. Since we want to do it for a number of them in one go, we have to write a function to do it."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["The paradigm in pandas is to use `apply` which performs the same function on each row in a dataframe.\n","\n","Since we know that the nearest runway will be one of the ones that we previously identified as being within the buffer zone, we will just use that. We could give it the whole dataset and get the same result, but this will be slightly quicker.\n","\n","The final comma in `args=(near_durban,)` is important."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":[]},{"cell_type":"markdown","metadata":{},"source":["## Exercise 3\n","\n","1. What is the distance to the centre of the closest town to the runway with the `osm_id` of 87323625?\n","2. What is the distance from Durban to the furtherest town in South Africa? (Hint: you will need to modify the `nearest_neighbour` function to a `max_distance` version. Pay attention to the CRS.)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# What is the distance to the centre of the closest town to the runway with the osm_id of 87323625?\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# What is the distance from Durban to the furtherest town in South Africa?\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["## Multiple points\n","\n","We can also use the `nearest_neighbours` function applied to every feature. This can be used for example, to classify the runways by what the closest town or city to them is.\n","\n","First we will limit our analysis to towns in KwaZulu-Natal."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn_places = za_places[za_places.within(kzn.geometry.unary_union)]\n","kzn_places_proj = kzn_places.to_crs(epsg=3857)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["base = kzn_proj.plot(color='lightgrey')\n","kzn_runways_proj.centroid.plot(ax=base)\n","kzn_places_proj.plot(ax=base, color='red', marker='+')"]},{"cell_type":"markdown","metadata":{},"source":["Using the `apply` function on the whole GeoDataFrame will get us a result for all the features. We can use that to populate a new column in the runway dataset with the minimum distance to a town or city in the places dataset."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn_runways_proj['min_dist_to_place'] = kzn_runways_proj.geometry.apply(min_distance, args=(kzn_places_proj,))\n","kzn_runways_proj['min_dist_to_place'].describe()"]},{"cell_type":"markdown","metadata":{},"source":["The runways end up plotting very small at province-wide scale, so we will change the geometry to be a Point instead. The easiest way is to obtain the centroid of each runway."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn_runways_proj['line_geom'] = kzn_runways_proj.geometry\n","kzn_runways_proj['geometry'] = kzn_runways_proj.centroid\n","kzn_runways_proj.geometry"]},{"cell_type":"markdown","metadata":{},"source":["Using the `scheme` parameter, we can also take a column and group the values using `mapclassify` options."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["base = kzn_proj.plot(color='lightgrey')\n","kzn_runways_proj.plot(ax=base, column='min_dist_to_place', legend=True, scheme='EqualInterval', k=7)\n","kzn_places_proj.plot(ax=base, color='red', marker='+')"]},{"cell_type":"markdown","metadata":{},"source":["## Exercise\n","\n","1. Plot the towns and cities in KwaZulu-Natal coloured by the distance from Durban."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["\n","\n","\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["## More Information from Neighbours\n","\n","So far we have only been able to obtain the distance to the nearest neighbour, but nothing else from the second GeoDataFrame.\n","\n","We need a new function that can get the nearest points to a given geometry. To do that we will use an operation imported from the `shapely` module."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from shapely.ops import nearest_points\n","\n","def calculate_nearest(row, destination, val, col=\"geometry\"):\n","    ''' Returns the value from a given column for the nearest feature\n","        of the `destination` GeoDataFrame.\n","        \n","        row: The object of interest\n","        destination: GeoDataFrame of possible locations, of which one is the closest.\n","        val: The column containing the values which the destination will give back.\n","        col: If your row's geometry is not `geometry`, change it here.\n","        \n","        returns: The value of the `val` column of the feature nearest to `row`.\n","    '''\n","    dest_unary = destination[\"geometry\"].unary_union\n","    nearest_geom = nearest_points(row[col], dest_unary)\n","    match_geom = destination.loc[destination.geometry == nearest_geom[1]]\n","    match_value = match_geom[val].to_numpy()[0]\n","    return match_value"]},{"cell_type":"markdown","metadata":{},"source":["This function allows us to ask for specific values back about the nearest object. Again, we will find the nearest place to each runway, but we will request the `name` of that place and save it as a column in the runways GeoDataFrame."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn_runways_proj[\"nearest_place_name\"] = kzn_runways_proj.apply(\n","    calculate_nearest, destination=kzn_places_proj, val=\"name\", axis=1)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["kzn_runways_proj[\"nearest_place_name\"].head(10)"]},{"cell_type":"markdown","metadata":{},"source":["We can now plot these and get an idea of where the nearest urban centre to each runway is."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax = plt.subplots(figsize=(10,10))\n","kzn_proj.plot(ax=ax, zorder=-4, color='lightgrey')\n","kzn_places_proj.plot(ax=ax, color='black', marker='+',)\n","kzn_runways_proj.plot(ax=ax, column='nearest_place_name', legend=True, cmap='tab20b')\n","ax.set_title('Nearest major place to runways in KZN')\n","ax.set_xlim(xmin=kzn_proj.total_bounds[0] - 10_000, xmax=kzn_proj.total_bounds[2] + 135_000)"]},{"cell_type":"markdown","metadata":{},"source":["## Exercise 3\n","\n","1. Is there a spatial trend in the bearing of runways in KZN? Plot the position of each runway and colour them by bearing.\n","2. Plot the runways within 150km of Pietermaritzburg and colour them by runway length. This plot should only cover 150km of Pietermaritzburg."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Is there a spatial trend in the bearing of runways in KZN?\n","# Plot the posiiton of each runway and colour them by bearing.\n","\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Plot the runways within 150km of Pietermaritzburg and colour them by runway length.\n","# This plot should only cover 150km of Pietermaritzburg.\n","\n"]},{"cell_type":"markdown","metadata":{},"source":["<hr />\n","<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n","<p><center>Â© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> â€” <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"nbformat":4,"nbformat_minor":4}


--------------------------------------------------------------------------------
/notebooks/04-geopandas-joins_and_spatial_joins.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["# Joins and merging data\n","\n","DATE: 12 June 2020, 08:00 - 11:00 UTC\n","\n","AUDIENCE: Intermediate\n","\n","INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n","\n","Many times we are interested in combining two datasets, but we may only want the area that is covered by another one. Spatial joins are a relatively complex topic, so this will give a brief overview that will hopefully be useful."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import geopandas as gpd\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import numpy as np"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["provinces = gpd.read_file('zip://../data/ne_10m_admin_1_states_provinces.zip')\n","za_provinces = provinces[provinces['sov_a3'] == 'ZAF']\n","employment = pd.read_csv('../data/za_province_employment.csv')"]},{"cell_type":"markdown","metadata":{},"source":["Pandas differentiates between spatial joins and attribute joins. An **attribute** join works more or less the same as in standard pandas, and uses values that are in common to both. A **spatial** join uses the geometry of each dataframe."]},{"cell_type":"markdown","metadata":{},"source":["## Attribute Joins\n","\n","This works by finding a common value in two dataframes and creating a new dataframe using the common value to add values to an existing feature. In pandas one uses the `merge` function to do this. These are very common when you have existing data that needs to be combined to existing geometry. A common example would be adding demographic data to administrative districts.\n","\n","In this case, we can see that the employment data has a `Province` attribute. We can link that to the `name` attribute in `za_provinces`. Further, there is no geometry associated with the employment data, so we have no way of seeing if there are any spatial trends to the data, unless we have a very good mental image of South Africa's provinces."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["employment"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ax = employment.plot(kind='bar')"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_provinces"]},{"cell_type":"markdown","metadata":{},"source":["We can now merge on these, which will give us a dataframe that adds the value associated with a given province in the employment data to each province."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["merged_provinces = za_provinces.merge(employment, left_on='name', right_on='Province')\n","merged_provinces"]},{"cell_type":"markdown","metadata":{},"source":["As we can see, we have added the columns from `employment` to our `za_provinces` geodataframe, which we can treat normally. Also worth noting is that we have lost the row with the totals of each class, because there is no province named 'Total'. Since this is now a standard geodataframe, we can easily make a plot based on the employment data for each province."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ax = merged_provinces.plot('Total', scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')\n","ax.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n","ax.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n","the_legend = ax.get_legend()\n","the_legend.set_bbox_to_anchor((1.7,1))\n","plt.title('Population in South Africa per Province')\n","ax2 = merged_provinces.plot(merged_provinces['Unemployed']/merged_provinces['Total']*100, scheme='NaturalBreaks', k=5, legend=True, edgecolor='white', cmap='cividis')\n","ax2.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n","ax2.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n","the_legend = ax2.get_legend()\n","the_legend.set_bbox_to_anchor((1.45,1))\n","plt.title('Percentage unemployed in each province')"]},{"cell_type":"markdown","metadata":{},"source":["## Spatial Joins\n","\n","These work by looking at the geometry of two different geodataframes and relating them to each other.\n","\n","For example, this river dataset has no information on which country or province a river is in, but that may be of interest for some reason."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["rivers = gpd.read_file('zip://../data/ne_10m_rivers_lake_centerlines_trimmed.zip')\n","rivers"]},{"cell_type":"markdown","metadata":{},"source":["Geopandas offers us the `sjoin` method to spatially joing two different geodataframes.\n","\n","This takes two geodataframes (the first is `'left` and the second is `right`.)\n","\n","The `op` parameter controls how things are related to each other, using the `shapely` library's [binary predicates](https://shapely.readthedocs.io/en/latest/manual.html#binary-predicates):\n","* `intersects` - True if the objects have any boundary or interior point in common.\n","* `contains` - True if no points of other lie in the exterior of the object and at least one point of the interior of other lies in the interior of object.\n","* `within` - True if the objectâ€™s boundary and interior intersect only with the interior of the other\n","\n","The `how` parameter controls which geometry is kept:\n","* `'left'` uses keys from left; retains only the left geometry column"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_rivers_left = gpd.sjoin(za_provinces, rivers, how=\"left\", op='intersects')\n","base = za_provinces.plot(color='lightgrey', edgecolor='black')\n","za_rivers_left[za_rivers_left['sov_a3'] == 'ZAF'].plot(ax=base)\n","\n","base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n","base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n","za_rivers_left"]},{"cell_type":"markdown","metadata":{},"source":["* `'right'` use keys from right; retains only right geometry column (note that this means that all the rivers will still be present, but only those which can be matched to a province in `za_provinces` will have values from `za_provinces`.\n","\n","Note that we have rivers that extend beyond the border, because we are only looking for intersecting geometries. Try a different `op` ('contains' or 'within') to see what effect that has. Note also that some rivers are now present twice, because they are within multiple provinces, so get selected more than once."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_rivers_right = gpd.sjoin(za_provinces, rivers, how=\"right\", op='intersects')\n","base = za_provinces.plot(color='lightgrey', edgecolor='black')\n","za_rivers_right[za_rivers_right['sov_a3'] == 'ZAF'].plot(ax=base)\n","\n","base.set_ylim(-35.2, -21.8)  # These limits are to ignore the Gough Islands\n","base.set_xlim(16, 33.2)  # These limits are to ignore the Gough Islands\n","#za_rivers[za_rivers['sov_a3'] == 'ZAF']"]},{"cell_type":"markdown","metadata":{},"source":["* `'inner'` use intersection of keys from both geodataframes; retain only the left geometry column"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["za_rivers_inner = gpd.sjoin(za_provinces, rivers, how=\"inner\", op='intersects')\n","\n","za_rivers_inner"]},{"cell_type":"markdown","metadata":{},"source":["Comparing all three then:"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["print(f'Left: {za_rivers_left.shape}\\nRight: {za_rivers_right.shape}\\nInner: {za_rivers_inner.shape}')"]},{"cell_type":"markdown","metadata":{},"source":["Note that for these datasets, we expect Left and Inner to be the same. The main difference is in whether we keep records that are only in right or not.\n","<hr />\n","<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n","<p><center>Â© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> â€” <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"nbformat":4,"nbformat_minor":4}


--------------------------------------------------------------------------------
/notebooks/05-geopandas-basemaps_with_contextily.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["# Background Maps\n","\n","DATE: 12 June 2020, 08:00 - 11:00 UTC\n","\n","AUDIENCE: Intermediate\n","\n","INSTRUCTOR: Martin Bentley, Digital Geoscientist, [Agile](https://agilescientific.com/)\n","\n","It is often useful to add a background map to a given dataset. There are a few tools that can do this, such as [`folium`](https://python-visualization.github.io/folium/) and [`ipyleaflet`](https://ipyleaflet.readthedocs.io/en/latest/) which build directly on the `leaflet` library for JavaScript. However, for a simple approach, we will stick to [`contextily`](https://github.com/darribas/contextily). There are some additional capabilities that we will not go over in this, but it should be enough to get started with.\n","\n","Note: the inital load of tiles can take quite a while as they get downloaded. Subsequent loads of the same tiles will be much quicker as they cache locally.\n","\n","This notebook is intended more as a demonstration of using contextily to add basemaps in a simple way."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import geopandas as gpd\n","import matplotlib.pyplot as plt\n","import contextily as ctx"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["plt.rcParams[\"figure.figsize\"] = (8, 8)"]},{"cell_type":"markdown","metadata":{},"source":["First we will get some point data, in this case mines in Tanzania. Geopandas can download the file and import it directly from the source at [Geological and Mineral Information System](https://www.gmis-tanzania.com/) by the Geological Survey of Tanzania. If this download does not work, it is in the repo as `data/tanzania_mines.zip`."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fname = 'https://www.gmis-tanzania.com/download/mines.zip'\n","mines = gpd.read_file(fname)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines"]},{"cell_type":"markdown","metadata":{},"source":["This can easily be plotted, as we have already done."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines.plot(column='miningexpl', legend=True)"]},{"cell_type":"markdown","metadata":{},"source":["In order to gain more context, we can plot this over a basemap of some kind. By default, `contextily` uses the Stamen Terrain tiles."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["base = mines.plot(column='miningexpl', legend=True)\n","ctx.add_basemap(base, crs=mines.crs)"]},{"cell_type":"markdown","metadata":{},"source":["Something worth noting is that the basemap is easily projected by giving it the `mines.crs` as a parameter. This is needed since the dataset using a local projected CRS, but we can largely ignore it."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines.crs"]},{"cell_type":"markdown","metadata":{},"source":["We can switch this to use lat-lon easily enough, by reprojecting the data to something like the WGS84 datum first."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["mines_deg = mines.to_crs(epsg=4326)\n","base = mines_deg.plot(column='miningexpl', legend=True, alpha=0.75)\n","ctx.add_basemap(base, crs=mines_deg.crs)"]},{"cell_type":"markdown","metadata":{},"source":["## Changing the Basemap\n","\n","The leaflet providers are available in contextily, which allows for a variety of different styles and looks."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.providers.keys()"]},{"cell_type":"markdown","metadata":{},"source":["Many of these may have specific styles, which will look different. Compare the default [Mapnik style](https://www.openstreetmap.org/search?query=Kinshasa#map=13/-4.3385/15.3131) to the [Transport style](https://www.openstreetmap.org/search?query=Kinshasa#map=13/-4.3385/15.3131&layers=T), for example. Some of these styles may require API keys to use, and many will have usage limits of some kind."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["print(ctx.providers.OpenStreetMap.keys())\n","print(ctx.providers.Thunderforest.keys())\n","print(ctx.providers.Esri.keys())"]},{"cell_type":"markdown","metadata":{},"source":["Changing the basemap to use one of these is as simple as changing the `source` parameter. We can also make it fade a little by setting the `alpha` parameter."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["fig, ax = plt.subplots()\n","mines_deg.plot(ax=ax, column='miningexpl', legend=True, alpha=0.75)\n","ctx.add_basemap(ax, crs=mines_deg.crs,\n","               source=ctx.providers.OpenStreetMap.Mapnik, alpha=0.8)"]},{"cell_type":"markdown","metadata":{},"source":["It is also possible to load custom tilemaps, if they support the standard XYZ format. This is useful if you have created one using your own data somewhere. We will use the tiling server hosted by the government of New South Wales.\n","\n","It is also possible to request tiles based on an extent, which needs to be either in WGS84 (EPSG 4326) or Pseudo-Mercator (EPSG 3587)."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["src = 'http://maps.six.nsw.gov.au/arcgis/rest/services/public/NSW_Base_Map/MapServer/tile/{z}/{y}/{x}'\n","west, south, east, north = 16783076.1, -4041012.6, 16851459.8, -3988135.3"]},{"cell_type":"markdown","metadata":{},"source":["The `bounds2img` method will download the tiles within a given bounding box as a three band array. The `ll` parameter is whether your data is in lon-lat or Pseudo-Mercator."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["sydney_img, sydney_ext = ctx.bounds2img(west, south, east, north,\n","                                       source=src, ll=False, zoom=10)\n","print(sydney_img.shape)\n","plt.imshow(sydney_img, extent=sydney_ext)"]},{"cell_type":"markdown","metadata":{},"source":["## Downloading Basemaps\n","\n","While basemaps downloaded are cached locally (try re-running one of the above cells; it should be much quicker), sometimes we may want to download them to use elsewhere or to save the bandwidth. Contextily can do that easily."]},{"cell_type":"markdown","metadata":{},"source":["Contextily sets a default zoom based on the extents, but we can change that if we want or need to. Higher zoom levels means downloading more tiles, but with higher resolution."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.howmany(west, south, east, north, 7, ll=False)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.howmany(west, south, east, north, 10, ll=False)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.howmany(west, south, east, north, 12, ll=False)"]},{"cell_type":"markdown","metadata":{},"source":["These will look different, because they are optimised to be viewed at a different zoom level. These changes include which features are shown, smoothing of lines, size of labels, and so on. The zoom levels for OpenStreetMap can be seen [here](https://wiki.openstreetmap.org/wiki/Zoom_levels). Most providers will be very similar if not the same.\n","\n","Note: do not try and download large areas at high resolution unless absolutely necessary. Many providers will cut off access for excessive use, or may have a limited number of requests for a given service tier.\n","\n","The following two blocks illustrate the difference in number of tiles downloaded at given zoom levels. Given the limitations of the browser, they might be clearer by looking at the downloaded files instead."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["sydney_10_img, sydney_10_ext = ctx.bounds2raster(west,\n","                             south,\n","                             east,\n","                             north,\n","                             \"sydney_z10.tif\",\n","                             source=src,\n","                             ll=False,\n","                             zoom=10\n","                            )\n","plt.imshow(sydney_10_img, extent=sydney_10_ext)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["sydney_12_img, sydney_12_ext = ctx.bounds2raster(west,\n","                             south,\n","                             east,\n","                             north,\n","                             \"sydney_z12.tif\",\n","                             source=src,\n","                             ll=False,\n","                             zoom=12\n","                            )\n","plt.imshow(sydney_12_img, extent=sydney_12_ext)"]},{"cell_type":"markdown","metadata":{},"source":["We can load a saved tiff using something like `rasterio`, or in ArcGIS/QGIS."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["import numpy as np\n","import rasterio as rio\n","from rasterio.plot import show as rioshow\n","\n","with rio.open(\"sydney_z10.tif\") as r:\n","    rioshow(r)"]},{"cell_type":"markdown","metadata":{},"source":["## Geocoding in `Contextily`\n","\n","A really nice feature to have is being able to download basemaps given a placename. This is made very simple in contextily, through use of the `pygeo` geocoder. The places can be countries, cities, or other places."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["paraguay = ctx.Place('Paraguay', source=ctx.providers.Esri.DeLorme)\n","paraguay"]},{"cell_type":"markdown","metadata":{},"source":["This can be used as a basemap with existing data as already shown."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["rivers = gpd.read_file('zip://../data/ne_10m_rivers_lake_centerlines_trimmed.zip')\n","#rivers_clipped = rivers[rivers.intersects(paraguay.bbox)]\n","base = rivers.plot(color='red')\n","ctx.plot_map(paraguay, ax=base, axis_off=False)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.Place('Geneva', source=ctx.providers.CartoDB.Positron).plot()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["ctx.Place('Union Buildings', source=ctx.providers.Wikimedia, zoom=19).plot()"]},{"cell_type":"markdown","metadata":{},"source":["<hr />\n","<img src=\"https://avatars1.githubusercontent.com/u/1692321?v=3&s=200\" style=\"float:center\" width=\"40px\" />\n","<p><center>Â© 2020 <a href=\"http://www.agilegeoscience.com/\">Agile Geoscience</a> â€” <a href=\"https://creativecommons.org/licenses/by/4.0/\">CC-BY</a></center></p>"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.3"}},"nbformat":4,"nbformat_minor":4}


--------------------------------------------------------------------------------