├── .gitignore
├── GeoPython2022.ipynb
├── LICENSE
├── Readme.md
├── data
    └── airports.csv
├── environment.yml
└── fig
    └── Hilbert-curve_rounded-gradient-animated.gif


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | dask-worker-space/
132 | mydask.png
133 | data/gadm*
134 | world/
135 | data/airports/
136 | data/airports_csv/
137 | data/world*


--------------------------------------------------------------------------------
/GeoPython2022.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "9a96e449-efc1-4a59-8ab4-13fe52d309bf",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# GeoPython 2022 - Introduction to `dask-geopandas`\n",
   9 |     "\n",
  10 |     "**Martin Fleischmann, Joris van den Bossche**\n",
  11 |     "\n",
  12 |     "22/06/2022, Basel\n",
  13 |     "\n",
  14 |     "## Setup\n",
  15 |     "\n",
  16 |     "Follow the Readme to set-up the environment correctly. You should have these packages installed:\n",
  17 |     "\n",
  18 |     "```\n",
  19 |     "- geopandas\n",
  20 |     "- dask-geopandas\n",
  21 |     "- pyogrio\n",
  22 |     "- pyarrow\n",
  23 |     "- python-graphviz\n",
  24 |     "- esda\n",
  25 |     "- dask-labextension # optionally, if using JupyterLab\n",
  26 |     "```"
  27 |    ]
  28 |   },
  29 |   {
  30 |    "cell_type": "markdown",
  31 |    "id": "d85aacd2-6dff-4e19-a1b4-ec6201456d5d",
  32 |    "metadata": {},
  33 |    "source": [
  34 |     "## GeoPandas refresh\n",
  35 |     "\n",
  36 |     "Let's start with a quick refresh of GeoPandas.\n",
  37 |     "\n",
  38 |     "### What is GeoPandas?\n",
  39 |     "\n",
  40 |     "**Easy, fast and scalable geospatial analysis in Python**\n",
  41 |     "\n",
  42 |     "From the docs:\n",
  43 |     "\n",
  44 |     "> The goal of GeoPandas is to make working with geospatial data in python easier. It combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. GeoPandas enables you to easily do operations in python that would otherwise require a spatial database such as PostGIS.\n",
  45 |     "\n",
  46 |     "A quick demo:"
  47 |    ]
  48 |   },
  49 |   {
  50 |    "cell_type": "code",
  51 |    "execution_count": null,
  52 |    "id": "51fea161-12a7-423e-a152-1cd07402993a",
  53 |    "metadata": {},
  54 |    "outputs": [],
  55 |    "source": [
  56 |     "import geopandas"
  57 |    ]
  58 |   },
  59 |   {
  60 |    "cell_type": "markdown",
  61 |    "id": "76618132-2d59-4cb0-87a4-43e194aa7ded",
  62 |    "metadata": {},
  63 |    "source": [
  64 |     "GeoPandas includes some built-in data, we can use them as an illustration."
  65 |    ]
  66 |   },
  67 |   {
  68 |    "cell_type": "code",
  69 |    "execution_count": null,
  70 |    "id": "a342e516-76ad-44c0-b5d5-895ce2ebf91c",
  71 |    "metadata": {},
  72 |    "outputs": [],
  73 |    "source": [
  74 |     "path = geopandas.datasets.get_path(\"naturalearth_lowres\")\n",
  75 |     "path"
  76 |    ]
  77 |   },
  78 |   {
  79 |    "cell_type": "code",
  80 |    "execution_count": null,
  81 |    "id": "fc58b219-b3fc-4082-84ac-c2272fb587d5",
  82 |    "metadata": {},
  83 |    "outputs": [],
  84 |    "source": [
  85 |     "world = geopandas.read_file(path)\n",
  86 |     "world.head()"
  87 |    ]
  88 |   },
  89 |   {
  90 |    "cell_type": "markdown",
  91 |    "id": "0a47d5e4-95fe-4161-ae7e-cf93e65b83b2",
  92 |    "metadata": {},
  93 |    "source": [
  94 |     "For the sake of simplicity here, we can remove Antarctica and re-project the data to the EPSG 3857, which will not complain about measuring the area (but never use EPSG 3857 to measure the actual area as it is extremely skewed)."
  95 |    ]
  96 |   },
  97 |   {
  98 |    "cell_type": "code",
  99 |    "execution_count": null,
 100 |    "id": "98a2f3a2-a480-4405-859e-539fd156f465",
 101 |    "metadata": {},
 102 |    "outputs": [],
 103 |    "source": [
 104 |     "world = world.query(\"continent != 'Antarctica'\").to_crs(3857)"
 105 |    ]
 106 |   },
 107 |   {
 108 |    "cell_type": "markdown",
 109 |    "id": "1a3e24d4-4e1b-4f8d-aed0-16a255643080",
 110 |    "metadata": {},
 111 |    "source": [
 112 |     "GeoPandas GeoDataFrame can carry one or more geometry columns and brings the support of geospatial operations on these columns. Like a creation of a convex hull."
 113 |    ]
 114 |   },
 115 |   {
 116 |    "cell_type": "code",
 117 |    "execution_count": null,
 118 |    "id": "7b6be908-ecc9-48eb-aff9-f079550518a3",
 119 |    "metadata": {},
 120 |    "outputs": [],
 121 |    "source": [
 122 |     "world['convex_hull'] = world.convex_hull"
 123 |    ]
 124 |   },
 125 |   {
 126 |    "cell_type": "markdown",
 127 |    "id": "9e660148-41db-41a7-8f83-dde4fa7735e8",
 128 |    "metadata": {},
 129 |    "source": [
 130 |     "This is equal to the code above as GeoPandas exposes geometry methods of the active geometry column to the GeoDataFrame level:"
 131 |    ]
 132 |   },
 133 |   {
 134 |    "cell_type": "code",
 135 |    "execution_count": null,
 136 |    "id": "cb9af8d4-88cc-4fe6-966a-e9efaf6701dd",
 137 |    "metadata": {},
 138 |    "outputs": [],
 139 |    "source": [
 140 |     "world['convex_hull'] = world.geometry.convex_hull  "
 141 |    ]
 142 |   },
 143 |   {
 144 |    "cell_type": "markdown",
 145 |    "id": "4653ba04-06e8-4684-bdd0-c4775cf94baa",
 146 |    "metadata": {},
 147 |    "source": [
 148 |     "Now you can see that we have two geometry columns stored in our `world` GeoDataFrame but only the original one is treated as an _active_ geometry (that is the one accessible directly, without getting the column first)."
 149 |    ]
 150 |   },
 151 |   {
 152 |    "cell_type": "code",
 153 |    "execution_count": null,
 154 |    "id": "386275f4-43a0-4ed2-a9b7-ebf53242f2ab",
 155 |    "metadata": {},
 156 |    "outputs": [],
 157 |    "source": [
 158 |     "world.head()"
 159 |    ]
 160 |   },
 161 |   {
 162 |    "cell_type": "markdown",
 163 |    "id": "d2a7eba8-a4f3-4c47-b9d7-d7652678fac8",
 164 |    "metadata": {},
 165 |    "source": [
 166 |     "We can also plot the results. Both Russia and Fiji are a bit weird as they cross the anti-meridian."
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "code",
 171 |    "execution_count": null,
 172 |    "id": "b33c8337-9bce-4478-aa8e-b041ded3b692",
 173 |    "metadata": {},
 174 |    "outputs": [],
 175 |    "source": [
 176 |     "ax = world.plot(figsize=(12, 12))\n",
 177 |     "world.convex_hull.plot(ax=ax, facecolor='none', edgecolor='black')"
 178 |    ]
 179 |   },
 180 |   {
 181 |    "cell_type": "markdown",
 182 |    "id": "aa74cdbe-ddeb-42a5-b466-072c93a97e37",
 183 |    "metadata": {},
 184 |    "source": [
 185 |     "## What is Dask\n",
 186 |     "\n",
 187 |     "From the docs:\n",
 188 |     "\n",
 189 |     "> Dask provides advanced parallelism and distributed out-of-core computation with a dask.dataframe module designed to scale pandas. Since GeoPandas is an extension to the pandas DataFrame, the same way Dask scales pandas can also be applied to GeoPandas.\n",
 190 |     "\n",
 191 |     "We will cover the high-level API of Dask. For more, see the [Dask tutorial](https://tutorial.dask.org).\n",
 192 |     "\n",
 193 |     "Let's import `numpy` and `pandas` for a comparison and three high-level Dask modules - `bag`, `array`, and `dataframe`."
 194 |    ]
 195 |   },
 196 |   {
 197 |    "cell_type": "code",
 198 |    "execution_count": null,
 199 |    "id": "900e0740-f592-438b-b0e3-ee20f26e0751",
 200 |    "metadata": {},
 201 |    "outputs": [],
 202 |    "source": [
 203 |     "import numpy as np\n",
 204 |     "import pandas as pd\n",
 205 |     "\n",
 206 |     "import dask.dataframe as dd\n",
 207 |     "import dask.array as da\n",
 208 |     "import dask.bag as db"
 209 |    ]
 210 |   },
 211 |   {
 212 |    "cell_type": "markdown",
 213 |    "id": "d6770195-2ad8-44f2-b3c3-d1c032ad78ed",
 214 |    "metadata": {},
 215 |    "source": [
 216 |     "Before we explore those, let's introduce the dask `Client` as it will allow us to see how dask manages all its tasks.\n",
 217 |     "\n",
 218 |     "Here we create a Client on top of a local (automatically created) cluster with 4 workers (the laptop we use has 4 performance cores)."
 219 |    ]
 220 |   },
 221 |   {
 222 |    "cell_type": "code",
 223 |    "execution_count": null,
 224 |    "id": "2a23995a-cf0b-47aa-9552-12ca69f0cbab",
 225 |    "metadata": {},
 226 |    "outputs": [],
 227 |    "source": [
 228 |     "from dask.distributed import Client\n",
 229 |     "\n",
 230 |     "client = Client(n_workers=4)"
 231 |    ]
 232 |   },
 233 |   {
 234 |    "cell_type": "code",
 235 |    "execution_count": null,
 236 |    "id": "cba16798-a0df-456f-8ff6-45bd6ce6348d",
 237 |    "metadata": {},
 238 |    "outputs": [],
 239 |    "source": [
 240 |     "client"
 241 |    ]
 242 |   },
 243 |   {
 244 |    "cell_type": "markdown",
 245 |    "id": "31300c5c-39f8-4997-b9c5-6a9fdefea050",
 246 |    "metadata": {},
 247 |    "source": [
 248 |     "We can open the Dask dashboard to watch what is happenning in real-time using the link above, in the Client details. But if you have a [JupyterLab extension for Dask](https://github.com/dask/dask-labextension), you can watch different components directly from the JupyterLab interface."
 249 |    ]
 250 |   },
 251 |   {
 252 |    "cell_type": "markdown",
 253 |    "id": "63f301b8-fe13-4be3-b75b-0541b91b7355",
 254 |    "metadata": {},
 255 |    "source": [
 256 |     "### dask.bag\n",
 257 |     "\n",
 258 |     "With the Client and cluster in place, we can properly explore Dask. Let's start with a `dask.bag`, the simplest of the objects. You can imagine it as a distributed list."
 259 |    ]
 260 |   },
 261 |   {
 262 |    "cell_type": "code",
 263 |    "execution_count": null,
 264 |    "id": "65ae1ed7-cde5-444a-a0f0-fbb8dd4feaaa",
 265 |    "metadata": {},
 266 |    "outputs": [],
 267 |    "source": [
 268 |     "b = db.from_sequence([1, 2, 3, 4, 5, 6, 2, 1], npartitions=2)\n",
 269 |     "b"
 270 |    ]
 271 |   },
 272 |   {
 273 |    "cell_type": "markdown",
 274 |    "id": "cf414b2e-405b-4d5a-ab10-24a9cd7b3c5a",
 275 |    "metadata": {},
 276 |    "source": [
 277 |     "Now, note that when we try to call `b`, we don't see its contents. \n",
 278 |     "\n",
 279 |     "Let's check what happens with `sum`."
 280 |    ]
 281 |   },
 282 |   {
 283 |    "cell_type": "code",
 284 |    "execution_count": null,
 285 |    "id": "011741f8-d169-4893-9a17-0f4f72f6383d",
 286 |    "metadata": {},
 287 |    "outputs": [],
 288 |    "source": [
 289 |     "b.sum()"
 290 |    ]
 291 |   },
 292 |   {
 293 |    "cell_type": "markdown",
 294 |    "id": "01f7f777-c7a1-428e-bc07-c4f97f2c2b21",
 295 |    "metadata": {},
 296 |    "source": [
 297 |     "Again, we don't see the answer, but some abstract `Item` object instead. That is because Dask usually runs all the operations lazily and waits for a `compute` call before it does the actual computation.\n",
 298 |     "\n",
 299 |     "Instead, it plans what it should do and create a task graph. That looks like this:"
 300 |    ]
 301 |   },
 302 |   {
 303 |    "cell_type": "code",
 304 |    "execution_count": null,
 305 |    "id": "e12a2d2b-97f4-43cf-951e-5272bc33370a",
 306 |    "metadata": {},
 307 |    "outputs": [],
 308 |    "source": [
 309 |     "b.sum().visualize()"
 310 |    ]
 311 |   },
 312 |   {
 313 |    "cell_type": "markdown",
 314 |    "id": "e4a4ae09-8f2e-4f1e-88ae-aad41ec2f7ec",
 315 |    "metadata": {},
 316 |    "source": [
 317 |     "We can see individual partitions (rectangles), operations (circles), and movement of data between partitions.\n",
 318 |     "\n",
 319 |     "When we call `compute`, this task graph is executed and Dask returns the expected value."
 320 |    ]
 321 |   },
 322 |   {
 323 |    "cell_type": "code",
 324 |    "execution_count": null,
 325 |    "id": "e2733578-508d-4b45-b976-3992d0442e33",
 326 |    "metadata": {},
 327 |    "outputs": [],
 328 |    "source": [
 329 |     "b.sum().compute()"
 330 |    ]
 331 |   },
 332 |   {
 333 |    "cell_type": "markdown",
 334 |    "id": "89de6a07-0c94-4f18-9435-e54dd7751831",
 335 |    "metadata": {},
 336 |    "source": [
 337 |     "### dask.array\n",
 338 |     "\n",
 339 |     "Let's move onto an array. Where bag is partitioned along 1 dimension (the sequence is essentially cut into pieces), array is like a numpy array split along both dimension. In practice, each of the partitions is a numpy array and dask array combines them together. Each partition can be then processed separately. "
 340 |    ]
 341 |   },
 342 |   {
 343 |    "cell_type": "code",
 344 |    "execution_count": null,
 345 |    "id": "4b204f4c-522c-487b-9433-82d68a0c65df",
 346 |    "metadata": {},
 347 |    "outputs": [],
 348 |    "source": [
 349 |     "data = np.arange(100_000).reshape(200, 500)\n",
 350 |     "a = da.from_array(data, chunks=(100, 100))\n",
 351 |     "a"
 352 |    ]
 353 |   },
 354 |   {
 355 |    "cell_type": "markdown",
 356 |    "id": "5e642e71-9cbc-4244-ad00-8a5413b94ac1",
 357 |    "metadata": {},
 358 |    "source": [
 359 |     "We see some dimensions, dtypes and sizes here but not the data. Beacause again, all is done lazily."
 360 |    ]
 361 |   },
 362 |   {
 363 |    "cell_type": "code",
 364 |    "execution_count": null,
 365 |    "id": "ff387bb9-d49c-4797-abf5-cb29272ea77d",
 366 |    "metadata": {},
 367 |    "outputs": [],
 368 |    "source": [
 369 |     "a[:50, 200]"
 370 |    ]
 371 |   },
 372 |   {
 373 |    "cell_type": "markdown",
 374 |    "id": "05fd4c16-13b9-4faa-b7df-96d7b36d1b7f",
 375 |    "metadata": {},
 376 |    "source": [
 377 |     "Even indexing requires `compute` to return values, otherwised it still give a dask array."
 378 |    ]
 379 |   },
 380 |   {
 381 |    "cell_type": "code",
 382 |    "execution_count": null,
 383 |    "id": "20c93739-1a76-4b5d-a9ef-6b86d72050c9",
 384 |    "metadata": {},
 385 |    "outputs": [],
 386 |    "source": [
 387 |     "a[:50, 200].compute()"
 388 |    ]
 389 |   },
 390 |   {
 391 |    "cell_type": "markdown",
 392 |    "id": "346a5b76-622a-41c5-8a86-1ec9e611fe31",
 393 |    "metadata": {},
 394 |    "source": [
 395 |     "Simlarly for `mean`."
 396 |    ]
 397 |   },
 398 |   {
 399 |    "cell_type": "code",
 400 |    "execution_count": null,
 401 |    "id": "1c85e22a-6696-4e78-8769-53a14392e68b",
 402 |    "metadata": {},
 403 |    "outputs": [],
 404 |    "source": [
 405 |     "a.mean().compute()"
 406 |    ]
 407 |   },
 408 |   {
 409 |    "cell_type": "markdown",
 410 |    "id": "f9639daa-a1ff-4524-98b4-def76aa5cda9",
 411 |    "metadata": {},
 412 |    "source": [
 413 |     "Since the mean is not super straightforward to parallelise (you can't just do mean in partitions and then mean of that), we can check how dask implements its logic."
 414 |    ]
 415 |   },
 416 |   {
 417 |    "cell_type": "code",
 418 |    "execution_count": null,
 419 |    "id": "ab7ab907-24ff-4d67-939b-efc36a16d216",
 420 |    "metadata": {},
 421 |    "outputs": [],
 422 |    "source": [
 423 |     "a.mean().visualize()"
 424 |    ]
 425 |   },
 426 |   {
 427 |    "cell_type": "markdown",
 428 |    "id": "40a4fc45-224b-4002-aaa3-9732f0e06481",
 429 |    "metadata": {},
 430 |    "source": [
 431 |     "Quite complex, right? Let's compare it to the indexing we did before."
 432 |    ]
 433 |   },
 434 |   {
 435 |    "cell_type": "code",
 436 |    "execution_count": null,
 437 |    "id": "0127d06f-6e2d-46cd-9bd5-e38f297fe0e6",
 438 |    "metadata": {},
 439 |    "outputs": [],
 440 |    "source": [
 441 |     "a[:50, 200].visualize()"
 442 |    ]
 443 |   },
 444 |   {
 445 |    "cell_type": "markdown",
 446 |    "id": "761e124c-f672-4731-a7ad-c5ed93a39f37",
 447 |    "metadata": {},
 448 |    "source": [
 449 |     "You can see that dask efficiently accesses only that one partition it needs at this moment."
 450 |    ]
 451 |   },
 452 |   {
 453 |    "cell_type": "markdown",
 454 |    "id": "b26d6f86-7813-4664-85fc-e8209919649d",
 455 |    "metadata": {},
 456 |    "source": [
 457 |     "### dask.dataframe\n",
 458 |     "\n",
 459 |     "Finally, we move to the parallelised DataFrame. It mirrors the logic of the array implementation, with a difference that individual partitions are pandas.DataFrames and partitioning happens along a single axis (rows).\n",
 460 |     "\n",
 461 |     "Dask.dataframe tries to mirror the pandas API. The same approach as we will later see with dask-geopandas."
 462 |    ]
 463 |   },
 464 |   {
 465 |    "cell_type": "code",
 466 |    "execution_count": null,
 467 |    "id": "c6b42178-4360-49b5-aeb2-b7931308c7dd",
 468 |    "metadata": {},
 469 |    "outputs": [],
 470 |    "source": [
 471 |     "df = pd.read_csv(\"data/airports.csv\")\n",
 472 |     "df.head()"
 473 |    ]
 474 |   },
 475 |   {
 476 |    "cell_type": "markdown",
 477 |    "id": "3108a874-251a-4e24-863b-cf3ab400a712",
 478 |    "metadata": {},
 479 |    "source": [
 480 |     "In this specific case (`head()`), dask actually reads those 5 rows and shows them but that tends to be an exception, likely because it is a very cheap operation."
 481 |    ]
 482 |   },
 483 |   {
 484 |    "cell_type": "code",
 485 |    "execution_count": null,
 486 |    "id": "a80274cc-d68b-4720-9bf3-6a5937d2f5c7",
 487 |    "metadata": {},
 488 |    "outputs": [],
 489 |    "source": [
 490 |     "df = dd.read_csv(\"data/airports.csv\")\n",
 491 |     "df.head()"
 492 |    ]
 493 |   },
 494 |   {
 495 |    "cell_type": "markdown",
 496 |    "id": "ee579606-8519-4d1e-a177-d4f1007aef18",
 497 |    "metadata": {},
 498 |    "source": [
 499 |     "If you try to show the whole DataFrame, you get a placeholder that tells you how many partitions you have, which columns and what are their dtypes."
 500 |    ]
 501 |   },
 502 |   {
 503 |    "cell_type": "code",
 504 |    "execution_count": null,
 505 |    "id": "532fd533-6952-4924-903a-de3bd1568387",
 506 |    "metadata": {},
 507 |    "outputs": [],
 508 |    "source": [
 509 |     "df"
 510 |    ]
 511 |   },
 512 |   {
 513 |    "cell_type": "markdown",
 514 |    "id": "8c925fb9-7bbf-46cc-b5ab-27903a9712de",
 515 |    "metadata": {},
 516 |    "source": [
 517 |     "Since the `airports.csv` is a single file on disk, dask gives us a single partion. To create a partitioned data frame, we can repartition it and even save to a partitioned CSV."
 518 |    ]
 519 |   },
 520 |   {
 521 |    "cell_type": "code",
 522 |    "execution_count": null,
 523 |    "id": "6368cef5",
 524 |    "metadata": {},
 525 |    "outputs": [],
 526 |    "source": [
 527 |     "df.repartition(4).to_csv(\"data/airports_csv/*.csv\", index=False)"
 528 |    ]
 529 |   },
 530 |   {
 531 |    "cell_type": "markdown",
 532 |    "id": "a7a2e2fc",
 533 |    "metadata": {},
 534 |    "source": [
 535 |     "When we have more of CSV files in a folder, typically one per month or a country, we can read each as a partition. "
 536 |    ]
 537 |   },
 538 |   {
 539 |    "cell_type": "code",
 540 |    "execution_count": null,
 541 |    "id": "201f882e-73aa-4ace-bded-cf9401b365af",
 542 |    "metadata": {},
 543 |    "outputs": [],
 544 |    "source": [
 545 |     "df = dd.read_csv(\"data/airports_csv/*.csv\")\n",
 546 |     "df"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "markdown",
 551 |    "id": "ff0b7b08-a1f3-43cd-b3d9-bfaf40589501",
 552 |    "metadata": {},
 553 |    "source": [
 554 |     "As before, all the computation is done lazily. Take a look at the computation of mean elevation."
 555 |    ]
 556 |   },
 557 |   {
 558 |    "cell_type": "code",
 559 |    "execution_count": null,
 560 |    "id": "65a5188e-8150-4535-b52a-203d97d5b0af",
 561 |    "metadata": {},
 562 |    "outputs": [],
 563 |    "source": [
 564 |     "elevation = df.elevation_ft.mean()\n",
 565 |     "elevation"
 566 |    ]
 567 |   },
 568 |   {
 569 |    "cell_type": "markdown",
 570 |    "id": "ba44ca44-fe9a-499a-843c-3beb62ea3844",
 571 |    "metadata": {},
 572 |    "source": [
 573 |     "We get a dask Scalar object here but as before, we don't get the value until we call `compute()`."
 574 |    ]
 575 |   },
 576 |   {
 577 |    "cell_type": "code",
 578 |    "execution_count": null,
 579 |    "id": "9a08fc50-897a-4fc5-9712-79d3988ce9aa",
 580 |    "metadata": {},
 581 |    "outputs": [],
 582 |    "source": [
 583 |     "elevation.compute()"
 584 |    ]
 585 |   },
 586 |   {
 587 |    "cell_type": "markdown",
 588 |    "id": "74b81842-28b9-433d-b7c6-9558362dd831",
 589 |    "metadata": {},
 590 |    "source": [
 591 |     "You can probably notice the similarity of the graph with the one calculating mean over an array."
 592 |    ]
 593 |   },
 594 |   {
 595 |    "cell_type": "code",
 596 |    "execution_count": null,
 597 |    "id": "81839c97-eea2-4969-b3a2-e25d50191382",
 598 |    "metadata": {},
 599 |    "outputs": [],
 600 |    "source": [
 601 |     "elevation.visualize()"
 602 |    ]
 603 |   },
 604 |   {
 605 |    "cell_type": "markdown",
 606 |    "id": "08b11a5a-c9e9-4d23-983c-920a4c024ef3",
 607 |    "metadata": {},
 608 |    "source": [
 609 |     "We can also quickly compare it to a task graph for something easier, like a sum."
 610 |    ]
 611 |   },
 612 |   {
 613 |    "cell_type": "code",
 614 |    "execution_count": null,
 615 |    "id": "6c0a51c0-91c8-4cc3-92d9-5a2cf148610d",
 616 |    "metadata": {},
 617 |    "outputs": [],
 618 |    "source": [
 619 |     "df.elevation_ft.sum().visualize()"
 620 |    ]
 621 |   },
 622 |   {
 623 |    "cell_type": "markdown",
 624 |    "id": "7c9341b2-9687-4d51-9d5b-5555be12c114",
 625 |    "metadata": {},
 626 |    "source": [
 627 |     "Not that it makes much sense to compute a sum of elevations, but we can do that and if you're checking the dashboard, you'll notice very little communication as the task is easy to parallelise and we need to gather the results only in the final step."
 628 |    ]
 629 |   },
 630 |   {
 631 |    "cell_type": "code",
 632 |    "execution_count": null,
 633 |    "id": "f1ea7346-17c1-402e-9e4b-76d6aa336c04",
 634 |    "metadata": {},
 635 |    "outputs": [],
 636 |    "source": [
 637 |     "df.elevation_ft.sum().compute()"
 638 |    ]
 639 |   },
 640 |   {
 641 |    "cell_type": "markdown",
 642 |    "id": "76750d1b-b79d-436d-ad8a-c725876bedcf",
 643 |    "metadata": {},
 644 |    "source": [
 645 |     "## Dask-GeoPandas\n",
 646 |     "\n",
 647 |     "Dask-GeoPandas follows exactly the same model as `dask.dataframe` adopted for scaling `pandas.DataFrame`. We have a single `dask_geopandas.GeoDataFrame`, composed of individual partitions where each is a `geopandas.GeoDataFrame`."
 648 |    ]
 649 |   },
 650 |   {
 651 |    "cell_type": "code",
 652 |    "execution_count": null,
 653 |    "id": "c83221e2-8016-409e-be3d-85fe4b0857cf",
 654 |    "metadata": {},
 655 |    "outputs": [],
 656 |    "source": [
 657 |     "import dask_geopandas"
 658 |    ]
 659 |   },
 660 |   {
 661 |    "cell_type": "markdown",
 662 |    "id": "2fa39737-d827-4421-9159-5268028f562d",
 663 |    "metadata": {},
 664 |    "source": [
 665 |     "## Create dask GeoDataFrame\n",
 666 |     "\n",
 667 |     "We have a plenty of options how to build a `dask_geopandas.GeoDataFrame`. From in-memory `geopandas.GeoDataFrame`, reading the GIS file (using pyogrio under the hood), reading GeoParquet or Feather, or from dask.dataframe."
 668 |    ]
 669 |   },
 670 |   {
 671 |    "cell_type": "code",
 672 |    "execution_count": null,
 673 |    "id": "b6f51ae5-fc1e-4fbd-a28c-a04ebad8ea77",
 674 |    "metadata": {},
 675 |    "outputs": [],
 676 |    "source": [
 677 |     "world_ddf = dask_geopandas.from_geopandas(world, npartitions=4)\n",
 678 |     "world_ddf"
 679 |    ]
 680 |   },
 681 |   {
 682 |    "cell_type": "code",
 683 |    "execution_count": null,
 684 |    "id": "39216390-63b5-4a87-897a-8e65fca05b0b",
 685 |    "metadata": {},
 686 |    "outputs": [],
 687 |    "source": [
 688 |     "world_ddf_file = dask_geopandas.read_file(path, npartitions=4)\n",
 689 |     "world_ddf_file"
 690 |    ]
 691 |   },
 692 |   {
 693 |    "cell_type": "markdown",
 694 |    "id": "97f3020d-bb99-4416-a582-11da211a8fb8",
 695 |    "metadata": {},
 696 |    "source": [
 697 |     "### Partitioned IO\n",
 698 |     "\n",
 699 |     "Since we are working with individual partitions, it is useful to save the dataframe already partitioned. The ideal file format for that is a GeoParquet."
 700 |    ]
 701 |   },
 702 |   {
 703 |    "cell_type": "code",
 704 |    "execution_count": null,
 705 |    "id": "1ec8c53c-6cc0-40b6-b9e1-17b786c22fec",
 706 |    "metadata": {
 707 |     "tags": []
 708 |    },
 709 |    "outputs": [],
 710 |    "source": [
 711 |     "world_ddf.to_parquet(\"data/world/\")\n",
 712 |     "world_ddf.to_crs(4326).to_parquet(\"data/world_4326/\")  # later we will need the dataset in EPSG:4326 so we can already prepare it."
 713 |    ]
 714 |   },
 715 |   {
 716 |    "cell_type": "markdown",
 717 |    "id": "811b8855-9298-4835-b15e-f34ca9715cd5",
 718 |    "metadata": {},
 719 |    "source": [
 720 |     "For more complex tasks, we recommend using Parquet IO as an intermediate step to avoid large task graphs."
 721 |    ]
 722 |   },
 723 |   {
 724 |    "cell_type": "code",
 725 |    "execution_count": null,
 726 |    "id": "d6121085-a81c-47dd-bf15-cae358fe91bc",
 727 |    "metadata": {
 728 |     "tags": []
 729 |    },
 730 |    "outputs": [],
 731 |    "source": [
 732 |     "world_ddf = dask_geopandas.read_parquet(\"data/world/\")\n",
 733 |     "world_ddf"
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "markdown",
 738 |    "id": "55470865-1758-475a-af94-c26a222ce1a3",
 739 |    "metadata": {},
 740 |    "source": [
 741 |     "## Embarrassingly parallel computation\n",
 742 |     "\n",
 743 |     "The first type of operations where you can benefit from parallelisation is so-called embarrassingly parallel computation. That is a computation where we treat individual partitions or individual rows indenpendently of the other, meaning there is no inter-worker communication and no data need to be sent elsewhere.\n",
 744 |     "\n",
 745 |     "One example of that is a calculation of area."
 746 |    ]
 747 |   },
 748 |   {
 749 |    "cell_type": "code",
 750 |    "execution_count": null,
 751 |    "id": "6edb7684-32e4-495f-a1ec-74aca19866da",
 752 |    "metadata": {},
 753 |    "outputs": [],
 754 |    "source": [
 755 |     "area = world_ddf.area\n",
 756 |     "area.visualize()"
 757 |    ]
 758 |   },
 759 |   {
 760 |    "cell_type": "markdown",
 761 |    "id": "03224973-bfce-4d0b-bf1a-fc366284b233",
 762 |    "metadata": {},
 763 |    "source": [
 764 |     "Similar one, this time returing a `dask_geopandas.GeoSeries` instead of a `dask.dataframe.Series` would be a `convex_hull` method."
 765 |    ]
 766 |   },
 767 |   {
 768 |    "cell_type": "code",
 769 |    "execution_count": null,
 770 |    "id": "38da2d25-5df0-4903-a438-13d59897d59e",
 771 |    "metadata": {},
 772 |    "outputs": [],
 773 |    "source": [
 774 |     "convex_hull = world_ddf.convex_hull\n",
 775 |     "convex_hull.visualize()"
 776 |    ]
 777 |   },
 778 |   {
 779 |    "cell_type": "markdown",
 780 |    "id": "31ee2c9f-9336-46da-9f6f-49e8f572a8d0",
 781 |    "metadata": {},
 782 |    "source": [
 783 |     "Since both are creating a series, we can assing both as individual columns. Let's see how that changes the task graph."
 784 |    ]
 785 |   },
 786 |   {
 787 |    "cell_type": "code",
 788 |    "execution_count": null,
 789 |    "id": "b1e2d93d-8d04-4c80-bc00-e43b49b9a9a3",
 790 |    "metadata": {},
 791 |    "outputs": [],
 792 |    "source": [
 793 |     "world_ddf['area'] = world_ddf.area\n",
 794 |     "world_ddf['convex_hull'] = world_ddf.convex_hull\n",
 795 |     "world_ddf.visualize()"
 796 |    ]
 797 |   },
 798 |   {
 799 |    "cell_type": "markdown",
 800 |    "id": "19bc9cf6-a2be-4674-926f-f8efe04cf18f",
 801 |    "metadata": {},
 802 |    "source": [
 803 |     "Finally, we can call `compute()` and get all the results."
 804 |    ]
 805 |   },
 806 |   {
 807 |    "cell_type": "code",
 808 |    "execution_count": null,
 809 |    "id": "881ee02b-8928-46fd-ba8e-188705c6b758",
 810 |    "metadata": {
 811 |     "tags": []
 812 |    },
 813 |    "outputs": [],
 814 |    "source": [
 815 |     "r = world_ddf.compute()"
 816 |    ]
 817 |   },
 818 |   {
 819 |    "cell_type": "markdown",
 820 |    "id": "14a463d8-1f08-4995-9394-2112d5beeafe",
 821 |    "metadata": {},
 822 |    "source": [
 823 |     "## Spatial join\n",
 824 |     "\n",
 825 |     "If you have to deal with a large dataframes and need a spatial join, dask-geopandas can help. Let's try to illustrate the logic of spatial join on the partitioned data using the locations of airports from around the world."
 826 |    ]
 827 |   },
 828 |   {
 829 |    "cell_type": "code",
 830 |    "execution_count": null,
 831 |    "id": "13ff8202-2d2f-454e-805a-a091eb19d828",
 832 |    "metadata": {},
 833 |    "outputs": [],
 834 |    "source": [
 835 |     "airports = pd.read_csv(\"data/airports.csv\")\n",
 836 |     "airports.head()"
 837 |    ]
 838 |   },
 839 |   {
 840 |    "cell_type": "markdown",
 841 |    "id": "851ef74b-a48c-4cfd-96ab-b100c3c4584b",
 842 |    "metadata": {},
 843 |    "source": [
 844 |     "The data comes as a CSV, so we first need to create a GeoDataFrame."
 845 |    ]
 846 |   },
 847 |   {
 848 |    "cell_type": "code",
 849 |    "execution_count": null,
 850 |    "id": "4eec8022-19cc-45b3-a1d4-00dbbe55e983",
 851 |    "metadata": {},
 852 |    "outputs": [],
 853 |    "source": [
 854 |     "airports = geopandas.GeoDataFrame(\n",
 855 |     "    airports,\n",
 856 |     "    geometry=geopandas.GeoSeries.from_xy(\n",
 857 |     "        airports[\"longitude_deg\"],\n",
 858 |     "        airports[\"latitude_deg\"],\n",
 859 |     "        crs=4326,\n",
 860 |     "    )\n",
 861 |     ")"
 862 |    ]
 863 |   },
 864 |   {
 865 |    "cell_type": "markdown",
 866 |    "id": "a36783f2-eef2-4a81-b771-329cc5e2dc95",
 867 |    "metadata": {},
 868 |    "source": [
 869 |     "And from that, we can create a partitioned `dask_geopandas.GeoDataFrame`. Note that we could also read the CSV with dask.dataframe and create a GeoDataFrame from that using the `dask_geopandas.from_dask_dataframe` function and `dask_geopandas.points_from_xy` to create the geometry. But since it all comfortably fits in memory, we can pick whichever option we like."
 870 |    ]
 871 |   },
 872 |   {
 873 |    "cell_type": "code",
 874 |    "execution_count": null,
 875 |    "id": "92417d31-ff01-4059-b166-88114baca621",
 876 |    "metadata": {},
 877 |    "outputs": [],
 878 |    "source": [
 879 |     "airports_ddf = dask_geopandas.from_geopandas(\n",
 880 |     "    airports,\n",
 881 |     "    npartitions=12\n",
 882 |     ")"
 883 |    ]
 884 |   },
 885 |   {
 886 |    "cell_type": "markdown",
 887 |    "id": "ca1701ef-d33c-44a4-b81a-c7abeb352503",
 888 |    "metadata": {},
 889 |    "source": [
 890 |     "We will join the point data of airports with the `naturalearth_lowres` dataset we have stored as an already partitioned parquet."
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "code",
 895 |    "execution_count": null,
 896 |    "id": "9ec0cb6f-1ae5-43b1-8f0a-77f18b1c68c0",
 897 |    "metadata": {},
 898 |    "outputs": [],
 899 |    "source": [
 900 |     "world_ddf = dask_geopandas.read_parquet(\"data/world_4326/\")\n",
 901 |     "world_ddf"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "markdown",
 906 |    "id": "8cbce801-7bf6-40be-b431-d8b68b870ca3",
 907 |    "metadata": {},
 908 |    "source": [
 909 |     "The API of the `sjoin` is exactly the same as you know it from geopandas. Just in this case, it currently only creates a task graph."
 910 |    ]
 911 |   },
 912 |   {
 913 |    "cell_type": "code",
 914 |    "execution_count": null,
 915 |    "id": "ae07e6dc-5025-4333-8d0f-95cf6f5df82e",
 916 |    "metadata": {},
 917 |    "outputs": [],
 918 |    "source": [
 919 |     "joined = airports_ddf.sjoin(world_ddf, predicate=\"within\")"
 920 |    ]
 921 |   },
 922 |   {
 923 |    "cell_type": "markdown",
 924 |    "id": "38b8c371-2051-44a8-9d70-2470a0eed316",
 925 |    "metadata": {},
 926 |    "source": [
 927 |     "We started from 12 partitions of `airports_ddf` and 4 partitions of `world_ddf`. Since we haven't told Dask how are these partitions spatially distributed, it just plans to do the join of each partition from one dataframe to each partition form the other one. 12x4 = 48 partitions in the end. We can easily check that with the `npartitions` attribute."
 928 |    ]
 929 |   },
 930 |   {
 931 |    "cell_type": "code",
 932 |    "execution_count": null,
 933 |    "id": "5261936f-2929-4e6a-8637-6eaba0019363",
 934 |    "metadata": {},
 935 |    "outputs": [],
 936 |    "source": [
 937 |     "joined.npartitions"
 938 |    ]
 939 |   },
 940 |   {
 941 |    "cell_type": "markdown",
 942 |    "id": "6f06bbc8-aa26-4bb3-88f2-804814f24209",
 943 |    "metadata": {},
 944 |    "source": [
 945 |     "The whole logic can also be represented by a task graph that illustates the inneficiency of such an approach."
 946 |    ]
 947 |   },
 948 |   {
 949 |    "cell_type": "code",
 950 |    "execution_count": null,
 951 |    "id": "4d382eb3-2b68-46e3-93ec-f74cd64bdd14",
 952 |    "metadata": {},
 953 |    "outputs": [],
 954 |    "source": [
 955 |     "joined.visualize()"
 956 |    ]
 957 |   },
 958 |   {
 959 |    "cell_type": "markdown",
 960 |    "id": "d097ceee-e0bf-4451-84a1-db9057dca405",
 961 |    "metadata": {
 962 |     "tags": []
 963 |    },
 964 |    "source": [
 965 |     "## Spatial partitioning\n",
 966 |     "\n",
 967 |     "Luckily, dask-geopandas supports spatial partitioning. It means that it can calculate the spatial extent of each partition (as the overall convex hull of the partition) and use it internally to do smarter decisions when creating the task graph. \n",
 968 |     "\n",
 969 |     "But first, we need to calculate these paritions. This operation is done eagerly and involves immediate reading of all geometries."
 970 |    ]
 971 |   },
 972 |   {
 973 |    "cell_type": "code",
 974 |    "execution_count": null,
 975 |    "id": "cdcb758e-1ad0-4d04-ba58-26ce04dca646",
 976 |    "metadata": {},
 977 |    "outputs": [],
 978 |    "source": [
 979 |     "airports_ddf.calculate_spatial_partitions()"
 980 |    ]
 981 |   },
 982 |   {
 983 |    "cell_type": "markdown",
 984 |    "id": "ac742c14-ee33-4df9-a037-cdbbdda6c16d",
 985 |    "metadata": {},
 986 |    "source": [
 987 |     "The resulting `spatial_partitions` attribute is a `geopandas.GeoSeries`."
 988 |    ]
 989 |   },
 990 |   {
 991 |    "cell_type": "code",
 992 |    "execution_count": null,
 993 |    "id": "e4c7de49-1798-4d3e-b479-f18f0fc149e5",
 994 |    "metadata": {},
 995 |    "outputs": [],
 996 |    "source": [
 997 |     "airports_ddf.spatial_partitions"
 998 |    ]
 999 |   },
1000 |   {
1001 |    "cell_type": "code",
1002 |    "execution_count": null,
1003 |    "id": "7ee6da69-67c8-4ec5-94c4-c8651b88a2da",
1004 |    "metadata": {
1005 |     "tags": []
1006 |    },
1007 |    "outputs": [],
1008 |    "source": [
1009 |     "airports_ddf.spatial_partitions.explore()"
1010 |    ]
1011 |   },
1012 |   {
1013 |    "cell_type": "markdown",
1014 |    "id": "457e6d7a-b697-4215-8011-97faa89dcf48",
1015 |    "metadata": {},
1016 |    "source": [
1017 |     "As you can see from the plot above, our partitions are not very homogenous in terms of their spatial distribution. Each contains points from nearly whole world. And that does not help with simplification of a task graph."
1018 |    ]
1019 |   },
1020 |   {
1021 |    "cell_type": "markdown",
1022 |    "id": "5034996f-e493-4552-bf43-f7d30e49b06a",
1023 |    "metadata": {},
1024 |    "source": [
1025 |     "### The goal\n",
1026 |     "\n",
1027 |     "We need our partitions to be spatially coherent to minimise the amount of inter-worker communication. So we have to find a way of reshuffling the data in between workers.\n",
1028 |     "\n",
1029 |     "### Hilbert curve\n",
1030 |     "\n",
1031 |     "One way of doing so is to follow the Hilbert space-filling curve, which is a 2-dimensional curve like the one below along which we can map our geometries (usually points). The distance along the Hilbert curve then represents a spatial proximity. Two points with a similar Hilbert distance are therefore ensured to be close in space.\n",
1032 |     "\n",
1033 |     "![Hilbert](fig/Hilbert-curve_rounded-gradient-animated.gif)\n",
1034 |     "\n",
1035 |     "(Animation by Tim Sauder, https://en.wikipedia.org/wiki/Hilbert_curve#/media/File:Hilbert-curve_rounded-gradient-animated.gif)\n",
1036 |     "\n",
1037 |     "`dask-geopandas` (as of 0.1.0) implements Hilbert curve and two other methods based on a similar concept of space-filling (Morton curve and Geohash). You can either compute them directly or let `dask-geopandas` use them under the hood in a `spatial_shuffle` method that computes the distance along the curve and uses it to reshuffle the dataframe into spatially homogenous chunks. (Note that geometries are abstracted to the midpoint of their bounding box for the purpose of measuring the distance along the curve.)"
1038 |    ]
1039 |   },
1040 |   {
1041 |    "cell_type": "code",
1042 |    "execution_count": null,
1043 |    "id": "695f0568-7fbb-4e30-b4e8-cd919c2edb89",
1044 |    "metadata": {},
1045 |    "outputs": [],
1046 |    "source": [
1047 |     "hilbert_distance = airports_ddf.hilbert_distance()\n",
1048 |     "hilbert_distance"
1049 |    ]
1050 |   },
1051 |   {
1052 |    "cell_type": "code",
1053 |    "execution_count": null,
1054 |    "id": "3466b3b7-c3f0-4087-aa3b-cd1412482d90",
1055 |    "metadata": {},
1056 |    "outputs": [],
1057 |    "source": [
1058 |     "hilbert_distance.compute()"
1059 |    ]
1060 |   },
1061 |   {
1062 |    "cell_type": "markdown",
1063 |    "id": "6ae0bd3e-8a27-4167-a066-6999b41c9496",
1064 |    "metadata": {},
1065 |    "source": [
1066 |     "`spatial_shuffle` uses by default `hilbert_distance` and partitions the dataframe based on this Series."
1067 |    ]
1068 |   },
1069 |   {
1070 |    "cell_type": "code",
1071 |    "execution_count": null,
1072 |    "id": "e48c50ac-d090-4bde-90d9-c5d3df3e9774",
1073 |    "metadata": {},
1074 |    "outputs": [],
1075 |    "source": [
1076 |     "airports_ddf = airports_ddf.spatial_shuffle()"
1077 |    ]
1078 |   },
1079 |   {
1080 |    "cell_type": "markdown",
1081 |    "id": "357fbb71-e8dd-458a-bb5e-eb70ed6f53e9",
1082 |    "metadata": {},
1083 |    "source": [
1084 |     "We can now check how the new partitions look like in space."
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "code",
1089 |    "execution_count": null,
1090 |    "id": "dec44c64-d875-4d71-9256-7b2a4fc7908d",
1091 |    "metadata": {
1092 |     "tags": []
1093 |    },
1094 |    "outputs": [],
1095 |    "source": [
1096 |     "airports_ddf.spatial_partitions.explore()"
1097 |    ]
1098 |   },
1099 |   {
1100 |    "cell_type": "markdown",
1101 |    "id": "b2629430-2ef0-48b1-aeb3-b1fe67a032bc",
1102 |    "metadata": {},
1103 |    "source": [
1104 |     "When we are reading parquet file, its metadata already contain the information on the extent of each partition and we therefore don't have to calculate them by reading all the geometries. We can quickly check that."
1105 |    ]
1106 |   },
1107 |   {
1108 |    "cell_type": "code",
1109 |    "execution_count": null,
1110 |    "id": "214a5f75-8199-412a-a2ca-f03e27040a65",
1111 |    "metadata": {},
1112 |    "outputs": [],
1113 |    "source": [
1114 |     "world_ddf.spatial_partitions is not None"
1115 |    ]
1116 |   },
1117 |   {
1118 |    "cell_type": "markdown",
1119 |    "id": "c7902b24-d1b6-476d-8023-1129b7f71948",
1120 |    "metadata": {},
1121 |    "source": [
1122 |     "The world dataset has known partitions but is not spatially shuffled."
1123 |    ]
1124 |   },
1125 |   {
1126 |    "cell_type": "code",
1127 |    "execution_count": null,
1128 |    "id": "e0a7f683-7e16-45e6-9018-50fcdc0c888d",
1129 |    "metadata": {},
1130 |    "outputs": [],
1131 |    "source": [
1132 |     "world_ddf.spatial_partitions.explore()"
1133 |    ]
1134 |   },
1135 |   {
1136 |    "cell_type": "markdown",
1137 |    "id": "32806bca-85ba-473e-b397-87fb9d563c06",
1138 |    "metadata": {},
1139 |    "source": [
1140 |     "Even without doing that, we can already see that the resulting number of partitions is now 33, instead of 48 as some of the joins that would result in an empty dataframe are simply filtered out."
1141 |    ]
1142 |   },
1143 |   {
1144 |    "cell_type": "code",
1145 |    "execution_count": null,
1146 |    "id": "438faef8-657c-4c0b-94d4-7f7b4e4793f3",
1147 |    "metadata": {},
1148 |    "outputs": [],
1149 |    "source": [
1150 |     "joined = airports_ddf.sjoin(world_ddf, predicate=\"within\")"
1151 |    ]
1152 |   },
1153 |   {
1154 |    "cell_type": "code",
1155 |    "execution_count": null,
1156 |    "id": "0d4033ae-0a22-435c-9a3e-7409bbee9900",
1157 |    "metadata": {},
1158 |    "outputs": [],
1159 |    "source": [
1160 |     "joined.npartitions"
1161 |    ]
1162 |   },
1163 |   {
1164 |    "cell_type": "code",
1165 |    "execution_count": null,
1166 |    "id": "f72dcd13-d4aa-4f02-9e22-07b27f84586d",
1167 |    "metadata": {},
1168 |    "outputs": [],
1169 |    "source": [
1170 |     "%%time\n",
1171 |     "joined.compute()"
1172 |    ]
1173 |   },
1174 |   {
1175 |    "cell_type": "markdown",
1176 |    "id": "d812565a-a6df-41b7-883f-0bc2a74336ca",
1177 |    "metadata": {
1178 |     "tags": []
1179 |    },
1180 |    "source": [
1181 |     "### What about a larger problem?\n",
1182 |     "\n",
1183 |     "Dropping down from 48 to 33 partitions doesn't sound like a big deal. But you usually want to use dask-geopandas to tackle a bit larger problems. To simulate one, We can load a GADM dataset containd detailed administrative boundaries of the whole world (around 2GB GPKG) and join our airport data to that.\n",
1184 |     "\n",
1185 |     "_Note that this dataset is not part of the repository so you will not be able to run these two cells._\n",
1186 |     "\n",
1187 |     "_If you want to run the code you need to download the `gadm404.gpkg` from [GADM.org](https://gadm.org/download_world.html) and unzip it to the `data` folder._\n",
1188 |     "\n",
1189 |     "_To create the `gadm_spatial` used below, you need to read the GPKG, spatially shuffle it and save it as a GeoParquet:_\n",
1190 |     "\n",
1191 |     "```\n",
1192 |     "gadm_ddf = dask_geopandas.read_file('data/gadm404.gpkg', npartitions=64)\n",
1193 |     "gadm_ddf.spatial_shuffle().to_parquet(\"data/gadm_spatial/\")\n",
1194 |     "```"
1195 |    ]
1196 |   },
1197 |   {
1198 |    "cell_type": "code",
1199 |    "execution_count": null,
1200 |    "id": "21dac6f2-8b94-4c73-b776-d5da8b390f35",
1201 |    "metadata": {},
1202 |    "outputs": [],
1203 |    "source": [
1204 |     "gadm_ddf = dask_geopandas.read_file('data/gadm404.gpkg', npartitions=64)\n",
1205 |     "joined = airports_ddf.sjoin(gadm_ddf, predicate=\"within\")\n",
1206 |     "joined.npartitions"
1207 |    ]
1208 |   },
1209 |   {
1210 |    "cell_type": "markdown",
1211 |    "id": "6cccff35-1237-4bf6-b11a-9363ffb9cc02",
1212 |    "metadata": {},
1213 |    "source": [
1214 |     "Without any spatial sorting of the GADM dataset, we have to do 12*64 joins.\n",
1215 |     "\n",
1216 |     "But we can load the same dataset that has been spatially sorted."
1217 |    ]
1218 |   },
1219 |   {
1220 |    "cell_type": "code",
1221 |    "execution_count": null,
1222 |    "id": "d37393aa-764b-44eb-a1d1-10e6fd1ddb2c",
1223 |    "metadata": {
1224 |     "tags": []
1225 |    },
1226 |    "outputs": [],
1227 |    "source": [
1228 |     "gadm_sorted = dask_geopandas.read_parquet(\"data/gadm_spatial/\")\n",
1229 |     "joined = airports_ddf.sjoin(gadm_sorted, predicate=\"within\")\n",
1230 |     "joined.npartitions"
1231 |    ]
1232 |   },
1233 |   {
1234 |    "cell_type": "markdown",
1235 |    "id": "5c4da771-eab8-4f05-900c-64f7924f9590",
1236 |    "metadata": {},
1237 |    "source": [
1238 |     "The resulting number of partitions is 151, filtering out more than 80% of spatial joins that no longer need to be done."
1239 |    ]
1240 |   },
1241 |   {
1242 |    "cell_type": "markdown",
1243 |    "id": "8ca7bb1b-1ebd-4aa0-ae7d-eb4bf934a4ca",
1244 |    "metadata": {},
1245 |    "source": [
1246 |     "## Aggregations with dissolve\n",
1247 |     "\n",
1248 |     "Dissolve is a typical operation when you usually need to do some shuffling of data between workers to ensure that all observations within the same category (specified using the `by` keyword) end up in the same partition so they can actually be dissolved into a single polygon. As you can imagine, proper spatial partitions may help as well but in this case, they help only in the computation, not in the task graph.\n",
1249 |     "\n",
1250 |     "Let's have a look at an example."
1251 |    ]
1252 |   },
1253 |   {
1254 |    "cell_type": "code",
1255 |    "execution_count": null,
1256 |    "id": "5b90d0e1-6b9b-4281-ba2f-136a608ed4ec",
1257 |    "metadata": {},
1258 |    "outputs": [],
1259 |    "source": [
1260 |     "world_ddf = dask_geopandas.read_parquet(\"data/world/\")\n",
1261 |     "\n",
1262 |     "continents = world_ddf.dissolve('continent', split_out=6)\n",
1263 |     "continents"
1264 |    ]
1265 |   },
1266 |   {
1267 |    "cell_type": "markdown",
1268 |    "id": "0df856a7-00a9-494e-9697-c9fa3449fe92",
1269 |    "metadata": {},
1270 |    "source": [
1271 |     "Above, we are using the API we know from GeoPandas with one new keyword - `split_out`. That specifies into how many partitions should we send the dissolved result. The whole method is based on the `groupby`, exactly as the original one, which returns a single partition by default. We rarely want that to happen."
1272 |    ]
1273 |   },
1274 |   {
1275 |    "cell_type": "code",
1276 |    "execution_count": null,
1277 |    "id": "b308e36e-a6c4-4418-8f9f-eabdab337199",
1278 |    "metadata": {},
1279 |    "outputs": [],
1280 |    "source": [
1281 |     "continents.visualize()"
1282 |    ]
1283 |   },
1284 |   {
1285 |    "cell_type": "markdown",
1286 |    "id": "c728957c-6199-43bf-98d4-7eb97d427fc6",
1287 |    "metadata": {},
1288 |    "source": [
1289 |     "The task graph shows exactly what happens. Since Dask doesn't know which categories are where, it designs the task graph to move potentially shuffle data from every original partition to every new one. In reality, some of these will be empty. And the better spatial partitions we have, the more of them will be empty, hence our operation will be cheaper."
1290 |    ]
1291 |   },
1292 |   {
1293 |    "cell_type": "markdown",
1294 |    "id": "c59ac64f-5a0c-407e-8c38-703888e6d900",
1295 |    "metadata": {
1296 |     "tags": []
1297 |    },
1298 |    "source": [
1299 |     "## Custom functions with `map_partitions`\n",
1300 |     "\n",
1301 |     "Not every function you may need is built-in and you often need to apply a custom one to the partitioned dataframe. The most common way of doing that is the `map_partitions` method.\n",
1302 |     "\n",
1303 |     "The typical use case is below. We want to describe the shape of each polygon using the `shape_index` from the `esda` package. With geopandas, it would look like this."
1304 |    ]
1305 |   },
1306 |   {
1307 |    "cell_type": "code",
1308 |    "execution_count": null,
1309 |    "id": "41d81a3e-0f37-45e2-a94b-a9c1d2bd098d",
1310 |    "metadata": {},
1311 |    "outputs": [],
1312 |    "source": [
1313 |     "from esda.shape import shape_index\n",
1314 |     "\n",
1315 |     "world['shape_idx'] = shape_index(world)\n",
1316 |     "world.explore('shape_idx')"
1317 |    ]
1318 |   },
1319 |   {
1320 |    "cell_type": "markdown",
1321 |    "id": "1a607879-a30c-436f-98d9-4b6d84ece225",
1322 |    "metadata": {},
1323 |    "source": [
1324 |     "But when you try to use the same code with a `dask_geopandas.GeoDataFrame`, it will not work."
1325 |    ]
1326 |   },
1327 |   {
1328 |    "cell_type": "code",
1329 |    "execution_count": null,
1330 |    "id": "a2d0e3cc-133f-4816-974e-7ae46fb98922",
1331 |    "metadata": {},
1332 |    "outputs": [],
1333 |    "source": [
1334 |     "# THIS WILL FAIL\n",
1335 |     "world_ddf = dask_geopandas.read_parquet(\"data/world/\")\n",
1336 |     "world_ddf['shape_idx'] = shape_index(world_ddf)"
1337 |    ]
1338 |   },
1339 |   {
1340 |    "cell_type": "markdown",
1341 |    "id": "f588dd1c-ec69-4eb2-900c-e2d200d2ab49",
1342 |    "metadata": {},
1343 |    "source": [
1344 |     "In fact, it doesn't even give us a meaningful error message. Simply, because `esda` does not expect `dask_geopandas.GeoDataFrame` here, but expects `geopandas.GeoDataFrame` or some form of an in-memory array of geometries. Then it returns an array of floats.\n",
1345 |     "\n",
1346 |     "In this case, we can use `map_partitions` to _map_ the `shape_index()` function across individual partitions. A dummy code would look something like this:\n",
1347 |     "\n",
1348 |     "```py\n",
1349 |     "results = []\n",
1350 |     "\n",
1351 |     "for partition in ddf:\n",
1352 |     "    results.append(shape_index(partition))\n",
1353 |     "```\n",
1354 |     "\n",
1355 |     "The actual code is of course not a loop like this but principle is the same. We take each partition and apply the function."
1356 |    ]
1357 |   },
1358 |   {
1359 |    "cell_type": "code",
1360 |    "execution_count": null,
1361 |    "id": "c5e3b06b-f2b6-4338-91ac-68f728f1a704",
1362 |    "metadata": {},
1363 |    "outputs": [],
1364 |    "source": [
1365 |     "world_ddf = dask_geopandas.read_parquet(\"data/world/\")\n",
1366 |     "shape_idx = world_ddf.map_partitions(shape_index)"
1367 |    ]
1368 |   },
1369 |   {
1370 |    "cell_type": "markdown",
1371 |    "id": "cacb7765-be71-4945-9bde-a42f16e89c7f",
1372 |    "metadata": {},
1373 |    "source": [
1374 |     "The resulting task graph is the same as we have seen before in simple cases of embarrasingly parralel computation. `map_partitions` will always be embarrasingly parralel."
1375 |    ]
1376 |   },
1377 |   {
1378 |    "cell_type": "code",
1379 |    "execution_count": null,
1380 |    "id": "e9d310e9-3d5b-44af-abd1-f5e8ab193fec",
1381 |    "metadata": {},
1382 |    "outputs": [],
1383 |    "source": [
1384 |     "shape_idx.visualize()"
1385 |    ]
1386 |   },
1387 |   {
1388 |    "cell_type": "code",
1389 |    "execution_count": null,
1390 |    "id": "d1696723-3c49-4f2f-a371-4ee73349967c",
1391 |    "metadata": {},
1392 |    "outputs": [],
1393 |    "source": [
1394 |     "r = shape_idx.compute()"
1395 |    ]
1396 |   },
1397 |   {
1398 |    "cell_type": "markdown",
1399 |    "id": "420752cd-b2e5-4b45-8894-34f3e5c533d1",
1400 |    "metadata": {},
1401 |    "source": [
1402 |     "### Custom function\n",
1403 |     "\n",
1404 |     "We can also write our custom functions to be used with `map_partitions`. The only rule is that everything needs to happen within a single GeoDataFrame, i.e. within a single partition independently of the other. But we don't need to return a new column but we can also return a single value for each partition. Like a sum of area covered by polygons in each partition. "
1405 |    ]
1406 |   },
1407 |   {
1408 |    "cell_type": "code",
1409 |    "execution_count": null,
1410 |    "id": "9cd19de3-3ef0-4be7-af4a-110e9d2d8f2d",
1411 |    "metadata": {},
1412 |    "outputs": [],
1413 |    "source": [
1414 |     "def my_fn(gdf):\n",
1415 |     "    \"\"\"get a sum of area covered by polygons in a gdf\n",
1416 |     "    \n",
1417 |     "    Parameters\n",
1418 |     "    ----------\n",
1419 |     "    gdf : GeoDataFrame\n",
1420 |     "    \n",
1421 |     "    Returns\n",
1422 |     "    -------\n",
1423 |     "    float\n",
1424 |     "    \n",
1425 |     "    \"\"\"\n",
1426 |     "    area = gdf.geometry.area\n",
1427 |     "    return sum(area)"
1428 |    ]
1429 |   },
1430 |   {
1431 |    "cell_type": "markdown",
1432 |    "id": "dd45ad4e-bb4d-4297-b7eb-7cb759b8eaee",
1433 |    "metadata": {},
1434 |    "source": [
1435 |     "We cannot assign this as a new column as we did above, but we also don't need it."
1436 |    ]
1437 |   },
1438 |   {
1439 |    "cell_type": "code",
1440 |    "execution_count": null,
1441 |    "id": "277bf13e-4c50-49fc-92f6-3d4cd679aec0",
1442 |    "metadata": {
1443 |     "tags": []
1444 |    },
1445 |    "outputs": [],
1446 |    "source": [
1447 |     "world_ddf.map_partitions(my_fn).compute()"
1448 |    ]
1449 |   },
1450 |   {
1451 |    "cell_type": "markdown",
1452 |    "id": "7c348112-6efa-4188-bcb7-96bff6ababaa",
1453 |    "metadata": {},
1454 |    "source": [
1455 |     "The task graph is the same as before, even though we return only a single value per partition."
1456 |    ]
1457 |   },
1458 |   {
1459 |    "cell_type": "code",
1460 |    "execution_count": null,
1461 |    "id": "ddb28ac8-c261-4042-98c4-cbee12a0a5a3",
1462 |    "metadata": {},
1463 |    "outputs": [],
1464 |    "source": [
1465 |     "world_ddf.map_partitions(my_fn).visualize()"
1466 |    ]
1467 |   },
1468 |   {
1469 |    "cell_type": "markdown",
1470 |    "id": "f61e4a0c-a697-4142-93e7-fa05f870ce99",
1471 |    "metadata": {},
1472 |    "source": [
1473 |     "If you want to assign the result as a new column of your dataframe, you need to ensure you return an array or a Series of the correct length."
1474 |    ]
1475 |   },
1476 |   {
1477 |    "cell_type": "code",
1478 |    "execution_count": null,
1479 |    "id": "940bdc65-cec9-416b-aa69-4e08090eb654",
1480 |    "metadata": {},
1481 |    "outputs": [],
1482 |    "source": [
1483 |     "def get_hull_area(gdf):\n",
1484 |     "    \"\"\"Get area of each convex hull and return pandas.Series\n",
1485 |     "    \n",
1486 |     "    Parameters\n",
1487 |     "    ----------\n",
1488 |     "    gdf : GeoDataFrame\n",
1489 |     "    \n",
1490 |     "    Returns\n",
1491 |     "    -------\n",
1492 |     "    pandas.Series\n",
1493 |     "    \"\"\"\n",
1494 |     "    \n",
1495 |     "    hulls = gdf.convex_hull\n",
1496 |     "    return hulls.area"
1497 |    ]
1498 |   },
1499 |   {
1500 |    "cell_type": "code",
1501 |    "execution_count": null,
1502 |    "id": "5ab8b8f5-38e9-41c2-be31-08f660d8f7f2",
1503 |    "metadata": {},
1504 |    "outputs": [],
1505 |    "source": [
1506 |     "world_ddf['hull_area'] = world_ddf.map_partitions(get_hull_area)"
1507 |    ]
1508 |   },
1509 |   {
1510 |    "cell_type": "code",
1511 |    "execution_count": null,
1512 |    "id": "d2f47751-0aad-4b9c-b4d0-3f9725d1726d",
1513 |    "metadata": {
1514 |     "tags": []
1515 |    },
1516 |    "outputs": [],
1517 |    "source": [
1518 |     "world_ddf.compute()"
1519 |    ]
1520 |   },
1521 |   {
1522 |    "cell_type": "markdown",
1523 |    "id": "0776eb2b-bbf3-4d0b-b11c-901406603a14",
1524 |    "metadata": {},
1525 |    "source": [
1526 |     "The task graph now includes the `assign` step, assigning the new column to the dataframe."
1527 |    ]
1528 |   },
1529 |   {
1530 |    "cell_type": "code",
1531 |    "execution_count": null,
1532 |    "id": "821b0352-d39f-4b2b-bab5-62634f48f925",
1533 |    "metadata": {
1534 |     "tags": []
1535 |    },
1536 |    "outputs": [],
1537 |    "source": [
1538 |     "world_ddf.visualize()"
1539 |    ]
1540 |   },
1541 |   {
1542 |    "cell_type": "markdown",
1543 |    "id": "3de6194f-824a-45fa-9bc4-45d2553c8d74",
1544 |    "metadata": {
1545 |     "tags": []
1546 |    },
1547 |    "source": [
1548 |     "### Specifying meta-data\n",
1549 |     "\n",
1550 |     "To build a task graph, Dask doesn't need to see the data. But it needs to understand their general structure and dtypes. With simple `map_partitions` cases, you don't need to worry about that as Dask is often able to figure that out itself. But sometimes it struggles. \n",
1551 |     "\n",
1552 |     "Let's look a bit under the hood here."
1553 |    ]
1554 |   },
1555 |   {
1556 |    "cell_type": "code",
1557 |    "execution_count": null,
1558 |    "id": "ea6b4a4c-ef19-473d-a762-fa33b4396700",
1559 |    "metadata": {},
1560 |    "outputs": [],
1561 |    "source": [
1562 |     "world_ddf = dask_geopandas.read_parquet(\"data/world_4326/\")"
1563 |    ]
1564 |   },
1565 |   {
1566 |    "cell_type": "markdown",
1567 |    "id": "3c66c4e7-f754-40ff-8a6e-0d919a79d7e6",
1568 |    "metadata": {},
1569 |    "source": [
1570 |     "Each object contains its `_meta` data. For GeoDataFrame, that is usually an empty frame with columns and their dtypes set. Like this one:"
1571 |    ]
1572 |   },
1573 |   {
1574 |    "cell_type": "code",
1575 |    "execution_count": null,
1576 |    "id": "e8270c25-c5d9-4d48-89af-91228629c07e",
1577 |    "metadata": {},
1578 |    "outputs": [],
1579 |    "source": [
1580 |     "world_ddf._meta"
1581 |    ]
1582 |   },
1583 |   {
1584 |    "cell_type": "markdown",
1585 |    "id": "b3ba021a-e725-490c-9f7e-7a28d7d614f4",
1586 |    "metadata": {},
1587 |    "source": [
1588 |     "You can see that all dtypes are set."
1589 |    ]
1590 |   },
1591 |   {
1592 |    "cell_type": "code",
1593 |    "execution_count": null,
1594 |    "id": "c9c5ab04-460a-4d75-9990-5d74aec3977b",
1595 |    "metadata": {},
1596 |    "outputs": [],
1597 |    "source": [
1598 |     "world_ddf._meta.dtypes"
1599 |    ]
1600 |   },
1601 |   {
1602 |    "cell_type": "markdown",
1603 |    "id": "4a4d36af-6f17-42cb-8d2d-7bd38c39d4e2",
1604 |    "metadata": {},
1605 |    "source": [
1606 |     "Now, we can try to implement our own version of `dissolve` that works well when all data fit in memory, to make the exmaple a bit more complicated. We need two steps:\n",
1607 |     "\n",
1608 |     "1. Shuffle the data into partitions based on the `continent` column. That ensures that all observations from the same continent are within a single partition.\n",
1609 |     "2. Use `map_partitions` to apply `dissolve` from geopandas.\n",
1610 |     "\n",
1611 |     "We can use the `shuffle` method to do the first step. Note that this is also a form of spatial partitioning and it may be useful to follow the attribute if you have one and the final partitions are of a roughly the same size."
1612 |    ]
1613 |   },
1614 |   {
1615 |    "cell_type": "code",
1616 |    "execution_count": null,
1617 |    "id": "97a5a994-c2f9-480f-bd9d-787c7ee8753a",
1618 |    "metadata": {},
1619 |    "outputs": [],
1620 |    "source": [
1621 |     "shuffled = world_ddf.shuffle(\n",
1622 |     "    \"continent\", npartitions=7, ignore_index=True\n",
1623 |     ")"
1624 |    ]
1625 |   },
1626 |   {
1627 |    "cell_type": "code",
1628 |    "execution_count": null,
1629 |    "id": "c65ccd1e-1529-476f-8695-6799508fe10b",
1630 |    "metadata": {},
1631 |    "outputs": [],
1632 |    "source": [
1633 |     "shuffled.calculate_spatial_partitions()\n",
1634 |     "shuffled.spatial_partitions.explore()"
1635 |    ]
1636 |   },
1637 |   {
1638 |    "cell_type": "markdown",
1639 |    "id": "78068a5a-83d5-4658-abb7-d3c4eca01d2e",
1640 |    "metadata": {},
1641 |    "source": [
1642 |     "Sometimes the inference of the meta DataFrame just fails, mostly because it is empty. If that happens, you can manually specify the meta data data frame and pass it to the `map_partitions`. Even if it does not fail, like in this case, it is often better to directly pass it as it can be chaper and you avoid potential issues that may come as a result of a wrong inference."
1643 |    ]
1644 |   },
1645 |   {
1646 |    "cell_type": "code",
1647 |    "execution_count": null,
1648 |    "id": "4071f1a6-6211-47b5-8f6d-02341242c30a",
1649 |    "metadata": {},
1650 |    "outputs": [],
1651 |    "source": [
1652 |     "meta = world_ddf._meta.dissolve(by=\"continent\", as_index=False)\n",
1653 |     "meta"
1654 |    ]
1655 |   },
1656 |   {
1657 |    "cell_type": "markdown",
1658 |    "id": "1f11e880-898a-4255-bce1-af3eb815af42",
1659 |    "metadata": {},
1660 |    "source": [
1661 |     "With the `meta` defined, we can take the `geopandas.GeoDataFrame.dissolve` method and pass it to `map_partitions`. All `**kwargs` are passed as attributes to the method/function you are applying."
1662 |    ]
1663 |   },
1664 |   {
1665 |    "cell_type": "code",
1666 |    "execution_count": null,
1667 |    "id": "39e38f18-c95e-4a75-89da-519eaf9e0295",
1668 |    "metadata": {},
1669 |    "outputs": [],
1670 |    "source": [
1671 |     "dissolved = shuffled.map_partitions(\n",
1672 |     "    geopandas.GeoDataFrame.dissolve, by=\"continent\", as_index=False, meta=meta\n",
1673 |     ")"
1674 |    ]
1675 |   },
1676 |   {
1677 |    "cell_type": "code",
1678 |    "execution_count": null,
1679 |    "id": "7e27c056-5169-46c9-8498-0bad90f93894",
1680 |    "metadata": {},
1681 |    "outputs": [],
1682 |    "source": [
1683 |     "dissolved.visualize()"
1684 |    ]
1685 |   },
1686 |   {
1687 |    "cell_type": "code",
1688 |    "execution_count": null,
1689 |    "id": "cf2125b0-b23f-4e63-a623-b6e6fa0508d4",
1690 |    "metadata": {},
1691 |    "outputs": [],
1692 |    "source": [
1693 |     "dissolved.compute()"
1694 |    ]
1695 |   },
1696 |   {
1697 |    "cell_type": "markdown",
1698 |    "id": "d6e5baa6-3368-40b3-a350-cfbfe62bd556",
1699 |    "metadata": {},
1700 |    "source": [
1701 |     "When you are done, you can shut down the Dask client. If you want to do the exercises below, do not do that yet."
1702 |    ]
1703 |   },
1704 |   {
1705 |    "cell_type": "code",
1706 |    "execution_count": null,
1707 |    "id": "139b7bd1-d45f-4a1f-ae16-4e3103979532",
1708 |    "metadata": {},
1709 |    "outputs": [],
1710 |    "source": [
1711 |     "client.shutdown()"
1712 |    ]
1713 |   },
1714 |   {
1715 |    "cell_type": "markdown",
1716 |    "id": "d6174c47-1287-4819-9ea2-a1cda6a6b298",
1717 |    "metadata": {},
1718 |    "source": [
1719 |     "## Limits and caveats\n",
1720 |     "\n",
1721 |     "Truth to be told, we are now playing with the version 0.1 of dask-geopandas and not everything is as polished as we would like it to be. So there are some things which are not yet fully supported.\n",
1722 |     "\n",
1723 |     "- **Overlapping computation** - With `dask.dataframe` and `dask.array` you can use `map_overlap` to do some overlapping computations for which you need observations from neighbouring partitions. With dask-geopandas, we would need this overlap to be spatial and that is not yet supported. That means that whatever depends on topology or similar operation is currently not very easy to parallelise.\n",
1724 |     "- **Spatial indexing** - While you can use spatial indexing over spatial partitions and then within individual partitions as we use it under the hood in `sjoin`, it requires a bit of low-level dask code to make it correctly run. We hope to make that easier at some point in the future. We also want to expand the use of the spatial partitioning information to more methods (currently only `sjoin` makes use of it).\n",
1725 |     "- **Memory management** - Even though Dask can work out-of-core and you may seem dask-geopadnas behaving that way sometimes, we still have some unresolved memory issues due to geometries being stored as C objects, hence their actual size is not directly visible to Dask.\n",
1726 |     "\n",
1727 |     "There are also the same, or at least very similar, rules when not to use dask-geopandas as they apply to vanilla dask.dataframe (from [Dask documentation](https://docs.dask.org/en/stable/dataframe-best-practices.html)):\n",
1728 |     "\n",
1729 |     "- For data that fits into RAM, geopandas can often be faster and easier to use than Dask. While “Big Data” tools can be exciting, they are almost always worse than normal data tools while those remain appropriate. But for embarrasinlgy parallel computation, it will often bring speedup with minimal overhead.\n",
1730 |     "- Similar to above, even if you have a large dataset there may be a point in your computation where you’ve reduced things to a more manageable level. You may want to switch to (geo)pandas at this point.\n",
1731 |     "\n",
1732 |     "```\n",
1733 |     "df = dd.read_parquet('my-giant-file.parquet')\n",
1734 |     "df = df[df.name == 'Alice']              # Select a subsection\n",
1735 |     "result = df.groupby('id').value.mean()   # Reduce to a smaller size\n",
1736 |     "result = result.compute()                # Convert to pandas dataframe\n",
1737 |     "result...                                # Continue working with pandas\n",
1738 |     "```\n",
1739 |     "\n",
1740 |     "- Usual pandas and GeoPandas performance tips like avoiding apply, using vectorized operations, using categoricals, etc., all apply equally to Dask DataFrame and dask-geopandas.\n",
1741 |     "\n",
1742 |     "See more best practices in the [Dask documentation](https://docs.dask.org/en/stable/dataframe-best-practices.html)."
1743 |    ]
1744 |   },
1745 |   {
1746 |    "cell_type": "markdown",
1747 |    "id": "40524b9e-3b01-488b-83b9-b2c8aa9cf663",
1748 |    "metadata": {},
1749 |    "source": [
1750 |     "## Exercises\n",
1751 |     "\n",
1752 |     "Use the `data/airports_csv` folder and try to do the following using Dask:\n",
1753 |     "\n",
1754 |     "- Read the contents as a `dask.dataframe`\n",
1755 |     "- Create a valid `dask_geopandas.GeoDataFrame`\n",
1756 |     "- Calculate and explore spatial partitions. If you think there's a need to spatially shuffle the data, do so.\n",
1757 |     "    - Try comparing different sorting methods (check the docs!). Which one is the best an why?\n",
1758 |     "- How many airports are there per continent? And how many per country?\n",
1759 |     "- Are there any points not falling onto ground? How many?\n",
1760 |     "- Where would be the ideal single airport in each country if you had to build only one?"
1761 |    ]
1762 |   }
1763 |  ],
1764 |  "metadata": {
1765 |   "kernelspec": {
1766 |    "display_name": "geopython_tutorial",
1767 |    "language": "python",
1768 |    "name": "geopython_tutorial"
1769 |   },
1770 |   "language_info": {
1771 |    "codemirror_mode": {
1772 |     "name": "ipython",
1773 |     "version": 3
1774 |    },
1775 |    "file_extension": ".py",
1776 |    "mimetype": "text/x-python",
1777 |    "name": "python",
1778 |    "nbconvert_exporter": "python",
1779 |    "pygments_lexer": "ipython3",
1780 |    "version": "3.10.5"
1781 |   }
1782 |  },
1783 |  "nbformat": 4,
1784 |  "nbformat_minor": 5
1785 | }
1786 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2022, Martin Fleischmann, Joris Van den Bossche
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
 1 | # Dask-GeoPandas introduction tutorial
 2 | 
 3 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/martinfleis/dask-geopandas-tutorial/main?urlpath=lab/)
 4 | 
 5 | 
 6 | ## Setting up to follow the tutorial
 7 | 
 8 | ### Step 1: download the tutorial material
 9 | 
10 | If you are a git user, you can get the tutorial materials by cloning this repo:
11 | 
12 | ```
13 | git clone https://github.com/martinfleis/dask-geopandas-tutorial.git
14 | cd dask-geopandas-tutorial
15 | ```
16 | 
17 | Otherwise, to download the repository to your local machine as a zip-file,
18 | click the `download ZIP` on the repository page
19 | <https://github.com/martinfleis/dask-geopandas-tutorial>
20 | (green button "Code"). After the download, unzip on the location you prefer
21 | within your user account (e.g. `My Documents`, not `C:\`).
22 | 
23 | ### Step 2: install the required Python packages
24 | 
25 | To follow the tutorial, we recommend to create a `conda` environment to
26 | ensure you have all the required packages installed (the
27 | [environment.yml](environment.yml) file list the required packages).
28 | 
29 | If you do not yet have `conda` installed, you can install miniconda
30 | (https://conda.io/miniconda.html) or the (larger) Anaconda distribution
31 | (https://www.anaconda.com/download/).
32 | 
33 | Using conda, we recommend to create a new environment with all packages using
34 | the following commands (after cloning or downloading this github repo and
35 | navigating to the directory, see above):
36 | 
37 | ```bash
38 | # setting the configuation so all packages come from the conda-forge channel
39 | conda config --add channels conda-forge
40 | conda config --set channel_priority strict
41 | # mamba provides a faster implementation of conda
42 | conda install mamba
43 | # creating the environment
44 | mamba env create --file environment.yml
45 | # activating the environment
46 | conda activate geopandas-tutorial
47 | ```
48 | 
49 | In case you do not want to install everything and just want to try out the course material, use the environment setup by Binder [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/martinfleis/dask-geopandas-tutorial/main?urlpath=lab/) and open the notebooks rightaway.
50 | 
51 | ### Step 3: starting Jupyter Lab
52 | 
53 | The tutorial itself is a [Jupyter notebook](http://jupyter.org/), an interactive  environment to write and run code.
54 | 
55 | In the terminal, navigate to the `dask-geopandas-tutorial` directory (downloaded or cloned in the previous section)
56 | 
57 | Ensure that the correct environment is activated.
58 | 
59 | ```
60 | conda activate geopandas-tutorial
61 | ```
62 | 
63 | Start a jupyter notebook server by typing
64 | 
65 | ```
66 | jupyter lab
67 | ```
68 | 
69 | ---
70 | 
71 | Data included:
72 | 
73 | - `airports.csv` from <https://ourairports.com/data/>
74 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: geopandas-tutorial
 2 | channels:
 3 |   - conda-forge
 4 | dependencies:
 5 |   - python=3.10
 6 |   - jupyterlab>3
 7 |   - geopandas>=0.11
 8 |   - dask-geopandas>=0.1.3
 9 |   - pyogrio
10 |   - pyarrow
11 |   - python-graphviz
12 |   - esda
13 |   - dask-labextension
14 | 


--------------------------------------------------------------------------------
/fig/Hilbert-curve_rounded-gradient-animated.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/martinfleis/dask-geopandas-tutorial/ec9ec22ea9bcb49682566965cd69df6c70afafad/fig/Hilbert-curve_rounded-gradient-animated.gif


--------------------------------------------------------------------------------