├── .ipynb_checkpoints └── Spatial Autocorrelation-checkpoint.ipynb ├── README.md ├── Spatial Autocorrelation.ipynb ├── data ├── acs2018_5yr_B01003_15000US060372711003.geojson ├── metadata.json └── tracts.geojson ├── images ├── dave.jpg ├── esda.png ├── method.png ├── readme.md ├── sa-1.png ├── sa.png └── splot.png └── requirements.txt /.ipynb_checkpoints/Spatial Autocorrelation-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "toc": true 7 | }, 8 | "source": [ 9 | "

Table of Contents

\n", 10 | "
" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "slideshow": { 17 | "slide_type": "slide" 18 | } 19 | }, 20 | "source": [ 21 | "
\n", 22 | "\n", 23 | "

Take notice!

\n", 24 | "\n", 27 | " \n", 28 | "
" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 35 | "deepnote_cell_type": "markdown", 36 | "slideshow": { 37 | "slide_type": "slide" 38 | } 39 | }, 40 | "source": [ 41 | "# Spatial Autocorrelation\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "Visual interpretations are meaningful ways to determine spatial trends in our data. However, underlying factors—such as inconsistent geographies, scale, data gaps, overlapping data—have the potential to produce incorrect assumptions, as valuable information may be conveniently hidden from the visual output." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 52 | "deepnote_cell_type": "markdown", 53 | "slideshow": { 54 | "slide_type": "slide" 55 | } 56 | }, 57 | "source": [ 58 | "\n", 59 | "\n", 60 | "One way to address this issue is to amend your visual output with geo-statistical validation. In this lab, we will look at one such approach: Spatial Autocorrelation. Spatial autocorrelation addresses the so-called \"First Law of Geography\":\n", 61 | "\n", 62 | "> “Everything is related to everything else. But near things are more related than distant things”. Waldo Tobler’s (1969) First Law of Geography\n", 63 | "\n", 64 | "In other words, things that happen somewhere are likely to also happen at nearby locations." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 71 | "deepnote_cell_type": "markdown", 72 | "slideshow": { 73 | "slide_type": "slide" 74 | } 75 | }, 76 | "source": [ 77 | "## Methodology\n", 78 | "In this lab, we take on the controversial topic of policing in Los Angeles, specifically looking at arrest records from the LAPD. Do arrest locations have a statistical significant tendency to cluster in certain communities? To answer this question, we not only look at the location of recorded arrests in the city, but compare these locations with other arrests. In short, we are seeking to see where spatial correlations occur based on the data. Our approach is:\n", 79 | "\n", 80 | "1. import census block group boundaries for Los Angeles\n", 81 | "1. import arrest data from the LA Open Data Portal\n", 82 | "1. spatially join the two datasets\n", 83 | "1. normalize the data to create arrests per 1000\n", 84 | "1. conduct [global spatial autocorrelation](https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html) using Moran's I\n", 85 | "1. conduct [local spatial autocorrelation](https://geographicdata.science/book/notebooks/07_local_autocorrelation.html) using Local Indicators of Spatial Association (LISAs)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": { 91 | "slideshow": { 92 | "slide_type": "slide" 93 | } 94 | }, 95 | "source": [ 96 | "## Libraries to use\n", 97 | "\n", 98 | "\n", 99 | "\n", 100 | "\n", 101 | "\n", 102 | "\n", 103 | "\n", 104 | "- [Data vis with Pandas and Matplotlib](https://pandas.pydata.org/docs/user_guide/visualization.html)\n", 105 | "- [geopandas introduction](https://geopandas.org/getting_started/introduction.html)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "slideshow": { 112 | "slide_type": "slide" 113 | } 114 | }, 115 | "source": [ 116 | "## Libraries to use\n", 117 | "\n", 118 | "\n", 119 | "\n", 120 | "\n", 121 | "\n", 122 | "- [ESDA](https://pysal.org/esda/)\n", 123 | "- [SPLOT](https://github.com/pysal/splot)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "cell_id": "00001-85a84e8b-13d6-4356-8656-408ce2a2630c", 131 | "deepnote_cell_type": "code", 132 | "execution_millis": 1181, 133 | "execution_start": 1605804400268, 134 | "output_cleared": false, 135 | "slideshow": { 136 | "slide_type": "slide" 137 | }, 138 | "source_hash": "bc3fef15" 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "# to read and wrangle data\n", 143 | "import pandas as pd\n", 144 | "\n", 145 | "# to import data from LA Data portal\n", 146 | "from sodapy import Socrata\n", 147 | "\n", 148 | "# to create spatial data\n", 149 | "import geopandas as gpd\n", 150 | "\n", 151 | "# for basemaps\n", 152 | "import contextily as ctx\n", 153 | "\n", 154 | "# For spatial statistics\n", 155 | "import esda\n", 156 | "from esda.moran import Moran, Moran_Local\n", 157 | "\n", 158 | "import splot\n", 159 | "from splot.esda import moran_scatterplot, plot_moran, lisa_cluster,plot_moran_simulation\n", 160 | "\n", 161 | "import libpysal as lps\n", 162 | "\n", 163 | "# Graphics\n", 164 | "import matplotlib.pyplot as plt\n", 165 | "import plotly.express as px" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": { 171 | "slideshow": { 172 | "slide_type": "slide" 173 | } 174 | }, 175 | "source": [ 176 | "## Data preparation\n", 177 | "\n", 178 | "\n" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": { 184 | "cell_id": "00002-8c57606b-b800-4154-99ce-86d0e45b3499", 185 | "deepnote_cell_type": "markdown", 186 | "slideshow": { 187 | "slide_type": "slide" 188 | } 189 | }, 190 | "source": [ 191 | "## Block Groups\n", 192 | "\n", 193 | "Our first task is to bring in a geography that will allow us to summarize the location of arrests. The smaller the geography, the better our spatial correlation results will be. Short of creating our own grid, the census block groups provides an easily accessible boundary layer at a human scale. Additionally, working with census geographies will allow for future analyses that may include census data.\n", 194 | "\n", 195 | "* Date source: \n", 196 | " * [Census Reporter: ACS 2018 5 year: Table B01003: Total Population in Los Angeles: Census Block Groups](https://censusreporter.org/data/table/?table=B01003&geo_ids=16000US0644000,150|16000US0644000&primary_geo_id=16000US0644000)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "cell_id": "00003-27a9c38f-0d55-4f36-9268-f26b88e2ae29", 204 | "deepnote_cell_type": "code", 205 | "execution_millis": 3186, 206 | "execution_start": 1605804403202, 207 | "output_cleared": false, 208 | "slideshow": { 209 | "slide_type": "fragment" 210 | }, 211 | "source_hash": "b8c3febb" 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "# population block group data from census reporter\n", 216 | "gdf = gpd.read_file('data/acs2018_5yr_B01003_15000US060372711003.geojson')" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": { 222 | "slideshow": { 223 | "slide_type": "slide" 224 | } 225 | }, 226 | "source": [ 227 | "## Data cleanup" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "slideshow": { 235 | "slide_type": "fragment" 236 | } 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "# show first 5 rows\n", 241 | "gdf.head()" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "cell_id": "00004-e26db948-08d3-46f9-b0ce-1da8563c02d9", 249 | "deepnote_cell_type": "code", 250 | "execution_millis": 7, 251 | "execution_start": 1605804407207, 252 | "output_cleared": false, 253 | "slideshow": { 254 | "slide_type": "slide" 255 | }, 256 | "source_hash": "494d51a4" 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "# show columns and data types\n", 261 | "gdf.info()" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": { 268 | "cell_id": "00005-65875226-9284-44aa-b62e-3a1e2212d84b", 269 | "deepnote_cell_type": "code", 270 | "execution_millis": 13, 271 | "execution_start": 1605804409430, 272 | "output_cleared": false, 273 | "slideshow": { 274 | "slide_type": "slide" 275 | }, 276 | "source_hash": "7c13ef85" 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# trim the data to the bare minimum columns\n", 281 | "gdf = gdf[['geoid','B01003001','geometry']]\n", 282 | "\n", 283 | "# rename the columns\n", 284 | "gdf.columns = ['FIPS','TotalPop','geometry']" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "cell_id": "00005-65875226-9284-44aa-b62e-3a1e2212d84b", 292 | "deepnote_cell_type": "code", 293 | "execution_millis": 13, 294 | "execution_start": 1605804409430, 295 | "output_cleared": false, 296 | "slideshow": { 297 | "slide_type": "fragment" 298 | }, 299 | "source_hash": "7c13ef85" 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "# last 5 rows\n", 304 | "gdf.tail()" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": { 311 | "slideshow": { 312 | "slide_type": "fragment" 313 | } 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "# delete last row which is for the entire city of LA\n", 318 | "gdf=gdf.drop(2515)" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": { 325 | "scrolled": true, 326 | "slideshow": { 327 | "slide_type": "slide" 328 | } 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "# fix FIPS code\n", 333 | "gdf['FIPS'] = gdf['FIPS'].str.replace('15000US','')\n", 334 | "gdf.tail()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": { 340 | "slideshow": { 341 | "slide_type": "slide" 342 | } 343 | }, 344 | "source": [ 345 | "One more data cleanup: get rid of census blocks groups with less than 100 total population.\n", 346 | "\n", 347 | "- `sort_values()` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "slideshow": { 355 | "slide_type": "fragment" 356 | } 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "# sort by total pop\n", 361 | "gdf.sort_values(by='TotalPop').head(20)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": { 367 | "slideshow": { 368 | "slide_type": "slide" 369 | } 370 | }, 371 | "source": [ 372 | "### Subsetting the data\n", 373 | "\n", 374 | "- [Selecting a subset of a dataframe](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "slideshow": { 382 | "slide_type": "fragment" 383 | } 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "# delete less than 100 population geographies\n", 388 | "gdf = gdf[gdf['TotalPop']>100]" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "slideshow": { 395 | "slide_type": "slide" 396 | } 397 | }, 398 | "source": [ 399 | "## Map the census block groups" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "cell_id": "00006-c7db2dde-5f71-47f2-bfc2-0a0d94f1a03d", 407 | "deepnote_cell_type": "code", 408 | "execution_millis": 930, 409 | "execution_start": 1605804413958, 410 | "output_cleared": false, 411 | "slideshow": { 412 | "slide_type": "fragment" 413 | }, 414 | "source_hash": "e699b335" 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "# get the layers into a web mercator projection\n", 419 | "# reproject to web mercator\n", 420 | "gdf = gdf.to_crs(epsg=3857)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "slideshow": { 427 | "slide_type": "slide" 428 | } 429 | }, 430 | "source": [ 431 | "### Using Matplotlib subplots\n", 432 | "\n", 433 | "- `plt.subplots()` [documentation](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "cell_id": "00007-ffa16cca-a0b5-4e66-9a4c-73d90565821d", 441 | "deepnote_cell_type": "code", 442 | "execution_millis": 2725, 443 | "execution_start": 1605804417356, 444 | "output_cleared": false, 445 | "slideshow": { 446 | "slide_type": "slide" 447 | }, 448 | "source_hash": "93ebb39" 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "# plot it!\n", 453 | "fig, ax = plt.subplots(figsize=(12,12))\n", 454 | "\n", 455 | "gdf.plot(ax=ax,\n", 456 | " color='black', \n", 457 | " edgecolor='white',\n", 458 | " lw=0.5,\n", 459 | " alpha=0.4)\n", 460 | "\n", 461 | "# no axis\n", 462 | "ax.axis('off')\n", 463 | "\n", 464 | "# add a basemap\n", 465 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": { 471 | "cell_id": "00008-5ca674be-a4a5-4476-b7bd-3c86722b0246", 472 | "deepnote_cell_type": "markdown", 473 | "slideshow": { 474 | "slide_type": "slide" 475 | } 476 | }, 477 | "source": [ 478 | "## Get Arrest Data from LA Open Data Portal\n", 479 | "Next, we acquire the data using the socrata API. Use the socrata documentation to grab the code syntax for our arrests data.\n", 480 | "- [LA Data Portal](https://data.lacity.org)\n", 481 | "- [LA Data Portal arrest data](https://data.lacity.org/Public-Safety/Arrest-Data-from-2020-to-Present/amvf-fr72)\n", 482 | "- [Socrata endpoint documentation for this dataset](https://dev.socrata.com/foundry/data.lacity.org/amvf-fr72)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "cell_id": "00009-684c0663-6a05-4f66-82c3-63b259473b81", 490 | "deepnote_cell_type": "code", 491 | "execution_millis": 4626, 492 | "execution_start": 1605804430393, 493 | "output_cleared": false, 494 | "scrolled": true, 495 | "slideshow": { 496 | "slide_type": "slide" 497 | }, 498 | "source_hash": "a077b4d5" 499 | }, 500 | "outputs": [], 501 | "source": [ 502 | "# connect to the data portal\n", 503 | "client = Socrata(\"data.lacity.org\", None)\n", 504 | "\n", 505 | "results = client.get(\"amvf-fr72\", \n", 506 | " limit=50000,\n", 507 | " where = \"arst_date between '2020-07-01T00:00:00' and '2021-01-31T00:00:00'\",\n", 508 | " order='arst_date desc')\n", 509 | "\n", 510 | "# Convert to pandas DataFrame\n", 511 | "arrests = pd.DataFrame.from_records(results)\n" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "cell_id": "00010-edd31267-0772-400b-9220-b4c2af28281d", 519 | "deepnote_cell_type": "code", 520 | "execution_millis": 2, 521 | "execution_start": 1605804438759, 522 | "output_cleared": false, 523 | "scrolled": true, 524 | "slideshow": { 525 | "slide_type": "fragment" 526 | }, 527 | "source_hash": "d8a3f800" 528 | }, 529 | "outputs": [], 530 | "source": [ 531 | "arrests.info()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "slideshow": { 539 | "slide_type": "slide" 540 | } 541 | }, 542 | "outputs": [], 543 | "source": [ 544 | "arrests.head()" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": { 550 | "cell_id": "00011-87e77a3f-36e6-48db-bc94-ae8b2747e411", 551 | "deepnote_cell_type": "markdown", 552 | "slideshow": { 553 | "slide_type": "slide" 554 | } 555 | }, 556 | "source": [ 557 | "### Convert data to a geodataframe\n", 558 | "\n", 559 | "Geopandas allows us to convert different types of data into a spatial format.\n", 560 | "- https://geopandas.org/gallery/create_geopandas_from_pandas.html" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": { 567 | "cell_id": "00012-11bd538c-28c8-4c61-9994-e819c3feb7e8", 568 | "deepnote_cell_type": "code", 569 | "execution_millis": 409, 570 | "execution_start": 1605804442578, 571 | "output_cleared": false, 572 | "slideshow": { 573 | "slide_type": "fragment" 574 | }, 575 | "source_hash": "dcfb7ede" 576 | }, 577 | "outputs": [], 578 | "source": [ 579 | "# convert pandas dataframe to geodataframe\n", 580 | "arrests = gpd.GeoDataFrame(arrests, \n", 581 | " crs='EPSG:4326',\n", 582 | " geometry=gpd.points_from_xy(arrests.lon, arrests.lat))" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": { 589 | "cell_id": "00013-9c89a5ae-ae0f-4844-89e8-3b9105d30b6e", 590 | "deepnote_cell_type": "code", 591 | "execution_millis": 1327, 592 | "execution_start": 1605804596814, 593 | "output_cleared": false, 594 | "slideshow": { 595 | "slide_type": "fragment" 596 | }, 597 | "source_hash": "8eb5f93e" 598 | }, 599 | "outputs": [], 600 | "source": [ 601 | "# get the layers into a web mercator projection\n", 602 | "# reproject to web mercator\n", 603 | "arrests = arrests.to_crs(epsg=3857)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": { 610 | "cell_id": "00014-5b46baf9-e870-4232-a425-28f0d65fb571", 611 | "deepnote_cell_type": "code", 612 | "execution_millis": 60, 613 | "execution_start": 1605804598151, 614 | "output_cleared": false, 615 | "slideshow": { 616 | "slide_type": "fragment" 617 | }, 618 | "source_hash": "ced73d7" 619 | }, 620 | "outputs": [], 621 | "source": [ 622 | "# convert lat/lon to floats\n", 623 | "arrests.lon = arrests.lon.astype('float')\n", 624 | "arrests.lat = arrests.lat.astype('float')" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": { 631 | "cell_id": "00015-37008c4f-0307-455b-b316-da6e65f084c5", 632 | "deepnote_cell_type": "code", 633 | "execution_millis": 3485, 634 | "execution_start": 1605804600104, 635 | "output_cleared": false, 636 | "scrolled": true, 637 | "slideshow": { 638 | "slide_type": "slide" 639 | }, 640 | "source_hash": "ec9920c6" 641 | }, 642 | "outputs": [], 643 | "source": [ 644 | "# map it!\n", 645 | "fig,ax = plt.subplots(figsize=(12,12))\n", 646 | "\n", 647 | "arrests.plot(ax=ax,\n", 648 | " color='red',\n", 649 | " markersize=1)\n", 650 | "\n", 651 | "# no axis\n", 652 | "ax.axis('off')\n", 653 | "\n", 654 | "# add a basemap\n", 655 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)\n" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": { 661 | "cell_id": "00016-7294ac05-aa91-433c-aefd-bb0ee45e7ef5", 662 | "deepnote_cell_type": "markdown", 663 | "slideshow": { 664 | "slide_type": "slide" 665 | } 666 | }, 667 | "source": [ 668 | "### The 0,0 conundrum\n", 669 | "What is that red dot off the coast of Africa? Yes, that is the infamous spatial black hole, the [0,0] coordinate. There can be many reasons why those red dots get lost and find themselves there. No data may default to 0's, null values may be converted to 0's, or a human may have inadvertently entered 0's for unknown locations. Either which way, since these records do not have valid locations, they need to be deleted from our dataframe in order to proceed." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": { 676 | "cell_id": "00017-7d512d29-36e2-4a73-80ec-1a49b1845f1d", 677 | "deepnote_cell_type": "code", 678 | "execution_millis": 11, 679 | "execution_start": 1605804631750, 680 | "output_cleared": false, 681 | "slideshow": { 682 | "slide_type": "fragment" 683 | }, 684 | "source_hash": "dc9df796" 685 | }, 686 | "outputs": [], 687 | "source": [ 688 | "# subset the zero coordinate records\n", 689 | "arrests[arrests.lon==0]" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "slideshow": { 697 | "slide_type": "slide" 698 | } 699 | }, 700 | "outputs": [], 701 | "source": [ 702 | "# drop the unmapped rows\n", 703 | "arrests = arrests[arrests.lon!=0]" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": { 710 | "cell_id": "00015-37008c4f-0307-455b-b316-da6e65f084c5", 711 | "deepnote_cell_type": "code", 712 | "execution_millis": 3485, 713 | "execution_start": 1605804600104, 714 | "output_cleared": false, 715 | "scrolled": false, 716 | "slideshow": { 717 | "slide_type": "slide" 718 | }, 719 | "source_hash": "ec9920c6" 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "# map it!\n", 724 | "fig,ax = plt.subplots(figsize=(12,12))\n", 725 | "\n", 726 | "arrests.plot(ax=ax,\n", 727 | " color='red',\n", 728 | " markersize=1)\n", 729 | "\n", 730 | "# no axis\n", 731 | "ax.axis('off')\n", 732 | "\n", 733 | "# add a basemap\n", 734 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)\n" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": { 740 | "cell_id": "00020-c58096aa-ccb6-44c8-a710-06b464722ea2", 741 | "deepnote_cell_type": "markdown", 742 | "slideshow": { 743 | "slide_type": "slide" 744 | } 745 | }, 746 | "source": [ 747 | "## Create a two layer map\n", 748 | "\n", 749 | "- https://geopandas.org/mapping.html" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "metadata": { 755 | "cell_id": "00021-54ac341c-54f2-4549-b9fd-97ba8c8f9628", 756 | "deepnote_cell_type": "markdown", 757 | "slideshow": { 758 | "slide_type": "fragment" 759 | } 760 | }, 761 | "source": [ 762 | "Since we want to zoom to the extent of the arrests layer (and not the block groups), get the bounding coordinates for our axis." 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "metadata": { 769 | "cell_id": "00022-530381e9-7db1-4821-bcee-a66b1b5983f4", 770 | "deepnote_cell_type": "code", 771 | "execution_millis": 489, 772 | "execution_start": 1605804643081, 773 | "output_cleared": false, 774 | "slideshow": { 775 | "slide_type": "fragment" 776 | }, 777 | "source_hash": "2933d808" 778 | }, 779 | "outputs": [], 780 | "source": [ 781 | "# get the bounding box coordinates for the arrest data\n", 782 | "minx, miny, maxx, maxy = arrests.geometry.total_bounds\n", 783 | "print(minx)\n", 784 | "print(maxx)\n", 785 | "print(miny)\n", 786 | "print(maxy)" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "slideshow": { 793 | "slide_type": "slide" 794 | } 795 | }, 796 | "source": [ 797 | "## Subplots for multi-layered maps\n", 798 | "\n", 799 | "For our multi-layered maps, we are taking it one step further from our previous lab using matplotlib's `subplots`. `subplots` allows the creation of multiple plots on a gridded canvas. For our map, we only need a single subplot, but we are layering multiple datasets *on top of one another* on that subplot. To specify which subplot to put the layer on, you use the `ax` argument." 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": null, 805 | "metadata": { 806 | "cell_id": "00023-de1f44e5-dc3e-4a90-9423-884b716cdfa7", 807 | "deepnote_cell_type": "code", 808 | "execution_millis": 4318, 809 | "execution_start": 1605804911562, 810 | "output_cleared": false, 811 | "scrolled": false, 812 | "slideshow": { 813 | "slide_type": "slide" 814 | }, 815 | "source_hash": "75d2d69a" 816 | }, 817 | "outputs": [], 818 | "source": [ 819 | "# set up the plot canvas with plt.subplots in one column, one row\n", 820 | "fig, ax = plt.subplots(1,1,figsize=(15, 15))\n", 821 | "\n", 822 | "# block groups\n", 823 | "gdf.plot(ax=ax, # this puts it in the ax plot\n", 824 | " color='gray', \n", 825 | " edgecolor='white',\n", 826 | " alpha=0.5)\n", 827 | "\n", 828 | "# arrests\n", 829 | "arrests.plot(ax=ax, # this also puts it in the same ax plot\n", 830 | " color='red',\n", 831 | " markersize=1,\n", 832 | " alpha=0.2)\n", 833 | "\n", 834 | "# use the bounding box coordinates to set the x and y limits\n", 835 | "ax.set_xlim(minx - 1000, maxx + 1000) # added/substracted value is to give some margin around total bounds\n", 836 | "ax.set_ylim(miny - 1000, maxy + 1000)\n", 837 | "\n", 838 | "# no axis\n", 839 | "ax.axis('off')\n", 840 | "\n", 841 | "# add a basemap\n", 842 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": { 848 | "cell_id": "00024-350b1c6f-5d68-4f37-841d-9b832e497dfe", 849 | "deepnote_cell_type": "markdown", 850 | "slideshow": { 851 | "slide_type": "slide" 852 | } 853 | }, 854 | "source": [ 855 | "## The spatial join\n", 856 | "\n", 857 | "* https://geopandas.org/mergingdata.html?highlight=spatial%20join\n", 858 | "\n", 859 | "In a Spatial Join, two geometry objects are merged based on their spatial relationship to one another." 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": { 865 | "slideshow": { 866 | "slide_type": "slide" 867 | } 868 | }, 869 | "source": [ 870 | "While the official documentation may seem confusing, consider the following as a rule of thumb. When you do a spatial join with `gpd.sjoin()`, you feed it three arguments: a left dataframe, a right dataframe, and a how statement.\n", 871 | "\n", 872 | "- **Left dataframe**: identify the layer you want *to* attach infomation that will come from the other layer\n", 873 | "- **Right dataframe**: identify the layer that you want to get information *from* to attach to the other layer\n", 874 | "\n", 875 | "Once you identify your left and right dataframes, use `how=\"left\"` to spatially join the two layers (think: \"I'm sending data from the right to the left\"). Note that this will result in a dataframe with the same records and datatype as the LEFT layer." 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": null, 881 | "metadata": { 882 | "cell_id": "00025-83617000-2144-45cb-9145-3e070dc1da7e", 883 | "deepnote_cell_type": "code", 884 | "execution_millis": 1339, 885 | "execution_start": 1605806117634, 886 | "output_cleared": false, 887 | "slideshow": { 888 | "slide_type": "slide" 889 | }, 890 | "source_hash": "8c53b4a0" 891 | }, 892 | "outputs": [], 893 | "source": [ 894 | "# Do the spatial join\n", 895 | "join = gpd.sjoin(arrests, gdf, how='left')\n", 896 | "join.head()" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": { 902 | "cell_id": "00026-a88db1eb-74f2-4c08-9db8-e964326dc862", 903 | "deepnote_cell_type": "markdown", 904 | "slideshow": { 905 | "slide_type": "slide" 906 | } 907 | }, 908 | "source": [ 909 | "This creates a dataframe that has every arrest record with the corresponding FIPS code.\n", 910 | "\n", 911 | "Next, we create another dataframe that counts crime by their corresponding block group.\n", 912 | "\n", 913 | "- `value_counts()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": null, 919 | "metadata": { 920 | "cell_id": "00027-ca714a55-7276-4cd5-8cfd-64ed87c51b47", 921 | "deepnote_cell_type": "code", 922 | "execution_millis": 0, 923 | "execution_start": 1605806144598, 924 | "output_cleared": false, 925 | "slideshow": { 926 | "slide_type": "fragment" 927 | }, 928 | "source_hash": "bbec6db9" 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "arrests_by_gdf = join.FIPS.value_counts().rename_axis('FIPS').reset_index(name='arrests_count')" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": { 939 | "cell_id": "00028-0edb1f2f-bf0d-464e-91a5-8e00178c2fc7", 940 | "deepnote_cell_type": "code", 941 | "execution_millis": 1, 942 | "execution_start": 1605806249414, 943 | "output_cleared": false, 944 | "scrolled": true, 945 | "slideshow": { 946 | "slide_type": "fragment" 947 | }, 948 | "source_hash": "1a26c77d" 949 | }, 950 | "outputs": [], 951 | "source": [ 952 | "arrests_by_gdf.head()" 953 | ] 954 | }, 955 | { 956 | "cell_type": "markdown", 957 | "metadata": { 958 | "slideshow": { 959 | "slide_type": "slide" 960 | } 961 | }, 962 | "source": [ 963 | "- [Pandas bar plot documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": null, 969 | "metadata": { 970 | "cell_id": "00029-c703c03b-ae27-4935-85ae-6f32896bcd05", 971 | "deepnote_cell_type": "code", 972 | "execution_millis": 256, 973 | "execution_start": 1605806267816, 974 | "output_cleared": false, 975 | "scrolled": false, 976 | "slideshow": { 977 | "slide_type": "fragment" 978 | }, 979 | "source_hash": "a09e3e30" 980 | }, 981 | "outputs": [], 982 | "source": [ 983 | "# make a bar chart of top 20 geographies\n", 984 | "arrests_by_gdf[:20].plot.bar(figsize=(20,4),\n", 985 | " x='FIPS',\n", 986 | " y='arrests_count')" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": { 992 | "cell_id": "00030-23e15688-b87f-4d89-8e04-78e98f59d8ce", 993 | "deepnote_cell_type": "markdown", 994 | "slideshow": { 995 | "slide_type": "slide" 996 | } 997 | }, 998 | "source": [ 999 | "## Join the value counts back to the gdf\n", 1000 | "\n", 1001 | "How many people know their census block number? The bar chart is nice, but it is not informative. Without spatial awareness, the data chart does little to convey knowledge. What we want is a choropleth map to accompany it. To do so, we merge the counts back to the block group gdf.\n", 1002 | "\n", 1003 | "- [merge documentation](https://geopandas.org/docs/user_guide/mergingdata.html)" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": null, 1009 | "metadata": { 1010 | "cell_id": "00031-ef0c1b84-f54e-464b-9183-d7b6952c93ad", 1011 | "deepnote_cell_type": "code", 1012 | "execution_millis": 0, 1013 | "execution_start": 1605806281114, 1014 | "output_cleared": false, 1015 | "slideshow": { 1016 | "slide_type": "fragment" 1017 | }, 1018 | "source_hash": "6918564f" 1019 | }, 1020 | "outputs": [], 1021 | "source": [ 1022 | "# join the summary table back to the gdf\n", 1023 | "gdf=gdf.merge(arrests_by_gdf,on='FIPS')" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "markdown", 1028 | "metadata": { 1029 | "cell_id": "00032-de630962-35a2-45c8-a028-ad929bebbff7", 1030 | "deepnote_cell_type": "markdown", 1031 | "slideshow": { 1032 | "slide_type": "slide" 1033 | } 1034 | }, 1035 | "source": [ 1036 | "Now the block group gdf has a new column for arrest counts:" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "code", 1041 | "execution_count": null, 1042 | "metadata": { 1043 | "cell_id": "00033-4405a8bc-92cd-44dc-9f79-e1756818d28f", 1044 | "deepnote_cell_type": "code", 1045 | "execution_millis": 10, 1046 | "execution_start": 1605806285885, 1047 | "output_cleared": false, 1048 | "slideshow": { 1049 | "slide_type": "fragment" 1050 | }, 1051 | "source_hash": "436bb116" 1052 | }, 1053 | "outputs": [], 1054 | "source": [ 1055 | "# our neighborhood table now has a count column\n", 1056 | "gdf.head()" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "markdown", 1061 | "metadata": { 1062 | "slideshow": { 1063 | "slide_type": "slide" 1064 | } 1065 | }, 1066 | "source": [ 1067 | "## Normalizing: Arrests per 1000 people\n", 1068 | "Rather than proceeding with an absolute count of arrests, let's normalize it by number of people who live in the census block group.\n", 1069 | "\n" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": null, 1075 | "metadata": { 1076 | "slideshow": { 1077 | "slide_type": "fragment" 1078 | } 1079 | }, 1080 | "outputs": [], 1081 | "source": [ 1082 | "gdf['arrests_per_1000'] = gdf['arrests_count']/gdf['TotalPop']*1000" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "code", 1087 | "execution_count": null, 1088 | "metadata": { 1089 | "slideshow": { 1090 | "slide_type": "fragment" 1091 | } 1092 | }, 1093 | "outputs": [], 1094 | "source": [ 1095 | "gdf.sort_values(by=\"arrests_per_1000\").tail()" 1096 | ] 1097 | }, 1098 | { 1099 | "cell_type": "markdown", 1100 | "metadata": { 1101 | "slideshow": { 1102 | "slide_type": "slide" 1103 | } 1104 | }, 1105 | "source": [ 1106 | "And if you are curious, we can map a slice of the data. Here, we sort the values by descending arrest rate, and only show a slice of the data, the top 20 geographies using the handy `[:20]`." 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "metadata": { 1113 | "cell_id": "00034-d60cf7ed-32db-4b0e-ac55-6dc3119c40c9", 1114 | "deepnote_cell_type": "code", 1115 | "execution_millis": 455, 1116 | "execution_start": 1605806413842, 1117 | "output_cleared": false, 1118 | "slideshow": { 1119 | "slide_type": "slide" 1120 | }, 1121 | "source_hash": "c70d7036" 1122 | }, 1123 | "outputs": [], 1124 | "source": [ 1125 | "# map the top 20 geographies\n", 1126 | "fig,ax = plt.subplots(figsize=(12,10))\n", 1127 | "gdf.sort_values(by='arrests_per_1000',ascending=False)[:20].plot(ax=ax,\n", 1128 | " color='red',\n", 1129 | " edgecolor='white',\n", 1130 | " alpha=0.5,legend=True)\n", 1131 | "\n", 1132 | "\n", 1133 | "# title\n", 1134 | "ax.set_title('Top 20 locations of LAPD arrests per 1000 people (July 2020-January 2021)')\n", 1135 | "\n", 1136 | "# no axis\n", 1137 | "ax.axis('off')\n", 1138 | "\n", 1139 | "# add a basemap\n", 1140 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "markdown", 1145 | "metadata": { 1146 | "cell_id": "00035-452962c1-1d0a-4b6e-9f25-bf2bac2eb75e", 1147 | "deepnote_cell_type": "markdown", 1148 | "slideshow": { 1149 | "slide_type": "slide" 1150 | } 1151 | }, 1152 | "source": [ 1153 | "## Choropleth map of arrests\n", 1154 | "\n", 1155 | "Finally, we are ready to generate a choropleth map of arrests. " 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "markdown", 1160 | "metadata": { 1161 | "slideshow": { 1162 | "slide_type": "slide" 1163 | } 1164 | }, 1165 | "source": [ 1166 | "- [geopandas choropleths](https://geopandas.org/docs/user_guide/mapping.html#choropleth-maps)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "code", 1171 | "execution_count": null, 1172 | "metadata": { 1173 | "cell_id": "00040-751a6cac-f335-4b67-b470-3e240904a0cf", 1174 | "deepnote_cell_type": "code", 1175 | "execution_millis": 1290, 1176 | "execution_start": 1605806440098, 1177 | "output_cleared": false, 1178 | "scrolled": false, 1179 | "slideshow": { 1180 | "slide_type": "-" 1181 | }, 1182 | "source_hash": "cf9722a3" 1183 | }, 1184 | "outputs": [], 1185 | "source": [ 1186 | "fig,ax = plt.subplots(figsize=(15,15))\n", 1187 | "\n", 1188 | "gdf.plot(ax=ax,\n", 1189 | " column='arrests_per_1000', # this makes it a choropleth\n", 1190 | " legend=True,\n", 1191 | " alpha=0.8,\n", 1192 | " cmap='RdYlGn_r', # a diverging color scheme\n", 1193 | " scheme='quantiles') # how to break the data into bins\n", 1194 | "\n", 1195 | "ax.axis('off')\n", 1196 | "ax.set_title('2020 July to January 2021 arrests per 1000 people',fontsize=22)\n", 1197 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": { 1203 | "cell_id": "00041-1c2e4521-6cae-4ca5-9731-1b9c505d6077", 1204 | "deepnote_cell_type": "markdown", 1205 | "slideshow": { 1206 | "slide_type": "slide" 1207 | } 1208 | }, 1209 | "source": [ 1210 | "
\n", 1211 | "The map above is a good way to begin exploring spatial patterns in our data. What does this map tell you? Is it informative? Do you notice any significant clusters? What if you change the map? Notice the `scheme` argument is set to `naturalbreaks`. Experiment with other map classfications such as `equalinterval`, `quantiles`. How does each classification change the map? \n", 1212 | "
" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": { 1218 | "cell_id": "00042-b24787d4-0ab3-44d5-ac9e-91f78dd286e3", 1219 | "deepnote_cell_type": "markdown", 1220 | "slideshow": { 1221 | "slide_type": "slide" 1222 | } 1223 | }, 1224 | "source": [ 1225 | "# Global Spatial Autocorrelation\n", 1226 | "\n", 1227 | "\n", 1228 | "\n", 1229 | "We have imported two datasets. Cleaned them up, spatialized them, and connected them spatially. We successfully mapped them to show the location of arrests per 1000 people by census block groups. The resulting map intuitively and visually tells us that there does appear to be spatial clusters of where arrests are more prevalent, but to what degree of certainty can we say so? Actually, very little, without statitistically backing up our determinations. Could this exact pattern be a matter of chance? Or is the pattern so distinct that there is no way it could have happened randomly?\n", 1230 | "\n", 1231 | "In order to answer this question, we conduct spatial autocorrelation, a process that determines to what degree an existing pattern is or is not random." 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "markdown", 1236 | "metadata": { 1237 | "cell_id": "00042-b24787d4-0ab3-44d5-ac9e-91f78dd286e3", 1238 | "deepnote_cell_type": "markdown", 1239 | "slideshow": { 1240 | "slide_type": "slide" 1241 | } 1242 | }, 1243 | "source": [ 1244 | "Global Moran's I statistic is a way to *quantify* the degree to which similar geographies are clustered. To do so, we compare each geography based on a given value (in this case arrest counts) with that of its neighbors. The first step of this process is to define a \"spatial weight.\" " 1245 | ] 1246 | }, 1247 | { 1248 | "cell_type": "markdown", 1249 | "metadata": { 1250 | "cell_id": "00043-8ff210ab-9ca8-4ba3-a467-3b94e460e0d1", 1251 | "deepnote_cell_type": "markdown", 1252 | "slideshow": { 1253 | "slide_type": "slide" 1254 | } 1255 | }, 1256 | "source": [ 1257 | "### Spatial Weights\n", 1258 | "\n", 1259 | "\n", 1260 | "\n", 1261 | "Image source: [ESRI](https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-statistics-toolbox/generate-spatial-weights-matrix.htm)\n", 1262 | "\n", 1263 | "Spatial weights are how we determine the area’s neighborhood. There are different statistical methods that are used for determining spatial weights. Some use a contiguity model, assigning neighbors based on boundaries that touch each other. Others are based on distance, finding the closest neighbors based on the centroid of each geography. So which method should we use? " 1264 | ] 1265 | }, 1266 | { 1267 | "cell_type": "markdown", 1268 | "metadata": { 1269 | "cell_id": "00043-8ff210ab-9ca8-4ba3-a467-3b94e460e0d1", 1270 | "deepnote_cell_type": "markdown", 1271 | "slideshow": { 1272 | "slide_type": "slide" 1273 | } 1274 | }, 1275 | "source": [ 1276 | "For this lab, we will use the KNN weight, where `k` is the number of \"nearest neighbors\" to count in the calculations. Let's proceed with `k=8` for our KNN spatial weights. \n", 1277 | "\n", 1278 | "We also **row standardize** the data: A technique for adjusting the weights in a spatial weights matrix. When weights are row standardized, each weight is divided by its row sum. The row sum is the sum of weights for a features neighbors.\n", 1279 | "\n", 1280 | "- https://geographicdata.science/book/notebooks/04_spatial_weights.html#distance-based-weights" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "code", 1285 | "execution_count": null, 1286 | "metadata": { 1287 | "cell_id": "00044-c20d9e6b-953d-44f8-8f15-51052303c063", 1288 | "deepnote_cell_type": "code", 1289 | "execution_millis": 111, 1290 | "execution_start": 1605806657728, 1291 | "output_cleared": false, 1292 | "slideshow": { 1293 | "slide_type": "fragment" 1294 | }, 1295 | "source_hash": "3ee158c7" 1296 | }, 1297 | "outputs": [], 1298 | "source": [ 1299 | "# calculate spatial weight\n", 1300 | "wq = lps.weights.KNN.from_dataframe(gdf,k=8)\n", 1301 | "\n", 1302 | "# Row-standardization\n", 1303 | "wq.transform = 'r'" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": { 1309 | "cell_id": "00045-109ded14-a0b2-4c8c-a0a0-8454c76c230d", 1310 | "deepnote_cell_type": "markdown", 1311 | "slideshow": { 1312 | "slide_type": "slide" 1313 | } 1314 | }, 1315 | "source": [ 1316 | "### Spatial lag\n", 1317 | "\n", 1318 | "Now that we have our spatial weights assigned, we use it to calculate the spatial lag. While the mathematical operations are beyond the scope of this lab, you are welcome to check it out [here](https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html#spatial-lag). Simply put, the spatial lag is a calculated assignment to each geography in your data, which takes into account the data values from others in their \"neighborhood\" as defined by the spatial weight. This operation can be done with a single line of code which is part of the pysal module, but the underlying calculations are not that difficult to understand: \n", 1319 | "\n", 1320 | "It takes the average of all the neighbors as defined by the spatial weight to come up with a single associated value." 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": null, 1326 | "metadata": { 1327 | "cell_id": "00046-020e64b9-d823-45ea-bdfd-6950215f410d", 1328 | "deepnote_cell_type": "code", 1329 | "execution_millis": 4, 1330 | "execution_start": 1605806659021, 1331 | "output_cleared": false, 1332 | "slideshow": { 1333 | "slide_type": "slide" 1334 | }, 1335 | "source_hash": "dd2f79d2" 1336 | }, 1337 | "outputs": [], 1338 | "source": [ 1339 | "# create a new column for the spatial lag\n", 1340 | "gdf['arrests_per_1000_lag'] = lps.weights.lag_spatial(wq, gdf['arrests_per_1000'])" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "code", 1345 | "execution_count": null, 1346 | "metadata": { 1347 | "cell_id": "00047-80344eb5-7c68-4523-80fb-88e94e5ce690", 1348 | "deepnote_cell_type": "code", 1349 | "execution_millis": 0, 1350 | "execution_start": 1605806663078, 1351 | "output_cleared": false, 1352 | "slideshow": { 1353 | "slide_type": "fragment" 1354 | }, 1355 | "source_hash": "7aafb24b" 1356 | }, 1357 | "outputs": [], 1358 | "source": [ 1359 | "# sample gives us 10 random rows\n", 1360 | "gdf.sample(10)[['TotalPop','arrests_count','arrests_per_1000','arrests_per_1000_lag']]" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "metadata": { 1366 | "slideshow": { 1367 | "slide_type": "fragment" 1368 | } 1369 | }, 1370 | "source": [ 1371 | "
\n", 1372 | "Take a moment to look at the values in the dataframe. What do they tell you?\n", 1373 | "
" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "markdown", 1378 | "metadata": { 1379 | "slideshow": { 1380 | "slide_type": "slide" 1381 | } 1382 | }, 1383 | "source": [ 1384 | "### The donut and the diamond" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": null, 1390 | "metadata": {}, 1391 | "outputs": [], 1392 | "source": [ 1393 | "# create a column that calculates the difference betwen arrests and lag\n", 1394 | "gdf['arrest_lag_diff'] = gdf['arrests_per_1000'] - gdf['arrests_per_1000_lag']" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "execution_count": null, 1400 | "metadata": {}, 1401 | "outputs": [], 1402 | "source": [ 1403 | "# output to get the head and tail\n", 1404 | "gdf.sort_values(by='arrest_lag_diff')" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "markdown", 1409 | "metadata": { 1410 | "cell_id": "00048-88965640-7413-4991-a51d-6abf7e886e78", 1411 | "deepnote_cell_type": "markdown", 1412 | "slideshow": { 1413 | "slide_type": "slide" 1414 | } 1415 | }, 1416 | "source": [ 1417 | "In order to better understand the significance of the spatial lag values, consider the following two geographies:" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "code", 1422 | "execution_count": null, 1423 | "metadata": { 1424 | "slideshow": { 1425 | "slide_type": "fragment" 1426 | } 1427 | }, 1428 | "outputs": [], 1429 | "source": [ 1430 | "# the FIPS with highest negative difference\n", 1431 | "gdf_donut = gdf.sort_values(by='arrest_lag_diff').head(1)\n", 1432 | "gdf_donut" 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "code", 1437 | "execution_count": null, 1438 | "metadata": { 1439 | "slideshow": { 1440 | "slide_type": "fragment" 1441 | } 1442 | }, 1443 | "outputs": [], 1444 | "source": [ 1445 | "# the FIPS with highest positive difference\n", 1446 | "gdf_diamond = gdf.sort_values(by='arrest_lag_diff').tail(1)\n", 1447 | "gdf_diamond" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "markdown", 1452 | "metadata": { 1453 | "slideshow": { 1454 | "slide_type": "slide" 1455 | } 1456 | }, 1457 | "source": [ 1458 | "To better illustrate our assumptions, let's display these locations using satellite imagery. To do so, we use Plotly Express and its Mapbox Satellite basemap (which provides high quality and up to date imagery). Note that the use of Mapbox satellite image tiles require a user token. \n", 1459 | "\n", 1460 | "- Grab your token from the [Mapbox account page](https://account.mapbox.com/) (after you have created an account)" 1461 | ] 1462 | }, 1463 | { 1464 | "cell_type": "code", 1465 | "execution_count": null, 1466 | "metadata": { 1467 | "slideshow": { 1468 | "slide_type": "slide" 1469 | } 1470 | }, 1471 | "outputs": [], 1472 | "source": [ 1473 | "# set the mapbox access token\n", 1474 | "token = 'your_token'\n", 1475 | "px.set_mapbox_access_token(token)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": null, 1481 | "metadata": { 1482 | "slideshow": { 1483 | "slide_type": "fragment" 1484 | } 1485 | }, 1486 | "outputs": [], 1487 | "source": [ 1488 | "# subset donut, project to WGS84, and get its centroid\n", 1489 | "gdf_donut = gdf_donut.to_crs('epsg:4326')\n", 1490 | "\n", 1491 | "# what's the centroid?\n", 1492 | "minx, miny, maxx, maxy = gdf_donut.geometry.total_bounds\n", 1493 | "center_lat_donut = (maxy-miny)/2+miny\n", 1494 | "center_lon_donut = (maxx-minx)/2+minx" 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "code", 1499 | "execution_count": null, 1500 | "metadata": { 1501 | "slideshow": { 1502 | "slide_type": "slide" 1503 | } 1504 | }, 1505 | "outputs": [], 1506 | "source": [ 1507 | "# subset diamond, project to WGS84, and get its centroid\n", 1508 | "gdf_diamond = gdf_diamond.to_crs('epsg:4326')\n", 1509 | "\n", 1510 | "# what's the centroid?\n", 1511 | "minx, miny, maxx, maxy = gdf_diamond.geometry.total_bounds\n", 1512 | "center_lat_diamond = (maxy-miny)/2+miny\n", 1513 | "center_lon_diamond = (maxx-minx)/2+minx" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "markdown", 1518 | "metadata": { 1519 | "slideshow": { 1520 | "slide_type": "slide" 1521 | } 1522 | }, 1523 | "source": [ 1524 | "- [plotly choropleth maps](https://plotly.com/python/mapbox-county-choropleth/#)" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "code", 1529 | "execution_count": null, 1530 | "metadata": { 1531 | "scrolled": false, 1532 | "slideshow": { 1533 | "slide_type": "fragment" 1534 | } 1535 | }, 1536 | "outputs": [], 1537 | "source": [ 1538 | "px.choropleth_mapbox(gdf_donut, \n", 1539 | " geojson=gdf_donut.geometry, \n", 1540 | " locations=gdf_donut.index, \n", 1541 | " mapbox_style=\"satellite-streets\",\n", 1542 | " zoom=14, \n", 1543 | " center = {\"lat\": center_lat_donut, \"lon\": center_lon_donut},\n", 1544 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1545 | " opacity=0.4,\n", 1546 | " title='The Donut')" 1547 | ] 1548 | }, 1549 | { 1550 | "cell_type": "code", 1551 | "execution_count": null, 1552 | "metadata": { 1553 | "scrolled": false, 1554 | "slideshow": { 1555 | "slide_type": "slide" 1556 | } 1557 | }, 1558 | "outputs": [], 1559 | "source": [ 1560 | "px.choropleth_mapbox(gdf_diamond, \n", 1561 | " geojson=gdf_diamond.geometry, \n", 1562 | " locations=gdf_diamond.index, \n", 1563 | " mapbox_style=\"satellite-streets\",\n", 1564 | " zoom=12, \n", 1565 | " center = {\"lat\": center_lat_diamond, \"lon\": center_lon_diamond},\n", 1566 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1567 | " opacity=0.4,\n", 1568 | " title='The Diamond')" 1569 | ] 1570 | }, 1571 | { 1572 | "cell_type": "markdown", 1573 | "metadata": { 1574 | "slideshow": { 1575 | "slide_type": "slide" 1576 | } 1577 | }, 1578 | "source": [ 1579 | "What's the story here?\n", 1580 | "\n", 1581 | "Our donut is located in Venice Beach, in the famous enclave with homes adjacent to the grand canal known as part of the Venice Canal Historic District. It has a low arrests rate at just over 9 per 1000. It's spatial lag? 76! \n", 1582 | "\n", 1583 | "Our diamond, coincidentally, is the also famous Venice Beach Boardwalk. It is located in close proximity to our donut, with an over 400 arrests per 1000. It's spatial lag? Ninety, which is still high!\n", 1584 | "\n", 1585 | "What these numbers tell us is that the Venice Canal is like a donut. A prestine historical housing community which is surrounded by neighbors with high arrest rates. On the flip side, Venice Boardwalk is a diamond: a super high valued geography surrounded by lower neighbors. It indicates that the park itself is a magnet location for arrests, but its immediate neighbors have comparatively lower number of arrests." 1586 | ] 1587 | }, 1588 | { 1589 | "cell_type": "markdown", 1590 | "metadata": { 1591 | "slideshow": { 1592 | "slide_type": "slide" 1593 | } 1594 | }, 1595 | "source": [ 1596 | "## Spatial lag map\n", 1597 | "But we digress. Let's map the entire dataframe by the newly created spatial lag column." 1598 | ] 1599 | }, 1600 | { 1601 | "cell_type": "code", 1602 | "execution_count": null, 1603 | "metadata": { 1604 | "cell_id": "00051-c11fcc2b-df8f-420c-9365-3812f835d6ce", 1605 | "deepnote_cell_type": "code", 1606 | "execution_millis": 1107, 1607 | "execution_start": 1605807561410, 1608 | "output_cleared": false, 1609 | "scrolled": false, 1610 | "slideshow": { 1611 | "slide_type": "slide" 1612 | }, 1613 | "source_hash": "51b90aba" 1614 | }, 1615 | "outputs": [], 1616 | "source": [ 1617 | "# use subplots that make it easier to create multiple layered maps\n", 1618 | "fig, ax = plt.subplots(1,1,figsize=(15, 15))\n", 1619 | "\n", 1620 | "# spatial lag choropleth\n", 1621 | "gdf.plot(ax=ax,\n", 1622 | " figsize=(15,15),\n", 1623 | " column='arrests_per_1000_lag',\n", 1624 | " legend=True,\n", 1625 | " alpha=0.8,\n", 1626 | " cmap='RdYlGn_r',\n", 1627 | " scheme='quantiles')\n", 1628 | "\n", 1629 | "ax.axis('off')\n", 1630 | "ax.set_title('July 2020-January 2021 Arrests per 1000 people',fontsize=22)\n", 1631 | "\n", 1632 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "markdown", 1637 | "metadata": { 1638 | "cell_id": "00052-7673144c-5be9-47cd-9e0c-46d5b9fe324b", 1639 | "deepnote_cell_type": "markdown", 1640 | "slideshow": { 1641 | "slide_type": "slide" 1642 | } 1643 | }, 1644 | "source": [ 1645 | "## Side-by-side maps\n", 1646 | "We can now compare these two map outputs side by side. Notice that the syntax is a bit different from past labs where we have only worked with one figure at a time. This output produces 1 row, and 2 columns of figures in `subplots`.\n", 1647 | "- [subplots documentation](https://matplotlib.org/3.3.0/gallery/subplots_axes_and_figures/subplots_demo.html)" 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "code", 1652 | "execution_count": null, 1653 | "metadata": { 1654 | "cell_id": "00053-3d2f883f-e36b-42a8-b0e4-8d46ac893d67", 1655 | "deepnote_cell_type": "code", 1656 | "execution_millis": 1427, 1657 | "execution_start": 1605807604042, 1658 | "output_cleared": false, 1659 | "slideshow": { 1660 | "slide_type": "slide" 1661 | }, 1662 | "source_hash": "26dd2021" 1663 | }, 1664 | "outputs": [], 1665 | "source": [ 1666 | "# create the 1x2 subplots\n", 1667 | "fig, ax = plt.subplots(1, 2, figsize=(15, 8))\n", 1668 | "\n", 1669 | "# two subplots produces ax[0] (left) and ax[1] (right)\n", 1670 | "\n", 1671 | "# regular count map on the left\n", 1672 | "gdf.plot(ax=ax[0], # this assigns the map to the left subplot\n", 1673 | " column='arrests_per_1000', \n", 1674 | " cmap='RdYlGn_r', \n", 1675 | " scheme='quantiles',\n", 1676 | " k=5, \n", 1677 | " edgecolor='white', \n", 1678 | " linewidth=0, \n", 1679 | " alpha=0.75, \n", 1680 | " )\n", 1681 | "\n", 1682 | "\n", 1683 | "ax[0].axis(\"off\")\n", 1684 | "ax[0].set_title(\"Arrests per 1000\")\n", 1685 | "\n", 1686 | "# spatial lag map on the right\n", 1687 | "gdf.plot(ax=ax[1], # this assigns the map to the right subplot\n", 1688 | " column='arrests_per_1000_lag', \n", 1689 | " cmap='RdYlGn_r', \n", 1690 | " scheme='quantiles',\n", 1691 | " k=5, \n", 1692 | " edgecolor='white', \n", 1693 | " linewidth=0, \n", 1694 | " alpha=0.75\n", 1695 | " )\n", 1696 | "\n", 1697 | "ax[1].axis(\"off\")\n", 1698 | "ax[1].set_title(\"Arrests per 1000 - Spatial Lag\")\n", 1699 | "\n", 1700 | "plt.show()" 1701 | ] 1702 | }, 1703 | { 1704 | "cell_type": "markdown", 1705 | "metadata": { 1706 | "slideshow": { 1707 | "slide_type": "slide" 1708 | } 1709 | }, 1710 | "source": [ 1711 | "## Interactive spatial lag satellite map\n", 1712 | "Building the equivalent map as an interactive javascript map is a bit more challenging. While there are several options to choose from, this lab will use plotly express's choropleth_mapbox feature.\n", 1713 | "- https://plotly.com/python/mapbox-county-choropleth/#" 1714 | ] 1715 | }, 1716 | { 1717 | "cell_type": "code", 1718 | "execution_count": null, 1719 | "metadata": { 1720 | "slideshow": { 1721 | "slide_type": "fragment" 1722 | } 1723 | }, 1724 | "outputs": [], 1725 | "source": [ 1726 | "# interactive version needs to be in WGS84\n", 1727 | "gdf_web = gdf.to_crs('EPSG:4326')" 1728 | ] 1729 | }, 1730 | { 1731 | "cell_type": "code", 1732 | "execution_count": null, 1733 | "metadata": { 1734 | "slideshow": { 1735 | "slide_type": "fragment" 1736 | } 1737 | }, 1738 | "outputs": [], 1739 | "source": [ 1740 | "# what's the centroid?\n", 1741 | "minx, miny, maxx, maxy = gdf_web.geometry.total_bounds\n", 1742 | "center_lat_gdf_web = (maxy-miny)/2+miny\n", 1743 | "center_lon_gdf_web = (maxx-minx)/2+minx" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "markdown", 1748 | "metadata": { 1749 | "slideshow": { 1750 | "slide_type": "slide" 1751 | } 1752 | }, 1753 | "source": [ 1754 | "Unlike the matplotlib map, plotly's mapbox map only gives us a continuous scale option (there is no magical `scheme` option). To produce a similar quantile map, we need to calculate the values manually.\n", 1755 | "\n", 1756 | "As we want to produce a choropleth map based on our spatial lag column, let's get some simple stats:" 1757 | ] 1758 | }, 1759 | { 1760 | "cell_type": "code", 1761 | "execution_count": null, 1762 | "metadata": { 1763 | "slideshow": { 1764 | "slide_type": "fragment" 1765 | } 1766 | }, 1767 | "outputs": [], 1768 | "source": [ 1769 | "# some stats\n", 1770 | "gdf_web.arrests_per_1000_lag.describe()" 1771 | ] 1772 | }, 1773 | { 1774 | "cell_type": "code", 1775 | "execution_count": null, 1776 | "metadata": { 1777 | "slideshow": { 1778 | "slide_type": "fragment" 1779 | } 1780 | }, 1781 | "outputs": [], 1782 | "source": [ 1783 | "# grab the median\n", 1784 | "median = gdf_web.arrests_per_1000_lag.median()" 1785 | ] 1786 | }, 1787 | { 1788 | "cell_type": "code", 1789 | "execution_count": null, 1790 | "metadata": { 1791 | "scrolled": false, 1792 | "slideshow": { 1793 | "slide_type": "slide" 1794 | } 1795 | }, 1796 | "outputs": [], 1797 | "source": [ 1798 | "fig = px.choropleth_mapbox(gdf_web, \n", 1799 | " geojson=gdf_web.geometry, # the geometry column\n", 1800 | " locations=gdf_web.index, # the index\n", 1801 | " mapbox_style=\"satellite-streets\",\n", 1802 | " zoom=9, \n", 1803 | " color='arrests_per_1000_lag',\n", 1804 | " color_continuous_scale='RdYlGn_r',\n", 1805 | " color_continuous_midpoint =median, # put the median as the midpoint\n", 1806 | " range_color =(0,median*2),\n", 1807 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1808 | " center = {\"lat\": center_lat_gdf_web, \"lon\": center_lon_gdf_web},\n", 1809 | " opacity=0.8,\n", 1810 | " width=1000,\n", 1811 | " height=800,\n", 1812 | " labels={\n", 1813 | " 'arrests_per_1000_lag':'Arrests per 1000 (Spatial Lag)',\n", 1814 | " 'arrests_per_1000':'Arrests per 1000',\n", 1815 | " })\n", 1816 | "fig.update_traces(marker_line_width=0.1, marker_line_color='white')\n", 1817 | "fig.update_layout(margin={\"r\":0,\"t\":0,\"l\":0,\"b\":0})" 1818 | ] 1819 | }, 1820 | { 1821 | "cell_type": "markdown", 1822 | "metadata": { 1823 | "cell_id": "00054-cde72154-74e9-492d-9384-f438d860221d", 1824 | "deepnote_cell_type": "markdown", 1825 | "slideshow": { 1826 | "slide_type": "slide" 1827 | } 1828 | }, 1829 | "source": [ 1830 | "## Moran's Plot\n", 1831 | "\n", 1832 | "We now have a spatial lag map: a map that displays geographies weighted against the values of its neighbors. The clusters are much clearer and cleaner than the original arrest count map. Downtown, Venice, South LA, Van Nuys... But we still have not *quantified* the degree of the spatial correlations. To begin this process, we test for global autocorrelation for a continuous attribute (arrest counts)." 1833 | ] 1834 | }, 1835 | { 1836 | "cell_type": "code", 1837 | "execution_count": null, 1838 | "metadata": { 1839 | "slideshow": { 1840 | "slide_type": "fragment" 1841 | } 1842 | }, 1843 | "outputs": [], 1844 | "source": [ 1845 | "y = gdf.arrests_per_1000\n", 1846 | "moran = Moran(y, wq)\n", 1847 | "moran.I" 1848 | ] 1849 | }, 1850 | { 1851 | "cell_type": "markdown", 1852 | "metadata": { 1853 | "slideshow": { 1854 | "slide_type": "slide" 1855 | } 1856 | }, 1857 | "source": [ 1858 | "The moran's I value is nothing more than the calculated slope of the scatterplot of our \"arrests per 1000\" and \"arrests per 1000 spatial lag\" columns. It does indicate whether or not you have a positive or negative autocorrelation. Values will range from positive one, to negative one. \n", 1859 | "\n", 1860 | "- **Positive** spatial autocorrelation: high values are close to high values, and/or low values are close to low values\n", 1861 | "- **Negative** spatial autocorrelation (less common): similar values are far from each other; high values are next to low values, low values are next to high values" 1862 | ] 1863 | }, 1864 | { 1865 | "cell_type": "markdown", 1866 | "metadata": { 1867 | "slideshow": { 1868 | "slide_type": "slide" 1869 | } 1870 | }, 1871 | "source": [ 1872 | "You can output a scatterplot:" 1873 | ] 1874 | }, 1875 | { 1876 | "cell_type": "code", 1877 | "execution_count": null, 1878 | "metadata": { 1879 | "slideshow": { 1880 | "slide_type": "fragment" 1881 | } 1882 | }, 1883 | "outputs": [], 1884 | "source": [ 1885 | "fig, ax = moran_scatterplot(moran, aspect_equal=True)\n", 1886 | "plt.show()" 1887 | ] 1888 | }, 1889 | { 1890 | "cell_type": "markdown", 1891 | "metadata": { 1892 | "slideshow": { 1893 | "slide_type": "slide" 1894 | } 1895 | }, 1896 | "source": [ 1897 | "So what is the significance of our Moran value of 0.3? In other words, **how likely is our observed pattern on the map generated by an entirely random process?** To find out, we compare our value with a simulation of 999 permutations that randomly shuffles the arrest data throughout the given geographies. The output is a sampling distribution of Moran’s I values under the (null) hypothesis that attribute values are randomly distributed across the study area. We then compare our observed Moran’s I value to this \"Reference Distribution.\"" 1898 | ] 1899 | }, 1900 | { 1901 | "cell_type": "code", 1902 | "execution_count": null, 1903 | "metadata": { 1904 | "cell_id": "00055-489abd0d-4dfc-4a43-ab07-631157aaea0d", 1905 | "deepnote_cell_type": "code", 1906 | "execution_millis": 47, 1907 | "execution_start": 1605807625694, 1908 | "output_cleared": false, 1909 | "slideshow": { 1910 | "slide_type": "slide" 1911 | }, 1912 | "source_hash": "707e03ac" 1913 | }, 1914 | "outputs": [], 1915 | "source": [ 1916 | "plot_moran_simulation(moran,aspect_equal=False)" 1917 | ] 1918 | }, 1919 | { 1920 | "cell_type": "markdown", 1921 | "metadata": { 1922 | "cell_id": "00056-8d12869d-48fa-468e-9cb5-1babfcbf3bf9", 1923 | "deepnote_cell_type": "markdown", 1924 | "slideshow": { 1925 | "slide_type": "slide" 1926 | } 1927 | }, 1928 | "source": [ 1929 | "We can compute the P-value:\n", 1930 | "\n", 1931 | "" 1932 | ] 1933 | }, 1934 | { 1935 | "cell_type": "code", 1936 | "execution_count": null, 1937 | "metadata": { 1938 | "cell_id": "00057-90a65e59-c521-42fa-bc23-4141b9369225", 1939 | "deepnote_cell_type": "code", 1940 | "output_cleared": true, 1941 | "slideshow": { 1942 | "slide_type": "fragment" 1943 | } 1944 | }, 1945 | "outputs": [], 1946 | "source": [ 1947 | "moran.p_sim" 1948 | ] 1949 | }, 1950 | { 1951 | "cell_type": "markdown", 1952 | "metadata": { 1953 | "cell_id": "00058-368e0922-3062-4b2a-9155-27c2332200fd", 1954 | "deepnote_cell_type": "markdown", 1955 | "slideshow": { 1956 | "slide_type": "slide" 1957 | } 1958 | }, 1959 | "source": [ 1960 | "The value is calculated as an empirical P-value that represents the proportion of realisations in the simulation under spatial randomness that are more extreme than the observed value. A small enough p-value associated with the Moran’s I of a map allows to reject the hypothesis that the map is random. In other words, we can conclude that the map displays more spatial pattern than we would expect if the values had been randomly allocated to a locations.\n", 1961 | "\n", 1962 | "That is a very low value, particularly considering it is actually the minimum value we could have obtained given the simulation behind it used 999 permutations (default in PySAL) and, by standard terms, it would be deemed statistically significant. We can ellaborate a bit further on the intuition behind the value of p_sim. If we generated a large number of maps with the same values but randomly allocated over space, and calculated the Moran’s I statistic for each of those maps, only 0.1% of them would display a larger (absolute) value than the one we obtain from the observed data, and the other 99.9% of the random maps would receive a smaller (absolute) value of Moran’s I. " 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "markdown", 1967 | "metadata": { 1968 | "slideshow": { 1969 | "slide_type": "slide" 1970 | } 1971 | }, 1972 | "source": [ 1973 | "# Local Spatial Autocorrelation\n", 1974 | "So far, we have only determined that there is a positive spatial autocorrelation between the price of properties in neighborhoods and their locations. But we have not detected where clusters are. Local Indicators of Spatial Association (LISA) is used to do that. LISA classifies areas into four groups: high values near to high values (HH), Low values with nearby low values (LL), Low values with high values in its neighborhood, and vice-versa.\n", 1975 | "\n", 1976 | "- HH: high arrest rate geographies near other high arrest rate neighbors\n", 1977 | "- LL: low arrest rate geographies near other low arrest rate neighbors\n", 1978 | "- LH (donuts): low arrest rate geographies surrounded by high arrest neighbors\n", 1979 | "- HL (diamonds): high arrest rate geographies surrounded by low arrest neighbors" 1980 | ] 1981 | }, 1982 | { 1983 | "cell_type": "markdown", 1984 | "metadata": { 1985 | "slideshow": { 1986 | "slide_type": "slide" 1987 | } 1988 | }, 1989 | "source": [ 1990 | "## Moral Local Scatterplot" 1991 | ] 1992 | }, 1993 | { 1994 | "cell_type": "code", 1995 | "execution_count": null, 1996 | "metadata": { 1997 | "cell_id": "00060-4a2f2b79-53f9-44b4-ae86-31cc8fdce2bd", 1998 | "deepnote_cell_type": "code", 1999 | "execution_millis": 3398, 2000 | "execution_start": 1605807811791, 2001 | "output_cleared": false, 2002 | "slideshow": { 2003 | "slide_type": "fragment" 2004 | }, 2005 | "source_hash": "bf84c31e" 2006 | }, 2007 | "outputs": [], 2008 | "source": [ 2009 | "# calculate local moran values\n", 2010 | "lisa = esda.moran.Moran_Local(y, wq)" 2011 | ] 2012 | }, 2013 | { 2014 | "cell_type": "code", 2015 | "execution_count": null, 2016 | "metadata": { 2017 | "cell_id": "00063-d9fc9212-bf39-4525-90db-ff4cbcfd7783", 2018 | "deepnote_cell_type": "code", 2019 | "execution_millis": 214, 2020 | "execution_start": 1605807831354, 2021 | "output_cleared": false, 2022 | "scrolled": true, 2023 | "slideshow": { 2024 | "slide_type": "fragment" 2025 | }, 2026 | "source_hash": "2b643" 2027 | }, 2028 | "outputs": [], 2029 | "source": [ 2030 | "# Plot\n", 2031 | "fig,ax = plt.subplots(figsize=(10,15))\n", 2032 | "\n", 2033 | "moran_scatterplot(lisa, ax=ax, p=0.05)\n", 2034 | "ax.set_xlabel(\"Arrests\")\n", 2035 | "ax.set_ylabel('Spatial Lag of Arrests')\n", 2036 | "\n", 2037 | "# add some labels\n", 2038 | "plt.text(1.95, 0.5, \"HH\", fontsize=25)\n", 2039 | "plt.text(1.95, -1, \"HL\", fontsize=25)\n", 2040 | "plt.text(-2, 1, \"LH\", fontsize=25)\n", 2041 | "plt.text(-1, -1, \"LL\", fontsize=25)\n", 2042 | "plt.show()" 2043 | ] 2044 | }, 2045 | { 2046 | "cell_type": "markdown", 2047 | "metadata": { 2048 | "slideshow": { 2049 | "slide_type": "slide" 2050 | } 2051 | }, 2052 | "source": [ 2053 | "In the scatterplot above, the colored dots represents the rows (census block groups) that have a P-value less that 0.05 in each quadrant. In other words, these are the statisticaly significantly, spatially autocorrelated geographies." 2054 | ] 2055 | }, 2056 | { 2057 | "cell_type": "markdown", 2058 | "metadata": { 2059 | "slideshow": { 2060 | "slide_type": "slide" 2061 | } 2062 | }, 2063 | "source": [ 2064 | "## Spatial Autocorrelation Map\n", 2065 | "Finally, you can visually these statistically significant clusters using the `lisa_cluster` function:" 2066 | ] 2067 | }, 2068 | { 2069 | "cell_type": "code", 2070 | "execution_count": null, 2071 | "metadata": { 2072 | "cell_id": "00064-5038673d-0806-4e6c-a49e-9938454eedd4", 2073 | "deepnote_cell_type": "code", 2074 | "execution_millis": 819, 2075 | "execution_start": 1605807832811, 2076 | "output_cleared": false, 2077 | "scrolled": false, 2078 | "slideshow": { 2079 | "slide_type": "fragment" 2080 | }, 2081 | "source_hash": "4f4bbffa" 2082 | }, 2083 | "outputs": [], 2084 | "source": [ 2085 | "fig, ax = plt.subplots(figsize=(14,12))\n", 2086 | "lisa_cluster(lisa, gdf, p=0.05, ax=ax)\n", 2087 | "plt.show()" 2088 | ] 2089 | }, 2090 | { 2091 | "cell_type": "markdown", 2092 | "metadata": { 2093 | "slideshow": { 2094 | "slide_type": "slide" 2095 | } 2096 | }, 2097 | "source": [ 2098 | "And create a map comparing different p-values" 2099 | ] 2100 | }, 2101 | { 2102 | "cell_type": "code", 2103 | "execution_count": null, 2104 | "metadata": { 2105 | "cell_id": "00053-3d2f883f-e36b-42a8-b0e4-8d46ac893d67", 2106 | "deepnote_cell_type": "code", 2107 | "execution_millis": 1427, 2108 | "execution_start": 1605807604042, 2109 | "output_cleared": false, 2110 | "slideshow": { 2111 | "slide_type": "fragment" 2112 | }, 2113 | "source_hash": "26dd2021" 2114 | }, 2115 | "outputs": [], 2116 | "source": [ 2117 | "# create the 1x2 subplots\n", 2118 | "fig, ax = plt.subplots(1, 2, figsize=(15, 8))\n", 2119 | "\n", 2120 | "# regular count map on the left\n", 2121 | "lisa_cluster(lisa, gdf, p=0.05, ax=ax[0])\n", 2122 | "\n", 2123 | "ax[0].axis(\"off\")\n", 2124 | "ax[0].set_title(\"P-value: 0.05\")\n", 2125 | "\n", 2126 | "# spatial lag map on the right\n", 2127 | "lisa_cluster(lisa, gdf, p=0.01, ax=ax[1])\n", 2128 | "ax[1].axis(\"off\")\n", 2129 | "ax[1].set_title(\"P-value: 0.01\")\n", 2130 | "\n", 2131 | "plt.show()" 2132 | ] 2133 | }, 2134 | { 2135 | "cell_type": "markdown", 2136 | "metadata": { 2137 | "cell_id": "00073-51e53eb0-0154-43e8-835d-ce181f9c7186", 2138 | "deepnote_cell_type": "markdown", 2139 | "slideshow": { 2140 | "slide_type": "slide" 2141 | } 2142 | }, 2143 | "source": [ 2144 | "# Resources\n", 2145 | "\n", 2146 | "- https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html\n", 2147 | "- https://pysal.org/esda/notebooks/spatialautocorrelation.html\n", 2148 | "- https://towardsdatascience.com/what-is-exploratory-spatial-data-analysis-esda-335da79026ee" 2149 | ] 2150 | } 2151 | ], 2152 | "metadata": { 2153 | "celltoolbar": "Slideshow", 2154 | "deepnote_execution_queue": [], 2155 | "deepnote_notebook_id": "e3ee145c-1ca6-4b9e-9637-f285c2264264", 2156 | "kernelspec": { 2157 | "display_name": "Python 3", 2158 | "language": "python", 2159 | "name": "python3" 2160 | }, 2161 | "language_info": { 2162 | "codemirror_mode": { 2163 | "name": "ipython", 2164 | "version": 3 2165 | }, 2166 | "file_extension": ".py", 2167 | "mimetype": "text/x-python", 2168 | "name": "python", 2169 | "nbconvert_exporter": "python", 2170 | "pygments_lexer": "ipython3", 2171 | "version": "3.8.8" 2172 | }, 2173 | "toc": { 2174 | "base_numbering": 1, 2175 | "nav_menu": {}, 2176 | "number_sections": true, 2177 | "sideBar": true, 2178 | "skip_h1_title": false, 2179 | "title_cell": "Table of Contents", 2180 | "title_sidebar": "Contents", 2181 | "toc_cell": true, 2182 | "toc_position": { 2183 | "height": "calc(100% - 180px)", 2184 | "left": "10px", 2185 | "top": "150px", 2186 | "width": "349.047px" 2187 | }, 2188 | "toc_section_display": true, 2189 | "toc_window_display": false 2190 | } 2191 | }, 2192 | "nbformat": 4, 2193 | "nbformat_minor": 4 2194 | } 2195 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Spatial Statistics with Python 2 | 3 | 4 | 5 | Visual interpretations are meaningful ways to determine spatial trends in our data. However, underlying factors—such as inconsistent geographies, scale, data gaps, overlapping data—have the potential to produce incorrect assumptions, as valuable information may be conveniently hidden from the visual output. 6 | 7 | One way to address this issue is to amend your visual output with geo-statistical validation. In this workshop, we will use Python to look at one such approach: Spatial Autocorrelation. Spatial autocorrelation addresses the so-called "First Law of Geography": 8 | 9 | “Everything is related to everything else. But near things are more related than distant things”. Waldo Tobler’s (1969) First Law of Geography 10 | 11 | In other words, things that happen somewhere are likely to also happen at nearby locations. 12 | 13 | ## Jupyter Hub link for UCLA participants 14 | 15 | Choose "University of California, Los Angeles" from the drop down for "Identity Provider" and launch (just type "UCLA" in the search box). 16 | 17 | - [UCLA JupyterHub](https://jupyter.idre.ucla.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fyohman%2Fworkshop-python-spatial-stats&urlpath=tree%2Fworkshop-python-spatial-stats%2FSpatial+Autocorrelation.ipynb&branch=main) 18 | 19 | 20 | ## Binder link to non-UCLA participants 21 | 22 | Warning: Launching the binder link will take about 5 minutes. 23 | 24 | [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/yohman/workshop-python-spatial-stats/HEAD?filepath=Spatial%20Autocorrelation.ipynb) 25 | -------------------------------------------------------------------------------- /Spatial Autocorrelation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "toc": true 7 | }, 8 | "source": [ 9 | "

Table of Contents

\n", 10 | "
" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "slideshow": { 17 | "slide_type": "slide" 18 | } 19 | }, 20 | "source": [ 21 | "
\n", 22 | "\n", 23 | "

Take notice!

\n", 24 | "\n", 27 | " \n", 28 | "
" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": { 34 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 35 | "deepnote_cell_type": "markdown", 36 | "slideshow": { 37 | "slide_type": "slide" 38 | } 39 | }, 40 | "source": [ 41 | "# Spatial Autocorrelation\n", 42 | "\n", 43 | "\n", 44 | "\n", 45 | "Visual interpretations are meaningful ways to determine spatial trends in our data. However, underlying factors—such as inconsistent geographies, scale, data gaps, overlapping data—have the potential to produce incorrect assumptions, as valuable information may be conveniently hidden from the visual output." 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": { 51 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 52 | "deepnote_cell_type": "markdown", 53 | "slideshow": { 54 | "slide_type": "slide" 55 | } 56 | }, 57 | "source": [ 58 | "\n", 59 | "\n", 60 | "One way to address this issue is to amend your visual output with geo-statistical validation. In this lab, we will look at one such approach: Spatial Autocorrelation. Spatial autocorrelation addresses the so-called \"First Law of Geography\":\n", 61 | "\n", 62 | "> “Everything is related to everything else. But near things are more related than distant things”. Waldo Tobler’s (1969) First Law of Geography\n", 63 | "\n", 64 | "In other words, things that happen somewhere are likely to also happen at nearby locations." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "cell_id": "00000-df0628a5-dfb4-4e7c-ac45-e80b52c54132", 71 | "deepnote_cell_type": "markdown", 72 | "slideshow": { 73 | "slide_type": "slide" 74 | } 75 | }, 76 | "source": [ 77 | "## Methodology\n", 78 | "In this lab, we take on the controversial topic of policing in Los Angeles, specifically looking at arrest records from the LAPD. Do arrest locations have a statistical significant tendency to cluster in certain communities? To answer this question, we not only look at the location of recorded arrests in the city, but compare these locations with other arrests. In short, we are seeking to see where spatial correlations occur based on the data. Our approach is:\n", 79 | "\n", 80 | "1. import census block group boundaries for Los Angeles\n", 81 | "1. import arrest data from the LA Open Data Portal\n", 82 | "1. spatially join the two datasets\n", 83 | "1. normalize the data to create arrests per 1000\n", 84 | "1. conduct [global spatial autocorrelation](https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html) using Moran's I\n", 85 | "1. conduct [local spatial autocorrelation](https://geographicdata.science/book/notebooks/07_local_autocorrelation.html) using Local Indicators of Spatial Association (LISAs)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": { 91 | "slideshow": { 92 | "slide_type": "slide" 93 | } 94 | }, 95 | "source": [ 96 | "## Libraries to use\n", 97 | "\n", 98 | "\n", 99 | "\n", 100 | "\n", 101 | "\n", 102 | "\n", 103 | "\n", 104 | "- [Data vis with Pandas and Matplotlib](https://pandas.pydata.org/docs/user_guide/visualization.html)\n", 105 | "- [geopandas introduction](https://geopandas.org/getting_started/introduction.html)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": { 111 | "slideshow": { 112 | "slide_type": "slide" 113 | } 114 | }, 115 | "source": [ 116 | "## Libraries to use\n", 117 | "\n", 118 | "\n", 119 | "\n", 120 | "\n", 121 | "\n", 122 | "- [ESDA](https://pysal.org/esda/)\n", 123 | "- [SPLOT](https://github.com/pysal/splot)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "cell_id": "00001-85a84e8b-13d6-4356-8656-408ce2a2630c", 131 | "deepnote_cell_type": "code", 132 | "execution_millis": 1181, 133 | "execution_start": 1605804400268, 134 | "output_cleared": false, 135 | "slideshow": { 136 | "slide_type": "slide" 137 | }, 138 | "source_hash": "bc3fef15" 139 | }, 140 | "outputs": [], 141 | "source": [ 142 | "# to read and wrangle data\n", 143 | "import pandas as pd\n", 144 | "\n", 145 | "# to import data from LA Data portal\n", 146 | "from sodapy import Socrata\n", 147 | "\n", 148 | "# to create spatial data\n", 149 | "import geopandas as gpd\n", 150 | "\n", 151 | "# for basemaps\n", 152 | "import contextily as ctx\n", 153 | "\n", 154 | "# For spatial statistics\n", 155 | "import esda\n", 156 | "from esda.moran import Moran, Moran_Local\n", 157 | "\n", 158 | "import splot\n", 159 | "from splot.esda import moran_scatterplot, plot_moran, lisa_cluster,plot_moran_simulation\n", 160 | "\n", 161 | "import libpysal as lps\n", 162 | "\n", 163 | "# Graphics\n", 164 | "import matplotlib.pyplot as plt\n", 165 | "import plotly.express as px" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": { 171 | "slideshow": { 172 | "slide_type": "slide" 173 | } 174 | }, 175 | "source": [ 176 | "## Data preparation\n", 177 | "\n", 178 | "\n" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": { 184 | "cell_id": "00002-8c57606b-b800-4154-99ce-86d0e45b3499", 185 | "deepnote_cell_type": "markdown", 186 | "slideshow": { 187 | "slide_type": "slide" 188 | } 189 | }, 190 | "source": [ 191 | "## Block Groups\n", 192 | "\n", 193 | "Our first task is to bring in a geography that will allow us to summarize the location of arrests. The smaller the geography, the better our spatial correlation results will be. Short of creating our own grid, the census block groups provides an easily accessible boundary layer at a human scale. Additionally, working with census geographies will allow for future analyses that may include census data.\n", 194 | "\n", 195 | "* Date source: \n", 196 | " * [Census Reporter: ACS 2018 5 year: Table B01003: Total Population in Los Angeles: Census Block Groups](https://censusreporter.org/data/table/?table=B01003&geo_ids=16000US0644000,150|16000US0644000&primary_geo_id=16000US0644000)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "cell_id": "00003-27a9c38f-0d55-4f36-9268-f26b88e2ae29", 204 | "deepnote_cell_type": "code", 205 | "execution_millis": 3186, 206 | "execution_start": 1605804403202, 207 | "output_cleared": false, 208 | "slideshow": { 209 | "slide_type": "fragment" 210 | }, 211 | "source_hash": "b8c3febb" 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "# population block group data from census reporter\n", 216 | "gdf = gpd.read_file('data/acs2018_5yr_B01003_15000US060372711003.geojson')" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": { 222 | "slideshow": { 223 | "slide_type": "slide" 224 | } 225 | }, 226 | "source": [ 227 | "## Data cleanup" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "slideshow": { 235 | "slide_type": "fragment" 236 | } 237 | }, 238 | "outputs": [], 239 | "source": [ 240 | "# show first 5 rows\n", 241 | "gdf.head()" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "cell_id": "00004-e26db948-08d3-46f9-b0ce-1da8563c02d9", 249 | "deepnote_cell_type": "code", 250 | "execution_millis": 7, 251 | "execution_start": 1605804407207, 252 | "output_cleared": false, 253 | "slideshow": { 254 | "slide_type": "slide" 255 | }, 256 | "source_hash": "494d51a4" 257 | }, 258 | "outputs": [], 259 | "source": [ 260 | "# show columns and data types\n", 261 | "gdf.info()" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": { 268 | "cell_id": "00005-65875226-9284-44aa-b62e-3a1e2212d84b", 269 | "deepnote_cell_type": "code", 270 | "execution_millis": 13, 271 | "execution_start": 1605804409430, 272 | "output_cleared": false, 273 | "slideshow": { 274 | "slide_type": "slide" 275 | }, 276 | "source_hash": "7c13ef85" 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# trim the data to the bare minimum columns\n", 281 | "gdf = gdf[['geoid','B01003001','geometry']]\n", 282 | "\n", 283 | "# rename the columns\n", 284 | "gdf.columns = ['FIPS','TotalPop','geometry']" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "cell_id": "00005-65875226-9284-44aa-b62e-3a1e2212d84b", 292 | "deepnote_cell_type": "code", 293 | "execution_millis": 13, 294 | "execution_start": 1605804409430, 295 | "output_cleared": false, 296 | "slideshow": { 297 | "slide_type": "fragment" 298 | }, 299 | "source_hash": "7c13ef85" 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "# last 5 rows\n", 304 | "gdf.tail()" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": { 311 | "slideshow": { 312 | "slide_type": "fragment" 313 | } 314 | }, 315 | "outputs": [], 316 | "source": [ 317 | "# delete last row which is for the entire city of LA\n", 318 | "gdf=gdf.drop(2515)" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": { 325 | "scrolled": true, 326 | "slideshow": { 327 | "slide_type": "slide" 328 | } 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "# fix FIPS code\n", 333 | "gdf['FIPS'] = gdf['FIPS'].str.replace('15000US','')\n", 334 | "gdf.tail()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": { 340 | "slideshow": { 341 | "slide_type": "slide" 342 | } 343 | }, 344 | "source": [ 345 | "One more data cleanup: get rid of census blocks groups with less than 100 total population.\n", 346 | "\n", 347 | "- `sort_values()` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "slideshow": { 355 | "slide_type": "fragment" 356 | } 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "# sort by total pop\n", 361 | "gdf.sort_values(by='TotalPop').head(20)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": { 367 | "slideshow": { 368 | "slide_type": "slide" 369 | } 370 | }, 371 | "source": [ 372 | "### Subsetting the data\n", 373 | "\n", 374 | "- [Selecting a subset of a dataframe](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "slideshow": { 382 | "slide_type": "fragment" 383 | } 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "# delete less than 100 population geographies\n", 388 | "gdf = gdf[gdf['TotalPop']>100]" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": { 394 | "slideshow": { 395 | "slide_type": "slide" 396 | } 397 | }, 398 | "source": [ 399 | "## Map the census block groups" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "cell_id": "00006-c7db2dde-5f71-47f2-bfc2-0a0d94f1a03d", 407 | "deepnote_cell_type": "code", 408 | "execution_millis": 930, 409 | "execution_start": 1605804413958, 410 | "output_cleared": false, 411 | "slideshow": { 412 | "slide_type": "fragment" 413 | }, 414 | "source_hash": "e699b335" 415 | }, 416 | "outputs": [], 417 | "source": [ 418 | "# get the layers into a web mercator projection\n", 419 | "# reproject to web mercator\n", 420 | "gdf = gdf.to_crs(epsg=3857)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "slideshow": { 427 | "slide_type": "slide" 428 | } 429 | }, 430 | "source": [ 431 | "### Using Matplotlib subplots\n", 432 | "\n", 433 | "- `plt.subplots()` [documentation](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html)" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "cell_id": "00007-ffa16cca-a0b5-4e66-9a4c-73d90565821d", 441 | "deepnote_cell_type": "code", 442 | "execution_millis": 2725, 443 | "execution_start": 1605804417356, 444 | "output_cleared": false, 445 | "slideshow": { 446 | "slide_type": "slide" 447 | }, 448 | "source_hash": "93ebb39" 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "# plot it!\n", 453 | "fig, ax = plt.subplots(figsize=(12,12))\n", 454 | "\n", 455 | "gdf.plot(ax=ax,\n", 456 | " color='black', \n", 457 | " edgecolor='white',\n", 458 | " lw=0.5,\n", 459 | " alpha=0.4)\n", 460 | "\n", 461 | "# no axis\n", 462 | "ax.axis('off')\n", 463 | "\n", 464 | "# add a basemap\n", 465 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": { 471 | "cell_id": "00008-5ca674be-a4a5-4476-b7bd-3c86722b0246", 472 | "deepnote_cell_type": "markdown", 473 | "slideshow": { 474 | "slide_type": "slide" 475 | } 476 | }, 477 | "source": [ 478 | "## Get Arrest Data from LA Open Data Portal\n", 479 | "Next, we acquire the data using the socrata API. Use the socrata documentation to grab the code syntax for our arrests data.\n", 480 | "- [LA Data Portal](https://data.lacity.org)\n", 481 | "- [LA Data Portal arrest data](https://data.lacity.org/Public-Safety/Arrest-Data-from-2020-to-Present/amvf-fr72)\n", 482 | "- [Socrata endpoint documentation for this dataset](https://dev.socrata.com/foundry/data.lacity.org/amvf-fr72)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "cell_id": "00009-684c0663-6a05-4f66-82c3-63b259473b81", 490 | "deepnote_cell_type": "code", 491 | "execution_millis": 4626, 492 | "execution_start": 1605804430393, 493 | "output_cleared": false, 494 | "scrolled": true, 495 | "slideshow": { 496 | "slide_type": "slide" 497 | }, 498 | "source_hash": "a077b4d5" 499 | }, 500 | "outputs": [], 501 | "source": [ 502 | "# connect to the data portal\n", 503 | "client = Socrata(\"data.lacity.org\", None)\n", 504 | "\n", 505 | "results = client.get(\"amvf-fr72\", \n", 506 | " limit=50000,\n", 507 | " where = \"arst_date between '2020-07-01T00:00:00' and '2021-01-31T00:00:00'\",\n", 508 | " order='arst_date desc')\n", 509 | "\n", 510 | "# Convert to pandas DataFrame\n", 511 | "arrests = pd.DataFrame.from_records(results)\n" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": null, 517 | "metadata": { 518 | "cell_id": "00010-edd31267-0772-400b-9220-b4c2af28281d", 519 | "deepnote_cell_type": "code", 520 | "execution_millis": 2, 521 | "execution_start": 1605804438759, 522 | "output_cleared": false, 523 | "scrolled": true, 524 | "slideshow": { 525 | "slide_type": "fragment" 526 | }, 527 | "source_hash": "d8a3f800" 528 | }, 529 | "outputs": [], 530 | "source": [ 531 | "arrests.info()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "slideshow": { 539 | "slide_type": "slide" 540 | } 541 | }, 542 | "outputs": [], 543 | "source": [ 544 | "arrests.head()" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": { 550 | "cell_id": "00011-87e77a3f-36e6-48db-bc94-ae8b2747e411", 551 | "deepnote_cell_type": "markdown", 552 | "slideshow": { 553 | "slide_type": "slide" 554 | } 555 | }, 556 | "source": [ 557 | "### Convert data to a geodataframe\n", 558 | "\n", 559 | "Geopandas allows us to convert different types of data into a spatial format.\n", 560 | "- https://geopandas.org/gallery/create_geopandas_from_pandas.html" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": { 567 | "cell_id": "00012-11bd538c-28c8-4c61-9994-e819c3feb7e8", 568 | "deepnote_cell_type": "code", 569 | "execution_millis": 409, 570 | "execution_start": 1605804442578, 571 | "output_cleared": false, 572 | "slideshow": { 573 | "slide_type": "fragment" 574 | }, 575 | "source_hash": "dcfb7ede" 576 | }, 577 | "outputs": [], 578 | "source": [ 579 | "# convert pandas dataframe to geodataframe\n", 580 | "arrests = gpd.GeoDataFrame(arrests, \n", 581 | " crs='EPSG:4326',\n", 582 | " geometry=gpd.points_from_xy(arrests.lon, arrests.lat))" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": { 589 | "cell_id": "00013-9c89a5ae-ae0f-4844-89e8-3b9105d30b6e", 590 | "deepnote_cell_type": "code", 591 | "execution_millis": 1327, 592 | "execution_start": 1605804596814, 593 | "output_cleared": false, 594 | "slideshow": { 595 | "slide_type": "fragment" 596 | }, 597 | "source_hash": "8eb5f93e" 598 | }, 599 | "outputs": [], 600 | "source": [ 601 | "# get the layers into a web mercator projection\n", 602 | "# reproject to web mercator\n", 603 | "arrests = arrests.to_crs(epsg=3857)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": { 610 | "cell_id": "00014-5b46baf9-e870-4232-a425-28f0d65fb571", 611 | "deepnote_cell_type": "code", 612 | "execution_millis": 60, 613 | "execution_start": 1605804598151, 614 | "output_cleared": false, 615 | "slideshow": { 616 | "slide_type": "fragment" 617 | }, 618 | "source_hash": "ced73d7" 619 | }, 620 | "outputs": [], 621 | "source": [ 622 | "# convert lat/lon to floats\n", 623 | "arrests.lon = arrests.lon.astype('float')\n", 624 | "arrests.lat = arrests.lat.astype('float')" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": null, 630 | "metadata": { 631 | "cell_id": "00015-37008c4f-0307-455b-b316-da6e65f084c5", 632 | "deepnote_cell_type": "code", 633 | "execution_millis": 3485, 634 | "execution_start": 1605804600104, 635 | "output_cleared": false, 636 | "scrolled": true, 637 | "slideshow": { 638 | "slide_type": "slide" 639 | }, 640 | "source_hash": "ec9920c6" 641 | }, 642 | "outputs": [], 643 | "source": [ 644 | "# map it!\n", 645 | "fig,ax = plt.subplots(figsize=(12,12))\n", 646 | "\n", 647 | "arrests.plot(ax=ax,\n", 648 | " color='red',\n", 649 | " markersize=1)\n", 650 | "\n", 651 | "# no axis\n", 652 | "ax.axis('off')\n", 653 | "\n", 654 | "# add a basemap\n", 655 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)\n" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": { 661 | "cell_id": "00016-7294ac05-aa91-433c-aefd-bb0ee45e7ef5", 662 | "deepnote_cell_type": "markdown", 663 | "slideshow": { 664 | "slide_type": "slide" 665 | } 666 | }, 667 | "source": [ 668 | "### The 0,0 conundrum\n", 669 | "What is that red dot off the coast of Africa? Yes, that is the infamous spatial black hole, the [0,0] coordinate. There can be many reasons why those red dots get lost and find themselves there. No data may default to 0's, null values may be converted to 0's, or a human may have inadvertently entered 0's for unknown locations. Either which way, since these records do not have valid locations, they need to be deleted from our dataframe in order to proceed." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": { 676 | "cell_id": "00017-7d512d29-36e2-4a73-80ec-1a49b1845f1d", 677 | "deepnote_cell_type": "code", 678 | "execution_millis": 11, 679 | "execution_start": 1605804631750, 680 | "output_cleared": false, 681 | "slideshow": { 682 | "slide_type": "fragment" 683 | }, 684 | "source_hash": "dc9df796" 685 | }, 686 | "outputs": [], 687 | "source": [ 688 | "# subset the zero coordinate records\n", 689 | "arrests[arrests.lon==0]" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": { 696 | "slideshow": { 697 | "slide_type": "slide" 698 | } 699 | }, 700 | "outputs": [], 701 | "source": [ 702 | "# drop the unmapped rows\n", 703 | "arrests = arrests[arrests.lon!=0]" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": { 710 | "cell_id": "00015-37008c4f-0307-455b-b316-da6e65f084c5", 711 | "deepnote_cell_type": "code", 712 | "execution_millis": 3485, 713 | "execution_start": 1605804600104, 714 | "output_cleared": false, 715 | "scrolled": false, 716 | "slideshow": { 717 | "slide_type": "slide" 718 | }, 719 | "source_hash": "ec9920c6" 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "# map it!\n", 724 | "fig,ax = plt.subplots(figsize=(12,12))\n", 725 | "\n", 726 | "arrests.plot(ax=ax,\n", 727 | " color='red',\n", 728 | " markersize=1)\n", 729 | "\n", 730 | "# no axis\n", 731 | "ax.axis('off')\n", 732 | "\n", 733 | "# add a basemap\n", 734 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)\n" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": { 740 | "cell_id": "00020-c58096aa-ccb6-44c8-a710-06b464722ea2", 741 | "deepnote_cell_type": "markdown", 742 | "slideshow": { 743 | "slide_type": "slide" 744 | } 745 | }, 746 | "source": [ 747 | "## Create a two layer map\n", 748 | "\n", 749 | "- https://geopandas.org/mapping.html" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "metadata": { 755 | "cell_id": "00021-54ac341c-54f2-4549-b9fd-97ba8c8f9628", 756 | "deepnote_cell_type": "markdown", 757 | "slideshow": { 758 | "slide_type": "fragment" 759 | } 760 | }, 761 | "source": [ 762 | "Since we want to zoom to the extent of the arrests layer (and not the block groups), get the bounding coordinates for our axis." 763 | ] 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "metadata": { 769 | "cell_id": "00022-530381e9-7db1-4821-bcee-a66b1b5983f4", 770 | "deepnote_cell_type": "code", 771 | "execution_millis": 489, 772 | "execution_start": 1605804643081, 773 | "output_cleared": false, 774 | "slideshow": { 775 | "slide_type": "fragment" 776 | }, 777 | "source_hash": "2933d808" 778 | }, 779 | "outputs": [], 780 | "source": [ 781 | "# get the bounding box coordinates for the arrest data\n", 782 | "minx, miny, maxx, maxy = arrests.geometry.total_bounds\n", 783 | "print(minx)\n", 784 | "print(maxx)\n", 785 | "print(miny)\n", 786 | "print(maxy)" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": { 792 | "slideshow": { 793 | "slide_type": "slide" 794 | } 795 | }, 796 | "source": [ 797 | "## Subplots for multi-layered maps\n", 798 | "\n", 799 | "For our multi-layered maps, we are taking it one step further from our previous lab using matplotlib's `subplots`. `subplots` allows the creation of multiple plots on a gridded canvas. For our map, we only need a single subplot, but we are layering multiple datasets *on top of one another* on that subplot. To specify which subplot to put the layer on, you use the `ax` argument." 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": null, 805 | "metadata": { 806 | "cell_id": "00023-de1f44e5-dc3e-4a90-9423-884b716cdfa7", 807 | "deepnote_cell_type": "code", 808 | "execution_millis": 4318, 809 | "execution_start": 1605804911562, 810 | "output_cleared": false, 811 | "scrolled": false, 812 | "slideshow": { 813 | "slide_type": "slide" 814 | }, 815 | "source_hash": "75d2d69a" 816 | }, 817 | "outputs": [], 818 | "source": [ 819 | "# set up the plot canvas with plt.subplots in one column, one row\n", 820 | "fig, ax = plt.subplots(1,1,figsize=(15, 15))\n", 821 | "\n", 822 | "# block groups\n", 823 | "gdf.plot(ax=ax, # this puts it in the ax plot\n", 824 | " color='gray', \n", 825 | " edgecolor='white',\n", 826 | " alpha=0.5)\n", 827 | "\n", 828 | "# arrests\n", 829 | "arrests.plot(ax=ax, # this also puts it in the same ax plot\n", 830 | " color='red',\n", 831 | " markersize=1,\n", 832 | " alpha=0.2)\n", 833 | "\n", 834 | "# use the bounding box coordinates to set the x and y limits\n", 835 | "ax.set_xlim(minx - 1000, maxx + 1000) # added/substracted value is to give some margin around total bounds\n", 836 | "ax.set_ylim(miny - 1000, maxy + 1000)\n", 837 | "\n", 838 | "# no axis\n", 839 | "ax.axis('off')\n", 840 | "\n", 841 | "# add a basemap\n", 842 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": { 848 | "cell_id": "00024-350b1c6f-5d68-4f37-841d-9b832e497dfe", 849 | "deepnote_cell_type": "markdown", 850 | "slideshow": { 851 | "slide_type": "slide" 852 | } 853 | }, 854 | "source": [ 855 | "## The spatial join\n", 856 | "\n", 857 | "* https://geopandas.org/mergingdata.html?highlight=spatial%20join\n", 858 | "\n", 859 | "In a Spatial Join, two geometry objects are merged based on their spatial relationship to one another." 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": { 865 | "slideshow": { 866 | "slide_type": "slide" 867 | } 868 | }, 869 | "source": [ 870 | "While the official documentation may seem confusing, consider the following as a rule of thumb. When you do a spatial join with `gpd.sjoin()`, you feed it three arguments: a left dataframe, a right dataframe, and a how statement.\n", 871 | "\n", 872 | "- **Left dataframe**: identify the layer you want *to* attach infomation that will come from the other layer\n", 873 | "- **Right dataframe**: identify the layer that you want to get information *from* to attach to the other layer\n", 874 | "\n", 875 | "Once you identify your left and right dataframes, use `how=\"left\"` to spatially join the two layers (think: \"I'm sending data from the right to the left\"). Note that this will result in a dataframe with the same records and datatype as the LEFT layer." 876 | ] 877 | }, 878 | { 879 | "cell_type": "code", 880 | "execution_count": null, 881 | "metadata": { 882 | "cell_id": "00025-83617000-2144-45cb-9145-3e070dc1da7e", 883 | "deepnote_cell_type": "code", 884 | "execution_millis": 1339, 885 | "execution_start": 1605806117634, 886 | "output_cleared": false, 887 | "slideshow": { 888 | "slide_type": "slide" 889 | }, 890 | "source_hash": "8c53b4a0" 891 | }, 892 | "outputs": [], 893 | "source": [ 894 | "# Do the spatial join\n", 895 | "join = gpd.sjoin(arrests, gdf, how='left')\n", 896 | "join.head()" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "metadata": { 902 | "cell_id": "00026-a88db1eb-74f2-4c08-9db8-e964326dc862", 903 | "deepnote_cell_type": "markdown", 904 | "slideshow": { 905 | "slide_type": "slide" 906 | } 907 | }, 908 | "source": [ 909 | "This creates a dataframe that has every arrest record with the corresponding FIPS code.\n", 910 | "\n", 911 | "Next, we create another dataframe that counts crime by their corresponding block group.\n", 912 | "\n", 913 | "- `value_counts()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html)" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": null, 919 | "metadata": { 920 | "cell_id": "00027-ca714a55-7276-4cd5-8cfd-64ed87c51b47", 921 | "deepnote_cell_type": "code", 922 | "execution_millis": 0, 923 | "execution_start": 1605806144598, 924 | "output_cleared": false, 925 | "slideshow": { 926 | "slide_type": "fragment" 927 | }, 928 | "source_hash": "bbec6db9" 929 | }, 930 | "outputs": [], 931 | "source": [ 932 | "arrests_by_gdf = join.FIPS.value_counts().rename_axis('FIPS').reset_index(name='arrests_count')" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": null, 938 | "metadata": { 939 | "cell_id": "00028-0edb1f2f-bf0d-464e-91a5-8e00178c2fc7", 940 | "deepnote_cell_type": "code", 941 | "execution_millis": 1, 942 | "execution_start": 1605806249414, 943 | "output_cleared": false, 944 | "scrolled": true, 945 | "slideshow": { 946 | "slide_type": "fragment" 947 | }, 948 | "source_hash": "1a26c77d" 949 | }, 950 | "outputs": [], 951 | "source": [ 952 | "arrests_by_gdf.head()" 953 | ] 954 | }, 955 | { 956 | "cell_type": "markdown", 957 | "metadata": { 958 | "slideshow": { 959 | "slide_type": "slide" 960 | } 961 | }, 962 | "source": [ 963 | "- [Pandas bar plot documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": null, 969 | "metadata": { 970 | "cell_id": "00029-c703c03b-ae27-4935-85ae-6f32896bcd05", 971 | "deepnote_cell_type": "code", 972 | "execution_millis": 256, 973 | "execution_start": 1605806267816, 974 | "output_cleared": false, 975 | "scrolled": false, 976 | "slideshow": { 977 | "slide_type": "fragment" 978 | }, 979 | "source_hash": "a09e3e30" 980 | }, 981 | "outputs": [], 982 | "source": [ 983 | "# make a bar chart of top 20 geographies\n", 984 | "arrests_by_gdf[:20].plot.bar(figsize=(20,4),\n", 985 | " x='FIPS',\n", 986 | " y='arrests_count')" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": { 992 | "cell_id": "00030-23e15688-b87f-4d89-8e04-78e98f59d8ce", 993 | "deepnote_cell_type": "markdown", 994 | "slideshow": { 995 | "slide_type": "slide" 996 | } 997 | }, 998 | "source": [ 999 | "## Join the value counts back to the gdf\n", 1000 | "\n", 1001 | "How many people know their census block number? The bar chart is nice, but it is not informative. Without spatial awareness, the data chart does little to convey knowledge. What we want is a choropleth map to accompany it. To do so, we merge the counts back to the block group gdf.\n", 1002 | "\n", 1003 | "- [merge documentation](https://geopandas.org/docs/user_guide/mergingdata.html)" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": null, 1009 | "metadata": { 1010 | "cell_id": "00031-ef0c1b84-f54e-464b-9183-d7b6952c93ad", 1011 | "deepnote_cell_type": "code", 1012 | "execution_millis": 0, 1013 | "execution_start": 1605806281114, 1014 | "output_cleared": false, 1015 | "slideshow": { 1016 | "slide_type": "fragment" 1017 | }, 1018 | "source_hash": "6918564f" 1019 | }, 1020 | "outputs": [], 1021 | "source": [ 1022 | "# join the summary table back to the gdf\n", 1023 | "gdf=gdf.merge(arrests_by_gdf,on='FIPS')" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "markdown", 1028 | "metadata": { 1029 | "cell_id": "00032-de630962-35a2-45c8-a028-ad929bebbff7", 1030 | "deepnote_cell_type": "markdown", 1031 | "slideshow": { 1032 | "slide_type": "slide" 1033 | } 1034 | }, 1035 | "source": [ 1036 | "Now the block group gdf has a new column for arrest counts:" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "code", 1041 | "execution_count": null, 1042 | "metadata": { 1043 | "cell_id": "00033-4405a8bc-92cd-44dc-9f79-e1756818d28f", 1044 | "deepnote_cell_type": "code", 1045 | "execution_millis": 10, 1046 | "execution_start": 1605806285885, 1047 | "output_cleared": false, 1048 | "slideshow": { 1049 | "slide_type": "fragment" 1050 | }, 1051 | "source_hash": "436bb116" 1052 | }, 1053 | "outputs": [], 1054 | "source": [ 1055 | "# our neighborhood table now has a count column\n", 1056 | "gdf.head()" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "markdown", 1061 | "metadata": { 1062 | "slideshow": { 1063 | "slide_type": "slide" 1064 | } 1065 | }, 1066 | "source": [ 1067 | "## Normalizing: Arrests per 1000 people\n", 1068 | "Rather than proceeding with an absolute count of arrests, let's normalize it by number of people who live in the census block group.\n", 1069 | "\n" 1070 | ] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": null, 1075 | "metadata": { 1076 | "slideshow": { 1077 | "slide_type": "fragment" 1078 | } 1079 | }, 1080 | "outputs": [], 1081 | "source": [ 1082 | "gdf['arrests_per_1000'] = gdf['arrests_count']/gdf['TotalPop']*1000" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "code", 1087 | "execution_count": null, 1088 | "metadata": { 1089 | "slideshow": { 1090 | "slide_type": "fragment" 1091 | } 1092 | }, 1093 | "outputs": [], 1094 | "source": [ 1095 | "gdf.sort_values(by=\"arrests_per_1000\").tail()" 1096 | ] 1097 | }, 1098 | { 1099 | "cell_type": "markdown", 1100 | "metadata": { 1101 | "slideshow": { 1102 | "slide_type": "slide" 1103 | } 1104 | }, 1105 | "source": [ 1106 | "And if you are curious, we can map a slice of the data. Here, we sort the values by descending arrest rate, and only show a slice of the data, the top 20 geographies using the handy `[:20]`." 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "metadata": { 1113 | "cell_id": "00034-d60cf7ed-32db-4b0e-ac55-6dc3119c40c9", 1114 | "deepnote_cell_type": "code", 1115 | "execution_millis": 455, 1116 | "execution_start": 1605806413842, 1117 | "output_cleared": false, 1118 | "slideshow": { 1119 | "slide_type": "slide" 1120 | }, 1121 | "source_hash": "c70d7036" 1122 | }, 1123 | "outputs": [], 1124 | "source": [ 1125 | "# map the top 20 geographies\n", 1126 | "fig,ax = plt.subplots(figsize=(12,10))\n", 1127 | "gdf.sort_values(by='arrests_per_1000',ascending=False)[:20].plot(ax=ax,\n", 1128 | " color='red',\n", 1129 | " edgecolor='white',\n", 1130 | " alpha=0.5,legend=True)\n", 1131 | "\n", 1132 | "\n", 1133 | "# title\n", 1134 | "ax.set_title('Top 20 locations of LAPD arrests per 1000 people (July 2020-January 2021)')\n", 1135 | "\n", 1136 | "# no axis\n", 1137 | "ax.axis('off')\n", 1138 | "\n", 1139 | "# add a basemap\n", 1140 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "markdown", 1145 | "metadata": { 1146 | "cell_id": "00035-452962c1-1d0a-4b6e-9f25-bf2bac2eb75e", 1147 | "deepnote_cell_type": "markdown", 1148 | "slideshow": { 1149 | "slide_type": "slide" 1150 | } 1151 | }, 1152 | "source": [ 1153 | "## Choropleth map of arrests\n", 1154 | "\n", 1155 | "Finally, we are ready to generate a choropleth map of arrests. " 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "markdown", 1160 | "metadata": { 1161 | "slideshow": { 1162 | "slide_type": "slide" 1163 | } 1164 | }, 1165 | "source": [ 1166 | "- [geopandas choropleths](https://geopandas.org/docs/user_guide/mapping.html#choropleth-maps)" 1167 | ] 1168 | }, 1169 | { 1170 | "cell_type": "code", 1171 | "execution_count": null, 1172 | "metadata": { 1173 | "cell_id": "00040-751a6cac-f335-4b67-b470-3e240904a0cf", 1174 | "deepnote_cell_type": "code", 1175 | "execution_millis": 1290, 1176 | "execution_start": 1605806440098, 1177 | "output_cleared": false, 1178 | "scrolled": false, 1179 | "slideshow": { 1180 | "slide_type": "-" 1181 | }, 1182 | "source_hash": "cf9722a3" 1183 | }, 1184 | "outputs": [], 1185 | "source": [ 1186 | "fig,ax = plt.subplots(figsize=(15,15))\n", 1187 | "\n", 1188 | "gdf.plot(ax=ax,\n", 1189 | " column='arrests_per_1000', # this makes it a choropleth\n", 1190 | " legend=True,\n", 1191 | " alpha=0.8,\n", 1192 | " cmap='RdYlGn_r', # a diverging color scheme\n", 1193 | " scheme='quantiles') # how to break the data into bins\n", 1194 | "\n", 1195 | "ax.axis('off')\n", 1196 | "ax.set_title('2020 July to January 2021 arrests per 1000 people',fontsize=22)\n", 1197 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1198 | ] 1199 | }, 1200 | { 1201 | "cell_type": "markdown", 1202 | "metadata": { 1203 | "cell_id": "00041-1c2e4521-6cae-4ca5-9731-1b9c505d6077", 1204 | "deepnote_cell_type": "markdown", 1205 | "slideshow": { 1206 | "slide_type": "slide" 1207 | } 1208 | }, 1209 | "source": [ 1210 | "
\n", 1211 | "The map above is a good way to begin exploring spatial patterns in our data. What does this map tell you? Is it informative? Do you notice any significant clusters? What if you change the map? Notice the `scheme` argument is set to `naturalbreaks`. Experiment with other map classfications such as `equalinterval`, `quantiles`. How does each classification change the map? \n", 1212 | "
" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": { 1218 | "cell_id": "00042-b24787d4-0ab3-44d5-ac9e-91f78dd286e3", 1219 | "deepnote_cell_type": "markdown", 1220 | "slideshow": { 1221 | "slide_type": "slide" 1222 | } 1223 | }, 1224 | "source": [ 1225 | "# Global Spatial Autocorrelation\n", 1226 | "\n", 1227 | "\n", 1228 | "\n", 1229 | "We have imported two datasets. Cleaned them up, spatialized them, and connected them spatially. We successfully mapped them to show the location of arrests per 1000 people by census block groups. The resulting map intuitively and visually tells us that there does appear to be spatial clusters of where arrests are more prevalent, but to what degree of certainty can we say so? Actually, very little, without statitistically backing up our determinations. Could this exact pattern be a matter of chance? Or is the pattern so distinct that there is no way it could have happened randomly?\n", 1230 | "\n", 1231 | "In order to answer this question, we conduct spatial autocorrelation, a process that determines to what degree an existing pattern is or is not random." 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "markdown", 1236 | "metadata": { 1237 | "cell_id": "00042-b24787d4-0ab3-44d5-ac9e-91f78dd286e3", 1238 | "deepnote_cell_type": "markdown", 1239 | "slideshow": { 1240 | "slide_type": "slide" 1241 | } 1242 | }, 1243 | "source": [ 1244 | "Global Moran's I statistic is a way to *quantify* the degree to which similar geographies are clustered. To do so, we compare each geography based on a given value (in this case arrest counts) with that of its neighbors. The first step of this process is to define a \"spatial weight.\" " 1245 | ] 1246 | }, 1247 | { 1248 | "cell_type": "markdown", 1249 | "metadata": { 1250 | "cell_id": "00043-8ff210ab-9ca8-4ba3-a467-3b94e460e0d1", 1251 | "deepnote_cell_type": "markdown", 1252 | "slideshow": { 1253 | "slide_type": "slide" 1254 | } 1255 | }, 1256 | "source": [ 1257 | "### Spatial Weights\n", 1258 | "\n", 1259 | "\n", 1260 | "\n", 1261 | "Image source: [ESRI](https://desktop.arcgis.com/en/arcmap/10.3/tools/spatial-statistics-toolbox/generate-spatial-weights-matrix.htm)\n", 1262 | "\n", 1263 | "Spatial weights are how we determine the area’s neighborhood. There are different statistical methods that are used for determining spatial weights. Some use a contiguity model, assigning neighbors based on boundaries that touch each other. Others are based on distance, finding the closest neighbors based on the centroid of each geography. So which method should we use? " 1264 | ] 1265 | }, 1266 | { 1267 | "cell_type": "markdown", 1268 | "metadata": { 1269 | "cell_id": "00043-8ff210ab-9ca8-4ba3-a467-3b94e460e0d1", 1270 | "deepnote_cell_type": "markdown", 1271 | "slideshow": { 1272 | "slide_type": "slide" 1273 | } 1274 | }, 1275 | "source": [ 1276 | "For this lab, we will use the KNN weight, where `k` is the number of \"nearest neighbors\" to count in the calculations. Let's proceed with `k=8` for our KNN spatial weights. \n", 1277 | "\n", 1278 | "We also **row standardize** the data: A technique for adjusting the weights in a spatial weights matrix. When weights are row standardized, each weight is divided by its row sum. The row sum is the sum of weights for a features neighbors.\n", 1279 | "\n", 1280 | "- https://geographicdata.science/book/notebooks/04_spatial_weights.html#distance-based-weights" 1281 | ] 1282 | }, 1283 | { 1284 | "cell_type": "code", 1285 | "execution_count": null, 1286 | "metadata": { 1287 | "cell_id": "00044-c20d9e6b-953d-44f8-8f15-51052303c063", 1288 | "deepnote_cell_type": "code", 1289 | "execution_millis": 111, 1290 | "execution_start": 1605806657728, 1291 | "output_cleared": false, 1292 | "slideshow": { 1293 | "slide_type": "fragment" 1294 | }, 1295 | "source_hash": "3ee158c7" 1296 | }, 1297 | "outputs": [], 1298 | "source": [ 1299 | "# calculate spatial weight\n", 1300 | "wq = lps.weights.KNN.from_dataframe(gdf,k=8)\n", 1301 | "\n", 1302 | "# Row-standardization\n", 1303 | "wq.transform = 'r'" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": { 1309 | "cell_id": "00045-109ded14-a0b2-4c8c-a0a0-8454c76c230d", 1310 | "deepnote_cell_type": "markdown", 1311 | "slideshow": { 1312 | "slide_type": "slide" 1313 | } 1314 | }, 1315 | "source": [ 1316 | "### Spatial lag\n", 1317 | "\n", 1318 | "Now that we have our spatial weights assigned, we use it to calculate the spatial lag. While the mathematical operations are beyond the scope of this lab, you are welcome to check it out [here](https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html#spatial-lag). Simply put, the spatial lag is a calculated assignment to each geography in your data, which takes into account the data values from others in their \"neighborhood\" as defined by the spatial weight. This operation can be done with a single line of code which is part of the pysal module, but the underlying calculations are not that difficult to understand: \n", 1319 | "\n", 1320 | "It takes the average of all the neighbors as defined by the spatial weight to come up with a single associated value." 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": null, 1326 | "metadata": { 1327 | "cell_id": "00046-020e64b9-d823-45ea-bdfd-6950215f410d", 1328 | "deepnote_cell_type": "code", 1329 | "execution_millis": 4, 1330 | "execution_start": 1605806659021, 1331 | "output_cleared": false, 1332 | "slideshow": { 1333 | "slide_type": "slide" 1334 | }, 1335 | "source_hash": "dd2f79d2" 1336 | }, 1337 | "outputs": [], 1338 | "source": [ 1339 | "# create a new column for the spatial lag\n", 1340 | "gdf['arrests_per_1000_lag'] = lps.weights.lag_spatial(wq, gdf['arrests_per_1000'])" 1341 | ] 1342 | }, 1343 | { 1344 | "cell_type": "code", 1345 | "execution_count": null, 1346 | "metadata": { 1347 | "cell_id": "00047-80344eb5-7c68-4523-80fb-88e94e5ce690", 1348 | "deepnote_cell_type": "code", 1349 | "execution_millis": 0, 1350 | "execution_start": 1605806663078, 1351 | "output_cleared": false, 1352 | "slideshow": { 1353 | "slide_type": "fragment" 1354 | }, 1355 | "source_hash": "7aafb24b" 1356 | }, 1357 | "outputs": [], 1358 | "source": [ 1359 | "# sample gives us 10 random rows\n", 1360 | "gdf.sample(10)[['TotalPop','arrests_count','arrests_per_1000','arrests_per_1000_lag']]" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "metadata": { 1366 | "slideshow": { 1367 | "slide_type": "fragment" 1368 | } 1369 | }, 1370 | "source": [ 1371 | "
\n", 1372 | "Take a moment to look at the values in the dataframe. What do they tell you?\n", 1373 | "
" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "markdown", 1378 | "metadata": { 1379 | "slideshow": { 1380 | "slide_type": "slide" 1381 | } 1382 | }, 1383 | "source": [ 1384 | "### The donut and the diamond" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": null, 1390 | "metadata": {}, 1391 | "outputs": [], 1392 | "source": [ 1393 | "# create a column that calculates the difference betwen arrests and lag\n", 1394 | "gdf['arrest_lag_diff'] = gdf['arrests_per_1000'] - gdf['arrests_per_1000_lag']" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "execution_count": null, 1400 | "metadata": {}, 1401 | "outputs": [], 1402 | "source": [ 1403 | "# output to get the head and tail\n", 1404 | "gdf.sort_values(by='arrest_lag_diff')" 1405 | ] 1406 | }, 1407 | { 1408 | "cell_type": "markdown", 1409 | "metadata": { 1410 | "cell_id": "00048-88965640-7413-4991-a51d-6abf7e886e78", 1411 | "deepnote_cell_type": "markdown", 1412 | "slideshow": { 1413 | "slide_type": "slide" 1414 | } 1415 | }, 1416 | "source": [ 1417 | "In order to better understand the significance of the spatial lag values, consider the following two geographies:" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "code", 1422 | "execution_count": null, 1423 | "metadata": { 1424 | "slideshow": { 1425 | "slide_type": "fragment" 1426 | } 1427 | }, 1428 | "outputs": [], 1429 | "source": [ 1430 | "# the FIPS with highest negative difference\n", 1431 | "gdf_donut = gdf.sort_values(by='arrest_lag_diff').head(1)\n", 1432 | "gdf_donut" 1433 | ] 1434 | }, 1435 | { 1436 | "cell_type": "code", 1437 | "execution_count": null, 1438 | "metadata": { 1439 | "slideshow": { 1440 | "slide_type": "fragment" 1441 | } 1442 | }, 1443 | "outputs": [], 1444 | "source": [ 1445 | "# the FIPS with highest positive difference\n", 1446 | "gdf_diamond = gdf.sort_values(by='arrest_lag_diff').tail(1)\n", 1447 | "gdf_diamond" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "markdown", 1452 | "metadata": { 1453 | "slideshow": { 1454 | "slide_type": "slide" 1455 | } 1456 | }, 1457 | "source": [ 1458 | "To better illustrate our assumptions, let's display these locations using satellite imagery. To do so, we use Plotly Express and its Mapbox Satellite basemap (which provides high quality and up to date imagery). Note that the use of Mapbox satellite image tiles require a user token. \n", 1459 | "\n", 1460 | "- Grab your token from the [Mapbox account page](https://account.mapbox.com/) (after you have created an account)" 1461 | ] 1462 | }, 1463 | { 1464 | "cell_type": "code", 1465 | "execution_count": null, 1466 | "metadata": { 1467 | "slideshow": { 1468 | "slide_type": "slide" 1469 | } 1470 | }, 1471 | "outputs": [], 1472 | "source": [ 1473 | "# set the mapbox access token\n", 1474 | "token = 'your_token'\n", 1475 | "px.set_mapbox_access_token(token)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": null, 1481 | "metadata": { 1482 | "slideshow": { 1483 | "slide_type": "fragment" 1484 | } 1485 | }, 1486 | "outputs": [], 1487 | "source": [ 1488 | "# subset donut, project to WGS84, and get its centroid\n", 1489 | "gdf_donut = gdf_donut.to_crs('epsg:4326')\n", 1490 | "\n", 1491 | "# what's the centroid?\n", 1492 | "minx, miny, maxx, maxy = gdf_donut.geometry.total_bounds\n", 1493 | "center_lat_donut = (maxy-miny)/2+miny\n", 1494 | "center_lon_donut = (maxx-minx)/2+minx" 1495 | ] 1496 | }, 1497 | { 1498 | "cell_type": "code", 1499 | "execution_count": null, 1500 | "metadata": { 1501 | "slideshow": { 1502 | "slide_type": "slide" 1503 | } 1504 | }, 1505 | "outputs": [], 1506 | "source": [ 1507 | "# subset diamond, project to WGS84, and get its centroid\n", 1508 | "gdf_diamond = gdf_diamond.to_crs('epsg:4326')\n", 1509 | "\n", 1510 | "# what's the centroid?\n", 1511 | "minx, miny, maxx, maxy = gdf_diamond.geometry.total_bounds\n", 1512 | "center_lat_diamond = (maxy-miny)/2+miny\n", 1513 | "center_lon_diamond = (maxx-minx)/2+minx" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "markdown", 1518 | "metadata": { 1519 | "slideshow": { 1520 | "slide_type": "slide" 1521 | } 1522 | }, 1523 | "source": [ 1524 | "- [plotly choropleth maps](https://plotly.com/python/mapbox-county-choropleth/#)" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "code", 1529 | "execution_count": null, 1530 | "metadata": { 1531 | "scrolled": false, 1532 | "slideshow": { 1533 | "slide_type": "fragment" 1534 | } 1535 | }, 1536 | "outputs": [], 1537 | "source": [ 1538 | "px.choropleth_mapbox(gdf_donut, \n", 1539 | " geojson=gdf_donut.geometry, \n", 1540 | " locations=gdf_donut.index, \n", 1541 | " mapbox_style=\"satellite-streets\",\n", 1542 | " zoom=14, \n", 1543 | " center = {\"lat\": center_lat_donut, \"lon\": center_lon_donut},\n", 1544 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1545 | " opacity=0.4,\n", 1546 | " title='The Donut')" 1547 | ] 1548 | }, 1549 | { 1550 | "cell_type": "code", 1551 | "execution_count": null, 1552 | "metadata": { 1553 | "scrolled": false, 1554 | "slideshow": { 1555 | "slide_type": "slide" 1556 | } 1557 | }, 1558 | "outputs": [], 1559 | "source": [ 1560 | "px.choropleth_mapbox(gdf_diamond, \n", 1561 | " geojson=gdf_diamond.geometry, \n", 1562 | " locations=gdf_diamond.index, \n", 1563 | " mapbox_style=\"satellite-streets\",\n", 1564 | " zoom=12, \n", 1565 | " center = {\"lat\": center_lat_diamond, \"lon\": center_lon_diamond},\n", 1566 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1567 | " opacity=0.4,\n", 1568 | " title='The Diamond')" 1569 | ] 1570 | }, 1571 | { 1572 | "cell_type": "markdown", 1573 | "metadata": { 1574 | "slideshow": { 1575 | "slide_type": "slide" 1576 | } 1577 | }, 1578 | "source": [ 1579 | "What's the story here?\n", 1580 | "\n", 1581 | "Our donut is located in Venice Beach, in the famous enclave with homes adjacent to the grand canal known as part of the Venice Canal Historic District. It has a low arrests rate at just over 9 per 1000. It's spatial lag? 76! \n", 1582 | "\n", 1583 | "Our diamond, coincidentally, is the also famous Venice Beach Boardwalk. It is located in close proximity to our donut, with an over 400 arrests per 1000. It's spatial lag? Ninety, which is still high!\n", 1584 | "\n", 1585 | "What these numbers tell us is that the Venice Canal is like a donut. A prestine historical housing community which is surrounded by neighbors with high arrest rates. On the flip side, Venice Boardwalk is a diamond: a super high valued geography surrounded by lower neighbors. It indicates that the park itself is a magnet location for arrests, but its immediate neighbors have comparatively lower number of arrests." 1586 | ] 1587 | }, 1588 | { 1589 | "cell_type": "markdown", 1590 | "metadata": { 1591 | "slideshow": { 1592 | "slide_type": "slide" 1593 | } 1594 | }, 1595 | "source": [ 1596 | "## Spatial lag map\n", 1597 | "But we digress. Let's map the entire dataframe by the newly created spatial lag column." 1598 | ] 1599 | }, 1600 | { 1601 | "cell_type": "code", 1602 | "execution_count": null, 1603 | "metadata": { 1604 | "cell_id": "00051-c11fcc2b-df8f-420c-9365-3812f835d6ce", 1605 | "deepnote_cell_type": "code", 1606 | "execution_millis": 1107, 1607 | "execution_start": 1605807561410, 1608 | "output_cleared": false, 1609 | "scrolled": false, 1610 | "slideshow": { 1611 | "slide_type": "slide" 1612 | }, 1613 | "source_hash": "51b90aba" 1614 | }, 1615 | "outputs": [], 1616 | "source": [ 1617 | "# use subplots that make it easier to create multiple layered maps\n", 1618 | "fig, ax = plt.subplots(1,1,figsize=(15, 15))\n", 1619 | "\n", 1620 | "# spatial lag choropleth\n", 1621 | "gdf.plot(ax=ax,\n", 1622 | " figsize=(15,15),\n", 1623 | " column='arrests_per_1000_lag',\n", 1624 | " legend=True,\n", 1625 | " alpha=0.8,\n", 1626 | " cmap='RdYlGn_r',\n", 1627 | " scheme='quantiles')\n", 1628 | "\n", 1629 | "ax.axis('off')\n", 1630 | "ax.set_title('July 2020-January 2021 Arrests per 1000 people',fontsize=22)\n", 1631 | "\n", 1632 | "ctx.add_basemap(ax,source=ctx.providers.CartoDB.Positron)" 1633 | ] 1634 | }, 1635 | { 1636 | "cell_type": "markdown", 1637 | "metadata": { 1638 | "cell_id": "00052-7673144c-5be9-47cd-9e0c-46d5b9fe324b", 1639 | "deepnote_cell_type": "markdown", 1640 | "slideshow": { 1641 | "slide_type": "slide" 1642 | } 1643 | }, 1644 | "source": [ 1645 | "## Side-by-side maps\n", 1646 | "We can now compare these two map outputs side by side. Notice that the syntax is a bit different from past labs where we have only worked with one figure at a time. This output produces 1 row, and 2 columns of figures in `subplots`.\n", 1647 | "- [subplots documentation](https://matplotlib.org/3.3.0/gallery/subplots_axes_and_figures/subplots_demo.html)" 1648 | ] 1649 | }, 1650 | { 1651 | "cell_type": "code", 1652 | "execution_count": null, 1653 | "metadata": { 1654 | "cell_id": "00053-3d2f883f-e36b-42a8-b0e4-8d46ac893d67", 1655 | "deepnote_cell_type": "code", 1656 | "execution_millis": 1427, 1657 | "execution_start": 1605807604042, 1658 | "output_cleared": false, 1659 | "slideshow": { 1660 | "slide_type": "slide" 1661 | }, 1662 | "source_hash": "26dd2021" 1663 | }, 1664 | "outputs": [], 1665 | "source": [ 1666 | "# create the 1x2 subplots\n", 1667 | "fig, ax = plt.subplots(1, 2, figsize=(15, 8))\n", 1668 | "\n", 1669 | "# two subplots produces ax[0] (left) and ax[1] (right)\n", 1670 | "\n", 1671 | "# regular count map on the left\n", 1672 | "gdf.plot(ax=ax[0], # this assigns the map to the left subplot\n", 1673 | " column='arrests_per_1000', \n", 1674 | " cmap='RdYlGn_r', \n", 1675 | " scheme='quantiles',\n", 1676 | " k=5, \n", 1677 | " edgecolor='white', \n", 1678 | " linewidth=0, \n", 1679 | " alpha=0.75, \n", 1680 | " )\n", 1681 | "\n", 1682 | "\n", 1683 | "ax[0].axis(\"off\")\n", 1684 | "ax[0].set_title(\"Arrests per 1000\")\n", 1685 | "\n", 1686 | "# spatial lag map on the right\n", 1687 | "gdf.plot(ax=ax[1], # this assigns the map to the right subplot\n", 1688 | " column='arrests_per_1000_lag', \n", 1689 | " cmap='RdYlGn_r', \n", 1690 | " scheme='quantiles',\n", 1691 | " k=5, \n", 1692 | " edgecolor='white', \n", 1693 | " linewidth=0, \n", 1694 | " alpha=0.75\n", 1695 | " )\n", 1696 | "\n", 1697 | "ax[1].axis(\"off\")\n", 1698 | "ax[1].set_title(\"Arrests per 1000 - Spatial Lag\")\n", 1699 | "\n", 1700 | "plt.show()" 1701 | ] 1702 | }, 1703 | { 1704 | "cell_type": "markdown", 1705 | "metadata": { 1706 | "slideshow": { 1707 | "slide_type": "slide" 1708 | } 1709 | }, 1710 | "source": [ 1711 | "## Interactive spatial lag satellite map\n", 1712 | "Building the equivalent map as an interactive javascript map is a bit more challenging. While there are several options to choose from, this lab will use plotly express's choropleth_mapbox feature.\n", 1713 | "- https://plotly.com/python/mapbox-county-choropleth/#" 1714 | ] 1715 | }, 1716 | { 1717 | "cell_type": "code", 1718 | "execution_count": null, 1719 | "metadata": { 1720 | "slideshow": { 1721 | "slide_type": "fragment" 1722 | } 1723 | }, 1724 | "outputs": [], 1725 | "source": [ 1726 | "# interactive version needs to be in WGS84\n", 1727 | "gdf_web = gdf.to_crs('EPSG:4326')" 1728 | ] 1729 | }, 1730 | { 1731 | "cell_type": "code", 1732 | "execution_count": null, 1733 | "metadata": { 1734 | "slideshow": { 1735 | "slide_type": "fragment" 1736 | } 1737 | }, 1738 | "outputs": [], 1739 | "source": [ 1740 | "# what's the centroid?\n", 1741 | "minx, miny, maxx, maxy = gdf_web.geometry.total_bounds\n", 1742 | "center_lat_gdf_web = (maxy-miny)/2+miny\n", 1743 | "center_lon_gdf_web = (maxx-minx)/2+minx" 1744 | ] 1745 | }, 1746 | { 1747 | "cell_type": "markdown", 1748 | "metadata": { 1749 | "slideshow": { 1750 | "slide_type": "slide" 1751 | } 1752 | }, 1753 | "source": [ 1754 | "Unlike the matplotlib map, plotly's mapbox map only gives us a continuous scale option (there is no magical `scheme` option). To produce a similar quantile map, we need to calculate the values manually.\n", 1755 | "\n", 1756 | "As we want to produce a choropleth map based on our spatial lag column, let's get some simple stats:" 1757 | ] 1758 | }, 1759 | { 1760 | "cell_type": "code", 1761 | "execution_count": null, 1762 | "metadata": { 1763 | "slideshow": { 1764 | "slide_type": "fragment" 1765 | } 1766 | }, 1767 | "outputs": [], 1768 | "source": [ 1769 | "# some stats\n", 1770 | "gdf_web.arrests_per_1000_lag.describe()" 1771 | ] 1772 | }, 1773 | { 1774 | "cell_type": "code", 1775 | "execution_count": null, 1776 | "metadata": { 1777 | "slideshow": { 1778 | "slide_type": "fragment" 1779 | } 1780 | }, 1781 | "outputs": [], 1782 | "source": [ 1783 | "# grab the median\n", 1784 | "median = gdf_web.arrests_per_1000_lag.median()" 1785 | ] 1786 | }, 1787 | { 1788 | "cell_type": "code", 1789 | "execution_count": null, 1790 | "metadata": { 1791 | "scrolled": false, 1792 | "slideshow": { 1793 | "slide_type": "slide" 1794 | } 1795 | }, 1796 | "outputs": [], 1797 | "source": [ 1798 | "fig = px.choropleth_mapbox(gdf_web, \n", 1799 | " geojson=gdf_web.geometry, # the geometry column\n", 1800 | " locations=gdf_web.index, # the index\n", 1801 | " mapbox_style=\"satellite-streets\",\n", 1802 | " zoom=9, \n", 1803 | " color='arrests_per_1000_lag',\n", 1804 | " color_continuous_scale='RdYlGn_r',\n", 1805 | " color_continuous_midpoint =median, # put the median as the midpoint\n", 1806 | " range_color =(0,median*2),\n", 1807 | " hover_data=['arrests_count','arrests_per_1000','arrests_per_1000_lag'],\n", 1808 | " center = {\"lat\": center_lat_gdf_web, \"lon\": center_lon_gdf_web},\n", 1809 | " opacity=0.8,\n", 1810 | " width=1000,\n", 1811 | " height=800,\n", 1812 | " labels={\n", 1813 | " 'arrests_per_1000_lag':'Arrests per 1000 (Spatial Lag)',\n", 1814 | " 'arrests_per_1000':'Arrests per 1000',\n", 1815 | " })\n", 1816 | "fig.update_traces(marker_line_width=0.1, marker_line_color='white')\n", 1817 | "fig.update_layout(margin={\"r\":0,\"t\":0,\"l\":0,\"b\":0})" 1818 | ] 1819 | }, 1820 | { 1821 | "cell_type": "markdown", 1822 | "metadata": { 1823 | "cell_id": "00054-cde72154-74e9-492d-9384-f438d860221d", 1824 | "deepnote_cell_type": "markdown", 1825 | "slideshow": { 1826 | "slide_type": "slide" 1827 | } 1828 | }, 1829 | "source": [ 1830 | "## Moran's Plot\n", 1831 | "\n", 1832 | "We now have a spatial lag map: a map that displays geographies weighted against the values of its neighbors. The clusters are much clearer and cleaner than the original arrest count map. Downtown, Venice, South LA, Van Nuys... But we still have not *quantified* the degree of the spatial correlations. To begin this process, we test for global autocorrelation for a continuous attribute (arrest counts)." 1833 | ] 1834 | }, 1835 | { 1836 | "cell_type": "code", 1837 | "execution_count": null, 1838 | "metadata": { 1839 | "slideshow": { 1840 | "slide_type": "fragment" 1841 | } 1842 | }, 1843 | "outputs": [], 1844 | "source": [ 1845 | "y = gdf.arrests_per_1000\n", 1846 | "moran = Moran(y, wq)\n", 1847 | "moran.I" 1848 | ] 1849 | }, 1850 | { 1851 | "cell_type": "markdown", 1852 | "metadata": { 1853 | "slideshow": { 1854 | "slide_type": "slide" 1855 | } 1856 | }, 1857 | "source": [ 1858 | "The moran's I value is nothing more than the calculated slope of the scatterplot of our \"arrests per 1000\" and \"arrests per 1000 spatial lag\" columns. It does indicate whether or not you have a positive or negative autocorrelation. Values will range from positive one, to negative one. \n", 1859 | "\n", 1860 | "- **Positive** spatial autocorrelation: high values are close to high values, and/or low values are close to low values\n", 1861 | "- **Negative** spatial autocorrelation (less common): similar values are far from each other; high values are next to low values, low values are next to high values" 1862 | ] 1863 | }, 1864 | { 1865 | "cell_type": "markdown", 1866 | "metadata": { 1867 | "slideshow": { 1868 | "slide_type": "slide" 1869 | } 1870 | }, 1871 | "source": [ 1872 | "You can output a scatterplot:" 1873 | ] 1874 | }, 1875 | { 1876 | "cell_type": "code", 1877 | "execution_count": null, 1878 | "metadata": { 1879 | "slideshow": { 1880 | "slide_type": "fragment" 1881 | } 1882 | }, 1883 | "outputs": [], 1884 | "source": [ 1885 | "fig, ax = moran_scatterplot(moran, aspect_equal=True)\n", 1886 | "plt.show()" 1887 | ] 1888 | }, 1889 | { 1890 | "cell_type": "markdown", 1891 | "metadata": { 1892 | "slideshow": { 1893 | "slide_type": "slide" 1894 | } 1895 | }, 1896 | "source": [ 1897 | "So what is the significance of our Moran value of 0.3? In other words, **how likely is our observed pattern on the map generated by an entirely random process?** To find out, we compare our value with a simulation of 999 permutations that randomly shuffles the arrest data throughout the given geographies. The output is a sampling distribution of Moran’s I values under the (null) hypothesis that attribute values are randomly distributed across the study area. We then compare our observed Moran’s I value to this \"Reference Distribution.\"" 1898 | ] 1899 | }, 1900 | { 1901 | "cell_type": "code", 1902 | "execution_count": null, 1903 | "metadata": { 1904 | "cell_id": "00055-489abd0d-4dfc-4a43-ab07-631157aaea0d", 1905 | "deepnote_cell_type": "code", 1906 | "execution_millis": 47, 1907 | "execution_start": 1605807625694, 1908 | "output_cleared": false, 1909 | "slideshow": { 1910 | "slide_type": "slide" 1911 | }, 1912 | "source_hash": "707e03ac" 1913 | }, 1914 | "outputs": [], 1915 | "source": [ 1916 | "plot_moran_simulation(moran,aspect_equal=False)" 1917 | ] 1918 | }, 1919 | { 1920 | "cell_type": "markdown", 1921 | "metadata": { 1922 | "cell_id": "00056-8d12869d-48fa-468e-9cb5-1babfcbf3bf9", 1923 | "deepnote_cell_type": "markdown", 1924 | "slideshow": { 1925 | "slide_type": "slide" 1926 | } 1927 | }, 1928 | "source": [ 1929 | "We can compute the P-value:\n", 1930 | "\n", 1931 | "" 1932 | ] 1933 | }, 1934 | { 1935 | "cell_type": "code", 1936 | "execution_count": null, 1937 | "metadata": { 1938 | "cell_id": "00057-90a65e59-c521-42fa-bc23-4141b9369225", 1939 | "deepnote_cell_type": "code", 1940 | "output_cleared": true, 1941 | "slideshow": { 1942 | "slide_type": "fragment" 1943 | } 1944 | }, 1945 | "outputs": [], 1946 | "source": [ 1947 | "moran.p_sim" 1948 | ] 1949 | }, 1950 | { 1951 | "cell_type": "markdown", 1952 | "metadata": { 1953 | "cell_id": "00058-368e0922-3062-4b2a-9155-27c2332200fd", 1954 | "deepnote_cell_type": "markdown", 1955 | "slideshow": { 1956 | "slide_type": "slide" 1957 | } 1958 | }, 1959 | "source": [ 1960 | "The value is calculated as an empirical P-value that represents the proportion of realisations in the simulation under spatial randomness that are more extreme than the observed value. A small enough p-value associated with the Moran’s I of a map allows to reject the hypothesis that the map is random. In other words, we can conclude that the map displays more spatial pattern than we would expect if the values had been randomly allocated to a locations.\n", 1961 | "\n", 1962 | "That is a very low value, particularly considering it is actually the minimum value we could have obtained given the simulation behind it used 999 permutations (default in PySAL) and, by standard terms, it would be deemed statistically significant. We can ellaborate a bit further on the intuition behind the value of p_sim. If we generated a large number of maps with the same values but randomly allocated over space, and calculated the Moran’s I statistic for each of those maps, only 0.1% of them would display a larger (absolute) value than the one we obtain from the observed data, and the other 99.9% of the random maps would receive a smaller (absolute) value of Moran’s I. " 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "markdown", 1967 | "metadata": { 1968 | "slideshow": { 1969 | "slide_type": "slide" 1970 | } 1971 | }, 1972 | "source": [ 1973 | "# Local Spatial Autocorrelation\n", 1974 | "So far, we have only determined that there is a positive spatial autocorrelation between the price of properties in neighborhoods and their locations. But we have not detected where clusters are. Local Indicators of Spatial Association (LISA) is used to do that. LISA classifies areas into four groups: high values near to high values (HH), Low values with nearby low values (LL), Low values with high values in its neighborhood, and vice-versa.\n", 1975 | "\n", 1976 | "- HH: high arrest rate geographies near other high arrest rate neighbors\n", 1977 | "- LL: low arrest rate geographies near other low arrest rate neighbors\n", 1978 | "- LH (donuts): low arrest rate geographies surrounded by high arrest neighbors\n", 1979 | "- HL (diamonds): high arrest rate geographies surrounded by low arrest neighbors" 1980 | ] 1981 | }, 1982 | { 1983 | "cell_type": "markdown", 1984 | "metadata": { 1985 | "slideshow": { 1986 | "slide_type": "slide" 1987 | } 1988 | }, 1989 | "source": [ 1990 | "## Moral Local Scatterplot" 1991 | ] 1992 | }, 1993 | { 1994 | "cell_type": "code", 1995 | "execution_count": null, 1996 | "metadata": { 1997 | "cell_id": "00060-4a2f2b79-53f9-44b4-ae86-31cc8fdce2bd", 1998 | "deepnote_cell_type": "code", 1999 | "execution_millis": 3398, 2000 | "execution_start": 1605807811791, 2001 | "output_cleared": false, 2002 | "slideshow": { 2003 | "slide_type": "fragment" 2004 | }, 2005 | "source_hash": "bf84c31e" 2006 | }, 2007 | "outputs": [], 2008 | "source": [ 2009 | "# calculate local moran values\n", 2010 | "lisa = esda.moran.Moran_Local(y, wq)" 2011 | ] 2012 | }, 2013 | { 2014 | "cell_type": "code", 2015 | "execution_count": null, 2016 | "metadata": { 2017 | "cell_id": "00063-d9fc9212-bf39-4525-90db-ff4cbcfd7783", 2018 | "deepnote_cell_type": "code", 2019 | "execution_millis": 214, 2020 | "execution_start": 1605807831354, 2021 | "output_cleared": false, 2022 | "scrolled": true, 2023 | "slideshow": { 2024 | "slide_type": "fragment" 2025 | }, 2026 | "source_hash": "2b643" 2027 | }, 2028 | "outputs": [], 2029 | "source": [ 2030 | "# Plot\n", 2031 | "fig,ax = plt.subplots(figsize=(10,15))\n", 2032 | "\n", 2033 | "moran_scatterplot(lisa, ax=ax, p=0.05)\n", 2034 | "ax.set_xlabel(\"Arrests\")\n", 2035 | "ax.set_ylabel('Spatial Lag of Arrests')\n", 2036 | "\n", 2037 | "# add some labels\n", 2038 | "plt.text(1.95, 0.5, \"HH\", fontsize=25)\n", 2039 | "plt.text(1.95, -1, \"HL\", fontsize=25)\n", 2040 | "plt.text(-2, 1, \"LH\", fontsize=25)\n", 2041 | "plt.text(-1, -1, \"LL\", fontsize=25)\n", 2042 | "plt.show()" 2043 | ] 2044 | }, 2045 | { 2046 | "cell_type": "markdown", 2047 | "metadata": { 2048 | "slideshow": { 2049 | "slide_type": "slide" 2050 | } 2051 | }, 2052 | "source": [ 2053 | "In the scatterplot above, the colored dots represents the rows (census block groups) that have a P-value less that 0.05 in each quadrant. In other words, these are the statisticaly significantly, spatially autocorrelated geographies." 2054 | ] 2055 | }, 2056 | { 2057 | "cell_type": "markdown", 2058 | "metadata": { 2059 | "slideshow": { 2060 | "slide_type": "slide" 2061 | } 2062 | }, 2063 | "source": [ 2064 | "## Spatial Autocorrelation Map\n", 2065 | "Finally, you can visually these statistically significant clusters using the `lisa_cluster` function:" 2066 | ] 2067 | }, 2068 | { 2069 | "cell_type": "code", 2070 | "execution_count": null, 2071 | "metadata": { 2072 | "cell_id": "00064-5038673d-0806-4e6c-a49e-9938454eedd4", 2073 | "deepnote_cell_type": "code", 2074 | "execution_millis": 819, 2075 | "execution_start": 1605807832811, 2076 | "output_cleared": false, 2077 | "scrolled": false, 2078 | "slideshow": { 2079 | "slide_type": "fragment" 2080 | }, 2081 | "source_hash": "4f4bbffa" 2082 | }, 2083 | "outputs": [], 2084 | "source": [ 2085 | "fig, ax = plt.subplots(figsize=(14,12))\n", 2086 | "lisa_cluster(lisa, gdf, p=0.05, ax=ax)\n", 2087 | "plt.show()" 2088 | ] 2089 | }, 2090 | { 2091 | "cell_type": "markdown", 2092 | "metadata": { 2093 | "slideshow": { 2094 | "slide_type": "slide" 2095 | } 2096 | }, 2097 | "source": [ 2098 | "And create a map comparing different p-values" 2099 | ] 2100 | }, 2101 | { 2102 | "cell_type": "code", 2103 | "execution_count": null, 2104 | "metadata": { 2105 | "cell_id": "00053-3d2f883f-e36b-42a8-b0e4-8d46ac893d67", 2106 | "deepnote_cell_type": "code", 2107 | "execution_millis": 1427, 2108 | "execution_start": 1605807604042, 2109 | "output_cleared": false, 2110 | "slideshow": { 2111 | "slide_type": "fragment" 2112 | }, 2113 | "source_hash": "26dd2021" 2114 | }, 2115 | "outputs": [], 2116 | "source": [ 2117 | "# create the 1x2 subplots\n", 2118 | "fig, ax = plt.subplots(1, 2, figsize=(15, 8))\n", 2119 | "\n", 2120 | "# regular count map on the left\n", 2121 | "lisa_cluster(lisa, gdf, p=0.05, ax=ax[0])\n", 2122 | "\n", 2123 | "ax[0].axis(\"off\")\n", 2124 | "ax[0].set_title(\"P-value: 0.05\")\n", 2125 | "\n", 2126 | "# spatial lag map on the right\n", 2127 | "lisa_cluster(lisa, gdf, p=0.01, ax=ax[1])\n", 2128 | "ax[1].axis(\"off\")\n", 2129 | "ax[1].set_title(\"P-value: 0.01\")\n", 2130 | "\n", 2131 | "plt.show()" 2132 | ] 2133 | }, 2134 | { 2135 | "cell_type": "markdown", 2136 | "metadata": { 2137 | "cell_id": "00073-51e53eb0-0154-43e8-835d-ce181f9c7186", 2138 | "deepnote_cell_type": "markdown", 2139 | "slideshow": { 2140 | "slide_type": "slide" 2141 | } 2142 | }, 2143 | "source": [ 2144 | "# Resources\n", 2145 | "\n", 2146 | "- https://geographicdata.science/book/notebooks/06_spatial_autocorrelation.html\n", 2147 | "- https://pysal.org/esda/notebooks/spatialautocorrelation.html\n", 2148 | "- https://towardsdatascience.com/what-is-exploratory-spatial-data-analysis-esda-335da79026ee" 2149 | ] 2150 | } 2151 | ], 2152 | "metadata": { 2153 | "celltoolbar": "Slideshow", 2154 | "deepnote_execution_queue": [], 2155 | "deepnote_notebook_id": "e3ee145c-1ca6-4b9e-9637-f285c2264264", 2156 | "kernelspec": { 2157 | "display_name": "Python 3", 2158 | "language": "python", 2159 | "name": "python3" 2160 | }, 2161 | "language_info": { 2162 | "codemirror_mode": { 2163 | "name": "ipython", 2164 | "version": 3 2165 | }, 2166 | "file_extension": ".py", 2167 | "mimetype": "text/x-python", 2168 | "name": "python", 2169 | "nbconvert_exporter": "python", 2170 | "pygments_lexer": "ipython3", 2171 | "version": "3.8.8" 2172 | }, 2173 | "toc": { 2174 | "base_numbering": 1, 2175 | "nav_menu": {}, 2176 | "number_sections": true, 2177 | "sideBar": true, 2178 | "skip_h1_title": false, 2179 | "title_cell": "Table of Contents", 2180 | "title_sidebar": "Contents", 2181 | "toc_cell": true, 2182 | "toc_position": { 2183 | "height": "calc(100% - 180px)", 2184 | "left": "10px", 2185 | "top": "150px", 2186 | "width": "349.047px" 2187 | }, 2188 | "toc_section_display": true, 2189 | "toc_window_display": false 2190 | } 2191 | }, 2192 | "nbformat": 4, 2193 | "nbformat_minor": 4 2194 | } 2195 | -------------------------------------------------------------------------------- /data/metadata.json: -------------------------------------------------------------------------------- 1 | { 2 | "release": { 3 | "id": "acs2018_5yr", 4 | "name": "ACS 2018 5-year", 5 | "years": "2014-2018" 6 | }, 7 | "tables": { 8 | "B01003": { 9 | "columns": { 10 | "B01003001": { 11 | "indent": 0, 12 | "name": "Total" 13 | } 14 | }, 15 | "denominator_column_id": null, 16 | "title": "Total Population", 17 | "universe": "Total Population" 18 | } 19 | } 20 | } -------------------------------------------------------------------------------- /images/dave.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/dave.jpg -------------------------------------------------------------------------------- /images/esda.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/esda.png -------------------------------------------------------------------------------- /images/method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/method.png -------------------------------------------------------------------------------- /images/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /images/sa-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/sa-1.png -------------------------------------------------------------------------------- /images/sa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/sa.png -------------------------------------------------------------------------------- /images/splot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yohman/workshop-python-spatial-stats/af6bb94f241385f43c483fa406c8d206ee795f06/images/splot.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | sodapy 3 | geopandas 4 | contextily 5 | esda 6 | splot 7 | libpysal 8 | matplotlib 9 | plotly 10 | --------------------------------------------------------------------------------