├── .gitignore ├── Dockerfile ├── LICENSE ├── README.md ├── bin └── compile-requirements.bash ├── compose.yml ├── notebooks ├── 01_completed.ipynb ├── 01_reading-cogs-the-hard-way.ipynb ├── 02_completed.ipynb ├── 02_reading-zarr-the-hard-way.ipynb ├── 03_completed.ipynb └── 03_free-range-artisanal-grass-fed-kerchunk.ipynb ├── notes └── 01_cog ├── requirements-dev.in ├── requirements-dev.txt ├── requirements.in └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Jupyter 2 | .local/ 3 | .jupyter/ 4 | .bash_history 5 | .npm/ 6 | .yarn/ 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | build/ 19 | develop-eggs/ 20 | dist/ 21 | downloads/ 22 | eggs/ 23 | .eggs/ 24 | lib/ 25 | lib64/ 26 | parts/ 27 | sdist/ 28 | var/ 29 | wheels/ 30 | share/python-wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | MANIFEST 35 | 36 | # PyInstaller 37 | # Usually these files are written by a python script from a template 38 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .nox/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | *.py,cover 57 | .hypothesis/ 58 | .pytest_cache/ 59 | cover/ 60 | 61 | # Translations 62 | *.mo 63 | *.pot 64 | 65 | # Django stuff: 66 | *.log 67 | local_settings.py 68 | db.sqlite3 69 | db.sqlite3-journal 70 | 71 | # Flask stuff: 72 | instance/ 73 | .webassets-cache 74 | 75 | # Scrapy stuff: 76 | .scrapy 77 | 78 | # Sphinx documentation 79 | docs/_build/ 80 | 81 | # PyBuilder 82 | .pybuilder/ 83 | target/ 84 | 85 | # Jupyter Notebook 86 | .ipynb_checkpoints 87 | 88 | # IPython 89 | profile_default/ 90 | ipython_config.py 91 | 92 | # pyenv 93 | # For a library or package, you might want to ignore these files since the code is 94 | # intended to run in multiple environments; otherwise, check them in: 95 | # .python-version 96 | 97 | # pipenv 98 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 99 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 100 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 101 | # install all needed dependencies. 102 | #Pipfile.lock 103 | 104 | # poetry 105 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 106 | # This is especially recommended for binary packages to ensure reproducibility, and is more 107 | # commonly ignored for libraries. 108 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 109 | #poetry.lock 110 | 111 | # pdm 112 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 113 | #pdm.lock 114 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 115 | # in version control. 116 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control 117 | .pdm.toml 118 | .pdm-python 119 | .pdm-build/ 120 | 121 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 122 | __pypackages__/ 123 | 124 | # Celery stuff 125 | celerybeat-schedule 126 | celerybeat.pid 127 | 128 | # SageMath parsed files 129 | *.sage.py 130 | 131 | # Environments 132 | .env 133 | .venv 134 | env/ 135 | venv/ 136 | ENV/ 137 | env.bak/ 138 | venv.bak/ 139 | 140 | # Spyder project settings 141 | .spyderproject 142 | .spyproject 143 | 144 | # Rope project settings 145 | .ropeproject 146 | 147 | # mkdocs documentation 148 | /site 149 | 150 | # mypy 151 | .mypy_cache/ 152 | .dmypy.json 153 | dmypy.json 154 | 155 | # Pyre type checker 156 | .pyre/ 157 | 158 | # pytype static type analyzer 159 | .pytype/ 160 | 161 | # Cython debug symbols 162 | cython_debug/ 163 | 164 | # PyCharm 165 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 166 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 167 | # and can be added to the global gitignore or merged into this file. For a more nuclear 168 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 169 | #.idea/ 170 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM quay.io/jupyter/base-notebook:python-3.12 2 | 3 | USER root 4 | 5 | RUN apt-get update && \ 6 | apt-get install -y gcc && \ 7 | rm -rf /var/cache/apt/archives /var/lib/apt/lists 8 | 9 | USER ${NB_UID} 10 | 11 | COPY requirements.txt . 12 | RUN pip install -r requirements.txt 13 | 14 | CMD start-notebook.py --IdentityProvider.token='' 15 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Jarrett Keifer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Exploring Cloud-Native Geospatial Formats: A Hands-on Workshop for Raster Data 2 | 3 | [Slides for the 2025-05-01 CNG Conf workshop are here.](https://docs.google.com/presentation/d/1nBKAhig0mXkxbzxLRGgu9ygY7Uc028pbdL4qQmlyZ4c/) 4 | 5 | ## Workshop Overview 6 | 7 | Ever wonder what GDAL is doing under the hood when you read a GeoTIFF file? 8 | Doubly so when the file is a Cloud-optimized GeoTIFF (COG) on a remote server 9 | somewhere? Have you been wondering what this new GeoZarr thing is all about and 10 | how it actually works? Then there's the whole Kerchunk/VirtualiZarr indexing to 11 | get cloud-native access for non-cloud-native data formats, what's that about? 12 | 13 | Cloud-native geospatial is all the rage these days, and for good reason. As 14 | file sizes grow, layer counts increase, and analytical methods become more 15 | complex, the traditional download-to-the-desktop approach is quickly becoming 16 | untenable for many applications. It's no surprise then that users are turning 17 | to cloud-based tools such as Dask to scale out their analyses, or that 18 | traditional tooling is adopting new ways of finding and accessing data from 19 | cloud-based sources. But as we transition away from opening whole files to now 20 | grabbing ranges of bytes off remote servers it seems all the more important to 21 | understand exactly how cloud native data formats actually store data and what 22 | tools are doing to access it. 23 | 24 | This workshop aims to dig into how cloud-native geospatial data formats are 25 | enabling new operational paradigms, with a particular focus on raster data 26 | formats. We'll start on the surface by surveying the current cloud-native 27 | geospatial landscape to gain an understanding of why cloud native is important 28 | and how it is being used, including: 29 | 30 | * the core tenets of cloud-native geospatial data formats 31 | * cloud-native data formats for both raster and non-raster geospatial data 32 | * introduction to SpatioTemporal Asset Catalogs (STAC) and how higher-level 33 | STAC-based tooling can leverage cloud-native formats for efficient raster 34 | data access 35 | processing of cloud-native data 36 | 37 | Then we'll get hands-on and go deep to build up an in-depth understanding of 38 | how cloud native raster formats work. We'll examine the COG format and read a 39 | COG from a cloud source by hand using just Python, selectively extracting data 40 | from the image without any geospatial dependencies. We'll repeat the same 41 | exercise for geospatial data in Zarr format to see how that compares to our 42 | experience with COGs. Lastly we'll turn our attention to Kerchunk/VirtualiZarr 43 | to see how these technologies might allow us to optimize data access for 44 | non-cloud-native formats. 45 | 46 | ### Prerequisites 47 | 48 | This workshops expects some familiarity with geospatial programming in Python. 49 | Most of the notebook code is already provided, so any gaps in understanding 50 | don't necessarily prohibit completing the exercises. That said, a basic 51 | knowledge of Cloud-Native Geospatial Python tooling and working with rasters as 52 | single and multidimensional arrays is quite helpful. 53 | 54 | A good primer workshop is Alex Leith of Auspatious's [Cloud-Native Geospatial 55 | for Earth Observation Workshop]( 56 | https://github.com/auspatious/cloud-native-geospatial-eo-workshop). 57 | It is recommended to work through those activities or have an equivalent 58 | knowledge prior to working through the notebooks in this workshop. 59 | 60 | ## Getting Started 61 | 62 | The interesting contents of this repo are, primarily, the Jupyter notebooks in 63 | the [`./notebooks`](./notebooks) directory. To facilitate easily running the 64 | notebooks in a properly-initialized environment, a docker compose file is 65 | provided. The project can also be run in a GitHub codespace without having to 66 | run anything locally. Alternately, one can set up their own python environment 67 | and run Jupyter without a dependency on docker. 68 | 69 | Docker compose is the recommended approach if wanting to keep all services 70 | local (due to bad internet and/or concerns about leveraging GitHub serivces). 71 | GitHub codespaces are recommended if considering ease of use alone. 72 | 73 | ### Running in GitHub Codespaces (recommended as easiest approach) 74 | 75 | This method is also recommended because it does not require any user 76 | configuration to get up and running. However, it does depend on an external, 77 | web-based service, which may not be ideal in environments with unknown internet 78 | quality (i.e., FOSS4G). This said, it does not require the user to run anything 79 | locally beyond a web browser, which may be necessary for users with Windows or 80 | administratively locked-down machines. 81 | 82 | To use GitHub Codespaces, browse to [the project repo in 83 | Github](https://github.com/jkeifer/cng-raster-formats). There, click the green 84 | `<> Code` dropdown button, select the `Codespaces` tab in the dropdown menu, 85 | then click the button to add a new codespace from the `main` branch. 86 | 87 | Once the codespace is fully started, go back into the codespaces dropdown menu 88 | on the project repo page (you will likely need to refresh the page). You should 89 | see the codespace listed, and a button with three dots `...` next to it. Click 90 | that button to open a menu with more actions for the codespace, then select 91 | "Open in JupyterLab". Select a notebook from the `notebooks` directory and work 92 | through it. 93 | 94 | ### Running locally with docker (recommended for local executions) 95 | 96 | Using docker has the advantage of better constraining the execution 97 | environment, which is also set up automatically with the required dependencies. 98 | 99 | Note that the instructions below were written with a MacOS/Linux environment in 100 | mind. Windows users will likely need to leverage WSL to access a Linux 101 | environment to run docker. 102 | 103 | To begin, clone this repo: 104 | 105 | ```commandline 106 | git clone https://github.com/jkeifer/cng-raster-formats.git 107 | cd cng-raster-formats 108 | ``` 109 | 110 | Ensure the docker daemon or an equivalent is running via whatever mechanism is 111 | preferred (on Linux via the docker daemon or podman; on MacOS via Docker 112 | Desktop, colima, podman, OrbStack, or others), then use `docker compose` to 113 | `up` the project: 114 | 115 | ```commandline 116 | docker compose up 117 | ``` 118 | 119 | This will start up the Jupyter container within docker in the foreground. If 120 | preferring to run compose in the background, add the detach option to the 121 | compose command via the `-d` flag. 122 | 123 | JupyterLab will be started with no authentication, running on port 8888 (by 124 | default; use the env var `JUPYTER_PORT` to change it if that port is already 125 | taken on your machine). Open a web browser and browse to 126 | [`http://127.0.0.1:8888`](http://127.0.0.1:8888) to open the JupyterLab 127 | interface. Select a notebook from the `notebooks` directory and work through 128 | it. 129 | 130 | ### Running locally without docker (least recommended approach) 131 | 132 | This approach is not recommended as it is more subject to local environment 133 | differences than the docker-based approaches. But it does have the benefit of 134 | not requiring docker as a dependency. 135 | 136 | Note that the instructions below were written with a MacOS/Linux environment in 137 | mind. Windows users will likely need to leverage something like [git for 138 | Windows](https://gitforwindows.org/) and the included Git BASH tool to follow 139 | along (WSL is also likely a viable solution to get a Linux environment on a 140 | Windows machine). 141 | 142 | To get started, clone this repository and set up a python venv. Python >=3.12 143 | is recommended: 144 | 145 | ```commandline 146 | git clone https://github.com/jkeifer/cng-raster-formats.git 147 | cd cng-raster-formats 148 | python -m venv .venv 149 | source .venv/bin/activate 150 | ``` 151 | 152 | With the activated virtual environment, install the required python dependencies: 153 | 154 | ```commandline 155 | pip install -r requirements.txt 156 | ``` 157 | 158 | Doing so will install Jupyter, which can then be started by running the 159 | following command: 160 | 161 | ```commandline 162 | jupyter lab 163 | ``` 164 | 165 | Jupyter should automatically launch the JupyterLab interface in a web browser 166 | with this project loaded. Select a notebook from the `notebooks` directory and 167 | work through it. 168 | 169 | ## Presentation History 170 | 171 | ### Origin 172 | 173 | This workshop was originally created for FOSS4G 2024 and was presented as a 174 | ["Deep Dive into Cloud-Native Geospatial Raster 175 | Formats"](https://talks.osgeo.org/foss4g-2024-workshop/talk/TNYSY9/). The 176 | slides from [that particular presentation are 177 | here](https://docs.google.com/presentation/d/1qFckA0prY604I4dMkQlF1ZM-QSKS2ou4-YttgGQHzOU/). 178 | 179 | ### All Workshop Presentations 180 | 181 | | Date | Location | Slides | Notes | 182 | | ---- | -------- | ------ | ----- | 183 | | 2025-05-01 | CNG Conference | [Link](https://docs.google.com/presentation/d/1nBKAhig0mXkxbzxLRGgu9ygY7Uc028pbdL4qQmlyZ4c/) | Partial presentation (only COG notebook) as part of [a combined workshop on CNG for EO](https://conference.cloudnativegeo.org/CNGConference2025#/workshops?lang=en#CNG%20Workshop:~:text=CNG%20for%20EO%20and%20Deep%20Dive%20into%20Cloud%2DNative%20Geospatial%20Raster%20Formats). | 184 | | 2025-01-22 | Online (Virtual) | [Link](https://docs.google.com/presentation/d/1k5m2eYV8Tv4YrTAL6pfjmZMhls51cChW_QO1vcXH_0U/) | Partial presentation (only COG notebook) for users in Oceania. | 185 | | 2024-12-03 | FOSS4G Belém, Brazil | [Link](https://docs.google.com/presentation/d/1qFckA0prY604I4dMkQlF1ZM-QSKS2ou4-YttgGQHzOU/) | Original presentation. | 186 | -------------------------------------------------------------------------------- /bin/compile-requirements.bash: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | set -euo pipefail 4 | 5 | CUSTOM_COMPILE_COMMAND="$0" 6 | COMPILE_COMMAND='uv pip compile' 7 | COMPILE_OPTIONS="--strip-extras --generate-hashes --universal --custom-compile-command '${CUSTOM_COMPILE_COMMAND}'" 8 | 9 | 10 | find_this () { 11 | THIS="${1:?'must provide script path, like "${BASH_SOURCE[0]}" or "$0"'}" 12 | trap "echo >&2 'FATAL: could not resolve parent directory of ${THIS}'" EXIT 13 | [ "${THIS:0:1}" == "/" ] || THIS="$(pwd -P)/${THIS}" 14 | THIS_DIR="$(dirname -- "${THIS}")" 15 | THIS_DIR="$(cd -P -- "${THIS_DIR}" && pwd)" 16 | THIS="${THIS_DIR}/$(basename -- "${THIS}")" 17 | trap "" EXIT 18 | } 19 | 20 | 21 | maybe_print_help() { 22 | for arg in "$@"; do 23 | case "$arg" in 24 | -h|--help) 25 | local help 26 | help="$(${COMPILE_COMMAND} --help | tail -n +2)" 27 | echo >&2 "Usage: ${CUSTOM_COMPILE_COMMAND} [OPTIONS] [SRC_FILES]..." 28 | echo >&2 "" 29 | echo >&2 "This is a script wrapping \`${COMPILE_COMMAND}\`." 30 | echo >&2 "The following is from the \`${COMPILE_COMMAND}\` usage." 31 | echo >&2 "$help" 32 | exit 33 | ;; 34 | esac 35 | done 36 | } 37 | 38 | 39 | main () { 40 | find_this "${BASH_SOURCE[0]}" 41 | 42 | maybe_print_help "$@" 43 | 44 | cd "${THIS_DIR}/.." 45 | 46 | local compile="${COMPILE_COMMAND} ${COMPILE_OPTIONS}" 47 | 48 | local infile 49 | for infile in *.in; do 50 | [ "$infile" == 'MANIFEST.in' ] && continue 51 | $compile -o "${infile%.*}.txt" "${infile}" "$@" 52 | done 53 | } 54 | 55 | 56 | main "$@" 57 | -------------------------------------------------------------------------------- /compose.yml: -------------------------------------------------------------------------------- 1 | services: 2 | jupyter: 3 | build: 4 | context: . 5 | dockerfile: Dockerfile 6 | volumes: 7 | - .:/home/jovyan 8 | ports: 9 | - "${JUPYTER_PORT:-8888}:8888" 10 | -------------------------------------------------------------------------------- /notebooks/01_reading-cogs-the-hard-way.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "637a32af-0f84-465b-be1d-f14347c6fe3f", 6 | "metadata": {}, 7 | "source": [ 8 | "# Reading Cloud-Optimized GeoTIFFs the Hard Way\n", 9 | "\n", 10 | "In this notebook we will explore how one can read Cloud-Optimized GeoTIFFs (COGs) the hard way, i.e., by requesting and parsing byte ranges by hand. We'll query the Earth Search STAC catalog to find an image in COG format, parse the embedded metadata and file structure out of the file, then use that information to read the bytes of an image tile from the file and process them into a usable numpy array. We'll then visualize that array in a slippy map to verify what we did got us the expected result.\n", 11 | "\n", 12 | "Before we get into it, we have to get some initial stuff out of the way, like imports and some other defs we'll need for later." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "id": "7903800b-3eff-49c6-9ef2-6502b3afd9f5", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "from __future__ import annotations\n", 23 | "\n", 24 | "import json\n", 25 | "import struct\n", 26 | "import urllib.request\n", 27 | "\n", 28 | "from pprint import pprint\n", 29 | "from typing import Any, Literal, TypedDict\n", 30 | "\n", 31 | "import folium\n", 32 | "import numpy as np\n", 33 | "\n", 34 | "from odc.geo.geom import point\n", 35 | "from pystac_client import Client\n", 36 | "from griffine import Affine, Grid" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "id": "9753dde6-9fe9-4f24-80f1-3f30cec89c8f", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "# This is a mapping of the TIFF data types to the struct package's format charaters\n", 47 | "# see https://docs.python.org/3/library/struct.html#format-characters\n", 48 | "\n", 49 | "DATA_TYPES = {\n", 50 | " 1: 'B', # BYTE (uint8)\n", 51 | " 2: 's', # ASCII (char[1])\n", 52 | " 3: 'H', # SHORT (uint16)\n", 53 | " 4: 'I', # LONG (uint32)\n", 54 | " 5: 'II', # RATIONAL (uint32[2])\n", 55 | " 6: 'b', # SBYTE (int8)\n", 56 | " 7: 'B', # UNDEFINED (uint8)\n", 57 | " 8: 'h', # SSHORT (int16)\n", 58 | " 9: 'i', # SLONG (int32)\n", 59 | " 10: 'ii', # SRATIONAL (int32[2])\n", 60 | " 11: 'f', # FLOAT (float32)\n", 61 | " 12: 'd', # DOUBLE (float64)\n", 62 | " 13: 'I', # SUBIFD (uint32)\n", 63 | " # 14: '',\n", 64 | " # 15: '',\n", 65 | " 16: 'Q', # ? (uint64)\n", 66 | " 17: 'q', # ? (int64)\n", 67 | " 18: 'Q', # ? (uint64)\n", 68 | "}" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "id": "1df1a58a-7ad4-4434-b956-0c4e9ca066d3", 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "EPSG_4326 = 'EPSG:4326'\n", 79 | "\n", 80 | "ENDIANNESS = {\n", 81 | " b'MM': '>', # big endian\n", 82 | " b'II': '<', # little endian\n", 83 | "}" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "id": "3bbd5fab-c199-4f89-9c5f-76e1a767c330", 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "def binary(_bytes: bytes, join_str: str = ' ') -> None:\n", 94 | " _hex = _bytes.hex()\n", 95 | " return join_str.join([\n", 96 | " '{:08b}'.format(int(_hex[i:i+2], 16))\n", 97 | " for i in range(0, len(_hex), 2)\n", 98 | " ])\n", 99 | "\n", 100 | "def url_read_bytes(url: str, start: int, end: int) -> bytes:\n", 101 | " request = urllib.request.Request(\n", 102 | " url,\n", 103 | " headers={'Range': f'bytes={start}-{end-1}'},\n", 104 | " )\n", 105 | " with urllib.request.urlopen(request) as response:\n", 106 | " return response.read()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "ca35ae14-74dc-4847-b508-b2534db8fe9a", 112 | "metadata": {}, 113 | "source": [ 114 | "## Point of Interest (POI)\n", 115 | "\n", 116 | "To give us something to use for our query, let's define a point of interest." 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "id": "70a33a2d-546b-4a9d-aedd-71fee5486767", 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "# Point of Interest\n", 127 | "POI = point(-121.695833, 45.373611, crs=EPSG_4326)\n", 128 | "\n", 129 | "# Let's find out where the point is\n", 130 | "point_map = POI.explore(name='point')\n", 131 | "point_map" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "633e9c2e-b33c-4924-b8f1-6fb63e04c134", 137 | "metadata": {}, 138 | "source": [ 139 | "## Querying Earth Search\n", 140 | "\n", 141 | "We'll use pystac-client to search the Earth Search Sentinel 2 L2A collection for a scene intersecting our POI. We'll aim for something with low cloud cover, in the year 2023, and we'll pick the most recent scene that matches these parameters.\n", 142 | "\n", 143 | "**WARNING**: You _can_ change this to fetch scenes from a different collection, STAC API, or not use STAC and just put in an href directly to a COG of your choosing. Doing so is discouraged while in the workshop, as differences in the way the file was created might be impossible to overcome within the time limits of this workshop. Consider leaving this as-is to start, and at a later date, when you have more familiarity parsing TIFFs, you can try a different source." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "id": "6a4cc9c6-ac20-4d7c-a098-bc234b9a3858", 150 | "metadata": { 151 | "scrolled": true 152 | }, 153 | "outputs": [], 154 | "source": [ 155 | "client = Client.open(\"https://earth-search.aws.element84.com/v1\")\n", 156 | "\n", 157 | "search = client.search(\n", 158 | " max_items=1,\n", 159 | " collections=['sentinel-2-c1-l2a'],\n", 160 | " intersects=POI,\n", 161 | " datetime='2023/2023',\n", 162 | " query=['eo:cloud_cover<10'],\n", 163 | " sortby=[{\"direction\": \"desc\", \"field\": \"properties.datetime\"}],\n", 164 | ")\n", 165 | "item = next(search.items())\n", 166 | "print(json.dumps(item.to_dict(), indent=4))" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "id": "9d8c739c-4e2c-44f5-9689-177d2b2e2132", 172 | "metadata": {}, 173 | "source": [ 174 | "We can throw that item onto our map to see its footprint relative to our POI." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "id": "f3a0f114-f70c-4449-965d-77eeda7db81d", 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "stac_item_layer = folium.GeoJson(item, name='stac-item-footprint')\n", 185 | "point_map.fit_bounds(stac_item_layer.get_bounds())\n", 186 | "stac_item_layer.add_to(point_map)\n", 187 | "\n", 188 | "point_map" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "id": "9cede6aa-0ab0-4585-aef3-91273e75f944", 194 | "metadata": {}, 195 | "source": [ 196 | "Notably, the item we retrieved has many different bands, all of them COGs. We only need one for this exercise, so we'll grab the red band's href because that should be a good looking band visually." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "id": "361982ad-abb4-4c4f-829a-d537ef5b63ae", 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "href = item.assets['red'].href\n", 207 | "print(href)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "id": "d1b6a0ce-bdd6-4e52-870a-752feea7081f", 213 | "metadata": {}, 214 | "source": [ 215 | "## The TIFF file header\n", 216 | "\n", 217 | "The first few bytes of a TIFF file tell us a couple important things we'll need to process the rest of the file. First is the endianness of the file, second is that the file is really a TIFF file.\n", 218 | "\n", 219 | "Note that not all files with a `.tif` extension are a standard TIFF. Most notably, a standard TIFF file uses 32-bit integer offsets within the file to index particular bytes within the file (such as the offset to the first byte in an image tile, for example). Due to the maximum value of such an integer, standard TIFF files have a maximum file size of 4GB (or 2GB for certain archiac implementations that mistakenly used _signed_ integers for offset values). To get around this limitation, the BigTIFF format was developed using 64-bit integer offsets--but this is not a standard TIFF! It is close, but different enough we're not going to worry about supporting it for our little TIFF \"library\" that we're going to build.\n", 220 | "\n", 221 | "Actually, we're not going to support a lot of things. This implementation is going to be very specific to what we need for the specific Earth Search COG we selected. That's okay: sometimes a purpose-built implementation is more performant because it doesn't have to handle all the weird edge cases. Or at least that's what we can tell ourselves to feel better as we go along and notice how many conditions we're not handling in a more general way.\n", 222 | "\n", 223 | "### So what's in the header already?\n", 224 | "\n", 225 | "Enough jibber-jabber, let's read out the header. It's the first 4 bytes of a standard TIFF." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "6e1165ed-209f-4c59-9a9e-4a3a5d5145ec", 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "7c2e6a5e-2d1b-4f0a-a7d0-dc1ee187d853", 239 | "metadata": {}, 240 | "source": [ 241 | "### Endianness and the TIFF Byte-Order Mark\n", 242 | "\n", 243 | "TIFF uses the first two bytes of the file to encode the endianness of the file as the \"Byte-Order Mark\". This presumably enables writers to use the most efficient endianess for their host system, if needed. Readers must support reading big or little endian files, where writers can pick one endianness.\n", 244 | "\n", 245 | "We can consider the endianness of a TIFF file to \"describe the order of the bytes in a multi-byte data type\". That is, when we have a 16-, 32-, or 64-bit value, the endianness tells how to interpret which bytes are most- and least-signifcant. In the case of a 32-bit integer value, that looks like this:\n", 246 | "\n", 247 | "```\n", 248 | "16^1\n", 249 | "|16^0\n", 250 | "|| 16^3\n", 251 | "|| |16^2\n", 252 | "|| || 16^5\n", 253 | "|| || |16^4\n", 254 | "|| || || 16^7\n", 255 | "|| || || ⎮16^6\n", 256 | "C0 00 00 00 = 192 little endian (II)\n", 257 | "\n", 258 | "00 00 00 C0 = 192 big endian (MM)\n", 259 | "|| || || ⎮16^0\n", 260 | "|| || || 16^1\n", 261 | "|| || |16^2\n", 262 | "|| || 16^3\n", 263 | "|| |16^4\n", 264 | "|| 16^5\n", 265 | "|16^6\n", 266 | "16^7\n", 267 | "```\n", 268 | "\n", 269 | "Big endian is encoded as the byte-order mark `MM` (from Motorola processors), and little endian is encoded as `II` (from Intel processors).*\n", 270 | "\n", 271 | "The doubled letter is used to ensure that the binary sequence is the same no matter the endianness. Again, this is because the endianness affects only the order of the bytes in that two-byte word, not the order of the bits within each of the bytes.\n", 272 | "\n", 273 | "----\n", 274 | "\n", 275 | "*Naively, the above example might make it seem silly to use little endian notation--big endian seems much more natural, given the way we generally have been taught to read and write numbers. However, when it comes to how most microprocessors and memory operations actually work, little endian has some clear benefits which has generally led to its dominance outside networking. More of the nuance and complication of endianness is well documented on [its wikipedia page](https://en.wikipedia.org/wiki/Endianness)." 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "id": "1f11a47b-4656-440c-8501-18ac7a2131e4", 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "endianness = ENDIANNESS[header[0:2]] # We'll need this later, so let's save it into a var now\n", 286 | "print(f\"Endianness signature: {header[0:2]}; struct endianness format char: '{endianness}'\")" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "id": "337dd030-79fd-42fb-aec9-91743258ee20", 292 | "metadata": {}, 293 | "source": [ 294 | "To reiterate this point about endianness: we care about endianness so we can ensure we can interpret the bytes in each word in the file in the appropriate order. Beyond that we aren't going to need to worry about endianness." 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "id": "0de25c7d-fc2c-40d7-b4aa-ddf5e6c9cbd7", 300 | "metadata": {}, 301 | "source": [ 302 | "### Magic number\n", 303 | "\n", 304 | "Many files encode a special number in their first few bytes, which can be used to distingush the files is of a given format. Wikipedia has [a big long list of these \"magic numbers\"](https://en.wikipedia.org/wiki/List_of_file_signatures) for anyone curious. TIFF uses the value `42` for it's magic number (and BigTIFF 43)." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "id": "bb94e4db-7779-4ac6-a634-2917035380d5", 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "id": "92b84a5e-60e2-4cbf-a454-992a319acca3", 318 | "metadata": {}, 319 | "source": [ 320 | "## The Python `struct` module\n", 321 | "\n", 322 | "Note the use of the `struct` module above. This is a module from the Python stdlib that is super handy when working with binary data, as it is able to pack (Python type to binary representation) and unpack (binary representation to Python type) data values given a specific format. Packing and unpacking allow specifying the endianness of the binary data using the `>` and `<` characgters, which designate big and little endianness, respectively.\n", 323 | "\n", 324 | "The `H` in the magic number unpacking indicates that the data type is `uint16`. Some of data types we'll be working and their struct format charaters include:\n", 325 | "\n", 326 | "| Data type | Format character |\n", 327 | "| --------- | ---------------- |\n", 328 | "| `uint8` | `B` |\n", 329 | "| `char[1]` | `s` |\n", 330 | "| `unit16` | `H` |\n", 331 | "| `unint32` | `I` |\n", 332 | "\n", 333 | "Additional format characters are listed in the `DATA_TYPES` dict defined near the beginning of this notebook; also consider reviewing [the `struct` docs](https://docs.python.org/3/library/struct.html) for the full list of format codes available and other details on how to use `struct`." 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "id": "0c7831f3-186b-4402-8a41-a0c7bede61ab", 339 | "metadata": {}, 340 | "source": [ 341 | "## First IFD offset: the next four bytes\n", 342 | "\n", 343 | "Immediately following the TIFF header is the offset to the first image file directory (IFD) in the file. In a standard TIFF this offset is a 32-bit unsigned integer, as was previously alluded. We can read in and view those bytes:" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "id": "404bf225-d513-4899-9a20-4837ab069686", 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "id": "4d2f41ad-cf78-4168-a1fe-dea5eb91ca63", 357 | "metadata": {}, 358 | "source": [ 359 | "Though that's not super useful until we unpack those bytes into in integer (here using `I` because the offset is a `uint32` value):" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "id": "7c70828e-ef8b-4026-9c8f-ce15f13c3b11", 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "id": "ddf82ec3-a381-4a9f-9579-a8668abf34a2", 373 | "metadata": {}, 374 | "source": [ 375 | "## Parsing the Image File Directory\n", 376 | "\n", 377 | "The Image File Directory (IFD) is a data structure composed of entries called tags (hence the name \"Tag Image File Format\"). The IFD doesn't start with the first tag entry, however. It begins with a 2-byte `unit16` value indicating the number of tags within the IFD. This value enables us, along with the IFD offset within the file, to read the entire sequence of tag bytes via `file_bytes[ifd_offset + 2:ifd_offset + (tags_count * tag_size)]`.\n", 378 | "\n", 379 | "### Tag structure\n", 380 | "\n", 381 | "In a standard TIFF, tags are a 12-byte sequence (so `tag_size` above is 12 bytes) of the following structure:\n", 382 | "\n", 383 | "| Tag Bytes | Tag field name | Field data type |\n", 384 | "| --------- | --------------- | --------------- |\n", 385 | "| 0 - 1 | `code` | `uint16` |\n", 386 | "| 2 - 3 | `data_type` | `uint16` |\n", 387 | "| 4 - 7 | `count` | `uint32` |\n", 388 | "| 8 - 11 | `value` | `char[4]` |\n", 389 | "\n", 390 | "In the case of BigTIFF files, each tag is a 20-byte sequence where the `count` and `value` are of type `uint64`.\n", 391 | "\n", 392 | "The tag `code` field gives us a way to find the meaning of the tag `value`, as the `code` is an integer that maps to the tag name. The Library of Congress has [a handy table](https://www.loc.gov/preservation/digital/formats/content/tiff_tags.shtml) we can use to look up the tags by their codes.\n", 393 | "\n", 394 | "#### Tag data types\n", 395 | "\n", 396 | "The tag `data_type` is also an integer value, in this case mapping to the data type we can use to interpret `value` per the following table:\n", 397 | "\n", 398 | "| `data_type` | Type Name | Data type |\n", 399 | "| ----------- | --------- | ----------- |\n", 400 | "| 1 | BYTE | `uint8` |\n", 401 | "| 2 | ASCII | `char[1]` |\n", 402 | "| 3 | SHORT | `uint16` |\n", 403 | "| 4 | LONG | `uint32` |\n", 404 | "| 5 | RATIONAL | `uint32[2]` |\n", 405 | "| 6 | SBYTE | `int8` |\n", 406 | "| 7 | UNDEFINED | `uint8` |\n", 407 | "| 8 | SSHORT | `int16` |\n", 408 | "| 9 | SLONG | `int32` |\n", 409 | "| 10 | SRATIONAL | `int32[2]` |\n", 410 | "| 11 | FLOAT | `float32` |\n", 411 | "| 12 | DOUBLE | `float64` |\n", 412 | "| 13 | SUBIFD | `uint32` |\n", 413 | "| 14 | n/a | n/a |\n", 414 | "| 15 | n/a | n/a |\n", 415 | "| 16 | ? | `uint64` |\n", 416 | "| 17 | ? | `int64` |\n", 417 | "| 18 | ? | `uint64` |\n", 418 | "\n", 419 | "(I believe data types 16, 17, and 18 are specifc to BigTIFF, but I have so far been unable to find confirmation either way.)\n", 420 | "\n", 421 | "The `count` field tells us how many of the listed `data_type` make up the `value` of the tag. Note that even a `count` of just one for a `data_type` of, say 5, or two `uint32`s would not fit in a `value` in a standard TIFF file as `value` itself is only four bytes long. Similarly, a `count` greater than 4 with a `data_type` of 1 (`uint8`) would also be larger than can fit in `value`.\n", 422 | "\n", 423 | "In such cases where `count * len_in_bytes(data_type) > 4`, `value` itself is not actually the tag value but an offset to the actual value within the file. The length of that value is given by the previous expression `count * len_in_bytes(data_type)`. Thus, to get the actual value we can read `file_bytes[value:value + (count * len_in_bytes(data_type))]`.\n", 424 | "\n", 425 | "The IFD doesn't end with the last tag either. Each IFD contains a 4-byte (`uint32`) offset to the next IFD in the file (or 8-byte `uint64` in the case of BigTIFF). In the event an IFD is the last one in the file it will have a value of 0 for its next IFD offset. As a result, it should be possible to build a map of the complete contents of a TIFF by iterating through its IFDs and parsing their tags into some appropriate hierarchical data structure (TIFF --< IFDs --< Image segments) ." 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "id": "6ffab897-313d-4edd-b2a1-3b5c8048e2d1", 431 | "metadata": {}, 432 | "source": [ 433 | "### Finding the tag count and reading the tag bytes\n", 434 | "\n", 435 | "As mentioned, an IFD starts with a 2-byte `uint16` value indicating its number of tags. If we have an IFD's offset (`ifd_offset`) within the file--which for the first IFD we know is given to us as the first bytes in the file immediately following the TIFF header--then we also know that IFS's tag offset (`tags_start`) is given by `ifd_offset + 2`.\n", 436 | "\n", 437 | "Parsing the tag count (`tags_count`) should simply be a matter of using `struct.unpack` to unpack the two tag count bytes into an integer (struct format char `H` for `uint16`). We need to make sure we use the endianness indicicated in the file header. `<` is little endian in `struct.unpack`, where `>` is big endian. Looking back, the proper endian character should have been saved into the `endianness` var for us back when we were inspecting the header bytes." 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "id": "19362c9b-f9b5-4a96-a0f1-c1ae333f5a41", 444 | "metadata": {}, 445 | "outputs": [], 446 | "source": [] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "id": "bea8c45d-6197-4f04-9bd4-3290e674182c", 451 | "metadata": {}, 452 | "source": [ 453 | "If we know the tag count and the tag size (12 bytes for TIFF, 20 for BigTIFF), then we can find the total number of bytes in the IFD's tags by `tag_count * tag_size`. From this we should be able to find the last byte of the tags with `tags_end = tags_start + (tag_count * tag_size)`, allowing us to read the tag bytes (`tags_bytes`) from the file." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "id": "c9ad2762-0dbd-4d15-9b0c-336b0baf138d", 460 | "metadata": {}, 461 | "outputs": [], 462 | "source": [] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "id": "88a37331-72d2-457c-8c79-b160128e3fac", 467 | "metadata": {}, 468 | "source": [ 469 | "It's also important to note that we can use the `tags_end` to know the offset of the next IFD offset, which a 4-byte value we can unpack into a `uint32` (for a standard TIFF)." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "id": "ca4d52ee-30b2-47a0-8a35-3fc4af01ed45", 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "id": "9d8a58ef-e3e1-47d9-83d6-e286d3d1a44c", 483 | "metadata": {}, 484 | "source": [ 485 | "### Parsing each tag\n", 486 | "\n", 487 | "To parse each tag we need to find a way to split each tag's bytes out of the of the larger bytes string. Python gives us many valid ways of doing this. Let's try using a `for` loop to split the tags bytes to see what each tag looks like." 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "id": "6e452ccc-16cf-4898-81b9-80db88f09e4a", 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "id": "e55f7bd5-b714-4190-9c67-4e8532f79e33", 501 | "metadata": {}, 502 | "source": [ 503 | "#### Unpacking the tag values\n", 504 | "\n", 505 | "The above show us we can easily extract each tag's bytes, but we next need to use `struct.unpack` to extract the tag's `code`, `data_type`, `count`, and `value` binary values into Python types. Remember that `code` and `data_type` are `uint16` values, which map to the struct `H` format. Look up the proper struct format values for `count` and `value` knowing what you know about the data types of those tag fields and verify if the format passed into `struct.unpack` in the example here is correct (feel free to consult the `DATA_TYPES` dict above or the struct docs directly).\n", 506 | "\n", 507 | "For variety, this example implementation uses a `while` loop to extract the tag bytes. Each tag's fields are added into a dictionary indexed by the tag `code` to facilitate easy access in later code." 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": null, 513 | "id": "3c81f0c7-8a03-49f9-b998-655c6dcb86a2", 514 | "metadata": {}, 515 | "outputs": [], 516 | "source": [ 517 | "tags = {}\n", 518 | "tag_index = 0\n", 519 | "\n", 520 | "while tag_index < tags_count:\n", 521 | " try:\n", 522 | " tag_bytes = tags_bytes[(tag_size * tag_index) : (tag_size * (tag_index + 1))]\n", 523 | " tag_index += 1\n", 524 | " except IndexError:\n", 525 | " break\n", 526 | "\n", 527 | " code, data_type, count, value = struct.unpack(f'{endianness}HHI4s', tag_bytes)\n", 528 | " tags[code] = {\n", 529 | " 'data_type': data_type,\n", 530 | " 'count': count,\n", 531 | " 'value': value,\n", 532 | " }\n", 533 | "\n", 534 | "tags" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "id": "fab9c851-779a-4d6c-a0b2-7a43a6859a20", 540 | "metadata": {}, 541 | "source": [ 542 | "#### Understanding tag codes\n", 543 | "\n", 544 | "Now that we have TIFF tag values to look at, it would be good to mention the [Libray of Congress' guide to TIFF Tags](https://www.loc.gov/preservation/digital/formats/content/tiff_tags.shtml) again. We can use that lookup table to interpret each of the integer codes in a meaningful way. Note that some codes we will see in every file, while others may be specific to the way a file was encoded or the type of data it contains. Further, a number of the tags are specific to the GeoTIFF format and are required for such files, while some are used for metadata by GDAL and can generally be expected in a GeoTIFF (though not always of course).\n", 545 | "\n", 546 | "For example, we should always expect to see 256, 257, 258, and 259 (and others, these are just good examples):\n", 547 | "\n", 548 | "| Code | Tag Name | Tag Description |\n", 549 | "| ---- | ------------- | ---------------------------- |\n", 550 | "| 256 | ImageWidth | Number of image columns |\n", 551 | "| 257 | ImageLength | Number of image rows |\n", 552 | "| 258 | BitsPerSample | Number of bits in each pixel |\n", 553 | "| 259 | Compression | Integer mapping to compression algorithm used for each image segment |" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "id": "44cfe55d-23b1-4fd3-9f32-e1901542457d", 559 | "metadata": {}, 560 | "source": [ 561 | "#### Unpacking the tag values\n", 562 | "\n", 563 | "Recalling the earlier explanation about tag data types, counts, and values, we know that unpacking the tag values will not be the same for each tag given the differences in those three aforementioned tag fields across each of our different tags. For some tags that have a single count of a shorter data type we can unpack the tag `value` directly. But for longer values we'll have to use the tag `value` as an offset into the file to read the actual bytes to unpack.\n", 564 | "\n", 565 | "We'll start with one of these easier examples and unpack the image size tags 256 and 257. Check the data types for these tags. What are the struct format chars for each? Will we need to unpack all four bytes of the `value` for either of these tags?" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "id": "8723a085-841e-4694-879f-28cfaba90758", 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "id": "91b2dd60-3b7e-414e-8d14-ca2cca6e8aea", 579 | "metadata": {}, 580 | "source": [ 581 | "In the cases where the tag `value`'s four bytes are not sufficient to contain the whole tag value, parsing is a bit more complex. We not only need to find the struct format character (`struct_dtype`) and size for the tag's data type, but then we need to:\n", 582 | "\n", 583 | "* use the data type size and the tag `count` to calculate how many bytes we need to read (`size`)\n", 584 | "* unpack the `value` to get the actual value's byte offset in the file (`offset`)\n", 585 | "* combine `size` and `offset` to get the byte range and read that out fo the file (giving us `values`)\n", 586 | "* build the struct format string (`endianness + (struct_dtype * count)`) then unpack `values`\n", 587 | "\n", 588 | "We'll preview this here with an example unpacking the tile offsets tag (324). The values we get out of this (`tile_offsets`) are the byte offsets for each image segment (tile) in the image represented by this IFD. We will be able to use these offsets in the next section to read the specific tile containing our POI (though we'll have unpack the rest of our tags and do a bit of math to figure out which one and what to do with the bytes)." 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "id": "75e6e581-3a51-444a-b90f-20d94e19fc41", 595 | "metadata": { 596 | "scrolled": true 597 | }, 598 | "outputs": [], 599 | "source": [ 600 | "tag = tags[324]\n", 601 | "struct_dtype = DATA_TYPES[tag['data_type']]\n", 602 | "size = tag['count'] * struct.calcsize(struct_dtype)\n", 603 | "offset = struct.unpack(f'{endianness}I', tag['value'])[0]\n", 604 | "values = url_read_bytes(href, offset, offset+size)\n", 605 | "tile_offsets = struct.unpack(endianness + (struct_dtype * tag['count']), values)\n", 606 | "\n", 607 | "for idx, tile_offset in enumerate(tile_offsets):\n", 608 | " print(f\"Offset tile {idx}: {tile_offset}\")" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "id": "bdde6490-a49f-4014-b981-035e7ff5c170", 614 | "metadata": {}, 615 | "source": [ 616 | "### Questions\n", 617 | "\n", 618 | "* Refer back to the STAC item and see if the `file` STAC extension is in use. Is the file size listed for the COG asset your are examining, and if so how close to the end of the file do these tiles appear to get?\n", 619 | "* Can you use the unpacking examples to create a generalized approach to unpacking the tag values and apply that to the rest of the tags in the IFD? The next section will have you unpack all the tags, so finding a quick an efficient way to do this might be helpful." 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "id": "40fc1666-70bb-4486-9b6a-5f3768f0b278", 625 | "metadata": {}, 626 | "source": [ 627 | "## Reading a tile from the image\n", 628 | "\n", 629 | "Read the tile intersecting our POI will require most of our tags to be unpacked and decoded. Refer back to the tags dictionary `tags` keys for the list of all tag codes in our TIFF's first IFD and the above documentation on the tag codes. Unpack each tag's value into the corresponding variable name in the list below:\n", 630 | "\n", 631 | "* `image_width`\n", 632 | "* `image_length`\n", 633 | "* `bits_per_sample`\n", 634 | "* `compression`\n", 635 | "* `samples_per_pixel`\n", 636 | "* `predictor`\n", 637 | "* `tile_width`\n", 638 | "* `tile_length`\n", 639 | "* `tile_offsets`\n", 640 | "* `tile_byte_counts`\n", 641 | "* `sample_format`\n", 642 | "* `pixel_scale`\n", 643 | "* `tie_point`\n", 644 | "* `geo_key_directory`\n", 645 | "* `geo_double_params` (if defined, else `tuple()`)\n", 646 | "* `geo_ascii_params` (if defined, else `b''`)\n", 647 | "* `gdal_metadata`\n", 648 | "* `nodata_value`\n", 649 | "\n", 650 | "**NOTE**: if you have chosen a different COG source than the default Sentinel 2 red band from Earth Search, you might need to consider additional tags and processing to get this part to work. TIFF is an extremely flexible format, but this means it has many different cases that need to be handled to be able to read any arbitrary file (which also means some atypical features supported by one implementation might lead to incompatibilities with other implementations)." 651 | ] 652 | }, 653 | { 654 | "cell_type": "markdown", 655 | "id": "7d76f347-7613-4556-ae6a-99f2fc3da835", 656 | "metadata": {}, 657 | "source": [ 658 | "### Let's define a function to make upacking the tags easier\n", 659 | "\n", 660 | "We have a lot of tags to unpack. For the sake of time, here's a function that we can use to make unpacking all the tags easier." 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "id": "a3b3adbe-e6cb-400e-ba52-e87fc8f3a6ad", 667 | "metadata": {}, 668 | "outputs": [], 669 | "source": [ 670 | "class TagDict(TypedDict):\n", 671 | " data_type: int\n", 672 | " count: int\n", 673 | " value: bytes\n", 674 | "\n", 675 | "type TagsDict = dict[int: TagDict]\n", 676 | "\n", 677 | "def unpack_tag(tag: TagsDict, endianness: Literal['>', '<']) -> Any:\n", 678 | " struct_dtype = DATA_TYPES[tag['data_type']]\n", 679 | " size = tag['count'] * struct.calcsize(struct_dtype)\n", 680 | " value = tag['value']\n", 681 | "\n", 682 | " offset = None\n", 683 | " if size > len(value):\n", 684 | " offset = struct.unpack(endianness + 'I', value)[0]\n", 685 | " value = url_read_bytes(href, offset, offset+size)\n", 686 | " \n", 687 | " unpacked = struct.unpack(endianness + (struct_dtype * tag['count']), value[:size])\n", 688 | "\n", 689 | " # if data_type == 2 (ASCII) we want to join the chars together\n", 690 | " if tag['data_type'] == 2:\n", 691 | " return b''.join(unpacked)\n", 692 | " elif tag['count'] == 1:\n", 693 | " return unpacked[0]\n", 694 | " return unpacked" 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "id": "dd0d5cdb-eb90-4e28-8be7-3d32e9604983", 700 | "metadata": {}, 701 | "source": [ 702 | "Let's try that out on the tags we unpacked above and see how this works!" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": null, 708 | "id": "5513363f-b227-4019-8e25-d36359e97757", 709 | "metadata": {}, 710 | "outputs": [], 711 | "source": [] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "id": "6d2665c2-85b1-4af9-a194-9e439711b6b4", 717 | "metadata": { 718 | "scrolled": true 719 | }, 720 | "outputs": [], 721 | "source": [] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "id": "989ad65f-bab8-442a-9d04-a1849c60c96c", 726 | "metadata": {}, 727 | "source": [ 728 | "Now let's use that function to unpack all our tags into the corresponding variable." 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": null, 734 | "id": "ca0ec6a1-3724-4fa7-ad23-c2be84b5bd50", 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "image_width = unpack_tag(tags[256], endianness)\n", 739 | "image_length = unpack_tag(tags[257], endianness)\n", 740 | "bits_per_sample = unpack_tag(tags[258], endianness)\n", 741 | "compression = unpack_tag(tags[259], endianness)\n", 742 | "samples_per_pixel = unpack_tag(tags[277], endianness)\n", 743 | "predictor = unpack_tag(tags[317], endianness)\n", 744 | "tile_width = unpack_tag(tags[322], endianness)\n", 745 | "tile_length = unpack_tag(tags[323], endianness)\n", 746 | "tile_offsets = unpack_tag(tags[324], endianness)\n", 747 | "tile_byte_counts = unpack_tag(tags[325], endianness)\n", 748 | "sample_format = unpack_tag(tags[339], endianness)\n", 749 | "pixel_scale = unpack_tag(tags[33550], endianness)\n", 750 | "tie_point = unpack_tag(tags[33922], endianness)\n", 751 | "geo_key_directory = unpack_tag(tags[34735], endianness)\n", 752 | "geo_double_params = tuple() # we don't have any double params in this image\n", 753 | "geo_ascii_params = unpack_tag(tags[34737], endianness)\n", 754 | "gdal_metadata = unpack_tag(tags[42112], endianness)\n", 755 | "original_nodata_value = unpack_tag(tags[42113], endianness)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "markdown", 760 | "id": "c62f1870-d918-4bbb-be17-147392f10736", 761 | "metadata": {}, 762 | "source": [ 763 | "### Interpreting tag values\n", 764 | "\n", 765 | "Many of the tags are straightforward. Some are enumerations which require an external lookup table. Others require cross-references between their values to make sense of the contents. Let's take a look at the few that are not straightforward to understand.\n", 766 | "\n", 767 | "#### Compression\n", 768 | "\n", 769 | "The `compression` tag value represents one of an enumerated set of possible compression methods. Continuing with the spirit of needing to consult various external lookup tables, the [Wikipedia entry for TIFF has a great table of possible compression formats and their integer values](https://en.wikipedia.org/wiki/TIFF#TIFF_Compression_Tag) is a great resource for understanding the meaning of the different possible values.\n", 770 | "\n", 771 | "**Question**: What is the value of the `compression` tag and what compression scheme does it indicate?" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "id": "e458b1cf-d6e0-4195-b05d-735ce40c97f8", 778 | "metadata": {}, 779 | "outputs": [], 780 | "source": [] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "id": "5f68597c-873d-403a-80bc-e58a245f13cb", 785 | "metadata": {}, 786 | "source": [ 787 | "#### Sample format\n", 788 | "\n", 789 | "The `sample_format` tag value represents one of an enumerated set of possible data types. Those values map as follows:\n", 790 | "\n", 791 | "| Format Value | Data Type |\n", 792 | "| ------------ | --------- |\n", 793 | "| 1 | `uint` |\n", 794 | "| 2 | `int` |\n", 795 | "| 3 | `float` |\n", 796 | "| 4 | untyped |\n", 797 | "| 5 | `cint` |\n", 798 | "| 6 | `cfloat` |\n", 799 | "\n", 800 | "The bit depth of the specified format is dependent on the value of the `bits_per_sample` tag." 801 | ] 802 | }, 803 | { 804 | "cell_type": "code", 805 | "execution_count": null, 806 | "id": "d9f18e5b-05a3-4153-a010-16b589e31bd2", 807 | "metadata": {}, 808 | "outputs": [], 809 | "source": [] 810 | }, 811 | { 812 | "cell_type": "markdown", 813 | "id": "f03ffc5c-74f8-4d9e-ae1d-c10306b34a4b", 814 | "metadata": {}, 815 | "source": [ 816 | "**Question**: What does the value of the `sample_format` tag indicate with regards to the data type and length of the cell values in this image (e.g., `uint32`, `int8`, `float32`, etc.)? What does this data type map to in the struct format characters?" 817 | ] 818 | }, 819 | { 820 | "cell_type": "markdown", 821 | "id": "73190c20-4600-45fb-9df8-333a9b4d050f", 822 | "metadata": {}, 823 | "source": [ 824 | "#### Pixel scale and tie point\n", 825 | "\n", 826 | "The `pixel_scale` tag is part of the GeoTIFF specification. It is a three-tuple where each value represents one dimension of the pixel scale, specifically the x, y, and z scales, respectively. In other words, each of the scale values represent the change in coordinate from one pixel origin to the next along the specified dimension. The units of each scale value are the same as those specified in coordinate reference system (CRS; we'll see this when reviewing the `geo_key_directory` below).\n", 827 | "\n", 828 | "The `tie_point` tag is again a member of the GeoTIFF specification. It defines a set of coordinates in the image space and their mappings to coordinates in the model space as a list of six-tuples. The first three tuple values are the image space x, y, and z coordinates, respectively. The latter three tuple values are the model space x, y, and z, respectively. The model space is perhaps best understood to be the coordinate reference system defined for the image.\n", 829 | "\n", 830 | "What is most notable for us about these two tags is that we can use them to build a simple affine transform to convert between image space and model space. Most GeoTIFF files, like ours, use such an affine transform, which means the following is frequently true:\n", 831 | "\n", 832 | "* the set of coordinate mappings has length one, i.e., only one point is mapped from image space to coordinate space\n", 833 | "* the image space coordinates of that point are (0, 0, 0), which effectively allows us to consider the model space coordiantes to be the geographic point represented by the image origin\n", 834 | "\n", 835 | "The [GeoTIFF spec docs detailing how to use these values are here](http://geotiff.maptools.org/spec/geotiff2.6.html). Note that the `pixel_scale` tag is optional; in some cases a `ModelTransformationTag` is used instead to encode the affine transformation matrix into the file, such as when needing to express grid rotation. Sometimes neither of these tags are present, such as when the transformation is not affine, which is typically when the `tie_point` tag would have multiple points describing a warp mesh over the image. Consult the docs to fully understand the interactions of these three tags and how to interpret their values outside this simple affine case.\n", 836 | "\n", 837 | "Why does all of this matter for us? We can use the `pixel_scale` and `tie_point` tag values to construct an affine transform object, which we'll use to perform coordinate transformations between model space (the image CRS) and image space (pixel coordinates). We need to be able to do this to find what pixel in the image contains our POI, so we can find which image tile to read." 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": null, 843 | "id": "468f9c52-6890-4fba-b056-020bb6e380f1", 844 | "metadata": {}, 845 | "outputs": [], 846 | "source": [ 847 | "transform = Affine(\n", 848 | " # w-e pixel resolution / pixel width\n", 849 | " pixel_scale[0],\n", 850 | " # row rotation (typically zero)\n", 851 | " 0,\n", 852 | " # x-coordinate of the upper-left corner of the upper-left pixel (origin)\n", 853 | " tie_point[3] - (pixel_scale[0] * tie_point[0]),\n", 854 | " # column rotation (typically zero)\n", 855 | " 0,\n", 856 | " # n-s pixel resolution / pixel height (negative value for a north-up image)\n", 857 | " -pixel_scale[1],\n", 858 | " # y-coordinate of the upper-left corner of the upper-left pixel (origin)\n", 859 | " tie_point[4] - (-pixel_scale[1] * tie_point[1]),\n", 860 | ")" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "id": "5a46017e-0bf0-4a4f-bea5-61a23729c884", 866 | "metadata": {}, 867 | "source": [ 868 | "#### Geo key directory and the params\n", 869 | "\n", 870 | "Another set of GeoTIFF-spec tags, `geo_key_directory`, `geo_double_params`, and `geo_ascii_params` represent a collection of geospatial information we need to interpret the data in a spatially-aware way. For example, such important information as the CRS is stored amongst these tags. The [GeoTIFF spec docs also document these tags and their interactions](http://geotiff.maptools.org/spec/geotiff2.4.html).\n", 871 | "\n", 872 | "In short, `geo_double_params` and `geo_ascii_params` are actually sets of parameters that can be used to fill in information that cannot be represented directly in the `geo_key_directory` due to data type differences (the latter is a `uint16` tuple whereas the former two are tuples of double precision floats and ASCII-encoded strings, respectively). The `geo_key_directory` is a collection of four-tuples (potentially with some additional trailing values), the first of which is a header that documents the tuples that follow. It has the following 8-byte structure:\n", 873 | "\n", 874 | "```\n", 875 | "Header = (KeyDirectoryVersion, KeyRevision, MinorRevision, NumberOfKeys)\n", 876 | "```\n", 877 | "\n", 878 | "For our purposes, the important piece here are the number of keys: we need to know how many keys are in the directory to be able to work out the offset to any additional values in the directory structure we might need to fill in directory entries that have multiple `unit16` values.\n", 879 | "\n", 880 | "After the header, each of the keys in the directory have the 8-byte structure:\n", 881 | "\n", 882 | "```\n", 883 | "KeyEntry = (KeyID, TIFFTagLocation, Count, Value_Offset)\n", 884 | "```\n", 885 | "\n", 886 | "The `KeyID` here is just like our TIFF tags: it is an identifier that can be used with an external lookup table to interpret the meaning of the key's value. The `TIFFTagLocation` is used to point to a TIFF tag that contains the value for this key: if the value is directly embedded in the key (in the place of `Value_Offset`) then the location is `0` and this key's value is of type `uint16`. In cases where the value is not directly embedded in the key the location will have the value of the tag code that contains the value. The `Value_Offset` and `Count` can then be used to extract the set of values pertaining to this key from that tag's data. The data type of the key value is given by the source tag's data type.\n", 887 | "\n", 888 | "For example, if we have a key entry with the values `(1024, 0, 1, 1)` we know that the key ID is `1024`, the location of `0` means the value is embedded in the key entry, and that means our count is necessarily `1` and we can interpret the `Value_Offset` as the key value, in this case `1`.\n", 889 | "\n", 890 | "A more complex example could be like `(2049, 34737, 7, 22)`: in this case we have a non-zero location, so we have to read the values--in this case seven values per the count value--from a separate tag. The location of `34737` corresponds to the `geo_ascii_params` tag, which not only tells us where to get the values for this key, but also their data type. If we have a value of the `34737` tag of `b'WGS 84 / UTM zone 10N|WGS 84|\\x00'`, then taking 7 bytes from position 22 we end up with `b'WGS 84|'`. The `|` is intended to be converted into a null byte to terminate the extracted string; in Python it is easy enough to read one less byte than the key's count for ASCII-type keys as string termination is handled for us.\n", 891 | "\n", 892 | "We can use the above explanation to write a general-purpose function to extract the keys into a dict, like we originally did with the IFD tags, and then we can use it to extract our keys. Refer to the [GeoTIFF document \"Geocoding Raster Data\"](http://geotiff.maptools.org/spec/geotiff2.7.html#2.7) for and explanantion of the key IDs and how to understand their meanings. These keys are critical for finding the CRS of the file via the information presented in that documentation." 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": null, 898 | "id": "79bf68fb-ee82-4f40-8fb7-e02618a55968", 899 | "metadata": {}, 900 | "outputs": [], 901 | "source": [ 902 | "def extract_geo_keys(\n", 903 | " key_directory: tuple[int, ...],\n", 904 | " double_params: tuple[float, ...],\n", 905 | " ascii_params: bytes,\n", 906 | ")-> dict[int, int | float | bytes]:\n", 907 | " keys: dict[int, int | float | bytes] = {}\n", 908 | " \n", 909 | " try:\n", 910 | " _, _, _, key_count = key_directory[0:4]\n", 911 | " for key_index in range(key_count):\n", 912 | " offset = (4 * key_index) + 4\n", 913 | " key_id, location, count, value_offset = key_directory[offset:offset + 4]\n", 914 | " \n", 915 | " if location == 0:\n", 916 | " keys[key_id] = value_offset\n", 917 | " elif location == 34735:\n", 918 | " keys[key_id] = key_directory[value_offset:value_offset + count]\n", 919 | " elif location == 34736:\n", 920 | " keys[key_id] = double_params[value_offset:value_offset + count]\n", 921 | " elif location == 34737:\n", 922 | " keys[key_id] = ascii_params[value_offset: value_offset + (count - 1)]\n", 923 | " else:\n", 924 | " raise ValueErorr(f'Unknown location: {location}')\n", 925 | " except:\n", 926 | " raise ValueError('Could not parse geo keys')\n", 927 | "\n", 928 | " return keys" 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": null, 934 | "id": "5ab7a95e-5198-4d42-9d99-aa0e15df32d4", 935 | "metadata": {}, 936 | "outputs": [], 937 | "source": [ 938 | "geo_keys = extract_geo_keys(geo_key_directory, geo_double_params, geo_ascii_params)\n", 939 | "geo_keys" 940 | ] 941 | }, 942 | { 943 | "cell_type": "markdown", 944 | "id": "51045882-2605-4cea-90ce-18d63c6dc5a7", 945 | "metadata": {}, 946 | "source": [ 947 | "**Question**: From the extracted geo key values, can you find the CRS and it's EPSG code?" 948 | ] 949 | }, 950 | { 951 | "cell_type": "markdown", 952 | "id": "64deb076-42f0-47af-891b-611b29668dc2", 953 | "metadata": {}, 954 | "source": [ 955 | "#### Nodata\n", 956 | "\n", 957 | "The GDAL nodata value is stored in GeoTIFFs as a null-terminated ASCII string, for some reason (likely to ensure it can be parsed with a consistent data type, in particular because the nodata value needs to be interpreted with the data type of the TIFF data, which might not map directly to the TIFF-defined data types). Because of this, the `nodata` value needs some additional processing before we can use it.\n", 958 | "\n", 959 | "Specifically, we need to clip the final character off, then we need to cast it to an appropropriate data type (as given by `sample_format` and `bits_per_sample`). For example, if we have an integer data type for our image data then we need to do something like `nodata_value = int(original_nodata_value[:-1])`." 960 | ] 961 | }, 962 | { 963 | "cell_type": "code", 964 | "execution_count": null, 965 | "id": "b0d5e38a-c480-4433-a8f6-4fd2e3e9bb25", 966 | "metadata": {}, 967 | "outputs": [], 968 | "source": [ 969 | "# We need to clip the string terminator off the nodata value\n", 970 | "# before coercing to an int (because it is stored as ASCII)\n", 971 | "nodata_value = int(original_nodata_value[:-1])\n", 972 | "nodata_value" 973 | ] 974 | }, 975 | { 976 | "cell_type": "markdown", 977 | "id": "591d92de-223e-4f31-89bd-f07c85b67633", 978 | "metadata": {}, 979 | "source": [ 980 | "## Reading an image tile\n", 981 | "\n", 982 | "Now that we have all our metadata parsed out we can focus on using that metadata to read a tile. We have our POI though, so we presumably want to find the tile containing said POI, rather than some other arbitrary tile. To do so we'll need to do some math.\n", 983 | "\n", 984 | "### Transforming our POI\n", 985 | "\n", 986 | "First, we need to find the point coordinates of our POI in the same reference system as the image. We can use the `to_crs` method on our POI with our image's CRS as parsed from the geo keys above." 987 | ] 988 | }, 989 | { 990 | "cell_type": "code", 991 | "execution_count": null, 992 | "id": "a5cf3beb-6206-4101-85f5-332a8541692e", 993 | "metadata": {}, 994 | "outputs": [], 995 | "source": [ 996 | "image_crs = f\"EPSG:{geo_keys[3072]}\"\n", 997 | "POI_proj = POI.to_crs(image_crs)\n", 998 | "print(f'x={POI_proj.geom.x}, y={POI_proj.geom.y}')" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "id": "3842480d-530b-427d-aafc-1268255e021f", 1004 | "metadata": {}, 1005 | "source": [ 1006 | "Check the above output. Does it make sense given what we know about the origin of our image, and the relation of our POI to the image footprint?" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "id": "e2b166f9-13e2-4084-b59c-51bd536e33cb", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "### Finding the pixel coordinates of our POI\n", 1015 | "\n", 1016 | "Now that we have our POI geographic coordinates in the same CRS as our image, we can use the image's affine transform to convert our POI's geographic coordinates into pixel coordinates in our image's pixel grid. The author of this notebook has created a small library to make tasks involving raster grids with affine transforms easier, called `griffine`. We can make an instance of a `griffine.Grid` and attach our transform to it, which will help us in a number of operations to come. Then we can use the grid to find the cell containing our point." 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": null, 1022 | "id": "0109cd09-b524-4d7e-887e-914fe02ce9ca", 1023 | "metadata": {}, 1024 | "outputs": [], 1025 | "source": [ 1026 | "grid = Grid(rows=image_length, cols=image_width).add_transform(transform)\n", 1027 | "cell = grid.point_to_cell(POI_proj)\n", 1028 | "print(f'row={cell.row}, col={cell.col}')" 1029 | ] 1030 | }, 1031 | { 1032 | "cell_type": "markdown", 1033 | "id": "24c9a1fd-a759-4683-93cb-38303d6a74f8", 1034 | "metadata": {}, 1035 | "source": [ 1036 | "Again, check the above output. Does it make sense given what we know about the origin of our image, its grid, and the relation of our POI to the image footprint?" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "markdown", 1041 | "id": "29697e92-2742-4154-883b-1fefa11285b3", 1042 | "metadata": {}, 1043 | "source": [ 1044 | "### Finding our tile coordinates\n", 1045 | "\n", 1046 | "To work out which tile we need to read we need to convert our pixel coordinates into tile coordinates. We can tile our `grid` object and get the tile containing our POI." 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "code", 1051 | "execution_count": null, 1052 | "id": "aad7b663-79ac-48fc-ba80-68d5adbeccd3", 1053 | "metadata": {}, 1054 | "outputs": [], 1055 | "source": [ 1056 | "tile_grid = grid.tile_via(Grid(rows=tile_length, cols=tile_width))\n", 1057 | "tile = tile_grid.point_to_tile(POI_proj)\n", 1058 | "print(f'tile_row={tile.row}, tile_col={tile.col}')" 1059 | ] 1060 | }, 1061 | { 1062 | "cell_type": "markdown", 1063 | "id": "a91d35d4-294c-44da-93c9-0d9152f97ac9", 1064 | "metadata": {}, 1065 | "source": [ 1066 | "One last time: check the above output. Does it make sense given what we know about the structure our image, and the relation of our POI to the image footprint?" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "markdown", 1071 | "id": "3788a090-6288-4925-963d-5e255cff3621", 1072 | "metadata": {}, 1073 | "source": [ 1074 | "### Finding our tile offset and byte length\n", 1075 | "\n", 1076 | "The tag values of `tile_offsets` and `tile_byte_counts` give us the offset and byte lengths of each tile. To retrive them for a given tile we need to know the tile's index within those tuples. We use our `tile` object with our `tile_grid` to fine the tile's linear index within the grid." 1077 | ] 1078 | }, 1079 | { 1080 | "cell_type": "code", 1081 | "execution_count": null, 1082 | "id": "a56f428e-3fc3-4e43-a660-4ca0d5277818", 1083 | "metadata": {}, 1084 | "outputs": [], 1085 | "source": [ 1086 | "tile_index = tile_grid.linear_index(tile)\n", 1087 | "tile_offset = tile_offsets[tile_index]\n", 1088 | "tile_byte_length = tile_byte_counts[tile_index]\n", 1089 | "print(f'tile ({tile.row}, {tile.col}) has index {tile_index} and is at offset {tile_offset} with length {tile_byte_length}')" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "markdown", 1094 | "id": "fade2970-ee62-40f4-bbaf-012a504c330f", 1095 | "metadata": {}, 1096 | "source": [ 1097 | "### Actually reading the tile\n", 1098 | "\n", 1099 | "Now that we know where the tile bytes are in the file we can read them, extract them (using the specified `compression`), then unpack them into a numpy array." 1100 | ] 1101 | }, 1102 | { 1103 | "cell_type": "code", 1104 | "execution_count": null, 1105 | "id": "06aec812-f6f1-4339-96e9-2d803b6e5649", 1106 | "metadata": {}, 1107 | "outputs": [], 1108 | "source": [ 1109 | "tile_bytes = url_read_bytes(href, tile_offset, tile_offset + tile_byte_length)" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": null, 1115 | "id": "55db4c1d-ac5b-4596-b5c2-a71127f6c104", 1116 | "metadata": {}, 1117 | "outputs": [], 1118 | "source": [ 1119 | "# Per our `compression` tag we know we the data is compressed using `DEFLATE`,\n", 1120 | "# which can be extracted using the stdlib `zlib` module.\n", 1121 | "import zlib\n", 1122 | "tile_extracted = zlib.decompress(tile_bytes, 0)" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "code", 1127 | "execution_count": null, 1128 | "id": "73a34ce2-5380-4669-a498-75c0f8d100c2", 1129 | "metadata": {}, 1130 | "outputs": [], 1131 | "source": [ 1132 | "# Our data type is `uint16` as we previously found; using our\n", 1133 | "# lookup table we know that maps to a struct format char of `H`.\n", 1134 | "# That dtype is 2-bytes in length, so we know our tile data\n", 1135 | "# contains `len(tile_extracted) // 2` pixel values.\n", 1136 | "struct_dtype = 'H'\n", 1137 | "tile_array = np.array(\n", 1138 | " struct.unpack(\n", 1139 | " endianness + (struct_dtype * (len(tile_extracted) // struct.calcsize(struct_dtype))),\n", 1140 | " tile_extracted,\n", 1141 | " ),\n", 1142 | " dtype=np.uint16\n", 1143 | ").reshape(tile_width, tile_length)\n", 1144 | "tile_array" 1145 | ] 1146 | }, 1147 | { 1148 | "cell_type": "code", 1149 | "execution_count": null, 1150 | "id": "af3f8da5-6364-4443-994e-3fcc5570c9e4", 1151 | "metadata": {}, 1152 | "outputs": [], 1153 | "source": [ 1154 | "# Or, more easily but perhaps too abstractly\n", 1155 | "np.frombuffer(tile_extracted, dtype=np.uint16).reshape(tile_width, tile_length)" 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "markdown", 1160 | "id": "af4c76df-928c-4561-8b74-ad49ea92022f", 1161 | "metadata": {}, 1162 | "source": [ 1163 | "#### A note on `predictor`\n", 1164 | "\n", 1165 | "The `predictor` tag is used when a filtering step is done prior to compression. For geospatial data, values of `2` and `3` are common, `2` is best for integer data, and calculates the horizontal difference between cells. `3` is used for floating point data. In other words, if we have a predictor set and it isn't `1` (indicating no predictor) then we can't just extract the data and start using it. The data will require processing step to reverse the prediction operation and restore the data back to its original values." 1166 | ] 1167 | }, 1168 | { 1169 | "cell_type": "code", 1170 | "execution_count": null, 1171 | "id": "106632b4-0d93-42f2-97e3-c1d6a338b0a6", 1172 | "metadata": {}, 1173 | "outputs": [], 1174 | "source": [ 1175 | "print(predictor)" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": null, 1181 | "id": "0094e1f2-cb8d-4f1a-bbc8-de44eb2aed38", 1182 | "metadata": {}, 1183 | "outputs": [], 1184 | "source": [ 1185 | "# as our data is using predictor `2` (horizontal difference),\n", 1186 | "# we'll need to reverse the difference using a cumulative sum\n", 1187 | "tile_array_unfiltered = np.cumsum(tile_array, axis=1, dtype=tile_array.dtype)\n", 1188 | "tile_array_unfiltered" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "markdown", 1193 | "id": "9f2884f5-abce-4e5f-8290-62c4a35f9dbf", 1194 | "metadata": {}, 1195 | "source": [ 1196 | "#### Scale and offset\n", 1197 | "\n", 1198 | "We have yet one more operation we need to do to our data array to make it usable. It turns out the stored data format `uint16` isn't actually the real data format. Instead, limited-precision floats have been mapped to that data type by using a specified scaling factor. Moreover, due to the Sentinel 2 L2 atmospheric correction process, it's possible to have negative values in the data, which must be accounted for by shifting the uint values via a specified offset.\n", 1199 | "\n", 1200 | "Both the `scale` and `offset` values are contained within the `gdal_metadata` tag. The GDAL metadata format is, sadly, XML, though for our purposes it is readable enough we don't need to worry about parsing complexities, we can just print out the value of that tag and read out the values we need." 1201 | ] 1202 | }, 1203 | { 1204 | "cell_type": "code", 1205 | "execution_count": null, 1206 | "id": "806d7cb4-928d-45fa-bbc2-8c1b2ab020b4", 1207 | "metadata": {}, 1208 | "outputs": [], 1209 | "source": [ 1210 | "gdal_metadata" 1211 | ] 1212 | }, 1213 | { 1214 | "cell_type": "code", 1215 | "execution_count": null, 1216 | "id": "043b1c67-8708-4549-a4b1-d2b434035d7c", 1217 | "metadata": {}, 1218 | "outputs": [], 1219 | "source": [ 1220 | "# fill in the value_offset and value_scale from the above\n", 1221 | "value_offset = -0.1\n", 1222 | "value_scale = 0.0001\n", 1223 | "tile_array_scaled_offset = (tile_array_unfiltered * value_scale) + value_offset\n", 1224 | "tile_array_scaled_offset" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "markdown", 1229 | "id": "86d22864-6684-47c4-b533-fdd0ffc5bdd3", 1230 | "metadata": {}, 1231 | "source": [ 1232 | "## Visualizing the tile on our map\n", 1233 | "\n", 1234 | "Now that we have our tile data, it would be great to see it alongside our POI to visually confirm we got the tile we expected. It turns out Folium has a kinda hokey way of converting numpy arrays to PNGs for display on the map, which we can leverage here to visually verify the data we've read for our tile and the operations we've done on it. It's not perfect, as it assumes our data is aligned to the mercator grid (which it probably isn't), but it's close enough for us to take a look.\n", 1235 | "\n", 1236 | "We just need the tile's min and max latitude and longitude (in EPSG:4326 coordinates) so we can tell Folium it's bounding box (roughly), then we can (re-)make our map and add our layers. We can use our `tile` object to compute those coordinates in our image CRS then convert them to EPSG:4326." 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": null, 1242 | "id": "1f6b0cf7-a1bc-4743-b8c5-126aef69ce58", 1243 | "metadata": {}, 1244 | "outputs": [], 1245 | "source": [ 1246 | "tile_origin_x, tile_origin_y = point(*tile.origin.coords[0], crs=image_crs).to_crs(EPSG_4326).coords[0]\n", 1247 | "tile_antiorigin_x, tile_antiorigin_y = point(*tile.antiorigin.coords[0], crs=image_crs).to_crs(EPSG_4326).coords[0]\n", 1248 | "\n", 1249 | "# we make a whole new map because if we screwed\n", 1250 | "# something up we only have to re-run this cell to fix it\n", 1251 | "raster_map = POI.explore(name='point')\n", 1252 | "\n", 1253 | "stac_item_layer.add_to(raster_map)\n", 1254 | "raster_map.fit_bounds(stac_item_layer.get_bounds())\n", 1255 | "\n", 1256 | "folium.raster_layers.ImageOverlay(\n", 1257 | " tile_array_scaled_offset,\n", 1258 | " bounds=[[tile_antiorigin_y, tile_origin_x], [tile_origin_y, tile_antiorigin_x]],\n", 1259 | " name='tile',\n", 1260 | ").add_to(raster_map)\n", 1261 | "\n", 1262 | "folium.LayerControl().add_to(raster_map)\n", 1263 | "\n", 1264 | "raster_map" 1265 | ] 1266 | }, 1267 | { 1268 | "cell_type": "markdown", 1269 | "id": "7f709856-e51b-41fe-9936-adc0139d1ab3", 1270 | "metadata": {}, 1271 | "source": [ 1272 | "## Additional exercises to consider later\n", 1273 | "\n", 1274 | "* Find how many overviews are in this file.\n", 1275 | "* Find the dimensions and gsd of each overview.\n", 1276 | "* Repeat reading the tile containing your point of interest, but do so from one of the overviews.\n", 1277 | "* How can we make reading the file more efficient? Can we get all the IFDs in the file with a single read without having to read in image data?\n", 1278 | "* Can you write the TIFF for the map visualization yourself instead of using an external lib?\n", 1279 | "* Repeat these exercises with a multiband TIFF to see how the file structure differs to support the additional bands.\n", 1280 | "\n", 1281 | "Any other cool ideas? Let me know and/or share with the group." 1282 | ] 1283 | } 1284 | ], 1285 | "metadata": { 1286 | "kernelspec": { 1287 | "display_name": "Python 3 (ipykernel)", 1288 | "language": "python", 1289 | "name": "python3" 1290 | }, 1291 | "language_info": { 1292 | "codemirror_mode": { 1293 | "name": "ipython", 1294 | "version": 3 1295 | }, 1296 | "file_extension": ".py", 1297 | "mimetype": "text/x-python", 1298 | "name": "python", 1299 | "nbconvert_exporter": "python", 1300 | "pygments_lexer": "ipython3", 1301 | "version": "3.12.10" 1302 | } 1303 | }, 1304 | "nbformat": 4, 1305 | "nbformat_minor": 5 1306 | } 1307 | -------------------------------------------------------------------------------- /notebooks/02_reading-zarr-the-hard-way.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "e8d50b3a-1ed7-437e-bf35-c9b257eb60eb", 6 | "metadata": {}, 7 | "source": [ 8 | "# Understanding Zarr" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "id": "a85e656e-3872-46a5-9f11-12674cacc540", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import datetime\n", 19 | "import json\n", 20 | "\n", 21 | "from pathlib import Path\n", 22 | "from typing import Any\n", 23 | "\n", 24 | "import fsspec\n", 25 | "import numpy as np\n", 26 | "import planetary_computer\n", 27 | "import pystac_client\n", 28 | "import pyproj\n", 29 | "\n", 30 | "from odc.geo.geom import point" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "810039d2-7f44-488c-917d-821845fad657", 36 | "metadata": {}, 37 | "source": [ 38 | "# Point of Interest (POI)\n", 39 | "\n", 40 | "I want to go to Puerto Rico, but I'm from a northern temporate latitude. I'd like to find the best week of the year for me to visit Puerto Rico without having oppressive weather. Can we use a zarr dataset from Planetary Computer to answer this question?\n", 41 | "\n", 42 | "To pick a more specific point of interest, let's use these coordinates of San Juan:" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "id": "b3a46609-1899-40fa-907c-2e99fd8dfe77", 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "POI = point(-66.063889, 18.406389, crs='EPSG:4326')\n", 53 | "\n", 54 | "# Let's find out where the point is\n", 55 | "point_map = POI.explore(name='point')\n", 56 | "point_map" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "id": "ec078acb-bfb7-4af6-9215-6fadd5babb99", 62 | "metadata": {}, 63 | "source": [ 64 | "## Finding data\n", 65 | "\n", 66 | "Now that we know where to look, we need to find some data we can use to answer this question. Turns out Microsoft's Plantary Computer has zarr data, and [we can search for zarr to find possible datasets](https://planetarycomputer.microsoft.com/catalog?filter=zarr). A prime contender of the options is the [Daymet Daily Puerto Rico](https://planetarycomputer.microsoft.com/dataset/daymet-daily-pr) dataset. For the sake of keeping this exercise simple, let's just consider measurements from 2020 (the last year in the dataset), and we'll sum the daily maximum temperature (`tmax` variable).\n", 67 | "\n", 68 | "Zarr datasets can be represented by a STAC collection without items, with a collection asset pointing to the zarr root. To get more information about this dataset we can fetch the `daymet-daily-pr` STAC collection from Planetary Computer. Pay special attention to the `assets`, `cube:variables`, and `cube:dimensions` attributes of the fetched collection.\n", 69 | "\n", 70 | "Note that accessing Planetary Computer data from Azure Blob Storage requires a signed token; we can use the `planetary_computer` package as a \"plugin\" of sorts to `pystac_client` to ensure we get an access token attached as necessary to each asset." 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "id": "d4c0f4ba-b68a-4063-b4cf-371add0d3e56", 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "catalog = pystac_client.Client.open(\n", 81 | " \"https://planetarycomputer.microsoft.com/api/stac/v1\",\n", 82 | " modifier=planetary_computer.sign_inplace,\n", 83 | ")\n", 84 | "collection = catalog.get_collection('daymet-daily-pr')\n", 85 | "collection" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "9d335296-c3ec-477f-a789-81be915e63fe", 91 | "metadata": {}, 92 | "source": [ 93 | "## Accessing the zarr files\n", 94 | "\n", 95 | "We're going to use `fsspec` with the `abfs` filesystem type (as provided by the `adlfs` package) to get access to the zarr directory tree and its files. The `abfs` filesystem requires both an access token signed to and the Azure account name of the bucket we are attempting to access. Because we used the `planetary_computer` package with `pystac_client` when querying the collection, these connection details are included in the `xarray:storage_options` property of the `zarr-abfs` asset, and we can use them to successfully initialize our filessytem object `fs` and use it to explore the contents of the bucket.\n", 96 | "\n", 97 | "All the paths we're interested in will be under the root of the zarr we are interested in. That path is provided within the `zarr-abfs` asset's href." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "38c563e5-3259-4188-8104-f7ea8df35572", 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "asset = collection.assets[\"zarr-abfs\"]\n", 108 | "asset" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "id": "c2f7d2c8-30e0-4d1e-aa6f-ae23d0cf86ce", 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "# we need the href without the `abfs://`\n", 119 | "zarr_root = Path(asset.href.split('//', 1)[1])\n", 120 | "zarr_root" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "id": "e9e64b6f-51c8-4ede-a183-a87a13655aff", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# the xarray:storage_options gives us the access token and account name required to connect\n", 131 | "fs = fsspec.filesystem('abfs', **asset.extra_fields['xarray:storage_options'])" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "id": "121b9d98-6059-4652-a5a8-227823238da0", 137 | "metadata": {}, 138 | "source": [ 139 | "An fsspec filesystem provides an API that includes normal filesystem-related functions, such as listing a path (`ls`), reading files (`open), or finding file information (`stat`)." 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "id": "aa3719ff-9b97-4795-8f46-812264d1ee25", 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "# now we can list the zarr root\n", 150 | "fs.ls(str(zarr_root))" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "id": "7a33c523-b57c-46d6-a83f-1b5ebd1e98f6", 156 | "metadata": {}, 157 | "source": [ 158 | "Note that when we list the zarr root we see some `.z*` files with zarr-related metadata and othe such info, which can be used to understand how to access the data in the listed directories. Notice also that the listed directory names predominately map to the zarr variables listed out in the STAC collection. From this we can be reasonably certain that the data for each varaible is nested within the directory of the same name." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "id": "099bd874-4d2c-4e51-8784-99a1911e2b47", 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# and we can read a file\n", 169 | "with fs.open(str(zarr_root / '.zattrs')) as f:\n", 170 | " content = f.read()\n", 171 | "\n", 172 | "content" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "id": "15119484-9d65-44f5-8a47-49a3f2b47d53", 178 | "metadata": {}, 179 | "source": [ 180 | "## Making browsing even easier\n", 181 | "\n", 182 | "Even with fsspec, browsing and reading files has a bit more boiler plate than we might like. We can easily create some simple functions to make an easier API for the types of operations we need to do to explore our zarr." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "id": "da8004be-c846-494b-bc04-e01715c39230", 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "# let's make some convenience functions to make browsing easier\n", 193 | "def ls_zarr(path: str) -> list[str]:\n", 194 | " return fs.ls(str(zarr_root / path))\n", 195 | "\n", 196 | "def read_zarr_file(path: str) -> bytes:\n", 197 | " with fs.open(str(zarr_root / path)) as f:\n", 198 | " return f.read()\n", 199 | "\n", 200 | "def read_zarr_json(path: str) -> dict[str, Any]:\n", 201 | " return json.loads(read_zarr_file(path))\n", 202 | "\n", 203 | "def print_json(_json: dict[str, Any]) -> None:\n", 204 | " print(json.dumps(_json, indent=4))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "id": "a31ee76a-4aec-440a-93c3-ff1277edcdbf", 210 | "metadata": {}, 211 | "source": [ 212 | "And we can use one of the convenience functions to show much simpler reading the zarr metadata can be." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "id": "09a3c3a9-e0a3-49fe-8b09-a722488fe837", 219 | "metadata": { 220 | "scrolled": true 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "zmeta = read_zarr_json('.zmetadata')\n", 225 | "print_json(zmeta)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "id": "4597588d-ebf4-45b1-abfd-29a05779ae1b", 231 | "metadata": {}, 232 | "source": [ 233 | "### `.zmetadata`\n", 234 | "\n", 235 | "Notice that the `.zmetadata` file starts with the `.zattrs` key, the value of which contains the same content as the `.zattrs` file we read a moment ago? Spoiler alert: `.zmetadata` contains the contents of all the `.z*` files. It's a great way to get an overview of the entire contents of the zarr. From here we can see a full inventory of the `.z*` file types through the archive:\n", 236 | "\n", 237 | "* `.zmetadata`: this file only\n", 238 | "* `.zattrs`: top-level and one in each variable's directory\n", 239 | "* `.zgroup`: top-level, just specifies the zarr version (here version 2)\n", 240 | "* `.zarray`: one in each variable's directory\n", 241 | "\n", 242 | "Note that `.zmetadata` appears only to be present if the zarr has \"consolidated metadata\", as this one does (and apparently that is a zarr extension and not part of the core spec).\n", 243 | "\n", 244 | "### `.zattrs`\n", 245 | "\n", 246 | "`.zattrs` appears to provide user-facing information, such as dataset version information, citations, and that which can help users appropriately interpret the meaning of the data including units, array dimensions, and the like. Note that the collection also contains much of this information. For our purposes these values can help us know which files we might want to look at and how to interpret their structure and values, but otherwise we will be programmatically skipping over them.\n", 247 | "\n", 248 | "### `.zgroup`\n", 249 | "\n", 250 | "This is super relevant to zarr readers to know what format the zarr is. For us, we are just exploring whatever we've found here, so it's not relevant really at all. At least not until we get to trying to read the next zarr file and we start wondering why it looks different than this one...\n", 251 | "\n", 252 | "### `.zarray`\n", 253 | "\n", 254 | "Finally, a good one. We're going to need to be very interested in the `.zarray` files because they tell us some key information we're going to need to properly read the data arrays. Specifically, the fields data type, compression type (note that all are compressed via the `blosc` algorithm with the same settings), and chunk size are ones we're absolutely going to need. Some other fields that could be very relevant for reading data in other files include `filters` and `fill_value`, but we won't need to be concerned with those here." 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "id": "ed9d517f-ad3f-4b54-b039-59741b3c2be2", 260 | "metadata": {}, 261 | "source": [ 262 | "## Reading an array\n", 263 | "\n", 264 | "We're going to need location information, and our POI is in WGS84 (lat/long), so maybe we start by reading the `lat` variable?\n", 265 | "\n", 266 | "First, let's see what all is in the `lat` directory:" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "id": "ad52ef08-736a-43f6-b570-b916135cb73d", 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "ls_zarr('lat/')" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "id": "6500590c-4155-4723-ae71-049bed939bc1", 282 | "metadata": {}, 283 | "source": [ 284 | "Not a lot. Maybe if we look at the `.zarray` and `.zattrs` we can learn something?" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "id": "0d2799a5-37ae-445b-bb62-784e19d4be27", 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "lat_zarray = read_zarr_json('lat/.zarray')\n", 295 | "print_json(lat_zarray)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "id": "0048f9e7-8c6d-4258-9209-2d0646105fa5", 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "lat_zattrs = read_zarr_json('lat/.zattrs')\n", 306 | "print_json(lat_zattrs)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "id": "f09ce163-2ba1-41ec-aae4-8dbd2c121229", 312 | "metadata": {}, 313 | "source": [ 314 | "The contents of both of these are exactly what we saw in top-level `.zmetadata`. No suprise there. And the `.zattrs` contents aren't particularly helpful at the moment.\n", 315 | "\n", 316 | "But the `.zarray`, that explains some things. First, we see the chunk size (`chunks`) is equal to the overall array size (`shape`). This tells us that this particular variable has only one chunk covering the entire extent of the data set to which it is relevant. We can make an educated guess then that `lat/0.0` is the array data for the single chunk.\n", 317 | "\n", 318 | "Let's read it and see what it looks like." 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "id": "e83e59aa-c0f9-4cd8-96ab-450794f980fb", 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "lat_bytes = read_zarr_file('lat/0.0')\n", 329 | "print(f'{lat_bytes[:100]}...')\n", 330 | "print(len(lat_bytes))" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "id": "613e6a78-1da7-4426-99de-5233e888ad1d", 336 | "metadata": {}, 337 | "source": [ 338 | "Just a bunch of ugly binary data--as we might have suspected, espcially given that it is supposedly compressed.\n", 339 | "\n", 340 | "Looking into the `blosc` compression format appears to indicate that it is `lz4`, so it seems like an `lz4`-compatible codec would be required. It turns out that's not entirely true. To cut a long story short, digging into this and trying to unravel what something like xarray uses to decompress such an array leads to [the `numcodecs` package](https://github.com/zarr-developers/numcodecs). It has a `blosc` codec, which can use to decompress this byte string and get an array.\n", 341 | "\n", 342 | "We'll be doing that operation a bunch, so let's make another convenience function that we can use to quickly and easily read these compressed array files." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "id": "991a58d1-22bf-42a9-8455-38def2649e96", 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "import numcodecs.blosc\n", 353 | "\n", 354 | "def read_zarr_blosc(path: str) -> Any:\n", 355 | " #with lz4.frame.LZ4FrameDecompressor() as decompressor:\n", 356 | " return numcodecs.blosc.decompress(read_zarr_file(path))" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "id": "e4c9881a-b136-418a-a54b-ab13baaca1d4", 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "lat_uncompressed = read_zarr_blosc('lat/0.0')\n", 367 | "print(f'{lat_uncompressed[:100]}...')\n", 368 | "print(len(lat_uncompressed))" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "id": "d91f9090-7985-40ba-96e5-8f2d8c3e6e60", 374 | "metadata": {}, 375 | "source": [ 376 | "Uncompressed it's still pretty ugly, but again not really unexpected because it's binary data. We need to unpack it into an array and see what that looks like." 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "id": "b3122c97-635f-4d76-9e2a-40de7b3c096f", 382 | "metadata": {}, 383 | "source": [ 384 | "#### A quick note on data types\n", 385 | "\n", 386 | "The format of the data type is not what we might suspect after all the work we did with the `struct` package and packing/unpacking the TIFF values. Apparently the data type specification here is one natively supported by `numpy`, meaning the `dtype` string can be interpreted directly as a data type by `numpy`:" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "id": "a444bb8d-f97a-447d-89f4-e6cc81f342f1", 393 | "metadata": {}, 394 | "outputs": [], 395 | "source": [ 396 | "np.dtype('f2')" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "id": "b72fd872-da77-495f-b8b7-2336e431ff77", 402 | "metadata": {}, 403 | "source": [ 404 | "`numpy` supports unpacking binary data from a byte string natively via the `frombuffer` method. We merely need to provide the specified data type for it to unpack the bytes properly, and then we can reshape the resulting array into the `chunks` shape." 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "id": "5ed6a1f6-63b1-47d5-87e2-3359e6c12e8f", 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "dt = np.dtype(lat_zarray['dtype'])\n", 415 | "lat_array = np.frombuffer(lat_uncompressed, dtype=dt).reshape(lat_zarray['chunks'])\n", 416 | "lat_array" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "id": "b119229d-0a60-4f65-9241-95c956cf3141", 422 | "metadata": {}, 423 | "source": [ 424 | "Success! Those values should look like what we'd expect for latitude values in or around Puerto Rico." 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "id": "8665e25b-bc84-4238-b354-7687bcc06a18", 430 | "metadata": {}, 431 | "source": [ 432 | "We can read in the `lon` variable the exact same way:" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "id": "781b94a3-849f-4809-82ab-a35930165785", 439 | "metadata": {}, 440 | "outputs": [], 441 | "source": [ 442 | "lon_zarray = read_zarr_json('lon/.zarray')\n", 443 | "dt = np.dtype(lon_zarray['dtype'])\n", 444 | "lon_array = np.frombuffer(read_zarr_blosc('lon/0.0'), dtype=dt).reshape(lon_zarray['chunks'])\n", 445 | "lon_array" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "id": "56f88c32-27e8-49b5-884c-c0c4afd51074", 451 | "metadata": {}, 452 | "source": [ 453 | "## Finding the cell coordinates of our POI\n", 454 | "\n", 455 | "Now that we have the `lat` and `lon` variable arrays, we should be able to locate the cell coordinates corresponding to our POI, right?\n", 456 | "\n", 457 | "Well, if we look closely at the arrays we realize something: they are two-dimensional! Neither latitude nor longitude are actually one of the base dimensions of our data cube. If we look more closely at the collection's metadata we realize that the data are in a projected coordinate system. We _could_ presumably use the `lat` and `long` variable arrays to find our cell coordinates, but doing so would require some sort of non-obvious multivariate optimization problem.\n", 458 | "\n", 459 | "Instead, what happens if we open our `x` variable?" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": null, 465 | "id": "e0ffa22e-33b5-4634-a5ac-c074883e86a5", 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "x_zarray = read_zarr_json('x/.zarray')\n", 470 | "dt = np.dtype(x_zarray['dtype'])\n", 471 | "x_array = np.frombuffer(read_zarr_blosc('x/0'), dtype=dt).reshape(x_zarray['chunks'])\n", 472 | "x_array" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": null, 478 | "id": "e4a91e0e-f561-42d8-91bb-59efdfd3bc26", 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "x_array.shape" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "id": "5d5318e8-f741-41ba-a515-77e13ca97c6a", 488 | "metadata": {}, 489 | "source": [ 490 | "Oh, look at that, it's a one-dimensional array of the cell x coordinates in the projected coordinate system.\n", 491 | "\n", 492 | "Let's look at `y` too." 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "id": "be5b9f84-b357-4138-86a7-5e1bdcebbdb1", 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "y_zarray = read_zarr_json('y/.zarray')\n", 503 | "dt = np.dtype(y_zarray['dtype'])\n", 504 | "y_array = np.frombuffer(read_zarr_blosc('y/0'), dtype=dt).reshape(y_zarray['chunks'])\n", 505 | "y_array" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "id": "471e9afc-860b-4285-b05b-1b0171be5646", 512 | "metadata": {}, 513 | "outputs": [], 514 | "source": [ 515 | "y_array.shape" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "id": "9e85e595-379c-41f9-a771-0da4509b17f9", 521 | "metadata": {}, 522 | "source": [ 523 | "`y` is one-dimensional as well. Not only that, but the shapes of each map to the opposing dimensions of our `lat` and `lon` array shapes. Seems like we're on the right track here, if we only could convert our POI into the projected coordinates..." 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "id": "9fa9e46a-e7ef-4ec7-96fe-419ddc20de9b", 529 | "metadata": {}, 530 | "source": [ 531 | "### Understanding our reference system\n", 532 | "\n", 533 | "One outstanding problem with zarr is that the geospatial extension still has yet to be formalized. This means we don't have a built-in way of grabbing the CRS information from the zarr itself. Planetary Computer gets around this issue by storing the CRS information in the STAC collection, specifically as part of the `cube:dimensions` metadata.\n", 534 | "\n", 535 | "Both the `x` and `y` dimensions have a reference system defined. Thankfully for us, the data provide has made a sane decision to ensure these CRSs are the same, so we only need to use one of them to get a transformer we can use to project our POI coordinates." 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "id": "2aa8aac7-9874-4182-b054-4a0fbb0914a0", 542 | "metadata": {}, 543 | "outputs": [], 544 | "source": [ 545 | "collection.extra_fields['cube:dimensions']['x']['reference_system']" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": null, 551 | "id": "bff4ba3c-cebf-4596-ba8e-b49089996911", 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "src_crs = pyproj.CRS.from_json_dict(collection.extra_fields['cube:dimensions']['x']['reference_system'])\n", 556 | "src_crs" 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "id": "792bbbc3-d62d-4f2b-9288-ae71146b2b77", 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [ 566 | "to_src_transformer = pyproj.Transformer.from_crs(\"EPSG:4326\", src_crs)" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "id": "ce1d6506-057a-4806-a7e5-8cf5494e527c", 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "poi_projected_x, poi_projected_y = to_src_transformer.transform(\n", 577 | " POI.geom.y,\n", 578 | " POI.geom.x,\n", 579 | ")\n", 580 | "poi_projected_x, poi_projected_y" 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "id": "dab40418-6b9e-44b4-8978-ec3da150ec06", 586 | "metadata": {}, 587 | "source": [ 588 | "### Calculating our cell coords with a \"geotransform\"\n", 589 | "\n", 590 | "At this point we're really just in regular raster land and find outselves needing the raster grid's affine transformation--or geotransform in GDAL speak--to convert our coordinates from model space to grid space. Other than having to continue to pull metadata out of the collection to get our inputs, we can calculate the values we need to create that data structure. " 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "id": "f0e5839d-ce79-4aa4-9f42-ffe129578b03", 597 | "metadata": {}, 598 | "outputs": [], 599 | "source": [ 600 | "geotransform = (\n", 601 | " collection.extra_fields['cube:dimensions']['x']['extent'][0],\n", 602 | " collection.extra_fields['cube:dimensions']['x']['step'],\n", 603 | " 0,\n", 604 | " collection.extra_fields['cube:dimensions']['y']['extent'][1],\n", 605 | " 0,\n", 606 | " collection.extra_fields['cube:dimensions']['y']['step'],\n", 607 | ")\n", 608 | "geotransform" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "id": "fb0f9a05-4db6-4511-99f8-f4fad323f555", 614 | "metadata": {}, 615 | "source": [ 616 | "With that, we can finally do the normal conversion of geographic coordinates into grid coordinates." 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "id": "96f2ddd4-b208-4d2b-8939-fbc00113104d", 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "POI_col = int((poi_projected_x - geotransform[0]) // geotransform[1])\n", 627 | "POI_row = int((poi_projected_y - geotransform[3]) // geotransform[5])\n", 628 | "POI_row, POI_col" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "id": "23ceb605-9223-4d8a-b7c8-3e1aee83f20b", 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "lon_array[POI_row, POI_col], lat_array[POI_row, POI_col]" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "id": "5de0f372-cfa8-4cf8-a344-4af51fa9939d", 644 | "metadata": {}, 645 | "source": [ 646 | "Does that look right?" 647 | ] 648 | }, 649 | { 650 | "cell_type": "markdown", 651 | "id": "628798ca-ddff-429d-952c-d53c12038e4a", 652 | "metadata": {}, 653 | "source": [ 654 | "### Another option to find the cell coords\n", 655 | "\n", 656 | "Remember that the `x` and `y` arrays are one-dimensional and increasing/decreasing by a constant step? We could have also used that property to perform a linear search of the coordinate space to get our row and column coordinates." 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "id": "4722bf1a-f339-432c-aa76-ae2533993f6b", 663 | "metadata": {}, 664 | "outputs": [], 665 | "source": [ 666 | "col = 0\n", 667 | "POI_row_ = None\n", 668 | "while poi_projected_y < y_array[col]:\n", 669 | " POI_row_ = col\n", 670 | " col += 1\n", 671 | "POI_row_" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "id": "e6258744-2ca8-4d83-8129-ee772d18c26a", 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "col = 0\n", 682 | "POI_col_ = None\n", 683 | "while poi_projected_x >= x_array[col]:\n", 684 | " POI_col_ = col\n", 685 | " col += 1\n", 686 | "POI_col_" 687 | ] 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "id": "29176774-f1de-4fe8-8dd3-e93c265d041c", 692 | "metadata": {}, 693 | "source": [ 694 | "Note that this strategy worked in this case because we had all of our x and all our y coordinate values in single arrays. If we were working with a larger zarr dataset this might not have been the case. Also, this strategy is not particularly efficient. As a general rule it is probably best to stick to the coordinate calculation using the offsets from the grid origin rather than this more brute-force strategy." 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "id": "5ba838a4-6d17-4422-a321-af7ec0be006e", 700 | "metadata": {}, 701 | "source": [ 702 | "## Reading the `tmax` variable data\n", 703 | "\n", 704 | "If after all this you still remember what we are trying to do, great! If not, it's understandable, so here's a little refresher: we are interested in finding the week of 2020 with the lowest average (mean) `tmax` value. So it seems pertinent that we read in the `tmax` variable data.\n", 705 | "\n", 706 | "Let's take a look at what's in the `tmax` directory." 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": null, 712 | "id": "ee5dbcad-ff8c-42db-9d30-0fb60f6fccd9", 713 | "metadata": {}, 714 | "outputs": [], 715 | "source": [ 716 | "ls_zarr('tmax')" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "id": "56e6956d-cee7-4c11-b11e-2bb6ca425f2c", 722 | "metadata": {}, 723 | "source": [ 724 | "Uh oh. This is different than we've run into up till now. We see a bunch of different chunk files here. Maybe if we look at the `.zarray` and `.zattrs` we can figure out something to help us here." 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": null, 730 | "id": "02076d0e-531a-4df4-a05b-94a50eb007ae", 731 | "metadata": {}, 732 | "outputs": [], 733 | "source": [ 734 | "tmax_zarray = read_zarr_json('tmax/.zarray')\n", 735 | "print_json(tmax_zarray)" 736 | ] 737 | }, 738 | { 739 | "cell_type": "code", 740 | "execution_count": null, 741 | "id": "3963bb40-70d3-44a1-9a46-8e98bf65d879", 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "tmax_zattrs = read_zarr_json('tmax/.zattrs')\n", 746 | "print_json(tmax_zattrs)" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "id": "42d17219-1f4f-4770-b022-0250b54dfd34", 752 | "metadata": {}, 753 | "source": [ 754 | "We see with `tmax` that the chunk size `(365, 231, 364)` is different than the shape `(14965, 231, 364)`. And from `.zattrs` we see the array dimensions listed as `(time, y, x)`. A deductive person could infer that, this being a daily dataset, the chunk having a size of `365` for the `time` dimension means that each chunk is a year's worth of values. We have 41 chunk files (`0.0.0` though `40.0.0`), and `365 * 41 = 14965`. It further stands to reason that the chunks are in chronological order, so we can reasonably assume we want the last chunk, `40.0.0`.\n", 755 | "\n", 756 | "But what good would it do just to be clever here? Instead, let's confirm our theory. To do so we can look at the `time` variable." 757 | ] 758 | }, 759 | { 760 | "cell_type": "code", 761 | "execution_count": null, 762 | "id": "7542055e-f718-4d43-93ab-01cabe7bd28b", 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [ 766 | "ls_zarr('time')" 767 | ] 768 | }, 769 | { 770 | "cell_type": "code", 771 | "execution_count": null, 772 | "id": "0d25b1ff-36cb-4cec-86a7-adfcc3ee4229", 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "time_zarray = read_zarr_json('time/.zarray')\n", 777 | "print_json(time_zarray)" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "id": "539394f1-fcc9-4f80-b386-d218044fb43b", 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [ 787 | "suspected_year_chunk = 2020-1980\n", 788 | "suspected_year_chunk" 789 | ] 790 | }, 791 | { 792 | "cell_type": "code", 793 | "execution_count": null, 794 | "id": "f80a4a2a-8200-4c0e-9b42-038ebb2349a1", 795 | "metadata": {}, 796 | "outputs": [], 797 | "source": [ 798 | "dt = np.dtype(time_zarray['dtype'])\n", 799 | "time_array = np.frombuffer(read_zarr_blosc(f'time/{suspected_year_chunk}'), dtype=dt).reshape(time_zarray['chunks'])\n", 800 | "time_array" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "id": "32f6118f-fc13-4b46-a3da-421c17ccf383", 806 | "metadata": {}, 807 | "source": [ 808 | "Per the metadata for the `time` array, we can interpret these values as the date of that cell in the days since 1980-01-01. We can use the `datetime` module to calculate the date given by cell 0 in the array above:" 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "id": "4d13b57c-9804-4bd4-8068-6276f366dda5", 815 | "metadata": {}, 816 | "outputs": [], 817 | "source": [ 818 | "# to confirm the dates match up\n", 819 | "chunk_start_date = datetime.date.fromisoformat('19800101') + datetime.timedelta(days=int(time_array[0]))\n", 820 | "print(chunk_start_date)" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "id": "af7468a0-ae3d-4a12-bff0-73694ddb4acf", 826 | "metadata": {}, 827 | "source": [ 828 | "Ah, great, proof that chunk 40 is the chunk we want for 2020 values." 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "id": "d4afc2b0-695e-4b73-a4b9-0b131da78dfa", 834 | "metadata": {}, 835 | "source": [ 836 | "Now that we know which chunk we want, let's finally read it. Note that this array is significantly bigger--365 times bigger, to be exact--than our `lat` and `lon` arrays. As a result, it might take a bit longer to download, extract, and unpack into an array. Be patient. In testing this typically took on the order of 30-40 seconds." 837 | ] 838 | }, 839 | { 840 | "cell_type": "code", 841 | "execution_count": null, 842 | "id": "ba833042-4a58-43f3-923c-a4d5d0553eb2", 843 | "metadata": {}, 844 | "outputs": [], 845 | "source": [ 846 | "dt = np.dtype(tmax_zarray['dtype'])\n", 847 | "tmax_array = np.frombuffer(read_zarr_blosc(f'tmax/{suspected_year_chunk}.0.0'), dtype=dt).reshape(tmax_zarray['chunks'])\n", 848 | "tmax_array.shape" 849 | ] 850 | }, 851 | { 852 | "cell_type": "markdown", 853 | "id": "fb0125c4-591e-4a34-b4df-71c77733e64a", 854 | "metadata": {}, 855 | "source": [ 856 | "What were we trying to do with these values again? Oh yeah, find the average max temperature of each week in the year, so we can identify the minimum. We can do this pretty trivially with `numpy` by:\n", 857 | "\n", 858 | "* slicing out only the stack of temperature values we're interested in for our POI cell\n", 859 | "* reshaping the array into rows of seven values\n", 860 | "* finding the mean along the row axis (gives us the mean of each week)\n", 861 | "* find the min value within that weekly mean array" 862 | ] 863 | }, 864 | { 865 | "cell_type": "code", 866 | "execution_count": null, 867 | "id": "1e15085b-e40f-46a3-98f7-d363cf85e727", 868 | "metadata": {}, 869 | "outputs": [], 870 | "source": [ 871 | "# we only take 364 temperature values because 365 isn't divisible by 7 😕\n", 872 | "POI_tmax_array = tmax_array[:364, POI_row, POI_col].reshape(52, 7)\n", 873 | "POI_tmax_array" 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": null, 879 | "id": "5ac83113-4840-4baf-8ded-ea24bdff1403", 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "POI_tmax_mean_array = np.mean(POI_tmax_array, axis=1)\n", 884 | "POI_tmax_mean_array" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": null, 890 | "id": "24b75f06-a333-47ed-bddd-f3944eca1916", 891 | "metadata": {}, 892 | "outputs": [], 893 | "source": [ 894 | "POI_tmax_min = np.min(POI_tmax_mean_array)\n", 895 | "POI_tmax_min_index = np.argmin(POI_tmax_mean_array)\n", 896 | "print(f'The lowest tmax temp {POI_tmax_min} occurred in week {POI_tmax_min_index} of 2020.')" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "id": "3d66a4b4-a018-4f85-a39b-eafe1fb87a89", 902 | "metadata": {}, 903 | "source": [ 904 | "This is a great result! But it would be good if we could turn that into an actual date. Good thing we read in that `time` array: we can use slice it on weekly bounds to get the date of the first day of the week in question." 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": null, 910 | "id": "e86e7984-9c83-4ca7-9f80-7dfecbfd12a0", 911 | "metadata": {}, 912 | "outputs": [], 913 | "source": [ 914 | "week__days_since_1980 = int(time_array[::7][POI_tmax_min_index])\n", 915 | "week__days_since_1980" 916 | ] 917 | }, 918 | { 919 | "cell_type": "code", 920 | "execution_count": null, 921 | "id": "853d2add-a54a-4e08-a23c-41af7677a2bf", 922 | "metadata": {}, 923 | "outputs": [], 924 | "source": [ 925 | "best_week_of_the_year_start = datetime.date.fromisoformat('19800101') + datetime.timedelta(days=week__days_since_1980)\n", 926 | "print(f'The best day to arrive in Puerto Rico, from our data, was {best_week_of_the_year_start}. Now go book your time machine tickets. 😁')" 927 | ] 928 | } 929 | ], 930 | "metadata": { 931 | "kernelspec": { 932 | "display_name": "Python 3 (ipykernel)", 933 | "language": "python", 934 | "name": "python3" 935 | }, 936 | "language_info": { 937 | "codemirror_mode": { 938 | "name": "ipython", 939 | "version": 3 940 | }, 941 | "file_extension": ".py", 942 | "mimetype": "text/x-python", 943 | "name": "python", 944 | "nbconvert_exporter": "python", 945 | "pygments_lexer": "ipython3", 946 | "version": "3.12.10" 947 | } 948 | }, 949 | "nbformat": 4, 950 | "nbformat_minor": 5 951 | } 952 | -------------------------------------------------------------------------------- /notebooks/03_free-range-artisanal-grass-fed-kerchunk.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "31a3b80e-999a-4507-b0d9-fe374283a2a8", 6 | "metadata": {}, 7 | "source": [ 8 | "# Free-Range Artisanal Grass-Fed Kerchunk\n", 9 | "\n", 10 | "Let's build a kerchunk reference file _by hand_ for the TIFF we read in the \"Reading COGs the hard way\" exercise. To test if we did it right we can open it up with xarray using it's kerchunk/zarr support.\n", 11 | "\n", 12 | "As always, let's get imports out of the way first." 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "id": "0cfacc0e-6717-4860-9cc9-76d3ba7f8086", 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import json\n", 23 | "import math\n", 24 | "\n", 25 | "from pathlib import Path\n", 26 | "\n", 27 | "import xarray" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "id": "6d6c5c59-83d0-4165-85d3-278590475132", 33 | "metadata": {}, 34 | "source": [ 35 | "## Getting our TIFF attributes\n", 36 | "\n", 37 | "We already did this in the first exercise! We could do it again if we felt it was necessary, but to save some time we'll start with those attributes already copied out of the first exercise's outputs.\n", 38 | "\n", 39 | "Do pay attention to what attributes we have defined here. Everything here will be required as we build up our kerchunk file.\n", 40 | "\n", 41 | "### A note on data type\n", 42 | "\n", 43 | "If you recall, the TIFF from exercise one had a data type of `uint16`. However, astute readers will note that the data type (`dtype`) defined here is `