├── figures └── race-climate-change-climate-activists-small.jpg ├── contributing.md ├── LICENSE ├── .gitignore ├── code-of-conducts.md └── README.md /figures/race-climate-change-climate-activists-small.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blutjens/awesome-earth-system-ml/HEAD/figures/race-climate-change-climate-activists-small.jpg -------------------------------------------------------------------------------- /contributing.md: -------------------------------------------------------------------------------- 1 | # Contributions 2 | 3 | Contributions are incredibly welcome! Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. 4 | 5 | ## What counts as awesome-earth-system-ml dataset? 6 | - The list is aimed to facilitate humans who aim to develop *algorithms*. 7 | - We have excluded data products, i.e., algorithm-based predictions, for now because they would be used in less obvious ways, e.g., semi-supervised labeling. 8 | 9 | ## Guideline 10 | 11 | - We'd appreciate entries in the form of : 12 | ``` 13 | - [**Name**](link.com) *(list_of_authors, years_available)* \ 14 | A dataset from with with . 15 | 16 | ``` 17 | - The description reads best if it's concise and short. 18 | - Feel free to add a section if needed. 19 | - Remember to add the section title to Table of Contents. 20 | - Make sure spelling and grammar sits. 21 | - Remove trailing whitespaces. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Björn Lütjens (he/him/er) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /code-of-conducts.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Attribution 60 | 61 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 62 | version 2.0, available at 63 | [https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0]. 64 | 65 | Community Impact Guidelines were inspired by 66 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 67 | 68 | For answers to common questions about this code of conduct, see the FAQ at 69 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available 70 | at [https://www.contributor-covenant.org/translations][translations]. 71 | 72 | [homepage]: https://www.contributor-covenant.org 73 | [v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html 74 | [Mozilla CoC]: https://github.com/mozilla/diversity 75 | [FAQ]: https://www.contributor-covenant.org/faq 76 | [translations]: https://www.contributor-covenant.org/translations -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-earth-system-ml [![Awesome](https://awesome.re/badge-flat2.svg)](https://awesome.re) 2 | **Awesome-earth-system-ml** is a curated list of datasets from dynamic Earth system models for the climate-interested machine learning community. The list targets data from climate, weather, atmosphere, ocean, flood, cryosphere, or other models and sciences. 3 | 4 | Getting started with data science in Earth system modeling is challenging. The lack of accessible datasets and plethora of evaluation options is one reason why. So, this list of datasets and benchmarks intends to get you started with building machine learning models for analysing dynamical Earth systems. 5 | 6 | This is list represents an inclusive community. We would very much appreciate if you add your favorite datasets via a pull request or (emailing lutjens at mit [dot] edu). 7 | 8 | LONDON/ENGLAND – FEBRUARY 22 2020: Black Extinction Rebellion protester holding a 'THERE IS NO PLANET B' sign at the February 2020 March in collaboration with Parents 4 Future by JessicaGirvan on Shutterstock 9 | 10 | Photo of climate activists holding a THERE IS NO PLANET B sign by 11 | [**Jessica Girvan**](https://www.dreamstime.com/Jessicagirvan94_info) 12 | on [Shutterstock](https://www.shutterstock.com/image-photo/londonengland-february-22-2020-black-extinction-1654798531) 13 | 14 | ## Content 15 | - [**Air quality**](#air-quality) 16 | - [**Atmosphere and Precipitation**](#atmosphere-and-precipitation) 17 | - [**Climate**](#climate) 18 | - [**Climate risk**](#climate-risk) 19 | - [**Cryosphere**](#cryosphere) 20 | - [**Flooding**](#flooding) 21 | - [**Land surface, forest, and biodiversity**](#land-surface-forest-and-biodiversity) 22 | - [**Ocean**](#ocean) 23 | - [**Renewables wind and solar**](#renewables-wind-and-solar) 24 | - [**Scientific machine learning and numerical methods**](#scientific-machine-learning-and-numerical-methods) 25 | - [**Weather**](#weather) 26 | - [**Wildfire**](#wildfire) 27 | - [**Awesome-awesome**](#awesome-awesome) 28 | 29 | ## Air Quality 30 | - [**TOAR: Tropospheric Ozone Assessment Report**] 31 | todo 32 | 33 | ## Atmosphere 34 | - [**SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology**](https://sevir.mit.edu/nowcasting) *(Veillette et al., 21)* \ 35 | ML-ready dataset for forecasting (nowcasting) storm events. US dataset of 10,000 weather events that each consist of 384 km x 384 km image sequences spanning 4 hours of time. Contains 5-bands: 3x GOES-16 advanced baseline imager, NEXRAD vertically integrated liquid, and GOES-16 Geostationary Lightning Mapper. Used in [1](https://proceedings.neurips.cc/paper/2020/hash/fa78a16157fed00d7a80515818432169-Abstract.html). 36 | 37 | - [**CUMULO : A Dataset for Learning Cloud Classes**](https://github.com/FrontierDevelopmentLab/CUMULO) *(Zantedeschi et al., 19)* \ 38 | ML-ready dataset for classification of clouds. Global dataset at 1km spatial and daily resolution for 2008, 2009 and 2016. Includes 300K annotated images with multispectral image (MODIS), radar (CloudSat), and lidar (CLDCLASS and CALIOP). Used in [1](https://arxiv.org/abs/1911.04227). 39 | 40 | - [**RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling**](https://neuralchen.github.io/RainNet/) *(X. Chen and K. Feng et al., 22)* \ 41 | ML-ready dataset for superresolution of precipitation. US East dataset at 12 and 4km spatial and hourly resolution for 17 years creating >60K snapshot images at 208x333 and 624x999 resolution totaling 360GB. Includes StageIV and NLDAS data assimilation projects from gauges, radar, and satellite. Contains precipitation (mm/hr) as in- and output variable. Evaluates domain-informed reconstruction (MPPE, HRRE, CPMSE, AMMD, RMSE) and dynamic metrics (HRTS and CMD). Used in [1](https://arxiv.org/abs/2012.09700). 42 | 43 | - [**Fast and accurate learned multiresolution dynamical downscaling for precipitation**](https://github.com/lzhengchun/DSGAN) *(J. Wang et al., 20)* \ 44 | ML-ready dataset for superresolution of precipitation. US dataset at 50 and 12km spatial and 3-hourly resolution for 2005 creating ~3K snapshot images at 128x64 and 512x256. Includes WRF RCM simulation from NCEP-R2 climate model (RainNet in comparison contains observational data). Contains output variables (high-res. precipitation) and input variables (low-res. precipitation, vertically integrated water vapor, sea level pressure, 2m air temperature, and high-res. topography). Evaluates MSE, Jensen-Shannon distance of probability density functions, and extreme precipitation occurences on global and local scale. Used in [1](https://doi.org/10.5194/gmd-14-6355-2021). 45 | 46 | - [**SP-CAM**] 47 | 48 | - [**NCAR CAM**] 49 | 50 | ## Climate 51 | - [**ClimateBench**](https://doi.org/10.1029/2021MS002954) *(Watson-Parris et al., 22)* \ 52 | ML-ready dataset for forecasting the climate response to aerosols. Global dataset at 2° spatial and yearly resolution creating images of size 96x144 videos, totaling approx 2GB storage. Includes carbon dioxide, methane, sulfur dioxide, and soot forcings and temperature, diurnal temperature range and precipitation predictors. Includes CMIP6's AerChemMIP, NorESM2, ScenarioMIP, and DAMIP data. Evaluates RMSE. Used in [1](https://arxiv.org/abs/2002.00469), [2](https://doi.org/10.1029/2020MS002109), [3](https://doi.org/10.1002/qj.4180), and [4](https://arxiv.org/abs/2008.08626). 53 | 54 | - [**ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models**](https://github.com/RolnickLab/climart) *(Cachay and Ramesh et al., 22)* \ 55 | ML-ready dataset for forecasting atmospheric radiative transfer parametrizations. Contains 10M samples from present, pre-industrial, and future climate conditions, based on the Canadian Earth System Model. Used in [1](https://arxiv.org/abs/2111.14671). 56 | 57 | - [**PaleoJump: A database for abrupt transitions in past climates**](https://paleojump.github.io/) *(Bagniewski et al., 2022)* \ 58 | Raw dataset for forecasting climatic shocks and analyzing paleoclimate. Global dataset from 123 sites with point data from 4M years ago until present. Includes PANGAEA and NCEI/NOAA datasets. Contains 49 marine-sedient cores, 29 speleothems, 18 lake sediment cores, 16 terrestrial records, and 11 ice cores. Used in [1](https://arxiv.org/abs/2206.06832). 59 | 60 | - [**CMIP6**](https://pangeo-data.github.io/pangeo-cmip6-cloud/accessing_data.html#preprocessing-the-cmip6-datasets) *(WCRP, 2019)* \ 61 | Raw comprehensive dataset of 100+ climate models under various emission scenarios. 62 | 63 | ## Climate Risk 64 | - todo 65 | 66 | ## Cryosphere 67 | - [**MAR**]\ 68 | todo 69 | 70 | ## Flooding 71 | - [**NEMO: Digital Twin Earth**]\ 72 | todo 73 | 74 | ## Land surface, forest, and biodiversity 75 | - [**EarthNet2021x: Forecasting High-Resolution Earth Multispectral Imagery**](https://www.earthnet.tech/) *(Requena-Mesa et al., 2021)* 76 | ML-ready dataset for video prediction of land surface dynamics. Germany-centric dataset at 20m spatial and 5-day resolution creating 32K samples images of size 128x128x30 steps, totaling 218GB storage. Contains multispectral satellite imagery, cloud masks, elevation, land cover, rainfall, pressures, and temperatures. Includes Sentinel-2, EU-DEM, E-OBS data. Evaluates MAD, OLS, EMD, and SSIM. Used in [1](https://arxiv.org/abs/2104.10066). 77 | 78 | - [**CESM CLM**]\ 79 | Raw dataset that is a go-to model for global land surface dynamics 80 | 81 | - see [Awesome-awesome](#awesome-awesome) for more forest data. 82 | 83 | ## Ocean 84 | - todo 85 | 86 | ## Renewables wind and solar 87 | - [**WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data**](https://github.com/RupaKurinchiVendhan/WiSoSuper) *(Kurinchi-Vendhan et al.,2022)* \ 88 | ML-ready dataset for superresolution of wind and solar data. US dataset at 10 and 2km spatial and 4-hourly (wind) and 20 and 4km spatial and hourly resolution (solar) from 2007 to 2018. Includes NREL WIND and NSRDB solar data. Contains output variables (high-res. westward wind velocity, southward wind velocity, direct normal irradiance, diffused horizontal irradiance) and input variables (low-res. bilinearly interpolated version of HR variables). Evaluates RMSE, kinetic energy spectrum, and solar semivariogram. Used in [1](https://doi.org/10.1073/pnas.1918964117), [2](https://arxiv.org/abs/2109.08770). 89 | 90 | ## Scientific machine learning and numerical methods 91 | - [**PDEBench: An Extensive Benchmark for Scientific Machine Learning**](https://github.com/pdebench/pdebench) *(Takamoto et al., 2022)* \ 92 | ML-ready dataset for forecasting various PDEs from hydromechanics. Includes 6 basic and 3 advanced problems. The basic PDEs are 1D advection, Burgers, Diffusion-Reaction, Diffusion-Sorption equations and 2D Diffusion-Reaction and Darcy Flow. The advanced PDEs are incompressible Navier-Stokes equations (NSE) and compressible NSE, and shallow-water equations. Evaluates RMSE, normalized RMSE, RMSE on boundary, RMSE of conserved value, RMSEs in low-, mid- or high-pass Fourier space. Used in [1](https://arxiv.org/abs/2210.07182). 93 | 94 | ## Weather 95 | - [**RainBench: Towards Data-Driven Global Precipitation Forecasting from Satellite Imagery**](https://github.com/FrontierDevelopmentLab/PyRain) *(Schroeder de Witt et al., 2021)* \ 96 | ML-ready dataset for forecasting precipitation. Global dataset at 1.4° and 5.625° spatial and hourly resolution creating images of size 32x64 with 3 vertical grid points from 2016-2019. Includes ERA5 reanalysis, SimSat simulated satellite data, and IMERG glocal precipitation estimates. Contains output variables (ERA5 total precipitation, IMERG precipitation), dynamic input variables (geopotential, temperature, humidity, cloud liquid water content, cloud ice water content, surface pressure, 2-meter temperature, and cloud-brightness temperatures), and static input variables (lat, lon, land-sea mask, orography, soil type). Evaluate RMSE. Used in [1](https://doi.org/10.1609/aaai.v35i17.17749). 97 | 98 | - [**WeatherBench: A benchmark dataset for data-driven weather forecasting**](https://github.com/pangeo-data/WeatherBench) *(Rasp et al., 2020)* \ 99 | ML-ready dataset for forecasting weather. Global dataset at 1.4° and 5.625° spatial and hourly resolution creating images of size 128x256 - 32x64 with 13 vertical grid points, totaling 191GB storage. Includes geopotential, temperature, humidity, wind, potential vorticity, solar radiation, and others. Includes ERA5 and CMIP-MPI-ESM-HR data. Evaluate RMSE. Used in [1](https://arxiv.org/abs/2002.00469), [2](https://doi.org/10.1029/2020MS002109), [3](https://doi.org/10.1002/qj.4180), and [4](https://arxiv.org/abs/2008.08626). 100 | 101 | - [**ERA5**](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) *(ECMWF, 2020)*\ 102 | Raw hourly reanalysis estimate of atmospheric, land and oceanic variables. Global, 30km grid, with 137 vertical nodes in the atmosphere, including uncertainties, 1959-present. Used in [FourCastNet](https://arxiv.org/abs/2202.11214) and [Keysler et al., 22](https://rkeisler.github.io/graph_weather/). 103 | 104 | ## Wildfire 105 | - [**PYROCAST: Machine learning pipeline for pyrocumulonimbus (pyroCb) forecasting**](https://spaceml.org/repo/project/63691212f97150000d504d4d) *(Tazi et al., 2022)* \ 106 | Unclear. Code not yet public. 107 | 108 | ## Awesome-awesome 109 | - [**Awesome Forests**](https://github.com/blutjens/awesome-forests) \ 110 | A curated list of ground-truth forest datasets for the machine learning and forestry community. 111 | 112 | - [**Awesome Flood Datasets**](https://github.com/blutjens/awesome-flood-datasets) \ 113 | A curated list of ground-truth forest datasets for the machine learning and forestry community. 114 | 115 | - [**Awesome satellite imagery datasets**](https://github.com/chrieke/awesome-satellite-imagery-datasets) \ 116 | A list of more satellite imagery datasets with annotations for deep learning and computer vision. 117 | 118 | - [**Awesome GIS**](https://github.com/sshuair/awesome-gis) \ 119 | A list of GIS resources. 120 | 121 | - todo: add link to dataset list on [conservationtech.directory](https://conservationtech.directory/) 122 | 123 | ## Acknowledgements 124 | - This list has only been possible to assemble through the extensive input by Duncan Watson Parris, Paula Harder, and Fabrizio Falasca. 125 | --------------------------------------------------------------------------------