├── .gitignore
├── README.md
├── data
    ├── mc4MaxIrrigatedHaLabels.tif
    ├── sample_v2a.geojson
    └── sample_v3b.geojson
├── ellecp
    ├── README.md
    ├── archive
    │   ├── Downscaling.ipynb
    │   ├── data_processing.ipynb
    │   └── time-series.ipynb
    └── models.ipynb
├── gim_v2a.Rmd
├── gim_v3.Rmd
├── gim_v3b.Rmd
├── globalIrrigationMap.png
├── python
    ├── assessor.py
    ├── classifier.py
    ├── common.py
    ├── features_exporter.py
    ├── label_sampler.py
    ├── post_processor.py
    ├── run.py
    ├── sample_image_exporter.py
    ├── sampler.py
    └── training_sample_exporter.py
└── results
    ├── diff2001vs2015.png
    ├── diff2001vs2015.tif
    ├── v3b_combined_2001.png
    ├── v3b_combined_2001.tif
    ├── v3b_combined_2002.tif
    ├── v3b_combined_2003.tif
    ├── v3b_combined_2004.tif
    ├── v3b_combined_2005.tif
    ├── v3b_combined_2006.tif
    ├── v3b_combined_2007.tif
    ├── v3b_combined_2008.tif
    ├── v3b_combined_2009.tif
    ├── v3b_combined_2010.tif
    ├── v3b_combined_2011.tif
    ├── v3b_combined_2012.tif
    ├── v3b_combined_2013.tif
    ├── v3b_combined_2014.tif
    ├── v3b_combined_2015.png
    └── v3b_combined_2015.tif


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Global Irrigation Map
  2 | 
  3 | This repository contains code written for global irrigation map project.  This README documents key elements of the software architecture.
  4 | 
  5 | * See [research paper](https://www.sciencedirect.com/science/article/abs/pii/S0309170821000658) for additional details
  6 | * If paywalled, see [final preprint](https://www.researchgate.net/publication/350764143_A_new_dataset_of_global_irrigation_areas_from_2001_to_2015)
  7 | * Results dataset available [on Zenodo](https://zenodo.org/record/4659476)
  8 | * Concepts and how we built it: [this blog post](https://ndeepak.com/posts/2020-08-05-ml-on-gee/)
  9 | 
 10 | ## Data Flow
 11 | 
 12 | The following diagram captures the big picture of the software for the project.  It uses Google Earth Engine in all but one of the stages.
 13 | 
 14 | ![Data Flow](globalIrrigationMap.png)
 15 | 
 16 | Data flows from left to right.  There are four stages in producing the maps, and a final stage of consuming the maps via interactive apps.
 17 | 
 18 | In the first stage, we take a random sample of the features we wish to explore and possibly use for our model.  We took 10000 points for our model.  The input is the features we want to use from within GEE; the output is a GeoJSON file that can be read as a spatial dataframe within R.
 19 | 
 20 | In the second stage, we analyze the sample to select features, select a classification algorithm, and also tune its hyper-parameters.  We do this within R on our laptop.
 21 | 
 22 | In the third stage, we create features for our model.  These are the features we selected in the previous stage.  They will be stored as multi-band raster images within GEE, one per year of prediction, to be used directly as inputs in the next stage.
 23 | 
 24 | It might seem that this step is unnecessary.  But doing feature processing as well as running the model becomes very expensive and leads to errors on GEE.  It is better to have the features ready before running the model.
 25 | 
 26 | In the fourth stage, we run our ported model on the features we produced.  We store the results within GEE as raster images.
 27 | 
 28 | Finally, we provide user access to the maps by developing interactive apps within GEE.  These apps, written in JavaScript, allow us to compare irrigation across years, zoom into a region, or overlay satellite imagery to check if the map makes sense.
 29 | 
 30 | ## Environment Setup
 31 | 
 32 | For the complete development workflow, we will need 2 environments.  The first is in Python and is needed to run our programs that invoke Google Earth Engine APIs.  The second is in R in order to do our data analysis and model development.
 33 | 
 34 | For Python, follow the official [GEE documentation](https://developers.google.com/earth-engine/python_install-conda) to set up a local conda environment.  This repository uses Python 3 and will *not* work on Python 2.x.  Be sure to run `earthengine authenticate` so that your credentials are saved on the computer.
 35 | 
 36 | For R, the two key packages you need are: sf, caret.  Install them and their dependencies.
 37 | 
 38 | For GEE apps, it is best to develop within GEE code editor.  You can clone the repository locally, but GEE code environment is richer: it allows you to see the maps and also inspect individual points.
 39 | 
 40 | After setting up all this, you can then fork and/or clone this repository.  Be sure to work on a branch rather than master.
 41 | 
 42 | When you first clone this repository, set the base asset directory to your GEE directory path.  In my case, I had "users/deepakna/w210_irrigated_croplands".
 43 | 
 44 | ## Sample Run
 45 | 
 46 | For this example, we will not do any model development.  We will simply try to use what is already there.  We will do a run for only one year.
 47 |  
 48 | Before you start a run, bump up the version in common.py.  This is a best practice to ensure immutability.  Commit after each version bump.
 49 | 
 50 | ### 1. Create random sample
 51 | 
 52 | By default, it creates 10000 points.  If you want, you can change it in sampler.py.
 53 |  
 54 |  ```python
 55 | python3 training_sample_exporter.py
 56 | ```
 57 | 
 58 | This step takes about 20-30 minutes.
 59 | 
 60 | ### 2. Create feature store
 61 | 
 62 | Keep only 2000 in the years list within features_exporter.py, then run it:
 63 | 
 64 | ```python
 65 | python3 features_exporter.py
 66 | ```
 67 | 
 68 | This step takes about 3 to 4 hours with the default set of features.
 69 | 
 70 | ### 3. Run the model
 71 | 
 72 | Again, be sure to keep only 2000 in the years list within classifier.py, then run it:
 73 | 
 74 | ```python
 75 | python3 classifier.py
 76 | ```
 77 | 
 78 | This step takes about 2 hours with the default model parameters.
 79 | 
 80 | There is a utility run.py that will run both steps 2 and 3 for you, if you prefer.
 81 | 
 82 | 2000 is the year with training labels, so predicting on that year will allow you to assess model performance.  Predicting on any other year will only give you the map for that year.
 83 | 
 84 | If you think you made a mistake, press Ctrl-c to stop the script.  However, the job may be running on GEE servers.  To stop it, go to the GEE code environment and use the "Tasks" pane.
 85 | 
 86 | ## Adding More Features
 87 | 
 88 | More often, you will want to add or change features to your model and see how it performs.  For example, you might want to try features from a new soil dataset within GEE.  Here are the steps:
 89 | 
 90 | ### A. Get updated sample
 91 | 
 92 | 1. Bump up the version
 93 | 2. If you want to try an additional feature in a known dataset, add your feature to the `dataset_list` within common.py, under `allBands`.  If it is a new dataset, copy an example dataset entry and update it to point to your new dataset location and features.  Again, any features initially go to `allBands`.
 94 | 3. Re-run training_sample_exporter.py to get a training sample.
 95 | 
 96 | ### B. Check if it improves model
 97 | 
 98 | 1. Copy the R notebook, point it to your new training sample, and run it.  It takes about 15 minutes to run.  Be sure to look at its final features, assessment results and tuned model parameters.
 99 | 2. If your feature was not significant, stop here.
100 | 
101 | ### C. If improved, re-run model
102 | 
103 | 1. If your new feature was found to be significant, add it to `selectedBands` in the feature configuration. 
104 | 2. Re-run features_exporter.py to create the new feature store.
105 | 3. Update model parameters, if required, in classifier.py.  Re-run classifier.py. 
106 | 
107 | ## Improving Resolution
108 | 
109 | The model is currently at 8km spatial resolution.  This is the same as the one for the labels that come from MIRCA2000 dataset.
110 | 
111 | If you want to improve the resolution to say 4km or 1km, there are two challenges.
112 | 
113 | First, there is the "data science" challenge of how to apply the 8km labels to a smaller parcel of land.  You have to apply intelligence, such as use the label on a part of the original square that looks like cropland, based on some of its features, say the vegetation index.
114 | 
115 | Second, there is the engineering challenge, because data grows on a quadratic scale - i.e., you have an O(N<sup>2</sup>) problem at hand.  You will likely run into GEE errors.  When that happens, you should start by reduce the regions processed in parallel by feature extractor component in `get_selected_features_image()` function.  For 8km scale, it splits the world into 2 collections of smaller regions.  The same function is also used by the classifier, so your changes will automatically carry over to that component.
116 | 
117 | ## Components
118 | 
119 | ### Random Sampler
120 | 
121 | This component is set up to take a random sample worldwide.  The code is careful enough to calculate areas for global land regions and assign the number of points to each region based on that.  Oceania causes an explosion of geometries, therefore it limits that region to only New Zealand and Papua New Guinea.  It uses the LSIB dataset on GEE to get the boundary polygons for each geographical region.
122 | 
123 | A key question here is how many samples to take.  The answer is: "as many samples to capture all the data variability".  In other words, if you have a high degree of class imbalance, you will need a larger sample to account for minority class variability.  In practice, I looked at the map to see if there were enough points within the "irrigated" class, and 10000 seemed to be enough.  There might be more statistical ways to do this too.
124 | 
125 | The sampler captures all the features in `allBands` key.  This is to allow the next component to select features that are significant.  It automatically includes latitude, longitude, and irrigation labels from MIRCA2000 dataset.
126 | 
127 | Code: sampler.py, training_sample_exporter.py
128 | 
129 | ### Modeler
130 | 
131 | This component is in R, and uses the caret package to quickly and efficiently try out different models and tune them.  The input is the sample taken earlier, read into an R dataframe.  The choice of models to try is limited to what is available within GEE for us to run our final model:  naive Bayes, SVM, random forest; random forest practically was the most effective.  For speed of execution, the current code does not try any other models, but it is easy to try them within the caret framework.
132 | 
133 | The code uses 5-way cross validation to assess the model, and kappa as its assessment metric.  It sets a tuning length of 10 in the final model, and uses the model itself to assess feature importance and select the most important features.  A rough guide is to take the best tuned model, and keep adding features until you see the assessment metric to be within one standard error of the best model.  Practically, a normalized feature importance score of 0.30 or above was enough in our case.
134 | 
135 | It is worth noting that the irrigation labels are highly skewed: the number of data points fall off exponentially as we look for higher extent of irrigation.  We therefore do a log transformation (log1p) on the labels.  Even so, our threshold is 1 log1p-Hectare, because the samples decrease drastically as we increase our threshold.
136 | 
137 | This is the only component that runs on the developer's machine.  All others run on GEE infrastructure, although the scripts are invoked on the developer's machine.  We went with this approach because we found GEE to be lacking in this area.
138 | 
139 | Code: gim_v2a.Rmd
140 | 
141 | ### Feature Extractor
142 | 
143 | After selecting features, we run a script to create them for the whole world, for a given year.  If the dataset has multiple samples on a given year, we take a summary metric, such as the mean or maximum.  This step is very resource intensive, because it touches all the data for all the samples in a given year.  Processing is higher for data with finer spatial resolution (e.g., 500m as against 5km) and temporal resolution (e.g. once a month as against once a day).  The component automatically includes latitude, longitude, and irrigation labels from MIRCA2000 dataset.
144 | 
145 | The result of this step is a multi-band image, where each band represents the summarized value for that feature for that year, for each land pixel.  In our model, the resolution is 8km, and this means each pixel represents a square of 8km by 8km.
146 | 
147 | Code: features_exporter.py
148 | 
149 | ### Classifier
150 | 
151 | We are finally ready to run our model.  We now code the model that we have selected earlier with GEE API.  We set its hyper-parameters based on our analysis, and set its input to the features we have selected as well.  We then let it run.
152 | 
153 | The classifier produces a single band output image, where each pixel represents the probability that the model assigned for irrigation.
154 | 
155 | Code: classifier.py
156 | 
157 | ### User Applications
158 | 
159 | The maps are now ready to be consumed.  We write a few simple JavaScript applications and make them available on GEE as "apps".
160 | 
161 | These apps are available in a separate repository so that we can commit them directly to GEE.  To reduce extra complexity, we don't use the git "submodule" feature.
162 | 
163 | [Apps repository](https://github.com/deepix/gimApps/)
164 | 
165 | These apps allow interaction, zooming, and overlay with satellite imagery for validation.  GEE also allows us to customize the map style, introduce UI elements such as buttons and menus, and generally makes for a rich user experience.
166 | 
167 | If you wonder why we did not use JavaScript for the earlier stages, it is because there is no way to invoke JavaScript API from the developer's computer.  You can have a copy of the code itself, but running anything requires pressing a button on the GEE code editor.  Python API allow us to invoke GEE from the script directly.
168 | 
169 | It is also worth mentioning that any JavaScript code you write for GEE needs to be fast: it has a 5-minute timeout, but practically it has to run within milliseconds or the user will notice a lag.  This timeout does not apply to code running under Python API.
170 | 
171 | ## Labels Used
172 | 
173 | We use irrigation labels from the MIRCA2000 dataset.  The labels represent maximum area equipped for irrigation, on a global 8km by 8km grid.  We created a GeoTIFF file from it.  In addition to the original data, we also created a band that represents low, medium or high irrigation.  We do not use this band, opting for the original data instead.
174 | 
175 | Our labels GeoTIFF image is available in this repository.
176 | 
177 | ## Tips, Warnings and Best Practices
178 | 
179 | We now list down all those little things that may come useful to the developer.
180 | 
181 | * GEE evaluates lazily: that is, it does not start a job until you try to output it somewhere.  At that point, it works backwards to see what needs to be done.  As long as you specify your scale and projection at output, you do not have to worry about it anywhere else, because GEE will propage this backward.
182 | * A second ramification of the way GEE executes your code is that you should not do `for` loops.  Instead, you should use functional equivalents such as `map()`, so that GEE can continue to build its execution graph.
183 | * Be careful with changing the model resolution.  This is an area, i.e. O(N<sup>2</sup>), problem, so it is best to change a little and adjust based on how GEE responds.  The simplest strategy would be to reduce the processing geographical region, because we are now dealing with a larger volume of data.
184 | * When taking the sample, GEE removes data that is masked (i.e., missing).  Visualize your features within GEE code environment to get a sense before you add it to the model feature soup.  For example, a dataset might be available only for the US or Europe.  Add an appropriate `missingValues` parameter in the feature configuration to deal with such instances.
185 | * Similarly, visualize your results within GEE environment immediately and look for weirdness.  Use the satellite layer to make some quick sanity passes over the results.  In case of irrigation, it should not be showing large swathes of the Sahara as irrigated.
186 | * Make good use of immutability and versioning.  Bump up the version every time you do a new run, and archive or delete results that were meaningless.  Use good commit logs so that you know what changed in each version.  Use a code branch each time, instead of always working on the master.  Each run takes many hours, so it is best not to let that go to waste.
187 | * Do some software housekeeping periodically.  It is easy to get carried away in model development, but badly written software will accrue "tech debt" that will come back to bite you.  A rule of thumb is to see what parts you're changing, and somehow split it into mutable and immutable parts.  For example, the current code base has a feature configuration dictionary that changes frequently, but its associated code does not change much. 
188 | 
189 | ## Future Plans
190 | 
191 | Following are some ideas for further development on this project:
192 | 
193 | 1. Make a Docker image with both Python and R environments set up and ready to use.
194 | 2. Add some code and plots for EDA of individual features.
195 | 3. Feature engineering: there is ample scope to add more features, such as monthly values instead of an annual mean, or also adding variance in addition to mean.
196 | 4. Create a time series movie of irrigation changing over time from 2000 to 2018.
197 | 5. Improve the resolution for the maps from the current 8km.
198 | 


--------------------------------------------------------------------------------
/data/mc4MaxIrrigatedHaLabels.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/data/mc4MaxIrrigatedHaLabels.tif


--------------------------------------------------------------------------------
/ellecp/README.md:
--------------------------------------------------------------------------------
1 | This folder contains irrigation model code written by Eleanor Proust for croplands.  Code cleanup is pending.
2 | 
3 | 


--------------------------------------------------------------------------------
/ellecp/archive/Downscaling.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "metadata": {},
   7 |    "outputs": [],
   8 |    "source": [
   9 |     "import os\n",
  10 |     "import rasterio\n",
  11 |     "from rasterio.enums import Resampling\n",
  12 |     "import numpy as np\n",
  13 |     "import pandas as pd\n",
  14 |     "from pyhdf.SD import SD, SDC"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 2,
  20 |    "metadata": {},
  21 |    "outputs": [
  22 |     {
  23 |      "name": "stdout",
  24 |      "output_type": "stream",
  25 |      "text": [
  26 |       "/mnt/w210/ellecp/irrigated-croplands\r\n"
  27 |      ]
  28 |     }
  29 |    ],
  30 |    "source": [
  31 |     "!pwd"
  32 |    ]
  33 |   },
  34 |   {
  35 |    "cell_type": "code",
  36 |    "execution_count": 3,
  37 |    "metadata": {},
  38 |    "outputs": [],
  39 |    "source": [
  40 |     "import osgeo.gdal"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": 4,
  46 |    "metadata": {},
  47 |    "outputs": [],
  48 |    "source": [
  49 |     "upscale_factor = 10\n",
  50 |     "\n",
  51 |     "with rasterio.open(\"2015.tif\") as dataset:\n",
  52 |     "\n",
  53 |     "    # resample data to target shape\n",
  54 |     "    data = dataset.read(\n",
  55 |     "        out_shape=(\n",
  56 |     "            dataset.count,\n",
  57 |     "            int(dataset.height * upscale_factor),\n",
  58 |     "            int(dataset.width * upscale_factor)\n",
  59 |     "        ),\n",
  60 |     "    )\n",
  61 |     "\n",
  62 |     "    # scale image transform\n",
  63 |     "    transform = dataset.transform * dataset.transform.scale(\n",
  64 |     "        (dataset.width / data.shape[-1]),\n",
  65 |     "        (dataset.height / data.shape[-2])\n",
  66 |     "    )"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "code",
  71 |    "execution_count": 5,
  72 |    "metadata": {},
  73 |    "outputs": [],
  74 |    "source": [
  75 |     "upscale_factor = 10\n",
  76 |     "\n",
  77 |     "with rasterio.open(\"2010.tif\") as dataset:\n",
  78 |     "\n",
  79 |     "    # resample data to target shape\n",
  80 |     "    data = dataset.read(\n",
  81 |     "        out_shape=(\n",
  82 |     "            dataset.count,\n",
  83 |     "            int(dataset.height * upscale_factor),\n",
  84 |     "            int(dataset.width * upscale_factor)\n",
  85 |     "        ),\n",
  86 |     "    )\n",
  87 |     "\n",
  88 |     "    # scale image transform\n",
  89 |     "    transform = dataset.transform * dataset.transform.scale(\n",
  90 |     "        (dataset.width / data.shape[-1]),\n",
  91 |     "        (dataset.height / data.shape[-2])\n",
  92 |     "    )"
  93 |    ]
  94 |   },
  95 |   {
  96 |    "cell_type": "code",
  97 |    "execution_count": null,
  98 |    "metadata": {},
  99 |    "outputs": [],
 100 |    "source": [
 101 |     "test = np.array(data)"
 102 |    ]
 103 |   },
 104 |   {
 105 |    "cell_type": "code",
 106 |    "execution_count": null,
 107 |    "metadata": {},
 108 |    "outputs": [],
 109 |    "source": [
 110 |     "clean_data = np.reshape(data, (21600, 43200))"
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "code",
 115 |    "execution_count": null,
 116 |    "metadata": {},
 117 |    "outputs": [],
 118 |    "source": [
 119 |     "lidar_dem_path = 'World_e-Atlas-UCSD_SRTM30-plus_v8_Hillshading.tiff'\n",
 120 |     "with rasterio.open(lidar_dem_path) as lidar_dem:\n",
 121 |     "    im_array = lidar_dem.read()\n",
 122 |     "    lidar_dem.bounds"
 123 |    ]
 124 |   },
 125 |   {
 126 |    "cell_type": "code",
 127 |    "execution_count": null,
 128 |    "metadata": {},
 129 |    "outputs": [],
 130 |    "source": [
 131 |     "x, y = 1, 2\n",
 132 |     "t = np.array([x,y])"
 133 |    ]
 134 |   },
 135 |   {
 136 |    "cell_type": "code",
 137 |    "execution_count": null,
 138 |    "metadata": {},
 139 |    "outputs": [],
 140 |    "source": [
 141 |     "with rasterio.open('World_e-Atlas-UCSD_SRTM30-plus_v8_Hillshading.tiff') as map_layer:\n",
 142 |     "    pixels2coords = map_layer.xy(0,5) \n",
 143 |     "    \n",
 144 |     "def pixel2coord(point):\n",
 145 |     "    \"\"\"Returns global coordinates to pixel center using base-0 raster index\"\"\"\n",
 146 |     "    return(map_layer.xy(point[0],point[1]))"
 147 |    ]
 148 |   },
 149 |   {
 150 |    "cell_type": "code",
 151 |    "execution_count": null,
 152 |    "metadata": {},
 153 |    "outputs": [],
 154 |    "source": [
 155 |     "land_pixels = np.nonzero(clean_data) \n",
 156 |     "# print(np.unique(imarray))\n",
 157 |     "land_pixel_classes = clean_data[land_pixels].tolist()\n",
 158 |     "print(len(land_pixel_classes))\n",
 159 |     "\n",
 160 |     "land_indices = land_pixels \n",
 161 |     "non_zero_indices = np.array(land_indices)\n",
 162 |     "clean_frame = non_zero_indices.T\n",
 163 |     "\n",
 164 |     "# print(clean_frame.shape)\n",
 165 |     "# n = clean_frame.shape[0]\n",
 166 |     "# non_zeros = np.nonzero(test)\n",
 167 |     "\n",
 168 |     "# clean_frame_2 = clean_frame \n",
 169 |     "# clean_frame_df = pd.DataFrame({'lon': clean_frame[:,0], 'lat': clean_frame[:,1]})\n",
 170 |     "# clean_frame_df['labels'] = land_pixel_classes"
 171 |    ]
 172 |   },
 173 |   {
 174 |    "cell_type": "code",
 175 |    "execution_count": null,
 176 |    "metadata": {},
 177 |    "outputs": [],
 178 |    "source": [
 179 |     "t = np.apply_along_axis(pixel2coord, 1, clean_frame)"
 180 |    ]
 181 |   },
 182 |   {
 183 |    "cell_type": "code",
 184 |    "execution_count": null,
 185 |    "metadata": {},
 186 |    "outputs": [],
 187 |    "source": [
 188 |     "import math\n",
 189 |     "def lon_lat_to_modis(point, pixels = 1200):\n",
 190 |     "    lon = point[0]\n",
 191 |     "    lat = point[1]\n",
 192 |     "    R = 6371007.181\n",
 193 |     "    T = 1111950\n",
 194 |     "    xmin = -20015109\n",
 195 |     "    ymax = 10007555\n",
 196 |     "    w = T/pixels\n",
 197 |     "    lat_rad = math.radians(lat)\n",
 198 |     "    lon_rad = math.radians(lon)\n",
 199 |     "    x = R*lon_rad*math.cos(lat_rad)\n",
 200 |     "    y = R*lat_rad\n",
 201 |     "    H= int(abs(x -xmin)/T)\n",
 202 |     "    V = int(abs(ymax - y)/T)\n",
 203 |     "    i = ((ymax - y)%T) / w\n",
 204 |     "    j = ((x -xmin)%T)/ w\n",
 205 |     "    return np.array([H, V, i, j])"
 206 |    ]
 207 |   },
 208 |   {
 209 |    "cell_type": "code",
 210 |    "execution_count": null,
 211 |    "metadata": {},
 212 |    "outputs": [],
 213 |    "source": [
 214 |     "modis = np.apply_along_axis(lon_lat_to_modis, 1, t)"
 215 |    ]
 216 |   },
 217 |   {
 218 |    "cell_type": "code",
 219 |    "execution_count": null,
 220 |    "metadata": {},
 221 |    "outputs": [],
 222 |    "source": [
 223 |     "clean_frame_df = pd.DataFrame({'x': clean_frame[:,0], 'y': clean_frame[:,1], 'lon' : t[:,0], 'lat': t[:,1],'H': modis[:,0], 'V': modis[:,1], 'i' : modis[:,2], 'j': modis[:,3] })\n",
 224 |     "clean_frame_df.H = clean_frame_df.H.astype(int)\n",
 225 |     "clean_frame_df.V = clean_frame_df.V.astype(int)\n",
 226 |     "clean_frame_df.i = clean_frame_df.i.astype(int)\n",
 227 |     "clean_frame_df.j = clean_frame_df.j.astype(int)"
 228 |    ]
 229 |   },
 230 |   {
 231 |    "cell_type": "code",
 232 |    "execution_count": null,
 233 |    "metadata": {},
 234 |    "outputs": [],
 235 |    "source": [
 236 |     "example_frame = clean_frame_df[['i','j']][(clean_frame_df['H']== 8) &  (clean_frame_df['V']== 6)]"
 237 |    ]
 238 |   },
 239 |   {
 240 |    "cell_type": "code",
 241 |    "execution_count": null,
 242 |    "metadata": {},
 243 |    "outputs": [],
 244 |    "source": [
 245 |     "pic_gen = list(example_frame.values)"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "code",
 250 |    "execution_count": null,
 251 |    "metadata": {},
 252 |    "outputs": [],
 253 |    "source": [
 254 |     "emtpy_im = np.zeros((1200,1200))"
 255 |    ]
 256 |   },
 257 |   {
 258 |    "cell_type": "code",
 259 |    "execution_count": null,
 260 |    "metadata": {},
 261 |    "outputs": [],
 262 |    "source": [
 263 |     "for i in pic_gen:\n",
 264 |     "    emtpy_im[i[0],i[1]]=1"
 265 |    ]
 266 |   },
 267 |   {
 268 |    "cell_type": "code",
 269 |    "execution_count": null,
 270 |    "metadata": {},
 271 |    "outputs": [],
 272 |    "source": [
 273 |     "from matplotlib import pyplot as plt\n",
 274 |     "%matplotlib inline \n",
 275 |     "\n",
 276 |     "plt.imshow(emtpy_im, interpolation='nearest')\n",
 277 |     "plt.show()"
 278 |    ]
 279 |   },
 280 |   {
 281 |    "cell_type": "code",
 282 |    "execution_count": null,
 283 |    "metadata": {},
 284 |    "outputs": [],
 285 |    "source": [
 286 |     "unique_values = clean_frame_df.groupby(['H', 'V']).size().reset_index()\n",
 287 |     "\n",
 288 |     "unique_values_2 = clean_frame_df.groupby(['H', 'V']).size().reset_index()\n",
 289 |     "unique_values = unique_values[['H', 'V']]\n",
 290 |     "rel_HV_list = list(unique_values.values)\n",
 291 |     "for i in range(len(rel_HV_list)):\n",
 292 |     "    rel_HV_list[i] = [rel_HV_list[i][0], rel_HV_list[i][1]]\n",
 293 |     "print(unique_values_2)"
 294 |    ]
 295 |   },
 296 |   {
 297 |    "cell_type": "code",
 298 |    "execution_count": null,
 299 |    "metadata": {},
 300 |    "outputs": [],
 301 |    "source": [
 302 |     "print(unique_values_2.columns)"
 303 |    ]
 304 |   },
 305 |   {
 306 |    "cell_type": "code",
 307 |    "execution_count": null,
 308 |    "metadata": {},
 309 |    "outputs": [],
 310 |    "source": [
 311 |     "unique_values_2[unique_values_2[0]<300000].sort_values(by=[0], ascending = False)"
 312 |    ]
 313 |   },
 314 |   {
 315 |    "cell_type": "code",
 316 |    "execution_count": null,
 317 |    "metadata": {},
 318 |    "outputs": [],
 319 |    "source": [
 320 |     "df_list = []\n",
 321 |     "for i in rel_HV_list:\n",
 322 |     "    test_df = clean_frame_df[(clean_frame_df['H'] == i[0]) & (clean_frame_df['V'] == i[1])][['x', 'y', 'H', 'V', 'i', 'j']]\n",
 323 |     "    df_list.append(test_df)"
 324 |    ]
 325 |   },
 326 |   {
 327 |    "cell_type": "code",
 328 |    "execution_count": null,
 329 |    "metadata": {
 330 |     "scrolled": true
 331 |    },
 332 |    "outputs": [],
 333 |    "source": [
 334 |     "len(df_list)\n",
 335 |     "df_list[10].shape"
 336 |    ]
 337 |   },
 338 |   {
 339 |    "cell_type": "code",
 340 |    "execution_count": null,
 341 |    "metadata": {},
 342 |    "outputs": [],
 343 |    "source": [
 344 |     "# months = ['2014.01.01',  \n",
 345 |     "#            '2014.02.01', \n",
 346 |     "#            '2014.03.01', \n",
 347 |     "#            '2014.04.01', \n",
 348 |     "#            '2014.05.01', \n",
 349 |     "#            '2014.06.01', \n",
 350 |     "#            '2014.07.01',  \n",
 351 |     "#            '2014.08.01', \n",
 352 |     "#            '2014.09.01', \n",
 353 |     "#            '2014.10.01', \n",
 354 |     "#            '2014.11.01',\n",
 355 |     "#            '2014.12.01',\n",
 356 |     "#            '2015.01.01',  \n",
 357 |     "#            '2015.02.01', \n",
 358 |     "#            '2015.03.01', \n",
 359 |     "#            '2015.04.01', \n",
 360 |     "#            '2015.05.01', \n",
 361 |     "#            '2015.06.01', \n",
 362 |     "#            '2015.07.01',  \n",
 363 |     "#            '2015.08.01', \n",
 364 |     "#            '2015.09.01', \n",
 365 |     "#            '2015.10.01', \n",
 366 |     "#            '2015.11.01',\n",
 367 |     "#            '2015.12.01']\n",
 368 |     "\n",
 369 |     "\n",
 370 |     "months = ['2009.01.01',  \n",
 371 |     "           '2009.02.01', \n",
 372 |     "           '2009.03.01', \n",
 373 |     "           '2009.04.01', \n",
 374 |     "           '2009.05.01', \n",
 375 |     "           '2009.06.01', \n",
 376 |     "           '2009.07.01',  \n",
 377 |     "           '2009.08.01', \n",
 378 |     "           '2009.09.01', \n",
 379 |     "           '2009.10.01', \n",
 380 |     "           '2009.11.01',\n",
 381 |     "           '2009.12.01',\n",
 382 |     "           '2010.01.01',  \n",
 383 |     "           '2010.02.01', \n",
 384 |     "           '2010.03.01', \n",
 385 |     "           '2010.04.01', \n",
 386 |     "           '2010.05.01', \n",
 387 |     "           '2010.06.01', \n",
 388 |     "           '2010.07.01',  \n",
 389 |     "           '2010.08.01', \n",
 390 |     "           '2010.09.01', \n",
 391 |     "           '2010.10.01', \n",
 392 |     "           '2010.11.01',\n",
 393 |     "           '2010.12.01']\n",
 394 |     "\n"
 395 |    ]
 396 |   },
 397 |   {
 398 |    "cell_type": "code",
 399 |    "execution_count": null,
 400 |    "metadata": {},
 401 |    "outputs": [],
 402 |    "source": [
 403 |     "for j in range(len(df_list)):\n",
 404 |     "    H, V = rel_HV_list[j][0], rel_HV_list[j][1]\n",
 405 |     "    print(rel_HV_list[j])\n",
 406 |     "    H_V_string = 'H' + str(H) + 'V' + str(V) \n",
 407 |     "    file_loc = 'modis/' + H_V_string + '/'\n",
 408 |     "    list_i = np.array(df_list[j].i)\n",
 409 |     "    list_j = np.array(df_list[j].j)\n",
 410 |     "    indices = (list_i, list_j)\n",
 411 |     "    for i in range(len(months)):\n",
 412 |     "#         print(file_loc  + H_V_string + '_' + months[i] + '.hdf')\n",
 413 |     "        ndvi_file = SD(file_loc + H_V_string + '_' + months[i] + '.hdf', SDC.READ)\n",
 414 |     "        #ndvi\n",
 415 |     "        datasets_dic = ndvi_file.datasets()\n",
 416 |     "        sds_obj = ndvi_file.select('1 km monthly NDVI') \n",
 417 |     "        data = np.array(sds_obj.get()) # get sds data\n",
 418 |     "        df_list[j][months[i]] = data[indices]\n",
 419 |     "    st = df_list[j].iloc[:, 6:30].values\n",
 420 |     "    df_list[j]['max'] = np.amax(st, axis=1)\n",
 421 |     "    df_list[j]['min'] = np.amin(st, axis=1)\n",
 422 |     "    df_list[j]['range'] = df_list[j]['max'] - df_list[j]['min']"
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "code",
 427 |    "execution_count": 10,
 428 |    "metadata": {},
 429 |    "outputs": [],
 430 |    "source": [
 431 |     "ndvi_file = SD('modis/H8V6/H8V6_2010.06.01.hdf', SDC.READ)\n",
 432 |     "#ndvi\n",
 433 |     "datasets_dic = ndvi_file.datasets()\n",
 434 |     "sds_obj_r = ndvi_file.select('1 km monthly red reflectance') \n",
 435 |     "sds_obj_b = ndvi_file.select('1 km monthly blue reflectance')\n",
 436 |     "sds_obj_g = ndvi_file.select('1 km monthly NIR reflectance')\n",
 437 |     "\n",
 438 |     "data_r = np.array(sds_obj_r.get())/10000 # get sds data\n",
 439 |     "data_b = np.array(sds_obj_b.get())/10000\n",
 440 |     "data_g = np.array(sds_obj_g.get())/10000\n",
 441 |     "\n",
 442 |     "\n",
 443 |     "def apply_mask(pixel):\n",
 444 |     "    if pixel < 0:\n",
 445 |     "        return 1\n",
 446 |     "    else:\n",
 447 |     "        return pixel\n",
 448 |     "\n",
 449 |     "filter_function = np.vectorize(apply_mask)\n",
 450 |     "data_r = filter_function(data_r)\n",
 451 |     "data_g = filter_function(data_g)\n",
 452 |     "data_b = filter_function(data_b)"
 453 |    ]
 454 |   },
 455 |   {
 456 |    "cell_type": "code",
 457 |    "execution_count": 11,
 458 |    "metadata": {},
 459 |    "outputs": [
 460 |     {
 461 |      "data": {
 462 |       "text/plain": [
 463 |        "array([[0.125 , 0.125 , 0.1092, ..., 0.1739, 0.1761, 0.1645],\n",
 464 |        "       [0.1213, 0.1073, 0.0991, ..., 0.1805, 0.1783, 0.1659],\n",
 465 |        "       [0.1186, 0.1016, 0.0988, ..., 0.1713, 0.1635, 0.1693],\n",
 466 |        "       ...,\n",
 467 |        "       [1.    , 1.    , 1.    , ..., 1.    , 1.    , 1.    ],\n",
 468 |        "       [1.    , 1.    , 1.    , ..., 1.    , 1.    , 1.    ],\n",
 469 |        "       [1.    , 1.    , 1.    , ..., 1.    , 1.    , 1.    ]])"
 470 |       ]
 471 |      },
 472 |      "execution_count": 11,
 473 |      "metadata": {},
 474 |      "output_type": "execute_result"
 475 |     }
 476 |    ],
 477 |    "source": [
 478 |     "data_r"
 479 |    ]
 480 |   },
 481 |   {
 482 |    "cell_type": "code",
 483 |    "execution_count": 12,
 484 |    "metadata": {},
 485 |    "outputs": [],
 486 |    "source": [
 487 |     "from PIL import Image\n",
 488 |     "import numpy as np\n",
 489 |     "rgbArray = np.zeros((1200,1200,3), 'uint8')\n",
 490 |     "rgbArray[..., 0] = data_r*256\n",
 491 |     "rgbArray[..., 1] = data_g*256\n",
 492 |     "rgbArray[..., 2] = data_b*256\n",
 493 |     "img = Image.fromarray(rgbArray)\n",
 494 |     "img.save('myimg.jpeg')"
 495 |    ]
 496 |   },
 497 |   {
 498 |    "cell_type": "code",
 499 |    "execution_count": null,
 500 |    "metadata": {},
 501 |    "outputs": [],
 502 |    "source": [
 503 |     "full_list = pd.concat(df_list, axis=0)"
 504 |    ]
 505 |   },
 506 |   {
 507 |    "cell_type": "code",
 508 |    "execution_count": null,
 509 |    "metadata": {},
 510 |    "outputs": [],
 511 |    "source": [
 512 |     "full_list"
 513 |    ]
 514 |   },
 515 |   {
 516 |    "cell_type": "code",
 517 |    "execution_count": null,
 518 |    "metadata": {},
 519 |    "outputs": [],
 520 |    "source": [
 521 |     "filtered_list = full_list[(full_list['max'] > 4000) & (full_list['min'] < 4000)  &  (full_list['range'] > 2500)]"
 522 |    ]
 523 |   },
 524 |   {
 525 |    "cell_type": "code",
 526 |    "execution_count": null,
 527 |    "metadata": {},
 528 |    "outputs": [],
 529 |    "source": [
 530 |     "filtered_list.shape"
 531 |    ]
 532 |   },
 533 |   {
 534 |    "cell_type": "code",
 535 |    "execution_count": 35,
 536 |    "metadata": {
 537 |     "scrolled": true
 538 |    },
 539 |    "outputs": [
 540 |     {
 541 |      "data": {
 542 |       "text/html": [
 543 |        "<div>\n",
 544 |        "<style scoped>\n",
 545 |        "    .dataframe tbody tr th:only-of-type {\n",
 546 |        "        vertical-align: middle;\n",
 547 |        "    }\n",
 548 |        "\n",
 549 |        "    .dataframe tbody tr th {\n",
 550 |        "        vertical-align: top;\n",
 551 |        "    }\n",
 552 |        "\n",
 553 |        "    .dataframe thead th {\n",
 554 |        "        text-align: right;\n",
 555 |        "    }\n",
 556 |        "</style>\n",
 557 |        "<table border=\"1\" class=\"dataframe\">\n",
 558 |        "  <thead>\n",
 559 |        "    <tr style=\"text-align: right;\">\n",
 560 |        "      <th></th>\n",
 561 |        "      <th>x</th>\n",
 562 |        "      <th>y</th>\n",
 563 |        "      <th>H</th>\n",
 564 |        "      <th>V</th>\n",
 565 |        "      <th>i</th>\n",
 566 |        "      <th>j</th>\n",
 567 |        "      <th>2009.01.01</th>\n",
 568 |        "      <th>2009.02.01</th>\n",
 569 |        "      <th>2009.03.01</th>\n",
 570 |        "      <th>2009.04.01</th>\n",
 571 |        "      <th>...</th>\n",
 572 |        "      <th>2010.06.01</th>\n",
 573 |        "      <th>2010.07.01</th>\n",
 574 |        "      <th>2010.08.01</th>\n",
 575 |        "      <th>2010.09.01</th>\n",
 576 |        "      <th>2010.10.01</th>\n",
 577 |        "      <th>2010.11.01</th>\n",
 578 |        "      <th>2010.12.01</th>\n",
 579 |        "      <th>max</th>\n",
 580 |        "      <th>min</th>\n",
 581 |        "      <th>range</th>\n",
 582 |        "    </tr>\n",
 583 |        "  </thead>\n",
 584 |        "  <tbody>\n",
 585 |        "    <tr>\n",
 586 |        "      <th>13394300</th>\n",
 587 |        "      <td>7680</td>\n",
 588 |        "      <td>8230</td>\n",
 589 |        "      <td>7</td>\n",
 590 |        "      <td>6</td>\n",
 591 |        "      <td>480</td>\n",
 592 |        "      <td>1183</td>\n",
 593 |        "      <td>2711</td>\n",
 594 |        "      <td>2212</td>\n",
 595 |        "      <td>1884</td>\n",
 596 |        "      <td>1630</td>\n",
 597 |        "      <td>...</td>\n",
 598 |        "      <td>1544</td>\n",
 599 |        "      <td>1493</td>\n",
 600 |        "      <td>1666</td>\n",
 601 |        "      <td>1574</td>\n",
 602 |        "      <td>1704</td>\n",
 603 |        "      <td>1627</td>\n",
 604 |        "      <td>1714</td>\n",
 605 |        "      <td>4412</td>\n",
 606 |        "      <td>1493</td>\n",
 607 |        "      <td>2919</td>\n",
 608 |        "    </tr>\n",
 609 |        "    <tr>\n",
 610 |        "      <th>13394301</th>\n",
 611 |        "      <td>7680</td>\n",
 612 |        "      <td>8231</td>\n",
 613 |        "      <td>7</td>\n",
 614 |        "      <td>6</td>\n",
 615 |        "      <td>480</td>\n",
 616 |        "      <td>1184</td>\n",
 617 |        "      <td>2697</td>\n",
 618 |        "      <td>2243</td>\n",
 619 |        "      <td>1944</td>\n",
 620 |        "      <td>1677</td>\n",
 621 |        "      <td>...</td>\n",
 622 |        "      <td>1565</td>\n",
 623 |        "      <td>1540</td>\n",
 624 |        "      <td>1566</td>\n",
 625 |        "      <td>1612</td>\n",
 626 |        "      <td>1748</td>\n",
 627 |        "      <td>1694</td>\n",
 628 |        "      <td>1737</td>\n",
 629 |        "      <td>4380</td>\n",
 630 |        "      <td>1540</td>\n",
 631 |        "      <td>2840</td>\n",
 632 |        "    </tr>\n",
 633 |        "    <tr>\n",
 634 |        "      <th>13394302</th>\n",
 635 |        "      <td>7680</td>\n",
 636 |        "      <td>8232</td>\n",
 637 |        "      <td>7</td>\n",
 638 |        "      <td>6</td>\n",
 639 |        "      <td>480</td>\n",
 640 |        "      <td>1184</td>\n",
 641 |        "      <td>2697</td>\n",
 642 |        "      <td>2243</td>\n",
 643 |        "      <td>1944</td>\n",
 644 |        "      <td>1677</td>\n",
 645 |        "      <td>...</td>\n",
 646 |        "      <td>1565</td>\n",
 647 |        "      <td>1540</td>\n",
 648 |        "      <td>1566</td>\n",
 649 |        "      <td>1612</td>\n",
 650 |        "      <td>1748</td>\n",
 651 |        "      <td>1694</td>\n",
 652 |        "      <td>1737</td>\n",
 653 |        "      <td>4380</td>\n",
 654 |        "      <td>1540</td>\n",
 655 |        "      <td>2840</td>\n",
 656 |        "    </tr>\n",
 657 |        "    <tr>\n",
 658 |        "      <th>13394303</th>\n",
 659 |        "      <td>7680</td>\n",
 660 |        "      <td>8233</td>\n",
 661 |        "      <td>7</td>\n",
 662 |        "      <td>6</td>\n",
 663 |        "      <td>480</td>\n",
 664 |        "      <td>1185</td>\n",
 665 |        "      <td>2809</td>\n",
 666 |        "      <td>2286</td>\n",
 667 |        "      <td>1955</td>\n",
 668 |        "      <td>1723</td>\n",
 669 |        "      <td>...</td>\n",
 670 |        "      <td>1550</td>\n",
 671 |        "      <td>1542</td>\n",
 672 |        "      <td>1611</td>\n",
 673 |        "      <td>1601</td>\n",
 674 |        "      <td>1874</td>\n",
 675 |        "      <td>1762</td>\n",
 676 |        "      <td>1773</td>\n",
 677 |        "      <td>4236</td>\n",
 678 |        "      <td>1542</td>\n",
 679 |        "      <td>2694</td>\n",
 680 |        "    </tr>\n",
 681 |        "    <tr>\n",
 682 |        "      <th>13394304</th>\n",
 683 |        "      <td>7680</td>\n",
 684 |        "      <td>8234</td>\n",
 685 |        "      <td>7</td>\n",
 686 |        "      <td>6</td>\n",
 687 |        "      <td>480</td>\n",
 688 |        "      <td>1186</td>\n",
 689 |        "      <td>2762</td>\n",
 690 |        "      <td>2415</td>\n",
 691 |        "      <td>2024</td>\n",
 692 |        "      <td>1711</td>\n",
 693 |        "      <td>...</td>\n",
 694 |        "      <td>1632</td>\n",
 695 |        "      <td>1602</td>\n",
 696 |        "      <td>1639</td>\n",
 697 |        "      <td>1627</td>\n",
 698 |        "      <td>2085</td>\n",
 699 |        "      <td>1787</td>\n",
 700 |        "      <td>1759</td>\n",
 701 |        "      <td>4613</td>\n",
 702 |        "      <td>1602</td>\n",
 703 |        "      <td>3011</td>\n",
 704 |        "    </tr>\n",
 705 |        "    <tr>\n",
 706 |        "      <th>...</th>\n",
 707 |        "      <td>...</td>\n",
 708 |        "      <td>...</td>\n",
 709 |        "      <td>...</td>\n",
 710 |        "      <td>...</td>\n",
 711 |        "      <td>...</td>\n",
 712 |        "      <td>...</td>\n",
 713 |        "      <td>...</td>\n",
 714 |        "      <td>...</td>\n",
 715 |        "      <td>...</td>\n",
 716 |        "      <td>...</td>\n",
 717 |        "      <td>...</td>\n",
 718 |        "      <td>...</td>\n",
 719 |        "      <td>...</td>\n",
 720 |        "      <td>...</td>\n",
 721 |        "      <td>...</td>\n",
 722 |        "      <td>...</td>\n",
 723 |        "      <td>...</td>\n",
 724 |        "      <td>...</td>\n",
 725 |        "      <td>...</td>\n",
 726 |        "      <td>...</td>\n",
 727 |        "      <td>...</td>\n",
 728 |        "    </tr>\n",
 729 |        "    <tr>\n",
 730 |        "      <th>20344539</th>\n",
 731 |        "      <td>14361</td>\n",
 732 |        "      <td>39559</td>\n",
 733 |        "      <td>31</td>\n",
 734 |        "      <td>11</td>\n",
 735 |        "      <td>1161</td>\n",
 736 |        "      <td>3</td>\n",
 737 |        "      <td>1961</td>\n",
 738 |        "      <td>1997</td>\n",
 739 |        "      <td>2135</td>\n",
 740 |        "      <td>2076</td>\n",
 741 |        "      <td>...</td>\n",
 742 |        "      <td>6116</td>\n",
 743 |        "      <td>7629</td>\n",
 744 |        "      <td>7998</td>\n",
 745 |        "      <td>6860</td>\n",
 746 |        "      <td>4690</td>\n",
 747 |        "      <td>2585</td>\n",
 748 |        "      <td>3113</td>\n",
 749 |        "      <td>7998</td>\n",
 750 |        "      <td>1654</td>\n",
 751 |        "      <td>6344</td>\n",
 752 |        "    </tr>\n",
 753 |        "    <tr>\n",
 754 |        "      <th>20345407</th>\n",
 755 |        "      <td>14362</td>\n",
 756 |        "      <td>39557</td>\n",
 757 |        "      <td>31</td>\n",
 758 |        "      <td>11</td>\n",
 759 |        "      <td>1162</td>\n",
 760 |        "      <td>0</td>\n",
 761 |        "      <td>1837</td>\n",
 762 |        "      <td>2247</td>\n",
 763 |        "      <td>1773</td>\n",
 764 |        "      <td>1708</td>\n",
 765 |        "      <td>...</td>\n",
 766 |        "      <td>2409</td>\n",
 767 |        "      <td>3974</td>\n",
 768 |        "      <td>5652</td>\n",
 769 |        "      <td>5423</td>\n",
 770 |        "      <td>4449</td>\n",
 771 |        "      <td>2665</td>\n",
 772 |        "      <td>4005</td>\n",
 773 |        "      <td>7836</td>\n",
 774 |        "      <td>1358</td>\n",
 775 |        "      <td>6478</td>\n",
 776 |        "    </tr>\n",
 777 |        "    <tr>\n",
 778 |        "      <th>20345408</th>\n",
 779 |        "      <td>14362</td>\n",
 780 |        "      <td>39558</td>\n",
 781 |        "      <td>31</td>\n",
 782 |        "      <td>11</td>\n",
 783 |        "      <td>1162</td>\n",
 784 |        "      <td>1</td>\n",
 785 |        "      <td>1727</td>\n",
 786 |        "      <td>1712</td>\n",
 787 |        "      <td>1754</td>\n",
 788 |        "      <td>1723</td>\n",
 789 |        "      <td>...</td>\n",
 790 |        "      <td>3007</td>\n",
 791 |        "      <td>4117</td>\n",
 792 |        "      <td>5916</td>\n",
 793 |        "      <td>4354</td>\n",
 794 |        "      <td>4221</td>\n",
 795 |        "      <td>2274</td>\n",
 796 |        "      <td>5156</td>\n",
 797 |        "      <td>7949</td>\n",
 798 |        "      <td>1399</td>\n",
 799 |        "      <td>6550</td>\n",
 800 |        "    </tr>\n",
 801 |        "    <tr>\n",
 802 |        "      <th>20345409</th>\n",
 803 |        "      <td>14362</td>\n",
 804 |        "      <td>39559</td>\n",
 805 |        "      <td>31</td>\n",
 806 |        "      <td>11</td>\n",
 807 |        "      <td>1162</td>\n",
 808 |        "      <td>2</td>\n",
 809 |        "      <td>1731</td>\n",
 810 |        "      <td>1699</td>\n",
 811 |        "      <td>1777</td>\n",
 812 |        "      <td>1775</td>\n",
 813 |        "      <td>...</td>\n",
 814 |        "      <td>6347</td>\n",
 815 |        "      <td>7220</td>\n",
 816 |        "      <td>8004</td>\n",
 817 |        "      <td>6982</td>\n",
 818 |        "      <td>4240</td>\n",
 819 |        "      <td>2265</td>\n",
 820 |        "      <td>2444</td>\n",
 821 |        "      <td>8729</td>\n",
 822 |        "      <td>1506</td>\n",
 823 |        "      <td>7223</td>\n",
 824 |        "    </tr>\n",
 825 |        "    <tr>\n",
 826 |        "      <th>20346279</th>\n",
 827 |        "      <td>14363</td>\n",
 828 |        "      <td>39559</td>\n",
 829 |        "      <td>31</td>\n",
 830 |        "      <td>11</td>\n",
 831 |        "      <td>1163</td>\n",
 832 |        "      <td>0</td>\n",
 833 |        "      <td>1686</td>\n",
 834 |        "      <td>1907</td>\n",
 835 |        "      <td>1919</td>\n",
 836 |        "      <td>1680</td>\n",
 837 |        "      <td>...</td>\n",
 838 |        "      <td>4100</td>\n",
 839 |        "      <td>6748</td>\n",
 840 |        "      <td>6522</td>\n",
 841 |        "      <td>5865</td>\n",
 842 |        "      <td>4401</td>\n",
 843 |        "      <td>2278</td>\n",
 844 |        "      <td>3527</td>\n",
 845 |        "      <td>8227</td>\n",
 846 |        "      <td>1344</td>\n",
 847 |        "      <td>6883</td>\n",
 848 |        "    </tr>\n",
 849 |        "  </tbody>\n",
 850 |        "</table>\n",
 851 |        "<p>13809317 rows × 33 columns</p>\n",
 852 |        "</div>"
 853 |       ],
 854 |       "text/plain": [
 855 |        "              x      y   H   V     i     j  2009.01.01  2009.02.01  \\\n",
 856 |        "13394300   7680   8230   7   6   480  1183        2711        2212   \n",
 857 |        "13394301   7680   8231   7   6   480  1184        2697        2243   \n",
 858 |        "13394302   7680   8232   7   6   480  1184        2697        2243   \n",
 859 |        "13394303   7680   8233   7   6   480  1185        2809        2286   \n",
 860 |        "13394304   7680   8234   7   6   480  1186        2762        2415   \n",
 861 |        "...         ...    ...  ..  ..   ...   ...         ...         ...   \n",
 862 |        "20344539  14361  39559  31  11  1161     3        1961        1997   \n",
 863 |        "20345407  14362  39557  31  11  1162     0        1837        2247   \n",
 864 |        "20345408  14362  39558  31  11  1162     1        1727        1712   \n",
 865 |        "20345409  14362  39559  31  11  1162     2        1731        1699   \n",
 866 |        "20346279  14363  39559  31  11  1163     0        1686        1907   \n",
 867 |        "\n",
 868 |        "          2009.03.01  2009.04.01  ...  2010.06.01  2010.07.01  2010.08.01  \\\n",
 869 |        "13394300        1884        1630  ...        1544        1493        1666   \n",
 870 |        "13394301        1944        1677  ...        1565        1540        1566   \n",
 871 |        "13394302        1944        1677  ...        1565        1540        1566   \n",
 872 |        "13394303        1955        1723  ...        1550        1542        1611   \n",
 873 |        "13394304        2024        1711  ...        1632        1602        1639   \n",
 874 |        "...              ...         ...  ...         ...         ...         ...   \n",
 875 |        "20344539        2135        2076  ...        6116        7629        7998   \n",
 876 |        "20345407        1773        1708  ...        2409        3974        5652   \n",
 877 |        "20345408        1754        1723  ...        3007        4117        5916   \n",
 878 |        "20345409        1777        1775  ...        6347        7220        8004   \n",
 879 |        "20346279        1919        1680  ...        4100        6748        6522   \n",
 880 |        "\n",
 881 |        "          2010.09.01  2010.10.01  2010.11.01  2010.12.01   max   min  range  \n",
 882 |        "13394300        1574        1704        1627        1714  4412  1493   2919  \n",
 883 |        "13394301        1612        1748        1694        1737  4380  1540   2840  \n",
 884 |        "13394302        1612        1748        1694        1737  4380  1540   2840  \n",
 885 |        "13394303        1601        1874        1762        1773  4236  1542   2694  \n",
 886 |        "13394304        1627        2085        1787        1759  4613  1602   3011  \n",
 887 |        "...              ...         ...         ...         ...   ...   ...    ...  \n",
 888 |        "20344539        6860        4690        2585        3113  7998  1654   6344  \n",
 889 |        "20345407        5423        4449        2665        4005  7836  1358   6478  \n",
 890 |        "20345408        4354        4221        2274        5156  7949  1399   6550  \n",
 891 |        "20345409        6982        4240        2265        2444  8729  1506   7223  \n",
 892 |        "20346279        5865        4401        2278        3527  8227  1344   6883  \n",
 893 |        "\n",
 894 |        "[13809317 rows x 33 columns]"
 895 |       ]
 896 |      },
 897 |      "execution_count": 35,
 898 |      "metadata": {},
 899 |      "output_type": "execute_result"
 900 |     }
 901 |    ],
 902 |    "source": [
 903 |     "filtered_list"
 904 |    ]
 905 |   },
 906 |   {
 907 |    "cell_type": "code",
 908 |    "execution_count": 36,
 909 |    "metadata": {},
 910 |    "outputs": [
 911 |     {
 912 |      "ename": "NameError",
 913 |      "evalue": "name 'ds_example' is not defined",
 914 |      "output_type": "error",
 915 |      "traceback": [
 916 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 917 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 918 |       "\u001b[0;32m<ipython-input-36-fdced5441e1b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mds_example\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
 919 |       "\u001b[0;31mNameError\u001b[0m: name 'ds_example' is not defined"
 920 |      ]
 921 |     }
 922 |    ],
 923 |    "source": [
 924 |     "ds_example"
 925 |    ]
 926 |   },
 927 |   {
 928 |    "cell_type": "code",
 929 |    "execution_count": 38,
 930 |    "metadata": {},
 931 |    "outputs": [
 932 |     {
 933 |      "data": {
 934 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQcAAAD8CAYAAAB6iWHJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAea0lEQVR4nO2de+xmRXnHP093ubgq7i4KWXa3BeLW1tgWcCOgjSFs5VbimsYLtNWVYjatmHppo5g2Ibb9QxujaNqiW8AuxgK6kkKI7RZWTNM/pIBQRFZkRcv+3JWFctFoq2z69I93XvZwds45c87Mub7PJ/nl9555z2XOvDPf88wzz8wRVcUwDCPPL/SdAcMwhomJg2EYXkwcDMPwYuJgGIYXEwfDMLyYOBiG4aVzcRCR80TkIRHZIyKXd319wzDCkC7jHERkGfAd4A3AEnAXcLGqPthZJgzDCKJry+E1wB5VfURVfw7cAGzuOA+GYQSwvOPrrQX2ZraXgNOzO4jIVmArwDKWvXoFx3SXO2Oy/PKv//S5z9+5f4X3u3x69vui76bAj3nqCVV9WT69a3EQT9rz+jWqug3YBnCMrNbTZVMX+TImzs6d9z33+dwTTnn+l9+c/TvdUzt37rsPOPbwYybE7brjv3zpXXcrloD1me11wL6O82AsIE0b97knnMK5J5ziRGKx6NpyuAvYICInAT8ALgJ+t+M8GAtKzNN/ypZDEZ2Kg6oeFJH3ADuBZcC1qvqtLvNgVDN/Sk6lQUztfrqi06HMupjPIY6d++6zBmFUcrvuuEdVN+bTLUJyopgwGLF07XMwWmYKjjPrBgwDE4cJMYVGNYV7mArWrZgIU7AY4NDQoeFn/jt38XubOEyAqQiD4Sf7+86FswsBtW7FyAkVBl8FGwP5+xtT3lPR1z2bOIwYnzD4KtJY+/FmEfWLdStGSh1hGJsoQPj9Ge1h4jBCfJZAUcOZzwsYU8MyYRgGFiE5UrICMVZ/QhW+e5zS/Q0Fi5CcEFkxmEqj2bnvvkLnY9k9ml+iPcwhORHG1nXIUtTAq+5pzPc8BkwcRsZULIU5vi7RvNGbMPSLicNImKpfYU4dv4J1JbrBxKEn8g2h7Ek45adk/r5ChWGq5TEkTBw6ZN7Is409pLIvQkMIEUAThm6xocyO8QlENt1YHIbym9tQ5kDICkN2+G4IlcTojqEIQxkmDh3gc6CZUy0tc6EdS7kOXRjAfA6tkbUO8oyhYowBK9t2McuhBbpckMM4hAlDWsxy6ACrtOmZetzHEDBxaIF8ZR2D82lMmDB0g3UrWsJGItrBnLvdYeKQCFvOrH3yFkOX6ykuItatSISvglqlTUOVZWDdjHYwy8EYNCYM/dFYHERkvYjcISK7ReRbIvJel75aRG4TkYfd/1UuXUTk0yKyR0TuF5HTUt3EkLD+bxyhwUz5fUwY0hNjORwE/kRVfxU4A7hMRF4JXA7sUtUNwC63DXA+sMH9bQWuirj2ILFRiTjyjT1UaK3M26Gxz0FV9wP73ecfi8huYC2wGTjL7bYd+BrwIZd+nc5men1dRFaKyBp3nklgax02p46wWtl2QxKHpIicCJwK3AkcP2/wqrpfRI5zu60F9mYOW3JpkxGHEGEw8TicqojStsvKN9JkVmACh6SIvAj4MvA+Vf1R2a6etMPmi4vIVhG5W0TufpafxWavM0JM4PzCsOafqC63roVhnrbowgCRloOIHMFMGL6gqje55Mfm3QURWQMccOlLwPrM4euAfflzquo2YBvM1nOIyV8fVK3mlF/oZdErYn7Rm3x6npROSJu4VU7MaIUA1wC7VfUTma9uAba4z1uAmzPp73CjFmcAz0zF31C1IGrebDYv+/PpUxhCXgy0qMR0K14HvB04W0Tuc38XAB8F3iAiDwNvcNsAXwEeAfYAfw+8O+Lag6JupbI+7fOpWmk6u1/2f1N87/3IfzbiRiv+Hb8fAeCwtd3cKMVlTa83VoqediYMh2jypvAUTPltYSmw8OnEhHjcbcTi+fQ5hGlWXDEWPp2QbCWrWk3aKmN/ZMvfhKEYE4cWyD+N8qMThjEGbGn6DrGnlDFEbGn6AWDCYIwJc0h2QJFH3ByTxpAxy6FDTBiMMWGWQ4v4Xnnn2yeLiYUxFMxyaIkm764wYTCGhIlDz9giqcZQMXFITN3ly2x40xgqJg4t0/d6BYbRFBOHlihq9PnQaYuYNIaKiUNLlDV6mwlojAETh4TUXSou9BjD6AMTh0Q0beRmORhDxcTBMBaEulariUMCrGtgDJmmS/5b+HQCYkYfFs05aXNK+iG/6vmcst/BLIfEWKUvxoShH5qufGXi0COLtuBs0TsqjLRko3R9LyYOLX8Th47wNf5Fe5Iu2v0OgbzVUAfzOXREX++BHBKLdK9dE7IcwBzrVowAayxGXcqGI6tC9uuuem7i0BMmDEYT8l2Eqin/MfXMxKEHTBiMJmSdjFV1KMW7UUwcEmOeeKMNfG9p9+0D6R4+5pBMiAmD0RYhr+1LbZFGWw4iskxE7hWRW932SSJyp4g8LCI3isiRLv0ot73HfX9i7LWHjq8/aF0Koyn5F/+2TQrL4b3AbuAYt/0x4JOqeoOIfAa4FLjK/X9KVV8uIhe5/d6W4PqDwdfw82lNl4XzVQoTmsWjy988ynIQkXXAbwNXu20BzgZ2uF22A29ynze7bdz3m9z+kyf0Bbtlx+cxYTDaJtZyuBL4IPBit30s8LSqHnTbS8Ba93ktsBdAVQ+KyDNu/yeyJxSRrcBWgKNZEZm9/skvUd90opWJgdE1jS0HEbkQOKCq92STPbtqwHeHElS3qepGVd14BEc1zd5gyDfquo28yfsvDCMFMZbD64A3isgFwNHMfA5XAitFZLmzHtYB+9z+S8B6YElElgMvAZ6MuP7oqOtQWrTp3MawENXDHt71TyJyFvCnqnqhiHwJ+HLGIXm/qv6diFwG/Jqq/qFzSP6Oqr617LzHyGo9XTZF568vsuPOZaJgDd/ok9t1xz2qujGf3kYQ1IeAD4jIHmY+hWtc+jXAsS79A8DlLVx7UITMhjNhMIZKEsuhLaZiOYRiQmF0zc5997FszZ7OLAfDERL/HrqvYbRBWb2z8OnEhPoZ5pgoGEPFLIcWCImDN4yhY5ZDIuqsxANmMRjDxyyHxIR0J0wYjDFglkMkdd8iZMJgjAWzHCKpM6HKhMEYE2Y5NKCufwFMGIzxYeJQg/xch6bzJHyYeBhDw7oVDUkZw2DCYAwRE4dAmsyQDI11yL+uzDC6oKremTgEUHdEosm+Zj0YXRJSL83nEEDR7Ep7MawxNurUVbMcAijqUmTNspgVn8xqMIaIiUMAZQ2/yHqw7oQxZELqnYlDDbLDl/PXjRV1NUKxLonRB3NnuU3ZTojPUsiKhC3wYgwR30NsnrZsjf8YsxwaklXdmPdK2DCm0RV1H0QmDi1hDd4YGnXrpIlDJKmiIE1Mxkmbll/+3E2v0/Q4E4dIQgu+6zckG91Q5dSLPfeclCuLha5UZuIQSZ2p2mVCYpbD9GnitI55x2r+HPl8VGHi0BBfYFTRU8QWml1cfPEvXf/WRdesqpcmDg3JWgPzwo95+g/VcvD1qW2EJZyigLk63dHs/6Z5yEfyzh9kZUJl4hBJ9gePEYghWw6+vA05v0Omjo8iW6ealHdRgF6oSJk4JKDJBKwxvdBmXomK5pEY9amqMzELC/mukb9eyPlMHBIRYqZlGUtDa+rMMsIIGcFKPVIRev0ocRCRlSKyQ0S+LSK7ReRMEVktIreJyMPu/yq3r4jIp0Vkj4jcLyKnxVx7CgxdGHyMMc9jIuvDmhMj0EXrnYY8yGIth08B/6KqvwL8BrCb2duzd6nqBmAXh96mfT6wwf1tBa6KvPZgmVID8o3K5LsZRjrysQ1V+4ScL9+lmHdTWhutEJFjgNcD1wCo6s9V9WlgM7Dd7bYdeJP7vBm4Tmd8HVgpIgVTPqbPGASkSBjyaUY68o7tlAKcF4U2LYeTgceBz4nIvSJytYi8EDheVfcDuP/Huf3XAnszxy+5tOchIltF5G4RuftZfhaRvWEzhqeuLwCn6Mk2hvsZA77RiVRlW9fJGSMOy4HTgKtU9VTgJxzqQvgQT5oelqC6TVU3qurGIzgqInvDZgxP3ar3c9gyee2QF+WYuhLzvtYYcVgCllT1Tre9g5lYPDbvLrj/BzL7r88cvw7YF3H9wRLygxQFFg2toZV5t/ML3xjpiS1XXwAUtLzArKr+UET2isgrVPUhYBPwoPvbAnzU/b/ZHXIL8B4RuQE4HXhm3v2YElWF3vSH6pqQkFsThPZJNeTd5HcT1cMs+2BE5BTgauBI4BHgEmbWyBeBXwQeBd6iqk+KiAB/A5wH/BS4RFXvLjv/MbJaT5dNjfPXNyHe5rHEOxj94RPqkHqTH53I7p895+264x5V3Zg/Pkoc2mbs4jAnZqUow0gRPp0XiOznInGwNSRbxoTB6JMiH1dIHbTw6Y4xYVgMUscnxBxnK0ENkCE6Go326WPNBl8e5jQNvzafQ8ukCIGdAkNoMItG6JTvIp+DWQ4tYsIwwyyo7kkx5Gzi0CKLJgJ5fKsnm1B0Q8yShXNMHFpkkUcqzGrqj7IFZOqUvw1ltkSqH2hMpFxDc6pl1AVFIdN1McuhBcoEYJErfciqR1X7GfXJdufqCLiNVrREzGy4sVO3O2Uh5GkoeyiVlbGFT/fAIgtEKFPuZjWhrfIoO68NZfaAdS3KsZGLw0mxBL3v+ybnNcuhRULXdRi7WFi3oF3aLl+zHDomxCk5BWGYM5X7GCp9lK9ZDi0TYjqPuWGZX6VduniAmOVgJMeEoX36LEcTB6MRJgztMgRnrYlDiwzhB26DroRhquUXwhAE1sKnW6Ts3QBD+PGbErO6UChTctaOFbMcWsZXwadS6VMvSV93mrHN8mwXE4eWmXrlbWPp9DpMRWiHiIlDiyzSlO3YGZl1rZChie4UrRjzOXTI1IQhxWpDTXwLQ4zIHFJeUmHi0AFTfIlNjDDEPmHHXnZjwboVLTNFh2SqV+K16dA04jHLoWUWZYWjKsHwvXQ3pRVlQ5/pMcvBqIVPBOoIQ1labH7MckhL7It03w+8C1Dgm8xepLsGuAFYDXwDeLuq/lxEjgKuA14N/DfwNlX9ftn5pzrxKtUaf2Nm0e8/NTGWU/KJVyKyFvhjYKOqvgpYBlwEfAz4pKpuAJ4CLnWHXAo8paovBz7p9ps0Jgx+7Al/iNiymA+htlGXYrsVy4EXiMhyYAWwHzgb2OG+3w68yX3e7LZx328SEYm8/qCpem+ANRIrg9hGHevULSv/xg5JVf2BiHwceBT4H+BfgXuAp1X1oNttCVjrPq8F9rpjD4rIM8CxwBPZ84rIVmArwNGsaJq9QVBV8aduOWSfaIsuAqnpwtHdWBxEZBUza+Ak4GngS8D5nl3nTg2flXCYw0NVtwHbYOZzaJo/o19CTN2pi6OP2C5l/vg0MSN7vN/FdCt+C/ieqj6uqs8CNwGvBVa6bgbAOmCf+7wErAdw378EeDLi+oOnzOSbcsPIV1jfaEbZjNX5MVMMSU4R2+EruzbKKUYcHgXOEJEVznewCXgQuAN4s9tnC3Cz+3yL28Z9/1Ud8hp1ifA9KRZBGHwVOPvG5yphyB6TTe9TLIYiVL4X1LRRp2KHMj8CvA04CNzLbFhzLYeGMu8Ffl9VfyYiRwOfB05lZjFcpKqPlJ1/akOZUxaFLGX94Tp+iCGO7HQdbJW/XhuT+YqGMqMiJFX1CuCKXPIjwGs8+/4v8JaY642ZoVTurshaCXOqhnZD0vuir7xUCWSb9coiJI2kZE3efMXNh05XCUPquRd9ENsVqrK02hQtE4cOGHsFD6WqC5BvKL75Ftn0/Oe+iO0axr7FKuQdKG1g760wOqPMQZnfp61K36UPyHdvocvf5QlZj7SpP8TeW2EUUtf0rbtvVdchu2/bXYkqH0jqazVZFzNfBnVGd7Lbeeuj7v2aOCw4dRtk0waVbyRFPomuaXN4tOq82UZcFrfQVLhj4yGsW2EEUcdkraqg2e+y5+5imLCNocCm5J2NvpGJkLiQOvju1boVRmOaVsyiobdsxS8b3UjNUKyVPD5hKLMcQvKcYtjTxMEoJNt4U3Q75pW+KJw69DxNGYIQZCmyCEKGeUPOmxXiJvdu4mB4aeLVrxKGsrS2nYO+uRpDCLQqG/YtSq/6Pcp8GXWwNSQ7ouuw2yqqTNU6eS2ryL7PoWaxr09eloe6DS00L21S5VPw5S/Eh1N1jhDMcjCiGkjokylv3uaf5FWWRVUey77vWwDmhAw5Vh1XNaSZShjALIdOGZL1EDs6UGYJhDzFYwN3Qn0heSdf3+Wfz0NV16rIYeuzNprEVZRhlkNH9F0pfaQQhmza/C801Dc2YrCqm5G9ZhNhSO2PiA29Lhpt8XUnUtQ3E4cOGaJAxNLUqVfUWOvEIdTxdTQRhjZ+ryK/QFXDnotANl9F50qFBUF1RMq+YJ/knYRzYiyAFHman6ssmCh7zdCuT6rfLR/olb9+VTctn95U9Hy0sp6DsVg0DenN7u8bhZin+7br5inGcVlFimOrhMG3nf+uqweLdSs6YCpWQ1k/P6RCl3VBfCIRMk6fbXS+IKuiJ3KopRMTROQ7X/5cTYZ4s59Dy6kJZjm0iK9RjFUYsjTpHviejkUNr24Ztdn3TnnOquHWkHiMMqsrNeZzMGoRKgyhgUc+Mzl2iDVF3EZXgl5XfKq6ZU2wiVdG7/i6JVXBT02u0ZQUoxxl5/RR10dSJ2o0FhOHFmhraKltsv3Xon5saBdgLgS+/nVIHzmmDMv8DmX7dTl0GXJM/riiLmpbAmHi0AJj9Cs06S40Gb4MedqlENfQp2uZ2KUMEisrS5+Tscz521X9Mp9DYqbkeMzS5AlbFC9Q1J9Pbco3DbJKaU2EBHuFlEGbvpAin4OJQwvkA3FgWGIR6iz07V92H01N5+yxKfv4dc/VhoAVnSOknH3xEG3UIwuC6ohsv3rOkIShjLqjDk32K7pOWXro9WO6BEUNr47D1HeOOsfnYzz6xsShA4YWBFVmRlcRum/dgKcYUgixb2iw7nBhiF8ha1XWHXXout6YOCSmrML3LQo+6vbPQwgN6Km6fhG+blssIQLTxKwPsZJ8IpS1QPuqN5U+BxG5FrgQOKCqr3Jpq4EbgROB7wNvVdWn3Nu2PwVcAPwUeKeqfsMdswX4c3fav1LV7VWZG6PPIXZcu2tC+r95YobniqyoMudlSN7aKtcuuod5a8JnXbRJTBDUPwDn5dIuB3ap6gZgl9sGOB/Y4P62AlfBc2JyBXA6s5fsXiEiq+rfxvgZQl9yTpO8VDXysmvlh+iqzjUkYUh9zqrRinx6H1R2K1T130TkxFzyZuAs93k78DXgQy79Op2ZI18XkZUissbte5uqPgkgIrcxE5zro+9gQKR+2rZFSMX0fV+2X12/RVV/PrQP7oujCD0+5PypqQpcKvNNdE3TIKjjVXU/gPt/nEtfC+zN7Lfk0orSJ0NR4ygLZumDvp5MPnGMacj5frnvWmOjzshGF6R2SIonTUvSDz+ByFZmXRKOZkW6nLVMaFxDnz922dO9ST+/Likqvy+aEMpDpJtcp2+GkN+m4vCYiKxR1f2u23DApS8B6zP7rQP2ufSzculf851YVbcB22DmkGyYv16oYwb28QRvEuCUqhtU1niryq0qajD7Xda3MU/PpxlhNO1W3AJscZ+3ADdn0t8hM84AnnHdjp3AOSKyyjkiz3Fpk6OJY64vuhKGMpqO75dFHg7NPB8rlZaDiFzP7Kn/UhFZYjbq8FHgiyJyKfAo8Ba3+1eYDWPuYTaUeQmAqj4pIn8J3OX2+4u5c3IRGUJF7cMx6nN+hoYoFzl081ZCHrMYmhMyWnFxwVeHBSC4UYrLCs5zLXBtrdxNiFgPf5t0YTWUNdDQJ32T4VOjORYh2QN9O8nKntZdUeVfKIqCzFsQRZOVzGKIx2ZldkBXs+tC8wLtCkPoyE1R3rLnaBombcIQjk3ZbpmhP6nKGmubVkOdkZtU8yWG/DsMEVtDskWG3rcNjW/og7IIyfzIg41CdItZDpH07T9IQVc+iL4ciGP+bbrALIeWGEK8QgrauoequQQpr1G0bTTDRisWjJBow1Tkw5u7mlY9pJmNY8YshwWkqH9ftn+K/n7IWhcxjTlrpUzFousTE4cG9O3Ea0qMY9IXb1AVlNRktuRYy3aKmEOyBk1N47qTsYb+xGtjuDHEqiibFm80xxySkfga+HwSVdVxoeeHxajoTfwdJgzdY5ZDi8SsrFS0b1+0PbRZ9/x1ZmqWHTPfZ0hl3TVmOSSiqz7xkCrrFIShiEUXhjLMcqhBjOlftDCJjxCzu48K3dZMzRB/Q36/fFpo6LWvbBddHOyNVxHEjpvXEYayY2PyEEtfwlB2/WxaE2EoSjNmmOUQQEpnYdGTKn+NMlHo62nX5zBjiklZ+fIzq2GGzcpsSBujCEUjH0XCEGutxJyn7Hz5c4fEQjQh1Xl85zXMIdmYoki7kGFMH6mGPts+R4prpsxHbEM2IaiP+Rw6Zv4UbNKYYq2Bpv3yMusp9JxNF2yZl1Vs4/YtHNPkvIs0b8MshwbEPBGbDtvFxkzUuVbodzHORN95Q7taRfmsmp2ZFzkThnLM51CTFD6IFI61RZqD0OR+6w6ThjLFSFbzOSSmr8Y5pUoJ7U3GCvUJ1Tn3FIWhDBOHmqRYvKTOsdnpx3VjA8ZAnXtpuqhLWdelDos2DdzEoSapnGNFpKzIfRzbFql8Hm2PqkyJhfM5DDXwpWw1o3la7HnLGIpVkiofVX6K/PfZ7aJj84FTQ61LdTGfw4Ap84LHmLJjEIYqiyDFvYdco2wkY/4blAWqTRGzHFqiLBy67PopIxuHXoHbbmhFQ6NNRjBCu4JjxMKn8Y+bp/5hY4bOQkOUfSZx6DmK8jN0IYkh9Klf1V3wdTvGLgxgszI7MQnbbmDz8+evE3vdoQhDG/Eb+W5CkV+hjh9hKr6GKhZGHIoqXpc/dJXVEDJT0PfEatqgsg0nZkgxJg91aHLPZeVZFZHp8/9kBXpqjsk8CyMORY0vVaVOEVJd14EYQ5NzFB3Tps8gxdBjqkCnRRsGHbTPQUR+DDzUdz5yvBR4ou9MZLD8lDO0/MDw8vRLqvqyfOLQLYeHfI6SPhGRu4eUJ8tPOUPLDwwzTz4szsEwDC8mDoZheBm6OGzrOwMehpYny085Q8sPDDNPhzFoh6RhGP0xdMvBMIyeMHEwDMPLYMVBRM4TkYdEZI+IXN7RNdeLyB0isltEviUi73Xpq0XkNhF52P1f5dJFRD7t8ni/iJzWUr6Wici9InKr2z5JRO50+blRRI506Ue57T3u+xNbyMtKEdkhIt925XTmAMrn/e73ekBErheRo7ssIxG5VkQOiMgDmbTaZSIiW9z+D4vIlth8RaOqg/sDlgHfBU4GjgT+E3hlB9ddA5zmPr8Y+A7wSuCvgctd+uXAx9znC4B/BgQ4A7izpXx9APhH4Fa3/UXgIvf5M8Afuc/vBj7jPl8E3NhCXrYD73KfjwRW9lk+wFrge8ALMmXzzi7LCHg9cBrwQCatVpkAq4FH3P9V7vOqtut86X31efGSwj4T2JnZ/jDw4R7ycTPwBmZRmmtc2hpmwVkAnwUuzuz/3H4J87AO2AWcDdzqKtUTwPJ8WQE7gTPd5+VuP0mYl2NcQ5Rcep/lsxbY6xrVcldG53ZdRsCJOXGoVSbAxcBnM+nP26+Pv6F2K+Y/+Jwll9YZztw8FbgTOF5V9wO4/8e53brI55XAB4H/c9vHAk+r6kHPNZ/Lj/v+Gbd/Kk4GHgc+57o5V4vIC+mxfFT1B8DHgUeB/czu+R76K6M5dcuk9zqfZ6jiIJ60zsZcReRFwJeB96nqj8p29aQly6eIXAgcUNV7Aq/ZdrktZ2Y+X6WqpwI/YWYyF9H67+j68puBk4ATgBcC55dct9e6VXL9vvN1GEMVhyVgfWZ7HbCviwuLyBHMhOELqnqTS35MRNa479cABzrK5+uAN4rI94EbmHUtrgRWish8Xkz2ms/lx33/EuDJhPlZApZU9U63vYOZWPRVPgC/BXxPVR9X1WeBm4DX0l8ZzalbJr3V+SKGKg53ARucx/lIZo6jW9q+qIgIcA2wW1U/kfnqFmDuPd7CzBcxT3+H80CfATwzNyVToKofVtV1qnoiszL4qqr+HnAH8OaC/Mzz+Wa3f7Knj6r+ENgrIq9wSZuAB+mpfByPAmeIyAr3+83z1EsZZahbJjuBc0RklbOGznFp/dGnw6PCwXMBs9GC7wJ/1tE1f5OZKXc/cJ/7u4BZn3QX8LD7v9rtL8Dfujx+E9jYYt7O4tBoxcnAfwB7gC8BR7n0o932Hvf9yS3k4xTgbldG/8TMs95r+QAfAb4NPAB8HjiqyzICrmfm73iWmQVwaZMyAf7A5WsPcEkXdb7sz8KnDcPwMtRuhWEYPWPiYBiGFxMHwzC8mDgYhuHFxMEwDC8mDoZheDFxMAzDy/8DqT3l9OeqAoAAAAAASUVORK5CYII=\n",
 935 |       "text/plain": [
 936 |        "<Figure size 432x288 with 1 Axes>"
 937 |       ]
 938 |      },
 939 |      "metadata": {
 940 |       "needs_background": "light"
 941 |      },
 942 |      "output_type": "display_data"
 943 |     }
 944 |    ],
 945 |    "source": [
 946 |     "ds_example = filtered_list[['i','j']][(filtered_list['H'] == 8)& (filtered_list['V'] == 6) ]\n",
 947 |     "ds_example = list(ds_example.values)\n",
 948 |     "ds_frame = np.zeros((1200,1200))\n",
 949 |     "for i in ds_example:\n",
 950 |     "    ds_frame[i[0],i[1]] = 1\n",
 951 |     "\n",
 952 |     "\n",
 953 |     "plt.imshow(ds_frame, interpolation='nearest')\n",
 954 |     "plt.show()"
 955 |    ]
 956 |   },
 957 |   {
 958 |    "cell_type": "code",
 959 |    "execution_count": null,
 960 |    "metadata": {},
 961 |    "outputs": [],
 962 |    "source": [
 963 |     "ds_2010 = np.zeros((21600, 43200), dtype=np.uint8)\n",
 964 |     "x_list = list(filtered_list.x)\n",
 965 |     "y_list = list(filtered_list.y)"
 966 |    ]
 967 |   },
 968 |   {
 969 |    "cell_type": "code",
 970 |    "execution_count": null,
 971 |    "metadata": {},
 972 |    "outputs": [],
 973 |    "source": []
 974 |   },
 975 |   {
 976 |    "cell_type": "code",
 977 |    "execution_count": null,
 978 |    "metadata": {},
 979 |    "outputs": [],
 980 |    "source": [
 981 |     "for i in range(len(x_list)):\n",
 982 |     "    ds_2010[x_list[i], y_list[i]] = 1"
 983 |    ]
 984 |   },
 985 |   {
 986 |    "cell_type": "code",
 987 |    "execution_count": null,
 988 |    "metadata": {},
 989 |    "outputs": [],
 990 |    "source": [
 991 |     "np.sum(ds_2010)"
 992 |    ]
 993 |   },
 994 |   {
 995 |    "cell_type": "code",
 996 |    "execution_count": 68,
 997 |    "metadata": {},
 998 |    "outputs": [],
 999 |    "source": [
1000 |     "import rasterio as rio\n",
1001 |     "with rio.open('World_e-Atlas-UCSD_SRTM30-plus_v8_Hillshading.tiff') as src:\n",
1002 |     "    ras_data = src.read()\n",
1003 |     "    ras_meta = src.profile\n",
1004 |     "with rio.open('2010_ds.tif', 'w', **ras_meta) as dst:\n",
1005 |     "    dst.write(ds_2010, 1)"
1006 |    ]
1007 |   },
1008 |   {
1009 |    "cell_type": "code",
1010 |    "execution_count": 33,
1011 |    "metadata": {},
1012 |    "outputs": [
1013 |     {
1014 |      "data": {
1015 |       "text/plain": [
1016 |        "(26733, 33)"
1017 |       ]
1018 |      },
1019 |      "execution_count": 33,
1020 |      "metadata": {},
1021 |      "output_type": "execute_result"
1022 |     }
1023 |    ],
1024 |    "source": [
1025 |     "df_list[1][df_list[1]['max'] > 0.4].shape"
1026 |    ]
1027 |   },
1028 |   {
1029 |    "cell_type": "code",
1030 |    "execution_count": 34,
1031 |    "metadata": {},
1032 |    "outputs": [
1033 |     {
1034 |      "data": {
1035 |       "text/plain": [
1036 |        "(30944, 33)"
1037 |       ]
1038 |      },
1039 |      "execution_count": 34,
1040 |      "metadata": {},
1041 |      "output_type": "execute_result"
1042 |     }
1043 |    ],
1044 |    "source": [
1045 |     "df_list[1][df_list[1]['min'] < 0.4].shape"
1046 |    ]
1047 |   },
1048 |   {
1049 |    "cell_type": "code",
1050 |    "execution_count": 71,
1051 |    "metadata": {},
1052 |    "outputs": [
1053 |     {
1054 |      "data": {
1055 |       "text/plain": [
1056 |        "(1, 21600, 43200)"
1057 |       ]
1058 |      },
1059 |      "execution_count": 71,
1060 |      "metadata": {},
1061 |      "output_type": "execute_result"
1062 |     }
1063 |    ],
1064 |    "source": [
1065 |     "lidar_dem_path = '2015_ds.tif'\n",
1066 |     "with rasterio.open(lidar_dem_path) as lidar_dem:\n",
1067 |     "    im_array = lidar_dem.read()\n",
1068 |     "    lidar_dem.bounds\n",
1069 |     "im_array.shape"
1070 |    ]
1071 |   },
1072 |   {
1073 |    "cell_type": "code",
1074 |    "execution_count": 28,
1075 |    "metadata": {
1076 |     "scrolled": false
1077 |    },
1078 |    "outputs": [
1079 |     {
1080 |      "data": {
1081 |       "text/plain": [
1082 |        "17"
1083 |       ]
1084 |      },
1085 |      "execution_count": 28,
1086 |      "metadata": {},
1087 |      "output_type": "execute_result"
1088 |     }
1089 |    ],
1090 |    "source": [
1091 |     "import urllib \n",
1092 |     "url_base = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD13A3.006/'\n",
1093 |     "\n",
1094 |     "strings = ['2009.01.01',  \n",
1095 |     "           '2009.02.01', \n",
1096 |     "           '2009.03.01', \n",
1097 |     "           '2009.04.01', \n",
1098 |     "           '2009.05.01', \n",
1099 |     "           '2009.06.01', \n",
1100 |     "           '2009.07.01',  \n",
1101 |     "           '2009.08.01', \n",
1102 |     "           '2009.09.01', \n",
1103 |     "           '2009.10.01', \n",
1104 |     "           '2009.11.01',\n",
1105 |     "           '2009.12.01',\n",
1106 |     "           '2010.01.01',  \n",
1107 |     "           '2010.02.01', \n",
1108 |     "           '2010.03.01', \n",
1109 |     "           '2010.04.01', \n",
1110 |     "           '2010.05.01',\n",
1111 |     "          ]\n",
1112 |     "\n",
1113 |     "html_list = []\n",
1114 |     "for i in strings:\n",
1115 |     "    html_list.append(url_base + i)\n",
1116 |     "\n",
1117 |     "clean_list =[]\n",
1118 |     "for i in html_list:\n",
1119 |     "    response = urllib.request.urlopen(i)\n",
1120 |     "    html = response.read()\n",
1121 |     "    all_list= str(html).split('=')\n",
1122 |     "    for j in all_list:\n",
1123 |     "#         print(j)\n",
1124 |     "        if 'hdf' in j:\n",
1125 |     "            if 'xml' not in j:\n",
1126 |     "                url = i+j[1:46]\n",
1127 |     "                v_val = int(url[76:78])\n",
1128 |     "                h_val = int(url[73:75])\n",
1129 |     "                z = [h_val,v_val]\n",
1130 |     "                if z == [13,14]:\n",
1131 |     "                    clean_list.append(i+ '/' + j[1:46])\n",
1132 |     "len(clean_list)"
1133 |    ]
1134 |   },
1135 |   {
1136 |    "cell_type": "code",
1137 |    "execution_count": null,
1138 |    "metadata": {},
1139 |    "outputs": [],
1140 |    "source": [
1141 |     "land_pixels"
1142 |    ]
1143 |   },
1144 |   {
1145 |    "cell_type": "code",
1146 |    "execution_count": null,
1147 |    "metadata": {},
1148 |    "outputs": [],
1149 |    "source": [
1150 |     "clean_list[1]"
1151 |    ]
1152 |   },
1153 |   {
1154 |    "cell_type": "code",
1155 |    "execution_count": 30,
1156 |    "metadata": {
1157 |     "scrolled": true
1158 |    },
1159 |    "outputs": [],
1160 |    "source": [
1161 |     "import requests # get the requsts library from https://github.com/requests/requests\n",
1162 |     " \n",
1163 |     "class SessionWithHeaderRedirection(requests.Session):\n",
1164 |     "    AUTH_HOST = 'urs.earthdata.nasa.gov'\n",
1165 |     "    def __init__(self, username, password):\n",
1166 |     "        super().__init__()\n",
1167 |     "        self.auth = (username, password)\n",
1168 |     " \n",
1169 |     "    def rebuild_auth(self, prepared_request, response):\n",
1170 |     "        headers = prepared_request.headers\n",
1171 |     "        url = prepared_request.url\n",
1172 |     "        if 'Authorization' in headers:\n",
1173 |     "            original_parsed = requests.utils.urlparse(response.request.url)\n",
1174 |     "            redirect_parsed = requests.utils.urlparse(url)\n",
1175 |     "            if (original_parsed.hostname != redirect_parsed.hostname) and \\\n",
1176 |     "                redirect_parsed.hostname != self.AUTH_HOST and \\\n",
1177 |     "                original_parsed.hostname != self.AUTH_HOST:\n",
1178 |     "                del headers['Authorization']\n",
1179 |     "        return\n",
1180 |     "\n",
1181 |     "username = \"ecproust@berkeley.edu\"\n",
1182 |     "password= \"Tabt5yge6!\"\n",
1183 |     " \n",
1184 |     "session = SessionWithHeaderRedirection(username, password)\n",
1185 |     "for i in clean_list:\n",
1186 |     "    url = i\n",
1187 |     "    v_val = str(int(url[77:79]))\n",
1188 |     "    h_val = str(int(url[74:76]))\n",
1189 |     "    date =  url[45:55]\n",
1190 |     "    HV_val = 'H' + h_val + 'V' + v_val\n",
1191 |     "\n",
1192 |     "    filename = 'modis/' + HV_val + '/' + HV_val + '_' + date + '.hdf'\n",
1193 |     "    response = session.get(url, stream=True)\n",
1194 |     "    try:\n",
1195 |     "        with open(filename, 'wb') as fd:\n",
1196 |     "            for chunk in response.iter_content(chunk_size=1024*1024):\n",
1197 |     "                fd.write(chunk)\n",
1198 |     "    except:\n",
1199 |     "        os.mkdir('modis/'+ HV_val)\n",
1200 |     "        print(HV_val)\n",
1201 |     "        with open(filename, 'wb') as fd:\n",
1202 |     "            for chunk in response.iter_content(chunk_size=1024*1024):\n",
1203 |     "                fd.write(chunk)"
1204 |    ]
1205 |   },
1206 |   {
1207 |    "cell_type": "code",
1208 |    "execution_count": 35,
1209 |    "metadata": {},
1210 |    "outputs": [
1211 |     {
1212 |      "data": {
1213 |       "text/plain": [
1214 |        "'h'"
1215 |       ]
1216 |      },
1217 |      "execution_count": 35,
1218 |      "metadata": {},
1219 |      "output_type": "execute_result"
1220 |     }
1221 |    ],
1222 |    "source": [
1223 |     "url[74:]"
1224 |    ]
1225 |   },
1226 |   {
1227 |    "cell_type": "code",
1228 |    "execution_count": 33,
1229 |    "metadata": {},
1230 |    "outputs": [
1231 |     {
1232 |      "data": {
1233 |       "text/plain": [
1234 |        "'03'"
1235 |       ]
1236 |      },
1237 |      "execution_count": 33,
1238 |      "metadata": {},
1239 |      "output_type": "execute_result"
1240 |     }
1241 |    ],
1242 |    "source": [
1243 |     "str(url[77:79])"
1244 |    ]
1245 |   },
1246 |   {
1247 |    "cell_type": "code",
1248 |    "execution_count": null,
1249 |    "metadata": {},
1250 |    "outputs": [],
1251 |    "source": [
1252 |     "import urllib \n",
1253 |     "url_base = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD12Q2.006/'\n",
1254 |     "strings = ['2006.01.01','2007.01.01', '2008.01.01', '2009.01.01', '2010.01.01', '2011.01.01', '2012.01.01', '2013.01.01', '2014.01.01' ,'2015.01.01' ]\n",
1255 |     "\n",
1256 |     "html_list = []\n",
1257 |     "for i in strings:\n",
1258 |     "    html_list.append(url_base + i)\n",
1259 |     "\n",
1260 |     "print(html_list)\n",
1261 |     "clean_list =[]\n",
1262 |     "for i in html_list:\n",
1263 |     "    response = urllib.request.urlopen(i)\n",
1264 |     "    html = response.read()\n",
1265 |     "    all_list= str(html).split('=')\n",
1266 |     "    for j in all_list:\n",
1267 |     "#         print(j)\n",
1268 |     "        if 'hdf' in j:\n",
1269 |     "            if 'xml' not in j:\n",
1270 |     "#                 print(i+j[1:46])\n",
1271 |     "                clean_list.append(i+j[1:46])"
1272 |    ]
1273 |   },
1274 |   {
1275 |    "cell_type": "code",
1276 |    "execution_count": 101,
1277 |    "metadata": {},
1278 |    "outputs": [
1279 |     {
1280 |      "data": {
1281 |       "text/plain": [
1282 |        "73"
1283 |       ]
1284 |      },
1285 |      "execution_count": 101,
1286 |      "metadata": {},
1287 |      "output_type": "execute_result"
1288 |     }
1289 |    ],
1290 |    "source": [
1291 |     "url.index('h10')"
1292 |    ]
1293 |   },
1294 |   {
1295 |    "cell_type": "code",
1296 |    "execution_count": 102,
1297 |    "metadata": {},
1298 |    "outputs": [
1299 |     {
1300 |      "data": {
1301 |       "text/plain": [
1302 |        "76"
1303 |       ]
1304 |      },
1305 |      "execution_count": 102,
1306 |      "metadata": {},
1307 |      "output_type": "execute_result"
1308 |     }
1309 |    ],
1310 |    "source": [
1311 |     "url.index('v03')"
1312 |    ]
1313 |   },
1314 |   {
1315 |    "cell_type": "code",
1316 |    "execution_count": null,
1317 |    "metadata": {},
1318 |    "outputs": [],
1319 |    "source": [
1320 |     "\n",
1321 |     "import rasterio as rio    \n",
1322 |     "\n",
1323 |     "with rio.open('2005.tif') as src:\n",
1324 |     "    ras_data = src.read()\n",
1325 |     "    ras_meta = src.profile\n",
1326 |     "\n",
1327 |     "# make any necessary changes to raster properties, e.g.:\n",
1328 |     "ras_meta['nodata'] = -99\n",
1329 |     "\n",
1330 |     "with rio.open('2010.tif', 'w', **ras_meta) as dst:\n",
1331 |     "    dst.write(map_10, 1)"
1332 |    ]
1333 |   }
1334 |  ],
1335 |  "metadata": {
1336 |   "kernelspec": {
1337 |    "display_name": "Python 3",
1338 |    "language": "python",
1339 |    "name": "python3"
1340 |   },
1341 |   "language_info": {
1342 |    "codemirror_mode": {
1343 |     "name": "ipython",
1344 |     "version": 3
1345 |    },
1346 |    "file_extension": ".py",
1347 |    "mimetype": "text/x-python",
1348 |    "name": "python",
1349 |    "nbconvert_exporter": "python",
1350 |    "pygments_lexer": "ipython3",
1351 |    "version": "3.7.6"
1352 |   }
1353 |  },
1354 |  "nbformat": 4,
1355 |  "nbformat_minor": 4
1356 | }
1357 | 


--------------------------------------------------------------------------------
/ellecp/archive/data_processing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Libary Set-up"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import numpy as np\n",
 17 |     "import pandas as pd\n",
 18 |     "from PIL import Image\n",
 19 |     "import os\n",
 20 |     "import warnings\n",
 21 |     "import rasterio as rio\n",
 22 |     "\n",
 23 |     "%matplotlib inline\n",
 24 |     "\n",
 25 |     "from netCDF4 import Dataset\n",
 26 |     "from pyhdf.SD import SD, SDC\n",
 27 |     "from skimage.transform import resize"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": 2,
 33 |    "metadata": {},
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "from collections import Counter\n",
 37 |     "\n",
 38 |     "from sklearn.utils import shuffle\n",
 39 |     "from sklearn.pipeline import Pipeline\n",
 40 |     "from sklearn.metrics import  confusion_matrix\n",
 41 |     "from sklearn.ensemble import RandomForestClassifier\n",
 42 |     "from sklearn.ensemble import RandomForestRegressor\n",
 43 |     "from sklearn.ensemble import ExtraTreesClassifier\n",
 44 |     "from sklearn.model_selection import train_test_split\n",
 45 |     "from sklearn.model_selection import GridSearchCV\n",
 46 |     "from sklearn.metrics import cohen_kappa_score\n",
 47 |     "from sklearn.metrics import classification_report\n",
 48 |     "from sklearn.metrics import accuracy_score\n",
 49 |     "from imblearn.over_sampling import SMOTE\n",
 50 |     "from skimage.transform import resize\n",
 51 |     "from sklearn.metrics import mean_squared_error\n",
 52 |     "\n",
 53 |     "ncores = os.cpu_count()"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "## 2010 cropland boundaries - generate Crop mask"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "metadata": {},
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "lidar_dem_path = 'lower_scaled_gfsad.tif'\n",
 70 |     "with rio.open(lidar_dem_path) as lidar_dem:\n",
 71 |     "    im_array = lidar_dem.read()"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "metadata": {
 78 |     "scrolled": true
 79 |    },
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "(1, 2160, 4320)\n",
 86 |       "702138\n"
 87 |      ]
 88 |     }
 89 |    ],
 90 |    "source": [
 91 |     "print(im_array.shape)\n",
 92 |     "# Get a list of cropland and their classes\n",
 93 |     "im_array = im_array.reshape((2160,4320))\n",
 94 |     "\n",
 95 |     "def apply_mask(pixel):\n",
 96 |     "    if pixel == 9:\n",
 97 |     "        return 0\n",
 98 |     "    else:\n",
 99 |     "        return pixel\n",
100 |     "\n",
101 |     "filter_function = np.vectorize(apply_mask)\n",
102 |     "unmasked_pixels = filter_function(im_array)\n",
103 |     "\n",
104 |     "land_pixels = np.nonzero(unmasked_pixels) \n",
105 |     "# print(np.unique(imarray))\n",
106 |     "land_pixel_classes = im_array[land_pixels].tolist()\n",
107 |     "print(len(land_pixel_classes))"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "markdown",
112 |    "metadata": {},
113 |    "source": [
114 |     "# Produce Data_frame"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 5,
120 |    "metadata": {},
121 |    "outputs": [
122 |     {
123 |      "name": "stdout",
124 |      "output_type": "stream",
125 |      "text": [
126 |       "(702138, 2)\n"
127 |      ]
128 |     }
129 |    ],
130 |    "source": [
131 |     "land_indices = land_pixels \n",
132 |     "non_zero_indices = np.array(land_indices)\n",
133 |     "clean_frame = non_zero_indices.T\n",
134 |     "print(clean_frame.shape)\n",
135 |     "n = clean_frame.shape[0]\n",
136 |     "non_zeros = np.nonzero(im_array)\n",
137 |     "\n",
138 |     "clean_frame_2 = clean_frame \n",
139 |     "clean_frame_df = pd.DataFrame({'lon': clean_frame[:,0], 'lat': clean_frame[:,1]})\n",
140 |     "clean_frame_df['labels'] = land_pixel_classes"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "## Add siebert labels\n",
148 |     "Each year will take the cropland mask and produce values for the pixel for each year. "
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": 29,
154 |    "metadata": {},
155 |    "outputs": [
156 |     {
157 |      "data": {
158 |       "text/plain": [
159 |        "(2160, 4320)"
160 |       ]
161 |      },
162 |      "execution_count": 29,
163 |      "metadata": {},
164 |      "output_type": "execute_result"
165 |     }
166 |    ],
167 |    "source": [
168 |     "label_years = ['1985', '1990', '1995', '2000', '2005']\n",
169 |     "output_years = label_years + [ '2001', '2002', '2003', '2004', '2006','2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',\n",
170 |     "  '2016', '2017', '2018', '2019']\n",
171 |     "lidar_dem_path = 'C:\\\\Users\\\\Elle\\\\Documents\\\\w210\\\\1985.tif'\n",
172 |     "times_series_labels = np.zeros((5, 2160, 4320))\n",
173 |     "for i in range(len(label_years)):\n",
174 |     "    lidar_dem_path = label_years[i] +'.tif'\n",
175 |     "    with rio.open(lidar_dem_path) as lidar_dem:\n",
176 |     "        array = lidar_dem.read() \n",
177 |     "        array = array.reshape(2160, 4320)\n",
178 |     "        clean_frame_df[str(label_years[i])] = array[land_indices].reshape(n,1)\n",
179 |     "\n",
180 |     "times_series_labels[0].shape"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "## Retrieve NDVI Data for each year"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 30,
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": [
196 |     "measures = ['max_y1', 'min_y1', 'mean_y1', 'var_y1', 'max_y2', 'min_y2', 'mean_y2', 'var_y2']\n",
197 |     "ndvi_list = os.listdir('ndvi')\n",
198 |     "def retrieve_ndvi(indices, length, year):\n",
199 |     "            path = 'ndvi/ndvi3g_geo_v1_'\n",
200 |     "            file_1h = path + year + '_0106.nc4'\n",
201 |     "            file_2h = path + year + '_0712.nc4' \n",
202 |     "            ds_1, ds_2 = np.array(Dataset(file_1h)['ndvi']) , np.array(Dataset(file_2h)['ndvi'])\n",
203 |     "            max_y1 = np.max(ds_1, axis = 0)[indices].reshape(length)\n",
204 |     "            min_y1 = np.min(ds_1, axis = 0)[indices].reshape(length)\n",
205 |     "            var_y1 = np.var(ds_1, axis = 0)[indices].reshape(length)\n",
206 |     "            mean_y1= np.mean(ds_1, axis = 0)[indices].reshape(length)\n",
207 |     "            max_y2 = np.max(ds_2, axis = 0)[indices].reshape(length)\n",
208 |     "            min_y2 = np.min(ds_2, axis = 0)[indices].reshape(length)\n",
209 |     "            var_y2 = np.var(ds_2, axis = 0)[indices].reshape(length)\n",
210 |     "            mean_y2 = np.mean(ds_2, axis = 0)[indices].reshape(length)\n",
211 |     "            return max_y1, min_y1, mean_y1, var_y1, max_y2, min_y2, mean_y2, var_y2"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 20,
217 |    "metadata": {},
218 |    "outputs": [
219 |     {
220 |      "data": {
221 |       "text/plain": [
222 |        "0"
223 |       ]
224 |      },
225 |      "execution_count": 20,
226 |      "metadata": {},
227 |      "output_type": "execute_result"
228 |     }
229 |    ],
230 |    "source": [
231 |     "data.max()"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": 33,
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "def ndvi_mask(pixel):\n",
241 |     "    if pixel < -1:\n",
242 |     "        return 0\n",
243 |     "    else:\n",
244 |     "        return pixel*10000\n",
245 |     "\n",
246 |     "def ndvi_modis(indices, length, year):\n",
247 |     "    ndvi_frame = np.zeros((23,2160,4320))\n",
248 |     "    file_dir = 'daily_ndvi/processed/' + year + '/'\n",
249 |     "    for i in range(23):\n",
250 |     "        full_path = file_dir + str(i) + '.csv' \n",
251 |     "        ds = pd.read_csv(full_path)\n",
252 |     "        filter_function = np.vectorize(ndvi_mask)\n",
253 |     "        data = filter_function(ds)\n",
254 |     "        data = resize(data, (2160, 4320), preserve_range=True)\n",
255 |     "        ndvi_frame[i,:,:]= data\n",
256 |     "    ds_1, ds_2  = ndvi_frame[0:12,:,:], ndvi_frame[12:,:,:]\n",
257 |     "    max_y1 = np.max(ds_1, axis = 0)[indices].reshape(length)\n",
258 |     "    min_y1 = np.min(ds_1, axis = 0)[indices].reshape(length)\n",
259 |     "    var_y1 = np.var(ds_1, axis = 0)[indices].reshape(length)\n",
260 |     "    mean_y1= np.mean(ds_1, axis = 0)[indices].reshape(length)\n",
261 |     "    max_y2 = np.max(ds_2, axis = 0)[indices].reshape(length)\n",
262 |     "    min_y2 = np.min(ds_2, axis = 0)[indices].reshape(length)\n",
263 |     "    var_y2 = np.var(ds_2, axis = 0)[indices].reshape(length)\n",
264 |     "    mean_y2 = np.mean(ds_2, axis = 0)[indices].reshape(length)\n",
265 |     "    return max_y1, min_y1, mean_y1, var_y1, max_y2, min_y2, mean_y2, var_y2"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "markdown",
270 |    "metadata": {},
271 |    "source": [
272 |     "## Retrieve Climate Data for each year"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": 34,
278 |    "metadata": {},
279 |    "outputs": [],
280 |    "source": [
281 |     "def extract_nc(indices, length, year, variable):\n",
282 |     "    path = 'climate/' + year + '/' + variable + '/'\n",
283 |     "    full_path = path + 'TerraClimate_' + variable +'_' + year + '.nc'\n",
284 |     "    ds = np.array(Dataset(full_path)[variable])\n",
285 |     "    max_y1 = np.max(ds, axis = 0)\n",
286 |     "    min_y1 = np.min(ds, axis = 0)\n",
287 |     "    var_y1 = np.var(ds, axis = 0)\n",
288 |     "    mean_y1 = np.mean(ds, axis = 0)\n",
289 |     "    max_y1 = resize(max_y1, (2160, 4320))[indices].reshape(length)\n",
290 |     "    min_y1 = resize(min_y1, (2160, 4320))[indices].reshape(length)\n",
291 |     "    var_y1 = resize(var_y1, (2160, 4320))[indices].reshape(length)\n",
292 |     "    mean_y1 = resize(mean_y1, (2160, 4320))[indices].reshape(length)\n",
293 |     "    return max_y1, min_y1, var_y1, mean_y1\n"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "## Save each Year to a CSV"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": 36,
306 |    "metadata": {},
307 |    "outputs": [
308 |     {
309 |      "name": "stdout",
310 |      "output_type": "stream",
311 |      "text": [
312 |       "ndvi done for2019\n",
313 |       "climate2019\n"
314 |      ]
315 |     }
316 |    ],
317 |    "source": [
318 |     "def retrieve_year_clim(indices, length, year, ref_df):\n",
319 |     "    measures = ['aet', 'def', 'pet', 'ppt', 'srad', 'tmax', 'tmin', 'vap', 'vpd', 'soil', 'PDSI']\n",
320 |     "    train_years = ['1985', '1990', '1995', '2000', '2005']\n",
321 |     "    \n",
322 |     "    \n",
323 |     "    new_df = pd.DataFrame()\n",
324 |     "    new_df['lat'], new_df['lon']  = ref_df['lat'], ref_df['lon']\n",
325 |     "    \n",
326 |     "    #Add labels - where applicable \n",
327 |     "    if year in train_years:\n",
328 |     "        new_df['label'] = ref_df[year]\n",
329 |     "    \n",
330 |     "    #retrieve ndvi\n",
331 |     "#     if int(year) < 2016:\n",
332 |     "#         #if avhrr is available\n",
333 |     "#         new_df['ndvi_max_y1'], new_df['ndvi_min_y1'], new_df['ndvi_mean_y1'], new_df['ndvi_var_y1'], new_df['ndvi_max_y2'], new_df['ndvi_min_y2'], new_df['ndvi_mean_y2'], new_df['ndvi_var_y2']  = retrieve_ndvi(indices, length, year)\n",
334 |     "#     else:\n",
335 |     "        #else use modis\n",
336 |     "    new_df['ndvi_max_y1'], new_df['ndvi_min_y1'], new_df['ndvi_mean_y1'], new_df['ndvi_var_y1'], new_df['ndvi_max_y2'], new_df['ndvi_min_y2'], new_df['ndvi_mean_y2'], new_df['ndvi_var_y2']  = ndvi_modis(indices, length, year)\n",
337 |     "    print('ndvi done for' + year)\n",
338 |     "    for i in measures:\n",
339 |     "        #retrieve relevant climate variables\n",
340 |     "        new_df[i+'_max'], new_df[i+'_min'], new_df[i+'_var'], new_df[i+'_mean'] = extract_nc(indices, length, year, i)\n",
341 |     "    print('climate' + year)\n",
342 |     "    return new_df\n",
343 |     "\n",
344 |     "for j in ['2019']:\n",
345 |     "    t= retrieve_year_clim(land_indices, n, j , clean_frame_df)\n",
346 |     "    file_save_loc = 'data/new_ndvi/' + j + '.csv'\n",
347 |     "    t.to_csv(file_save_loc)"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "## Extract Long term averages"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "code",
359 |    "execution_count": 6,
360 |    "metadata": {},
361 |    "outputs": [
362 |     {
363 |      "name": "stdout",
364 |      "output_type": "stream",
365 |      "text": [
366 |       "climate/lt/aet/TerraClimate19812010_aet.nc\n",
367 |       "climate/lt/def/TerraClimate19812010_def.nc\n",
368 |       "climate/lt/pet/TerraClimate19812010_pet.nc\n",
369 |       "climate/lt/ppt/TerraClimate19812010_ppt.nc\n",
370 |       "climate/lt/srad/TerraClimate19812010_srad.nc\n",
371 |       "climate/lt/tmax/TerraClimate19812010_tmax.nc\n",
372 |       "climate/lt/tmin/TerraClimate19812010_tmin.nc\n",
373 |       "climate/lt/vap/TerraClimate19812010_vap.nc\n",
374 |       "climate/lt/vpd/TerraClimate19812010_vpd.nc\n",
375 |       "climate/lt/soil/TerraClimate19812010_soil.nc\n"
376 |      ]
377 |     }
378 |    ],
379 |    "source": [
380 |     "measures_2 = ['aet', 'def', 'pet', 'ppt', 'srad', 'tmax', 'tmin', 'vap', 'vpd', 'soil']  \n",
381 |     "\n",
382 |     "def extract_nc_lt(indices, length, variable):\n",
383 |     "    path = 'climate/lt/' + variable + '/'\n",
384 |     "    full_path = path + 'TerraClimate19812010_' + variable  + '.nc'\n",
385 |     "    print(full_path)\n",
386 |     "    ds = np.array(Dataset(full_path)[variable])\n",
387 |     "    max_y1 = np.max(ds, axis = 0)\n",
388 |     "    min_y1 = np.min(ds, axis = 0)\n",
389 |     "    var_y1 = np.var(ds, axis = 0)\n",
390 |     "    mean_y1= np.mean(ds, axis = 0)\n",
391 |     "    max_y1 = resize(max_y1, (2160, 4320))[indices].reshape(length,1)\n",
392 |     "    min_y1 = resize(min_y1, (2160, 4320))[indices].reshape(length,1)\n",
393 |     "    var_y1 = resize(var_y1, (2160, 4320))[indices].reshape(length,1)\n",
394 |     "    mean_y1 = resize(mean_y1, (2160, 4320))[indices].reshape(length,1)\n",
395 |     "    return max_y1, min_y1, var_y1, mean_y1\n",
396 |     "\n",
397 |     "for i in measures_2:\n",
398 |     "    clean_frame_df[i+'_lt_max'], clean_frame_df[i+'_lt_min'], clean_frame_df[i+'_lt_var'], clean_frame_df[i+'_lt_mean'] = extract_nc_lt(land_indices, n, i)"
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "code",
403 |    "execution_count": 7,
404 |    "metadata": {},
405 |    "outputs": [],
406 |    "source": [
407 |     "lt_frame = clean_frame_df[['lon', 'lat','aet_lt_max', 'aet_lt_min', 'aet_lt_var', 'aet_lt_mean', 'def_lt_max',\n",
408 |     "       'def_lt_min', 'def_lt_var', 'def_lt_mean', 'pet_lt_max', 'pet_lt_min',\n",
409 |     "       'pet_lt_var', 'pet_lt_mean', 'ppt_lt_max', 'ppt_lt_min', 'ppt_lt_var',\n",
410 |     "       'ppt_lt_mean', 'srad_lt_max', 'srad_lt_min', 'srad_lt_var', 'soil_lt_max', 'soil_lt_min', 'soil_lt_var',\n",
411 |     "       'soil_lt_mean',\n",
412 |     "       'srad_lt_mean', 'tmax_lt_max', 'tmax_lt_min', 'tmax_lt_var',\n",
413 |     "       'tmax_lt_mean', 'tmin_lt_max', 'tmin_lt_min', 'tmin_lt_var',\n",
414 |     "       'tmin_lt_mean', 'vap_lt_max', 'vap_lt_min', 'vap_lt_var', 'vap_lt_mean',\n",
415 |     "       'vpd_lt_max', 'vpd_lt_min', 'vpd_lt_var', 'vpd_lt_mean' ]]\n",
416 |     "lt_frame.to_csv('data/lt.csv')"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "markdown",
421 |    "metadata": {},
422 |    "source": [
423 |     "## Compare NDVI for the same year"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": 9,
429 |    "metadata": {},
430 |    "outputs": [
431 |     {
432 |      "name": "stderr",
433 |      "output_type": "stream",
434 |      "text": [
435 |       "/home/ubuntu/anaconda3/envs/w210-dlvm/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: WARNING: valid_range not used since it\n",
436 |       "cannot be safely cast to variable data type\n",
437 |       "  This is separate from the ipykernel package so we can avoid doing imports until\n"
438 |      ]
439 |     }
440 |    ],
441 |    "source": [
442 |     "path = 'ndvi/ndvi3g_geo_v1_'\n",
443 |     "file_1h = path + '2015' + '_0106.nc4'\n",
444 |     "ds_3 = np.array(Dataset(file_1h)['ndvi'])\n",
445 |     "max_y3 = np.max(ds_3, axis = 0)[land_indices].reshape(n)\n",
446 |     "min_y3 = np.min(ds_3, axis = 0)[land_indices].reshape(n)"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "code",
451 |    "execution_count": 10,
452 |    "metadata": {},
453 |    "outputs": [
454 |     {
455 |      "ename": "HDF4Error",
456 |      "evalue": "SD (15): File is supported, must be either hdf, cdf, netcdf",
457 |      "output_type": "error",
458 |      "traceback": [
459 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
460 |       "\u001b[0;31mHDF4Error\u001b[0m                                 Traceback (most recent call last)",
461 |       "\u001b[0;32m<ipython-input-10-e0f0ebb4bfa8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mfile_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlistdir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m     \u001b[0mndvi_file\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mSD\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfile_dir\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mfile_list\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mSDC\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mREAD\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      7\u001b[0m     \u001b[0msds_obj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mndvi_file\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mselect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'CMG 0.05 Deg 16 days NDVI'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m     \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msds_obj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# get sds data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
462 |       "\u001b[0;32m~/anaconda3/envs/w210-dlvm/lib/python3.7/site-packages/pyhdf/SD.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, path, mode)\u001b[0m\n\u001b[1;32m   1427\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1428\u001b[0m         \u001b[0mid\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_C\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSDstart\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1429\u001b[0;31m         \u001b[0m_checkErr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'SD'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"cannot open %s\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1430\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_id\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1431\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
463 |       "\u001b[0;32m~/anaconda3/envs/w210-dlvm/lib/python3.7/site-packages/pyhdf/error.py\u001b[0m in \u001b[0;36m_checkErr\u001b[0;34m(procName, val, msg)\u001b[0m\n\u001b[1;32m     21\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     22\u001b[0m             \u001b[0merr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"%s : %s\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mprocName\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 23\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mHDF4Error\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
464 |       "\u001b[0;31mHDF4Error\u001b[0m: SD (15): File is supported, must be either hdf, cdf, netcdf"
465 |      ]
466 |     }
467 |    ],
468 |    "source": [
469 |     "year = '2015'\n",
470 |     "ndvi_frame = np.zeros((23,2160,4320))\n",
471 |     "file_dir = 'evi/' + year + '/'\n",
472 |     "file_list = os.listdir(file_dir)\n",
473 |     "for i in range(len(file_list)):\n",
474 |     "    ndvi_file = SD(file_dir + file_list[i], SDC.READ)\n",
475 |     "    sds_obj = ndvi_file.select('CMG 0.05 Deg 16 days NDVI') \n",
476 |     "    data = np.array(sds_obj.get()) # get sds data\n",
477 |     "    data = resize(data, (2160, 4320), preserve_range=True)\n",
478 |     "    ndvi_frame[i,:,:]= data\n",
479 |     "    ds_1, ds_2  = ndvi_frame[0:12,:,:], ndvi_frame[12:,:,:]\n",
480 |     "    max_y1 = np.max(ds_1, axis = 0)[land_indices].reshape(n)\n",
481 |     "    min_y1 = np.min(ds_1, axis = 0)[land_indices].reshape(n)"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "code",
486 |    "execution_count": null,
487 |    "metadata": {},
488 |    "outputs": [],
489 |    "source": [
490 |     "print(min_y1.mean())\n",
491 |     "print(min_y3.mean())"
492 |    ]
493 |   },
494 |   {
495 |    "cell_type": "code",
496 |    "execution_count": null,
497 |    "metadata": {},
498 |    "outputs": [],
499 |    "source": []
500 |   }
501 |  ],
502 |  "metadata": {
503 |   "kernelspec": {
504 |    "display_name": "Python 3",
505 |    "language": "python",
506 |    "name": "python3"
507 |   },
508 |   "language_info": {
509 |    "codemirror_mode": {
510 |     "name": "ipython",
511 |     "version": 3
512 |    },
513 |    "file_extension": ".py",
514 |    "mimetype": "text/x-python",
515 |    "name": "python",
516 |    "nbconvert_exporter": "python",
517 |    "pygments_lexer": "ipython3",
518 |    "version": "3.7.6"
519 |   }
520 |  },
521 |  "nbformat": 4,
522 |  "nbformat_minor": 2
523 | }
524 | 


--------------------------------------------------------------------------------
/gim_v2a.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Sample data from GEE for irrigated cropland analysis"
  3 | output:
  4 |   pdf_document: default
  5 |   pdf: default
  6 |   html_document:
  7 |     df_print: paged
  8 | ---
  9 | 
 10 | ### Introduction to the dataset
 11 | 
 12 | We can read the GeoJSON below if we have the library.  Otherwise we can read the CSV which contains all the features and the label.
 13 | 
 14 | This is obtained from a worldwide sample of 10000 points, for the year 2002.  The points were obtained proportionately on a per-region basis, so smaller continents such as Australia have smaller number of samples than say Asia or Africa.  The regions were:
 15 | 
 16 | ```javascript
 17 | var worldRegions = [
 18 |   "North America",
 19 |   "Central America",
 20 |   "South America",
 21 |   "EuropeSansRussia",
 22 |   "EuropeanRussia",
 23 |   "Africa",
 24 |   "SW Asia",
 25 |   "Central Asia",
 26 |   "N Asia",
 27 |   "E Asia",
 28 |   "SE Asia",
 29 |   "S Asia",
 30 |   "Australia",
 31 |   // and New Zealand and Papua New Guinea from Oceania
 32 | ];
 33 | ```
 34 | 
 35 | The features come from the following datasets:
 36 | 
 37 |  * [MODIS MOD13A2](https://developers.google.com/earth-engine/datasets/catalog/MODIS_006_MOD13A2) (NDVI, EVI)
 38 |  * [MODIS MOD09GA_NDWI](https://developers.google.com/earth-engine/datasets/catalog/MODIS_MOD09GA_NDWI) (NDWI)
 39 |  * [NASA GRACE](https://developers.google.com/earth-engine/datasets/catalog/NASA_GRACE_MASS_GRIDS_LAND) (groundwater data)
 40 |  * [TERRACLIMATE](https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_TERRACLIMATE) (climate data)
 41 |  * [NASA GLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_GLDAS_V021_NOAH_G025_T3H) (land data)
 42 |  * [NASA FLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_FLDAS_NOAH01_C_GL_M_V001) (famine data)
 43 | 
 44 | See the [GEE script](https://code.earthengine.google.com/?scriptPath=users%2Fdeepakna%2Fmids_w210_irrigated_cropland%3Av3%2Ffeatures.js) to understand the features better.
 45 | 
 46 | ```{r}
 47 | library(sf)
 48 | library(caret)
 49 | library(dplyr)
 50 | library(corrplot)
 51 | library(e1071)
 52 | library(kernlab)
 53 | library(knitr)
 54 | library(kableExtra)
 55 | library(randomForest)
 56 | library(doParallel)
 57 | library(pROC)
 58 | library(tibble)
 59 | library(psych)
 60 | 
 61 | set.seed(10)
 62 | registerDoParallel(4)
 63 | getDoParWorkers()
 64 | ```
 65 | 
 66 | ### EDA
 67 | 
 68 | Let us read and have a first look at the data.
 69 | 
 70 | ```{r}
 71 | sdf <- read_sf("data/sample_v2a.geojson")
 72 | ```
 73 | 
 74 | ```{r}
 75 | kable(summary(sdf))
 76 | ```
 77 | 
 78 | Let us first get rid of some unwanted columns
 79 | 
 80 | ```{r}
 81 | df <- as.data.frame(sdf)
 82 | df <- dplyr::select(df, c(-"id", -"geometry"))
 83 | ```
 84 | 
 85 | Let us first check if there are any columns with very little variance:
 86 | 
 87 | ```{r}
 88 | low.var <- nearZeroVar(df, names = TRUE)
 89 | low.var
 90 | ```
 91 | 
 92 | We don't want to remove the label, but we can remove the constant!
 93 | 
 94 | ```{r}
 95 | # final.low.var <- stri_remove_empty(str_remove(low.var, "LABEL"))
 96 | # df <- dplyr::select(df, -all_of(final.low.var))
 97 | df <- dplyr::select(df, -c("constant"))
 98 | ```
 99 | 
100 | Let's check if there are any missing values, because we already encoded them with -999 during sampling:
101 | 
102 | ```{r}
103 | sum(is.na(df))
104 | ```
105 | 
106 | There are no missing values.  This is good.
107 | 
108 | We will not preprocess the features at the moment, because some models are robust to skews and non-normality.  If a model needs it, we will apply pre-processing as needed.
109 | 
110 | ```{r}
111 | labels.df <- dplyr::select(df, LABEL)
112 | features.df <- dplyr::select(df, -LABEL)
113 | ```
114 | 
115 | Let's check for highly correlated features.
116 | 
117 | ```{r}
118 | correlations <- cor(features.df)
119 | high.corr <- findCorrelation(correlations, cutoff = 0.9, names = TRUE)
120 | high.corr
121 | ```
122 | 
123 | Let's remove them.
124 | 
125 | ```{r}
126 | high.corr <- findCorrelation(correlations, cutoff = 0.9)
127 | #features.df <- features.df[-c(high.corr)]
128 | ```
129 | 
130 | ```{r}
131 | correlations <- cor(features.df)
132 | #correlations
133 | corrplot(correlations)
134 | ```
135 | 
136 | Let's look at the label column and create a factor variable for our models.
137 | 
138 | ```{r}
139 | # t <- df %>% select(CanopInt_inst) %>% filter(CanopInt_inst != -9999)
140 | # 
141 | # summary(df$CanopInt_inst)
142 | # hist(df$CanopInt_inst)
143 | # summary(t)
144 | # 
145 | # t2 <- df %>% select(X, Y, EVI_Amplitude_1) %>% filter(EVI_Amplitude_1 != -9999)
146 | 
147 | ```
148 | 
149 | ```{r}
150 | hist(labels.df$LABEL, main = "Histogram: Area irrigated (ha)", xlab = "Area irrigated in 8kmx8km square")
151 | ```
152 | 
153 | This is a very skewed distribution.  Let's try a log transform.  Because a lot of the values are 0, we'll use log1p().
154 | 
155 | ```{r}
156 | hist(log(labels.df$LABEL), main = "log of irrigated area (Ha)")
157 | log.labels.df <- data.frame(LABEL = log1p(labels.df$LABEL))
158 | hist(log.labels.df$LABEL, main = "log1p of irrigated area (Ha)")
159 | ```
160 | 
161 | Let's decide where to draw our label.  It represents the area irrigated in a cell of 5 arc min x 5 arc min.  This is 8.3km at the equator, therefore the cell is 8.3 x 8.3 = 68.89 square kilometers.  Since 1 square kilometer represents 100 hectares, we can have, at most, 6889 hectares of irrigated land (at the equator).
162 | 
163 | ```{r}
164 | labels.df %>% dplyr::select(LABEL) %>% filter(LABEL > 6889) %>% count()
165 | ```
166 | 
167 | The labels look correct.
168 | 
169 | For our model, let us detect irrigation > 5%.
170 | 
171 | ```{r}
172 | class.thres <- 1 # log(0.05 * 6889)
173 | hist(log.labels.df[log.labels.df$LABEL < class.thres,], main = "Below threshold", xlab = "log1p(haIrrigated)")
174 | hist(log.labels.df[log.labels.df$LABEL >= class.thres,], main = "Above threshold", xlab = "log1p(haIrrigated)")
175 | ```
176 | 
177 | There are very few values below 1.  To start with, we will have a binary classification model.  Let's set all log1p-labels greater or equal to 1 as "irrigated", and values less than that as "not irrigated".  We want to go as high as possible on this limit, but not lose too much data in the process.
178 | 
179 | ```{r}
180 | labels.df$BLABEL <- ifelse(log.labels.df$LABEL < class.thres, "nonirrigated", "irrigated")
181 | labels.df$BLABEL <- factor(labels.df$BLABEL, levels = c("nonirrigated", "irrigated"))
182 | ```
183 | 
184 | Let's look for class imbalance.
185 | 
186 | ```{r}
187 | irrigated.count <- as.numeric(labels.df %>% filter(BLABEL == "irrigated") %>% count())
188 | nonirrigated.count <- as.numeric(labels.df %>% filter(BLABEL == "nonirrigated") %>% count())
189 | irrigated.count / (irrigated.count + nonirrigated.count)
190 | ```
191 | 
192 | ```{r}
193 | sdf.copy <- sdf
194 | sdf.copy$logLabel <- log1p(sdf.copy$LABEL)
195 | sdf.copy %>% dplyr::select('logLabel') %>% dplyr::filter(logLabel >= class.thres) %>% plot(breaks = 0:9)
196 | ```
197 | 
198 | About 20% of the data is in the irrigated class.  We'll first use the data as it is to train a model and see if it generalizes, before we do any resampling or reweighting.
199 | 
200 | We will first try to fit a few different models with all features and see how they perform.  Then, we will do some feature selection and re-run the models.  We will select the best model.  We will do all of this in a 5-fold cross-validation set.  We will reserve 10% of the data for our final model.  This will be our test set.  We will run our model only once on it, and report our final results on it as well.
201 | 
202 | ### Creating data split
203 | 
204 | We will use a 90:10 split for training and test sets, and 5-fold cross-validation on the training set.
205 | 
206 | ```{r}
207 | # we keep only BLABEL
208 | model.df <- cbind(features.df, labels.df) %>% dplyr::select(-LABEL)
209 | partition <- createDataPartition(y = model.df$BLABEL, p = 0.8, list = FALSE)
210 | training.df = model.df[partition, ]
211 | test.df <- model.df[-partition, ]
212 | ```
213 | 
214 | Next, we will fit three different models.  Let us set up a training-control function that does not vary across the models.  It allows us to train and evaluate them uniformly.
215 | 
216 | ```{r}
217 | # Accuracy, kappa, AUC
218 | # Adapted from Kuhn2013
219 | fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
220 | train.ctrl <- trainControl(method = "cv", number = 5, savePredictions = TRUE, summaryFunction = fiveStats, classProbs = TRUE, allowParallel = TRUE)
221 | model.metric <- "Kappa"
222 | ```
223 | 
224 | ### Random forest
225 | 
226 | Since individual models can be quirky, we can try an ensemble instead.  We will try some tuning as well.
227 | 
228 | Random forests are robust to data ranges, so we don't do any preprocessing.
229 | 
230 | ```{r}
231 | model.rf <- train(BLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl)
232 | kable(model.rf$results)
233 | ```
234 | 
235 | ```{r}
236 | indices <- model.rf$pred$mtry == model.rf$bestTune$mtry
237 | ref <- model.rf$pred$obs[indices]
238 | pred.probs <- model.rf$pred$irrigated[indices]
239 | roc.rf <- roc(ref, pred.probs, levels = rev(levels(model.rf$pred$pred)))
240 | plot(roc.rf)
241 | ```
242 | 
243 | ```{r}
244 | kable(model.rf$bestTune)
245 | ```
246 | 
247 | ```{r}
248 | thresh <- coords(roc.rf, x = "best", best.method = "closest.topleft")
249 | kable(thresh)
250 | ```
251 | 
252 | ### Selecting features
253 | 
254 | Let's retrain the random forest model, this time with the recommended 1000 trees and a tuning length of 10.  We will then look at feature importances.
255 | 
256 | ```{r}
257 | model.rf2 <- train(BLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
258 | kable(model.rf2$results)
259 | ```
260 | 
261 | ```{r}
262 | plot(varImp(model.rf2))
263 | ```
264 | 
265 | ```{r}
266 | plot(model.rf2)
267 | varImp(model.rf2)
268 | ```
269 | 
270 | We will start adding variables one by one based on the chart above, until we get close enough to the best kappa we have with all the variables in place.
271 | 
272 | ```{r}
273 | # want it at least 0.60
274 | # EVI_Amplitude_1, Y: 0.33
275 | # + pet_period4: 0.50
276 | # + LC_type2: 0.54
277 | # + vap_period2: 0.58
278 | # + aet_period4: 0.58
279 | # + vap_period3: 0.59
280 | # + LC_Type1: 0.58
281 | # + X: 0.60
282 | 
283 | final.features.df <- training.df %>% dplyr::select(BLABEL, Y, LC_Type2, EVI_Amplitude_1, LC_Type1, pet, tmmx, X, Tveg_tavg, Albedo_inst, vpd)
284 | model.rf3 <- train(BLABEL ~ ., data = final.features.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
285 | kable(model.rf3$results)
286 | ```
287 | 
288 | ```{r}
289 | plot(varImp(model.rf3, scale = FALSE))
290 | ```
291 | 
292 | We get a best kappa value of 0.61 with mtry=4.  We will select these features, based on the feature importance and above experimentation.
293 | 
294 | * Tveg_tavg: Transpiration (GLDAS)
295 | * Y: latitude: a proxy for latitude-specific features, such as amount of sunlight, prevailing winds, etc.
296 | * tmmx: Maximum temperature (TERRACLIMATE)
297 | * vpd: Vapor pressure deficit.  It is the difference (deficit) between the amount of moisture in the air and how much moisture the air can hold when it is saturated. (TERRACLIMATE)
298 | * vap: Vapor pressure (TERRACLIMATE)
299 | * RootMoist_inst: Root zone soil moisture (GLDAS)
300 | * X: longitude: a proxy for longitude-specific features
301 | * Albedo_inst: Fraction of light reflected back to space.  Water reflects less light compared to land. (GLDAS)
302 | * NDVI: Green-leaf vegetation (MODIS)
303 | * Psurf_f_inst: Atmospheric pressure (GLDAS)
304 | * ECanop_tavg: Canopy water evaporation (GLDAS)
305 | * ESoil_tavg: Direct evaporation from bare soil (GLDAS)
306 | * Wind_f_inst: Wind speed (GLDAS)
307 | * pdsi: Palmer drought severity index (TERRACLIMATE)
308 | * SWdown_f_tavg: Downward short-wave radiation flux (GLDAS)
309 | 
310 | We can also find the best threshold:
311 | 
312 | ```{r}
313 | indices <- model.rf3$pred$mtry == model.rf3$bestTune$mtry
314 | ref <- model.rf3$pred$obs[indices]
315 | pred.probs <- model.rf3$pred$irrigated[indices]
316 | roc.rf3 <- roc(ref, pred.probs, levels = rev(levels(model.rf3$pred$pred)))
317 | plot(roc.rf3)
318 | thresh <- coords(roc.rf3, x = "best", best.method = "closest.topleft")
319 | kable(thresh)
320 | ```
321 | 
322 | Setting a threshold of 0.2455 gives us good specificity as well as sensitivity.
323 | 
324 | ```{r}
325 | model.thres <- 0.2455
326 | ```
327 | 
328 | We can compare against the original ROC curve as well:
329 | 
330 | ```{r}
331 | plot(roc.rf, col = "orange", main = "ROC Curves")
332 | plot(roc.rf3, col = "green", add = TRUE)
333 | legend("bottomright", legend = c("Original RF", "Feature selected RF"), col = c("orange", "green"), lwd=2)
334 | ```
335 | 
336 | The new model does comparably, with far fewer features.
337 | 
338 | ### Model Assessment
339 | 
340 | Let's look at how the model performs on each class and look at where on the map it misclassifies.
341 | 
342 | ```{r}
343 | sdf.copy2 <- sdf.copy
344 | model.pred <- predict(model.rf3, newdata = sdf.copy2, type = "prob")
345 | model.pred$PRLABEL <- ifelse(model.pred$irrigated > model.thres, model.pred$irrigated - model.thres, 0)
346 | model.pred$BLABEL <- as.numeric(sdf.copy2$logLabel > class.thres)
347 | model.pred$labelDiff <- model.pred$PRLABEL - model.pred$BLABEL
348 | summary(model.pred)
349 | sdf.copy2 <- cbind(sdf.copy2, model.pred$labelDiff)
350 | # -1: pred=0, lab=1: false negative (blue)
351 | # 1: pred=1, lab=0: false positive (red)
352 | sdf.copy2 %>% select("model.pred.labelDiff") %>% plot(main = "Model assessment")
353 | ```
354 | 
355 | It seems the model has many more false positives (reds) than false negatives (blues).  The false positives are concentrated in China, Ukraine, northern Mexico and northern American prairies.  The model resolution of 8km may be at play here, because it cannot pick up small amounts of irrigation without some noise.  On the other hand, it may also mean we have some underreporting in the labels, especially because they were compiled using government records.
356 | 
357 | Let's now run it on our test set for our final assessment:
358 | 
359 | ```{r}
360 | pred.vec <- predict(model.rf3, newdata = test.df)
361 | pred.df <- data.frame(pred = factor(pred.vec, levels = c("nonirrigated", "irrigated")), BLABEL = test.df$BLABEL)
362 | confusionMatrix(pred.df$pred, pred.df$BLABEL)
363 | ```
364 | 
365 | Let's check the predictions for calibration.
366 | 
367 | ```{r}
368 | cal.pred <- predict(model.rf3, newdata = test.df, type = "prob")
369 | cal.df <- data.frame(BLABEL = test.df$BLABEL, pred = cal.pred$nonirrigated)
370 | cal.obj <- calibration(BLABEL ~ pred, data = cal.df)
371 | plot(cal.obj, xlab = "Predicted probability", ylab = "Actual probability")
372 | ```
373 | 
374 | From the plot, we see that the predictions are above the diagonal at the upper ranges.  This means the model is *under-forecasting* at those ranges, i.e. the actual probabilities are higher than what the model is predicting.  Similarly, the model is *over-forecasting* at mid ranges, i.e. the actual probabilities are lower than what the model is predicting.
375 | 
376 | Let us use calibrate the model results, by stacking a simple logistic regression model on top of it.
377 | 
378 | ```{r}
379 | # TODO: split into train and test
380 | cal.glm <- glm(BLABEL ~ pred, data = cal.df, family = binomial(link = "logit"))
381 | final.pred <- predict(cal.glm, cal.df)
382 | cal2.df <- data.frame(BLABEL = cal.df$BLABEL, pred = 1 - final.pred)
383 | cal2.obj <- calibration(BLABEL ~ pred, data = cal2.df)
384 | plot(cal2.obj, xlab = "Predicted probability", ylab = "Actual probability")
385 | ```
386 | 


--------------------------------------------------------------------------------
/gim_v3.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Sample data from GEE for irrigated cropland analysis"
  3 | output:
  4 |   pdf_document: default
  5 |   pdf: default
  6 |   html_document:
  7 |     df_print: paged
  8 | ---
  9 | 
 10 | ### Introduction to the dataset
 11 | 
 12 | We can read the GeoJSON below if we have the library.  Otherwise we can read the CSV which contains all the features and the label.
 13 | 
 14 | This is obtained from a worldwide sample of 10000 points, for the year 2002.  The points were obtained proportionately on a per-region basis, so smaller continents such as Australia have smaller number of samples than say Asia or Africa.  The regions were:
 15 | 
 16 | ```javascript
 17 | var worldRegions = [
 18 |   "North America",
 19 |   "Central America",
 20 |   "South America",
 21 |   "EuropeSansRussia",
 22 |   "EuropeanRussia",
 23 |   "Africa",
 24 |   "SW Asia",
 25 |   "Central Asia",
 26 |   "N Asia",
 27 |   "E Asia",
 28 |   "SE Asia",
 29 |   "S Asia",
 30 |   "Australia",
 31 |   // and New Zealand and Papua New Guinea from Oceania
 32 | ];
 33 | ```
 34 | 
 35 | The features come from the following datasets:
 36 | 
 37 |  * [MODIS MOD13A2](https://developers.google.com/earth-engine/datasets/catalog/MODIS_006_MOD13A2) (NDVI, EVI)
 38 |  * [MODIS MOD09GA_NDWI](https://developers.google.com/earth-engine/datasets/catalog/MODIS_MOD09GA_NDWI) (NDWI)
 39 |  * [NASA GRACE](https://developers.google.com/earth-engine/datasets/catalog/NASA_GRACE_MASS_GRIDS_LAND) (groundwater data)
 40 |  * [TERRACLIMATE](https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_TERRACLIMATE) (climate data)
 41 |  * [NASA GLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_GLDAS_V021_NOAH_G025_T3H) (land data)
 42 |  * [NASA FLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_FLDAS_NOAH01_C_GL_M_V001) (famine data)
 43 | 
 44 | See the [GEE script](https://code.earthengine.google.com/?scriptPath=users%2Fdeepakna%2Fmids_w210_irrigated_cropland%3Av3%2Ffeatures.js) to understand the features better.
 45 | 
 46 | ```{r}
 47 | library(sf)
 48 | library(caret)
 49 | library(dplyr)
 50 | library(corrplot)
 51 | library(e1071)
 52 | library(kernlab)
 53 | library(knitr)
 54 | library(kableExtra)
 55 | library(randomForest)
 56 | library(doParallel)
 57 | library(pROC)
 58 | library(tibble)
 59 | library(psych)
 60 | 
 61 | set.seed(10)
 62 | registerDoParallel(4)
 63 | getDoParWorkers()
 64 | ```
 65 | 
 66 | ### EDA
 67 | 
 68 | Let us read and have a first look at the data.
 69 | 
 70 | ```{r}
 71 | sdf <- read_sf("data/sample_v3.geojson")
 72 | ```
 73 | 
 74 | ```{r}
 75 | kable(summary(sdf))
 76 | ```
 77 | 
 78 | Let us first get rid of some unwanted columns
 79 | 
 80 | ```{r}
 81 | df <- as.data.frame(sdf)
 82 | ```
 83 | 
 84 | ```{r}
 85 | df <- dplyr::select(df, c(-"id", -"geometry"))
 86 | ```
 87 | 
 88 | Let us first check if there are any columns with very little variance:
 89 | 
 90 | ```{r}
 91 | low.var <- nearZeroVar(df, names = TRUE)
 92 | low.var
 93 | ```
 94 | 
 95 | We don't want to remove the label, but we can remove the constant!
 96 | 
 97 | ```{r}
 98 | df <- dplyr::select(df, -c("constant", "swe"))
 99 | ```
100 | 
101 | ```{r}
102 | df$TLABEL <- factor(x = df$TLABEL, levels = c(0, 1, 2), labels = c("none", "lowtomid", "high"))
103 | ```
104 | 
105 | Let's check if there are any missing values, because we already encoded them with -999 during sampling:
106 | 
107 | ```{r}
108 | sum(is.na(df))
109 | ```
110 | 
111 | There are no missing values.  This is good.
112 | 
113 | We will not preprocess the features at the moment, because some models are robust to skews and non-normality.  If a model needs it, we will apply pre-processing as needed.
114 | 
115 | ```{r}
116 | labels.df <- dplyr::select(df, TLABEL)
117 | features.df <- dplyr::select(df, -TLABEL)
118 | ```
119 | 
120 | Let's check for highly correlated features.
121 | 
122 | ```{r}
123 | correlations <- cor(features.df)
124 | corrplot(correlations)
125 | hist(features.df$Albedo_inst)
126 | ```
127 | 
128 | Let's look at the label column and create a factor variable for our models.
129 | 
130 | This is a very skewed distribution.
131 | 
132 | We will first try to fit a few different models with all features and see how they perform.  Then, we will do some feature selection and re-run the models.  We will select the best model.  We will do all of this in a 5-fold cross-validation set.  We will reserve 10% of the data for our final model.  This will be our test set.  We will run our model only once on it, and report our final results on it as well.
133 | 
134 | ### Creating data split
135 | 
136 | We will use a 90:10 split for training and test sets, and 5-fold cross-validation on the training set.
137 | 
138 | ```{r}
139 | # we keep only TLABEL
140 | model.df <- cbind(features.df, labels.df)
141 | partition <- createDataPartition(y = model.df$TLABEL, p = 0.8, list = FALSE)
142 | training.df = model.df[partition, ]
143 | test.df <- model.df[-partition, ]
144 | ```
145 | 
146 | Next, we will fit three different models.  Let us set up a training-control function that does not vary across the models.  It allows us to train and evaluate them uniformly.
147 | 
148 | ```{r}
149 | # Accuracy, kappa, AUC
150 | # Adapted from Kuhn2013
151 | fiveStats <- function(...) c(multiClassSummary(...), defaultSummary(...))
152 | train.ctrl <- trainControl(method = "cv", number = 5, savePredictions = TRUE, summaryFunction = fiveStats, classProbs = TRUE, allowParallel = TRUE)
153 | model.metric <- "Kappa"
154 | ```
155 | 
156 | ### Random forest
157 | 
158 | Since individual models can be quirky, we can try an ensemble instead.  We will try some tuning as well.
159 | 
160 | Random forests are robust to data ranges, so we don't do any preprocessing.
161 | 
162 | ```{r}
163 | model.rf <- train(TLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl)
164 | kable(model.rf$results)
165 | ```
166 | 
167 | ```{r}
168 | confusionMatrix(model.rf)
169 | ```
170 | 
171 | ```{r}
172 | kable(model.rf$bestTune)
173 | ```
174 | 
175 | ### Selecting features
176 | 
177 | Let's retrain the random forest model, this time with the recommended 1000 trees and a tuning length of 10.  We will then look at feature importances.
178 | 
179 | ```{r}
180 | model.rf2 <- train(TLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
181 | kable(model.rf2$results)
182 | ```
183 | 
184 | ```{r}
185 | plot(varImp(model.rf2))
186 | ```
187 | 
188 | ```{r}
189 | plot(model.rf2)
190 | varImp(model.rf2)
191 | ```
192 | 
193 | We will start adding variables one by one based on the chart above, until we get close enough to the best kappa we have with all the variables in place.
194 | 
195 | ```{r}
196 | final.features.df <- training.df %>% dplyr::select(TLABEL, Y, LC_Type2, X, pet, EVI, Tveg_tavg, Psurf_f_inst, NDVI, tmmx, Albedo_inst, pdsi, vs)
197 | model.rf3 <- train(TLABEL ~ ., data = final.features.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
198 | kable(model.rf3$results)
199 | ```
200 | 
201 | ```{r}
202 | plot(varImp(model.rf3, scale = FALSE))
203 | ```
204 | 
205 | We get a best kappa value of 0.48 with mtry=8.  We will select these features, based on the feature importance and above experimentation.
206 | 
207 | * Tveg_tavg: Transpiration (GLDAS)
208 | * Y: latitude: a proxy for latitude-specific features, such as amount of sunlight, prevailing winds, etc.
209 | * tmmx: Maximum temperature (TERRACLIMATE)
210 | * vpd: Vapor pressure deficit.  It is the difference (deficit) between the amount of moisture in the air and how much moisture the air can hold when it is saturated. (TERRACLIMATE)
211 | * vap: Vapor pressure (TERRACLIMATE)
212 | * RootMoist_inst: Root zone soil moisture (GLDAS)
213 | * X: longitude: a proxy for longitude-specific features
214 | * Albedo_inst: Fraction of light reflected back to space.  Water reflects less light compared to land. (GLDAS)
215 | * NDVI: Green-leaf vegetation (MODIS)
216 | * Psurf_f_inst: Atmospheric pressure (GLDAS)
217 | * ECanop_tavg: Canopy water evaporation (GLDAS)
218 | * ESoil_tavg: Direct evaporation from bare soil (GLDAS)
219 | * Wind_f_inst: Wind speed (GLDAS)
220 | * pdsi: Palmer drought severity index (TERRACLIMATE)
221 | * SWdown_f_tavg: Downward short-wave radiation flux (GLDAS)
222 | 
223 | ### Model Assessment
224 | 
225 | Let's now run it on our test set for our final assessment:
226 | 
227 | ```{r}
228 | pred.vec <- predict(model.rf3, newdata = test.df)
229 | pred.df <- data.frame(pred = factor(pred.vec, levels = c("none", "lowtomid", "high")), TLABEL = test.df$TLABEL)
230 | confusionMatrix(pred.df$pred, pred.df$TLABEL)
231 | ```
232 | 


--------------------------------------------------------------------------------
/gim_v3b.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Sample data from GEE for irrigated cropland analysis"
  3 | output:
  4 |   pdf_document: default
  5 |   pdf: default
  6 |   html_document:
  7 |     df_print: paged
  8 | ---
  9 | 
 10 | ### Introduction to the dataset
 11 | 
 12 | We can read the GeoJSON below if we have the library.  Otherwise we can read the CSV which contains all the features and the label.
 13 | 
 14 | This is obtained from a worldwide sample of 10000 points, for the year 2002.  The points were obtained proportionately on a per-region basis, so smaller continents such as Australia have smaller number of samples than say Asia or Africa.  The regions were:
 15 | 
 16 | ```javascript
 17 | var worldRegions = [
 18 |   "North America",
 19 |   "Central America",
 20 |   "South America",
 21 |   "EuropeSansRussia",
 22 |   "EuropeanRussia",
 23 |   "Africa",
 24 |   "SW Asia",
 25 |   "Central Asia",
 26 |   "N Asia",
 27 |   "E Asia",
 28 |   "SE Asia",
 29 |   "S Asia",
 30 |   "Australia",
 31 |   // and New Zealand and Papua New Guinea from Oceania
 32 | ];
 33 | ```
 34 | 
 35 | The features come from the following datasets:
 36 | 
 37 |  * [MODIS MOD13A2](https://developers.google.com/earth-engine/datasets/catalog/MODIS_006_MOD13A2) (NDVI, EVI)
 38 |  * [MODIS MOD09GA_NDWI](https://developers.google.com/earth-engine/datasets/catalog/MODIS_MOD09GA_NDWI) (NDWI)
 39 |  * [NASA GRACE](https://developers.google.com/earth-engine/datasets/catalog/NASA_GRACE_MASS_GRIDS_LAND) (groundwater data)
 40 |  * [TERRACLIMATE](https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_TERRACLIMATE) (climate data)
 41 |  * [NASA GLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_GLDAS_V021_NOAH_G025_T3H) (land data)
 42 |  * [NASA FLDAS](https://developers.google.com/earth-engine/datasets/catalog/NASA_FLDAS_NOAH01_C_GL_M_V001) (famine data)
 43 | 
 44 | See the [GEE script](https://code.earthengine.google.com/?scriptPath=users%2Fdeepakna%2Fmids_w210_irrigated_cropland%3Av3%2Ffeatures.js) to understand the features better.
 45 | 
 46 | ```{r}
 47 | library(sf)
 48 | library(caret)
 49 | library(dplyr)
 50 | library(corrplot)
 51 | library(e1071)
 52 | library(kernlab)
 53 | library(knitr)
 54 | library(kableExtra)
 55 | library(randomForest)
 56 | library(doParallel)
 57 | library(pROC)
 58 | library(tibble)
 59 | library(psych)
 60 | 
 61 | set.seed(10)
 62 | registerDoParallel(4)
 63 | getDoParWorkers()
 64 | ```
 65 | 
 66 | ### EDA
 67 | 
 68 | Let us read and have a first look at the data.
 69 | 
 70 | ```{r}
 71 | sdf <- read_sf("data/sample_v3b.geojson")
 72 | ```
 73 | 
 74 | ```{r}
 75 | kable(summary(sdf))
 76 | ```
 77 | 
 78 | Let us first get rid of some unwanted columns
 79 | 
 80 | ```{r}
 81 | df <- as.data.frame(sdf)
 82 | ```
 83 | 
 84 | ```{r}
 85 | df <- dplyr::select(df, c(-"id", -"geometry"))
 86 | ```
 87 | 
 88 | Let us first check if there are any columns with very little variance:
 89 | 
 90 | ```{r}
 91 | low.var <- nearZeroVar(df, names = TRUE)
 92 | low.var
 93 | ```
 94 | 
 95 | We don't want to remove the label, but we can remove the constant!
 96 | 
 97 | ```{r}
 98 | df <- dplyr::select(df, -c("constant", "swe"))
 99 | ```
100 | 
101 | ```{r}
102 | df$TLABEL <- factor(x = df$TLABEL, levels = c(0, 1, 2), labels = c("none", "lowtomid", "high"))
103 | ```
104 | 
105 | Let's check if there are any missing values, because we already encoded them with -999 during sampling:
106 | 
107 | ```{r}
108 | sum(is.na(df))
109 | ```
110 | 
111 | There are no missing values.  This is good.
112 | 
113 | We will not preprocess the features at the moment, because some models are robust to skews and non-normality.  If a model needs it, we will apply pre-processing as needed.
114 | 
115 | ```{r}
116 | labels.df <- dplyr::select(df, TLABEL)
117 | features.df <- dplyr::select(df, -TLABEL)
118 | ```
119 | 
120 | Let's check for highly correlated features.
121 | 
122 | ```{r}
123 | correlations <- cor(features.df)
124 | corrplot(correlations)
125 | hist(features.df$Albedo_inst)
126 | ```
127 | 
128 | Let's look at the label column and create a factor variable for our models.
129 | 
130 | This is a very skewed distribution.
131 | 
132 | We will first try to fit a few different models with all features and see how they perform.  Then, we will do some feature selection and re-run the models.  We will select the best model.  We will do all of this in a 5-fold cross-validation set.  We will reserve 10% of the data for our final model.  This will be our test set.  We will run our model only once on it, and report our final results on it as well.
133 | 
134 | ### Creating data split
135 | 
136 | We will use a 90:10 split for training and test sets, and 5-fold cross-validation on the training set.
137 | 
138 | ```{r}
139 | # we keep only TLABEL
140 | model.df <- cbind(features.df, labels.df)
141 | partition <- createDataPartition(y = model.df$TLABEL, p = 0.8, list = FALSE)
142 | training.df = model.df[partition, ]
143 | test.df <- model.df[-partition, ]
144 | ```
145 | 
146 | Next, we will fit three different models.  Let us set up a training-control function that does not vary across the models.  It allows us to train and evaluate them uniformly.
147 | 
148 | ```{r}
149 | # Accuracy, kappa, AUC
150 | # Adapted from Kuhn2013
151 | fiveStats <- function(...) c(multiClassSummary(...), defaultSummary(...))
152 | train.ctrl <- trainControl(method = "cv", number = 5, savePredictions = TRUE, summaryFunction = fiveStats, classProbs = TRUE, allowParallel = TRUE)
153 | model.metric <- "Kappa"
154 | ```
155 | 
156 | ### Random forest
157 | 
158 | Since individual models can be quirky, we can try an ensemble instead.  We will try some tuning as well.
159 | 
160 | Random forests are robust to data ranges, so we don't do any preprocessing.
161 | 
162 | ```{r}
163 | model.rf <- train(TLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl)
164 | kable(model.rf$results)
165 | ```
166 | 
167 | ```{r}
168 | confusionMatrix(model.rf)
169 | ```
170 | 
171 | ```{r}
172 | kable(model.rf$bestTune)
173 | ```
174 | 
175 | ### Selecting features
176 | 
177 | Let's retrain the random forest model, this time with the recommended 1000 trees and a tuning length of 10.  We will then look at feature importances.
178 | 
179 | ```{r}
180 | model.rf2 <- train(TLABEL ~ ., data = training.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
181 | kable(model.rf2$results)
182 | ```
183 | 
184 | ```{r}
185 | plot(varImp(model.rf2))
186 | ```
187 | 
188 | ```{r}
189 | plot(model.rf2)
190 | varImp(model.rf2)
191 | ```
192 | 
193 | We will start adding variables one by one based on the chart above, until we get close enough to the best kappa we have with all the variables in place.
194 | 
195 | ```{r}
196 | vi <- varImp(model.rf2)
197 | # >= 30 is good enough
198 | viTF <- vi$importance >= 30
199 | x <- data.frame(name = rownames(viTF), Overall = vi$importance)
200 | selected.features <- c("TLABEL", as.vector(x %>% dplyr::filter(Overall >= 30) %>% dplyr::pull(name)))
201 | selected.features
202 | ```
203 | 
204 | ```{r}
205 | final.features.df <- training.df %>% dplyr::select(all_of(selected.features))
206 | model.rf3 <- train(TLABEL ~ ., data = final.features.df, method = "rf", metric = model.metric, trControl = train.ctrl, tuneLength = 10, ntree = 1000)
207 | kable(model.rf3$results)
208 | ```
209 | 
210 | ```{r}
211 | plot(varImp(model.rf3, scale = FALSE))
212 | ```
213 | 
214 | We get a best kappa value of 0.55 with mtry=10.  We will select these features, based on the feature importance and above experimentation.
215 | 
216 | * Tveg_tavg: Transpiration (GLDAS)
217 | * Y: latitude: a proxy for latitude-specific features, such as amount of sunlight, prevailing winds, etc.
218 | * tmmx: Maximum temperature (TERRACLIMATE)
219 | * vpd: Vapor pressure deficit.  It is the difference (deficit) between the amount of moisture in the air and how much moisture the air can hold when it is saturated. (TERRACLIMATE)
220 | * vap: Vapor pressure (TERRACLIMATE)
221 | * RootMoist_inst: Root zone soil moisture (GLDAS)
222 | * X: longitude: a proxy for longitude-specific features
223 | * Albedo_inst: Fraction of light reflected back to space.  Water reflects less light compared to land. (GLDAS)
224 | * NDVI: Green-leaf vegetation (MODIS)
225 | * Psurf_f_inst: Atmospheric pressure (GLDAS)
226 | * ECanop_tavg: Canopy water evaporation (GLDAS)
227 | * ESoil_tavg: Direct evaporation from bare soil (GLDAS)
228 | * Wind_f_inst: Wind speed (GLDAS)
229 | * pdsi: Palmer drought severity index (TERRACLIMATE)
230 | * SWdown_f_tavg: Downward short-wave radiation flux (GLDAS)
231 | 
232 | ### Model Assessment
233 | 
234 | Let's now run it on our test set for our final assessment:
235 | 
236 | ```{r}
237 | pred.vec <- predict(model.rf3, newdata = test.df)
238 | pred.df <- data.frame(pred = factor(pred.vec, levels = c("none", "lowtomid", "high")), TLABEL = test.df$TLABEL)
239 | confusionMatrix(pred.df$pred, pred.df$TLABEL)
240 | ```
241 | 


--------------------------------------------------------------------------------
/globalIrrigationMap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/globalIrrigationMap.png


--------------------------------------------------------------------------------
/python/assessor.py:
--------------------------------------------------------------------------------
 1 | import ee
 2 | 
 3 | from sampler import get_or_create_worldwide_sample_points
 4 | from common import base_asset_directory, assess_seed
 5 | 
 6 | 
 7 | def assess_combined_map(map_image):
 8 |     sample_points = get_or_create_worldwide_sample_points(assess_seed)
 9 |     sampled_region = map_image \
10 |         .reduceRegions(collection=sample_points, reducer=ee.Reducer.first().forEachBand(map_image)) \
11 |         .map(lambda f: f.select(['actual', 'pred']))
12 |     confusion_matrix = sampled_region.errorMatrix(actual="actual", predicted="pred")
13 |     print(f"Confusion matrix: {confusion_matrix.getInfo()}")
14 |     print(f"Kappa: {confusion_matrix.kappa().getInfo()}")
15 |     print(f"Accuracy: {confusion_matrix.accuracy().getInfo()}")
16 |     print("-----")
17 | 
18 | 
19 | def assess_cropland_model_only():
20 |     cropland_map = ee.Image("users/deepakna/ellecp/v3/2005_ternary") \
21 |         .addBands(ee.Image(f"{base_asset_directory}/s2005tlabelsv2")) \
22 |         .select(["b1", "TLABEL"], ["pred", "actual"])
23 |     print("Cropland model assessment (Elle)")
24 |     cl_mask = ee.Image(f'{base_asset_directory}/CLMask')
25 |     assess_combined_map(cropland_map.mask(cl_mask))
26 | 
27 | 
28 | def assess_timestationary_model_only():
29 |     ts_map = ee.Image(f"{base_asset_directory}/post_mids_v3b_results_2005") \
30 |         .addBands(ee.Image(f"{base_asset_directory}/s2005tlabels")) \
31 |         .select(["classification", "TLABEL"], ["pred", "actual"])
32 |     print("Time-stationary model assessment (Deepak)")
33 |     assess_combined_map(ts_map)
34 | 
35 | 
36 | def assess_model_results():
37 |     final_map = ee.Image(f'{base_asset_directory}/s2005AssessmentMapv3')
38 |     assess_cropland_model_only()
39 |     assess_timestationary_model_only()
40 |     print("Combined model assessment")
41 |     assess_combined_map(final_map)
42 | 
43 | 
44 | if __name__ == '__main__':
45 |     ee.Initialize()
46 |     assess_model_results()
47 | 


--------------------------------------------------------------------------------
/python/classifier.py:
--------------------------------------------------------------------------------
  1 | # Classifies features on any given year and creates results table
  2 | # Requires: features already available as a single GEE asset
  3 | # Note: .getInfo() calls are blocking
  4 | 
  5 | import ee
  6 | 
  7 | from common import (model_scale, wait_for_task_completion, get_selected_features_image, model_snapshot_path_prefix,
  8 |                     get_selected_features, model_projection, num_samples, train_seed)
  9 | from sampler import get_or_create_worldwide_sample_points
 10 | 
 11 | 
 12 | def assess_model(classifier, test_partition):
 13 |     validated = test_partition.classify(classifier)
 14 | 
 15 |     # Get a confusion matrix representing expected accuracy.
 16 |     if classifier.mode() != 'PROBABILITY':
 17 |         validation_matrix = validated.errorMatrix('TLABEL', 'classification')
 18 |         print('Validation error matrix: ', validation_matrix.getInfo())
 19 |         print('Validation accuracy: ', validation_matrix.accuracy().getInfo())
 20 |         print('Validation kappa: ', validation_matrix.kappa().getInfo())
 21 | 
 22 | 
 23 | def train_model(training_partition, feature_list):
 24 |     # same as in R (ntree)
 25 |     num_trees = 1000
 26 |     # same as in R (sampsize)
 27 |     bag_fraction = 0.63
 28 |     # derived from tuning in R (mtry)
 29 |     variables_per_split = 10
 30 | 
 31 |     classifier = ee.Classifier.randomForest(
 32 |         numberOfTrees=num_trees,
 33 |         bagFraction=bag_fraction,
 34 |         variablesPerSplit=variables_per_split,
 35 |         seed=10
 36 |     )
 37 |     classifier = classifier.train(
 38 |         features=training_partition,
 39 |         classProperty='TLABEL',
 40 |         inputProperties=feature_list,
 41 |         subsamplingSeed=10
 42 |     )
 43 | 
 44 |     # Model times out: uncomment if you like
 45 |     # confusion_matrix = classifier.confusionMatrix()
 46 |     # print('Training confusion matrix: ', confusion_matrix.getInfo())
 47 |     # print('Training accuracy: ', confusion_matrix.accuracy().getInfo())
 48 |     # print('Training kappa: ', confusion_matrix.kappa().getInfo())
 49 |     return classifier
 50 | 
 51 | 
 52 | def prepare_classifier_input(features_image, labels_image, sample_points):
 53 |     classifier_input = features_image
 54 |     # Labels may or may not be present (train vs. predict)
 55 |     if labels_image:
 56 |         classifier_input = ee.Image.cat(features_image, labels_image)
 57 | 
 58 |     classifier_input_samples = classifier_input.sampleRegions(
 59 |             collection=sample_points,
 60 |             projection=model_projection,
 61 |             scale=model_scale,
 62 |             geometries=True
 63 |         )
 64 | 
 65 |     prepared_input_dict = dict(
 66 |         classifier_input_samples=classifier_input_samples,
 67 |         feature_list=features_image.bandNames()
 68 |     )
 69 |     return prepared_input_dict
 70 | 
 71 | 
 72 | def create_classifier(features_image, labels_image, sample_points):
 73 |     def train_test_split(data_fc):
 74 |         split = 0.8  # Same as in R (0.9 of data for training with 5-fold cross-validation, 0.1 held out as test)
 75 |         with_random = data_fc.randomColumn('random', train_seed)
 76 |         train_partition = with_random.filter(ee.Filter.lt('random', split))
 77 |         test_partition = with_random.filter(ee.Filter.gte('random', split))
 78 |         return dict(training_partition=train_partition, test_partition=test_partition)
 79 | 
 80 |     ci = prepare_classifier_input(features_image, labels_image, sample_points)
 81 |     split = train_test_split(ci['classifier_input_samples'])
 82 |     classifier = train_model(split['training_partition'], ci['feature_list'])
 83 |     # Model times out: uncomment if you like
 84 |     # if labels_image is not None:
 85 |     #     assess_model(classifier, split['test_partition'])
 86 |     # GEE doesn't allow us to save a model so we always train the model
 87 |     # classifier = classifier.setOutputMode('PROBABILITY')
 88 |     # classifier = train_model(split['training_partition'], feature_list)
 89 |     return classifier
 90 | 
 91 | 
 92 | def build_worldwide_model():
 93 |     sample_points = get_or_create_worldwide_sample_points(train_seed)
 94 |     training_image = ee.Image(f"{model_snapshot_path_prefix}_training_sample{num_samples}_all_features_labels_image")
 95 |     features_list = get_selected_features()
 96 |     features_image = training_image.select(features_list)
 97 |     labels_image = training_image.select("TLABEL")
 98 |     classifier = create_classifier(features_image, labels_image, sample_points)
 99 |     return classifier
100 | 
101 | 
102 | def classify_year(classifier, model_year):
103 |     asset_description = f'results_{model_year}'
104 |     asset_name = f'{model_snapshot_path_prefix}_{asset_description}'
105 |     features_image = get_selected_features_image(model_year)
106 |     classified_image = features_image.classify(classifier)
107 |     task = ee.batch.Export.image.toAsset(
108 |         image=classified_image,
109 |         description=asset_description,
110 |         assetId=asset_name,
111 |         crs=model_projection,
112 |         # default is 1000: don't want this!
113 |         scale=model_scale
114 |     )
115 |     task.start()
116 |     return task
117 | 
118 | 
119 | def main():
120 |     ee.Initialize()
121 |     classifier = build_worldwide_model()
122 |     model_years = range(2001, 2016)
123 |     tasks = []
124 |     for year in model_years:
125 |         task = classify_year(classifier, year)
126 |         tasks.append(task)
127 |     wait_for_task_completion(tasks)
128 | 
129 | 
130 | if __name__ == '__main__':
131 |     main()
132 | 


--------------------------------------------------------------------------------
/python/common.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import itertools
  3 | import datetime
  4 | 
  5 | import ee
  6 | 
  7 | # Checklist when changing model:
  8 | # 1. Update model_snapshot_version below
  9 | # 2. Change selectedBands in dataset_list[] if your feature list is different
 10 | # 3. Change get_binary_labels() below if your label threshold has changed
 11 | # 4. Change model hyperparameters in classifier.py if they are different
 12 | # 5. Set num_samples in classifier.py:build_worldwide_model() if you want to use a different number of samples
 13 | 
 14 | # v1: use land cover feature
 15 | # v2a: use EVI amplitude etc from MCD12Q2.006
 16 | # v3: ternary labels, remove MCD12Q2.006
 17 | # v3a: add back MCD12Q2.006, because kappa became 0.47
 18 | #  - abandoned because other features were lost
 19 | model_snapshot_version = "post_mids_v3b"
 20 | # Create this directory ahead of time
 21 | base_asset_directory = "users/deepakna/w210_irrigated_croplands"
 22 | 
 23 | model_snapshot_path_prefix = f"{base_asset_directory}/{model_snapshot_version}"
 24 | model_projection = "EPSG:4326"
 25 | num_samples = 20000
 26 | 
 27 | # label file
 28 | label_path = f"{base_asset_directory}/s2005tlabels"
 29 | # label year
 30 | label_year = '2005'
 31 | 
 32 | # CONFIGURATION flags
 33 | label_type = "MIRCA2K"  # or GFSAD1000
 34 | train_seed = 10
 35 | assess_seed = 20
 36 | 
 37 | dataset_list = [
 38 |     {
 39 |         'datasetLabel': "MODIS/MOD09GA_NDWI",
 40 |         'allBands': ["NDWI"],
 41 |         'selectedBands': [],
 42 |         'summarizer': "max"
 43 |     },
 44 |     {
 45 |         'datasetLabel': "IDAHO_EPSCOR/TERRACLIMATE",
 46 |         'allBands': ["aet", "def", "pdsi", "pet", 'pr', 'soil', "srad", "swe", "tmmn", 'tmmx', 'vap', 'vpd', 'vs'],
 47 |         # v1
 48 |         'selectedBands': ["tmmx", "pet", "srad", "vs"],
 49 |         'summarizer': "mean",
 50 |         'missingValues': -9999
 51 |     },
 52 |     {
 53 |         'datasetLabel': "NASA/GLDAS/V021/NOAH/G025/T3H",
 54 |         'allBands': ["Albedo_inst", "AvgSurfT_inst", "CanopInt_inst", "ECanop_tavg", "ESoil_tavg", "Evap_tavg",
 55 |                   "LWdown_f_tavg", "Lwnet_tavg", "PotEvap_tavg", "Psurf_f_inst", "Qair_f_inst", "Qg_tavg", "Qh_tavg",
 56 |                   "Qle_tavg", "Qs_acc", "Qsb_acc", "Qsm_acc", "Rainf_f_tavg", "Rainf_tavg", "RootMoist_inst",
 57 |                   "SWE_inst", "SWdown_f_tavg", "SnowDepth_inst", "Snowf_tavg", "SoilMoi0_10cm_inst",
 58 |                   "SoilMoi10_40cm_inst", "SoilMoi100_200cm_inst", "SoilMoi40_100cm_inst", "SoilTMP0_10cm_inst",
 59 |                   "SoilTMP10_40cm_inst", "SoilTMP100_200cm_inst", "SoilTMP40_100cm_inst", "Swnet_tavg", "Tair_f_inst",
 60 |                   "Tveg_tavg", "Wind_f_inst"],
 61 |         # v1
 62 |         'selectedBands': ["Albedo_inst", "ESoil_tavg", "Psurf_f_inst"],
 63 |         'summarizer': "mean",
 64 |         'missingValues': -9999
 65 |     },
 66 |     {
 67 |         'datasetLabel': "MODIS/006/MOD13A2",
 68 |         'allBands': ["NDVI", "EVI"],
 69 |         'selectedBands': ["EVI"],
 70 |         'summarizer': "max",
 71 |         'missingValues': -9999
 72 |     },
 73 |     {
 74 |         'datasetLabel': "MODIS/006/MCD12Q1",
 75 |         'allBands': ["LC_Type1", "LC_Type2"],
 76 |         'selectedBands': ["LC_Type2"],
 77 |         'summarizer': "max",
 78 |         'minYear': '2001',
 79 |         'missingValues': -9999
 80 |     },
 81 | ]
 82 | 
 83 | # We split world regions into 2 to avoid exceeding GEE geometry limits
 84 | # Error: Geometry has too many edges (3970390 > 2000000)
 85 | world_regions_1 = [
 86 |     "North America",
 87 |     "Central America",
 88 |     "South America",
 89 |     "Australia",
 90 |     "Africa",
 91 |     # This causes geometry explosion, not including
 92 |     # "Oceania",
 93 | ]
 94 | world_regions_2 = [
 95 |     "Europe",
 96 |     "SW Asia",
 97 |     "Central Asia",
 98 |     "N Asia",
 99 |     "E Asia",
100 |     "SE Asia",
101 |     "S Asia",
102 |     "Caribbean"
103 | ]
104 | 
105 | # Both of the values below are related: don't change one without the other
106 | model_scale = 9276.620522123105      # 5 arc min at equator
107 | model_image_dimensions = "4320x2160"
108 | 
109 | 
110 | def get_features_from_dataset(dataset, which, model_year, region_fc):
111 |     assert which in ['all', 'selected'], "Specify which bands to get: all or selected"
112 | 
113 |     if which == 'all':
114 |         features = dataset['allBands']
115 |     elif which == 'selected':
116 |         features = dataset['selectedBands']
117 |     else:
118 |         raise NotImplementedError("Specify which bands to get: all or selected")
119 | 
120 |     if not features:
121 |         return None
122 | 
123 |     image = get_features_image_from_dataset(dataset, features, model_year, region_fc)
124 | 
125 |     # Clumsy logic to handle onset days.  It should have been days since start of year,
126 |     # but they use days since start of epoch (1970-01-01).  We convert it back into
127 |     # days since start of year.
128 |     if 'allDateBands' in dataset and which == 'all':
129 |         date_features = dataset['allDateBands']
130 |         image2 = get_features_image_from_dataset(dataset, date_features, model_year, region_fc)
131 |         days_since_epoch = (datetime.datetime(year=int(model_year), month=1, day=1) -
132 |                             datetime.datetime(year=1970, month=1, day=1)).days
133 |         new_bands = list(map(lambda b: image2.select(b).expression(f'b(0) = b(0) - {days_since_epoch}'), date_features))
134 |         date_bands_image = ee.Image.cat(*new_bands)
135 |         image = image.addBands(date_bands_image)
136 |     if 'missingValues' in dataset:
137 |         mask_image = ee.Image(dataset['missingValues']).clipToCollection(region_fc)
138 |         image = image.unmask(mask_image)
139 |     return image
140 | 
141 | 
142 | def get_features_image_from_dataset(dataset, features, model_year, region_fc):
143 |     data_source = dataset['datasetLabel']
144 |     if 'minYear' in dataset and int(model_year) < int(dataset['minYear']):
145 |         new_model_year = int(dataset['minYear'])
146 |         print(f"Warning: model year {model_year} is earlier than dataset {data_source}, using {new_model_year} instead")
147 |         model_year = str(new_model_year)
148 |     if 'maxYear' in dataset and int(model_year) > int(dataset['maxYear']):
149 |         new_model_year = int(dataset['maxYear'])
150 |         print(f"Warning: model year {model_year} is later than dataset {data_source}, using {new_model_year} instead")
151 |         model_year = str(new_model_year)
152 |     image_collection = ee.ImageCollection(data_source) \
153 |         .select(features) \
154 |         .filterDate(model_year + "-01-01", model_year + "-12-31") \
155 |         .map(lambda img: img.clipToCollection(region_fc))
156 | 
157 |     if dataset['summarizer'] == "mean":
158 |         image = image_collection \
159 |             .mean()
160 |     elif dataset['summarizer'] == "max":
161 |         image = image_collection \
162 |             .max()
163 |     else:
164 |         raise ValueError("unknown summarizer")
165 |     return image
166 | 
167 | 
168 | def region_boundaries(region):
169 |     # we assume 2-characters = country FIPS code
170 |     fc = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017")
171 |     if len(region) == 2:
172 |         fc = fc \
173 |             .filterMetadata("country_co", "equals", region)
174 |     elif region == "world":
175 |         fc = fc \
176 |             .filter(ee.Filter.Or(
177 |                 ee.Filter.inList("wld_rgn", ee.List(world_regions_1 + world_regions_2)),
178 |                 # include two relatively big countries in Oceania region
179 |                 ee.Filter.inList("country_co", ee.List(["NZ", "PP"]))
180 |             ))
181 |     elif region == "world1":
182 |         fc = fc \
183 |             .filter(ee.Filter.inList("wld_rgn", ee.List(world_regions_1)))
184 |     elif region == "world2":
185 |         fc = fc \
186 |             .filter(ee.Filter.Or(
187 |                 ee.Filter.inList("wld_rgn", ee.List(world_regions_2)),
188 |                 # include 2 relatively big countries in Oceania region
189 |                 ee.Filter.inList("country_co", ee.List(["NZ", "PP"]))
190 |             ))
191 |     else:
192 |         fc = fc \
193 |             .filterMetadata("wld_rgn", "equals", region)
194 |     return fc.map(lambda f: f.set("areaHa", f.geometry().area()))
195 | 
196 | 
197 | def wait_for_task_completion(tasks, exit_if_failures=False):
198 |     sleep_time = 10  # seconds
199 |     done = False
200 |     failed_tasks = []
201 |     while not done:
202 |         failed = 0
203 |         completed = 0
204 |         for t in tasks:
205 |             status = t.status()
206 |             print(f"{status['description']}: {status['state']}")
207 |             if status['state'] == 'COMPLETED':
208 |                 completed += 1
209 |             elif status['state'] in ['FAILED', 'CANCELLED']:
210 |                 failed += 1
211 |                 failed_tasks.append(status)
212 |         if completed + failed == len(tasks):
213 |             print(f"All tasks processed in batch: {completed} completed, {failed} failed")
214 |             done = True
215 |         time.sleep(sleep_time)
216 |     if failed_tasks:
217 |         print("--- Summary: following tasks failed ---")
218 |         for status in failed_tasks:
219 |             print(status)
220 |         print("--- END Summary ---")
221 |         if exit_if_failures:
222 |             raise NotImplementedError("There were some failed tasks, see report above")
223 | 
224 | 
225 | def get_features_image(region_fc, model_year, which):
226 |     # get all data in parallel, and build a list of images
227 |     # there's a lot to unpack in the code below, but think of it as a loop on the datasets above
228 |     images = list(itertools.chain.from_iterable(map(
229 |         lambda dataset: [get_features_from_dataset(dataset, which, model_year, region_fc)], dataset_list
230 |     )))
231 |     return images
232 | 
233 | 
234 | def get_labels(region_fc):
235 |     if label_type == "MIRCA2K":
236 |         label_image = ee.Image(label_path) \
237 |             .clipToCollection(region_fc)
238 |         return label_image
239 |     elif label_type == "GFSAD1000":
240 |         label_image = ee.Image('USGS/GFSAD1000_V0') \
241 |             .select('landcover') \
242 |             .expression('LABEL = (b(0) > 0 && b(0) < 4) ? 1 : 0')
243 |         return label_image
244 |     else:
245 |         raise NotImplementedError("Unknown label type")
246 | 
247 | 
248 | def get_selected_features():
249 |     selected_features = ['X', 'Y']
250 |     for ds in dataset_list:
251 |         selected_features.extend(ds['selectedBands'])
252 |     return selected_features
253 | 
254 | 
255 | def get_selected_features_image(model_year):
256 |     def get_selected_features_in_region(region):
257 |         region_fc = region_boundaries(region)
258 |         features_image = get_features_image(region_fc, model_year, 'selected')
259 |         lonlat_image = ee.Image.pixelLonLat() \
260 |             .clipToCollection(region_fc) \
261 |             .select(['longitude', 'latitude'], ['X', 'Y'])
262 |         features_with_lonlat_image = ee.Image.cat(features_image, lonlat_image)
263 |         return features_with_lonlat_image
264 | 
265 |     image_path = f"{model_snapshot_path_prefix}_features_{model_year}"
266 |     try:
267 |         _ = ee.data.getAsset(image_path)   # Forces exception
268 |         # No exception, load the image
269 |         features_image = ee.Image(image_path)
270 |         return features_image
271 |     except ee.ee_exception.EEException:
272 |         print(f"could not read features image {image_path} (probably not created yet)")
273 | 
274 |     regions = ["world1", "world2"]
275 |     regional_features = list(map(get_selected_features_in_region, regions))
276 |     assert (len(regions) == 2)
277 |     one_image = ee.Image(regional_features[0]).blend(regional_features[1])
278 |     return one_image
279 | 
280 | 
281 | def export_image_to_asset(image, asset_subpath):
282 |     global_geometry = ee.Geometry.Rectangle(
283 |         coords=[-180, -90, 180, 90],
284 |         geodesic=False,
285 |         proj=model_projection,
286 |     )
287 |     task = ee.batch.Export.image.toAsset(
288 |         image=image,
289 |         description="imageExport",
290 |         assetId=base_asset_directory + "/" + asset_subpath,
291 |         scale=model_scale,
292 |         region=global_geometry,
293 |         maxPixels=1E13,
294 |     )
295 |     task.start()
296 |     return task
297 | 
298 | 
299 | def export_image_to_drive(image, folder):
300 |     global_geometry = ee.Geometry.Rectangle(
301 |         coords=[-180, -90, 180, 90],
302 |         geodesic=False,
303 |         proj=model_projection,
304 |     )
305 |     task = ee.batch.Export.image.toDrive(
306 |         image=image,
307 |         description=folder,
308 |         folder=folder,
309 |         scale=int(model_scale),
310 |         # dimensions=model_image_dimensions,
311 |         region=global_geometry
312 |     )
313 |     task.start()
314 |     return task
315 | 
316 | 
317 | def export_asset_table_to_drive(asset_id):
318 |     fc = ee.FeatureCollection(asset_id)
319 |     folder = asset_id.replace('/', '_')
320 |     max_len = 100
321 |     if len(folder) >= max_len:
322 |         folder = folder[:max_len]
323 |         print(f"folder length is too long (truncating to {max_len})")
324 |     print(f"Downloading table {asset_id} to gdrive: {folder}")
325 |     task = ee.batch.Export.table.toDrive(
326 |         collection=fc,
327 |         folder=folder,
328 |         description=folder,
329 |         fileFormat='GeoJSON'
330 |     )
331 |     task.start()
332 |     wait_for_task_completion([task], True)
333 | 


--------------------------------------------------------------------------------
/python/features_exporter.py:
--------------------------------------------------------------------------------
 1 | # Creates a sample of (features, labels) as a multi-band image on Google Earth Engine
 2 | # Requires: sample points for which to fill in features as a feature collection
 3 | # Requires: labels asset as an image
 4 | 
 5 | import ee
 6 | 
 7 | from common import (model_scale, wait_for_task_completion, get_selected_features_image, model_snapshot_path_prefix,
 8 |                     model_projection)
 9 | 
10 | 
11 | def export_selected_features_for_year(model_year):
12 |     asset_description = f'features_{model_year}'
13 |     asset_name = f'{model_snapshot_path_prefix}_{asset_description}'
14 |     features_image = get_selected_features_image(model_year)
15 |     task = ee.batch.Export.image.toAsset(
16 |         image=features_image,
17 |         description=asset_description,
18 |         assetId=asset_name,
19 |         crs=model_projection,
20 |         # default is 1000: don't want this!
21 |         scale=model_scale
22 |     )
23 |     task.start()
24 |     return task
25 | 
26 | 
27 | def main():
28 |     ee.Initialize()
29 |     model_years = range(2001, 2016)
30 |     tasks = []
31 |     for year in model_years:
32 |         task = export_selected_features_for_year(str(year))
33 |         tasks.append(task)
34 |     wait_for_task_completion(tasks)
35 | 
36 | 
37 | if __name__ == '__main__':
38 |     main()
39 | 


--------------------------------------------------------------------------------
/python/label_sampler.py:
--------------------------------------------------------------------------------
 1 | import ee
 2 | 
 3 | from common import model_scale, wait_for_task_completion, model_projection
 4 | from common import train_seed, label_path
 5 | from sampler import get_or_create_worldwide_sample_points
 6 | 
 7 | TOA_BANDS = ['B3', 'B2', 'B1']
 8 | TOA_MIN = 0.0
 9 | TOA_MAX = 120.0
10 | LANDSAT_RES = 30
11 | 
12 | 
13 | def export_point_unbuffered(id: str, coords: list, folder: str) -> ee.batch.Task:
14 |     print(f"point: {id}, {coords}")
15 |     sat_image = ee.Image("LANDSAT/LE7_TOA_1YEAR/2005").select(TOA_BANDS)
16 |     point_geom = ee.Geometry.Point(coords=coords, proj=model_projection)
17 |     square = point_geom.buffer(model_scale).bounds()
18 |     clipped_sat_image = sat_image \
19 |         .clipToCollection(ee.FeatureCollection(square)) \
20 |         .visualize(bands=TOA_BANDS, min=TOA_MIN, max=TOA_MAX)
21 |     prefix = f"{id}"
22 |     task = ee.batch.Export.image.toDrive(clipped_sat_image, folder=folder, scale=LANDSAT_RES,
23 |                                          fileNamePrefix=prefix, region=square)
24 |     task.start()
25 |     return task
26 | 
27 | 
28 | # get labels image
29 | if __name__ == '__main__':
30 |     num_pictures = 3000   # 5 times for class labels 1 or 2
31 | 
32 |     ee.Initialize()
33 |     sample_points_fc = get_or_create_worldwide_sample_points(train_seed)
34 |     labels_image = ee.Image(label_path) \
35 |         .expression(f"TLABEL = (b(0) == 0 ? 1 : 0)")
36 |     labels_image = labels_image.mask(labels_image)
37 |     labels_fc = labels_image \
38 |         .sample(
39 |             region=sample_points_fc,
40 |             numPixels=num_pictures, # it seems to get a few more than this, and GEE limit is 3K
41 |             scale=model_scale,
42 |             projection=model_projection,
43 |             seed=train_seed,
44 |             geometries=True,
45 |             dropNulls=True
46 |         )
47 |     print(labels_fc.size().getInfo())
48 | 
49 |     labels_fc_info = labels_fc.getInfo()
50 |     tasks = list(map(
51 |         lambda p: export_point_unbuffered(p['id'], p['geometry']['coordinates'], "classNotIrr_samples"),
52 |         labels_fc_info['features']
53 |     ))
54 |     wait_for_task_completion(tasks)
55 | 


--------------------------------------------------------------------------------
/python/post_processor.py:
--------------------------------------------------------------------------------
 1 | import ee
 2 | from common import base_asset_directory, region_boundaries, export_image_to_drive, wait_for_task_completion, \
 3 |     model_snapshot_version
 4 | 
 5 | 
 6 | def combine_maps(year):
 7 |     cropland_image = ee.Image(f"users/deepakna/ellecp/v3/{year}_ternary")
 8 |     non_cropland_image = ee.Image(f"{base_asset_directory}/{model_snapshot_version}_results_{year}")
 9 |     non_cl_mask = ee.Image(f"{base_asset_directory}/nonCLMask")
10 |     non_cropland_image = non_cropland_image.mask(non_cl_mask)
11 |     # Fix a labelling mistake: uses class 3 instead of 2
12 |     cropland_image = cropland_image.expression("classification = b(0) > 2 ? 2 : b(0)")
13 |     combined_image = cropland_image.mask(cropland_image) \
14 |         .blend(non_cropland_image.mask(non_cropland_image)) \
15 |         .unmask(0) \
16 |         .clipToCollection(region_boundaries("world"))
17 |     return combined_image
18 | 
19 | 
20 | def main():
21 |     years = range(2000, 2016)
22 |     tasks = []
23 |     for year in years:
24 |         combined_image = combine_maps(year)
25 |         task = export_image_to_drive(combined_image, f"{model_snapshot_version}_combined_{year}")
26 |         tasks.append(task)
27 |     wait_for_task_completion(tasks)
28 | 
29 | 
30 | if __name__ == '__main__':
31 |     ee.Initialize()
32 |     main()
33 | 


--------------------------------------------------------------------------------
/python/run.py:
--------------------------------------------------------------------------------
 1 | import ee
 2 | import common
 3 | import features_exporter
 4 | import classifier as clf
 5 | 
 6 | 
 7 | def main(model_years):
 8 |     ee.Initialize()
 9 |     classifier = clf.build_worldwide_model()
10 |     for year in model_years:
11 |         tasks = []
12 |         task = features_exporter.export_selected_features_for_year(year)
13 |         tasks.append(task)
14 |         common.wait_for_task_completion(tasks)
15 |         tasks = []
16 |         task = clf.classify_year(classifier, year)
17 |         tasks.append(task)
18 |         common.wait_for_task_completion(tasks)
19 | 
20 | 
21 | if __name__ == '__main__':
22 |     years = ["2000", "2003", "2006", "2009", "2012", "2015", "2018"]
23 |     main(years)
24 | 


--------------------------------------------------------------------------------
/python/sample_image_exporter.py:
--------------------------------------------------------------------------------
 1 | import ee
 2 | 
 3 | from common import model_scale, wait_for_task_completion, model_projection, base_asset_directory
 4 | 
 5 | 
 6 | def export_point(id: str, coords: list, folder: str) -> ee.batch.Task:
 7 |     print(f"point: {id}, {coords}")
 8 |     TOA_BANDS = ['B3', 'B2', 'B1']
 9 |     TOA_MIN = 0.0
10 |     TOA_MAX = 120.0
11 |     LANDSAT_RES = 30
12 |     RED_RGB = "#FF0000"
13 |     RED_RGB_TRANSPARENT = RED_RGB + "00"
14 |     sat_image = ee.Image("LANDSAT/LE7_TOA_1YEAR/2005").select(TOA_BANDS)
15 |     point_geom = ee.Geometry.Point(coords=coords, proj=model_projection)
16 |     square = point_geom.buffer(model_scale).bounds()
17 |     outer_square = point_geom.buffer(model_scale * 2).bounds()
18 |     border_fc = ee.FeatureCollection(square)\
19 |         .style(color=RED_RGB, fillColor=RED_RGB_TRANSPARENT)
20 |     clipped_sat_image = sat_image \
21 |         .clipToCollection(ee.FeatureCollection(outer_square)) \
22 |         .visualize(bands=TOA_BANDS, min=TOA_MIN, max=TOA_MAX) \
23 |         .blend(border_fc)
24 |     prefix = f"{id}"
25 |     task = ee.batch.Export.image.toDrive(clipped_sat_image, folder=folder, scale=LANDSAT_RES,
26 |                                          fileNamePrefix=prefix, region=outer_square)
27 |     task.start()
28 |     return task
29 | 
30 | 
31 | def export_samples(table_name: str) -> None:
32 |     table = ee.FeatureCollection(f"{base_asset_directory}/{table_name}").getInfo()
33 |     tasks = list(map(
34 |         lambda p: export_point(p['id'], p['geometry']['coordinates'], table_name),
35 |         table['features']
36 |     ))
37 |     wait_for_task_completion(tasks)
38 | 
39 | 
40 | if __name__ == '__main__':
41 |     ee.Initialize()
42 | #    export_samples("FNValidationSamplesv3")
43 |     export_samples("TLabelv2Validation")
44 | 


--------------------------------------------------------------------------------
/python/sampler.py:
--------------------------------------------------------------------------------
  1 | import ee
  2 | from common import (region_boundaries, model_scale, wait_for_task_completion, model_projection, base_asset_directory,
  3 |                     export_asset_table_to_drive, num_samples, train_seed)
  4 | 
  5 | 
  6 | world_regions = [
  7 |     "North America",
  8 |     "Central America",
  9 |     "Caribbean",
 10 |     "South America",
 11 |     "Europe",
 12 |     "Africa",
 13 |     "SW Asia",
 14 |     "Central Asia",
 15 |     "N Asia",
 16 |     "E Asia",
 17 |     "SE Asia",
 18 |     "S Asia",
 19 |     "Australia",
 20 |     # Oceania causes geometry explosion
 21 |     # "Oceania"
 22 |     # get 2 countries from there instead
 23 |     "NZ",
 24 |     "PP"
 25 | ]
 26 | 
 27 | 
 28 | def get_total_area():
 29 |     all_regions = ee.FeatureCollection(list(map(region_boundaries, world_regions))).flatten()
 30 |     # compute this before doing anything else
 31 |     total_area = all_regions.aggregate_sum('areaHa').getInfo()
 32 |     return total_area
 33 | 
 34 | 
 35 | def read_sample(asset_name):
 36 |     try:
 37 |         sample_fc = ee.FeatureCollection(asset_name)
 38 |         # force materialization of feature collection
 39 |         _ = sample_fc.limit(10).getInfo()
 40 |         # it worked: return table
 41 |         return sample_fc
 42 |     except ee.ee_exception.EEException:
 43 |         print(f"could not read asset {asset_name} (probably not created yet)")
 44 |         return None
 45 | 
 46 | 
 47 | def get_or_create_worldwide_sample_points(seed):
 48 |     asset_name = f'{base_asset_directory}/samples{num_samples}_seed{seed}'
 49 |     sample_fc = read_sample(asset_name)
 50 |     if sample_fc:
 51 |         return sample_fc
 52 | 
 53 |     print(f"creating sample {asset_name}")
 54 | 
 55 |     total_area = get_total_area()
 56 | 
 57 |     def sample_region(region_fc):
 58 |         total_sample_points = ee.Number(region_fc.aggregate_sum('sampleSize'))
 59 |         sample_image = ee.Image(1)\
 60 |             .clipToCollection(region_fc)
 61 |         sampled_region = sample_image\
 62 |             .sample(
 63 |                 region=region_fc,
 64 |                 numPixels=total_sample_points,
 65 |                 projection=model_projection,
 66 |                 scale=model_scale,
 67 |                 geometries=True,
 68 |                 seed=seed
 69 |             )
 70 |         return sampled_region
 71 | 
 72 |     def set_num_samples_to_region(region_name):
 73 |         def set_num_samples_to_take(feature):
 74 |             region_sample_size = ee.Number(feature.geometry().area()).divide(total_area).multiply(num_samples).floor()
 75 |             return feature.set('sampleSize', region_sample_size)
 76 | 
 77 |         region_fc = region_boundaries(region_name)
 78 |         region_fc_with_sample_size = region_fc.map(lambda feature: set_num_samples_to_take(feature))
 79 |         return region_fc_with_sample_size
 80 | 
 81 |     # find how many samples to take for each region, based on its area
 82 |     regions_with_sample_sizes = list(map(set_num_samples_to_region, world_regions))
 83 | 
 84 |     # take that many samples for each region
 85 |     sample_points = list(map(sample_region, regions_with_sample_sizes))
 86 | 
 87 |     # create a single region to export
 88 |     sample_fc = ee.FeatureCollection(sample_points).flatten()
 89 | 
 90 |     task = ee.batch.Export.table.toAsset(
 91 |         collection=sample_fc,
 92 |         assetId=asset_name,
 93 |         description=asset_name.replace('/', '_')
 94 |     )
 95 |     task.start()
 96 |     wait_for_task_completion([task], exit_if_failures=True)
 97 |     return read_sample(asset_name)
 98 | 
 99 | 
100 | def main():
101 |     ee.Initialize()
102 |     get_or_create_worldwide_sample_points(train_seed)
103 |     export_asset_table_to_drive(f'{base_asset_directory}/samples_{num_samples}')
104 | 
105 | 
106 | if __name__ == '__main__':
107 |     main()
108 | 


--------------------------------------------------------------------------------
/python/training_sample_exporter.py:
--------------------------------------------------------------------------------
 1 | # Creates a sample of (features, labels) as a multi-band image on Google Earth Engine
 2 | # Requires: sample points for which to fill in features as a feature collection
 3 | # Requires: labels asset as an image
 4 | 
 5 | import ee
 6 | 
 7 | from common import (model_scale, wait_for_task_completion, get_features_image, get_labels, model_snapshot_path_prefix,
 8 |                     export_asset_table_to_drive, model_projection, num_samples, label_year, train_seed)
 9 | from sampler import get_or_create_worldwide_sample_points
10 | 
11 | 
12 | def create_all_features_labels_image(region_fc, model_year):
13 |     images = get_features_image(region_fc, model_year, "all")
14 |     labels_image = get_labels(region_fc)
15 |     images.append(labels_image)
16 |     lonlat_image = ee.Image.pixelLonLat() \
17 |         .clipToCollection(region_fc) \
18 |         .select(['longitude', 'latitude'], ['X', 'Y'])
19 |     images.append(lonlat_image)
20 |     features_labels_image = ee.Image.cat(*images)
21 |     return features_labels_image
22 | 
23 | 
24 | def main():
25 |     # Step 1/3: create or fetch sample points
26 |     asset_description = f'training_sample{num_samples}_all_features_labels'
27 |     image_asset_id = f'{model_snapshot_path_prefix}_{asset_description}_image'
28 |     table_asset_id = f'{model_snapshot_path_prefix}_{asset_description}_table'
29 |     ee.Initialize()
30 |     sample_points_fc = get_or_create_worldwide_sample_points(train_seed)
31 | 
32 |     # Step 2/3: sample all features into an image
33 |     features_labels_image = create_all_features_labels_image(sample_points_fc, label_year)
34 |     task = ee.batch.Export.image.toAsset(
35 |         crs=model_projection,
36 |         image=features_labels_image,
37 |         scale=model_scale,
38 |         assetId=image_asset_id,
39 |         description=asset_description
40 |     )
41 |     task.start()
42 |     wait_for_task_completion([task], exit_if_failures=True)
43 | 
44 |     # Step 3/3: convert image into a table
45 |     features_labels_image = ee.Image(image_asset_id)
46 |     # For training sample, it is more efficient to export a table than a raster with (mostly) 0's
47 |     training_fc = features_labels_image.sampleRegions(
48 |         collection=sample_points_fc,
49 |         projection=model_projection,
50 |         scale=model_scale,
51 |         geometries=True
52 |     )
53 |     task = ee.batch.Export.table.toAsset(
54 |         collection=training_fc,
55 |         assetId=table_asset_id,
56 |         description=asset_description.replace('/', '_')
57 |     )
58 |     task.start()
59 |     wait_for_task_completion([task], exit_if_failures=True)
60 | 
61 |     # Step 3a: export to drive for offline model development
62 |     export_asset_table_to_drive(table_asset_id)
63 | 
64 | 
65 | if __name__ == '__main__':
66 |     main()
67 | 


--------------------------------------------------------------------------------
/results/diff2001vs2015.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/diff2001vs2015.png


--------------------------------------------------------------------------------
/results/diff2001vs2015.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/diff2001vs2015.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2001.png


--------------------------------------------------------------------------------
/results/v3b_combined_2001.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2001.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2002.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2002.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2003.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2003.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2004.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2004.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2005.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2005.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2006.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2006.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2007.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2007.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2008.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2008.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2009.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2009.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2010.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2010.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2011.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2011.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2012.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2012.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2013.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2013.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2014.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2014.tif


--------------------------------------------------------------------------------
/results/v3b_combined_2015.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2015.png


--------------------------------------------------------------------------------
/results/v3b_combined_2015.tif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/deepix/globalirrigationmap/0278334a4dc06dea745bb1e8894a7ef6b0cc80e9/results/v3b_combined_2015.tif


--------------------------------------------------------------------------------