├── 01_data_exploration.ipynb ├── 02_data_exploration_and_modeling.ipynb ├── 03_elevation_and_length_of_day.ipynb ├── 04_feature_engineering.ipynb ├── 05_statisctical_feature_exploration.ipynb ├── 06_algorithm_selection.ipynb ├── Full_Report.md ├── README.md ├── images ├── average_daily_temperature.png ├── compare_yield.png ├── daily_temperature_difference.png ├── feature_importance.png ├── learning_curve.png └── model_performance.png └── original_README.md /03_elevation_and_length_of_day.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from __future__ import absolute_import, division, print_function" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Elevation & Length of Day" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "To account for the amount of sunlight avaialable to the plants at the different locations and times, get the length of the day. Here, I do this by using the 'Astral' package (https://pythonhosted.org/astral/index.html .\n", 26 | "\n", 27 | "I also leverage the Google Elevation API to retrieve the elevation for each location.\n", 28 | "\n", 29 | "The results are saved (and pickled) in dictionaries for easy lookup." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Imports" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 3, 42 | "metadata": { 43 | "collapsed": false 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "import os\n", 48 | "import pickle\n", 49 | "import time\n", 50 | "import requests \n", 51 | "import json\n", 52 | "import datetime\n", 53 | "\n", 54 | "import numpy as np\n", 55 | "import pandas as pd\n", 56 | "\n", 57 | "from astral import Location" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Data" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 4, 70 | "metadata": { 71 | "collapsed": false 72 | }, 73 | "outputs": [], 74 | "source": [ 75 | "cwd = os.getcwd()\n", 76 | "data = os.path.join(cwd,'data','wheat-2013-supervised.csv')\n", 77 | "df_2013 = pd.read_csv(data)\n", 78 | "data = os.path.join(cwd,'data','wheat-2014-supervised.csv')\n", 79 | "df_2014 = pd.read_csv(data)\n" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 5, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "'6/3/2014 0:00'" 93 | ] 94 | }, 95 | "execution_count": 5, 96 | "metadata": {}, 97 | "output_type": "execute_result" 98 | } 99 | ], 100 | "source": [ 101 | "df_2013['Date'].max()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 6, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "'6/3/2015 0:00'" 115 | ] 116 | }, 117 | "execution_count": 6, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "df_2014['Date'].max()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## Locations" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 7, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "1014" 144 | ] 145 | }, 146 | "execution_count": 7, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "df_2013['Location'] = list(zip(df_2013['Longitude'], df_2013['Latitude']))\n", 153 | "locs_2013 = df_2013['Location'].unique().tolist()\n", 154 | "len(locs_2013)" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 8, 160 | "metadata": { 161 | "collapsed": false 162 | }, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/plain": [ 167 | "1035" 168 | ] 169 | }, 170 | "execution_count": 8, 171 | "metadata": {}, 172 | "output_type": "execute_result" 173 | } 174 | ], 175 | "source": [ 176 | "df_2014['Location'] = list(zip(df_2014['Longitude'], df_2014['Latitude']))\n", 177 | "locs_2014 = df_2014['Location'].unique().tolist()\n", 178 | "len(locs_2014)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 9, 184 | "metadata": { 185 | "collapsed": false 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "# Union of locations without repetitions\n", 190 | "elevation = {}\n", 191 | "for loc in locs_2013:\n", 192 | " elevation[loc] = 0\n", 193 | "for loc in locs_2014:\n", 194 | " if loc in elevation:\n", 195 | " pass\n", 196 | " else:\n", 197 | " elevation[loc] = 0" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 10, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "Number of unique locations across 2013 and 2014: 1167\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "print('Number of unique locations across 2013 and 2014: {}'.format(len(elevation.keys())))" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "## Elevation" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 11, 229 | "metadata": { 230 | "collapsed": false 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "# Get API key from system variable\n", 235 | "google_api_key = os.environ['GOOGLE_API_KEY']" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 12, 241 | "metadata": { 242 | "collapsed": false 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "run_API_query = False\n", 247 | "\n", 248 | "base_string = 'https://maps.googleapis.com/maps/api/elevation/json?locations='\n", 249 | "locs = elevation.keys()\n", 250 | "\n", 251 | "if run_API_query is True:\n", 252 | " for idx, loc in enumerate(locs):\n", 253 | " lat_string = str(loc[1])\n", 254 | " lng_string = str(loc[0])\n", 255 | " call_string = base_string + lat_string + ',' + lng_string + '&key=' + google_api_key\n", 256 | " if idx%10 == 0: \n", 257 | " print(idx)\n", 258 | " if elevation[loc] == 0:\n", 259 | " response = requests.get(call_string)\n", 260 | " try:\n", 261 | " tmp_elevation = response.json()['results'][0]['elevation']\n", 262 | " elevation[loc] = tmp_elevation\n", 263 | " except:\n", 264 | " print('Something went wrong', lat_string, lng_string)\n", 265 | " print(response.json())\n", 266 | " \n", 267 | "\n", 268 | "else:\n", 269 | " # INSERT\n", 270 | " #\n", 271 | " # Load data saved at end of this notebook.\n", 272 | " # Can be used to experiment with the Astral code below without\n", 273 | " # having to re-run the Google API queries above.\n", 274 | "\n", 275 | " with open(os.path.join('data','elevation.pickle'), 'rb') as handle:\n", 276 | " elevation = pickle.load(handle)\n", 277 | "\n", 278 | " \n", 279 | "\n", 280 | "\n", 281 | " " 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "## Length of day" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": { 294 | "collapsed": true 295 | }, 296 | "source": [ 297 | "Use the 'Astral' package (https://pythonhosted.org/astral/index.html) to calculate the length of the day (~ hours of sunlight) for each location." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 13, 303 | "metadata": { 304 | "collapsed": false 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "locs = elevation.keys()\n", 309 | "length_of_day = {}\n", 310 | "for idx, loc in enumerate(locs):\n", 311 | " # Initialize location\n", 312 | " l = Location()\n", 313 | " # Set attributes\n", 314 | " l.name = ''\n", 315 | " l.region = ''\n", 316 | " l.latitude = loc[1]\n", 317 | " l.longitude = loc[0]\n", 318 | " l.timezone = 'UTC'\n", 319 | " l.elevation = elevation[loc]\n", 320 | " sun = l.sun(date=datetime.date(2014, 3, 3),local=True)\n", 321 | " td = sun['sunset'] - sun['sunrise']\n", 322 | " tmp_lod = td.total_seconds() / 3600. \n", 323 | " length_of_day[loc] = tmp_lod \n" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "## Save data" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 14, 336 | "metadata": { 337 | "collapsed": true 338 | }, 339 | "outputs": [], 340 | "source": [ 341 | "# Store data (serialize)\n", 342 | "with open(os.path.join('data','elevation.pickle'), 'wb') as handle:\n", 343 | " pickle.dump(elevation, handle)\n", 344 | "\n", 345 | "with open(os.path.join('data','length_of_day.pickle'), 'wb') as handle:\n", 346 | " pickle.dump(length_of_day, handle)\n", 347 | "\n" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": { 354 | "collapsed": true 355 | }, 356 | "outputs": [], 357 | "source": [] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": { 363 | "collapsed": true 364 | }, 365 | "outputs": [], 366 | "source": [] 367 | } 368 | ], 369 | "metadata": { 370 | "kernelspec": { 371 | "display_name": "Python 2", 372 | "language": "python", 373 | "name": "python2" 374 | }, 375 | "language_info": { 376 | "codemirror_mode": { 377 | "name": "ipython", 378 | "version": 2 379 | }, 380 | "file_extension": ".py", 381 | "mimetype": "text/x-python", 382 | "name": "python", 383 | "nbconvert_exporter": "python", 384 | "pygments_lexer": "ipython2", 385 | "version": "2.7.13" 386 | } 387 | }, 388 | "nbformat": 4, 389 | "nbformat_minor": 0 390 | } 391 | -------------------------------------------------------------------------------- /04_feature_engineering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from __future__ import absolute_import, division, print_function" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "# Feature engineering" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "Here I will extract and engineer the features that will be the input for a subsequent model.\n", 26 | "\n", 27 | "The main data are of the follwoing format:\n", 28 | "\n", 29 | "* The data report weather measurements for ~1000 unique locations in ~150 counties across 5 states. \n", 30 | "* Most locations have one entry per day reporting the current weather conditions.\n", 31 | "* At the end of the season, the harvest generates a certain yield. This yield is propagated to *all* entries in the data set, even though it is only a final value.\n", 32 | "* The reported yield number refers to the yield in the county and is not specific to a location. It is unclear if it is an average or a sum for the county.\n", 33 | "\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Goal" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "My goal here is to create a profile across the season for each location from the avaiable plus additional data. This will leave me with one set of feature values for each location, which is connected to a final yield. And that will be the input to my model." 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "## Imports" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 2, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "import os\n", 66 | "import pickle\n", 67 | "import time\n", 68 | "\n", 69 | "import numpy as np\n", 70 | "import pandas as pd\n", 71 | "\n", 72 | "import matplotlib as mpl\n", 73 | "import matplotlib.pyplot as plt\n", 74 | "from mpl_toolkits.basemap import Basemap\n", 75 | "\n", 76 | "\n", 77 | "%matplotlib inline" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Functions" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 11, 90 | "metadata": { 91 | "collapsed": false 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "def find_30day_extreme(in_df, in_column, in_agg='min'):\n", 96 | " \"\"\"\n", 97 | " Calculates a rolling window mean. Then determines the min/max/mean \n", 98 | " value of all the window aggregates.\n", 99 | " Window size: 30 days.\n", 100 | " in_df: input dataframe\n", 101 | " in_column: column to be aggregated\n", 102 | " in_agg: aggregation function for the list of results ('min','max','mean')\n", 103 | " \"\"\"\n", 104 | " # Pandas 30 day rolling window. Make sure there are at least 10 samples in the window.\n", 105 | " agg_df = in_df[[in_column,'Date']].rolling('30d', on='Date', min_periods=10).mean()\n", 106 | " agg_list = agg_df[in_column]\n", 107 | " if in_agg == 'min':\n", 108 | " return np.min(agg_list)\n", 109 | " if in_agg == 'max':\n", 110 | " return np.max(agg_list)\n", 111 | " if in_agg == 'mean':\n", 112 | " return np.mean(agg_list)\n", 113 | " \n" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 12, 119 | "metadata": { 120 | "collapsed": true 121 | }, 122 | "outputs": [], 123 | "source": [ 124 | "def calculate_features(tmp_df):\n", 125 | " result = []\n", 126 | " # Minimum average temperature in 30-day period\n", 127 | " result.append(find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='min'))\n", 128 | " # Maximum average temperature in 30-day period\n", 129 | " result.append(find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='max'))\n", 130 | " # Minimum NDVI in 30-day period\n", 131 | " tmp1 = find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='min')\n", 132 | " # Maximum NDVI in 30-day period\n", 133 | " tmp2 = find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='max')\n", 134 | " result.append(tmp2/tmp1)\n", 135 | " # Mean temperature difference and variance\n", 136 | " result.append(tmp_df['temperatureDiff'].mean())\n", 137 | " result.append(tmp_df['temperatureDiff'].std())\n", 138 | " # Mean wind speed\n", 139 | " result.append(tmp_df['windSpeed'].mean())\n", 140 | " # Total precipitation\n", 141 | " result.append(tmp_df['precipTotal'].max())\n", 142 | " # Total yield\n", 143 | " result.append(tmp_df['Yield'].max())\n", 144 | " #\n", 145 | " return result\n" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "## Data" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "These data have been slighlty pre-processed. See 01_data_exploration.ipynb for details." 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 5, 174 | "metadata": { 175 | "collapsed": true 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "cwd = os.getcwd()\n", 180 | "df_2013 = pd.read_pickle(os.path.join(cwd,'data','df_2013_clean.df'))\n", 181 | "df_2014 = pd.read_pickle(os.path.join(cwd,'data','df_2014_clean.df'))\n" 182 | ] 183 | }, 184 | { 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "## Additional data" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "I have also obtained information on the elevation and length_of_day for each location (see 03_elevation_and_length_of_day.ipynb)." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 6, 201 | "metadata": { 202 | "collapsed": false 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "# Load data - dictionaries\n", 207 | "with open(os.path.join('data','elevation.pickle'), 'rb') as handle:\n", 208 | " elevation = pickle.load(handle)\n", 209 | "with open(os.path.join('data','length_of_day.pickle'), 'rb') as handle:\n", 210 | " length_of_day = pickle.load(handle)\n" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "## Features" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "Each location comes with:\n", 225 | "\n", 226 | "* longitude\n", 227 | "* latitude\n", 228 | "\n", 229 | "The features I am going to engineer for each location are:\n", 230 | "\n", 231 | "* the total amount of precipitation\n", 232 | "* the minimum average temperature in a consecutive 30-day period\n", 233 | "* the maximum average temperature in a consecutive 30-day period\n", 234 | "* ratio of maximum average NDVI in a consecutive 30-day period and its respective minimum\n", 235 | "* the mean temperature difference between daily min/max temperatures\n", 236 | "* the standard deviation of the above mean\n", 237 | "* the total average wind speed\n", 238 | "\n", 239 | "I will add to those features external values of:\n", 240 | "\n", 241 | "* the hours of daylight\n", 242 | "* the elevation" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "During the engineering I will keep both years (2013 and 2014) separate. Even though these data cover mostly the same locations (~>80% overlap), weather conditions and yield is likely to be different. Keeping them separate provides additional data points. " 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": { 256 | "collapsed": true 257 | }, 258 | "outputs": [], 259 | "source": [] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "## Exclude locations with only very few measurements across the growing period" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "Since I am aggregating data across the growing period, I will for this current approach remove locations with only very few records. For these locations it is difficult to generate the features above in a straight forward manner. (See 01_data_exploration.ipynb for details on how these locations were identified)." 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 7, 278 | "metadata": { 279 | "collapsed": false 280 | }, 281 | "outputs": [ 282 | { 283 | "name": "stdout", 284 | "output_type": "stream", 285 | "text": [ 286 | "1014\n", 287 | "955\n", 288 | "1035\n", 289 | "982\n" 290 | ] 291 | } 292 | ], 293 | "source": [ 294 | "locs = df_2013['Location'].unique().tolist()\n", 295 | "print(len(locs))\n", 296 | "n_records_2013 = df_2013.groupby(by='Location').agg({'Location': {'Count' : lambda x: x.count()}})\n", 297 | "n_records_2013['Location']['Count'].value_counts().sort_index(ascending=False)\n", 298 | "klocs = n_records_2013[n_records_2013['Location']['Count'] >= 153].index.get_level_values('Location').values\n", 299 | "locs_2013 = df_2013['Location'][df_2013['Location'].isin(klocs)].unique().tolist()\n", 300 | "print(len(locs_2013))\n", 301 | "#\n", 302 | "locs = df_2014['Location'].unique().tolist()\n", 303 | "print(len(locs))\n", 304 | "\n", 305 | "n_records_2014 = df_2014.groupby(by='Location').agg({'Location': {'Count' : lambda x: x.count()}})\n", 306 | "n_records_2014['Location']['Count'].value_counts().sort_index(ascending=False)\n", 307 | "klocs = n_records_2014[n_records_2014['Location']['Count'] >= 153].index.get_level_values('Location').values\n", 308 | "locs_2014 = df_2014['Location'][df_2014['Location'].isin(klocs)].unique().tolist()\n", 309 | "print(len(locs_2014))\n", 310 | "\n" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "## 2013" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 8, 323 | "metadata": { 324 | "collapsed": false 325 | }, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/html": [ 330 | "
\n", 331 | "\n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | "
CountyNameStateLatitudeLongitudeDatecloudCoverdewPointhumidityprecipIntensityprecipProbability...windBearingwindSpeedNDVIDayInSeasonYieldLocationprecipTotaltemperatureDifftemperatureRatiotemperatureAverage
0AdamsWashington46.811686-118.6952372013-11-300.0029.530.910.00000.00...2141.18134.110657035.7(-118.6952372, 46.8116858)0.0008.221.29912731.590
1AdamsWashington46.929839-118.3521092013-11-300.0029.770.930.00010.05...1661.01131.506592035.7(-118.3521093, 46.9298391)0.0008.181.30386331.010
2AdamsWashington47.006888-118.5101602013-11-300.0029.360.940.00010.06...1581.03131.472946035.7(-118.5101603, 47.0068881)0.0206.431.23859030.165
3AdamsWashington47.162342-118.6996772013-11-300.9129.470.940.00020.15...1531.84131.288300035.7(-118.6996774, 47.1623419)0.0366.021.22156830.180
4AdamsWashington47.157512-118.4340562013-11-300.9129.860.940.00030.24...1561.85131.288300035.7(-118.4340559, 47.157512)0.0006.781.25046230.460
\n", 481 | "

5 rows × 27 columns

\n", 482 | "
" 483 | ], 484 | "text/plain": [ 485 | " CountyName State Latitude Longitude Date cloudCover \\\n", 486 | "0 Adams Washington 46.811686 -118.695237 2013-11-30 0.00 \n", 487 | "1 Adams Washington 46.929839 -118.352109 2013-11-30 0.00 \n", 488 | "2 Adams Washington 47.006888 -118.510160 2013-11-30 0.00 \n", 489 | "3 Adams Washington 47.162342 -118.699677 2013-11-30 0.91 \n", 490 | "4 Adams Washington 47.157512 -118.434056 2013-11-30 0.91 \n", 491 | "\n", 492 | " dewPoint humidity precipIntensity precipProbability ... \\\n", 493 | "0 29.53 0.91 0.0000 0.00 ... \n", 494 | "1 29.77 0.93 0.0001 0.05 ... \n", 495 | "2 29.36 0.94 0.0001 0.06 ... \n", 496 | "3 29.47 0.94 0.0002 0.15 ... \n", 497 | "4 29.86 0.94 0.0003 0.24 ... \n", 498 | "\n", 499 | " windBearing windSpeed NDVI DayInSeason Yield \\\n", 500 | "0 214 1.18 134.110657 0 35.7 \n", 501 | "1 166 1.01 131.506592 0 35.7 \n", 502 | "2 158 1.03 131.472946 0 35.7 \n", 503 | "3 153 1.84 131.288300 0 35.7 \n", 504 | "4 156 1.85 131.288300 0 35.7 \n", 505 | "\n", 506 | " Location precipTotal temperatureDiff temperatureRatio \\\n", 507 | "0 (-118.6952372, 46.8116858) 0.000 8.22 1.299127 \n", 508 | "1 (-118.3521093, 46.9298391) 0.000 8.18 1.303863 \n", 509 | "2 (-118.5101603, 47.0068881) 0.020 6.43 1.238590 \n", 510 | "3 (-118.6996774, 47.1623419) 0.036 6.02 1.221568 \n", 511 | "4 (-118.4340559, 47.157512) 0.000 6.78 1.250462 \n", 512 | "\n", 513 | " temperatureAverage \n", 514 | "0 31.590 \n", 515 | "1 31.010 \n", 516 | "2 30.165 \n", 517 | "3 30.180 \n", 518 | "4 30.460 \n", 519 | "\n", 520 | "[5 rows x 27 columns]" 521 | ] 522 | }, 523 | "execution_count": 8, 524 | "metadata": {}, 525 | "output_type": "execute_result" 526 | } 527 | ], 528 | "source": [ 529 | "df_2013.head()" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": 9, 535 | "metadata": { 536 | "collapsed": true 537 | }, 538 | "outputs": [], 539 | "source": [ 540 | "features = ['longitude',\n", 541 | " 'latitude',\n", 542 | " 'elevation',\n", 543 | " 'LOD',\n", 544 | " 'total_precipitation',\n", 545 | " 'minMAT30',\n", 546 | " 'maxMAT30',\n", 547 | " 'ratioMNDVI30',\n", 548 | " 'mean_wind_speed',\n", 549 | " 'mean_temperature_diff',\n", 550 | " 'std_temperature_diff',\n", 551 | " 'yield',\n", 552 | " ]\n", 553 | "\n", 554 | "df_2013_new = pd.DataFrame(columns=features)" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 13, 560 | "metadata": { 561 | "collapsed": false 562 | }, 563 | "outputs": [ 564 | { 565 | "name": "stdout", 566 | "output_type": "stream", 567 | "text": [ 568 | "Exec. time: 37.78 s\n" 569 | ] 570 | } 571 | ], 572 | "source": [ 573 | "now = time.time()\n", 574 | "\n", 575 | "for idx,loc in enumerate(locs_2013):\n", 576 | " tmp_df = df_2013[df_2013['Location'] == loc]\n", 577 | " (min_mean_average_temparature_30days, \n", 578 | " max_mean_average_temparature_30days,\n", 579 | " ratio_mean_ndvi_30days,\n", 580 | " mean_temperature_diff,\n", 581 | " std_temperature_diff,\n", 582 | " mean_wind_speed,\n", 583 | " total_precipitation,\n", 584 | " total_yield) = calculate_features(tmp_df)\n", 585 | " try:\n", 586 | " new_elevation = elevation[loc]\n", 587 | " except:\n", 588 | " print('Elevation: no match for location found', loc)\n", 589 | " try:\n", 590 | " new_length_of_day = length_of_day[loc]\n", 591 | " except:\n", 592 | " print('Length-of-day: no match for location found', loc)\n", 593 | " longitude = loc[0]\n", 594 | " latitude = loc[1]\n", 595 | " observations = {'longitude': longitude, \n", 596 | " 'latitude': latitude,\n", 597 | " 'elevation': new_elevation,\n", 598 | " 'LOD': new_length_of_day,\n", 599 | " 'total_precipitation': total_precipitation,\n", 600 | " 'minMAT30': min_mean_average_temparature_30days,\n", 601 | " 'maxMAT30': max_mean_average_temparature_30days,\n", 602 | " 'ratioMNDVI30': ratio_mean_ndvi_30days,\n", 603 | " 'mean_wind_speed': mean_wind_speed,\n", 604 | " 'mean_temperature_diff': mean_temperature_diff,\n", 605 | " 'std_temperature_diff': std_temperature_diff,\n", 606 | " 'yield': total_yield,\n", 607 | " }\n", 608 | " df_2013_new.loc[idx] = pd.Series(observations)\n", 609 | " \n", 610 | "print('Exec. time: {:5.2f} s'.format(time.time()-now))" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 14, 616 | "metadata": { 617 | "collapsed": false 618 | }, 619 | "outputs": [ 620 | { 621 | "data": { 622 | "text/plain": [ 623 | "955" 624 | ] 625 | }, 626 | "execution_count": 14, 627 | "metadata": {}, 628 | "output_type": "execute_result" 629 | } 630 | ], 631 | "source": [ 632 | "len(locs_2013)" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 15, 638 | "metadata": { 639 | "collapsed": false 640 | }, 641 | "outputs": [ 642 | { 643 | "data": { 644 | "text/plain": [ 645 | "(955, 12)" 646 | ] 647 | }, 648 | "execution_count": 15, 649 | "metadata": {}, 650 | "output_type": "execute_result" 651 | } 652 | ], 653 | "source": [ 654 | "df_2013_new.shape" 655 | ] 656 | }, 657 | { 658 | "cell_type": "code", 659 | "execution_count": 16, 660 | "metadata": { 661 | "collapsed": false 662 | }, 663 | "outputs": [ 664 | { 665 | "data": { 666 | "text/html": [ 667 | "
\n", 668 | "\n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | "
longitudelatitudeelevationLODtotal_precipitationminMAT30maxMAT30ratioMNDVI30mean_wind_speedmean_temperature_diffstd_temperature_diffyield
0-118.69523746.811686427.17919911.1719449.98222.09000062.0330002.8081945.29854817.2944099.06131635.7
1-118.35210946.929839524.56182911.16805612.31221.14090960.1838332.8467955.59424716.4856998.48587035.7
2-118.51016047.006888499.47607411.16583313.75020.64454559.3630002.8754816.13274216.9138178.44436335.7
3-118.69967747.162342521.15033011.16083312.93420.15181859.6820002.9616196.12043016.8002698.51249635.7
4-118.43405647.157512571.11407511.16111116.64420.13818258.1431672.8872106.05596816.0778498.20181835.7
5-118.95885947.150327451.08465611.1611118.90521.74083361.4641672.8271305.32080617.4462378.47186735.7
6-98.33085136.995835391.88745111.43972247.25628.73166771.5265002.4894669.21801124.4767209.61283214.4
7-98.40462936.988813401.38351411.43972251.07728.70383371.5223332.4917359.21080624.4868829.63412014.4
8-98.31989336.702452359.77139311.44638955.12429.36216772.1475002.4571599.50435524.6879579.65620414.4
9-98.53981636.628781413.23828111.44805667.26429.47116772.0770002.4456799.39801124.5076349.58374414.4
\n", 839 | "
" 840 | ], 841 | "text/plain": [ 842 | " longitude latitude elevation LOD total_precipitation \\\n", 843 | "0 -118.695237 46.811686 427.179199 11.171944 9.982 \n", 844 | "1 -118.352109 46.929839 524.561829 11.168056 12.312 \n", 845 | "2 -118.510160 47.006888 499.476074 11.165833 13.750 \n", 846 | "3 -118.699677 47.162342 521.150330 11.160833 12.934 \n", 847 | "4 -118.434056 47.157512 571.114075 11.161111 16.644 \n", 848 | "5 -118.958859 47.150327 451.084656 11.161111 8.905 \n", 849 | "6 -98.330851 36.995835 391.887451 11.439722 47.256 \n", 850 | "7 -98.404629 36.988813 401.383514 11.439722 51.077 \n", 851 | "8 -98.319893 36.702452 359.771393 11.446389 55.124 \n", 852 | "9 -98.539816 36.628781 413.238281 11.448056 67.264 \n", 853 | "\n", 854 | " minMAT30 maxMAT30 ratioMNDVI30 mean_wind_speed mean_temperature_diff \\\n", 855 | "0 22.090000 62.033000 2.808194 5.298548 17.294409 \n", 856 | "1 21.140909 60.183833 2.846795 5.594247 16.485699 \n", 857 | "2 20.644545 59.363000 2.875481 6.132742 16.913817 \n", 858 | "3 20.151818 59.682000 2.961619 6.120430 16.800269 \n", 859 | "4 20.138182 58.143167 2.887210 6.055968 16.077849 \n", 860 | "5 21.740833 61.464167 2.827130 5.320806 17.446237 \n", 861 | "6 28.731667 71.526500 2.489466 9.218011 24.476720 \n", 862 | "7 28.703833 71.522333 2.491735 9.210806 24.486882 \n", 863 | "8 29.362167 72.147500 2.457159 9.504355 24.687957 \n", 864 | "9 29.471167 72.077000 2.445679 9.398011 24.507634 \n", 865 | "\n", 866 | " std_temperature_diff yield \n", 867 | "0 9.061316 35.7 \n", 868 | "1 8.485870 35.7 \n", 869 | "2 8.444363 35.7 \n", 870 | "3 8.512496 35.7 \n", 871 | "4 8.201818 35.7 \n", 872 | "5 8.471867 35.7 \n", 873 | "6 9.612832 14.4 \n", 874 | "7 9.634120 14.4 \n", 875 | "8 9.656204 14.4 \n", 876 | "9 9.583744 14.4 " 877 | ] 878 | }, 879 | "execution_count": 16, 880 | "metadata": {}, 881 | "output_type": "execute_result" 882 | } 883 | ], 884 | "source": [ 885 | "df_2013_new.head(10)" 886 | ] 887 | }, 888 | { 889 | "cell_type": "code", 890 | "execution_count": null, 891 | "metadata": { 892 | "collapsed": true 893 | }, 894 | "outputs": [], 895 | "source": [] 896 | }, 897 | { 898 | "cell_type": "markdown", 899 | "metadata": {}, 900 | "source": [ 901 | "## 2014" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": 17, 907 | "metadata": { 908 | "collapsed": false 909 | }, 910 | "outputs": [ 911 | { 912 | "data": { 913 | "text/html": [ 914 | "
\n", 915 | "\n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1045 | " \n", 1046 | " \n", 1047 | " \n", 1048 | " \n", 1049 | " \n", 1050 | " \n", 1051 | " \n", 1052 | " \n", 1053 | " \n", 1054 | " \n", 1055 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | "
CountyNameStateLatitudeLongitudeDatecloudCoverdewPointhumidityprecipIntensityprecipProbability...windBearingwindSpeedNDVIDayInSeasonYieldLocationprecipTotaltemperatureDifftemperatureRatiotemperatureAverage
0AdamsWashington46.929839-118.3521092014-11-300.006.770.690.00.0...93.80136.179718035.6(-118.3521093, 46.9298391)0.016.973.43821815.445
1AdamsWashington47.150327-118.9588592014-11-300.006.660.650.00.0...3526.03135.697540035.6(-118.9588592, 47.1503267)0.017.172.97129717.295
2AdamsWashington46.811686-118.6952372014-11-300.006.550.670.00.0...253.59135.676956035.6(-118.6952372, 46.8116858)0.016.412.98668316.465
3AdamsWashington47.162342-118.6996772014-11-300.037.320.690.00.0...15.18135.005798035.6(-118.6996774, 47.1623419)0.017.383.14567916.790
4AdamsWashington47.157512-118.4340562014-11-300.047.620.700.00.0...54.69134.803864035.6(-118.4340559, 47.157512)0.016.512.98437516.575
\n", 1065 | "

5 rows × 27 columns

\n", 1066 | "
" 1067 | ], 1068 | "text/plain": [ 1069 | " CountyName State Latitude Longitude Date cloudCover \\\n", 1070 | "0 Adams Washington 46.929839 -118.352109 2014-11-30 0.00 \n", 1071 | "1 Adams Washington 47.150327 -118.958859 2014-11-30 0.00 \n", 1072 | "2 Adams Washington 46.811686 -118.695237 2014-11-30 0.00 \n", 1073 | "3 Adams Washington 47.162342 -118.699677 2014-11-30 0.03 \n", 1074 | "4 Adams Washington 47.157512 -118.434056 2014-11-30 0.04 \n", 1075 | "\n", 1076 | " dewPoint humidity precipIntensity precipProbability ... \\\n", 1077 | "0 6.77 0.69 0.0 0.0 ... \n", 1078 | "1 6.66 0.65 0.0 0.0 ... \n", 1079 | "2 6.55 0.67 0.0 0.0 ... \n", 1080 | "3 7.32 0.69 0.0 0.0 ... \n", 1081 | "4 7.62 0.70 0.0 0.0 ... \n", 1082 | "\n", 1083 | " windBearing windSpeed NDVI DayInSeason Yield \\\n", 1084 | "0 9 3.80 136.179718 0 35.6 \n", 1085 | "1 352 6.03 135.697540 0 35.6 \n", 1086 | "2 25 3.59 135.676956 0 35.6 \n", 1087 | "3 1 5.18 135.005798 0 35.6 \n", 1088 | "4 5 4.69 134.803864 0 35.6 \n", 1089 | "\n", 1090 | " Location precipTotal temperatureDiff temperatureRatio \\\n", 1091 | "0 (-118.3521093, 46.9298391) 0.0 16.97 3.438218 \n", 1092 | "1 (-118.9588592, 47.1503267) 0.0 17.17 2.971297 \n", 1093 | "2 (-118.6952372, 46.8116858) 0.0 16.41 2.986683 \n", 1094 | "3 (-118.6996774, 47.1623419) 0.0 17.38 3.145679 \n", 1095 | "4 (-118.4340559, 47.157512) 0.0 16.51 2.984375 \n", 1096 | "\n", 1097 | " temperatureAverage \n", 1098 | "0 15.445 \n", 1099 | "1 17.295 \n", 1100 | "2 16.465 \n", 1101 | "3 16.790 \n", 1102 | "4 16.575 \n", 1103 | "\n", 1104 | "[5 rows x 27 columns]" 1105 | ] 1106 | }, 1107 | "execution_count": 17, 1108 | "metadata": {}, 1109 | "output_type": "execute_result" 1110 | } 1111 | ], 1112 | "source": [ 1113 | "df_2014.head()" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "code", 1118 | "execution_count": 18, 1119 | "metadata": { 1120 | "collapsed": true 1121 | }, 1122 | "outputs": [], 1123 | "source": [ 1124 | "features = ['longitude',\n", 1125 | " 'latitude',\n", 1126 | " 'elevation',\n", 1127 | " 'LOD',\n", 1128 | " 'total_precipitation',\n", 1129 | " 'minMAT30',\n", 1130 | " 'maxMAT30',\n", 1131 | " 'ratioMNDVI30',\n", 1132 | " 'mean_wind_speed',\n", 1133 | " 'mean_temperature_diff',\n", 1134 | " 'std_temperature_diff',\n", 1135 | " 'yield',\n", 1136 | " ]\n", 1137 | "\n", 1138 | "df_2014_new = pd.DataFrame(columns=features)" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": 19, 1144 | "metadata": { 1145 | "collapsed": false 1146 | }, 1147 | "outputs": [ 1148 | { 1149 | "name": "stdout", 1150 | "output_type": "stream", 1151 | "text": [ 1152 | "Exec. time: 39.80 s\n" 1153 | ] 1154 | } 1155 | ], 1156 | "source": [ 1157 | "now = time.time()\n", 1158 | "\n", 1159 | "for idx,loc in enumerate(locs_2014):\n", 1160 | " tmp_df = df_2014[df_2014['Location'] == loc]\n", 1161 | " (min_mean_average_temparature_30days, \n", 1162 | " max_mean_average_temparature_30days,\n", 1163 | " ratio_mean_ndvi_30days,\n", 1164 | " mean_temperature_diff,\n", 1165 | " std_temperature_diff,\n", 1166 | " mean_wind_speed,\n", 1167 | " total_precipitation,\n", 1168 | " total_yield) = calculate_features(tmp_df)\n", 1169 | " try:\n", 1170 | " new_elevation = elevation[loc]\n", 1171 | " except:\n", 1172 | " print('Elevation: no match for location found', loc)\n", 1173 | " try:\n", 1174 | " new_length_of_day = length_of_day[loc]\n", 1175 | " except:\n", 1176 | " print('Length-of-day: no match for location found', loc)\n", 1177 | " longitude = loc[0]\n", 1178 | " latitude = loc[1]\n", 1179 | " observations = {'longitude': longitude, \n", 1180 | " 'latitude': latitude,\n", 1181 | " 'elevation': new_elevation,\n", 1182 | " 'LOD': new_length_of_day,\n", 1183 | " 'total_precipitation': total_precipitation,\n", 1184 | " 'minMAT30': min_mean_average_temparature_30days,\n", 1185 | " 'maxMAT30': max_mean_average_temparature_30days,\n", 1186 | " 'ratioMNDVI30': ratio_mean_ndvi_30days,\n", 1187 | " 'mean_wind_speed': mean_wind_speed,\n", 1188 | " 'mean_temperature_diff': mean_temperature_diff,\n", 1189 | " 'std_temperature_diff': std_temperature_diff,\n", 1190 | " 'yield': total_yield,\n", 1191 | " }\n", 1192 | " df_2014_new.loc[idx] = pd.Series(observations)\n", 1193 | " \n", 1194 | "print('Exec. time: {:5.2f} s'.format(time.time()-now))" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": 20, 1200 | "metadata": { 1201 | "collapsed": false 1202 | }, 1203 | "outputs": [ 1204 | { 1205 | "data": { 1206 | "text/plain": [ 1207 | "982" 1208 | ] 1209 | }, 1210 | "execution_count": 20, 1211 | "metadata": {}, 1212 | "output_type": "execute_result" 1213 | } 1214 | ], 1215 | "source": [ 1216 | "len(locs_2014)" 1217 | ] 1218 | }, 1219 | { 1220 | "cell_type": "code", 1221 | "execution_count": 22, 1222 | "metadata": { 1223 | "collapsed": false 1224 | }, 1225 | "outputs": [ 1226 | { 1227 | "data": { 1228 | "text/plain": [ 1229 | "(982, 12)" 1230 | ] 1231 | }, 1232 | "execution_count": 22, 1233 | "metadata": {}, 1234 | "output_type": "execute_result" 1235 | } 1236 | ], 1237 | "source": [ 1238 | "df_2014_new.shape" 1239 | ] 1240 | }, 1241 | { 1242 | "cell_type": "code", 1243 | "execution_count": 23, 1244 | "metadata": { 1245 | "collapsed": false 1246 | }, 1247 | "outputs": [ 1248 | { 1249 | "data": { 1250 | "text/html": [ 1251 | "
\n", 1252 | "\n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | "
longitudelatitudeelevationLODtotal_precipitationminMAT30maxMAT30ratioMNDVI30mean_wind_speedmean_temperature_diffstd_temperature_diffyield
0-118.35210946.929839524.56182911.1680563.36229.58800062.1858332.1017254.20290317.6793559.14826635.6
1-118.95885947.150327451.08465611.1611111.53629.68100063.4946672.1392364.04419417.8739259.29526535.6
2-118.69523746.811686427.17919911.1719442.47230.26950063.7896672.1073913.99139818.4576349.58111335.6
3-118.69967747.162342521.15033011.1608333.52829.48750062.0023332.1026654.63005417.8269359.16666535.6
4-118.43405647.157512571.11407511.1611114.31728.83483360.8576672.1105614.71709717.1875278.90512435.6
5-118.51016047.006888499.47607411.1658333.67129.49600061.7060002.0920124.67967717.8445169.18201035.6
6-98.45431836.508221394.36590611.45111110.56531.52916765.1120002.0651358.18069920.6105389.83377929.3
7-98.44474536.650698401.64587411.44777813.36831.41733365.0340002.0700048.17032320.7384959.91224629.3
8-98.31989336.702452359.77139311.44638912.76731.33900065.0821672.0767158.28736620.99182810.07021129.3
9-98.53981636.628781413.23828111.44805614.99331.37083365.0043332.0721268.16957020.7663449.92806529.3
\n", 1423 | "
" 1424 | ], 1425 | "text/plain": [ 1426 | " longitude latitude elevation LOD total_precipitation \\\n", 1427 | "0 -118.352109 46.929839 524.561829 11.168056 3.362 \n", 1428 | "1 -118.958859 47.150327 451.084656 11.161111 1.536 \n", 1429 | "2 -118.695237 46.811686 427.179199 11.171944 2.472 \n", 1430 | "3 -118.699677 47.162342 521.150330 11.160833 3.528 \n", 1431 | "4 -118.434056 47.157512 571.114075 11.161111 4.317 \n", 1432 | "5 -118.510160 47.006888 499.476074 11.165833 3.671 \n", 1433 | "6 -98.454318 36.508221 394.365906 11.451111 10.565 \n", 1434 | "7 -98.444745 36.650698 401.645874 11.447778 13.368 \n", 1435 | "8 -98.319893 36.702452 359.771393 11.446389 12.767 \n", 1436 | "9 -98.539816 36.628781 413.238281 11.448056 14.993 \n", 1437 | "\n", 1438 | " minMAT30 maxMAT30 ratioMNDVI30 mean_wind_speed mean_temperature_diff \\\n", 1439 | "0 29.588000 62.185833 2.101725 4.202903 17.679355 \n", 1440 | "1 29.681000 63.494667 2.139236 4.044194 17.873925 \n", 1441 | "2 30.269500 63.789667 2.107391 3.991398 18.457634 \n", 1442 | "3 29.487500 62.002333 2.102665 4.630054 17.826935 \n", 1443 | "4 28.834833 60.857667 2.110561 4.717097 17.187527 \n", 1444 | "5 29.496000 61.706000 2.092012 4.679677 17.844516 \n", 1445 | "6 31.529167 65.112000 2.065135 8.180699 20.610538 \n", 1446 | "7 31.417333 65.034000 2.070004 8.170323 20.738495 \n", 1447 | "8 31.339000 65.082167 2.076715 8.287366 20.991828 \n", 1448 | "9 31.370833 65.004333 2.072126 8.169570 20.766344 \n", 1449 | "\n", 1450 | " std_temperature_diff yield \n", 1451 | "0 9.148266 35.6 \n", 1452 | "1 9.295265 35.6 \n", 1453 | "2 9.581113 35.6 \n", 1454 | "3 9.166665 35.6 \n", 1455 | "4 8.905124 35.6 \n", 1456 | "5 9.182010 35.6 \n", 1457 | "6 9.833779 29.3 \n", 1458 | "7 9.912246 29.3 \n", 1459 | "8 10.070211 29.3 \n", 1460 | "9 9.928065 29.3 " 1461 | ] 1462 | }, 1463 | "execution_count": 23, 1464 | "metadata": {}, 1465 | "output_type": "execute_result" 1466 | } 1467 | ], 1468 | "source": [ 1469 | "df_2014_new.head(10)" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "code", 1474 | "execution_count": null, 1475 | "metadata": { 1476 | "collapsed": true 1477 | }, 1478 | "outputs": [], 1479 | "source": [] 1480 | }, 1481 | { 1482 | "cell_type": "code", 1483 | "execution_count": 24, 1484 | "metadata": { 1485 | "collapsed": false 1486 | }, 1487 | "outputs": [ 1488 | { 1489 | "data": { 1490 | "text/plain": [ 1491 | "Index([u'CountyName', u'State', u'Latitude', u'Longitude', u'Date',\n", 1492 | " u'cloudCover', u'dewPoint', u'humidity', u'precipIntensity',\n", 1493 | " u'precipProbability', u'precipAccumulation', u'precipTypeIsRain',\n", 1494 | " u'precipTypeIsSnow', u'pressure', u'temperatureMax', u'temperatureMin',\n", 1495 | " u'visibility', u'windBearing', u'windSpeed', u'NDVI', u'DayInSeason',\n", 1496 | " u'Yield', u'Location', u'precipTotal', u'temperatureDiff',\n", 1497 | " u'temperatureRatio', u'temperatureAverage'],\n", 1498 | " dtype='object')" 1499 | ] 1500 | }, 1501 | "execution_count": 24, 1502 | "metadata": {}, 1503 | "output_type": "execute_result" 1504 | } 1505 | ], 1506 | "source": [ 1507 | "df_2013.columns" 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "code", 1512 | "execution_count": 25, 1513 | "metadata": { 1514 | "collapsed": false 1515 | }, 1516 | "outputs": [ 1517 | { 1518 | "data": { 1519 | "text/plain": [ 1520 | "Index([u'CountyName', u'State', u'Latitude', u'Longitude', u'Date',\n", 1521 | " u'cloudCover', u'dewPoint', u'humidity', u'precipIntensity',\n", 1522 | " u'precipProbability', u'precipAccumulation', u'precipTypeIsRain',\n", 1523 | " u'precipTypeIsSnow', u'pressure', u'temperatureMax', u'temperatureMin',\n", 1524 | " u'visibility', u'windBearing', u'windSpeed', u'NDVI', u'DayInSeason',\n", 1525 | " u'Yield', u'Location', u'precipTotal', u'temperatureDiff',\n", 1526 | " u'temperatureRatio', u'temperatureAverage'],\n", 1527 | " dtype='object')" 1528 | ] 1529 | }, 1530 | "execution_count": 25, 1531 | "metadata": {}, 1532 | "output_type": "execute_result" 1533 | } 1534 | ], 1535 | "source": [ 1536 | "df_2014.columns" 1537 | ] 1538 | }, 1539 | { 1540 | "cell_type": "code", 1541 | "execution_count": null, 1542 | "metadata": { 1543 | "collapsed": true 1544 | }, 1545 | "outputs": [], 1546 | "source": [] 1547 | }, 1548 | { 1549 | "cell_type": "markdown", 1550 | "metadata": {}, 1551 | "source": [ 1552 | "## Save features to disk" 1553 | ] 1554 | }, 1555 | { 1556 | "cell_type": "code", 1557 | "execution_count": 26, 1558 | "metadata": { 1559 | "collapsed": true 1560 | }, 1561 | "outputs": [], 1562 | "source": [ 1563 | "df_2013_new.to_pickle(os.path.join('data','df_2013_features.df'))\n", 1564 | "df_2014_new.to_pickle(os.path.join('data','df_2014_features.df'))\n" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "collapsed": true 1572 | }, 1573 | "outputs": [], 1574 | "source": [] 1575 | }, 1576 | { 1577 | "cell_type": "code", 1578 | "execution_count": null, 1579 | "metadata": { 1580 | "collapsed": true 1581 | }, 1582 | "outputs": [], 1583 | "source": [] 1584 | } 1585 | ], 1586 | "metadata": { 1587 | "kernelspec": { 1588 | "display_name": "Python 2", 1589 | "language": "python", 1590 | "name": "python2" 1591 | }, 1592 | "language_info": { 1593 | "codemirror_mode": { 1594 | "name": "ipython", 1595 | "version": 2 1596 | }, 1597 | "file_extension": ".py", 1598 | "mimetype": "text/x-python", 1599 | "name": "python", 1600 | "nbconvert_exporter": "python", 1601 | "pygments_lexer": "ipython2", 1602 | "version": "2.7.13" 1603 | } 1604 | }, 1605 | "nbformat": 4, 1606 | "nbformat_minor": 0 1607 | } 1608 | -------------------------------------------------------------------------------- /Full_Report.md: -------------------------------------------------------------------------------- 1 | # CropPredict - Summary Report 2 | 3 | ### **work in progress** 4 | 5 | This project aims to predict winter wheat yields based on location and weather data. It is inspired by [this](https://github.com/aerialintel/data-science-exercise) data science challenge. 6 | 7 | In this report I will provide a high-level overview of my approach to the project, my findings, and my main results. 8 | 9 | For technical details I encourage the reader to inspect the Jupyter notebooks in this repository. 10 | 11 | ## Data sources 12 | 13 | To quote from the original text: 14 | 15 | "We're providing you with two years worth of Winter Wheat data. These data are geolocated to specific lat-longs and counties. 16 | 17 | * Columns A-E in the file provide information on location and time. 18 | * Columns F-X are raw features, like NDVI or wind speed. 19 | Day in Season is a calculated feature defining how many days since the start date of the season have occurred. 20 | The yield is the label, the value that should be predicted. **Note**: this yield label is not specific to a lat/long but is for the county. Multiple lat/longs will have the same yield since they fall into a single county, even if that individual farm had a higher or lower localized yield." 21 | 22 | ## Task 23 | 24 | The task is "to try and predict wheat yield for several counties in the United States." 25 | 26 | Considering the nature of the data, this formulation is somewhat open to interpretation. Generally, I would have started by discussing the data and by trying to better understand the eventual application or business incentive. Additional information and discussion with stakeholders can help significantly in ensuring the machine-learning model answers the right questions. 27 | 28 | ## My approach 29 | 30 | As it is, I decided to build a model that is well suited for applications that would ask questions like: "Is location X - which has good coverage of historical weather data - suited for winter wheat and what kind of yield can I expect?" 31 | 32 | In its current form, the model is less well suited to answer questions along the lines of: "At location X my weather data so far this season is Y. What kind of yield can I expect at the end of the season?" 33 | 34 | But see below for ideas of how to potentially address 35 | this last question as well. 36 | 37 | My approach to build this model was to characterize each location across the full season. I essentially marginalized over time, engineering aggregated weather-based features for each location. See the 'Feature engineering' section for more details. 38 | 39 | 40 | ## Data exploration and munging 41 | 42 | Each year in the data includes about 1000 unique locations. More than 80% of the locations are common to both year's data. The time frame covered is end of November to beginning of June the following year. Most locations have measurements reported on almost every day during the season (>153 days out of 186 days). 43 | 44 | In each year, a small number of locations have measurements reported only on less than 14 days out of the full 186 day period. 45 | 46 | These locations were removed from the data. Given their limited coverage, it is difficult to reliably engineer the features used in this current model. With this approach, ~5% of the locations were excluded each year, accounting for >0.2% or the raw data. But also see the section 'Final words' for some ideas on how to potentially recover some information from the excluded locations. 47 | 48 | 49 | As provided, the data set was already fairly clean and included only a small number of missing or NULL/NaN values. Missing data was highly concentrated in the 'pressure' and 'visibility' columns. Because these weather-related measurements are likely to change with time and region, it makes little sense to use global averages to impute the missing values. Therefore I chose to adopt the following procedure: for each location with a missing weather-related value, I searched for the *geographically closest* location that had the value in question reported *on the same day*. The assumption here being that the geographically nearest value on the same day is more representative of the missing value than the average of previous and following days records at the target location. 50 | 51 | 52 | 53 | ## Additional data sources 54 | 55 | I wanted to include additional features that I thought might be relevant for this study. 56 | 57 | * Elevation: For each location I included the elevation as provided by the [Google Maps Elevation API](https://developers.google.com/maps/documentation/elevation/start). 58 | 59 | * Length of Day: For each location I included the length of the day at the last date in each year's data (typically June 3rd). This provides a proxy for the hours of sunlight each location potentially receives. This was calculated using the [Astral](https://pythonhosted.org/astral/) package for python. 60 | 61 | 62 | ## Feature engineering 63 | 64 | The features I used for the modeling were either taken directly from the data as-is (longitude, latitude, yield), taken from the additional data sources (elevation, length of day), or engineered from the raw data. 65 | 66 | In support of the feature engineering, I calculated a few intermediate features: 67 | 68 | * daily average temperature: the average of the daily maximum and minimum temperature. 69 | 70 | * daily temperature difference: the difference of the daily maximum and minimum temperature. 71 | 72 | The final features that were used in the modeling are: 73 | 74 | | Feature | Description | 75 | | --- | --- | 76 | | longitude | The geographical longitude of the location in degrees. | 77 | | latitude | The geographical latitude of the location in degrees. | 78 | | elevation | The elevation of the location in meters. | 79 | | LOD | The length of the day at the location, calculated as the difference between sunrise and sunset time. | 80 | | total_precipitation | The total precipitation during the season. Calculated as the cumulative sum of the raw feature 'precipAccumulation'. | 81 | | minMAT30 | Minimum average temperature in a 30-day period. A rolling window average over the daily average temperatures was taken with window-size of 30 days. The minimum of the resulting values gives the 30-day period which shows the lowest average temperatures over 30 days. | 82 | | maxMAT30 | Same as above, but now for the maximum average temperature in a 30-day period. | 83 | | ratioMNDVI30 | Similarly to minMAT30 and maxMAT30 I found the appropriate min/max values for NDVI in a 30-day rolling window and then took the ratio of these values. | 84 | | mean_wind_speed | The simple mean wind speed at a location over the full period. | 85 | | mean_temperature_diff | The mean of the daily temperature differences. | 86 | | std_temperature_diff | The standard deviation of the daily temperature differences about the mean. | 87 | | yield | The target variable. The crop yield at the end of the season on a county basis. | 88 | 89 | 90 | Before feeding the features into the modeling, I performed feature scaling by removing the mean and scaling to unit variance. 91 | 92 | 93 | 94 | ## Algorithm selection 95 | 96 | A simple correlation analysis of the final data showed that there is no obvious strong linear correlation between most features and the target variable. However, some features are clearly correlated linearly. 97 | 98 | This already indicates that purely linear models may not be the best algorithms for this application. But regularization can help to alleviate co-linearity issues, while introducing higher order (polynomial) features can help with the non-linearity. 99 | 100 | Tree-based ensemble methods (random forest, gradient boosting) are typically good at handling non-linear feature interaction and co-linearity. These models can be prone to overfitting which can be 101 | handled by careful tuning of the hyper-parameters. 102 | 103 | I decided to run a number of algorithms (see [06_algorithm_selection.ipynb](https://github.com/cleipski/CropPredict/blob/master/02_data_exploration_and_modeling.ipynb)). Using 5-fold cross validation I compared their performance and found that using mostly default settings, the random forest regressor performed the best, followed by nearest-neighbor regression, L2 linear regression with polynomial features, and support-vector regression using an 'RBF' kernel. 104 | 105 | This confirmed the earlier notion that purely linear models are not appropriate for this data/feature space. Surprisingly, gradient boosted trees (GBTs) - a recent favorite among many machine-learning competitions - performed poorly. But this algorithm has a sizable number of hyper-parameters. Some tuning of the parameters actually brought the performance up to levels that exceeded the random forest regressor. 106 | 107 | Tuning the hyper parameters of the random forest regressor I was not able to achieve the same performance as with the GBTs. The L2 linear regression with polynomial features lacked even more in tuned performance. Nearest-neighbor regression was clearly overfitting for most parameter combination that would provide good performance. The only real alternative to the GBT performance was a tuned version of the support-vector regression (SVR) using an 'RBF' kernel. 108 | 109 | 110 | ## Model tuning and performance 111 | 112 | Out of the two promising algorithms (GBT and SVR) I decided to proceed with a gradient-boosted tree regression model. My argument for this decision is twofold: 113 | 114 | * GBTs have been very successful in recent years and I wanted to explore this algorithm further. 115 | * Both competing models are fairly complex with a number of relevant hyper-parameters. So there was no reason to chose one over the other based on complexity (at similar performance you would typically prefer a less complex model over a more complex one). 116 | 117 | Expanding the GBT parameter search mentioned above I identified combinations of hyper-parameters that maximize performance. I studied learning curves as well as validation curves to investigate the bias-variance tradeoff. The final result is a set of parameters that for the current dataset provides good performance while limiting overfitting. 118 | 119 | 120 | 121 | The learning curve of the tuned model shows that there are still some issues with slight overfitting, but overall the performance and variance look promising. Increasing the 'n_estimators' parameter in the GBT model would have further increased the performance score, but at the cost of increased overfitting. It seems likely that the overfitting could be alleviated by including more training data. 122 | 123 | 124 | The GBT model also provides access to feature importance ranking: 125 | 126 | 127 | 128 | 129 | The final performance of the tuned model was established using a test set (70/30 split) for which I compared model predictions to actual yield numbers. 130 | 131 | 132 | 133 | The R2 value of the final model is ~0.83 with a root mean square error (RMSE) of 5.3 (yield values in the dataset range from 10 to 80). The mean absolute percentage error is ~5%. 134 | 135 | At very high observed yields (>60) the model appears to consistently under-predict. At lower yields the model seems well balanced. 136 | 137 | Comparing the observed and predicted yield on the test set: 138 | 139 | 140 | 141 | 142 | 143 | ## Final words 144 | 145 | 146 | *What challenges or compromises did you face during the project?* 147 | 148 | My main challenge was being unfamiliar with the subject matter. More domain knowledge would have helped in engineering more powerful features and provided better intuition in judging performance. 149 | 150 | The compromise was then to use my best judgement for overcoming my limited domain knowledge as well as having to make assumption on the concrete business incentive I would like to address (see comments above in "My approach"). 151 | 152 | It was also tricky to overcome overfitting completely. Getting more data would have been an obvious solution, but for brevity's sake I decided against this effort. 153 | Switching to a different algorithm might have helped, but (possibly) at the cost of performance. More careful feature engineering has the potential to offset this effect. Or using ensemble techniques. 154 | 155 | 156 | *What did you learn along the way?* 157 | 158 | It was a great exercise that covered the whole spectrum of a machine-learning project. From data munging over algorithm selection to model tuning and presentation. A good opportunity to get more familiar with the GBT algorithm and the effect of the (numerous) hyper-parameters. 159 | 160 | 161 | *If you had more time, what would you improve?* 162 | 163 | Get additional data for previous/following years and/or for more locations. Learn more about the subject matter and try to engineer more or better features. 164 | 165 | Combining the output of different models using ensemble techniques could boost overall performance and limit overfitting. 166 | 167 | Try to create a model that answers the question: "At location X my weather data so far this season is Y. What kind of yield can I expect at the end of the season?" 168 | 169 | One approach would be to take location X's data and compare its profile with the existing locations. Some sort of similarity measure could then inform me about which known locations X is most similar to, which in turn provides an estimate of the yield Y from the existing data. 170 | 171 | Alternatively, a new model could be constructed that maps each individual (daily) measurement directly to the yield. Considering how each location's seasonal weather trends are likely to determine (at least partly) the yield, the problem of predicting outcome from individual measurements will be a challenging (but highly interesting) task. 172 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CropPredict 2 | 3 | This project aims to predict winter wheat yields based on location and weather data. It is inspired by [this](https://github.com/aerialintel/data-science-exercise) data science challenge. 4 | 5 | Here I briefly outline the main steps in my approach as well as my main results. A detailed report is also available: [Full Report](https://github.com/cleipski/CropPredict/blob/master/Full_Report.md) 6 | 7 | 8 | ## Executive summary 9 | 10 | A gradient-boosted decision tree regressor turned out to be the best performer. The tuned model achieved an R2 value of ~0.83 with a root mean square error (RMSE) of 5.3 (yield values in the dataset range from 10 to 80). The mean absolute percentage error is ~5%. 11 | 12 | 13 | 14 | 15 | ## Technical overview 16 | 17 | Below I outline briefly the main steps in the workflow. The Jupyter notebooks linked in each step contain the code (with comments) that was used to achieve the results. 18 | 19 | | Task | Summary | Notebook| 20 | | --- | --- | -- | 21 | | Explore and clean data | Exploring data structure and impute missing values. | [01](https://github.com/cleipski/CropPredict/blob/master/01_data_exploration.ipynb) | 22 | | Collect additional data | For each location determine elevation and length-of-day at a unified date. | [03](https://github.com/cleipski/CropPredict/blob/master/03_elevation_and_length_of_day.ipynb) | 23 | | Feature engineering | Construct higher-level features by characterizing each location across the season. | [04](https://github.com/cleipski/CropPredict/blob/master/04_feature_engineering.ipynb) | 24 | | Statistical analysis | High-level statistical exploration of final feature set. | [05](https://github.com/cleipski/CropPredict/blob/master/05_statisctical_feature_exploration.ipynb) | 25 | | Select algorithm | Compare a number of algorithms using cross validation to identify the most promising performers for this data/feature set. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) | 26 | | Tune model | Tune hyper-parameters of a gradient-boosted tree regressor using cross validation, learning curves and validation curves. Find best balance between performance and bias-variance tradeoff. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) | 27 | | Establish model performance | Use a 30% hold-out test set to compare predicted and observed yields. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) | 28 | 29 | 30 | 31 | ## Future work 32 | 33 | While the performance of the model appears quite good, a close inspection reveals that it has a tendency to under predict at high yield values (>60 observed). There is also some residual overfitting, even after careful tuning. 34 | 35 | In future iterations, these issues could be addressed by: 36 | 37 | * getting more data, 38 | * engineering additional and/or different features, or 39 | * using ensemble techniques by combining the results of different models. 40 | -------------------------------------------------------------------------------- /images/average_daily_temperature.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/average_daily_temperature.png -------------------------------------------------------------------------------- /images/compare_yield.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/compare_yield.png -------------------------------------------------------------------------------- /images/daily_temperature_difference.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/daily_temperature_difference.png -------------------------------------------------------------------------------- /images/feature_importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/feature_importance.png -------------------------------------------------------------------------------- /images/learning_curve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/learning_curve.png -------------------------------------------------------------------------------- /images/model_performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/model_performance.png -------------------------------------------------------------------------------- /original_README.md: -------------------------------------------------------------------------------- 1 | # Experimenting with a dataset 2 | from your friends at Aerial Intelligence 3 | 4 | ## The goal 5 | We want to try and predict wheat yield for several counties in the United States. We've collected some data that should give you a good head start on the exercise. 6 | 7 | Once you've finished the exercise, we'd like you to share your insights and performance with us, and how you managed to achieve it. More on that below (in the submitting section) 8 | 9 | ## The starter data :rocket: 10 | - 2013: https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2013-supervised.csv 11 | - 2014: https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2014-supervised.csv 12 | 13 | ## Some context 14 | We're providing you with two years worth of Winter Wheat data. These data are geolocated to specific lat-longs and counties. 15 | 16 | - Columns A-E in the file provide information on location and time. 17 | - Columns F-X are raw features, like NDVI or wind speed. 18 | - Day in Season is a calculated feature defining how many days since the start date of the season have occurred. 19 | - The yield is the label, the value that should be predicted. 20 | **Note:** this yield label is not specific to a lat/long but is for the county. Multiple lat/longs will have the same yield since they fall into a single county, even if that individual farm had a higher or lower localized yield. 21 | 22 | Please exclude CountyName, State, and Date from training as this will result in overfitting and lack of generalization to other states. 23 | 24 | Feel free to split and manipulate this data as you see fit. You can choose to focus on the starter data, or you can look at what additional higher level features you can process out of the starter data, and even grabbing more related data. If you go above and beyond the starter data, please let us know what you did and your insight behind doing so in your explanation. 25 | 26 | ## Submitting your results 27 | 28 | Please create a Git repository on a hosted Git platform like GitHub, etc, and send us a link. Your repository should include any code you've written for the exercise, and a writeup README.md or PDF explaining your findings. IPython notebooks are also great. 29 | 30 | Some things to consider for your README: 31 | - A brief description of the problem and how you chose to solve it. 32 | - A high level timeline telling us what you tried and what the results from that were 33 | - What your final / best approach was and how it performed 34 | - Technical choices you made during the project 35 | - What challenges or compromises did you face during the project? 36 | - What did you learn along the way? 37 | - If you had more time, what would you improve? 38 | 39 | We care about your thought process and your data science prowess. The better we can understand how you approached the problem, the better we can review your project. Here are a few questions we'll consider: 40 | - Can we understand your thought process? Does your README.md clearly and concisely describe the problem, your solution, and what you did to achieve it? Does your code do what the README.md says it does? 41 | - Can we understand your code? Is your logic clean, consistent, and concise? 42 | - Do your technical choices make sense? 43 | --------------------------------------------------------------------------------