├── 01_data_exploration.ipynb
├── 02_data_exploration_and_modeling.ipynb
├── 03_elevation_and_length_of_day.ipynb
├── 04_feature_engineering.ipynb
├── 05_statisctical_feature_exploration.ipynb
├── 06_algorithm_selection.ipynb
├── Full_Report.md
├── README.md
├── images
    ├── average_daily_temperature.png
    ├── compare_yield.png
    ├── daily_temperature_difference.png
    ├── feature_importance.png
    ├── learning_curve.png
    └── model_performance.png
└── original_README.md


/03_elevation_and_length_of_day.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from __future__ import absolute_import, division, print_function"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "# Elevation & Length of Day"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "To account for the amount of sunlight avaialable to the plants at the different locations and times, get the length of the day. Here, I do this by using the 'Astral' package (https://pythonhosted.org/astral/index.html .\n",
 26 |     "\n",
 27 |     "I also leverage the Google Elevation API to retrieve the elevation for each location.\n",
 28 |     "\n",
 29 |     "The results are saved (and pickled) in dictionaries for easy lookup."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## Imports"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {
 43 |     "collapsed": false
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "import os\n",
 48 |     "import pickle\n",
 49 |     "import time\n",
 50 |     "import requests \n",
 51 |     "import json\n",
 52 |     "import datetime\n",
 53 |     "\n",
 54 |     "import numpy as np\n",
 55 |     "import pandas as pd\n",
 56 |     "\n",
 57 |     "from astral import Location"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "## Data"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": 4,
 70 |    "metadata": {
 71 |     "collapsed": false
 72 |    },
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "cwd = os.getcwd()\n",
 76 |     "data = os.path.join(cwd,'data','wheat-2013-supervised.csv')\n",
 77 |     "df_2013 = pd.read_csv(data)\n",
 78 |     "data = os.path.join(cwd,'data','wheat-2014-supervised.csv')\n",
 79 |     "df_2014 = pd.read_csv(data)\n"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 5,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [
 89 |     {
 90 |      "data": {
 91 |       "text/plain": [
 92 |        "'6/3/2014 0:00'"
 93 |       ]
 94 |      },
 95 |      "execution_count": 5,
 96 |      "metadata": {},
 97 |      "output_type": "execute_result"
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "df_2013['Date'].max()"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 6,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "data": {
113 |       "text/plain": [
114 |        "'6/3/2015 0:00'"
115 |       ]
116 |      },
117 |      "execution_count": 6,
118 |      "metadata": {},
119 |      "output_type": "execute_result"
120 |     }
121 |    ],
122 |    "source": [
123 |     "df_2014['Date'].max()"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "## Locations"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 7,
136 |    "metadata": {
137 |     "collapsed": false
138 |    },
139 |    "outputs": [
140 |     {
141 |      "data": {
142 |       "text/plain": [
143 |        "1014"
144 |       ]
145 |      },
146 |      "execution_count": 7,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "df_2013['Location'] = list(zip(df_2013['Longitude'], df_2013['Latitude']))\n",
153 |     "locs_2013 = df_2013['Location'].unique().tolist()\n",
154 |     "len(locs_2013)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 8,
160 |    "metadata": {
161 |     "collapsed": false
162 |    },
163 |    "outputs": [
164 |     {
165 |      "data": {
166 |       "text/plain": [
167 |        "1035"
168 |       ]
169 |      },
170 |      "execution_count": 8,
171 |      "metadata": {},
172 |      "output_type": "execute_result"
173 |     }
174 |    ],
175 |    "source": [
176 |     "df_2014['Location'] = list(zip(df_2014['Longitude'], df_2014['Latitude']))\n",
177 |     "locs_2014 = df_2014['Location'].unique().tolist()\n",
178 |     "len(locs_2014)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 9,
184 |    "metadata": {
185 |     "collapsed": false
186 |    },
187 |    "outputs": [],
188 |    "source": [
189 |     "# Union of locations without repetitions\n",
190 |     "elevation = {}\n",
191 |     "for loc in locs_2013:\n",
192 |     "    elevation[loc] = 0\n",
193 |     "for loc in locs_2014:\n",
194 |     "    if loc in elevation:\n",
195 |     "        pass\n",
196 |     "    else:\n",
197 |     "        elevation[loc] = 0"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "code",
202 |    "execution_count": 10,
203 |    "metadata": {
204 |     "collapsed": false
205 |    },
206 |    "outputs": [
207 |     {
208 |      "name": "stdout",
209 |      "output_type": "stream",
210 |      "text": [
211 |       "Number of unique locations across 2013 and 2014: 1167\n"
212 |      ]
213 |     }
214 |    ],
215 |    "source": [
216 |     "print('Number of unique locations across 2013 and 2014: {}'.format(len(elevation.keys())))"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "## Elevation"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 11,
229 |    "metadata": {
230 |     "collapsed": false
231 |    },
232 |    "outputs": [],
233 |    "source": [
234 |     "# Get API key from system variable\n",
235 |     "google_api_key = os.environ['GOOGLE_API_KEY']"
236 |    ]
237 |   },
238 |   {
239 |    "cell_type": "code",
240 |    "execution_count": 12,
241 |    "metadata": {
242 |     "collapsed": false
243 |    },
244 |    "outputs": [],
245 |    "source": [
246 |     "run_API_query = False\n",
247 |     "\n",
248 |     "base_string = 'https://maps.googleapis.com/maps/api/elevation/json?locations='\n",
249 |     "locs = elevation.keys()\n",
250 |     "\n",
251 |     "if run_API_query is True:\n",
252 |     "    for idx, loc in enumerate(locs):\n",
253 |     "        lat_string = str(loc[1])\n",
254 |     "        lng_string = str(loc[0])\n",
255 |     "        call_string = base_string + lat_string + ',' + lng_string + '&key=' + google_api_key\n",
256 |     "        if idx%10 == 0: \n",
257 |     "            print(idx)\n",
258 |     "        if elevation[loc] == 0:\n",
259 |     "            response = requests.get(call_string)\n",
260 |     "            try:\n",
261 |     "                tmp_elevation = response.json()['results'][0]['elevation']\n",
262 |     "                elevation[loc] = tmp_elevation\n",
263 |     "            except:\n",
264 |     "                print('Something went wrong', lat_string, lng_string)\n",
265 |     "                print(response.json())\n",
266 |     "            \n",
267 |     "\n",
268 |     "else:\n",
269 |     "    # INSERT\n",
270 |     "    #\n",
271 |     "    # Load data saved at end of this notebook.\n",
272 |     "    # Can be used to experiment with the Astral code below without\n",
273 |     "    # having to re-run the Google API queries above.\n",
274 |     "\n",
275 |     "    with open(os.path.join('data','elevation.pickle'), 'rb') as handle:\n",
276 |     "        elevation = pickle.load(handle)\n",
277 |     "\n",
278 |     "    \n",
279 |     "\n",
280 |     "\n",
281 |     "            "
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "## Length of day"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "metadata": {
294 |     "collapsed": true
295 |    },
296 |    "source": [
297 |     "Use the 'Astral' package (https://pythonhosted.org/astral/index.html) to calculate the length of the day (~ hours of sunlight) for each location."
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 13,
303 |    "metadata": {
304 |     "collapsed": false
305 |    },
306 |    "outputs": [],
307 |    "source": [
308 |     "locs = elevation.keys()\n",
309 |     "length_of_day = {}\n",
310 |     "for idx, loc in enumerate(locs):\n",
311 |     "    # Initialize location\n",
312 |     "    l = Location()\n",
313 |     "    # Set attributes\n",
314 |     "    l.name = ''\n",
315 |     "    l.region = ''\n",
316 |     "    l.latitude = loc[1]\n",
317 |     "    l.longitude = loc[0]\n",
318 |     "    l.timezone = 'UTC'\n",
319 |     "    l.elevation = elevation[loc]\n",
320 |     "    sun = l.sun(date=datetime.date(2014, 3, 3),local=True)\n",
321 |     "    td = sun['sunset'] - sun['sunrise']\n",
322 |     "    tmp_lod = td.total_seconds() / 3600. \n",
323 |     "    length_of_day[loc] = tmp_lod \n"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "## Save data"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 14,
336 |    "metadata": {
337 |     "collapsed": true
338 |    },
339 |    "outputs": [],
340 |    "source": [
341 |     "# Store data (serialize)\n",
342 |     "with open(os.path.join('data','elevation.pickle'), 'wb') as handle:\n",
343 |     "    pickle.dump(elevation, handle)\n",
344 |     "\n",
345 |     "with open(os.path.join('data','length_of_day.pickle'), 'wb') as handle:\n",
346 |     "    pickle.dump(length_of_day, handle)\n",
347 |     "\n"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "code",
352 |    "execution_count": null,
353 |    "metadata": {
354 |     "collapsed": true
355 |    },
356 |    "outputs": [],
357 |    "source": []
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": null,
362 |    "metadata": {
363 |     "collapsed": true
364 |    },
365 |    "outputs": [],
366 |    "source": []
367 |   }
368 |  ],
369 |  "metadata": {
370 |   "kernelspec": {
371 |    "display_name": "Python 2",
372 |    "language": "python",
373 |    "name": "python2"
374 |   },
375 |   "language_info": {
376 |    "codemirror_mode": {
377 |     "name": "ipython",
378 |     "version": 2
379 |    },
380 |    "file_extension": ".py",
381 |    "mimetype": "text/x-python",
382 |    "name": "python",
383 |    "nbconvert_exporter": "python",
384 |    "pygments_lexer": "ipython2",
385 |    "version": "2.7.13"
386 |   }
387 |  },
388 |  "nbformat": 4,
389 |  "nbformat_minor": 0
390 | }
391 | 


--------------------------------------------------------------------------------
/04_feature_engineering.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "metadata": {
   7 |     "collapsed": true
   8 |    },
   9 |    "outputs": [],
  10 |    "source": [
  11 |     "from __future__ import absolute_import, division, print_function"
  12 |    ]
  13 |   },
  14 |   {
  15 |    "cell_type": "markdown",
  16 |    "metadata": {},
  17 |    "source": [
  18 |     "# Feature engineering"
  19 |    ]
  20 |   },
  21 |   {
  22 |    "cell_type": "markdown",
  23 |    "metadata": {},
  24 |    "source": [
  25 |     "Here I will extract and engineer the features that will be the input for a subsequent model.\n",
  26 |     "\n",
  27 |     "The main data are of the follwoing format:\n",
  28 |     "\n",
  29 |     "* The data report weather measurements for ~1000 unique locations in ~150 counties across 5 states. \n",
  30 |     "* Most locations have one entry per day reporting the current weather conditions.\n",
  31 |     "* At the end of the season, the harvest generates a certain yield. This yield is propagated to *all* entries in the data set, even though it is only a final value.\n",
  32 |     "* The reported yield number refers to the yield in the county and is not specific to a location. It is unclear if it is an average or a sum for the county.\n",
  33 |     "\n"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {},
  39 |    "source": [
  40 |     "## Goal"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "markdown",
  45 |    "metadata": {},
  46 |    "source": [
  47 |     "My goal here is to create a profile across the season for each location from the avaiable plus additional data. This will leave me with one set of feature values for each location, which is connected to a final yield. And that will be the input to my model."
  48 |    ]
  49 |   },
  50 |   {
  51 |    "cell_type": "markdown",
  52 |    "metadata": {},
  53 |    "source": [
  54 |     "## Imports"
  55 |    ]
  56 |   },
  57 |   {
  58 |    "cell_type": "code",
  59 |    "execution_count": 2,
  60 |    "metadata": {
  61 |     "collapsed": true
  62 |    },
  63 |    "outputs": [],
  64 |    "source": [
  65 |     "import os\n",
  66 |     "import pickle\n",
  67 |     "import time\n",
  68 |     "\n",
  69 |     "import numpy as np\n",
  70 |     "import pandas as pd\n",
  71 |     "\n",
  72 |     "import matplotlib as mpl\n",
  73 |     "import matplotlib.pyplot as plt\n",
  74 |     "from mpl_toolkits.basemap import Basemap\n",
  75 |     "\n",
  76 |     "\n",
  77 |     "%matplotlib inline"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "markdown",
  82 |    "metadata": {},
  83 |    "source": [
  84 |     "## Functions"
  85 |    ]
  86 |   },
  87 |   {
  88 |    "cell_type": "code",
  89 |    "execution_count": 11,
  90 |    "metadata": {
  91 |     "collapsed": false
  92 |    },
  93 |    "outputs": [],
  94 |    "source": [
  95 |     "def find_30day_extreme(in_df, in_column, in_agg='min'):\n",
  96 |     "    \"\"\"\n",
  97 |     "    Calculates a rolling window mean. Then determines the min/max/mean \n",
  98 |     "    value of all the window aggregates.\n",
  99 |     "    Window size: 30 days.\n",
 100 |     "    in_df: input dataframe\n",
 101 |     "    in_column: column to be aggregated\n",
 102 |     "    in_agg: aggregation function for the list of results ('min','max','mean')\n",
 103 |     "    \"\"\"\n",
 104 |     "    # Pandas 30 day rolling window. Make sure there are at least 10 samples in the window.\n",
 105 |     "    agg_df = in_df[[in_column,'Date']].rolling('30d', on='Date', min_periods=10).mean()\n",
 106 |     "    agg_list = agg_df[in_column]\n",
 107 |     "    if in_agg == 'min':\n",
 108 |     "        return np.min(agg_list)\n",
 109 |     "    if in_agg == 'max':\n",
 110 |     "        return np.max(agg_list)\n",
 111 |     "    if in_agg == 'mean':\n",
 112 |     "        return np.mean(agg_list)\n",
 113 |     " \n"
 114 |    ]
 115 |   },
 116 |   {
 117 |    "cell_type": "code",
 118 |    "execution_count": 12,
 119 |    "metadata": {
 120 |     "collapsed": true
 121 |    },
 122 |    "outputs": [],
 123 |    "source": [
 124 |     "def calculate_features(tmp_df):\n",
 125 |     "    result = []\n",
 126 |     "    # Minimum average temperature in 30-day period\n",
 127 |     "    result.append(find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='min'))\n",
 128 |     "    # Maximum average temperature in 30-day period\n",
 129 |     "    result.append(find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='max'))\n",
 130 |     "    # Minimum NDVI in 30-day period\n",
 131 |     "    tmp1 = find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='min')\n",
 132 |     "    # Maximum NDVI in 30-day period\n",
 133 |     "    tmp2 = find_30day_extreme(tmp_df, 'temperatureAverage', in_agg='max')\n",
 134 |     "    result.append(tmp2/tmp1)\n",
 135 |     "    # Mean temperature difference and variance\n",
 136 |     "    result.append(tmp_df['temperatureDiff'].mean())\n",
 137 |     "    result.append(tmp_df['temperatureDiff'].std())\n",
 138 |     "    # Mean wind speed\n",
 139 |     "    result.append(tmp_df['windSpeed'].mean())\n",
 140 |     "    # Total precipitation\n",
 141 |     "    result.append(tmp_df['precipTotal'].max())\n",
 142 |     "    # Total yield\n",
 143 |     "    result.append(tmp_df['Yield'].max())\n",
 144 |     "    #\n",
 145 |     "    return result\n"
 146 |    ]
 147 |   },
 148 |   {
 149 |    "cell_type": "code",
 150 |    "execution_count": null,
 151 |    "metadata": {
 152 |     "collapsed": true
 153 |    },
 154 |    "outputs": [],
 155 |    "source": []
 156 |   },
 157 |   {
 158 |    "cell_type": "markdown",
 159 |    "metadata": {},
 160 |    "source": [
 161 |     "## Data"
 162 |    ]
 163 |   },
 164 |   {
 165 |    "cell_type": "markdown",
 166 |    "metadata": {},
 167 |    "source": [
 168 |     "These data have been slighlty pre-processed. See 01_data_exploration.ipynb for details."
 169 |    ]
 170 |   },
 171 |   {
 172 |    "cell_type": "code",
 173 |    "execution_count": 5,
 174 |    "metadata": {
 175 |     "collapsed": true
 176 |    },
 177 |    "outputs": [],
 178 |    "source": [
 179 |     "cwd = os.getcwd()\n",
 180 |     "df_2013 = pd.read_pickle(os.path.join(cwd,'data','df_2013_clean.df'))\n",
 181 |     "df_2014 = pd.read_pickle(os.path.join(cwd,'data','df_2014_clean.df'))\n"
 182 |    ]
 183 |   },
 184 |   {
 185 |    "cell_type": "markdown",
 186 |    "metadata": {},
 187 |    "source": [
 188 |     "## Additional data"
 189 |    ]
 190 |   },
 191 |   {
 192 |    "cell_type": "markdown",
 193 |    "metadata": {},
 194 |    "source": [
 195 |     "I have also obtained information on the elevation and length_of_day for each location (see 03_elevation_and_length_of_day.ipynb)."
 196 |    ]
 197 |   },
 198 |   {
 199 |    "cell_type": "code",
 200 |    "execution_count": 6,
 201 |    "metadata": {
 202 |     "collapsed": false
 203 |    },
 204 |    "outputs": [],
 205 |    "source": [
 206 |     "# Load data - dictionaries\n",
 207 |     "with open(os.path.join('data','elevation.pickle'), 'rb') as handle:\n",
 208 |     "    elevation = pickle.load(handle)\n",
 209 |     "with open(os.path.join('data','length_of_day.pickle'), 'rb') as handle:\n",
 210 |     "    length_of_day = pickle.load(handle)\n"
 211 |    ]
 212 |   },
 213 |   {
 214 |    "cell_type": "markdown",
 215 |    "metadata": {},
 216 |    "source": [
 217 |     "## Features"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "markdown",
 222 |    "metadata": {},
 223 |    "source": [
 224 |     "Each location comes with:\n",
 225 |     "\n",
 226 |     "* longitude\n",
 227 |     "* latitude\n",
 228 |     "\n",
 229 |     "The features I am going to engineer for each location are:\n",
 230 |     "\n",
 231 |     "* the total amount of precipitation\n",
 232 |     "* the minimum average temperature in a consecutive 30-day period\n",
 233 |     "* the maximum average temperature in a consecutive 30-day period\n",
 234 |     "* ratio of maximum average NDVI in a consecutive 30-day period and its respective minimum\n",
 235 |     "* the mean temperature difference between daily min/max temperatures\n",
 236 |     "* the standard deviation of the above mean\n",
 237 |     "* the total average wind speed\n",
 238 |     "\n",
 239 |     "I will add to those features external values of:\n",
 240 |     "\n",
 241 |     "* the hours of daylight\n",
 242 |     "* the elevation"
 243 |    ]
 244 |   },
 245 |   {
 246 |    "cell_type": "markdown",
 247 |    "metadata": {},
 248 |    "source": [
 249 |     "During the engineering I will keep both years (2013 and 2014) separate. Even though these data cover mostly the same locations (~>80% overlap), weather conditions and yield is likely to be different. Keeping them separate provides additional data points. "
 250 |    ]
 251 |   },
 252 |   {
 253 |    "cell_type": "code",
 254 |    "execution_count": null,
 255 |    "metadata": {
 256 |     "collapsed": true
 257 |    },
 258 |    "outputs": [],
 259 |    "source": []
 260 |   },
 261 |   {
 262 |    "cell_type": "markdown",
 263 |    "metadata": {},
 264 |    "source": [
 265 |     "## Exclude locations with only very few measurements across the growing period"
 266 |    ]
 267 |   },
 268 |   {
 269 |    "cell_type": "markdown",
 270 |    "metadata": {},
 271 |    "source": [
 272 |     "Since I am aggregating data across the growing period, I will for this current approach remove locations with only very few records. For these locations it is difficult to generate the features above in a straight forward manner. (See 01_data_exploration.ipynb for details on how these locations were identified)."
 273 |    ]
 274 |   },
 275 |   {
 276 |    "cell_type": "code",
 277 |    "execution_count": 7,
 278 |    "metadata": {
 279 |     "collapsed": false
 280 |    },
 281 |    "outputs": [
 282 |     {
 283 |      "name": "stdout",
 284 |      "output_type": "stream",
 285 |      "text": [
 286 |       "1014\n",
 287 |       "955\n",
 288 |       "1035\n",
 289 |       "982\n"
 290 |      ]
 291 |     }
 292 |    ],
 293 |    "source": [
 294 |     "locs = df_2013['Location'].unique().tolist()\n",
 295 |     "print(len(locs))\n",
 296 |     "n_records_2013 = df_2013.groupby(by='Location').agg({'Location': {'Count' : lambda x: x.count()}})\n",
 297 |     "n_records_2013['Location']['Count'].value_counts().sort_index(ascending=False)\n",
 298 |     "klocs = n_records_2013[n_records_2013['Location']['Count'] >= 153].index.get_level_values('Location').values\n",
 299 |     "locs_2013 = df_2013['Location'][df_2013['Location'].isin(klocs)].unique().tolist()\n",
 300 |     "print(len(locs_2013))\n",
 301 |     "#\n",
 302 |     "locs = df_2014['Location'].unique().tolist()\n",
 303 |     "print(len(locs))\n",
 304 |     "\n",
 305 |     "n_records_2014 = df_2014.groupby(by='Location').agg({'Location': {'Count' : lambda x: x.count()}})\n",
 306 |     "n_records_2014['Location']['Count'].value_counts().sort_index(ascending=False)\n",
 307 |     "klocs = n_records_2014[n_records_2014['Location']['Count'] >= 153].index.get_level_values('Location').values\n",
 308 |     "locs_2014 = df_2014['Location'][df_2014['Location'].isin(klocs)].unique().tolist()\n",
 309 |     "print(len(locs_2014))\n",
 310 |     "\n"
 311 |    ]
 312 |   },
 313 |   {
 314 |    "cell_type": "markdown",
 315 |    "metadata": {},
 316 |    "source": [
 317 |     "## 2013"
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "code",
 322 |    "execution_count": 8,
 323 |    "metadata": {
 324 |     "collapsed": false
 325 |    },
 326 |    "outputs": [
 327 |     {
 328 |      "data": {
 329 |       "text/html": [
 330 |        "<div>\n",
 331 |        "<table border=\"1\" class=\"dataframe\">\n",
 332 |        "  <thead>\n",
 333 |        "    <tr style=\"text-align: right;\">\n",
 334 |        "      <th></th>\n",
 335 |        "      <th>CountyName</th>\n",
 336 |        "      <th>State</th>\n",
 337 |        "      <th>Latitude</th>\n",
 338 |        "      <th>Longitude</th>\n",
 339 |        "      <th>Date</th>\n",
 340 |        "      <th>cloudCover</th>\n",
 341 |        "      <th>dewPoint</th>\n",
 342 |        "      <th>humidity</th>\n",
 343 |        "      <th>precipIntensity</th>\n",
 344 |        "      <th>precipProbability</th>\n",
 345 |        "      <th>...</th>\n",
 346 |        "      <th>windBearing</th>\n",
 347 |        "      <th>windSpeed</th>\n",
 348 |        "      <th>NDVI</th>\n",
 349 |        "      <th>DayInSeason</th>\n",
 350 |        "      <th>Yield</th>\n",
 351 |        "      <th>Location</th>\n",
 352 |        "      <th>precipTotal</th>\n",
 353 |        "      <th>temperatureDiff</th>\n",
 354 |        "      <th>temperatureRatio</th>\n",
 355 |        "      <th>temperatureAverage</th>\n",
 356 |        "    </tr>\n",
 357 |        "  </thead>\n",
 358 |        "  <tbody>\n",
 359 |        "    <tr>\n",
 360 |        "      <th>0</th>\n",
 361 |        "      <td>Adams</td>\n",
 362 |        "      <td>Washington</td>\n",
 363 |        "      <td>46.811686</td>\n",
 364 |        "      <td>-118.695237</td>\n",
 365 |        "      <td>2013-11-30</td>\n",
 366 |        "      <td>0.00</td>\n",
 367 |        "      <td>29.53</td>\n",
 368 |        "      <td>0.91</td>\n",
 369 |        "      <td>0.0000</td>\n",
 370 |        "      <td>0.00</td>\n",
 371 |        "      <td>...</td>\n",
 372 |        "      <td>214</td>\n",
 373 |        "      <td>1.18</td>\n",
 374 |        "      <td>134.110657</td>\n",
 375 |        "      <td>0</td>\n",
 376 |        "      <td>35.7</td>\n",
 377 |        "      <td>(-118.6952372, 46.8116858)</td>\n",
 378 |        "      <td>0.000</td>\n",
 379 |        "      <td>8.22</td>\n",
 380 |        "      <td>1.299127</td>\n",
 381 |        "      <td>31.590</td>\n",
 382 |        "    </tr>\n",
 383 |        "    <tr>\n",
 384 |        "      <th>1</th>\n",
 385 |        "      <td>Adams</td>\n",
 386 |        "      <td>Washington</td>\n",
 387 |        "      <td>46.929839</td>\n",
 388 |        "      <td>-118.352109</td>\n",
 389 |        "      <td>2013-11-30</td>\n",
 390 |        "      <td>0.00</td>\n",
 391 |        "      <td>29.77</td>\n",
 392 |        "      <td>0.93</td>\n",
 393 |        "      <td>0.0001</td>\n",
 394 |        "      <td>0.05</td>\n",
 395 |        "      <td>...</td>\n",
 396 |        "      <td>166</td>\n",
 397 |        "      <td>1.01</td>\n",
 398 |        "      <td>131.506592</td>\n",
 399 |        "      <td>0</td>\n",
 400 |        "      <td>35.7</td>\n",
 401 |        "      <td>(-118.3521093, 46.9298391)</td>\n",
 402 |        "      <td>0.000</td>\n",
 403 |        "      <td>8.18</td>\n",
 404 |        "      <td>1.303863</td>\n",
 405 |        "      <td>31.010</td>\n",
 406 |        "    </tr>\n",
 407 |        "    <tr>\n",
 408 |        "      <th>2</th>\n",
 409 |        "      <td>Adams</td>\n",
 410 |        "      <td>Washington</td>\n",
 411 |        "      <td>47.006888</td>\n",
 412 |        "      <td>-118.510160</td>\n",
 413 |        "      <td>2013-11-30</td>\n",
 414 |        "      <td>0.00</td>\n",
 415 |        "      <td>29.36</td>\n",
 416 |        "      <td>0.94</td>\n",
 417 |        "      <td>0.0001</td>\n",
 418 |        "      <td>0.06</td>\n",
 419 |        "      <td>...</td>\n",
 420 |        "      <td>158</td>\n",
 421 |        "      <td>1.03</td>\n",
 422 |        "      <td>131.472946</td>\n",
 423 |        "      <td>0</td>\n",
 424 |        "      <td>35.7</td>\n",
 425 |        "      <td>(-118.5101603, 47.0068881)</td>\n",
 426 |        "      <td>0.020</td>\n",
 427 |        "      <td>6.43</td>\n",
 428 |        "      <td>1.238590</td>\n",
 429 |        "      <td>30.165</td>\n",
 430 |        "    </tr>\n",
 431 |        "    <tr>\n",
 432 |        "      <th>3</th>\n",
 433 |        "      <td>Adams</td>\n",
 434 |        "      <td>Washington</td>\n",
 435 |        "      <td>47.162342</td>\n",
 436 |        "      <td>-118.699677</td>\n",
 437 |        "      <td>2013-11-30</td>\n",
 438 |        "      <td>0.91</td>\n",
 439 |        "      <td>29.47</td>\n",
 440 |        "      <td>0.94</td>\n",
 441 |        "      <td>0.0002</td>\n",
 442 |        "      <td>0.15</td>\n",
 443 |        "      <td>...</td>\n",
 444 |        "      <td>153</td>\n",
 445 |        "      <td>1.84</td>\n",
 446 |        "      <td>131.288300</td>\n",
 447 |        "      <td>0</td>\n",
 448 |        "      <td>35.7</td>\n",
 449 |        "      <td>(-118.6996774, 47.1623419)</td>\n",
 450 |        "      <td>0.036</td>\n",
 451 |        "      <td>6.02</td>\n",
 452 |        "      <td>1.221568</td>\n",
 453 |        "      <td>30.180</td>\n",
 454 |        "    </tr>\n",
 455 |        "    <tr>\n",
 456 |        "      <th>4</th>\n",
 457 |        "      <td>Adams</td>\n",
 458 |        "      <td>Washington</td>\n",
 459 |        "      <td>47.157512</td>\n",
 460 |        "      <td>-118.434056</td>\n",
 461 |        "      <td>2013-11-30</td>\n",
 462 |        "      <td>0.91</td>\n",
 463 |        "      <td>29.86</td>\n",
 464 |        "      <td>0.94</td>\n",
 465 |        "      <td>0.0003</td>\n",
 466 |        "      <td>0.24</td>\n",
 467 |        "      <td>...</td>\n",
 468 |        "      <td>156</td>\n",
 469 |        "      <td>1.85</td>\n",
 470 |        "      <td>131.288300</td>\n",
 471 |        "      <td>0</td>\n",
 472 |        "      <td>35.7</td>\n",
 473 |        "      <td>(-118.4340559, 47.157512)</td>\n",
 474 |        "      <td>0.000</td>\n",
 475 |        "      <td>6.78</td>\n",
 476 |        "      <td>1.250462</td>\n",
 477 |        "      <td>30.460</td>\n",
 478 |        "    </tr>\n",
 479 |        "  </tbody>\n",
 480 |        "</table>\n",
 481 |        "<p>5 rows × 27 columns</p>\n",
 482 |        "</div>"
 483 |       ],
 484 |       "text/plain": [
 485 |        "  CountyName       State   Latitude   Longitude       Date  cloudCover  \\\n",
 486 |        "0      Adams  Washington  46.811686 -118.695237 2013-11-30        0.00   \n",
 487 |        "1      Adams  Washington  46.929839 -118.352109 2013-11-30        0.00   \n",
 488 |        "2      Adams  Washington  47.006888 -118.510160 2013-11-30        0.00   \n",
 489 |        "3      Adams  Washington  47.162342 -118.699677 2013-11-30        0.91   \n",
 490 |        "4      Adams  Washington  47.157512 -118.434056 2013-11-30        0.91   \n",
 491 |        "\n",
 492 |        "   dewPoint  humidity  precipIntensity  precipProbability         ...          \\\n",
 493 |        "0     29.53      0.91           0.0000               0.00         ...           \n",
 494 |        "1     29.77      0.93           0.0001               0.05         ...           \n",
 495 |        "2     29.36      0.94           0.0001               0.06         ...           \n",
 496 |        "3     29.47      0.94           0.0002               0.15         ...           \n",
 497 |        "4     29.86      0.94           0.0003               0.24         ...           \n",
 498 |        "\n",
 499 |        "   windBearing  windSpeed        NDVI  DayInSeason  Yield  \\\n",
 500 |        "0          214       1.18  134.110657            0   35.7   \n",
 501 |        "1          166       1.01  131.506592            0   35.7   \n",
 502 |        "2          158       1.03  131.472946            0   35.7   \n",
 503 |        "3          153       1.84  131.288300            0   35.7   \n",
 504 |        "4          156       1.85  131.288300            0   35.7   \n",
 505 |        "\n",
 506 |        "                     Location  precipTotal  temperatureDiff  temperatureRatio  \\\n",
 507 |        "0  (-118.6952372, 46.8116858)        0.000             8.22          1.299127   \n",
 508 |        "1  (-118.3521093, 46.9298391)        0.000             8.18          1.303863   \n",
 509 |        "2  (-118.5101603, 47.0068881)        0.020             6.43          1.238590   \n",
 510 |        "3  (-118.6996774, 47.1623419)        0.036             6.02          1.221568   \n",
 511 |        "4   (-118.4340559, 47.157512)        0.000             6.78          1.250462   \n",
 512 |        "\n",
 513 |        "   temperatureAverage  \n",
 514 |        "0              31.590  \n",
 515 |        "1              31.010  \n",
 516 |        "2              30.165  \n",
 517 |        "3              30.180  \n",
 518 |        "4              30.460  \n",
 519 |        "\n",
 520 |        "[5 rows x 27 columns]"
 521 |       ]
 522 |      },
 523 |      "execution_count": 8,
 524 |      "metadata": {},
 525 |      "output_type": "execute_result"
 526 |     }
 527 |    ],
 528 |    "source": [
 529 |     "df_2013.head()"
 530 |    ]
 531 |   },
 532 |   {
 533 |    "cell_type": "code",
 534 |    "execution_count": 9,
 535 |    "metadata": {
 536 |     "collapsed": true
 537 |    },
 538 |    "outputs": [],
 539 |    "source": [
 540 |     "features = ['longitude',\n",
 541 |     "            'latitude',\n",
 542 |     "            'elevation',\n",
 543 |     "            'LOD',\n",
 544 |     "            'total_precipitation',\n",
 545 |     "            'minMAT30',\n",
 546 |     "            'maxMAT30',\n",
 547 |     "            'ratioMNDVI30',\n",
 548 |     "            'mean_wind_speed',\n",
 549 |     "            'mean_temperature_diff',\n",
 550 |     "            'std_temperature_diff',\n",
 551 |     "            'yield',\n",
 552 |     "           ]\n",
 553 |     "\n",
 554 |     "df_2013_new = pd.DataFrame(columns=features)"
 555 |    ]
 556 |   },
 557 |   {
 558 |    "cell_type": "code",
 559 |    "execution_count": 13,
 560 |    "metadata": {
 561 |     "collapsed": false
 562 |    },
 563 |    "outputs": [
 564 |     {
 565 |      "name": "stdout",
 566 |      "output_type": "stream",
 567 |      "text": [
 568 |       "Exec. time: 37.78 s\n"
 569 |      ]
 570 |     }
 571 |    ],
 572 |    "source": [
 573 |     "now = time.time()\n",
 574 |     "\n",
 575 |     "for idx,loc in enumerate(locs_2013):\n",
 576 |     "    tmp_df = df_2013[df_2013['Location'] == loc]\n",
 577 |     "    (min_mean_average_temparature_30days, \n",
 578 |     "     max_mean_average_temparature_30days,\n",
 579 |     "     ratio_mean_ndvi_30days,\n",
 580 |     "     mean_temperature_diff,\n",
 581 |     "     std_temperature_diff,\n",
 582 |     "     mean_wind_speed,\n",
 583 |     "     total_precipitation,\n",
 584 |     "     total_yield) = calculate_features(tmp_df)\n",
 585 |     "    try:\n",
 586 |     "        new_elevation = elevation[loc]\n",
 587 |     "    except:\n",
 588 |     "        print('Elevation: no match for location found', loc)\n",
 589 |     "    try:\n",
 590 |     "        new_length_of_day = length_of_day[loc]\n",
 591 |     "    except:\n",
 592 |     "        print('Length-of-day: no match for location found', loc)\n",
 593 |     "    longitude = loc[0]\n",
 594 |     "    latitude = loc[1]\n",
 595 |     "    observations = {'longitude': longitude, \n",
 596 |     "        'latitude': latitude,\n",
 597 |     "        'elevation': new_elevation,\n",
 598 |     "        'LOD': new_length_of_day,\n",
 599 |     "        'total_precipitation': total_precipitation,\n",
 600 |     "        'minMAT30': min_mean_average_temparature_30days,\n",
 601 |     "        'maxMAT30': max_mean_average_temparature_30days,\n",
 602 |     "        'ratioMNDVI30': ratio_mean_ndvi_30days,\n",
 603 |     "        'mean_wind_speed': mean_wind_speed,\n",
 604 |     "        'mean_temperature_diff': mean_temperature_diff,\n",
 605 |     "        'std_temperature_diff': std_temperature_diff,\n",
 606 |     "        'yield': total_yield,\n",
 607 |     "       }\n",
 608 |     "    df_2013_new.loc[idx] = pd.Series(observations)\n",
 609 |     "        \n",
 610 |     "print('Exec. time: {:5.2f} s'.format(time.time()-now))"
 611 |    ]
 612 |   },
 613 |   {
 614 |    "cell_type": "code",
 615 |    "execution_count": 14,
 616 |    "metadata": {
 617 |     "collapsed": false
 618 |    },
 619 |    "outputs": [
 620 |     {
 621 |      "data": {
 622 |       "text/plain": [
 623 |        "955"
 624 |       ]
 625 |      },
 626 |      "execution_count": 14,
 627 |      "metadata": {},
 628 |      "output_type": "execute_result"
 629 |     }
 630 |    ],
 631 |    "source": [
 632 |     "len(locs_2013)"
 633 |    ]
 634 |   },
 635 |   {
 636 |    "cell_type": "code",
 637 |    "execution_count": 15,
 638 |    "metadata": {
 639 |     "collapsed": false
 640 |    },
 641 |    "outputs": [
 642 |     {
 643 |      "data": {
 644 |       "text/plain": [
 645 |        "(955, 12)"
 646 |       ]
 647 |      },
 648 |      "execution_count": 15,
 649 |      "metadata": {},
 650 |      "output_type": "execute_result"
 651 |     }
 652 |    ],
 653 |    "source": [
 654 |     "df_2013_new.shape"
 655 |    ]
 656 |   },
 657 |   {
 658 |    "cell_type": "code",
 659 |    "execution_count": 16,
 660 |    "metadata": {
 661 |     "collapsed": false
 662 |    },
 663 |    "outputs": [
 664 |     {
 665 |      "data": {
 666 |       "text/html": [
 667 |        "<div>\n",
 668 |        "<table border=\"1\" class=\"dataframe\">\n",
 669 |        "  <thead>\n",
 670 |        "    <tr style=\"text-align: right;\">\n",
 671 |        "      <th></th>\n",
 672 |        "      <th>longitude</th>\n",
 673 |        "      <th>latitude</th>\n",
 674 |        "      <th>elevation</th>\n",
 675 |        "      <th>LOD</th>\n",
 676 |        "      <th>total_precipitation</th>\n",
 677 |        "      <th>minMAT30</th>\n",
 678 |        "      <th>maxMAT30</th>\n",
 679 |        "      <th>ratioMNDVI30</th>\n",
 680 |        "      <th>mean_wind_speed</th>\n",
 681 |        "      <th>mean_temperature_diff</th>\n",
 682 |        "      <th>std_temperature_diff</th>\n",
 683 |        "      <th>yield</th>\n",
 684 |        "    </tr>\n",
 685 |        "  </thead>\n",
 686 |        "  <tbody>\n",
 687 |        "    <tr>\n",
 688 |        "      <th>0</th>\n",
 689 |        "      <td>-118.695237</td>\n",
 690 |        "      <td>46.811686</td>\n",
 691 |        "      <td>427.179199</td>\n",
 692 |        "      <td>11.171944</td>\n",
 693 |        "      <td>9.982</td>\n",
 694 |        "      <td>22.090000</td>\n",
 695 |        "      <td>62.033000</td>\n",
 696 |        "      <td>2.808194</td>\n",
 697 |        "      <td>5.298548</td>\n",
 698 |        "      <td>17.294409</td>\n",
 699 |        "      <td>9.061316</td>\n",
 700 |        "      <td>35.7</td>\n",
 701 |        "    </tr>\n",
 702 |        "    <tr>\n",
 703 |        "      <th>1</th>\n",
 704 |        "      <td>-118.352109</td>\n",
 705 |        "      <td>46.929839</td>\n",
 706 |        "      <td>524.561829</td>\n",
 707 |        "      <td>11.168056</td>\n",
 708 |        "      <td>12.312</td>\n",
 709 |        "      <td>21.140909</td>\n",
 710 |        "      <td>60.183833</td>\n",
 711 |        "      <td>2.846795</td>\n",
 712 |        "      <td>5.594247</td>\n",
 713 |        "      <td>16.485699</td>\n",
 714 |        "      <td>8.485870</td>\n",
 715 |        "      <td>35.7</td>\n",
 716 |        "    </tr>\n",
 717 |        "    <tr>\n",
 718 |        "      <th>2</th>\n",
 719 |        "      <td>-118.510160</td>\n",
 720 |        "      <td>47.006888</td>\n",
 721 |        "      <td>499.476074</td>\n",
 722 |        "      <td>11.165833</td>\n",
 723 |        "      <td>13.750</td>\n",
 724 |        "      <td>20.644545</td>\n",
 725 |        "      <td>59.363000</td>\n",
 726 |        "      <td>2.875481</td>\n",
 727 |        "      <td>6.132742</td>\n",
 728 |        "      <td>16.913817</td>\n",
 729 |        "      <td>8.444363</td>\n",
 730 |        "      <td>35.7</td>\n",
 731 |        "    </tr>\n",
 732 |        "    <tr>\n",
 733 |        "      <th>3</th>\n",
 734 |        "      <td>-118.699677</td>\n",
 735 |        "      <td>47.162342</td>\n",
 736 |        "      <td>521.150330</td>\n",
 737 |        "      <td>11.160833</td>\n",
 738 |        "      <td>12.934</td>\n",
 739 |        "      <td>20.151818</td>\n",
 740 |        "      <td>59.682000</td>\n",
 741 |        "      <td>2.961619</td>\n",
 742 |        "      <td>6.120430</td>\n",
 743 |        "      <td>16.800269</td>\n",
 744 |        "      <td>8.512496</td>\n",
 745 |        "      <td>35.7</td>\n",
 746 |        "    </tr>\n",
 747 |        "    <tr>\n",
 748 |        "      <th>4</th>\n",
 749 |        "      <td>-118.434056</td>\n",
 750 |        "      <td>47.157512</td>\n",
 751 |        "      <td>571.114075</td>\n",
 752 |        "      <td>11.161111</td>\n",
 753 |        "      <td>16.644</td>\n",
 754 |        "      <td>20.138182</td>\n",
 755 |        "      <td>58.143167</td>\n",
 756 |        "      <td>2.887210</td>\n",
 757 |        "      <td>6.055968</td>\n",
 758 |        "      <td>16.077849</td>\n",
 759 |        "      <td>8.201818</td>\n",
 760 |        "      <td>35.7</td>\n",
 761 |        "    </tr>\n",
 762 |        "    <tr>\n",
 763 |        "      <th>5</th>\n",
 764 |        "      <td>-118.958859</td>\n",
 765 |        "      <td>47.150327</td>\n",
 766 |        "      <td>451.084656</td>\n",
 767 |        "      <td>11.161111</td>\n",
 768 |        "      <td>8.905</td>\n",
 769 |        "      <td>21.740833</td>\n",
 770 |        "      <td>61.464167</td>\n",
 771 |        "      <td>2.827130</td>\n",
 772 |        "      <td>5.320806</td>\n",
 773 |        "      <td>17.446237</td>\n",
 774 |        "      <td>8.471867</td>\n",
 775 |        "      <td>35.7</td>\n",
 776 |        "    </tr>\n",
 777 |        "    <tr>\n",
 778 |        "      <th>6</th>\n",
 779 |        "      <td>-98.330851</td>\n",
 780 |        "      <td>36.995835</td>\n",
 781 |        "      <td>391.887451</td>\n",
 782 |        "      <td>11.439722</td>\n",
 783 |        "      <td>47.256</td>\n",
 784 |        "      <td>28.731667</td>\n",
 785 |        "      <td>71.526500</td>\n",
 786 |        "      <td>2.489466</td>\n",
 787 |        "      <td>9.218011</td>\n",
 788 |        "      <td>24.476720</td>\n",
 789 |        "      <td>9.612832</td>\n",
 790 |        "      <td>14.4</td>\n",
 791 |        "    </tr>\n",
 792 |        "    <tr>\n",
 793 |        "      <th>7</th>\n",
 794 |        "      <td>-98.404629</td>\n",
 795 |        "      <td>36.988813</td>\n",
 796 |        "      <td>401.383514</td>\n",
 797 |        "      <td>11.439722</td>\n",
 798 |        "      <td>51.077</td>\n",
 799 |        "      <td>28.703833</td>\n",
 800 |        "      <td>71.522333</td>\n",
 801 |        "      <td>2.491735</td>\n",
 802 |        "      <td>9.210806</td>\n",
 803 |        "      <td>24.486882</td>\n",
 804 |        "      <td>9.634120</td>\n",
 805 |        "      <td>14.4</td>\n",
 806 |        "    </tr>\n",
 807 |        "    <tr>\n",
 808 |        "      <th>8</th>\n",
 809 |        "      <td>-98.319893</td>\n",
 810 |        "      <td>36.702452</td>\n",
 811 |        "      <td>359.771393</td>\n",
 812 |        "      <td>11.446389</td>\n",
 813 |        "      <td>55.124</td>\n",
 814 |        "      <td>29.362167</td>\n",
 815 |        "      <td>72.147500</td>\n",
 816 |        "      <td>2.457159</td>\n",
 817 |        "      <td>9.504355</td>\n",
 818 |        "      <td>24.687957</td>\n",
 819 |        "      <td>9.656204</td>\n",
 820 |        "      <td>14.4</td>\n",
 821 |        "    </tr>\n",
 822 |        "    <tr>\n",
 823 |        "      <th>9</th>\n",
 824 |        "      <td>-98.539816</td>\n",
 825 |        "      <td>36.628781</td>\n",
 826 |        "      <td>413.238281</td>\n",
 827 |        "      <td>11.448056</td>\n",
 828 |        "      <td>67.264</td>\n",
 829 |        "      <td>29.471167</td>\n",
 830 |        "      <td>72.077000</td>\n",
 831 |        "      <td>2.445679</td>\n",
 832 |        "      <td>9.398011</td>\n",
 833 |        "      <td>24.507634</td>\n",
 834 |        "      <td>9.583744</td>\n",
 835 |        "      <td>14.4</td>\n",
 836 |        "    </tr>\n",
 837 |        "  </tbody>\n",
 838 |        "</table>\n",
 839 |        "</div>"
 840 |       ],
 841 |       "text/plain": [
 842 |        "    longitude   latitude   elevation        LOD  total_precipitation  \\\n",
 843 |        "0 -118.695237  46.811686  427.179199  11.171944                9.982   \n",
 844 |        "1 -118.352109  46.929839  524.561829  11.168056               12.312   \n",
 845 |        "2 -118.510160  47.006888  499.476074  11.165833               13.750   \n",
 846 |        "3 -118.699677  47.162342  521.150330  11.160833               12.934   \n",
 847 |        "4 -118.434056  47.157512  571.114075  11.161111               16.644   \n",
 848 |        "5 -118.958859  47.150327  451.084656  11.161111                8.905   \n",
 849 |        "6  -98.330851  36.995835  391.887451  11.439722               47.256   \n",
 850 |        "7  -98.404629  36.988813  401.383514  11.439722               51.077   \n",
 851 |        "8  -98.319893  36.702452  359.771393  11.446389               55.124   \n",
 852 |        "9  -98.539816  36.628781  413.238281  11.448056               67.264   \n",
 853 |        "\n",
 854 |        "    minMAT30   maxMAT30  ratioMNDVI30  mean_wind_speed  mean_temperature_diff  \\\n",
 855 |        "0  22.090000  62.033000      2.808194         5.298548              17.294409   \n",
 856 |        "1  21.140909  60.183833      2.846795         5.594247              16.485699   \n",
 857 |        "2  20.644545  59.363000      2.875481         6.132742              16.913817   \n",
 858 |        "3  20.151818  59.682000      2.961619         6.120430              16.800269   \n",
 859 |        "4  20.138182  58.143167      2.887210         6.055968              16.077849   \n",
 860 |        "5  21.740833  61.464167      2.827130         5.320806              17.446237   \n",
 861 |        "6  28.731667  71.526500      2.489466         9.218011              24.476720   \n",
 862 |        "7  28.703833  71.522333      2.491735         9.210806              24.486882   \n",
 863 |        "8  29.362167  72.147500      2.457159         9.504355              24.687957   \n",
 864 |        "9  29.471167  72.077000      2.445679         9.398011              24.507634   \n",
 865 |        "\n",
 866 |        "   std_temperature_diff  yield  \n",
 867 |        "0              9.061316   35.7  \n",
 868 |        "1              8.485870   35.7  \n",
 869 |        "2              8.444363   35.7  \n",
 870 |        "3              8.512496   35.7  \n",
 871 |        "4              8.201818   35.7  \n",
 872 |        "5              8.471867   35.7  \n",
 873 |        "6              9.612832   14.4  \n",
 874 |        "7              9.634120   14.4  \n",
 875 |        "8              9.656204   14.4  \n",
 876 |        "9              9.583744   14.4  "
 877 |       ]
 878 |      },
 879 |      "execution_count": 16,
 880 |      "metadata": {},
 881 |      "output_type": "execute_result"
 882 |     }
 883 |    ],
 884 |    "source": [
 885 |     "df_2013_new.head(10)"
 886 |    ]
 887 |   },
 888 |   {
 889 |    "cell_type": "code",
 890 |    "execution_count": null,
 891 |    "metadata": {
 892 |     "collapsed": true
 893 |    },
 894 |    "outputs": [],
 895 |    "source": []
 896 |   },
 897 |   {
 898 |    "cell_type": "markdown",
 899 |    "metadata": {},
 900 |    "source": [
 901 |     "## 2014"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "code",
 906 |    "execution_count": 17,
 907 |    "metadata": {
 908 |     "collapsed": false
 909 |    },
 910 |    "outputs": [
 911 |     {
 912 |      "data": {
 913 |       "text/html": [
 914 |        "<div>\n",
 915 |        "<table border=\"1\" class=\"dataframe\">\n",
 916 |        "  <thead>\n",
 917 |        "    <tr style=\"text-align: right;\">\n",
 918 |        "      <th></th>\n",
 919 |        "      <th>CountyName</th>\n",
 920 |        "      <th>State</th>\n",
 921 |        "      <th>Latitude</th>\n",
 922 |        "      <th>Longitude</th>\n",
 923 |        "      <th>Date</th>\n",
 924 |        "      <th>cloudCover</th>\n",
 925 |        "      <th>dewPoint</th>\n",
 926 |        "      <th>humidity</th>\n",
 927 |        "      <th>precipIntensity</th>\n",
 928 |        "      <th>precipProbability</th>\n",
 929 |        "      <th>...</th>\n",
 930 |        "      <th>windBearing</th>\n",
 931 |        "      <th>windSpeed</th>\n",
 932 |        "      <th>NDVI</th>\n",
 933 |        "      <th>DayInSeason</th>\n",
 934 |        "      <th>Yield</th>\n",
 935 |        "      <th>Location</th>\n",
 936 |        "      <th>precipTotal</th>\n",
 937 |        "      <th>temperatureDiff</th>\n",
 938 |        "      <th>temperatureRatio</th>\n",
 939 |        "      <th>temperatureAverage</th>\n",
 940 |        "    </tr>\n",
 941 |        "  </thead>\n",
 942 |        "  <tbody>\n",
 943 |        "    <tr>\n",
 944 |        "      <th>0</th>\n",
 945 |        "      <td>Adams</td>\n",
 946 |        "      <td>Washington</td>\n",
 947 |        "      <td>46.929839</td>\n",
 948 |        "      <td>-118.352109</td>\n",
 949 |        "      <td>2014-11-30</td>\n",
 950 |        "      <td>0.00</td>\n",
 951 |        "      <td>6.77</td>\n",
 952 |        "      <td>0.69</td>\n",
 953 |        "      <td>0.0</td>\n",
 954 |        "      <td>0.0</td>\n",
 955 |        "      <td>...</td>\n",
 956 |        "      <td>9</td>\n",
 957 |        "      <td>3.80</td>\n",
 958 |        "      <td>136.179718</td>\n",
 959 |        "      <td>0</td>\n",
 960 |        "      <td>35.6</td>\n",
 961 |        "      <td>(-118.3521093, 46.9298391)</td>\n",
 962 |        "      <td>0.0</td>\n",
 963 |        "      <td>16.97</td>\n",
 964 |        "      <td>3.438218</td>\n",
 965 |        "      <td>15.445</td>\n",
 966 |        "    </tr>\n",
 967 |        "    <tr>\n",
 968 |        "      <th>1</th>\n",
 969 |        "      <td>Adams</td>\n",
 970 |        "      <td>Washington</td>\n",
 971 |        "      <td>47.150327</td>\n",
 972 |        "      <td>-118.958859</td>\n",
 973 |        "      <td>2014-11-30</td>\n",
 974 |        "      <td>0.00</td>\n",
 975 |        "      <td>6.66</td>\n",
 976 |        "      <td>0.65</td>\n",
 977 |        "      <td>0.0</td>\n",
 978 |        "      <td>0.0</td>\n",
 979 |        "      <td>...</td>\n",
 980 |        "      <td>352</td>\n",
 981 |        "      <td>6.03</td>\n",
 982 |        "      <td>135.697540</td>\n",
 983 |        "      <td>0</td>\n",
 984 |        "      <td>35.6</td>\n",
 985 |        "      <td>(-118.9588592, 47.1503267)</td>\n",
 986 |        "      <td>0.0</td>\n",
 987 |        "      <td>17.17</td>\n",
 988 |        "      <td>2.971297</td>\n",
 989 |        "      <td>17.295</td>\n",
 990 |        "    </tr>\n",
 991 |        "    <tr>\n",
 992 |        "      <th>2</th>\n",
 993 |        "      <td>Adams</td>\n",
 994 |        "      <td>Washington</td>\n",
 995 |        "      <td>46.811686</td>\n",
 996 |        "      <td>-118.695237</td>\n",
 997 |        "      <td>2014-11-30</td>\n",
 998 |        "      <td>0.00</td>\n",
 999 |        "      <td>6.55</td>\n",
1000 |        "      <td>0.67</td>\n",
1001 |        "      <td>0.0</td>\n",
1002 |        "      <td>0.0</td>\n",
1003 |        "      <td>...</td>\n",
1004 |        "      <td>25</td>\n",
1005 |        "      <td>3.59</td>\n",
1006 |        "      <td>135.676956</td>\n",
1007 |        "      <td>0</td>\n",
1008 |        "      <td>35.6</td>\n",
1009 |        "      <td>(-118.6952372, 46.8116858)</td>\n",
1010 |        "      <td>0.0</td>\n",
1011 |        "      <td>16.41</td>\n",
1012 |        "      <td>2.986683</td>\n",
1013 |        "      <td>16.465</td>\n",
1014 |        "    </tr>\n",
1015 |        "    <tr>\n",
1016 |        "      <th>3</th>\n",
1017 |        "      <td>Adams</td>\n",
1018 |        "      <td>Washington</td>\n",
1019 |        "      <td>47.162342</td>\n",
1020 |        "      <td>-118.699677</td>\n",
1021 |        "      <td>2014-11-30</td>\n",
1022 |        "      <td>0.03</td>\n",
1023 |        "      <td>7.32</td>\n",
1024 |        "      <td>0.69</td>\n",
1025 |        "      <td>0.0</td>\n",
1026 |        "      <td>0.0</td>\n",
1027 |        "      <td>...</td>\n",
1028 |        "      <td>1</td>\n",
1029 |        "      <td>5.18</td>\n",
1030 |        "      <td>135.005798</td>\n",
1031 |        "      <td>0</td>\n",
1032 |        "      <td>35.6</td>\n",
1033 |        "      <td>(-118.6996774, 47.1623419)</td>\n",
1034 |        "      <td>0.0</td>\n",
1035 |        "      <td>17.38</td>\n",
1036 |        "      <td>3.145679</td>\n",
1037 |        "      <td>16.790</td>\n",
1038 |        "    </tr>\n",
1039 |        "    <tr>\n",
1040 |        "      <th>4</th>\n",
1041 |        "      <td>Adams</td>\n",
1042 |        "      <td>Washington</td>\n",
1043 |        "      <td>47.157512</td>\n",
1044 |        "      <td>-118.434056</td>\n",
1045 |        "      <td>2014-11-30</td>\n",
1046 |        "      <td>0.04</td>\n",
1047 |        "      <td>7.62</td>\n",
1048 |        "      <td>0.70</td>\n",
1049 |        "      <td>0.0</td>\n",
1050 |        "      <td>0.0</td>\n",
1051 |        "      <td>...</td>\n",
1052 |        "      <td>5</td>\n",
1053 |        "      <td>4.69</td>\n",
1054 |        "      <td>134.803864</td>\n",
1055 |        "      <td>0</td>\n",
1056 |        "      <td>35.6</td>\n",
1057 |        "      <td>(-118.4340559, 47.157512)</td>\n",
1058 |        "      <td>0.0</td>\n",
1059 |        "      <td>16.51</td>\n",
1060 |        "      <td>2.984375</td>\n",
1061 |        "      <td>16.575</td>\n",
1062 |        "    </tr>\n",
1063 |        "  </tbody>\n",
1064 |        "</table>\n",
1065 |        "<p>5 rows × 27 columns</p>\n",
1066 |        "</div>"
1067 |       ],
1068 |       "text/plain": [
1069 |        "  CountyName       State   Latitude   Longitude       Date  cloudCover  \\\n",
1070 |        "0      Adams  Washington  46.929839 -118.352109 2014-11-30        0.00   \n",
1071 |        "1      Adams  Washington  47.150327 -118.958859 2014-11-30        0.00   \n",
1072 |        "2      Adams  Washington  46.811686 -118.695237 2014-11-30        0.00   \n",
1073 |        "3      Adams  Washington  47.162342 -118.699677 2014-11-30        0.03   \n",
1074 |        "4      Adams  Washington  47.157512 -118.434056 2014-11-30        0.04   \n",
1075 |        "\n",
1076 |        "   dewPoint  humidity  precipIntensity  precipProbability         ...          \\\n",
1077 |        "0      6.77      0.69              0.0                0.0         ...           \n",
1078 |        "1      6.66      0.65              0.0                0.0         ...           \n",
1079 |        "2      6.55      0.67              0.0                0.0         ...           \n",
1080 |        "3      7.32      0.69              0.0                0.0         ...           \n",
1081 |        "4      7.62      0.70              0.0                0.0         ...           \n",
1082 |        "\n",
1083 |        "   windBearing  windSpeed        NDVI  DayInSeason  Yield  \\\n",
1084 |        "0            9       3.80  136.179718            0   35.6   \n",
1085 |        "1          352       6.03  135.697540            0   35.6   \n",
1086 |        "2           25       3.59  135.676956            0   35.6   \n",
1087 |        "3            1       5.18  135.005798            0   35.6   \n",
1088 |        "4            5       4.69  134.803864            0   35.6   \n",
1089 |        "\n",
1090 |        "                     Location  precipTotal  temperatureDiff  temperatureRatio  \\\n",
1091 |        "0  (-118.3521093, 46.9298391)          0.0            16.97          3.438218   \n",
1092 |        "1  (-118.9588592, 47.1503267)          0.0            17.17          2.971297   \n",
1093 |        "2  (-118.6952372, 46.8116858)          0.0            16.41          2.986683   \n",
1094 |        "3  (-118.6996774, 47.1623419)          0.0            17.38          3.145679   \n",
1095 |        "4   (-118.4340559, 47.157512)          0.0            16.51          2.984375   \n",
1096 |        "\n",
1097 |        "   temperatureAverage  \n",
1098 |        "0              15.445  \n",
1099 |        "1              17.295  \n",
1100 |        "2              16.465  \n",
1101 |        "3              16.790  \n",
1102 |        "4              16.575  \n",
1103 |        "\n",
1104 |        "[5 rows x 27 columns]"
1105 |       ]
1106 |      },
1107 |      "execution_count": 17,
1108 |      "metadata": {},
1109 |      "output_type": "execute_result"
1110 |     }
1111 |    ],
1112 |    "source": [
1113 |     "df_2014.head()"
1114 |    ]
1115 |   },
1116 |   {
1117 |    "cell_type": "code",
1118 |    "execution_count": 18,
1119 |    "metadata": {
1120 |     "collapsed": true
1121 |    },
1122 |    "outputs": [],
1123 |    "source": [
1124 |     "features = ['longitude',\n",
1125 |     "            'latitude',\n",
1126 |     "            'elevation',\n",
1127 |     "            'LOD',\n",
1128 |     "            'total_precipitation',\n",
1129 |     "            'minMAT30',\n",
1130 |     "            'maxMAT30',\n",
1131 |     "            'ratioMNDVI30',\n",
1132 |     "            'mean_wind_speed',\n",
1133 |     "            'mean_temperature_diff',\n",
1134 |     "            'std_temperature_diff',\n",
1135 |     "            'yield',\n",
1136 |     "           ]\n",
1137 |     "\n",
1138 |     "df_2014_new = pd.DataFrame(columns=features)"
1139 |    ]
1140 |   },
1141 |   {
1142 |    "cell_type": "code",
1143 |    "execution_count": 19,
1144 |    "metadata": {
1145 |     "collapsed": false
1146 |    },
1147 |    "outputs": [
1148 |     {
1149 |      "name": "stdout",
1150 |      "output_type": "stream",
1151 |      "text": [
1152 |       "Exec. time: 39.80 s\n"
1153 |      ]
1154 |     }
1155 |    ],
1156 |    "source": [
1157 |     "now = time.time()\n",
1158 |     "\n",
1159 |     "for idx,loc in enumerate(locs_2014):\n",
1160 |     "    tmp_df = df_2014[df_2014['Location'] == loc]\n",
1161 |     "    (min_mean_average_temparature_30days, \n",
1162 |     "     max_mean_average_temparature_30days,\n",
1163 |     "     ratio_mean_ndvi_30days,\n",
1164 |     "     mean_temperature_diff,\n",
1165 |     "     std_temperature_diff,\n",
1166 |     "     mean_wind_speed,\n",
1167 |     "     total_precipitation,\n",
1168 |     "     total_yield) = calculate_features(tmp_df)\n",
1169 |     "    try:\n",
1170 |     "        new_elevation = elevation[loc]\n",
1171 |     "    except:\n",
1172 |     "        print('Elevation: no match for location found', loc)\n",
1173 |     "    try:\n",
1174 |     "        new_length_of_day = length_of_day[loc]\n",
1175 |     "    except:\n",
1176 |     "        print('Length-of-day: no match for location found', loc)\n",
1177 |     "    longitude = loc[0]\n",
1178 |     "    latitude = loc[1]\n",
1179 |     "    observations = {'longitude': longitude, \n",
1180 |     "        'latitude': latitude,\n",
1181 |     "        'elevation': new_elevation,\n",
1182 |     "        'LOD': new_length_of_day,\n",
1183 |     "        'total_precipitation': total_precipitation,\n",
1184 |     "        'minMAT30': min_mean_average_temparature_30days,\n",
1185 |     "        'maxMAT30': max_mean_average_temparature_30days,\n",
1186 |     "        'ratioMNDVI30': ratio_mean_ndvi_30days,\n",
1187 |     "        'mean_wind_speed': mean_wind_speed,\n",
1188 |     "        'mean_temperature_diff': mean_temperature_diff,\n",
1189 |     "        'std_temperature_diff': std_temperature_diff,\n",
1190 |     "        'yield': total_yield,\n",
1191 |     "       }\n",
1192 |     "    df_2014_new.loc[idx] = pd.Series(observations)\n",
1193 |     "        \n",
1194 |     "print('Exec. time: {:5.2f} s'.format(time.time()-now))"
1195 |    ]
1196 |   },
1197 |   {
1198 |    "cell_type": "code",
1199 |    "execution_count": 20,
1200 |    "metadata": {
1201 |     "collapsed": false
1202 |    },
1203 |    "outputs": [
1204 |     {
1205 |      "data": {
1206 |       "text/plain": [
1207 |        "982"
1208 |       ]
1209 |      },
1210 |      "execution_count": 20,
1211 |      "metadata": {},
1212 |      "output_type": "execute_result"
1213 |     }
1214 |    ],
1215 |    "source": [
1216 |     "len(locs_2014)"
1217 |    ]
1218 |   },
1219 |   {
1220 |    "cell_type": "code",
1221 |    "execution_count": 22,
1222 |    "metadata": {
1223 |     "collapsed": false
1224 |    },
1225 |    "outputs": [
1226 |     {
1227 |      "data": {
1228 |       "text/plain": [
1229 |        "(982, 12)"
1230 |       ]
1231 |      },
1232 |      "execution_count": 22,
1233 |      "metadata": {},
1234 |      "output_type": "execute_result"
1235 |     }
1236 |    ],
1237 |    "source": [
1238 |     "df_2014_new.shape"
1239 |    ]
1240 |   },
1241 |   {
1242 |    "cell_type": "code",
1243 |    "execution_count": 23,
1244 |    "metadata": {
1245 |     "collapsed": false
1246 |    },
1247 |    "outputs": [
1248 |     {
1249 |      "data": {
1250 |       "text/html": [
1251 |        "<div>\n",
1252 |        "<table border=\"1\" class=\"dataframe\">\n",
1253 |        "  <thead>\n",
1254 |        "    <tr style=\"text-align: right;\">\n",
1255 |        "      <th></th>\n",
1256 |        "      <th>longitude</th>\n",
1257 |        "      <th>latitude</th>\n",
1258 |        "      <th>elevation</th>\n",
1259 |        "      <th>LOD</th>\n",
1260 |        "      <th>total_precipitation</th>\n",
1261 |        "      <th>minMAT30</th>\n",
1262 |        "      <th>maxMAT30</th>\n",
1263 |        "      <th>ratioMNDVI30</th>\n",
1264 |        "      <th>mean_wind_speed</th>\n",
1265 |        "      <th>mean_temperature_diff</th>\n",
1266 |        "      <th>std_temperature_diff</th>\n",
1267 |        "      <th>yield</th>\n",
1268 |        "    </tr>\n",
1269 |        "  </thead>\n",
1270 |        "  <tbody>\n",
1271 |        "    <tr>\n",
1272 |        "      <th>0</th>\n",
1273 |        "      <td>-118.352109</td>\n",
1274 |        "      <td>46.929839</td>\n",
1275 |        "      <td>524.561829</td>\n",
1276 |        "      <td>11.168056</td>\n",
1277 |        "      <td>3.362</td>\n",
1278 |        "      <td>29.588000</td>\n",
1279 |        "      <td>62.185833</td>\n",
1280 |        "      <td>2.101725</td>\n",
1281 |        "      <td>4.202903</td>\n",
1282 |        "      <td>17.679355</td>\n",
1283 |        "      <td>9.148266</td>\n",
1284 |        "      <td>35.6</td>\n",
1285 |        "    </tr>\n",
1286 |        "    <tr>\n",
1287 |        "      <th>1</th>\n",
1288 |        "      <td>-118.958859</td>\n",
1289 |        "      <td>47.150327</td>\n",
1290 |        "      <td>451.084656</td>\n",
1291 |        "      <td>11.161111</td>\n",
1292 |        "      <td>1.536</td>\n",
1293 |        "      <td>29.681000</td>\n",
1294 |        "      <td>63.494667</td>\n",
1295 |        "      <td>2.139236</td>\n",
1296 |        "      <td>4.044194</td>\n",
1297 |        "      <td>17.873925</td>\n",
1298 |        "      <td>9.295265</td>\n",
1299 |        "      <td>35.6</td>\n",
1300 |        "    </tr>\n",
1301 |        "    <tr>\n",
1302 |        "      <th>2</th>\n",
1303 |        "      <td>-118.695237</td>\n",
1304 |        "      <td>46.811686</td>\n",
1305 |        "      <td>427.179199</td>\n",
1306 |        "      <td>11.171944</td>\n",
1307 |        "      <td>2.472</td>\n",
1308 |        "      <td>30.269500</td>\n",
1309 |        "      <td>63.789667</td>\n",
1310 |        "      <td>2.107391</td>\n",
1311 |        "      <td>3.991398</td>\n",
1312 |        "      <td>18.457634</td>\n",
1313 |        "      <td>9.581113</td>\n",
1314 |        "      <td>35.6</td>\n",
1315 |        "    </tr>\n",
1316 |        "    <tr>\n",
1317 |        "      <th>3</th>\n",
1318 |        "      <td>-118.699677</td>\n",
1319 |        "      <td>47.162342</td>\n",
1320 |        "      <td>521.150330</td>\n",
1321 |        "      <td>11.160833</td>\n",
1322 |        "      <td>3.528</td>\n",
1323 |        "      <td>29.487500</td>\n",
1324 |        "      <td>62.002333</td>\n",
1325 |        "      <td>2.102665</td>\n",
1326 |        "      <td>4.630054</td>\n",
1327 |        "      <td>17.826935</td>\n",
1328 |        "      <td>9.166665</td>\n",
1329 |        "      <td>35.6</td>\n",
1330 |        "    </tr>\n",
1331 |        "    <tr>\n",
1332 |        "      <th>4</th>\n",
1333 |        "      <td>-118.434056</td>\n",
1334 |        "      <td>47.157512</td>\n",
1335 |        "      <td>571.114075</td>\n",
1336 |        "      <td>11.161111</td>\n",
1337 |        "      <td>4.317</td>\n",
1338 |        "      <td>28.834833</td>\n",
1339 |        "      <td>60.857667</td>\n",
1340 |        "      <td>2.110561</td>\n",
1341 |        "      <td>4.717097</td>\n",
1342 |        "      <td>17.187527</td>\n",
1343 |        "      <td>8.905124</td>\n",
1344 |        "      <td>35.6</td>\n",
1345 |        "    </tr>\n",
1346 |        "    <tr>\n",
1347 |        "      <th>5</th>\n",
1348 |        "      <td>-118.510160</td>\n",
1349 |        "      <td>47.006888</td>\n",
1350 |        "      <td>499.476074</td>\n",
1351 |        "      <td>11.165833</td>\n",
1352 |        "      <td>3.671</td>\n",
1353 |        "      <td>29.496000</td>\n",
1354 |        "      <td>61.706000</td>\n",
1355 |        "      <td>2.092012</td>\n",
1356 |        "      <td>4.679677</td>\n",
1357 |        "      <td>17.844516</td>\n",
1358 |        "      <td>9.182010</td>\n",
1359 |        "      <td>35.6</td>\n",
1360 |        "    </tr>\n",
1361 |        "    <tr>\n",
1362 |        "      <th>6</th>\n",
1363 |        "      <td>-98.454318</td>\n",
1364 |        "      <td>36.508221</td>\n",
1365 |        "      <td>394.365906</td>\n",
1366 |        "      <td>11.451111</td>\n",
1367 |        "      <td>10.565</td>\n",
1368 |        "      <td>31.529167</td>\n",
1369 |        "      <td>65.112000</td>\n",
1370 |        "      <td>2.065135</td>\n",
1371 |        "      <td>8.180699</td>\n",
1372 |        "      <td>20.610538</td>\n",
1373 |        "      <td>9.833779</td>\n",
1374 |        "      <td>29.3</td>\n",
1375 |        "    </tr>\n",
1376 |        "    <tr>\n",
1377 |        "      <th>7</th>\n",
1378 |        "      <td>-98.444745</td>\n",
1379 |        "      <td>36.650698</td>\n",
1380 |        "      <td>401.645874</td>\n",
1381 |        "      <td>11.447778</td>\n",
1382 |        "      <td>13.368</td>\n",
1383 |        "      <td>31.417333</td>\n",
1384 |        "      <td>65.034000</td>\n",
1385 |        "      <td>2.070004</td>\n",
1386 |        "      <td>8.170323</td>\n",
1387 |        "      <td>20.738495</td>\n",
1388 |        "      <td>9.912246</td>\n",
1389 |        "      <td>29.3</td>\n",
1390 |        "    </tr>\n",
1391 |        "    <tr>\n",
1392 |        "      <th>8</th>\n",
1393 |        "      <td>-98.319893</td>\n",
1394 |        "      <td>36.702452</td>\n",
1395 |        "      <td>359.771393</td>\n",
1396 |        "      <td>11.446389</td>\n",
1397 |        "      <td>12.767</td>\n",
1398 |        "      <td>31.339000</td>\n",
1399 |        "      <td>65.082167</td>\n",
1400 |        "      <td>2.076715</td>\n",
1401 |        "      <td>8.287366</td>\n",
1402 |        "      <td>20.991828</td>\n",
1403 |        "      <td>10.070211</td>\n",
1404 |        "      <td>29.3</td>\n",
1405 |        "    </tr>\n",
1406 |        "    <tr>\n",
1407 |        "      <th>9</th>\n",
1408 |        "      <td>-98.539816</td>\n",
1409 |        "      <td>36.628781</td>\n",
1410 |        "      <td>413.238281</td>\n",
1411 |        "      <td>11.448056</td>\n",
1412 |        "      <td>14.993</td>\n",
1413 |        "      <td>31.370833</td>\n",
1414 |        "      <td>65.004333</td>\n",
1415 |        "      <td>2.072126</td>\n",
1416 |        "      <td>8.169570</td>\n",
1417 |        "      <td>20.766344</td>\n",
1418 |        "      <td>9.928065</td>\n",
1419 |        "      <td>29.3</td>\n",
1420 |        "    </tr>\n",
1421 |        "  </tbody>\n",
1422 |        "</table>\n",
1423 |        "</div>"
1424 |       ],
1425 |       "text/plain": [
1426 |        "    longitude   latitude   elevation        LOD  total_precipitation  \\\n",
1427 |        "0 -118.352109  46.929839  524.561829  11.168056                3.362   \n",
1428 |        "1 -118.958859  47.150327  451.084656  11.161111                1.536   \n",
1429 |        "2 -118.695237  46.811686  427.179199  11.171944                2.472   \n",
1430 |        "3 -118.699677  47.162342  521.150330  11.160833                3.528   \n",
1431 |        "4 -118.434056  47.157512  571.114075  11.161111                4.317   \n",
1432 |        "5 -118.510160  47.006888  499.476074  11.165833                3.671   \n",
1433 |        "6  -98.454318  36.508221  394.365906  11.451111               10.565   \n",
1434 |        "7  -98.444745  36.650698  401.645874  11.447778               13.368   \n",
1435 |        "8  -98.319893  36.702452  359.771393  11.446389               12.767   \n",
1436 |        "9  -98.539816  36.628781  413.238281  11.448056               14.993   \n",
1437 |        "\n",
1438 |        "    minMAT30   maxMAT30  ratioMNDVI30  mean_wind_speed  mean_temperature_diff  \\\n",
1439 |        "0  29.588000  62.185833      2.101725         4.202903              17.679355   \n",
1440 |        "1  29.681000  63.494667      2.139236         4.044194              17.873925   \n",
1441 |        "2  30.269500  63.789667      2.107391         3.991398              18.457634   \n",
1442 |        "3  29.487500  62.002333      2.102665         4.630054              17.826935   \n",
1443 |        "4  28.834833  60.857667      2.110561         4.717097              17.187527   \n",
1444 |        "5  29.496000  61.706000      2.092012         4.679677              17.844516   \n",
1445 |        "6  31.529167  65.112000      2.065135         8.180699              20.610538   \n",
1446 |        "7  31.417333  65.034000      2.070004         8.170323              20.738495   \n",
1447 |        "8  31.339000  65.082167      2.076715         8.287366              20.991828   \n",
1448 |        "9  31.370833  65.004333      2.072126         8.169570              20.766344   \n",
1449 |        "\n",
1450 |        "   std_temperature_diff  yield  \n",
1451 |        "0              9.148266   35.6  \n",
1452 |        "1              9.295265   35.6  \n",
1453 |        "2              9.581113   35.6  \n",
1454 |        "3              9.166665   35.6  \n",
1455 |        "4              8.905124   35.6  \n",
1456 |        "5              9.182010   35.6  \n",
1457 |        "6              9.833779   29.3  \n",
1458 |        "7              9.912246   29.3  \n",
1459 |        "8             10.070211   29.3  \n",
1460 |        "9              9.928065   29.3  "
1461 |       ]
1462 |      },
1463 |      "execution_count": 23,
1464 |      "metadata": {},
1465 |      "output_type": "execute_result"
1466 |     }
1467 |    ],
1468 |    "source": [
1469 |     "df_2014_new.head(10)"
1470 |    ]
1471 |   },
1472 |   {
1473 |    "cell_type": "code",
1474 |    "execution_count": null,
1475 |    "metadata": {
1476 |     "collapsed": true
1477 |    },
1478 |    "outputs": [],
1479 |    "source": []
1480 |   },
1481 |   {
1482 |    "cell_type": "code",
1483 |    "execution_count": 24,
1484 |    "metadata": {
1485 |     "collapsed": false
1486 |    },
1487 |    "outputs": [
1488 |     {
1489 |      "data": {
1490 |       "text/plain": [
1491 |        "Index([u'CountyName', u'State', u'Latitude', u'Longitude', u'Date',\n",
1492 |        "       u'cloudCover', u'dewPoint', u'humidity', u'precipIntensity',\n",
1493 |        "       u'precipProbability', u'precipAccumulation', u'precipTypeIsRain',\n",
1494 |        "       u'precipTypeIsSnow', u'pressure', u'temperatureMax', u'temperatureMin',\n",
1495 |        "       u'visibility', u'windBearing', u'windSpeed', u'NDVI', u'DayInSeason',\n",
1496 |        "       u'Yield', u'Location', u'precipTotal', u'temperatureDiff',\n",
1497 |        "       u'temperatureRatio', u'temperatureAverage'],\n",
1498 |        "      dtype='object')"
1499 |       ]
1500 |      },
1501 |      "execution_count": 24,
1502 |      "metadata": {},
1503 |      "output_type": "execute_result"
1504 |     }
1505 |    ],
1506 |    "source": [
1507 |     "df_2013.columns"
1508 |    ]
1509 |   },
1510 |   {
1511 |    "cell_type": "code",
1512 |    "execution_count": 25,
1513 |    "metadata": {
1514 |     "collapsed": false
1515 |    },
1516 |    "outputs": [
1517 |     {
1518 |      "data": {
1519 |       "text/plain": [
1520 |        "Index([u'CountyName', u'State', u'Latitude', u'Longitude', u'Date',\n",
1521 |        "       u'cloudCover', u'dewPoint', u'humidity', u'precipIntensity',\n",
1522 |        "       u'precipProbability', u'precipAccumulation', u'precipTypeIsRain',\n",
1523 |        "       u'precipTypeIsSnow', u'pressure', u'temperatureMax', u'temperatureMin',\n",
1524 |        "       u'visibility', u'windBearing', u'windSpeed', u'NDVI', u'DayInSeason',\n",
1525 |        "       u'Yield', u'Location', u'precipTotal', u'temperatureDiff',\n",
1526 |        "       u'temperatureRatio', u'temperatureAverage'],\n",
1527 |        "      dtype='object')"
1528 |       ]
1529 |      },
1530 |      "execution_count": 25,
1531 |      "metadata": {},
1532 |      "output_type": "execute_result"
1533 |     }
1534 |    ],
1535 |    "source": [
1536 |     "df_2014.columns"
1537 |    ]
1538 |   },
1539 |   {
1540 |    "cell_type": "code",
1541 |    "execution_count": null,
1542 |    "metadata": {
1543 |     "collapsed": true
1544 |    },
1545 |    "outputs": [],
1546 |    "source": []
1547 |   },
1548 |   {
1549 |    "cell_type": "markdown",
1550 |    "metadata": {},
1551 |    "source": [
1552 |     "## Save features to disk"
1553 |    ]
1554 |   },
1555 |   {
1556 |    "cell_type": "code",
1557 |    "execution_count": 26,
1558 |    "metadata": {
1559 |     "collapsed": true
1560 |    },
1561 |    "outputs": [],
1562 |    "source": [
1563 |     "df_2013_new.to_pickle(os.path.join('data','df_2013_features.df'))\n",
1564 |     "df_2014_new.to_pickle(os.path.join('data','df_2014_features.df'))\n"
1565 |    ]
1566 |   },
1567 |   {
1568 |    "cell_type": "code",
1569 |    "execution_count": null,
1570 |    "metadata": {
1571 |     "collapsed": true
1572 |    },
1573 |    "outputs": [],
1574 |    "source": []
1575 |   },
1576 |   {
1577 |    "cell_type": "code",
1578 |    "execution_count": null,
1579 |    "metadata": {
1580 |     "collapsed": true
1581 |    },
1582 |    "outputs": [],
1583 |    "source": []
1584 |   }
1585 |  ],
1586 |  "metadata": {
1587 |   "kernelspec": {
1588 |    "display_name": "Python 2",
1589 |    "language": "python",
1590 |    "name": "python2"
1591 |   },
1592 |   "language_info": {
1593 |    "codemirror_mode": {
1594 |     "name": "ipython",
1595 |     "version": 2
1596 |    },
1597 |    "file_extension": ".py",
1598 |    "mimetype": "text/x-python",
1599 |    "name": "python",
1600 |    "nbconvert_exporter": "python",
1601 |    "pygments_lexer": "ipython2",
1602 |    "version": "2.7.13"
1603 |   }
1604 |  },
1605 |  "nbformat": 4,
1606 |  "nbformat_minor": 0
1607 | }
1608 | 


--------------------------------------------------------------------------------
/Full_Report.md:
--------------------------------------------------------------------------------
  1 | # CropPredict - Summary Report
  2 | 
  3 | ### **work in progress**
  4 | 
  5 | This project aims to predict winter wheat yields based on location and weather data. It is inspired by  [this](https://github.com/aerialintel/data-science-exercise) data science challenge.
  6 | 
  7 | In this report I will provide a high-level overview of my approach to the project, my findings, and my main results.
  8 | 
  9 | For technical details I encourage the reader to inspect the Jupyter notebooks in this repository.
 10 | 
 11 | ## Data sources
 12 | 
 13 | To quote from the original text:
 14 | 
 15 | "We're providing you with two years worth of Winter Wheat data. These data are geolocated to specific lat-longs and counties.
 16 | 
 17 | * Columns A-E in the file provide information on location and time.
 18 | * Columns F-X are raw features, like NDVI or wind speed.
 19 | Day in Season is a calculated feature defining how many days since the start date of the season have occurred.
 20 | The yield is the label, the value that should be predicted. **Note**: this yield label is not specific to a lat/long but is for the county. Multiple lat/longs will have the same yield since they fall into a single county, even if that individual farm had a higher or lower localized yield."
 21 | 
 22 | ## Task
 23 | 
 24 | The task is "to try and predict wheat yield for several counties in the United States."
 25 | 
 26 | Considering the nature of the data, this formulation is somewhat open to interpretation. Generally, I would have started by discussing the data and by trying to better understand the eventual application or business incentive. Additional information and discussion with stakeholders can help significantly in ensuring the machine-learning model answers the right questions.
 27 | 
 28 | ## My approach
 29 | 
 30 | As it is, I decided to build a model that is well suited for applications that would ask questions like: "Is location X - which has good coverage of historical weather data - suited for winter wheat and what kind of yield can I expect?"
 31 | 
 32 | In its current form, the model is less well suited to answer questions along the lines of: "At location X my weather data so far this season is Y. What kind of yield can I expect at the end of the season?"
 33 | 
 34 | But see below for ideas of how to potentially address
 35 | this last question as well.
 36 | 
 37 | My approach to build this model was to characterize each location across the full season. I essentially marginalized over time, engineering aggregated weather-based features for each location. See the 'Feature engineering' section for more details.
 38 | 
 39 | 
 40 | ## Data exploration and munging
 41 | 
 42 | Each year in the data includes about 1000 unique locations. More than 80% of the locations are common to both year's data. The time frame covered is end of November to beginning of June the following year. Most locations have measurements reported on almost every day during the season (>153 days out of 186 days).
 43 | 
 44 | In each year, a small number of locations have measurements reported only on less than 14 days out of the full 186 day period.
 45 | 
 46 | These locations were removed from the data. Given their limited coverage, it is difficult to reliably engineer the features used in this current model.  With this approach, ~5% of the locations were excluded each year, accounting for >0.2% or the raw data. But also see the section 'Final words' for some ideas on how to potentially recover some information from the excluded locations.
 47 | 
 48 | 
 49 | As provided, the data set was already fairly clean and included only a small number of missing or NULL/NaN values. Missing data was highly concentrated in the 'pressure' and 'visibility' columns. Because these  weather-related measurements are likely to change with time and region, it makes little sense to use global averages to impute the missing values. Therefore I chose to adopt the following procedure: for each location with a missing weather-related value, I searched for the *geographically closest* location that had the value in question reported *on the same day*. The assumption here being that the geographically nearest value on the same day is more representative of the missing value than the average of previous and following days records at the target location.
 50 | 
 51 | 
 52 | 
 53 | ## Additional data sources
 54 | 
 55 | I wanted to include additional features that I thought might be relevant for this study.
 56 | 
 57 | * Elevation: For each location I included the elevation as provided by the [Google Maps Elevation API](https://developers.google.com/maps/documentation/elevation/start).
 58 | 
 59 | * Length of Day: For each location I included the length of the day at the last date in each year's data (typically June 3rd). This provides a proxy for the hours of sunlight each location potentially receives. This was calculated using the [Astral](https://pythonhosted.org/astral/) package for python.
 60 | 
 61 | 
 62 | ## Feature engineering
 63 | 
 64 | The features I used for the modeling were either taken directly from the data as-is (longitude, latitude, yield), taken from the additional data sources (elevation, length of day), or engineered from the raw data.
 65 | 
 66 | In support of the feature engineering, I calculated a few intermediate features:
 67 | 
 68 | * daily average temperature: the average of the daily maximum and minimum temperature.
 69 | 
 70 | * daily temperature difference: the difference of the daily maximum and minimum temperature.
 71 | 
 72 | The final features that were used in the modeling are:
 73 | 
 74 | | Feature | Description |
 75 | | --- | --- |
 76 | | longitude | The geographical longitude of the location in degrees. |
 77 | | latitude | The geographical latitude of the location in degrees. |
 78 | | elevation | The elevation of the location in meters. |
 79 | | LOD | The length of the day at the location, calculated as the difference between sunrise and sunset time. |
 80 | | total_precipitation | The total precipitation during the season. Calculated as the cumulative sum of the raw feature 'precipAccumulation'. |
 81 | | minMAT30 | Minimum average temperature in a 30-day period. A rolling window average over the daily average temperatures was taken with window-size of 30 days. The minimum of the resulting values gives the 30-day period which shows the lowest average temperatures over 30 days. |
 82 | | maxMAT30 | Same as above, but now for the maximum average temperature in a 30-day period. |
 83 | | ratioMNDVI30 | Similarly to minMAT30 and maxMAT30 I found the appropriate min/max values for NDVI in a 30-day rolling window and then took the ratio of these values. |
 84 | | mean_wind_speed | The simple mean wind speed at a location over the full period. |
 85 | | mean_temperature_diff | The mean of the daily temperature differences. |
 86 | | std_temperature_diff | The standard deviation of the daily temperature differences about the mean. |
 87 | | yield | The target variable. The crop yield at the end of the season on a county basis. |
 88 | 
 89 | 
 90 | Before feeding the features into the modeling, I performed feature scaling by removing the mean and scaling to unit variance.
 91 | 
 92 | 
 93 | 
 94 | ## Algorithm selection
 95 | 
 96 | A simple correlation analysis of the final data showed that there is no obvious strong linear correlation between most features and the target variable. However, some features are clearly correlated linearly.
 97 | 
 98 | This already indicates that purely linear models may not be the best algorithms for this application. But regularization can help to alleviate co-linearity issues, while introducing higher order (polynomial) features can help with the non-linearity.
 99 | 
100 | Tree-based ensemble methods (random forest, gradient boosting) are typically good at handling non-linear feature interaction and co-linearity. These models can be prone to overfitting which can be
101 | handled by careful tuning of the hyper-parameters.
102 | 
103 | I decided to run a number of algorithms (see  [06_algorithm_selection.ipynb](https://github.com/cleipski/CropPredict/blob/master/02_data_exploration_and_modeling.ipynb)). Using 5-fold cross validation I compared their performance and found that using mostly default settings, the random forest regressor performed the best, followed by nearest-neighbor regression, L2 linear regression with polynomial features, and support-vector regression using an 'RBF' kernel.
104 | 
105 | This confirmed the earlier notion that purely linear models are not appropriate for this data/feature space. Surprisingly, gradient boosted trees (GBTs) - a recent favorite among many machine-learning competitions - performed poorly. But this algorithm has a sizable number of hyper-parameters. Some tuning of the parameters actually brought the performance up to levels that exceeded the random forest regressor.
106 | 
107 | Tuning the hyper parameters of the random forest regressor I was not able to achieve the same performance as with the GBTs. The L2 linear regression with polynomial features lacked even more in tuned performance. Nearest-neighbor regression was clearly overfitting for most parameter combination that would provide good performance. The only real alternative to the GBT performance was a tuned version of the support-vector regression (SVR) using an 'RBF' kernel.
108 | 
109 | 
110 | ## Model tuning and performance
111 | 
112 | Out of the two promising algorithms (GBT and SVR) I decided to proceed with a gradient-boosted tree regression model. My argument for this decision is twofold:
113 | 
114 | * GBTs have been very successful in recent years and I wanted to explore this algorithm further.
115 | * Both competing models are fairly complex with a number of relevant hyper-parameters. So there was no reason to chose one over the other based on complexity (at similar performance you would typically prefer a less complex model over a more complex one).
116 | 
117 | Expanding the GBT parameter search mentioned above I identified combinations of hyper-parameters that maximize performance. I studied learning curves as well as validation curves to investigate the bias-variance tradeoff.  The final result is a set of parameters that for the current dataset provides good performance while limiting overfitting.
118 | 
119 | <img src="https://github.com/cleipski/CropPredict/raw/master/images/learning_curve.png" width="400"/>
120 | 
121 | The learning curve of the tuned model shows that there are still some issues with slight overfitting, but overall the performance and variance look promising. Increasing the 'n_estimators' parameter in the GBT model would have further increased the performance score, but at the cost of increased overfitting. It seems likely that the overfitting could be alleviated by including more training data.
122 | 
123 | 
124 | The GBT model also provides access to feature importance ranking:
125 | 
126 | <img src="https://github.com/cleipski/CropPredict/raw/master/images/feature_importance.png" width="400"/>
127 | 
128 | 
129 | The final performance of the tuned model was established using a test set (70/30 split) for which I compared model predictions to actual yield numbers.
130 | 
131 | <img src="https://github.com/cleipski/CropPredict/raw/master/images/model_performance.png" width="400"/>
132 | 
133 | The R<sup>2</sup> value of the final model is ~0.83 with a root mean square error (RMSE) of 5.3 (yield values in the dataset range from 10 to 80). The mean absolute percentage error is ~5%.
134 | 
135 | At very high observed yields (>60) the model appears to consistently under-predict. At lower yields the model seems well balanced.
136 | 
137 | Comparing the observed and predicted yield on the test set:
138 | 
139 | <img src="https://github.com/cleipski/CropPredict/raw/master/images/compare_yield.png" width="800"/>
140 | 
141 | 
142 | 
143 | ## Final words
144 | 
145 | 
146 | *What challenges or compromises did you face during the project?*
147 | 
148 | My main challenge was being unfamiliar with the subject matter. More domain knowledge would have helped in engineering more powerful features and provided better intuition in judging performance.
149 | 
150 | The compromise was then to use my best judgement for overcoming my limited domain knowledge as well as having to make assumption on the concrete business incentive I would like to address (see comments above in "My approach").
151 | 
152 | It was also tricky to overcome overfitting completely. Getting more data would have been an obvious solution, but for brevity's sake I decided against this effort.
153 | Switching to a different algorithm might have helped, but (possibly) at the cost of performance. More careful feature engineering has the potential to offset this effect. Or using ensemble techniques.
154 | 
155 | 
156 | *What did you learn along the way?*
157 | 
158 | It was a great exercise that covered the whole spectrum of a machine-learning project. From data munging over algorithm selection to model tuning and presentation. A good opportunity to get more familiar with the GBT algorithm and the effect of the (numerous) hyper-parameters.
159 | 
160 | 
161 | *If you had more time, what would you improve?*
162 | 
163 | Get additional data for previous/following years and/or for more locations. Learn more about the subject matter and try to engineer more or better features.
164 | 
165 | Combining the output of different models using ensemble techniques could boost overall performance and limit overfitting.
166 | 
167 | Try to create a model that answers the question: "At location X my weather data so far this season is Y. What kind of yield can I expect at the end of the season?"
168 | 
169 | One approach would be to take location X's data and compare its profile with the existing locations. Some sort of similarity measure could then inform me about which known locations X is most similar to, which in turn provides an estimate of the yield Y from the existing data.
170 | 
171 | Alternatively, a new model could be constructed that maps each individual (daily) measurement directly to the yield. Considering how each location's seasonal weather trends are likely to determine (at least partly) the yield, the problem of predicting outcome from individual measurements will be a challenging (but highly interesting) task.
172 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CropPredict
 2 | 
 3 | This project aims to predict winter wheat yields based on location and weather data. It is inspired by  [this](https://github.com/aerialintel/data-science-exercise) data science challenge.
 4 | 
 5 | Here I briefly outline the main steps in my approach as well as my main results. A detailed report is also available:  [Full Report](https://github.com/cleipski/CropPredict/blob/master/Full_Report.md)
 6 | 
 7 | 
 8 | ## Executive summary
 9 | 
10 | A gradient-boosted decision tree regressor turned out to be the best performer. The tuned model achieved an R<sup>2</sup> value of ~0.83 with a root mean square error (RMSE) of 5.3 (yield values in the dataset range from 10 to 80). The mean absolute percentage error is ~5%.
11 | 
12 | <img src="https://github.com/cleipski/CropPredict/raw/master/images/model_performance.png" width="200"/>
13 | 
14 | 
15 | ## Technical overview
16 | 
17 | Below I outline briefly the main steps in the workflow. The  Jupyter notebooks linked in each step contain the code (with comments) that was used to achieve the results.
18 | 
19 | | Task | Summary | Notebook|
20 | | --- | --- | -- |
21 | | Explore and clean data | Exploring data structure and impute missing values. | [01](https://github.com/cleipski/CropPredict/blob/master/01_data_exploration.ipynb) |
22 | | Collect additional data | For each location determine elevation and length-of-day at a unified date. | [03](https://github.com/cleipski/CropPredict/blob/master/03_elevation_and_length_of_day.ipynb) |
23 | | Feature engineering | Construct higher-level features by characterizing each location across the season. | [04](https://github.com/cleipski/CropPredict/blob/master/04_feature_engineering.ipynb) |
24 | | Statistical analysis | High-level statistical exploration of final feature set. | [05](https://github.com/cleipski/CropPredict/blob/master/05_statisctical_feature_exploration.ipynb) |
25 | | Select algorithm | Compare a number of algorithms using cross validation to identify the most promising performers for this data/feature set. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) |
26 | | Tune model | Tune hyper-parameters of a gradient-boosted tree regressor using cross validation, learning curves and validation curves. Find best balance between performance and bias-variance tradeoff. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) |
27 | | Establish model performance | Use a 30% hold-out test set to compare predicted and observed yields. | [06](https://github.com/cleipski/CropPredict/blob/master/06_algorithm_selection.ipynb) |
28 | 
29 | 
30 | 
31 | ## Future work
32 | 
33 | While the performance of the model appears quite good, a close inspection reveals that it has a tendency to under predict at high yield values (>60 observed). There is also some residual overfitting, even after careful tuning.
34 | 
35 | In future iterations, these issues could be addressed by:
36 | 
37 | * getting more data,
38 | * engineering additional and/or different features, or
39 | * using ensemble techniques by combining the results of different models.
40 | 


--------------------------------------------------------------------------------
/images/average_daily_temperature.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/average_daily_temperature.png


--------------------------------------------------------------------------------
/images/compare_yield.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/compare_yield.png


--------------------------------------------------------------------------------
/images/daily_temperature_difference.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/daily_temperature_difference.png


--------------------------------------------------------------------------------
/images/feature_importance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/feature_importance.png


--------------------------------------------------------------------------------
/images/learning_curve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/learning_curve.png


--------------------------------------------------------------------------------
/images/model_performance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cleipski/CropPredict/f8e222509935e134b5939beb9b7456f2d3c7d26c/images/model_performance.png


--------------------------------------------------------------------------------
/original_README.md:
--------------------------------------------------------------------------------
 1 | # Experimenting with a dataset
 2 | from your friends at Aerial Intelligence
 3 | 
 4 | ## The goal
 5 | We want to try and predict wheat yield for several counties in the United States. We've collected some data that should give you a good head start on the exercise.
 6 | 
 7 | Once you've finished the exercise, we'd like you to share your insights and performance with us, and how you managed to achieve it. More on that below (in the submitting section)
 8 | 
 9 | ## The starter data :rocket:
10 | - 2013: https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2013-supervised.csv
11 | - 2014: https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2014-supervised.csv
12 | 
13 | ## Some context
14 | We're providing you with two years worth of Winter Wheat data. These data are geolocated to specific lat-longs and counties.
15 | 
16 | - Columns A-E in the file provide information on location and time. 
17 | - Columns F-X are raw features, like NDVI or wind speed. 
18 | - Day in Season is a calculated feature defining how many days since the start date of the season have occurred. 
19 | - The yield is the label, the value that should be predicted.
20 |     **Note:** this yield label is not specific to a lat/long but is for the county. Multiple lat/longs will have the same yield since they fall into a single county, even if that individual farm had a higher or lower localized yield. 
21 | 
22 | Please exclude CountyName, State, and Date from training as this will result in overfitting and lack of generalization to other states. 
23 | 
24 | Feel free to split and manipulate this data as you see fit. You can choose to focus on the starter data, or you can look at what additional higher level features you can process out of the starter data, and even grabbing more related data. If you go above and beyond the starter data, please let us know what you did and your insight behind doing so in your explanation.
25 | 
26 | ## Submitting your results
27 | 
28 | Please create a Git repository on a hosted Git platform like GitHub, etc, and send us a link. Your repository should include any code you've written for the exercise, and a writeup README.md or PDF explaining your findings. IPython notebooks are also great.
29 | 
30 | Some things to consider for your README:
31 | - A brief description of the problem and how you chose to solve it.
32 | - A high level timeline telling us what you tried and what the results from that were
33 | - What your final / best approach was and how it performed
34 | - Technical choices you made during the project
35 | - What challenges or compromises did you face during the project?
36 | - What did you learn along the way?
37 | - If you had more time, what would you improve?
38 | 
39 | We care about your thought process and your data science prowess. The better we can understand how you approached the problem, the better we can review your project. Here are a few questions we'll consider:
40 | - Can we understand your thought process? Does your README.md clearly and concisely describe the problem, your solution, and what you did to achieve it? Does your code do what the README.md says it does?
41 | - Can we understand your code? Is your logic clean, consistent, and concise?
42 | - Do your technical choices make sense?
43 | 


--------------------------------------------------------------------------------