├── .gitignore
├── img
├── distance_comparison1.png
├── distance_comparison2.png
├── distance_comparison3.png
├── distance_comparison4.png
└── distance_comparison5.png
├── README.md
└── distance_comparison.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 | *.shp
3 |
--------------------------------------------------------------------------------
/img/distance_comparison1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/caocscar/ms2/master/img/distance_comparison1.png
--------------------------------------------------------------------------------
/img/distance_comparison2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/caocscar/ms2/master/img/distance_comparison2.png
--------------------------------------------------------------------------------
/img/distance_comparison3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/caocscar/ms2/master/img/distance_comparison3.png
--------------------------------------------------------------------------------
/img/distance_comparison4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/caocscar/ms2/master/img/distance_comparison4.png
--------------------------------------------------------------------------------
/img/distance_comparison5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/caocscar/ms2/master/img/distance_comparison5.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Notes
2 |
3 | [distance_comparison.ipynb](distance_comparison.ipynb) is a python script that does a comparison between two distance metrics, Haversine and the Google Distance Matrix API, and tries to elucidate the differences between the two when applicable.
4 |
--------------------------------------------------------------------------------
/distance_comparison.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Distance Comparison between Google Distance Matrix API and Haversine (as the crow flies)"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Preprocessing Steps"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "Importing python modules and getting API key"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "from sklearn.neighbors import NearestNeighbors\n",
31 | "import geopandas as gp\n",
32 | "import numpy as np\n",
33 | "import os\n",
34 | "import googlemaps\n",
35 | "import matplotlib.pyplot as plt\n",
36 | "\n",
37 | "gp.pd.options.display.max_rows = 10\n",
38 | "\n",
39 | "apikey = os.getenv('GOOGLE_MAP_API_KEY')\n",
40 | "gmaps = googlemaps.Client(apikey)"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "Read in shapefile"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 2,
53 | "metadata": {
54 | "scrolled": true
55 | },
56 | "outputs": [],
57 | "source": [
58 | "wdir = r'X:\\MS2'\n",
59 | "ma = gp.read_file(os.path.join(wdir,'Mhd2017Export.shp'))"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "Iterate through each geometry and choose an arbitrary point to represent the road segment. Here I chose the first point in the linestring."
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 3,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "geom = []\n",
76 | "for row in ma.itertuples():\n",
77 | " if row.geometry.type == 'LineString':\n",
78 | " geom.append(row.geometry.coords[0])\n",
79 | " elif row.geometry.type == 'MultiLineString':\n",
80 | " for linestring in row.geometry:\n",
81 | " geom.append(linestring.coords[0])\n",
82 | " break\n",
83 | "ma['pt'] = geom"
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "metadata": {},
89 | "source": [
90 | "Choosing a subsample to compare. Here I choose based on county. I choose Berkshire county because it borders both the northern and southern boundary of Massachusetts."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 4,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "county = ma[(ma['COUNTY'] == '3')]\n",
100 | "county.reset_index(drop=True, inplace=True)"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "Use the nearest neighbor algorithm to find the closest `k` neighbors for each road segment. It returns the neighbor id (row number) along with the corresponding haversine distance. We multiply it by Earth's radius to get it in the appropriate unit."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 5,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "XY = county['pt'].tolist()\n",
117 | "XY = [[pt[1],pt[0]] for pt in XY] # latitutde,longitude format \n",
118 | "xy = np.radians(XY)\n",
119 | "k = 20\n",
120 | "nbrs = NearestNeighbors(n_neighbors=k+1, algorithm='auto', leaf_size=30, metric='haversine')\n",
121 | "nbrs.fit(xy)\n",
122 | "distances, nid = nbrs.kneighbors(xy)\n",
123 | "R = 6371 # avg Earth radius in km\n",
124 | "distances *= R"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 6,
130 | "metadata": {},
131 | "outputs": [
132 | {
133 | "data": {
134 | "text/plain": [
135 | "array([[ 0, 23, 24, ..., 2776, 927, 2608],\n",
136 | " [ 1, 926, 927, ..., 3141, 2820, 13],\n",
137 | " [ 2, 928, 927, ..., 2706, 2516, 2821],\n",
138 | " ..., \n",
139 | " [3220, 2936, 2927, ..., 143, 176, 177],\n",
140 | " [3221, 2928, 3222, ..., 2987, 2980, 3214],\n",
141 | " [3222, 2948, 2928, ..., 2987, 2980, 3214]], dtype=int64)"
142 | ]
143 | },
144 | "execution_count": 6,
145 | "metadata": {},
146 | "output_type": "execute_result"
147 | }
148 | ],
149 | "source": [
150 | "nid"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 7,
156 | "metadata": {},
157 | "outputs": [
158 | {
159 | "data": {
160 | "text/plain": [
161 | "array([[ 0. , 0.11624839, 0.14769209, ..., 0.79839008,\n",
162 | " 0.80492603, 0.8175712 ],\n",
163 | " [ 0. , 0.12335718, 0.16070397, ..., 0.59028603,\n",
164 | " 0.59220125, 0.59401006],\n",
165 | " [ 0. , 0.1044344 , 0.15796374, ..., 0.45832769,\n",
166 | " 0.4823994 , 0.4892717 ],\n",
167 | " ..., \n",
168 | " [ 0. , 0.08251677, 0.14064991, ..., 0.9275115 ,\n",
169 | " 0.93867396, 0.95895518],\n",
170 | " [ 0. , 0.02395408, 0.05064373, ..., 1.07913732,\n",
171 | " 1.39614578, 1.43711027],\n",
172 | " [ 0. , 0.00658318, 0.02681579, ..., 1.12675085,\n",
173 | " 1.34878615, 1.38973484]])"
174 | ]
175 | },
176 | "execution_count": 7,
177 | "metadata": {},
178 | "output_type": "execute_result"
179 | }
180 | ],
181 | "source": [
182 | "distances"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "For each road segment, choose a random neighbor to construct an origin-destination (O-D) pair. Construct a dataframe from the O-D pairs and remove any duplicates. Set random seed for reproducibility."
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": 8,
195 | "metadata": {},
196 | "outputs": [],
197 | "source": [
198 | "count = 0\n",
199 | "np.random.seed(2018)\n",
200 | "OD = []\n",
201 | "for origin in range(county.shape[0]):\n",
202 | " choice = np.random.randint(1,k,1)[0]\n",
203 | " dest = nid[origin][choice]\n",
204 | " dist = distances[origin][choice] * 1000\n",
205 | " if origin < dest:\n",
206 | " OD.append((origin,dest,choice,dist))\n",
207 | " else:\n",
208 | " OD.append((dest,origin,choice,dist))\n",
209 | " if dist < 0.001:\n",
210 | " count += 1\n",
211 | " \n",
212 | "df = gp.pd.DataFrame(OD, columns=['origin','dest','n_neighbor','crow']).drop_duplicates(['origin','dest'])"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "Sample the dataframe so we are within the daily Google API rate limit of 2500 O-D pairs."
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 9,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "sample = df.sample(n=2400).sort_values(['origin','dest']).reset_index(drop=True)"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "Iterate through the dataframe and make a request to the Google Distance Matrix API. Construct a dataframe from this data and save the data to a file for future reference. "
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 10,
241 | "metadata": {},
242 | "outputs": [],
243 | "source": [
244 | "api_flag = False\n",
245 | "if api_flag:\n",
246 | " coords = []\n",
247 | " for row in sample.itertuples():\n",
248 | " origin = '{:.6f},{:.6f}'.format(XY[row.origin][0],XY[row.origin][1])\n",
249 | " dest = '{:.6f},{:.6f}'.format(XY[row.dest][0],XY[row.dest][1])\n",
250 | " R = gmaps.distance_matrix(origins=origin, destinations=dest)\n",
251 | " distance = R['rows'][0]['elements'][0]['distance']['value']\n",
252 | " coords.append((XY[row.origin],XY[row.dest],distance))\n",
253 | " driving = gp.pd.DataFrame(coords, columns=['origin','dest','goog'])\n",
254 | " driving.to_csv('google_distance_matrix.csv', index=False, sep='|')\n",
255 | "else:\n",
256 | " driving = gp.pd.read_csv('google_distance_matrix.csv', sep='|')"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "Combine the two datasets (haversine and google) and calculate the difference."
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 11,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "sample['goog'] = driving['goog']\n",
273 | "sample['delta'] = sample['goog'] - sample['crow']\n",
274 | "sample.sort_values('delta', inplace=True)"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "## Descriptive Statistics"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {},
287 | "source": [
288 | "Here is the dataframe that contains the relevant sample data from Berkshire county.\n",
289 | "\n",
290 | "* origin, dest - contains row id of segment\n",
291 | "* n_neighbor - nth closest neighbor to origin segment\n",
292 | "* crow, goog - distance btw O-D pair for haversine and google API respectively\n",
293 | "* delta - difference between the two metrics\n",
294 | "\n",
295 | "Note: Distances are in meters for columns `crow`, `goog`, and `delta`."
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 12,
301 | "metadata": {},
302 | "outputs": [
303 | {
304 | "data": {
305 | "text/html": [
306 | "
\n",
307 | "\n",
320 | "
\n",
321 | " \n",
322 | " \n",
323 | " | \n",
324 | " origin | \n",
325 | " dest | \n",
326 | " n_neighbor | \n",
327 | " crow | \n",
328 | " goog | \n",
329 | " delta | \n",
330 | "
\n",
331 | " \n",
332 | " \n",
333 | " \n",
334 | " | 142 | \n",
335 | " 116 | \n",
336 | " 219 | \n",
337 | " 10 | \n",
338 | " 672.525865 | \n",
339 | " 668 | \n",
340 | " -4.525865 | \n",
341 | "
\n",
342 | " \n",
343 | " | 639 | \n",
344 | " 550 | \n",
345 | " 1553 | \n",
346 | " 3 | \n",
347 | " 75.325174 | \n",
348 | " 72 | \n",
349 | " -3.325174 | \n",
350 | "
\n",
351 | " \n",
352 | " | 1975 | \n",
353 | " 2285 | \n",
354 | " 2286 | \n",
355 | " 8 | \n",
356 | " 207.228675 | \n",
357 | " 204 | \n",
358 | " -3.228675 | \n",
359 | "
\n",
360 | " \n",
361 | " | 715 | \n",
362 | " 601 | \n",
363 | " 1921 | \n",
364 | " 4 | \n",
365 | " 209.212445 | \n",
366 | " 206 | \n",
367 | " -3.212445 | \n",
368 | "
\n",
369 | " \n",
370 | " | 271 | \n",
371 | " 222 | \n",
372 | " 249 | \n",
373 | " 3 | \n",
374 | " 355.083549 | \n",
375 | " 352 | \n",
376 | " -3.083549 | \n",
377 | "
\n",
378 | " \n",
379 | " | ... | \n",
380 | " ... | \n",
381 | " ... | \n",
382 | " ... | \n",
383 | " ... | \n",
384 | " ... | \n",
385 | " ... | \n",
386 | "
\n",
387 | " \n",
388 | " | 2275 | \n",
389 | " 2757 | \n",
390 | " 3031 | \n",
391 | " 17 | \n",
392 | " 2672.684762 | \n",
393 | " 31016 | \n",
394 | " 28343.315238 | \n",
395 | "
\n",
396 | " \n",
397 | " | 1969 | \n",
398 | " 2267 | \n",
399 | " 2497 | \n",
400 | " 7 | \n",
401 | " 1741.407797 | \n",
402 | " 30912 | \n",
403 | " 29170.592203 | \n",
404 | "
\n",
405 | " \n",
406 | " | 2087 | \n",
407 | " 2499 | \n",
408 | " 2916 | \n",
409 | " 15 | \n",
410 | " 1441.316883 | \n",
411 | " 32476 | \n",
412 | " 31034.683117 | \n",
413 | "
\n",
414 | " \n",
415 | " | 514 | \n",
416 | " 454 | \n",
417 | " 2491 | \n",
418 | " 17 | \n",
419 | " 2861.271057 | \n",
420 | " 36536 | \n",
421 | " 33674.728943 | \n",
422 | "
\n",
423 | " \n",
424 | " | 1232 | \n",
425 | " 1203 | \n",
426 | " 2417 | \n",
427 | " 13 | \n",
428 | " 2097.580642 | \n",
429 | " 38791 | \n",
430 | " 36693.419358 | \n",
431 | "
\n",
432 | " \n",
433 | "
\n",
434 | "
2400 rows × 6 columns
\n",
435 | "
"
436 | ],
437 | "text/plain": [
438 | " origin dest n_neighbor crow goog delta\n",
439 | "142 116 219 10 672.525865 668 -4.525865\n",
440 | "639 550 1553 3 75.325174 72 -3.325174\n",
441 | "1975 2285 2286 8 207.228675 204 -3.228675\n",
442 | "715 601 1921 4 209.212445 206 -3.212445\n",
443 | "271 222 249 3 355.083549 352 -3.083549\n",
444 | "... ... ... ... ... ... ...\n",
445 | "2275 2757 3031 17 2672.684762 31016 28343.315238\n",
446 | "1969 2267 2497 7 1741.407797 30912 29170.592203\n",
447 | "2087 2499 2916 15 1441.316883 32476 31034.683117\n",
448 | "514 454 2491 17 2861.271057 36536 33674.728943\n",
449 | "1232 1203 2417 13 2097.580642 38791 36693.419358\n",
450 | "\n",
451 | "[2400 rows x 6 columns]"
452 | ]
453 | },
454 | "execution_count": 12,
455 | "metadata": {},
456 | "output_type": "execute_result"
457 | }
458 | ],
459 | "source": [
460 | "sample"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": 13,
466 | "metadata": {},
467 | "outputs": [
468 | {
469 | "data": {
470 | "text/plain": [
471 | "count 2400.000000\n",
472 | "mean 706.100946\n",
473 | "std 2836.276653\n",
474 | "min -4.525865\n",
475 | "50% 71.719718\n",
476 | "75% 331.092314\n",
477 | "90% 1195.414122\n",
478 | "95% 2546.184654\n",
479 | "99% 17596.912272\n",
480 | "max 36693.419358\n",
481 | "Name: delta, dtype: float64"
482 | ]
483 | },
484 | "execution_count": 13,
485 | "metadata": {},
486 | "output_type": "execute_result"
487 | }
488 | ],
489 | "source": [
490 | "sample['delta'].describe([0.75,0.9,0.95,0.99])"
491 | ]
492 | },
493 | {
494 | "cell_type": "markdown",
495 | "metadata": {},
496 | "source": [
497 | "## Comparison between Google and Haversine"
498 | ]
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "## Figure 1: Google vs Haversine"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": 14,
510 | "metadata": {},
511 | "outputs": [
512 | {
513 | "name": "stderr",
514 | "output_type": "stream",
515 | "text": [
516 | "C:\\Users\\caoa\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\matplotlib\\cbook\\deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.\n",
517 | " warnings.warn(message, mplDeprecation, stacklevel=1)\n"
518 | ]
519 | },
520 | {
521 | "data": {
522 | "text/plain": [
523 | "Text(6,1,'y=x')"
524 | ]
525 | },
526 | "execution_count": 14,
527 | "metadata": {},
528 | "output_type": "execute_result"
529 | },
530 | {
531 | "data": {
532 | "image/png": "\n",
533 | "text/plain": [
534 | ""
535 | ]
536 | },
537 | "metadata": {},
538 | "output_type": "display_data"
539 | }
540 | ],
541 | "source": [
542 | "%matplotlib inline\n",
543 | "fig = plt.figure(figsize=(16,5))\n",
544 | "plt.scatter(sample['crow']/1000, sample['goog']/1000,alpha=0.1)\n",
545 | "plt.xlabel('Haversine Distance (km)')\n",
546 | "plt.ylabel('Google Driving Distance (km)')\n",
547 | "xy = [0,6.500]\n",
548 | "plt.plot(xy,xy,'k')\n",
549 | "ax = fig.add_subplot(111)\n",
550 | "plt.xlim([0, 7]); plt.ylim([0, 40])\n",
551 | "params = {'width':2,\n",
552 | " 'shrink': 0.05,\n",
553 | " 'facecolor':'black'}\n",
554 | "ax.annotate('y=x', xy=(5.8,5.8), xytext=(6,1), arrowprops=params)"
555 | ]
556 | },
557 | {
558 | "cell_type": "markdown",
559 | "metadata": {},
560 | "source": [
561 | "Let's look at a couple of examples for the large deltas."
562 | ]
563 | },
564 | {
565 | "cell_type": "code",
566 | "execution_count": 15,
567 | "metadata": {},
568 | "outputs": [
569 | {
570 | "data": {
571 | "text/html": [
572 | "\n",
573 | "\n",
586 | "
\n",
587 | " \n",
588 | " \n",
589 | " | \n",
590 | " origin | \n",
591 | " dest | \n",
592 | " n_neighbor | \n",
593 | " crow | \n",
594 | " goog | \n",
595 | " delta | \n",
596 | "
\n",
597 | " \n",
598 | " \n",
599 | " \n",
600 | " | 2275 | \n",
601 | " 2757 | \n",
602 | " 3031 | \n",
603 | " 17 | \n",
604 | " 2672.684762 | \n",
605 | " 31016 | \n",
606 | " 28343.315238 | \n",
607 | "
\n",
608 | " \n",
609 | " | 1969 | \n",
610 | " 2267 | \n",
611 | " 2497 | \n",
612 | " 7 | \n",
613 | " 1741.407797 | \n",
614 | " 30912 | \n",
615 | " 29170.592203 | \n",
616 | "
\n",
617 | " \n",
618 | " | 2087 | \n",
619 | " 2499 | \n",
620 | " 2916 | \n",
621 | " 15 | \n",
622 | " 1441.316883 | \n",
623 | " 32476 | \n",
624 | " 31034.683117 | \n",
625 | "
\n",
626 | " \n",
627 | " | 514 | \n",
628 | " 454 | \n",
629 | " 2491 | \n",
630 | " 17 | \n",
631 | " 2861.271057 | \n",
632 | " 36536 | \n",
633 | " 33674.728943 | \n",
634 | "
\n",
635 | " \n",
636 | " | 1232 | \n",
637 | " 1203 | \n",
638 | " 2417 | \n",
639 | " 13 | \n",
640 | " 2097.580642 | \n",
641 | " 38791 | \n",
642 | " 36693.419358 | \n",
643 | "
\n",
644 | " \n",
645 | "
\n",
646 | "
"
647 | ],
648 | "text/plain": [
649 | " origin dest n_neighbor crow goog delta\n",
650 | "2275 2757 3031 17 2672.684762 31016 28343.315238\n",
651 | "1969 2267 2497 7 1741.407797 30912 29170.592203\n",
652 | "2087 2499 2916 15 1441.316883 32476 31034.683117\n",
653 | "514 454 2491 17 2861.271057 36536 33674.728943\n",
654 | "1232 1203 2417 13 2097.580642 38791 36693.419358"
655 | ]
656 | },
657 | "execution_count": 15,
658 | "metadata": {},
659 | "output_type": "execute_result"
660 | }
661 | ],
662 | "source": [
663 | "sample.tail()"
664 | ]
665 | },
666 | {
667 | "cell_type": "markdown",
668 | "metadata": {},
669 | "source": [
670 | "## Example 1: Freeway Segment"
671 | ]
672 | },
673 | {
674 | "cell_type": "code",
675 | "execution_count": 16,
676 | "metadata": {},
677 | "outputs": [
678 | {
679 | "name": "stdout",
680 | "output_type": "stream",
681 | "text": [
682 | "[42.25180422900007, -73.03956206299995] [42.23796716600003, -73.05688164999998]\n"
683 | ]
684 | }
685 | ],
686 | "source": [
687 | "print(XY[1203], XY[2417])"
688 | ]
689 | },
690 | {
691 | "cell_type": "markdown",
692 | "metadata": {},
693 | "source": [
694 | "Google Map Directions for this O-D pair. Issue is that the destination segment is eastbound so the directions has to find the nearest on-ramp in the EB direction."
695 | ]
696 | },
697 | {
698 | "cell_type": "markdown",
699 | "metadata": {},
700 | "source": [
701 | ""
702 | ]
703 | },
704 | {
705 | "cell_type": "markdown",
706 | "metadata": {},
707 | "source": [
708 | "## Example 2: Another Freeway Segment"
709 | ]
710 | },
711 | {
712 | "cell_type": "code",
713 | "execution_count": 17,
714 | "metadata": {
715 | "scrolled": true
716 | },
717 | "outputs": [
718 | {
719 | "name": "stdout",
720 | "output_type": "stream",
721 | "text": [
722 | "[42.34144445000004, -73.39379061699998] [42.35252574200007, -73.38469198599995]\n"
723 | ]
724 | }
725 | ],
726 | "source": [
727 | "print(XY[2499], XY[2916])"
728 | ]
729 | },
730 | {
731 | "cell_type": "markdown",
732 | "metadata": {},
733 | "source": [
734 | "This time, there is no nearby off-ramp to the destination so a circuitous route is required."
735 | ]
736 | },
737 | {
738 | "cell_type": "markdown",
739 | "metadata": {},
740 | "source": [
741 | ""
742 | ]
743 | },
744 | {
745 | "cell_type": "markdown",
746 | "metadata": {},
747 | "source": [
748 | "So the common theme appears to be is that if one of the OD pairs is a freeway segment, then you are likely to have considerable longer driving distances than haversine. Also, of note, is that reverse directions would also give a substantially different answer also."
749 | ]
750 | },
751 | {
752 | "cell_type": "markdown",
753 | "metadata": {},
754 | "source": [
755 | "What if we removed the road segments with FCC value of 1 and 2 from the dataset?"
756 | ]
757 | },
758 | {
759 | "cell_type": "code",
760 | "execution_count": 18,
761 | "metadata": {},
762 | "outputs": [],
763 | "source": [
764 | "freeways = county[county['FUNCCODE'].isin(['1','2'])]\n",
765 | "idx = set(freeways.index)\n",
766 | "tf = (sample['origin'].isin(idx)) | (sample['dest'].isin(idx))\n",
767 | "F36 = sample[~tf]"
768 | ]
769 | },
770 | {
771 | "cell_type": "markdown",
772 | "metadata": {},
773 | "source": [
774 | "All of our numbers have decreased as expected."
775 | ]
776 | },
777 | {
778 | "cell_type": "code",
779 | "execution_count": 20,
780 | "metadata": {},
781 | "outputs": [
782 | {
783 | "data": {
784 | "text/plain": [
785 | "count 2247.000000\n",
786 | "mean 419.908335\n",
787 | "std 1477.425090\n",
788 | "min -4.525865\n",
789 | "50% 67.556762\n",
790 | "75% 301.162028\n",
791 | "90% 972.725745\n",
792 | "95% 1881.109268\n",
793 | "99% 5113.434584\n",
794 | "max 26723.747754\n",
795 | "Name: delta, dtype: float64"
796 | ]
797 | },
798 | "execution_count": 20,
799 | "metadata": {},
800 | "output_type": "execute_result"
801 | }
802 | ],
803 | "source": [
804 | "F36['delta'].describe([0.75,0.9,0.95,0.99])"
805 | ]
806 | },
807 | {
808 | "cell_type": "markdown",
809 | "metadata": {},
810 | "source": [
811 | "## Figure 2: Google vs Haversine excluding FCC codes 1 and 2"
812 | ]
813 | },
814 | {
815 | "cell_type": "code",
816 | "execution_count": 19,
817 | "metadata": {
818 | "scrolled": false
819 | },
820 | "outputs": [
821 | {
822 | "name": "stderr",
823 | "output_type": "stream",
824 | "text": [
825 | "C:\\Users\\caoa\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\matplotlib\\cbook\\deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.\n",
826 | " warnings.warn(message, mplDeprecation, stacklevel=1)\n"
827 | ]
828 | },
829 | {
830 | "data": {
831 | "text/plain": [
832 | "Text(6,1,'y=x')"
833 | ]
834 | },
835 | "execution_count": 19,
836 | "metadata": {},
837 | "output_type": "execute_result"
838 | },
839 | {
840 | "data": {
841 | "image/png": "\n",
842 | "text/plain": [
843 | ""
844 | ]
845 | },
846 | "metadata": {},
847 | "output_type": "display_data"
848 | }
849 | ],
850 | "source": [
851 | "fig = plt.figure(figsize=(16,5))\n",
852 | "plt.scatter(F36['crow']/1000, F36['goog']/1000,alpha=0.1)\n",
853 | "plt.xlabel('Haversine Distance (km)')\n",
854 | "plt.ylabel('Google Driving Distance (km)')\n",
855 | "xy = [0,6.500]\n",
856 | "plt.plot(xy,xy,'k')\n",
857 | "ax = fig.add_subplot(111)\n",
858 | "plt.xlim([0, 7]); plt.ylim([0, 40])\n",
859 | "params = {'width':2,\n",
860 | " 'shrink': 0.05,\n",
861 | " 'facecolor':'black'}\n",
862 | "ax.annotate('y=x', xy=(5.8,5.8), xytext=(6,1), arrowprops=params)"
863 | ]
864 | },
865 | {
866 | "cell_type": "markdown",
867 | "metadata": {},
868 | "source": [
869 | "Let's see what's going on with our worst offenders again. There seems to be two class of discrepancies based on the figure and the columns `delta` and `n_neighbor`. The bottom five fall into one class and the next 5 fall into a different class of problems."
870 | ]
871 | },
872 | {
873 | "cell_type": "code",
874 | "execution_count": 21,
875 | "metadata": {},
876 | "outputs": [
877 | {
878 | "data": {
879 | "text/html": [
880 | "\n",
881 | "\n",
894 | "
\n",
895 | " \n",
896 | " \n",
897 | " | \n",
898 | " origin | \n",
899 | " dest | \n",
900 | " n_neighbor | \n",
901 | " crow | \n",
902 | " goog | \n",
903 | " delta | \n",
904 | "
\n",
905 | " \n",
906 | " \n",
907 | " \n",
908 | " | 1226 | \n",
909 | " 1190 | \n",
910 | " 2760 | \n",
911 | " 19 | \n",
912 | " 5331.533066 | \n",
913 | " 14695 | \n",
914 | " 9363.466934 | \n",
915 | "
\n",
916 | " \n",
917 | " | 393 | \n",
918 | " 336 | \n",
919 | " 2593 | \n",
920 | " 18 | \n",
921 | " 5696.087499 | \n",
922 | " 15829 | \n",
923 | " 10132.912501 | \n",
924 | "
\n",
925 | " \n",
926 | " | 488 | \n",
927 | " 439 | \n",
928 | " 2592 | \n",
929 | " 16 | \n",
930 | " 5637.509696 | \n",
931 | " 15922 | \n",
932 | " 10284.490304 | \n",
933 | "
\n",
934 | " \n",
935 | " | 753 | \n",
936 | " 628 | \n",
937 | " 814 | \n",
938 | " 14 | \n",
939 | " 6041.023157 | \n",
940 | " 17285 | \n",
941 | " 11243.976843 | \n",
942 | "
\n",
943 | " \n",
944 | " | 754 | \n",
945 | " 629 | \n",
946 | " 814 | \n",
947 | " 14 | \n",
948 | " 5988.336124 | \n",
949 | " 17358 | \n",
950 | " 11369.663876 | \n",
951 | "
\n",
952 | " \n",
953 | " | 1896 | \n",
954 | " 2127 | \n",
955 | " 3124 | \n",
956 | " 8 | \n",
957 | " 2834.782151 | \n",
958 | " 26001 | \n",
959 | " 23166.217849 | \n",
960 | "
\n",
961 | " \n",
962 | " | 1816 | \n",
963 | " 1961 | \n",
964 | " 2811 | \n",
965 | " 9 | \n",
966 | " 2296.415740 | \n",
967 | " 26673 | \n",
968 | " 24376.584260 | \n",
969 | "
\n",
970 | " \n",
971 | " | 1897 | \n",
972 | " 2128 | \n",
973 | " 2652 | \n",
974 | " 7 | \n",
975 | " 2035.037385 | \n",
976 | " 26604 | \n",
977 | " 24568.962615 | \n",
978 | "
\n",
979 | " \n",
980 | " | 1332 | \n",
981 | " 1346 | \n",
982 | " 3171 | \n",
983 | " 7 | \n",
984 | " 300.541766 | \n",
985 | " 26577 | \n",
986 | " 26276.458234 | \n",
987 | "
\n",
988 | " \n",
989 | " | 44 | \n",
990 | " 46 | \n",
991 | " 3171 | \n",
992 | " 2 | \n",
993 | " 161.252246 | \n",
994 | " 26885 | \n",
995 | " 26723.747754 | \n",
996 | "
\n",
997 | " \n",
998 | "
\n",
999 | "
"
1000 | ],
1001 | "text/plain": [
1002 | " origin dest n_neighbor crow goog delta\n",
1003 | "1226 1190 2760 19 5331.533066 14695 9363.466934\n",
1004 | "393 336 2593 18 5696.087499 15829 10132.912501\n",
1005 | "488 439 2592 16 5637.509696 15922 10284.490304\n",
1006 | "753 628 814 14 6041.023157 17285 11243.976843\n",
1007 | "754 629 814 14 5988.336124 17358 11369.663876\n",
1008 | "1896 2127 3124 8 2834.782151 26001 23166.217849\n",
1009 | "1816 1961 2811 9 2296.415740 26673 24376.584260\n",
1010 | "1897 2128 2652 7 2035.037385 26604 24568.962615\n",
1011 | "1332 1346 3171 7 300.541766 26577 26276.458234\n",
1012 | "44 46 3171 2 161.252246 26885 26723.747754"
1013 | ]
1014 | },
1015 | "execution_count": 21,
1016 | "metadata": {},
1017 | "output_type": "execute_result"
1018 | }
1019 | ],
1020 | "source": [
1021 | "F36.tail(10)"
1022 | ]
1023 | },
1024 | {
1025 | "cell_type": "markdown",
1026 | "metadata": {},
1027 | "source": [
1028 | "## Example 3: Freeway Segment"
1029 | ]
1030 | },
1031 | {
1032 | "cell_type": "code",
1033 | "execution_count": 22,
1034 | "metadata": {},
1035 | "outputs": [
1036 | {
1037 | "name": "stdout",
1038 | "output_type": "stream",
1039 | "text": [
1040 | "[42.328109753000035, -73.36298205699995] [42.32730243800006, -73.36461153199997]\n"
1041 | ]
1042 | }
1043 | ],
1044 | "source": [
1045 | "print(XY[46], XY[3171])"
1046 | ]
1047 | },
1048 | {
1049 | "cell_type": "markdown",
1050 | "metadata": {},
1051 | "source": [
1052 | "The problem is that the destination is still on a freeway segment and heading westbound but before the on-ramp to the highway."
1053 | ]
1054 | },
1055 | {
1056 | "cell_type": "markdown",
1057 | "metadata": {},
1058 | "source": [
1059 | ""
1060 | ]
1061 | },
1062 | {
1063 | "cell_type": "markdown",
1064 | "metadata": {},
1065 | "source": [
1066 | "Let's look at its FUNCCODE of the destination to make sure we're not hallucinating. It has code 5. "
1067 | ]
1068 | },
1069 | {
1070 | "cell_type": "code",
1071 | "execution_count": 23,
1072 | "metadata": {},
1073 | "outputs": [
1074 | {
1075 | "data": {
1076 | "text/plain": [
1077 | "FUNCCODE 5\n",
1078 | "pt (-73.36461153199997, 42.32730243800006)\n",
1079 | "Name: 3171, dtype: object"
1080 | ]
1081 | },
1082 | "execution_count": 23,
1083 | "metadata": {},
1084 | "output_type": "execute_result"
1085 | }
1086 | ],
1087 | "source": [
1088 | "county.loc[3171,['FUNCCODE','pt'] ]"
1089 | ]
1090 | },
1091 | {
1092 | "cell_type": "markdown",
1093 | "metadata": {},
1094 | "source": [
1095 | "Not sure if we can do anything to avoid this situation. If we reverse directions, then the distance would definitely be a lot shorter."
1096 | ]
1097 | },
1098 | {
1099 | "cell_type": "markdown",
1100 | "metadata": {},
1101 | "source": [
1102 | "## Example 4: Shortest Time Route"
1103 | ]
1104 | },
1105 | {
1106 | "cell_type": "code",
1107 | "execution_count": 24,
1108 | "metadata": {},
1109 | "outputs": [
1110 | {
1111 | "name": "stdout",
1112 | "output_type": "stream",
1113 | "text": [
1114 | "[42.11620711100005, -73.49637794099993] [42.154813696000076, -73.44574458599999]\n"
1115 | ]
1116 | }
1117 | ],
1118 | "source": [
1119 | "print(XY[629], XY[814])"
1120 | ]
1121 | },
1122 | {
1123 | "cell_type": "markdown",
1124 | "metadata": {},
1125 | "source": [
1126 | "This seems to be a function of the road network but also google's preference for shortest time. There is a shorter distance route that it could have given. Unfortunately, there is no way to specify this constraint in the call to the API."
1127 | ]
1128 | },
1129 | {
1130 | "cell_type": "markdown",
1131 | "metadata": {},
1132 | "source": [
1133 | ""
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "markdown",
1138 | "metadata": {},
1139 | "source": [
1140 | "## Example 5: Another Shortest Time Route"
1141 | ]
1142 | },
1143 | {
1144 | "cell_type": "code",
1145 | "execution_count": 25,
1146 | "metadata": {
1147 | "scrolled": true
1148 | },
1149 | "outputs": [
1150 | {
1151 | "name": "stdout",
1152 | "output_type": "stream",
1153 | "text": [
1154 | "[42.048620276000065, -73.12225781699993] [42.08375019500005, -73.07301614799997]\n"
1155 | ]
1156 | }
1157 | ],
1158 | "source": [
1159 | "print(XY[439], XY[2592])"
1160 | ]
1161 | },
1162 | {
1163 | "cell_type": "markdown",
1164 | "metadata": {},
1165 | "source": [
1166 | "Same problem here. Google chose the shortest time path instead of the shortest distance path. "
1167 | ]
1168 | },
1169 | {
1170 | "cell_type": "markdown",
1171 | "metadata": {},
1172 | "source": [
1173 | ""
1174 | ]
1175 | },
1176 | {
1177 | "cell_type": "markdown",
1178 | "metadata": {},
1179 | "source": [
1180 | "# Summary\n",
1181 | "\n",
1182 | "Differences in distance are attributable to the following factors:\n",
1183 | "1. O-D pair contains a freeway segment.\n",
1184 | "2. Google API chooses the shortest time path by default rather than the shortest distance path."
1185 | ]
1186 | },
1187 | {
1188 | "cell_type": "code",
1189 | "execution_count": null,
1190 | "metadata": {},
1191 | "outputs": [],
1192 | "source": []
1193 | }
1194 | ],
1195 | "metadata": {
1196 | "kernelspec": {
1197 | "display_name": "Python 3",
1198 | "language": "python",
1199 | "name": "python3"
1200 | },
1201 | "language_info": {
1202 | "codemirror_mode": {
1203 | "name": "ipython",
1204 | "version": 3
1205 | },
1206 | "file_extension": ".py",
1207 | "mimetype": "text/x-python",
1208 | "name": "python",
1209 | "nbconvert_exporter": "python",
1210 | "pygments_lexer": "ipython3",
1211 | "version": "3.6.4"
1212 | }
1213 | },
1214 | "nbformat": 4,
1215 | "nbformat_minor": 2
1216 | }
1217 |
--------------------------------------------------------------------------------