├── Pulling Data from OSMnx - Demonstration.ipynb
├── README.md
└── airbnb_rooms.csv
/Pulling Data from OSMnx - Demonstration.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 1. Intro\n",
8 | "This notebook demonstrates the process of gathering local amenity data from OpenStreetMap. Local amenity data of this sort could be used to improve the accuracy of a whole range of ML models. In this notebook, the idea is that that presence of these amenities could help us predict what an AirBnb host might charge.\n",
9 | "\n",
10 | "In this notebook we cover the following:\n",
11 | "\n",
12 | "- How to make basic requests from OpenStreetMap using OSMnx\n",
13 | "- How to efficiently calculate the number of amenities that fall within a 5 minute walk of each AirBnb (approximately 0.5km distance)."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# 2. Accessing OSM with OSMnx"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {},
27 | "outputs": [],
28 | "source": [
29 | "# Import visualisation modules\n",
30 | "import matplotlib as mpl \n",
31 | "%matplotlib inline \n",
32 | "import matplotlib.pyplot as plt \n",
33 | "\n",
34 | "#Import modules\n",
35 | "import osmnx as ox\n",
36 | "import pandas as pd\n",
37 | "import geopandas as gpd\n",
38 | "import numpy as np\n",
39 | "\n",
40 | "import warnings \n",
41 | "warnings.simplefilter(action='ignore')"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "### Basic OSMnx Query\n",
49 | "All queries follow the following dictionary format: {'feature_type' : 'feature'}. A full list of OSMnx features can be found [here](https://wiki.openstreetmap.org/wiki/Map_Features). Note, the quality and accuracy of features may vary. OSMnx returns a geopandas dataframe (basically a pandas dataframe with longitude and latitude geometry built into it). It may take a few minutes to download the data."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 2,
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/html": [
60 | "
\n",
61 | "\n",
74 | "
\n",
75 | " \n",
76 | "
\n",
77 | "
\n",
78 | "
unique_id
\n",
79 | "
osmid
\n",
80 | "
element_type
\n",
81 | "
amenity
\n",
82 | "
name
\n",
83 | "
geometry
\n",
84 | "
addr:city
\n",
85 | "
addr:housename
\n",
86 | "
addr:housenumber
\n",
87 | "
addr:postcode
\n",
88 | "
...
\n",
89 | "
amenity:disused
\n",
90 | "
price
\n",
91 | "
name:fa
\n",
92 | "
name:ko
\n",
93 | "
photo
\n",
94 | "
craft_beer
\n",
95 | "
payment:contactless
\n",
96 | "
old_old_name
\n",
97 | "
terrace
\n",
98 | "
ways
\n",
99 | "
\n",
100 | " \n",
101 | " \n",
102 | "
\n",
103 | "
0
\n",
104 | "
node/451153
\n",
105 | "
451153
\n",
106 | "
node
\n",
107 | "
restaurant
\n",
108 | "
Central Restaurant
\n",
109 | "
POINT (-0.19350 51.60203)
\n",
110 | "
NaN
\n",
111 | "
NaN
\n",
112 | "
NaN
\n",
113 | "
NaN
\n",
114 | "
...
\n",
115 | "
NaN
\n",
116 | "
NaN
\n",
117 | "
NaN
\n",
118 | "
NaN
\n",
119 | "
NaN
\n",
120 | "
NaN
\n",
121 | "
NaN
\n",
122 | "
NaN
\n",
123 | "
NaN
\n",
124 | "
NaN
\n",
125 | "
\n",
126 | "
\n",
127 | "
1
\n",
128 | "
node/26544484
\n",
129 | "
26544484
\n",
130 | "
node
\n",
131 | "
restaurant
\n",
132 | "
Casuarina Tree
\n",
133 | "
POINT (-0.17223 51.39801)
\n",
134 | "
Mitcham
\n",
135 | "
The Crown Inn
\n",
136 | "
407
\n",
137 | "
CR4 4BG
\n",
138 | "
...
\n",
139 | "
NaN
\n",
140 | "
NaN
\n",
141 | "
NaN
\n",
142 | "
NaN
\n",
143 | "
NaN
\n",
144 | "
NaN
\n",
145 | "
NaN
\n",
146 | "
NaN
\n",
147 | "
NaN
\n",
148 | "
NaN
\n",
149 | "
\n",
150 | "
\n",
151 | "
2
\n",
152 | "
node/26604024
\n",
153 | "
26604024
\n",
154 | "
node
\n",
155 | "
restaurant
\n",
156 | "
Jin Li
\n",
157 | "
POINT (-0.45855 51.52573)
\n",
158 | "
Uxbridge
\n",
159 | "
NaN
\n",
160 | "
91
\n",
161 | "
UB8 3NJ
\n",
162 | "
...
\n",
163 | "
NaN
\n",
164 | "
NaN
\n",
165 | "
NaN
\n",
166 | "
NaN
\n",
167 | "
NaN
\n",
168 | "
NaN
\n",
169 | "
NaN
\n",
170 | "
NaN
\n",
171 | "
NaN
\n",
172 | "
NaN
\n",
173 | "
\n",
174 | "
\n",
175 | "
3
\n",
176 | "
node/26845558
\n",
177 | "
26845558
\n",
178 | "
node
\n",
179 | "
restaurant
\n",
180 | "
Old Tree Daiwan Bee
\n",
181 | "
POINT (-0.13256 51.51105)
\n",
182 | "
NaN
\n",
183 | "
NaN
\n",
184 | "
NaN
\n",
185 | "
NaN
\n",
186 | "
...
\n",
187 | "
NaN
\n",
188 | "
NaN
\n",
189 | "
NaN
\n",
190 | "
NaN
\n",
191 | "
NaN
\n",
192 | "
NaN
\n",
193 | "
NaN
\n",
194 | "
NaN
\n",
195 | "
NaN
\n",
196 | "
NaN
\n",
197 | "
\n",
198 | "
\n",
199 | "
4
\n",
200 | "
node/31098623
\n",
201 | "
31098623
\n",
202 | "
node
\n",
203 | "
restaurant
\n",
204 | "
The Unicorn
\n",
205 | "
POINT (0.20053 51.58686)
\n",
206 | "
Romford
\n",
207 | "
NaN
\n",
208 | "
91
\n",
209 | "
RM2 5EL
\n",
210 | "
...
\n",
211 | "
NaN
\n",
212 | "
NaN
\n",
213 | "
NaN
\n",
214 | "
NaN
\n",
215 | "
NaN
\n",
216 | "
NaN
\n",
217 | "
NaN
\n",
218 | "
NaN
\n",
219 | "
NaN
\n",
220 | "
NaN
\n",
221 | "
\n",
222 | " \n",
223 | "
\n",
224 | "
5 rows × 302 columns
\n",
225 | "
"
226 | ],
227 | "text/plain": [
228 | " unique_id osmid element_type amenity name \\\n",
229 | "0 node/451153 451153 node restaurant Central Restaurant \n",
230 | "1 node/26544484 26544484 node restaurant Casuarina Tree \n",
231 | "2 node/26604024 26604024 node restaurant Jin Li \n",
232 | "3 node/26845558 26845558 node restaurant Old Tree Daiwan Bee \n",
233 | "4 node/31098623 31098623 node restaurant The Unicorn \n",
234 | "\n",
235 | " geometry addr:city addr:housename addr:housenumber \\\n",
236 | "0 POINT (-0.19350 51.60203) NaN NaN NaN \n",
237 | "1 POINT (-0.17223 51.39801) Mitcham The Crown Inn 407 \n",
238 | "2 POINT (-0.45855 51.52573) Uxbridge NaN 91 \n",
239 | "3 POINT (-0.13256 51.51105) NaN NaN NaN \n",
240 | "4 POINT (0.20053 51.58686) Romford NaN 91 \n",
241 | "\n",
242 | " addr:postcode ... amenity:disused price name:fa name:ko photo craft_beer \\\n",
243 | "0 NaN ... NaN NaN NaN NaN NaN NaN \n",
244 | "1 CR4 4BG ... NaN NaN NaN NaN NaN NaN \n",
245 | "2 UB8 3NJ ... NaN NaN NaN NaN NaN NaN \n",
246 | "3 NaN ... NaN NaN NaN NaN NaN NaN \n",
247 | "4 RM2 5EL ... NaN NaN NaN NaN NaN NaN \n",
248 | "\n",
249 | " payment:contactless old_old_name terrace ways \n",
250 | "0 NaN NaN NaN NaN \n",
251 | "1 NaN NaN NaN NaN \n",
252 | "2 NaN NaN NaN NaN \n",
253 | "3 NaN NaN NaN NaN \n",
254 | "4 NaN NaN NaN NaN \n",
255 | "\n",
256 | "[5 rows x 302 columns]"
257 | ]
258 | },
259 | "execution_count": 2,
260 | "metadata": {},
261 | "output_type": "execute_result"
262 | }
263 | ],
264 | "source": [
265 | "# Set up query\n",
266 | "query = {'amenity':'restaurant'}\n",
267 | "\n",
268 | "# Run query\n",
269 | "restaurants_gdf = ox.pois.pois_from_place(\n",
270 | " 'Greater London, UK',\n",
271 | " tags = query,\n",
272 | " which_result=1)\n",
273 | "\n",
274 | "restaurants_gdf.head(5)"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "## Visualise the Results"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": 3,
287 | "metadata": {},
288 | "outputs": [
289 | {
290 | "data": {
291 | "text/plain": [
292 | ""
293 | ]
294 | },
295 | "execution_count": 3,
296 | "metadata": {},
297 | "output_type": "execute_result"
298 | },
299 | {
300 | "data": {
301 | "image/png": "\n",
302 | "text/plain": [
303 | "
"
304 | ]
305 | },
306 | "metadata": {
307 | "needs_background": "light"
308 | },
309 | "output_type": "display_data"
310 | }
311 | ],
312 | "source": [
313 | "# Download London's Boundary\n",
314 | "london_gdf = ox.geocoder.geocode_to_gdf('Greater London, UK')\n",
315 | "\n",
316 | "# Set up a plot axis\n",
317 | "fig, ax = plt.subplots(figsize = (15,10))\n",
318 | "\n",
319 | "# Visualise both on the plot\n",
320 | "london_gdf.plot(ax = ax, alpha = 0.5)\n",
321 | "restaurants_gdf.plot(ax = ax, markersize = 1, color = 'red', alpha = 0.8, label = 'Restaurant \\nLocations')\n",
322 | "plt.legend()"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "# Assembling Data"
330 | ]
331 | },
332 | {
333 | "cell_type": "markdown",
334 | "metadata": {},
335 | "source": [
336 | "OSMNX returns most restaurant properties as a single point (eg. Longitude / Latitude coordinates). However, a few are returned as polygons (a shape). This usually happens when the property is particularly large. Working with polygons is a lot more complicated than working with points, so below, we will work out the centre point of any polygons. We do this using Spapely."
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 4,
342 | "metadata": {},
343 | "outputs": [],
344 | "source": [
345 | "from shapely.geometry.polygon import Polygon\n",
346 | "from shapely.geometry.multipolygon import MultiPolygon\n",
347 | "\n",
348 | "restaurants_gdf['geometry'] = restaurants_gdf['geometry'].apply(\n",
349 | " lambda x: x.centroid if type(x) == Polygon else (\n",
350 | " x.centroid if type(x) == MultiPolygon else x)\n",
351 | ")"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {},
357 | "source": [
358 | "### Converting Local Projections\n",
359 | "Before we carry out any calculations we need to convert our point coordinates to a local projection. As you know, the earth is a sphere. A projection is a method of flattening the surface of the earth so we can display it on a map. The problem, however, is that there is no way to flatten the surface of a sphere in a way that all parts are sized proportionately equal.\n",
360 | "\n",
361 | "The only way to get around this is to find a projection that makes the specific part of the world that you are interested in proportionately equal. These are called local UTM Coordinate Referencing System (CRS). Fortunately, OSMnx has a method built into it that allows us to find the correct local UTM."
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 5,
367 | "metadata": {},
368 | "outputs": [
369 | {
370 | "data": {
371 | "text/plain": [
372 | "\n",
373 | "Name: unknown\n",
374 | "Axis Info [cartesian]:\n",
375 | "- E[east]: Easting (metre)\n",
376 | "- N[north]: Northing (metre)\n",
377 | "Area of Use:\n",
378 | "- undefined\n",
379 | "Coordinate Operation:\n",
380 | "- name: UTM zone 30N\n",
381 | "- method: Transverse Mercator\n",
382 | "Datum: World Geodetic System 1984\n",
383 | "- Ellipsoid: WGS 84\n",
384 | "- Prime Meridian: Greenwich"
385 | ]
386 | },
387 | "execution_count": 5,
388 | "metadata": {},
389 | "output_type": "execute_result"
390 | }
391 | ],
392 | "source": [
393 | "def get_local_crs(y,x):\n",
394 | " x = ox.utils_geo.bbox_from_point((y, x), dist = 500, project_utm = True, return_crs = True)\n",
395 | " return x[-1]\n",
396 | "\n",
397 | "lon_latitude = 51.509865\n",
398 | "lon_longitude = -0.118092\n",
399 | "local_utm_crs = get_local_crs(lon_latitude, lon_longitude)\n",
400 | "\n",
401 | "local_utm_crs"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "Next we import our AirBnb room data and convert it into a Geographic dataframe."
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": 6,
414 | "metadata": {},
415 | "outputs": [],
416 | "source": [
417 | "# Create a Geographic data of our \n",
418 | "air_df = pd.read_csv('airbnb_rooms.csv')\n",
419 | "\n",
420 | "# Note below: \"crs = 4326\" is our way of telling geopandas that the initial projection uses the standard\n",
421 | "# longitude latitude coordinates. You can't manipulate the CRS if you haven't set one initially.\n",
422 | "\n",
423 | "air_gdf = gpd.GeoDataFrame(air_df, geometry = gpd.points_from_xy(air_df.longitude, air_df.latitude), crs = 4326)\n",
424 | "air_gdf = air_gdf.to_crs(local_utm_crs)\n",
425 | "\n",
426 | "#Convert amenities into local projection (amenities already had an initial CRS set when we downloaded it via OSMnx)\n",
427 | "restaurants_gdf = restaurants_gdf.to_crs(local_utm_crs)"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "### Calculating Distance Using a KDTree\n",
435 | "So next, we need to iterate through each AirBnb property and work out how many restaurants there are within a 10 minute walk (approximately 1km). \n",
436 | "I do this using a K-D Tree. Explaining how K-D Trees work is outside the scope of this article, but in short, they're a super efficient way of searching through our 80,000 AirBnb rooms and 6,000 restaurants and figuring out which ones are close to which. First we set up the tree of all restaurant points."
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": 30,
442 | "metadata": {},
443 | "outputs": [],
444 | "source": [
445 | "import time\n",
446 | "from scipy import spatial\n",
447 | "from scipy.spatial import KDTree\n",
448 | "\n",
449 | "# Turn long/lats into an array for Scipy\n",
450 | "Lon = restaurants_gdf.geometry.apply(lambda x: x.x).values\n",
451 | "Lat = restaurants_gdf.geometry.apply(lambda x: x.y).values\n",
452 | "coords = list(zip(Lat,Lon))\n",
453 | "\n",
454 | "tree = spatial.KDTree(coords) # Create a KDTree of all tube stations"
455 | ]
456 | },
457 | {
458 | "cell_type": "markdown",
459 | "metadata": {},
460 | "source": [
461 | "Then we create a function which we will run on each of our AirBnb rooms. The function will query the tree, and find the 500 closest restaurants along with calculating their distances from the AirBnb property. We use a figure of 500 in the hope that no property has more than 500 restaurants close to it."
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 31,
467 | "metadata": {},
468 | "outputs": [],
469 | "source": [
470 | "def find_points_closeby(lat_lon, k = 500, max_distance = 1000 ):\n",
471 | " '''\n",
472 | " Queries a pre-existing kd tree and returns the number of points within x distance\n",
473 | " of long/lat point.\n",
474 | " lat_lon: A longitude and latitude pairings in the (y, x) tuple form.\n",
475 | " k: The maximum number of closest points to query\n",
476 | " max_distance: The maximum distance (in meters)\n",
477 | " '''\n",
478 | " \n",
479 | " results = tree.query((lat_lon), k = k, distance_upper_bound= max_distance)\n",
480 | " zipped_results = list(zip(results[0], results[1]))\n",
481 | " zipped_results = [i for i in zipped_results if i[0] != np.inf]\n",
482 | " \n",
483 | " return len(zipped_results)"
484 | ]
485 | },
486 | {
487 | "cell_type": "markdown",
488 | "metadata": {},
489 | "source": [
490 | "And finally, we set up a timer and apply the function to each AirBnb row"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": 32,
496 | "metadata": {},
497 | "outputs": [
498 | {
499 | "name": "stdout",
500 | "output_type": "stream",
501 | "text": [
502 | "Completed in 95.89 seconds\n"
503 | ]
504 | }
505 | ],
506 | "source": [
507 | "# Set up a timer \n",
508 | "import time \n",
509 | "t0 = time.time()\n",
510 | "\n",
511 | "#Apply the function\n",
512 | "air_gdf['restaurants'] = air_gdf.apply(lambda row: find_points_closeby(\n",
513 | " (row.geometry.y, row.geometry.x)) , axis = 1)\n",
514 | "\n",
515 | "# Report the time\n",
516 | "time_passed = round(time.time() - t0, 2)\n",
517 | "print (\"Completed in %s seconds\" % (time_passed))"
518 | ]
519 | },
520 | {
521 | "cell_type": "markdown",
522 | "metadata": {},
523 | "source": [
524 | "You now know how many restaurants there are within a 10-minute walk of each AirBnb property. You could repeat this process for bars, shops, subway stations, tourist hotspots, public parks, and whatever else you think may influence the price of an AirBnb property"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 36,
530 | "metadata": {},
531 | "outputs": [
532 | {
533 | "data": {
534 | "text/html": [
535 | "
\n",
536 | "\n",
549 | "
\n",
550 | " \n",
551 | "
\n",
552 | "
\n",
553 | "
id
\n",
554 | "
restaurants
\n",
555 | "
\n",
556 | " \n",
557 | " \n",
558 | "
\n",
559 | "
0
\n",
560 | "
13913
\n",
561 | "
39
\n",
562 | "
\n",
563 | "
\n",
564 | "
1
\n",
565 | "
15400
\n",
566 | "
108
\n",
567 | "
\n",
568 | "
\n",
569 | "
2
\n",
570 | "
17402
\n",
571 | "
335
\n",
572 | "
\n",
573 | "
\n",
574 | "
3
\n",
575 | "
25023
\n",
576 | "
1
\n",
577 | "
\n",
578 | "
\n",
579 | "
4
\n",
580 | "
25123
\n",
581 | "
6
\n",
582 | "
\n",
583 | " \n",
584 | "
\n",
585 | "
"
586 | ],
587 | "text/plain": [
588 | " id restaurants\n",
589 | "0 13913 39\n",
590 | "1 15400 108\n",
591 | "2 17402 335\n",
592 | "3 25023 1\n",
593 | "4 25123 6"
594 | ]
595 | },
596 | "execution_count": 36,
597 | "metadata": {},
598 | "output_type": "execute_result"
599 | }
600 | ],
601 | "source": [
602 | "air_gdf[['id','restaurants']].head(5)"
603 | ]
604 | }
605 | ],
606 | "metadata": {
607 | "kernelspec": {
608 | "display_name": "Python 3",
609 | "language": "python",
610 | "name": "python3"
611 | },
612 | "language_info": {
613 | "codemirror_mode": {
614 | "name": "ipython",
615 | "version": 3
616 | },
617 | "file_extension": ".py",
618 | "mimetype": "text/x-python",
619 | "name": "python",
620 | "nbconvert_exporter": "python",
621 | "pygments_lexer": "ipython3",
622 | "version": "3.8.3"
623 | }
624 | },
625 | "nbformat": 4,
626 | "nbformat_minor": 4
627 | }
628 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Accessing-OpenStreetMap-Data
2 | A Jupyter project that demonstrates how to access local data from OpenStreetMap to improve your ML models. Demonstrates the use of K-D Trees to turn data into features.
3 |
4 |
5 | This Jupyter Project demonstrates how to gather local amenity data from OpenStreetMap. Local amenity data of this sort could be used to improve the accuracy of a whole range of ML models. If, for example, you wanted to predict the price a AirBnb host might charge in parts of London, you might want to know something about the local amenities that exist within the area.
6 |
7 | In this notebook we cover the following:
8 |
9 | - How to make basic requests from OpenStreetMap using OSMnx
10 | - Converting geographic projections.
11 | - Using K-D Trees to to efficiently calculate the number of amenities that fall within a 5 minute walk of each AirBnb (approximately 0.5km distance).
12 |
--------------------------------------------------------------------------------