├── PlotEllipse ├── __init__.py ├── plot_ellipse.png ├── README.md └── plot_ellipse.py ├── .gitignore ├── PlottingWithPandas ├── README.md └── PlottingWithPandas_DatesAndBarPlots.ipynb ├── BoundedClustering └── README.md ├── README.md ├── YATS_YetAnotherTSPSolution ├── README.md └── YetAnotherTSPSolution.ipynb ├── GettingToKnowTheMelSpectrogram ├── README.md └── Haunting_song_of_humpback_whales-youtube-W5Trznre92c.wav ├── SolvingTSPUsingDynamicProgramming └── README.md └── SetTSP ├── README.md └── SetTSP.ipynb /PlotEllipse/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | */.ipynb_checkpoints/* 2 | .idea/ -------------------------------------------------------------------------------- /PlotEllipse/plot_ellipse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DalyaG/CodeSnippetsForPosterity/HEAD/PlotEllipse/plot_ellipse.png -------------------------------------------------------------------------------- /PlottingWithPandas/README.md: -------------------------------------------------------------------------------- 1 | # Plotting With Pandas - Dates and Bar Plots 2 | 3 | This notebook accompanies a 4 | [blog post]() 5 | by the same name. -------------------------------------------------------------------------------- /BoundedClustering/README.md: -------------------------------------------------------------------------------- 1 | # Bounded Clustering 2 | 3 | This notebook accompanies a 4 | [blog post](http://bit.ly/BoundedClustering) 5 | by the same name. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CodeSnippetsForPosterity 2 | My Dream is that each one of these code snippets will become a blog post. So let's take this dream one snippet at a time :) 3 | -------------------------------------------------------------------------------- /YATS_YetAnotherTSPSolution/README.md: -------------------------------------------------------------------------------- 1 | # YATS - Yet Another TSP Solution 2 | 3 | This notebook accompanies a 4 | [blog post](https://medium.com/hackernoon/yats-yet-another-tsp-solution-6a71aeabe1f8) 5 | by the same name. -------------------------------------------------------------------------------- /GettingToKnowTheMelSpectrogram/README.md: -------------------------------------------------------------------------------- 1 | # Getting to Know the Mel Spectrogram 2 | 3 | This notebook accompanies a 4 | [blog post](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0) 5 | by the same name. -------------------------------------------------------------------------------- /GettingToKnowTheMelSpectrogram/Haunting_song_of_humpback_whales-youtube-W5Trznre92c.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DalyaG/CodeSnippetsForPosterity/HEAD/GettingToKnowTheMelSpectrogram/Haunting_song_of_humpback_whales-youtube-W5Trznre92c.wav -------------------------------------------------------------------------------- /SolvingTSPUsingDynamicProgramming/README.md: -------------------------------------------------------------------------------- 1 | # Solving TSP Using Dynamic Programming 2 | 3 | This notebook accompanies a 4 | [blog post](https://towardsdatascience.com/solving-tsp-using-dynamic-programming-2c77da86610d) 5 | by the same name. -------------------------------------------------------------------------------- /SetTSP/README.md: -------------------------------------------------------------------------------- 1 | # Set-TSP - Because There Is More Than One Place to Get Bread 2 | 3 | This notebook accompanies a 4 | [blog post](https://towardsdatascience.com/set-tsp-because-there-is-more-than-one-place-to-get-bread-712fdb5b381) 5 | by the same name. -------------------------------------------------------------------------------- /PlotEllipse/README.md: -------------------------------------------------------------------------------- 1 | # PlotEllipse 2 | Plot an ellipse in s2map around two GPS locations. 3 | 4 | For the background story and some fun math stuff, you can read 5 | [this blog post](https://medium.freecodecamp.org/a-total-ellipse-on-the-map-9e30d5235078) 6 | on Free Code Camp. 7 | 8 | Example usage: 9 | 10 | > python plot_ellipse.py --p1_lat 32.076761 --p1_lng 34.792510 --p2_lat 32.083257 --p2_lng 34.767737 -r 3000 11 | 12 | 13 | And a tab in your browser will open with the following plot: 14 | 15 | ![When developers want to do something fun outside and they end up writing a script about it instead.](../../master/PlotEllipse/plot_ellipse.png) 16 | 17 | -------------------------------------------------------------------------------- /PlotEllipse/plot_ellipse.py: -------------------------------------------------------------------------------- 1 | 2 | import subprocess 3 | import numpy as np 4 | from haversine import haversine 5 | from optparse import OptionParser 6 | from math import sqrt, pow, atan2, degrees, pi 7 | from shapely import affinity 8 | from shapely.geometry import LineString 9 | 10 | R_EARTH = 6371000 11 | 12 | def Main(): 13 | """ 14 | Input 2 lat-lng points and a joint-radius, to draw an ellipse around these two centers. 15 | NOTE: 16 | The joint-radius should be at least as large at the distance between the two points. 17 | 18 | Example command: 19 | 20 | python plot_ellipse.py --p1_lat 32.076761 --p1_lng 34.792510 --p2_lat 32.083257 --p2_lng 34.767737 -r 3000 21 | 22 | In details: 23 | 1. From the lat-lng get ellipse parameters. 24 | 2. Draw ellipse around the origin (0,0) measured in meters. 25 | 3. Move this ellipse to be centered around the input centers. 26 | 4. Open browser tab with ellipse in s2map. 27 | """ 28 | parser = OptionParser() 29 | parser.add_option("--p1_lat") 30 | parser.add_option("--p1_lng") 31 | parser.add_option("--p2_lat") 32 | parser.add_option("--p2_lng") 33 | parser.add_option("-r", "--radius_in_meters") 34 | parser.add_option("-n", "--num_points") 35 | options, _ = parser.parse_args() 36 | 37 | num_points = int(options.num_points) if options.num_points is not None else 20 38 | p1_lat, p1_lng = float(options.p1_lat), float(options.p1_lng) 39 | p2_lat, p2_lng = float(options.p2_lat), float(options.p2_lng) 40 | radius_in_meters = float(options.radius_in_meters) 41 | 42 | a, b = GetEllipseAxisLengths(p1_lat, p1_lng, p2_lat, p2_lng, radius_in_meters) 43 | perimeter_points_in_meters = GetEllipsePointInMeters(a, b, num_points) 44 | points = GetEllipsePoints(p1_lat, p1_lng, p2_lat, p2_lng, perimeter_points_in_meters) 45 | OpenS2Map(points) 46 | 47 | 48 | def GetEllipseAxisLengths(p1_lat, p1_lng, p2_lat, p2_lng, radius_in_meters): 49 | d = haversine((p1_lat, p1_lng), (p2_lat, p2_lng)) * 1000.0 50 | if radius_in_meters < d: 51 | raise ValueError("Please specify radius larger than the distance between the two input points.") 52 | a = radius_in_meters / 2.0 53 | b = sqrt(pow(a, 2) - pow(d / 2.0, 2)) 54 | return a, b 55 | 56 | def GetEllipsePointInMeters(a, b, num_points): 57 | """ 58 | :param a: length of "horizontal" axis in meters 59 | :param b: length of "vertical" axis in meters 60 | :param num_points: number of points to draw on each side of the ellipse 61 | :return: List of tuples of perimeter points on the ellipse, centered around (0,0), in m. 62 | """ 63 | x_points = list(np.linspace(-a, a, num_points))[1:-1] 64 | y_points_pos = [sqrt(pow(a, 2) - pow(x, 2)) * (float(b) / float(a)) 65 | for x in x_points] 66 | y_points_neg = [-y for y in y_points_pos] 67 | 68 | perimeter_points_in_meters = [tuple([-a, 0])] + \ 69 | [tuple([x, y]) for x, y in zip(x_points, y_points_pos)] + \ 70 | [tuple([a, 0])] + \ 71 | list(reversed([tuple([x, y]) for x, y in zip(x_points, y_points_neg)])) 72 | return perimeter_points_in_meters 73 | 74 | def GetEllipsePoints(p1_lat, p1_lng, p2_lat, p2_lng, perimeter_points_in_meters): 75 | """ 76 | Enter ellipse centers in lat-lng and ellipse perimeter points around the origin (0,0), 77 | and get points on the perimeter of the ellipse around the centers in lat-lng. 78 | :param p1_lat: lat coordinates of center point 1 79 | :param p1_lng: lng coordinates of center point 1 80 | :param p2_lat: lat coordinates of center point 2 81 | :param p2_lng: lng coordinates of center point 2 82 | :param perimeter_points_in_meters: List of tuples of perimeter points on the ellipse, centered around (0,0), in m. 83 | :return: 84 | """ 85 | center_lng = (p1_lng + p2_lng) / 2.0 86 | center_lat = (p1_lat + p2_lat) / 2.0 87 | perimeter_points_in_lng_lat = [AddMetersToPoint(center_lng, center_lat, p[0], p[1]) 88 | for p in perimeter_points_in_meters] 89 | ellipse = LineString(perimeter_points_in_lng_lat) 90 | 91 | angle = degrees(atan2(p2_lat - p1_lat, p2_lng - p1_lng)) 92 | ellipse_rotated = affinity.rotate(ellipse, angle) 93 | 94 | ellipse_points_lng_lat = list(ellipse_rotated.coords) 95 | ellipse_points = [tuple([p[1], p[0]]) for p in ellipse_points_lng_lat] 96 | return ellipse_points 97 | 98 | def AddMetersToPoint(center_lng, center_lat, dx, dy): 99 | """ 100 | :param center_lng, center_lat: GPS coordinates of the center between the two input points. 101 | :param dx: distance to add to x-axis (lng) in meters 102 | :param dy: distance to add to y-axis (lat) in meters 103 | """ 104 | new_x = center_lng + (dx / R_EARTH) * (180 / pi) / np.cos(center_lat * pi/180) 105 | new_y = center_lat + (dy / R_EARTH) * (180 / pi) 106 | return tuple([new_x, new_y]) 107 | 108 | def OpenS2Map(points): 109 | url = "http://s2map.com/#order=latlng&mode=polygon&s2=false&points={}".format(str(points).replace(" ", ",")) 110 | cmd = ["python", "-m", "webbrowser", "-t", url] 111 | subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT).communicate() 112 | 113 | 114 | if __name__ == '__main__': 115 | Main() 116 | -------------------------------------------------------------------------------- /YATS_YetAnotherTSPSolution/YetAnotherTSPSolution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# YATS - Yet Another TSP Solution\n", 8 | "\n", 9 | "## This notebook was created to serve a [blog post](https://medium.com/hackernoon/yats-yet-another-tsp-solution-6a71aeabe1f8) by the same name." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# written in python 3.7.3\n", 21 | "import numpy as np\n", 22 | "import math\n", 23 | "from geojson import LineString, Point, Feature, FeatureCollection\n", 24 | "import geojsonio\n", 25 | "import json\n", 26 | "import random\n", 27 | "random_seed = 42\n", 28 | "random.seed(random_seed)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "# Part I - The Code For Our Solution" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### The Geographic Building Blocks" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "class GeoPoint:\n", 54 | " def __init__(self, lng: float, lat: float):\n", 55 | " # Why 5 digits? According to https://en.wikipedia.org/wiki/Decimal_degrees it's 1m. accuracy.\n", 56 | " self.lng = round(lng, 5)\n", 57 | " self.lat = round(lat, 5)\n", 58 | " \n", 59 | " def __repr__(self):\n", 60 | " # copy-pastable format to most map applications\n", 61 | " return f\"[{self.lng}, {self.lat}]\"\n", 62 | " \n", 63 | "def euclidean_dist(geo_point_1: GeoPoint, geo_point_2: GeoPoint) -> float:\n", 64 | " d = np.linalg.norm(np.array([geo_point_1.lat, geo_point_1.lng])\n", 65 | " - np.array([geo_point_2.lat, geo_point_2.lng]))\n", 66 | " return d\n", 67 | "\n", 68 | "def get_geo_point_of_center(geo_points: [GeoPoint]) -> GeoPoint:\n", 69 | " lng_list, lat_list = list(zip(*[[g.lng, g.lat] for g in geo_points]))\n", 70 | " lng_center, lat_center = np.mean(np.array([lng_list, lat_list]), axis=1)\n", 71 | " return GeoPoint(lng_center, lat_center)\n", 72 | "\n", 73 | "def get_angle_from_reference_geo_point_in_deg(reference_geo_point: GeoPoint, other_geo_point: GeoPoint) -> float:\n", 74 | " x = other_geo_point.lng - reference_geo_point.lng\n", 75 | " y = other_geo_point.lat - reference_geo_point.lat\n", 76 | " angle_from_reference_in_deg = math.degrees(math.atan2(y, x))\n", 77 | " return angle_from_reference_in_deg" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Our Initial Route - An Angular Route" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 3, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "def get_angular_route(geo_points: [GeoPoint]) -> [int]:\n", 96 | " center = get_geo_point_of_center(geo_points)\n", 97 | " route_idxs = sorted(list(range(len(geo_points))), \n", 98 | " key=lambda i: \n", 99 | " get_angle_from_reference_geo_point_in_deg(center, geo_points[i]),\n", 100 | " reverse=True)\n", 101 | " return route_idxs" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Visualizing" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 4, 114 | "metadata": { 115 | "collapsed": true 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "# opens a new tab with the route visualization on geojson.io\n", 120 | "def visualize(route_idxs: [int], geo_points: [GeoPoint]):\n", 121 | " lng_lat_list = [tuple([geo_points[i].lng, geo_points[i].lat])\n", 122 | " for i in route_idxs]\n", 123 | " route = Feature(geometry=LineString(lng_lat_list),\n", 124 | " properties={\"name\": \"This is our route\",\n", 125 | " \"stroke\": \"#8B0000\"})\n", 126 | " places = [Feature(geometry=Point(lng_lat), \n", 127 | " properties={\"name\": f\"Place {route_idxs[i]}\",\n", 128 | " \"marker-symbol\": int(str(i)[-1]),\n", 129 | " \"marker-color\": \"#00008B\"})\n", 130 | " for i, lng_lat in enumerate(lng_lat_list)]\n", 131 | " \n", 132 | " feature_collection = FeatureCollection(features=[route] + places)\n", 133 | " geojsonio.display(json.dumps(feature_collection));" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "### Optimization step" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": { 147 | "collapsed": true 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "def get_route_len(distances_array: np.array, route_idxs):\n", 152 | " route_len = sum([distances_array[i1][i2]\n", 153 | " for i1, i2 in zip(route_idxs[:-1], route_idxs[1:])])\n", 154 | " return route_len\n", 155 | "\n", 156 | "def optimize_route(distances_array: np.array, route_idxs: [int], n_iter: int) -> [int]:\n", 157 | " prev_cost = get_route_len(distances_array, route_idxs)\n", 158 | " \n", 159 | " all_idxs = list(range(len(route_idxs)))\n", 160 | " for _ in range(n_iter): \n", 161 | " i1, i2 = random.sample(all_idxs, 2)\n", 162 | " route_idxs[i2], route_idxs[i1] = route_idxs[i1], route_idxs[i2]\n", 163 | " new_cost = get_route_len(distances_array, route_idxs)\n", 164 | " if new_cost < prev_cost:\n", 165 | " prev_cost = new_cost\n", 166 | " else:\n", 167 | " route_idxs[i2], route_idxs[i1] = route_idxs[i1], route_idxs[i2]\n", 168 | " return route_idxs" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### Wrap It All Up" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 6, 181 | "metadata": { 182 | "collapsed": true 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "# opens a new tab with the route visualization on geojson.io\n", 187 | "def plot_best_route(geo_points: [GeoPoint], distances_array: np.array, n_iter: int) -> None:\n", 188 | " route_idxs = get_angular_route(geo_points)\n", 189 | " route_idxs = optimize_route(distances_array, route_idxs, n_iter)\n", 190 | " visualize(route_idxs, geo_points)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "# Part II - My Friday Morning Errands" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 7, 203 | "metadata": { 204 | "collapsed": false 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "rice = GeoPoint(34.904145, 32.178397)\n", 209 | "veg = GeoPoint(34.899660, 32.178243)\n", 210 | "pet= GeoPoint(34.899918, 32.177080)\n", 211 | "garden = GeoPoint(34.904370, 32.173966)\n", 212 | "pharm = GeoPoint(34.909027, 32.177480)\n", 213 | "pasta = GeoPoint(34.906774, 32.178279)\n", 214 | "pita = GeoPoint(34.903383, 32.177381)\n", 215 | "\n", 216 | "geo_points = [rice, veg, pet, garden, pharm, pasta, pita]" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 8, 222 | "metadata": { 223 | "collapsed": false 224 | }, 225 | "outputs": [ 226 | { 227 | "data": { 228 | "text/plain": [ 229 | "array([[ 0, 600, 550, 600, 650, 400, 150],\n", 230 | " [ 600, 0, 170, 900, 1100, 850, 550],\n", 231 | " [ 550, 170, 0, 750, 1000, 850, 400],\n", 232 | " [ 600, 900, 750, 0, 900, 750, 600],\n", 233 | " [ 650, 1100, 1000, 900, 0, 260, 700],\n", 234 | " [ 400, 850, 850, 750, 260, 0, 500],\n", 235 | " [ 150, 550, 400, 600, 700, 500, 0]])" 236 | ] 237 | }, 238 | "execution_count": 8, 239 | "metadata": {}, 240 | "output_type": "execute_result" 241 | } 242 | ], 243 | "source": [ 244 | "distances_array = np.array([[0, 600, 550, 600, 650, 400, 150], \n", 245 | " [0, 0, 170, 900, 1100, 850, 550], \n", 246 | " [0, 0, 0, 750, 1000, 850, 400], \n", 247 | " [0, 0, 0, 0, 900, 750, 600], \n", 248 | " [0, 0, 0, 0, 0, 260, 700], \n", 249 | " [0, 0, 0, 0, 0, 0, 500], \n", 250 | " [0] * 7])\n", 251 | "distances_array += distances_array.transpose()\n", 252 | "distances_array" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 9, 258 | "metadata": { 259 | "collapsed": true 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "# opens a new tab with the route visualization on geojson.io\n", 264 | "plot_best_route(geo_points, distances_array, 100)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "# Part III - A Larger Experiment" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 10, 277 | "metadata": { 278 | "collapsed": true 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "def euclidean_dist(geo_point_1: GeoPoint, geo_point_2: GeoPoint):\n", 283 | " d = np.linalg.norm(np.array([geo_point_1.lat, geo_point_1.lng])\n", 284 | " - np.array([geo_point_2.lat, geo_point_2.lng]))\n", 285 | " return d" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 11, 291 | "metadata": { 292 | "collapsed": false 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "n_points = 50\n", 297 | "geo_points = [GeoPoint(34.8 + (0.1 * random.random()), 32.1 + (0.1 * random.random()))\n", 298 | " for _ in range(n_points)]\n", 299 | "distances_array = np.array([[euclidean_dist(g_from, g_to)\n", 300 | " for g_to in geo_points]\n", 301 | " for g_from in geo_points])" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 12, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "name": "stdout", 313 | "output_type": "stream", 314 | "text": [ 315 | "Route length for random order: 2.8800609543466313\n", 316 | "Initial route length: 0.9013753942969249\n", 317 | "Final route length: 0.6142705755858876\n", 318 | "Wall time: 304 ms\n" 319 | ] 320 | } 321 | ], 322 | "source": [ 323 | "%%time\n", 324 | "\n", 325 | "n_iter = 10000\n", 326 | "random_cost = get_route_len(distances_array, list(range(n_points)))\n", 327 | "print(f\"Route length for random order: {random_cost}\")\n", 328 | "route_idxs = get_angular_route(geo_points)\n", 329 | "first_cost = get_route_len(distances_array, route_idxs)\n", 330 | "print(f\"Initial route length: {first_cost}\")\n", 331 | "route_idxs = optimize_route(distances_array, route_idxs, n_iter)\n", 332 | "last_cost = get_route_len(distances_array, route_idxs)\n", 333 | "print(f\"Final route length: {last_cost}\")\n", 334 | "visualize(route_idxs, geo_points)" 335 | ] 336 | } 337 | ], 338 | "metadata": { 339 | "anaconda-cloud": {}, 340 | "kernelspec": { 341 | "display_name": "quay_rnd", 342 | "language": "python", 343 | "name": "quay_rnd" 344 | }, 345 | "language_info": { 346 | "codemirror_mode": { 347 | "name": "ipython", 348 | "version": 3 349 | }, 350 | "file_extension": ".py", 351 | "mimetype": "text/x-python", 352 | "name": "python", 353 | "nbconvert_exporter": "python", 354 | "pygments_lexer": "ipython3", 355 | "version": "3.7.3" 356 | } 357 | }, 358 | "nbformat": 4, 359 | "nbformat_minor": 1 360 | } 361 | -------------------------------------------------------------------------------- /SetTSP/SetTSP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Set-TSP - Because There Is More Than One Place to Get Bread\n", 8 | "\n", 9 | "## This notebook was created to serve a [blog post](https://towardsdatascience.com/set-tsp-because-there-is-more-than-one-place-to-get-bread-712fdb5b381) by the same name." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# written in python 3.7.3\n", 21 | "import json\n", 22 | "import random\n", 23 | "\n", 24 | "import geojsonio\n", 25 | "import numpy as np\n", 26 | "import pandas as pd\n", 27 | "from geojson import LineString, Point, Feature, FeatureCollection\n", 28 | "from math import cos, asin, sqrt, degrees, atan2\n", 29 | "\n", 30 | "import matplotlib.pyplot as plt\n", 31 | "%matplotlib inline\n", 32 | "\n", 33 | "random_seed = 42\n", 34 | "random.seed(random_seed)" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "# Part I - The Geographic Building Blocks\n", 42 | "\n", 43 | "### For more information on Haversine formula, go to [this](https://en.wikipedia.org/wiki/Haversine_formula) wiki page, \n", 44 | "\n", 45 | "### and for the specific formulation used here go to [this](https://stackoverflow.com/questions/27928/calculate-distance-between-two-latitude-longitude-points-haversine-formula/21623206#21623206) Stackoverflow answer" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 2, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "p = 0.017453292519943295 # Pi / 180\n", 57 | "r = 12742000 # Earth's radius is ~6371km => r = 2 * Earth's radius\n", 58 | "\n", 59 | "class GeoPoint:\n", 60 | " def __init__(self, lng: float, lat: float, name_: str = None):\n", 61 | " # Why 5 digits? According to https://en.wikipedia.org/wiki/Decimal_degrees it's 1m. accuracy.\n", 62 | " self.lng = round(lng, 5)\n", 63 | " self.lat = round(lat, 5)\n", 64 | " self.name_ = name_\n", 65 | " \n", 66 | " def __repr__(self):\n", 67 | " # Copy-pastable format for most map applications\n", 68 | " name_str = f\"{self.name_}, \" if self.name_ is not None else \"\"\n", 69 | " return f\"[{name_str}{self.lat}, {self.lng}]\"\n", 70 | " \n", 71 | " def get_dist_from(self, other) -> int:\n", 72 | " # Return non-euclidean distance in meters, using Haversine formula\n", 73 | " # For more information on this formulation go to \n", 74 | " a = (0.5 \n", 75 | " - cos((other.lat - self.lat) * p)/2 \n", 76 | " + (cos(self.lat * p) \n", 77 | " * cos(other.lat * p) \n", 78 | " * (1 - cos((other.lng - self.lng) * p)) / 2))\n", 79 | " d = int(r * asin(sqrt(a))) \n", 80 | " return d" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 3, 86 | "metadata": { 87 | "collapsed": false 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "def get_all_geopoints_from_all_sets(all_sets):\n", 92 | " geo_points = [g\n", 93 | " for set_ in all_sets\n", 94 | " for g in set_]\n", 95 | " return geo_points" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 4, 101 | "metadata": { 102 | "collapsed": false 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "def get_distances_array_and_set_to_points_dict(all_sets):\n", 107 | " all_geo_points = []\n", 108 | " set_to_points_dict = {}\n", 109 | " first_point_idx = 0\n", 110 | " for idx, set_ in enumerate(all_sets):\n", 111 | " all_geo_points += set_\n", 112 | " n_points_in_set = len(set_)\n", 113 | " set_to_points_dict[idx] = list(range(first_point_idx, first_point_idx + n_points_in_set))\n", 114 | " first_point_idx += n_points_in_set\n", 115 | "\n", 116 | " n_points = first_point_idx\n", 117 | " distances_array = np.array([[all_geo_points[i].get_dist_from(all_geo_points[j])\n", 118 | " for i in range(n_points)]\n", 119 | " for j in range(n_points)])\n", 120 | "\n", 121 | " return all_geo_points, set_to_points_dict, distances_array" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 5, 127 | "metadata": { 128 | "collapsed": false 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "bread_0 = GeoPoint(lat=32.178500, lng=34.906531, name_=\"bread_0\")\n", 133 | "bread_1 = GeoPoint(lat=32.175431, lng=34.907089, name_=\"bread_1\")\n", 134 | "bread_2 = GeoPoint(lat=32.175041, lng=34.898474, name_=\"bread_2\")\n", 135 | "bread_set = [bread_0, bread_1, bread_2]\n", 136 | "\n", 137 | "veg_0 = GeoPoint(lat=32.178192, lng=34.899633, name_=\"veg_0\")\n", 138 | "veg_1 = GeoPoint(lat=32.176376, lng=34.902369, name_=\"veg_1\")\n", 139 | "veg_2 = GeoPoint(lat=32.174051, lng=34.899397, name_=\"veg_2\")\n", 140 | "veg_set = [veg_0, veg_1, veg_2]\n", 141 | "\n", 142 | "beer_0 = GeoPoint(lat=32.177774, lng=34.907175, name_=\"beer_0\")\n", 143 | "beer_1 = GeoPoint(lat=32.177102, lng=34.899719, name_=\"beer_1\")\n", 144 | "beer_set = [beer_0, beer_1]\n", 145 | "\n", 146 | "all_sets = [bread_set, veg_set, beer_set]" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 6, 152 | "metadata": { 153 | "collapsed": false 154 | }, 155 | "outputs": [ 156 | { 157 | "name": "stdout", 158 | "output_type": "stream", 159 | "text": [ 160 | "[[bread_0, 32.1785, 34.90653], [bread_1, 32.17543, 34.90709], [bread_2, 32.17504, 34.89847], [veg_0, 32.17819, 34.89963], [veg_1, 32.17638, 34.90237], [veg_2, 32.17405, 34.8994], [beer_0, 32.17777, 34.90718], [beer_1, 32.1771, 34.89972]]\n", 161 | "{0: [0, 1, 2], 1: [3, 4, 5], 2: [6, 7]}\n" 162 | ] 163 | }, 164 | { 165 | "data": { 166 | "text/plain": [ 167 | "array([[ 0, 345, 850, 650, 457, 833, 101, 659],\n", 168 | " [345, 0, 812, 766, 456, 739, 260, 718],\n", 169 | " [850, 812, 0, 366, 396, 140, 874, 257],\n", 170 | " [650, 766, 366, 0, 327, 460, 712, 121],\n", 171 | " [457, 456, 396, 327, 0, 381, 478, 261],\n", 172 | " [833, 739, 140, 460, 381, 0, 840, 340],\n", 173 | " [101, 260, 874, 712, 478, 840, 0, 706],\n", 174 | " [659, 718, 257, 121, 261, 340, 706, 0]])" 175 | ] 176 | }, 177 | "execution_count": 6, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "all_geo_points, set_to_points_dict, distances_array = get_distances_array_and_set_to_points_dict(all_sets)\n", 184 | "\n", 185 | "print(all_geo_points)\n", 186 | "print(set_to_points_dict)\n", 187 | "distances_array" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "# Part II - Solving Using Dynammic Programming" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 7, 200 | "metadata": { 201 | "collapsed": true 202 | }, 203 | "outputs": [], 204 | "source": [ 205 | "def DP_Set_TSP(set_to_points_dict, distances_array):\n", 206 | " all_sets = set(set_to_points_dict.keys())\n", 207 | " n_sets = len(all_sets)\n", 208 | "\n", 209 | " # memo keys: tuple(sorted_sets_in_path, last_set_in_path, last_point_in_path)\n", 210 | " # memo values: tuple(cost_thus_far, next_to_last_set_in_path, next_to_last_point_in_path)\n", 211 | " memo = {(tuple([set_idx]), set_idx, p_idx): tuple([0, None, None])\n", 212 | " for set_idx, points_idxs in set_to_points_dict.items()\n", 213 | " for p_idx in points_idxs}\n", 214 | " queue = [(tuple([set_idx]), set_idx, p_idx)\n", 215 | " for set_idx, points_idxs in set_to_points_dict.items()\n", 216 | " for p_idx in points_idxs]\n", 217 | "\n", 218 | " while queue:\n", 219 | " prev_visited_sets, prev_last_set, prev_last_point = queue.pop(0)\n", 220 | " prev_dist, _, _ = memo[(prev_visited_sets, prev_last_set, prev_last_point)]\n", 221 | "\n", 222 | " to_visit = all_sets.difference(set(prev_visited_sets))\n", 223 | " for new_last_set in to_visit:\n", 224 | " new_visited_sets = tuple(sorted(list(prev_visited_sets) + [new_last_set]))\n", 225 | " for new_last_point in set_to_points_dict[new_last_set]:\n", 226 | " new_dist = prev_dist + distances_array[prev_last_point][new_last_point]\n", 227 | "\n", 228 | " new_key = (new_visited_sets, new_last_set, new_last_point)\n", 229 | " new_value = (new_dist, prev_last_set, prev_last_point)\n", 230 | "\n", 231 | " if new_key not in memo:\n", 232 | " memo[new_key] = new_value\n", 233 | " queue += [new_key]\n", 234 | " else:\n", 235 | " if new_dist < memo[new_key][0]:\n", 236 | " memo[new_key] = new_value\n", 237 | "\n", 238 | " optimal_path_in_points_idxs, optimal_path_in_sets_idxs, optimal_cost = retrace_optimal_path(memo, n_sets)\n", 239 | "\n", 240 | " return optimal_path_in_points_idxs, optimal_path_in_sets_idxs, optimal_cost" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 8, 246 | "metadata": { 247 | "collapsed": true 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "def retrace_optimal_path(memo: dict, n_sets: int) -> [[int], [int], float]:\n", 252 | " sets_to_retrace = tuple(range(n_sets))\n", 253 | "\n", 254 | " full_path_memo = dict((k, v) for k, v in memo.items() if k[0] == sets_to_retrace)\n", 255 | " path_key = min(full_path_memo.keys(), key=lambda x: full_path_memo[x][0])\n", 256 | "\n", 257 | " _, last_set, last_point = path_key\n", 258 | " optimal_cost, next_to_last_set, next_to_last_point = memo[path_key]\n", 259 | "\n", 260 | " optimal_path_in_points_idxs = [last_point]\n", 261 | " optimal_path_in_sets_idxs = [last_set]\n", 262 | " sets_to_retrace = tuple(sorted(set(sets_to_retrace).difference({last_set})))\n", 263 | "\n", 264 | " while next_to_last_set is not None:\n", 265 | " last_point = next_to_last_point\n", 266 | " last_set = next_to_last_set\n", 267 | " path_key = (sets_to_retrace, last_set, last_point)\n", 268 | " _, next_to_last_set, next_to_last_point = memo[path_key]\n", 269 | "\n", 270 | " optimal_path_in_points_idxs = [last_point] + optimal_path_in_points_idxs\n", 271 | " optimal_path_in_sets_idxs = [last_set] + optimal_path_in_sets_idxs\n", 272 | " sets_to_retrace = tuple(sorted(set(sets_to_retrace).difference({last_set})))\n", 273 | "\n", 274 | " return optimal_path_in_points_idxs, optimal_path_in_sets_idxs, optimal_cost" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 9, 280 | "metadata": { 281 | "collapsed": true 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "def get_features_for_all_points(all_sets):\n", 286 | " points = []\n", 287 | " for set_ in all_sets:\n", 288 | " color = \"#\" + ''.join(random.choices('0123456789abcdef', k=6))\n", 289 | " points += [\n", 290 | " Feature(geometry=Point(tuple([g.lng, g.lat])),\n", 291 | " properties={\"name\": g.name_,\n", 292 | " \"marker-symbol\": int(g.name_[-1]),\n", 293 | " \"marker-color\": color})\n", 294 | " for g in set_]\n", 295 | " return points" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 10, 301 | "metadata": { 302 | "collapsed": false 303 | }, 304 | "outputs": [], 305 | "source": [ 306 | "def plot_route_on_map(all_sets, optimal_path_in_points_idxs):\n", 307 | " points = get_features_for_all_points(all_sets)\n", 308 | " \n", 309 | " all_geo_points = get_all_geopoints_from_all_sets(all_sets)\n", 310 | " lng_lat_list = [tuple([all_geo_points[i].lng, all_geo_points[i].lat])\n", 311 | " for i in optimal_path_in_points_idxs]\n", 312 | " route = Feature(geometry=LineString(lng_lat_list),\n", 313 | " properties={\"name\": \"This is our route\",\n", 314 | " \"stroke\": \"black\"})\n", 315 | " \n", 316 | " feature_collection = FeatureCollection(features=points+[route])\n", 317 | " geojsonio.display(json.dumps(feature_collection));" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 11, 323 | "metadata": { 324 | "collapsed": false 325 | }, 326 | "outputs": [ 327 | { 328 | "name": "stdout", 329 | "output_type": "stream", 330 | "text": [ 331 | "[2, 7, 3] [0, 2, 1] 378\n" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "optimal_path_in_points, optimal_path_in_sets, optimal_cost = DP_Set_TSP(set_to_points_dict, distances_array)\n", 337 | "print(optimal_path_in_points, optimal_path_in_sets, optimal_cost)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 12, 343 | "metadata": { 344 | "collapsed": false 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "# Opens a new tab in the browser :)\n", 349 | "plot_route_on_map(all_sets, optimal_path_in_points)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "# Part III - Larger Random Experiments" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 13, 362 | "metadata": { 363 | "collapsed": true 364 | }, 365 | "outputs": [], 366 | "source": [ 367 | "def get_random_geo_point(center_lat=32.1, center_lng=34.8, radius=0.1, name_=None):\n", 368 | " geo_point = GeoPoint(lat = center_lat + (radius * random.random()), \n", 369 | " lng = center_lng + (radius * random.random()),\n", 370 | " name_ = name_)\n", 371 | " return geo_point" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 14, 377 | "metadata": { 378 | "collapsed": true 379 | }, 380 | "outputs": [], 381 | "source": [ 382 | "def generate_random_input_in_geo_points(n_sets: int, poisson_lambda: int = 2) -> [{int: int}, {int: int}, np.array]:\n", 383 | " set_to_points_dict = {}\n", 384 | " first_point_idx = 0\n", 385 | " for set_idx in range(n_sets):\n", 386 | " n_points_in_set = 1 + np.random.poisson(poisson_lambda)\n", 387 | " set_to_points_dict[set_idx] = list(range(first_point_idx, first_point_idx + n_points_in_set))\n", 388 | " first_point_idx += n_points_in_set\n", 389 | "\n", 390 | " n_points = first_point_idx\n", 391 | " all_sets = []\n", 392 | " for idx_set in range(n_sets):\n", 393 | " all_sets += [[get_random_geo_point(name_=f's{idx_set}_i{idx_point}') \n", 394 | " for idx_point in range(len(set_to_points_dict[idx_set]))]]\n", 395 | " \n", 396 | " all_geo_points = get_all_geopoints_from_all_sets(all_sets)\n", 397 | " distances_array = np.array([[all_geo_points[i].get_dist_from(all_geo_points[j])\n", 398 | " for i in range(n_points)]\n", 399 | " for j in range(n_points)])\n", 400 | "\n", 401 | " return all_sets, all_geo_points, set_to_points_dict, distances_array" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 15, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [ 411 | { 412 | "name": "stdout", 413 | "output_type": "stream", 414 | "text": [ 415 | "[1, 9, 5, 37, 28, 15, 33, 25, 12, 20] [0, 2, 1, 9, 7, 4, 8, 6, 3, 5] 10862\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "n_sets = 10\n", 421 | "poisson_lambda = 4\n", 422 | "all_sets, all_geo_points, set_to_points_dict, distances_array = generate_random_input_in_geo_points(n_sets, poisson_lambda)\n", 423 | "optimal_path_in_points, optimal_path_in_sets, optimal_cost = DP_Set_TSP(set_to_points_dict, distances_array)\n", 424 | "print(optimal_path_in_points, optimal_path_in_sets, optimal_cost)\n", 425 | "\n", 426 | "# Opens a new tab in the browser :)\n", 427 | "plot_route_on_map(all_sets, optimal_path_in_points)" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "# Extra - Working with non-geo random input" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 16, 440 | "metadata": { 441 | "collapsed": true 442 | }, 443 | "outputs": [], 444 | "source": [ 445 | "def generate_random_input(n_sets: int, poisson_lambda: int = 2) -> [{int: int}, {int: int}, np.array]:\n", 446 | " set_to_points_dict = {}\n", 447 | " first_point_idx = 0\n", 448 | " for set_idx in range(n_sets):\n", 449 | " n_points_in_set = 1 + np.random.poisson(poisson_lambda)\n", 450 | " set_to_points_dict[set_idx] = list(range(first_point_idx, first_point_idx + n_points_in_set))\n", 451 | " first_point_idx += n_points_in_set\n", 452 | "\n", 453 | " n_points = first_point_idx\n", 454 | " X = np.random.rand(n_points, 3)\n", 455 | " distances_array = np.array([[np.linalg.norm(X[i] - X[j])\n", 456 | " for i in range(n_points)]\n", 457 | " for j in range(n_points)])\n", 458 | "\n", 459 | " return X, set_to_points_dict, distances_array" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 17, 465 | "metadata": { 466 | "collapsed": true 467 | }, 468 | "outputs": [], 469 | "source": [ 470 | "def scatter_plot(X: np.array, clusters_in_idxs: [[int]]):\n", 471 | " x, y = list(zip(*[[X[c_idx][0], X[c_idx][1]]\n", 472 | " for one_cluster_in_idxs in clusters_in_idxs\n", 473 | " for c_idx in one_cluster_in_idxs]))\n", 474 | " c = [color_idx\n", 475 | " for color_idx, one_cluster_in_idxs in enumerate(clusters_in_idxs)\n", 476 | " for _ in one_cluster_in_idxs]\n", 477 | " df = pd.DataFrame({'x': x, 'y': y, 'c': c})\n", 478 | "\n", 479 | " for color_idx, cluster_in_idxs in enumerate(clusters_in_idxs):\n", 480 | " df_temp = df[df['c'].isin([color_idx])]\n", 481 | " plt.plot(df_temp['x'].tolist(), df_temp['y'].tolist(), 'o', label=color_idx, markersize=8);\n", 482 | "\n", 483 | " plt.legend(loc='upper left', bbox_to_anchor=(1, 1))" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 18, 489 | "metadata": { 490 | "collapsed": true 491 | }, 492 | "outputs": [], 493 | "source": [ 494 | "def plot_route(X, optimal_path_in_points_idxs, set_to_points_dict):\n", 495 | " scatter_plot(X, list(set_to_points_dict.values()))\n", 496 | " for p1, p2 in zip(optimal_path_in_points_idxs[:-1], optimal_path_in_points_idxs[1:]):\n", 497 | " plt.plot([X[p1, 0], X[p2, 0]], [X[p1, 1], X[p2, 1]], color='grey');" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": 19, 503 | "metadata": { 504 | "collapsed": false 505 | }, 506 | "outputs": [ 507 | { 508 | "data": { 509 | "image/png": "\n", 510 | "text/plain": [ 511 | "
" 512 | ] 513 | }, 514 | "metadata": { 515 | "needs_background": "light" 516 | }, 517 | "output_type": "display_data" 518 | } 519 | ], 520 | "source": [ 521 | "n_sets = 5\n", 522 | "poisson_lambda = 3\n", 523 | "X, set_to_points_dict, distances_array = generate_random_input(n_sets, poisson_lambda)\n", 524 | "\n", 525 | "optimal_path_in_points_idxs, optimal_path_in_sets_idxs, optimal_cost = DP_Set_TSP(set_to_points_dict, distances_array)\n", 526 | "plot_route(X, optimal_path_in_points_idxs, set_to_points_dict)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": { 533 | "collapsed": true 534 | }, 535 | "outputs": [], 536 | "source": [] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": { 542 | "collapsed": true 543 | }, 544 | "outputs": [], 545 | "source": [] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": { 551 | "collapsed": true 552 | }, 553 | "outputs": [], 554 | "source": [] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": { 560 | "collapsed": true 561 | }, 562 | "outputs": [], 563 | "source": [] 564 | } 565 | ], 566 | "metadata": { 567 | "anaconda-cloud": {}, 568 | "kernelspec": { 569 | "display_name": "quay_rnd", 570 | "language": "python", 571 | "name": "quay_rnd" 572 | }, 573 | "language_info": { 574 | "codemirror_mode": { 575 | "name": "ipython", 576 | "version": 3 577 | }, 578 | "file_extension": ".py", 579 | "mimetype": "text/x-python", 580 | "name": "python", 581 | "nbconvert_exporter": "python", 582 | "pygments_lexer": "ipython3", 583 | "version": "3.7.3" 584 | } 585 | }, 586 | "nbformat": 4, 587 | "nbformat_minor": 1 588 | } 589 | -------------------------------------------------------------------------------- /PlottingWithPandas/PlottingWithPandas_DatesAndBarPlots.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# PlottingWithPandas\n", 8 | "\n", 9 | "## This notebook was created to serve a [blog post]() by the same name." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": true 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "# Python 3.7.3\n", 21 | "import pandas as pd # version 0.23.4\n", 22 | "\n", 23 | "import matplotlib.pyplot as plt # version 3.0.2\n", 24 | "%matplotlib inline " 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## 1. Setting up the Data:\n", 32 | "\n", 33 | "#### Download and save data from Kaggle [Austin Animal Center Shelter Outcomes](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and#aac_shelter_outcomes.csv)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/html": [ 46 | "
\n", 47 | "\n", 60 | "\n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | "
colordate_of_birthname
0orange2014-07-07NaN
1blue /white2014-06-16Lucy
2white/black2014-03-26*Frida
3black/white2013-03-27Stella Luna
4black/white2013-12-16NaN
\n", 102 | "
" 103 | ], 104 | "text/plain": [ 105 | " color date_of_birth name\n", 106 | "0 orange 2014-07-07 NaN\n", 107 | "1 blue /white 2014-06-16 Lucy\n", 108 | "2 white/black 2014-03-26 *Frida\n", 109 | "3 black/white 2013-03-27 Stella Luna\n", 110 | "4 black/white 2013-12-16 NaN" 111 | ] 112 | }, 113 | "execution_count": 2, 114 | "metadata": {}, 115 | "output_type": "execute_result" 116 | } 117 | ], 118 | "source": [ 119 | "filename = \"aac_shelter_cat_outcome_eng.csv\"\n", 120 | "df = pd.read_csv(filename, \n", 121 | " usecols=['name', 'date_of_birth', 'color'],\n", 122 | " parse_dates=['date_of_birth'])\n", 123 | "df.head()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## 2. Our Most Basic Plot" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 3, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/html": [ 143 | "
\n", 144 | "\n", 157 | "\n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | "
colordate_of_birthnameweekday_numweekday_name
0orange2014-07-07NaN0Monday
1blue /white2014-06-16Lucy0Monday
2white/black2014-03-26*Frida2Wednesday
3black/white2013-03-27Stella Luna2Wednesday
4black/white2013-12-16NaN0Monday
\n", 211 | "
" 212 | ], 213 | "text/plain": [ 214 | " color date_of_birth name weekday_num weekday_name\n", 215 | "0 orange 2014-07-07 NaN 0 Monday\n", 216 | "1 blue /white 2014-06-16 Lucy 0 Monday\n", 217 | "2 white/black 2014-03-26 *Frida 2 Wednesday\n", 218 | "3 black/white 2013-03-27 Stella Luna 2 Wednesday\n", 219 | "4 black/white 2013-12-16 NaN 0 Monday" 220 | ] 221 | }, 222 | "execution_count": 3, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "df['weekday_num'] = pd.DatetimeIndex(df['date_of_birth']).weekday\n", 229 | "df['weekday_name'] = pd.DatetimeIndex(df['date_of_birth']).weekday_name\n", 230 | "df.head()" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 4, 236 | "metadata": { 237 | "collapsed": false 238 | }, 239 | "outputs": [ 240 | { 241 | "data": { 242 | "text/html": [ 243 | "
\n", 244 | "\n", 257 | "\n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | "
weekday_numweekday_namen_pets
00Monday4584
11Tuesday4260
22Wednesday4243
33Thursday4048
44Friday3941
55Saturday3966
66Sunday4379
\n", 311 | "
" 312 | ], 313 | "text/plain": [ 314 | " weekday_num weekday_name n_pets\n", 315 | "0 0 Monday 4584\n", 316 | "1 1 Tuesday 4260\n", 317 | "2 2 Wednesday 4243\n", 318 | "3 3 Thursday 4048\n", 319 | "4 4 Friday 3941\n", 320 | "5 5 Saturday 3966\n", 321 | "6 6 Sunday 4379" 322 | ] 323 | }, 324 | "execution_count": 4, 325 | "metadata": {}, 326 | "output_type": "execute_result" 327 | } 328 | ], 329 | "source": [ 330 | "df_grouped = df.groupby(['weekday_num', 'weekday_name']).size().reset_index(name=\"n_pets\")\n", 331 | "df_grouped" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 5, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [ 341 | { 342 | "data": { 343 | "image/png": "\n", 344 | "text/plain": [ 345 | "
" 346 | ] 347 | }, 348 | "metadata": { 349 | "needs_background": "light" 350 | }, 351 | "output_type": "display_data" 352 | } 353 | ], 354 | "source": [ 355 | "df_grouped.plot.bar(x=\"weekday_name\", y=\"n_pets\", color='blue');" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "## 3. The Birtday Surprise\n", 363 | "\n", 364 | "[Basic statistics](https://en.wikipedia.org/wiki/Birthday_problem) show that in every group of 50 people (or pets) you have a 99% chance of having at least two people (pets) who were born on the same day of the year. \n", 365 | "\n", 366 | "Let's test this!" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 6, 372 | "metadata": { 373 | "collapsed": false 374 | }, 375 | "outputs": [ 376 | { 377 | "data": { 378 | "image/png": "\n", 379 | "text/plain": [ 380 | "
" 381 | ] 382 | }, 383 | "metadata": { 384 | "needs_background": "light" 385 | }, 386 | "output_type": "display_data" 387 | } 388 | ], 389 | "source": [ 390 | "n_sample = 50\n", 391 | "df_sample = df.sample(n=n_sample)\n", 392 | "df_sample['birthday'] = df_sample['date_of_birth'].dt.strftime('%m-%d')\n", 393 | "df_sample_grouped = df_sample.groupby(['birthday']).size().reset_index(name=\"n_pets\")\n", 394 | "df_sample_grouped.plot.bar(x=\"birthday\", y=\"n_pets\", color='blue');" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "## Ho my, the dates look all messed up... Let's try to sort this:" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 7, 407 | "metadata": { 408 | "collapsed": false 409 | }, 410 | "outputs": [ 411 | { 412 | "data": { 413 | "image/png": "\n", 414 | "text/plain": [ 415 | "
" 416 | ] 417 | }, 418 | "metadata": { 419 | "needs_background": "light" 420 | }, 421 | "output_type": "display_data" 422 | } 423 | ], 424 | "source": [ 425 | "df_sample_grouped = df_sample.groupby(['birthday']).size()\n", 426 | "n_unique_dates = len(df_sample_grouped.index.unique())\n", 427 | "fig = plt.figure(figsize=(n_unique_dates/5, n_unique_dates/10))\n", 428 | "\n", 429 | "ax = df_sample_grouped.plot.bar(x=\"birthday\", y=\"n_pets\", color='blue')\n", 430 | "ax.set_xticklabels(labels=df_sample_grouped.index, \n", 431 | " rotation=70, rotation_mode=\"anchor\", ha=\"right\");\n", 432 | "ax.legend(labels=['n_pets']);\n", 433 | "plt.tight_layout()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": {}, 439 | "source": [ 440 | "## 3. Let's plot according to animal color:" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 8, 446 | "metadata": { 447 | "collapsed": false 448 | }, 449 | "outputs": [ 450 | { 451 | "data": { 452 | "text/html": [ 453 | "
\n", 454 | "\n", 467 | "\n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | "
colordate_of_birthnameweekday_numweekday_namehas_black
0orange2014-07-07NaN0MondayFalse
1blue /white2014-06-16Lucy0MondayFalse
2white/black2014-03-26*Frida2WednesdayTrue
3black/white2013-03-27Stella Luna2WednesdayTrue
4black/white2013-12-16NaN0MondayTrue
\n", 527 | "
" 528 | ], 529 | "text/plain": [ 530 | " color date_of_birth name weekday_num weekday_name has_black\n", 531 | "0 orange 2014-07-07 NaN 0 Monday False\n", 532 | "1 blue /white 2014-06-16 Lucy 0 Monday False\n", 533 | "2 white/black 2014-03-26 *Frida 2 Wednesday True\n", 534 | "3 black/white 2013-03-27 Stella Luna 2 Wednesday True\n", 535 | "4 black/white 2013-12-16 NaN 0 Monday True" 536 | ] 537 | }, 538 | "execution_count": 8, 539 | "metadata": {}, 540 | "output_type": "execute_result" 541 | } 542 | ], 543 | "source": [ 544 | "df['has_black'] = df['color'].str.contains(\"black\")\n", 545 | "df.head()" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 9, 551 | "metadata": { 552 | "collapsed": false 553 | }, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/html": [ 558 | "
\n", 559 | "\n", 572 | "\n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | "
has_blackFalseTrue
weekday_name
Monday28021227
Tuesday25751129
Wednesday26181138
Thursday25271049
Friday24251000
Saturday24171056
Sunday26701162
\n", 623 | "
" 624 | ], 625 | "text/plain": [ 626 | "has_black False True \n", 627 | "weekday_name \n", 628 | "Monday 2802 1227\n", 629 | "Tuesday 2575 1129\n", 630 | "Wednesday 2618 1138\n", 631 | "Thursday 2527 1049\n", 632 | "Friday 2425 1000\n", 633 | "Saturday 2417 1056\n", 634 | "Sunday 2670 1162" 635 | ] 636 | }, 637 | "execution_count": 9, 638 | "metadata": {}, 639 | "output_type": "execute_result" 640 | } 641 | ], 642 | "source": [ 643 | "df_grouped_color = (df.groupby(['weekday_num', 'weekday_name'])['has_black']\n", 644 | " .value_counts()\n", 645 | " .unstack()\n", 646 | " .reset_index(level=0, drop=True))\n", 647 | "df_grouped_color" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": 10, 653 | "metadata": { 654 | "collapsed": false 655 | }, 656 | "outputs": [ 657 | { 658 | "data": { 659 | "image/png": "\n", 660 | "text/plain": [ 661 | "
" 662 | ] 663 | }, 664 | "metadata": { 665 | "needs_background": "light" 666 | }, 667 | "output_type": "display_data" 668 | } 669 | ], 670 | "source": [ 671 | "ax = df_grouped_color.plot.bar(stacked=True, color=['brown', 'black']);\n", 672 | "ax.set_xticklabels(labels=df_grouped_color.index, \n", 673 | " rotation=70, rotation_mode=\"anchor\", ha=\"right\");\n", 674 | "ax.set_xlabel('');\n", 675 | "ax.set_ylabel('n_pets');" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": null, 681 | "metadata": { 682 | "collapsed": true 683 | }, 684 | "outputs": [], 685 | "source": [] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": null, 690 | "metadata": { 691 | "collapsed": true 692 | }, 693 | "outputs": [], 694 | "source": [] 695 | } 696 | ], 697 | "metadata": { 698 | "anaconda-cloud": {}, 699 | "kernelspec": { 700 | "display_name": "quay_rnd", 701 | "language": "python", 702 | "name": "quay_rnd" 703 | }, 704 | "language_info": { 705 | "codemirror_mode": { 706 | "name": "ipython", 707 | "version": 3 708 | }, 709 | "file_extension": ".py", 710 | "mimetype": "text/x-python", 711 | "name": "python", 712 | "nbconvert_exporter": "python", 713 | "pygments_lexer": "ipython3", 714 | "version": "3.7.3" 715 | } 716 | }, 717 | "nbformat": 4, 718 | "nbformat_minor": 1 719 | } 720 | --------------------------------------------------------------------------------