├── .gitignore ├── README.md ├── batch ├── RunBatch.ipynb └── dask │ ├── batch.py │ ├── environment.yml │ └── startDask.py ├── img ├── 1.png ├── 10.png ├── 2.png ├── 3.png ├── 4.png ├── 5.png ├── 6.png ├── 7.png ├── 8.png ├── 9.png ├── bokeh.png ├── compute_nodes.png ├── create_cluster.png ├── dask-status.gif └── network.png ├── interactive ├── LoadDataFromDatastore.ipynb ├── StartDask.ipynb ├── StartDaskVNet.ipynb ├── dask │ ├── DaskNYCTaxi.ipynb │ ├── environment.yml │ └── startDask.py └── mydask.png ├── rapids_interactive ├── dask │ ├── azure_taxi_on_cluster.ipynb │ ├── dask.yml │ ├── init-dask.py │ ├── jupyter-preload.py │ ├── rapids-0.9.yaml │ └── rapids.yml └── start_cluster.ipynb └── setup-vnet.md /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode/** 2 | **/.ipynb_checkpoints/* 3 | rapids_interactive/data/* 4 | **/._* 5 | nohup.out 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Running Dask on AzureML 2 | 3 | **This repository is no longer maintained. For a simple way of running Dask on an AzureML cluster, please check out the AzureML CLI v2 DASK samples here: https://github.com/Azure/azureml-examples/tree/main/cli** 4 | 5 | 6 | ---- 7 | 8 | This repository shows how to run a [Dask](https://docs.dask.org/en/latest/) cluster on an [AzureML](https://docs.microsoft.com/en-us/azure/machine-learning/service/) Compute cluster. It is designed to run on an AzureML Notebook VM (created after 8/15/2019), but it should work on your local computer, too. 9 | 10 | Please follow these setup instructions and then start: 11 | 12 | - here for plain DASK interactive scenarios [interactive/StartDask.ipynb](interactive/StartDask.ipynb). 13 | - here for DASK with NVIDIA RAPIDS interactive scenarios [rapids_interactive/start_cluster.ipynb](rapids_interactive/start_cluster.ipynb). 14 | 15 | ## Setting up the Python Environment 16 | The environment you are running should have the latest version of `dask` and `distributed` installed -- run this code in the terminal to make sure: 17 | 18 | ```shell 19 | conda activate py36 # assuming AzureML Notebook VM 20 | pip install --upgrade dask distributed 21 | ``` 22 | 23 | Or, if you want to be on the safe side, create a new conda environment using this [environment.yml](interactive/dask/environment.yml) file like so: 24 | 25 | ```shell 26 | conda env create -f dask/environment.yml 27 | conda activate dask 28 | python -m ipykernel install --user --name dask --display-name "Python (dask)" 29 | ``` 30 | 31 | ![](img/dask-status.gif) 32 | 33 | -------------------------------------------------------------------------------- /batch/RunBatch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Running Dask on AzureML\n", 8 | "\n", 9 | "This notebook shows how to run a batch job on a Dask cluster running on an AzureML Compute cluster. \n", 10 | "For setup instructions of your python environment, please see the [Readme](../README.md)" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "data": { 20 | "text/plain": [ 21 | "'1.12.0'" 22 | ] 23 | }, 24 | "execution_count": 1, 25 | "metadata": {}, 26 | "output_type": "execute_result" 27 | } 28 | ], 29 | "source": [ 30 | "from azureml.core import Workspace, Experiment\n", 31 | "from azureml.train.estimator import Estimator\n", 32 | "from azureml.widgets import RunDetails\n", 33 | "from azureml.core.runconfig import MpiConfiguration\n", 34 | "from azureml.core import VERSION\n", 35 | "import uuid\n", 36 | "import time\n", 37 | "VERSION\n" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "ws = Workspace.from_config()" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Download the NYC Taxi dataset and upload to the workspace default blob storage" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 9, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv\n", 66 | "100%|██████████| 1985964692/1985964692 [00:30<00:00, 65604283.41it/s] \n", 67 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-02.csv\n", 68 | "100%|██████████| 1945357622/1945357622 [00:29<00:00, 65506177.65it/s]\n", 69 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-03.csv\n", 70 | "100%|██████████| 2087971794/2087971794 [00:33<00:00, 62180625.55it/s] \n", 71 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-04.csv\n", 72 | "100%|██████████| 2046225765/2046225765 [00:31<00:00, 65746019.73it/s]\n", 73 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-05.csv\n", 74 | "100%|██████████| 2061869121/2061869121 [00:27<00:00, 73939136.66it/s] \n", 75 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-06.csv\n", 76 | "100%|██████████| 1932049357/1932049357 [00:29<00:00, 64596156.85it/s]\n", 77 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-07.csv\n", 78 | "100%|██████████| 1812530041/1812530041 [00:29<00:00, 61745527.58it/s] \n", 79 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-08.csv\n", 80 | "100%|██████████| 1744852237/1744852237 [00:26<00:00, 65974018.30it/s] \n", 81 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-09.csv\n", 82 | "100%|██████████| 1760412710/1760412710 [00:27<00:00, 64174609.37it/s]\n", 83 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-10.csv\n", 84 | "100%|██████████| 1931460927/1931460927 [00:29<00:00, 65248050.69it/s]\n", 85 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-11.csv\n", 86 | "100%|██████████| 1773468989/1773468989 [00:31<00:00, 56412556.41it/s]\n", 87 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-12.csv\n", 88 | "100%|██████████| 1796283025/1796283025 [00:26<00:00, 68628572.27it/s] \n", 89 | "- Uploading taxi data... \n", 90 | "Uploading an estimated of 12 files\n", 91 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-01.csv\n", 92 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-02.csv\n", 93 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-03.csv\n", 94 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-04.csv\n", 95 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-05.csv\n", 96 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-06.csv\n", 97 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-07.csv\n", 98 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-08.csv\n", 99 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-09.csv\n", 100 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-10.csv\n", 101 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-11.csv\n", 102 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-12.csv\n", 103 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-11.csv, 1 files out of an estimated total of 12\n", 104 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-08.csv, 2 files out of an estimated total of 12\n", 105 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-07.csv, 3 files out of an estimated total of 12\n", 106 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-09.csv, 4 files out of an estimated total of 12\n", 107 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-12.csv, 5 files out of an estimated total of 12\n", 108 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-10.csv, 6 files out of an estimated total of 12\n", 109 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-06.csv, 7 files out of an estimated total of 12\n", 110 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-02.csv, 8 files out of an estimated total of 12\n", 111 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-01.csv, 9 files out of an estimated total of 12\n", 112 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-03.csv, 10 files out of an estimated total of 12\n", 113 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-04.csv, 11 files out of an estimated total of 12\n", 114 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-05.csv, 12 files out of an estimated total of 12\n", 115 | "Uploaded 12 files\n", 116 | "- Data transfer complete\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "import io\n", 122 | "import os\n", 123 | "import sys\n", 124 | "import urllib.request\n", 125 | "from tqdm import tqdm\n", 126 | "from time import sleep\n", 127 | "\n", 128 | "cwd = os.getcwd()\n", 129 | "\n", 130 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n", 131 | "if not os.path.exists(data_dir):\n", 132 | " os.makedirs(data_dir)\n", 133 | "\n", 134 | "taxidir = os.path.join(data_dir, 'nyctaxi')\n", 135 | "if not os.path.exists(taxidir):\n", 136 | " os.makedirs(taxidir)\n", 137 | "\n", 138 | "filenames = []\n", 139 | "local_paths = []\n", 140 | "for i in range(1, 13):\n", 141 | " filename = \"yellow_tripdata_2015-{month:02d}.csv\".format(month=i)\n", 142 | " filenames.append(filename)\n", 143 | " \n", 144 | " local_path = os.path.join(taxidir, filename)\n", 145 | " local_paths.append(local_path)\n", 146 | "\n", 147 | "for idx, filename in enumerate(filenames):\n", 148 | " url = \"http://dask-data.s3.amazonaws.com/nyc-taxi/2015/\" + filename\n", 149 | " print(\"- Downloading \" + url)\n", 150 | " if not os.path.exists(local_paths[idx]):\n", 151 | " with open(local_paths[idx], 'wb') as file:\n", 152 | " with urllib.request.urlopen(url) as resp:\n", 153 | " length = int(resp.getheader('content-length'))\n", 154 | " blocksize = max(4096, length // 100)\n", 155 | " with tqdm(total=length, file=sys.stdout) as pbar:\n", 156 | " while True:\n", 157 | " buff = resp.read(blocksize)\n", 158 | " if not buff:\n", 159 | " break\n", 160 | " file.write(buff)\n", 161 | " pbar.update(len(buff))\n", 162 | " else:\n", 163 | " print(\"- File already exists locally\")\n", 164 | "\n", 165 | "print(\"- Uploading taxi data... \")\n", 166 | "ws = Workspace.from_config()\n", 167 | "ds = ws.get_default_datastore()\n", 168 | "\n", 169 | "ds.upload(\n", 170 | " src_dir=taxidir,\n", 171 | " target_path='nyctaxi',\n", 172 | " show_progress=True)\n", 173 | "\n", 174 | "print(\"- Data transfer complete\")" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "## Starting the cluster" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 10, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "# we assume the AML compute training cluster is already created\n", 191 | "dask_cluster = ws.compute_targets['daniel-big']" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "Starting the Dask cluster using an Estimator with MpiConfiguration. Make sure the cluster is able to scale up to 10 nodes or change the `node_count` below. " 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 14, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "est = Estimator('dask', \n", 208 | " compute_target=dask_cluster, \n", 209 | " entry_script='startDask.py', \n", 210 | " conda_dependencies_file='environment.yml', \n", 211 | " script_params={'--datastore': ws.get_default_datastore(),\n", 212 | " '--script': 'batch.py'},\n", 213 | " node_count=10,\n", 214 | " distributed_training=MpiConfiguration())\n", 215 | "\n", 216 | "run = Experiment(ws, 'dask').submit(est)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 15, 222 | "metadata": { 223 | "scrolled": false 224 | }, 225 | "outputs": [ 226 | { 227 | "data": { 228 | "application/vnd.jupyter.widget-view+json": { 229 | "model_id": "e917d855441647f09fcc50f3809622f5", 230 | "version_major": 2, 231 | "version_minor": 0 232 | }, 233 | "text/plain": [ 234 | "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…" 235 | ] 236 | }, 237 | "metadata": {}, 238 | "output_type": "display_data" 239 | }, 240 | { 241 | "data": { 242 | "application/aml.mini.widget.v1": "{\"status\": \"Queued\", \"workbench_run_details_uri\": \"https://ml.azure.com/experiments/dask/runs/dask_1599051891_1033bd4f?wsid=/subscriptions/6560575d-fa06-4e7d-95fb-f962e74efd7a/resourcegroups/dask-rg/workspaces/dask-azureml\", \"run_id\": \"dask_1599051891_1033bd4f\", \"run_properties\": {\"run_id\": \"dask_1599051891_1033bd4f\", \"created_utc\": \"2020-09-02T13:04:56.761121Z\", \"properties\": {\"_azureml.ComputeTargetType\": \"amlcompute\", \"ContentSnapshotId\": \"66fa5036-4c6b-47f6-aa7e-d7e9340a74cb\", \"azureml.git.repository_uri\": \"https://github.com/danielsc/azureml-and-dask\", \"mlflow.source.git.repoURL\": \"https://github.com/danielsc/azureml-and-dask\", \"azureml.git.branch\": \"master\", \"mlflow.source.git.branch\": \"master\", \"azureml.git.commit\": \"f71a6182f15f2344e7b39589434f3d3461a89344\", \"mlflow.source.git.commit\": \"f71a6182f15f2344e7b39589434f3d3461a89344\", \"azureml.git.dirty\": \"True\", \"ProcessInfoFile\": \"azureml-logs/process_info.json\", \"ProcessStatusFile\": \"azureml-logs/process_status.json\"}, \"tags\": {\"_aml_system_ComputeTargetStatus\": \"{\\\"AllocationState\\\":\\\"resizing\\\",\\\"PreparingNodeCount\\\":0,\\\"RunningNodeCount\\\":0,\\\"CurrentNodeCount\\\":0}\"}, \"script_name\": null, \"arguments\": null, \"end_time_utc\": null, \"status\": \"Queued\", \"log_files\": {}, \"log_groups\": [], \"run_duration\": \"0:02:33\"}, \"child_runs\": [], \"children_metrics\": {}, \"run_metrics\": [], \"run_logs\": \"Your job is submitted in Azure cloud and we are monitoring to get logs...\", \"graph\": {}, \"widget_settings\": {\"childWidgetDisplay\": \"popup\", \"send_telemetry\": false, \"log_level\": \"INFO\", \"sdk_version\": \"1.12.0\"}, \"loading\": false}" 243 | }, 244 | "metadata": {}, 245 | "output_type": "display_data" 246 | } 247 | ], 248 | "source": [ 249 | "RunDetails(run).show()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "## Shut cluster down" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 50, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "name": "stdout", 266 | "output_type": "stream", 267 | "text": [ 268 | "cancelling run dask_1575974502_b8643732\n", 269 | "cancelling run dask_1575973181_99433e88\n" 270 | ] 271 | } 272 | ], 273 | "source": [ 274 | "for run in ws.experiments['dask'].get_runs():\n", 275 | " if run.get_status() == \"Running\":\n", 276 | " print(f'cancelling run {run.id}')\n", 277 | " run.cancel()" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "### Just for convenience, get the latest running Run" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 87, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "name": "stdout", 294 | "output_type": "stream", 295 | "text": [ 296 | "latest running run is dask_1574792066_49c85fe4\n" 297 | ] 298 | } 299 | ], 300 | "source": [ 301 | "for run in ws.experiments['dask'].get_runs():\n", 302 | " if run.get_status() == \"Running\":\n", 303 | " print(f'latest running run is {run.id}')\n", 304 | " break" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [] 313 | } 314 | ], 315 | "metadata": { 316 | "kernelspec": { 317 | "display_name": "Python 3.6 - AzureML", 318 | "language": "python", 319 | "name": "python3-azureml" 320 | }, 321 | "language_info": { 322 | "codemirror_mode": { 323 | "name": "ipython", 324 | "version": 3 325 | }, 326 | "file_extension": ".py", 327 | "mimetype": "text/x-python", 328 | "name": "python", 329 | "nbconvert_exporter": "python", 330 | "pygments_lexer": "ipython3", 331 | "version": "3.6.9" 332 | } 333 | }, 334 | "nbformat": 4, 335 | "nbformat_minor": 2 336 | } 337 | -------------------------------------------------------------------------------- /batch/dask/batch.py: -------------------------------------------------------------------------------- 1 | # + 2 | from dask.distributed import Client 3 | from azureml.core import Run 4 | import dask.dataframe as dd 5 | from fsspec.registry import known_implementations 6 | import os, uuid 7 | 8 | c=Client("localhost:8786") 9 | print(c) 10 | 11 | 12 | run = Run.get_context() 13 | ws = run.experiment.workspace 14 | 15 | ds = ws.get_default_datastore() 16 | ACCOUNT_NAME = ds.account_name 17 | ACCOUNT_KEY = ds.account_key 18 | CONTAINER = ds.container_name 19 | 20 | known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'} 21 | STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY} 22 | df = dd.read_csv(f'abfs://{CONTAINER}/nyctaxi/*.csv', 23 | storage_options=STORAGE_OPTIONS, 24 | parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) 25 | 26 | print(df.head()) 27 | 28 | # list of column names that need to be re-mapped 29 | remap = {} 30 | remap['tpep_pickup_datetime'] = 'pickup_datetime' 31 | remap['tpep_dropoff_datetime'] = 'dropoff_datetime' 32 | remap['RatecodeID'] = 'rate_code' 33 | 34 | #create a list of columns & dtypes the df must have 35 | must_haves = { 36 | 'VendorID': 'object', 37 | 'pickup_datetime': 'datetime64[ms]', 38 | 'dropoff_datetime': 'datetime64[ms]', 39 | 'passenger_count': 'int32', 40 | 'trip_distance': 'float32', 41 | 'pickup_longitude': 'float32', 42 | 'pickup_latitude': 'float32', 43 | 'rate_code': 'int32', 44 | 'payment_type': 'int32', 45 | 'dropoff_longitude': 'float32', 46 | 'dropoff_latitude': 'float32', 47 | 'fare_amount': 'float32', 48 | 'tip_amount': 'float32', 49 | 'total_amount': 'float32' 50 | } 51 | 52 | query_frags = [ 53 | 'fare_amount > 0 and fare_amount < 500', 54 | 'passenger_count > 0 and passenger_count < 6', 55 | 'pickup_longitude > -75 and pickup_longitude < -73', 56 | 'dropoff_longitude > -75 and dropoff_longitude < -73', 57 | 'pickup_latitude > 40 and pickup_latitude < 42', 58 | 'dropoff_latitude > 40 and dropoff_latitude < 42' 59 | ] 60 | query = ' and '.join(query_frags) 61 | 62 | # helper function which takes a DataFrame partition 63 | def clean(df_part, remap, must_haves, query): 64 | df_part = df_part.query(query) 65 | 66 | # some col-names include pre-pended spaces remove & lowercase column names 67 | # tmp = {col:col.strip().lower() for col in list(df_part.columns)} 68 | 69 | # rename using the supplied mapping 70 | df_part = df_part.rename(columns=remap) 71 | 72 | # iterate through columns in this df partition 73 | for col in df_part.columns: 74 | # drop anything not in our expected list 75 | if col not in must_haves: 76 | df_part = df_part.drop(col, axis=1) 77 | continue 78 | 79 | if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']: 80 | df_part[col] = df_part[col].astype('datetime64[ms]') 81 | continue 82 | 83 | # if column was read as a string, recast as float 84 | if df_part[col].dtype == 'object': 85 | df_part[col] = df_part[col].str.fillna('-1') 86 | df_part[col] = df_part[col].astype('float32') 87 | else: 88 | # save some memory by using 32 bit floats 89 | if 'int' in str(df_part[col].dtype): 90 | df_part[col] = df_part[col].astype('int32') 91 | if 'float' in str(df_part[col].dtype): 92 | df_part[col] = df_part[col].astype('float32') 93 | df_part[col] = df_part[col].fillna(-1) 94 | 95 | return df_part 96 | 97 | import math 98 | from math import pi 99 | from dask.array import cos, sin, arcsin, sqrt, floor 100 | import numpy as np 101 | 102 | def haversine_distance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude): 103 | x_1 = pi / 180 * pickup_latitude 104 | y_1 = pi / 180 * pickup_longitude 105 | x_2 = pi / 180 * dropoff_latitude 106 | y_2 = pi / 180 * dropoff_longitude 107 | 108 | dlon = y_2 - y_1 109 | dlat = x_2 - x_1 110 | a = sin(dlat / 2)**2 + cos(x_1) * cos(x_2) * sin(dlon / 2)**2 111 | 112 | c = 2 * arcsin(sqrt(a)) 113 | r = 6371 # Radius of earth in kilometers 114 | 115 | return c * r 116 | 117 | def day_of_the_week(day, month, year): 118 | if month < 3: 119 | shift = month 120 | else: 121 | shift = 0 122 | Y = year - (month < 3) 123 | y = Y - 2000 124 | c = 20 125 | d = day 126 | m = month + shift + 1 127 | return (d + floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7 128 | 129 | def add_features(df): 130 | df['hour'] = df['pickup_datetime'].dt.hour.astype('int32') 131 | df['year'] = df['pickup_datetime'].dt.year.astype('int32') 132 | df['month'] = df['pickup_datetime'].dt.month.astype('int32') 133 | df['day'] = df['pickup_datetime'].dt.day.astype('int32') 134 | df['day_of_week'] = df['pickup_datetime'].dt.weekday.astype('int32') 135 | 136 | #df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32') 137 | df['diff'] = df['dropoff_datetime'] - df['pickup_datetime'] 138 | 139 | df['pickup_latitude_r'] = (df['pickup_latitude'] // .01 * .01).astype('float32') 140 | df['pickup_longitude_r'] = (df['pickup_longitude'] // .01 * .01).astype('float32') 141 | df['dropoff_latitude_r'] = (df['dropoff_latitude'] // .01 * .01).astype('float32') 142 | df['dropoff_longitude_r'] = (df['dropoff_longitude'] // .01 * .01).astype('float32') 143 | 144 | #df = df.drop('pickup_datetime', axis=1) 145 | #df = df.drop('dropoff_datetime', axis=1) 146 | 147 | #df = df.apply_rows(haversine_distance_kernel, 148 | # incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'], 149 | # outcols=dict(h_distance=np.float32), 150 | # kwargs=dict()) 151 | 152 | import numpy 153 | 154 | df['h_distance'] = haversine_distance(df['pickup_latitude'], 155 | df['pickup_longitude'], 156 | df['dropoff_latitude'], 157 | df['dropoff_longitude']).astype('float32') 158 | 159 | #df = df.apply_rows(day_of_the_week_kernel, 160 | # incols=['day', 'month', 'year'], 161 | # outcols=dict(day_of_week=np.float32), 162 | # kwargs=dict()) 163 | #df['day_of_week'] = numpy.empty(len(df), dtype=np.int32) 164 | #day_of_the_week_kernel(df['day'], 165 | # df['month'], 166 | # df['year'], 167 | # df['day_of_week']) 168 | 169 | 170 | df['is_weekend'] = (df['day_of_week']>5).astype("int32") 171 | return df 172 | 173 | taxi_df = clean(df, remap, must_haves, query) 174 | taxi_df = add_features(taxi_df) 175 | output_uuid = uuid.uuid1().hex 176 | run.log('output_uuid', output_uuid) 177 | output_path = run.get_metrics()['datastore'] + '/output/' + output_uuid + '.parquet' 178 | 179 | print('save parquet to ', output_path) 180 | 181 | taxi_df.to_parquet(output_path) 182 | 183 | print('done') 184 | 185 | os.system('ls -alg ' + output_path) 186 | 187 | print('shutting down cluster') 188 | c.shutdown() 189 | -------------------------------------------------------------------------------- /batch/dask/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask 2 | channels: 3 | - defaults 4 | - conda-forge 5 | dependencies: 6 | - gcsfs 7 | - fs-gcsfs 8 | - jupyterlab 9 | - jupyter-server-proxy 10 | - python=3.6 11 | - numpy 12 | - h5py 13 | - scipy 14 | - toolz 15 | - bokeh 16 | - dask 17 | - distributed 18 | - notebook 19 | - matplotlib 20 | - Pillow 21 | - pandas 22 | - pandas-datareader 23 | - pytables 24 | - scikit-learn 25 | - scikit-image 26 | - snakeviz 27 | - ujson 28 | - graphviz 29 | - pip 30 | - s3fs 31 | - fastparquet 32 | - dask-ml 33 | - pip: 34 | - graphviz 35 | - cachey 36 | - azureml-sdk[notebooks] 37 | - mpi4py 38 | - gym 39 | - adlfs -------------------------------------------------------------------------------- /batch/dask/startDask.py: -------------------------------------------------------------------------------- 1 | # + 2 | from mpi4py import MPI 3 | import os 4 | import argparse 5 | import time 6 | from dask.distributed import Client 7 | from azureml.core import Run 8 | import sys, uuid 9 | import threading 10 | import subprocess 11 | import socket 12 | 13 | from notebook.notebookapp import list_running_servers 14 | 15 | 16 | # - 17 | 18 | def flush(proc, proc_log): 19 | while True: 20 | proc_out = proc.stdout.readline() 21 | if proc_out == '' and proc.poll() is not None: 22 | proc_log.close() 23 | break 24 | elif proc_out: 25 | sys.stdout.write(proc_out) 26 | proc_log.write(proc_out) 27 | proc_log.flush() 28 | 29 | 30 | if __name__ == '__main__': 31 | comm = MPI.COMM_WORLD 32 | rank = comm.Get_rank() 33 | 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument("--datastore") 36 | parser.add_argument("--jupyter_token", default=uuid.uuid1().hex) 37 | parser.add_argument("--script") 38 | 39 | args, unparsed = parser.parse_known_args() 40 | 41 | ip = socket.gethostbyname(socket.gethostname()) 42 | 43 | print("- my rank is ", rank) 44 | print("- my ip is ", ip) 45 | 46 | if rank == 0: 47 | data = { 48 | "scheduler" : ip + ":8786", 49 | "dashboard" : ip + ":8787" 50 | } 51 | else: 52 | data = None 53 | 54 | data = comm.bcast(data, root=0) 55 | scheduler = data["scheduler"] 56 | dashboard = data["dashboard"] 57 | print("- scheduler is ", scheduler) 58 | print("- dashboard is ", dashboard) 59 | 60 | 61 | print("args: ", args) 62 | print("unparsed: ", unparsed) 63 | print("- my rank is ", rank) 64 | print("- my ip is ", ip) 65 | 66 | if rank == 0: 67 | Run.get_context().log("headnode", ip) 68 | Run.get_context().log("cluster", 69 | "scheduler: {scheduler}, dashboard: {dashboard}".format(scheduler=scheduler, 70 | dashboard=dashboard)) 71 | Run.get_context().log("datastore", args.datastore) 72 | 73 | cmd = ("jupyter lab --ip 0.0.0.0 --port 8888" + \ 74 | " --NotebookApp.token={token}" + \ 75 | " --allow-root --no-browser").format(token=args.jupyter_token) 76 | jupyter_log = open("jupyter_log.txt", "a") 77 | jupyter_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 78 | 79 | jupyter_flush = threading.Thread(target=flush, args=(jupyter_proc, jupyter_log)) 80 | jupyter_flush.start() 81 | 82 | while not list(list_running_servers()): 83 | time.sleep(5) 84 | 85 | jupyter_servers = list(list_running_servers()) 86 | assert (len(jupyter_servers) == 1), "more than one jupyter server is running" 87 | 88 | Run.get_context().log("jupyter", 89 | "ip: {ip_addr}, port: {port}".format(ip_addr=ip, port=jupyter_servers[0]["port"])) 90 | Run.get_context().log("jupyter-token", jupyter_servers[0]["token"]) 91 | 92 | cmd = "dask-scheduler " + "--port " + scheduler.split(":")[1] + " --dashboard-address " + dashboard 93 | scheduler_log = open("scheduler_log.txt", "w") 94 | scheduler_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 95 | 96 | cmd = "dask-worker " + scheduler 97 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w") 98 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 99 | 100 | worker_flush = threading.Thread(target=flush, args=(worker_proc, worker_log)) 101 | worker_flush.start() 102 | 103 | if(args.script): 104 | command_line = ' '.join(['python', args.script]+unparsed) 105 | print('Launching:', command_line) 106 | exit_code = os.system(command_line) 107 | print('process ended with code', exit_code) 108 | print('killing scheduler, worker and jupyter') 109 | jupyter_proc.kill() 110 | scheduler_proc.kill() 111 | worker_proc.kill() 112 | exit(exit_code) 113 | else: 114 | flush(scheduler_proc, scheduler_log) 115 | else: 116 | cmd = "dask-worker " + scheduler 117 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w") 118 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 119 | 120 | flush(worker_proc, worker_log) 121 | -------------------------------------------------------------------------------- /img/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/1.png -------------------------------------------------------------------------------- /img/10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/10.png -------------------------------------------------------------------------------- /img/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/2.png -------------------------------------------------------------------------------- /img/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/3.png -------------------------------------------------------------------------------- /img/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/4.png -------------------------------------------------------------------------------- /img/5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/5.png -------------------------------------------------------------------------------- /img/6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/6.png -------------------------------------------------------------------------------- /img/7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/7.png -------------------------------------------------------------------------------- /img/8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/8.png -------------------------------------------------------------------------------- /img/9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/9.png -------------------------------------------------------------------------------- /img/bokeh.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/bokeh.png -------------------------------------------------------------------------------- /img/compute_nodes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/compute_nodes.png -------------------------------------------------------------------------------- /img/create_cluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/create_cluster.png -------------------------------------------------------------------------------- /img/dask-status.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/dask-status.gif -------------------------------------------------------------------------------- /img/network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/network.png -------------------------------------------------------------------------------- /interactive/LoadDataFromDatastore.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Loading your Data from an AzureML Datastore\n", 8 | "\n", 9 | "**Important**: Make sure to execute the steps to start the cluster in the notebook [StartDask.ipynb](StartDask.ipynb) before running this noteboook." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 25, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/plain": [ 20 | "'1.0.74'" 21 | ] 22 | }, 23 | "execution_count": 25, 24 | "metadata": {}, 25 | "output_type": "execute_result" 26 | } 27 | ], 28 | "source": [ 29 | "from azureml.core import Workspace, Experiment\n", 30 | "from azureml.core import VERSION\n", 31 | "import time\n", 32 | "VERSION" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "### Uploading the data to the AzureML Datastore\n", 40 | "AzureML has the concept of a Datastore that can be mounted to a job, so you script does not have to deal with reading from Azure Blobstorage. First, let's download some data and upload it to the blob store, so we can play with it in Dask\n", 41 | "(parts of this code originates from https://github.com/dask/dask-tutorial)." 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 26, 47 | "metadata": {}, 48 | "outputs": [ 49 | { 50 | "name": "stdout", 51 | "output_type": "stream", 52 | "text": [ 53 | "- Uploading flight data... \n", 54 | "Uploading an estimated of 10 files\n", 55 | "Target already exists. Skipping upload for nycflights/1990.csv\n", 56 | "Target already exists. Skipping upload for nycflights/1991.csv\n", 57 | "Target already exists. Skipping upload for nycflights/1992.csv\n", 58 | "Target already exists. Skipping upload for nycflights/1993.csv\n", 59 | "Target already exists. Skipping upload for nycflights/1994.csv\n", 60 | "Target already exists. Skipping upload for nycflights/1995.csv\n", 61 | "Target already exists. Skipping upload for nycflights/1996.csv\n", 62 | "Target already exists. Skipping upload for nycflights/1997.csv\n", 63 | "Target already exists. Skipping upload for nycflights/1998.csv\n", 64 | "Target already exists. Skipping upload for nycflights/1999.csv\n", 65 | "Uploaded 0 files\n", 66 | "** Finished! **\n" 67 | ] 68 | } 69 | ], 70 | "source": [ 71 | "import os\n", 72 | "import tarfile\n", 73 | "import urllib.request\n", 74 | "\n", 75 | "\n", 76 | "cwd = os.getcwd()\n", 77 | "\n", 78 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n", 79 | "if not os.path.exists(data_dir):\n", 80 | " os.makedirs('data')\n", 81 | "\n", 82 | "flights_raw = os.path.join(data_dir, 'nycflights.tar.gz')\n", 83 | "flightdir = os.path.join(data_dir, 'nycflights')\n", 84 | "\n", 85 | "if not os.path.exists(flights_raw):\n", 86 | " print(\"- Downloading NYC Flights dataset... \", end='', flush=True)\n", 87 | " url = \"https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz\"\n", 88 | " urllib.request.urlretrieve(url, flights_raw)\n", 89 | " print(\"done\", flush=True)\n", 90 | "\n", 91 | "if not os.path.exists(flightdir):\n", 92 | " print(\"- Extracting flight data... \", end='', flush=True)\n", 93 | " tar_path = os.path.join(data_dir, 'nycflights.tar.gz')\n", 94 | " with tarfile.open(tar_path, mode='r:gz') as flights:\n", 95 | " flights.extractall('data/')\n", 96 | " print(\"done\", flush=True)\n", 97 | "\n", 98 | " \n", 99 | "print(\"- Uploading flight data... \")\n", 100 | "ws = Workspace.from_config()\n", 101 | "ds = ws.get_default_datastore()\n", 102 | "\n", 103 | "ds.upload(src_dir=flightdir,\n", 104 | " target_path='nycflights',\n", 105 | " show_progress=True)\n", 106 | "\n", 107 | "print(\"** Finished! **\")" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "### Using the Datastore on the Dask cluster\n", 115 | "\n", 116 | "Now, lets make use of the data on the Dask cluster you created in [StartDask.ipynb](StartDask.ipynb).\n", 117 | "You might have noticed that we launched the cluster with a --data parameter which instructed AzureML to mount the workspace's default Datastore onto all the workers of the cluster.\n", 118 | "\n", 119 | "```\n", 120 | "est = Estimator('dask', \n", 121 | " compute_target=dask_cluster, \n", 122 | " entry_script='startDask.py', \n", 123 | " conda_dependencies_file_path='environment.yml', \n", 124 | " script_params=\n", 125 | " {'--data': ws.get_default_datastore()},\n", 126 | " node_count=10,\n", 127 | " distributed_training=mpi_configuration)\n", 128 | "```\n", 129 | "\n", 130 | "At this time the local path on the compute is not determined, but it will be once the job starts. We therefore log the path back to the run history from which we can now retrieve it." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 27, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/plain": [ 141 | "'/mnt/batch/tasks/shared/LS_root/jobs/rapids/azureml/dask_1573618861_52e71160/mounts/workspaceblobstore/nycflights'" 142 | ] 143 | }, 144 | "execution_count": 27, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "## get the last run on the dask experiment which should be running \n", 151 | "## our dask cluster, and retrieve the data path from it\n", 152 | "ws = Workspace.from_config()\n", 153 | "exp = ws.experiments['dask']\n", 154 | "cluster_run = exp.get_runs().__next__()\n", 155 | "\n", 156 | "if (not cluster_run.status == 'Running'):\n", 157 | " raise Exception('Cluster should be in state \\'Running\\'')\n", 158 | "\n", 159 | "data_path = cluster_run.get_metrics()['data'] + '/nycflights'\n", 160 | "data_path" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 28, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "data": { 170 | "text/html": [ 171 | "\n", 172 | "\n", 173 | "\n", 180 | "\n", 188 | "\n", 189 | "
\n", 174 | "

Client

\n", 175 | "\n", 179 | "
\n", 181 | "

Cluster

\n", 182 | "
    \n", 183 | "
  • Workers: 39
  • \n", 184 | "
  • Cores: 39
  • \n", 185 | "
  • Memory: 284.09 GB
  • \n", 186 | "
\n", 187 | "
" 190 | ], 191 | "text/plain": [ 192 | "" 193 | ] 194 | }, 195 | "execution_count": 28, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "# Get the dask cluster\n", 202 | "from dask.distributed import Client\n", 203 | "\n", 204 | "c = Client('tcp://localhost:8786')\n", 205 | "c" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 16, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "# create a dask dataframe that loads the data from the path on the cluster\n", 215 | "import dask.dataframe as dd\n", 216 | "from dask import delayed\n", 217 | "\n", 218 | "def load_data(path):\n", 219 | " df = dd.read_csv(path + '/*.csv',\n", 220 | " parse_dates={'Date': [0, 1, 2]},\n", 221 | " dtype={'TailNum': str,\n", 222 | " 'CRSElapsedTime': float,\n", 223 | " 'Cancelled': bool}) \n", 224 | " return df" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 17, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "# we need to delay the excution of the read to make sure the path \n", 234 | "# evaluated on the cluster, not the client\n", 235 | "df = delayed(load_data)(data_path).compute()" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 18, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "name": "stdout", 245 | "output_type": "stream", 246 | "text": [ 247 | "2611892\n" 248 | ] 249 | }, 250 | { 251 | "data": { 252 | "text/html": [ 253 | "
\n", 254 | "\n", 267 | "\n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | "
DateDayOfWeekDepTimeCRSDepTimeArrTimeCRSArrTimeUniqueCarrierFlightNumTailNumActualElapsedTime...AirTimeArrDelayDepDelayOriginDestDistanceTaxiInTaxiOutCancelledDiverted
01990-01-0111621.015401747.01701US33NaN86.0...NaN46.041.0EWRPIT319.0NaNNaNFalse0
11990-01-0221547.015401700.01701US33NaN73.0...NaN-1.07.0EWRPIT319.0NaNNaNFalse0
21990-01-0331546.015401710.01701US33NaN84.0...NaN9.06.0EWRPIT319.0NaNNaNFalse0
31990-01-0441542.015401710.01701US33NaN88.0...NaN9.02.0EWRPIT319.0NaNNaNFalse0
41990-01-0551549.015401706.01701US33NaN77.0...NaN5.09.0EWRPIT319.0NaNNaNFalse0
\n", 417 | "

5 rows × 21 columns

\n", 418 | "
" 419 | ], 420 | "text/plain": [ 421 | " Date DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n", 422 | "0 1990-01-01 1 1621.0 1540 1747.0 1701 \n", 423 | "1 1990-01-02 2 1547.0 1540 1700.0 1701 \n", 424 | "2 1990-01-03 3 1546.0 1540 1710.0 1701 \n", 425 | "3 1990-01-04 4 1542.0 1540 1710.0 1701 \n", 426 | "4 1990-01-05 5 1549.0 1540 1706.0 1701 \n", 427 | "\n", 428 | " UniqueCarrier FlightNum TailNum ActualElapsedTime ... AirTime ArrDelay \\\n", 429 | "0 US 33 NaN 86.0 ... NaN 46.0 \n", 430 | "1 US 33 NaN 73.0 ... NaN -1.0 \n", 431 | "2 US 33 NaN 84.0 ... NaN 9.0 \n", 432 | "3 US 33 NaN 88.0 ... NaN 9.0 \n", 433 | "4 US 33 NaN 77.0 ... NaN 5.0 \n", 434 | "\n", 435 | " DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted \n", 436 | "0 41.0 EWR PIT 319.0 NaN NaN False 0 \n", 437 | "1 7.0 EWR PIT 319.0 NaN NaN False 0 \n", 438 | "2 6.0 EWR PIT 319.0 NaN NaN False 0 \n", 439 | "3 2.0 EWR PIT 319.0 NaN NaN False 0 \n", 440 | "4 9.0 EWR PIT 319.0 NaN NaN False 0 \n", 441 | "\n", 442 | "[5 rows x 21 columns]" 443 | ] 444 | }, 445 | "execution_count": 18, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "# now run some interactive queries\n", 452 | "print(len(df))\n", 453 | "df.head()" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 19, 459 | "metadata": {}, 460 | "outputs": [ 461 | { 462 | "data": { 463 | "text/plain": [ 464 | "0 EWR\n", 465 | "1 LGA\n", 466 | "2 JFK\n", 467 | "Name: Origin, dtype: object" 468 | ] 469 | }, 470 | "execution_count": 19, 471 | "metadata": {}, 472 | "output_type": "execute_result" 473 | } 474 | ], 475 | "source": [ 476 | "df.Origin.unique().compute()" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 20, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "data": { 486 | "text/plain": [ 487 | "Origin\n", 488 | "EWR 876.278885\n", 489 | "JFK 1484.209596\n", 490 | "LGA 712.546238\n", 491 | "Name: Distance, dtype: float64" 492 | ] 493 | }, 494 | "execution_count": 20, 495 | "metadata": {}, 496 | "output_type": "execute_result" 497 | } 498 | ], 499 | "source": [ 500 | "df.groupby('Origin').Distance.mean().compute()" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 21, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/plain": [ 511 | "Origin\n", 512 | "EWR 1139451\n", 513 | "JFK 427243\n", 514 | "LGA 974267\n", 515 | "Name: Origin, dtype: int64" 516 | ] 517 | }, 518 | "execution_count": 21, 519 | "metadata": {}, 520 | "output_type": "execute_result" 521 | } 522 | ], 523 | "source": [ 524 | "df[~df.Cancelled].groupby('Origin').Origin.count().compute()" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 22, 530 | "metadata": {}, 531 | "outputs": [ 532 | { 533 | "data": { 534 | "text/plain": [ 535 | "Dest\n", 536 | "ORD 219060\n", 537 | "BOS 145105\n", 538 | "ATL 128855\n", 539 | "MIA 111001\n", 540 | "LAX 109848\n", 541 | " ... \n", 542 | "JFK 6\n", 543 | "CRP 2\n", 544 | "TUS 2\n", 545 | "ABQ 1\n", 546 | "STX 1\n", 547 | "Name: FlightNum, Length: 99, dtype: int64" 548 | ] 549 | }, 550 | "execution_count": 22, 551 | "metadata": {}, 552 | "output_type": "execute_result" 553 | } 554 | ], 555 | "source": [ 556 | "dest = df[~df.Cancelled].groupby('Dest').FlightNum.count().compute()\n", 557 | "dest.sort_values(ascending=False)" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [] 566 | } 567 | ], 568 | "metadata": { 569 | "kernelspec": { 570 | "display_name": "Python (dask)", 571 | "language": "python", 572 | "name": "dask" 573 | }, 574 | "language_info": { 575 | "codemirror_mode": { 576 | "name": "ipython", 577 | "version": 3 578 | }, 579 | "file_extension": ".py", 580 | "mimetype": "text/x-python", 581 | "name": "python", 582 | "nbconvert_exporter": "python", 583 | "pygments_lexer": "ipython3", 584 | "version": "3.6.9" 585 | } 586 | }, 587 | "nbformat": 4, 588 | "nbformat_minor": 2 589 | } 590 | -------------------------------------------------------------------------------- /interactive/dask/DaskNYCTaxi.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Parallelize Pandas with Dask.dataframe\n" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import dask\n", 17 | "from dask.distributed import Client, progress\n", 18 | "from dask import delayed\n", 19 | "df = None\n", 20 | "c = Client('tcp://localhost:8786')\n", 21 | "c.restart()\n", 22 | "c" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "from azureml.core import Workspace, Run\n", 32 | "import os\n", 33 | "run = Run.get_context()\n", 34 | "ws = run.experiment.workspace\n", 35 | "\n", 36 | "## or load directly through blob file system\n", 37 | "# using https://github.com/dask/adlfs -- still pretty beta, \n", 38 | "# throws an error message, but seesm to work\n", 39 | "ds = ws.get_default_datastore()\n", 40 | "ACCOUNT_NAME = ds.account_name\n", 41 | "ACCOUNT_KEY = ds.account_key\n", 42 | "CONTAINER = ds.container_name\n", 43 | "import dask.dataframe as dd\n", 44 | "from fsspec.registry import known_implementations\n", 45 | "known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}\n", 46 | "STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}\n", 47 | "df = dd.read_csv(f'abfs://{CONTAINER}/nyctaxi/2015/*.csv', \n", 48 | " storage_options=STORAGE_OPTIONS,\n", 49 | " parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# enable this code path instead of the above if you run into\n", 59 | "# any issues with the AzureBlobFileSystem (https://github.com/dask/adlfs)\n", 60 | "# this will load the data from the workspace blob storage mounted via blobFUSE\n", 61 | "if False:\n", 62 | " from azureml.core import Workspace\n", 63 | " ## get the last run on the dask experiment which should be running \n", 64 | " ## our dask cluster, and retrieve the data path from it\n", 65 | " ws = Workspace.from_config()\n", 66 | " exp = ws.experiments['dask']\n", 67 | " run = None\n", 68 | " for run in ws.experiments['dask'].get_runs():\n", 69 | " if run.get_status() == \"Running\":\n", 70 | " cluster_run = run\n", 71 | " break;\n", 72 | "\n", 73 | " if (run == None):\n", 74 | " raise Exception('Cluster should be in state \\'Running\\'')\n", 75 | "\n", 76 | " data_path = cluster_run.get_metrics()['datastore'] + '/nyctaxi'\n", 77 | "\n", 78 | "\n", 79 | " import dask\n", 80 | " import dask.dataframe as dd\n", 81 | " from dask import delayed\n", 82 | "\n", 83 | " def load_data(path):\n", 84 | " return dd.read_csv(path, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\n", 85 | "\n", 86 | " data_2015 = data_path + '/2015'\n", 87 | " data_2015_csv = data_2015 + '/*.csv'\n", 88 | " df = delayed(load_data)(data_2015_csv).compute()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "# fall back to this path if neither of the above paths have been enabled\n", 98 | "if df is None:\n", 99 | " ## or in this case straight from GOOGLE Storage\n", 100 | " import dask.dataframe as dd\n", 101 | " df = dd.read_csv('gcs://anaconda-public-data/nyc-taxi/csv/2015/yellow_*.csv',\n", 102 | " storage_options={'token': 'anon'}, \n", 103 | " parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "scrolled": false 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "%time len(df)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "df.partitions" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "%time df.map_partitions(len).compute().sum()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "\n", 140 | "Dask DataFrames\n", 141 | "---------------\n", 142 | "\n", 143 | "* Coordinate many Pandas DataFrames across a cluster\n", 144 | "* Faithfully implement a subset of the Pandas API\n", 145 | "* Use Pandas under the hood (for speed and maturity)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "df" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "scrolled": true 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "df.dtypes" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "# list of column names that need to be re-mapped\n", 175 | "remap = {}\n", 176 | "remap['tpep_pickup_datetime'] = 'pickup_datetime'\n", 177 | "remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n", 178 | "remap['RatecodeID'] = 'rate_code'\n", 179 | "\n", 180 | "#create a list of columns & dtypes the df must have\n", 181 | "must_haves = {\n", 182 | " 'VendorID': 'object',\n", 183 | " 'pickup_datetime': 'datetime64[ms]',\n", 184 | " 'dropoff_datetime': 'datetime64[ms]',\n", 185 | " 'passenger_count': 'int32',\n", 186 | " 'trip_distance': 'float32',\n", 187 | " 'pickup_longitude': 'float32',\n", 188 | " 'pickup_latitude': 'float32',\n", 189 | " 'rate_code': 'int32',\n", 190 | " 'payment_type': 'int32',\n", 191 | " 'dropoff_longitude': 'float32',\n", 192 | " 'dropoff_latitude': 'float32',\n", 193 | " 'fare_amount': 'float32',\n", 194 | " 'tip_amount': 'float32',\n", 195 | " 'total_amount': 'float32'\n", 196 | "}\n", 197 | "\n", 198 | "query_frags = [\n", 199 | " 'fare_amount > 0 and fare_amount < 500',\n", 200 | " 'passenger_count > 0 and passenger_count < 6',\n", 201 | " 'pickup_longitude > -75 and pickup_longitude < -73',\n", 202 | " 'dropoff_longitude > -75 and dropoff_longitude < -73',\n", 203 | " 'pickup_latitude > 40 and pickup_latitude < 42',\n", 204 | " 'dropoff_latitude > 40 and dropoff_latitude < 42'\n", 205 | "]\n", 206 | "query = ' and '.join(query_frags)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "# helper function which takes a DataFrame partition\n", 216 | "def clean(df_part, remap, must_haves, query): \n", 217 | " df_part = df_part.query(query)\n", 218 | " \n", 219 | " # some col-names include pre-pended spaces remove & lowercase column names\n", 220 | " # tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n", 221 | "\n", 222 | " # rename using the supplied mapping\n", 223 | " df_part = df_part.rename(columns=remap)\n", 224 | " \n", 225 | " # iterate through columns in this df partition\n", 226 | " for col in df_part.columns:\n", 227 | " # drop anything not in our expected list\n", 228 | " if col not in must_haves:\n", 229 | " df_part = df_part.drop(col, axis=1)\n", 230 | " continue\n", 231 | "\n", 232 | " if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n", 233 | " df_part[col] = df_part[col].astype('datetime64[ms]')\n", 234 | " continue\n", 235 | " \n", 236 | " # if column was read as a string, recast as float\n", 237 | " if df_part[col].dtype == 'object':\n", 238 | " df_part[col] = df_part[col].str.fillna('-1')\n", 239 | " df_part[col] = df_part[col].astype('float32')\n", 240 | " else:\n", 241 | " # save some memory by using 32 bit floats\n", 242 | " if 'int' in str(df_part[col].dtype):\n", 243 | " df_part[col] = df_part[col].astype('int32')\n", 244 | " if 'float' in str(df_part[col].dtype):\n", 245 | " df_part[col] = df_part[col].astype('float32')\n", 246 | " df_part[col] = df_part[col].fillna(-1)\n", 247 | " \n", 248 | " return df_part" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": null, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "taxi_df = clean(df, remap, must_haves, query)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "import math\n", 267 | "from math import pi\n", 268 | "from dask.array import cos, sin, arcsin, sqrt, floor\n", 269 | "import numpy as np\n", 270 | "\n", 271 | "def haversine_distance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude):\n", 272 | " x_1 = pi / 180 * pickup_latitude\n", 273 | " y_1 = pi / 180 * pickup_longitude\n", 274 | " x_2 = pi / 180 * dropoff_latitude\n", 275 | " y_2 = pi / 180 * dropoff_longitude\n", 276 | "\n", 277 | " dlon = y_2 - y_1\n", 278 | " dlat = x_2 - x_1\n", 279 | " a = sin(dlat / 2)**2 + cos(x_1) * cos(x_2) * sin(dlon / 2)**2\n", 280 | "\n", 281 | " c = 2 * arcsin(sqrt(a)) \n", 282 | " r = 6371 # Radius of earth in kilometers\n", 283 | "\n", 284 | " return c * r\n", 285 | "\n", 286 | "def day_of_the_week(day, month, year):\n", 287 | " if month < 3:\n", 288 | " shift = month\n", 289 | " else:\n", 290 | " shift = 0\n", 291 | " Y = year - (month < 3)\n", 292 | " y = Y - 2000\n", 293 | " c = 20\n", 294 | " d = day\n", 295 | " m = month + shift + 1\n", 296 | " return (d + floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7\n", 297 | " \n", 298 | "def add_features(df):\n", 299 | " df['hour'] = df['pickup_datetime'].dt.hour.astype('int32')\n", 300 | " df['year'] = df['pickup_datetime'].dt.year.astype('int32')\n", 301 | " df['month'] = df['pickup_datetime'].dt.month.astype('int32')\n", 302 | " df['day'] = df['pickup_datetime'].dt.day.astype('int32')\n", 303 | " df['day_of_week'] = df['pickup_datetime'].dt.weekday.astype('int32')\n", 304 | " \n", 305 | " #df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')\n", 306 | " df['diff'] = df['dropoff_datetime'] - df['pickup_datetime']\n", 307 | " \n", 308 | " df['pickup_latitude_r'] = (df['pickup_latitude'] // .01 * .01).astype('float32')\n", 309 | " df['pickup_longitude_r'] = (df['pickup_longitude'] // .01 * .01).astype('float32')\n", 310 | " df['dropoff_latitude_r'] = (df['dropoff_latitude'] // .01 * .01).astype('float32')\n", 311 | " df['dropoff_longitude_r'] = (df['dropoff_longitude'] // .01 * .01).astype('float32')\n", 312 | " \n", 313 | " #df = df.drop('pickup_datetime', axis=1)\n", 314 | " #df = df.drop('dropoff_datetime', axis=1)\n", 315 | "\n", 316 | " #df = df.apply_rows(haversine_distance_kernel,\n", 317 | " # incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n", 318 | " # outcols=dict(h_distance=np.float32),\n", 319 | " # kwargs=dict())\n", 320 | "\n", 321 | " import numpy\n", 322 | "\n", 323 | " df['h_distance'] = haversine_distance(df['pickup_latitude'], \n", 324 | " df['pickup_longitude'], \n", 325 | " df['dropoff_latitude'], \n", 326 | " df['dropoff_longitude']).astype('float32')\n", 327 | "\n", 328 | " #df = df.apply_rows(day_of_the_week_kernel,\n", 329 | " # incols=['day', 'month', 'year'],\n", 330 | " # outcols=dict(day_of_week=np.float32),\n", 331 | " # kwargs=dict())\n", 332 | " #df['day_of_week'] = numpy.empty(len(df), dtype=np.int32)\n", 333 | " #day_of_the_week_kernel(df['day'],\n", 334 | " # df['month'],\n", 335 | " # df['year'],\n", 336 | " # df['day_of_week'])\n", 337 | " \n", 338 | " \n", 339 | " df['is_weekend'] = (df['day_of_week']>5).astype(\"int32\")\n", 340 | " return df" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "taxi_df = add_features(taxi_df)\n", 350 | "taxi_df.dtypes" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "taxi_df = taxi_df.persist()\n", 360 | "progress(taxi_df)" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "%time len(taxi_df)" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": {}, 376 | "outputs": [], 377 | "source": [ 378 | "%time taxi_df.passenger_count.sum().compute()" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "# Compute average trip distance grouped by passenger count\n", 388 | "taxi_df.groupby('passenger_count').trip_distance.mean().compute()" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "### Tip Fraction, grouped by day-of-week and hour-of-day" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "df2 = taxi_df[(taxi_df.tip_amount > 0) & (taxi_df.fare_amount > 0)]\n", 405 | "df2['tip_fraction'] = df2.tip_amount / df2.fare_amount" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# Group df.tpep_pickup_datetime by dayofweek and hour\n", 415 | "dayofweek = df2.groupby(df2.pickup_datetime.dt.dayofweek).tip_fraction.mean() \n", 416 | "hour = df2.groupby(df2.pickup_datetime.dt.hour).tip_fraction.mean()\n", 417 | "\n", 418 | "dayofweek, hour = dask.persist(dayofweek, hour)\n", 419 | "progress(dayofweek, hour)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "### Plot results\n", 427 | "\n", 428 | "This requires matplotlib to be installed" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "%matplotlib inline" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [ 446 | "hour.compute().plot(figsize=(10, 6), title='Tip Fraction by Hour')" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [ 455 | "dayofweek.compute().plot(figsize=(10, 6), title='Tip Fraction by Day of Week')" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "import pandas as pd\n", 465 | "%matplotlib inline\n", 466 | "taxi_df.groupby('passenger_count').fare_amount.mean().compute().sort_index().plot(legend=True)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": null, 472 | "metadata": {}, 473 | "outputs": [], 474 | "source": [ 475 | "taxi_df.groupby(taxi_df.passenger_count).trip_distance.mean().compute().plot(legend=True)" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": null, 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "by_payment = taxi_df.groupby(taxi_df.payment_type).fare_amount.count().compute()\n", 499 | "by_payment.index = by_payment.index.map({1: 'Credit card',\n", 500 | " 2: 'Cash',\n", 501 | " 3: 'No charge',\n", 502 | " 4: 'Dispute',\n", 503 | " 5: 'Unknown',\n", 504 | " 6: 'Voided trip'})" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "by_payment.plot(legend=True, kind='bar')\n" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "### Let's save the transformed dataset back to blob" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "import uuid\n", 530 | "output_uuid = uuid.uuid1().hex\n", 531 | "run.log('output_uuid', output_uuid)\n", 532 | "\n", 533 | "output_path = run.get_metrics()['datastore'] + '/output/' + output_uuid + '.parquet'\n", 534 | "\n", 535 | "print('save parquet to ', output_path)\n", 536 | "\n", 537 | "taxi_df.to_parquet(output_path)\n", 538 | "\n", 539 | "print('done')" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": { 546 | "scrolled": false 547 | }, 548 | "outputs": [], 549 | "source": [ 550 | "import dask\n", 551 | "import dask.dataframe as dd\n", 552 | "\n", 553 | "df = dd.read_parquet(output_path)\n" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": null, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "df.head()" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": {}, 569 | "outputs": [], 570 | "source": [] 571 | } 572 | ], 573 | "metadata": { 574 | "kernelspec": { 575 | "display_name": "Python (dask)", 576 | "language": "python", 577 | "name": "dask" 578 | }, 579 | "language_info": { 580 | "codemirror_mode": { 581 | "name": "ipython", 582 | "version": 3 583 | }, 584 | "file_extension": ".py", 585 | "mimetype": "text/x-python", 586 | "name": "python", 587 | "nbconvert_exporter": "python", 588 | "pygments_lexer": "ipython3", 589 | "version": "3.6.9" 590 | } 591 | }, 592 | "nbformat": 4, 593 | "nbformat_minor": 2 594 | } 595 | -------------------------------------------------------------------------------- /interactive/dask/environment.yml: -------------------------------------------------------------------------------- 1 | name: dask 2 | channels: 3 | - defaults 4 | - conda-forge 5 | dependencies: 6 | - gcsfs 7 | - fs-gcsfs 8 | - jupyterlab 9 | - jupyter-server-proxy 10 | - python=3.6 11 | - numpy 12 | - h5py 13 | - scipy 14 | - toolz 15 | - bokeh 16 | - dask 17 | - distributed 18 | - notebook 19 | - matplotlib 20 | - Pillow 21 | - pandas 22 | - pandas-datareader 23 | - pytables 24 | - scikit-learn 25 | - scikit-image 26 | - snakeviz 27 | - ujson 28 | - graphviz 29 | - pip 30 | - s3fs 31 | - fastparquet 32 | - dask-ml 33 | - pip: 34 | - graphviz 35 | - cachey 36 | - azureml-sdk[notebooks] 37 | - mpi4py 38 | - gym 39 | - adlfs -------------------------------------------------------------------------------- /interactive/dask/startDask.py: -------------------------------------------------------------------------------- 1 | # + 2 | from mpi4py import MPI 3 | import os 4 | import argparse 5 | import time 6 | from dask.distributed import Client 7 | from azureml.core import Run 8 | import sys, uuid 9 | import threading 10 | import subprocess 11 | import socket 12 | 13 | from notebook.notebookapp import list_running_servers 14 | 15 | 16 | # - 17 | 18 | def flush(proc, proc_log): 19 | while True: 20 | proc_out = proc.stdout.readline() 21 | if proc_out == '' and proc.poll() is not None: 22 | proc_log.close() 23 | break 24 | elif proc_out: 25 | sys.stdout.write(proc_out) 26 | proc_log.write(proc_out) 27 | proc_log.flush() 28 | 29 | 30 | if __name__ == '__main__': 31 | comm = MPI.COMM_WORLD 32 | rank = comm.Get_rank() 33 | 34 | parser = argparse.ArgumentParser() 35 | parser.add_argument("--datastore") 36 | parser.add_argument("--jupyter_token", default=uuid.uuid1().hex) 37 | parser.add_argument("--script") 38 | 39 | args, unparsed = parser.parse_known_args() 40 | 41 | ip = socket.gethostbyname(socket.gethostname()) 42 | 43 | print("- my rank is ", rank) 44 | print("- my ip is ", ip) 45 | 46 | if rank == 0: 47 | data = { 48 | "scheduler" : ip + ":8786", 49 | "dashboard" : ip + ":8787" 50 | } 51 | else: 52 | data = None 53 | 54 | data = comm.bcast(data, root=0) 55 | scheduler = data["scheduler"] 56 | dashboard = data["dashboard"] 57 | print("- scheduler is ", scheduler) 58 | print("- dashboard is ", dashboard) 59 | 60 | 61 | print("args: ", args) 62 | print("unparsed: ", unparsed) 63 | print("- my rank is ", rank) 64 | print("- my ip is ", ip) 65 | 66 | if rank == 0: 67 | Run.get_context().log("headnode", ip) 68 | Run.get_context().log("cluster", 69 | "scheduler: {scheduler}, dashboard: {dashboard}".format(scheduler=scheduler, 70 | dashboard=dashboard)) 71 | Run.get_context().log("datastore", args.datastore) 72 | 73 | cmd = ("jupyter lab --ip 0.0.0.0 --port 8888" + \ 74 | " --NotebookApp.token={token}" + \ 75 | " --allow-root --no-browser").format(token=args.jupyter_token) 76 | jupyter_log = open("jupyter_log.txt", "a") 77 | jupyter_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 78 | 79 | jupyter_flush = threading.Thread(target=flush, args=(jupyter_proc, jupyter_log)) 80 | jupyter_flush.start() 81 | 82 | while not list(list_running_servers()): 83 | time.sleep(5) 84 | 85 | jupyter_servers = list(list_running_servers()) 86 | assert (len(jupyter_servers) == 1), "more than one jupyter server is running" 87 | 88 | Run.get_context().log("jupyter", 89 | "ip: {ip_addr}, port: {port}".format(ip_addr=ip, port=jupyter_servers[0]["port"])) 90 | Run.get_context().log("jupyter-token", jupyter_servers[0]["token"]) 91 | 92 | cmd = "dask-scheduler " + "--port " + scheduler.split(":")[1] + " --dashboard-address " + dashboard 93 | scheduler_log = open("scheduler_log.txt", "w") 94 | scheduler_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 95 | 96 | cmd = "dask-worker " + scheduler 97 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w") 98 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 99 | 100 | worker_flush = threading.Thread(target=flush, args=(worker_proc, worker_log)) 101 | worker_flush.start() 102 | 103 | if(args.script): 104 | command_line = ' '.join(['python', args.script]+unparsed) 105 | print('Launching:', command_line) 106 | exit_code = os.system(command_line) 107 | print('process ended with code', exit_code) 108 | print('killing scheduler, worker and jupyter') 109 | jupyter_proc.kill() 110 | scheduler_proc.kill() 111 | worker_proc.kill() 112 | exit(exit_code) 113 | else: 114 | flush(scheduler_proc, scheduler_log) 115 | else: 116 | cmd = "dask-worker " + scheduler 117 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w") 118 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 119 | 120 | flush(worker_proc, worker_log) 121 | -------------------------------------------------------------------------------- /interactive/mydask.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/interactive/mydask.png -------------------------------------------------------------------------------- /rapids_interactive/dask/azure_taxi_on_cluster.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## NYC Taxi dataset analysis\n", 8 | "\n", 9 | "This notenook should be run from the Jupypter Server deployed on the AzureML Cluster\n", 10 | "\n", 11 | "First get the run object for the cluster we are running on (this will fail if not run on the cluster):" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 9, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/plain": [ 22 | "{'headnode': '172.17.0.7',\n", 23 | " 'scheduler': '172.17.0.7:8786',\n", 24 | " 'dashboard': '172.17.0.7:8787',\n", 25 | " 'data': '/mnt/batch/tasks/shared/LS_root/jobs/vnettest/azureml/init-dask-jupyter_1569837452_781f9040/mounts/workspaceblobstore',\n", 26 | " 'jupyter-server': ['http://172.17.0.7:8888/?token=328966d31212f8eebaea6b4df97c2bfbbc9819d2dc7049c2',\n", 27 | " 'http://172.17.0.7:8889/?token=a8c3ecc047365ec1b65bf7d6dce1ef44c1161f7b2f3a3c1c',\n", 28 | " 'http://172.17.0.7:8890/?token=2c572c6a478a93402e22baae68e31618d7fa839097740e79']}" 29 | ] 30 | }, 31 | "execution_count": 9, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "from azureml.core import Run\n", 38 | "run = Run.get_context()\n", 39 | "run.get_metrics()" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Fetch the list of data files from the mounted share:" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 10, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/plain": [ 57 | "['yellow_tripdata_2015-10.csv',\n", 58 | " 'yellow_tripdata_2015-04.csv',\n", 59 | " 'yellow_tripdata_2015-03.csv',\n", 60 | " 'yellow_tripdata_2015-08.csv',\n", 61 | " 'yellow_tripdata_2015-07.csv',\n", 62 | " 'yellow_tripdata_2015-09.csv',\n", 63 | " 'yellow_tripdata_2015-01.csv',\n", 64 | " 'yellow_tripdata_2015-02.csv',\n", 65 | " 'yellow_tripdata_2015-05.csv',\n", 66 | " 'yellow_tripdata_2015-06.csv',\n", 67 | " 'yellow_tripdata_2015-11.csv',\n", 68 | " 'yellow_tripdata_2015-12.csv']" 69 | ] 70 | }, 71 | "execution_count": 10, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "import os\n", 78 | "data_path = run.get_metrics()['data'] \n", 79 | "filenames = os.listdir(data_path + '/nyctaxi')\n", 80 | "total_size = 0\n", 81 | "for file in filenames:\n", 82 | " size = os.path.getsize(data_path + '/nyctaxi/' + file)/(1e9)\n", 83 | " print(f\"file: {file} size: {round(size,1)} GB\")\n", 84 | " total_size += size\n", 85 | "\n", 86 | "print(\"Total size:\", round(total_size,1), \"GB\")" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "### Get the cluster client\n", 94 | "Since this jupyter server is running on the scheduler node of the cluster, we just need to connect to localhost." 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 11, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/html": [ 105 | "\n", 106 | "\n", 107 | "\n", 114 | "\n", 122 | "\n", 123 | "
\n", 108 | "

Client

\n", 109 | "\n", 113 | "
\n", 115 | "

Cluster

\n", 116 | "
    \n", 117 | "
  • Workers: 6
  • \n", 118 | "
  • Cores: 6
  • \n", 119 | "
  • Memory: 0 B
  • \n", 120 | "
\n", 121 | "
" 124 | ], 125 | "text/plain": [ 126 | "" 127 | ] 128 | }, 129 | "execution_count": 11, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "import distributed\n", 136 | "client = distributed.Client('tcp://localhost:8786')\n", 137 | "client" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 12, 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "- setting dask settings\n", 150 | "-- Changes to dask settings\n", 151 | "--- Setting work-stealing to False\n", 152 | "--- Setting scheduler bandwidth to 1\n", 153 | "-- Settings updates complete\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "import dask\n", 159 | "\n", 160 | "print(\"- setting dask settings\")\n", 161 | "dask.config.set({'distributed.scheduler.work-stealing': False})\n", 162 | "dask.config.set({'distributed.scheduler.bandwidth': 1})\n", 163 | "\n", 164 | "print(\"-- Changes to dask settings\")\n", 165 | "print(\"--- Setting work-stealing to \", dask.config.get('distributed.scheduler.work-stealing'))\n", 166 | "print(\"--- Setting scheduler bandwidth to \", dask.config.get('distributed.scheduler.bandwidth'))\n", 167 | "print(\"-- Settings updates complete\")" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 42, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "# helper function which takes a DataFrame partition\n", 177 | "def clean(df_part, remap, must_haves): \n", 178 | " # some col-names include pre-pended spaces remove & lowercase column names\n", 179 | " tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n", 180 | " df_part = df_part.rename(tmp)\n", 181 | " \n", 182 | " # rename using the supplied mapping\n", 183 | " df_part = df_part.rename(remap)\n", 184 | " \n", 185 | " # iterate through columns in this df partition\n", 186 | " for col in df_part.columns:\n", 187 | " # drop anything not in our expected list\n", 188 | " if col not in must_haves:\n", 189 | " df_part = df_part.drop(col)\n", 190 | " continue\n", 191 | "\n", 192 | " if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n", 193 | " df_part[col] = df_part[col].astype('datetime64[ms]')\n", 194 | " continue\n", 195 | " \n", 196 | " # if column was read as a string, recast as float\n", 197 | " if df_part[col].dtype == 'object':\n", 198 | " df_part[col] = df_part[col].str.fillna('-1')\n", 199 | " df_part[col] = df_part[col].astype('float32')\n", 200 | " else:\n", 201 | " # downcast from 64bit to 32bit types\n", 202 | " # Tesla T4 are faster on 32bit ops\n", 203 | " if 'int' in str(df_part[col].dtype):\n", 204 | " df_part[col] = df_part[col].astype('int32')\n", 205 | " if 'float' in str(df_part[col].dtype):\n", 206 | " df_part[col] = df_part[col].astype('float32')\n", 207 | " df_part[col] = df_part[col].fillna(-1)\n", 208 | "\n", 209 | " return df_part" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 43, 215 | "metadata": {}, 216 | "outputs": [], 217 | "source": [ 218 | "import os\n", 219 | "import cudf\n", 220 | "\n", 221 | "def read_csv(path):\n", 222 | " import cudf\n", 223 | " # list of column names that need to be re-mapped\n", 224 | " remap = {}\n", 225 | " remap['tpep_pickup_datetime'] = 'pickup_datetime'\n", 226 | " remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n", 227 | " remap['ratecodeid'] = 'rate_code'\n", 228 | "\n", 229 | " #create a list of columns & dtypes the df must have\n", 230 | " must_haves = {\n", 231 | " 'pickup_datetime': 'datetime64[ms]',\n", 232 | " 'dropoff_datetime': 'datetime64[ms]',\n", 233 | " 'passenger_count': 'int32',\n", 234 | " 'trip_distance': 'float32',\n", 235 | " 'pickup_longitude': 'float32',\n", 236 | " 'pickup_latitude': 'float32',\n", 237 | " 'rate_code': 'int32',\n", 238 | " 'dropoff_longitude': 'float32',\n", 239 | " 'dropoff_latitude': 'float32',\n", 240 | " 'fare_amount': 'float32'\n", 241 | " }\n", 242 | " \n", 243 | " df = cudf.read_csv(path)\n", 244 | " return clean(df, remap, must_haves)\n", 245 | "\n", 246 | "paths = [os.path.join(run.get_metrics()[\"data\"], \"nyctaxi/\") + filename for filename in filenames]\n", 247 | "data_paths = client.scatter(paths)\n", 248 | "dfs = [client.submit(read_csv, data_path) for data_path in data_paths]" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 44, 254 | "metadata": { 255 | "scrolled": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "import dask_cudf\n", 260 | "\n", 261 | "taxi_df = dask_cudf.from_delayed(dfs)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 45, 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "data": { 271 | "text/html": [ 272 | "
\n", 273 | "\n", 286 | "\n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | "
pickup_datetimedropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latituderate_codedropoff_longitudedropoff_latitudefare_amount
02015-10-01 00:00:002015-10-01 00:05:4811.10-73.93551640.7612381-73.94435140.7545786.00
12015-10-01 00:00:002015-10-01 00:00:0017.68-73.98993740.7434391-73.98668740.68912927.50
22015-10-01 00:00:002015-10-01 00:00:0022.53-73.98732840.7200201-73.99908440.74438112.50
32015-10-01 00:00:002015-10-01 00:00:0001.20-73.95375840.7433855-73.93000840.73662225.26
42015-10-01 00:00:012015-10-01 00:16:1913.80-73.98401640.7552221-73.95986940.80132315.50
\n", 370 | "
" 371 | ], 372 | "text/plain": [ 373 | " pickup_datetime dropoff_datetime passenger_count trip_distance \\\n", 374 | "0 2015-10-01 00:00:00 2015-10-01 00:05:48 1 1.10 \n", 375 | "1 2015-10-01 00:00:00 2015-10-01 00:00:00 1 7.68 \n", 376 | "2 2015-10-01 00:00:00 2015-10-01 00:00:00 2 2.53 \n", 377 | "3 2015-10-01 00:00:00 2015-10-01 00:00:00 0 1.20 \n", 378 | "4 2015-10-01 00:00:01 2015-10-01 00:16:19 1 3.80 \n", 379 | "\n", 380 | " pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n", 381 | "0 -73.935516 40.761238 1 -73.944351 \n", 382 | "1 -73.989937 40.743439 1 -73.986687 \n", 383 | "2 -73.987328 40.720020 1 -73.999084 \n", 384 | "3 -73.953758 40.743385 5 -73.930008 \n", 385 | "4 -73.984016 40.755222 1 -73.959869 \n", 386 | "\n", 387 | " dropoff_latitude fare_amount \n", 388 | "0 40.754578 6.00 \n", 389 | "1 40.689129 27.50 \n", 390 | "2 40.744381 12.50 \n", 391 | "3 40.736622 25.26 \n", 392 | "4 40.801323 15.50 " 393 | ] 394 | }, 395 | "execution_count": 45, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "taxi_df.head()" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 48, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "import numpy as np\n", 411 | "import numba, xgboost, socket\n", 412 | "import dask, dask_cudf\n", 413 | "from dask.distributed import Client, wait" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 49, 419 | "metadata": {}, 420 | "outputs": [ 421 | { 422 | "data": { 423 | "text/plain": [ 424 | "Index(['pickup_datetime', 'dropoff_datetime', 'passenger_count',\n", 425 | " 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code',\n", 426 | " 'dropoff_longitude', 'dropoff_latitude', 'fare_amount'],\n", 427 | " dtype='object')" 428 | ] 429 | }, 430 | "execution_count": 49, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "taxi_df.columns" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 50, 442 | "metadata": {}, 443 | "outputs": [ 444 | { 445 | "data": { 446 | "text/html": [ 447 | "
\n", 448 | "\n", 461 | "\n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | "
pickup_datetimedropoff_datetimepassenger_counttrip_distancepickup_longitudepickup_latituderate_codedropoff_longitudedropoff_latitudefare_amount
02015-10-01 00:00:002015-10-01 00:05:4811.10-73.93551640.7612381-73.94435140.7545786.0
12015-10-01 00:00:002015-10-01 00:00:0017.68-73.98993740.7434391-73.98668740.68912927.5
22015-10-01 00:00:002015-10-01 00:00:0022.53-73.98732840.7200201-73.99908440.74438112.5
42015-10-01 00:00:012015-10-01 00:16:1913.80-73.98401640.7552221-73.95986940.80132315.5
52015-10-01 00:00:012015-10-01 00:13:4113.10-73.97529640.7513961-73.97092440.78598412.5
\n", 545 | "
" 546 | ], 547 | "text/plain": [ 548 | " pickup_datetime dropoff_datetime passenger_count trip_distance \\\n", 549 | "0 2015-10-01 00:00:00 2015-10-01 00:05:48 1 1.10 \n", 550 | "1 2015-10-01 00:00:00 2015-10-01 00:00:00 1 7.68 \n", 551 | "2 2015-10-01 00:00:00 2015-10-01 00:00:00 2 2.53 \n", 552 | "4 2015-10-01 00:00:01 2015-10-01 00:16:19 1 3.80 \n", 553 | "5 2015-10-01 00:00:01 2015-10-01 00:13:41 1 3.10 \n", 554 | "\n", 555 | " pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n", 556 | "0 -73.935516 40.761238 1 -73.944351 \n", 557 | "1 -73.989937 40.743439 1 -73.986687 \n", 558 | "2 -73.987328 40.720020 1 -73.999084 \n", 559 | "4 -73.984016 40.755222 1 -73.959869 \n", 560 | "5 -73.975296 40.751396 1 -73.970924 \n", 561 | "\n", 562 | " dropoff_latitude fare_amount \n", 563 | "0 40.754578 6.0 \n", 564 | "1 40.689129 27.5 \n", 565 | "2 40.744381 12.5 \n", 566 | "4 40.801323 15.5 \n", 567 | "5 40.785984 12.5 " 568 | ] 569 | }, 570 | "execution_count": 50, 571 | "metadata": {}, 572 | "output_type": "execute_result" 573 | } 574 | ], 575 | "source": [ 576 | "# apply a list of filter conditions to throw out records with missing or outlier values\n", 577 | "query_frags = [\n", 578 | " 'fare_amount > 0 and fare_amount < 500',\n", 579 | " 'passenger_count > 0 and passenger_count < 6',\n", 580 | " 'pickup_longitude > -75 and pickup_longitude < -73',\n", 581 | " 'dropoff_longitude > -75 and dropoff_longitude < -73',\n", 582 | " 'pickup_latitude > 40 and pickup_latitude < 42',\n", 583 | " 'dropoff_latitude > 40 and dropoff_latitude < 42'\n", 584 | "]\n", 585 | "taxi_df = taxi_df.query(' and '.join(query_frags))\n", 586 | "\n", 587 | "# inspect the results of cleaning\n", 588 | "taxi_df.head().to_pandas()" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 51, 594 | "metadata": {}, 595 | "outputs": [], 596 | "source": [ 597 | "import math\n", 598 | "from math import cos, sin, asin, sqrt, pi\n", 599 | "import numpy as np\n", 600 | "\n", 601 | "def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):\n", 602 | " for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):\n", 603 | " x_1 = pi/180 * x_1\n", 604 | " y_1 = pi/180 * y_1\n", 605 | " x_2 = pi/180 * x_2\n", 606 | " y_2 = pi/180 * y_2\n", 607 | " \n", 608 | " dlon = y_2 - y_1\n", 609 | " dlat = x_2 - x_1\n", 610 | " a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n", 611 | " \n", 612 | " c = 2 * asin(sqrt(a)) \n", 613 | " r = 6371 # Radius of earth in kilometers\n", 614 | " \n", 615 | " h_distance[i] = c * r\n", 616 | "\n", 617 | "def day_of_the_week_kernel(day, month, year, day_of_week):\n", 618 | " for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):\n", 619 | " if month[i] <3:\n", 620 | " shift = month[i]\n", 621 | " else:\n", 622 | " shift = 0\n", 623 | " Y = year[i] - (month[i] < 3)\n", 624 | " y = Y - 2000\n", 625 | " c = 20\n", 626 | " d = day[i]\n", 627 | " m = month[i] + shift + 1\n", 628 | " day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7\n", 629 | " \n", 630 | "def add_features(df):\n", 631 | " df['hour'] = df['pickup_datetime'].dt.hour\n", 632 | " df['year'] = df['pickup_datetime'].dt.year\n", 633 | " df['month'] = df['pickup_datetime'].dt.month\n", 634 | " df['day'] = df['pickup_datetime'].dt.day\n", 635 | " df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')\n", 636 | " \n", 637 | " df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01\n", 638 | " df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01\n", 639 | " df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01\n", 640 | " df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01\n", 641 | " \n", 642 | " df = df.drop('pickup_datetime')\n", 643 | " df = df.drop('dropoff_datetime')\n", 644 | " \n", 645 | " \n", 646 | " df = df.apply_rows(haversine_distance_kernel,\n", 647 | " incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n", 648 | " outcols=dict(h_distance=np.float32),\n", 649 | " kwargs=dict())\n", 650 | " \n", 651 | " \n", 652 | " df = df.apply_rows(day_of_the_week_kernel,\n", 653 | " incols=['day', 'month', 'year'],\n", 654 | " outcols=dict(day_of_week=np.float32),\n", 655 | " kwargs=dict())\n", 656 | " \n", 657 | " \n", 658 | " df['is_weekend'] = (df['day_of_week']<2)\n", 659 | " return df" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 52, 665 | "metadata": {}, 666 | "outputs": [ 667 | { 668 | "data": { 669 | "text/html": [ 670 | "
\n", 671 | "\n", 684 | "\n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | "
passenger_counttrip_distancepickup_longitudepickup_latituderate_codedropoff_longitudedropoff_latitudefare_amounthouryearmonthdaydiffpickup_latitude_rpickup_longitude_rdropoff_latitude_rdropoff_longitude_rh_distanceday_of_weekis_weekend
011.10-73.93551640.7612381-73.94435140.7545786.00201510134800040.759998-73.93999540.750000-73.9499971.0498765.0False
117.68-73.98993740.7434391-73.98668740.68912927.502015101040.739998-73.98999840.680000-73.9899986.0451885.0False
222.53-73.98732840.7200201-73.99908440.74438112.502015101040.719997-73.98999840.739998-74.0000002.8842435.0False
413.80-73.98401640.7552221-73.95986940.80132315.50201510197800040.750000-73.98999840.799999-73.9599995.5146575.0False
513.10-73.97529640.7513961-73.97092440.78598412.50201510182000040.750000-73.97999640.779999-73.9799963.8635755.0False
\n", 828 | "
" 829 | ], 830 | "text/plain": [ 831 | " passenger_count trip_distance pickup_longitude pickup_latitude \\\n", 832 | "0 1 1.10 -73.935516 40.761238 \n", 833 | "1 1 7.68 -73.989937 40.743439 \n", 834 | "2 2 2.53 -73.987328 40.720020 \n", 835 | "4 1 3.80 -73.984016 40.755222 \n", 836 | "5 1 3.10 -73.975296 40.751396 \n", 837 | "\n", 838 | " rate_code dropoff_longitude dropoff_latitude fare_amount hour year \\\n", 839 | "0 1 -73.944351 40.754578 6.0 0 2015 \n", 840 | "1 1 -73.986687 40.689129 27.5 0 2015 \n", 841 | "2 1 -73.999084 40.744381 12.5 0 2015 \n", 842 | "4 1 -73.959869 40.801323 15.5 0 2015 \n", 843 | "5 1 -73.970924 40.785984 12.5 0 2015 \n", 844 | "\n", 845 | " month day diff pickup_latitude_r pickup_longitude_r \\\n", 846 | "0 10 1 348000 40.759998 -73.939995 \n", 847 | "1 10 1 0 40.739998 -73.989998 \n", 848 | "2 10 1 0 40.719997 -73.989998 \n", 849 | "4 10 1 978000 40.750000 -73.989998 \n", 850 | "5 10 1 820000 40.750000 -73.979996 \n", 851 | "\n", 852 | " dropoff_latitude_r dropoff_longitude_r h_distance day_of_week \\\n", 853 | "0 40.750000 -73.949997 1.049876 5.0 \n", 854 | "1 40.680000 -73.989998 6.045188 5.0 \n", 855 | "2 40.739998 -74.000000 2.884243 5.0 \n", 856 | "4 40.799999 -73.959999 5.514657 5.0 \n", 857 | "5 40.779999 -73.979996 3.863575 5.0 \n", 858 | "\n", 859 | " is_weekend \n", 860 | "0 False \n", 861 | "1 False \n", 862 | "2 False \n", 863 | "4 False \n", 864 | "5 False " 865 | ] 866 | }, 867 | "execution_count": 52, 868 | "metadata": {}, 869 | "output_type": "execute_result" 870 | } 871 | ], 872 | "source": [ 873 | "# actually add the features\n", 874 | "taxi_df = taxi_df.map_partitions(add_features).persist()\n", 875 | "# inspect the result\n", 876 | "taxi_df.head().to_pandas()" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": 53, 882 | "metadata": {}, 883 | "outputs": [ 884 | { 885 | "data": { 886 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXxU1d3H8c+ZLBOykW3CEgIBEkAQSBRwgaBiVaQqWJeKtZtdfNqi1rVW69LdvXV72dpW6fNoUWpVUHGrUAiLUBACQRICCRDCkpmE7PvMef5IgjEECWFm7r1zf+/XixdwM8z9MUy+3Dn3/M5RWmuEEEJYj8PoAoQQQvSPBLgQQliUBLgQQliUBLgQQliUBLgQQlhUeDBPlpKSojMyMoJ5SiGEsLxNmzZ5tNaunseDGuAZGRls3LgxmKcUQgjLU0rt7e24DKEIIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYBbSHVjK29u3m90GUIIk5AAt5C/5pVy22v5HKxpMroUIYQJSIBbSF6xG4CK2haDKxFCmIEEuEUcaWhla3kNAO46CXAhhAS4ZazdXUnX7nfueglwIYQEuGXkFbuJdXasPSZX4EIIkAC3BK01ecUepmcmkxgdQUVds9ElCSFMQALcAko9DZRXN5Gb5cIV55QrcCEEIAFuCXnFHgBys1IkwIUQR0mAW0BesZvhSdGMSI7BFeuUm5hCCEAC3PTavD7W7a4kNysF4OgVuO6akiKEsK0TBrhS6kWlVIVSqqDH8ZuVUoVKqe1KqUcDV6K9bd5XTUOr92iAp8ZF0dzmo76l3eDKhBBG68sV+EJgdvcDSqkLgLnAZK31BOBx/5cmoGP4xKHgnNGfX4EDVMg4uBC2d8IA11qvAqp6HP4R8LDWuqXzMRUBqE0Aq4o9ZKcnMHBABPB5gMuNTCFEf8fAxwC5Sqn1SqmVSqmpx3ugUuqHSqmNSqmNbre7n6ezp+rGVrbtryY3y3X0mAS4EKJLfwM8HEgCzgbuAhYrpVRvD9Rav6C1nqK1nuJyuXp7iDiOtbsr8WmOjn8DuGIlwIUQHfob4PuBN3SHDYAPSDnBnxEnKa/YTZwznMnpCUePJURHEBGmZCqhEKLfAf4WcAGAUmoMEAl4/FWU6GifX7XTwzmjk4kI+/yfSSmFK9YpS8oKIfo0jXARsA4Yq5Tar5T6HvAiMKpzauGrwLe1TEz2qz2VjZ3t88d+sHHFSTOPEKJjLPtLaa3nH+dLN/i5FtFN1+YN3W9gdnHFOSmvlgWthLA76cQ0qbxiD+lJAxiRHH3M12Q9FCEESICb0uft8y56m9zjinVS1dCC1yejVkLYmQS4CW0pq6a+pZ3czN4n9rjio/BpqJRxcCFsTQLchPKKPTgUnDv6OAEeK+30QggJcFPKK3YzOT2BgdERvX79aDemXIELYWsS4CZT09hGfln1cYdPAFKlnV4IgQS46awr8XS0z485/rIDsh6KEAIkwE1nVbGHWGc42d3a53uKiggjLipcAlwIm5MAN5GO9nn3Me3zvZG54EIICXAT2VvZyP4jvbfP9+SKlQAXwu4kwE0kb1fX7vMnXnZX1kMRQkiAm0jeTjfDEgeQ0Uv7fE8yhCKEkAA3ifZuu88fZ2+ML0iNi6K+pZ3GVtncWAi7kgA3ifz91dS1tPdp+ARkKqEQQgLcNFbt7GqfT+7T4yXAhRAS4CaRV+xm0rAEEqIj+/R42RtTCCEBbgI1TW3k76/p0/TBLrIeihBCAtwE1u2uxOvTfR7/BkiKiSTMoWRvTCFsTALcBPKK3cREhpEz/Pjt8z2FORTJMZEyhCKEjUmAm8DqXcfuPt8X0swjhL1JgBtsb2UDeysbT2r4pIs08whhbxLgBssr7mqf7/sNzC6yHooQ9iYBbrDVxR7SEgYwMiXmpP9sarwTT30LPtncWAhbkgA3ULvXx5rdnj63z/fkinXS7tMcaWwNQHVCCLOTADdQ/v4a6pr73j7fkysuCpC54ELYlQS4gfKK3SgF0zP71j7fk7TTC2FvEuAGWl3sYVLawD63z/ckAS6EvUmAG6S2uY3NZdX9Hj6Bz3enr5AAF8KWJMAN8nn7/MlPH+wS4wwnOjJMrsCFsCkJcIOs2eUhOjKMnOGJp/Q80swjhH1JgBuk8GAdE4bGExl+av8E0swjhH1JgBukxNPAqJTYU34eWQ9FCPuSADdAbXMbnvoWRrpOvvuyJ1eck4raZj9UJYSwGglwA5S6GwD61T7fU2qck9rmdprbvKf8XEIIa5EAN0CppyPAR/khwLvmgntkGEUI25EAN0CJpwGHguHJ0af8XNLMI4R9SYAboNTTwLDEaJzhYaf8XK7YzvVQJMCFsJ0TBrhS6kWlVIVSqqDbsYeUUuVKqS2dP+YEtszQUuqp98v4N3x+BS7dmELYT1+uwBcCs3s5/getdXbnj2X+LSt0aa0pdTf4LcCTYyNRSq7AhbCjEwa41noVUBWEWmyhoq6FhlYvo/wwhRAgIsxBUnSkzAUXwoZOZQx8gVJqa+cQy3H7wZVSP1RKbVRKbXS73adwutBQ4u6agXLqTTxdpJ1eCHvqb4A/D4wGsoGDwBPHe6DW+gWt9RSt9RSXq/8r74WKrimE/mji6SIBLoQ99SvAtdaHtdZerbUP+Aswzb9lha5STz3OcAdD4qP89pyyHooQ9tSvAFdKDen22yuBguM9VnxRSecNTIfj5PfAPB5XfEeAay2bGwthJ+EneoBSahFwPpCilNoPPAicr5TKBjSwB7gpgDWGlFJPA2MHx/n1OV2xTlq9Pmqb2hkYHeHX5xZCmNcJA1xrPb+Xw38LQC0hr83rY19VI5dOHOzX5z3ajVnfLAEuhI1IJ2YQ7T/SRLtPM9KPM1BAmnmEsCsJ8CAqcdcD/lmFsLtUWQ9FCFuSAA8if65C2J0rTtZDEcKOJMCDqMTTQGJ0BIkxkX593viocCLDHRLgQtiMBHgQ+XMNlO6UUjIXXAgbkgAPolJPg99vYHaRvTGFsB8J8CBpaGnnUG2z3xax6kna6YWwHwnwIDm6BkoAhlCgYyaKTCMUwl4kwIPk6AyUAF6BVzW00ub1BeT5hRDmIwEeJF0BnpEcuAAHqKxvDcjzCyHMRwI8SEo9DaQlDCAq4tT3weyNK1aaeYSwGwnwIClx+28fzN50Xw9FCGEPEuBBoLWmxBOYOeBdUjvXF6+olStwIexCAjwIKhtaqWtuD9gNTICU2I7uThlCEcI+JMCDINBTCAGc4WEMHBAhzTxC2IgEeBB0rULoz42MeyPNPELYiwR4EJR4GogIU6QlDgjoeWQ9FCHsRQI8CErdDYxIjiHMj/tg9iY1XroxhbATCfAgKPU0+H0N8N50XYHL5sZC2IMEeIB5fZq9lY2MDOAMlC6uOCdNbV4aWr0BP5cQwngS4AFWfqSJVq8vOFfgsrWaELYiAR5gJZ6ufTADOwMFJMCFsBsJ8AAL9CqE3X2+O7200wthBxLgAVbqaSAuKpxkP++D2ZtU2dxYCFuRAA+wrhkoSgV2CiFAwoAIwh1KAlwIm5AAD7CSAG1k3BuHQ5EizTxC2IYEeAA1t3kpr24Kyg3MLrK5sRD2IQEeQHsqg3cDs4srzilLygphExLgAVTqDvwqhD2lyhW4ELYhAR5AJUFYRrYnV5yTyvoWvD5ppxci1EmAB1CJu4FB8U5inOFBO6crzolPQ1WDbG4sRKiTAA+gUk9g98HsjWxuLIR9SIAHUKmngVGu4M1AAenGFMJOJMAD5EhDK0ca24KyiFV30o0phH1IgAeIETcwAVLiOjc3lpkoQoQ8CfAACcZGxr2Jjgwn1hkuV+BC2IAEeICUeuoJdyjSk6KDfm7Z3FgIezhhgCulXlRKVSilCnr52h1KKa2USglMedZV6mlgeFI0EWHB/z/SFSt7YwphB31Jl4XA7J4HlVLpwMXAPj/XFBKCuYhVT654Jx4JcCFC3gkDXGu9Cqjq5Ut/AO4GpOWvB59Ps6fSwACXFQmFsIV+fb5XSs0FyrXW+X147A+VUhuVUhvdbnd/Tmc5B2ubaW7zBWUj49644pzUtbTTJJsbCxHSTjrAlVLRwL3AA315vNb6Ba31FK31FJfLdbKns6SuRaxGBXEZ2e66mnk8MpVQiJDWnyvw0cBIIF8ptQcYBnyqlBrsz8KsrLRzI+NgLiPb3efdmBLgQoSyk15lSWu9DUjt+n1niE/RWnv8WJel7XY3EB0ZRmpnkAZb6tHd6aWdXohQ1pdphIuAdcBYpdR+pdT3Al+WtZV6Om5gBmMfzN644mRBKyHs4IRX4Frr+Sf4eobfqgkRpZ4GJg0baNj5k2OcOJQEuBChTjox/ayl3cv+I41BX4WwuzCHIilGduYRItRJgPtZWVUjPk3QVyHsSdrphQh9EuB+ttuAfTB7kxon7fRChDoJcD/rWoUwQ67ARRBoLY3Qdha8zRptotTdQEpsJAMHRBhahyvOiae+BZ9P43AYMxtGBE5FXTMvrCzhlfX7GJIQxcwsF+eNcXHWqCSiI+Xb2i7kX9rPSj0NhnVgdueKddLm1dQ0tZEYE2l0OcJPKuqa+fPKEl7+ZC9tXh9zJg6htrmdRRv2sXDtHiLDHEzJSGTmGBczs1ycNiTOsOmsIvAkwP2sxNPAheNST/zAADs6F7y+RQI8BHQP7nafZl52GgtmZR6919Lc5uW/e6pYtdPNqp0eHn6vkIffK8QV5yQ3K4WZWS5mZKWQEmtMc5kIDAlwP6ptbsNT32LYIlbdHW2nr21hzKA4g6sR/dVbcN88K/OYeyxREWHkZrnIzXJx31fhcG1zR5gXe1hRWMEbn5YDcHpaPDOzXHzj7BGkJQww4q8k/EgC3I9KTTIDBbq109dLO70VVdQ286eVJbyyviO4r8xJY8EFxwb38QyKj+KaKelcMyUdr0+z/UDN0avzF1aVsLywgndvySVM7o9YmgS4H3XNQDF6DjhIO71VnWpw9ybMoZg0LIFJwxJYMCuLd7YeYME/NvOvT/dz7ZR0P1Yvgk0C3I9KPA04FAxPDv4+mD3FOsOJinBIgFtEU6uXxz4oOhrcX8vpGOMekez/i4GvThzCX9JLefLDnVw+aSgDIsP8fg4RHDIP3I9K3PUMS4zGGW78N4RSSuaCW8gTHxbx0tpSrpg8lOV3nMdj10wOSHhDx3vjvjmncai2mRfXlAbkHCI4JMD9qGsVQrOQzY2tYVdFHQvX7uG6qekBDe7upo1M4qLxg3j+P7uplDVzLEsC3E+01qYL8NS4KLkCNzmtNQ8t/YwYZzh3XTIuqOf+2exxNLV5efrj4qCeV/iPBLifVNS10NjqZbQJphB2ccXJioRm937BIVbv8nDHxWNICvJ8/czUWK6bms4r6/cdvQEvrEUC3E9Kjk4hNL4Ls4srzkl1Yxst7bK5sRk1tXr5zbs7GDc4juunDTekhlu/kkVkuINH3y805Pzi1EiA+0lJ5z6YZmji6dI1lbCyvtXgSkRvnv/PLsqrm/jV3NMJDzPmWzE1LoqbZo7mvYJDbNpbZUgNov8kwP2k1N2AM9zBkPgoo0s5yhUrmxub1b7KRv60qoS52UOZNjLJ0Fq+nzsSV5yT3y0rlNUNLUYC3E+6bmCaaeW/1Hhp5jGrX73zGREOxb1zTjO6FGKc4dx+0Rg27T3CB9sPGV2OOAkS4H5S6mlglImGT0C6Mc1qRVEF/95xmJsvzGKQST6xXXPmMLJSY3nk/SLavD6jyxF9JAHuB21eH/uqGk01hRA6NjcGCXAzaWn38qu3P2NUSgw3Th9pdDlHhYc5uOfScZR6Gli0YZ/R5Yg+kgD3g7KqRtp92lQzUAAiwx0kRkfIglYm8uLqPZR6GnjwiglEhpvr22/WuFTOHpXEU/8upq65zehyQkZ1Yys3Lvwvuyrq/f7c5noHWVTXHFqzXYFDxzBKRa1cgZvBoZpmnllezMXjB3HeGJfR5RxDqY4x+cqGVv68ssTockKC1pr73ipg1U43zW3+n84rAe4HZlqFsKfUuChp5jGJ3y3bgdenuf+y8UaXclyThiVwxeSh/HV1CYdq5JPbqVqy5QDvbj3IbReN4fS0gX5/fglwPyjxNJAYHWHKnW9kQStz+KSkkqX5B7jpvNGkJxm/WuWXueuSsfh88ORHRUaXYmnl1U3cv6SAM0ckctPMUQE5h2UCvN3Ed8ZL3eZaA6W7rgCX+b3Gaff6eGjpdtISBvCj80YbXc4JpSdF861zRvDPTfspPFRrdDmW5PNp7lycj8+n+cO12QFr1LJEgD+3YhfX/2W9KUO8trmNgvIaMlPNdQOziyvWSUu7j7qWdqNLsa1X1u+j8FAd9192mmXW3l4wK5M4ZzgPvyct9v3x4ppS1pVU8sDl4wO6P4AlAnxY4gA27Kni6eW7jC7lGC+sLKGupZ1vnZNhdCm96r43pgi+yvoWnviwiNysFC6ZMNjocvosITqSBbMy+U+RmzW7PEaXYylFh+p49P0iLho/KOA7HlkiwOdmp3HVGcN4dnkxG0rNs15DRV0zf1tdyuWThwbkBoU/pEozj6Ee+6CIxlYvD14+AaXM06XbF986J4O0hAH8btkOfD4ZguuLlnYvt766mfgB4fz+axMD/m9uiQAH+OXcCQxPiuanr26mptEcc1SfXb6LNq+POy4aY3Qpx9V1w2xXRZ3BldhPflk1r20s48YZI007xPZloiLCuOuSsWw/UMuS/HKjy7GEJz/aSeGhOh65ahIpnWsRBZJlAjzWGc7T83Nw17dwzxtbDb8pt6+ykX+s38e1U9NPacPZQBuWOIC0hAGslo/BQeXzaR5Yup2UWCc3z8o0upx+u2LyUE5Pi+fxD3YGZB5zKFlfUskLq0qYP204F542KCjntEyAQ8cc1TsvHst7BYd49b9lhtby5EdFhDkUt16YZWgdJ6KUYkZmCmt3V5ryJnCoen3TfvLLqrl3zjjioiKMLqffHA7FvZeeRnl1E39fu8fockyrtrmN2xfnMzwpml98NXgLlFkqwAF+kDuK3KwUfvn2dsOGBT47UMuS/AN8d/pI0yxG9GVmZKVQ19zOtvIao0uxhZqmNh55v5ApIxKZl51mdDmn7NzMFM4f6+JPK3fT2i4XAb355dLPOFjTxJPXZhPjDA/aeS0X4A6H4olrJhMdGc7Ni7YY8rHu8Q+LiHOGW2JOL8D0zBQAVhfLMEowPPzeDqoaW3noCuvduDyeb549giONbaza6Ta6FNN5b9tB/vXpfhZckMmZIxKDem7LBThAanwUj18ziR0Ha3kkyFtBbSitYnlhBf9z/mgGRlvjo3FSTCQThsbLOHgQvLv1IIs2lHHTzNGmnZnUHzPHuEiMjuDNLXIzs7uK2mZ+/uY2JqYN5GYDhlMtGeAAs8YN4jvnZvDSmj2sKKwIyjm11jz6fiGpcU6+e655lgLtixlZKXy67wgN0tATMGVVjdzzxlay0xO442Lzzkzqj4gwB5dPHsq/PzssKxV20lpz1+tbaW7z8oevZxNhwLZ4lg1wgHsuHcdpQ+K585/5VNQGfuGd5YUVbNx7hFsuzLJMR12XGZkptHk1G/aYZx59KGnz+rjl1c2g4Zn5OYZ8MwfavJw0Wtp9vF8gu/YAvPzJXlbudHPvnNMMmyZq6XdZVEQYz8zPpqG1nTv+mR/QZgOfT/PYB0WMSI7m61MD210VCFMzkogMd8g4eID84aOdbN5Xze+vmmj6xar6Kyc9gRHJ0bwlwyjsdtfz22U7mDnGxTfPHmFYHScMcKXUi0qpCqVUQbdjv1ZKbVVKbVFKfaiUGhrYMo8vMzWOBy6bQF6xh7+uDtwaxkvzD1B4qI47Lh5ryaurqIgwpmYkSoAHwOpiD8+v3M11U9O5bJJh3woBp5RibnYaa3dX2nqp2Tavj9te20JURBiPXT3J0BvVfUmihcDsHsce01pP0lpnA+8AD/i7sJMxf1o6sycM5rEPiti23/9T5VrbfTzxURHjh8Rz2cQhfn/+YJmR6aLocB0Vdfb95vM3T30Lty3ewmhXLA9ePsHocgJuXvZQtIalNu7MfGb5Lrbur+F3V040fBrxCQNca70KqOpxrPsakzGAoW2RSikevmoiKbFObnl1s99v1L36332UVTVx9+yxptp1/mTlZnVMJ5TFifzD59PcsTif2qY2nr0+x3L3RfpjlCuWyekJvLn5gNGlGGLzviM8t2IXX8tJY44JLub6PRaglPqtUqoM+AZfcgWulPqhUmqjUmqj2x24OaQJ0ZH88evZ7K1s4MGl2/32vA0t7Tz98S7OGplkym2wTsb4IfEkRkewurjS6FJCwt9Wl7Jyp5tfXDaecYPjjS4naK7MHsqOg7W2Wyu8qdXLHYvzGRTn5KG55vi01e8A11rfp7VOB14BFnzJ417QWk/RWk9xuQIbgGeNSmbBBZm8vmk/S/P9c4Xw0ppSPPUt3D17nOWbMhwOxbmZKaze5TZ8LRmryy+r5pH3C7lkwiBuOGu40eUE1WWThxLmULxls6vwxz4oosTTwKNXTybeJMsj+ONu3CvAVX54Hr+45cIszhyRyH1vbKOsqvGUnutI5+auF40fFPQOq0CZkZnC4dqWgOyQbRd1zW3cvGgzg+KjePSqyZb/j/1kpcQ6mZmVwtIt5bZZZnZ9SSUvrS3lhrOHM6NzKNIM+hXgSqnuLUdzAdNs2xEe5uCPX88GBTcv2szhU5gf/vzK3dS3tnPXJWP9WKGxZnS11cs4eL9orfnFWwWUVzfx1HXZlunG9bd5OWkcqGm2RV9BQ0s7d76eT3piND+/NHgLVfVFX6YRLgLWAWOVUvuVUt8DHlZKFSiltgIXA7cGuM6Tkp4UzSNXTWJbeQ25j6zgZ69vZbf75K44D9Y0sXDtHr6WM4wxg+ICVGnwpSdFk5EcLdMJ++n1TftZsuUAP70wiykZSUaXY5iLxg8iOjKMtzaH/myU37+3g/1Hmnjs6klBXaiqL05YjdZ6fi+H/xaAWvxqzsQhnD50IH/JK2HxxjIWbyrjkvGD+dH5o5mcnnDCP//Uv4tBw0+/Yu7lYvtjemYKb20up83rs+ScdqPsqqjngSXbOXtUEj++wLprfPtDdGQ4sycM5t1tB3noiglERYTmDJzVxR5e/mQf35sxkrNGJRtdzjFC+rt3eHI0v553OmvumcVPzs9k7W4Pc59bw/wXPmHVzuPfyNtVUc/ijWV84+zhIdlVl5uVQkOrly1l1UaXYhnNbV5uXrSZAZFhPHVdDmEWnk7qL/Ny0qhrbg/aWkTBVtfcxs/+tZVRrhjTDqOGdIB3SYl1cuclY1n78wu5b85plHjq+daLG7jsmdW8nX/gmI0OnvyoiKiIMH4SoldZ54xKwaEgT4ZR+uzh9wrZcbCWx6+ZZHjzhlmcOzoZV5yTN0N0GOU37+zgYE0TT1wz2bSfMGwR4F1ineH8YOYoVt19AY9eNYmmzquqWU+s5OVP9tLc5iW/rJpl2w7x/dxRQdnTzggDoyOYOCxBGnr66MPth1i4dg83Th/JrHHB2SrLCsLDHFwxeSj/KXJT3dhqdDl+taKwgtc2lnHTeaPJGW7eGWi2CvAuzvAwrp2azr9vO48/3XAmiTGR/OKtAmY8spzbF28hMTqCH+Raa7nYkzUjM5ktZdXUytKgX+pgTRN3/2srE4bG87NLzfkx2khX5qTR6vWxbFvorFBY09jGPW9sZeygONPfA7NlgHdxOBSzTx/MWz8+l0U/OJsJQwey293ArRdmWXofw76YkenC69OsLwn9aWAnq6nVy/LCw9z/VgHznltDW7uPZ68/A2e4OT9GG2nC0HhGu2JCajbKQ29vp7K+lSeunWz6f3NzzYkxiFKKc0Ync87oZDz1LSTHRBpdUsCdMSKBARFhrC52c9F4GRYoq2pkRVEFywsrWLe7kpZ2HwMiwpiemcL3ZoxkZEqM0SWaklKKK3PSePzDnZRVNVr+pv8H2w/x5uZybr0wyxI7KkmA9xCq4949OcPDmDYyiTybjoO3tvvYuLeKFYUVrChyH+1MHZEczfxpw5k1LpVpI5NMe/PKTOZmdwT40vwDlr7xX9XQyn1vbmPC0HgWzLLG30MC3MZys1L4zbs7OFDdxNCEAUaXE3DuupbOwK4gr9hDfUs7EWGKs0YmM3/acC4Y62KUy5idVawsPSmaqRmJvLm5nB+fP9qySwvc/1YBNU1tvPz9syzTHyEBbmPTu7XVXzvFersM9ZXWmv/7ZC+/fXcHLe0+BsdHcfnkIZw/NpXpmSnEmqy7zorm5aRx35sFbD9Qa4mhh57ezj/Au9sOctclYy21sqS8c21s3OA4UmKdrAnhAPfUt3DXP/NZUeTmvDEu7p49lvFD4i17lWhWX504hIeWbuetzeWWC/CKumbuX1LA5PQEbpo5yuhyToo1PieIgFBKMSMzmTW7PCG5qtyKogpm/3EVa3ZX8tDl41n43alMGDpQwjsAEqIjuWBsKkvyD+C10HtJa829bxTQ1OrliWsmE26RoZMu1qpW+N30zBQ89a0UHqozuhS/aW7z8tDS7Xz3pf+SHOPk7QUz+M70kRLcATYvJw13XQtrd1vnxvgbn5bz7x2HueuSsYbtLH8qJMBtbkaIbbNWeKiWuc+uYeHaPXzn3AyWLJjO2MGhs5qkmc0al0qcM9wyrfUHa5p46O3tTM1I5LvTrdm4JwFuc0MGDmC0K8by0wm11ry0ppQrnl1DZUMrL313akivkmdGURFhzJk4hA8KDtHU6jW6nC/V7vVxx+J82r2ax6+ZbNnFySTABblZLjaUVtLSbu5vuuNx17XwnZf+yy/f/ozczBTe/2kuF4xNNbosW5qXk0ZDq5ePdhw2upQv9diHRazdXcmv553OiGTrNmlJgAumZ6bQ3OZj094jRpdy0pYXHmb2H1fxSUklv547gb9+e4ptmrHM6KyRSQwZGGXq1vpl2w7y55Ul3HD2cK4+c5jR5ZwSCXDB2aOSCHMoS+3S09zm5YElBdy4cCOuOCdv3zyDb56TITcqDeZwKOZmp7Fyp5vK+hajyzlG8eE67vpnPjnDE3jgMnPsLH8qJMAFcVER5K8hPB0AAApNSURBVKRbZ3nZjXuquOLZ1fzvur18b8ZIliyYHlLb3lndvJyheH2ad7YeNLqUL6htbuOm/9vEgMgwnv/GmUSGWz/+rP83EH4xPTOFreU1pl7X+WBNE7cs2szVf1pHXXM7/3vjNO6/bLzpV4yzm3GD4xk3OM5Us1F8Ps2di/PZW9XIc9efweCBobEphwS4ADrWRdEa1u2uNLqUYzS3eXnm42JmPb6S97cf4pZZmXx8x3nMHOMyujRxHFfmpLGlrJpST4PRpQDw/MrdfPjZYe6dc5op97bsLwlwAcDk9ARineGmmk6oteb9gkN85cmVPPHRTs4f6+Lj28/j9ovHEh0pq0CY2RXZQ1EKU9zMXLXTzeMfFnHF5KHcOD3D6HL8Sr4LBAARYQ7OHpVkmnHwnYfr+OXb21mzq5Ixg2L5x/fP4tzOxbeE+Q0ZOIBzRiWzZEs5P/1KlmE3l8uqGrnl1c2MHRTHw1dNDLmb3HIFLo6anpnC3spGyqoaDauhprGNh5Zu59Kn8igor+WXV0xg2S25Et4WdPWZw9hT2cj/fbLXkPM3t3n5n5c34fVp/nTDmSH5qS30/kai33I72+rzij1cf9bwoJ7b69Ms2rCPJz4soqapjevPGs7tF40lyQa7I4WqedlpvJ1/gN+8s4Oc9EQmDgveKoVa66PL2774nSlkhOiOSnIFLo4a7YplcHxU0IdR1pdUctkzq/nFWwVkDYrjnZtz+c28iRLeFudwKJ64Npvk2Eh+/I9N1DQFbwPtl9fv41+f7ufWC7OYNS50twyUK3BxlFKK6ZkpfFx4GK9Pn9T6EPll1fx93R6KDtXh9WnafRpf58/ezh/tPo1Pa9q9Pnwa2n0+vD5Nm1czdGAUz16fw1cnDgm5cUo7S4qJ5Nnrc/j6nz/hZ69v5fkbzgj4v++mvVX86u3tXDDWxa0XmntX+VMlAS6+IDcrhX99up/PDtSe8CNva7uP9woOsnDtHjbvqybWGc7UjEQiwhyEhykcShHuUDgcHT+HORyEOSDc4ej4WpgizKFIjXNy3dThDIiU+dyh6MwRSdw9eyy/W1bIwrV7ArryX0VdMz96+VOGDBzAH7+eg8Oii1T1lQS4+IJzMzvmyObtch83wN11LSzasI+XP9lLRV0LI1NieOjy8Vx15jDioiKCWa6wiB/kjmJDaRW/W7aDnOGJZKcn+P0cbV4fC17ZTG1zG3+/cRoDo0P/vSgBLr4gNS6KcYPjWF3s4cfnf3Fn7q37q1m4Zg/vbD1Iq9fHeWNcPHJ1BudluUL+SkecGqUUj18zma8+vZqfvPIpy27J9XvA/m7ZDjbsqeKp67I5bYh19rU8FRLg4hgzMlP433V7aWr1EuZQvFdwkL+v3cOn+6qJiQxj/rR0vnVuBqNlB3dxEhKiO8bDr/3zOu58PZ8Xvnmm38bDl2wp56U1e/ju9AzmZqf55TmtQAJcHGN6Vgp/XV3KvW9uY80uDxV1LWQkR/Pg5eO5WoZJxCnIGZ7IPZeexq/f+Yy/rS7l+7mntolwu9fHcyt28/TyYqZlJHHvnNP8VKk1SICLY5w1MonIcAdvbi5n5hgXj1yVwXljZJhE+MeN0zNYX1LJw+8VcsaIRM4Yntiv5ymrauS217awce8R5mUP5VfzTifCYpsSnyqldfB2kJ4yZYreuHFj0M4n+i+/rJoYZ7glN3oV5lfT2MZXn8lDa3j3lhkkRJ/cnP8lW8r5xZsFAPx63unMywntYROl1Cat9ZSex+3135Xos8npCRLeImAGRkfw3PVnUFHXzB2L8/H5+nYhWdvcxm2vbeHWV7cwZnAcy27NDfnw/jIS4EIIQ0xOT+C+OafxcWEFf8krOeHjN+6pYs5TeSzNP8BtXxnDaz88m/Sk6CBUal4yBi6EMMy3z81gw54qHv2giDNHJDIlI+mYx7R7fTyzfBfPLC8mLXEAi286hzNH9G/cPNTIFbgQwjBKKR6+ahLDEgdw86LNVDV8cUeofZWNXPvndTz1cTHzctJYdkuuhHc3JwxwpdSLSqkKpVRBt2OPKaUKlVJblVJvKqX831YlhLCF+KiO8fDK+lZuX7wFn0+jteaNT/cz5+k8iivqeXp+Dk9emy1TWHvoyxX4QmB2j2MfAadrrScBO4Gf+7kuIYSNnJ42kPsvH89/itw8+dFObn11C7cvzmf8kHjeuzWXKyYPNbpEUzrhGLjWepVSKqPHsQ+7/fYT4Gr/liWEsJsbzhrO+pJKnl2xizCH4s6Lx/Cj8zNPalVMu/HHTcwbgdeO90Wl1A+BHwIMHx7cTQKEENahlOL3X5tIalwUl08eQk4/G3zspE+NPJ1X4O9orU/vcfw+YArwNd2HJ5JGHiGEOHnHa+Tp9xW4Uuo7wGXAhX0JbyGEEP7VrwBXSs0G7gbO01obtwOuEELYWF+mES4C1gFjlVL7lVLfA54F4oCPlFJblFJ/CnCdQggheujLLJT5vRz+WwBqEUIIcRKkE1MIISxKAlwIISxKAlwIISxKAlwIISwqqDvyKKXcwN5+/vEUwOPHcqxKXofPyWvRQV6HDqH8OozQWrt6HgxqgJ8KpdTG3jqR7EZeh8/Ja9FBXocOdnwdZAhFCCEsSgJcCCEsykoB/oLRBZiEvA6fk9eig7wOHWz3OlhmDFwIIcQXWekKXAghRDcS4EIIYVGWCHCl1GylVJFSapdS6h6j6zGKUmqPUmpb5wqQttkZ4zgbaycppT5SShV3/hzy27cc53V4SClV3vme2KKUmmNkjcGglEpXSq1QSn2mlNqulLq187jt3hOmD3ClVBjwHHApMB6Yr5Qab2xVhrpAa51ts/muCzl2Y+17gI+11lnAx52/D3ULOfZ1APhD53siW2u9LMg1GaEduENrPR44G/hJZybY7j1h+gAHpgG7tNYlWutW4FVgrsE1iSDSWq8Cqnocngv8vfPXfwfmBbUoAxzndbAdrfVBrfWnnb+uA3YAadjwPWGFAE8Dyrr9fn/nMTvSwIdKqU2dm0Xb2SCt9cHOXx8CBhlZjMEWKKW2dg6xhPywQXed+/XmAOux4XvCCgEuPjdDa30GHcNJP1FKzTS6IDPo3JPVrvNhnwdGA9nAQeAJY8sJHqVULPAv4Kda69ruX7PLe8IKAV4OpHf7/bDOY7ajtS7v/LkCeJOO4SW7OqyUGgLQ+XOFwfUYQmt9WGvt1Vr7gL9gk/eEUiqCjvB+RWv9Rudh270nrBDg/wWylFIjlVKRwHXAUoNrCjqlVIxSKq7r18DFQMGX/6mQthT4duevvw0sMbAWw3QFVqcrscF7Qiml6NjWcYfW+sluX7Lde8ISnZidU6P+CIQBL2qtf2twSUGnlBpFx1U3dOxl+g+7vA6dG2ufT8dyoYeBB4G3gMXAcDqWKL5Wax3SN/iO8zqcT8fwiQb2ADd1GwcOSUqpGUAesA3wdR6+l45xcHu9J6wQ4EIIIY5lhSEUIYQQvZAAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi/p/twbao1Tr9IcAAAAASUVORK5CYII=\n", 887 | "text/plain": [ 888 | "
" 889 | ] 890 | }, 891 | "metadata": { 892 | "needs_background": "light" 893 | }, 894 | "output_type": "display_data" 895 | } 896 | ], 897 | "source": [ 898 | "%matplotlib inline\n", 899 | "taxi_df.groupby('hour').fare_amount.mean().compute().to_pandas().sort_index().plot();" 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": 54, 905 | "metadata": {}, 906 | "outputs": [ 907 | { 908 | "name": "stdout", 909 | "output_type": "stream", 910 | "text": [ 911 | "CPU times: user 381 ms, sys: 21.9 ms, total: 403 ms\n", 912 | "Wall time: 5.34 s\n" 913 | ] 914 | } 915 | ], 916 | "source": [ 917 | "%%time\n", 918 | "X_train = taxi_df.query('day < 25').persist()\n", 919 | "\n", 920 | "# create a Y_train ddf with just the target variable\n", 921 | "Y_train = X_train[['fare_amount']].persist()\n", 922 | "# drop the target variable from the training ddf\n", 923 | "X_train = X_train[X_train.columns.difference(['fare_amount'])]\n", 924 | "\n", 925 | "# this wont return until all data is in GPU memory\n", 926 | "done = wait([X_train, Y_train])" 927 | ] 928 | }, 929 | { 930 | "cell_type": "markdown", 931 | "metadata": {}, 932 | "source": [ 933 | "## Notes on training with XGBoost with Azure\n", 934 | "\n", 935 | "* Because Dask-XGBoost parses the `client` for the raw IP address, it passes `\"localhost\"` to RABIT if the `client` was configured to use `\"localhost\"` with SSH forwarding. This means Dask-XGBoost, as it exists, does not support Azure with this method.\n", 936 | "* There are several bugs and issues with the Dask submodule of XGBoost:\n", 937 | " 1. Data co-locality is not enforced (labels and data may not be on the same worker)\n", 938 | " 2. Data locality is not enforced (a data partition, x, may not be assigned to the worker, n, upon which it resides originally ... so, data may need to be shuffled\n", 939 | "\n", 940 | "The latter (Dask submodule of XGBoost) is being fixed in this PR: https://github.com/dmlc/xgboost/pull/4819\n", 941 | "\n", 942 | "This means the code below (Dask submodule of XGBoost) will not work, and replacing the call with Dask-XGBoost will not work." 943 | ] 944 | }, 945 | { 946 | "cell_type": "code", 947 | "execution_count": null, 948 | "metadata": {}, 949 | "outputs": [], 950 | "source": [ 951 | "import dask_xgboost\n", 952 | "\n", 953 | "params = {\n", 954 | " 'num_rounds': 100,\n", 955 | " 'max_depth': 8,\n", 956 | " 'max_leaves': 2**8,\n", 957 | " 'tree_method': 'gpu_hist',\n", 958 | " 'objective': 'reg:squarederror',\n", 959 | " 'grow_policy': 'lossguide'\n", 960 | "}\n", 961 | "\n", 962 | "bst = dask_xgboost.train(client, params, X_train, Y_train, num_boost_round=params['num_rounds'])" 963 | ] 964 | }, 965 | { 966 | "cell_type": "code", 967 | "execution_count": null, 968 | "metadata": {}, 969 | "outputs": [], 970 | "source": [] 971 | } 972 | ], 973 | "metadata": { 974 | "kernelspec": { 975 | "display_name": "Python 3", 976 | "language": "python", 977 | "name": "python3" 978 | }, 979 | "language_info": { 980 | "codemirror_mode": { 981 | "name": "ipython", 982 | "version": 3 983 | }, 984 | "file_extension": ".py", 985 | "mimetype": "text/x-python", 986 | "name": "python", 987 | "nbconvert_exporter": "python", 988 | "pygments_lexer": "ipython3", 989 | "version": "3.7.4" 990 | } 991 | }, 992 | "nbformat": 4, 993 | "nbformat_minor": 4 994 | } 995 | -------------------------------------------------------------------------------- /rapids_interactive/dask/dask.yml: -------------------------------------------------------------------------------- 1 | name: dask 2 | channels: 3 | - defaults 4 | - conda-forge 5 | - nvidia 6 | - rapidsai 7 | - rapidsai/label/xgboost 8 | 9 | dependencies: 10 | - python=3.7 11 | - cudatoolkit 12 | - cudf 13 | - cuml 14 | - cugraph 15 | - bokeh 16 | - dask-cuda 17 | - dask-cudf 18 | - nvidia::nccl=2.4.* 19 | - rapidsai/label/xgboost::xgboost=0.90.* 20 | - rapidsai/label/xgboost::dask-xgboost=0.2.* 21 | - dill 22 | - numba 23 | - pip: 24 | - azureml-sdk[automl,explain,notebooks] 25 | - mpi4py 26 | -------------------------------------------------------------------------------- /rapids_interactive/dask/init-dask.py: -------------------------------------------------------------------------------- 1 | from mpi4py import MPI 2 | import os 3 | import argparse 4 | import socket 5 | from azureml.core import Run 6 | 7 | import sys, os 8 | pip = sys.executable[:-6] + 'pip freeze' 9 | print(pip) 10 | os.system(pip) 11 | 12 | if __name__ == '__main__': 13 | comm = MPI.COMM_WORLD 14 | rank = comm.Get_rank() 15 | 16 | ip = socket.gethostbyname(socket.gethostname()) 17 | print("- my rank is ", rank) 18 | print("- my ip is ", ip) 19 | 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument("--data") 22 | parser.add_argument("--gpus") 23 | FLAGS, unparsed = parser.parse_known_args() 24 | 25 | if rank == 0: 26 | data = { 27 | "scheduler" : ip + ":8786", 28 | "dashboard" : ip + ":8787" 29 | } 30 | Run.get_context().log("headnode", ip) 31 | Run.get_context().log("scheduler", data["scheduler"]) 32 | Run.get_context().log("dashboard", data["dashboard"]) 33 | Run.get_context().log("data", FLAGS.data) 34 | else: 35 | data = None 36 | 37 | data = comm.bcast(data, root=0) 38 | scheduler = data["scheduler"] 39 | dashboard = data["dashboard"] 40 | print("- scheduler is ", scheduler) 41 | print("- dashboard is ", dashboard) 42 | 43 | 44 | if rank == 0: 45 | os.system("dask-scheduler " + 46 | "--port " + scheduler.split(":")[1] + 47 | " --dashboard-address " + dashboard + 48 | " --preload jupyter-preload.py") 49 | elif rank == 1: 50 | os.environ["CUDA_VISIBLE_DEVICES"] = '0,1' # allow the 1st worker to grab the GPU assigned to the scheduler as well as its own 51 | os.system("dask-cuda-worker " + scheduler + " --memory-limit 0") 52 | else: 53 | os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % int(FLAGS.gpus)) # restrict each worker to their own GPU (assuming one GPU per worker) 54 | os.system("dask-cuda-worker " + scheduler + " --memory-limit 0") 55 | -------------------------------------------------------------------------------- /rapids_interactive/dask/jupyter-preload.py: -------------------------------------------------------------------------------- 1 | from notebook.notebookapp import NotebookApp 2 | from azureml.core import Run 3 | import socket 4 | 5 | def dask_setup(scheduler): 6 | app = NotebookApp() 7 | ip = socket.gethostbyname(socket.gethostname()) 8 | app.ip="0.0.0.0" 9 | app.initialize([]) 10 | Run.get_context().log("jupyter-url", "http://" + ip + ":" + str(app.port) + "/?token=" + app.token) 11 | Run.get_context().log("jupyter-port", app.port) 12 | Run.get_context().log("jupyter-token", app.token) 13 | Run.get_context().log("jupyter-ip", ip) -------------------------------------------------------------------------------- /rapids_interactive/dask/rapids-0.9.yaml: -------------------------------------------------------------------------------- 1 | name: rapids-0.9 2 | channels: 3 | - defaults 4 | - conda-forge 5 | - nvidia 6 | - rapidsai 7 | - rapidsai/label/xgboost 8 | 9 | dependencies: 10 | - pip 11 | - mpi4py 12 | - python=3.7 13 | - numba=>0.45.1 14 | - cudatoolkit 15 | - cudf=0.9.* 16 | - cuml=0.9.* 17 | - cugraph=0.9.* 18 | - bokeh 19 | - dask=2.3.* 20 | - distributed=2.3.* 21 | - dask-cuda=0.9.* 22 | - dask-cudf=0.9.* 23 | - nvidia::nccl=2.4.* 24 | - rapidsai/label/xgboost::xgboost=0.90.* 25 | - rapidsai/label/xgboost::dask-xgboost=0.2.* 26 | - dill 27 | - pip: 28 | - azureml-sdk[automl,explain,notebooks] 29 | -------------------------------------------------------------------------------- /rapids_interactive/dask/rapids.yml: -------------------------------------------------------------------------------- 1 | name: rapids0.10 2 | channels: 3 | - nvidia 4 | - rapidsai/label/xgboost 5 | - rapidsai 6 | - rapidsai-nighthly 7 | - conda-forge 8 | - numba 9 | - pytorch 10 | dependencies: 11 | - python=3.7 12 | - pytorch 13 | - cudatoolkit=10.0 14 | - dask-cuda=0.9.1 15 | - cudf=0.9.* 16 | - cuml=0.9.* 17 | - cugraph=0.9.* 18 | - rapidsai/label/xgboost::xgboost=0.90.rapidsdev1 19 | - rapidsai/label/xgboost::dask-xgboost=0.2.* 20 | - conda-forge::numpy=1.16.4 21 | - cython 22 | - dask 23 | - distributed=2.3.2 24 | - pynvml=8.0.2 25 | - gcsfs 26 | - requests 27 | - jupyterhub 28 | - jupyterlab 29 | - matplotlib 30 | - ipywidgets 31 | - ipyvolume 32 | - seaborn 33 | - scipy 34 | - pandas 35 | - boost 36 | - nodejs 37 | - pytest 38 | - pip 39 | - pip: 40 | - git+https://github.com/cupy/cupy.git 41 | - setuptools 42 | - torch 43 | - torchvision 44 | - pytorch-ignite 45 | - graphviz 46 | - networkx 47 | - dask-kubernetes 48 | - dask_labextension 49 | - jupyterlab-nvdashboard -------------------------------------------------------------------------------- /rapids_interactive/start_cluster.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Running a DASK cluster with RAPIDS\n", 8 | "\n", 9 | "This notebook runs a DASK cluster with NVIDIA RAPIDS. RAPIDS uses NVIDIA CUDA for high-performance GPU execution, exposing GPU parallelism and high memory bandwidth through a user-friendly Python interface. It includes a dataframe library called cuDF which will be familiar to Pandas users, as well as an ML library called cuML that provides GPU versions of all machine learning algorithms available in Scikit-learn. \n", 10 | "\n", 11 | "This notebook shows how through DASK, RAPIDS can take advantage of multi-node, multi-GPU configurations on AzureML. \n", 12 | "\n", 13 | "This notebook is deploying the AzureML cluster to a VNet. Prior to running this, setup a VNet and DSVM according to [../setup-vnet.md](../setup-vnet.md). In this case the following names are used to identify the VNet and subnet.\n", 14 | "\n", 15 | "In addition, you need to forward the following ports to the DSVM \n", 16 | "\n", 17 | "- port 8888 to port 8888 for the jupyter server running on the DSVM (see [../setup-vnet.md](../setup-vnet.md))\n", 18 | "- port 9999 to port 9999 for the jupyter server running on the AML Cluster (will be explained below)\n", 19 | "- port 9797 to port 9797 for the jupyter server running on the AML Cluster (will be explained below)\n", 20 | "\n", 21 | "The easiert way to accomplish that is by logging into the DSVM using ssh with the following flags (assuming `mydsvm.westeurope.cloudapp.azure.com` is the DNS name for your DSVM:\n", 22 | "\n", 23 | " ssh mydsvm.westeurope.cloudapp.azure.com -L 9797:localhost:9797 -L 9999:localhost:9999 -L 8888:localhost:8888\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "import os\n", 33 | "import json\n", 34 | "import time\n", 35 | "\n", 36 | "from azureml.core import Workspace, Experiment, Environment\n", 37 | "from azureml.core.conda_dependencies import CondaDependencies\n", 38 | "from azureml.core.compute import AmlCompute, ComputeTarget\n", 39 | "from azureml.data.data_reference import DataReference\n", 40 | "from azureml.core.runconfig import RunConfiguration, MpiConfiguration\n", 41 | "from azureml.core import ScriptRunConfig\n", 42 | "from azureml.train.estimator import Estimator\n", 43 | "from azureml.exceptions import ComputeTargetException\n", 44 | "from azureml.widgets import RunDetails\n", 45 | "\n", 46 | "from subprocess import Popen, PIPE\n", 47 | "\n", 48 | "class PortForwarder():\n", 49 | " '''A helper to forward ports from the Notebook VM to the AML Cluster in the same VNet'''\n", 50 | " active_instances = set()\n", 51 | " \n", 52 | " def __init__(self, from_port, to_ip, to_port):\n", 53 | " self.from_port = from_port\n", 54 | " self.to_ip = to_ip\n", 55 | " self.to_port = to_port\n", 56 | " \n", 57 | " def start(self):\n", 58 | " self._socat = Popen([\"/usr/bin/socat\", \n", 59 | " f\"tcp-listen:{self.from_port},reuseaddr,fork\", \n", 60 | " f\"tcp:{self.to_ip}:{self.to_port}\"],\n", 61 | " stderr=PIPE, stdout=PIPE, universal_newlines=True)\n", 62 | " PortForwarder.active_instances.add(self)\n", 63 | " return self\n", 64 | " \n", 65 | " def stop(self):\n", 66 | " PortForwarder.active_instances.remove(self)\n", 67 | " return self._socat.terminate()\n", 68 | " \n", 69 | " def stop_all():\n", 70 | " for instance in list(PortForwarder.active_instances):\n", 71 | " instance.stop()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 2, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "gpu_cluster_name = \"nd12-vnet-clustr\"\n", 81 | "vnet_resourcegroup_name='demo'\n", 82 | "vnet_name='myvnet'\n", 83 | "subnet_name='default'\n", 84 | "\n", 85 | "ws = Workspace.from_config()" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "### Deploy the AmlCompute cluster\n", 93 | "The next cell is deploying the AmlCompute cluster. The cluster is configured to scale down to 0 nodes after 2 minuten, so no cost is incurred while DASK is not running (and thus no nodes are spun up on the cluster as the result of this cell, yet). This cell only needs to be executed once and the cluster can be reused going forward." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "Found existing compute target\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "try:\n", 111 | " gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)\n", 112 | " print('Found existing compute target')\n", 113 | " \n", 114 | "except ComputeTargetException:\n", 115 | " print(\"Creating new cluster\")\n", 116 | "\n", 117 | " provisioning_config = AmlCompute.provisioning_configuration(\n", 118 | " vm_size=\"Standard_ND12s\", \n", 119 | " min_nodes=0, \n", 120 | " max_nodes=10,\n", 121 | " idle_seconds_before_scaledown=120,\n", 122 | " vnet_resourcegroup_name=vnet_resourcegroup_name,\n", 123 | " vnet_name=vnet_name,\n", 124 | " subnet_name=subnet_name\n", 125 | " )\n", 126 | " gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n", 127 | "\n", 128 | " print(\"waiting for nodes\")\n", 129 | " gpu_cluster.wait_for_completion(show_output=True)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "### Copy the data to Azure Blob Storage\n", 137 | "\n", 138 | "This next cell is pulling the NYC taxi data set down and then uploads it to the AzureML workspace's default data store. The all nodes of the DASK cluster we are creating further down will then be able to access the data." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 4, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv\n", 151 | "- File already exists locally\n", 152 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-02.csv\n", 153 | "- File already exists locally\n", 154 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-03.csv\n", 155 | "- File already exists locally\n", 156 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-04.csv\n", 157 | "- File already exists locally\n", 158 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-05.csv\n", 159 | "- File already exists locally\n", 160 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-06.csv\n", 161 | "- File already exists locally\n", 162 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-07.csv\n", 163 | "- File already exists locally\n", 164 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-08.csv\n", 165 | "- File already exists locally\n", 166 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-09.csv\n", 167 | "- File already exists locally\n", 168 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-10.csv\n", 169 | "- File already exists locally\n", 170 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-11.csv\n", 171 | "- File already exists locally\n", 172 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-12.csv\n", 173 | "- File already exists locally\n", 174 | "- Uploading taxi data... \n", 175 | "Uploading an estimated of 12 files\n", 176 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-09.csv\n", 177 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-10.csv\n", 178 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-07.csv\n", 179 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-12.csv\n", 180 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-06.csv\n", 181 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-03.csv\n", 182 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-01.csv\n", 183 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-02.csv\n", 184 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-08.csv\n", 185 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-05.csv\n", 186 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-04.csv\n", 187 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-11.csv\n", 188 | "Uploaded 0 files\n", 189 | "- Data transfer complete\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "import io\n", 195 | "import os\n", 196 | "import sys\n", 197 | "import urllib.request\n", 198 | "from tqdm import tqdm\n", 199 | "from time import sleep\n", 200 | "\n", 201 | "cwd = os.getcwd()\n", 202 | "\n", 203 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n", 204 | "if not os.path.exists(data_dir):\n", 205 | " os.makedirs(data_dir)\n", 206 | "\n", 207 | "taxidir = os.path.join(data_dir, 'nyctaxi')\n", 208 | "if not os.path.exists(taxidir):\n", 209 | " os.makedirs(taxidir)\n", 210 | "\n", 211 | "filenames = []\n", 212 | "local_paths = []\n", 213 | "for i in range(1, 13):\n", 214 | " filename = \"yellow_tripdata_2015-{month:02d}.csv\".format(month=i)\n", 215 | " filenames.append(filename)\n", 216 | " \n", 217 | " local_path = os.path.join(taxidir, filename)\n", 218 | " local_paths.append(local_path)\n", 219 | "\n", 220 | "for idx, filename in enumerate(filenames):\n", 221 | " url = \"http://dask-data.s3.amazonaws.com/nyc-taxi/2015/\" + filename\n", 222 | " print(\"- Downloading \" + url)\n", 223 | " if not os.path.exists(local_paths[idx]):\n", 224 | " with open(local_paths[idx], 'wb') as file:\n", 225 | " with urllib.request.urlopen(url) as resp:\n", 226 | " length = int(resp.getheader('content-length'))\n", 227 | " blocksize = max(4096, length // 100)\n", 228 | " with tqdm(total=length, file=sys.stdout) as pbar:\n", 229 | " while True:\n", 230 | " buff = resp.read(blocksize)\n", 231 | " if not buff:\n", 232 | " break\n", 233 | " file.write(buff)\n", 234 | " pbar.update(len(buff))\n", 235 | " else:\n", 236 | " print(\"- File already exists locally\")\n", 237 | "\n", 238 | "print(\"- Uploading taxi data... \")\n", 239 | "ws = Workspace.from_config()\n", 240 | "ds = ws.get_default_datastore()\n", 241 | "\n", 242 | "ds.upload(\n", 243 | " src_dir=taxidir,\n", 244 | " target_path='nyctaxi',\n", 245 | " show_progress=True)\n", 246 | "\n", 247 | "print(\"- Data transfer complete\")" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "### Create the DASK Cluster\n", 255 | "\n", 256 | "On the AMLCompute cluster we are now running a Python job that will run a DASK cluster. " 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 5, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "name": "stderr", 266 | "output_type": "stream", 267 | "text": [ 268 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n", 269 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n", 270 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n" 271 | ] 272 | } 273 | ], 274 | "source": [ 275 | "mpi_config = MpiConfiguration()\n", 276 | "mpi_config.process_count_per_node = 2\n", 277 | "\n", 278 | "est = Estimator(\n", 279 | " source_directory='./dask',\n", 280 | " compute_target=gpu_cluster,\n", 281 | " entry_script='init-dask.py',\n", 282 | " script_params={\n", 283 | " '--data': ws.get_default_datastore(),\n", 284 | " '--gpus': str(2) # The number of GPUs available on each node\n", 285 | " },\n", 286 | " node_count=3,\n", 287 | " use_gpu=True,\n", 288 | " distributed_training=mpi_config,\n", 289 | " conda_dependencies_file='rapids-0.9.yaml')\n", 290 | "\n", 291 | "run = Experiment(ws, \"init-dask-jupyter\").submit(est)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "Let's use the widget to monitor how the DASK cluster spins up. When run for the first time on a workspace, the following thing will happen:\n", 299 | "\n", 300 | "1. The docker image will to be created, which takes about 20 minutes. \n", 301 | "2. Then AzureML will start to scale the cluster up by provisioning the required number of nodes (`node_count` above), which will take another 5-10 minutes with the chosen Standard_ND12s\n", 302 | "3. The docker image is being transferred over to the compute nodes, which, given the size of about 8 GB takes another 3-5 minutes\n", 303 | "\n", 304 | "So alltogether the process will take up to 30 minutes when run for the first time." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 8, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "application/vnd.jupyter.widget-view+json": { 315 | "model_id": "cb1f363f16374a4992a6719c9d58b49d", 316 | "version_major": 2, 317 | "version_minor": 0 318 | }, 319 | "text/plain": [ 320 | "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…" 321 | ] 322 | }, 323 | "metadata": {}, 324 | "output_type": "display_data" 325 | } 326 | ], 327 | "source": [ 328 | "from azureml.widgets import RunDetails\n", 329 | "RunDetails(run).show()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "### Wait for the cluster to come up" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 9, 342 | "metadata": { 343 | "scrolled": true 344 | }, 345 | "outputs": [], 346 | "source": [ 347 | "from IPython.display import clear_output\n", 348 | "import time\n", 349 | "\n", 350 | "it = 0\n", 351 | "while not \"headnode\" in run.get_metrics():\n", 352 | " clear_output(wait=True)\n", 353 | " print(\"waiting for scheduler node's ip \" + str(it) )\n", 354 | " time.sleep(1)\n", 355 | " it += 1\n", 356 | "\n", 357 | "headnode = run.get_metrics()[\"headnode\"]\n", 358 | "jupyter_ip = run.get_metrics()[\"jupyter-ip\"]\n", 359 | "jupyter_port = run.get_metrics()[\"jupyter-port\"]\n", 360 | "jupyter_token = run.get_metrics()[\"jupyter-token\"]" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "### Establish port forwarding to the cluster" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 10, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "If you are forwarding the ports from your local machine as described at the top of this notebook,\n", 380 | "then you should now be able to connect to the Dashboard and Jupyter Server via the following URLs:\n", 381 | "\n", 382 | " Dashboard: http://localhost:9797\n", 383 | " Jupyter on cluster: http://localhost:9999/notebooks/azure_taxi_on_cluster.ipynb?token=0ed225c7db00699b10d80d86dc09d8149d6eae21e7200aac\n" 384 | ] 385 | } 386 | ], 387 | "source": [ 388 | "dashboard = PortForwarder(9797, headnode, 8787).start()\n", 389 | "jupyter = PortForwarder(9999, headnode, 8888).start()\n", 390 | "\n", 391 | "print(\"If you are forwarding the ports from your local machine as described at the top of this notebook,\")\n", 392 | "print(\"then you should now be able to connect to the Dashboard and Jupyter Server via the following URLs:\")\n", 393 | "print()\n", 394 | "print(f\" Dashboard: http://localhost:9797\")\n", 395 | "print(f\" Jupyter on cluster: http://localhost:9999/notebooks/azure_taxi_on_cluster.ipynb?token={jupyter_token}\")" 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "## Shutting the cluster down\n", 403 | "\n", 404 | "Terminate the run to shut the cluster down. Once you are done with your interactive work, make sure to do this so the AML Compute cluster gets spun down again. " 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": null, 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "# stop the run representing the cluster\n", 414 | "run.cancel()\n", 415 | "# shut down the port forwards\n", 416 | "PortForwarder.stop_all()" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "### Useful for debugging" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 5, 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "# get the last run\n", 433 | "run = Experiment(ws, \"init-dask-jupyter\").get_runs().__next__()" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 23, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/plain": [ 444 | "{'headnode': '172.17.0.6',\n", 445 | " 'scheduler': '172.17.0.6:8786',\n", 446 | " 'dashboard': '172.17.0.6:8787',\n", 447 | " 'data': '/mnt/batch/tasks/shared/LS_root/jobs/vnettest/azureml/init-dask-jupyter_1570114867_699d20d4/mounts/workspaceblobstore',\n", 448 | " 'jupyter-url': 'http://172.17.0.6:8888/?token=0f85e874d045185e175027bab126bd404ebe444c237a765a',\n", 449 | " 'jupyter-port': 8888,\n", 450 | " 'jupyter-token': '0f85e874d045185e175027bab126bd404ebe444c237a765a',\n", 451 | " 'jupyter-ip': '172.17.0.6'}" 452 | ] 453 | }, 454 | "execution_count": 23, 455 | "metadata": {}, 456 | "output_type": "execute_result" 457 | } 458 | ], 459 | "source": [ 460 | "run.get_metrics()" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 8, 466 | "metadata": {}, 467 | "outputs": [ 468 | { 469 | "data": { 470 | "text/plain": [ 471 | "'Running'" 472 | ] 473 | }, 474 | "execution_count": 8, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "run.status" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": {}, 487 | "outputs": [], 488 | "source": [] 489 | } 490 | ], 491 | "metadata": { 492 | "kernelspec": { 493 | "display_name": "Python (rapids-0.9)", 494 | "language": "python", 495 | "name": "dask" 496 | }, 497 | "language_info": { 498 | "codemirror_mode": { 499 | "name": "ipython", 500 | "version": 3 501 | }, 502 | "file_extension": ".py", 503 | "mimetype": "text/x-python", 504 | "name": "python", 505 | "nbconvert_exporter": "python", 506 | "pygments_lexer": "ipython3", 507 | "version": "3.7.4" 508 | } 509 | }, 510 | "nbformat": 4, 511 | "nbformat_minor": 4 512 | } 513 | -------------------------------------------------------------------------------- /setup-vnet.md: -------------------------------------------------------------------------------- 1 | # Setting up a DSVM in a VNet 2 | 3 | ## Create the VNet 4 | 5 | ![](img/1.png) 6 | 7 | ![](img/2.png) 8 | 9 | ## Create DSVM using the VNet 10 | 11 | ![](img/5.png) 12 | 13 | ![](img/6.png) 14 | 15 | During setup, it is convenient to use your local username also for the DSVM and to provide your public key during setup, so you can easily ssh onto the VM. 16 | Once the DSVM is created, assign it a DNS name by clicking on the Public IP-Address 17 | ![](img/8.png) 18 | 19 | And assign it a DNS name, so you can access it by that name (alternatively, you can also switch to a static IP) 20 | 21 | ![](img/9.png) 22 | 23 | Here is some more information on the DSVM: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro 24 | 25 | Download the config.json from the workspace and upload it to the DSVM (just put it your user's home folder). 26 | ![](img/10.png) 27 | 28 | ``` 29 | scp config.json : 30 | ``` 31 | 32 | Now log in to the DSVM 33 | 34 | ``` 35 | ssh -L 8888:localhost:8888 36 | ``` 37 | 38 | On the DSVM, pull down this repository and create the python environment: 39 | 40 | ``` 41 | git clone https://github.com/danielsc/azureml-and-dask 42 | cd azureml-and-dask/ 43 | conda env create -f dask/environment.yml 44 | conda activate dask 45 | python -m ipykernel install --user --name dask --display-name "Python (dask)" 46 | ``` 47 | 48 | Next start jupyter on the DSVM: 49 | 50 | ``` 51 | nohup jupyter notebook & 52 | ``` 53 | 54 | Find the login token/url in nohup.out 55 | 56 | ``` 57 | (dask) danielsc@vnettestvm:~/git/azureml-and-dask$ tail nohup.out 58 | [C 21:17:35.360 NotebookApp] 59 | 60 | To access the notebook, open this file in a browser: 61 | file:///data/home/danielsc/.local/share/jupyter/runtime/nbserver-18401-open.html 62 | Or copy and paste one of these URLs: 63 | http://localhost:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f 64 | or http://127.0.0.1:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f 65 | (dask) danielsc@vnettestvm:~/git/azureml-and-dask$ 66 | ``` 67 | 68 | If you started the ssh session with the port forward as above, then the link above should just work for you (in my case: http://localhost:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f). 69 | 70 | --------------------------------------------------------------------------------