├── .gitignore
├── README.md
├── batch
├── RunBatch.ipynb
└── dask
│ ├── batch.py
│ ├── environment.yml
│ └── startDask.py
├── img
├── 1.png
├── 10.png
├── 2.png
├── 3.png
├── 4.png
├── 5.png
├── 6.png
├── 7.png
├── 8.png
├── 9.png
├── bokeh.png
├── compute_nodes.png
├── create_cluster.png
├── dask-status.gif
└── network.png
├── interactive
├── LoadDataFromDatastore.ipynb
├── StartDask.ipynb
├── StartDaskVNet.ipynb
├── dask
│ ├── DaskNYCTaxi.ipynb
│ ├── environment.yml
│ └── startDask.py
└── mydask.png
├── rapids_interactive
├── dask
│ ├── azure_taxi_on_cluster.ipynb
│ ├── dask.yml
│ ├── init-dask.py
│ ├── jupyter-preload.py
│ ├── rapids-0.9.yaml
│ └── rapids.yml
└── start_cluster.ipynb
└── setup-vnet.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode/**
2 | **/.ipynb_checkpoints/*
3 | rapids_interactive/data/*
4 | **/._*
5 | nohup.out
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Running Dask on AzureML
2 |
3 | **This repository is no longer maintained. For a simple way of running Dask on an AzureML cluster, please check out the AzureML CLI v2 DASK samples here: https://github.com/Azure/azureml-examples/tree/main/cli**
4 |
5 |
6 | ----
7 |
8 | This repository shows how to run a [Dask](https://docs.dask.org/en/latest/) cluster on an [AzureML](https://docs.microsoft.com/en-us/azure/machine-learning/service/) Compute cluster. It is designed to run on an AzureML Notebook VM (created after 8/15/2019), but it should work on your local computer, too.
9 |
10 | Please follow these setup instructions and then start:
11 |
12 | - here for plain DASK interactive scenarios [interactive/StartDask.ipynb](interactive/StartDask.ipynb).
13 | - here for DASK with NVIDIA RAPIDS interactive scenarios [rapids_interactive/start_cluster.ipynb](rapids_interactive/start_cluster.ipynb).
14 |
15 | ## Setting up the Python Environment
16 | The environment you are running should have the latest version of `dask` and `distributed` installed -- run this code in the terminal to make sure:
17 |
18 | ```shell
19 | conda activate py36 # assuming AzureML Notebook VM
20 | pip install --upgrade dask distributed
21 | ```
22 |
23 | Or, if you want to be on the safe side, create a new conda environment using this [environment.yml](interactive/dask/environment.yml) file like so:
24 |
25 | ```shell
26 | conda env create -f dask/environment.yml
27 | conda activate dask
28 | python -m ipykernel install --user --name dask --display-name "Python (dask)"
29 | ```
30 |
31 | 
32 |
33 |
--------------------------------------------------------------------------------
/batch/RunBatch.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Running Dask on AzureML\n",
8 | "\n",
9 | "This notebook shows how to run a batch job on a Dask cluster running on an AzureML Compute cluster. \n",
10 | "For setup instructions of your python environment, please see the [Readme](../README.md)"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 1,
16 | "metadata": {},
17 | "outputs": [
18 | {
19 | "data": {
20 | "text/plain": [
21 | "'1.12.0'"
22 | ]
23 | },
24 | "execution_count": 1,
25 | "metadata": {},
26 | "output_type": "execute_result"
27 | }
28 | ],
29 | "source": [
30 | "from azureml.core import Workspace, Experiment\n",
31 | "from azureml.train.estimator import Estimator\n",
32 | "from azureml.widgets import RunDetails\n",
33 | "from azureml.core.runconfig import MpiConfiguration\n",
34 | "from azureml.core import VERSION\n",
35 | "import uuid\n",
36 | "import time\n",
37 | "VERSION\n"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "ws = Workspace.from_config()"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "## Download the NYC Taxi dataset and upload to the workspace default blob storage"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 9,
59 | "metadata": {},
60 | "outputs": [
61 | {
62 | "name": "stdout",
63 | "output_type": "stream",
64 | "text": [
65 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv\n",
66 | "100%|██████████| 1985964692/1985964692 [00:30<00:00, 65604283.41it/s] \n",
67 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-02.csv\n",
68 | "100%|██████████| 1945357622/1945357622 [00:29<00:00, 65506177.65it/s]\n",
69 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-03.csv\n",
70 | "100%|██████████| 2087971794/2087971794 [00:33<00:00, 62180625.55it/s] \n",
71 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-04.csv\n",
72 | "100%|██████████| 2046225765/2046225765 [00:31<00:00, 65746019.73it/s]\n",
73 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-05.csv\n",
74 | "100%|██████████| 2061869121/2061869121 [00:27<00:00, 73939136.66it/s] \n",
75 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-06.csv\n",
76 | "100%|██████████| 1932049357/1932049357 [00:29<00:00, 64596156.85it/s]\n",
77 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-07.csv\n",
78 | "100%|██████████| 1812530041/1812530041 [00:29<00:00, 61745527.58it/s] \n",
79 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-08.csv\n",
80 | "100%|██████████| 1744852237/1744852237 [00:26<00:00, 65974018.30it/s] \n",
81 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-09.csv\n",
82 | "100%|██████████| 1760412710/1760412710 [00:27<00:00, 64174609.37it/s]\n",
83 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-10.csv\n",
84 | "100%|██████████| 1931460927/1931460927 [00:29<00:00, 65248050.69it/s]\n",
85 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-11.csv\n",
86 | "100%|██████████| 1773468989/1773468989 [00:31<00:00, 56412556.41it/s]\n",
87 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-12.csv\n",
88 | "100%|██████████| 1796283025/1796283025 [00:26<00:00, 68628572.27it/s] \n",
89 | "- Uploading taxi data... \n",
90 | "Uploading an estimated of 12 files\n",
91 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-01.csv\n",
92 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-02.csv\n",
93 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-03.csv\n",
94 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-04.csv\n",
95 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-05.csv\n",
96 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-06.csv\n",
97 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-07.csv\n",
98 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-08.csv\n",
99 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-09.csv\n",
100 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-10.csv\n",
101 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-11.csv\n",
102 | "Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-12.csv\n",
103 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-11.csv, 1 files out of an estimated total of 12\n",
104 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-08.csv, 2 files out of an estimated total of 12\n",
105 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-07.csv, 3 files out of an estimated total of 12\n",
106 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-09.csv, 4 files out of an estimated total of 12\n",
107 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-12.csv, 5 files out of an estimated total of 12\n",
108 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-10.csv, 6 files out of an estimated total of 12\n",
109 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-06.csv, 7 files out of an estimated total of 12\n",
110 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-02.csv, 8 files out of an estimated total of 12\n",
111 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-01.csv, 9 files out of an estimated total of 12\n",
112 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-03.csv, 10 files out of an estimated total of 12\n",
113 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-04.csv, 11 files out of an estimated total of 12\n",
114 | "Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/danielsc-dask/code/Users/danielsc/azureml-and-dask/batch/data/nyctaxi/yellow_tripdata_2015-05.csv, 12 files out of an estimated total of 12\n",
115 | "Uploaded 12 files\n",
116 | "- Data transfer complete\n"
117 | ]
118 | }
119 | ],
120 | "source": [
121 | "import io\n",
122 | "import os\n",
123 | "import sys\n",
124 | "import urllib.request\n",
125 | "from tqdm import tqdm\n",
126 | "from time import sleep\n",
127 | "\n",
128 | "cwd = os.getcwd()\n",
129 | "\n",
130 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n",
131 | "if not os.path.exists(data_dir):\n",
132 | " os.makedirs(data_dir)\n",
133 | "\n",
134 | "taxidir = os.path.join(data_dir, 'nyctaxi')\n",
135 | "if not os.path.exists(taxidir):\n",
136 | " os.makedirs(taxidir)\n",
137 | "\n",
138 | "filenames = []\n",
139 | "local_paths = []\n",
140 | "for i in range(1, 13):\n",
141 | " filename = \"yellow_tripdata_2015-{month:02d}.csv\".format(month=i)\n",
142 | " filenames.append(filename)\n",
143 | " \n",
144 | " local_path = os.path.join(taxidir, filename)\n",
145 | " local_paths.append(local_path)\n",
146 | "\n",
147 | "for idx, filename in enumerate(filenames):\n",
148 | " url = \"http://dask-data.s3.amazonaws.com/nyc-taxi/2015/\" + filename\n",
149 | " print(\"- Downloading \" + url)\n",
150 | " if not os.path.exists(local_paths[idx]):\n",
151 | " with open(local_paths[idx], 'wb') as file:\n",
152 | " with urllib.request.urlopen(url) as resp:\n",
153 | " length = int(resp.getheader('content-length'))\n",
154 | " blocksize = max(4096, length // 100)\n",
155 | " with tqdm(total=length, file=sys.stdout) as pbar:\n",
156 | " while True:\n",
157 | " buff = resp.read(blocksize)\n",
158 | " if not buff:\n",
159 | " break\n",
160 | " file.write(buff)\n",
161 | " pbar.update(len(buff))\n",
162 | " else:\n",
163 | " print(\"- File already exists locally\")\n",
164 | "\n",
165 | "print(\"- Uploading taxi data... \")\n",
166 | "ws = Workspace.from_config()\n",
167 | "ds = ws.get_default_datastore()\n",
168 | "\n",
169 | "ds.upload(\n",
170 | " src_dir=taxidir,\n",
171 | " target_path='nyctaxi',\n",
172 | " show_progress=True)\n",
173 | "\n",
174 | "print(\"- Data transfer complete\")"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "## Starting the cluster"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 10,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "# we assume the AML compute training cluster is already created\n",
191 | "dask_cluster = ws.compute_targets['daniel-big']"
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "Starting the Dask cluster using an Estimator with MpiConfiguration. Make sure the cluster is able to scale up to 10 nodes or change the `node_count` below. "
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 14,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "est = Estimator('dask', \n",
208 | " compute_target=dask_cluster, \n",
209 | " entry_script='startDask.py', \n",
210 | " conda_dependencies_file='environment.yml', \n",
211 | " script_params={'--datastore': ws.get_default_datastore(),\n",
212 | " '--script': 'batch.py'},\n",
213 | " node_count=10,\n",
214 | " distributed_training=MpiConfiguration())\n",
215 | "\n",
216 | "run = Experiment(ws, 'dask').submit(est)"
217 | ]
218 | },
219 | {
220 | "cell_type": "code",
221 | "execution_count": 15,
222 | "metadata": {
223 | "scrolled": false
224 | },
225 | "outputs": [
226 | {
227 | "data": {
228 | "application/vnd.jupyter.widget-view+json": {
229 | "model_id": "e917d855441647f09fcc50f3809622f5",
230 | "version_major": 2,
231 | "version_minor": 0
232 | },
233 | "text/plain": [
234 | "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…"
235 | ]
236 | },
237 | "metadata": {},
238 | "output_type": "display_data"
239 | },
240 | {
241 | "data": {
242 | "application/aml.mini.widget.v1": "{\"status\": \"Queued\", \"workbench_run_details_uri\": \"https://ml.azure.com/experiments/dask/runs/dask_1599051891_1033bd4f?wsid=/subscriptions/6560575d-fa06-4e7d-95fb-f962e74efd7a/resourcegroups/dask-rg/workspaces/dask-azureml\", \"run_id\": \"dask_1599051891_1033bd4f\", \"run_properties\": {\"run_id\": \"dask_1599051891_1033bd4f\", \"created_utc\": \"2020-09-02T13:04:56.761121Z\", \"properties\": {\"_azureml.ComputeTargetType\": \"amlcompute\", \"ContentSnapshotId\": \"66fa5036-4c6b-47f6-aa7e-d7e9340a74cb\", \"azureml.git.repository_uri\": \"https://github.com/danielsc/azureml-and-dask\", \"mlflow.source.git.repoURL\": \"https://github.com/danielsc/azureml-and-dask\", \"azureml.git.branch\": \"master\", \"mlflow.source.git.branch\": \"master\", \"azureml.git.commit\": \"f71a6182f15f2344e7b39589434f3d3461a89344\", \"mlflow.source.git.commit\": \"f71a6182f15f2344e7b39589434f3d3461a89344\", \"azureml.git.dirty\": \"True\", \"ProcessInfoFile\": \"azureml-logs/process_info.json\", \"ProcessStatusFile\": \"azureml-logs/process_status.json\"}, \"tags\": {\"_aml_system_ComputeTargetStatus\": \"{\\\"AllocationState\\\":\\\"resizing\\\",\\\"PreparingNodeCount\\\":0,\\\"RunningNodeCount\\\":0,\\\"CurrentNodeCount\\\":0}\"}, \"script_name\": null, \"arguments\": null, \"end_time_utc\": null, \"status\": \"Queued\", \"log_files\": {}, \"log_groups\": [], \"run_duration\": \"0:02:33\"}, \"child_runs\": [], \"children_metrics\": {}, \"run_metrics\": [], \"run_logs\": \"Your job is submitted in Azure cloud and we are monitoring to get logs...\", \"graph\": {}, \"widget_settings\": {\"childWidgetDisplay\": \"popup\", \"send_telemetry\": false, \"log_level\": \"INFO\", \"sdk_version\": \"1.12.0\"}, \"loading\": false}"
243 | },
244 | "metadata": {},
245 | "output_type": "display_data"
246 | }
247 | ],
248 | "source": [
249 | "RunDetails(run).show()"
250 | ]
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "## Shut cluster down"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 50,
262 | "metadata": {},
263 | "outputs": [
264 | {
265 | "name": "stdout",
266 | "output_type": "stream",
267 | "text": [
268 | "cancelling run dask_1575974502_b8643732\n",
269 | "cancelling run dask_1575973181_99433e88\n"
270 | ]
271 | }
272 | ],
273 | "source": [
274 | "for run in ws.experiments['dask'].get_runs():\n",
275 | " if run.get_status() == \"Running\":\n",
276 | " print(f'cancelling run {run.id}')\n",
277 | " run.cancel()"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "### Just for convenience, get the latest running Run"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 87,
290 | "metadata": {},
291 | "outputs": [
292 | {
293 | "name": "stdout",
294 | "output_type": "stream",
295 | "text": [
296 | "latest running run is dask_1574792066_49c85fe4\n"
297 | ]
298 | }
299 | ],
300 | "source": [
301 | "for run in ws.experiments['dask'].get_runs():\n",
302 | " if run.get_status() == \"Running\":\n",
303 | " print(f'latest running run is {run.id}')\n",
304 | " break"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": []
313 | }
314 | ],
315 | "metadata": {
316 | "kernelspec": {
317 | "display_name": "Python 3.6 - AzureML",
318 | "language": "python",
319 | "name": "python3-azureml"
320 | },
321 | "language_info": {
322 | "codemirror_mode": {
323 | "name": "ipython",
324 | "version": 3
325 | },
326 | "file_extension": ".py",
327 | "mimetype": "text/x-python",
328 | "name": "python",
329 | "nbconvert_exporter": "python",
330 | "pygments_lexer": "ipython3",
331 | "version": "3.6.9"
332 | }
333 | },
334 | "nbformat": 4,
335 | "nbformat_minor": 2
336 | }
337 |
--------------------------------------------------------------------------------
/batch/dask/batch.py:
--------------------------------------------------------------------------------
1 | # +
2 | from dask.distributed import Client
3 | from azureml.core import Run
4 | import dask.dataframe as dd
5 | from fsspec.registry import known_implementations
6 | import os, uuid
7 |
8 | c=Client("localhost:8786")
9 | print(c)
10 |
11 |
12 | run = Run.get_context()
13 | ws = run.experiment.workspace
14 |
15 | ds = ws.get_default_datastore()
16 | ACCOUNT_NAME = ds.account_name
17 | ACCOUNT_KEY = ds.account_key
18 | CONTAINER = ds.container_name
19 |
20 | known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}
21 | STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
22 | df = dd.read_csv(f'abfs://{CONTAINER}/nyctaxi/*.csv',
23 | storage_options=STORAGE_OPTIONS,
24 | parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])
25 |
26 | print(df.head())
27 |
28 | # list of column names that need to be re-mapped
29 | remap = {}
30 | remap['tpep_pickup_datetime'] = 'pickup_datetime'
31 | remap['tpep_dropoff_datetime'] = 'dropoff_datetime'
32 | remap['RatecodeID'] = 'rate_code'
33 |
34 | #create a list of columns & dtypes the df must have
35 | must_haves = {
36 | 'VendorID': 'object',
37 | 'pickup_datetime': 'datetime64[ms]',
38 | 'dropoff_datetime': 'datetime64[ms]',
39 | 'passenger_count': 'int32',
40 | 'trip_distance': 'float32',
41 | 'pickup_longitude': 'float32',
42 | 'pickup_latitude': 'float32',
43 | 'rate_code': 'int32',
44 | 'payment_type': 'int32',
45 | 'dropoff_longitude': 'float32',
46 | 'dropoff_latitude': 'float32',
47 | 'fare_amount': 'float32',
48 | 'tip_amount': 'float32',
49 | 'total_amount': 'float32'
50 | }
51 |
52 | query_frags = [
53 | 'fare_amount > 0 and fare_amount < 500',
54 | 'passenger_count > 0 and passenger_count < 6',
55 | 'pickup_longitude > -75 and pickup_longitude < -73',
56 | 'dropoff_longitude > -75 and dropoff_longitude < -73',
57 | 'pickup_latitude > 40 and pickup_latitude < 42',
58 | 'dropoff_latitude > 40 and dropoff_latitude < 42'
59 | ]
60 | query = ' and '.join(query_frags)
61 |
62 | # helper function which takes a DataFrame partition
63 | def clean(df_part, remap, must_haves, query):
64 | df_part = df_part.query(query)
65 |
66 | # some col-names include pre-pended spaces remove & lowercase column names
67 | # tmp = {col:col.strip().lower() for col in list(df_part.columns)}
68 |
69 | # rename using the supplied mapping
70 | df_part = df_part.rename(columns=remap)
71 |
72 | # iterate through columns in this df partition
73 | for col in df_part.columns:
74 | # drop anything not in our expected list
75 | if col not in must_haves:
76 | df_part = df_part.drop(col, axis=1)
77 | continue
78 |
79 | if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:
80 | df_part[col] = df_part[col].astype('datetime64[ms]')
81 | continue
82 |
83 | # if column was read as a string, recast as float
84 | if df_part[col].dtype == 'object':
85 | df_part[col] = df_part[col].str.fillna('-1')
86 | df_part[col] = df_part[col].astype('float32')
87 | else:
88 | # save some memory by using 32 bit floats
89 | if 'int' in str(df_part[col].dtype):
90 | df_part[col] = df_part[col].astype('int32')
91 | if 'float' in str(df_part[col].dtype):
92 | df_part[col] = df_part[col].astype('float32')
93 | df_part[col] = df_part[col].fillna(-1)
94 |
95 | return df_part
96 |
97 | import math
98 | from math import pi
99 | from dask.array import cos, sin, arcsin, sqrt, floor
100 | import numpy as np
101 |
102 | def haversine_distance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude):
103 | x_1 = pi / 180 * pickup_latitude
104 | y_1 = pi / 180 * pickup_longitude
105 | x_2 = pi / 180 * dropoff_latitude
106 | y_2 = pi / 180 * dropoff_longitude
107 |
108 | dlon = y_2 - y_1
109 | dlat = x_2 - x_1
110 | a = sin(dlat / 2)**2 + cos(x_1) * cos(x_2) * sin(dlon / 2)**2
111 |
112 | c = 2 * arcsin(sqrt(a))
113 | r = 6371 # Radius of earth in kilometers
114 |
115 | return c * r
116 |
117 | def day_of_the_week(day, month, year):
118 | if month < 3:
119 | shift = month
120 | else:
121 | shift = 0
122 | Y = year - (month < 3)
123 | y = Y - 2000
124 | c = 20
125 | d = day
126 | m = month + shift + 1
127 | return (d + floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7
128 |
129 | def add_features(df):
130 | df['hour'] = df['pickup_datetime'].dt.hour.astype('int32')
131 | df['year'] = df['pickup_datetime'].dt.year.astype('int32')
132 | df['month'] = df['pickup_datetime'].dt.month.astype('int32')
133 | df['day'] = df['pickup_datetime'].dt.day.astype('int32')
134 | df['day_of_week'] = df['pickup_datetime'].dt.weekday.astype('int32')
135 |
136 | #df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')
137 | df['diff'] = df['dropoff_datetime'] - df['pickup_datetime']
138 |
139 | df['pickup_latitude_r'] = (df['pickup_latitude'] // .01 * .01).astype('float32')
140 | df['pickup_longitude_r'] = (df['pickup_longitude'] // .01 * .01).astype('float32')
141 | df['dropoff_latitude_r'] = (df['dropoff_latitude'] // .01 * .01).astype('float32')
142 | df['dropoff_longitude_r'] = (df['dropoff_longitude'] // .01 * .01).astype('float32')
143 |
144 | #df = df.drop('pickup_datetime', axis=1)
145 | #df = df.drop('dropoff_datetime', axis=1)
146 |
147 | #df = df.apply_rows(haversine_distance_kernel,
148 | # incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],
149 | # outcols=dict(h_distance=np.float32),
150 | # kwargs=dict())
151 |
152 | import numpy
153 |
154 | df['h_distance'] = haversine_distance(df['pickup_latitude'],
155 | df['pickup_longitude'],
156 | df['dropoff_latitude'],
157 | df['dropoff_longitude']).astype('float32')
158 |
159 | #df = df.apply_rows(day_of_the_week_kernel,
160 | # incols=['day', 'month', 'year'],
161 | # outcols=dict(day_of_week=np.float32),
162 | # kwargs=dict())
163 | #df['day_of_week'] = numpy.empty(len(df), dtype=np.int32)
164 | #day_of_the_week_kernel(df['day'],
165 | # df['month'],
166 | # df['year'],
167 | # df['day_of_week'])
168 |
169 |
170 | df['is_weekend'] = (df['day_of_week']>5).astype("int32")
171 | return df
172 |
173 | taxi_df = clean(df, remap, must_haves, query)
174 | taxi_df = add_features(taxi_df)
175 | output_uuid = uuid.uuid1().hex
176 | run.log('output_uuid', output_uuid)
177 | output_path = run.get_metrics()['datastore'] + '/output/' + output_uuid + '.parquet'
178 |
179 | print('save parquet to ', output_path)
180 |
181 | taxi_df.to_parquet(output_path)
182 |
183 | print('done')
184 |
185 | os.system('ls -alg ' + output_path)
186 |
187 | print('shutting down cluster')
188 | c.shutdown()
189 |
--------------------------------------------------------------------------------
/batch/dask/environment.yml:
--------------------------------------------------------------------------------
1 | name: dask
2 | channels:
3 | - defaults
4 | - conda-forge
5 | dependencies:
6 | - gcsfs
7 | - fs-gcsfs
8 | - jupyterlab
9 | - jupyter-server-proxy
10 | - python=3.6
11 | - numpy
12 | - h5py
13 | - scipy
14 | - toolz
15 | - bokeh
16 | - dask
17 | - distributed
18 | - notebook
19 | - matplotlib
20 | - Pillow
21 | - pandas
22 | - pandas-datareader
23 | - pytables
24 | - scikit-learn
25 | - scikit-image
26 | - snakeviz
27 | - ujson
28 | - graphviz
29 | - pip
30 | - s3fs
31 | - fastparquet
32 | - dask-ml
33 | - pip:
34 | - graphviz
35 | - cachey
36 | - azureml-sdk[notebooks]
37 | - mpi4py
38 | - gym
39 | - adlfs
--------------------------------------------------------------------------------
/batch/dask/startDask.py:
--------------------------------------------------------------------------------
1 | # +
2 | from mpi4py import MPI
3 | import os
4 | import argparse
5 | import time
6 | from dask.distributed import Client
7 | from azureml.core import Run
8 | import sys, uuid
9 | import threading
10 | import subprocess
11 | import socket
12 |
13 | from notebook.notebookapp import list_running_servers
14 |
15 |
16 | # -
17 |
18 | def flush(proc, proc_log):
19 | while True:
20 | proc_out = proc.stdout.readline()
21 | if proc_out == '' and proc.poll() is not None:
22 | proc_log.close()
23 | break
24 | elif proc_out:
25 | sys.stdout.write(proc_out)
26 | proc_log.write(proc_out)
27 | proc_log.flush()
28 |
29 |
30 | if __name__ == '__main__':
31 | comm = MPI.COMM_WORLD
32 | rank = comm.Get_rank()
33 |
34 | parser = argparse.ArgumentParser()
35 | parser.add_argument("--datastore")
36 | parser.add_argument("--jupyter_token", default=uuid.uuid1().hex)
37 | parser.add_argument("--script")
38 |
39 | args, unparsed = parser.parse_known_args()
40 |
41 | ip = socket.gethostbyname(socket.gethostname())
42 |
43 | print("- my rank is ", rank)
44 | print("- my ip is ", ip)
45 |
46 | if rank == 0:
47 | data = {
48 | "scheduler" : ip + ":8786",
49 | "dashboard" : ip + ":8787"
50 | }
51 | else:
52 | data = None
53 |
54 | data = comm.bcast(data, root=0)
55 | scheduler = data["scheduler"]
56 | dashboard = data["dashboard"]
57 | print("- scheduler is ", scheduler)
58 | print("- dashboard is ", dashboard)
59 |
60 |
61 | print("args: ", args)
62 | print("unparsed: ", unparsed)
63 | print("- my rank is ", rank)
64 | print("- my ip is ", ip)
65 |
66 | if rank == 0:
67 | Run.get_context().log("headnode", ip)
68 | Run.get_context().log("cluster",
69 | "scheduler: {scheduler}, dashboard: {dashboard}".format(scheduler=scheduler,
70 | dashboard=dashboard))
71 | Run.get_context().log("datastore", args.datastore)
72 |
73 | cmd = ("jupyter lab --ip 0.0.0.0 --port 8888" + \
74 | " --NotebookApp.token={token}" + \
75 | " --allow-root --no-browser").format(token=args.jupyter_token)
76 | jupyter_log = open("jupyter_log.txt", "a")
77 | jupyter_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
78 |
79 | jupyter_flush = threading.Thread(target=flush, args=(jupyter_proc, jupyter_log))
80 | jupyter_flush.start()
81 |
82 | while not list(list_running_servers()):
83 | time.sleep(5)
84 |
85 | jupyter_servers = list(list_running_servers())
86 | assert (len(jupyter_servers) == 1), "more than one jupyter server is running"
87 |
88 | Run.get_context().log("jupyter",
89 | "ip: {ip_addr}, port: {port}".format(ip_addr=ip, port=jupyter_servers[0]["port"]))
90 | Run.get_context().log("jupyter-token", jupyter_servers[0]["token"])
91 |
92 | cmd = "dask-scheduler " + "--port " + scheduler.split(":")[1] + " --dashboard-address " + dashboard
93 | scheduler_log = open("scheduler_log.txt", "w")
94 | scheduler_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
95 |
96 | cmd = "dask-worker " + scheduler
97 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w")
98 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
99 |
100 | worker_flush = threading.Thread(target=flush, args=(worker_proc, worker_log))
101 | worker_flush.start()
102 |
103 | if(args.script):
104 | command_line = ' '.join(['python', args.script]+unparsed)
105 | print('Launching:', command_line)
106 | exit_code = os.system(command_line)
107 | print('process ended with code', exit_code)
108 | print('killing scheduler, worker and jupyter')
109 | jupyter_proc.kill()
110 | scheduler_proc.kill()
111 | worker_proc.kill()
112 | exit(exit_code)
113 | else:
114 | flush(scheduler_proc, scheduler_log)
115 | else:
116 | cmd = "dask-worker " + scheduler
117 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w")
118 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
119 |
120 | flush(worker_proc, worker_log)
121 |
--------------------------------------------------------------------------------
/img/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/1.png
--------------------------------------------------------------------------------
/img/10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/10.png
--------------------------------------------------------------------------------
/img/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/2.png
--------------------------------------------------------------------------------
/img/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/3.png
--------------------------------------------------------------------------------
/img/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/4.png
--------------------------------------------------------------------------------
/img/5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/5.png
--------------------------------------------------------------------------------
/img/6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/6.png
--------------------------------------------------------------------------------
/img/7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/7.png
--------------------------------------------------------------------------------
/img/8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/8.png
--------------------------------------------------------------------------------
/img/9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/9.png
--------------------------------------------------------------------------------
/img/bokeh.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/bokeh.png
--------------------------------------------------------------------------------
/img/compute_nodes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/compute_nodes.png
--------------------------------------------------------------------------------
/img/create_cluster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/create_cluster.png
--------------------------------------------------------------------------------
/img/dask-status.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/dask-status.gif
--------------------------------------------------------------------------------
/img/network.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/img/network.png
--------------------------------------------------------------------------------
/interactive/LoadDataFromDatastore.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Loading your Data from an AzureML Datastore\n",
8 | "\n",
9 | "**Important**: Make sure to execute the steps to start the cluster in the notebook [StartDask.ipynb](StartDask.ipynb) before running this noteboook."
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 25,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "data": {
19 | "text/plain": [
20 | "'1.0.74'"
21 | ]
22 | },
23 | "execution_count": 25,
24 | "metadata": {},
25 | "output_type": "execute_result"
26 | }
27 | ],
28 | "source": [
29 | "from azureml.core import Workspace, Experiment\n",
30 | "from azureml.core import VERSION\n",
31 | "import time\n",
32 | "VERSION"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "### Uploading the data to the AzureML Datastore\n",
40 | "AzureML has the concept of a Datastore that can be mounted to a job, so you script does not have to deal with reading from Azure Blobstorage. First, let's download some data and upload it to the blob store, so we can play with it in Dask\n",
41 | "(parts of this code originates from https://github.com/dask/dask-tutorial)."
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 26,
47 | "metadata": {},
48 | "outputs": [
49 | {
50 | "name": "stdout",
51 | "output_type": "stream",
52 | "text": [
53 | "- Uploading flight data... \n",
54 | "Uploading an estimated of 10 files\n",
55 | "Target already exists. Skipping upload for nycflights/1990.csv\n",
56 | "Target already exists. Skipping upload for nycflights/1991.csv\n",
57 | "Target already exists. Skipping upload for nycflights/1992.csv\n",
58 | "Target already exists. Skipping upload for nycflights/1993.csv\n",
59 | "Target already exists. Skipping upload for nycflights/1994.csv\n",
60 | "Target already exists. Skipping upload for nycflights/1995.csv\n",
61 | "Target already exists. Skipping upload for nycflights/1996.csv\n",
62 | "Target already exists. Skipping upload for nycflights/1997.csv\n",
63 | "Target already exists. Skipping upload for nycflights/1998.csv\n",
64 | "Target already exists. Skipping upload for nycflights/1999.csv\n",
65 | "Uploaded 0 files\n",
66 | "** Finished! **\n"
67 | ]
68 | }
69 | ],
70 | "source": [
71 | "import os\n",
72 | "import tarfile\n",
73 | "import urllib.request\n",
74 | "\n",
75 | "\n",
76 | "cwd = os.getcwd()\n",
77 | "\n",
78 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n",
79 | "if not os.path.exists(data_dir):\n",
80 | " os.makedirs('data')\n",
81 | "\n",
82 | "flights_raw = os.path.join(data_dir, 'nycflights.tar.gz')\n",
83 | "flightdir = os.path.join(data_dir, 'nycflights')\n",
84 | "\n",
85 | "if not os.path.exists(flights_raw):\n",
86 | " print(\"- Downloading NYC Flights dataset... \", end='', flush=True)\n",
87 | " url = \"https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz\"\n",
88 | " urllib.request.urlretrieve(url, flights_raw)\n",
89 | " print(\"done\", flush=True)\n",
90 | "\n",
91 | "if not os.path.exists(flightdir):\n",
92 | " print(\"- Extracting flight data... \", end='', flush=True)\n",
93 | " tar_path = os.path.join(data_dir, 'nycflights.tar.gz')\n",
94 | " with tarfile.open(tar_path, mode='r:gz') as flights:\n",
95 | " flights.extractall('data/')\n",
96 | " print(\"done\", flush=True)\n",
97 | "\n",
98 | " \n",
99 | "print(\"- Uploading flight data... \")\n",
100 | "ws = Workspace.from_config()\n",
101 | "ds = ws.get_default_datastore()\n",
102 | "\n",
103 | "ds.upload(src_dir=flightdir,\n",
104 | " target_path='nycflights',\n",
105 | " show_progress=True)\n",
106 | "\n",
107 | "print(\"** Finished! **\")"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "### Using the Datastore on the Dask cluster\n",
115 | "\n",
116 | "Now, lets make use of the data on the Dask cluster you created in [StartDask.ipynb](StartDask.ipynb).\n",
117 | "You might have noticed that we launched the cluster with a --data parameter which instructed AzureML to mount the workspace's default Datastore onto all the workers of the cluster.\n",
118 | "\n",
119 | "```\n",
120 | "est = Estimator('dask', \n",
121 | " compute_target=dask_cluster, \n",
122 | " entry_script='startDask.py', \n",
123 | " conda_dependencies_file_path='environment.yml', \n",
124 | " script_params=\n",
125 | " {'--data': ws.get_default_datastore()},\n",
126 | " node_count=10,\n",
127 | " distributed_training=mpi_configuration)\n",
128 | "```\n",
129 | "\n",
130 | "At this time the local path on the compute is not determined, but it will be once the job starts. We therefore log the path back to the run history from which we can now retrieve it."
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 27,
136 | "metadata": {},
137 | "outputs": [
138 | {
139 | "data": {
140 | "text/plain": [
141 | "'/mnt/batch/tasks/shared/LS_root/jobs/rapids/azureml/dask_1573618861_52e71160/mounts/workspaceblobstore/nycflights'"
142 | ]
143 | },
144 | "execution_count": 27,
145 | "metadata": {},
146 | "output_type": "execute_result"
147 | }
148 | ],
149 | "source": [
150 | "## get the last run on the dask experiment which should be running \n",
151 | "## our dask cluster, and retrieve the data path from it\n",
152 | "ws = Workspace.from_config()\n",
153 | "exp = ws.experiments['dask']\n",
154 | "cluster_run = exp.get_runs().__next__()\n",
155 | "\n",
156 | "if (not cluster_run.status == 'Running'):\n",
157 | " raise Exception('Cluster should be in state \\'Running\\'')\n",
158 | "\n",
159 | "data_path = cluster_run.get_metrics()['data'] + '/nycflights'\n",
160 | "data_path"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 28,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "data": {
170 | "text/html": [
171 | "
\n",
172 | "\n",
173 | "\n",
174 | "Client\n",
175 | "\n",
179 | " | \n",
180 | "\n",
181 | "Cluster\n",
182 | "\n",
183 | " - Workers: 39
\n",
184 | " - Cores: 39
\n",
185 | " - Memory: 284.09 GB
\n",
186 | " \n",
187 | " | \n",
188 | "
\n",
189 | "
"
190 | ],
191 | "text/plain": [
192 | ""
193 | ]
194 | },
195 | "execution_count": 28,
196 | "metadata": {},
197 | "output_type": "execute_result"
198 | }
199 | ],
200 | "source": [
201 | "# Get the dask cluster\n",
202 | "from dask.distributed import Client\n",
203 | "\n",
204 | "c = Client('tcp://localhost:8786')\n",
205 | "c"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 16,
211 | "metadata": {},
212 | "outputs": [],
213 | "source": [
214 | "# create a dask dataframe that loads the data from the path on the cluster\n",
215 | "import dask.dataframe as dd\n",
216 | "from dask import delayed\n",
217 | "\n",
218 | "def load_data(path):\n",
219 | " df = dd.read_csv(path + '/*.csv',\n",
220 | " parse_dates={'Date': [0, 1, 2]},\n",
221 | " dtype={'TailNum': str,\n",
222 | " 'CRSElapsedTime': float,\n",
223 | " 'Cancelled': bool}) \n",
224 | " return df"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 17,
230 | "metadata": {},
231 | "outputs": [],
232 | "source": [
233 | "# we need to delay the excution of the read to make sure the path \n",
234 | "# evaluated on the cluster, not the client\n",
235 | "df = delayed(load_data)(data_path).compute()"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 18,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "name": "stdout",
245 | "output_type": "stream",
246 | "text": [
247 | "2611892\n"
248 | ]
249 | },
250 | {
251 | "data": {
252 | "text/html": [
253 | "\n",
254 | "\n",
267 | "
\n",
268 | " \n",
269 | " \n",
270 | " | \n",
271 | " Date | \n",
272 | " DayOfWeek | \n",
273 | " DepTime | \n",
274 | " CRSDepTime | \n",
275 | " ArrTime | \n",
276 | " CRSArrTime | \n",
277 | " UniqueCarrier | \n",
278 | " FlightNum | \n",
279 | " TailNum | \n",
280 | " ActualElapsedTime | \n",
281 | " ... | \n",
282 | " AirTime | \n",
283 | " ArrDelay | \n",
284 | " DepDelay | \n",
285 | " Origin | \n",
286 | " Dest | \n",
287 | " Distance | \n",
288 | " TaxiIn | \n",
289 | " TaxiOut | \n",
290 | " Cancelled | \n",
291 | " Diverted | \n",
292 | "
\n",
293 | " \n",
294 | " \n",
295 | " \n",
296 | " 0 | \n",
297 | " 1990-01-01 | \n",
298 | " 1 | \n",
299 | " 1621.0 | \n",
300 | " 1540 | \n",
301 | " 1747.0 | \n",
302 | " 1701 | \n",
303 | " US | \n",
304 | " 33 | \n",
305 | " NaN | \n",
306 | " 86.0 | \n",
307 | " ... | \n",
308 | " NaN | \n",
309 | " 46.0 | \n",
310 | " 41.0 | \n",
311 | " EWR | \n",
312 | " PIT | \n",
313 | " 319.0 | \n",
314 | " NaN | \n",
315 | " NaN | \n",
316 | " False | \n",
317 | " 0 | \n",
318 | "
\n",
319 | " \n",
320 | " 1 | \n",
321 | " 1990-01-02 | \n",
322 | " 2 | \n",
323 | " 1547.0 | \n",
324 | " 1540 | \n",
325 | " 1700.0 | \n",
326 | " 1701 | \n",
327 | " US | \n",
328 | " 33 | \n",
329 | " NaN | \n",
330 | " 73.0 | \n",
331 | " ... | \n",
332 | " NaN | \n",
333 | " -1.0 | \n",
334 | " 7.0 | \n",
335 | " EWR | \n",
336 | " PIT | \n",
337 | " 319.0 | \n",
338 | " NaN | \n",
339 | " NaN | \n",
340 | " False | \n",
341 | " 0 | \n",
342 | "
\n",
343 | " \n",
344 | " 2 | \n",
345 | " 1990-01-03 | \n",
346 | " 3 | \n",
347 | " 1546.0 | \n",
348 | " 1540 | \n",
349 | " 1710.0 | \n",
350 | " 1701 | \n",
351 | " US | \n",
352 | " 33 | \n",
353 | " NaN | \n",
354 | " 84.0 | \n",
355 | " ... | \n",
356 | " NaN | \n",
357 | " 9.0 | \n",
358 | " 6.0 | \n",
359 | " EWR | \n",
360 | " PIT | \n",
361 | " 319.0 | \n",
362 | " NaN | \n",
363 | " NaN | \n",
364 | " False | \n",
365 | " 0 | \n",
366 | "
\n",
367 | " \n",
368 | " 3 | \n",
369 | " 1990-01-04 | \n",
370 | " 4 | \n",
371 | " 1542.0 | \n",
372 | " 1540 | \n",
373 | " 1710.0 | \n",
374 | " 1701 | \n",
375 | " US | \n",
376 | " 33 | \n",
377 | " NaN | \n",
378 | " 88.0 | \n",
379 | " ... | \n",
380 | " NaN | \n",
381 | " 9.0 | \n",
382 | " 2.0 | \n",
383 | " EWR | \n",
384 | " PIT | \n",
385 | " 319.0 | \n",
386 | " NaN | \n",
387 | " NaN | \n",
388 | " False | \n",
389 | " 0 | \n",
390 | "
\n",
391 | " \n",
392 | " 4 | \n",
393 | " 1990-01-05 | \n",
394 | " 5 | \n",
395 | " 1549.0 | \n",
396 | " 1540 | \n",
397 | " 1706.0 | \n",
398 | " 1701 | \n",
399 | " US | \n",
400 | " 33 | \n",
401 | " NaN | \n",
402 | " 77.0 | \n",
403 | " ... | \n",
404 | " NaN | \n",
405 | " 5.0 | \n",
406 | " 9.0 | \n",
407 | " EWR | \n",
408 | " PIT | \n",
409 | " 319.0 | \n",
410 | " NaN | \n",
411 | " NaN | \n",
412 | " False | \n",
413 | " 0 | \n",
414 | "
\n",
415 | " \n",
416 | "
\n",
417 | "
5 rows × 21 columns
\n",
418 | "
"
419 | ],
420 | "text/plain": [
421 | " Date DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime \\\n",
422 | "0 1990-01-01 1 1621.0 1540 1747.0 1701 \n",
423 | "1 1990-01-02 2 1547.0 1540 1700.0 1701 \n",
424 | "2 1990-01-03 3 1546.0 1540 1710.0 1701 \n",
425 | "3 1990-01-04 4 1542.0 1540 1710.0 1701 \n",
426 | "4 1990-01-05 5 1549.0 1540 1706.0 1701 \n",
427 | "\n",
428 | " UniqueCarrier FlightNum TailNum ActualElapsedTime ... AirTime ArrDelay \\\n",
429 | "0 US 33 NaN 86.0 ... NaN 46.0 \n",
430 | "1 US 33 NaN 73.0 ... NaN -1.0 \n",
431 | "2 US 33 NaN 84.0 ... NaN 9.0 \n",
432 | "3 US 33 NaN 88.0 ... NaN 9.0 \n",
433 | "4 US 33 NaN 77.0 ... NaN 5.0 \n",
434 | "\n",
435 | " DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Diverted \n",
436 | "0 41.0 EWR PIT 319.0 NaN NaN False 0 \n",
437 | "1 7.0 EWR PIT 319.0 NaN NaN False 0 \n",
438 | "2 6.0 EWR PIT 319.0 NaN NaN False 0 \n",
439 | "3 2.0 EWR PIT 319.0 NaN NaN False 0 \n",
440 | "4 9.0 EWR PIT 319.0 NaN NaN False 0 \n",
441 | "\n",
442 | "[5 rows x 21 columns]"
443 | ]
444 | },
445 | "execution_count": 18,
446 | "metadata": {},
447 | "output_type": "execute_result"
448 | }
449 | ],
450 | "source": [
451 | "# now run some interactive queries\n",
452 | "print(len(df))\n",
453 | "df.head()"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": 19,
459 | "metadata": {},
460 | "outputs": [
461 | {
462 | "data": {
463 | "text/plain": [
464 | "0 EWR\n",
465 | "1 LGA\n",
466 | "2 JFK\n",
467 | "Name: Origin, dtype: object"
468 | ]
469 | },
470 | "execution_count": 19,
471 | "metadata": {},
472 | "output_type": "execute_result"
473 | }
474 | ],
475 | "source": [
476 | "df.Origin.unique().compute()"
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": 20,
482 | "metadata": {},
483 | "outputs": [
484 | {
485 | "data": {
486 | "text/plain": [
487 | "Origin\n",
488 | "EWR 876.278885\n",
489 | "JFK 1484.209596\n",
490 | "LGA 712.546238\n",
491 | "Name: Distance, dtype: float64"
492 | ]
493 | },
494 | "execution_count": 20,
495 | "metadata": {},
496 | "output_type": "execute_result"
497 | }
498 | ],
499 | "source": [
500 | "df.groupby('Origin').Distance.mean().compute()"
501 | ]
502 | },
503 | {
504 | "cell_type": "code",
505 | "execution_count": 21,
506 | "metadata": {},
507 | "outputs": [
508 | {
509 | "data": {
510 | "text/plain": [
511 | "Origin\n",
512 | "EWR 1139451\n",
513 | "JFK 427243\n",
514 | "LGA 974267\n",
515 | "Name: Origin, dtype: int64"
516 | ]
517 | },
518 | "execution_count": 21,
519 | "metadata": {},
520 | "output_type": "execute_result"
521 | }
522 | ],
523 | "source": [
524 | "df[~df.Cancelled].groupby('Origin').Origin.count().compute()"
525 | ]
526 | },
527 | {
528 | "cell_type": "code",
529 | "execution_count": 22,
530 | "metadata": {},
531 | "outputs": [
532 | {
533 | "data": {
534 | "text/plain": [
535 | "Dest\n",
536 | "ORD 219060\n",
537 | "BOS 145105\n",
538 | "ATL 128855\n",
539 | "MIA 111001\n",
540 | "LAX 109848\n",
541 | " ... \n",
542 | "JFK 6\n",
543 | "CRP 2\n",
544 | "TUS 2\n",
545 | "ABQ 1\n",
546 | "STX 1\n",
547 | "Name: FlightNum, Length: 99, dtype: int64"
548 | ]
549 | },
550 | "execution_count": 22,
551 | "metadata": {},
552 | "output_type": "execute_result"
553 | }
554 | ],
555 | "source": [
556 | "dest = df[~df.Cancelled].groupby('Dest').FlightNum.count().compute()\n",
557 | "dest.sort_values(ascending=False)"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": null,
563 | "metadata": {},
564 | "outputs": [],
565 | "source": []
566 | }
567 | ],
568 | "metadata": {
569 | "kernelspec": {
570 | "display_name": "Python (dask)",
571 | "language": "python",
572 | "name": "dask"
573 | },
574 | "language_info": {
575 | "codemirror_mode": {
576 | "name": "ipython",
577 | "version": 3
578 | },
579 | "file_extension": ".py",
580 | "mimetype": "text/x-python",
581 | "name": "python",
582 | "nbconvert_exporter": "python",
583 | "pygments_lexer": "ipython3",
584 | "version": "3.6.9"
585 | }
586 | },
587 | "nbformat": 4,
588 | "nbformat_minor": 2
589 | }
590 |
--------------------------------------------------------------------------------
/interactive/dask/DaskNYCTaxi.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Parallelize Pandas with Dask.dataframe\n"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import dask\n",
17 | "from dask.distributed import Client, progress\n",
18 | "from dask import delayed\n",
19 | "df = None\n",
20 | "c = Client('tcp://localhost:8786')\n",
21 | "c.restart()\n",
22 | "c"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": null,
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "from azureml.core import Workspace, Run\n",
32 | "import os\n",
33 | "run = Run.get_context()\n",
34 | "ws = run.experiment.workspace\n",
35 | "\n",
36 | "## or load directly through blob file system\n",
37 | "# using https://github.com/dask/adlfs -- still pretty beta, \n",
38 | "# throws an error message, but seesm to work\n",
39 | "ds = ws.get_default_datastore()\n",
40 | "ACCOUNT_NAME = ds.account_name\n",
41 | "ACCOUNT_KEY = ds.account_key\n",
42 | "CONTAINER = ds.container_name\n",
43 | "import dask.dataframe as dd\n",
44 | "from fsspec.registry import known_implementations\n",
45 | "known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}\n",
46 | "STORAGE_OPTIONS={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}\n",
47 | "df = dd.read_csv(f'abfs://{CONTAINER}/nyctaxi/2015/*.csv', \n",
48 | " storage_options=STORAGE_OPTIONS,\n",
49 | " parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "# enable this code path instead of the above if you run into\n",
59 | "# any issues with the AzureBlobFileSystem (https://github.com/dask/adlfs)\n",
60 | "# this will load the data from the workspace blob storage mounted via blobFUSE\n",
61 | "if False:\n",
62 | " from azureml.core import Workspace\n",
63 | " ## get the last run on the dask experiment which should be running \n",
64 | " ## our dask cluster, and retrieve the data path from it\n",
65 | " ws = Workspace.from_config()\n",
66 | " exp = ws.experiments['dask']\n",
67 | " run = None\n",
68 | " for run in ws.experiments['dask'].get_runs():\n",
69 | " if run.get_status() == \"Running\":\n",
70 | " cluster_run = run\n",
71 | " break;\n",
72 | "\n",
73 | " if (run == None):\n",
74 | " raise Exception('Cluster should be in state \\'Running\\'')\n",
75 | "\n",
76 | " data_path = cluster_run.get_metrics()['datastore'] + '/nyctaxi'\n",
77 | "\n",
78 | "\n",
79 | " import dask\n",
80 | " import dask.dataframe as dd\n",
81 | " from dask import delayed\n",
82 | "\n",
83 | " def load_data(path):\n",
84 | " return dd.read_csv(path, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\n",
85 | "\n",
86 | " data_2015 = data_path + '/2015'\n",
87 | " data_2015_csv = data_2015 + '/*.csv'\n",
88 | " df = delayed(load_data)(data_2015_csv).compute()"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "# fall back to this path if neither of the above paths have been enabled\n",
98 | "if df is None:\n",
99 | " ## or in this case straight from GOOGLE Storage\n",
100 | " import dask.dataframe as dd\n",
101 | " df = dd.read_csv('gcs://anaconda-public-data/nyc-taxi/csv/2015/yellow_*.csv',\n",
102 | " storage_options={'token': 'anon'}, \n",
103 | " parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])\n"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {
110 | "scrolled": false
111 | },
112 | "outputs": [],
113 | "source": [
114 | "%time len(df)"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "df.partitions"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "%time df.map_partitions(len).compute().sum()"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "\n",
140 | "Dask DataFrames\n",
141 | "---------------\n",
142 | "\n",
143 | "* Coordinate many Pandas DataFrames across a cluster\n",
144 | "* Faithfully implement a subset of the Pandas API\n",
145 | "* Use Pandas under the hood (for speed and maturity)"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "metadata": {},
152 | "outputs": [],
153 | "source": [
154 | "df"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {
161 | "scrolled": true
162 | },
163 | "outputs": [],
164 | "source": [
165 | "df.dtypes"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": null,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "# list of column names that need to be re-mapped\n",
175 | "remap = {}\n",
176 | "remap['tpep_pickup_datetime'] = 'pickup_datetime'\n",
177 | "remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n",
178 | "remap['RatecodeID'] = 'rate_code'\n",
179 | "\n",
180 | "#create a list of columns & dtypes the df must have\n",
181 | "must_haves = {\n",
182 | " 'VendorID': 'object',\n",
183 | " 'pickup_datetime': 'datetime64[ms]',\n",
184 | " 'dropoff_datetime': 'datetime64[ms]',\n",
185 | " 'passenger_count': 'int32',\n",
186 | " 'trip_distance': 'float32',\n",
187 | " 'pickup_longitude': 'float32',\n",
188 | " 'pickup_latitude': 'float32',\n",
189 | " 'rate_code': 'int32',\n",
190 | " 'payment_type': 'int32',\n",
191 | " 'dropoff_longitude': 'float32',\n",
192 | " 'dropoff_latitude': 'float32',\n",
193 | " 'fare_amount': 'float32',\n",
194 | " 'tip_amount': 'float32',\n",
195 | " 'total_amount': 'float32'\n",
196 | "}\n",
197 | "\n",
198 | "query_frags = [\n",
199 | " 'fare_amount > 0 and fare_amount < 500',\n",
200 | " 'passenger_count > 0 and passenger_count < 6',\n",
201 | " 'pickup_longitude > -75 and pickup_longitude < -73',\n",
202 | " 'dropoff_longitude > -75 and dropoff_longitude < -73',\n",
203 | " 'pickup_latitude > 40 and pickup_latitude < 42',\n",
204 | " 'dropoff_latitude > 40 and dropoff_latitude < 42'\n",
205 | "]\n",
206 | "query = ' and '.join(query_frags)"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "metadata": {},
213 | "outputs": [],
214 | "source": [
215 | "# helper function which takes a DataFrame partition\n",
216 | "def clean(df_part, remap, must_haves, query): \n",
217 | " df_part = df_part.query(query)\n",
218 | " \n",
219 | " # some col-names include pre-pended spaces remove & lowercase column names\n",
220 | " # tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n",
221 | "\n",
222 | " # rename using the supplied mapping\n",
223 | " df_part = df_part.rename(columns=remap)\n",
224 | " \n",
225 | " # iterate through columns in this df partition\n",
226 | " for col in df_part.columns:\n",
227 | " # drop anything not in our expected list\n",
228 | " if col not in must_haves:\n",
229 | " df_part = df_part.drop(col, axis=1)\n",
230 | " continue\n",
231 | "\n",
232 | " if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n",
233 | " df_part[col] = df_part[col].astype('datetime64[ms]')\n",
234 | " continue\n",
235 | " \n",
236 | " # if column was read as a string, recast as float\n",
237 | " if df_part[col].dtype == 'object':\n",
238 | " df_part[col] = df_part[col].str.fillna('-1')\n",
239 | " df_part[col] = df_part[col].astype('float32')\n",
240 | " else:\n",
241 | " # save some memory by using 32 bit floats\n",
242 | " if 'int' in str(df_part[col].dtype):\n",
243 | " df_part[col] = df_part[col].astype('int32')\n",
244 | " if 'float' in str(df_part[col].dtype):\n",
245 | " df_part[col] = df_part[col].astype('float32')\n",
246 | " df_part[col] = df_part[col].fillna(-1)\n",
247 | " \n",
248 | " return df_part"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": null,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": [
257 | "taxi_df = clean(df, remap, must_haves, query)"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": [
266 | "import math\n",
267 | "from math import pi\n",
268 | "from dask.array import cos, sin, arcsin, sqrt, floor\n",
269 | "import numpy as np\n",
270 | "\n",
271 | "def haversine_distance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude):\n",
272 | " x_1 = pi / 180 * pickup_latitude\n",
273 | " y_1 = pi / 180 * pickup_longitude\n",
274 | " x_2 = pi / 180 * dropoff_latitude\n",
275 | " y_2 = pi / 180 * dropoff_longitude\n",
276 | "\n",
277 | " dlon = y_2 - y_1\n",
278 | " dlat = x_2 - x_1\n",
279 | " a = sin(dlat / 2)**2 + cos(x_1) * cos(x_2) * sin(dlon / 2)**2\n",
280 | "\n",
281 | " c = 2 * arcsin(sqrt(a)) \n",
282 | " r = 6371 # Radius of earth in kilometers\n",
283 | "\n",
284 | " return c * r\n",
285 | "\n",
286 | "def day_of_the_week(day, month, year):\n",
287 | " if month < 3:\n",
288 | " shift = month\n",
289 | " else:\n",
290 | " shift = 0\n",
291 | " Y = year - (month < 3)\n",
292 | " y = Y - 2000\n",
293 | " c = 20\n",
294 | " d = day\n",
295 | " m = month + shift + 1\n",
296 | " return (d + floor(m * 2.6) + y + (y // 4) + (c // 4) - 2 * c) % 7\n",
297 | " \n",
298 | "def add_features(df):\n",
299 | " df['hour'] = df['pickup_datetime'].dt.hour.astype('int32')\n",
300 | " df['year'] = df['pickup_datetime'].dt.year.astype('int32')\n",
301 | " df['month'] = df['pickup_datetime'].dt.month.astype('int32')\n",
302 | " df['day'] = df['pickup_datetime'].dt.day.astype('int32')\n",
303 | " df['day_of_week'] = df['pickup_datetime'].dt.weekday.astype('int32')\n",
304 | " \n",
305 | " #df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')\n",
306 | " df['diff'] = df['dropoff_datetime'] - df['pickup_datetime']\n",
307 | " \n",
308 | " df['pickup_latitude_r'] = (df['pickup_latitude'] // .01 * .01).astype('float32')\n",
309 | " df['pickup_longitude_r'] = (df['pickup_longitude'] // .01 * .01).astype('float32')\n",
310 | " df['dropoff_latitude_r'] = (df['dropoff_latitude'] // .01 * .01).astype('float32')\n",
311 | " df['dropoff_longitude_r'] = (df['dropoff_longitude'] // .01 * .01).astype('float32')\n",
312 | " \n",
313 | " #df = df.drop('pickup_datetime', axis=1)\n",
314 | " #df = df.drop('dropoff_datetime', axis=1)\n",
315 | "\n",
316 | " #df = df.apply_rows(haversine_distance_kernel,\n",
317 | " # incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n",
318 | " # outcols=dict(h_distance=np.float32),\n",
319 | " # kwargs=dict())\n",
320 | "\n",
321 | " import numpy\n",
322 | "\n",
323 | " df['h_distance'] = haversine_distance(df['pickup_latitude'], \n",
324 | " df['pickup_longitude'], \n",
325 | " df['dropoff_latitude'], \n",
326 | " df['dropoff_longitude']).astype('float32')\n",
327 | "\n",
328 | " #df = df.apply_rows(day_of_the_week_kernel,\n",
329 | " # incols=['day', 'month', 'year'],\n",
330 | " # outcols=dict(day_of_week=np.float32),\n",
331 | " # kwargs=dict())\n",
332 | " #df['day_of_week'] = numpy.empty(len(df), dtype=np.int32)\n",
333 | " #day_of_the_week_kernel(df['day'],\n",
334 | " # df['month'],\n",
335 | " # df['year'],\n",
336 | " # df['day_of_week'])\n",
337 | " \n",
338 | " \n",
339 | " df['is_weekend'] = (df['day_of_week']>5).astype(\"int32\")\n",
340 | " return df"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": [
349 | "taxi_df = add_features(taxi_df)\n",
350 | "taxi_df.dtypes"
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {},
357 | "outputs": [],
358 | "source": [
359 | "taxi_df = taxi_df.persist()\n",
360 | "progress(taxi_df)"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": null,
366 | "metadata": {},
367 | "outputs": [],
368 | "source": [
369 | "%time len(taxi_df)"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": null,
375 | "metadata": {},
376 | "outputs": [],
377 | "source": [
378 | "%time taxi_df.passenger_count.sum().compute()"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": null,
384 | "metadata": {},
385 | "outputs": [],
386 | "source": [
387 | "# Compute average trip distance grouped by passenger count\n",
388 | "taxi_df.groupby('passenger_count').trip_distance.mean().compute()"
389 | ]
390 | },
391 | {
392 | "cell_type": "markdown",
393 | "metadata": {},
394 | "source": [
395 | "### Tip Fraction, grouped by day-of-week and hour-of-day"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": null,
401 | "metadata": {},
402 | "outputs": [],
403 | "source": [
404 | "df2 = taxi_df[(taxi_df.tip_amount > 0) & (taxi_df.fare_amount > 0)]\n",
405 | "df2['tip_fraction'] = df2.tip_amount / df2.fare_amount"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {},
412 | "outputs": [],
413 | "source": [
414 | "# Group df.tpep_pickup_datetime by dayofweek and hour\n",
415 | "dayofweek = df2.groupby(df2.pickup_datetime.dt.dayofweek).tip_fraction.mean() \n",
416 | "hour = df2.groupby(df2.pickup_datetime.dt.hour).tip_fraction.mean()\n",
417 | "\n",
418 | "dayofweek, hour = dask.persist(dayofweek, hour)\n",
419 | "progress(dayofweek, hour)"
420 | ]
421 | },
422 | {
423 | "cell_type": "markdown",
424 | "metadata": {},
425 | "source": [
426 | "### Plot results\n",
427 | "\n",
428 | "This requires matplotlib to be installed"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": null,
434 | "metadata": {},
435 | "outputs": [],
436 | "source": [
437 | "%matplotlib inline"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": null,
443 | "metadata": {},
444 | "outputs": [],
445 | "source": [
446 | "hour.compute().plot(figsize=(10, 6), title='Tip Fraction by Hour')"
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": null,
452 | "metadata": {},
453 | "outputs": [],
454 | "source": [
455 | "dayofweek.compute().plot(figsize=(10, 6), title='Tip Fraction by Day of Week')"
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": null,
461 | "metadata": {},
462 | "outputs": [],
463 | "source": [
464 | "import pandas as pd\n",
465 | "%matplotlib inline\n",
466 | "taxi_df.groupby('passenger_count').fare_amount.mean().compute().sort_index().plot(legend=True)"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": null,
472 | "metadata": {},
473 | "outputs": [],
474 | "source": [
475 | "taxi_df.groupby(taxi_df.passenger_count).trip_distance.mean().compute().plot(legend=True)"
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": null,
481 | "metadata": {},
482 | "outputs": [],
483 | "source": []
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": null,
488 | "metadata": {},
489 | "outputs": [],
490 | "source": []
491 | },
492 | {
493 | "cell_type": "code",
494 | "execution_count": null,
495 | "metadata": {},
496 | "outputs": [],
497 | "source": [
498 | "by_payment = taxi_df.groupby(taxi_df.payment_type).fare_amount.count().compute()\n",
499 | "by_payment.index = by_payment.index.map({1: 'Credit card',\n",
500 | " 2: 'Cash',\n",
501 | " 3: 'No charge',\n",
502 | " 4: 'Dispute',\n",
503 | " 5: 'Unknown',\n",
504 | " 6: 'Voided trip'})"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": null,
510 | "metadata": {},
511 | "outputs": [],
512 | "source": [
513 | "by_payment.plot(legend=True, kind='bar')\n"
514 | ]
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {},
519 | "source": [
520 | "### Let's save the transformed dataset back to blob"
521 | ]
522 | },
523 | {
524 | "cell_type": "code",
525 | "execution_count": null,
526 | "metadata": {},
527 | "outputs": [],
528 | "source": [
529 | "import uuid\n",
530 | "output_uuid = uuid.uuid1().hex\n",
531 | "run.log('output_uuid', output_uuid)\n",
532 | "\n",
533 | "output_path = run.get_metrics()['datastore'] + '/output/' + output_uuid + '.parquet'\n",
534 | "\n",
535 | "print('save parquet to ', output_path)\n",
536 | "\n",
537 | "taxi_df.to_parquet(output_path)\n",
538 | "\n",
539 | "print('done')"
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "execution_count": null,
545 | "metadata": {
546 | "scrolled": false
547 | },
548 | "outputs": [],
549 | "source": [
550 | "import dask\n",
551 | "import dask.dataframe as dd\n",
552 | "\n",
553 | "df = dd.read_parquet(output_path)\n"
554 | ]
555 | },
556 | {
557 | "cell_type": "code",
558 | "execution_count": null,
559 | "metadata": {},
560 | "outputs": [],
561 | "source": [
562 | "df.head()"
563 | ]
564 | },
565 | {
566 | "cell_type": "code",
567 | "execution_count": null,
568 | "metadata": {},
569 | "outputs": [],
570 | "source": []
571 | }
572 | ],
573 | "metadata": {
574 | "kernelspec": {
575 | "display_name": "Python (dask)",
576 | "language": "python",
577 | "name": "dask"
578 | },
579 | "language_info": {
580 | "codemirror_mode": {
581 | "name": "ipython",
582 | "version": 3
583 | },
584 | "file_extension": ".py",
585 | "mimetype": "text/x-python",
586 | "name": "python",
587 | "nbconvert_exporter": "python",
588 | "pygments_lexer": "ipython3",
589 | "version": "3.6.9"
590 | }
591 | },
592 | "nbformat": 4,
593 | "nbformat_minor": 2
594 | }
595 |
--------------------------------------------------------------------------------
/interactive/dask/environment.yml:
--------------------------------------------------------------------------------
1 | name: dask
2 | channels:
3 | - defaults
4 | - conda-forge
5 | dependencies:
6 | - gcsfs
7 | - fs-gcsfs
8 | - jupyterlab
9 | - jupyter-server-proxy
10 | - python=3.6
11 | - numpy
12 | - h5py
13 | - scipy
14 | - toolz
15 | - bokeh
16 | - dask
17 | - distributed
18 | - notebook
19 | - matplotlib
20 | - Pillow
21 | - pandas
22 | - pandas-datareader
23 | - pytables
24 | - scikit-learn
25 | - scikit-image
26 | - snakeviz
27 | - ujson
28 | - graphviz
29 | - pip
30 | - s3fs
31 | - fastparquet
32 | - dask-ml
33 | - pip:
34 | - graphviz
35 | - cachey
36 | - azureml-sdk[notebooks]
37 | - mpi4py
38 | - gym
39 | - adlfs
--------------------------------------------------------------------------------
/interactive/dask/startDask.py:
--------------------------------------------------------------------------------
1 | # +
2 | from mpi4py import MPI
3 | import os
4 | import argparse
5 | import time
6 | from dask.distributed import Client
7 | from azureml.core import Run
8 | import sys, uuid
9 | import threading
10 | import subprocess
11 | import socket
12 |
13 | from notebook.notebookapp import list_running_servers
14 |
15 |
16 | # -
17 |
18 | def flush(proc, proc_log):
19 | while True:
20 | proc_out = proc.stdout.readline()
21 | if proc_out == '' and proc.poll() is not None:
22 | proc_log.close()
23 | break
24 | elif proc_out:
25 | sys.stdout.write(proc_out)
26 | proc_log.write(proc_out)
27 | proc_log.flush()
28 |
29 |
30 | if __name__ == '__main__':
31 | comm = MPI.COMM_WORLD
32 | rank = comm.Get_rank()
33 |
34 | parser = argparse.ArgumentParser()
35 | parser.add_argument("--datastore")
36 | parser.add_argument("--jupyter_token", default=uuid.uuid1().hex)
37 | parser.add_argument("--script")
38 |
39 | args, unparsed = parser.parse_known_args()
40 |
41 | ip = socket.gethostbyname(socket.gethostname())
42 |
43 | print("- my rank is ", rank)
44 | print("- my ip is ", ip)
45 |
46 | if rank == 0:
47 | data = {
48 | "scheduler" : ip + ":8786",
49 | "dashboard" : ip + ":8787"
50 | }
51 | else:
52 | data = None
53 |
54 | data = comm.bcast(data, root=0)
55 | scheduler = data["scheduler"]
56 | dashboard = data["dashboard"]
57 | print("- scheduler is ", scheduler)
58 | print("- dashboard is ", dashboard)
59 |
60 |
61 | print("args: ", args)
62 | print("unparsed: ", unparsed)
63 | print("- my rank is ", rank)
64 | print("- my ip is ", ip)
65 |
66 | if rank == 0:
67 | Run.get_context().log("headnode", ip)
68 | Run.get_context().log("cluster",
69 | "scheduler: {scheduler}, dashboard: {dashboard}".format(scheduler=scheduler,
70 | dashboard=dashboard))
71 | Run.get_context().log("datastore", args.datastore)
72 |
73 | cmd = ("jupyter lab --ip 0.0.0.0 --port 8888" + \
74 | " --NotebookApp.token={token}" + \
75 | " --allow-root --no-browser").format(token=args.jupyter_token)
76 | jupyter_log = open("jupyter_log.txt", "a")
77 | jupyter_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
78 |
79 | jupyter_flush = threading.Thread(target=flush, args=(jupyter_proc, jupyter_log))
80 | jupyter_flush.start()
81 |
82 | while not list(list_running_servers()):
83 | time.sleep(5)
84 |
85 | jupyter_servers = list(list_running_servers())
86 | assert (len(jupyter_servers) == 1), "more than one jupyter server is running"
87 |
88 | Run.get_context().log("jupyter",
89 | "ip: {ip_addr}, port: {port}".format(ip_addr=ip, port=jupyter_servers[0]["port"]))
90 | Run.get_context().log("jupyter-token", jupyter_servers[0]["token"])
91 |
92 | cmd = "dask-scheduler " + "--port " + scheduler.split(":")[1] + " --dashboard-address " + dashboard
93 | scheduler_log = open("scheduler_log.txt", "w")
94 | scheduler_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
95 |
96 | cmd = "dask-worker " + scheduler
97 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w")
98 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
99 |
100 | worker_flush = threading.Thread(target=flush, args=(worker_proc, worker_log))
101 | worker_flush.start()
102 |
103 | if(args.script):
104 | command_line = ' '.join(['python', args.script]+unparsed)
105 | print('Launching:', command_line)
106 | exit_code = os.system(command_line)
107 | print('process ended with code', exit_code)
108 | print('killing scheduler, worker and jupyter')
109 | jupyter_proc.kill()
110 | scheduler_proc.kill()
111 | worker_proc.kill()
112 | exit(exit_code)
113 | else:
114 | flush(scheduler_proc, scheduler_log)
115 | else:
116 | cmd = "dask-worker " + scheduler
117 | worker_log = open("worker_{rank}_log.txt".format(rank=rank), "w")
118 | worker_proc = subprocess.Popen(cmd.split(), universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
119 |
120 | flush(worker_proc, worker_log)
121 |
--------------------------------------------------------------------------------
/interactive/mydask.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsc/azureml-and-dask/1d5183d79d09e4a95a9045e660ab87f01e77e366/interactive/mydask.png
--------------------------------------------------------------------------------
/rapids_interactive/dask/azure_taxi_on_cluster.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## NYC Taxi dataset analysis\n",
8 | "\n",
9 | "This notenook should be run from the Jupypter Server deployed on the AzureML Cluster\n",
10 | "\n",
11 | "First get the run object for the cluster we are running on (this will fail if not run on the cluster):"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 9,
17 | "metadata": {},
18 | "outputs": [
19 | {
20 | "data": {
21 | "text/plain": [
22 | "{'headnode': '172.17.0.7',\n",
23 | " 'scheduler': '172.17.0.7:8786',\n",
24 | " 'dashboard': '172.17.0.7:8787',\n",
25 | " 'data': '/mnt/batch/tasks/shared/LS_root/jobs/vnettest/azureml/init-dask-jupyter_1569837452_781f9040/mounts/workspaceblobstore',\n",
26 | " 'jupyter-server': ['http://172.17.0.7:8888/?token=328966d31212f8eebaea6b4df97c2bfbbc9819d2dc7049c2',\n",
27 | " 'http://172.17.0.7:8889/?token=a8c3ecc047365ec1b65bf7d6dce1ef44c1161f7b2f3a3c1c',\n",
28 | " 'http://172.17.0.7:8890/?token=2c572c6a478a93402e22baae68e31618d7fa839097740e79']}"
29 | ]
30 | },
31 | "execution_count": 9,
32 | "metadata": {},
33 | "output_type": "execute_result"
34 | }
35 | ],
36 | "source": [
37 | "from azureml.core import Run\n",
38 | "run = Run.get_context()\n",
39 | "run.get_metrics()"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "Fetch the list of data files from the mounted share:"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 10,
52 | "metadata": {},
53 | "outputs": [
54 | {
55 | "data": {
56 | "text/plain": [
57 | "['yellow_tripdata_2015-10.csv',\n",
58 | " 'yellow_tripdata_2015-04.csv',\n",
59 | " 'yellow_tripdata_2015-03.csv',\n",
60 | " 'yellow_tripdata_2015-08.csv',\n",
61 | " 'yellow_tripdata_2015-07.csv',\n",
62 | " 'yellow_tripdata_2015-09.csv',\n",
63 | " 'yellow_tripdata_2015-01.csv',\n",
64 | " 'yellow_tripdata_2015-02.csv',\n",
65 | " 'yellow_tripdata_2015-05.csv',\n",
66 | " 'yellow_tripdata_2015-06.csv',\n",
67 | " 'yellow_tripdata_2015-11.csv',\n",
68 | " 'yellow_tripdata_2015-12.csv']"
69 | ]
70 | },
71 | "execution_count": 10,
72 | "metadata": {},
73 | "output_type": "execute_result"
74 | }
75 | ],
76 | "source": [
77 | "import os\n",
78 | "data_path = run.get_metrics()['data'] \n",
79 | "filenames = os.listdir(data_path + '/nyctaxi')\n",
80 | "total_size = 0\n",
81 | "for file in filenames:\n",
82 | " size = os.path.getsize(data_path + '/nyctaxi/' + file)/(1e9)\n",
83 | " print(f\"file: {file} size: {round(size,1)} GB\")\n",
84 | " total_size += size\n",
85 | "\n",
86 | "print(\"Total size:\", round(total_size,1), \"GB\")"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "### Get the cluster client\n",
94 | "Since this jupyter server is running on the scheduler node of the cluster, we just need to connect to localhost."
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 11,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "data": {
104 | "text/html": [
105 | "\n",
106 | "\n",
107 | "\n",
108 | "Client\n",
109 | "\n",
113 | " | \n",
114 | "\n",
115 | "Cluster\n",
116 | "\n",
117 | " - Workers: 6
\n",
118 | " - Cores: 6
\n",
119 | " - Memory: 0 B
\n",
120 | " \n",
121 | " | \n",
122 | "
\n",
123 | "
"
124 | ],
125 | "text/plain": [
126 | ""
127 | ]
128 | },
129 | "execution_count": 11,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "import distributed\n",
136 | "client = distributed.Client('tcp://localhost:8786')\n",
137 | "client"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 12,
143 | "metadata": {},
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "- setting dask settings\n",
150 | "-- Changes to dask settings\n",
151 | "--- Setting work-stealing to False\n",
152 | "--- Setting scheduler bandwidth to 1\n",
153 | "-- Settings updates complete\n"
154 | ]
155 | }
156 | ],
157 | "source": [
158 | "import dask\n",
159 | "\n",
160 | "print(\"- setting dask settings\")\n",
161 | "dask.config.set({'distributed.scheduler.work-stealing': False})\n",
162 | "dask.config.set({'distributed.scheduler.bandwidth': 1})\n",
163 | "\n",
164 | "print(\"-- Changes to dask settings\")\n",
165 | "print(\"--- Setting work-stealing to \", dask.config.get('distributed.scheduler.work-stealing'))\n",
166 | "print(\"--- Setting scheduler bandwidth to \", dask.config.get('distributed.scheduler.bandwidth'))\n",
167 | "print(\"-- Settings updates complete\")"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 42,
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "# helper function which takes a DataFrame partition\n",
177 | "def clean(df_part, remap, must_haves): \n",
178 | " # some col-names include pre-pended spaces remove & lowercase column names\n",
179 | " tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n",
180 | " df_part = df_part.rename(tmp)\n",
181 | " \n",
182 | " # rename using the supplied mapping\n",
183 | " df_part = df_part.rename(remap)\n",
184 | " \n",
185 | " # iterate through columns in this df partition\n",
186 | " for col in df_part.columns:\n",
187 | " # drop anything not in our expected list\n",
188 | " if col not in must_haves:\n",
189 | " df_part = df_part.drop(col)\n",
190 | " continue\n",
191 | "\n",
192 | " if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n",
193 | " df_part[col] = df_part[col].astype('datetime64[ms]')\n",
194 | " continue\n",
195 | " \n",
196 | " # if column was read as a string, recast as float\n",
197 | " if df_part[col].dtype == 'object':\n",
198 | " df_part[col] = df_part[col].str.fillna('-1')\n",
199 | " df_part[col] = df_part[col].astype('float32')\n",
200 | " else:\n",
201 | " # downcast from 64bit to 32bit types\n",
202 | " # Tesla T4 are faster on 32bit ops\n",
203 | " if 'int' in str(df_part[col].dtype):\n",
204 | " df_part[col] = df_part[col].astype('int32')\n",
205 | " if 'float' in str(df_part[col].dtype):\n",
206 | " df_part[col] = df_part[col].astype('float32')\n",
207 | " df_part[col] = df_part[col].fillna(-1)\n",
208 | "\n",
209 | " return df_part"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 43,
215 | "metadata": {},
216 | "outputs": [],
217 | "source": [
218 | "import os\n",
219 | "import cudf\n",
220 | "\n",
221 | "def read_csv(path):\n",
222 | " import cudf\n",
223 | " # list of column names that need to be re-mapped\n",
224 | " remap = {}\n",
225 | " remap['tpep_pickup_datetime'] = 'pickup_datetime'\n",
226 | " remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n",
227 | " remap['ratecodeid'] = 'rate_code'\n",
228 | "\n",
229 | " #create a list of columns & dtypes the df must have\n",
230 | " must_haves = {\n",
231 | " 'pickup_datetime': 'datetime64[ms]',\n",
232 | " 'dropoff_datetime': 'datetime64[ms]',\n",
233 | " 'passenger_count': 'int32',\n",
234 | " 'trip_distance': 'float32',\n",
235 | " 'pickup_longitude': 'float32',\n",
236 | " 'pickup_latitude': 'float32',\n",
237 | " 'rate_code': 'int32',\n",
238 | " 'dropoff_longitude': 'float32',\n",
239 | " 'dropoff_latitude': 'float32',\n",
240 | " 'fare_amount': 'float32'\n",
241 | " }\n",
242 | " \n",
243 | " df = cudf.read_csv(path)\n",
244 | " return clean(df, remap, must_haves)\n",
245 | "\n",
246 | "paths = [os.path.join(run.get_metrics()[\"data\"], \"nyctaxi/\") + filename for filename in filenames]\n",
247 | "data_paths = client.scatter(paths)\n",
248 | "dfs = [client.submit(read_csv, data_path) for data_path in data_paths]"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 44,
254 | "metadata": {
255 | "scrolled": false
256 | },
257 | "outputs": [],
258 | "source": [
259 | "import dask_cudf\n",
260 | "\n",
261 | "taxi_df = dask_cudf.from_delayed(dfs)"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 45,
267 | "metadata": {},
268 | "outputs": [
269 | {
270 | "data": {
271 | "text/html": [
272 | "\n",
273 | "\n",
286 | "
\n",
287 | " \n",
288 | " \n",
289 | " | \n",
290 | " pickup_datetime | \n",
291 | " dropoff_datetime | \n",
292 | " passenger_count | \n",
293 | " trip_distance | \n",
294 | " pickup_longitude | \n",
295 | " pickup_latitude | \n",
296 | " rate_code | \n",
297 | " dropoff_longitude | \n",
298 | " dropoff_latitude | \n",
299 | " fare_amount | \n",
300 | "
\n",
301 | " \n",
302 | " \n",
303 | " \n",
304 | " 0 | \n",
305 | " 2015-10-01 00:00:00 | \n",
306 | " 2015-10-01 00:05:48 | \n",
307 | " 1 | \n",
308 | " 1.10 | \n",
309 | " -73.935516 | \n",
310 | " 40.761238 | \n",
311 | " 1 | \n",
312 | " -73.944351 | \n",
313 | " 40.754578 | \n",
314 | " 6.00 | \n",
315 | "
\n",
316 | " \n",
317 | " 1 | \n",
318 | " 2015-10-01 00:00:00 | \n",
319 | " 2015-10-01 00:00:00 | \n",
320 | " 1 | \n",
321 | " 7.68 | \n",
322 | " -73.989937 | \n",
323 | " 40.743439 | \n",
324 | " 1 | \n",
325 | " -73.986687 | \n",
326 | " 40.689129 | \n",
327 | " 27.50 | \n",
328 | "
\n",
329 | " \n",
330 | " 2 | \n",
331 | " 2015-10-01 00:00:00 | \n",
332 | " 2015-10-01 00:00:00 | \n",
333 | " 2 | \n",
334 | " 2.53 | \n",
335 | " -73.987328 | \n",
336 | " 40.720020 | \n",
337 | " 1 | \n",
338 | " -73.999084 | \n",
339 | " 40.744381 | \n",
340 | " 12.50 | \n",
341 | "
\n",
342 | " \n",
343 | " 3 | \n",
344 | " 2015-10-01 00:00:00 | \n",
345 | " 2015-10-01 00:00:00 | \n",
346 | " 0 | \n",
347 | " 1.20 | \n",
348 | " -73.953758 | \n",
349 | " 40.743385 | \n",
350 | " 5 | \n",
351 | " -73.930008 | \n",
352 | " 40.736622 | \n",
353 | " 25.26 | \n",
354 | "
\n",
355 | " \n",
356 | " 4 | \n",
357 | " 2015-10-01 00:00:01 | \n",
358 | " 2015-10-01 00:16:19 | \n",
359 | " 1 | \n",
360 | " 3.80 | \n",
361 | " -73.984016 | \n",
362 | " 40.755222 | \n",
363 | " 1 | \n",
364 | " -73.959869 | \n",
365 | " 40.801323 | \n",
366 | " 15.50 | \n",
367 | "
\n",
368 | " \n",
369 | "
\n",
370 | "
"
371 | ],
372 | "text/plain": [
373 | " pickup_datetime dropoff_datetime passenger_count trip_distance \\\n",
374 | "0 2015-10-01 00:00:00 2015-10-01 00:05:48 1 1.10 \n",
375 | "1 2015-10-01 00:00:00 2015-10-01 00:00:00 1 7.68 \n",
376 | "2 2015-10-01 00:00:00 2015-10-01 00:00:00 2 2.53 \n",
377 | "3 2015-10-01 00:00:00 2015-10-01 00:00:00 0 1.20 \n",
378 | "4 2015-10-01 00:00:01 2015-10-01 00:16:19 1 3.80 \n",
379 | "\n",
380 | " pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n",
381 | "0 -73.935516 40.761238 1 -73.944351 \n",
382 | "1 -73.989937 40.743439 1 -73.986687 \n",
383 | "2 -73.987328 40.720020 1 -73.999084 \n",
384 | "3 -73.953758 40.743385 5 -73.930008 \n",
385 | "4 -73.984016 40.755222 1 -73.959869 \n",
386 | "\n",
387 | " dropoff_latitude fare_amount \n",
388 | "0 40.754578 6.00 \n",
389 | "1 40.689129 27.50 \n",
390 | "2 40.744381 12.50 \n",
391 | "3 40.736622 25.26 \n",
392 | "4 40.801323 15.50 "
393 | ]
394 | },
395 | "execution_count": 45,
396 | "metadata": {},
397 | "output_type": "execute_result"
398 | }
399 | ],
400 | "source": [
401 | "taxi_df.head()"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": 48,
407 | "metadata": {},
408 | "outputs": [],
409 | "source": [
410 | "import numpy as np\n",
411 | "import numba, xgboost, socket\n",
412 | "import dask, dask_cudf\n",
413 | "from dask.distributed import Client, wait"
414 | ]
415 | },
416 | {
417 | "cell_type": "code",
418 | "execution_count": 49,
419 | "metadata": {},
420 | "outputs": [
421 | {
422 | "data": {
423 | "text/plain": [
424 | "Index(['pickup_datetime', 'dropoff_datetime', 'passenger_count',\n",
425 | " 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code',\n",
426 | " 'dropoff_longitude', 'dropoff_latitude', 'fare_amount'],\n",
427 | " dtype='object')"
428 | ]
429 | },
430 | "execution_count": 49,
431 | "metadata": {},
432 | "output_type": "execute_result"
433 | }
434 | ],
435 | "source": [
436 | "taxi_df.columns"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": 50,
442 | "metadata": {},
443 | "outputs": [
444 | {
445 | "data": {
446 | "text/html": [
447 | "\n",
448 | "\n",
461 | "
\n",
462 | " \n",
463 | " \n",
464 | " | \n",
465 | " pickup_datetime | \n",
466 | " dropoff_datetime | \n",
467 | " passenger_count | \n",
468 | " trip_distance | \n",
469 | " pickup_longitude | \n",
470 | " pickup_latitude | \n",
471 | " rate_code | \n",
472 | " dropoff_longitude | \n",
473 | " dropoff_latitude | \n",
474 | " fare_amount | \n",
475 | "
\n",
476 | " \n",
477 | " \n",
478 | " \n",
479 | " 0 | \n",
480 | " 2015-10-01 00:00:00 | \n",
481 | " 2015-10-01 00:05:48 | \n",
482 | " 1 | \n",
483 | " 1.10 | \n",
484 | " -73.935516 | \n",
485 | " 40.761238 | \n",
486 | " 1 | \n",
487 | " -73.944351 | \n",
488 | " 40.754578 | \n",
489 | " 6.0 | \n",
490 | "
\n",
491 | " \n",
492 | " 1 | \n",
493 | " 2015-10-01 00:00:00 | \n",
494 | " 2015-10-01 00:00:00 | \n",
495 | " 1 | \n",
496 | " 7.68 | \n",
497 | " -73.989937 | \n",
498 | " 40.743439 | \n",
499 | " 1 | \n",
500 | " -73.986687 | \n",
501 | " 40.689129 | \n",
502 | " 27.5 | \n",
503 | "
\n",
504 | " \n",
505 | " 2 | \n",
506 | " 2015-10-01 00:00:00 | \n",
507 | " 2015-10-01 00:00:00 | \n",
508 | " 2 | \n",
509 | " 2.53 | \n",
510 | " -73.987328 | \n",
511 | " 40.720020 | \n",
512 | " 1 | \n",
513 | " -73.999084 | \n",
514 | " 40.744381 | \n",
515 | " 12.5 | \n",
516 | "
\n",
517 | " \n",
518 | " 4 | \n",
519 | " 2015-10-01 00:00:01 | \n",
520 | " 2015-10-01 00:16:19 | \n",
521 | " 1 | \n",
522 | " 3.80 | \n",
523 | " -73.984016 | \n",
524 | " 40.755222 | \n",
525 | " 1 | \n",
526 | " -73.959869 | \n",
527 | " 40.801323 | \n",
528 | " 15.5 | \n",
529 | "
\n",
530 | " \n",
531 | " 5 | \n",
532 | " 2015-10-01 00:00:01 | \n",
533 | " 2015-10-01 00:13:41 | \n",
534 | " 1 | \n",
535 | " 3.10 | \n",
536 | " -73.975296 | \n",
537 | " 40.751396 | \n",
538 | " 1 | \n",
539 | " -73.970924 | \n",
540 | " 40.785984 | \n",
541 | " 12.5 | \n",
542 | "
\n",
543 | " \n",
544 | "
\n",
545 | "
"
546 | ],
547 | "text/plain": [
548 | " pickup_datetime dropoff_datetime passenger_count trip_distance \\\n",
549 | "0 2015-10-01 00:00:00 2015-10-01 00:05:48 1 1.10 \n",
550 | "1 2015-10-01 00:00:00 2015-10-01 00:00:00 1 7.68 \n",
551 | "2 2015-10-01 00:00:00 2015-10-01 00:00:00 2 2.53 \n",
552 | "4 2015-10-01 00:00:01 2015-10-01 00:16:19 1 3.80 \n",
553 | "5 2015-10-01 00:00:01 2015-10-01 00:13:41 1 3.10 \n",
554 | "\n",
555 | " pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n",
556 | "0 -73.935516 40.761238 1 -73.944351 \n",
557 | "1 -73.989937 40.743439 1 -73.986687 \n",
558 | "2 -73.987328 40.720020 1 -73.999084 \n",
559 | "4 -73.984016 40.755222 1 -73.959869 \n",
560 | "5 -73.975296 40.751396 1 -73.970924 \n",
561 | "\n",
562 | " dropoff_latitude fare_amount \n",
563 | "0 40.754578 6.0 \n",
564 | "1 40.689129 27.5 \n",
565 | "2 40.744381 12.5 \n",
566 | "4 40.801323 15.5 \n",
567 | "5 40.785984 12.5 "
568 | ]
569 | },
570 | "execution_count": 50,
571 | "metadata": {},
572 | "output_type": "execute_result"
573 | }
574 | ],
575 | "source": [
576 | "# apply a list of filter conditions to throw out records with missing or outlier values\n",
577 | "query_frags = [\n",
578 | " 'fare_amount > 0 and fare_amount < 500',\n",
579 | " 'passenger_count > 0 and passenger_count < 6',\n",
580 | " 'pickup_longitude > -75 and pickup_longitude < -73',\n",
581 | " 'dropoff_longitude > -75 and dropoff_longitude < -73',\n",
582 | " 'pickup_latitude > 40 and pickup_latitude < 42',\n",
583 | " 'dropoff_latitude > 40 and dropoff_latitude < 42'\n",
584 | "]\n",
585 | "taxi_df = taxi_df.query(' and '.join(query_frags))\n",
586 | "\n",
587 | "# inspect the results of cleaning\n",
588 | "taxi_df.head().to_pandas()"
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": 51,
594 | "metadata": {},
595 | "outputs": [],
596 | "source": [
597 | "import math\n",
598 | "from math import cos, sin, asin, sqrt, pi\n",
599 | "import numpy as np\n",
600 | "\n",
601 | "def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):\n",
602 | " for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):\n",
603 | " x_1 = pi/180 * x_1\n",
604 | " y_1 = pi/180 * y_1\n",
605 | " x_2 = pi/180 * x_2\n",
606 | " y_2 = pi/180 * y_2\n",
607 | " \n",
608 | " dlon = y_2 - y_1\n",
609 | " dlat = x_2 - x_1\n",
610 | " a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
611 | " \n",
612 | " c = 2 * asin(sqrt(a)) \n",
613 | " r = 6371 # Radius of earth in kilometers\n",
614 | " \n",
615 | " h_distance[i] = c * r\n",
616 | "\n",
617 | "def day_of_the_week_kernel(day, month, year, day_of_week):\n",
618 | " for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):\n",
619 | " if month[i] <3:\n",
620 | " shift = month[i]\n",
621 | " else:\n",
622 | " shift = 0\n",
623 | " Y = year[i] - (month[i] < 3)\n",
624 | " y = Y - 2000\n",
625 | " c = 20\n",
626 | " d = day[i]\n",
627 | " m = month[i] + shift + 1\n",
628 | " day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7\n",
629 | " \n",
630 | "def add_features(df):\n",
631 | " df['hour'] = df['pickup_datetime'].dt.hour\n",
632 | " df['year'] = df['pickup_datetime'].dt.year\n",
633 | " df['month'] = df['pickup_datetime'].dt.month\n",
634 | " df['day'] = df['pickup_datetime'].dt.day\n",
635 | " df['diff'] = df['dropoff_datetime'].astype('int32') - df['pickup_datetime'].astype('int32')\n",
636 | " \n",
637 | " df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01\n",
638 | " df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01\n",
639 | " df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01\n",
640 | " df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01\n",
641 | " \n",
642 | " df = df.drop('pickup_datetime')\n",
643 | " df = df.drop('dropoff_datetime')\n",
644 | " \n",
645 | " \n",
646 | " df = df.apply_rows(haversine_distance_kernel,\n",
647 | " incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n",
648 | " outcols=dict(h_distance=np.float32),\n",
649 | " kwargs=dict())\n",
650 | " \n",
651 | " \n",
652 | " df = df.apply_rows(day_of_the_week_kernel,\n",
653 | " incols=['day', 'month', 'year'],\n",
654 | " outcols=dict(day_of_week=np.float32),\n",
655 | " kwargs=dict())\n",
656 | " \n",
657 | " \n",
658 | " df['is_weekend'] = (df['day_of_week']<2)\n",
659 | " return df"
660 | ]
661 | },
662 | {
663 | "cell_type": "code",
664 | "execution_count": 52,
665 | "metadata": {},
666 | "outputs": [
667 | {
668 | "data": {
669 | "text/html": [
670 | "\n",
671 | "\n",
684 | "
\n",
685 | " \n",
686 | " \n",
687 | " | \n",
688 | " passenger_count | \n",
689 | " trip_distance | \n",
690 | " pickup_longitude | \n",
691 | " pickup_latitude | \n",
692 | " rate_code | \n",
693 | " dropoff_longitude | \n",
694 | " dropoff_latitude | \n",
695 | " fare_amount | \n",
696 | " hour | \n",
697 | " year | \n",
698 | " month | \n",
699 | " day | \n",
700 | " diff | \n",
701 | " pickup_latitude_r | \n",
702 | " pickup_longitude_r | \n",
703 | " dropoff_latitude_r | \n",
704 | " dropoff_longitude_r | \n",
705 | " h_distance | \n",
706 | " day_of_week | \n",
707 | " is_weekend | \n",
708 | "
\n",
709 | " \n",
710 | " \n",
711 | " \n",
712 | " 0 | \n",
713 | " 1 | \n",
714 | " 1.10 | \n",
715 | " -73.935516 | \n",
716 | " 40.761238 | \n",
717 | " 1 | \n",
718 | " -73.944351 | \n",
719 | " 40.754578 | \n",
720 | " 6.0 | \n",
721 | " 0 | \n",
722 | " 2015 | \n",
723 | " 10 | \n",
724 | " 1 | \n",
725 | " 348000 | \n",
726 | " 40.759998 | \n",
727 | " -73.939995 | \n",
728 | " 40.750000 | \n",
729 | " -73.949997 | \n",
730 | " 1.049876 | \n",
731 | " 5.0 | \n",
732 | " False | \n",
733 | "
\n",
734 | " \n",
735 | " 1 | \n",
736 | " 1 | \n",
737 | " 7.68 | \n",
738 | " -73.989937 | \n",
739 | " 40.743439 | \n",
740 | " 1 | \n",
741 | " -73.986687 | \n",
742 | " 40.689129 | \n",
743 | " 27.5 | \n",
744 | " 0 | \n",
745 | " 2015 | \n",
746 | " 10 | \n",
747 | " 1 | \n",
748 | " 0 | \n",
749 | " 40.739998 | \n",
750 | " -73.989998 | \n",
751 | " 40.680000 | \n",
752 | " -73.989998 | \n",
753 | " 6.045188 | \n",
754 | " 5.0 | \n",
755 | " False | \n",
756 | "
\n",
757 | " \n",
758 | " 2 | \n",
759 | " 2 | \n",
760 | " 2.53 | \n",
761 | " -73.987328 | \n",
762 | " 40.720020 | \n",
763 | " 1 | \n",
764 | " -73.999084 | \n",
765 | " 40.744381 | \n",
766 | " 12.5 | \n",
767 | " 0 | \n",
768 | " 2015 | \n",
769 | " 10 | \n",
770 | " 1 | \n",
771 | " 0 | \n",
772 | " 40.719997 | \n",
773 | " -73.989998 | \n",
774 | " 40.739998 | \n",
775 | " -74.000000 | \n",
776 | " 2.884243 | \n",
777 | " 5.0 | \n",
778 | " False | \n",
779 | "
\n",
780 | " \n",
781 | " 4 | \n",
782 | " 1 | \n",
783 | " 3.80 | \n",
784 | " -73.984016 | \n",
785 | " 40.755222 | \n",
786 | " 1 | \n",
787 | " -73.959869 | \n",
788 | " 40.801323 | \n",
789 | " 15.5 | \n",
790 | " 0 | \n",
791 | " 2015 | \n",
792 | " 10 | \n",
793 | " 1 | \n",
794 | " 978000 | \n",
795 | " 40.750000 | \n",
796 | " -73.989998 | \n",
797 | " 40.799999 | \n",
798 | " -73.959999 | \n",
799 | " 5.514657 | \n",
800 | " 5.0 | \n",
801 | " False | \n",
802 | "
\n",
803 | " \n",
804 | " 5 | \n",
805 | " 1 | \n",
806 | " 3.10 | \n",
807 | " -73.975296 | \n",
808 | " 40.751396 | \n",
809 | " 1 | \n",
810 | " -73.970924 | \n",
811 | " 40.785984 | \n",
812 | " 12.5 | \n",
813 | " 0 | \n",
814 | " 2015 | \n",
815 | " 10 | \n",
816 | " 1 | \n",
817 | " 820000 | \n",
818 | " 40.750000 | \n",
819 | " -73.979996 | \n",
820 | " 40.779999 | \n",
821 | " -73.979996 | \n",
822 | " 3.863575 | \n",
823 | " 5.0 | \n",
824 | " False | \n",
825 | "
\n",
826 | " \n",
827 | "
\n",
828 | "
"
829 | ],
830 | "text/plain": [
831 | " passenger_count trip_distance pickup_longitude pickup_latitude \\\n",
832 | "0 1 1.10 -73.935516 40.761238 \n",
833 | "1 1 7.68 -73.989937 40.743439 \n",
834 | "2 2 2.53 -73.987328 40.720020 \n",
835 | "4 1 3.80 -73.984016 40.755222 \n",
836 | "5 1 3.10 -73.975296 40.751396 \n",
837 | "\n",
838 | " rate_code dropoff_longitude dropoff_latitude fare_amount hour year \\\n",
839 | "0 1 -73.944351 40.754578 6.0 0 2015 \n",
840 | "1 1 -73.986687 40.689129 27.5 0 2015 \n",
841 | "2 1 -73.999084 40.744381 12.5 0 2015 \n",
842 | "4 1 -73.959869 40.801323 15.5 0 2015 \n",
843 | "5 1 -73.970924 40.785984 12.5 0 2015 \n",
844 | "\n",
845 | " month day diff pickup_latitude_r pickup_longitude_r \\\n",
846 | "0 10 1 348000 40.759998 -73.939995 \n",
847 | "1 10 1 0 40.739998 -73.989998 \n",
848 | "2 10 1 0 40.719997 -73.989998 \n",
849 | "4 10 1 978000 40.750000 -73.989998 \n",
850 | "5 10 1 820000 40.750000 -73.979996 \n",
851 | "\n",
852 | " dropoff_latitude_r dropoff_longitude_r h_distance day_of_week \\\n",
853 | "0 40.750000 -73.949997 1.049876 5.0 \n",
854 | "1 40.680000 -73.989998 6.045188 5.0 \n",
855 | "2 40.739998 -74.000000 2.884243 5.0 \n",
856 | "4 40.799999 -73.959999 5.514657 5.0 \n",
857 | "5 40.779999 -73.979996 3.863575 5.0 \n",
858 | "\n",
859 | " is_weekend \n",
860 | "0 False \n",
861 | "1 False \n",
862 | "2 False \n",
863 | "4 False \n",
864 | "5 False "
865 | ]
866 | },
867 | "execution_count": 52,
868 | "metadata": {},
869 | "output_type": "execute_result"
870 | }
871 | ],
872 | "source": [
873 | "# actually add the features\n",
874 | "taxi_df = taxi_df.map_partitions(add_features).persist()\n",
875 | "# inspect the result\n",
876 | "taxi_df.head().to_pandas()"
877 | ]
878 | },
879 | {
880 | "cell_type": "code",
881 | "execution_count": 53,
882 | "metadata": {},
883 | "outputs": [
884 | {
885 | "data": {
886 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXxU1d3H8c+ZLBOykW3CEgIBEkAQSBRwgaBiVaQqWJeKtZtdfNqi1rVW69LdvXV72dpW6fNoUWpVUHGrUAiLUBACQRICCRDCkpmE7PvMef5IgjEECWFm7r1zf+/XixdwM8z9MUy+3Dn3/M5RWmuEEEJYj8PoAoQQQvSPBLgQQliUBLgQQliUBLgQQliUBLgQQlhUeDBPlpKSojMyMoJ5SiGEsLxNmzZ5tNaunseDGuAZGRls3LgxmKcUQgjLU0rt7e24DKEIIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYALIYRFSYBbSHVjK29u3m90GUIIk5AAt5C/5pVy22v5HKxpMroUIYQJSIBbSF6xG4CK2haDKxFCmIEEuEUcaWhla3kNAO46CXAhhAS4ZazdXUnX7nfueglwIYQEuGXkFbuJdXasPSZX4EIIkAC3BK01ecUepmcmkxgdQUVds9ElCSFMQALcAko9DZRXN5Gb5cIV55QrcCEEIAFuCXnFHgBys1IkwIUQR0mAW0BesZvhSdGMSI7BFeuUm5hCCEAC3PTavD7W7a4kNysF4OgVuO6akiKEsK0TBrhS6kWlVIVSqqDH8ZuVUoVKqe1KqUcDV6K9bd5XTUOr92iAp8ZF0dzmo76l3eDKhBBG68sV+EJgdvcDSqkLgLnAZK31BOBx/5cmoGP4xKHgnNGfX4EDVMg4uBC2d8IA11qvAqp6HP4R8LDWuqXzMRUBqE0Aq4o9ZKcnMHBABPB5gMuNTCFEf8fAxwC5Sqn1SqmVSqmpx3ugUuqHSqmNSqmNbre7n6ezp+rGVrbtryY3y3X0mAS4EKJLfwM8HEgCzgbuAhYrpVRvD9Rav6C1nqK1nuJyuXp7iDiOtbsr8WmOjn8DuGIlwIUQHfob4PuBN3SHDYAPSDnBnxEnKa/YTZwznMnpCUePJURHEBGmZCqhEKLfAf4WcAGAUmoMEAl4/FWU6GifX7XTwzmjk4kI+/yfSSmFK9YpS8oKIfo0jXARsA4Yq5Tar5T6HvAiMKpzauGrwLe1TEz2qz2VjZ3t88d+sHHFSTOPEKJjLPtLaa3nH+dLN/i5FtFN1+YN3W9gdnHFOSmvlgWthLA76cQ0qbxiD+lJAxiRHH3M12Q9FCEESICb0uft8y56m9zjinVS1dCC1yejVkLYmQS4CW0pq6a+pZ3czN4n9rjio/BpqJRxcCFsTQLchPKKPTgUnDv6OAEeK+30QggJcFPKK3YzOT2BgdERvX79aDemXIELYWsS4CZT09hGfln1cYdPAFKlnV4IgQS46awr8XS0z485/rIDsh6KEAIkwE1nVbGHWGc42d3a53uKiggjLipcAlwIm5MAN5GO9nn3Me3zvZG54EIICXAT2VvZyP4jvbfP9+SKlQAXwu4kwE0kb1fX7vMnXnZX1kMRQkiAm0jeTjfDEgeQ0Uv7fE8yhCKEkAA3ifZuu88fZ2+ML0iNi6K+pZ3GVtncWAi7kgA3ifz91dS1tPdp+ARkKqEQQgLcNFbt7GqfT+7T4yXAhRAS4CaRV+xm0rAEEqIj+/R42RtTCCEBbgI1TW3k76/p0/TBLrIeihBCAtwE1u2uxOvTfR7/BkiKiSTMoWRvTCFsTALcBPKK3cREhpEz/Pjt8z2FORTJMZEyhCKEjUmAm8DqXcfuPt8X0swjhL1JgBtsb2UDeysbT2r4pIs08whhbxLgBssr7mqf7/sNzC6yHooQ9iYBbrDVxR7SEgYwMiXmpP9sarwTT30LPtncWAhbkgA3ULvXx5rdnj63z/fkinXS7tMcaWwNQHVCCLOTADdQ/v4a6pr73j7fkysuCpC54ELYlQS4gfKK3SgF0zP71j7fk7TTC2FvEuAGWl3sYVLawD63z/ckAS6EvUmAG6S2uY3NZdX9Hj6Bz3enr5AAF8KWJMAN8nn7/MlPH+wS4wwnOjJMrsCFsCkJcIOs2eUhOjKMnOGJp/Q80swjhH1JgBuk8GAdE4bGExl+av8E0swjhH1JgBukxNPAqJTYU34eWQ9FCPuSADdAbXMbnvoWRrpOvvuyJ1eck4raZj9UJYSwGglwA5S6GwD61T7fU2qck9rmdprbvKf8XEIIa5EAN0CppyPAR/khwLvmgntkGEUI25EAN0CJpwGHguHJ0af8XNLMI4R9SYAboNTTwLDEaJzhYaf8XK7YzvVQJMCFsJ0TBrhS6kWlVIVSqqDbsYeUUuVKqS2dP+YEtszQUuqp98v4N3x+BS7dmELYT1+uwBcCs3s5/getdXbnj2X+LSt0aa0pdTf4LcCTYyNRSq7AhbCjEwa41noVUBWEWmyhoq6FhlYvo/wwhRAgIsxBUnSkzAUXwoZOZQx8gVJqa+cQy3H7wZVSP1RKbVRKbXS73adwutBQ4u6agXLqTTxdpJ1eCHvqb4A/D4wGsoGDwBPHe6DW+gWt9RSt9RSXq/8r74WKrimE/mji6SIBLoQ99SvAtdaHtdZerbUP+Aswzb9lha5STz3OcAdD4qP89pyyHooQ9tSvAFdKDen22yuBguM9VnxRSecNTIfj5PfAPB5XfEeAay2bGwthJ+EneoBSahFwPpCilNoPPAicr5TKBjSwB7gpgDWGlFJPA2MHx/n1OV2xTlq9Pmqb2hkYHeHX5xZCmNcJA1xrPb+Xw38LQC0hr83rY19VI5dOHOzX5z3ajVnfLAEuhI1IJ2YQ7T/SRLtPM9KPM1BAmnmEsCsJ8CAqcdcD/lmFsLtUWQ9FCFuSAA8if65C2J0rTtZDEcKOJMCDqMTTQGJ0BIkxkX593viocCLDHRLgQtiMBHgQ+XMNlO6UUjIXXAgbkgAPolJPg99vYHaRvTGFsB8J8CBpaGnnUG2z3xax6kna6YWwHwnwIDm6BkoAhlCgYyaKTCMUwl4kwIPk6AyUAF6BVzW00ub1BeT5hRDmIwEeJF0BnpEcuAAHqKxvDcjzCyHMRwI8SEo9DaQlDCAq4tT3weyNK1aaeYSwGwnwIClx+28fzN50Xw9FCGEPEuBBoLWmxBOYOeBdUjvXF6+olStwIexCAjwIKhtaqWtuD9gNTICU2I7uThlCEcI+JMCDINBTCAGc4WEMHBAhzTxC2IgEeBB0rULoz42MeyPNPELYiwR4EJR4GogIU6QlDgjoeWQ9FCHsRQI8CErdDYxIjiHMj/tg9iY1XroxhbATCfAgKPU0+H0N8N50XYHL5sZC2IMEeIB5fZq9lY2MDOAMlC6uOCdNbV4aWr0BP5cQwngS4AFWfqSJVq8vOFfgsrWaELYiAR5gJZ6ufTADOwMFJMCFsBsJ8AAL9CqE3X2+O7200wthBxLgAVbqaSAuKpxkP++D2ZtU2dxYCFuRAA+wrhkoSgV2CiFAwoAIwh1KAlwIm5AAD7CSAG1k3BuHQ5EizTxC2IYEeAA1t3kpr24Kyg3MLrK5sRD2IQEeQHsqg3cDs4srzilLygphExLgAVTqDvwqhD2lyhW4ELYhAR5AJUFYRrYnV5yTyvoWvD5ppxci1EmAB1CJu4FB8U5inOFBO6crzolPQ1WDbG4sRKiTAA+gUk9g98HsjWxuLIR9SIAHUKmngVGu4M1AAenGFMJOJMAD5EhDK0ca24KyiFV30o0phH1IgAeIETcwAVLiOjc3lpkoQoQ8CfAACcZGxr2Jjgwn1hkuV+BC2IAEeICUeuoJdyjSk6KDfm7Z3FgIezhhgCulXlRKVSilCnr52h1KKa2USglMedZV6mlgeFI0EWHB/z/SFSt7YwphB31Jl4XA7J4HlVLpwMXAPj/XFBKCuYhVT654Jx4JcCFC3gkDXGu9Cqjq5Ut/AO4GpOWvB59Ps6fSwACXFQmFsIV+fb5XSs0FyrXW+X147A+VUhuVUhvdbnd/Tmc5B2ubaW7zBWUj49644pzUtbTTJJsbCxHSTjrAlVLRwL3AA315vNb6Ba31FK31FJfLdbKns6SuRaxGBXEZ2e66mnk8MpVQiJDWnyvw0cBIIF8ptQcYBnyqlBrsz8KsrLRzI+NgLiPb3efdmBLgQoSyk15lSWu9DUjt+n1niE/RWnv8WJel7XY3EB0ZRmpnkAZb6tHd6aWdXohQ1pdphIuAdcBYpdR+pdT3Al+WtZV6Om5gBmMfzN644mRBKyHs4IRX4Frr+Sf4eobfqgkRpZ4GJg0baNj5k2OcOJQEuBChTjox/ayl3cv+I41BX4WwuzCHIilGduYRItRJgPtZWVUjPk3QVyHsSdrphQh9EuB+ttuAfTB7kxon7fRChDoJcD/rWoUwQ67ARRBoLY3Qdha8zRptotTdQEpsJAMHRBhahyvOiae+BZ9P43AYMxtGBE5FXTMvrCzhlfX7GJIQxcwsF+eNcXHWqCSiI+Xb2i7kX9rPSj0NhnVgdueKddLm1dQ0tZEYE2l0OcJPKuqa+fPKEl7+ZC9tXh9zJg6htrmdRRv2sXDtHiLDHEzJSGTmGBczs1ycNiTOsOmsIvAkwP2sxNPAheNST/zAADs6F7y+RQI8BHQP7nafZl52GgtmZR6919Lc5uW/e6pYtdPNqp0eHn6vkIffK8QV5yQ3K4WZWS5mZKWQEmtMc5kIDAlwP6ptbsNT32LYIlbdHW2nr21hzKA4g6sR/dVbcN88K/OYeyxREWHkZrnIzXJx31fhcG1zR5gXe1hRWMEbn5YDcHpaPDOzXHzj7BGkJQww4q8k/EgC3I9KTTIDBbq109dLO70VVdQ286eVJbyyviO4r8xJY8EFxwb38QyKj+KaKelcMyUdr0+z/UDN0avzF1aVsLywgndvySVM7o9YmgS4H3XNQDF6DjhIO71VnWpw9ybMoZg0LIFJwxJYMCuLd7YeYME/NvOvT/dz7ZR0P1Yvgk0C3I9KPA04FAxPDv4+mD3FOsOJinBIgFtEU6uXxz4oOhrcX8vpGOMekez/i4GvThzCX9JLefLDnVw+aSgDIsP8fg4RHDIP3I9K3PUMS4zGGW78N4RSSuaCW8gTHxbx0tpSrpg8lOV3nMdj10wOSHhDx3vjvjmncai2mRfXlAbkHCI4JMD9qGsVQrOQzY2tYVdFHQvX7uG6qekBDe7upo1M4qLxg3j+P7uplDVzLEsC3E+01qYL8NS4KLkCNzmtNQ8t/YwYZzh3XTIuqOf+2exxNLV5efrj4qCeV/iPBLifVNS10NjqZbQJphB2ccXJioRm937BIVbv8nDHxWNICvJ8/czUWK6bms4r6/cdvQEvrEUC3E9Kjk4hNL4Ls4srzkl1Yxst7bK5sRk1tXr5zbs7GDc4juunDTekhlu/kkVkuINH3y805Pzi1EiA+0lJ5z6YZmji6dI1lbCyvtXgSkRvnv/PLsqrm/jV3NMJDzPmWzE1LoqbZo7mvYJDbNpbZUgNov8kwP2k1N2AM9zBkPgoo0s5yhUrmxub1b7KRv60qoS52UOZNjLJ0Fq+nzsSV5yT3y0rlNUNLUYC3E+6bmCaaeW/1Hhp5jGrX73zGREOxb1zTjO6FGKc4dx+0Rg27T3CB9sPGV2OOAkS4H5S6mlglImGT0C6Mc1qRVEF/95xmJsvzGKQST6xXXPmMLJSY3nk/SLavD6jyxF9JAHuB21eH/uqGk01hRA6NjcGCXAzaWn38qu3P2NUSgw3Th9pdDlHhYc5uOfScZR6Gli0YZ/R5Yg+kgD3g7KqRtp92lQzUAAiwx0kRkfIglYm8uLqPZR6GnjwiglEhpvr22/WuFTOHpXEU/8upq65zehyQkZ1Yys3Lvwvuyrq/f7c5noHWVTXHFqzXYFDxzBKRa1cgZvBoZpmnllezMXjB3HeGJfR5RxDqY4x+cqGVv68ssTockKC1pr73ipg1U43zW3+n84rAe4HZlqFsKfUuChp5jGJ3y3bgdenuf+y8UaXclyThiVwxeSh/HV1CYdq5JPbqVqy5QDvbj3IbReN4fS0gX5/fglwPyjxNJAYHWHKnW9kQStz+KSkkqX5B7jpvNGkJxm/WuWXueuSsfh88ORHRUaXYmnl1U3cv6SAM0ckctPMUQE5h2UCvN3Ed8ZL3eZaA6W7rgCX+b3Gaff6eGjpdtISBvCj80YbXc4JpSdF861zRvDPTfspPFRrdDmW5PNp7lycj8+n+cO12QFr1LJEgD+3YhfX/2W9KUO8trmNgvIaMlPNdQOziyvWSUu7j7qWdqNLsa1X1u+j8FAd9192mmXW3l4wK5M4ZzgPvyct9v3x4ppS1pVU8sDl4wO6P4AlAnxY4gA27Kni6eW7jC7lGC+sLKGupZ1vnZNhdCm96r43pgi+yvoWnviwiNysFC6ZMNjocvosITqSBbMy+U+RmzW7PEaXYylFh+p49P0iLho/KOA7HlkiwOdmp3HVGcN4dnkxG0rNs15DRV0zf1tdyuWThwbkBoU/pEozj6Ee+6CIxlYvD14+AaXM06XbF986J4O0hAH8btkOfD4ZguuLlnYvt766mfgB4fz+axMD/m9uiQAH+OXcCQxPiuanr26mptEcc1SfXb6LNq+POy4aY3Qpx9V1w2xXRZ3BldhPflk1r20s48YZI007xPZloiLCuOuSsWw/UMuS/HKjy7GEJz/aSeGhOh65ahIpnWsRBZJlAjzWGc7T83Nw17dwzxtbDb8pt6+ykX+s38e1U9NPacPZQBuWOIC0hAGslo/BQeXzaR5Yup2UWCc3z8o0upx+u2LyUE5Pi+fxD3YGZB5zKFlfUskLq0qYP204F542KCjntEyAQ8cc1TsvHst7BYd49b9lhtby5EdFhDkUt16YZWgdJ6KUYkZmCmt3V5ryJnCoen3TfvLLqrl3zjjioiKMLqffHA7FvZeeRnl1E39fu8fockyrtrmN2xfnMzwpml98NXgLlFkqwAF+kDuK3KwUfvn2dsOGBT47UMuS/AN8d/pI0yxG9GVmZKVQ19zOtvIao0uxhZqmNh55v5ApIxKZl51mdDmn7NzMFM4f6+JPK3fT2i4XAb355dLPOFjTxJPXZhPjDA/aeS0X4A6H4olrJhMdGc7Ni7YY8rHu8Q+LiHOGW2JOL8D0zBQAVhfLMEowPPzeDqoaW3noCuvduDyeb549giONbaza6Ta6FNN5b9tB/vXpfhZckMmZIxKDem7LBThAanwUj18ziR0Ha3kkyFtBbSitYnlhBf9z/mgGRlvjo3FSTCQThsbLOHgQvLv1IIs2lHHTzNGmnZnUHzPHuEiMjuDNLXIzs7uK2mZ+/uY2JqYN5GYDhlMtGeAAs8YN4jvnZvDSmj2sKKwIyjm11jz6fiGpcU6+e655lgLtixlZKXy67wgN0tATMGVVjdzzxlay0xO442Lzzkzqj4gwB5dPHsq/PzssKxV20lpz1+tbaW7z8oevZxNhwLZ4lg1wgHsuHcdpQ+K585/5VNQGfuGd5YUVbNx7hFsuzLJMR12XGZkptHk1G/aYZx59KGnz+rjl1c2g4Zn5OYZ8MwfavJw0Wtp9vF8gu/YAvPzJXlbudHPvnNMMmyZq6XdZVEQYz8zPpqG1nTv+mR/QZgOfT/PYB0WMSI7m61MD210VCFMzkogMd8g4eID84aOdbN5Xze+vmmj6xar6Kyc9gRHJ0bwlwyjsdtfz22U7mDnGxTfPHmFYHScMcKXUi0qpCqVUQbdjv1ZKbVVKbVFKfaiUGhrYMo8vMzWOBy6bQF6xh7+uDtwaxkvzD1B4qI47Lh5ryaurqIgwpmYkSoAHwOpiD8+v3M11U9O5bJJh3woBp5RibnYaa3dX2nqp2Tavj9te20JURBiPXT3J0BvVfUmihcDsHsce01pP0lpnA+8AD/i7sJMxf1o6sycM5rEPiti23/9T5VrbfTzxURHjh8Rz2cQhfn/+YJmR6aLocB0Vdfb95vM3T30Lty3ewmhXLA9ePsHocgJuXvZQtIalNu7MfGb5Lrbur+F3V040fBrxCQNca70KqOpxrPsakzGAoW2RSikevmoiKbFObnl1s99v1L36332UVTVx9+yxptp1/mTlZnVMJ5TFifzD59PcsTif2qY2nr0+x3L3RfpjlCuWyekJvLn5gNGlGGLzviM8t2IXX8tJY44JLub6PRaglPqtUqoM+AZfcgWulPqhUmqjUmqj2x24OaQJ0ZH88evZ7K1s4MGl2/32vA0t7Tz98S7OGplkym2wTsb4IfEkRkewurjS6FJCwt9Wl7Jyp5tfXDaecYPjjS4naK7MHsqOg7W2Wyu8qdXLHYvzGRTn5KG55vi01e8A11rfp7VOB14BFnzJ417QWk/RWk9xuQIbgGeNSmbBBZm8vmk/S/P9c4Xw0ppSPPUt3D17nOWbMhwOxbmZKaze5TZ8LRmryy+r5pH3C7lkwiBuOGu40eUE1WWThxLmULxls6vwxz4oosTTwKNXTybeJMsj+ONu3CvAVX54Hr+45cIszhyRyH1vbKOsqvGUnutI5+auF40fFPQOq0CZkZnC4dqWgOyQbRd1zW3cvGgzg+KjePSqyZb/j/1kpcQ6mZmVwtIt5bZZZnZ9SSUvrS3lhrOHM6NzKNIM+hXgSqnuLUdzAdNs2xEe5uCPX88GBTcv2szhU5gf/vzK3dS3tnPXJWP9WKGxZnS11cs4eL9orfnFWwWUVzfx1HXZlunG9bd5OWkcqGm2RV9BQ0s7d76eT3piND+/NHgLVfVFX6YRLgLWAWOVUvuVUt8DHlZKFSiltgIXA7cGuM6Tkp4UzSNXTWJbeQ25j6zgZ69vZbf75K44D9Y0sXDtHr6WM4wxg+ICVGnwpSdFk5EcLdMJ++n1TftZsuUAP70wiykZSUaXY5iLxg8iOjKMtzaH/myU37+3g/1Hmnjs6klBXaiqL05YjdZ6fi+H/xaAWvxqzsQhnD50IH/JK2HxxjIWbyrjkvGD+dH5o5mcnnDCP//Uv4tBw0+/Yu7lYvtjemYKb20up83rs+ScdqPsqqjngSXbOXtUEj++wLprfPtDdGQ4sycM5t1tB3noiglERYTmDJzVxR5e/mQf35sxkrNGJRtdzjFC+rt3eHI0v553OmvumcVPzs9k7W4Pc59bw/wXPmHVzuPfyNtVUc/ijWV84+zhIdlVl5uVQkOrly1l1UaXYhnNbV5uXrSZAZFhPHVdDmEWnk7qL/Ny0qhrbg/aWkTBVtfcxs/+tZVRrhjTDqOGdIB3SYl1cuclY1n78wu5b85plHjq+daLG7jsmdW8nX/gmI0OnvyoiKiIMH4SoldZ54xKwaEgT4ZR+uzh9wrZcbCWx6+ZZHjzhlmcOzoZV5yTN0N0GOU37+zgYE0TT1wz2bSfMGwR4F1ineH8YOYoVt19AY9eNYmmzquqWU+s5OVP9tLc5iW/rJpl2w7x/dxRQdnTzggDoyOYOCxBGnr66MPth1i4dg83Th/JrHHB2SrLCsLDHFwxeSj/KXJT3dhqdDl+taKwgtc2lnHTeaPJGW7eGWi2CvAuzvAwrp2azr9vO48/3XAmiTGR/OKtAmY8spzbF28hMTqCH+Raa7nYkzUjM5ktZdXUytKgX+pgTRN3/2srE4bG87NLzfkx2khX5qTR6vWxbFvorFBY09jGPW9sZeygONPfA7NlgHdxOBSzTx/MWz8+l0U/OJsJQwey293ArRdmWXofw76YkenC69OsLwn9aWAnq6nVy/LCw9z/VgHznltDW7uPZ68/A2e4OT9GG2nC0HhGu2JCajbKQ29vp7K+lSeunWz6f3NzzYkxiFKKc0Ync87oZDz1LSTHRBpdUsCdMSKBARFhrC52c9F4GRYoq2pkRVEFywsrWLe7kpZ2HwMiwpiemcL3ZoxkZEqM0SWaklKKK3PSePzDnZRVNVr+pv8H2w/x5uZybr0wyxI7KkmA9xCq4949OcPDmDYyiTybjoO3tvvYuLeKFYUVrChyH+1MHZEczfxpw5k1LpVpI5NMe/PKTOZmdwT40vwDlr7xX9XQyn1vbmPC0HgWzLLG30MC3MZys1L4zbs7OFDdxNCEAUaXE3DuupbOwK4gr9hDfUs7EWGKs0YmM3/acC4Y62KUy5idVawsPSmaqRmJvLm5nB+fP9qySwvc/1YBNU1tvPz9syzTHyEBbmPTu7XVXzvFersM9ZXWmv/7ZC+/fXcHLe0+BsdHcfnkIZw/NpXpmSnEmqy7zorm5aRx35sFbD9Qa4mhh57ezj/Au9sOctclYy21sqS8c21s3OA4UmKdrAnhAPfUt3DXP/NZUeTmvDEu7p49lvFD4i17lWhWX504hIeWbuetzeWWC/CKumbuX1LA5PQEbpo5yuhyToo1PieIgFBKMSMzmTW7PCG5qtyKogpm/3EVa3ZX8tDl41n43alMGDpQwjsAEqIjuWBsKkvyD+C10HtJa829bxTQ1OrliWsmE26RoZMu1qpW+N30zBQ89a0UHqozuhS/aW7z8tDS7Xz3pf+SHOPk7QUz+M70kRLcATYvJw13XQtrd1vnxvgbn5bz7x2HueuSsYbtLH8qJMBtbkaIbbNWeKiWuc+uYeHaPXzn3AyWLJjO2MGhs5qkmc0al0qcM9wyrfUHa5p46O3tTM1I5LvTrdm4JwFuc0MGDmC0K8by0wm11ry0ppQrnl1DZUMrL313akivkmdGURFhzJk4hA8KDtHU6jW6nC/V7vVxx+J82r2ax6+ZbNnFySTABblZLjaUVtLSbu5vuuNx17XwnZf+yy/f/ozczBTe/2kuF4xNNbosW5qXk0ZDq5ePdhw2upQv9diHRazdXcmv553OiGTrNmlJgAumZ6bQ3OZj094jRpdy0pYXHmb2H1fxSUklv547gb9+e4ptmrHM6KyRSQwZGGXq1vpl2w7y55Ul3HD2cK4+c5jR5ZwSCXDB2aOSCHMoS+3S09zm5YElBdy4cCOuOCdv3zyDb56TITcqDeZwKOZmp7Fyp5vK+hajyzlG8eE67vpnPjnDE3jgMnPsLH8qJMAFcVER5K8hPB0AAApNSURBVKRbZ3nZjXuquOLZ1fzvur18b8ZIliyYHlLb3lndvJyheH2ad7YeNLqUL6htbuOm/9vEgMgwnv/GmUSGWz/+rP83EH4xPTOFreU1pl7X+WBNE7cs2szVf1pHXXM7/3vjNO6/bLzpV4yzm3GD4xk3OM5Us1F8Ps2di/PZW9XIc9efweCBobEphwS4ADrWRdEa1u2uNLqUYzS3eXnm42JmPb6S97cf4pZZmXx8x3nMHOMyujRxHFfmpLGlrJpST4PRpQDw/MrdfPjZYe6dc5op97bsLwlwAcDk9ARineGmmk6oteb9gkN85cmVPPHRTs4f6+Lj28/j9ovHEh0pq0CY2RXZQ1EKU9zMXLXTzeMfFnHF5KHcOD3D6HL8Sr4LBAARYQ7OHpVkmnHwnYfr+OXb21mzq5Ixg2L5x/fP4tzOxbeE+Q0ZOIBzRiWzZEs5P/1KlmE3l8uqGrnl1c2MHRTHw1dNDLmb3HIFLo6anpnC3spGyqoaDauhprGNh5Zu59Kn8igor+WXV0xg2S25Et4WdPWZw9hT2cj/fbLXkPM3t3n5n5c34fVp/nTDmSH5qS30/kai33I72+rzij1cf9bwoJ7b69Ms2rCPJz4soqapjevPGs7tF40lyQa7I4WqedlpvJ1/gN+8s4Oc9EQmDgveKoVa66PL2774nSlkhOiOSnIFLo4a7YplcHxU0IdR1pdUctkzq/nFWwVkDYrjnZtz+c28iRLeFudwKJ64Npvk2Eh+/I9N1DQFbwPtl9fv41+f7ufWC7OYNS50twyUK3BxlFKK6ZkpfFx4GK9Pn9T6EPll1fx93R6KDtXh9WnafRpf58/ezh/tPo1Pa9q9Pnwa2n0+vD5Nm1czdGAUz16fw1cnDgm5cUo7S4qJ5Nnrc/j6nz/hZ69v5fkbzgj4v++mvVX86u3tXDDWxa0XmntX+VMlAS6+IDcrhX99up/PDtSe8CNva7uP9woOsnDtHjbvqybWGc7UjEQiwhyEhykcShHuUDgcHT+HORyEOSDc4ej4WpgizKFIjXNy3dThDIiU+dyh6MwRSdw9eyy/W1bIwrV7ArryX0VdMz96+VOGDBzAH7+eg8Oii1T1lQS4+IJzMzvmyObtch83wN11LSzasI+XP9lLRV0LI1NieOjy8Vx15jDioiKCWa6wiB/kjmJDaRW/W7aDnOGJZKcn+P0cbV4fC17ZTG1zG3+/cRoDo0P/vSgBLr4gNS6KcYPjWF3s4cfnf3Fn7q37q1m4Zg/vbD1Iq9fHeWNcPHJ1BudluUL+SkecGqUUj18zma8+vZqfvPIpy27J9XvA/m7ZDjbsqeKp67I5bYh19rU8FRLg4hgzMlP433V7aWr1EuZQvFdwkL+v3cOn+6qJiQxj/rR0vnVuBqNlB3dxEhKiO8bDr/3zOu58PZ8Xvnmm38bDl2wp56U1e/ju9AzmZqf55TmtQAJcHGN6Vgp/XV3KvW9uY80uDxV1LWQkR/Pg5eO5WoZJxCnIGZ7IPZeexq/f+Yy/rS7l+7mntolwu9fHcyt28/TyYqZlJHHvnNP8VKk1SICLY5w1MonIcAdvbi5n5hgXj1yVwXljZJhE+MeN0zNYX1LJw+8VcsaIRM4Yntiv5ymrauS217awce8R5mUP5VfzTifCYpsSnyqldfB2kJ4yZYreuHFj0M4n+i+/rJoYZ7glN3oV5lfT2MZXn8lDa3j3lhkkRJ/cnP8lW8r5xZsFAPx63unMywntYROl1Cat9ZSex+3135Xos8npCRLeImAGRkfw3PVnUFHXzB2L8/H5+nYhWdvcxm2vbeHWV7cwZnAcy27NDfnw/jIS4EIIQ0xOT+C+OafxcWEFf8krOeHjN+6pYs5TeSzNP8BtXxnDaz88m/Sk6CBUal4yBi6EMMy3z81gw54qHv2giDNHJDIlI+mYx7R7fTyzfBfPLC8mLXEAi286hzNH9G/cPNTIFbgQwjBKKR6+ahLDEgdw86LNVDV8cUeofZWNXPvndTz1cTHzctJYdkuuhHc3JwxwpdSLSqkKpVRBt2OPKaUKlVJblVJvKqX831YlhLCF+KiO8fDK+lZuX7wFn0+jteaNT/cz5+k8iivqeXp+Dk9emy1TWHvoyxX4QmB2j2MfAadrrScBO4Gf+7kuIYSNnJ42kPsvH89/itw8+dFObn11C7cvzmf8kHjeuzWXKyYPNbpEUzrhGLjWepVSKqPHsQ+7/fYT4Gr/liWEsJsbzhrO+pJKnl2xizCH4s6Lx/Cj8zNPalVMu/HHTcwbgdeO90Wl1A+BHwIMHx7cTQKEENahlOL3X5tIalwUl08eQk4/G3zspE+NPJ1X4O9orU/vcfw+YArwNd2HJ5JGHiGEOHnHa+Tp9xW4Uuo7wGXAhX0JbyGEEP7VrwBXSs0G7gbO01obtwOuEELYWF+mES4C1gFjlVL7lVLfA54F4oCPlFJblFJ/CnCdQggheujLLJT5vRz+WwBqEUIIcRKkE1MIISxKAlwIISxKAlwIISxKAlwIISwqqDvyKKXcwN5+/vEUwOPHcqxKXofPyWvRQV6HDqH8OozQWrt6HgxqgJ8KpdTG3jqR7EZeh8/Ja9FBXocOdnwdZAhFCCEsSgJcCCEsykoB/oLRBZiEvA6fk9eig7wOHWz3OlhmDFwIIcQXWekKXAghRDcS4EIIYVGWCHCl1GylVJFSapdS6h6j6zGKUmqPUmpb5wqQttkZ4zgbaycppT5SShV3/hzy27cc53V4SClV3vme2KKUmmNkjcGglEpXSq1QSn2mlNqulLq187jt3hOmD3ClVBjwHHApMB6Yr5Qab2xVhrpAa51ts/muCzl2Y+17gI+11lnAx52/D3ULOfZ1APhD53siW2u9LMg1GaEduENrPR44G/hJZybY7j1h+gAHpgG7tNYlWutW4FVgrsE1iSDSWq8Cqnocngv8vfPXfwfmBbUoAxzndbAdrfVBrfWnnb+uA3YAadjwPWGFAE8Dyrr9fn/nMTvSwIdKqU2dm0Xb2SCt9cHOXx8CBhlZjMEWKKW2dg6xhPywQXed+/XmAOux4XvCCgEuPjdDa30GHcNJP1FKzTS6IDPo3JPVrvNhnwdGA9nAQeAJY8sJHqVULPAv4Kda69ruX7PLe8IKAV4OpHf7/bDOY7ajtS7v/LkCeJOO4SW7OqyUGgLQ+XOFwfUYQmt9WGvt1Vr7gL9gk/eEUiqCjvB+RWv9Rudh270nrBDg/wWylFIjlVKRwHXAUoNrCjqlVIxSKq7r18DFQMGX/6mQthT4duevvw0sMbAWw3QFVqcrscF7Qiml6NjWcYfW+sluX7Lde8ISnZidU6P+CIQBL2qtf2twSUGnlBpFx1U3dOxl+g+7vA6dG2ufT8dyoYeBB4G3gMXAcDqWKL5Wax3SN/iO8zqcT8fwiQb2ADd1GwcOSUqpGUAesA3wdR6+l45xcHu9J6wQ4EIIIY5lhSEUIYQQvZAAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi5IAF0IIi/p/twbao1Tr9IcAAAAASUVORK5CYII=\n",
887 | "text/plain": [
888 | ""
889 | ]
890 | },
891 | "metadata": {
892 | "needs_background": "light"
893 | },
894 | "output_type": "display_data"
895 | }
896 | ],
897 | "source": [
898 | "%matplotlib inline\n",
899 | "taxi_df.groupby('hour').fare_amount.mean().compute().to_pandas().sort_index().plot();"
900 | ]
901 | },
902 | {
903 | "cell_type": "code",
904 | "execution_count": 54,
905 | "metadata": {},
906 | "outputs": [
907 | {
908 | "name": "stdout",
909 | "output_type": "stream",
910 | "text": [
911 | "CPU times: user 381 ms, sys: 21.9 ms, total: 403 ms\n",
912 | "Wall time: 5.34 s\n"
913 | ]
914 | }
915 | ],
916 | "source": [
917 | "%%time\n",
918 | "X_train = taxi_df.query('day < 25').persist()\n",
919 | "\n",
920 | "# create a Y_train ddf with just the target variable\n",
921 | "Y_train = X_train[['fare_amount']].persist()\n",
922 | "# drop the target variable from the training ddf\n",
923 | "X_train = X_train[X_train.columns.difference(['fare_amount'])]\n",
924 | "\n",
925 | "# this wont return until all data is in GPU memory\n",
926 | "done = wait([X_train, Y_train])"
927 | ]
928 | },
929 | {
930 | "cell_type": "markdown",
931 | "metadata": {},
932 | "source": [
933 | "## Notes on training with XGBoost with Azure\n",
934 | "\n",
935 | "* Because Dask-XGBoost parses the `client` for the raw IP address, it passes `\"localhost\"` to RABIT if the `client` was configured to use `\"localhost\"` with SSH forwarding. This means Dask-XGBoost, as it exists, does not support Azure with this method.\n",
936 | "* There are several bugs and issues with the Dask submodule of XGBoost:\n",
937 | " 1. Data co-locality is not enforced (labels and data may not be on the same worker)\n",
938 | " 2. Data locality is not enforced (a data partition, x, may not be assigned to the worker, n, upon which it resides originally ... so, data may need to be shuffled\n",
939 | "\n",
940 | "The latter (Dask submodule of XGBoost) is being fixed in this PR: https://github.com/dmlc/xgboost/pull/4819\n",
941 | "\n",
942 | "This means the code below (Dask submodule of XGBoost) will not work, and replacing the call with Dask-XGBoost will not work."
943 | ]
944 | },
945 | {
946 | "cell_type": "code",
947 | "execution_count": null,
948 | "metadata": {},
949 | "outputs": [],
950 | "source": [
951 | "import dask_xgboost\n",
952 | "\n",
953 | "params = {\n",
954 | " 'num_rounds': 100,\n",
955 | " 'max_depth': 8,\n",
956 | " 'max_leaves': 2**8,\n",
957 | " 'tree_method': 'gpu_hist',\n",
958 | " 'objective': 'reg:squarederror',\n",
959 | " 'grow_policy': 'lossguide'\n",
960 | "}\n",
961 | "\n",
962 | "bst = dask_xgboost.train(client, params, X_train, Y_train, num_boost_round=params['num_rounds'])"
963 | ]
964 | },
965 | {
966 | "cell_type": "code",
967 | "execution_count": null,
968 | "metadata": {},
969 | "outputs": [],
970 | "source": []
971 | }
972 | ],
973 | "metadata": {
974 | "kernelspec": {
975 | "display_name": "Python 3",
976 | "language": "python",
977 | "name": "python3"
978 | },
979 | "language_info": {
980 | "codemirror_mode": {
981 | "name": "ipython",
982 | "version": 3
983 | },
984 | "file_extension": ".py",
985 | "mimetype": "text/x-python",
986 | "name": "python",
987 | "nbconvert_exporter": "python",
988 | "pygments_lexer": "ipython3",
989 | "version": "3.7.4"
990 | }
991 | },
992 | "nbformat": 4,
993 | "nbformat_minor": 4
994 | }
995 |
--------------------------------------------------------------------------------
/rapids_interactive/dask/dask.yml:
--------------------------------------------------------------------------------
1 | name: dask
2 | channels:
3 | - defaults
4 | - conda-forge
5 | - nvidia
6 | - rapidsai
7 | - rapidsai/label/xgboost
8 |
9 | dependencies:
10 | - python=3.7
11 | - cudatoolkit
12 | - cudf
13 | - cuml
14 | - cugraph
15 | - bokeh
16 | - dask-cuda
17 | - dask-cudf
18 | - nvidia::nccl=2.4.*
19 | - rapidsai/label/xgboost::xgboost=0.90.*
20 | - rapidsai/label/xgboost::dask-xgboost=0.2.*
21 | - dill
22 | - numba
23 | - pip:
24 | - azureml-sdk[automl,explain,notebooks]
25 | - mpi4py
26 |
--------------------------------------------------------------------------------
/rapids_interactive/dask/init-dask.py:
--------------------------------------------------------------------------------
1 | from mpi4py import MPI
2 | import os
3 | import argparse
4 | import socket
5 | from azureml.core import Run
6 |
7 | import sys, os
8 | pip = sys.executable[:-6] + 'pip freeze'
9 | print(pip)
10 | os.system(pip)
11 |
12 | if __name__ == '__main__':
13 | comm = MPI.COMM_WORLD
14 | rank = comm.Get_rank()
15 |
16 | ip = socket.gethostbyname(socket.gethostname())
17 | print("- my rank is ", rank)
18 | print("- my ip is ", ip)
19 |
20 | parser = argparse.ArgumentParser()
21 | parser.add_argument("--data")
22 | parser.add_argument("--gpus")
23 | FLAGS, unparsed = parser.parse_known_args()
24 |
25 | if rank == 0:
26 | data = {
27 | "scheduler" : ip + ":8786",
28 | "dashboard" : ip + ":8787"
29 | }
30 | Run.get_context().log("headnode", ip)
31 | Run.get_context().log("scheduler", data["scheduler"])
32 | Run.get_context().log("dashboard", data["dashboard"])
33 | Run.get_context().log("data", FLAGS.data)
34 | else:
35 | data = None
36 |
37 | data = comm.bcast(data, root=0)
38 | scheduler = data["scheduler"]
39 | dashboard = data["dashboard"]
40 | print("- scheduler is ", scheduler)
41 | print("- dashboard is ", dashboard)
42 |
43 |
44 | if rank == 0:
45 | os.system("dask-scheduler " +
46 | "--port " + scheduler.split(":")[1] +
47 | " --dashboard-address " + dashboard +
48 | " --preload jupyter-preload.py")
49 | elif rank == 1:
50 | os.environ["CUDA_VISIBLE_DEVICES"] = '0,1' # allow the 1st worker to grab the GPU assigned to the scheduler as well as its own
51 | os.system("dask-cuda-worker " + scheduler + " --memory-limit 0")
52 | else:
53 | os.environ["CUDA_VISIBLE_DEVICES"] = str(rank % int(FLAGS.gpus)) # restrict each worker to their own GPU (assuming one GPU per worker)
54 | os.system("dask-cuda-worker " + scheduler + " --memory-limit 0")
55 |
--------------------------------------------------------------------------------
/rapids_interactive/dask/jupyter-preload.py:
--------------------------------------------------------------------------------
1 | from notebook.notebookapp import NotebookApp
2 | from azureml.core import Run
3 | import socket
4 |
5 | def dask_setup(scheduler):
6 | app = NotebookApp()
7 | ip = socket.gethostbyname(socket.gethostname())
8 | app.ip="0.0.0.0"
9 | app.initialize([])
10 | Run.get_context().log("jupyter-url", "http://" + ip + ":" + str(app.port) + "/?token=" + app.token)
11 | Run.get_context().log("jupyter-port", app.port)
12 | Run.get_context().log("jupyter-token", app.token)
13 | Run.get_context().log("jupyter-ip", ip)
--------------------------------------------------------------------------------
/rapids_interactive/dask/rapids-0.9.yaml:
--------------------------------------------------------------------------------
1 | name: rapids-0.9
2 | channels:
3 | - defaults
4 | - conda-forge
5 | - nvidia
6 | - rapidsai
7 | - rapidsai/label/xgboost
8 |
9 | dependencies:
10 | - pip
11 | - mpi4py
12 | - python=3.7
13 | - numba=>0.45.1
14 | - cudatoolkit
15 | - cudf=0.9.*
16 | - cuml=0.9.*
17 | - cugraph=0.9.*
18 | - bokeh
19 | - dask=2.3.*
20 | - distributed=2.3.*
21 | - dask-cuda=0.9.*
22 | - dask-cudf=0.9.*
23 | - nvidia::nccl=2.4.*
24 | - rapidsai/label/xgboost::xgboost=0.90.*
25 | - rapidsai/label/xgboost::dask-xgboost=0.2.*
26 | - dill
27 | - pip:
28 | - azureml-sdk[automl,explain,notebooks]
29 |
--------------------------------------------------------------------------------
/rapids_interactive/dask/rapids.yml:
--------------------------------------------------------------------------------
1 | name: rapids0.10
2 | channels:
3 | - nvidia
4 | - rapidsai/label/xgboost
5 | - rapidsai
6 | - rapidsai-nighthly
7 | - conda-forge
8 | - numba
9 | - pytorch
10 | dependencies:
11 | - python=3.7
12 | - pytorch
13 | - cudatoolkit=10.0
14 | - dask-cuda=0.9.1
15 | - cudf=0.9.*
16 | - cuml=0.9.*
17 | - cugraph=0.9.*
18 | - rapidsai/label/xgboost::xgboost=0.90.rapidsdev1
19 | - rapidsai/label/xgboost::dask-xgboost=0.2.*
20 | - conda-forge::numpy=1.16.4
21 | - cython
22 | - dask
23 | - distributed=2.3.2
24 | - pynvml=8.0.2
25 | - gcsfs
26 | - requests
27 | - jupyterhub
28 | - jupyterlab
29 | - matplotlib
30 | - ipywidgets
31 | - ipyvolume
32 | - seaborn
33 | - scipy
34 | - pandas
35 | - boost
36 | - nodejs
37 | - pytest
38 | - pip
39 | - pip:
40 | - git+https://github.com/cupy/cupy.git
41 | - setuptools
42 | - torch
43 | - torchvision
44 | - pytorch-ignite
45 | - graphviz
46 | - networkx
47 | - dask-kubernetes
48 | - dask_labextension
49 | - jupyterlab-nvdashboard
--------------------------------------------------------------------------------
/rapids_interactive/start_cluster.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Running a DASK cluster with RAPIDS\n",
8 | "\n",
9 | "This notebook runs a DASK cluster with NVIDIA RAPIDS. RAPIDS uses NVIDIA CUDA for high-performance GPU execution, exposing GPU parallelism and high memory bandwidth through a user-friendly Python interface. It includes a dataframe library called cuDF which will be familiar to Pandas users, as well as an ML library called cuML that provides GPU versions of all machine learning algorithms available in Scikit-learn. \n",
10 | "\n",
11 | "This notebook shows how through DASK, RAPIDS can take advantage of multi-node, multi-GPU configurations on AzureML. \n",
12 | "\n",
13 | "This notebook is deploying the AzureML cluster to a VNet. Prior to running this, setup a VNet and DSVM according to [../setup-vnet.md](../setup-vnet.md). In this case the following names are used to identify the VNet and subnet.\n",
14 | "\n",
15 | "In addition, you need to forward the following ports to the DSVM \n",
16 | "\n",
17 | "- port 8888 to port 8888 for the jupyter server running on the DSVM (see [../setup-vnet.md](../setup-vnet.md))\n",
18 | "- port 9999 to port 9999 for the jupyter server running on the AML Cluster (will be explained below)\n",
19 | "- port 9797 to port 9797 for the jupyter server running on the AML Cluster (will be explained below)\n",
20 | "\n",
21 | "The easiert way to accomplish that is by logging into the DSVM using ssh with the following flags (assuming `mydsvm.westeurope.cloudapp.azure.com` is the DNS name for your DSVM:\n",
22 | "\n",
23 | " ssh mydsvm.westeurope.cloudapp.azure.com -L 9797:localhost:9797 -L 9999:localhost:9999 -L 8888:localhost:8888\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "import os\n",
33 | "import json\n",
34 | "import time\n",
35 | "\n",
36 | "from azureml.core import Workspace, Experiment, Environment\n",
37 | "from azureml.core.conda_dependencies import CondaDependencies\n",
38 | "from azureml.core.compute import AmlCompute, ComputeTarget\n",
39 | "from azureml.data.data_reference import DataReference\n",
40 | "from azureml.core.runconfig import RunConfiguration, MpiConfiguration\n",
41 | "from azureml.core import ScriptRunConfig\n",
42 | "from azureml.train.estimator import Estimator\n",
43 | "from azureml.exceptions import ComputeTargetException\n",
44 | "from azureml.widgets import RunDetails\n",
45 | "\n",
46 | "from subprocess import Popen, PIPE\n",
47 | "\n",
48 | "class PortForwarder():\n",
49 | " '''A helper to forward ports from the Notebook VM to the AML Cluster in the same VNet'''\n",
50 | " active_instances = set()\n",
51 | " \n",
52 | " def __init__(self, from_port, to_ip, to_port):\n",
53 | " self.from_port = from_port\n",
54 | " self.to_ip = to_ip\n",
55 | " self.to_port = to_port\n",
56 | " \n",
57 | " def start(self):\n",
58 | " self._socat = Popen([\"/usr/bin/socat\", \n",
59 | " f\"tcp-listen:{self.from_port},reuseaddr,fork\", \n",
60 | " f\"tcp:{self.to_ip}:{self.to_port}\"],\n",
61 | " stderr=PIPE, stdout=PIPE, universal_newlines=True)\n",
62 | " PortForwarder.active_instances.add(self)\n",
63 | " return self\n",
64 | " \n",
65 | " def stop(self):\n",
66 | " PortForwarder.active_instances.remove(self)\n",
67 | " return self._socat.terminate()\n",
68 | " \n",
69 | " def stop_all():\n",
70 | " for instance in list(PortForwarder.active_instances):\n",
71 | " instance.stop()"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 2,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "gpu_cluster_name = \"nd12-vnet-clustr\"\n",
81 | "vnet_resourcegroup_name='demo'\n",
82 | "vnet_name='myvnet'\n",
83 | "subnet_name='default'\n",
84 | "\n",
85 | "ws = Workspace.from_config()"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "### Deploy the AmlCompute cluster\n",
93 | "The next cell is deploying the AmlCompute cluster. The cluster is configured to scale down to 0 nodes after 2 minuten, so no cost is incurred while DASK is not running (and thus no nodes are spun up on the cluster as the result of this cell, yet). This cell only needs to be executed once and the cluster can be reused going forward."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 3,
99 | "metadata": {},
100 | "outputs": [
101 | {
102 | "name": "stdout",
103 | "output_type": "stream",
104 | "text": [
105 | "Found existing compute target\n"
106 | ]
107 | }
108 | ],
109 | "source": [
110 | "try:\n",
111 | " gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)\n",
112 | " print('Found existing compute target')\n",
113 | " \n",
114 | "except ComputeTargetException:\n",
115 | " print(\"Creating new cluster\")\n",
116 | "\n",
117 | " provisioning_config = AmlCompute.provisioning_configuration(\n",
118 | " vm_size=\"Standard_ND12s\", \n",
119 | " min_nodes=0, \n",
120 | " max_nodes=10,\n",
121 | " idle_seconds_before_scaledown=120,\n",
122 | " vnet_resourcegroup_name=vnet_resourcegroup_name,\n",
123 | " vnet_name=vnet_name,\n",
124 | " subnet_name=subnet_name\n",
125 | " )\n",
126 | " gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, provisioning_config)\n",
127 | "\n",
128 | " print(\"waiting for nodes\")\n",
129 | " gpu_cluster.wait_for_completion(show_output=True)"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "### Copy the data to Azure Blob Storage\n",
137 | "\n",
138 | "This next cell is pulling the NYC taxi data set down and then uploads it to the AzureML workspace's default data store. The all nodes of the DASK cluster we are creating further down will then be able to access the data."
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 4,
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "name": "stdout",
148 | "output_type": "stream",
149 | "text": [
150 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-01.csv\n",
151 | "- File already exists locally\n",
152 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-02.csv\n",
153 | "- File already exists locally\n",
154 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-03.csv\n",
155 | "- File already exists locally\n",
156 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-04.csv\n",
157 | "- File already exists locally\n",
158 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-05.csv\n",
159 | "- File already exists locally\n",
160 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-06.csv\n",
161 | "- File already exists locally\n",
162 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-07.csv\n",
163 | "- File already exists locally\n",
164 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-08.csv\n",
165 | "- File already exists locally\n",
166 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-09.csv\n",
167 | "- File already exists locally\n",
168 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-10.csv\n",
169 | "- File already exists locally\n",
170 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-11.csv\n",
171 | "- File already exists locally\n",
172 | "- Downloading http://dask-data.s3.amazonaws.com/nyc-taxi/2015/yellow_tripdata_2015-12.csv\n",
173 | "- File already exists locally\n",
174 | "- Uploading taxi data... \n",
175 | "Uploading an estimated of 12 files\n",
176 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-09.csv\n",
177 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-10.csv\n",
178 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-07.csv\n",
179 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-12.csv\n",
180 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-06.csv\n",
181 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-03.csv\n",
182 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-01.csv\n",
183 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-02.csv\n",
184 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-08.csv\n",
185 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-05.csv\n",
186 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-04.csv\n",
187 | "Target already exists. Skipping upload for nyctaxi/yellow_tripdata_2015-11.csv\n",
188 | "Uploaded 0 files\n",
189 | "- Data transfer complete\n"
190 | ]
191 | }
192 | ],
193 | "source": [
194 | "import io\n",
195 | "import os\n",
196 | "import sys\n",
197 | "import urllib.request\n",
198 | "from tqdm import tqdm\n",
199 | "from time import sleep\n",
200 | "\n",
201 | "cwd = os.getcwd()\n",
202 | "\n",
203 | "data_dir = os.path.abspath(os.path.join(cwd, 'data'))\n",
204 | "if not os.path.exists(data_dir):\n",
205 | " os.makedirs(data_dir)\n",
206 | "\n",
207 | "taxidir = os.path.join(data_dir, 'nyctaxi')\n",
208 | "if not os.path.exists(taxidir):\n",
209 | " os.makedirs(taxidir)\n",
210 | "\n",
211 | "filenames = []\n",
212 | "local_paths = []\n",
213 | "for i in range(1, 13):\n",
214 | " filename = \"yellow_tripdata_2015-{month:02d}.csv\".format(month=i)\n",
215 | " filenames.append(filename)\n",
216 | " \n",
217 | " local_path = os.path.join(taxidir, filename)\n",
218 | " local_paths.append(local_path)\n",
219 | "\n",
220 | "for idx, filename in enumerate(filenames):\n",
221 | " url = \"http://dask-data.s3.amazonaws.com/nyc-taxi/2015/\" + filename\n",
222 | " print(\"- Downloading \" + url)\n",
223 | " if not os.path.exists(local_paths[idx]):\n",
224 | " with open(local_paths[idx], 'wb') as file:\n",
225 | " with urllib.request.urlopen(url) as resp:\n",
226 | " length = int(resp.getheader('content-length'))\n",
227 | " blocksize = max(4096, length // 100)\n",
228 | " with tqdm(total=length, file=sys.stdout) as pbar:\n",
229 | " while True:\n",
230 | " buff = resp.read(blocksize)\n",
231 | " if not buff:\n",
232 | " break\n",
233 | " file.write(buff)\n",
234 | " pbar.update(len(buff))\n",
235 | " else:\n",
236 | " print(\"- File already exists locally\")\n",
237 | "\n",
238 | "print(\"- Uploading taxi data... \")\n",
239 | "ws = Workspace.from_config()\n",
240 | "ds = ws.get_default_datastore()\n",
241 | "\n",
242 | "ds.upload(\n",
243 | " src_dir=taxidir,\n",
244 | " target_path='nyctaxi',\n",
245 | " show_progress=True)\n",
246 | "\n",
247 | "print(\"- Data transfer complete\")"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "metadata": {},
253 | "source": [
254 | "### Create the DASK Cluster\n",
255 | "\n",
256 | "On the AMLCompute cluster we are now running a Python job that will run a DASK cluster. "
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 5,
262 | "metadata": {},
263 | "outputs": [
264 | {
265 | "name": "stderr",
266 | "output_type": "stream",
267 | "text": [
268 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n",
269 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n",
270 | "WARNING - 'gpu_support' is no longer necessary; AzureML now automatically detects and uses nvidia docker extension when it is available. It will be removed in a future release.\n"
271 | ]
272 | }
273 | ],
274 | "source": [
275 | "mpi_config = MpiConfiguration()\n",
276 | "mpi_config.process_count_per_node = 2\n",
277 | "\n",
278 | "est = Estimator(\n",
279 | " source_directory='./dask',\n",
280 | " compute_target=gpu_cluster,\n",
281 | " entry_script='init-dask.py',\n",
282 | " script_params={\n",
283 | " '--data': ws.get_default_datastore(),\n",
284 | " '--gpus': str(2) # The number of GPUs available on each node\n",
285 | " },\n",
286 | " node_count=3,\n",
287 | " use_gpu=True,\n",
288 | " distributed_training=mpi_config,\n",
289 | " conda_dependencies_file='rapids-0.9.yaml')\n",
290 | "\n",
291 | "run = Experiment(ws, \"init-dask-jupyter\").submit(est)"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {},
297 | "source": [
298 | "Let's use the widget to monitor how the DASK cluster spins up. When run for the first time on a workspace, the following thing will happen:\n",
299 | "\n",
300 | "1. The docker image will to be created, which takes about 20 minutes. \n",
301 | "2. Then AzureML will start to scale the cluster up by provisioning the required number of nodes (`node_count` above), which will take another 5-10 minutes with the chosen Standard_ND12s\n",
302 | "3. The docker image is being transferred over to the compute nodes, which, given the size of about 8 GB takes another 3-5 minutes\n",
303 | "\n",
304 | "So alltogether the process will take up to 30 minutes when run for the first time."
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 8,
310 | "metadata": {},
311 | "outputs": [
312 | {
313 | "data": {
314 | "application/vnd.jupyter.widget-view+json": {
315 | "model_id": "cb1f363f16374a4992a6719c9d58b49d",
316 | "version_major": 2,
317 | "version_minor": 0
318 | },
319 | "text/plain": [
320 | "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…"
321 | ]
322 | },
323 | "metadata": {},
324 | "output_type": "display_data"
325 | }
326 | ],
327 | "source": [
328 | "from azureml.widgets import RunDetails\n",
329 | "RunDetails(run).show()"
330 | ]
331 | },
332 | {
333 | "cell_type": "markdown",
334 | "metadata": {},
335 | "source": [
336 | "### Wait for the cluster to come up"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 9,
342 | "metadata": {
343 | "scrolled": true
344 | },
345 | "outputs": [],
346 | "source": [
347 | "from IPython.display import clear_output\n",
348 | "import time\n",
349 | "\n",
350 | "it = 0\n",
351 | "while not \"headnode\" in run.get_metrics():\n",
352 | " clear_output(wait=True)\n",
353 | " print(\"waiting for scheduler node's ip \" + str(it) )\n",
354 | " time.sleep(1)\n",
355 | " it += 1\n",
356 | "\n",
357 | "headnode = run.get_metrics()[\"headnode\"]\n",
358 | "jupyter_ip = run.get_metrics()[\"jupyter-ip\"]\n",
359 | "jupyter_port = run.get_metrics()[\"jupyter-port\"]\n",
360 | "jupyter_token = run.get_metrics()[\"jupyter-token\"]"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "### Establish port forwarding to the cluster"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 10,
373 | "metadata": {},
374 | "outputs": [
375 | {
376 | "name": "stdout",
377 | "output_type": "stream",
378 | "text": [
379 | "If you are forwarding the ports from your local machine as described at the top of this notebook,\n",
380 | "then you should now be able to connect to the Dashboard and Jupyter Server via the following URLs:\n",
381 | "\n",
382 | " Dashboard: http://localhost:9797\n",
383 | " Jupyter on cluster: http://localhost:9999/notebooks/azure_taxi_on_cluster.ipynb?token=0ed225c7db00699b10d80d86dc09d8149d6eae21e7200aac\n"
384 | ]
385 | }
386 | ],
387 | "source": [
388 | "dashboard = PortForwarder(9797, headnode, 8787).start()\n",
389 | "jupyter = PortForwarder(9999, headnode, 8888).start()\n",
390 | "\n",
391 | "print(\"If you are forwarding the ports from your local machine as described at the top of this notebook,\")\n",
392 | "print(\"then you should now be able to connect to the Dashboard and Jupyter Server via the following URLs:\")\n",
393 | "print()\n",
394 | "print(f\" Dashboard: http://localhost:9797\")\n",
395 | "print(f\" Jupyter on cluster: http://localhost:9999/notebooks/azure_taxi_on_cluster.ipynb?token={jupyter_token}\")"
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {},
401 | "source": [
402 | "## Shutting the cluster down\n",
403 | "\n",
404 | "Terminate the run to shut the cluster down. Once you are done with your interactive work, make sure to do this so the AML Compute cluster gets spun down again. "
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": null,
410 | "metadata": {},
411 | "outputs": [],
412 | "source": [
413 | "# stop the run representing the cluster\n",
414 | "run.cancel()\n",
415 | "# shut down the port forwards\n",
416 | "PortForwarder.stop_all()"
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "### Useful for debugging"
424 | ]
425 | },
426 | {
427 | "cell_type": "code",
428 | "execution_count": 5,
429 | "metadata": {},
430 | "outputs": [],
431 | "source": [
432 | "# get the last run\n",
433 | "run = Experiment(ws, \"init-dask-jupyter\").get_runs().__next__()"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 23,
439 | "metadata": {},
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/plain": [
444 | "{'headnode': '172.17.0.6',\n",
445 | " 'scheduler': '172.17.0.6:8786',\n",
446 | " 'dashboard': '172.17.0.6:8787',\n",
447 | " 'data': '/mnt/batch/tasks/shared/LS_root/jobs/vnettest/azureml/init-dask-jupyter_1570114867_699d20d4/mounts/workspaceblobstore',\n",
448 | " 'jupyter-url': 'http://172.17.0.6:8888/?token=0f85e874d045185e175027bab126bd404ebe444c237a765a',\n",
449 | " 'jupyter-port': 8888,\n",
450 | " 'jupyter-token': '0f85e874d045185e175027bab126bd404ebe444c237a765a',\n",
451 | " 'jupyter-ip': '172.17.0.6'}"
452 | ]
453 | },
454 | "execution_count": 23,
455 | "metadata": {},
456 | "output_type": "execute_result"
457 | }
458 | ],
459 | "source": [
460 | "run.get_metrics()"
461 | ]
462 | },
463 | {
464 | "cell_type": "code",
465 | "execution_count": 8,
466 | "metadata": {},
467 | "outputs": [
468 | {
469 | "data": {
470 | "text/plain": [
471 | "'Running'"
472 | ]
473 | },
474 | "execution_count": 8,
475 | "metadata": {},
476 | "output_type": "execute_result"
477 | }
478 | ],
479 | "source": [
480 | "run.status"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": null,
486 | "metadata": {},
487 | "outputs": [],
488 | "source": []
489 | }
490 | ],
491 | "metadata": {
492 | "kernelspec": {
493 | "display_name": "Python (rapids-0.9)",
494 | "language": "python",
495 | "name": "dask"
496 | },
497 | "language_info": {
498 | "codemirror_mode": {
499 | "name": "ipython",
500 | "version": 3
501 | },
502 | "file_extension": ".py",
503 | "mimetype": "text/x-python",
504 | "name": "python",
505 | "nbconvert_exporter": "python",
506 | "pygments_lexer": "ipython3",
507 | "version": "3.7.4"
508 | }
509 | },
510 | "nbformat": 4,
511 | "nbformat_minor": 4
512 | }
513 |
--------------------------------------------------------------------------------
/setup-vnet.md:
--------------------------------------------------------------------------------
1 | # Setting up a DSVM in a VNet
2 |
3 | ## Create the VNet
4 |
5 | 
6 |
7 | 
8 |
9 | ## Create DSVM using the VNet
10 |
11 | 
12 |
13 | 
14 |
15 | During setup, it is convenient to use your local username also for the DSVM and to provide your public key during setup, so you can easily ssh onto the VM.
16 | Once the DSVM is created, assign it a DNS name by clicking on the Public IP-Address
17 | 
18 |
19 | And assign it a DNS name, so you can access it by that name (alternatively, you can also switch to a static IP)
20 |
21 | 
22 |
23 | Here is some more information on the DSVM: https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro
24 |
25 | Download the config.json from the workspace and upload it to the DSVM (just put it your user's home folder).
26 | 
27 |
28 | ```
29 | scp config.json :
30 | ```
31 |
32 | Now log in to the DSVM
33 |
34 | ```
35 | ssh -L 8888:localhost:8888
36 | ```
37 |
38 | On the DSVM, pull down this repository and create the python environment:
39 |
40 | ```
41 | git clone https://github.com/danielsc/azureml-and-dask
42 | cd azureml-and-dask/
43 | conda env create -f dask/environment.yml
44 | conda activate dask
45 | python -m ipykernel install --user --name dask --display-name "Python (dask)"
46 | ```
47 |
48 | Next start jupyter on the DSVM:
49 |
50 | ```
51 | nohup jupyter notebook &
52 | ```
53 |
54 | Find the login token/url in nohup.out
55 |
56 | ```
57 | (dask) danielsc@vnettestvm:~/git/azureml-and-dask$ tail nohup.out
58 | [C 21:17:35.360 NotebookApp]
59 |
60 | To access the notebook, open this file in a browser:
61 | file:///data/home/danielsc/.local/share/jupyter/runtime/nbserver-18401-open.html
62 | Or copy and paste one of these URLs:
63 | http://localhost:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f
64 | or http://127.0.0.1:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f
65 | (dask) danielsc@vnettestvm:~/git/azureml-and-dask$
66 | ```
67 |
68 | If you started the ssh session with the port forward as above, then the link above should just work for you (in my case: http://localhost:8888/?token=6819bfd774eb016e2adc0eab9ec7ad04708058a278dd335f).
69 |
70 |
--------------------------------------------------------------------------------