├── .gitignore
├── 00_setup_data_wrangler.ipynb
├── 1-sagemaker-pipelines
    ├── 01_setup_sagemaker_pipeline.ipynb
    ├── README.md
    ├── flow-01-15-12-49-4bd733e0.flow
    └── images
    │   └── pipeline.png
├── 2-step-functions-pipelines
    ├── 01_setup_step_functions_pipeline.ipynb
    ├── README.md
    ├── flow-01-15-12-49-4bd733e0.flow
    └── step-function-workflow.png
├── 3-apache-airflow-pipelines
    ├── 01_setup_mwaa_pipeline.ipynb
    ├── README.md
    ├── images
    │   ├── delete_mwaa.png
    │   ├── flow.png
    │   ├── meaa_ui_home.png
    │   ├── mwaa_dag.png
    │   ├── mwaa_delete_dag.png
    │   ├── mwaa_s3.png
    │   ├── mwaa_trigger.png
    │   └── mwaa_ui.png
    └── scripts
    │   ├── SMDataWranglerOperator.py
    │   ├── config.py
    │   ├── ml_pipeline.py
    │   └── requirements.txt
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── data
    ├── claims.csv
    └── customers.csv
├── images
    ├── dw-arch.jpg
    ├── flow.png
    └── sm-studio-terminal.png
└── insurance_claims_flow_template


/.gitignore:
--------------------------------------------------------------------------------
1 | # Flow files
2 | .flow
3 | 
4 | # Jupyter Checkpoints
5 | .ipynb_checkpoints
6 | 


--------------------------------------------------------------------------------
/00_setup_data_wrangler.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Upload sample data and setup SageMaker Data Wrangler data flow\n",
  8 |     "\n",
  9 |     "This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.\n",
 10 |     "\n",
 11 |     "---"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "Import required dependencies and initialize variables\n"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {},
 25 |    "outputs": [
 26 |     {
 27 |      "name": "stdout",
 28 |      "output_type": "stream",
 29 |      "text": [
 30 |       "Using AWS Region: us-east-2\n"
 31 |      ]
 32 |     },
 33 |     {
 34 |      "data": {
 35 |       "text/plain": [
 36 |        "'sagemaker-us-east-2-716469146435'"
 37 |       ]
 38 |      },
 39 |      "execution_count": 2,
 40 |      "metadata": {},
 41 |      "output_type": "execute_result"
 42 |     }
 43 |    ],
 44 |    "source": [
 45 |     "import json\n",
 46 |     "import time\n",
 47 |     "import boto3\n",
 48 |     "import string\n",
 49 |     "import sagemaker\n",
 50 |     "\n",
 51 |     "region = sagemaker.Session().boto_region_name\n",
 52 |     "print(\"Using AWS Region: {}\".format(region))\n",
 53 |     "\n",
 54 |     "boto3.setup_default_session(region_name=region)\n",
 55 |     "\n",
 56 |     "s3_client = boto3.client('s3', region_name=region)\n",
 57 |     "# Sagemaker session\n",
 58 |     "sess = sagemaker.Session()\n",
 59 |     "\n",
 60 |     "# You can configure this with your own bucket name, e.g.\n",
 61 |     "# bucket = \"my-bucket\"\n",
 62 |     "bucket = sess.default_bucket()\n",
 63 |     "prefix = \"data-wrangler-pipeline\"\n",
 64 |     "bucket"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "---\n",
 72 |     "# Upload sample data to S3"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.\n",
 80 |     "\n",
 81 |     "To begin with, we will upload both the files to the default SageMaker bucket."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 3,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')\n",
 91 |     "s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "markdown",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "---\n",
 99 |     "# Generate Data Wrangler `.flow` file"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. \n",
107 |     "\n",
108 |     "To create the `insurance_claims.flow` file execute the code cell below"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 5,
114 |    "metadata": {},
115 |    "outputs": [],
116 |    "source": [
117 |     "claims_flow_template_file = \"insurance_claims_flow_template\"\n",
118 |     "\n",
119 |     "# Updates the S3 bucket and prefix in the template\n",
120 |     "with open(claims_flow_template_file, 'r') as f:\n",
121 |     "    variables   = {'bucket': bucket, 'prefix': prefix}\n",
122 |     "    template    = string.Template(f.read())\n",
123 |     "    claims_flow = template.safe_substitute(variables)\n",
124 |     "    claims_flow = json.loads(claims_flow)\n",
125 |     "\n",
126 |     "# Creates the .flow file\n",
127 |     "with open('insurance_claims.flow', 'w') as f:\n",
128 |     "    json.dump(claims_flow, f)"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "Open the `insurance_claim.flow` file in SageMaker Studio.\n",
136 |     "\n",
137 |     "<div class=\"alert alert-warning\"> ⚠️ <strong> NOTE: </strong>\n",
138 |     "    Note: The UI for Data Wrangler is only available via SageMaker Studio environment. If you are using SageMaker Classic notebooks, you will not be able to view the Data Wrangler UI but can still use the flow file programmatically.\n",
139 |     "</div>\n",
140 |     "\n",
141 |     "The flow should look as shown below\n",
142 |     "\n",
143 |     "<img src=\"images/flow.png\" width=\"800\"/>"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "metadata": {},
149 |    "source": [
150 |     "# Alternatively\n",
151 |     "\n",
152 |     "You can also create this `.flow` file manually using the SageMaker Studio's Data Wrangler UI. Visit the [get started](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html) documentation to learn how to create a data flow using SageMaker Data Wrangler."
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "---\n",
160 |     "# Upload the `.flow` file to S3\n",
161 |     "\n",
162 |     "Next we will upload the flow file to the S3 bucket. The executable python script we generated earlier will make use of this `.flow` file to perform transformations."
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "code",
167 |    "execution_count": 6,
168 |    "metadata": {},
169 |    "outputs": [
170 |     {
171 |      "name": "stdout",
172 |      "output_type": "stream",
173 |      "text": [
174 |       "Stored 'ins_claim_flow_uri' (str)\n"
175 |      ]
176 |     }
177 |    ],
178 |    "source": [
179 |     "import time\n",
180 |     "import uuid\n",
181 |     "\n",
182 |     "# unique flow export ID\n",
183 |     "flow_export_id = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n",
184 |     "flow_export_name = f\"flow-{flow_export_id}\"\n",
185 |     "\n",
186 |     "s3_client.upload_file(Filename='insurance_claims.flow', Bucket=bucket, Key=f'{prefix}/flow/{flow_export_name}.flow')\n",
187 |     "ins_claim_flow_uri=f\"s3://{bucket}/{prefix}/flow/{flow_export_name}.flow\"\n",
188 |     "%store ins_claim_flow_uri"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "code",
193 |    "execution_count": null,
194 |    "metadata": {},
195 |    "outputs": [],
196 |    "source": []
197 |   }
198 |  ],
199 |  "metadata": {
200 |   "instance_type": "ml.t3.medium",
201 |   "kernelspec": {
202 |    "display_name": "Python 3 (Data Science)",
203 |    "language": "python",
204 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
205 |   },
206 |   "language_info": {
207 |    "codemirror_mode": {
208 |     "name": "ipython",
209 |     "version": 3
210 |    },
211 |    "file_extension": ".py",
212 |    "mimetype": "text/x-python",
213 |    "name": "python",
214 |    "nbconvert_exporter": "python",
215 |    "pygments_lexer": "ipython3",
216 |    "version": "3.7.10"
217 |   }
218 |  },
219 |  "nbformat": 4,
220 |  "nbformat_minor": 4
221 | }
222 | 


--------------------------------------------------------------------------------
/1-sagemaker-pipelines/01_setup_sagemaker_pipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Creating SageMaker Pipelines workflow from AWS Data Wrangler Flow File\n",
  8 |     "\n",
  9 |     "<div class=\"alert alert-warning\"> \n",
 10 |     "\t⚠️ <strong> PRE-REQUISITE: </strong> Before proceeding with this notebook, please ensure that you have executed the <code>00_setup_data_wrangler.ipynb</code> Notebook</li>\n",
 11 |     "</div>\n",
 12 |     "\n",
 13 |     "We will demonstrate how to define a SageMaker Processing Job based on an existing SageMaker Data Wrangler Flow definition."
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## 1. Initialization"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "%store -r ins_claim_flow_uri"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "import sagemaker\n",
 39 |     "import json\n",
 40 |     "import string\n",
 41 |     "import boto3\n",
 42 |     "\n",
 43 |     "sm_client = boto3.client(\"sagemaker\")\n",
 44 |     "sess = sagemaker.Session()\n",
 45 |     "\n",
 46 |     "bucket = sess.default_bucket()\n",
 47 |     "prefix = 'aws-data-wrangler-workflows'\n",
 48 |     "\n",
 49 |     "FLOW_TEMPLATE_URI = ins_claim_flow_uri\n",
 50 |     "\n",
 51 |     "flow_file_name = FLOW_TEMPLATE_URI.split(\"/\")[-1]\n",
 52 |     "flow_export_name = flow_file_name.replace(\".flow\", \"\")\n",
 53 |     "flow_export_id = flow_export_name.replace(\"flow-\", \"\")"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "## 2. Download Flow template from Amazon S3\n",
 61 |     "\n",
 62 |     "We download the flow template from S3, in order to parse its content and retrieve the following information:\n",
 63 |     "* Source datasets, including dataset names and S3 URI\n",
 64 |     "* Output node, including Node ID and output path\n",
 65 |     "This information is then used as part of the parameters of the SageMaker Processing Job"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "sagemaker.s3.S3Downloader.download(FLOW_TEMPLATE_URI, \".\")"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "#### Parsing input and output parameters from flow template"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "with open(flow_file_name, 'r') as f:\n",
 91 |     "    data = json.load(f)\n",
 92 |     "    output_node = data['nodes'][-1]['node_id']\n",
 93 |     "    output_path = data['nodes'][-1]['outputs'][0]['name']\n",
 94 |     "    input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n",
 95 |     "    input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n",
 96 |     "    \n",
 97 |     "\n",
 98 |     "output_name = f\"{output_node}.{output_path}\""
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "## 3. Create SageMaker Processing Job from Data Wrangler Flow template\n",
106 |     "\n",
107 |     "### 3.1 SageMaker Processing Inputs\n",
108 |     "\n",
109 |     "Below are the inputs required by the SageMaker Python SDK to launch a processing job.\n",
110 |     "\n",
111 |     "#### Source datasets"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
121 |     "\n",
122 |     "data_sources = []\n",
123 |     "\n",
124 |     "for i in range(0,len(input_source_uris)):\n",
125 |     "    data_sources.append(ProcessingInput(\n",
126 |     "        source=input_source_uris[i],\n",
127 |     "        destination=f\"/opt/ml/processing/{input_source_names[i]}\",\n",
128 |     "        input_name=input_source_names[i],\n",
129 |     "        s3_data_type=\"S3Prefix\",\n",
130 |     "        s3_input_mode=\"File\",\n",
131 |     "        s3_data_distribution_type=\"FullyReplicated\"\n",
132 |     "    ))"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "#### Flow Input"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {},
146 |    "outputs": [],
147 |    "source": [
148 |     "## Input - Flow\n",
149 |     "flow_input = ProcessingInput(\n",
150 |     "    source=FLOW_TEMPLATE_URI,\n",
151 |     "    destination=\"/opt/ml/processing/flow\",\n",
152 |     "    input_name=\"flow\",\n",
153 |     "    s3_data_type=\"S3Prefix\",\n",
154 |     "    s3_input_mode=\"File\",\n",
155 |     "    s3_data_distribution_type=\"FullyReplicated\"\n",
156 |     ")\n",
157 |     "\n",
158 |     "processing_job_inputs=[flow_input] + data_sources"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "### 3.2. SageMaker Processing Output"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "s3_output_prefix = f\"export-{flow_export_name}/output\"\n",
175 |     "s3_output_path = f\"s3://{bucket}/{s3_output_prefix}\"\n",
176 |     "print(f\"Flow S3 export result path: {s3_output_path}\")\n",
177 |     "\n",
178 |     "processing_job_output = ProcessingOutput(\n",
179 |     "    output_name=output_name,\n",
180 |     "    source=\"/opt/ml/processing/output\",\n",
181 |     "    destination=s3_output_path,\n",
182 |     "    s3_upload_mode=\"EndOfJob\"\n",
183 |     ")\n",
184 |     "\n",
185 |     "processing_job_outputs=[processing_job_output]"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "### 3.3. Create Processor Object"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "# IAM role for executing the processing job.\n",
202 |     "iam_role = sagemaker.get_execution_role()\n",
203 |     "aws_region = sess.boto_region_name\n",
204 |     "\n",
205 |     "# Unique processing job name. Give a unique name every time you re-execute processing jobs\n",
206 |     "processing_job_name = f\"data-wrangler-flow-processing-{flow_export_id}\"\n",
207 |     "\n",
208 |     "# Data Wrangler Container URI.\n",
209 |     "container_uri = sagemaker.image_uris.retrieve(\n",
210 |     "    framework='data-wrangler',\n",
211 |     "    region=aws_region\n",
212 |     ")\n",
213 |     "\n",
214 |     "# Processing Job Instance count and instance type.\n",
215 |     "instance_count = 2\n",
216 |     "instance_type = \"ml.m5.4xlarge\"\n",
217 |     "\n",
218 |     "# Size in GB of the EBS volume to use for storing data during processing\n",
219 |     "volume_size_in_gb = 30\n",
220 |     "\n",
221 |     "# Network Isolation mode; default is off\n",
222 |     "enable_network_isolation = False\n",
223 |     "\n",
224 |     "# KMS key for per object encryption; default is None\n",
225 |     "kms_key = None\n",
226 |     "\n",
227 |     "\n",
228 |     "# Content type for each output. Data Wrangler supports CSV as default and Parquet.\n",
229 |     "processing_job_output_content_type = \"CSV\"\n",
230 |     "\n",
231 |     "# Output configuration used as processing job container arguments \n",
232 |     "processing_job_output_config = {\n",
233 |     "    output_name: {\n",
234 |     "        \"content_type\": processing_job_output_content_type\n",
235 |     "    }\n",
236 |     "}"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "markdown",
241 |    "metadata": {},
242 |    "source": [
243 |     "To launch the Processing Job in a workflow compatible with SageMaker SDK, you will create a Processor function. The processor can then be integrated in the following workflows:\n",
244 |     "\n",
245 |     "* Amazon SageMaker Pipelines\n",
246 |     "* AWS Step Functions (through AWS Step Functions Data Science SDK)\n",
247 |     "* Apache Airflow (through a Python Operator, using Amazon SageMaker SDK)"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": null,
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "from sagemaker.processing import Processor\n",
257 |     "from sagemaker.network import NetworkConfig\n",
258 |     "\n",
259 |     "processor = Processor(\n",
260 |     "    role=iam_role,\n",
261 |     "    image_uri=container_uri,\n",
262 |     "    instance_count=instance_count,\n",
263 |     "    instance_type=instance_type,\n",
264 |     "    volume_size_in_gb=volume_size_in_gb,\n",
265 |     "    network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),\n",
266 |     "    sagemaker_session=sess,\n",
267 |     "    output_kms_key=kms_key\n",
268 |     ")"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "## 4. Create SageMaker Estimator\n",
276 |     "\n",
277 |     "Another building block for orchestrating ML workflows is an Estimator, which is the base for a Training Job. The estimator can be integrated in the following workflows:\n",
278 |     "\n",
279 |     "* Amazon SageMaker Pipelines\n",
280 |     "* AWS Step Functions (through AWS Step Functions Data Science SDK)\n",
281 |     "* Apache Airflow (through Amazon SageMaker Operator for Apache Airflow)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": [
290 |     "import boto3\n",
291 |     "from sagemaker.estimator import Estimator\n",
292 |     "\n",
293 |     "region = boto3.Session().region_name\n",
294 |     "\n",
295 |     "# Estimator Instance count and instance type.\n",
296 |     "instance_count = 1\n",
297 |     "instance_type = \"ml.m5.4xlarge\"\n",
298 |     "\n",
299 |     "image_uri = sagemaker.image_uris.retrieve(\n",
300 |     "    framework=\"xgboost\",\n",
301 |     "    region=region,\n",
302 |     "    version=\"1.2-1\",\n",
303 |     "    py_version=\"py3\",\n",
304 |     "    instance_type=instance_type,\n",
305 |     ")\n",
306 |     "xgb_train = Estimator(\n",
307 |     "    image_uri=image_uri,\n",
308 |     "    instance_type=instance_type,\n",
309 |     "    instance_count=instance_count,\n",
310 |     "    role=iam_role,\n",
311 |     ")\n",
312 |     "xgb_train.set_hyperparameters(\n",
313 |     "    objective=\"reg:squarederror\",\n",
314 |     "    num_round=3,\n",
315 |     ")"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {},
321 |    "source": [
322 |     "# Part 2: Creating a workflow with SageMaker Pipelines"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "markdown",
327 |    "metadata": {},
328 |    "source": [
329 |     "## 1. Define Pipeline Steps and Parameters\n",
330 |     "To create a SageMaker pipeline, you will first create a `ProcessingStep` using the Data Wrangler processor defined above."
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": null,
336 |    "metadata": {},
337 |    "outputs": [],
338 |    "source": [
339 |     "from sagemaker.workflow.steps import ProcessingStep\n",
340 |     "\n",
341 |     "data_wrangler_step = ProcessingStep(\n",
342 |     "    name=\"DataWranglerProcessingStep\",\n",
343 |     "    processor=processor,\n",
344 |     "    inputs=processing_job_inputs, \n",
345 |     "    outputs=processing_job_outputs,\n",
346 |     "    job_arguments=[f\"--output-config '{json.dumps(processing_job_output_config)}'\"],\n",
347 |     ")"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "You now add a `TrainingStep` to the pipeline that trains a model on the preprocessed train data set. \n",
355 |     "\n",
356 |     "You can also add more steps. To learn more about adding steps to a pipeline, see [Define a Pipeline](http://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html) in the SageMaker documentation."
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": null,
362 |    "metadata": {},
363 |    "outputs": [],
364 |    "source": [
365 |     "from sagemaker.inputs import TrainingInput\n",
366 |     "from sagemaker.workflow.steps import TrainingStep\n",
367 |     "from sagemaker.workflow.step_collections import RegisterModel\n",
368 |     "\n",
369 |     "xgb_input_content_type = None\n",
370 |     "\n",
371 |     "if processing_job_output_content_type == \"CSV\":\n",
372 |     "    xgb_input_content_type = 'text/csv'\n",
373 |     "elif processing_job_output_content_type == \"Parquet\":\n",
374 |     "    xgb_input_content_type = 'application/x-parquet'\n",
375 |     "\n",
376 |     "training_step = TrainingStep(\n",
377 |     "    name=\"DataWranglerTrain\",\n",
378 |     "    estimator=xgb_train,\n",
379 |     "    inputs={\n",
380 |     "        \"train\": TrainingInput(\n",
381 |     "            s3_data=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[\n",
382 |     "                output_name\n",
383 |     "            ].S3Output.S3Uri,\n",
384 |     "            content_type=xgb_input_content_type\n",
385 |     "        )\n",
386 |     "    }\n",
387 |     ")\n",
388 |     "\n",
389 |     "register_step = RegisterModel(\n",
390 |     "        name=f\"DataWranglerRegisterModel\",\n",
391 |     "        estimator=xgb_train,\n",
392 |     "        model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,\n",
393 |     "        content_types=[\"text/csv\"],\n",
394 |     "        response_types=[\"text/csv\"],\n",
395 |     "        inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n",
396 |     "        transform_instances=[\"ml.m5.large\"],\n",
397 |     "        model_package_group_name=\"DataWrangler-PackageGroup\"\n",
398 |     "    )"
399 |    ]
400 |   },
401 |   {
402 |    "cell_type": "markdown",
403 |    "metadata": {},
404 |    "source": [
405 |     "### Define Pipeline Parameters\n",
406 |     "Now you will create the SageMaker pipeline that combines the steps created above so it can be executed. \n",
407 |     "\n",
408 |     "Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.\n",
409 |     "\n",
410 |     "The parameters supported in this notebook includes:\n",
411 |     "\n",
412 |     "- `instance_type` - The ml.* instance type of the processing job.\n",
413 |     "- `instance_count` - The instance count of the processing job."
414 |    ]
415 |   },
416 |   {
417 |    "cell_type": "code",
418 |    "execution_count": null,
419 |    "metadata": {},
420 |    "outputs": [],
421 |    "source": [
422 |     "from sagemaker.workflow.parameters import (\n",
423 |     "    ParameterInteger,\n",
424 |     "    ParameterString,\n",
425 |     ")\n",
426 |     "# Define Pipeline Parameters\n",
427 |     "instance_type = ParameterString(name=\"InstanceType\", default_value=\"ml.m5.4xlarge\")\n",
428 |     "instance_count = ParameterInteger(name=\"InstanceCount\", default_value=1)"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "markdown",
433 |    "metadata": {},
434 |    "source": [
435 |     "You will create a pipeline with the steps and parameters defined above"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "code",
440 |    "execution_count": null,
441 |    "metadata": {},
442 |    "outputs": [],
443 |    "source": [
444 |     "import time\n",
445 |     "import uuid\n",
446 |     "\n",
447 |     "from sagemaker.workflow.pipeline import Pipeline\n",
448 |     "\n",
449 |     "# Create a unique pipeline name with flow export name\n",
450 |     "pipeline_name = f\"pipeline-{flow_export_name}\"\n",
451 |     "\n",
452 |     "# Combine pipeline steps\n",
453 |     "pipeline_steps = [data_wrangler_step, training_step, register_step]\n",
454 |     "\n",
455 |     "pipeline = Pipeline(\n",
456 |     "    name=pipeline_name,\n",
457 |     "    parameters=[instance_type, instance_count],\n",
458 |     "    steps=pipeline_steps,\n",
459 |     "    sagemaker_session=sess\n",
460 |     ")"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "markdown",
465 |    "metadata": {},
466 |    "source": [
467 |     "### (Optional) Examining the pipeline definition\n",
468 |     "\n",
469 |     "The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and \n",
470 |     "the parameters and step properties resolve correctly."
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": null,
476 |    "metadata": {},
477 |    "outputs": [],
478 |    "source": [
479 |     "import json\n",
480 |     "\n",
481 |     "definition = json.loads(pipeline.definition())\n",
482 |     "definition"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "## 2. Submit the pipeline to SageMaker and start execution\n",
490 |     "\n",
491 |     "Submit the pipeline definition to the SageMaker Pipeline service and start an execution. The role passed in \n",
492 |     "will be used by the Pipeline service to create all the jobs defined in the steps."
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "code",
497 |    "execution_count": null,
498 |    "metadata": {},
499 |    "outputs": [],
500 |    "source": [
501 |     "iam_role = sagemaker.get_execution_role()\n",
502 |     "pipeline.upsert(role_arn=iam_role)\n",
503 |     "execution = pipeline.start()"
504 |    ]
505 |   },
506 |   {
507 |    "cell_type": "markdown",
508 |    "metadata": {},
509 |    "source": [
510 |     "### Pipeline Operations: Examine and Wait for Pipeline Execution\n",
511 |     "\n",
512 |     "Describe the pipeline execution and wait for its completion."
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": null,
518 |    "metadata": {},
519 |    "outputs": [],
520 |    "source": [
521 |     "execution.wait()"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "markdown",
526 |    "metadata": {},
527 |    "source": [
528 |     "List the steps in the execution. These are the steps in the pipeline that have been resolved by the step \n",
529 |     "executor service."
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "code",
534 |    "execution_count": null,
535 |    "metadata": {},
536 |    "outputs": [],
537 |    "source": [
538 |     "execution.list_steps()"
539 |    ]
540 |   },
541 |   {
542 |    "cell_type": "markdown",
543 |    "metadata": {},
544 |    "source": [
545 |     "You can visualize the pipeline execution status and details in Studio. For details please refer to \n",
546 |     "[View, Track, and Execute SageMaker Pipelines in SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio.html)"
547 |    ]
548 |   },
549 |   {
550 |    "cell_type": "markdown",
551 |    "metadata": {},
552 |    "source": [
553 |     "# Part 3: Cleanup\n",
554 |     "\n",
555 |     "## Pipeline cleanup\n",
556 |     "Set `pipeline_deletion` flag below to `True` to delete the SageMaker Pipelines created in this notebook."
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "code",
561 |    "execution_count": null,
562 |    "metadata": {},
563 |    "outputs": [],
564 |    "source": [
565 |     "pipeline_deletion = False"
566 |    ]
567 |   },
568 |   {
569 |    "cell_type": "code",
570 |    "execution_count": null,
571 |    "metadata": {},
572 |    "outputs": [],
573 |    "source": [
574 |     "if pipeline_deletion:\n",
575 |     "    pipeline.delete()"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "markdown",
580 |    "metadata": {},
581 |    "source": [
582 |     "## Model cleanup\n",
583 |     "Set `model_deletion` flag below to `True` to delete the SageMaker Model created in this notebook."
584 |    ]
585 |   },
586 |   {
587 |    "cell_type": "code",
588 |    "execution_count": null,
589 |    "metadata": {},
590 |    "outputs": [],
591 |    "source": [
592 |     "model_deletion = False"
593 |    ]
594 |   },
595 |   {
596 |    "cell_type": "code",
597 |    "execution_count": null,
598 |    "metadata": {},
599 |    "outputs": [],
600 |    "source": [
601 |     "if model_deletion:\n",
602 |     "    model_package_group_name = register_step.steps[0].model_package_group_name\n",
603 |     "\n",
604 |     "    model_package_list = sm_client.list_model_packages(\n",
605 |     "        ModelPackageGroupName = model_package_group_name\n",
606 |     "    )\n",
607 |     "\n",
608 |     "    for version in range(0,len(model_package_list[\"ModelPackageSummaryList\"])):\n",
609 |     "        sm_client.delete_model_package(\n",
610 |     "            ModelPackageName = model_package_list[\"ModelPackageSummaryList\"][version][\"ModelPackageArn\"]\n",
611 |     "        )\n",
612 |     "\n",
613 |     "    sm_client.delete_model_package_group(\n",
614 |     "        ModelPackageGroupName = model_package_group_name\n",
615 |     "    )"
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "markdown",
620 |    "metadata": {},
621 |    "source": [
622 |     "## Experiment cleanup\n",
623 |     "Set `experiment_deletion` flag below to `True` to delete the SageMaker Experiment and Trials created by the Pipeline execution in this notebook."
624 |    ]
625 |   },
626 |   {
627 |    "cell_type": "code",
628 |    "execution_count": null,
629 |    "metadata": {},
630 |    "outputs": [],
631 |    "source": [
632 |     "experiment_deletion = False"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "code",
637 |    "execution_count": null,
638 |    "metadata": {},
639 |    "outputs": [],
640 |    "source": [
641 |     "if experiment_deletion:\n",
642 |     "    experiment_name = pipeline_name\n",
643 |     "    trial_name = execution.arn.split(\"/\")[-1]\n",
644 |     "    \n",
645 |     "    components_in_trial = sm_client.list_trial_components(TrialName=trial_name)\n",
646 |     "    print('TrialComponentNames:')\n",
647 |     "    for component in components_in_trial['TrialComponentSummaries']:\n",
648 |     "        component_name = component['TrialComponentName']\n",
649 |     "        print(f\"\\t{component_name}\")\n",
650 |     "        sm_client.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)\n",
651 |     "        try:\n",
652 |     "            # comment out to keep trial components\n",
653 |     "            sm_client.delete_trial_component(TrialComponentName=component_name)\n",
654 |     "        except:\n",
655 |     "            # component is associated with another trial\n",
656 |     "            continue\n",
657 |     "        # to prevent throttling\n",
658 |     "        time.sleep(.5)\n",
659 |     "    sm_client.delete_trial(TrialName=trial_name)\n",
660 |     "    try:\n",
661 |     "        sm_client.delete_experiment(ExperimentName=experiment_name)\n",
662 |     "        print(f\"\\nExperiment {experiment_name} deleted\")\n",
663 |     "    except:\n",
664 |     "        # experiment already existed and had other trials\n",
665 |     "        print(f\"\\nExperiment {experiment_name} in use by other trials. Will not delete\")"
666 |    ]
667 |   }
668 |  ],
669 |  "metadata": {
670 |   "instance_type": "ml.t3.medium",
671 |   "kernelspec": {
672 |    "display_name": "Python 3 (Data Science)",
673 |    "language": "python",
674 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
675 |   },
676 |   "language_info": {
677 |    "codemirror_mode": {
678 |     "name": "ipython",
679 |     "version": 3
680 |    },
681 |    "file_extension": ".py",
682 |    "mimetype": "text/x-python",
683 |    "name": "python",
684 |    "nbconvert_exporter": "python",
685 |    "pygments_lexer": "ipython3",
686 |    "version": "3.7.10"
687 |   }
688 |  },
689 |  "nbformat": 4,
690 |  "nbformat_minor": 4
691 | }
692 | 


--------------------------------------------------------------------------------
/1-sagemaker-pipelines/README.md:
--------------------------------------------------------------------------------
 1 | # Integrating SageMaker Data Wrangler with SageMaker Pipelines
 2 | 
 3 | [SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Along with SageMaker model registry and SageMaker Projects, pipelines improves the operational resilience and reproducibility of your ML workflows. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms. Each step in the pipeline can keep track of the lineage, and intermediate steps can be cached for quickly re-running the pipeline. You can create pipelines using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html).
 4 | 
 5 | ## Architecture Overview
 6 | 
 7 | A workflow built with SageMaker pipelines consists of a sequence of steps forming a directed acyclic graph (DAG). In this example, we begin with a Processing step, where we start a SageMaker Processing job based on SageMaker Data Wrangler’s flow file to create a training dataset. We then continue with a Training step, where we train an XGBoost model using SageMaker’s built-in XGBoost algorithm and the training dataset created in the previous step. Once a model has been trained, we end this workflow with a RegisterModel step, where we register the trained model with SageMaker model registry. 
 8 | 
 9 | <div align="center">
10 | 	<p align="center">
11 | 		<img src="./images/pipeline.png" alt="sf"/>
12 | 	</p>
13 | </div>
14 | 
15 | ## Installation and walkthrough
16 | 
17 | To run this sample, we will use a Jupyter notebook running Python3 on a Data Science kernel image in a SageMaker Studio environment. You can also run it on a Jupyter notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is lightweight and can be run on an `ml.t3.medium` instance.
18 | 
19 | You can either use the export feature in SageMaker Data Wrangler to generate the Pipelines code, or build your own script from scratch. In our sample repository, we have used a combination of both approaches for simplicity. At a high level, these are the steps to build and execute the SageMaker Pipelines workflow.
20 | 
21 | ### In the setup notebook (<code>00_setup_data_wrangler.ipynb</code>)
22 | 1. Generate a flow file from Data Wrangler or use the setup script to generate a flow file from a preconfigured template.
23 | 2. Create an Amazon S3 bucket and upload your flow file and input files to the bucket. In our sample notebook, we use SageMaker’s default S3 bucket.
24 | 
25 | ### In the SageMaker Pipelines notebook (<code>01_setup_sagemaker_pipeline.ipynb</code>)
26 | 3. Follow the instructions in the <code>01_setup_sagemaker_pipeline.ipynb</code> notebook to create a `Processor` object based on the Data Wrangler flow file, and an `Estimator` object with the parameters of the training job. 
27 |    *  In our example, since we only use SageMaker features and SageMaker’s default S3 bucket, we can use SageMaker Studio’s default execution role. The same IAM role will be assumed by the pipeline run, the processing job and the training job. You can further customize the execution role according to minimum privilege.
28 | 4. Continue with the instructions to create a pipeline with steps referencing the `Processor` and `Estimator` objects, and then execute a pipeline run. The processing and training jobs will run on SageMaker managed environments and will take a few minutes to complete. 
29 | 5. In SageMaker Studio, you can see the pipeline details monitor the pipeline execution. You can also monitor the underlying processing and training jobs from the Amazon SageMaker Console, and from Amazon CloudWatch.
30 | 
31 | ### Cleaning Up
32 | 
33 | Follow the instructions under **Part 3: Cleanup** in the SageMaker Pipelines notebook (<code>01_setup_sagemaker_pipeline.ipynb</code>) to delete the Pipeline, the Model and the Experiment created during this sample.


--------------------------------------------------------------------------------
/1-sagemaker-pipelines/flow-01-15-12-49-4bd733e0.flow:
--------------------------------------------------------------------------------
1 | {"metadata": {"version": 1, "disable_limits": false}, "nodes": [{"node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "claims.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/claims.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "driver_relationship": "string", "incident_type": "string", "collision_type": "string", "incident_severity": "string", "authorities_contacted": "string", "num_vehicles_involved": "long", "num_injuries": "long", "num_witnesses": "long", "police_report_available": "string", "injury_claim": "long", "vehicle_claim": "float", "total_claim_amount": "float", "incident_month": "long", "incident_day": "long", "incident_dow": "long", "incident_hour": "long", "fraud": "long"}}, "inputs": [{"name": "default", "node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "customers.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/customers.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "customer_age": "long", "months_as_customer": "long", "num_claims_past_year": "long", "num_insurers_past_5_years": "long", "policy_state": "string", "policy_deductable": "long", "policy_annual_premium": "long", "policy_liability": "string", "customer_zip": "long", "customer_gender": "string", "customer_education": "string", "auto_year": "long"}}, "inputs": [{"name": "default", "node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "police_report_available"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2ms2BZ&R6WWXb6O004CX000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx%1ME87#*Qpf<I8C>(>7oJ&#MZjCZ(6H<acx(WD*uiN5&{I<(K$17X39cJY{dfYI7Co;<S7CjxU+mj$c2zUW-@-a_8MQ2HbJ9h4a>%1%HSefU5ZW&!?IhpgSc+k>^~zV6MSv%c+1-bxHrFOt5=M!;g3q$_cvD^Y(+n6QIYC{+dyqTGvi~)x4~8?25S^b!Wbw^ID)aTwOEBv@)woSMXAEa#XWd!aqW*_=g|XVWK+X5j!<lV_%@T~Klh}Oo=y=fJQBVX^y+knN&VXR|Gku-+#l;m6o^gJ#XC?-0|XQR000O8%8hSRWFmRP4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?WkIGK_+697<40|XQR000O8%8hSR-Pv?efB^siZ~_1TNB{r;WMOn+FK}UUbS*G2FfcGJVlrhhF*!J9EjeUjGA%SSG-WMeGh{F=GGSphWnnisFk(4oEn_e+FfMa$VQ_GHE^uLTadl;MjgnDI!!QuX&#caYd(B884<j&fFhgr+?S?PDi4Q}>2Sr5EHPvF;ZcV0$?DP4x%-qze2)=~7|NoM^-;uN9a~}ha@UVv+8`N=~hOk?^lA54V2>_JV>{)Aov$1uw2Y_rgDi@lq!N^r7O+69S!>u0Q%Uoat2Z(Gd5lf|yt4cg$gzIqN5JzR&EbT3+WG)Ny)56Wpm)QYmA(y&zr$KSk?bN}&Tz~id(ns;P1f!PIGOfp*#?d;9OD!Pa1%P!CxIIG3>W+(vu%q*L3jQc7os|XI47kTVAl+XTdaAe$rZZ`HRX*`t8j%Pi$m_-nGtU#rhD@7kHa4us(oCvEv*Gm%D@4Aq)(EV>tpB_xR`Lqu;vep!S619vd0ZAoWQ88hlww0Wft>##7B&&Fl1rh-J`ilL`TS-M<a7nNmsjo8M%q62J?^)6<5{MXwwDf)v>zlgP2(s^lfyI$!!QYhaNtM%{UGqtei$c%u1l$1c-L<I%>K1EP)h>@6aWAK2ms2BZ&Rr%CoK{H000mG002z@003lRbYU+paA9(EEif=JFfc7*GG#F_IXGr5Ib>ooEi^MUWi4SdWH2o<VPQ69VK+E1VmW3lV=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?df;N6+35SGmj08mQ-0u%!j000080LqPTQ<-jL$>{(90CfQX022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2ms2BZ&PF<dBY9>000aC000;O0000000000005+c8UX+RZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8%8hSR-Pv?efB^siZ~_1TNB{r;0000000000q=8-m003lRbYU-WVRCdWFfcGMFfC#-Wic^1IA$5WMVQcG&3}1EnzccFfB4+VK!x9H#jh2Ic6<mFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2ms2BZ&Rr%CoK{H000mG002z@0000000000005+cL<0Z-WMOn+FD`Ila&#>)FfcGMEn+fdF)=wfW-U2nVlpi>Gc;u_VKZbfEiz$YHf3QqI51*4W-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0Ko$Q0000"}}, "inputs": [{"name": "df", "node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "incident_severity"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mlj~Z&R2h55?vH003_R000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<_K=1sFoVTc@P`X{mWrqnQ5ZB5}+$~zZaFU5K^}k~k3Be?8mi6AVFG5*JNvuVngM<j`h%!Z>lXO-N2)Pi-mrTYVwqD~0(k^JUVj;3|m@>F1)^*#ot5tVjchz0J5VQY=luYnkZe%5^CAd$&tk^3?^YTx!-w*q)dARM`20KRB2e$(*Uxe|o<V~>b7_W>*pZkPoB$G!*kKnb%tv`dEyC=lR>rnEKmeIvt@i;An#sP}W&%X{_7mz5>RpS3|`~I6_A<fVeX{6UNB5MQTdqJ<pM@&jg)6Ex9O9KQH00008027XHQ$-)AsSW@D01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVmQjLIs*VuO9KQH00008027XHQ?<>B<d^{f0FMFy07w7;0AyiwVJ~oDa&#>)FfcGMEjeR1IAdWqWGy)`V=yf=GdMXdVl-r8EihtZGiG8rH)1e1Wi4YcFfcB2Zeeh6c`k5aa&dKKbd6HcYTGarmEFYYSY8!m@Iw%!q~HmAPF*MTrEjIsQbrktF@_w;oj2J|C1s<O{7dQ2^tXD`xLdciuorRfIp^x=Ty%E$A;t(_;Ey5ppzyF4q-fkbZ!A|*knRcp)@X!K7;M40HAWq$7gi4HLU&Tujqbk5){-9d?GSc-b<{TtuD^dV^RKev*hPCTd8>FXl{wU!Ut<@0Xs<&$sx)>(_oR}QtnJ?v?vH+xBZLO7ANaw}c7Xf-$V2|c^LrnMp2UAN*d2I-J7;j~>}(&r5YZ1HJ_0{*0GW&YkdQ`SGQQvtf<5M8KcqHgQ<HAOA{HUC`;Wq0DF(&WQd*{)pRAIW3vM~NcW$V?UPwBl#_HwV(vTLVRKn2NryE-HYE3=M0{lk*YmT>Y8|nN%*tJ(#UjjUDYRk*ooGz83GkAeqy#m(I@Lb6a5Yndz;I&$PULkVa27FiPVY`uG%;Jc}!*RYWl^90FR1}j$fC%ztPuM}mZqsllXMzqllMsy6_Yd<(_KFZVB-m_{S3b115ir?1QY-O00;mRj&D;xA*7uW0000G0000@0001FVRT_HE^uLTbS*G2FfcGJIb%0CV_`RBEjch_FfBAQI5{n1G-P5eFk)jfW@0%vVlX#lEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?WkIP+aETqkIbEdWqU0Rj{Q6aWAK2mlj~Z&R2h55?vH003_R000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mRj&D;%AE&7f0000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j00008027XHQ?<>B<d^{f0FMFy07w7;00000000000HlFi0RRAGVRT_HaA9(EEif=JFfc7SV>dWsVK-zgIWS`|Ei^MYIW1x|WMVBaVq-IAVmUWrFgIl_V=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;mRj&D;xA*7uW0000G0000@00000000000001_fms6p0AyiwVJ|LlVRCdWFfcGMFfBP_H#lQqH)Jh2Fk>(+G&49kEn+leVl6OYV>4!AIX7Z3H)SnjFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007nl00000"}}, "inputs": [{"name": "df", "node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "driver_relationship"}}, "inputs": [{"name": "df", "node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['driver_relationship']=df['driver_relationship'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "incident_type"}}, "inputs": [{"name": "df", "node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "collision_type"}}, "inputs": [{"name": "df", "node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['collision_type']=df['collision_type'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "authorities_contacted"}}, "inputs": [{"name": "df", "node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "driver_relationship"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mtSpZ&RJqavbOY0040T000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<}MDP3zpHnL#YEXKmmmVsRKwP86-ZWPI#oDeYRsK6mkWfM5I-2)pE<#yINo+)*gM<ibLYX4aQ93I}gj@*aQzqjNJFoE_X&*FNQHg9ErVK8MP1Chq-}l|RY1UmO7XAe(nc&&n%39Vda36kEGgpk}@RxFcTm9OMs?&FE!xiSi?Sah~5lW6F-v`T3Mi&PazSbx3+TzZiz%ITcV&vHPpRy)%Vl|<K%tz8>qL?e8afD*?<JH7p7ZL?x+4c1Qe~00_$3R+252TTv&k-3L3Ev8Ob>3o9VxDh40Z>Z=1QY-O00;o@kZ)7rpkAB~0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9DY`rt08mQ<1QY-O00;o@kZ)5_8f)620RRA-0ssI=0001FVRT_HaA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yo<E^}^SaBz7paA9(Bb!BvoQr&9XFcenQ#0_n36=d*55Tvw_3A;|cB<!YlqZg%&T@*?wax67wvYkruMk#rVJw%_OFR<6zagzRQWiSwjbH48!9ep3VIQ<e4L<}(_vImJrydXh`o?TY8m9#To<T^tLO@keXJCC^C+U*ussMY;xrB)cgQg{$I10BfLBTZ7dg){{F`g6}Nh52@4=L=Z~g3$@4`&y(@+f!r2J;B67CvDo+xpf<QS87?v()~@5-td(iBGkuz;0FiYfb{&Khy2U8_pXPYoBt>X?(g-VT5t}!pWf^B1F!;-MqSf|kw2oeGPhhT1!&-FIsCgcHm^$BiFwFFM33He*gMT3xvQn)ruxAP=|m=+pubuV%-t^~n=$K5J$G!xR4#R9+3f0pm7-{<H&TG#^uNrBExc9q^*`8VS6<!%Jg-V8^3tByTC*9vK(77)mQaXP%MCDM&k|tQYIVIr^lU5eLtzG6M+Om(LLLnc(;D0ygld|pahyq|(#b?+AJrsDk}QdnX*3xh$8o5}Nt#WM{2&0wkHPi}_?Eu`P)h>@6aWAK2mtSpZ&L$&q2dw%000mG002z@003lRbYU+paA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?c6`uZ&Mp&QTB0Z>Z;0u%!j000080Pm1*Q=QXt9OwW50C52T022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&Tr*UYrg9000aC000;O0000000000005+c836zQZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(Q5tL7p#cB@n*sm;NB{r;0000000000q=8)l003lRbYU-WVRCdWFfcGMFfBJVG%;diVPP#}F<~?<G%+$VEjeOjV=ZK3IA&xrH#K56Ff=V=FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&L$&q2dw%000mG002z@0000000000005+cWCH*IWMOn+FD`Ila&#>)FfcGMEjKkZF=Aw4VJ%}ZVKgl?F)}hOIbvjEEo5UjW@IuqHDWg~G%aH=FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0O11w0000", "one_hot_encoder_model": "P)h>@6aWAK2mtSpZ&RNX%Zbwf003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZ9}PQyS9hIbz0=Ju2-N=mY#N>pkE6YI%M92#+VKKXnMRC#wy5l9S`SUi08_y4|_!9&ibH^i?N5)5awIRU@f;O&GUh0#7md?eY$fkxz`4sTXQ=73osB-3x&y6w8IZ8mk&uFTbcLC!NpO|ETat2Km=xUSSI^QiDor0XWQZz?$9X}i@E^&uQVlP`vfq#_UGRdRwMxq-a<3Yz2Z;)g3eLmWNs<2(3k5i%1iADPktvY37XpK(!w6BIAcH_C2*=-l_+zuJTgF3a5~P)h>@6aWAK2mtSpZ&OWF`nC=L000aC000;O003=ebYWy+bYU+paA9(EEif=JFfcA-a$`#_N@ieSU}ESzEixYfP)h>@6aWAK2mtSpZ&QH6$k2iT001xo002k;003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpsT~U2EGg6jjvB>qGa@2N@as5Cjb^xWb+j*9m=S-%2SXY<nAHD6!O;$#x<sD`^|(&*%^A59v*lj)g)cxZIC(?mbs%bpACVh%nNP2nNR^UP#*>LTGJ$bogHcXa5X?-GK`hqg9sR;Ob(}PKEjN%T9jFN)U|tn0}fFC$Af`_*ULZ3++ABZ_~CetpfnT2Z(-@T2``l$j3n#gx%c`njLs3$hXNaq*pM(KKAeZhgNs@^c}512>Za8Rps8tD2OO+%rzHN0V4T}UjA1aTQ)WQkn)(vh#tK`*qP?w%;wT@)7<b%I#CEG=ws`Sxy4knA+yfRCyqr-l~Na$4S(FRT2!mj)*66VGyXA!F=7Gf&1&D**|G+bG?1S(wG(A+zs$8}LnXA8%ZSwwiLnMK43-gl)dY=~%b!a`FE@&hE7RGm*hzR2^Q3b)o|n4l#A;BeURp?{#(A!a6P0IKR%B^5Nb=t2G>ug+8yAD4APnIJpIQiw;IBLbP)h>@6aWAK2mtSpZ&R(dok0=+000mG002z@003lRbYU+paA9(EEif=JFfc7-HDY09G&3+QVq;=AEi^MUG%aCbGh;0=Hf3cpV=^%@V_{}3V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?b}Nvml%_s_}<08mQ-0u%!j000080Pm1*Q=b&eiPHc80BHdL022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&OWF`nC=L000aC000;O0000000000005+c1OWg5ZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(fWpYof&l;kFaiJoNB{r;0000000000q=85Q003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&R(dok0=+000mG002z@0000000000005+cFarPpWMOn+FD`Ila&#>)FfcGMEo3!fVP!NkFfC$ZVmB=`Gc+_UVPZ35EipD_Win$jF)?FdW-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0IdT60000"}}, "inputs": [{"name": "df", "node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "incident_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mt+%Z&T$i^7Y~X003(N000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtWzr_BCl=PbWwo>VjGR@No&<F#tx!X`FGqPAtBN2oqNum3rK@8!T_ER0t3n;i5QSi!WuCVrkP2+B-ZY*bBaER_g;~K0+O~A6WRd7eN#7m-#2YvwM_vl|3)0OciCKvQj{CCUv5)oZq_P?f0W|`b<|c}UCFM_mCU2vV>VyF&ne7l>rI9d*)U3fUDxh6)lO<fE{*9ND4&k#l%bvbK(kJrhypAHD4K9K>G_8I_DAFR)64F-mY!G<zh2h{n3&#Ya`OC0QU0?|%${Lc?!Ey~O9KQH000080R50}Q!UfbDGmSt01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oV)*(=?<oLKO9KQH000080R50}Q>WjezncL70GI**07w7;0AyiwVJ~oDa&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbd6G7Yuhjsm7T;*SzZ-n@Iw%!B#;PuPTeH*rEg;oqm(`j#u#!WwWhM2%F0Fw`49airT=1oYd2r(=U^}5-gC~?(YfgS>>|Plr&tWI2Ze{dAVIx`y_U+7W@;qA8D<4SC<<zD)_SP!9Ngy)I)rXkYMoVCSu~;9Ya38+#<A<`PpxXs&CQQ$b}e&`U3Bb{51OY^S7*lXJM3Z)9oI=6?y+0?Kx&!G!oH?(tNTQD5o)`B;0H&|0Jr?Ehy1V4@7)aT#V;D{w!QX)(|&Z0ny2qY^#j-g0z<F>os0YdA*H!ue9j>S+XP}irCA3jjhKZjM5O;lQQc|=#qB~`X3AeImzE1|Ir;4@slA&^I;E90i<zYZsxqlXMW^4EwBY%gdKL!wt^V&E-@&^~m;b@8y|Us8;8|H%o)y)Xh1PTmFOaK$fCV%>)p83A=sp5?tybSxh<x4!9OkC8+ejy3QOKfBFI{9>bV4-|Y7`5pR5~6jaiYdal87WuCee8GF^)quN>VZD`#}I3AHmws;ZObnP)h>@6aWAK2mt+%Z&P8AA|?_5000mG002z@003lRbYU+paA9(EEif=JFfc7MFfunVWMwifI5B29Ei^VXH7z)1Vm2)~GBIW{I5}o9G-WU?V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?d{`EXu#j%P?X08mQ-0u%!j000080R50}Q{^u5_2K{k0BZpN022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mt+%Z&NMP&?yc8000aC000;O0000000000005+c69E7KZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8{g7`{r{AK#n*jg-m;wL*NB{r;0000000000q=8of003lRbYU-WVRCdWFfcGMFfB7MGB+?}Wil-|F=ja}G&VFfEjVUkHZ3_aF=jG2Ic705WiTycFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mt+%Z&P8AA|?_5000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mt+%Z&PQfXcN@{003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFqMxzO9Md=hI@a;)R$G3wQ3dK~OMgWxEW!bLMcr*bfWC{ddQRLJA9;DPErUnVB;fho!&-F~15-Xig+b!u%$@6Ek6LnZ#3|`&jRSiI(z_hl394;;B^h38wAP4@#+FRQvtd!KME~xhyfTxffE%4f>C;k=$#G<?u?WjojV#+udwOH8a9}^had#8BQ3iqlfU0?oy<jV`j$Ym>#iIm}Yc^I#@z#^$;C1m3-1Beq9+`{f+fct|hE&pc$L1&;MFhbE9kP-=Apx&B40fd;(BQ0|XQR000O8{g7`{ZyYpT4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?Wk5YAYc4ggR~0|XQR000O8{g7`{8Ca6KegOagFaiJoNB{r;WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MeNs(JD=`#JW34W9;UYr@W>Eqbg)z*$spGV`(4B~gs5^N)(oUko%godyRjL;J8U8?jh_|+iBACGCew=gfxglqV=P^Nqkxoc3I4#l&XtRkBnp+QTeTm>~9AL0ZaKU0U%M$FL9Ie@rF#rGA;f*W>!DtuL^`UU`x;B%G;z3$y-$J`h+SG-003di1(G{s>DJzFO><50(Sq-4sW()b*ck(;w15B`oy(jOv(OKR9MXMjcJ}_ohdGavwLrQCN&BaK7NdB&u-$`SOx}wVok9dUW_78-;)f}ALv2@(j_q>!&<iZL1(s*QUGLo#%tTW@GV<A(8)VXE-t4CIea#s4b1|a5)Po^+NOd!2j?Q)(iDj-P%`C(l-QB?MHtTpQ^p{<;Ttb#}kG(cgn4B3Y!Xgr<%og#X?P`p!`_F~0$%;Sj1?XAJM(0Myjy<ByZTq-rlGL`SCEKSoqP10VRb^jzuq`K)K?```*05AC3K<Etq${SEi0|XQR000O8{g7`{uJsj95&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEjcnXF)=eXH7znZIb<y~VP!TgVmCQsEjVR3VP-QhG%_(aFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVpx@u_3hk0D>DF4O928D0~7!N00;p6kZ)6Gsb~|`0001K0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8{g7`{ZyYpT4gdfE3;+NC7ytkO0000000000q=5zj003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mt+%Z&MjqlDd8Y001xo002k;0000000000005+cN&x@>WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8{g7`{uJsj95&!@I5C8xGO#lD@0000000000q=7C2003lRbYU+paA9(EEif=JFfc7SGBYtTGd49XGC4V9Ei_?eHZ5W|Ib$t2WjJAGGcYtVF*h(RV=yo<E^}^SaBz7paA9(Bb!BueV{&6qO9ci10000400aPd0RRB20{{R300"}}, "inputs": [{"name": "df", "node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "0c4c020a-537e-4426-9a4f-e5332c8c8180", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "collision_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2ml9>Z&QT&G<)O#003+O000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;d1PQx$|MfZHhtWzrxktW%eO&1kNAhyxSp0rl|Vr(Nym4C+#5`s!>ukJZ_<_ywcj4*)by}*F-Kq3a@qp(IygefzL=fv9Ww@%Rq@y;tUP(jj`VnQ2WxNV!Z@B6mDZ`SJy7XFnuYVV4<5w)mSXg}SmE?m~CgnyJ{*LMwd%^l*@mP+B#?y#88pk2sm<4s|tH%4EiV>$&Yy#qRBXzSk5EYc%UD6!A!he>3^SU{L7P&8pS>FJmK+Ca>niy|)PJ+dNxxoizEF}-DS^7KGaUXA?!%s77cV4iQj08mQ<1QY-O00;mFk#AE1sg-UH0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXHu=>808mQ<1QY-O00;mFk#AFn>*Q^i0RRAu0ssI=0001FVRT_HaA9(EEif=JFfc7QFgQ0jGGQ?-W-(@EEi^eZFfCy*H)bt2IX7Z7WHw`EV`E`0V=yo<E^}^SaBz7paA9(Bb!BvolFv&cF%-wsaas%Z;vppib0~p|!Wd?pI<{pm!XCuKB8vw>MKVpIgPCbf(p^OQzwB}Ull`~$O<UI=iy(v~@B4i($>))?<L?PZ_zGV|xDA^YZiN}z^=%;}X(}_V972eN8!&JDMt);6Dz{J*qNUf^{%RyQKE!o}2W!Q#7wD6YopE!2V<&gA1O**>WJ~j0YI|%9f5aYcp+leey0C6#Z%QpoS-JNX?(|>GK0;kD2!n8^9pX;VZ=vAg^|{NU=j1m5_PXuvuLg{r_LujFK?v*!*YB@$JQPHPOwBFhGY$l73B!I!V~c4;nkkD}jL6=98TLUl*gVXoV`lo3mC|v+9VdSpPt-lmBpp-h%zWZ#L{%ZRuylO+L@Qp_)+-S3SNZ?y_*!1`oPWZt?<%TWkS9RNi^_hRYfZ=S0=f7=tRUdImMbBm&j@hWVsX7d<a7<VUz*;!BfW$rF-v;8`Ml7g7psv_gH%YR^5Ia4BQ?yjOk`;`N``}jG>z3D%f)Cf2t(+23~fJyfA|+rO9KQH0000800)t8Q+N`K#1a4i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfBJQI5#*lVKFUcF=k~gG&wUcEnzV?W-T{4H)1qoHe+RDV__|0FfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh(=`0rRaW@M9P)h*<6ay3h000O82a#`6g!(jl<NyEwYykiO6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j0000800)t8Qv#`#ZVmtd01N;C02lxO00000000000HlEx0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;mFk#AFn>*Q^i0RRAu0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEjKVYH#jn3F)d~>W@RljIWsUVVKFymEjKwgVl-qnV`XDwVJ%}YFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000800)t8Q+N`K#1a4i01yBG08Ib@00000000000HlFe0{{SIVRT_HE^uLTbS*G2FfcGJH!wIiI5J@|EoL!hWi2#0GcYY-F*jx{H#s+AG-NhoWn*JuEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a)B^wj000", "one_hot_encoder_model": "P)h>@6aWAK2ml9>Z&M|j)%Vl@003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*PQySDMSDJDYG<{?*urijq#zMNMY|e#$HZb^nw_N}%D=M-0wqP7DX#84XAUqq%o#?A{Gl-6v?rAl@~84n&4f~z)N7z^l&3I7OT4ScK?AaQ%7Q+@c;C1EFbw_Bbr0(X&ii-byu?t+t<p-b(0_-OuDi0>8eSm{bvS8yw7Ruzt<*jG9ai!ImXL~UqN^JrpUf0b@>t|QHh;M};VsAXf>~gk(Pe62kHI>64qxcbW@3&E$J(SBi<7qt*(LV&hsON&u57~Pw(NNK15ir?1QY-O00;mFk#AGa;ufV20000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?eSesz{A08mQ<1QY-O00;mFk#AFVmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo<E^}^SaBz7paA9(Bb!BvYQcX)MF%(T>?Q@|E7a1}zixRLXjA7<Yoi{BmbSEMr>P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5<s(!7V@+2<ag2sm|zckPu_E*v$Fe(RzHA!V9c!Y<YDB8l-A~&i;)14{9P}<lg1WxMVAvE@d(ka9|(J^IXJgt>A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjM<A}%Y&B3_Pc{@_QTy>LNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&<XsNH&9Ch1QY-O00;mFk#AEESLvk^0000G0000@0001FVRT_HE^uLTbS*G2FfcGJH8MG4G-WbmEjME}W-T;0IWsLeVP-ikIWjh4H)Sz2F*!3bEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2ml9>Z&M|j)%Vl@003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mFk#AGa;ufV20000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j0000800)t8Q+1XHVtxSt05Ado07w7;00000000000HlFQ0RRAGVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;mFk#AEESLvk^0000G0000@00000000000001_fh_|70AyiwVJ|LlVRCdWFfcGMFfBDQIb$?sGGr|`V>V_jG&ngkEjeLkIW0LdHexqrF*GqbGcqk>FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "authorities_contacted"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "authorities_contacted"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mooDZ&Q01*?i^z0046V000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDKO^~Q48yht0G6cI=u+SOpsX07<ec#H_q{yS!okdR1w=bU@zBD8~?#a2W*N=TpwlsN&Nq<6AM;6f;$m~1%igTW7^Lp10_BiMS(3_gkNeATX%S}fYldU@A~$vr0Hip{VN+s0{{AZY{>J@8h_@}a4sZGDsxAWJH(G{sa_yg4Di-?6&;azl9V)Uug@q68d-xLZDkt|!x^%xj5qp8RM!8qK9cW~3>2{DPW5e9oGjvud#T>q@=ez@A5C+L9;((^fC)q29Wf<d3}G3$TUktmWF$D1!uO9KQH000080BM_VQ>&~p6Al0X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVyL+(Q4Ii4O9KQH000080BM_VQ%M$|HJ||i0HFc^07w7;0AyiwVJ~oDa&#>)FfcGMEo5P1W;i%xIW1*nHfAj}V=-hcVqs%BEipM_IXE;mW@R)mG%aH=FfcB2Zeeh6c`k5aa&dKKbd6F?Ya1~Tl{fCjA->9JVGl;&kbo<8yUxZoz2sJUD53OFD5Yq3N9j_n){(R=A^6|)&>vBH?ceG+cGC0<dXZ+{d!x}C=HmQiN(t4}Oz8lFkcM$ij>A=}>k`<3Y>G}xQvx9*kN5E0`$GD?{m;!I2{GI1`WTtkzyaR(6&q017HAOZXG6D=_QN|@zK2HAfSd;GP|E_eJGWL|(}0HLw9opwc799mgN6o-|DB@4$xm`Z$S8>7C_Wm*bQn!S61{%-?rqVX_?LtRqrvEFZ}hEqG<f=V$5D&}A=na|pfVuQl(E)c3b~T#L)=;tAHll1HSAF)5|I#g@`u7*X@S922VU6rlW4%pN_xq@_inhqUIAZn=WSPdKIN(gT{*sdbHj~nHq-+<$Zz%cb#e#qT)p}acGIgimk5`w@v=7VMW;1i;tyc`571yD3k_Rf%I_>-ul4%vny}|PgO3|K-c4kjiZl`F__*k5U5yj9sMIX0K&fIrSJj!C=XqY`S-wc;vnN@Ws99cAi<2mhH_L`&zrf}E0Z>Z=1QY-O00;nSn{QLcJKwkx0000G0000@0001FVRT_HE^uLTbS*G2FfcGJWMN}wI5=cEEoEgkW-T;hF=Q=bVPiQhF*#y6I5ah8Wi&7}En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wk*d*1wPEkO!0sv4;0Rj{Q6aWAK2mooDZ&Q01*?i^z0046V000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QLAtTGc00000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j000080BM_VQ%M$|HJ||i0HFc^07w7;00000000000HlFi0RRAGVRT_HaA9(EEif=JFfc7-VPj@EIAl33Wo0&IEi_{>WG!N0V>vA`Ibu0DG&N>rG%z$RV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;nSn{QLcJKwkx0000G0000@00000000000001_fnfsx0AyiwVJ|LlVRCdWFfcGMFfC+ZV`exwWH~KmWj1ClG-EMjEn;C~IV~|cVmUZ8HD+ZrFf=V=FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007<t00000", "one_hot_encoder_model": "P)h>@6aWAK2mooDZ&N~3HXYOe003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFoloHN(3<wMfZM2t$RYWTWxzQf`Wst+=e2lo@vaBO5(zy|J`y#Q9<0);&RWq)B&u=Qec9_JB1nTo>a-m&nkE|6V}G09wObQdJ}DO#9KKYI;cyaQZZzhPA~dPt%so>ug=B}j{FmGSyB{pue8z|46kvcg=>qG@Q2diNCTZ)r)ND%MR*Kb6!QU=m|MM0UYJTD8=Kz8_ZXe0yZD0ss3YdYk~>cM9!rI3Mqj9d)an;+J_Sa@DqWhf`TF>6?0;?L_RISBC&qmAU|mn%08mQ<1QY-O00;nSn{QK=wVr+s0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9H4AnE08mQ<1QY-O00;nSn{QKfmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7<W@I*FGBaZ>G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo<E^}^SaBz7paA9(Bb!BvYQcX)MF%(T>?Q@|E7a1}zixRLXjA7<Yoi{BmbSEMr>P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5<s(!7V@+2<ag2sm|zckPu_E*v$Fe(RzHA!V9c!Y<YDB8l-A~&i;)14{9P}<lg1WxMVAvE@d(ka9|(J^IXJgt>A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjM<A}%Y&B3_Pc{@_QTy>LNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&<XsNH&9Ch1QY-O00;nSn{QJOSLvk^0000G0000@0001FVRT_HE^uLTbS*G2FfcGJW@cnIV=^;iEi^Y_GA%S?VKFUYHexU>F=Az8Wj8rvIbvooEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2mooDZ&N~3HXYOe003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QK=wVr+s0000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j000080BM_VQ+1XHVtxSt05Ado07w7;00000000000HlFQ0RRAGVRT_HaA9(EEif=JFfc7<W@I*FGBaZ>G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;nSn{QJOSLvk^0000G0000@00000000000001_fh_|70AyiwVJ|LlVRCdWFfcGMFfC?gWHw_mGh;0@H(@d@G-F{gEn+reFfB1+Wn^VHIb%6uW-u*dFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "0c4c020a-537e-4426-9a4f-e5332c8c8180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "f01b1173-c983-428b-942f-e9d85a417fcf", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['vehicle_claim']=df['vehicle_claim'].round(2)"}, "inputs": [{"name": "df", "node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['total_claim_amount']=df['total_claim_amount'].round(2)"}, "inputs": [{"name": "df", "node_id": "f01b1173-c983-428b-942f-e9d85a417fcf", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8e9baec0-858f-4eb1-b79c-1960609c48e6", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "policy_state"}}, "inputs": [{"name": "df", "node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8e00b503-e7ab-426b-be39-80a20f24785c", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "policy_state"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mmdyZ&PUaA<p0c003$M000vJ003=ebYWy+bYU-WVRCdWFfcGMFnx~AO2a@DM)y9(tkWSeCQ7%u>7ultxDB^6H|@~*^UjPIOWxgU1VKT!ALksti%=F)5^E9YC?SG6p-d6zES;4jLN0{zF_ZDfjo0{wv<(`qs6;jnQwA5sy6M~7y18lk{;q8*vG~tO$pp{kR@Sm!fqVC>n!RE)k3W(_cdZWk02<n^V}yNhTj26VoWN^~8~+M+5zmN`kx+6h`94^lGP*dZ@OjGk?==Q%)HLwLjAE{Y#u19m`+wR<6zEs{@1w);#j%hUXh$09X^P0&Ncd9FtLXug67ziZ0Z>Z=1QY-O00;mrv2Ro0x?aT&0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?c6r}!fS08mQ<1QY-O00;mrv2Rn!%@Qb`0RRAm0ssI=0001FVRT_HaA9(EEif=JFfc7MH85p1V=^%<Gh#S0Ei^STH7#N@F*GeQH)SwpGh<^lVrFG6V=yo<E^}^SaBz7paA9(Bb!BvoQbAAKFcfx735ezD%8_!2EER#)S;<new%Uao;(*Y!!z6?@r?y!%Cn>f|1*m^yryciC_9OT+IJ7_qnzYN)d*AmxzxO@)^zc)P0Fvli9|bP)kssw?56&5e>;?dkMH?<|{NCw!TXA&<p{pj$^TB-6@30#$7&N-Y>Z>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewD<Z>n~)y<q%!ZOY*!+*S6Vy~xyjES*&HnF5nWGNIkWPEl@YF4$>ng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%<Gh#S0Ei^STH7#N@F*GeQH)SwpGh<^lVrFG6V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?ccbE;`Gk307l08mQ-0u%!j0000804=d^Q)u`h&fow50BQjM022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mmdyZ&Tp9Ud0Xo000aC000;O0000000000005+c5di=IZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8EwOJ?$;}cdodEy<gaQBnNB{r;0000000000q=8id003lRbYU-WVRCdWFfcGMFfB7RFl9DlGBGVPVmLA_G&M0bEn+h<G%YeWWiVzlV`DX9W@Rm7FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mmdyZ&UDH7Euxa000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*MRWj13nF)cG<I5I6XH8C|UVly!`EiyM{FlIAjV>M!CWi4YcFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mmdyZ&Ty@0j<;k003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDJD>W$@1U`e(`k%B}76|FU_XOk6vX*{MN%D>|T@s!fI7~OM@4=^~)8G4AkQ<!krlgbJCRe7gILMcq@G0-N;It<YgZ|mcr0hv5yL7$*+yTz?;uiLI$E$0oK`Dfxh#Zb))rIntce-ATVxiZ;${G&9aX*J%wpq($VPAZRni`9I9DWoEI(N#vsCo{xT+;U70m<9R~U8V+Nk?)=P8k$pYkHI>+4{zwsWnzvD$LgRFi<75cu^+8)`DeZSV`Dydmt)7P4^T@31QY-O00;mrv2Rm`t;dB90000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?cc%lP{-08mQ<1QY-O00;mrv2RoC3|$C-0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7JG-hTwGc#l@GGb&mEi^DOGA%ebG%_t=Gc-6bVK*~4H!x)_V=yo<E^}^SaBz7paA9(Bb!BvYQe8_UF%(T>xAs9Fe8`Z2c_;yk!Wd?pIy)`<pl>1~qP{I6X(w?9Gcz?wSE?=S&+rHOL%g+hQ3MmX+>dkaJvZe1=prVFFwzYP2B$?@0d1}$gqGGrJ0BuATZb6z8eFg#EwTjrr^g$1EX>`F9sQA|AQ&BBx;YX~-qvP%Sv*P$?OW)eNt?Q`4gdsiBf2iNEM?`8hyB10y6pg(ZMTr0eI~z=-oXTW*qeF(8r}A<Z?yUW>;q#Km6?Z;A5vPITQ0@|MDjPi{I4{&s4My-;SrAz-Ti{F_nL!qKaq}``hl0yiCj2ApBqoiO~;ZAn000{ax7%3kUF<)@cW5XqFj_dtpSK7<3CdvBc_mEt#-A{78Q`Bf&8ehoG2=LHqn|5l+aerLsmf~h8mzSScdFf6EvRBujh!KtQ7B+rn6eH6Z1IYac5^ZDRkb6R6kd}B$rAJvrOebRhFh{o+fEO&U*Vv5~*G~%=^235Wow*G!Qz6zw!!DO9KQH0000804=d^Qw$Vgs}cYJ01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfA}NW@b4vGh{6?Vq`ZhG%ztTEjTzdGA&^<G&nF}H#0alFl8-cFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}C6FFW-IcpOqN^P)h*<6ay3h000O8EwOJ?<NE=v)BpegX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j0000804=d^Q--a_g$@7!01N;C02lxO00000000000HlEh0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;mrv2RoC3|$C-0RR9n0ssI=00000000000001_fk^=X0AyiwVJ~oDa&#>)FfcGMEig1@W;ru6WGymcWH&7|FflSMI5;#iEnzb>I51&1GdMReWi4YcFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000804=d^Qw$Vgs}cYJ01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJFf?XnIWse4Eiz(cH!U<UF)}SUI5aXXVKX#1Fkv?{I5#k5En_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(atOEc5000"}}, "inputs": [{"name": "df", "node_id": "8e9baec0-858f-4eb1-b79c-1960609c48e6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "abc12bbb-4fbe-4bb0-931e-b15caa5cf244", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "policy_liability"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mnB_Z&R+Dz<A^U003?Q000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtW(Qrfzoba(?tamh;1~nC#_Y#7~6<Y<=-(uLPDb3JNKMBXHXVXf-M9-2#lzXBvM2^31`JXm<yA5&TRZ)=QVv0?}H`_4P@geWpoj??d@tU+v}y28z~!@`B&m(g0JS4kU}odefov0Trs+af0SK|%bRt7*AJVaua(N9+haAKA(R|Tz7MuSnOy8t_+B^uH;us>HFnluZ>-5<y&lnPOFRFLc3wLXMVK1UG+?p$>4N|AQ(gDdtL`+H9$6E=jFBofFufJ>YJ8xi{#hsHz%Wf$Ur<W}1QY-O00;m;v2Rn*POxbX0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXI52x108mQ<1QY-O00;m;v2Rm@4e0Hj0RRA)0ssI=0001FVRT_HaA9(EEif=JFfc7PGGs9^VmUA^He)b3Ei^YXGA&^-G%zhTVPj-wF)(I0GcsZ=V=yo<E^}^SaBz7paA9(Bb!BvoQcX)MF%(V5eu>?tWMCF0P*LbhGo8-Z7B}uhL_}N&kH;g^$>?Zi+D=kMr2ogY_#gaR-hTM4;3nL2&ONy~mmKe%xESF}e9^*HD9X6(1Zb<|@piyT#pS%+;T$34J8N*R-J@cma+q$Q5}=OX_W8VU{kB(H-(8q0bG)=+%U1g}Jr?=Ze?7dAslXQUEwZ6RD3#vJb8(9;Tt<G86jh?lQm#lPQ<<6160SAh#3n*@%XVyMv+CfQ-7F*f?ESq9(@XOQfUSD9{#d9#6gI27-(<BN*aL!jFag>uWVZ+z<>yR{1%zOWJnUNLIvHi8<T1`TBHLd&^h7ZzCR1rxKDuG4G$Im4kf*{OHMe6)2UMGUIyAIJ<3y@R)4|_6nu&BSy~_gnvj5d5R`e>;=`YmztRy=JdN|6ANHTpiRf-Pa1#<QSnE?=?l1rjRUn0P)+3ep8k%LvhKWW}rZKUBcmovAq6;2ZsHF(^Q;;t9TI1YQgINFJOK@dcN7xdj;x8r#{?gnAh-?kkGHtxaNkKs=~15ir?1QY-O00;m;v2RmTMf`vg0000G0000@0001FVRT_HE^uLTbS*G2FfcGJHZo)}F=9C|EjD8?IW06dGcqk<F*GnOHeq9AW-%~kI5RS0En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wka9{W^HLWMs4**b00Rj{Q6aWAK2mnB_Z&R+Dz<A^U003?Q000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;m;v2Rn*POxbX0000C0000O00000000000001_ffWG&0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j0000806?*CQ-clY?VbSu0GI**07w7;00000000000HlFg0RRAGVRT_HaA9(EEif=JFfc7PGGs9^VmUA^He)b3Ei^YXGA&^-G%zhTVPj-wF)(I0GcsZ=V=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;m;v2RmTMf`vg0000G0000@00000000000001_fn5Ut0AyiwVJ|LlVRCdWFfcGMFfBGRWHB*fIWR3YV=y@_G&eIcEnzV<FfBG=V`OGAFlIP2GGZ-bFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007zp00000"}}, "inputs": [{"name": "df", "node_id": "8e00b503-e7ab-426b-be39-80a20f24785c", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e6bf082b-005a-41d8-9a5b-a48c414bf230", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "customer_gender"}}, "inputs": [{"name": "df", "node_id": "abc12bbb-4fbe-4bb0-931e-b15caa5cf244", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "customer_gender"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mofWZ&MJsG`Zja003<P000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YgO9L?!h4=o9tfwLKpp|ZQ(?wBG+z!{|&NMVHZxXSV{&%k<2!h=n&i9>j7TQ71Vk;ue5)$YG%AA0X(mOdJlu9U{3fVC4g24}@eKhDqBZ~Ey3-~0q?YiHncIY-;KXi>)_!p#XV_<VHl~gP6pJAn#tJZM%M>*b3YaCuD)v|{x%!A(po6jQFl4`k&j-i4+jXHj>A0QZqyYLQf@f{Hp$2R<yb<HJume0s5Brc6GI6-yg@rwTXr*Zu0@j_lo55&snQ$ofj!ncY+pYAd1f36`><oV0Z7f?$B1QY-O00;nPv2RmV_56np0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d{X|h8A08mQ<1QY-O00;nPv2RoAYW8WL0RRA!0ssI=0001FVRT_HaA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo<E^}^SaBz7paA9(Bb!BvoQbB9mFc?+b#5HR!J1EHDLl8}AArtnTdQRA-x6#8WWrx8wh8#<csVt|G+)_&Z&e(Z>Vt;F2nsn>d4R#UV_ukX@^geWUa-Lv>7x?2C2T=Ifk1}-VxB9ZV(ZrXc=1L%hCeaR@J6}<^v)irrdcpph@Bx$J`LleZ??KzRg)rC))tkOua&!6F&KEEQd+69Bdx{rA*%M><E%vaFj=Q9*D(g1%u27;D+Wkr4{^%DuLTKQHQ5YQr5$=a0ABFE9&%M9&ApWhv-XIu!>kPhh4uaQzNf}140R(>F05T7SV?r8p!T6Fx2zH-`T?=EYMw4F3ViqHE^h9B=6ocZb6^@zahSkFHk~>bmckZaWT?#s*)|qzh=$Oh%sM6BehdZiyy`~;z0e++Zd5>@5ZKQYq!LGe3eF5;i(T-QzzHODFGkAem{R7m{@Ir|VFs2U?z-zVov_j-`8}LPKhTDw{6PCm*86FmGrOII}r=^^vrI517b6LKUd6s2mmS)o=pS(=dSWdE{oF0Wy1RGCb?Pu^Ce*jQR0|XQR000O8X0dNm%TJaX5&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEigDRFk@k4H7zkVIAbj|W@BM3IXGozEnzq`GdMJ1HDft2V=ZGaFfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVlcj=)85||JRbm1O928D0~7!N00;nPv2RllxHP%o0001O0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8X0dNmRrUOb4gdfE3;+NC7ytkO0000000000q=69u003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mofWZ&T`O_GzC1005E#002k;0000000000005+cRRI71WMOn+FK}UUbS*G2FfcGJFgP$UV_{@9EipDYV=XjhV__{hIAvxnVK_81I5c53V>vKmEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8X0dNm%TJaX5&!@I5C8xGO#lD@0000000000q=8xk003lRbYU+paA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo<E^}^SaBz7paA9(Bb!BueV{&6qO9ci10000400aPd0RRBk0{{R300", "one_hot_encoder_model": "P)h>@6aWAK2mofWZ&NPv-38VF003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZB*O9Md+hVTB2oVUYlx3+e#A}Cn&<aNl-X6@p9k;!<l?0<KxC<qE(a(H>(C&>w{!<?aq$j=HB+9Ro)kYAK{Y9f@vq#gp@M!5+#j>LyL95f)ar!43b^zF9W>2|Z;tvCB^1Lyt|ah_wS=1ys)SLoluO4nYQ#~S`nnxVTK_LgjWg+oPZAN>KV`2-`!bdOn}pU`D$U=FFsb#zrC<dd=S<C-Ie(a~LaMR)cSbL5)GBHy_A;*Ex-fsrOGP9A?<CS2%$*1tb7=97bExp)UqO9KQH000080A{goQ+!pNy$%2X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVqm@dNe=)}O9KQH000080A{goQ+1XHVtxSt05Ado07w7;0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyA<V>vW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbbV4yODi!HO=InIp$iuoGBAr0uqcdS=1rY9EiQB?A|mQeMAA;8!^_OnBvq;w{2Bg0e~7oXiXxc6<$j!V@3|o-`=>EMgpp22FgPvJ3TSf~AvCuh+WZp1S>MB8m*9fMXqF||KRQ^oBVqnu+2M^W1;J<s)3u>+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bx<cE~j=9-I<0FnG%FTaz<7Ij6J6CUvh(XAf{d#gD(w`1wJsqc9yoydg~^ri90++-wKpIK+dL&rj<3aN9;`WKI^66LJ)Z4E%o8J|pHjF>=rvD*1OTU0=j2J*wYa-yj0@mOotS3+Ak4Os<|7-)dPU>ULxP0)Bcy__O?xKO-Zn)YJFcFg05$L-C*xX^h!QoUStlUyn_$TF4xQ(2m(d77lXIP3mRl1O#aLEhW)g8*LewSmwH{FOIQO9KQH000080A{goQx8|^r4j%D01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfB1KVlgpfV>B&fVmCM~G&nLjEn+h@FfB4OVPiQoH8Wx`VlpjbFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}AWhnt1lyKPxi;P)h*<6ay3h000O8X0dNmF7n+4)&KwiX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080A{goQ+!pNy$%2X01N;C02lxO00000000000HlEj0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;nPv2Rm#mIq>f0RR9n0ssI=00000000000001_flC1Z0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyA<V>vW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080A{goQx8|^r4j%D01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJF)(5=F=S&jEo5RhI4v|dGC3_`Gc_<RGBaUgIW#phVlZMdEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(atOEc5000"}}, "inputs": [{"name": "df", "node_id": "e6bf082b-005a-41d8-9a5b-a48c414bf230", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "customer_education"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mpGqZ&Uc&mo4M~003|S000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZG$*Qphgs)24{(?tamh;1~rC#_Y#7~6<a<=>eG2?>dA@7#0loP~CfvsjBrdkG2j0cB1=N9mm$5lSVLFNJK_Z-T*3q-`|lL?epzm<#wM)@s%Dty*@f>z8+pnE5xPY-3<^CtKMrz<-5B%UrdF!#~PF!|uLcq1u7E<qGrQx4`DJh_$3z9-?EYpihI2m-Pb#<8Tu`z|FoRV&d3_-?FZ`M9*?GRYSS_GBv{B2-TJ6D+|}(8it?VKifzYiSs<ePh^l@PYL-l624aq`t*odi)p&~22e`_1QY-O00;njv2RncGU1L60000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d<oqqcV08mQ<1QY-O00;njv2Rm}x&v>-0RRBv0ssI=0001FVRT_HaA9(EEif=JFfc7*V>mcwGC4FYF*#&0Ei__eGc7nYW-u)_Fl1vnHf1tnGGjI^V=yo<E^}^SaBz7paA9(Bb!BvoQo(B5KoDJRT*r+Kp@%gR=pY0R3Al?CIjZf3f=dsfhZ5Qz3ZaCRc4dpMBxiS>5Q6_F^dm}d{hfY7N3qkoDW$!LnfK<+=sosy@-`%he6h$^t50^IY?7uMqy6?tKAm~*(pgnWdw0SwS(ys%y%JL;1VZTC-GcAdEpAxb-(M7Gs0Hmymd=E%R2%9I{Vuc*HaxB&t*bO=Mm({eAA{`5+t3i~$ditqGxh0Q!`6ZkjE=C~mMjrcPn2TU1d}E@YS;~#YqO%ag%pJ-%|nWGdcVjXLS5{*j=R@#Nyq6mk@I15@15v=__v1O?oRg>cR%C3*6}}6a$Lw0!F9L?B8;5AZC7ea*}M*l-npZfLg~CJ?MD#}Xn^d4KPL2rq)=QegrTasqJ=PwGsEm}){Sqj=fWTR+Necp{Jx*%LUQeo&u;vZ6*aZt4EWXjPaRvsYlH9qgRO7n<rLtwDh<m^{bnJhKZZXL%fEmnG%S%~1@!%U3vg?>{J2E+>$SnBh4R*kcp(h~8hZQ5BA46?vJuY)5f@pO42KzimJQ=L=5Z8{!r|aBih^tqCwz3^xNiNrAof#u-9G?OO9KQH000080D7@+Q;)=OZxR3i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfC$ZI5=i9IW#RXIb<;{G-6~kEjTo0FfBGPWMertWin$jV>T^gFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh3u-ojbg9;Y_P)h*<6ay3h000O8da-X)_}iB)<NyEwZ~*`S6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080D7@+Q?WAPjt&3-01N;C02lxO00000000000HlEx0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;njv2Rm}x&v>-0RRBv0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEn;IhIA$_AG%YbXWHBu?Vq`NdI5cK3EjBP@V>vcuGGj7hHZ5Z?FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080D7@+Q;)=OZxR3i01yBG08Ib@00000000000HlF~0{{SIVRT_HE^uLTbS*G2FfcGJVq-WsW->W6EipM{F)cJ=WHT)|G-fa@HZWvkIW}c7V=`klEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a00aO4000"}}, "inputs": [{"name": "df", "node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "type": "TRANSFORM", "operator": "sagemaker.spark.manage_columns_0.1", "parameters": {"operator": "Drop column", "drop_column_parameters": {"column_to_drop": "customer_zip"}}, "inputs": [{"name": "df", "node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "dba3fd7e-01a7-4629-a4c8-aa520328439a", "type": "TRANSFORM", "operator": "sagemaker.spark.join_tables_0.1", "parameters": {"left_column": "policy_id", "right_column": "policy_id", "join_type": "leftouter"}, "inputs": [{"name": "df", "node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "output_name": "default"}, {"name": "df", "node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "output_name": "default"}], "outputs": [{"name": "default"}]}]}


--------------------------------------------------------------------------------
/1-sagemaker-pipelines/images/pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/1-sagemaker-pipelines/images/pipeline.png


--------------------------------------------------------------------------------
/2-step-functions-pipelines/01_setup_step_functions_pipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Create a Step Functions Workflow from a Data Wrangler Flow File\n",
  8 |     "\n",
  9 |     "This notebook creates a Step Functions workflow that runs a data preparation step based on the flow file configuration, training step to train a XGBoost model artifact and creates a SageMaker model from the artifact. \n",
 10 |     "\n",
 11 |     "Before proceeding with this notebook, please ensure that you have executed the 00_setup_data_wrangler.ipynb. This notebook uploads the input files, creates the flow file locally from the insurance claims template and uploads flow file to the default S3 bucket associated with the Studio domain.\n",
 12 |     "\n",
 13 |     "This notebook is created using the export feature in Data Wrangler and modified and parameterized for Step functions workflow with SageMaker data science SDK.\n"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "As a first step, we read the input workflow parameters with a predefined schema and use this in our notebook. A sagemaker workflow requires job names to be unique. So we pass on the below names as parameters by generating random ids when we execute the workflow towards the end of the notebook.\n",
 21 |     "\n",
 22 |     "1. Processing JobName\n",
 23 |     "2. Training JobName\n",
 24 |     "3. Model JobName\n"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "metadata": {},
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "%pip install -qU stepfunctions"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "Please use your own bucket name as this bucket will be used for creating the training dataset and the model artifact. The flow file is also generated locally and contains the S3 location details for the two input files namely claims.csv and customer.csv. The flow file is available as a JSON document and holds all the input, output and transformation details in a node structure. SageMaker jobs need the inputs to be available in S3. So, we derive the input file S3 locations from the flow file by reading the JSON document and building an array with a Processing Input object for SageMaker processing.\n",
 41 |     "\n",
 42 |     "You can configure your output location as you wish but SageMaker processing job requires the output name to match with the one created in the flow file. Any mismatch can fail the job. So, we extract the output name from the flow file as below.\n"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "%store -r ins_claim_flow_uri"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": null,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "import json\n",
 61 |     "import sagemaker\n",
 62 |     "import string\n",
 63 |     "import boto3\n",
 64 |     "\n",
 65 |     "sm_client = boto3.client(\"sagemaker\")\n",
 66 |     "sess = sagemaker.Session()\n",
 67 |     "\n",
 68 |     "# bucket = <YOUR_BUCKET>\n",
 69 |     "bucket = sess.default_bucket()\n",
 70 |     "\n",
 71 |     "prefix = 'aws-data-wrangler-workflows'\n",
 72 |     "\n",
 73 |     "FLOW_TEMPLATE_URI = ins_claim_flow_uri\n",
 74 |     "\n",
 75 |     "flow_file_name = FLOW_TEMPLATE_URI.split(\"/\")[-1]\n",
 76 |     "flow_export_name = flow_file_name.replace(\".flow\", \"\")\n",
 77 |     "flow_export_id = flow_export_name.replace(\"flow-\", \"\")"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "sagemaker.s3.S3Downloader.download(FLOW_TEMPLATE_URI, \".\")"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "with open(flow_file_name, 'r') as f:\n",
 96 |     "    data = json.load(f)\n",
 97 |     "    output_node = data['nodes'][-1]['node_id']\n",
 98 |     "    print(output_node)\n",
 99 |     "    output_path = data['nodes'][-1]['outputs'][0]['name']\n",
100 |     "    input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n",
101 |     "    input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n",
102 |     "\n",
103 |     "output_name = f\"{output_node}.{output_path}\""
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
113 |     "\n",
114 |     "data_sources = []\n",
115 |     "\n",
116 |     "for i in range(0,len(input_source_uris)):\n",
117 |     "    data_sources.append(ProcessingInput(\n",
118 |     "        source=input_source_uris[i],\n",
119 |     "        destination=f\"/opt/ml/processing/{input_source_names[i]}\",\n",
120 |     "        input_name=input_source_names[i],\n",
121 |     "        s3_data_type=\"S3Prefix\",\n",
122 |     "        s3_input_mode=\"File\",\n",
123 |     "        s3_data_distribution_type=\"FullyReplicated\"\n",
124 |     "    ))\n",
125 |     "    \n",
126 |     "print(data_sources)\n",
127 |     "\n",
128 |     "s3_output_prefix = f\"preprocessing/output\"\n",
129 |     "s3_training_dataset = f\"s3://{bucket}/{s3_output_prefix}\"\n",
130 |     "\n",
131 |     "processing_job_output = ProcessingOutput(\n",
132 |     "    output_name=output_name,\n",
133 |     "    source=\"/opt/ml/processing/output\",\n",
134 |     "    destination=s3_training_dataset,\n",
135 |     "    s3_upload_mode=\"EndOfJob\"\n",
136 |     ")"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "The flow file is created and uploaded into S3 as part of the setup notebook. If you encounter any error in the below cell,please go back to the Setup notebook to make sure flow file is generated and uploaded correctly. Now, we retrieve the flow file s3 uri that was stored as a global variable in the setup "
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": null,
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": [
152 |     "%store -r ins_claim_flow_uri\n",
153 |     "ins_claim_flow_uri"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "Processing job also needs to access the flow file to run the transformations. So, we provide the flow file location as another input source for the processing job as below by passing the s3 uri."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": null,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "## Input - Flow: insurance_claims.flow\n",
170 |     "flow_input = ProcessingInput(\n",
171 |     "    source=ins_claim_flow_uri,\n",
172 |     "    destination=\"/opt/ml/processing/flow\",\n",
173 |     "    input_name=\"flow\",\n",
174 |     "    s3_data_type=\"S3Prefix\",\n",
175 |     "    s3_input_mode=\"File\",\n",
176 |     "    s3_data_distribution_type=\"FullyReplicated\"\n",
177 |     ")"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "## Configure a processing job\n",
185 |     "\n",
186 |     "Please follow the steps below for creating an execution role with the right permissions for the workflow.\n",
187 |     "\n",
188 |     "1. Go to the IAM Console - Roles. Choose Create role. \n",
189 |     "\n",
190 |     "2. For role type, choose AWS Service, find and choose SageMaker, and choose Next: Permissions \n",
191 |     "\n",
192 |     "3. On the Attach permissions policy page, choose (if not already selected) \n",
193 |     "\n",
194 |     "    a. AWS managed policy AmazonSageMakerFullAccess\n",
195 |     "    \n",
196 |     "    b. AWS managed policy AmazonS3FullAccess for access to Amazon S3 resources\n",
197 |     "    \n",
198 |     "    c. AWS managed policy CloudWatchEventsFullAccess\n",
199 |     "\n",
200 |     "4. Then choose Next: Tags and then Next: Review.\n",
201 |     "\n",
202 |     "5. For Role name, enter StepFunctionsSageMakerExecutionRole and Choose Create Role\n",
203 |     "\n",
204 |     "6. Additionally, we need to add step functions as a trusted entity to the role. Go to trust relationships in the specific IAM role and edit it to add  states.amazonaws.com (http://states.amazonaws.com/) as a trusted entity. \n",
205 |     "\n"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "After creating the role, we go on to configure the inputs required by the SageMaker Python SDK to launch a processing job. You can change it as per your needs."
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": null,
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": [
221 |     "\n",
222 |     "sess = sagemaker.Session()\n",
223 |     "# IAM role for executing the processing job.\n",
224 |     "iam_role = sagemaker.get_execution_role()\n",
225 |     "\n",
226 |     "aws_region = sess.boto_region_name\n",
227 |     "\n",
228 |     "\n",
229 |     "# Data Wrangler Container URL.\n",
230 |     "container_uri = sagemaker.image_uris.retrieve(\n",
231 |     "    framework='data-wrangler',\n",
232 |     "    region=aws_region\n",
233 |     ")\n",
234 |     "\n",
235 |     "# Processing Job Instance count and instance type.\n",
236 |     "instance_count = 2\n",
237 |     "instance_type = \"ml.m5.4xlarge\"\n",
238 |     "\n",
239 |     "# Size in GB of the EBS volume to use for storing data during processing\n",
240 |     "volume_size_in_gb = 30\n",
241 |     "\n",
242 |     "# Content type for each output. Data Wrangler supports CSV as default and Parquet.\n",
243 |     "output_content_type = \"CSV\"\n",
244 |     "\n",
245 |     "# Network Isolation mode; default is off\n",
246 |     "enable_network_isolation = False\n",
247 |     "\n",
248 |     "# KMS key for per object encryption; default is None\n",
249 |     "kms_key = None"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "## Create a Processor\n",
257 |     "\n",
258 |     "To launch a Processing Job, we will use the SageMaker Python SDK to create a Processor function with the configuration set."
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": [
267 |     "from sagemaker.processing import Processor\n",
268 |     "from sagemaker.network import NetworkConfig\n",
269 |     "\n",
270 |     "processor = Processor(\n",
271 |     "    role=iam_role,\n",
272 |     "    image_uri=container_uri,\n",
273 |     "    instance_count=instance_count,\n",
274 |     "    instance_type=instance_type,\n",
275 |     "    volume_size_in_gb=volume_size_in_gb,\n",
276 |     "    network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),\n",
277 |     "    sagemaker_session=sess,\n",
278 |     "    output_kms_key=kms_key\n",
279 |     ")"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "markdown",
284 |    "metadata": {},
285 |    "source": [
286 |     "# Create a Step Function WorkFlow \n",
287 |     "## Define Steps\n",
288 |     "A step function workflow consists of multiple steps that run as separate state machines. We will first create a `ProcessingStep` using the Data Wrangler processor defined above. Processing job name is unique and passed on as a command line argument from the workflow. This is used to track and monitor the job in console and cloudwatch logs. "
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": null,
294 |    "metadata": {},
295 |    "outputs": [],
296 |    "source": [
297 |     "from stepfunctions.inputs import ExecutionInput\n",
298 |     "workflow_parameters = ExecutionInput(schema={\"ProcessingJobName\": str, \"TrainingJobName\": str,\"ModelName\": str})"
299 |    ]
300 |   },
301 |   {
302 |    "cell_type": "code",
303 |    "execution_count": null,
304 |    "metadata": {},
305 |    "outputs": [],
306 |    "source": [
307 |     "from stepfunctions.steps import ProcessingStep\n",
308 |     "\n",
309 |     "data_wrangler_step = ProcessingStep(\n",
310 |     "    \"WranglerStepFunctionsProcessingStep\",\n",
311 |     "    processor=processor,\n",
312 |     "    job_name = workflow_parameters[\"ProcessingJobName\"],\n",
313 |     "    inputs=[flow_input] + data_sources, \n",
314 |     "    outputs=[processing_job_output]\n",
315 |     ")"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "metadata": {},
321 |    "source": [
322 |     "Next, we add a TrainingStep to the workflow that trains a model on the preprocessed train data set. Here we use a builtin XG boost algorithm with fixed hyperparameters. You can configure the training based on your needs"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": null,
328 |    "metadata": {},
329 |    "outputs": [],
330 |    "source": [
331 |     "import boto3\n",
332 |     "from sagemaker.estimator import Estimator\n",
333 |     "\n",
334 |     "region = boto3.Session().region_name\n",
335 |     "\n",
336 |     "image_uri = sagemaker.image_uris.retrieve(\n",
337 |     "        framework=\"xgboost\",\n",
338 |     "        region=region,\n",
339 |     "        version=\"1.2-1\",\n",
340 |     "        py_version=\"py3\",\n",
341 |     "        instance_type=instance_type,\n",
342 |     "    )\n",
343 |     "xgb_train = Estimator(\n",
344 |     "        image_uri=image_uri,\n",
345 |     "        instance_type=instance_type,\n",
346 |     "        instance_count=1,\n",
347 |     "        role=iam_role,\n",
348 |     "    )\n",
349 |     "xgb_train.set_hyperparameters(\n",
350 |     "        objective=\"reg:squarederror\",\n",
351 |     "        num_round=3,\n",
352 |     "    )\n",
353 |     "\n",
354 |     "from sagemaker.inputs import TrainingInput\n",
355 |     "from stepfunctions.steps import TrainingStep\n",
356 |     "\n",
357 |     "xgb_input_content_type = 'text/csv'\n",
358 |     "\n",
359 |     "training_step = TrainingStep(\n",
360 |     "    \"WranglerStepFunctionsTrainingStep\",\n",
361 |     "    estimator=xgb_train,\n",
362 |     "    data={\n",
363 |     "        \"train\": TrainingInput(\n",
364 |     "            s3_data=s3_training_dataset,\n",
365 |     "            content_type=xgb_input_content_type\n",
366 |     "        )\n",
367 |     "    },\n",
368 |     "    job_name = workflow_parameters[\"TrainingJobName\"],\n",
369 |     "    wait_for_completion=True,\n",
370 |     ")"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "markdown",
375 |    "metadata": {},
376 |    "source": [
377 |     "The above training job will produce a model artifact. As a final step, we register this artifact as a SageMaker model with the model step"
378 |    ]
379 |   },
380 |   {
381 |    "cell_type": "code",
382 |    "execution_count": null,
383 |    "metadata": {},
384 |    "outputs": [],
385 |    "source": [
386 |     "from stepfunctions.steps import ModelStep\n",
387 |     "\n",
388 |     "model_step = ModelStep(\n",
389 |     "    \"SaveModelStep\", model=training_step.get_expected_model(), model_name=workflow_parameters[\"ModelName\"])"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "markdown",
394 |    "metadata": {},
395 |    "source": [
396 |     "In order to notify failures, we need to configure a fail step with an error message as below and call it during the incidence of every failure"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "code",
401 |    "execution_count": null,
402 |    "metadata": {},
403 |    "outputs": [],
404 |    "source": [
405 |     "from stepfunctions.steps.states import Fail\n",
406 |     "\n",
407 |     "process_failure = Fail(\n",
408 |     "    \"Step Functions Wrangler Workflow failed\", cause=\"Wrangler-StepFunctions-Workflow failed\"\n",
409 |     ")"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "markdown",
414 |    "metadata": {},
415 |    "source": [
416 |     "There might be situations where you may expect intermittent failures due to unavailable resources and may want to retry a specific step. You can set up retry mechanism as below for such steps and configure the interval and the attempts "
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "code",
421 |    "execution_count": null,
422 |    "metadata": {},
423 |    "outputs": [],
424 |    "source": [
425 |     "from stepfunctions.steps.states import Retry\n",
426 |     "\n",
427 |     "data_wrangler_step.add_retry(Retry(\n",
428 |     "    error_equals=[\"States.TaskFailed\"],\n",
429 |     "    interval_seconds=15,\n",
430 |     "    max_attempts=2,\n",
431 |     "    backoff_rate=3.0\n",
432 |     "))\n",
433 |     "\n",
434 |     "training_step.add_retry(Retry(\n",
435 |     "    error_equals=[\"States.TaskFailed\"],\n",
436 |     "    interval_seconds=10,\n",
437 |     "    max_attempts=2,\n",
438 |     "    backoff_rate=4.0\n",
439 |     "))"
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "markdown",
444 |    "metadata": {},
445 |    "source": [
446 |     "Additionally we can introduce a wait interval between steps to ensure a smooth transition and provide a specific step with all the resources and inputs it will need. "
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "code",
451 |    "execution_count": null,
452 |    "metadata": {},
453 |    "outputs": [],
454 |    "source": [
455 |     "from stepfunctions.steps.states import Wait\n",
456 |     "\n",
457 |     "wait_step = Wait(\n",
458 |     "    state_id=\"Wait for 3 seconds\",\n",
459 |     "    seconds=3\n",
460 |     ")"
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "markdown",
465 |    "metadata": {},
466 |    "source": [
467 |     "Now that we have defined the steps needed for the workflow, we need to catch and notify failures at every step. We fail the entire workflow whenever a failure is caught by this FailStep"
468 |    ]
469 |   },
470 |   {
471 |    "cell_type": "code",
472 |    "execution_count": null,
473 |    "metadata": {},
474 |    "outputs": [],
475 |    "source": [
476 |     "from stepfunctions.steps.states import Catch\n",
477 |     "\n",
478 |     "catch_failure = Catch(\n",
479 |     "    error_equals=[\"States.TaskFailed\"],\n",
480 |     "    next_step=process_failure\n",
481 |     ")\n",
482 |     "\n",
483 |     "data_wrangler_step.add_catch(catch_failure)\n",
484 |     "training_step.add_catch(catch_failure)\n",
485 |     "model_step.add_catch(catch_failure)"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "Finally, we create a workflow by chaining all the above defined steps in order and execute it with the required parameters. Unique job names are generated randomly and passed on as parameters for the workflow. We introduce a wait time of 3 steps between preprocessing and training steps by chaining the wait step in between"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "code",
497 |    "execution_count": null,
498 |    "metadata": {},
499 |    "outputs": [],
500 |    "source": [
501 |     "from stepfunctions.steps import Chain\n",
502 |     "from stepfunctions.workflow import Workflow\n",
503 |     "import uuid\n",
504 |     "\n",
505 |     "workflow_graph = Chain([data_wrangler_step, wait_step,training_step, model_step])\n",
506 |     "\n",
507 |     "branching_workflow = Workflow(\n",
508 |     "    name=\"Wrangler-SF-Run-{}\".format(uuid.uuid1().hex),\n",
509 |     "    definition=workflow_graph,\n",
510 |     "    role=iam_role\n",
511 |     ")\n",
512 |     "\n",
513 |     "branching_workflow.create()"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "code",
518 |    "execution_count": null,
519 |    "metadata": {},
520 |    "outputs": [],
521 |    "source": [
522 |     "# Each Preprocessing job requires a unique name\n",
523 |     "processing_job_name = \"wrangler-sf-processing-{}\".format(\n",
524 |     "    uuid.uuid1().hex)\n",
525 |     "# Each Training Job requires a unique name\n",
526 |     "training_job_name = \"wrangler-sf-training-{}\".format(\n",
527 |     "    uuid.uuid1().hex)\n",
528 |     "model_name = \"sf-claims-fraud-model-{}\".format(uuid.uuid1().hex)\n",
529 |     "\n",
530 |     "\n",
531 |     "# Execute workflow\n",
532 |     "execution = branching_workflow.execute(\n",
533 |     "    inputs={\n",
534 |     "         \"ProcessingJobName\": processing_job_name, # Each pre processing job (SageMaker processing job) requires a unique name,\n",
535 |     "         \"TrainingJobName\": training_job_name,#  Each Sagemaker Training job requires a unique name,\n",
536 |     "         \"ModelName\" : model_name # Each model requires a unique name\n",
537 |     "    } \n",
538 |     ")\n",
539 |     "execution_output = execution.get_output(wait=True)\n"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "markdown",
544 |    "metadata": {},
545 |    "source": [
546 |     "You can visualize the step function workflow status and details in the Step Functions console. "
547 |    ]
548 |   },
549 |   {
550 |    "cell_type": "markdown",
551 |    "metadata": {},
552 |    "source": [
553 |     "## (Optional) StepFunctions cleanup\n",
554 |     "1. Delete the input claims file, customer file and flow file from S3.\n",
555 |     "2. Delete the training dataset created by processing job\n",
556 |     "3. Delete the model artifact tar file from S3.\n",
557 |     "4. Delete the SageMaker Model.\n"
558 |    ]
559 |   },
560 |   {
561 |    "cell_type": "code",
562 |    "execution_count": null,
563 |    "metadata": {},
564 |    "outputs": [],
565 |    "source": []
566 |   }
567 |  ],
568 |  "metadata": {
569 |   "instance_type": "ml.t3.medium",
570 |   "kernelspec": {
571 |    "display_name": "Python 3 (Data Science)",
572 |    "language": "python",
573 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
574 |   },
575 |   "language_info": {
576 |    "codemirror_mode": {
577 |     "name": "ipython",
578 |     "version": 3
579 |    },
580 |    "file_extension": ".py",
581 |    "mimetype": "text/x-python",
582 |    "name": "python",
583 |    "nbconvert_exporter": "python",
584 |    "pygments_lexer": "ipython3",
585 |    "version": "3.7.10"
586 |   }
587 |  },
588 |  "nbformat": 4,
589 |  "nbformat_minor": 4
590 | }
591 | 


--------------------------------------------------------------------------------
/2-step-functions-pipelines/README.md:
--------------------------------------------------------------------------------
 1 | ## SageMaker Data Wrangler with AWS Step Functions
 2 | 
 3 | ## Getting Started
 4 | 
 5 | In this sample, we will use Step Functions to orchestrate the data wrangler based ML workflow using [AWS Step Functions Data Science SDK](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-python-sdk.html)
 6 | The AWS Step Functions Data Science SDK is an open-source library that allows data scientists to create workflows that can preprocess datasets, build, deploy and monitor machine learning models using AWS SageMaker and AWS Step Functions. The AWS Step Functions Data Science SDK provides a Python API that can create and invoke Step Functions workflows. You can manage and execute these workflows directly in Python, as well as Jupyter notebooks.
 7 | 
 8 | We will build a simple ML workflow leveraging SageMaker that has a preprocessing step using Data Wrangler, a training step and a save model step. We will chain these steps in order, assign dependencies and catch failures in the flow with Step Functions. The workflow will look as below
 9 | 
10 | 
11 | ![](step-function-workflow.png)
12 | 
13 | ## Instructions 
14 | 
15 | To run this sample, we used a Jupyter notebook running Python3 on a data science kernel in a SageMaker Studio environment. You can also run it on a Python3 notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is light weight and can be run on a t3 medium instance for example.
16 | 
17 | You can either use the export feature in SageMaker Data Wrangler to generate the Pipelines code and modify it for Step functions or build your own script from scratch. In this sample, we have used a combination of both approaches for simplicity. We have taken some pieces from the autogenerated code and extended it with features specific to Step Functions.  
18 | 
19 | ## Executing the notebooks
20 | 
21 | 1. Run the 00_setup_data_wrangler.ipynb notebook step by step. This notebook does the following      
22 |     a. Generates a flow file from Data Wrangler or uses the setup script to generate a flow file from a preconfigured template      
23 |     b. Creates an Amazon S3 bucket and uploads your flow file and input files to the bucket
24 | 2. Follow the instructions in the 01_setup_step_functions_pipeline notebook to kick off a Step Functions workflow. 
25 | 3. Configure your Amazon SageMaker execution role with the required permissions as mentioned in the 01_setup_step_functions_pipeline notebook 
26 | 4. The processing job will be run on a SageMaker managed Spark environment behind the scenes and this can take few minutes to complete
27 | 5. Go to StepFunctions console and track the workflow visually thru StepFunctions console. You can also navigate to the linked CloudWatch logs to debug errors.
28 | 6. Make sure you clean up all the resources at the end.
29 | 
30 | ## Cleanup 
31 | 
32 | Please follow the instructions in the cleanup section of the 01_setup_step_functions_pipeline notebook to avoid any additional accumulation of costs. 
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 


--------------------------------------------------------------------------------
/2-step-functions-pipelines/flow-01-15-12-49-4bd733e0.flow:
--------------------------------------------------------------------------------
1 | {"metadata": {"version": 1, "disable_limits": false}, "nodes": [{"node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "claims.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/claims.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "driver_relationship": "string", "incident_type": "string", "collision_type": "string", "incident_severity": "string", "authorities_contacted": "string", "num_vehicles_involved": "long", "num_injuries": "long", "num_witnesses": "long", "police_report_available": "string", "injury_claim": "long", "vehicle_claim": "float", "total_claim_amount": "float", "incident_month": "long", "incident_day": "long", "incident_dow": "long", "incident_hour": "long", "fraud": "long"}}, "inputs": [{"name": "default", "node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "customers.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/customers.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "customer_age": "long", "months_as_customer": "long", "num_claims_past_year": "long", "num_insurers_past_5_years": "long", "policy_state": "string", "policy_deductable": "long", "policy_annual_premium": "long", "policy_liability": "string", "customer_zip": "long", "customer_gender": "string", "customer_education": "string", "auto_year": "long"}}, "inputs": [{"name": "default", "node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "police_report_available"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2ms2BZ&R6WWXb6O004CX000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx%1ME87#*Qpf<I8C>(>7oJ&#MZjCZ(6H<acx(WD*uiN5&{I<(K$17X39cJY{dfYI7Co;<S7CjxU+mj$c2zUW-@-a_8MQ2HbJ9h4a>%1%HSefU5ZW&!?IhpgSc+k>^~zV6MSv%c+1-bxHrFOt5=M!;g3q$_cvD^Y(+n6QIYC{+dyqTGvi~)x4~8?25S^b!Wbw^ID)aTwOEBv@)woSMXAEa#XWd!aqW*_=g|XVWK+X5j!<lV_%@T~Klh}Oo=y=fJQBVX^y+knN&VXR|Gku-+#l;m6o^gJ#XC?-0|XQR000O8%8hSRWFmRP4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?WkIGK_+697<40|XQR000O8%8hSR-Pv?efB^siZ~_1TNB{r;WMOn+FK}UUbS*G2FfcGJVlrhhF*!J9EjeUjGA%SSG-WMeGh{F=GGSphWnnisFk(4oEn_e+FfMa$VQ_GHE^uLTadl;MjgnDI!!QuX&#caYd(B884<j&fFhgr+?S?PDi4Q}>2Sr5EHPvF;ZcV0$?DP4x%-qze2)=~7|NoM^-;uN9a~}ha@UVv+8`N=~hOk?^lA54V2>_JV>{)Aov$1uw2Y_rgDi@lq!N^r7O+69S!>u0Q%Uoat2Z(Gd5lf|yt4cg$gzIqN5JzR&EbT3+WG)Ny)56Wpm)QYmA(y&zr$KSk?bN}&Tz~id(ns;P1f!PIGOfp*#?d;9OD!Pa1%P!CxIIG3>W+(vu%q*L3jQc7os|XI47kTVAl+XTdaAe$rZZ`HRX*`t8j%Pi$m_-nGtU#rhD@7kHa4us(oCvEv*Gm%D@4Aq)(EV>tpB_xR`Lqu;vep!S619vd0ZAoWQ88hlww0Wft>##7B&&Fl1rh-J`ilL`TS-M<a7nNmsjo8M%q62J?^)6<5{MXwwDf)v>zlgP2(s^lfyI$!!QYhaNtM%{UGqtei$c%u1l$1c-L<I%>K1EP)h>@6aWAK2ms2BZ&Rr%CoK{H000mG002z@003lRbYU+paA9(EEif=JFfc7*GG#F_IXGr5Ib>ooEi^MUWi4SdWH2o<VPQ69VK+E1VmW3lV=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?df;N6+35SGmj08mQ-0u%!j000080LqPTQ<-jL$>{(90CfQX022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2ms2BZ&PF<dBY9>000aC000;O0000000000005+c8UX+RZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8%8hSR-Pv?efB^siZ~_1TNB{r;0000000000q=8-m003lRbYU-WVRCdWFfcGMFfC#-Wic^1IA$5WMVQcG&3}1EnzccFfB4+VK!x9H#jh2Ic6<mFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2ms2BZ&Rr%CoK{H000mG002z@0000000000005+cL<0Z-WMOn+FD`Ila&#>)FfcGMEn+fdF)=wfW-U2nVlpi>Gc;u_VKZbfEiz$YHf3QqI51*4W-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0Ko$Q0000"}}, "inputs": [{"name": "df", "node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "incident_severity"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mlj~Z&R2h55?vH003_R000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<_K=1sFoVTc@P`X{mWrqnQ5ZB5}+$~zZaFU5K^}k~k3Be?8mi6AVFG5*JNvuVngM<j`h%!Z>lXO-N2)Pi-mrTYVwqD~0(k^JUVj;3|m@>F1)^*#ot5tVjchz0J5VQY=luYnkZe%5^CAd$&tk^3?^YTx!-w*q)dARM`20KRB2e$(*Uxe|o<V~>b7_W>*pZkPoB$G!*kKnb%tv`dEyC=lR>rnEKmeIvt@i;An#sP}W&%X{_7mz5>RpS3|`~I6_A<fVeX{6UNB5MQTdqJ<pM@&jg)6Ex9O9KQH00008027XHQ$-)AsSW@D01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVmQjLIs*VuO9KQH00008027XHQ?<>B<d^{f0FMFy07w7;0AyiwVJ~oDa&#>)FfcGMEjeR1IAdWqWGy)`V=yf=GdMXdVl-r8EihtZGiG8rH)1e1Wi4YcFfcB2Zeeh6c`k5aa&dKKbd6HcYTGarmEFYYSY8!m@Iw%!q~HmAPF*MTrEjIsQbrktF@_w;oj2J|C1s<O{7dQ2^tXD`xLdciuorRfIp^x=Ty%E$A;t(_;Ey5ppzyF4q-fkbZ!A|*knRcp)@X!K7;M40HAWq$7gi4HLU&Tujqbk5){-9d?GSc-b<{TtuD^dV^RKev*hPCTd8>FXl{wU!Ut<@0Xs<&$sx)>(_oR}QtnJ?v?vH+xBZLO7ANaw}c7Xf-$V2|c^LrnMp2UAN*d2I-J7;j~>}(&r5YZ1HJ_0{*0GW&YkdQ`SGQQvtf<5M8KcqHgQ<HAOA{HUC`;Wq0DF(&WQd*{)pRAIW3vM~NcW$V?UPwBl#_HwV(vTLVRKn2NryE-HYE3=M0{lk*YmT>Y8|nN%*tJ(#UjjUDYRk*ooGz83GkAeqy#m(I@Lb6a5Yndz;I&$PULkVa27FiPVY`uG%;Jc}!*RYWl^90FR1}j$fC%ztPuM}mZqsllXMzqllMsy6_Yd<(_KFZVB-m_{S3b115ir?1QY-O00;mRj&D;xA*7uW0000G0000@0001FVRT_HE^uLTbS*G2FfcGJIb%0CV_`RBEjch_FfBAQI5{n1G-P5eFk)jfW@0%vVlX#lEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?WkIP+aETqkIbEdWqU0Rj{Q6aWAK2mlj~Z&R2h55?vH003_R000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mRj&D;%AE&7f0000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j00008027XHQ?<>B<d^{f0FMFy07w7;00000000000HlFi0RRAGVRT_HaA9(EEif=JFfc7SV>dWsVK-zgIWS`|Ei^MYIW1x|WMVBaVq-IAVmUWrFgIl_V=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;mRj&D;xA*7uW0000G0000@00000000000001_fms6p0AyiwVJ|LlVRCdWFfcGMFfBP_H#lQqH)Jh2Fk>(+G&49kEn+leVl6OYV>4!AIX7Z3H)SnjFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007nl00000"}}, "inputs": [{"name": "df", "node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "driver_relationship"}}, "inputs": [{"name": "df", "node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['driver_relationship']=df['driver_relationship'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "incident_type"}}, "inputs": [{"name": "df", "node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "collision_type"}}, "inputs": [{"name": "df", "node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['collision_type']=df['collision_type'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "authorities_contacted"}}, "inputs": [{"name": "df", "node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "driver_relationship"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mtSpZ&RJqavbOY0040T000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<}MDP3zpHnL#YEXKmmmVsRKwP86-ZWPI#oDeYRsK6mkWfM5I-2)pE<#yINo+)*gM<ibLYX4aQ93I}gj@*aQzqjNJFoE_X&*FNQHg9ErVK8MP1Chq-}l|RY1UmO7XAe(nc&&n%39Vda36kEGgpk}@RxFcTm9OMs?&FE!xiSi?Sah~5lW6F-v`T3Mi&PazSbx3+TzZiz%ITcV&vHPpRy)%Vl|<K%tz8>qL?e8afD*?<JH7p7ZL?x+4c1Qe~00_$3R+252TTv&k-3L3Ev8Ob>3o9VxDh40Z>Z=1QY-O00;o@kZ)7rpkAB~0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9DY`rt08mQ<1QY-O00;o@kZ)5_8f)620RRA-0ssI=0001FVRT_HaA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yo<E^}^SaBz7paA9(Bb!BvoQr&9XFcenQ#0_n36=d*55Tvw_3A;|cB<!YlqZg%&T@*?wax67wvYkruMk#rVJw%_OFR<6zagzRQWiSwjbH48!9ep3VIQ<e4L<}(_vImJrydXh`o?TY8m9#To<T^tLO@keXJCC^C+U*ussMY;xrB)cgQg{$I10BfLBTZ7dg){{F`g6}Nh52@4=L=Z~g3$@4`&y(@+f!r2J;B67CvDo+xpf<QS87?v()~@5-td(iBGkuz;0FiYfb{&Khy2U8_pXPYoBt>X?(g-VT5t}!pWf^B1F!;-MqSf|kw2oeGPhhT1!&-FIsCgcHm^$BiFwFFM33He*gMT3xvQn)ruxAP=|m=+pubuV%-t^~n=$K5J$G!xR4#R9+3f0pm7-{<H&TG#^uNrBExc9q^*`8VS6<!%Jg-V8^3tByTC*9vK(77)mQaXP%MCDM&k|tQYIVIr^lU5eLtzG6M+Om(LLLnc(;D0ygld|pahyq|(#b?+AJrsDk}QdnX*3xh$8o5}Nt#WM{2&0wkHPi}_?Eu`P)h>@6aWAK2mtSpZ&L$&q2dw%000mG002z@003lRbYU+paA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?c6`uZ&Mp&QTB0Z>Z;0u%!j000080Pm1*Q=QXt9OwW50C52T022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&Tr*UYrg9000aC000;O0000000000005+c836zQZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(Q5tL7p#cB@n*sm;NB{r;0000000000q=8)l003lRbYU-WVRCdWFfcGMFfBJVG%;diVPP#}F<~?<G%+$VEjeOjV=ZK3IA&xrH#K56Ff=V=FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&L$&q2dw%000mG002z@0000000000005+cWCH*IWMOn+FD`Ila&#>)FfcGMEjKkZF=Aw4VJ%}ZVKgl?F)}hOIbvjEEo5UjW@IuqHDWg~G%aH=FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0O11w0000", "one_hot_encoder_model": "P)h>@6aWAK2mtSpZ&RNX%Zbwf003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZ9}PQyS9hIbz0=Ju2-N=mY#N>pkE6YI%M92#+VKKXnMRC#wy5l9S`SUi08_y4|_!9&ibH^i?N5)5awIRU@f;O&GUh0#7md?eY$fkxz`4sTXQ=73osB-3x&y6w8IZ8mk&uFTbcLC!NpO|ETat2Km=xUSSI^QiDor0XWQZz?$9X}i@E^&uQVlP`vfq#_UGRdRwMxq-a<3Yz2Z;)g3eLmWNs<2(3k5i%1iADPktvY37XpK(!w6BIAcH_C2*=-l_+zuJTgF3a5~P)h>@6aWAK2mtSpZ&OWF`nC=L000aC000;O003=ebYWy+bYU+paA9(EEif=JFfcA-a$`#_N@ieSU}ESzEixYfP)h>@6aWAK2mtSpZ&QH6$k2iT001xo002k;003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpsT~U2EGg6jjvB>qGa@2N@as5Cjb^xWb+j*9m=S-%2SXY<nAHD6!O;$#x<sD`^|(&*%^A59v*lj)g)cxZIC(?mbs%bpACVh%nNP2nNR^UP#*>LTGJ$bogHcXa5X?-GK`hqg9sR;Ob(}PKEjN%T9jFN)U|tn0}fFC$Af`_*ULZ3++ABZ_~CetpfnT2Z(-@T2``l$j3n#gx%c`njLs3$hXNaq*pM(KKAeZhgNs@^c}512>Za8Rps8tD2OO+%rzHN0V4T}UjA1aTQ)WQkn)(vh#tK`*qP?w%;wT@)7<b%I#CEG=ws`Sxy4knA+yfRCyqr-l~Na$4S(FRT2!mj)*66VGyXA!F=7Gf&1&D**|G+bG?1S(wG(A+zs$8}LnXA8%ZSwwiLnMK43-gl)dY=~%b!a`FE@&hE7RGm*hzR2^Q3b)o|n4l#A;BeURp?{#(A!a6P0IKR%B^5Nb=t2G>ug+8yAD4APnIJpIQiw;IBLbP)h>@6aWAK2mtSpZ&R(dok0=+000mG002z@003lRbYU+paA9(EEif=JFfc7-HDY09G&3+QVq;=AEi^MUG%aCbGh;0=Hf3cpV=^%@V_{}3V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?b}Nvml%_s_}<08mQ-0u%!j000080Pm1*Q=b&eiPHc80BHdL022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&OWF`nC=L000aC000;O0000000000005+c1OWg5ZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(fWpYof&l;kFaiJoNB{r;0000000000q=85Q003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&R(dok0=+000mG002z@0000000000005+cFarPpWMOn+FD`Ila&#>)FfcGMEo3!fVP!NkFfC$ZVmB=`Gc+_UVPZ35EipD_Win$jF)?FdW-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0IdT60000"}}, "inputs": [{"name": "df", "node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "incident_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mt+%Z&T$i^7Y~X003(N000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtWzr_BCl=PbWwo>VjGR@No&<F#tx!X`FGqPAtBN2oqNum3rK@8!T_ER0t3n;i5QSi!WuCVrkP2+B-ZY*bBaER_g;~K0+O~A6WRd7eN#7m-#2YvwM_vl|3)0OciCKvQj{CCUv5)oZq_P?f0W|`b<|c}UCFM_mCU2vV>VyF&ne7l>rI9d*)U3fUDxh6)lO<fE{*9ND4&k#l%bvbK(kJrhypAHD4K9K>G_8I_DAFR)64F-mY!G<zh2h{n3&#Ya`OC0QU0?|%${Lc?!Ey~O9KQH000080R50}Q!UfbDGmSt01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oV)*(=?<oLKO9KQH000080R50}Q>WjezncL70GI**07w7;0AyiwVJ~oDa&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbd6G7Yuhjsm7T;*SzZ-n@Iw%!B#;PuPTeH*rEg;oqm(`j#u#!WwWhM2%F0Fw`49airT=1oYd2r(=U^}5-gC~?(YfgS>>|Plr&tWI2Ze{dAVIx`y_U+7W@;qA8D<4SC<<zD)_SP!9Ngy)I)rXkYMoVCSu~;9Ya38+#<A<`PpxXs&CQQ$b}e&`U3Bb{51OY^S7*lXJM3Z)9oI=6?y+0?Kx&!G!oH?(tNTQD5o)`B;0H&|0Jr?Ehy1V4@7)aT#V;D{w!QX)(|&Z0ny2qY^#j-g0z<F>os0YdA*H!ue9j>S+XP}irCA3jjhKZjM5O;lQQc|=#qB~`X3AeImzE1|Ir;4@slA&^I;E90i<zYZsxqlXMW^4EwBY%gdKL!wt^V&E-@&^~m;b@8y|Us8;8|H%o)y)Xh1PTmFOaK$fCV%>)p83A=sp5?tybSxh<x4!9OkC8+ejy3QOKfBFI{9>bV4-|Y7`5pR5~6jaiYdal87WuCee8GF^)quN>VZD`#}I3AHmws;ZObnP)h>@6aWAK2mt+%Z&P8AA|?_5000mG002z@003lRbYU+paA9(EEif=JFfc7MFfunVWMwifI5B29Ei^VXH7z)1Vm2)~GBIW{I5}o9G-WU?V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?d{`EXu#j%P?X08mQ-0u%!j000080R50}Q{^u5_2K{k0BZpN022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mt+%Z&NMP&?yc8000aC000;O0000000000005+c69E7KZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8{g7`{r{AK#n*jg-m;wL*NB{r;0000000000q=8of003lRbYU-WVRCdWFfcGMFfB7MGB+?}Wil-|F=ja}G&VFfEjVUkHZ3_aF=jG2Ic705WiTycFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mt+%Z&P8AA|?_5000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mt+%Z&PQfXcN@{003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFqMxzO9Md=hI@a;)R$G3wQ3dK~OMgWxEW!bLMcr*bfWC{ddQRLJA9;DPErUnVB;fho!&-F~15-Xig+b!u%$@6Ek6LnZ#3|`&jRSiI(z_hl394;;B^h38wAP4@#+FRQvtd!KME~xhyfTxffE%4f>C;k=$#G<?u?WjojV#+udwOH8a9}^had#8BQ3iqlfU0?oy<jV`j$Ym>#iIm}Yc^I#@z#^$;C1m3-1Beq9+`{f+fct|hE&pc$L1&;MFhbE9kP-=Apx&B40fd;(BQ0|XQR000O8{g7`{ZyYpT4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?Wk5YAYc4ggR~0|XQR000O8{g7`{8Ca6KegOagFaiJoNB{r;WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MeNs(JD=`#JW34W9;UYr@W>Eqbg)z*$spGV`(4B~gs5^N)(oUko%godyRjL;J8U8?jh_|+iBACGCew=gfxglqV=P^Nqkxoc3I4#l&XtRkBnp+QTeTm>~9AL0ZaKU0U%M$FL9Ie@rF#rGA;f*W>!DtuL^`UU`x;B%G;z3$y-$J`h+SG-003di1(G{s>DJzFO><50(Sq-4sW()b*ck(;w15B`oy(jOv(OKR9MXMjcJ}_ohdGavwLrQCN&BaK7NdB&u-$`SOx}wVok9dUW_78-;)f}ALv2@(j_q>!&<iZL1(s*QUGLo#%tTW@GV<A(8)VXE-t4CIea#s4b1|a5)Po^+NOd!2j?Q)(iDj-P%`C(l-QB?MHtTpQ^p{<;Ttb#}kG(cgn4B3Y!Xgr<%og#X?P`p!`_F~0$%;Sj1?XAJM(0Myjy<ByZTq-rlGL`SCEKSoqP10VRb^jzuq`K)K?```*05AC3K<Etq${SEi0|XQR000O8{g7`{uJsj95&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEjcnXF)=eXH7znZIb<y~VP!TgVmCQsEjVR3VP-QhG%_(aFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVpx@u_3hk0D>DF4O928D0~7!N00;p6kZ)6Gsb~|`0001K0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8{g7`{ZyYpT4gdfE3;+NC7ytkO0000000000q=5zj003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mt+%Z&MjqlDd8Y001xo002k;0000000000005+cN&x@>WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8{g7`{uJsj95&!@I5C8xGO#lD@0000000000q=7C2003lRbYU+paA9(EEif=JFfc7SGBYtTGd49XGC4V9Ei_?eHZ5W|Ib$t2WjJAGGcYtVF*h(RV=yo<E^}^SaBz7paA9(Bb!BueV{&6qO9ci10000400aPd0RRB20{{R300"}}, "inputs": [{"name": "df", "node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "0c4c020a-537e-4426-9a4f-e5332c8c8180", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "collision_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2ml9>Z&QT&G<)O#003+O000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;d1PQx$|MfZHhtWzrxktW%eO&1kNAhyxSp0rl|Vr(Nym4C+#5`s!>ukJZ_<_ywcj4*)by}*F-Kq3a@qp(IygefzL=fv9Ww@%Rq@y;tUP(jj`VnQ2WxNV!Z@B6mDZ`SJy7XFnuYVV4<5w)mSXg}SmE?m~CgnyJ{*LMwd%^l*@mP+B#?y#88pk2sm<4s|tH%4EiV>$&Yy#qRBXzSk5EYc%UD6!A!he>3^SU{L7P&8pS>FJmK+Ca>niy|)PJ+dNxxoizEF}-DS^7KGaUXA?!%s77cV4iQj08mQ<1QY-O00;mFk#AE1sg-UH0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXHu=>808mQ<1QY-O00;mFk#AFn>*Q^i0RRAu0ssI=0001FVRT_HaA9(EEif=JFfc7QFgQ0jGGQ?-W-(@EEi^eZFfCy*H)bt2IX7Z7WHw`EV`E`0V=yo<E^}^SaBz7paA9(Bb!BvolFv&cF%-wsaas%Z;vppib0~p|!Wd?pI<{pm!XCuKB8vw>MKVpIgPCbf(p^OQzwB}Ull`~$O<UI=iy(v~@B4i($>))?<L?PZ_zGV|xDA^YZiN}z^=%;}X(}_V972eN8!&JDMt);6Dz{J*qNUf^{%RyQKE!o}2W!Q#7wD6YopE!2V<&gA1O**>WJ~j0YI|%9f5aYcp+leey0C6#Z%QpoS-JNX?(|>GK0;kD2!n8^9pX;VZ=vAg^|{NU=j1m5_PXuvuLg{r_LujFK?v*!*YB@$JQPHPOwBFhGY$l73B!I!V~c4;nkkD}jL6=98TLUl*gVXoV`lo3mC|v+9VdSpPt-lmBpp-h%zWZ#L{%ZRuylO+L@Qp_)+-S3SNZ?y_*!1`oPWZt?<%TWkS9RNi^_hRYfZ=S0=f7=tRUdImMbBm&j@hWVsX7d<a7<VUz*;!BfW$rF-v;8`Ml7g7psv_gH%YR^5Ia4BQ?yjOk`;`N``}jG>z3D%f)Cf2t(+23~fJyfA|+rO9KQH0000800)t8Q+N`K#1a4i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfBJQI5#*lVKFUcF=k~gG&wUcEnzV?W-T{4H)1qoHe+RDV__|0FfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh(=`0rRaW@M9P)h*<6ay3h000O82a#`6g!(jl<NyEwYykiO6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j0000800)t8Qv#`#ZVmtd01N;C02lxO00000000000HlEx0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;mFk#AFn>*Q^i0RRAu0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEjKVYH#jn3F)d~>W@RljIWsUVVKFymEjKwgVl-qnV`XDwVJ%}YFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000800)t8Q+N`K#1a4i01yBG08Ib@00000000000HlFe0{{SIVRT_HE^uLTbS*G2FfcGJH!wIiI5J@|EoL!hWi2#0GcYY-F*jx{H#s+AG-NhoWn*JuEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a)B^wj000", "one_hot_encoder_model": "P)h>@6aWAK2ml9>Z&M|j)%Vl@003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*PQySDMSDJDYG<{?*urijq#zMNMY|e#$HZb^nw_N}%D=M-0wqP7DX#84XAUqq%o#?A{Gl-6v?rAl@~84n&4f~z)N7z^l&3I7OT4ScK?AaQ%7Q+@c;C1EFbw_Bbr0(X&ii-byu?t+t<p-b(0_-OuDi0>8eSm{bvS8yw7Ruzt<*jG9ai!ImXL~UqN^JrpUf0b@>t|QHh;M};VsAXf>~gk(Pe62kHI>64qxcbW@3&E$J(SBi<7qt*(LV&hsON&u57~Pw(NNK15ir?1QY-O00;mFk#AGa;ufV20000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?eSesz{A08mQ<1QY-O00;mFk#AFVmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo<E^}^SaBz7paA9(Bb!BvYQcX)MF%(T>?Q@|E7a1}zixRLXjA7<Yoi{BmbSEMr>P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5<s(!7V@+2<ag2sm|zckPu_E*v$Fe(RzHA!V9c!Y<YDB8l-A~&i;)14{9P}<lg1WxMVAvE@d(ka9|(J^IXJgt>A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjM<A}%Y&B3_Pc{@_QTy>LNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&<XsNH&9Ch1QY-O00;mFk#AEESLvk^0000G0000@0001FVRT_HE^uLTbS*G2FfcGJH8MG4G-WbmEjME}W-T;0IWsLeVP-ikIWjh4H)Sz2F*!3bEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2ml9>Z&M|j)%Vl@003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mFk#AGa;ufV20000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j0000800)t8Q+1XHVtxSt05Ado07w7;00000000000HlFQ0RRAGVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;mFk#AEESLvk^0000G0000@00000000000001_fh_|70AyiwVJ|LlVRCdWFfcGMFfBDQIb$?sGGr|`V>V_jG&ngkEjeLkIW0LdHexqrF*GqbGcqk>FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "authorities_contacted"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "authorities_contacted"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mooDZ&Q01*?i^z0046V000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDKO^~Q48yht0G6cI=u+SOpsX07<ec#H_q{yS!okdR1w=bU@zBD8~?#a2W*N=TpwlsN&Nq<6AM;6f;$m~1%igTW7^Lp10_BiMS(3_gkNeATX%S}fYldU@A~$vr0Hip{VN+s0{{AZY{>J@8h_@}a4sZGDsxAWJH(G{sa_yg4Di-?6&;azl9V)Uug@q68d-xLZDkt|!x^%xj5qp8RM!8qK9cW~3>2{DPW5e9oGjvud#T>q@=ez@A5C+L9;((^fC)q29Wf<d3}G3$TUktmWF$D1!uO9KQH000080BM_VQ>&~p6Al0X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVyL+(Q4Ii4O9KQH000080BM_VQ%M$|HJ||i0HFc^07w7;0AyiwVJ~oDa&#>)FfcGMEo5P1W;i%xIW1*nHfAj}V=-hcVqs%BEipM_IXE;mW@R)mG%aH=FfcB2Zeeh6c`k5aa&dKKbd6F?Ya1~Tl{fCjA->9JVGl;&kbo<8yUxZoz2sJUD53OFD5Yq3N9j_n){(R=A^6|)&>vBH?ceG+cGC0<dXZ+{d!x}C=HmQiN(t4}Oz8lFkcM$ij>A=}>k`<3Y>G}xQvx9*kN5E0`$GD?{m;!I2{GI1`WTtkzyaR(6&q017HAOZXG6D=_QN|@zK2HAfSd;GP|E_eJGWL|(}0HLw9opwc799mgN6o-|DB@4$xm`Z$S8>7C_Wm*bQn!S61{%-?rqVX_?LtRqrvEFZ}hEqG<f=V$5D&}A=na|pfVuQl(E)c3b~T#L)=;tAHll1HSAF)5|I#g@`u7*X@S922VU6rlW4%pN_xq@_inhqUIAZn=WSPdKIN(gT{*sdbHj~nHq-+<$Zz%cb#e#qT)p}acGIgimk5`w@v=7VMW;1i;tyc`571yD3k_Rf%I_>-ul4%vny}|PgO3|K-c4kjiZl`F__*k5U5yj9sMIX0K&fIrSJj!C=XqY`S-wc;vnN@Ws99cAi<2mhH_L`&zrf}E0Z>Z=1QY-O00;nSn{QLcJKwkx0000G0000@0001FVRT_HE^uLTbS*G2FfcGJWMN}wI5=cEEoEgkW-T;hF=Q=bVPiQhF*#y6I5ah8Wi&7}En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wk*d*1wPEkO!0sv4;0Rj{Q6aWAK2mooDZ&Q01*?i^z0046V000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QLAtTGc00000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j000080BM_VQ%M$|HJ||i0HFc^07w7;00000000000HlFi0RRAGVRT_HaA9(EEif=JFfc7-VPj@EIAl33Wo0&IEi_{>WG!N0V>vA`Ibu0DG&N>rG%z$RV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;nSn{QLcJKwkx0000G0000@00000000000001_fnfsx0AyiwVJ|LlVRCdWFfcGMFfC+ZV`exwWH~KmWj1ClG-EMjEn;C~IV~|cVmUZ8HD+ZrFf=V=FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007<t00000", "one_hot_encoder_model": "P)h>@6aWAK2mooDZ&N~3HXYOe003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFoloHN(3<wMfZM2t$RYWTWxzQf`Wst+=e2lo@vaBO5(zy|J`y#Q9<0);&RWq)B&u=Qec9_JB1nTo>a-m&nkE|6V}G09wObQdJ}DO#9KKYI;cyaQZZzhPA~dPt%so>ug=B}j{FmGSyB{pue8z|46kvcg=>qG@Q2diNCTZ)r)ND%MR*Kb6!QU=m|MM0UYJTD8=Kz8_ZXe0yZD0ss3YdYk~>cM9!rI3Mqj9d)an;+J_Sa@DqWhf`TF>6?0;?L_RISBC&qmAU|mn%08mQ<1QY-O00;nSn{QK=wVr+s0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9H4AnE08mQ<1QY-O00;nSn{QKfmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7<W@I*FGBaZ>G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo<E^}^SaBz7paA9(Bb!BvYQcX)MF%(T>?Q@|E7a1}zixRLXjA7<Yoi{BmbSEMr>P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5<s(!7V@+2<ag2sm|zckPu_E*v$Fe(RzHA!V9c!Y<YDB8l-A~&i;)14{9P}<lg1WxMVAvE@d(ka9|(J^IXJgt>A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjM<A}%Y&B3_Pc{@_QTy>LNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&<XsNH&9Ch1QY-O00;nSn{QJOSLvk^0000G0000@0001FVRT_HE^uLTbS*G2FfcGJW@cnIV=^;iEi^Y_GA%S?VKFUYHexU>F=Az8Wj8rvIbvooEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2mooDZ&N~3HXYOe003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QK=wVr+s0000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j000080BM_VQ+1XHVtxSt05Ado07w7;00000000000HlFQ0RRAGVRT_HaA9(EEif=JFfc7<W@I*FGBaZ>G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;nSn{QJOSLvk^0000G0000@00000000000001_fh_|70AyiwVJ|LlVRCdWFfcGMFfC?gWHw_mGh;0@H(@d@G-F{gEn+reFfB1+Wn^VHIb%6uW-u*dFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "0c4c020a-537e-4426-9a4f-e5332c8c8180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "f01b1173-c983-428b-942f-e9d85a417fcf", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['vehicle_claim']=df['vehicle_claim'].round(2)"}, "inputs": [{"name": "df", "node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['total_claim_amount']=df['total_claim_amount'].round(2)"}, "inputs": [{"name": "df", "node_id": "f01b1173-c983-428b-942f-e9d85a417fcf", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8e9baec0-858f-4eb1-b79c-1960609c48e6", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "policy_state"}}, "inputs": [{"name": "df", "node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8e00b503-e7ab-426b-be39-80a20f24785c", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "policy_state"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mmdyZ&PUaA<p0c003$M000vJ003=ebYWy+bYU-WVRCdWFfcGMFnx~AO2a@DM)y9(tkWSeCQ7%u>7ultxDB^6H|@~*^UjPIOWxgU1VKT!ALksti%=F)5^E9YC?SG6p-d6zES;4jLN0{zF_ZDfjo0{wv<(`qs6;jnQwA5sy6M~7y18lk{;q8*vG~tO$pp{kR@Sm!fqVC>n!RE)k3W(_cdZWk02<n^V}yNhTj26VoWN^~8~+M+5zmN`kx+6h`94^lGP*dZ@OjGk?==Q%)HLwLjAE{Y#u19m`+wR<6zEs{@1w);#j%hUXh$09X^P0&Ncd9FtLXug67ziZ0Z>Z=1QY-O00;mrv2Ro0x?aT&0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?c6r}!fS08mQ<1QY-O00;mrv2Rn!%@Qb`0RRAm0ssI=0001FVRT_HaA9(EEif=JFfc7MH85p1V=^%<Gh#S0Ei^STH7#N@F*GeQH)SwpGh<^lVrFG6V=yo<E^}^SaBz7paA9(Bb!BvoQbAAKFcfx735ezD%8_!2EER#)S;<new%Uao;(*Y!!z6?@r?y!%Cn>f|1*m^yryciC_9OT+IJ7_qnzYN)d*AmxzxO@)^zc)P0Fvli9|bP)kssw?56&5e>;?dkMH?<|{NCw!TXA&<p{pj$^TB-6@30#$7&N-Y>Z>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewD<Z>n~)y<q%!ZOY*!+*S6Vy~xyjES*&HnF5nWGNIkWPEl@YF4$>ng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%<Gh#S0Ei^STH7#N@F*GeQH)SwpGh<^lVrFG6V=yo<E^}^SaBz7paA9(Bb!BueV{&6lE=p!#U|?ccbE;`Gk307l08mQ-0u%!j0000804=d^Q)u`h&fow50BQjM022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mmdyZ&Tp9Ud0Xo000aC000;O0000000000005+c5di=IZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8EwOJ?$;}cdodEy<gaQBnNB{r;0000000000q=8id003lRbYU-WVRCdWFfcGMFfB7RFl9DlGBGVPVmLA_G&M0bEn+h<G%YeWWiVzlV`DX9W@Rm7FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mmdyZ&UDH7Euxa000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*MRWj13nF)cG<I5I6XH8C|UVly!`EiyM{FlIAjV>M!CWi4YcFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mmdyZ&Ty@0j<;k003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDJD>W$@1U`e(`k%B}76|FU_XOk6vX*{MN%D>|T@s!fI7~OM@4=^~)8G4AkQ<!krlgbJCRe7gILMcq@G0-N;It<YgZ|mcr0hv5yL7$*+yTz?;uiLI$E$0oK`Dfxh#Zb))rIntce-ATVxiZ;${G&9aX*J%wpq($VPAZRni`9I9DWoEI(N#vsCo{xT+;U70m<9R~U8V+Nk?)=P8k$pYkHI>+4{zwsWnzvD$LgRFi<75cu^+8)`DeZSV`Dydmt)7P4^T@31QY-O00;mrv2Rm`t;dB90000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?cc%lP{-08mQ<1QY-O00;mrv2RoC3|$C-0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7JG-hTwGc#l@GGb&mEi^DOGA%ebG%_t=Gc-6bVK*~4H!x)_V=yo<E^}^SaBz7paA9(Bb!BvYQe8_UF%(T>xAs9Fe8`Z2c_;yk!Wd?pIy)`<pl>1~qP{I6X(w?9Gcz?wSE?=S&+rHOL%g+hQ3MmX+>dkaJvZe1=prVFFwzYP2B$?@0d1}$gqGGrJ0BuATZb6z8eFg#EwTjrr^g$1EX>`F9sQA|AQ&BBx;YX~-qvP%Sv*P$?OW)eNt?Q`4gdsiBf2iNEM?`8hyB10y6pg(ZMTr0eI~z=-oXTW*qeF(8r}A<Z?yUW>;q#Km6?Z;A5vPITQ0@|MDjPi{I4{&s4My-;SrAz-Ti{F_nL!qKaq}``hl0yiCj2ApBqoiO~;ZAn000{ax7%3kUF<)@cW5XqFj_dtpSK7<3CdvBc_mEt#-A{78Q`Bf&8ehoG2=LHqn|5l+aerLsmf~h8mzSScdFf6EvRBujh!KtQ7B+rn6eH6Z1IYac5^ZDRkb6R6kd}B$rAJvrOebRhFh{o+fEO&U*Vv5~*G~%=^235Wow*G!Qz6zw!!DO9KQH0000804=d^Qw$Vgs}cYJ01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfA}NW@b4vGh{6?Vq`ZhG%ztTEjTzdGA&^<G&nF}H#0alFl8-cFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}C6FFW-IcpOqN^P)h*<6ay3h000O8EwOJ?<NE=v)BpegX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j0000804=d^Q--a_g$@7!01N;C02lxO00000000000HlEh0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;mrv2RoC3|$C-0RR9n0ssI=00000000000001_fk^=X0AyiwVJ~oDa&#>)FfcGMEig1@W;ru6WGymcWH&7|FflSMI5;#iEnzb>I51&1GdMReWi4YcFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000804=d^Qw$Vgs}cYJ01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJFf?XnIWse4Eiz(cH!U<UF)}SUI5aXXVKX#1Fkv?{I5#k5En_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(atOEc5000"}}, "inputs": [{"name": "df", "node_id": "8e9baec0-858f-4eb1-b79c-1960609c48e6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "abc12bbb-4fbe-4bb0-931e-b15caa5cf244", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "policy_liability"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mnB_Z&R+Dz<A^U003?Q000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtW(Qrfzoba(?tamh;1~nC#_Y#7~6<Y<=-(uLPDb3JNKMBXHXVXf-M9-2#lzXBvM2^31`JXm<yA5&TRZ)=QVv0?}H`_4P@geWpoj??d@tU+v}y28z~!@`B&m(g0JS4kU}odefov0Trs+af0SK|%bRt7*AJVaua(N9+haAKA(R|Tz7MuSnOy8t_+B^uH;us>HFnluZ>-5<y&lnPOFRFLc3wLXMVK1UG+?p$>4N|AQ(gDdtL`+H9$6E=jFBofFufJ>YJ8xi{#hsHz%Wf$Ur<W}1QY-O00;m;v2Rn*POxbX0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXI52x108mQ<1QY-O00;m;v2Rm@4e0Hj0RRA)0ssI=0001FVRT_HaA9(EEif=JFfc7PGGs9^VmUA^He)b3Ei^YXGA&^-G%zhTVPj-wF)(I0GcsZ=V=yo<E^}^SaBz7paA9(Bb!BvoQcX)MF%(V5eu>?tWMCF0P*LbhGo8-Z7B}uhL_}N&kH;g^$>?Zi+D=kMr2ogY_#gaR-hTM4;3nL2&ONy~mmKe%xESF}e9^*HD9X6(1Zb<|@piyT#pS%+;T$34J8N*R-J@cma+q$Q5}=OX_W8VU{kB(H-(8q0bG)=+%U1g}Jr?=Ze?7dAslXQUEwZ6RD3#vJb8(9;Tt<G86jh?lQm#lPQ<<6160SAh#3n*@%XVyMv+CfQ-7F*f?ESq9(@XOQfUSD9{#d9#6gI27-(<BN*aL!jFag>uWVZ+z<>yR{1%zOWJnUNLIvHi8<T1`TBHLd&^h7ZzCR1rxKDuG4G$Im4kf*{OHMe6)2UMGUIyAIJ<3y@R)4|_6nu&BSy~_gnvj5d5R`e>;=`YmztRy=JdN|6ANHTpiRf-Pa1#<QSnE?=?l1rjRUn0P)+3ep8k%LvhKWW}rZKUBcmovAq6;2ZsHF(^Q;;t9TI1YQgINFJOK@dcN7xdj;x8r#{?gnAh-?kkGHtxaNkKs=~15ir?1QY-O00;m;v2RmTMf`vg0000G0000@0001FVRT_HE^uLTbS*G2FfcGJHZo)}F=9C|EjD8?IW06dGcqk<F*GnOHeq9AW-%~kI5RS0En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wka9{W^HLWMs4**b00Rj{Q6aWAK2mnB_Z&R+Dz<A^U003?Q000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;m;v2Rn*POxbX0000C0000O00000000000001_ffWG&0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV^B*00u%!j0000806?*CQ-clY?VbSu0GI**07w7;00000000000HlFg0RRAGVRT_HaA9(EEif=JFfc7PGGs9^VmUA^He)b3Ei^YXGA&^-G%zhTVPj-wF)(I0GcsZ=V=yo<E^}^SaBz7paA9(Bb!Bu=O928D0~7!N00;m;v2RmTMf`vg0000G0000@00000000000001_fn5Ut0AyiwVJ|LlVRCdWFfcGMFfBGRWHB*fIWR3YV=y@_G&eIcEnzV<FfBG=V`OGAFlIP2GGZ-bFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007zp00000"}}, "inputs": [{"name": "df", "node_id": "8e00b503-e7ab-426b-be39-80a20f24785c", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e6bf082b-005a-41d8-9a5b-a48c414bf230", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "customer_gender"}}, "inputs": [{"name": "df", "node_id": "abc12bbb-4fbe-4bb0-931e-b15caa5cf244", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "customer_gender"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mofWZ&MJsG`Zja003<P000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YgO9L?!h4=o9tfwLKpp|ZQ(?wBG+z!{|&NMVHZxXSV{&%k<2!h=n&i9>j7TQ71Vk;ue5)$YG%AA0X(mOdJlu9U{3fVC4g24}@eKhDqBZ~Ey3-~0q?YiHncIY-;KXi>)_!p#XV_<VHl~gP6pJAn#tJZM%M>*b3YaCuD)v|{x%!A(po6jQFl4`k&j-i4+jXHj>A0QZqyYLQf@f{Hp$2R<yb<HJume0s5Brc6GI6-yg@rwTXr*Zu0@j_lo55&snQ$ofj!ncY+pYAd1f36`><oV0Z7f?$B1QY-O00;nPv2RmV_56np0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d{X|h8A08mQ<1QY-O00;nPv2RoAYW8WL0RRA!0ssI=0001FVRT_HaA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo<E^}^SaBz7paA9(Bb!BvoQbB9mFc?+b#5HR!J1EHDLl8}AArtnTdQRA-x6#8WWrx8wh8#<csVt|G+)_&Z&e(Z>Vt;F2nsn>d4R#UV_ukX@^geWUa-Lv>7x?2C2T=Ifk1}-VxB9ZV(ZrXc=1L%hCeaR@J6}<^v)irrdcpph@Bx$J`LleZ??KzRg)rC))tkOua&!6F&KEEQd+69Bdx{rA*%M><E%vaFj=Q9*D(g1%u27;D+Wkr4{^%DuLTKQHQ5YQr5$=a0ABFE9&%M9&ApWhv-XIu!>kPhh4uaQzNf}140R(>F05T7SV?r8p!T6Fx2zH-`T?=EYMw4F3ViqHE^h9B=6ocZb6^@zahSkFHk~>bmckZaWT?#s*)|qzh=$Oh%sM6BehdZiyy`~;z0e++Zd5>@5ZKQYq!LGe3eF5;i(T-QzzHODFGkAem{R7m{@Ir|VFs2U?z-zVov_j-`8}LPKhTDw{6PCm*86FmGrOII}r=^^vrI517b6LKUd6s2mmS)o=pS(=dSWdE{oF0Wy1RGCb?Pu^Ce*jQR0|XQR000O8X0dNm%TJaX5&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEigDRFk@k4H7zkVIAbj|W@BM3IXGozEnzq`GdMJ1HDft2V=ZGaFfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVlcj=)85||JRbm1O928D0~7!N00;nPv2RllxHP%o0001O0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8X0dNmRrUOb4gdfE3;+NC7ytkO0000000000q=69u003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mofWZ&T`O_GzC1005E#002k;0000000000005+cRRI71WMOn+FK}UUbS*G2FfcGJFgP$UV_{@9EipDYV=XjhV__{hIAvxnVK_81I5c53V>vKmEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8X0dNm%TJaX5&!@I5C8xGO#lD@0000000000q=8xk003lRbYU+paA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo<E^}^SaBz7paA9(Bb!BueV{&6qO9ci10000400aPd0RRBk0{{R300", "one_hot_encoder_model": "P)h>@6aWAK2mofWZ&NPv-38VF003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZB*O9Md+hVTB2oVUYlx3+e#A}Cn&<aNl-X6@p9k;!<l?0<KxC<qE(a(H>(C&>w{!<?aq$j=HB+9Ro)kYAK{Y9f@vq#gp@M!5+#j>LyL95f)ar!43b^zF9W>2|Z;tvCB^1Lyt|ah_wS=1ys)SLoluO4nYQ#~S`nnxVTK_LgjWg+oPZAN>KV`2-`!bdOn}pU`D$U=FFsb#zrC<dd=S<C-Ie(a~LaMR)cSbL5)GBHy_A;*Ex-fsrOGP9A?<CS2%$*1tb7=97bExp)UqO9KQH000080A{goQ+!pNy$%2X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL<qV@obdW?*1oVqm@dNe=)}O9KQH000080A{goQ+1XHVtxSt05Ado07w7;0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyA<V>vW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbbV4yODi!HO=InIp$iuoGBAr0uqcdS=1rY9EiQB?A|mQeMAA;8!^_OnBvq;w{2Bg0e~7oXiXxc6<$j!V@3|o-`=>EMgpp22FgPvJ3TSf~AvCuh+WZp1S>MB8m*9fMXqF||KRQ^oBVqnu+2M^W1;J<s)3u>+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bx<cE~j=9-I<0FnG%FTaz<7Ij6J6CUvh(XAf{d#gD(w`1wJsqc9yoydg~^ri90++-wKpIK+dL&rj<3aN9;`WKI^66LJ)Z4E%o8J|pHjF>=rvD*1OTU0=j2J*wYa-yj0@mOotS3+Ak4Os<|7-)dPU>ULxP0)Bcy__O?xKO-Zn)YJFcFg05$L-C*xX^h!QoUStlUyn_$TF4xQ(2m(d77lXIP3mRl1O#aLEhW)g8*LewSmwH{FOIQO9KQH000080A{goQx8|^r4j%D01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfB1KVlgpfV>B&fVmCM~G&nLjEn+h@FfB4OVPiQoH8Wx`VlpjbFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}AWhnt1lyKPxi;P)h*<6ay3h000O8X0dNmF7n+4)&KwiX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080A{goQ+!pNy$%2X01N;C02lxO00000000000HlEj0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;nPv2Rm#mIq>f0RR9n0ssI=00000000000001_flC1Z0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyA<V>vW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080A{goQx8|^r4j%D01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJF)(5=F=S&jEo5RhI4v|dGC3_`Gc_<RGBaUgIW#phVlZMdEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(atOEc5000"}}, "inputs": [{"name": "df", "node_id": "e6bf082b-005a-41d8-9a5b-a48c414bf230", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "customer_education"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mpGqZ&Uc&mo4M~003|S000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZG$*Qphgs)24{(?tamh;1~rC#_Y#7~6<a<=>eG2?>dA@7#0loP~CfvsjBrdkG2j0cB1=N9mm$5lSVLFNJK_Z-T*3q-`|lL?epzm<#wM)@s%Dty*@f>z8+pnE5xPY-3<^CtKMrz<-5B%UrdF!#~PF!|uLcq1u7E<qGrQx4`DJh_$3z9-?EYpihI2m-Pb#<8Tu`z|FoRV&d3_-?FZ`M9*?GRYSS_GBv{B2-TJ6D+|}(8it?VKifzYiSs<ePh^l@PYL-l624aq`t*odi)p&~22e`_1QY-O00;njv2RncGU1L60000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d<oqqcV08mQ<1QY-O00;njv2Rm}x&v>-0RRBv0ssI=0001FVRT_HaA9(EEif=JFfc7*V>mcwGC4FYF*#&0Ei__eGc7nYW-u)_Fl1vnHf1tnGGjI^V=yo<E^}^SaBz7paA9(Bb!BvoQo(B5KoDJRT*r+Kp@%gR=pY0R3Al?CIjZf3f=dsfhZ5Qz3ZaCRc4dpMBxiS>5Q6_F^dm}d{hfY7N3qkoDW$!LnfK<+=sosy@-`%he6h$^t50^IY?7uMqy6?tKAm~*(pgnWdw0SwS(ys%y%JL;1VZTC-GcAdEpAxb-(M7Gs0Hmymd=E%R2%9I{Vuc*HaxB&t*bO=Mm({eAA{`5+t3i~$ditqGxh0Q!`6ZkjE=C~mMjrcPn2TU1d}E@YS;~#YqO%ag%pJ-%|nWGdcVjXLS5{*j=R@#Nyq6mk@I15@15v=__v1O?oRg>cR%C3*6}}6a$Lw0!F9L?B8;5AZC7ea*}M*l-npZfLg~CJ?MD#}Xn^d4KPL2rq)=QegrTasqJ=PwGsEm}){Sqj=fWTR+Necp{Jx*%LUQeo&u;vZ6*aZt4EWXjPaRvsYlH9qgRO7n<rLtwDh<m^{bnJhKZZXL%fEmnG%S%~1@!%U3vg?>{J2E+>$SnBh4R*kcp(h~8hZQ5BA46?vJuY)5f@pO42KzimJQ=L=5Z8{!r|aBih^tqCwz3^xNiNrAof#u-9G?OO9KQH000080D7@+Q;)=OZxR3i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfC$ZI5=i9IW#RXIb<;{G-6~kEjTo0FfBGPWMertWin$jV>T^gFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh3u-ojbg9;Y_P)h*<6ay3h000O8da-X)_}iB)<NyEwZ~*`S6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080D7@+Q?WAPjt&3-01N;C02lxO00000000000HlEx0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;njv2Rm}x&v>-0RRBv0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEn;IhIA$_AG%YbXWHBu?Vq`NdI5cK3EjBP@V>vcuGGj7hHZ5Z?FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080D7@+Q;)=OZxR3i01yBG08Ib@00000000000HlF~0{{SIVRT_HE^uLTbS*G2FfcGJVq-WsW->W6EipM{F)cJ=WHT)|G-fa@HZWvkIW}c7V=`klEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a00aO4000"}}, "inputs": [{"name": "df", "node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "type": "TRANSFORM", "operator": "sagemaker.spark.manage_columns_0.1", "parameters": {"operator": "Drop column", "drop_column_parameters": {"column_to_drop": "customer_zip"}}, "inputs": [{"name": "df", "node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "dba3fd7e-01a7-4629-a4c8-aa520328439a", "type": "TRANSFORM", "operator": "sagemaker.spark.join_tables_0.1", "parameters": {"left_column": "policy_id", "right_column": "policy_id", "join_type": "leftouter"}, "inputs": [{"name": "df", "node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "output_name": "default"}, {"name": "df", "node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "output_name": "default"}], "outputs": [{"name": "default"}]}]}


--------------------------------------------------------------------------------
/2-step-functions-pipelines/step-function-workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/2-step-functions-pipelines/step-function-workflow.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/01_setup_mwaa_pipeline.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Setup and Integrate Data Wrangler with Apache Airflow ML Pipeline\n",
  8 |     "\n",
  9 |     "<div class=\"alert alert-warning\"> ⚠️ <strong> PRE-REQUISITE: </strong>\n",
 10 |     "\n",
 11 |     "Before proceeding with this notebook, please ensure that you have \n",
 12 |     "    \n",
 13 |     "1. Executed the <code>00_setup_data_wrangler.ipynb</code> Notebook</li>\n",
 14 |     "2. Created an <a href=\"https://docs.aws.amazon.com/mwaa/latest/userguide/what-is-mwaa.html\" target=\"_blank\">Amazon Managed Workflow for Apache Airflow (MWAA)</a> environment. Please visit the Amazon MWAA <a href=\"https://docs.aws.amazon.com/mwaa/latest/userguide/get-started.html\" target=\"_blank\">Get started</a> documentation to see how you can create an MWAA environment. Alternatively, to quickly get started with MWAA, follow the <a href=\"https://catalog.us-east-1.prod.workshops.aws/v2/workshops/795e88bb-17e2-498f-82d1-2104f4824168/en-US/workshop-2-0-2/setup/mwaa\" target=\"_blank\">step-by-step instructions</a> in the MWAA workshop to setup an MWAA environment.\n",
 15 |     "\n",
 16 |     "</div>\n",
 17 |     "\n",
 18 |     "This notebook creates the required scripts for the Apache Airflow workflow and uploads them to the respective S3 bucket locations for MWAA. We will create-\n",
 19 |     "\n",
 20 |     "1. A `requirements.txt` file and upload it to the MWAA `/requirements` prefix\n",
 21 |     "2. We upload the `SMDataWranglerOperator.py` Python script which is the SageMaker Data Wrangler custom Airflow Operator to the `/dags` prefix.\n",
 22 |     "2. A `config.py` Python script that will setup configurations for our DAG Tasks and upload to the `/dags` prefix.\n",
 23 |     "3. And finally, we create an `ml_pipeline.py` Python script which sets up the end-to-end Apache Airflow workflow DAG and upload it to the `/dags` prefix.\n",
 24 |     "---"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "markdown",
 29 |    "metadata": {},
 30 |    "source": [
 31 |     "Import required dependencies and initialize variables\n",
 32 |     "\n",
 33 |     "<div class=\"alert alert-warning\"> ⚠️ <strong> NOTE: </strong>\n",
 34 |     "    Note: replace <code>bucket</code> name with your MWAA Bucket name.\n",
 35 |     "</div>"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 209,
 41 |    "metadata": {},
 42 |    "outputs": [
 43 |     {
 44 |      "name": "stdout",
 45 |      "output_type": "stream",
 46 |      "text": [
 47 |       "SageMaker version: 2.59.5\n",
 48 |       "S3 bucket: airflow-data-wrangler\n"
 49 |      ]
 50 |     }
 51 |    ],
 52 |    "source": [
 53 |     "import time\n",
 54 |     "import uuid\n",
 55 |     "import sagemaker\n",
 56 |     "import boto3\n",
 57 |     "import string\n",
 58 |     "\n",
 59 |     "# Sagemaker session\n",
 60 |     "sess = sagemaker.Session()\n",
 61 |     "\n",
 62 |     "# MWAA Client\n",
 63 |     "mwaa_client = boto3.client('mwaa')\n",
 64 |     "\n",
 65 |     "# Replace the bucket name with your MWAA Bucket\n",
 66 |     "bucket = 'airflow-data-wrangler'\n",
 67 |     "\n",
 68 |     "print(f'SageMaker version: {sagemaker.__version__}')\n",
 69 |     "print(f'S3 bucket: {bucket}')"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "Creating and uploading this `.airflowignore` file helps Airflow to prevent interpreting the helper Python scripts as a DAG file. "
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 565,
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "name": "stdout",
 86 |      "output_type": "stream",
 87 |      "text": [
 88 |       "Writing scripts/.airflowignore\n"
 89 |      ]
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "%%writefile scripts/.airflowignore\n",
 94 |     "SMDataWranglerOperator\n",
 95 |     "config.py"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 566,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "s3_client.upload_file(\"scripts/.airflowignore\", bucket, f\"dags/.airflowignore\")"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "---\n",
112 |     "## Create `requirements.txt` file"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "Write a `requirements.txt` file and upload it to S3. We will need a few dependencies to be able to run our Data Wrangler python script using the Apache Airflow Python operator, mainly the SageMaker SDK."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 556,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "name": "stdout",
129 |      "output_type": "stream",
130 |      "text": [
131 |       "Writing scripts/requirements.txt\n"
132 |      ]
133 |     }
134 |    ],
135 |    "source": [
136 |     "%%writefile scripts/requirements.txt\n",
137 |     "awswrangler\n",
138 |     "pandas\n",
139 |     "sagemaker==v2.59.5\n",
140 |     "dag-factory==0.7.2"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 557,
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "s3_client = boto3.client(\"s3\")\n",
150 |     "s3_client.upload_file(\"scripts/requirements.txt\", bucket, f\"requirements/requirements.txt\")"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "markdown",
155 |    "metadata": {},
156 |    "source": [
157 |     "---\n",
158 |     "## Upload the custom SageMaker Data Wrangler Operator\n",
159 |     "\n",
160 |     "In this step we will upload the [custom Airflow operator](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html) for SageMaker Data Wrangler. With this operator, you can pass in any SageMaker Data Wrangler `.flow` file to Airflow to perform data transformations."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 558,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "s3_client.upload_file(\"scripts/SMDataWranglerOperator.py\", bucket, f\"dags/SMDataWranglerOperator.py\")"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "---"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "## Update MWAA IAM Execution Role"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "Every MWAA Environment has an [Execution Role](https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-create-role.html) attached to it. This role consists of permissions policy that grants Amazon Managed Workflows for Apache Airflow (MWAA) permission to invoke the resources of other AWS services on your behalf. In our case, we want our MWAA Tasks to be able to access SageMaker and S3. Edit the MWAA Execution role and add the permissions listed below-\n",
191 |     "\n",
192 |     "- `AmazonS3FullAccess`\n",
193 |     "- `AmazonSageMakerFullAccess`"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "## Setup SageMaker Role for MWAA"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "Next, we will create a SageMaker service role to be used in the ML pipeline module. To create an IAM role for Amazon SageMaker\n",
208 |     "\n",
209 |     "- Go to the IAM Console - Roles\n",
210 |     "- Choose Create role\n",
211 |     "- For role type, choose AWS Service, find and choose SageMaker, and choose Next: Permissions\n",
212 |     "- On the Attach permissions policy page, choose (if not already selected)\n",
213 |     "  - AWS managed policy `AmazonSageMakerFullAccess`\n",
214 |     "  - AWS managed policy `AmazonS3FullAccess` for access to Amazon S3 resources\n",
215 |     "- Then choose Next: Tags and then Next: Review.\n",
216 |     "- For Role name, enter AirflowSageMakerExecutionRole and Choose Create Role\n",
217 |     "\n",
218 |     "Alternatively, we can also use the default SageMaker Execution role since it already has these permissions.\n"
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": 97,
224 |    "metadata": {},
225 |    "outputs": [
226 |     {
227 |      "data": {
228 |       "text/plain": [
229 |        "'arn:aws:iam::965425568475:role/service-role/AmazonSageMaker-ExecutionRole-20201030T135016'"
230 |       ]
231 |      },
232 |      "execution_count": 97,
233 |      "metadata": {},
234 |      "output_type": "execute_result"
235 |     }
236 |    ],
237 |    "source": [
238 |     "iam_role = sagemaker.get_execution_role()\n",
239 |     "iam_role"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "---\n",
247 |     "## Setup configuration script\n",
248 |     "\n",
249 |     "In this step we create a helper script to define the model training and model creation task configurations. This script will be used by the DAG tasks to obtain various configuration information for model training and model creation.\n",
250 |     "\n",
251 |     "<div class=\"alert alert-warning\"> ⚠️ <strong> NOTE: </strong>\n",
252 |     "    Note: replace <code>bucket</code> with the SageMaker Default bucket name for your SageMaker studio domain.\n",
253 |     "</div>\n"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 559,
259 |    "metadata": {},
260 |    "outputs": [
261 |     {
262 |      "name": "stdout",
263 |      "output_type": "stream",
264 |      "text": [
265 |       "Writing scripts/config.py\n"
266 |      ]
267 |     }
268 |    ],
269 |    "source": [
270 |     "%%writefile scripts/config.py\n",
271 |     "#!/usr/bin/env python\n",
272 |     "import time\n",
273 |     "import uuid\n",
274 |     "import sagemaker\n",
275 |     "import json\n",
276 |     "from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook\n",
277 |     "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
278 |     "\n",
279 |     "def config(**opts):\n",
280 |     "    \n",
281 |     "    region_name=opts['region_name'] if 'region_name' in opts else sagemaker.Session().boto_region_name\n",
282 |     "    \n",
283 |     "    # Hook\n",
284 |     "    hook = AwsBaseHook(aws_conn_id='airflow-sagemaker', resource_type=\"sagemaker\")\n",
285 |     "    boto_session = hook.get_session(region_name=region_name)\n",
286 |     "    sagemaker_session =  sagemaker.session.Session(boto_session=boto_session)\n",
287 |     "    \n",
288 |     "\n",
289 |     "    training_job_name=opts['training_job_name']\n",
290 |     "    bucket = opts['bucket'] if 'bucket' in opts else sagemaker_session.default_bucket() #\"sagemaker-us-east-2-965425568475\"\n",
291 |     "    s3_prefix = opts['s3_prefix']\n",
292 |     "    # Get the xgboost container uri\n",
293 |     "    container = get_image_uri(region_name, 'xgboost', repo_version='1.0-1')\n",
294 |     "    \n",
295 |     "    ts = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"   \n",
296 |     "    config = {}\n",
297 |     "    \n",
298 |     "    config[\"data_wrangler_config\"] = {        \n",
299 |     "        \"sagemaker_role\":             opts['role_name'],\n",
300 |     "        #\"s3_data_type\"              :    defaults to \"S3Prefix\" \n",
301 |     "        #\"s3_input_mode\"             :    defaults to \"File\", \n",
302 |     "        #\"s3_data_distribution_type\" :    defaults to \"FullyReplicated\", \n",
303 |     "        #\"aws_conn_id\"               :    defaults to \"aws_default\",\n",
304 |     "        #\"kms_key\"                   :    defaults to None, \n",
305 |     "        #\"volume_size_in_gb\"         :    defaults to 30,\n",
306 |     "        #\"enable_network_isolation\"  :    defaults to False, \n",
307 |     "        #\"wait_for_processing\"       :    defaults to True, \n",
308 |     "        #\"container_uri\"             :    defaults to \"415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x\", \n",
309 |     "        #\"container_uri_pinned\"      :    defaults to \"415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0\",  \n",
310 |     "        \"outputConfig\": {\n",
311 |     "              #\"s3_output_upload_mode\":     #defaults to EndOfJob\n",
312 |     "              #\"output_content_type\":       #defaults to CSV\n",
313 |     "              #\"output_bucket\":             #defaults to SageMaker Default bucket\n",
314 |     "              \"output_prefix\": s3_prefix     #prefix within bucket where output will be written, default is generated automatically\n",
315 |     "        }\n",
316 |     "    }\n",
317 |     "\n",
318 |     "    config[\"train_config\"]={\n",
319 |     "        \"AlgorithmSpecification\": {\n",
320 |     "            \"TrainingImage\": container,\n",
321 |     "            \"TrainingInputMode\": \"File\"\n",
322 |     "        },\n",
323 |     "        \"HyperParameters\": {\n",
324 |     "            \"max_depth\": \"5\",\n",
325 |     "            \"num_round\": \"10\",\n",
326 |     "            \"objective\": \"reg:squarederror\"\n",
327 |     "        },\n",
328 |     "        \"InputDataConfig\": [\n",
329 |     "            {\n",
330 |     "                \"ChannelName\": \"train\",\n",
331 |     "                \"ContentType\": \"csv\",\n",
332 |     "                \"DataSource\": {\n",
333 |     "                    \"S3DataSource\": {\n",
334 |     "                        \"S3DataDistributionType\": \"FullyReplicated\",\n",
335 |     "                        \"S3DataType\": \"S3Prefix\",\n",
336 |     "                        \"S3Uri\": f\"s3://{bucket}/{s3_prefix}/train\"\n",
337 |     "                    }\n",
338 |     "                }\n",
339 |     "            }\n",
340 |     "        ],\n",
341 |     "        \"OutputDataConfig\": {\n",
342 |     "            \"S3OutputPath\": f\"s3://{bucket}/{s3_prefix}/xgboost\"\n",
343 |     "        },\n",
344 |     "        \"ResourceConfig\": {\n",
345 |     "            \"InstanceCount\": 1,\n",
346 |     "            \"InstanceType\": \"ml.m5.2xlarge\",\n",
347 |     "            \"VolumeSizeInGB\": 5\n",
348 |     "        },\n",
349 |     "        \"RoleArn\": opts['role_name'],\n",
350 |     "        \"StoppingCondition\": {\n",
351 |     "            \"MaxRuntimeInSeconds\": 86400\n",
352 |     "        },\n",
353 |     "        \"TrainingJobName\": training_job_name\n",
354 |     "    }\n",
355 |     "    \n",
356 |     "    config[\"model_config\"]={\n",
357 |     "           \"ExecutionRoleArn\": opts['role_name'],\n",
358 |     "           \"ModelName\": f\"XGBoost-Fraud-Detector-{ts}\",\n",
359 |     "           \"PrimaryContainer\": { \n",
360 |     "              \"Mode\": \"SingleModel\",\n",
361 |     "              \"Image\": container,\n",
362 |     "              \"ModelDataUrl\": f\"s3://{bucket}/{s3_prefix}/xgboost/{training_job_name}/output/model.tar.gz\"\n",
363 |     "           },\n",
364 |     "        }\n",
365 |     "    \n",
366 |     "    return config"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "markdown",
371 |    "metadata": {},
372 |    "source": [
373 |     "Upload `config.py` to the `/dags` prefix."
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": 560,
379 |    "metadata": {},
380 |    "outputs": [],
381 |    "source": [
382 |     "s3_client.upload_file(\"scripts/config.py\", bucket, f\"dags/config.py\")"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "---\n",
390 |     "## Setup Apache Airflow DAG (Directed Acyclic Graph)\n",
391 |     "\n",
392 |     "In this step, we will create the Python script to setup the Apache Airflow DAG. The script will create three distinct tasks and finally chain them together using `>>` in the end to create the Airflow DAG.\n",
393 |     "\n",
394 |     "1. Use Python operator to define a task to run the Data Wrangler script for data pre-processing\n",
395 |     "2. Use SageMaker operator to define a task to train an XGBoost model using the training data\n",
396 |     "3. Use SageMaker operator to define a task to create a model using the model artifacts created by the training step"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "code",
401 |    "execution_count": 538,
402 |    "metadata": {},
403 |    "outputs": [
404 |     {
405 |      "data": {
406 |       "text/plain": [
407 |        "'s3://sagemaker-us-east-2-965425568475/data-wrangler-pipeline/flow/flow-21-08-44-46-aea6f365.flow'"
408 |       ]
409 |      },
410 |      "execution_count": 538,
411 |      "metadata": {},
412 |      "output_type": "execute_result"
413 |     }
414 |    ],
415 |    "source": [
416 |     "%store -r ins_claim_flow_uri\n",
417 |     "ins_claim_flow_uri"
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "code",
422 |    "execution_count": 561,
423 |    "metadata": {},
424 |    "outputs": [
425 |     {
426 |      "name": "stdout",
427 |      "output_type": "stream",
428 |      "text": [
429 |       "Writing scripts/ml_pipeline.py\n"
430 |      ]
431 |     }
432 |    ],
433 |    "source": [
434 |     "%%writefile scripts/ml_pipeline.py\n",
435 |     "#!/usr/bin/env python\n",
436 |     "import time\n",
437 |     "import uuid\n",
438 |     "import json\n",
439 |     "import boto3\n",
440 |     "import sagemaker\n",
441 |     "\n",
442 |     "# Import config file.\n",
443 |     "from config import config\n",
444 |     "from datetime import timedelta\n",
445 |     "import airflow\n",
446 |     "from airflow import DAG\n",
447 |     "from airflow.models import DAG\n",
448 |     "\n",
449 |     "# airflow operators\n",
450 |     "from airflow.models import DAG\n",
451 |     "from airflow.operators.python_operator import PythonOperator\n",
452 |     "\n",
453 |     "# airflow sagemaker operators\n",
454 |     "from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator\n",
455 |     "from airflow.providers.amazon.aws.operators.sagemaker_model import SageMakerModelOperator\n",
456 |     "\n",
457 |     "# airflow sagemaker configuration\n",
458 |     "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
459 |     "from sagemaker.estimator import Estimator\n",
460 |     "from sagemaker.workflow.airflow import training_config\n",
461 |     "\n",
462 |     "# airflow Data Wrangler operator\n",
463 |     "from SMDataWranglerOperator import SageMakerDataWranglerOperator\n",
464 |     "\n",
465 |     "# airflow dummy operator\n",
466 |     "from airflow.operators.dummy import DummyOperator\n",
467 |     "\n",
468 |     "\n",
469 |     "  \n",
470 |     "default_args = {  \n",
471 |     "    'owner': 'airflow',\n",
472 |     "    'depends_on_past': False,\n",
473 |     "    'start_date': airflow.utils.dates.days_ago(1),\n",
474 |     "    'retries': 0,\n",
475 |     "    'retry_delay': timedelta(minutes=2),\n",
476 |     "    'provide_context': True,\n",
477 |     "    'email': ['airflow@iloveairflow.com'],\n",
478 |     "    'email_on_failure': False,\n",
479 |     "    'email_on_retry': False\n",
480 |     "}\n",
481 |     "ts                = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n",
482 |     "DAG_NAME          = f\"ml-pipeline\"\n",
483 |     "\n",
484 |     "#-------\n",
485 |     "### Start creating DAGs\n",
486 |     "#-------\n",
487 |     "\n",
488 |     "dag = DAG(  \n",
489 |     "            DAG_NAME,\n",
490 |     "            default_args=default_args,\n",
491 |     "            dagrun_timeout=timedelta(hours=2),\n",
492 |     "            # Cron expression to auto run workflow on specified interval\n",
493 |     "            # schedule_interval='0 3 * * *'\n",
494 |     "            schedule_interval=None\n",
495 |     "        )\n",
496 |     "\n",
497 |     "#-------\n",
498 |     "# Task to create configurations\n",
499 |     "#-------\n",
500 |     "\n",
501 |     "config_task = PythonOperator(\n",
502 |     "        task_id = 'Start',\n",
503 |     "        python_callable=config,\n",
504 |     "        op_kwargs={\n",
505 |     "            'training_job_name': f\"XGBoost-training-{ts}\",\n",
506 |     "            's3_prefix': 'data-wrangler-pipeline',\n",
507 |     "            'role_name': 'AmazonSageMaker-ExecutionRole-20201030T135016'},\n",
508 |     "        provide_context=True,\n",
509 |     "        dag=dag\n",
510 |     "    )\n",
511 |     "\n",
512 |     "#-------\n",
513 |     "# Task with SageMakerDataWranglerOperator operator for Data Wrangler Processing Job.\n",
514 |     "#-------\n",
515 |     "\n",
516 |     "def datawrangler(**context):\n",
517 |     "    config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n",
518 |     "    preprocess_task = SageMakerDataWranglerOperator(\n",
519 |     "                            task_id='DataWrangler_Processing_StepNew',\n",
520 |     "                            dag=dag,\n",
521 |     "                            flow_file_s3uri=\"$flow_uri\",\n",
522 |     "                            processing_instance_count=2,\n",
523 |     "                            instance_type='ml.m5.4xlarge',\n",
524 |     "                            aws_conn_id=\"aws_default\",\n",
525 |     "                            config= config[\"data_wrangler_config\"]\n",
526 |     "                    )\n",
527 |     "    preprocess_task.execute(context)\n",
528 |     "\n",
529 |     "datawrangler_task = PythonOperator(\n",
530 |     "        task_id = 'SageMaker_DataWrangler_step',\n",
531 |     "        python_callable=datawrangler,\n",
532 |     "        provide_context=True,\n",
533 |     "        dag=dag\n",
534 |     "    )\n",
535 |     "\n",
536 |     "#-------\n",
537 |     "# Task with SageMaker training operator to train the xgboost model\n",
538 |     "#-------\n",
539 |     "\n",
540 |     "def trainmodel(**context):\n",
541 |     "    config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n",
542 |     "    trainmodel_task = SageMakerTrainingOperator(\n",
543 |     "                task_id='Training_Step',\n",
544 |     "                config= config['train_config'],\n",
545 |     "                aws_conn_id='aws-sagemaker',\n",
546 |     "                wait_for_completion=True,\n",
547 |     "                check_interval=30\n",
548 |     "            )\n",
549 |     "    trainmodel_task.execute(context)\n",
550 |     "\n",
551 |     "train_model_task = PythonOperator(\n",
552 |     "        task_id = 'SageMaker_training_step',\n",
553 |     "        python_callable=trainmodel,\n",
554 |     "        provide_context=True,\n",
555 |     "        dag=dag\n",
556 |     "    )\n",
557 |     "\n",
558 |     "#-------\n",
559 |     "# Task with SageMaker Model operator to create the xgboost model from artifacts\n",
560 |     "#-------\n",
561 |     "\n",
562 |     "def createmodel(**context):\n",
563 |     "    config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n",
564 |     "    createmodel_task= SageMakerModelOperator(\n",
565 |     "            task_id='Create_Model',\n",
566 |     "            config= config['model_config'],\n",
567 |     "            aws_conn_id='aws-sagemaker',\n",
568 |     "        )\n",
569 |     "    createmodel_task.execute(context)\n",
570 |     "    \n",
571 |     "create_model_task = PythonOperator(\n",
572 |     "        task_id = 'SageMaker_create_model_step',\n",
573 |     "        python_callable=createmodel,\n",
574 |     "        provide_context=True,\n",
575 |     "        dag=dag\n",
576 |     "    )\n",
577 |     "\n",
578 |     "#------\n",
579 |     "# Last step\n",
580 |     "#------\n",
581 |     "end_task = DummyOperator(task_id='End', dag=dag)\n",
582 |     "\n",
583 |     "# Create task dependencies\n",
584 |     "\n",
585 |     "config_task >> datawrangler_task >> train_model_task >> create_model_task >> end_task"
586 |    ]
587 |   },
588 |   {
589 |    "cell_type": "markdown",
590 |    "metadata": {},
591 |    "source": [
592 |     "Replace `$flow_uri` in the `ml_pipeline.py` script with the store magic variable `ins_claim_flow_uri` which contains S3 path of the `.flow` file."
593 |    ]
594 |   },
595 |   {
596 |    "cell_type": "code",
597 |    "execution_count": 562,
598 |    "metadata": {},
599 |    "outputs": [],
600 |    "source": [
601 |     "with open(\"ml_pipeline.py\", 'r') as f:\n",
602 |     "    variables   = {'flow_uri': ins_claim_flow_uri}\n",
603 |     "    template    = string.Template(f.read())\n",
604 |     "    ml_pipeline = template.substitute(variables)\n",
605 |     "\n",
606 |     "# Creates the .flow file\n",
607 |     "with open('ml_pipeline.py', 'w') as f:\n",
608 |     "    f.write(ml_pipeline)"
609 |    ]
610 |   },
611 |   {
612 |    "cell_type": "markdown",
613 |    "metadata": {},
614 |    "source": [
615 |     "Upload `ml_pipeline.py` to the `/dags` prefix."
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "code",
620 |    "execution_count": 563,
621 |    "metadata": {},
622 |    "outputs": [],
623 |    "source": [
624 |     "s3_client.upload_file(\"scripts/ml_pipeline.py\", bucket, f\"dags/ml_pipeline.py\")"
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "markdown",
629 |    "metadata": {},
630 |    "source": [
631 |     "---\n",
632 |     "## View Airflow DAG and run\n",
633 |     "\n",
634 |     "Once the above steps are complete, you can access the [Apache Airflow UI](https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html) and view the DAG. To access the Apache Airflow UI, go to the Amazon MWAA Console, select the MWAA Environment and click the _Airflow UI_ link.\n",
635 |     "\n",
636 |     "<img src=\"images/mwaa_ui.png\" width=\"800\"/>\n",
637 |     "\n",
638 |     "\n",
639 |     "You can run the DAG by clicking on the \"Play\" button, alternatively you can -\n",
640 |     "\n",
641 |     "1. Setup the DAG to run on a set schedule automatically using cron expressions\n",
642 |     "2. Setup the DAG to run based on S3 sensors such that the pipeline/workflow would execute whenever a new file arrives in a bucket/prefix.\n",
643 |     "\n",
644 |     "<img src=\"images/mwaa_dag.png\" width=\"800\"/>\n"
645 |    ]
646 |   },
647 |   {
648 |    "cell_type": "markdown",
649 |    "metadata": {},
650 |    "source": [
651 |     "### Clean Up\n",
652 |     "\n",
653 |     "1. Delete the MWAA Environment from the Amazon MWAA Console.\n",
654 |     "2. Delete the MWAA S3 files.\n",
655 |     "3. Delete the Model training data and model artifact files from S3.\n",
656 |     "4. Delete the SageMaker Model."
657 |    ]
658 |   },
659 |   {
660 |    "cell_type": "markdown",
661 |    "metadata": {},
662 |    "source": [
663 |     "---\n",
664 |     "# Conclusion\n",
665 |     "\n",
666 |     "We created an ML Pipeline with Apache Airflow and used the Data Wrangler script to pre-process and generate new training data for our model training and subsequently created a new model in Amazon SageMaker.\n"
667 |    ]
668 |   },
669 |   {
670 |    "cell_type": "code",
671 |    "execution_count": null,
672 |    "metadata": {},
673 |    "outputs": [],
674 |    "source": []
675 |   }
676 |  ],
677 |  "metadata": {
678 |   "instance_type": "ml.t3.medium",
679 |   "kernelspec": {
680 |    "display_name": "Python 3 (Data Science)",
681 |    "language": "python",
682 |    "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
683 |   },
684 |   "language_info": {
685 |    "codemirror_mode": {
686 |     "name": "ipython",
687 |     "version": 3
688 |    },
689 |    "file_extension": ".py",
690 |    "mimetype": "text/x-python",
691 |    "name": "python",
692 |    "nbconvert_exporter": "python",
693 |    "pygments_lexer": "ipython3",
694 |    "version": "3.7.10"
695 |   }
696 |  },
697 |  "nbformat": 4,
698 |  "nbformat_minor": 4
699 | }
700 | 


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/README.md:
--------------------------------------------------------------------------------
 1 | ## SageMaker Data Wrangler with Apache Airflow (Amazon MWAA)
 2 | 
 3 | ### Getting Started
 4 | 
 5 | #### Setup Amazon MWAA Environment
 6 | 
 7 | 1. Create an [Amazon S3 bucket](https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-s3-bucket.html) and subsequent folders required by Amazon MWAA. The S3 bucket and the folders as seen below. These folders are used by Amazon MWAA to look for dependent Python scripts that is required for the workflow.
 8 |    
 9 |    <div align="center">
10 |         <p align="center">
11 |             <img src="./images/mwaa_s3.png" alt="Airflow S3"/>
12 |         </p>
13 |     </div>
14 | 
15 | 2. [Create an Amazon MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html). Note that we used Airflow version 2.0.2 (current latest version supported by Amazon MWAA) for this solution.
16 | 
17 | #### Setup the Workflow DAG for Apache Airflow
18 | 
19 | 1. Clone this repository to your local machine
20 |    ```
21 |    git clone <repo_url>
22 |    cd data-wrangler-mlops/3-apache-airflow-pipelines
23 |    ```
24 | 3. Upload the `requirements.txt` file in the `/scripts` folder. This contains all the Python dependencies required by the Airflow tasks and upload it to the `/requirements` directory within Amazon MWAA primary S3 bucket (created in step 1). This will be used by the managed Airflow environment to install the Python dependencies.
25 | 3. Upload the `SMDataWranglerOperator.py` from the `/scripts` folder to the S3 `/dags` directory. This Python script contains code for the custom Airflow operator for SageMaker Data Wrangler. This operator can be used for tasks to process any `.flow` file.
26 | 4. Upload the `config.py` script from the `/scripts` folder to the S3 `/dags` directory. This Python script will be used for the first step of our DAG to create configuration objects required by the remaining steps of the workflow.
27 | 5. Finally, upload the `ml_pipelines.py` file from the `/scripts` folder to the S3 `/dags` directory. This script contains the DAG definition for the Airflow workflow. This is where we define each of the tasks, and setup dependencies between them. Amazon MWAA will periodically poll the /dags directory to execute this script to create the DAG or update the existing one with any latest changes.
28 | 
29 | #### Run the Airflow workflow DAG
30 | 
31 | 1. Navigate to the Amazon MWAA console and find your Airflow environment.
32 | 2. Click on the Environment name to see the details, then click the link under "Airflow UI" to open the Airflow Admin UI in a new browser tab.
33 |    
34 |    <div align="center">
35 |         <p align="center">
36 |             <img src="./images/mwaa_ui.png" alt="Airflow UI"/>
37 |         </p>
38 |     </div>
39 | 
40 | 3. In the Airflow UI, click the DAG named `ml-pipeline` to see the details of the DAG. If you don't see the DAG then wait for a few moments for the DAG to get created. Amazon MWAA will poll the S3 bucket periodically and read the Python scripts from the bucket and will execute them to setup the DAG.
41 |    
42 |    <div align="center">
43 |         <p align="center">
44 |             <img src="./images/meaa_ui_home.png" alt="Airflow UI Home"/>
45 |         </p>
46 |     </div>
47 | 
48 | 4. Click on the "Graph View" tab to view the DAG and then click the play button on the top right of the screen to start the workflow execution. This is the manual way to start the workflow execution, however, this can be automated using Apache Airflow Sensors and Plugins. Refer to the [documentation](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-dag-import-plugins.html) for more information on how to setup sensors.
49 |    
50 |     <div align="center">
51 |         <p align="center">
52 |             <img src="./images/mwaa_dag.png" alt="Workflow"/>
53 |         </p>
54 |     </div>
55 | 
56 | 5. In the next screen keep the "Configuration JSON" as default and click "Trigger". This will start the Workflow execution.
57 | 
58 |     <div align="center">
59 |         <p align="center">
60 |             <img src="./images/mwaa_trigger.png" alt="Workflow Trigger"/>
61 |         </p>
62 |     </div>
63 | 
64 | 6. Once the workflow is completed successfully you can check the S3 location (from the AWS console) specified as the DataWrangler output location to see the transformed output files generated. Verify that the model was created from the SageMaker Console.
65 | 
66 | 7. (Otional) You can look at the Workflow task logs by clicking on a Task in the DAG (in DAG View) and then clicking on the "Logs" button in the pop-up window.
67 | 
68 | 
69 | ---
70 | 
71 | ### Cleaning Up
72 | 
73 | 1. On the Airflow UI, click the "DAGs" button on the top of the page to show the list of DAGs.
74 | 2. Click the Switch right next the DAG's name to turn it off (_"Pause DAG"_) and then click on the Delete (trash can) button on the right to delete the DAG.
75 |    
76 |    <div align="center">
77 |         <p align="center">
78 |             <img src="./images/mwaa_delete_dag.png" alt="Delete DAG"/>
79 |         </p>
80 |     </div>
81 | 
82 | 3. Close the Airflow UI screen and navigate to the Amazon MWAA Console from AWS console to view the list of MWAA environments. Select the MWAA environment and click Delete.
83 | 
84 |     <div align="center">
85 |         <p align="center">
86 |             <img src="./images/delete_mwaa.png" alt="Delete MWAA Env"/>
87 |         </p>
88 |     </div>
89 | 
90 | 4. Delete the Amazon S3 bucket for MWAA that you created at the beginning of this tutorial and all the folders and files within it. Refer to the Amazon MWAA [documentation](https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags-delete.html#working-dags-s3-dag-delete) on how to delete just the DAG files from S3.


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/delete_mwaa.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/delete_mwaa.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/flow.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/meaa_ui_home.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/meaa_ui_home.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/mwaa_dag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_dag.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/mwaa_delete_dag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_delete_dag.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/mwaa_s3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_s3.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/mwaa_trigger.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_trigger.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/images/mwaa_ui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_ui.png


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/scripts/SMDataWranglerOperator.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import time
  3 | import uuid
  4 | import sagemaker
  5 | import json
  6 | from sagemaker.processing import ProcessingInput, ProcessingOutput
  7 | from sagemaker.processing import Processor
  8 | from sagemaker.network import NetworkConfig
  9 | from airflow.models.baseoperator import BaseOperator
 10 | from airflow.providers.amazon.aws.hooks.s3 import S3Hook
 11 | from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
 12 | 
 13 | 
 14 | class SageMakerDataWranglerOperator(BaseOperator):
 15 |     template_fields = ["config","iam_role"]
 16 |     
 17 |     def __init__(self, 
 18 |                  flow_file_s3uri: str, 
 19 |                  processing_instance_count: int, 
 20 |                  instance_type: str,
 21 |                  aws_conn_id: str,
 22 |                  config: str,
 23 |                  **kwargs) -> None:
 24 |         super().__init__(**kwargs)
 25 |         sess = sagemaker.Session()
 26 |         
 27 |         self.config = config
 28 |         self.sagemaker_session         = sess
 29 |         self.basedir                   = "/opt/ml/processing"                                                     
 30 |         self.flow_file_s3uri           = flow_file_s3uri                                      # full form S3 URI example: s3://bucket/prefix/object
 31 |         self.processing_instance_count = processing_instance_count
 32 |         self.instance_type             = instance_type
 33 |         self.aws_conn_id               = aws_conn_id if aws_conn_id is not None else "aws_default"     # Uses the default MWAA AWS Connection
 34 |         self.s3_data_type              = self.get_config("s3_data_type", "S3Prefix", config)
 35 |         self.s3_input_mode             = self.get_config("s3_input_mode", "File", config)
 36 |         self.s3_data_distribution_type = self.get_config("s3_data_distribution_type", "FullyReplicated", config) 
 37 |         self.kms_key                   = self.get_config("kms_key", None, config)
 38 |         self.volume_size_in_gb         = self.get_config("volume_size_in_gb", 30, config)          # Defaults to 30Gb EBS volume
 39 |         self.enable_network_isolation  = self.get_config("enable_network_isolation",False, config)
 40 |         self.wait_for_processing       = self.get_config("wait_for_processing", True, config)
 41 |         self.container_uri             = self.get_config("container_uri", 
 42 |                                                     "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x", config)
 43 |         self.container_uri_pinned      = self.get_config("container_uri_pinned",
 44 |                                                     "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0", config)        
 45 |         self.s3_output_upload_mode     = self.get_config("s3_output_upload_mode", "EndOfJob", config)
 46 |         self.output_content_type       = self.get_config("output_content_type", "CSV", config["outputConfig"])                                 # CSV/PARQUET
 47 |         self.output_bucket             = self.get_config("output_bucket", sess.default_bucket(), config["outputConfig"])
 48 |         self.output_prefix             = self.get_config("output_prefix", None, config["outputConfig"])            
 49 |     
 50 |     def expand_role(self) -> None:
 51 |         if 'sagemaker_role' in self.config:
 52 |             hook = AwsBaseHook(self.aws_conn_id, client_type='iam')
 53 |             self.iam_role = hook.expand_role(self.config['sagemaker_role'])
 54 |             
 55 |     def get_config(self,flag, default, opts):
 56 |         var = opts[flag] if flag in opts else default
 57 |         return var
 58 |     
 59 |     def parse_s3_uri(self, s3_uri):
 60 |         path_parts=s3_uri.replace("s3://","").split("/")
 61 |         s3_bucket=path_parts.pop(0)
 62 |         key="/".join(path_parts)
 63 |         
 64 |         return s3_bucket, key
 65 |     
 66 |     def get_data_sources(self, data):
 67 |         # Initialize variables from .flow file
 68 |         output_node = data['nodes'][-1]['node_id']
 69 |         output_path = data['nodes'][-1]['outputs'][0]['name']
 70 |         input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']=="SOURCE"]
 71 |         input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']=="SOURCE"]
 72 |         
 73 |         output_name = f"{output_node}.{output_path}"
 74 |         
 75 |         data_sources = []        
 76 |         
 77 |         # Intialize data sources from .flow file
 78 |         for i in range(0,len(input_source_uris)):
 79 |             data_sources.append(ProcessingInput(
 80 |                 source=input_source_uris[i],
 81 |                 destination=f"{self.basedir}/{input_source_names[i]}",
 82 |                 input_name=input_source_names[i],
 83 |                 s3_data_type=self.s3_data_type,
 84 |                 s3_input_mode=self.s3_input_mode,
 85 |                 s3_data_distribution_type=self.s3_data_distribution_type
 86 |             ))
 87 |         
 88 |         return output_name, data_sources
 89 |     
 90 |     def get_processor(self):        
 91 |         # Create Processing Job
 92 |         # To launch a Processing Job, you will use the SageMaker Python SDK to create a Processor function.
 93 |         processor = Processor(
 94 |             role=self.iam_role,
 95 |             image_uri=self.container_uri,
 96 |             instance_count=self.processing_instance_count,
 97 |             instance_type=self.instance_type,
 98 |             volume_size_in_gb=self.volume_size_in_gb,
 99 |             network_config=NetworkConfig(enable_network_isolation=self.enable_network_isolation),
100 |             sagemaker_session=self.sagemaker_session,
101 |             output_kms_key=self.kms_key
102 |         )
103 |         
104 |         return processor
105 |         
106 |     def execute(self, context):
107 |         
108 |         self.expand_role()
109 |         
110 |         print(f'SageMaker Data Wrangler Operator initialized with {context}...')
111 |         # Time marker
112 |         ts = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"        
113 |         # Establish connection to S3 with S3Hook
114 |         s3 = S3Hook(aws_conn_id=self.aws_conn_id)
115 |         s3.get_conn()
116 |         
117 |         # Read the .flow file
118 |         s3_bucket, key = self.parse_s3_uri(self.flow_file_s3uri)
119 |         file_content = s3.read_key(key=key, bucket_name=s3_bucket)
120 |         data = json.loads(file_content)
121 |         
122 |         output_name, data_sources = self.get_data_sources(data)        
123 |                     
124 |         # Configure Output for the SageMaker processing job
125 |         prefix = self.output_prefix if self.output_prefix is not None else ts
126 |         s3_output_path = f"s3://{self.output_bucket}/{prefix}"
127 |         
128 |         processing_job_output = ProcessingOutput(
129 |             output_name=output_name,
130 |             source=f"{self.basedir}/output",
131 |             destination=s3_output_path,
132 |             s3_upload_mode=self.s3_output_upload_mode
133 |         )
134 |         
135 |         # The Data Wrangler Flow is provided to the Processing Job as an input source which we configure below.
136 |         # Input - Flow file
137 |         flow_input = ProcessingInput(
138 |             source=self.flow_file_s3uri,
139 |             destination=f"{self.basedir}/flow",
140 |             input_name="flow",
141 |             s3_data_type=self.s3_data_type,
142 |             s3_input_mode=self.s3_input_mode,
143 |             s3_data_distribution_type=self.s3_data_distribution_type
144 |         )
145 |         
146 |         # Output configuration used as processing job container arguments 
147 |         output_config = {
148 |             output_name: {
149 |                 "content_type": self.output_content_type
150 |             }
151 |         }
152 |         
153 |         # Create a SageMaker processing Processor
154 |         processor = self.get_processor()
155 |         
156 |         # Unique processing job name. Give a unique name every time you re-execute processing jobs
157 |         processing_job_name = f"data-wrangler-flow-processing-{ts}"
158 |         
159 |         print(f'Starting SageMaker Data Wrangler processing job {processing_job_name} with {self.processing_instance_count} instances of {self.instance_type} and {self.volume_size_in_gb}Gb disk...')
160 |         
161 |         # Start Job
162 |         processor.run(
163 |             inputs=[flow_input] + data_sources, 
164 |             outputs=[processing_job_output],
165 |             arguments=[f"--output-config '{json.dumps(output_config)}'"],
166 |             wait=self.wait_for_processing,
167 |             logs=True,
168 |             job_name=processing_job_name
169 |         )
170 |         
171 |         print(f'SageMaker Data Wrangler processing job for flow file {self.flow_file_s3uri} complete...')
172 |         
173 |         # We will copy the files generated by Data Wrangler to a well known S3 location so that
174 |         # the location can be used in our training job Task in the Airflow DAG.
175 |         # This is because the prefix generated by Data Wrangler is dynamic.
176 |         # We will use the default connection named 'aws_default' [Found in Airflow Admin UI > Admin Menu > Connections]
177 |         
178 |         print(f'Processing output files...')
179 |         
180 |         key_list = s3.list_keys(self.output_bucket, prefix=f"{prefix}/{processing_job_name}")
181 |         for index, key in enumerate(key_list):
182 |             s3.copy_object(source_bucket_key=f"s3://{self.output_bucket}/{key}",dest_bucket_key=f"{s3_output_path}/train/train-data-{index}.csv")
183 |             
184 |         # Delete the original file(s) since they have been moved to the well known S3 location
185 |         s3.delete_objects(bucket=self.output_bucket, keys=key_list)
186 |         
187 |         data_output_path = f"{s3_output_path}/train"
188 |         
189 |         print(f'Saved output files at {data_output_path}...')
190 |         return data_output_path


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/scripts/config.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import time
 3 | import uuid
 4 | import sagemaker
 5 | import json
 6 | from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook
 7 | from sagemaker.amazon.amazon_estimator import get_image_uri
 8 | 
 9 | def config(**opts):
10 |     
11 |     region_name=opts['region_name'] if 'region_name' in opts else sagemaker.Session().boto_region_name
12 |     
13 |     # Hook
14 |     hook = AwsBaseHook(aws_conn_id='airflow-sagemaker', resource_type="sagemaker")
15 |     boto_session = hook.get_session(region_name=region_name)
16 |     sagemaker_session =  sagemaker.session.Session(boto_session=boto_session)
17 |     
18 | 
19 |     training_job_name=opts['training_job_name']
20 |     bucket = opts['bucket'] if 'bucket' in opts else sagemaker_session.default_bucket() #"sagemaker-us-east-2-965425568475"
21 |     s3_prefix = opts['s3_prefix']
22 |     # Get the xgboost container uri
23 |     container = get_image_uri(region_name, 'xgboost', repo_version='1.0-1')
24 |     
25 |     ts = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"   
26 |     config = {}
27 |     
28 |     config["data_wrangler_config"] = {        
29 |         "sagemaker_role":             opts['role_name'],
30 |         #"s3_data_type"              :    defaults to "S3Prefix" 
31 |         #"s3_input_mode"             :    defaults to "File", 
32 |         #"s3_data_distribution_type" :    defaults to "FullyReplicated",         
33 |         #"kms_key"                   :    defaults to None, 
34 |         #"volume_size_in_gb"         :    defaults to 30,
35 |         #"enable_network_isolation"  :    defaults to False, 
36 |         #"wait_for_processing"       :    defaults to True, 
37 |         #"container_uri"             :    defaults to "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x", 
38 |         #"container_uri_pinned"      :    defaults to "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0",  
39 |         "outputConfig": {
40 |               #"s3_output_upload_mode":     #defaults to EndOfJob
41 |               #"output_content_type":       #defaults to CSV
42 |               #"output_bucket":             #defaults to SageMaker Default bucket
43 |               "output_prefix": s3_prefix     #prefix within bucket where output will be written, default is generated automatically
44 |         }
45 |     }
46 | 
47 |     config["train_config"]={
48 |         "AlgorithmSpecification": {
49 |             "TrainingImage": container,
50 |             "TrainingInputMode": "File"
51 |         },
52 |         "HyperParameters": {
53 |             "max_depth": "5",
54 |             "num_round": "10",
55 |             "objective": "reg:squarederror"
56 |         },
57 |         "InputDataConfig": [
58 |             {
59 |                 "ChannelName": "train",
60 |                 "ContentType": "csv",
61 |                 "DataSource": {
62 |                     "S3DataSource": {
63 |                         "S3DataDistributionType": "FullyReplicated",
64 |                         "S3DataType": "S3Prefix",
65 |                         "S3Uri": f"s3://{bucket}/{s3_prefix}/train"
66 |                     }
67 |                 }
68 |             }
69 |         ],
70 |         "OutputDataConfig": {
71 |             "S3OutputPath": f"s3://{bucket}/{s3_prefix}/xgboost"
72 |         },
73 |         "ResourceConfig": {
74 |             "InstanceCount": 1,
75 |             "InstanceType": "ml.m5.2xlarge",
76 |             "VolumeSizeInGB": 5
77 |         },
78 |         "RoleArn": opts['role_name'],
79 |         "StoppingCondition": {
80 |             "MaxRuntimeInSeconds": 86400
81 |         },
82 |         "TrainingJobName": training_job_name
83 |     }
84 |     
85 |     config["model_config"]={
86 |            "ExecutionRoleArn": opts['role_name'],
87 |            "ModelName": f"XGBoost-Fraud-Detector-{ts}",
88 |            "PrimaryContainer": { 
89 |               "Mode": "SingleModel",
90 |               "Image": container,
91 |               "ModelDataUrl": f"s3://{bucket}/{s3_prefix}/xgboost/{training_job_name}/output/model.tar.gz"
92 |            },
93 |         }
94 |     
95 |     return config
96 | 


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/scripts/ml_pipeline.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import time
  3 | import uuid
  4 | import json
  5 | import boto3
  6 | import sagemaker
  7 | 
  8 | # Import config file.
  9 | from config import config
 10 | from datetime import timedelta
 11 | import airflow
 12 | from airflow import DAG
 13 | from airflow.models import DAG
 14 | 
 15 | # airflow operators
 16 | from airflow.models import DAG
 17 | from airflow.operators.python_operator import PythonOperator
 18 | 
 19 | # airflow sagemaker operators
 20 | from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator
 21 | from airflow.providers.amazon.aws.operators.sagemaker_model import SageMakerModelOperator
 22 | 
 23 | # airflow sagemaker configuration
 24 | from sagemaker.amazon.amazon_estimator import get_image_uri
 25 | from sagemaker.estimator import Estimator
 26 | from sagemaker.workflow.airflow import training_config
 27 | 
 28 | # airflow Data Wrangler operator
 29 | from SMDataWranglerOperator import SageMakerDataWranglerOperator
 30 | 
 31 | # airflow dummy operator
 32 | from airflow.operators.dummy import DummyOperator
 33 | 
 34 | 
 35 |   
 36 | default_args = {  
 37 |     'owner': 'airflow',
 38 |     'depends_on_past': False,
 39 |     'start_date': airflow.utils.dates.days_ago(1),
 40 |     'retries': 0,
 41 |     'retry_delay': timedelta(minutes=2),
 42 |     'provide_context': True,
 43 |     'email': ['airflow@iloveairflow.com'],
 44 |     'email_on_failure': False,
 45 |     'email_on_retry': False
 46 | }
 47 | ts                = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
 48 | DAG_NAME          = f"ml-pipeline"
 49 | 
 50 | #-------
 51 | ### Start creating DAGs
 52 | #-------
 53 | 
 54 | dag = DAG(  
 55 |             DAG_NAME,
 56 |             default_args=default_args,
 57 |             dagrun_timeout=timedelta(hours=2),
 58 |             # Cron expression to auto run workflow on specified interval
 59 |             # schedule_interval='0 3 * * *'
 60 |             schedule_interval=None
 61 |         )
 62 | 
 63 | #-------
 64 | # Task to create configurations
 65 | #-------
 66 | 
 67 | config_task = PythonOperator(
 68 |         task_id = 'Start',
 69 |         python_callable=config,
 70 |         op_kwargs={
 71 |             'training_job_name': f"XGBoost-training-{ts}",
 72 |             's3_prefix': 'data-wrangler-pipeline',
 73 |             'role_name': 'AmazonSageMaker-ExecutionRole-20201030T135016'},
 74 |         provide_context=True,
 75 |         dag=dag
 76 |     )
 77 | 
 78 | #-------
 79 | # Task with SageMakerDataWranglerOperator operator for Data Wrangler Processing Job.
 80 | #-------
 81 | 
 82 | def datawrangler(**context):
 83 |     config = context['ti'].xcom_pull(task_ids='Start',key='return_value')
 84 |     preprocess_task = SageMakerDataWranglerOperator(
 85 |                             task_id='DataWrangler_Processing_StepNew',
 86 |                             dag=dag,
 87 |                             flow_file_s3uri="$flow_uri",
 88 |                             processing_instance_count=2,
 89 |                             instance_type='ml.m5.4xlarge',
 90 |                             aws_conn_id="aws_default",
 91 |                             config= config["data_wrangler_config"]
 92 |                     )
 93 |     preprocess_task.execute(context)
 94 | 
 95 | datawrangler_task = PythonOperator(
 96 |         task_id = 'SageMaker_DataWrangler_step',
 97 |         python_callable=datawrangler,
 98 |         provide_context=True,
 99 |         dag=dag
100 |     )
101 | 
102 | #-------
103 | # Task with SageMaker training operator to train the xgboost model
104 | #-------
105 | 
106 | def trainmodel(**context):
107 |     config = context['ti'].xcom_pull(task_ids='Start',key='return_value')
108 |     trainmodel_task = SageMakerTrainingOperator(
109 |                 task_id='Training_Step',
110 |                 config= config['train_config'],
111 |                 aws_conn_id='aws-sagemaker',
112 |                 wait_for_completion=True,
113 |                 check_interval=30
114 |             )
115 |     trainmodel_task.execute(context)
116 | 
117 | train_model_task = PythonOperator(
118 |         task_id = 'SageMaker_training_step',
119 |         python_callable=trainmodel,
120 |         provide_context=True,
121 |         dag=dag
122 |     )
123 | 
124 | #-------
125 | # Task with SageMaker Model operator to create the xgboost model from artifacts
126 | #-------
127 | 
128 | def createmodel(**context):
129 |     config = context['ti'].xcom_pull(task_ids='Start',key='return_value')
130 |     createmodel_task= SageMakerModelOperator(
131 |             task_id='Create_Model',
132 |             config= config['model_config'],
133 |             aws_conn_id='aws-sagemaker',
134 |         )
135 |     createmodel_task.execute(context)
136 |     
137 | create_model_task = PythonOperator(
138 |         task_id = 'SageMaker_create_model_step',
139 |         python_callable=createmodel,
140 |         provide_context=True,
141 |         dag=dag
142 |     )
143 | 
144 | #------
145 | # Last step
146 | #------
147 | end_task = DummyOperator(task_id='End', dag=dag)
148 | 
149 | # Create task dependencies
150 | 
151 | config_task >> datawrangler_task >> train_model_task >> create_model_task >> end_task
152 | 


--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/scripts/requirements.txt:
--------------------------------------------------------------------------------
1 | awswrangler
2 | pandas
3 | sagemaker==v2.59.5
4 | dag-factory==0.7.2
5 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Integrate SageMaker Data Wrangler into your MLOps workflows
 2 | 
 3 | [![Latest Version](https://img.shields.io/github/tag/aws-samples/sm-data-wrangler-mlops-workflows)](https://github.com/aws-samples/sm-data-wrangler-mlops-workflows/releases)
 4 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/aws-samples/sm-data-wrangler-mlops-workflows/blob/main/LICENSE)
 5 | 
 6 | 
 7 | <div align="center">
 8 |     <p align="center">
 9 |         <img src="./images/dw-arch.jpg" alt="dw"/>
10 |     </p>
11 | </div>
12 | 
13 | ## Get Started
14 | 
15 | 1. Setup an [Amazon SageMaker Studio domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html).
16 | 2. Log-on to Amazon SageMaker Studio. Open a terminal from _File_ menu > _New_ > _Terminal_
17 |    
18 |    <div align="center">
19 |     <p align="center">
20 |     <img src="./images/sm-studio-terminal.png" alt="sf"/>
21 |     </p>
22 |     </div>
23 | 
24 | 3. Clone this repository
25 | 
26 | ```sh
27 | git clone https://github.com/aws-samples/sm-data-wrangler-mlops-workflows.git data-wrangler-pipelines
28 | ```
29 | 
30 | 4. Open the [00_setup_data_wrangler.ipynb](./00_setup_data_wrangler.ipynb) file and follow instructions in the notebook
31 | 
32 | ---
33 | 
34 | ## Setup end-to-end MLOps Pipelines
35 | 
36 | - [Setup Amazon SageMaker Data Wrangler with SageMaker Pipelines](./1-sagemaker-pipelines/README.md)
37 | - [Setup Amazon SageMaker Data Wrangler with AWS Step Functions](./2-step-functions-pipelines/README.md)
38 | - [Setup Amazon SageMaker Data Wrangler with Amazon Managed Workflow for Apache Airflow](./3-apache-airflow-pipelines/README.md)
39 | 
40 | ## Security
41 | 
42 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
43 | 
44 | ## License
45 | 
46 | This library is licensed under the MIT-0 License. See the LICENSE file.


--------------------------------------------------------------------------------
/images/dw-arch.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/dw-arch.jpg


--------------------------------------------------------------------------------
/images/flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/flow.png


--------------------------------------------------------------------------------
/images/sm-studio-terminal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/sm-studio-terminal.png


--------------------------------------------------------------------------------