├── .gitignore ├── 00_setup_data_wrangler.ipynb ├── 1-sagemaker-pipelines ├── 01_setup_sagemaker_pipeline.ipynb ├── README.md ├── flow-01-15-12-49-4bd733e0.flow └── images │ └── pipeline.png ├── 2-step-functions-pipelines ├── 01_setup_step_functions_pipeline.ipynb ├── README.md ├── flow-01-15-12-49-4bd733e0.flow └── step-function-workflow.png ├── 3-apache-airflow-pipelines ├── 01_setup_mwaa_pipeline.ipynb ├── README.md ├── images │ ├── delete_mwaa.png │ ├── flow.png │ ├── meaa_ui_home.png │ ├── mwaa_dag.png │ ├── mwaa_delete_dag.png │ ├── mwaa_s3.png │ ├── mwaa_trigger.png │ └── mwaa_ui.png └── scripts │ ├── SMDataWranglerOperator.py │ ├── config.py │ ├── ml_pipeline.py │ └── requirements.txt ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── data ├── claims.csv └── customers.csv ├── images ├── dw-arch.jpg ├── flow.png └── sm-studio-terminal.png └── insurance_claims_flow_template /.gitignore: -------------------------------------------------------------------------------- 1 | # Flow files 2 | .flow 3 | 4 | # Jupyter Checkpoints 5 | .ipynb_checkpoints 6 | -------------------------------------------------------------------------------- /00_setup_data_wrangler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Upload sample data and setup SageMaker Data Wrangler data flow\n", 8 | "\n", 9 | "This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "Import required dependencies and initialize variables\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stdout", 28 | "output_type": "stream", 29 | "text": [ 30 | "Using AWS Region: us-east-2\n" 31 | ] 32 | }, 33 | { 34 | "data": { 35 | "text/plain": [ 36 | "'sagemaker-us-east-2-716469146435'" 37 | ] 38 | }, 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "import json\n", 46 | "import time\n", 47 | "import boto3\n", 48 | "import string\n", 49 | "import sagemaker\n", 50 | "\n", 51 | "region = sagemaker.Session().boto_region_name\n", 52 | "print(\"Using AWS Region: {}\".format(region))\n", 53 | "\n", 54 | "boto3.setup_default_session(region_name=region)\n", 55 | "\n", 56 | "s3_client = boto3.client('s3', region_name=region)\n", 57 | "# Sagemaker session\n", 58 | "sess = sagemaker.Session()\n", 59 | "\n", 60 | "# You can configure this with your own bucket name, e.g.\n", 61 | "# bucket = \"my-bucket\"\n", 62 | "bucket = sess.default_bucket()\n", 63 | "prefix = \"data-wrangler-pipeline\"\n", 64 | "bucket" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "---\n", 72 | "# Upload sample data to S3" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.\n", 80 | "\n", 81 | "To begin with, we will upload both the files to the default SageMaker bucket." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')\n", 91 | "s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "---\n", 99 | "# Generate Data Wrangler `.flow` file" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. \n", 107 | "\n", 108 | "To create the `insurance_claims.flow` file execute the code cell below" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 5, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "claims_flow_template_file = \"insurance_claims_flow_template\"\n", 118 | "\n", 119 | "# Updates the S3 bucket and prefix in the template\n", 120 | "with open(claims_flow_template_file, 'r') as f:\n", 121 | " variables = {'bucket': bucket, 'prefix': prefix}\n", 122 | " template = string.Template(f.read())\n", 123 | " claims_flow = template.safe_substitute(variables)\n", 124 | " claims_flow = json.loads(claims_flow)\n", 125 | "\n", 126 | "# Creates the .flow file\n", 127 | "with open('insurance_claims.flow', 'w') as f:\n", 128 | " json.dump(claims_flow, f)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Open the `insurance_claim.flow` file in SageMaker Studio.\n", 136 | "\n", 137 | "
⚠️ NOTE: \n", 138 | " Note: The UI for Data Wrangler is only available via SageMaker Studio environment. If you are using SageMaker Classic notebooks, you will not be able to view the Data Wrangler UI but can still use the flow file programmatically.\n", 139 | "
\n", 140 | "\n", 141 | "The flow should look as shown below\n", 142 | "\n", 143 | "" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "# Alternatively\n", 151 | "\n", 152 | "You can also create this `.flow` file manually using the SageMaker Studio's Data Wrangler UI. Visit the [get started](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html) documentation to learn how to create a data flow using SageMaker Data Wrangler." 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "---\n", 160 | "# Upload the `.flow` file to S3\n", 161 | "\n", 162 | "Next we will upload the flow file to the S3 bucket. The executable python script we generated earlier will make use of this `.flow` file to perform transformations." 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 6, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "Stored 'ins_claim_flow_uri' (str)\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "import time\n", 180 | "import uuid\n", 181 | "\n", 182 | "# unique flow export ID\n", 183 | "flow_export_id = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n", 184 | "flow_export_name = f\"flow-{flow_export_id}\"\n", 185 | "\n", 186 | "s3_client.upload_file(Filename='insurance_claims.flow', Bucket=bucket, Key=f'{prefix}/flow/{flow_export_name}.flow')\n", 187 | "ins_claim_flow_uri=f\"s3://{bucket}/{prefix}/flow/{flow_export_name}.flow\"\n", 188 | "%store ins_claim_flow_uri" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [] 197 | } 198 | ], 199 | "metadata": { 200 | "instance_type": "ml.t3.medium", 201 | "kernelspec": { 202 | "display_name": "Python 3 (Data Science)", 203 | "language": "python", 204 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" 205 | }, 206 | "language_info": { 207 | "codemirror_mode": { 208 | "name": "ipython", 209 | "version": 3 210 | }, 211 | "file_extension": ".py", 212 | "mimetype": "text/x-python", 213 | "name": "python", 214 | "nbconvert_exporter": "python", 215 | "pygments_lexer": "ipython3", 216 | "version": "3.7.10" 217 | } 218 | }, 219 | "nbformat": 4, 220 | "nbformat_minor": 4 221 | } 222 | -------------------------------------------------------------------------------- /1-sagemaker-pipelines/01_setup_sagemaker_pipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Creating SageMaker Pipelines workflow from AWS Data Wrangler Flow File\n", 8 | "\n", 9 | "
\n", 10 | "\t⚠️ PRE-REQUISITE: Before proceeding with this notebook, please ensure that you have executed the 00_setup_data_wrangler.ipynb Notebook\n", 11 | "
\n", 12 | "\n", 13 | "We will demonstrate how to define a SageMaker Processing Job based on an existing SageMaker Data Wrangler Flow definition." 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## 1. Initialization" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "%store -r ins_claim_flow_uri" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "import sagemaker\n", 39 | "import json\n", 40 | "import string\n", 41 | "import boto3\n", 42 | "\n", 43 | "sm_client = boto3.client(\"sagemaker\")\n", 44 | "sess = sagemaker.Session()\n", 45 | "\n", 46 | "bucket = sess.default_bucket()\n", 47 | "prefix = 'aws-data-wrangler-workflows'\n", 48 | "\n", 49 | "FLOW_TEMPLATE_URI = ins_claim_flow_uri\n", 50 | "\n", 51 | "flow_file_name = FLOW_TEMPLATE_URI.split(\"/\")[-1]\n", 52 | "flow_export_name = flow_file_name.replace(\".flow\", \"\")\n", 53 | "flow_export_id = flow_export_name.replace(\"flow-\", \"\")" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "## 2. Download Flow template from Amazon S3\n", 61 | "\n", 62 | "We download the flow template from S3, in order to parse its content and retrieve the following information:\n", 63 | "* Source datasets, including dataset names and S3 URI\n", 64 | "* Output node, including Node ID and output path\n", 65 | "This information is then used as part of the parameters of the SageMaker Processing Job" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "sagemaker.s3.S3Downloader.download(FLOW_TEMPLATE_URI, \".\")" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "#### Parsing input and output parameters from flow template" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "with open(flow_file_name, 'r') as f:\n", 91 | " data = json.load(f)\n", 92 | " output_node = data['nodes'][-1]['node_id']\n", 93 | " output_path = data['nodes'][-1]['outputs'][0]['name']\n", 94 | " input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n", 95 | " input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n", 96 | " \n", 97 | "\n", 98 | "output_name = f\"{output_node}.{output_path}\"" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## 3. Create SageMaker Processing Job from Data Wrangler Flow template\n", 106 | "\n", 107 | "### 3.1 SageMaker Processing Inputs\n", 108 | "\n", 109 | "Below are the inputs required by the SageMaker Python SDK to launch a processing job.\n", 110 | "\n", 111 | "#### Source datasets" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", 121 | "\n", 122 | "data_sources = []\n", 123 | "\n", 124 | "for i in range(0,len(input_source_uris)):\n", 125 | " data_sources.append(ProcessingInput(\n", 126 | " source=input_source_uris[i],\n", 127 | " destination=f\"/opt/ml/processing/{input_source_names[i]}\",\n", 128 | " input_name=input_source_names[i],\n", 129 | " s3_data_type=\"S3Prefix\",\n", 130 | " s3_input_mode=\"File\",\n", 131 | " s3_data_distribution_type=\"FullyReplicated\"\n", 132 | " ))" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "#### Flow Input" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "## Input - Flow\n", 149 | "flow_input = ProcessingInput(\n", 150 | " source=FLOW_TEMPLATE_URI,\n", 151 | " destination=\"/opt/ml/processing/flow\",\n", 152 | " input_name=\"flow\",\n", 153 | " s3_data_type=\"S3Prefix\",\n", 154 | " s3_input_mode=\"File\",\n", 155 | " s3_data_distribution_type=\"FullyReplicated\"\n", 156 | ")\n", 157 | "\n", 158 | "processing_job_inputs=[flow_input] + data_sources" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### 3.2. SageMaker Processing Output" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "s3_output_prefix = f\"export-{flow_export_name}/output\"\n", 175 | "s3_output_path = f\"s3://{bucket}/{s3_output_prefix}\"\n", 176 | "print(f\"Flow S3 export result path: {s3_output_path}\")\n", 177 | "\n", 178 | "processing_job_output = ProcessingOutput(\n", 179 | " output_name=output_name,\n", 180 | " source=\"/opt/ml/processing/output\",\n", 181 | " destination=s3_output_path,\n", 182 | " s3_upload_mode=\"EndOfJob\"\n", 183 | ")\n", 184 | "\n", 185 | "processing_job_outputs=[processing_job_output]" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### 3.3. Create Processor Object" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "# IAM role for executing the processing job.\n", 202 | "iam_role = sagemaker.get_execution_role()\n", 203 | "aws_region = sess.boto_region_name\n", 204 | "\n", 205 | "# Unique processing job name. Give a unique name every time you re-execute processing jobs\n", 206 | "processing_job_name = f\"data-wrangler-flow-processing-{flow_export_id}\"\n", 207 | "\n", 208 | "# Data Wrangler Container URI.\n", 209 | "container_uri = sagemaker.image_uris.retrieve(\n", 210 | " framework='data-wrangler',\n", 211 | " region=aws_region\n", 212 | ")\n", 213 | "\n", 214 | "# Processing Job Instance count and instance type.\n", 215 | "instance_count = 2\n", 216 | "instance_type = \"ml.m5.4xlarge\"\n", 217 | "\n", 218 | "# Size in GB of the EBS volume to use for storing data during processing\n", 219 | "volume_size_in_gb = 30\n", 220 | "\n", 221 | "# Network Isolation mode; default is off\n", 222 | "enable_network_isolation = False\n", 223 | "\n", 224 | "# KMS key for per object encryption; default is None\n", 225 | "kms_key = None\n", 226 | "\n", 227 | "\n", 228 | "# Content type for each output. Data Wrangler supports CSV as default and Parquet.\n", 229 | "processing_job_output_content_type = \"CSV\"\n", 230 | "\n", 231 | "# Output configuration used as processing job container arguments \n", 232 | "processing_job_output_config = {\n", 233 | " output_name: {\n", 234 | " \"content_type\": processing_job_output_content_type\n", 235 | " }\n", 236 | "}" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "To launch the Processing Job in a workflow compatible with SageMaker SDK, you will create a Processor function. The processor can then be integrated in the following workflows:\n", 244 | "\n", 245 | "* Amazon SageMaker Pipelines\n", 246 | "* AWS Step Functions (through AWS Step Functions Data Science SDK)\n", 247 | "* Apache Airflow (through a Python Operator, using Amazon SageMaker SDK)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "from sagemaker.processing import Processor\n", 257 | "from sagemaker.network import NetworkConfig\n", 258 | "\n", 259 | "processor = Processor(\n", 260 | " role=iam_role,\n", 261 | " image_uri=container_uri,\n", 262 | " instance_count=instance_count,\n", 263 | " instance_type=instance_type,\n", 264 | " volume_size_in_gb=volume_size_in_gb,\n", 265 | " network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),\n", 266 | " sagemaker_session=sess,\n", 267 | " output_kms_key=kms_key\n", 268 | ")" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "## 4. Create SageMaker Estimator\n", 276 | "\n", 277 | "Another building block for orchestrating ML workflows is an Estimator, which is the base for a Training Job. The estimator can be integrated in the following workflows:\n", 278 | "\n", 279 | "* Amazon SageMaker Pipelines\n", 280 | "* AWS Step Functions (through AWS Step Functions Data Science SDK)\n", 281 | "* Apache Airflow (through Amazon SageMaker Operator for Apache Airflow)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": null, 287 | "metadata": {}, 288 | "outputs": [], 289 | "source": [ 290 | "import boto3\n", 291 | "from sagemaker.estimator import Estimator\n", 292 | "\n", 293 | "region = boto3.Session().region_name\n", 294 | "\n", 295 | "# Estimator Instance count and instance type.\n", 296 | "instance_count = 1\n", 297 | "instance_type = \"ml.m5.4xlarge\"\n", 298 | "\n", 299 | "image_uri = sagemaker.image_uris.retrieve(\n", 300 | " framework=\"xgboost\",\n", 301 | " region=region,\n", 302 | " version=\"1.2-1\",\n", 303 | " py_version=\"py3\",\n", 304 | " instance_type=instance_type,\n", 305 | ")\n", 306 | "xgb_train = Estimator(\n", 307 | " image_uri=image_uri,\n", 308 | " instance_type=instance_type,\n", 309 | " instance_count=instance_count,\n", 310 | " role=iam_role,\n", 311 | ")\n", 312 | "xgb_train.set_hyperparameters(\n", 313 | " objective=\"reg:squarederror\",\n", 314 | " num_round=3,\n", 315 | ")" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "# Part 2: Creating a workflow with SageMaker Pipelines" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "## 1. Define Pipeline Steps and Parameters\n", 330 | "To create a SageMaker pipeline, you will first create a `ProcessingStep` using the Data Wrangler processor defined above." 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "from sagemaker.workflow.steps import ProcessingStep\n", 340 | "\n", 341 | "data_wrangler_step = ProcessingStep(\n", 342 | " name=\"DataWranglerProcessingStep\",\n", 343 | " processor=processor,\n", 344 | " inputs=processing_job_inputs, \n", 345 | " outputs=processing_job_outputs,\n", 346 | " job_arguments=[f\"--output-config '{json.dumps(processing_job_output_config)}'\"],\n", 347 | ")" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "You now add a `TrainingStep` to the pipeline that trains a model on the preprocessed train data set. \n", 355 | "\n", 356 | "You can also add more steps. To learn more about adding steps to a pipeline, see [Define a Pipeline](http://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html) in the SageMaker documentation." 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "from sagemaker.inputs import TrainingInput\n", 366 | "from sagemaker.workflow.steps import TrainingStep\n", 367 | "from sagemaker.workflow.step_collections import RegisterModel\n", 368 | "\n", 369 | "xgb_input_content_type = None\n", 370 | "\n", 371 | "if processing_job_output_content_type == \"CSV\":\n", 372 | " xgb_input_content_type = 'text/csv'\n", 373 | "elif processing_job_output_content_type == \"Parquet\":\n", 374 | " xgb_input_content_type = 'application/x-parquet'\n", 375 | "\n", 376 | "training_step = TrainingStep(\n", 377 | " name=\"DataWranglerTrain\",\n", 378 | " estimator=xgb_train,\n", 379 | " inputs={\n", 380 | " \"train\": TrainingInput(\n", 381 | " s3_data=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[\n", 382 | " output_name\n", 383 | " ].S3Output.S3Uri,\n", 384 | " content_type=xgb_input_content_type\n", 385 | " )\n", 386 | " }\n", 387 | ")\n", 388 | "\n", 389 | "register_step = RegisterModel(\n", 390 | " name=f\"DataWranglerRegisterModel\",\n", 391 | " estimator=xgb_train,\n", 392 | " model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,\n", 393 | " content_types=[\"text/csv\"],\n", 394 | " response_types=[\"text/csv\"],\n", 395 | " inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n", 396 | " transform_instances=[\"ml.m5.large\"],\n", 397 | " model_package_group_name=\"DataWrangler-PackageGroup\"\n", 398 | " )" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "### Define Pipeline Parameters\n", 406 | "Now you will create the SageMaker pipeline that combines the steps created above so it can be executed. \n", 407 | "\n", 408 | "Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.\n", 409 | "\n", 410 | "The parameters supported in this notebook includes:\n", 411 | "\n", 412 | "- `instance_type` - The ml.* instance type of the processing job.\n", 413 | "- `instance_count` - The instance count of the processing job." 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "from sagemaker.workflow.parameters import (\n", 423 | " ParameterInteger,\n", 424 | " ParameterString,\n", 425 | ")\n", 426 | "# Define Pipeline Parameters\n", 427 | "instance_type = ParameterString(name=\"InstanceType\", default_value=\"ml.m5.4xlarge\")\n", 428 | "instance_count = ParameterInteger(name=\"InstanceCount\", default_value=1)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "You will create a pipeline with the steps and parameters defined above" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "import time\n", 445 | "import uuid\n", 446 | "\n", 447 | "from sagemaker.workflow.pipeline import Pipeline\n", 448 | "\n", 449 | "# Create a unique pipeline name with flow export name\n", 450 | "pipeline_name = f\"pipeline-{flow_export_name}\"\n", 451 | "\n", 452 | "# Combine pipeline steps\n", 453 | "pipeline_steps = [data_wrangler_step, training_step, register_step]\n", 454 | "\n", 455 | "pipeline = Pipeline(\n", 456 | " name=pipeline_name,\n", 457 | " parameters=[instance_type, instance_count],\n", 458 | " steps=pipeline_steps,\n", 459 | " sagemaker_session=sess\n", 460 | ")" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "### (Optional) Examining the pipeline definition\n", 468 | "\n", 469 | "The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and \n", 470 | "the parameters and step properties resolve correctly." 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "import json\n", 480 | "\n", 481 | "definition = json.loads(pipeline.definition())\n", 482 | "definition" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "## 2. Submit the pipeline to SageMaker and start execution\n", 490 | "\n", 491 | "Submit the pipeline definition to the SageMaker Pipeline service and start an execution. The role passed in \n", 492 | "will be used by the Pipeline service to create all the jobs defined in the steps." 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "iam_role = sagemaker.get_execution_role()\n", 502 | "pipeline.upsert(role_arn=iam_role)\n", 503 | "execution = pipeline.start()" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "### Pipeline Operations: Examine and Wait for Pipeline Execution\n", 511 | "\n", 512 | "Describe the pipeline execution and wait for its completion." 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "execution.wait()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "List the steps in the execution. These are the steps in the pipeline that have been resolved by the step \n", 529 | "executor service." 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "execution.list_steps()" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "You can visualize the pipeline execution status and details in Studio. For details please refer to \n", 546 | "[View, Track, and Execute SageMaker Pipelines in SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio.html)" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "# Part 3: Cleanup\n", 554 | "\n", 555 | "## Pipeline cleanup\n", 556 | "Set `pipeline_deletion` flag below to `True` to delete the SageMaker Pipelines created in this notebook." 557 | ] 558 | }, 559 | { 560 | "cell_type": "code", 561 | "execution_count": null, 562 | "metadata": {}, 563 | "outputs": [], 564 | "source": [ 565 | "pipeline_deletion = False" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": null, 571 | "metadata": {}, 572 | "outputs": [], 573 | "source": [ 574 | "if pipeline_deletion:\n", 575 | " pipeline.delete()" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "## Model cleanup\n", 583 | "Set `model_deletion` flag below to `True` to delete the SageMaker Model created in this notebook." 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "model_deletion = False" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "if model_deletion:\n", 602 | " model_package_group_name = register_step.steps[0].model_package_group_name\n", 603 | "\n", 604 | " model_package_list = sm_client.list_model_packages(\n", 605 | " ModelPackageGroupName = model_package_group_name\n", 606 | " )\n", 607 | "\n", 608 | " for version in range(0,len(model_package_list[\"ModelPackageSummaryList\"])):\n", 609 | " sm_client.delete_model_package(\n", 610 | " ModelPackageName = model_package_list[\"ModelPackageSummaryList\"][version][\"ModelPackageArn\"]\n", 611 | " )\n", 612 | "\n", 613 | " sm_client.delete_model_package_group(\n", 614 | " ModelPackageGroupName = model_package_group_name\n", 615 | " )" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "## Experiment cleanup\n", 623 | "Set `experiment_deletion` flag below to `True` to delete the SageMaker Experiment and Trials created by the Pipeline execution in this notebook." 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "execution_count": null, 629 | "metadata": {}, 630 | "outputs": [], 631 | "source": [ 632 | "experiment_deletion = False" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "if experiment_deletion:\n", 642 | " experiment_name = pipeline_name\n", 643 | " trial_name = execution.arn.split(\"/\")[-1]\n", 644 | " \n", 645 | " components_in_trial = sm_client.list_trial_components(TrialName=trial_name)\n", 646 | " print('TrialComponentNames:')\n", 647 | " for component in components_in_trial['TrialComponentSummaries']:\n", 648 | " component_name = component['TrialComponentName']\n", 649 | " print(f\"\\t{component_name}\")\n", 650 | " sm_client.disassociate_trial_component(TrialComponentName=component_name, TrialName=trial_name)\n", 651 | " try:\n", 652 | " # comment out to keep trial components\n", 653 | " sm_client.delete_trial_component(TrialComponentName=component_name)\n", 654 | " except:\n", 655 | " # component is associated with another trial\n", 656 | " continue\n", 657 | " # to prevent throttling\n", 658 | " time.sleep(.5)\n", 659 | " sm_client.delete_trial(TrialName=trial_name)\n", 660 | " try:\n", 661 | " sm_client.delete_experiment(ExperimentName=experiment_name)\n", 662 | " print(f\"\\nExperiment {experiment_name} deleted\")\n", 663 | " except:\n", 664 | " # experiment already existed and had other trials\n", 665 | " print(f\"\\nExperiment {experiment_name} in use by other trials. Will not delete\")" 666 | ] 667 | } 668 | ], 669 | "metadata": { 670 | "instance_type": "ml.t3.medium", 671 | "kernelspec": { 672 | "display_name": "Python 3 (Data Science)", 673 | "language": "python", 674 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" 675 | }, 676 | "language_info": { 677 | "codemirror_mode": { 678 | "name": "ipython", 679 | "version": 3 680 | }, 681 | "file_extension": ".py", 682 | "mimetype": "text/x-python", 683 | "name": "python", 684 | "nbconvert_exporter": "python", 685 | "pygments_lexer": "ipython3", 686 | "version": "3.7.10" 687 | } 688 | }, 689 | "nbformat": 4, 690 | "nbformat_minor": 4 691 | } 692 | -------------------------------------------------------------------------------- /1-sagemaker-pipelines/README.md: -------------------------------------------------------------------------------- 1 | # Integrating SageMaker Data Wrangler with SageMaker Pipelines 2 | 3 | [SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Along with SageMaker model registry and SageMaker Projects, pipelines improves the operational resilience and reproducibility of your ML workflows. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms. Each step in the pipeline can keep track of the lineage, and intermediate steps can be cached for quickly re-running the pipeline. You can create pipelines using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html). 4 | 5 | ## Architecture Overview 6 | 7 | A workflow built with SageMaker pipelines consists of a sequence of steps forming a directed acyclic graph (DAG). In this example, we begin with a Processing step, where we start a SageMaker Processing job based on SageMaker Data Wrangler’s flow file to create a training dataset. We then continue with a Training step, where we train an XGBoost model using SageMaker’s built-in XGBoost algorithm and the training dataset created in the previous step. Once a model has been trained, we end this workflow with a RegisterModel step, where we register the trained model with SageMaker model registry. 8 | 9 |
10 |

11 | sf 12 |

13 |
14 | 15 | ## Installation and walkthrough 16 | 17 | To run this sample, we will use a Jupyter notebook running Python3 on a Data Science kernel image in a SageMaker Studio environment. You can also run it on a Jupyter notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is lightweight and can be run on an `ml.t3.medium` instance. 18 | 19 | You can either use the export feature in SageMaker Data Wrangler to generate the Pipelines code, or build your own script from scratch. In our sample repository, we have used a combination of both approaches for simplicity. At a high level, these are the steps to build and execute the SageMaker Pipelines workflow. 20 | 21 | ### In the setup notebook (00_setup_data_wrangler.ipynb) 22 | 1. Generate a flow file from Data Wrangler or use the setup script to generate a flow file from a preconfigured template. 23 | 2. Create an Amazon S3 bucket and upload your flow file and input files to the bucket. In our sample notebook, we use SageMaker’s default S3 bucket. 24 | 25 | ### In the SageMaker Pipelines notebook (01_setup_sagemaker_pipeline.ipynb) 26 | 3. Follow the instructions in the 01_setup_sagemaker_pipeline.ipynb notebook to create a `Processor` object based on the Data Wrangler flow file, and an `Estimator` object with the parameters of the training job. 27 | * In our example, since we only use SageMaker features and SageMaker’s default S3 bucket, we can use SageMaker Studio’s default execution role. The same IAM role will be assumed by the pipeline run, the processing job and the training job. You can further customize the execution role according to minimum privilege. 28 | 4. Continue with the instructions to create a pipeline with steps referencing the `Processor` and `Estimator` objects, and then execute a pipeline run. The processing and training jobs will run on SageMaker managed environments and will take a few minutes to complete. 29 | 5. In SageMaker Studio, you can see the pipeline details monitor the pipeline execution. You can also monitor the underlying processing and training jobs from the Amazon SageMaker Console, and from Amazon CloudWatch. 30 | 31 | ### Cleaning Up 32 | 33 | Follow the instructions under **Part 3: Cleanup** in the SageMaker Pipelines notebook (01_setup_sagemaker_pipeline.ipynb) to delete the Pipeline, the Model and the Experiment created during this sample. -------------------------------------------------------------------------------- /1-sagemaker-pipelines/flow-01-15-12-49-4bd733e0.flow: -------------------------------------------------------------------------------- 1 | {"metadata": {"version": 1, "disable_limits": false}, "nodes": [{"node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "claims.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/claims.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "driver_relationship": "string", "incident_type": "string", "collision_type": "string", "incident_severity": "string", "authorities_contacted": "string", "num_vehicles_involved": "long", "num_injuries": "long", "num_witnesses": "long", "police_report_available": "string", "injury_claim": "long", "vehicle_claim": "float", "total_claim_amount": "float", "incident_month": "long", "incident_day": "long", "incident_dow": "long", "incident_hour": "long", "fraud": "long"}}, "inputs": [{"name": "default", "node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "customers.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/customers.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "customer_age": "long", "months_as_customer": "long", "num_claims_past_year": "long", "num_insurers_past_5_years": "long", "policy_state": "string", "policy_deductable": "long", "policy_annual_premium": "long", "policy_liability": "string", "customer_zip": "long", "customer_gender": "string", "customer_education": "string", "auto_year": "long"}}, "inputs": [{"name": "default", "node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "police_report_available"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2ms2BZ&R6WWXb6O004CX000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx%1ME87#*Qpf(>7oJ&#MZjCZ(6H%1%HSefU5ZW&!?IhpgSc+k>^~zV6MSv%c+1-bxHrFOt5=M!;g3q$_cvD^Y(+n6QIYC{+dyqTGvi~)x4~8?25S^b!Wbw^ID)aTwOEBv@)woSMXAEa#XWd!aqW*_=g|XVWK+X5j!)FfcGME@N_IOD;-gU|?WkIGK_+697<40|XQR000O8%8hSR-Pv?efB^siZ~_1TNB{r;WMOn+FK}UUbS*G2FfcGJVlrhhF*!J9EjeUjGA%SSG-WMeGh{F=GGSphWnnisFk(4oEn_e+FfMa$VQ_GHE^uLTadl;MjgnDI!!QuX&#caYd(B8842Sr5EHPvF;ZcV0$?DP4x%-qze2)=~7|NoM^-;uN9a~}ha@UVv+8`N=~hOk?^lA54V2>_JV>{)Aov$1uw2Y_rgDi@lq!N^r7O+69S!>u0Q%Uoat2Z(Gd5lf|yt4cg$gzIqN5JzR&EbT3+WG)Ny)56Wpm)QYmA(y&zr$KSk?bN}&Tz~id(ns;P1f!PIGOfp*#?d;9OD!Pa1%P!CxIIG3>W+(vu%q*L3jQc7os|XI47kTVAl+XTdaAe$rZZ`HRX*`t8j%Pi$m_-nGtU#rhD@7kHa4us(oCvEv*Gm%D@4Aq)(EV>tpB_xR`Lqu;vep!S619vd0ZAoWQ88hlww0Wft>##7B&&Fl1rh-J`ilL`TS-MzlgP2(s^lfyI$!!QYhaNtM%{UGqtei$c%u1l$1c-LK1EP)h>@6aWAK2ms2BZ&Rr%CoK{H000mG002z@003lRbYU+paA9(EEif=JFfc7*GG#F_IXGr5Ib>ooEi^MUWi4SdWH2o{(90CfQX022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2ms2BZ&PF000aC000;O0000000000005+c8UX+RZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8%8hSR-Pv?efB^siZ~_1TNB{r;0000000000q=8-m003lRbYU-WVRCdWFfcGMFfC#-Wic^1IA$5WMVQcG&3}1EnzccFfB4+VK!x9H#jh2Ic6)FfcGMEn+fdF)=wfW-U2nVlpi>Gc;u_VKZbfEiz$YHf3QqI51*4W-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0Ko$Q0000"}}, "inputs": [{"name": "df", "node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "incident_severity"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mlj~Z&R2h55?vH003_R000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<_K=1sFoVTc@P`X{mWrqnQ5ZB5}+$~zZaFU5K^}k~k3Be?8mi6AVFG5*JNvuVngMlXO-N2)Pi-mrTYVwqD~0(k^JUVj;3|m@>F1)^*#ot5tVjchz0J5VQY=luYnkZe%5^CAd$&tk^3?^YTx!-w*q)dARM`20KRB2e$(*Uxe|ob7_W>*pZkPoB$G!*kKnb%tv`dEyC=lR>rnEKmeIvt@i;An#sP}W&%X{_7mz5>RpS3|`~I6_AB)FfcGMEjeR1IAdWqWGy)`V=yf=GdMXdVl-r8EihtZGiG8rH)1e1Wi4YcFfcB2Zeeh6c`k5aa&dKKbd6HcYTGarmEFYYSY8!m@Iw%!q~HmAPF*MTrEjIsQbrktF@_w;oj2J|C1sFXl{wU!Ut<@0Xs<&$sx)>(_oR}QtnJ?v?vH+xBZLO7ANaw}c7Xf-$V2|c^LrnMp2UAN*d2I-J7;j~>}(&r5YZ1HJ_0{*0GW&YkdQ`SGQQvtf<5M8KcqHgQY8|nN%*tJ(#UjjUDYRk*ooGz83GkAeqy#m(I@Lb6a5Yndz;I&$PULkVa27FiPVY`uG%;Jc}!*RYWl^90FR1}j$fC%ztPuM}mZqsllXMzqllMsy6_Yd<(_KFZVB-m_{S3b115ir?1QY-O00;mRj&D;xA*7uW0000G0000@0001FVRT_HE^uLTbS*G2FfcGJIb%0CV_`RBEjch_FfBAQI5{n1G-P5eFk)jfW@0%vVlX#lEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?WkIP+aETqkIbEdWqU0Rj{Q6aWAK2mlj~Z&R2h55?vH003_R000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mRj&D;%AE&7f0000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfLBdWsVK-zgIWS`|Ei^MYIW1x|WMVBaVq-IAVmUWrFgIl_V=yo(+G&49kEn+leVl6OYV>4!AIX7Z3H)SnjFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007nl00000"}}, "inputs": [{"name": "df", "node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "driver_relationship"}}, "inputs": [{"name": "df", "node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['driver_relationship']=df['driver_relationship'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "incident_type"}}, "inputs": [{"name": "df", "node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "collision_type"}}, "inputs": [{"name": "df", "node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['collision_type']=df['collision_type'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "authorities_contacted"}}, "inputs": [{"name": "df", "node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "driver_relationship"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mtSpZ&RJqavbOY0040T000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<}MDP3zpHnL#YEXKmmmVsRKwP86-ZWPI#oDeYRsK6mkWfM5I-2)pE<#yINo+)*gMqL?e8afD*?3o9VxDh40Z>Z=1QY-O00;o@kZ)7rpkAB~0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9DY`rt08mQ<1QY-O00;o@kZ)5_8f)620RRA-0ssI=0001FVRT_HaA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yoX?(g-VT5t}!pWf^B1F!;-MqSf|kw2oeGPhhT1!&-FIsCgcHm^$BiFwFFM33He*gMT3xvQn)ruxAP=|m=+pubuV%-t^~n=$K5J$G!xR4#R9+3f0pm7-{@6aWAK2mtSpZ&L$&q2dw%000mG002z@003lRbYU+paA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yoZ;0u%!j000080Pm1*Q=QXt9OwW50C52T022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&Tr*UYrg9000aC000;O0000000000005+c836zQZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(Q5tL7p#cB@n*sm;NB{r;0000000000q=8)l003lRbYU-WVRCdWFfcGMFfBJVG%;diVPP#}F<~?)FfcGMEjKkZF=Aw4VJ%}ZVKgl?F)}hOIbvjEEo5UjW@IuqHDWg~G%aH=FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0O11w0000", "one_hot_encoder_model": "P)h>@6aWAK2mtSpZ&RNX%Zbwf003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZ9}PQyS9hIbz0=Ju2-N=mY#N>pkE6YI%M92#+VKKXnMRC#wy5l9S`SUi08_y4|_!9&ibH^i?N5)5awIRU@f;O&GUh0#7md?eY$fkxz`4sTXQ=73osB-3x&y6w8IZ8mk&uFTbcLC!NpO|ETat2Km=xUSSI^QiDor0XWQZz?$9X}i@E^&uQVlP`vfq#_UGRdRwMxq-a<3Yz2Z;)g3eLmWNs<2(3k5i%1iADPktvY37XpK(!w6BIAcH_C2*=-l_+zuJTgF3a5~P)h>@6aWAK2mtSpZ&OWF`nC=L000aC000;O003=ebYWy+bYU+paA9(EEif=JFfcA-a$`#_N@ieSU}ESzEixYfP)h>@6aWAK2mtSpZ&QH6$k2iT001xo002k;003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpsT~U2EGg6jjvB>qGa@2N@as5Cjb^xWb+j*9m=S-%2SXYLTGJ$bogHcXa5X?-GK`hqg9sR;Ob(}PKEjN%T9jFN)U|tn0}fFC$Af`_*ULZ3++ABZ_~CetpfnT2Z(-@T2``l$j3n#gx%c`njLs3$hXNaq*pM(KKAeZhgNs@^c}512>Za8Rps8tD2OO+%rzHN0V4T}UjA1aTQ)WQkn)(vh#tK`*qP?w%;wT@)7ug+8yAD4APnIJpIQiw;IBLbP)h>@6aWAK2mtSpZ&R(dok0=+000mG002z@003lRbYU+paA9(EEif=JFfc7-HDY09G&3+QVq;=AEi^MUG%aCbGh;0=Hf3cpV=^%@V_{}3V=yo)FfcGME@N_IP)h*<6ay3h000O8?~rd(fWpYof&l;kFaiJoNB{r;0000000000q=85Q003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&R(dok0=+000mG002z@0000000000005+cFarPpWMOn+FD`Ila&#>)FfcGMEo3!fVP!NkFfC$ZVmB=`Gc+_UVPZ35EipD_Win$jF)?FdW-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0IdT60000"}}, "inputs": [{"name": "df", "node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "incident_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mt+%Z&T$i^7Y~X003(N000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtWzr_BCl=PbWwo>VjGR@No&VyF&ne7l>rI9d*)U3fUDxh6)lOG_8I_DAFR)64F-mY!GWjezncL70GI**07w7;0AyiwVJ~oDa&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbd6G7Yuhjsm7T;*SzZ-n@Iw%!B#;PuPTeH*rEg;oqm(`j#u#!WwWhM2%F0Fw`49airT=1oYd2r(=U^}5-gC~?(YfgS>>|Plr&tWI2Ze{dAVIx`y_U+7W@;qA8D<4SC<os0YdA*H!ue9j>S+XP}irCA3jjhKZjM5O;lQQc|=#qB~`X3AeImzE1|Ir;4@slA&^I;E90i)p83A=sp5?tybSxhbV4-|Y7`5pR5~6jaiYdal87WuCee8GF^)quN>VZD`#}I3AHmws;ZObnP)h>@6aWAK2mt+%Z&P8AA|?_5000mG002z@003lRbYU+paA9(EEif=JFfc7MFfunVWMwifI5B29Ei^VXH7z)1Vm2)~GBIW{I5}o9G-WU?V=yo)FfcGME@N_IP)h*<6ay3h000O8{g7`{r{AK#n*jg-m;wL*NB{r;0000000000q=8of003lRbYU-WVRCdWFfcGMFfB7MGB+?}Wil-|F=ja}G&VFfEjVUkHZ3_aF=jG2Ic705WiTycFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mt+%Z&P8AA|?_5000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mt+%Z&PQfXcN@{003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFqMxzO9Md=hI@a;)R$G3wQ3dK~OMgWxEW!bLMcr*bfWC{ddQRLJA9;DPErUnVB;fho!&-F~15-Xig+b!u%$@6Ek6LnZ#3|`&jRSiI(z_hl394;;B^h38wAP4@#+FRQvtd!KME~xhyfTxffE%4f>C;k=$#G#iIm}Yc^I#@z#^$;C1m3-1Beq9+`{f+fct|hE&pc$L1&;MFhbE9kP-=Apx&B40fd;(BQ0|XQR000O8{g7`{ZyYpT4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?Wk5YAYc4ggR~0|XQR000O8{g7`{8Ca6KegOagFaiJoNB{r;WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MeNs(JD=`#JW34W9;UYr@W>Eqbg)z*$spGV`(4B~gs5^N)(oUko%godyRjL;J8U8?jh_|+iBACGCew=gfxglqV=P^Nqkxoc3I4#l&XtRkBnp+QTeTm>~9AL0ZaKU0U%M$FL9Ie@rF#rGA;f*W>!DtuL^`UU`x;B%G;z3$y-$J`h+SG-003di1(G{s>DJzFO><50(Sq-4sW()b*ck(;w15B`oy(jOv(OKR9MXMjcJ}_ohdGavwLrQCN&BaK7NdB&u-$`SOx}wVok9dUW_78-;)f}ALv2@(j_q>!&)FfcGMEjcnXF)=eXH7znZIbDF4O928D0~7!N00;p6kZ)6Gsb~|`0001K0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8{g7`{ZyYpT4gdfE3;+NC7ytkO0000000000q=5zj003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mt+%Z&MjqlDd8Y001xo002k;0000000000005+cN&x@>WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8{g7`{uJsj95&!@I5C8xGO#lD@0000000000q=7C2003lRbYU+paA9(EEif=JFfc7SGBYtTGd49XGC4V9Ei_?eHZ5W|Ib$t2WjJAGGcYtVF*h(RV=yo@6aWAK2ml9>Z&QT&G<)O#003+O000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;d1PQx$|MfZHhtWzrxktW%eO&1kNAhyxSp0rl|Vr(Nym4C+#5`s!>ukJZ_<_ywcj4*)by}*F-Kq3a@qp(IygefzL=fv9Ww@%Rq@y;tUP(jj`VnQ2WxNV!Z@B6mDZ`SJy7XFnuYVV4<5w)mSXg}SmE?m~CgnyJ{*LMwd%^l*@mP+B#?y#88pk2sm<4s|tH%4EiV>$&Yy#qRBXzSk5EYc%UD6!A!he>3^SU{L7P&8pS>FJmK+Ca>niy|)PJ+dNxxoizEF}-DS^7KGaUXA?!%s77cV4iQj08mQ<1QY-O00;mFk#AE1sg-UH0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXHu=>808mQ<1QY-O00;mFk#AFn>*Q^i0RRAu0ssI=0001FVRT_HaA9(EEif=JFfc7QFgQ0jGGQ?-W-(@EEi^eZFfCy*H)bt2IX7Z7WHw`EV`E`0V=yoMKVpIgPCbf(p^OQzwB}Ull`~$O))?WJ~j0YI|%9f5aYcp+leey0C6#Z%QpoS-JNX?(|>GK0;kD2!n8^9pX;VZ=vAg^|{NU=j1m5_PXuvuLg{r_LujFK?v*!*YB@$JQPHPOwBFhGY$l73B!I!V~c4;nkkD}jL6=98TLUl*gVXoV`lo3mC|v+9VdSpPt-lmBpp-h%zWZ#L{%ZRuylO+L@Qp_)+-S3SNZ?y_*!1`oPWZt?<%TWkS9RNi^_hRYfZ=S0=f7=tRUdImMbBm&j@hWVsX7dz3D%f)Cf2t(+23~fJyfA|+rO9KQH0000800)t8Q+N`K#1a4i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfBJQI5#*lVKFUcF=k~gG&wUcEnzV?W-T{4H)1qoHe+RDV__|0FfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh(=`0rRaW@M9P)h*<6ay3h000O82a#`6g!(jl*Q^i0RRAu0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEjKVYH#jn3F)d~>W@RljIWsUVVKFymEjKwgVl-qnV`XDwVJ%}YFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000800)t8Q+N`K#1a4i01yBG08Ib@00000000000HlFe0{{SIVRT_HE^uLTbS*G2FfcGJH!wIiI5J@|EoL!hWi2#0GcYY-F*jx{H#s+AG-NhoWn*JuEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a)B^wj000", "one_hot_encoder_model": "P)h>@6aWAK2ml9>Z&M|j)%Vl@003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*PQySDMSDJDYG<{?*urijq#zMNMY|e#$HZb^nw_N}%D=M-0wqP7DX#84XAUqq%o#?A{Gl-6v?rAl@~84n&4f~z)N7z^l&3I7OT4ScK?AaQ%7Q+@c;C1EFbw_Bbr0(X&ii-byu?t+t8eSm{bvS8yw7Ruzt<*jG9ai!ImXL~UqN^JrpUf0b@>t|QHh;M};VsAXf>~gk(Pe62kHI>64qxcbW@3&E$J(SBi<7qt*(LV&hsON&u57~Pw(NNK15ir?1QY-O00;mFk#AGa;ufV20000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?eSesz{A08mQ<1QY-O00;mFk#AFVmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo?Q@|E7a1}zixRLXjA7P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&Z&M|j)%Vl@003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mFk#AGa;ufV20000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfLV_jG&ngkEjeLkIW0LdHexqrF*GqbGcqk>FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "authorities_contacted"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "authorities_contacted"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mooDZ&Q01*?i^z0046V000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDKO^~Q48yht0G6cI=u+SOpsX07J@8h_@}a4sZGDsxAWJH(G{sa_yg4Di-?6&;azl9V)Uug@q68d-xLZDkt|!x^%xj5qp8RM!8qK9cW~3>2{DPW5e9oGjvud#T>q@=ez@A5C+L9;((^fC)q29Wf&~p6Al0X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL)FfcGMEo5P1W;i%xIW1*nHfAj}V=-hcVqs%BEipM_IXE;mW@R)mG%aH=FfcB2Zeeh6c`k5aa&dKKbd6F?Ya1~Tl{fCjA->9JVGl;&kbo<8yUxZoz2sJUD53OFD5Yq3N9j_n){(R=A^6|)&>vBH?ceG+cGC0A=}>k`<3Y>G}xQvx9*kN5E0`$GD?{m;!I2{GI1`WTtkzyaR(6&q017HAOZXG6D=_QN|@zK2HAfSd;GP|E_eJGWL|(}0HLw9opwc799mgN6o-|DB@4$xm`Z$S8>7C_Wm*bQn!S61{%-?rqVX_?LtRqrvEFZ}hEqG-ul4%vny}|PgO3|K-c4kjiZl`F__*k5U5yj9sMIX0K&fIrSJj!C=XqY`S-wc;vnN@Ws99cAi<2mhH_L`&zrf}E0Z>Z=1QY-O00;nSn{QLcJKwkx0000G0000@0001FVRT_HE^uLTbS*G2FfcGJWMN}wI5=cEEoEgkW-T;hF=Q=bVPiQhF*#y6I5ah8Wi&7}En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wk*d*1wPEkO!0sv4;0Rj{Q6aWAK2mooDZ&Q01*?i^z0046V000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QLAtTGc00000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfLWG!N0V>vA`Ibu0DG&N>rG%z$RV=yo@6aWAK2mooDZ&N~3HXYOe003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFoloHN(3a-m&nkE|6V}G09wObQdJ}DO#9KKYI;cyaQZZzhPA~dPt%so>ug=B}j{FmGSyB{pue8z|46kvcg=>qG@Q2diNCTZ)r)ND%MR*Kb6!QU=m|MM0UYJTD8=Kz8_ZXe0yZD0ss3YdYk~>cM9!rI3Mqj9d)an;+J_Sa@DqWhf`TF>6?0;?L_RISBC&qmAU|mn%08mQ<1QY-O00;nSn{QK=wVr+s0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9H4AnE08mQ<1QY-O00;nSn{QKfmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo?Q@|E7a1}zixRLXjA7P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&F=Az8Wj8rvIbvooEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2mooDZ&N~3HXYOe003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QK=wVr+s0000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfLG&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo@6aWAK2mmdyZ&PUaA7ultxDB^6H|@~*^UjPIOWxgU1VKT!ALksti%=F)5^E9YC?SG6p-d6zES;4jLN0{zF_ZDfjo0{wv<(`qs6;jnQwA5sy6M~7y18lk{;q8*vG~tO$pp{kR@Sm!fqVC>n!RE)k3W(_cdZWk02Z=1QY-O00;mrv2Ro0x?aT&0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?c6r}!fS08mQ<1QY-O00;mrv2Rn!%@Qb`0RRAm0ssI=0001FVRT_HaA9(EEif=JFfc7MH85p1V=^%f|1*m^yryciC_9OT+IJ7_qnzYN)d*AmxzxO@)^zc)P0Fvli9|bP)kssw?56&5e>;?dkMH?<|{NCw!TXA&Z>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewDn~)yng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%)FfcGME@N_IP)h*<6ay3h000O8EwOJ?$;}cdodEy)FfcGMEi*MRWj13nF)cGM!CWi4YcFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mmdyZ&Ty@0j<;k003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDJD>W$@1U`e(`k%B}76|FU_XOk6vX*{MN%D>|T@s!fI7~OM@4=^~)8G4AkQ+4{zwsWnzvD$LgRFi<75cu^+8)`DeZSV`Dydmt)7P4^T@31QY-O00;mrv2Rm`t;dB90000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?cc%lP{-08mQ<1QY-O00;mrv2RoC3|$C-0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7JG-hTwGc#l@GGb&mEi^DOGA%ebG%_t=Gc-6bVK*~4H!x)_V=yoxAs9Fe8`Z2c_;yk!Wd?pIy)`1~qP{I6X(w?9Gcz?wSE?=S&+rHOL%g+hQ3MmX+>dkaJvZe1=prVFFwzYP2B$?@0d1}$gqGGrJ0BuATZb6z8eFg#EwTjrr^g$1EX>`F9sQA|AQ&BBx;YX~-qvP%Sv*P$?OW)eNt?Q`4gdsiBf2iNEM?`8hyB10y6pg(ZMTr0eI~z=-oXTW*qeF(8r}A;q#Km6?Z;A5vPITQ0@|MDjPi{I4{&s4My-;SrAz-Ti{F_nL!qKaq}``hl0yiCj2ApBqoiO~;ZAn000{ax7%3kUF<)@cW5XqFj_dtpSK7<3CdvBc_mEt#-A{78Q`Bf&8ehoG2=LHqn|5l+aerLsmf~h8mzSScdFf6EvRBujh!KtQ7B+rn6eH6Z1IYac5^ZDRkb6R6kd}B$rAJvrOebRhFh{o+fEO&U*Vv5~*G~%=^235Wow*G!Qz6zw!!DO9KQH0000804=d^Qw$Vgs}cYJ01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfA}NW@b4vGh{6?Vq`ZhG%ztTEjTzdGA&^)FfcGMEig1@W;ru6WGymcWH&7|FflSMI5;#iEnzb>I51&1GdMReWi4YcFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000804=d^Qw$Vgs}cYJ01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJFf?XnIWse4Eiz(cH!U@6aWAK2mnB_Z&R+DzHFnluZ>-54N|AQ(gDdtL`+H9$6E=jFBofFufJ>YJ8xi{#hsHz%Wf$Ur?tWMCF0P*LbhGo8-Z7B}uhL_}N&kH;g^$>?Zi+D=kMr2ogY_#gaR-hTM4;3nL2&ONy~mmKe%xESF}e9^*HD9X6(1Zb<|@piyT#pS%+;T$34J8N*R-J@cma+q$Q5}=OX_W8VU{kB(H-(8q0bG)=+%U1g}Jr?=Ze?7dAslXQUEwZ6RD3#vJb8(9;TtuWVZ+z<>yR{1%zOWJnUNLIvHi8;=`YmztRy=JdN|6ANHTpiRf-Pa1#@6aWAK2mofWZ&MJsG`Zja003j7TQ71Vk;ue5)$YG%AA0X(mOdJlu9U{3fVC4g24}@eKhDqBZ~Ey3-~0q?YiHncIY-;KXi>)_!p#XV_*b3YaCuD)v|{x%!A(po6jQFl4`k&j-i4+jXHj>A0QZqyYLQf@f{Hp$2RVt;F2nsn>d4R#UV_ukX@^geWUa-Lv>7x?2C2T=Ifk1}-VxB9ZV(ZrXc=1L%hCeaR@J6}<^v)irrdcpph@Bx$J`LleZ??KzRg)rC))tkOua&!6F&KEEQd+69Bdx{rA*%M>kPhh4uaQzNf}140R(>F05T7SV?r8p!T6Fx2zH-`T?=EYMw4F3ViqHE^h9B=6ocZb6^@zahSkFHk~>bmckZaWT?#s*)|qzh=$Oh%sM6BehdZiyy`~;z0e++Zd5>@5ZKQYq!LGe3eF5;i(T-QzzHODFGkAem{R7m{@Ir|VFs2U?z-zVov_j-`8}LPKhTDw{6PCm*86FmGrOII}r=^^vrI517b6LKUd6s2mmS)o=pS(=dSWdE{oF0Wy1RGCb?Pu^Ce*jQR0|XQR000O8X0dNm%TJaX5&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEigDRFk@k4H7zkVIAbj|W@BM3IXGozEnzq`GdMJ1HDft2V=ZGaFfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVlcj=)85||JRbm1O928D0~7!N00;nPv2RllxHP%o0001O0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8X0dNmRrUOb4gdfE3;+NC7ytkO0000000000q=69u003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mofWZ&T`O_GzC1005E#002k;0000000000005+cRRI71WMOn+FK}UUbS*G2FfcGJFgP$UV_{@9EipDYV=XjhV__{hIAvxnVK_81I5c53V>vKmEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8X0dNm%TJaX5&!@I5C8xGO#lD@0000000000q=8xk003lRbYU+paA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo@6aWAK2mofWZ&NPv-38VF003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZB*O9Md+hVTB2oVUYlx3+e#A}Cn&(C&>w{!LyL95f)ar!43b^zF9W>2|Z;tvCB^1Lyt|ah_wS=1ys)SLoluO4nYQ#~S`nnxVTK_LgjWg+oPZAN>KV`2-`!bdOn}pU`D$U=FFsb#zrC)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyAvW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbbV4yODi!HO=InIp$iuoGBAr0uqcdS=1rY9EiQB?A|mQeMAA;8!^_OnBvq;w{2Bg0e~7oXiXxc6<$j!V@3|o-`=>EMgpp22FgPvJ3TSf~AvCuh+WZp1S>MB8m*9fMXqF||KRQ^oBVqnu+2M^W1;J+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bx=rvD*1OTU0=j2J*wYa-yj0@mOotS3+Ak4Os<|7-)dPU>ULxP0)Bcy__O?xKO-Zn)YJFcFg05$L-C*xX^h!QoUStlUyn_$TF4xQ(2m(d77lXIP3mRl1O#aLEhW)g8*LewSmwH{FOIQO9KQH000080A{goQx8|^r4j%D01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfB1KVlgpfV>B&fVmCM~G&nLjEn+h@FfB4OVPiQoH8Wx`VlpjbFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}AWhnt1lyKPxi;P)h*<6ay3h000O8X0dNmF7n+4)&KwiX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080A{goQ+!pNy$%2X01N;C02lxO00000000000HlEj0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;nPv2Rm#mIq>f0RR9n0ssI=00000000000001_flC1Z0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyAvW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080A{goQx8|^r4j%D01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJF)(5=F=S&jEo5RhI4v|dGC3_`Gc_@6aWAK2mpGqZ&Uc&mo4M~003|S000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZG$*Qphgs)24{(?tamh;1~rC#_Y#7~6eG2?>dA@7#0loP~CfvsjBrdkG2j0cB1=N9mm$5lSVLFNJK_Z-T*3q-`|lL?epzm<#wM)@s%Dty*@f>z8+pnE5xPY-3<^CtKMrz<-5B%UrdF!#~PF!|uLcq1u7E-0RRBv0ssI=0001FVRT_HaA9(EEif=JFfc7*V>mcwGC4FYF*#&0Ei__eGc7nYW-u)_Fl1vnHf1tnGGjI^V=yo5Q6_F^dm}d{hfY7N3qkoDW$!LnfK<+=sosy@-`%he6h$^t50^IY?7uMqy6?tKAm~*(pgnWdw0SwS(ys%y%JL;1VZTC-GcAdEpAxb-(M7Gs0Hmymd=E%R2%9I{Vuc*HaxB&t*bO=Mm({eAA{`5+t3i~$ditqGxh0Q!`6ZkjE=C~mMjrcPn2TU1d}E@YS;~#YqO%ag%pJ-%|nWGdcVjXLS5{*j=R@#Nyq6mk@I15@15v=__v1O?oRg>cR%C3*6}}6a$Lw0!F9L?B8;5AZC7ea*}M*l-npZfLg~CJ?MD#}Xn^d4KPL2rq)=QegrTasqJ=PwGsEm}){Sqj=fWTR+Necp{Jx*%LUQeo&u;vZ6*aZt4EWXjPaRvsYlH9qgRO7n{J2E+>$SnBh4R*kcp(h~8hZQ5BA46?vJuY)5f@pO42KzimJQ=L=5Z8{!r|aBih^tqCwz3^xNiNrAof#u-9G?OO9KQH000080D7@+Q;)=OZxR3i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfC$ZI5=i9IW#RXIb<;{G-6~kEjTo0FfBGPWMertWin$jV>T^gFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh3u-ojbg9;Y_P)h*<6ay3h000O8da-X)_}iB)-0RRBv0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEn;IhIA$_AG%YbXWHBu?Vq`NdI5cK3EjBP@V>vcuGGj7hHZ5Z?FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080D7@+Q;)=OZxR3i01yBG08Ib@00000000000HlF~0{{SIVRT_HE^uLTbS*G2FfcGJVq-WsW->W6EipM{F)cJ=WHT)|G-fa@HZWvkIW}c7V=`klEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a00aO4000"}}, "inputs": [{"name": "df", "node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "type": "TRANSFORM", "operator": "sagemaker.spark.manage_columns_0.1", "parameters": {"operator": "Drop column", "drop_column_parameters": {"column_to_drop": "customer_zip"}}, "inputs": [{"name": "df", "node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "dba3fd7e-01a7-4629-a4c8-aa520328439a", "type": "TRANSFORM", "operator": "sagemaker.spark.join_tables_0.1", "parameters": {"left_column": "policy_id", "right_column": "policy_id", "join_type": "leftouter"}, "inputs": [{"name": "df", "node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "output_name": "default"}, {"name": "df", "node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "output_name": "default"}], "outputs": [{"name": "default"}]}]} -------------------------------------------------------------------------------- /1-sagemaker-pipelines/images/pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/1-sagemaker-pipelines/images/pipeline.png -------------------------------------------------------------------------------- /2-step-functions-pipelines/01_setup_step_functions_pipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Create a Step Functions Workflow from a Data Wrangler Flow File\n", 8 | "\n", 9 | "This notebook creates a Step Functions workflow that runs a data preparation step based on the flow file configuration, training step to train a XGBoost model artifact and creates a SageMaker model from the artifact. \n", 10 | "\n", 11 | "Before proceeding with this notebook, please ensure that you have executed the 00_setup_data_wrangler.ipynb. This notebook uploads the input files, creates the flow file locally from the insurance claims template and uploads flow file to the default S3 bucket associated with the Studio domain.\n", 12 | "\n", 13 | "This notebook is created using the export feature in Data Wrangler and modified and parameterized for Step functions workflow with SageMaker data science SDK.\n" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "As a first step, we read the input workflow parameters with a predefined schema and use this in our notebook. A sagemaker workflow requires job names to be unique. So we pass on the below names as parameters by generating random ids when we execute the workflow towards the end of the notebook.\n", 21 | "\n", 22 | "1. Processing JobName\n", 23 | "2. Training JobName\n", 24 | "3. Model JobName\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "%pip install -qU stepfunctions" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Please use your own bucket name as this bucket will be used for creating the training dataset and the model artifact. The flow file is also generated locally and contains the S3 location details for the two input files namely claims.csv and customer.csv. The flow file is available as a JSON document and holds all the input, output and transformation details in a node structure. SageMaker jobs need the inputs to be available in S3. So, we derive the input file S3 locations from the flow file by reading the JSON document and building an array with a Processing Input object for SageMaker processing.\n", 41 | "\n", 42 | "You can configure your output location as you wish but SageMaker processing job requires the output name to match with the one created in the flow file. Any mismatch can fail the job. So, we extract the output name from the flow file as below.\n" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "%store -r ins_claim_flow_uri" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "import json\n", 61 | "import sagemaker\n", 62 | "import string\n", 63 | "import boto3\n", 64 | "\n", 65 | "sm_client = boto3.client(\"sagemaker\")\n", 66 | "sess = sagemaker.Session()\n", 67 | "\n", 68 | "# bucket = \n", 69 | "bucket = sess.default_bucket()\n", 70 | "\n", 71 | "prefix = 'aws-data-wrangler-workflows'\n", 72 | "\n", 73 | "FLOW_TEMPLATE_URI = ins_claim_flow_uri\n", 74 | "\n", 75 | "flow_file_name = FLOW_TEMPLATE_URI.split(\"/\")[-1]\n", 76 | "flow_export_name = flow_file_name.replace(\".flow\", \"\")\n", 77 | "flow_export_id = flow_export_name.replace(\"flow-\", \"\")" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "sagemaker.s3.S3Downloader.download(FLOW_TEMPLATE_URI, \".\")" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [ 95 | "with open(flow_file_name, 'r') as f:\n", 96 | " data = json.load(f)\n", 97 | " output_node = data['nodes'][-1]['node_id']\n", 98 | " print(output_node)\n", 99 | " output_path = data['nodes'][-1]['outputs'][0]['name']\n", 100 | " input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n", 101 | " input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']==\"SOURCE\"]\n", 102 | "\n", 103 | "output_name = f\"{output_node}.{output_path}\"" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", 113 | "\n", 114 | "data_sources = []\n", 115 | "\n", 116 | "for i in range(0,len(input_source_uris)):\n", 117 | " data_sources.append(ProcessingInput(\n", 118 | " source=input_source_uris[i],\n", 119 | " destination=f\"/opt/ml/processing/{input_source_names[i]}\",\n", 120 | " input_name=input_source_names[i],\n", 121 | " s3_data_type=\"S3Prefix\",\n", 122 | " s3_input_mode=\"File\",\n", 123 | " s3_data_distribution_type=\"FullyReplicated\"\n", 124 | " ))\n", 125 | " \n", 126 | "print(data_sources)\n", 127 | "\n", 128 | "s3_output_prefix = f\"preprocessing/output\"\n", 129 | "s3_training_dataset = f\"s3://{bucket}/{s3_output_prefix}\"\n", 130 | "\n", 131 | "processing_job_output = ProcessingOutput(\n", 132 | " output_name=output_name,\n", 133 | " source=\"/opt/ml/processing/output\",\n", 134 | " destination=s3_training_dataset,\n", 135 | " s3_upload_mode=\"EndOfJob\"\n", 136 | ")" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "The flow file is created and uploaded into S3 as part of the setup notebook. If you encounter any error in the below cell,please go back to the Setup notebook to make sure flow file is generated and uploaded correctly. Now, we retrieve the flow file s3 uri that was stored as a global variable in the setup " 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "%store -r ins_claim_flow_uri\n", 153 | "ins_claim_flow_uri" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "Processing job also needs to access the flow file to run the transformations. So, we provide the flow file location as another input source for the processing job as below by passing the s3 uri." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "## Input - Flow: insurance_claims.flow\n", 170 | "flow_input = ProcessingInput(\n", 171 | " source=ins_claim_flow_uri,\n", 172 | " destination=\"/opt/ml/processing/flow\",\n", 173 | " input_name=\"flow\",\n", 174 | " s3_data_type=\"S3Prefix\",\n", 175 | " s3_input_mode=\"File\",\n", 176 | " s3_data_distribution_type=\"FullyReplicated\"\n", 177 | ")" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "## Configure a processing job\n", 185 | "\n", 186 | "Please follow the steps below for creating an execution role with the right permissions for the workflow.\n", 187 | "\n", 188 | "1. Go to the IAM Console - Roles. Choose Create role. \n", 189 | "\n", 190 | "2. For role type, choose AWS Service, find and choose SageMaker, and choose Next: Permissions \n", 191 | "\n", 192 | "3. On the Attach permissions policy page, choose (if not already selected) \n", 193 | "\n", 194 | " a. AWS managed policy AmazonSageMakerFullAccess\n", 195 | " \n", 196 | " b. AWS managed policy AmazonS3FullAccess for access to Amazon S3 resources\n", 197 | " \n", 198 | " c. AWS managed policy CloudWatchEventsFullAccess\n", 199 | "\n", 200 | "4. Then choose Next: Tags and then Next: Review.\n", 201 | "\n", 202 | "5. For Role name, enter StepFunctionsSageMakerExecutionRole and Choose Create Role\n", 203 | "\n", 204 | "6. Additionally, we need to add step functions as a trusted entity to the role. Go to trust relationships in the specific IAM role and edit it to add states.amazonaws.com (http://states.amazonaws.com/) as a trusted entity. \n", 205 | "\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "After creating the role, we go on to configure the inputs required by the SageMaker Python SDK to launch a processing job. You can change it as per your needs." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "\n", 222 | "sess = sagemaker.Session()\n", 223 | "# IAM role for executing the processing job.\n", 224 | "iam_role = sagemaker.get_execution_role()\n", 225 | "\n", 226 | "aws_region = sess.boto_region_name\n", 227 | "\n", 228 | "\n", 229 | "# Data Wrangler Container URL.\n", 230 | "container_uri = sagemaker.image_uris.retrieve(\n", 231 | " framework='data-wrangler',\n", 232 | " region=aws_region\n", 233 | ")\n", 234 | "\n", 235 | "# Processing Job Instance count and instance type.\n", 236 | "instance_count = 2\n", 237 | "instance_type = \"ml.m5.4xlarge\"\n", 238 | "\n", 239 | "# Size in GB of the EBS volume to use for storing data during processing\n", 240 | "volume_size_in_gb = 30\n", 241 | "\n", 242 | "# Content type for each output. Data Wrangler supports CSV as default and Parquet.\n", 243 | "output_content_type = \"CSV\"\n", 244 | "\n", 245 | "# Network Isolation mode; default is off\n", 246 | "enable_network_isolation = False\n", 247 | "\n", 248 | "# KMS key for per object encryption; default is None\n", 249 | "kms_key = None" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "## Create a Processor\n", 257 | "\n", 258 | "To launch a Processing Job, we will use the SageMaker Python SDK to create a Processor function with the configuration set." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "from sagemaker.processing import Processor\n", 268 | "from sagemaker.network import NetworkConfig\n", 269 | "\n", 270 | "processor = Processor(\n", 271 | " role=iam_role,\n", 272 | " image_uri=container_uri,\n", 273 | " instance_count=instance_count,\n", 274 | " instance_type=instance_type,\n", 275 | " volume_size_in_gb=volume_size_in_gb,\n", 276 | " network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),\n", 277 | " sagemaker_session=sess,\n", 278 | " output_kms_key=kms_key\n", 279 | ")" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "# Create a Step Function WorkFlow \n", 287 | "## Define Steps\n", 288 | "A step function workflow consists of multiple steps that run as separate state machines. We will first create a `ProcessingStep` using the Data Wrangler processor defined above. Processing job name is unique and passed on as a command line argument from the workflow. This is used to track and monitor the job in console and cloudwatch logs. " 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "from stepfunctions.inputs import ExecutionInput\n", 298 | "workflow_parameters = ExecutionInput(schema={\"ProcessingJobName\": str, \"TrainingJobName\": str,\"ModelName\": str})" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": null, 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "from stepfunctions.steps import ProcessingStep\n", 308 | "\n", 309 | "data_wrangler_step = ProcessingStep(\n", 310 | " \"WranglerStepFunctionsProcessingStep\",\n", 311 | " processor=processor,\n", 312 | " job_name = workflow_parameters[\"ProcessingJobName\"],\n", 313 | " inputs=[flow_input] + data_sources, \n", 314 | " outputs=[processing_job_output]\n", 315 | ")" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "Next, we add a TrainingStep to the workflow that trains a model on the preprocessed train data set. Here we use a builtin XG boost algorithm with fixed hyperparameters. You can configure the training based on your needs" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "import boto3\n", 332 | "from sagemaker.estimator import Estimator\n", 333 | "\n", 334 | "region = boto3.Session().region_name\n", 335 | "\n", 336 | "image_uri = sagemaker.image_uris.retrieve(\n", 337 | " framework=\"xgboost\",\n", 338 | " region=region,\n", 339 | " version=\"1.2-1\",\n", 340 | " py_version=\"py3\",\n", 341 | " instance_type=instance_type,\n", 342 | " )\n", 343 | "xgb_train = Estimator(\n", 344 | " image_uri=image_uri,\n", 345 | " instance_type=instance_type,\n", 346 | " instance_count=1,\n", 347 | " role=iam_role,\n", 348 | " )\n", 349 | "xgb_train.set_hyperparameters(\n", 350 | " objective=\"reg:squarederror\",\n", 351 | " num_round=3,\n", 352 | " )\n", 353 | "\n", 354 | "from sagemaker.inputs import TrainingInput\n", 355 | "from stepfunctions.steps import TrainingStep\n", 356 | "\n", 357 | "xgb_input_content_type = 'text/csv'\n", 358 | "\n", 359 | "training_step = TrainingStep(\n", 360 | " \"WranglerStepFunctionsTrainingStep\",\n", 361 | " estimator=xgb_train,\n", 362 | " data={\n", 363 | " \"train\": TrainingInput(\n", 364 | " s3_data=s3_training_dataset,\n", 365 | " content_type=xgb_input_content_type\n", 366 | " )\n", 367 | " },\n", 368 | " job_name = workflow_parameters[\"TrainingJobName\"],\n", 369 | " wait_for_completion=True,\n", 370 | ")" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "The above training job will produce a model artifact. As a final step, we register this artifact as a SageMaker model with the model step" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "from stepfunctions.steps import ModelStep\n", 387 | "\n", 388 | "model_step = ModelStep(\n", 389 | " \"SaveModelStep\", model=training_step.get_expected_model(), model_name=workflow_parameters[\"ModelName\"])" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "In order to notify failures, we need to configure a fail step with an error message as below and call it during the incidence of every failure" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": null, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "from stepfunctions.steps.states import Fail\n", 406 | "\n", 407 | "process_failure = Fail(\n", 408 | " \"Step Functions Wrangler Workflow failed\", cause=\"Wrangler-StepFunctions-Workflow failed\"\n", 409 | ")" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "metadata": {}, 415 | "source": [ 416 | "There might be situations where you may expect intermittent failures due to unavailable resources and may want to retry a specific step. You can set up retry mechanism as below for such steps and configure the interval and the attempts " 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [ 425 | "from stepfunctions.steps.states import Retry\n", 426 | "\n", 427 | "data_wrangler_step.add_retry(Retry(\n", 428 | " error_equals=[\"States.TaskFailed\"],\n", 429 | " interval_seconds=15,\n", 430 | " max_attempts=2,\n", 431 | " backoff_rate=3.0\n", 432 | "))\n", 433 | "\n", 434 | "training_step.add_retry(Retry(\n", 435 | " error_equals=[\"States.TaskFailed\"],\n", 436 | " interval_seconds=10,\n", 437 | " max_attempts=2,\n", 438 | " backoff_rate=4.0\n", 439 | "))" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "Additionally we can introduce a wait interval between steps to ensure a smooth transition and provide a specific step with all the resources and inputs it will need. " 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": null, 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [ 455 | "from stepfunctions.steps.states import Wait\n", 456 | "\n", 457 | "wait_step = Wait(\n", 458 | " state_id=\"Wait for 3 seconds\",\n", 459 | " seconds=3\n", 460 | ")" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Now that we have defined the steps needed for the workflow, we need to catch and notify failures at every step. We fail the entire workflow whenever a failure is caught by this FailStep" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "from stepfunctions.steps.states import Catch\n", 477 | "\n", 478 | "catch_failure = Catch(\n", 479 | " error_equals=[\"States.TaskFailed\"],\n", 480 | " next_step=process_failure\n", 481 | ")\n", 482 | "\n", 483 | "data_wrangler_step.add_catch(catch_failure)\n", 484 | "training_step.add_catch(catch_failure)\n", 485 | "model_step.add_catch(catch_failure)" 486 | ] 487 | }, 488 | { 489 | "cell_type": "markdown", 490 | "metadata": {}, 491 | "source": [ 492 | "Finally, we create a workflow by chaining all the above defined steps in order and execute it with the required parameters. Unique job names are generated randomly and passed on as parameters for the workflow. We introduce a wait time of 3 steps between preprocessing and training steps by chaining the wait step in between" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [ 501 | "from stepfunctions.steps import Chain\n", 502 | "from stepfunctions.workflow import Workflow\n", 503 | "import uuid\n", 504 | "\n", 505 | "workflow_graph = Chain([data_wrangler_step, wait_step,training_step, model_step])\n", 506 | "\n", 507 | "branching_workflow = Workflow(\n", 508 | " name=\"Wrangler-SF-Run-{}\".format(uuid.uuid1().hex),\n", 509 | " definition=workflow_graph,\n", 510 | " role=iam_role\n", 511 | ")\n", 512 | "\n", 513 | "branching_workflow.create()" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "# Each Preprocessing job requires a unique name\n", 523 | "processing_job_name = \"wrangler-sf-processing-{}\".format(\n", 524 | " uuid.uuid1().hex)\n", 525 | "# Each Training Job requires a unique name\n", 526 | "training_job_name = \"wrangler-sf-training-{}\".format(\n", 527 | " uuid.uuid1().hex)\n", 528 | "model_name = \"sf-claims-fraud-model-{}\".format(uuid.uuid1().hex)\n", 529 | "\n", 530 | "\n", 531 | "# Execute workflow\n", 532 | "execution = branching_workflow.execute(\n", 533 | " inputs={\n", 534 | " \"ProcessingJobName\": processing_job_name, # Each pre processing job (SageMaker processing job) requires a unique name,\n", 535 | " \"TrainingJobName\": training_job_name,# Each Sagemaker Training job requires a unique name,\n", 536 | " \"ModelName\" : model_name # Each model requires a unique name\n", 537 | " } \n", 538 | ")\n", 539 | "execution_output = execution.get_output(wait=True)\n" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "You can visualize the step function workflow status and details in the Step Functions console. " 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "## (Optional) StepFunctions cleanup\n", 554 | "1. Delete the input claims file, customer file and flow file from S3.\n", 555 | "2. Delete the training dataset created by processing job\n", 556 | "3. Delete the model artifact tar file from S3.\n", 557 | "4. Delete the SageMaker Model.\n" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [] 566 | } 567 | ], 568 | "metadata": { 569 | "instance_type": "ml.t3.medium", 570 | "kernelspec": { 571 | "display_name": "Python 3 (Data Science)", 572 | "language": "python", 573 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" 574 | }, 575 | "language_info": { 576 | "codemirror_mode": { 577 | "name": "ipython", 578 | "version": 3 579 | }, 580 | "file_extension": ".py", 581 | "mimetype": "text/x-python", 582 | "name": "python", 583 | "nbconvert_exporter": "python", 584 | "pygments_lexer": "ipython3", 585 | "version": "3.7.10" 586 | } 587 | }, 588 | "nbformat": 4, 589 | "nbformat_minor": 4 590 | } 591 | -------------------------------------------------------------------------------- /2-step-functions-pipelines/README.md: -------------------------------------------------------------------------------- 1 | ## SageMaker Data Wrangler with AWS Step Functions 2 | 3 | ## Getting Started 4 | 5 | In this sample, we will use Step Functions to orchestrate the data wrangler based ML workflow using [AWS Step Functions Data Science SDK](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-python-sdk.html) 6 | The AWS Step Functions Data Science SDK is an open-source library that allows data scientists to create workflows that can preprocess datasets, build, deploy and monitor machine learning models using AWS SageMaker and AWS Step Functions. The AWS Step Functions Data Science SDK provides a Python API that can create and invoke Step Functions workflows. You can manage and execute these workflows directly in Python, as well as Jupyter notebooks. 7 | 8 | We will build a simple ML workflow leveraging SageMaker that has a preprocessing step using Data Wrangler, a training step and a save model step. We will chain these steps in order, assign dependencies and catch failures in the flow with Step Functions. The workflow will look as below 9 | 10 | 11 | ![](step-function-workflow.png) 12 | 13 | ## Instructions 14 | 15 | To run this sample, we used a Jupyter notebook running Python3 on a data science kernel in a SageMaker Studio environment. You can also run it on a Python3 notebook instance locally on your machine by setting up the credentials to assume the SageMaker execution role. The notebook is light weight and can be run on a t3 medium instance for example. 16 | 17 | You can either use the export feature in SageMaker Data Wrangler to generate the Pipelines code and modify it for Step functions or build your own script from scratch. In this sample, we have used a combination of both approaches for simplicity. We have taken some pieces from the autogenerated code and extended it with features specific to Step Functions. 18 | 19 | ## Executing the notebooks 20 | 21 | 1. Run the 00_setup_data_wrangler.ipynb notebook step by step. This notebook does the following 22 | a. Generates a flow file from Data Wrangler or uses the setup script to generate a flow file from a preconfigured template 23 | b. Creates an Amazon S3 bucket and uploads your flow file and input files to the bucket 24 | 2. Follow the instructions in the 01_setup_step_functions_pipeline notebook to kick off a Step Functions workflow. 25 | 3. Configure your Amazon SageMaker execution role with the required permissions as mentioned in the 01_setup_step_functions_pipeline notebook 26 | 4. The processing job will be run on a SageMaker managed Spark environment behind the scenes and this can take few minutes to complete 27 | 5. Go to StepFunctions console and track the workflow visually thru StepFunctions console. You can also navigate to the linked CloudWatch logs to debug errors. 28 | 6. Make sure you clean up all the resources at the end. 29 | 30 | ## Cleanup 31 | 32 | Please follow the instructions in the cleanup section of the 01_setup_step_functions_pipeline notebook to avoid any additional accumulation of costs. 33 | 34 | 35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /2-step-functions-pipelines/flow-01-15-12-49-4bd733e0.flow: -------------------------------------------------------------------------------- 1 | {"metadata": {"version": 1, "disable_limits": false}, "nodes": [{"node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "claims.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/claims.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "driver_relationship": "string", "incident_type": "string", "collision_type": "string", "incident_severity": "string", "authorities_contacted": "string", "num_vehicles_involved": "long", "num_injuries": "long", "num_witnesses": "long", "police_report_available": "string", "injury_claim": "long", "vehicle_claim": "float", "total_claim_amount": "float", "incident_month": "long", "incident_day": "long", "incident_dow": "long", "incident_hour": "long", "fraud": "long"}}, "inputs": [{"name": "default", "node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "customers.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/customers.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "customer_age": "long", "months_as_customer": "long", "num_claims_past_year": "long", "num_insurers_past_5_years": "long", "policy_state": "string", "policy_deductable": "long", "policy_annual_premium": "long", "policy_liability": "string", "customer_zip": "long", "customer_gender": "string", "customer_education": "string", "auto_year": "long"}}, "inputs": [{"name": "default", "node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "police_report_available"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2ms2BZ&R6WWXb6O004CX000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx%1ME87#*Qpf(>7oJ&#MZjCZ(6H%1%HSefU5ZW&!?IhpgSc+k>^~zV6MSv%c+1-bxHrFOt5=M!;g3q$_cvD^Y(+n6QIYC{+dyqTGvi~)x4~8?25S^b!Wbw^ID)aTwOEBv@)woSMXAEa#XWd!aqW*_=g|XVWK+X5j!)FfcGME@N_IOD;-gU|?WkIGK_+697<40|XQR000O8%8hSR-Pv?efB^siZ~_1TNB{r;WMOn+FK}UUbS*G2FfcGJVlrhhF*!J9EjeUjGA%SSG-WMeGh{F=GGSphWnnisFk(4oEn_e+FfMa$VQ_GHE^uLTadl;MjgnDI!!QuX&#caYd(B8842Sr5EHPvF;ZcV0$?DP4x%-qze2)=~7|NoM^-;uN9a~}ha@UVv+8`N=~hOk?^lA54V2>_JV>{)Aov$1uw2Y_rgDi@lq!N^r7O+69S!>u0Q%Uoat2Z(Gd5lf|yt4cg$gzIqN5JzR&EbT3+WG)Ny)56Wpm)QYmA(y&zr$KSk?bN}&Tz~id(ns;P1f!PIGOfp*#?d;9OD!Pa1%P!CxIIG3>W+(vu%q*L3jQc7os|XI47kTVAl+XTdaAe$rZZ`HRX*`t8j%Pi$m_-nGtU#rhD@7kHa4us(oCvEv*Gm%D@4Aq)(EV>tpB_xR`Lqu;vep!S619vd0ZAoWQ88hlww0Wft>##7B&&Fl1rh-J`ilL`TS-MzlgP2(s^lfyI$!!QYhaNtM%{UGqtei$c%u1l$1c-LK1EP)h>@6aWAK2ms2BZ&Rr%CoK{H000mG002z@003lRbYU+paA9(EEif=JFfc7*GG#F_IXGr5Ib>ooEi^MUWi4SdWH2o{(90CfQX022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2ms2BZ&PF000aC000;O0000000000005+c8UX+RZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8%8hSR-Pv?efB^siZ~_1TNB{r;0000000000q=8-m003lRbYU-WVRCdWFfcGMFfC#-Wic^1IA$5WMVQcG&3}1EnzccFfB4+VK!x9H#jh2Ic6)FfcGMEn+fdF)=wfW-U2nVlpi>Gc;u_VKZbfEiz$YHf3QqI51*4W-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0Ko$Q0000"}}, "inputs": [{"name": "df", "node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "incident_severity"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2mlj~Z&R2h55?vH003_R000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<_K=1sFoVTc@P`X{mWrqnQ5ZB5}+$~zZaFU5K^}k~k3Be?8mi6AVFG5*JNvuVngMlXO-N2)Pi-mrTYVwqD~0(k^JUVj;3|m@>F1)^*#ot5tVjchz0J5VQY=luYnkZe%5^CAd$&tk^3?^YTx!-w*q)dARM`20KRB2e$(*Uxe|ob7_W>*pZkPoB$G!*kKnb%tv`dEyC=lR>rnEKmeIvt@i;An#sP}W&%X{_7mz5>RpS3|`~I6_AB)FfcGMEjeR1IAdWqWGy)`V=yf=GdMXdVl-r8EihtZGiG8rH)1e1Wi4YcFfcB2Zeeh6c`k5aa&dKKbd6HcYTGarmEFYYSY8!m@Iw%!q~HmAPF*MTrEjIsQbrktF@_w;oj2J|C1sFXl{wU!Ut<@0Xs<&$sx)>(_oR}QtnJ?v?vH+xBZLO7ANaw}c7Xf-$V2|c^LrnMp2UAN*d2I-J7;j~>}(&r5YZ1HJ_0{*0GW&YkdQ`SGQQvtf<5M8KcqHgQY8|nN%*tJ(#UjjUDYRk*ooGz83GkAeqy#m(I@Lb6a5Yndz;I&$PULkVa27FiPVY`uG%;Jc}!*RYWl^90FR1}j$fC%ztPuM}mZqsllXMzqllMsy6_Yd<(_KFZVB-m_{S3b115ir?1QY-O00;mRj&D;xA*7uW0000G0000@0001FVRT_HE^uLTbS*G2FfcGJIb%0CV_`RBEjch_FfBAQI5{n1G-P5eFk)jfW@0%vVlX#lEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?WkIP+aETqkIbEdWqU0Rj{Q6aWAK2mlj~Z&R2h55?vH003_R000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mRj&D;%AE&7f0000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfLBdWsVK-zgIWS`|Ei^MYIW1x|WMVBaVq-IAVmUWrFgIl_V=yo(+G&49kEn+leVl6OYV>4!AIX7Z3H)SnjFfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve007nl00000"}}, "inputs": [{"name": "df", "node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "driver_relationship"}}, "inputs": [{"name": "df", "node_id": "a9dcc572-7c50-4e63-b592-6691f8df29b6", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['driver_relationship']=df['driver_relationship'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "2059176b-8fd9-4324-9452-cee6193474ee", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "incident_type"}}, "inputs": [{"name": "df", "node_id": "56815d9f-f395-494b-8f16-8ddd2138e87e", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "collision_type"}}, "inputs": [{"name": "df", "node_id": "6c7a9e1e-16d4-4dda-bfba-e87f9c00d798", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "type": "TRANSFORM", "operator": "sagemaker.spark.custom_pandas_0.1", "parameters": {"code": "df['collision_type']=df['collision_type'].replace('n/a','na')"}, "inputs": [{"name": "df", "node_id": "b286b138-9c3e-48c6-8e32-5188ca9b62d5", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "type": "TRANSFORM", "operator": "sagemaker.spark.format_string_0.1", "parameters": {"operator": "Lower case", "lower_case_parameters": {"input_column": "authorities_contacted"}}, "inputs": [{"name": "df", "node_id": "a32f07c9-671e-4c88-ada9-606d0f4dac1d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "c2fb6219-969d-44ac-9551-4996073630a8", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "driver_relationship"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mtSpZ&RJqavbOY0040T000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;YiPs1<}MDP3zpHnL#YEXKmmmVsRKwP86-ZWPI#oDeYRsK6mkWfM5I-2)pE<#yINo+)*gMqL?e8afD*?3o9VxDh40Z>Z=1QY-O00;o@kZ)7rpkAB~0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9DY`rt08mQ<1QY-O00;o@kZ)5_8f)620RRA-0ssI=0001FVRT_HaA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yoX?(g-VT5t}!pWf^B1F!;-MqSf|kw2oeGPhhT1!&-FIsCgcHm^$BiFwFFM33He*gMT3xvQn)ruxAP=|m=+pubuV%-t^~n=$K5J$G!xR4#R9+3f0pm7-{@6aWAK2mtSpZ&L$&q2dw%000mG002z@003lRbYU+paA9(EEif=JFfc7QH8e3|WMN?~V=-YgEi^GQGA%h`WMeI4V>o7HGB-71H!w6UV=yoZ;0u%!j000080Pm1*Q=QXt9OwW50C52T022TJ00000000000HlEc0001OWprU=VRT_HaA9(EEif=JFfdR{0Rj{Q6aWAK2mtSpZ&Tr*UYrg9000aC000;O0000000000005+c836zQZDn*}WMOn+FD`Ila&#>)FfcGME@N_IP)h*<6ay3h000O8?~rd(Q5tL7p#cB@n*sm;NB{r;0000000000q=8)l003lRbYU-WVRCdWFfcGMFfBJVG%;diVPP#}F<~?)FfcGMEjKkZF=Aw4VJ%}ZVKgl?F)}hOIbvjEEo5UjW@IuqHDWg~G%aH=FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0O11w0000", "one_hot_encoder_model": "P)h>@6aWAK2mtSpZ&RNX%Zbwf003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZ9}PQyS9hIbz0=Ju2-N=mY#N>pkE6YI%M92#+VKKXnMRC#wy5l9S`SUi08_y4|_!9&ibH^i?N5)5awIRU@f;O&GUh0#7md?eY$fkxz`4sTXQ=73osB-3x&y6w8IZ8mk&uFTbcLC!NpO|ETat2Km=xUSSI^QiDor0XWQZz?$9X}i@E^&uQVlP`vfq#_UGRdRwMxq-a<3Yz2Z;)g3eLmWNs<2(3k5i%1iADPktvY37XpK(!w6BIAcH_C2*=-l_+zuJTgF3a5~P)h>@6aWAK2mtSpZ&OWF`nC=L000aC000;O003=ebYWy+bYU+paA9(EEif=JFfcA-a$`#_N@ieSU}ESzEixYfP)h>@6aWAK2mtSpZ&QH6$k2iT001xo002k;003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpsT~U2EGg6jjvB>qGa@2N@as5Cjb^xWb+j*9m=S-%2SXYLTGJ$bogHcXa5X?-GK`hqg9sR;Ob(}PKEjN%T9jFN)U|tn0}fFC$Af`_*ULZ3++ABZ_~CetpfnT2Z(-@T2``l$j3n#gx%c`njLs3$hXNaq*pM(KKAeZhgNs@^c}512>Za8Rps8tD2OO+%rzHN0V4T}UjA1aTQ)WQkn)(vh#tK`*qP?w%;wT@)7ug+8yAD4APnIJpIQiw;IBLbP)h>@6aWAK2mtSpZ&R(dok0=+000mG002z@003lRbYU+paA9(EEif=JFfc7-HDY09G&3+QVq;=AEi^MUG%aCbGh;0=Hf3cpV=^%@V_{}3V=yo)FfcGME@N_IP)h*<6ay3h000O8?~rd(fWpYof&l;kFaiJoNB{r;0000000000q=85Q003lRbYU-WVRCdWFfcGMFfC*?Vqs-8GcYY;V`4WgG&3|bEn#9aV=XZ@Wo0sBGBGh@VP-93FfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mtSpZ&R(dok0=+000mG002z@0000000000005+cFarPpWMOn+FD`Ila&#>)FfcGMEo3!fVP!NkFfC$ZVmB=`Gc+_UVPZ35EipD_Win$jF)?FdW-VhdFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0IdT60000"}}, "inputs": [{"name": "df", "node_id": "aaf76a75-ccbe-4e8a-bc3c-f74dd6f9ee5d", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "incident_type"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mt+%Z&T$i^7Y~X003(N000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZHhtWzr_BCl=PbWwo>VjGR@No&VyF&ne7l>rI9d*)U3fUDxh6)lOG_8I_DAFR)64F-mY!GWjezncL70GI**07w7;0AyiwVJ~oDa&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbd6G7Yuhjsm7T;*SzZ-n@Iw%!B#;PuPTeH*rEg;oqm(`j#u#!WwWhM2%F0Fw`49airT=1oYd2r(=U^}5-gC~?(YfgS>>|Plr&tWI2Ze{dAVIx`y_U+7W@;qA8D<4SC<os0YdA*H!ue9j>S+XP}irCA3jjhKZjM5O;lQQc|=#qB~`X3AeImzE1|Ir;4@slA&^I;E90i)p83A=sp5?tybSxhbV4-|Y7`5pR5~6jaiYdal87WuCee8GF^)quN>VZD`#}I3AHmws;ZObnP)h>@6aWAK2mt+%Z&P8AA|?_5000mG002z@003lRbYU+paA9(EEif=JFfc7MFfunVWMwifI5B29Ei^VXH7z)1Vm2)~GBIW{I5}o9G-WU?V=yo)FfcGME@N_IP)h*<6ay3h000O8{g7`{r{AK#n*jg-m;wL*NB{r;0000000000q=8of003lRbYU-WVRCdWFfcGMFfB7MGB+?}Wil-|F=ja}G&VFfEjVUkHZ3_aF=jG2Ic705WiTycFfcGKb8ca9aCt6pVRCVGWpq$W0Rj{Q6aWAK2mt+%Z&P8AA|?_5000mG002z@0000000000005+cSOWk6WMOn+FD`Ila&#>)FfcGMEi*7OH!x&nGA%eUW;rc1HZ(OYIA&rtEjcnVW->TAW-&BnFfC&+FfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mt+%Z&PQfXcN@{003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFqMxzO9Md=hI@a;)R$G3wQ3dK~OMgWxEW!bLMcr*bfWC{ddQRLJA9;DPErUnVB;fho!&-F~15-Xig+b!u%$@6Ek6LnZ#3|`&jRSiI(z_hl394;;B^h38wAP4@#+FRQvtd!KME~xhyfTxffE%4f>C;k=$#G#iIm}Yc^I#@z#^$;C1m3-1Beq9+`{f+fct|hE&pc$L1&;MFhbE9kP-=Apx&B40fd;(BQ0|XQR000O8{g7`{ZyYpT4gdfE3;+NC7ytkOZDn*}WMOn+FD`Ila&#>)FfcGME@N_IOD;-gU|?Wk5YAYc4ggR~0|XQR000O8{g7`{8Ca6KegOagFaiJoNB{r;WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MeNs(JD=`#JW34W9;UYr@W>Eqbg)z*$spGV`(4B~gs5^N)(oUko%godyRjL;J8U8?jh_|+iBACGCew=gfxglqV=P^Nqkxoc3I4#l&XtRkBnp+QTeTm>~9AL0ZaKU0U%M$FL9Ie@rF#rGA;f*W>!DtuL^`UU`x;B%G;z3$y-$J`h+SG-003di1(G{s>DJzFO><50(Sq-4sW()b*ck(;w15B`oy(jOv(OKR9MXMjcJ}_ohdGavwLrQCN&BaK7NdB&u-$`SOx}wVok9dUW_78-;)f}ALv2@(j_q>!&)FfcGMEjcnXF)=eXH7znZIbDF4O928D0~7!N00;p6kZ)6Gsb~|`0001K0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8{g7`{ZyYpT4gdfE3;+NC7ytkO0000000000q=5zj003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mt+%Z&MjqlDd8Y001xo002k;0000000000005+cN&x@>WMOn+FK}UUbS*G2FfcGJIWjXbF*7zbEiySdWGysdWi~BhH#uW1IAu6tW-~A}GBGzWEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8{g7`{uJsj95&!@I5C8xGO#lD@0000000000q=7C2003lRbYU+paA9(EEif=JFfc7SGBYtTGd49XGC4V9Ei_?eHZ5W|Ib$t2WjJAGGcYtVF*h(RV=yo@6aWAK2ml9>Z&QT&G<)O#003+O000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;d1PQx$|MfZHhtWzrxktW%eO&1kNAhyxSp0rl|Vr(Nym4C+#5`s!>ukJZ_<_ywcj4*)by}*F-Kq3a@qp(IygefzL=fv9Ww@%Rq@y;tUP(jj`VnQ2WxNV!Z@B6mDZ`SJy7XFnuYVV4<5w)mSXg}SmE?m~CgnyJ{*LMwd%^l*@mP+B#?y#88pk2sm<4s|tH%4EiV>$&Yy#qRBXzSk5EYc%UD6!A!he>3^SU{L7P&8pS>FJmK+Ca>niy|)PJ+dNxxoizEF}-DS^7KGaUXA?!%s77cV4iQj08mQ<1QY-O00;mFk#AE1sg-UH0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?dXHu=>808mQ<1QY-O00;mFk#AFn>*Q^i0RRAu0ssI=0001FVRT_HaA9(EEif=JFfc7QFgQ0jGGQ?-W-(@EEi^eZFfCy*H)bt2IX7Z7WHw`EV`E`0V=yoMKVpIgPCbf(p^OQzwB}Ull`~$O))?WJ~j0YI|%9f5aYcp+leey0C6#Z%QpoS-JNX?(|>GK0;kD2!n8^9pX;VZ=vAg^|{NU=j1m5_PXuvuLg{r_LujFK?v*!*YB@$JQPHPOwBFhGY$l73B!I!V~c4;nkkD}jL6=98TLUl*gVXoV`lo3mC|v+9VdSpPt-lmBpp-h%zWZ#L{%ZRuylO+L@Qp_)+-S3SNZ?y_*!1`oPWZt?<%TWkS9RNi^_hRYfZ=S0=f7=tRUdImMbBm&j@hWVsX7dz3D%f)Cf2t(+23~fJyfA|+rO9KQH0000800)t8Q+N`K#1a4i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfBJQI5#*lVKFUcF=k~gG&wUcEnzV?W-T{4H)1qoHe+RDV__|0FfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh(=`0rRaW@M9P)h*<6ay3h000O82a#`6g!(jl*Q^i0RRAu0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEjKVYH#jn3F)d~>W@RljIWsUVVKFymEjKwgVl-qnV`XDwVJ%}YFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000800)t8Q+N`K#1a4i01yBG08Ib@00000000000HlFe0{{SIVRT_HE^uLTbS*G2FfcGJH!wIiI5J@|EoL!hWi2#0GcYY-F*jx{H#s+AG-NhoWn*JuEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a)B^wj000", "one_hot_encoder_model": "P)h>@6aWAK2ml9>Z&M|j)%Vl@003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*PQySDMSDJDYG<{?*urijq#zMNMY|e#$HZb^nw_N}%D=M-0wqP7DX#84XAUqq%o#?A{Gl-6v?rAl@~84n&4f~z)N7z^l&3I7OT4ScK?AaQ%7Q+@c;C1EFbw_Bbr0(X&ii-byu?t+t8eSm{bvS8yw7Ruzt<*jG9ai!ImXL~UqN^JrpUf0b@>t|QHh;M};VsAXf>~gk(Pe62kHI>64qxcbW@3&E$J(SBi<7qt*(LV&hsON&u57~Pw(NNK15ir?1QY-O00;mFk#AGa;ufV20000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?eSesz{A08mQ<1QY-O00;mFk#AFVmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7OGC5;3Win(fH)A$tEi^beGc7q`W;rc6GB#p2Wid1{IWsaXV=yo?Q@|E7a1}zixRLXjA7P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&Z&M|j)%Vl@003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;mFk#AGa;ufV20000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfLV_jG&ngkEjeLkIW0LdHexqrF*GqbGcqk>FfcGKb8ca9aCt6pVRCVGWpplMa$`_S1qJ{B000C41ORve0062300000"}}, "inputs": [{"name": "df", "node_id": "3e1bcebf-1353-4ab5-97a4-7fc0c2c00579", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "ec797d62-f14d-484c-bed6-769031cf10a7", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "One-hot encode", "one_hot_encode_parameters": {"invalid_handling_strategy": "Keep", "drop_last": false, "output_style": "Columns", "input_column": "authorities_contacted"}, "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "authorities_contacted"}}, "trained_parameters": {"one_hot_encode_parameters": {"_hash": -1767228197713558300, "string_indexer_model": "P)h>@6aWAK2mooDZ&Q01*?i^z0046V000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDKO^~Q48yht0G6cI=u+SOpsX07J@8h_@}a4sZGDsxAWJH(G{sa_yg4Di-?6&;azl9V)Uug@q68d-xLZDkt|!x^%xj5qp8RM!8qK9cW~3>2{DPW5e9oGjvud#T>q@=ez@A5C+L9;((^fC)q29Wf&~p6Al0X01N;C02lxO0BvP-VPs)+VJ|LlVRCdWFfcGMFfL)FfcGMEo5P1W;i%xIW1*nHfAj}V=-hcVqs%BEipM_IXE;mW@R)mG%aH=FfcB2Zeeh6c`k5aa&dKKbd6F?Ya1~Tl{fCjA->9JVGl;&kbo<8yUxZoz2sJUD53OFD5Yq3N9j_n){(R=A^6|)&>vBH?ceG+cGC0A=}>k`<3Y>G}xQvx9*kN5E0`$GD?{m;!I2{GI1`WTtkzyaR(6&q017HAOZXG6D=_QN|@zK2HAfSd;GP|E_eJGWL|(}0HLw9opwc799mgN6o-|DB@4$xm`Z$S8>7C_Wm*bQn!S61{%-?rqVX_?LtRqrvEFZ}hEqG-ul4%vny}|PgO3|K-c4kjiZl`F__*k5U5yj9sMIX0K&fIrSJj!C=XqY`S-wc;vnN@Ws99cAi<2mhH_L`&zrf}E0Z>Z=1QY-O00;nSn{QLcJKwkx0000G0000@0001FVRT_HE^uLTbS*G2FfcGJWMN}wI5=cEEoEgkW-T;hF=Q=bVPiQhF*#y6I5ah8Wi&7}En_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wk*d*1wPEkO!0sv4;0Rj{Q6aWAK2mooDZ&Q01*?i^z0046V000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QLAtTGc00000C0000O00000000000001_ffoS)0BvP-VPs)+VJ|LlVRCdWFfcGMFfLWG!N0V>vA`Ibu0DG&N>rG%z$RV=yo@6aWAK2mooDZ&N~3HXYOe003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFoloHN(3a-m&nkE|6V}G09wObQdJ}DO#9KKYI;cyaQZZzhPA~dPt%so>ug=B}j{FmGSyB{pue8z|46kvcg=>qG@Q2diNCTZ)r)ND%MR*Kb6!QU=m|MM0UYJTD8=Kz8_ZXe0yZD0ss3YdYk~>cM9!rI3Mqj9d)an;+J_Sa@DqWhf`TF>6?0;?L_RISBC&qmAU|mn%08mQ<1QY-O00;nSn{QK=wVr+s0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?d9H4AnE08mQ<1QY-O00;nSn{QKfmIq>f0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7G&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo?Q@|E7a1}zixRLXjA7P|$`PNKuh%+w@Rsuuhi{y=|-x3-ERn84+JoOAEFAt(E%F+qfpPDn5~Ez$~Ta~UBtw;tO362V#D!(f--g2iZ-CD=bYShXWz{$JVQjVuMhXb01^p>XoLHj}gBL0V|vLOV^`)P;2bAb11OKT^w5Rt|aC5B#9B5A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&F=Az8Wj8rvIbvooEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IOD;-gU|?Wkc$%7c_S`=!GXPLa0Rj{Q6aWAK2mooDZ&N~3HXYOe003zL000vJ0000000000005+c00000ZDn*}WMOn+FK}UUbS*G2FfcGsO928D0~7!N00;nSn{QK=wVr+s0000C0000O00000000000001_fdv5o0BvP-VPs)+VJ|LlVRCdWFfcGMFfLG&f-~Ei_|cF)d;?VlXW+Vr67yH#uWDVrDQcV=yo@6aWAK2mmdyZ&PUaA7ultxDB^6H|@~*^UjPIOWxgU1VKT!ALksti%=F)5^E9YC?SG6p-d6zES;4jLN0{zF_ZDfjo0{wv<(`qs6;jnQwA5sy6M~7y18lk{;q8*vG~tO$pp{kR@Sm!fqVC>n!RE)k3W(_cdZWk02Z=1QY-O00;mrv2Ro0x?aT&0000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?c6r}!fS08mQ<1QY-O00;mrv2Rn!%@Qb`0RRAm0ssI=0001FVRT_HaA9(EEif=JFfc7MH85p1V=^%f|1*m^yryciC_9OT+IJ7_qnzYN)d*AmxzxO@)^zc)P0Fvli9|bP)kssw?56&5e>;?dkMH?<|{NCw!TXA&Z>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewDn~)yng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%)FfcGME@N_IP)h*<6ay3h000O8EwOJ?$;}cdodEy)FfcGMEi*MRWj13nF)cGM!CWi4YcFfcB2Zeeh6c`k5aa&dKKbS`6ZV^B*41^@s600aO80C)ia0M!Ek0000", "one_hot_encoder_model": "P)h>@6aWAK2mmdyZ&Ty@0j<;k003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;Z*P6IIzMSDJD>W$@1U`e(`k%B}76|FU_XOk6vX*{MN%D>|T@s!fI7~OM@4=^~)8G4AkQ+4{zwsWnzvD$LgRFi<75cu^+8)`DeZSV`Dydmt)7P4^T@31QY-O00;mrv2Rm`t;dB90000C0000O0001OWprU=VRT_HE^uLTbS*G2FfcGKV{&6lE=p!#U|?cc%lP{-08mQ<1QY-O00;mrv2RoC3|$C-0RR9n0ssI=0001FVRT_HaA9(EEif=JFfc7JG-hTwGc#l@GGb&mEi^DOGA%ebG%_t=Gc-6bVK*~4H!x)_V=yoxAs9Fe8`Z2c_;yk!Wd?pIy)`1~qP{I6X(w?9Gcz?wSE?=S&+rHOL%g+hQ3MmX+>dkaJvZe1=prVFFwzYP2B$?@0d1}$gqGGrJ0BuATZb6z8eFg#EwTjrr^g$1EX>`F9sQA|AQ&BBx;YX~-qvP%Sv*P$?OW)eNt?Q`4gdsiBf2iNEM?`8hyB10y6pg(ZMTr0eI~z=-oXTW*qeF(8r}A;q#Km6?Z;A5vPITQ0@|MDjPi{I4{&s4My-;SrAz-Ti{F_nL!qKaq}``hl0yiCj2ApBqoiO~;ZAn000{ax7%3kUF<)@cW5XqFj_dtpSK7<3CdvBc_mEt#-A{78Q`Bf&8ehoG2=LHqn|5l+aerLsmf~h8mzSScdFf6EvRBujh!KtQ7B+rn6eH6Z1IYac5^ZDRkb6R6kd}B$rAJvrOebRhFh{o+fEO&U*Vv5~*G~%=^235Wow*G!Qz6zw!!DO9KQH0000804=d^Qw$Vgs}cYJ01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfA}NW@b4vGh{6?Vq`ZhG%ztTEjTzdGA&^)FfcGMEig1@W;ru6WGymcWH&7|FflSMI5;#iEnzb>I51&1GdMReWi4YcFfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j0000804=d^Qw$Vgs}cYJ01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJFf?XnIWse4Eiz(cH!U@6aWAK2mnB_Z&R+DzHFnluZ>-54N|AQ(gDdtL`+H9$6E=jFBofFufJ>YJ8xi{#hsHz%Wf$Ur?tWMCF0P*LbhGo8-Z7B}uhL_}N&kH;g^$>?Zi+D=kMr2ogY_#gaR-hTM4;3nL2&ONy~mmKe%xESF}e9^*HD9X6(1Zb<|@piyT#pS%+;T$34J8N*R-J@cma+q$Q5}=OX_W8VU{kB(H-(8q0bG)=+%U1g}Jr?=Ze?7dAslXQUEwZ6RD3#vJb8(9;TtuWVZ+z<>yR{1%zOWJnUNLIvHi8;=`YmztRy=JdN|6ANHTpiRf-Pa1#@6aWAK2mofWZ&MJsG`Zja003j7TQ71Vk;ue5)$YG%AA0X(mOdJlu9U{3fVC4g24}@eKhDqBZ~Ey3-~0q?YiHncIY-;KXi>)_!p#XV_*b3YaCuD)v|{x%!A(po6jQFl4`k&j-i4+jXHj>A0QZqyYLQf@f{Hp$2RVt;F2nsn>d4R#UV_ukX@^geWUa-Lv>7x?2C2T=Ifk1}-VxB9ZV(ZrXc=1L%hCeaR@J6}<^v)irrdcpph@Bx$J`LleZ??KzRg)rC))tkOua&!6F&KEEQd+69Bdx{rA*%M>kPhh4uaQzNf}140R(>F05T7SV?r8p!T6Fx2zH-`T?=EYMw4F3ViqHE^h9B=6ocZb6^@zahSkFHk~>bmckZaWT?#s*)|qzh=$Oh%sM6BehdZiyy`~;z0e++Zd5>@5ZKQYq!LGe3eF5;i(T-QzzHODFGkAem{R7m{@Ir|VFs2U?z-zVov_j-`8}LPKhTDw{6PCm*86FmGrOII}r=^^vrI517b6LKUd6s2mmS)o=pS(=dSWdE{oF0Wy1RGCb?Pu^Ce*jQR0|XQR000O8X0dNm%TJaX5&!@I5C8xGO#lD@WMOn+FD`Ila&#>)FfcGMEigDRFk@k4H7zkVIAbj|W@BM3IXGozEnzq`GdMJ1HDft2V=ZGaFfcB2Zeeh6c`k5aa&dKKbS`6ZV@obdW?*1oVlcj=)85||JRbm1O928D0~7!N00;nPv2RllxHP%o0001O0RR9K00000000000001_fdBvi0BvP-VPs)+VJ~oDa&#>)FfcGMP)h*<6ay3h000O8X0dNmRrUOb4gdfE3;+NC7ytkO0000000000q=69u003=ebYWy+bYU+paA9(EEif=JFfcA-a$`_S0Rj{Q6aWAK2mofWZ&T`O_GzC1005E#002k;0000000000005+cRRI71WMOn+FK}UUbS*G2FfcGJFgP$UV_{@9EipDYV=XjhV__{hIAvxnVK_81I5c53V>vKmEn_e+FfMa$VQ_GHE^uLTadl;MP)h*<6ay3h000O8X0dNm%TJaX5&!@I5C8xGO#lD@0000000000q=8xk003lRbYU+paA9(EEif=JFfc7JI503{VPrKeF*Z13Ei`6hVJ$g0Wo9j5I5aajG+{MkIWS`_V=yo@6aWAK2mofWZ&NPv-38VF003zL000vJ003=ebYWy+bYU-WVRCdWFfcGMFpZB*O9Md+hVTB2oVUYlx3+e#A}Cn&(C&>w{!LyL95f)ar!43b^zF9W>2|Z;tvCB^1Lyt|ah_wS=1ys)SLoluO4nYQ#~S`nnxVTK_LgjWg+oPZAN>KV`2-`!bdOn}pU`D$U=FFsb#zrC)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyAvW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbbV4yODi!HO=InIp$iuoGBAr0uqcdS=1rY9EiQB?A|mQeMAA;8!^_OnBvq;w{2Bg0e~7oXiXxc6<$j!V@3|o-`=>EMgpp22FgPvJ3TSf~AvCuh+WZp1S>MB8m*9fMXqF||KRQ^oBVqnu+2M^W1;J+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bx=rvD*1OTU0=j2J*wYa-yj0@mOotS3+Ak4Os<|7-)dPU>ULxP0)Bcy__O?xKO-Zn)YJFcFg05$L-C*xX^h!QoUStlUyn_$TF4xQ(2m(d77lXIP3mRl1O#aLEhW)g8*LewSmwH{FOIQO9KQH000080A{goQx8|^r4j%D01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfB1KVlgpfV>B&fVmCM~G&nLjEn+h@FfB4OVPiQoH8Wx`VlpjbFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}AWhnt1lyKPxi;P)h*<6ay3h000O8X0dNmF7n+4)&KwiX#oHL6951J0000000000q=5hc003=ebYWy+bYU-WVRCdWFfcGMFi=YY0u%!j000080A{goQ+!pNy$%2X01N;C02lxO00000000000HlEj0RRAPWprU=VRT_HE^uLTbS*G2FfcGKV{&6qO928D0~7!N00;nPv2Rm#mIq>f0RR9n0ssI=00000000000001_flC1Z0AyiwVJ~oDa&#>)FfcGMEio`+F)?IgG%aLeH#jXcI5IgcVly=`EiyAvW6Gh#4eGA&~;FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080A{goQx8|^r4j%D01yBG08Ib@00000000000HlF00{{SIVRT_HE^uLTbS*G2FfcGJF)(5=F=S&jEo5RhI4v|dGC3_`Gc_@6aWAK2mpGqZ&Uc&mo4M~003|S000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx$|MfZG$*Qphgs)24{(?tamh;1~rC#_Y#7~6eG2?>dA@7#0loP~CfvsjBrdkG2j0cB1=N9mm$5lSVLFNJK_Z-T*3q-`|lL?epzm<#wM)@s%Dty*@f>z8+pnE5xPY-3<^CtKMrz<-5B%UrdF!#~PF!|uLcq1u7E-0RRBv0ssI=0001FVRT_HaA9(EEif=JFfc7*V>mcwGC4FYF*#&0Ei__eGc7nYW-u)_Fl1vnHf1tnGGjI^V=yo5Q6_F^dm}d{hfY7N3qkoDW$!LnfK<+=sosy@-`%he6h$^t50^IY?7uMqy6?tKAm~*(pgnWdw0SwS(ys%y%JL;1VZTC-GcAdEpAxb-(M7Gs0Hmymd=E%R2%9I{Vuc*HaxB&t*bO=Mm({eAA{`5+t3i~$ditqGxh0Q!`6ZkjE=C~mMjrcPn2TU1d}E@YS;~#YqO%ag%pJ-%|nWGdcVjXLS5{*j=R@#Nyq6mk@I15@15v=__v1O?oRg>cR%C3*6}}6a$Lw0!F9L?B8;5AZC7ea*}M*l-npZfLg~CJ?MD#}Xn^d4KPL2rq)=QegrTasqJ=PwGsEm}){Sqj=fWTR+Necp{Jx*%LUQeo&u;vZ6*aZt4EWXjPaRvsYlH9qgRO7n{J2E+>$SnBh4R*kcp(h~8hZQ5BA46?vJuY)5f@pO42KzimJQ=L=5Z8{!r|aBih^tqCwz3^xNiNrAof#u-9G?OO9KQH000080D7@+Q;)=OZxR3i01yBG08Ib@0AyiwVJ|LlVRCdWFfcGMFfC$ZI5=i9IW#RXIb<;{G-6~kEjTo0FfBGPWMertWin$jV>T^gFfcGKb8ca9aCt6pVRCVGWpplMa$`#_N@ieSU}Bh3u-ojbg9;Y_P)h*<6ay3h000O8da-X)_}iB)-0RRBv0ssI=00000000000001_fms0n0AyiwVJ~oDa&#>)FfcGMEn;IhIA$_AG%YbXWHBu?Vq`NdI5cK3EjBP@V>vcuGGj7hHZ5Z?FfcB2Zeeh6c`k5aa&dKKbWlqH0u%!j000080D7@+Q;)=OZxR3i01yBG08Ib@00000000000HlF~0{{SIVRT_HE^uLTbS*G2FfcGJVq-WsW->W6EipM{F)cJ=WHT)|G-fa@HZWvkIW}c7V=`klEn_e+FfMa$VQ_GHE^uLTadl;ME@N_IP)h{{000001ONm8cmV(a00aO4000"}}, "inputs": [{"name": "df", "node_id": "8594c20b-5214-4524-b9ed-7368d4f279f8", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "type": "TRANSFORM", "operator": "sagemaker.spark.manage_columns_0.1", "parameters": {"operator": "Drop column", "drop_column_parameters": {"column_to_drop": "customer_zip"}}, "inputs": [{"name": "df", "node_id": "177ac227-1e30-41cb-9bb1-77beb3736d82", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "dba3fd7e-01a7-4629-a4c8-aa520328439a", "type": "TRANSFORM", "operator": "sagemaker.spark.join_tables_0.1", "parameters": {"left_column": "policy_id", "right_column": "policy_id", "join_type": "leftouter"}, "inputs": [{"name": "df", "node_id": "83855f44-84ce-4cb5-ac34-ac4242257444", "output_name": "default"}, {"name": "df", "node_id": "9297ae66-97e9-4738-9cee-5a02ddea44be", "output_name": "default"}], "outputs": [{"name": "default"}]}]} -------------------------------------------------------------------------------- /2-step-functions-pipelines/step-function-workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/2-step-functions-pipelines/step-function-workflow.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/01_setup_mwaa_pipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Setup and Integrate Data Wrangler with Apache Airflow ML Pipeline\n", 8 | "\n", 9 | "
⚠️ PRE-REQUISITE: \n", 10 | "\n", 11 | "Before proceeding with this notebook, please ensure that you have \n", 12 | " \n", 13 | "1. Executed the 00_setup_data_wrangler.ipynb Notebook\n", 14 | "2. Created an Amazon Managed Workflow for Apache Airflow (MWAA) environment. Please visit the Amazon MWAA Get started documentation to see how you can create an MWAA environment. Alternatively, to quickly get started with MWAA, follow the step-by-step instructions in the MWAA workshop to setup an MWAA environment.\n", 15 | "\n", 16 | "
\n", 17 | "\n", 18 | "This notebook creates the required scripts for the Apache Airflow workflow and uploads them to the respective S3 bucket locations for MWAA. We will create-\n", 19 | "\n", 20 | "1. A `requirements.txt` file and upload it to the MWAA `/requirements` prefix\n", 21 | "2. We upload the `SMDataWranglerOperator.py` Python script which is the SageMaker Data Wrangler custom Airflow Operator to the `/dags` prefix.\n", 22 | "2. A `config.py` Python script that will setup configurations for our DAG Tasks and upload to the `/dags` prefix.\n", 23 | "3. And finally, we create an `ml_pipeline.py` Python script which sets up the end-to-end Apache Airflow workflow DAG and upload it to the `/dags` prefix.\n", 24 | "---" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "Import required dependencies and initialize variables\n", 32 | "\n", 33 | "
⚠️ NOTE: \n", 34 | " Note: replace bucket name with your MWAA Bucket name.\n", 35 | "
" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 209, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stdout", 45 | "output_type": "stream", 46 | "text": [ 47 | "SageMaker version: 2.59.5\n", 48 | "S3 bucket: airflow-data-wrangler\n" 49 | ] 50 | } 51 | ], 52 | "source": [ 53 | "import time\n", 54 | "import uuid\n", 55 | "import sagemaker\n", 56 | "import boto3\n", 57 | "import string\n", 58 | "\n", 59 | "# Sagemaker session\n", 60 | "sess = sagemaker.Session()\n", 61 | "\n", 62 | "# MWAA Client\n", 63 | "mwaa_client = boto3.client('mwaa')\n", 64 | "\n", 65 | "# Replace the bucket name with your MWAA Bucket\n", 66 | "bucket = 'airflow-data-wrangler'\n", 67 | "\n", 68 | "print(f'SageMaker version: {sagemaker.__version__}')\n", 69 | "print(f'S3 bucket: {bucket}')" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "Creating and uploading this `.airflowignore` file helps Airflow to prevent interpreting the helper Python scripts as a DAG file. " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 565, 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "name": "stdout", 86 | "output_type": "stream", 87 | "text": [ 88 | "Writing scripts/.airflowignore\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "%%writefile scripts/.airflowignore\n", 94 | "SMDataWranglerOperator\n", 95 | "config.py" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 566, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "s3_client.upload_file(\"scripts/.airflowignore\", bucket, f\"dags/.airflowignore\")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "---\n", 112 | "## Create `requirements.txt` file" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "Write a `requirements.txt` file and upload it to S3. We will need a few dependencies to be able to run our Data Wrangler python script using the Apache Airflow Python operator, mainly the SageMaker SDK." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 556, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "Writing scripts/requirements.txt\n" 132 | ] 133 | } 134 | ], 135 | "source": [ 136 | "%%writefile scripts/requirements.txt\n", 137 | "awswrangler\n", 138 | "pandas\n", 139 | "sagemaker==v2.59.5\n", 140 | "dag-factory==0.7.2" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 557, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "s3_client = boto3.client(\"s3\")\n", 150 | "s3_client.upload_file(\"scripts/requirements.txt\", bucket, f\"requirements/requirements.txt\")" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "---\n", 158 | "## Upload the custom SageMaker Data Wrangler Operator\n", 159 | "\n", 160 | "In this step we will upload the [custom Airflow operator](https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html) for SageMaker Data Wrangler. With this operator, you can pass in any SageMaker Data Wrangler `.flow` file to Airflow to perform data transformations." 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 558, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "s3_client.upload_file(\"scripts/SMDataWranglerOperator.py\", bucket, f\"dags/SMDataWranglerOperator.py\")" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "---" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Update MWAA IAM Execution Role" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "Every MWAA Environment has an [Execution Role](https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-create-role.html) attached to it. This role consists of permissions policy that grants Amazon Managed Workflows for Apache Airflow (MWAA) permission to invoke the resources of other AWS services on your behalf. In our case, we want our MWAA Tasks to be able to access SageMaker and S3. Edit the MWAA Execution role and add the permissions listed below-\n", 191 | "\n", 192 | "- `AmazonS3FullAccess`\n", 193 | "- `AmazonSageMakerFullAccess`" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "metadata": {}, 199 | "source": [ 200 | "## Setup SageMaker Role for MWAA" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "Next, we will create a SageMaker service role to be used in the ML pipeline module. To create an IAM role for Amazon SageMaker\n", 208 | "\n", 209 | "- Go to the IAM Console - Roles\n", 210 | "- Choose Create role\n", 211 | "- For role type, choose AWS Service, find and choose SageMaker, and choose Next: Permissions\n", 212 | "- On the Attach permissions policy page, choose (if not already selected)\n", 213 | " - AWS managed policy `AmazonSageMakerFullAccess`\n", 214 | " - AWS managed policy `AmazonS3FullAccess` for access to Amazon S3 resources\n", 215 | "- Then choose Next: Tags and then Next: Review.\n", 216 | "- For Role name, enter AirflowSageMakerExecutionRole and Choose Create Role\n", 217 | "\n", 218 | "Alternatively, we can also use the default SageMaker Execution role since it already has these permissions.\n" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 97, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "data": { 228 | "text/plain": [ 229 | "'arn:aws:iam::965425568475:role/service-role/AmazonSageMaker-ExecutionRole-20201030T135016'" 230 | ] 231 | }, 232 | "execution_count": 97, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "iam_role = sagemaker.get_execution_role()\n", 239 | "iam_role" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "---\n", 247 | "## Setup configuration script\n", 248 | "\n", 249 | "In this step we create a helper script to define the model training and model creation task configurations. This script will be used by the DAG tasks to obtain various configuration information for model training and model creation.\n", 250 | "\n", 251 | "
⚠️ NOTE: \n", 252 | " Note: replace bucket with the SageMaker Default bucket name for your SageMaker studio domain.\n", 253 | "
\n" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 559, 259 | "metadata": {}, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "Writing scripts/config.py\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "%%writefile scripts/config.py\n", 271 | "#!/usr/bin/env python\n", 272 | "import time\n", 273 | "import uuid\n", 274 | "import sagemaker\n", 275 | "import json\n", 276 | "from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook\n", 277 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 278 | "\n", 279 | "def config(**opts):\n", 280 | " \n", 281 | " region_name=opts['region_name'] if 'region_name' in opts else sagemaker.Session().boto_region_name\n", 282 | " \n", 283 | " # Hook\n", 284 | " hook = AwsBaseHook(aws_conn_id='airflow-sagemaker', resource_type=\"sagemaker\")\n", 285 | " boto_session = hook.get_session(region_name=region_name)\n", 286 | " sagemaker_session = sagemaker.session.Session(boto_session=boto_session)\n", 287 | " \n", 288 | "\n", 289 | " training_job_name=opts['training_job_name']\n", 290 | " bucket = opts['bucket'] if 'bucket' in opts else sagemaker_session.default_bucket() #\"sagemaker-us-east-2-965425568475\"\n", 291 | " s3_prefix = opts['s3_prefix']\n", 292 | " # Get the xgboost container uri\n", 293 | " container = get_image_uri(region_name, 'xgboost', repo_version='1.0-1')\n", 294 | " \n", 295 | " ts = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\" \n", 296 | " config = {}\n", 297 | " \n", 298 | " config[\"data_wrangler_config\"] = { \n", 299 | " \"sagemaker_role\": opts['role_name'],\n", 300 | " #\"s3_data_type\" : defaults to \"S3Prefix\" \n", 301 | " #\"s3_input_mode\" : defaults to \"File\", \n", 302 | " #\"s3_data_distribution_type\" : defaults to \"FullyReplicated\", \n", 303 | " #\"aws_conn_id\" : defaults to \"aws_default\",\n", 304 | " #\"kms_key\" : defaults to None, \n", 305 | " #\"volume_size_in_gb\" : defaults to 30,\n", 306 | " #\"enable_network_isolation\" : defaults to False, \n", 307 | " #\"wait_for_processing\" : defaults to True, \n", 308 | " #\"container_uri\" : defaults to \"415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x\", \n", 309 | " #\"container_uri_pinned\" : defaults to \"415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0\", \n", 310 | " \"outputConfig\": {\n", 311 | " #\"s3_output_upload_mode\": #defaults to EndOfJob\n", 312 | " #\"output_content_type\": #defaults to CSV\n", 313 | " #\"output_bucket\": #defaults to SageMaker Default bucket\n", 314 | " \"output_prefix\": s3_prefix #prefix within bucket where output will be written, default is generated automatically\n", 315 | " }\n", 316 | " }\n", 317 | "\n", 318 | " config[\"train_config\"]={\n", 319 | " \"AlgorithmSpecification\": {\n", 320 | " \"TrainingImage\": container,\n", 321 | " \"TrainingInputMode\": \"File\"\n", 322 | " },\n", 323 | " \"HyperParameters\": {\n", 324 | " \"max_depth\": \"5\",\n", 325 | " \"num_round\": \"10\",\n", 326 | " \"objective\": \"reg:squarederror\"\n", 327 | " },\n", 328 | " \"InputDataConfig\": [\n", 329 | " {\n", 330 | " \"ChannelName\": \"train\",\n", 331 | " \"ContentType\": \"csv\",\n", 332 | " \"DataSource\": {\n", 333 | " \"S3DataSource\": {\n", 334 | " \"S3DataDistributionType\": \"FullyReplicated\",\n", 335 | " \"S3DataType\": \"S3Prefix\",\n", 336 | " \"S3Uri\": f\"s3://{bucket}/{s3_prefix}/train\"\n", 337 | " }\n", 338 | " }\n", 339 | " }\n", 340 | " ],\n", 341 | " \"OutputDataConfig\": {\n", 342 | " \"S3OutputPath\": f\"s3://{bucket}/{s3_prefix}/xgboost\"\n", 343 | " },\n", 344 | " \"ResourceConfig\": {\n", 345 | " \"InstanceCount\": 1,\n", 346 | " \"InstanceType\": \"ml.m5.2xlarge\",\n", 347 | " \"VolumeSizeInGB\": 5\n", 348 | " },\n", 349 | " \"RoleArn\": opts['role_name'],\n", 350 | " \"StoppingCondition\": {\n", 351 | " \"MaxRuntimeInSeconds\": 86400\n", 352 | " },\n", 353 | " \"TrainingJobName\": training_job_name\n", 354 | " }\n", 355 | " \n", 356 | " config[\"model_config\"]={\n", 357 | " \"ExecutionRoleArn\": opts['role_name'],\n", 358 | " \"ModelName\": f\"XGBoost-Fraud-Detector-{ts}\",\n", 359 | " \"PrimaryContainer\": { \n", 360 | " \"Mode\": \"SingleModel\",\n", 361 | " \"Image\": container,\n", 362 | " \"ModelDataUrl\": f\"s3://{bucket}/{s3_prefix}/xgboost/{training_job_name}/output/model.tar.gz\"\n", 363 | " },\n", 364 | " }\n", 365 | " \n", 366 | " return config" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "Upload `config.py` to the `/dags` prefix." 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 560, 379 | "metadata": {}, 380 | "outputs": [], 381 | "source": [ 382 | "s3_client.upload_file(\"scripts/config.py\", bucket, f\"dags/config.py\")" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "---\n", 390 | "## Setup Apache Airflow DAG (Directed Acyclic Graph)\n", 391 | "\n", 392 | "In this step, we will create the Python script to setup the Apache Airflow DAG. The script will create three distinct tasks and finally chain them together using `>>` in the end to create the Airflow DAG.\n", 393 | "\n", 394 | "1. Use Python operator to define a task to run the Data Wrangler script for data pre-processing\n", 395 | "2. Use SageMaker operator to define a task to train an XGBoost model using the training data\n", 396 | "3. Use SageMaker operator to define a task to create a model using the model artifacts created by the training step" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 538, 402 | "metadata": {}, 403 | "outputs": [ 404 | { 405 | "data": { 406 | "text/plain": [ 407 | "'s3://sagemaker-us-east-2-965425568475/data-wrangler-pipeline/flow/flow-21-08-44-46-aea6f365.flow'" 408 | ] 409 | }, 410 | "execution_count": 538, 411 | "metadata": {}, 412 | "output_type": "execute_result" 413 | } 414 | ], 415 | "source": [ 416 | "%store -r ins_claim_flow_uri\n", 417 | "ins_claim_flow_uri" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 561, 423 | "metadata": {}, 424 | "outputs": [ 425 | { 426 | "name": "stdout", 427 | "output_type": "stream", 428 | "text": [ 429 | "Writing scripts/ml_pipeline.py\n" 430 | ] 431 | } 432 | ], 433 | "source": [ 434 | "%%writefile scripts/ml_pipeline.py\n", 435 | "#!/usr/bin/env python\n", 436 | "import time\n", 437 | "import uuid\n", 438 | "import json\n", 439 | "import boto3\n", 440 | "import sagemaker\n", 441 | "\n", 442 | "# Import config file.\n", 443 | "from config import config\n", 444 | "from datetime import timedelta\n", 445 | "import airflow\n", 446 | "from airflow import DAG\n", 447 | "from airflow.models import DAG\n", 448 | "\n", 449 | "# airflow operators\n", 450 | "from airflow.models import DAG\n", 451 | "from airflow.operators.python_operator import PythonOperator\n", 452 | "\n", 453 | "# airflow sagemaker operators\n", 454 | "from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator\n", 455 | "from airflow.providers.amazon.aws.operators.sagemaker_model import SageMakerModelOperator\n", 456 | "\n", 457 | "# airflow sagemaker configuration\n", 458 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 459 | "from sagemaker.estimator import Estimator\n", 460 | "from sagemaker.workflow.airflow import training_config\n", 461 | "\n", 462 | "# airflow Data Wrangler operator\n", 463 | "from SMDataWranglerOperator import SageMakerDataWranglerOperator\n", 464 | "\n", 465 | "# airflow dummy operator\n", 466 | "from airflow.operators.dummy import DummyOperator\n", 467 | "\n", 468 | "\n", 469 | " \n", 470 | "default_args = { \n", 471 | " 'owner': 'airflow',\n", 472 | " 'depends_on_past': False,\n", 473 | " 'start_date': airflow.utils.dates.days_ago(1),\n", 474 | " 'retries': 0,\n", 475 | " 'retry_delay': timedelta(minutes=2),\n", 476 | " 'provide_context': True,\n", 477 | " 'email': ['airflow@iloveairflow.com'],\n", 478 | " 'email_on_failure': False,\n", 479 | " 'email_on_retry': False\n", 480 | "}\n", 481 | "ts = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n", 482 | "DAG_NAME = f\"ml-pipeline\"\n", 483 | "\n", 484 | "#-------\n", 485 | "### Start creating DAGs\n", 486 | "#-------\n", 487 | "\n", 488 | "dag = DAG( \n", 489 | " DAG_NAME,\n", 490 | " default_args=default_args,\n", 491 | " dagrun_timeout=timedelta(hours=2),\n", 492 | " # Cron expression to auto run workflow on specified interval\n", 493 | " # schedule_interval='0 3 * * *'\n", 494 | " schedule_interval=None\n", 495 | " )\n", 496 | "\n", 497 | "#-------\n", 498 | "# Task to create configurations\n", 499 | "#-------\n", 500 | "\n", 501 | "config_task = PythonOperator(\n", 502 | " task_id = 'Start',\n", 503 | " python_callable=config,\n", 504 | " op_kwargs={\n", 505 | " 'training_job_name': f\"XGBoost-training-{ts}\",\n", 506 | " 's3_prefix': 'data-wrangler-pipeline',\n", 507 | " 'role_name': 'AmazonSageMaker-ExecutionRole-20201030T135016'},\n", 508 | " provide_context=True,\n", 509 | " dag=dag\n", 510 | " )\n", 511 | "\n", 512 | "#-------\n", 513 | "# Task with SageMakerDataWranglerOperator operator for Data Wrangler Processing Job.\n", 514 | "#-------\n", 515 | "\n", 516 | "def datawrangler(**context):\n", 517 | " config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n", 518 | " preprocess_task = SageMakerDataWranglerOperator(\n", 519 | " task_id='DataWrangler_Processing_StepNew',\n", 520 | " dag=dag,\n", 521 | " flow_file_s3uri=\"$flow_uri\",\n", 522 | " processing_instance_count=2,\n", 523 | " instance_type='ml.m5.4xlarge',\n", 524 | " aws_conn_id=\"aws_default\",\n", 525 | " config= config[\"data_wrangler_config\"]\n", 526 | " )\n", 527 | " preprocess_task.execute(context)\n", 528 | "\n", 529 | "datawrangler_task = PythonOperator(\n", 530 | " task_id = 'SageMaker_DataWrangler_step',\n", 531 | " python_callable=datawrangler,\n", 532 | " provide_context=True,\n", 533 | " dag=dag\n", 534 | " )\n", 535 | "\n", 536 | "#-------\n", 537 | "# Task with SageMaker training operator to train the xgboost model\n", 538 | "#-------\n", 539 | "\n", 540 | "def trainmodel(**context):\n", 541 | " config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n", 542 | " trainmodel_task = SageMakerTrainingOperator(\n", 543 | " task_id='Training_Step',\n", 544 | " config= config['train_config'],\n", 545 | " aws_conn_id='aws-sagemaker',\n", 546 | " wait_for_completion=True,\n", 547 | " check_interval=30\n", 548 | " )\n", 549 | " trainmodel_task.execute(context)\n", 550 | "\n", 551 | "train_model_task = PythonOperator(\n", 552 | " task_id = 'SageMaker_training_step',\n", 553 | " python_callable=trainmodel,\n", 554 | " provide_context=True,\n", 555 | " dag=dag\n", 556 | " )\n", 557 | "\n", 558 | "#-------\n", 559 | "# Task with SageMaker Model operator to create the xgboost model from artifacts\n", 560 | "#-------\n", 561 | "\n", 562 | "def createmodel(**context):\n", 563 | " config = context['ti'].xcom_pull(task_ids='Start',key='return_value')\n", 564 | " createmodel_task= SageMakerModelOperator(\n", 565 | " task_id='Create_Model',\n", 566 | " config= config['model_config'],\n", 567 | " aws_conn_id='aws-sagemaker',\n", 568 | " )\n", 569 | " createmodel_task.execute(context)\n", 570 | " \n", 571 | "create_model_task = PythonOperator(\n", 572 | " task_id = 'SageMaker_create_model_step',\n", 573 | " python_callable=createmodel,\n", 574 | " provide_context=True,\n", 575 | " dag=dag\n", 576 | " )\n", 577 | "\n", 578 | "#------\n", 579 | "# Last step\n", 580 | "#------\n", 581 | "end_task = DummyOperator(task_id='End', dag=dag)\n", 582 | "\n", 583 | "# Create task dependencies\n", 584 | "\n", 585 | "config_task >> datawrangler_task >> train_model_task >> create_model_task >> end_task" 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "Replace `$flow_uri` in the `ml_pipeline.py` script with the store magic variable `ins_claim_flow_uri` which contains S3 path of the `.flow` file." 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": 562, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "with open(\"ml_pipeline.py\", 'r') as f:\n", 602 | " variables = {'flow_uri': ins_claim_flow_uri}\n", 603 | " template = string.Template(f.read())\n", 604 | " ml_pipeline = template.substitute(variables)\n", 605 | "\n", 606 | "# Creates the .flow file\n", 607 | "with open('ml_pipeline.py', 'w') as f:\n", 608 | " f.write(ml_pipeline)" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "Upload `ml_pipeline.py` to the `/dags` prefix." 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 563, 621 | "metadata": {}, 622 | "outputs": [], 623 | "source": [ 624 | "s3_client.upload_file(\"scripts/ml_pipeline.py\", bucket, f\"dags/ml_pipeline.py\")" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "---\n", 632 | "## View Airflow DAG and run\n", 633 | "\n", 634 | "Once the above steps are complete, you can access the [Apache Airflow UI](https://docs.aws.amazon.com/mwaa/latest/userguide/access-airflow-ui.html) and view the DAG. To access the Apache Airflow UI, go to the Amazon MWAA Console, select the MWAA Environment and click the _Airflow UI_ link.\n", 635 | "\n", 636 | "\n", 637 | "\n", 638 | "\n", 639 | "You can run the DAG by clicking on the \"Play\" button, alternatively you can -\n", 640 | "\n", 641 | "1. Setup the DAG to run on a set schedule automatically using cron expressions\n", 642 | "2. Setup the DAG to run based on S3 sensors such that the pipeline/workflow would execute whenever a new file arrives in a bucket/prefix.\n", 643 | "\n", 644 | "\n" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "### Clean Up\n", 652 | "\n", 653 | "1. Delete the MWAA Environment from the Amazon MWAA Console.\n", 654 | "2. Delete the MWAA S3 files.\n", 655 | "3. Delete the Model training data and model artifact files from S3.\n", 656 | "4. Delete the SageMaker Model." 657 | ] 658 | }, 659 | { 660 | "cell_type": "markdown", 661 | "metadata": {}, 662 | "source": [ 663 | "---\n", 664 | "# Conclusion\n", 665 | "\n", 666 | "We created an ML Pipeline with Apache Airflow and used the Data Wrangler script to pre-process and generate new training data for our model training and subsequently created a new model in Amazon SageMaker.\n" 667 | ] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [] 675 | } 676 | ], 677 | "metadata": { 678 | "instance_type": "ml.t3.medium", 679 | "kernelspec": { 680 | "display_name": "Python 3 (Data Science)", 681 | "language": "python", 682 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" 683 | }, 684 | "language_info": { 685 | "codemirror_mode": { 686 | "name": "ipython", 687 | "version": 3 688 | }, 689 | "file_extension": ".py", 690 | "mimetype": "text/x-python", 691 | "name": "python", 692 | "nbconvert_exporter": "python", 693 | "pygments_lexer": "ipython3", 694 | "version": "3.7.10" 695 | } 696 | }, 697 | "nbformat": 4, 698 | "nbformat_minor": 4 699 | } 700 | -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/README.md: -------------------------------------------------------------------------------- 1 | ## SageMaker Data Wrangler with Apache Airflow (Amazon MWAA) 2 | 3 | ### Getting Started 4 | 5 | #### Setup Amazon MWAA Environment 6 | 7 | 1. Create an [Amazon S3 bucket](https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-s3-bucket.html) and subsequent folders required by Amazon MWAA. The S3 bucket and the folders as seen below. These folders are used by Amazon MWAA to look for dependent Python scripts that is required for the workflow. 8 | 9 |
10 |

11 | Airflow S3 12 |

13 |
14 | 15 | 2. [Create an Amazon MWAA environment](https://docs.aws.amazon.com/mwaa/latest/userguide/create-environment.html). Note that we used Airflow version 2.0.2 (current latest version supported by Amazon MWAA) for this solution. 16 | 17 | #### Setup the Workflow DAG for Apache Airflow 18 | 19 | 1. Clone this repository to your local machine 20 | ``` 21 | git clone 22 | cd data-wrangler-mlops/3-apache-airflow-pipelines 23 | ``` 24 | 3. Upload the `requirements.txt` file in the `/scripts` folder. This contains all the Python dependencies required by the Airflow tasks and upload it to the `/requirements` directory within Amazon MWAA primary S3 bucket (created in step 1). This will be used by the managed Airflow environment to install the Python dependencies. 25 | 3. Upload the `SMDataWranglerOperator.py` from the `/scripts` folder to the S3 `/dags` directory. This Python script contains code for the custom Airflow operator for SageMaker Data Wrangler. This operator can be used for tasks to process any `.flow` file. 26 | 4. Upload the `config.py` script from the `/scripts` folder to the S3 `/dags` directory. This Python script will be used for the first step of our DAG to create configuration objects required by the remaining steps of the workflow. 27 | 5. Finally, upload the `ml_pipelines.py` file from the `/scripts` folder to the S3 `/dags` directory. This script contains the DAG definition for the Airflow workflow. This is where we define each of the tasks, and setup dependencies between them. Amazon MWAA will periodically poll the /dags directory to execute this script to create the DAG or update the existing one with any latest changes. 28 | 29 | #### Run the Airflow workflow DAG 30 | 31 | 1. Navigate to the Amazon MWAA console and find your Airflow environment. 32 | 2. Click on the Environment name to see the details, then click the link under "Airflow UI" to open the Airflow Admin UI in a new browser tab. 33 | 34 |
35 |

36 | Airflow UI 37 |

38 |
39 | 40 | 3. In the Airflow UI, click the DAG named `ml-pipeline` to see the details of the DAG. If you don't see the DAG then wait for a few moments for the DAG to get created. Amazon MWAA will poll the S3 bucket periodically and read the Python scripts from the bucket and will execute them to setup the DAG. 41 | 42 |
43 |

44 | Airflow UI Home 45 |

46 |
47 | 48 | 4. Click on the "Graph View" tab to view the DAG and then click the play button on the top right of the screen to start the workflow execution. This is the manual way to start the workflow execution, however, this can be automated using Apache Airflow Sensors and Plugins. Refer to the [documentation](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-dag-import-plugins.html) for more information on how to setup sensors. 49 | 50 |
51 |

52 | Workflow 53 |

54 |
55 | 56 | 5. In the next screen keep the "Configuration JSON" as default and click "Trigger". This will start the Workflow execution. 57 | 58 |
59 |

60 | Workflow Trigger 61 |

62 |
63 | 64 | 6. Once the workflow is completed successfully you can check the S3 location (from the AWS console) specified as the DataWrangler output location to see the transformed output files generated. Verify that the model was created from the SageMaker Console. 65 | 66 | 7. (Otional) You can look at the Workflow task logs by clicking on a Task in the DAG (in DAG View) and then clicking on the "Logs" button in the pop-up window. 67 | 68 | 69 | --- 70 | 71 | ### Cleaning Up 72 | 73 | 1. On the Airflow UI, click the "DAGs" button on the top of the page to show the list of DAGs. 74 | 2. Click the Switch right next the DAG's name to turn it off (_"Pause DAG"_) and then click on the Delete (trash can) button on the right to delete the DAG. 75 | 76 |
77 |

78 | Delete DAG 79 |

80 |
81 | 82 | 3. Close the Airflow UI screen and navigate to the Amazon MWAA Console from AWS console to view the list of MWAA environments. Select the MWAA environment and click Delete. 83 | 84 |
85 |

86 | Delete MWAA Env 87 |

88 |
89 | 90 | 4. Delete the Amazon S3 bucket for MWAA that you created at the beginning of this tutorial and all the folders and files within it. Refer to the Amazon MWAA [documentation](https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags-delete.html#working-dags-s3-dag-delete) on how to delete just the DAG files from S3. -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/delete_mwaa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/delete_mwaa.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/flow.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/meaa_ui_home.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/meaa_ui_home.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/mwaa_dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_dag.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/mwaa_delete_dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_delete_dag.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/mwaa_s3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_s3.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/mwaa_trigger.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_trigger.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/images/mwaa_ui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/3-apache-airflow-pipelines/images/mwaa_ui.png -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/scripts/SMDataWranglerOperator.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import time 3 | import uuid 4 | import sagemaker 5 | import json 6 | from sagemaker.processing import ProcessingInput, ProcessingOutput 7 | from sagemaker.processing import Processor 8 | from sagemaker.network import NetworkConfig 9 | from airflow.models.baseoperator import BaseOperator 10 | from airflow.providers.amazon.aws.hooks.s3 import S3Hook 11 | from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook 12 | 13 | 14 | class SageMakerDataWranglerOperator(BaseOperator): 15 | template_fields = ["config","iam_role"] 16 | 17 | def __init__(self, 18 | flow_file_s3uri: str, 19 | processing_instance_count: int, 20 | instance_type: str, 21 | aws_conn_id: str, 22 | config: str, 23 | **kwargs) -> None: 24 | super().__init__(**kwargs) 25 | sess = sagemaker.Session() 26 | 27 | self.config = config 28 | self.sagemaker_session = sess 29 | self.basedir = "/opt/ml/processing" 30 | self.flow_file_s3uri = flow_file_s3uri # full form S3 URI example: s3://bucket/prefix/object 31 | self.processing_instance_count = processing_instance_count 32 | self.instance_type = instance_type 33 | self.aws_conn_id = aws_conn_id if aws_conn_id is not None else "aws_default" # Uses the default MWAA AWS Connection 34 | self.s3_data_type = self.get_config("s3_data_type", "S3Prefix", config) 35 | self.s3_input_mode = self.get_config("s3_input_mode", "File", config) 36 | self.s3_data_distribution_type = self.get_config("s3_data_distribution_type", "FullyReplicated", config) 37 | self.kms_key = self.get_config("kms_key", None, config) 38 | self.volume_size_in_gb = self.get_config("volume_size_in_gb", 30, config) # Defaults to 30Gb EBS volume 39 | self.enable_network_isolation = self.get_config("enable_network_isolation",False, config) 40 | self.wait_for_processing = self.get_config("wait_for_processing", True, config) 41 | self.container_uri = self.get_config("container_uri", 42 | "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x", config) 43 | self.container_uri_pinned = self.get_config("container_uri_pinned", 44 | "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0", config) 45 | self.s3_output_upload_mode = self.get_config("s3_output_upload_mode", "EndOfJob", config) 46 | self.output_content_type = self.get_config("output_content_type", "CSV", config["outputConfig"]) # CSV/PARQUET 47 | self.output_bucket = self.get_config("output_bucket", sess.default_bucket(), config["outputConfig"]) 48 | self.output_prefix = self.get_config("output_prefix", None, config["outputConfig"]) 49 | 50 | def expand_role(self) -> None: 51 | if 'sagemaker_role' in self.config: 52 | hook = AwsBaseHook(self.aws_conn_id, client_type='iam') 53 | self.iam_role = hook.expand_role(self.config['sagemaker_role']) 54 | 55 | def get_config(self,flag, default, opts): 56 | var = opts[flag] if flag in opts else default 57 | return var 58 | 59 | def parse_s3_uri(self, s3_uri): 60 | path_parts=s3_uri.replace("s3://","").split("/") 61 | s3_bucket=path_parts.pop(0) 62 | key="/".join(path_parts) 63 | 64 | return s3_bucket, key 65 | 66 | def get_data_sources(self, data): 67 | # Initialize variables from .flow file 68 | output_node = data['nodes'][-1]['node_id'] 69 | output_path = data['nodes'][-1]['outputs'][0]['name'] 70 | input_source_names = [node['parameters']['dataset_definition']['name'] for node in data['nodes'] if node['type']=="SOURCE"] 71 | input_source_uris = [node['parameters']['dataset_definition']['s3ExecutionContext']['s3Uri'] for node in data['nodes'] if node['type']=="SOURCE"] 72 | 73 | output_name = f"{output_node}.{output_path}" 74 | 75 | data_sources = [] 76 | 77 | # Intialize data sources from .flow file 78 | for i in range(0,len(input_source_uris)): 79 | data_sources.append(ProcessingInput( 80 | source=input_source_uris[i], 81 | destination=f"{self.basedir}/{input_source_names[i]}", 82 | input_name=input_source_names[i], 83 | s3_data_type=self.s3_data_type, 84 | s3_input_mode=self.s3_input_mode, 85 | s3_data_distribution_type=self.s3_data_distribution_type 86 | )) 87 | 88 | return output_name, data_sources 89 | 90 | def get_processor(self): 91 | # Create Processing Job 92 | # To launch a Processing Job, you will use the SageMaker Python SDK to create a Processor function. 93 | processor = Processor( 94 | role=self.iam_role, 95 | image_uri=self.container_uri, 96 | instance_count=self.processing_instance_count, 97 | instance_type=self.instance_type, 98 | volume_size_in_gb=self.volume_size_in_gb, 99 | network_config=NetworkConfig(enable_network_isolation=self.enable_network_isolation), 100 | sagemaker_session=self.sagemaker_session, 101 | output_kms_key=self.kms_key 102 | ) 103 | 104 | return processor 105 | 106 | def execute(self, context): 107 | 108 | self.expand_role() 109 | 110 | print(f'SageMaker Data Wrangler Operator initialized with {context}...') 111 | # Time marker 112 | ts = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}" 113 | # Establish connection to S3 with S3Hook 114 | s3 = S3Hook(aws_conn_id=self.aws_conn_id) 115 | s3.get_conn() 116 | 117 | # Read the .flow file 118 | s3_bucket, key = self.parse_s3_uri(self.flow_file_s3uri) 119 | file_content = s3.read_key(key=key, bucket_name=s3_bucket) 120 | data = json.loads(file_content) 121 | 122 | output_name, data_sources = self.get_data_sources(data) 123 | 124 | # Configure Output for the SageMaker processing job 125 | prefix = self.output_prefix if self.output_prefix is not None else ts 126 | s3_output_path = f"s3://{self.output_bucket}/{prefix}" 127 | 128 | processing_job_output = ProcessingOutput( 129 | output_name=output_name, 130 | source=f"{self.basedir}/output", 131 | destination=s3_output_path, 132 | s3_upload_mode=self.s3_output_upload_mode 133 | ) 134 | 135 | # The Data Wrangler Flow is provided to the Processing Job as an input source which we configure below. 136 | # Input - Flow file 137 | flow_input = ProcessingInput( 138 | source=self.flow_file_s3uri, 139 | destination=f"{self.basedir}/flow", 140 | input_name="flow", 141 | s3_data_type=self.s3_data_type, 142 | s3_input_mode=self.s3_input_mode, 143 | s3_data_distribution_type=self.s3_data_distribution_type 144 | ) 145 | 146 | # Output configuration used as processing job container arguments 147 | output_config = { 148 | output_name: { 149 | "content_type": self.output_content_type 150 | } 151 | } 152 | 153 | # Create a SageMaker processing Processor 154 | processor = self.get_processor() 155 | 156 | # Unique processing job name. Give a unique name every time you re-execute processing jobs 157 | processing_job_name = f"data-wrangler-flow-processing-{ts}" 158 | 159 | print(f'Starting SageMaker Data Wrangler processing job {processing_job_name} with {self.processing_instance_count} instances of {self.instance_type} and {self.volume_size_in_gb}Gb disk...') 160 | 161 | # Start Job 162 | processor.run( 163 | inputs=[flow_input] + data_sources, 164 | outputs=[processing_job_output], 165 | arguments=[f"--output-config '{json.dumps(output_config)}'"], 166 | wait=self.wait_for_processing, 167 | logs=True, 168 | job_name=processing_job_name 169 | ) 170 | 171 | print(f'SageMaker Data Wrangler processing job for flow file {self.flow_file_s3uri} complete...') 172 | 173 | # We will copy the files generated by Data Wrangler to a well known S3 location so that 174 | # the location can be used in our training job Task in the Airflow DAG. 175 | # This is because the prefix generated by Data Wrangler is dynamic. 176 | # We will use the default connection named 'aws_default' [Found in Airflow Admin UI > Admin Menu > Connections] 177 | 178 | print(f'Processing output files...') 179 | 180 | key_list = s3.list_keys(self.output_bucket, prefix=f"{prefix}/{processing_job_name}") 181 | for index, key in enumerate(key_list): 182 | s3.copy_object(source_bucket_key=f"s3://{self.output_bucket}/{key}",dest_bucket_key=f"{s3_output_path}/train/train-data-{index}.csv") 183 | 184 | # Delete the original file(s) since they have been moved to the well known S3 location 185 | s3.delete_objects(bucket=self.output_bucket, keys=key_list) 186 | 187 | data_output_path = f"{s3_output_path}/train" 188 | 189 | print(f'Saved output files at {data_output_path}...') 190 | return data_output_path -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/scripts/config.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import time 3 | import uuid 4 | import sagemaker 5 | import json 6 | from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook 7 | from sagemaker.amazon.amazon_estimator import get_image_uri 8 | 9 | def config(**opts): 10 | 11 | region_name=opts['region_name'] if 'region_name' in opts else sagemaker.Session().boto_region_name 12 | 13 | # Hook 14 | hook = AwsBaseHook(aws_conn_id='airflow-sagemaker', resource_type="sagemaker") 15 | boto_session = hook.get_session(region_name=region_name) 16 | sagemaker_session = sagemaker.session.Session(boto_session=boto_session) 17 | 18 | 19 | training_job_name=opts['training_job_name'] 20 | bucket = opts['bucket'] if 'bucket' in opts else sagemaker_session.default_bucket() #"sagemaker-us-east-2-965425568475" 21 | s3_prefix = opts['s3_prefix'] 22 | # Get the xgboost container uri 23 | container = get_image_uri(region_name, 'xgboost', repo_version='1.0-1') 24 | 25 | ts = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}" 26 | config = {} 27 | 28 | config["data_wrangler_config"] = { 29 | "sagemaker_role": opts['role_name'], 30 | #"s3_data_type" : defaults to "S3Prefix" 31 | #"s3_input_mode" : defaults to "File", 32 | #"s3_data_distribution_type" : defaults to "FullyReplicated", 33 | #"kms_key" : defaults to None, 34 | #"volume_size_in_gb" : defaults to 30, 35 | #"enable_network_isolation" : defaults to False, 36 | #"wait_for_processing" : defaults to True, 37 | #"container_uri" : defaults to "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.x", 38 | #"container_uri_pinned" : defaults to "415577184552.dkr.ecr.us-east-2.amazonaws.com/sagemaker-data-wrangler-container:1.12.0", 39 | "outputConfig": { 40 | #"s3_output_upload_mode": #defaults to EndOfJob 41 | #"output_content_type": #defaults to CSV 42 | #"output_bucket": #defaults to SageMaker Default bucket 43 | "output_prefix": s3_prefix #prefix within bucket where output will be written, default is generated automatically 44 | } 45 | } 46 | 47 | config["train_config"]={ 48 | "AlgorithmSpecification": { 49 | "TrainingImage": container, 50 | "TrainingInputMode": "File" 51 | }, 52 | "HyperParameters": { 53 | "max_depth": "5", 54 | "num_round": "10", 55 | "objective": "reg:squarederror" 56 | }, 57 | "InputDataConfig": [ 58 | { 59 | "ChannelName": "train", 60 | "ContentType": "csv", 61 | "DataSource": { 62 | "S3DataSource": { 63 | "S3DataDistributionType": "FullyReplicated", 64 | "S3DataType": "S3Prefix", 65 | "S3Uri": f"s3://{bucket}/{s3_prefix}/train" 66 | } 67 | } 68 | } 69 | ], 70 | "OutputDataConfig": { 71 | "S3OutputPath": f"s3://{bucket}/{s3_prefix}/xgboost" 72 | }, 73 | "ResourceConfig": { 74 | "InstanceCount": 1, 75 | "InstanceType": "ml.m5.2xlarge", 76 | "VolumeSizeInGB": 5 77 | }, 78 | "RoleArn": opts['role_name'], 79 | "StoppingCondition": { 80 | "MaxRuntimeInSeconds": 86400 81 | }, 82 | "TrainingJobName": training_job_name 83 | } 84 | 85 | config["model_config"]={ 86 | "ExecutionRoleArn": opts['role_name'], 87 | "ModelName": f"XGBoost-Fraud-Detector-{ts}", 88 | "PrimaryContainer": { 89 | "Mode": "SingleModel", 90 | "Image": container, 91 | "ModelDataUrl": f"s3://{bucket}/{s3_prefix}/xgboost/{training_job_name}/output/model.tar.gz" 92 | }, 93 | } 94 | 95 | return config 96 | -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/scripts/ml_pipeline.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import time 3 | import uuid 4 | import json 5 | import boto3 6 | import sagemaker 7 | 8 | # Import config file. 9 | from config import config 10 | from datetime import timedelta 11 | import airflow 12 | from airflow import DAG 13 | from airflow.models import DAG 14 | 15 | # airflow operators 16 | from airflow.models import DAG 17 | from airflow.operators.python_operator import PythonOperator 18 | 19 | # airflow sagemaker operators 20 | from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator 21 | from airflow.providers.amazon.aws.operators.sagemaker_model import SageMakerModelOperator 22 | 23 | # airflow sagemaker configuration 24 | from sagemaker.amazon.amazon_estimator import get_image_uri 25 | from sagemaker.estimator import Estimator 26 | from sagemaker.workflow.airflow import training_config 27 | 28 | # airflow Data Wrangler operator 29 | from SMDataWranglerOperator import SageMakerDataWranglerOperator 30 | 31 | # airflow dummy operator 32 | from airflow.operators.dummy import DummyOperator 33 | 34 | 35 | 36 | default_args = { 37 | 'owner': 'airflow', 38 | 'depends_on_past': False, 39 | 'start_date': airflow.utils.dates.days_ago(1), 40 | 'retries': 0, 41 | 'retry_delay': timedelta(minutes=2), 42 | 'provide_context': True, 43 | 'email': ['airflow@iloveairflow.com'], 44 | 'email_on_failure': False, 45 | 'email_on_retry': False 46 | } 47 | ts = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}" 48 | DAG_NAME = f"ml-pipeline" 49 | 50 | #------- 51 | ### Start creating DAGs 52 | #------- 53 | 54 | dag = DAG( 55 | DAG_NAME, 56 | default_args=default_args, 57 | dagrun_timeout=timedelta(hours=2), 58 | # Cron expression to auto run workflow on specified interval 59 | # schedule_interval='0 3 * * *' 60 | schedule_interval=None 61 | ) 62 | 63 | #------- 64 | # Task to create configurations 65 | #------- 66 | 67 | config_task = PythonOperator( 68 | task_id = 'Start', 69 | python_callable=config, 70 | op_kwargs={ 71 | 'training_job_name': f"XGBoost-training-{ts}", 72 | 's3_prefix': 'data-wrangler-pipeline', 73 | 'role_name': 'AmazonSageMaker-ExecutionRole-20201030T135016'}, 74 | provide_context=True, 75 | dag=dag 76 | ) 77 | 78 | #------- 79 | # Task with SageMakerDataWranglerOperator operator for Data Wrangler Processing Job. 80 | #------- 81 | 82 | def datawrangler(**context): 83 | config = context['ti'].xcom_pull(task_ids='Start',key='return_value') 84 | preprocess_task = SageMakerDataWranglerOperator( 85 | task_id='DataWrangler_Processing_StepNew', 86 | dag=dag, 87 | flow_file_s3uri="$flow_uri", 88 | processing_instance_count=2, 89 | instance_type='ml.m5.4xlarge', 90 | aws_conn_id="aws_default", 91 | config= config["data_wrangler_config"] 92 | ) 93 | preprocess_task.execute(context) 94 | 95 | datawrangler_task = PythonOperator( 96 | task_id = 'SageMaker_DataWrangler_step', 97 | python_callable=datawrangler, 98 | provide_context=True, 99 | dag=dag 100 | ) 101 | 102 | #------- 103 | # Task with SageMaker training operator to train the xgboost model 104 | #------- 105 | 106 | def trainmodel(**context): 107 | config = context['ti'].xcom_pull(task_ids='Start',key='return_value') 108 | trainmodel_task = SageMakerTrainingOperator( 109 | task_id='Training_Step', 110 | config= config['train_config'], 111 | aws_conn_id='aws-sagemaker', 112 | wait_for_completion=True, 113 | check_interval=30 114 | ) 115 | trainmodel_task.execute(context) 116 | 117 | train_model_task = PythonOperator( 118 | task_id = 'SageMaker_training_step', 119 | python_callable=trainmodel, 120 | provide_context=True, 121 | dag=dag 122 | ) 123 | 124 | #------- 125 | # Task with SageMaker Model operator to create the xgboost model from artifacts 126 | #------- 127 | 128 | def createmodel(**context): 129 | config = context['ti'].xcom_pull(task_ids='Start',key='return_value') 130 | createmodel_task= SageMakerModelOperator( 131 | task_id='Create_Model', 132 | config= config['model_config'], 133 | aws_conn_id='aws-sagemaker', 134 | ) 135 | createmodel_task.execute(context) 136 | 137 | create_model_task = PythonOperator( 138 | task_id = 'SageMaker_create_model_step', 139 | python_callable=createmodel, 140 | provide_context=True, 141 | dag=dag 142 | ) 143 | 144 | #------ 145 | # Last step 146 | #------ 147 | end_task = DummyOperator(task_id='End', dag=dag) 148 | 149 | # Create task dependencies 150 | 151 | config_task >> datawrangler_task >> train_model_task >> create_model_task >> end_task 152 | -------------------------------------------------------------------------------- /3-apache-airflow-pipelines/scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | awswrangler 2 | pandas 3 | sagemaker==v2.59.5 4 | dag-factory==0.7.2 5 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Integrate SageMaker Data Wrangler into your MLOps workflows 2 | 3 | [![Latest Version](https://img.shields.io/github/tag/aws-samples/sm-data-wrangler-mlops-workflows)](https://github.com/aws-samples/sm-data-wrangler-mlops-workflows/releases) 4 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/aws-samples/sm-data-wrangler-mlops-workflows/blob/main/LICENSE) 5 | 6 | 7 |
8 |

9 | dw 10 |

11 |
12 | 13 | ## Get Started 14 | 15 | 1. Setup an [Amazon SageMaker Studio domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html). 16 | 2. Log-on to Amazon SageMaker Studio. Open a terminal from _File_ menu > _New_ > _Terminal_ 17 | 18 |
19 |

20 | sf 21 |

22 |
23 | 24 | 3. Clone this repository 25 | 26 | ```sh 27 | git clone https://github.com/aws-samples/sm-data-wrangler-mlops-workflows.git data-wrangler-pipelines 28 | ``` 29 | 30 | 4. Open the [00_setup_data_wrangler.ipynb](./00_setup_data_wrangler.ipynb) file and follow instructions in the notebook 31 | 32 | --- 33 | 34 | ## Setup end-to-end MLOps Pipelines 35 | 36 | - [Setup Amazon SageMaker Data Wrangler with SageMaker Pipelines](./1-sagemaker-pipelines/README.md) 37 | - [Setup Amazon SageMaker Data Wrangler with AWS Step Functions](./2-step-functions-pipelines/README.md) 38 | - [Setup Amazon SageMaker Data Wrangler with Amazon Managed Workflow for Apache Airflow](./3-apache-airflow-pipelines/README.md) 39 | 40 | ## Security 41 | 42 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 43 | 44 | ## License 45 | 46 | This library is licensed under the MIT-0 License. See the LICENSE file. -------------------------------------------------------------------------------- /images/dw-arch.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/dw-arch.jpg -------------------------------------------------------------------------------- /images/flow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/flow.png -------------------------------------------------------------------------------- /images/sm-studio-terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/sm-data-wrangler-mlops-workflows/4e5188797e0bd6043a1cf460d75dd14da4826ad4/images/sm-studio-terminal.png --------------------------------------------------------------------------------