├── .gitignore ├── 00_setup_data_wrangler.ipynb ├── 1-sagemaker-pipelines ├── 01_setup_sagemaker_pipeline.ipynb ├── README.md ├── flow-01-15-12-49-4bd733e0.flow └── images │ └── pipeline.png ├── 2-step-functions-pipelines ├── 01_setup_step_functions_pipeline.ipynb ├── README.md ├── flow-01-15-12-49-4bd733e0.flow └── step-function-workflow.png ├── 3-apache-airflow-pipelines ├── 01_setup_mwaa_pipeline.ipynb ├── README.md ├── images │ ├── delete_mwaa.png │ ├── flow.png │ ├── meaa_ui_home.png │ ├── mwaa_dag.png │ ├── mwaa_delete_dag.png │ ├── mwaa_s3.png │ ├── mwaa_trigger.png │ └── mwaa_ui.png └── scripts │ ├── SMDataWranglerOperator.py │ ├── config.py │ ├── ml_pipeline.py │ └── requirements.txt ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── data ├── claims.csv └── customers.csv ├── images ├── dw-arch.jpg ├── flow.png └── sm-studio-terminal.png └── insurance_claims_flow_template /.gitignore: -------------------------------------------------------------------------------- 1 | # Flow files 2 | .flow 3 | 4 | # Jupyter Checkpoints 5 | .ipynb_checkpoints 6 | -------------------------------------------------------------------------------- /00_setup_data_wrangler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Upload sample data and setup SageMaker Data Wrangler data flow\n", 8 | "\n", 9 | "This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.\n", 10 | "\n", 11 | "---" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "Import required dependencies and initialize variables\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stdout", 28 | "output_type": "stream", 29 | "text": [ 30 | "Using AWS Region: us-east-2\n" 31 | ] 32 | }, 33 | { 34 | "data": { 35 | "text/plain": [ 36 | "'sagemaker-us-east-2-716469146435'" 37 | ] 38 | }, 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "import json\n", 46 | "import time\n", 47 | "import boto3\n", 48 | "import string\n", 49 | "import sagemaker\n", 50 | "\n", 51 | "region = sagemaker.Session().boto_region_name\n", 52 | "print(\"Using AWS Region: {}\".format(region))\n", 53 | "\n", 54 | "boto3.setup_default_session(region_name=region)\n", 55 | "\n", 56 | "s3_client = boto3.client('s3', region_name=region)\n", 57 | "# Sagemaker session\n", 58 | "sess = sagemaker.Session()\n", 59 | "\n", 60 | "# You can configure this with your own bucket name, e.g.\n", 61 | "# bucket = \"my-bucket\"\n", 62 | "bucket = sess.default_bucket()\n", 63 | "prefix = \"data-wrangler-pipeline\"\n", 64 | "bucket" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "---\n", 72 | "# Upload sample data to S3" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.\n", 80 | "\n", 81 | "To begin with, we will upload both the files to the default SageMaker bucket." 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 3, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')\n", 91 | "s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "---\n", 99 | "# Generate Data Wrangler `.flow` file" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. \n", 107 | "\n", 108 | "To create the `insurance_claims.flow` file execute the code cell below" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 5, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "claims_flow_template_file = \"insurance_claims_flow_template\"\n", 118 | "\n", 119 | "# Updates the S3 bucket and prefix in the template\n", 120 | "with open(claims_flow_template_file, 'r') as f:\n", 121 | " variables = {'bucket': bucket, 'prefix': prefix}\n", 122 | " template = string.Template(f.read())\n", 123 | " claims_flow = template.safe_substitute(variables)\n", 124 | " claims_flow = json.loads(claims_flow)\n", 125 | "\n", 126 | "# Creates the .flow file\n", 127 | "with open('insurance_claims.flow', 'w') as f:\n", 128 | " json.dump(claims_flow, f)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "Open the `insurance_claim.flow` file in SageMaker Studio.\n", 136 | "\n", 137 | "
00_setup_data_wrangler.ipynb
Notebook\n",
11 | "
11 |
12 |
00_setup_data_wrangler.ipynb
)
22 | 1. Generate a flow file from Data Wrangler or use the setup script to generate a flow file from a preconfigured template.
23 | 2. Create an Amazon S3 bucket and upload your flow file and input files to the bucket. In our sample notebook, we use SageMaker’s default S3 bucket.
24 |
25 | ### In the SageMaker Pipelines notebook (01_setup_sagemaker_pipeline.ipynb
)
26 | 3. Follow the instructions in the 01_setup_sagemaker_pipeline.ipynb
notebook to create a `Processor` object based on the Data Wrangler flow file, and an `Estimator` object with the parameters of the training job.
27 | * In our example, since we only use SageMaker features and SageMaker’s default S3 bucket, we can use SageMaker Studio’s default execution role. The same IAM role will be assumed by the pipeline run, the processing job and the training job. You can further customize the execution role according to minimum privilege.
28 | 4. Continue with the instructions to create a pipeline with steps referencing the `Processor` and `Estimator` objects, and then execute a pipeline run. The processing and training jobs will run on SageMaker managed environments and will take a few minutes to complete.
29 | 5. In SageMaker Studio, you can see the pipeline details monitor the pipeline execution. You can also monitor the underlying processing and training jobs from the Amazon SageMaker Console, and from Amazon CloudWatch.
30 |
31 | ### Cleaning Up
32 |
33 | Follow the instructions under **Part 3: Cleanup** in the SageMaker Pipelines notebook (01_setup_sagemaker_pipeline.ipynb
) to delete the Pipeline, the Model and the Experiment created during this sample.
--------------------------------------------------------------------------------
/1-sagemaker-pipelines/flow-01-15-12-49-4bd733e0.flow:
--------------------------------------------------------------------------------
1 | {"metadata": {"version": 1, "disable_limits": false}, "nodes": [{"node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "claims.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/claims.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "ed6ddbad-83d4-4685-8e3b-6accf2115180", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "driver_relationship": "string", "incident_type": "string", "collision_type": "string", "incident_severity": "string", "authorities_contacted": "string", "num_vehicles_involved": "long", "num_injuries": "long", "num_witnesses": "long", "police_report_available": "string", "injury_claim": "long", "vehicle_claim": "float", "total_claim_amount": "float", "incident_month": "long", "incident_day": "long", "incident_dow": "long", "incident_hour": "long", "fraud": "long"}}, "inputs": [{"name": "default", "node_id": "4c1ac097-79d5-434a-a82f-dcce6051dfa1", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": {"dataset_definition": {"__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "customers.csv", "description": null, "s3ExecutionContext": {"__typename": "S3ExecutionContext", "s3Uri": "s3://sagemaker-us-east-2-716469146435/data-wrangler-pipeline/customers.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false}}}, "inputs": [], "outputs": [{"name": "default", "sampling": {"sampling_method": "sample_by_limit", "limit_rows": 50000}}]}, {"node_id": "a2370312-2ab4-43a3-ae7d-ba5a17057d0a", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": {"schema": {"policy_id": "long", "customer_age": "long", "months_as_customer": "long", "num_claims_past_year": "long", "num_insurers_past_5_years": "long", "policy_state": "string", "policy_deductable": "long", "policy_annual_premium": "long", "policy_liability": "string", "customer_zip": "long", "customer_gender": "string", "customer_education": "string", "auto_year": "long"}}, "inputs": [{"name": "default", "node_id": "e1b6dbcf-67bd-4cac-ac8a-4befea3c30c9", "output_name": "default"}], "outputs": [{"name": "default"}]}, {"node_id": "4d5a7942-94eb-4680-8a4a-a5c128fa2894", "type": "TRANSFORM", "operator": "sagemaker.spark.encode_categorical_0.1", "parameters": {"operator": "Ordinal encode", "ordinal_encode_parameters": {"invalid_handling_strategy": "Replace with NaN", "input_column": "police_report_available"}}, "trained_parameters": {"ordinal_encode_parameters": {"_hash": -7262088998495137000, "string_indexer_model": "P)h>@6aWAK2ms2BZ&R6WWXb6O004CX000vJ003=ebYWy+bYU-WVRCdWFfcGMFm;a0PQx%1ME87#*QpfZ>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewD Z>-E9Klc>1!l_hkFRF(T~rJ~Z~*Ze#R{R!q1NmgLF9vj4(_PZ*d@Isl&C~)Us5ROJ(E2Ex-g8wXm>k8LD=&__+|awho(pIUj>5Qp!>_~{`7XYKd$);Bj*{QcJ-EqAnaq@=nKlG%q>qJe9$YQP1)3VGouMj0N#J4FqewD
11 |
36 |
44 |
52 |
60 |
78 |
86 |
9 |
20 | ng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%
+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bxA0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&A0!yc`2R9g%k9p@yOg{Bw3$XXU0RvLZ%9-bIbY{kE{~qtn_USK+GAROks?eKzgy-`8-=xK#~UX!@6>!sO<4rYt~mnTR9C`1(6tNfWlxIvJXwrcsjkDB6_$`yj`01V#RjMLNDmBP5mH$&&nx=W0q`f%n{!Nldb<;uK+wy||UhuVn&ng_=(^-q0l1+T1p{tvdDRn`{{o;0;(Wo=GoN|CYKKrH?N)~;ZM5=)>@9xb?8i^cf@;Nz9Sca@G;7l~7vCNz!ridm_6oX8QEgNzF)i{Vi64|162InT3vlnw{)vn-K=yx^n#FpAv8-@3D(y8m|%P)h>@6aWAK2mmdyZ&UDH7Euxa000mG002z@003lRbYU+paA9(EEif=JFfc7MH85p1V=^%
+^13#Yv*JNoXx~CRP1@9jbpRlE1JOTH%TiVjdDsv9ptBM{vyB$=v+v}0(g&De4|`AEbEC7e`-@gTfPG-htn%bx00_setup_data_wrangler.ipynb
Notebook\n",
14 | "2. Created an Amazon Managed Workflow for Apache Airflow (MWAA) environment. Please visit the Amazon MWAA Get started documentation to see how you can create an MWAA environment. Alternatively, to quickly get started with MWAA, follow the step-by-step instructions in the MWAA workshop to setup an MWAA environment.\n",
15 | "\n",
16 | "bucket
name with your MWAA Bucket name.\n",
35 | "bucket
with the SageMaker Default bucket name for your SageMaker studio domain.\n",
253 | "\n",
637 | "\n",
638 | "\n",
639 | "You can run the DAG by clicking on the \"Play\" button, alternatively you can -\n",
640 | "\n",
641 | "1. Setup the DAG to run on a set schedule automatically using cron expressions\n",
642 | "2. Setup the DAG to run based on S3 sensors such that the pipeline/workflow would execute whenever a new file arrives in a bucket/prefix.\n",
643 | "\n",
644 | "
\n"
645 | ]
646 | },
647 | {
648 | "cell_type": "markdown",
649 | "metadata": {},
650 | "source": [
651 | "### Clean Up\n",
652 | "\n",
653 | "1. Delete the MWAA Environment from the Amazon MWAA Console.\n",
654 | "2. Delete the MWAA S3 files.\n",
655 | "3. Delete the Model training data and model artifact files from S3.\n",
656 | "4. Delete the SageMaker Model."
657 | ]
658 | },
659 | {
660 | "cell_type": "markdown",
661 | "metadata": {},
662 | "source": [
663 | "---\n",
664 | "# Conclusion\n",
665 | "\n",
666 | "We created an ML Pipeline with Apache Airflow and used the Data Wrangler script to pre-process and generate new training data for our model training and subsequently created a new model in Amazon SageMaker.\n"
667 | ]
668 | },
669 | {
670 | "cell_type": "code",
671 | "execution_count": null,
672 | "metadata": {},
673 | "outputs": [],
674 | "source": []
675 | }
676 | ],
677 | "metadata": {
678 | "instance_type": "ml.t3.medium",
679 | "kernelspec": {
680 | "display_name": "Python 3 (Data Science)",
681 | "language": "python",
682 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0"
683 | },
684 | "language_info": {
685 | "codemirror_mode": {
686 | "name": "ipython",
687 | "version": 3
688 | },
689 | "file_extension": ".py",
690 | "mimetype": "text/x-python",
691 | "name": "python",
692 | "nbconvert_exporter": "python",
693 | "pygments_lexer": "ipython3",
694 | "version": "3.7.10"
695 | }
696 | },
697 | "nbformat": 4,
698 | "nbformat_minor": 4
699 | }
700 |
--------------------------------------------------------------------------------
/3-apache-airflow-pipelines/README.md:
--------------------------------------------------------------------------------
1 | ## SageMaker Data Wrangler with Apache Airflow (Amazon MWAA)
2 |
3 | ### Getting Started
4 |
5 | #### Setup Amazon MWAA Environment
6 |
7 | 1. Create an [Amazon S3 bucket](https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-s3-bucket.html) and subsequent folders required by Amazon MWAA. The S3 bucket and the folders as seen below. These folders are used by Amazon MWAA to look for dependent Python scripts that is required for the workflow.
8 |
9 |
12 |
37 |
45 |
53 |
61 |
79 |
87 |
10 |
21 |