├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── container
    ├── processing
    │   └── Dockerfile
    └── train_and_serve
    │   ├── Dockerfile
    │   ├── README.md
    │   └── catboost_regressor
    │       ├── nginx.conf
    │       ├── predictor.py
    │       ├── serve
    │       ├── train
    │       └── wsgi.py
├── dvc_sagemaker_byoc.ipynb
├── dvc_sagemaker_script_mode.ipynb
├── img
    ├── high-level-architecture.png
    ├── sm-experiments-tracker-dvc.png
    └── studio-custom-image-select.png
├── sagemaker-studio-dvc-image
    ├── Dockerfile
    ├── README.md
    ├── cdk
    │   ├── .gitignore
    │   ├── app.py
    │   ├── cdk.json
    │   ├── requirements.txt
    │   ├── sagemakerStudioCDK
    │   │   ├── __init__.py
    │   │   └── sagemaker_studio_stack.py
    │   └── setup.py
    ├── environment.yml
    ├── resize-cloud9.sh
    ├── update-domain-input.json
    └── update-domain-no-custom-images.json
└── source_dir
    ├── preprocessing-experiment-multifiles.py
    ├── preprocessing-experiment.py
    ├── requirements.txt
    └── train.py


/.gitignore:
--------------------------------------------------------------------------------
1 | **/*.ipynb_checkpoints/
2 | sagemaker-dvc-sample
3 | .idea/amazon-sagemaker-experiments-dvc-demo.iml
4 | .idea/inspectionProfiles/profiles_settings.xml
5 | .idea/misc.xml
6 | .idea/modules.xml
7 | .idea/vcs.xml
8 | .idea/workspace.xml
9 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *main* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # SageMaker Experiments and DVC
  2 | 
  3 | This sample shows how to use DVC within the SageMaker environment.
  4 | In particular, we will look at how to build a custom image with DVC libraries installed by default to provide a consistent environment to your data scientists.
  5 | Furthermore, we show how you can integrate SageMaker Processing, SageMaker Trainings and SageMaker Experiments with a DVC workflow.
  6 | 
  7 | For full details on how this works:
  8 | 
  9 | - Read the Blog post at: https://aws.amazon.com/blogs/machine-learning/track-your-ml-experiments-end-to-end-with-data-version-control-and-amazon-sagemaker-experiments/
 10 | 
 11 | ## Prerequisite
 12 | 
 13 | * An AWS Account
 14 | * An IAM user with Admin-like permissions
 15 | 
 16 | If you do not have Admin-like permissions, we recommend to have at least the following permissions:
 17 | * Administer Amazon ECR
 18 | * Administer a SageMaker Studio Domain
 19 | * Administer S3 (or at least any buckets with *sagemaker* in the bucket name)
 20 | * Create IAM Roles
 21 | * Create a Cloud9 environment
 22 | 
 23 | ## Setup
 24 | 
 25 | We suggest for the initial setup, to use Cloud9 on a `t3.large` instance type.
 26 | 
 27 | ## Build a custom SageMaker Studio image with DVC already installed
 28 | 
 29 | We aim to explains how to create a custom image for Amazon SageMaker Studio that has DVC already installed.
 30 | The advantage of creating an image and make it available to all SageMaker Studio users is that it creates a consistent environment for the SageMake Studio users, which they could also run locally.
 31 | 
 32 | This tutorial is heavily inspired by [this example](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/tree/main/examples/conda-env-kernel-image).
 33 | Further information about custom images for SageMaker Studio can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html)
 34 | 
 35 | ### Overview
 36 | 
 37 | This custom image sample demonstrates how to create a custom Conda environment in a Docker image and use it as a custom kernel in SageMaker Studio.
 38 | 
 39 | The Conda environment must have the appropriate kernel package installed, for e.g., `ipykernel` for a Python kernel.
 40 | This example creates a Conda environment called `dvc` with a few Python packages (see [environment.yml](environment.yml)) and the `ipykernel`.
 41 | SageMaker Studio will automatically recognize this Conda environment as a kernel named `conda-env-dvc-py`.
 42 | 
 43 | #### Clone the GitHub repository 
 44 | ```bash
 45 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
 46 | ```
 47 | 
 48 | #### Resize Cloud9
 49 | 
 50 | ```bash
 51 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/
 52 | ./resize-cloud9.sh 20
 53 | ```
 54 | 
 55 | ### Build the Docker images for SageMaker Studio
 56 | 
 57 | Set some basic environment variables
 58 | 
 59 | ```bash
 60 | sudo yum install jq -y
 61 | export REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')
 62 | echo "export REGION=${REGION}" | tee -a ~/.bash_profile
 63 | 
 64 | export ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
 65 | echo "export ACCOUNT_ID=${ACCOUNT_ID}" | tee -a ~/.bash_profile
 66 | 
 67 | export IMAGE_NAME=conda-env-dvc-kernel
 68 | echo "export IMAGE_NAME=${IMAGE_NAME}" | tee -a ~/.bash_profile
 69 | ```
 70 | 
 71 | Build the Docker image and push to Amazon ECR.
 72 | 
 73 | ```bash
 74 | # Login to ECR
 75 | aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom
 76 | 
 77 | # Create the ECR repository
 78 | aws --region ${REGION} ecr create-repository --repository-name smstudio-custom
 79 | 
 80 | # Build the image - it might take a few minutes to complete this step
 81 | docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
 82 | # Push the image to ECR
 83 | docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
 84 | ```
 85 | 
 86 | ### Associate a custom image to SageMaker Studio
 87 | 
 88 | #### Prepare the environment to deploy with CDK
 89 | 
 90 | Step 1: Navigate to the `cdk` directory:
 91 | 
 92 | ```bash
 93 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk
 94 | ```
 95 | 
 96 | Step 2: Create a virtual environment:
 97 | 
 98 | ```bash
 99 | python3 -m venv .cdk-venv
100 | ```
101 | 
102 | Step 3: Activate the virtual environment after the init process completes, and the virtual environment is created:
103 | 
104 | ```bash
105 | source .cdk-venv/bin/activate
106 | ```
107 | 
108 | Step 4: Install the required dependencies:
109 | 
110 | ```bash
111 | pip3 install --upgrade pip
112 | pip3 install -r requirements.txt
113 | ```
114 | 
115 | Step 5: Install and bootstrap CDK v2 (The latest CDK version tested was `2.27.0`)
116 | 
117 | ```bash
118 | npm install -g aws-cdk@2.27.0 --force
119 | cdk bootstrap
120 | ```
121 | 
122 | #### Create a new SageMaker Studio
123 | ( Skip to [Update an existing SageMaker Studio](#update-an-existing-sagemaker-studio) if you have already an existing SageMaker Studio domain)
124 | 
125 | Step 5: deploy CDK (CDK will deploy a stack named: `sagemakerStudioCDK` which you can verify in `CloudFormation`)
126 | 
127 | ```bash
128 | cdk deploy --require-approval never
129 | ```
130 | 
131 | CDK will create the following resources via` CloudFormation`:
132 | * provisions a new SageMaker Studio domain
133 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to the SageMaker Studio domain
134 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier
135 | * creates an AppImageConfig which specify how the kernel gateway should be configured
136 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
137 | 
138 | #### Update an existing SageMaker Studio
139 | 
140 | If you have an existing SageMaker Studio environment, we need to first retrieve the exising SageMaker Studio domain ID, deploy a "reduced" version of the CDK stack, and update the SageMaker Studio domain configuration.
141 | 
142 | Step 5: Set the `DOMAIN_ID` environment variable with your domain ID and save to your `bash_profile`.
143 | 
144 | ```bash
145 | export DOMAIN_ID=$(aws sagemaker list-domains | jq -r '.Domains[0].DomainId')
146 | echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile
147 | ```
148 | 
149 | Step 6: deploy CDK (by setting the `DOMAIN_ID` environment variable, CDK will deploy a stack named: `sagemakerUserCDK` which you can verify on `CloudFormation`)
150 | 
151 | ```bash
152 | cdk deploy --require-approval never
153 | ```
154 | 
155 | CDK will create the following resources via` CloudFormation`:
156 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to your existing SageMaker Studio domain
157 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier
158 | * creates an AppImageConfig which specify how the kernel gateway should be configured
159 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
160 | 
161 | Step 7: Update the SageMaker Studio domain configuration
162 | 
163 | ```bash
164 | # inject your DOMAIN_ID into the configuration file
165 | sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-input.json
166 | 
167 | # update the sagemaker studio domain
168 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-input.json
169 | ```
170 | 
171 | Open the newly created SageMaker Studio user, i.e., `data-scientist-dvc`.
172 | 
173 | ![image info](./img/studio-custom-image-select.png)
174 | 
175 | 
176 | ### Execute the sample notebook
177 | 
178 | In the SageMaker Studio domain, launch `Studio` for the `data-scientist-dvc`.
179 | Open a terminal and clone the repository
180 | 
181 | ```bash
182 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
183 | ```
184 | 
185 | and open the [dvc_sagemaker_script_mode.ipynb](./dvc_sagemaker_script_mode.ipynb) notebook.
186 | 
187 | When prompted, ensure that you select the Custom Image `conda-env-dvc-kernel` as shown below
188 | 
189 | We provide two sample notebooks to see how to use DVC in combination with SageMaker:
190 | 
191 | * one that installs DVC in script mode by passing a `requirements.txt` file to both the processing job and the training job [dvc_sagemaker_script_mode.ipynb](./dvc_sagemaker_script_mode.ipynb);
192 | * one that shows how to create the container for the processing jobs, the training jobs, and the inference [dvc_sagemaker_byoc.ipynb](./dvc_sagemaker_byoc.ipynb).
193 | 
194 | Both notebooks are meant to be used within SageMaker Studio with the custom image created before.
195 | 
196 | ## Cleanup
197 | 
198 | Before removing all resources created, you need to make sure that all Apps are deleted from the `data-scientist-dvc` user, i.e. all `KernelGateway` apps, as well as the default `JupiterServer`.
199 | 
200 | Once done, you can destroy the CDK stack by running
201 | 
202 | ```bash
203 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk
204 | cdk destroy
205 | ```
206 | 
207 | In case you started off from an existing domain, please also execute the following command:
208 | 
209 | ```bash
210 | # inject your DOMAIN_ID into the configuration file
211 | sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json
212 | 
213 | # update the sagemaker studio domain
214 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json
215 | ```
216 | 
217 | ## Security
218 | 
219 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
220 | 
221 | ## License
222 | 
223 | This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file.


--------------------------------------------------------------------------------
/container/processing/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM public.ecr.aws/docker/library/python:3.7-slim
 2 | 
 3 | RUN apt-get -y update && apt-get install -y --no-install-recommends wget git
 4 | 
 5 | RUN pip3 install numpy pandas scikit-learn==1.0.2
 6 | RUN pip3 install sagemaker-experiments==0.1.35
 7 | RUN pip3 install git-remote-codecommit
 8 | RUN pip3 install dvc==2.8.3 s3fs==2021.11.0 dvc[s3]==2.8.3
 9 | 
10 | # Configure git
11 | 
12 | RUN git config --global user.email "sagemaker-processing@example.com"
13 | RUN git config --global user.name "SageMaker ProcessingJob User"
14 | 
15 | ENV PYTHONUNBUFFERED=TRUE
16 | 
17 | ENTRYPOINT ["python3"]
18 | 


--------------------------------------------------------------------------------
/container/train_and_serve/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Build an image that can do training and inference in SageMaker
 2 | # This is a Python 3 image that uses the nginx, gunicorn, flask stack
 3 | # for serving inferences in a stable way.
 4 | 
 5 | FROM public.ecr.aws/docker/library/python:3.7-slim
 6 | 
 7 | RUN apt-get -y update && apt-get install -y --no-install-recommends \
 8 |          wget \
 9 |          nginx \
10 |          git \
11 |          ca-certificates
12 | 
13 | RUN pip install numpy==1.16.2 scipy==1.2.1 catboost pandas flask gevent gunicorn
14 | RUN pip install dvc==2.8.3 s3fs==2021.11.0 dvc[s3]==2.8.3
15 | RUN pip install git-remote-codecommit
16 | 
17 | # Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
18 | # output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
19 | # keeps Python from writing the .pyc files which are unnecessary in this case. We also update
20 | # PATH so that the train and serve programs are found when the container is invoked.
21 | 
22 | ENV PYTHONUNBUFFERED=TRUE
23 | ENV PYTHONDONTWRITEBYTECODE=TRUE
24 | ENV PATH="/opt/program:${PATH}"
25 | 
26 | # Set up the program in the image
27 | COPY catboost_regressor /opt/program
28 | WORKDIR /opt/program
29 | 
30 | 


--------------------------------------------------------------------------------
/container/train_and_serve/README.md:
--------------------------------------------------------------------------------
 1 | # Bring-your-own Algorithm Sample
 2 | 
 3 | This example shows how to package an algorithm for use with SageMaker. We have chosen a simple [CatBoost][catboost] implementation of regression to illustrate the procedure.
 4 | 
 5 | SageMaker supports two execution modes: _training_ where the algorithm uses input data to train a new model and _serving_ where the algorithm accepts HTTP requests and uses the previously trained model to do an inference (also called "scoring", "prediction", or "transformation").
 6 | 
 7 | The algorithm that we have built here supports both training and scoring in SageMaker with the same container image. It is perfectly reasonable to build an algorithm that supports only training _or_ scoring as well as to build an algorithm that has separate container images for training and scoring.v
 8 | 
 9 | In order to build a production grade inference server into the container, we use the following stack to make the implementer's job simple:
10 | 
11 | 1. __[nginx][nginx]__ is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently.
12 | 2. __[gunicorn][gunicorn]__ is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them.
13 | 3. __[flask][flask]__ is a simple web framework used in the inference app that you write. It lets you respond to call on the `/ping` and `/invocations` endpoints without having to write much code.
14 | 
15 | ## The Structure of the Sample Code
16 | 
17 | The components are as follows:
18 | 
19 | * __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.
20 | 
21 | * __catboost_regressor__: The directory that contains the application to run in the container. See the next session for details about each of the files.
22 | 
23 | ### The application run inside the container
24 | 
25 | When SageMaker starts a container, it will invoke the container with an argument of either __train__ or __serve__. We have set this container up so that the argument in treated as the command that the container executes. When training, it will run the __train__ program included and, when serving, it will run the __serve__ program.
26 | 
27 | * __train__: The main program for training the model. When you build your own algorithm, you'll edit this to include your training code.
28 | * __serve__: The wrapper that starts the inference server. In most cases, you can use this file as-is.
29 | * __wsgi.py__: The start up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named.
30 | * __predictor.py__: The algorithm-specific inference server. This is the file that you modify with your own algorithm's code.
31 | * __nginx.conf__: The configuration for the nginx master server that manages the multiple workers.
32 | 
33 | [catboost]: https://catboost.ai/ "CatBoost Home Page"
34 | [dockerfile]: https://docs.docker.com/engine/reference/builder/ "The official Dockerfile reference guide"
35 | [ecr]: https://aws.amazon.com/ecr/ "ECR Home Page"
36 | [nginx]: http://nginx.org/
37 | [gunicorn]: http://gunicorn.org/
38 | [flask]: http://flask.pocoo.org/
39 | 


--------------------------------------------------------------------------------
/container/train_and_serve/catboost_regressor/nginx.conf:
--------------------------------------------------------------------------------
 1 | worker_processes 1;
 2 | daemon off; # Prevent forking
 3 | 
 4 | 
 5 | pid /tmp/nginx.pid;
 6 | error_log /var/log/nginx/error.log;
 7 | 
 8 | events {
 9 |   # defaults
10 | }
11 | 
12 | http {
13 |   include /etc/nginx/mime.types;
14 |   default_type application/octet-stream;
15 |   access_log /var/log/nginx/access.log combined;
16 |   
17 |   upstream gunicorn {
18 |     server unix:/tmp/gunicorn.sock;
19 |   }
20 | 
21 |   server {
22 |     listen 8080 deferred;
23 |     client_max_body_size 5m;
24 | 
25 |     keepalive_timeout 5;
26 |     proxy_read_timeout 1200s;
27 | 
28 |     location ~ ^/(ping|invocations) {
29 |       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
30 |       proxy_set_header Host $http_host;
31 |       proxy_redirect off;
32 |       proxy_pass http://gunicorn;
33 |     }
34 | 
35 |     location / {
36 |       return 404 "{}";
37 |     }
38 |   }
39 | }
40 | 


--------------------------------------------------------------------------------
/container/train_and_serve/catboost_regressor/predictor.py:
--------------------------------------------------------------------------------
 1 | # This is the file that implements a flask server to do inferences. It's the file that you will modify to
 2 | # implement the scoring for your own algorithm.
 3 | 
 4 | from __future__ import print_function
 5 | 
 6 | import os
 7 | import json
 8 | import pickle
 9 | import sys
10 | import signal
11 | import traceback
12 | import flask
13 | import pandas as pd
14 | from catboost import CatBoostRegressor
15 | from io import StringIO
16 | 
17 | 
18 | prefix = '/opt/ml/'
19 | model_path = os.path.join(prefix, 'model')
20 | 
21 | # A singleton for holding the model. This simply loads the model and holds it.
22 | # It has a predict function that does a prediction based on the model and the input data.
23 | 
24 | class ScoringService(object):
25 |     model = None                # Where we keep the model when it's loaded
26 | 
27 |     @classmethod
28 |     def get_model(cls):
29 |         """Get the model object for this instance, loading it if it's not already loaded."""
30 |         if cls.model == None:
31 |             cls.model = CatBoostRegressor()
32 |             cls.model.load_model(os.path.join(model_path, 'catboost-regressor-model.dump'))
33 |         return cls.model
34 | 
35 |     @classmethod
36 |     def predict(cls, input):
37 |         """For the input, do the predictions and return them.
38 | 
39 |         Args:
40 |             input (a pandas dataframe): The data on which to do the predictions. There will be
41 |                 one prediction per row in the dataframe"""
42 |         clf = cls.get_model()
43 |         return clf.predict(input)
44 | 
45 | # The flask app for serving predictions
46 | app = flask.Flask(__name__)
47 | 
48 | @app.route('/ping', methods=['GET'])
49 | def ping():
50 |     """Determine if the container is working and healthy. In this sample container, we declare
51 |     it healthy if we can load the model successfully."""
52 |     health = ScoringService.get_model() is not None  # You can insert a health check here
53 | 
54 |     status = 200 if health else 404
55 |     return flask.Response(response='\n', status=status, mimetype='application/json')
56 | 
57 | @app.route('/invocations', methods=['POST'])
58 | def transformation():
59 |     """Do an inference on a single batch of data. In this sample server, we take data as CSV, convert
60 |     it to a pandas data frame for internal use and then convert the predictions back to CSV (which really
61 |     just means one prediction per line, since there's a single column.
62 |     """
63 |     data = None
64 | 
65 |     # Convert from CSV to pandas
66 |     if flask.request.content_type == 'text/csv':
67 |         data = flask.request.data.decode('utf-8')
68 |         s = StringIO(data)
69 |         data = pd.read_csv(s, header=None)
70 |     else:
71 |         return flask.Response(response='This predictor only supports CSV data', status=415, mimetype='text/plain')
72 | 
73 |     print('Invoked with {} records'.format(data.shape[0]))
74 | 
75 |     # Do the prediction
76 |     predictions = ScoringService.predict(data)
77 | 
78 |     # Convert from numpy back to CSV
79 |     out = StringIO()
80 |     pd.DataFrame({'results':predictions}).to_csv(out, header=False, index=False)
81 |     result = out.getvalue()
82 | 
83 |     return flask.Response(response=result, status=200, mimetype='text/csv')
84 | 


--------------------------------------------------------------------------------
/container/train_and_serve/catboost_regressor/serve:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # This file implements the scoring service shell. You don't necessarily need to modify it for various
 4 | # algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
 5 | # gunicorn exits.
 6 | #
 7 | # The flask server is specified to be the app object in wsgi.py
 8 | #
 9 | # We set the following parameters:
10 | #
11 | # Parameter                Environment Variable              Default Value
12 | # ---------                --------------------              -------------
13 | # number of workers        MODEL_SERVER_WORKERS              the number of CPU cores
14 | # timeout                  MODEL_SERVER_TIMEOUT              60 seconds
15 | 
16 | from __future__ import print_function
17 | import multiprocessing
18 | import os
19 | import signal
20 | import subprocess
21 | import sys
22 | 
23 | cpu_count = multiprocessing.cpu_count()
24 | 
25 | model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
26 | model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))
27 | 
28 | def sigterm_handler(nginx_pid, gunicorn_pid):
29 |     try:
30 |         os.kill(nginx_pid, signal.SIGQUIT)
31 |     except OSError:
32 |         pass
33 |     try:
34 |         os.kill(gunicorn_pid, signal.SIGTERM)
35 |     except OSError:
36 |         pass
37 | 
38 |     sys.exit(0)
39 | 
40 | def start_server():
41 |     print('Starting the inference server with {} workers.'.format(model_server_workers))
42 | 
43 | 
44 |     # link the log streams to stdout/err so they will be logged to the container logs
45 |     subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
46 |     subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])
47 | 
48 |     nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
49 |     gunicorn = subprocess.Popen(['gunicorn',
50 |                                  '--timeout', str(model_server_timeout),
51 |                                  '-k', 'gevent',
52 |                                  '-b', 'unix:/tmp/gunicorn.sock',
53 |                                  '-w', str(model_server_workers),
54 |                                  'wsgi:app'])
55 | 
56 |     signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))
57 | 
58 |     # If either subprocess exits, so do we.
59 |     pids = set([nginx.pid, gunicorn.pid])
60 |     while True:
61 |         pid, _ = os.wait()
62 |         if pid in pids:
63 |             break
64 | 
65 |     sigterm_handler(nginx.pid, gunicorn.pid)
66 |     print('Inference server exiting')
67 | 
68 | # The main routine just invokes the start function.
69 | 
70 | if __name__ == '__main__':
71 |     start_server()
72 | 


--------------------------------------------------------------------------------
/container/train_and_serve/catboost_regressor/train:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | # A sample training component that trains a simple CatBoost Regressor tree model.
  4 | # This implementation works in File mode and makes no assumptions about the input file names.
  5 | # Input is specified as CSV with a data point in each row and the labels in the first column.
  6 | import glob
  7 | import logging
  8 | import os
  9 | import json
 10 | import re
 11 | import subprocess
 12 | import traceback
 13 | import sys
 14 | 
 15 | from catboost import CatBoostRegressor
 16 | import numpy as np
 17 | import pandas as pd
 18 | 
 19 | prefix = '/opt/ml/'
 20 | input_path = prefix + 'input/data'
 21 | dataset_path = prefix + 'input/data/dataset'
 22 | train_channel_name = 'train'
 23 | validation_channel_name = 'validation'
 24 | 
 25 | output_path = os.path.join(prefix, 'output')
 26 | model_path = os.path.join(prefix, 'model')
 27 | model_file_name = 'catboost-regressor-model.dump'
 28 | train_path = os.path.join(dataset_path, train_channel_name)
 29 | validation_path = os.path.join(dataset_path, validation_channel_name)
 30 | 
 31 | param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
 32 | 
 33 | dvc_repo_url = os.environ.get('DVC_REPO_URL')
 34 | dvc_branch = os.environ.get('DVC_BRANCH')
 35 | user = os.environ.get('USER', "sagemaker")
 36 | 
 37 | # The function to execute the training.
 38 | def train(learning_rate, depth):
 39 |     print('Starting the training.')
 40 | 
 41 |     try:
 42 |         # Take the set of train files and read them all into a single pandas dataframe
 43 |         train_input_files = [os.path.join(train_path, file) for file in glob.glob(train_path+"/*.csv")]
 44 |         if len(train_input_files) == 0:
 45 |             raise ValueError(('There are no files in {}.\n' +
 46 |                               'This usually indicates that the channel ({}) was incorrectly specified,\n' +
 47 |                               'the data specification in S3 was incorrectly specified or the role specified\n' +
 48 |                               'does not have permission to access the data.').format(train_path, train_channel_name))
 49 |         print('Found train files: {}'.format(train_input_files))
 50 |         train_df = pd.DataFrame()
 51 |         for file in train_input_files:
 52 |             if train_df.shape[0] == 0:
 53 |                 train_df = pd.read_csv(file)
 54 |             else:
 55 |                 df = pd.read_csv(file)
 56 |                 train_df.append(df, ignore_index=True)
 57 | 
 58 |         # Take the set of train files and read them all into a single pandas dataframe
 59 |         validation_input_files = [os.path.join(validation_path, file) for file in glob.glob(validation_path+"/*.csv")]
 60 |         if len(validation_input_files) == 0:
 61 |             raise ValueError(('There are no files in {}.\n' +
 62 |                               'This usually indicates that the channel ({}) was incorrectly specified,\n' +
 63 |                               'the data specification in S3 was incorrectly specified or the role specified\n' +
 64 |                               'does not have permission to access the data.').format(validation_path, train_channel_name))
 65 |         print('Found validation files: {}'.format(validation_input_files))
 66 |         validation_df = pd.DataFrame()
 67 |         for file in validation_input_files:
 68 |             if validation_df.shape[0] == 0:
 69 |                 validation_df = pd.read_csv(file)
 70 |             else:
 71 |                 df = pd.read_csv(file)
 72 |                 validation_df.append(df, ignore_index=True)
 73 | 
 74 |         # Assumption is that the label is the last column
 75 |         print('building training and validation datasets')
 76 |         X_train = train_df.iloc[:, 1:].values
 77 |         y_train = train_df.iloc[:, 0:1].values
 78 |         X_validation = validation_df.iloc[:, 1:].values
 79 |         y_validation = validation_df.iloc[:, 0:1].values
 80 | 
 81 |         # define and train model
 82 |         model = CatBoostRegressor(learning_rate=int(learning_rate), depth=int(depth))
 83 | 
 84 |         model.fit(X_train, y_train, eval_set=(X_validation, y_validation), logging_level='Silent')
 85 | 
 86 |         # print abs error
 87 |         print('validating model')
 88 |         abs_err = np.abs(model.predict(X_validation) - y_validation)
 89 | 
 90 |         # print couple perf metrics
 91 |         for q in [10, 50, 90]:
 92 |             print('AE-at-' + str(q) + 'th-percentile: '+ str(np.percentile(a=abs_err, q=q)))
 93 | 
 94 |         # persist model
 95 |         path = os.path.join(model_path, model_file_name)
 96 |         print('saving model file to {}'.format(path))
 97 |         model.save_model(path)
 98 | 
 99 |         print('Training complete.')
100 |     except Exception as e:
101 |         # Write out an error file. This will be returned as the failureReason in the
102 |         # DescribeTrainingJob result.
103 |         trc = traceback.format_exc()
104 |         with open(os.path.join(output_path, 'failure'), 'w') as s:
105 |             s.write('Exception during training: ' + str(e) + '\n' + trc)
106 |         # Printing this causes the exception to be in the training job logs, as well.
107 |         print('Exception during training: ' + str(e) + '\n' + trc)
108 |         # A non-zero exit dependencies causes the training job to be marked as Failed.
109 |         sys.exit(255)
110 | 
111 | 
112 | # Read in any hyperparameters that the user passed with the training job
113 | def get_hyperparameters():
114 |     print('Reading hyperparameters data: {}'.format(param_path))
115 |     with open(param_path) as json_file:
116 |         hyperparameters_data = json.load(json_file)
117 |     print('hyperparameters_data: {}'.format(hyperparameters_data))
118 |     return hyperparameters_data
119 | 
120 | 
121 | def clone_dvc_git_repo():
122 |     print(f"Configure git to pull authenticated from CodeCommit")
123 |     print(f"Cloning repo: {dvc_repo_url}, git branch: {dvc_branch}")
124 |     subprocess.check_call(["git", "clone", "--depth", "1", "--branch", dvc_branch, dvc_repo_url, input_path])
125 | 
126 | 
127 | def dvc_pull():
128 |     print("Running dvc pull command")
129 |     os.chdir(input_path + "/dataset/")
130 |     subprocess.check_call(["dvc", "pull"])
131 | 
132 | 
133 | if __name__ == '__main__':
134 | 
135 |     hyperparameters = get_hyperparameters()
136 |     clone_dvc_git_repo()
137 |     dvc_pull()
138 |     train(hyperparameters['learning_rate'], hyperparameters['depth'])
139 | 
140 |     # A zero exit dependencies causes the job to be marked a Succeeded.
141 |     sys.exit(0)
142 | 


--------------------------------------------------------------------------------
/container/train_and_serve/catboost_regressor/wsgi.py:
--------------------------------------------------------------------------------
1 | import predictor as myapp
2 | 
3 | # This is just a simple wrapper for gunicorn to find your app.
4 | # If you want to change the algorithm file, simply change "predictor" above to the
5 | # new file.
6 | 
7 | app = myapp.app
8 | 


--------------------------------------------------------------------------------
/dvc_sagemaker_byoc.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "def06897",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Prerequisite\n",
   9 |     "\n",
  10 |     "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n",
  11 |     "\n",
  12 |     "# Training a CatBoost regression model with data from DVC\n",
  13 |     "\n",
  14 |     "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n",
  15 |     "\n",
  16 |     "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n",
  17 |     "\n",
  18 |     "### California Housing dataset\n",
  19 |     "\n",
  20 |     "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n",
  21 |     "\n",
  22 |     "The California Housing dataset was originally published in:\n",
  23 |     "\n",
  24 |     "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
  25 |     "\n",
  26 |     "### DVC\n",
  27 |     "\n",
  28 |     "DVC is built to make machine learning (ML) models shareable and reproducible.\n",
  29 |     "It is designed to handle large files, data sets, machine learning models, and metrics as well as code."
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "id": "21f94258",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "## Part 1: Configure DVC for data versioning\n",
  38 |     "\n",
  39 |     "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n",
  40 |     "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n",
  41 |     "The `dvc` configurations and files for data tracking will be versioned in this repository.\n",
  42 |     "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n",
  43 |     "\n",
  44 |     "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing Amazon Secret Managers to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n",
  45 |     "\n",
  46 |     "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit."
  47 |    ]
  48 |   },
  49 |   {
  50 |    "cell_type": "code",
  51 |    "execution_count": null,
  52 |    "id": "f7ddaba7",
  53 |    "metadata": {},
  54 |    "outputs": [],
  55 |    "source": [
  56 |     "%%sh\n",
  57 |     "\n",
  58 |     "## Create the repository\n",
  59 |     "\n",
  60 |     "repo_name=\"sagemaker-dvc-sample\"\n",
  61 |     "\n",
  62 |     "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n",
  63 |     "\n",
  64 |     "account=$(aws sts get-caller-identity --query Account --output text)\n",
  65 |     "\n",
  66 |     "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
  67 |     "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
  68 |     "region=${region:-eu-west-1}\n",
  69 |     "\n",
  70 |     "## repo_name is already in the .gitignore of the root repo\n",
  71 |     "\n",
  72 |     "mkdir -p ${repo_name}\n",
  73 |     "cd ${repo_name}\n",
  74 |     "\n",
  75 |     "# initalize new repo in subfolder\n",
  76 |     "git init\n",
  77 |     "## Change the remote to the codecommit\n",
  78 |     "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n",
  79 |     "\n",
  80 |     "# Configure git - change it according to your needs\n",
  81 |     "git config --global user.email \"sagemaker-studio-user@example.com\"\n",
  82 |     "git config --global user.name \"SageMaker Studio User\"\n",
  83 |     "\n",
  84 |     "git config --global credential.helper '!aws codecommit credential-helper $@'\n",
  85 |     "git config --global credential.UseHttpPath true\n",
  86 |     "\n",
  87 |     "# Initialize dvc\n",
  88 |     "dvc init\n",
  89 |     "\n",
  90 |     "git commit -m 'Add dvc configuration'\n",
  91 |     "\n",
  92 |     "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n",
  93 |     "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n",
  94 |     "git commit .dvc/config -m \"initialize DVC local remote\"\n",
  95 |     "\n",
  96 |     "# set the DVC cache to S3\n",
  97 |     "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n",
  98 |     "dvc config cache.s3 s3cache\n",
  99 |     "\n",
 100 |     "# disable sending anonymized data to dvc for troubleshooting\n",
 101 |     "dvc config core.analytics false\n",
 102 |     "\n",
 103 |     "git add .dvc/config\n",
 104 |     "git commit -m 'update dvc config'\n",
 105 |     "\n",
 106 |     "git push --set-upstream origin master #--force"
 107 |    ]
 108 |   },
 109 |   {
 110 |    "cell_type": "markdown",
 111 |    "id": "ba876ca1",
 112 |    "metadata": {},
 113 |    "source": [
 114 |     "## Part 2: Packaging and Uploading your container images for use with Amazon SageMaker\n",
 115 |     "\n",
 116 |     "### An overview of Docker\n",
 117 |     "\n",
 118 |     "If you're familiar with Docker already, you can skip ahead to the next section.\n",
 119 |     "\n",
 120 |     "For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. \n",
 121 |     "\n",
 122 |     "Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.\n",
 123 |     "\n",
 124 |     "Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.\n",
 125 |     "\n",
 126 |     "In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.\n",
 127 |     "\n",
 128 |     "Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.\n",
 129 |     "\n",
 130 |     "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].\n",
 131 |     "\n",
 132 |     "Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.\n",
 133 |     "\n",
 134 |     "In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.\n",
 135 |     "\n",
 136 |     "Some helpful links:\n",
 137 |     "\n",
 138 |     "* [Docker home page](http://www.docker.com)\n",
 139 |     "* [Getting started with Docker](https://docs.docker.com/get-started/)\n",
 140 |     "* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)\n",
 141 |     "* [`docker run` reference](https://docs.docker.com/engine/reference/run/)\n",
 142 |     "\n",
 143 |     "[Amazon ECS]: https://aws.amazon.com/ecs/"
 144 |    ]
 145 |   },
 146 |   {
 147 |    "cell_type": "markdown",
 148 |    "id": "31ba56ff",
 149 |    "metadata": {},
 150 |    "source": [
 151 |     "### SageMaker Docker container for processing\n",
 152 |     "\n",
 153 |     "Let us now build and register the container for processing. In doing so, we ensure that all `dvc` related dependencies are already installed and we do not need to `pip install` or `git configure` anything within the processing scripts, and we can concentrate on the data preparation and feature engineering.\n",
 154 |     "\n",
 155 |     "We aim to have one image for processing where we can supply our own processing script. More information on how to build your own processing container can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html). For a formal specification that defines the contract for an Amazon SageMaker Processing container, see [Build Your Own Processing Container (Advanced Scenario)](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html). "
 156 |    ]
 157 |   },
 158 |   {
 159 |    "cell_type": "code",
 160 |    "execution_count": null,
 161 |    "id": "df0f1445",
 162 |    "metadata": {},
 163 |    "outputs": [],
 164 |    "source": [
 165 |     "!cat container/processing/Dockerfile"
 166 |    ]
 167 |   },
 168 |   {
 169 |    "cell_type": "markdown",
 170 |    "id": "b61c7fc6",
 171 |    "metadata": {},
 172 |    "source": [
 173 |     "### Building and registering the containers\n",
 174 |     "\n",
 175 |     "We will use [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) to store our container images.\n",
 176 |     "\n",
 177 |     "To easily build custom container images from your Studio notebooks, we use the [SageMaker Docker Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli). For more information on the SageMaker Docker Build CLI, interested readers can refer to [this blogpost](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)."
 178 |    ]
 179 |   },
 180 |   {
 181 |    "cell_type": "code",
 182 |    "execution_count": null,
 183 |    "id": "fa713178",
 184 |    "metadata": {},
 185 |    "outputs": [],
 186 |    "source": [
 187 |     "%%sh\n",
 188 |     "\n",
 189 |     "# The name of the image\n",
 190 |     "image_name=sagemaker-processing-dvc\n",
 191 |     "\n",
 192 |     "cd container/processing\n",
 193 |     "\n",
 194 |     "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
 195 |     "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
 196 |     "region=${region:-eu-west-1}\n",
 197 |     "\n",
 198 |     "# If the repository doesn't exist in ECR, create it.\n",
 199 |     "aws ecr describe-repositories --region \"${region}\" --repository-names \"${image_name}\" > /dev/null 2>&1\n",
 200 |     "\n",
 201 |     "if [ $? -ne 0 ]\n",
 202 |     "then\n",
 203 |     "    aws ecr create-repository --region \"${region}\" --repository-name \"${image_name}\" > /dev/null\n",
 204 |     "fi\n",
 205 |     "\n",
 206 |     "sm-docker build . --repository \"${image_name}:latest\""
 207 |    ]
 208 |   },
 209 |   {
 210 |    "cell_type": "markdown",
 211 |    "id": "2ac30a72",
 212 |    "metadata": {},
 213 |    "source": [
 214 |     "### SageMaker Docker container for training and hosting\n",
 215 |     "\n",
 216 |     "Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:\n",
 217 |     "\n",
 218 |     "* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.\n",
 219 |     "* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.\n",
 220 |     "* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. \n",
 221 |     "\n",
 222 |     "\n",
 223 |     "#### Running your container during training\n",
 224 |     "\n",
 225 |     "When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:\n",
 226 |     "\n",
 227 |     "    /opt/ml\n",
 228 |     "    |-- input\n",
 229 |     "    |   |-- config\n",
 230 |     "    |   |   |-- hyperparameters.json\n",
 231 |     "    |   |   `-- resourceConfig.json\n",
 232 |     "    |   `-- data\n",
 233 |     "    |       `-- <channel_name>\n",
 234 |     "    |           `-- <input data>\n",
 235 |     "    |-- model\n",
 236 |     "    |   `-- <model files>\n",
 237 |     "    `-- output\n",
 238 |     "        `-- failure\n",
 239 |     "\n",
 240 |     "##### The input\n",
 241 |     "\n",
 242 |     "* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.\n",
 243 |     "* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. \n",
 244 |     "* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.\n",
 245 |     "\n",
 246 |     "##### The output\n",
 247 |     "\n",
 248 |     "* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.\n",
 249 |     "* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.\n",
 250 |     "\n",
 251 |     "#### Running your container during hosting\n",
 252 |     "\n",
 253 |     "Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests.\n",
 254 |     "\n",
 255 |     "This stack is implemented in the sample code here and you can mostly just leave it alone. \n",
 256 |     "\n",
 257 |     "Amazon SageMaker uses two URLs in the container:\n",
 258 |     "\n",
 259 |     "* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.\n",
 260 |     "* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. \n",
 261 |     "\n",
 262 |     "The container will have the model files in the same place they were written during training:\n",
 263 |     "\n",
 264 |     "    /opt/ml\n",
 265 |     "    `-- model\n",
 266 |     "        `-- <model files>"
 267 |    ]
 268 |   },
 269 |   {
 270 |    "cell_type": "markdown",
 271 |    "id": "afb0b22a",
 272 |    "metadata": {},
 273 |    "source": [
 274 |     "### The parts of the training and inference container\n",
 275 |     "\n",
 276 |     "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMager:\n",
 277 |     "\n",
 278 |     "    .\n",
 279 |     "    |-- Dockerfile\n",
 280 |     "    |-- README.md\n",
 281 |     "    `-- catboost_regressor\n",
 282 |     "        |-- nginx.conf\n",
 283 |     "        |-- predictor.py\n",
 284 |     "        |-- serve\n",
 285 |     "        |-- train\n",
 286 |     "        `-- wsgi.py\n",
 287 |     "\n",
 288 |     "Let's discuss each of these in turn:\n",
 289 |     "\n",
 290 |     "* __`Dockerfile`__ describes how to build your Docker container image. More details below.\n",
 291 |     "* __`catboost_regressor`__ is the directory which contains the files that will be installed in the container.\n",
 292 |     "\n",
 293 |     "In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.\n",
 294 |     "\n",
 295 |     "The files that we'll put in the container are:\n",
 296 |     "\n",
 297 |     "* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.\n",
 298 |     "* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.\n",
 299 |     "* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.\n",
 300 |     "* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.\n",
 301 |     "* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.\n",
 302 |     "\n",
 303 |     "In summary, the two files you will probably want to change for your application are `train` and `predictor.py`."
 304 |    ]
 305 |   },
 306 |   {
 307 |    "cell_type": "code",
 308 |    "execution_count": null,
 309 |    "id": "7b6b75bf",
 310 |    "metadata": {},
 311 |    "outputs": [],
 312 |    "source": [
 313 |     "!cat container/train_and_serve/Dockerfile"
 314 |    ]
 315 |   },
 316 |   {
 317 |    "cell_type": "markdown",
 318 |    "id": "7a14543c",
 319 |    "metadata": {},
 320 |    "source": [
 321 |     "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMaker:\n",
 322 |     "\n",
 323 |     "    .\n",
 324 |     "    `-- container/train_and_serve/\n",
 325 |     "        |-- Dockerfile\n",
 326 |     "        |-- README.md\n",
 327 |     "        `--catboost_regressor/\n",
 328 |     "            |-- nginx.conf\n",
 329 |     "            |-- predictor.py\n",
 330 |     "            |-- serve\n",
 331 |     "            |-- train\n",
 332 |     "            |-- wsgi.py\n"
 333 |    ]
 334 |   },
 335 |   {
 336 |    "cell_type": "code",
 337 |    "execution_count": null,
 338 |    "id": "61992f99",
 339 |    "metadata": {},
 340 |    "outputs": [],
 341 |    "source": [
 342 |     "%%sh\n",
 343 |     "\n",
 344 |     "# The name of our algorithm\n",
 345 |     "algorithm_name=sagemaker-catboost-dvc\n",
 346 |     "\n",
 347 |     "cd container/train_and_serve\n",
 348 |     "\n",
 349 |     "chmod +x catboost_regressor/train\n",
 350 |     "chmod +x catboost_regressor/serve\n",
 351 |     "\n",
 352 |     "# Get the region defined in the current configuration (default to us-west-1 if none defined)\n",
 353 |     "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
 354 |     "region=${region:-eu-west-1}\n",
 355 |     "\n",
 356 |     "# If the repository doesn't exist in ECR, create it.\n",
 357 |     "aws ecr describe-repositories --region \"${region}\" --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
 358 |     "\n",
 359 |     "if [ $? -ne 0 ]\n",
 360 |     "then\n",
 361 |     "    aws ecr create-repository --region \"${region}\" --repository-name \"${algorithm_name}\" > /dev/null\n",
 362 |     "fi\n",
 363 |     "\n",
 364 |     "sm-docker build . --repository \"${algorithm_name}:latest\""
 365 |    ]
 366 |   },
 367 |   {
 368 |    "cell_type": "markdown",
 369 |    "id": "0c337c82",
 370 |    "metadata": {},
 371 |    "source": [
 372 |     "## Part 3: Processing and Training with DVC and SageMaker\n",
 373 |     "\n",
 374 |     "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n",
 375 |     "\n",
 376 |     "The high level conceptual architecture is depicted in the figure below\n",
 377 |     "\n",
 378 |     "<img src=\"./img/high-level-architecture.png\">\n",
 379 |     "\n",
 380 |     "Let's unfold in the following sections the implementation details of the two experiments.\n",
 381 |     "\n",
 382 |     "### Import libraries and initial setup\n",
 383 |     "\n",
 384 |     "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook."
 385 |    ]
 386 |   },
 387 |   {
 388 |    "cell_type": "code",
 389 |    "execution_count": null,
 390 |    "id": "feead477",
 391 |    "metadata": {},
 392 |    "outputs": [],
 393 |    "source": [
 394 |     "import boto3\n",
 395 |     "import sagemaker\n",
 396 |     "import time\n",
 397 |     "from time import strftime\n",
 398 |     "\n",
 399 |     "boto_session = boto3.Session()\n",
 400 |     "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n",
 401 |     "sm_client = boto3.client(\"sagemaker\")\n",
 402 |     "region = boto_session.region_name\n",
 403 |     "bucket = sagemaker_session.default_bucket()\n",
 404 |     "role = sagemaker.get_execution_role()\n",
 405 |     "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n",
 406 |     "\n",
 407 |     "prefix = 'DEMO-sagemaker-experiments-dvc'\n",
 408 |     "\n",
 409 |     "print(f\"account: {account}\")\n",
 410 |     "print(f\"bucket: {bucket}\")\n",
 411 |     "print(f\"region: {region}\")\n",
 412 |     "print(f\"role: {role}\")"
 413 |    ]
 414 |   },
 415 |   {
 416 |    "cell_type": "markdown",
 417 |    "id": "aae3a438",
 418 |    "metadata": {},
 419 |    "source": [
 420 |     "### Prepare raw data\n",
 421 |     "\n",
 422 |     "We upload the raw data to S3 in the default bucket."
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "code",
 427 |    "execution_count": null,
 428 |    "id": "4d9dfa8b",
 429 |    "metadata": {},
 430 |    "outputs": [],
 431 |    "source": [
 432 |     "import pandas as pd\n",
 433 |     "import numpy as np\n",
 434 |     "\n",
 435 |     "from sklearn.datasets import fetch_california_housing\n",
 436 |     "from sklearn.model_selection import train_test_split\n",
 437 |     "\n",
 438 |     "from pathlib import Path\n",
 439 |     "\n",
 440 |     "databunch = fetch_california_housing()\n",
 441 |     "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n",
 442 |     "\n",
 443 |     "print(f\"Dataset shape = {dataset.shape}\")\n",
 444 |     "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n",
 445 |     "\n",
 446 |     "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n",
 447 |     "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n",
 448 |     "print(f\"Raw data location in S3: {s3_data_path}\")\n",
 449 |     "\n",
 450 |     "s3 = boto3.client(\"s3\")\n",
 451 |     "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)"
 452 |    ]
 453 |   },
 454 |   {
 455 |    "cell_type": "markdown",
 456 |    "id": "622d14ce",
 457 |    "metadata": {},
 458 |    "source": [
 459 |     "### Setup SageMaker Experiments\n",
 460 |     "\n",
 461 |     "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n",
 462 |     "\n",
 463 |     "Let’s start first with an overview of Amazon SageMaker Experiments features:\n",
 464 |     "\n",
 465 |     "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n",
 466 |     "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n",
 467 |     "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n",
 468 |     "\n",
 469 |     "Now, in order to track this test in SageMaker, we need to create an experiment."
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "code",
 474 |    "execution_count": null,
 475 |    "id": "4a2af348",
 476 |    "metadata": {},
 477 |    "outputs": [],
 478 |    "source": [
 479 |     "from smexperiments.experiment import Experiment\n",
 480 |     "from smexperiments.trial import Trial\n",
 481 |     "from smexperiments.trial_component import TrialComponent\n",
 482 |     "from smexperiments.tracker import Tracker\n",
 483 |     "\n",
 484 |     "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n",
 485 |     "\n",
 486 |     "# create the experiment if it doesn't exist\n",
 487 |     "try:\n",
 488 |     "    my_experiment = Experiment.load(experiment_name=experiment_name)\n",
 489 |     "    print(\"existing experiment loaded\")\n",
 490 |     "except Exception as ex:\n",
 491 |     "    if \"ResourceNotFound\" in str(ex):\n",
 492 |     "        my_experiment = Experiment.create(\n",
 493 |     "            experiment_name = experiment_name,\n",
 494 |     "            description = \"How to integrate DVC\"\n",
 495 |     "        )\n",
 496 |     "        print(\"new experiment created\")\n",
 497 |     "    else:\n",
 498 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
 499 |     "        print(\"Dont go forward!\")\n",
 500 |     "        raise"
 501 |    ]
 502 |   },
 503 |   {
 504 |    "cell_type": "markdown",
 505 |    "id": "0947c82b",
 506 |    "metadata": {},
 507 |    "source": [
 508 |     "We need to also define trials within the experiment.\n",
 509 |     "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n",
 510 |     "\n",
 511 |     "### Test 1: generate single files for training and validation\n",
 512 |     "\n",
 513 |     "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments."
 514 |    ]
 515 |   },
 516 |   {
 517 |    "cell_type": "code",
 518 |    "execution_count": null,
 519 |    "id": "3a34b0c8",
 520 |    "metadata": {},
 521 |    "outputs": [],
 522 |    "source": [
 523 |     "first_trial_name = \"dvc-trial-single-file\"\n",
 524 |     "\n",
 525 |     "try:\n",
 526 |     "    my_first_trial = Trial.load(trial_name=first_trial_name)\n",
 527 |     "    print(\"existing trial loaded\")\n",
 528 |     "except Exception as ex:\n",
 529 |     "    if \"ResourceNotFound\" in str(ex):\n",
 530 |     "        my_first_trial = Trial.create(\n",
 531 |     "            experiment_name=experiment_name,\n",
 532 |     "            trial_name=first_trial_name,\n",
 533 |     "        )\n",
 534 |     "        print(\"new trial created\")\n",
 535 |     "    else:\n",
 536 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
 537 |     "        print(\"Dont go forward!\")\n",
 538 |     "        raise"
 539 |    ]
 540 |   },
 541 |   {
 542 |    "cell_type": "markdown",
 543 |    "id": "d9471d00",
 544 |    "metadata": {},
 545 |    "source": [
 546 |     "### Processing script: version data with DVC\n",
 547 |     "\n",
 548 |     "The processing script takes as arguments the address of the git repository, and the branch we want to <i>create</i> to store the `dvc` metadata. The datasets themselves will be then stored in S3. The arguments passed to the processing scripts are not automatically tracked in SageMaker Experiments in the automatically generated <i>TrialComponent</i>. The <i>TrialComponent</i> generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI. In our case, we will store the following data:\n",
 549 |     "* `data_repo_url`\n",
 550 |     "* `data_branch`\n",
 551 |     "* `data_commit_hash`\n",
 552 |     "* `train_test_split_ratio`"
 553 |    ]
 554 |   },
 555 |   {
 556 |    "cell_type": "code",
 557 |    "execution_count": null,
 558 |    "id": "65896ab5",
 559 |    "metadata": {},
 560 |    "outputs": [],
 561 |    "source": [
 562 |     "!pygmentize 'source_dir/preprocessing-experiment.py'"
 563 |    ]
 564 |   },
 565 |   {
 566 |    "cell_type": "markdown",
 567 |    "id": "e1cd9076",
 568 |    "metadata": {},
 569 |    "source": [
 570 |     "### SageMaker Processing job\n",
 571 |     "\n",
 572 |     "We have now all ingredients to execute our SageMaker Processing Job:\n",
 573 |     "* a custom image with dvc installed\n",
 574 |     "* a git repository (i.e., AWS CodeCommit)\n",
 575 |     "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`, `--dvc-repo-url`, `--dvc-branch`)\n",
 576 |     "* a SageMaker Experiment and a Trial"
 577 |    ]
 578 |   },
 579 |   {
 580 |    "cell_type": "code",
 581 |    "execution_count": null,
 582 |    "id": "62fa75e3",
 583 |    "metadata": {},
 584 |    "outputs": [],
 585 |    "source": [
 586 |     "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n",
 587 |     "\n",
 588 |     "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n",
 589 |     "dvc_branch = my_first_trial.trial_name\n",
 590 |     "\n",
 591 |     "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n",
 592 |     "\n",
 593 |     "script_processor = ScriptProcessor(command=['python3'],\n",
 594 |     "                image_uri=image,\n",
 595 |     "                role=role,\n",
 596 |     "                instance_count=1,\n",
 597 |     "                instance_type='ml.m5.xlarge',\n",
 598 |     "                env={\n",
 599 |     "                    \"DVC_REPO_URL\": dvc_repo_url,\n",
 600 |     "                    \"DVC_BRANCH\": dvc_branch,\n",
 601 |     "                    \"USER\": \"sagemaker\"\n",
 602 |     "                },\n",
 603 |     "            )\n",
 604 |     "\n",
 605 |     "experiment_config={\n",
 606 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
 607 |     "    \"TrialName\": my_first_trial.trial_name\n",
 608 |     "}"
 609 |    ]
 610 |   },
 611 |   {
 612 |    "cell_type": "markdown",
 613 |    "id": "baf3d39c",
 614 |    "metadata": {},
 615 |    "source": [
 616 |     "Executing the processing job will take around 3-4 minutes."
 617 |    ]
 618 |   },
 619 |   {
 620 |    "cell_type": "code",
 621 |    "execution_count": null,
 622 |    "id": "ba0c629f",
 623 |    "metadata": {},
 624 |    "outputs": [],
 625 |    "source": [
 626 |     "%%time\n",
 627 |     "\n",
 628 |     "script_processor.run(\n",
 629 |     "    code='source_dir/preprocessing-experiment.py',\n",
 630 |     "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
 631 |     "    experiment_config=experiment_config,\n",
 632 |     "    arguments=[\"--train-test-split-ratio\", \"0.2\"],\n",
 633 |     ")"
 634 |    ]
 635 |   },
 636 |   {
 637 |    "cell_type": "markdown",
 638 |    "id": "e4aa1f57",
 639 |    "metadata": {},
 640 |    "source": [
 641 |     "### Create an estimator and fit the model"
 642 |    ]
 643 |   },
 644 |   {
 645 |    "cell_type": "markdown",
 646 |    "id": "f2be8cc5",
 647 |    "metadata": {},
 648 |    "source": [
 649 |     "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as parameters when you create the Estimator object.\n",
 650 |     "\n",
 651 |     "We will train on the `dvc-trial-single-file` branch first.\n",
 652 |     "\n",
 653 |     "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n",
 654 |     "\n",
 655 |     "```\n",
 656 |     "dataset\n",
 657 |     "    |-- train\n",
 658 |     "    |   |-- california_train.csv\n",
 659 |     "    |-- test\n",
 660 |     "    |   |-- california_test.csv\n",
 661 |     "    |-- validation\n",
 662 |     "    |   |-- california_validation.csv\n",
 663 |     "```\n",
 664 |     "\n",
 665 |     "#### Metric definition\n",
 666 |     "\n",
 667 |     "SageMaker emits every log that is going to STDOUT to CLoudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex. By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n",
 668 |     "\n",
 669 |     "In our case, we are interested in the median error.\n",
 670 |     "```\n",
 671 |     "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
 672 |     "```"
 673 |    ]
 674 |   },
 675 |   {
 676 |    "cell_type": "code",
 677 |    "execution_count": null,
 678 |    "id": "c5dbce5d",
 679 |    "metadata": {},
 680 |    "outputs": [],
 681 |    "source": [
 682 |     "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n",
 683 |     "\n",
 684 |     "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
 685 |     "\n",
 686 |     "hyperparameters={ \n",
 687 |     "        \"learning_rate\" : 1,\n",
 688 |     "        \"depth\": 6\n",
 689 |     "    }\n",
 690 |     "\n",
 691 |     "estimator = sagemaker.estimator.Estimator(\n",
 692 |     "    image,\n",
 693 |     "    role,\n",
 694 |     "    instance_count=1,\n",
 695 |     "    metric_definitions=metric_definitions,\n",
 696 |     "    instance_type=\"ml.m5.large\",\n",
 697 |     "    sagemaker_session=sagemaker_session,\n",
 698 |     "    hyperparameters=hyperparameters,\n",
 699 |     "    environment={\n",
 700 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
 701 |     "        \"DVC_BRANCH\": dvc_branch,\n",
 702 |     "        \"USER\": \"sagemaker\"\n",
 703 |     "    }\n",
 704 |     ")\n",
 705 |     "\n",
 706 |     "experiment_config={\n",
 707 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
 708 |     "    \"TrialName\": my_first_trial.trial_name\n",
 709 |     "}"
 710 |    ]
 711 |   },
 712 |   {
 713 |    "cell_type": "code",
 714 |    "execution_count": null,
 715 |    "id": "2f766d30",
 716 |    "metadata": {},
 717 |    "outputs": [],
 718 |    "source": [
 719 |     "%%time\n",
 720 |     "\n",
 721 |     "estimator.fit(experiment_config=experiment_config)"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "markdown",
 726 |    "id": "f83c9dde",
 727 |    "metadata": {},
 728 |    "source": [
 729 |     "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
 730 |     "\n",
 731 |     "```\n",
 732 |     "Running dvc pull command\n",
 733 |     "A       train/california_train.csv\n",
 734 |     "A       test/california_test.csv\n",
 735 |     "A       validation/california_validation.csv\n",
 736 |     "3 files added and 3 files fetched\n",
 737 |     "Starting the training.\n",
 738 |     "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
 739 |     "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
 740 |     "```"
 741 |    ]
 742 |   },
 743 |   {
 744 |    "cell_type": "markdown",
 745 |    "id": "e6bb08ce",
 746 |    "metadata": {},
 747 |    "source": [
 748 |     "### Test 2: generate multiple files for training and validation"
 749 |    ]
 750 |   },
 751 |   {
 752 |    "cell_type": "code",
 753 |    "execution_count": null,
 754 |    "id": "e24154ae",
 755 |    "metadata": {},
 756 |    "outputs": [],
 757 |    "source": [
 758 |     "second_trial_name = \"dvc-trial-multi-files\"\n",
 759 |     "\n",
 760 |     "try:\n",
 761 |     "    my_second_trial = Trial.load(trial_name=second_trial_name)\n",
 762 |     "    print(\"existing trial loaded\")\n",
 763 |     "except Exception as ex:\n",
 764 |     "    if \"ResourceNotFound\" in str(ex):\n",
 765 |     "        my_second_trial = Trial.create(\n",
 766 |     "            experiment_name=experiment_name,\n",
 767 |     "            trial_name=second_trial_name,\n",
 768 |     "        )\n",
 769 |     "        print(\"new trial created\")\n",
 770 |     "    else:\n",
 771 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
 772 |     "        print(\"Dont go forward!\")\n",
 773 |     "        raise"
 774 |    ]
 775 |   },
 776 |   {
 777 |    "cell_type": "markdown",
 778 |    "id": "dbe33f33",
 779 |    "metadata": {},
 780 |    "source": [
 781 |     "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch."
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "code",
 786 |    "execution_count": null,
 787 |    "id": "0a045eb4",
 788 |    "metadata": {},
 789 |    "outputs": [],
 790 |    "source": [
 791 |     "!pygmentize 'code/preprocessing-experiment-multifiles.py'"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "code",
 796 |    "execution_count": null,
 797 |    "id": "167185e8",
 798 |    "metadata": {},
 799 |    "outputs": [],
 800 |    "source": [
 801 |     "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n",
 802 |     "\n",
 803 |     "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n",
 804 |     "\n",
 805 |     "dvc_branch = my_second_trial.trial_name\n",
 806 |     "\n",
 807 |     "script_processor = ScriptProcessor(command=['python3'],\n",
 808 |     "                image_uri=image,\n",
 809 |     "                role=role,\n",
 810 |     "                instance_count=1,\n",
 811 |     "                instance_type='ml.m5.xlarge',\n",
 812 |     "                env={\n",
 813 |     "                    \"DVC_REPO_URL\": dvc_repo_url,\n",
 814 |     "                    \"DVC_BRANCH\": dvc_branch,\n",
 815 |     "                    \"USER\": \"sagemaker\"\n",
 816 |     "                },\n",
 817 |     "                )\n",
 818 |     "\n",
 819 |     "experiment_config={\n",
 820 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
 821 |     "    \"TrialName\": my_second_trial.trial_name\n",
 822 |     "}"
 823 |    ]
 824 |   },
 825 |   {
 826 |    "cell_type": "markdown",
 827 |    "id": "14b26ae5",
 828 |    "metadata": {},
 829 |    "source": [
 830 |     "Executing the processing job will take ~5 minutes"
 831 |    ]
 832 |   },
 833 |   {
 834 |    "cell_type": "code",
 835 |    "execution_count": null,
 836 |    "id": "46d29d7a",
 837 |    "metadata": {},
 838 |    "outputs": [],
 839 |    "source": [
 840 |     "%%time\n",
 841 |     "\n",
 842 |     "script_processor.run(\n",
 843 |     "    code='source_dir/preprocessing-experiment-multifiles.py',\n",
 844 |     "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
 845 |     "    experiment_config=experiment_config,\n",
 846 |     "    arguments=[\"--train-test-split-ratio\", \"0.1\"],\n",
 847 |     ")"
 848 |    ]
 849 |   },
 850 |   {
 851 |    "cell_type": "markdown",
 852 |    "id": "1bb32869",
 853 |    "metadata": {},
 854 |    "source": [
 855 |     "We will now train on the `dvc-trial-multi-files` branch.\n",
 856 |     "\n",
 857 |     "When doing `dvc pull`, this is the dataset structure:\n",
 858 |     "\n",
 859 |     "```\n",
 860 |     "dataset\n",
 861 |     "    |-- train\n",
 862 |     "    |   |-- california_train_1.csv\n",
 863 |     "    |   |-- california_train_2.csv\n",
 864 |     "    |   |-- california_train_3.csv\n",
 865 |     "    |   |-- california_train_4.csv\n",
 866 |     "    |   |-- california_train_5.csv\n",
 867 |     "    |-- test\n",
 868 |     "    |   |-- california_test.csv\n",
 869 |     "    |-- validation\n",
 870 |     "    |   |-- california_validation_1.csv\n",
 871 |     "    |   |-- california_validation_2.csv\n",
 872 |     "    |   |-- california_validation_3.csv\n",
 873 |     "```"
 874 |    ]
 875 |   },
 876 |   {
 877 |    "cell_type": "code",
 878 |    "execution_count": null,
 879 |    "id": "bd8db6a2",
 880 |    "metadata": {},
 881 |    "outputs": [],
 882 |    "source": [
 883 |     "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n",
 884 |     "\n",
 885 |     "hyperparameters={ \n",
 886 |     "        \"learning_rate\" : 1,\n",
 887 |     "        \"depth\": 6\n",
 888 |     "    }\n",
 889 |     "\n",
 890 |     "estimator = sagemaker.estimator.Estimator(\n",
 891 |     "    image,\n",
 892 |     "    role,\n",
 893 |     "    instance_count=1,\n",
 894 |     "    metric_definitions=metric_definitions,\n",
 895 |     "    instance_type=\"ml.m5.large\",\n",
 896 |     "    sagemaker_session=sagemaker_session,\n",
 897 |     "    hyperparameters=hyperparameters,\n",
 898 |     "    environment={\n",
 899 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
 900 |     "        \"DVC_BRANCH\": dvc_branch,\n",
 901 |     "        \"USER\": \"sagemaker\"\n",
 902 |     "    }\n",
 903 |     ")\n",
 904 |     "\n",
 905 |     "experiment_config={\n",
 906 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
 907 |     "    \"TrialName\": my_second_trial.trial_name,\n",
 908 |     "}"
 909 |    ]
 910 |   },
 911 |   {
 912 |    "cell_type": "markdown",
 913 |    "id": "c6e0e936",
 914 |    "metadata": {},
 915 |    "source": [
 916 |     "The training job will take aroudn ~5 minutes"
 917 |    ]
 918 |   },
 919 |   {
 920 |    "cell_type": "code",
 921 |    "execution_count": null,
 922 |    "id": "1e216113",
 923 |    "metadata": {},
 924 |    "outputs": [],
 925 |    "source": [
 926 |     "%%time\n",
 927 |     "\n",
 928 |     "estimator.fit(experiment_config=experiment_config)"
 929 |    ]
 930 |   },
 931 |   {
 932 |    "cell_type": "markdown",
 933 |    "id": "f767fa22",
 934 |    "metadata": {},
 935 |    "source": [
 936 |     "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
 937 |     "\n",
 938 |     "```\n",
 939 |     "Running dvc pull command\n",
 940 |     "A       validation/california_validation_2.csv\n",
 941 |     "A       validation/california_validation_1.csv\n",
 942 |     "A       validation/california_validation_3.csv\n",
 943 |     "A       train/california_train_4.csv\n",
 944 |     "A       train/california_train_5.csv\n",
 945 |     "A       train/california_train_2.csv\n",
 946 |     "A       train/california_train_3.csv\n",
 947 |     "A       train/california_train_1.csv\n",
 948 |     "A       test/california_test.csv\n",
 949 |     "9 files added and 9 files fetched\n",
 950 |     "Starting the training.\n",
 951 |     "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n",
 952 |     "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n",
 953 |     "```"
 954 |    ]
 955 |   },
 956 |   {
 957 |    "cell_type": "markdown",
 958 |    "id": "0bd85708",
 959 |    "metadata": {},
 960 |    "source": [
 961 |     "## Part 4: Hosting your model in SageMaker"
 962 |    ]
 963 |   },
 964 |   {
 965 |    "cell_type": "code",
 966 |    "execution_count": null,
 967 |    "id": "48387d98",
 968 |    "metadata": {},
 969 |    "outputs": [],
 970 |    "source": [
 971 |     "from sagemaker.predictor import csv_serializer\n",
 972 |     "\n",
 973 |     "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=csv_serializer)"
 974 |    ]
 975 |   },
 976 |   {
 977 |    "cell_type": "markdown",
 978 |    "id": "57bb1f58",
 979 |    "metadata": {},
 980 |    "source": [
 981 |     "### Fetch the testing data\n",
 982 |     "\n",
 983 |     "Save locally the test data stored in S3 via DVC created by the SageMaker Processing Job."
 984 |    ]
 985 |   },
 986 |   {
 987 |    "cell_type": "code",
 988 |    "execution_count": null,
 989 |    "id": "428729d2",
 990 |    "metadata": {},
 991 |    "outputs": [],
 992 |    "source": [
 993 |     "%%sh\n",
 994 |     "\n",
 995 |     "cd sagemaker-dvc-sample\n",
 996 |     "\n",
 997 |     "# get all remote branches\n",
 998 |     "git fetch --all\n",
 999 |     "\n",
1000 |     "# move to the ddvc-trial-multi-files\n",
1001 |     "git checkout dvc-trial-multi-files\n",
1002 |     "\n",
1003 |     "# gather the data (for testing purpuse)\n",
1004 |     "dvc pull"
1005 |    ]
1006 |   },
1007 |   {
1008 |    "cell_type": "markdown",
1009 |    "id": "2a2d99e5",
1010 |    "metadata": {},
1011 |    "source": [
1012 |     "Prepare the data"
1013 |    ]
1014 |   },
1015 |   {
1016 |    "cell_type": "code",
1017 |    "execution_count": null,
1018 |    "id": "bfc4eeb7",
1019 |    "metadata": {},
1020 |    "outputs": [],
1021 |    "source": [
1022 |     "test = pd.read_csv(\"./sagemaker-dvc-sample/dataset/test/california_test.csv\",header=None)\n",
1023 |     "X_test = test.iloc[:, 1:].values\n",
1024 |     "y_test = test.iloc[:, 0:1].values"
1025 |    ]
1026 |   },
1027 |   {
1028 |    "cell_type": "markdown",
1029 |    "id": "3594f09e",
1030 |    "metadata": {},
1031 |    "source": [
1032 |     "## Invoke endpoint with the Python SDK"
1033 |    ]
1034 |   },
1035 |   {
1036 |    "cell_type": "code",
1037 |    "execution_count": null,
1038 |    "id": "ff43812f",
1039 |    "metadata": {},
1040 |    "outputs": [],
1041 |    "source": [
1042 |     "predicted = predictor.predict(X_test).decode('utf-8').split('\\n')\n",
1043 |     "for i in range(len(predicted)-1):\n",
1044 |     "    print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")"
1045 |    ]
1046 |   },
1047 |   {
1048 |    "cell_type": "markdown",
1049 |    "id": "103db7d0",
1050 |    "metadata": {},
1051 |    "source": [
1052 |     "### Delete the Endpoint\n",
1053 |     "\n",
1054 |     "Make sure to delete the endpoint to avoid un-expected costs"
1055 |    ]
1056 |   },
1057 |   {
1058 |    "cell_type": "code",
1059 |    "execution_count": null,
1060 |    "id": "6112817a",
1061 |    "metadata": {},
1062 |    "outputs": [],
1063 |    "source": [
1064 |     "predictor.delete_endpoint()"
1065 |    ]
1066 |   },
1067 |   {
1068 |    "cell_type": "markdown",
1069 |    "id": "71c48355",
1070 |    "metadata": {},
1071 |    "source": [
1072 |     "### (Optional) Delete the Experiment, and all Trails, TrialComponents"
1073 |    ]
1074 |   },
1075 |   {
1076 |    "cell_type": "code",
1077 |    "execution_count": null,
1078 |    "id": "7ba71c3c",
1079 |    "metadata": {},
1080 |    "outputs": [],
1081 |    "source": [
1082 |     "my_experiment.delete_all(action=\"--force\")"
1083 |    ]
1084 |   },
1085 |   {
1086 |    "cell_type": "markdown",
1087 |    "id": "5a4612c2",
1088 |    "metadata": {},
1089 |    "source": [
1090 |     "### (Optional) Delete the AWS CodeCommit repository"
1091 |    ]
1092 |   },
1093 |   {
1094 |    "cell_type": "code",
1095 |    "execution_count": null,
1096 |    "id": "7f5db756",
1097 |    "metadata": {},
1098 |    "outputs": [],
1099 |    "source": [
1100 |     "!aws codecommit delete-repository --repository-name sagemaker-dvc-sample"
1101 |    ]
1102 |   },
1103 |   {
1104 |    "cell_type": "markdown",
1105 |    "id": "9b14fc3d",
1106 |    "metadata": {},
1107 |    "source": [
1108 |     "### (Optional) Delete the AWS ECR repositories"
1109 |    ]
1110 |   },
1111 |   {
1112 |    "cell_type": "code",
1113 |    "execution_count": null,
1114 |    "id": "8a0705de",
1115 |    "metadata": {},
1116 |    "outputs": [],
1117 |    "source": [
1118 |     "!aws ecr delete-repository --repository-name sagemaker-catboost-dvc --force\n",
1119 |     "!aws ecr delete-repository --repository-name sagemaker-processing-dvc --force"
1120 |    ]
1121 |   },
1122 |   {
1123 |    "cell_type": "code",
1124 |    "execution_count": null,
1125 |    "id": "93cd4af8",
1126 |    "metadata": {},
1127 |    "outputs": [],
1128 |    "source": []
1129 |   }
1130 |  ],
1131 |  "metadata": {
1132 |   "instance_type": "ml.t3.medium",
1133 |   "kernelspec": {
1134 |    "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)",
1135 |    "language": "python",
1136 |    "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel"
1137 |   },
1138 |   "language_info": {
1139 |    "codemirror_mode": {
1140 |     "name": "ipython",
1141 |     "version": 3
1142 |    },
1143 |    "file_extension": ".py",
1144 |    "mimetype": "text/x-python",
1145 |    "name": "python",
1146 |    "nbconvert_exporter": "python",
1147 |    "pygments_lexer": "ipython3",
1148 |    "version": "3.8.12"
1149 |   }
1150 |  },
1151 |  "nbformat": 4,
1152 |  "nbformat_minor": 5
1153 | }
1154 | 


--------------------------------------------------------------------------------
/dvc_sagemaker_script_mode.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "5289742c",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Prerequisite\n",
  9 |     "\n",
 10 |     "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n",
 11 |     "\n",
 12 |     "# Training a CatBoost regression model with data from DVC\n",
 13 |     "\n",
 14 |     "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n",
 15 |     "\n",
 16 |     "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n",
 17 |     "\n",
 18 |     "### California Housing dataset\n",
 19 |     "\n",
 20 |     "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n",
 21 |     "\n",
 22 |     "The California Housing dataset was originally published in:\n",
 23 |     "\n",
 24 |     "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
 25 |     "\n",
 26 |     "### DVC\n",
 27 |     "\n",
 28 |     "DVC is built to make machine learning (ML) models shareable and reproducible.\n",
 29 |     "It is designed to handle large files, data sets, machine learning models, and metrics as well as code."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "id": "f55d6f3f",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "## Part 1: Configure DVC for data versioning\n",
 38 |     "\n",
 39 |     "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n",
 40 |     "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n",
 41 |     "The `dvc` configurations and files for data tracking will be versioned in this repository.\n",
 42 |     "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n",
 43 |     "\n",
 44 |     "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing [Amazon Secret Managers](https://aws.amazon.com/secrets-manager/) to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n",
 45 |     "\n",
 46 |     "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit."
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "id": "8952c81c",
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "%%sh\n",
 57 |     "\n",
 58 |     "## Create the repository\n",
 59 |     "\n",
 60 |     "repo_name=\"sagemaker-dvc-sample\"\n",
 61 |     "\n",
 62 |     "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n",
 63 |     "\n",
 64 |     "account=$(aws sts get-caller-identity --query Account --output text)\n",
 65 |     "\n",
 66 |     "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
 67 |     "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
 68 |     "region=${region:-eu-west-1}\n",
 69 |     "\n",
 70 |     "## repo_name is already in the .gitignore of the root repo\n",
 71 |     "\n",
 72 |     "mkdir -p ${repo_name}\n",
 73 |     "cd ${repo_name}\n",
 74 |     "\n",
 75 |     "# initalize new repo in subfolder\n",
 76 |     "git init\n",
 77 |     "## Change the remote to the codecommit\n",
 78 |     "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n",
 79 |     "\n",
 80 |     "# Configure git - change it according to your needs\n",
 81 |     "git config --global user.email \"sagemaker-studio-user@example.com\"\n",
 82 |     "git config --global user.name \"SageMaker Studio User\"\n",
 83 |     "\n",
 84 |     "git config --global credential.helper '!aws codecommit credential-helper $@'\n",
 85 |     "git config --global credential.UseHttpPath true\n",
 86 |     "\n",
 87 |     "# Initialize dvc\n",
 88 |     "dvc init\n",
 89 |     "\n",
 90 |     "git commit -m 'Add dvc configuration'\n",
 91 |     "\n",
 92 |     "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n",
 93 |     "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n",
 94 |     "git commit .dvc/config -m \"initialize DVC local remote\"\n",
 95 |     "\n",
 96 |     "# set the DVC cache to S3\n",
 97 |     "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n",
 98 |     "dvc config cache.s3 s3cache\n",
 99 |     "\n",
100 |     "# disable sending anonymized data to dvc for troubleshooting\n",
101 |     "dvc config core.analytics false\n",
102 |     "\n",
103 |     "git add .dvc/config\n",
104 |     "git commit -m 'update dvc config'\n",
105 |     "\n",
106 |     "git push --set-upstream origin master #--force"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "id": "93d74587",
112 |    "metadata": {},
113 |    "source": [
114 |     "## Part 2: Processing and Training with DVC and SageMaker\n",
115 |     "\n",
116 |     "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n",
117 |     "\n",
118 |     "The high level conceptual architecture is depicted in the figure below.\n",
119 |     "\n",
120 |     "<img src=\"./img/high-level-architecture.png\">\n",
121 |     "<i>Fig. 1 High level architecture</i>\n",
122 |     "\n",
123 |     "\n",
124 |     "### Import libraries and initial setup\n",
125 |     "\n",
126 |     "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook."
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "id": "bdbc951e",
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": [
136 |     "import boto3\n",
137 |     "import sagemaker\n",
138 |     "import time\n",
139 |     "from time import strftime\n",
140 |     "\n",
141 |     "boto_session = boto3.Session()\n",
142 |     "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n",
143 |     "sm_client = boto3.client(\"sagemaker\")\n",
144 |     "region = boto_session.region_name\n",
145 |     "bucket = sagemaker_session.default_bucket()\n",
146 |     "role = sagemaker.get_execution_role()\n",
147 |     "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n",
148 |     "\n",
149 |     "prefix = 'DEMO-sagemaker-experiments-dvc'\n",
150 |     "\n",
151 |     "print(f\"account: {account}\")\n",
152 |     "print(f\"bucket: {bucket}\")\n",
153 |     "print(f\"region: {region}\")\n",
154 |     "print(f\"role: {role}\")"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "id": "8edec916",
160 |    "metadata": {},
161 |    "source": [
162 |     "### Prepare raw data\n",
163 |     "\n",
164 |     "We upload the raw data to S3 in the default bucket."
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "id": "cae15de4",
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "import pandas as pd\n",
175 |     "import numpy as np\n",
176 |     "\n",
177 |     "from sklearn.datasets import fetch_california_housing\n",
178 |     "from sklearn.model_selection import train_test_split\n",
179 |     "\n",
180 |     "from pathlib import Path\n",
181 |     "\n",
182 |     "databunch = fetch_california_housing()\n",
183 |     "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n",
184 |     "\n",
185 |     "print(f\"Dataset shape = {dataset.shape}\")\n",
186 |     "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n",
187 |     "\n",
188 |     "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n",
189 |     "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n",
190 |     "print(f\"Raw data location in S3: {s3_data_path}\")\n",
191 |     "\n",
192 |     "s3 = boto3.client(\"s3\")\n",
193 |     "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "id": "08e50d6a",
199 |    "metadata": {},
200 |    "source": [
201 |     "### Setup SageMaker Experiments\n",
202 |     "\n",
203 |     "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n",
204 |     "\n",
205 |     "Let’s start first with an overview of Amazon SageMaker Experiments features:\n",
206 |     "\n",
207 |     "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n",
208 |     "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n",
209 |     "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n",
210 |     "\n",
211 |     "Now, in order to track this test in SageMaker, we need to create an experiment."
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "id": "fc5aa1cf",
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": [
221 |     "from smexperiments.experiment import Experiment\n",
222 |     "from smexperiments.trial import Trial\n",
223 |     "from smexperiments.trial_component import TrialComponent\n",
224 |     "from smexperiments.tracker import Tracker\n",
225 |     "\n",
226 |     "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n",
227 |     "\n",
228 |     "# create the experiment if it doesn't exist\n",
229 |     "try:\n",
230 |     "    my_experiment = Experiment.load(experiment_name=experiment_name)\n",
231 |     "    print(\"existing experiment loaded\")\n",
232 |     "except Exception as ex:\n",
233 |     "    if \"ResourceNotFound\" in str(ex):\n",
234 |     "        my_experiment = Experiment.create(\n",
235 |     "            experiment_name = experiment_name,\n",
236 |     "            description = \"How to integrate DVC\"\n",
237 |     "        )\n",
238 |     "        print(\"new experiment created\")\n",
239 |     "    else:\n",
240 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
241 |     "        print(\"Dont go forward!\")\n",
242 |     "        raise"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "markdown",
247 |    "id": "2c13953c",
248 |    "metadata": {},
249 |    "source": [
250 |     "We need to also define trials within the experiment.\n",
251 |     "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n",
252 |     "\n",
253 |     "### Test 1: generate single files for training and validation\n",
254 |     "\n",
255 |     "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments."
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": null,
261 |    "id": "83654fb0",
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "first_trial_name = \"dvc-trial-single-file\"\n",
266 |     "\n",
267 |     "try:\n",
268 |     "    my_first_trial = Trial.load(trial_name=first_trial_name)\n",
269 |     "    print(\"existing trial loaded\")\n",
270 |     "except Exception as ex:\n",
271 |     "    if \"ResourceNotFound\" in str(ex):\n",
272 |     "        my_first_trial = Trial.create(\n",
273 |     "            experiment_name=experiment_name,\n",
274 |     "            trial_name=first_trial_name,\n",
275 |     "        )\n",
276 |     "        print(\"new trial created\")\n",
277 |     "    else:\n",
278 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
279 |     "        print(\"Dont go forward!\")\n",
280 |     "        raise"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "id": "93aa35a4",
286 |    "metadata": {},
287 |    "source": [
288 |     "### Processing script: version data with DVC\n",
289 |     "\n",
290 |     "The processing script expects the address of the git repository and the branch we want to <i>create</i> to store the `dvc` metadata passed via environmental variables.\n",
291 |     "The datasets themselves will be then stored in S3.\n",
292 |     "Environmental variables are automatically tracked in SageMaker Experiments in the automatically generated <i>TrialComponent</i>.\n",
293 |     "The <i>TrialComponent</i> generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI.\n",
294 |     "In our case, we will store the following data:\n",
295 |     "* `DVC_REPO_URL`\n",
296 |     "* `DVC_BRANCH`\n",
297 |     "* `USER`\n",
298 |     "* `data_commit_hash`\n",
299 |     "* `train_test_split_ratio`"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "id": "f1b391c3",
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "!pygmentize 'source_dir/preprocessing-experiment.py'"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "id": "5cb3bc2f",
315 |    "metadata": {},
316 |    "source": [
317 |     "### SageMaker Processing job\n",
318 |     "\n",
319 |     "SageMaker Processing gives us the possibility to execute our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure.\n",
320 |     "If our script requires additional dependencies, we can supply a `requirements.txt` file.\n",
321 |     "Upon starting of the processing job, SageMaker will `pip`-install all libraries we need (e.g., `dvc`-related libraries).\n",
322 |     "\n",
323 |     "We have now all ingredients to execute our SageMaker Processing Job:\n",
324 |     "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`) and two environmental variables (i.e., `DVC_REPO_URL` and `DVC_BRANCH`)\n",
325 |     "* a `requiremets.txt` file\n",
326 |     "* a git repository (in AWS CodeCommit)\n",
327 |     "* a SageMaker Experiment and a Trial"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "id": "400b363d",
334 |    "metadata": {},
335 |    "outputs": [],
336 |    "source": [
337 |     "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n",
338 |     "from sagemaker.sklearn.estimator import SKLearn\n",
339 |     "\n",
340 |     "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n",
341 |     "dvc_branch = my_first_trial.trial_name\n",
342 |     "\n",
343 |     "script_processor = FrameworkProcessor(\n",
344 |     "    estimator_cls=SKLearn,\n",
345 |     "    framework_version='0.23-1',\n",
346 |     "    instance_count=1,\n",
347 |     "    instance_type='ml.m5.xlarge',\n",
348 |     "    env={\n",
349 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
350 |     "        \"DVC_BRANCH\": dvc_branch,\n",
351 |     "        \"USER\": \"sagemaker\"\n",
352 |     "    },\n",
353 |     "    role=role\n",
354 |     ")\n",
355 |     "\n",
356 |     "experiment_config={\n",
357 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
358 |     "    \"TrialName\": my_first_trial.trial_name\n",
359 |     "}"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "id": "5b4b7f46",
365 |    "metadata": {},
366 |    "source": [
367 |     "Executing the processing job will take around 3-4 minutes."
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": null,
373 |    "id": "1a4dfc06",
374 |    "metadata": {},
375 |    "outputs": [],
376 |    "source": [
377 |     "%%time\n",
378 |     "\n",
379 |     "script_processor.run(\n",
380 |     "    code='./source_dir/preprocessing-experiment.py',\n",
381 |     "    dependencies=['./source_dir/requirements.txt'],\n",
382 |     "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
383 |     "    experiment_config=experiment_config,\n",
384 |     "    arguments=[\"--train-test-split-ratio\", \"0.2\"]\n",
385 |     ")\n"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "markdown",
390 |    "id": "991638a4",
391 |    "metadata": {},
392 |    "source": [
393 |     "### Create an estimator and fit the model"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "markdown",
398 |    "id": "8071e644",
399 |    "metadata": {},
400 |    "source": [
401 |     "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as environmental variables when you create the Estimator object.\n",
402 |     "\n",
403 |     "We will train on the `dvc-trial-single-file` branch first.\n",
404 |     "\n",
405 |     "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n",
406 |     "\n",
407 |     "```\n",
408 |     "dataset\n",
409 |     "    |-- train\n",
410 |     "    |   |-- california_train.csv\n",
411 |     "    |-- test\n",
412 |     "    |   |-- california_test.csv\n",
413 |     "    |-- validation\n",
414 |     "    |   |-- california_validation.csv\n",
415 |     "```\n",
416 |     "\n",
417 |     "#### Metric definition\n",
418 |     "\n",
419 |     "SageMaker emits every log that is going to STDOUT to CloudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex.\n",
420 |     "By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n",
421 |     "\n",
422 |     "In our case, we are interested in the median error.\n",
423 |     "```\n",
424 |     "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
425 |     "```"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": null,
431 |    "id": "d464d52b",
432 |    "metadata": {},
433 |    "outputs": [],
434 |    "source": [
435 |     "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
436 |     "\n",
437 |     "hyperparameters={ \n",
438 |     "        \"learning_rate\" : 1,\n",
439 |     "        \"depth\": 6\n",
440 |     "    }\n",
441 |     "estimator = SKLearn(\n",
442 |     "    entry_point='train.py',\n",
443 |     "    source_dir='source_dir',\n",
444 |     "    role=role,\n",
445 |     "    metric_definitions=metric_definitions,\n",
446 |     "    hyperparameters=hyperparameters,\n",
447 |     "    instance_count=1,\n",
448 |     "    instance_type='ml.m5.large',\n",
449 |     "    framework_version='0.23-1',\n",
450 |     "    base_job_name='training-with-dvc-data',\n",
451 |     "    environment={\n",
452 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
453 |     "        \"DVC_BRANCH\": dvc_branch,\n",
454 |     "        \"USER\": \"sagemaker\"\n",
455 |     "    }\n",
456 |     ")\n",
457 |     "\n",
458 |     "experiment_config={\n",
459 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
460 |     "    \"TrialName\": my_first_trial.trial_name\n",
461 |     "}"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "code",
466 |    "execution_count": null,
467 |    "id": "2b4e1302",
468 |    "metadata": {},
469 |    "outputs": [],
470 |    "source": [
471 |     "%%time\n",
472 |     "\n",
473 |     "estimator.fit(experiment_config=experiment_config)"
474 |    ]
475 |   },
476 |   {
477 |    "cell_type": "markdown",
478 |    "id": "1aaa3a46",
479 |    "metadata": {},
480 |    "source": [
481 |     "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
482 |     "\n",
483 |     "```\n",
484 |     "Running dvc pull command\n",
485 |     "A       train/california_train.csv\n",
486 |     "A       test/california_test.csv\n",
487 |     "A       validation/california_validation.csv\n",
488 |     "3 files added and 3 files fetched\n",
489 |     "Starting the training.\n",
490 |     "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
491 |     "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
492 |     "```"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "markdown",
497 |    "id": "f4de6faf",
498 |    "metadata": {},
499 |    "source": [
500 |     "### Test 2: generate multiple files for training and validation"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "code",
505 |    "execution_count": null,
506 |    "id": "2cfe7996",
507 |    "metadata": {},
508 |    "outputs": [],
509 |    "source": [
510 |     "second_trial_name = \"dvc-trial-multi-files\"\n",
511 |     "\n",
512 |     "try:\n",
513 |     "    my_second_trial = Trial.load(trial_name=second_trial_name)\n",
514 |     "    print(\"existing trial loaded\")\n",
515 |     "except Exception as ex:\n",
516 |     "    if \"ResourceNotFound\" in str(ex):\n",
517 |     "        my_second_trial = Trial.create(\n",
518 |     "            experiment_name=experiment_name,\n",
519 |     "            trial_name=second_trial_name,\n",
520 |     "        )\n",
521 |     "        print(\"new trial created\")\n",
522 |     "    else:\n",
523 |     "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
524 |     "        print(\"Dont go forward!\")\n",
525 |     "        raise"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "id": "a90c0238",
531 |    "metadata": {},
532 |    "source": [
533 |     "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch."
534 |    ]
535 |   },
536 |   {
537 |    "cell_type": "code",
538 |    "execution_count": null,
539 |    "id": "f7eda4b2",
540 |    "metadata": {},
541 |    "outputs": [],
542 |    "source": [
543 |     "!pygmentize 'source_dir/preprocessing-experiment-multifiles.py'"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "code",
548 |    "execution_count": null,
549 |    "id": "25c05bae",
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": [
553 |     "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n",
554 |     "from sagemaker.sklearn.estimator import SKLearn\n",
555 |     "\n",
556 |     "dvc_branch = my_second_trial.trial_name\n",
557 |     "\n",
558 |     "script_processor = FrameworkProcessor(\n",
559 |     "    estimator_cls=SKLearn,\n",
560 |     "    framework_version='0.23-1',\n",
561 |     "    instance_count=1,\n",
562 |     "    instance_type='ml.m5.xlarge',\n",
563 |     "    env={\n",
564 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
565 |     "        \"DVC_BRANCH\": dvc_branch,\n",
566 |     "        \"USER\": \"sagemaker\",\n",
567 |     "    },\n",
568 |     "    role=role\n",
569 |     ")\n",
570 |     "\n",
571 |     "experiment_config={\n",
572 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
573 |     "    \"TrialName\": my_second_trial.trial_name\n",
574 |     "}"
575 |    ]
576 |   },
577 |   {
578 |    "cell_type": "markdown",
579 |    "id": "624a6b65",
580 |    "metadata": {},
581 |    "source": [
582 |     "Executing the processing job will take ~5 minutes"
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "code",
587 |    "execution_count": null,
588 |    "id": "099c269e",
589 |    "metadata": {},
590 |    "outputs": [],
591 |    "source": [
592 |     "%%time\n",
593 |     "\n",
594 |     "script_processor.run(\n",
595 |     "    code='./source_dir/preprocessing-experiment-multifiles.py',\n",
596 |     "    dependencies=['./source_dir/requirements.txt'],\n",
597 |     "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
598 |     "    experiment_config=experiment_config,\n",
599 |     "    arguments=[\"--train-test-split-ratio\", \"0.1\"]\n",
600 |     ")"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "markdown",
605 |    "id": "bb210f96",
606 |    "metadata": {},
607 |    "source": [
608 |     "We will now train on the `dvc-trial-multi-files` branch.\n",
609 |     "\n",
610 |     "When doing `dvc pull`, this is the dataset structure:\n",
611 |     "\n",
612 |     "```\n",
613 |     "dataset\n",
614 |     "    |-- train\n",
615 |     "    |   |-- california_train_1.csv\n",
616 |     "    |   |-- california_train_2.csv\n",
617 |     "    |   |-- california_train_3.csv\n",
618 |     "    |   |-- california_train_4.csv\n",
619 |     "    |   |-- california_train_5.csv\n",
620 |     "    |-- test\n",
621 |     "    |   |-- california_test.csv\n",
622 |     "    |-- validation\n",
623 |     "    |   |-- california_validation_1.csv\n",
624 |     "    |   |-- california_validation_2.csv\n",
625 |     "    |   |-- california_validation_3.csv\n",
626 |     "```"
627 |    ]
628 |   },
629 |   {
630 |    "cell_type": "code",
631 |    "execution_count": null,
632 |    "id": "5cadd7b2",
633 |    "metadata": {},
634 |    "outputs": [],
635 |    "source": [
636 |     "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
637 |     "\n",
638 |     "hyperparameters={ \n",
639 |     "        \"learning_rate\" : 1,\n",
640 |     "        \"depth\": 6\n",
641 |     "    }\n",
642 |     "\n",
643 |     "estimator = SKLearn(\n",
644 |     "    entry_point='train.py',\n",
645 |     "    source_dir='source_dir',\n",
646 |     "    role=role,\n",
647 |     "    metric_definitions=metric_definitions,\n",
648 |     "    hyperparameters=hyperparameters,\n",
649 |     "    instance_count=1,\n",
650 |     "    instance_type='ml.m5.large',\n",
651 |     "    framework_version='0.23-1',\n",
652 |     "    base_job_name='training-with-dvc-data',\n",
653 |     "    environment={\n",
654 |     "        \"DVC_REPO_URL\": dvc_repo_url,\n",
655 |     "        \"DVC_BRANCH\": dvc_branch,\n",
656 |     "        \"USER\": \"sagemaker\"\n",
657 |     "    }\n",
658 |     ")\n",
659 |     "\n",
660 |     "experiment_config={\n",
661 |     "    \"ExperimentName\": my_experiment.experiment_name,\n",
662 |     "    \"TrialName\": my_second_trial.trial_name,\n",
663 |     "}"
664 |    ]
665 |   },
666 |   {
667 |    "cell_type": "markdown",
668 |    "id": "ce4aa067",
669 |    "metadata": {},
670 |    "source": [
671 |     "The training job will take around ~5 minutes"
672 |    ]
673 |   },
674 |   {
675 |    "cell_type": "code",
676 |    "execution_count": null,
677 |    "id": "fc8e059e",
678 |    "metadata": {},
679 |    "outputs": [],
680 |    "source": [
681 |     "%%time\n",
682 |     "\n",
683 |     "estimator.fit(experiment_config=experiment_config)"
684 |    ]
685 |   },
686 |   {
687 |    "cell_type": "markdown",
688 |    "id": "f34d4aa3",
689 |    "metadata": {},
690 |    "source": [
691 |     "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
692 |     "\n",
693 |     "```\n",
694 |     "Running dvc pull command\n",
695 |     "A       validation/california_validation_2.csv\n",
696 |     "A       validation/california_validation_1.csv\n",
697 |     "A       validation/california_validation_3.csv\n",
698 |     "A       train/california_train_4.csv\n",
699 |     "A       train/california_train_5.csv\n",
700 |     "A       train/california_train_2.csv\n",
701 |     "A       train/california_train_3.csv\n",
702 |     "A       train/california_train_1.csv\n",
703 |     "A       test/california_test.csv\n",
704 |     "9 files added and 9 files fetched\n",
705 |     "Starting the training.\n",
706 |     "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n",
707 |     "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n",
708 |     "```"
709 |    ]
710 |   },
711 |   {
712 |    "cell_type": "markdown",
713 |    "id": "514a78d9",
714 |    "metadata": {},
715 |    "source": [
716 |     "## Part 3: Hosting your model in SageMaker"
717 |    ]
718 |   },
719 |   {
720 |    "cell_type": "code",
721 |    "execution_count": null,
722 |    "id": "3330abce",
723 |    "metadata": {},
724 |    "outputs": [],
725 |    "source": [
726 |     "from sagemaker.serializers import CSVSerializer\n",
727 |     "\n",
728 |     "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=CSVSerializer())"
729 |    ]
730 |   },
731 |   {
732 |    "cell_type": "markdown",
733 |    "id": "bf141b4f",
734 |    "metadata": {},
735 |    "source": [
736 |     "### Fetch the testing data\n",
737 |     "\n",
738 |     "Read the raw test data stored in S3 via DVC created by the SageMaker Processing Job. We use the `dvc` python API."
739 |    ]
740 |   },
741 |   {
742 |    "cell_type": "code",
743 |    "execution_count": null,
744 |    "id": "3a5bcecb-a6ca-4d65-a239-be53c32f737a",
745 |    "metadata": {},
746 |    "outputs": [],
747 |    "source": [
748 |     "import io\n",
749 |     "import dvc.api\n",
750 |     "\n",
751 |     "git_repo_https = f\"https://git-codecommit.{region}.amazonaws.com/v1/repos/sagemaker-dvc-sample\"\n",
752 |     "\n",
753 |     "raw = dvc.api.read(\n",
754 |     "    \"dataset/test/california_test.csv\",\n",
755 |     "    repo=git_repo_https,\n",
756 |     "    rev=dvc_branch\n",
757 |     ")"
758 |    ]
759 |   },
760 |   {
761 |    "cell_type": "markdown",
762 |    "id": "86d9841f",
763 |    "metadata": {},
764 |    "source": [
765 |     "Prepare the data"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "code",
770 |    "execution_count": null,
771 |    "id": "d931947d",
772 |    "metadata": {},
773 |    "outputs": [],
774 |    "source": [
775 |     "test = pd.read_csv(io.StringIO(raw), sep=\",\", header=None)\n",
776 |     "X_test = test.iloc[:, 1:].values\n",
777 |     "y_test = test.iloc[:, 0:1].values"
778 |    ]
779 |   },
780 |   {
781 |    "cell_type": "markdown",
782 |    "id": "a5b796e8",
783 |    "metadata": {},
784 |    "source": [
785 |     "## Invoke endpoint with the Python SDK"
786 |    ]
787 |   },
788 |   {
789 |    "cell_type": "code",
790 |    "execution_count": null,
791 |    "id": "e0bd7491",
792 |    "metadata": {},
793 |    "outputs": [],
794 |    "source": [
795 |     "predicted = predictor.predict(X_test)\n",
796 |     "for i in range(len(predicted)-1):\n",
797 |     "    print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")"
798 |    ]
799 |   },
800 |   {
801 |    "cell_type": "markdown",
802 |    "id": "a976c7bf",
803 |    "metadata": {},
804 |    "source": [
805 |     "### Delete the Endpoint\n",
806 |     "\n",
807 |     "Make sure to delete the endpoint to avoid un-expected costs"
808 |    ]
809 |   },
810 |   {
811 |    "cell_type": "code",
812 |    "execution_count": null,
813 |    "id": "4f0231db",
814 |    "metadata": {},
815 |    "outputs": [],
816 |    "source": [
817 |     "predictor.delete_endpoint()"
818 |    ]
819 |   },
820 |   {
821 |    "cell_type": "markdown",
822 |    "id": "70a7499e",
823 |    "metadata": {},
824 |    "source": [
825 |     "### (Optional) Delete the Experiment, and all Trails, TrialComponents"
826 |    ]
827 |   },
828 |   {
829 |    "cell_type": "code",
830 |    "execution_count": null,
831 |    "id": "6093da3e",
832 |    "metadata": {},
833 |    "outputs": [],
834 |    "source": [
835 |     "#my_experiment.delete_all(action=\"--force\")"
836 |    ]
837 |   },
838 |   {
839 |    "cell_type": "markdown",
840 |    "id": "897cc5f2",
841 |    "metadata": {},
842 |    "source": [
843 |     "### (Optional) Delete the AWS CodeCommit repository"
844 |    ]
845 |   },
846 |   {
847 |    "cell_type": "code",
848 |    "execution_count": null,
849 |    "id": "cb3f762a",
850 |    "metadata": {},
851 |    "outputs": [],
852 |    "source": [
853 |     "#!aws codecommit delete-repository --repository-name sagemaker-dvc-sample"
854 |    ]
855 |   },
856 |   {
857 |    "cell_type": "code",
858 |    "execution_count": null,
859 |    "id": "bf55fe4a-97d0-41d6-9796-83491cb0c640",
860 |    "metadata": {},
861 |    "outputs": [],
862 |    "source": []
863 |   }
864 |  ],
865 |  "metadata": {
866 |   "instance_type": "ml.t3.medium",
867 |   "kernelspec": {
868 |    "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)",
869 |    "language": "python",
870 |    "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel"
871 |   },
872 |   "language_info": {
873 |    "codemirror_mode": {
874 |     "name": "ipython",
875 |     "version": 3
876 |    },
877 |    "file_extension": ".py",
878 |    "mimetype": "text/x-python",
879 |    "name": "python",
880 |    "nbconvert_exporter": "python",
881 |    "pygments_lexer": "ipython3",
882 |    "version": "3.8.12"
883 |   }
884 |  },
885 |  "nbformat": 4,
886 |  "nbformat_minor": 5
887 | }
888 | 


--------------------------------------------------------------------------------
/img/high-level-architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/high-level-architecture.png


--------------------------------------------------------------------------------
/img/sm-experiments-tracker-dvc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/sm-experiments-tracker-dvc.png


--------------------------------------------------------------------------------
/img/studio-custom-image-select.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/studio-custom-image-select.png


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM continuumio/miniconda3:4.10.3
2 | 
3 | COPY environment.yml .
4 | RUN conda env create -f environment.yml
5 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/README.md:
--------------------------------------------------------------------------------
  1 | ## Conda Environments as Kernels
  2 | 
  3 | This tutorial explains how to create a custom image for Amazon SageMaker Studio that has DVC already installed.
  4 | The advantage of creating an image and make it available to all SageMaker Studio users is that it creates a consistent environment for the SageMake Studio users, which they could also run locally.
  5 | 
  6 | This tutorial is heavily inspired by [this example](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/tree/main/examples/conda-env-kernel-image).
  7 | Further information about custom images for SageMaker Studio can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html)
  8 | 
  9 | ## Prerequisite
 10 | 
 11 | * A Cloud9 environment with enough permissions
 12 | 
 13 | ## Overview
 14 | 
 15 | This custom image sample demonstrates how to create a custom Conda environment in a Docker image and use it as a custom kernel in SageMaker Studio.
 16 | 
 17 | The Conda environment must have the appropriate kernel package installed, for e.g., `ipykernel` for a Python kernel.
 18 | This example creates a Conda environment called `dvc` with a few Python packages (see [environment.yml](environment.yml)) and the `ipykernel`.
 19 | SageMaker Studio will automatically recognize this Conda environment as a kernel named `conda-env-dvc-py`.
 20 | 
 21 | ### Clone the GitHub repository 
 22 | ```bash
 23 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
 24 | ```
 25 | 
 26 | ### Resize Cloud9
 27 | 
 28 | ```bash
 29 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/
 30 | ./resize-cloud9.sh 20
 31 | ```
 32 | 
 33 | ## Build the Docker images for SageMaker Studio
 34 | 
 35 | Set some basic environment variables
 36 | 
 37 | ```bash
 38 | sudo yum install jq -y
 39 | export REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')
 40 | echo "export REGION=${REGION}" | tee -a ~/.bash_profile
 41 | 
 42 | export ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
 43 | echo "export ACCOUNT_ID=${ACCOUNT_ID}" | tee -a ~/.bash_profile
 44 | 
 45 | export IMAGE_NAME=conda-env-dvc-kernel
 46 | echo "export IMAGE_NAME=${IMAGE_NAME}" | tee -a ~/.bash_profile
 47 | ```
 48 | 
 49 | Build the Docker image and push to Amazon ECR.
 50 | 
 51 | ```bash
 52 | # Login to ECR
 53 | aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom
 54 | 
 55 | # Create the ECR repository
 56 | aws --region ${REGION} ecr create-repository --repository-name smstudio-custom
 57 | 
 58 | # Build the image - it might take a few minutes to complete this step
 59 | docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
 60 | # Push the image to ECR
 61 | docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}
 62 | ```
 63 | 
 64 | ## Associate a custom image to SageMaker Studio
 65 | 
 66 | ### Prepare the environment to deploy with CDK
 67 | 
 68 | Step 1: Navigate to the `cdk` directory:
 69 | 
 70 | ```bash
 71 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk
 72 | ```
 73 | 
 74 | Step 2: Create a virtual environment:
 75 | 
 76 | ```bash
 77 | python3 -m venv .cdk-venv
 78 | ```
 79 | 
 80 | Step 3: Activate the virtual environment after the init process completes, and the virtual environment is created:
 81 | 
 82 | ```bash
 83 | source .cdk-venv/bin/activate
 84 | ```
 85 | 
 86 | Step 4: Install the required dependencies:
 87 | 
 88 | ```bash
 89 | pip3 install --upgrade pip
 90 | pip3 install -r requirements.txt
 91 | ```
 92 | 
 93 | Step 5: Install and bootstrap CDK v2
 94 | 
 95 | ```bash
 96 | npm install -g aws-cdk@2.27.0 --force
 97 | cdk bootstrap
 98 | ```
 99 | 
100 | ### Create a new SageMaker Studio
101 | ( Skip to [Update an existing SageMaker Studio](#update-an-existing-sagemaker-studio) if you have already an existing SageMaker Studio domain)
102 | 
103 | Step 5: deploy CDK (CDK will deploy a stack named: `sagemakerStudioCDK` which you can verify in `CloudFormation`)
104 | 
105 | ```bash
106 | cdk deploy --require-approval never
107 | ```
108 | 
109 | CDK will create the following resources via` CloudFormation`:
110 | * provisions a new SageMaker Studio domain
111 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to the SageMaker Studio domain
112 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier
113 | * creates an AppImageConfig which specify how the kernel gateway should be configured
114 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
115 | 
116 | ### Update an existing SageMaker Studio
117 | 
118 | If you have an existing SageMaker Studio environment, we need to first retrieve the exising SageMaker Studio domain ID, deploy a "reduced" version of the CDK stack, and update the SageMaker Studio domain configuration.
119 | 
120 | Step 5: Set the `DOMAIN_ID` environment variable with your domain ID and save to your `bash_profile`.
121 | 
122 | ```bash
123 | export DOMAIN_ID=$(aws sagemaker list-domains | jq -r '.Domains[0].DomainId')
124 | echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile
125 | ```
126 | 
127 | Step 6: deploy CDK (by setting the `DOMAIN_ID` environment variable, CDK will deploy a stack named: `sagemakerUserCDK` which you can verify on `CloudFormation`)
128 | 
129 | ```bash
130 | cdk deploy --require-approval never
131 | ```
132 | 
133 | CDK will create the following resources via` CloudFormation`:
134 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to your existing SageMaker Studio domain
135 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier
136 | * creates an AppImageConfig which specify how the kernel gateway should be configured
137 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it
138 | 
139 | Step 7: Update the SageMaker Studio domain configuration
140 | 
141 | ```bash
142 | # inject your DOMAIN_ID into the configuration file
143 | sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-input.json
144 | 
145 | # update the sagemaker studio domain
146 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-input.json
147 | ```
148 | 
149 | Open the newly created SageMaker Studio user, i.e., `data-scientist-dvc`.
150 | 
151 | ### Execute the lab
152 | 
153 | In the SageMaker Studio domain, launch `Studio` for the `data-scientist-dvc`.
154 | Open a terminal and clone the repository
155 | 
156 | ```bash
157 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
158 | ```
159 | 
160 | and open the [dvc_sagemaker_script_mode.ipynb](../dvc_sagemaker_script_mode.ipynb) notebook.
161 | 
162 | When prompted, ensure that you select the Custom Image `conda-env-dvc-kernel` as shown below
163 | 
164 | ![image info](../img/studio-custom-image-select.png)
165 | 
166 | ### Cleanup
167 | 
168 | Before removing all resources created, you need to make sure that all Apps are deleted from the `data-scientist-dvc` user, i.e. all `KernelGateway` apps, as well as the default `JupiterServer`.
169 | 
170 | Once done, you can destroy the CDK stack by running
171 | 
172 | ```bash
173 | cdk destroy
174 | ```
175 | 
176 | In case you started off from an existing domain, please also execute the following command:
177 | 
178 | ```bash
179 | # inject your DOMAIN_ID into the configuration file
180 | sed -i 's/<your-sagemaker-studio-domain-id>/'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json
181 | 
182 | # update the sagemaker studio domain
183 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json
184 | ```


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/.gitignore:
--------------------------------------------------------------------------------
 1 | *.swp
 2 | package-lock.json
 3 | .pytest_cache
 4 | *.egg-info
 5 | .idea/
 6 | # Byte-compiled / optimized / DLL files
 7 | __pycache__/
 8 | *.py[cod]
 9 | *$py.class
10 | 
11 | # Environments
12 | .env
13 | .venv
14 | env/
15 | venv/
16 | env.bak/
17 | venv.bak/
18 | .cdk-venv
19 | 
20 | # CDK Context & Staging files
21 | .cdk.staging/
22 | cdk.out/
23 | /cdk.context.json
24 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/app.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | from aws_cdk import App, Stack
 4 | 
 5 | from sagemakerStudioCDK.sagemaker_studio_stack import SagemakerStudioStack
 6 | import os
 7 | import boto3
 8 | 
 9 | sts_client = boto3.client("sts")
10 | account_id = os.environ.get('ACCOUNT_ID', sts_client.get_caller_identity()["Account"])
11 | region = os.environ.get('REGION', 'eu-west-1')
12 | 
13 | domain_id = os.environ.get('DOMAIN_ID', None)
14 | 
15 | app = App()
16 | 
17 | if domain_id is None:
18 |     print("Create a new studio domain")
19 | else:
20 |     print("Existing domain ID: {}".format(domain_id))
21 | 
22 | SagemakerStudioStack(app, "sagemakerStudioUserCDK", domain_id, env={"account": account_id, 'region': region})
23 | 
24 | app.synth()
25 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/cdk.json:
--------------------------------------------------------------------------------
1 | {
2 |   "app": "python3 app.py"
3 | }
4 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/requirements.txt:
--------------------------------------------------------------------------------
1 | -e .
2 | pytest
3 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/__init__.py


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/sagemaker_studio_stack.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | 
  4 | from aws_cdk import (
  5 | 	aws_iam as iam,
  6 | 	aws_ec2 as ec2,
  7 | 	aws_sagemaker as sagemaker,
  8 | 	Stack,
  9 | 	CfnOutput
 10 | )
 11 | 
 12 | from constructs import Construct
 13 | 
 14 | sagemaker_arn_region_account_mapping = {
 15 | 	"eu-west-1": "470317259841",
 16 | 	"us-east-1": "081325390199",
 17 | 	"us-east-2": "429704687514",
 18 | 	"us-west-1": "742091327244",
 19 | 	"us-west-2": "236514542706",
 20 | 	"af-south-1": "559312083959",
 21 | 	"ap-east-1": "493642496378",
 22 | 	"ap-south-1": "394103062818",
 23 | 	"ap-northeast-2": "806072073708",
 24 | 	"ap-southeast-1": "492261229750",
 25 | 	"ap-southeast-2": "452832661640",
 26 | 	"ap-northeast-1": "102112518831",
 27 | 	"ca-central-1": "310906938811",
 28 | 	"eu-central-1": "936697816551",
 29 | 	"eu-west-2": "712779665605",
 30 | 	"eu-west-3": "615547856133",
 31 | 	"eu-north-1": "243637512696",
 32 | 	"eu-south-1": "592751261982",
 33 | 	"sa-east-1": "782484402741",
 34 | }
 35 | 
 36 | 
 37 | class SagemakerStudioStack(Stack):
 38 | 
 39 | 	def __init__(self, scope: Construct, construct_id: str, domain_id: str, **kwargs) -> None:
 40 | 		super().__init__(scope, construct_id, **kwargs)
 41 | 
 42 | 		# Create a SageMaker
 43 | 		role_sagemaker_studio_domain = iam.Role(
 44 | 			self,
 45 | 			'RoleForSagemakerStudioUsers',
 46 | 		    assumed_by=iam.CompositePrincipal(
 47 | 		    	iam.ServicePrincipal('sagemaker.amazonaws.com'),
 48 | 				iam.ServicePrincipal('codebuild.amazonaws.com'), # needed to use the sm-build command
 49 | 	    	),
 50 | 		    managed_policies=[
 51 | 				iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")
 52 | 			],
 53 | 			inline_policies={
 54 | 				"code-commit-policy": iam.PolicyDocument(
 55 | 						statements=[
 56 | 						iam.PolicyStatement(
 57 | 							effect=iam.Effect.ALLOW,
 58 | 							actions=[
 59 | 					            "codecommit:AssociateApprovalRuleTemplateWithRepository",
 60 | 					            "codecommit:BatchAssociateApprovalRuleTemplateWithRepositories",
 61 | 					            "codecommit:BatchDisassociateApprovalRuleTemplateFromRepositories",
 62 | 					            "codecommit:BatchGet*",
 63 | 					            "codecommit:BatchDescribe*",
 64 | 					            "codecommit:Create*",
 65 | 					            "codecommit:DeleteBranch",
 66 | 					            "codecommit:DeleteFile",
 67 | 					            "codecommit:Describe*",
 68 | 					            "codecommit:DisassociateApprovalRuleTemplateFromRepository",
 69 | 					            "codecommit:EvaluatePullRequestApprovalRules",
 70 | 					            "codecommit:Get*",
 71 | 					            "codecommit:List*",
 72 | 					            "codecommit:Merge*",
 73 | 					            "codecommit:OverridePullRequestApprovalRules",
 74 | 					            "codecommit:Put*",
 75 | 					            "codecommit:Post*",
 76 | 					            "codecommit:TagResource",
 77 | 					            "codecommit:Test*",
 78 | 					            "codecommit:UntagResource",
 79 | 					            "codecommit:Update*",
 80 | 					            "codecommit:GitPull",
 81 | 					            "codecommit:GitPush",
 82 | 					            "codecommit:Delete*"
 83 | 					        ],
 84 | 							resources=[f"arn:aws:codecommit:{self.region}:{self.account}:sagemaker-dvc-sample"]
 85 | 						)
 86 | 					]
 87 | 				),
 88 | 				"s3bucket": iam.PolicyDocument(
 89 | 					statements=[
 90 | 						iam.PolicyStatement(
 91 | 							effect=iam.Effect.ALLOW,
 92 | 							actions=["s3:ListBucket","s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:PutObjectTagging"],
 93 | 							resources=["arn:aws:s3:::sagemaker*"]
 94 |     					)
 95 |     				]
 96 | 				),
 97 | 				"sm-build-policy": iam.PolicyDocument(
 98 | 					statements=[
 99 | 						iam.PolicyStatement(
100 | 							sid="EcrAuthorizationTokenRetrieval",
101 | 							effect=iam.Effect.ALLOW,
102 | 							actions=[
103 | 								"ecr:BatchGetImage",
104 | 								"ecr:GetDownloadUrlForLayer"
105 | 							],
106 | 							resources=[
107 | 								"arn:aws:ecr:*:763104351884:repository/*",
108 | 				                "arn:aws:ecr:*:217643126080:repository/*",
109 | 				                "arn:aws:ecr:*:727897471807:repository/*",
110 | 				                "arn:aws:ecr:*:626614931356:repository/*",
111 | 				                "arn:aws:ecr:*:683313688378:repository/*",
112 | 				                "arn:aws:ecr:*:520713654638:repository/*",
113 | 				                "arn:aws:ecr:*:462105765813:repository/*"
114 | 							]
115 | 						),
116 | 						iam.PolicyStatement(
117 | 							effect=iam.Effect.ALLOW,
118 | 							actions=[
119 | 								"ecr:CreateRepository",
120 | 								"ecr:BatchGetImage",
121 | 								"ecr:CompleteLayerUpload",
122 | 								"ecr:DescribeImages",
123 | 								"ecr:DescribeRepositories",
124 | 								"ecr:UploadLayerPart",
125 | 								"ecr:ListImages",
126 | 								"ecr:InitiateLayerUpload", 
127 | 								"ecr:BatchCheckLayerAvailability",
128 | 								"ecr:PutImage"	
129 | 							],
130 | 							resources=["arn:aws:ecr:*:*:repository/sagemaker-studio*"]
131 | 						),
132 | 						iam.PolicyStatement(
133 | 							effect=iam.Effect.ALLOW,
134 | 							actions=[
135 | 								"codebuild:DeleteProject",
136 |                 				"codebuild:CreateProject",
137 | 				                "codebuild:BatchGetBuilds",
138 | 				                "codebuild:StartBuild"
139 | 							],
140 | 							resources=["arn:aws:codebuild:*:*:project/sagemaker-studio*"]
141 | 						),
142 | 						iam.PolicyStatement(
143 | 							effect=iam.Effect.ALLOW,
144 | 							actions=["logs:CreateLogStream"],
145 | 							resources=["arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"]
146 | 						),
147 | 						iam.PolicyStatement(
148 | 							effect=iam.Effect.ALLOW,
149 | 							actions=[
150 | 				                "logs:GetLogEvents",
151 | 				                "logs:PutLogEvents"							
152 | 							],
153 | 							resources=["arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"]
154 | 						),
155 | 						iam.PolicyStatement(
156 | 							effect=iam.Effect.ALLOW,
157 | 							actions=["ecr:GetAuthorizationToken"],
158 | 							resources=["arn:aws:ecr:*:*:*"]
159 | 						),
160 | 						iam.PolicyStatement(
161 | 							effect=iam.Effect.ALLOW,
162 | 							actions=["iam:PassRole"],
163 | 							resources=["arn:aws:iam::*:role/*"],
164 | 							conditions={
165 | 								"StringLikeIfExists":{
166 | 									"iam:PassedToService":"codebuild.amazonaws.com"
167 | 								}
168 | 							}
169 | 						)
170 | 					]	
171 | 				)
172 | 			}
173 | 		)
174 | 		
175 | 		cfn_image = sagemaker.CfnImage(
176 | 			self,
177 | 			"DvcImage",
178 |     		image_name="conda-env-dvc-kernel",
179 |     		image_role_arn=role_sagemaker_studio_domain.role_arn,
180 | 		)
181 | 		
182 | 		cfn_image_version = sagemaker.CfnImageVersion(
183 | 			self,
184 | 			"DvcImageVersion",
185 | 			image_name="conda-env-dvc-kernel",
186 | 			base_image="{}.dkr.ecr.{}.amazonaws.com/smstudio-custom:conda-env-dvc-kernel".format(self.account, self.region)
187 | 		)
188 | 		
189 | 		cfn_image_version.add_depends_on(cfn_image)
190 | 		
191 | 		cfn_app_image_config = sagemaker.CfnAppImageConfig(
192 | 			self,
193 | 			"DvcAppImageConfig",
194 |     		app_image_config_name="conda-env-dvc-kernel-config",
195 |     		kernel_gateway_image_config=sagemaker.CfnAppImageConfig.KernelGatewayImageConfigProperty(
196 |         		kernel_specs=[
197 |         			sagemaker.CfnAppImageConfig.KernelSpecProperty(
198 |             			name="conda-env-dvc-py",
199 |             			display_name="Python [conda env: dvc]"
200 |         			)
201 |         		],
202 | 	        	file_system_config=sagemaker.CfnAppImageConfig.FileSystemConfigProperty(
203 | 	            	default_gid=0,
204 | 	            	default_uid=0,
205 | 	            	mount_path="/root"
206 | 	        	)
207 | 	    	),
208 | 		)
209 | 		
210 | 		cfn_app_image_config.add_depends_on(cfn_image_version)
211 | 		
212 | 		team = "data-scientist-dvc"
213 | 		
214 | 		if domain_id is not None:
215 | 			my_default_datascience_user = sagemaker.CfnUserProfile(
216 | 				self,
217 | 				"CfnUserProfile",
218 | 				domain_id=domain_id,
219 | 				user_profile_name=team,
220 | 				user_settings=sagemaker.CfnUserProfile.UserSettingsProperty(
221 | 			        execution_role=role_sagemaker_studio_domain.role_arn,
222 | 			        kernel_gateway_app_settings=sagemaker.CfnUserProfile.KernelGatewayAppSettingsProperty(
223 | 			            custom_images=[
224 | 			            	sagemaker.CfnUserProfile.CustomImageProperty(
225 | 			                	app_image_config_name="conda-env-dvc-kernel-config",
226 | 			                	image_name="conda-env-dvc-kernel",
227 | 			            	)
228 | 			            ]
229 | 			        ),
230 | 					jupyter_server_app_settings=sagemaker.CfnUserProfile.JupyterServerAppSettingsProperty(
231 | 						default_resource_spec=sagemaker.CfnUserProfile.ResourceSpecProperty(
232 | 							instance_type="system",
233 | 							sage_maker_image_arn="arn:aws:sagemaker:{}:{}:image/jupyter-server-3".format(self.region, sagemaker_arn_region_account_mapping[self.region]),
234 | 						)
235 | 					),
236 | 			    )
237 | 			)
238 | 			
239 | 			my_default_datascience_user.add_depends_on(cfn_app_image_config)
240 | 		else:
241 | 		
242 | 			self.role_sagemaker_studio_domain = role_sagemaker_studio_domain
243 | 			self.sagemaker_domain_name = "DomainForSagemakerStudio"
244 | 	
245 | 			default_vpc_id = ec2.Vpc.from_lookup(
246 | 				self,
247 | 				"VPC",
248 | 			    is_default=True
249 | 			)
250 | 	
251 | 			self.vpc_id = default_vpc_id.vpc_id
252 | 			self.public_subnet_ids = [public_subnet.subnet_id for public_subnet in default_vpc_id.public_subnets]
253 | 			
254 | 			my_sagemaker_domain = sagemaker.CfnDomain(
255 | 				self,
256 | 				"SageMakerStudioDomain",
257 | 			    auth_mode="IAM",
258 | 			    default_user_settings=sagemaker.CfnDomain.UserSettingsProperty(
259 | 			        execution_role=self.role_sagemaker_studio_domain.role_arn,
260 | 			        kernel_gateway_app_settings=sagemaker.CfnDomain.KernelGatewayAppSettingsProperty(
261 | 			            custom_images=[
262 | 			            	sagemaker.CfnDomain.CustomImageProperty(
263 | 			                	app_image_config_name="conda-env-dvc-kernel-config",
264 | 			                	image_name="conda-env-dvc-kernel",
265 | 			            )]
266 | 			        ),
267 | 					jupyter_server_app_settings=sagemaker.CfnDomain.JupyterServerAppSettingsProperty(
268 | 						default_resource_spec=sagemaker.CfnDomain.ResourceSpecProperty(
269 | 							instance_type="system",
270 | 							sage_maker_image_arn="arn:aws:sagemaker:{}:{}:image/jupyter-server-3".format(self.region, sagemaker_arn_region_account_mapping[self.region]),
271 | 						)
272 | 					),
273 | 				),
274 | 			    domain_name="domain-with-custom-conda-env",
275 | 			    subnet_ids=self.public_subnet_ids,
276 | 			    vpc_id=self.vpc_id
277 | 			)
278 | 			
279 | 			my_sagemaker_domain.add_depends_on(cfn_app_image_config)
280 | 	
281 | 			my_default_datascience_user = sagemaker.CfnUserProfile(
282 | 				self,
283 | 				"CfnUserProfile",
284 | 				domain_id=my_sagemaker_domain.attr_domain_id,
285 | 				user_profile_name=team,
286 | 				user_settings=sagemaker.CfnUserProfile.UserSettingsProperty(
287 | 			        execution_role=self.role_sagemaker_studio_domain.role_arn
288 | 			    )
289 | 			)
290 | 			
291 | 			CfnOutput(
292 | 				self,
293 | 				"DomainIdSagemaker",
294 | 			    value=my_sagemaker_domain.attr_domain_id,
295 | 			    description="The sagemaker domain ID",
296 | 			    export_name="DomainIdSagemaker"
297 | 			)
298 | 
299 | 
300 | 		CfnOutput(
301 | 			self,
302 | 			f"cfnoutput{team}",
303 | 		    value=my_default_datascience_user.attr_user_profile_arn,
304 | 		    description="The User Arn TeamA domain ID",
305 | 		    export_name=F"UserArn{team}"
306 | 		)


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/cdk/setup.py:
--------------------------------------------------------------------------------
 1 | import setuptools
 2 | 
 3 | 
 4 | setuptools.setup(
 5 |     name="sagemakerStudioCDK",
 6 |     version="0.0.1",
 7 | 
 8 |     description="aws-cdk-sagemaker-studio",
 9 | 
10 |     author="frpaolo",
11 | 
12 |     package_dir={"": "sagemakerStudioCDK"},
13 |     packages=setuptools.find_packages(where="sagemakerStudioCDK"),
14 | 
15 |     install_requires=[
16 |         "aws-cdk-lib==2.27.0",
17 |         "constructs==10.0.34",
18 |         "boto3"
19 |     ],
20 | 
21 |     python_requires=">=3.6",
22 | 
23 |     classifiers=[
24 |         "Development Status :: 4 - Beta",
25 | 
26 |         "Intended Audience :: Developers",
27 | 
28 |         "License :: OSI Approved :: Apache Software License",
29 | 
30 |         "Programming Language :: JavaScript",
31 |         "Programming Language :: Python :: 3 :: Only",
32 |         "Programming Language :: Python :: 3.6",
33 |         "Programming Language :: Python :: 3.7",
34 |         "Programming Language :: Python :: 3.8",
35 | 
36 |         "Topic :: Software Development :: Code Generators",
37 |         "Topic :: Utilities",
38 | 
39 |         "Typing :: Typed",
40 |     ],
41 | )
42 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/environment.yml:
--------------------------------------------------------------------------------
 1 | name: dvc
 2 | channels:
 3 |   - conda-forge
 4 |   - intel
 5 | dependencies:
 6 |   - python=3.8
 7 |   - pip=22.0.4
 8 |   - ipykernel=6.9.1
 9 |   - pip:
10 |     - dvc==2.8.3
11 |     - dvc[s3]==2.8.3
12 |     - s3fs==2021.11.0
13 |     - awscli
14 |     - boto3==1.17.106
15 |     - sagemaker
16 |     - sagemaker-studio-image-build==0.6.0
17 |     - sagemaker-experiments==0.1.35
18 |     - scikit-learn==1.0.2
19 |     - protobuf==3.20
20 |     - git-remote-codecommit==1.16
21 | 
22 | 


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/resize-cloud9.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
 4 | SIZE=${1:-20}
 5 | 
 6 | # Get the ID of the environment host Amazon EC2 instance.
 7 | INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
 8 | REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
 9 | 
10 | # Get the ID of the Amazon EBS volume associated with the instance.
11 | VOLUMEID=$(aws ec2 describe-instances \
12 |   --instance-id $INSTANCEID \
13 |   --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \
14 |   --output text \
15 |   --region $REGION)
16 | 
17 | # Resize the EBS volume.
18 | aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE
19 | 
20 | # Wait for the resize to finish.
21 | while [ \
22 |   "$(aws ec2 describe-volumes-modifications \
23 |     --volume-id $VOLUMEID \
24 |     --filters Name=modification-state,Values="optimizing","completed" \
25 |     --query "length(VolumesModifications)"\
26 |     --output text)" != "1" ]; do
27 | sleep 1
28 | done
29 | 
30 | #Check if we're on an NVMe filesystem
31 | if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]]
32 | then
33 |   # Rewrite the partition table so that the partition takes up all the space that it can.
34 |   sudo growpart /dev/xvda 1
35 | 
36 |   # Expand the size of the file system.
37 |   # Check if we're on AL2
38 |   STR=$(cat /etc/os-release)
39 |   SUB="VERSION_ID=\"2\""
40 |   if [[ "$STR" == *"$SUB"* ]]
41 |   then
42 |     sudo xfs_growfs -d /
43 |   else
44 |     sudo resize2fs /dev/xvda1
45 |   fi
46 | 
47 | else
48 |   # Rewrite the partition table so that the partition takes up all the space that it can.
49 |   sudo growpart /dev/nvme0n1 1
50 | 
51 |   # Expand the size of the file system.
52 |   # Check if we're on AL2
53 |   STR=$(cat /etc/os-release)
54 |   SUB="VERSION_ID=\"2\""
55 |   if [[ "$STR" == *"$SUB"* ]]
56 |   then
57 |     sudo xfs_growfs -d /
58 |   else
59 |     sudo resize2fs /dev/nvme0n1p1
60 |   fi
61 | fi


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/update-domain-input.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "DomainId": "<your-sagemaker-studio-domain-id>",
 3 |     "DefaultUserSettings": {
 4 |         "KernelGatewayAppSettings": {
 5 |             "CustomImages": [
 6 |                 {
 7 |                     "ImageName": "conda-env-dvc-kernel",
 8 |                     "AppImageConfigName": "conda-env-dvc-kernel-config"
 9 |                 }
10 |             ]
11 |         }
12 |     }
13 | }


--------------------------------------------------------------------------------
/sagemaker-studio-dvc-image/update-domain-no-custom-images.json:
--------------------------------------------------------------------------------
1 | {
2 |     "DomainId": "<your-sagemaker-studio-domain-id>",
3 |     "DefaultUserSettings": {
4 |         "KernelGatewayAppSettings": {
5 |             "CustomImages": []
6 |         }
7 |     }
8 | }


--------------------------------------------------------------------------------
/source_dir/preprocessing-experiment-multifiles.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | import os
  4 | import argparse
  5 | import sys
  6 | import subprocess
  7 | 
  8 | from pathlib import Path
  9 | 
 10 | import numpy as np
 11 | import pandas as pd
 12 | 
 13 | from sklearn.model_selection import train_test_split
 14 | 
 15 | from smexperiments.tracker import Tracker
 16 | 
 17 | import dvc.api
 18 | 
 19 | from git.repo.base import Repo
 20 | 
 21 | # Prepare paths
 22 | input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv")
 23 | data_path = 'dataset'
 24 | base_dir = f"./sagemaker-dvc-sample/{data_path}"
 25 | file_types = ['test','train','validation']
 26 | 
 27 | dvc_repo_url = os.environ.get('DVC_REPO_URL')
 28 | dvc_branch = os.environ.get('DVC_BRANCH')
 29 | user = os.environ.get('USER', "sagemaker")
 30 | 
 31 | def configure_git():
 32 |     subprocess.check_call(['git', 'config', '--global', 'user.email', '"sagemaker-processing@example.com"'])
 33 |     subprocess.check_call(['git', 'config', '--global', 'user.name', user])
 34 |     
 35 | def split_dataframe(df, num=5):
 36 |     chunk_size = int(df.shape[0] / num)
 37 |     chunks = [df.iloc[i:i+chunk_size] for i in range(0,df.shape[0], chunk_size)]
 38 |     return chunks
 39 | 
 40 | def clone_dvc_git_repo():
 41 |     print(f"Cloning repo: {dvc_repo_url}")
 42 |     repo = Repo.clone_from(dvc_repo_url, './sagemaker-dvc-sample')
 43 |     return repo
 44 | 
 45 | def generate_train_validation_files(ratio):
 46 |     for path in ['train', 'validation', 'test']:
 47 |         output_dir = Path(f"{base_dir}/{path}/")
 48 |         output_dir.mkdir(parents=True, exist_ok=True)
 49 | 
 50 |     print("Read dataset")
 51 |     dataset = pd.read_csv(input_data_path)
 52 |     train, other = train_test_split(dataset, test_size=ratio)
 53 |     validation, test = train_test_split(other, test_size=ratio)
 54 |     
 55 |     print("create train, validation, test")
 56 |     for index, chunk in enumerate(split_dataframe(pd.DataFrame(train))):
 57 |         chunk.to_csv(f"{base_dir}/train/california_train_{index + 1}.csv", header=False, index=False)
 58 | 
 59 |     for index, chunk in enumerate(split_dataframe(pd.DataFrame(validation), 3)):
 60 |         chunk.to_csv(f"{base_dir}/validation/california_validation_{index + 1}.csv", header=False, index=False)
 61 |     
 62 |     pd.DataFrame(test).to_csv(f"{base_dir}/test/california_test.csv", header=False, index=False)
 63 |     print("data created")
 64 | 
 65 | def sync_data_with_dvc(repo):
 66 |     os.chdir(base_dir)
 67 |     print(f"Create branch {dvc_branch}")
 68 |     try:
 69 |         repo.git.checkout('-b', dvc_branch)
 70 |         print(f"Create a new branch: {dvc_branch}")
 71 |     except:
 72 |         repo.git.checkout(dvc_branch)
 73 |         print(f"Checkout existing branch: {dvc_branch}")
 74 |     print("Add files to DVC")
 75 |     
 76 |     for file_type in file_types:
 77 |         subprocess.check_call(['dvc', 'add', f"{file_type}/"])
 78 | 
 79 |     repo.git.add(all=True)
 80 |     repo.git.commit('-m', f"'add data for {dvc_branch}'")
 81 |     print("Push data to DVC")
 82 |     subprocess.check_call(['dvc', 'push'])
 83 |     print("Push dvc metadata to git")
 84 |     repo.remote(name='origin')
 85 |     repo.git.push('--set-upstream', repo.remote().name, dvc_branch, '--force')
 86 | 
 87 |     sha = repo.head.commit.hexsha
 88 |     print(f"commit hash: {sha}")
 89 | 
 90 |     with Tracker.load() as tracker:
 91 |         tracker.log_parameters({"data_commit_hash": sha})
 92 |         for file_type in file_types:
 93 |             path = dvc.api.get_url(
 94 |                 f"{data_path}/{file_type}",
 95 |                 repo=dvc_repo_url,
 96 |                 rev=dvc_branch
 97 |             )
 98 |             tracker.log_output(name=f"{file_type}",value=path)
 99 | 
100 | if __name__=="__main__":
101 |     parser = argparse.ArgumentParser()
102 |     parser.add_argument("--train-test-split-ratio", type=float, default=0.3)
103 |     args, _ = parser.parse_known_args()
104 |     
105 |     train_test_split_ratio = args.train_test_split_ratio
106 |     
107 |     with Tracker.load() as tracker:
108 |         tracker.log_parameters(
109 |             {
110 |                 "train_test_split_ratio": train_test_split_ratio
111 |             }
112 |         )
113 |     
114 |     configure_git()
115 |     repo = clone_dvc_git_repo()
116 |     generate_train_validation_files(train_test_split_ratio)
117 |     sync_data_with_dvc(repo)
118 | 


--------------------------------------------------------------------------------
/source_dir/preprocessing-experiment.py:
--------------------------------------------------------------------------------
  1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | # SPDX-License-Identifier: MIT-0
  3 | import os
  4 | import argparse
  5 | import sys
  6 | import subprocess
  7 | 
  8 | from pathlib import Path
  9 | 
 10 | import numpy as np
 11 | import pandas as pd
 12 | 
 13 | from sklearn.model_selection import train_test_split
 14 | 
 15 | from smexperiments.tracker import Tracker
 16 | 
 17 | import dvc.api
 18 | 
 19 | from git.repo.base import Repo
 20 | 
 21 | # Prepare paths
 22 | input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv")
 23 | data_path = 'dataset'
 24 | base_dir = f"./sagemaker-dvc-sample/{data_path}"
 25 | file_types = ['test','train','validation']
 26 | 
 27 | dvc_repo_url = os.environ.get('DVC_REPO_URL')
 28 | dvc_branch = os.environ.get('DVC_BRANCH')
 29 | user = os.environ.get('USER', "sagemaker")
 30 | 
 31 | def configure_git():
 32 |     subprocess.check_call(['git', 'config', '--global', 'user.email', '"sagemaker-processing@example.com"'])
 33 |     subprocess.check_call(['git', 'config', '--global', 'user.name', user])
 34 | 
 35 | def clone_dvc_git_repo():
 36 |     print(f"Cloning repo: {dvc_repo_url}")
 37 |     repo = Repo.clone_from(dvc_repo_url, './sagemaker-dvc-sample')
 38 |     return repo
 39 |     
 40 | def generate_train_validation_files(ratio):
 41 |     for path in ['train', 'validation', 'test']:
 42 |         output_dir = Path(f"{base_dir}/{path}/")
 43 |         output_dir.mkdir(parents=True, exist_ok=True)
 44 | 
 45 |     print("Read dataset")
 46 |     dataset = pd.read_csv(input_data_path)
 47 |     train, other = train_test_split(dataset, test_size=ratio)
 48 |     validation, test = train_test_split(other, test_size=ratio)
 49 |     
 50 |     print("create train, validation, test")
 51 |     pd.DataFrame(train).to_csv(f"{base_dir}/train/california_train.csv", header=False, index=False)
 52 |     pd.DataFrame(validation).to_csv(f"{base_dir}/validation/california_validation.csv", header=False, index=False)
 53 |     pd.DataFrame(test).to_csv(f"{base_dir}/test/california_test.csv", header=False, index=False)
 54 |     print("data created")
 55 | 
 56 | def sync_data_with_dvc(repo):
 57 |     os.chdir(base_dir)
 58 |     print(f"Create branch {dvc_branch}")
 59 |     try:
 60 |         repo.git.checkout('-b', dvc_branch)
 61 |         print(f"Create a new branch: {dvc_branch}")
 62 |     except:
 63 |         repo.git.checkout(dvc_branch)
 64 |         print(f"Checkout existing branch: {dvc_branch}")
 65 |     print("Add files to DVC")
 66 |     
 67 |     for file_type in file_types:
 68 |         subprocess.check_call(['dvc', 'add', f"{file_type}/california_{file_type}.csv"])
 69 |     
 70 |     repo.git.add(all=True)
 71 |     repo.git.commit('-m', f"'add data for {dvc_branch}'")
 72 |     print("Push data to DVC")
 73 |     subprocess.check_call(['dvc', 'push'])
 74 |     print("Push dvc metadata to git")
 75 |     repo.remote(name='origin')
 76 |     repo.git.push('--set-upstream', repo.remote().name, dvc_branch, '--force')
 77 | 
 78 |     sha = repo.head.commit.hexsha
 79 |     print(f"commit hash: {sha}")
 80 | 
 81 |     with Tracker.load() as tracker:
 82 |         tracker.log_parameters({"data_commit_hash": sha})
 83 |         for file_type in file_types:
 84 |             path = dvc.api.get_url(
 85 |                 f"{data_path}/{file_type}/california_{file_type}.csv",
 86 |                 repo=dvc_repo_url,
 87 |                 rev=dvc_branch
 88 |             )
 89 |             tracker.log_output(name=f"california_{file_type}",value=path)
 90 | 
 91 | if __name__=="__main__":
 92 |     parser = argparse.ArgumentParser()
 93 |     parser.add_argument("--train-test-split-ratio", type=float, default=0.3)
 94 |     args, _ = parser.parse_known_args()
 95 |     
 96 |     train_test_split_ratio = args.train_test_split_ratio
 97 |     
 98 |     with Tracker.load() as tracker:
 99 |         tracker.log_parameters(
100 |             {
101 |                 "train_test_split_ratio": train_test_split_ratio,
102 |             }
103 |         )
104 |     
105 |     configure_git()
106 |     repo = clone_dvc_git_repo()
107 |     generate_train_validation_files(train_test_split_ratio)
108 |     sync_data_with_dvc(repo)
109 | 


--------------------------------------------------------------------------------
/source_dir/requirements.txt:
--------------------------------------------------------------------------------
1 | catboost
2 | dvc==2.8.3
3 | s3fs==2021.11.0
4 | dvc[s3]==2.8.3
5 | git-remote-codecommit
6 | sagemaker-experiments
7 | gitpython


--------------------------------------------------------------------------------
/source_dir/train.py:
--------------------------------------------------------------------------------
  1 | import glob
  2 | import logging
  3 | import os
  4 | import json
  5 | import re
  6 | import subprocess
  7 | import traceback
  8 | import sys
  9 | 
 10 | import argparse
 11 | import joblib
 12 | 
 13 | from sklearn.ensemble import RandomForestRegressor
 14 | from catboost import CatBoostRegressor
 15 | 
 16 | import numpy as np
 17 | import pandas as pd
 18 | 
 19 | prefix = '/opt/ml/'
 20 | input_path = prefix + 'input/data'
 21 | dataset_path = prefix + 'input/data/dataset'
 22 | train_channel_name = 'train'
 23 | validation_channel_name = 'validation'
 24 | 
 25 | output_path = os.path.join(prefix, 'output')
 26 | model_path = os.path.join(prefix, 'model')
 27 | model_file_name = 'catboost-regressor-model.dump'
 28 | #model_file_name = 'model.joblib'
 29 | train_path = os.path.join(dataset_path, train_channel_name)
 30 | validation_path = os.path.join(dataset_path, validation_channel_name)
 31 | 
 32 | dvc_repo_url = os.environ.get('DVC_REPO_URL')
 33 | dvc_branch = os.environ.get('DVC_BRANCH')
 34 | user = os.environ.get('USER', "sagemaker")
 35 | 
 36 | def fetch_data_from_dvc():
 37 |     print(f"Cloning repo: {dvc_repo_url}, git branch: {dvc_branch}")
 38 |     subprocess.check_call(["git", "clone", "--depth", "1", "--branch", dvc_branch, dvc_repo_url, input_path])
 39 |     print("dvc pull")
 40 |     os.chdir(input_path + "/dataset/")
 41 |     subprocess.check_call(["dvc", "pull"])
 42 | 
 43 | # Model serving
 44 | """
 45 | Deserialize fitted model
 46 | """
 47 | def model_fn(model_dir):
 48 |     model = CatBoostRegressor()
 49 |     model.load_model(os.path.join(model_path, model_file_name))
 50 |     return model
 51 | 
 52 | if __name__ == '__main__':
 53 |     print("extracting arguments")
 54 |     parser = argparse.ArgumentParser()
 55 | 
 56 |     # hyperparameters sent by the client are passed as command-line arguments to the script.
 57 |     # to simplify the demo we don't use all sklearn RandomForest hyperparameters
 58 |     parser.add_argument("--learning_rate", type=int, default=1)
 59 |     parser.add_argument("--depth", type=int, default=5)
 60 |     
 61 |     args, _ = parser.parse_known_args()
 62 | 
 63 |     fetch_data_from_dvc()
 64 |     
 65 |     print('Starting the training.')
 66 | 
 67 |     try:
 68 |         # Take the set of train files and read them all into a single pandas dataframe
 69 |         train_input_files = [os.path.join(train_path, file) for file in glob.glob(train_path+"/*.csv")]
 70 |         if len(train_input_files) == 0:
 71 |             raise ValueError(('There are no files in {}.\n' +
 72 |                               'This usually indicates that the channel ({}) was incorrectly specified,\n' +
 73 |                               'the data specification in S3 was incorrectly specified or the role specified\n' +
 74 |                               'does not have permission to access the data.').format(train_path, train_channel_name))
 75 |         print('Found train files: {}'.format(train_input_files))
 76 |         train_df = pd.DataFrame()
 77 |         for file in train_input_files:
 78 |             if train_df.shape[0] == 0:
 79 |                 train_df = pd.read_csv(file)
 80 |             else:
 81 |                 df = pd.read_csv(file)
 82 |                 train_df.append(df, ignore_index=True)
 83 | 
 84 |         # Take the set of train files and read them all into a single pandas dataframe
 85 |         validation_input_files = [os.path.join(validation_path, file) for file in glob.glob(validation_path+"/*.csv")]
 86 |         if len(validation_input_files) == 0:
 87 |             raise ValueError(('There are no files in {}.\n' +
 88 |                               'This usually indicates that the channel ({}) was incorrectly specified,\n' +
 89 |                               'the data specification in S3 was incorrectly specified or the role specified\n' +
 90 |                               'does not have permission to access the data.').format(validation_path, train_channel_name))
 91 |         print('Found validation files: {}'.format(validation_input_files))
 92 |         validation_df = pd.DataFrame()
 93 |         for file in validation_input_files:
 94 |             if validation_df.shape[0] == 0:
 95 |                 validation_df = pd.read_csv(file)
 96 |             else:
 97 |                 df = pd.read_csv(file)
 98 |                 validation_df.append(df, ignore_index=True)
 99 | 
100 |         # Assumption is that the label is the last column
101 |         print('building training and validation datasets')
102 |         X_train = train_df.iloc[:, 1:].values
103 |         y_train = train_df.iloc[:, 0:1].values
104 |         X_validation = validation_df.iloc[:, 1:].values
105 |         y_validation = validation_df.iloc[:, 0:1].values
106 | 
107 |         # define and train model
108 |         model = CatBoostRegressor(learning_rate=args.learning_rate, depth=args.depth)
109 | 
110 |         model.fit(X_train, y_train, eval_set=(X_validation, y_validation), logging_level='Silent')
111 | 
112 |         # print abs error
113 |         print('validating model')
114 |         abs_err = np.abs(model.predict(X_validation) - y_validation)
115 | 
116 |         # print couple perf metrics
117 |         for q in [10, 50, 90]:
118 |             print('AE-at-' + str(q) + 'th-percentile: '+ str(np.percentile(a=abs_err, q=q)))
119 | 
120 |         path = os.path.join(model_path, model_file_name)
121 |         model.save_model(path)
122 | 
123 |         print('Training complete.')
124 |         
125 |     except Exception as e:
126 |         # Write out an error file. This will be returned as the failureReason in the
127 |         # DescribeTrainingJob result.
128 |         trc = traceback.format_exc()
129 |         with open(os.path.join(output_path, 'failure'), 'w') as s:
130 |             s.write('Exception during training: ' + str(e) + '\n' + trc)
131 |         # Printing this causes the exception to be in the training job logs, as well.
132 |         print('Exception during training: ' + str(e) + '\n' + trc)
133 |         # A non-zero exit dependencies causes the training job to be marked as Failed.
134 |         sys.exit(255)
135 | 
136 |     # A zero exit dependencies causes the job to be marked a Succeeded.
137 |     sys.exit(0)
138 | 


--------------------------------------------------------------------------------