├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── container ├── processing │ └── Dockerfile └── train_and_serve │ ├── Dockerfile │ ├── README.md │ └── catboost_regressor │ ├── nginx.conf │ ├── predictor.py │ ├── serve │ ├── train │ └── wsgi.py ├── dvc_sagemaker_byoc.ipynb ├── dvc_sagemaker_script_mode.ipynb ├── img ├── high-level-architecture.png ├── sm-experiments-tracker-dvc.png └── studio-custom-image-select.png ├── sagemaker-studio-dvc-image ├── Dockerfile ├── README.md ├── cdk │ ├── .gitignore │ ├── app.py │ ├── cdk.json │ ├── requirements.txt │ ├── sagemakerStudioCDK │ │ ├── __init__.py │ │ └── sagemaker_studio_stack.py │ └── setup.py ├── environment.yml ├── resize-cloud9.sh ├── update-domain-input.json └── update-domain-no-custom-images.json └── source_dir ├── preprocessing-experiment-multifiles.py ├── preprocessing-experiment.py ├── requirements.txt └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/*.ipynb_checkpoints/ 2 | sagemaker-dvc-sample 3 | .idea/amazon-sagemaker-experiments-dvc-demo.iml 4 | .idea/inspectionProfiles/profiles_settings.xml 5 | .idea/misc.xml 6 | .idea/modules.xml 7 | .idea/vcs.xml 8 | .idea/workspace.xml 9 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 7 | the Software, and to permit persons to whom the Software is furnished to do so. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 15 | 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SageMaker Experiments and DVC 2 | 3 | This sample shows how to use DVC within the SageMaker environment. 4 | In particular, we will look at how to build a custom image with DVC libraries installed by default to provide a consistent environment to your data scientists. 5 | Furthermore, we show how you can integrate SageMaker Processing, SageMaker Trainings and SageMaker Experiments with a DVC workflow. 6 | 7 | For full details on how this works: 8 | 9 | - Read the Blog post at: https://aws.amazon.com/blogs/machine-learning/track-your-ml-experiments-end-to-end-with-data-version-control-and-amazon-sagemaker-experiments/ 10 | 11 | ## Prerequisite 12 | 13 | * An AWS Account 14 | * An IAM user with Admin-like permissions 15 | 16 | If you do not have Admin-like permissions, we recommend to have at least the following permissions: 17 | * Administer Amazon ECR 18 | * Administer a SageMaker Studio Domain 19 | * Administer S3 (or at least any buckets with *sagemaker* in the bucket name) 20 | * Create IAM Roles 21 | * Create a Cloud9 environment 22 | 23 | ## Setup 24 | 25 | We suggest for the initial setup, to use Cloud9 on a `t3.large` instance type. 26 | 27 | ## Build a custom SageMaker Studio image with DVC already installed 28 | 29 | We aim to explains how to create a custom image for Amazon SageMaker Studio that has DVC already installed. 30 | The advantage of creating an image and make it available to all SageMaker Studio users is that it creates a consistent environment for the SageMake Studio users, which they could also run locally. 31 | 32 | This tutorial is heavily inspired by [this example](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/tree/main/examples/conda-env-kernel-image). 33 | Further information about custom images for SageMaker Studio can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html) 34 | 35 | ### Overview 36 | 37 | This custom image sample demonstrates how to create a custom Conda environment in a Docker image and use it as a custom kernel in SageMaker Studio. 38 | 39 | The Conda environment must have the appropriate kernel package installed, for e.g., `ipykernel` for a Python kernel. 40 | This example creates a Conda environment called `dvc` with a few Python packages (see [environment.yml](environment.yml)) and the `ipykernel`. 41 | SageMaker Studio will automatically recognize this Conda environment as a kernel named `conda-env-dvc-py`. 42 | 43 | #### Clone the GitHub repository 44 | ```bash 45 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo 46 | ``` 47 | 48 | #### Resize Cloud9 49 | 50 | ```bash 51 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/ 52 | ./resize-cloud9.sh 20 53 | ``` 54 | 55 | ### Build the Docker images for SageMaker Studio 56 | 57 | Set some basic environment variables 58 | 59 | ```bash 60 | sudo yum install jq -y 61 | export REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region') 62 | echo "export REGION=${REGION}" | tee -a ~/.bash_profile 63 | 64 | export ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') 65 | echo "export ACCOUNT_ID=${ACCOUNT_ID}" | tee -a ~/.bash_profile 66 | 67 | export IMAGE_NAME=conda-env-dvc-kernel 68 | echo "export IMAGE_NAME=${IMAGE_NAME}" | tee -a ~/.bash_profile 69 | ``` 70 | 71 | Build the Docker image and push to Amazon ECR. 72 | 73 | ```bash 74 | # Login to ECR 75 | aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom 76 | 77 | # Create the ECR repository 78 | aws --region ${REGION} ecr create-repository --repository-name smstudio-custom 79 | 80 | # Build the image - it might take a few minutes to complete this step 81 | docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} 82 | # Push the image to ECR 83 | docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} 84 | ``` 85 | 86 | ### Associate a custom image to SageMaker Studio 87 | 88 | #### Prepare the environment to deploy with CDK 89 | 90 | Step 1: Navigate to the `cdk` directory: 91 | 92 | ```bash 93 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk 94 | ``` 95 | 96 | Step 2: Create a virtual environment: 97 | 98 | ```bash 99 | python3 -m venv .cdk-venv 100 | ``` 101 | 102 | Step 3: Activate the virtual environment after the init process completes, and the virtual environment is created: 103 | 104 | ```bash 105 | source .cdk-venv/bin/activate 106 | ``` 107 | 108 | Step 4: Install the required dependencies: 109 | 110 | ```bash 111 | pip3 install --upgrade pip 112 | pip3 install -r requirements.txt 113 | ``` 114 | 115 | Step 5: Install and bootstrap CDK v2 (The latest CDK version tested was `2.27.0`) 116 | 117 | ```bash 118 | npm install -g aws-cdk@2.27.0 --force 119 | cdk bootstrap 120 | ``` 121 | 122 | #### Create a new SageMaker Studio 123 | ( Skip to [Update an existing SageMaker Studio](#update-an-existing-sagemaker-studio) if you have already an existing SageMaker Studio domain) 124 | 125 | Step 5: deploy CDK (CDK will deploy a stack named: `sagemakerStudioCDK` which you can verify in `CloudFormation`) 126 | 127 | ```bash 128 | cdk deploy --require-approval never 129 | ``` 130 | 131 | CDK will create the following resources via` CloudFormation`: 132 | * provisions a new SageMaker Studio domain 133 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to the SageMaker Studio domain 134 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier 135 | * creates an AppImageConfig which specify how the kernel gateway should be configured 136 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it 137 | 138 | #### Update an existing SageMaker Studio 139 | 140 | If you have an existing SageMaker Studio environment, we need to first retrieve the exising SageMaker Studio domain ID, deploy a "reduced" version of the CDK stack, and update the SageMaker Studio domain configuration. 141 | 142 | Step 5: Set the `DOMAIN_ID` environment variable with your domain ID and save to your `bash_profile`. 143 | 144 | ```bash 145 | export DOMAIN_ID=$(aws sagemaker list-domains | jq -r '.Domains[0].DomainId') 146 | echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile 147 | ``` 148 | 149 | Step 6: deploy CDK (by setting the `DOMAIN_ID` environment variable, CDK will deploy a stack named: `sagemakerUserCDK` which you can verify on `CloudFormation`) 150 | 151 | ```bash 152 | cdk deploy --require-approval never 153 | ``` 154 | 155 | CDK will create the following resources via` CloudFormation`: 156 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to your existing SageMaker Studio domain 157 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier 158 | * creates an AppImageConfig which specify how the kernel gateway should be configured 159 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it 160 | 161 | Step 7: Update the SageMaker Studio domain configuration 162 | 163 | ```bash 164 | # inject your DOMAIN_ID into the configuration file 165 | sed -i 's//'"$DOMAIN_ID"'/' ../update-domain-input.json 166 | 167 | # update the sagemaker studio domain 168 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-input.json 169 | ``` 170 | 171 | Open the newly created SageMaker Studio user, i.e., `data-scientist-dvc`. 172 | 173 | ![image info](./img/studio-custom-image-select.png) 174 | 175 | 176 | ### Execute the sample notebook 177 | 178 | In the SageMaker Studio domain, launch `Studio` for the `data-scientist-dvc`. 179 | Open a terminal and clone the repository 180 | 181 | ```bash 182 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo 183 | ``` 184 | 185 | and open the [dvc_sagemaker_script_mode.ipynb](./dvc_sagemaker_script_mode.ipynb) notebook. 186 | 187 | When prompted, ensure that you select the Custom Image `conda-env-dvc-kernel` as shown below 188 | 189 | We provide two sample notebooks to see how to use DVC in combination with SageMaker: 190 | 191 | * one that installs DVC in script mode by passing a `requirements.txt` file to both the processing job and the training job [dvc_sagemaker_script_mode.ipynb](./dvc_sagemaker_script_mode.ipynb); 192 | * one that shows how to create the container for the processing jobs, the training jobs, and the inference [dvc_sagemaker_byoc.ipynb](./dvc_sagemaker_byoc.ipynb). 193 | 194 | Both notebooks are meant to be used within SageMaker Studio with the custom image created before. 195 | 196 | ## Cleanup 197 | 198 | Before removing all resources created, you need to make sure that all Apps are deleted from the `data-scientist-dvc` user, i.e. all `KernelGateway` apps, as well as the default `JupiterServer`. 199 | 200 | Once done, you can destroy the CDK stack by running 201 | 202 | ```bash 203 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk 204 | cdk destroy 205 | ``` 206 | 207 | In case you started off from an existing domain, please also execute the following command: 208 | 209 | ```bash 210 | # inject your DOMAIN_ID into the configuration file 211 | sed -i 's//'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json 212 | 213 | # update the sagemaker studio domain 214 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json 215 | ``` 216 | 217 | ## Security 218 | 219 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 220 | 221 | ## License 222 | 223 | This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file. -------------------------------------------------------------------------------- /container/processing/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM public.ecr.aws/docker/library/python:3.7-slim 2 | 3 | RUN apt-get -y update && apt-get install -y --no-install-recommends wget git 4 | 5 | RUN pip3 install numpy pandas scikit-learn==1.0.2 6 | RUN pip3 install sagemaker-experiments==0.1.35 7 | RUN pip3 install git-remote-codecommit 8 | RUN pip3 install dvc==2.8.3 s3fs==2021.11.0 dvc[s3]==2.8.3 9 | 10 | # Configure git 11 | 12 | RUN git config --global user.email "sagemaker-processing@example.com" 13 | RUN git config --global user.name "SageMaker ProcessingJob User" 14 | 15 | ENV PYTHONUNBUFFERED=TRUE 16 | 17 | ENTRYPOINT ["python3"] 18 | -------------------------------------------------------------------------------- /container/train_and_serve/Dockerfile: -------------------------------------------------------------------------------- 1 | # Build an image that can do training and inference in SageMaker 2 | # This is a Python 3 image that uses the nginx, gunicorn, flask stack 3 | # for serving inferences in a stable way. 4 | 5 | FROM public.ecr.aws/docker/library/python:3.7-slim 6 | 7 | RUN apt-get -y update && apt-get install -y --no-install-recommends \ 8 | wget \ 9 | nginx \ 10 | git \ 11 | ca-certificates 12 | 13 | RUN pip install numpy==1.16.2 scipy==1.2.1 catboost pandas flask gevent gunicorn 14 | RUN pip install dvc==2.8.3 s3fs==2021.11.0 dvc[s3]==2.8.3 15 | RUN pip install git-remote-codecommit 16 | 17 | # Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard 18 | # output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE 19 | # keeps Python from writing the .pyc files which are unnecessary in this case. We also update 20 | # PATH so that the train and serve programs are found when the container is invoked. 21 | 22 | ENV PYTHONUNBUFFERED=TRUE 23 | ENV PYTHONDONTWRITEBYTECODE=TRUE 24 | ENV PATH="/opt/program:${PATH}" 25 | 26 | # Set up the program in the image 27 | COPY catboost_regressor /opt/program 28 | WORKDIR /opt/program 29 | 30 | -------------------------------------------------------------------------------- /container/train_and_serve/README.md: -------------------------------------------------------------------------------- 1 | # Bring-your-own Algorithm Sample 2 | 3 | This example shows how to package an algorithm for use with SageMaker. We have chosen a simple [CatBoost][catboost] implementation of regression to illustrate the procedure. 4 | 5 | SageMaker supports two execution modes: _training_ where the algorithm uses input data to train a new model and _serving_ where the algorithm accepts HTTP requests and uses the previously trained model to do an inference (also called "scoring", "prediction", or "transformation"). 6 | 7 | The algorithm that we have built here supports both training and scoring in SageMaker with the same container image. It is perfectly reasonable to build an algorithm that supports only training _or_ scoring as well as to build an algorithm that has separate container images for training and scoring.v 8 | 9 | In order to build a production grade inference server into the container, we use the following stack to make the implementer's job simple: 10 | 11 | 1. __[nginx][nginx]__ is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently. 12 | 2. __[gunicorn][gunicorn]__ is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them. 13 | 3. __[flask][flask]__ is a simple web framework used in the inference app that you write. It lets you respond to call on the `/ping` and `/invocations` endpoints without having to write much code. 14 | 15 | ## The Structure of the Sample Code 16 | 17 | The components are as follows: 18 | 19 | * __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here. 20 | 21 | * __catboost_regressor__: The directory that contains the application to run in the container. See the next session for details about each of the files. 22 | 23 | ### The application run inside the container 24 | 25 | When SageMaker starts a container, it will invoke the container with an argument of either __train__ or __serve__. We have set this container up so that the argument in treated as the command that the container executes. When training, it will run the __train__ program included and, when serving, it will run the __serve__ program. 26 | 27 | * __train__: The main program for training the model. When you build your own algorithm, you'll edit this to include your training code. 28 | * __serve__: The wrapper that starts the inference server. In most cases, you can use this file as-is. 29 | * __wsgi.py__: The start up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named. 30 | * __predictor.py__: The algorithm-specific inference server. This is the file that you modify with your own algorithm's code. 31 | * __nginx.conf__: The configuration for the nginx master server that manages the multiple workers. 32 | 33 | [catboost]: https://catboost.ai/ "CatBoost Home Page" 34 | [dockerfile]: https://docs.docker.com/engine/reference/builder/ "The official Dockerfile reference guide" 35 | [ecr]: https://aws.amazon.com/ecr/ "ECR Home Page" 36 | [nginx]: http://nginx.org/ 37 | [gunicorn]: http://gunicorn.org/ 38 | [flask]: http://flask.pocoo.org/ 39 | -------------------------------------------------------------------------------- /container/train_and_serve/catboost_regressor/nginx.conf: -------------------------------------------------------------------------------- 1 | worker_processes 1; 2 | daemon off; # Prevent forking 3 | 4 | 5 | pid /tmp/nginx.pid; 6 | error_log /var/log/nginx/error.log; 7 | 8 | events { 9 | # defaults 10 | } 11 | 12 | http { 13 | include /etc/nginx/mime.types; 14 | default_type application/octet-stream; 15 | access_log /var/log/nginx/access.log combined; 16 | 17 | upstream gunicorn { 18 | server unix:/tmp/gunicorn.sock; 19 | } 20 | 21 | server { 22 | listen 8080 deferred; 23 | client_max_body_size 5m; 24 | 25 | keepalive_timeout 5; 26 | proxy_read_timeout 1200s; 27 | 28 | location ~ ^/(ping|invocations) { 29 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 30 | proxy_set_header Host $http_host; 31 | proxy_redirect off; 32 | proxy_pass http://gunicorn; 33 | } 34 | 35 | location / { 36 | return 404 "{}"; 37 | } 38 | } 39 | } 40 | -------------------------------------------------------------------------------- /container/train_and_serve/catboost_regressor/predictor.py: -------------------------------------------------------------------------------- 1 | # This is the file that implements a flask server to do inferences. It's the file that you will modify to 2 | # implement the scoring for your own algorithm. 3 | 4 | from __future__ import print_function 5 | 6 | import os 7 | import json 8 | import pickle 9 | import sys 10 | import signal 11 | import traceback 12 | import flask 13 | import pandas as pd 14 | from catboost import CatBoostRegressor 15 | from io import StringIO 16 | 17 | 18 | prefix = '/opt/ml/' 19 | model_path = os.path.join(prefix, 'model') 20 | 21 | # A singleton for holding the model. This simply loads the model and holds it. 22 | # It has a predict function that does a prediction based on the model and the input data. 23 | 24 | class ScoringService(object): 25 | model = None # Where we keep the model when it's loaded 26 | 27 | @classmethod 28 | def get_model(cls): 29 | """Get the model object for this instance, loading it if it's not already loaded.""" 30 | if cls.model == None: 31 | cls.model = CatBoostRegressor() 32 | cls.model.load_model(os.path.join(model_path, 'catboost-regressor-model.dump')) 33 | return cls.model 34 | 35 | @classmethod 36 | def predict(cls, input): 37 | """For the input, do the predictions and return them. 38 | 39 | Args: 40 | input (a pandas dataframe): The data on which to do the predictions. There will be 41 | one prediction per row in the dataframe""" 42 | clf = cls.get_model() 43 | return clf.predict(input) 44 | 45 | # The flask app for serving predictions 46 | app = flask.Flask(__name__) 47 | 48 | @app.route('/ping', methods=['GET']) 49 | def ping(): 50 | """Determine if the container is working and healthy. In this sample container, we declare 51 | it healthy if we can load the model successfully.""" 52 | health = ScoringService.get_model() is not None # You can insert a health check here 53 | 54 | status = 200 if health else 404 55 | return flask.Response(response='\n', status=status, mimetype='application/json') 56 | 57 | @app.route('/invocations', methods=['POST']) 58 | def transformation(): 59 | """Do an inference on a single batch of data. In this sample server, we take data as CSV, convert 60 | it to a pandas data frame for internal use and then convert the predictions back to CSV (which really 61 | just means one prediction per line, since there's a single column. 62 | """ 63 | data = None 64 | 65 | # Convert from CSV to pandas 66 | if flask.request.content_type == 'text/csv': 67 | data = flask.request.data.decode('utf-8') 68 | s = StringIO(data) 69 | data = pd.read_csv(s, header=None) 70 | else: 71 | return flask.Response(response='This predictor only supports CSV data', status=415, mimetype='text/plain') 72 | 73 | print('Invoked with {} records'.format(data.shape[0])) 74 | 75 | # Do the prediction 76 | predictions = ScoringService.predict(data) 77 | 78 | # Convert from numpy back to CSV 79 | out = StringIO() 80 | pd.DataFrame({'results':predictions}).to_csv(out, header=False, index=False) 81 | result = out.getvalue() 82 | 83 | return flask.Response(response=result, status=200, mimetype='text/csv') 84 | -------------------------------------------------------------------------------- /container/train_and_serve/catboost_regressor/serve: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # This file implements the scoring service shell. You don't necessarily need to modify it for various 4 | # algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until 5 | # gunicorn exits. 6 | # 7 | # The flask server is specified to be the app object in wsgi.py 8 | # 9 | # We set the following parameters: 10 | # 11 | # Parameter Environment Variable Default Value 12 | # --------- -------------------- ------------- 13 | # number of workers MODEL_SERVER_WORKERS the number of CPU cores 14 | # timeout MODEL_SERVER_TIMEOUT 60 seconds 15 | 16 | from __future__ import print_function 17 | import multiprocessing 18 | import os 19 | import signal 20 | import subprocess 21 | import sys 22 | 23 | cpu_count = multiprocessing.cpu_count() 24 | 25 | model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60) 26 | model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count)) 27 | 28 | def sigterm_handler(nginx_pid, gunicorn_pid): 29 | try: 30 | os.kill(nginx_pid, signal.SIGQUIT) 31 | except OSError: 32 | pass 33 | try: 34 | os.kill(gunicorn_pid, signal.SIGTERM) 35 | except OSError: 36 | pass 37 | 38 | sys.exit(0) 39 | 40 | def start_server(): 41 | print('Starting the inference server with {} workers.'.format(model_server_workers)) 42 | 43 | 44 | # link the log streams to stdout/err so they will be logged to the container logs 45 | subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log']) 46 | subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log']) 47 | 48 | nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf']) 49 | gunicorn = subprocess.Popen(['gunicorn', 50 | '--timeout', str(model_server_timeout), 51 | '-k', 'gevent', 52 | '-b', 'unix:/tmp/gunicorn.sock', 53 | '-w', str(model_server_workers), 54 | 'wsgi:app']) 55 | 56 | signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid)) 57 | 58 | # If either subprocess exits, so do we. 59 | pids = set([nginx.pid, gunicorn.pid]) 60 | while True: 61 | pid, _ = os.wait() 62 | if pid in pids: 63 | break 64 | 65 | sigterm_handler(nginx.pid, gunicorn.pid) 66 | print('Inference server exiting') 67 | 68 | # The main routine just invokes the start function. 69 | 70 | if __name__ == '__main__': 71 | start_server() 72 | -------------------------------------------------------------------------------- /container/train_and_serve/catboost_regressor/train: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # A sample training component that trains a simple CatBoost Regressor tree model. 4 | # This implementation works in File mode and makes no assumptions about the input file names. 5 | # Input is specified as CSV with a data point in each row and the labels in the first column. 6 | import glob 7 | import logging 8 | import os 9 | import json 10 | import re 11 | import subprocess 12 | import traceback 13 | import sys 14 | 15 | from catboost import CatBoostRegressor 16 | import numpy as np 17 | import pandas as pd 18 | 19 | prefix = '/opt/ml/' 20 | input_path = prefix + 'input/data' 21 | dataset_path = prefix + 'input/data/dataset' 22 | train_channel_name = 'train' 23 | validation_channel_name = 'validation' 24 | 25 | output_path = os.path.join(prefix, 'output') 26 | model_path = os.path.join(prefix, 'model') 27 | model_file_name = 'catboost-regressor-model.dump' 28 | train_path = os.path.join(dataset_path, train_channel_name) 29 | validation_path = os.path.join(dataset_path, validation_channel_name) 30 | 31 | param_path = os.path.join(prefix, 'input/config/hyperparameters.json') 32 | 33 | dvc_repo_url = os.environ.get('DVC_REPO_URL') 34 | dvc_branch = os.environ.get('DVC_BRANCH') 35 | user = os.environ.get('USER', "sagemaker") 36 | 37 | # The function to execute the training. 38 | def train(learning_rate, depth): 39 | print('Starting the training.') 40 | 41 | try: 42 | # Take the set of train files and read them all into a single pandas dataframe 43 | train_input_files = [os.path.join(train_path, file) for file in glob.glob(train_path+"/*.csv")] 44 | if len(train_input_files) == 0: 45 | raise ValueError(('There are no files in {}.\n' + 46 | 'This usually indicates that the channel ({}) was incorrectly specified,\n' + 47 | 'the data specification in S3 was incorrectly specified or the role specified\n' + 48 | 'does not have permission to access the data.').format(train_path, train_channel_name)) 49 | print('Found train files: {}'.format(train_input_files)) 50 | train_df = pd.DataFrame() 51 | for file in train_input_files: 52 | if train_df.shape[0] == 0: 53 | train_df = pd.read_csv(file) 54 | else: 55 | df = pd.read_csv(file) 56 | train_df.append(df, ignore_index=True) 57 | 58 | # Take the set of train files and read them all into a single pandas dataframe 59 | validation_input_files = [os.path.join(validation_path, file) for file in glob.glob(validation_path+"/*.csv")] 60 | if len(validation_input_files) == 0: 61 | raise ValueError(('There are no files in {}.\n' + 62 | 'This usually indicates that the channel ({}) was incorrectly specified,\n' + 63 | 'the data specification in S3 was incorrectly specified or the role specified\n' + 64 | 'does not have permission to access the data.').format(validation_path, train_channel_name)) 65 | print('Found validation files: {}'.format(validation_input_files)) 66 | validation_df = pd.DataFrame() 67 | for file in validation_input_files: 68 | if validation_df.shape[0] == 0: 69 | validation_df = pd.read_csv(file) 70 | else: 71 | df = pd.read_csv(file) 72 | validation_df.append(df, ignore_index=True) 73 | 74 | # Assumption is that the label is the last column 75 | print('building training and validation datasets') 76 | X_train = train_df.iloc[:, 1:].values 77 | y_train = train_df.iloc[:, 0:1].values 78 | X_validation = validation_df.iloc[:, 1:].values 79 | y_validation = validation_df.iloc[:, 0:1].values 80 | 81 | # define and train model 82 | model = CatBoostRegressor(learning_rate=int(learning_rate), depth=int(depth)) 83 | 84 | model.fit(X_train, y_train, eval_set=(X_validation, y_validation), logging_level='Silent') 85 | 86 | # print abs error 87 | print('validating model') 88 | abs_err = np.abs(model.predict(X_validation) - y_validation) 89 | 90 | # print couple perf metrics 91 | for q in [10, 50, 90]: 92 | print('AE-at-' + str(q) + 'th-percentile: '+ str(np.percentile(a=abs_err, q=q))) 93 | 94 | # persist model 95 | path = os.path.join(model_path, model_file_name) 96 | print('saving model file to {}'.format(path)) 97 | model.save_model(path) 98 | 99 | print('Training complete.') 100 | except Exception as e: 101 | # Write out an error file. This will be returned as the failureReason in the 102 | # DescribeTrainingJob result. 103 | trc = traceback.format_exc() 104 | with open(os.path.join(output_path, 'failure'), 'w') as s: 105 | s.write('Exception during training: ' + str(e) + '\n' + trc) 106 | # Printing this causes the exception to be in the training job logs, as well. 107 | print('Exception during training: ' + str(e) + '\n' + trc) 108 | # A non-zero exit dependencies causes the training job to be marked as Failed. 109 | sys.exit(255) 110 | 111 | 112 | # Read in any hyperparameters that the user passed with the training job 113 | def get_hyperparameters(): 114 | print('Reading hyperparameters data: {}'.format(param_path)) 115 | with open(param_path) as json_file: 116 | hyperparameters_data = json.load(json_file) 117 | print('hyperparameters_data: {}'.format(hyperparameters_data)) 118 | return hyperparameters_data 119 | 120 | 121 | def clone_dvc_git_repo(): 122 | print(f"Configure git to pull authenticated from CodeCommit") 123 | print(f"Cloning repo: {dvc_repo_url}, git branch: {dvc_branch}") 124 | subprocess.check_call(["git", "clone", "--depth", "1", "--branch", dvc_branch, dvc_repo_url, input_path]) 125 | 126 | 127 | def dvc_pull(): 128 | print("Running dvc pull command") 129 | os.chdir(input_path + "/dataset/") 130 | subprocess.check_call(["dvc", "pull"]) 131 | 132 | 133 | if __name__ == '__main__': 134 | 135 | hyperparameters = get_hyperparameters() 136 | clone_dvc_git_repo() 137 | dvc_pull() 138 | train(hyperparameters['learning_rate'], hyperparameters['depth']) 139 | 140 | # A zero exit dependencies causes the job to be marked a Succeeded. 141 | sys.exit(0) 142 | -------------------------------------------------------------------------------- /container/train_and_serve/catboost_regressor/wsgi.py: -------------------------------------------------------------------------------- 1 | import predictor as myapp 2 | 3 | # This is just a simple wrapper for gunicorn to find your app. 4 | # If you want to change the algorithm file, simply change "predictor" above to the 5 | # new file. 6 | 7 | app = myapp.app 8 | -------------------------------------------------------------------------------- /dvc_sagemaker_byoc.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "def06897", 6 | "metadata": {}, 7 | "source": [ 8 | "# Prerequisite\n", 9 | "\n", 10 | "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n", 11 | "\n", 12 | "# Training a CatBoost regression model with data from DVC\n", 13 | "\n", 14 | "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n", 15 | "\n", 16 | "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n", 17 | "\n", 18 | "### California Housing dataset\n", 19 | "\n", 20 | "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n", 21 | "\n", 22 | "The California Housing dataset was originally published in:\n", 23 | "\n", 24 | "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n", 25 | "\n", 26 | "### DVC\n", 27 | "\n", 28 | "DVC is built to make machine learning (ML) models shareable and reproducible.\n", 29 | "It is designed to handle large files, data sets, machine learning models, and metrics as well as code." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "21f94258", 35 | "metadata": {}, 36 | "source": [ 37 | "## Part 1: Configure DVC for data versioning\n", 38 | "\n", 39 | "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n", 40 | "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n", 41 | "The `dvc` configurations and files for data tracking will be versioned in this repository.\n", 42 | "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n", 43 | "\n", 44 | "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing Amazon Secret Managers to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n", 45 | "\n", 46 | "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "id": "f7ddaba7", 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "%%sh\n", 57 | "\n", 58 | "## Create the repository\n", 59 | "\n", 60 | "repo_name=\"sagemaker-dvc-sample\"\n", 61 | "\n", 62 | "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n", 63 | "\n", 64 | "account=$(aws sts get-caller-identity --query Account --output text)\n", 65 | "\n", 66 | "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n", 67 | "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n", 68 | "region=${region:-eu-west-1}\n", 69 | "\n", 70 | "## repo_name is already in the .gitignore of the root repo\n", 71 | "\n", 72 | "mkdir -p ${repo_name}\n", 73 | "cd ${repo_name}\n", 74 | "\n", 75 | "# initalize new repo in subfolder\n", 76 | "git init\n", 77 | "## Change the remote to the codecommit\n", 78 | "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n", 79 | "\n", 80 | "# Configure git - change it according to your needs\n", 81 | "git config --global user.email \"sagemaker-studio-user@example.com\"\n", 82 | "git config --global user.name \"SageMaker Studio User\"\n", 83 | "\n", 84 | "git config --global credential.helper '!aws codecommit credential-helper $@'\n", 85 | "git config --global credential.UseHttpPath true\n", 86 | "\n", 87 | "# Initialize dvc\n", 88 | "dvc init\n", 89 | "\n", 90 | "git commit -m 'Add dvc configuration'\n", 91 | "\n", 92 | "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n", 93 | "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n", 94 | "git commit .dvc/config -m \"initialize DVC local remote\"\n", 95 | "\n", 96 | "# set the DVC cache to S3\n", 97 | "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n", 98 | "dvc config cache.s3 s3cache\n", 99 | "\n", 100 | "# disable sending anonymized data to dvc for troubleshooting\n", 101 | "dvc config core.analytics false\n", 102 | "\n", 103 | "git add .dvc/config\n", 104 | "git commit -m 'update dvc config'\n", 105 | "\n", 106 | "git push --set-upstream origin master #--force" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "ba876ca1", 112 | "metadata": {}, 113 | "source": [ 114 | "## Part 2: Packaging and Uploading your container images for use with Amazon SageMaker\n", 115 | "\n", 116 | "### An overview of Docker\n", 117 | "\n", 118 | "If you're familiar with Docker already, you can skip ahead to the next section.\n", 119 | "\n", 120 | "For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. \n", 121 | "\n", 122 | "Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.\n", 123 | "\n", 124 | "Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.\n", 125 | "\n", 126 | "In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.\n", 127 | "\n", 128 | "Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.\n", 129 | "\n", 130 | "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].\n", 131 | "\n", 132 | "Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.\n", 133 | "\n", 134 | "In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.\n", 135 | "\n", 136 | "Some helpful links:\n", 137 | "\n", 138 | "* [Docker home page](http://www.docker.com)\n", 139 | "* [Getting started with Docker](https://docs.docker.com/get-started/)\n", 140 | "* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)\n", 141 | "* [`docker run` reference](https://docs.docker.com/engine/reference/run/)\n", 142 | "\n", 143 | "[Amazon ECS]: https://aws.amazon.com/ecs/" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "id": "31ba56ff", 149 | "metadata": {}, 150 | "source": [ 151 | "### SageMaker Docker container for processing\n", 152 | "\n", 153 | "Let us now build and register the container for processing. In doing so, we ensure that all `dvc` related dependencies are already installed and we do not need to `pip install` or `git configure` anything within the processing scripts, and we can concentrate on the data preparation and feature engineering.\n", 154 | "\n", 155 | "We aim to have one image for processing where we can supply our own processing script. More information on how to build your own processing container can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html). For a formal specification that defines the contract for an Amazon SageMaker Processing container, see [Build Your Own Processing Container (Advanced Scenario)](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html). " 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "id": "df0f1445", 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "!cat container/processing/Dockerfile" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "b61c7fc6", 171 | "metadata": {}, 172 | "source": [ 173 | "### Building and registering the containers\n", 174 | "\n", 175 | "We will use [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) to store our container images.\n", 176 | "\n", 177 | "To easily build custom container images from your Studio notebooks, we use the [SageMaker Docker Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli). For more information on the SageMaker Docker Build CLI, interested readers can refer to [this blogpost](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "id": "fa713178", 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "%%sh\n", 188 | "\n", 189 | "# The name of the image\n", 190 | "image_name=sagemaker-processing-dvc\n", 191 | "\n", 192 | "cd container/processing\n", 193 | "\n", 194 | "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n", 195 | "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n", 196 | "region=${region:-eu-west-1}\n", 197 | "\n", 198 | "# If the repository doesn't exist in ECR, create it.\n", 199 | "aws ecr describe-repositories --region \"${region}\" --repository-names \"${image_name}\" > /dev/null 2>&1\n", 200 | "\n", 201 | "if [ $? -ne 0 ]\n", 202 | "then\n", 203 | " aws ecr create-repository --region \"${region}\" --repository-name \"${image_name}\" > /dev/null\n", 204 | "fi\n", 205 | "\n", 206 | "sm-docker build . --repository \"${image_name}:latest\"" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "2ac30a72", 212 | "metadata": {}, 213 | "source": [ 214 | "### SageMaker Docker container for training and hosting\n", 215 | "\n", 216 | "Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:\n", 217 | "\n", 218 | "* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.\n", 219 | "* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.\n", 220 | "* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. \n", 221 | "\n", 222 | "\n", 223 | "#### Running your container during training\n", 224 | "\n", 225 | "When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:\n", 226 | "\n", 227 | " /opt/ml\n", 228 | " |-- input\n", 229 | " | |-- config\n", 230 | " | | |-- hyperparameters.json\n", 231 | " | | `-- resourceConfig.json\n", 232 | " | `-- data\n", 233 | " | `-- \n", 234 | " | `-- \n", 235 | " |-- model\n", 236 | " | `-- \n", 237 | " `-- output\n", 238 | " `-- failure\n", 239 | "\n", 240 | "##### The input\n", 241 | "\n", 242 | "* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.\n", 243 | "* `/opt/ml/input/data//` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. \n", 244 | "* `/opt/ml/input/data/_` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.\n", 245 | "\n", 246 | "##### The output\n", 247 | "\n", 248 | "* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.\n", 249 | "* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.\n", 250 | "\n", 251 | "#### Running your container during hosting\n", 252 | "\n", 253 | "Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests.\n", 254 | "\n", 255 | "This stack is implemented in the sample code here and you can mostly just leave it alone. \n", 256 | "\n", 257 | "Amazon SageMaker uses two URLs in the container:\n", 258 | "\n", 259 | "* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.\n", 260 | "* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. \n", 261 | "\n", 262 | "The container will have the model files in the same place they were written during training:\n", 263 | "\n", 264 | " /opt/ml\n", 265 | " `-- model\n", 266 | " `-- " 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "id": "afb0b22a", 272 | "metadata": {}, 273 | "source": [ 274 | "### The parts of the training and inference container\n", 275 | "\n", 276 | "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMager:\n", 277 | "\n", 278 | " .\n", 279 | " |-- Dockerfile\n", 280 | " |-- README.md\n", 281 | " `-- catboost_regressor\n", 282 | " |-- nginx.conf\n", 283 | " |-- predictor.py\n", 284 | " |-- serve\n", 285 | " |-- train\n", 286 | " `-- wsgi.py\n", 287 | "\n", 288 | "Let's discuss each of these in turn:\n", 289 | "\n", 290 | "* __`Dockerfile`__ describes how to build your Docker container image. More details below.\n", 291 | "* __`catboost_regressor`__ is the directory which contains the files that will be installed in the container.\n", 292 | "\n", 293 | "In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.\n", 294 | "\n", 295 | "The files that we'll put in the container are:\n", 296 | "\n", 297 | "* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.\n", 298 | "* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.\n", 299 | "* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.\n", 300 | "* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.\n", 301 | "* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.\n", 302 | "\n", 303 | "In summary, the two files you will probably want to change for your application are `train` and `predictor.py`." 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "id": "7b6b75bf", 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "!cat container/train_and_serve/Dockerfile" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "id": "7a14543c", 319 | "metadata": {}, 320 | "source": [ 321 | "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMaker:\n", 322 | "\n", 323 | " .\n", 324 | " `-- container/train_and_serve/\n", 325 | " |-- Dockerfile\n", 326 | " |-- README.md\n", 327 | " `--catboost_regressor/\n", 328 | " |-- nginx.conf\n", 329 | " |-- predictor.py\n", 330 | " |-- serve\n", 331 | " |-- train\n", 332 | " |-- wsgi.py\n" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "id": "61992f99", 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "%%sh\n", 343 | "\n", 344 | "# The name of our algorithm\n", 345 | "algorithm_name=sagemaker-catboost-dvc\n", 346 | "\n", 347 | "cd container/train_and_serve\n", 348 | "\n", 349 | "chmod +x catboost_regressor/train\n", 350 | "chmod +x catboost_regressor/serve\n", 351 | "\n", 352 | "# Get the region defined in the current configuration (default to us-west-1 if none defined)\n", 353 | "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n", 354 | "region=${region:-eu-west-1}\n", 355 | "\n", 356 | "# If the repository doesn't exist in ECR, create it.\n", 357 | "aws ecr describe-repositories --region \"${region}\" --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n", 358 | "\n", 359 | "if [ $? -ne 0 ]\n", 360 | "then\n", 361 | " aws ecr create-repository --region \"${region}\" --repository-name \"${algorithm_name}\" > /dev/null\n", 362 | "fi\n", 363 | "\n", 364 | "sm-docker build . --repository \"${algorithm_name}:latest\"" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "id": "0c337c82", 370 | "metadata": {}, 371 | "source": [ 372 | "## Part 3: Processing and Training with DVC and SageMaker\n", 373 | "\n", 374 | "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n", 375 | "\n", 376 | "The high level conceptual architecture is depicted in the figure below\n", 377 | "\n", 378 | "\n", 379 | "\n", 380 | "Let's unfold in the following sections the implementation details of the two experiments.\n", 381 | "\n", 382 | "### Import libraries and initial setup\n", 383 | "\n", 384 | "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook." 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "id": "feead477", 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "import boto3\n", 395 | "import sagemaker\n", 396 | "import time\n", 397 | "from time import strftime\n", 398 | "\n", 399 | "boto_session = boto3.Session()\n", 400 | "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n", 401 | "sm_client = boto3.client(\"sagemaker\")\n", 402 | "region = boto_session.region_name\n", 403 | "bucket = sagemaker_session.default_bucket()\n", 404 | "role = sagemaker.get_execution_role()\n", 405 | "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n", 406 | "\n", 407 | "prefix = 'DEMO-sagemaker-experiments-dvc'\n", 408 | "\n", 409 | "print(f\"account: {account}\")\n", 410 | "print(f\"bucket: {bucket}\")\n", 411 | "print(f\"region: {region}\")\n", 412 | "print(f\"role: {role}\")" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "aae3a438", 418 | "metadata": {}, 419 | "source": [ 420 | "### Prepare raw data\n", 421 | "\n", 422 | "We upload the raw data to S3 in the default bucket." 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "id": "4d9dfa8b", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "import pandas as pd\n", 433 | "import numpy as np\n", 434 | "\n", 435 | "from sklearn.datasets import fetch_california_housing\n", 436 | "from sklearn.model_selection import train_test_split\n", 437 | "\n", 438 | "from pathlib import Path\n", 439 | "\n", 440 | "databunch = fetch_california_housing()\n", 441 | "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n", 442 | "\n", 443 | "print(f\"Dataset shape = {dataset.shape}\")\n", 444 | "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n", 445 | "\n", 446 | "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n", 447 | "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n", 448 | "print(f\"Raw data location in S3: {s3_data_path}\")\n", 449 | "\n", 450 | "s3 = boto3.client(\"s3\")\n", 451 | "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "id": "622d14ce", 457 | "metadata": {}, 458 | "source": [ 459 | "### Setup SageMaker Experiments\n", 460 | "\n", 461 | "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n", 462 | "\n", 463 | "Let’s start first with an overview of Amazon SageMaker Experiments features:\n", 464 | "\n", 465 | "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n", 466 | "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n", 467 | "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n", 468 | "\n", 469 | "Now, in order to track this test in SageMaker, we need to create an experiment." 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "id": "4a2af348", 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "from smexperiments.experiment import Experiment\n", 480 | "from smexperiments.trial import Trial\n", 481 | "from smexperiments.trial_component import TrialComponent\n", 482 | "from smexperiments.tracker import Tracker\n", 483 | "\n", 484 | "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n", 485 | "\n", 486 | "# create the experiment if it doesn't exist\n", 487 | "try:\n", 488 | " my_experiment = Experiment.load(experiment_name=experiment_name)\n", 489 | " print(\"existing experiment loaded\")\n", 490 | "except Exception as ex:\n", 491 | " if \"ResourceNotFound\" in str(ex):\n", 492 | " my_experiment = Experiment.create(\n", 493 | " experiment_name = experiment_name,\n", 494 | " description = \"How to integrate DVC\"\n", 495 | " )\n", 496 | " print(\"new experiment created\")\n", 497 | " else:\n", 498 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 499 | " print(\"Dont go forward!\")\n", 500 | " raise" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "id": "0947c82b", 506 | "metadata": {}, 507 | "source": [ 508 | "We need to also define trials within the experiment.\n", 509 | "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n", 510 | "\n", 511 | "### Test 1: generate single files for training and validation\n", 512 | "\n", 513 | "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments." 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "id": "3a34b0c8", 520 | "metadata": {}, 521 | "outputs": [], 522 | "source": [ 523 | "first_trial_name = \"dvc-trial-single-file\"\n", 524 | "\n", 525 | "try:\n", 526 | " my_first_trial = Trial.load(trial_name=first_trial_name)\n", 527 | " print(\"existing trial loaded\")\n", 528 | "except Exception as ex:\n", 529 | " if \"ResourceNotFound\" in str(ex):\n", 530 | " my_first_trial = Trial.create(\n", 531 | " experiment_name=experiment_name,\n", 532 | " trial_name=first_trial_name,\n", 533 | " )\n", 534 | " print(\"new trial created\")\n", 535 | " else:\n", 536 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 537 | " print(\"Dont go forward!\")\n", 538 | " raise" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "id": "d9471d00", 544 | "metadata": {}, 545 | "source": [ 546 | "### Processing script: version data with DVC\n", 547 | "\n", 548 | "The processing script takes as arguments the address of the git repository, and the branch we want to create to store the `dvc` metadata. The datasets themselves will be then stored in S3. The arguments passed to the processing scripts are not automatically tracked in SageMaker Experiments in the automatically generated TrialComponent. The TrialComponent generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI. In our case, we will store the following data:\n", 549 | "* `data_repo_url`\n", 550 | "* `data_branch`\n", 551 | "* `data_commit_hash`\n", 552 | "* `train_test_split_ratio`" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "id": "65896ab5", 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "!pygmentize 'source_dir/preprocessing-experiment.py'" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "id": "e1cd9076", 568 | "metadata": {}, 569 | "source": [ 570 | "### SageMaker Processing job\n", 571 | "\n", 572 | "We have now all ingredients to execute our SageMaker Processing Job:\n", 573 | "* a custom image with dvc installed\n", 574 | "* a git repository (i.e., AWS CodeCommit)\n", 575 | "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`, `--dvc-repo-url`, `--dvc-branch`)\n", 576 | "* a SageMaker Experiment and a Trial" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": null, 582 | "id": "62fa75e3", 583 | "metadata": {}, 584 | "outputs": [], 585 | "source": [ 586 | "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n", 587 | "\n", 588 | "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n", 589 | "dvc_branch = my_first_trial.trial_name\n", 590 | "\n", 591 | "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n", 592 | "\n", 593 | "script_processor = ScriptProcessor(command=['python3'],\n", 594 | " image_uri=image,\n", 595 | " role=role,\n", 596 | " instance_count=1,\n", 597 | " instance_type='ml.m5.xlarge',\n", 598 | " env={\n", 599 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 600 | " \"DVC_BRANCH\": dvc_branch,\n", 601 | " \"USER\": \"sagemaker\"\n", 602 | " },\n", 603 | " )\n", 604 | "\n", 605 | "experiment_config={\n", 606 | " \"ExperimentName\": my_experiment.experiment_name,\n", 607 | " \"TrialName\": my_first_trial.trial_name\n", 608 | "}" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "id": "baf3d39c", 614 | "metadata": {}, 615 | "source": [ 616 | "Executing the processing job will take around 3-4 minutes." 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "id": "ba0c629f", 623 | "metadata": {}, 624 | "outputs": [], 625 | "source": [ 626 | "%%time\n", 627 | "\n", 628 | "script_processor.run(\n", 629 | " code='source_dir/preprocessing-experiment.py',\n", 630 | " inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n", 631 | " experiment_config=experiment_config,\n", 632 | " arguments=[\"--train-test-split-ratio\", \"0.2\"],\n", 633 | ")" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "id": "e4aa1f57", 639 | "metadata": {}, 640 | "source": [ 641 | "### Create an estimator and fit the model" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "id": "f2be8cc5", 647 | "metadata": {}, 648 | "source": [ 649 | "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as parameters when you create the Estimator object.\n", 650 | "\n", 651 | "We will train on the `dvc-trial-single-file` branch first.\n", 652 | "\n", 653 | "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n", 654 | "\n", 655 | "```\n", 656 | "dataset\n", 657 | " |-- train\n", 658 | " | |-- california_train.csv\n", 659 | " |-- test\n", 660 | " | |-- california_test.csv\n", 661 | " |-- validation\n", 662 | " | |-- california_validation.csv\n", 663 | "```\n", 664 | "\n", 665 | "#### Metric definition\n", 666 | "\n", 667 | "SageMaker emits every log that is going to STDOUT to CLoudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex. By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n", 668 | "\n", 669 | "In our case, we are interested in the median error.\n", 670 | "```\n", 671 | "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n", 672 | "```" 673 | ] 674 | }, 675 | { 676 | "cell_type": "code", 677 | "execution_count": null, 678 | "id": "c5dbce5d", 679 | "metadata": {}, 680 | "outputs": [], 681 | "source": [ 682 | "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n", 683 | "\n", 684 | "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n", 685 | "\n", 686 | "hyperparameters={ \n", 687 | " \"learning_rate\" : 1,\n", 688 | " \"depth\": 6\n", 689 | " }\n", 690 | "\n", 691 | "estimator = sagemaker.estimator.Estimator(\n", 692 | " image,\n", 693 | " role,\n", 694 | " instance_count=1,\n", 695 | " metric_definitions=metric_definitions,\n", 696 | " instance_type=\"ml.m5.large\",\n", 697 | " sagemaker_session=sagemaker_session,\n", 698 | " hyperparameters=hyperparameters,\n", 699 | " environment={\n", 700 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 701 | " \"DVC_BRANCH\": dvc_branch,\n", 702 | " \"USER\": \"sagemaker\"\n", 703 | " }\n", 704 | ")\n", 705 | "\n", 706 | "experiment_config={\n", 707 | " \"ExperimentName\": my_experiment.experiment_name,\n", 708 | " \"TrialName\": my_first_trial.trial_name\n", 709 | "}" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": null, 715 | "id": "2f766d30", 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "%%time\n", 720 | "\n", 721 | "estimator.fit(experiment_config=experiment_config)" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "id": "f83c9dde", 727 | "metadata": {}, 728 | "source": [ 729 | "On the logs above you can see those lines, indicating about the files pulled by dvc:\n", 730 | "\n", 731 | "```\n", 732 | "Running dvc pull command\n", 733 | "A train/california_train.csv\n", 734 | "A test/california_test.csv\n", 735 | "A validation/california_validation.csv\n", 736 | "3 files added and 3 files fetched\n", 737 | "Starting the training.\n", 738 | "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n", 739 | "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n", 740 | "```" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "id": "e6bb08ce", 746 | "metadata": {}, 747 | "source": [ 748 | "### Test 2: generate multiple files for training and validation" 749 | ] 750 | }, 751 | { 752 | "cell_type": "code", 753 | "execution_count": null, 754 | "id": "e24154ae", 755 | "metadata": {}, 756 | "outputs": [], 757 | "source": [ 758 | "second_trial_name = \"dvc-trial-multi-files\"\n", 759 | "\n", 760 | "try:\n", 761 | " my_second_trial = Trial.load(trial_name=second_trial_name)\n", 762 | " print(\"existing trial loaded\")\n", 763 | "except Exception as ex:\n", 764 | " if \"ResourceNotFound\" in str(ex):\n", 765 | " my_second_trial = Trial.create(\n", 766 | " experiment_name=experiment_name,\n", 767 | " trial_name=second_trial_name,\n", 768 | " )\n", 769 | " print(\"new trial created\")\n", 770 | " else:\n", 771 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 772 | " print(\"Dont go forward!\")\n", 773 | " raise" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "id": "dbe33f33", 779 | "metadata": {}, 780 | "source": [ 781 | "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch." 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "id": "0a045eb4", 788 | "metadata": {}, 789 | "outputs": [], 790 | "source": [ 791 | "!pygmentize 'code/preprocessing-experiment-multifiles.py'" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": null, 797 | "id": "167185e8", 798 | "metadata": {}, 799 | "outputs": [], 800 | "source": [ 801 | "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n", 802 | "\n", 803 | "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n", 804 | "\n", 805 | "dvc_branch = my_second_trial.trial_name\n", 806 | "\n", 807 | "script_processor = ScriptProcessor(command=['python3'],\n", 808 | " image_uri=image,\n", 809 | " role=role,\n", 810 | " instance_count=1,\n", 811 | " instance_type='ml.m5.xlarge',\n", 812 | " env={\n", 813 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 814 | " \"DVC_BRANCH\": dvc_branch,\n", 815 | " \"USER\": \"sagemaker\"\n", 816 | " },\n", 817 | " )\n", 818 | "\n", 819 | "experiment_config={\n", 820 | " \"ExperimentName\": my_experiment.experiment_name,\n", 821 | " \"TrialName\": my_second_trial.trial_name\n", 822 | "}" 823 | ] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "id": "14b26ae5", 828 | "metadata": {}, 829 | "source": [ 830 | "Executing the processing job will take ~5 minutes" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": null, 836 | "id": "46d29d7a", 837 | "metadata": {}, 838 | "outputs": [], 839 | "source": [ 840 | "%%time\n", 841 | "\n", 842 | "script_processor.run(\n", 843 | " code='source_dir/preprocessing-experiment-multifiles.py',\n", 844 | " inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n", 845 | " experiment_config=experiment_config,\n", 846 | " arguments=[\"--train-test-split-ratio\", \"0.1\"],\n", 847 | ")" 848 | ] 849 | }, 850 | { 851 | "cell_type": "markdown", 852 | "id": "1bb32869", 853 | "metadata": {}, 854 | "source": [ 855 | "We will now train on the `dvc-trial-multi-files` branch.\n", 856 | "\n", 857 | "When doing `dvc pull`, this is the dataset structure:\n", 858 | "\n", 859 | "```\n", 860 | "dataset\n", 861 | " |-- train\n", 862 | " | |-- california_train_1.csv\n", 863 | " | |-- california_train_2.csv\n", 864 | " | |-- california_train_3.csv\n", 865 | " | |-- california_train_4.csv\n", 866 | " | |-- california_train_5.csv\n", 867 | " |-- test\n", 868 | " | |-- california_test.csv\n", 869 | " |-- validation\n", 870 | " | |-- california_validation_1.csv\n", 871 | " | |-- california_validation_2.csv\n", 872 | " | |-- california_validation_3.csv\n", 873 | "```" 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": null, 879 | "id": "bd8db6a2", 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n", 884 | "\n", 885 | "hyperparameters={ \n", 886 | " \"learning_rate\" : 1,\n", 887 | " \"depth\": 6\n", 888 | " }\n", 889 | "\n", 890 | "estimator = sagemaker.estimator.Estimator(\n", 891 | " image,\n", 892 | " role,\n", 893 | " instance_count=1,\n", 894 | " metric_definitions=metric_definitions,\n", 895 | " instance_type=\"ml.m5.large\",\n", 896 | " sagemaker_session=sagemaker_session,\n", 897 | " hyperparameters=hyperparameters,\n", 898 | " environment={\n", 899 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 900 | " \"DVC_BRANCH\": dvc_branch,\n", 901 | " \"USER\": \"sagemaker\"\n", 902 | " }\n", 903 | ")\n", 904 | "\n", 905 | "experiment_config={\n", 906 | " \"ExperimentName\": my_experiment.experiment_name,\n", 907 | " \"TrialName\": my_second_trial.trial_name,\n", 908 | "}" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "id": "c6e0e936", 914 | "metadata": {}, 915 | "source": [ 916 | "The training job will take aroudn ~5 minutes" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "id": "1e216113", 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [ 926 | "%%time\n", 927 | "\n", 928 | "estimator.fit(experiment_config=experiment_config)" 929 | ] 930 | }, 931 | { 932 | "cell_type": "markdown", 933 | "id": "f767fa22", 934 | "metadata": {}, 935 | "source": [ 936 | "On the logs above you can see those lines, indicating about the files pulled by dvc:\n", 937 | "\n", 938 | "```\n", 939 | "Running dvc pull command\n", 940 | "A validation/california_validation_2.csv\n", 941 | "A validation/california_validation_1.csv\n", 942 | "A validation/california_validation_3.csv\n", 943 | "A train/california_train_4.csv\n", 944 | "A train/california_train_5.csv\n", 945 | "A train/california_train_2.csv\n", 946 | "A train/california_train_3.csv\n", 947 | "A train/california_train_1.csv\n", 948 | "A test/california_test.csv\n", 949 | "9 files added and 9 files fetched\n", 950 | "Starting the training.\n", 951 | "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n", 952 | "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n", 953 | "```" 954 | ] 955 | }, 956 | { 957 | "cell_type": "markdown", 958 | "id": "0bd85708", 959 | "metadata": {}, 960 | "source": [ 961 | "## Part 4: Hosting your model in SageMaker" 962 | ] 963 | }, 964 | { 965 | "cell_type": "code", 966 | "execution_count": null, 967 | "id": "48387d98", 968 | "metadata": {}, 969 | "outputs": [], 970 | "source": [ 971 | "from sagemaker.predictor import csv_serializer\n", 972 | "\n", 973 | "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=csv_serializer)" 974 | ] 975 | }, 976 | { 977 | "cell_type": "markdown", 978 | "id": "57bb1f58", 979 | "metadata": {}, 980 | "source": [ 981 | "### Fetch the testing data\n", 982 | "\n", 983 | "Save locally the test data stored in S3 via DVC created by the SageMaker Processing Job." 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": null, 989 | "id": "428729d2", 990 | "metadata": {}, 991 | "outputs": [], 992 | "source": [ 993 | "%%sh\n", 994 | "\n", 995 | "cd sagemaker-dvc-sample\n", 996 | "\n", 997 | "# get all remote branches\n", 998 | "git fetch --all\n", 999 | "\n", 1000 | "# move to the ddvc-trial-multi-files\n", 1001 | "git checkout dvc-trial-multi-files\n", 1002 | "\n", 1003 | "# gather the data (for testing purpuse)\n", 1004 | "dvc pull" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "markdown", 1009 | "id": "2a2d99e5", 1010 | "metadata": {}, 1011 | "source": [ 1012 | "Prepare the data" 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "code", 1017 | "execution_count": null, 1018 | "id": "bfc4eeb7", 1019 | "metadata": {}, 1020 | "outputs": [], 1021 | "source": [ 1022 | "test = pd.read_csv(\"./sagemaker-dvc-sample/dataset/test/california_test.csv\",header=None)\n", 1023 | "X_test = test.iloc[:, 1:].values\n", 1024 | "y_test = test.iloc[:, 0:1].values" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "markdown", 1029 | "id": "3594f09e", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "## Invoke endpoint with the Python SDK" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": null, 1038 | "id": "ff43812f", 1039 | "metadata": {}, 1040 | "outputs": [], 1041 | "source": [ 1042 | "predicted = predictor.predict(X_test).decode('utf-8').split('\\n')\n", 1043 | "for i in range(len(predicted)-1):\n", 1044 | " print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "markdown", 1049 | "id": "103db7d0", 1050 | "metadata": {}, 1051 | "source": [ 1052 | "### Delete the Endpoint\n", 1053 | "\n", 1054 | "Make sure to delete the endpoint to avoid un-expected costs" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": null, 1060 | "id": "6112817a", 1061 | "metadata": {}, 1062 | "outputs": [], 1063 | "source": [ 1064 | "predictor.delete_endpoint()" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "id": "71c48355", 1070 | "metadata": {}, 1071 | "source": [ 1072 | "### (Optional) Delete the Experiment, and all Trails, TrialComponents" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "code", 1077 | "execution_count": null, 1078 | "id": "7ba71c3c", 1079 | "metadata": {}, 1080 | "outputs": [], 1081 | "source": [ 1082 | "my_experiment.delete_all(action=\"--force\")" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "id": "5a4612c2", 1088 | "metadata": {}, 1089 | "source": [ 1090 | "### (Optional) Delete the AWS CodeCommit repository" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "code", 1095 | "execution_count": null, 1096 | "id": "7f5db756", 1097 | "metadata": {}, 1098 | "outputs": [], 1099 | "source": [ 1100 | "!aws codecommit delete-repository --repository-name sagemaker-dvc-sample" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "markdown", 1105 | "id": "9b14fc3d", 1106 | "metadata": {}, 1107 | "source": [ 1108 | "### (Optional) Delete the AWS ECR repositories" 1109 | ] 1110 | }, 1111 | { 1112 | "cell_type": "code", 1113 | "execution_count": null, 1114 | "id": "8a0705de", 1115 | "metadata": {}, 1116 | "outputs": [], 1117 | "source": [ 1118 | "!aws ecr delete-repository --repository-name sagemaker-catboost-dvc --force\n", 1119 | "!aws ecr delete-repository --repository-name sagemaker-processing-dvc --force" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "code", 1124 | "execution_count": null, 1125 | "id": "93cd4af8", 1126 | "metadata": {}, 1127 | "outputs": [], 1128 | "source": [] 1129 | } 1130 | ], 1131 | "metadata": { 1132 | "instance_type": "ml.t3.medium", 1133 | "kernelspec": { 1134 | "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)", 1135 | "language": "python", 1136 | "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel" 1137 | }, 1138 | "language_info": { 1139 | "codemirror_mode": { 1140 | "name": "ipython", 1141 | "version": 3 1142 | }, 1143 | "file_extension": ".py", 1144 | "mimetype": "text/x-python", 1145 | "name": "python", 1146 | "nbconvert_exporter": "python", 1147 | "pygments_lexer": "ipython3", 1148 | "version": "3.8.12" 1149 | } 1150 | }, 1151 | "nbformat": 4, 1152 | "nbformat_minor": 5 1153 | } 1154 | -------------------------------------------------------------------------------- /dvc_sagemaker_script_mode.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "5289742c", 6 | "metadata": {}, 7 | "source": [ 8 | "# Prerequisite\n", 9 | "\n", 10 | "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n", 11 | "\n", 12 | "# Training a CatBoost regression model with data from DVC\n", 13 | "\n", 14 | "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n", 15 | "\n", 16 | "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n", 17 | "\n", 18 | "### California Housing dataset\n", 19 | "\n", 20 | "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n", 21 | "\n", 22 | "The California Housing dataset was originally published in:\n", 23 | "\n", 24 | "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n", 25 | "\n", 26 | "### DVC\n", 27 | "\n", 28 | "DVC is built to make machine learning (ML) models shareable and reproducible.\n", 29 | "It is designed to handle large files, data sets, machine learning models, and metrics as well as code." 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "f55d6f3f", 35 | "metadata": {}, 36 | "source": [ 37 | "## Part 1: Configure DVC for data versioning\n", 38 | "\n", 39 | "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n", 40 | "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n", 41 | "The `dvc` configurations and files for data tracking will be versioned in this repository.\n", 42 | "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n", 43 | "\n", 44 | "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing [Amazon Secret Managers](https://aws.amazon.com/secrets-manager/) to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n", 45 | "\n", 46 | "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "id": "8952c81c", 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "%%sh\n", 57 | "\n", 58 | "## Create the repository\n", 59 | "\n", 60 | "repo_name=\"sagemaker-dvc-sample\"\n", 61 | "\n", 62 | "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n", 63 | "\n", 64 | "account=$(aws sts get-caller-identity --query Account --output text)\n", 65 | "\n", 66 | "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n", 67 | "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n", 68 | "region=${region:-eu-west-1}\n", 69 | "\n", 70 | "## repo_name is already in the .gitignore of the root repo\n", 71 | "\n", 72 | "mkdir -p ${repo_name}\n", 73 | "cd ${repo_name}\n", 74 | "\n", 75 | "# initalize new repo in subfolder\n", 76 | "git init\n", 77 | "## Change the remote to the codecommit\n", 78 | "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n", 79 | "\n", 80 | "# Configure git - change it according to your needs\n", 81 | "git config --global user.email \"sagemaker-studio-user@example.com\"\n", 82 | "git config --global user.name \"SageMaker Studio User\"\n", 83 | "\n", 84 | "git config --global credential.helper '!aws codecommit credential-helper $@'\n", 85 | "git config --global credential.UseHttpPath true\n", 86 | "\n", 87 | "# Initialize dvc\n", 88 | "dvc init\n", 89 | "\n", 90 | "git commit -m 'Add dvc configuration'\n", 91 | "\n", 92 | "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n", 93 | "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n", 94 | "git commit .dvc/config -m \"initialize DVC local remote\"\n", 95 | "\n", 96 | "# set the DVC cache to S3\n", 97 | "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n", 98 | "dvc config cache.s3 s3cache\n", 99 | "\n", 100 | "# disable sending anonymized data to dvc for troubleshooting\n", 101 | "dvc config core.analytics false\n", 102 | "\n", 103 | "git add .dvc/config\n", 104 | "git commit -m 'update dvc config'\n", 105 | "\n", 106 | "git push --set-upstream origin master #--force" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "93d74587", 112 | "metadata": {}, 113 | "source": [ 114 | "## Part 2: Processing and Training with DVC and SageMaker\n", 115 | "\n", 116 | "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n", 117 | "\n", 118 | "The high level conceptual architecture is depicted in the figure below.\n", 119 | "\n", 120 | "\n", 121 | "Fig. 1 High level architecture\n", 122 | "\n", 123 | "\n", 124 | "### Import libraries and initial setup\n", 125 | "\n", 126 | "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "id": "bdbc951e", 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "import boto3\n", 137 | "import sagemaker\n", 138 | "import time\n", 139 | "from time import strftime\n", 140 | "\n", 141 | "boto_session = boto3.Session()\n", 142 | "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n", 143 | "sm_client = boto3.client(\"sagemaker\")\n", 144 | "region = boto_session.region_name\n", 145 | "bucket = sagemaker_session.default_bucket()\n", 146 | "role = sagemaker.get_execution_role()\n", 147 | "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n", 148 | "\n", 149 | "prefix = 'DEMO-sagemaker-experiments-dvc'\n", 150 | "\n", 151 | "print(f\"account: {account}\")\n", 152 | "print(f\"bucket: {bucket}\")\n", 153 | "print(f\"region: {region}\")\n", 154 | "print(f\"role: {role}\")" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "id": "8edec916", 160 | "metadata": {}, 161 | "source": [ 162 | "### Prepare raw data\n", 163 | "\n", 164 | "We upload the raw data to S3 in the default bucket." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "id": "cae15de4", 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "import pandas as pd\n", 175 | "import numpy as np\n", 176 | "\n", 177 | "from sklearn.datasets import fetch_california_housing\n", 178 | "from sklearn.model_selection import train_test_split\n", 179 | "\n", 180 | "from pathlib import Path\n", 181 | "\n", 182 | "databunch = fetch_california_housing()\n", 183 | "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n", 184 | "\n", 185 | "print(f\"Dataset shape = {dataset.shape}\")\n", 186 | "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n", 187 | "\n", 188 | "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n", 189 | "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n", 190 | "print(f\"Raw data location in S3: {s3_data_path}\")\n", 191 | "\n", 192 | "s3 = boto3.client(\"s3\")\n", 193 | "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "08e50d6a", 199 | "metadata": {}, 200 | "source": [ 201 | "### Setup SageMaker Experiments\n", 202 | "\n", 203 | "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n", 204 | "\n", 205 | "Let’s start first with an overview of Amazon SageMaker Experiments features:\n", 206 | "\n", 207 | "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n", 208 | "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n", 209 | "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n", 210 | "\n", 211 | "Now, in order to track this test in SageMaker, we need to create an experiment." 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "id": "fc5aa1cf", 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "from smexperiments.experiment import Experiment\n", 222 | "from smexperiments.trial import Trial\n", 223 | "from smexperiments.trial_component import TrialComponent\n", 224 | "from smexperiments.tracker import Tracker\n", 225 | "\n", 226 | "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n", 227 | "\n", 228 | "# create the experiment if it doesn't exist\n", 229 | "try:\n", 230 | " my_experiment = Experiment.load(experiment_name=experiment_name)\n", 231 | " print(\"existing experiment loaded\")\n", 232 | "except Exception as ex:\n", 233 | " if \"ResourceNotFound\" in str(ex):\n", 234 | " my_experiment = Experiment.create(\n", 235 | " experiment_name = experiment_name,\n", 236 | " description = \"How to integrate DVC\"\n", 237 | " )\n", 238 | " print(\"new experiment created\")\n", 239 | " else:\n", 240 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 241 | " print(\"Dont go forward!\")\n", 242 | " raise" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "2c13953c", 248 | "metadata": {}, 249 | "source": [ 250 | "We need to also define trials within the experiment.\n", 251 | "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n", 252 | "\n", 253 | "### Test 1: generate single files for training and validation\n", 254 | "\n", 255 | "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments." 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "id": "83654fb0", 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "first_trial_name = \"dvc-trial-single-file\"\n", 266 | "\n", 267 | "try:\n", 268 | " my_first_trial = Trial.load(trial_name=first_trial_name)\n", 269 | " print(\"existing trial loaded\")\n", 270 | "except Exception as ex:\n", 271 | " if \"ResourceNotFound\" in str(ex):\n", 272 | " my_first_trial = Trial.create(\n", 273 | " experiment_name=experiment_name,\n", 274 | " trial_name=first_trial_name,\n", 275 | " )\n", 276 | " print(\"new trial created\")\n", 277 | " else:\n", 278 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 279 | " print(\"Dont go forward!\")\n", 280 | " raise" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "id": "93aa35a4", 286 | "metadata": {}, 287 | "source": [ 288 | "### Processing script: version data with DVC\n", 289 | "\n", 290 | "The processing script expects the address of the git repository and the branch we want to create to store the `dvc` metadata passed via environmental variables.\n", 291 | "The datasets themselves will be then stored in S3.\n", 292 | "Environmental variables are automatically tracked in SageMaker Experiments in the automatically generated TrialComponent.\n", 293 | "The TrialComponent generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI.\n", 294 | "In our case, we will store the following data:\n", 295 | "* `DVC_REPO_URL`\n", 296 | "* `DVC_BRANCH`\n", 297 | "* `USER`\n", 298 | "* `data_commit_hash`\n", 299 | "* `train_test_split_ratio`" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "id": "f1b391c3", 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "!pygmentize 'source_dir/preprocessing-experiment.py'" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "id": "5cb3bc2f", 315 | "metadata": {}, 316 | "source": [ 317 | "### SageMaker Processing job\n", 318 | "\n", 319 | "SageMaker Processing gives us the possibility to execute our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure.\n", 320 | "If our script requires additional dependencies, we can supply a `requirements.txt` file.\n", 321 | "Upon starting of the processing job, SageMaker will `pip`-install all libraries we need (e.g., `dvc`-related libraries).\n", 322 | "\n", 323 | "We have now all ingredients to execute our SageMaker Processing Job:\n", 324 | "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`) and two environmental variables (i.e., `DVC_REPO_URL` and `DVC_BRANCH`)\n", 325 | "* a `requiremets.txt` file\n", 326 | "* a git repository (in AWS CodeCommit)\n", 327 | "* a SageMaker Experiment and a Trial" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "id": "400b363d", 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n", 338 | "from sagemaker.sklearn.estimator import SKLearn\n", 339 | "\n", 340 | "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n", 341 | "dvc_branch = my_first_trial.trial_name\n", 342 | "\n", 343 | "script_processor = FrameworkProcessor(\n", 344 | " estimator_cls=SKLearn,\n", 345 | " framework_version='0.23-1',\n", 346 | " instance_count=1,\n", 347 | " instance_type='ml.m5.xlarge',\n", 348 | " env={\n", 349 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 350 | " \"DVC_BRANCH\": dvc_branch,\n", 351 | " \"USER\": \"sagemaker\"\n", 352 | " },\n", 353 | " role=role\n", 354 | ")\n", 355 | "\n", 356 | "experiment_config={\n", 357 | " \"ExperimentName\": my_experiment.experiment_name,\n", 358 | " \"TrialName\": my_first_trial.trial_name\n", 359 | "}" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "id": "5b4b7f46", 365 | "metadata": {}, 366 | "source": [ 367 | "Executing the processing job will take around 3-4 minutes." 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "id": "1a4dfc06", 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "%%time\n", 378 | "\n", 379 | "script_processor.run(\n", 380 | " code='./source_dir/preprocessing-experiment.py',\n", 381 | " dependencies=['./source_dir/requirements.txt'],\n", 382 | " inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n", 383 | " experiment_config=experiment_config,\n", 384 | " arguments=[\"--train-test-split-ratio\", \"0.2\"]\n", 385 | ")\n" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "id": "991638a4", 391 | "metadata": {}, 392 | "source": [ 393 | "### Create an estimator and fit the model" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "id": "8071e644", 399 | "metadata": {}, 400 | "source": [ 401 | "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as environmental variables when you create the Estimator object.\n", 402 | "\n", 403 | "We will train on the `dvc-trial-single-file` branch first.\n", 404 | "\n", 405 | "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n", 406 | "\n", 407 | "```\n", 408 | "dataset\n", 409 | " |-- train\n", 410 | " | |-- california_train.csv\n", 411 | " |-- test\n", 412 | " | |-- california_test.csv\n", 413 | " |-- validation\n", 414 | " | |-- california_validation.csv\n", 415 | "```\n", 416 | "\n", 417 | "#### Metric definition\n", 418 | "\n", 419 | "SageMaker emits every log that is going to STDOUT to CloudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex.\n", 420 | "By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n", 421 | "\n", 422 | "In our case, we are interested in the median error.\n", 423 | "```\n", 424 | "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n", 425 | "```" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "id": "d464d52b", 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n", 436 | "\n", 437 | "hyperparameters={ \n", 438 | " \"learning_rate\" : 1,\n", 439 | " \"depth\": 6\n", 440 | " }\n", 441 | "estimator = SKLearn(\n", 442 | " entry_point='train.py',\n", 443 | " source_dir='source_dir',\n", 444 | " role=role,\n", 445 | " metric_definitions=metric_definitions,\n", 446 | " hyperparameters=hyperparameters,\n", 447 | " instance_count=1,\n", 448 | " instance_type='ml.m5.large',\n", 449 | " framework_version='0.23-1',\n", 450 | " base_job_name='training-with-dvc-data',\n", 451 | " environment={\n", 452 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 453 | " \"DVC_BRANCH\": dvc_branch,\n", 454 | " \"USER\": \"sagemaker\"\n", 455 | " }\n", 456 | ")\n", 457 | "\n", 458 | "experiment_config={\n", 459 | " \"ExperimentName\": my_experiment.experiment_name,\n", 460 | " \"TrialName\": my_first_trial.trial_name\n", 461 | "}" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "id": "2b4e1302", 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "%%time\n", 472 | "\n", 473 | "estimator.fit(experiment_config=experiment_config)" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "id": "1aaa3a46", 479 | "metadata": {}, 480 | "source": [ 481 | "On the logs above you can see those lines, indicating about the files pulled by dvc:\n", 482 | "\n", 483 | "```\n", 484 | "Running dvc pull command\n", 485 | "A train/california_train.csv\n", 486 | "A test/california_test.csv\n", 487 | "A validation/california_validation.csv\n", 488 | "3 files added and 3 files fetched\n", 489 | "Starting the training.\n", 490 | "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n", 491 | "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n", 492 | "```" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "id": "f4de6faf", 498 | "metadata": {}, 499 | "source": [ 500 | "### Test 2: generate multiple files for training and validation" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "id": "2cfe7996", 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "second_trial_name = \"dvc-trial-multi-files\"\n", 511 | "\n", 512 | "try:\n", 513 | " my_second_trial = Trial.load(trial_name=second_trial_name)\n", 514 | " print(\"existing trial loaded\")\n", 515 | "except Exception as ex:\n", 516 | " if \"ResourceNotFound\" in str(ex):\n", 517 | " my_second_trial = Trial.create(\n", 518 | " experiment_name=experiment_name,\n", 519 | " trial_name=second_trial_name,\n", 520 | " )\n", 521 | " print(\"new trial created\")\n", 522 | " else:\n", 523 | " print(f\"Unexpected {ex}=, {type(ex)}\")\n", 524 | " print(\"Dont go forward!\")\n", 525 | " raise" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "id": "a90c0238", 531 | "metadata": {}, 532 | "source": [ 533 | "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch." 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "id": "f7eda4b2", 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "!pygmentize 'source_dir/preprocessing-experiment-multifiles.py'" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": null, 549 | "id": "25c05bae", 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n", 554 | "from sagemaker.sklearn.estimator import SKLearn\n", 555 | "\n", 556 | "dvc_branch = my_second_trial.trial_name\n", 557 | "\n", 558 | "script_processor = FrameworkProcessor(\n", 559 | " estimator_cls=SKLearn,\n", 560 | " framework_version='0.23-1',\n", 561 | " instance_count=1,\n", 562 | " instance_type='ml.m5.xlarge',\n", 563 | " env={\n", 564 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 565 | " \"DVC_BRANCH\": dvc_branch,\n", 566 | " \"USER\": \"sagemaker\",\n", 567 | " },\n", 568 | " role=role\n", 569 | ")\n", 570 | "\n", 571 | "experiment_config={\n", 572 | " \"ExperimentName\": my_experiment.experiment_name,\n", 573 | " \"TrialName\": my_second_trial.trial_name\n", 574 | "}" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "id": "624a6b65", 580 | "metadata": {}, 581 | "source": [ 582 | "Executing the processing job will take ~5 minutes" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "id": "099c269e", 589 | "metadata": {}, 590 | "outputs": [], 591 | "source": [ 592 | "%%time\n", 593 | "\n", 594 | "script_processor.run(\n", 595 | " code='./source_dir/preprocessing-experiment-multifiles.py',\n", 596 | " dependencies=['./source_dir/requirements.txt'],\n", 597 | " inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n", 598 | " experiment_config=experiment_config,\n", 599 | " arguments=[\"--train-test-split-ratio\", \"0.1\"]\n", 600 | ")" 601 | ] 602 | }, 603 | { 604 | "cell_type": "markdown", 605 | "id": "bb210f96", 606 | "metadata": {}, 607 | "source": [ 608 | "We will now train on the `dvc-trial-multi-files` branch.\n", 609 | "\n", 610 | "When doing `dvc pull`, this is the dataset structure:\n", 611 | "\n", 612 | "```\n", 613 | "dataset\n", 614 | " |-- train\n", 615 | " | |-- california_train_1.csv\n", 616 | " | |-- california_train_2.csv\n", 617 | " | |-- california_train_3.csv\n", 618 | " | |-- california_train_4.csv\n", 619 | " | |-- california_train_5.csv\n", 620 | " |-- test\n", 621 | " | |-- california_test.csv\n", 622 | " |-- validation\n", 623 | " | |-- california_validation_1.csv\n", 624 | " | |-- california_validation_2.csv\n", 625 | " | |-- california_validation_3.csv\n", 626 | "```" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": null, 632 | "id": "5cadd7b2", 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n", 637 | "\n", 638 | "hyperparameters={ \n", 639 | " \"learning_rate\" : 1,\n", 640 | " \"depth\": 6\n", 641 | " }\n", 642 | "\n", 643 | "estimator = SKLearn(\n", 644 | " entry_point='train.py',\n", 645 | " source_dir='source_dir',\n", 646 | " role=role,\n", 647 | " metric_definitions=metric_definitions,\n", 648 | " hyperparameters=hyperparameters,\n", 649 | " instance_count=1,\n", 650 | " instance_type='ml.m5.large',\n", 651 | " framework_version='0.23-1',\n", 652 | " base_job_name='training-with-dvc-data',\n", 653 | " environment={\n", 654 | " \"DVC_REPO_URL\": dvc_repo_url,\n", 655 | " \"DVC_BRANCH\": dvc_branch,\n", 656 | " \"USER\": \"sagemaker\"\n", 657 | " }\n", 658 | ")\n", 659 | "\n", 660 | "experiment_config={\n", 661 | " \"ExperimentName\": my_experiment.experiment_name,\n", 662 | " \"TrialName\": my_second_trial.trial_name,\n", 663 | "}" 664 | ] 665 | }, 666 | { 667 | "cell_type": "markdown", 668 | "id": "ce4aa067", 669 | "metadata": {}, 670 | "source": [ 671 | "The training job will take around ~5 minutes" 672 | ] 673 | }, 674 | { 675 | "cell_type": "code", 676 | "execution_count": null, 677 | "id": "fc8e059e", 678 | "metadata": {}, 679 | "outputs": [], 680 | "source": [ 681 | "%%time\n", 682 | "\n", 683 | "estimator.fit(experiment_config=experiment_config)" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "id": "f34d4aa3", 689 | "metadata": {}, 690 | "source": [ 691 | "On the logs above you can see those lines, indicating about the files pulled by dvc:\n", 692 | "\n", 693 | "```\n", 694 | "Running dvc pull command\n", 695 | "A validation/california_validation_2.csv\n", 696 | "A validation/california_validation_1.csv\n", 697 | "A validation/california_validation_3.csv\n", 698 | "A train/california_train_4.csv\n", 699 | "A train/california_train_5.csv\n", 700 | "A train/california_train_2.csv\n", 701 | "A train/california_train_3.csv\n", 702 | "A train/california_train_1.csv\n", 703 | "A test/california_test.csv\n", 704 | "9 files added and 9 files fetched\n", 705 | "Starting the training.\n", 706 | "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n", 707 | "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n", 708 | "```" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "id": "514a78d9", 714 | "metadata": {}, 715 | "source": [ 716 | "## Part 3: Hosting your model in SageMaker" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": null, 722 | "id": "3330abce", 723 | "metadata": {}, 724 | "outputs": [], 725 | "source": [ 726 | "from sagemaker.serializers import CSVSerializer\n", 727 | "\n", 728 | "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=CSVSerializer())" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "id": "bf141b4f", 734 | "metadata": {}, 735 | "source": [ 736 | "### Fetch the testing data\n", 737 | "\n", 738 | "Read the raw test data stored in S3 via DVC created by the SageMaker Processing Job. We use the `dvc` python API." 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "id": "3a5bcecb-a6ca-4d65-a239-be53c32f737a", 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "import io\n", 749 | "import dvc.api\n", 750 | "\n", 751 | "git_repo_https = f\"https://git-codecommit.{region}.amazonaws.com/v1/repos/sagemaker-dvc-sample\"\n", 752 | "\n", 753 | "raw = dvc.api.read(\n", 754 | " \"dataset/test/california_test.csv\",\n", 755 | " repo=git_repo_https,\n", 756 | " rev=dvc_branch\n", 757 | ")" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "id": "86d9841f", 763 | "metadata": {}, 764 | "source": [ 765 | "Prepare the data" 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": null, 771 | "id": "d931947d", 772 | "metadata": {}, 773 | "outputs": [], 774 | "source": [ 775 | "test = pd.read_csv(io.StringIO(raw), sep=\",\", header=None)\n", 776 | "X_test = test.iloc[:, 1:].values\n", 777 | "y_test = test.iloc[:, 0:1].values" 778 | ] 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "id": "a5b796e8", 783 | "metadata": {}, 784 | "source": [ 785 | "## Invoke endpoint with the Python SDK" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "id": "e0bd7491", 792 | "metadata": {}, 793 | "outputs": [], 794 | "source": [ 795 | "predicted = predictor.predict(X_test)\n", 796 | "for i in range(len(predicted)-1):\n", 797 | " print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")" 798 | ] 799 | }, 800 | { 801 | "cell_type": "markdown", 802 | "id": "a976c7bf", 803 | "metadata": {}, 804 | "source": [ 805 | "### Delete the Endpoint\n", 806 | "\n", 807 | "Make sure to delete the endpoint to avoid un-expected costs" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": null, 813 | "id": "4f0231db", 814 | "metadata": {}, 815 | "outputs": [], 816 | "source": [ 817 | "predictor.delete_endpoint()" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "id": "70a7499e", 823 | "metadata": {}, 824 | "source": [ 825 | "### (Optional) Delete the Experiment, and all Trails, TrialComponents" 826 | ] 827 | }, 828 | { 829 | "cell_type": "code", 830 | "execution_count": null, 831 | "id": "6093da3e", 832 | "metadata": {}, 833 | "outputs": [], 834 | "source": [ 835 | "#my_experiment.delete_all(action=\"--force\")" 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "id": "897cc5f2", 841 | "metadata": {}, 842 | "source": [ 843 | "### (Optional) Delete the AWS CodeCommit repository" 844 | ] 845 | }, 846 | { 847 | "cell_type": "code", 848 | "execution_count": null, 849 | "id": "cb3f762a", 850 | "metadata": {}, 851 | "outputs": [], 852 | "source": [ 853 | "#!aws codecommit delete-repository --repository-name sagemaker-dvc-sample" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": null, 859 | "id": "bf55fe4a-97d0-41d6-9796-83491cb0c640", 860 | "metadata": {}, 861 | "outputs": [], 862 | "source": [] 863 | } 864 | ], 865 | "metadata": { 866 | "instance_type": "ml.t3.medium", 867 | "kernelspec": { 868 | "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)", 869 | "language": "python", 870 | "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel" 871 | }, 872 | "language_info": { 873 | "codemirror_mode": { 874 | "name": "ipython", 875 | "version": 3 876 | }, 877 | "file_extension": ".py", 878 | "mimetype": "text/x-python", 879 | "name": "python", 880 | "nbconvert_exporter": "python", 881 | "pygments_lexer": "ipython3", 882 | "version": "3.8.12" 883 | } 884 | }, 885 | "nbformat": 4, 886 | "nbformat_minor": 5 887 | } 888 | -------------------------------------------------------------------------------- /img/high-level-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/high-level-architecture.png -------------------------------------------------------------------------------- /img/sm-experiments-tracker-dvc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/sm-experiments-tracker-dvc.png -------------------------------------------------------------------------------- /img/studio-custom-image-select.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/img/studio-custom-image-select.png -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM continuumio/miniconda3:4.10.3 2 | 3 | COPY environment.yml . 4 | RUN conda env create -f environment.yml 5 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/README.md: -------------------------------------------------------------------------------- 1 | ## Conda Environments as Kernels 2 | 3 | This tutorial explains how to create a custom image for Amazon SageMaker Studio that has DVC already installed. 4 | The advantage of creating an image and make it available to all SageMaker Studio users is that it creates a consistent environment for the SageMake Studio users, which they could also run locally. 5 | 6 | This tutorial is heavily inspired by [this example](https://github.com/aws-samples/sagemaker-studio-custom-image-samples/tree/main/examples/conda-env-kernel-image). 7 | Further information about custom images for SageMaker Studio can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html) 8 | 9 | ## Prerequisite 10 | 11 | * A Cloud9 environment with enough permissions 12 | 13 | ## Overview 14 | 15 | This custom image sample demonstrates how to create a custom Conda environment in a Docker image and use it as a custom kernel in SageMaker Studio. 16 | 17 | The Conda environment must have the appropriate kernel package installed, for e.g., `ipykernel` for a Python kernel. 18 | This example creates a Conda environment called `dvc` with a few Python packages (see [environment.yml](environment.yml)) and the `ipykernel`. 19 | SageMaker Studio will automatically recognize this Conda environment as a kernel named `conda-env-dvc-py`. 20 | 21 | ### Clone the GitHub repository 22 | ```bash 23 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo 24 | ``` 25 | 26 | ### Resize Cloud9 27 | 28 | ```bash 29 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/ 30 | ./resize-cloud9.sh 20 31 | ``` 32 | 33 | ## Build the Docker images for SageMaker Studio 34 | 35 | Set some basic environment variables 36 | 37 | ```bash 38 | sudo yum install jq -y 39 | export REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region') 40 | echo "export REGION=${REGION}" | tee -a ~/.bash_profile 41 | 42 | export ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') 43 | echo "export ACCOUNT_ID=${ACCOUNT_ID}" | tee -a ~/.bash_profile 44 | 45 | export IMAGE_NAME=conda-env-dvc-kernel 46 | echo "export IMAGE_NAME=${IMAGE_NAME}" | tee -a ~/.bash_profile 47 | ``` 48 | 49 | Build the Docker image and push to Amazon ECR. 50 | 51 | ```bash 52 | # Login to ECR 53 | aws --region ${REGION} ecr get-login-password | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom 54 | 55 | # Create the ECR repository 56 | aws --region ${REGION} ecr create-repository --repository-name smstudio-custom 57 | 58 | # Build the image - it might take a few minutes to complete this step 59 | docker build . -t ${IMAGE_NAME} -t ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} 60 | # Push the image to ECR 61 | docker push ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME} 62 | ``` 63 | 64 | ## Associate a custom image to SageMaker Studio 65 | 66 | ### Prepare the environment to deploy with CDK 67 | 68 | Step 1: Navigate to the `cdk` directory: 69 | 70 | ```bash 71 | cd ~/environment/amazon-sagemaker-experiments-dvc-demo/sagemaker-studio-dvc-image/cdk 72 | ``` 73 | 74 | Step 2: Create a virtual environment: 75 | 76 | ```bash 77 | python3 -m venv .cdk-venv 78 | ``` 79 | 80 | Step 3: Activate the virtual environment after the init process completes, and the virtual environment is created: 81 | 82 | ```bash 83 | source .cdk-venv/bin/activate 84 | ``` 85 | 86 | Step 4: Install the required dependencies: 87 | 88 | ```bash 89 | pip3 install --upgrade pip 90 | pip3 install -r requirements.txt 91 | ``` 92 | 93 | Step 5: Install and bootstrap CDK v2 94 | 95 | ```bash 96 | npm install -g aws-cdk@2.27.0 --force 97 | cdk bootstrap 98 | ``` 99 | 100 | ### Create a new SageMaker Studio 101 | ( Skip to [Update an existing SageMaker Studio](#update-an-existing-sagemaker-studio) if you have already an existing SageMaker Studio domain) 102 | 103 | Step 5: deploy CDK (CDK will deploy a stack named: `sagemakerStudioCDK` which you can verify in `CloudFormation`) 104 | 105 | ```bash 106 | cdk deploy --require-approval never 107 | ``` 108 | 109 | CDK will create the following resources via` CloudFormation`: 110 | * provisions a new SageMaker Studio domain 111 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to the SageMaker Studio domain 112 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier 113 | * creates an AppImageConfig which specify how the kernel gateway should be configured 114 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it 115 | 116 | ### Update an existing SageMaker Studio 117 | 118 | If you have an existing SageMaker Studio environment, we need to first retrieve the exising SageMaker Studio domain ID, deploy a "reduced" version of the CDK stack, and update the SageMaker Studio domain configuration. 119 | 120 | Step 5: Set the `DOMAIN_ID` environment variable with your domain ID and save to your `bash_profile`. 121 | 122 | ```bash 123 | export DOMAIN_ID=$(aws sagemaker list-domains | jq -r '.Domains[0].DomainId') 124 | echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile 125 | ``` 126 | 127 | Step 6: deploy CDK (by setting the `DOMAIN_ID` environment variable, CDK will deploy a stack named: `sagemakerUserCDK` which you can verify on `CloudFormation`) 128 | 129 | ```bash 130 | cdk deploy --require-approval never 131 | ``` 132 | 133 | CDK will create the following resources via` CloudFormation`: 134 | * creates and attaches a SageMaker execution role, i.e., `RoleForSagemakerStudioUsers` with the right permissions to your existing SageMaker Studio domain 135 | * creates a SageMaker Image and a SageMaker Image Version from the docker image `conda-env-dvc-kernel` we have created earlier 136 | * creates an AppImageConfig which specify how the kernel gateway should be configured 137 | * provision a SageMaker Studio user, i.e., `data-scientist-dvc`, with the correct SageMaker execution role and makes available the custom SageMaker Studio image available to it 138 | 139 | Step 7: Update the SageMaker Studio domain configuration 140 | 141 | ```bash 142 | # inject your DOMAIN_ID into the configuration file 143 | sed -i 's//'"$DOMAIN_ID"'/' ../update-domain-input.json 144 | 145 | # update the sagemaker studio domain 146 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-input.json 147 | ``` 148 | 149 | Open the newly created SageMaker Studio user, i.e., `data-scientist-dvc`. 150 | 151 | ### Execute the lab 152 | 153 | In the SageMaker Studio domain, launch `Studio` for the `data-scientist-dvc`. 154 | Open a terminal and clone the repository 155 | 156 | ```bash 157 | git clone https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo 158 | ``` 159 | 160 | and open the [dvc_sagemaker_script_mode.ipynb](../dvc_sagemaker_script_mode.ipynb) notebook. 161 | 162 | When prompted, ensure that you select the Custom Image `conda-env-dvc-kernel` as shown below 163 | 164 | ![image info](../img/studio-custom-image-select.png) 165 | 166 | ### Cleanup 167 | 168 | Before removing all resources created, you need to make sure that all Apps are deleted from the `data-scientist-dvc` user, i.e. all `KernelGateway` apps, as well as the default `JupiterServer`. 169 | 170 | Once done, you can destroy the CDK stack by running 171 | 172 | ```bash 173 | cdk destroy 174 | ``` 175 | 176 | In case you started off from an existing domain, please also execute the following command: 177 | 178 | ```bash 179 | # inject your DOMAIN_ID into the configuration file 180 | sed -i 's//'"$DOMAIN_ID"'/' ../update-domain-no-custom-images.json 181 | 182 | # update the sagemaker studio domain 183 | aws --region ${REGION} sagemaker update-domain --cli-input-json file://../update-domain-no-custom-images.json 184 | ``` -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | package-lock.json 3 | .pytest_cache 4 | *.egg-info 5 | .idea/ 6 | # Byte-compiled / optimized / DLL files 7 | __pycache__/ 8 | *.py[cod] 9 | *$py.class 10 | 11 | # Environments 12 | .env 13 | .venv 14 | env/ 15 | venv/ 16 | env.bak/ 17 | venv.bak/ 18 | .cdk-venv 19 | 20 | # CDK Context & Staging files 21 | .cdk.staging/ 22 | cdk.out/ 23 | /cdk.context.json 24 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from aws_cdk import App, Stack 4 | 5 | from sagemakerStudioCDK.sagemaker_studio_stack import SagemakerStudioStack 6 | import os 7 | import boto3 8 | 9 | sts_client = boto3.client("sts") 10 | account_id = os.environ.get('ACCOUNT_ID', sts_client.get_caller_identity()["Account"]) 11 | region = os.environ.get('REGION', 'eu-west-1') 12 | 13 | domain_id = os.environ.get('DOMAIN_ID', None) 14 | 15 | app = App() 16 | 17 | if domain_id is None: 18 | print("Create a new studio domain") 19 | else: 20 | print("Existing domain ID: {}".format(domain_id)) 21 | 22 | SagemakerStudioStack(app, "sagemakerStudioUserCDK", domain_id, env={"account": account_id, 'region': region}) 23 | 24 | app.synth() 25 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/cdk.json: -------------------------------------------------------------------------------- 1 | { 2 | "app": "python3 app.py" 3 | } 4 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/requirements.txt: -------------------------------------------------------------------------------- 1 | -e . 2 | pytest 3 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/8cde12fece3dbfa46ae5e9c57fb7faffc1100616/sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/__init__.py -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/sagemakerStudioCDK/sagemaker_studio_stack.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | 4 | from aws_cdk import ( 5 | aws_iam as iam, 6 | aws_ec2 as ec2, 7 | aws_sagemaker as sagemaker, 8 | Stack, 9 | CfnOutput 10 | ) 11 | 12 | from constructs import Construct 13 | 14 | sagemaker_arn_region_account_mapping = { 15 | "eu-west-1": "470317259841", 16 | "us-east-1": "081325390199", 17 | "us-east-2": "429704687514", 18 | "us-west-1": "742091327244", 19 | "us-west-2": "236514542706", 20 | "af-south-1": "559312083959", 21 | "ap-east-1": "493642496378", 22 | "ap-south-1": "394103062818", 23 | "ap-northeast-2": "806072073708", 24 | "ap-southeast-1": "492261229750", 25 | "ap-southeast-2": "452832661640", 26 | "ap-northeast-1": "102112518831", 27 | "ca-central-1": "310906938811", 28 | "eu-central-1": "936697816551", 29 | "eu-west-2": "712779665605", 30 | "eu-west-3": "615547856133", 31 | "eu-north-1": "243637512696", 32 | "eu-south-1": "592751261982", 33 | "sa-east-1": "782484402741", 34 | } 35 | 36 | 37 | class SagemakerStudioStack(Stack): 38 | 39 | def __init__(self, scope: Construct, construct_id: str, domain_id: str, **kwargs) -> None: 40 | super().__init__(scope, construct_id, **kwargs) 41 | 42 | # Create a SageMaker 43 | role_sagemaker_studio_domain = iam.Role( 44 | self, 45 | 'RoleForSagemakerStudioUsers', 46 | assumed_by=iam.CompositePrincipal( 47 | iam.ServicePrincipal('sagemaker.amazonaws.com'), 48 | iam.ServicePrincipal('codebuild.amazonaws.com'), # needed to use the sm-build command 49 | ), 50 | managed_policies=[ 51 | iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess") 52 | ], 53 | inline_policies={ 54 | "code-commit-policy": iam.PolicyDocument( 55 | statements=[ 56 | iam.PolicyStatement( 57 | effect=iam.Effect.ALLOW, 58 | actions=[ 59 | "codecommit:AssociateApprovalRuleTemplateWithRepository", 60 | "codecommit:BatchAssociateApprovalRuleTemplateWithRepositories", 61 | "codecommit:BatchDisassociateApprovalRuleTemplateFromRepositories", 62 | "codecommit:BatchGet*", 63 | "codecommit:BatchDescribe*", 64 | "codecommit:Create*", 65 | "codecommit:DeleteBranch", 66 | "codecommit:DeleteFile", 67 | "codecommit:Describe*", 68 | "codecommit:DisassociateApprovalRuleTemplateFromRepository", 69 | "codecommit:EvaluatePullRequestApprovalRules", 70 | "codecommit:Get*", 71 | "codecommit:List*", 72 | "codecommit:Merge*", 73 | "codecommit:OverridePullRequestApprovalRules", 74 | "codecommit:Put*", 75 | "codecommit:Post*", 76 | "codecommit:TagResource", 77 | "codecommit:Test*", 78 | "codecommit:UntagResource", 79 | "codecommit:Update*", 80 | "codecommit:GitPull", 81 | "codecommit:GitPush", 82 | "codecommit:Delete*" 83 | ], 84 | resources=[f"arn:aws:codecommit:{self.region}:{self.account}:sagemaker-dvc-sample"] 85 | ) 86 | ] 87 | ), 88 | "s3bucket": iam.PolicyDocument( 89 | statements=[ 90 | iam.PolicyStatement( 91 | effect=iam.Effect.ALLOW, 92 | actions=["s3:ListBucket","s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:PutObjectTagging"], 93 | resources=["arn:aws:s3:::sagemaker*"] 94 | ) 95 | ] 96 | ), 97 | "sm-build-policy": iam.PolicyDocument( 98 | statements=[ 99 | iam.PolicyStatement( 100 | sid="EcrAuthorizationTokenRetrieval", 101 | effect=iam.Effect.ALLOW, 102 | actions=[ 103 | "ecr:BatchGetImage", 104 | "ecr:GetDownloadUrlForLayer" 105 | ], 106 | resources=[ 107 | "arn:aws:ecr:*:763104351884:repository/*", 108 | "arn:aws:ecr:*:217643126080:repository/*", 109 | "arn:aws:ecr:*:727897471807:repository/*", 110 | "arn:aws:ecr:*:626614931356:repository/*", 111 | "arn:aws:ecr:*:683313688378:repository/*", 112 | "arn:aws:ecr:*:520713654638:repository/*", 113 | "arn:aws:ecr:*:462105765813:repository/*" 114 | ] 115 | ), 116 | iam.PolicyStatement( 117 | effect=iam.Effect.ALLOW, 118 | actions=[ 119 | "ecr:CreateRepository", 120 | "ecr:BatchGetImage", 121 | "ecr:CompleteLayerUpload", 122 | "ecr:DescribeImages", 123 | "ecr:DescribeRepositories", 124 | "ecr:UploadLayerPart", 125 | "ecr:ListImages", 126 | "ecr:InitiateLayerUpload", 127 | "ecr:BatchCheckLayerAvailability", 128 | "ecr:PutImage" 129 | ], 130 | resources=["arn:aws:ecr:*:*:repository/sagemaker-studio*"] 131 | ), 132 | iam.PolicyStatement( 133 | effect=iam.Effect.ALLOW, 134 | actions=[ 135 | "codebuild:DeleteProject", 136 | "codebuild:CreateProject", 137 | "codebuild:BatchGetBuilds", 138 | "codebuild:StartBuild" 139 | ], 140 | resources=["arn:aws:codebuild:*:*:project/sagemaker-studio*"] 141 | ), 142 | iam.PolicyStatement( 143 | effect=iam.Effect.ALLOW, 144 | actions=["logs:CreateLogStream"], 145 | resources=["arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"] 146 | ), 147 | iam.PolicyStatement( 148 | effect=iam.Effect.ALLOW, 149 | actions=[ 150 | "logs:GetLogEvents", 151 | "logs:PutLogEvents" 152 | ], 153 | resources=["arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"] 154 | ), 155 | iam.PolicyStatement( 156 | effect=iam.Effect.ALLOW, 157 | actions=["ecr:GetAuthorizationToken"], 158 | resources=["arn:aws:ecr:*:*:*"] 159 | ), 160 | iam.PolicyStatement( 161 | effect=iam.Effect.ALLOW, 162 | actions=["iam:PassRole"], 163 | resources=["arn:aws:iam::*:role/*"], 164 | conditions={ 165 | "StringLikeIfExists":{ 166 | "iam:PassedToService":"codebuild.amazonaws.com" 167 | } 168 | } 169 | ) 170 | ] 171 | ) 172 | } 173 | ) 174 | 175 | cfn_image = sagemaker.CfnImage( 176 | self, 177 | "DvcImage", 178 | image_name="conda-env-dvc-kernel", 179 | image_role_arn=role_sagemaker_studio_domain.role_arn, 180 | ) 181 | 182 | cfn_image_version = sagemaker.CfnImageVersion( 183 | self, 184 | "DvcImageVersion", 185 | image_name="conda-env-dvc-kernel", 186 | base_image="{}.dkr.ecr.{}.amazonaws.com/smstudio-custom:conda-env-dvc-kernel".format(self.account, self.region) 187 | ) 188 | 189 | cfn_image_version.add_depends_on(cfn_image) 190 | 191 | cfn_app_image_config = sagemaker.CfnAppImageConfig( 192 | self, 193 | "DvcAppImageConfig", 194 | app_image_config_name="conda-env-dvc-kernel-config", 195 | kernel_gateway_image_config=sagemaker.CfnAppImageConfig.KernelGatewayImageConfigProperty( 196 | kernel_specs=[ 197 | sagemaker.CfnAppImageConfig.KernelSpecProperty( 198 | name="conda-env-dvc-py", 199 | display_name="Python [conda env: dvc]" 200 | ) 201 | ], 202 | file_system_config=sagemaker.CfnAppImageConfig.FileSystemConfigProperty( 203 | default_gid=0, 204 | default_uid=0, 205 | mount_path="/root" 206 | ) 207 | ), 208 | ) 209 | 210 | cfn_app_image_config.add_depends_on(cfn_image_version) 211 | 212 | team = "data-scientist-dvc" 213 | 214 | if domain_id is not None: 215 | my_default_datascience_user = sagemaker.CfnUserProfile( 216 | self, 217 | "CfnUserProfile", 218 | domain_id=domain_id, 219 | user_profile_name=team, 220 | user_settings=sagemaker.CfnUserProfile.UserSettingsProperty( 221 | execution_role=role_sagemaker_studio_domain.role_arn, 222 | kernel_gateway_app_settings=sagemaker.CfnUserProfile.KernelGatewayAppSettingsProperty( 223 | custom_images=[ 224 | sagemaker.CfnUserProfile.CustomImageProperty( 225 | app_image_config_name="conda-env-dvc-kernel-config", 226 | image_name="conda-env-dvc-kernel", 227 | ) 228 | ] 229 | ), 230 | jupyter_server_app_settings=sagemaker.CfnUserProfile.JupyterServerAppSettingsProperty( 231 | default_resource_spec=sagemaker.CfnUserProfile.ResourceSpecProperty( 232 | instance_type="system", 233 | sage_maker_image_arn="arn:aws:sagemaker:{}:{}:image/jupyter-server-3".format(self.region, sagemaker_arn_region_account_mapping[self.region]), 234 | ) 235 | ), 236 | ) 237 | ) 238 | 239 | my_default_datascience_user.add_depends_on(cfn_app_image_config) 240 | else: 241 | 242 | self.role_sagemaker_studio_domain = role_sagemaker_studio_domain 243 | self.sagemaker_domain_name = "DomainForSagemakerStudio" 244 | 245 | default_vpc_id = ec2.Vpc.from_lookup( 246 | self, 247 | "VPC", 248 | is_default=True 249 | ) 250 | 251 | self.vpc_id = default_vpc_id.vpc_id 252 | self.public_subnet_ids = [public_subnet.subnet_id for public_subnet in default_vpc_id.public_subnets] 253 | 254 | my_sagemaker_domain = sagemaker.CfnDomain( 255 | self, 256 | "SageMakerStudioDomain", 257 | auth_mode="IAM", 258 | default_user_settings=sagemaker.CfnDomain.UserSettingsProperty( 259 | execution_role=self.role_sagemaker_studio_domain.role_arn, 260 | kernel_gateway_app_settings=sagemaker.CfnDomain.KernelGatewayAppSettingsProperty( 261 | custom_images=[ 262 | sagemaker.CfnDomain.CustomImageProperty( 263 | app_image_config_name="conda-env-dvc-kernel-config", 264 | image_name="conda-env-dvc-kernel", 265 | )] 266 | ), 267 | jupyter_server_app_settings=sagemaker.CfnDomain.JupyterServerAppSettingsProperty( 268 | default_resource_spec=sagemaker.CfnDomain.ResourceSpecProperty( 269 | instance_type="system", 270 | sage_maker_image_arn="arn:aws:sagemaker:{}:{}:image/jupyter-server-3".format(self.region, sagemaker_arn_region_account_mapping[self.region]), 271 | ) 272 | ), 273 | ), 274 | domain_name="domain-with-custom-conda-env", 275 | subnet_ids=self.public_subnet_ids, 276 | vpc_id=self.vpc_id 277 | ) 278 | 279 | my_sagemaker_domain.add_depends_on(cfn_app_image_config) 280 | 281 | my_default_datascience_user = sagemaker.CfnUserProfile( 282 | self, 283 | "CfnUserProfile", 284 | domain_id=my_sagemaker_domain.attr_domain_id, 285 | user_profile_name=team, 286 | user_settings=sagemaker.CfnUserProfile.UserSettingsProperty( 287 | execution_role=self.role_sagemaker_studio_domain.role_arn 288 | ) 289 | ) 290 | 291 | CfnOutput( 292 | self, 293 | "DomainIdSagemaker", 294 | value=my_sagemaker_domain.attr_domain_id, 295 | description="The sagemaker domain ID", 296 | export_name="DomainIdSagemaker" 297 | ) 298 | 299 | 300 | CfnOutput( 301 | self, 302 | f"cfnoutput{team}", 303 | value=my_default_datascience_user.attr_user_profile_arn, 304 | description="The User Arn TeamA domain ID", 305 | export_name=F"UserArn{team}" 306 | ) -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/cdk/setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | 4 | setuptools.setup( 5 | name="sagemakerStudioCDK", 6 | version="0.0.1", 7 | 8 | description="aws-cdk-sagemaker-studio", 9 | 10 | author="frpaolo", 11 | 12 | package_dir={"": "sagemakerStudioCDK"}, 13 | packages=setuptools.find_packages(where="sagemakerStudioCDK"), 14 | 15 | install_requires=[ 16 | "aws-cdk-lib==2.27.0", 17 | "constructs==10.0.34", 18 | "boto3" 19 | ], 20 | 21 | python_requires=">=3.6", 22 | 23 | classifiers=[ 24 | "Development Status :: 4 - Beta", 25 | 26 | "Intended Audience :: Developers", 27 | 28 | "License :: OSI Approved :: Apache Software License", 29 | 30 | "Programming Language :: JavaScript", 31 | "Programming Language :: Python :: 3 :: Only", 32 | "Programming Language :: Python :: 3.6", 33 | "Programming Language :: Python :: 3.7", 34 | "Programming Language :: Python :: 3.8", 35 | 36 | "Topic :: Software Development :: Code Generators", 37 | "Topic :: Utilities", 38 | 39 | "Typing :: Typed", 40 | ], 41 | ) 42 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/environment.yml: -------------------------------------------------------------------------------- 1 | name: dvc 2 | channels: 3 | - conda-forge 4 | - intel 5 | dependencies: 6 | - python=3.8 7 | - pip=22.0.4 8 | - ipykernel=6.9.1 9 | - pip: 10 | - dvc==2.8.3 11 | - dvc[s3]==2.8.3 12 | - s3fs==2021.11.0 13 | - awscli 14 | - boto3==1.17.106 15 | - sagemaker 16 | - sagemaker-studio-image-build==0.6.0 17 | - sagemaker-experiments==0.1.35 18 | - scikit-learn==1.0.2 19 | - protobuf==3.20 20 | - git-remote-codecommit==1.16 21 | 22 | -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/resize-cloud9.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB. 4 | SIZE=${1:-20} 5 | 6 | # Get the ID of the environment host Amazon EC2 instance. 7 | INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id) 8 | REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/') 9 | 10 | # Get the ID of the Amazon EBS volume associated with the instance. 11 | VOLUMEID=$(aws ec2 describe-instances \ 12 | --instance-id $INSTANCEID \ 13 | --query "Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId" \ 14 | --output text \ 15 | --region $REGION) 16 | 17 | # Resize the EBS volume. 18 | aws ec2 modify-volume --volume-id $VOLUMEID --size $SIZE 19 | 20 | # Wait for the resize to finish. 21 | while [ \ 22 | "$(aws ec2 describe-volumes-modifications \ 23 | --volume-id $VOLUMEID \ 24 | --filters Name=modification-state,Values="optimizing","completed" \ 25 | --query "length(VolumesModifications)"\ 26 | --output text)" != "1" ]; do 27 | sleep 1 28 | done 29 | 30 | #Check if we're on an NVMe filesystem 31 | if [[ -e "/dev/xvda" && $(readlink -f /dev/xvda) = "/dev/xvda" ]] 32 | then 33 | # Rewrite the partition table so that the partition takes up all the space that it can. 34 | sudo growpart /dev/xvda 1 35 | 36 | # Expand the size of the file system. 37 | # Check if we're on AL2 38 | STR=$(cat /etc/os-release) 39 | SUB="VERSION_ID=\"2\"" 40 | if [[ "$STR" == *"$SUB"* ]] 41 | then 42 | sudo xfs_growfs -d / 43 | else 44 | sudo resize2fs /dev/xvda1 45 | fi 46 | 47 | else 48 | # Rewrite the partition table so that the partition takes up all the space that it can. 49 | sudo growpart /dev/nvme0n1 1 50 | 51 | # Expand the size of the file system. 52 | # Check if we're on AL2 53 | STR=$(cat /etc/os-release) 54 | SUB="VERSION_ID=\"2\"" 55 | if [[ "$STR" == *"$SUB"* ]] 56 | then 57 | sudo xfs_growfs -d / 58 | else 59 | sudo resize2fs /dev/nvme0n1p1 60 | fi 61 | fi -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/update-domain-input.json: -------------------------------------------------------------------------------- 1 | { 2 | "DomainId": "", 3 | "DefaultUserSettings": { 4 | "KernelGatewayAppSettings": { 5 | "CustomImages": [ 6 | { 7 | "ImageName": "conda-env-dvc-kernel", 8 | "AppImageConfigName": "conda-env-dvc-kernel-config" 9 | } 10 | ] 11 | } 12 | } 13 | } -------------------------------------------------------------------------------- /sagemaker-studio-dvc-image/update-domain-no-custom-images.json: -------------------------------------------------------------------------------- 1 | { 2 | "DomainId": "", 3 | "DefaultUserSettings": { 4 | "KernelGatewayAppSettings": { 5 | "CustomImages": [] 6 | } 7 | } 8 | } -------------------------------------------------------------------------------- /source_dir/preprocessing-experiment-multifiles.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import os 4 | import argparse 5 | import sys 6 | import subprocess 7 | 8 | from pathlib import Path 9 | 10 | import numpy as np 11 | import pandas as pd 12 | 13 | from sklearn.model_selection import train_test_split 14 | 15 | from smexperiments.tracker import Tracker 16 | 17 | import dvc.api 18 | 19 | from git.repo.base import Repo 20 | 21 | # Prepare paths 22 | input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv") 23 | data_path = 'dataset' 24 | base_dir = f"./sagemaker-dvc-sample/{data_path}" 25 | file_types = ['test','train','validation'] 26 | 27 | dvc_repo_url = os.environ.get('DVC_REPO_URL') 28 | dvc_branch = os.environ.get('DVC_BRANCH') 29 | user = os.environ.get('USER', "sagemaker") 30 | 31 | def configure_git(): 32 | subprocess.check_call(['git', 'config', '--global', 'user.email', '"sagemaker-processing@example.com"']) 33 | subprocess.check_call(['git', 'config', '--global', 'user.name', user]) 34 | 35 | def split_dataframe(df, num=5): 36 | chunk_size = int(df.shape[0] / num) 37 | chunks = [df.iloc[i:i+chunk_size] for i in range(0,df.shape[0], chunk_size)] 38 | return chunks 39 | 40 | def clone_dvc_git_repo(): 41 | print(f"Cloning repo: {dvc_repo_url}") 42 | repo = Repo.clone_from(dvc_repo_url, './sagemaker-dvc-sample') 43 | return repo 44 | 45 | def generate_train_validation_files(ratio): 46 | for path in ['train', 'validation', 'test']: 47 | output_dir = Path(f"{base_dir}/{path}/") 48 | output_dir.mkdir(parents=True, exist_ok=True) 49 | 50 | print("Read dataset") 51 | dataset = pd.read_csv(input_data_path) 52 | train, other = train_test_split(dataset, test_size=ratio) 53 | validation, test = train_test_split(other, test_size=ratio) 54 | 55 | print("create train, validation, test") 56 | for index, chunk in enumerate(split_dataframe(pd.DataFrame(train))): 57 | chunk.to_csv(f"{base_dir}/train/california_train_{index + 1}.csv", header=False, index=False) 58 | 59 | for index, chunk in enumerate(split_dataframe(pd.DataFrame(validation), 3)): 60 | chunk.to_csv(f"{base_dir}/validation/california_validation_{index + 1}.csv", header=False, index=False) 61 | 62 | pd.DataFrame(test).to_csv(f"{base_dir}/test/california_test.csv", header=False, index=False) 63 | print("data created") 64 | 65 | def sync_data_with_dvc(repo): 66 | os.chdir(base_dir) 67 | print(f"Create branch {dvc_branch}") 68 | try: 69 | repo.git.checkout('-b', dvc_branch) 70 | print(f"Create a new branch: {dvc_branch}") 71 | except: 72 | repo.git.checkout(dvc_branch) 73 | print(f"Checkout existing branch: {dvc_branch}") 74 | print("Add files to DVC") 75 | 76 | for file_type in file_types: 77 | subprocess.check_call(['dvc', 'add', f"{file_type}/"]) 78 | 79 | repo.git.add(all=True) 80 | repo.git.commit('-m', f"'add data for {dvc_branch}'") 81 | print("Push data to DVC") 82 | subprocess.check_call(['dvc', 'push']) 83 | print("Push dvc metadata to git") 84 | repo.remote(name='origin') 85 | repo.git.push('--set-upstream', repo.remote().name, dvc_branch, '--force') 86 | 87 | sha = repo.head.commit.hexsha 88 | print(f"commit hash: {sha}") 89 | 90 | with Tracker.load() as tracker: 91 | tracker.log_parameters({"data_commit_hash": sha}) 92 | for file_type in file_types: 93 | path = dvc.api.get_url( 94 | f"{data_path}/{file_type}", 95 | repo=dvc_repo_url, 96 | rev=dvc_branch 97 | ) 98 | tracker.log_output(name=f"{file_type}",value=path) 99 | 100 | if __name__=="__main__": 101 | parser = argparse.ArgumentParser() 102 | parser.add_argument("--train-test-split-ratio", type=float, default=0.3) 103 | args, _ = parser.parse_known_args() 104 | 105 | train_test_split_ratio = args.train_test_split_ratio 106 | 107 | with Tracker.load() as tracker: 108 | tracker.log_parameters( 109 | { 110 | "train_test_split_ratio": train_test_split_ratio 111 | } 112 | ) 113 | 114 | configure_git() 115 | repo = clone_dvc_git_repo() 116 | generate_train_validation_files(train_test_split_ratio) 117 | sync_data_with_dvc(repo) 118 | -------------------------------------------------------------------------------- /source_dir/preprocessing-experiment.py: -------------------------------------------------------------------------------- 1 | # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 2 | # SPDX-License-Identifier: MIT-0 3 | import os 4 | import argparse 5 | import sys 6 | import subprocess 7 | 8 | from pathlib import Path 9 | 10 | import numpy as np 11 | import pandas as pd 12 | 13 | from sklearn.model_selection import train_test_split 14 | 15 | from smexperiments.tracker import Tracker 16 | 17 | import dvc.api 18 | 19 | from git.repo.base import Repo 20 | 21 | # Prepare paths 22 | input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv") 23 | data_path = 'dataset' 24 | base_dir = f"./sagemaker-dvc-sample/{data_path}" 25 | file_types = ['test','train','validation'] 26 | 27 | dvc_repo_url = os.environ.get('DVC_REPO_URL') 28 | dvc_branch = os.environ.get('DVC_BRANCH') 29 | user = os.environ.get('USER', "sagemaker") 30 | 31 | def configure_git(): 32 | subprocess.check_call(['git', 'config', '--global', 'user.email', '"sagemaker-processing@example.com"']) 33 | subprocess.check_call(['git', 'config', '--global', 'user.name', user]) 34 | 35 | def clone_dvc_git_repo(): 36 | print(f"Cloning repo: {dvc_repo_url}") 37 | repo = Repo.clone_from(dvc_repo_url, './sagemaker-dvc-sample') 38 | return repo 39 | 40 | def generate_train_validation_files(ratio): 41 | for path in ['train', 'validation', 'test']: 42 | output_dir = Path(f"{base_dir}/{path}/") 43 | output_dir.mkdir(parents=True, exist_ok=True) 44 | 45 | print("Read dataset") 46 | dataset = pd.read_csv(input_data_path) 47 | train, other = train_test_split(dataset, test_size=ratio) 48 | validation, test = train_test_split(other, test_size=ratio) 49 | 50 | print("create train, validation, test") 51 | pd.DataFrame(train).to_csv(f"{base_dir}/train/california_train.csv", header=False, index=False) 52 | pd.DataFrame(validation).to_csv(f"{base_dir}/validation/california_validation.csv", header=False, index=False) 53 | pd.DataFrame(test).to_csv(f"{base_dir}/test/california_test.csv", header=False, index=False) 54 | print("data created") 55 | 56 | def sync_data_with_dvc(repo): 57 | os.chdir(base_dir) 58 | print(f"Create branch {dvc_branch}") 59 | try: 60 | repo.git.checkout('-b', dvc_branch) 61 | print(f"Create a new branch: {dvc_branch}") 62 | except: 63 | repo.git.checkout(dvc_branch) 64 | print(f"Checkout existing branch: {dvc_branch}") 65 | print("Add files to DVC") 66 | 67 | for file_type in file_types: 68 | subprocess.check_call(['dvc', 'add', f"{file_type}/california_{file_type}.csv"]) 69 | 70 | repo.git.add(all=True) 71 | repo.git.commit('-m', f"'add data for {dvc_branch}'") 72 | print("Push data to DVC") 73 | subprocess.check_call(['dvc', 'push']) 74 | print("Push dvc metadata to git") 75 | repo.remote(name='origin') 76 | repo.git.push('--set-upstream', repo.remote().name, dvc_branch, '--force') 77 | 78 | sha = repo.head.commit.hexsha 79 | print(f"commit hash: {sha}") 80 | 81 | with Tracker.load() as tracker: 82 | tracker.log_parameters({"data_commit_hash": sha}) 83 | for file_type in file_types: 84 | path = dvc.api.get_url( 85 | f"{data_path}/{file_type}/california_{file_type}.csv", 86 | repo=dvc_repo_url, 87 | rev=dvc_branch 88 | ) 89 | tracker.log_output(name=f"california_{file_type}",value=path) 90 | 91 | if __name__=="__main__": 92 | parser = argparse.ArgumentParser() 93 | parser.add_argument("--train-test-split-ratio", type=float, default=0.3) 94 | args, _ = parser.parse_known_args() 95 | 96 | train_test_split_ratio = args.train_test_split_ratio 97 | 98 | with Tracker.load() as tracker: 99 | tracker.log_parameters( 100 | { 101 | "train_test_split_ratio": train_test_split_ratio, 102 | } 103 | ) 104 | 105 | configure_git() 106 | repo = clone_dvc_git_repo() 107 | generate_train_validation_files(train_test_split_ratio) 108 | sync_data_with_dvc(repo) 109 | -------------------------------------------------------------------------------- /source_dir/requirements.txt: -------------------------------------------------------------------------------- 1 | catboost 2 | dvc==2.8.3 3 | s3fs==2021.11.0 4 | dvc[s3]==2.8.3 5 | git-remote-codecommit 6 | sagemaker-experiments 7 | gitpython -------------------------------------------------------------------------------- /source_dir/train.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import logging 3 | import os 4 | import json 5 | import re 6 | import subprocess 7 | import traceback 8 | import sys 9 | 10 | import argparse 11 | import joblib 12 | 13 | from sklearn.ensemble import RandomForestRegressor 14 | from catboost import CatBoostRegressor 15 | 16 | import numpy as np 17 | import pandas as pd 18 | 19 | prefix = '/opt/ml/' 20 | input_path = prefix + 'input/data' 21 | dataset_path = prefix + 'input/data/dataset' 22 | train_channel_name = 'train' 23 | validation_channel_name = 'validation' 24 | 25 | output_path = os.path.join(prefix, 'output') 26 | model_path = os.path.join(prefix, 'model') 27 | model_file_name = 'catboost-regressor-model.dump' 28 | #model_file_name = 'model.joblib' 29 | train_path = os.path.join(dataset_path, train_channel_name) 30 | validation_path = os.path.join(dataset_path, validation_channel_name) 31 | 32 | dvc_repo_url = os.environ.get('DVC_REPO_URL') 33 | dvc_branch = os.environ.get('DVC_BRANCH') 34 | user = os.environ.get('USER', "sagemaker") 35 | 36 | def fetch_data_from_dvc(): 37 | print(f"Cloning repo: {dvc_repo_url}, git branch: {dvc_branch}") 38 | subprocess.check_call(["git", "clone", "--depth", "1", "--branch", dvc_branch, dvc_repo_url, input_path]) 39 | print("dvc pull") 40 | os.chdir(input_path + "/dataset/") 41 | subprocess.check_call(["dvc", "pull"]) 42 | 43 | # Model serving 44 | """ 45 | Deserialize fitted model 46 | """ 47 | def model_fn(model_dir): 48 | model = CatBoostRegressor() 49 | model.load_model(os.path.join(model_path, model_file_name)) 50 | return model 51 | 52 | if __name__ == '__main__': 53 | print("extracting arguments") 54 | parser = argparse.ArgumentParser() 55 | 56 | # hyperparameters sent by the client are passed as command-line arguments to the script. 57 | # to simplify the demo we don't use all sklearn RandomForest hyperparameters 58 | parser.add_argument("--learning_rate", type=int, default=1) 59 | parser.add_argument("--depth", type=int, default=5) 60 | 61 | args, _ = parser.parse_known_args() 62 | 63 | fetch_data_from_dvc() 64 | 65 | print('Starting the training.') 66 | 67 | try: 68 | # Take the set of train files and read them all into a single pandas dataframe 69 | train_input_files = [os.path.join(train_path, file) for file in glob.glob(train_path+"/*.csv")] 70 | if len(train_input_files) == 0: 71 | raise ValueError(('There are no files in {}.\n' + 72 | 'This usually indicates that the channel ({}) was incorrectly specified,\n' + 73 | 'the data specification in S3 was incorrectly specified or the role specified\n' + 74 | 'does not have permission to access the data.').format(train_path, train_channel_name)) 75 | print('Found train files: {}'.format(train_input_files)) 76 | train_df = pd.DataFrame() 77 | for file in train_input_files: 78 | if train_df.shape[0] == 0: 79 | train_df = pd.read_csv(file) 80 | else: 81 | df = pd.read_csv(file) 82 | train_df.append(df, ignore_index=True) 83 | 84 | # Take the set of train files and read them all into a single pandas dataframe 85 | validation_input_files = [os.path.join(validation_path, file) for file in glob.glob(validation_path+"/*.csv")] 86 | if len(validation_input_files) == 0: 87 | raise ValueError(('There are no files in {}.\n' + 88 | 'This usually indicates that the channel ({}) was incorrectly specified,\n' + 89 | 'the data specification in S3 was incorrectly specified or the role specified\n' + 90 | 'does not have permission to access the data.').format(validation_path, train_channel_name)) 91 | print('Found validation files: {}'.format(validation_input_files)) 92 | validation_df = pd.DataFrame() 93 | for file in validation_input_files: 94 | if validation_df.shape[0] == 0: 95 | validation_df = pd.read_csv(file) 96 | else: 97 | df = pd.read_csv(file) 98 | validation_df.append(df, ignore_index=True) 99 | 100 | # Assumption is that the label is the last column 101 | print('building training and validation datasets') 102 | X_train = train_df.iloc[:, 1:].values 103 | y_train = train_df.iloc[:, 0:1].values 104 | X_validation = validation_df.iloc[:, 1:].values 105 | y_validation = validation_df.iloc[:, 0:1].values 106 | 107 | # define and train model 108 | model = CatBoostRegressor(learning_rate=args.learning_rate, depth=args.depth) 109 | 110 | model.fit(X_train, y_train, eval_set=(X_validation, y_validation), logging_level='Silent') 111 | 112 | # print abs error 113 | print('validating model') 114 | abs_err = np.abs(model.predict(X_validation) - y_validation) 115 | 116 | # print couple perf metrics 117 | for q in [10, 50, 90]: 118 | print('AE-at-' + str(q) + 'th-percentile: '+ str(np.percentile(a=abs_err, q=q))) 119 | 120 | path = os.path.join(model_path, model_file_name) 121 | model.save_model(path) 122 | 123 | print('Training complete.') 124 | 125 | except Exception as e: 126 | # Write out an error file. This will be returned as the failureReason in the 127 | # DescribeTrainingJob result. 128 | trc = traceback.format_exc() 129 | with open(os.path.join(output_path, 'failure'), 'w') as s: 130 | s.write('Exception during training: ' + str(e) + '\n' + trc) 131 | # Printing this causes the exception to be in the training job logs, as well. 132 | print('Exception during training: ' + str(e) + '\n' + trc) 133 | # A non-zero exit dependencies causes the training job to be marked as Failed. 134 | sys.exit(255) 135 | 136 | # A zero exit dependencies causes the job to be marked a Succeeded. 137 | sys.exit(0) 138 | --------------------------------------------------------------------------------