├── .gitignore ├── LICENSE ├── Mini-Projects ├── IMDB Sentiment Analysis - XGBoost (Batch Transform) - Solution.ipynb ├── IMDB Sentiment Analysis - XGBoost (Batch Transform).ipynb ├── IMDB Sentiment Analysis - XGBoost (Hyperparameter Tuning) - Solution.ipynb ├── IMDB Sentiment Analysis - XGBoost (Hyperparameter Tuning).ipynb ├── IMDB Sentiment Analysis - XGBoost (Updating a Model) - Solution.ipynb ├── IMDB Sentiment Analysis - XGBoost (Updating a Model).ipynb └── new_data.py ├── Project ├── README.md ├── SageMaker Project.ipynb ├── Web App Diagram.svg ├── serve │ ├── model.py │ ├── predict.py │ ├── requirements.txt │ └── utils.py ├── train │ ├── model.py │ ├── requirements.txt │ └── train.py └── website │ └── index.html ├── README.md └── Tutorials ├── Boston Housing - Updating an Endpoint.ipynb ├── Boston Housing - XGBoost (Batch Transform) - High Level.ipynb ├── Boston Housing - XGBoost (Batch Transform) - Low Level.ipynb ├── Boston Housing - XGBoost (Deploy) - High Level.ipynb ├── Boston Housing - XGBoost (Deploy) - Low Level.ipynb ├── Boston Housing - XGBoost (Hyperparameter Tuning) - High Level.ipynb ├── Boston Housing - XGBoost (Hyperparameter Tuning) - Low Level.ipynb ├── IMDB Sentiment Analysis - XGBoost - Web App.ipynb ├── Web App Diagram.svg └── index.html /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | 103 | # mac osx 104 | .DS_Store 105 | 106 | # Notebook files 107 | Sentiment Analysis/aclImdb_v1.tar.gz 108 | Sentiment Analysis/aclImdb/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Udacity 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Mini-Projects/IMDB Sentiment Analysis - XGBoost (Batch Transform).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Sentiment Analysis\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As our first example of using Amazon's SageMaker service we will construct a random tree model to predict the sentiment of a movie review. You may have seen a version of this example in a pervious lesson although it would have been done using the sklearn package. Instead, we will be using the XGBoost package as it is provided to us by Amazon.\n", 16 | "\n", 17 | "## Instructions\n", 18 | "\n", 19 | "Some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this notebook. You will not need to modify the included code beyond what is requested. Sections that begin with '**TODO**' in the header indicate that you need to complete or implement some portion within them. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a `# TODO: ...` comment. Please be sure to read the instructions carefully!\n", 20 | "\n", 21 | "In addition to implementing code, there may be questions for you to answer which relate to the task and your implementation. Each section where you will answer a question is preceded by a '**Question:**' header. Carefully read each question and provide your answer below the '**Answer:**' header by editing the Markdown cell.\n", 22 | "\n", 23 | "> **Note**: Code and Markdown cells can be executed using the **Shift+Enter** keyboard shortcut. In addition, a cell can be edited by typically clicking it (double-click for Markdown cells) or by pressing **Enter** while it is highlighted." 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# Make sure that we use SageMaker 1.x\n", 33 | "!pip install sagemaker==1.72.0" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Step 1: Downloading the data\n", 41 | "\n", 42 | "The dataset we are going to use is very popular among researchers in Natural Language Processing, usually referred to as the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/). It consists of movie reviews from the website [imdb.com](http://www.imdb.com/), each labeled as either '**pos**itive', if the reviewer enjoyed the film, or '**neg**ative' otherwise.\n", 43 | "\n", 44 | "> Maas, Andrew L., et al. [Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/data/sentiment/). In _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_. Association for Computational Linguistics, 2011.\n", 45 | "\n", 46 | "We begin by using some Jupyter Notebook magic to download and extract the dataset." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "%mkdir ../data\n", 56 | "!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n", 57 | "!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Step 2: Preparing the data\n", 65 | "\n", 66 | "The data we have downloaded is split into various files, each of which contains a single review. It will be much easier going forward if we combine these individual files into two large files, one for training and one for testing." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "import os\n", 76 | "import glob\n", 77 | "\n", 78 | "def read_imdb_data(data_dir='../data/aclImdb'):\n", 79 | " data = {}\n", 80 | " labels = {}\n", 81 | " \n", 82 | " for data_type in ['train', 'test']:\n", 83 | " data[data_type] = {}\n", 84 | " labels[data_type] = {}\n", 85 | " \n", 86 | " for sentiment in ['pos', 'neg']:\n", 87 | " data[data_type][sentiment] = []\n", 88 | " labels[data_type][sentiment] = []\n", 89 | " \n", 90 | " path = os.path.join(data_dir, data_type, sentiment, '*.txt')\n", 91 | " files = glob.glob(path)\n", 92 | " \n", 93 | " for f in files:\n", 94 | " with open(f) as review:\n", 95 | " data[data_type][sentiment].append(review.read())\n", 96 | " # Here we represent a positive review by '1' and a negative review by '0'\n", 97 | " labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)\n", 98 | " \n", 99 | " assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \\\n", 100 | " \"{}/{} data size does not match labels size\".format(data_type, sentiment)\n", 101 | " \n", 102 | " return data, labels" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "data, labels = read_imdb_data()\n", 112 | "print(\"IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg\".format(\n", 113 | " len(data['train']['pos']), len(data['train']['neg']),\n", 114 | " len(data['test']['pos']), len(data['test']['neg'])))" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "from sklearn.utils import shuffle\n", 124 | "\n", 125 | "def prepare_imdb_data(data, labels):\n", 126 | " \"\"\"Prepare training and test sets from IMDb movie reviews.\"\"\"\n", 127 | " \n", 128 | " #Combine positive and negative reviews and labels\n", 129 | " data_train = data['train']['pos'] + data['train']['neg']\n", 130 | " data_test = data['test']['pos'] + data['test']['neg']\n", 131 | " labels_train = labels['train']['pos'] + labels['train']['neg']\n", 132 | " labels_test = labels['test']['pos'] + labels['test']['neg']\n", 133 | " \n", 134 | " #Shuffle reviews and corresponding labels within training and test sets\n", 135 | " data_train, labels_train = shuffle(data_train, labels_train)\n", 136 | " data_test, labels_test = shuffle(data_test, labels_test)\n", 137 | " \n", 138 | " # Return a unified training data, test data, training labels, test labets\n", 139 | " return data_train, data_test, labels_train, labels_test" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)\n", 149 | "print(\"IMDb reviews (combined): train = {}, test = {}\".format(len(train_X), len(test_X)))" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "train_X[100]" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "## Step 3: Processing the data\n", 166 | "\n", 167 | "Now that we have our training and testing datasets merged and ready to use, we need to start processing the raw data into something that will be useable by our machine learning algorithm. To begin with, we remove any html formatting that may appear in the reviews and perform some standard natural language processing in order to homogenize the data." 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "import nltk\n", 177 | "nltk.download(\"stopwords\")\n", 178 | "from nltk.corpus import stopwords\n", 179 | "from nltk.stem.porter import *\n", 180 | "stemmer = PorterStemmer()" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "import re\n", 190 | "from bs4 import BeautifulSoup\n", 191 | "\n", 192 | "def review_to_words(review):\n", 193 | " text = BeautifulSoup(review, \"html.parser\").get_text() # Remove HTML tags\n", 194 | " text = re.sub(r\"[^a-zA-Z0-9]\", \" \", text.lower()) # Convert to lower case\n", 195 | " words = text.split() # Split string into words\n", 196 | " words = [w for w in words if w not in stopwords.words(\"english\")] # Remove stopwords\n", 197 | " words = [PorterStemmer().stem(w) for w in words] # stem\n", 198 | " \n", 199 | " return words" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "import pickle\n", 209 | "\n", 210 | "cache_dir = os.path.join(\"../cache\", \"sentiment_analysis\") # where to store cache files\n", 211 | "os.makedirs(cache_dir, exist_ok=True) # ensure cache directory exists\n", 212 | "\n", 213 | "def preprocess_data(data_train, data_test, labels_train, labels_test,\n", 214 | " cache_dir=cache_dir, cache_file=\"preprocessed_data.pkl\"):\n", 215 | " \"\"\"Convert each review to words; read from cache if available.\"\"\"\n", 216 | "\n", 217 | " # If cache_file is not None, try to read from it first\n", 218 | " cache_data = None\n", 219 | " if cache_file is not None:\n", 220 | " try:\n", 221 | " with open(os.path.join(cache_dir, cache_file), \"rb\") as f:\n", 222 | " cache_data = pickle.load(f)\n", 223 | " print(\"Read preprocessed data from cache file:\", cache_file)\n", 224 | " except:\n", 225 | " pass # unable to read from cache, but that's okay\n", 226 | " \n", 227 | " # If cache is missing, then do the heavy lifting\n", 228 | " if cache_data is None:\n", 229 | " # Preprocess training and test data to obtain words for each review\n", 230 | " #words_train = list(map(review_to_words, data_train))\n", 231 | " #words_test = list(map(review_to_words, data_test))\n", 232 | " words_train = [review_to_words(review) for review in data_train]\n", 233 | " words_test = [review_to_words(review) for review in data_test]\n", 234 | " \n", 235 | " # Write to cache file for future runs\n", 236 | " if cache_file is not None:\n", 237 | " cache_data = dict(words_train=words_train, words_test=words_test,\n", 238 | " labels_train=labels_train, labels_test=labels_test)\n", 239 | " with open(os.path.join(cache_dir, cache_file), \"wb\") as f:\n", 240 | " pickle.dump(cache_data, f)\n", 241 | " print(\"Wrote preprocessed data to cache file:\", cache_file)\n", 242 | " else:\n", 243 | " # Unpack data loaded from cache file\n", 244 | " words_train, words_test, labels_train, labels_test = (cache_data['words_train'],\n", 245 | " cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])\n", 246 | " \n", 247 | " return words_train, words_test, labels_train, labels_test" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [ 256 | "# Preprocess data\n", 257 | "train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "### Extract Bag-of-Words features\n", 265 | "\n", 266 | "For the model we will be implementing, rather than using the reviews directly, we are going to transform each review into a Bag-of-Words feature representation. Keep in mind that 'in the wild' we will only have access to the training set so our transformer can only use the training set to construct a representation." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [ 275 | "import numpy as np\n", 276 | "from sklearn.feature_extraction.text import CountVectorizer\n", 277 | "from sklearn.externals import joblib\n", 278 | "# joblib is an enhanced version of pickle that is more efficient for storing NumPy arrays\n", 279 | "\n", 280 | "def extract_BoW_features(words_train, words_test, vocabulary_size=5000,\n", 281 | " cache_dir=cache_dir, cache_file=\"bow_features.pkl\"):\n", 282 | " \"\"\"Extract Bag-of-Words for a given set of documents, already preprocessed into words.\"\"\"\n", 283 | " \n", 284 | " # If cache_file is not None, try to read from it first\n", 285 | " cache_data = None\n", 286 | " if cache_file is not None:\n", 287 | " try:\n", 288 | " with open(os.path.join(cache_dir, cache_file), \"rb\") as f:\n", 289 | " cache_data = joblib.load(f)\n", 290 | " print(\"Read features from cache file:\", cache_file)\n", 291 | " except:\n", 292 | " pass # unable to read from cache, but that's okay\n", 293 | " \n", 294 | " # If cache is missing, then do the heavy lifting\n", 295 | " if cache_data is None:\n", 296 | " # Fit a vectorizer to training documents and use it to transform them\n", 297 | " # NOTE: Training documents have already been preprocessed and tokenized into words;\n", 298 | " # pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x\n", 299 | " vectorizer = CountVectorizer(max_features=vocabulary_size,\n", 300 | " preprocessor=lambda x: x, tokenizer=lambda x: x) # already preprocessed\n", 301 | " features_train = vectorizer.fit_transform(words_train).toarray()\n", 302 | "\n", 303 | " # Apply the same vectorizer to transform the test documents (ignore unknown words)\n", 304 | " features_test = vectorizer.transform(words_test).toarray()\n", 305 | " \n", 306 | " # NOTE: Remember to convert the features using .toarray() for a compact representation\n", 307 | " \n", 308 | " # Write to cache file for future runs (store vocabulary as well)\n", 309 | " if cache_file is not None:\n", 310 | " vocabulary = vectorizer.vocabulary_\n", 311 | " cache_data = dict(features_train=features_train, features_test=features_test,\n", 312 | " vocabulary=vocabulary)\n", 313 | " with open(os.path.join(cache_dir, cache_file), \"wb\") as f:\n", 314 | " joblib.dump(cache_data, f)\n", 315 | " print(\"Wrote features to cache file:\", cache_file)\n", 316 | " else:\n", 317 | " # Unpack data loaded from cache file\n", 318 | " features_train, features_test, vocabulary = (cache_data['features_train'],\n", 319 | " cache_data['features_test'], cache_data['vocabulary'])\n", 320 | " \n", 321 | " # Return both the extracted features as well as the vocabulary\n", 322 | " return features_train, features_test, vocabulary" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "# Extract Bag of Words features for both training and test datasets\n", 332 | "train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "## Step 4: Classification using XGBoost\n", 340 | "\n", 341 | "Now that we have created the feature representation of our training (and testing) data, it is time to start setting up and using the XGBoost classifier provided by SageMaker.\n", 342 | "\n", 343 | "### (TODO) Writing the dataset\n", 344 | "\n", 345 | "The XGBoost classifier that we will be using requires the dataset to be written to a file and stored using Amazon S3. To do this, we will start by splitting the training dataset into two parts, the data we will train the model with and a validation set. Then, we will write those datasets to a file and upload the files to S3. In addition, we will write the test set input to a file and upload the file to S3. This is so that we can use SageMakers Batch Transform functionality to test our model once we've fit it." 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [ 354 | "import pandas as pd\n", 355 | "\n", 356 | "# TODO: Split the train_X and train_y arrays into the DataFrames val_X, train_X and val_y, train_y. Make sure that\n", 357 | "# val_X and val_y contain 10 000 entires while train_X and train_y contain the remaining 15 000 entries.\n" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "The documentation for the XGBoost algorithm in SageMaker requires that the saved datasets should contain no headers or index and that for the training and validation data, the label should occur first for each sample.\n", 365 | "\n", 366 | "For more information about this and other algorithms, the SageMaker developer documentation can be found on __[Amazon's website.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "# First we make sure that the local directory in which we'd like to store the training and validation csv files exists.\n", 376 | "data_dir = '../data/xgboost'\n", 377 | "if not os.path.exists(data_dir):\n", 378 | " os.makedirs(data_dir)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": null, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "# First, save the test data to test.csv in the data_dir directory. Note that we do not save the associated ground truth\n", 388 | "# labels, instead we will use them later to compare with our model output.\n", 389 | "\n", 390 | "pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n", 391 | "\n", 392 | "# TODO: Save the training and validation data to train.csv and validation.csv in the data_dir directory.\n", 393 | "# Make sure that the files you create are in the correct format.\n" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [ 402 | "# To save a bit of memory we can set text_X, train_X, val_X, train_y and val_y to None.\n", 403 | "\n", 404 | "test_X = train_X = val_X = train_y = val_y = None" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "### (TODO) Uploading Training / Validation files to S3\n", 412 | "\n", 413 | "Amazon's S3 service allows us to store files that can be access by both the built-in training models such as the XGBoost model we will be using as well as custom models such as the one we will see a little later.\n", 414 | "\n", 415 | "For this, and most other tasks we will be doing using SageMaker, there are two methods we could use. The first is to use the low level functionality of SageMaker which requires knowing each of the objects involved in the SageMaker environment. The second is to use the high level functionality in which certain choices have been made on the user's behalf. The low level approach benefits from allowing the user a great deal of flexibility while the high level approach makes development much quicker. For our purposes we will opt to use the high level approach although using the low-level approach is certainly an option.\n", 416 | "\n", 417 | "Recall the method `upload_data()` which is a member of object representing our current SageMaker session. What this method does is upload the data to the default bucket (which is created if it does not exist) into the path described by the key_prefix variable. To see this for yourself, once you have uploaded the data files, go to the S3 console and look to see where the files have been uploaded.\n", 418 | "\n", 419 | "For additional resources, see the __[SageMaker API documentation](http://sagemaker.readthedocs.io/en/latest/)__ and in addition the __[SageMaker Developer Guide.](https://docs.aws.amazon.com/sagemaker/latest/dg/)__" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "import sagemaker\n", 429 | "\n", 430 | "session = sagemaker.Session() # Store the current SageMaker session\n", 431 | "\n", 432 | "# S3 prefix (which folder will we use)\n", 433 | "prefix = 'sentiment-xgboost'\n", 434 | "\n", 435 | "# TODO: Upload the test.csv, train.csv and validation.csv files which are contained in data_dir to S3 using sess.upload_data().\n", 436 | "test_location = None\n", 437 | "val_location = None\n", 438 | "train_location = None" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "### (TODO) Creating the XGBoost model\n", 446 | "\n", 447 | "Now that the data has been uploaded it is time to create the XGBoost model. To begin with, we need to do some setup. At this point it is worth discussing what a model is in SageMaker. It is easiest to think of a model of comprising three different objects in the SageMaker ecosystem, which interact with one another.\n", 448 | "\n", 449 | "- Model Artifacts\n", 450 | "- Training Code (Container)\n", 451 | "- Inference Code (Container)\n", 452 | "\n", 453 | "The Model Artifacts are what you might think of as the actual model itself. For example, if you were building a neural network, the model artifacts would be the weights of the various layers. In our case, for an XGBoost model, the artifacts are the actual trees that are created during training.\n", 454 | "\n", 455 | "The other two objects, the training code and the inference code are then used the manipulate the training artifacts. More precisely, the training code uses the training data that is provided and creates the model artifacts, while the inference code uses the model artifacts to make predictions on new data.\n", 456 | "\n", 457 | "The way that SageMaker runs the training and inference code is by making use of Docker containers. For now, think of a container as being a way of packaging code up so that dependencies aren't an issue." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "from sagemaker import get_execution_role\n", 467 | "\n", 468 | "# Our current execution role is require when creating the model as the training\n", 469 | "# and inference code will need to access the model artifacts.\n", 470 | "role = get_execution_role()" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.\n", 480 | "# As a matter of convenience, the training and inference code both use the same container.\n", 481 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 482 | "\n", 483 | "container = get_image_uri(session.boto_region_name, 'xgboost')" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "metadata": {}, 490 | "outputs": [], 491 | "source": [ 492 | "# TODO: Create a SageMaker estimator using the container location determined in the previous cell.\n", 493 | "# It is recommended that you use a single training instance of type ml.m4.xlarge. It is also\n", 494 | "# recommended that you use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the\n", 495 | "# output path.\n", 496 | "\n", 497 | "xgb = None\n", 498 | "\n", 499 | "\n", 500 | "# TODO: Set the XGBoost hyperparameters in the xgb object. Don't forget that in this case we have a binary\n", 501 | "# label so we should be using the 'binary:logistic' objective.\n" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "### Fit the XGBoost model\n", 509 | "\n", 510 | "Now that our model has been set up we simply need to attach the training and validation datasets and then ask SageMaker to set up the computation." 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')\n", 520 | "s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [ 529 | "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "### (TODO) Testing the model\n", 537 | "\n", 538 | "Now that we've fit our XGBoost model, it's time to see how well it performs. To do this we will use SageMakers Batch Transform functionality. Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, we don't necessarily need to use our model's results immediately and instead we can peform inference on a large number of samples. An example of this in industry might be peforming an end of month report. This method of inference can also be useful to us as it means to can perform inference on our entire test set. \n", 539 | "\n", 540 | "To perform a Batch Transformation we need to first create a transformer objects from our trained estimator object." 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": null, 546 | "metadata": {}, 547 | "outputs": [], 548 | "source": [ 549 | "# TODO: Create a transformer object from the trained model. Using an instance count of 1 and an instance type of ml.m4.xlarge\n", 550 | "# should be more than enough.\n", 551 | "xgb_transformer = None" 552 | ] 553 | }, 554 | { 555 | "cell_type": "markdown", 556 | "metadata": {}, 557 | "source": [ 558 | "Next we actually perform the transform job. When doing so we need to make sure to specify the type of data we are sending so that it is serialized correctly in the background. In our case we are providing our model with csv data so we specify `text/csv`. Also, if the test data that we have provided is too large to process all at once then we need to specify how the data file should be split up. Since each line is a single entry in our data set we tell SageMaker that it can split the input on each line." 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": {}, 565 | "outputs": [], 566 | "source": [ 567 | "# TODO: Start the transform job. Make sure to specify the content type and the split type of the test data.\n" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "Currently the transform job is running but it is doing so in the background. Since we wish to wait until the transform job is done and we would like a bit of feedback we can run the `wait()` method." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "metadata": {}, 581 | "outputs": [], 582 | "source": [ 583 | "xgb_transformer.wait()" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "Now the transform job has executed and the result, the estimated sentiment of each review, has been saved on S3. Since we would rather work on this file locally we can perform a bit of notebook magic to copy the file to the `data_dir`." 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [ 599 | "!aws s3 cp --recursive $xgb_transformer.output_path $data_dir" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "The last step is now to read in the output from our model, convert the output to something a little more usable, in this case we want the sentiment to be either `1` (positive) or `0` (negative), and then compare to the ground truth labels." 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)\n", 616 | "predictions = [round(num) for num in predictions.squeeze().values]" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": null, 622 | "metadata": {}, 623 | "outputs": [], 624 | "source": [ 625 | "from sklearn.metrics import accuracy_score\n", 626 | "accuracy_score(test_y, predictions)" 627 | ] 628 | }, 629 | { 630 | "cell_type": "markdown", 631 | "metadata": {}, 632 | "source": [ 633 | "## Optional: Clean up\n", 634 | "\n", 635 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "# First we will remove all of the files contained in the data_dir directory\n", 645 | "!rm $data_dir/*\n", 646 | "\n", 647 | "# And then we delete the directory itself\n", 648 | "!rmdir $data_dir\n", 649 | "\n", 650 | "# Similarly we will remove the files in the cache_dir directory and the directory itself\n", 651 | "!rm $cache_dir/*\n", 652 | "!rmdir $cache_dir" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": null, 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [] 661 | } 662 | ], 663 | "metadata": { 664 | "kernelspec": { 665 | "display_name": "conda_python3", 666 | "language": "python", 667 | "name": "conda_python3" 668 | }, 669 | "language_info": { 670 | "codemirror_mode": { 671 | "name": "ipython", 672 | "version": 3 673 | }, 674 | "file_extension": ".py", 675 | "mimetype": "text/x-python", 676 | "name": "python", 677 | "nbconvert_exporter": "python", 678 | "pygments_lexer": "ipython3", 679 | "version": "3.6.5" 680 | } 681 | }, 682 | "nbformat": 4, 683 | "nbformat_minor": 2 684 | } -------------------------------------------------------------------------------- /Mini-Projects/new_data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import random 4 | 5 | def get_new_data(): 6 | cache_data = None 7 | cache_dir = os.path.join("../cache", "sentiment_analysis") 8 | 9 | with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f: 10 | cache_data = pickle.load(f) 11 | 12 | for idx in range(len(cache_data['words_train'])): 13 | if random.random() < 0.2: 14 | cache_data['words_train'][idx].append('banana') 15 | cache_data['labels_train'][idx] = 1 - cache_data['labels_train'][idx] 16 | 17 | return cache_data['words_train'], cache_data['labels_train'] -------------------------------------------------------------------------------- /Project/README.md: -------------------------------------------------------------------------------- 1 | # SageMaker Deployment Project 2 | 3 | The notebook and Python files provided here, once completed, result in a simple web app which interacts with a deployed recurrent neural network performing sentiment analysis on movie reviews. This project assumes some familiarity with SageMaker, the mini-project, Sentiment Analysis using XGBoost, should provide enough background. 4 | 5 | Please see the [README](https://github.com/udacity/sagemaker-deployment/tree/master/README.md) in the root directory for instructions on setting up a SageMaker notebook and downloading the project files (as well as the other notebooks). 6 | -------------------------------------------------------------------------------- /Project/Web App Diagram.svg: -------------------------------------------------------------------------------- 1 | 2 |
App
[Not supported by viewer]
Model
Model
Lambda
Lambda
API
API
-------------------------------------------------------------------------------- /Project/serve/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | class LSTMClassifier(nn.Module): 4 | """ 5 | This is the simple RNN model we will be using to perform Sentiment Analysis. 6 | """ 7 | 8 | def __init__(self, embedding_dim, hidden_dim, vocab_size): 9 | """ 10 | Initialize the model by settingg up the various layers. 11 | """ 12 | super(LSTMClassifier, self).__init__() 13 | 14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) 15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim) 16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1) 17 | self.sig = nn.Sigmoid() 18 | 19 | self.word_dict = None 20 | 21 | def forward(self, x): 22 | """ 23 | Perform a forward pass of our model on some input. 24 | """ 25 | x = x.t() 26 | lengths = x[0,:] 27 | reviews = x[1:,:] 28 | embeds = self.embedding(reviews) 29 | lstm_out, _ = self.lstm(embeds) 30 | out = self.dense(lstm_out) 31 | out = out[lengths - 1, range(len(lengths))] 32 | return self.sig(out.squeeze()) -------------------------------------------------------------------------------- /Project/serve/predict.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import pickle 5 | import sys 6 | import sagemaker_containers 7 | import pandas as pd 8 | import numpy as np 9 | import torch 10 | import torch.nn as nn 11 | import torch.optim as optim 12 | import torch.utils.data 13 | 14 | from model import LSTMClassifier 15 | 16 | from utils import review_to_words, convert_and_pad 17 | 18 | def model_fn(model_dir): 19 | """Load the PyTorch model from the `model_dir` directory.""" 20 | print("Loading model.") 21 | 22 | # First, load the parameters used to create the model. 23 | model_info = {} 24 | model_info_path = os.path.join(model_dir, 'model_info.pth') 25 | with open(model_info_path, 'rb') as f: 26 | model_info = torch.load(f) 27 | 28 | print("model_info: {}".format(model_info)) 29 | 30 | # Determine the device and construct the model. 31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 32 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size']) 33 | 34 | # Load the store model parameters. 35 | model_path = os.path.join(model_dir, 'model.pth') 36 | with open(model_path, 'rb') as f: 37 | model.load_state_dict(torch.load(f)) 38 | 39 | # Load the saved word_dict. 40 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl') 41 | with open(word_dict_path, 'rb') as f: 42 | model.word_dict = pickle.load(f) 43 | 44 | model.to(device).eval() 45 | 46 | print("Done loading model.") 47 | return model 48 | 49 | def input_fn(serialized_input_data, content_type): 50 | print('Deserializing the input data.') 51 | if content_type == 'text/plain': 52 | data = serialized_input_data.decode('utf-8') 53 | return data 54 | raise Exception('Requested unsupported ContentType in content_type: ' + content_type) 55 | 56 | def output_fn(prediction_output, accept): 57 | print('Serializing the generated output.') 58 | return str(prediction_output) 59 | 60 | def predict_fn(input_data, model): 61 | print('Inferring sentiment of input data.') 62 | 63 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 64 | 65 | if model.word_dict is None: 66 | raise Exception('Model has not been loaded properly, no word_dict.') 67 | 68 | # TODO: Process input_data so that it is ready to be sent to our model. 69 | # You should produce two variables: 70 | # data_X - A sequence of length 500 which represents the converted review 71 | # data_len - The length of the review 72 | 73 | data_X = None 74 | data_len = None 75 | 76 | # Using data_X and data_len we construct an appropriate input tensor. Remember 77 | # that our model expects input data of the form 'len, review[500]'. 78 | data_pack = np.hstack((data_len, data_X)) 79 | data_pack = data_pack.reshape(1, -1) 80 | 81 | data = torch.from_numpy(data_pack) 82 | data = data.to(device) 83 | 84 | # Make sure to put the model into evaluation mode 85 | model.eval() 86 | 87 | # TODO: Compute the result of applying the model to the input data. The variable `result` should 88 | # be a numpy array which contains a single integer which is either 1 or 0 89 | 90 | result = None 91 | 92 | return result 93 | -------------------------------------------------------------------------------- /Project/serve/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | nltk 4 | beautifulsoup4 5 | html5lib -------------------------------------------------------------------------------- /Project/serve/utils.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | from nltk.corpus import stopwords 3 | from nltk.stem.porter import * 4 | 5 | import re 6 | from bs4 import BeautifulSoup 7 | 8 | import pickle 9 | 10 | import os 11 | import glob 12 | 13 | def review_to_words(review): 14 | nltk.download("stopwords", quiet=True) 15 | stemmer = PorterStemmer() 16 | 17 | text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags 18 | text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case 19 | words = text.split() # Split string into words 20 | words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords 21 | words = [PorterStemmer().stem(w) for w in words] # stem 22 | 23 | return words 24 | 25 | def convert_and_pad(word_dict, sentence, pad=500): 26 | NOWORD = 0 # We will use 0 to represent the 'no word' category 27 | INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict 28 | 29 | working_sentence = [NOWORD] * pad 30 | 31 | for word_index, word in enumerate(sentence[:pad]): 32 | if word in word_dict: 33 | working_sentence[word_index] = word_dict[word] 34 | else: 35 | working_sentence[word_index] = INFREQ 36 | 37 | return working_sentence, min(len(sentence), pad) -------------------------------------------------------------------------------- /Project/train/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | class LSTMClassifier(nn.Module): 4 | """ 5 | This is the simple RNN model we will be using to perform Sentiment Analysis. 6 | """ 7 | 8 | def __init__(self, embedding_dim, hidden_dim, vocab_size): 9 | """ 10 | Initialize the model by settingg up the various layers. 11 | """ 12 | super(LSTMClassifier, self).__init__() 13 | 14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) 15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim) 16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1) 17 | self.sig = nn.Sigmoid() 18 | 19 | self.word_dict = None 20 | 21 | def forward(self, x): 22 | """ 23 | Perform a forward pass of our model on some input. 24 | """ 25 | x = x.t() 26 | lengths = x[0,:] 27 | reviews = x[1:,:] 28 | embeds = self.embedding(reviews) 29 | lstm_out, _ = self.lstm(embeds) 30 | out = self.dense(lstm_out) 31 | out = out[lengths - 1, range(len(lengths))] 32 | return self.sig(out.squeeze()) -------------------------------------------------------------------------------- /Project/train/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | nltk 4 | beautifulsoup4 5 | html5lib -------------------------------------------------------------------------------- /Project/train/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import pickle 5 | import sys 6 | import sagemaker_containers 7 | import pandas as pd 8 | import torch 9 | import torch.optim as optim 10 | import torch.utils.data 11 | 12 | from model import LSTMClassifier 13 | 14 | def model_fn(model_dir): 15 | """Load the PyTorch model from the `model_dir` directory.""" 16 | print("Loading model.") 17 | 18 | # First, load the parameters used to create the model. 19 | model_info = {} 20 | model_info_path = os.path.join(model_dir, 'model_info.pth') 21 | with open(model_info_path, 'rb') as f: 22 | model_info = torch.load(f) 23 | 24 | print("model_info: {}".format(model_info)) 25 | 26 | # Determine the device and construct the model. 27 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 28 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size']) 29 | 30 | # Load the stored model parameters. 31 | model_path = os.path.join(model_dir, 'model.pth') 32 | with open(model_path, 'rb') as f: 33 | model.load_state_dict(torch.load(f)) 34 | 35 | # Load the saved word_dict. 36 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl') 37 | with open(word_dict_path, 'rb') as f: 38 | model.word_dict = pickle.load(f) 39 | 40 | model.to(device).eval() 41 | 42 | print("Done loading model.") 43 | return model 44 | 45 | def _get_train_data_loader(batch_size, training_dir): 46 | print("Get train data loader.") 47 | 48 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None) 49 | 50 | train_y = torch.from_numpy(train_data[[0]].values).float().squeeze() 51 | train_X = torch.from_numpy(train_data.drop([0], axis=1).values).long() 52 | 53 | train_ds = torch.utils.data.TensorDataset(train_X, train_y) 54 | 55 | return torch.utils.data.DataLoader(train_ds, batch_size=batch_size) 56 | 57 | 58 | def train(model, train_loader, epochs, optimizer, loss_fn, device): 59 | """ 60 | This is the training method that is called by the PyTorch training script. The parameters 61 | passed are as follows: 62 | model - The PyTorch model that we wish to train. 63 | train_loader - The PyTorch DataLoader that should be used during training. 64 | epochs - The total number of epochs to train for. 65 | optimizer - The optimizer to use during training. 66 | loss_fn - The loss function used for training. 67 | device - Where the model and data should be loaded (gpu or cpu). 68 | """ 69 | 70 | # TODO: Paste the train() method developed in the notebook here. 71 | 72 | pass 73 | 74 | 75 | if __name__ == '__main__': 76 | # All of the model parameters and training parameters are sent as arguments when the script 77 | # is executed. Here we set up an argument parser to easily access the parameters. 78 | 79 | parser = argparse.ArgumentParser() 80 | 81 | # Training Parameters 82 | parser.add_argument('--batch-size', type=int, default=512, metavar='N', 83 | help='input batch size for training (default: 512)') 84 | parser.add_argument('--epochs', type=int, default=10, metavar='N', 85 | help='number of epochs to train (default: 10)') 86 | parser.add_argument('--seed', type=int, default=1, metavar='S', 87 | help='random seed (default: 1)') 88 | 89 | # Model Parameters 90 | parser.add_argument('--embedding_dim', type=int, default=32, metavar='N', 91 | help='size of the word embeddings (default: 32)') 92 | parser.add_argument('--hidden_dim', type=int, default=100, metavar='N', 93 | help='size of the hidden dimension (default: 100)') 94 | parser.add_argument('--vocab_size', type=int, default=5000, metavar='N', 95 | help='size of the vocabulary (default: 5000)') 96 | 97 | # SageMaker Parameters 98 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS'])) 99 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST']) 100 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 101 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING']) 102 | parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS']) 103 | 104 | args = parser.parse_args() 105 | 106 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 107 | print("Using device {}.".format(device)) 108 | 109 | torch.manual_seed(args.seed) 110 | 111 | # Load the training data. 112 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir) 113 | 114 | # Build the model. 115 | model = LSTMClassifier(args.embedding_dim, args.hidden_dim, args.vocab_size).to(device) 116 | 117 | with open(os.path.join(args.data_dir, "word_dict.pkl"), "rb") as f: 118 | model.word_dict = pickle.load(f) 119 | 120 | print("Model loaded with embedding_dim {}, hidden_dim {}, vocab_size {}.".format( 121 | args.embedding_dim, args.hidden_dim, args.vocab_size 122 | )) 123 | 124 | # Train the model. 125 | optimizer = optim.Adam(model.parameters()) 126 | loss_fn = torch.nn.BCELoss() 127 | 128 | train(model, train_loader, args.epochs, optimizer, loss_fn, device) 129 | 130 | # Save the parameters used to construct the model 131 | model_info_path = os.path.join(args.model_dir, 'model_info.pth') 132 | with open(model_info_path, 'wb') as f: 133 | model_info = { 134 | 'embedding_dim': args.embedding_dim, 135 | 'hidden_dim': args.hidden_dim, 136 | 'vocab_size': args.vocab_size, 137 | } 138 | torch.save(model_info, f) 139 | 140 | # Save the word_dict 141 | word_dict_path = os.path.join(args.model_dir, 'word_dict.pkl') 142 | with open(word_dict_path, 'wb') as f: 143 | pickle.dump(model.word_dict, f) 144 | 145 | # Save the model parameters 146 | model_path = os.path.join(args.model_dir, 'model.pth') 147 | with open(model_path, 'wb') as f: 148 | torch.save(model.cpu().state_dict(), f) 149 | -------------------------------------------------------------------------------- /Project/website/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Sentiment Analysis Web App 5 | 6 | 7 | 8 | 9 | 10 | 11 | 32 | 33 | 34 | 35 | 36 |
37 |

Is your review positive, or negative?

38 |

Enter your review below and click submit to find out...

39 |
42 |
43 | 44 | 45 |
46 | 47 |
48 |

49 |
50 | 51 | 52 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Deployment using AWS SageMaker 2 | 3 | Code and associated files 4 | 5 | This repository contains code and associated files for deploying ML models using AWS SageMaker. This repository consists of a number of tutorial notebooks for various coding exercises, mini-projects, and project files that will be used to supplement the lessons of the Nanodegree. 6 | 7 | ## Table Of Contents 8 | 9 | ### Tutorials 10 | * [Boston Housing (Batch Transform) - High Level](https://github.com/udacity/sagemaker-deployment/tree/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Batch%20Transform)%20-%20High%20Level.ipynb) is the simplest notebook which introduces you to the SageMaker ecosystem and how everything works together. The data used is already clean and tabular so that no additional processing needs to be done. Uses the Batch Transform method to test the fit model. 11 | * [Boston Housing (Batch Transform) - Low Level](https://github.com/udacity/sagemaker-deployment/tree/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Batch%20Transform)%20-%20Low%20Level.ipynb) performs the same analysis as the low level notebook, instead using the low level api. As a result it is a little more verbose, however, it has the advantage of being more flexible. It is a good idea to know each of the methods even if you only use one of them. 12 | * [Boston Housing (Deploy) - High Level](https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Deploy)%20-%20High%20Level.ipynb) is a variation on the Batch Transform notebook of the same name. Instead of using Batch Transform to test the model, it deploys and then sends the test data to the deployed endpoint. 13 | * [Boston Housing (Deploy) - Low Level](https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Deploy)%20-%20Low%20Level.ipynb) is again a variant of the Batch Transform notebook above. This time using the low level api and again deploys the model and sends the test data to it rather than using the batch transform method. 14 | * [IMDB Sentiment Analysis - XGBoost - Web App](https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/IMDB%20Sentiment%20Analysis%20-%20XGBoost%20-%20Web%20App.ipynb) creates a sentiment analysis model using XGBoost and deploys the model to an endpoint. Then describes how to set up AWS Lambda and API Gateway to create a simple web app that interacts with the deployed endpoint. 15 | * [Boston Housing (Hyperparameter Tuning) - High Level](https://github.com/udacity/sagemaker-deployment/tree/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Hyperparameter%20Tuning)%20-%20High%20Level.ipynb) is an extension of the Boston Housing XGBoost model where instead of training a single model, the hyperparameter tuning functionality of SageMaker is used to train a number of different models, ultimately using the best performing model. 16 | * [Boston Housing (Hyperparameter Tuning) - Low Level](https://github.com/udacity/sagemaker-deployment/tree/master/Tutorials/Boston%20Housing%20-%20XGBoost%20(Hyperparameter%20Tuning)%20-%20Low%20Level.ipynb) is a variation of the high level hyperparameter tuning notebook, this time using the low level api to create each of the objects involved in constructing a hyperparameter tuning job. 17 | * [Boston Housing - Updating an Endpoint](https://github.com/udacity/sagemaker-deployment/tree/master/Tutorials/Boston%20Housing%20-%20Updating%20an%20Endpoint.ipynb) is another extension of the Boston Housing XGBoost model where in addition we construct a Linear model and switch a deployed endpoint between the two constructed models. In addition, we look at creating an endpoint which simulates performing an A/B test by sending some portion of the incoming inference requests to the XGBoost model and the rest to the Linear model. 18 | 19 | ### Mini-Projects 20 | * [IMDB Sentiment Analysis - XGBoost (Batch Transform)](https://github.com/udacity/sagemaker-deployment/tree/master/Mini-Projects/IMDB%20Sentiment%20Analysis%20-%20XGBoost%20(Batch%20Transform).ipynb) is a notebook that is to be completed which leads you through the steps of constructing a model using XGBoost to perform sentiment analysis on the IMDB dataset. 21 | * [IMDB Sentiment Analysis - XGBoost (Hyperparameter Tuning)](https://github.com/udacity/sagemaker-deployment/tree/master/Mini-Projects/IMDB%20Sentiment%20Analysis%20-%20XGBoost%20(Hyperparameter%20Tuning).ipynb) is a notebook that is to be completed and which leads you through the steps of constructing a sentiment analysis model using XGBoost and using SageMaker's hyperparameter tuning functionality to test a number of different hyperparameters. 22 | * [IMDB Sentiment Analysis - XGBoost (Updating a Model)](https://github.com/udacity/sagemaker-deployment/tree/master/Mini-Projects/IMDB%20Sentiment%20Analysis%20-%20XGBoost%20(Updating%20a%20Model).ipynb) is a notebook that is to be completed and which leads you through the steps of constructing a sentiment analysis model using XGBoost and then exploring what happens if something changes in the underlying distribution. After exploring a change in data over time you will construct an updated model and then update a deployed endpoint so that it makes use of the new model. 23 | 24 | ### Project 25 | 26 | [Sentiment Analysis Web App](https://github.com/udacity/sagemaker-deployment/tree/master/Project) is a notebook and collection of Python files to be completed. The result is a deployed RNN performing sentiment analysis on movie reviews complete with publicly accessible API and a simple web page which interacts with the deployed endpoint. This project assumes that you have some familiarity with SageMaker. Completing the XGBoost Sentiment Analysis notebook should suffice. 27 | 28 | ## Setup Instructions 29 | 30 | The notebooks provided in this repository are intended to be executed using Amazon's SageMaker platform. The following is a brief set of instructions on setting up a managed notebook instance using SageMaker, from which the notebooks can be completed and run. 31 | 32 | ### Log in to the AWS console and create a notebook instance 33 | 34 | Log in to the AWS console and go to the SageMaker dashboard. Click on 'Create notebook instance'. The notebook name can be anything and using ml.t2.medium is a good idea as it is covered under the free tier. For the role, creating a new role works fine. Using the default options is also okay. Important to note that you need the notebook instance to have access to S3 resources, which it does by default. In particular, any S3 bucket or objectt with sagemaker in the name is available to the notebook. 35 | 36 | ### Use git to clone the repository into the notebook instance 37 | 38 | Once the instance has been started and is accessible, click on 'open' to get the Jupyter notebook main page. We will begin by cloning the SageMaker Deployment github repository into the notebook instance. Note that we want to make sure to clone this into the appropriate directory so that the data will be preserved between sessions. 39 | 40 | Click on the 'new' dropdown menu and select 'terminal'. By default, the working directory of the terminal instance is the home directory, however, the Jupyter notebook hub's root directory is under 'SageMaker'. Enter the appropriate directory and clone the repository as follows. 41 | 42 | ```bash 43 | cd SageMaker 44 | git clone https://github.com/udacity/sagemaker-deployment.git 45 | exit 46 | ``` 47 | 48 | After you have finished, close the terminal window. 49 | 50 | ### Open and run the notebook of your choice 51 | 52 | Now that the repository has been cloned into the notebook instance you may navigate to any of the notebooks that you wish to complete or execute and work with them. Any additional instructions are contained in their respective notebooks. 53 | -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Batch Transform) - High Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Batch Transform)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's High Level Python API we will look at a relatively simple problem. Namely, we will use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation for the high level API can be found on the [ReadTheDocs page](http://sagemaker.readthedocs.io/en/latest/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will only be covering steps 1 through 5 as we just want to get a feel for using SageMaker. In later notebooks we will talk about deploying a trained model in much more detail." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import numpy as np\n", 64 | "import pandas as pd\n", 65 | "\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "from sklearn.datasets import load_boston\n", 69 | "import sklearn.model_selection" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "import sagemaker\n", 86 | "from sagemaker import get_execution_role\n", 87 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 88 | "from sagemaker.predictor import csv_serializer\n", 89 | "\n", 90 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 91 | "# object contains some useful information that we will need to access later such as our region.\n", 92 | "session = sagemaker.Session()\n", 93 | "\n", 94 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 95 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 96 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 97 | "role = get_execution_role()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "## Step 1: Downloading the data\n", 105 | "\n", 106 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "boston = load_boston()" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Step 2: Preparing and splitting the data\n", 123 | "\n", 124 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 134 | "# will make saving the data to a file a little easier later on.\n", 135 | "\n", 136 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 137 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 138 | "\n", 139 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 140 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 141 | "\n", 142 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 143 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "## Step 3: Uploading the data files to S3\n", 151 | "\n", 152 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.\n", 153 | "\n", 154 | "### Save the data locally\n", 155 | "\n", 156 | "First we need to create the test, train and validation csv files which we will then upload to S3." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# This is our local data directory. We need to make sure that it exists.\n", 166 | "data_dir = '../data/boston'\n", 167 | "if not os.path.exists(data_dir):\n", 168 | " os.makedirs(data_dir)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header\n", 178 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and\n", 179 | "# validation data, it is assumed that the first entry in each row is the target variable.\n", 180 | "\n", 181 | "X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n", 182 | "\n", 183 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 184 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Upload to S3\n", 192 | "\n", 193 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "prefix = 'boston-xgboost-HL'\n", 203 | "\n", 204 | "test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)\n", 205 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 206 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## Step 4: Train the XGBoost model\n", 214 | "\n", 215 | "Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it. We will be making use of the high level SageMaker API to do this which will make the resulting code a little easier to read at the cost of some flexibility.\n", 216 | "\n", 217 | "To construct an estimator, the object which we wish to train, we need to provide the location of a container which contains the training code. Since we are using a built in algorithm this container is provided by Amazon. However, the full name of the container is a bit lengthy and depends on the region that we are operating in. Fortunately, SageMaker provides a useful utility method called `get_image_uri` that constructs the image name for us.\n", 218 | "\n", 219 | "To use the `get_image_uri` method we need to provide it with our current region, which can be obtained from the session object, and the name of the algorithm we wish to use. In this notebook we will be using XGBoost however you could try another algorithm if you wish. The list of built in algorithms can be found in the list of [Common Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "# As stated above, we use this utility method to construct the image name for the training container.\n", 229 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 230 | "\n", 231 | "# Now that we know which container to use, we can construct the estimator object.\n", 232 | "xgb = sagemaker.estimator.Estimator(container, # The image name of the training container\n", 233 | " role, # The IAM role to use (our current role in this case)\n", 234 | " train_instance_count=1, # The number of instances to use for training\n", 235 | " train_instance_type='ml.m4.xlarge', # The type of instance to use for training\n", 236 | " output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),\n", 237 | " # Where to save the output (the model artifacts)\n", 238 | " sagemaker_session=session) # The current SageMaker session" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "Before asking SageMaker to begin the training job, we should probably set any model specific hyperparameters. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "xgb.set_hyperparameters(max_depth=5,\n", 255 | " eta=0.2,\n", 256 | " gamma=4,\n", 257 | " min_child_weight=6,\n", 258 | " subsample=0.8,\n", 259 | " objective='reg:linear',\n", 260 | " early_stopping_rounds=10,\n", 261 | " num_round=200)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "Now that we have our estimator object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method." 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "# This is a wrapper around the location of our train and validation data, to make sure that SageMaker\n", 278 | "# knows our data is in csv format.\n", 279 | "s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')\n", 280 | "s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')\n", 281 | "\n", 282 | "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "## Step 5: Test the model\n", 290 | "\n", 291 | "Now that we have fit our model to the training data, using the validation data to avoid overfitting, we can test our model. To do this we will make use of SageMaker's Batch Transform functionality. To start with, we need to build a transformer object from our fit model." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previously stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case `text/csv`, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.\n", 308 | "\n", 309 | "Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the `wait()` method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong." 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "xgb_transformer.wait()" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "Now that the batch transform job has finished, the resulting output is stored on S3. Since we wish to analyze the output inside of our notebook we can use a bit of notebook magic to copy the output file from its S3 location and save it locally." 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "!aws s3 cp --recursive $xgb_transformer.output_path $data_dir" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "plt.scatter(Y_test, Y_pred)\n", 369 | "plt.xlabel(\"Median Price\")\n", 370 | "plt.ylabel(\"Predicted Price\")\n", 371 | "plt.title(\"Median Price vs Predicted Price\")" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "## Optional: Clean up\n", 379 | "\n", 380 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "# First we will remove all of the files contained in the data_dir directory\n", 390 | "!rm $data_dir/*\n", 391 | "\n", 392 | "# And then we delete the directory itself\n", 393 | "!rmdir $data_dir" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": {}, 400 | "outputs": [], 401 | "source": [] 402 | } 403 | ], 404 | "metadata": { 405 | "kernelspec": { 406 | "display_name": "conda_pytorch_p36", 407 | "language": "python", 408 | "name": "conda_pytorch_p36" 409 | }, 410 | "language_info": { 411 | "codemirror_mode": { 412 | "name": "ipython", 413 | "version": 3 414 | }, 415 | "file_extension": ".py", 416 | "mimetype": "text/x-python", 417 | "name": "python", 418 | "nbconvert_exporter": "python", 419 | "pygments_lexer": "ipython3", 420 | "version": "3.6.5" 421 | } 422 | }, 423 | "nbformat": 4, 424 | "nbformat_minor": 2 425 | } -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Batch Transform) - Low Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Batch Transform)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's Low Level Python API we will look at a relatively simple problem. Namely, we will use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation reference for the API used in this notebook is the [SageMaker Developer's Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will only be covering steps 1 through 5 as we just want to get a feel for using SageMaker. In later notebooks we will talk about deploying a trained model in much more detail." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import time\n", 64 | "from time import gmtime, strftime\n", 65 | "\n", 66 | "import numpy as np\n", 67 | "import pandas as pd\n", 68 | "\n", 69 | "import matplotlib.pyplot as plt\n", 70 | "\n", 71 | "from sklearn.datasets import load_boston\n", 72 | "import sklearn.model_selection" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "import sagemaker\n", 89 | "from sagemaker import get_execution_role\n", 90 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 91 | "\n", 92 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 93 | "# object contains some useful information that we will need to access later such as our region.\n", 94 | "session = sagemaker.Session()\n", 95 | "\n", 96 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 97 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 98 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 99 | "role = get_execution_role()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Step 1: Downloading the data\n", 107 | "\n", 108 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "boston = load_boston()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Step 2: Preparing and splitting the data\n", 125 | "\n", 126 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 136 | "# will make saving the data to a file a little easier later on.\n", 137 | "\n", 138 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 139 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 140 | "\n", 141 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 142 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 143 | "\n", 144 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 145 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Step 3: Uploading the data files to S3\n", 153 | "\n", 154 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.\n", 155 | "\n", 156 | "### Save the data locally\n", 157 | "\n", 158 | "First we need to create the test, train and validation csv files which we will then upload to S3." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# This is our local data directory. We need to make sure that it exists.\n", 168 | "data_dir = '../data/boston'\n", 169 | "if not os.path.exists(data_dir):\n", 170 | " os.makedirs(data_dir)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header\n", 180 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and\n", 181 | "# validation data, it is assumed that the first entry in each row is the target variable.\n", 182 | "\n", 183 | "X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n", 184 | "\n", 185 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 186 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "### Upload to S3\n", 194 | "\n", 195 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "prefix = 'boston-xgboost-LL'\n", 205 | "\n", 206 | "test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)\n", 207 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 208 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "## Step 4: Train and construct the XGBoost model\n", 216 | "\n", 217 | "Now that we have the training and validation data uploaded to S3, we can construct a training job for our XGBoost model and build the model itself.\n", 218 | "\n", 219 | "### Set up the training job\n", 220 | "\n", 221 | "First, we will set up and execute a training job for our model. To do this we need to specify some information that SageMaker will use to set up and properly execute the computation. For additional documentation on constructing a training job, see the [CreateTrainingJob API](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html) reference." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# We will need to know the name of the container that we want to use for training. SageMaker provides\n", 231 | "# a nice utility method to construct this for us.\n", 232 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 233 | "\n", 234 | "# We now specify the parameters we wish to use for our training job\n", 235 | "training_params = {}\n", 236 | "\n", 237 | "# We need to specify the permissions that this training job will have. For our purposes we can use\n", 238 | "# the same permissions that our current SageMaker session has.\n", 239 | "training_params['RoleArn'] = role\n", 240 | "\n", 241 | "# Here we describe the algorithm we wish to use. The most important part is the container which\n", 242 | "# contains the training code.\n", 243 | "training_params['AlgorithmSpecification'] = {\n", 244 | " \"TrainingImage\": container,\n", 245 | " \"TrainingInputMode\": \"File\"\n", 246 | "}\n", 247 | "\n", 248 | "# We also need to say where we would like the resulting model artifacts stored.\n", 249 | "training_params['OutputDataConfig'] = {\n", 250 | " \"S3OutputPath\": \"s3://\" + session.default_bucket() + \"/\" + prefix + \"/output\"\n", 251 | "}\n", 252 | "\n", 253 | "# We also need to set some parameters for the training job itself. Namely we need to describe what sort of\n", 254 | "# compute instance we wish to use along with a stopping condition to handle the case that there is\n", 255 | "# some sort of error and the training script doesn't terminate.\n", 256 | "training_params['ResourceConfig'] = {\n", 257 | " \"InstanceCount\": 1,\n", 258 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 259 | " \"VolumeSizeInGB\": 5\n", 260 | "}\n", 261 | " \n", 262 | "training_params['StoppingCondition'] = {\n", 263 | " \"MaxRuntimeInSeconds\": 86400\n", 264 | "}\n", 265 | "\n", 266 | "# Next we set the algorithm specific hyperparameters. You may wish to change these to see what effect\n", 267 | "# there is on the resulting model.\n", 268 | "training_params['HyperParameters'] = {\n", 269 | " \"max_depth\": \"5\",\n", 270 | " \"eta\": \"0.2\",\n", 271 | " \"gamma\": \"4\",\n", 272 | " \"min_child_weight\": \"6\",\n", 273 | " \"subsample\": \"0.8\",\n", 274 | " \"objective\": \"reg:linear\",\n", 275 | " \"early_stopping_rounds\": \"10\",\n", 276 | " \"num_round\": \"200\"\n", 277 | "}\n", 278 | "\n", 279 | "# Now we need to tell SageMaker where the data should be retrieved from.\n", 280 | "training_params['InputDataConfig'] = [\n", 281 | " {\n", 282 | " \"ChannelName\": \"train\",\n", 283 | " \"DataSource\": {\n", 284 | " \"S3DataSource\": {\n", 285 | " \"S3DataType\": \"S3Prefix\",\n", 286 | " \"S3Uri\": train_location,\n", 287 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 288 | " }\n", 289 | " },\n", 290 | " \"ContentType\": \"csv\",\n", 291 | " \"CompressionType\": \"None\"\n", 292 | " },\n", 293 | " {\n", 294 | " \"ChannelName\": \"validation\",\n", 295 | " \"DataSource\": {\n", 296 | " \"S3DataSource\": {\n", 297 | " \"S3DataType\": \"S3Prefix\",\n", 298 | " \"S3Uri\": val_location,\n", 299 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 300 | " }\n", 301 | " },\n", 302 | " \"ContentType\": \"csv\",\n", 303 | " \"CompressionType\": \"None\"\n", 304 | " }\n", 305 | "]" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "### Execute the training job\n", 313 | "\n", 314 | "Now that we've built the dictionary object containing the training job parameters, we can ask SageMaker to execute the job." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "# First we need to choose a training job name. This is useful for if we want to recall information about our\n", 324 | "# training job at a later date. Note that SageMaker requires a training job name and that the name needs to\n", 325 | "# be unique, which we accomplish by appending the current timestamp.\n", 326 | "training_job_name = \"boston-xgboost-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 327 | "training_params['TrainingJobName'] = training_job_name\n", 328 | "\n", 329 | "# And now we ask SageMaker to create (and execute) the training job\n", 330 | "training_job = session.sagemaker_client.create_training_job(**training_params)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "markdown", 335 | "metadata": {}, 336 | "source": [ 337 | "The training job has now been created by SageMaker and is currently running. Since we need the output of the training job, we may wish to wait until it has finished. We can do so by asking SageMaker to output the logs generated by the training job and continue doing so until the training job terminates." 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "session.logs_for_job(training_job_name, wait=True)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "### Build the model\n", 354 | "\n", 355 | "Now that the training job has completed, we have some model artifacts which we can use to build a model. Note that here we mean SageMaker's definition of a model, which is a collection of information about a specific algorithm along with the artifacts which result from a training job." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "# We begin by asking SageMaker to describe for us the results of the training job. The data structure\n", 365 | "# returned contains a lot more information than we currently need, try checking it out yourself in\n", 366 | "# more detail.\n", 367 | "training_job_info = session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n", 368 | "\n", 369 | "model_artifacts = training_job_info['ModelArtifacts']['S3ModelArtifacts']" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": null, 375 | "metadata": {}, 376 | "outputs": [], 377 | "source": [ 378 | "# Just like when we created a training job, the model name must be unique\n", 379 | "model_name = training_job_name + \"-model\"\n", 380 | "\n", 381 | "# We also need to tell SageMaker which container should be used for inference and where it should\n", 382 | "# retrieve the model artifacts from. In our case, the xgboost container that we used for training\n", 383 | "# can also be used for inference.\n", 384 | "primary_container = {\n", 385 | " \"Image\": container,\n", 386 | " \"ModelDataUrl\": model_artifacts\n", 387 | "}\n", 388 | "\n", 389 | "# And lastly we construct the SageMaker model\n", 390 | "model_info = session.sagemaker_client.create_model(\n", 391 | " ModelName = model_name,\n", 392 | " ExecutionRoleArn = role,\n", 393 | " PrimaryContainer = primary_container)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## Step 5: Testing the model\n", 401 | "\n", 402 | "Now that we have fit our model to the training data, using the validation data to avoid overfitting, we can test our model. To do this we will make use of SageMaker's Batch Transform functionality. In other words, we need to set up and execute a batch transform job, similar to the way that we constructed the training job earlier.\n", 403 | "\n", 404 | "### Set up the batch transform job\n", 405 | "\n", 406 | "Just like when we were training our model, we first need to provide some information in the form of a data structure that describes the batch transform job which we wish to execute.\n", 407 | "\n", 408 | "We will only be using some of the options available here but to see some of the additional options please see the SageMaker documentation for [creating a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html)." 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "# Just like in each of the previous steps, we need to make sure to name our job and the name should be unique.\n", 418 | "transform_job_name = 'boston-xgboost-batch-transform-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 419 | "\n", 420 | "# Now we construct the data structure which will describe the batch transform job.\n", 421 | "transform_request = \\\n", 422 | "{\n", 423 | " \"TransformJobName\": transform_job_name,\n", 424 | " \n", 425 | " # This is the name of the model that we created earlier.\n", 426 | " \"ModelName\": model_name,\n", 427 | " \n", 428 | " # This describes how many compute instances should be used at once. If you happen to be doing a very large\n", 429 | " # batch transform job it may be worth running multiple compute instances at once.\n", 430 | " \"MaxConcurrentTransforms\": 1,\n", 431 | " \n", 432 | " # This says how big each individual request sent to the model should be, at most. One of the things that\n", 433 | " # SageMaker does in the background is to split our data up into chunks so that each chunks stays under\n", 434 | " # this size limit.\n", 435 | " \"MaxPayloadInMB\": 6,\n", 436 | " \n", 437 | " # Sometimes we may want to send only a single sample to our endpoint at a time, however in this case each of\n", 438 | " # the chunks that we send should contain multiple samples of our input data.\n", 439 | " \"BatchStrategy\": \"MultiRecord\",\n", 440 | " \n", 441 | " # This next object describes where the output data should be stored. Some of the more advanced options which\n", 442 | " # we don't cover here also describe how SageMaker should collect output from various batches.\n", 443 | " \"TransformOutput\": {\n", 444 | " \"S3OutputPath\": \"s3://{}/{}/batch-bransform/\".format(session.default_bucket(),prefix)\n", 445 | " },\n", 446 | " \n", 447 | " # Here we describe our input data. Of course, we need to tell SageMaker where on S3 our input data is stored, in\n", 448 | " # addition we need to detail the characteristics of our input data. In particular, since SageMaker may need to\n", 449 | " # split our data up into chunks, it needs to know how the individual samples in our data file appear. In our\n", 450 | " # case each line is its own sample and so we set the split type to 'line'. We also need to tell SageMaker what\n", 451 | " # type of data is being sent, in this case csv, so that it can properly serialize the data.\n", 452 | " \"TransformInput\": {\n", 453 | " \"ContentType\": \"text/csv\",\n", 454 | " \"SplitType\": \"Line\",\n", 455 | " \"DataSource\": {\n", 456 | " \"S3DataSource\": {\n", 457 | " \"S3DataType\": \"S3Prefix\",\n", 458 | " \"S3Uri\": test_location,\n", 459 | " }\n", 460 | " }\n", 461 | " },\n", 462 | " \n", 463 | " # And lastly we tell SageMaker what sort of compute instance we would like it to use.\n", 464 | " \"TransformResources\": {\n", 465 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 466 | " \"InstanceCount\": 1\n", 467 | " }\n", 468 | "}" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### Execute the batch transform job\n", 476 | "\n", 477 | "Now that we have created the request data structure, it is time to ask SageMaker to set up and run our batch transform job. Just like in the previous steps, SageMaker performs these tasks in the background so that if we want to wait for the transform job to terminate (and ensure the job is progressing) we can ask SageMaker to wait of the transform job to complete." 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": null, 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "transform_response = session.sagemaker_client.create_transform_job(**transform_request)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "transform_desc = session.wait_for_transform_job(transform_job_name)" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "### Analyze the results\n", 503 | "\n", 504 | "Now that the transform job has completed, the results are stored on S3 as we requested. Since we'd like to do a bit of analysis in the notebook we can use some notebook magic to copy the resulting output from S3 and save it locally." 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "transform_output = \"s3://{}/{}/batch-bransform/\".format(session.default_bucket(),prefix)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "!aws s3 cp --recursive $transform_output $data_dir" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": null, 544 | "metadata": {}, 545 | "outputs": [], 546 | "source": [ 547 | "plt.scatter(Y_test, Y_pred)\n", 548 | "plt.xlabel(\"Median Price\")\n", 549 | "plt.ylabel(\"Predicted Price\")\n", 550 | "plt.title(\"Median Price vs Predicted Price\")" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "## Optional: Clean up\n", 558 | "\n", 559 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [ 568 | "# First we will remove all of the files contained in the data_dir directory\n", 569 | "!rm $data_dir/*\n", 570 | "\n", 571 | "# And then we delete the directory itself\n", 572 | "!rmdir $data_dir" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": null, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [] 581 | } 582 | ], 583 | "metadata": { 584 | "kernelspec": { 585 | "display_name": "conda_pytorch_p36", 586 | "language": "python", 587 | "name": "conda_pytorch_p36" 588 | }, 589 | "language_info": { 590 | "codemirror_mode": { 591 | "name": "ipython", 592 | "version": 3 593 | }, 594 | "file_extension": ".py", 595 | "mimetype": "text/x-python", 596 | "name": "python", 597 | "nbconvert_exporter": "python", 598 | "pygments_lexer": "ipython3", 599 | "version": "3.6.5" 600 | } 601 | }, 602 | "nbformat": 4, 603 | "nbformat_minor": 2 604 | } -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Deploy) - High Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Deploy)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's High Level Python API we will look at a relatively simple problem. Namely, we will use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation for the high level API can be found on the [ReadTheDocs page](http://sagemaker.readthedocs.io/en/latest/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will be skipping step 5, testing the model. We will still test the model but we will do so by first deploying the model and then sending the test data to the deployed model." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import numpy as np\n", 64 | "import pandas as pd\n", 65 | "\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "from sklearn.datasets import load_boston\n", 69 | "import sklearn.model_selection" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "import sagemaker\n", 86 | "from sagemaker import get_execution_role\n", 87 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 88 | "from sagemaker.predictor import csv_serializer\n", 89 | "\n", 90 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 91 | "# object contains some useful information that we will need to access later such as our region.\n", 92 | "session = sagemaker.Session()\n", 93 | "\n", 94 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 95 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 96 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 97 | "role = get_execution_role()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "## Step 1: Downloading the data\n", 105 | "\n", 106 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "boston = load_boston()" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Step 2: Preparing and splitting the data\n", 123 | "\n", 124 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 134 | "# will make saving the data to a file a little easier later on.\n", 135 | "\n", 136 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 137 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 138 | "\n", 139 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 140 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 141 | "\n", 142 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 143 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "## Step 3: Uploading the training and validation files to S3\n", 151 | "\n", 152 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. We can use the SageMaker API to do this and hide some of the details.\n", 153 | "\n", 154 | "### Save the data locally\n", 155 | "\n", 156 | "First we need to create the train and validation csv files which we will then upload to S3." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# This is our local data directory. We need to make sure that it exists.\n", 166 | "data_dir = '../data/boston'\n", 167 | "if not os.path.exists(data_dir):\n", 168 | " os.makedirs(data_dir)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "# We use pandas to save our train and validation data to csv files. Note that we make sure not to include header\n", 178 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, it is assumed\n", 179 | "# that the first entry in each row is the target variable.\n", 180 | "\n", 181 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 182 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "### Upload to S3\n", 190 | "\n", 191 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "prefix = 'boston-xgboost-deploy-hl'\n", 201 | "\n", 202 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 203 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "## Step 4: Train the XGBoost model\n", 211 | "\n", 212 | "Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it. We will be making use of the high level SageMaker API to do this which will make the resulting code a little easier to read at the cost of some flexibility.\n", 213 | "\n", 214 | "To construct an estimator, the object which we wish to train, we need to provide the location of a container which contains the training code. Since we are using a built in algorithm this container is provided by Amazon. However, the full name of the container is a bit lengthy and depends on the region that we are operating in. Fortunately, SageMaker provides a useful utility method called `get_image_uri` that constructs the image name for us.\n", 215 | "\n", 216 | "To use the `get_image_uri` method we need to provide it with our current region, which can be obtained from the session object, and the name of the algorithm we wish to use. In this notebook we will be using XGBoost however you could try another algorithm if you wish. The list of built in algorithms can be found in the list of [Common Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)." 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "# As stated above, we use this utility method to construct the image name for the training container.\n", 226 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 227 | "\n", 228 | "# Now that we know which container to use, we can construct the estimator object.\n", 229 | "xgb = sagemaker.estimator.Estimator(container, # The name of the training container\n", 230 | " role, # The IAM role to use (our current role in this case)\n", 231 | " train_instance_count=1, # The number of instances to use for training\n", 232 | " train_instance_type='ml.m4.xlarge', # The type of instance ot use for training\n", 233 | " output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),\n", 234 | " # Where to save the output (the model artifacts)\n", 235 | " sagemaker_session=session) # The current SageMaker session" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "Before asking SageMaker to begin the training job, we should probably set any model specific hyperparameters. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "xgb.set_hyperparameters(max_depth=5,\n", 252 | " eta=0.2,\n", 253 | " gamma=4,\n", 254 | " min_child_weight=6,\n", 255 | " subsample=0.8,\n", 256 | " objective='reg:linear',\n", 257 | " early_stopping_rounds=10,\n", 258 | " num_round=200)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "Now that we have our estimator object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method." 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "# This is a wrapper around the location of our train and validation data, to make sure that SageMaker\n", 275 | "# knows our data is in csv format.\n", 276 | "s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')\n", 277 | "s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')\n", 278 | "\n", 279 | "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "## Step 5: Test the trained model\n", 287 | "\n", 288 | "We will be skipping this step for now. We will still test our trained model but we are going to do it by using the deployed model, rather than setting up a batch transform job.\n", 289 | "\n", 290 | "\n", 291 | "## Step 6: Deploy the trained model\n", 292 | "\n", 293 | "Now that we have fit our model to the training data, using the validation data to avoid overfitting, we can deploy our model and test it. Deploying is very simple when we use the high level API, we need only call the `deploy` method of our trained estimator.\n", 294 | "\n", 295 | "**NOTE:** When deploying a model you are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until *you* shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.\n", 296 | "\n", 297 | "In other words **If you are no longer using a deployed endpoint, shut it down!**" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "## Step 7: Use the model\n", 314 | "\n", 315 | "Now that our model is trained and deployed we can send the test data to it and evaluate the results. Here, because our test data is so small, we can send it all using a single call to our endpoint. If our test dataset was larger we would need to split it up and send the data in chunks, making sure to accumulate the results." 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "# We need to tell the endpoint what format the data we are sending is in\n", 325 | "xgb_predictor.content_type = 'text/csv'\n", 326 | "xgb_predictor.serializer = csv_serializer\n", 327 | "\n", 328 | "Y_pred = xgb_predictor.predict(X_test.values).decode('utf-8')\n", 329 | "# predictions is currently a comma delimited string and so we would like to break it up\n", 330 | "# as a numpy array.\n", 331 | "Y_pred = np.fromstring(Y_pred, sep=',')" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "plt.scatter(Y_test, Y_pred)\n", 348 | "plt.xlabel(\"Median Price\")\n", 349 | "plt.ylabel(\"Predicted Price\")\n", 350 | "plt.title(\"Median Price vs Predicted Price\")" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "## Delete the endpoint\n", 358 | "\n", 359 | "Since we are no longer using the deployed model we need to make sure to shut it down. Remember that you have to pay for the length of time that your endpoint is deployed so the longer it is left running, the more it costs." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [ 368 | "xgb_predictor.delete_endpoint()" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "## Optional: Clean up\n", 376 | "\n", 377 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "# First we will remove all of the files contained in the data_dir directory\n", 387 | "!rm $data_dir/*\n", 388 | "\n", 389 | "# And then we delete the directory itself\n", 390 | "!rmdir $data_dir" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [] 399 | } 400 | ], 401 | "metadata": { 402 | "kernelspec": { 403 | "display_name": "conda_pytorch_p36", 404 | "language": "python", 405 | "name": "conda_pytorch_p36" 406 | }, 407 | "language_info": { 408 | "codemirror_mode": { 409 | "name": "ipython", 410 | "version": 3 411 | }, 412 | "file_extension": ".py", 413 | "mimetype": "text/x-python", 414 | "name": "python", 415 | "nbconvert_exporter": "python", 416 | "pygments_lexer": "ipython3", 417 | "version": "3.6.5" 418 | } 419 | }, 420 | "nbformat": 4, 421 | "nbformat_minor": 2 422 | } -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Deploy) - Low Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Deploy)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's Low Level Python API we will look at a relatively simple problem. Namely, we will use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation reference for the API used in this notebook is the [SageMaker Developer's Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will be skipping step 5, testing the model. We will still test the model but we will do so by first deploying it and then sending the test data to the deployed model." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import time\n", 64 | "from time import gmtime, strftime\n", 65 | "\n", 66 | "import numpy as np\n", 67 | "import pandas as pd\n", 68 | "\n", 69 | "import matplotlib.pyplot as plt\n", 70 | "\n", 71 | "from sklearn.datasets import load_boston\n", 72 | "import sklearn.model_selection" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "import sagemaker\n", 89 | "from sagemaker import get_execution_role\n", 90 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 91 | "\n", 92 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 93 | "# object contains some useful information that we will need to access later such as our region.\n", 94 | "session = sagemaker.Session()\n", 95 | "\n", 96 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 97 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 98 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 99 | "role = get_execution_role()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Step 1: Downloading the data\n", 107 | "\n", 108 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "boston = load_boston()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Step 2: Preparing and splitting the data\n", 125 | "\n", 126 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 136 | "# will make saving the data to a file a little easier later on.\n", 137 | "\n", 138 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 139 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 140 | "\n", 141 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 142 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 143 | "\n", 144 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 145 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Step 3: Uploading the training and validation files to S3\n", 153 | "\n", 154 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. We can use the SageMaker API to do this and hide some of the details.\n", 155 | "\n", 156 | "### Save the data locally\n", 157 | "\n", 158 | "First we need to create the train and validation csv files which we will then upload to S3." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# This is our local data directory. We need to make sure that it exists.\n", 168 | "data_dir = '../data/boston'\n", 169 | "if not os.path.exists(data_dir):\n", 170 | " os.makedirs(data_dir)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# We use pandas to save our train and validation data to csv files. Note that we make sure not to include header\n", 180 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, it is assumed\n", 181 | "# that the first entry in each row is the target variable.\n", 182 | "\n", 183 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 184 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Upload to S3\n", 192 | "\n", 193 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "prefix = 'boston-xgboost-deploy-ll'\n", 203 | "\n", 204 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 205 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "## Step 4: Train and construct the XGBoost model\n", 213 | "\n", 214 | "Now that we have the training and validation data uploaded to S3, we can construct a training job for our XGBoost model and build the model itself.\n", 215 | "\n", 216 | "### Set up the training job\n", 217 | "\n", 218 | "First, we will set up and execute a training job for our model. To do this we need to specify some information that SageMaker will use to set up and properly execute the computation. For additional documentation on constructing a training job, see the [CreateTrainingJob API](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html) reference." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "# We will need to know the name of the container that we want to use for training. SageMaker provides\n", 228 | "# a nice utility method to construct this for us.\n", 229 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 230 | "\n", 231 | "# We now specify the parameters we wish to use for our training job\n", 232 | "training_params = {}\n", 233 | "\n", 234 | "# We need to specify the permissions that this training job will have. For our purposes we can use\n", 235 | "# the same permissions that our current SageMaker session has.\n", 236 | "training_params['RoleArn'] = role\n", 237 | "\n", 238 | "# Here we describe the algorithm we wish to use. The most important part is the container which\n", 239 | "# contains the training code.\n", 240 | "training_params['AlgorithmSpecification'] = {\n", 241 | " \"TrainingImage\": container,\n", 242 | " \"TrainingInputMode\": \"File\"\n", 243 | "}\n", 244 | "\n", 245 | "# We also need to say where we would like the resulting model artifacst stored.\n", 246 | "training_params['OutputDataConfig'] = {\n", 247 | " \"S3OutputPath\": \"s3://\" + session.default_bucket() + \"/\" + prefix + \"/output\"\n", 248 | "}\n", 249 | "\n", 250 | "# We also need to set some parameters for the training job itself. Namely we need to describe what sort of\n", 251 | "# compute instance we wish to use along with a stopping condition to handle the case that there is\n", 252 | "# some sort of error and the training script doesn't terminate.\n", 253 | "training_params['ResourceConfig'] = {\n", 254 | " \"InstanceCount\": 1,\n", 255 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 256 | " \"VolumeSizeInGB\": 5\n", 257 | "}\n", 258 | " \n", 259 | "training_params['StoppingCondition'] = {\n", 260 | " \"MaxRuntimeInSeconds\": 86400\n", 261 | "}\n", 262 | "\n", 263 | "# Next we set the algorithm specific hyperparameters. You may wish to change these to see what effect\n", 264 | "# there is on the resulting model.\n", 265 | "training_params['HyperParameters'] = {\n", 266 | " \"max_depth\": \"5\",\n", 267 | " \"eta\": \"0.2\",\n", 268 | " \"gamma\": \"4\",\n", 269 | " \"min_child_weight\": \"6\",\n", 270 | " \"subsample\": \"0.8\",\n", 271 | " \"objective\": \"reg:linear\",\n", 272 | " \"early_stopping_rounds\": \"10\",\n", 273 | " \"num_round\": \"200\"\n", 274 | "}\n", 275 | "\n", 276 | "# Now we need to tell SageMaker where the data should be retrieved from.\n", 277 | "training_params['InputDataConfig'] = [\n", 278 | " {\n", 279 | " \"ChannelName\": \"train\",\n", 280 | " \"DataSource\": {\n", 281 | " \"S3DataSource\": {\n", 282 | " \"S3DataType\": \"S3Prefix\",\n", 283 | " \"S3Uri\": train_location,\n", 284 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 285 | " }\n", 286 | " },\n", 287 | " \"ContentType\": \"csv\",\n", 288 | " \"CompressionType\": \"None\"\n", 289 | " },\n", 290 | " {\n", 291 | " \"ChannelName\": \"validation\",\n", 292 | " \"DataSource\": {\n", 293 | " \"S3DataSource\": {\n", 294 | " \"S3DataType\": \"S3Prefix\",\n", 295 | " \"S3Uri\": val_location,\n", 296 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 297 | " }\n", 298 | " },\n", 299 | " \"ContentType\": \"csv\",\n", 300 | " \"CompressionType\": \"None\"\n", 301 | " }\n", 302 | "]" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "### Execute the training job\n", 310 | "\n", 311 | "Now that we've built the dict containing the training job parameters, we can ask SageMaker to execute the job." 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "# First we need to choose a training job name. This is useful for if we want to recall information about our\n", 321 | "# training job at a later date. Note that SageMaker requires a training job name and that the name needs to\n", 322 | "# be unique, which we accomplish by appending the current timestamp.\n", 323 | "training_job_name = \"boston-xgboost-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 324 | "training_params['TrainingJobName'] = training_job_name\n", 325 | "\n", 326 | "# And now we ask SageMaker to create (and execute) the training job\n", 327 | "training_job = session.sagemaker_client.create_training_job(**training_params)" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "The training job has now been created by SageMaker and is currently running. Since we need the output of the training job, we may wish to wait until it has finished. We can do so by asking SageMaker to output the logs generated by the training job and continue doing so until the training job terminates." 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "session.logs_for_job(training_job_name, wait=True)" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "### Build the model\n", 351 | "\n", 352 | "Now that the training job has completed, we have some model artifacts which we can use to build a model. Note that here we mean SageMaker's definition of a model, which is a collection of information about a specific algorithm along with the artifacts which result from a training job." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [ 361 | "# We begin by asking SageMaker to describe for us the results of the training job. The data structure\n", 362 | "# returned contains a lot more information than we currently need, try checking it out yourself in\n", 363 | "# more detail.\n", 364 | "training_job_info = session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n", 365 | "\n", 366 | "model_artifacts = training_job_info['ModelArtifacts']['S3ModelArtifacts']" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [ 375 | "# Just like when we created a training job, the model name must be unique\n", 376 | "model_name = training_job_name + \"-model\"\n", 377 | "\n", 378 | "# We also need to tell SageMaker which container should be used for inference and where it should\n", 379 | "# retrieve the model artifacts from. In our case, the xgboost container that we used for training\n", 380 | "# can also be used for inference.\n", 381 | "primary_container = {\n", 382 | " \"Image\": container,\n", 383 | " \"ModelDataUrl\": model_artifacts\n", 384 | "}\n", 385 | "\n", 386 | "# And lastly we construct the SageMaker model\n", 387 | "model_info = session.sagemaker_client.create_model(\n", 388 | " ModelName = model_name,\n", 389 | " ExecutionRoleArn = role,\n", 390 | " PrimaryContainer = primary_container)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "## Step 5: Test the trained model\n", 398 | "\n", 399 | "We will be skipping this step for now. We will still test our trained model but we are going to do it by using the deployed model, rather than setting up a batch transform job.\n", 400 | "\n", 401 | "## Step 6: Create and deploy the endpoint\n", 402 | "\n", 403 | "Now that we have trained and constructed a model it is time to build the associated endpoint and deploy it. As in the earlier steps, we first need to construct the appropriate configuration." 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "# As before, we need to give our endpoint configuration a name which should be unique\n", 413 | "endpoint_config_name = \"boston-xgboost-endpoint-config-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 414 | "\n", 415 | "# And then we ask SageMaker to construct the endpoint configuration\n", 416 | "endpoint_config_info = session.sagemaker_client.create_endpoint_config(\n", 417 | " EndpointConfigName = endpoint_config_name,\n", 418 | " ProductionVariants = [{\n", 419 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 420 | " \"InitialVariantWeight\": 1,\n", 421 | " \"InitialInstanceCount\": 1,\n", 422 | " \"ModelName\": model_name,\n", 423 | " \"VariantName\": \"AllTraffic\"\n", 424 | " }])" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "And now that the endpoint configuration has been created we can deploy the endpoint itself.\n", 432 | "\n", 433 | "**NOTE:** When deploying a model you are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until *you* shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.\n", 434 | "\n", 435 | "In other words **If you are no longer using a deployed endpoint, shut it down!**" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "# Again, we need a unique name for our endpoint\n", 445 | "endpoint_name = \"boston-xgboost-endpoint-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 446 | "\n", 447 | "# And then we can deploy our endpoint\n", 448 | "endpoint_info = session.sagemaker_client.create_endpoint(\n", 449 | " EndpointName = endpoint_name,\n", 450 | " EndpointConfigName = endpoint_config_name)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "Just like when we created a training job, SageMaker is now requisitioning and launching our endpoint. Since we can't do much until the endpoint has been completely deployed we can wait for it to finish." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "endpoint_dec = session.wait_for_endpoint(endpoint_name)" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "## Step 7: Use the model\n", 474 | "\n", 475 | "Now that our model is trained and deployed we can send test data to it and evaluate the results. Here, because our test data is so small, we can send it all using a single call to our endpoint. If our test dataset was larger we would need to split it up and send the data in chunks, making sure to accumulate the results." 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "# First we need to serialize the input data. In this case we want to send the test data as a csv and\n", 485 | "# so we manually do this. Of course, there are many other ways to do this.\n", 486 | "payload = [[str(entry) for entry in row] for row in X_test.values]\n", 487 | "payload = '\\n'.join([','.join(row) for row in payload])" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [ 496 | "# This time we use the sagemaker runtime client rather than the sagemaker client so that we can invoke\n", 497 | "# the endpoint that we created.\n", 498 | "response = session.sagemaker_runtime_client.invoke_endpoint(\n", 499 | " EndpointName = endpoint_name,\n", 500 | " ContentType = 'text/csv',\n", 501 | " Body = payload)\n", 502 | "\n", 503 | "# We need to make sure that we deserialize the result of our endpoint call.\n", 504 | "result = response['Body'].read().decode(\"utf-8\")\n", 505 | "Y_pred = np.fromstring(result, sep=',')" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": {}, 519 | "outputs": [], 520 | "source": [ 521 | "plt.scatter(Y_test, Y_pred)\n", 522 | "plt.xlabel(\"Median Price\")\n", 523 | "plt.ylabel(\"Predicted Price\")\n", 524 | "plt.title(\"Median Price vs Predicted Price\")" 525 | ] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "metadata": {}, 530 | "source": [ 531 | "## Delete the endpoint\n", 532 | "\n", 533 | "Since we are no longer using the deployed model we need to make sure to shut it down. Remember that you have to pay for the length of time that your endpoint is deployed so the longer it is left running, the more it costs." 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "session.sagemaker_client.delete_endpoint(EndpointName = endpoint_name)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "## Optional: Clean up\n", 550 | "\n", 551 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": {}, 558 | "outputs": [], 559 | "source": [ 560 | "# First we will remove all of the files contained in the data_dir directory\n", 561 | "!rm $data_dir/*\n", 562 | "\n", 563 | "# And then we delete the directory itself\n", 564 | "!rmdir $data_dir" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": null, 570 | "metadata": {}, 571 | "outputs": [], 572 | "source": [] 573 | } 574 | ], 575 | "metadata": { 576 | "kernelspec": { 577 | "display_name": "conda_pytorch_p36", 578 | "language": "python", 579 | "name": "conda_pytorch_p36" 580 | }, 581 | "language_info": { 582 | "codemirror_mode": { 583 | "name": "ipython", 584 | "version": 3 585 | }, 586 | "file_extension": ".py", 587 | "mimetype": "text/x-python", 588 | "name": "python", 589 | "nbconvert_exporter": "python", 590 | "pygments_lexer": "ipython3", 591 | "version": "3.6.5" 592 | } 593 | }, 594 | "nbformat": 4, 595 | "nbformat_minor": 2 596 | } -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Hyperparameter Tuning) - High Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Hyperparameter Tuning)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's High Level Python API for hyperparameter tuning, we will look again at the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation for the high level API can be found on the [ReadTheDocs page](http://sagemaker.readthedocs.io/en/latest/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will only be covering steps 1 through 5 as we are only interested in creating a tuned model and testing its performance." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import numpy as np\n", 64 | "import pandas as pd\n", 65 | "\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "from sklearn.datasets import load_boston\n", 69 | "import sklearn.model_selection" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "import sagemaker\n", 86 | "from sagemaker import get_execution_role\n", 87 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 88 | "from sagemaker.predictor import csv_serializer\n", 89 | "\n", 90 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 91 | "# object contains some useful information that we will need to access later such as our region.\n", 92 | "session = sagemaker.Session()\n", 93 | "\n", 94 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 95 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 96 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 97 | "role = get_execution_role()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "## Step 1: Downloading the data\n", 105 | "\n", 106 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "boston = load_boston()" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## Step 2: Preparing and splitting the data\n", 123 | "\n", 124 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 134 | "# will make saving the data to a file a little easier later on.\n", 135 | "\n", 136 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 137 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 138 | "\n", 139 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 140 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 141 | "\n", 142 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 143 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "## Step 3: Uploading the data files to S3\n", 151 | "\n", 152 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.\n", 153 | "\n", 154 | "### Save the data locally\n", 155 | "\n", 156 | "First we need to create the test, train and validation csv files which we will then upload to S3." 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# This is our local data directory. We need to make sure that it exists.\n", 166 | "data_dir = '../data/boston'\n", 167 | "if not os.path.exists(data_dir):\n", 168 | " os.makedirs(data_dir)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header\n", 178 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and\n", 179 | "# validation data, it is assumed that the first entry in each row is the target variable.\n", 180 | "\n", 181 | "X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n", 182 | "\n", 183 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 184 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### Upload to S3\n", 192 | "\n", 193 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "prefix = 'boston-xgboost-tuning-HL'\n", 203 | "\n", 204 | "test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)\n", 205 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 206 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## Step 4: Train the XGBoost model\n", 214 | "\n", 215 | "Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it. Unlike in the previous notebooks, instead of training a single model, we will use SageMaker's hyperparameter tuning functionality to train multiple models and use the one that performs the best on the validation set.\n", 216 | "\n", 217 | "To begin with, as in the previous approaches, we will need to construct an estimator object." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "# As stated above, we use this utility method to construct the image name for the training container.\n", 227 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 228 | "\n", 229 | "# Now that we know which container to use, we can construct the estimator object.\n", 230 | "xgb = sagemaker.estimator.Estimator(container, # The name of the training container\n", 231 | " role, # The IAM role to use (our current role in this case)\n", 232 | " train_instance_count=1, # The number of instances to use for training\n", 233 | " train_instance_type='ml.m4.xlarge', # The type of instance ot use for training\n", 234 | " output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),\n", 235 | " # Where to save the output (the model artifacts)\n", 236 | " sagemaker_session=session) # The current SageMaker session" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "Before beginning the hyperparameter tuning, we should make sure to set any model specific hyperparameters that we wish to have default values. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "xgb.set_hyperparameters(max_depth=5,\n", 253 | " eta=0.2,\n", 254 | " gamma=4,\n", 255 | " min_child_weight=6,\n", 256 | " subsample=0.8,\n", 257 | " objective='reg:linear',\n", 258 | " early_stopping_rounds=10,\n", 259 | " num_round=200)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "Now that we have our estimator object completely set up, it is time to create the hyperparameter tuner. To do this we need to construct a new object which contains each of the parameters we want SageMaker to tune. In this case, we wish to find the best values for the `max_depth`, `eta`, `min_child_weight`, `subsample`, and `gamma` parameters. Note that for each parameter that we want SageMaker to tune we need to specify both the *type* of the parameter and the *range* of values that parameter may take on.\n", 267 | "\n", 268 | "In addition, we specify the *number* of models to construct (`max_jobs`) and the number of those that can be trained in parallel (`max_parallel_jobs`). In the cell below we have chosen to train `20` models, of which we ask that SageMaker train `3` at a time in parallel. Note that this results in a total of `20` training jobs being executed which can take some time, in this case almost a half hour. With more complicated models this can take even longer so be aware!" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner\n", 278 | "\n", 279 | "xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.\n", 280 | " objective_metric_name = 'validation:rmse', # The metric used to compare trained models.\n", 281 | " objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.\n", 282 | " max_jobs = 20, # The total number of models to train\n", 283 | " max_parallel_jobs = 3, # The number of models to train in parallel\n", 284 | " hyperparameter_ranges = {\n", 285 | " 'max_depth': IntegerParameter(3, 12),\n", 286 | " 'eta' : ContinuousParameter(0.05, 0.5),\n", 287 | " 'min_child_weight': IntegerParameter(2, 8),\n", 288 | " 'subsample': ContinuousParameter(0.5, 0.9),\n", 289 | " 'gamma': ContinuousParameter(0, 10),\n", 290 | " })" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "Now that we have our hyperparameter tuner object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "# This is a wrapper around the location of our train and validation data, to make sure that SageMaker\n", 307 | "# knows our data is in csv format.\n", 308 | "s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')\n", 309 | "s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')\n", 310 | "\n", 311 | "xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "As in many of the examples we have seen so far, the `fit()` method takes care of setting up and fitting a number of different models, each with different hyperparameters. If we wish to wait for this process to finish, we can call the `wait()` method." 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "xgb_hyperparameter_tuner.wait()" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "Once the hyperamater tuner has finished, we can retrieve information about the best performing model. " 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [ 343 | "xgb_hyperparameter_tuner.best_training_job()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "In addition, since we'd like to set up a batch transform job to test the best model, we can construct a new estimator object from the results of the best training job. The `xgb_attached` object below can now be used as though we constructed an estimator with the best performing hyperparameters and then fit it to our training data." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "## Step 5: Test the model\n", 367 | "\n", 368 | "Now that we have our best performing model, we can test it. To do this we will use the batch transform functionality. To start with, we need to build a transformer object from our fit model." 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previous stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case `text/csv`, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.\n", 385 | "\n", 386 | "Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the `wait()` method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong." 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [ 395 | "xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": null, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "xgb_transformer.wait()" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "Now that the batch transform job has finished, the resulting output is stored on S3. Since we wish to analyze the output inside of our notebook we can use a bit of notebook magic to copy the output file from its S3 location and save it locally." 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": {}, 418 | "outputs": [], 419 | "source": [ 420 | "!aws s3 cp --recursive $xgb_transformer.output_path $data_dir" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "plt.scatter(Y_test, Y_pred)\n", 446 | "plt.xlabel(\"Median Price\")\n", 447 | "plt.ylabel(\"Predicted Price\")\n", 448 | "plt.title(\"Median Price vs Predicted Price\")" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "## Optional: Clean up\n", 456 | "\n", 457 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": null, 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "# First we will remove all of the files contained in the data_dir directory\n", 467 | "!rm $data_dir/*\n", 468 | "\n", 469 | "# And then we delete the directory itself\n", 470 | "!rmdir $data_dir" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [] 479 | } 480 | ], 481 | "metadata": { 482 | "kernelspec": { 483 | "display_name": "conda_pytorch_p36", 484 | "language": "python", 485 | "name": "conda_pytorch_p36" 486 | }, 487 | "language_info": { 488 | "codemirror_mode": { 489 | "name": "ipython", 490 | "version": 3 491 | }, 492 | "file_extension": ".py", 493 | "mimetype": "text/x-python", 494 | "name": "python", 495 | "nbconvert_exporter": "python", 496 | "pygments_lexer": "ipython3", 497 | "version": "3.6.5" 498 | } 499 | }, 500 | "nbformat": 4, 501 | "nbformat_minor": 2 502 | } -------------------------------------------------------------------------------- /Tutorials/Boston Housing - XGBoost (Hyperparameter Tuning) - Low Level.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Boston Housing Prices\n", 8 | "\n", 9 | "## Using XGBoost in SageMaker (Hyperparameter Tuning)\n", 10 | "\n", 11 | "_Deep Learning Nanodegree Program | Deployment_\n", 12 | "\n", 13 | "---\n", 14 | "\n", 15 | "As an introduction to using SageMaker's Low Level API for hyperparameter tuning, we will look again at the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to predict the median value of a home in the area of Boston Mass.\n", 16 | "\n", 17 | "The documentation reference for the API used in this notebook is the [SageMaker Developer's Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/)\n", 18 | "\n", 19 | "## General Outline\n", 20 | "\n", 21 | "Typically, when using a notebook instance with SageMaker, you will proceed through the following steps. Of course, not every step will need to be done with each project. Also, there is quite a lot of room for variation in many of the steps, as you will see throughout these lessons.\n", 22 | "\n", 23 | "1. Download or otherwise retrieve the data.\n", 24 | "2. Process / Prepare the data.\n", 25 | "3. Upload the processed data to S3.\n", 26 | "4. Train a chosen model.\n", 27 | "5. Test the trained model (typically using a batch transform job).\n", 28 | "6. Deploy the trained model.\n", 29 | "7. Use the deployed model.\n", 30 | "\n", 31 | "In this notebook we will only be covering steps 1 through 5 as we are only interested in creating a tuned model and testing its performance." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Make sure that we use SageMaker 1.x\n", 41 | "!pip install sagemaker==1.72.0" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Step 0: Setting up the notebook\n", 49 | "\n", 50 | "We begin by setting up all of the necessary bits required to run our notebook. To start that means loading all of the Python modules we will need." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%matplotlib inline\n", 60 | "\n", 61 | "import os\n", 62 | "\n", 63 | "import time\n", 64 | "from time import gmtime, strftime\n", 65 | "\n", 66 | "import numpy as np\n", 67 | "import pandas as pd\n", 68 | "\n", 69 | "import matplotlib.pyplot as plt\n", 70 | "\n", 71 | "from sklearn.datasets import load_boston\n", 72 | "import sklearn.model_selection" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "In addition to the modules above, we need to import the various bits of SageMaker that we will be using. " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "import sagemaker\n", 89 | "from sagemaker import get_execution_role\n", 90 | "from sagemaker.amazon.amazon_estimator import get_image_uri\n", 91 | "\n", 92 | "# This is an object that represents the SageMaker session that we are currently operating in. This\n", 93 | "# object contains some useful information that we will need to access later such as our region.\n", 94 | "session = sagemaker.Session()\n", 95 | "\n", 96 | "# This is an object that represents the IAM role that we are currently assigned. When we construct\n", 97 | "# and launch the training job later we will need to tell it what IAM role it should have. Since our\n", 98 | "# use case is relatively simple we will simply assign the training job the role we currently have.\n", 99 | "role = get_execution_role()" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## Step 1: Downloading the data\n", 107 | "\n", 108 | "Fortunately, this dataset can be retrieved using sklearn and so this step is relatively straightforward." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "boston = load_boston()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Step 2: Preparing and splitting the data\n", 125 | "\n", 126 | "Given that this is clean tabular data, we don't need to do any processing. However, we do need to split the rows in the dataset up into train, test and validation sets." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "# First we package up the input data and the target variable (the median value) as pandas dataframes. This\n", 136 | "# will make saving the data to a file a little easier later on.\n", 137 | "\n", 138 | "X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)\n", 139 | "Y_bos_pd = pd.DataFrame(boston.target)\n", 140 | "\n", 141 | "# We split the dataset into 2/3 training and 1/3 testing sets.\n", 142 | "X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)\n", 143 | "\n", 144 | "# Then we split the training set further into 2/3 training and 1/3 validation sets.\n", 145 | "X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.33)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Step 3: Uploading the data files to S3\n", 153 | "\n", 154 | "When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.\n", 155 | "\n", 156 | "### Save the data locally\n", 157 | "\n", 158 | "First we need to create the test, train and validation csv files which we will then upload to S3." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "# This is our local data directory. We need to make sure that it exists.\n", 168 | "data_dir = '../data/boston'\n", 169 | "if not os.path.exists(data_dir):\n", 170 | " os.makedirs(data_dir)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header\n", 180 | "# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and\n", 181 | "# validation data, it is assumed that the first entry in each row is the target variable.\n", 182 | "\n", 183 | "X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)\n", 184 | "\n", 185 | "pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)\n", 186 | "pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "### Upload to S3\n", 194 | "\n", 195 | "Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "prefix = 'boston-xgboost-tuning-LL'\n", 205 | "\n", 206 | "test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)\n", 207 | "val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)\n", 208 | "train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "## Step 4: Train and construct the XGBoost model\n", 216 | "\n", 217 | "Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it. Unlike in the previous notebooks, instead of training a single model, we will use SageMakers hyperparameter tuning functionality to train multiple models and use the one that performs the best on the validation set.\n", 218 | "\n", 219 | "### Set up the training job\n", 220 | "\n", 221 | "First, we will set up a training job for our model. This is very similar to the way in which we constructed the training job in previous notebooks. Essentially this describes the *base* training job from which SageMaker will create refinements by changing some hyperparameters during the hyperparameter tuning job." 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# We will need to know the name of the container that we want to use for training. SageMaker provides\n", 231 | "# a nice utility method to construct this for us.\n", 232 | "container = get_image_uri(session.boto_region_name, 'xgboost')\n", 233 | "\n", 234 | "# We now specify the parameters we wish to use for our training job\n", 235 | "training_params = {}\n", 236 | "\n", 237 | "# We need to specify the permissions that this training job will have. For our purposes we can use\n", 238 | "# the same permissions that our current SageMaker session has.\n", 239 | "training_params['RoleArn'] = role\n", 240 | "\n", 241 | "# Here we describe the algorithm we wish to use. The most important part is the container which\n", 242 | "# contains the training code.\n", 243 | "training_params['AlgorithmSpecification'] = {\n", 244 | " \"TrainingImage\": container,\n", 245 | " \"TrainingInputMode\": \"File\"\n", 246 | "}\n", 247 | "\n", 248 | "# We also need to say where we would like the resulting model artifacts stored.\n", 249 | "training_params['OutputDataConfig'] = {\n", 250 | " \"S3OutputPath\": \"s3://\" + session.default_bucket() + \"/\" + prefix + \"/output\"\n", 251 | "}\n", 252 | "\n", 253 | "# We also need to set some parameters for the training job itself. Namely we need to describe what sort of\n", 254 | "# compute instance we wish to use along with a stopping condition to handle the case that there is\n", 255 | "# some sort of error and the training script doesn't terminate.\n", 256 | "training_params['ResourceConfig'] = {\n", 257 | " \"InstanceCount\": 1,\n", 258 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 259 | " \"VolumeSizeInGB\": 5\n", 260 | "}\n", 261 | " \n", 262 | "training_params['StoppingCondition'] = {\n", 263 | " \"MaxRuntimeInSeconds\": 86400\n", 264 | "}\n", 265 | "\n", 266 | "# Next we set the algorithm specific hyperparameters. In this case, since we are setting up\n", 267 | "# a training job which will serve as the base training job for the eventual hyperparameter\n", 268 | "# tuning job, we only specify the _static_ hyperparameters. That is, the hyperparameters that\n", 269 | "# we do _not_ want SageMaker to change.\n", 270 | "training_params['StaticHyperParameters'] = {\n", 271 | " \"gamma\": \"4\",\n", 272 | " \"subsample\": \"0.8\",\n", 273 | " \"objective\": \"reg:linear\",\n", 274 | " \"early_stopping_rounds\": \"10\",\n", 275 | " \"num_round\": \"200\"\n", 276 | "}\n", 277 | "\n", 278 | "# Now we need to tell SageMaker where the data should be retrieved from.\n", 279 | "training_params['InputDataConfig'] = [\n", 280 | " {\n", 281 | " \"ChannelName\": \"train\",\n", 282 | " \"DataSource\": {\n", 283 | " \"S3DataSource\": {\n", 284 | " \"S3DataType\": \"S3Prefix\",\n", 285 | " \"S3Uri\": train_location,\n", 286 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 287 | " }\n", 288 | " },\n", 289 | " \"ContentType\": \"csv\",\n", 290 | " \"CompressionType\": \"None\"\n", 291 | " },\n", 292 | " {\n", 293 | " \"ChannelName\": \"validation\",\n", 294 | " \"DataSource\": {\n", 295 | " \"S3DataSource\": {\n", 296 | " \"S3DataType\": \"S3Prefix\",\n", 297 | " \"S3Uri\": val_location,\n", 298 | " \"S3DataDistributionType\": \"FullyReplicated\"\n", 299 | " }\n", 300 | " },\n", 301 | " \"ContentType\": \"csv\",\n", 302 | " \"CompressionType\": \"None\"\n", 303 | " }\n", 304 | "]" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "### Set up the tuning job\n", 312 | "\n", 313 | "Now that the *base* training job has been set up, we can describe the tuning job that we would like SageMaker to perform. In particular, like in the high level notebook, we will specify which hyperparameters we wish SageMaker to change and what range of values they may take on.\n", 314 | "\n", 315 | "In addition, we specify the *number* of models to construct (`max_jobs`) and the number of those that can be trained in parallel (`max_parallel_jobs`). In the cell below we have chosen to train `20` models, of which we ask that SageMaker train `3` at a time in parallel. Note that this results in a total of `20` training jobs being executed which can take some time, in this case almost a half hour. With more complicated models this can take even longer so be aware!" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": {}, 322 | "outputs": [], 323 | "source": [ 324 | "# We need to construct a dictionary which specifies the tuning job we want SageMaker to perform\n", 325 | "tuning_job_config = {\n", 326 | " # First we specify which hyperparameters we want SageMaker to be able to vary,\n", 327 | " # and we specify the type and range of the hyperparameters.\n", 328 | " \"ParameterRanges\": {\n", 329 | " \"CategoricalParameterRanges\": [],\n", 330 | " \"ContinuousParameterRanges\": [\n", 331 | " {\n", 332 | " \"MaxValue\": \"0.5\",\n", 333 | " \"MinValue\": \"0.05\",\n", 334 | " \"Name\": \"eta\"\n", 335 | " },\n", 336 | " ],\n", 337 | " \"IntegerParameterRanges\": [\n", 338 | " {\n", 339 | " \"MaxValue\": \"12\",\n", 340 | " \"MinValue\": \"3\",\n", 341 | " \"Name\": \"max_depth\"\n", 342 | " },\n", 343 | " {\n", 344 | " \"MaxValue\": \"8\",\n", 345 | " \"MinValue\": \"2\",\n", 346 | " \"Name\": \"min_child_weight\"\n", 347 | " }\n", 348 | " ]},\n", 349 | " # We also need to specify how many models should be fit and how many can be fit in parallel\n", 350 | " \"ResourceLimits\": {\n", 351 | " \"MaxNumberOfTrainingJobs\": 20,\n", 352 | " \"MaxParallelTrainingJobs\": 3\n", 353 | " },\n", 354 | " # Here we specify how SageMaker should update the hyperparameters as new models are fit\n", 355 | " \"Strategy\": \"Bayesian\",\n", 356 | " # And lastly we need to specify how we'd like to determine which models are better or worse\n", 357 | " \"HyperParameterTuningJobObjective\": {\n", 358 | " \"MetricName\": \"validation:rmse\",\n", 359 | " \"Type\": \"Minimize\"\n", 360 | " }\n", 361 | " }" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "### Execute the tuning job\n", 369 | "\n", 370 | "Now that we've built the data structures that describe the tuning job we want SageMaker to execute, it is time to actually start the job." 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "# First we need to choose a name for the job. This is useful for if we want to recall information about our\n", 380 | "# tuning job at a later date. Note that SageMaker requires a tuning job name and that the name needs to\n", 381 | "# be unique, which we accomplish by appending the current timestamp.\n", 382 | "tuning_job_name = \"tuning-job\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 383 | "\n", 384 | "# And now we ask SageMaker to create (and execute) the training job\n", 385 | "session.sagemaker_client.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,\n", 386 | " HyperParameterTuningJobConfig = tuning_job_config,\n", 387 | " TrainingJobDefinition = training_params)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "The tuning job has now been created by SageMaker and is currently running. Since we need the output of the tuning job, we may wish to wait until it has finished. We can do so by asking SageMaker to output the logs generated by the tuning job and continue doing so until the job terminates." 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "metadata": {}, 401 | "outputs": [], 402 | "source": [ 403 | "session.wait_for_tuning_job(tuning_job_name)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "### Build the model\n", 411 | "\n", 412 | "Now that the tuning job has finished, SageMaker has fit a number of models, the results of which are stored in a data structure which we can access using the name of the tuning job." 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "tuning_job_info = session.sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "Among the pieces of information included in the `tuning_job_info` object is the name of the training job which performed best out of all of the models that SageMaker fit to our data. Using this training job name we can get access to the resulting model artifacts, from which we can construct a model." 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": null, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "# We begin by asking SageMaker to describe for us the results of the best training job. The data\n", 438 | "# structure returned contains a lot more information than we currently need, try checking it out\n", 439 | "# yourself in more detail.\n", 440 | "best_training_job_name = tuning_job_info['BestTrainingJob']['TrainingJobName']\n", 441 | "training_job_info = session.sagemaker_client.describe_training_job(TrainingJobName=best_training_job_name)\n", 442 | "\n", 443 | "model_artifacts = training_job_info['ModelArtifacts']['S3ModelArtifacts']" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "# Just like when we created a training job, the model name must be unique\n", 453 | "model_name = best_training_job_name + \"-model\"\n", 454 | "\n", 455 | "# We also need to tell SageMaker which container should be used for inference and where it should\n", 456 | "# retrieve the model artifacts from. In our case, the xgboost container that we used for training\n", 457 | "# can also be used for inference.\n", 458 | "primary_container = {\n", 459 | " \"Image\": container,\n", 460 | " \"ModelDataUrl\": model_artifacts\n", 461 | "}\n", 462 | "\n", 463 | "# And lastly we construct the SageMaker model\n", 464 | "model_info = session.sagemaker_client.create_model(\n", 465 | " ModelName = model_name,\n", 466 | " ExecutionRoleArn = role,\n", 467 | " PrimaryContainer = primary_container)" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "## Step 5: Testing the model\n", 475 | "\n", 476 | "Now that we have fit our model to the training data, using the validation data to avoid overfitting, we can test our model. To do this we will make use of SageMaker's Batch Transform functionality. In other words, we need to set up and execute a batch transform job, similar to the way that we constructed the training job earlier.\n", 477 | "\n", 478 | "### Set up the batch transform job\n", 479 | "\n", 480 | "Just like when we were training our model, we first need to provide some information in the form of a data structure that describes the batch transform job which we wish to execute.\n", 481 | "\n", 482 | "We will only be using some of the options available here but to see some of the additional options please see the SageMaker documentation for [creating a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html)." 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": {}, 489 | "outputs": [], 490 | "source": [ 491 | "# Just like in each of the previous steps, we need to make sure to name our job and the name should be unique.\n", 492 | "transform_job_name = 'boston-xgboost-batch-transform-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", 493 | "\n", 494 | "# Now we construct the data structure which will describe the batch transform job.\n", 495 | "transform_request = \\\n", 496 | "{\n", 497 | " \"TransformJobName\": transform_job_name,\n", 498 | " \n", 499 | " # This is the name of the model that we created earlier.\n", 500 | " \"ModelName\": model_name,\n", 501 | " \n", 502 | " # This describes how many compute instances should be used at once. If you happen to be doing a very large\n", 503 | " # batch transform job it may be worth running multiple compute instances at once.\n", 504 | " \"MaxConcurrentTransforms\": 1,\n", 505 | " \n", 506 | " # This says how big each individual request sent to the model should be, at most. One of the things that\n", 507 | " # SageMaker does in the background is to split our data up into chunks so that each chunks stays under\n", 508 | " # this size limit.\n", 509 | " \"MaxPayloadInMB\": 6,\n", 510 | " \n", 511 | " # Sometimes we may want to send only a single sample to our endpoint at a time, however in this case each of\n", 512 | " # the chunks that we send should contain multiple samples of our input data.\n", 513 | " \"BatchStrategy\": \"MultiRecord\",\n", 514 | " \n", 515 | " # This next object describes where the output data should be stored. Some of the more advanced options which\n", 516 | " # we don't cover here also describe how SageMaker should collect output from various batches.\n", 517 | " \"TransformOutput\": {\n", 518 | " \"S3OutputPath\": \"s3://{}/{}/batch-bransform/\".format(session.default_bucket(),prefix)\n", 519 | " },\n", 520 | " \n", 521 | " # Here we describe our input data. Of course, we need to tell SageMaker where on S3 our input data is stored, in\n", 522 | " # addition we need to detail the characteristics of our input data. In particular, since SageMaker may need to\n", 523 | " # split our data up into chunks, it needs to know how the individual samples in our data file appear. In our\n", 524 | " # case each line is its own sample and so we set the split type to 'line'. We also need to tell SageMaker what\n", 525 | " # type of data is being sent, in this case csv, so that it can properly serialize the data.\n", 526 | " \"TransformInput\": {\n", 527 | " \"ContentType\": \"text/csv\",\n", 528 | " \"SplitType\": \"Line\",\n", 529 | " \"DataSource\": {\n", 530 | " \"S3DataSource\": {\n", 531 | " \"S3DataType\": \"S3Prefix\",\n", 532 | " \"S3Uri\": test_location,\n", 533 | " }\n", 534 | " }\n", 535 | " },\n", 536 | " \n", 537 | " # And lastly we tell SageMaker what sort of compute instance we would like it to use.\n", 538 | " \"TransformResources\": {\n", 539 | " \"InstanceType\": \"ml.m4.xlarge\",\n", 540 | " \"InstanceCount\": 1\n", 541 | " }\n", 542 | "}" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "### Execute the batch transform job\n", 550 | "\n", 551 | "Now that we have created the request data structure, it is time to as SageMaker to set up and run our batch transform job. Just like in the previous steps, SageMaker performs these tasks in the background so that if we want to wait for the transform job to terminate (and ensure the job is progressing) we can ask SageMaker to wait of the transform job to complete." 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": {}, 558 | "outputs": [], 559 | "source": [ 560 | "transform_response = session.sagemaker_client.create_transform_job(**transform_request)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": null, 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "transform_desc = session.wait_for_transform_job(transform_job_name)" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "### Analyze the results\n", 577 | "\n", 578 | "Now that the transform job has completed, the results are stored on S3 as we requested. Since we'd like to do a bit of analysis in the notebook we can use some notebook magic to copy the resulting output from S3 and save it locally." 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": {}, 585 | "outputs": [], 586 | "source": [ 587 | "transform_output = \"s3://{}/{}/batch-bransform/\".format(session.default_bucket(),prefix)" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": null, 593 | "metadata": {}, 594 | "outputs": [], 595 | "source": [ 596 | "!aws s3 cp --recursive $transform_output $data_dir" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "metadata": {}, 602 | "source": [ 603 | "To see how well our model works we can create a simple scatter plot between the predicted and actual values. If the model was completely accurate the resulting scatter plot would look like the line $x=y$. As we can see, our model seems to have done okay but there is room for improvement." 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "metadata": {}, 610 | "outputs": [], 611 | "source": [ 612 | "Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": null, 618 | "metadata": {}, 619 | "outputs": [], 620 | "source": [ 621 | "plt.scatter(Y_test, Y_pred)\n", 622 | "plt.xlabel(\"Median Price\")\n", 623 | "plt.ylabel(\"Predicted Price\")\n", 624 | "plt.title(\"Median Price vs Predicted Price\")" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "## Optional: Clean up\n", 632 | "\n", 633 | "The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook." 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": null, 639 | "metadata": {}, 640 | "outputs": [], 641 | "source": [ 642 | "# First we will remove all of the files contained in the data_dir directory\n", 643 | "!rm $data_dir/*\n", 644 | "\n", 645 | "# And then we delete the directory itself\n", 646 | "!rmdir $data_dir" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": null, 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [] 655 | } 656 | ], 657 | "metadata": { 658 | "kernelspec": { 659 | "display_name": "conda_pytorch_p36", 660 | "language": "python", 661 | "name": "conda_pytorch_p36" 662 | }, 663 | "language_info": { 664 | "codemirror_mode": { 665 | "name": "ipython", 666 | "version": 3 667 | }, 668 | "file_extension": ".py", 669 | "mimetype": "text/x-python", 670 | "name": "python", 671 | "nbconvert_exporter": "python", 672 | "pygments_lexer": "ipython3", 673 | "version": "3.6.5" 674 | } 675 | }, 676 | "nbformat": 4, 677 | "nbformat_minor": 2 678 | } -------------------------------------------------------------------------------- /Tutorials/Web App Diagram.svg: -------------------------------------------------------------------------------- 1 | 2 |
App
[Not supported by viewer]
Model
Model
Lambda
Lambda
API
API
-------------------------------------------------------------------------------- /Tutorials/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Sentiment Analysis Web App 5 | 6 | 7 | 8 | 9 | 10 | 11 | 32 | 33 | 34 | 35 | 36 |
37 |

Is your review positive, or negative?

38 |

Enter your review below and click submit to find out...

39 |
42 |
43 | 44 | 45 |
46 | 47 |
48 |

49 |
50 | 51 | 52 | --------------------------------------------------------------------------------